SinGAN: Learning a Generative Model from a Single Natural Image

Computer Vision | Jupyter Notebook | Final Paper

My Role GAN Architecture
Network Training
Random Sampling (w/ scaled random sampling)

The Team Xiangyu Peng
Alyssa Scheske

Generate unconditional realistic images from a single natural input image with akin quality to state-of-the art Generative Adversarial Networks (GANs) trained on complete classes of datasets. During my last semester at University of Michigan, I worked with the team to replicate the original SinGAN paper as part of the final project for my Coumputer Vision course. SinGAN learns the internal patch distribution within the image and then creates good quality samples which carry the same visual content as the original image. The generated samples are very diverse yet maintain the global structure and fine textures of the image. Being unconditional, SinGAN can generate new samples from pure noise. SinGAN has a training time of about 40-60 mins as tested on Google Collab with GPU usage which is a huge plus considering that the trained model can be used perform a wide variety of image manipulation tasks.

One of the basic functionality of SinGAN is to generate fake images that are hard to distinguish from the real one. SinGAN can generate fakes of variable size and aspect ratio by just changing the size of the input noise map. When generating fakes, the model produces new structures while still maintaining the visual content of the original image. Interestingly, SinGAN is not only able to preserve reflections but in many cases even synthesize it

Can you identify the real images from the fake ones? Below is a gallery of a mix of samples generated from sinGAN. Guess whether an image is real or a fake and move your mouse cursor over the image to find whether you were right or wrong!

Architecture

SinGAN has a pyramid architecture with N levels; each level consisting of a GAN (a pair of G0 and D0). I was responsible for creating this pyramid of GANs. The number of levels (N) is calculated based on the image size to optimize for training resources. For the project, the maximal dimension of the input mage is set to 250px while maintaing the aspect ratio. During the training, the output from each level is used as the input to the next level. However, the initial input to the pyramid, to the coarsest level, is a random noise map. Each level of the pyramid evaluates patches from the input image in a coarse-to-fine fashion.

The architecture for generators and discriminators for all levels remain the same. Both are fully convolutional and a patch of noise is added to the output from previous level and the result is passed through the generator. At the coarsest level, the number of kernels is 32 and we double the number of kernels every 4 levels as in the original implementation. The receptive field is set at the coarsest layer and remains constant (11 × 11) for every layer. As we move up the pyramid, the image is upscaled but a fixed receptive field means a varying effective patch size as shown by the yellow block.

Applications

Apart from random sample generation, SinGAN can be used to perform multiple image manipulative tasks like paint-to-image, editing, harmonization, super-resolution and animation from a single image. All such manipulation tasks can be performed by simply using the trained model without any further tuning. However, the team implemented two maniupulation tasks: Editing and Harmonization

Editing

During the editing task, we copy certain patches of the original image and paste them in other locations in the same image. SinGAN can effectively regenerate fine texture and seamlessly stitch those pasted parts and create realistic images.

Injecting image at different levels of the pyramid impacts how the final output looks like. The change in output is because the effective patch size of GAN in each scale layer is different. As the injection scale decreases, the outcome is more like the edited input. The lower the injection scale, the smaller the structures modified in the output. For injection scale 7, more global properties are changed but for injection scale 1, only local textures are changed but the global structure remains the same. Based on experiments, injection scale of 2, 3 and 4 (for level 4, 5 or 6 respectively) works best.

Harmonization

Harmonization is a cool technique where style of images are explicitlye matched before blending them. Using a multi-level technique allows us to transfer the appearance of one image to another. To achieve this technique, SinGAN is first trained on a background image. Then an input image with newly added content to the original image is injected and realistically blended with the style of the background. Similarly to editing, the injection scale impacts the final output. The best results are achieved at scales 3 or 4 because the newly added content maintains its structure while it is converted to the background’s style.