CS180 - Fall 2024 Alex Cao
This project is divided into two parts. In the first part, I will explore the sampling process of diffusion models, try different denoising methods, and create cool images like visual anagrams, image to image translation, and hybrid images. In the second part, I will implement my own UNet for diffusion process, and train them on the famous MNIST dataset.
In this project, we are using DeepFloyd IF diffusion model. This is a two stage model: the first stage produces images of size 64 x 64, and the second stage takes the output from the first stage and upscale it to 256 x256. Below are the results of generation using the three text prompts provided in the project spec. The quality of the image is quite nice, I am surprised they are quite good with only 20 denoising steps. Especially the picture of the man wearing a hat is very realistic. The images are also very related to prompt, as they reflect the prompt very well. The objects are exactly what being described in prompts, and also for first image, the image also accurately reflected the oil painting style mentioned in prompt. I am using the seed 1104.
an oil painting of a snowy mountain village (stage 1)
a man wearing a hat (stage 1)
a rocket ship (stage 1)
an oil painting of a snowy mountain village (stage 2)
a man wearing a hat (stage 2)
a rocket ship (stage 2)
Below is the result of generation using the prompt "an oil painting of a snowy mountain village" using different number of denoising steps. We can see that with the increase of denoising steps, the image seems to be more colorful and detailed. Maybe more denoising steps allow the model to generate more "oil paintingness" in the image.
20 denoising steps
60 denoising steps
80 denoising steps
For this part, I implemented the forward function. I got alpha_cumprod by indexing the t-th element in alphas_cumprod, and epislon is generated using torch.randn_like. Below are the Berkeley campinile at different noise levels.
Berkeley Campanile
Noisy Campanile at t=250
Noisy Campanile at t=500
Noisy Campanile at t=750
In this part I applied guassian bluring with kernel size equals 5 and sigma equals 3. Below are the side by side result of guassian blur filtering and the original noisy image.
Noisy Campanile at t=250
Noisy Campanile at t=500
Noisy Campanile at t=750
Gaussian Blur Denoising at t = 250
Gaussian Blur Denoising at t = 500
Gaussian Blur Denoising at t = 750
To perform one-step denoising, I first use the forward process to generate a noisy image at given noise level t, then I use the stage 1 unet to predict the noise. Once I have the predicted noise, I obtained the clean image by solving for X0 in the equation \[ x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon \quad \text{where} \quad \epsilon \sim \mathcal{N}(0,1) \] Below are the original image, and the noisy images and one-step denoised images at different noise levels.
Original Image
Noisy Campanile at t=250
Noisy Campanile at t=500
Noisy Campanile at t=750
One-Step Denoised Campanile at t = 250
One-Step Denoised Campanile at t = 500
One-Step Denoised Campanile at t = 750
Creating a list of monotonically decreasing timesteps starting at 990, with a stride of 30, ending at 0, I followed this formula to iteratively denoise the image: \[ x_{t'} = \frac{\sqrt{\bar{\alpha}_{t'}}\beta_t}{1-\bar{\alpha}_t}x_0 + \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t'})}{1-\bar{\alpha}_t}x_t + v_\sigma \] Below is a series of images during every 5th loop of denoising process. Further below, I have showed the original image, the iteratively denoised image, one-step denoised image, and gaussian blurred image for comparison.
Noisy Campanile at t=90
Noisy Campanile at t=240
Noisy Campanile at t=390
Noisy Campanile at t=540
Noisy Campanile at t=690
We can see that iteratively denoised image has the best quality, followed by one-step denoised image, and then gaussian blurred image has the worst quality.
Original
Iteratively Denoised Campanile
One-Step Denoised Campanile
Gaussian Blurred Campanile
In this part, I used iterative_denoise and set i_start = 0; I passed in random noise generated using torch.randn, and used the prompt embedding of "a high quality photo". Here are five results sampled using this procesure, with a seed of 1104.
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Some of the images in the prior section are not very good. To address this, we can perform classifier-free guidance. This is done by computing both a noise estimate conditioned on the prompt, and an unconditional noise estimate, then we calculate our new noise estimate as: \[\varepsilon = \varepsilon_u + \gamma(\varepsilon_c - \varepsilon_u)\] where epsilon_u is the unconditional noise estimate, and epsilon_c is the conditional noise estimate, and gamma is the guidance scale. To get the unconditional noise estimate, we can simpy pass an empty prompt embedding to the model. The rest of the process is same as last part. Below are some results using classifier-free guidance and a seed of 2002.
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
If we take an image, add some noise to it, and then denoise it, we would get an image that is similar to the original image. The more noise we add and remove, the more different the denoised image would be from the original image. In this part, I added different amounts of noises to images, and then denoise them using text prompt "a high quality photo" to get new images (follows the SDEdit algorithm). Below are some examples with different noise levels.
SDEdit with i_start = 1
SDEdit with i_start = 3
SDEdit with i_start = 5
SDEdit with i_start = 7
SDEdit with i_start = 10
SDEdit with i_start = 20
Campanile
SDEdit with i_start = 1
SDEdit with i_start = 3
SDEdit with i_start = 5
SDEdit with i_start = 7
SDEdit with i_start = 10
SDEdit with i_start = 20
A Tree
SDEdit with i_start = 1
SDEdit with i_start = 3
SDEdit with i_start = 5
SDEdit with i_start = 7
SDEdit with i_start = 10
SDEdit with i_start = 20
A Car
We can also apply SDEdit to hand-drawn and web images to force them onto the image manifold. Below are the results of applying SDEdit with different noise levels to web and hand-drawn images. The first one is a web image, and the second and last one are my hand-drawn images.
Volcano with i_start = 1
Volcano with i_start = 3
Volcano with i_start = 5
Volcano with i_start = 7
Volcano with i_start = 10
Volcano with i_start = 20
Volcano
Apple with i_start = 1
Apple with i_start = 3
Apple with i_start = 5
Apple with i_start = 7
Apple with i_start = 10
Apple with i_start = 20
Original Apple Sketch
SDEdit with i_start = 1
SDEdit with i_start = 3
SDEdit with i_start = 5
SDEdit with i_start = 7
SDEdit with i_start = 10
SDEdit with i_start = 20
Original Pumpkin Sketch
To do image inpainting, I first generate mask specifying the region I want to replace, then I run the diffusion denoising loop, but at every step, after obtaining x_t, I "force" x_t to have the same pixels as the original image where mask is 0, i.e: \[x_t \leftarrow \mathbf{m}x_t + (1-\mathbf{m})\text{forward}(x_{orig}, t)\] In this case, we only generate a new image inside the region we want to replace. Below are the results of inpainting for the Campanile and two of my own chosen images.
Campanile
Mask
Hole to Fill
Campanile Inpainted
Picture of People
Mask
Hole to Fill
People Inpainted
Neymar
Mask
Hole to Fill
Neymar Inpainted
Now we will do SDEdit but guide the projection with a text prompt. The following results are SDEdit on different noise levels, with text prompt "a rocket ship". The first one is the campanile, and the last two are images of my own choices (also conditioned on "a rocket ship")
Rocket Ship at noise level 1
Rocket Ship at noise level 3
Rocket Ship at noise level 5
Rocket Ship at noise level 7
Rocket Ship at noise level 10
Rocket Ship at noise level 20
Campanile
Rocket Ship at noise level 1
Rocket Ship at noise level 3
Rocket Ship at noise level 5
Rocket Ship at noise level 7
Rocket Ship at noise level 10
Rocket Ship at noise level 20
Toothpaste
Rocket Ship at noise level 1
Rocket Ship at noise level 3
Rocket Ship at noise level 5
Rocket Ship at noise level 7
Rocket Ship at noise level 10
Rocket Ship at noise level 20
Ironman/p>
To create a visual anagram that look like one thing normally but look like another thing upside down, we can denoise an image normally with one prompt, to get one noise estimate, we can then flipp it upside down and denoise with another prompt to get another noise estimate. We then flip the second noise esitame, and average the two. We then proceed with the denoising using the new averaged noise estimate. Below are some of the results, where the first image is the original image, and the second image is flipped upside down.
An oil painting of an old man
An oil painting of people around campfire
A parrot sitting on a tree branch
A blooming garden with flowers and trees
A lion under a tree
A person walking in the forest
To implement the make hybrids function, I will first estimate the noise separately using two different prompts, then create a composite noise estimate by combining the low frequencies from one noise stimate with high frequencies from the other, i.e.:
Hybrid image of a skull and a waterfall
Hybrid image of a lion and a skull
Hybrid image of flower patterns and a human skeleton
First we need to implement the UNet. A UNet consists of a few downsampling and upsampling blocks with skip connections. In this part, I will be building Unets to be trained on the famous MNIST dataset.
To train the denoiser, we would need training data pairs of (z,x) where each x is a clean MNIST digit and z is a noisy image. For each training batch,
Varying levels of noise on MNIST digits
I implemented the UNet according to figure 1 and figure2 in the project spec. Then, I trained the denoiser on MNIST dataset with noisy image z of sigma = 0.5 applied to clean image x. I have trained the model for 5 epochs, below is the training losses during the training process (of linear scale).
Training Loss Curve (Linear Scale)
Below are sample results after the 1st and 5th epoch.
Results on digits from the test set after 1 epoch of training
Results on digits from the test set after 5 epoch of training
Because the denoiser was trained on MNIST digits noised with sigma = 0.5, it might not perform well given other noise levels. Below are the denoiser results on test set digits with noise sigma = [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0].
Results on digits from the test set with varying noise levels
Following figure 8 and figure 9 in the project spec, I implemented the fully-connected block and injected the conditioning signal into the UNet. For training, I followed Algorithm B.1. in spec-- pick a random image from the training set, a random t, and train the denoiser to predict the noise added. This is repeated for different images and different t values until convergence. Below is the training loss curve of the time conditioned UNet. (Linear Scale, and I used parameters specified in the spec)
Time-Conditioned UNet training loss curve (Linear Scale)
The sampling process is very similar to previous parts, here I followed Algorithm B.2. in project spec. Below are the sampling results for the time-conditioned UNet for 5 and 20 epochs. They are made as gif to demonstrate the denoising process.
Sampling process (epoch 5)
Sampling process (epoch 20)
Following the project spec, I added class condition via injecting the class label c into the time-conditioned UNet model. In addition, a dropout layer (dropout probability = 0.1) is also added so some times the tensor c is set to 0, and the model can still perform well without a class label. The training process is very similar to the previous part, with the only difference being adding the conditioning vector c. Below is the training loss curve of the class-conditioned UNet.
Class-Conditioned UNet training loss curve (Linear Scale)
I followed Algorithm B.4. in project spec to perform cfg sampling. Below are the sampling results for the Class-conditioned UNet for 5 and 20 epochs. They are made as gif to demonstrate the denoising process.
Sampling process (epoch 5)
Sampling process (epoch 20)
By keeping track of the intermediate denoising results (I used interval of 10 denoising steps), I made gifs to visualize the denoising process further. Below is a summary of the Time conditioned model and Class conditioned model sampling gifs.
Time-Conditioned Model (epoch 5)
Time-Conditioned Model (epoch 20)
Class-Conditioned Model (epoch 5)
Class-Conditioned Model (epoch 20)
This comparison clearly shows the progression of both models' capabilities across training epochs, demonstrating the improvements in image quality and stability as training progresses.
I realized that the infinitely looping gifs might be hard to read, so here I have included the exact same gifs, looping only once, for readers to better see the result of the denoising process.
Time-Conditioned Warped Model (epoch 5)
Time-Conditioned Warped Model (epoch 20)
Class-Conditioned Warped Model (epoch 5)
Class-Conditioned Warped Model (epoch 20)
These warped sampling results show an alternative visualization of the denoising process, highlighting different aspects of how the models learn to generate images.
This is a very fun project and I learned a lot!