Fun with Diffusion Models!

CS180 - Fall 2024 Alex Cao

Introduction

This project is divided into two parts. In the first part, I will explore the sampling process of diffusion models, try different denoising methods, and create cool images like visual anagrams, image to image translation, and hybrid images. In the second part, I will implement my own UNet for diffusion process, and train them on the famous MNIST dataset.

Part A: Diffusion Models

Setup

In this project, we are using DeepFloyd IF diffusion model. This is a two stage model: the first stage produces images of size 64 x 64, and the second stage takes the output from the first stage and upscale it to 256 x256. Below are the results of generation using the three text prompts provided in the project spec. The quality of the image is quite nice, I am surprised they are quite good with only 20 denoising steps. Especially the picture of the man wearing a hat is very realistic. The images are also very related to prompt, as they reflect the prompt very well. The objects are exactly what being described in prompts, and also for first image, the image also accurately reflected the oil painting style mentioned in prompt. I am using the seed 1104.

Image 1

an oil painting of a snowy mountain village (stage 1)

Image 2

a man wearing a hat (stage 1)

Image 3

a rocket ship (stage 1)

Image 4

an oil painting of a snowy mountain village (stage 2)

Image 5

a man wearing a hat (stage 2)

Image 6

a rocket ship (stage 2)

Noise Level Analysis

Below is the result of generation using the prompt "an oil painting of a snowy mountain village" using different number of denoising steps. We can see that with the increase of denoising steps, the image seems to be more colorful and detailed. Maybe more denoising steps allow the model to generate more "oil paintingness" in the image.

Low noise test

20 denoising steps

Medium noise test

60 denoising steps

High noise test

80 denoising steps

1.1 Implementing the Forward Process

For this part, I implemented the forward function. I got alpha_cumprod by indexing the t-th element in alphas_cumprod, and epislon is generated using torch.randn_like. Below are the Berkeley campinile at different noise levels.

Guidance level 1

Berkeley Campanile

Guidance level 3

Noisy Campanile at t=250

Guidance level 5

Noisy Campanile at t=500

Guidance level 7

Noisy Campanile at t=750

1.2 Classical Denoising

In this part I applied guassian bluring with kernel size equals 5 and sigma equals 3. Below are the side by side result of guassian blur filtering and the original noisy image.

Backward process step 1

Noisy Campanile at t=250

Backward process step 2

Noisy Campanile at t=500

Backward process step 3

Noisy Campanile at t=750

Backward process step 4

Gaussian Blur Denoising at t = 250

Backward process step 5

Gaussian Blur Denoising at t = 500

Backward process step 6

Gaussian Blur Denoising at t = 750

1.3 One-Step Denoising

To perform one-step denoising, I first use the forward process to generate a noisy image at given noise level t, then I use the stage 1 unet to predict the noise. Once I have the predicted noise, I obtained the clean image by solving for X0 in the equation \[ x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon \quad \text{where} \quad \epsilon \sim \mathcal{N}(0,1) \] Below are the original image, and the noisy images and one-step denoised images at different noise levels.

Backward process step 1

Original Image

Backward process step 1

Noisy Campanile at t=250

Backward process step 2

Noisy Campanile at t=500

Backward process step 3

Noisy Campanile at t=750

Backward process step 4

One-Step Denoised Campanile at t = 250

Backward process step 5

One-Step Denoised Campanile at t = 500

Backward process step 6

One-Step Denoised Campanile at t = 750

1.4 Iterative Denoising

Creating a list of monotonically decreasing timesteps starting at 990, with a stride of 30, ending at 0, I followed this formula to iteratively denoise the image: \[ x_{t'} = \frac{\sqrt{\bar{\alpha}_{t'}}\beta_t}{1-\bar{\alpha}_t}x_0 + \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t'})}{1-\bar{\alpha}_t}x_t + v_\sigma \] Below is a series of images during every 5th loop of denoising process. Further below, I have showed the original image, the iteratively denoised image, one-step denoised image, and gaussian blurred image for comparison.

Comparison 1

Noisy Campanile at t=90

Comparison 2

Noisy Campanile at t=240

Comparison 3

Noisy Campanile at t=390

Comparison 4

Noisy Campanile at t=540

Comparison 5

Noisy Campanile at t=690

We can see that iteratively denoised image has the best quality, followed by one-step denoised image, and then gaussian blurred image has the worst quality.

Result 1

Original

Result 2

Iteratively Denoised Campanile

Result 3

One-Step Denoised Campanile

Result 4

Gaussian Blurred Campanile

1.5 Diffusion Model Sampling

In this part, I used iterative_denoise and set i_start = 0; I passed in random noise generated using torch.randn, and used the prompt embedding of "a high quality photo". Here are five results sampled using this procesure, with a seed of 1104.

Noise Level 1

Sample 1

Noise Level 2

Sample 2

Noise Level 3

Sample 3

Noise Level 4

Sample 4

Noise Level 5

Sample 5

1.6 Classifier-Free Guidance

Some of the images in the prior section are not very good. To address this, we can perform classifier-free guidance. This is done by computing both a noise estimate conditioned on the prompt, and an unconditional noise estimate, then we calculate our new noise estimate as: \[\varepsilon = \varepsilon_u + \gamma(\varepsilon_c - \varepsilon_u)\] where epsilon_u is the unconditional noise estimate, and epsilon_c is the conditional noise estimate, and gamma is the guidance scale. To get the unconditional noise estimate, we can simpy pass an empty prompt embedding to the model. The rest of the process is same as last part. Below are some results using classifier-free guidance and a seed of 2002.

Noise Level 1

Sample 1

Noise Level 2

Sample 2

Noise Level 3

Sample 3

Noise Level 4

Sample 4

Noise Level 5

Sample 5

1.7 Image-to-image Translation

If we take an image, add some noise to it, and then denoise it, we would get an image that is similar to the original image. The more noise we add and remove, the more different the denoised image would be from the original image. In this part, I added different amounts of noises to images, and then denoise them using text prompt "a high quality photo" to get new images (follows the SDEdit algorithm). Below are some examples with different noise levels.

Scale 1.0

SDEdit with i_start = 1

Scale 2.0

SDEdit with i_start = 3

Scale 3.0

SDEdit with i_start = 5

Scale 4.0

SDEdit with i_start = 7

Scale 5.0

SDEdit with i_start = 10

Scale 7.0

SDEdit with i_start = 20

Scale 9.0

Campanile

Scale 1.0

SDEdit with i_start = 1

Scale 2.0

SDEdit with i_start = 3

Scale 3.0

SDEdit with i_start = 5

Scale 4.0

SDEdit with i_start = 7

Scale 5.0

SDEdit with i_start = 10

Scale 7.0

SDEdit with i_start = 20

Scale 9.0

A Tree

Scale 1.0

SDEdit with i_start = 1

Scale 2.0

SDEdit with i_start = 3

Scale 3.0

SDEdit with i_start = 5

Scale 4.0

SDEdit with i_start = 7

Scale 5.0

SDEdit with i_start = 10

Scale 7.0

SDEdit with i_start = 20

Scale 9.0

A Car

1.7.1 Editing Hand-Drawn and Web Images

We can also apply SDEdit to hand-drawn and web images to force them onto the image manifold. Below are the results of applying SDEdit with different noise levels to web and hand-drawn images. The first one is a web image, and the second and last one are my hand-drawn images.

Scale 1.0

Volcano with i_start = 1

Scale 2.0

Volcano with i_start = 3

Scale 3.0

Volcano with i_start = 5

Scale 4.0

Volcano with i_start = 7

Scale 5.0

Volcano with i_start = 10

Scale 7.0

Volcano with i_start = 20

Scale 9.0

Volcano

Scale 1.0

Apple with i_start = 1

Scale 2.0

Apple with i_start = 3

Scale 3.0

Apple with i_start = 5

Scale 4.0

Apple with i_start = 7

Scale 5.0

Apple with i_start = 10

Scale 7.0

Apple with i_start = 20

Scale 9.0

Original Apple Sketch

Scale 1.0

SDEdit with i_start = 1

Scale 2.0

SDEdit with i_start = 3

Scale 3.0

SDEdit with i_start = 5

Scale 4.0

SDEdit with i_start = 7

Scale 5.0

SDEdit with i_start = 10

Scale 7.0

SDEdit with i_start = 20

Scale 9.0

Original Pumpkin Sketch

1.7.2 Inpainting

To do image inpainting, I first generate mask specifying the region I want to replace, then I run the diffusion denoising loop, but at every step, after obtaining x_t, I "force" x_t to have the same pixels as the original image where mask is 0, i.e: \[x_t \leftarrow \mathbf{m}x_t + (1-\mathbf{m})\text{forward}(x_{orig}, t)\] In this case, we only generate a new image inside the region we want to replace. Below are the results of inpainting for the Campanile and two of my own chosen images.

SDEdit Result 1

Campanile

SDEdit Result 2

Mask

SDEdit Result 3

Hole to Fill

SDEdit Result 4

Campanile Inpainted

SDEdit Result 5

Picture of People

SDEdit Result 6

Mask

SDEdit Result 7

Hole to Fill

SDEdit Result 8

People Inpainted

SDEdit Result 9

Neymar

SDEdit Result 10

Mask

SDEdit Result 11

Hole to Fill

SDEdit Result 12

Neymar Inpainted

1.7.3 Text-Conditional Image-to-image Translation

Now we will do SDEdit but guide the projection with a text prompt. The following results are SDEdit on different noise levels, with text prompt "a rocket ship". The first one is the campanile, and the last two are images of my own choices (also conditioned on "a rocket ship")

Denoising Result 15

Rocket Ship at noise level 1

Denoising Result 16

Rocket Ship at noise level 3

Denoising Result 17

Rocket Ship at noise level 5

Denoising Result 18

Rocket Ship at noise level 7

Denoising Result 19

Rocket Ship at noise level 10

Denoising Result 20

Rocket Ship at noise level 20

Denoising Result 21

Campanile

Denoising Result 15

Rocket Ship at noise level 1

Denoising Result 16

Rocket Ship at noise level 3

Denoising Result 17

Rocket Ship at noise level 5

Denoising Result 18

Rocket Ship at noise level 7

Denoising Result 19

Rocket Ship at noise level 10

Denoising Result 20

Rocket Ship at noise level 20

Denoising Result 21

Toothpaste

Denoising Result 15

Rocket Ship at noise level 1

Denoising Result 16

Rocket Ship at noise level 3

Denoising Result 17

Rocket Ship at noise level 5

Denoising Result 18

Rocket Ship at noise level 7

Denoising Result 19

Rocket Ship at noise level 10

Denoising Result 20

Rocket Ship at noise level 20

Denoising Result 21

Ironman/p>

1.8 Visual Anagrams

To create a visual anagram that look like one thing normally but look like another thing upside down, we can denoise an image normally with one prompt, to get one noise estimate, we can then flipp it upside down and denoise with another prompt to get another noise estimate. We then flip the second noise esitame, and average the two. We then proceed with the denoising using the new averaged noise estimate. Below are some of the results, where the first image is the original image, and the second image is flipped upside down.

Denoising Result 1

An oil painting of an old man

Denoising Result 2

An oil painting of people around campfire

Denoising Result 3

A parrot sitting on a tree branch

Denoising Result 4

A blooming garden with flowers and trees

Denoising Result 5

A lion under a tree

Denoising Result 6

A person walking in the forest

1.9 Hybrid Images

To implement the make hybrids function, I will first estimate the noise separately using two different prompts, then create a composite noise estimate by combining the low frequencies from one noise stimate with high frequencies from the other, i.e.:

\[\varepsilon_1 = \text{UNet}(x_t, t, p_1)\] \[\varepsilon_2 = \text{UNet}(x_t, t, p_2)\] \[\varepsilon = f_\text{lowpass}(\varepsilon_1) + f_\text{highpass}(\varepsilon_2)\]
We then use the same sampling process, but using the new composite noise estimate. For low pass gaussian filter I used kernel size 33 and sigma 2. Below are some of the results.

Denoising Result 7

Hybrid image of a skull and a waterfall

Denoising Result 8

Hybrid image of a lion and a skull

Denoising Result 9

Hybrid image of flower patterns and a human skeleton

Part B: Diffusion Models from Scratch!

Single step Denoising UNet

1.1 Implementing the UNet

First we need to implement the UNet. A UNet consists of a few downsampling and upsampling blocks with skip connections. In this part, I will be building Unets to be trained on the famous MNIST dataset.

1.2 Using the UNet to Train a Denoiser

To train the denoiser, we would need training data pairs of (z,x) where each x is a clean MNIST digit and z is a noisy image. For each training batch,

\[ z = x + \sigma\varepsilon, \text{ where } \varepsilon \sim \mathcal{N}(0, I) \]
noising processes, over sigma = [0.0,0.2,0.4,0.5,0.6,0.8,1.0]

graph

Varying levels of noise on MNIST digits

1.2.1 Training

I implemented the UNet according to figure 1 and figure2 in the project spec. Then, I trained the denoiser on MNIST dataset with noisy image z of sigma = 0.5 applied to clean image x. I have trained the model for 5 epochs, below is the training losses during the training process (of linear scale).

Denoising Results

Training Loss Curve (Linear Scale)

Below are sample results after the 1st and 5th epoch.

Images

Results on digits from the test set after 1 epoch of training

Images

Results on digits from the test set after 5 epoch of training

1.2.2 Out-of-Distribution Testing

Because the denoiser was trained on MNIST digits noised with sigma = 0.5, it might not perform well given other noise levels. Below are the denoiser results on test set digits with noise sigma = [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0].

Image

Results on digits from the test set with varying noise levels

Time Conditioned UNet

Following figure 8 and figure 9 in the project spec, I implemented the fully-connected block and injected the conditioning signal into the UNet. For training, I followed Algorithm B.1. in spec-- pick a random image from the training set, a random t, and train the denoiser to predict the noise added. This is repeated for different images and different t values until convergence. Below is the training loss curve of the time conditioned UNet. (Linear Scale, and I used parameters specified in the spec)

Image

Time-Conditioned UNet training loss curve (Linear Scale)

Sampling results

The sampling process is very similar to previous parts, here I followed Algorithm B.2. in project spec. Below are the sampling results for the time-conditioned UNet for 5 and 20 epochs. They are made as gif to demonstrate the denoising process.

Sampling Process 1

Sampling process (epoch 5)

Sampling Process 2

Sampling process (epoch 20)

Adding Class-Conditioning to UNet

Following the project spec, I added class condition via injecting the class label c into the time-conditioned UNet model. In addition, a dropout layer (dropout probability = 0.1) is also added so some times the tensor c is set to 0, and the model can still perform well without a class label. The training process is very similar to the previous part, with the only difference being adding the conditioning vector c. Below is the training loss curve of the class-conditioned UNet.

Image

Class-Conditioned UNet training loss curve (Linear Scale)

Sampling results

I followed Algorithm B.4. in project spec to perform cfg sampling. Below are the sampling results for the Class-conditioned UNet for 5 and 20 epochs. They are made as gif to demonstrate the denoising process.

Sampling Process 1

Sampling process (epoch 5)

Sampling Process 2

Sampling process (epoch 20)

Bells and Whistles

Sampling Gifs

By keeping track of the intermediate denoising results (I used interval of 10 denoising steps), I made gifs to visualize the denoising process further. Below is a summary of the Time conditioned model and Class conditioned model sampling gifs.

Time-Conditioned Model Results:

Time-Conditioned Sampling (5 epochs)

Time-Conditioned Model (epoch 5)

Time-Conditioned Sampling (20 epochs)

Time-Conditioned Model (epoch 20)

Class-Conditioned Model Results:

Class-Conditioned Sampling (5 epochs)

Class-Conditioned Model (epoch 5)

Class-Conditioned Sampling (20 epochs)

Class-Conditioned Model (epoch 20)

This comparison clearly shows the progression of both models' capabilities across training epochs, demonstrating the improvements in image quality and stability as training progresses.

Stationary pictures

I realized that the infinitely looping gifs might be hard to read, so here I have included the exact same gifs, looping only once, for readers to better see the result of the denoising process.

Time-Conditioned Warped Sampling (5 epochs)

Time-Conditioned Warped Model (epoch 5)

Time-Conditioned Warped Sampling (20 epochs)

Time-Conditioned Warped Model (epoch 20)

Class-Conditioned Warped Results:

Class-Conditioned Warped Sampling (5 epochs)

Class-Conditioned Warped Model (epoch 5)

Class-Conditioned Warped Sampling (20 epochs)

Class-Conditioned Warped Model (epoch 20)

These warped sampling results show an alternative visualization of the denoising process, highlighting different aspects of how the models learn to generate images.

Reflection

This is a very fun project and I learned a lot!