CS180 - Fall 2024 Alex Cao
This project is divided into two parts. In the first part, I will explore the sampling process of diffusion models, try different denoising methods, and create cool images like visual anagrams, image to image translation, and hybrid images. In the second part, I will implement my own UNet for diffusion process, and train them on the famous MNIST dataset.
In this project, we are using DeepFloyd IF diffusion model. This is a two stage model: the first stage produces images of size 64 x 64, and the second stage takes the output from the first stage and upscale it to 256 x256. Below are the results of generation using the three text prompts provided in the project spec. The quality of the image is quite nice, I am surprised they are quite good with only 20 denoising steps. Especially the picture of the man wearing a hat is very realistic. The images are also very related to prompt, as they reflect the prompt very well. The objects are exactly what being described in prompts, and also for first image, the image also accurately reflected the oil painting style mentioned in prompt. I am using the seed 1104.
Below is the result of generation using the prompt "an oil painting of a snowy mountain village" using different number of denoising steps. We can see that with the increase of denoising steps, the image seems to be more colorful and detailed. Maybe more denoising steps allow the model to generate more "oil paintingness" in the image.
For this part, I implemented the forward function. I got alpha_cumprod by indexing the t-th element in alphas_cumprod, and epislon is generated using torch.randn_like. Below are the Berkeley campinile at different noise levels.
In this part I applied guassian bluring with kernel size equals 5 and sigma equals 3. Below are the side by side result of guassian blur filtering and the original noisy image.
To perform one-step denoising, I first use the forward process to generate a noisy image at given noise level t, then I use the stage 1 unet to predict the noise. Once I have the predicted noise, I obtained the clean image by solving for X0 in the equation \[ x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon \quad \text{where} \quad \epsilon \sim \mathcal{N}(0,1) \] Below are the original image, and the noisy images and one-step denoised images at different noise levels.
Creating a list of monotonically decreasing timesteps starting at 990, with a stride of 30, ending at 0, I followed this formula to iteratively denoise the image: \[ x_{t'} = \frac{\sqrt{\bar{\alpha}_{t'}}\beta_t}{1-\bar{\alpha}_t}x_0 + \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t'})}{1-\bar{\alpha}_t}x_t + v_\sigma \] Below is a series of images during every 5th loop of denoising process. Further below, I have showed the original image, the iteratively denoised image, one-step denoised image, and gaussian blurred image for comparison.
We can see that iteratively denoised image has the best quality, followed by one-step denoised image, and then gaussian blurred image has the worst quality.
In this part, I used iterative_denoise and set i_start = 0; I passed in random noise generated using torch.randn, and used the prompt embedding of "a high quality photo". Here are five results sampled using this procesure, with a seed of 1104.
Some of the images in the prior section are not very good. To address this, we can perform classifier-free guidance. This is done by computing both a noise estimate conditioned on the prompt, and an unconditional noise estimate, then we calculate our new noise estimate as: \[\varepsilon = \varepsilon_u + \gamma(\varepsilon_c - \varepsilon_u)\] where epsilon_u is the unconditional noise estimate, and epsilon_c is the conditional noise estimate, and gamma is the guidance scale. To get the unconditional noise estimate, we can simpy pass an empty prompt embedding to the model. The rest of the process is same as last part. Below are some results using classifier-free guidance and a seed of 2002.
If we take an image, add some noise to it, and then denoise it, we would get an image that is similar to the original image. The more noise we add and remove, the more different the denoised image would be from the original image. In this part, I added different amounts of noises to images, and then denoise them using text prompt "a high quality photo" to get new images (follows the SDEdit algorithm). Below are some examples with different noise levels.
We can also apply SDEdit to hand-drawn and web images to force them onto the image manifold. Below are the results of applying SDEdit with different noise levels to web and hand-drawn images. The first one is a web image, and the second and last one are my hand-drawn images.
To do image inpainting, I first generate mask specifying the region I want to replace, then I run the diffusion denoising loop, but at every step, after obtaining x_t, I "force" x_t to have the same pixels as the original image where mask is 0, i.e: \[x_t \leftarrow \mathbf{m}x_t + (1-\mathbf{m})\text{forward}(x_{orig}, t)\] In this case, we only generate a new image inside the region we want to replace. Below are the results of inpainting for the Campanile and two of my own chosen images.
Now we will do SDEdit but guide the projection with a text prompt. The following results are SDEdit on different noise levels, with text prompt "a rocket ship". The first one is the campanile, and the last two are images of my own choices (also conditioned on "a rocket ship")
To create a visual anagram that look like one thing normally but look like another thing upside down, we can denoise an image normally with one prompt, to get one noise estimate, we can then flipp it upside down and denoise with another prompt to get another noise estimate. We then flip the second noise esitame, and average the two. We then proceed with the denoising using the new averaged noise estimate. Below are some of the results, where the first image is the original image, and the second image is flipped upside down.
To implement the make hybrids function, I will first estimate the noise separately using two different prompts, then create a composite noise estimate by combining the low frequencies from one noise stimate with high frequencies from the other, i.e.:
First we need to implement the UNet. A UNet consists of a few downsampling and upsampling blocks with skip connections. In this part, I will be building Unets to be trained on the famous MNIST dataset.
To train the denoiser, we would need training data pairs of (z,x) where each x is a clean MNIST digit and z is a noisy image. For each training batch,
I implemented the UNet according to figure 1 and figure2 in the project spec. Then, I trained the denoiser on MNIST dataset with noisy image z of sigma = 0.5 applied to clean image x. I have trained the model for 5 epochs, below is the training losses during the training process (of linear scale).
Below are sample results after the 1st and 5th epoch.
Because the denoiser was trained on MNIST digits noised with sigma = 0.5, it might not perform well given other noise levels. Below are the denoiser results on test set digits with noise sigma = [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0].
Following figure 8 and figure 9 in the project spec, I implemented the fully-connected block and injected the conditioning signal into the UNet. For training, I followed Algorithm B.1. in spec-- pick a random image from the training set, a random t, and train the denoiser to predict the noise added. This is repeated for different images and different t values until convergence. Below is the training loss curve of the time conditioned UNet. (Linear Scale, and I used parameters specified in the spec)
The sampling process is very similar to previous parts, here I followed Algorithm B.2. in project spec. Below are the sampling results for the time-conditioned UNet for 5 and 20 epochs. They are made as gif to demonstrate the denoising process.
Following the project spec, I added class condition via injecting the class label c into the time-conditioned UNet model. In addition, a dropout layer (dropout probability = 0.1) is also added so some times the tensor c is set to 0, and the model can still perform well without a class label. The training process is very similar to the previous part, with the only difference being adding the conditioning vector c. Below is the training loss curve of the class-conditioned UNet.
I followed Algorithm B.4. in project spec to perform cfg sampling. Below are the sampling results for the Class-conditioned UNet for 5 and 20 epochs. They are made as gif to demonstrate the denoising process.
By keeping track of the intermediate denoising results (I used interval of 10 denoising steps), I made gifs to visualize the denoising process further. Below is a summary of the Time conditioned model and Class conditioned model sampling gifs.
This comparison clearly shows the progression of both models' capabilities across training epochs, demonstrating the improvements in image quality and stability as training progresses.
I realized that the infinitely looping gifs might be hard to read, so here I have included the exact same gifs, looping only once, for readers to better see the result of the denoising process.
These warped sampling results show an alternative visualization of the denoising process, highlighting different aspects of how the models learn to generate images.
This is a very fun project and I learned a lot!