Diffusion Model - ML Easy Learning

✦ See It in Action: An image disappears into noise, then reappears

Click "Add Noise" to watch the image gradually become random noise; click "Remove Noise" to see it restored from the noise.

t = 0 / 60

Original Image

← t=0 Original Image t=T Pure Noise →

Forward Process: Each step adds a small amount of Gaussian noise; after T steps the image is completely gone
Reverse Process: A neural network learns to predict and remove the noise at each step, gradually restoring the image

01 Plain English Explanation of Diffusion Models

The Story of a Photo Disappearing in Snow

Imagine a clear photo. Every second you sprinkle a little sand on it. After 100 seconds, the photo is completely covered by sand, leaving only a pile of random grains.

The core question of diffusion models is: Can we train a neural network to learn "how to remove one grain of sand at a time"?

If the network masters this, you can start from a pile of random noise, gradually remove the "sand", and eventually restore an image — or even generate a brand new one.

Two Processes

→

Forward Process (Adding Noise): Manually defined, no learning needed

Each step adds a little Gaussian noise to the image. This process uses a fixed mathematical formula — you can jump directly from x₀ to any timestep x_t without stepping through each one.

←

Reverse Process (Removing Noise): Learned by a neural network

Starting from pure noise x_T, the neural network predicts "what noise was mixed into this image" at each step. After subtracting the noise, you get a slightly cleaner x_{t-1}. Repeat T steps to recover x₀.

What Happens During Training?

Randomly pick a real image and a timestep t

Use the forward formula to directly generate x_t (the version with t steps of noise)

Let the network guess "what noise was added"

The network takes x_t and timestep t as input, and outputs the predicted noise ε̂

Minimize the gap between predicted and actual noise

Loss = ‖ε - ε̂‖², a very simple MSE loss

That's it — just these three steps. Train on millions of images, and the network learns to "denoise".

Why Is It Better Than GAN?

Training Stability

Only needs to minimize MSE; no instability from GAN's two-network adversarial game

Diversity

Starting from different random noise each time produces diverse results without mode collapse

Controllable Generation

Adding text conditions during denoising (Stable Diffusion) enables text-to-image generation

Downside: Slow

Generation requires dozens to hundreds of denoising steps, much slower than GAN's single forward pass

Diffusion Models in the Real World

Stable Diffusion

Performs diffusion in compressed "latent space" for faster speed; open-source and runs locally

DALL-E 3

OpenAI's text-to-image model that understands complex text prompts

Sora

Extends diffusion models to video generation, applying diffusion along the temporal dimension

Audio / Proteins

WaveGrad (speech synthesis), RoseTTAFold (protein structure prediction)

Building the Diffusion Process Step by Step

From noise scheduling to forward diffusion visualization, build it up gradually.

Step 1 Noise Schedule (β Sequence)

A linear schedule gradually increases the noise level. ᾱ_t is the cumulative signal preservation rate.

Step 2 Forward Diffusion (Closed-form Formula)

No need to iterate step by step — directly compute the noised signal at any timestep t using a single formula.

Step 3 Visualize Key Timesteps

Observe the signal evolution from t=0 (original sine wave) to t=99 (pure noise).

02 Code

03 Academic Explanation

Diffusion Probabilistic Models (DDPM) (Ho et al., 2020) define a Markov chain noising process and learn its reverse, surpassing GANs in image generation quality with more stable training characteristics.

Forward Process (Fixed Forward Process)

The forward process is defined as a Markov chain, adding Gaussian noise at each step:

q(x_t | x_{t-1}) = N(x_t; √(1-β_t)·x_{t-1},
                            β_t·I)

Thanks to the reparameterization property of Gaussian distributions, you can sample any step x_t directly from x₀:

x_t = √ᾱ_t · x₀ + √(1-ᾱ_t) · ε，where ε ~
                            N(0,I)，ᾱ_t = ∏ᵢ₌₁ᵗ αᵢ

When t→T, ᾱ_T→0, and x_T degenerates into standard Gaussian noise.

Reverse Process (Learned Reverse Process)

The reverse transition q(x_{t-1}|x_t) is intractable directly, but has a closed form given x₀:

q(x_{t-1} | x_t, x₀) = N(x_{t-1}; μ̃_t, β̃_t·I)

The neural network ε_θ learns to predict the noise ε given x_t and t, thereby estimating the mean μ̃_t.

Training Objective

After simplification, the original ELBO objective is equivalent to:

L = E_{x₀,ε,t} [ ‖ε - ε_θ(√ᾱ_t·x₀ + √(1-ᾱ_t)·ε,
                            t)‖² ]

That is: randomly sample a timestep t and noise ε, add noise to get x_t, let the network predict noise ε, and minimize MSE.

Network Architecture

U-Net Backbone

The encoder gradually reduces resolution to extract features, while the decoder gradually increases resolution to restore them, with skip connections preserving details. Well-suited for image noise prediction tasks.

Timestep Embedding

Sinusoidal positional encoding maps the scalar t to a high-dimensional vector, injected into each U-Net layer so the network knows "which step we're at".

Conditional Information Injection (Text-to-Image)

Cross-Attention injects text encodings (CLIP) into the U-Net's intermediate layers, guiding the denoising direction. Classifier-Free Guidance (CFG) further strengthens conditional control.

Improvements

DDIM

Non-Markovian sampling reduces steps from 1000 to 50, greatly speeding up generation

Latent Diffusion

Diffusion in VAE-compressed latent space, reducing resolution by 8x — the foundation of Stable Diffusion

Flow Matching

Uses straight paths instead of diffusion paths for more stable training; adopted by Stable Diffusion 3 / Flux

Consistency Models

Proposed by OpenAI — trains a network to jump directly from any x_t to x₀ in one step, enabling ultra-fast generation

Summary

Forward

Progressive noising via Markov chain

Reverse

Neural network predicts noise

Training

MSE noise prediction, simple and stable

Generation

T-step denoising from pure noise to image