Diffusion Model
First "stir" an image into noise, then learn to "restore" it from noise — the core idea behind Stable Diffusion and DALL-E
✦ See It in Action: An image disappears into noise, then reappears
Click "Add Noise" to watch the image gradually become random noise; click "Remove Noise" to see it restored from the noise.
Reverse Process: A neural network learns to predict and remove the noise at each step, gradually restoring the image
01 Plain English Explanation of Diffusion Models
The Story of a Photo Disappearing in Snow
Imagine a clear photo. Every second you sprinkle a little sand on it. After 100 seconds, the photo is completely covered by sand, leaving only a pile of random grains.
The core question of diffusion models is: Can we train a neural network to learn "how to remove one grain of sand at a time"?
If the network masters this, you can start from a pile of random noise, gradually remove the "sand", and eventually restore an image — or even generate a brand new one.
Two Processes
Each step adds a little Gaussian noise to the image. This process uses a fixed mathematical formula — you can jump directly from x₀ to any timestep x_t without stepping through each one.
Starting from pure noise x_T, the neural network predicts "what noise was mixed into this image" at each step. After subtracting the noise, you get a slightly cleaner x_{t-1}. Repeat T steps to recover x₀.
What Happens During Training?
Use the forward formula to directly generate x_t (the version with t steps of noise)
The network takes x_t and timestep t as input, and outputs the predicted noise ε̂
Loss = ‖ε - ε̂‖², a very simple MSE loss
That's it — just these three steps. Train on millions of images, and the network learns to "denoise".
Why Is It Better Than GAN?
Only needs to minimize MSE; no instability from GAN's two-network adversarial game
Starting from different random noise each time produces diverse results without mode collapse
Adding text conditions during denoising (Stable Diffusion) enables text-to-image generation
Generation requires dozens to hundreds of denoising steps, much slower than GAN's single forward pass
Diffusion Models in the Real World
Performs diffusion in compressed "latent space" for faster speed; open-source and runs locally
OpenAI's text-to-image model that understands complex text prompts
Extends diffusion models to video generation, applying diffusion along the temporal dimension
WaveGrad (speech synthesis), RoseTTAFold (protein structure prediction)
Building the Diffusion Process Step by Step
From noise scheduling to forward diffusion visualization, build it up gradually.
A linear schedule gradually increases the noise level. ᾱ_t is the cumulative signal preservation rate.
No need to iterate step by step — directly compute the noised signal at any timestep t using a single formula.
Observe the signal evolution from t=0 (original sine wave) to t=99 (pure noise).
02 Code
03 Academic Explanation
Diffusion Probabilistic Models (DDPM) (Ho et al., 2020) define a Markov chain noising process and learn its reverse, surpassing GANs in image generation quality with more stable training characteristics.
Forward Process (Fixed Forward Process)
The forward process is defined as a Markov chain, adding Gaussian noise at each step:
q(x_t | x_{t-1}) = N(x_t; √(1-β_t)·x_{t-1},
β_t·I)
Thanks to the reparameterization property of Gaussian distributions, you can sample any step x_t directly from x₀:
x_t = √ᾱ_t · x₀ + √(1-ᾱ_t) · ε,where ε ~
N(0,I),ᾱ_t = ∏ᵢ₌₁ᵗ αᵢ
When t→T, ᾱ_T→0, and x_T degenerates into standard Gaussian noise.
Reverse Process (Learned Reverse Process)
The reverse transition q(x_{t-1}|x_t) is intractable directly, but has a closed form given x₀:
q(x_{t-1} | x_t, x₀) = N(x_{t-1}; μ̃_t, β̃_t·I)
The neural network ε_θ learns to predict the noise ε given x_t and t, thereby estimating the mean μ̃_t.
Training Objective
After simplification, the original ELBO objective is equivalent to:
L = E_{x₀,ε,t} [ ‖ε - ε_θ(√ᾱ_t·x₀ + √(1-ᾱ_t)·ε,
t)‖² ]
That is: randomly sample a timestep t and noise ε, add noise to get x_t, let the network predict noise ε, and minimize MSE.
Network Architecture
The encoder gradually reduces resolution to extract features, while the decoder gradually increases resolution to restore them, with skip connections preserving details. Well-suited for image noise prediction tasks.
Sinusoidal positional encoding maps the scalar t to a high-dimensional vector, injected into each U-Net layer so the network knows "which step we're at".
Cross-Attention injects text encodings (CLIP) into the U-Net's intermediate layers, guiding the denoising direction. Classifier-Free Guidance (CFG) further strengthens conditional control.
Improvements
Non-Markovian sampling reduces steps from 1000 to 50, greatly speeding up generation
Diffusion in VAE-compressed latent space, reducing resolution by 8x — the foundation of Stable Diffusion
Uses straight paths instead of diffusion paths for more stable training; adopted by Stable Diffusion 3 / Flux
Proposed by OpenAI — trains a network to jump directly from any x_t to x₀ in one step, enabling ultra-fast generation
Summary
Progressive noising via Markov chain
Neural network predicts noise
MSE noise prediction, simple and stable
T-step denoising from pure noise to image