VAE Variational Autoencoder
Compress data into a "fuzzy cloud", then randomly sample a point from it and decompress — each time generating a brand new image
01 Core Principles (Plain English)
A regular autoencoder (AE) compresses an image into a single point, then reconstructs from that point — but you can only reproduce the original image, not generate new ones.
VAE's improvement: compress into a "Gaussian cloud" instead of a single point — the cloud has a center (mean) and a radius (variance). During generation, randomly sample a point from the cloud and decompress it. Each result is slightly different, producing "brand new but plausible" images.
Comparison with GAN
VAE
Has clear probabilistic interpretation, stable training, continuous and interpolable latent space (transitions between two faces look natural), but generated images tend to be blurry.
GAN
Generates sharper, more realistic images, but training is unstable, latent space is not guaranteed to be continuous, and suffers from mode collapse.
Structure in Three Steps
Input image → outputs two vectors: mean μ and log variance log σ² (describing the position and size of the "cloud" in latent space).
Sample a point from the cloud: z = μ + σ·ε, where ε ~ N(0,1). This trick allows gradients to backpropagate through random sampling.
Reconstruct the sampled point z back into an image. Training objective: reconstruction must be accurate (reconstruction loss) + cloud must stay close to standard normal distribution (KL divergence).
Latent Space Interpolation: VAE's most fascinating property. Walking from "cat vector" to "dog vector" in latent space passes through plausible intermediates that look "cat-like yet dog-like", because the entire latent space is forced to learn a smooth, continuous representation.
Build VAE Step by Step
From spiral data to latent space, build the variational autoencoder step by step.
Generate two spiral lines as training data, randomly initialize encoder/decoder weight matrices.
The encoder outputs mean μ and variance σ. The reparameterization trick allows backpropagation through random sampling.
The decoder reconstructs the latent vector back to original coordinates; loss = reconstruction error + KL divergence.
Mini-batch gradient updates for 600 epochs, observing the latent space transition from chaos to order.
02 Code
03 Academic Explanation
Variational Inference Framework
Assume data x is generated by latent variable z. The true posterior p(z|x) is intractable, so we use approximate posterior q(z|x) to approximate it. Maximize the Evidence Lower Bound (ELBO):
The first term is the reconstruction loss (decoder's ability to reconstruct), and the second term is the KL regularization term (forcing the posterior toward the prior N(0,I)).
Reparameterization Trick
Directly sampling from z ~ q(z|x) cannot be backpropagated. Reparameterization separates the randomness:
Gradients can flow through μ and σ, while ε is just a noise source that doesn't need gradients. This is the key to end-to-end training of VAE.
Analytical Form of KL Divergence
When q(z|x) = N(μ, σ²I), p(z) = N(0, I), the KL divergence has an analytical solution:
No Monte Carlo estimation needed — compute directly from the encoder's μ and log σ² outputs.
β-VAE and Disentanglement
Add weight β > 1 to the KL term:
The larger β is, the more the latent space is compressed, and each dimension tends toward independent disentanglement (each dimension corresponds to a semantic factor, such as face orientation, lighting direction). This is the core contribution of β-VAE.
Theoretical Comparison with GAN
- VAE: Maximizes ELBO, has a clear probabilistic objective, stable training, but reconstruction loss (MSE/BCE) causes blurry generated images
- GAN: Minimizes JS divergence (original) or Wasserstein distance, discriminator provides "perceptual loss", generates sharper images, but prone to gradient vanishing and mode collapse
- VQVAE / Diffusion Models: Subsequent work combining the advantages of both approaches