VAE

01 Core Principles (Plain English)

A regular autoencoder (AE) compresses an image into a single point, then reconstructs from that point — but you can only reproduce the original image, not generate new ones.

VAE's improvement: compress into a "Gaussian cloud" instead of a single point — the cloud has a center (mean) and a radius (variance). During generation, randomly sample a point from the cloud and decompress it. Each result is slightly different, producing "brand new but plausible" images.

Comparison with GAN

Has clear probabilistic interpretation, stable training, continuous and interpolable latent space (transitions between two faces look natural), but generated images tend to be blurry.

GAN

Generates sharper, more realistic images, but training is unstable, latent space is not guaranteed to be continuous, and suffers from mode collapse.

Structure in Three Steps

Encoder

Input image → outputs two vectors: mean μ and log variance log σ² (describing the position and size of the "cloud" in latent space).

Reparameterization Sampling

Sample a point from the cloud: z = μ + σ·ε, where ε ~ N(0,1). This trick allows gradients to backpropagate through random sampling.

Decoder

Reconstruct the sampled point z back into an image. Training objective: reconstruction must be accurate (reconstruction loss) + cloud must stay close to standard normal distribution (KL divergence).

Latent Space Interpolation: VAE's most fascinating property. Walking from "cat vector" to "dog vector" in latent space passes through plausible intermediates that look "cat-like yet dog-like", because the entire latent space is forced to learn a smooth, continuous representation.

Build VAE Step by Step

From spiral data to latent space, build the variational autoencoder step by step.

Step 1 Data and Network Weights

Generate two spiral lines as training data, randomly initialize encoder/decoder weight matrices.

Step 2 Encoder: Input → Gaussian Distribution

The encoder outputs mean μ and variance σ. The reparameterization trick allows backpropagation through random sampling.

Step 3 Decoder and Loss Function

The decoder reconstructs the latent vector back to original coordinates; loss = reconstruction error + KL divergence.

Step 4 Training Loop

Mini-batch gradient updates for 600 epochs, observing the latent space transition from chaos to order.

02 Code

03 Academic Explanation

Variational Inference Framework

Assume data x is generated by latent variable z. The true posterior p(z|x) is intractable, so we use approximate posterior q(z|x) to approximate it. Maximize the Evidence Lower Bound (ELBO):

log p(x) ≥ E_q[log p(x|z)] − KL(q(z|x) ‖ p(z))

The first term is the reconstruction loss (decoder's ability to reconstruct), and the second term is the KL regularization term (forcing the posterior toward the prior N(0,I)).

Reparameterization Trick

Directly sampling from z ~ q(z|x) cannot be backpropagated. Reparameterization separates the randomness:

z = μ(x) + σ(x) ⊙ ε,　ε ~ N(0, I)

Gradients can flow through μ and σ, while ε is just a noise source that doesn't need gradients. This is the key to end-to-end training of VAE.

Analytical Form of KL Divergence

When q(z|x) = N(μ, σ²I), p(z) = N(0, I), the KL divergence has an analytical solution:

KL = −½ Σⱼ (1 + log σⱼ² − μⱼ² − σⱼ²)

No Monte Carlo estimation needed — compute directly from the encoder's μ and log σ² outputs.

β-VAE and Disentanglement

Add weight β > 1 to the KL term:

L = E_q[log p(x|z)] − β · KL(q(z|x) ‖ p(z))

The larger β is, the more the latent space is compressed, and each dimension tends toward independent disentanglement (each dimension corresponds to a semantic factor, such as face orientation, lighting direction). This is the core contribution of β-VAE.

Theoretical Comparison with GAN

VAE: Maximizes ELBO, has a clear probabilistic objective, stable training, but reconstruction loss (MSE/BCE) causes blurry generated images
GAN: Minimizes JS divergence (original) or Wasserstein distance, discriminator provides "perceptual loss", generates sharper images, but prone to gradient vanishing and mode collapse
VQVAE / Diffusion Models: Subsequent work combining the advantages of both approaches

VAE Variational Autoencoder