t-SNE - Fun ML

01 Core Concept (Plain English)

Imagine you have a 100-dimensional "map" and want to print it on paper (2D). PCA finds a good angle and takes a photo, preserving the directions of maximum variance.

t-SNE takes a completely different approach: it ignores global structure and only cares about neighborhood relationships — points that were close in high dimensions should stay close in 2D; points that were far apart can go anywhere.

The trade-off: global distances in t-SNE have no meaning. The distance between two clusters tells you nothing. Only the structure within each cluster is trustworthy.

Four Steps to Understand t-SNE

High-dimensional similarity: Gaussian kernel

For each point i, convert distances to other points into probabilities p_j|i using a Gaussian — closer neighbors get higher probability, distant points approach zero. The bandwidth σ is determined by Perplexity (binary search).

Symmetrize: P_ij = (p_j|i + p_i|j) / 2n

Convert asymmetric conditional probabilities into a symmetric joint probability so each pair has a single similarity value.

Low-dimensional similarity: t-distribution

In 2D, use a Student-t distribution (df=1) to compute Q_ij. The heavier tails of the t-distribution allow dissimilar points to spread further apart, solving the "crowding problem."

Minimize KL divergence: gradient descent

Loss = KL(P||Q) = Σ P_ij log(P_ij/Q_ij). Gradient descent adjusts 2D coordinates to make Q as close to P as possible.

Step 1: What does high-dimensional data look like?

The Iris dataset has 4 features (sepal length/width, petal length/width). Let's plot the first two to get a feel for the raw data:

Step 2: Compute high-dimensional similarity P

The Gaussian kernel converts Euclidean distances to probabilities — nearby points get high probability, distant points near zero. Here we show point #0's similarity to all others:

Step 3: Gradient descent to optimize 2D layout

Start with random 2D coordinates and run gradient descent to make Q match P. See the result after 200 steps:

Step 4: Compare PCA vs t-SNE

PCA is linear projection; t-SNE is nonlinear. On linearly separable Iris data the difference is subtle; on complex manifold data, t-SNE wins clearly:

Perplexity: think of it as "how many effective neighbors each point should have." Too small (e.g. 5) → only looks at immediate neighbors, clusters fragment; too large (e.g. 50) → neighborhoods too broad, clusters merge. Typical range: 5–50, default 30.

02 Code

Full hand-written t-SNE with Perplexity support (change the perplexity variable). The JS version runs 150 Iris samples in the browser (~10–20 seconds); the Python version uses sklearn's optimized implementation.

03 Deep Dive

Why t-distribution instead of Gaussian?

In high dimensions, distances between dissimilar points are generally large. If the low-dimensional space also uses a Gaussian, these points get squashed together — the crowding problem. The heavier tails of the t-distribution map the same Q_ij to a larger low-dimensional distance, letting dissimilar points spread out and preventing everything from collapsing into a single blob.

Gradient of the KL divergence

Differentiating with respect to low-dimensional coordinate y_i:

∂C/∂y_i = 4 Σ_j (P_ij - Q_ij)(y_i - y_j)(1 + ||y_i - y_j||²)⁻¹

Positive terms (P > Q) act as attraction, pulling high-dimensional neighbors closer; negative terms (P < Q) act as repulsion, pushing dissimilar points apart.

Perplexity and σ

Binary search finds a per-point σ_i satisfying:

Perplexity = 2^{H(P_i)},  H(P_i) = -Σ_j p_{j|i} log p_{j|i}

Each point gets its own σ_i — small in dense regions, large in sparse regions — adapting to local density.

Limitations of t-SNE

Global distances are meaningless

The distance between two clusters does not reflect true high-dimensional distance. You cannot infer "which classes are more similar" from the layout.

Non-deterministic results

Random initialization means the layout changes each run, though the cluster topology (which points group together) should be consistent.

High computational cost

Naïve implementation is O(n²). For large datasets (n > 100k) use Barnes-Hut approximation (sklearn default).

For visualization only

t-SNE coordinates are not suitable as features for downstream classification or regression — use only for visual exploration.

t-SNE vs PCA: when to use which?

Prefer PCA

Fast preprocessing, global variance structure matters, reproducible results needed, or as a pre-step to t-SNE (reduce to 50 dims first for large datasets).

Prefer t-SNE

Goal is cluster visualization, data has nonlinear manifold structure, or you care about local neighborhood relationships.