t-SNE
PCA projects data linearly; t-SNE instead places nearby neighbors close together in 2D — clusters that were close stay close
01 Core Concept (Plain English)
Imagine you have a 100-dimensional "map" and want to print it on paper (2D). PCA finds a good angle and takes a photo, preserving the directions of maximum variance.
t-SNE takes a completely different approach: it ignores global structure and only cares about neighborhood relationships — points that were close in high dimensions should stay close in 2D; points that were far apart can go anywhere.
The trade-off: global distances in t-SNE have no meaning. The distance between two clusters tells you nothing. Only the structure within each cluster is trustworthy.
Four Steps to Understand t-SNE
For each point i, convert distances to other points into probabilities pj|i using a Gaussian — closer neighbors get higher probability, distant points approach zero. The bandwidth σ is determined by Perplexity (binary search).
Convert asymmetric conditional probabilities into a symmetric joint probability so each pair has a single similarity value.
In 2D, use a Student-t distribution (df=1) to compute Qij. The heavier tails of the t-distribution allow dissimilar points to spread further apart, solving the "crowding problem."
Loss = KL(P||Q) = Σ Pij log(Pij/Qij). Gradient descent adjusts 2D coordinates to make Q as close to P as possible.
Step 1: What does high-dimensional data look like?
The Iris dataset has 4 features (sepal length/width, petal length/width). Let's plot the first two to get a feel for the raw data:
Step 2: Compute high-dimensional similarity P
The Gaussian kernel converts Euclidean distances to probabilities — nearby points get high probability, distant points near zero. Here we show point #0's similarity to all others:
Step 3: Gradient descent to optimize 2D layout
Start with random 2D coordinates and run gradient descent to make Q match P. See the result after 200 steps:
Step 4: Compare PCA vs t-SNE
PCA is linear projection; t-SNE is nonlinear. On linearly separable Iris data the difference is subtle; on complex manifold data, t-SNE wins clearly:
Perplexity: think of it as "how many effective neighbors each point should have." Too small (e.g. 5) → only looks at immediate neighbors, clusters fragment; too large (e.g. 50) → neighborhoods too broad, clusters merge. Typical range: 5–50, default 30.
02 Code
Full hand-written t-SNE with Perplexity support (change the perplexity variable). The JS version runs 150 Iris samples in the browser (~10–20 seconds); the Python version uses sklearn's optimized implementation.
03 Deep Dive
Why t-distribution instead of Gaussian?
In high dimensions, distances between dissimilar points are generally large. If the low-dimensional space also uses a Gaussian, these points get squashed together — the crowding problem. The heavier tails of the t-distribution map the same Qij to a larger low-dimensional distance, letting dissimilar points spread out and preventing everything from collapsing into a single blob.
Gradient of the KL divergence
Differentiating with respect to low-dimensional coordinate yi:
∂C/∂y_i = 4 Σ_j (P_ij - Q_ij)(y_i - y_j)(1 + ||y_i - y_j||²)⁻¹
Positive terms (P > Q) act as attraction, pulling high-dimensional neighbors closer; negative terms (P < Q) act as repulsion, pushing dissimilar points apart.
Perplexity and σ
Binary search finds a per-point σi satisfying:
Perplexity = 2^{H(P_i)}, H(P_i) = -Σ_j p_{j|i} log p_{j|i}
Each point gets its own σi — small in dense regions, large in sparse regions — adapting to local density.
Limitations of t-SNE
Global distances are meaningless
The distance between two clusters does not reflect true high-dimensional distance. You cannot infer "which classes are more similar" from the layout.
Non-deterministic results
Random initialization means the layout changes each run, though the cluster topology (which points group together) should be consistent.
High computational cost
Naïve implementation is O(n²). For large datasets (n > 100k) use Barnes-Hut approximation (sklearn default).
For visualization only
t-SNE coordinates are not suitable as features for downstream classification or regression — use only for visual exploration.
t-SNE vs PCA: when to use which?
Prefer PCA
Fast preprocessing, global variance structure matters, reproducible results needed, or as a pre-step to t-SNE (reduce to 50 dims first for large datasets).
Prefer t-SNE
Goal is cluster visualization, data has nonlinear manifold structure, or you care about local neighborhood relationships.