Attention
The attention mechanism teaches models to "selectively focus" — when translating a sentence, no need to memorize the whole thing, just look at the most relevant words
✦ See It in Action: Attention Weights Between Words
Click "Generate Heatmap" to see how much each word attends to every other word. Darker colors indicate higher attention weights.
01 Plain English Explanation of Attention
The Library Analogy
Imagine you go to the library to look up information. Here's how the process works:
You have a question in mind, like "Which words are related to this word?" That's the Query.
Every book in the library has a title and category tags. You compare your question against each book's tags — the more similar, the more likely it's what you need.
After finding the relevant books, you read their content weighted by relevance, ultimately arriving at an answer. That's the Value.
What Problem Does Attention Solve?
Before Attention, RNNs translating a sentence had to "compress" the entire sentence into a fixed vector before decoding — the longer the sentence, the more information was lost.
Attention's approach: When translating each word, look back at the original sentence directly, weighting by relevance. When translating "apple," pay more attention to "apple" — no need to memorize the entire sentence.
Self-Attention
The most powerful variant of Attention applies attention to the sentence itself — each word simultaneously plays all three roles of Q, K, and V:
Use Q·Kᵀ to compute the match score for each pair of words, then divide by √d to prevent values from getting too large
Convert scores into a probability distribution (each row sums to 1) — these are the attention weights
Use the weights to compute a weighted sum of V (each word's content vector), yielding a "new representation enriched with context"
What Can Attention Do?
When translating each word, look back at the most relevant words in the source sentence (the original use case)
"Apple is delicious" vs. "Apple released a new product" — use contextual Attention to distinguish meanings
The subject at the beginning of a sentence can directly attend to the verb at the end, unconstrained by distance
Large language models like GPT and BERT are all built on top of self-attention
Building Attention Step by Step
From word embeddings to attention weights, let's build it up gradually.
3 words, each represented by a 4-dimensional vector. This is the raw input to Attention.
Use three learnable matrices to project the input into Query (Q), Key (K), and Value (V).
softmax(Q·Kᵀ / √d_k) · V, outputting each word's attention to every other word.
02 Code
03 Academic Explanation
Attention Mechanism enables models to dynamically aggregate information from all positions in a sequence when processing each position, with weights determined by content similarity rather than a fixed local window.
Scaled Dot-Product Attention
The standard attention mechanism computes via three sets of vectors — Query, Key, and Value:
Matrix multiplication computes dot products for all Q-K pairs at once, with complexity O(n²d), where n is the sequence length
Dot products grow in variance as dimension d increases, pushing Softmax into saturation with vanishing gradients. Dividing by √dₖ stabilizes variance around 1
After Softmax normalizes the weights, a weighted sum over V produces a dynamic mixture of information from all positions in the sequence
Why Do We Need Three Separate Q/K/V Matrices?
Using word vectors directly to compute similarity would work, but three independent projection matrices (W_Q, W_K, W_V) decouple "how to query" from "what information to carry," greatly increasing representational capacity. These three matrices are learned during training.
Computational Complexity
O(n²d), where n is sequence length and d is dimension — expensive for long sequences
O(n²) to store the attention matrix — this is the main bottleneck for long contexts
Matrix multiplication is highly parallelizable, far faster than RNN's sequential computation on GPUs
Any two positions are just 1 hop apart, completely solving RNN's long-distance gradient vanishing problem
Summary
What the current word "wants to look up"
What each word "can match against"
The actual information each word "carries"
Q·K similarity normalized via Softmax