See It in Action: Attention Weights Between Words

Click "Generate Heatmap" to see how much each word attends to every other word. Darker colors indicate higher attention weights.

Each generation produces random attention weights, simulating different attention distributions across contexts

01 Plain English Explanation of Attention

The Library Analogy

Imagine you go to the library to look up information. Here's how the process works:

Q
Query — What you want to find

You have a question in mind, like "Which words are related to this word?" That's the Query.

K
Key — Book titles/labels

Every book in the library has a title and category tags. You compare your question against each book's tags — the more similar, the more likely it's what you need.

V
Value — The actual content of the book

After finding the relevant books, you read their content weighted by relevance, ultimately arriving at an answer. That's the Value.

What Problem Does Attention Solve?

Before Attention, RNNs translating a sentence had to "compress" the entire sentence into a fixed vector before decoding — the longer the sentence, the more information was lost.

Attention's approach: When translating each word, look back at the original sentence directly, weighting by relevance. When translating "apple," pay more attention to "apple" — no need to memorize the entire sentence.

Self-Attention

The most powerful variant of Attention applies attention to the sentence itself — each word simultaneously plays all three roles of Q, K, and V:

1
Compute similarity between words

Use Q·Kᵀ to compute the match score for each pair of words, then divide by √d to prevent values from getting too large

2
Softmax normalization

Convert scores into a probability distribution (each row sums to 1) — these are the attention weights

3
Weighted sum

Use the weights to compute a weighted sum of V (each word's content vector), yielding a "new representation enriched with context"

What Can Attention Do?

Machine Translation

When translating each word, look back at the most relevant words in the source sentence (the original use case)

Disambiguation

"Apple is delicious" vs. "Apple released a new product" — use contextual Attention to distinguish meanings

Long-range Dependencies

The subject at the beginning of a sentence can directly attend to the verb at the end, unconstrained by distance

Foundation of Transformers

Large language models like GPT and BERT are all built on top of self-attention

Building Attention Step by Step

From word embeddings to attention weights, let's build it up gradually.

Step 1 Input: Word Embedding Matrix

3 words, each represented by a 4-dimensional vector. This is the raw input to Attention.

Step 2 Linear Projection: Generate Q, K, V

Use three learnable matrices to project the input into Query (Q), Key (K), and Value (V).

Step 3 Scaled Dot-Product Attention

softmax(Q·Kᵀ / √d_k) · V, outputting each word's attention to every other word.

02 Code

03 Academic Explanation

Attention Mechanism enables models to dynamically aggregate information from all positions in a sequence when processing each position, with weights determined by content similarity rather than a fixed local window.

Scaled Dot-Product Attention

The standard attention mechanism computes via three sets of vectors — Query, Key, and Value:

Q K V QKᵀ / √dₖ Softmax × V Output Attention(Q,K,V) = Softmax(QKᵀ / √dₖ) · V
1
Similarity computation: QKᵀ

Matrix multiplication computes dot products for all Q-K pairs at once, with complexity O(n²d), where n is the sequence length

2
Scaling: divide by √dₖ

Dot products grow in variance as dimension d increases, pushing Softmax into saturation with vanishing gradients. Dividing by √dₖ stabilizes variance around 1

3
Weighted aggregation: Softmax × V

After Softmax normalizes the weights, a weighted sum over V produces a dynamic mixture of information from all positions in the sequence

Why Do We Need Three Separate Q/K/V Matrices?

Using word vectors directly to compute similarity would work, but three independent projection matrices (W_Q, W_K, W_V) decouple "how to query" from "what information to carry," greatly increasing representational capacity. These three matrices are learned during training.

Computational Complexity

Time Complexity

O(n²d), where n is sequence length and d is dimension — expensive for long sequences

Space Complexity

O(n²) to store the attention matrix — this is the main bottleneck for long contexts

Parallelism

Matrix multiplication is highly parallelizable, far faster than RNN's sequential computation on GPUs

Receptive Field

Any two positions are just 1 hop apart, completely solving RNN's long-distance gradient vanishing problem

Summary

Query

What the current word "wants to look up"

Key

What each word "can match against"

Value

The actual information each word "carries"

Weights

Q·K similarity normalized via Softmax