See It in Action: Different Views of Multi-Head Attention

Switch between different attention heads and observe the patterns each head focuses on within the same sentence — some focus on adjacent words, others capture long-range dependencies.

Hover over a token to see attention weights; line thickness indicates weight magnitude

01 Plain English Explanation of Transformer

RNN's Problem: Can Only Step Through Sequentially

RNN processes a sentence like reading a book — you must read from the first character to the last, you can't proceed to the next step until the previous one is done. This causes two problems:

!
Cannot Parallelize

GPUs have thousands of cores, but RNN only uses one at a time — massive waste

!
Long-Range Forgetting

Information from the beginning of a sentence must pass through dozens of steps to influence the end, getting diluted along the way

Transformer's Solution: All Positions See Each Other at Once

Transformer uses Self-Attention to let every word simultaneously attend to all other words in the sentence. There's no sequential dependency — the entire sentence is processed in one parallel pass.

The trade-off: you must explicitly tell the model the order of words — that's the role of positional encoding.

Positional Encoding: Assigning "Seat Numbers" to Words

Transformer uses sine/cosine functions to generate unique encodings for each position, which are added to the word vectors:

1
Low Dimensions Oscillate Fast, High Dimensions Oscillate Slowly

Like the second hand, minute hand, and hour hand of a clock — combined together they can precisely represent any moment (any position)

2
Relative Position Is Computable

The difference between encodings of any two positions is fixed, letting the model perceive relative distance, not just absolute position

Multi-Head Attention: Understanding from Multiple Angles

A single Attention can only focus on one pattern at a time. Multi-Head Attention runs multiple Attention heads in parallel, where each "head" learns to focus on different types of relationships:

Head 1

Focuses on adjacent words (local syntax)

Head 2

Focuses on similar words (semantic similarity)

Head 3

Focuses on syntactic dependencies (subject-verb-object)

Head 4

Focuses on long-range references (coreference resolution)

Finally, the outputs of all heads are concatenated together to produce a richer representation.

Why Is Transformer So Powerful?

Fast Training

Fully parallel, maximizing GPU utilization — only Transformer can train models with hundreds of billions of parameters like GPT-3

Long-Range Dependencies in One Hop

The path length between any two words is 1, eliminating gradient vanishing problems

Universal Architecture

GPT (decoder-only), BERT (encoder-only), T5 (encoder+decoder) are all Transformer variants

Build Positional Encoding Step by Step

Transformer has no recurrent structure — it relies on positional encoding to inject sequence order information.

Step 1 Positional Encoding Formula

Even dimensions use sin, odd dimensions use cos, with frequency decreasing as dimension increases.

Step 2 Build the Complete PE Matrix

Generate a [MAX_LEN × D_MODEL] positional encoding matrix, added to word embeddings.

Step 3 Heatmap Visualization

Use a heatmap to observe the encoding values at each position and dimension — dimensions on the right change more slowly.

02 Code

03 Academic Explanation

Transformer (Vaswani et al., 2017 "Attention Is All You Need") is built entirely on the self-attention mechanism, completely discarding recurrence and convolution. It achieves efficient sequence modeling through multi-head attention, residual connections, and layer normalization.

Overall Architecture: Encoder-Decoder

Encoder Multi-Head Self-Attention Add & Norm Feed-Forward Network FFN Add & Norm ×N Input + Positional Enc. K, V → Decoder Decoder Masked Self-Attention Add & Norm Cross Attention (K,V←Encoder) FFN + Add & Norm ×N Output + Positional Enc. Linear + Softmax → Word K, V

Core Components in Detail

1
Positional Encoding

PE(pos, 2i) = sin(pos / 10000^(2i/d)), PE(pos, 2i+1) = cos(...). Superimposed sine waves at different frequencies enable the model to compute relative position differences through linear transformations

2
Multi-Head Attention

Project Q/K/V into h separate low-dimensional subspaces, perform Attention independently, then concatenate: MultiHead(Q,K,V) = Concat(head₁,...,headₕ)·W_O, where headᵢ = Attention(QWᵢ_Q, KWᵢ_K, VWᵢ_V)

3
Residual Connection + Layer Normalization (Add & Norm)

Each sublayer output is LayerNorm(x + Sublayer(x)). Residual connections prevent gradient vanishing in deep networks; layer normalization stabilizes training

4
Masked Attention

When generating the t-th word, the decoder can only see words at positions 1~t-1 (future words are masked), ensuring causality in autoregressive generation

5
Cross Attention

The decoder's Q comes from itself, while K/V come from the encoder output — this is the channel through which the decoder "consults" source sequence information

Comparison with RNN

Longest Path

RNN O(n); Transformer O(1) — any two words are directly connected

Parallelism

RNN sequential dependencies prevent parallelism; Transformer is fully parallel

Computational Cost

RNN O(nd²); Transformer O(n²d) — Transformer is faster for short sequences, but the opposite for very long sequences

Applications

GPT/BERT/T5 are all based on Transformer, which has become the default NLP architecture

Summary

Parallelism

Discards recurrence, fully parallel

Positional Encoding

Sine waves inject order information

Multi-Head

Captures dependencies from multiple angles

Residual + Normalization

Stabilizes deep network training