Transformer
Stack Attention layers — completely abandon RNN, all positions "see" each other simultaneously, enabling parallel training
✦ See It in Action: Different Views of Multi-Head Attention
Switch between different attention heads and observe the patterns each head focuses on within the same sentence — some focus on adjacent words, others capture long-range dependencies.
Hover over a token to see attention weights; line thickness indicates weight magnitude
01 Plain English Explanation of Transformer
RNN's Problem: Can Only Step Through Sequentially
RNN processes a sentence like reading a book — you must read from the first character to the last, you can't proceed to the next step until the previous one is done. This causes two problems:
GPUs have thousands of cores, but RNN only uses one at a time — massive waste
Information from the beginning of a sentence must pass through dozens of steps to influence the end, getting diluted along the way
Transformer's Solution: All Positions See Each Other at Once
Transformer uses Self-Attention to let every word simultaneously attend to all other words in the sentence. There's no sequential dependency — the entire sentence is processed in one parallel pass.
The trade-off: you must explicitly tell the model the order of words — that's the role of positional encoding.
Positional Encoding: Assigning "Seat Numbers" to Words
Transformer uses sine/cosine functions to generate unique encodings for each position, which are added to the word vectors:
Like the second hand, minute hand, and hour hand of a clock — combined together they can precisely represent any moment (any position)
The difference between encodings of any two positions is fixed, letting the model perceive relative distance, not just absolute position
Multi-Head Attention: Understanding from Multiple Angles
A single Attention can only focus on one pattern at a time. Multi-Head Attention runs multiple Attention heads in parallel, where each "head" learns to focus on different types of relationships:
Focuses on adjacent words (local syntax)
Focuses on similar words (semantic similarity)
Focuses on syntactic dependencies (subject-verb-object)
Focuses on long-range references (coreference resolution)
Finally, the outputs of all heads are concatenated together to produce a richer representation.
Why Is Transformer So Powerful?
Fully parallel, maximizing GPU utilization — only Transformer can train models with hundreds of billions of parameters like GPT-3
The path length between any two words is 1, eliminating gradient vanishing problems
GPT (decoder-only), BERT (encoder-only), T5 (encoder+decoder) are all Transformer variants
Build Positional Encoding Step by Step
Transformer has no recurrent structure — it relies on positional encoding to inject sequence order information.
Even dimensions use sin, odd dimensions use cos, with frequency decreasing as dimension increases.
Generate a [MAX_LEN × D_MODEL] positional encoding matrix, added to word embeddings.
Use a heatmap to observe the encoding values at each position and dimension — dimensions on the right change more slowly.
02 Code
03 Academic Explanation
Transformer (Vaswani et al., 2017 "Attention Is All You Need") is built entirely on the self-attention mechanism, completely discarding recurrence and convolution. It achieves efficient sequence modeling through multi-head attention, residual connections, and layer normalization.
Overall Architecture: Encoder-Decoder
Core Components in Detail
PE(pos, 2i) = sin(pos / 10000^(2i/d)), PE(pos, 2i+1) = cos(...). Superimposed sine waves at different frequencies enable the model to compute relative position differences through linear transformations
Project Q/K/V into h separate low-dimensional subspaces, perform Attention independently, then concatenate: MultiHead(Q,K,V) = Concat(head₁,...,headₕ)·W_O, where headᵢ = Attention(QWᵢ_Q, KWᵢ_K, VWᵢ_V)
Each sublayer output is LayerNorm(x + Sublayer(x)). Residual connections prevent gradient vanishing in deep networks; layer normalization stabilizes training
When generating the t-th word, the decoder can only see words at positions 1~t-1 (future words are masked), ensuring causality in autoregressive generation
The decoder's Q comes from itself, while K/V come from the encoder output — this is the channel through which the decoder "consults" source sequence information
Comparison with RNN
RNN O(n); Transformer O(1) — any two words are directly connected
RNN sequential dependencies prevent parallelism; Transformer is fully parallel
RNN O(nd²); Transformer O(n²d) — Transformer is faster for short sequences, but the opposite for very long sequences
GPT/BERT/T5 are all based on Transformer, which has become the default NLP architecture
Summary
Discards recurrence, fully parallel
Sine waves inject order information
Captures dependencies from multiple angles
Stabilizes deep network training