Transformer - ML Easy Learning

✦ See It in Action: Different Views of Multi-Head Attention

Switch between different attention heads and observe the patterns each head focuses on within the same sentence — some focus on adjacent words, others capture long-range dependencies.

Hover over a token to see attention weights; line thickness indicates weight magnitude

01 Plain English Explanation of Transformer

RNN's Problem: Can Only Step Through Sequentially

RNN processes a sentence like reading a book — you must read from the first character to the last, you can't proceed to the next step until the previous one is done. This causes two problems:

Cannot Parallelize

GPUs have thousands of cores, but RNN only uses one at a time — massive waste

Long-Range Forgetting

Information from the beginning of a sentence must pass through dozens of steps to influence the end, getting diluted along the way

Transformer's Solution: All Positions See Each Other at Once

Transformer uses Self-Attention to let every word simultaneously attend to all other words in the sentence. There's no sequential dependency — the entire sentence is processed in one parallel pass.

The trade-off: you must explicitly tell the model the order of words — that's the role of positional encoding.

Positional Encoding: Assigning "Seat Numbers" to Words

Transformer uses sine/cosine functions to generate unique encodings for each position, which are added to the word vectors:

Low Dimensions Oscillate Fast, High Dimensions Oscillate Slowly

Like the second hand, minute hand, and hour hand of a clock — combined together they can precisely represent any moment (any position)

Relative Position Is Computable

The difference between encodings of any two positions is fixed, letting the model perceive relative distance, not just absolute position

Multi-Head Attention: Understanding from Multiple Angles

A single Attention can only focus on one pattern at a time. Multi-Head Attention runs multiple Attention heads in parallel, where each "head" learns to focus on different types of relationships:

Head 1

Focuses on adjacent words (local syntax)

Head 2

Focuses on similar words (semantic similarity)

Head 3

Focuses on syntactic dependencies (subject-verb-object)

Head 4

Focuses on long-range references (coreference resolution)

Finally, the outputs of all heads are concatenated together to produce a richer representation.

Why Is Transformer So Powerful?

✓

Fast Training

Fully parallel, maximizing GPU utilization — only Transformer can train models with hundreds of billions of parameters like GPT-3

✓

Long-Range Dependencies in One Hop

The path length between any two words is 1, eliminating gradient vanishing problems

✓

Universal Architecture

GPT (decoder-only), BERT (encoder-only), T5 (encoder+decoder) are all Transformer variants

Build Positional Encoding Step by Step

Transformer has no recurrent structure — it relies on positional encoding to inject sequence order information.

Step 1 Positional Encoding Formula

Even dimensions use sin, odd dimensions use cos, with frequency decreasing as dimension increases.

Step 2 Build the Complete PE Matrix

Generate a [MAX_LEN × D_MODEL] positional encoding matrix, added to word embeddings.

Step 3 Heatmap Visualization

Use a heatmap to observe the encoding values at each position and dimension — dimensions on the right change more slowly.

02 Code

03 Academic Explanation

Transformer (Vaswani et al., 2017 "Attention Is All You Need") is built entirely on the self-attention mechanism, completely discarding recurrence and convolution. It achieves efficient sequence modeling through multi-head attention, residual connections, and layer normalization.

Overall Architecture: Encoder-Decoder

Core Components in Detail

Positional Encoding

PE(pos, 2i) = sin(pos / 10000^(2i/d)), PE(pos, 2i+1) = cos(...). Superimposed sine waves at different frequencies enable the model to compute relative position differences through linear transformations

Multi-Head Attention

Project Q/K/V into h separate low-dimensional subspaces, perform Attention independently, then concatenate: MultiHead(Q,K,V) = Concat(head₁,...,headₕ)·W_O, where headᵢ = Attention(QWᵢ_Q, KWᵢ_K, VWᵢ_V)

Residual Connection + Layer Normalization (Add & Norm)

Each sublayer output is LayerNorm(x + Sublayer(x)). Residual connections prevent gradient vanishing in deep networks; layer normalization stabilizes training

Masked Attention

When generating the t-th word, the decoder can only see words at positions 1~t-1 (future words are masked), ensuring causality in autoregressive generation

Cross Attention

The decoder's Q comes from itself, while K/V come from the encoder output — this is the channel through which the decoder "consults" source sequence information

Comparison with RNN

Longest Path

RNN O(n); Transformer O(1) — any two words are directly connected

Parallelism

RNN sequential dependencies prevent parallelism; Transformer is fully parallel

Computational Cost

RNN O(nd²); Transformer O(n²d) — Transformer is faster for short sequences, but the opposite for very long sequences

Applications

GPT/BERT/T5 are all based on Transformer, which has become the default NLP architecture

Summary

Parallelism

Discards recurrence, fully parallel

Positional Encoding

Sine waves inject order information

Multi-Head

Captures dependencies from multiple angles

Residual + Normalization

Stabilizes deep network training