GNN Graph Neural Network
Teaching neural networks to understand "relationships" — not just individual data points, but who they're connected to and what their neighbors look like
✦ See It in Action: Social Network Node Classification
A 7-node social graph below. The GNN aggregates neighbor information to classify each node into Community A (blue) or Community B (orange). Click any node to see its prediction, or click "Retrain" to rerun the model.
01 Core Idea (Plain English)
To judge someone's personality, you don't just look at them — you look at their social circle. If all their friends are athletes, they're probably sporty too. If all their friends are programmers, they probably stay up late.
That's exactly what a GNN does: each node doesn't just look at its own features — it aggregates its neighbors' features too, repeating for several rounds until every node "knows" its position in the whole graph.
Why Can't Regular Neural Networks Handle Graphs?
Graphs have two properties that frustrate ordinary neural networks:
Variable Number of Nodes
A social network has 100 people today, 101 tomorrow. A fully-connected layer's input dimension would change. CNNs require fixed-size images, MLPs require fixed-length vectors — graphs can't provide that.
No Order, Only Connections
The "distance" between nodes A and B is determined by edges, not index order. Shuffle the nodes and the graph structure is unchanged — but a regular neural network would see it as completely different input.
GNN's key insight: Use graph structure (who connects to whom) as an inductive bias. Message passing lets each node aggregate neighbor information — naturally handling graphs of any size and structure.
Message Passing: The Core of GNNs
Each node sends its feature vector along edges to all neighbors. Edges can carry weights representing relationship strength.
Each node collects messages from all neighbors and summarizes them — common approaches: sum, mean, or max. Order-invariant, guaranteeing permutation invariance over nodes.
Concatenate the aggregated result with the node's own features, pass through a small neural network (usually linear layer + ReLU), and get the new node representation. That's one GNN layer.
Stacking Layers = Expanding Receptive Field
Exactly the same logic as stacking convolutional layers in a CNN:
- Layer 1: each node perceives its direct neighbors (1-hop)
- Layer 2: each node perceives neighbors of neighbors (2-hop)
- Layer k: each node perceives the entire local subgraph within k hops
Receptive field analogy: A CNN's receptive field is a region of the image; a GNN's receptive field is the k-hop subgraph centered on the current node. More layers = more global structural information.
What Can GNNs Do?
Node Classification
Classify each user in a social network into a community; classify each paper in a citation network into a research area.
Link Prediction
Predict whether two nodes will form a connection — used in recommendation systems ("People You May Know") and protein interaction prediction.
Graph Classification
Classify an entire graph (e.g. a molecule) — is this molecule toxic? Does this code snippet have a vulnerability?
Graph Generation
Generate new graph structures — designing molecules with desired properties, generating road network layouts.
Building a GNN Step by Step
Use an adjacency matrix and node feature matrix to represent the graph. Node features can be degree, attribute labels, etc.
Implement one-layer Mean Aggregation: each node's new feature = mean(self + all neighbors' old features).
After message passing, apply a linear transform W + ReLU. Parameters are learned via backpropagation.
Stack two GNN layers, add a Softmax output for node class probabilities, train with cross-entropy loss.
02 Code
03 Deep Dive
Mathematical Definition of Message Passing
Let graph G = (V, E). The representation of node v at layer k is h(k)v. The Message Passing Neural Network (MPNN) framework is:
N(v) is the neighbor set of node v, euv is an optional edge feature. Different choices of MSG, AGG, and UPDATE correspond to different GNN variants.
Graph Convolutional Network — GCN (Kipf & Welling, 2017)
GCN is the most classic GNN variant. Its propagation rule is:
à = A + I (self-loops added), D̃ is the corresponding degree matrix, W(k) is the learnable weight matrix at layer k, σ is an activation function (e.g. ReLU). Symmetric normalization D̃-1/2 à D̃-1/2 prevents gradient explosion at high-degree nodes.
Message Passing Animation
Watch messages propagate through the graph round by round, and see the receptive field expand with each layer:
Click "Play" to watch messages propagate layer by layer
Graph Attention Network — GAT (Veličković et al., 2018)
GCN's neighbor weights are fixed (determined by degree). GAT introduces attention so the model learns which neighbors to focus on more:
∥ denotes vector concatenation, a is a learnable attention vector. Multi-head attention further improves stability — the same idea as Transformer attention.
Over-smoothing Problem
Stacking too many GNN layers causes all node representations to converge to the same value — because each aggregation step "averages" information. In practice, 2-3 GNN layers is usually sufficient; beyond 5 layers performance often degrades.
Solutions: Residual connections (concatenate initial features back at each layer), DropEdge (randomly drop edges during training), Jumping Knowledge Networks (JK-Net, aggregate representations from all layers).
Graph Readout / Pooling
Node classification uses node representations directly. Graph classification needs to pool all node representations into one graph-level vector:
Common READOUT functions: global mean pooling, global max pooling, learnable Set2Set, DiffPool (differentiable hierarchical pooling).
GNN vs CNN vs Transformer
GNN vs CNN
CNN is a special case of GNN on a regular grid (pixels): neighbors are the fixed up/down/left/right pixels, weights are shared via the kernel. GNN generalizes this to arbitrary graph structures.
GNN vs Transformer
Transformer can be viewed as "GAT on a complete graph": every token has an edge to every other token, and attention weights are the edge weights. GNN does the same on sparse graphs at lower computational cost.
Classic GNN Architecture Timeline
GCN (2017)
Simplified spectral graph convolution with mean aggregation and symmetric normalization. Established the GNN paradigm. Best for semi-supervised node classification.
GraphSAGE (2017)
Samples a fixed number of neighbors, supports inductive learning (generalizes to unseen nodes). Used by Pinterest at scale for recommendation.
GAT (2018)
Attention weights learned automatically so different neighbors contribute differently. Outperforms GCN across node classification benchmarks.
GIN (2019)
Graph Isomorphism Network — theoretically matches the upper bound of the Weisfeiler-Leman graph isomorphism test. One of the most expressive GNNs for graph classification.