GNN - ML Easy Learning

✦ See It in Action: Social Network Node Classification

A 7-node social graph below. The GNN aggregates neighbor information to classify each node into Community A (blue) or Community B (orange). Click any node to see its prediction, or click "Retrain" to rerun the model.

Social Graph

Classification Results

Results appear after training...

01 Core Idea (Plain English)

To judge someone's personality, you don't just look at them — you look at their social circle. If all their friends are athletes, they're probably sporty too. If all their friends are programmers, they probably stay up late.

That's exactly what a GNN does: each node doesn't just look at its own features — it aggregates its neighbors' features too, repeating for several rounds until every node "knows" its position in the whole graph.

Why Can't Regular Neural Networks Handle Graphs?

Graphs have two properties that frustrate ordinary neural networks:

Variable Number of Nodes

A social network has 100 people today, 101 tomorrow. A fully-connected layer's input dimension would change. CNNs require fixed-size images, MLPs require fixed-length vectors — graphs can't provide that.

No Order, Only Connections

The "distance" between nodes A and B is determined by edges, not index order. Shuffle the nodes and the graph structure is unchanged — but a regular neural network would see it as completely different input.

GNN's key insight: Use graph structure (who connects to whom) as an inductive bias. Message passing lets each node aggregate neighbor information — naturally handling graphs of any size and structure.

Message Passing: The Core of GNNs

Send Messages

Each node sends its feature vector along edges to all neighbors. Edges can carry weights representing relationship strength.

Aggregate

Each node collects messages from all neighbors and summarizes them — common approaches: sum, mean, or max. Order-invariant, guaranteeing permutation invariance over nodes.

Update

Concatenate the aggregated result with the node's own features, pass through a small neural network (usually linear layer + ReLU), and get the new node representation. That's one GNN layer.

Stacking Layers = Expanding Receptive Field

Exactly the same logic as stacking convolutional layers in a CNN:

Layer 1: each node perceives its direct neighbors (1-hop)
Layer 2: each node perceives neighbors of neighbors (2-hop)
Layer k: each node perceives the entire local subgraph within k hops

Receptive field analogy: A CNN's receptive field is a region of the image; a GNN's receptive field is the k-hop subgraph centered on the current node. More layers = more global structural information.

What Can GNNs Do?

Node Classification

Classify each user in a social network into a community; classify each paper in a citation network into a research area.

Link Prediction

Predict whether two nodes will form a connection — used in recommendation systems ("People You May Know") and protein interaction prediction.

Graph Classification

Classify an entire graph (e.g. a molecule) — is this molecule toxic? Does this code snippet have a vulnerability?

Graph Generation

Generate new graph structures — designing molecules with desired properties, generating road network layouts.

Building a GNN Step by Step

Step 1 Represent the Graph

Use an adjacency matrix and node feature matrix to represent the graph. Node features can be degree, attribute labels, etc.

Step 2 Write Message Passing by Hand

Implement one-layer Mean Aggregation: each node's new feature = mean(self + all neighbors' old features).

Step 3 Add Learnable Weights

After message passing, apply a linear transform W + ReLU. Parameters are learned via backpropagation.

Step 4 Train Node Classification

Stack two GNN layers, add a Softmax output for node class probabilities, train with cross-entropy loss.

02 Code

03 Deep Dive

Mathematical Definition of Message Passing

Let graph G = (V, E). The representation of node v at layer k is h^(k)_v. The Message Passing Neural Network (MPNN) framework is:

m^(k)_v = AGG({ MSG(h^(k-1)_u, h^(k-1)_v, e_uv) | u ∈ N(v) })

h^(k)_v = UPDATE(h^(k-1)_v, m^(k)_v)

N(v) is the neighbor set of node v, e_uv is an optional edge feature. Different choices of MSG, AGG, and UPDATE correspond to different GNN variants.

Graph Convolutional Network — GCN (Kipf & Welling, 2017)

GCN is the most classic GNN variant. Its propagation rule is:

H^(k) = σ( D̃^-1/2 Ã D̃^-1/2 H^(k-1) W^(k) )

Ã = A + I (self-loops added), D̃ is the corresponding degree matrix, W^(k) is the learnable weight matrix at layer k, σ is an activation function (e.g. ReLU). Symmetric normalization D̃^-1/2 Ã D̃^-1/2 prevents gradient explosion at high-degree nodes.

Message Passing Animation

Watch messages propagate through the graph round by round, and see the receptive field expand with each layer:

Click "Play" to watch messages propagate layer by layer

Graph Attention Network — GAT (Veličković et al., 2018)

GCN's neighbor weights are fixed (determined by degree). GAT introduces attention so the model learns which neighbors to focus on more:

α_uv = softmax_u∈N(v)( LeakyReLU( a^T [W h_v ∥ W h_u] ) )

h'_v = σ( Σ_u∈N(v) α_uv W h_u )

∥ denotes vector concatenation, a is a learnable attention vector. Multi-head attention further improves stability — the same idea as Transformer attention.

Over-smoothing Problem

Stacking too many GNN layers causes all node representations to converge to the same value — because each aggregation step "averages" information. In practice, 2-3 GNN layers is usually sufficient; beyond 5 layers performance often degrades.

Solutions: Residual connections (concatenate initial features back at each layer), DropEdge (randomly drop edges during training), Jumping Knowledge Networks (JK-Net, aggregate representations from all layers).

Graph Readout / Pooling

Node classification uses node representations directly. Graph classification needs to pool all node representations into one graph-level vector:

h_G = READOUT({ h^(K)_v | v ∈ V })

Common READOUT functions: global mean pooling, global max pooling, learnable Set2Set, DiffPool (differentiable hierarchical pooling).

GNN vs CNN vs Transformer

GNN vs CNN

CNN is a special case of GNN on a regular grid (pixels): neighbors are the fixed up/down/left/right pixels, weights are shared via the kernel. GNN generalizes this to arbitrary graph structures.

GNN vs Transformer

Transformer can be viewed as "GAT on a complete graph": every token has an edge to every other token, and attention weights are the edge weights. GNN does the same on sparse graphs at lower computational cost.

Classic GNN Architecture Timeline

GCN (2017)

Simplified spectral graph convolution with mean aggregation and symmetric normalization. Established the GNN paradigm. Best for semi-supervised node classification.

GraphSAGE (2017)

Samples a fixed number of neighbors, supports inductive learning (generalizes to unseen nodes). Used by Pinterest at scale for recommendation.

GAT (2018)

Attention weights learned automatically so different neighbors contribute differently. Outperforms GCN across node classification benchmarks.

GIN (2019)

Graph Isomorphism Network — theoretically matches the upper bound of the Weisfeiler-Leman graph isomorphism test. One of the most expressive GNNs for graph classification.

GNN Graph Neural Network