MLP Multi-Layer Perceptron
The foundational building block of deep learning — stack neurons layer by layer, and it can learn any complex pattern
01 Core Principles (Plain English)
You look at a photo and recognize a cat in it. How does your brain do this? The retina senses light → the primary visual cortex detects edges → higher-level areas recognize ears, whiskers → finally concludes "it's a cat."
MLP mimics exactly this process: one layer of neurons handles one level of abstraction, with signals flowing from low-level features to high-level judgments. Each layer does another round of "processing" on top of the previous layer's results. The more layers, the more complex the patterns it can learn.
How Signals Flow Through the Network
Feed in raw data (image pixels, sensor readings, text encodings, etc.), with each input feature corresponding to one neuron.
Each neuron takes all outputs from the previous layer, multiplies them by weights and sums them up, then applies an activation function (like ReLU) to introduce non-linearity. This step can be repeated for multiple layers.
The final layer outputs the prediction. Classification tasks use Softmax to output class probabilities; regression tasks directly output numerical values.
Calculates the difference between the prediction and the ground truth (loss), uses the chain rule to compute each weight's contribution to the error from back to front, then fine-tunes with gradient descent — repeat continuously, and the network gets more and more accurate.
The activation function is the soul of the network. Without activation functions, no matter how many layers you stack, the entire network is equivalent to a single linear model that can only draw straight lines. ReLU (zeros out negatives, keeps positives) breaks linearity, giving the network the ability to fit any curve.
Are More Layers Always Better?
Too Shallow (1-2 Layers)
Can only learn simple patterns. Not capable enough for complex tasks (image recognition, natural language), leading to underfitting.
Too Deep (Without Techniques)
Gradients become increasingly small during backpropagation (vanishing gradient), and layers near the input barely learn anything. Requires techniques like residual connections and batch normalization to address this.
Teaching Example: Gaussian Binary Classification
We use a simple but representative task throughout: two groups of points on a plane, each following a 2D Gaussian distribution. The MLP needs to find a nonlinear boundary to separate them.
- Input: 2D coordinates (x₁, x₂)
- Output: Probability of belonging to class 0 or class 1
- Network Architecture: 2 → 4 → 4 → 2, two ReLU hidden layers
- Training Method: Pure hand-written SGD, no frameworks
Linear models (like logistic regression) can only draw straight lines and cannot perfectly separate these two groups of points. With hidden layers and activation functions, MLP can bend the decision boundary to fit the data.
Building MLP Step by Step (Pure Hand-Written, No Libraries)
From data generation to backpropagation, every line is written by hand.
Use the Box-Muller method to generate two groups of Gaussian-distributed points. The scatter plot gives an intuitive feel for the data distribution.
Hand-write ReLU, Softmax, cross-entropy, and a Dense layer class with forward/backward propagation.
Three-layer network with per-sample SGD: forward propagation → compute loss → backpropagation → gradient descent.
02 Code
03 Academic Explanation
MLP (Multi-Layer Perceptron) is a foundational deep learning model. It consists of multiple layers of neurons, with each layer fully connected to the next, capable of learning nonlinear relationships.
Network Architecture
MLP typically contains three types of layers. The model in this demo has the structure 2 → 16 → 8 → 2:
-
Input Layer (2 neurons):
Corresponds to two input features x, y, shape
(batch, 2) -
Hidden Layer 1 (16 neurons): Weight
matrix W¹ shape
(2, 16), bias b¹ shape(16,), uses ReLU activation, output(batch, 16) -
Hidden Layer 2 (8 neurons): Weight
matrix W² shape
(16, 8), bias b² shape(8,), uses ReLU activation, output(batch, 8) -
Output Layer (2 neurons): Weight
matrix W³ shape
(8, 2), bias b³ shape(2,), passes through Softmax to output class probabilities, shape(batch, 2)
Forward Propagation
Data flows from the input layer to the output layer, with each layer performing computation:
Common activation functions include ReLU, Sigmoid, and Tanh.
Backpropagation
Through the chain rule, the gradient of the loss with respect to each weight is propagated from the output layer back to the input layer layer by layer:
- ∂L/∂a: Gradient of loss w.r.t. activation value, tells us whether the activation is too high or too low
- ∂a/∂z: Derivative of the activation function (ReLU derivative: 1 for positive, 0 for negative)
- ∂z/∂W: Derivative of the weighted sum w.r.t. weights, which is simply the previous layer's activation value
After computing the gradient for each layer, update the
weights using gradient descent:
W ← W - η × ∂L/∂W, where η is the learning
rate.
Neuron Activation Animation
Click "Forward Propagation" to observe signals flowing through the network layer by layer. After training completes, the weights in the diagram will automatically update to reflect the actual learned values from the model.
How to read this diagram: Connection lines in green indicate positive weights, red indicates negative weights, and thicker lines mean larger absolute values; brighter node colors indicate higher activation values (closer to 1), while darker colors mean values closer to 0 (ReLU cutoff).
Summary
Input + hidden + output layers
ReLU / Sigmoid / Tanh
Backpropagation + gradient descent
Learning nonlinear relationships