01 Core Principles (Plain English)

You look at a photo and recognize a cat in it. How does your brain do this? The retina senses light → the primary visual cortex detects edges → higher-level areas recognize ears, whiskers → finally concludes "it's a cat."

MLP mimics exactly this process: one layer of neurons handles one level of abstraction, with signals flowing from low-level features to high-level judgments. Each layer does another round of "processing" on top of the previous layer's results. The more layers, the more complex the patterns it can learn.

How Signals Flow Through the Network

1
Input Layer: Receives Data

Feed in raw data (image pixels, sensor readings, text encodings, etc.), with each input feature corresponding to one neuron.

2
Hidden Layers: Weighted Sum + Activation

Each neuron takes all outputs from the previous layer, multiplies them by weights and sums them up, then applies an activation function (like ReLU) to introduce non-linearity. This step can be repeated for multiple layers.

3
Output Layer: Reaches a Conclusion

The final layer outputs the prediction. Classification tasks use Softmax to output class probabilities; regression tasks directly output numerical values.

4
Backpropagation: Updates Weights

Calculates the difference between the prediction and the ground truth (loss), uses the chain rule to compute each weight's contribution to the error from back to front, then fine-tunes with gradient descent — repeat continuously, and the network gets more and more accurate.

The activation function is the soul of the network. Without activation functions, no matter how many layers you stack, the entire network is equivalent to a single linear model that can only draw straight lines. ReLU (zeros out negatives, keeps positives) breaks linearity, giving the network the ability to fit any curve.

Are More Layers Always Better?

Too Shallow (1-2 Layers)

Can only learn simple patterns. Not capable enough for complex tasks (image recognition, natural language), leading to underfitting.

Too Deep (Without Techniques)

Gradients become increasingly small during backpropagation (vanishing gradient), and layers near the input barely learn anything. Requires techniques like residual connections and batch normalization to address this.

Teaching Example: Gaussian Binary Classification

We use a simple but representative task throughout: two groups of points on a plane, each following a 2D Gaussian distribution. The MLP needs to find a nonlinear boundary to separate them.

  • Input: 2D coordinates (x₁, x₂)
  • Output: Probability of belonging to class 0 or class 1
  • Network Architecture: 2 → 4 → 4 → 2, two ReLU hidden layers
  • Training Method: Pure hand-written SGD, no frameworks

Linear models (like logistic regression) can only draw straight lines and cannot perfectly separate these two groups of points. With hidden layers and activation functions, MLP can bend the decision boundary to fit the data.

Building MLP Step by Step (Pure Hand-Written, No Libraries)

From data generation to backpropagation, every line is written by hand.

Step 1 Generate Data and Visualize

Use the Box-Muller method to generate two groups of Gaussian-distributed points. The scatter plot gives an intuitive feel for the data distribution.

Step 2 Define Activation Functions and Fully Connected Layers

Hand-write ReLU, Softmax, cross-entropy, and a Dense layer class with forward/backward propagation.

Step 3 Build Network, Train, Update Weights

Three-layer network with per-sample SGD: forward propagation → compute loss → backpropagation → gradient descent.

02 Code

03 Academic Explanation

MLP (Multi-Layer Perceptron) is a foundational deep learning model. It consists of multiple layers of neurons, with each layer fully connected to the next, capable of learning nonlinear relationships.

Network Architecture

MLP typically contains three types of layers. The model in this demo has the structure 2 → 16 → 8 → 2:

  • Input Layer (2 neurons): Corresponds to two input features x, y, shape (batch, 2)
  • Hidden Layer 1 (16 neurons): Weight matrix W¹ shape (2, 16), bias b¹ shape (16,), uses ReLU activation, output (batch, 16)
  • Hidden Layer 2 (8 neurons): Weight matrix W² shape (16, 8), bias b² shape (8,), uses ReLU activation, output (batch, 8)
  • Output Layer (2 neurons): Weight matrix W³ shape (8, 2), bias b³ shape (2,), passes through Softmax to output class probabilities, shape (batch, 2)
x y Input Layer 2 Hidden Layer 1 16 Hidden Layer 2 8 Class 0 Class 1 Output Layer 2 Weights W¹ Weights W² Weights W³

Forward Propagation

Data flows from the input layer to the output layer, with each layer performing computation:

$$z = Wx + b$$
$$a = \text{activation}(z)$$

Common activation functions include ReLU, Sigmoid, and Tanh.

Backpropagation

Through the chain rule, the gradient of the loss with respect to each weight is propagated from the output layer back to the input layer layer by layer:

$$\frac{\partial L}{\partial W} = \frac{\partial L}{\partial a}\cdot\frac{\partial a}{\partial z}\cdot\frac{\partial z}{\partial W}$$
  • ∂L/∂a: Gradient of loss w.r.t. activation value, tells us whether the activation is too high or too low
  • ∂a/∂z: Derivative of the activation function (ReLU derivative: 1 for positive, 0 for negative)
  • ∂z/∂W: Derivative of the weighted sum w.r.t. weights, which is simply the previous layer's activation value

After computing the gradient for each layer, update the weights using gradient descent: W ← W - η × ∂L/∂W, where η is the learning rate.

Neuron Activation Animation

Click "Forward Propagation" to observe signals flowing through the network layer by layer. After training completes, the weights in the diagram will automatically update to reflect the actual learned values from the model.

How to read this diagram: Connection lines in green indicate positive weights, red indicates negative weights, and thicker lines mean larger absolute values; brighter node colors indicate higher activation values (closer to 1), while darker colors mean values closer to 0 (ReLU cutoff).

Summary

Architecture

Input + hidden + output layers

Activation

ReLU / Sigmoid / Tanh

Training

Backpropagation + gradient descent

Capability

Learning nonlinear relationships