See It in Action: Handwritten Digit Recognition

Draw a digit (0-9) on the canvas below, and the CNN on the right will show real-time confidence scores for each digit. Model accuracy ~99%, runs entirely in your browser.

Draw a digit here
Recognition Result
?
Confidence

01 Core Principles (Plain English)

You recognize a photo of a cat not because you memorized every pixel's color — but because you see pointy ears, whiskers, a furry outline and other features.

CNN does exactly the same thing: first find local features in the image, then combine those features into higher-level concepts, and finally make a judgment.

Why aren't regular neural networks enough?

A 224×224 color image has 150,528 pixel values. If you use a regular fully connected layer, each neuron in the first layer would need to connect to 150,528 inputs — an explosion of parameters that completely ignores spatial relationships between pixels.

Key insight: Useful information in images is local — edges, corners, and textures only occupy small patches of the image. CNN exploits this property by scanning the image with small windows instead of looking at everything at once.

Three Core Operations of CNN

1
Convolution — Find Features

Use a small window (convolutional kernel, e.g. 3×3) to slide across the image. At each position, compute a weighted sum to output a single number. After sliding across the entire image, you get a feature map that records "where this kernel's feature responds".
Multiple different kernels = detecting multiple features simultaneously (horizontal edges, vertical edges, diagonals…).

2
Pooling — Downsample

Feature maps are too large. Use Max Pooling to shrink each 2×2 block into 1 number (take the maximum), halving the image size while preserving important features. Benefits: reduces computation and introduces some translation invariance — features shifting a few pixels won't change the conclusion.

3
Fully Connected (FC) — Make Decisions

Flatten the last feature map into a 1D vector and feed it into a regular fully connected layer. At this point, features are already abstract enough (not pixels anymore, but high-level concepts like "has ears"), and the FC layer combines these features to output classification probabilities.

Two Key Properties of CNN

Local Receptive Field

Each neuron only looks at a small patch of the image (receptive field), not the entire image. This matches the natural structure of images: meaningful features are always local.

Weight Sharing

The same convolutional kernel uses the same set of weights as it slides across the image. A kernel that detects "vertical lines" uses the same weights whether the line is on the left or right side. Parameters drop from millions to dozens.

Stacked layers, increasingly abstract features

CNNs typically stack multiple "convolution + pooling" combinations:

  • 1st convolutional layer: learns low-level features like edges and color gradients
  • 2nd convolutional layer: combines edges into mid-level features like corners and arcs
  • Deeper layers: combine into high-level semantic features like eyes, wheels, text

This is what "deep" in deep learning means: instead of manually designing features, let the network abstract pixels into concepts layer by layer.

Build a CNN Step by Step

From pixel drawing to classification training, step by step.

Step 1 Draw 16×16 pixel shapes

Use code to draw circles, squares, and triangles on a 16×16 canvas, with each pixel value being 0 or 1.

Step 2 Generate the dataset

Batch generate 180 samples and organize them into the 4D tensor format [N, H, W, C] that CNN requires.

Step 3 Build the convolutional network

Two convolutional layers (extract features) → flatten → fully connected layer (classify).

Step 4 Train and observe accuracy

Train with the Adam optimizer, printing loss and accuracy each epoch.

02 Code

03 Academic Explanation

Mathematical Definition of Convolution

For input feature map X (height H × width W × channels C_in) and kernel K (height k_H × width k_W × C_in), the output feature map at position (i, j) is:

Y[i, j] = Σ_c Σ_m Σ_n X[i·s+m, j·s+n, c] · K[m, n, c] + b

where s is the stride and b is the bias. When using multiple filters, each kernel produces a feature map, and all feature maps are stacked to form the next layer's input.

Convolution Operation Diagram

Input 5×5 Kernel 3×3 w₁ w₂ w₃ w₄ w₅ w₆ w₇ w₈ w₉ Feature Map 3×3 y₀₀ Output size (no padding): H_out = (H - k_H) / s + 1 W_out = (W - k_W) / s + 1 e.g.: (5-3)/1+1 = 3 padding=same keeps output same size

Parameter Count

A convolutional layer's parameters are only the kernel weights and biases, independent of input image size:

Parameters = k_H × k_W × C_in × C_out + C_out (bias)

Our demo model (1st Conv2D layer): 3 × 3 × 1 × 8 + 8 = 80 parameters. Compare with a fully connected layer processing 16×16 input to 8 neurons, which needs 16×16×8 + 8 = 2,056 parameters — convolution is far more efficient.

Network Architecture

Our demo model structure is Conv(8) → Pool → Conv(16) → Pool → Flatten → Dense(32) → Dense(3):

  • Conv2D(8, 3×3, ReLU, same): input (batch, 16, 16, 1), output (batch, 16, 16, 8), 8 kernels each detecting one feature, 80 parameters
  • MaxPooling2D(2×2): output (batch, 8, 8, 8), size halved, retains strongest responses
  • Conv2D(16, 3×3, ReLU, same): output (batch, 8, 8, 16), combines features from layer 1 into more complex structures, 1,168 parameters
  • MaxPooling2D(2×2): output (batch, 4, 4, 16)
  • Flatten: output (batch, 256)
  • Dense(32, ReLU): output (batch, 32), 8,224 parameters
  • Dense(3, Softmax): output (batch, 3), probabilities for circle/square/triangle, 99 parameters

Pooling Layer

Max Pooling takes the maximum value within each pooling window:

y[i, j] = max_{m,n ∈ window} x[i·s+m, j·s+n]

Max pooling preserves position invariance of features: even if a feature shifts slightly, the maximum value is still captured. Average Pooling takes the mean, producing smoother information but potentially diluting edge features.

Activation Function

Convolution is usually followed by ReLU activation to introduce non-linearity:

ReLU(x) = max(0, x)

ReLU zeros out negative responses ("this feature is not present here") and keeps positive ones. Simple to compute, no vanishing gradients — the most commonly used activation function in CNNs.

Backpropagation and Weight Updates

CNNs also use backpropagation to compute gradients, but convolutional layer gradients require transposed convolution (also called deconvolution) to propagate back to the input layer. Due to weight sharing, gradients from the same kernel at different positions are accumulated:

∂L/∂K[m,n] = Σ_{i,j} ∂L/∂Y[i,j] · X[i·s+m, j·s+n]

Weight update formulas are the same as MLP: K ← K − η × ∂L/∂K, bias b ← b − η × ∂L/∂b.

Convolutional Feature Map Visualization

Observe how different kernels slide across the image and the resulting feature maps:

Input Image (8x8)
Kernel Sliding
Feature Map (6x6)

Evolution of Classic CNN Architectures

LeNet-5 (1998)

The first practical CNN, used for handwritten digit recognition. Structure: 2 convolutional layers + 3 fully connected layers, establishing the basic CNN paradigm.

AlexNet (2012)

ImageNet competition winner that brought CNNs into the deep learning era. Introduced ReLU, Dropout, and data augmentation. Top-5 error rate dropped to 15.3%.

VGG (2014)

Used uniform 3×3 small kernels to build deeper networks (16-19 layers), proving that network depth is key to performance.

ResNet (2015)

Introduced residual connections (skip connections), solving the vanishing gradient problem in deep networks, pushing depth to 152 layers. Top-5 error rate dropped to 3.57%.