CNN Convolutional Neural Network
The core architecture that lets machines "see" images — not by memorizing every pixel, but by learning to recognize shapes, textures, and outlines that truly matter
✦ See It in Action: Handwritten Digit Recognition
Draw a digit (0-9) on the canvas below, and the CNN on the right will show real-time confidence scores for each digit. Model accuracy ~99%, runs entirely in your browser.
01 Core Principles (Plain English)
You recognize a photo of a cat not because you memorized every pixel's color — but because you see pointy ears, whiskers, a furry outline and other features.
CNN does exactly the same thing: first find local features in the image, then combine those features into higher-level concepts, and finally make a judgment.
Why aren't regular neural networks enough?
A 224×224 color image has 150,528 pixel values. If you use a regular fully connected layer, each neuron in the first layer would need to connect to 150,528 inputs — an explosion of parameters that completely ignores spatial relationships between pixels.
Key insight: Useful information in images is local — edges, corners, and textures only occupy small patches of the image. CNN exploits this property by scanning the image with small windows instead of looking at everything at once.
Three Core Operations of CNN
Use a small window (convolutional kernel,
e.g. 3×3) to slide across the image. At each
position, compute a weighted sum to output a
single number. After sliding across the
entire image, you get a
feature map that records
"where this kernel's feature responds".
Multiple
different kernels = detecting multiple
features simultaneously (horizontal edges,
vertical edges, diagonals…).
Feature maps are too large. Use Max Pooling to shrink each 2×2 block into 1 number (take the maximum), halving the image size while preserving important features. Benefits: reduces computation and introduces some translation invariance — features shifting a few pixels won't change the conclusion.
Flatten the last feature map into a 1D vector and feed it into a regular fully connected layer. At this point, features are already abstract enough (not pixels anymore, but high-level concepts like "has ears"), and the FC layer combines these features to output classification probabilities.
Two Key Properties of CNN
Local Receptive Field
Each neuron only looks at a small patch of the image (receptive field), not the entire image. This matches the natural structure of images: meaningful features are always local.
Weight Sharing
The same convolutional kernel uses the same set of weights as it slides across the image. A kernel that detects "vertical lines" uses the same weights whether the line is on the left or right side. Parameters drop from millions to dozens.
Stacked layers, increasingly abstract features
CNNs typically stack multiple "convolution + pooling" combinations:
- 1st convolutional layer: learns low-level features like edges and color gradients
- 2nd convolutional layer: combines edges into mid-level features like corners and arcs
- Deeper layers: combine into high-level semantic features like eyes, wheels, text
This is what "deep" in deep learning means: instead of manually designing features, let the network abstract pixels into concepts layer by layer.
Build a CNN Step by Step
From pixel drawing to classification training, step by step.
Use code to draw circles, squares, and triangles on a 16×16 canvas, with each pixel value being 0 or 1.
Batch generate 180 samples and organize them into the 4D tensor format [N, H, W, C] that CNN requires.
Two convolutional layers (extract features) → flatten → fully connected layer (classify).
Train with the Adam optimizer, printing loss and accuracy each epoch.
02 Code
03 Academic Explanation
Mathematical Definition of Convolution
For input feature map X (height H × width W × channels C_in) and kernel K (height k_H × width k_W × C_in), the output feature map at position (i, j) is:
where s is the stride and b is the bias. When using multiple filters, each kernel produces a feature map, and all feature maps are stacked to form the next layer's input.
Convolution Operation Diagram
Parameter Count
A convolutional layer's parameters are only the kernel weights and biases, independent of input image size:
Our demo model (1st Conv2D layer): 3 × 3 × 1 × 8 + 8 = 80 parameters. Compare with a fully connected layer processing 16×16 input to 8 neurons, which needs 16×16×8 + 8 = 2,056 parameters — convolution is far more efficient.
Network Architecture
Our demo model structure is Conv(8) → Pool → Conv(16) → Pool → Flatten → Dense(32) → Dense(3):
-
Conv2D(8, 3×3, ReLU, same): input
(batch, 16, 16, 1), output(batch, 16, 16, 8), 8 kernels each detecting one feature, 80 parameters -
MaxPooling2D(2×2): output
(batch, 8, 8, 8), size halved, retains strongest responses -
Conv2D(16, 3×3, ReLU, same): output
(batch, 8, 8, 16), combines features from layer 1 into more complex structures, 1,168 parameters -
MaxPooling2D(2×2): output
(batch, 4, 4, 16) -
Flatten: output
(batch, 256) -
Dense(32, ReLU): output
(batch, 32), 8,224 parameters -
Dense(3, Softmax): output
(batch, 3), probabilities for circle/square/triangle, 99 parameters
Pooling Layer
Max Pooling takes the maximum value within each pooling window:
Max pooling preserves position invariance of features: even if a feature shifts slightly, the maximum value is still captured. Average Pooling takes the mean, producing smoother information but potentially diluting edge features.
Activation Function
Convolution is usually followed by ReLU activation to introduce non-linearity:
ReLU zeros out negative responses ("this feature is not present here") and keeps positive ones. Simple to compute, no vanishing gradients — the most commonly used activation function in CNNs.
Backpropagation and Weight Updates
CNNs also use backpropagation to compute gradients, but convolutional layer gradients require transposed convolution (also called deconvolution) to propagate back to the input layer. Due to weight sharing, gradients from the same kernel at different positions are accumulated:
Weight update formulas are the same as MLP: K ← K − η × ∂L/∂K, bias b ← b − η × ∂L/∂b.
Convolutional Feature Map Visualization
Observe how different kernels slide across the image and the resulting feature maps:
Evolution of Classic CNN Architectures
LeNet-5 (1998)
The first practical CNN, used for handwritten digit recognition. Structure: 2 convolutional layers + 3 fully connected layers, establishing the basic CNN paradigm.
AlexNet (2012)
ImageNet competition winner that brought CNNs into the deep learning era. Introduced ReLU, Dropout, and data augmentation. Top-5 error rate dropped to 15.3%.
VGG (2014)
Used uniform 3×3 small kernels to build deeper networks (16-19 layers), proving that network depth is key to performance.
ResNet (2015)
Introduced residual connections (skip connections), solving the vanishing gradient problem in deep networks, pushing depth to 152 layers. Top-5 error rate dropped to 3.57%.