01 Core Principles (Plain English)

Imagine a map with two types of cities: red and blue. You need to draw a boundary line to separate them. There are infinitely many lines you could draw — which one is best?

SVM's answer: find the line that is as far as possible from the nearest points on both sides. It's like building the widest "road" between the two groups of points, with the boundary line running down the middle.

Three Key Concepts

1
Hyperplane

In 2D it's a line, in 3D it's a plane, and in higher dimensions it's called a hyperplane. SVM finds this separating hyperplane.

2
Support Vectors

The points closest to the boundary line. The entire "road" width is determined solely by them — all other points don't participate in the calculation at all.

3
Kernel Function

What if the data can't be linearly separated? Use a kernel function to "project" the data into a higher-dimensional space where it can often be linearly separated, then project back.

Not Linearly Separable → Kernel Trick

Linear Kernel

Use when data is already linearly separable. Fastest and simplest.

RBF Kernel (Gaussian Kernel)

The most commonly used kernel. Can handle non-linear boundaries and suits most scenarios. The parameter γ controls the influence radius.

Polynomial Kernel

Suitable for data with multiplicative feature interactions, such as image recognition.

SVM's advantages: Excels on small samples and high-dimensional data (text classification, bioinformatics). Drawbacks include slow training on large samples and sensitivity to feature scaling — always standardize before use.

Build SVM Step by Step

Step 1 Generate Linearly Separable Data

SVM labels use -1 and +1 (not 0/1), so the margin's mathematical expression is more symmetric: y·(w·x+b) ≥ 1.

Step 2 Hinge Loss: Only Penalize Points Within the Margin

Points with sufficient margin have zero loss; only points squeezed inside the "road" incur loss, pushing the boundary outward.

Step 3 Subgradient Update: Widen the Margin

L2 regularization makes ||w|| smaller (margin wider), hinge loss pushes misclassified points back to the correct side. The trade-off is controlled by C.

Step 4 Training Loop

Repeat subgradient updates. ||w|| gradually shrinks, the margin gradually widens, and support vectors emerge.

Combine all four segments with visualization to get the complete demo code — see below.

02 Code

03 Academic Explanation

Hard Margin SVM (Linearly Separable)

The hyperplane w·x + b = 0 separates two classes. The goal is to maximize the margin 2/‖w‖, equivalent to:

min ½‖w‖² s.t. yᵢ(w·xᵢ + b) ≥ 1, ∀i

This is a convex quadratic programming problem with a unique global optimal solution.

Soft Margin SVM (Allowing Misclassification)

Introduce slack variables ξᵢ ≥ 0 to allow some points to cross the margin boundary:

min ½‖w‖² + C·Σξᵢ s.t. yᵢ(w·xᵢ + b) ≥ 1 − ξᵢ

Parameter C controls the "penalty intensity": large C → intolerant of misclassification, narrow margin; small C → allows misclassification, wider margin, stronger regularization.

Dual Problem and Kernel Functions

Through Lagrange duality, the primal problem transforms into a form that depends only on inner products xᵢ·xⱼ. Replacing inner products with kernel functions K(xᵢ,xⱼ) = φ(xᵢ)·φ(xⱼ) enables classification in high-dimensional space without explicitly computing the high-dimensional mapping φ (kernel trick).

Common kernel functions:

  • Linear kernel: K(x,z) = x·z
  • Polynomial kernel: K(x,z) = (γx·z + r)ᵈ
  • RBF kernel: K(x,z) = exp(−γ‖x−z‖²)

VC Dimension and Generalization Bound

SVM maximizing the margin is equivalent to minimizing model complexity (VC dimension). According to structural risk minimization theory, the generalization error upper bound is:

R(f) ≤ R_emp(f) + O(√(h/n))

where h is the VC dimension and n is the number of samples. The larger the margin, the smaller the VC dimension, the better the generalization — this is the core explanation for "why SVM works" in theory.