Gradient Descent
Gradient Descent is the most fundamental optimization algorithm in machine learning.
01 Core Principles (Plain English)
You've made a pot of soup, but it doesn't taste quite right and you want to improve the recipe. You can adjust the amount of salt — too much and it's too salty, too little and it's bland. How do you find the perfect amount?
A simple approach: try a little, taste it, sense which direction would make it better, adjust a tiny bit in that direction, taste again, adjust again... repeat a few times, getting closer and closer to the best recipe.
Gradient Descent is the mathematical version of this process: replacing "how good it tastes" with a "loss function", "which direction to adjust" with the "gradient direction", and "adjust a tiny bit" with the "learning rate".
Only Three Things
Use a single number to measure how poorly the model is currently predicting. This number is called the "loss value", and we want to make it as small as possible.
The gradient tells you: in which direction would the loss increase the fastest. So going in the opposite direction means the loss decreases the fastest.
You can't take too big a step — overshooting the minimum is trouble. Use the learning rate to control taking only a small step each time, gradually approaching the optimal solution.
Repeat these three steps, and the loss value will keep getting smaller while the model gets more accurate. Stop when: the gradient is close to zero (you're already at the bottom of the valley), or you've reached the maximum number of iterations.
Learning Rate: Step Size is Key
Step Too Large
Like taking huge strides down a mountain, you overshoot to the opposite slope, bouncing back and forth never reaching the valley floor. The loss value oscillates or even grows.
Step Too Small
Like an ant crawling, the direction is right but it's absurdly slow, requiring tens of thousands of iterations to converge — completely impractical in real applications.
In practice, start with a learning rate of 0.01 and adjust based on the loss curve.
Building Gradient Descent Step by Step
Talk is cheap — let's build the code from scratch.
Use a quadratic function
f(x) = x² to simulate a "loss
function" — it has a minimum point, and our goal
is to find it.
Let's first plot it to see what it looks like:
The problem is obvious: x + 1 is
blindly wandering — it doesn't know which
direction to go to make f(x) smaller.
The derivative (gradient) tells us the "slope" and "direction" of the function at the current point. Moving in the negative gradient direction means going downhill:
We now have the core prototype of gradient descent. Adding convergence checks and visualization gives us the complete version — see the full code below.
02 Code
03 Academic Explanation
Gradient Descent is the most fundamental optimization algorithm in machine learning. Whether it's linear regression, neural networks, or deep learning, gradient descent is everywhere. Its core idea is remarkably simple: like descending a mountain, take each step in the steepest downhill direction until you reach the valley.
What is a Gradient?
A gradient can be understood as the direction of steepest ascent at your current position. Imagine standing on a slope in a valley — the gradient points in the direction that would increase your altitude the fastest (going uphill).
Why Follow the Negative Gradient Direction?
Imagine you're lost on a mountain and want to get down as quickly as possible. The most intuitive approach is: take each step in the steepest downhill direction. This is exactly the core idea of gradient descent!
How to Compute the Gradient?
Gradient descent aims to minimize the loss function, so the gradient is the partial derivative of the loss with respect to the parameters, not the derivative of the prediction function itself.
Take linear regression as an example: the prediction
function is ŷ = wx + b, and the loss for a
single sample is the squared error:
Let err = wx + b − y, using the chain
rule to compute partial derivatives with respect to
w and b:
The x term in b's derivative becomes 1, because the
derivative of b with respect to ŷ is 1.
In actual training with n samples, the loss is averaged, i.e., MSE (Mean Squared Error):
1/n is a constant, pulled out directly.
The partial derivatives become:
2/n is simply the 2 from
the single-sample derivative multiplied by
1/n from the MSE definition. The loop
in the code accumulates errᵢ · xᵢ, and
after the loop multiplies by 2/n, which
exactly matches the formula.
Learning Rate: How Far Each Step Goes
The Learning Rate controls the length of each step. It is one of the most important hyperparameters in gradient descent.
Learning Rate Too Large
- Steps too large, easily overshoot the minimum
- May cause oscillation and divergence, never converging
- Loss value bounces back and forth around the minimum
Learning Rate Just Right
- Moderate step size, able to descend steadily
- Can reach a local or global minimum
- This is the effect we want
Learning Rate Too Small
- Steps too small, descending very slowly
- Requires many iterations to converge
- Easily gets stuck in local optima
Where α is the learning rate and ∇J(θ) is the gradient. Common learning rate choices include 0.001, 0.01, 0.1, etc.
How to Judge Convergence?
Gradient descent needs to stop at some point, otherwise it will run forever. Common convergence criteria include:
1. Set Number of Iterations
The simplest approach is to set a fixed number of iterations. The downside is that you need to experiment repeatedly to find a good value.
2. Loss Change Below Threshold
When the loss value change between two iterations is very small, we consider it converged.
3. Gradient Magnitude Below Threshold
When the gradient is close to zero, it means we've reached near the minimum point.
Key Takeaways
Find the minimum of the loss function, i.e., the optimal model parameters
Follow the negative gradient direction, i.e., the steepest descent direction
Controlled by the learning rate, determining how far each step moves
Gradient close to zero or maximum iterations reached
Loss Surface 3D Visualization
The image below shows the surface of the loss function J(w, b) = sin(w)·cos(b) + w²/8 + b²/8 that you're optimizing. High areas (warm colors) indicate large loss values, and low areas (cool colors) indicate small loss values.