01 Core Principles (Plain English)

You've made a pot of soup, but it doesn't taste quite right and you want to improve the recipe. You can adjust the amount of salt — too much and it's too salty, too little and it's bland. How do you find the perfect amount?

A simple approach: try a little, taste it, sense which direction would make it better, adjust a tiny bit in that direction, taste again, adjust again... repeat a few times, getting closer and closer to the best recipe.

Gradient Descent is the mathematical version of this process: replacing "how good it tastes" with a "loss function", "which direction to adjust" with the "gradient direction", and "adjust a tiny bit" with the "learning rate".

Only Three Things

1
Calculate how bad it is (Loss Function)

Use a single number to measure how poorly the model is currently predicting. This number is called the "loss value", and we want to make it as small as possible.

2
Calculate which direction to adjust (Gradient)

The gradient tells you: in which direction would the loss increase the fastest. So going in the opposite direction means the loss decreases the fastest.

3
Take a small step (Learning Rate)

You can't take too big a step — overshooting the minimum is trouble. Use the learning rate to control taking only a small step each time, gradually approaching the optimal solution.

Repeat these three steps, and the loss value will keep getting smaller while the model gets more accurate. Stop when: the gradient is close to zero (you're already at the bottom of the valley), or you've reached the maximum number of iterations.

Learning Rate: Step Size is Key

Step Too Large

Like taking huge strides down a mountain, you overshoot to the opposite slope, bouncing back and forth never reaching the valley floor. The loss value oscillates or even grows.

Step Too Small

Like an ant crawling, the direction is right but it's absurdly slow, requiring tens of thousands of iterations to converge — completely impractical in real applications.

In practice, start with a learning rate of 0.01 and adjust based on the loss curve.

Building Gradient Descent Step by Step

Talk is cheap — let's build the code from scratch.

Step 1 Define the Function to Optimize

Use a quadratic function f(x) = x² to simulate a "loss function" — it has a minimum point, and our goal is to find it.

Step 2 Add a Loop to Make It "Run"

Let's first plot it to see what it looks like:

The problem is obvious: x + 1 is blindly wandering — it doesn't know which direction to go to make f(x) smaller.

Step 3 Add Derivatives So It Knows Which Way to Go

The derivative (gradient) tells us the "slope" and "direction" of the function at the current point. Moving in the negative gradient direction means going downhill:

We now have the core prototype of gradient descent. Adding convergence checks and visualization gives us the complete version — see the full code below.

02 Code

03 Academic Explanation

Gradient Descent is the most fundamental optimization algorithm in machine learning. Whether it's linear regression, neural networks, or deep learning, gradient descent is everywhere. Its core idea is remarkably simple: like descending a mountain, take each step in the steepest downhill direction until you reach the valley.

What is a Gradient?

A gradient can be understood as the direction of steepest ascent at your current position. Imagine standing on a slope in a valley — the gradient points in the direction that would increase your altitude the fastest (going uphill).

x y Valley Bottom (Minimum) P(x,y) Gradient Direction Negative Gradient Direction Negative gradient direction points toward the steepest descent to the valley bottom
Key Point: The gradient is a vector that points in the direction where the function value increases the fastest.

Why Follow the Negative Gradient Direction?

Imagine you're lost on a mountain and want to get down as quickly as possible. The most intuitive approach is: take each step in the steepest downhill direction. This is exactly the core idea of gradient descent!

Start Convergence Point Each step follows the negative gradient direction gradually approaching the minimum
Vivid Analogy: Imagine you're blindfolded on a hillside and can only feel the slope with your feet. Each step, you move in the steepest downhill direction you can feel. Eventually, you'll reach a valley — this is a local minimum.

How to Compute the Gradient?

Gradient descent aims to minimize the loss function, so the gradient is the partial derivative of the loss with respect to the parameters, not the derivative of the prediction function itself.

Take linear regression as an example: the prediction function is ŷ = wx + b, and the loss for a single sample is the squared error:

$$L = (wx+b-y)^2$$

Let err = wx + b − y, using the chain rule to compute partial derivatives with respect to w and b:

$$\frac{\partial L}{\partial w} = 2\cdot\text{err}\cdot x$$
$$\frac{\partial L}{\partial b} = 2\cdot\text{err}$$

The x term in b's derivative becomes 1, because the derivative of b with respect to ŷ is 1.

In actual training with n samples, the loss is averaged, i.e., MSE (Mean Squared Error):

$$L = \frac{1}{n}\sum_i (wx_i+b-y_i)^2$$

1/n is a constant, pulled out directly. The partial derivatives become:

$$\frac{\partial L}{\partial w} = \frac{2}{n}\sum_i \text{err}_i\cdot x_i$$
$$\frac{\partial L}{\partial b} = \frac{2}{n}\sum_i \text{err}_i$$

2/n is simply the 2 from the single-sample derivative multiplied by 1/n from the MSE definition. The loop in the code accumulates errᵢ · xᵢ, and after the loop multiplies by 2/n, which exactly matches the formula.

Learning Rate: How Far Each Step Goes

The Learning Rate controls the length of each step. It is one of the most important hyperparameters in gradient descent.

Learning Rate Too Large

Oscillating / Diverging
  • Steps too large, easily overshoot the minimum
  • May cause oscillation and divergence, never converging
  • Loss value bounces back and forth around the minimum

Learning Rate Just Right

Smooth Descent
  • Moderate step size, able to descend steadily
  • Can reach a local or global minimum
  • This is the effect we want

Learning Rate Too Small

Descending Too Slowly
  • Steps too small, descending very slowly
  • Requires many iterations to converge
  • Easily gets stuck in local optima
Parameter Update Formula:
θ = θ - α × ∇J(θ)

Where α is the learning rate and ∇J(θ) is the gradient. Common learning rate choices include 0.001, 0.01, 0.1, etc.

How to Judge Convergence?

Gradient descent needs to stop at some point, otherwise it will run forever. Common convergence criteria include:

1. Set Number of Iterations

The simplest approach is to set a fixed number of iterations. The downside is that you need to experiment repeatedly to find a good value.

# Set 1000 iterations for i in range(1000): gradient_descent_step()

2. Loss Change Below Threshold

When the loss value change between two iterations is very small, we consider it converged.

# Stop when loss change is less than 0.0001 if abs(loss_prev - loss_curr) < 0.0001: break

3. Gradient Magnitude Below Threshold

When the gradient is close to zero, it means we've reached near the minimum point.

# Stop when gradient norm is less than 0.0001 if np.linalg.norm(gradient) < 0.0001: break

Key Takeaways

Goal

Find the minimum of the loss function, i.e., the optimal model parameters

Direction

Follow the negative gradient direction, i.e., the steepest descent direction

Step Size

Controlled by the learning rate, determining how far each step moves

Stopping

Gradient close to zero or maximum iterations reached

Loss Surface 3D Visualization

The image below shows the surface of the loss function J(w, b) = sin(w)·cos(b) + w²/8 + b²/8 that you're optimizing. High areas (warm colors) indicate large loss values, and low areas (cool colors) indicate small loss values.