Linear Regression
Use a straight line to describe patterns in data — the simplest and most important first step in machine learning
01 Core Principles (Plain English)
You recorded the temperature and ice cream sales for every day over the past year, and noticed that higher temperatures lead to more sales. Now your boss asks: if tomorrow is 35°C, how many will we sell?
You plot all the data points on paper, then line up a ruler — find a straight line that stays as close to all the points as possible. Once you've found it, use that line to make predictions.
Linear regression is having a machine automatically find that "best-fitting ruler."
Three Steps to Linear Regression
A line only needs two numbers: slope w
(steepness) and intercept b (where it
crosses the y-axis).
The formula is
just one line:
y = w × x + b
Square the distance from each data point to the line and take the average — this is called Mean Squared Error (MSE). The smaller this number, the better the line fits the data.
Each time, calculate which direction to adjust w and b to make MSE smaller, then nudge them a tiny bit in that direction. Repeat hundreds of times, and the line naturally approaches the optimal position.
The whole process is: draw a random line → measure error → fine-tune → measure again → adjust again... until the line barely moves. Run the code below to see this process with your own eyes.
Building Linear Regression Step by Step
We'll break the complete code into pieces and examine what each step does.
Use the known y = 2x + 5 plus
random noise to simulate real data, so you can
see what the data structure looks like.
Use Mean Squared Error (MSE) to measure the deviation between the line and the data. The smaller the MSE, the better the line fits the data.
Each iteration goes through all the data, computes the gradients for w and b separately, and takes a small step in the negative gradient direction.
Prediction function ŷ = wx + b,
single sample loss
L = (wx + b − y)², let
err = wx + b − y, by the chain
rule:
If the formulas feel abstract, skip them and look at the code directly — code is the most intuitive. Here's the code corresponding to the formulas:
const err = w * x + b - y; // prediction error const dw = 2 * err * x; // ∂L/∂w const db = 2 * err; // ∂L/∂b
The above are functional expressions that apply
to any single sample (x, y). Extending to n
samples, the loss is the average over all
samples (MSE = Mean Squared Error), the
gradients are accumulated for each sample then
divided by n, and 1/n is factored
out as a constant:
If the formulas feel abstract, skip them and look at the code directly — code is the most intuitive. Here's the code corresponding to the formulas:
let dw = 0, db = 0;
for (const [x, y] of data) {
const err = w * x + b - y;
dw += err * x; // accumulate ∂L/∂w
db += err; // accumulate ∂L/∂b
}
dw = dw * 2 / n;
db = db * 2 / n;
The relationship between the loss function and gradients is the most core part of getting started with machine learning:
- Loss function tells you how bad things are right now — it's a scalar (a single number)
- Gradient is obtained by taking partial derivatives of the loss function — each component corresponds to a parameter, telling you "adjust this parameter in which direction to make the loss increase the fastest"
- During training, we update parameters along the negative gradient direction, which means making the loss decrease the fastest. On the loss surface, the geometric meaning of the entire gradient vector is exactly the steepest downhill direction at the current position
$\text{MSE} = \frac{1}{n}\sum(wx+b-y)^2$, let $\text{err} = wx+b-y$, then:
The two components combined form [dw, db], the gradient vector on the loss surface. Taking a step in the opposite direction gives the parameter update in the last two lines of the code above.
Put the three parts together with visualization, and you get the complete demo code — see below.
What Do w and b Each Control?
Slope w
Controls the steepness of the line. Large w → steep line, a small increase in x causes a big increase in y; small w → gentle line; negative w → line slopes downward to the right.
Intercept b
Controls the vertical position of the line. When x=0, y=b. Adjusting b is like shifting the entire line up or down without changing the angle.
02 Code
03 Academic Explanation
Hypothesis Function
Given n training samples {(x₁,y₁), …, (xₙ,yₙ)}, linear regression assumes a linear mapping between x and y, represented by a parameterized function:
w is called the weight, and b is called the bias. The goal of learning is to find a set of (w, b) that makes h(x) as close to the true y as possible.
Objective Function
To make the model "close" to the data, we first need to quantitatively describe "how far off." Define the residual for the i-th sample:
Why use squared error instead of absolute value? Two reasons: ① It's differentiable everywhere, making gradient computation convenient; ② It penalizes large errors more heavily. Taking the average of squared residuals across all samples gives the Mean Squared Error (MSE):
J(w,b) is the objective function, also called the loss function. The essence of training is an optimization problem:
Gradient Derivation
Taking partial derivatives of J(w,b) with respect to w and b, by the chain rule:
The gradient vector [∂J/∂w, ∂J/∂b] points in the direction where J increases the fastest. Taking a step in the opposite direction is the gradient descent update:
α is the learning rate (step size), controlling the magnitude of each update. Iterate until the gradient approaches zero, i.e., reaching the minimum point. At this point J(w,b) converges — because MSE is a convex function with respect to w and b, the minimum is the global minimum.
Summary
h(x) = wx + b
εᵢ = h(xᵢ) − yᵢ
J = (1/n)Σεᵢ² (MSE)
∂J/∂w, ∂J/∂b derived by chain rule
parameter ← parameter − α · gradient
MSE is convex w.r.t. parameters; gradient descent converges to global optimum