Linear Regression - ML Easy Learning

01 Core Principles (Plain English)

You recorded the temperature and ice cream sales for every day over the past year, and noticed that higher temperatures lead to more sales. Now your boss asks: if tomorrow is 35°C, how many will we sell?

You plot all the data points on paper, then line up a ruler — find a straight line that stays as close to all the points as possible. Once you've found it, use that line to make predictions.

Linear regression is having a machine automatically find that "best-fitting ruler."

Three Steps to Linear Regression

Define a Line (Model)

A line only needs two numbers: slope w (steepness) and intercept b (where it crosses the y-axis).
The formula is just one line: y = w × x + b

Measure How Bad the Line Is (Loss Function)

Square the distance from each data point to the line and take the average — this is called Mean Squared Error (MSE). The smaller this number, the better the line fits the data.

Keep Adjusting the Line (Gradient Descent)

Each time, calculate which direction to adjust w and b to make MSE smaller, then nudge them a tiny bit in that direction. Repeat hundreds of times, and the line naturally approaches the optimal position.

The whole process is: draw a random line → measure error → fine-tune → measure again → adjust again... until the line barely moves. Run the code below to see this process with your own eyes.

Building Linear Regression Step by Step

We'll break the complete code into pieces and examine what each step does.

Step 1 Generate Training Data

Use the known y = 2x + 5 plus random noise to simulate real data, so you can see what the data structure looks like.

Step 2 Design the Loss Function

Use Mean Squared Error (MSE) to measure the deviation between the line and the data. The smaller the MSE, the better the line fits the data.

Step 3 Training Loop: Gradient Descent Parameter Update Most Important

💡 Note: The gradient descent chapter earlier talked about directly finding the minimum of a function, so we took the gradient of the function directly. In practice, we generally minimize the loss function — because we want to minimize the loss, so we differentiate the loss function, not the prediction function. The loss function is therefore also called the objective function.

Each iteration goes through all the data, computes the gradients for w and b separately, and takes a small step in the negative gradient direction.

Prediction function ŷ = wx + b, single sample loss L = (wx + b − y)², let err = wx + b − y, by the chain rule:

$$\frac{\partial L}{\partial w} = 2 \cdot \text{err} \cdot x$$

$$\frac{\partial L}{\partial b} = 2 \cdot \text{err}$$

If the formulas feel abstract, skip them and look at the code directly — code is the most intuitive. Here's the code corresponding to the formulas:

const err = w * x + b - y;   // prediction error
const dw  = 2 * err * x;     // ∂L/∂w
const db  = 2 * err;         // ∂L/∂b

The above are functional expressions that apply to any single sample (x, y). Extending to n samples, the loss is the average over all samples (MSE = Mean Squared Error), the gradients are accumulated for each sample then divided by n, and 1/n is factored out as a constant:

$$\frac{\partial L}{\partial w} = \frac{2}{n}\sum_i \text{err}_i \cdot x_i$$

$$\frac{\partial L}{\partial b} = \frac{2}{n}\sum_i \text{err}_i$$

If the formulas feel abstract, skip them and look at the code directly — code is the most intuitive. Here's the code corresponding to the formulas:

let dw = 0, db = 0;
for (const [x, y] of data) {
    const err = w * x + b - y;
    dw += err * x;   // accumulate ∂L/∂w
    db += err;       // accumulate ∂L/∂b
}
dw = dw * 2 / n;
db = db * 2 / n;

The relationship between the loss function and gradients is the most core part of getting started with machine learning:

Loss function tells you how bad things are right now — it's a scalar (a single number)
Gradient is obtained by taking partial derivatives of the loss function — each component corresponds to a parameter, telling you "adjust this parameter in which direction to make the loss increase the fastest"
During training, we update parameters along the negative gradient direction, which means making the loss decrease the fastest. On the loss surface, the geometric meaning of the entire gradient vector is exactly the steepest downhill direction at the current position

Gradient derivation for this example (partial derivatives of MSE w.r.t. w, b):

$\text{MSE} = \frac{1}{n}\sum(wx+b-y)^2$, let $\text{err} = wx+b-y$, then:

$$\frac{\partial\text{MSE}}{\partial w} = \frac{2}{n}\sum(\text{err}\cdot x) \quad\rightarrow\quad \texttt{dw += err * x}$$

$$\frac{\partial\text{MSE}}{\partial b} = \frac{2}{n}\sum(\text{err}) \quad\rightarrow\quad \texttt{db += err}$$

The two components combined form [dw, db], the gradient vector on the loss surface. Taking a step in the opposite direction gives the parameter update in the last two lines of the code above.

Put the three parts together with visualization, and you get the complete demo code — see below.

What Do w and b Each Control?

Slope w

Controls the steepness of the line. Large w → steep line, a small increase in x causes a big increase in y; small w → gentle line; negative w → line slopes downward to the right.

Intercept b

Controls the vertical position of the line. When x=0, y=b. Adjusting b is like shifting the entire line up or down without changing the angle.

02 Code

03 Academic Explanation

Hypothesis Function

Given n training samples {(x₁,y₁), …, (xₙ,yₙ)}, linear regression assumes a linear mapping between x and y, represented by a parameterized function:

$$h(x) = wx + b$$

w is called the weight, and b is called the bias. The goal of learning is to find a set of (w, b) that makes h(x) as close to the true y as possible.

Objective Function

To make the model "close" to the data, we first need to quantitatively describe "how far off." Define the residual for the i-th sample:

$$\varepsilon_i = h(x_i) - y_i = wx_i + b - y_i$$

Why use squared error instead of absolute value? Two reasons: ① It's differentiable everywhere, making gradient computation convenient; ② It penalizes large errors more heavily. Taking the average of squared residuals across all samples gives the Mean Squared Error (MSE):

$$J(w,b) = \frac{1}{n} \sum_i \varepsilon_i^2 = \frac{1}{n} \sum_i (wx_i + b - y_i)^2$$

J(w,b) is the objective function, also called the loss function. The essence of training is an optimization problem:

$$\min_{w,b}\; J(w,b)$$

Gradient Derivation

Taking partial derivatives of J(w,b) with respect to w and b, by the chain rule:

$$\frac{\partial J}{\partial w} = \frac{2}{n} \sum_i \varepsilon_i x_i$$

$$\frac{\partial J}{\partial b} = \frac{2}{n}\sum_i \varepsilon_i$$

The gradient vector [∂J/∂w, ∂J/∂b] points in the direction where J increases the fastest. Taking a step in the opposite direction is the gradient descent update:

$$w \leftarrow w - \alpha \cdot \frac{2}{n}\sum_i \varepsilon_i x_i$$

$$b \leftarrow b - \alpha \cdot \frac{2}{n}\sum_i \varepsilon_i$$

α is the learning rate (step size), controlling the magnitude of each update. Iterate until the gradient approaches zero, i.e., reaching the minimum point. At this point J(w,b) converges — because MSE is a convex function with respect to w and b, the minimum is the global minimum.

Summary

Hypothesis Function

h(x) = wx + b

Residual

εᵢ = h(xᵢ) − yᵢ

Objective Function

J = (1/n)Σεᵢ² (MSE)

Gradient

∂J/∂w, ∂J/∂b derived by chain rule

Update Rule

parameter ← parameter − α · gradient

Convergence Guarantee

MSE is convex w.r.t. parameters; gradient descent converges to global optimum