01 Core Principles (Plain English)

The linear regression we covered earlier is essentially about fitting a function — finding a line that passes through as many data points as possible.

Logistic regression is the exact opposite: we don't care about the specific values of the data points, we just want to find a line that separates two classes of data.

Why Not Use a Step Function Directly?

You might naturally think: just check which side of the line a data point is on — one side is 0, the other is 1. This is exactly the idea behind a step function.

Step Function

Jumps directly from 0 to 1 at the boundary, with zero gradient everywhere (undefined at the boundary). No gradient means gradient descent can't work.

Sigmoid Function

Also squeezes the output into 0~1, but with a smooth transition that is differentiable everywhere. Gradient descent can use it to guide parameter updates.

Logistic regression uses Sigmoid to smooth out the step, enabling gradient descent training: σ(z) = 1 / (1 + e⁻ᶻ), z = w·x + b. The decision boundary is the line where z = 0, with different classes on each side.

Building Logistic Regression Step by Step

We'll break the complete code into pieces and understand what each step does.

Step 1 Generate Labeled Data

Compared to linear regression, the data has an additional dimension — the label y, which only takes values 0 and 1.

Step 2 Sigmoid: Giving the Step a Gradient

To separate two classes of data, it's natural to use a discriminant: z = w·x + b, classify as category 1 if z > 0, and category 0 if z < 0. Note that x here is a vector — with two-dimensional features, it's actually x₁ and x₂, written in full as:

$$z = w_1 x_1 + w_2 x_2 + b$$

z is the result of plugging the data point into the line equation — positive on one side, negative on the other. But using z directly for classification is a step function: output 1 if z > 0, otherwise 0. The gradient at the boundary is zero, so gradient descent can't work.

The solution: wrap it in a Sigmoid, which squeezes z into a probability between (0,1) while remaining smooth and differentiable everywhere:

$$\hat{y} = \sigma(z) = \frac{1}{1+e^{-z}}$$

One line of code compresses any real number into (0,1) while guaranteeing differentiability everywhere — this is the core of logistic regression.

Step 3 Loss Function: Binary Cross-Entropy

Using mean squared error to measure classification performance has issues. Cross-entropy is a loss function specifically designed for probability outputs:

$$\mathcal{L} = -\bigl[y\log\hat{y} + (1-y)\log(1-\hat{y})\bigr]$$

ŷ is the probability output by Sigmoid, and y is the true label (0 or 1). The more accurate the prediction, the closer the log value is to 0, and the smaller the loss. When the prediction is completely wrong, log approaches negative infinity and the loss approaches infinity — the more wrong the prediction, the heavier the penalty.

Step 4 Training Loop: Gradient Descent Parameter Update

Taking partial derivatives of the cross-entropy loss, expanding by the chain rule:

$$\frac{\partial L}{\partial w_1} = (\hat{y}-y)\cdot x_1$$
$$\frac{\partial L}{\partial w_2} = (\hat{y}-y)\cdot x_2$$
$$\frac{\partial L}{\partial b} = \hat{y}-y$$

Let err = ŷ − y, the form is exactly the same as the gradient for linear regression — this is a nice property of the joint derivative of cross-entropy + Sigmoid: the derivative of sigmoid and the derivative of cross-entropy cancel out perfectly, leaving only the prediction error err. Extending to n samples by taking the average:

$$\frac{\partial L}{\partial w_1} = \frac{1}{n}\sum_i \text{err}_i \cdot x_{1i}$$
$$\frac{\partial L}{\partial b} = \frac{1}{n}\sum_i \text{err}_i$$

Put the four parts together with visualization, and you get the complete demo code — see below.

02 Code

03 Academic Explanation

Although logistic regression has "regression" in its name, it is actually a classification algorithm. It is primarily used for binary classification problems: determining whether the result is "0" or "1".

Sigmoid Function

The core of logistic regression is the Sigmoid function, which maps any real number to the (0, 1) interval:

$$\sigma(z) = \dfrac{1}{1+e^{-z}}$$
Key Point: The output range of the Sigmoid function is (0, 1), which can be interpreted as a probability. When z > 0, the output > 0.5 (positive class); when z < 0, the output < 0.5 (negative class).

Logistic Regression Model

Pass the linear combination into the Sigmoid function:

$$P(y=1\mid x) = \sigma(wx+b) = \dfrac{1}{1+e^{-(wx+b)}}$$

The model outputs a probability value between 0 and 1. We typically use 0.5 as the threshold for classification.

Loss Function: Binary Cross-Entropy

Logistic regression uses Binary Cross-Entropy as the loss function:

$$\mathcal{L} = -\bigl[y\log\hat{y} + (1-y)\log(1-\hat{y})\bigr]$$

where ŷ is the predicted probability and y is the true label (0 or 1).

Why not use MSE? If using mean squared error, the loss function becomes non-convex, easily getting stuck in local optima. Cross-entropy is a convex function, so gradient descent can converge to the global optimum.

Decision Boundary

The decision boundary is the dividing line where the model classifies. For linear logistic regression, the decision boundary is a straight line:

$$wx + b = 0$$

Each side of the line belongs to a different class.

Summary

Task

Binary classification

Function

Sigmoid

Loss

Cross-Entropy

Output

Probability value