Softmax Regression
The multi-class version of logistic regression — from "is it or not" to "which one is it"
01 Core Principles (Plain English)
Logistic regression solves binary classification problems: it outputs a single probability telling you how likely something "is" the target class.
Softmax regression extends this to multi-class problems: it simultaneously outputs K probabilities, one for each class, and all probabilities sum to 1.
How Does Sigmoid Become Softmax?
Logistic Regression (Binary Classification)
A single linear score z is squeezed through Sigmoid into (0,1), outputting the probability of "being class 1".
Softmax Regression (Multi-Class Classification)
K linear scores (logits) are normalized through Softmax into K probabilities — pick the class with the highest probability.
Softmax formula: $P(y=k) = \dfrac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}$. The exponential amplifies differences, and normalization ensures probabilities sum to 1. When K=2, Softmax is equivalent to Sigmoid — logistic regression is a special case of Softmax.
Build Softmax Regression Step by Step
Compared to logistic regression, the labels
y extend from 0/1 to 0/1/2, with
three classes distributed in different regions.
For K classes, first compute the raw score (logit) for each class, then convert them to a probability distribution using Softmax:
The probabilities of all classes sum to exactly 1, and each probability is between (0,1). Non-vectorized implementation — expanded step by step for clarity:
// Suppose there are 3 classes, each with a raw score (logit) const logits = [2.0, 1.0, 0.5]; // scores for class 0, 1, 2 // Step 1: exponentiate const exps = [Math.exp(2.0), Math.exp(1.0), Math.exp(0.5)]; // [7.389, 2.718, 1.649] // Step 2: sum const sum = exps[0] + exps[1] + exps[2]; // 11.756 // Step 3: normalize to get probabilities const probs = [exps[0]/sum, exps[1]/sum, exps[2]/sum]; // [0.629, 0.231, 0.140] → sum = 1
Vectorized implementation (processes all classes simultaneously, subtracts max value to prevent exp overflow):
The general form of cross-entropy (summing over all classes):
where $y_k$ is the one-hot label: the class index is converted to a vector with 1 at the true class position and 0 elsewhere. For example, if the true class is 1 in a 3-class problem, then $\mathbf{y} = [0, 1, 0]$. In the summation, only the term for the true class $c$ is non-zero, so the formula simplifies to:
Here's the key point! In
plain terms: Softmax outputs a probability
array, and one-hot encoding uses the true
label y to select which
segment. In code, it's extremely simple —
just probs[y] below.
Compare with logistic regression: Sigmoid
outputs a single number, and
y ∈ {0, 1} makes a binary
choice (yes or no); Softmax outputs K
numbers, and
y ∈ {0,1,...,K-1} is used as an
index for selection — the essence is the
same, just from a binary choice to a K-way
choice.
$p_c$ is the probability the model assigns to the true class. The more accurate the prediction, $p_c \to 1$, $-\log(1)=0$ and the loss approaches zero; when the prediction is completely wrong, $p_c \to 0$ and the loss approaches infinity — the worse the mistake, the heavier the penalty.
Each class has its own weight vector. The
gradient form is identical to logistic
regression: err = pred - label,
except now K sets of parameters are updated
simultaneously.
Combine all four segments with visualization to get the complete demo code — see below.
02 Code
03 Academic Explanation
Softmax regression (also known as multinomial logistic regression) extends logistic regression to multi-class problems, suitable for K mutually exclusive output classes.
Softmax Function
Given K classes with linear scores (logits) z₁, z₂, ..., z_K:
where $z_k = w_k \cdot x + b_k$ is the linear combination for class k.
Loss Function: Multi-Class Cross-Entropy
Only penalizes the predicted probability of the true class:
where c is the true class of the sample. This is equivalent to the full cross-entropy over all classes −Σ y_k log p_k (after one-hot encoding, non-true-class terms are 0).
Gradient Derivation
After jointly differentiating Softmax + cross-entropy, the gradient form is extremely concise:
The gradient form is identical to logistic regression —
err = predicted probability − true label,
except each of the K classes has its own set of
parameters.
Decision Boundaries
The decision boundary between class i and class j is the hyperplane where their probabilities are equal:
K classes produce K(K-1)/2 pairwise decision boundaries, partitioning the feature space into K regions.
Summary
Multi-class classification
Softmax
Multi-class cross-entropy
K probabilities