Softmax - ML Easy Learning

01 Core Principles (Plain English)

Logistic regression solves binary classification problems: it outputs a single probability telling you how likely something "is" the target class.

Softmax regression extends this to multi-class problems: it simultaneously outputs K probabilities, one for each class, and all probabilities sum to 1.

How Does Sigmoid Become Softmax?

Logistic Regression (Binary Classification)

A single linear score z is squeezed through Sigmoid into (0,1), outputting the probability of "being class 1".

Softmax Regression (Multi-Class Classification)

K linear scores (logits) are normalized through Softmax into K probabilities — pick the class with the highest probability.

Softmax formula: $P(y=k) = \dfrac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}$. The exponential amplifies differences, and normalization ensures probabilities sum to 1. When K=2, Softmax is equivalent to Sigmoid — logistic regression is a special case of Softmax.

Build Softmax Regression Step by Step

Step 1 Generate Three-Class Data

Compared to logistic regression, the labels y extend from 0/1 to 0/1/2, with three classes distributed in different regions.

Step 2 Softmax: Turn Scores into Probabilities

For K classes, first compute the raw score (logit) for each class, then convert them to a probability distribution using Softmax:

$$p(k) = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}$$

The probabilities of all classes sum to exactly 1, and each probability is between (0,1). Non-vectorized implementation — expanded step by step for clarity:

// Suppose there are 3 classes, each with a raw score (logit)
const logits = [2.0, 1.0, 0.5];  // scores for class 0, 1, 2

// Step 1: exponentiate
const exps = [Math.exp(2.0), Math.exp(1.0), Math.exp(0.5)];
// [7.389, 2.718, 1.649]

// Step 2: sum
const sum = exps[0] + exps[1] + exps[2];  // 11.756

// Step 3: normalize to get probabilities
const probs = [exps[0]/sum, exps[1]/sum, exps[2]/sum];
// [0.629, 0.231, 0.140]  → sum = 1

Vectorized implementation (processes all classes simultaneously, subtracts max value to prevent exp overflow):

Step 3 Loss Function: Multi-Class Cross-Entropy

The general form of cross-entropy (summing over all classes):

$$\mathcal{L} = -\sum_{k=1}^{K} y_k \log p_k$$

where $y_k$ is the one-hot label: the class index is converted to a vector with 1 at the true class position and 0 elsewhere. For example, if the true class is 1 in a 3-class problem, then $\mathbf{y} = [0, 1, 0]$. In the summation, only the term for the true class $c$ is non-zero, so the formula simplifies to:

$$\mathcal{L} = -\log p_c$$

Here's the key point! In plain terms: Softmax outputs a probability array, and one-hot encoding uses the true label y to select which segment. In code, it's extremely simple — just probs[y] below.

Compare with logistic regression: Sigmoid outputs a single number, and y ∈ {0, 1} makes a binary choice (yes or no); Softmax outputs K numbers, and y ∈ {0,1,...,K-1} is used as an index for selection — the essence is the same, just from a binary choice to a K-way choice.

$p_c$ is the probability the model assigns to the true class. The more accurate the prediction, $p_c \to 1$, $-\log(1)=0$ and the loss approaches zero; when the prediction is completely wrong, $p_c \to 0$ and the loss approaches infinity — the worse the mistake, the heavier the penalty.

Step 4 Training Loop: Gradient Descent Parameter Update

Each class has its own weight vector. The gradient form is identical to logistic regression: err = pred - label, except now K sets of parameters are updated simultaneously.

Combine all four segments with visualization to get the complete demo code — see below.

02 Code

03 Academic Explanation

Softmax regression (also known as multinomial logistic regression) extends logistic regression to multi-class problems, suitable for K mutually exclusive output classes.

Softmax Function

Given K classes with linear scores (logits) z₁, z₂, ..., z_K:

$$P(y=k \mid x) = \dfrac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}$$

where $z_k = w_k \cdot x + b_k$ is the linear combination for class k.

Numerical Stability: In practice, subtract max(z) before computing exp — the result is unchanged but avoids floating-point overflow: $\dfrac{e^{z_k - \max(z)}}{\sum_j e^{z_j - \max(z)}}$

Loss Function: Multi-Class Cross-Entropy

Only penalizes the predicted probability of the true class:

$$\mathcal{L} = -\log P(y=c\mid x)$$

where c is the true class of the sample. This is equivalent to the full cross-entropy over all classes −Σ y_k log p_k (after one-hot encoding, non-true-class terms are 0).

Gradient Derivation

After jointly differentiating Softmax + cross-entropy, the gradient form is extremely concise:

$$\frac{\partial L}{\partial w_k} = (p_k - \mathbf{1}[k=c])\cdot x$$

The gradient form is identical to logistic regression — err = predicted probability − true label, except each of the K classes has its own set of parameters.

Decision Boundaries

The decision boundary between class i and class j is the hyperplane where their probabilities are equal:

(w_i − w_j) · x + (b_i − b_j) = 0

K classes produce K(K-1)/2 pairwise decision boundaries, partitioning the feature space into K regions.

Summary

Task

Multi-class classification

Function

Softmax

Loss

Multi-class cross-entropy

Output

K probabilities