See It in Action: Fitting Residuals Tree by Tree

Click "Add One Tree" to watch each new tree consume the current error, or click "Auto Add" to play continuously. Adjust the learning rate and tree depth to see how convergence changes.

Learning rate η 0.30
Max tree depth 3
Trees Added 0 decision trees
Current RMSE root mean square error
History

01 Core Idea (Plain English)

A novice chef cooks a dish and it comes out bland (prediction too low). A second chef specializes in "fixing" it — adding seasoning wherever it's needed. A third chef looks at what's still off and fixes that too…

XGBoost works exactly like this: each new tree only corrects the errors of all previous trees. Add every tree's prediction together and you get a precise result.

XGBoost vs Random Forest

Random Forest (Parallel)

Trains many fully independent trees simultaneously, each on a random subset of the data. Advantage: naturally parallel, less prone to overfitting.

XGBoost (Sequential)

Trees are added one by one, each trained on the residuals of its predecessors. Advantage: higher final accuracy. Downside: must train sequentially; needs early stopping to prevent overfitting.

Three Core Steps

1
Initialize Predictions

Before the first tree, initialize predictions with a constant — mean for regression, log-odds for classification.

2
Compute Residuals (Negative Gradient)

Calculate the gap between current predictions and true values: residual = true − predicted. This is "how much is still off" — the negative gradient of the loss function.

3
Fit Residuals + Update Predictions

Train a new decision tree to fit the residuals, then multiply by learning rate η and add to the current prediction. Repeat from step 2 until the tree limit is reached or residuals are small enough.

Learning Rate: Slow and Steady Wins the Race

η (learning rate) controls how much each tree contributes: η=1 trusts each new tree fully — fast convergence but prone to overfitting; η=0.1 takes tiny steps, needs more trees, but generalizes better. Industry standard: η=0.05–0.3 with 100–1000 trees.

Why "Gradient" Boosting?

Fitting residuals is gradient descent in function space: the residual is the negative gradient of MSE loss, and adding each tree is one step in the gradient direction. XGBoost extends this to any differentiable loss function, using first-order gradients (g) and second-order gradients (h) to determine each leaf node's optimal output value.

Building XGBoost Step by Step

Step 1 Generate Data + Write a Regression Tree

Generate a noisy sine curve and hand-write a single regression tree using variance reduction.

Step 2 Compute Residuals + Fit Residuals

Initialize predictions as the mean, compute residuals, and let a second tree fit those residuals.

Step 3 Loop Boosting N Rounds

Loop "compute residuals → fit residuals → update predictions" N times and watch the RMSE drop.

Step 4 Add Regularization and Early Stopping

Compare overfitting with and without regularization. Add validation-set early stopping to prevent too many trees.

02 Code

03 Deep Dive

Objective Function

XGBoost's objective = loss term + regularization term:

Obj = Σᵢ L(yᵢ, ŷᵢ) + Σₖ Ω(fₖ)

L is a differentiable loss function (MSE for regression, log-loss for classification). Ω(f) penalizes tree complexity:

Ω(f) = γT + ½λ Σⱼ wⱼ²

T is the number of leaf nodes, wⱼ is the output value of leaf j. γ penalizes the number of leaves (controls tree depth), λ applies L2 regularization to leaf weights.

Per-Round Approximate Objective

At round t, adding tree fₜ. Apply second-order Taylor expansion around the current prediction ŷ(t-1):

Obj⁽ᵗ⁾ ≈ Σᵢ [ gᵢ fₜ(xᵢ) + ½ hᵢ fₜ²(xᵢ) ] + Ω(fₜ)

Where gᵢ = ∂L/∂ŷ(t-1) (first-order gradient), hᵢ = ∂²L/∂(ŷ(t-1))² (second-order gradient). For MSE loss: gᵢ = ŷ − y (residual), hᵢ = 1.

Optimal Leaf Weights

Group samples in each leaf node j, differentiate with respect to wⱼ and set to zero to get the optimal weight:

wⱼ* = − Gⱼ / (Hⱼ + λ)

Gⱼ = Σᵢ∈j gᵢ, Hⱼ = Σᵢ∈j hᵢ. Substituting back gives the score for this tree structure:

Obj* = −½ Σⱼ Gⱼ² / (Hⱼ + λ) + γT

Split Gain

To evaluate whether a split is worthwhile, compute the reduction in the objective:

Gain = ½ [ G_L²/(H_L+λ) + G_R²/(H_R+λ) − (G_L+G_R)²/(H_L+H_R+λ) ] − γ

Only split when Gain > 0. Larger γ makes splits harder to trigger, keeping trees shorter. This is the essence of XGBoost's built-in regularization.

Four Engineering Optimizations in XGBoost

Column Subsampling

Randomly select a subset of features per tree or per split. Reduces overfitting and speeds up training — same idea as feature sampling in Random Forests.

Histogram Approximation

Instead of evaluating all split points, bucket feature values into histograms and only search bucket boundaries. Dramatically faster than the naive approach.

Sparsity-aware Split

Handles missing values and sparse one-hot features with a dedicated strategy. Automatically learns whether missing values should go left or right.

Cache-aware Access

Optimizes memory access patterns by storing gradient data column-wise to maximize CPU cache hit rates — several times faster than a naive implementation.

Key Hyperparameters

n_estimators / num_round

Number of trees. More = higher accuracy, but too many causes overfitting. Use with early_stopping_rounds on a validation set.

learning_rate (η)

Scaling factor per tree. Smaller values need more trees. Typically 0.05–0.3: trade training time for better generalization.

max_depth

Maximum tree depth, controls model complexity. Typical values: 3–8. Too deep overfits; too shallow underfits.

subsample / colsample_bytree

Row and column sampling ratios (0–1). Adds randomness to prevent overfitting. Typically 0.7–0.9.

XGBoost vs LightGBM vs CatBoost

XGBoost (2016)

The pioneering gradient boosting implementation. Introduced regularization and second-order gradients. Won countless Kaggle competitions and remains the go-to baseline.

LightGBM (2017, Microsoft)

Histogram acceleration + leaf-wise tree growth. 5–10× faster than XGBoost, lower memory usage. First choice for large datasets.

CatBoost (2017, Yandex)

Native support for categorical features. Ordered Boosting reduces prediction shift. Most friendly for missing values and categorical variables.