XGBoost
Turning weak trees into a strong model — not by training one great tree, but by adding many small trees that each specialize in fixing mistakes
✦ See It in Action: Fitting Residuals Tree by Tree
Click "Add One Tree" to watch each new tree consume the current error, or click "Auto Add" to play continuously. Adjust the learning rate and tree depth to see how convergence changes.
01 Core Idea (Plain English)
A novice chef cooks a dish and it comes out bland (prediction too low). A second chef specializes in "fixing" it — adding seasoning wherever it's needed. A third chef looks at what's still off and fixes that too…
XGBoost works exactly like this: each new tree only corrects the errors of all previous trees. Add every tree's prediction together and you get a precise result.
XGBoost vs Random Forest
Random Forest (Parallel)
Trains many fully independent trees simultaneously, each on a random subset of the data. Advantage: naturally parallel, less prone to overfitting.
XGBoost (Sequential)
Trees are added one by one, each trained on the residuals of its predecessors. Advantage: higher final accuracy. Downside: must train sequentially; needs early stopping to prevent overfitting.
Three Core Steps
Before the first tree, initialize predictions with a constant — mean for regression, log-odds for classification.
Calculate the gap between current predictions and true values: residual = true − predicted. This is "how much is still off" — the negative gradient of the loss function.
Train a new decision tree to fit the residuals, then multiply by learning rate η and add to the current prediction. Repeat from step 2 until the tree limit is reached or residuals are small enough.
Learning Rate: Slow and Steady Wins the Race
η (learning rate) controls how much each tree contributes: η=1 trusts each new tree fully — fast convergence but prone to overfitting; η=0.1 takes tiny steps, needs more trees, but generalizes better. Industry standard: η=0.05–0.3 with 100–1000 trees.
Why "Gradient" Boosting?
Fitting residuals is gradient descent in function space: the residual is the negative gradient of MSE loss, and adding each tree is one step in the gradient direction. XGBoost extends this to any differentiable loss function, using first-order gradients (g) and second-order gradients (h) to determine each leaf node's optimal output value.
Building XGBoost Step by Step
Generate a noisy sine curve and hand-write a single regression tree using variance reduction.
Initialize predictions as the mean, compute residuals, and let a second tree fit those residuals.
Loop "compute residuals → fit residuals → update predictions" N times and watch the RMSE drop.
Compare overfitting with and without regularization. Add validation-set early stopping to prevent too many trees.
02 Code
03 Deep Dive
Objective Function
XGBoost's objective = loss term + regularization term:
L is a differentiable loss function (MSE for regression, log-loss for classification). Ω(f) penalizes tree complexity:
T is the number of leaf nodes, wⱼ is the output value of leaf j. γ penalizes the number of leaves (controls tree depth), λ applies L2 regularization to leaf weights.
Per-Round Approximate Objective
At round t, adding tree fₜ. Apply second-order Taylor expansion around the current prediction ŷ(t-1):
Where gᵢ = ∂L/∂ŷ(t-1) (first-order gradient), hᵢ = ∂²L/∂(ŷ(t-1))² (second-order gradient). For MSE loss: gᵢ = ŷ − y (residual), hᵢ = 1.
Optimal Leaf Weights
Group samples in each leaf node j, differentiate with respect to wⱼ and set to zero to get the optimal weight:
Gⱼ = Σᵢ∈j gᵢ, Hⱼ = Σᵢ∈j hᵢ. Substituting back gives the score for this tree structure:
Split Gain
To evaluate whether a split is worthwhile, compute the reduction in the objective:
Only split when Gain > 0. Larger γ makes splits harder to trigger, keeping trees shorter. This is the essence of XGBoost's built-in regularization.
Four Engineering Optimizations in XGBoost
Column Subsampling
Randomly select a subset of features per tree or per split. Reduces overfitting and speeds up training — same idea as feature sampling in Random Forests.
Histogram Approximation
Instead of evaluating all split points, bucket feature values into histograms and only search bucket boundaries. Dramatically faster than the naive approach.
Sparsity-aware Split
Handles missing values and sparse one-hot features with a dedicated strategy. Automatically learns whether missing values should go left or right.
Cache-aware Access
Optimizes memory access patterns by storing gradient data column-wise to maximize CPU cache hit rates — several times faster than a naive implementation.
Key Hyperparameters
n_estimators / num_round
Number of trees. More = higher accuracy, but too many causes overfitting. Use with early_stopping_rounds on a validation set.
learning_rate (η)
Scaling factor per tree. Smaller values need more trees. Typically 0.05–0.3: trade training time for better generalization.
max_depth
Maximum tree depth, controls model complexity. Typical values: 3–8. Too deep overfits; too shallow underfits.
subsample / colsample_bytree
Row and column sampling ratios (0–1). Adds randomness to prevent overfitting. Typically 0.7–0.9.
XGBoost vs LightGBM vs CatBoost
XGBoost (2016)
The pioneering gradient boosting implementation. Introduced regularization and second-order gradients. Won countless Kaggle competitions and remains the go-to baseline.
LightGBM (2017, Microsoft)
Histogram acceleration + leaf-wise tree growth. 5–10× faster than XGBoost, lower memory usage. First choice for large datasets.
CatBoost (2017, Yandex)
Native support for categorical features. Ordered Boosting reduces prediction shift. Most friendly for missing values and categorical variables.