Random Forest - ML Easy Learning

✦ Demo: Single Tree vs Random Forest

Left: a single decision tree. Right: a random forest. Notice the jagged decision boundary on the left — that's overfitting. The right boundary is smoother and generalizes better. Adjust tree count or depth, then click "Retrain" to see changes.

Number of trees 20

Max tree depth 5

Trees 20 voting trees

Single Tree Acc — train set

Forest Acc — train set

Single Decision Tree (prone to overfitting)

Random Forest (smoother boundary)

01 How It Works

Imagine you're deciding whether to bring an umbrella. If one friend says "no rain," you might not be sure. But if you ask 100 people and 80 say "it will rain," you trust the majority.

Random Forest works the same way: train many slightly different decision trees, then let them vote. Each tree can be wrong, but they're wrong in different directions, so the errors cancel out.

Why Do Decision Trees Need Random Forest?

Decision trees have one big weakness: high variance. A small change in training data can completely change the tree structure. Deep trees memorize every detail of training data (overfitting) and perform poorly on new data.

Core idea: trade diversity for stability — each tree is trained on slightly different data and features, so even though each individual tree isn't perfect, their errors point in different directions and cancel out when voting.

Two Sources of Randomness

Bootstrap Sampling (row-level randomness)

Each tree's training set is created by sampling with replacement from the original dataset. About 63% of samples are included, 36% are left out (called OOB samples — useful for estimating generalization error for free).

Feature Subsampling (column-level randomness)

At each split, only a random subset of √p features are considered (p = total features). This prevents all trees from being dominated by the same strongest feature, increasing diversity between trees.

Bagging vs Boosting

Random Forest (Bagging)

All trees are trained in parallel independently. Each tree is a full deep tree. Final prediction by majority vote or average. Goal: reduce variance, prevent overfitting. Fast to train, easy to parallelize.

XGBoost (Boosting)

Trees are trained sequentially, each correcting the errors of the previous one. Each tree is intentionally shallow (weak learner). Goal: reduce bias, improve accuracy. Must be trained in order; more prone to overfitting.

How Voting Works

Classification: Majority Vote

Each tree predicts a class label. The final prediction is the class with the most votes.

Regression: Average

Each tree predicts a numeric value. The final prediction is the average of all trees' predictions.

Building a Random Forest Step by Step

Step 1 Bootstrap Sampling

Sample with replacement from the original dataset. Each bootstrap sample looks slightly different — that's what makes trees diverse.

Step 2 Single Decision Tree

Train one CART classification tree using Gini impurity. See how it learns a decision boundary.

Step 3 Feature Subsampling

At each split, only consider a random subset of features. Compare two forests: with and without feature subsampling.

Step 4 Ensemble Voting + Accuracy Curve

Train N trees, vote for final prediction, and watch accuracy rise and stabilize as more trees are added.

02 Code

03 Deep Dive

Bias-Variance Decomposition

Prediction error can be decomposed into three components:

E[(y − f̂(x))²] = Bias²(f̂) + Var(f̂) + σ²

Deep decision trees have low bias but high variance (memorize training data). Bagging averages B trees, reducing variance by up to 1/B (assuming independent trees):

Var(f̄) = Var(fₜ) / B

In practice trees are correlated (same dataset), so feature subsampling reduces inter-tree correlation and brings variance reduction closer to the theoretical limit.

Out-of-Bag (OOB) Error

Each bootstrap sample only uses ~63.2% of the data; the remaining ~36.8% are OOB samples. For each training sample, we can use the trees that didn't train on it to predict its label — a free, unbiased estimate of generalization error without a separate validation set:

OOB Error = (1/n) Σᵢ 𝟙[ŷᵢ_oob ≠ yᵢ]

Feature Importance

Random Forest naturally provides feature importance scores. The most common is Gini importance:

Importance(j) = Σₜ Σₙ∈t [p(n) · ΔGini(n, j)]

Sum up the weighted Gini impurity decrease for all splits on feature j across all trees. Another method is permutation importance: randomly shuffle feature j and measure how much OOB error increases — a larger increase means the feature is more important.

Number of Trees (n_estimators)

As you add more trees to a Random Forest, error monotonically decreases and stabilizes — unlike Boosting, adding too many trees does NOT cause overfitting. In practice 100–500 trees is usually sufficient; beyond that returns diminish.

More trees in Random Forest = always better. More trees in XGBoost = can overfit. This is one of the most important differences between Bagging and Boosting.

Random Forest vs XGBoost: When to Use Which?

Prefer Random Forest

Smaller datasets, fewer features, need quick results, care about interpretability, don't want to tune hyperparameters. OOB score gives built-in validation; almost no tuning needed.

Prefer XGBoost

Chasing maximum accuracy, large datasets, competitions, willing to tune hyperparameters. Usually outperforms Random Forest on tabular data but requires tuning learning_rate, max_depth, n_estimators, etc.