01 Core Principles (Plain English)

Should you bring an umbrella this morning? First ask: will it rain today? Yes → bring one. No → ask: will it be windy? Yes → bring one. No → don't bother.

That's a decision tree: break a complex decision into a chain of yes/no questions, follow the branches to a leaf node, and get your answer.

The algorithm does one thing: find the best question

1
Select a feature

Iterate over all features and find "which question splits the data most cleanly" — quantified using information gain or Gini index.

2
Split the node

Divide the dataset into two (or more) subsets based on the selected feature, then recursively repeat Step 1 on each subset.

3
Stopping condition

Stop splitting when a subset is pure (all one class), max depth is reached, or too few samples remain. The node becomes a leaf (the prediction result).

Information Gain vs Gini Index

Information Gain (ID3/C4.5)

Uses entropy to measure "impurity." Picks the feature that reduces entropy the most after splitting. Biased toward features with many values; C4.5 corrects this with gain ratio.

Gini Index (CART)

Measures the probability that two randomly picked samples have different labels. Faster to compute. sklearn uses CART by default.

Random Forest = many decision trees voting together. Each tree is trained on a random subset of samples and features. The ensemble vote is much more accurate than any single tree and far less prone to overfitting.

Build a Decision Tree Step by Step

Step 1 Generate moon-shaped data

Use non-linear moon-shaped data for testing — linear models fail here, which perfectly showcases the strength of decision trees.

Step 2 Gini Index: quantifying "impurity"

A lower Gini index means purer data. The decision tree aims to minimize the Gini index of its subsets at every split.

Step 3 Find the optimal split point

Iterate over all features and all possible thresholds, and pick the split with the highest Gini gain — this is everything a decision tree "learns."

Step 4 Recursive tree building and prediction

After finding the best split, divide the data in two and recursively repeat the same process for the left and right subtrees until a stopping condition is met.

Combine all four parts, add visualization, and you get the complete demo code — see below.

02 Code

03 Academic Explanation

Information Entropy and Information Gain

The entropy of a dataset S is defined as:

$$H(S) = -\sum_i p_i \log_2 p_i$$

where pᵢ is the proportion of class i samples. Higher entropy means more impurity (greater uncertainty).

Information gain after splitting on feature A:

$$\text{IG}(S,A) = H(S) - \sum_v \frac{|S_v|}{|S|} H(S_v)$$

Select the feature with the largest IG — i.e., the one that reduces uncertainty the most after splitting.

Gini Index (CART)

The Gini impurity of node t:

$$\text{Gini}(t) = 1 - \sum_i p_i^2$$

CART selects the feature and threshold that maximize the weighted decrease in Gini impurity, building a binary tree (each split produces exactly two branches).

Overfitting and Pruning

An unrestricted decision tree will overfit (memorize the training set). Common control methods:

  • Pre-pruning: limit max depth (max_depth), minimum samples to split (min_samples_split)
  • Post-pruning: grow the full tree first, then merge unnecessary splits bottom-up (cost-complexity pruning)

Variance Reduction in Random Forests

If n independent and identically distributed trees each have variance σ², the variance of their average is σ²/n. Random forests introduce diversity through bootstrap sampling + random feature subsets, reducing inter-tree correlation ρ. The ensemble variance becomes:

Var = ρσ² + (1−ρ)σ²/n

When the number of trees is large enough, the second term approaches 0, and ensemble error is dominated by inter-tree correlation.

Complexity

  • Training: O(n·d·log n), where n is the number of samples and d is the number of features
  • Prediction: O(depth), where depth is typically O(log n)
  • Random Forest Training: O(T·n·√d·log n), where T is the number of trees