Decision Tree
Classify data like playing "20 Questions" — ask one question at a time, follow the answers, and arrive at a conclusion
01 Core Principles (Plain English)
Should you bring an umbrella this morning? First ask: will it rain today? Yes → bring one. No → ask: will it be windy? Yes → bring one. No → don't bother.
That's a decision tree: break a complex decision into a chain of yes/no questions, follow the branches to a leaf node, and get your answer.
The algorithm does one thing: find the best question
Iterate over all features and find "which question splits the data most cleanly" — quantified using information gain or Gini index.
Divide the dataset into two (or more) subsets based on the selected feature, then recursively repeat Step 1 on each subset.
Stop splitting when a subset is pure (all one class), max depth is reached, or too few samples remain. The node becomes a leaf (the prediction result).
Information Gain vs Gini Index
Information Gain (ID3/C4.5)
Uses entropy to measure "impurity." Picks the feature that reduces entropy the most after splitting. Biased toward features with many values; C4.5 corrects this with gain ratio.
Gini Index (CART)
Measures the probability that two randomly picked samples have different labels. Faster to compute. sklearn uses CART by default.
Random Forest = many decision trees voting together. Each tree is trained on a random subset of samples and features. The ensemble vote is much more accurate than any single tree and far less prone to overfitting.
Build a Decision Tree Step by Step
Use non-linear moon-shaped data for testing — linear models fail here, which perfectly showcases the strength of decision trees.
A lower Gini index means purer data. The decision tree aims to minimize the Gini index of its subsets at every split.
Iterate over all features and all possible thresholds, and pick the split with the highest Gini gain — this is everything a decision tree "learns."
After finding the best split, divide the data in two and recursively repeat the same process for the left and right subtrees until a stopping condition is met.
Combine all four parts, add visualization, and you get the complete demo code — see below.
02 Code
03 Academic Explanation
Information Entropy and Information Gain
The entropy of a dataset S is defined as:
where pᵢ is the proportion of class i samples. Higher entropy means more impurity (greater uncertainty).
Information gain after splitting on feature A:
Select the feature with the largest IG — i.e., the one that reduces uncertainty the most after splitting.
Gini Index (CART)
The Gini impurity of node t:
CART selects the feature and threshold that maximize the weighted decrease in Gini impurity, building a binary tree (each split produces exactly two branches).
Overfitting and Pruning
An unrestricted decision tree will overfit (memorize the training set). Common control methods:
- Pre-pruning: limit max depth (max_depth), minimum samples to split (min_samples_split)
- Post-pruning: grow the full tree first, then merge unnecessary splits bottom-up (cost-complexity pruning)
Variance Reduction in Random Forests
If n independent and identically distributed trees each have variance σ², the variance of their average is σ²/n. Random forests introduce diversity through bootstrap sampling + random feature subsets, reducing inter-tree correlation ρ. The ensemble variance becomes:
When the number of trees is large enough, the second term approaches 0, and ensemble error is dominated by inter-tree correlation.
Complexity
- Training: O(n·d·log n), where n is the number of samples and d is the number of features
- Prediction: O(depth), where depth is typically O(log n)
- Random Forest Training: O(T·n·√d·log n), where T is the number of trees