K-Means Clustering
You don't need to tell it the answers in advance — it can group similar things together on its own. That's K-Means.
01 Core Principles (Plain English)
Imagine walking into a candy store where hundreds of candies are scattered across a table. The owner asks you to sort them into groups, but doesn't tell you how — what would you do?
Intuitively, you'd put candies that look alike together: similar colors, similar sizes... That's exactly what K-Means does.
The Algorithm Has Only Four Steps
Randomly select K points from the data and plant flags — these are the K "captains" (centroids).
Each data point looks at which captain is closest, and follows that one.
Each captain moves to the center of its team members (taking the average), positioning itself more "centrally".
Repeat steps 2 and 3. The captains move less and less, eventually settling — clustering is complete.
Why does it converge? After each "captain relocation", the total "travel distance" (sum of distances from each point to its captain) only decreases, never increases. When it can't decrease anymore, it stops.
Two Pitfalls to Watch Out For
You Must Choose K Yourself
You have to tell the algorithm how many groups to create. If K is too small, different categories get mixed together; if K is too large, the same category gets split apart. The "Elbow Method" is commonly used to find a good K.
Initial Positions Affect Results
Different random starting positions can lead to different final results, sometimes getting stuck in local optima. In practice, it's common to run multiple times and take the best result.
Building K-Means Step by Step
Generate three natural clusters, then randomly pick K points from the data as initial centroids. Different starting positions may lead to different final results.
Each point calculates its distance to all centroids and joins the nearest one. This is the core operation of K-Means.
Each centroid moves to the mean position of all its members. After each move, the total distance only decreases, guaranteeing eventual convergence.
Repeatedly execute "assign + update", stopping when centroids no longer move. Usually only a few rounds are needed to converge.
Combine all four segments together, add visualization, and you have the complete demo code — see below.
02 Code
03 Academic Explanation
Formal Definition
Given a dataset X = {x₁, x₂, ..., xₙ}, K-Means aims to find K clusters C = {C₁, C₂, ..., C_K} and their corresponding centroids μ₁, ..., μ_K that minimize the objective function:
Where ‖x − μ_k‖² is the squared Euclidean distance, and J is also called the Within-Cluster Sum of Squares (WCSS / SSE).
Algorithm Pseudocode
Convergence
After each iteration, the objective function J is monotonically non-increasing, and the state space is finite (there are finitely many assignment schemes), so the algorithm is guaranteed to converge — but only to a local optimum, not necessarily the global optimum.
In practice, K-Means++ initialization is commonly used: the next centroid is selected with probability proportional to d(x, nearest selected centroid)², significantly reducing the chance of getting stuck in poor local optima.
Distance Metric
Standard K-Means uses Euclidean distance. If the data dimensions have very different scales, standardization (z-score) should be applied first, otherwise dimensions with larger scales will dominate the distance calculation.
Choosing K
Elbow Method
Plot the relationship between K and SSE. The inflection point where SSE's decrease slows down is a good choice for K.
Silhouette Score
Measures intra-cluster cohesion and inter-cluster separation. Ranges from [−1, 1] — the closer to 1, the better.
Business Constraints
In real-world scenarios, K is often determined by business requirements (e.g., segmenting users into 3 tiers).
Complexity
- Time Complexity: O(n · K · d · T), where n is the number of samples, d is the number of dimensions, and T is the number of iterations
- Space Complexity: O(n · d + K · d)
- For large-scale data, Mini-Batch K-Means can be used, updating centroids with random subsets each time for significant speedup.
Limitations
- Assumes convex-shaped clusters — performs poorly on non-convex distributions like rings or crescents
- Sensitive to outliers (outliers can strongly pull centroids off course)
- Requires specifying K in advance, and results depend on initialization