PCA - ML Easy Learning

01 Core Principles (Plain English)

You have a photo that you need to send to a friend, but the file is too large to transfer. You compress the image — the main subject is still there, some details are lost, but the overall information is mostly preserved.

PCA does the same thing: "compress" high-dimensional data into low dimensions, but prioritize preserving the most important information.

The key question is: what counts as "most important"? PCA's answer — the directions with the largest variance are the most important. The more the data varies and spreads in a given direction, the more information that direction contains. By projecting the data onto that direction, you can express more content with fewer numbers.

Four Steps to Complete PCA

Center the Data

Subtract the mean from all data points, moving the data's "center of gravity" to the origin. This makes directional comparisons meaningful.

Compute the Covariance Matrix

The covariance matrix records the correlation between each pair of features — which two dimensions always change together, and which are independent.

Find Principal Component Directions (Eigenvectors)

Perform eigen decomposition on the covariance matrix to get several mutually perpendicular directions (principal components). The larger the corresponding eigenvalue, the greater the variance and information in that direction.

Project to Reduce Dimensions

Project the data onto the first k principal component directions. Originally 100-dimensional data might only need 2-3 dimensions to preserve 95% of the information.

Run the code below to see the complete four-step process: original scatter plot → centered data moved to the origin → red line marking the PC1 principal direction → all points vertically projected onto the principal axis.

How Many Dimensions Should You Keep?

Look at the "Variance Explained Ratio"

Each principal component has a variance explained ratio. If the first k components together explain 95% of the variance, that's usually enough — the remaining 5% is mostly noise.

Use a "Scree Plot" as a Guide

Plot the eigenvalues of each principal component from largest to smallest. The point where the curve shows a clear "elbow" is a good cutoff for the number of dimensions.

Build PCA Step by Step

Step 1 Generate Diagonal Elliptical Data

An ellipse with its major axis along the 45° direction. The principal component direction is obvious — perfect for demonstrating how PCA finds this hidden "main direction".

Step 2 Center the Data: Move the Center of Gravity to the Origin

Subtracting the mean puts all directional comparisons on the same starting point. Without centering, the principal component directions would be contaminated by the mean offset.

Step 3 Covariance Matrix + Eigenvalue Decomposition

The covariance matrix captures correlations between dimensions. Eigenvectors are the principal component directions, and larger eigenvalues indicate more information in that direction.

Step 4 Project onto PC1 to Complete Dimensionality Reduction

Project each point onto the PC1 direction (dot product). 2D data becomes 1D coordinates while recording how much variance was retained.

Combine all four segments with visualization to get the complete demo code — see below.

02 Code

03 Academic Explanation

PCA (Principal Component Analysis) is a common dimensionality reduction method. Its core idea is: find the directions of maximum variance in the data, and project high-dimensional data into a lower-dimensional space while preserving as much information as possible.

Why Do We Need Dimensionality Reduction?

Visualization: Reduce high-dimensional data to 2-3 dimensions for visualization
Compression: Reduce storage space
Acceleration: Reduce computational cost
Denosing: Remove noise and redundant features

What Are Principal Components?

Principal components are the directions of maximum variance in the data. The first principal component is the direction where the data is most "spread out", and the second principal component is the direction with the next highest variance that is orthogonal to the first.

Algorithm Steps

Center the Data

Subtract the mean from each feature, moving the data center to the origin

Compute the Covariance Matrix

Reflects the correlation between features

Find Eigenvalues and Eigenvectors

Eigenvectors represent principal component directions, eigenvalues represent the importance of each direction

Select Principal Components

Sort by eigenvalue magnitude and choose the top k as principal components

PCA Dimensionality Reduction Demo

Reduce 2D data to 1D and observe the principal component direction. Left: original data with principal component directions; right: 1D projection after reduction:

Dimensionality Reduction Results

PC1 Direction

PC2 Direction

Variance Explained Ratio

Summary

Goal

Maximize variance

Input

High-dimensional data

Output

Low-dimensional projection

Evaluation

Variance explained ratio

PCA Principal Component Analysis