Definition
Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms a dataset into a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (the first principal component), the second greatest variance on the second coordinate, and so on.
Why It Matters
In an era of “big data,” the sheer volume of information can obscure the truth. PCA is critical because it filters out the noise, allowing researchers and engineers to identify the core drivers of a system without being overwhelmed by irrelevant variables, thereby making complex problems solvable.
Core Concepts
- Data Centering: Subtracting the mean vector from each data point so the data is centered at the origin.
- Covariance Matrix: Computing .
- How to read: “The covariance matrix V is equal to one divided by the quantity n minus one, times the sum over all data points of the outer product of the difference between the vector x i and the mean vector m.”
- Meaning: Measures how features vary and co-vary after centering at mean .
- Eigen-decomposition: The principal components are the eigenvectors of . The eigenvalues represent the amount of variance captured by each component.
- How to read: “The eigenvectors of the covariance matrix V; and the corresponding eigenvalues lambda i.”
- Meaning: Rotate axes to directions of maximum spread—largest is the first PC.
- Dimension Reduction: Projecting data onto the top eigenvectors to reduce dimensionality while preserving maximum information.