Andromeda
Note

Principal Component Analysis

Definition

Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms a dataset into a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (the first principal component), the second greatest variance on the second coordinate, and so on.

Why It Matters

In an era of “big data,” the sheer volume of information can obscure the truth. PCA is critical because it filters out the noise, allowing researchers and engineers to identify the core drivers of a system without being overwhelmed by irrelevant variables, thereby making complex problems solvable.

Core Concepts

  • Data Centering: Subtracting the mean vector mm from each data point xix_i so the data is centered at the origin.
  • Covariance Matrix: Computing V=1n1(xim)(xim)TV = \frac{1}{n-1} \sum (x_i - m)(x_i - m)^T.
    • How to read: “The covariance matrix V is equal to one divided by the quantity n minus one, times the sum over all data points of the outer product of the difference between the vector x i and the mean vector m.”
    • Meaning: Measures how features vary and co-vary after centering at mean mm.
  • Eigen-decomposition: The principal components are the eigenvectors of VV. The eigenvalues λi\lambda_i represent the amount of variance captured by each component.
    • How to read: “The eigenvectors of the covariance matrix V; and the corresponding eigenvalues lambda i.”
    • Meaning: Rotate axes to directions of maximum spread—largest λ\lambda is the first PC.
  • Dimension Reduction: Projecting data onto the top kk eigenvectors to reduce dimensionality while preserving maximum information.

Connected Concepts