Principal components analysis
|
In statistics, principal components analysis (PCA) is a technique that can be used to simplify a dataset; more formally it is a linear transformation that chooses a new coordinate system for the data set such that the greatest variance by any projection of the data set comes to lie on the first axis (then called the first principal component), the second greatest variance on the second axis, and so on. PCA can be used for reducing dimensionality in a dataset while retaining those characteristics of the dataset that contribute most to its variance by eliminating the later principal components (by a more or less heuristic decision). These characteristics may be the "most important", but this is not necessarily the case, depending on the application.
PCA is also called the Karhunen-Loève transform (named after Kari Karhunen and Michel Loève) or the Hotelling transform (in honor of Harold Hotelling). PCA has the speciality of being the optimal linear transformation for keeping the subspace that has largest variance. However this comes at the price of greater computational requirement, e.g. if compared to the discrete cosine transform. Unlike other linear transforms, the PCA does not have a fixed set of basis vectors. Its basis vectors depend on the data set.
Assuming zero empirical mean (the empirical mean of the distribution has been subtracted away from the data set), the principal component w1 of a dataset x can be defined as:
- <math>\mathbf{w}_1
= \arg\max_{\Vert \mathbf{w} \Vert = 1} E\left\{ \left( \mathbf{w}^T \mathbf{x}\right)^2 \right\}<math>
(See arg max for the notation.) With the first <math>k - 1<math> components, the <math>k<math>-th component can be found by subtracting the first <math>k - 1<math> principal components from x:
- <math>\mathbf{\hat{x}}_{k - 1}
= \mathbf{x} - \sum_{i = 1}^{k - 1} \mathbf{w}_i \mathbf{w}_i^T \mathbf{x}<math>
and by substituting this as the new dataset to find a principal component in
- <math>\mathbf{w}_k
= \arg\max_{\Vert \mathbf{w} \Vert = 1} E\left\{ \left( \mathbf{w}^T \mathbf{\hat{x}}_{k - 1} \right)^2 \right\}<math>.
A simpler way to calculate the components wi uses the empirical covariance matrix of x, the measurement vector. By finding the eigenvalues and eigenvectors of the covariance matrix, we find that the eigenvectors with the largest eigenvalues correspond to the dimensions that have the strongest correlation in the dataset. The original measurements are finally projected onto the reduced vector space. Note that the eigenvectors X are actually the columns of the matrix V, where X=ULV′ is the singular value decomposition of X.
PCA is equivalent to empirical orthogonal functions (EOF).
PCA is a popular technique in pattern recognition. However, PCA is not optimized for class separability. An alternative is the linear discriminant analysis, which does take this into account. PCA optimally minimizes reconstruction error under the L2 norm.
Contents |
Algorithm details
Following is a detailed English description of PCA using the covariance method. Suppose you have n data vectors of d dimensions each, and you want to project your data into a k dimensional subspace.
Find the basis vectors
- Organize your data into column vectors, so you end up with a <math>d \times n<math> matrix, D.
- Find the empirical mean along each dimension, so you end up with a <math>d \times 1<math> empirical mean vector, M.
- Subtract the empirical mean vector M from each column of the data matrix D. Store mean-subtracted data <math>d \times n<math> matrix in S.
- Find the empirical covariance <math>d\times d<math> matrix C of S. <math>C = S \cdot S^T<math>.
- Compute and sort by decreasing eigenvalue, the eigenvectors V of C.
- Save the mean vector M. Save the first k columns of V as P. P will have dimension <math>d \times k<math>. <math>1 \leq k \leq d<math>
Projecting new data
Suppose you have a d×1 data vector D. Then the k×1 projected vector is v = PT(D − M).
Derivation of PCA using the covariance method
Let X be a d-dimensional random vector expressed as column vector. Without loss of generality, assume X has zero empirical mean. We want to find a <math>d \times d<math> orthonormal projection matrix P such that
- <math>Y = P^\top X<math>
with the constraint that
- <math>\operatorname{cov}(Y)<math> is a diagonal matrix and <math>P^{-1} = P^\top<math>.
By substitution, and matrix algebra, we get
- <math>
\begin{matrix} \operatorname{cov}(Y) &=& \operatorname{E}[YY^\top]\\ \ &=& \operatorname{E}[(P^\top X) (P^\top X)^\top]\\ \ &=& \operatorname{E}[(P^\top X) (X^\top P)]\\ \ &=& P^\top \operatorname{E}[X X^\top] P\\ \ &=& P^\top \operatorname{cov}(X) P \end{matrix} <math>.
We now have
- <math>
\begin{matrix} P\operatorname{cov}(Y) &=& P P^\top \operatorname{cov}(X) P\\ \ &=& \operatorname{cov}(X) P\\ \end{matrix} <math>.
Rewrite P as d <math>d \times 1<math> column vectors, so
- <math>P = [P_1, P_2, \ldots, P_d]<math>
and <math>\operatorname{cov}(Y)<math> as
- <math>
\begin{bmatrix} \lambda_1 & \cdots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \cdots & \lambda_d \end{bmatrix} <math>.
Substituting into equation above, we get
- <math>[\lambda_1 P_1, \lambda_2 P_2, \ldots, \lambda_d P_d] =
[\operatorname{cov}(X)P_1, \operatorname{cov}(X)P_2, \ldots, \operatorname{cov}(X)P_d]<math>.
Notice that in <math>\lambda_i P_i = \operatorname{cov}(X)P_i<math>, Pi is an eigenvector of Xs covariance matrix. Therefore, by finding the eigenvectors of X′s covariance matrix, we find a projection matrix P that satisfies the original constraints.
Correspondence analysis
Correspondence analysis is conceptually similar to PCA, but scales the data (which must be positive) so that rows and columns are treated equivalently. It is traditionally applied to contingency tables where Pearson's chi-square test has shown a relationship between rows and columns.
See also
- eigenface
- transform coding
- independent components analysis
- singular value decompositionde:Hauptkomponentenanalyse
eo:Analizo al precipaj konsisteroj it:Analisi delle componenti principali nl:Hoofdcomponenten