Principal Components Analysis
Introduction
Principal Component Analysis is a statistical variable reduction technique that linearly transforms a set of variables to a new uncorrelated factors. Each new factor is a linear combination of the original variables. The complete set of new factors preserve as much variation as the original variables presented.
The factors are sorted in a descending order according to the amount of variation explained in their variances and generally the first few principal components explain most of the variation indicating that the effective dimensionality of the original set of variables is considerably less than the total number of variables. The remaining components are associated with eigenvalues of the covariance matrix that are close to zero and have little explanatory power. Including all 15 Rank-Statistics would have resulted in very small and possibly zero eigenvalues that do not explain any additional variation in the data and therefore I only used the remaining nine Rank-Statistics before PCA is applied.
Applications
PCA is applied to the dataset to reduce the dimensionality of the problem whilst producing uncorrelated factors. The covariance matrix of the original data expresses the variability and covariability between the variables. The information, which is preserved after the transformation, is represented in the variances of the uncorrelated factors. PCA, which is a singular value decomposition (SVD) problem, extracts orthogonal factors in descending order of importance as measured by the factor variance. The factors are a rotation of the original variables centred on zero; and the complete set of nine principal components explains all the variation observed in the original dataset. Each factor is therefore a linear combination of all the original centred variables. In many cases, scholars have found that the first three or four factors explain most of the variation whilst the rest of the factors tend to capture noise. Therefore by selecting the first three or four factors, we capture the essence of the original dataset with a much smaller set of uncorrelated variables.
We refer to the original \(m \times N\) matrix by \(RS\), where each row of \(RS\) corresponds to a variable, and similarly we refer to the principal component factors by an \(m \times N\) matrix \(F\) whose rows are the transformed factors. The \(m\) variables are centralised such that each has a zero mean by subtracting the mean of the sample from each of the observations. In most applications of PCA, the data are normalised by scaling the centred variables according to their standard deviations. However, this is not necessary. The original variables are linearly transformed by a transformation matrix \(V_{m\times m}\), with \(V^{-1}=V^{T}\), to be the matrix whose columns are the orthonormal eigenvectors of \(Cov \left(RS_{m\times N}-\overline{RS}_{m\times N}\right)=Cov \left( \widehat{RS}_{m\times N} \right)\), where \(\widehat{RS}=RS-\overline{RS}\) and all elements of the \(i\)-th row of the mean matrix \(\overline{RS}\) are identical and equal to $$ \overline{rs}_{i,k}=\frac{1}{N}\sum_{j=1}^{N}rs_{i,j}, $$ where \(\overline{rs}_{i,j}\) and \(rs_{i,j}\) are entries for the matrices \(\overline{RS}\) and \(RS\), respectively.
The columns of \(V\), which are the eigenvectors of the covariance matrix, are sorted in a descending order, according to the value of the corresponding eigenvalue. We then get
\begin{equation} F_{m\times N}=V_{m\times m} \left(RS_{m\times N}-\overline{RS}_{m\times N}\right)=V_{m\times m} \left( \widehat{RS}_{m\times N} \right) \text{with the diagonal covariance matrix} \end{equation} \begin{equation} Cov(F)=\frac{1}{N-1} F\times F^T=V\times Cov(\widehat{RS})\times V^T= \begin{bmatrix} \lambda_1 & 0&\ldots & 0 \\ 0& \lambda_2& \ddots & \vdots \\ \vdots& \ddots& \ddots & \vdots \\ 0 & 0& \ldots & \lambda_m \end{bmatrix}, \end{equation} where the diagonal entries, \(\lambda_1 \geq \lambda_2 \ldots \geq \lambda_m\), are the eigenvalues of the covariance matrix.
References
-
Bollen, J., Van de Sompel, H., Hagberg, A., and Chute, R. (2009).
A principal component analysis of 39 scientific impact measures
, PloS one, 4(6), e6022. - Daultrey, S., (1976).
Principal Component Analysis (Concepts and techniques in modern geography ; no. 8)
. Norwich: Geo Abstracts Limited. - Dunteman, G. H. (1989).
Principal components analysis
(No. 69). Sage. - Haidar, H.,
An empirical analysis of controlled risk and investment performance using risk measures: a study of risk controlled environment
(Doctoral dissertation, University of Sussex, 2014). - Jolliffe, I.T., (2002),
Principal Component Analysis
, Second Edition, Springer Series in Statistics, Berlin. - Shlens, J. (2005).
A tutorial on principal component analysis
. Systems Neurobiology Laboratory, University of California at San Diego, 82.
1 comment:
I have published a video on Youtube that shows the calculations of PCA for a simple set of data. Watch it
https://youtu.be/ENDAFbCWojI
Post a Comment