Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a statistical variable reduction technique that linearly transforms a set of variables by rotation where the images of the transformation are new uncorrelated factors. The technique is used as a variable reduction technique to remove the noise while preserve the main variation of my original data. The transformed factors are sorted in descending order according to their contribution to the variation of the original data. Initially proposed by Pearson (1901), it was developed by Hotelling (1933) (see textbooks Dunteman, 1989; Jolliffe, 2002). PCA applications vary between signal networks, gene expression (e.g. Yeung and Ruzzo, 2001; Raychaudhuri et al., 2000), image compression; face recognition and modern Geography (e.g. Daultrey, 1976). PCA has been also applied to study bond returns (Litterman and Scheinkman, 1991). The method of classifying funds into classes and observe their performances has been previously looked at by Brown and Goetzmann (2003) using generalized style classifications by comparing returns data to index portfolios and corresponding loading factors. Further studies have been carried out in Gibson and Gyger (2007).

Principal Components Analysis

Introduction

Principal Component Analysis is a statistical variable reduction technique that linearly transforms a set of variables to a new uncorrelated factors. Each new factor is a linear combination of the original variables. The complete set of new factors preserve as much variation as the original variables presented.

The factors are sorted in a descending order according to the amount of variation explained in their variances and generally the first few principal components explain most of the variation indicating that the effective dimensionality of the original set of variables is considerably less than the total number of variables. The remaining components are associated with eigenvalues of the covariance matrix that are close to zero and have little explanatory power. Including all 15 Rank-Statistics would have resulted in very small and possibly zero eigenvalues that do not explain any additional variation in the data and therefore I only used the remaining nine Rank-Statistics before PCA is applied.

Applications

PCA is applied to the dataset to reduce the dimensionality of the problem whilst producing uncorrelated factors. The covariance matrix of the original data expresses the variability and covariability between the variables. The information, which is preserved after the transformation, is represented in the variances of the uncorrelated factors. PCA, which is a singular value decomposition (SVD) problem, extracts orthogonal factors in descending order of importance as measured by the factor variance. The factors are a rotation of the original variables centred on zero; and the complete set of nine principal components explains all the variation observed in the original dataset. Each factor is therefore a linear combination of all the original centred variables. In many cases, scholars have found that the first three or four factors explain most of the variation whilst the rest of the factors tend to capture noise. Therefore by selecting the first three or four factors, we capture the essence of the original dataset with a much smaller set of uncorrelated variables.

We refer to the original \(m \times N\) matrix by \(RS\), where each row of \(RS\) corresponds to a variable, and similarly we refer to the principal component factors by an \(m \times N\) matrix \(F\) whose rows are the transformed factors. The \(m\) variables are centralised such that each has a zero mean by subtracting the mean of the sample from each of the observations. In most applications of PCA, the data are normalised by scaling the centred variables according to their standard deviations. However, this is not necessary. The original variables are linearly transformed by a transformation matrix \(V_{m\times m}\), with \(V^{-1}=V^{T}\), to be the matrix whose columns are the orthonormal eigenvectors of \(Cov \left(RS_{m\times N}-\overline{RS}_{m\times N}\right)=Cov \left( \widehat{RS}_{m\times N} \right)\), where \(\widehat{RS}=RS-\overline{RS}\) and all elements of the \(i\)-th row of the mean matrix \(\overline{RS}\) are identical and equal to $$ \overline{rs}_{i,k}=\frac{1}{N}\sum_{j=1}^{N}rs_{i,j}, $$ where \(\overline{rs}_{i,j}\) and \(rs_{i,j}\) are entries for the matrices \(\overline{RS}\) and \(RS\), respectively.

The columns of \(V\), which are the eigenvectors of the covariance matrix, are sorted in a descending order, according to the value of the corresponding eigenvalue. We then get
\begin{equation} F_{m\times N}=V_{m\times m} \left(RS_{m\times N}-\overline{RS}_{m\times N}\right)=V_{m\times m} \left( \widehat{RS}_{m\times N} \right) \text{with the diagonal covariance matrix} \end{equation} \begin{equation} Cov(F)=\frac{1}{N-1} F\times F^T=V\times Cov(\widehat{RS})\times V^T= \begin{bmatrix} \lambda_1 & 0&\ldots & 0 \\ 0& \lambda_2& \ddots & \vdots \\ \vdots& \ddots& \ddots & \vdots \\ 0 & 0& \ldots & \lambda_m \end{bmatrix}, \end{equation} where the diagonal entries, \(\lambda_1 \geq \lambda_2 \ldots \geq \lambda_m\), are the eigenvalues of the covariance matrix.

References
  1. Bollen, J., Van de Sompel, H., Hagberg, A., and Chute, R. (2009). A principal component analysis of 39 scientific impact measures, PloS one, 4(6), e6022.
  2. Daultrey, S., (1976). Principal Component Analysis (Concepts and techniques in modern geography ; no. 8). Norwich: Geo Abstracts Limited.
  3. Dunteman, G. H. (1989). Principal components analysis (No. 69). Sage.
  4. Haidar, H., An empirical analysis of controlled risk and investment performance using risk measures: a study of risk controlled environment (Doctoral dissertation, University of Sussex, 2014).
  5. Jolliffe, I.T., (2002), Principal Component Analysis, Second Edition, Springer Series in Statistics, Berlin.
  6. Shlens, J. (2005). A tutorial on principal component analysis. Systems Neurobiology Laboratory, University of California at San Diego, 82.

1 comment:

Haidora said...

I have published a video on Youtube that shows the calculations of PCA for a simple set of data. Watch it

https://youtu.be/ENDAFbCWojI

Post a Comment