17.7.1.3 Algorithms(Principal Component Analysis)

1 Listwise Exclusion of Missing Values
- 1.1 Matrix Type for Analysis
- 1.2 Quantities to Compute
2 Pairwise Exclusion of Missing Values
3 Bartlett's Test

Principal Component Analysis examines relationships of variables. It can be used to reduce the number of variables in regression and clustering, for example.

Each principal component in Principal Component Analysis is the linear combination of the variables and gives a maximized variance. Let X be a matrix for n observations by p variables, and the covariance matrix is S. Then for a linear combination of the variables

$z_1=\sum_{i=1}^p a_{1i}x_i$

where $x_i\$ is the ith variable, $a_{1i} \ i=1,2,...,p$ are linear combination coefficients for $z_1\$ , they can be denoted by a column vector $a_1\$ , and normalized by $a_1^Ta_1=1$ . The variance of $z_1\$ will be $a_1^TSa_1$ .

The vector $a_1\$ is found by maximizing the variance. And $z_1\$ is called the first principal component. The second principal component can be found in the same way by maximizing:

$a_2^TSa_2$ subject to the constraints $a_2^Ta_2=1$ and $a_2^Ta_1=0$

It gives the second principal component that is orthogonal to the first one. Remaining principal components can be derived in a similar way. In fact coefficients $a_1, a_2, ..., a_p\$ can be calculated from eigenvectors of the matrix S. Origin uses different methods according to the way of excluding missing values.

Listwise Exclusion of Missing Values

An observation containing one or more missing values will be excluded in the analysis. And a matrix $X_s\$ for SVD can be derived from X depending on the matrix type for analysis.

Matrix Type for Analysis

Covariance Matrix

Let $X_s\$ be the matrix X with each column's mean subtracted from each variable and each column scaled by $\frac{1}{\sqrt{n-1}}$ .

Correlation Matrix

Let $X_s\$ be the matrix X with each column's mean subtracted from each variable and each column scaled by $\frac{1}{\sqrt{n-1}\sigma_i}$ where $\sigma_i\$ is the standard deviation of the ith variable.

Quantities to Compute

Perform SVD on $X_s\$ .

$X_s=V\Lambda P^T\$

where V is an n by p matrix with $V^TV=I\$ , P is a p by p matrix, and $\Lambda$ is a diagonal matrix with diagonal elements $s_i \ i=1, 2, ..., p$ .

Eigenvalues

$\lambda_i=s_i^2$

Eigenvalues are sorted in descending order. The proportion of variance explained by the ith principal component is $\lambda_i/\sum_{k=1}^p \lambda_k$ .

Eigenvectors

Eigenvectors are also known as loadings or coefficients for principal components. Each column in P is the eigenvector corresponding to the eigenvalue or principal component.

Note that the eigenvector's sign is not unique for SVD, Origin normalizes its sign by forcing the sum of each column to be positive.

Scores

Each column in $\sqrt{n-1}V\Lambda$ is the scores corresponding to the principal component. And scores will be missing values corresponding to an observation containing missing values.

Note that variance of scores for each principal component equals its corresponding eigenvalue for this method.

Standardized Scores

Scores for each principal component are standardized so that they have unit variance.

Pairwise Exclusion of Missing Values

An observation is excluded only in the calculation of covariance or correlation between two variables if missing values exist in either of the two variables for the observation.

Eigenvalues and eigenvectors are calculated from the covariance or correlation matrix S.

$SP=PD\$

where P is a p by p matrix and D is a diagonal matrix with diagonal elements $\lambda_i \ i=1, 2, ..., p$ .

Eigenvalues

$\lambda_i\$ is the ith eigenvalue for the ith principal component. And eigenvalues are sorted in descending order.

Note that eigenvalues can be negative for missing values excluded in a pairwise way, which will make no sense for principal components. Origin sets the loading and scores to zeros for a negative eigenvalue.

Eigenvectors

Each column in P is the eigenvector corresponding to the eigenvalue or principal component.

Note that the eigenvector's sign is not unique; Origin normalizes its sign by forcing the sum of each column to be positive.

Scores

$V=X_0P\$

where $X_0\$ is the matrix X with each column's mean subtracted from each variable.

Scores will be missing values corresponding to an observation containing missing values.

Note that variance of scores for each principal component may not equal its corresponding eigenvalue for this method.

Standardized Scores

Scores for each principal component are scaled by the square root of its eigenvalue.

Bartlett's Test

Bartlett's Test tests the equality of the remaining p-k eigenvalues. It is available only when analysis matrix is covariance matrix.

$H_0:\lambda_{k+1}=\lambda_{k+2}=...=\lambda_{p} k=0, 1, ..., p-2\$

It approximates a $\chi_2\$ distribution with $\frac{1}{2}(p-k-1)(p-k+2)$ degrees of freedom.

$(n-1-(2p+5)/6)\Big\{-\sum_{i=k+1}^p \mathrm{log}(\lambda_i)+(p-k)\mathrm{log}(\sum_{i=k+1}^p \lambda_i/(p-k))\Big\}$