Reading time 3 minutes
# Principal Components Analysis 2

## Exploring Principal Components Analysis

### Understanding Multivariate Methods

### What is Principal Components Analysis (PCA)?

### How PCA Works

### PCA Equations

### Practical Aspects

### Interpreting Principal Components

### Example: PCA in Lung Function Tests

### Advantages and Disadvantages of PCA

## Reference

This material is reproduced based on the examples from: Oxford handbook of medical statistics (Janet Peacock Philip J Peacock) [1].

- Used to analyze multiple outcome variables at once, as opposed to single outcome variables
- Aims to simplify complex datasets for easier interpretation

- Reduces datasets with many inter-correlated variables to a smaller set of uncorrelated variables
- Also known as ‘reducing the dimensionality of a dataset’
- The smaller set of variables is used in subsequent analyses

- Generates a set of principal components (PCs), each being a linear combination of original variables
- Maximum of
*n*PCs can be computed for*n*variables - PCs explain a proportion of total variability, with the first PC explaining the maximum amount, and subsequent PCs explaining progressively smaller amounts

- Original variables: \(x_1, x_2, x_3, \dots, x_p\)
- PCA generates
*p*principal components: \(y_1, y_2, y_3, \dots, y_p\) - PCs are defined as:
- \[y_1 = b_{11}x_1 + b_{12}x_2 + \dots + b_{1p}x_p\]
- \[y_2 = b_{21}x_1 + b_{22}x_2 + \dots + b_{2p}x_p\]
- \[y_p = b_{p1}x_1 + b_{p2}x_2 + \dots + b_{pp}x_p\]

- \(b_{11}, b_{12}\), etc., are coefficients

- Common practice to include enough PCs to explain at least 80% of total variability, often just two or three
- PCA generates a single value for each PC for each subject, creating new variables
- These new variables are used in further analyses like other variables

- Specific PCs may represent an overarching theme with contributions from several original variables

Researchers wanted to identify important features of six lung function tests in 458 coalminers [1]. They used PCA and reduced the six tests to three meaningful respiratory components. The results are summarized in the following table:

Component | 1st | 2nd | 3rd | 4th |
---|---|---|---|---|

FEV1 | -0.46 | 0.18 | 0.23 | -0.26 |

FVC | -0.38 | 0.58 | -0.22 | -0.24 |

FEV1/FVC | -0.38 | -0.57 | -0.24 | -0.52 |

Vmax50 | -0.44 | -0.32 | 0.12 | 0.05 |

Vmax25 | -0.43 | -0.21 | 0.17 | 0.77 |

TLCO | -0.35 | 0.41 | -0.83 | 0.14 |

% Variability | 74% | 15% | 7% | 3% |

- Replaces inter-correlated variables with a smaller set of independent components, capturing key features of original data
- Overcomes colinearity problems in complex predictor variables, making it easier to examine possible predictor variables
- Each component is a new variable, which is a linear combination of the original variables, making actual component values harder to interpret

- Oxford handbook of medical statistics (Janet Peacock Philip J Peacock). Cowie H, Lloyd MH, Soutar CA. Study of lung function data by principal components analysis. Thorax 1985; 40(6):438–43.