Reading time 3 minutes
Principal Components Analysis 2
This material is reproduced based on the examples from:
Oxford handbook of medical statistics (Janet Peacock Philip J Peacock) [1].
Exploring Principal Components Analysis
Understanding Multivariate Methods
- Used to analyze multiple outcome variables at once, as opposed to single outcome variables
- Aims to simplify complex datasets for easier interpretation
What is Principal Components Analysis (PCA)?
- Reduces datasets with many inter-correlated variables to a smaller set of uncorrelated variables
- Also known as ‘reducing the dimensionality of a dataset’
- The smaller set of variables is used in subsequent analyses
How PCA Works
- Generates a set of principal components (PCs), each being a linear combination of original variables
- Maximum of n PCs can be computed for n variables
- PCs explain a proportion of total variability, with the first PC explaining the maximum amount, and subsequent PCs explaining progressively smaller amounts
PCA Equations
- Original variables: \(x_1, x_2, x_3, \dots, x_p\)
- PCA generates p principal components: \(y_1, y_2, y_3, \dots, y_p\)
- PCs are defined as:
-
\[y_1 = b_{11}x_1 + b_{12}x_2 + \dots + b_{1p}x_p\]
-
\[y_2 = b_{21}x_1 + b_{22}x_2 + \dots + b_{2p}x_p\]
-
\[y_p = b_{p1}x_1 + b_{p2}x_2 + \dots + b_{pp}x_p\]
- \(b_{11}, b_{12}\), etc., are coefficients
Practical Aspects
- Common practice to include enough PCs to explain at least 80% of total variability, often just two or three
- PCA generates a single value for each PC for each subject, creating new variables
- These new variables are used in further analyses like other variables
Interpreting Principal Components
- Specific PCs may represent an overarching theme with contributions from several original variables
Example: PCA in Lung Function Tests
Researchers wanted to identify important features of six lung function tests in 458 coalminers [1].
They used PCA and reduced the six tests to three meaningful respiratory components. The results are summarized in the following table:
Component |
1st |
2nd |
3rd |
4th |
FEV1 |
-0.46 |
0.18 |
0.23 |
-0.26 |
FVC |
-0.38 |
0.58 |
-0.22 |
-0.24 |
FEV1/FVC |
-0.38 |
-0.57 |
-0.24 |
-0.52 |
Vmax50 |
-0.44 |
-0.32 |
0.12 |
0.05 |
Vmax25 |
-0.43 |
-0.21 |
0.17 |
0.77 |
TLCO |
-0.35 |
0.41 |
-0.83 |
0.14 |
% Variability |
74% |
15% |
7% |
3% |
Advantages and Disadvantages of PCA
- Replaces inter-correlated variables with a smaller set of independent components, capturing key features of original data
- Overcomes colinearity problems in complex predictor variables, making it easier to examine possible predictor variables
- Each component is a new variable, which is a linear combination of the original variables, making actual component values harder to interpret
Reference
- Oxford handbook of medical statistics (Janet Peacock Philip J Peacock).
Cowie H, Lloyd MH, Soutar CA. Study of lung function data by principal components analysis. Thorax 1985; 40(6):438–43.