Population structure can be a confounding factor in genetic association studies, including rare variant analysis with variant collapse. One approach to address this issue is to use principal component analysis (PCA) to capture the underlying population structure and include the top principal components (PCs) as covariates in the analysis.
It is worth noting that if the variant allele frequency is already filtered quite low, for example to MAF \(<0.005\), then I do not think that PCA is useful as a covariate during analysis. However, empirical testing would be ideal.
A study (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5851819/) recommends performing PCA on all variants (rare + common) and including the top 10 PCs as covariates. The eigenvec table generated from the PCA contains rows for each sample and columns for each PC, providing a value for how similar each individual is to others in the population. All samples - case and control - should be included.
To perform PCA, plink2 (https://www.cog-genomics.org/plink/2.0/) and GCTA (https://yanglab.westlake.edu.cn/software/gcta/#PCA) can be used. It is recommended to prune the data to variants that best represent the linkage disequilibrium (LD) structure before running PCA.
Here are the steps to use PCA as a covariate in rare variant analysis:
plink2 --vcf samples.filt.vcf --make-bed samples.filt
. This will create three output files with extensions .bed
, .bim
, and .fam
. This binary fileset will contain information about the SNPs and samples in the VCF file, including SNP names, positions, alleles, and genotypes, as well as sample IDs, phenotypes, and family information.exclusion_regions_hg19.txt
) and name the dataset samples.filt
.Some useful resources for understanding and implementing PCA in rare variant analysis include:
An example script for performing PCA on plink-formatted GWAS data, pruning the data, and generating a plot of the results can be found below at grm.sh
and pca_plot.R
:
#!/bin/bash
#SBATCH --job-name=grm
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=24
#SBATCH --time=03:00:00
#SBATCH --mem=60gb
#SBATCH -o ./log/2.4_grm.%J.out
#SBATCH -e ./log/2.4_grm.%J.err
processed=/work/user/data/processed
wdir=$processed/combined
#==============================================================================
# Starting with a VCF file of joint genotyped cohort
# Make GRM and PCA
# Prune known and cohort-specific LD regions for the grm
#==============================================================================
# Plink2 <https://www.cog-genomics.org/plink/2.0/>
# Convert VCF to plink format
plink2 \
--vcf $wdir/geno/samples.filt.vcf \
--make-bed $wdir/geno/samples.filt
# Removing long-range LD regions for PCA
plink2 \
--bfile $wdir/geno/samples.filt \
--exclude range $processed/exclusion_regions_hg19.txt \
--make-bed \
--out $wdir/grm/samples.filt.no_lrldr
# Produce a pruned subset of markers that are in approximate
# linkage equilibrium with each other
# Writes the IDs to plink.prune.in
# (and the IDs of all excluded variants to plink.prune.out)
plink2 \
--bfile $wdir/grm/samples.filt.no_lrldr \
--indep-pairwise 50 5 0.5 \
--out $wdir/grm/samples.filt.no_lrldr
# Extract the pruned subset of markers
# Extract the pruned subset of markers
plink2 \
--bfile $wdir/grm/samples.filt.no_lrldr \
--extract $wdir/grm/samples.filt.no_lrldr.prune.in \
--make-bed --out $wdir/grm/samples.filt.no_lrldr.pruned
# PCA
# this takes apporx 10 minutes on 32 threads, 70 minutes on this dataset on ~18 thread
# GCTA <https://yanglab.westlake.edu.cn/software/gcta/#Download>
/work/user/tool/gcta64 \
--bfile $wdir/grm/samples.filt.no_lrldr.pruned \
--autosome --make-grm \
--out $wdir/grm/samples.filt.grm \
--thread-num 24
/work/user/tool/gcta64 \
--grm $wdir/grm/samples.filt.grm \
--pca 20 \
--out $wdir/grm/samples.filt.pca \
--thread-num 24
# # have a look at the PCA in grm/plotpca.pdf
# module load intel/18.0.5 &&\
# module load intel-mkl/2018.5.274 &&\
# module load r/3.6.0 &&\
#
# cd $wdir/grm/
# Rscript pca.R
(END)
library(ggplot2)
library(dplyr)
library(gridExtra)
# import data
vec <- read.table("./samples.filt.pca.eigenvec", header=F)
phenotypes <- read.csv("./phenotypes/samples_pheno.csv", header=T, sep = "," )
pheno <- phenotypes %>% select(V1, V2, age.days, study.site, gender)
vec_pheno <- merge(x=vec, y=pheno, by=c("V1", "V2"), all=TRUE)
# Set plot style
theme_set <- theme(text = element_text(face="bold"), legend.position="none", panel.background = element_rect("#F7F7F7"))
geom_set <- geom_point(aes(fill = ethnicity), shape=21, alpha=0.5)
# Loop to create plots and store them in a list
plots <- list()
for (i in 1:17) {
x_col <- paste0("V", i+2)
y_col <- paste0("V", i+3)
labs_set <- labs(x=paste0("PC", i), y=paste0("PC", i+1))
p <- vec_pheno %>%
ggplot(aes(x=get(x_col), y=get(y_col))) +
geom_set + labs_set + theme_set
plots[[i]] <- p
}
# Arrange plots using grid.arrange function
grid.arrange(grobs=plots, ncol=4, top="Cohort Principal component analysis", left="")
(END)
Making a Genetic Relationship Matrix (GRM) is a method to estimate the genetic relatedness between individuals based on their genomic similarity. The GCTA-GRM software can be used to generate a GRM from Single Nucleotide Polymorphism (SNP) data. The GRM is a square matrix where each element represents the genetic covariance between two individuals, and the diagonal elements represent the proportion of genetic variance accounted for by each individual.
The GCTA-GRM method uses a set of SNPs to estimate the covariance matrix, and the SNPs can be filtered for various quality control criteria such as minor allele frequency (MAF) or linkage disequilibrium (LD). The resulting GRM can be used to examine the genetic relatedness between individuals and to control for population stratification in downstream analyses.
After generating a Genetic Relationship Matrix (GRM) using GCTA, the software can be used to run Principal Component Analysis (PCA) on the genetic data. Running PCA on the GRM generates two output files: the eigenvalue file (ending in “.eigenval”) and the eigenvector file (ending in “.eigenvec”). These are equivalent to those calculated by the program EIGENSTRAT. The eigenvalue file contains a single column of numbers, one for each principal component (PC), which represents the magnitude of the corresponding eigenvector. The eigenvalue measures the amount of variation in the data captured by each PC and is the factor by which the corresponding eigenvector is scaled. The higher the eigenvalue, the more important the corresponding PC is in explaining the variation in the data.
It is worth noting that the sum of all eigenvalues is equal to the total variance in the data. Thus, the proportion of variance explained by each PC can be calculated by dividing each eigenvalue by the sum of all eigenvalues. The eigenvector file contains a table with rows for each sample and columns for each PC, with each value in the table representing the sample’s contribution to that particular PC, scaled by the corresponding eigenvalue. These eigenvector values can be used as covariates in downstream analyses to control for population structure and other confounding factors.
The sum of all eigenvalues is equal to the total variance in the data, and the proportion of variance explained by each PC can be calculated by dividing each eigenvalue by the sum of all eigenvalues. The eigenvector file provides sample-specific contributions to each PC, scaled by the corresponding eigenvalue, which can be used as covariates in downstream analyses to control for population structure and other confounding factors.