Reading time 9 minutes
STAAR framework  variantset test using annotation
26 Apr 2023  last update
Abbreviations
 COSI: Calibration Coalescent Model
 FAVOR: Functional Annotations of Variants  Online Resource
 GWAS: GenomeWide Association Study
 MAF: Minor Allele Frequency
 RVAT: RareVariant Association Test
 SKAT: Sequence Kernel Association Test
 STAAR: variantset test for association using annotation Information
 STAARO: STAAR with multiple annotation weights (Omnibus Test)
 SCANGSTAAR: SCANG (scan the genome) method combined with STAAR
 WGS: WholeGenome Sequencing
 WES: WholeExome Sequencing
Overview
The SKAT (Sequence Kernel Association Test) method is fundamental to the STAAR (variantSet Test for Association using Annotation information) framework.
STAAR builds upon the SKAT method by incorporating multiple functional annotations for genetic variants to improve the power of the association tests.
The integration of annotation information in STAAR allows prioritization of functional variants using multidimensional variant biological functions.
The authors target:

Limitations of existing methods: Current approaches for RV analysis of WGS/WES studies have limited ability to define analysis units in the noncoding genome.

Proposed framework: Introduce a computationally efficient and robust noncoding RV association detection framework for WGS data.

Genecentric analysis: Propose additional strategies for grouping noncoding variants based on functional annotations within STAAR.

Nongenecentric analysis: Propose SCANGSTAAR, a flexible dataadaptive window size RVAT method that incorporates multiple functional annotations and accounts for relatedness and population structure.

STAARpipeline: Develop a pipeline that functionally annotates both noncoding and coding variants of a WGS study and performs RVATs using the proposed methods for both genecentric and nongenecentric analysis.
First, the rarevariant association tests (RVATs) are defined to analyse rare variants in association with phenotypes.
Then, STAAR is applied, which incorporates multiple functional annotations for genetic variants to boost power in RVATs. STAAR can be applied in different ways, such as STAARO, which uses multiple annotation weights. Additionally, for nongenecentric analysis, SCANGSTAAR is proposed, which uses dynamic windows with dataadaptive sizes and incorporates multidimensional functional annotations.
Method
 Notations and model:
 Generalized linear model for unrelated samples (Equation 1):
\(g(\mu_i) = \alpha_0 + \boldsymbol{X}_i^T\alpha + \boldsymbol{G}_i^T\beta\)
 Generalized linear mixed model for related samples (Equation 2):
\(g(\mu_i) = \alpha_0 + \boldsymbol{X}_i^T\alpha + \boldsymbol{G}_i^T\beta + b_i\)
 Variantset test using STAAR:
 Burden test statistic:
\(Q_{\mathrm{Burden},l,k} = \left(\sum_{j=1}^p \hat{\pi}_{jk}w_{jl}S_j\right)^2\)
 SKAT test statistic:
\(Q_{\mathrm{SKAT},l,k} = \sum_{j=1}^p \hat{\pi}_{jk}w_{jl}^2S_j^2\)
 ACATV test statistic:
\(Q_{\mathrm{ACATV},l,k} = \overline{\hat{\pi}_{\cdot k}w_{\cdot l}^2\mathrm{MAF}(1  \mathrm{MAF})}\tan\left((0.5  p_{0,k})\pi\right) + \sum_{j=1}^{p'} \hat{\pi}_{jk}w_{jl}^2\mathrm{MAF}_j(1  \mathrm{MAF}_j)\tan\left((0.5  p_j)\pi\right)\)
 STAAR tests:
 STAARBurden (STAARB):
\(T_{\mathrm{STAARB}} = \sum_{l=1}^2\sum_{k=0}^K\frac{\tan\{(0.5  p_{\mathrm{Burden},l,k})\pi\}}{2(K+1)}\)
 STAARS:
\(T_{\mathrm{STAARS}} = \sum_{l=1}^2\sum_{k=0}^K\frac{\tan\{(0.5  p_{\mathrm{SKAT},l,k})\pi\}}{2(K+1)}\)
 STAARACATV (STAARA):
\(T_{\mathrm{STAARA}} = \sum_{l=1}^2\sum_{k=0}^K\frac{\tan\{(0.5  p_{\mathrm{ACATV},l,k})\pi\}}{2(K+1)}\)
 STAARO test statistic:
\(T_{\mathrm{STAARO}} = \frac{1}{3}\left[\tan\{(0.5  p_{\mathrm{STAARB}})\pi\} + \tan\{(0.5  p_{\mathrm{STAARS}})\pi\} + \tan\{(0.5  p_{\mathrm{STAARA}})\pi\}\right]\)
 Dynamic window analysis using SCANGSTAAR:
 Minimum P value of all candidate moving windows:
\(p_{\mathrm{min}} = \min_{L_{\mathrm{min}} \le I \le L_{\mathrm{max}}}p(I)\)
 Conditional analysis
 The STAARpipeline performs conditional analysis to identify RV association independent of known variants. We first select a list of known variants by including the traitassociated variants identified in literature, for example, variants indexed in the GWAS Catalog or significant variants in largescale GWAS. The significant variants detected in individual analyses using the same data could also be added into the known variants list to ensure the RV signals are not captured by the significant individual variants. We then use the following stepwise selection strategy to select a subset of independent variants representing the known variant list as the variants adjusted in the conditional analysis:
 Calculate the individual P value of all variants in the known variants list and select the most significant variant.
 For each step, calculate the P values of all the remaining variants conditional on the variant(s) that have already been selected. For each variant, we only condition on the selected variants within a specified region of that variant, such as the \(\pm\)1Mb window.
 Select the variant with minimum conditional P value that is lower than the cutoff P value, for example, \(1 \times 10^{4}\).
 Repeat steps 2–3 until no variants can be selected.
Finally, we calculate the conditional P value of each significant RV analysis unit by adjusting for the selected variants residing in an extended region (for example, \(\pm\)1Mb window) of the analysis unit.
Type I error rate simulations
The authors perform this as follows:
 100,000 sequencing chromosomes in a 10Mb region
 African American population linkage disequilibrium structure (COSI)
 Total sample sizes (n = 50,000)
 Continuous traits from a linear model:
\(Y_i = 0.5X_{1i} + 0.5X_{2i} + {\it{\epsilon }}_i\),
\(X_{1i} \sim N(0,1)$, $X_{2i} \sim \mathrm{Bernoulli}(0.5)\), \(\epsilon_i \sim N(0,1)\)
 Dichotomous traits from a logistic model:
\(\mathrm{logit}\,P(Y_i = 1) = \alpha_0 + 0.5X_{1i} + 0.5X_{2i}\),
\(X_{1i}\) and \(X_{2i}\) defined as in continuous traits, \(\alpha_0\) for 1% prevalence
 Case–control sampling
 Ten annotations generated: \(A_1, \ldots, A_{10}\) as i.i.d. \(N(0,1)\) for each variant
 SCANGSTAARS, SCANGSTAARB, SCANGSTAARO with MAF and ten annotations as weights
 10,000 replicates for genomewise (familywise) type I error rates at α = 0.05 and 0.01
Using ACAT in STAAR
The Aggregated Cauchy Association Test (ACAT) is a statistical method used for rarevariant association tests (RVATs) in genetic studies.
ACAT is designed to aggregate the association signals of multiple rare genetic variants within a genomic region or a gene, while accounting for the directions of the effects of these variants on the phenotype of interest.
The ACAT method utilizes a Cauchy distribution, which allows for improved performance in identifying true associations, especially when the directions and magnitudes of variant effects are heterogeneous.
In a separate page, I discuss the ACAT method.
Some of the following passages are included in both pages since they related.
Figure 1. Slide from presentation of ACAT method by Dr. Xihong Lin.
Noncoding rarevariant association tests
The authors propose the following:

Noncoding RVAT framework: Propose a computationally efficient and robust noncoding rarevariant association test (RVAT) framework for phenotypegenotype association analyses of WGS data.

Regressionbased: Allows adjustment for covariates, population structure, and relatedness by fitting linear and logistic mixed models for quantitative and dichotomous traits.

Genecentric approach: Group noncoding rare variants for each gene using eight functional categories of regulatory regions and apply STAAR, which incorporates multiple in silico variant functional annotation scores.

Nongenecentric approach: Propose SCANGSTAAR, using dynamic windows with dataadaptive sizes and incorporating multidimensional functional annotations instead of fixedsize sliding windows.

Conditional analysis: Perform analytical followup to dissect RV association signals independent of known variants.
Applying STAARO for multiple annotation weights
In the STAAR Nature Methods paper, the section Genecentric analysis of the noncoding genome
shows how the STAAR method can indeed be used to capitalize on the ACAT method to obtain a combined pvalue from a set of annotations for a single variant. The STAAR framework incorporates multiple functional annotation scores into the RVATs (rarevariant association tests) to increase the power of association analysis. In this context, it uses the STAARO test, an omnibus test that aggregates annotationweighted burden test, SKAT, and ACATV within the STAAR framework.
By incorporating multiple functional annotation scores, such as CADD, LINSIGHT, FATHMMXF, and annotation principal components (aPCs), the STAAR method enhances the ability to detect associations between variants and traits of interest. Therefore, the STAAR framework can be used to leverage the strengths of the ACAT method and obtain a combined pvalue from a set of annotations for a single variant or a set of variants.
Nongenecentric analysis using dynamic windows with SCANGSTAAR
The SCANGSTAAR method is an improvement over the fixedsize sliding window RVAT in the STAAR framework. It proposes a dynamic windowbased approach called SCANGSTAAR, which extends the SCANG procedure by incorporating multidimensional functional annotations. This method allows for flexible detection of locations and sizes of signal windows across the genome, as the locations of regions associated with a disease or trait are often unknown in advance, and their sizes may vary across the genome. Using a prespecified fixedsize sliding window for RVAT can lead to power loss if the prespecified window sizes do not align with the true locations of the signals.
The SCANGSTAAR method has two main procedures: SCANGSTAARS and SCANGSTAARB. SCANGSTAARS extends the SCANGSKAT (SCANGS) procedure by calculating the STAARSKAT (STAARS) pvalue in each overlapping window by incorporating multiple variant functional annotations, instead of using just the MAFweightbased SKAT pvalue. SCANGSTAARB is based on the STAARBurden pvalue. SCANGSTAARS has two advantages over SCANGSTAARB in detecting noncoding associations using dynamic windows: first, the effects of causal variants in a neighborhood in the noncoding genome tend to be in different directions, especially in intergenic regions; second, due to the different correlation structures of the two test statistics for overlapping windows, the genomewide significance threshold of SCANGSTAARB is lower than that of SCANGSTAARS.
SCANGSTAAR also provides the SCANGSTAARO procedure, based on an omnibus pvalue of SCANGSTAARS and SCANGSTAARB calculated by the ACAT method. However, unlike STAARO, the ACATV test is not incorporated into the omnibus test because it is designed for sparse alternatives, and as a result, it tends to detect the region with the smallest size that contains the most significant variant in the dynamic window procedure.
Figure 2. Slide from presentation of ACAT application in STAAR by Dr. Xihong Lin.
Multiweight annotation analysis
The STAAR framework can be used to combine the pvalues associated with each of the 5 annotation columns (CADD_score, MAF, GnomAD_AF, REVEL_score, ClinVar_score) for a single variant. STAAR incorporates multiple functional annotation scores as weights when constructing its statistics, making it suitable for combining pvalues from different annotation columns to obtain a single combined pvalue for that variant.
Figure 3: From Li et al NatMeth 2022: a, Prepare the input data of STAARpipeline, including genotypes, phenotypes and covariates. b, Annotate all variants in the genome using FAVORannotator through FAVOR database and calculate the (sparse) GRM. c, Define analysis units in the noncoding genome: eight functional categories of regulatory regions, sliding windows and dynamic windows using SCANG. d, Obtain genomewide significant associations and perform analytical followup via conditional analysis.