Reading time 10 minutes

Integrating databases

Population genetics
Phenotype, genotype, and function
References

Population genetics

GnomAD (version r2.0.2) (Lek et al., 2016) was used in these studies as the best source of population genetics data. The reference genome is GRCh37. Offline local database mirrors were used in most cases. Input sets used GnomAD variant allele frequencies and reference sequences processed as VCF and CSV files. outlines a specific data transformation using the gnomAD database, but in general, gnomAD was used as a filtering threshold for determining the expected population frequency of each variant. A strict threshold for rare variants could be set to ignore and candidate variants that are more frequent than 0.001. However, in most cases a more lenient level is used and any remaining benign or common variants are removed by “technical control” (filter on cohort to remove common variants between individuals that do not share a phenotype). A more modest cut-off threshold allows us to sometimes identify variant that are present in the general population, which are responsible for a recessive disease with no predictable heterozygous loss-of-function intolerance.

Other sources of population genetics data comes from resources such as CliVar and dbSNP, which as they grow in size become an annotated and curated for of population data. These resources allow us to calculate the expected frequencies for disease-causing variants. However, since these are manually curated database and predominantly European based, they are inherently biased and not reliable for statistical applications.

Phenotype, genotype, and function

Population genetics database gnomAD has been individually addressed in section [subsec:gnomad], as this is the most important type of annotation and filtering criteria for genetic determinants of rare disease. Additionally, in these studies many phenotype and genotype databases have been used for annotation and interpretation. Specifically, the most frequently used data came from MGI Phenotype, MorbidMap, VOC MammalianPhenotype, Gencode symbol, UniProtKB, Enterez ID, ENSGene ID, GO ID, Description, OMIM, BIOGRID interactions, HGMD human phenotype, ClinVar, and dbSNP. In most cases, every candidate variant was annotated with the main information per gene from a local database containing the information from each of the listed resources.

These are the “basic” information databases that we used to annotate variants. In a cohort study, data mining can find correlations and was therefore included for posterity as it does not significantly increase the data storage. Even if an obvious cause of disease was found we may later return to the data to find other cofactors or genetic modifiers. Or for example, in a single case study, a variant of unknown significance may have no statistical basis to be selected or ignored. We use this information to decide if that mutation is worth consideration: Is it in a protein domain of known function? Are there other cases reported with the same phenotype? What is the gene function, ontology, etc.?

We have also used some gene lists that are specific to disease, druggability, etc. A major contributor for collecting these gene lists has been the Mac Arthur et al. (Lab, 2018). These gene lists can be used is special cases. For example, a study looking at (1) dominant pathogenic mutations, and (2) in known immune genes might filter to included only those known observables. We could decide to only study SNPs in FDA-approved drug targets.

Mechanism	Gene Count	Name	Reference
	19,194	HUGO 2018	(at the European Bioinformatics Institute, 2018)
FDA-approved drug targets	385	Wishart 2018	(Wishart et al., 2017)
Drug targets	201	Nelson 2012	(Nelson et al., 2012)
Autosomal dominant genes	307	Blekhman 2008	(Blekhman et al., 2008)
Autosomal dominant genes	631	Berg 2013	(Berg et al., 2013)
Autosomal recessive genes	527	Blekhman 2008	(Blekhman et al., 2008)
Autosomal recessive genes	1073	Berg 2013	(Berg et al., 2013)
X-linked genes	66	Blekhman 2008	(Blekhman et al., 2008)
X-linked recessive genes	102	Berg 2013	(Berg et al., 2013)
X-linked dominant genes	34	Berg 2013	(Berg et al., 2013)
X-linked ClinVar genes	61	Landrum 2014	(Landrum et al., 2013)
All dominant genes	709	Blekhman 2008, Berg 2013	(missing reference)
All recessive genes	1183	Blekhman 2008, Berg 2013	(missing reference)
Homozygous LoF tolerant	330	Lek 2016	(Lek et al., 2016)
Essential in culture	283	Hart 2014	(Hart et al., 2014)
Essential in culture	683	Hart 2017	(Hart et al., 2017)
Non-essential in culture	913	Hart 2017	(Hart et al., 2017)
Essential in mice	2,454	Blake ‘11, Georgi ‘13, Liu ‘13	(missing reference)
Genes nearest to GWAS peaks	6,336	MacArthur 2017	(MacArthur et al., 2016)
DNA Repair Genes	178	Wood 2005	(Wood et al., 2005)
DNA Repair Genes	151	Kang 2012	(Kang et al., 2012)
ClinGen haploinsufficient genes	294	Rehm 2015	(Rehm et al., 2015)
Olfactory receptors	371	Mainland 2015	(Mainland et al., 2015)
Reported in ClinVar	3078	Landrum 2014	(Landrum et al., 2013)
Kinases	347	UniProt 2016	(Consortium, 2016)
GPCRs from guide to pharmacology	391	Alexander 2017, Harding 2018.	(missing reference)
GPCRs from Uniprot	756	UniProt 2016	(Consortium, 2016)
Natural product targets	37	Dancik 2010	(Dančı́k Vlado et al., 2010)
BROCA - Cancer Risk Panel	66	BROCA Cancer Risk Panel	(Deptartment of Laboratory Medicine, n.d.)
ACMG V2.0	59	Kalia 2017	(Kalia et al., 2016)
GPI-anchored proteins	135	UniProt 2016	(Consortium, 2016)

(Verma et al., 2018) take an interesting approach to comparing druggable targets with population genetics data. DrugBank is a database for over 800 genes with over 950 unique drugs. Genetic data can be filtered for these genes and targeted for LoF variants. Association analysis consists of logistic regression using the ICD-9 codes, and linear regression using quantitative variables. This gene binding and regression analysis steps are done using BioBin.

The International Statistical Classification of Diseases and Related Health Problems (commonly known as the ICD) provides alpha-numeric codes to classify diseases and a wide variety of signs, symptoms, abnormal findings, complaints, social circumstances and external causes of injury or disease. Nearly every health condition can be assigned to a unique category and given a code, up to six characters long. Such categories usually include a set of similar diseases

BioBin relies on the Library of Knowledge Integration (LOKI), which integrates multiple databases providing a comprehensive biological knowledge platform for variant binning (Pendergrass et al., 2013). The LOKI database consolidates biological information from several sources, most notably the National Center for Biotechnology (NCBI) dbSNP and Entrez Gene, Kyoto Encyclopedia of Genes and Genomes (KEGG), Reactome, Gene Ontology (GO), Protein families database (Pfam), NetPath-signal transduction pathways, amongst others (missing reference).

References

Lek, M., Karczewski, K. J., Minikel, E. V., Samocha, K. E., Banks, E., Fennell, T., O’Donnell-Luria, A. H., Ware, J. S., Hill, A. J., Cummings, B. B., & others. (2016). Analysis of protein-coding genetic variation in 60,706 humans. Nature, 536(7616), 285.
Lab, M. A. (2018). List of gene lists for genomic analysis. GitHub. https://github.com/macarthur-lab/gene_lists
at the European Bioinformatics Institute, H. U. G. O. G. N. C. (2018). Genenames.org.
Wishart, D. S., Feunang, Y. D., Guo, A. C., Lo, E. J., Marcu, A., Grant, J. R., Sajed, T., Johnson, D., Li, C., Sayeeda, Z., Assempour, N., Iynkkaran, I., Liu, Y., Maciejewski, A., Gale, N., Wilson, A., Chin, L., Cummings, R., Le, D., … Wilson, M. (2017). DrugBank 5.0: a major update to the DrugBank database for 2018 . Nucleic Acids Research, 46(D1), D1074–D1082. https://doi.org/10.1093/nar/gkx1037
Nelson, M. R., Wegmann, D., Ehm, M. G., Kessner, D., St. Jean, P., Verzilli, C., Shen, J., Tang, Z., Bacanu, S.-A., Fraser, D., Warren, L., Aponte, J., Zawistowski, M., Liu, X., Zhang, H., Zhang, Y., Li, J., Li, Y., Li, L., … Mooser, V. (2012). An Abundance of Rare Functional Variants in 202 Drug Target Genes Sequenced in 14,002 People. Science, 337(6090), 100–104. https://doi.org/10.1126/science.1217876
Blekhman, R., Man, O., Herrmann, L., Boyko, A. R., Indap, A., Kosiol, C., Bustamante, C. D., Teshima, K. M., & Przeworski, M. (2008). Natural Selection on Genes that Underlie Human Disease Susceptibility. Current Biology, 18(12), 883–889. https://doi.org/https://doi.org/10.1016/j.cub.2008.04.074
Berg, J. S., Adams, M., Nassar, N., Bizon, C., Lee, K., Schmitt, C. P., Wilhelmsen, K. C., & Evans, J. P. (2013). An informatics approach to analyzing the incidentalome. Genetics in Medicine, 15(1), 36.
Landrum, M. J., Lee, J. M., Riley, G. R., Jang, W., Rubinstein, W. S., Church, D. M., & Maglott, D. R. (2013). ClinVar: public archive of relationships among sequence variation and human phenotype . Nucleic Acids Research, 42(D1), D980–D985. https://doi.org/10.1093/nar/gkt1113
Hart, T., Brown, K. R., Sircoulomb, F., Rottapel, R., & Moffat, J. (2014). Measuring error rates in genomic perturbation screens: gold standards for human functional genomics. Molecular Systems Biology, 10(7).
Hart, T., Tong, A. H. Y., Chan, K., Van Leeuwen, J., Seetharaman, A., Aregger, M., Chandrashekhar, M., Hustedt, N., Seth, S., Noonan, A., & others. (2017). Evaluation and design of genome-wide CRISPR/SpCas9 knockout screens. G3: Genes, Genomes, Genetics, 7(8), 2719–2727.
MacArthur, J., Bowler, E., Cerezo, M., Gil, L., Hall, P., Hastings, E., Junkins, H., McMahon, A., Milano, A., Morales, J., Pendlington, Z. M., Welter, D., Burdett, T., Hindorff, L., Flicek, P., Cunningham, F., & Parkinson, H. (2016). The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog) . Nucleic Acids Research, 45(D1), D896–D901. https://doi.org/10.1093/nar/gkw1133
Wood, R. D., Mitchell, M., & Lindahl, T. (2005). Human DNA repair genes, 2005. Mutation Research/Fundamental and Molecular Mechanisms of Mutagenesis, 577(1), 275–283. https://doi.org/https://doi.org/10.1016/j.mrfmmm.2005.03.007
Kang, J., Dâ€™Andrea, A. D., & Kozono, D. (2012). A DNA Repair Pathwayâ€“Focused Score for Prediction of Outcomes in Ovarian Cancer Treated With Platinum-Based Chemotherapy . JNCI: Journal of the National Cancer Institute, 104(9), 670–681. https://doi.org/10.1093/jnci/djs177
Rehm, H. L., Berg, J. S., Brooks, L. D., Bustamante, C. D., Evans, J. P., Landrum, M. J., Ledbetter, D. H., Maglott, D. R., Martin, C. L., Nussbaum, R. L., Plon, S. E., Ramos, E. M., Sherry, S. T., & Watson, M. S. (2015). ClinGen — The Clinical Genome Resource. New England Journal of Medicine, 372(23), 2235–2242. https://doi.org/10.1056/NEJMsr1406261
Mainland, J. D., Li, Y. R., Zhou, T., Liu, W. L. L., & Matsunami, H. (2015). Human olfactory receptor responses to odorants. Scientific Data, 2, 150002 EP -. https://doi.org/10.1038/sdata.2015.2
Consortium, T. U. P. (2016). UniProt: the universal protein knowledgebase. Nucleic Acids Research, 45(D1), D158–D169. https://doi.org/10.1093/nar/gkw1099
Dančı́k Vlado, Seiler, K. P., Young, D. W., Schreiber, S. L., & Clemons, P. A. (2010). Distinct Biological Network Properties between the Targets of Natural Products and Disease Genes. Journal of the American Chemical Society, 132(27), 9259–9261. https://doi.org/10.1021/ja102798t
Deptartment of Laboratory Medicine, U. of W. BROCA-CancerRiskPanel. http://depts.washington.edu/labweb/Divisions/MolDiag/MolDiagGen/index.htm
Kalia, S. S., Adelman, K., Bale, S. J., Chung, W. K., Eng, C., Evans, J. P., Herman, G. E., Hufnagel, S. B., Klein, T. E., Korf, B. R., McKelvey, K. D., Ormond, K. E., Richards, C. S., Vlangos, C. N., Watson, M., Martin, C. L., & Miller, D. T. (2016). Recommendations for reporting of secondary findings in clinical exome and genome sequencing, 2016 update (ACMG SF v2.0): a policy statement of the American College of Medical Genetics and Genomics. Genetics In Medicine, 19, 249 EP -. https://doi.org/10.1038/gim.2016.190
Verma, S. S., Josyula, N., Verma, A., Zhang, X., Veturi, Y., Dewey, F. E., Hartzel, D. N., Leader, J., Ritchie, M. D., & Pendergrass, S. A. (2018). Rare variants in drug target genes contributing to complex diseases, phenome-wide. Scientific Reports, 8(1), 4624.
Pendergrass, S. A., Frase, A., Wallace, J., Wolfe, D., Katiyar, N., Moore, C., & Ritchie, M. D. (2013). Genomic analyses with biofilter 2.0: knowledge driven filtering, annotation, and model development. BioData Mining, 6(1), 25.