Reading time 10 minutes

Integrating databases

Population genetics

GnomAD (version r2.0.2) (Lek et al., 2016) was used in these studies as the best source of population genetics data. The reference genome is GRCh37. Offline local database mirrors were used in most cases. Input sets used GnomAD variant allele frequencies and reference sequences processed as VCF and CSV files. outlines a specific data transformation using the gnomAD database, but in general, gnomAD was used as a filtering threshold for determining the expected population frequency of each variant. A strict threshold for rare variants could be set to ignore and candidate variants that are more frequent than 0.001. However, in most cases a more lenient level is used and any remaining benign or common variants are removed by “technical control” (filter on cohort to remove common variants between individuals that do not share a phenotype). A more modest cut-off threshold allows us to sometimes identify variant that are present in the general population, which are responsible for a recessive disease with no predictable heterozygous loss-of-function intolerance.

Other sources of population genetics data comes from resources such as CliVar and dbSNP, which as they grow in size become an annotated and curated for of population data. These resources allow us to calculate the expected frequencies for disease-causing variants. However, since these are manually curated database and predominantly European based, they are inherently biased and not reliable for statistical applications.

Phenotype, genotype, and function

Population genetics database gnomAD has been individually addressed in section [subsec:gnomad], as this is the most important type of annotation and filtering criteria for genetic determinants of rare disease. Additionally, in these studies many phenotype and genotype databases have been used for annotation and interpretation. Specifically, the most frequently used data came from MGI Phenotype, MorbidMap, VOC MammalianPhenotype, Gencode symbol, UniProtKB, Enterez ID, ENSGene ID, GO ID, Description, OMIM, BIOGRID interactions, HGMD human phenotype, ClinVar, and dbSNP. In most cases, every candidate variant was annotated with the main information per gene from a local database containing the information from each of the listed resources.

These are the “basic” information databases that we used to annotate variants. In a cohort study, data mining can find correlations and was therefore included for posterity as it does not significantly increase the data storage. Even if an obvious cause of disease was found we may later return to the data to find other cofactors or genetic modifiers. Or for example, in a single case study, a variant of unknown significance may have no statistical basis to be selected or ignored. We use this information to decide if that mutation is worth consideration: Is it in a protein domain of known function? Are there other cases reported with the same phenotype? What is the gene function, ontology, etc.?

We have also used some gene lists that are specific to disease, druggability, etc. A major contributor for collecting these gene lists has been the Mac Arthur et al. (Lab, 2018). These gene lists can be used is special cases. For example, a study looking at (1) dominant pathogenic mutations, and (2) in known immune genes might filter to included only those known observables. We could decide to only study SNPs in FDA-approved drug targets.

Mechanism Gene Count Name Reference
  19,194 HUGO 2018 (at the European Bioinformatics Institute, 2018)
FDA-approved drug targets 385 Wishart 2018 (Wishart et al., 2017)
Drug targets 201 Nelson 2012 (Nelson et al., 2012)
Autosomal dominant genes 307 Blekhman 2008 (Blekhman et al., 2008)
Autosomal dominant genes 631 Berg 2013 (Berg et al., 2013)
Autosomal recessive genes 527 Blekhman 2008 (Blekhman et al., 2008)
Autosomal recessive genes 1073 Berg 2013 (Berg et al., 2013)
X-linked genes 66 Blekhman 2008 (Blekhman et al., 2008)
X-linked recessive genes 102 Berg 2013 (Berg et al., 2013)
X-linked dominant genes 34 Berg 2013 (Berg et al., 2013)
X-linked ClinVar genes 61 Landrum 2014 (Landrum et al., 2013)
All dominant genes 709 Blekhman 2008, Berg 2013 (missing reference)
All recessive genes 1183 Blekhman 2008, Berg 2013 (missing reference)
Homozygous LoF tolerant 330 Lek 2016 (Lek et al., 2016)
Essential in culture 283 Hart 2014 (Hart et al., 2014)
Essential in culture 683 Hart 2017 (Hart et al., 2017)
Non-essential in culture 913 Hart 2017 (Hart et al., 2017)
Essential in mice 2,454 Blake ‘11, Georgi ‘13, Liu ‘13 (missing reference)
Genes nearest to GWAS peaks 6,336 MacArthur 2017 (MacArthur et al., 2016)
DNA Repair Genes 178 Wood 2005 (Wood et al., 2005)
DNA Repair Genes 151 Kang 2012 (Kang et al., 2012)
ClinGen haploinsufficient genes 294 Rehm 2015 (Rehm et al., 2015)
Olfactory receptors 371 Mainland 2015 (Mainland et al., 2015)
Reported in ClinVar 3078 Landrum 2014 (Landrum et al., 2013)
Kinases 347 UniProt 2016 (Consortium, 2016)
GPCRs from guide to pharmacology 391 Alexander 2017, Harding 2018. (missing reference)
GPCRs from Uniprot 756 UniProt 2016 (Consortium, 2016)
Natural product targets 37 Dancik 2010 (Dančı́k Vlado et al., 2010)
BROCA - Cancer Risk Panel 66 BROCA Cancer Risk Panel (Deptartment of Laboratory Medicine, n.d.)
ACMG V2.0 59 Kalia 2017 (Kalia et al., 2016)
GPI-anchored proteins 135 UniProt 2016 (Consortium, 2016)

(Verma et al., 2018) take an interesting approach to comparing druggable targets with population genetics data. DrugBank is a database for over 800 genes with over 950 unique drugs. Genetic data can be filtered for these genes and targeted for LoF variants. Association analysis consists of logistic regression using the ICD-9 codes, and linear regression using quantitative variables. This gene binding and regression analysis steps are done using BioBin.

The International Statistical Classification of Diseases and Related Health Problems (commonly known as the ICD) provides alpha-numeric codes to classify diseases and a wide variety of signs, symptoms, abnormal findings, complaints, social circumstances and external causes of injury or disease. Nearly every health condition can be assigned to a unique category and given a code, up to six characters long. Such categories usually include a set of similar diseases

BioBin relies on the Library of Knowledge Integration (LOKI), which integrates multiple databases providing a comprehensive biological knowledge platform for variant binning (Pendergrass et al., 2013). The LOKI database consolidates biological information from several sources, most notably the National Center for Biotechnology (NCBI) dbSNP and Entrez Gene, Kyoto Encyclopedia of Genes and Genomes (KEGG), Reactome, Gene Ontology (GO), Protein families database (Pfam), NetPath-signal transduction pathways, amongst others (missing reference).


  1. Lek, M., Karczewski, K. J., Minikel, E. V., Samocha, K. E., Banks, E., Fennell, T., O’Donnell-Luria, A. H., Ware, J. S., Hill, A. J., Cummings, B. B., & others. (2016). Analysis of protein-coding genetic variation in 60,706 humans. Nature, 536(7616), 285.
  2. Lab, M. A. (2018). List of gene lists for genomic analysis. GitHub.
  3. at the European Bioinformatics Institute, H. U. G. O. G. N. C. (2018).
  4. Wishart, D. S., Feunang, Y. D., Guo, A. C., Lo, E. J., Marcu, A., Grant, J. R., Sajed, T., Johnson, D., Li, C., Sayeeda, Z., Assempour, N., Iynkkaran, I., Liu, Y., Maciejewski, A., Gale, N., Wilson, A., Chin, L., Cummings, R., Le, D., … Wilson, M. (2017). DrugBank 5.0: a major update to the DrugBank database for 2018 . Nucleic Acids Research, 46(D1), D1074–D1082.
  5. Nelson, M. R., Wegmann, D., Ehm, M. G., Kessner, D., St. Jean, P., Verzilli, C., Shen, J., Tang, Z., Bacanu, S.-A., Fraser, D., Warren, L., Aponte, J., Zawistowski, M., Liu, X., Zhang, H., Zhang, Y., Li, J., Li, Y., Li, L., … Mooser, V. (2012). An Abundance of Rare Functional Variants in 202 Drug Target Genes Sequenced in 14,002 People. Science, 337(6090), 100–104.
  6. Blekhman, R., Man, O., Herrmann, L., Boyko, A. R., Indap, A., Kosiol, C., Bustamante, C. D., Teshima, K. M., & Przeworski, M. (2008). Natural Selection on Genes that Underlie Human Disease Susceptibility. Current Biology, 18(12), 883–889.
  7. Berg, J. S., Adams, M., Nassar, N., Bizon, C., Lee, K., Schmitt, C. P., Wilhelmsen, K. C., & Evans, J. P. (2013). An informatics approach to analyzing the incidentalome. Genetics in Medicine, 15(1), 36.
  8. Landrum, M. J., Lee, J. M., Riley, G. R., Jang, W., Rubinstein, W. S., Church, D. M., & Maglott, D. R. (2013). ClinVar: public archive of relationships among sequence variation and human phenotype . Nucleic Acids Research, 42(D1), D980–D985.
  9. Hart, T., Brown, K. R., Sircoulomb, F., Rottapel, R., & Moffat, J. (2014). Measuring error rates in genomic perturbation screens: gold standards for human functional genomics. Molecular Systems Biology, 10(7).
  10. Hart, T., Tong, A. H. Y., Chan, K., Van Leeuwen, J., Seetharaman, A., Aregger, M., Chandrashekhar, M., Hustedt, N., Seth, S., Noonan, A., & others. (2017). Evaluation and design of genome-wide CRISPR/SpCas9 knockout screens. G3: Genes, Genomes, Genetics, 7(8), 2719–2727.
  11. MacArthur, J., Bowler, E., Cerezo, M., Gil, L., Hall, P., Hastings, E., Junkins, H., McMahon, A., Milano, A., Morales, J., Pendlington, Z. M., Welter, D., Burdett, T., Hindorff, L., Flicek, P., Cunningham, F., & Parkinson, H. (2016). The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog) . Nucleic Acids Research, 45(D1), D896–D901.
  12. Wood, R. D., Mitchell, M., & Lindahl, T. (2005). Human DNA repair genes, 2005. Mutation Research/Fundamental and Molecular Mechanisms of Mutagenesis, 577(1), 275–283.
  13. Kang, J., D’Andrea, A. D., & Kozono, D. (2012). A DNA Repair Pathway–Focused Score for Prediction of Outcomes in Ovarian Cancer Treated With Platinum-Based Chemotherapy . JNCI: Journal of the National Cancer Institute, 104(9), 670–681.
  14. Rehm, H. L., Berg, J. S., Brooks, L. D., Bustamante, C. D., Evans, J. P., Landrum, M. J., Ledbetter, D. H., Maglott, D. R., Martin, C. L., Nussbaum, R. L., Plon, S. E., Ramos, E. M., Sherry, S. T., & Watson, M. S. (2015). ClinGen — The Clinical Genome Resource. New England Journal of Medicine, 372(23), 2235–2242.
  15. Mainland, J. D., Li, Y. R., Zhou, T., Liu, W. L. L., & Matsunami, H. (2015). Human olfactory receptor responses to odorants. Scientific Data, 2, 150002 EP -.
  16. Consortium, T. U. P. (2016). UniProt: the universal protein knowledgebase. Nucleic Acids Research, 45(D1), D158–D169.
  17. Dančı́k Vlado, Seiler, K. P., Young, D. W., Schreiber, S. L., & Clemons, P. A. (2010). Distinct Biological Network Properties between the Targets of Natural Products and Disease Genes. Journal of the American Chemical Society, 132(27), 9259–9261.
  18. Deptartment of Laboratory Medicine, U. of W. BROCA-CancerRiskPanel.
  19. Kalia, S. S., Adelman, K., Bale, S. J., Chung, W. K., Eng, C., Evans, J. P., Herman, G. E., Hufnagel, S. B., Klein, T. E., Korf, B. R., McKelvey, K. D., Ormond, K. E., Richards, C. S., Vlangos, C. N., Watson, M., Martin, C. L., & Miller, D. T. (2016). Recommendations for reporting of secondary findings in clinical exome and genome sequencing, 2016 update (ACMG SF v2.0): a policy statement of the American College of Medical Genetics and Genomics. Genetics In Medicine, 19, 249 EP -.
  20. Verma, S. S., Josyula, N., Verma, A., Zhang, X., Veturi, Y., Dewey, F. E., Hartzel, D. N., Leader, J., Ritchie, M. D., & Pendergrass, S. A. (2018). Rare variants in drug target genes contributing to complex diseases, phenome-wide. Scientific Reports, 8(1), 4624.
  21. Pendergrass, S. A., Frase, A., Wallace, J., Wolfe, D., Katiyar, N., Moore, C., & Ritchie, M. D. (2013). Genomic analyses with biofilter 2.0: knowledge driven filtering, annotation, and model development. BioData Mining, 6(1), 25.