Reading time 17 minutes

Genomic analysis for precision medicine

Genomic analysis turns DNA sequence data into evidence.

The task is not only to find variants. The task is to decide which variants were measured, which were missed, which are common, which are rare, which may change protein function, and which can support a clinical or biological conclusion.

This page gives a practical map of the field. It is written for readers learning genomics, rare disease analysis, bioinformatics, or precision medicine.

The short version

DNA sequencing produces reads.

Reads are aligned to a reference genome.

Aligned reads are converted into variant calls.

Variant calls are annotated with biological, clinical, and population data.

Interpretation combines variant evidence with phenotype, inheritance, mechanism, and uncertainty.

The main file types are:

File type	Meaning
FASTQ	Raw sequencing reads and base quality scores
BAM or CRAM	Reads aligned to a reference genome
VCF or gVCF	Genetic variants and genotype-level evidence
BED	Genomic intervals, often used for gene panels or exome targets
TSV, CSV, JSON or HTML	Interpreted results, reports, tables, and machine-readable summaries

The main reference choice is also critical. A genomic coordinate only has meaning relative to a specific genome build, usually GRCh37, GRCh38, or T2T-CHM13.

Exome sequencing and genome sequencing

Exome sequencing reads mainly the protein-coding regions of the genome.

Whole genome sequencing reads coding, non-coding, mitochondrial, structural, and regulatory regions more evenly.

Gene panels read selected genes.

Each approach has a different purpose.

Test	What it measures	Common use
Gene panel	Selected disease genes	Focused diagnosis, lower cost, easier interpretation
Whole exome sequencing	Most protein-coding exons	Rare disease, Mendelian disorders, gene discovery
Whole genome sequencing	The full genome	Rare disease, structural variants, non-coding variants, research, long-term data reuse
Long-read sequencing	Longer DNA fragments	Repeat expansions, structural variants, phasing, difficult regions
Ultra-deep sequencing	Very high depth at selected loci	Low-level mosaicism, somatic variants, cancer, autoinflammatory disease

Exome sequencing remains useful because coding variants are often easier to interpret. Whole genome sequencing is stronger when the question includes structural variation, non-coding regulation, copy number change, repeat expansions, mitochondrial DNA, phasing, or future reuse.

In precision medicine, whole genome sequencing is often the better long-term data asset. Exome sequencing can still be the better immediate diagnostic test when cost, coverage, and interpretation are the main constraints.

Sample preparation

Most germline sequencing begins with DNA from blood, saliva, or tissue.

Blood is often preferred because it gives reliable DNA quality and usually represents the inherited genome well. Saliva can work, but bacterial DNA contamination and variable sample quality can affect sequencing. Tumour, skin, buccal, fibroblast, or sorted cell samples may be needed when mosaicism or somatic disease is suspected.

Before sequencing, laboratories usually check:

Check	Purpose
DNA quantity	Enough DNA for library preparation
DNA purity	Absence of inhibitors or contamination
DNA integrity	Fragment size and degradation
Sample identity	Avoiding swaps and contamination
Consent and metadata	Ensuring lawful and interpretable use

For exome sequencing and panels, DNA is fragmented, adapted, indexed, captured with probes, and sequenced. Capture systems include products such as Agilent SureSelect, IDT xGen, Twist Bioscience panels, and Roche KAPA or SeqCap designs.

For whole genome sequencing, capture is usually not required. The full library is sequenced after fragmentation, adaptor ligation, indexing, amplification or PCR-free preparation.

PCR-free whole genome sequencing is often preferred when the goal is uniform coverage and fewer amplification artefacts.

Sequencing platforms

Most clinical and research short-read sequencing still uses Illumina instruments. Common platform families include MiSeq, NextSeq, NovaSeq, and older HiSeq systems.

Other relevant technologies include:

Technology	Typical use
Illumina short-read sequencing	Germline SNVs, indels, exomes, genomes, panels
Element Biosciences AVITI	Short-read sequencing with alternative chemistry
MGI DNBSEQ	Short-read sequencing at scale
Oxford Nanopore	Long reads, structural variants, methylation, rapid sequencing
PacBio HiFi	Accurate long reads, structural variants, phasing, repeat regions

Short reads are efficient for SNVs and small indels. Long reads are stronger for structural variants, repeat expansions, phasing, and difficult genomic regions.

A modern precision medicine programme may use both.

A practical analysis workflow

A typical short-read germline workflow follows this structure.

FASTQ
  ↓
quality control
  ↓
alignment to reference genome
  ↓
BAM or CRAM
  ↓
duplicate marking and quality recalibration
  ↓
variant calling
  ↓
gVCF or VCF
  ↓
joint genotyping or cohort merging
  ↓
variant filtering
  ↓
annotation
  ↓
clinical or research interpretation

Useful tools include:

Step	Common tools
FASTQ quality control	FastQC, MultiQC, fastp
Adapter trimming	fastp, Trim Galore, Cutadapt
Alignment	BWA-MEM, BWA-MEM2, DRAGMAP, minimap2
BAM processing	samtools, Picard, Sambamba
Germline SNV and indel calling	GATK HaplotypeCaller, DeepVariant, Sentieon DNAscope, DRAGEN
Joint genotyping	GATK GenomicsDBImport, GATK GenotypeGVCFs, GLnexus
Somatic SNV and indel calling	GATK Mutect2, Strelka2, VarDict, Octopus
Structural variant calling	Manta, Delly, Lumpy, GRIDSS, Sniffles, cuteSV
Copy number calling	ExomeDepth, XHMM, CNVkit, GATK gCNV, Canvas
Variant normalisation	bcftools norm, vt
Variant annotation	Ensembl VEP, ANNOVAR, SnpEff, Nirvana
Workflow management	Nextflow, Snakemake, WDL, Cromwell

The exact tools matter less than the traceability of the workflow. A useful analysis records the reference genome, software versions, parameters, input files, output files, and quality metrics.

Reference genomes

The reference genome is the coordinate system for the analysis.

A variant reported as chr7:117559593 only means something if the genome build is known. The same biological variant may have different coordinates in GRCh37, GRCh38, and T2T-CHM13.

Common references include:

Reference	Use
GRCh37 or hg19	Older clinical and research databases
GRCh38 or hg38	Current standard for many new analyses
T2T-CHM13	More complete telomere-to-telomere assembly
hs37d5	GRCh37 with decoy sequences, used in many older pipelines
GRCh38 with ALT contigs	Improved representation of difficult regions

Modern analyses should usually use GRCh38 unless there is a clear reason to remain on GRCh37.

The reason many older datasets use GRCh37 is practical. Large historical cohorts, clinical databases, and analysis pipelines were built around it. Reanalysis on GRCh38 improves consistency for new work but requires careful liftOver, remapping, or reprocessing.

Quality control

Quality control asks whether the data are suitable for interpretation.

Useful checks include:

Level	Examples
Sample identity	Sex check, kinship, relatedness, contamination, sample swap detection
Sequencing quality	Q30, read depth, duplication rate, insert size, GC bias
Alignment quality	Mapping rate, coverage uniformity, off-target rate
Variant quality	Ti/Tv ratio, heterozygosity, call rate, allele balance
Cohort quality	Batch effects, ancestry, principal components, outlier samples
Clinical quality	Coverage over disease genes, reportable regions, medically relevant gaps

Common tools include FastQC, MultiQC, samtools stats, Picard CollectHsMetrics, Picard CollectWgsMetrics, VerifyBamID, Somalier, PLINK, bcftools, mosdepth, and GATK CollectReadCounts.

For precision medicine, a negative result is only meaningful if the relevant regions were actually measured. A report should distinguish “no variant found” from “the region was not adequately assessed”.

Variant calling

Variant calling converts aligned reads into genetic differences from the reference.

The main variant classes are:

Variant type	Meaning
SNV	Single nucleotide variant
Indel	Small insertion or deletion
MNV	Multi-nucleotide variant
CNV	Copy number variant
SV	Structural variant
STR or repeat expansion	Variable repeat sequence
mtDNA variant	Mitochondrial genome variant
Mosaic or somatic variant	Variant present in only some cells

A single caller rarely captures every class well. SNVs and indels may be handled by GATK HaplotypeCaller or DeepVariant. Structural variants may require Manta, Delly, GRIDSS, Sniffles, or cuteSV. Copy number variants may require ExomeDepth, CNVkit, GATK gCNV, or Canvas.

Long-read sequencing improves variant detection in repetitive and structurally complex regions.

Annotation

Annotation adds biological and clinical context to variants.

A raw VCF tells us the coordinate, genotype, and quality metrics. It does not explain whether the variant affects a gene, changes a protein, is common in the population, or has been reported in disease.

Common annotation resources include:

Resource	Use
Ensembl VEP	Transcript consequence annotation
ANNOVAR	Variant annotation framework
SnpEff	Variant consequence annotation
MANE Select	Preferred matched Ensembl and RefSeq transcripts
GENCODE	Gene and transcript models
ClinVar	Submitted clinical variant interpretations
gnomAD	Population allele frequencies and constraint metrics
dbSNP	Variant identifiers
OMIM	Mendelian disease genes and phenotypes
Orphanet	Rare disease information
ClinGen	Gene-disease validity and dosage sensitivity
PanelApp	Curated disease-gene panels
HGNC	Approved gene symbols
HPO	Human phenotype ontology
UniProt	Protein function and domains
Pfam and InterPro	Protein domains and families
AlphaFold DB	Predicted protein structures
STRING	Protein interaction networks

Good annotation depends on transcript choice. For clinical work, MANE Select transcripts are often preferred where available. A variant can appear missense on one transcript and non-coding on another, so transcript reporting must be explicit.

Population frequency

Population allele frequency is one of the strongest filters in rare disease analysis.

A fully penetrant variant causing a very rare dominant disease should not be common in the general population. A recessive pathogenic variant can be more common because carriers may be unaffected.

The most widely used population resource is gnomAD. It provides allele frequencies across large cohorts and ancestry groups. Other resources may be relevant in specific settings, including TOPMed, UK Biobank, 1000 Genomes, All of Us, and national reference datasets.

Frequency filtering should consider:

Factor	Why it matters
Disease prevalence	Common variants cannot usually cause very rare diseases alone
Inheritance model	Dominant, recessive, X-linked, mitochondrial, de novo
Penetrance	Low penetrance permits higher population frequency
Ancestry	A variant may be rare globally but common in one ancestry group
Technical quality	Some recurrent calls are artefacts
Cohort context	Internal frequency can reveal batch effects or shared ancestry

A simple minor allele frequency threshold can be useful, but it is not interpretation. Frequency is evidence that must be combined with mechanism, phenotype, segregation, and variant effect.

Disease-gene panels

A disease-gene panel is a curated list of genes relevant to a phenotype or clinical indication.

Panels are useful because they define the search space. They also make interpretation more reproducible. A rare missense variant in a gene unrelated to the phenotype is usually less useful than a rare damaging variant in a validated disease gene.

Important panel resources include:

Resource	Use
Genomics England PanelApp	Curated gene panels with confidence levels
ClinGen	Gene-disease validity and dosage curation
OMIM	Mendelian disease-gene relationships
Orphanet	Rare disease entities and genes
ACMG Secondary Findings list	Genes recommended for reporting of actionable secondary findings
HPO	Phenotype terms used to select relevant genes

Panel selection should be recorded. A report should state which genes were assessed, which transcripts were used, and which regions had insufficient coverage.

Virtual panels are often applied to genome or exome data after sequencing. This allows reanalysis when gene knowledge changes.

Variant interpretation

Variant interpretation asks whether a variant explains a phenotype.

For rare disease, interpretation usually combines:

Evidence type	Examples
Variant consequence	Missense, nonsense, frameshift, splice, structural
Gene-disease validity	ClinGen, OMIM, PanelApp, published cases
Population frequency	gnomAD, ancestry-specific frequency, internal controls
Inheritance	Dominant, recessive, de novo, compound heterozygous, X-linked
Segregation	Does the variant track with disease in the family
Phenotype match	HPO similarity, clinical fit, disease mechanism
Functional evidence	Assays, expression, protein studies, model systems
Computational prediction	SpliceAI, CADD, REVEL, AlphaMissense, ESM1b
Constraint	LOEUF, pLI, missense Z-score, regional constraint
Previous classification	ClinVar, locus-specific databases, literature

The ACMG and AMP framework is the main clinical structure for germline variant classification. It classifies variants as pathogenic, likely pathogenic, uncertain significance, likely benign, or benign.

That label is useful, but it is not the whole conclusion. A patient-level diagnosis also needs gene validity, phenotype match, inheritance consistency, and confidence that the relevant regions were measured.

Inheritance models

Inheritance is often the fastest way to reduce candidate variants.

Model	Typical pattern
Autosomal dominant	One disease-causing allele may be sufficient
Autosomal recessive	Two affected alleles in the same gene are usually required
Compound heterozygous	Two different variants affect the two copies of one gene
X-linked	Variant on the X chromosome, often sex-dependent
De novo	Variant present in the child but absent from parents
Mitochondrial	Variant in mitochondrial DNA, often heteroplasmic
Mosaic	Variant present in only a fraction of cells
Somatic	Acquired variant, common in cancer and some inflammatory disorders

Trio sequencing is powerful because it shows inheritance directly. A trio includes the affected child and both biological parents. It helps identify de novo variants, recessive inheritance, compound heterozygosity, and sample mix-ups.

Phenotype data

Genomic interpretation is weaker without structured phenotype data.

The Human Phenotype Ontology, or HPO, provides standard terms for clinical features. Examples include HP:0001250 for seizure, HP:0001631 for atrial septal defect, and HP:0001249 for intellectual disability.

HPO terms help match patients to disease-gene panels, OMIM diseases, and published cases. Tools such as Exomiser, PhenoTips, LIRICAL, Phen2Gene, and PanelAppRex AI use phenotype information to prioritise genes or panels.

Phenotype quality matters. “Developmental delay” is less informative than a structured set of age of onset, neurological findings, growth measurements, imaging results, immune features, and laboratory abnormalities.

Precision medicine

Precision medicine uses molecular evidence to guide diagnosis, prognosis, prevention, treatment, or data reuse.

In genomics, precision medicine can include:

Area	Example
Rare disease diagnosis	Identifying a causal variant in a Mendelian disorder
Cancer genomics	Matching tumour variants to therapy, prognosis, or trials
Pharmacogenomics	Using CYP2D6, CYP2C19, TPMT, DPYD, SLCO1B1, or HLA genotypes
Carrier screening	Identifying reproductive risk for recessive disorders
Newborn sequencing	Early detection of actionable inherited conditions
Polygenic risk	Estimating inherited risk across many common variants
Infectious disease genomics	Host or pathogen genomic contribution to disease
Multi-omics	Integrating DNA, RNA, protein, metabolite, and phenotype evidence

Precision medicine depends on infrastructure as much as sequencing. Results must be traceable, reproducible, interpretable, and updateable.

A genome sequenced today may become more informative years later. That only works if the raw files, metadata, reference build, consent, and interpretation records are preserved.

Reanalysis

A negative genomic result is not always final.

Reanalysis can find diagnoses because gene-disease knowledge improves, variant databases grow, phenotype information becomes clearer, and methods improve.

Common reasons to reanalyse include:

Reason	Example
New disease gene	A gene was not known at the first analysis
Updated ClinVar classification	A VUS becomes likely pathogenic or benign
Improved phenotype	New HPO terms refine the search
Better caller	Structural variant or repeat expansion detected later
New reference	GRCh38 or T2T improves mapping
Additional family data	Segregation becomes informative
RNA or protein evidence	Multi-omics supports a candidate gene

Reanalysis is strongest when the original data are stored in standard formats such as FASTQ, BAM or CRAM, VCF, and structured reports.

Network and pathway analysis

Single-gene interpretation works well when a known disease gene explains the case.

Cohort studies often need a broader view. Different patients can carry damaging variants in different genes that affect the same pathway.

Useful resources include STRING, Reactome, KEGG, Gene Ontology, BioGRID, IntAct, OmniPath, Cytoscape, and the Markov Cluster Algorithm.

Pathway analysis can help identify shared biology, but it needs discipline. Protein interaction databases contain many weak or context-dependent links. A network result should be treated as hypothesis-generating unless it is supported by variant evidence, phenotype fit, statistical enrichment, and functional validation.

Modern rare variant association methods include burden tests, SKAT, SKAT-O, STAAR, SAIGE-GENE, REGENIE, DeepRVAT, and other gene or variant-set methods. These methods are useful when cohorts are large enough for statistical testing.

Small rare disease cohorts often remain interpretation-led. In those settings, pathway analysis can organise evidence, but it does not replace variant interpretation.

Data security and governance

Genomic data is identifying.

A genome is not like a routine blood test result. It contains information about the individual, biological relatives, ancestry, disease risk, and future interpretability.

Good genomic infrastructure should record:

Item	Reason
Consent	Defines permitted use
Sample metadata	Supports interpretation and audit
File provenance	Shows where each file came from
Reference genome	Defines coordinates
Software versions	Supports reproducibility
Access history	Shows who accessed or received data
Report versions	Preserves interpretation at a point in time
Reanalysis history	Shows how conclusions changed

Relevant standards and organisations include GA4GH, HL7 FHIR Genomics, ISO 15189, ACMG, AMP, ClinGen, SPHN, and national data protection frameworks.

The practical principle is simple. Genome data should be private by default, shareable by explicit request, and interpretable with a complete audit trail.

What a good genomic report should say

A useful genomic report should not only list variants.

It should state:

Report element	Purpose
Test type	Panel, exome, genome, long-read, tumour, germline
Reference genome	GRCh37, GRCh38, T2T-CHM13, or other
Regions assessed	Genes, transcripts, intervals, coverage limits
Methods	Sequencing, alignment, calling, annotation, filtering
Main findings	Variants and interpretation
Inheritance evidence	De novo, recessive, compound heterozygous, segregation
Population evidence	gnomAD or other frequency evidence
Clinical evidence	ClinVar, OMIM, ClinGen, literature
Phenotype match	Relevant HPO terms or clinical features
Limitations	Regions not covered, variant classes not assessed
Data retained	FASTQ, BAM or CRAM, VCF, report, metadata
Reanalysis recommendation	When and why to review again

The most important distinction is between absence of evidence and evidence of absence. A report should make clear whether a variant was not found, or whether the relevant region was not measured well enough to know.

Common mistakes

A few mistakes cause many interpretation problems.

Using the wrong reference build gives incorrect coordinates.

Using old gene symbols creates failed database joins.

Ignoring transcript choice changes variant consequence.

Filtering too strictly can remove real pathogenic variants.

Filtering too loosely creates false candidate lists.

Treating ClinVar as truth ignores submitter disagreement and review status.

Treating a VUS as causal overstates the evidence.

Ignoring coverage makes negative reports unreliable.

Ignoring ancestry makes frequency interpretation weaker.

Reporting a variant without the phenotype context can mislead the reader.

Key terms

Term	Meaning
Allele	One version of a genetic sequence at a locus
Genotype	The allele combination in an individual
Heterozygous	One reference and one alternate allele
Homozygous	Two copies of the same allele
Hemizygous	One copy of a chromosome region, often X-linked in males
Compound heterozygous	Two different variants affecting the two copies of one gene
Penetrance	Probability that a genotype produces a phenotype
Expressivity	Range or severity of features caused by a genotype
Mosaicism	A variant present in only some cells
Phasing	Determining which variants are on the same parental chromosome
Coverage	Number of reads covering a genomic position
Allele balance	Fraction of reads supporting each allele
VUS	Variant of uncertain significance
SNV	Single nucleotide variant
Indel	Insertion or deletion
CNV	Copy number variant
SV	Structural variant

Tools and resources to know

Category	Examples
Read QC	FastQC, MultiQC, fastp
Alignment	BWA-MEM2, DRAGMAP, minimap2
BAM and CRAM handling	samtools, Picard, Sambamba
Variant calling	GATK HaplotypeCaller, DeepVariant, DRAGEN, Sentieon DNAscope
Joint genotyping	GATK GenomicsDBImport, GenotypeGVCFs, GLnexus
Somatic calling	Mutect2, Strelka2, VarDict, Octopus
Structural variants	Manta, Delly, GRIDSS, Sniffles, cuteSV
Copy number	ExomeDepth, CNVkit, XHMM, GATK gCNV
Annotation	Ensembl VEP, ANNOVAR, SnpEff, Nirvana
Population frequency	gnomAD, TOPMed, 1000 Genomes, UK Biobank
Clinical interpretation	ClinVar, ClinGen, OMIM, Orphanet
Phenotype	HPO, PhenoTips, Exomiser, LIRICAL
Disease panels	PanelApp, ACMG SF, ClinGen, OMIM
Protein context	UniProt, Pfam, InterPro, AlphaFold DB
Pathways	Reactome, KEGG, GO, STRING, BioGRID
Workflow systems	Nextflow, Snakemake, WDL, Cromwell
Cohort genetics	PLINK, bcftools, REGENIE, SAIGE, SKAT, STAAR

A useful mental model

Genomic analysis has three layers.

The first layer is measurement. Did the sequencing and alignment measure the region well enough?

The second layer is evidence. What does the variant, gene, population frequency, inheritance, and phenotype suggest?

The third layer is consequence. Can the result support diagnosis, treatment, research, reanalysis, or data reuse?

Most errors happen when these layers are mixed together.

A variant can be real but irrelevant.

A variant can be rare but benign.

A gene can be plausible but not validated.

A negative result can be uninformative if coverage was poor.

A report can be technically correct but clinically unusable if it does not state uncertainty.

Precision medicine depends on keeping these distinctions clear.

Closing note

DNA sequencing is now routine. Genomic interpretation is not.

The value comes from the chain: sample, sequence, reference, alignment, variant call, annotation, phenotype, inheritance, mechanism, population frequency, report, and reanalysis.

When that chain is visible, genomic data becomes usable evidence.