Reading time 12 minutes

Pharmacogenomics from a VCF file

Updated 07 May 2026

Audience

This page is written for a postgraduate class on precision medicine.

The aim is to understand how a pharmacogenomics analysis works under the hood. Professional laboratories and clinical systems usually use validated software, curated databases, quality-controlled pipelines, and formal reporting rules. Those systems are necessary for real use.

The exercise here is different. We start with a VCF file and build the logic ourselves. The point is to see the file types, the columns, the joins, the assumptions, and the failure points. Once those are clear, packages and pipelines become easier to judge.

What pharmacogenomics asks

Pharmacogenomics studies how genetic variation affects drug metabolism, toxicity, dosing, and response.

A pharmacogenomic result usually connects five things:

Component Example
Gene CYP2C19
Allele or genotype CYP2C192/17
Drug Clopidogrel
Phenotype Intermediate metaboliser
Recommendation source CPIC or DPWG guideline

A variant in a drug-related gene is not automatically pharmacogenomic evidence. A drug may bind a protein, a gene may encode a metabolising enzyme, or a variant may be present in a pharmacogene. These are different facts.

This distinction matters throughout the analysis.

What this workflow produces

The workflow produces a candidate table.

VCF
  ↓
variant annotation
  ↓
gene and consequence extraction
  ↓
pharmacogene matching
  ↓
drug-gene annotation
  ↓
candidate table

The candidate table can show that a person has variants in genes connected to drugs. It does not, by itself, provide a clinical pharmacogenomic report.

A clinical report requires validated allele calling, diplotype interpretation, quality control, guideline mapping, and clinical review.

Input files

The exercise uses five file types.

File Purpose
input.vcf.gz Genetic variants from one genome or exome
annotated_vep.tsv VEP annotation table
pharmacogenes.txt Genes of interest
drug_gene_table.tsv Drug-gene relationships
candidate_output.tsv Merged candidate table

A real workflow may also use BED files, reference FASTA files, VEP cache files, gene panels, PharmVar allele definitions, CPIC tables, and FDA drug labels.

Resources used

Each resource answers a different part of the pharmacogenomic question.

Resource Role
Ensembl VEP Converts genomic coordinates into gene and transcript annotations
MANE Select Defines preferred matched Ensembl and RefSeq transcripts
PharmVar Defines star alleles for pharmacogenes
CPIC Provides clinical pharmacogenetic implementation guidelines
PharmGKB Curates drug-gene evidence, annotations, labels, and pathways
DPWG Provides pharmacogenetic dosing recommendations
FDA pharmacogenomic biomarker table Links drug labels to pharmacogenomic biomarkers
DrugBank Lists drug targets, enzymes, carriers, transporters, and drug metadata
ClinVar Curated variant-level clinical assertions
gnomAD Population allele frequencies

DrugBank is useful for drug-target relationships. CPIC, PharmGKB, PharmVar, and DPWG are closer to clinical pharmacogenomic interpretation.

Example VCF

Public example genomes are available from the Personal Genome Project.

The original classroom exercise used:

Profile: hu24385B
Source: Personal Genome Project
File: hu24385B 2019-04-07.vcf.gz
Sequencing provider: Dante Labs Whole Genome
Approximate size: 147 MB
Approximate variants: 3.46 million

A VCF is a structured text file. It contains metadata, a column header, and one row per variant.

Column Meaning
CHROM Chromosome or contig
POS Position on the reference genome
ID Variant identifier, often rsID
REF Reference allele
ALT Alternate allele
QUAL Variant quality score
FILTER Pass or filter status
INFO Variant-level annotations
FORMAT Genotype field definitions
Sample column Genotype-level data for one sample

Inspect the VCF

Keep the original compressed file unchanged.

ls -lh hu24385B.vcf.gz

Count the number of variant rows.

zgrep -vc '^#' hu24385B.vcf.gz

Inspect the metadata.

zgrep '^##' hu24385B.vcf.gz | head -40

Inspect the column header.

zgrep '^#CHROM' hu24385B.vcf.gz

Inspect the first variant rows.

zgrep -v '^#' hu24385B.vcf.gz | head

The first useful skill is recognising the difference between metadata lines, the VCF header, and variant rows.

Check the reference genome

Genomic coordinates depend on the reference genome.

A variant reported at one coordinate in GRCh37 may have a different coordinate in GRCh38. Annotation databases must match the reference build used to generate the VCF.

Check the reference metadata.

zgrep '^##reference=' hu24385B.vcf.gz

Check chromosome naming.

zgrep '^##contig=' hu24385B.vcf.gz | head

A VCF using 1 will not automatically match a BED file using chr1. Coordinate systems and chromosome naming must be consistent before filtering or annotation.

Annotate variants with VEP

A raw VCF usually contains coordinates, alleles, genotype fields, and variant quality information. It does not necessarily contain gene symbols, transcript consequences, protein changes, ClinVar terms, or population frequencies.

Ensembl Variant Effect Predictor, or VEP, adds these fields.

Useful VEP fields include:

VEP field Meaning
SYMBOL HGNC gene symbol
Gene Ensembl gene ID
Feature Transcript ID
Consequence Predicted variant consequence
HGVSc Coding DNA change
HGVSp Protein change
IMPACT Broad consequence severity
MANE_SELECT MANE Select transcript flag
Existing_variation Known variant ID, often rsID
CLIN_SIG ClinVar clinical significance where available
gnomAD_AF Population allele frequency where available

For a small example, the VEP web interface is acceptable. For a whole genome or repeated analysis, run VEP locally with a matching cache.

Split a VCF for web annotation

The VEP web interface has file-size limits. Splitting a VCF is useful for a classroom example. It is not a good production strategy.

The header must be preserved in every split file.

#!/usr/bin/env bash
set -euo pipefail

input="hu24385B.vcf.gz"
outdir="split_vcf"
lines_per_file=150000

mkdir -p "$outdir"

zgrep '^#' "$input" > "$outdir/header.vcf"

zgrep -v '^#' "$input" | split -l "$lines_per_file" - "$outdir/block_"

for block in "$outdir"/block_*; do
  cat "$outdir/header.vcf" "$block" > "${block}.vcf"
  rm "$block"
done

Each block_*.vcf file now contains the original metadata, the VCF column header, and a subset of variant rows.

Run VEP locally

A local VEP command for a GRCh37 VCF might look like this.

vep \
  --input_file hu24385B.vcf.gz \
  --output_file hu24385B.vep.tsv \
  --cache \
  --offline \
  --assembly GRCh37 \
  --tab \
  --symbol \
  --canonical \
  --mane \
  --hgvs \
  --protein \
  --variant_class \
  --sift b \
  --polyphen b \
  --af_gnomad \
  --clin_sig \
  --fork 8

For GRCh38, change the assembly argument.

--assembly GRCh38

The output file is a tab-separated table. It can be searched, filtered, joined, and imported into R or Python.

Clean the VEP output

VEP tabular output may include metadata lines beginning with ##. Keep the main header and remove metadata lines.

grep -v '^##' hu24385B.vep.tsv > hu24385B.vep.clean.tsv

Print the column names as a numbered list.

head -1 hu24385B.vep.clean.tsv | tr '\t' '\n' | nl

Find the SYMBOL column.

head -1 hu24385B.vep.clean.tsv | tr '\t' '\n' | nl | grep 'SYMBOL'

Suppose SYMBOL is column 14. Extract unique gene symbols.

awk -F '\t' 'NR > 1 && $14 != "" {print $14}' hu24385B.vep.clean.tsv \
  | sort -u > genes_from_vcf.txt

This file contains genes with at least one annotated variant in the VCF.

It is a screening file. It does not contain allele interpretation.

Build a small pharmacogene list

Create a simple pharmacogene file.

CYP2D6
CYP2C19
CYP2C9
VKORC1
TPMT
NUDT15
DPYD
UGT1A1
SLCO1B1
HLA-B
HLA-A
CFTR
G6PD

Save it as:

pharmacogenes.txt

These genes illustrate several pharmacogenomic mechanisms.

Gene Common relevance
CYP2D6 Antidepressants, antipsychotics, opioids, beta blockers
CYP2C19 Clopidogrel, proton pump inhibitors, antidepressants
CYP2C9 Warfarin, NSAIDs, phenytoin
VKORC1 Warfarin dose
TPMT Thiopurines
NUDT15 Thiopurines
DPYD Fluoropyrimidines
UGT1A1 Irinotecan and atazanavir
SLCO1B1 Statin-associated myopathy risk
HLA-B Abacavir and allopurinol hypersensitivity
HLA-A Carbamazepine hypersensitivity in some populations
CFTR CFTR modulator eligibility
G6PD Haemolysis risk with oxidant drugs

The list is deliberately small. A complete clinical pharmacogenomics system would use curated allele definitions and guideline tables.

Compare VCF genes with pharmacogenes

Sort both gene lists.

sort -u genes_from_vcf.txt > genes_from_vcf.sorted.txt
sort -u pharmacogenes.txt > pharmacogenes.sorted.txt

Find genes present in both.

comm -12 genes_from_vcf.sorted.txt pharmacogenes.sorted.txt \
  > pharmacogene_hits.txt

Inspect the result.

cat pharmacogene_hits.txt

This step answers one question only:

Which pharmacogenes contain at least one annotated variant in this VCF?

It does not answer:

Does this person have an actionable pharmacogenomic allele?

Filter VEP rows to pharmacogenes

Use the pharmacogene list to retain only VEP rows in relevant genes.

awk 'NR==FNR {genes[$1]; next}
     FNR==1 {print; next}
     $14 in genes {print}' \
     pharmacogenes.txt \
     hu24385B.vep.clean.tsv \
     > hu24385B.pharmacogene_variants.tsv

This assumes SYMBOL is column 14. Change $14 if your VEP output uses another column position.

Count the remaining rows.

wc -l hu24385B.pharmacogene_variants.tsv

Inspect the first rows.

head hu24385B.pharmacogene_variants.tsv

Reduce the VCF before annotation

A full genome contains millions of variants. If the question is limited to pharmacogenes, reduce the VCF first.

A BED file defines genomic intervals.

chr10   94762681    94855547    CYP2C19
chr22   42126499    42130865    CYP2D6
chr7    117120016   117308718   CFTR

A BED file usually has at least three columns:

Column Meaning
chrom Chromosome
chromStart Start coordinate
chromEnd End coordinate

The BED file must match the VCF reference genome and chromosome naming.

Using bcftools:

bcftools view \
  -R pharmacogenes.GRCh37.bed \
  -Oz \
  -o hu24385B.pharmacogenes.vcf.gz \
  hu24385B.vcf.gz

bcftools index hu24385B.pharmacogenes.vcf.gz

Using vcftools:

vcftools \
  --gzvcf hu24385B.vcf.gz \
  --bed pharmacogenes.GRCh37.bed \
  --recode \
  --keep-INFO-all \
  --out hu24385B.pharmacogenes

The reduced VCF is faster to annotate and easier to inspect.

Prepare a drug-gene table

A simple drug-gene table can be stored as tab-separated text.

Gene.Name   Drug.ID Drug.Name   Relationship    Source  Evidence
CYP2C19 DB00758 Clopidogrel metabolism  CPIC    guideline
CYP2D6  DB00472 Fluoxetine  metabolism  PharmGKB    curated
SLCO1B1 DB00641 Simvastatin toxicity    CPIC    guideline
DPYD    DB00544 Fluorouracil    toxicity    CPIC    guideline
CFTR    DB08820 Ivacaftor   target  FDA label

This file is a teaching substitute for a curated pharmacogenomic database.

In a professional setting, the equivalent table would be assembled from controlled sources such as CPIC, PharmGKB, PharmVar, DPWG, FDA labels, and internal validation records.

Merge VEP and drug-gene data in R

Import the VEP output and drug-gene table.

vep <- read.delim(
  "hu24385B.vep.clean.tsv",
  stringsAsFactors = FALSE,
  check.names = FALSE
)

drug_genes <- read.delim(
  "drug_gene_table.tsv",
  stringsAsFactors = FALSE,
  check.names = FALSE
)

Rename the VEP gene symbol column to match the drug-gene table.

names(vep)[names(vep) == "SYMBOL"] <- "Gene.Name"

Remove rows without a gene symbol.

vep <- vep[!is.na(vep$Gene.Name) & vep$Gene.Name != "", ]

Merge by gene name.

merged <- merge(
  x = vep,
  y = drug_genes,
  by = "Gene.Name",
  all = FALSE
)

Remove low-information variant consequences for this exercise.

low_information <- paste(
  c(
    "synonymous_variant",
    "intron_variant",
    "upstream_gene_variant",
    "downstream_gene_variant",
    "UTR",
    "non_coding_transcript",
    "NMD_transcript_variant"
  ),
  collapse = "|"
)

keep <- !grepl(low_information, merged$Consequence)
merged_filtered <- merged[keep, ]

Write the candidate table.

write.table(
  merged_filtered,
  file = "pharmacogene_variant_drug_matches.tsv",
  sep = "\t",
  quote = FALSE,
  row.names = FALSE
)

The output table links annotated variants to drug-gene records.

It is still a candidate table. Interpretation comes later.

Merge VEP and drug-gene data in Python

The same join can be written in Python using pandas.

import pandas as pd

vep = pd.read_csv("hu24385B.vep.clean.tsv", sep="\t")
drug_genes = pd.read_csv("drug_gene_table.tsv", sep="\t")

vep = vep.rename(columns={"SYMBOL": "Gene.Name"})
vep = vep[vep["Gene.Name"].notna() & (vep["Gene.Name"] != "")]

merged = vep.merge(drug_genes, on="Gene.Name", how="inner")

low_information = (
    "synonymous_variant|"
    "intron_variant|"
    "upstream_gene_variant|"
    "downstream_gene_variant|"
    "UTR|"
    "non_coding_transcript|"
    "NMD_transcript_variant"
)

merged_filtered = merged[
    ~merged["Consequence"].astype(str).str.contains(low_information, regex=True)
]

merged_filtered.to_csv(
    "pharmacogene_variant_drug_matches.tsv",
    sep="\t",
    index=False
)

The Python version is shorter. The Bash and R versions make more of the file mechanics visible.

Read the candidate table

The candidate table should contain enough fields to show the chain from variant to drug relationship.

Column Meaning
Gene.Name HGNC gene symbol
Location or Uploaded_variation Variant coordinate or VEP variant ID
Allele Alternate allele
Consequence VEP consequence
HGVSc Coding change
HGVSp Protein change
Existing_variation Known ID such as rsID
CLIN_SIG ClinVar assertion where available
Drug.ID Drug identifier
Drug.Name Drug name
Relationship Metabolism, toxicity, efficacy, target, biomarker, or warning
Source CPIC, PharmGKB, FDA, DrugBank, or other
Evidence Guideline, label, curated evidence, prediction, or candidate match

A candidate row does not mean that a drug should be given or avoided. It means the row has enough structure for review.

Consequence is not interpretation

VEP consequence terms describe predicted molecular effect.

Consequence Typical meaning
stop_gained Creates a premature stop codon
frameshift_variant Changes the reading frame
splice_acceptor_variant Affects a canonical splice acceptor
splice_donor_variant Affects a canonical splice donor
start_lost Alters the start codon
missense_variant Changes one amino acid
synonymous_variant Does not change the amino acid sequence
intron_variant Located in an intron

A high-impact consequence can support interpretation. It is not automatically pathogenic. A missense variant can be benign, damaging, uncertain, or irrelevant to the drug response.

Transcript choice also matters. A variant can be coding on one transcript and non-coding on another. MANE Select transcripts are often preferred for clinical reporting where available.

Known pharmacogenomic alleles

Many pharmacogenomic results are not interpreted from single VCF rows.

They are often interpreted as star alleles, diplotypes, copy number states, repeat genotypes, or HLA alleles.

Gene Result type
CYP2D6 Star alleles, copy number, hybrid alleles
CYP2C19 Star alleles
CYP2C9 Star alleles
TPMT Star alleles or named variants
NUDT15 Named alleles
HLA-B HLA allele, such as HLA-B*57:01
UGT1A1 Repeat genotype, such as UGT1A1*28
SLCO1B1 Specific variant or haplotype

A standard SNV VCF can miss important pharmacogenomic structure.

Task Tools or resources
CYP2D6 calling Aldy, Stargazer, Cyrius, Astrolabe
HLA typing OptiType, HLA-HD, HISAT-genotype
Pharmacogenomic annotation PharmCAT
Star allele definitions PharmVar
Clinical recommendations CPIC, DPWG

Gene-level matching is the first pass. Allele-level interpretation is required before a pharmacogenomic conclusion.

Zygosity and dosage

Genotype affects interpretation.

State Meaning
Heterozygous One alternate allele
Homozygous Two alternate alleles
Hemizygous One copy of the region
Compound heterozygous Two different variants in the same gene
Copy number change Deletion or duplication of a gene or region
Mosaic Variant present in only some cells

Pharmacogenomic reports often translate genotype into a functional phenotype, such as poor, intermediate, normal, rapid, or ultrarapid metaboliser.

CYP2D6 shows why dosage matters. A person may carry inactive alleles, reduced-function alleles, gene duplications, or hybrid alleles. A simple SNV table cannot always resolve that structure.

Drug-gene relationship types

A drug-gene match can mean several things.

Relationship Meaning
Metabolism The gene product affects drug exposure
Toxicity The genotype changes adverse event risk
Efficacy The genotype changes probability of response
Target The drug binds the gene product
Biomarker The gene or variant is part of a labelled indication
Warning The genotype or disease state changes risk

A drug target is not the same as a prescribing rule.

A drug may bind a protein encoded by a gene, but inherited variation in that gene may have no known treatment effect. A pharmacogenomic variant may affect metabolism of many drugs without being part of the drug target.

The FDA pharmacogenomic biomarker table and drug prescribing information help distinguish these categories.

Unknown variants

Most variants in a genome are not actionable.

A rare missense variant in a pharmacogene may be known, benign, uncertain, technical, or research-only.

Interpretation Meaning
Known actionable allele Evidence supports a drug-specific recommendation
Known benign or tolerated variant No action expected
Variant of uncertain significance Insufficient evidence
Technical artefact Variant call is not reliable
Research candidate Interesting, but not clinically interpretable

Prediction tools can help rank variants. They do not define clinical action.

Tool Use
CADD General deleteriousness score
REVEL Missense variant prediction
AlphaMissense Missense effect prediction
SIFT Protein tolerance prediction
PolyPhen-2 Missense effect prediction
SpliceAI Splice effect prediction

Predicted effects should remain labelled as predictions.

Limits of the first-principles workflow

The workflow explains the mechanics of a simple pharmacogenomics analysis. It does not replace a validated clinical pipeline.

Special caution is needed for:

Gene or region Reason
CYP2D6 Copy number, hybrid alleles, and complex haplotypes
HLA genes High polymorphism and specialised typing requirements
UGT1A1 Repeat alleles may not be represented in simple VCFs
Structural variants Often absent from SNV-focused VCFs
Low coverage regions Negative results may be uninformative
Direct-to-consumer data Often incomplete for clinical PGx
Transcript ambiguity Consequence can change by transcript
Reference mismatch Coordinates and annotations can become incorrect

The most common error is treating a matched gene as an interpreted result.

The correct interpretation chain is stricter.

variant call
  ↓
quality check
  ↓
correct reference build
  ↓
correct transcript or allele definition
  ↓
validated pharmacogenomic allele
  ↓
drug-specific guideline or label
  ↓
clinical interpretation

Practical summary

This exercise builds a candidate pharmacogenomics table from first principles.

It uses:

VCF
VEP annotation
gene extraction
pharmacogene matching
drug-gene joining
consequence filtering
candidate review

The file operations are straightforward. The interpretation is not.

A candidate table can show where to look. A pharmacogenomic conclusion requires allele-level evidence, drug-specific guidance, and quality-controlled genotyping.

References and resources