I will go thorough the steps to get to this final summary image:
Opinion:
Opinion:
The closer pLI is to one, the more intolerant of protein-truncating variants the transcript appears to be. We consider pLI ≥ 0.9 as an extremely intolerant set of transcripts. (Lek et al., 2016).
To find ZNF genes with pLI >0.9 I used
Handling requires installation of samtools https://www.htslib.org (Samtools, BCFtools, HTSlib). Uncompress the data with bgzip.
Query the whole genoe pLI score for all ZNF genes. Clean the dataset to extract the data of interest: Gene name, numbers of observed/expected LoF, pLI score, etc.
The R code performs a lot of formatting and cleaning to interpret the dataset. We find all genes wiht pLI > 0.9.
I have pre-emptively shown ZNF777 in red. I will show in the next section why I chose this as a pilot study.
This was found from existing data, as shown in the next section.
This was found from existing data, as shown in the next section.
I used existing chip-seq data. This will pair one ZNF to a list of all genes that it ostensibly regulates. I believe this dataset to be a useful example: (Imbeault et al., 2017), https://www.nature.com/articles/nature21683.
“We then determined the genomic targets of human KZFPs by performing chromatin immunoprecipitation with exonuclease digestion (ChIP–exo)[13] on 257 HEK293T cell lines, each engineered to express one haemagglutinin (HA)-tagged family member. We obtained single base-pair resolution binding sites for 222 human KZFPs (Supplementary Table 4), 12 of them in duplicate (Extended Data Fig. 2), with the number of high-quality peaks per protein ranging from more than 10,000 to only around 15 (Extended Data Fig. 3a).
I compared my list of all ZNF for pLI >0.9 with the ZNF genes available in this study. The first match was ZNF777, which has pLI 0.99 (as shown in Figure 1).
I downloaded the data:
GSM2466654 ZNF777 -https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE78099 -https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2466654
Methods in Code block 3
I cleaned and visualise the MACS (Model-based Analysis of ChIP-Seq) peak data. What is MACS? - ref 1, (Feng et al., 2011). What is MACS? - ref 2
The genes “affected” by ZNF777 were clustered into their best-fit protein pathways using STRING-db. String uses experimental and curated data from Biocarta, BioCyc, GO, KEGG, and Reactome, BIND, DIP, GRID, HPRD, IntAct, MINT, and PID.
I access the database using Cytoscape.
Pathway clustering was performed using Markov Cluster Algorithm (MCL), an unsupervised cluster algorithm for networks based on stochastic flow. (Van Dongen, 2000), https://micans.org/mcl/.
In Figure A, I show all of the gene target of ZNF777.
Since we cannot define the protein pathways clearly here (something that you find when reading hierarchical GO enrichement analsis results), I instead cluster these proteins into the best-fitting pathways.
The result of MCL clustering is show in Figure B.
Now we have a set of distinct protein pathways.
Next, since we started with a ZNF that is intolerant to LoF, we also believe that the target genes must also be
I cleaned the data further to group gene according to their
Figure 3 shows the top 5 largest pathways.
These by-stander genes are still surely sensitive to regulation by ZNF777. However, we can infer the evolutionary hierarchy of the major and minor protein pathways under the control of this ZNF.
There are 2 major further ways to progress.
From all candidate ZNF gene targets, we find those which match it’s intolerance to LoF. We therefore define these pathways as the major targets of this ZNF.
I have purposely refrained for completing the final step: quantifying the top pLI pathway/gene list. This is to prevent from making a conclusion without testing the theory and stats.
#!/bin/bash
# dataset
# <https://gnomad.broadinstitute.org/downloads>
# Section "Constraint"
# File "pLoF Metrics by Gene TSV"
# Ref <https://www.nature.com/articles/s41586-020-2308-7>
# install samtools
bgzip -d gnomad.v2.1.1.lof_metrics.by_gene.txt.bgz
#!/bin/bash
set -e
cd ../data/
output=ZNF_summary.tsv
# query gene and get summary column
head -1 gnomad.v2.1.1.lof_metrics.by_gene.txt | cut -f 1,17,18,20,21,42,43,46,75,76,77 > $output
# obs_lof mu_lof exp_lof pLI obs_het_lof obs_hom_lof exp_hom_lof
grep "ZNF*" gnomad.v2.1.1.lof_metrics.by_gene.txt | cut -f 1,17,18,20,21,42,43,46,75,76,77 >> $output
Major note: some of the data used in the second half of this script requires the scripts from 4,5,6. I have named them in the order of writing. It will be obvious when a dataset is missing, that the next bash script should be run. If continuing this project, 3.analyse.R should be split into sequential order, or call the bash script within the R code.
Minor note: This script is needlessly long and contains rough exploration work. If continuing this project, it can be heavily simplified.
library(dplyr)
library(ggplot2)
library(tidyr)
# Multiple sequence alignment file
df <-
read.table(file="../data/ZNF_summary.tsv",
header = TRUE,
sep = "\t",
stringsAsFactors = FALSE)
# Note there is one KAZN gene included, figure out the grep correction later.
# The closer pLI is to one, the more intolerant of protein-truncating variants the transcript appears to be. We consider pLI ≥ 0.9 as an extremely intolerant set of transcripts. _Lek et al Nature 2016._
names(df)
# get the data in chr order
df <- arrange(df, chromosome, start_position)
df$chromosome <- as.numeric(df$chromosome)
df$start_position <- as.numeric(df$start_position)
# replace NAs with X chromosome
df$chromosome[is.na(df$chromosome)]<-"X"
# create the subset for example: ZNF777
g1 <- subset(df, gene == "ZNF777")
pZNF777 <-
df %>%
# ggplot(aes(x=start_position, y=pLI))+
ggplot(aes(x=factor(chromosome, stringr::str_sort(unique(chromosome), numeric = TRUE)),
y=pLI))+
# geom_point() +
geom_jitter() +
geom_point(data=g1, colour="red") + # this adds a red point
geom_text(data=g1, label="ZNF777", y=.95, colour="red") +
geom_hline(linetype="dotted", yintercept=0.9) +
#facet_grid(~chromosome) +
# facet_grid(.~factor(chromosome, stringr::str_sort(unique(chromosome), numeric = TRUE))) +
# theme(legend.position="none", panel.background = element_rect("#F7F7F7"),
# #axis.text.x=element_text(angle=45,hjust=1,vjust=0.5),
# axis.text.x=element_blank()) +
xlab("ZNF genes genomic position") +
ylab("pLI (≥ 0.9 extremely intolerant to LoF)")
pZNF777
# Next I want to simulate some chip-seq data, matching one ZNF to a list of all genes that are regulated
# I believe this dataset is useful
#### https://www.nature.com/articles/nature21683
# We then determined the genomic targets of human KZFPs by performing chromatin immunoprecipitation with exonuclease digestion (ChIP–exo)13 on 257 HEK293T cell lines, each engineered to express one haemagglutinin (HA)-tagged family member. We obtained single base-pair resolution binding sites for 222 human KZFPs (Supplementary Table 4), 12 of them in duplicate (Extended Data Fig. 2), with the number of high-quality peaks per protein ranging from more than 10,000 to only around 15 (Extended Data Fig. 3a).
#ZNF77 has pLI 0.99
# Download
# GSM2466654 ZNF777
# https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE78099
## https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2466654
df2 <-
read.table(file="../data/GSM2466654_ZNF777_peaks_processed_score_signal_exo.bed.txt",
header = FALSE,
sep = "\t",
stringsAsFactors = FALSE)
df2 <- separate(df2, V1, into = c("text", "chromosome"), sep = "chr") %>% select(-text)
colnames(df2)[colnames(df2) == 'V2'] <- 'start_position'
colnames(df2)[colnames(df2) == 'V3'] <- 'end_position'
colnames(df2)[colnames(df2) == 'V5'] <- 'score'
df2 %>%
ggplot(aes(x=start_position, y=score ))+
geom_point() +
# geom_hline(linetype="dotted", yintercept=0.9) +
#facet_grid(~chromosome) +
facet_grid(.~factor(chromosome, stringr::str_sort(unique(chromosome), numeric = TRUE))) +
theme(legend.position="none", panel.background = element_rect("#F7F7F7"),
#axis.text.x=element_text(angle=45,hjust=1,vjust=0.5),
axis.text.x=element_blank()) +
xlab("Genes genomic position") +
ylab("Score MACS peak")
# aligned to GRCh37g1k
# get gene symbtols via ensenbke biomart
# http://grch37.ensembl.org/biomart/martview/35883cadc275e8738b07668ab1f8d58f
# run 4.chip.sh
# Query for these coordinates and output attibutes:
#Dataset : Human genes (GRCh37.p13)
#Filters: coordinates, e.g. 1:100:10000:-1, 1:100000:200000:1: [ID-list specified]
#Attributes:
#Gene stable ID
#Gene start (bp)
#Gene end (bp)
#Chromosome/scaffold name
#Transcription start site (TSS)
#Gene name
# Output: GSM2466654_ZNF777_peaks_processed_score_signal_exo.bed.biomartformat_annotated.txt
df3 <-
read.table(file="../data/GSM2466654_ZNF777_peaks_processed_score_signal_exo.bed.biomartformat_annotated.txt",
header = TRUE,
sep = "\t",
stringsAsFactors = FALSE)
# run 5.genelist.sh
# cytoscape
# string gene query, print list
# MCL cluster, default 2.5 inflation
df4 <-
read.table(file="../output/img2_string_network_ZNF777_peaks_MCL2.5_table.csv",
header = TRUE,
sep = ",",
stringsAsFactors = FALSE)
df4 <- df4 %>% select(X__mclCluster, display.name)
# colnames(df)[colnames(df) == 'oldName'] <- 'newName'
colnames(df4)[colnames(df4) == 'display.name'] <- 'Gene.name'
colnames(df4)[colnames(df4) == 'X__mclCluster'] <- 'mclCluster'
# merge MCL clusters with the MACS peaks (genes named via biomart)
df5 <- merge(df3, df4)
# get first in gene (instead of all TSS)
df5 <- df5 %>%
group_by(Gene.name) %>%
filter(row_number()==1)
names(df5)
colnames(df5)[colnames(df5) == 'Chromosome.scaffold.name'] <- 'chromosome'
colnames(df5)[colnames(df5) == 'Gene.start..bp.'] <- 'start_position'
colnames(df5)[colnames(df5) == 'Gene.end..bp.'] <- 'end_position'
df5 %>%
ggplot(aes(x=start_position, y=(mclCluster), color= (as.character(mclCluster) ) ))+
geom_point() +
# geom_hline(linetype="dotted", yintercept=0.9) +
#facet_grid(~chromosome) +
facet_grid(~factor(chromosome, stringr::str_sort(unique(chromosome), numeric = TRUE))) +
theme(legend.position="none", panel.background = element_rect("#F7F7F7"),
#axis.text.x=element_text(angle=45,hjust=1,vjust=0.5),
axis.text.x=element_blank()) +
xlab("Genes genomic position") +
ylab("Score MACS peak")
df5 %>%
ggplot(aes(x=factor(chromosome, stringr::str_sort(unique(chromosome), numeric = TRUE)),
y=(mclCluster),
group=mclCluster,
color= (as.character(mclCluster) ) ))+
#geom_point() +
geom_jitter() +
geom_line(aes(alpha=0.2)) +
theme(legend.position="none", panel.background = element_rect("#F7F7F7"),
#axis.text.x=element_text(angle=45,hjust=1,vjust=0.5),
#axis.text.x=element_blank()
) +
xlab("Genes genomic position") +
ylab("Protein pathway ID")
# merge protein pathway data with MACS peaks
df5 <- df5 %>% filter(mclCluster < 6)
df2
names(df2)
names(df5)
tmp <- merge(
(df2 %>% select(chromosome, start_position, end_position, score)),
(df5 %>% select(Gene.name, chromosome, start_position, end_position, mclCluster)),
all = T )
df2 <- df2 %>%
select(chromosome, start_position, end_position, score)
tmp <- merge(df2, df5, all = T )
# get pLI scores
# now get the plof for mcl genes
gene_names <- tmp %>% filter(mclCluster > 0) %>% select(Gene.name)
write.table(gene_names, file='../output/gene_names_mcl_top5.csv',sep=",", quote=FALSE, row.names=FALSE, col.names = FALSE)
# run 6.mclplof.sh
# Multiple sequence alignment file
df6 <-
read.table(file="../output/gene_names_mcl_top5_plof.txt",
header = TRUE,
sep = "\t",
stringsAsFactors = FALSE)
names(tmp)
names(df6)
# colnames(df6)[colnames(df6) == 'oldName'] <- 'newName'
colnames(df6)[colnames(df6) == 'gene'] <- 'Gene.name'
tmp <- merge(tmp, df6, all.x = T)
tmp %>%
ggplot(aes(x=factor(chromosome, stringr::str_sort(unique(chromosome), numeric = TRUE)),
y=mclCluster,
group=mclCluster,
color= (as.character(mclCluster) ) ))+
# geom_point() +
geom_jitter() +
geom_line(aes(alpha=0.2)) +
theme(legend.position="none", panel.background = element_rect("#F7F7F7"),
#axis.text.x=element_text(angle=45,hjust=1,vjust=0.5),
#axis.text.x=element_blank()
) +
xlab("Genes genomic position") +
ylab("Protein pathway ID")
p0 <- tmp %>%
# filter(chromosome ==1 ) %>%
ggplot(aes(x=start_position,
y=mclCluster,
group=mclCluster,
color= (as.character(mclCluster) ) ))+
# geom_point() +
geom_jitter() +
geom_line(aes(alpha=0.2)) +
theme(legend.position="none", panel.background = element_rect("#F7F7F7"),
#axis.text.x=element_text(angle=45,hjust=1,vjust=0.5),
#axis.text.x=element_blank()
) +
xlab("Genes genomic position") +
ylab("Protein pathway ID")
g2$Gene.name
g2 <- subset(tmp, pLI >= 0.9)
p1 <- tmp2 %>%
ggplot(aes(x=factor(chromosome, stringr::str_sort(unique(chromosome), numeric = TRUE)),
y=(mclCluster),
group=mclCluster,
color= (as.character(mclCluster) ) ))+
#geom_point() +
geom_jitter(aes(size=pLI)) +
geom_line(aes(alpha=0.2)) +
theme(legend.position="none", panel.background = element_rect("#F7F7F7"),
#axis.text.x=element_text(angle=45,hjust=1,vjust=0.5),
#axis.text.x=element_blank()
) +
xlab("Genes genomic position") +
ylab("Protein pathway ID") +
geom_text(data=g2, aes(label=Gene.name), color='black')
p2 <- tmp %>%
# filter(chromosome ==1 ) %>%
ggplot(aes(start_position, score ))+
geom_segment(aes(xend = start_position, yend = 0))+
facet_grid(.~factor(chromosome, stringr::str_sort(unique(chromosome), numeric = TRUE))) +
theme(legend.position="none", panel.background = element_rect("#F7F7F7"),
#axis.text.x=element_text(angle=45,hjust=1,vjust=0.5),
axis.text.x=element_blank()) +
xlab("Genes genomic position") +
ylab("Score MACS peak")
# Plot main figure ----
require(gridExtra)
grid.arrange(pZNF777, p2, p1,ncol=1,
bottom = "",
left = "")
#!/bin/bash
cd ../data/
# convert to correct format
# (Chr:Start:End:Strand) [Max 500 advised]
# e.g. 1:100:10000:-1, 1:100000:200000:1
cut -f 1-3 GSM2466654_ZNF777_peaks_processed_score_signal_exo.bed.txt |\
sed 's/ /:/g' |\
sed 's/chr//g' \
> GSM2466654_ZNF777_peaks_processed_score_signal_exo.bed.biomartformat.txt
#!/bin/bash
# gene simple gene list to cluster with STRING
cut -f 6 GSM2466654_ZNF777_peaks_processed_score_signal_exo.bed.biomartformat_annotated.txt |\
sort |\
uniq \
> GSM2466654_ZNF777_peaks_processed_score_signal_exo.bed.biomartformat_annotated_genes.txt
#!/bin/bash
cd ../output/
# for every line in the top 5 protein pathways, get the pLI data
head -1 ../data/gnomad.v2.1.1.lof_metrics.by_gene.txt | cut -f 1,17,18,20,21,42,43,46,75,76,77 > gene_names_mcl_top5_plof.txt
while read -r line;
do
grep "$line" ../data/gnomad.v2.1.1.lof_metrics.by_gene.txt;
done < gene_names_mcl_top5.csv |\
cut -f 1,17,18,20,21,42,43,46,75,76,77 >> gene_names_mcl_top5_plof.txt