DeepG4ToolsComparison: A snakemake pipeline to run and evaluate G4’s DNA prediction tools
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
DeepG4ToolsComparison: A snakemake pipeline to run and compare G4 DNA prediction tools with DeepG4
The predictions for differents tissues and cancer with DeepG4 is available here
The code to generate the precision/recall curve is available here .
Overview
It’s based on Snakemake to manage the workflow and Docker to isolate the application and run it with the appropriate tool versions.
Installation
Clone the repository :
git clone https://github.com/morphos30/DeepG4ToolsComparison.git
cd DeepG4ToolsComparison
Install the docker image and run it :
docker build . -t morphos30/g4docker -f Dockerfile/Dockerfile
docker run -it -v $(pwd):/DeepG4ToolsComparison morphos30/g4docker /bin/bash
Where
$(pwd)
is the working directory of
DeepG4ToolsComparison
on
your computer.
Launch the pipeline :
cd /DeepG4ToolsComparison
snakemake --use-conda -j 30
You have to set the option
--use-conda
in order to install and run
each tool in its proper environment.
Workflow specifications
Input
- DNA sequences into bed format, split into positive set and negative set, written into the bed directory.
Note :
if you want add a new dataset, edit the
Snakefile
file and
add the bed files in the dictionnary
EXPERIMENTS
, without the
.bed
extension. Example :
TestSet_Peaks_BG4_G4seq_HaCaT_GSE76688_hg19_201b_Ctrl_gkmSVM_0.8_42_Ctrl_gkmSVM.bed
TestSet_Peaks_BG4_G4seq_HaCaT_GSE76688_hg19_201b_Ctrl_gkmSVM_0.8_42.bed
EXPERIMENTS = { "TestSet_Peaks_BG4_G4seq_HaCaT_GSE76688_hg19_201b_Ctrl_gkmSVM_0.8_42_Ctrl_gkmSVM":{"CTRL":"TestSet_Peaks_BG4_G4seq_HaCaT_GSE76688_hg19_201b_Ctrl_gkmSVM_0.8_42_Ctrl_gkmSVM","EXP":"TestSet_Peaks_BG4_G4seq_HaCaT_GSE76688_hg19_201b_Ctrl_gkmSVM_0.8_42"}
}
Where
CTRL
is the negative set and
EXP
is the positive set.
-
DNA Accessibility (ATAC-seq/DNAse-seq/MNase-seq) in bigwig format or
directly the averaged value for each sequence in a
one-column tsv
file.
ATACFILE = { "TestSet_Peaks_BG4_G4seq_HaCaT_GSE76688_hg19_201b_Ctrl_gkmSVM":["ATAC_entinostat_mean.bw"]
}
or
one-column tsv
file in
fasta/{Experiment_name}/{Experiment_name}_atac_merged.tsv
. Example :
fasta/TestSet_Peaks_BG4_G4seq_HaCaT_GSE76688_hg19_201b_Ctrl_gkmSVM/TestSet_Peaks_BG4_G4seq_HaCaT_GSE76688_hg19_201b_Ctrl_gkmSVM_atac_merged.tsv
head TestSet_Peaks_BG4_G4seq_HaCaT_GSE76688_hg19_201b_Ctrl_gkmSVM_atac_merged.tsv
0.01628741641898675
0.028752257447422012
0.028878783223623482
0.055516399884055316
0.02825982069785745
0.03582923041809851
0.023904436394151577
0.07724288611280039
0.01740800116454673
0.05779605688479145
Rulegraph :
Workflow output for each tools :
Outputs | Tools | Methods |
---|---|---|
ATACDeepG4_ATACnormBG | ATACDeepG4 | DeepG4 using accessibily (DeepG4 in paper) |
ATACDeepG4_classictuningOH5 | ATACDeepG4 | DeepG4 without accessibility (DeepG4* in paper) |
penguinn_retrained | penguinn | penguinn using custom model trained on BG4G4seq dataset |
penguinn | penguinn | penguinn using default model |
G4detector_retrained | G4detector | G4detector using custom model trained on BG4G4seq dataset |
G4detector | G4detector | G4detector using default model |
quadron_retrained | quadron | quadron using custom model trained on BG4G4seq dataset |
quadron_score | quadron | quadron using default model |
Code Snippets
10 11 | shell: "Rscript "+AUC_script+" {input} {output.pdf}" |
22 23 | shell: "Rscript "+supp_metrics_script+" {input} {output.pdf}" |
35 36 | shell: "Rscript "+subseq_fasta+" {params.sub} {input.fas} {output.fasta_trimmed}" |
82 83 | script: "../scripts/Snakemake/generate_fasta_atac_from_bed_bgnorm.R" |
11 12 | shell: "python2 "+G4CatchAll_script+" --fasta {input.fas} > {output}" |
26 27 | shell: "Rscript "+G4CatchAll_tsv+" {input.table} {input.fas} {output} {params.calc}" |
15 16 | shell: "python "+G4detector_script+" test {input.fas} {params.model} {output}" |
29 30 | shell: "Rscript "+G4detector_to_tsv+" {input.table} {input.fas} {output}" |
44 45 | shell: "python "+G4detector_script+" test {input.fas} {params.model} {output}" |
59 60 | shell: "Rscript "+G4detector_to_tsv+" {input.table} {input.fas} {output}" |
40 41 | shell: "Rscript "+G4Hunter_tsv+" {input} {output} {params.calc} {params.score_selec}" |
58 59 | shell: "Rscript "+G4Hunter_retrained_script+" {params.model_path} {input} {output.score} {params.calc}" |
12 13 14 15 16 17 18 19 20 21 22 23 24 25 | run: from Bio import SeqIO with open(output[0],"w") as outfile: with open(input.fas, "rU") as handle: for record in SeqIO.parse(handle, "fasta"): try: cmd = shell("echo '"+str(record.seq)+"' | timeout 20s "+GQRS_Mapper_script+" -csv -notitle",iterable=True) for line in cmd: if len(line) == 0: outfile.write(record.id+",,,,,,,,,\n") else: outfile.write(record.id+","+line+"\n") except: outfile.write(record.id+",,,,,,,,,\n") |
39 40 | shell: "Rscript "+GQRS_Mappper_tsv+" {input.table} {output} {params.calc}" |
27 28 | shell: "Rscript "+DeepG4OH4_script+" {input.fas} {input.atac_merged} {output} {params.model}" |
46 47 | shell: "Rscript "+penguinn_script_R+" {input.fas} {output} {params.model}" |
61 62 | shell: "Rscript "+penguinn_script_R+" {input.fas} {output} {params.model}" |
13 14 | shell: "python2 "+qparse_script+" -i {input.fas} -o {output}" |
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | run: fasta = {} with open(output[0],"w") as outfile: with open(input.qparse_file, "r") as qparsefile: for line in qparsefile: line = line.strip() if not line: continue if line.startswith("#"): continue if line.startswith(">"): active_sequence_name = line[1:] if active_sequence_name not in fasta: fasta[active_sequence_name] = [] continue sequence = line fasta[active_sequence_name].append(sequence) for i,v in fasta.items(): outfile.write("\n".join([i+"\t"+j for j in v])+"\n") |
56 57 | shell: "Rscript "+qparse_tsv+" {input.fas} {input.qparse_file} {output} {params.calc}" |
14 15 | shell: "g4predict intra -M -f {input} -b {output}" |
30 31 | shell: "Rscript "+quadparser_tsv+" {input.table} {input.fas} {output} {params.calc}" |
23 24 | shell: "Rscript "+generate_quadron_dat+" {params.type} {params.Quadron_lib} {params.source} {threads} {input.fas} {output.quadron_table}" |
43 44 | shell: "Rscript "+generate_quadron_dat+" {params.type} {params.Quadron_lib} {params.source} {threads} {input.fas} {output.quadron_table}" |
61 62 | shell: "Rscript "+quadron_retrained_script+" {params.model_path} {input.quadron_table} {output.score}" |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 | require(Biostrings) if (!require(plyranges)) { BiocManager::install("plyranges") } require(plyranges) if (!require(BSgenome.Hsapiens.UCSC.hg19)) { BiocManager::install("BSgenome.Hsapiens.UCSC.hg19") } library(BSgenome.Hsapiens.UCSC.hg19) require(tidyverse) seqlens <- seqlengths(BSgenome.Hsapiens.UCSC.hg19) binbed <- snakemake@params[["random_regions"]] %>% read_bed() #Functions (copied from DeepG4 package) getScoreBW <- function (WIG, BED,forScan=F) { res <- do.call("rbind",lapply(split(BED, droplevels(GenomeInfoDb::seqnames(BED))), function(zz) { cov <- WIG[[unique(as.character(GenomeInfoDb::seqnames(zz)))]] score <- IRanges::Views(cov, start = BiocGenerics::start(zz), end = BiocGenerics::end(zz)) return(as.matrix(score)) })) if(forScan){ return(res) }else{ return(rowMeans(res)) } } NormBW <- function(x,binbed){ ranges_bin <- base::range(IRanges::subsetByOverlaps(x,binbed)$score,na.rm = TRUE, finite = TRUE) x$score <- scales::rescale(x$score,to = c(0,1),from = ranges_bin) x <- IRanges::coverage(x,weight = "score") return(x) } #read bed files readBed <- . %>% read_tsv(col_names = F) %>% dplyr::select(1:3) %>% setNames(c("seqnames","start","end")) %>% as_granges() G4pos.bed <- snakemake@input[["bed_pos"]] %>% read_bed() if(unique(width(G4pos.bed)) != 201){ G4pos.bed <- snakemake@input[["bed_pos"]] %>% readBed } G4neg.bed <- snakemake@input[["bed_ctrl"]] %>% read_bed() if(unique(width(G4neg.bed)) != 201){ G4neg.bed <- snakemake@input[["bed_ctrl"]] %>% readBed } G4pos.bed <- G4pos.bed %>% filter(seqnames != "chrY") G4neg.bed <- G4neg.bed %>% filter(seqnames != "chrY") #generate open chromatin dataset (ATAC or DNAse-seq) my_seuil_bg <- snakemake@params[["seuil"]] my_window_bg <- as.numeric(snakemake@params[["window"]]) ATAC_dataset <- snakemake@input[["atac_data"]] G4all.bed <- c(G4pos.bed,G4neg.bed) G4all.bed$order <- 1:length(G4all.bed) G4all.bed <- BiocGenerics::sort(GenomeInfoDb::sortSeqlevels(G4all.bed)) G4all.atac <- ATAC_dataset %>% map(function(x){ xbw <- x %>% import.bw %>% NormBW(binbed) res <- xbw %>% getScoreBW(G4all.bed) res[is.na(res)] <- 0 return(res) })%>% purrr::reduce(`+`) / length(ATAC_dataset) G4all.bed.bg <- G4all.bed %>% anchor_center() %>% mutate(width=my_window_bg)%>% as_tibble() %>% left_join(enframe(seqlens),by = c("seqnames"="name")) %>% mutate(end = ifelse(end>value,value,end)) %>% dplyr::select(-width) %>% as_granges() G4all.atac.bg <- ATAC_dataset %>% map(function(x){ xbw <- x %>% import.bw() %>% NormBW(binbed) res <- xbw %>% getScoreBW(G4all.bed.bg) res[is.na(res)] <- 0 return(res) })%>% purrr::reduce(`+`) / length(ATAC_dataset) my_test <- (G4all.atac/G4all.atac.bg)<my_seuil_bg G4all.atac[my_test] <- 0 tibble(atac_seq=G4all.atac) %>% write_tsv(snakemake@output[["atac_merged"]],col_names = F) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | require(tidyverse) # files <- list.files("results", recursive=TRUE, full.names=TRUE) # files <- files[str_detect(files,"AUC/.*\\.tsv")] res <- snakemake@input %>% map(read_tsv) %>% map(gather,key = metric,value = value,-file) %>% bind_rows() fres <- res %>% mutate(Exp = str_extract(file,"GSE[0-9]+|breast-cancer-PDTX")) %>% mutate(size = str_extract(file,"201b|500b|253b")) %>% mutate(Ctrl = str_extract(file,"Ctrl(_[A-Za-z0-9]+|)")) %>% mutate(Tool = str_extract(basename(file),"\\..+")) %>% mutate(Tool = str_remove(Tool,".8_42_Ctrl_(gkmSVM|neg).|.")) %>% mutate(cell_line = str_extract(file,"qG4|HaCaT|K562|HEKnp|H1975|A549|293T|HeLaS3")) %>% mutate(TypeExp = str_extract(file,"Peaks_G4P_G4seq|TestSet_Peaks_qG4|TestSet_Peaks_G4seqpm_BG4|TestSet_Peaks_BG4_G4seq|Promoters_qG4|Promoters_G4seq_BG4|Promoters_BG4_G4seq|Peaks_G4seqpm_BG4|Peaks_BG4_G4seq|Peaks_G4seq_qG4|Peaks_G4seq_BG4|Peaks_qG4|Peaks_G4seqpm_qG4")) %>% dplyr::select(-file) #Raf en veut fres %>% unite(Ctrl_dat,TypeExp,Exp,cell_line,Ctrl,size) %>% unite(Ctrl_dat,metric,Ctrl_dat) %>% spread(key = Ctrl_dat,value = value) %>% write_tsv(snakemake@output[[1]]) #Raf en veut fres %>% group_by(Exp,Ctrl,cell_line,TypeExp,size,metric) %>% arrange(desc(value)) %>% slice(1) %>% write_tsv(snakemake@output[[2]]) fres %>% unite(Ctrl_dat,TypeExp,Exp,cell_line,Ctrl,size) %>% group_by(Ctrl_dat,metric) %>% arrange(desc(value)) %>% mutate(Classement = 1:dplyr::n()) %>% dplyr::select(-value) %>% spread(key = Ctrl_dat,value = Classement) %>% write_tsv(snakemake@output[[3]]) fres %>% write_tsv(snakemake@output[[4]]) |
98 99 | script: "scripts/Snakemake/report_AUC.R" |
Support
- Future updates
Related Workflows





