singularity & snakemake workflow to detect mutations with several tools to detect mutations
detect Mutations
Sylvain Schmitt April 20, 2021
singularity
&
snakemake
workflow to detect mutations with several alignment and mutation
detection tools.
Installation
-
[x] Python ≥3.5
-
[x] Snakemake ≥5.24.1
-
[x] Golang ≥1.15.2
-
[x] Singularity ≥3.7.3
-
[x] This workflow
# Python
sudo apt-get install python3.5
# Snakemake
sudo apt install snakemake`
# Golang
export VERSION=1.15.8 OS=linux ARCH=amd64 # change this as you need
wget -O /tmp/go${VERSION}.${OS}-${ARCH}.tar.gz https://dl.google.com/go/go${VERSION}.${OS}-${ARCH}.tar.gz && \
sudo tar -C /usr/local -xzf /tmp/go${VERSION}.${OS}-${ARCH}.tar.gz
echo 'export GOPATH=${HOME}/go' >> ~/.bashrc && \
echo 'export PATH=/usr/local/go/bin:${PATH}:${GOPATH}/bin' >> ~/.bashrc && \
source ~/.bashrc
# Singularity
mkdir -p ${GOPATH}/src/github.com/sylabs && \
cd ${GOPATH}/src/github.com/sylabs && \
git clone https://github.com/sylabs/singularity.git && \
cd singularity
git checkout v3.7.3
cd ${GOPATH}/src/github.com/sylabs/singularity && \
./mconfig && \
cd ./builddir && \
make && \
sudo make install
# detect Mutations
git clone git@github.com:sylvainschmitt/detectMutations.git
cd detectMutations
Usage
Get data
Generate data using the generate Mutations workflow.
git clone git@github.com:sylvainschmitt/generateMutations.git
cd ../generateMutations
snakemake --use-singularity --cores 4
cd ../detectMutations
bash scripts/get_data.sh
Locally
snakemake -np # dry run
snakemake --dag | dot -Tsvg > dag/dag.svg # dag
snakemake --use-singularity --cores 4 # run
snakemake --use-singularity --cores 1 --verbose # debug
snakemake --report report.html # report
HPC
module purge ; module load bioinfo/snakemake-5.25.0 # for test on node
snakemake -np # dry run
sbatch job.sh ; watch 'squeue -u sschmitt' # run
less detMut.*.err # snakemake outputs, use MAJ+F
less detMut.*.out # snakemake outputs, use MAJ+F
snakemake --dag | dot -Tsvg > dag/dag.svg # dag
module purge ; module load bioinfo/snakemake-5.8.1 ; module load system/Python-3.6.3 # for report
snakemake --report report.html # report
module purge ; module load system/R-3.6.2 ; R # to build results
Workflow
Reference
Index reference and SNPs for software to work with.
bwa_index
-
Tools:
BWA index
-
Singularity: oras://registry.forgemia.inra.fr/gafl/singularity/bwa/bwa:latest
samtools_faidx
-
Tools:
samtools faidx
-
Singularity: oras://registry.forgemia.inra.fr/gafl/singularity/samtools/samtools:latest
gatk_dict
-
Singularity: docker://broadinstitute/gatk:4.2.6.1
gatk_idx
-
Tools:
gatk IndexFeatureFile
-
Singularity: docker://broadinstitute/gatk:4.2.6.1
Reads
CReport quality and trim.
fastqc
-
Tools:
fastQC
-
Singularity: docker://biocontainers/fastqc:v0.11.9_cv8
trimmomatic
-
Tools:
Trimmomatic
-
Singularity: oras://registry.forgemia.inra.fr/gafl/singularity/trimmomatic/trimmomatic:latest
Alignments
Align reads against reference, mark duplicated, and report alignment quality.
bwa_mem
-
Tools:
BWA mem
-
Singularity: oras://registry.forgemia.inra.fr/gafl/singularity/bwa/bwa:latest
samtools_sort
-
Tools:
Samtools sort
-
Singularity: oras://registry.forgemia.inra.fr/gafl/singularity/samtools/samtools:latest
samtools_index
-
Tools:
Samtools index
-
Singularity: oras://registry.forgemia.inra.fr/gafl/singularity/samtools/samtools:latest
gatk_markduplicates
-
Tools:
gatk MarkDuplicates
-
Singularity: docker://broadinstitute/gatk:4.2.6.1
samtools_index_md
-
Tools:
Samtools index
-
Singularity: oras://registry.forgemia.inra.fr/gafl/singularity/samtools/samtools:latest
samtools_mpileup
-
Tools:
Samtools mpileup
-
Singularity: oras://registry.forgemia.inra.fr/gafl/singularity/samtools/samtools:latest
samtools_stats
-
Tools:
Samtools stats
-
Singularity: oras://registry.forgemia.inra.fr/gafl/singularity/samtools/samtools:latest
qualimap
-
Tools:
QualiMap
-
Singularity: docker://pegi3s/qualimap:2.2.1
Detection
Detect mutations.
gatk_mutect2
-
Tools:
gatk Mutect2
-
Singularity: docker://broadinstitute/gatk:4.2.6.1
freebayes
-
Tools:
freebayes
-
Singularity: oras://registry.forgemia.inra.fr/gafl/singularity/freebayes/freebayes:latest
gatk_haplotypecaller
-
Tools:
gatk HaplotypeCaller
-
Singularity: docker://broadinstitute/gatk:4.2.6.1
gatk_genotypegvcfs
-
Tools:
gatk GenotypeGVCFs
-
Singularity: docker://broadinstitute/gatk:4.2.6.1
strelka2
-
Tools:
Strelka2
-
Singularity: docker://quay.io/wtsicgp/strelka2-manta
varscan
-
Tools:
VarScan
-
Singularity: docker://alexcoppe/varscan
varscan2vcf
-
Script:
varscan2vcf.R
somaticsniper
-
Tools:
Somatic Sniper
-
Singularity: docker://lethalfang/somaticsniper:1.0.5.0
muse
-
Tools:
MuSe
-
Singularity: docker://opengenomics/muse:v0.1.1
Mutations
cp_vcfs
-
Tools:
cp
bedtools_substract
-
Tools:
bedtools substract
-
Singularity: oras://registry.forgemia.inra.fr/gafl/singularity/bedtools/bedtools:latest
Quality check
Combined quality information from
QualiMap
,
Picard
,
Samtools
,
Trimmomatic
, and
FastQC
(see previous steps) and assess calls
performance.
multiqc
-
Tools:
MultiQC
-
Singularity: oras://registry.forgemia.inra.fr/gafl/singularity/multiqc/multiqc:latest
evaluate_call
-
Script:
evaluate_call.R
Results
module load system/singularity-3.7.3
singularity pull https://github.com/sylvainschmitt/singularity-r-bioinfo/releases/download/0.0.3/sylvainschmitt-singularity-r-bioinfo.latest.sif
singularity shell sylvainschmitt-singularity-r-bioinfo.latest.sif
library(tidyverse)
lapply(list.files("results/stats", full=T), read_tsv) %>%
bind_rows() %>%
write_tsv("stats.tsv")
quit()
exit
Code Snippets
13 14 | shell: "bedtools subtract -header -a {input[0]} -b {input[1]} > {output}" |
13 14 | shell: "bwa index {input}" |
21 22 23 | shell: "bwa mem -M -R '{params.rg_mutated}' -t {threads} {input[0]} {input[1]} {input[2]} > {output[0]} ; " "bwa mem -M -R '{params.rg_base}' -t {threads} {input[0]} {input[3]} {input[4]} > {output[1]} ; " |
12 13 | shell: "cp {input} {output}" |
13 14 | script: "../scripts/evaluate_call.R" |
16 17 | shell: "fastqc -t {threads} -q {input} --outdir=results/{wildcards.library}/" |
16 17 18 19 | shell: "freebayes -f {input[0]} --pooled-continuous --pooled-discrete --genotype-qualities --report-genotype-likelihood-max" " --allele-balance-priors-off --min-alternate-fraction 0.03 --min-repeat-entropy 1 --min-alternate-count 2" " {input[2]} {input[1]} > {output}" |
12 13 | shell: "gatk CreateSequenceDictionary R={input} O={output}" |
15 16 | shell: "gatk GenotypeGVCFs -R {input[0]} -V {input[1]} -O {output}" |
16 17 | shell: "gatk HaplotypeCaller -R {input[0]} -I {input[1]} -O {output} -ERC GVCF" |
12 13 | shell: "gatk IndexFeatureFile -I {input}" |
16 17 18 | shell: "gatk MarkDuplicates --java-options \"-Xmx16G -Xms1G -Djava.io.tmpdir=tmp\" I={input[0]} O={output[0]} M={output[2]} ; " "gatk MarkDuplicates --java-options \"-Xmx16G -Xms1G -Djava.io.tmpdir=tmp\" I={input[1]} O={output[1]} M={output[3]}" |
18 19 20 | shell: "gatk Mutect2 -R {input[0]} -I {input[1]} -tumor {wildcards.lib}_REP{wildcards.REP}_mutated -I {input[2]} -normal {wildcards.lib}_REP{wildcards.REP}_base " " --panel-of-normals {input[3]} -dont-use-soft-clipped-bases true -O {output}" |
16 17 18 19 | shell: "multiqc results/{wildcards.library}/ -o results/{wildcards.library} ; " "rm -r results/{wildcards.library}/qualimap_mutated ; " "rm -r results/{wildcards.library}/qualimap_base ; " |
21 22 | shell: "muse.py -f {input[0]} --tumor-bam {input[1]} --normal-bam {input[2]} -O {output} -n {threads} -w results/{wildcards.lib}_REP{wildcards.REP}/muse/" |
15 16 17 18 19 | shell: "qualimap bamqc -bam {input[0]} --paint-chromosome-limits -nt {threads} -skip-duplicated --skip-dup-mode " "0 -outdir results/{wildcards.library}/qualimap_mutated -outformat HTML ; " "qualimap bamqc -bam {input[1]} --paint-chromosome-limits -nt {threads} -skip-duplicated --skip-dup-mode " "0 -outdir results/{wildcards.library}/qualimap_base -outformat HTML" |
12 13 | shell: "samtools faidx {input}" |
12 13 | shell: "samtools index {input[0]} ; samtools index {input[1]}" |
12 13 | shell: "samtools index {input[0]} ; samtools index {input[1]}" |
16 17 18 | shell: "samtools mpileup -f {input[0]} {input[1]} > {output[0]} ; " "samtools mpileup -f {input[0]} {input[2]} > {output[1]}" |
12 13 14 | shell: "samtools sort --threads {threads} {input[0]} > {output[0]} ; " "samtools sort --threads {threads} {input[1]} > {output[1]}" |
12 13 14 | shell: "samtools stats --threads {threads} {input[0]} > {output[0]} ; " "samtools stats --threads {threads} {input[1]} > {output[1]}" |
16 17 | shell: "/opt/somatic-sniper/build/bin/bam-somaticsniper -q 25 -Q 15 -s 0.0001 -F vcf -f {input[0]} {input[1]} {input[2]} {output}" |
19 20 21 22 23 24 25 26 27 | shell: "configureStrelkaSomaticWorkflow.py " "--normalBam {input[2]} " "--tumorBam {input[1]} " "--referenceFasta {input[0]} " "--runDir results/{wildcards.lib}_{wildcards.REP}/strelka2/ ; " "results/{wildcards.lib}_{wildcards.REP}/strelka2/runWorkflow.py -m local -j {threads} ; " "zcat results/{wildcards.lib}_{wildcards.REP}/strelka2/results/variants/somatic.snvs.vcf.gz > {output} ; " "rm -r results/{wildcards.lib}_{wildcards.REP}/strelka2" |
17 18 19 20 21 | shell: "trimmomatic PE -threads {threads} {input[0]} {input[1]} {output[0]} {output[4]} {output[1]} {output[5]} " "ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:keepBothReads SLIDINGWINDOW:4:15 2> {output[8]} ; " "trimmomatic PE -threads {threads} {input[2]} {input[3]} {output[2]} {output[6]} {output[3]} {output[7]} " "ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:keepBothReads SLIDINGWINDOW:4:15 2> {output[9]}" |
15 16 | script: "../scripts/varscan2vcf.R" |
12 13 | shell: "/.run somatic {input[1]} {input[0]} results/{wildcards.library}/varscan/{wildcards.library}" |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | callfile <- snakemake@input[[1]] mutationsfile <- snakemake@input[[2]] statsfile <- snakemake@output[[1]] library(tidyverse) library(vcfR) mutations <- read_tsv(mutationsfile, col_types = cols( CHROM = col_character(), POS = col_double(), REF = col_character(), TYPE = col_character(), ALT = col_character() )) %>% mutate(True = 1) %>% filter(REF != "N") call <- try(read.vcfR(callfile, verbose = F)@fix, silent = T) if(!inherits(call, "try-error")){ call <- as.data.frame(call) %>% mutate_at(c("CHROM", "REF", "ALT"), as.character) %>% mutate(POS = as.numeric(as.character(POS))) %>% mutate(Called = 1) %>% mutate(REF = substr(REF, 1, 1), ALT = substr(ALT, 1, 1)) } else { call <- data.frame(CHROM = character(), POS = double(), REF = character(), ALT = character(), Called = integer()) } stats <- full_join(mutations, call, by = c("CHROM", "POS", "REF", "ALT")) %>% mutate_at(c("True", "Called"), funs(ifelse(is.na(.), 0, .))) %>% mutate(Confusion = recode(paste0(True, Called), "01" = "FP", "10" = "FN", "11" = "TP")) %>% group_by(Confusion) %>% summarise(N = n()) %>% reshape2::dcast(mutationsfile + callfile ~ Confusion, value.var = "N") if(!("FP" %in% names(stats))) stats$FP <- 0 if(!("FN" %in% names(stats))) stats$FN <- 0 if(!("TP" %in% names(stats))) stats$TP <- 0 stats <- mutate(stats, Precision = round(TP/(TP+FP), 2), Recall = round(TP/(TP+FN), 2)) %>% mutate(callfile = callfile) %>% mutate(mutationsfile = mutationsfile) write_tsv(stats, statsfile) |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | varscanfile <- snakemake@input[[1]] snpfile <- snakemake@input[[2]] # output vcffile <- snakemake@output[[1]] # manual # varscanfile <- "results/N100_R2_AF0.1_NR7000/varscan/N100_R2_AF0.1_NR7000.snp" # snpfile <- file.path("results/reference/", yaml::read_yaml("config/config.dev.yml")$snps) # vcffile <- "results/N100_R2_AF0.1_NR7000/varscan/N100_R2_AF0.1_NR7000.vcf" library(tidyverse) library(Biostrings) cat("##fileformat=VCFv4.1\n##fileDate=20210521\n##source=varscan\n#CHROM POS ID REF ALT\n", file = vcffile) snps <- read_tsv(snpfile, skip = 11) %>% dplyr::rename(CHROM = `#CHROM`) %>% select(CHROM, POS, ID, REF, ALT) %>% mutate(ID = ".") %>% as_tibble() %>% mutate_all(as.character) %>% mutate_at("POS", as.numeric) read_tsv(varscanfile) %>% mutate(ID = ".", CHROM = chrom, POS = position, REF =ref, ALT = var) %>% dplyr::select(CHROM, POS, ID, REF, ALT) %>% anti_join(snps) %>% write_tsv(file = vcffile, append = T, col_names = F) |
Support
- Future updates
Related Workflows





