mega-non-model-wgs-snakeflow
There are two main changes:
-
The genomics databases are now created directly from the gvcf sections. As a consequence, it is not necessary for gvcfs at all chromosomes and scaffold groups to be complete before going on to the genomics DB stage. This is helpful when there are some very long chromosomes. This change entails no difference, otherwise, to the user’s resposibilities.
-
The GenotypeGVCFs step is now done in a scattered fashion over intervals that are all less than some desired length. This means that instead of having to wait a super long time for a long chromosome to finish this step, you can sort of normalize the times required for each instance of this step, but effectively chopping each chromosome into, say, 5 megabase chunks, and then dispatching each of those jobs separately, and stitching them all together at the end. This also means that the workflow can be designed so that none of these jobs exceed 24 hours, so that they can be run on the standard queue on most clusters. tq.gz
Condensed DAG for the workflow
Here is a DAG for the workflow on the test data in
.test
, condensed into an easier-to-look-at picture by the
condense_dag()
function in Eric’s
SnakemakeDagR
package.
snakemake workflow
-
fastqc on both reads
-
don’t bother with single end
-
add adapters so illumina clip can work
-
benchmark each rule
-
use genomicsDBimport
-
allow for merging of lots of small scaffolds into genomicsDB
Code Snippets
29 30 31 | shell: " bam clipOverlap --in {input} --out {output} --stats 2> {log.clip} && " " samtools index {output} 2> {log.index}" |
41 42 | shell: " for i in {params.sams}; do echo $i; done > {output.sample_list} 2>{log} " |
64 65 66 67 68 69 70 71 72 73 74 75 | shell: " IGRP={wildcards.igrp}; " " if [ $IGRP = \"__ALL\" ]; then " # just hard-link the files in this case " (ln {input.indel} {output.vcf}; " " ln {input.indel_idx} {output.idx}) 2>{log}; " " else " " gatk SelectVariants -R {input.ref} " " -V {input.indel} " " -O {output.vcf} " " --sample-name {input.sample_list} " " --exclude-non-variants 2>{log}; " " fi " |
95 96 97 98 | shell: "SAMP=$(bcftools query -l {input.indel} | head -n 1); " " bcftools view -Oz -s $SAMP {input.indel} > {output.vcf} 2> {log}; " " bcftools index -t {output.vcf} 2>> {log}; " |
116 117 118 | shell: " (bcftools concat {params.opts} -Oz {input} > {output.vcf} 2> {log}; " " bcftools index -t {output.vcf}) 2>> {log}; " |
138 139 140 141 142 143 144 | shell: "( wget -P {output.dir} https://storage.googleapis.com/gatk-software/package-archive/gatk/GenomeAnalysisTK-3.8-0-ge9d806836.tar.bz2 && " " bzip2 -d {output.dir}/GenomeAnalysisTK-3.8-0-ge9d806836.tar.bz2 && " " tar --directory {output.dir} -xvf {output.dir}/GenomeAnalysisTK-3.8-0-ge9d806836.tar && " " mv {output.dir}/GenomeAnalysisTK-3.8-0-ge9d806836/GenomeAnalysisTK.jar {output.dir} && " " if [ $(uname -s) = \"Darwin\" ]; then gatk3-register {output.jar}; fi" " ) > {log} 2>&1 " |
167 168 169 170 171 | shell: "gatk3 {params.jopts} -T RealignerTargetCreator -nt {threads} " " -R resources/genome.fasta " " -o {output} " " -known {input.vcf} 2> {log} " |
196 197 198 199 200 201 202 203 204 205 206 | shell: "gatk3 {params.jopts} \"-Dsamjdk.compression_level=9\" " " -T IndelRealigner " " -R {input.fasta} " " -I {input.bam} " " -targetIntervals {input.intervals} " " -known {input.vcf} " " --consensusDeterminationModel KNOWNS_ONLY " " -LOD 0.4 " # this is recommended at https://github.com/broadinstitute/gatk-docs/blob/master/gatk3-methods-and-algorithms/Local_Realignment_around_Indels.md " -o {output.bam} " " > {log} 2>&1 " |
21 22 | wrapper: "0.59.2/bio/vep/annotate" |
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 | shell: " git log | head -n 150 > {params.config_dir}/latest-git-commits.txt; " " mkdir -p results/qc_summaries/bqsr-round-{{0..{params.BQR}}}; " " for i in {{0..{params.BQR}}}; do cp -r results/bqsr-round-$i/qc/{{multiqc.html,bcftools_stats/*.txt}} results/qc_summaries/bqsr-round-$i/; done; " " for i in {{0..{params.BQR}}}; do tar -cvf results/bqsr-round-$i/qc.tar results/bqsr-round-$i/qc; gzip results/bqsr-round-$i/qc.tar; done; " " for i in {{0..{params.BQR}}}; do tar -cvf results/bqsr-round-$i/benchmarks.tar results/bqsr-round-$i/benchmarks; gzip results/bqsr-round-$i/benchmarks.tar; done; " " for i in {{0..{params.BQR}}}; do tar -cvf results/bqsr-round-$i/logs.tar results/bqsr-round-$i/logs; gzip results/bqsr-round-$i/logs.tar; done; " " " " rclone copy --dry-run --bwlimit 8650k . {params.rclone_base} " " --include='{params.config_dir}/**' " " --include='results/qc_summaries/**' " " --include='results/bqsr-round-{{{params.comma_nums}}}/{{qc,benchmarks,logs}}.tar.gz' " " --include='results/bqsr-round-{{{params.comma_nums}}}/{{bcf,bq_recal_tables,bq_variants}}/**' " " --include='resources/**' " " {params.data_comm} " " --include='results/bqsr-round-{params.BQR}/gvcf/*' " " --include='results/{params.bamdir}/*' " " --include='results/bqsr-round-{params.BQR}/indel_realigned/**' " |
96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 | shell: " git log | head -n 150 > config/latest-git-commits.txt; " " mkdir -p results/qc_summaries/bqsr-round-{{0..{params.BQR}}}; " " for i in {{0..{params.BQR}}}; do cp -r results/bqsr-round-$i/qc/{{multiqc.html,bcftools_stats/*.txt}} results/qc_summaries/bqsr-round-$i/; done; " " for i in {{0..{params.BQR}}}; do tar -cvf results/bqsr-round-$i/qc.tar results/bqsr-round-$i/qc; gzip results/bqsr-round-$i/qc.tar; done; " " for i in {{0..{params.BQR}}}; do tar -cvf results/bqsr-round-$i/benchmarks.tar results/bqsr-round-$i/benchmarks; gzip results/bqsr-round-$i/benchmarks.tar; done; " " for i in {{0..{params.BQR}}}; do tar -cvf results/bqsr-round-$i/logs.tar results/bqsr-round-$i/logs; gzip results/bqsr-round-$i/logs.tar; done; " " " " rclone copy --dry-run --drive-stop-on-upload-limit . {params.rclone_base} " " --include='config/**' " " --include='results/qc_summaries/**' " " --include='results/bqsr-round-{params.BQR}/{{qc,benchmarks,logs}}.tar.gz' " " --include='results/bqsr-round-{params.BQR}/bcf/**' " " --include='resources/**' " " --include='data/**' " " --include='results/bqsr-round-{params.BQR}/gvcf/*' " " --include='results/{params.bamdir}/*' " " --include='results/bqsr-round-{params.BQR}/overlap_clipped/*' " |
19 20 | shell: "workflow/scripts/qd_and_qual.sh {input.bcf} {output.qd} {output.qual} {log} " |
42 43 44 45 46 47 | shell: "SAMP=$(bcftools query -l {input.bcf} | head -n 1); " " bcftools view -i 'QUAL >= {params.qual} && QD >= {params.qd}' " " -Oz -s $SAMP {input.bcf} > {output.vcf} 2> {log}; " " bcftools index -t {output.vcf} 2>> {log}; " " bcftools stats {output.vcf} > {output.stats} 2>> {log}; " |
66 67 68 69 70 71 72 | shell: "gatk BaseRecalibrator " " -I {input.bam} " " -R {input.ref} " " --known-sites {input.vcf} " " --bqsr-baq-gap-open-penalty 30 " " -O {output} 2> {log} " |
91 92 93 94 95 96 | shell: "gatk --java-options \"-Dsamjdk.compression_level=9\" ApplyBQSR " " -R {input.ref} " " -I {input.bam} " " --bqsr-recal-file {input.rtable} " " -O {output.bam} 2> {log} " |
9 10 | shell: " awk -v sg={wildcards.scaff_group} 'NR>1 && $1 == sg {{print $2}}' {input.scaff_groups} > {output} 2> {log};" |
19 20 | shell: " echo {wildcards.chromo} > {output} 2> {log};" |
31 32 33 34 | shell: " awk -v sgc={wildcards.sg_or_chrom} -v scat={wildcards.scatter} ' " " NR>1 && $1 == sgc && $2==scat {{printf(\"%s:%s-%s\\n\", $3, $4, $5)}} " " ' {input.scatters_file} > {output} 2> {log};" |
62 63 64 65 66 67 68 69 70 | shell: "gatk --java-options \"{params.java_opts}\" HaplotypeCaller " " -R {input.ref} " " -I {input.bam} " " -O {output.gvcf} " " -L {input.interval_list} " " --native-pair-hmm-threads {threads} " " {params.conf_pars} " " -ERC GVCF > {log.stdout} 2> {log.stderr} " |
88 89 90 | shell: " bcftools concat {params.opts} -O z {input} > {output.gvcf} 2>{log}; " " bcftools index -t {output.gvcf} " |
129 130 131 132 133 134 135 | shell: " mkdir -p results/bqsr-round-{wildcards.bqsr_round}/genomics_db; " " gatk --java-options {params.java_opts} GenomicsDBImport " " $(echo {input.gvcfs} | awk '{{for(i=1;i<=NF;i++) printf(\" -V %s \", $i)}}') " " {params.my_opts} {params.db} 2>&1 | tee {log} > {log}.import_{params.import_num} && " " (echo $(({params.import_num} + 1)) > {output.counter}) && " " (for i in {params.histories}; do mkdir -p $(dirname $i); echo {params.import_num} $(date) > $i; done)" |
169 170 171 172 173 174 175 176 | shell: " mkdir -p results/bqsr-round-{wildcards.bqsr_round}/genomics_db; " " awk -v sg={wildcards.scaff_group} 'NR>1 && $1 == sg {{print $2}}' {input.scaff_groups} > {output.interval_list}; " " gatk --java-options {params.java_opts} GenomicsDBImport " " $(echo {input.gvcfs} | awk '{{for(i=1;i<=NF;i++) printf(\" -V %s \", $i)}}') " " {params.my_opts} {params.db} 2>&1 | tee {log} > {log}.import_{params.import_num} && " " (echo $(({params.import_num} + 1)) > {output.counter}) && " " (for i in {params.histories}; do mkdir -p $(dirname $i); echo {params.import_num} $(date) > $i; done) " |
208 209 210 211 212 213 214 | shell: " gatk --java-options {params.java_opts} GenotypeGVCFs " " {params.pextra} " " -L {input.scatters} " " -R {input.genome} " " -V gendb://{params.gendb} " " -O {output.vcf} > {log} 2> {log} " |
232 233 234 | shell: " (bcftools concat {params.opts} -Oz {input.vcf} > {output.vcf}; " " bcftools index -t {output.vcf}) 2>{log}; " |
257 258 259 260 261 | shell: "(bcftools +setGT {input.vcf} -- -t q -n . -i 'FMT/DP=0 | (FMT/PL[:0]=0 & FMT/PL[:1]=0 & FMT/PL[:2]=0)' | " " bcftools +fill-tags - -- -t 'NMISS=N_MISSING' | " " bcftools view -Oz - > {output.vcf}; " " bcftools index -t {output.vcf}) 2> {log} " |
280 281 282 | shell: " (bcftools concat {params.opts} -Ob {input} > {output.bcf}; " " bcftools index {output.bcf}) 2> {log}; " |
302 303 304 | shell: " (bcftools concat {params.opts} -Ob {input} > {output.bcf}; " " bcftools index {output.bcf}) 2>{log}; " |
40 41 | script: "../scripts/sequence-scatter-bins.R" |
9 10 | shell: " awk 'NF>0 {{clen = $3}} END {{print clen}}' {input} > {output} " |
19 20 21 22 23 24 25 26 27 28 | shell: " (" " printf \"sample\\tave_depth\\n\"; " " for i in {input.ss}; do " " FN=$(basename $i); " " FN=${{FN/.txt/}}; " " awk -F\"\t\" -v f=$FN -v NumBases=$(cat {input.gLen}) ' " " BEGIN {{OFS=\"\t\";}} " " $2==\"bases mapped (cigar):\" {{sub(/\\.stats/, \"\", f); print f, $3/NumBases}} " " ' $i; done) > {output} " |
52 53 54 55 56 57 58 59 60 61 62 | shell: " ( " " OPT=$(awk 'NR>1 && $1==\"{wildcards.sample}\" {{ wc = \"{wildcards.cov}\"; if(wc == \"FD\") {{print \"NOSAMPLE\"; exit}} fract = wc / $NF; if(fract < 1) print fract; else print \"NOSAMPLE\"; }}' {input.dps}); " " if [ $OPT = \"NOSAMPLE\" ]; then " " ln {input.bam} {output.bam}; " " ln {input.bai} {output.bai}; " " else " " samtools view --subsample $OPT --subsample-seed 1 -b {input.bam} > {output.bam}; " " samtools index {output.bam}; " " fi " " ) 2> {log} " |
11 12 | wrapper: "0.59.0/bio/gatk/selectvariants" |
25 26 | wrapper: "0.59.2/bio/gatk/variantfiltration" |
38 39 | wrapper: "0.59.2/bio/gatk/variantrecalibrator" |
55 56 | wrapper: "0.59.2/bio/picard/mergevcfs" |
23 24 | shell: " gatk SelectVariants -V {input.vcf} -select-type SNP -O {output.vcf} > {log} 2>&1 " |
41 42 | shell: " gatk SelectVariants -V {input.vcf} -select-type INDEL -O {output.vcf} > {log} 2>&1 " |
60 61 62 63 64 65 66 67 68 69 70 | shell: "gatk VariantFiltration " " -V {input.vcf} " " -filter 'QD < 2.0' --filter-name 'QD2' " " -filter 'QUAL < 30.0' --filter-name 'QUAL30' " " -filter 'SOR > 3.0' --filter-name 'SOR3' " " -filter 'FS > 60.0' --filter-name 'FS60' " " -filter 'MQ < 40.0' --filter-name 'MQ40' " " -filter 'MQRankSum < -12.5' --filter-name 'MQRankSum-12.5' " " -filter 'ReadPosRankSum < -8.0' --filter-name 'ReadPosRankSum-8' " " -O {output.vcf} > {log} 2>&1 " |
88 89 90 91 92 93 94 95 | shell: "gatk VariantFiltration " " -V {input.vcf} " " -filter 'QD < 2.0' --filter-name 'QD2' " " -filter 'QUAL < 30.0' --filter-name 'QUAL30' " " -filter 'FS > 200.0' --filter-name 'FS200' " " -filter 'ReadPosRankSum < -20.0' --filter-name 'ReadPosRankSum-20' " " -O {output.vcf} > {log} 2>&1 " |
114 115 116 | shell: "(bcftools concat -a {input.snp} {input.indel} | " " bcftools view -Ob > {output.vcf}; ) 2> {log} " |
132 133 134 | shell: " bcftools view -Ob -i 'FILTER=\"PASS\" & MAF > {params.maf} ' " " {input} > {output} 2>{log} " |
20 21 | wrapper: "v1.1.0/bio/trimmomatic/pe" |
50 51 | wrapper: "v1.23.3/bio/bwa/mem" |
70 71 | wrapper: "v1.1.0/bio/picard/markduplicates" |
11 12 | wrapper: "v1.1.0/bio/fastqc" |
25 26 | wrapper: "v1.1.0/bio/fastqc" |
41 42 | shell: "samtools stats {input} > {output} 2> {log} " |
71 72 | wrapper: "v1.23.3/bio/multiqc" |
102 103 104 105 106 | shell: "bcftools view {params.filter_opt} -Ou {input} | " " bcftools stats -s {params.comma_samples} " " -u NMISS:0:{params.stop}:{params.steps} > " " {output} 2> {log} " |
127 128 129 130 | shell: " bcftools stats -s {params.comma_samples} " " -u NMISS:0:{params.stop}:{params.steps} {input} > " " {output} 2> {log} " |
142 143 | shell: " plot-vcfstats -m {input} > {output} 2>{log} " |
155 156 | shell: " plot-vcfstats -m {input} > {output} 2>{log} " |
12 13 14 15 16 17 18 | shell: " (tmp_dir=$(mktemp -d) && " " URL={params.url} && " " if [[ $URL =~ \.gz$ ]]; then EXT='.gz'; else EXT=''; fi && " " wget -O $tmp_dir/file$EXT $URL && " " if [[ $URL =~ \.gz$ ]]; then gunzip $tmp_dir/file$EXT; fi && " " mv $tmp_dir/file {output}) > {log} 2>&1 " |
31 32 | shell: "samtools faidx {input}" |
46 47 | shell: "samtools dict {input} > {output} 2> {log} " |
63 64 | wrapper: "0.59.2/bio/bwa/index" |
10 11 12 13 | shell: "(bcftools view --apply-filters PASS --output-type u {input} | " "rbt vcf-to-txt -g --fmt DP AD --info ANN | " "gzip > {output}) 2> {log}" |
30 31 | script: "../scripts/plot-depths.py" |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | import sys sys.stderr = open(snakemake.log[0], "w") import common import matplotlib.pyplot as plt import pandas as pd import numpy as np import seaborn as sns calls = pd.read_table(snakemake.input[0], header=[0, 1]) samples = [name for name in calls.columns.levels[0] if name != "VARIANT"] sample_info = calls.loc[:, samples].stack([0, 1]).unstack().reset_index(1, drop=False) sample_info = sample_info.rename({"level_1": "sample"}, axis=1) sample_info = sample_info[sample_info["DP"] > 0] sample_info["freq"] = sample_info["AD"] / sample_info["DP"] sample_info.index = np.arange(sample_info.shape[0]) plt.figure() sns.stripplot(x="sample", y="freq", data=sample_info, jitter=True) plt.ylabel("allele frequency") plt.xticks(rotation="vertical") plt.savefig(snakemake.output.freqs) plt.figure() sns.stripplot(x="sample", y="DP", data=sample_info, jitter=True) plt.ylabel("read depth") plt.xticks(rotation="vertical") plt.savefig(snakemake.output.depths) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | INP=$1 QD=$2 QUAL=$3 LOG=$4 (bcftools query -f '%QUAL\t%INFO/QD\n' $INP | awk -v qd_file=${QD}-unsrt -v qual_file=${QUAL}-unsrt ' BEGIN {OFS="\t"} { qual[int($1)]++; qd[int($2)]++; } END { for(i in qual) print i, qual[i] > qual_file; for(i in qd) print i, qd[i] > qd_file; } ') 2> $LOG; (sort -grk1 ${QUAL}-unsrt > ${QUAL} && rm ${QUAL}-unsrt) 2>>$LOG; (sort -grk1 ${QD}-unsrt > ${QD} && rm ${QD}-unsrt) 2>>$LOG |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 | log <- file(snakemake@log[[1]], open="wt") sink(log, type = "output") sink(log, type = "message") library(tidyverse) # these are while developing #chroms_file = ".test/config/chromosomes.tsv" #scaffs_file = ".test/config/scaffold_groups.tsv" #max_chunk <- 900000 # shooting for just < than 1 Mb chunks chroms_file <- snakemake@input$chroms scaffs_file <- snakemake@input$scaffs max_chunk <- as.integer(snakemake@params$binsize) outfile <- snakemake@output$interval_list # read the files in and prepare them so they have # columns that are comparable chroms <- read_tsv(chroms_file) %>% mutate(id = chrom, .before = chrom) %>% group_by(id) %>% mutate(cumul = cumsum(num_bases)) %>% ungroup() scaffs <- read_tsv(scaffs_file) %>% rename(num_bases = len) # Now, we handle the chromosomes and the scaffold groups a little # differently. (It's a shame really, they should just all be scaff_groups, # with the chromosomes just being scaff groups of size 1. Sigh...) # for the chromos, we just find a good length to chop them up into. # Here is a function that tries to get them all close to an equal length # that is less than max_chunk: #' @param L length of the chromosome #' @param m max length of chunk #' @param S the starting value #' @return returns a tibble with start and stop columns of the different #' chunks. (base-1 indexed, inclusive). chromo_chopper = function(L, m, S = 1) { # The number of bins will be: B = ceiling(L / m) # if m divides L perfectly, mstar is m and last_one = mstar if((L %% m) == 0) { mstar = m last_one = mstar } else { # otherwise mstar = ceiling(L / B) last_one = L - mstar * (B - 1) } starts <- seq(S, S + B * (mstar - 1), by = mstar) ends <- starts + mstar - 1 ends[length(ends)] <- starts[length(ends)] + last_one - 1 tibble( start = starts, end = ends ) %>% mutate(scatter_length = end - start + 1) %>% mutate(scatter_idx = sprintf("scat_%04d", 1:n()), .before = start) } chroms2 <- chroms %>% group_by(id) %>% mutate(scats = map(num_bases, chromo_chopper, m = max_chunk)) %>% unnest(scats) %>% select(id, scatter_idx, chrom, start, end, scatter_length) # now, for the scaffold groups we will do it a little differently. # Any scaffold that is greater than the max_chunk will get broken down into # roughly equal-sized chunks that are all less than the max_chunk. scaff_chopper <- function(L, m) { if(L <= m) return( tibble( start = 1, end = L, seg_length = L ) ) else { return( chromo_chopper(L, m) %>% rename(seg_length = scatter_length) %>% select(-scatter_idx) ) } } scaffs2 <- scaffs %>% ungroup() %>% mutate(chops = map(num_bases, scaff_chopper, m = max_chunk)) %>% unnest(chops) %>% group_by(id) %>% mutate(cumul = cumsum(seg_length)) # now we will group by id and within each one, just iterate through # the number of bases and assign to different scat groups. #' @param s the number of bases in each segment (i.e. the seg_length) #' @param m the desired max chunk size iteratively_assign_scats <- function(n, m) { c <- 0 # cumulative in scat group scg <- 1 ret <- character(length(n)) for(i in 1:length(n)) { c <- c + n[i] if(c > m) { scg <- scg + 1 c <- n[i] } ret[i] <- sprintf("scat_%04d", scg) } ret } scaffs3 <- scaffs2 %>% group_by(id) %>% mutate(scatter_idx = iteratively_assign_scats(n = seg_length, m = max_chunk)) %>% group_by(id, scatter_idx) %>% mutate(scatter_length = sum(seg_length)) %>% ungroup() %>% select(id, scatter_idx, chrom, start, end, scatter_length) chrom_and_sg <- bind_rows(chroms2, scaffs3) write_tsv(chrom_and_sg, file = snakemake@output$tsv) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2018, Johannes Köster" __email__ = "johannes.koester@protonmail.com" __license__ = "MIT" from snakemake.shell import shell extra = snakemake.params.get("extra", "") java_opts = snakemake.params.get("java_opts", "") log = snakemake.log_fmt_shell(stdout=True, stderr=True) shell( "gatk --java-options '{java_opts}' SelectVariants -R {snakemake.input.ref} -V {snakemake.input.vcf} " "{extra} -O {snakemake.output.vcf} {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | __author__ = "Patrik Smeds" __copyright__ = "Copyright 2016, Patrik Smeds" __email__ = "patrik.smeds@gmail.com" __license__ = "MIT" from os import path from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=False, stderr=True) # Check inputs/arguments. if len(snakemake.input) == 0: raise ValueError("A reference genome has to be provided!") elif len(snakemake.input) > 1: raise ValueError("Only one reference genome can be inputed!") # Prefix that should be used for the database prefix = snakemake.params.get("prefix", "") if len(prefix) > 0: prefix = "-p " + prefix # Contrunction algorithm that will be used to build the database, default is bwtsw construction_algorithm = snakemake.params.get("algorithm", "") if len(construction_algorithm) != 0: construction_algorithm = "-a " + construction_algorithm shell( "bwa index" " {prefix}" " {construction_algorithm}" " {snakemake.input[0]}" " {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2018, Johannes Köster" __email__ = "johannes.koester@protonmail.com" __license__ = "MIT" from snakemake.shell import shell extra = snakemake.params.get("extra", "") java_opts = snakemake.params.get("java_opts", "") filters = [ "--filter-name {} --filter-expression '{}'".format(name, expr.replace("'", "\\'")) for name, expr in snakemake.params.filters.items() ] log = snakemake.log_fmt_shell(stdout=True, stderr=True) shell( "gatk --java-options '{java_opts}' VariantFiltration -R {snakemake.input.ref} -V {snakemake.input.vcf} " "{extra} {filters} -O {snakemake.output.vcf} {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2018, Johannes Köster" __email__ = "johannes.koester@protonmail.com" __license__ = "MIT" import os from snakemake.shell import shell extra = snakemake.params.get("extra", "") java_opts = snakemake.params.get("java_opts", "") def fmt_res(resname, resparams): fmt_bool = lambda b: str(b).lower() try: f = snakemake.input.get(resname) except KeyError: raise RuntimeError( "There must be a named input file for every resource (missing: {})".format( resname ) ) return "{},known={},training={},truth={},prior={}:{}".format( resname, fmt_bool(resparams["known"]), fmt_bool(resparams["training"]), fmt_bool(resparams["truth"]), resparams["prior"], f, ) resources = [ "--resource {}".format(fmt_res(resname, resparams)) for resname, resparams in snakemake.params["resources"].items() ] annotation = list(map("-an {}".format, snakemake.params.annotation)) tranches = "" if snakemake.output.tranches: tranches = "--tranches-file " + snakemake.output.tranches log = snakemake.log_fmt_shell(stdout=True, stderr=True) shell( "gatk --java-options '{java_opts}' VariantRecalibrator {extra} {resources} " "-R {snakemake.input.ref} -V {snakemake.input.vcf} " "-mode {snakemake.params.mode} " "--output {snakemake.output.vcf} " "{tranches} {annotation} {log}" ) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2018, Johannes Köster" __email__ = "johannes.koester@protonmail.com" __license__ = "MIT" from snakemake.shell import shell inputs = " ".join("INPUT={}".format(f) for f in snakemake.input.vcfs) log = snakemake.log_fmt_shell(stdout=False, stderr=True) extra = snakemake.params.get("extra", "") shell( "picard" " MergeVcfs" " {extra}" " {inputs}" " OUTPUT={snakemake.output[0]}" " {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2020, Johannes Köster" __email__ = "johannes.koester@uni-due.de" __license__ = "MIT" from pathlib import Path from snakemake.shell import shell def get_only_child_dir(path): children = [child for child in path.iterdir() if child.is_dir()] assert ( len(children) == 1 ), "Invalid VEP cache directory, only a single entry is allowed, make sure that cache was created with the snakemake VEP cache wrapper" return children[0] extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell(stdout=False, stderr=True) fork = "--fork {}".format(snakemake.threads) if snakemake.threads > 1 else "" stats = snakemake.output.stats cache = snakemake.input.cache plugins = snakemake.input.plugins entrypath = get_only_child_dir(get_only_child_dir(Path(cache))) species = entrypath.parent.name release, build = entrypath.name.split("_") load_plugins = " ".join(map("--plugin {}".format, snakemake.params.plugins)) if snakemake.output.calls.endswith(".vcf.gz"): fmt = "z" elif snakemake.output.calls.endswith(".bcf"): fmt = "b" else: fmt = "v" shell( "(bcftools view {snakemake.input.calls} | " "vep {extra} {fork} " "--format vcf " "--vcf " "--cache " "--cache_version {release} " "--species {species} " "--assembly {build} " "--dir_cache {cache} " "--dir_plugins {plugins} " "--offline " "{load_plugins} " "--output_file STDOUT " "--stats_file {stats} | " "bcftools view -O{fmt} > {snakemake.output.calls}) {log}" ) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" from os import path import re from tempfile import TemporaryDirectory from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=True, stderr=True) def basename_without_ext(file_path): """Returns basename of file path, without the file extension.""" base = path.basename(file_path) # Remove file extension(s) (similar to the internal fastqc approach) base = re.sub("\\.gz$", "", base) base = re.sub("\\.bz2$", "", base) base = re.sub("\\.txt$", "", base) base = re.sub("\\.fastq$", "", base) base = re.sub("\\.fq$", "", base) base = re.sub("\\.sam$", "", base) base = re.sub("\\.bam$", "", base) return base # Run fastqc, since there can be race conditions if multiple jobs # use the same fastqc dir, we create a temp dir. with TemporaryDirectory() as tempdir: shell( "fastqc {snakemake.params} -t {snakemake.threads} " "--outdir {tempdir:q} {snakemake.input[0]:q}" " {log}" ) # Move outputs into proper position. output_base = basename_without_ext(snakemake.input[0]) html_path = path.join(tempdir, output_base + "_fastqc.html") zip_path = path.join(tempdir, output_base + "_fastqc.zip") if snakemake.output.html != html_path: shell("mv {html_path:q} {snakemake.output.html:q}") if snakemake.output.zip != zip_path: shell("mv {zip_path:q} {snakemake.output.zip:q}") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "koester@jimmy.harvard.edu" __license__ = "MIT" import tempfile from snakemake.shell import shell from snakemake_wrapper_utils.java import get_java_opts log = snakemake.log_fmt_shell() extra = snakemake.params.get("extra", "") java_opts = get_java_opts(snakemake) bams = snakemake.input if isinstance(bams, str): bams = [bams] bams = list(map("--INPUT {}".format, bams)) with tempfile.TemporaryDirectory() as tmpdir: shell( "picard MarkDuplicates" # Tool and its subcommand " {java_opts}" # Automatic java option " {extra}" # User defined parmeters " {bams}" # Input bam(s) " --TMP_DIR {tmpdir}" " --OUTPUT {snakemake.output.bam}" # Output bam " --METRICS_FILE {snakemake.output.metrics}" # Output metrics " {log}" # Logging ) |
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 | __author__ = "Johannes Köster, Jorge Langa" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "koester@jimmy.harvard.edu" __license__ = "MIT" from snakemake.shell import shell from snakemake_wrapper_utils.java import get_java_opts # Distribute available threads between trimmomatic itself and any potential pigz instances def distribute_threads(input_files, output_files, available_threads): gzipped_input_files = sum(1 for file in input_files if file.endswith(".gz")) gzipped_output_files = sum(1 for file in output_files if file.endswith(".gz")) potential_threads_per_process = available_threads // ( 1 + gzipped_input_files + gzipped_output_files ) if potential_threads_per_process > 0: # decompressing pigz creates at most 4 threads pigz_input_threads = ( min(4, potential_threads_per_process) if gzipped_input_files != 0 else 0 ) pigz_output_threads = ( (available_threads - pigz_input_threads * gzipped_input_files) // (1 + gzipped_output_files) if gzipped_output_files != 0 else 0 ) trimmomatic_threads = ( available_threads - pigz_input_threads * gzipped_input_files - pigz_output_threads * gzipped_output_files ) else: # not enough threads for pigz pigz_input_threads = 0 pigz_output_threads = 0 trimmomatic_threads = available_threads return trimmomatic_threads, pigz_input_threads, pigz_output_threads def compose_input_gz(filename, threads): if filename.endswith(".gz") and threads > 0: return "<(pigz -p {threads} --decompress --stdout {filename})".format( threads=threads, filename=filename ) return filename def compose_output_gz(filename, threads, compression_level): if filename.endswith(".gz") and threads > 0: return ">(pigz -p {threads} {compression_level} > {filename})".format( threads=threads, compression_level=compression_level, filename=filename ) return filename extra = snakemake.params.get("extra", "") java_opts = get_java_opts(snakemake) log = snakemake.log_fmt_shell(stdout=True, stderr=True) compression_level = snakemake.params.get("compression_level", "-5") trimmer = " ".join(snakemake.params.trimmer) # Distribute threads input_files = [snakemake.input.r1, snakemake.input.r2] output_files = [ snakemake.output.r1, snakemake.output.r1_unpaired, snakemake.output.r2, snakemake.output.r2_unpaired, ] trimmomatic_threads, input_threads, output_threads = distribute_threads( input_files, output_files, snakemake.threads ) input_r1, input_r2 = [ compose_input_gz(filename, input_threads) for filename in input_files ] output_r1, output_r1_unp, output_r2, output_r2_unp = [ compose_output_gz(filename, output_threads, compression_level) for filename in output_files ] shell( "trimmomatic PE -threads {trimmomatic_threads} {java_opts} {extra} " "{input_r1} {input_r2} " "{output_r1} {output_r1_unp} " "{output_r2} {output_r2_unp} " "{trimmer} " "{log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 | __author__ = "Johannes Köster, Julian de Ruiter" __copyright__ = "Copyright 2016, Johannes Köster and Julian de Ruiter" __email__ = "koester@jimmy.harvard.edu, julianderuiter@gmail.com" __license__ = "MIT" import tempfile from os import path from snakemake.shell import shell from snakemake_wrapper_utils.java import get_java_opts from snakemake_wrapper_utils.samtools import get_samtools_opts # Extract arguments. extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell(stdout=False, stderr=True) sort = snakemake.params.get("sorting", "none") sort_order = snakemake.params.get("sort_order", "coordinate") sort_extra = snakemake.params.get("sort_extra", "") samtools_opts = get_samtools_opts(snakemake) java_opts = get_java_opts(snakemake) index = snakemake.input.idx if isinstance(index, str): index = path.splitext(snakemake.input.idx)[0] else: index = path.splitext(snakemake.input.idx[0])[0] # Check inputs/arguments. if not isinstance(snakemake.input.reads, str) and len(snakemake.input.reads) not in { 1, 2, }: raise ValueError("input must have 1 (single-end) or 2 (paired-end) elements") if sort_order not in {"coordinate", "queryname"}: raise ValueError("Unexpected value for sort_order ({})".format(sort_order)) # Determine which pipe command to use for converting to bam or sorting. if sort == "none": # Simply convert to bam using samtools view. pipe_cmd = "samtools view {samtools_opts}" elif sort == "samtools": # Add name flag if needed. if sort_order == "queryname": sort_extra += " -n" # Sort alignments using samtools sort. pipe_cmd = "samtools sort {samtools_opts} {sort_extra} -T {tmpdir}" elif sort == "picard": # Sort alignments using picard SortSam. pipe_cmd = "picard SortSam {java_opts} {sort_extra} --INPUT /dev/stdin --TMP_DIR {tmpdir} --SORT_ORDER {sort_order} --OUTPUT {snakemake.output[0]}" else: raise ValueError(f"Unexpected value for params.sort ({sort})") with tempfile.TemporaryDirectory() as tmpdir: shell( "(bwa mem" " -t {snakemake.threads}" " {extra}" " {index}" " {snakemake.input.reads}" " | " + pipe_cmd + ") {log}" ) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" from os import path from snakemake.shell import shell extra = snakemake.params.get("extra", "") # Set this to False if multiqc should use the actual input directly # instead of parsing the folders where the provided files are located use_input_files_only = snakemake.params.get("use_input_files_only", False) if not use_input_files_only: input_data = set(path.dirname(fp) for fp in snakemake.input) else: input_data = set(snakemake.input) output_dir = path.dirname(snakemake.output[0]) output_name = path.basename(snakemake.output[0]) log = snakemake.log_fmt_shell(stdout=True, stderr=True) shell( "multiqc" " {extra}" " --force" " -o {output_dir}" " -n {output_name}" " {input_data}" " {log}" ) |
Support
- Future updates
Related Workflows





