Snakemake workflow for somatic mutation detection without matched normal samples
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
TOSCA ( T umor O nly S omatic CA lling) is a Snakemake workflow , aimed at performing a somatic variant calling (without matched normal samples) analysis in a reproducible, automated, and open source workflow.
TOSCA consists of a
Snakefile
, a set of
conda
environment files (
envs/*.yaml
) a configuration file (
config/config.yaml
) and a set of
R
scripts. It is able to perform an end-to-end analysis, from raw read files, via quality checks, alignment and variant calling, to functional annotation, databases filtering, tumor purity and ploidy estimation and finally variant classification in whole exome / target sequencing data.
By default, the pipeline performs only mandatory steps shown in the
diagram
below. However, you can turn on optional rules in the
config/config.yaml
file.
Advanced use
: If you wish to custom your analysis by adding or removing some steps (e.g. VarScan over Mutect2 or Bowtie2 over BWA), you can use the software of your preference provided you have your own script(s), and change some lines within the
Snakefile
. If you think your "custom rule" might be of interest to a broader audience, let us know by opening an issue.
Using the TOSCA workflow
We assume that you already have conda and Snakemake installed, otherwise you can easily install them with the following commands:
To install conda: https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html
To install Snakemake via conda: conda install -c conda-forge -c bioconda snakemake snakemake-wrapper-utils mamba
To use TOSCA:
git clone https://github.com/mdelcorvo/TOSCA.git
#edit config and meta file
cd TOSCA && snakemake --use-conda --configfile config/config.yaml
More details about TOSCA workflow can be found in the wiki .
Workflow graph
Simplified directed acyclic graph (DAG) of the TOSCA workflow.
Oval shapes are referred to data (input/output and intermediate files) while rectangles to processes. Input and output files are depicted as green while files independently downloaded by TOSCA are depicted as orange ovals. Red blocks are mandatory rules, while dashed lines and blue blocks are optional rules, controlled in the
config.yaml
. By default only mandatory rules are executed.
Contributors
Current contributors include:
Paper
- Main paper describing the tool:
Marcello Del Corvo, Saveria Mazzara, Stefano A Pileri, TOSCA: an automated Tumor Only Somatic CAlling workflow for somatic mutation detection without matched normal samples, Bioinformatics Advances, Volume 2, Issue 1, 2022, vbac070, https://doi.org/10.1093/bioadv/vbac070
Code Snippets
20 21 22 23 | shell: "java -jar {params.gatk3} -T CallableLoci -R {input.ref} -I {input.bam} --minDepth {params.minD} -summary {output.out_sum} -o {output.out_bed} -U ALLOW_SEQ_DICT_INCOMPATIBILITY -L {input.bed};" "grep 'CALLABLE' {output.out_bed} > {output.out_filt};" "Rscript --vanilla {input.script} {output.out_filt} {output.out_final}" |
48 49 50 51 52 53 54 | shell: "gatk Mutect2 -R {input.ref} -I {input.bam} --germline-resource {input.exac} --panel-of-normals {input.pon} -L {input.bed} --f1r2-tar-gz {output.f1r2} -O {output.vcf_raw};" "gatk LearnReadOrientationModel -I {output.f1r2} -O {output.rom};" "gatk GetPileupSummaries -I {input.bam} -V {input.exac} -L {input.bed} -O {output.getpileupsum};" "gatk CalculateContamination -I {output.getpileupsum} -O {output.contamination};" "gatk FilterMutectCalls -R {input.ref} --contamination-table {output.contamination} --ob-priors {output.rom} -V {output.vcf_raw} -O {output.vcf_filt};" "gatk VariantsToTable -V {output.vcf_raw} -F CHROM -F POS -F TYPE -GF AF --show-filtered true -O {output.af_table};" |
12 13 14 15 16 17 | shell: "curl -k -L '{params.Cosmic}' > resources/database/{params.build}/COSMIC.vcf.gz; " "curl -k -L '{params.Gen1K}' > resources/database/{params.build}/1000G-phase_3.vcf.gz; " "curl -k -L '{params.ESP}' > resources/database/{params.build}/temp_ESP6500SI.vcf.tar.gz; " "curl -k -L '{params.Clinvar}' > resources/database/{params.build}/ClinVar.vcf.gz; " "curl -k -L '{params.Clinvar}.tbi' > resources/database/{params.build}/ClinVar.vcf.gz.tbi; " |
26 27 28 29 30 | run: if config["ref"]["build"]=='GRCh38': shell("curl -L '{params.mappability}_{params.kmer}.bw' > resources/database/{params.build}/raw_mappability_{params.kmer}.bw; ") else: shell("curl -L '{params.mappability}{params.kmer}mer.bigWig' > resources/database/{params.build}/raw_mappability_{params.kmer}.bw; ") |
42 43 44 45 46 47 | shell: "tar -xzvf resources/database/{params.build}/temp_ESP6500SI.vcf.tar.gz -C resources/database/{params.build}; " "Rscript --vanilla {input.script} {params.build}; " "vcf-concat resources/database/{params.build}/*.vcf | sort -k1,1V -k2,2n | bgzip -c > resources/database/{params.build}/ESP6500SI.vcf.gz; " "rm resources/database/{params.build}/temp_ESP6500SI.vcf.tar.gz; " "rm resources/database/{params.build}/*.vcf; " |
56 57 | wrapper: "v1.23.3/bio/tabix/index" |
66 67 | wrapper: "v1.23.3/bio/tabix/index" |
76 77 | wrapper: "v1.23.3/bio/tabix/index" |
91 92 93 94 95 | shell: "bigWigToBedGraph {input.raw_map} {output.raw_bed}; " "Rscript {input.script} {output.raw_bed} {output.bed}; " "faidx {input.ref} -i chromsizes > {output.chromsizes}; " "bedGraphToBigWig {output.bed} {output.chromsizes} {output.map}; " |
16 17 | shell: "snpEff -Xmx{params.mem_gb}g -dataDir $(pwd)/resources/database/snpEff -v -csvStats {output.csv} -s {output.html} {params.ref} {input.vcf} > {output.call}; " |
30 31 | wrapper: "v1.23.3/bio/snpsift/annotate" |
44 45 | wrapper: "v1.23.3/bio/snpsift/annotate" |
57 58 | wrapper: "v1.23.3/bio/snpsift/annotate" |
71 72 | wrapper: "v1.23.3/bio/snpsift/annotate" |
86 87 | shell: "Rscript --vanilla {input.script} {input.call} {input.database} {output.call} " |
99 100 | wrapper: "v1.23.3/bio/snpsift/annotate" |
110 111 | shell: "java -jar {params} -B {input.call} {input.bam} > {output.cov}" |
137 138 | shell: "Rscript --vanilla {input.script} {input.vcf} {input.snpeff} {input.gen1k} {input.esp} {input.exac} {input.dbsnp} {input.cosmic} {input.clinvar} {input.cov} {input.af_table} {input.gtf} {params.ref} {params.depth} {params.vaf} {output.call} " |
151 152 | shell: "ls {input.call} > {output.list};ls {input.ps} > {output.list_ps};Rscript {input.script} {output.list} {output.res} {output.list_ps} " if "meta_N" in config else "ls {input.call} > {output.list};Rscript {input.script} {output.list} {output.res} {output.list_ps} " |
12 13 | wrapper: "v1.23.3/bio/reference/ensembl-sequence" |
23 24 | wrapper: "v1.23.3/bio/reference/ensembl-annotation" |
36 37 | shell: "Rscript --vanilla {input.script} {input.gtf} {output.rdata} " |
46 47 | shell: "samtools dict {input} > {output} " |
58 59 | wrapper: "v1.23.3/bio/bwa/index" |
67 68 | wrapper: "v1.23.3/bio/samtools/faidx" |
82 83 | wrapper: "v1.23.3/bio/reference/ensembl-variation" |
92 93 | wrapper: "v1.23.3/bio/tabix/index" |
102 103 | shell: "snpEff download -v {params.db} -dataDir $(pwd)/resources/database/snpEff; " |
17 18 | wrapper: "v1.23.3/bio/trimmomatic/pe" |
37 38 | wrapper: "v1.23.3/bio/bwa/mem" |
54 55 | wrapper: "v1.23.3/bio/picard/markduplicates" |
13 14 | wrapper: "v1.23.3/bio/trimmomatic/pe" |
31 32 | wrapper: "v1.23.3/bio/bwa/mem" |
46 47 | wrapper: "v1.23.3/bio/picard/markduplicates" |
61 62 | shell: "gatk Mutect2 -R {input.ref} -I {input.bam} --germline-resource {input.exac} --panel-of-normals {input.pon} -O {output.vcf_raw}; " |
71 72 73 | shell: "bgzip -c {input.vcf} > {output.vcf};" "tabix -p vcf {output.vcf}" |
83 84 85 | shell: "ls {input.vcf} > {output.list};" "bcftools merge --file-list {output.list} | bcftools view --min-ac=5 > {output.norm};" |
94 95 96 | shell: "bgzip -c {input.vcf} > {output.vcf};" "tabix -p vcf {output.vcf}" |
15 16 | shell: "Rscript {input.script} {input.baits} {input.ref} {input.mappability} {input.gtf} {params.min_target} {params.min_off_target} {output.intervals} " |
28 29 | shell: "Rscript {input.script} {input.intervals} {input.bam} {output.coverage_raw} {output.coverage_loess} " |
41 42 | shell: "Rscript {input.script} {input.intervals} {input.bam} {output.coverage_raw} {output.coverage_loess} " |
57 58 59 | shell: "ls {input.cov_norm} > {output.list};" "Rscript {input.script} {input.vcf} {output.list} {output.rds} {output.rds_bias} {params.ref};" |
81 82 | shell: "Rscript {input.script} {input.rds} {input.cov_tum} {input.vcf_tum} {input.intervals} {input.rds_bias} {params.ref} {params.sample} {output.result_rds} {output.ps} {output.plot_output} " |
12 13 | wrapper: "v1.23.3/bio/fastqc" |
24 25 | wrapper: "v1.23.3/bio/samtools/stats" |
41 42 | wrapper: "v1.23.3/bio/mosdepth" |
58 59 | wrapper: "v1.23.3/bio/multiqc" |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | __author__ = "Patrik Smeds" __copyright__ = "Copyright 2016, Patrik Smeds" __email__ = "patrik.smeds@gmail.com" __license__ = "MIT" from os.path import splitext from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=False, stderr=True) # Check inputs/arguments. if len(snakemake.input) == 0: raise ValueError("A reference genome has to be provided!") elif len(snakemake.input) > 1: raise ValueError("Only one reference genome can be inputed!") # Prefix that should be used for the database prefix = snakemake.params.get("prefix", splitext(snakemake.output.idx[0])[0]) if len(prefix) > 0: prefix = "-p " + prefix # Contrunction algorithm that will be used to build the database, default is bwtsw construction_algorithm = snakemake.params.get("algorithm", "") if len(construction_algorithm) != 0: construction_algorithm = "-a " + construction_algorithm shell( "bwa index" " {prefix}" " {construction_algorithm}" " {snakemake.input[0]}" " {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 | __author__ = "Johannes Köster, Julian de Ruiter" __copyright__ = "Copyright 2016, Johannes Köster and Julian de Ruiter" __email__ = "koester@jimmy.harvard.edu, julianderuiter@gmail.com" __license__ = "MIT" import tempfile from os import path from snakemake.shell import shell from snakemake_wrapper_utils.java import get_java_opts from snakemake_wrapper_utils.samtools import get_samtools_opts # Extract arguments. extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell(stdout=False, stderr=True) sort = snakemake.params.get("sorting", "none") sort_order = snakemake.params.get("sort_order", "coordinate") sort_extra = snakemake.params.get("sort_extra", "") samtools_opts = get_samtools_opts(snakemake) java_opts = get_java_opts(snakemake) index = snakemake.input.idx if isinstance(index, str): index = path.splitext(snakemake.input.idx)[0] else: index = path.splitext(snakemake.input.idx[0])[0] # Check inputs/arguments. if not isinstance(snakemake.input.reads, str) and len(snakemake.input.reads) not in { 1, 2, }: raise ValueError("input must have 1 (single-end) or 2 (paired-end) elements") if sort_order not in {"coordinate", "queryname"}: raise ValueError("Unexpected value for sort_order ({})".format(sort_order)) # Determine which pipe command to use for converting to bam or sorting. if sort == "none": # Simply convert to bam using samtools view. pipe_cmd = "samtools view {samtools_opts}" elif sort == "samtools": # Add name flag if needed. if sort_order == "queryname": sort_extra += " -n" # Sort alignments using samtools sort. pipe_cmd = "samtools sort {samtools_opts} {sort_extra} -T {tmpdir}" elif sort == "picard": # Sort alignments using picard SortSam. pipe_cmd = "picard SortSam {java_opts} {sort_extra} --INPUT /dev/stdin --TMP_DIR {tmpdir} --SORT_ORDER {sort_order} --OUTPUT {snakemake.output[0]}" else: raise ValueError(f"Unexpected value for params.sort ({sort})") with tempfile.TemporaryDirectory() as tmpdir: shell( "(bwa mem" " -t {snakemake.threads}" " {extra}" " {index}" " {snakemake.input.reads}" " | " + pipe_cmd + ") {log}" ) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" from os import path import re from tempfile import TemporaryDirectory from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=True, stderr=True) def basename_without_ext(file_path): """Returns basename of file path, without the file extension.""" base = path.basename(file_path) # Remove file extension(s) (similar to the internal fastqc approach) base = re.sub("\\.gz$", "", base) base = re.sub("\\.bz2$", "", base) base = re.sub("\\.txt$", "", base) base = re.sub("\\.fastq$", "", base) base = re.sub("\\.fq$", "", base) base = re.sub("\\.sam$", "", base) base = re.sub("\\.bam$", "", base) return base # Run fastqc, since there can be race conditions if multiple jobs # use the same fastqc dir, we create a temp dir. with TemporaryDirectory() as tempdir: shell( "fastqc {snakemake.params} -t {snakemake.threads} " "--outdir {tempdir:q} {snakemake.input[0]:q}" " {log}" ) # Move outputs into proper position. output_base = basename_without_ext(snakemake.input[0]) html_path = path.join(tempdir, output_base + "_fastqc.html") zip_path = path.join(tempdir, output_base + "_fastqc.zip") if snakemake.output.html != html_path: shell("mv {html_path:q} {snakemake.output.html:q}") if snakemake.output.zip != zip_path: shell("mv {zip_path:q} {snakemake.output.zip:q}") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 | __author__ = "William Rowell" __copyright__ = "Copyright 2020, William Rowell" __email__ = "wrowell@pacb.com" __license__ = "MIT" import sys from snakemake.shell import shell extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell(stdout=True, stderr=True) bed = snakemake.input.get("bed", "") by = snakemake.params.get("by", "") if by: if bed: sys.exit( "Either provide a bed input file OR a window size via params.by, not both." ) else: by = f"--by {by}" if bed: by = f"--by {bed}" quantize_out = False thresholds_out = False regions_bed_out = False region_dist_out = False for file in snakemake.output: if ".per-base." in file and "--no-per-base" in extra: sys.exit( "You asked not to generate per-base output (--no-per-base), but your rule specifies a '.per-base.' output file. Remove one of the two." ) if ".quantized.bed.gz" in file: quantize_out = True if ".thresholds.bed.gz" in file: thresholds_out = True if ".mosdepth.region.dist.txt" in file: region_dist_out = True if ".regions.bed.gz" in file: regions_bed_out = True if by and not regions_bed_out: sys.exit( "You ask for by-region output. Please also specify *.regions.bed.gz as a rule output." ) if by and not region_dist_out: sys.exit( "You ask for by-region output. Please also specify *.mosdepth.region.dist.txt as a rule output." ) if (region_dist_out or regions_bed_out) and not by: sys.exit( "You specify *.regions.bed.gz and/or *.mosdepth.region.dist.txt as a rule output. You also need to ask for by-region output via 'input.bed' or 'params.by'." ) quantize = snakemake.params.get("quantize", "") if quantize: if not quantize_out: sys.exit( "You ask for quantized output via params.quantize. Please also specify *.quantized.bed.gz as a rule output." ) quantize = f"--quantize {quantize}" if not quantize and quantize_out: sys.exit( "The rule has output *.quantized.bed.gz specified. Please also specify params.quantize to actually generate it." ) thresholds = snakemake.params.get("thresholds", "") if thresholds: if not thresholds_out: sys.exit( "You ask for --thresholds output via params.thresholds. Please also specify *.thresholds.bed.gz as a rule output." ) thresholds = f"--thresholds {thresholds}" if not thresholds and thresholds_out: sys.exit( "The rule has output *.thresholds.bed.gz specified. Please also specify params.thresholds to actually generate it." ) precision = snakemake.params.get("precision", "") if precision: precision = f"MOSDEPTH_PRECISION={precision}" fasta = snakemake.input.get("fasta", "") if fasta: fasta = f"--fasta {fasta}" # mosdepth takes additional threads through its option --threads # One thread for mosdepth # Other threads are *additional* decompression threads passed to the '--threads' argument threads = "" if snakemake.threads <= 1 else "--threads {}".format(snakemake.threads - 1) # named output summary = "*.mosdepth.summary.txt" is required prefix = snakemake.output.summary.replace(".mosdepth.summary.txt", "") shell( "({precision} mosdepth {threads} {fasta} {by} {quantize} {thresholds} {extra} {prefix} {snakemake.input.bam}) {log}" ) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" from os import path from snakemake.shell import shell extra = snakemake.params.get("extra", "") # Set this to False if multiqc should use the actual input directly # instead of parsing the folders where the provided files are located use_input_files_only = snakemake.params.get("use_input_files_only", False) if not use_input_files_only: input_data = set(path.dirname(fp) for fp in snakemake.input) else: input_data = set(snakemake.input) output_dir = path.dirname(snakemake.output[0]) output_name = path.basename(snakemake.output[0]) log = snakemake.log_fmt_shell(stdout=True, stderr=True) shell( "multiqc" " {extra}" " --force" " -o {output_dir}" " -n {output_name}" " {input_data}" " {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | __author__ = "Johannes Köster, Christopher Schröder" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "koester@jimmy.harvard.edu" __license__ = "MIT" import tempfile from snakemake.shell import shell from snakemake_wrapper_utils.java import get_java_opts log = snakemake.log_fmt_shell() extra = snakemake.params.get("extra", "") java_opts = get_java_opts(snakemake) bams = snakemake.input.bams if isinstance(bams, str): bams = [bams] bams = list(map("--INPUT {}".format, bams)) if snakemake.output.bam.endswith(".cram"): output = "/dev/stdout" if snakemake.params.embed_ref: view_options = "-O cram,embed_ref" else: view_options = "-O cram" convert = f" | samtools view -@ {snakemake.threads} {view_options} --reference {snakemake.input.ref} -o {snakemake.output.bam}" else: output = snakemake.output.bam convert = "" with tempfile.TemporaryDirectory() as tmpdir: shell( "(picard MarkDuplicates" # Tool and its subcommand " {java_opts}" # Automatic java option " {extra}" # User defined parmeters " {bams}" # Input bam(s) " --TMP_DIR {tmpdir}" " --OUTPUT {output}" # Output bam " --METRICS_FILE {snakemake.output.metrics}" # Output metrics " {convert} ) {log}" # Logging ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2019, Johannes Köster" __email__ = "johannes.koester@uni-due.de" __license__ = "MIT" import subprocess import sys from pathlib import Path from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=False, stderr=True) species = snakemake.params.species.lower() build = snakemake.params.build release = int(snakemake.params.release) out_fmt = Path(snakemake.output[0]).suffixes out_gz = (out_fmt.pop() and True) if out_fmt[-1] == ".gz" else False out_fmt = out_fmt.pop().lstrip(".") branch = "" if release >= 81 and build == "GRCh37": # use the special grch37 branch for new releases branch = "grch37/" elif snakemake.params.get("branch"): branch = snakemake.params.branch + "/" flavor = snakemake.params.get("flavor", "") if flavor: flavor += "." suffix = "" if out_fmt == "gtf": suffix = "gtf.gz" elif out_fmt == "gff3": suffix = "gff3.gz" else: raise ValueError( "invalid format specified. Only 'gtf[.gz]' and 'gff3[.gz]' are currently supported." ) url = "ftp://ftp.ensembl.org/pub/{branch}release-{release}/{out_fmt}/{species}/{species_cap}.{build}.{release}.{flavor}{suffix}".format( release=release, build=build, species=species, out_fmt=out_fmt, species_cap=species.capitalize(), suffix=suffix, flavor=flavor, branch=branch, ) try: if out_gz: shell("curl -L {url} > {snakemake.output[0]} {log}") else: shell("(curl -L {url} | gzip -d > {snakemake.output[0]}) {log}") except subprocess.CalledProcessError as e: if snakemake.log: sys.stderr = open(snakemake.log[0], "a") print( "Unable to download annotation data from Ensembl. " "Did you check that this combination of species, build, and release is actually provided?", file=sys.stderr, ) exit(1) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2019, Johannes Köster" __email__ = "johannes.koester@uni-due.de" __license__ = "MIT" import subprocess as sp import sys from itertools import product from snakemake.shell import shell species = snakemake.params.species.lower() release = int(snakemake.params.release) build = snakemake.params.build branch = "" if release >= 81 and build == "GRCh37": # use the special grch37 branch for new releases branch = "grch37/" elif snakemake.params.get("branch"): branch = snakemake.params.branch + "/" log = snakemake.log_fmt_shell(stdout=False, stderr=True) spec = ("{build}" if int(release) > 75 else "{build}.{release}").format( build=build, release=release ) suffixes = "" datatype = snakemake.params.get("datatype", "") chromosome = snakemake.params.get("chromosome", "") if datatype == "dna": if chromosome: suffixes = ["dna.chromosome.{}.fa.gz".format(chromosome)] else: suffixes = ["dna.primary_assembly.fa.gz", "dna.toplevel.fa.gz"] elif datatype == "cdna": suffixes = ["cdna.all.fa.gz"] elif datatype == "cds": suffixes = ["cds.all.fa.gz"] elif datatype == "ncrna": suffixes = ["ncrna.fa.gz"] elif datatype == "pep": suffixes = ["pep.all.fa.gz"] else: raise ValueError("invalid datatype, must be one of dna, cdna, cds, ncrna, pep") if chromosome: if not datatype == "dna": raise ValueError( "invalid datatype, to select a single chromosome the datatype must be dna" ) spec = spec.format(build=build, release=release) url_prefix = f"ftp://ftp.ensembl.org/pub/{branch}release-{release}/fasta/{species}/{datatype}/{species.capitalize()}.{spec}" success = False for suffix in suffixes: url = f"{url_prefix}.{suffix}" try: shell("curl -sSf {url} > /dev/null 2> /dev/null") except sp.CalledProcessError: continue shell("(curl -L {url} | gzip -d > {snakemake.output[0]}) {log}") success = True break if not success: if len(suffixes) > 1: url = f"{url_prefix}.[{'|'.join(suffixes)}]" else: url = f"{url_prefix}.{suffixes[0]}" print( f"Unable to download requested sequence data from Ensembl ({url}). " "Please check whether above URL is currently available (might be a temporal server issue). " "Apart from that, did you check that this combination of species, build, and release is actually provided?", file=sys.stderr, ) exit(1) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2019, Johannes Köster" __email__ = "johannes.koester@uni-due.de" __license__ = "MIT" import tempfile import subprocess import sys import os from snakemake.shell import shell from snakemake.exceptions import WorkflowError species = snakemake.params.species.lower() release = int(snakemake.params.release) build = snakemake.params.build type = snakemake.params.type chromosome = snakemake.params.get("chromosome", "") branch = "" if release >= 81 and build == "GRCh37": # use the special grch37 branch for new releases branch = "grch37/" elif snakemake.params.get("branch"): branch = snakemake.params.branch + "/" if release < 98 and not branch: print("Ensembl releases <98 are unsupported.", file=open(snakemake.log[0], "w")) exit(1) log = snakemake.log_fmt_shell(stdout=False, stderr=True) if chromosome and type != "all": raise ValueError( "Parameter chromosome given but chromosome-wise download" "is only implemented for type='all'." ) if type == "all": if species == "homo_sapiens" and release >= 93: chroms = ( list(range(1, 23)) + ["X", "Y", "MT"] if not chromosome else [chromosome] ) suffixes = ["-chr{}".format(chrom) for chrom in chroms] else: if chromosome: raise ValueError( "Parameter chromosome given but chromosome-wise download" "is only implemented for homo_sapiens in releases >=93." ) suffixes = [""] elif type == "somatic": suffixes = ["_somatic"] elif type == "structural_variations": suffixes = ["_structural_variations"] else: raise ValueError( "Unsupported type {} (only all, somatic, structural_variations are allowed)".format( type ) ) species_filename = species if release >= 91 else species.capitalize() urls = [ "ftp://ftp.ensembl.org/pub/{branch}release-{release}/variation/vcf/{species}/{species_filename}{suffix}.{ext}".format( release=release, species=species, suffix=suffix, species_filename=species_filename, branch=branch, ext=ext, ) for suffix in suffixes for ext in ["vcf.gz", "vcf.gz.csi"] ] names = [os.path.basename(url) for url in urls if url.endswith(".gz")] try: gather = "curl {urls}".format(urls=" ".join(map("-O {}".format, urls))) workdir = os.getcwd() with tempfile.TemporaryDirectory() as tmpdir: if snakemake.input.get("fai"): shell( "(cd {tmpdir}; {gather} && " "bcftools concat -Oz --naive {names} > concat.vcf.gz && " "bcftools reheader --fai {workdir}/{snakemake.input.fai} concat.vcf.gz " "> {workdir}/{snakemake.output}) {log}" ) else: shell( "(cd {tmpdir}; {gather} && " "bcftools concat -Oz --naive {names} " "> {workdir}/{snakemake.output}) {log}" ) except subprocess.CalledProcessError as e: if snakemake.log: sys.stderr = open(snakemake.log[0], "a") print( "Unable to download variation data from Ensembl. " "Did you check that this combination of species, build, and release is actually provided? ", file=sys.stderr, ) exit(1) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | __author__ = "Michael Chambers" __copyright__ = "Copyright 2019, Michael Chambers" __email__ = "greenkidneybean@gmail.com" __license__ = "MIT" from snakemake.shell import shell from snakemake_wrapper_utils.samtools import get_samtools_opts samtools_opts = get_samtools_opts( snakemake, parse_threads=False, parse_write_index=False, parse_output_format=False ) extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell(stdout=True, stderr=True) shell("samtools faidx {samtools_opts} {extra} {snakemake.input[0]} {log}") |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" from snakemake.shell import shell from snakemake_wrapper_utils.samtools import get_samtools_opts bed = snakemake.input.get("bed", "") if bed: bed = "-t " + bed samtools_opts = get_samtools_opts( snakemake, parse_write_index=False, parse_output=False, parse_output_format=False ) extra = snakemake.params.get("extra", "") region = snakemake.params.get("region", "") log = snakemake.log_fmt_shell(stdout=False, stderr=True) shell( "samtools stats {samtools_opts} {extra} {snakemake.input.bam} {bed} {region} > {snakemake.output} {log}" ) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 | __author__ = "Thibault Dayris" __copyright__ = "Copyright 2020, Dayris Thibault" __email__ = "thibault.dayris@gustaveroussy.fr" __license__ = "MIT" from snakemake.shell import shell from snakemake_wrapper_utils.java import get_java_opts java_opts = get_java_opts(snakemake) log = snakemake.log_fmt_shell(stdout=False, stderr=True) extra = snakemake.params.get("extra", "") min_threads = 1 incall = snakemake.input["call"] if snakemake.input["call"].endswith("bcf"): min_threads += 1 incall = "< <(bcftools view {})".format(incall) elif snakemake.input["call"].endswith("gz"): min_threads += 1 incall = "< <(gunzip -c {})".format(incall) outcall = snakemake.output["call"] if snakemake.output["call"].endswith("gz"): min_threads += 1 outcall = "| gzip -c > {}".format(outcall) elif snakemake.output["call"].endswith("bcf"): min_threads += 1 outcall = "| bcftools view > {}".format(outcall) else: outcall = "> {}".format(outcall) if snakemake.threads < min_threads: raise ValueError( "At least {} threads required, {} provided".format( min_threads, snakemake.threads ) ) shell( "SnpSift annotate" # Tool and its subcommand " {java_opts} {extra}" # Extra parameters " {snakemake.input.database}" # Path to annotation vcf file " {incall} " # Path to input vcf file " {outcall} " # Path to output vcf file " {log}" # Logging behaviour ) |
1 2 3 4 5 6 7 8 9 10 11 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "koester@jimmy.harvard.edu" __license__ = "MIT" from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=False, stderr=True) shell("tabix {snakemake.params} {snakemake.input[0]} {log}") |
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 | __author__ = "Johannes Köster, Jorge Langa" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "koester@jimmy.harvard.edu" __license__ = "MIT" from snakemake.shell import shell from snakemake_wrapper_utils.java import get_java_opts # Distribute available threads between trimmomatic itself and any potential pigz instances def distribute_threads(input_files, output_files, available_threads): gzipped_input_files = sum(1 for file in input_files if file.endswith(".gz")) gzipped_output_files = sum(1 for file in output_files if file.endswith(".gz")) potential_threads_per_process = available_threads // ( 1 + gzipped_input_files + gzipped_output_files ) if potential_threads_per_process > 0: # decompressing pigz creates at most 4 threads pigz_input_threads = ( min(4, potential_threads_per_process) if gzipped_input_files != 0 else 0 ) pigz_output_threads = ( (available_threads - pigz_input_threads * gzipped_input_files) // (1 + gzipped_output_files) if gzipped_output_files != 0 else 0 ) trimmomatic_threads = ( available_threads - pigz_input_threads * gzipped_input_files - pigz_output_threads * gzipped_output_files ) else: # not enough threads for pigz pigz_input_threads = 0 pigz_output_threads = 0 trimmomatic_threads = available_threads return trimmomatic_threads, pigz_input_threads, pigz_output_threads def compose_input_gz(filename, threads): if filename.endswith(".gz") and threads > 0: return "<(pigz -p {threads} --decompress --stdout {filename})".format( threads=threads, filename=filename ) return filename def compose_output_gz(filename, threads, compression_level): if filename.endswith(".gz") and threads > 0: return ">(pigz -p {threads} {compression_level} > {filename})".format( threads=threads, compression_level=compression_level, filename=filename ) return filename extra = snakemake.params.get("extra", "") java_opts = get_java_opts(snakemake) log = snakemake.log_fmt_shell(stdout=True, stderr=True) compression_level = snakemake.params.get("compression_level", "-5") trimmer = " ".join(snakemake.params.trimmer) # Distribute threads input_files = [snakemake.input.r1, snakemake.input.r2] output_files = [ snakemake.output.r1, snakemake.output.r1_unpaired, snakemake.output.r2, snakemake.output.r2_unpaired, ] trimmomatic_threads, input_threads, output_threads = distribute_threads( input_files, output_files, snakemake.threads ) input_r1, input_r2 = [ compose_input_gz(filename, input_threads) for filename in input_files ] output_r1, output_r1_unp, output_r2, output_r2_unp = [ compose_output_gz(filename, output_threads, compression_level) for filename in output_files ] shell( "trimmomatic PE -threads {trimmomatic_threads} {java_opts} {extra} " "{input_r1} {input_r2} " "{output_r1} {output_r1_unp} " "{output_r2} {output_r2_unp} " "{trimmer} " "{log}" ) |
Support
- Future updates
Related Workflows





