A snakemake workflow for benchmarking variant calling approaches with Genome in a Bottle (GIAB), CHM (syndip) or other custom datasets
A Snakemake workflow for benchmarking variant calling approaches with Genome in a Bottle (GIAB) data (and other custom benchmark datasets). The workflow uses a combination of bedtools, mosdepth, rtg-tools, pandas and datavzrd.
Usage
The usage of this workflow is described in the Snakemake Workflow Catalog .
If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) benchmark-giabsitory and its DOI (see above).
Code Snippets
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "koester@jimmy.harvard.edu" __license__ = "MIT" from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=True, stderr=True) ## Extract arguments extra = snakemake.params.get("extra", "") shell("bcftools index {extra} {snakemake.input[0]} {log}") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 | __author__ = "William Rowell" __copyright__ = "Copyright 2020, William Rowell" __email__ = "wrowell@pacb.com" __license__ = "MIT" import sys from snakemake.shell import shell extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell(stdout=True, stderr=True) bed = snakemake.input.get("bed", "") by = snakemake.params.get("by", "") if by: if bed: sys.exit( "Either provide a bed input file OR a window size via params.by, not both." ) else: by = f"--by {by}" if bed: by = f"--by {bed}" quantize_out = False thresholds_out = False regions_bed_out = False region_dist_out = False for file in snakemake.output: if ".per-base." in file and "--no-per-base" in extra: sys.exit( "You asked not to generate per-base output (--no-per-base), but your rule specifies a '.per-base.' output file. Remove one of the two." ) if ".quantized.bed.gz" in file: quantize_out = True if ".thresholds.bed.gz" in file: thresholds_out = True if ".mosdepth.region.dist.txt" in file: region_dist_out = True if ".regions.bed.gz" in file: regions_bed_out = True if by and not regions_bed_out: sys.exit( "You ask for by-region output. Please also specify *.regions.bed.gz as a rule output." ) if by and not region_dist_out: sys.exit( "You ask for by-region output. Please also specify *.mosdepth.region.dist.txt as a rule output." ) if (region_dist_out or regions_bed_out) and not by: sys.exit( "You specify *.regions.bed.gz and/or *.mosdepth.region.dist.txt as a rule output. You also need to ask for by-region output via 'input.bed' or 'params.by'." ) quantize = snakemake.params.get("quantize", "") if quantize: if not quantize_out: sys.exit( "You ask for quantized output via params.quantize. Please also specify *.quantized.bed.gz as a rule output." ) quantize = f"--quantize {quantize}" if not quantize and quantize_out: sys.exit( "The rule has output *.quantized.bed.gz specified. Please also specify params.quantize to actually generate it." ) thresholds = snakemake.params.get("thresholds", "") if thresholds: if not thresholds_out: sys.exit( "You ask for --thresholds output via params.thresholds. Please also specify *.thresholds.bed.gz as a rule output." ) thresholds = f"--thresholds {thresholds}" if not thresholds and thresholds_out: sys.exit( "The rule has output *.thresholds.bed.gz specified. Please also specify params.thresholds to actually generate it." ) precision = snakemake.params.get("precision", "") if precision: precision = f"MOSDEPTH_PRECISION={precision}" fasta = snakemake.input.get("fasta", "") if fasta: fasta = f"--fasta {fasta}" # mosdepth takes additional threads through its option --threads # One thread for mosdepth # Other threads are *additional* decompression threads passed to the '--threads' argument threads = "" if snakemake.threads <= 1 else "--threads {}".format(snakemake.threads - 1) # named output summary = "*.mosdepth.summary.txt" is required prefix = snakemake.output.summary.replace(".mosdepth.summary.txt", "") shell( "({precision} mosdepth {threads} {fasta} {by} {quantize} {thresholds} {extra} {prefix} {snakemake.input.bam}) {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | __author__ = "Johannes Köster, Christopher Schröder" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "koester@jimmy.harvard.edu" __license__ = "MIT" import tempfile from snakemake.shell import shell from snakemake_wrapper_utils.java import get_java_opts log = snakemake.log_fmt_shell() extra = snakemake.params.get("extra", "") java_opts = get_java_opts(snakemake) bams = snakemake.input.bams if isinstance(bams, str): bams = [bams] bams = list(map("--INPUT {}".format, bams)) if snakemake.output.bam.endswith(".cram"): output = "/dev/stdout" if snakemake.params.embed_ref: view_options = "-O cram,embed_ref" else: view_options = "-O cram" convert = f" | samtools view -@ {snakemake.threads} {view_options} --reference {snakemake.input.ref} -o {snakemake.output.bam}" else: output = snakemake.output.bam convert = "" with tempfile.TemporaryDirectory() as tmpdir: shell( "(picard MarkDuplicates" # Tool and its subcommand " {java_opts}" # Automatic java option " {extra}" # User defined parmeters " {bams}" # Input bam(s) " --TMP_DIR {tmpdir}" " --OUTPUT {output}" # Output bam " --METRICS_FILE {snakemake.output.metrics}" # Output metrics " {convert} ) {log}" # Logging ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2019, Johannes Köster" __email__ = "johannes.koester@uni-due.de" __license__ = "MIT" import subprocess as sp import sys from itertools import product from snakemake.shell import shell species = snakemake.params.species.lower() release = int(snakemake.params.release) build = snakemake.params.build branch = "" if release >= 81 and build == "GRCh37": # use the special grch37 branch for new releases branch = "grch37/" log = snakemake.log_fmt_shell(stdout=False, stderr=True) spec = ("{build}" if int(release) > 75 else "{build}.{release}").format( build=build, release=release ) suffixes = "" datatype = snakemake.params.get("datatype", "") chromosome = snakemake.params.get("chromosome", "") if datatype == "dna": if chromosome: suffixes = ["dna.chromosome.{}.fa.gz".format(chromosome)] else: suffixes = ["dna.primary_assembly.fa.gz", "dna.toplevel.fa.gz"] elif datatype == "cdna": suffixes = ["cdna.all.fa.gz"] elif datatype == "cds": suffixes = ["cds.all.fa.gz"] elif datatype == "ncrna": suffixes = ["ncrna.fa.gz"] elif datatype == "pep": suffixes = ["pep.all.fa.gz"] else: raise ValueError("invalid datatype, must be one of dna, cdna, cds, ncrna, pep") if chromosome: if not datatype == "dna": raise ValueError( "invalid datatype, to select a single chromosome the datatype must be dna" ) spec = spec.format(build=build, release=release) url_prefix = f"ftp://ftp.ensembl.org/pub/{branch}release-{release}/fasta/{species}/{datatype}/{species.capitalize()}.{spec}" success = False for suffix in suffixes: url = f"{url_prefix}.{suffix}" try: shell("curl -sSf {url} > /dev/null 2> /dev/null") except sp.CalledProcessError: continue shell("(curl -L {url} | gzip -d > {snakemake.output[0]}) {log}") success = True break if not success: if len(suffixes) > 1: url = f"{url_prefix}.[{'|'.join(suffixes)}]" else: url = f"{url_prefix}.{suffixes[0]}" print( f"Unable to download requested sequence data from Ensembl ({url}). " "Please check whether above URL is currently available (might be a temporal server issue). " "Apart from that, did you check that this combination of species, build, and release is actually provided?", file=sys.stderr, ) exit(1) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | __author__ = "Michael Chambers" __copyright__ = "Copyright 2019, Michael Chambers" __email__ = "greenkidneybean@gmail.com" __license__ = "MIT" from snakemake.shell import shell from snakemake_wrapper_utils.samtools import get_samtools_opts samtools_opts = get_samtools_opts( snakemake, parse_threads=False, parse_write_index=False, parse_output_format=False ) extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell(stdout=True, stderr=True) shell("samtools faidx {samtools_opts} {extra} {snakemake.input[0]} {log}") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "koester@jimmy.harvard.edu" __license__ = "MIT" from snakemake.shell import shell extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell(stdout=True, stderr=True) # Samtools takes additional threads through its option -@ # One thread for samtools merge # Other threads are *additional* threads passed to the '-@' argument threads = "" if snakemake.threads <= 1 else " -@ {} ".format(snakemake.threads - 1) shell( "samtools index {threads} {extra} {snakemake.input[0]} {snakemake.output[0]} {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | __author__ = "Patrik Smeds" __copyright__ = "Copyright 2016, Patrik Smeds" __email__ = "patrik.smeds@gmail.com" __license__ = "MIT" from os.path import splitext from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=False, stderr=True) # Check inputs/arguments. if len(snakemake.input) == 0: raise ValueError("A reference genome has to be provided!") elif len(snakemake.input) > 1: raise ValueError("Only one reference genome can be inputed!") # Prefix that should be used for the database prefix = snakemake.params.get("prefix", splitext(snakemake.output.idx[0])[0]) if len(prefix) > 0: prefix = "-p " + prefix # Contrunction algorithm that will be used to build the database, default is bwtsw construction_algorithm = snakemake.params.get("algorithm", "") if len(construction_algorithm) != 0: construction_algorithm = "-a " + construction_algorithm shell( "bwa index" " {prefix}" " {construction_algorithm}" " {snakemake.input[0]}" " {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 | __author__ = "Johannes Köster, Julian de Ruiter" __copyright__ = "Copyright 2016, Johannes Köster and Julian de Ruiter" __email__ = "koester@jimmy.harvard.edu, julianderuiter@gmail.com" __license__ = "MIT" from os import path import re import tempfile from snakemake.shell import shell # Extract arguments. extra = snakemake.params.get("extra", "") sort = snakemake.params.get("sorting", "none") sort_order = snakemake.params.get("sort_order", "coordinate") sort_extra = snakemake.params.get("sort_extra", "") index = snakemake.input.idx if isinstance(index, str): index = path.splitext(snakemake.input.idx)[0] else: index = path.splitext(snakemake.input.idx[0])[0] if re.search(r"-T\b", sort_extra) or re.search(r"--TMP_DIR\b", sort_extra): sys.exit( "You have specified temp dir (`-T` or `--TMP_DIR`) in params.sort_extra; this is automatically set from params.tmp_dir." ) log = snakemake.log_fmt_shell(stdout=False, stderr=True) # Check inputs/arguments. if not isinstance(snakemake.input.reads, str) and len(snakemake.input.reads) not in { 1, 2, }: raise ValueError("input must have 1 (single-end) or " "2 (paired-end) elements") if sort_order not in {"coordinate", "queryname"}: raise ValueError("Unexpected value for sort_order ({})".format(sort_order)) # Determine which pipe command to use for converting to bam or sorting. if sort == "none": # Simply convert to bam using samtools view. pipe_cmd = "samtools view -Sbh -o {snakemake.output[0]} -" elif sort == "samtools": # Add name flag if needed. if sort_order == "queryname": sort_extra += " -n" # Sort alignments using samtools sort. pipe_cmd = "samtools sort -T {tmp} {sort_extra} -o {snakemake.output[0]} -" elif sort == "picard": # Sort alignments using picard SortSam. pipe_cmd = ( "picard SortSam {sort_extra} --INPUT /dev/stdin" " --OUTPUT {snakemake.output[0]} --SORT_ORDER {sort_order} --TMP_DIR {tmp}" ) else: raise ValueError("Unexpected value for params.sort ({})".format(sort)) with tempfile.TemporaryDirectory() as tmp: shell( "(bwa mem" " -t {snakemake.threads}" " {extra}" " {index}" " {snakemake.input.reads}" " | " + pipe_cmd + ") {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "koester@jimmy.harvard.edu" __license__ = "MIT" from snakemake.shell import shell from snakemake_wrapper_utils.bcftools import get_bcftools_opts bcftools_opts = get_bcftools_opts( snakemake, parse_ref=False, parse_output_format=False, parse_memory=False ) extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell(stdout=True, stderr=True) if "--tbi" in extra or "--csi" in extra: raise ValueError( "You have specified index format (`--tbi/--csi`) in `params.extra`; this is automatically infered from the first output file." ) if snakemake.output[0].endswith(".tbi"): extra += " --tbi" elif snakemake.output[0].endswith(".csi"): extra += " --csi" else: raise ValueError("invalid index file format ('.tbi', '.csi').") shell("bcftools index {bcftools_opts} {extra} {snakemake.input[0]} {log}") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | __author__ = "Dayne Filer" __copyright__ = "Copyright 2019, Dayne Filer" __email__ = "dayne.filer@gmail.com" __license__ = "MIT" from snakemake.shell import shell from snakemake_wrapper_utils.bcftools import get_bcftools_opts bcftools_opts = get_bcftools_opts(snakemake, parse_memory=False) extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell(stdout=True, stderr=True) shell("bcftools norm {bcftools_opts} {extra} {snakemake.input[0]} {log}") |
1 2 3 4 5 6 7 8 9 10 11 12 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2017, Johannes Köster" __email__ = "johannes.koester@protonmail.com" __license__ = "MIT" from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=True, stderr=True) extra = snakemake.params.get("extra", "") shell("datavzrd {snakemake.input.config} {extra} --output {snakemake.output[0]} {log}") |
16 17 18 19 20 21 | shell: "(set +o pipefail; samtools view -f3 -h" " {params.bam_url}" " {params.limit} |" " samtools sort -n -O BAM --threads {resources.sort_threads} | " " samtools fastq -1 {output.r1} -2 {output.r2} -s /dev/null -0 /dev/null -) 2> {log}" |
34 35 | shell: "(mkdir -p {output}; curl -L {params.url} | tar -x -C {output} --strip-components 1) 2> {log}" |
50 51 52 53 | shell: "ls {input.archive}; (bcftools view {params.url}" " | sed {params.repl_chr} | bcftools view -Ob - > {output}" ") 2> {log}" |
66 67 | shell: "bcftools concat -O b --allow-overlap {input} > {output} 2> {log}" |
81 82 | wrapper: "v1.9.0/bio/bcftools/norm" |
97 98 | shell: "({params.cmd} | sed {params.repl_chr} > {output}) 2> {log}" |
108 109 | shell: "curl --insecure -L http://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz > {output} 2> {log}" |
125 126 | shell: "({params.get_bed} {params.liftover}) 2> {log}" |
140 141 | shell: "sed {params.repl_chr} {input} > {output} 2> {log}" |
155 156 | wrapper: "v1.7.2/bio/reference/ensembl-sequence" |
166 167 | wrapper: "v1.7.2/bio/samtools/faidx" |
179 180 | wrapper: "v1.8.0/bio/bwa/index" |
195 196 | wrapper: "v1.8.0/bio/bwa/mem" |
211 212 | wrapper: "v1.7.2/bio/picard/markduplicates" |
222 223 | wrapper: "v1.7.2/bio/samtools/index" |
239 240 | wrapper: "v1.7.2/bio/mosdepth" |
11 12 13 | shell: "bcftools annotate {input.calls} --rename-chrs {input.repl_file} " "-Ob -o {output} 2> {log}" |
37 38 39 40 | shell: "(bedtools intersect -b {input.regions} -a " "<(bcftools view {input.variants}) -wa -f 1.0 -header | " "bcftools view -Oz > {output}) 2> {log}" |
60 61 | wrapper: "v1.7.2/bio/bcftools/index" |
73 74 | script: "../scripts/stat-truth.py" |
87 88 | shell: "rtg format --output {output} {input.genome} &> {log}" |
107 108 109 110 | shell: "rm -r {params.output}; rtg vcfeval --threads {threads} --ref-overlap --all-records " "--output-mode ga4gh --baseline {input.truth} --calls {input.query} " "--output {params.output} --template {input.genome} &> {log}" |
124 125 | script: "../scripts/calc-precision-recall.py" |
143 144 | script: "../scripts/collect-stratifications.py" |
159 160 | script: "../scripts/collect-precision-recall.py" |
190 191 | wrapper: "v2.1.1/utils/datavzrd" |
204 205 | script: "../scripts/extract-fp-fn.py" |
231 232 | script: "../scripts/collect-fp-fn.py" |
265 266 | wrapper: "v2.1.1/utils/datavzrd" |
8 9 | wrapper: "v1.9.0/bio/bcftools/index" |
19 20 | wrapper: "v1.9.0/bio/bcftools/index" |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 | from collections import defaultdict import sys, os # sys.path.insert(0, os.path.dirname(__file__)) sys.stderr = open(snakemake.log[0], "w") from abc import ABC, abstractmethod from enum import Enum import pandas as pd import pysam from common.classification import CompareExactGenotype, CompareExistence, Class class Classifications: def __init__(self, comparator): self.tp_query = 0 self.tp_truth = 0 self.fn = 0 self.fp = 0 self.comparator = comparator def register(self, record): for c in self.comparator.classify(record): if c.cls is Class.TP_truth: self.tp_truth += 1 elif c.cls is Class.TP_query: self.tp_query += 1 elif c.cls is Class.FN: self.fn += 1 elif c.cls is Class.FP: self.fp += 1 else: assert False, "unexpected case" def precision(self): p = self.tp_query + self.fp if p == 0: return 1.0 return float(self.tp_query) / float(p) def recall(self): t = self.tp_truth + self.fn if t == 0: return 1.0 return float(self.tp_truth) / float(t) def collect_results(vartype): classifications_exact = Classifications(CompareExactGenotype(vartype)) classifications_existence = Classifications(CompareExistence(vartype)) for record in pysam.VariantFile(snakemake.input.calls): classifications_exact.register(record) classifications_existence.register(record) vartype = "indels" if vartype == "INDEL" else "snvs" mismatched_genotype = ( classifications_existence.tp_query - classifications_exact.tp_query ) if classifications_existence.tp_query > 0: mismatched_genotype_rate = ( mismatched_genotype / classifications_existence.tp_query ) else: mismatched_genotype_rate = 0.0 d = pd.DataFrame( { "precision": [classifications_existence.precision()], "tp_query": [classifications_existence.tp_query], "fp": [classifications_existence.fp], "recall": [classifications_existence.recall()], "tp_truth": [classifications_existence.tp_truth], "fn": [classifications_existence.fn], "genotype_mismatch_rate": [mismatched_genotype_rate], } ) return d[ [ "precision", "tp_query", "fp", "recall", "tp_truth", "fn", "genotype_mismatch_rate", ] ] assert snakemake.wildcards.vartype in ["snvs", "indels"] vartype = "SNV" if snakemake.wildcards.vartype == "snvs" else "INDEL" collect_results(vartype).to_csv(snakemake.output[0], sep="\t", index=False) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 | import os import sys sys.stderr = open(snakemake.log[0], "w") import pandas as pd from scipy.cluster.hierarchy import ward, leaves_list from scipy.spatial.distance import pdist from sklearn.feature_selection import chi2 from statsmodels.stats.multitest import fdrcorrection import numpy as np def collect_chromosomes(f): d = pd.read_csv(f, sep="\t", dtype=str, usecols=["chromosome"]).drop_duplicates() return d["chromosome"].tolist() def read_data(f, callset, chromosome=None): print("reading", f, "...", file=sys.stderr) data = pd.read_csv( f, sep="\t", dtype=str, ) if chromosome is not None: data = data.loc[data["chromosome"] == chromosome] print(data.head(), file=sys.stderr) data = data.set_index( [ "chromosome", "position", "ref_allele", "alt_allele", "true_genotype", ] ) data.drop("class", axis="columns", inplace=True) assert ( not data.index.duplicated().any() ), f"bug: not expecting any duplicates in FP/FN table {f}" data.columns = [callset] if snakemake.wildcards.classification == "fn": data.loc[:, callset] = "FN" return data def get_idx_sorted_by_clustering(data): cluster_matrix = ward(pdist(~data.isna(), metric="hamming")) idx = leaves_list(cluster_matrix) return idx chromosomes = sorted( {chrom for f in snakemake.input.tables for chrom in collect_chromosomes(f)} ) if not chromosomes: chromosomes = [None] n_written = 0 # process data for each chromosome separately and append to the same files for i, chromosome in enumerate(chromosomes): if n_written > snakemake.params.max_entries: break data = pd.concat( [ read_data(f, callset, chromosome) for f, callset in zip(snakemake.input.tables, snakemake.params.callsets) ], axis="columns", ) data = data.loc[data.isna().sum(axis="columns").sort_values().index,] data = data.dropna(how="all") if not data.empty: idx_rows = get_idx_sorted_by_clustering(data) idx_cols = get_idx_sorted_by_clustering(data.T) data = data.iloc[idx_rows, idx_cols] label_df = pd.DataFrame(snakemake.params.labels) def store(data, output, label_idx=None): _label_df = label_df if label_idx is not None: _label_df = label_df.iloc[[label_idx]] # add labels index_cols = data.index.names cols = data.columns data = pd.concat([_label_df, data.reset_index()]).set_index(index_cols) # restore column order data = data[cols] if i == 0: mode = "w" header = True else: mode = "a" header = False data.to_csv(output, sep="\t", mode=mode, header=header) store(data, snakemake.output.main) n_written += len(data) label_df.index = snakemake.params.label_names os.makedirs(snakemake.output.dependency_sorting, exist_ok=True) # for each label, calculate mutual information and sort FP/FN entries in descending order for label_idx, label in enumerate(snakemake.params.label_names): outdata = data if not data.empty: # oe = OrdinalEncoder() # le = LabelEncoder() # feature matrix: genotypes, transposed into samples x features, replace NA with False (match) # and any genotype with True (mismatch with truth). feature_matrix = data.reset_index(drop=True).T.copy() feature_matrix[~pd.isna(feature_matrix)] = True feature_matrix[pd.isna(feature_matrix)] = False # target vector: label values, converted into factors target_vector = label_df.loc[label] # ignore any NA in the target vector and correspondingly remove the rows in the feature matrix not_na_target_vector = target_vector[~pd.isna(target_vector)] feature_matrix = feature_matrix.loc[not_na_target_vector.index] # perfom chi² test against each feature _, pvals = chi2(feature_matrix, not_na_target_vector) sorted_idx = np.argsort(pvals) _, fdr = fdrcorrection( pvals, method="negcorr" ) # use Benjamini/Yekutieli as variants might be both positively or negatively correlated # clone data sorted_data = data.copy(deep=True) # sort by label sorted_target_vector = target_vector.sort_values() sorted_data = sorted_data[sorted_target_vector.index] # add pvalue and FDR sorted_data.insert(0, "FDR dependency", np.around(fdr, 3)) sorted_data.insert(0, "p-value dependency", np.around(pvals, 3)) outdata = sorted_data.iloc[sorted_idx] # only keep the rather significant entries (but be a bit more permissive than 0.05) outdata = outdata.loc[outdata["p-value dependency"] <= 0.25] outpath = os.path.join(snakemake.output.dependency_sorting, f"{label}.tsv") store(outdata, outpath, label_idx=label_idx) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | import sys sys.stderr = open(snakemake.log[0], "w") import pandas as pd def load_data(path, callset): d = pd.read_csv(path, sep="\t") d.insert(0, "callset", callset) return d results = pd.concat( [ load_data(f, callset) for f, callset in zip(snakemake.input.tables, snakemake.params.callsets) ], axis="rows", ) def cov_key(cov_label): # return lower bound as integer for sorting if ".." in cov_label: return int(cov_label.split("..")[0]) else: return int(cov_label[1:]) def sort_key(col): if col.name == "callset": return col else: return col.apply(cov_key) results.sort_values(["callset", "coverage"], inplace=True, key=sort_key) results["sort_index"] = results["coverage"].apply(cov_key) results.to_csv(snakemake.output[0], sep="\t", index=False) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 | import sys sys.stderr = open(snakemake.log[0], "w") import pandas as pd def get_cov_label(coverage): lower = snakemake.params.coverage_lower_bounds[coverage] bounds = [ bound for bound in snakemake.params.coverage_lower_bounds.values() if bound > lower ] if bounds: upper = min(bounds) return f"{lower}..{upper}" else: return f"≥{lower}" def load_data(f, coverage): d = pd.read_csv(f, sep="\t") d.insert(0, "coverage", get_cov_label(coverage)) return d if snakemake.input: report = pd.concat( load_data(f, cov) for cov, f in zip(snakemake.params.coverages, snakemake.input) ) if (report["tp_truth"] == 0).all(): raise ValueError( f"The callset {snakemake.wildcards.callset} does not predict any variant from the truth. " "This is likely a technical issue in the callset and should be checked before further evaluation." ) report.to_csv(snakemake.output[0], sep="\t", index=False) else: pd.DataFrame( { col: [] for col in [ "coverage", "precision", "tp_query", "fp", "recall", "tp_truth", "fn", "genotype_mismatch_rate", ] } ).to_csv(snakemake.output[0], sep="\t") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | from collections import defaultdict import sys, os # sys.path.insert(0, os.path.dirname(__file__)) sys.stderr = open(snakemake.log[0], "w") import csv import pysam from common.classification import CompareExistence, Class, is_het cmp = CompareExistence() varfile = pysam.VariantFile(snakemake.input.calls) with open(snakemake.output[0], "w", newline="") as outfile: writer = csv.writer(outfile, delimiter="\t") writer.writerow( [ "class", "chromosome", "position", "ref_allele", "alt_allele", "true_genotype", "predicted_genotype", ] ) for record in varfile: for c in cmp.classify(record): if c.cls == Class.FP and snakemake.wildcards.classification == "fp": classification = "FP" truth_gt = "0/0" query_gt = "0/1" if is_het(record, 1, c.variant) else "1/1" elif c.cls == Class.FN and snakemake.wildcards.classification == "fn": classification = "FN" truth_gt = "0/1" if is_het(record, 0, c.variant) else "1/1" query_gt = "" else: continue for alt in c.variant.alts: writer.writerow( [ classification, record.contig, record.pos, record.ref, alt, truth_gt, query_gt, ] ) |
1 2 3 4 5 6 7 8 9 10 11 12 | import sys sys.stderr = open(snakemake.log[0], "w") import json from pysam import VariantFile with VariantFile(snakemake.input[0]) as infile, open( snakemake.output[0], "w" ) as outfile: json.dump({"isempty": not any(infile)}, outfile) |
Support
Do you know this workflow well? If so, you can
request seller status , and start supporting this workflow.
Created: 1yr ago
Updated: 1yr ago
Maitainers:
public
URL:
https://github.com/snakemake-workflows/dna-seq-benchmark
Name:
dna-seq-benchmark
Version:
v1.8.3
Downloaded:
0
Copyright:
Public Domain
License:
MIT License
Keywords:
- Future updates
Related Workflows

ENCODE pipeline for histone marks developed for the psychENCODE project
psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project.
The o...

Near-real time tracking of SARS-CoV-2 in Connecticut
Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

ATLAS - Three commands to start analyzing your metagenome data
Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...
raw sequence reads
Genome assembly
Annotation track
checkm2
gunc
prodigal
snakemake-wrapper-utils
MEGAHIT
Atlas
BBMap
Biopython
BioRuby
Bwa-mem2
cd-hit
CheckM
DAS
Diamond
eggNOG-mapper v2
MetaBAT 2
Minimap2
MMseqs
MultiQC
Pandas
Picard
pyfastx
SAMtools
SemiBin
Snakemake
SPAdes
SqueezeMeta
TADpole
VAMB
CONCOCT
ete3
gtdbtk
h5py
networkx
numpy
plotly
psutil
utils
metagenomics

RNA-seq workflow using STAR and DESeq2
This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

This Snakemake pipeline implements the GATK best-practices workflow
This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...