Snakemake workflow for evaluating varlociraptor

public 1yr ago 0 bookmarks

View Workflow

varlociraptor-evaluation — View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

This Snakemake workflow automatically generates all results and figures from the initial Varlociraptor paper focusing on somatic variant calling .

Rerunning the workflow requires a lot of computation time and some unavoidable external resources that have to be manually deployed . We therefore hope that the Snakemake report in the supplementary material of the paper , providing all results together with comprehensive provenance information (workflow steps, parameters, software versions, code) will already yield sufficient information in most of the cases. If you nevertheless intend to rerun the analysis, feel free to follow the steps below, and please inform us about any potential issues.

General Requirements

Any 64-bit Linux installation with GLIBC 2.5 or newer (i.e. any Linux distribution that is newer than CentOS 6). Note that the restriction of this workflow to Linux is purely a design decision (to save space and ensure reproducibility) and not related to Conda/Bioconda. Bioconda packages are available for both Linux and MacOS in general.

Usage

This workflow can be used to recreate all results found in the paper.

Step 1: Setup system

Variant a: Installing Miniconda on your system

If you are on a Linux system with GLIBC 2.5 or newer (i.e. any Linux distribution that is newer than CentOS 6), you can simply install Miniconda3 with

curl -o /tmp/miniconda.sh https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh && bash /tmp/miniconda.sh

Make sure to answer yes to the question whether your PATH variable shall be modified. Afterwards, open a new shell/terminal.

Variant b: Use a Docker container

Otherwise, e.g., on MacOS or if you don't want to modify your system setup, install Docker , run

docker run -it continuumio/miniconda3 /bin/bash

and execute all the following steps within that container.

Variant c: Use an existing Miniconda installation

If you want to use an existing Miniconda installation, please be aware that this is only possible if it uses Python 3 by default. You can check this via

python --version

Further, ensure it is up to date with

conda update --all

Step 2: Setup Bioconda channel

Setup Bioconda with

conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda

Step 3: Install Snakemake

Install Snakemake >=5.4.0 with

conda install snakemake

If you already have an older version of Snakemake, please make sure it is updated to >=5.4.0.

Step 4: Download the workflow

First, create a working directory:

mkdir varlociraptor-workflow
cd varlociraptor-workflow

Then, download the workflow archive from https://doi.org/10.5281/zenodo.3361700 and unpack it with

tar -xf workflow.tar.gz

Step 5: Obtain additional resources

In this special case there are unfortunately unavoidable additional requirements, due to licensing restrictions and data size.

The required real data has to be obtained from EGA (EGAD00001002142) . After downloading it, edit the config.yaml in order to point to the right paths here and below .
The required simulated data has to be obtained from Zenodo: https://doi.org/10.5281/zenodo.1421298. Download it, convert back to BAM, and edit the config.yaml in order to point to the right paths here and here .
The synthetic data can be obtained by running a separate workflow: https://doi.org/10.5281/zenodo.3630241.

Once all data is obtained, edit the config.yaml under datasets: to point to the right file paths in your system.

Step 6: Run the workflow

Execute the analysis workflow with Snakemake

snakemake --use-conda

Please wait a few minutes for the analysis to finish. Results can be found in the folder figs/ . If you have been running the workflow in the docker container (see above), you can obtain the results with

docker cp <container-id>:/bioconda-workflow/figs .

whith <container-id> being the ID of the container.

Known errors

If you see an error like
```
ImportError: No module named 'appdirs'
```
when starting Snakemake, you are likely suffering from a bug in an older conda version. Make sure to update your conda installation with
```
conda update --all
```
and then reinstall the appdirs and snakemake package with
```
conda install -f appdirs snakemake
```
If you see an error like
```
ImportError: Missing required dependencies ['numpy']
```
you are likely suffering from a bug in an older conda version. Make sure to update your conda installation with
```
conda update --all
```
and then reinstall the snakemake package with
```
conda install -f snakemake
```

Code Snippets

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"


from snakemake.shell import shell


shell(
    "bcftools concat {snakemake.params} -o {snakemake.output[0]} "
    "{snakemake.input}")

Python Snakemake BCFtools From line 1 of concat/wrapper.py

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"


from snakemake.shell import shell


try:
    exclude = "-x " + snakemake.input.exclude
except AttributeError:
    exclude = ""


extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "OMP_NUM_THREADS={snakemake.threads} delly call {extra} "
    "{exclude} -t {snakemake.params.vartype} -g {snakemake.input.ref} "
    "-o {snakemake.output[0]} {snakemake.input.samples} {log}")

Python Snakemake delly From line 1 of delly/wrapper.py

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"


from snakemake.shell import shell


shell(
    "bcftools concat {snakemake.params} -o {snakemake.output[0]} "
    "{snakemake.input}")

Python Snakemake BCFtools From line 1 of concat/wrapper.py

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"


from snakemake.shell import shell


shell(
    "bcftools view {snakemake.params} {snakemake.input[0]} "
    "-o {snakemake.output[0]}")

Python Snakemake BCFtools From line 1 of view/wrapper.py

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"


from snakemake.shell import shell


shell(
    "bcftools concat {snakemake.params} -o {snakemake.output[0]} "
    "{snakemake.input.calls}")

Python Snakemake BCFtools From line 1 of concat/wrapper.py

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"


from snakemake.shell import shell


shell(
    "bcftools view {snakemake.params} {snakemake.input[0]} "
    "-o {snakemake.output[0]}")

Python Snakemake BCFtools From line 1 of view/wrapper.py

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"


from snakemake.shell import shell


shell("picard MarkDuplicates {snakemake.params} INPUT={snakemake.input} "
      "OUTPUT={snakemake.output.bam} METRICS_FILE={snakemake.output.metrics} "
      "&> {snakemake.log}")

Python Snakemake Picard From line 1 of markduplicates/wrapper.py

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"


from snakemake.shell import shell


shell("samtools index {snakemake.params} {snakemake.input[0]} {snakemake.output[0]}")

Python Snakemake SAMtools From line 1 of index/wrapper.py

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"


import os
from snakemake.shell import shell


prefix = os.path.splitext(snakemake.output[0])[0]

shell(
    "samtools sort {snakemake.params} -@ {snakemake.threads} -o {snakemake.output[0]} "
    "-T {prefix} {snakemake.input[0]}")

Python Snakemake SAMtools From line 1 of sort/wrapper.py

__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "julianderuiter@gmail.com"
__license__ = "MIT"


from snakemake.shell import shell


extra = snakemake.params.get("extra", "")
region = snakemake.params.get("region", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)


shell("samtools stats {extra} {snakemake.input}"
      " {region} > {snakemake.output} {log}")

Python Snakemake SAMtools From line 3 of stats/wrapper.py

script:
    "../scripts/adhoc-calling.py"

SnakeMake From line 9 of rules/adhoc.smk

shell:
    "(break-point-inspector -vcf {input.manta} -ref {input.samples[1]} -tumor {input.samples[0]} "
    "-output_vcf {output}) > {log} 2>&1"

SnakeMake From line 14 of rules/bpi.smk

wrapper:
    "0.22.0/bio/bcftools/view"

SnakeMake From line 29 of rules/bpi.smk

wrapper:
    "0.22.0/bio/bcftools/view"

SnakeMake From line 40 of rules/bpi.smk

wrapper:
    "0.17.4/bio/delly"

SnakeMake delly From line 16 of rules/delly.smk

wrapper:
    "0.17.4/bio/bcftools/concat"

SnakeMake From line 27 of rules/delly.smk

run:
    with open(output[0], "w") as out:
        print("tumor", "tumor", file=out)
        print("normal", "control", file=out)

SnakeMake From line 34 of rules/delly.smk

shell:
    "delly filter -m 0 -r 1.0 --samples {input.samples} "
    "-o {params.tmp} {input.bcf}; "
    "bcftools view -i INFO/SOMATIC -f PASS -Ob {params.tmp} > {output}"

SnakeMake BCFtools delly From line 54 of rules/delly.smk

script:
    "../scripts/annotate-truth.py"

SnakeMake From line 14 of rules/eval.smk

shell:
    "bcftools view {params.regions} {input.calls} | rbt vcf-match {params.match} {input.truth} > {output}"

SnakeMake BCFtools rust-bio-tools From line 43 of rules/eval.smk

shell:
    "bcftools view {params.regions} {input.calls} | rbt vcf-match {params.match} {input.truth} > {output}"

SnakeMake BCFtools rust-bio-tools From line 59 of rules/eval.smk

shell:
    "rbt vcf-to-txt --genotypes --info SOMATIC SVLEN SVTYPE TAF NAF < {input} > {output}"

SnakeMake rust-bio-tools From line 70 of rules/eval.smk

shell:
    "rbt vcf-to-txt {params.gt} {params.tags} --info MATCHING < {input} > {output}"

SnakeMake rust-bio-tools From line 108 of rules/eval.smk

shell:
    "rbt vcf-to-txt {params.gt} {params.tags} < {input} > {output}"

SnakeMake rust-bio-tools From line 122 of rules/eval.smk

shell:
    "rbt vcf-to-txt {params.gt} {params.tags} --info MATCHING < {input} > {output}"

SnakeMake rust-bio-tools From line 136 of rules/eval.smk

script:
    "../scripts/obtain-tp-fp.py"

SnakeMake From line 154 of rules/eval.smk

script:
    "../scripts/plot-precision-recall.py"

SnakeMake From line 211 of rules/eval.smk

script:
    "../scripts/plot-allelefreq-recall.py"

SnakeMake From line 228 of rules/eval.smk

script:
    "../scripts/plot-fdr-control.py"

SnakeMake From line 245 of rules/eval.smk

script:
    "../scripts/plot-allelefreq-estimation.py"

SnakeMake From line 260 of rules/eval.smk

script:
    "../scripts/plot-allelefreq-scatter.py"

SnakeMake From line 275 of rules/eval.smk

script:
    "../scripts/plot-score-dist.py"

SnakeMake From line 289 of rules/eval.smk

script:
    "../scripts/bam-stats.py"

SnakeMake From line 300 of rules/eval.smk

shell:
    "rbt vcf-match {params.match} {params.bcfs[1]} < {params.bcfs[0]} > {output}"

SnakeMake rust-bio-tools From line 315 of rules/eval.smk

script:
    "../scripts/aggregate-concordance.py"

SnakeMake From line 336 of rules/eval.smk

shell:
    "rbt vcf-to-txt {params.gt} {params.tags} --info MATCHING < {input} > {output}"

SnakeMake rust-bio-tools From line 350 of rules/eval.smk

script:
    "../scripts/plot-concordance-upset.R"

SnakeMake From line 364 of rules/eval.smk

script:
    "../scripts/plot-concordance.py"

SnakeMake From line 382 of rules/eval.smk

script:
    "../scripts/plot-freqdist.py"

SnakeMake From line 393 of rules/eval.smk

shell:
    """
    cd resources
    curl -L https://github.com/nygenome/lancet/archive/{params.commit}.tar.gz | tar -xz
    cd lancet-{params.commit}
    sed -i 's/-Wl,-rpath,$(ABS_BAMTOOLS_DIR)\/lib\///' Makefile
    make clean
    prefix=$(dirname $(dirname $(which $CXX)))
    make CXX=$CXX INCLUDES="-I$prefix/include/ -I$prefix/include/bamtools -L$prefix/lib"
    cp lancet ..
    """

SnakeMake BamTools From line 8 of rules/lancet.smk

shell:
   "LD_LIBRARY_PATH=$CONDA_PREFIX/lib "
   "resources/lancet --tumor {input.bams[0]} --normal {input.bams[1]} "
   "--ref {input.ref} --reg {params.region} --num-threads {threads} "
   "{params.extra} > {output} 2> {log}"

SnakeMake From line 47 of rules/lancet.smk

shell:
    "sed -r 's/MS\=[0-9]+[ACGT]+/MS/g' {input.vcf} | bcftools annotate -o {output} -h {input.header} -"

SnakeMake BCFtools From line 62 of rules/lancet.smk

shell:
    "bcftools concat -Ob {input} > {output}"

SnakeMake BCFtools From line 73 of rules/lancet.smk

wrapper:
    "0.19.3/bio/bcftools/view"

SnakeMake From line 87 of rules/lancet.smk

shell:
    "rm -rf {params.dir}; "
    "(configManta.py {params.extra} --tumorBam {input.samples[0]} --normalBam {input.samples[1]} "
    "--referenceFasta {input.ref} --runDir {params.dir}; "
    "{params.dir}/runWorkflow.py -m local -j {threads}) > {log} 2>&1"

SnakeMake From line 20 of rules/manta.smk

wrapper:
    "0.22.0/bio/bcftools/view"

SnakeMake From line 34 of rules/manta.smk

wrapper:
    "0.22.0/bio/bcftools/view"

SnakeMake From line 45 of rules/manta.smk

wrapper:
    "0.22.0/bio/bcftools/view"

SnakeMake From line 59 of rules/manta.smk

wrapper:
    "0.22.0/bio/samtools/sort"

SnakeMake From line 15 of rules/mapping.smk

shell:
    "samtools bam2fq {input} -1 {output.m1} -2 {output.m2} -0 {output.mixed}"

SnakeMake SAMtools From line 28 of rules/mapping.smk

shell:
    "bwa index -p {params.prefix} {input}"

SnakeMake BWA From line 41 of rules/mapping.smk

shell:
    "(resources/bwa mem -t {threads} {params.extra} {params.index} {input.sample} | "
    "samtools view -Sb - > {output}) 2> {log}"

SnakeMake SAMtools BWA From line 72 of rules/mapping.smk

wrapper:
    "0.22.0/bio/samtools/sort"

SnakeMake From line 85 of rules/mapping.smk

wrapper:
    "0.22.0/bio/picard/markduplicates"

SnakeMake From line 99 of rules/mapping.smk

wrapper:
    "0.22.0/bio/samtools/index"

SnakeMake From line 108 of rules/mapping.smk

shell:
    "faidx --transform bed {input} > {output}"

SnakeMake From line 8 of rules/neusomatic.smk

shell:
    """
    (preprocess.py --mode call --reference {input.ref} \
                  --region_bed {input.bed} --tumor_bam {input.bams[0]} \
                  --normal_bam {input.bams[1]} --work {output.workdir} \
                  --min_mapq 10 --num_threads {threads} \
                  --scan_alignments_binary /opt/neusomatic/neusomatic/bin/scan_alignments

    call.py --candidates_tsv {output.workdir}/dataset/*/candidates*.tsv \
            --reference {input.ref} --out {output.workdir} \
            --checkpoint /opt/neusomatic/neusomatic/models/NeuSomatic_v0.1.4_standalone_SEQC-WGS-Spike.pth \
            --num_threads {threads} \
            --batch_size 100

    python `which postprocess.py` --reference {input.ref} --tumor_bam {input.bams[0]} \
                   --pred_vcf {output.workdir}/pred.vcf \
                   --candidates_vcf {output.workdir}/work_tumor/filtered_candidates.vcf \
                   --output_vcf {output.vcf} \
                   --work {output.workdir}) 2> {log}
    """

SnakeMake From line 29 of rules/neusomatic.smk

shell:
    "bcftools annotate -Ob -o {output} -h {input.header} {input.vcf}"

SnakeMake BCFtools From line 59 of rules/neusomatic.smk

wrapper:
    "0.19.3/bio/bcftools/view"

SnakeMake From line 74 of rules/neusomatic.smk

shell:
    "run_novobreak {input.ref} {input.bams[0]} {input.bams[1]} {threads} {output.workdir} > {log} 2>&1"

SnakeMake From line 15 of rules/novobreak.smk

shell:
    "bcftools view -Ob {input}/novoBreak.pass.flt.vcf > {output}"

SnakeMake BCFtools From line 26 of rules/novobreak.smk

shell:
    "bcftools concat -Ob {input}/split/*.sp.vcf > {output}"

SnakeMake BCFtools From line 37 of rules/novobreak.smk

wrapper:
    "0.22.0/bio/samtools/stats"

SnakeMake From line 11 of rules/stats.smk

shell:
    "samtools depth {input} | gzip > {output}"

SnakeMake SAMtools From line 24 of rules/stats.smk

shell:
    "rm -rf {params.dir}; "
    "(configureStrelkaSomaticWorkflow.py {params.extra} --tumorBam {input.samples[0]} --normalBam {input.samples[1]} "
    "--referenceFasta {input.ref} --runDir {params.dir} --indelCandidates {input.manta}; "
    "{params.dir}/runWorkflow.py -m local -j {threads}) > {log} 2>&1"

SnakeMake From line 20 of rules/strelka.smk

wrapper:
    "0.22.0/bio/bcftools/concat"

SnakeMake From line 35 of rules/strelka.smk

wrapper:
    "0.22.0/bio/bcftools/view"

SnakeMake From line 49 of rules/strelka.smk

shell:
    "cairosvg {input} -o {output}"

SnakeMake From line 8 of rules/tools.smk

shell:
    "bcftools view -Ou {input.calls} {params.chrom_prefix} | "
    "varlociraptor call variants {input.ref} "
    "{config[caller][varlociraptor][params]} {params.caller} "
    "tumor-normal {input.bams} "
    "--purity {params.purity} "
    "> {output} 2> {log}"

SnakeMake BCFtools Varlociraptor From line 39 of rules/varlociraptor.smk

wrapper:
    "0.19.1/bio/bcftools/concat"

SnakeMake From line 55 of rules/varlociraptor.smk

shell:
    "varlociraptor filter-calls posterior-odds positive --events SOMATIC_TUMOR < {input} > {output}"

SnakeMake Varlociraptor From line 66 of rules/varlociraptor.smk

shell:
    "varlociraptor filter-calls control-fdr {input} --events SOMATIC_TUMOR --var {wildcards.type} "
    "--minlen {wildcards.minlen} --maxlen {wildcards.maxlen} "
    "--fdr {wildcards.fdr} > {output}"

SnakeMake Varlociraptor From line 77 of rules/varlociraptor.smk

wrapper:
    "0.22.0/bio/bcftools/view"

SnakeMake From line 92 of rules/varlociraptor.smk

from cyvcf2 import VCF, Writer
import numpy as np

def get_sample_name(tissue):
    ds = snakemake.config["runs"][snakemake.wildcards.run]["dataset"]
    return snakemake.config["datasets"][ds][tissue]["name"]


tumor, normal = map(get_sample_name, ["tumor", "normal"])


bcf_in = VCF(snakemake.input.vcf)
bcf_out = Writer(snakemake.output[0], bcf_in)


for rec in bcf_in:
    if rec.FILTER:
        continue
    gt = rec.genotypes
    tumor_gt = gt[0][:2]
    normal_gt = gt[1][:2]
    if (np.any(tumor_gt) and
        not np.any(normal_gt) and
        not np.any(np.isnan(normal_gt))):
        # somatic variant
        bcf_out.write_record(rec)

bcf_out.close()

Python numpy cyvcf2 From line 1 of scripts/adhoc-calling.py

from common import load_variants
import networkx as nx
import pandas as pd
import numpy as np

vartype = snakemake.wildcards.vartype

index_cols = ["CHROM", "POS", "SVLEN"] if vartype == "INS" or vartype == "DEL" else ["CHROM", "POS", "ALT"]

all_variants = [load_variants(f, vartype=vartype) for f in snakemake.input.calls]

G = nx.Graph()
for calls, (i, j) in zip(all_variants, snakemake.params.dataset_combinations):
    calls["component"] = None
    for call in calls.itertuples():
        a = (i, call.Index)
        G.add_node(a)
        if call.MATCHING >= 0:
            b = (j, call.MATCHING)
            G.add_node(b)
            G.add_edge(a, b)

# get a set of calls for each dataset (we don't need all pairwise comparisons for that)
representatives = {snakemake.params.dataset_combinations[i][0]: calls for i, calls in enumerate(all_variants)}

if snakemake.wildcards.mode != "varlociraptor":
    varlociraptor_variants = [load_variants(f, vartype=vartype) for f in snakemake.input.varlociraptor_calls]
    for calls in varlociraptor_variants:
        calls.set_index(index_cols, inplace=True)
    varlociraptor_representatives = {snakemake.params.dataset_combinations[i][0]: calls for i, calls in enumerate(varlociraptor_variants)}

# annotate calls with their component, i.e. their equivalence class
for component_id, component in enumerate(nx.connected_components(G)):
    for i, k in component:
        representatives[i].loc[k, "component"] = component_id
for calls in representatives.values():
    calls["component"] = calls["component"].astype(np.float32)
    calls.set_index("component", inplace=True)

# join calls based on their equivalence class
aggregated = None
suffix = "_{}".format
dataset_name = lambda i: snakemake.params.datasets[i]
is_varlociraptor = False
for dataset_id, calls in representatives.items():
    cols = list(index_cols)
    if "CASE_AF" in calls.columns:
        cols.extend(["CASE_AF", "PROB_SOMATIC_TUMOR"])
        is_varlociraptor = True
    calls = calls[cols]
    if snakemake.wildcards.mode != "varlociraptor":
        idx_calls = calls.set_index(cols, drop=False)
        caseaf = idx_calls.join(varlociraptor_representatives[dataset_id][["CASE_AF"]], how="left")["CASE_AF"]
        caseaf = caseaf[~caseaf.index.duplicated()]
        calls = calls[~idx_calls.index.duplicated()]
        calls["CASE_AF"] = caseaf.values

    calls.columns = [c + suffix(dataset_name(dataset_id)) for c in calls.columns]
    if aggregated is None:
        aggregated = calls
    else:
        aggregated = aggregated.join(calls, how="outer", lsuffix="", rsuffix="")

# Forget the component id. Otherwise, we might run into errors with duplicate elements
# in the index below. These can occur if there are multiple ambiguous calls.
aggregated.reset_index(inplace=True, drop=True)

pos_cols = aggregated.columns[aggregated.columns.str.startswith("POS_")]
is_called = (~aggregated[pos_cols].isnull()).astype(int)
is_called.columns = pos_cols.str.replace("POS_", "")
aggregated = aggregated.join(is_called, lsuffix="", rsuffix="")

aggregated.insert(len(aggregated.columns), "concordance_count", is_called.sum(axis=1))

aggregated["max_case_af"] = aggregated[aggregated.columns[aggregated.columns.str.startswith("CASE_AF")]].max(axis=1)
if is_varlociraptor:
    aggregated["max_prob_somatic_tumor"] =  aggregated[aggregated.columns[aggregated.columns.str.startswith("PROB_SOMATIC")]].min(axis=1)

aggregated.to_csv(snakemake.output[0], sep="\t", index=False)

Python Pandas numpy common networkx Varlociraptor From line 1 of scripts/aggregate-concordance.py

from cyvcf2 import VCF, Writer
import numpy as np


def subclone_vaf(gt):
    """Calculate subclone allele frequency"""
    if np.all(gt[:2] == [1, 1]):
        return 1.0
    elif (np.all(gt[:2] == [0, 1]) or np.all(gt[:2] == [1, 0]) or
          np.all(gt[:2] == [-1, 1]) or np.all(gt[:2] == [1, -1])):
        return 0.5
    else:
        return 0.0


# Reader
vcf_in = VCF(snakemake.input[0])

# Setup subclone information
subclones = ["Som{}".format(i) for i in range(1, 5)]
fractions = [1/3, 1/3, 1/4, 1/12]


# Prepare writer
vcf_in.add_info_to_header({"ID": "TAF",
                           "Number": "1",
                           "Description": "True tumor allele frequency",
                           "Type": "Float"})
vcf_in.add_info_to_header({"ID": "NAF",
                           "Number": "1",
                           "Description": "True normal allele frequency",
                           "Type": "Float"})
bcf_out = Writer(snakemake.output[0], vcf_in)

for rec in vcf_in:
    if len(rec.ALT) > 1:
        raise ValueError("multiallelic sites are not supported at the moment")

    try:
        # get VAFs from VCF
        tumor_vaf = rec.INFO["TAF"]
        normal_vaf = rec.INFO["NAF"]
    except KeyError:
        # calculate VAFs
        subclone_idx = [vcf_in.samples.index(s) for s in subclones]
        control_idx = vcf_in.samples.index("Control")

        tumor_vaf = sum(fraction * subclone_vaf(rec.genotypes[idx])
                        for idx, fraction in zip(subclone_idx, fractions))
        normal_vaf = subclone_vaf(rec.genotypes[control_idx])

        rec.INFO["TAF"] = tumor_vaf
        rec.INFO["NAF"] = normal_vaf

    # only keep somatic variants
    if normal_vaf == 0.0 and tumor_vaf > 0.0:
        bcf_out.write_record(rec)

bcf_out.close()

Python numpy cyvcf2 From line 1 of scripts/annotate-truth.py

import sys
import numpy as np
import pandas as pd
import pysam
import matplotlib
matplotlib.use("agg")
import matplotlib.pyplot as plt
import seaborn as sns
from functools import partial

tumor = pysam.AlignmentFile(snakemake.input[0], "rb")
normal = pysam.AlignmentFile(snakemake.input[1], "rb")

softclips = []

for i, rec in enumerate(normal):
    if rec.is_supplementary or rec.is_unmapped:
        continue
    is_first_read = rec.pos < rec.mpos
    get_clip = lambda c: c[1] if c[0] == 4 else None
    clip_left = get_clip(rec.cigartuples[0])
    if clip_left is not None:
        softclips.append([clip_left, True, is_first_read])
    clip_right = get_clip(rec.cigartuples[-1])
    if clip_right is not None:
        softclips.append([clip_right, False, is_first_read])
    if i == 10000000:
        break

softclips = pd.DataFrame(softclips, columns=["len", "left", "first_in_pair"])

def plot(*args, **kwargs):
    softclips = args[0]
    plt.hist(softclips, normed=True)
    q95 = np.percentile(softclips, 99)
    plt.plot([q95, q95], [0, 1.0], "--k")
    m = max(softclips)
    plt.plot([m, m], [0, 1.0], ":k")
    plt.text(m, 1, "max={}".format(m), horizontalalignment="right", verticalalignment="top")


g = sns.FacetGrid(softclips, col="left", row="first_in_pair")
g = g.map(plot, "len")

plt.savefig(snakemake.output[0])

Python Pandas numpy matplotlib seaborn pysam From line 3 of scripts/bam-stats.py

import pandas as pd
import numpy as np
from common import load_variants


minlen = int(snakemake.wildcards.minlen)
maxlen = int(snakemake.wildcards.maxlen)
vartype = snakemake.wildcards.vartype

if snakemake.wildcards.mode == "varlociraptor":
    score = snakemake.config["caller"]["varlociraptor"]["score"]
    # calls are already filtered by FDR control step
    minlen = None
    maxlen = None
elif snakemake.wildcards.mode == "default":
    score = snakemake.config["caller"][snakemake.wildcards.caller]["score"]
else:
    score = None

calls = load_variants(snakemake.input.calls, vartype=vartype, minlen=minlen, maxlen=maxlen)

calls["is_tp"] = calls["MATCHING"] >= 0

calls["score"] = calls[score] if score else np.nan

calls.to_csv(snakemake.output[0], sep="\t")

Python Pandas numpy common Varlociraptor From line 1 of scripts/obtain-tp-fp.py

from itertools import product
import math
import matplotlib
matplotlib.use("agg")
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
import common
import numpy as np


MIN_CALLS = 10

vartype = snakemake.wildcards.vartype
colors = common.get_colors(snakemake.config)

truth = common.load_variants(snakemake.input.truth, vartype=vartype)

def props(callers):
    return product(callers, snakemake.params.len_ranges)

def plot_len_range(minlen, maxlen):
    def plot(calls, colors):
        calls = calls[calls.is_tp]
        true_af = truth.loc[calls.MATCHING].reset_index().TAF
        calls = calls.reset_index()
        calls["error"] = calls.CASE_AF - true_af

        if calls.empty:
            return

        calls["true_af"] = true_af
        true_af = pd.Series(calls["true_af"].unique()).sort_values()
        # standard deviation when sampling in binomial process from allele freq
        # this is the expected sampling error within the correctly mapped fragments
        # sd = true_af.apply(lambda af: 1 / 40 * math.sqrt(40 * af * (1 - af)))
        # x = np.arange(len(true_af))
        # offsets = [-0.5, 0.5]
        # y_upper = np.array([v for v in sd for o in offsets])
        #  y_lower = np.maximum(-y_upper, [-f for f in true_af for o in offsets])
        # plt.fill_between([v + o for v in x for o in offsets], y_lower, y_upper, color="#EEEEEE", zorder=-5)

        calls["true_af"] = calls["true_af"].apply("{:.3f}".format)

        size = 1 if maxlen == 30 else 2
        sns.stripplot("true_af", "error", hue="caller", data=calls, palette=colors, dodge=True, jitter=True, alpha=0.5, size=size, rasterized=True)
        sns.boxplot("true_af", "error", hue="caller", data=calls, color="white", fliersize=0, linewidth=1)

        handles, labels = plt.gca().get_legend_handles_labels()
        n = len(calls.caller.unique())

        plt.ylim((-1,1))
        plt.grid(axis="y", linestyle=":", color="grey")
        sns.despine()
        plt.xticks(rotation="vertical")
        ax = plt.gca()
        ax.legend().remove()

        return ax, handles[n:]

    all_calls, all_colors = load_calls(minlen, maxlen)
    return plot(all_calls, all_colors)

def load_calls(minlen, maxlen):
    all_calls = []
    all_colors = []
    for calls, (caller, len_range) in zip(snakemake.input.varlociraptor_calls, props(snakemake.params.varlociraptor_callers)):
        if len_range[0] != minlen and len_range[1] != maxlen:
            continue
        label = "varlociraptor+{}".format(caller)
        calls = pd.read_table(calls)
        calls["caller"] = label
        if not calls.empty:
            all_calls.append(calls)
            all_colors.append(colors[caller])

    all_calls = pd.concat(all_calls)
    return all_calls, all_colors

common.plot_ranges(
    snakemake.params.len_ranges,
    plot_len_range,
    xlabel="true allele frequency",
    ylabel="predicted - truth")

plt.savefig(snakemake.output[0], bbox_inches="tight")

Python Pandas numpy matplotlib seaborn common From line 1 of scripts/plot-allelefreq-estimation.py

from itertools import product
import matplotlib
matplotlib.use("agg")
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
import common
import numpy as np
import math
from matplotlib.lines import Line2D

MIN_CALLS = 10

vartype = snakemake.wildcards.vartype
colors = common.get_colors(snakemake.config)


def props(callers):
    return product(callers, snakemake.params.len_ranges)


def plot_len_range(minlen, maxlen):

    truth = common.load_variants(
        snakemake.input.truth, minlen, maxlen, vartype=vartype)

    afs = pd.Series(truth.TAF.unique()).sort_values()

    def plot(calls,
             label,
             color,
             varlociraptor=True,
             style="-.",
             markersize=4):
        calls = pd.read_table(calls, index_col=0)
        if len(calls) < 10:
            return
        if varlociraptor:
            phred = lambda p: -10 * math.log10(p)
            def calc_recall(p):
                c = calls[calls.score <= phred(p)]
                return [common.recall(c, truth[truth.TAF >= af]) for af in afs]

            return plt.fill_between(
                afs,
                calc_recall(0.98 if maxlen > 30 else 0.99),
                calc_recall(0.9),
                color=color,
                label=label,
                alpha=0.6)
        else:
            recall = [common.recall(calls, truth[truth.TAF >= af]) for af in afs]
            # plot a white background first to increase visibility
            plt.plot(afs, recall, "-", color="white", alpha=0.8)
            return plt.plot(
                afs,
                recall,
                style,
                color=color,
                label=label)[0]

    handles = []
    def register_handle(handle):
        if handle is not None:
            handles.append(handle)
    for calls, (caller,
                len_range) in zip(snakemake.input.varlociraptor_calls,
                                  props(snakemake.params.varlociraptor_callers)):
        if len_range[0] != minlen and len_range[1] != maxlen:
            continue
        label = "varlociraptor+{}".format(caller)
        handle = plot(calls, label, colors[caller], varlociraptor=True)
        register_handle(handle)
        #handles.append(Line2D([0], [0], color=colors[caller], label=label))

    for calls, (caller, len_range) in zip(snakemake.input.adhoc_calls,
                             props(snakemake.params.adhoc_callers)):
        if len_range[0] != minlen and len_range[1] != maxlen:
            continue
        color = colors[caller]
        handle = plot(calls, caller, color, style=":", varlociraptor=False)
        register_handle(handle)
        #handles.append(Line2D([0], [0], linestyle=":", color=color, label=caller))

    sns.despine()
    ax = plt.gca()
    return ax, handles


common.plot_ranges(
    snakemake.params.len_ranges,
    plot_len_range,
    xlabel="allele frequency",
    ylabel="recall")

plt.savefig(snakemake.output[0], bbox_inches="tight")

Python Pandas numpy matplotlib seaborn common From line 1 of scripts/plot-allelefreq-recall.py

import math
import matplotlib
matplotlib.use("agg")
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
import common
import numpy as np

MIN_COUNT = 20
MAX_DEPTH = 60

vartype = snakemake.wildcards.vartype
colors = common.get_colors(snakemake.config)

truth = common.load_variants(snakemake.input.truth, vartype=vartype)

all_calls = []
for caller, calls in zip(snakemake.params.callers, snakemake.input.calls):
    calls = pd.read_table(calls)
    calls.loc[:, "caller"] = caller
    all_calls.append(calls)
all_calls = pd.concat(all_calls)

def plot(af, _):
    constrain_lower = lambda error: np.maximum(error, -af)
    constrain_upper = lambda error: np.minimum(error, 1.0 - af)

    dp = all_calls["TUMOR_DP"]
    calls = all_calls[all_calls.is_tp]
    true_af = truth.loc[calls.MATCHING].reset_index().TAF
    calls = calls.reset_index()
    calls["true_af"] = true_af
    calls = calls[calls["true_af"] == af]
    calls["error"] = calls.CASE_AF - true_af

    sns.kdeplot(calls["TUMOR_DP"], calls["error"], cmap="Blues", n_levels=50, shade=True, alpha=0.7, shade_lowest=False) #alpha=0.5, clip=((0.0, 1.0), (0.0, af)))
    plt.plot(calls["TUMOR_DP"], calls["error"], ",", color="k", lw=0, alpha=1.0, rasterized=True)
    by_depth = calls.groupby("TUMOR_DP")["error"].describe().reset_index()
    by_depth["-std"] = constrain_lower(-by_depth["std"])
    by_depth["std"] = constrain_upper(by_depth["std"])
    by_depth = by_depth[by_depth["count"] >= MIN_COUNT]
    plt.plot(by_depth.TUMOR_DP, by_depth["std"], "--", color="k")
    plt.plot(by_depth.TUMOR_DP, by_depth["-std"], "--", color="k")
    plt.plot(by_depth.TUMOR_DP, by_depth["mean"], "-", color="k")

    depths = np.arange(0, MAX_DEPTH)
    # standard deviation when sampling in binomial process from allele freq
    # this is the expected sampling error within the correctly mapped fragments
    sd = np.array([1.0 / depth * math.sqrt(depth * af * (1.0 - af)) for depth in depths])
    plt.fill_between(depths, constrain_lower(-sd), constrain_upper(sd), color="grey", alpha=0.5)

    sns.despine()
    plt.xticks(rotation="vertical")
    ax = plt.gca()
    ax.legend().remove()
    handles, labels = ax.get_legend_handles_labels()
    plt.ylim((-1.0, 1.0))
    plt.xlim((0, MAX_DEPTH))

    return ax, []

afs = [(af, af) for af in truth.TAF.sort_values().unique()]

common.plot_ranges(
    afs,
    plot,
    "depth",
    "predicted - truth")

plt.savefig(snakemake.output[0], bbox_inches="tight")

Python Pandas numpy matplotlib seaborn common From line 1 of scripts/plot-allelefreq-scatter.py

from itertools import product
import matplotlib
matplotlib.use("agg")
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
import common
import numpy as np
import math
from matplotlib.lines import Line2D
from matplotlib.colors import to_rgba


class NotEnoughObservationsException(Exception):
    pass


MIN_CALLS = 20
MAX_LEN = 1000

vartype = snakemake.wildcards.vartype
colors = common.get_colors(snakemake.config)



varlociraptor_calls_low = [pd.read_table(f) for f in snakemake.input.varlociraptor_calls_low]
varlociraptor_calls_high = [pd.read_table(f) for f in snakemake.input.varlociraptor_calls_high]
adhoc_calls = [pd.read_table(f) for f in snakemake.input.adhoc_calls]


def expected_count(af, effective_mutation_rate):
    """Calculate the expected number of somatic variants
       greater than a given allele frequency given an effective mutation
       rate, according to the model of Williams et al. Nature 
       Genetics 2016"""
    return effective_mutation_rate * (1.0 / af - 1.0)


def expected_counts(afs, effective_mutation_rate):
    return [expected_count(af, effective_mutation_rate) for af in afs]


def calc_concordance(calls):
    n = len(calls)
    return (calls["concordance_count"] > 1).sum() / n


def plot_len_range(minlen, maxlen, yfunc=None, yscale=None, upper_bound=None):
    handles_varlociraptor = []
    handles_adhoc = []
    for i, caller in enumerate(snakemake.params.callers):
        def plot_calls(calls, label, color, style, calls_lower=None):
            def get_xy(calls, caseafs=None):
                svlen = calls.loc[:, calls.columns.str.startswith("SVLEN")].abs()
                # at least one of the calls has a valid svlen
                valid = ((svlen >= minlen) & (svlen <= maxlen)).sum(axis=1) >= 1
                calls = calls[valid]
                if caseafs is None:
                    caseafs = calls["max_case_af"].dropna().unique()
                y = []
                _caseafs = []
                for caseaf in sorted(caseafs):
                    _calls = calls[calls["max_case_af"] >= caseaf]
                    if upper_bound is not None:
                        _calls = _calls[_calls["max_case_af"] <= caseaf + upper_bound]
                    if len(_calls) < MIN_CALLS:
                        continue
                    _caseafs.append(caseaf)
                    y.append(yfunc(_calls))
                return _caseafs, y

            x, y = get_xy(calls)
            if not x:
                raise NotEnoughObservationsException()
            if calls_lower is not None:
                _, y2 = get_xy(calls_lower, caseafs=x)
                return plt.fill_between(x, y, y2, label=label, edgecolor=color, facecolor=to_rgba(color, alpha=0.2))
            else:
                if style != "-":
                    plt.plot(x, y, "-", color="white", alpha=0.8)
                return plt.plot(x, y, style, label=label, color=color)[0]

        color = colors[snakemake.params.callers[i]]
        try:
            handles_varlociraptor.append(
                plot_calls(
                    varlociraptor_calls_high[i], 
                    "varlociraptor+{}".format(caller), 
                    color=color, style="-", 
                    calls_lower=varlociraptor_calls_low[i]))
        except NotEnoughObservationsException:
            # skip plot
            pass
        try:
            handles_adhoc.append(plot_calls(adhoc_calls[i], caller, color=color, style=":"))
        except NotEnoughObservationsException:
            # skip plot
            pass

    handles = handles_varlociraptor + handles_adhoc
    sns.despine()
    ax = plt.gca()
    if yscale is not None:
        ax.set_yscale(yscale)
    return ax, handles

plt.figure(figsize=(10, 4))
plt.subplot(121)
plot_len_range(1, MAX_LEN, yfunc=calc_concordance)
plt.xlabel("$\geq$ tumor allele frequency")
plt.ylabel("concordance")

plt.subplot(122)
for effective_mutation_rate in 10 ** np.linspace(1, 5, 7):
    afs = np.linspace(0.0, 1.0, 100, endpoint=False)
    plt.semilogy(afs, expected_counts(afs, effective_mutation_rate), "-", color="grey", alpha=0.4)

ax, handles = plot_len_range(1, MAX_LEN, yfunc=lambda calls: len(calls), yscale="log")

plt.xlabel("$\geq$ tumor allele frequency")
plt.ylabel("# of calls")

ax.legend(handles=handles, loc="upper left", bbox_to_anchor=(1.0, 1.0))

plt.tight_layout()

plt.savefig(snakemake.output[0], bbox_inches="tight")

Python Pandas numpy matplotlib seaborn common From line 1 of scripts/plot-concordance.py

library(UpSetR)
library(plyr)
library(ggplot2)
library(grid)
library(rlist)

calls <- read.table(snakemake@input[[1]], row.names = NULL, sep = "\t", header = TRUE, check.names = FALSE)
#print(dim(calls))
calls <- calls[calls$max_case_af >= 0.1, ]
#calls <- calls[calls$max_prob_somatic_tumor < 0.07]_
#calls$max_case_af_log10 <- log10(calls$max_case_af)

queries <- list()
for(ds in snakemake@params[["datasets"]]) {
    queries <- list.append(queries, list(query=intersects, params=list(ds)))
}


svg(file = snakemake@output[[1]])
if(!empty(calls)) {
    upset(calls, 
          order.by = "freq", 
          sets = snakemake@params[["datasets"]],
          queries = queries
    )
}
dev.off()

R ggplot2 plyr UpSetR rlist From line 1 of scripts/plot-concordance-upset.R

from itertools import product
import matplotlib
matplotlib.use("agg")
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
import common
import numpy as np


MIN_CALLS = 100

colors = common.get_colors(snakemake.config)

props = product(snakemake.params.callers,
                snakemake.params.len_ranges, snakemake.params.fdrs)

calls = []

for _calls, (caller, len_range, fdr) in zip(snakemake.input.varlociraptor_calls, props):
    calls.append({"caller": caller, "len_range": len_range, "fdr": float(fdr), "calls": _calls})

calls = pd.DataFrame(calls)
calls = calls.set_index("caller", drop=False)


def plot_len_range(minlen, maxlen):

    def plot(caller):
        color = colors[caller]
        label = "varlociraptor+{}".format(caller)
        fdrs = []
        alphas = []
        calls_ = calls.loc[caller]
        calls_ = calls_[calls_["len_range"].map(lambda r: r == [minlen, maxlen])]
        calls_ = calls_.sort_values("fdr")
        for e in calls_.itertuples():
            c = pd.read_table(e.calls)
            n = c.shape[0]
            if n < MIN_CALLS:
                continue
            true_fdr = 1.0 - common.precision(c)
            if fdrs and fdrs[-1] == true_fdr:
                continue
            fdrs.append(true_fdr)
            alphas.append(e.fdr)
        plt.plot(alphas, fdrs, ".-", color=color, label=label)


    for caller in calls.index.unique():
        plot(caller)

    plt.plot([0, 1], [0, 1], ":", color="grey")

    sns.despine()
    ax = plt.gca()
    handles, _ = ax.get_legend_handles_labels()
    return ax, handles

common.plot_ranges(
    snakemake.params.len_ranges,
    plot_len_range,
    xlabel="FDR threshold",
    ylabel="true FDR")

plt.savefig(snakemake.output[0], bbox_inches="tight")

Python Pandas numpy matplotlib seaborn common From line 1 of scripts/plot-fdr-control.py

from pysam import VariantFile
import seaborn as sns
import matplotlib.pyplot as plt
from itertools import islice
import pandas as pd


sns.set_style("ticks")


calls = VariantFile(snakemake.input[0])


def freq(sample):
    ref, alt = sample.get("AD")
    if alt == 0:
        return 0
    return alt / (ref + alt)


normal_freqs = []
tumor_freqs = []
vartypes = []

print("collecting calls")
for record in calls:
    if not record.info.get("SHARED"):
        continue
    normal_freqs.append(freq(record.samples.get("normal")))
    tumor_freqs.append(freq(record.samples.get("tumor")))
    vartypes.append(record.info.get("TYPE"))

d = pd.DataFrame({"normal": normal_freqs, "tumor": tumor_freqs, "type": vartypes})

print("subsampling")
# sample d to not get overwhelmed
d = d.sample(10000, random_state=245746)

print("plotting and density estimation")
g = sns.FacetGrid(col="type", data=d)
g.map(sns.kdeplot, "normal", "tumor", shade=True, clip=(0, 1), shade_lowest=False)
g.map(sns.scatterplot, "normal", "tumor", size=1, alpha=0.5, marker=".")

plt.savefig(snakemake.output[0], bbox_inches="tight")

Python Pandas matplotlib seaborn pysam From line 1 of scripts/plot-freqdist.py

from itertools import product
from functools import partial
import matplotlib
matplotlib.use("agg")
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
import common
import numpy as np
import math
from matplotlib.lines import Line2D

vartype = snakemake.wildcards.vartype
colors = common.get_colors(snakemake.config)


def props(callers):
    return product(callers, snakemake.params.len_ranges)


def plot_len_range(minlen, maxlen, min_precision=0.0):

    truth = common.load_variants(
        snakemake.input.truth, minlen, maxlen, vartype=vartype)

    def plot(calls,
             label,
             color,
             line=True,
             style="-",
             invert=False,
             markersize=4,
             endmarker=False):
        calls = pd.read_table(calls, index_col=0)
        if len(calls) < 10:
            return
        if line:
            thresholds = calls.score.quantile(np.linspace(0.0, 1.0, 50))
            precision = []
            recall = []
            for t in thresholds:
                if invert:
                    c = calls[calls.score >= t]
                else:
                    c = calls[calls.score <= t]
                p = common.precision(c)
                r = common.recall(c, truth)
                print(label, t, c.shape[0], p, r)
                if len(c) < 10:
                    print("skipping threshold: too few calls", c)
                    continue
                precision.append(p)
                recall.append(r)
            if len(precision) <= 2:
                print("skipping curve because we have too few values")
                return
        else:
            precision = [common.precision(calls)]
            recall = [common.recall(calls, truth)]
            style = "."
            print(label, calls.shape[0], precision, recall)

        plt.plot(
            recall,
            precision,
            style,
            color=color,
            label=label,
            markersize=markersize
        )
        if endmarker:
            plt.plot(recall[-1], precision[-1], "s", color=color, markersize=markersize)

    handles = []
    for calls, (caller,
                len_range) in zip(snakemake.input.varlociraptor_calls,
                                  props(snakemake.params.varlociraptor_callers)):
        if len_range[0] != minlen and len_range[1] != maxlen:
            continue
        label = "varlociraptor+{}".format(caller)
        plot(calls, label, colors[caller], endmarker=True)
        handles.append(Line2D([0], [0], color=colors[caller], label=label))

    for calls, (caller,
                len_range) in zip(snakemake.input.default_calls,
                                  props(snakemake.params.default_callers)):
        if len_range[0] != minlen and len_range[1] != maxlen:
            continue
        color = colors[caller]
        plot(
            calls,
            caller,
            color,
            style=":",
            invert=snakemake.config["caller"][caller].get("invert", False))
        if caller in snakemake.params.adhoc_callers:
            handles.append(Line2D([0], [0], markersize=10, markerfacecolor=color, markeredgecolor=color, color=color, label=caller, marker=".", linestyle=":"))
        else:
            handles.append(Line2D([0], [0], color=color, label=caller, linestyle=":"))

    for calls, (caller, len_range) in zip(snakemake.input.adhoc_calls,
                             props(snakemake.params.adhoc_callers)):
        if len_range[0] != minlen and len_range[1] != maxlen:
            continue
        color = colors[caller]
        plot(calls, caller, color, markersize=10, line=False)
        if caller not in snakemake.params.default_callers:
            handles.append(Line2D([0], [0], markersize=10, markerfacecolor=color, markeredgecolor=color, label=caller, marker=".", lw=0))

    sns.despine()
    ax = plt.gca()
    plt.ylim((min_precision, 1.01 if min_precision == 0.0 else 1.001))
    return ax, handles


plot = plot_len_range
fig_height = None
legend_outside = snakemake.params.legend_outside
if snakemake.wildcards.zoom == "zoom":
    plot = partial(plot_len_range, min_precision=0.99 if vartype == "INS" else 0.95)
    fig_height = 3
    legend_outside = True


common.plot_ranges(
    snakemake.params.len_ranges,
    plot,
    xlabel="recall",
    ylabel="precision",
    fig_height=fig_height,
    legend_outside=legend_outside,
)

plt.savefig(snakemake.output[0], bbox_inches="tight")

Python Pandas numpy matplotlib seaborn common From line 1 of scripts/plot-precision-recall.py

from itertools import product
import matplotlib
matplotlib.use("agg")
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
import common
import numpy as np
import math

vartype = snakemake.wildcards.vartype
colors = common.get_colors(snakemake.config)

def props(callers):
    return product(callers, snakemake.params.len_ranges)

phred_to_log_factor = -0.23025850929940456
log_to_phred_factor = -4.3429448190325175

def plot_len_range(minlen, maxlen):
    for calls, (caller, len_range) in zip(snakemake.input.varlociraptor_calls, props(snakemake.params.varlociraptor_callers)):
        if len_range[0] != minlen and len_range[1] != maxlen:
            continue
        label = "varlociraptor+{}".format(caller)
        calls = pd.read_table(calls)
        calls["caller"] = label
        if not calls.empty:
            color = colors[caller]
            sns.kdeplot(calls[calls.is_tp].PROB_SOMATIC_TUMOR.map(np.log), color=color, label=label)
            sns.kdeplot(calls[~calls.is_tp].PROB_SOMATIC_TUMOR.map(np.log), color=color, linestyle=":", label="")

    ax = plt.gca()
    fmt_ticks = lambda ticks: ["{:.1g}".format(np.exp(t)) for t in ticks]
    ax.set_xticklabels(fmt_ticks(plt.xticks()[0]))
    ax.legend().remove()
    handles, _ = ax.get_legend_handles_labels()
    sns.despine()

    return ax, handles

common.plot_ranges(
    snakemake.params.len_ranges,
    plot_len_range,
    xlabel=r"$-10 \log_{10}$ Pr(somatic)",
    ylabel="density")

plt.savefig(snakemake.output[0], bbox_inches="tight")

Python Pandas numpy matplotlib seaborn common From line 1 of scripts/plot-score-dist.py

shell:
    "bcftools index {input}"

SnakeMake BCFtools From line 114 of master/Snakefile

run:
    calls = pd.read_table(input[0], header=[0, 1])
    calls = calls["VARIANT"]
    calls = calls[(calls.MATCHING >= 0 if params.type else calls.MATCHING < 0)]
    calls.sort_values("PROB_SOMATIC_TUMOR", ascending=not params.type, inplace=True)
    calls[["CHROM", "POS", "PROB_SOMATIC_TUMOR", "MATCHING"]].to_csv(output[0], sep="\t")

SnakeMake From line 125 of master/Snakefile

shell:
    "bcftools view {input.bcf} {wildcards.varpos} > {output.vcf}; "
    "samtools view -b {input.bams[0]} {params.region} > {output.tumor}; "
    "samtools view -b {input.bams[1]} {params.region} > {output.normal}; "
    "samtools index {output.tumor}; samtools index {output.normal}"