Snakemake-based workflow for detecting structural variants in genomic data

public 1yr ago Version: v1.2.2 0 bookmarks

View Workflow

Structural variants (SVs) are an important class of genetic variation implicated in a wide array of genetic diseases. sv-callers is a Snakemake -based workflow that combines several state-of-the-art tools for detecting SVs in whole genome sequencing (WGS) data. The workflow is easy to use and deploy on any Linux-based machine. In particular, the workflow supports automated software deployment, easy configuration and addition of new analysis tools as well as enables to scale from a single computer to different HPC clusters with minimal effort.

Dependencies

Python
Conda - package/environment management system
Snakemake - workflow management system
Xenon CLI - command-line interface to compute and storage resources
jq - command-line JSON processor (optional)
YAtiML - library for YAML type inference and schema validation

The workflow includes the following bioinformatics tools:

SV callers
- Manta
- DELLY
- LUMPY
- GRIDSS
Post-processing

The software dependencies can be found in the conda environment files: [1] , [2] , [3] .

1. Clone this repo.

git clone https://github.com/GooglingTheCancerGenome/sv-callers.git
cd sv-callers

2. Install dependencies.

# download Miniconda3 installer
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
# install Conda (respond by 'yes')
bash miniconda.sh
# update Conda
conda update -y conda
# install Mamba
conda install -n base -c conda-forge -y mamba
# create a new environment with dependencies & activate it
mamba env create -n wf -f environment.yaml
conda activate wf

3. Configure the workflow.

config files :
- analysis.yaml - analysis-specific settings (e.g., workflow mode, I/O files, SV callers, post-processing or resources used etc.)
- samples.csv - list of (paired) samples
input files :
- example data in workflow/data directory
- reference genome in .fasta (incl. index files)
- excluded regions in .bed (optional)
- WGS samples in .bam (incl. index files)
output files :
- (filtered) SVs per caller and merged calls in .vcf (incl. index files)

4. Execute the workflow.

cd workflow

Locally

# 'dry' run only checks I/O files
snakemake -np
# 'vanilla' run if echo_run set to 1 (default) in analysis.yaml,
# it merely mimics the execution of SV callers by writing (dummy) VCF files;
# SV calling if echo_run set to 0
snakemake --use-conda --jobs

Submit jobs to Slurm or GridEngine cluster

SCH=slurm # or gridengine
snakemake --use-conda --latency-wait 30 --jobs \
--cluster "xenon scheduler $SCH --location local:// submit --name smk.{rule} --inherit-env --cores-per-task {threads} --max-run-time 1 --max-memory {resources.mem_mb} --working-directory . --stderr stderr-%j.log --stdout stdout-%j.log" &>smk.log&

Note: One sample or a tumor/normal pair generates in total 18 SV calling and post-processing jobs. See the workflow instance of single-sample (germline) or paired-sample (somatic) analysis.

To perform SV calling:

edit (default) parameters in analysis.yaml
- set echo_run to 0
- choose between two workflow mode s: single- ( s ) or paired-sample ( p - default)
- select one or more callers using enable_callers (default all)
use xenon CLI to set:
- --max-run-time of workflow jobs (in minutes)
- --temp-space (optional, in MB)
adjust compute requirements per SV caller according to the system used:
- the number of threads ,
- the amount of memory (in MB),
- the amount of temporary disk space or tmpspace (path in TMPDIR env variable) can be used for intermediate files by LUMPY and GRIDSS only.

Query job accounting information

SCH=slurm # or gridengine
xenon --json scheduler $SCH --location local:// list --identifier [jobID] | jq ...

Code Snippets

shell:
    """
    set -xe

    OUTDIR="$(dirname "{output}")"
    PREFIX="$(basename "{output}" .bcf)"
    OUTFILE="${{OUTDIR}}/${{PREFIX}}.unfiltered.bcf"
    TSV="${{OUTDIR}}/sample_pairs.tsv"

    # fetch sample ID from a BAM file
    function get_samp_id() {{
        echo "$(samtools view -H ${{1}} | \
               perl -lne 'print ${{1}} if /\sSM:(\S+)/' | \
               head -n 1)"
    }}

    # run dummy or real job
    if [ "{config.echo_run}" -eq "1" ]; then
        echo "{input}" > "{output}"
    else
        # use OpenMP for threaded jobs
        export OMP_NUM_THREADS={threads}
        #export OMP_PROC_BIND=true
        #export OMP_PLACES=threads

        # SV calling
        delly call \
            -t "{wildcards.sv_type}" \
            -g "{input.fasta}" \
            -o "${{OUTFILE}}" \
            -q 1 `# min.paired-end mapping quality` \
            -s 9 `# insert size cutoff, DELs only` \
            {params.excl_opt} \
            "{input.tumor_bam}" "{input.normal_bam}"
        # somatic + SV quality filtering
        #   create sample list
        TID=$(get_samp_id "{input.tumor_bam}")
        CID=$(get_samp_id "{input.normal_bam}")
        printf "${{TID}}\ttumor\n${{CID}}\tcontrol\n" > ${{TSV}}
        delly filter \
            -f somatic \
            -t "{wildcards.sv_type}" \
            -p \
            -s "${{TSV}}" \
            -o "{output}" \
            "${{OUTFILE}}"
    fi
    """

SnakeMake delly From line 24 of rules/delly.smk

shell:
    """
    set -xe

    OUTDIR="$(dirname "{output}")"
    PREFIX="$(basename "{output}" .bcf)"
    OUTFILE="${{OUTDIR}}/${{PREFIX}}.unfiltered.bcf"

    # run dummy or real job
    if [ "{config.echo_run}" -eq "1" ]; then
        echo "{input}" > "{output}"
    else
        # use OpenMP for threaded jobs
        export OMP_NUM_THREADS={threads}

        # SV calling
        delly call \
            -t "{wildcards.sv_type}" \
            -g "{input.fasta}" \
            -o "${{OUTFILE}}" \
            -q 1 `# min.paired-end mapping quality` \
            -s 9 `# insert size cutoff, DELs only` \
            {params.excl_opt} \
            "{input.bam}"
        # SV quality filtering
        bcftools filter \
            -O b `# compressed BCF format` \
            -o "{output}" \
            -i "FILTER == 'PASS'" \
            "${{OUTFILE}}"
        # index BCF file
        bcftools index "{output}"
    fi
    """

SnakeMake BCFtools delly From line 95 of rules/delly.smk

shell:
    """
     set -x

     # run dummy or real job
     if [ "{config.echo_run}" -eq "1" ]; then
         cat {input} > "{output}"
     else
         # concatenate rather than merge BCF files
         bcftools concat \
            -a `# allow overlaps` \
            -O v `# uncompressed VCF format` \
            -o "{output}" \
            {input}
    fi
    """

SnakeMake BCFtools From line 172 of rules/delly.smk

shell:
    """
    set -xe

    # if 'tmpspace' set to >0MB use TMPDIR otherwise use OUTDIR
    OUTDIR="$(dirname "{output}")"
    PREFIX="$(basename "{output}" .vcf)"
    OUTFILE="${{OUTDIR}}/${{PREFIX}}.unfiltered.vcf"
    TMP=$([ "{resources.tmp_mb}" -eq "0" ] &&
        echo "${{OUTDIR}}" || echo "${{TMPDIR}}")

    # set JVM max. heap size dynamically (in GB)
    # N.B. don't allocate >31G due to Compressed Oops and JDK-8029679
    MAX_HEAP=$(LC_ALL=C printf "%.f" $(bc <<< "scale=2; \
        {resources.mem_mb} / 1024 * .8")) # max. 80% of requested mem
    MAX_HEAP=$([ "${{MAX_HEAP}}" -gt "31" ] && echo "31g" ||
        echo "${{MAX_HEAP}}g")
    export _JAVA_OPTIONS="-Djava.io.tmpdir=${{TMP}} -Xmx${{MAX_HEAP}}"

    # run dummy or real job
    if [ "{config.echo_run}" -eq "1" ]; then
        echo "{input}" "${{TMP}}" > "{output}"
    else
        # clean-up outdir prior to SV calling
        rm -fr ${{OUTDIR}}/*gridss*
        gridss gridss.CallVariants \
            WORKER_THREADS={threads} \
            REFERENCE_SEQUENCE="{input.fasta}" \
            {params.excl_opt} \
            INPUT="{input.normal_bam}" \
            INPUT="{input.tumor_bam}" \
            OUTPUT="${{OUTFILE}}" \
            ASSEMBLY="${{OUTDIR}}/gridss_assembly.bam" \
            WORKING_DIR="${{TMP}}" \
            TMP_DIR="${{TMP}}/gridss.${{RANDOM}}"
        # somatic + SV quality filtering
        #   'normal' sample assumes index 0
        bcftools filter \
            -O v `# uncompressed VCF format` \
            -o "{output}" \
            -i "FORMAT/QUAL[0] == 0 && FILTER == '.'" \
            "${{OUTFILE}}"
    fi
    """

SnakeMake BCFtools GRIDSS From line 24 of rules/gridss.smk

shell:
    """
    set -xe

    # if 'tmpspace' set to >0MB use TMPDIR otherwise use OUTDIR
    OUTDIR="$(dirname "{output}")"
    PREFIX="$(basename "{output}" .vcf)"
    OUTFILE="${{OUTDIR}}/${{PREFIX}}.unfiltered.vcf"
    TMP=$([ "{resources.tmp_mb}" -eq "0" ] &&
        echo "${{OUTDIR}}" || echo "${{TMPDIR}}")

    # set JVM max. heap size dynamically (in GB)
    # N.B. don't allocate >31G due to Compressed Oops and JDK-8029679
    MAX_HEAP=$(LC_ALL=C printf "%.f" $(bc <<< "scale=2; \
        {resources.mem_mb} / 1024 * .8")) # max. 80% of requested mem
    MAX_HEAP=$([ "${{MAX_HEAP}}" -gt "31" ] && echo "31g" ||
        echo "${{MAX_HEAP}}g")
    export _JAVA_OPTIONS="-Djava.io.tmpdir=${{TMP}} -Xmx${{MAX_HEAP}}"

    # run dummy or real job
    if [ "{config.echo_run}" -eq "1" ]; then
        echo "{input}" "${{TMP}}" > "{output}"
    else
        # clean-up outdir prior to SV calling
        rm -fr ${{OUTDIR}}/*gridss* &&
        gridss gridss.CallVariants \
            WORKER_THREADS={threads} \
            REFERENCE_SEQUENCE="{input.fasta}" \
            {params.excl_opt} \
            INPUT="{input.bam}" \
            OUTPUT="${{OUTFILE}}" \
            ASSEMBLY="${{OUTDIR}}/gridss_assembly.bam" \
            WORKING_DIR="${{TMP}}" \
            TMP_DIR="${{TMP}}/gridss.${{RANDOM}}"
        # SV quality filtering
        bcftools filter \
            -O v `# uncompressed VCF format` \
            -o "{output}" \
            -i "FILTER == '.'" \
            "${{OUTFILE}}"
    fi
    """

SnakeMake BCFtools GRIDSS From line 91 of rules/gridss.smk

shell:
    """
    set -xe

    # if 'tmpspace' set to >0MB use TMPDIR otherwise use OUTDIR
    OUTDIR="$(dirname "{output}")"
    PREFIX="$(basename "{output}" .vcf)"
    OUTFILE="${{OUTDIR}}/${{PREFIX}}.unfiltered.vcf"
    TMP=$([ "{resources.tmp_mb}" -eq "0" ] &&
        echo "${{OUTDIR}}" || echo "${{TMPDIR}}")

    # run dummy or real job
    if [ "{config.echo_run}" -eq "1" ]; then
        echo "{input}" "${{TMP}}" > "{output}"
    else
        lumpyexpress \
            -B "{input.tumor_bam}","{input.normal_bam}" \
            {params.excl_opt} \
            -o "${{OUTFILE}}" \
            -m 4 `# min. sample weight` \
            -r 0 `# trim threshold` \
            -k `# keep tmp files` \
            -T "${{TMP}}/lumpy.${{RANDOM}}"
        # somatic + SV quality filtering
        #   'normal' sample assumes index 1
        bcftools filter \
            -O v `# uncompressed VCF format` \
            -o "{output}" \
            -i "FORMAT/SU[1] == 0 && FILTER == '.'" \
            "${{OUTFILE}}"
    fi
    """

SnakeMake BCFtools From line 24 of rules/lumpy.smk

shell:
    """
    set -xe

    # if 'tmpspace' set to >0MB use TMPDIR otherwise use OUTDIR
    OUTDIR="$(dirname "{output}")"
    PREFIX="$(basename "{output}" .vcf)"
    OUTFILE="${{OUTDIR}}/${{PREFIX}}.unfiltered.vcf"
    TMP=$([ "{resources.tmp_mb}" -eq "0" ] &&
        echo "${{OUTDIR}}" || echo "${{TMPDIR}}")

    # run dummy or real job
    if [ "{config.echo_run}" -eq "1" ]; then
        echo "{input}" "${{TMP}}" > "{output}"
    else
        lumpyexpress \
            -B "{input.bam}" \
            {params.excl_opt} \
            -o "${{OUTFILE}}" \
            -m 4 `# min. sample weight` \
            -r 0 `# trim threshold` \
            -k `# keep tmp files` \
            -T "${{TMP}}/lumpy.${{RANDOM}}"
        # SV quality filtering
        bcftools filter \
            -O v `# uncompressed VCF format` \
            -o "{output}" \
            -i "FILTER == '.'" \
            "${{OUTFILE}}"
    fi
    """

SnakeMake BCFtools From line 79 of rules/lumpy.smk

shell:
    """
    set -xe

    OUTDIR="$(dirname "{output}")"
    OUTFILE="$(basename "{output}")"

    # run dummy or real job
    if [ "{config.echo_run}" -eq "1" ]; then
        echo "{input}" > "{output}"
    else
        configManta.py \
            --runDir "${{OUTDIR}}" \
            --reference "{input.fasta}" \
            --tumorBam "{input.tumor_bam}" \
            --normalBam "{input.normal_bam}"
        cd "${{OUTDIR}}"
        ./runWorkflow.py \
            --quiet \
            -m local \
            -j {threads}
        # SV quality filtering
        bcftools filter \
            -O v `# uncompressed VCF format` \
            -o "${{OUTFILE}}" \
            -i "FILTER == 'PASS'" \
            "{params.outfile}"
    fi
    """

SnakeMake BCFtools From line 25 of rules/manta.smk

shell:
    """
    set -xe

    OUTDIR="$(dirname "{output}")"
    OUTFILE="$(basename "{output}")"

    # run dummy or real job
    if [ "{config.echo_run}" -eq "1" ]; then
        echo "{input}" > "{output}"
    else
        configManta.py \
            --runDir "${{OUTDIR}}" \
            --reference "{input.fasta}" \
            {params.bam_opt} "{input.bam}"
        cd "${{OUTDIR}}"
        ./runWorkflow.py \
            --quiet \
            -m local \
            -j {threads}
        # SV quality filtering
        bcftools filter \
            -O v `# uncompressed VCF format` \
            -o "${{OUTFILE}}" \
            -i "FILTER == 'PASS'" \
            "{params.outfile}"
    fi
    """

SnakeMake BCFtools From line 81 of rules/manta.smk

shell:
    """
    set -xe

    # run dummy or real job
    if [ "{config.echo_run}" -eq "1" ]; then
        cat "{input}" > "{output}"
    else
        if [ "{params.excl}" -eq "1" ]; then
            SURVIVOR filter "{input}" {params.args} "{output}"
        else
            ln -sr "{input}" "{output}"
        fi
    fi
    """

SnakeMake SURVIVOR From line 43 of rules/survivor.smk

shell:
    """
    set -xe

    # create a list of VCF files
    for f in $(echo "{input}")
    do
        echo "$f" >> "{output[0]}"
    done

    # run dummy or real job
    if [ "{config.echo_run}" -eq "1" ]; then
        cat "{output[0]}" > "{output[1]}"
    else
        SURVIVOR merge "{output[0]}" {params.args} "{output[1]}"
    fi
    """

SnakeMake SURVIVOR From line 101 of rules/survivor.smk

script:
    "../scripts/viola_vcf.py"

SnakeMake From line 38 of rules/viola.smk

import os
import shutil
import viola

vcf_in = str(snakemake.input)
vcf_org = vcf_in + '.org'
vcf_out = str(snakemake.output)
caller = str(snakemake.wildcards.prefix)

if caller == 'gridss':
    os.rename(vcf_in, vcf_org)
    with open(vcf_in, 'w') as new:
        with open(vcf_org, 'r') as org:
            for line in org:
                # FIX INFO field: change PARID to MATEID
                new.write(line.replace('PARID', 'MATEID'))
try:
    sv = viola.read_vcf(vcf_in, variant_caller=caller).breakend2breakpoint()
    sv.to_vcf(vcf_out)
except Exception:
    shutil.copyfile(vcf_in, vcf_out)