Genome Re-sequencing Analysis Snakemake Workflow: De-novo and Variant Calling Modes

public 1yr ago 0 bookmarks

View Workflow

variantcallingwithsnakemake — View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

This Snakemake workflow is for analysing genome re-sequencing experiments. It features 2 modes. The **de-novo** mode is used to confirm sample relationships from the raw sequencing reads with [kwip](https://github.com/kdmurray91/kWIP) and [mash](https://github.com/marbl/Mash). The **varcall** mode performs read alignments to one or several reference genomes followed by variant detection. Read alignments can be performed with [bwa](http://bio-bwa.sourceforge.net/bwa.shtml) and/or [NextGenMap]

Usage

Create a new github repository in your github account using this workflow as a template .
Clone your newly created repository to your local system where you want to perform the analysis.
Setup the software dependencies
Configure the workflow for your needs and input files
Run the workflow
Archive your workflow for documenting your work and easy reproduction.

Some pointers for setup , configuring , and running the workflow are below, for details please consult the technical documentation .

Setup

An easy way to setup the dependencies is conda.

Get the Miniconda Python 3 distribution:

$ wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash Miniconda3-latest-Linux-x86_64.sh

Create an environment with the required software:

NOTE: conda's enviroment name in these examples is dna-proto .

$ conda env create --name dna-proto --file envs/condaenv.yml

Activate the environment:

$ conda activate dna-proto

Additional useful conda commands are here .

----

Check config and metadata

We provide scripts to list metadata and configuration parameters in utils/ .

python utils/check_metadata.py
python utils/check_config.py

Visualising the workflow

You can check the workflow in graphical form by printing the so-called DAG.

snakemake --dag -npr -j -1 | dot -Tsvg > dag.svg
eog dag.svg

Pretending a run of the workflow

Prior to running the workflow, pretend a run and confirm it will do what is intended.

snakemake -npr

Data

Main directory content:

.
├── envs
├── genomes_and_annotations
├── metadata
├── output
├── rules
├── scripts
├── utils
├── config.yml
├── Snakefile
├── snpEff.config

NOTE: the output directory and some files in the metadata directory are/will be generated by the workflow.

You will need to configure the workflow for your specific project. For details see the technical documentation . Below files and directories will need editing:

Snakefile
genomes_and_annotations/
metadata/
config.yml
snpEff.config

You can download example data for testing the workflow. click here to download

Clone the repository

clone this repository Now **clone the forked repository** to your machine. Go to your GitHub account, open the forked repository, click on the clone button and then click the *copy to clipboard* icon. The url is going to be like: ```https://github.com/your-username/dna-proto-workflow.git``` where `your-username` is your GitHub username.

Open a terminal and run the following git command:

git clone https://github.com/your-username/dna-proto-workflow.git

copy URL to clipboard Once you've cloned your fork, you can edit your local copy. However, if you want to contribute, you'll need to create a new branch.

Create a branch

Change to the repository directory on your computer (if you are not already there):

NOTE: Don't change the name of this directory!

cd dna-proto-workflow

You can check your branches and active branch, using the git branch command.

git branch -a

Now create a branch using the git checkout command:

git checkout -b new-branch-name

For example:

git checkout -b development

From this point, you are in the new branch and edits only affect your branch. If things go wrong, simply remove your branch using

git branch -d name-of-the-branch

Or revert back to the master -branch using

git checkout master

Make changes and commit

Once you've modified something, you can confirm that there are changes with git status (called from the top-level directory). Add those changes to your branch with git add :

git status
git add .
or
git add name_of_the_file_you_modified

Commit those changes with git commit :

git commit -m "write a message"

Code Snippets

shell:
    "python3 scripts/tidybamstat.py"
    "   -o output/alnstats/everything"  # prefix
    "   {input}"
    ">'{log}' 2>&1"

SnakeMake From line 29 of rules/align.rules.smk

shell:
    "( bwa mem"
    "   -p" # paired input
    "   -t {threads}"
    "   -R '@RG\\tID:{wildcards.run}_{wildcards.lib}\\tSM:{params.sample}'"
    "   {input.ref}"
    "   {input.reads}"
    "| samtools view -Suh - >{output.bam}"
    " ) >'{log}' 2>&1"

SnakeMake SAMtools BWA From line 71 of rules/align.rules.smk

shell:
    "( java"
    "   -{params.mem}"
    "   -jar {params.abra_release}"
    "   --in {input.set}"
    "   --out {output}"
    "   --ref {params.ref}"
    "   --threads {params.threads}"
    "   --targets {params.region}"
    "   --tmpdir {params.abra_temp}"
    ") >'{log}' 2>&1"

SnakeMake From line 100 of rules/align.rules.smk

shell:
    "( samtools fixmate "
    "   -m"
    "   -@ {threads}"
    "   --output-fmt bam,level=0"
    "   {input.bam}"
    "   -"
    " | samtools sort"
    "   -T {config[abra2][temp]}/{wildcards.run}_{wildcards.lib}_sort_$RANDOM"
    "   --output-fmt bam,level=0"
    "   -@ {threads}"
    "   -m 1g"
    "   -"
    " | samtools markdup"
    "   -T {config[abra2][temp]}/{wildcards.run}_{wildcards.lib}_markdup_$RANDOM"
    "   -s" # report stats
    "   -@ {threads}"
    "   --output-fmt bam,level=3"
    "   -"
    "   {output.bam}"
    " ) >'{log}' 2>&1"

SnakeMake SAMtools From line 121 of rules/align.rules.smk

shell:
    "( samtools merge"
    "   -@ {threads}"
    "   --output-fmt bam,level=4"
    "   {output.bam}"
    "   {input}"
    " ) >'{log}' 2>&1"

SnakeMake SAMtools From line 155 of rules/align.rules.smk

shell:
    "( unset DISPLAY; qualimap bamqc"
    "   --java-mem-size=4G"
    "   -bam {input.bam}"
    "   -nr 10000"
    "   -nt {threads}"
    "   -outdir {output}"
    "   {input}"
    " ) >'{log}' 2>&1"

SnakeMake QualiMap From line 172 of rules/align.rules.smk

run:
    with open(output[0], "w") as fh:
        for s in input:
            print(s, file=fh)

SnakeMake From line 191 of rules/align.rules.smk

shell:
    "( samtools merge"
    "   --output-fmt bam,level=7"
    "   -@ {threads}"
    "   -"
    "   {input}"
    " | tee {output.bam}"
    " | samtools index - {output.bai}"  # indexing takes bloody ages, we may as well do this on the fly
    " ) >'{log}' 2>&1"

SnakeMake SAMtools From line 208 of rules/align.rules.smk

shell:
    "(samtools stats -i 5000 -x {input} >{output}) >'{log}'"

SnakeMake From line 227 of rules/align.rules.smk

shell:
    "samtools index {input}"

SnakeMake SAMtools From line 241 of rules/align.rules.smk

shell:
    "samtools view -Suh {input} >{output}"

SnakeMake SAMtools From line 249 of rules/align.rules.smk

shell:
    "( ngm"
    "   -q {input.reads}"
    "   --paired --broken-pairs"
    "   -r {input.ref}"
    "   -t {threads}"
    "   --rg-id {wildcards.run}_{wildcards.lib}"
    "   --rg-sm {params.sample}"
    "   --sensitivity {params.sensitivity}" # this is the mean from a bunch of different runs
    "| samtools view -Suh - >{output.bam}"
    " ) >'{log}' 2>&1"

SnakeMake SAMtools From line 297 of rules/align.rules.smk

shell:
    "( snpEff ann"
    " -filterInterval {input.bed}"
    " -csvStats {output.csvstats}"
    " -htmlStats {output.htmlStats}"
    " {params.extra}"
    " {params.database}"
    " {input.vcf}"
    " > {output.vcf}"
    " ) >'{log}' 2>&1"

SnakeMake snpEff From line 41 of rules/annotate.rules.smk

script:
    "../scripts/pca_mash.R"

SnakeMake Mash From line 35 of rules/denovo.rules.smk

script:
    "../scripts/pca_kwip.R"

SnakeMake kWIP From line 44 of rules/denovo.rules.smk

shell:
    " mash sketch"
    "   -k {wildcards.ksize}"
    "   -s {wildcards.sketchsize}"
    "   -p {threads}"
    "   -o {output}"
    "   {input}"
    " >{log} 2>&1"

SnakeMake Mash From line 68 of rules/denovo.rules.smk

shell:
    "mash dist"
    "   -p {threads}"
    "   -t" # tabular format
    "   {input} {input}" # needs input twice
    " >{output}"
    " 2>{log}"

SnakeMake Mash From line 86 of rules/denovo.rules.smk

shell:
    "load-into-counting.py"
    "   -N 1"
    "   -x {wildcards.sketchsize}"
    "   -k {wildcards.ksize}"
    "   -b"
    "   -f"
    "   -s tsv"
    "   -T {threads}"
    "   {output.ct}"
    "   {input}"
    " >{log} 2>&1"

SnakeMake From line 105 of rules/denovo.rules.smk

shell:
    "kwip"
    " -d {output.d}"
    " -k {output.k}"
    " -t {threads}"
    " {input}"
    " >{log} 2>&1"

SnakeMake kWIP From line 130 of rules/denovo.rules.smk

shell:
    "( kdm-unique-kmers.py"
    "    -t {threads}"
    "    -k {params.kmersize}"
    "    {input}"
    "    >{output}"
    " ) 2>{log}"

SnakeMake From line 150 of rules/denovo.rules.smk

shell:
    "( sourmash compute"
    "   --name '{wildcards.sample}'"
    "   -k {wildcards.ksize}"
    "   -n {wildcards.sketchsize}"
    "   -o {output}"
    "   {input}"
    ") >{log} 2>&1"

SnakeMake From line 166 of rules/denovo.rules.smk

shell:
    "(sourmash compare -k {wildcards.ksize} -o {output} {input} ) >{log} 2>&1"

SnakeMake From line 185 of rules/denovo.rules.smk

shell:
    "( AdapterRemoval"
    "   --file1 {input.r1}"
    "   --file2 {input.r2}"
    "   --adapter1 {params.adp1}"
    "   --adapter2 {params.adp2}"
    "   --combined-output"
    "   --interleaved-output"
    "   --trimns"
    "   --trimqualities"
    "   --trimwindows 10"
    "   --minquality {params.minqual}"
    "   --threads 2"
    "   --settings {log.settings}"
    "   --output1 /dev/stdout"
    " | seqhax pairs"
    "   -l 20"
    "   -b >(pigz -p 5 >{output.reads})"
    "   /dev/stdin"
    ") >'{log.log}' 2>&1"

SnakeMake AdapterRemoval seqhax From line 46 of rules/readqc.rules.smk

shell:
    "( AdapterRemoval"
    "   --file1 {input}"
    "   --adapter1 {params.adp1}"
    "   --adapter2 {params.adp2}"
    "   {params.extra}"
    "   --minquality {params.minqual}"
    "   --threads 2"
    "   --settings {log.settings}"
    "   --output1 /dev/stdout"
    " | seqhax pairs"
    "   -l 20"
    "   -b >(pigz -p 5 >{output.reads})"
    "   /dev/stdin"
    ") >'{log.log}' 2>&1"

SnakeMake AdapterRemoval seqhax From line 84 of rules/readqc.rules.smk

shell:
    "cat {input} > {output}"

SnakeMake From line 110 of rules/readqc.rules.smk

shell:
    "( seqhax stats"
    "    -t {threads}"
    "    {input}"
    "    >{output}"
    " ) 2>'{log}'"

SnakeMake seqhax From line 123 of rules/readqc.rules.smk

shell:
    "( seqhax stats"
    "    -t {threads}"
    "    {input}"
    "    >{output}"
    " ) 2>''{log}'"

SnakeMake seqhax From line 140 of rules/readqc.rules.smk

script:
    "../scripts/plot-quals.py"

SnakeMake From line 40 of rules/stats.rules.smk

shell:
    "(  freebayes"
    "   --theta {params.theta}"
    "   --samples {input.sset}"
    "   --ploidy 2"
    "   --use-best-n-alleles 3"
    "   --min-mapping-quality {params.minmq}"
    "   --min-base-quality {params.minbq}"
    "   --read-max-mismatch-fraction 0.1"
    "   --min-alternate-fraction 0"
    "   --min-alternate-count 2" # per sample
    "   --min-alternate-total 5" # across all samples
    "   --min-coverage 10" # across all samples
    "   --prob-contamination 1e-6"
    "   --use-mapping-quality"
    "   --strict-vcf"
    "   --region '{wildcards.region}'"
    "   --fasta-reference {input.ref}"
    "   {input.bam}"
    " | bcftools view"
    "   -O b  -o '{output.bcf}'"
    " ) >'{log}' 2>&1"

SnakeMake BCFtools FreeBayes From line 62 of rules/varcall.rules.smk

shell:
    "( bcftools mpileup"
    "   --adjust-MQ 50"
    "   --redo-BAQ"
    "   --max-depth 20000" # the default per file max (250x) is insane, i.e. <1x for most sets. new limit of 20000x  equates to a max. of 20x across all samples.
    "   --min-MQ {params.minmq}"
    "   --min-BQ {params.minbq}"
    "   --fasta-ref {input.ref}"
    "   --samples-file {input.sset}"
    "   --annotate FORMAT/DP,FORMAT/AD,FORMAT/SP,INFO/AD" #output extra tags
    "   --region '{wildcards.region}'"
    "   --output-type u" #uncompressed
    "   {input.bam}"
    " | bcftools call"
    "   --targets '{wildcards.region}'" # might not be needed
    "   --multiallelic-caller"
    "   --prior {params.theta}"
    "   -O b"
    "   -o {output.bcf}"
    " ) >'{log}' 2>&1"

SnakeMake BCFtools From line 103 of rules/varcall.rules.smk

shell:
    "( bcftools view"
    "   {params.filtarg}"
    "   '{input.bcf}'"
    "   -O b  -o '{output.bcf}'"
    " ) >'{log}' 2>&1"

SnakeMake BCFtools From line 136 of rules/varcall.rules.smk

run:
    with open(output[0], "w") as fh:
        for s in sorted(input):
            print(s, file=fh)

SnakeMake From line 152 of rules/varcall.rules.smk

shell:
    "( bcftools concat"
    "   --threads {threads}"
    "   -O b"
    "   -o {output.bcf}"
    "   --file-list {input.fofn}"
    " ) >'{log}' 2>&1"

SnakeMake BCFtools From line 168 of rules/varcall.rules.smk

shell:
    "( bcftools view"
    "   {input.bcf}"
    "   -O z"
    "   --threads {threads}"
    "   -o {output.vcf}"
    " ) >'{log}' 2>&1"

SnakeMake BCFtools From line 185 of rules/varcall.rules.smk

shell:
    "( bcftools view"
    "   {input.bcf}"
    "   -O v"
    "   --threads {threads}"
    "   -o {output.vcf}"
    " ) >'{log}' 2>&1"

SnakeMake BCFtools From line 201 of rules/varcall.rules.smk

shell:
    "bcftools index -c -f {input} && bcftools index -t -f {input}"

SnakeMake BCFtools From line 214 of rules/varcall.rules.smk

shell:
    "bcftools stats -s - -d 0,1000,2 --threads {threads} {input} >{output}"

SnakeMake BCFtools From line 222 of rules/varcall.rules.smk

library(rgl)

print (snakemake@input[[1]])
print (snakemake@output[[1]])

y0<-read.delim(snakemake@input[[1]], header=T)
data<-as.matrix(y0[, -1])
labels<-colnames(data)
pca <- prcomp(data, scale=F)
print ('PCA done!')
plot3d(pca$x[,c(1,2,3)], size=10)
text3d(pca$x[,c(1,2,3)], text=labels)
rgl.postscript(snakemake@output[[1]],"pdf")

R rgl From line 1 of scripts/pca_kwip.R

library(rgl)

print (snakemake@input[[1]])
print (snakemake@output[[1]])

y0<-read.delim(snakemake@input[[1]], header=T)
data<-as.matrix(y0[, -1])
labels<-as.data.frame(do.call(rbind, strsplit(colnames(data), "[.]")))$V4
pca <- prcomp(data, scale=F)
print ('PCA done!')
plot3d(pca$x[,c(1,2,3)], size=10)
text3d(pca$x[,c(1,2,3)], text=labels)
rgl.postscript(snakemake@output[[1]],"pdf")

R rgl From line 1 of scripts/pca_mash.R

import matplotlib
import matplotlib.pyplot as plt
from pysam import VariantFile
import numpy as np

###matplotlib.use("Agg")

print ("$ Script: plot-quals")
print ("INPUT: ", snakemake.input)
quals = [record.qual for record in VariantFile(snakemake.input[0])]
plt.hist(quals)
plt.savefig(snakemake.output[0])

Python numpy matplotlib pysam From line 1 of scripts/plot-quals.py

from sys import stderr, stdout, stdin, argv
from os.path import basename, splitext
import re
import json
import csv
import argparse

try:
    from tqdm import tqdm
except ImportError:
    tqdm = lambda x: x


def doSN(line, data):
    if "SN" not in data:
        data["SN"] = {}
    row = line.rstrip('\n').split('\t')
    field = re.sub(r'\(|\)|:|%', '', row[1])
    field = re.sub(r'\s+$|^\s+', '', field)
    field = re.sub(r'\s+', '_', field)
    val = row[2]
    data["SN"][field] = val


# TODO: not handling this right now
# def doTableByCycle(line, data, key):
#     cols = {
#         "FFQ": ["cycle_number", "quality"],
#         "LFQ": ["cycle_number", "quality"],
#     }


TABLE_COLS = {
    "IS": ["insert_size", "pairs", "pairs_in", "pairs_out", "pairs_other"],
    "ID": ["indel_size", "n_insertion", "n_deletion"],
    "GCF": ["gc_content", "n_reads"],
    "GCL": ["gc_content", "n_reads"],
    "GCC": ["cycle_number", "a_pct", "c_pct", "g_pct", "t_pct", "n_pct", None],
    "RL": ["read_length", "n_reads"],
    "COV": ["coverage_bin", None, "genome_bp"],
}

def doTable(line, data):
    key = line.strip().split('\t')[0]
    if key not in data:
        data[key] = []
    vals = line.strip().split("\t")
    valdict = {}
    for i, col in enumerate(TABLE_COLS[key]):
        if col is None:
            continue
        valdict[col] = vals[i+1]
    data[key].append(valdict)


def dump(samples, key, outfile):
    with open(outfile, "w") as fh:
        csvfh = None
        for sample, alldata in samples.items():
            if key not in alldata:
                continue
            data = alldata[key]
            if isinstance(data, dict):
                # for some where rows are split over multiple lines, e.g. SN
                data = [data, ]
            if csvfh is None:
                header = ["sample", ] + list(data[0].keys())
                csvfh = csv.DictWriter(fh, header)
                csvfh.writeheader()
            for row in data:
                row["sample"] = sample
                csvfh.writerow(row)



def main():
    p = argparse.ArgumentParser(prog="tidybamstat")
    p.add_argument("-o", "--outprefix", type=str, required=True,
                   help="Output prefix")
    p.add_argument("input", type=str, nargs='+')
    args = p.parse_args()

    samples = {}
    print("Load alignment stats:", file=stderr)
    for file in tqdm(args.input):
        data = dict()
        with open(file) as fh:
            for line in fh:
                dtype = line.strip().split("\t")[0]
                if line.startswith("#"):
                    continue
                elif line.startswith("SN"):
                    doSN(line, data)
                elif dtype in TABLE_COLS:
                    doTable(line, data)
        sample = splitext(basename(file))[0]
        samples[sample] = data

    allkeys = ["SN"] + list(TABLE_COLS.keys())
    print("Dumping data by data type: ", file=stderr, end='', flush=True)
    for key in allkeys:
        print(f"{key}, ", file=stderr, end='',  flush=True)
        outfile = f"{args.outprefix}_{key}.csv"
        dump(samples, key, outfile)
    print(" Done!", file=stderr)

if __name__  == "__main__":
    main()