Multiple Sequence Alignment (MSA) and Phylogeny Construction Workflow with Snakemake

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

This is the template for a new Snakemake workflow. Replace this text with a comprehensive description covering the purpose and domain. Insert your code into the respective folders, i.e. scripts , rules , and envs . Define the entry point of the workflow in the Snakefile and the main configuration in the config.yaml file.

Usage

If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and, if available, its DOI (see above).

Step 1: Obtain a copy of this workflow

Create a new github repository using this workflow as a template .
Clone the newly created repository to your local system, into the place where you want to perform the data analysis.

Step 2: Configure workflow

Configure the workflow according to your needs via editing the files in the config/ folder. Adjust config.yaml to configure the workflow execution, and samples.tsv to specify your sample setup.

Step 3: Install Snakemake

Install Snakemake using conda :

conda create -c bioconda -c conda-forge -n snakemake snakemake

For installation details, see the instructions in the Snakemake documentation .

Step 4: Execute workflow

Activate the conda environment:

conda activate snakemake

Test your configuration by performing a dry-run via

snakemake --use-conda -n

Execute the workflow locally via

snakemake --use-conda --cores $N

using $N cores or run it in a cluster environment via

snakemake --use-conda --cluster qsub --jobs 100

Step 5: Investigate results

After successful execution, you can create a self-contained interactive HTML report with all results via:

snakemake --report report.html

This report can, e.g., be forwarded to your collaborators. An example (using some trivial test data) can be seen here .

Step 6: Commit changes

Whenever you change something, don't forget to commit the changes back to your github copy of the repository:

git commit -a
git push

Step 7: Obtain updates from upstream

Whenever you want to synchronize your workflow copy with new developments from upstream, do the following.

Once, register the upstream repository in your local copy: git remote add -f upstream git@github.com:snakemake-workflows/MSA-phylogeny.git or git remote add -f upstream https://github.com/snakemake-workflows/MSA-phylogeny.git if you do not have setup ssh keys.
Update the upstream version: git fetch upstream .
Create a diff with the current version: git diff HEAD upstream/master workflow > upstream-changes.diff .
Investigate the changes: vim upstream-changes.diff .
Apply the modified diff via: git apply upstream-changes.diff .
Carefully check whether you need to update the config files: git diff HEAD upstream/master config . If so, do it manually, and only where necessary, since you would otherwise likely overwrite your settings and samples.

Code Snippets

shell:
    "gff2bed < <(zcat {input.gff} | "
    "awk '{{if ($3 ~ /gene/) print $0}}') > {output.bed} 2> {log}"

SnakeMake gff2bed From line 10 of rules/MSA.smk

shell:
    "seqtk subseq {input.sequence} {input.bed} > {output.regions} 2> {log}"

SnakeMake seqtk subSeq From line 25 of rules/MSA.smk

script:
    "../scripts/rename-regions.py"

SnakeMake From line 39 of rules/MSA.smk

shell:
    "minimap2 -a {input.target} {input.query} -o {output} 2> {log}"

SnakeMake Minimap2 From line 53 of rules/MSA.smk

script:
    "../scripts/extract-sequences.py"

SnakeMake From line 67 of rules/MSA.smk

shell:
    "cat {input} > {output}"

SnakeMake From line 80 of rules/MSA.smk

shell:
    "mafft {input} > {output} 2> {log}"

SnakeMake MAFFT API (EBI) From line 93 of rules/MSA.smk

shell:
    "CIAlign --infile {input} --outfile_stem {params.output_stem} --clean --visualise --interpret"

SnakeMake cialign From line 108 of rules/MSA.smk

shell:
    "cp {input.fasta} {output.inter_fasta} && "
    "iqtree -s {output.inter_fasta} > {log} 2>&1"

SnakeMake IQ-TREE From line 14 of rules/phylogeny.smk

script:
    "../scripts/update-sample-sheet.py"

SnakeMake From line 8 of rules/preprocessing.smk

shell:
    "((esearch -db nucleotide -query '{params.accession}' | "
    "efetch -format fasta > {output}) && [ -s {output} ]) 2> {log}"

SnakeMake From line 12 of rules/ref.smk

wrapper:
    "0.70.0/bio/samtools/index"

SnakeMake From line 8 of rules/utils.smk

wrapper:
    "0.70.0/bio/samtools/faidx"

SnakeMake From line 19 of rules/utils.smk

shell:
    "samtools sort -o {output} -O BAM {input}"

SnakeMake SAMtools From line 32 of rules/utils.smk

wrapper:
    "0.70.0/bio/tabix"

SnakeMake tabix From line 45 of rules/utils.smk

wrapper:
    "v1.12.2/bio/bcftools/mpileup"

SnakeMake From line 23 of rules/variant-plot.smk

wrapper:
    "v1.12.2/bio/bcftools/call"

SnakeMake From line 38 of rules/variant-plot.smk

wrapper:
    "v1.12.2/bio/vep/plugins"

SnakeMake From line 49 of rules/variant-plot.smk

wrapper:
    "v1.12.2/bio/vep/annotate"

SnakeMake From line 72 of rules/variant-plot.smk

script:
    "../scripts/plot-variants.py"

SnakeMake From line 85 of rules/variant-plot.smk

import pysam
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

with pysam.AlignmentFile(snakemake.input.bam, "rb") as samfile, open(snakemake.output.seq, "w") as out_fasta:
    for record in samfile.fetch():
        if record.reference_name.endswith(snakemake.wildcards.region):
            out_seq = SeqRecord(
                Seq(record.query_sequence),
                id="%s_%s" %(record.query_name, record.reference_name.split("_")[-1]),
                description=""
            )
            # print(record.reference_name, record.reference_start, record.query_alignment_start, record.query_alignment_end)
            SeqIO.write(out_seq, out_fasta, "fasta")

Python Biopython pysam From line 1 of scripts/extract-sequences.py

from Bio import SeqIO
import pandas as pd

bed = pd.read_csv(snakemake.input.bed, delimiter="\t", header=None)

bed[1] += 1
bed["name"] = bed[0] + ":" + bed[1].astype(str) + "-" + bed[2].astype(str)
bed["feature"] = bed[9].str.extract(r';Name=([A-Z]{1,3}\d*[a-z]{0,2});')[0]
bed.set_index("name", inplace=True)

with open(snakemake.input.regions) as input_handle, open(snakemake.output.renamed_regions, "w") as output_handle:
    for record in SeqIO.parse(input_handle, "fasta"):
        record.id = record.id + "_" + bed.loc[str(record.id)]["feature"]
        record.name = ""
        record.description = ""
        SeqIO.write(record, output_handle, "fasta")

Python Pandas Biopython From line 1 of scripts/rename-regions.py

import os
from datetime import date, datetime
from os import getcwd, listdir, mkdir, path
from shutil import copy2, move
from Bio import SeqIO

import pandas as pd
from ruamel import yaml  # conda install -c conda-forge ruamel.yaml

# define location of sample sheet and workflow config

def update_sample_sheet():
    print("""
    This function updates the samples.csv file in your config with all .fasta files
    included in data/query and data/reference. Please make sure all query files are
    single fastas per sample. Reference fasta files can be multiple sequenence fasta 
    files. All sequences have to be named properly (≤ ten letters) as some applications
    cutoff sequence names.

    Please make sure to define a tag after the sample sheet is generated
    """)

    config = snakemake.config

    QUERY_PATH = str(config["data-handling"]["data-query"])
    REFERENCE_PATH = str(config["data-handling"]["data-reference"])

    incoming_files = [f for f in listdir(QUERY_PATH)]

    if not incoming_files:
        print("No files in data/query")
        new_files_reference = pd.DataFrame(columns=["sample_name", "file", "type", "tag"])
    else:
        print("Updating sample sheet")
        # create dataframe
        new_files_query = pd.DataFrame(incoming_files, columns=["file"])

        # get id of sample by splitting the file handle
        new_files_query["sample_name"] = new_files_query["file"].apply(
            lambda x: (x.rsplit(".", 1)[0])
        )

        new_files_query["type"] = "query"
        new_files_query["tag"] = "your_tag"
        # add path of file
        new_files_query["file"] = QUERY_PATH + "/" + new_files_query["file"]

        print(new_files_query)
        print("\t{} query samples added".format(len(new_files_query)))

    incoming_files = [f for f in listdir(REFERENCE_PATH)]

    if not incoming_files:
        print("No files in data/reference")
        new_files_reference = pd.DataFrame(columns=["sample_name", "file", "type"])
    else:
        print("Updating sample sheet")
        # create dataframe
        new_files_reference = pd.DataFrame(incoming_files, columns=["file"])

        # get id of sample by splitting the file handle
        new_files_reference["sample_name"] = new_files_reference["file"].apply(
            lambda x: (x.rsplit(".", 1)[0])
        )

        new_files_reference["type"] = "reference"
        new_files_reference["tag"] = "your_tag"
        # add path of file
        new_files_reference["file"] = REFERENCE_PATH + "/" + new_files_reference["file"]

        print(new_files_reference)
        print("\t{} reference file added".format(len(new_files_reference)))

        sample_sheet = pd.concat([new_files_query, new_files_reference])

        sample_sheet.to_csv(snakemake.input[0], index=False, columns=["sample_name", "file", "type", "tag"])


update_sample_sheet()

Python Pandas Biopython From line 6 of scripts/update-sample-sheet.py

__author__ = "Michael Chambers"
__copyright__ = "Copyright 2019, Michael Chambers"
__email__ = "greenkidneybean@gmail.com"
__license__ = "MIT"


from snakemake.shell import shell


shell("samtools faidx {snakemake.params} {snakemake.input[0]} > {snakemake.output[0]}")

Python Snakemake SAMtools From line 1 of faidx/wrapper.py

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"


from snakemake.shell import shell


shell("samtools index {snakemake.params} {snakemake.input[0]} {snakemake.output[0]}")

Python Snakemake SAMtools From line 1 of index/wrapper.py

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"


from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

shell("tabix {snakemake.params} {snakemake.input[0]} {log}")

Python Snakemake tabix From line 1 of tabix/wrapper.py

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"


from snakemake.shell import shell
from snakemake_wrapper_utils.bcftools import get_bcftools_opts


bcftools_opts = get_bcftools_opts(snakemake, parse_ref=False, parse_memory=False)
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)


class CallerOptionError(Exception):
    pass


valid_caller_opts = {"-c", "--consensus-caller", "-m", "--multiallelic-caller"}

caller_opt = snakemake.params.get("caller", "")
if caller_opt.strip() not in valid_caller_opts:
    raise CallerOptionError(
        "bcftools call expects either -m/--multiallelic-caller or "
        "-c/--consensus-caller as caller option."
    )


shell(
    "bcftools call"
    " {bcftools_opts}"
    " {caller_opt}"
    " {extra}"
    " {snakemake.input[0]}"
    " {log}"
)

Python Snakemake BCFtools snakemake-wrapper-utils From line 1 of call/wrapper.py

__author__ = "Michael Hall"
__copyright__ = "Copyright 2020, Michael Hall"
__email__ = "michael@mbh.sh"
__license__ = "MIT"


from snakemake.shell import shell
from snakemake_wrapper_utils.bcftools import get_bcftools_opts


extra = snakemake.params.get("extra", "")
bcftools_opts = get_bcftools_opts(
    snakemake, parse_ref=("--no-reference" not in extra), parse_memory=False
)
log = snakemake.log_fmt_shell(stdout=True, stderr=True)


class MissingReferenceError(Exception):
    pass


shell("bcftools mpileup {bcftools_opts} {extra} {snakemake.input[0]} {log}")

Python Snakemake BCFtools snakemake-wrapper-utils From line 1 of mpileup/wrapper.py

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2020, Johannes Köster"
__email__ = "johannes.koester@uni-due.de"
__license__ = "MIT"

import os
from pathlib import Path
from snakemake.shell import shell


def get_only_child_dir(path):
    children = [child for child in path.iterdir() if child.is_dir()]
    assert (
        len(children) == 1
    ), "Invalid VEP cache directory, only a single entry is allowed, make sure that cache was created with the snakemake VEP cache wrapper"
    return children[0]


extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)

fork = "--fork {}".format(snakemake.threads) if snakemake.threads > 1 else ""
stats = snakemake.output.stats
cache = snakemake.input.get("cache", "")
plugins = snakemake.input.plugins
plugin_aux_files = {"LoFtool": "LoFtool_scores.txt", "ExACpLI": "ExACpLI_values.txt"}

load_plugins = []
for plugin in snakemake.params.plugins:
    if plugin in plugin_aux_files.keys():
        aux_path = os.path.join(plugins, plugin_aux_files[plugin])
        load_plugins.append(",".join([plugin, aux_path]))
    else:
        load_plugins.append(",".join([plugin, snakemake.input.get(plugin.lower(), "")]))
load_plugins = " ".join(map("--plugin {}".format, load_plugins))

if snakemake.output.calls.endswith(".vcf.gz"):
    fmt = "z"
elif snakemake.output.calls.endswith(".bcf"):
    fmt = "b"
else:
    fmt = "v"

fasta = snakemake.input.get("fasta", "")
if fasta:
    fasta = "--fasta {}".format(fasta)

gff = snakemake.input.get("gff", "")
if gff:
    gff = "--gff {}".format(gff)

if cache:
    entrypath = get_only_child_dir(get_only_child_dir(Path(cache)))
    species = (
        entrypath.parent.name[:-7]
        if entrypath.parent.name.endswith("_refseq")
        else entrypath.parent.name
    )
    release, build = entrypath.name.split("_")
    cache = (
        "--offline --cache --dir_cache {cache} --cache_version {release} --species {species} --assembly {build}"
    ).format(cache=cache, release=release, build=build, species=species)

shell(
    "(bcftools view '{snakemake.input.calls}' | "
    "vep {extra} {fork} "
    "--format vcf "
    "--vcf "
    "{cache} "
    "{gff} "
    "{fasta} "
    "--dir_plugins {plugins} "
    "{load_plugins} "
    "--output_file STDOUT "
    "--stats_file {stats} | "
    "bcftools view -O{fmt} > {snakemake.output.calls}) {log}"
)

Python Snakemake BCFtools Variant Effect Predictor (VEP) From line 1 of annotate/wrapper.py

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2020, Johannes Köster"
__email__ = "johannes.koester@uni-due.de"
__license__ = "MIT"

import sys
from pathlib import Path
from urllib.request import urlretrieve
from zipfile import ZipFile
from tempfile import NamedTemporaryFile

if snakemake.log:
    sys.stderr = open(snakemake.log[0], "w")

outdir = Path(snakemake.output[0])
outdir.mkdir()

with NamedTemporaryFile() as tmp:
    urlretrieve(
        "https://github.com/Ensembl/VEP_plugins/archive/release/{release}.zip".format(
            release=snakemake.params.release
        ),
        tmp.name,
    )

    with ZipFile(tmp.name) as f:
        for member in f.infolist():
            memberpath = Path(member.filename)
            if len(memberpath.parts) == 1:
                # skip root dir
                continue
            targetpath = outdir / memberpath.relative_to(memberpath.parts[0])
            if member.is_dir():
                targetpath.mkdir()
            else:
                with open(targetpath, "wb") as out:
                    out.write(f.read(member.filename))