Multiple Sequence Alignment (MSA) and Phylogeny Construction Workflow with Snakemake

Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
This is the template for a new Snakemake workflow. Replace this text with a comprehensive description covering the purpose and domain. Insert your code into the respective folders, i.e.
scripts
,
rules
, and
envs
. Define the entry point of the workflow in the
Snakefile
and the main configuration in the
config.yaml
file.
Usage
If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and, if available, its DOI (see above).
Step 1: Obtain a copy of this workflow
-
Create a new github repository using this workflow as a template .
-
Clone the newly created repository to your local system, into the place where you want to perform the data analysis.
Step 2: Configure workflow
Configure the workflow according to your needs via editing the files in the
config/
folder. Adjust
config.yaml
to configure the workflow execution, and
samples.tsv
to specify your sample setup.
Step 3: Install Snakemake
Install Snakemake using conda :
conda create -c bioconda -c conda-forge -n snakemake snakemake
For installation details, see the instructions in the Snakemake documentation .
Step 4: Execute workflow
Activate the conda environment:
conda activate snakemake
Test your configuration by performing a dry-run via
snakemake --use-conda -n
Execute the workflow locally via
snakemake --use-conda --cores $N
using
$N
cores or run it in a cluster environment via
snakemake --use-conda --cluster qsub --jobs 100
Step 5: Investigate results
After successful execution, you can create a self-contained interactive HTML report with all results via:
snakemake --report report.html
This report can, e.g., be forwarded to your collaborators. An example (using some trivial test data) can be seen here .
Step 6: Commit changes
Whenever you change something, don't forget to commit the changes back to your github copy of the repository:
git commit -a
git push
Step 7: Obtain updates from upstream
Whenever you want to synchronize your workflow copy with new developments from upstream, do the following.
-
Once, register the upstream repository in your local copy:
git remote add -f upstream git@github.com:snakemake-workflows/MSA-phylogeny.git
orgit remote add -f upstream https://github.com/snakemake-workflows/MSA-phylogeny.git
if you do not have setup ssh keys. -
Update the upstream version:
git fetch upstream
. -
Create a diff with the current version:
git diff HEAD upstream/master workflow > upstream-changes.diff
. -
Investigate the changes:
vim upstream-changes.diff
. -
Apply the modified diff via:
git apply upstream-changes.diff
. -
Carefully check whether you need to update the config files:
git diff HEAD upstream/master config
. If so, do it manually, and only where necessary, since you would otherwise likely overwrite your settings and samples.
Code Snippets
10 11 12 | shell: "gff2bed < <(zcat {input.gff} | " "awk '{{if ($3 ~ /gene/) print $0}}') > {output.bed} 2> {log}" |
25 26 | shell: "seqtk subseq {input.sequence} {input.bed} > {output.regions} 2> {log}" |
39 40 | script: "../scripts/rename-regions.py" |
53 54 | shell: "minimap2 -a {input.target} {input.query} -o {output} 2> {log}" |
67 68 | script: "../scripts/extract-sequences.py" |
80 81 | shell: "cat {input} > {output}" |
93 94 | shell: "mafft {input} > {output} 2> {log}" |
108 109 | shell: "CIAlign --infile {input} --outfile_stem {params.output_stem} --clean --visualise --interpret" |
14 15 16 | shell: "cp {input.fasta} {output.inter_fasta} && " "iqtree -s {output.inter_fasta} > {log} 2>&1" |
8 9 | script: "../scripts/update-sample-sheet.py" |
12 13 14 | shell: "((esearch -db nucleotide -query '{params.accession}' | " "efetch -format fasta > {output}) && [ -s {output} ]) 2> {log}" |
8 9 | wrapper: "0.70.0/bio/samtools/index" |
19 20 | wrapper: "0.70.0/bio/samtools/faidx" |
32 33 | shell: "samtools sort -o {output} -O BAM {input}" |
45 46 | wrapper: "0.70.0/bio/tabix" |
23 24 | wrapper: "v1.12.2/bio/bcftools/mpileup" |
38 39 | wrapper: "v1.12.2/bio/bcftools/call" |
49 50 | wrapper: "v1.12.2/bio/vep/plugins" |
72 73 | wrapper: "v1.12.2/bio/vep/annotate" |
85 86 | script: "../scripts/plot-variants.py" |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | import pysam from Bio import SeqIO from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord with pysam.AlignmentFile(snakemake.input.bam, "rb") as samfile, open(snakemake.output.seq, "w") as out_fasta: for record in samfile.fetch(): if record.reference_name.endswith(snakemake.wildcards.region): out_seq = SeqRecord( Seq(record.query_sequence), id="%s_%s" %(record.query_name, record.reference_name.split("_")[-1]), description="" ) # print(record.reference_name, record.reference_start, record.query_alignment_start, record.query_alignment_end) SeqIO.write(out_seq, out_fasta, "fasta") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | from Bio import SeqIO import pandas as pd bed = pd.read_csv(snakemake.input.bed, delimiter="\t", header=None) bed[1] += 1 bed["name"] = bed[0] + ":" + bed[1].astype(str) + "-" + bed[2].astype(str) bed["feature"] = bed[9].str.extract(r';Name=([A-Z]{1,3}\d*[a-z]{0,2});')[0] bed.set_index("name", inplace=True) with open(snakemake.input.regions) as input_handle, open(snakemake.output.renamed_regions, "w") as output_handle: for record in SeqIO.parse(input_handle, "fasta"): record.id = record.id + "_" + bed.loc[str(record.id)]["feature"] record.name = "" record.description = "" SeqIO.write(record, output_handle, "fasta") |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 | import os from datetime import date, datetime from os import getcwd, listdir, mkdir, path from shutil import copy2, move from Bio import SeqIO import pandas as pd from ruamel import yaml # conda install -c conda-forge ruamel.yaml # define location of sample sheet and workflow config def update_sample_sheet(): print(""" This function updates the samples.csv file in your config with all .fasta files included in data/query and data/reference. Please make sure all query files are single fastas per sample. Reference fasta files can be multiple sequenence fasta files. All sequences have to be named properly (≤ ten letters) as some applications cutoff sequence names. Please make sure to define a tag after the sample sheet is generated """) config = snakemake.config QUERY_PATH = str(config["data-handling"]["data-query"]) REFERENCE_PATH = str(config["data-handling"]["data-reference"]) incoming_files = [f for f in listdir(QUERY_PATH)] if not incoming_files: print("No files in data/query") new_files_reference = pd.DataFrame(columns=["sample_name", "file", "type", "tag"]) else: print("Updating sample sheet") # create dataframe new_files_query = pd.DataFrame(incoming_files, columns=["file"]) # get id of sample by splitting the file handle new_files_query["sample_name"] = new_files_query["file"].apply( lambda x: (x.rsplit(".", 1)[0]) ) new_files_query["type"] = "query" new_files_query["tag"] = "your_tag" # add path of file new_files_query["file"] = QUERY_PATH + "/" + new_files_query["file"] print(new_files_query) print("\t{} query samples added".format(len(new_files_query))) incoming_files = [f for f in listdir(REFERENCE_PATH)] if not incoming_files: print("No files in data/reference") new_files_reference = pd.DataFrame(columns=["sample_name", "file", "type"]) else: print("Updating sample sheet") # create dataframe new_files_reference = pd.DataFrame(incoming_files, columns=["file"]) # get id of sample by splitting the file handle new_files_reference["sample_name"] = new_files_reference["file"].apply( lambda x: (x.rsplit(".", 1)[0]) ) new_files_reference["type"] = "reference" new_files_reference["tag"] = "your_tag" # add path of file new_files_reference["file"] = REFERENCE_PATH + "/" + new_files_reference["file"] print(new_files_reference) print("\t{} reference file added".format(len(new_files_reference))) sample_sheet = pd.concat([new_files_query, new_files_reference]) sample_sheet.to_csv(snakemake.input[0], index=False, columns=["sample_name", "file", "type", "tag"]) update_sample_sheet() |
1 2 3 4 5 6 7 8 9 10 | __author__ = "Michael Chambers" __copyright__ = "Copyright 2019, Michael Chambers" __email__ = "greenkidneybean@gmail.com" __license__ = "MIT" from snakemake.shell import shell shell("samtools faidx {snakemake.params} {snakemake.input[0]} > {snakemake.output[0]}") |
1 2 3 4 5 6 7 8 9 10 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "koester@jimmy.harvard.edu" __license__ = "MIT" from snakemake.shell import shell shell("samtools index {snakemake.params} {snakemake.input[0]} {snakemake.output[0]}") |
1 2 3 4 5 6 7 8 9 10 11 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "koester@jimmy.harvard.edu" __license__ = "MIT" from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=False, stderr=True) shell("tabix {snakemake.params} {snakemake.input[0]} {log}") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "koester@jimmy.harvard.edu" __license__ = "MIT" from snakemake.shell import shell from snakemake_wrapper_utils.bcftools import get_bcftools_opts bcftools_opts = get_bcftools_opts(snakemake, parse_ref=False, parse_memory=False) extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell(stdout=True, stderr=True) class CallerOptionError(Exception): pass valid_caller_opts = {"-c", "--consensus-caller", "-m", "--multiallelic-caller"} caller_opt = snakemake.params.get("caller", "") if caller_opt.strip() not in valid_caller_opts: raise CallerOptionError( "bcftools call expects either -m/--multiallelic-caller or " "-c/--consensus-caller as caller option." ) shell( "bcftools call" " {bcftools_opts}" " {caller_opt}" " {extra}" " {snakemake.input[0]}" " {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | __author__ = "Michael Hall" __copyright__ = "Copyright 2020, Michael Hall" __email__ = "michael@mbh.sh" __license__ = "MIT" from snakemake.shell import shell from snakemake_wrapper_utils.bcftools import get_bcftools_opts extra = snakemake.params.get("extra", "") bcftools_opts = get_bcftools_opts( snakemake, parse_ref=("--no-reference" not in extra), parse_memory=False ) log = snakemake.log_fmt_shell(stdout=True, stderr=True) class MissingReferenceError(Exception): pass shell("bcftools mpileup {bcftools_opts} {extra} {snakemake.input[0]} {log}") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2020, Johannes Köster" __email__ = "johannes.koester@uni-due.de" __license__ = "MIT" import os from pathlib import Path from snakemake.shell import shell def get_only_child_dir(path): children = [child for child in path.iterdir() if child.is_dir()] assert ( len(children) == 1 ), "Invalid VEP cache directory, only a single entry is allowed, make sure that cache was created with the snakemake VEP cache wrapper" return children[0] extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell(stdout=False, stderr=True) fork = "--fork {}".format(snakemake.threads) if snakemake.threads > 1 else "" stats = snakemake.output.stats cache = snakemake.input.get("cache", "") plugins = snakemake.input.plugins plugin_aux_files = {"LoFtool": "LoFtool_scores.txt", "ExACpLI": "ExACpLI_values.txt"} load_plugins = [] for plugin in snakemake.params.plugins: if plugin in plugin_aux_files.keys(): aux_path = os.path.join(plugins, plugin_aux_files[plugin]) load_plugins.append(",".join([plugin, aux_path])) else: load_plugins.append(",".join([plugin, snakemake.input.get(plugin.lower(), "")])) load_plugins = " ".join(map("--plugin {}".format, load_plugins)) if snakemake.output.calls.endswith(".vcf.gz"): fmt = "z" elif snakemake.output.calls.endswith(".bcf"): fmt = "b" else: fmt = "v" fasta = snakemake.input.get("fasta", "") if fasta: fasta = "--fasta {}".format(fasta) gff = snakemake.input.get("gff", "") if gff: gff = "--gff {}".format(gff) if cache: entrypath = get_only_child_dir(get_only_child_dir(Path(cache))) species = ( entrypath.parent.name[:-7] if entrypath.parent.name.endswith("_refseq") else entrypath.parent.name ) release, build = entrypath.name.split("_") cache = ( "--offline --cache --dir_cache {cache} --cache_version {release} --species {species} --assembly {build}" ).format(cache=cache, release=release, build=build, species=species) shell( "(bcftools view '{snakemake.input.calls}' | " "vep {extra} {fork} " "--format vcf " "--vcf " "{cache} " "{gff} " "{fasta} " "--dir_plugins {plugins} " "{load_plugins} " "--output_file STDOUT " "--stats_file {stats} | " "bcftools view -O{fmt} > {snakemake.output.calls}) {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2020, Johannes Köster" __email__ = "johannes.koester@uni-due.de" __license__ = "MIT" import sys from pathlib import Path from urllib.request import urlretrieve from zipfile import ZipFile from tempfile import NamedTemporaryFile if snakemake.log: sys.stderr = open(snakemake.log[0], "w") outdir = Path(snakemake.output[0]) outdir.mkdir() with NamedTemporaryFile() as tmp: urlretrieve( "https://github.com/Ensembl/VEP_plugins/archive/release/{release}.zip".format( release=snakemake.params.release ), tmp.name, ) with ZipFile(tmp.name) as f: for member in f.infolist(): memberpath = Path(member.filename) if len(memberpath.parts) == 1: # skip root dir continue targetpath = outdir / memberpath.relative_to(memberpath.parts[0]) if member.is_dir(): targetpath.mkdir() else: with open(targetpath, "wb") as out: out.write(f.read(member.filename)) |
Support
- Future updates
Related Workflows





