aMeta: Ancient DNA Shotgun Metagenomics Analysis Workflow
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
aMeta is a Snakemake workflow for identifying microbial sequences in ancient DNA shotgun metagenomics samples. The workflow performs:
-
trimming adapter sequences and removing reads shorter than 30 bp with Cutadapt
-
quaity control before and after trimming with FastQC and MultiQC
-
taxonomic sequence kmer-based classification with KrakenUniq
-
sequence alignment with Bowtie2 and screening for common microbial pathogens
-
deamination pattern analysis with MapDamage2
-
Lowest Common Ancestor (LCA) sequence alignment with Malt
-
authentication and validation of identified microbial species with MaltExtract
When using aMeta and / or pre-built databases provided together with the wokflow for your research projects, please cite our preprint: https://www.biorxiv.org/content/10.1101/2022.10.03.510579v1
Installation
Clone the github repository, then create and activate aMeta conda environment (here and below
cd aMeta
implies navigating to the cloned root aMeta directory). For this purpose, we recommend installing own
conda
, for example from here https://docs.conda.io/en/latest/miniconda.html, and
mamba
https://mamba.readthedocs.io/en/latest/installation.html:
git clone https://github.com/NBISweden/aMeta
cd aMeta
mamba env create -f workflow/envs/environment.yaml
# alternatively: conda env create -f workflow/envs/environment.yaml, however it takes longer
conda activate aMeta
Run a test to make sure that the workflow was installed correctly:
cd .test
./runtest.sh -j 4
Here, and below, by
-j
you can specify the number of threads that the workflow can use. Please make sure that the installation and test run accomplished successfully before proceeding with running aMeta on your data. Potential problems with installation and test run often come from unstable internet connection and particular
conda
settings used e.g. at computer clusters, therefore we advise you to use your own freshly installed
conda
.
Quick start
To run the worflow you need to prepare a tab-delimited sample-file
config/samples.tsv
with at least two columns, and a configuration file
config/config.yaml
, below we provide examples for both files.
Here is an example of
samples.tsv
, this implies that the fastq-files are located in
aMeta/data
folder:
sample fastq
foo data/foo.fq.gz
bar data/bar.fq.gz
Currently, it is important that the sample names in the first column exactly match the names of the fastq-files in the second column. For example, a fastq-file "data/foo.fq.gz" specified in the "fastq" column, must have a name "foo" in the "sample" column. Please make sure that the names in the first and second columns match.
Main results of the workflow and their interpretation
All output files of the workflow are located in
aMeta/results
directory. To get a quick overview of ancient microbes present in your samples you should check a heatmap in
results/overview_heatmap_scores.pdf
.
The heatmap demonstrates microbial species (in rows) authenticated for each sample (in columns). The colors and the numbers in the heatmap represent authentications scores, i.e. numeric quantification of eight quality metrics that provide information about microbial presence and ancient status. The authentication scores can vary from 0 to 10, the higher is the score the more likely that a microbe is present in a sample and is ancient. Typically, scores from 8 to 10 (red color in the heatmap) provide good confidence of ancient microbial presence in a sample. Scores from 5 to 7 (yellow and orange colors in the heatmap) can imply that either: a) a microbe is present but not ancient, i.e. modern contaminant, or b) a microbe is ancient (the reads are damaged) but was perhaps aligned to a wrong reference, i.e. it is not the microbe you think about. The former is a more common case scenario. The latter often happens when an ancient microbe is correctly detected on a genus level but we are not confident about the exact species, and might be aligning the damaged reads to a non-optimal reference which leads to a lot of mismatches or poor evennes of coverage. Scores from 0 to 4 (blue color in the heatmap) typically mean that we have very little statistical evedence (very few reads) to claim presence of a microbe in a sample.
Code Snippets
23 24 | shell: "bowtie2-build-l --threads {threads} {input.ref} {input.ref} > {log} 2>&1" |
47 48 49 50 51 52 53 54 55 56 57 | shell: """bowtie2 --large-index -x {params.BOWTIE2_DB} --end-to-end --threads {threads} --very-sensitive -U {input.fastq} > $(dirname {output.bam})/AlignedToBowtie2DB.sam 2> {log};""" """grep @ $(dirname {output.bam})/AlignedToBowtie2DB.sam | awk '!seen[$2]++' > $(dirname {output.bam})/header_nodups.txt;""" """grep -v '^@' $(dirname {output.bam})/AlignedToBowtie2DB.sam > $(dirname {output.bam})/AlignedToBowtie2DB.noheader.sam;""" """cat $(dirname {output.bam})/header_nodups.txt $(dirname {output.bam})/AlignedToBowtie2DB.noheader.sam > $(dirname {output.bam})/AlignedToBowtie2DB.nodups.sam;""" """samtools view -bS -q 1 -h -@ {threads} $(dirname {output.bam})/AlignedToBowtie2DB.nodups.sam | samtools sort - -@ {threads} -o {output.bam};""" """samtools index {output.bam};""" """rm $(dirname {output.bam})/header_nodups.txt;""" """rm $(dirname {output.bam})/AlignedToBowtie2DB.noheader.sam;""" """rm $(dirname {output.bam})/AlignedToBowtie2DB.nodups.sam;""" """rm $(dirname {output.bam})/AlignedToBowtie2DB.sam;""" |
19 20 21 22 | shell: "mkdir -p {params.dir}; " "while read taxid; do mkdir -p {params.dir}/$taxid; touch {params.dir}/$taxid/.done; done<{input.species};" "touch {output.done}" |
48 49 | shell: "touch {output}; " |
63 64 | shell: "awk -v var={wildcards.taxid} '{{ if($1==var) print $0 }}' {params.tax_db}/taxDB | cut -f3 > {output.node_list}" |
94 95 | shell: "time MaltExtract -Xmx32G -i {input.rma6} -f def_anc -o {params.extract} -r {params.ncbi_db} --reads --threads {threads} --matches --minPI 85.0 --maxReadLength 0 --minComp 0.0 --meganSummary -t {input.node_list} -v 2> {log}" |
114 115 | shell: "postprocessing.AMPS.r -m def_anc -r {params.extract} -t {threads} -n {input.node_list} || echo 'postprocessing failed for {wildcards.sample}_{wildcards.taxid}' > {output.analysis} 2> {log}" |
134 135 | shell: "samtools faidx {input.fna} 2> {log}" |
159 160 161 162 163 164 165 166 167 | shell: "echo {params.ref_id} > {output.name_list}; " "zgrep {params.ref_id} {input.sam} | uniq > {output.sam}; " "samtools view -bS {output.sam} > results/AUTHENTICATION/{wildcards.sample}/{wildcards.taxid}/{params.ref_id}.bam; " "samtools sort results/AUTHENTICATION/{wildcards.sample}/{wildcards.taxid}/{params.ref_id}.bam > {output.sorted_bam}; " "samtools index {output.sorted_bam}; " "samtools depth -a {output.sorted_bam} > {output.breadth_of_coverage}; " "grep -w -f {output.name_list} {input.malt_fasta_fai} | awk '{{printf(\"%s:1-%s\\n\", $1, $2)}}' > {output.name_list}.regions; " "samtools faidx {input.malt_fasta} -r {output.name_list}.regions -o results/AUTHENTICATION/{wildcards.sample}/{wildcards.taxid}/{params.ref_id}.fasta" |
184 185 | shell: "samtools view {input.bam} | awk '{{ print length($10) }}' > {output.distribution}" |
202 203 | shell: "(samtools view -h {input.bam} || true) | pmdtools --printDS > {output.scores}" |
226 227 | shell: "Rscript {params.exe} {wildcards.taxid} {wildcards.sample}.trimmed.rma6 {input.dir}/" |
245 246 247 248 | shell: "(samtools view {input.bam} || true) | pmdtools --platypus --number 2000000 > {output.tmp}; " "cd results/AUTHENTICATION/{wildcards.sample}/{wildcards.taxid}; " "R CMD BATCH $(which plotPMD); " |
269 270 | shell: "Rscript {params.exe} {input.rma6} $(dirname {input.maltextractlog}) {input.name_list} $(dirname {input.name_list}) &> {log};" |
23 24 25 26 27 28 29 30 31 32 | shell: "mkdir {output.dir}; " "if [ -s {input.species_tax_id} ]; then " 'cat {input.species_tax_id} | parallel -j {threads} "grep -w {{}} {params.bowtie2_seqid2taxid} | cut -f1 > {output.dir}/{{}}.seq_ids" ; ' "for i in $(cat {input.species_tax_id}); do xargs --arg-file={output.dir}/${{i}}.seq_ids samtools view -bh {input.bam} --write-index -@ {threads} -o {output.dir}/${{i}}.tax.bam; done >> {log} 2>&1; " "find {output.dir} -name '*.tax.bam' | parallel -j {threads} \"mapDamage {params.options} -i {{}} -r {params.BOWTIE2_DB} --merge-reference-sequences -d {output.dir}/mapDamage_{{}}\" >> {log} 2>&1 || true; " "for filename in {output.dir}/*.tax.bam; do newname=`echo $filename | sed 's/tax\.//g'`; mv $filename $newname; done >> {log} 2>&1; " "mv {output.dir}/mapDamage_{output.dir}/* {output.dir} >> {log} 2>&1; " "rm -r {output.dir}/mapDamage_results >> {log} 2>&1; " "else echo NO MICROBES TO AUTHENTICATE > {output.dir}/README.txt; fi" |
21 22 | shell: "krakenuniq --preload --db {params.DB} --fastq-input {input.fastq} --threads {threads} --output {output.seqs} --report-file {output.report} --gzip-compressed --only-classified-out &> {log}" |
49 50 51 52 | shell: """{params.exe} {input.krakenuniq} {params.n_unique_kmers} {params.n_tax_reads} {input.pathogenomesFound} &> {log}; """ """cut -f7 {output.pathogens} | tail -n +2 > {output.pathogen_tax_id};""" """cut -f7 {output.filtered} | tail -n +2 > {output.species_tax_id}""" |
78 79 80 81 | shell: "{params.exe} {input.report} {input.seqs} &> {log}; " "cat {output.seqs} | cut -f 2,3 > {output.krona}; " "ktImportTaxonomy {output.krona} -o {output.html} {params.DB} &>> {log}" |
109 110 111 | shell: "Rscript {params.exe} results/KRAKENUNIQ {output.out_dir} {params.n_unique_kmers} {params.n_tax_reads} &> {log};" "Rscript {params.exe_plot} {output.out_dir} {output.out_dir} &> {log}" |
25 26 | script: "../scripts/malt-build.py" |
49 50 | shell: "unset DISPLAY; malt-run -at SemiGlobal -m BlastN -i {input.fastq} -o {output.rma6} -a {params.gunzipped_sam} -t {threads} -d {input.db} -sup 1 -mq 100 -top 1 -mpi 85.0 -id 85.0 -v &> {log}" |
69 70 71 | shell: "mkdir -p {output.out_dir}; " "{params.exe} {input.sam} {params.unique_taxids} > {output.counts} 2> {log}" |
95 96 | shell: "Rscript {params.exe} results/MALT_QUANTIFY_ABUNDANCE {output.out_dir} &> {log}" |
118 119 120 121 | shell: "{params.exe} -d $(dirname {input.rma6}) -r 'S' &> {log}; " "mv results/MALT/count_table.tsv {output.out_dir}; " "mv {output.out_dir}/count_table.tsv {output.abundance_matrix}" |
141 142 143 | shell: "mv {input.tre} {output.tre};" "mv {input.map} {output.map};" |
19 20 | shell: "fastqc {input.fastq} --threads {threads} --nogroup --outdir results/FASTQC_BEFORE_TRIMMING &> {log}" |
41 42 | shell: "cutadapt {params.adapters} --minimum-length 31 -o {output.fastq} {input.fastq} &> {log}" |
62 63 | shell: "fastqc {input.fastq} --threads {threads} --nogroup --outdir results/FASTQC_AFTER_TRIMMING &> {log}" |
85 86 87 | shell: 'echo {input} | tr " " "\n" > {output.html}.fof;' "multiqc -c {params.config} -l {output.html}.fof --verbose --force --outdir results/MULTIQC &> {log}" |
17 18 | shell: "Rscript {params.exe} results/AUTHENTICATION $(dirname {output.heatmap}) &> {log}" |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | __author__ = "Per Unneberg" __copyright__ = "Copyright 2022, Per Unneberg" __email__ = "per.unneberg@scilifelab.se" __license__ = "MIT" import re from snakemake.shell import shell import subprocess as sp out = sp.run(["malt-build", "--help"], capture_output=True) regex = re.compile("version (?P<major>\d+)\.(?P<minor>\d+)") m = regex.search(out.stderr.decode()) if m is None: # Assume minor version 4 minor = 4 else: minor = int(m.groupdict()["minor"]) a2t_option = "-a2taxonomy" if minor <= 4 else "-a2t" log = snakemake.log_fmt_shell(stdout=False, stderr=True, append=True) options = snakemake.params.get("extra", "") unique_taxids = snakemake.input.unique_taxids seqid2taxid = snakemake.params.seqid2taxid nt_fasta = snakemake.params.nt_fasta accession2taxid = snakemake.params.accession2taxid output = snakemake shell( "grep -wFf {unique_taxids} {seqid2taxid} > {snakemake.output.seqid2taxid_project}; " "cut -f1 {snakemake.output.seqid2taxid_project} > {snakemake.output.seqids_project}; " "grep -Ff {snakemake.output.seqids_project} {nt_fasta} | sed 's/>//g' > {snakemake.output.project_headers}; " "seqtk subseq {nt_fasta} {snakemake.output.project_headers} > {snakemake.output.project_fasta} {log}; " "unset DISPLAY; " "malt-build -i {snakemake.output.project_fasta} {a2t_option} {accession2taxid} -s DNA -t {snakemake.threads} -d {snakemake.output.db} {log}" ) |
Support
- Future updates
Related Workflows





