Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
About
aMeta is a Snakemake workflow for identifying microbial sequences in ancient DNA shotgun metagenomics samples. The workflow performs:
-
trimming adapter sequences and removing reads shorter than 30 bp with Cutadapt
-
quaity control before and after trimming with FastQC and MultiQC
-
taxonomic sequence kmer-based classification with KrakenUniq
-
sequence alignment with Bowtie2 and screening for common microbial pathogens
-
deamination pattern analysis with MapDamage2
-
Lowest Common Ancestor (LCA) sequence alignment with Malt
-
authentication and validation of identified microbial species with MaltExtract
When using aMeta and / or pre-built databases provided together with the wokflow for your research projects, please cite our preprint: https://www.biorxiv.org/content/10.1101/2022.10.03.510579v1
Authors
-
Nikolay Oskolkov (@LeandroRitter) nikolay.oskolkov@scilifelab.se
-
Claudio Mirabello (@clami66) claudio.mirabello@scilifelab.se
-
Per Unneberg (@percyfal) per.unneberg@scilifelab.se
Installation
Clone the github repository, then create and activate aMeta conda environment (here and below
cd aMeta
implies navigating to the cloned root aMeta directory). For this purpose, we recommend installing own
conda
, for example from here https://docs.conda.io/en/latest/miniconda.html, and
mamba
https://mamba.readthedocs.io/en/latest/installation.html:
git clone https://github.com/NBISweden/aMeta
cd aMeta
mamba env create -f workflow/envs/environment.yaml
# alternatively: conda env create -f workflow/envs/environment.yaml, however it takes longer
conda activate aMeta
Run a test to make sure that the workflow was installed correctly:
cd .test
./runtest.sh -j 4
Here, and below, by
-j
you can specify the number of threads that the workflow can use. Please make sure that the installation and test run accomplished successfully before proceeding with running aMeta on your data. Potential problems with installation and test run often come from unstable internet connection and particular
conda
settings used e.g. at computer clusters, therefore we advise you to use your own freshly installed
conda
.
Quick start
To run the worflow you need to prepare a tab-delimited sample-file
config/samples.tsv
with at least two columns, and a configuration file
config/config.yaml
, below we provide examples for both files.
Here is an example of
samples.tsv
, this implies that the fastq-files are located in
aMeta/data
folder:
sample fastq
foo data/foo.fq.gz
bar data/bar.fq.gz
Currently, it is important that the sample names in the first column exactly match the names of the fastq-files in the second column. For example, a fastq-file "data/foo.fq.gz" specified in the "fastq" column, must have a name "foo" in the "sample" column. Please make sure that the names in the first and second columns match.
Below is an example of
config.yaml
, here you will need to download a few databases that we made public (or build databases yourself).
samplesheet: "config/samples.tsv"
# KrakenUniq Microbial NCBI NT database (if you are interested in prokaryotes only)
# can be downloaded from https://doi.org/10.17044/scilifelab.20518251
krakenuniq_db: resources/DBDIR_KrakenUniq_MicrobialNT
# KrakenUniq full NCBI NT database (if you are interested in prokaryotes and eukaryotes)
# can be downloaded from https://doi.org/10.17044/scilifelab.20205504
#krakenuniq_db: resources/DBDIR_KrakenUniq_Full_NT
# Bowtie2 index and helping files for following up microbial pathogens
# can be downloaded from https://doi.org/10.17044/scilifelab.21185887
bowtie2_db: resources/library.pathogen.fna
bowtie2_seqid2taxid_db: resources/seqid2taxid.pathogen.map
pathogenomesFound: resources/pathogensFound.very_inclusive.tab
# Bowtie2 index for full NCBI NT (for quick followup of prokaryotes and eukaryotes),
# can be downloaded from https://doi.org/10.17044/scilifelab.21070063 (please unzip files)
# For using Bowtie2 NT index, replace "bowtie2_db" and "bowtie2_seqid2taxid_db" above by
#bowtie2_db: resources/library.fna
#bowtie2_seqid2taxid_db: resources/seqid2taxid.map.orig
# Helping files for building Malt database
# can be downloaded from https://doi.org/10.17044/scilifelab.21070063
malt_nt_fasta: resources/library.fna
malt_seqid2taxid_db: resources/seqid2taxid.map.orig
malt_accession2taxid: resources/nucl_gb.accession2taxid
# A path for downloading NCBI taxonomy files (performed automatically)
# you do not need to change this line
ncbi_db: resources/ncbi
# Breadth and depth of coverage filters
# default thresholds are very conservative, can be tuned by users
n_unique_kmers: 1000
n_tax_reads: 200
There are several ways to download the database files. One option is to follow this link https://docs.figshare.com/#articles_search and search for the last number in the database links provided above in the "article_id" search bar. This will give you the download url for each file. Then you can either use wget inside a screen session or tmux session to download it, or aria2c, for example, https://aria2.github.io/. N.B. We strongly recommend you not to mix the databases in the same directory but place them in individual folders, otherwise they may overwrite each other.
After you have prepared the sample- and configration-file, please install job-specific environments and update Krona taxonomy:
cd aMeta
snakemake --snakefile workflow/Snakefile --use-conda --conda-create-envs-only -j 20
env=$(grep krona .snakemake/conda/*yaml | awk '{print $1}' | sed -e "s/.yaml://g" | head -1)
cd $env/opt/krona/
./updateTaxonomy.sh taxonomy
cd -
Finally, the workflow can be run using the following command line:
cd aMeta
snakemake --snakefile workflow/Snakefile --use-conda -j 20
In the sections More configuration options , Environment module configuration and Runtime configuration we will give more information about fine-tuning the configuration as well as instructions on how to run the workflow in a computer cluster enviroment.
More details about running aMeta can be found in the step-by-step tutorial available in the
aMeta/vignettes
directory.
Main results of the workflow and their interpretation
All output files of the workflow are located in
aMeta/results
directory. To get a quick overview of ancient microbes present in your samples you should check a heatmap in
results/overview_heatmap_scores.pdf
.
The heatmap demonstrates microbial species (in rows) authenticated for each sample (in columns). The colors and the numbers in the heatmap represent authentications scores, i.e. numeric quantification of eight quality metrics that provide information about microbial presence and ancient status. The authentication scores can vary from 0 to 10, the higher is the score the more likely that a microbe is present in a sample and is ancient. Typically, scores from 8 to 10 (red color in the heatmap) provide good confidence of ancient microbial presence in a sample. Scores from 5 to 7 (yellow and orange colors in the heatmap) can imply that either: a) a microbe is present but not ancient, i.e. modern contaminant, or b) a microbe is ancient (the reads are damaged) but was perhaps aligned to a wrong reference, i.e. it is not the microbe you think about. The former is a more common case scenario. The latter often happens when an ancient microbe is correctly detected on a genus level but we are not confident about the exact species, and might be aligning the damaged reads to a non-optimal reference which leads to a lot of mismatches or poor evennes of coverage. Scores from 0 to 4 (blue color in the heatmap) typically mean that we have very little statistical evedence (very few reads) to claim presence of a microbe in a sample.
To visually examine the eight quality metrics
-
deamination profile,
-
evenness of coverage,
-
edit distance (amount of mismatches) for all reads,
-
edit distance (amount of mismatches) for damaged reads,
-
read length distribution,
-
PMD scores distribution,
-
number of assigned reads (depth of coverage),
-
average nucleotide identity (ANI)
corresponding to the numbers and colors of the heatmap, one can find them in
results/AUTHENTICATION/sampleID/taxID/authentic_Sample_sampleID.trimmed.rma6_TaxID_taxID.pdf
for each sample
sampleID
and each authenticated microbe
taxID
. An example of such quality metrics is shown below:
In case you are interested in an overview of microbial species present in your samples
irrespective of their ancient status
, you can simply check a KrakenUniq abundance matrix here
results/KRAKENUNIQ_ABUNDANCE_MATRIX/krakenuniq_absolute_abundance_heatmap.pdf
:
The values in the heatmap above indicate the numbers of reads assigned to each microbe in each species. The corresponding Total Sum Scaled (TSS), aka library size normalized, KrakenUniq abundance matrix is located in
results/KRAKENUNIQ_ABUNDANCE_MATRIX/krakenuniq_normalized_abundance_heatmap.pdf
. Please note that the microbial species in the KrakenUniq abundance matrix might not always overlap with the ones present in the authentication score heatmap above. This is because not all microbes detected by KrakenUniq at the pre-screening step can be successfully validated by Malt + MaltExtract. The absolute and normalized microbial abundance heatmaps are optimal for visual exploration for up to ~50-100 samples and ~100-200 species. For larger numbers of samples and / or species, the heatmaps may become too crowded and difficult to view; therefore, in this case, the users are advised to utilize the raw KrakenUniq abundance matrix
krakenuniq_abundance_matrix.txt
for their own custom visualization.
Finally, below we list locations and provide short comments for a few other useful metrics / plots / information delivered by aMeta which you perhaps would also like to check:
-
the deamination profile computed by MaltExtract among the eight validation and authentication metrics above might be less informative than the one delivered by MapDamage. You can find the deamination profile for the microbe of interest with
taxID
detected in samplesampleID
hereresults/MAPDAMAGE/sampleID/taxID.tax.bam/Fragmisincorporation_plot.pdf
. Please note that the MapDamage deamination profile is computed on Bowtie2 alignments without LCA, these alignments might be less accurate than the LCA-based Malt alignments used for MaltExtract. -
visualization of KrakenUniq microbial detection results for each sample
sampleID
is available atresults/KRAKENUNIQ/sampleID/taxonomy.krona.html
. -
unfiltered KrakenUniq results for each sample
sampleID
can be found inresults/KRAKENUNIQ/sampleID/krakenuniq.output
, filtered withn_unique_kmers
(breadth of coverage) andn_tax_reads
(depth of coverage) can be found inresults/KRAKENUNIQ/sampleID/krakenuniq.output.filtered
, and the pathogenic subset of the filtered KrakenUniq results for each samplesampleID
is available inresults/KRAKENUNIQ/sampleID/krakenuniq.output.pathogens
. -
Malt microbial abundance matrix quantified from rma6- and sam-files can be found in
results/MALT_ABUNDANCE_MATRIX_RMA6/malt_abundance_matrix_rma6.txt
andresults/MALT_ABUNDANCE_MATRIX_SAM/malt_abundance_matrix_sam.txt
, respectively. These abundance matrices are complementary and can be compared with the KrakenUniq abundance matrix fromresults/KRAKENUNIQ_ABUNDANCE_MATRIX/krakenuniq_abundance_matrix.txt
for a better intuition about microbial presence and abundance. -
all technical details on the quality of your data, adapter trimming, alignments etc. can be found in
results/MULTIQC/multiqc_report.html
.
More configuration options
Within
config.yaml
one can specify what samples to analyse in the
samples
section through the
include
and
exclude
keys, so that a global samplesheet can be reused multiple times.
Analyses
mapdamage
,
authentication
,
malt
, and
krona
can be individually turned on and off in the
analyses
section.
Adapter sequences can be defined in the
adapters
section of
config.yaml
.
The keys
config['adapters']['illumina']
(default
true
) and
config['adapters']['nextera']
(default
false
) are switches
that turn on/off adapter trimming of illumina (
AGATCGGAAGAG
) and nextera (
AGATCGGAAGAG
) adapter sequences. Addional custom adapter sequences can be set in the configuration key
config['adapters']['custom']
which must be an array of strings.
An example snippet that can optionally be added to the configuration file
config.yaml
is shown below:
# you can include or exclude samples
samples:
include:
- foo
exclude:
- bar
# you can include or exclude certain types of analysis
analyses:
mapdamage: true
authentication: true
malt: true
krona: true
# you can specify type of adapters to trim
adapters:
illumina: true
nextera: false
# custom is a list of adapter sequences
custom: []
Environment module configuration
To run the workflow in a computer cluster environemnt you should specify environmental modules and runtimes via
--profile
as follows:
snakemake --snakefile workflow/Snakefile -j 100 --profile .profile --use-envmodules
If the workflow is run on a HPC with the
--use-envmodules
option
(see
using-environment-modules
),
the workflow will check for an additional file that configures environment modules. By default, the file is
config/envmodules.yaml
, but a custom location can be set with the
environment variable
ANCIENT_MICROBIOME_ENVMODULES
.
Environmental modules configurations are placed in a configuration section
envmodules
with key-value pairs that map a dependency set to a list
of environment modules. The dependency sets are named after the rule's
corresponding conda environment file, such that a dependency set may
affect multiple rules. For instance, the following example shows how
to define modules for rules depending on fastqc, as it would be
implemented on the
uppmax
compute cluster:
envmodules:
fastqc:
- bioinfo-tools
- FastQC
See the configuration schema file
(
workflows/schema/config.schema.yaml
) for more information.
Runtime configuration
Most individual rules define the number of threads to run. Although
the number of threads for a given rule can be tweaked on the command
line via the option
--set-threads
, it is advisable to put all
runtime configurations in a
profile
.
At its simplest, a profile is a directory (e.g.
.profile
in the
working directory) containing a file
config.yaml
which consists of
command line option settings. In addition to customizing threads, it
enables the customization of resources, such as runtime and memory. An
example is shown here:
# Rerun incomplete jobs
rerun-incomplete: true
# Restart jobs once on failure
restart-times: 1
# Set threads for mapping and fastqc
set-threads:
- Bowtie2_Alignment=10
- FastQC_BeforeTrimming=5
# Set resources (runtime in minutes, memory in mb) for malt
set-resources:
- Malt:runtime=7200
- Malt:mem_mb=512000
# Set defalt resources that apply to all rules
default-resources:
- runtime=120
- mem_mb=16000
- disk_mb=1000000
For more advanced profiles for different hpc systems, see Snakemake-Profiles github page .
Frequently Asked Questions (FAQ)
My fastq-files do not contain adapters, how can I skip the adapter removal step?
To our experinece, there are very often adapter traces left even after an adapter removing software has been applied to the fastq-files. Therefore, we strongly recommend not to skip the adapter removal step. This step is typically not time consuming and can only be beneficial for the analysis. Otherwise, adapter contamination can lead to severe biases in microbial discovery.
I have paired-end (PE) sequencing data, does aMeta support it?
Currently not, but this option will be added soon. As in other aDNA analyses, PE reads need to be merged prior to using aMeta, this can be achieve by e.g.
fastp
https://github.com/OpenGene/fastp. Alternatively, simple concatenation of R1 and R2 reads as
cat R1.fastq.gz R2.fastq.gz > merged.fastq.gz
is also possible, and the resulting file
merged.fastq.gz
can be used as input for aMeta.
I get "Java heap space error" on the Malt step, what should I do?
You will need to adjust Malt max memory usage (64 GB by default) via modifying
malt-build.vmoptions
and
malt-run.vmoptions
files.
To locate these files you have to find a Malt conda environment, activate it and replace the default 64 GB with the amount of RAM available on you computer node, in the example below it is 512 GB:
cd aMeta env=$(grep hops .snakemake/conda/*yaml | awk '{print $1}' | sed -e "s/.yaml://g" | head -1) conda activate $env version=$(conda list malt --json | grep version | sed -e "s/\"//g" | awk '{print $2}') cd $env/opt/malt-$version sed -i -e "s/-Xmx64G/-Xmx512G/" malt-build.vmoptions sed -i -e "s/-Xmx64G/-Xmx512G/" malt-run.vmoptions cd - conda deactivate
I get "Java heap space error" on the FastQC step, what should I do?
Similarly to Malt, see above, you will need to modify the default memory usage of FastQC. An example of how this can be done is demonstrated below:
cd aMeta env=$(grep fastqc .snakemake/conda/*yaml | awk '{print $1}' | sed -e "s/.yaml://g" | head -1) conda activate $env version=$(conda list fastqc --json | grep version | sed -e "s/\"//g" | awk '{print $2}') cd $env/opt/fastqc-$version sed -i -e "s/-Xmx250m/-Xmx10g/" fastqc cd - conda deactivate
I get a "No validator found" message, should I be worried?
Short answer: no, you do not need to be worried about purple snakemake warning text. Only red messages indicate an error and should be investigated.
aMeta takes a lot of time to run, can I speed it up?
If you run aMeta using our pre-built database:
KrakenUniq database based on full NCBI NT: https://doi.org/10.17044/scilifelab.20205504
KrakenUniq database based on microbial part of NCBI NT: https://doi.org/10.17044/scilifelab.20518251
KrakenUniq database based on microbial part of NCBI RefSeq: https://doi.org/10.17044/scilifelab.21299541
Bowtie2 index for full NCBI NT database: https://doi.org/10.17044/scilifelab.21070063
Bowtie2 index for pathogenic microbial species of NCBI NT: https://doi.org/10.17044/scilifelab.21185887
it can be very fast (a few hours for a sample with ~10 mln reads) if you have enough RAM (recommended minimum ~200 GB, ideally ~1 TB). Otherwise, runing aMeta with smaller RAM is also possible but results in much longer computation times. We prioritize using large databases for more accurate metagenomic analysis. Alternatively, smaller databases can also be used which might speed up aMeta considerably, but very likely result in less accurate analysis (lower sensitivity and specificity).
Code Snippets
23 24 | shell: "bowtie2-build-l --threads {threads} {input.ref} {input.ref} > {log} 2>&1" |
47 48 49 50 51 52 53 54 55 56 57 | shell: """bowtie2 --large-index -x {params.BOWTIE2_DB} --end-to-end --threads {threads} --very-sensitive -U {input.fastq} > $(dirname {output.bam})/AlignedToBowtie2DB.sam 2> {log};""" """grep @ $(dirname {output.bam})/AlignedToBowtie2DB.sam | awk '!seen[$2]++' > $(dirname {output.bam})/header_nodups.txt;""" """grep -v '^@' $(dirname {output.bam})/AlignedToBowtie2DB.sam > $(dirname {output.bam})/AlignedToBowtie2DB.noheader.sam;""" """cat $(dirname {output.bam})/header_nodups.txt $(dirname {output.bam})/AlignedToBowtie2DB.noheader.sam > $(dirname {output.bam})/AlignedToBowtie2DB.nodups.sam;""" """samtools view -bS -q 1 -h -@ {threads} $(dirname {output.bam})/AlignedToBowtie2DB.nodups.sam | samtools sort - -@ {threads} -o {output.bam};""" """samtools index {output.bam};""" """rm $(dirname {output.bam})/header_nodups.txt;""" """rm $(dirname {output.bam})/AlignedToBowtie2DB.noheader.sam;""" """rm $(dirname {output.bam})/AlignedToBowtie2DB.nodups.sam;""" """rm $(dirname {output.bam})/AlignedToBowtie2DB.sam;""" |
19 20 21 22 | shell: "mkdir -p {params.dir}; " "while read taxid; do mkdir -p {params.dir}/$taxid; touch {params.dir}/$taxid/.done; done<{input.species};" "touch {output.done}" |
48 49 | shell: "touch {output}; " |
63 64 | shell: "awk -v var={wildcards.taxid} '{{ if($1==var) print $0 }}' {params.tax_db}/taxDB | cut -f3 > {output.node_list}" |
94 95 | shell: "time MaltExtract -Xmx32G -i {input.rma6} -f def_anc -o {params.extract} -r {params.ncbi_db} --reads --threads {threads} --matches --minPI 85.0 --maxReadLength 0 --minComp 0.0 --meganSummary -t {input.node_list} -v 2> {log}" |
114 115 | shell: "postprocessing.AMPS.r -m def_anc -r {params.extract} -t {threads} -n {input.node_list} || echo 'postprocessing failed for {wildcards.sample}_{wildcards.taxid}' > {output.analysis} 2> {log}" |
134 135 | shell: "samtools faidx {input.fna} 2> {log}" |
159 160 161 162 163 164 165 166 167 | shell: "echo {params.ref_id} > {output.name_list}; " "zgrep {params.ref_id} {input.sam} | uniq > {output.sam}; " "samtools view -bS {output.sam} > results/AUTHENTICATION/{wildcards.sample}/{wildcards.taxid}/{params.ref_id}.bam; " "samtools sort results/AUTHENTICATION/{wildcards.sample}/{wildcards.taxid}/{params.ref_id}.bam > {output.sorted_bam}; " "samtools index {output.sorted_bam}; " "samtools depth -a {output.sorted_bam} > {output.breadth_of_coverage}; " "grep -w -f {output.name_list} {input.malt_fasta_fai} | awk '{{printf(\"%s:1-%s\\n\", $1, $2)}}' > {output.name_list}.regions; " "samtools faidx {input.malt_fasta} -r {output.name_list}.regions -o results/AUTHENTICATION/{wildcards.sample}/{wildcards.taxid}/{params.ref_id}.fasta" |
184 185 | shell: "samtools view {input.bam} | awk '{{ print length($10) }}' > {output.distribution}" |
202 203 | shell: "(samtools view -h {input.bam} || true) | pmdtools --printDS > {output.scores}" |
226 227 | shell: "Rscript {params.exe} {wildcards.taxid} {wildcards.sample}.trimmed.rma6 {input.dir}/" |
245 246 247 248 | shell: "(samtools view {input.bam} || true) | pmdtools --platypus --number 2000000 > {output.tmp}; " "cd results/AUTHENTICATION/{wildcards.sample}/{wildcards.taxid}; " "R CMD BATCH $(which plotPMD); " |
269 270 | shell: "Rscript {params.exe} {input.rma6} $(dirname {input.maltextractlog}) {input.name_list} $(dirname {input.name_list}) &> {log};" |
23 24 25 26 27 28 29 30 31 32 | shell: "mkdir {output.dir}; " "if [ -s {input.species_tax_id} ]; then " 'cat {input.species_tax_id} | parallel -j {threads} "grep -w {{}} {params.bowtie2_seqid2taxid} | cut -f1 > {output.dir}/{{}}.seq_ids" ; ' "for i in $(cat {input.species_tax_id}); do xargs --arg-file={output.dir}/${{i}}.seq_ids samtools view -bh {input.bam} --write-index -@ {threads} -o {output.dir}/${{i}}.tax.bam; done >> {log} 2>&1; " "find {output.dir} -name '*.tax.bam' | parallel -j {threads} \"mapDamage {params.options} -i {{}} -r {params.BOWTIE2_DB} --merge-reference-sequences -d {output.dir}/mapDamage_{{}}\" >> {log} 2>&1 || true; " "for filename in {output.dir}/*.tax.bam; do newname=`echo $filename | sed 's/tax\.//g'`; mv $filename $newname; done >> {log} 2>&1; " "mv {output.dir}/mapDamage_{output.dir}/* {output.dir} >> {log} 2>&1; " "rm -r {output.dir}/mapDamage_results >> {log} 2>&1; " "else echo NO MICROBES TO AUTHENTICATE > {output.dir}/README.txt; fi" |
21 22 | shell: "krakenuniq --preload --db {params.DB} --fastq-input {input.fastq} --threads {threads} --output {output.seqs} --report-file {output.report} --gzip-compressed --only-classified-out &> {log}" |
49 50 51 52 | shell: """{params.exe} {input.krakenuniq} {params.n_unique_kmers} {params.n_tax_reads} {input.pathogenomesFound} &> {log}; """ """cut -f7 {output.pathogens} | tail -n +2 > {output.pathogen_tax_id};""" """cut -f7 {output.filtered} | tail -n +2 > {output.species_tax_id}""" |
78 79 80 81 | shell: "{params.exe} {input.report} {input.seqs} &> {log}; " "cat {output.seqs} | cut -f 2,3 > {output.krona}; " "ktImportTaxonomy {output.krona} -o {output.html} {params.DB} &>> {log}" |
109 110 111 | shell: "Rscript {params.exe} results/KRAKENUNIQ {output.out_dir} {params.n_unique_kmers} {params.n_tax_reads} &> {log};" "Rscript {params.exe_plot} {output.out_dir} {output.out_dir} &> {log}" |
25 26 | script: "../scripts/malt-build.py" |
49 50 | shell: "unset DISPLAY; malt-run -at SemiGlobal -m BlastN -i {input.fastq} -o {output.rma6} -a {params.gunzipped_sam} -t {threads} -d {input.db} -sup 1 -mq 100 -top 1 -mpi 85.0 -id 85.0 -v &> {log}" |
69 70 71 | shell: "mkdir -p {output.out_dir}; " "{params.exe} {input.sam} {params.unique_taxids} > {output.counts} 2> {log}" |
95 96 | shell: "Rscript {params.exe} results/MALT_QUANTIFY_ABUNDANCE {output.out_dir} &> {log}" |
118 119 120 121 | shell: "{params.exe} -d $(dirname {input.rma6}) -r 'S' &> {log}; " "mv results/MALT/count_table.tsv {output.out_dir}; " "mv {output.out_dir}/count_table.tsv {output.abundance_matrix}" |
141 142 143 | shell: "mv {input.tre} {output.tre};" "mv {input.map} {output.map};" |
19 20 | shell: "fastqc {input.fastq} --threads {threads} --nogroup --outdir results/FASTQC_BEFORE_TRIMMING &> {log}" |
41 42 | shell: "cutadapt {params.adapters} --minimum-length 31 -o {output.fastq} {input.fastq} &> {log}" |
62 63 | shell: "fastqc {input.fastq} --threads {threads} --nogroup --outdir results/FASTQC_AFTER_TRIMMING &> {log}" |
85 86 87 | shell: 'echo {input} | tr " " "\n" > {output.html}.fof;' "multiqc -c {params.config} -l {output.html}.fof --verbose --force --outdir results/MULTIQC &> {log}" |
17 18 | shell: "Rscript {params.exe} results/AUTHENTICATION $(dirname {output.heatmap}) &> {log}" |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | __author__ = "Per Unneberg" __copyright__ = "Copyright 2022, Per Unneberg" __email__ = "per.unneberg@scilifelab.se" __license__ = "MIT" import re from snakemake.shell import shell import subprocess as sp out = sp.run(["malt-build", "--help"], capture_output=True) regex = re.compile("version (?P<major>\d+)\.(?P<minor>\d+)") m = regex.search(out.stderr.decode()) if m is None: # Assume minor version 4 minor = 4 else: minor = int(m.groupdict()["minor"]) a2t_option = "-a2taxonomy" if minor <= 4 else "-a2t" log = snakemake.log_fmt_shell(stdout=False, stderr=True, append=True) options = snakemake.params.get("extra", "") unique_taxids = snakemake.input.unique_taxids seqid2taxid = snakemake.params.seqid2taxid nt_fasta = snakemake.params.nt_fasta accession2taxid = snakemake.params.accession2taxid output = snakemake shell( "grep -wFf {unique_taxids} {seqid2taxid} > {snakemake.output.seqid2taxid_project}; " "cut -f1 {snakemake.output.seqid2taxid_project} > {snakemake.output.seqids_project}; " "grep -Ff {snakemake.output.seqids_project} {nt_fasta} | sed 's/>//g' > {snakemake.output.project_headers}; " "seqtk subseq {nt_fasta} {snakemake.output.project_headers} > {snakemake.output.project_fasta} {log}; " "unset DISPLAY; " "malt-build -i {snakemake.output.project_fasta} {a2t_option} {accession2taxid} -s DNA -t {snakemake.threads} -d {snakemake.output.db} {log}" ) |
Support
- Future updates
Related Workflows





