Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Snakemake-based workflow for the retrieval of Eukaryotic (and other) genomes from metagenomes
1.0 Introduction
Molecular and genomic approaches, particularly those applied to whole, mixed communities (e.g. metagenomics, metatranscriptomics), have shed light on the ecological roles, evolutionary histories, and physiological capabilities of these organisms. We developed a scalable and reproducible pipeline to facilitate the retrieval, taxonomic assignment, and annotation of eukaryotic metagenome assembled genomes (MAGs) from mixed community metagenomes. The below representation of the pipeline shows how EukHeist can be used to retrieve eukaryotic (and other) metagenome assembled genomes from the Tara Oceans global dataset.
General workflow
1.1 EukHeist capabilities
EukHeist is capable of running metagenome analysis on its own and incorporating metatranscriptome reads downstream. In the
hierarchy_cluster.yaml
file, a user can select metaG, metaT or both (see section 3.1)
User workflow
-
get set up on HPC with
Snakemake
-
input sample list, assembly group
-
run tests and confirm rules in pipeline you plan to use
2.0 Set up
Locate and organize metagenomic and metatranscriptomic fastq files. Create a raw reads directory, and within that a single directory for your metagenomic reads and a second directory for metatranscriptomic reads. You will need to know the full path for all files, individual sample IDs, and an idea of how the assemblies should be grouped. This grouping may be selected on the basis of anticipated sample similarity: e.g., you may wish to co-assemble several replicates from a single or closely-related set of sampling locations in order to maximize metagenome-assembled genome recovery but minimize mixing of organisms from distinct sites. In this step, we are creating input file lists that will tell
EukHeist
where to look for our input fastq reads, what to name them, and how to group the assemblies and mapping steps.
2.1 Structure input fastq files in individual directories
Example file structure:
EukHeist/raw_dir/ ├── metagenome └── metatranscriptome
To use the test data, run:
bash download-test-data.sh
This will download raw reads, for a test run, to the metagenome and metatranscriptome directories. This is also a way by which you can familiarize yourself with the format that we expect and get ready for your future runs of
EukHeist
.
2.2 Create sample list files for metagenome and metatranscriptome data
The final sample list files should look like this, with complete paths under "FULLPATH" and the assembly grouping lists how you want the samples to be assembled for the metagenomic and metatranscriptomic pipelines. In this example, all three of these files would be assembled together in a single
MEGAHIT
assembly, because they have the same "ASSEMBLY_GROUPING" name. If you don't want any of your samples to be assembled in the same batch, you can simply name the assembly group the same way you named the sample.
Example sample list files
SAMPLEID SAMPLENAME OMIC FULLPATH R1 R2 ASSEMBLY_GROUPING
ERR1163068 CTD1200_DNA METAGENOMIC /../../../metagenome ERR1163068_1.fastq.gz ERR1163068_2.fastq.gz CTD1200_DNA
ERR1163069 CTD1200_DNA METAGENOMIC /../../../metagenome ERR1163069_1.fastq.gz ERR1163069_2.fastq.gz CTD1200_DNA
ERR1163070 CTD1200_DNA METAGENOMIC /../../../metagenome ERR1163070_1.fastq.gz ERR1163070_2.fastq.gz CTD1200_DNA
See example file "input-samplelist-example-metagenomic.txt". These should be .txt files and separated with a tab.
2.3 Generate an Assembly group file
The first column, 'ASSEMBLY_GROUPING' lists the unique names for each group specified from the sample list file. The second column lists the sample IDs (in this example, the accession numbers), that are associated with the ASSEMBLY_GROUPING. These sample IDs are collapsed with commas. This file should match your list of samples above (so if you have one sample per assembly group, every sample list in this file should have only one sample and should not be separated by commas.
ASSEMBLY_GROUPING SAMPLE_LIST
Group1 ERR1726828,ERR599214
Group2 ERR868421
Create sample list files that end in
metatranscriptomic.txt
and
metagenomic.txt
, and run this R script to automatically generate the Assembly group files.
Run Rscript
/../generate-assembly-group-tables.r
; this will input the sample list file and generate the assembly grouping file, as long as the sample names designate the assembly groups.
3.0 Set up EukHeist
Start by cloning this repo.
git clone https://github.com/AlexanderLabWHOI/EukHeist.git
Then, create a conda environment to run the EukHeist pipeline. We recommend using
mamba
for this, use
conda install mamba -n base -c conda-forge
to install mamba to your base conda environment.
conda create -n EUKHeist -c conda-forge mamba #create EUKHeist environment
conda activate EUKHeist #activate EUKHeist
mamba install -c bioconda -c conda-forge snakemake python jinja2 networkx matplotlib graphviz #install items from bioconda and conda-forge needed to run pipeline
Update above instructions later when versions of needed software is updated
3.1
Modify
hierarchy_cluster.yaml
to tell EukHeist the location of raw read directories, assembly group table, and where you want results to be stored.
Explanation of
hierarchy_cluster.yaml
file:
directories:
input: test-data # Input path for where metagenomic and metatranscriptomic reads are
output: /vortexfs1/omics/alexander/akrinos/EUKHeist_test # Set a location as your output directory
scratch: /vortexfs1/scratch/akrinos/EUKHeist_scratch # Full path for scratch directory
metaG:
create_sample_table: False
sample_data_table: test-data/samplelist-metaG-wgroups.txt # Location of metagenomic sample list
assembly_group_table: test-data/assembly-list-metaG.txt # Location of metagenomic assembly list
folder_name: metaG # Directory name (within "input") where metagenomic reads are located
metaT:
create_sample_table: False
sample_data_table: test-data/samplelist-metaT-wgroups.txt
assembly_group_table: test-data/assembly-list-metaT.txt
folder_name: metaT #PRJEB6609
adapters: input/adapters/illumina-adapters.fa
mode: metaG
megahit:
cpu: 80
min_contig: 1000
mem: .95
other: --continue --k-list 29,39,59,79,99,119
Co-assembly will be dictated by user providing a column in the sample list input file and toggling the
create_sample_table:
3.2 Review working directory
Explanation of working directory:
EukHeist
├── cluster.yaml # File that is modified by user to run snakemake on HPC
├── config.yaml #RM
├── environment.yaml # Environment to create snakemake conda env
├── envs # Directory with all environment files for each snakemake rule
├── EUKHeist # Snakemake script for EukHeist
├── hierarchy_cluster.yaml # User-modified file that configures EukHeist to run your samples
├── input # Necessary input files to run EukHeist
├── LICENSE
├── README.md
├── rules # each step of the snakemake pipeline is a 'rule'
├── scripts # Directory with additional scripts used in EukHeist
├── Snakefile #RM
├── start-up #RM?
└── submit_script #RM?
4.0 Execute EukHeist dry run / test
EukHeist relies on the workflow manager called 'Snakemake'. Snakemake will create a script of all required commands and make sure needed files are correct and that all rules can be executed. We can use the "dry run" approach to perform an initial test of our EukHeist run.
4.1 Set up with your HPC
Navigate to the EukHeist directory and modify the
cluster.yaml
file to for specific HPC access. Under
__default__
you can specify your account information, threads, time, and node that will be used for each slurm run. Then below, each rule has a set of thread, memory, and time parameters that can be modified for your individual dataset. For instance, the
trimmomatic
* step looks like this:
trimmomatic:
queue: compute
threads: 6
mem: 24
time: 720
4.3 Run Snakemake test run
If running with your HPC and slurm, navigate to
EukHeist/submit_script/
. There are three bash scripts that can be used to run dry tests of your EukHeist workflow and to actually submit the whole thing to slurm.
In order to execute a dry run, enable the EukHeist conda environment you previously created, then use the
-s
and
-np
flags with snakemake to test the workflow.
conda activate EukHeist # Prefix to each line should now read "(EukHeist)"
snakemake -s EUKHeist -np --use-conda
4.4 Execute full snakemake pipeline
Once you get a green output from snakemake with the dry run, it means that the EukHeist pipeline is ready to run. Run the bash script in the
submit_script
directory to send all jobs to slurm. This will enable the parameters you specified in the cluster.yaml file.
bash submit_script/submit_snakemake.sh
## If not using HPC
# snakemake -s EUKHeist --use-conda --cores <number of available cores>
5.0 EukHeist output
Within the
output-data
directory, final file structure should look like this.
.
├── eukrep # sorted prok vs euk contigs
├── logs
├── megahit # co-assembled contigs
├── metabat2 # bwa-derived abundances of prok contigs
├── metabat2_euk # bwa-derived abundances of euk contigs
├── prodigal
└── qc
# From scratch
.
├── bwa_index
├── mapping
├── sourmash
└── trimmed
6.0 snakemake troubleshooting
-
When throwing an error, snakemake will list log files. Each time snakemake is executed, a log file is created in
CURRENT_SNAKEMAKE_DIR/.snakemake/log/
. These are dated and provide the printed output. Some common errors and steps to diagnose. -
Compatibility with snakemake and conda Note the version of snakemake and anaconda you are using. Upon conda's switch from source activate to conda activate , snakemake was not calling on the conda environments properly. Errors associted with these were
returned non-zero exit status 127
and an error about line 56 in thread.py like this:LOCATION_OF_YOUR_CONDA_ENVS/.conda/envs/snake-18S/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Update your version of snakemake. Versions listed above are compatible. This error will also be generated when there is an incomptaible conda environment called, such as an outdated snakemake wrapper . Try updating that - but ensure you have first updated the snakemake version. -
Check all log files, found in
./snakemake/log
and the log files generated as output. Most helpful are the output files from each slurm job,slurm-XXX.out
. Look at the most recent slurm out files after a job fails to find more informative errors. -
Error in rule xxx: Typically, snakemake will also tell you which rule an error occurred. In this example, an error read
Error in rule fastqc:
and pointed to the location of the fastqc log file itself (for EukHeist this will be in output data). We we inspected the fastqc log file, it was discovered that the .yaml file for fastqc had an issue, so snakemake was not able to create this environment or find 'fastqc' to run. -
KeyError : In the case of a Key Error, this indicates that some samples are not named correctly. This should become apparent during the test runs of snakemake where EukHeist will go through all the code it plans to run and makes sure that all files will exist and are named correctly. In the case of EukHeist, ensure your assembly group name and sample list files are free of typos. Typically the key error will provide an example of where snakemake found the error, so this can be used to search for a potential typo.
-
See
snakemake -h
for additional commands to clean up working snakemake environment, list steps, or restarting attempts, etc. When running, snakemake "locks" the working directory, so only one snakemake command can be run at a time. If you have to cancel, make sure to runsnakemake --unlock
to clear it. See other flags to clean up your working environment after updated conda, snakemake, or other environment versions (--cleanup-shadow
,--cleanup-conda
). To look for additional code error that may result in Syntax errors, try adding this to snakemake execution: -
--summary
or--detailed-summary
-
--printshellcmds
-
--debug
-
During the trimmomatic step, if you run into an error that looks like this:
Exception in thread "main" java.io.EOFException: Unexpected end of ZLIB input stream
at java.base/java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:245)
at java.base/java.util.zip.InflaterInputStream.read(InflaterInputStream.java:159)
at java.base/java.util.zip.GZIPInputStream.read(GZIPInputStream.java:118)
at org.usadellab.trimmomatic.util.ConcatGZIPInputStream.read(ConcatGZIPInputStream.java:73)
It indicates that there is something wrong with your original fastq file and trimmomatic detected it. To test your raw fastq files, use the command
gunzip -t <fastq-file-name> && echo "VALID"
, the output will print "VALID" if the fastq file is formatted and zipped properly. It there is an issue, you may need to re-download the raw sequence file.
7.0 Accessing EukHeist Tara Oceans results
7.1 Downloading raw data
Please see our GitHub repository for information on how to pull the raw data from the Tara Oceans project from ENA:
7.2 Links to processed data (MAGs) from OSF (Alexander et al 2021)
-
Preprint here:
-
OSF link for data:
-
Talk from the BVCN conference:
7.3 Using R to download TOPAZ MAGs
The OSFR package has the capability to access OSF directories and extract files and data that you may need. Set up environment, (>R 4.0).
in R:
# install.packages("osfr")
# install.packages("tidyverse")
library(osfr)
library(tidyverse)
TOPAZ eukaryotic MAGs:
topaz <- osf_retrieve_node("c9hj5")
Initial set of MAGs from the Tara Oceans project processed with EukHeist can be found here:
osf_open(topaz)
TOPAZ prokaryotic MAGs
topaz_prok <- osf_retrieve_node("n3v2c")
List all available files in eukaryotic TOPAZ MAGs.
# topaz
topaz_euk_all <- osf_ls_files(topaz,
path = "Eukaryotic_TOPAZ_MAGs",
# pattern = "IOS", # Option to isolate specific regions
# n_max = 10, # Option to only print first 10 (default)
n_max = 1000)
topaz_euk_all now has a list of all the eukaryotic TOPAZ MAGs
topaz_prok <- osf_ls_files(topaz_prok,
path = "Prokaryotic_TOPAZ_MAGs",
# pattern = "IOS", # Option to isolate specific regions
# n_max = 10, # Option to only print first 10 (default)
n_max = 1000)
Example download
Isolate a subset of files of eukaryotic MAGs from NPS. Download to my test-download folder.
topaz_euk_NPS <- osf_ls_files(topaz,
path = "Eukaryotic_TOPAZ_MAGs",
n_max = 10,
pattern = "NPS")
# topaz_euk_NPS
# ?osf_ls_files
# Download to local directory
osf_download(
topaz_euk_NPS,
path = "topaz-mags/test-download/",
recurse = FALSE,
conflicts = "error",
# verbose = FALSE,
progress = FALSE
)
Download highly complete eukaryotic MAGs
A text file lists the highly complete eukaryotic MAGs. We can extract this list to download only highly complete eukaryotic MAGs.
osf_download(osf_ls_files(topaz,
path = "Eukaryotic_TOPAZ_MAGs",
pattern = "hqmags.txt"),
path = ".",
conflicts = "error",
recurse = FALSE)
hc_mags <- read.delim("hqmags.txt", header = FALSE)
Get list of the 485 total highly complete MAGs.
hc_mags_list <- as.character(hc_mags$V1)
# length(hc_mags_list)
Filter from the complete list.
topaz_euk_HC <- topaz_euk_all %>%
filter(name %in% hc_mags_list)
# topaz_euk_HC
Download all highly complete eukaryotic MAGs
osf_download(
topaz_euk_HC,
path = "topaz-mags/highly-complete-mags",
recurse = FALSE,
conflicts = "error",
progress = FALSE
)
Download MAGs by taxa
Access supplementary data from EukHeist manuscript.
supp <- osf_retrieve_node("twz2f")
osf_download(osf_ls_files(supp,
pattern = "TableS02"),
path = ".",
conflicts = "error",
recurse = FALSE)
Import back into R session
taxa_mags <- read.csv("TableS02_EukaryoticMAG.csv")
head(taxa_mags)
Only download specific haptophyta MAGs
haptophyta <- taxa_mags %>%
# filter(groups == "Haptophyta") %>%
filter(grepl("Prymnesiophyceae", eukulele_taxonomy))
prym_list <- as.character(haptophyta$X)
Select those topaz mags that are prymnesiophyceae and download.
osf_download(topaz_euk_all %>%
filter(name %in% prym_list),
path = ".",
conflicts = "error",
recurse = FALSE)
Last updated 20 June 2023. If you have questions about the approach we took here, or have tried it out yourself and have input about our modular, à la carte approach to metagenome-assembled genome (MAG) binning, please do reach out to us over on the
Issues
tab!
Code Snippets
141 142 | wrapper: "0.27.1/bio/fastqc" |
159 160 | wrapper: "0.27.1/bio/trimmomatic/pe" |
171 172 | wrapper: "0.27.1/bio/fastqc" |
191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 | shell: """ multiqc -n multiqc.html {input.rawG} mv multiqc.html {output.html_rawG} mv multiqc_data/multiqc_general_stats.txt {output.stats_rawG} rm -rf multiqc_data multiqc -n multiqc.html {input.trimmedG} mv multiqc.html {output.html_trimmedG} mv multiqc_data/multiqc_general_stats.txt {output.stats_trimmedG} rm -rf multiqc_data multiqc -n multiqc.html {input.trimmedT} mv multiqc.html {output.html_trimmedT} mv multiqc_data/multiqc_general_stats.txt {output.stats_trimmedT} rm -rf multiqc_data multiqc -n multiqc.html {input.trimmedT} mv multiqc.html {output.html_trimmedT} mv multiqc_data/multiqc_general_stats.txt {output.stats_trimmedT} rm -rf multiqc_data """ |
224 225 226 227 228 229 | shell: """ zcat {input.r1} {input.r2} | sourmash compute -k 21,31,51\ --scaled 10000 --track-abundance \ -o {output} - 2> {log} """ |
249 250 251 252 | shell: """ megahit -1 {params.inputr1} -2 {params.inputr2} --min-contig-len {params.min_contig_len} --memory {params.memory} -t {params.cpu_threads} --out-dir {params.megahit_output_name} {params.other_options} >> {log} 2>&1 """ |
270 271 272 273 | shell: """ bwa index {input} 2> {log} """ |
280 281 282 283 | shell: """ cp {input} {output} """ |
305 306 307 308 | shell: """ bwa mem -t {params.threads} {params.extra} {input.reference} {input.r1} {input.r2} | samtools sort -o {output} - >> {log} 2>&1 """ |
319 320 321 322 | shell: """ jgi_summarize_bam_contig_depths --outputDepth {output} {input} > {log} 2>&1 """ |
337 338 339 340 | shell: """ metabat2 {params.other} --numThreads {params.threads} -i {input.assembly} -a {input.depth} -o {output} > {log} 2>&1 """ |
354 355 356 357 | shell: """ EukRep -i {input} -o {output} --prokarya {params.prok} --min {params.min_contig} > {log} 2>&1 """ |
372 373 374 375 | shell: """ metabat2 {params.other} --numThreads {params.threads} -i {input.assembly} -a {input.depth} -o {output} > {log} 2>&1 """ |
385 386 387 388 | shell: """ prodigal -i {input.assembly} -f gff -o {output.genes} -a {output.proteins} -p meta """ |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" from os import path from tempfile import TemporaryDirectory from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=False, stderr=True) def basename_without_ext(file_path): """Returns basename of file path, without the file extension.""" base = path.basename(file_path) split_ind = 2 if base.endswith(".gz") else 1 base = ".".join(base.split(".")[:-split_ind]) return base # Run fastqc, since there can be race conditions if multiple jobs # use the same fastqc dir, we create a temp dir. with TemporaryDirectory() as tempdir: shell("fastqc {snakemake.params} --quiet " "--outdir {tempdir} {snakemake.input[0]}" " {log}") # Move outputs into proper position. output_base = basename_without_ext(snakemake.input[0]) html_path = path.join(tempdir, output_base + "_fastqc.html") zip_path = path.join(tempdir, output_base + "_fastqc.zip") if snakemake.output.html != html_path: shell("mv {html_path} {snakemake.output.html}") if snakemake.output.zip != zip_path: shell("mv {zip_path} {snakemake.output.zip}") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "koester@jimmy.harvard.edu" __license__ = "MIT" from snakemake.shell import shell extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell(stdout=True, stderr=True) trimmer = " ".join(snakemake.params.trimmer) shell("trimmomatic PE {snakemake.params.extra} " "{snakemake.input.r1} {snakemake.input.r2} " "{snakemake.output.r1} {snakemake.output.r1_unpaired} " "{snakemake.output.r2} {snakemake.output.r2_unpaired} " "{trimmer} " "{log}") |
Support
- Future updates
Related Workflows





