Snakemake workflow for bacterial genome assembly + polishing for Oxford Nanopore (ONT) sequencing using multiple tools
A snakemake-wrapper for easily creating de novo bacterial genome assemblies from Oxford Nanopore (ONT) sequencing data, and optionally Illumina data, using any combination of read filtering, assembly, long and short read polishing, and reference-based polishing.
See the preprint here: Snakemake Workflows for Long-read Bacterial Genome Assembly and Evaluation, Preprints.org 2022
Included programs
read filtering | assembly | long read polishing | short read polishing | reference-based polishing |
---|---|---|---|---|
Filtlong
Rasusa |
Flye
raven miniasm Unicycler Canu |
racon
medaka |
pilon
Polypolish POLCA |
Homopolish
proovframe |
Quick start
# Install
git clone https://github.com/pmenzel/ont-assembly-snake.git
conda config --add channels bioconda
conda env create -n ont-assembly-snake --file ont-assembly-snake/env/conda-main.yaml
conda activate ont-assembly-snake
# Prepare ONT reads, one file per sample
mkdir fastq-ont
cp /path/to/my/data/my_sample/ont_reads.fastq.gz fastq-ont/mysample.fastq.gz
# optionally: add Illumina paired-end reads
mkdir fastq-illumina
cp /path/to/my/data/my_sample/illumina_reads_R1.fastq.gz fastq-illumina/mysample_R1.fastq.gz
cp /path/to/my/data/my_sample/illumina_reads_R2.fastq.gz fastq-illumina/mysample_R2.fastq.gz
# Declare desired combination of read filtering, assembly and polishing
mkdir assemblies
mkdir assemblies/mysample_flye+medaka
mkdir assemblies/mysample+filtlongMB500_flye+racon2+medaka
mkdir assemblies/mysample_raven2+medaka+pilon
[...]
# Run workflow
snakemake -s ont-assembly-snake/Snakefile --use-conda --cores 20
Setup
Clone repository, for example into the existing folder
/opt/software/
:
git clone https://github.com/pmenzel/ont-assembly-snake.git /opt/software/ont-assembly-snake
Install
conda
and create a new environment called
ont-assembly-snake
:
conda config --add channels bioconda
conda env create -n ont-assembly-snake --file /opt/software/ont-assembly-snake/env/conda-main.yaml
Activate the environment:
conda activate ont-assembly-snake
Usage
First, prepare a folder called
fastq-ont/
containing the ONT sequencing reads as
one
.fastq
or
.fastq.gz
file per sample, e.g.
fastq-ont/sample1.fastq.gz
.
Unicycler and some polishing tools additonally use paired-end Illumina reads, which need to be placed in the folder
fastq-illumina/
using
_R[12].fastq
or
_R[12].fastq.gz
suffixes, e.g.
fastq-illumina/sample1_R1.fastq.gz
and
fastq-illumina/sample1_R2.fastq.gz
.
Next, create a folder
assemblies
and in there, create empty folders specifying
the desired combinations of read filtering, assembly, and polishing steps by using specific keywords for each program, see below.
The first part of a folder name is a sample name, which must match the filenames in
fastq-ont/
and, optionally,
fastq-illumina/
.
The next part can be a keyword for read filtering with Filtlong, see below, which is separated from the sample name by
+
.
Then follows, separated by an underscore, a keyword for the assembler.
NB: This also means that sample names must not contain underscores.
After the keyword for the assembler follow the keywords for one ore more polishing steps, all separated by
+
.
After making the desired subfolders in
assemblies/
, run the workflow, e.g. with 20 threads:
snakemake -s /opt/software/ont-assembly-snake/Snakefile --use-conda --cores 20
Assemblies created in each step are contained in the files
output.fa
in each folder and symlinked as
.fa
files in the
assemblies/
folder, see the example below.
Included Programs
Filtlong
The ONT reads can be filtered by length and quality using Filtlong prior to the assembly.
The available keywords are:
filtlong
:
This will filter the ONT reads in
fastq-ont/mysample.fastq
and keep only
reads longer than 1000 bases; using the Filtlong option
--min_length
. The filtered read set is written to
fastq-ont/mysample+filtlong.fastq
. The length can be changed using the
snakemake configuration option
filtlong_min_read_length
.
filtlongPC<p>
This will filter the reads to only include the top
p
percent of megabases from reads with highest average quality using
the Filtlong option
--keep_percent
.
Further, reads are filtered by their length as above. The output is written to
fastq-ont/mysample+filtlongPC<p>.fastq
.
filtlongMB<m>
This will filter the reads to only include reads with highest average quality up to a total length of
m
megabases.
Further, reads are filtered by their length. The output is written to
fastq-ont/mysample+filtlongMB<m>.fastq
.
filtlongMB<m>,<q>,<l>
This will filter the reads to only include reads up to a total length of
m
megabases, which are filtered by length
and quality, where
q
and
l
set the priority for each using the Filtlong options
--mean_q_weight
and
--length_weight
, respectively.
See also the
section in the Filtlong docs
.
Further, reads are filtered by their length as above. The output is written to
fastq-ont/mysample+filtlongMB<m>,<q>,<l>.fastq
.
filtlongMB<m>,<q>,<l>,<n>
As above, but the the minimum read length is explicitly specified by
n
and not by the global option
filtlong_min_read_length
:
The output is written to
fastq-ont/mysample+filtlongMB<m>,<q>,<l>,<n>.fastq
.
When using any of the Filtlong keywords in a folder name, they must be followed by an underscore, followed by the keyword for the assembler.
Rasusa
The ONT reads can be randomly subsampled prior to the assembly.
The available keywords are:
rasusaMB<m>
This will subsample the ONT reads to a total of
m
megabases.
The output is written to
fastq-ont/mysample+rasusaMB<m>.fastq
.
When using any the Rasusa keyword in a folder name, it must be followed by an underscore, followed by the keyword for the assembler.
Flye
Following keywords can be used to run the assembly with Flye:
flye
Default assembly, which includes one round of internal polishing the assembly with the ONT reads.
flyeX
Assembly with
X
rounds of internal polishing. Setting
X
to 0 disables polishing altogether.
flyehq
Assembly for high-quality ONT reads using Flye option
--nano-hq
for ONT Guppy5+ in SUP mode, with one round of internal polishing.
flyehqX
High-quality assembly, with
X
rounds of internal polishing. Setting
X
to 0 disables polishing altogether.
Raven
Following keywords can be used to run the assembly with raven:
raven
Default assembly, which includes two rounds of internal polishing with racon using the ONT reads.
ravenX
Assembly with
X
rounds of internal polishing with racon. Setting
X
to 0 disables polishing altogether.
Miniasm
Following keywords can be used to run the assembly with miniasm:
miniasm
Default assembly. Miniasm does not do any polishing by itself.
Unicycler
unicycler
Unicycler does a hybrid assembly, i.e., both ONT and Illumina reads must be present in
fastq-ont
and
fastq-illumina
, respectively.
Canu
canu
The Canu assembler requires to know the genome size (in Megabases) beforehand, use Snakemake option:
--config genome_size=5.2
(e.g. for 5.2 Mb)
racon
Following keywords can be used to polish an assembly using ONT reads:
racon
Polishing the assembly once.
raconX
Run racon polishing iteratively
X
times.
medaka
medaka
Medaka polishes the assembly using the ONT reads, but also requires the name of
the Medaka model to be used, which depends on the flow cell and basecalling that were used for creating the reads.
The model name can either be set globally for all samples using the snakemake configuration option
medaka_model
,
or by supplying a tab-separated file with two columns that maps sample names to medaka models using the snakemake configuration option
map_medaka_model
.
Options are specified using snakemake's
--config
parameter, for example:
snakemake /opt/software/ont-assembly-snake/Snakefile --cores 20 --config map_medaka_model=map_medaka.tsv
where
map_medaka.tsv
contains, for example, the two columns:
sample1 r941_min_high_g330
sample2 r941_min_high_g351
pilon
pilon
Pilon polishes an assembly using Illumina reads, which must be located in the
fastq-illumina
folder.
Polypolish
polypolish
Polypolish polishes an assembly using Illumina reads, which must be located in the
fastq-illumina
folder.
POLCA
polca
POLCA polishes an assembly using Illumina reads, which must be located in the
fastq-illumina
folder.
Homopolish
homopolish
Homopolish does reference-based polishing based on one ore more provided reference genomes in fasta format located in
references/NAME1.fa
,
references/NAME2.fa
, etc., where
NAME1
and
NAME2
can be any string.
Snakemake will create output files
...+homopolish/output_NAME1.fa
,
...+homopolish/output_NAME1.fa
, etc., containing the polished assemblies.
When using homopolish, it must be the last keyword in the folder name.
proovframe
proovframe
Proovframe does reference-based polishing based on one ore more provided reference proteomes in fasta format containing the amino acid sequences located in
references-protein/NAME1.faa
,
references-protein/NAME2.faa
, etc., where
NAME1
and
NAME2
can be any string.
Snakemake will create output files
...+proovframe/output_NAME1.fa
,
...+proovframe/output_NAME1.fa
, etc., containing the polished assemblies.
When using proovframe, it must be the last keyword in the folder name.
Example
This example contains two samples with ONT sequencing reads and Illumina reads for sample 2 only.
For sample 1, the assembly should be done with flye (including the default single round of
polishing), followed by polishing the assembly with racon (twice), medaka, and eventually homopolish, which will use the E. coli genome in the file
references/Ecoli.faa
.
In another assembly, we also want to filter the ONT reads of sample 1 to only include the highest quality reads up to a total of 500Mb
using Filtlong and apply the same assembly and polishing protocol.
Sample 2 should be assembled by raven including two internal polishing rounds,
followed by medaka and pilon polishing using the Illumina reads, and finally reference-based polishing with E. coli proteins using proovframe, which requires the file
references-protein/Ecoli.faa
.
We therefore create the folders and files as follows:
.
├── assemblies
│ ├── sample1+filtlongMB500_flye+racon2+medaka+homopolish
│ ├── sample1_flye+racon2+medaka+homopolish
│ ├── sample2_raven2+medaka+pilon+proovframe
├── fastq-illumina
│ ├── sample2_R1.fastq
│ └── sample2_R2.fastq
├── fastq-ont
│ ├── sample1.fastq
│ └── sample2.fastq
├── references
│ └── Ecoli.fa
└── references-protein └── Ecoli.faa
We also want to set the minimum read length threshold for Filtlong to 500nt and use the medaka model
r941_min_high_g351
for both samples.
Therefore, we run the workflow with:
snakemake -s /opt/software/ont-assembly-snake/Snakefile --use-conda --cores 20 --config medaka_model=r941_min_high_g351 filtlong_min_read_length=500
Snakemake will recursively handle the dependencies for each assembly,
and create folders for all intermediate steps automatically.
Additionally, a symlink is created for each output assembly in the
assemblies/
folder, so they can easily be used as input for
score-assemblies
.
For the above example, the folders will look like this after running the workflow:
.
├── assemblies
│ ├── sample1+filtlongMB500_flye
│ ├── sample1+filtlongMB500_flye.fa -> sample1+filtlongMB500_flye/output.fa
│ ├── sample1+filtlongMB500_flye+racon2
│ ├── sample1+filtlongMB500_flye+racon2.fa -> sample1+filtlongMB500_flye+racon2/output.fa
│ ├── sample1+filtlongMB500_flye+racon2+medaka
│ ├── sample1+filtlongMB500_flye+racon2+medaka.fa -> sample1+filtlongMB500_flye+racon2+medaka/output.fa
│ ├── sample1+filtlongMB500_flye+racon2+medaka+homopolish
│ ├── sample1+filtlongMB500_flye+racon2+medaka+homopolishEcoli.fa -> sample1+filtlongMB500_flye+racon2+medaka+homopolish/output_Ecoli.fa
│ ├── sample1_flye
│ ├── sample1_flye.fa -> sample1_flye/output.fa
│ ├── sample1_flye+racon2
│ ├── sample1_flye+racon2.fa -> sample1_flye+racon2/output.fa
│ ├── sample1_flye+racon2+medaka
│ ├── sample1_flye+racon2+medaka.fa -> sample1_flye+racon2+medaka/output.fa
│ ├── sample1_flye+racon2+medaka+homopolish
│ ├── sample1_flye+racon2+medaka+homopolishEcoli.fa -> sample1_flye+racon2+medaka+homopolish/output_Ecoli.fa
│ ├── sample2_raven2
│ ├── sample2_raven2.fa -> sample2_raven2/output.fa
│ ├── sample2_raven2+medaka
│ ├── sample2_raven2+medaka.fa -> sample2_raven2+medaka/output.fa
│ ├── sample2_raven2+medaka+pilon
│ ├── sample2_raven2+medaka+pilon.fa -> sample2_raven2+medaka+pilon/output.fa
│ ├── sample2_raven2+medaka+pilon+proovframe
│ └── sample2_raven2+medaka+pilon+proovframeEcoli.fa -> sample2_raven2+medaka+pilon+proovframe/output_Ecoli.fa
├── fastq-illumina
│ ├── sample2_R1.fastq
│ └── sample2_R2.fastq
├── fastq-ont
│ ├── sample1.fastq
│ ├── sample1+filtlongMB500.fastq
│ └── sample2.fastq
├── references
│ └── Ecoli.fa
└── references-protein
└── Ecoli.faa
(Not shown is the content of each subfolder in
assemblies/
and some auxiliary files.)
Code Snippets
160 161 162 163 | shell: """ filtlong --min_length {filtlong_min_read_length} {input} > {output} 2>{log} """ |
176 177 178 179 | shell: """ rasusa --bases {wildcards.num}m -i {input} -o {output} 2>{log} """ |
190 191 192 193 | shell: """ filtlong --min_length {filtlong_min_read_length} -t {wildcards.num}000000 {input} > {output} 2>{log} """ |
205 206 207 208 | shell: """ filtlong --min_length {filtlong_min_read_length} --keep_percent {wildcards.num} {input} > {output} 2>{log} """ |
223 224 225 226 | shell: """ filtlong --min_length {filtlong_min_read_length} --mean_q_weight {wildcards.qweight} --length_weight {wildcards.lweight} -t {wildcards.mb}000000 {input} > {output} 2>{log} """ |
242 243 244 245 | shell: """ filtlong --min_length {wildcards.readlen} --mean_q_weight {wildcards.qweight} --length_weight {wildcards.lweight} -t {wildcards.mb}000000 {input} > {output} 2>{log} """ |
259 260 261 262 263 264 265 | shell: """ minimap2 -x ava-ont -t {threads} {input} {input} > assemblies/{wildcards.sample}_miniasm/minimap2_overlap.paf 2>{log} miniasm -f {input} assemblies/{wildcards.sample}_miniasm/minimap2_overlap.paf > assemblies/{wildcards.sample}_miniasm/minimap2_miniasm.gfa 2>>{log} perl -lsane 'print ">$F[1]\n$F[2]" if $F[0] =~ /S/;' assemblies/{wildcards.sample}_miniasm/minimap2_miniasm.gfa > {output.fa} 2>>{log} ln -sr {output.fa} {output.link} """ |
281 282 283 284 285 286 287 288 | shell: """ # del spades folder if already exists (e.g. when workflow was canceled), so that it does not warn about it upon restart [ -d "assemblies/{wildcards.sample}_unicycler/spades_assembly" ] && rm -r "assemblies/{wildcards.sample}_unicycler/spades_assembly" >{log} unicycler -1 {input.fq1} -2 {input.fq2} -l {input.fqont} -t {threads} --keep 0 -o assemblies/{wildcards.sample}_unicycler/ >>{log} 2>&1 cp assemblies/{wildcards.sample}_unicycler/assembly.fasta {output.fa} 2>>{log} ln -sr {output.fa} {output.link} """ |
303 304 305 306 307 308 | shell: """ flye --nano-raw {input.fq} -o assemblies/{wildcards.sample}_flye/ -t {threads} 2>{log} mv assemblies/{wildcards.sample}_flye/assembly.fasta {output.fa} ln -sr {output.fa} {output.link} """ |
322 323 324 325 326 327 | shell: """ flye --nano-raw {input.fq} -o assemblies/{wildcards.sample}_flye{wildcards.num}/ -t {threads} -i {wildcards.num} 2>{log} mv assemblies/{wildcards.sample}_flye{wildcards.num}/assembly.fasta {output.fa} ln -sr {output.fa} {output.link} """ |
342 343 344 345 346 347 | shell: """ flye --nano-hq {input.fq} -o assemblies/{wildcards.sample}_flyehq/ -t {threads} 2>{log} mv assemblies/{wildcards.sample}_flyehq/assembly.fasta {output.fa} ln -sr {output.fa} {output.link} """ |
361 362 363 364 365 366 | shell: """ flye --nano-hq {input.fq} -o assemblies/{wildcards.sample}_flyehq{wildcards.num}/ -t {threads} -i {wildcards.num} 2>{log} mv assemblies/{wildcards.sample}_flyehq{wildcards.num}/assembly.fasta {output.fa} ln -sr {output.fa} {output.link} """ |
381 382 383 384 385 | shell: """ raven --disable-checkpoints -t {threads} {input.fq} >{output.fa} 2>{log} ln -sr {output.fa} {output.link} """ |
400 401 402 403 404 | shell: """ raven --disable-checkpoints -p {wildcards.num} -t {threads} {input.fq} >{output.fa} 2>{log} ln -sr {output.fa} {output.link} """ |
418 419 420 421 422 423 | shell: """ canu -nanopore -d assemblies/{wildcards.sample}_canu/ -p output useGrid=false maxThreads={threads} genomeSize={target_genome_size}m {input.fqont} >>{log} 2>&1 cp assemblies/{wildcards.sample}_canu/output.contigs.fasta {output.fa} 2>>{log} ln -sr {output.fa} {output.link} """ |
440 441 442 443 444 445 | shell: """ minimap2 -ax map-ont -t {threads} {input.prev_fa} {input.fq} > {output.sam} 2>{log} racon --threads {threads} --include-unpolished {input.fq} {output.sam} {input.prev_fa} > {output.fa} 2>>{log} ln -sr {output.fa} {output.link} """ |
461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 | shell: """ DIR_temp=$(mktemp -d --suffix=.raconX) trap "rm -r $DIR_temp" EXIT cp {input.prev_fa} $DIR_temp/prev.fa for i in `seq 1 {wildcards.num}` do echo "Polishing round $i / {wildcards.num}" >>{log} minimap2 -ax map-ont -t {threads} $DIR_temp/prev.fa {input.fq} > $DIR_temp/map.sam 2>>{log} racon --threads {threads} --include-unpolished {input.fq} $DIR_temp/map.sam $DIR_temp/prev.fa > $DIR_temp/polished.fa 2>>{log} mv $DIR_temp/polished.fa $DIR_temp/prev.fa done mv $DIR_temp/prev.fa {output.fa} ln -sr {output.fa} {output.link} """ |
509 510 511 512 513 514 | shell: """ medaka_consensus -f -i {input.fq} -d {input.prev_fa} -o assemblies/{wildcards.sample}_{wildcards.assembly}+medaka -t {threads} {params.model} >{log} 2>&1 mv assemblies/{wildcards.sample}_{wildcards.assembly}+medaka/consensus.fasta assemblies/{wildcards.sample}_{wildcards.assembly}+medaka/output.fa ln -sr {output.fa} {output.link} """ |
531 532 533 534 535 536 537 538 539 | shell: """ bwa index -p assemblies/{wildcards.sample}_{wildcards.assembly}+pilon/bwa_index {input.prev_fa} >{log} 2>&1 bwa mem -t {threads} assemblies/{wildcards.sample}_{wildcards.assembly}+pilon/bwa_index {input.fq1} {input.fq2} 2>>{log} | samtools sort -o {output.bam} - 2>>{log} samtools index {output.bam} 2>>{log} pilon -Xmx60G --genome {input.prev_fa} --frags {output.bam} --outdir assemblies/{wildcards.sample}_{wildcards.assembly}+pilon/ --output pilon --changes --vcf >>{log} 2>&1 mv assemblies/{wildcards.sample}_{wildcards.assembly}+pilon/pilon.fasta {output.fa} ln -sr {output.fa} {output.link} """ |
557 558 559 560 561 562 | shell: """ polca.sh -t {threads} -a {input.prev_fa} -r '{input.fq1} {input.fq2}' >{log} 2>&1 mv output.fa.PolcaCorrected.fa {output.fa} ln -sr {output.fa} {output.link} """ |
578 579 580 581 582 583 584 585 586 | shell: """ bwa index -p assemblies/{wildcards.sample}_{wildcards.assembly}+polypolish/bwa_index {input.prev_fa} >{log} 2>&1 bwa mem -a -t {threads} assemblies/{wildcards.sample}_{wildcards.assembly}+polypolish/bwa_index {input.fq1} > assemblies/{wildcards.sample}_{wildcards.assembly}+polypolish/alignments_R1.sam 2>>{log} bwa mem -a -t {threads} assemblies/{wildcards.sample}_{wildcards.assembly}+polypolish/bwa_index {input.fq2} > assemblies/{wildcards.sample}_{wildcards.assembly}+polypolish/alignments_R2.sam 2>>{log} polypolish_insert_filter.py --in1 assemblies/{wildcards.sample}_{wildcards.assembly}+polypolish/alignments_R1.sam --in2 assemblies/{wildcards.sample}_{wildcards.assembly}+polypolish/alignments_R2.sam --out1 assemblies/{wildcards.sample}_{wildcards.assembly}+polypolish/filtered_R1.sam --out2 assemblies/{wildcards.sample}_{wildcards.assembly}+polypolish/filtered_R2.sam >>{log} 2>&1 polypolish {input.prev_fa} assemblies/{wildcards.sample}_{wildcards.assembly}+polypolish/filtered_R1.sam assemblies/{wildcards.sample}_{wildcards.assembly}+polypolish/filtered_R2.sam > {output.fa} 2>>{log} ln -sr {output.fa} {output.link} """ |
601 602 603 604 605 606 607 608 | shell: """ DIR_temp=$(mktemp -d --suffix=.raconX) trap "rm -r $DIR_temp" EXIT homopolish polish -a {input.prev_fa} -m R9.4.pkl -o $DIR_temp -l {input.ref} >{log} 2>&1 cp $DIR_temp/*_homopolished.fasta {output.fa} ln -sr {output.fa} {output.link} """ |
621 622 623 624 | shell: """ diamond makedb -p {threads} --in {input} --db {output} >{log} 2>&1 """ |
640 641 642 643 644 645 | shell: """ proovframe map -d {input.ref} -o {output.tsv} {input.prev_fa} >{log} 2>&1 proovframe fix -o {output.fa} {input.prev_fa} {output.tsv} >>{log} 2>&1 ln -sr {output.fa} {output.link} """ |
Support
- Future updates
Related Workflows





