v1.0
NanoPlasm is a specialized workflow designed for the purpose of assembling, typing, and analyzing plasmid sequences using long reads such as Nanopore or PacBio, as well as a combination of long and short reads. To ensure reliability, reproducibility, and optimization in the analysis process, NanoPlasm harnesses the power of the Snakemake workflow framework.
The tool is still under development.
Workflow overview
Install
Currently, the only method of installing and utilizing NanoPlasm consists in cloning this repository and subsequently installing the various dependencies and databases (as outlined below).
# Clone NanoPlasm repository
git clone https://github.com/Flomauf/NanoPlasm.git NanoPlasm
Dependencies
While NanoPlasm relies on Docker images for the majority of its tools, certain dependencies must be installed beforehand to ensure seamless execution. It is strongly recommended to set up a dedicated environment specifically for this purpose. The subsequent instructions illustrate how to achieve this using Conda:
# Create the environment
conda create -n nanoplasm python=3.10.0
# Install dependencies
conda install -c bioconda -c conda-forge snakemake bwa biopython mge-cluster
# Activate the environment
conda activate nanoplasm
Database
In order for Homopolish to function properly, a dedicated database is necessary. This database should be installed within the NanoPlasm directory. To streamline this process, a script has been developed to simplify the installation.
# Go into NanoPlasm directory
cd NanoPlasm
# Install database
sh install_db.sh
NanoPlasm is now ready to use.
Usage
To streamline the usage of NanoPlasm, a wrapper script has been implemented to facilitate the invocation of the Snakemake pipeline. However, if needed, arguments can still be passed to Snakemake using the --options argument, allowing for additional customization and flexibility.
To maintain consistency and organization, it is essential to adhere to the following file naming conventions when working with NanoPlasm:
For long reads:
- Each long read file should be named as "ID.fastq", where "ID" represents a unique identifier associated with the sample.
For short reads:
-
Paired-end short read files should be stored in a separate folder.
-
The forward read file should be named as "ID_R1.fastq", where "ID" corresponds to the sample identifier.
-
The reverse read file should be named as "ID_R2.fastq", following the same sample identifier pattern.
Adhering to these file naming conventions ensures clarity and facilitates the organization of data within the NanoPlasm workflow.
Creation of the analysis folder
The initial step entails creating a designated folder where the analysis will take place. Within this folder, an automatically generated configuration file will be created, encompassing all the parameters required for the analysis. It is imperative to edit this file in accordance with your specific analysis requirements and preferences.
usage: nanoplasm init [-h] -i dir [-I dir] -o dir
Generate the folder and configuration file for NanoPlasm analysis.
options:
-h, --help show this help message and exit
-i dir, --input dir directory with Nanopore fastq files. Files must be in the following format: ID.fastq
-I dir, --INPUT dir directory with Illumina fastq files. Files must be in the following format: ID_R#.fastq
-o dir, --output dir directory to create for the run
Starting NanoPlasm run
After successfully creating the analysis folder, you can initiate a NanoPlasm run. The workflow offers two distinct assembly modes: "long" and "hybrid."
-
"long" mode: This mode is specifically designed for long-read sequencing data, such as Nanopore or PacBio reads. It focuses on leveraging the unique characteristics of long reads to perform the plasmid assembly, typing, and analysis.
-
"hybrid" mode: In this mode, a combination of long and short reads is utilized. The workflow harnesses the strengths of both long and short reads to achieve more comprehensive and accurate results in plasmid assembly, typing, and analysis.
You can select the appropriate assembly mode based on the sequencing data available for your analysis.
usage: nanoplasm typing [-h] [-m dir] -d dir [-t int] [--quality_check] [--options str]
Start Nanoplasm with long|hybrid mode.
options:
-h, --help show this help message and exit
-m dir, --mode dir mode for assembling genomes [long]
-d dir, --dir dir directory for the run (created with nanoplasm init)
-t int, --thread int number of threads [1]
--quality_check enable reads quality check [disabled]
--options str Snakemake options. Use as --options= and snakemake arguments between brackets
Example
# Generate the analysis folder and configuration file
path/to/nanoplasm init -i fastq_nanopore -o analysis_folder
# Edit the configuration file
nano analysis_folder/config.yaml
# Start NanoPlasm run
path/to/nanoplasm typing -d analysis_folder -t 24 --quality_check
Future features
-
Module for discarding reads when in excess. Samples with reads in excess (very high coverage) can lead to errors during Flye assembly.
-
Merge functions for combining different analysis folders.
Tips
-
The pipeline might crash because a sample assembly failed.
This can be due to a very coverage for this sample leading Flye to fail disjointing assembly. To avoid this error, you can use the --asm-coverage and --genome-size options in Flye to limit the coverage in your assemblies. Add these options in the config.yaml file of the run. -
I need to analyse groups of sample with different parameters but I want to compare them all together ultimately.
Run the analyses in different folders with appropriate parameters for each analysis and, once all analyses are completed, merge all folders using the merge function.
Code Snippets
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | from Bio import SeqIO import os min_size = snakemake.config["classify_contigs"]["min_size"] max_size = snakemake.config["classify_contigs"]["max_size"] assemblies = snakemake.input output = snakemake.output["class_file"] plasmids_files_path = snakemake.output["plasmids_list"] # If only one sample, must converted into list if len(assemblies) < 2: assemblies = [assemblies] # Create folders if not os.path.exists("Sequences/plasmids"): os.mkdir("Sequences/plasmids") if not os.path.exists("Sequences/chromosomes"): os.mkdir("Sequences/chromosomes") with open(output, "w") as report, open(plasmids_files_path, "w") as plasmids_list: # Write report first line report.write("Sample\tContig ID\tType\tLength\n") for file in assemblies: sample = file.split("/")[1] # sample ID (second term in path) assembly = SeqIO.parse(file, "fasta") for seq in assembly: # If plasmid if min_size <= len(seq.seq) <= max_size: with open(f"Sequences/plasmids/{sample}_plasmid_{seq.id}.fasta", "w") as out: SeqIO.write([seq], out, "fasta") report.write(f"{sample}\t{seq.id}\tPlasmid\t{len(seq.seq)}\n") plasmids_list.write(f"Sequences/plasmids/{sample}_plasmid_{seq.id}.fasta\n") # If chromosome elif max_size < len(seq.seq): with open(f"Sequences/chromosomes/{sample}_chromosome_{seq.id}.fasta", "w") as out: SeqIO.write([seq], out, "fasta") report.write(f"{sample}\t{seq.id}\tChromosome\t{len(seq.seq)}\n") |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 | import pandas as pd """ Gather results from Resfinder and mobsuite """ def add_resfinder(master_df, res_files): " Add resfinder results to the classification file " # Creation of dataframe for resfinder results res_df = pd.DataFrame({"Sample":[], "Contig ID":[], "ResFinder Resistance genes":[], "ResFinder phenotype":[]}) # Parsing of all Resfinder files for f in res_files: sample = f.split("/")[1] # extracted from the path results = {} # stores the results per contig text = open(f, "r").readlines()[1:] # open and skip header for line in text: gene = line.split("\t")[0] contig = line.split("\t")[5] phenotype = line.split("\t")[7].split(", ") # list of phenotypes if contig not in results: results[contig] = [gene, phenotype] # list of genes in pos 0 and list of phenotype in pos 1 else: results[contig][0] += f", {gene}" # add the new gene to the string in position 0 (list of genes) results[contig][1].extend(x for x in phenotype if x not in results[contig][1]) # add phenotype if not already present # Looping over results dict to complete dataframe. The label used is the length of the dataframe, creating a simple index. for elt in results: res_df.loc[len(res_df)] = [sample, elt, results[elt][0], ", ".join(results[elt][1])] # Merging the classification dataframe with the resfinder dataframe new_df = master_df.merge(res_df, how="outer", on=["Sample", "Contig ID"]) return new_df def add_mobsuite(master_df, mob_files): " Add mobsuite results to the master file " # Creation of dataframe for mobsuite results res_df = pd.DataFrame({"Sample":[], "Contig ID":[], "Mob-suite replicon types":[], "Mob-suite relaxase types":[]}) # Parsing of all mobsuite files for f in mob_files: sample = f.split("/")[1].split("_")[0] # extracted from the path results = {} # stores the results per contig text = open(f, "r").readlines()[1:] # open and skip header for line in text: contig = line.split("\t")[0] replicon = line.split("\t")[5] relaxase = line.split("\t")[7] results[contig] = [replicon, relaxase] # Looping over results dict to complete dataframe. The label used is the length of the dataframe, creating a simple index. for elt in results: res_df.loc[len(res_df)] = [sample, elt, results[elt][0], results[elt][1]] # Merging the classification dataframe with the mobsuite dataframe new_df = master_df.merge(res_df, how="outer", on=["Sample", "Contig ID"]) return new_df if __name__ == "__main__": # Load dataframe and list of files from the Snakefile class_file = pd.read_csv(snakemake.input["classification"], sep='\t', header=0) class_file = class_file.astype({"Sample": str}) # Convert first column to string in case samples ID are just numbers res_files = snakemake.input["resfinder"] mob_files = snakemake.input["mobsuite"] # If only one sample, must converted into list if len(res_files) > 1: res_list = res_files else: res_list = [res_files[0]] if len(mob_files) > 1: mob_list = mob_files else: mob_list = [mob_files[0]] # Add information for each analysis infos_resfinder = add_resfinder(class_file, res_list) infos_mobsuite = add_mobsuite(infos_resfinder, mob_list) # Export the final dataframe final_df = infos_mobsuite.sort_values(by=["Sample", "Type", 'Length']) final_df.to_csv(snakemake.output[0], sep="\t") |
35 36 | shell: "NanoFilt -q {params.quality} -l {params.length} {input} --logfile {log}> {output}" |
50 51 | shell: "medaka_consensus -i {input.reads} -d {input.assembly} -t {threads} -m {params.model} -o 04-Medaka/{wildcards.sample} > {log} 2>&1" |
62 63 | shell: "nanoq -i {input} -s -vvv 2> {output}" |
75 76 | shell: "flye --nano-hq {input} -t {threads} -o 03-assembly/{wildcards.sample}_long {params.others} > {log} 2> /dev/null" |
89 90 | shell: "homopolish polish -a {input} -s {workflow.basedir}/{params.db} -m R10.3.pkl -o 05-Homopolish/{wildcards.sample} > {log} 2>&1" |
115 116 117 | shell: "fastp -i {input.R1} -I {input.R2} -o {output.R1} -O {output.R2} -q {params.qual} " "-u {params.unqual} -e {params.ave_qual} -l {params.length} -h {output.html} -j {output.json} 2> {log}" |
128 129 130 | shell: "mkdir QC/{wildcards.sample};" "fastqc {input} -o QC > /dev/null 2>&1" |
144 145 146 | shell: "unicycler -1 {input.short_1} -2 {input.short_2} -l {input.long} -o 03-assembly/{wildcards.sample}_hybrid -t {threads} --mode {params.mode} " "--verbosity 0 --keep 0" |
160 161 162 163 | shell: "bwa index {input.draft} > /dev/null 2>&1;" "bwa mem -t {threads} -a {input.draft} {input.short_1} > {output.alignments_1} 2> {log};" "bwa mem -t {threads} -a {input.draft} {input.short_1} > {output.alignments_2} 2> {log}" |
179 180 181 | shell: "polypolish_insert_filter.py --in1 {input.alignments_1} --in2 {input.alignments_2} --out1 {output.alignments_filt_1} --out2 {output.alignments_filt_2} > /dev/null 2>&1;" "polypolish {input.draft} {output.alignments_filt_1} {output.alignments_filt_2} > {output.assembly} 2> {log}" |
196 197 | shell: "python -m resfinder -ifa {input} --nanopore -acq -o 06-resfinder/{wildcards.sample} -l {params.coverage_threshold} -t {params.id_threshold} > /dev/null 2>&1" |
207 208 209 210 | shell: """java -cp {workflow.basedir}/KARGA KARGA {input} d:{workflow.basedir}/{params.db} -Xmx16GB; mv 01-NanoFilt/{wildcards.sample}_nanofilt_KARGA_mappedReads.csv 09-karga; mv 01-NanoFilt/{wildcards.sample}_nanofilt_KARGA_mappedGenes.csv 09-karga""" |
221 222 | shell: "mob_typer -i {input} -o {output} -n {threads} --multi > {log} 2>&1" |
232 233 | script: "bin/classify_contigs.py" |
245 246 | shell: "mge_cluster --create --input {input} --outdir 08-mge-cluster --perplexity {params.perplexity} --min_cluster {params.min_cluster} --kmer {params.kmer} --threads {threads}" |
255 256 | script: "bin/gather_typing_results.py" |
Support
- Future updates
Related Workflows





