Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
ON-rep-seq analysis toolbox
ON-rep-seq is a molecular method where bacterial (or yeast) selective intragenomic fragments generated with Rep-PCR are sequenced using Oxford Nanopore Technologies.
This apporoch allows for species and sub-species level identification but also often strain level discrimination of bacterial and yeast isolates at very low cost.
Current version of ON-rep-seq allows for analysis of up to 192 isolates in one R9 flow cell but will give most cost effective results by using
flongle <https://nanoporetech.com/products/flongle>
_ for which it wass initially designed.
Requirements
- Anaconda
You can follow the
installation guide <https://docs.anaconda.com/anaconda/install/>
_ .
Installation
Clone github repo and enter directory::
git clone https://github.com/lauramilena3/On-rep-seq cd On-rep-seq
Create On-rep-seq virtual environment and activate it::
conda env create -n On-rep-seq -f On-rep-seq.yaml source activate On-rep-seq
Go into On-rep-seq directory and create variables to your basecalled data and the results directory of your choice::
fastqDir="/path/to/your/basecalled/data" reusultDir="/path/to/your/desired/results/dir"
Note to macOS users (Canu)
If you are using os then you need to edit the config file to set a new directory for canu::
sed -i'.bak' -e 's/Linux-amd64/Darwin-amd64/g' config.yaml
Download kraken database
View the number of avaliable cores in your machine and set a number::
nproc nCores="n"
If you are using your laptop we suggest you to leave 2 free cores for other system tasks.
Download kraken database. Notice this step can take up to 48 hours (!needs to be done only once)::
kraken2-build --download-taxonomy --db db/NCBI-bacteria --threads $nCores #4h kraken2-build --download-library bacteria --db db/NCBI-bacteria --threads $nCores #33h kraken2-build --build --db db/NCBI-bacteria --threads $nCores #4h
Running On-rep-seq analysis
Note to all users
ON-rep-seq is under regular updates. For better results, please keep your local installation up to date::
cd On-rep-seq git pull
Input data
The input data is basecalled fastq files. Please check
Guppy basecaller <https://community.nanoporetech.com/downloads>
_
For best performance we strongly recommend basecalling on GPU (tested on GTX 1080Ti and RTX 2080).
Running
Run the snakemake pipeline with the desired number of cores::
snakemake -j $nCores --use-conda --config basecalled_dir=$fastqDir results_dir=$reusultDir
Limiting memory ...............
You can limit the memory resources (in Megabytes) used per core by using the resources directive as follows::
snakemake -j $nCores --use-conda --config basecalled_dir=$fastqDir results_dir=$reusultDir --resources mem_mb=$max_mem
View dag of jobs to visualize the workflow ++++++++++++++++++++++++++++++++++++++++++
To view the dag run::
snakemake --dag | dot -Tpdf > dag.pdf
Results structure
All results are stored in the Results folder as follows::
Results
├── 01_porechopped_data
│ └── {barcode}
demultiplexed.fastq # Demultiplexed fastq per barcode
├── 02_LCPs
│ ├── LCP_clustering_heatmaps.ipynb # Clustering jupyter notebook
│ ├── LCP_plots.pdf # Plots
│ ├── {barcode}.txt # All LCPs
│ └── LCPsClusteringData
│ └── {barcode}.txt # LCPs used for clustering
├── 03_LCPs_peaks
│ ├── 00_peak_consensus
│ │ └── fixed
{barcode}
{peak}.fasta # Corrected consensus fasta of peaks
│ ├── 01_taxonomic_assignments
│ │ ├── taxonomy_assignments.txt # Taxonomy of all barcodes
│ │ └── taxonomy
{barcode}.txt # Taxonomy per Barcode
│ └── peaks_{barcode}.txt # File with the peaks of each barcode
└── check.txt # Final file "On-rep-seq succesfuly executed"
Publications & citing
bioRxiv <https://www.biorxiv.org/content/10.1101/402156v1>
_
Code Snippets
13 14 15 16 17 18 19 20 21 22 | shell: """ echo {wildcards.sample} porechop -i {input}/{wildcards.sample}.fastq -b {params} -t {threads} --discard_unassigned --verbosity 2 > /dev/null 2>&1 line=$(echo {BARCODES}) for barcode in $line do touch {params}/$barcode.fastq done """ |
34 35 36 37 38 | shell: """ cat {params}/*/{wildcards.barcode}.fastq > {output} echo "{params}/*/{wildcards.barcode}.fastq > {output}" """ |
49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | shell: """ if [ -s {input} ] then porechop -i {input} -o {output} --fp2ndrun --verbosity 0 reads=$(grep -c "^@" {output}) if (( $reads < 2000 )) then mv {output} {params} touch {output} fi else touch {output} fi """ |
8 9 10 11 | shell: """ cat {input} | awk '{{if(NR%4==2) print length($1)+0}}' | sort -n | uniq -c | sed "s/ //g" | sed "s/ //g" | sed "s/^ *// " > {output} """ |
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | run: #import libraries import matplotlib.pyplot as plt import numpy as np import math #set subplot features filelist=sorted(input, key=lambda x: int(x.split('BC')[1].split(".")[0])) nro=math.ceil(len(filelist)/3) fig, axes = plt.subplots(nrows=nro, ncols=3, figsize=(12, 50), sharex=True, sharey=True) plt.xlim(0,3500) #plot each barcode i = 0 for row in axes: for ax in row: if i < len(filelist): if os.path.getsize(filelist[i]) > 10: data=np.loadtxt(filelist[i]) X=data[:,0] Y=data[:,1] else: X=0 Y=0 ax.plot(Y, X) #add label to barcode subplot ax.text(0.9, 0.5, filelist[i].split("/")[-1].split(".")[0], transform=ax.transAxes, ha="right") i += 1 #save figure to pdf fig.savefig(OUTPUT_DIR + "/02_LCPs/LCP_plots.pdf", bbox_inches='tight') |
60 61 62 63 64 65 | shell: """ Rscript --vanilla scripts/peakpicker.R -f {input} -o {output.txt} -v TRUE || true touch {output.txt} touch {output.pdf} """ |
86 87 88 89 90 91 92 93 94 95 96 97 98 99 | shell: """ mkdir -p "{output.directory}" cp "{params.work_directory}"/*.txt "{output.directory}" find "{output.directory}" -size -{params.min_size}c -delete Rscript -e "IRkernel::installspec()" CLUSTSCRIPT="$(realpath ./scripts/LCpCluster.R)" ( cd "{params.work_directory}"; "$CLUSTSCRIPT" LCPsClusteringData/ "{params.ipynb}" ) mv "{params.work_directory}/{params.ipynb}" "{output.ipynb}" mv "{params.work_directory}/{params.png1}" "{output.png1}" mv "{params.work_directory}/{params.png2}" "{output.png2}" mv "{params.work_directory}/{params.fl_pdf}" "{output.fl_pdf}" jupyter-nbconvert --to html --template full "{output.ipynb}" """ |
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | shell: """ sed 1d {input} | while read line do P1=$(echo $line | cut -d',' -f 5 ) P2=$(echo $line | cut -d',' -f 6) if [ $P2 > 300 ] then name=$(echo $line | cut -d',' -f 3) cutadapt -m $P1 {params.porechopped}/{wildcards.barcode}_demultiplexed.fastq -o {params.peaks}/{wildcards.barcode}_short_$name.fastq cutadapt -M $P2 {params.peaks}/{wildcards.barcode}_short_$name.fastq -o {params.peaks}/{wildcards.barcode}_$name.fastq echo "{wildcards.barcode}_$name" >> {output} fi done touch {output} """ |
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 | shell: """ cat {input} | while read line do echo $line ./{config[canu_dir]}/canu -correct -p peak -d {params}/fixed_$line genomeSize=5k -nanopore-raw {params}/$line.fastq \ minReadLength=300 correctedErrorRate=0.01 corOutCoverage=5000 corMinCoverage=2 minOverlapLength=300 cnsErrorRate=0.1 \ cnsMaxCoverage=5000 useGrid=false || true if [ -s {params}/fixed_$line/peak.correctedReads.fasta.gz ]; then gunzip -c {params}/fixed_$line/peak.correctedReads.fasta.gz > {params}/fixed_$line.fastq echo "fixed_$line" >> {output} fi done touch {output} """ |
62 63 64 65 66 67 68 69 70 71 72 73 74 75 | shell: """ mkdir -p {params.consensus} cat {input} | while read line do count=$(grep -c ">" {params.LCPs}/$line.fastq ) min=$(echo "scale=0 ; $count / 5" | bc ) echo "$line" >> {output} vsearch --sortbylength {params.LCPs}/$line.fastq --output {params.LCPs}/sorted_$line.fasta vsearch --cluster_fast {params.LCPs}/sorted_$line.fasta -id 0.9 --consout {params.LCPs}/consensus_$line.fasta -strand both -minsl 0.80 -sizeout -minuniquesize $min vsearch --sortbysize {params.LCPs}/consensus_$line.fasta --output {params.consensus}/$line.fasta --minsize 50 done touch {output} """ |
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | shell: """ mkdir -p {params.taxonomy} cat {input} | while read line do echo "{params.consensus}/$line.fasta" if [ -s {params.consensus}/$line.fasta ] then cat {params.consensus}/$line.fasta >> {output.merged} fi done kraken2 --db {config[kraken_db]} {output.merged} --use-names > {output.taxonomy} || true touch {output.taxonomy} touch {output.merged} awk -F '\t' '{{print FILENAME " " $3}}' {output.taxonomy} | sort | uniq -c | sort -nr >> {params.taxonomy_final} """ |
34 35 36 37 38 | shell: """ rm {params.peaks}/*fastq {params.peaks}/*fasta echo "On-rep-seq succesfuly executed" >> {output} """ |
Support
- Future updates
Related Workflows





