A tool for generating bacterial genomes from metagenomes with nanopore long read sequencing
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Lathe
A tool for generating bacterial genomes from metagenomes with Nanopore long read sequencing
Installation
First, install miniconda3
Then install snakemake. This can be done with the following.
conda install snakemake
snakemake --version #please ensure this is >=5.4.3
Next, clone this github directory to some location where it can be stored permanently. Remember to keep it updated with
git pull
.
git clone https://github.com/elimoss/lathe.git
Instructions to enable cluster execution with SLURM can be found at https://github.com/bhattlab/slurm
Inputs
Alter config.yaml to provide the following:
-
sample_name : Name of sample and output directory
-
fast5_dirs_list : text file containing a list of absolute paths to run/fast5/* subfolders containing .fast5 files. A good way to generate this is with
find -maxdepth 2 -mindepth 2 fast5_parent_dir > fodn.txt
-
flowcell : flowcell code, e.g. FLO-MIN106, passed to basecaller
-
kit : kit code, e.g. SQK-LSK109, passed to basecaller
-
genome_size : Estimated genome size, e.g. 50m, passed to Canu.
-
singularity : location (including on the internet) of a singularity image to be used for the workflow. Don't change this.
-
short_reads : location of short reads to be used for Pilon polishing, or empty quotes for long-read polishing.
-
use_grid : should Canu execute in distributed mode on a cluster?
-
grid_options : Extra options for execution on a cluster
-
canu_args : Extra options for Canu
-
skip_circularization : Should circularization be omitted from the workflow?
Lathe uses the Flye assembler by default. For Canu, please specify 'canu' for the assembler parameter in the config. For cluster Canu execution, please note: if set to True, you will need to install Canu, e.g.
conda install -c conda-forge -c bioconda Canu=1.8
as well as provide any additional required parameters for your job scheduler in the config.yaml file. Please see the example config file. When executing on a cluster, Canu will appear to fail, as the first process does not produce an assembly and instead spawns subsequent jobs on the cluster. Don't worry, just re-run Lathe when the assembly completes.
To execute please run the following. Please note, you must substitute a parent directory containing all of your data and working directories for
/labs/
.
snakemake --use-singularity --singularity-args '--bind /labs/ --bind /scratch/ --bind /scg/ ' -s /path/to/lathe/Snakefile \
--configfile path/to/modified_config.yaml --restart-times 0 --keep-going --latency-wait 30
# --profile scg #enable cluster support, highly recommended. See above.
Outputs
The outputs generated by this workflow will look like the following:
samplename/
├── 0.basecall
│ ├── samplename.fq
│ └── nanoplots
├── 1.assemble
│ ├── samplename_merged.fasta
│ ├── samplename_raw_assembly.fa
│ ├── samplename_raw_assembly.fa.amb
│ ├── samplename_raw_assembly.fa.ann
│ ├── samplename_raw_assembly.fa.bwt
│ ├── samplename_raw_assembly.fa.fai
│ ├── samplename_raw_assembly.fa.pac
│ ├── samplename_raw_assembly.fa.paf
│ ├── samplename_raw_assembly.fa.sa
│ ├── assemble_100m (if specified)
│ └── assemble_250m (if specified)
├── 2.polish
│ ├── samplename_polished.corrected.fasta
│ ├── samplename_polished.fasta
│ ├── samplename_polished.fasta.bam
│ ├── samplename_polished.fasta.bam.bai
│ ├── samplename_polished.fasta.fai
│ ├── samplename_polished.fasta.misassemblies.tsv
│ ├── medaka (if specified)
│ ├── pilon (if specified)
│ └── racon (if specified)
├── 3.circularization
│ ├── 1.candidate_genomes
│ ├── 2.circularization
│ ├── 3.circular_sequences #circularized genomes
│ ├── 4.samplename_circularized.corrected.fasta
│ ├── 4.samplename_circularized.fasta
│ ├── 4.samplename_circularized.fasta.bam
│ ├── 4.samplename_circularized.fasta.bam.bai
│ ├── 4.samplename_circularized.fasta.fai
│ └── 4.samplename_circularized.fasta.misassemblies.tsv
└── 5.final
├── samplename_final.fa
└── samplename_final.fa.fai
Code Snippets
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 | import sys import os import argparse import pysam fastaf = snakemake.input[0] out = open(snakemake.output[0], 'w') #knobs smooth_gap_width = 150000 contig_edge_margin = 150000 min_smoothed_aln_len = 10000 min_aln_len = 5000 #align print("Aligning...".format(t=str(snakemake.threads))) os.system("nucmer -p {delta} -b 4000 -l 2000 --maxmatch {fa} {fa}".format( #-t {threads} back to mummer3, no more multithreading : ( threads = str(snakemake.threads), delta = snakemake.params['delta'], fa = fastaf )) os.system("show-coords -T {delta}.delta -L 2000 | sed '1,5d' > coords.tsv".format( delta = snakemake.params['delta'] )) #the 1,5d gets rid of the header and identity hit as well print("Trimming genome...") max_tiglen = 0 with open('coords.tsv') as coords: lines = coords.readlines() smoothed_lines = [] if len(lines) > 0: aln_start_line = lines[0].strip().split("\t") prev_line = lines[0].strip().split("\t") #goal: identify corner-cutting parallel off-diagonals. # print(prev_line) for l in lines[1:] + [lines[0]]: #just so we don't miss the last item s = l.strip().split("\t") #print(s) tigname = s[-1] #store the name of the tig. this repeats nugatorily. if int(s[1]) > max_tiglen: #keep track of the largest alignment coordinate found, used as the tig length max_tiglen = int(s[1]) if int(s[0]) > int(s[1]): #ignore inversions # print('ignoring') continue if int(s[1]) - int(s[0]) < min_aln_len: # print('ignoring') continue if abs(int(s[0]) - int(prev_line[1])) < smooth_gap_width and \ abs(int(s[2]) - int(prev_line[3])) < smooth_gap_width: # print("joining") pass else: # print("terminating") newline = aln_start_line newline[1] = prev_line[1] newline[3] = prev_line[3] if int(newline[1]) - int(newline[0]) > min_smoothed_aln_len: # print("storing") smoothed_lines.append(newline) aln_start_line = s prev_line = s #print("output") #for s in smoothed_lines: # print(s) if len(smoothed_lines) > 0: #is it a corner cutting parallel off-diagonal? if int(smoothed_lines[0][0]) < contig_edge_margin and int(smoothed_lines[0][3]) > max_tiglen - contig_edge_margin: if int(smoothed_lines[-1][2]) < contig_edge_margin and int(smoothed_lines[-1][1]) > max_tiglen - contig_edge_margin: #out.write("It's overcircularized!") out.write(tigname + ":" + smoothed_lines[0][0] + '-' + smoothed_lines[-1][0] + "\n") out.write("done\n") |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 | import fileinput import os prev = [] prev_is_terminal = False verbose = True margin = snakemake.params['margin'] with open(snakemake.output[0], 'w') as oot: for l in open(snakemake.input[0]): s = l.strip().split("\t") if prev == []: next if s == ['no circularizations']: oot.write("no circularizations\n") break target_tig = s[9] spanner = s[10] #test for terminal alignment r_start = min([int(i) for i in s[0:2]]) r_end = max([int(i) for i in s[0:2]]) q_start = min([int(i) for i in s[2:4]]) q_end = max([int(i) for i in s[2:4]]) r_len = int(s[7]) q_len = int(s[8]) if r_start < margin or r_end > (r_len - margin): #ref is terminal if q_start < margin or q_end > (q_len - margin): #query is terminal if prev_is_terminal and spanner == prev[10] and target_tig == prev[9]: #spanned event #trim when the center of the spanning contig aligns to both ends of the genome (consistent with overcircularization) if q_start < prev_q_end: trim_bases = prev_q_end - q_start # oot.write('Need to trim ' + str(trim_bases) + " bases.\n") oot.write("{c}:1-{e}\n".format(c = target_tig, e = str(r_len-trim_bases))) #extend with spanning contig when it closes a gap else: insert_bases = [q_start, prev_q_end] insert_bases.sort() insert_string = spanner + ":" + '-'.join([str(i) for i in insert_bases]) #print(insert_string) oot.write(target_tig + "\n") oot.write(insert_string + "\n") prev_is_terminal = False next prev_is_terminal=True else: prev_is_terminal = False else: prev_is_terminal = False #assign prevs prev = s prev_r_start = r_start prev_r_end = r_end prev_q_start = q_start prev_q_end = q_end prev_r_len = r_len prev_q_len = q_len oot.write('done\n') |
52 53 | shell: "ln -s {input} {output}" |
67 68 69 70 71 72 73 74 75 76 | shell: "guppy_basecaller --cpu_threads_per_caller {threads} -i {params.in_dir} -s {params.out_dir} " + "--flowcell {fc} --kit {k}" .format(fc=config['flowcell'], k = config['kit']) rule basecall_final: #Collate called bases into a single fastq file input: expand('{{sample}}/0.basecall/raw_calls/{foo}/sequencing_summary.txt', foo = [os.path.basename(f).split('.')[0] for f in fast5_files]) output: '{sample}/0.basecall/{sample}.fq' shell: "find {sample}/0.basecall/raw_calls/*/*.fastq | xargs cat > {output}" |
87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 | shell: "NanoPlot --fastq {input} -t {threads} --color green -o " + "{s}/0.basecall/nanoplots".format(s = sample) rule assemble_canu: #Run canu. This can be run either in distributed fashion on the cluster or in a single job. If run in distributed fashion (usedgrid is set to 'True') #then canu must be installed in the user's environment. In this case, the singualarity image is not used. Canu arguments are passed through from #the config within the 'canu_args' key. 'gridOptions' are also passed through from the config and instruct Canu on any additional parameters #needed to run on the cluster. input: rules.basecall_final.output output: '{sample}/1.assemble/assemble_{genome_size}/{sample}_{genome_size}.contigs.fasta', #'{sample}/1.assemble/assemble_{genome_size}/{sample}_{genome_size}.correctedReads.fasta.gz' threads: 1 resources: mem=100, time=80 singularity: singularity_image if config['usegrid'] != 'True' and config['usegrid'] != True else '' #switch between image and local environment for canu depending on whether cluster execution is required shell: "canu -p {sample}_{wildcards.genome_size} -d {sample}/1.assemble/assemble_{wildcards.genome_size}/ -nanopore-raw {input} " + config['canu_args'] + " stopOnReadQuality=false genomeSize={wildcards.genome_size} " + "useGrid={grid} gridOptions='{opts}'".format( grid = config['usegrid'], opts = config['grid_options'] ) rule assemble_flye: input: rules.basecall_final.output output: "{sample}/1.assemble/assemble_{genome_size}/assembly.fasta" |
121 122 | shell: "flye -t {threads} --nano-raw {input} --meta -o {wildcards.sample}/1.assemble/assemble_{wildcards.genome_size}/ -g {wildcards.genome_size}" |
143 144 145 146 147 148 149 150 151 152 | shell: """ bedtools makewindows -g {input[1]} -w {params.window_width} | join - {input[1]} | tr ' ' '\t' | \ cut -f1-4 | awk '{{if ($2 > {params.window_width} && $3 < $4 - {params.window_width} && $4 > {params.min_tig_size}) print $0}}' | \ xargs -P 16 -l bash -c ' htsbox samview {input[2]} $0:$1-$1 -p | \ cut -f8,9 | awk "{{if (\$1 < $1 - ({params.window_width}/2) && \$2 > $1 + ({params.window_width}/2)) print \$0}}" | wc -l | \ paste <(echo $0) <(echo $1) - ' | awk '{{if ($3 < 2) print $0}} ' > {output} """ |
165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 | shell: """ if [ -s {input[0]} ] #if the input is nonempty... then cat {input[0]} | grep -v ^# | sort -k1,1 -k2,2g | join - <(cat {input[1]} | sort -k1,1 -k2,2g) | \ awk '{{ if ($1 == prev_tig){{ print($1,prev_coord,$2) }} else{{ if (prev_len > 0){{ print(prev_tig,prev_coord,prev_len) }} print($1,"1",$2) }} prev_tig = $1 prev_coord = $2 prev_len = $4 }} END {{ print(prev_tig,prev_coord,prev_len) }}' | sed "s/\(.*\) \(.*\)\ \(.*\)/\\1:\\2-\\3/g" | xargs samtools faidx {input[2]} \ | cut -f1 -d ':' | awk '(/^>/ && s[$0]++){{$0=$0\"_\"s[$0]}}1;' > {output[0]} cut -f1 {input[0]} > {sample}/{wildcards.sequence}.tigs.toremove grep -vf {sample}/{wildcards.sequence}.tigs.toremove {input[1]} | cut -f1 | xargs samtools faidx {input[2]} >> {output[0]} rm {sample}/{wildcards.sequence}.tigs.toremove else cp {input[2]} {output} fi """ |
210 211 | shell: "merge_wrapper.py {input} -ml 10000 -c 5 -hco 10; mv merged_out.fasta {output}" |
216 217 | shell: "ln -s ../../{input} {output}" |
232 | shell: "sort -k2,2gr {input[1]} | awk '{{if ($2 > {wildcards.contig_cutoff}) print $1}}' | xargs samtools faidx {input[0]} > {output}" |
247 248 | shell: "cat {input} | cut -f1 -d '_' | fold -w 120 | awk '(/^>/ && s[$0]++){{$0=$0\"_\"s[$0]}}1;' > {output}" |
294 295 296 297 | shell: """ racon -m 8 -x -6 -g -8 -w 500 -t {threads} {input} > {output} """ |
307 308 309 310 311 | shell: """ mkdir -p {output} cut -f1 {input[1]} | xargs -n 1 -I foo sh -c 'touch {output[0]}/foo; samtools faidx {input[0]} foo > {output[1]}/foo.fa' """ |
325 326 327 328 | shell: """ medaka consensus {input[1]} {output} --model r941_flip213 --threads {threads} --regions {wildcards.range} """ |
342 343 344 345 | shell: """ medaka stitch {input[0]} {input[1]} {output} """ |
366 367 368 369 | shell: """ cat {params.indir} | cut -f1 -d ':' > {output} """ |
385 386 | shell: "bwa index {input[0]}; bwa mem -t {threads} {input} | samtools sort --threads {threads} > {output}" |
397 398 399 400 401 402 | shell: """ mkdir {output} bedtools makewindows -w 100000 -g {input[1]} | awk '{{print $1,\":\", $2+ 1, \"-\", $3}}' | tr -d ' ' | xargs -n 1 -I foo touch {output}/foo """ |
425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 | shell: """ # set env var $i to be the smallest read subset decimal (in increments of 0.1, with a couple very low values thrown in, too) # sufficient to generate at least 40x coverage depth of the target sequence, or # 1 if 40x coverage cannot be achieved with the available read data for i in 0.01 0.05 $(seq 0.1 0.1 1); do cov=$(samtools view {input[1]} -s $i -h {wildcards.range} | samtools depth - | cut -f3 | awk '{{sum+=$1}}END{{print sum/(NR+1)}}') if [ $(echo $cov'>'{params.target_coverage}|bc) -eq 1 ] then break fi done echo Using $i x of total coverage; samtools view -h -O BAM -s $i {input[1]} {wildcards.range} > {params.bam} samtools index {params.bam} samtools faidx {input[0]} $(echo {wildcards.range}| cut -f1 -d ':') | cut -f1 -d ':' > {params.fa} java -Xmx{params.java_mem}G -jar $(which pilon | sed 's/\/pilon//g')/../share/pilon*/pilon*.jar \ --genome {params.fa} --targets {wildcards.range} \ --unpaired {params.bam} --output {sample}_{wildcards.range} --outdir {params.subrun_folder} \ --vcf --nostrays --mindepth 1 bgzip {params.subrun_folder}/{sample}_{wildcards.range}.vcf tabix -fp vcf {params.subrun_folder}/{sample}_{wildcards.range}.vcf.gz """ |
475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 | shell: """ #workaround! Snakemake was causing the vcf's to appear newer than the indices, which tabix didn't like touch {sample}/2.polish/pilon/sub_runs/*/*.vcf.gz.tbi #get properly sorted intervals ls {sample}/2.polish/pilon/ranges/ | tr ':-' '\t' | sort -k1,1 -k2,2g | awk '{{print $1,":",$2,"-",$3}}' | tr -d ' ' > sorted_ranges.tmp #get header (zcat {sample}/2.polish/pilon/sub_runs/*/*.vcf.gz | head -1000 | grep ^# #get corrections within each range (omitting DUP records, which bcftools can't understand) cat sorted_ranges.tmp | xargs -n 1 -I foo sh -c " tabix {sample}/2.polish/pilon/sub_runs/foo/{sample}_foo.vcf.gz foo | grep -v '0/0' | grep -v DUP" | #sort by position sort -k1,1 -k2,2g) | #compress and store bgzip > {output} || true #index tabix -p vcf {output} """ |
508 509 510 511 | shell: """ bcftools consensus -f {input} -o {output} """ |
518 519 520 521 | shell: """ cut -f1 -d ':' {input} > {output} """ |
535 536 537 538 539 540 541 | shell: """ #mkdir {output} cat {input[1]} | awk '{{if ($2 > {params.min_size}) print $1}}' | xargs -n 1 -I foo sh -c " samtools faidx {input[0]} foo > {sample}/3.circularization/1.candidate_genomes/foo.fa " """ |
552 553 554 555 556 557 | shell: """ (samtools idxstats {input[0]} | grep {wildcards.tig} | awk '{{if ($2 > 50000) print $1, ":", $2-50000, "-", $2; else print $1, ":", 1, "-", $2 }}' | tr -d ' '; samtools idxstats {input[0]} | grep {wildcards.tig} | awk '{{if ($2 > 50000) print $1, ":", 1, "-", 50000; else print $1, ":", 1, "-", $2 }}' | tr -d ' ') | xargs -I foo sh -c 'samtools view -h {input[0]} foo | samtools fastq - || true' | paste - - - - | sort | uniq | tr '\t' '\n' | bgzip > {output} """ |
572 573 574 575 | shell: """ flye -t {threads} --nano-raw {input} -o {params.directory} -g 1m """ |
600 601 602 603 604 605 606 607 608 609 610 | shell: """ nucmer -b 5000 {input[0]} {input[1]} -p {params.directory}/{params.prefix} #-t {threads} #reverted nucmer back down to 3, no more multithreading :( delta-filter -q {params.directory}/{params.prefix}.delta > {params.directory}/{params.prefix}.filt.delta show-coords -Tq {params.directory}/{params.prefix}.filt.delta | cut -f8,9 | sed '1,3d' | sort | \ uniq -c | tr -s ' ' '\\t' | cut -f2-99 | grep -v ^1 | cut -f2,3 > {params.directory}/potential_circularizations.tsv || true show-coords -Tql {params.directory}/{params.prefix}.filt.delta | grep -f {params.directory}/potential_circularizations.tsv | cat > {output} || true """ |
619 620 | script: "scripts/spancircle.py" |
634 635 636 637 638 639 640 641 642 643 644 645 646 | run: span_out = open(input[0], 'r').readlines() cmd = '' if span_out == ['done\n'] or span_out[0].strip() == 'no circularizations': #no circularization occurred print('Nothing to do') else: trim = span_out[0].strip() trim_cmd = 'samtools faidx ' + input[1] + ' ' + trim + " > " + params.outfa + "\n" cmd += trim_cmd if len(span_out) == 3: extend = span_out[1].strip() extend_cmd = 'samtools faidx ' + input[3] + ' ' + extend + " | grep -v '>'" + " >> " + params.outfa + "\n" |
671 672 | script: "scripts/encircle.py" |
686 687 688 689 690 691 692 693 694 695 696 | run: span_out = open(input[0], 'r').readlines() cmd = '' if span_out == ['done\n']: #no circularization occurred print('No over-circularization') else: trim = span_out[0].strip() trim_cmd = 'samtools faidx ' + input[1] + ' ' + trim + " > " + params.outfa + "\n" cmd += trim_cmd open(output[0], 'w').write(cmd + '\n') |
719 720 721 722 723 724 725 726 727 | shell: """ find {sample}/3.circularization/3.circular_sequences/sh/ -type f | xargs cat | bash ls {sample}/3.circularization/3.circular_sequences/ | grep .fa$ | cut -f1-2 -d '_' > circs.tmp || true (cat {input[1]} | grep -vf circs.tmp | cut -f1 | xargs samtools faidx {input[0]}; ls {sample}/3.circularization/3.circular_sequences/* | grep .fa$ | xargs cat) | tr -d '\\n' | sed 's/\\(>[contigscaffold_]*[0-9]*\\)/\\n\\1\\n/g' | fold -w 120 | cut -f1 -d ':' | grep -v '^$' > {output} || true rm circs.tmp """ |
743 | shell: "cp {input} {output}" |
757 758 | shell: "minimap2 -t {threads} -x map-ont {input} > {output}" |
772 773 | shell: "minimap2 -t {threads} -ax map-ont {input} | samtools sort --threads {threads} > {output}" |
782 783 | shell: "samtools index {input}" |
790 | shell: "samtools faidx {input}" |
Support
- Future updates
Related Workflows





