A tool for generating bacterial genomes from metagenomes with nanopore long read sequencing
Lathe
A tool for generating bacterial genomes from metagenomes with Nanopore long read sequencing
Installation
First, install miniconda3
Then install snakemake. This can be done with the following.
conda install snakemake
snakemake --version #please ensure this is >=5.4.3
Next, clone this github directory to some location where it can be stored permanently. Remember to keep it updated with
git pull
.
git clone https://github.com/elimoss/lathe.git
Instructions to enable cluster execution with SLURM can be found at https://github.com/bhattlab/slurm .
Typical installation time: 5-10 minutes.
Change as of 2021-02-03
Lathe has been adapted to run on multiple samples simultaneously, instead of one sample per snakemake command. The pipeline can now take in either .fast5 raw data from a nanopore run, or basecalled fastq files. The config file has been changed to reflect this. You now provide sample information and datasets in a tab-delimited file, and indicate this file in the
config.yaml
. Provide this file to the
file_names_txt
argument in the configfile.
file_names_txt is a 2 or 3 column tsv with the following columns
SAMPLE_NAME FAST5/FASTQ_READS SHORT_READS_1,SHORT_READS_2
sample name in the first column will be used to name ouptut
the second column can be a directory containing fast5 files (the output of a nanopore run)
-OR- a single fastq file containing basecalled data
Optionally, a short read sequencing dataset can be provided in the third column,
with pairs separated by a comma. If this option is selected, short read
polishing will be turned on.
Inputs
Alter config.yaml to provide the following:
-
file_names_txt : Tab delimited file describing sample names and input datasets. See config.yaml for a description.
-
flowcell : flowcell code, e.g. FLO-MIN106, passed to basecaller
-
kit : kit code, e.g. SQK-LSK109, passed to basecaller
-
genome_size : Estimated genome size, e.g. 50m, passed to Canu.
-
singularity : location (including on the internet) of a singularity image to be used for the workflow. Don't change this.
-
use_grid : should Canu execute in distributed mode on a cluster?
-
grid_options : Extra options for execution on a cluster
-
canu_args : Extra options for Canu
-
skip_circularization : Should circularization be omitted from the workflow?
Lathe uses the Flye assembler by default. For Canu, please specify 'canu' for the assembler parameter in the config. For cluster Canu execution, please note: if set to True, you will need to install Canu, e.g.
conda install -c conda-forge -c bioconda Canu=1.8
as well as provide any additional required parameters for your job scheduler in the config.yaml file. Please see the example config file. When executing on a cluster, Canu will appear to fail, as the first process does not produce an assembly and instead spawns subsequent jobs on the cluster. Don't worry, just re-run Lathe when the assembly completes.
To execute please run the following. Please note, you must substitute a parent directory containing all of your data and working directories for
/labs/
.
snakemake --use-singularity --singularity-args '--bind /labs/,/scg/,/home/ ' -s /path/to/lathe/Snakefile \
--configfile path/to/modified_config.yaml --restart-times 0 --keep-going --latency-wait 30
# --profile scg #enable cluster support, highly recommended. See above.
Outputs
The outputs generated by this workflow will look like the following:
samplename/
├── 0.basecall
│ ├── samplename.fq
│ └── nanoplots
├── 1.assemble
│ ├── samplename_merged.fasta
│ ├── samplename_raw_assembly.fa
│ ├── samplename_raw_assembly.fa.amb
│ ├── samplename_raw_assembly.fa.ann
│ ├── samplename_raw_assembly.fa.bwt
│ ├── samplename_raw_assembly.fa.fai
│ ├── samplename_raw_assembly.fa.pac
│ ├── samplename_raw_assembly.fa.paf
│ ├── samplename_raw_assembly.fa.sa
│ ├── assemble_100m (if specified)
│ └── assemble_250m (if specified)
├── 2.polish
│ ├── samplename_polished.corrected.fasta
│ ├── samplename_polished.fasta
│ ├── samplename_polished.fasta.bam
│ ├── samplename_polished.fasta.bam.bai
│ ├── samplename_polished.fasta.fai
│ ├── samplename_polished.fasta.misassemblies.tsv
│ ├── medaka (if specified)
│ ├── pilon (if specified)
│ └── racon (if specified)
├── 3.circularization
│ ├── 1.candidate_genomes
│ ├── 2.circularization
│ ├── 3.circular_sequences #circularized genomes
│ ├── 4.samplename_circularized.corrected.fasta
│ ├── 4.samplename_circularized.fasta
│ ├── 4.samplename_circularized.fasta.bam
│ ├── 4.samplename_circularized.fasta.bam.bai
│ ├── 4.samplename_circularized.fasta.fai
│ └── 4.samplename_circularized.fasta.misassemblies.tsv
└── 5.final
├── samplename_final.fa
└── samplename_final.fa.fai
Tutorial
The tutorial can be run using the provided config file and input data within the tutorial folder. This tutorial uses pre-basecalled long read data (to reduce total file sizes) and performs assembly with Flye and short read polishing. To reduce runtime, this tutorial skips basecalling, long read polishing, and circularization steps. With cluster execution enabled, this tutorial should be completed in under 6 hours. Successful completion will be indicated by the presence of a atcc_tutorial_final.fa file in the 5.final directory. To run the tutorial:
-
unzip the short read (tutorial/inputdata/atcc_100000_sr.fastq.gz) and long read (tutorial/atcc_tutorial/0.basecall/atcc_tutorial.fq.gz) data
-
edit the config file to provide the absolute path to the short read input data (atcc_100000_sr.fastq)
-
run Lathe using the command:
snakemake --use-singularity --singularity-args '--bind /yourrootdirectories/ ' -s /path/to/lathe/Snakefile \
--configfile path/to/config_nobasecalling.yaml --restart-times 0 --keep-going --latency-wait 30
# --profile clusterconfiguration #enable cluster support, highly recommended. See above.
Code Snippets
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 | import sys import os import argparse import pysam fastaf = snakemake.input[0] out = open(snakemake.output[0], 'w') #knobs smooth_gap_width = 150000 contig_edge_margin = 150000 min_smoothed_aln_len = 10000 min_aln_len = 5000 #align print("Aligning...".format(t=str(snakemake.threads))) os.system("nucmer -p {delta} -b 4000 -l 2000 --maxmatch {fa} {fa}".format( #-t {threads} back to mummer3, no more multithreading : ( threads = str(snakemake.threads), delta = snakemake.params['delta'], fa = fastaf )) os.system("show-coords -T {delta}.delta -L 2000 | sed '1,5d' > coords.tsv".format( delta = snakemake.params['delta'] )) #the 1,5d gets rid of the header and identity hit as well print("Trimming genome...") max_tiglen = 0 with open('coords.tsv') as coords: lines = coords.readlines() smoothed_lines = [] if len(lines) > 0: aln_start_line = lines[0].strip().split("\t") prev_line = lines[0].strip().split("\t") #goal: identify corner-cutting parallel off-diagonals. # print(prev_line) for l in lines[1:] + [lines[0]]: #just so we don't miss the last item s = l.strip().split("\t") #print(s) tigname = s[-1] #store the name of the tig. this repeats nugatorily. if int(s[1]) > max_tiglen: #keep track of the largest alignment coordinate found, used as the tig length max_tiglen = int(s[1]) if int(s[0]) > int(s[1]): #ignore inversions # print('ignoring') continue if int(s[1]) - int(s[0]) < min_aln_len: # print('ignoring') continue if abs(int(s[0]) - int(prev_line[1])) < smooth_gap_width and \ abs(int(s[2]) - int(prev_line[3])) < smooth_gap_width: # print("joining") pass else: # print("terminating") newline = aln_start_line newline[1] = prev_line[1] newline[3] = prev_line[3] if int(newline[1]) - int(newline[0]) > min_smoothed_aln_len: # print("storing") smoothed_lines.append(newline) aln_start_line = s prev_line = s #print("output") #for s in smoothed_lines: # print(s) if len(smoothed_lines) > 0: #is it a corner cutting parallel off-diagonal? if int(smoothed_lines[0][0]) < contig_edge_margin and int(smoothed_lines[0][3]) > max_tiglen - contig_edge_margin: if int(smoothed_lines[-1][2]) < contig_edge_margin and int(smoothed_lines[-1][1]) > max_tiglen - contig_edge_margin: #out.write("It's overcircularized!") out.write(tigname + ":" + smoothed_lines[0][0] + '-' + smoothed_lines[-1][0] + "\n") out.write("done\n") |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 | import fileinput import os prev = [] prev_is_terminal = False verbose = True margin = snakemake.params['margin'] with open(snakemake.output[0], 'w') as oot: for l in open(snakemake.input[0]): s = l.strip().split("\t") if prev == []: next if s == ['no circularizations']: oot.write("no circularizations\n") break target_tig = s[9] spanner = s[10] #test for terminal alignment r_start = min([int(i) for i in s[0:2]]) r_end = max([int(i) for i in s[0:2]]) q_start = min([int(i) for i in s[2:4]]) q_end = max([int(i) for i in s[2:4]]) r_len = int(s[7]) q_len = int(s[8]) if r_start < margin or r_end > (r_len - margin): #ref is terminal if q_start < margin or q_end > (q_len - margin): #query is terminal if prev_is_terminal and spanner == prev[10] and target_tig == prev[9]: #spanned event #trim when the center of the spanning contig aligns to both ends of the genome (consistent with overcircularization) if q_start < prev_q_end: trim_bases = prev_q_end - q_start # oot.write('Need to trim ' + str(trim_bases) + " bases.\n") oot.write("{c}:1-{e}\n".format(c = target_tig, e = str(r_len-trim_bases))) #extend with spanning contig when it closes a gap else: insert_bases = [q_start, prev_q_end] insert_bases.sort() insert_string = spanner + ":" + '-'.join([str(i) for i in insert_bases]) #print(insert_string) oot.write(target_tig + "\n") oot.write(insert_string + "\n") prev_is_terminal = False next prev_is_terminal=True else: prev_is_terminal = False else: prev_is_terminal = False #assign prevs prev = s prev_r_start = r_start prev_r_end = r_end prev_q_start = q_start prev_q_end = q_end prev_r_len = r_len prev_q_len = q_len oot.write('done\n') |
91 92 | shell: "ln -s {input} {output}" |
106 107 108 109 110 111 112 113 114 115 | shell: "guppy_basecaller --cpu_threads_per_caller {threads} -i {params.in_dir} -s {params.out_dir} " + "--flowcell {fc} --kit {k}" .format(fc=config['flowcell'], k = config['kit']) rule basecall_final: #Collate called bases into a single fastq file input: lambda wildcards: expand('{{sample}}/0.basecall/raw_calls/{foo}/sequencing_summary.txt', foo = [os.path.basename(f).split('.')[0] for f in fast5_files_dict[wildcards.sample]]) output: '{sample}/0.basecall/{sample}.fq' params: sample="{sample}" |
116 117 | shell: "find {params.sample}/0.basecall/raw_calls/*/*.fastq | xargs cat > {output}" |
128 | shell: "NanoPlot --fastq {input} -t {threads} --color green -o {wildcards.sample}/0.basecall/nanoplots" |
145 146 147 148 149 150 151 152 153 154 155 156 157 | shell: "canu -p {sample}_{wildcards.genome_size} -d {sample}/1.assemble/assemble_{wildcards.genome_size}/ -nanopore-raw {input} " + config['canu_args'] + " genomeSize={wildcards.genome_size} " + # stopOnReadQuality=false "useGrid={grid} gridOptions='{opts}'".format( grid = config['usegrid'], opts = config['grid_options'] ) rule assemble_flye: input: lambda wildcards: fq_files_dict[wildcards.sample] output: "{sample}/1.assemble/assemble_{genome_size}/assembly.fasta" |
163 164 | shell: "flye -t {threads} --meta --nano-raw {input} -o {wildcards.sample}/1.assemble/assemble_{wildcards.genome_size}/ -g {wildcards.genome_size}" |
185 186 187 188 189 190 191 192 193 | shell: """ bedtools makewindows -g {input[1]} -w {params.window_width} | join - {input[1]} | tr ' ' '\t' | \ cut -f1-4 | awk '{{if ($2 > {params.window_width} && $3 < $4 - {params.window_width} && $4 > {params.min_tig_size}) print $0}}' | \ xargs -P 16 -l bash -c ' htsbox samview {input[2]} $0:$1-$1 -p | \ cut -f8,9 | awk "{{if (\$1 < $1 - ({params.window_width}/2) && \$2 > $1 + ({params.window_width}/2)) print \$0}}" | wc -l | \ paste <(echo $0) <(echo $1) - ' | awk '{{if ($3 < 2) print $0}} ' > {output} """ |
206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 | shell: """ if [ -s {input[0]} ] #if the input is nonempty... then cat {input[0]} | grep -v ^# | sort -k1,1 -k2,2g | join - <(cat {input[1]} | sort -k1,1 -k2,2g) | \ awk '{{ if ($1 == prev_tig){{ print($1,prev_coord,$2) }} else{{ if (prev_len > 0){{ print(prev_tig,prev_coord,prev_len) }} print($1,"1",$2) }} prev_tig = $1 prev_coord = $2 prev_len = $4 }} END {{ print(prev_tig,prev_coord,prev_len) }}' | sed "s/\(.*\) \(.*\)\ \(.*\)/\\1:\\2-\\3/g" | xargs samtools faidx {input[2]} \ | cut -f1 -d ':' | awk '(/^>/ && s[$0]++){{$0=$0\"_\"s[$0]}}1;' > {output[0]} cut -f1 {input[0]} > {wildcards.sample}/{wildcards.sequence}.tigs.toremove # DM: temp fix to skip misassembly if formatting fails x=$(cat {input[1]} | wc -l) if [ $x -gt 1 ] then grep -vf {wildcards.sample}/{wildcards.sequence}.tigs.toremove {input[1]} | cut -f1 | xargs samtools faidx {input[2]} >> {output[0]} rm {wildcards.sample}/{wildcards.sequence}.tigs.toremove else cp {input[2]} {output} echo "Misassembly error: skipping misassembly removal" fi else cp {input[2]} {output} fi """ |
260 261 | shell: "merge_wrapper.py {input} -ml 10000 -c 5 -hco 10; mv merged_out.fasta {output}" |
266 267 | shell: "ln -s ../../{input} {output}" |
285 | shell: "sort -k2,2gr {input[1]} | awk '{{if ($2 > {params.min_contig_size}) print $1}}' | xargs samtools faidx {input[0]} > {output}" |
300 301 | shell: "cat {input} | cut -f1 -d '_' | fold -w 120 | awk '(/^>/ && s[$0]++){{$0=$0\"_\"s[$0]}}1;' > {output}" |
350 351 352 | shell: """ racon -m 8 -x -6 -g -8 -w 500 -t {threads} {input} > {output} """ |
362 363 364 365 | shell: """ mkdir -p {output} cut -f1 {input[1]} | xargs -n 1 -I foo sh -c 'touch {output[0]}/foo; samtools faidx {input[0]} foo > {output[1]}/foo.fa' """ |
379 380 381 | shell: """ medaka consensus {input[1]} {output} --model r941_flip213 --threads {threads} --regions {wildcards.range} """ |
395 396 397 | shell: """ medaka stitch {input[0]} {input[1]} {output} """ |
418 419 420 | shell: """ cat {params.indir} | cut -f1 -d ':' > {output} """ |
434 435 436 | shell: "bwa index {input.ref}; bwa mem -t {threads} {input.ref} {input.reads} | \ samtools sort --threads {threads} > {output}" |
447 448 449 450 451 | shell: """ mkdir {output} bedtools makewindows -w 100000 -g {input[1]} | awk '{{print $1,\":\", $2+ 1, \"-\", $3}}' | tr -d ' ' | xargs -n 1 -I foo touch {output}/foo """ |
476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 | shell: """ # set env var $i to be the smallest read subset decimal (in increments of 0.1, with a couple very low values thrown in, too) # sufficient to generate at least 40x coverage depth of the target sequence, or # 1 if 40x coverage cannot be achieved with the available read data for i in 0.01 0.05 $(seq 0.1 0.1 1); do cov=$(samtools view {input[1]} -s $i -h {params.myrange} | samtools depth - | cut -f3 | awk '{{sum+=$1}}END{{print sum/(NR+1)}}') if [ $(echo $cov'>'{params.target_coverage}|bc) -eq 1 ] then break fi done echo Using $i x of total coverage; samtools view -h -O BAM -s $i {input[1]} {params.myrange} > {params.bam} samtools index {params.bam} samtools faidx {input[0]} $(echo {params.myrange}| cut -f1 -d ':') | cut -f1 -d ':' > {params.fa} java -Xmx{params.java_mem}G -jar $(which pilon | sed 's/\/pilon//g')/../share/pilon*/pilon*.jar \ --genome {params.fa} --targets {params.myrange} \ --unpaired {params.bam} --output {params.sample}_{params.myrange} --outdir {params.subrun_folder} \ --vcf --nostrays --mindepth 1 bgzip {params.subrun_folder}/{params.sample}_{params.myrange}.vcf tabix -fp vcf {params.subrun_folder}/{params.sample}_{params.myrange}.vcf.gz """ |
527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 | shell: """ #workaround! Snakemake was causing the vcf's to appear newer than the indices, which tabix didn't like touch {params.sample}/2.polish/pilon/sub_runs/*/*.vcf.gz.tbi #get properly sorted intervals ls {params.sample}/2.polish/pilon/ranges/ | tr ':-' '\t' | sort -k1,1 -k2,2g | awk '{{print $1,":",$2,"-",$3}}' | tr -d ' ' > sorted_ranges.tmp #get header (zcat {params.sample}/2.polish/pilon/sub_runs/*/*.vcf.gz | head -1000 | grep ^# #get corrections within each range (omitting DUP records, which bcftools can't understand) cat sorted_ranges.tmp | xargs -n 1 -I foo sh -c " tabix {params.sample}/2.polish/pilon/sub_runs/foo/{params.sample}_foo.vcf.gz foo | grep -v '0/0' | grep -v DUP" | #sort by position sort -k1,1 -k2,2g) | #compress and store bgzip > {output} || true #index tabix -p vcf {output} """ |
559 560 561 | shell: """ bcftools consensus -f {input} -o {output} """ |
568 569 570 | shell: """ cut -f1 -d ':' {input} > {output} """ |
585 586 587 588 589 590 | shell: """ #mkdir {output} cat {input[1]} | awk '{{if ($2 > {params.min_size}) print $1}}' | xargs -n 1 -I foo sh -c " samtools faidx {input[0]} foo > {params.sample}/3.circularization/1.candidate_genomes/foo.fa " """ |
603 604 605 606 607 | shell: """ (samtools idxstats {input[0]} | grep {params.tig} | awk '{{if ($2 > 50000) print $1, ":", $2-50000, "-", $2; else print $1, ":", 1, "-", $2 }}' | tr -d ' '; samtools idxstats {input[0]} | grep {params.tig} | awk '{{if ($2 > 50000) print $1, ":", 1, "-", 50000; else print $1, ":", 1, "-", $2 }}' | tr -d ' ') | xargs -I foo sh -c 'samtools view -h {input[0]} foo | samtools fastq - || true' | paste - - - - | sort | uniq | tr '\t' '\n' | bgzip > {output} """ |
622 623 624 | shell: """ flye -t {threads} --nano-raw {input} -o {params.directory} -g 1m """ |
649 650 651 652 653 654 655 656 657 658 | shell: """ nucmer -b 5000 {input[0]} {input[1]} -p {params.directory}/{params.prefix} #-t {threads} #reverted nucmer back down to 3, no more multithreading :( delta-filter -q {params.directory}/{params.prefix}.delta > {params.directory}/{params.prefix}.filt.delta show-coords -Tq {params.directory}/{params.prefix}.filt.delta | cut -f8,9 | sed '1,3d' | sort | \ uniq -c | tr -s ' ' '\\t' | cut -f2-99 | grep -v ^1 | cut -f2,3 > {params.directory}/potential_circularizations.tsv || true show-coords -Tql {params.directory}/{params.prefix}.filt.delta | grep -f {params.directory}/potential_circularizations.tsv | cat > {output} || true """ |
667 668 | script: "scripts/spancircle.py" |
682 683 684 685 686 687 688 689 690 691 692 693 694 | run: span_out = open(input[0], 'r').readlines() cmd = '' if span_out == ['done\n'] or span_out[0].strip() == 'no circularizations': #no circularization occurred print('Nothing to do') else: trim = span_out[0].strip() trim_cmd = 'samtools faidx ' + input[1] + ' ' + trim + " > " + params.outfa + "\n" cmd += trim_cmd if len(span_out) == 3: extend = span_out[1].strip() extend_cmd = 'samtools faidx ' + input[3] + ' ' + extend + " | grep -v '>'" + " >> " + params.outfa + "\n" |
719 720 | script: "scripts/encircle.py" |
734 735 736 737 738 739 740 741 742 743 744 | run: span_out = open(input[0], 'r').readlines() cmd = '' if span_out == ['done\n']: #no circularization occurred print('No over-circularization') else: trim = span_out[0].strip() trim_cmd = 'samtools faidx ' + input[1] + ' ' + trim + " > " + params.outfa + "\n" cmd += trim_cmd open(output[0], 'w').write(cmd + '\n') |
769 770 771 772 773 774 775 776 777 778 | shell: """ mkdir -p {params.sample}/3.circularization/3.circular_sequences/sh/ touch {params.sample}/3.circularization/3.circular_sequences/sh/dummy find {params.sample}/3.circularization/3.circular_sequences/sh/ -type f | xargs cat | bash ls {params.sample}/3.circularization/3.circular_sequences/ | grep .fa$ | cut -f1-2 -d '_' > circs.tmp || true (cat {input[1]} | grep -vf circs.tmp | cut -f1 | xargs samtools faidx {input[0]}; ls {params.sample}/3.circularization/3.circular_sequences/* | grep .fa$ | xargs cat) | tr -d '\\n' | sed 's/\\(>[contigscaffold_]*[0-9]*\\)/\\n\\1\\n/g' | fold -w 120 | cut -f1 -d ':' | grep -v '^$' > {output} || true rm circs.tmp """ |
794 | shell: "cp {input} {output}" |
808 809 | shell: "minimap2 -t {threads} -x map-ont {input} > {output}" |
823 824 | shell: "minimap2 -t {threads} -ax map-ont {input} | samtools sort --threads {threads} > {output}" |
833 834 | shell: "samtools index {input}" |
841 | shell: "samtools faidx {input}" |
Support
- Future updates
Related Workflows





