Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Novel-X: Novel sequence insertion detection using Linked-Reads
Novel-X detects and genotypes novel sequence insertions in 10X sequencing dataset using non-trivial read alignment signatures and barcode information.
Installation
To start working with Novel-X please clone this repository recursively:
git clone --recursive git@github.com:1dayac/Novel-X.git
If you clone repository non-recursively Novel-X will not work. To fix this run from Novel-X folder:
git submodule update --init --recursive
Novel-X is a pipeline based on a popular Snakemake workflow management system and consists of multiple steps and requires a lot of external software.
First, the following software should be installed (version numbers used for testing are shown in brackets, but other versions should also work):
-
Longranger (version 2.15) - Download Page
-
Velvet (commit 9adf09f) - GitHub Page - outdated but still useful assembler with minimal assumptions about the data. Note that we use kmer length of 63 during the assembly for 10X data, and Velvet should be compiled using
make ’MAXKMERLENGTH=63’
command. For more information, refer to the Velvet manual.
-
BlastN (version 2.2.31) - Download Page
-
Samtools (version 1.7) - Project Page
-
SPAdes (version 3.13) - Project Page
-
Quast (version 4.4)- Project Page
All this programs (except LongRanger) can be installed with conda package. We provide conda-env.yml file that allows to install them using the following command:
conda env create -f conda-env.yml
Path to executables (if executables are not in $PATH) should be provided in path_to_executables_config.json file.
Python dependencies are listed in requirements.txt file. They can be downloaded and installed with following command:
pip install -r requirements.txt
Inside bxtools folder run following commands (estimated execution time is around 2 minutes):
./configure
make
make install
We tested our tool using CentOS Linux 7 OS, but we suppose that it should work at any modern Unix-like system.
Then you are ready to go.
Command Options
Novel-X can be run with novel-x.py script with two modes:
-
run - run pipeline from the scratch
-
restart - if previous pipeline was not finished for some reason you can try to catch up with novel-x.py restart command.
A typical command to start Novel-X is
python novel-x.py run --bam my_bam.bam --genome my_genome.fasta --outdir my_dir
Optional arguments are:
-
--lr20 - needed if you run pipeline on a bam file obtained by LongRanger2.0 pipeline
-
--nt - optional filtering of non-human sequences from the orphan contigs
We added two option groups to handle different data and its properties (molecule length, intra-molecule coverage, etc.).
Data option group:
-
--10x - for 10X Genomics data [Default]
-
--tellseq - for Tell-Seq data
-
--stlfr - for stLFR data
Tell-Seq and stLFR data should be converted to LongRanger-compatible bam. For stLFR data, use this pipeline . For Tell-Seq data refer to Tell-Seq paper.
Coverage group:
-
--high-coverage - best for 60X coverage and higher [Default]
-
--low-coverage - best for 20X-40X coverage
You can invoke help message by typing:
python novel-x.py run --help
or
python novel-x.py restart --help
Output Formats
Novel-X write results into vcf-file. If your bam-file was named HM2KYBBXX_NA18509.bam, the resulting vcf-file will be named HM2KYBBXX_NA18509.vcf and will be stored inside the outdir folder.
Example Commands
Run from the start:
python ~/Novel-X/novel-x.py run --bam /athena/ihlab/scratch/dmm2017/70_samples_data/HLF3WBBXX_NA12006_longranger.bam -t 8 -m 200 --nt /athena/ihlab/scratch/dmm2017/blast_database/ --genome /athena/ihlab/scratch/dmm2017/hg38/hg38.fa --outdir /athena/ihlab/scratch/dmm2017/70_samples/novelx_NA12006
Restart from the last stage:
python ~/Novel-X/novel-x.py restart --outdir novelx_NA12006
There is a problem on filter_target_contig stage at the moment. It can exit with non-zero exit code. We recommend to comment out the next line before using restart option:
parallel --jobs {THREADS} filter_target_contigs ::: {input.contigs}/*
Demo command
We placed a toy dataset in demo folder to test that software is installed correctly. You can run command:
python ~/Novel-X/novel-x.py run --bam ~/Novel-X/demo/demo.bam -t 1 -m 20 --genome ~/Novel-X/demo/demo.fasta --outdir out
This command takes about 15 minutes to finish on our hardware. It produces a vcf-file with a single vcf record.
chr1_25500000_25535000 29503 . T TGTATTGTGTGTATGAGGGTTGTGTGCTGTGTGTTGTGTATATATTGTATGTGTTATGTGTATGTATGTCGTATGAGTGTATTCTGTATATGTGTTTTGTGTGGTCTATTATGTATGTGGCATGTGTTGTGTATGTGTGTTGTGTGTGATGTGTTGTATGTGTGTTGTGCATATATGTTGTTTCTGTGTATGTATGTTATGTGTATGTGTATGTTGTGTTGTATGTATGGGTTGTGCCTATGTGCTGTGTTGTGTGCTGCATGCATGTTTGTGTGGTGTGTGTATTTAGGTTGTGTGCTATTTATGTGTCTATATTGTATGTGTTGTATGTGTGTTGTATGTATGTGTAGTGTATGTGTGTTGTGTGTGATGTGTATATGTGGTGTGTGTATGTCTGTTATGTGTATGTATGAGTGTATGTGTGTTGTGTGTGTTGTGTATATGTGTTGTGTGTGTTGTGTATGTGTGTTGTGAGTTGTGTATATGTGGTGTGAGTTGTGTTGTGTCATGTATGTGTGCATTGTGTATAGGTGTTGCATGTGTGTTGTGTTGTGTGTATGTGTTGTGTTGTGTATATGTGGTATGTGAATGTGTATGTTGTATGTTGTGTTGTATGTATATGTGTTATGTATATGTGATGTGTGTGTTGTGTATATGCTGGGTGTGTGTGTACATGTGTGTATGTGTGTTGTATGTATGTGTGTATGCATGTGTGTTGCGTATATGTGGTATGTGTGCATGTGTGTTGTCATGTGTATGTGTGTTGTGTATATGTGTGTGTTGTGTATATGTGTTGTGTGTATGTGTATCATGTTGTGTGTATGTGTTATGTTGTGTATATGTGGTGTGTGAATGTGTGTTGTGTGTATGTGTATGTTGTCTGTTTTGTGTGTGTATACGTGGTGTGTGTGTGTTGTGTTGTGTATATGTGTTGTGTGTGTTGCGTGTATGTGTTGTGTGTT . PASS DP=100 NODE_1_length_4180_cov_43.887399 2776 276 347 1306
Output may slightly differ based on your software versions.
Publications
"Novel sequence insertion detection using Linked-Reads" preprint is available at https://www.biorxiv.org/content/10.1101/551028v1 .
Contact & Support
Feel free to drop any inquiry to meleshko.dmitrii@gmail.com
Code Snippets
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | import sys from Bio import SeqIO if len(sys.argv) < 4: print("Usage: %s <length threshold> <contigs_file> <output>" % sys.argv[0]) sys.exit(1) f_n = sys.argv[2] input_seq_iterator = SeqIO.parse(open(f_n, "r"), "fasta") filtered_iterator = (record for record in input_seq_iterator \ if len(record.seq) > int(sys.argv[1]) ) output_handle = open(sys.argv[3], "w") SeqIO.write(filtered_iterator, output_handle, "fasta") output_handle.close() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | import sys from Bio import SeqIO with open(sys.argv[2], "r") as nucmer_file: lines = nucmer_file.readlines() record_dict = SeqIO.to_dict(SeqIO.parse(sys.argv[1], "fasta")) if len(lines) == 1: exit() records_to_output = [] with open(sys.argv[3], "w") as output_handle: for line in lines: split = line.split("\t") if len(split) < 5 or split[0] == "S1": pass else: if split[4].strip() not in records_to_output: records_to_output.append(split[4].strip()) for record in records_to_output: SeqIO.write([record_dict[record]], output_handle, "fasta") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 | import sys import re from Bio import SeqIO current_set = [] current_node = "" current_chrom = "" unique_insertions = [] non_unique_insertions = [] unique_deletions = [] non_unique_deletions = [] class Deletion(object): def __init__(self, node, contig_pos, rc, ref_id, ref_pos1, ref_pos2, anchor1, anchor2): self.node = node self.contig_pos = contig_pos self.rc = rc self.ref_id = ref_id self.ref_pos1 = ref_pos1 self.ref_pos2 = ref_pos2 self.anchor1 = anchor1 self.anchor2 = anchor2 def __eq__(self, other): return (abs(self.ref_pos1 - other.ref_pos1) < 100 or abs(self.ref_pos2 - other.ref_pos2) < 100 or abs(self.ref_pos2 - other.ref_pos1) < 100 or abs(self.ref_pos1 - other.ref_pos2) < 100) def size(self): return abs(self.ref_pos2 - self.ref_pos1) - 1 class Insertion(object): def __init__(self, node, pos1, pos2, rc, ref_id, ref_pos, anchor1, anchor2, overlap): self.node = node self.pos1 = pos1 self.pos2 = pos2 self.rc = rc self.ref_id = ref_id self.ref_pos = ref_pos self.anchor1 = anchor1 self.anchor2 = anchor2 self.overlap = overlap def __eq__(self, other): return abs(self.ref_pos - other.ref_pos) < 100 def size(self): return abs(self.pos2 - self.pos1) + self.overlap - 1 def Overlaps(al1, al2): cont11 = al1.cont1 cont12 = al1.cont2 cont21 = al2.cont1 cont22 = al2.cont2 if cont11 > cont12: cont11, cont12 = cont12, cont11 if cont21 > cont22: cont21, cont22 = cont22, cont21 return (cont11 >= cont21 and cont11 <= cont22) or (cont12 <= cont21 and cont12 >= cont22) def Near(sv1, sv2): return sv2 - sv1 < 50 and sv2 - sv1 > -10 def BigAnchors(al1, al2): return abs(al1.cont1 - al1.cont2) > 300 or abs(al2.cont1 - al2.cont2) > 300 def IsDeletion(al1, al2): ref11 = al1.ref1 ref12 = al1.ref2 is_rc = ref12 - ref11 < 0 ref21 = al2.ref1 ref22 = al2.ref2 is_rc2 = ref22 - ref21 < 0 if is_rc != is_rc2: return False if ref11 > ref12: ref11, ref12 = ref12, ref11 if ref21 > ref22: ref21, ref22 = ref22, ref21 if is_rc: contig_breakpoint = al1.cont1 else: contig_breakpoint = al1.cont2 if cont21 - cont12 < 5: if ref21 - ref12 > 5: deletion = Deletion(al1.node, contig_breakpoint, is_rc, al1.chrom, ref12, ref21, abs(cont22 - cont21), abs(cont12 - cont11)) if deletion.size() > 1000: print(deletion.size()) print("Deletion: " + str(al1) + " \t " + str(al2)) if deletion not in unique_deletions: unique_deletions.append(deletion) else: non_unique_deletions.append(deletion) return True if cont22 < cont11: deletion = Deletion(al1.node, contig_breakpoint, is_rc, al1.chrom, ref12, ref21, abs(cont22 - cont21), abs(cont12 - cont11)) if deletion.size() > 1000: print(deletion.size()) print("Deletion: " + str(al1) + " \t " + str(al2)) if deletion in non_unique_deletions: return False if cont11 - cont22 > 5: if deletion not in unique_deletions: unique_deletions.append(deletion) else: non_unique_deletions.append(deletion) return True return False def GetOverlap(ref11, ref12, ref21, ref22): if ref11 > ref12: ref11, ref12 = ref12, ref11 if ref21 > ref22: ref21, ref22 = ref22, ref21 if (min(ref21, ref22) > max(ref11, ref12)) or (max(ref21, ref22) < min(ref11, ref12)): return 0 if ref11 < ref21 < ref12 and ref22 > ref12: return ref12 - ref21 if ref11 < ref22 < ref12 and ref21 < ref11: return ref22 - ref11 return 0 def IsInsertion(al1, al2): cont11 = al1.cont1 cont12 = al1.cont2 ref_breakpoint11 = al1.ref1 ref_breakpoint12 = al1.ref2 is_rc = cont12 - cont11 < 0 cont21 = al2.cont1 cont22 = al2.cont2 ref_breakpoint21 = al2.ref1 ref_breakpoint22 = al2.ref2 is_rc2 = cont22 - cont21 < 0 if cont21 > cont22: cont21, cont22 = cont22, cont21 ref_breakpoint21, ref_breakpoint22 = ref_breakpoint22, ref_breakpoint21 if is_rc != is_rc2: return False if cont11 > cont12: cont11, cont12 = cont12, cont11 ref_breakpoint11, ref_breakpoint12 = ref_breakpoint12, ref_breakpoint11 if (ref_breakpoint21 < ref_breakpoint11 and ref_breakpoint11 < ref_breakpoint12): return False overlap = GetOverlap(ref_breakpoint11, ref_breakpoint12, ref_breakpoint21, ref_breakpoint22) print(overlap) if abs(ref_breakpoint12 - ref_breakpoint21) > 50 and overlap == 0: print("Not an exact insertion") return False if cont12 < cont21: if cont21 - cont12 > 5: ins = Insertion(al1.node, cont12, cont21, is_rc, al1.chrom, ref_breakpoint12, abs(cont22 - cont21), abs(cont12 - cont11), overlap) if ins.size() > 500: print(ins.size()) print("Insertion: " + str(al1) + " \t " + str(al2)) if ins not in unique_insertions: unique_insertions.append(ins) else: non_unique_insertions.append(ins) return True if cont22 < cont11: ins = Insertion(al1.node, cont22, cont11, is_rc, al1.chrom, ref_breakpoint12, abs(cont22 - cont21), abs(cont12 - cont11), overlap) if ins.size() > 500: print(ins.size()) print("Insertion: " + str(al1) + " \t " + str(al2)) if cont11 - cont22 > 5: if ins not in unique_insertions: unique_insertions.append(ins) else: non_unique_insertions.append(ins) return True return False def AnalyzeCurrentSet(al1, al2): if BigAnchors(al1, al2): if IsInsertion(al1, al2): pass #print(str(al1) + " \t " + str(al2)) if IsDeletion(al1, al2): pass #print(str(al1) + " \t " + str(al2)) class ContigFilter(object): def __init__(self): self.coverage_cutoff = 0.0 self.length_cutoff = 0 def ParseCoverage(self, contig_name): m = re.search("cov_([\d\.]+)", contig_name) return float(m.group(1)) def ParseLength(self, contig_name): m = re.search("length_([\d\.]+)_", contig_name) return int(m.group(1)) def PassFilter(self, contig_name): return self.ParseCoverage(contig_name) > self.coverage_cutoff \ and self.ParseLength(contig_name) > self.length_cutoff class Alignment(object): def __init__(self, ref1, ref2, cont1, cont2, chrom, node): self.ref1 = ref1 self.ref2 = ref2 self.cont1 = cont1 self.cont2 = cont2 self.chrom = chrom self.node = node def __str__(self): return str(self.ref1) + "\t" + str(self.ref2) + "\t" + str(self.cont1) + "\t" + str(self.cont2) + "\t" + self.chrom + "\t" + self.node i = 0 filter = ContigFilter() with open(sys.argv[1], "r") as nucmer_file: lines = nucmer_file.readlines() for i in range(1, len(lines) - 1): line = lines[i] if line.startswith("local misassembly") or line.startswith("transloc") or line.startswith("relocation") or line.startswith("indel") or line.startswith("unknown"): line1 = lines[i-1] line2 = lines[i+1] if line2.startswith("CON"): continue chrom = line1.split("\t")[4].strip() ref11 = int(line1.split("\t")[0].strip()) ref12 = int(line1.split("\t")[1].strip()) ref21 = int(line2.split("\t")[0].strip()) ref22 = int(line2.split("\t")[1].strip()) cont11 = int(line1.split("\t")[2].strip()) cont12 = int(line1.split("\t")[3].strip()) cont21 = int(line2.split("\t")[2].strip()) cont22 = int(line2.split("\t")[3].strip()) contig_name = line1.split("\t")[5].strip() if not filter.PassFilter(contig_name): continue i += 1 if i > 100000: pass #break al1 = Alignment(ref11, ref12, cont11, cont12, chrom, contig_name) al2 = Alignment(ref21, ref22, cont21, cont22, chrom, contig_name) AnalyzeCurrentSet(al1, al2) def get_insertion_pos(cont11, cont12, contig_length): if contig_length - max(cont11, cont12) > min(cont11, cont12): return max(cont11, cont12), contig_length else: return 0, min(cont11, cont12) with open(sys.argv[1], "r") as nucmer_file: lines = nucmer_file.readlines() for i in range(2, len(lines)): line = lines[i] if line.startswith("CONTIG") and (lines[i-2].startswith("CONTIG") or lines[i-2].startswith("S1")): alignment_line = lines[i-1] if not alignment_line[0].isdigit(): continue chrom = alignment_line.split("\t")[4].strip() ref11 = int(alignment_line.split("\t")[0].strip()) ref12 = int(alignment_line.split("\t")[1].strip()) cont11 = int(alignment_line.split("\t")[2].strip()) cont12 = int(alignment_line.split("\t")[3].strip()) contig_name = alignment_line.split("\t")[5].strip() if not filter.PassFilter(contig_name): continue contig_length = filter.ParseLength(contig_name) if abs(cont12 - cont11) < 600: continue ins_start, ins_end = get_insertion_pos(cont11, cont12, contig_length) if abs(ins_start - ins_end) < 50: continue if ins_start == 0: if cont12 < cont11: ref_breakpoint = ref12 else: ref_breakpoint = ref11 else: if cont12 < cont11: ref_breakpoint = ref11 else: ref_breakpoint = ref12 is_rc = cont11 > cont12 if cont12 < cont11: cont12, cont11 = cont11, cont12 ins = Insertion(contig_name, ins_start, ins_end, is_rc, chrom, ref_breakpoint, abs(cont12 - cont11), 0, 0) if ins.size() > 500: print(ins.size()) print("Insertion: " + str(chrom) + " \t " + str(ref_breakpoint)) if ins not in unique_insertions: unique_insertions.append(ins) else: non_unique_insertions.append(ins) records = list(SeqIO.parse(sys.argv[2], "fasta")) record_dict = {} for record in records: if record.id not in record_dict.keys(): record_dict[record.id] = record with open(sys.argv[3], "w") as vcf, open("insertions_with_anchors.fasta", "w") as fasta: for insertion in unique_insertions: if insertion.size() >= 300: fasta.write(">" + insertion.node + "\n") fasta.write(str(record_dict[insertion.node].seq)) fasta.write("\n") try: ins_seq = str(record_dict[insertion.node].seq[insertion.pos1 : insertion.pos2 + insertion.overlap] if not insertion.rc else record_dict[insertion.node].seq[insertion.pos1 : insertion.pos2 + insertion.overlap].reverse_complement()) vcf.write(str(insertion.ref_id) + "\t" + str(insertion.ref_pos) + "\t" + "." + "\t" + str(record_dict[insertion.node].seq[insertion.pos1]) + "\t" + ins_seq + "\t" + "." + "\t" + "PASS" + "\t" + "DP=100" + "\t" + insertion.node + "\t" + str(insertion.anchor1) + "\t" + str(insertion.anchor2) + "\t" + str(insertion.pos1) + "\t" + str(insertion.pos2) + "\n") except: pass |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 | import sys import re import math import os class Usage(Exception): def __init__(self, msg): self.msg = msg def usage(): print("\nUsage: python extractContaminants.py contaminationFile orphan.fq.contigs.fa orphan.fq.contigs.wocontamination.fa") sys.exit(-1) def main(): args = sys.argv[1:] if len(args) != 3: usage() fcontamination = open(sys.argv[1], 'r') fcontig = open(sys.argv[2], 'r') fout = open(sys.argv[3], 'w') contamination = fcontamination.readline() while contamination != '': contamination = contamination.split()[0] contig = fcontig.readline()[:-1] if (contig != ''): contig = re.split(">| ", contig)[1] while contamination != contig and contig != '': fout.write(">" + contig + "\n") while True: temp_string = fcontig.readline() if temp_string != '' and temp_string[0] != '>': fout.write(temp_string) else: break contig = temp_string if (contig != ''): contig = re.split(">| ", contig)[1][:-1] if (contamination == contig and contig != ''): last_pos = fcontig.tell() fcontig.readline() while contig != '': if contig[0] == '>': fcontig.seek(last_pos) break last_pos = fcontig.tell() contig = fcontig.readline() contamination = fcontamination.readline() contig = fcontig.readline() while contig != '': fout.write(contig) fout.write(fcontig.readline()) contig = fcontig.readline() fout.close() if __name__ == "__main__": sys.exit(main()) |
36 37 38 39 40 | shell: """ {SAMTOOLS} sort -@ {THREADS} -n {input} -o {output.sorted} {GIT_ROOT}/bxtools/bin/bxtools filter {output.sorted} -b -s 0.2 -q 10 >{output.unmapped_bam} """ |
48 49 50 51 52 53 54 55 56 57 58 59 60 61 | run: shell("rm -rf temp_reads") shell("mkdir temp_reads") shell("mkdir -p fasta") shell("{GIT_ROOT}/bxtools/bin/bxtools bamtofastq {input.bam} temp_reads/") if DATA == "other": shell("{SPADES} -t {THREADS} -k {VELVET_K} -1 temp_reads/{wildcards.sample}_R1.fastq -2 temp_reads/{wildcards.sample}_R2.fastq --only-assembler -o spades_{wildcards.sample}") shell("cp spades_{wildcards.sample}/scaffolds.fasta fasta/{wildcards.sample}.fasta") else: shell("{VELVETH} velvet_{wildcards.sample} {VELVET_K} -shortPaired -fastq -separate temp_reads/{wildcards.sample}_R1.fastq temp_reads/{wildcards.sample}_R2.fastq") shell("{VELVETG} velvet_{wildcards.sample} -exp_cov auto -cov_cutoff {VELVET_COVERAGE} -max_coverage 100 -scaffolding no") shell("cp velvet_{wildcards.sample}/contigs.fa fasta/{wildcards.sample}.fasta") shell("rm -rf temp_reads/") |
69 70 71 72 | shell: """ {GIT_ROOT}/contig_length_filter.py 200 {input.fasta} {output.filtered_fasta} """ |
79 80 81 82 | shell: """ {GIT_ROOT}/bamtofastq {ADDITIONAL_BAMTOFASTQ_FLAGS} {input} {output.temp_dir} """ |
89 90 91 92 93 94 95 96 97 | run: if BLAST_DB != 'None': shell("mkdir -p blast") shell("{BLASTN} -task megablast -query {input.filtered_fasta} -db {BLAST_DB} -num_threads {THREADS} > blast/{wildcards.sample}.megablast") shell("{GIT_ROOT}/cleanmega blast/{wildcards.sample}.megablast blast/{wildcards.sample}.cleanmega") shell("{GIT_ROOT}/find_contaminations.py blast/{wildcards.sample}.cleanmega blast/{wildcards.sample}.contaminants") shell("python {GIT_ROOT}/remove_contaminations.py blast/{wildcards.sample}.contaminants {input.filtered_fasta} {output.filtered_fasta}") else: shell("cp {input.filtered_fasta} {output.filtered_fasta}") |
109 110 111 112 113 114 | shell: """ {LONGRANGER} mkref {input.filtered_fasta} {LONGRANGER} align --localcores={THREADS} --localmem={MEMORY} --id=temp_{wildcards.sample} --reference={output.refdata} --fastqs={input.temp_dir}/ {SAMTOOLS} view -b -F 12 temp_{wildcards.sample}/outs/possorted_bam.bam >{output.mapped_bam} """ |
122 123 124 125 126 | shell: """ {GIT_ROOT}/bxtools/bin/bxtools filter {input.mapped_bam} -s 0.2 -c 0.2 -q 50 >mapped/{wildcards.sample}.filtered.bam {GIT_ROOT}/bxtools/bin/bxtools split-by-ref mapped/{wildcards.sample}.filtered.bam -o {output.barcode_folder} """ |
135 136 137 138 139 | shell: """ mkdir small_bams_{wildcards.sample} {GIT_ROOT}/bxtools/bin/bxtools extract {input.sample} {input.barcode_folder} {output.small_bams}/ """ |
148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 | shell: """ mkdir -p {output.small_reads} function prepare_reads {{ a="$(basename $1 | sed "s/\..*//")" if [ -d "{output.small_reads}_2/$a" ]; then mv {output.small_reads}_2/$a {output.small_reads}/$a return fi mkdir -p {output.small_reads}/$a {GIT_ROOT}/bxtools/bin/bxtools bamtofastq {input.small_bams}/$a.bam {output.small_reads}/$a/ }} export -f prepare_reads parallel --jobs {THREADS} prepare_reads ::: {input.small_bams}/* """ |
171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 | shell: """ mkdir -p {output.contigs} mkdir -p {output.assemblies_folder} function local_assembly {{ a="$(basename $1 | sed "s/\..*q//")" if [ -f "{output.assemblies_folder}/$a/scaffolds.fasta" ]; then cp {output.assemblies_folder}/$a/scaffolds.fasta {output.contigs}/$a.fasta return fi {SPADES} --only-assembler -t 1 -m {MEMORY_PER_THREAD} -k {SPADES_K} --cov-cutoff 3 --pe1-1 {input.small_reads}/$a/$a\_R1.fastq --pe1-2 {input.small_reads}/$a/$a\_R2.fastq -o {output.assemblies_folder}/$a cp {output.assemblies_folder}/$a/scaffolds.fasta {output.contigs}/$a.fasta rm -rf {output.assemblies_folder}/$a/K* }} export -f local_assembly parallel --jobs {THREADS} local_assembly ::: {input.small_reads}/* """ |
199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 | shell: """ mkdir -p {output.splitted_insertions} mkdir -p {output.prefilter_contigs} python {GIT_ROOT}/split_fasta.py {input.insertions} {output.splitted_insertions} mkdir -p quast_res mkdir -p {output.filtered_contigs} function filter_target_contigs {{ a="$(basename $1 | sed "s/\..*//")" {GIT_ROOT}/contig_length_filter.py 500 {input.contigs}/$a.fasta {output.prefilter_contigs}/$a.fasta {QUAST} -t 1 --fast --min-contig 199 -R {output.prefilter_contigs}/$a.fasta {output.splitted_insertions}/$a.fasta -o quast_res/$a python {GIT_ROOT}/filter_correct_record.py {output.prefilter_contigs}/$a.fasta quast_res/$a/contigs_reports/all_alignments_$a.tsv {output.filtered_contigs}/$a.fasta }} export -f filter_target_contigs set +oe pipefail parallel --jobs {THREADS} filter_target_contigs ::: {input.contigs}/* cat {output.filtered_contigs}/*.fasta >{output.contigs} rm -rf quast_res rm -f core* """ |
225 226 227 228 229 | shell: """ {QUAST} --fast -R {GENOME} {input.contigs} -o quast_res cp quast_res/contigs_reports/all_alignments_final_set_*.tsv {output.final_alignments} """ |
237 238 239 240 | shell: """ python {GIT_ROOT}/minimap_to_vcf.py {input.final_alignments} {input.contigs} {output.final_vcf} """ |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | import sys from Bio import SeqIO import os directory = sys.argv[2] try: os.stat(directory) except: os.mkdir(directory) current_index = 1 for record in SeqIO.parse(sys.argv[1], "fasta"): with open(directory + os.sep + str(current_index) + ".fasta", "w") as output_handle: SeqIO.write([record], output_handle, "fasta") current_index += 1 |
Support
- Future updates
Related Workflows





