Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
A Snakemake workflow to correct genome annotation feature coordinates after polishing a reference genome using short reads from one or more sequencing runs, allowing insertions and deletions. It allows traversing an relationship tree of e.g. mutant cell
Code Snippets
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | __author__ = "Christopher Schröder, Patrik Smeds" __copyright__ = "Copyright 2020, Christopher Schröder, Patrik Smeds" __email__ = "christopher.schroeder@tu-dortmund.de, patrik.smeds@gmail.com" __license__ = "MIT" from os import path from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=True, stderr=True) # Check inputs/arguments. if len(snakemake.input) == 0: raise ValueError("A reference genome has to be provided.") elif len(snakemake.input) > 1: raise ValueError("Please provide exactly one reference genome as input.") valid_suffixes = {".0123", ".amb", ".ann", ".bwt.2bit.64", ".pac"} def get_valid_suffix(path): for suffix in valid_suffixes: if path.endswith(suffix): return suffix prefixes = set() for s in snakemake.output: suffix = get_valid_suffix(s) if suffix is None: raise ValueError(f"{s} cannot be generated by bwa-mem2 index (invalid suffix).") prefixes.add(s[: -len(suffix)]) if len(prefixes) != 1: raise ValueError("Output files must share common prefix up to their file endings.") (prefix,) = prefixes shell("bwa-mem2 index -p {prefix} {snakemake.input[0]} {log}") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 | __author__ = "Christopher Schröder, Johannes Köster, Julian de Ruiter" __copyright__ = ( "Copyright 2020, Christopher Schröder, Johannes Köster and Julian de Ruiter" ) __email__ = "christopher.schroeder@tu-dortmund.de koester@jimmy.harvard.edu, julianderuiter@gmail.com" __license__ = "MIT" import tempfile from os import path from snakemake.shell import shell from snakemake_wrapper_utils.java import get_java_opts from snakemake_wrapper_utils.samtools import get_samtools_opts # Extract arguments. extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell(stdout=False, stderr=True) sort = snakemake.params.get("sort", "none") sort_order = snakemake.params.get("sort_order", "coordinate") sort_extra = snakemake.params.get("sort_extra", "") samtools_opts = get_samtools_opts(snakemake) java_opts = get_java_opts(snakemake) index = snakemake.input.get("index", "") if isinstance(index, str): index = path.splitext(snakemake.input.idx)[0] else: index = path.splitext(snakemake.input.idx[0])[0] # Check inputs/arguments. if not isinstance(snakemake.input.reads, str) and len(snakemake.input.reads) not in { 1, 2, }: raise ValueError("input must have 1 (single-end) or 2 (paired-end) elements") if sort_order not in {"coordinate", "queryname"}: raise ValueError(f"Unexpected value for sort_order ({sort_order})") # Determine which pipe command to use for converting to bam or sorting. if sort == "none": # Simply convert to bam using samtools view. pipe_cmd = "samtools view {samtools_opts}" elif sort == "samtools": # Sort alignments using samtools sort. pipe_cmd = "samtools sort {samtools_opts} {sort_extra} -T {tmpdir}" # Add name flag if needed. if sort_order == "queryname": sort_extra += " -n" elif sort == "picard": # Sort alignments using picard SortSam. pipe_cmd = "picard SortSam {java_opts} {sort_extra} --INPUT /dev/stdin --TMP_DIR {tmpdir} --SORT_ORDER {sort_order} --OUTPUT {snakemake.output[0]}" else: raise ValueError(f"Unexpected value for params.sort ({sort})") with tempfile.TemporaryDirectory() as tmpdir: shell( "(bwa-mem2 mem" " -t {snakemake.threads}" " {extra}" " {index}" " {snakemake.input.reads}" " | " + pipe_cmd + ") {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "koester@jimmy.harvard.edu" __license__ = "MIT" from snakemake.shell import shell extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell(stdout=True, stderr=True) # Samtools takes additional threads through its option -@ # One thread for samtools merge # Other threads are *additional* threads passed to the '-@' argument threads = "" if snakemake.threads <= 1 else " -@ {} ".format(snakemake.threads - 1) shell( "samtools index {threads} {extra} {snakemake.input[0]} {snakemake.output[0]} {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 | from bisect import bisect_right from collections.abc import Sequence import pandas as pd from typing import List, Union, Dict import geffa # Give the snakemake object a type... snakemake: object # Read the pilon changes file df = pd.read_table(snakemake.input.changes, sep=' ', header=None, names=['old_coordinate', 'new_coordinate', 'old', 'new']) # Split the coordinates into contig name and position split = df.old_coordinate.str.split(':', expand=True) df.loc[:, 'contig'] = split.iloc[:, 0] df.loc[:, 'old_coordinate'] = split.iloc[:, 1] split = df.new_coordinate.str.split(':', expand=True) df.loc[:, 'new_coordinate'] = split.iloc[:, 1] # Function to categorise the each row (deletion, insertion or snp) # Also calculates the change in length (negative for deletion, positive for insertion, zero for snps) def categorize_change(row): lold = len(row.old) lnew = len(row.new) diff = lnew - lold if diff == 0: if lold == 1: category = 'snp' elif lold == 0: raise ValueError('No change!?!') else: category = 'indel' elif diff < 0: category = 'deletion' else: category = 'insertion' return pd.Series({'difference': diff, 'category': category}) # Incorporate the categorisation into the table df = pd.concat([df, df.replace('.', '').apply(categorize_change, axis=1)], axis=1) df.loc[:, 'old_coordinate_int'] = df.old_coordinate.map(lambda a: int(a.split('-')[0])) # Class that takes a list of change coordinates and associated shifts and allows us to # then transform old feature coordinates into new ones. class PieceWiseConstantShift: def __init__(self, coordinates: Sequence[int], shifts: Sequence[int]) -> None: self.coordinates: list[int] = [0] self.shifts: list[int] = [0] self.cumshifts: list[int] = [0] # Calculate cumulative shifts for c, s in zip(coordinates, shifts): if s != 0: # Filter out snps, they don't shift anything self.coordinates.append(c) self.shifts.append(s) self.cumshifts.append(self.cumshifts[-1] + s) # This takes a coordinate and shifts it according to the changes given above def __call__(self, x: Union[int,List[int]]) -> Union[int,List[int]]: if isinstance(x, Sequence): return [self(y) for y in x] # Find the index of the shift to the left of the given coordinate index = bisect_right(self.coordinates, x)-1 shift = self.shifts[index] cumshift = self.cumshifts[index] coordinate = self.coordinates[index] # If the old coordinate is within a deletion or insertion region, we need to shift it beyond if x < coordinate - shift: new = coordinate else: new = x new += cumshift try: # If we shifted beyond the next change, we need to correct for that if new > self.coordinates[index+1]: new = self.coordinates[index+1] except IndexError: # We were at the last entry, so nothing to see here. pass return new # Build a translator for each contig translations:Dict[str, PieceWiseConstantShift] = { contig: PieceWiseConstantShift(g.old_coordinate_int, g.difference) for contig, g in df.groupby('contig') } # TODO: Validate - mark incomplete CDS as pseudogenes!!! # Load the parental GFF file and the new sequence fasta file gff = geffa.GffFile(snakemake.input.gff, fasta_file=snakemake.input.fasta, ignore_unknown_feature_types=True) for seqreg in gff.sequence_regions.values(): # Iterate over contigs try: # Load the correct translator trans = translations[seqreg.name] except KeyError: # If we have no reads in a contig, it won't show up in the translation table continue marked_for_deletion = [] # Iterate over all features in the contig for feature in seqreg.node_registry.values(): # Translate the start and end coordinates start, end = feature.start, feature.end feature.start, feature.end = trans((feature.start, feature.end)) try: # Validate the feature with the new coordinates feature.validate() except Exception as e: # The new feature isn't correct! # Print some debug info print(feature) print(feature.sequence_region.sequence[feature.start-5:feature.end+5]) print(f"{start}->{feature.start}, {end}->{feature.end}") msg = str(e) # Delete the feature if it's a feature that has been corrupted by an indel if ("SLAS feature needs to be of length 2" in msg) or ("__len__() should return >= 0" in msg): print("Deleting this feature because it's been destroyed by an indel.") marked_for_deletion.append(feature) elif "Exon parent needs to be an RNA" in msg: print("Invalid exon found - skipping because it's not that severe.") else: # Fail otherwise, don't know what to do. print(list(zip(trans.coordinates, trans.shifts, trans.cumshifts))) raise # Delete the features that were marked for deletion # (couldn't do that earlier, can't change a container while iterating over it) for feature in marked_for_deletion: feature.delete() # Save the new annotation GFF gff.save(snakemake.output.gff, include_sequences=True) |
16 17 18 | shell: """ curl -o {output.fasta} "https://tritrypdb.org/common/downloads/release-{wildcards.version}/{wildcards.organism}/fasta/data/TriTrypDB-{wildcards.version}_{wildcards.organism}_Genome.fasta" """ |
29 30 31 32 33 34 35 36 37 | shell: """ cd {wildcards.path} # Prefetching followed by fasterq-dump is faster than fasterq-dump alone (not sure why?) prefetch SRR{wildcards.sraid} fasterq-dump --split-files SRR{wildcards.sraid} # Move to correct location mv SRR{wildcards.sraid}_1.fastq SRR{wildcards.sraid}_1.fq mv SRR{wildcards.sraid}_2.fastq SRR{wildcards.sraid}_2.fq """ |
47 48 49 | shell: """ gunzip < "{input}" > "{output}" """ |
61 62 63 | shell: """ run_rcorrector.pl -t 24 -1 {input.r1} -2 {input.r2} -od results/{wildcards.experiment}/ """ |
74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 | run: from itertools import zip_longest def grouper(iterable, n, fillvalue=None): "Collect data into fixed-length chunks or blocks" # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx args = [iter(iterable)] * n return zip_longest(fillvalue=fillvalue, *args) r1_cor_count=0 r2_cor_count=0 pair_cor_count=0 unfix_r1_count=0 unfix_r2_count=0 unfix_both_count=0 with \ open(input.r1, 'r') as r1_in,\ open(input.r2, 'r') as r2_in,\ open(output.r1, 'w') as r1_out,\ open(output.r2, 'w') as r2_out: R1=grouper(r1_in,4) R2=grouper(r2_in,4) counter=0 for entry in R1: counter+=1 if counter%100000==0: print("%s reads processed" % counter) head1,seq1,placeholder1,qual1=[i.strip() for i in entry] head2,seq2,placeholder2,qual2=[j.strip() for j in next(R2)] if 'unfixable' in head1 and 'unfixable' not in head2: unfix_r1_count+=1 elif 'unfixable' in head2 and 'unfixable' not in head1: unfix_r2_count+=1 elif 'unfixable' in head1 and 'unfixable' in head2: unfix_both_count+=1 else: if 'cor' in head1: r1_cor_count+=1 if 'cor' in head2: r2_cor_count+=1 if 'cor' in head1 or 'cor' in head2: pair_cor_count+=1 r1_out.write('%s\n' % '\n'.join([head1,seq1,placeholder1,qual1])) r2_out.write('%s\n' % '\n'.join([head2,seq2,placeholder2,qual2])) total_unfixable = unfix_r1_count+unfix_r2_count+unfix_both_count total_retained = counter - total_unfixable print('total PE reads:%s\nremoved PE reads:%s\nretained PE reads:%s\nR1 corrected:%s\nR2 corrected:%s\npairs corrected:%s\nR1 unfixable:%s\nR2 unfixable:%s\nboth reads unfixable:%s\n' % (counter,total_unfixable,total_retained,r1_cor_count,r2_cor_count,pair_cor_count,unfix_r1_count,unfix_r2_count,unfix_both_count)) |
137 138 139 | shell: """ trim_galore --paired --phred33 --length 36 -q 5 --stringency 1 -e 0.1 -j 8 --no_report_file --dont_gzip --output_dir 'results/{wildcards.experiment}' --basename '{wildcards.sample_id}' '{input.r1}' '{input.r2}' """ |
153 154 | wrapper: "v1.32.0/bio/bwa-mem2/index" |
172 173 | wrapper: "v1.32.0/bio/bwa-mem2/mem" |
186 187 | wrapper: "v1.32.0/bio/samtools/index" |
200 201 202 | shell: """ pilon -Xmx20G --genome {input.genome} --output results/{wildcards.experiment}/genome.polish --changes --fix snps,indels `for frag in {input.frags}; do echo -n "--frags $frag "; done;` """ |
208 | shell: "sed 's/_pilon//' {input} > {output}" |
219 | script: "scripts/shift_gff_coordinates.py" |
Support
Do you know this workflow well? If so, you can
request seller status , and start supporting this workflow.
Created: 1yr ago
Updated: 1yr ago
Maitainers:
public
URL:
https://github.com/Wheeler-Lab/InDelCoordinateCorrection
Name:
indelcoordinatecorrection
Version:
1
Downloaded:
0
Copyright:
Public Domain
License:
MIT License
Keywords:
- Future updates
Related Workflows

ENCODE pipeline for histone marks developed for the psychENCODE project
psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project.
The o...

Near-real time tracking of SARS-CoV-2 in Connecticut
Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

ATLAS - Three commands to start analyzing your metagenome data
Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...
raw sequence reads
Genome assembly
Annotation track
checkm2
gunc
prodigal
snakemake-wrapper-utils
MEGAHIT
Atlas
BBMap
Biopython
BioRuby
Bwa-mem2
cd-hit
CheckM
DAS
Diamond
eggNOG-mapper v2
MetaBAT 2
Minimap2
MMseqs
MultiQC
Pandas
Picard
pyfastx
SAMtools
SemiBin
Snakemake
SPAdes
SqueezeMeta
TADpole
VAMB
CONCOCT
ete3
gtdbtk
h5py
networkx
numpy
plotly
psutil
utils
metagenomics

RNA-seq workflow using STAR and DESeq2
This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

This Snakemake pipeline implements the GATK best-practices workflow
This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...