Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Snakemake workflow for flattening annotations and creating merged annotations in the case of spike-ins.
Code Snippets
8 9 | shell: "cat {input.ref} {input.spike} > {output.cref}" |
17 18 | shell: "cat {input.ref} {input.spike} > {output.cref}" |
11 12 | script: "../scripts/dexseq_prepare_annotation.py" |
26 27 28 29 30 | shell: """ chmod +x {params.shellscript} {params.shellscript} {input} {output} """ |
11 12 | wrapper: "v2.1.1/bio/samtools/sort" |
28 29 30 31 32 33 | shell: """ htseq-count -t aggregate_gene -m intersection-strict -s {params.strand} \ -r pos -p bam --add-chromosome-info -i gene_id \ -c {output.counts} {input.bam} {input.gtf} """ |
49 50 51 52 53 54 | shell: """ htseq-count -t exonic_part -m union -s {params.strand} \ -r pos -p bam --add-chromosome-info -i exon_id \ -c {output.counts} {input.bam} {input.gtf} """ |
69 70 71 72 73 74 | shell: """ htseq-count -t exonic_part -m intersection-strict -s {params.strand} \ -r pos -p bam --add-chromosome-info -i gene_id \ -c {output.counts} {input.bam} {input.gtf} """ |
11 12 | wrapper: "v1.21.4/bio/reference/ensembl-sequence" |
26 27 | wrapper: "v1.21.4/bio/reference/ensembl-annotation" |
40 41 42 43 44 | shell: """ chmod +x {params.shellscript} {params.shellscript} {input} {output} """ |
56 57 58 59 60 | shell: """ chmod +x {params.shellscript} {params.shellscript} {input} {output.real_o} """ |
13 14 | wrapper: "v1.21.4/bio/reference/ensembl-sequence" |
28 29 | wrapper: "v1.21.4/bio/reference/ensembl-annotation" |
43 44 45 46 47 | shell: """ chmod +x {params.shellscript} {params.shellscript} {input} {output} {params.species} """ |
61 62 63 64 65 | shell: """ chmod +x {params.shellscript} {params.shellscript} {input} {output.temp_o} {output.real_o} {params.species} """ |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 | import sys, collections, itertools, os.path, optparse args = [snakemake.input.gtf, snakemake.output.flatgtf] if len( args ) != 2: sys.stderr.write( "Script to prepare annotation for DEXSeq.\n\n" ) sys.stderr.write( "Usage: python %s <in.gtf> <out.gff>\n\n" % os.path.basename(sys.argv[0]) ) sys.stderr.write( "This script takes an annotation file in Ensembl GTF format\n" ) sys.stderr.write( "and outputs a 'flattened' annotation file suitable for use\n" ) sys.stderr.write( "with the count_in_exons.py script.\n" ) sys.exit(1) try: import HTSeq except ImportError: sys.stderr.write( "Could not import HTSeq. Please install the HTSeq Python framework\n" ) sys.stderr.write( "available from http://www-huber.embl.de/users/anders/HTSeq\n" ) sys.exit(1) gtf_file = args[0] out_file = args[1] aggregateGenes = True # Step 1: Store all exons with their gene and transcript ID # in a GenomicArrayOfSets exons = HTSeq.GenomicArrayOfSets( "auto", stranded=True ) for f in HTSeq.GFF_Reader( gtf_file ): if f.type != "exon": continue f.attr['gene_id'] = f.attr['gene_id'].replace( ":", "_" ) exons[f.iv] += ( f.attr['gene_id'], f.attr['transcript_id'] ) # Step 2: Form sets of overlapping genes # We produce the dict 'gene_sets', whose values are sets of gene IDs. Each set # contains IDs of genes that overlap, i.e., share bases (on the same strand). # The keys of 'gene_sets' are the IDs of all genes, and each key refers to # the set that contains the gene. # Each gene set forms an 'aggregate gene'. if aggregateGenes == True: gene_sets = collections.defaultdict( lambda: set() ) for iv, s in exons.steps(): # For each step, make a set, 'full_set' of all the gene IDs occuring # in the present step, and also add all those gene IDs, whch have been # seen earlier to co-occur with each of the currently present gene IDs. full_set = set() for gene_id, transcript_id in s: full_set.add( gene_id ) full_set |= gene_sets[ gene_id ] # Make sure that all genes that are now in full_set get associated # with full_set, i.e., get to know about their new partners for gene_id in full_set: assert gene_sets[ gene_id ] <= full_set gene_sets[ gene_id ] = full_set # Step 3: Go through the steps again to get the exonic sections. Each step # becomes an 'exonic part'. The exonic part is associated with an # aggregate gene, i.e., a gene set as determined in the previous step, # and a transcript set, containing all transcripts that occur in the step. # The results are stored in the dict 'aggregates', which contains, for each # aggregate ID, a list of all its exonic_part features. aggregates = collections.defaultdict( lambda: list() ) for iv, s in exons.steps( ): # Skip empty steps if len(s) == 0: continue gene_id = list(s)[0][0] ## if aggregateGenes=FALSE, ignore the exons associated to more than one gene ID if aggregateGenes == False: check_set = set() for geneID, transcript_id in s: check_set.add( geneID ) if( len( check_set ) > 1 ): continue else: aggregate_id = gene_id # Take one of the gene IDs, find the others via gene sets, and # form the aggregate ID from all of them else: assert set( gene_id for gene_id, transcript_id in s ) <= gene_sets[ gene_id ] aggregate_id = '+'.join( gene_sets[ gene_id ] ) # Make the feature and store it in 'aggregates' f = HTSeq.GenomicFeature( aggregate_id, "exonic_part", iv ) f.source = os.path.basename( sys.argv[0] ) # f.source = "camara" f.attr = {} f.attr[ 'gene_id' ] = aggregate_id transcript_set = set( ( transcript_id for gene_id, transcript_id in s ) ) f.attr[ 'transcripts' ] = '+'.join( transcript_set ) aggregates[ aggregate_id ].append( f ) # Step 4: For each aggregate, number the exonic parts aggregate_features = [] for l in list(aggregates.values()): for i in range( len(l)-1 ): assert l[i].name == l[i+1].name, str(l[i+1]) + " has wrong name" assert l[i].iv.end <= l[i+1].iv.start, str(l[i+1]) + " starts too early" if l[i].iv.chrom != l[i+1].iv.chrom: raise ValueError("Same name found on two chromosomes: %s, %s" % ( str(l[i]), str(l[i+1]) )) if l[i].iv.strand != l[i+1].iv.strand: raise ValueError("Same name found on two strands: %s, %s" % ( str(l[i]), str(l[i+1]) )) aggr_feat = HTSeq.GenomicFeature( l[0].name, "aggregate_gene", HTSeq.GenomicInterval( l[0].iv.chrom, l[0].iv.start, l[-1].iv.end, l[0].iv.strand ) ) aggr_feat.source = os.path.basename( sys.argv[0] ) aggr_feat.attr = { 'gene_id': aggr_feat.name } for i in range( len(l) ): l[i].attr['exonic_part_number'] = "%03d" % ( i+1 ) aggregate_features.append( aggr_feat ) # Step 5: Sort the aggregates, then write everything out aggregate_features.sort( key = lambda f: ( f.iv.chrom, f.iv.start ) ) fout = open( out_file, "w" ) for aggr_feat in aggregate_features: fout.write( aggr_feat.get_gff_line() ) for f in aggregates[ aggr_feat.name ]: fout.write( f.get_gff_line() ) fout.close() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2019, Johannes Köster" __email__ = "johannes.koester@uni-due.de" __license__ = "MIT" import subprocess import sys from pathlib import Path from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=False, stderr=True) species = snakemake.params.species.lower() build = snakemake.params.build release = int(snakemake.params.release) out_fmt = Path(snakemake.output[0]).suffixes out_gz = (out_fmt.pop() and True) if out_fmt[-1] == ".gz" else False out_fmt = out_fmt.pop().lstrip(".") branch = "" if release >= 81 and build == "GRCh37": # use the special grch37 branch for new releases branch = "grch37/" elif snakemake.params.get("branch"): branch = snakemake.params.branch + "/" flavor = snakemake.params.get("flavor", "") if flavor: flavor += "." suffix = "" if out_fmt == "gtf": suffix = "gtf.gz" elif out_fmt == "gff3": suffix = "gff3.gz" else: raise ValueError( "invalid format specified. Only 'gtf[.gz]' and 'gff3[.gz]' are currently supported." ) url = "ftp://ftp.ensembl.org/pub/{branch}release-{release}/{out_fmt}/{species}/{species_cap}.{build}.{release}.{flavor}{suffix}".format( release=release, build=build, species=species, out_fmt=out_fmt, species_cap=species.capitalize(), suffix=suffix, flavor=flavor, branch=branch, ) try: if out_gz: shell("curl -L {url} > {snakemake.output[0]} {log}") else: shell("(curl -L {url} | gzip -d > {snakemake.output[0]}) {log}") except subprocess.CalledProcessError as e: if snakemake.log: sys.stderr = open(snakemake.log[0], "a") print( "Unable to download annotation data from Ensembl. " "Did you check that this combination of species, build, and release is actually provided?", file=sys.stderr, ) exit(1) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2019, Johannes Köster" __email__ = "johannes.koester@uni-due.de" __license__ = "MIT" import subprocess as sp import sys from itertools import product from snakemake.shell import shell species = snakemake.params.species.lower() release = int(snakemake.params.release) build = snakemake.params.build branch = "" if release >= 81 and build == "GRCh37": # use the special grch37 branch for new releases branch = "grch37/" elif snakemake.params.get("branch"): branch = snakemake.params.branch + "/" log = snakemake.log_fmt_shell(stdout=False, stderr=True) spec = ("{build}" if int(release) > 75 else "{build}.{release}").format( build=build, release=release ) suffixes = "" datatype = snakemake.params.get("datatype", "") chromosome = snakemake.params.get("chromosome", "") if datatype == "dna": if chromosome: suffixes = ["dna.chromosome.{}.fa.gz".format(chromosome)] else: suffixes = ["dna.primary_assembly.fa.gz", "dna.toplevel.fa.gz"] elif datatype == "cdna": suffixes = ["cdna.all.fa.gz"] elif datatype == "cds": suffixes = ["cds.all.fa.gz"] elif datatype == "ncrna": suffixes = ["ncrna.fa.gz"] elif datatype == "pep": suffixes = ["pep.all.fa.gz"] else: raise ValueError("invalid datatype, must be one of dna, cdna, cds, ncrna, pep") if chromosome: if not datatype == "dna": raise ValueError( "invalid datatype, to select a single chromosome the datatype must be dna" ) spec = spec.format(build=build, release=release) url_prefix = f"ftp://ftp.ensembl.org/pub/{branch}release-{release}/fasta/{species}/{datatype}/{species.capitalize()}.{spec}" success = False for suffix in suffixes: url = f"{url_prefix}.{suffix}" try: shell("curl -sSf {url} > /dev/null 2> /dev/null") except sp.CalledProcessError: continue shell("(curl -L {url} | gzip -d > {snakemake.output[0]}) {log}") success = True break if not success: if len(suffixes) > 1: url = f"{url_prefix}.[{'|'.join(suffixes)}]" else: url = f"{url_prefix}.{suffixes[0]}" print( f"Unable to download requested sequence data from Ensembl ({url}). " "Please check whether above URL is currently available (might be a temporal server issue). " "Apart from that, did you check that this combination of species, build, and release is actually provided?", file=sys.stderr, ) exit(1) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "koester@jimmy.harvard.edu" __license__ = "MIT" import tempfile from pathlib import Path from snakemake.shell import shell from snakemake_wrapper_utils.snakemake import get_mem from snakemake_wrapper_utils.samtools import get_samtools_opts samtools_opts = get_samtools_opts(snakemake) extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell(stdout=True, stderr=True) mem_per_thread_mb = int(get_mem(snakemake) / snakemake.threads) with tempfile.TemporaryDirectory() as tmpdir: tmp_prefix = Path(tmpdir) / "samtools_sort" shell( "samtools sort {samtools_opts} -m {mem_per_thread_mb}M {extra} -T {tmp_prefix} {snakemake.input[0]} {log}" ) |
Support
Do you know this workflow well? If so, you can
request seller status , and start supporting this workflow.
Created: 1yr ago
Updated: 1yr ago
Maitainers:
public
URL:
https://github.com/isaacvock/Flatstacks
Name:
flatstacks
Version:
1
Downloaded:
0
Copyright:
Public Domain
License:
None
Keywords:
- Future updates
Related Workflows

ENCODE pipeline for histone marks developed for the psychENCODE project
psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project.
The o...

Near-real time tracking of SARS-CoV-2 in Connecticut
Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

ATLAS - Three commands to start analyzing your metagenome data
Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...
raw sequence reads
Genome assembly
Annotation track
checkm2
gunc
prodigal
snakemake-wrapper-utils
MEGAHIT
Atlas
BBMap
Biopython
BioRuby
Bwa-mem2
cd-hit
CheckM
DAS
Diamond
eggNOG-mapper v2
MetaBAT 2
Minimap2
MMseqs
MultiQC
Pandas
Picard
pyfastx
SAMtools
SemiBin
Snakemake
SPAdes
SqueezeMeta
TADpole
VAMB
CONCOCT
ete3
gtdbtk
h5py
networkx
numpy
plotly
psutil
utils
metagenomics

RNA-seq workflow using STAR and DESeq2
This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

This Snakemake pipeline implements the GATK best-practices workflow
This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...