Using RNA-Seq data to improve microRNA target prediction accuracy in animals
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
FilTar is a tool to integrate RNA-Seq data to pre-existing miRNA target prediction workflows in order to increase prediction accuracy.
It achieves this by:
-
Removing transcripts which are not expressed or poorly expressed for a given cell type or tissue
-
Generating 3'UTR annotations specific to a given cell type or tissue
It also operates as a fully functional wrapper around the pre-existing TargetScan7 and miRanda target prediction workflows.
Installation
Instructions on how to install FilTar can be found at the following location: https://tbradley27.github.io/FilTar/
Basic Usage
FilTar can be used by following 2 steps:
-
Specify the options you would like to use to run FilTar by editing
config/basic.yaml
. -
Run the following command:
snakemake --use-conda --cores $N target_predictions.txt
After running the command, all target predictions are contained inside
target_predictions.txt
.
The following video presents a concise demonstration of basic FilTar usage:
https://www.youtube.com/watch?v=Xhl-nsg7_xo
More detailed instructions can be found inside the full documentation: https://tbradley27.github.io/FilTar/
Publication
The article describing FilTar can be found in Volume 36, Issue 8 (pages 2410-2416) of the Bioinformatics journal published by Oxford University Press. An online, open access version of the article is available here .
DOI: 10.1093/bioinformatics/btaa007
PMCID: PMC7178423
PMID: 31930382
The default method of citing the article is to use the following:
Thomas Bradley, Simon Moxon, FilTar: using RNA-Seq data to improve microRNA target prediction accuracy in animals, Bioinformatics, Volume 36, Issue 8, 15 April 2020, Pages 2410–2416, https://doi.org/10.1093/bioinformatics/btaa007
Getting Help
In order to ensure your enquiries are seen by the most people possible who may be sharing your problem, it is best to share the problems that you are having publicly. The first port of call is to post questions on the biostars bioinformatics online forum (https://www.biostars.org/). If using biostars, please make sure to use the 'filtar' tag when asking questions to notify me, so I can answer promptly.
If this option doesn't work for whatever reason, I happen to accept correspondence via my academic email address: thomas.bradley@uea.ac.uk
Reporting bugs, suggested enhancements or any other issues
The issues page of this repository is the best place to post this.
Contributions and Acknowledgements
Simox Moxon came up with the original idea and project proposal for FilTar. The FilTar concept was extended and developed further between Simon Moxon and Thomas Bradley through the course of the latter's BBSRC (Biotechnology and Biological Sciences Research Council) PhD Studentship on the Norwich Research Park (NRP) Bioscience Doctoral Training Partnership (DTP) programme, when Thomas Bradley worked under the primary supervision of Simon Moxon initially predominantly at the Earlham Institute, and then later predominantly at the School of Biological Sciences, University of East Anglia.
Full acknowledgements can be found within the preprinted article associated with FilTar.
Code Snippets
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "koester@jimmy.harvard.edu" __license__ = "MIT" import os from snakemake.shell import shell prefix = os.path.splitext(snakemake.output[0])[0] shell( "samtools sort {snakemake.params} -@ {snakemake.threads} -o {snakemake.output[0]} " "-T {prefix} {snakemake.input[0]}") |
17 18 19 20 21 22 23 | import sys import os if 'single_end' in snakemake.output[0]: os.system("wget -nv --directory-prefix=data/single_end/ ftp://ftp.sra.ebi.ac.uk/vol1/fastq/{}/00{}/{}/{}.fastq.gz || wget -nv --directory-prefix=data/single_end/ ftp://ftp.sra.ebi.ac.uk/vol1/fastq/{}/{}/{}.fastq.gz".format(snakemake.wildcards['accession'][0:6],snakemake.wildcards['accession'][-1],snakemake.wildcards['accession'],snakemake.wildcards['accession'],snakemake.wildcards['accession'][0:6],snakemake.wildcards['accession'],snakemake.wildcards['accession'])) elif 'paired_end' in snakemake.output[0]: os.system("wget -nv --directory-prefix=data/paired_end/ ftp://ftp.sra.ebi.ac.uk/vol1/fastq/{}/00{}/{}/{}_{}.fastq.gz || wget -nv --directory-prefix=data/paired_end/ ftp://ftp.sra.ebi.ac.uk/vol1/fastq/{}/{}/{}_{}.fastq.gz".format(snakemake.wildcards['accession'][0:6],snakemake.wildcards['accession'][-1],snakemake.wildcards['accession'],snakemake.wildcards['accession'],snakemake.wildcards['mate_number'],snakemake.wildcards['accession'][0:6],snakemake.wildcards['accession'],snakemake.wildcards['accession'],snakemake.wildcards['mate_number'])) |
24 | script: "download_fastq.py" |
30 | script: "download_fastq.py" |
33 | shell: "rsync -av rsync://ftp.ensembl.org/ensembl/pub/release-{config[ensembl_release]}/gtf/{params}/{wildcards.genus_species}.{wildcards.build}.{config[ensembl_release]}.chr.gtf.gz data/ && gunzip data/{wildcards.genus_species}.{wildcards.build}.{config[ensembl_release]}.chr.gtf.gz && sed 's/^chr//g' data/{wildcards.genus_species}.{wildcards.build}.{config[ensembl_release]}.chr.gtf > tmp && mv tmp data/{wildcards.genus_species}.{wildcards.build}.{config[ensembl_release]}.chr.gtf" |
38 39 40 41 42 43 | shell: "sed -r 's/^MT//g' {input} | sed '/^\t/d' > {output}" # delete mitochondiral records and records not assigned to a chromosome rule get_transcript_ids: input: gtf=get_gtf_file output: "results/bed/{species}_chr{chrom}_all_transcripts.txt" |
44 | shell: "scripts/get_all_transcripts.sh {input} {wildcards.chrom} > {output}" |
51 | shell: "rsync -av rsync://ftp.ensembl.org/ensembl/pub/release-{config[ensembl_release]}/fasta/{params}/dna/{wildcards.genus_species}.{wildcards.build}.dna.toplevel.fa.gz data/" |
56 | shell: "rsync -av rsync://ftp.ensembl.org/ensembl/pub/release-{config[ensembl_release]}/fasta/{params}/dna/{wildcards.genus_species}.{wildcards.build}.dna.primary_assembly.fa.gz data/" |
62 | shell: "rsync -av rsync://ftp.ensembl.org/ensembl/pub/release-{config[ensembl_release]}/fasta/{params}/dna/{wildcards.genus_species}.{wildcards.build}.dna.chromosome.{wildcards.chrom}.fa.gz data/" |
69 | shell: "pigz -p {threads} -d {input}" |
75 76 | shell: "wget --directory-prefix=data/maf_hsa/ -nv http://hgdownload.soe.ucsc.edu/goldenPath/hg38/multiz100way/maf/chr{wildcards.chrom}.maf.gz && gunzip {output}.gz" |
82 83 | shell: "wget --directory-prefix=data/maf_mmu/ -nv http://hgdownload.cse.ucsc.edu/goldenPath/mm10/multiz60way/maf/chr{wildcards.chrom}.maf.gz && gunzip {output}.gz" |
88 | shell: "rsync -av rsync://ftp.ensembl.org/ensembl/pub/release-{config[ensembl_release]}/fasta/{params}/cdna/{wildcards.genus_species}.{wildcards.build}.cdna.all.fa.gz data" |
93 | shell: "gunzip {input}" |
97 | shell: 'wget https://sourceforge.net/projects/apatrap/files/APAtrap_Linux.zip/download && mv download scripts && unzip -d scripts/ scripts/download' |
24 | shell: "fasterq-dump -O data/single_end/ --threads {threads} {wildcards.accession}" |
31 | shell: "fasterq-dump -O data/paired_end/ --threads {threads} {wildcards.accession}" |
36 | shell: "gzip {input}" |
45 | shell: "gzip {input.file1} {input.file2}" |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | library(tidyverse) # load in target predictions target_predictions = readr::read_tsv(snakemake@input[['targets']]) utr_genomic_positions = readr::read_tsv(snakemake@input[['bed_file']], col_names=c('chromosome','start','stop','strand','transcript'), col_types='ciicc') utr_genomic_positions = utr_genomic_positions[!duplicated(utr_genomic_positions$transcript),] print(target_predictions) print(utr_genomic_positions) target_predictions = dplyr::inner_join(target_predictions, utr_genomic_positions, by=c(`Gene ID`='transcript')) target_predictions$genomic_start = target_predictions$start + target_predictions$`UTR start` target_predictions$genomic_end = target_predictions$start + target_predictions$`UTR end` print(target_predictions %>% select(`Gene ID`,`Mirbase ID`,`UTR start`,`UTR end`,start,stop,genomic_start,genomic_end), width=Inf) write.table(x=target_predictions, file=snakemake@output[[1]], sep='\t', quote=FALSE, col.names=TRUE, row.names=FALSE) |
8 | script: 'get_target_genomic_coordinates.R' |
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 | import sys import functools from functools import reduce import re import os from Bio import AlignIO from Bio.AlignIO import MafIO from subprocess import call def add_alignment(): global start_pos global end_pos if strand == -1: start_pos = start_pos[::-1] # Reverse Order end_pos = end_pos[::-1] else: pass new_multiple_alignment = idx.get_spliced(start_pos, end_pos, strand) # splice through the index for i in range(len(new_multiple_alignment)): new_multiple_alignment[i].id = re.sub('\..*','', new_multiple_alignment[i].id) # strip chomosome information from id new_multiple_alignment[i].id = accession + '\t' + new_multiple_alignment[i].id # label all alignments with relevant transcript accession global big_alignment big_alignment += new_multiple_alignment.format('fasta') return() if os.stat(snakemake.input['bed']).st_size == 0: target = open(snakemake.output[0], 'w') else: if snakemake.wildcards['species'] == "hsa": #Identify the species identifier passed through the command line build = "hg38" elif snakemake.wildcards['species'] == "mmu": build = "mm10" else: build = '' idx = AlignIO.MafIO.MafIndex(snakemake.input['maf_index'], snakemake.input['maf'], "{}.chr{}".format(build, snakemake.wildcards['chrom']) ) start_pos = [] end_pos = [] accession = 'empty' with open(snakemake.input['bed'] ) as f: big_alignment = '' for line in f: #Open and loop through line-by-line the relevant BED file parts = line.split() if accession == 'empty': accession = parts[4] start_pos.append(int(parts[1])) end_pos.append(int(parts[2])) strand = (int(parts[3])) elif parts[4] == accession: # If accession has multiple entries in bed, add additional genome co-ordinates to rel. lists start_pos.append(int(parts[1])) end_pos.append(int(parts[2])) strand = (int(parts[3])) else: add_alignment() start_pos = [] # initialise a new transcript record end_pos = [] start_pos.append(int(parts[1])) end_pos.append(int(parts[2])) strand = (int(parts[3])) accession = parts[4] else: add_alignment() result = reduce(lambda x, y: x.replace(y, snakemake.config["TaxID"][y]), snakemake.config["TaxID"], big_alignment) # get NCBI taxonomic IDs result = result.replace('\n','') # convert from fasta to tsv result = result.replace('>','\n') # convert from fasta to tsv result = re.sub('(\s[0-9]{4,7})',r'\1\t',result) # convert from fasta to tsv result = re.sub('\n','',result, count=1) # remove leading empty line target = open(snakemake.output[0], 'w') for line in iter(result.splitlines()): pattern = re.compile('\.[0-9][0-9]?\sN+') pattern2 = re.compile('T[0-9]+\sN+') pattern3 = re.compile('^N+$') # when ref transcript is unknown - it leaves a trailing lines of Ns which need to be removed if 'delete' in line: pass elif 'unknown' in line: pass elif pattern.search(line): pass elif pattern2.search(line): pass elif pattern3.search(line): pass else: line2 = line + '\n' target.write(line2) |
3 4 5 6 7 8 9 10 11 12 13 14 15 | import sys import re from Bio import AlignIO from Bio.AlignIO import MafIO if snakemake.wildcards['species'] == "hsa": build = "hg38" elif snakemake.wildcards['species'] == "mmu": build = "mm10" else: build = '' idx = AlignIO.MafIO.MafIndex(snakemake.output[0], snakemake.input[0], "{}.chr{}".format(build, snakemake.wildcards['chrom']) ) |
23 24 | script: "biopython_maf_processing.py" |
36 37 | script: "biopython_maf_processing2.py" |
3 4 5 6 7 8 | if ( file.info(snakemake@input[[1]])$size == 0 ) { file.copy(from=snakemake@input[[1]], to=snakemake@output[[1]]) } else { output = filtar::MergeFasta(snakemake@input[[1]]) write.table(output, snakemake@output[[1]], quote=FALSE, sep="\t", col.names=FALSE, row.names=FALSE) } |
29 | shell: "scripts/get_bedtools_bed.sh {input} {output}" |
37 | shell: "bedtools getfasta -name -s -fi {input.dna} -bed {input.bed} -fo {output}" |
42 | script: "merge_fasta.R" |
48 | shell: "scripts/convert_fa_to_tsv2.sh {input} {params} {output}" |
17 18 19 20 21 22 23 24 25 26 27 28 29 30 | from Bio import SeqIO records = list(SeqIO.parse(snakemake.input[0],'fasta')) filtered_records = [] for record in records: if record.id in snakemake.config['mirnas']: filtered_records.append(record) elif len(snakemake.config['mirnas']) == 0: # use all records if mirna config entry is empty filtered_records.append(record) else: pass SeqIO.write(filtered_records,snakemake.output[0],'fasta') |
26 | shell: "grep -A 1 {wildcards.species} {input} | awk '{{ print $1 }}' | sed 's/--//g' | sed '/^$/d' > {output}" |
32 | script: "filter_mirna_fasta.py" |
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | united_quant = readr::read_tsv( file=snakemake@input[[1]], col_types=readr::cols(.default = 'd', Name = 'c') ) united_quant2 = filtar::AvgSalmonQuant(united_quant) avg_quant = united_quant2[,c('Name','avg')] colnames(avg_quant) = c('Name','TPM') write.table( x=avg_quant, file=snakemake@output[[1]], row.names=FALSE, col.names=TRUE, quote=FALSE, sep="\t" ) |
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | united_quant = readr::read_tsv( file=snakemake@input[[1]], col_types=readr::cols(.default = 'd', Name = 'c') ) united_quant2 = filtar::AvgSalmonQuant(united_quant) avg_quant = tibble::tibble(Name=united_quant2$Name, Length=20, EffectiveLength=20.00, TPM=united_quant2$avg, NumReads=20.00) real_output = paste(snakemake@output[[1]],'quant.sf',sep="/") dir.create(snakemake@output[[1]]) write.table( x=avg_quant, file=real_output, row.names=FALSE, col.names=TRUE, quote=FALSE, sep="\t" ) |
17 18 19 20 21 22 23 | import sys import os if len(snakemake.input) == 2: os.system('salmon quant --validateMappings --rangeFactorizationBins 4 --seqBias --posBias -p {} -i {} -l A -r {} -o {}'.format(snakemake.threads,snakemake.input['index'],snakemake.input['reads'][0],snakemake.output)) else: os.system('salmon quant --validateMappings --rangeFactorizationBins 4 --seqBias --posBias -p {} -i {} -l A -1 {} -2 {} -o {}'.format(snakemake.threads,snakemake.input['index'],snakemake.input['reads'][0],snakemake.input['reads'][1],snakemake.output)) |
30 31 | shell: "salmon index --threads {threads} -t {input} -i {output} --type quasi -k 31" |
40 41 | shell: "salmon index --threads {threads} -t {input} -i {output} --type quasi -k 31" |
66 67 | script: "quant_salmon.py" |
79 80 | script: "quant_salmon.py" |
87 | shell: "grep 'expected' {input}/lib_format_counts.json | awk '{{print $2}}' | sed 's/\"//g' | sed 's/,//g' > {output}" |
101 | shell: "salmon quantmerge --quants {input} --names {input} -o {output}" |
108 | script: "get_average_quant.R" |
122 | shell: "salmon quantmerge --quants {input} --names {input} -o {output}" |
127 | script: "get_average_quant2.R" |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | data = readr::read_tsv(snakemake@input[[1]], col_names=FALSE) data = tidyr::separate(data, X5, into=c('X5.1','X5.2'), remove=TRUE) data = tidyr::separate(data, X6, into=c('X6.1','X6.2'), remove=TRUE) colnames(data) = c('miRNA_ID','transcript_ID','score','energy(kCal/Mol)','miRNA_start','miRNA_end','3UTR_start','3UTR_end','alignment_length','percent_matches', 'percent_matches_and_wobbles') write.table( x=data, file=snakemake@output[[1]], quote=FALSE, row.names=FALSE, col.names=TRUE, sep="\t" ) |
34 | shell: "sed 's/(+)//g' {input} | sed 's/(-)//g' > {output}" |
43 | shell: "miranda {input.mirna} {input.utr} {params} -sc {config[miRanda.minimum_alignment_score]} -en {config[miRanda.minimum_energy_score]} -scale {config[miRanda.5_prime_3_prime_scaling_factor]} -go {config[miRanda.alignment_gap_open_penalty]} -ge {config[miRanda.alignment_gap_extension_penalty]} > {output}" |
48 | shell: "scripts/convert_miRanda_to_tsv.sh {input} {output}" |
55 | shell: "cat {input} > {output}" |
60 | script: "add_miRanda_header.R" |
1 2 3 4 5 6 7 | if ( file.info(snakemake@input[['miRanda_scores']])$size == 0 ) { file.create(snakemake@output[[1]]) } else { filtered_miRanda_scores = filtar::filter_miRanda_scores(snakemake@input[['miRanda_scores']], snakemake@input[['expression_values']],snakemake@params[['tpm_expression_threshold']]) write.table(filtered_miRanda_scores, snakemake@output[[1]], row.names=FALSE, col.names=TRUE, sep="\t", quote=FALSE) } |
8 | script: 'filter_miRanda_scores.R' |
18 | script: 'filter_miRanda_scores.R' |
24 | shell: 'cp {input} {output}' |
1 2 3 4 5 6 7 | if ( file.info(snakemake@input[['miRanda_scores']])$size == 0 ) { file.create(snakemake@output[[1]]) } else { filtered_miRanda_scores = filtar::filter_miRanda_scores(snakemake@input[['miRanda_scores']], snakemake@input[['expression_values']],snakemake@params[['tpm_expression_threshold']]) write.table(filtered_miRanda_scores, snakemake@output[[1]], row.names=FALSE, col.names=TRUE, sep="\t", quote=FALSE) } |
8 | script: 'filter_miRanda_scores.R' |
19 | script: 'filter_miRanda_scores.R' |
1 | filtar::filter_mir_families(snakemake@input[['mature_mirnas']],snakemake@input[['mirna_families']], snakemake@config[["mirnas"]], snakemake@output[[1]]) |
1 | filtar::filter_mature_mirs(snakemake@input[[1]], snakemake@config[["mirnas"]], snakemake@output[[1]]) |
1 2 3 4 5 | species = snakemake@wildcards$species test = filtar::get_mirna_context(snakemake@input$mirna_seed, snakemake@input$mature_mirnas, snakemake@config$tax_ids[[species]]) readr::write_tsv(test, snakemake@output[[1]], col_names=FALSE) |
1 2 3 4 5 | species = snakemake@wildcards$species mirna_seeds = filtar::get_mirna_family(snakemake@input[[1]], snakemake@wildcards$species, snakemake@config$tax_ids[[species]]) write.table(mirna_seeds, snakemake@output[[1]], sep="\t", quote=FALSE, col.names=FALSE, row.names=FALSE) |
28 29 | shell: "{input.script} {input.data} > {output}" |
36 37 | script: "get_mirna_family.R" |
45 46 | shell: "{input.script} {input.data} {wildcards.species} {params} > {output}" |
54 55 | script: "get_mirna_context.R" |
60 | script: "filter_mir_for_context_scores.R" |
67 | script: "filter_mir_families.R" |
78 79 | shell: "{input.script} {input.mirna_families} {input.msa} {output}" |
89 90 | shell: "{input.script} {input.msa_3UTR} {params.tax_id} > {output}" |
102 103 | shell: "{input.script} {input.mirna_family} {input.mirna_sites} {input.branch_lengths} > {output}" |
113 114 | shell: "{input.script} {input.mirna_seeds} {input.CDS} > {output.eightmer_counts}" |
134 135 | shell: "{input.script} {input.mirnas} {input.msa} {input.PCTs} {input.CDS_lengths} {input.eightmer_counts} {output} {params.tax_id} {input.AIRs} {params.RNAplfold_dir}" |
140 | shell: "cat {input} | sed '1b;/Gene/d' > {output}" |
145 | shell: "cat {input} | sed '1b;/Gene/d' > {output}" |
152 | shell: "awk '{{ print $5}}' {input} | sort | uniq > {output}" |
158 159 160 | shell: "cd scripts/targetscan7 && wget http://www.targetscan.org/vert_72/vert_72_data_download/targetscan_70.zip && unzip targetscan_70.zip && rm UTR_Sequences_sample.txt miR_Family_info_sample.txt\ targetscan_70_output.txt README_70.txt targetscan_70.zip" |
168 169 170 | shell: "wget http://www.targetscan.org/vert_72/vert_72_data_download/targetscan_70_BL_PCT.zip && unzip targetscan_70_BL_PCT.zip && mv\ TargetScan7_BL_PCT/PCT_parameters data/ && mv TargetScan7_BL_PCT/targetscan_70_BL_bins.pl TargetScan7_BL_PCT/targetscan_70_BL_PCT.pl scripts/targetscan7 && rm -rf targetscan_70_BL_PCT.zip TargetScan7_BL_PCT/" |
179 180 | shell: "wget http://www.targetscan.org/vert_72/vert_72_data_download/TargetScan7_context_scores.zip && unzip TargetScan7_context_scores.zip && mv TargetScan7_context_scores/Agarwal_2015_parameters.txt TargetScan7_context_scores/TA_SPS_by_seed_region.txt TargetScan7_context_scores/targetscan_count_8mers.pl TargetScan7_context_scores/targetscan_70_context_scores.pl scripts/targetscan7/ && rm -rf TargetScan7_context_scores/ TargetScan7_context_scores.zip" |
185 | shell: "sed -e '40 s/9606/$ARGV[1]/g' {input} | sed -e '38 s;PCT_parameters;data\/PCT_parameters;g' > {output} && chmod +x {output}" |
190 | shell: "sed -e '55 s;PCT_parameters;data\/PCT_parameters;g' {input} > {output} && chmod +x {output}" |
1 2 3 | filtered_contextpp_scores = filtar::filter_contextpp_scores(snakemake@input[['contextpp_scores']], snakemake@input[['expression_values']], snakemake@params[['tpm_expression_threshold']]) write.table(filtered_contextpp_scores, snakemake@output[[1]], row.names=FALSE, col.names=TRUE, sep="\t", quote=FALSE) |
8 | script: 'filter_contextpp_scores.R' |
18 | script: 'filter_contextpp_scores.R' |
24 | shell: "cp {input} {output}" |
1 2 3 | filtered_contextpp_scores = filtar::filter_contextpp_scores(snakemake@input[['contextpp_scores']], snakemake@input[['expression_values']], snakemake@params[['tpm_expression_threshold']]) write.table(filtered_contextpp_scores, snakemake@output[[1]], row.names=FALSE, col.names=TRUE, sep="\t", quote=FALSE) |
8 | script: 'filter_contextpp_scores.R' |
17 | script: 'filter_contextpp_scores.R' |
35 36 | shell: "trim_galore --output_dir results/trimmed_fastq/ --length {config[trim_galore.length]} --stringency {config[trim_galore.stringency]} {input}" |
46 47 | shell: "trim_galore --output_dir results/trimmed_fastq/ --length {config[trim_galore.length]} --stringency {config[trim_galore.stringency]} --paired {input[0]} {input[1]}" |
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | input = readr::read_tsv(snakemake@input[[1]], col_names=FALSE, col_types='ciiic') if (length(snakemake@config[['transcripts']]) == 0) { file.copy(from=snakemake@input[[1]],to=snakemake@output[[1]]) } else { for (transcript in snakemake@config[['transcripts']]) { if (!transcript %in% input$X5) { write(stringr::str_interp("The transcript identifer '${transcript}' is not a valid identifier for the selected species for ensembl release ${snakemake@config[['ensembl_release']]}"), stderr()) } } filtered_input = input[input$X5 %in% snakemake@config[['transcripts']],] if (dim(filtered_input)[1] == 0) { write("No valid transcript identifiers in input. Halting execution.", stderr()) quit(save="no", status=1) } write.table(filtered_input,snakemake@output[[1]], col.names=FALSE, row.names=FALSE, quote=FALSE, sep="\t") } |
1 2 3 | AIRs = filtar::get_canonical_AIRs(snakemake@input[[1]]) write.table(AIRs, snakemake@output[[1]], quote=FALSE, row.names=FALSE, col.names=FALSE, sep="\t") |
1 2 3 4 5 6 7 8 9 10 | utr_lengths = filtar::get_utr_lengths(snakemake@input[[1]]) write.table( x=utr_lengths, file=snakemake@output[[1]], sep="\t", quote=FALSE, row.names=FALSE, col.names=TRUE ) |
34 35 | shell: "{input.script} {input.gtf} {wildcards.feature} {output}" |
40 | script: 'get_utr_lengths.R' |
45 | script: 'get_canonical_AIR.R' |
50 | script: 'get_canonical_AIR.R' |
57 | shell: 'grep -E "^{wildcards.chrom}\s" {input} > {output}' |
62 | script: 'filter_bed6.R' |
67 | shell: 'grep -E "^{wildcards.chrom}\s" {input} > {output} || true' |
3 4 5 6 7 8 | if ( file.info(snakemake@input$normal_bed)$size == 0 | file.info(snakemake@input$extended_bed)$size == 0 ) { file.copy(from=snakemake@input$normal_bed, to=snakemake@output[[1]]) } else { full_set_sorted = filtar::get_full_bed(snakemake@input$normal_bed, snakemake@input$extended_bed, snakemake@input$all_transcripts, snakemake@input$tx_quant) write.table(full_set_sorted, file=snakemake@output[[1]], quote=FALSE, row.names=FALSE, col.names=FALSE, sep="\t") } |
17 18 19 20 21 22 23 24 25 26 27 | input = readr::read_tsv(snakemake@input[[1]], col_names=FALSE, col_types='ciiciciiiicc') if (length(snakemake@config[['transcripts']]) == 0) { file.copy(from=snakemake@input[[1]],to=snakemake@output[[1]]) } else { transcripts = gsub('\\..*','', snakemake@config[['transcripts']]) filtered_input = input[input$X4 %in% transcripts,] write.table(filtered_input,snakemake@output[[1]], col.names=FALSE, row.names=FALSE, quote=FALSE, sep="\t") } |
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | input = readr::read_tsv(snakemake@input[[1]], col_names=FALSE, col_types='ciiic') if (length(snakemake@config[['transcripts']]) == 0) { file.copy(from=snakemake@input[[1]],to=snakemake@output[[1]]) } else { for (transcript in snakemake@config[['transcripts']]) { if (!transcript %in% input$X5) { write(stringr::str_interp("The transcript identifer '${transcript}' is not a valid identifier for the selected species for ensembl release ${snakemake@config[['ensembl_release']]}"), stderr()) } } filtered_input = input[input$X5 %in% snakemake@config[['transcripts']],] if (dim(filtered_input)[1] == 0) { write("No valid transcript identifiers in input. Halting execution.", stderr()) quit(save="no", status=1) } write.table(filtered_input,snakemake@output[[1]], col.names=FALSE, row.names=FALSE, quote=FALSE, sep="\t") } |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | united_bedgraph = readr::read_tsv( file=snakemake@input[[1]], col_names=FALSE, col_types=readr::cols(.default = 'd', X1 = 'c', X2 = 'i', X3 = 'i') ) united_bedgraph = filtar::AvgBedgraph(united_bedgraph) united_bedgraph = united_bedgraph[,c('X1','X2','X3','avg')] write.table( x=united_bedgraph, file=snakemake@output[[1]], row.names=FALSE, col.names=FALSE, quote=FALSE, sep="\t" ) |
1 2 3 4 5 6 | if ( file.info(snakemake@input[[2]])$size == 1 ) { file.copy(from=snakemake@input[[1]], to=snakemake@output[[1]]) } else { AIR_file = filtar::get_AIR_file(snakemake@input[[1]], snakemake@input[[2]]) write.table(AIR_file, snakemake@output[[1]], sep="\t", col.names=FALSE, row.names=FALSE, quote=FALSE) } |
1 2 3 4 5 6 7 8 9 10 | utr_lengths = filtar::get_utr_lengths(snakemake@input[[1]]) write.table( x=utr_lengths, file=snakemake@output[[1]], sep="\t", quote=FALSE, row.names=FALSE, col.names=TRUE ) |
1 2 3 4 5 6 7 8 9 10 11 | import sys import os index_base_name = snakemake.input['index'].replace('.1.ht2','') if len(snakemake.input['reads']) == 1: print('single_end_processing') os.system('hisat2 -x {} -p {} -U {} -S {} {}'.format(index_base_name,snakemake.threads,snakemake.input['reads'],snakemake.output,snakemake.params[0])) else: print('paired_end_processing') os.system('hisat2 -x {} -p {} -1 {} -2 {} -S {} {}'.format(index_base_name,snakemake.threads,snakemake.input['reads'][0], snakemake.input['reads'][1], snakemake.output, snakemake.params[0])) |
73 | shell: "python {input.py_script} {input.gtf} > {output}" |
80 | shell: "python {input.py_script} {input.gtf} > {output}" |
90 | shell: "hisat2-build -f -p {threads} --ss {input.splice_sites} --exon {input.exons} {input.assembly} data/{wildcards.species}" |
99 | shell: "hisat2-build -f -p {threads} {input.assembly} data/{wildcards.species}" |
115 116 | script: "map_reads.py" |
124 | shell: "samtools view -@ {threads} -Sb {input} > {output}" |
134 135 | wrapper: "0.27.0/bio/samtools/sort" |
1 2 3 4 5 6 | import os if len(snakemake.input) > 1: os.system("bedtools unionbedg -i {} > {}".format(snakemake.input, snakemake.output)) else: os.system("cp {} {}".format(snakemake.input, snakemake.output)) |
49 50 | shell: "{input.script} {input.gtf} {output}" |
55 | script: 'filter_bed6.R' |
60 | shell: 'grep -E "^{wildcards.chrom}\s" {input} > {output} || true' |
70 71 | shell: "gtfToGenePred {input.gtf} {output}" |
81 82 | shell: "genePredToBed {input.genepred} {output}" |
94 95 | shell: "genomeCoverageBed -bg -ibam {input.sorted_bam} -split > {output}" |
100 | shell: 'grep -E "^{wildcards.chrom}\s" {input} > {output}' |
118 | script: "merge_bedgraphs.py" |
127 128 | script: "get_average_bedgraph.R" |
138 | script: "merge_bedgraphs.py" |
147 148 | script: "get_average_bedgraph.R" |
153 | shell: 'grep -E "^{wildcards.chrom}\s" {input} > {output}' |
158 | script: "filter_bed12.R" |
167 | shell: "./{input.script} -i {input.bedgraphs} -p {config[APAtrap.min_proportion_of_valid_nucleotides_in_window]} -c {config[APAtrap.min_window_coverage]} -w {config[APAtrap.window_size]} -e {config[APAtrap.utr_extension_size]} -m {input.bed} -o {output}" |
177 178 | script: "extend_bed2.R" |
183 | shell: "cat {input} > {output}" |
194 195 | shell: "./{input.script} -i {input.bedgraphs} -g 1 -n 1 -d {config[APAtrap.min_cov_variation_between_APA_sites]} -c {config[APAtrap.min_average_cov]} -a {config[APAtrap.min_distance_between_APA_sites]} -w {config[APAtrap.predictAPA_window_size]} -u {input.bed} -o {output}" |
200 | shell: "cat {input} | sed '1b;/Gene/d' > {output}" |
205 | script: 'get_utr_lengths.R' |
210 | script: 'get_utr_lengths.R' |
215 | script: "get_tissue_specific_APA_file.R" |
225 226 | shell: "{input.script} {input.gtf} CDS {output}" |
231 | script: "filter_bed6.R" |
236 | shell: 'grep -E "^{wildcards.chrom}\s" {input} > {output} || true' |
1 | sed 's/(+)//g' $1 | sed 's/(-)//g' | sed 's/(-)//g' | tr '\n' '\t' | sed 's/>/\n/g' | sed '/^$/d' | awk -v species=$2 '{OFS="\t"}{ print $1,species,$2}' > $3 |
1 2 3 4 5 6 7 8 9 | grep 'hit' $1 command=$? # get exit code if [[ $command -eq 0 ]] then grep -A 1 'hit' $1 | sed '/--/d' | grep '>' | sed 's/>//' > $2 else touch $2 fi |
1 | cat $1 | awk '$3 == "transcript"' | awk '{OFS="\t"}{print $1,$14,$16}' | sed 's/"//g' | sed 's/;//g' | sed 's/\t/\./2' | grep -E "^$2 " | awk '{print $2}' |
1 | awk '{OFS="\t"}{ print $1,$2,$3,$5,$5,$4}' $1 | sed 's/\t1$/\t+/g' | sed 's/\t-1$/\t-/g' > $2 |
Support
- Future updates
Related Workflows





