Analysis Pipeline for Ribosome Profiling-Based Identification of Open Reading Frames in Bacteria
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
This repository contains all code to recreate the contents of [RiboReport - Benchmarking tools for ribosome profiling-based identification of open reading frames in bacteria](https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab549/6509045). Following are descriptions of the required steps to reproduce our analysis.
Processing of High Throughput Sequencing Data
In the publication, the data required for the generation of our result figures was created using the
snakemake
workflow described below. After the creation of the input data, the figures from the publication were generated by using the evaluation scripts provided in the evaluation folder. The generation and usage of the final tables and figures is described below.
Generation of the Dataset
Dependencies
-
snakemake >=6.4.1
-
singularity
Input data
For running the workflow, several input files are required:
-
genome.fa
-
annotation.gff
-
samples.tsv
-
config.yaml
-
fastq files
The fastq files for our newly published dataset are available via NCBI GEO (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE131514). The data will be published together with the publication. Please contact us if you require the access token.
The rest of the required files can be found in the data folder.
Workflow
Note
Even though
snakemake
workflows are executable locally, we do not advise this due to high memory usage and runtime of some of the processing steps. We ran the workflow on a TORQUE cluster system (powered by bwHPC) and later on a de.NBI cloud instance.
For more information how to run the workflow using your prefered cluster setup, please refer to the snakemake documentation .
To run the provided
snakemake
workflow, follow the example for
pseudomonas_aeroginosa
below:
1. Setup the workflow folder and download the workflow:
mkdir pseudomonas_aeroginosa; cd pseudomonas_aeroginosa;
wget https://github.com/RickGelhausen/RiboReport/archive/2021.tar.gz;
tar -xzf 2021.tar.gz; mv RiboReport-2021 RiboReport; rm 2021.tar.gz;
2. Required tools
All required tools are downloaded automatically via conda and docker.
3. Fetch the annotation and genome files:
cp RiboReport/data/pseudomonas_aeroginosa/annotation.gff . ;
cp RiboReport/data/pseudomonas_aeroginosa/genome.fa . ;
4. Fetch the config.yaml and samples.tsv files:
cp RiboReport/data/pseudomonas_aeroginosa/config.yaml RiboReport/config.yaml ;
cp RiboReport/data/pseudomonas_aeroginosa/samples.tsv RiboReport/samples.tsv ;
5. Retrieve the sequencing data:
In order to download the fastq files, we used the
SRA Toolkit
.
We run
fasterq-dump
executable to download the according
.fastq
files, by providing the appropriate
SRR ID
.
If you do not have the
SRA Toolkit
, we suggest creating a conda environment:
conda create -n sra-tools -c bioconda -c conda-forge sra-tools pigz
conda activate sra-tools
(use source activate, if conda is not set-up for your bash)
For simplicity, we already collected the required
fasterq-dump
calls in a file.
cp RiboReport/data/pseudomonas_aeroginosa/download.sh .
bash download.sh
This will download all required fastq files into a fastq folder.
6. Run the snakemake workflow:
In order to run
snakemake
, the creation of a conda environment is required.
Once miniconda3 is installed. Create a snakemake environment:
conda create -n snakemake -c conda-forge -c bioconda snakemake singularity
conda activate snakemake
Next, we will show how we executed the RiboReport pipeline in both our bwHPC cluster system and on our de.NBI Cloud.
On the cloud or locally (tested on ubuntu)
If you have set up everything correctly, simply run:
snakemake -p -k --use-conda --use-singularity --singularity-args " -c " --greediness 0 -s RiboReport/Snakefile --configfile RiboReport/config.yaml --directory ${PWD} -j 4 --latency-wait 60
You can also run it using nohup, as it might take some time to finish:
nohup snakemake -p -k --use-conda --use-singularity --singularity-args " -c " --greediness 0 -s RiboReport/Snakefile --configfile RiboReport/config.yaml --directory ${PWD} -j 4 --latency-wait 60 &
Depending on your available resources you can also remove --greediness 0 and increase -j, to your preferences in order to boost performance.
On the cluster
Then you can copy and complete one of the provided submission scripts, or create your own.
cp RiboReport/torque.sh .
Example for
torque.sh
:
#!/bin/bash
#PBS -N benchmark
#PBS -S /bin/bash
#PBS -q "long"
#PBS -d <file path>/benchmark
#PBS -l nodes=1:ppn=1
#PBS -o <file path>/benchmark
#PBS -j oe
cd <file path>/benchmark
export PATH="<file path>/miniconda3/bin/:$PATH"
source activate snakemake
snakemake --latency-wait 600 --use-conda -s RiboReport/Snakefile --configfile RiboReport/config.yaml --directory ${PWD} -j 20 --cluster-config RiboReport/torque.yaml --cluster "qsub -N {cluster.jobname} -S /bin/bash -q {cluster.qname} -d <file path>/benchmark -l {cluster.resources} -o {cluster.logoutputdir} -j oe"
All file path statements have to be replaced by the path to your benchmark folder.
Please note
that these scripts might need some extra changes depending on your cluster system. If your cluster system does not run SGE or TORQUE, these files will most likely not work at all. In the case they run SGE or TORQUE, there might be slightly different definitions for the resource statements (here
#PBS -l nodes=1:ppn=1
). This is then also the case for the configuration files
sge.yaml
and
torque.yaml
. Please check the
snakemake documentation
.
Excluding a prediction tool:
Currently, to remove a tool from the analysis simply remove the line from the input and shell part of the mergeConditions rule. You can find it in under
rules/postprocessing.smk
.
To remove IRSOM change:
rule mergeConditions:
input:
ribotish="tracks/{condition}.ribotish.gff",
reparation="tracks/{condition}.reparation.gff",
deepribo="tracks/{condition}.deepribo.gff",
irsom="tracks/{condition}.irsom.gff",
spectre="tracks/{condition}.spectre.gff",
smorfer="tracks/{condition}.smorfer.gff",
ribotricer="tracks/{condition}.ribotricer.gff",
price="tracks/{condition}.price.gff"
output:
"tracks/{condition, [a-zA-Z]+}.merged.gff"
conda:
"../envs/bedtools.yaml"
threads: 1
shell:
"""
mkdir -p tracks;
cat {input.spectre} > {output}.unsorted;
cat {input.ribotish} >> {output}.unsorted;
cat {input.reparation} >> {output}.unsorted;
cat {input.deepribo} >> {output}.unsorted;
cat {input.irsom} >> {output}.unsorted;
cat {input.smorfer} >> {output}.unsorted;
cat {input.ribotricer} >> {output}.unsorted;
cat {input.price} >> {output}.unsorted;
bedtools sort -i {output}.unsorted > {output};
"""
to
rule mergeConditions:
input:
ribotish="tracks/{condition}.ribotish.gff",
reparation="tracks/{condition}.reparation.gff",
deepribo="tracks/{condition}.deepribo.gff",
spectre="tracks/{condition}.spectre.gff",
smorfer="tracks/{condition}.smorfer.gff",
ribotricer="tracks/{condition}.ribotricer.gff",
price="tracks/{condition}.price.gff"
output:
"tracks/{condition, [a-zA-Z]+}.merged.gff"
conda:
"../envs/bedtools.yaml"
threads: 1
shell:
"""
mkdir -p tracks;
cat {input.spectre} > {output}.unsorted;
cat {input.ribotish} >> {output}.unsorted;
cat {input.reparation} >> {output}.unsorted;
cat {input.deepribo} >> {output}.unsorted;
cat {input.smorfer} >> {output}.unsorted;
cat {input.ribotricer} >> {output}.unsorted;
cat {input.price} >> {output}.unsorted;
bedtools sort -i {output}.unsorted > {output};
"""
smORFer output
As smORFer is modular and has different optional steps it is hard to add all combinations to the snakemake pipeline.
Some of these steps even require the creation of a calibrated bam file, which involves manual checking of offsets.
Rules for all the steps involving Ribo-seq libraries are included in
rules/smorfer.smk
. To use the steps involving manual work, simply add the calibrated bam file to a directory called
calibrated_bam/method-condition-replicate_calibrated.bam
.
By default, the filtering steps involving manual work are skipped. To add them the input/output of the rules in
rules/smorfer.smk
have to be updated accordingly.
Extract operon regions from the annotation:
This describes how we extracted the operons from the available annotation:
conda create -n operons -c conda-forge -c bioconda interlap
conda activate operons
./RiboReport/scripts/operons.py --in_gff_filepath RiboReport/data/pseudomonas_aeroginosa/annotation.gff > RiboReport/data/pseudomonas_aeroginosa/operons.gff ;
Visualisation and plotting of the Results
All data can be visualised using the following scripts inside the evaluation folder. Further, an
evaluation_call.sh
script, containing all calls for the plotting pipeline, is provided. This bash script only works if all python scripts are positioned in the same folder and the input gtf-files from the data folder are stored in a "data"-folder in the same location.
Dependencies
-
numpy =1.16.3
-
matplotlib =3.0.3
-
seaborn =0.9.0
-
pandas =0.24.2
-
simple_venn =0.1.0
-
bedtools =v2.28.0
statistics_pos_neg.py
Parameters:
-
reference_pos_data (-p) path to the .gtf file containing all genes of the investigated genome or the investigated subset of genes
-
reference_neg_data (-n) path to the .gtf file containing all genes of the investigated genome or the investigated subset of genes
-
tool_data (-t) path to gtf file containing all predictions for the tools: deepribo, ribotish, reparation and irsom
-
save_path (-o) path to a directory, where the result tables will be stored
-
overlap_cutoff (-c) allowed sequence overlap cutoff between gene and prediction
-
flag_subopt (-s) allowed allow suboptimals in statistical calculation. Set to 1 to enable flag. If not 0
-
flag_no_gene (-g) add predictions which do not overlap with any genen for the given cutoff. Set to 1 to enable flag. If not 0.
Output:
-
df_stat.csv -> main result table storing all computed statistical measurements. Needed for the bar plots.
-
SetX_labels_pos_predictions_overlap_fp.gtf -> all prdictions which do not overlap with the positive or the negative gene set. Therefore, predictions outside of the annotation borders for a given overlap (temporary result).
-
SetX_labels_pos_predictions_overlap_neg.gtf -> all genes of the negative set overlapping with a given percent with any prediction (temporary result).
-
SetX_labels_pos_predictions_overlap_pos.gtf -> all genes of the positive set overlapping with a given percent with any prediction (temporary result).
-
SetX_labels_pos_predictions_overlap_temp.gtf -> all prdictions which do not overlap with the positive gene set (temporary result).
-
df_venn_genes.csv -> table containing a column for each tool listing the number of genes counted as true positive (TP). Needed for the overlap plots.
-
toolX_score_list -> For each prediction, correct genen prediction and not predicted gene a tuple of: (score, overlap, label) is saved as a python object serialization (pickle.dump). Needed to compute the ROC and PRC.
This script generates several statistical measurements for a reference and tool prediction .gtf file. First true positives (TP), false positives (FP) and false negatives (FN) are predicted. One prediction will be associated with one gene and counted as one true positve. The association selection is based on the lowest p-value (0.05). All genes fulfilling the overlap cutoff will be counted as suboptimals and not as false positives. False positives are all predictions, where no gene fulfilling the overlap cutoff could be found and the false negatives vice versa. Based on this computations the recall, FNR, precision, FDR and F1 measure are calculated.
plot_barplots.py
Parameters:
-
input1_df (-i1) df_stat.csv of the statistics.py script for the first overlap cutoff.
-
input2_df (-i2) df_stat.csv of the statistics.py script for the second overlap cutoff.
-
save_path (-o) path to directory where the plots will be stored.
Output:
-
bar_FNR.pdf -> barplot of FNR measures for different tools and two cutoff conditions.
-
bar_recall.pdf -> barplot of recall measures for different tools and two cutoff conditions.
-
bar_precision.pdf -> barplot of precision measures for different tools and two cutoff conditions.
-
bar_FDR.pdf -> barplot of FDR measures for different tools and two cutoff conditions.
-
bar_F1.pdf -> barplot of F1 measures for different tools and two cutoff conditions.
This script generates barplots of statistical measures for the different tools and two overlap cutoff conditions. The output barplots consist of FNR, recall, precision, FDR and F1 measures. It is based on the statistics.py statistical output table.
venn_diagram.py
Parameters:
-
input_df (-i) .cvs containing the list of true positive genes for the used tools.
-
save_path (-o) path where the Venn diagramm will be saved.
-
name_folder (-n) name of the result folder of the investigated set. It will be used in the file name to make it unique.
-
coverage_percent (-c) percentage of overlap cutoff that was used to determine the true positves. It will be used in the file name to make it unique.
Output:
- venn_diagram.pdf -> a 4 Venn diagram highlighting overlapping predictions of the tool.
This script will generate a 4 Venn diagram for the overlap of true positive predicted genes of the 4 investigated tools.
prc_plotting.py
Parameters:
-
input_path: path list of triplets (score, overlap, label) for each tool.
-
species_label: name of the bacterial datasets.
-
save_path: path to save data to.
-
percent_overlap: how much of the prediction should overlap with the gene and vice versa.
-
experiment: name of the subset.
Output:
- datesetX_prc_overlapScore_subset_lables_.pdf
This scripts generates Precision-Recall-Curves (PRC) for all different tools and overlaps.
roc_plotting.py
Parameters:
-
input_path: path list of triplets (score, overlap, label) for each tool.
-
save_path: path to save data to.
-
percent_overlap: how much of the prediction should overlap with the gene and vice versa.
-
experiment: name of the subset.
Output:
- datesetX_roc_overlapScore_.pdf
This scripts generates Receiver-Operator-Curves (ROC) for all different tools and overlaps.
Code Snippets
10 11 | shell: "mkdir -p ribotish; RiboReport/scripts/createRiboTISHannotation.py -a {input.annotation} --genome_sizes {input.sizes} --annotation_output {output}" |
21 22 | shell: "mkdir -p ribotish; gff3ToGenePred {input} {output}" |
32 33 | shell: "mkdir -p ribotish; genePredToBed {input} {output}" |
43 44 | shell: "mkdir -p qc/all; RiboReport/scripts/annotation_featurecount.py -a {input.annotation} -o {output};" |
54 55 | shell: "mkdir -p auxiliary; RiboReport/scripts/enrich_annotation.py -a {input.annotation} -o {output}" |
65 66 67 68 69 | shell: """ mkdir -p auxiliary; awk -F'\\t' '/^[^#]/ {{printf "%s\\t%s\\t%s\\t%s\\t%s\\t%s\\t%s\\t%s\\tID=uid%s;\\n", $1, $2, $3, $4, $5, $6, $7, $8, NR-1}}' {input} > {output} """ |
83 84 | shell: "mkdir -p transcripts; cufflinks {input.bam} -p {threads} -o ./transcripts/{wildcards.condition}-{wildcards.replicate}/ -g {input.annotation}" |
18 19 | run: shell("mkdir -p deepribo; mv {input} deepribo/DeepRibo_model_v1.pt") |
31 32 | shell: "mkdir -p coverage_deepribo; RiboReport/scripts/coverage_deepribo.py --alignment_file {input.bam} --output_file_prefix coverage_deepribo/{wildcards.condition}-{wildcards.replicate}" |
44 45 46 47 48 49 | shell: """ mkdir -p coverage_deepribo bedtools genomecov -bg -ibam {input.bam} -strand + > {output.covfwd} bedtools genomecov -bg -ibam {input.bam} -strand - > {output.covrev} """ |
64 65 66 67 68 69 | shell: """ mkdir -p deepribo/{wildcards.condition}-{wildcards.replicate}/0/; mkdir -p deepribo/{wildcards.condition}-{wildcards.replicate}/1/; DataParser.py {input.covS} {input.covAS} {input.asiteS} {input.asiteAS} {input.genome} deepribo/{wildcards.condition}-{wildcards.replicate} -g {input.annotation} """ |
79 80 | shell: "mkdir -p deepribo; Rscript RiboReport/scripts/parameter_estimation.R -f {input} -o {output}" |
95 96 97 98 99 | shell: """ mkdir -p deepribo; DeepRibo.py predict deepribo/ --pred_data {wildcards.condition}-{wildcards.replicate}/ -r {params.rpkm} -c {params.cov} --model {input.model} --dest {output} --num_workers {threads} """ |
109 110 | shell: "mkdir -p tracks; RiboReport/scripts/deepriboGFF.py -c {wildcards.condition} -i {input} -o {output}" |
120 121 | shell: "mkdir -p tracks; RiboReport/scripts/concatGFF.py {input} -o {output}" |
9 10 | shell: "samtools faidx {rules.retrieveGenome.output}" |
21 22 | shell: "mkdir -p genomes; cut -f1,2 {input[0]} > genomes/sizes.genome" |
35 36 | shell: "samtools index -@ {threads} maplink/{params.prefix}" |
10 11 | shell: "mkdir -p transcripts; RiboReport/scripts/transcriptsToBed.py -i {input} -o {output}" |
22 23 | shell: "mkdir -p transcripts; bedtools getfasta -fi {input.genome} -name -bed {input.transcripts} -fo {output}" |
38 39 40 41 42 43 | shell: """ mkdir -p irsom/model/Escherichia_coli/ mv forge.ibisc.univ-evry.fr/lplaton/IRSOM/raw/master/model/Escherichia_coli/SLSOM irsom/model/Escherichia_coli/ mv forge.ibisc.univ-evry.fr/lplaton/IRSOM/raw/master/model/Escherichia_coli/SOM irsom/model/Escherichia_coli/ """ |
50 51 | shell: "mkdir -p irsom; mv {input} irsom/Featurer; chmod +x irsom/Featurer;" |
67 68 69 70 71 | shell: """ mkdir -p irsom; predict.py --featurer={input.featurer} --file={input.transcripts} --model=irsom/model/Escherichia_coli/ --output=irsom/{wildcards.condition}-{wildcards.replicate}/ """ |
81 82 | shell: "mkdir -p tracks; RiboReport/scripts/irsomGFF.py -c {wildcards.condition} -i {input} -o {output}" |
92 93 | shell: "mkdir -p tracks; RiboReport/scripts/concatGFF.py {input} -o {output}" |
10 11 | shell: "mkdir -p maplink; ln -s {params.inlink} {params.outlink}" |
22 23 | shell: "mkdir -p maplink/RIBO/; ln -s {params.inlink} {params.outlink}" |
34 35 | shell: "mkdir -p maplink/RIBO/; ln -s {params.inlink} {params.outlink}" |
46 47 | shell: "mkdir -p maplink/RNA/; ln -s {params.inlink} {params.outlink}" |
58 59 | shell: "mkdir -p maplink/RNA/; ln -s {params.inlink} {params.outlink}" |
10 11 | shell: "mkdir -p maplink/TIS/; ln -s {params.inlink} {params.outlink}" |
22 23 | shell: "mkdir -p maplink/TIS/; ln -s {params.inlink} {params.outlink}" |
34 35 | shell: "mkdir -p maplink/RNATIS/; ln -s {params.inlink} {params.outlink}" |
46 47 | shell: "mkdir -p maplink/RNATIS/; ln -s {params.inlink} {params.outlink}" |
11 12 | shell: "mkdir -p genomeSegemehlIndex; echo \"Computing Segemehl index\"; segemehl.x --threads {threads} -x {output.index} -d {input.genome} 2> {log};" |
28 29 | shell: "mkdir -p sammulti; segemehl.x -e -d {input.genome} -i {input.genomeSegemehlIndex} -q {input.fastq} --threads {threads} -o {output.sammulti} 2> {log}" |
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 | shell: """ set +e mkdir -p sam awk '$2 == "4"' {input.sammulti} > {input.sammulti}.unmapped gawk -i inplace '$2 != "4"' {input.sammulti} samtools view -H <(cat {input.sammulti}) | grep '@HD' > {output.sam} samtools view -H <(cat {input.sammulti}) | grep '@SQ' | sort -t$'\t' -k1,1 -k2,2V >> {output.sam} samtools view -H <(cat {input.sammulti}) | grep '@RG' >> {output.sam} samtools view -H <(cat {input.sammulti}) | grep '@PG' >> {output.sam} cat {input.sammulti} |grep -v '^@' | grep -w 'NH:i:1' >> {output.sam} exitcode=$? if [ $exitcode -eq 1 ] then exit 1 else exit 0 fi """ |
68 | shell: "if [ \"{params.method}\" == \"NOTSET\" ]; then RiboReport/scripts/samstrandinverter.py --sam_in_filepath={input.sam} --sam_out_filepath={output.sam}; else cp {input.sam} {output.sam}; fi" |
78 79 | shell: "mkdir -p bammulti; samtools view -@ {threads} -bh {input.sam} | samtools sort -@ {threads} -o {output} -O bam" |
89 90 | shell: "mkdir -p rRNAbam; samtools view -@ {threads} -bh {input.sam} | samtools sort -@ {threads} -o {output} -O bam" |
17 18 19 20 21 22 23 24 25 26 27 28 29 | shell: """ mkdir -p tracks; cat {input.spectre} > {output}.unsorted; cat {input.ribotish} >> {output}.unsorted; cat {input.reparation} >> {output}.unsorted; cat {input.deepribo} >> {output}.unsorted; cat {input.irsom} >> {output}.unsorted; cat {input.smorfer} >> {output}.unsorted; cat {input.ribotricer} >> {output}.unsorted; cat {input.price} >> {output}.unsorted; bedtools sort -i {output}.unsorted > {output}; """ |
39 40 | shell: "mkdir -p tracks; RiboReport/scripts/concatGFF.py {input.mergedGff} -o {output}" |
50 51 | shell: "mkdir -p tracks; RiboReport/scripts/allToUniqueGTF.py -i {input} -o {output}" |
7 8 | shell: "mkdir -p genomes; mv genome.fa genomes/" |
16 17 | shell: "mkdir -p annotation; mv annotation.gff annotation/" |
25 26 | shell: "mkdir -p annotation; RiboReport/scripts/gtf2gff3.py -a {input} -o {output}" |
34 35 | shell: "mkdir -p annotation; RiboReport/scripts/gff2gtf.py -i {input} -o {output}" |
10 11 12 13 14 | shell: """ mkdir -p price; gedi -e IndexGenome -s {input.genome} -a {input.annotation} -nobowtie -nokallisto -nostar -f price/ -o {output} -n price_genome """ |
27 28 29 30 31 | shell: """ mkdir -p price; gedi -e Price -reads {input.bam} -genomic {input.genomic} -prefix price/{wildcards.condition}-{wildcards.replicate}/results """ |
41 42 43 44 45 | shell: """ mkdir -p price; gedi -e ViewCIT -m bed -name 'd.getType()' {input.cit} > {output.bed} """ |
57 58 | shell: "mkdir -p tracks; RiboReport/scripts/priceGFF.py -c {wildcards.condition} -i {input.filtered} -u {input.unfiltered} -o {output.unfiltered} -f {output.filtered}" |
70 71 72 73 74 75 | shell: """ mkdir -p tracks; RiboReport/scripts/concatGFF.py {input.unfiltered} -o {output.unfiltered} RiboReport/scripts/concatGFF.py {input.filtered} -o {output.filtered} """ |
13 14 | shell: "mkdir -p qc/4unique; fastqc -o qc/4unique -t {threads} -f sam_mapped {input.sam}; mv qc/4unique/{params.prefix}_fastqc.html {output.html}; mv qc/4unique/{params.prefix}_fastqc.zip {output.zip}" |
28 29 | shell: "mkdir -p qc/3mapped; fastqc -o qc/3mapped -t {threads} -f sam_mapped {input.sam}; mv qc/3mapped/{params.prefix}_fastqc.html {output.html}; mv qc/3mapped/{params.prefix}_fastqc.zip {output.zip}" |
43 44 | shell: "mkdir -p qc/1raw; fastqc -o qc/1raw -t {threads} {input.fastq}; mv qc/1raw/{params.prefix}_fastqc.html {output.html}; mv qc/1raw/{params.prefix}_fastqc.zip {output.zip}" |
58 59 | shell: "mkdir -p qc/2trimmed; fastqc -o qc/2trimmed -t {threads} {input}; mv qc/2trimmed/{params.prefix}_fastqc.html {output.html}; mv qc/2trimmed/{params.prefix}_fastqc.zip {output.zip}" |
73 74 | shell: "mkdir -p qc/5removedrRNA; fastqc -o qc/5removedrRNA -t {threads} {input}; mv qc/5removedrRNA/{params.prefix}_fastqc.html {output.html}; mv qc/5removedrRNA/{params.prefix}_fastqc.zip {output.zip}" |
87 88 89 90 91 92 93 94 95 96 97 | shell: """ mkdir -p qc/all; column3=$(cut -f3 auxiliary/unambigous_annotation.gff | sort | uniq) if [[ " ${{column3[@]}} " =~ "gene" ]]; then featureCounts -T {threads} -t gene -g ID -a {input.annotation} -o {output.txt} {input.bam}; else touch {output.txt}; fi """ |
108 109 110 111 112 113 114 115 116 117 118 | shell: """ mkdir -p qc/trnainall; column3=$(cut -f3 auxiliary/unambigous_annotation.gff | sort | uniq) if [[ " ${{column3[@]}} " =~ "tRNA" ]]; then featureCounts -T {threads} -t tRNA -g ID -a {input.annotation} -o {output.txt} {input.bam}; else touch {output.txt}; fi """ |
129 130 131 132 133 134 135 136 137 138 139 | shell: """ mkdir -p qc/rrnainall; column3=$(cut -f3 auxiliary/unambigous_annotation.gff | sort | uniq) if [[ " ${{column3[@]}} " =~ "rRNA" ]]; then featureCounts -T {threads} -t rRNA -g ID -a {input.annotation} -o {output.txt} {input.bam}; else touch {output.txt}; fi """ |
150 151 152 153 154 155 156 157 158 159 160 | shell: """ mkdir -p qc/rrnainallaligned; column3=$(cut -f3 auxiliary/unambigous_annotation.gff | sort | uniq) if [[ " ${{column3[@]}} " =~ "rRNA" ]]; then featureCounts -T {threads} -t rRNA -g ID -a {input.annotation} -o {output.txt} {input.bam}; else touch {output.txt}; fi """ |
171 172 173 174 175 176 177 178 179 180 181 | shell: """ mkdir -p qc/rrnainuniquelyaligned; column3=$(cut -f3 auxiliary/unambigous_annotation.gff | sort | uniq) if [[ " ${{column3[@]}} " =~ "rRNA" ]]; then featureCounts -T {threads} -t rRNA -g ID -a {input.annotation} -o {output.txt} {input.bam}; else touch {output.txt}; fi """ |
191 192 | shell: "mkdir -p coverage; bedtools genomecov -ibam {input} -bg > {output}" |
215 216 | shell: "export LC_ALL=en_US.utf8; export LANG=en_US.utf8; multiqc -f -d --exclude picard --exclude gatk -z -o {params.dir} qc/3mapped qc/1raw qc/2trimmed qc/5removedrRNA qc/4unique qc/all qc/trnainall qc/rrnainallaligned qc/rrnainuniquelyaligned qc/rrnainall trimmed 2> {log}" |
11 12 13 14 15 16 17 18 19 20 21 22 23 24 | shell: """ mkdir -p auxiliary UNIQUE="$(cut -f3 {input.annotation} | sort | uniq)" IDENTIFIER="ID" LINE="$(sed '3q;d' {input.annotation})" if [[ $LINE == *"gene_id="* ]]; then IDENTIFIER="gene_id"; fi; for f in ${{UNIQUE}} do featureCounts -F GTF -s 1 -g $IDENTIFIER -O -t $f -M --fraction -a {input.annotation} {input.bam} -T {threads} -o auxiliary/annotation_total_reads.raw.tmp cat auxiliary/annotation_total_reads.raw.tmp | sed 1,2d | awk -v var=$f -FS'\\t' '{{print $0"\\t"var}}' >> {output} rm auxiliary/annotation_total_reads.raw.tmp done """ |
35 36 37 38 | shell: """ mkdir -p auxiliary; RiboReport/scripts/map_reads_to_annotation.py -i {input.reads} -a {input.annotation} -o {output} """ |
50 51 | shell: "mkdir -p auxiliary; RiboReport/scripts/total_mapped_reads.py -b {input.bam} -m {output.mapped} -l {output.length}" |
65 66 | shell: "mkdir -p auxiliary; RiboReport/scripts/generate_output_annotation.py -t {input.total} -r {input.reads} -g {input.genome} -o {output.xlsx} --simple {output.annotation} --complete {output.complete}" |
10 11 12 | run: outputName = os.path.basename(input[0]) shell("mkdir -p uniprotDB; mv {input} uniprotDB/{outputName}; gunzip uniprotDB/{outputName}") |
34 35 | shell: "mkdir -p reparation; if [ uniprotDB/uniprot_sprot.fasta.bak does not exist ]; then cp -p uniprotDB/uniprot_sprot.fasta uniprotDB/uniprot_sprot.fasta.bak; fi; mkdir -p {params.prefix}/tmp; reparation.pl -bam {input.bam} -g {input.genome} -gtf {input.gtf} -db {input.db} -out {params.prefix} -threads {threads}; if [ uniprotDB/uniprot_sprot.fasta does not exist ]; then cp -p uniprotDB/uniprot_sprot.fasta.bak uniprotDB/uniprot_sprot.fasta; fi;" |
45 46 | shell: "mkdir -p tracks; RiboReport/scripts/reparationGFF.py -c {wildcards.condition} -r {wildcards.replicate} -i {input} -o {output}" |
56 57 | shell: "mkdir -p tracks; RiboReport/scripts/concatGFF.py {input} -o {output}" |
19 20 | shell: "mkdir -p ribotish; ribotish quality -v -p {threads} -b {input.fp} -g {input.annotation} -o {params.reporttxt} -f {params.reportpdf} 2> {log} || true; if grep -q \"offdict = {{'m0': {{}}}}\" {params.offsetparameters}; then mv {params.offsetparameters} {params.offsetparameters}.unused; fi; touch {output.offsetdone}" |
43 44 | shell: "mkdir -p ribotish; ribotish predict -v {params.codons} -p {threads} -b {params.fplist} -g {input.annotation} -f {input.genome} -o {output.filtered} 2> {log}" |
54 55 | shell: "mkdir -p tracks; RiboReport/scripts/ribotishGFF.py -c {wildcards.condition} -i {input} -o {output}" |
16 17 18 19 20 | shell: """ mkdir -p ribotricer; ribotricer prepare-orfs --gtf {input.annotation} --fasta {input.genome} --prefix ribotricer/ribotricer --min_orf_length 6 --start_codons ATG,GTG,TTG """ |
33 34 35 36 37 | shell: """ mkdir -p ribotricer/{wildcards.condition}-{wildcards.replicate}; ribotricer learn-cutoff --ribo_bams {input.bam_ribo} --rna_bams {input.bam_rna} --prefix ribotricer/ribotricer --ribotricer_index {input.index} | tail -n 1 | grep -Eo '[+-]?[0-9]+([.][0-9]+)?'> {output} """ |
50 51 52 53 54 | shell: """ mkdir -p ribotricer/{wildcards.condition}-{wildcards.replicate} ribotricer detect-orfs --bam {input.bam_ribo} --ribotricer_index {input.index} --prefix ribotricer/{wildcards.condition}-{wildcards.replicate}/ribotricer --phase_score_cutoff {params.cutoff} """ |
64 65 | shell: "mkdir -p tracks; RiboReport/scripts/ribotricerGFF.py -c {wildcards.condition} -i {input} -o {output}" |
75 76 | shell: "mkdir -p tracks; RiboReport/scripts/concatGFF.py {input} -o {output}" |
9 10 11 12 | shell: """ mkdir -p annotation; awk -F'\\t' '$3 == "rRNA" || $3 == "tRNA"' {input.annotation} | awk -F'\\t' '{{print $1 FS $4 FS $5 FS "." FS "." FS $7}}' > {output.annotation} """ |
23 24 | shell: "bedtools intersect -v -a {input.mapuniq} -b {input.annotation} > {output.bam}" |
20 21 22 23 24 | shell: """ mkdir -p smorfer; smorfer.sh -s putative_orfs.sh -g {input.genome} -c {params.genome_name} -n smorfer_pORFs -i 9-150 -o smorfer/ """ |
35 36 37 38 39 | shell: """ mkdir -p smorfer; smorfer.sh -s FT_GCcontent.sh -g {input.genome} -o smorfer/ -b {input.bed} """ |
50 51 52 53 54 | shell: """ mkdir -p smorfer/{wildcards.condition}-{wildcards.replicate}; smorfer.sh -s count_RPF.sh -b {input.bed} -a {input.bam} -o smorfer/{wildcards.condition}-{wildcards.replicate}/ """ |
65 66 67 68 69 | shell: """ mkdir -p smorfer/{wildcards.condition}-{wildcards.replicate}; smorfer.sh -s FT_RPF.sh high_expressed_ORFs.bed calibrated_RiboSeq.bam -b {input.bed} -a {input.bam} -o smorfer/{wildcards.condition}-{wildcards.replicate}/ """ |
80 81 82 83 84 | shell: """ mkdir -p smorfer/{wildcards.condition}-{wildcards.replicate}; smorfer.sh -s find_best_start.sh -b {input.bed} -a {input.bam} -o smorfer/{wildcards.condition}-{wildcards.replicate}/ """ |
96 97 98 99 100 | shell: """ mkdir -p smorfer/{wildcards.condition}-{wildcards.replicate}_noFT; smorfer.sh -s count_RPF.sh -b {input.bed} -a {input.bam} -o smorfer/{wildcards.condition}-{wildcards.replicate}_noFT/ """ |
111 112 | shell: "mkdir -p tracks; RiboReport/scripts/smorferGFF.py -c {wildcards.condition} -i {input} -o {output}" |
122 123 | shell: "mkdir -p tracks; RiboReport/scripts/concatGFF.py {input} -o {output}" |
14 15 16 17 18 | shell: """ mkdir -p spectre; SPECtre.py --input {input.bam} --output {output.results} --log {output.log} --fpkm {input.fpkm} --gtf {input.gtf} --nt {threads} """ |
28 29 | shell: "mkdir -p tracks; RiboReport/scripts/spectreGFF.py -c {wildcards.condition} -i {input} -o {output}" |
39 40 | shell: "mkdir -p tracks; RiboReport/scripts/concatGFF.py {input} -o {output}" |
16 17 | shell: "mkdir -p trimlink; ln -s {params.inlink} {params.outlink};" |
31 32 | shell: "mkdir -p trimmed; cutadapt -j {threads} {params.adapter} {params.quality} {params.filtering} -o {output.fastq} {input.fastq}" |
11 12 | shell: "mkdir -p genomes; RiboReport/scripts/reverseComplement.py --input_fasta_filepath genomes/genome.fa --output_fasta_filepath genomes/genome.rev.fa" |
23 24 | shell: "mkdir -p tracks; RiboReport/scripts/motif2GFF3.py --input_genome_fasta_filepath {input.fwd} --input_reverse_genome_fasta_filepath {input.rev} --motif_string ATG --output_gff3_filepath {output}" |
35 36 | shell: "mkdir -p tracks; RiboReport/scripts/motif2GFF3.py --input_genome_fasta_filepath {input.fwd} --input_reverse_genome_fasta_filepath {input.rev} --motif_string GTG,TTG,CTG --output_gff3_filepath {output}" |
48 49 | shell: "mkdir -p tracks; RiboReport/scripts/motif2GFF3.py --input_genome_fasta_filepath {input.fwd} --input_reverse_genome_fasta_filepath {input.rev} --motif_string TAG,TGA,TAA --output_gff3_filepath {output}" |
60 61 | shell: "mkdir -p tracks; RiboReport/scripts/motif2GFF3.py --input_genome_fasta_filepath {input.fwd} --input_reverse_genome_fasta_filepath {input.rev} --motif_string AAGG --output_gff3_filepath {output}" |
82 83 | shell: "mkdir -p globaltracks; mkdir -p globaltracks/raw; mkdir -p globaltracks/mil; mkdir -p globaltracks/min; RiboReport/scripts/mapping.py --mapping_style global --bam_path {input.bam} --wiggle_file_path globaltracks/ --no_of_aligned_reads_file_path {input.stats} --library_name {params.prefix};" |
94 95 | shell: "wigToBigWig {input.fwd} {input.genomeSize} {output.fwd}" |
106 107 | shell: "wigToBigWig {input.rev} {input.genomeSize} {output.rev}" |
118 119 | shell: "wigToBigWig {input.fwd} {input.genomeSize} {output.fwd}" |
130 131 | shell: "wigToBigWig {input.rev} {input.genomeSize} {output.rev}" |
142 143 | shell: "wigToBigWig {input.fwd} {input.genomeSize} {output.fwd}" |
154 155 | shell: "wigToBigWig {input.rev} {input.genomeSize} {output.rev}" |
176 177 | shell: "mkdir -p centeredtracks; mkdir -p centeredtracks/raw; mkdir -p centeredtracks/mil; mkdir -p centeredtracks/min; RiboReport/scripts/mapping.py --mapping_style centered --bam_path {input.bam} --wiggle_file_path centeredtracks/ --no_of_aligned_reads_file_path {input.stats} --library_name {params.prefix};" |
188 189 | shell: "wigToBigWig {input.fwd} {input.genomeSize} {output.fwd}" |
200 201 | shell: "wigToBigWig {input.rev} {input.genomeSize} {output.rev}" |
212 213 | shell: "wigToBigWig {input.fwd} {input.genomeSize} {output.fwd}" |
224 225 | shell: "wigToBigWig {input.rev} {input.genomeSize} {output.rev}" |
236 237 | shell: "wigToBigWig {input.fwd} {input.genomeSize} {output.fwd}" |
248 249 | shell: "wigToBigWig {input.rev} {input.genomeSize} {output.rev}" |
270 271 | shell: "mkdir -p fiveprimetracks; mkdir -p fiveprimetracks/raw; mkdir -p fiveprimetracks/mil; mkdir -p fiveprimetracks/min; RiboReport/scripts/mapping.py --mapping_style first_base_only --bam_path {input.bam} --wiggle_file_path fiveprimetracks/ --no_of_aligned_reads_file_path {input.stats} --library_name {params.prefix};" |
282 283 | shell: "wigToBigWig {input.fwd} {input.genomeSize} {output.fwd}" |
294 295 | shell: "wigToBigWig {input.rev} {input.genomeSize} {output.rev}" |
306 307 | shell: "wigToBigWig {input.fwd} {input.genomeSize} {output.fwd}" |
318 319 | shell: "wigToBigWig {input.rev} {input.genomeSize} {output.rev}" |
330 331 | shell: "wigToBigWig {input.fwd} {input.genomeSize} {output.fwd}" |
342 343 | shell: "wigToBigWig {input.rev} {input.genomeSize} {output.rev}" |
364 365 | shell: "mkdir -p threeprimetracks; mkdir -p threeprimetracks/raw; mkdir -p threeprimetracks/mil; mkdir -p threeprimetracks/min; RiboReport/scripts/mapping.py --mapping_style last_base_only --bam_path {input.bam} --wiggle_file_path threeprimetracks/ --no_of_aligned_reads_file_path {input.stats} --library_name {params.prefix};" |
376 377 | shell: "wigToBigWig {input.fwd} {input.genomeSize} {output.fwd}" |
388 389 | shell: "wigToBigWig {input.rev} {input.genomeSize} {output.rev}" |
400 401 | shell: "wigToBigWig {input.fwd} {input.genomeSize} {output.fwd}" |
412 413 | shell: "wigToBigWig {input.rev} {input.genomeSize} {output.rev}" |
424 425 | shell: "wigToBigWig {input.fwd} {input.genomeSize} {output.fwd}" |
436 437 | shell: "wigToBigWig {input.rev} {input.genomeSize} {output.rev}" |
447 448 | shell: "mkdir -p tracks; multiBamSummary bins --smartLabels --bamfiles maplink/*.bam -o {output} -p {threads};" |
458 459 | shell: "mkdir -p figures; plotCorrelation -in {input.npz} --corMethod spearman --skipZeros --plotTitle \"Spearman Correlation of Read Counts\" --whatToPlot heatmap --colorMap RdYlBu --plotNumbers -o {output.correlation} --outFileCorMatrix SpearmanCorr_readCounts.tab" |
469 470 | shell: "mkdir -p tracks; cat {input[0]} | grep -v '\tgene\t' > tracks/annotation-woGenes.gtf; gtf2bed < tracks/annotation-woGenes.gtf > tracks/annotation.bed" |
481 482 | shell: "mkdir -p tracks; cut -f1-6 {input[0]} > tracks/annotationNScore.bed6; awk '{{$5=1 ; print ;}}' tracks/annotation.bed6 > tracks/annotation.bed6; bedToBigBed -type=bed6 -tab tracks/annotation.bed6 {input[1]} tracks/annotation.bb" |
Support
- Future updates
Related Workflows





