riboCleaner: rRNA Contamination Detection and Quantification Workflow for RNA-Seq Data
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
riboCleaner is an open-source workflow designed to identify and quantify rRNA abundance in RNA-Seq data. riboCleaner has two main workflows: identify and quantify.
The identify workflow identifies putative rDNA genes in a genome of interest. First, it uses barrnap to identify coordinates that contain rDNA elements in a genome (fasta) and then bundles these elements into non-overlapping operons. Then it compares these regions to the predicted genes (gff) to identify genes that were predicted in overlapping coordinate space as a predicted rDNA operon. These genes are flagged as putative rDNA contaminants. Then, the rest of the genes are compared to the rDNA contaminant gene list using blastn and any that pass a sequence identity and coverage threshold are also flagged as rDNA.
The quantify workflow quantifies the abundance of rRNA transcripts in RNA-Seq samples. For this we use salmon to quantify the abundance of transcripts in each sample (including the putative rDNA gene models). This allows any reads derived from rRNA to map to the rRNA transcripts and avoids biasing the counts of mRNA transcripts with homologous rRNA reads. Then, riboCleaner quantifies the abundance of the rRNA transcripts to give an estimate of rRNA prevalence in the RNA-seq data and visualizes the results. As a final step, riboCleaner removes the rRNA transcripts from the salmon counts table to yield a table of mRNA abundances that can be normalized and used for downstream analysis.
This workflow diagram shows how the different parts of the riboCleaner standard workflow fits together. You can also read a more detailed description of the methods or see the Snakemake DAG .
Quick Start
If you want to see the output riboCleaner produces, we have provided the report from running riboCleaner on a test dataset
To jump right in and run riboCleaner as fast as possible:
# clone this repo
git clone https://github.com/basf/riboCleaner.git
cd riboCleaner/test_data
Then follow the instructions in the test data README to make sure everything is set up properly.
Once the workflow runs on the test data, replace the test data inputs with your own, delete the contents of the
results
directory, and rerun.
Otherwise see our detailed documentation for installing and running riboCleaner .
Code Snippets
13 14 15 16 | shell: """ barrnap --kingdom {params.kingdom} --threads {threads} {input.reference} > {output.gff} """ |
32 33 34 35 | shell: """ bedtools merge -c 1 -o count -d {params.nor_distance} -i {input.gff} | grep -v ^unmapped | awk -v c={params.nor_counts} '{{if ($4>c) print}}' > {output.nor} """ |
53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 | shell: """ bedtools intersect -a {input.gene_models} -b {input.barrnap} {input.nor} > {output.tmp} # grep will exit with a 1 if it finds no matches set +e grep {params.gff_feature} {output.tmp} > {output.rdna_models} exitcode=$? if [ $exitcode -eq 1 ] # no matches were found then exit 0 else exit $exitcode fi """ |
82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 | run: # parse through the gff and extract the fields required for the BED # [chromosome, start, end, name] with open(input.gff, 'r') as IN, open(output.bed, 'w') as OUT: for line in IN: if line.startswith("#"): continue columns = line.strip().split("\t") chrom = columns[0] feature_type = columns[2] start = columns[3] end = columns[4] # only grab the features we want if feature_type == params.feature: # get a string like "ID=Glyma.13G020400.1.Wm82.a2.v1;Name=Glyma.13G020400.1;pacid=30501300;longest=1;Parent=Glyma.13G020400.Wm82.a2.v1" attrs = columns[8] # tokenize attrs tokens = {attr.split("=")[0]: attr.split("=")[1] for attr in attrs.split(";")} try: name = tokens[params.name] except KeyError: raise ValueError("'{name}' missing in '{line}'".format(name=params.name, line=line)) # write the line in the BED file OUT.write("\t".join((chrom, start, end, name)) + "\n") |
120 121 | shell: "bedtools getfasta -name -fi {input.reference} -bed {input.models} -s > {output.fasta}" |
136 137 | shell: "dedupe.sh minidentity={params.identity} in={input.fasta} out={output.deduped}" |
149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 | run: import re # gff/gtf have different ways to specify attributes gff_regex = re.compile('(?P<key>[^=]+) *= *(?P<value>[^;]+);? *') gtf_regex = re.compile('(?P<key>[^ ]+) *\"(?P<value>[^;\"]+)\"?;? *') filetype = None names = set() with open(input.models, 'r') as IN: # read the first line to determine if we are using gff/gtf first_line = IN.readline() if first_line: attrs = first_line.split("\t")[8] if gff_regex.match(attrs): attr_regex = gff_regex name_key = "ID" elif gtf_regex.match(attrs): attr_regex = gtf_regex name_key = "gene_id" else: raise ValueError("Couldn't filetype as either GFF/GTF") # reset to the beginning of the file IN.seek(0) for line in IN: # get a string like "ID=Glyma.13G020400.1.Wm82.a2.v1;Name=Glyma.13G020400.1;pacid=30501300;longest=1;Parent=Glyma.13G020400.Wm82.a2.v1" # or 'gene_id "FBgn0267517"; gene_symbol "5.8SrRNA-Psi:CR45857"'; # see: http://uswest.ensembl.org/info/website/upload/gff.html attrs = line.split("\t")[8] # tokenize attrs #tokens = {attr.split("=")[0]: attr.split("=")[1] for attr in attrs.split(";")} for match in attr_regex.finditer(attrs): if match.group("key") == name_key: names.add(match.group("value")) break else: raise ValueError("Couldn't detect gene name in '{line}'".format(line=line)) # write the names 1/line to output with open(output.names, 'w') as OUT: for n in names: OUT.write(n + "\n") |
210 211 212 213 214 | shell: """ # -w gets the full trascript (+ UTR, etc) gffread -w {output.features} -g {input.fasta} {input.gff} """ |
227 228 | shell: "blastn -query {input.query} -subject {input.subject} -outfmt '6 qseqid sseqid qstart qend pident qcovs' -out {output.hits} || touch {output.hits}" |
243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 | run: rdna = set() with open(input.blast, 'r') as IN: current_query = "" current_subject = "" current_weight_id = 0 current_weight_n = 0 current_cov = 0 for line in IN: query, subject, start, end, pident, qcovs = line.strip().split("\t") # trim the bedtools names at the first ":" #query = query.split(":")[0] # gather the HSPs to each subject for each query and check them against our thresholds if query != current_query or subject != current_subject: # check the coverage and weighted perc id of the hits if current_query != "" and float(current_cov) >= params.qcovs and current_weight_id / current_weight_n >= params.pident: rdna.add(current_query) # reset current_query = query current_subject = subject current_weight_id = 0 current_weight_n = 0 current_cov = qcovs # add this hit; keeping a running % identity for the HSPs in this query/subject pairing current_weight_id += float(pident) * abs(int(end) - int(start)) current_weight_n += abs(int(end) - int(start)) # subtract the known names barrnap_names = set() with open(input.barrnap_names, 'r') as IN: for line in IN: barrnap_names.add(line.strip()) with open(output.names, 'w') as OUT: for n in rdna.difference(barrnap_names): OUT.write(n + "\n") |
291 292 | shell: "cat {input.barrnap} {input.homology} > {output.names}" |
301 302 | shell: "/opt/bbmap/filterbyname.sh include=t substring=name in={input.features} names={input.names} out={output.fasta}" |
316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 | run: from collections import defaultdict # count non-partial rDNA predicted by barrnap barrnap_results = defaultdict(lambda: defaultdict(int)) with open(input.barrnap_predictions, 'r') as IN: for line in IN: if not line.startswith("#"): # detect if this is partial partial = "partial" in line # quantify each rDNA type if "5S_rRNA" in line: barrnap_results["5S"][partial] += 1 elif "18S_rRNA" in line: barrnap_results["18S"][partial] += 1 elif "28S_rRNA" in line: barrnap_results["28S"][partial] += 1 elif "5_8S_rRNA" in line: barrnap_results["5.8S"][partial] += 1 # needs to be expanded for prokaryotes else: barrnap_results["other"][partial] += 1 # count the rDNA regions rDNA_regions = 0 with open(input.rdna_regions, 'r') as IN: for line in IN: rDNA_regions += 1 # count the models added via BLAST homology homology_models = 0 with open(input.homology_models, 'r') as IN: for line in IN: homology_models += 1 # count the rDNA genes rDNA_gene_models = 0 with open(input.gene_models, 'r') as IN: for line in IN: rDNA_gene_models += 1 # output the summary with open(output.summary, 'w') as OUT: OUT.write("riboCleaner ran barrnap to detect the following counts of rDNA features:\n") for elem in ["18S", "5.8S", "28S", "5S", "other"]: OUT.write(" {elem}: {complete} complete, {partial} partial\n".format(elem=elem, complete=barrnap_results[elem][False], partial=barrnap_results[elem][True])) OUT.write("\n") OUT.write("These rDNA features were merged into {} rDNA-rich regions.\n".format(rDNA_regions)) OUT.write("\n") OUT.write("The rDNA regions were found to overlap with {overlap} features with the `{tag}` tag in the GFF file.\n".format(overlap=rDNA_gene_models - homology_models, tag=params.gff_feature)) OUT.write("Additionally, {homology} models were identified as having significant homology to these features for a total of {total} models flagged as putative rDNA.\n".format(homology=homology_models, total=rDNA_gene_models)) OUT.write("\n") # begin the warning section if not rDNA_regions: OUT.write("WARNING: There were no rDNA regions detected. This may be because there are no rDNA sequences in your reference genome (detectable by barrnap) or because they do not form regular rDNA repeats. riboCleaner works best when it is able to detect rDNA regions.\n") if not rDNA_gene_models: OUT.write("WARNING: There were no rDNA gene models detected. This is often good news but means riboCleaner will detect no rDNA for this genome. However, since there are no false gene models originating from rDNA, your counts table resulting from standard transcriptome analysis won't be biased by false rDNA gene models!\n") |
11 12 13 14 | shell: """ salmon index -p {threads} -t <(cat {input.features} {input.genome}) -d <(grep "^>" {input.genome} | cut -d " " -f 1 | sed 's/>//g') -i {output} """ |
30 31 32 33 34 35 | shell: """ salmon quant --threads {threads} -i {input.index} --libType A -1 {input.r1} -2 {input.r2} -o {output.tmp} 2>&1 | tee {log} cp -v {output.tmp}/quant.sf {output.sf} cp -v {output.tmp}/aux_info/meta_info.json {output.meta_info} """ |
48 49 50 51 52 53 | shell: """ salmon quant --threads {threads} -i {input.index} --libType A -r {input.r1} -o {output.tmp} 2>&1 | tee {log} cp -v {output.tmp}/quant.sf {output.sf} cp -v {output.tmp}/aux_info/meta_info.json {output.meta_info} """ |
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 | run: import pandas as pd import json # get the rDNA names with open(input.rdna_names, 'r') as IN: names = [line.strip() for line in IN] sample_summary = {} # gather the counts for each file all_counts = [] for sf, meta_info, sample in zip(input.sf, input.meta_info, params.samples): with open(meta_info) as IN: info = json.load(IN) d = pd.read_csv(sf, sep="\t", header=0, index_col=0) counts = d["TPM"] counts.name = sample all_counts.append(counts) # get the sample summary info sample_summary[sample] = { 'percent_unmapped': 100 - float(info['percent_mapped']), # scale percent rDNA to the mapped portion 'percent_rDNA': float(info['percent_mapped']) * counts.filter(items=names).sum() / counts.sum() } # the usable percent is whatever is left over sample_summary[sample]['percent_mRNA'] = 100 - sample_summary[sample]['percent_unmapped'] - sample_summary[sample]['percent_rDNA'] # and combine into one df all_counts = pd.DataFrame(all_counts).fillna(0).transpose() # make the index be just the gene names (not the position) all_counts.index = [g.split(":")[0] for g in all_counts.index] # get just the rdna all_counts_rdna = all_counts.filter(items=names, axis=0) all_counts_rdna.to_csv(output.rdna_abundance, sep="\t", index_label="sample") # done as is above to write out more details in the future, for now, just the TPM summary = pd.DataFrame.from_dict(sample_summary, orient="index").round(2) # write sorted by sample name summary.sort_index().to_csv(output.summary, sep="\t", index_label="sample", float_format="%.2f") |
128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 | run: import pandas as pd import json # get the rDNA names with open(input.rdna_names, 'r') as IN: rdna = [line.strip() for line in IN] sample_summary = {} # gather the counts for each file all_counts = [] for sf, sample in zip(input.sf, params.samples): d = pd.read_csv(sf, sep="\t", header=0, index_col=0) counts = d["TPM"] # get the total (including rDNA) total = counts.sum() # remove rDNA counts = counts.filter(items=[n for n in counts.index if n not in rdna]) # rescale counts by the rDNA ratio (to get the TPM as if the rDNA was never there) counts = counts * (total / counts.sum()) counts.name = sample all_counts.append(counts) # and combine into one df all_counts = pd.DataFrame(all_counts).fillna(0).transpose() # make the index be just the gene names (not the position) all_counts.index = [g.split(":")[0] for g in all_counts.index] all_counts.to_csv(output.clean_tpm, sep="\t", index_label="gene") |
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | run: import pandas as pd from matplotlib import pyplot as plt summary = pd.read_csv(input.summary, sep="\t", index_col=0)[["percent_mRNA", "percent_rDNA", "percent_unmapped"]] summary.columns = ["mRNA", "rDNA", "unmapped"] # make the experiment level plot # adapted from: https://www.dataforeverybody.com/matplotlib-seaborn-pie-charts/ overall_summary = summary.mean(axis=0) pie, ax = plt.subplots(figsize=[10,6]) labels = overall_summary.keys() plt.pie(x=overall_summary, autopct="%.2f%%", explode=[0.05]*3, labels=labels, pctdistance=0.5) plt.title("Experiment Overview", fontsize=14) pie.savefig(output.experiment_summary) pie.savefig(output.experiment_summary_svg) # make the sample level plot # limited to 5 samples per inch of plot (and 3 inches for overhead) width = 3 + int(len(summary) / 5) summary.plot(kind="bar", stacked=True, figsize=(width, 8)) plt.legend(loc='lower left', bbox_to_anchor=(1.0, 0.5)) plt.xticks(rotation=45, ha="right") plt.title("Sample Composition") plt.ylabel("% of Reads") plt.tight_layout() plt.gcf().savefig(output.sample_summary) plt.gcf().savefig(output.sample_summary_svg) |
96 97 98 99 100 101 102 103 104 105 106 107 108 109 | run: # print the workflow docstring print("Snakefile: {p}".format(p=workflow.snakefile)) if workflow.globals["__doc__"]: print(workflow.globals["__doc__"]) # print each (public) rule print("\n##### RULES #####\n") for rule in workflow.rules: if not rule.name.startswith("_"): if rule.docstring: print(rule.name) print(rule.docstring) print("----------\n") |
143 144 145 146 | shell: """ bbduk.sh in={input.r1} in2={input.r2} out={output.r1} out2={output.r2} reads={params.reads} threads={threads} """ |
157 158 159 160 | shell: """ bbduk.sh in={input.r1} out={output.r1} reads={params.reads} threads={threads} """ |
Support
- Future updates
Related Workflows





