circDetector (CD): Circular RNA Detection and Annotation Workflow
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
circDetector (CD) : circular RNAs detection and annotation
Description:
circDetector (CD) is a computational tool for detecting and annotation of circular RNAs (circRNAs) from Total RNA-Seq data. This tool is implemented with the Snakemake workflow management system allowing reproducible and scalable data analyses. CD was developped to identify circRNAs from reads mapped to the reference genome with the STAR tool (Spliced Transcripts Alignment to a Reference) in Single-End (SE). It consists of identifying reads with a circular junction ( chimeric reads ) from the chimeric.out.junction files provided by STAR. CD also provides a file reporting all statistics of STAR-SE mapping. Mapping informations were extracted from the STAR file Log.final.out .
Note: The documentation of the CD tool is in progress.
Table of contents:
Installation:
The source code can be downloaded from GitHub .
Compilation and configuration:
To install this tool, you need to first fetch the repository git repository or download the corresponding compressed files.
git clone https://github.com/GenEpi-GenPhySE/circRNA.git
cd circRNA
Required tools:
Local installation:
sudo apt-get install bedtools
Installation on the Genotoul cluster:
module load bioinfo/bedtools2-2.29.0
Setting the environment:
python3 -m venv circrnaenv
source circrnaenv/bin/activate
pip install Cython
pip install pybedtools
pip install snakemake
pip install pandas
pip install natsort
pip install tqdm
pip install networkx
Preparing the config file:
Snakemake provides a config file mechanism in which the path of the several files (for example annotation file, sample file, data...) can be specified. Note: The paths indicated in this file must be adapted to the data.
Preparing the sample file:
The script prepare.py takes as input an tabulated file containing the paths to all the mapping files and generates a simplified tabular file which will be taken at the entrance of the snakemake workflow.
python scripts/prepare.py -i metadata.tsv -o samples.tsv
- Input: metadata.tsv is a tabulated file containing the following fields:
Column | Type | Description |
---|---|---|
1 | string | species (bos_taurus, sus_scrofa) |
2 | string | tissue (testis, liver) |
3 | string | sex (male, female) |
4 | string | sample (single, orient) |
5 | string | sample_unit (sample uniq name) |
6 | string | animal_name (specific name for each individual) |
7 | string | mapdir (file path to mapping files) |
Note: The table must contain all these fields (column names).
- Output: samples.tsv is a tabulated file containing the following fields:
Column | Type | Description |
---|---|---|
1 | string | sample (concatenation of species, tissue and animal_name) |
2 | string | sample_unit (sample uniq name) |
3 | string | species (bos_taurus = cow, sus_scrofa = pig) |
4 | string | sex (male, female) |
5 | string | mapdir (file path to mapping files) |
Snakemake worflow:
The Snakefile is composed of the main following rules:
-
rule detection : circRNAs detection
-
rule mappingstat : analyses of statistics of STAR-SE mapping
-
rule annotation : circRNAs annotation
Rule detection:
-r1 R1_INPUT_FILE, --r1_input_file R1_INPUT_FILE
R1 input file name
-r2 R2_INPUT_FILE, --r2_input_file R2_INPUT_FILE
R2 input file name
-min_cr CR_THRESHOLD, --cr_threshold CR_THRESHOLD
Minimum number of chimeric reads
-tol TOLERANCE, --tolerance TOLERANCE
Tolerance of different positions to start and end
-fmt {bed,gtf}, --output_file_format {bed,gtf}
Output file format : 0-based (bed) or 1-based (gtf)
-o OUTPUT_FILE, --output_file OUTPUT_FILE
Output file path
--verbose Print more info
Example of a command executed by Snakemake:
python3 scripts/circRNA_detection.py -r1 -r2 -o circ_rnas.bed
Rule mappingstat:
CD provides a file reporting all statistics of STAR-SE mapping. Mapping informations were extracted from the STAR file Log.final.out .
- Input: Example of a Log.final.out file:
Started job on | Oct 26 05:21:11
Started mapping on | Oct 26 05:29:19
Finished on | Oct 26 05:52:39
Mapping speed, Million of reads per hour | 121.89
Number of input reads | 47400458
Average input read length | 150
UNIQUE READS:
Uniquely mapped reads number | 33941471
Uniquely mapped reads % | 71.61%
Average mapped length | 149.32
Number of splices: Total | 8294615
Number of splices: Annotated (sjdb) | 7882009
Number of splices: GT/AG | 8204363
Number of splices: GC/AG | 58632
Number of splices: AT/AC | 5362
Number of splices: Non-canonical | 26258
Mismatch rate per base, % | 0.57%
Deletion rate per base | 0.02%
Deletion average length | 2.11
Insertion rate per base | 0.02%
Insertion average length | 1.74
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 3381777
% of reads mapped to multiple loci | 7.13%
Number of reads mapped to too many loci | 28860
% of reads mapped to too many loci | 0.06%
UNMAPPED READS:
Number of reads unmapped: too many mismatches | 0
% of reads unmapped: too many mismatches | 0.00%
Number of reads unmapped: too short | 9943085
% of reads unmapped: too short | 20.98%
Number of reads unmapped: other | 18180
% of reads unmapped: other | 0.04%
CHIMERIC READS:
Number of chimeric reads | 129065
% of chimeric reads | 0.27%
- Output: reports/mapping_stat.tsv is a tabulated file containing all the informations of the input file (column) for each sample (row).
Rule annotation:
-circ CIRC_RNA_FILE, --circ_rna_file CIRC_RNA_FILE
Circular RNA file path
-fmt {bed,gtf}, --annot_format {bed,gtf}
Annotation file format
-annot ANNOTATION_FILE, --annotation_file ANNOTATION_FILE
Annotation file path
-oe EXONIC_OUTPUT_FILE, --exonic_output_file EXONIC_OUTPUT_FILE
Exonic output file path
-oi INTRONIC_OUTPUT_FILE, --intronic_output_file INTRONIC_OUTPUT_FILE
Intronic output file path
-o OUTPUT, --output OUTPUT
Annotation output file path
--verbose Print more info
Example of a command executed by Snakemake:
python3 scripts/circRNA_annotation.py -circ circ_rnas.bed -annot -o annotation_circRNAs.tsv
Note: Manual post-processing of data is required to remove short circRNAs.
Example Commands:
Launching the pipeline locally:
snakemake -p --jobs 8
Launching the pipeline on the Genotoul cluster:
sbatch circ_rnas.sh
Testing the detection rule locally for a single sample:
snakemake pig_testis_neonat-1/auzeville.bed -p --cores 1
Note: Make sure the circrnaenv environment is in this folder.
Contact & Support
For any bug report, feature request, or questions please fill out an issue through circRNA_project's issue page .
Copyright and License
This software is released under GNU General Public License (v3.0).
Code Snippets
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 | # [-fmt {bed,gtf}] [-o OUTPUT_FILE] # Imports import circRNA as circ from collections import defaultdict import sys, os, re, argparse, natsort import numpy as np import csv import pybedtools as pybed # Utility functions def eprint(*args, **kwargs): print(*args, file=sys.stderr, **kwargs) dsp_control = circ.DisplayControl.instance() if dsp_control.verbose: print(*args, file=sys.stderr, **kwargs) def exonic_annotations(circ_rnas, exons): """ Annotate the circrnas according to the exon annotations circ_rnas: an array of cirRNAs object exons: an array of circRNAs objects returns the circ_rnas array with the circRNAs annotated We annotate each circRNA's splice junction with the corresponding exons spice junctions using a dict of exons splice junctions This means that splice junction correspondances must be exact """ eprint("\nCompute junction dictionary...\n") annotation_junctions = compute_junction_dict(exons) eprint("\nCircRNA annotation...\n") set_annotation(circ_rnas, annotation_junctions) return circ_rnas def junction_key(chrom, pos): return "%s:%s" %(chrom, pos) def compute_junction_dict(annot_array): """ Compute a dictionnary of splic junctions: keys: chrom:pos values: list of tuples tuple: (exon, end_type) end_type: 3 (resp 5) for donor (resp acceptor) """ junction_dict = defaultdict(list) seen_exons = defaultdict() for annot in annot_array: if annot.get_att("exon_id") in seen_exons: continue seen_exons[annot.get_att("exon_id")] = 1 key_start = junction_key(annot.chrom, annot.start) key_end = junction_key(annot.chrom, annot.end) if annot.strand == "+": start_end_type = "5" end_end_type = "3" else: start_end_type = "3" end_end_type = "5" # annotating start junction_dict[key_start].append((annot, start_end_type)) # annotating end junction_dict[key_end].append((annot, end_end_type)) return junction_dict def set_annotation(circ_rnas, annotation_junctions): """ An annotation string would be very much appreciated """ for circ_rna in circ_rnas: junction_start = junction_key(circ_rna.chrom, circ_rna.start) junction_end = junction_key(circ_rna.chrom, circ_rna.end) if junction_start in annotation_junctions: circ_rna.add_start_annotation(annotation_junctions[junction_start]) circ_rna.add_start_transcript_id(annotation_junctions[junction_start]) circ_rna.add_start_gene_id(annotation_junctions[junction_start]) if junction_end in annotation_junctions: circ_rna.add_end_annotation(annotation_junctions[junction_end]) circ_rna.add_end_transcript_id(annotation_junctions[junction_end]) circ_rna.add_end_gene_id(annotation_junctions[junction_end]) def intronic_annotations(circ_rnas, exons): """ Annotate the circrnas according to the introns annotations circ_rnas: an array of cirRNAs object exons: an array of circRNAs objects returns the circ_rnas array with the circRNAs intrno-annotated Because at least on circRNA junction is away from existing splice junctions, the annotation is solely based on overlap between the circRNA and an intron : - [start, end] interval must be within the intron - this interval must cover at least 90% of the intron """ #eprint("\nCompute introns from exons...\n") transcripts = get_exons_per_transcript(exons) introns = circ.compute_intronic_positions(transcripts) #eprint("\nCompute intersection with circRNAs...\n") circ_rnas = annotate_intron_intersection(circ_rnas, introns) return circ_rnas def annotate_intron_intersection(circ_rnas, introns): pybed_introns = annot_to_pybed(introns) pybed_circrnas = annot_to_pybed(circ_rnas) intersections = pybed_circrnas.intersect(pybed_introns, f=0.95, wo=True) circrna_d = {circ.name: circ for circ in circ_rnas} introns_d = {intron.name: intron for intron in introns} for i in intersections: intron = introns_d[i[9]] circ = circrna_d[i[3]] overlap = i[-1] p_overlap = float(overlap)/intron.length # taille circ sur taille intron circ.annotate_intron(intron, p_overlap) return circ_rnas def annot_to_pybed(annots): pybed_annot = [] for a in annots: pybed_annot.append(a.topybed()) return pybed.BedTool(pybed_annot).sort() def infra_exonic_annotations(circ_rnas, exons): """ """ pybed_exons = annot_to_pybed_ife(exons) pybed_circrnas = annot_to_pybed_ife(circ_rnas) intersections = pybed_circrnas.intersect(pybed_exons, f=0.95, wo=True) circrna_d = {circ.name: circ for circ in circ_rnas} exons_d = {exon.name: exon for exon in exons} for i in intersections: exon = exons_d[i[9]] circ = circrna_d[i[3]] circ.annotate_ife(exon) return circ_rnas def annot_to_pybed_ife(annots): pybed_annot = [] for a in annots: pybed_annot.append(a.topybed_ife()) return pybed.BedTool(pybed_annot).sort() def get_exons_per_transcript(annots): """ returns a dictionnary key: ... """ d_transcript = defaultdict(list) for exon in annots: transcript_id = exon.get_att("transcript_id") d_transcript[transcript_id].append(exon) return d_transcript def write_annotated_circrnas(circ_annotated, outputfile): circ_rnas = natsort.natsorted(circ_annotated, key=lambda e: (e.chrom, e.start)) header = "\t".join(["chrom", "start", "end", "strand", "nb_ccr", "circ_rna_name", "exons_id_start", "exons_id_end", "transcript_id_start", "transcript_id_end", "gene_id_start", "gene_id_end", "intron_name", "start_i", "end_i", "gene_id_i", "exon_id_i", 'exon_id_ife', "exons_start_end_ife", "gene_id_ife"]) with open(outputfile, "w") as fout: fout.write(header+"\n") for circ_rna in circ_rnas: intron_annot = circ_rna.intron_annotation if len(intron_annot) == 0: start_i = "" end_i = "" name_i = "" else: for annot_i in intron_annot: intron = annot_i[0] start_i = intron.start end_i = intron.end name_i = intron.name att_d = dict(re.split(" |=", item) for item in circ_rna.attributes.split("; ")) nb_ccr = att_d["nb_ccr"].replace('\"', '') s = [circ_rna.chrom, circ_rna.start, circ_rna.end, circ_rna.strand, nb_ccr, circ_rna.name, circ_rna.get_start_annotation_str(), circ_rna.get_end_annotation_str(), circ_rna.get_start_transcript_id_str(), circ_rna.get_end_transcript_id_str(), circ_rna.get_start_gene_id_str(), circ_rna.get_end_gene_id_str(), name_i, start_i, end_i, circ_rna.get_intron_annot_gene_str('gene_id'), circ_rna.get_intron_annot_str('exon_id'), circ_rna.get_infra_exonic_annot_str("exon_id"), circ_rna.get_infra_exonic_exons_start_end(), circ_rna.get_infra_exonic_gene_annot_str("gene_id")] fout.write( "%s\n" % "\t".join(map(str, s))) def main(): # Exonic: eprint("\nIdentification of EXONIC circRNAs:\n") eprint("\nReading circRNA file...\n") circ_rnas = circ.read_annotation(args.circ_rna_file, fmt=args.annot_format) eprint("\nReading annotation file...\n") annots = circ.read_annotation(args.annotation_file, fmt=args.annot_format) eprint("\nExonic annotations..\n") circ_rnas = exonic_annotations(circ_rnas, annots) eprint("\nIntronic annotations..\n") circ_rnas = intronic_annotations(circ_rnas, annots) ##### eprint("\nInfra-exonic annotations..\n") circ_rnas = infra_exonic_annotations(circ_rnas, annots) ##### eprint("\nWrite annotations..\n") write_annotated_circrnas(circ_rnas, args.output) def parse_arguments(): parser = argparse.ArgumentParser(description='Annotation') parser.add_argument('-circ', '--circ_rna_file', required=True, help='Circular RNA file path') parser.add_argument('-fmt', '--annot_format', required=False, default="bed", choices=["bed","gtf"], help='Annotation file format') parser.add_argument('-annot', '--annotation_file', required=True, help='Annotataion file path') parser.add_argument('-oe', '--exonic_output_file', required=False, default='exonic_circ_rnas_annot.out', help='Exonic output file path') parser.add_argument('-oi', '--intronic_output_file', required=False, default='annotation_introns.bed', help='Intronic output file path') parser.add_argument('-o', '--output', required=True, help='Annotation output file path') parser.add_argument('--verbose', help='Print more info', action='store_true') args = parser.parse_args() dsp_control = circ.DisplayControl.instance() dsp_control.verbose = args.verbose return args if __name__ == '__main__': args = parse_arguments() main() |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 | # [-min_cr CR_THRESHOLD] [-tol TOLERANCE] # [-fmt {bed,gtf}] [-o OUTPUT_FILE] # Imports import sys, os, re, argparse, natsort, time if not sys.warnoptions: import warnings warnings.simplefilter("ignore") from collections import defaultdict import pandas as pd import numpy as np from networkx import (draw, DiGraph, Graph) import networkx as netwx import statistics from statistics import mode from time import sleep from tqdm import tqdm import circRNA as circ # Utility functions def eprint(*args, **kwargs): print(*args, file=sys.stderr, **kwargs) dsp_control = circ.DisplayControl.instance() if dsp_control.verbose: print(*args, file=sys.stderr, **kwargs) def merge_junctions(df_a): """ Merge two dataframes :df_r1: dataframe of R1 reads :df_r2: dataframe of R2 reads :return the merged dataframe """ df_merged = pd.concat(df_a,ignore_index=True) return df_merged def read_junction_file(file, read_type=None): """ Read the data, adds the header of the table and adds a column with the corresponding read type. :file: file to read :read_type: read type (R1 or R2) :return the dataframe of reads """ names = ["chr_donor", "pos_donor","strand_donor","chr_acceptor", "pos_acceptor","strand_acceptor","junction_type", "repeat_left","repeat_right","read_name","pos_s1", "CIGAR_s1","pos_s2","CIGAR_s2"] # Columns names of the input data # Read data as table with pandas: df = pd.read_table(file, sep = '\t', names=names, dtype=str, converters = {'pos_acceptor':int, 'pos_donor':int}) df['read_type'] = read_type # Add column "read" to specify the read type return df def get_valid_circjunctions(df): """ Selects valid circular chimeric reads (ccr) according to the conditions of the "valid_ccr" function. :df: datatframe to filter return: the filtered dataframe """ # Parsing dataframe to select valid rows # Boolean array with true if the read is a valid ccr: valid_ccrs = [] #for index, row in tqdm(df.iterrows(), total=df.shape[0]): for index, row in circ.tqdm_wrapper(df): valid_ccrs.append(valid_ccr(row)) df_circ_cr = df[valid_ccrs] ccr_a = [] # Array of circular junctions eprint(" "+"\n"+"Generating array of selected ccr..."+"\n") #for index, row in tqdm(df_circ_cr.iterrows(), total=df_circ_cr.shape[0]): for index, row in circ.tqdm_wrapper(df_circ_cr): ccr = circ.CCR(row) # Create CCR object ccr_a.append(ccr) # Add chimeric reads into array eprint("\n") return ccr_a def valid_ccr(row): """ Select ccr according to several conditions. :row: row of the dataframe :return boolean "False" for not valid ccr or boolean "True" for valid ccr """ # Regular expression of the left dangling CIGAR: dangling_left = re.compile('^\d+S\d+M$') # Regular expression of the rigth dangling CIGAR: dangling_right = re.compile('^\d+M\d+S$') if (row['chr_donor']!=row['chr_acceptor'] or row['strand_donor']!=row['strand_acceptor']): return False elif abs(row['pos_donor'] - row['pos_acceptor']) > 10000000: return False else: if row['strand_donor'] == '-' : if (row['pos_donor']>row['pos_acceptor'] or row['pos_s1']>row['pos_s2']): return False else: if (not dangling_left.match(row['CIGAR_s1']) or not dangling_right.match(row['CIGAR_s2'])): return False else: return True elif row['strand_donor'] == '+' : if (row['pos_donor']<row['pos_acceptor'] or row['pos_s1']<row['pos_s2']): return False else: if (not dangling_right.match(row['CIGAR_s1']) or not dangling_left.match(row['CIGAR_s2'])): return False else: return True def get_exact_circrnas(circ_junctions, cr_threshold): """ Detects circ RNAs as CR sharing exactly the same circ signature chrom start end strand """ eprint("Exact circular RNA detection") circ_rna_dict = defaultdict(list) for circ_junc in circ_junctions: circ_rna_dict[circ_junc.key].append(circ_junc) circ_array = [] for key, cr_array in circ_rna_dict.items(): if len(cr_array) >= cr_threshold: circ_array.append(circ.CircRNA(args.output_file_format, ccr_array=cr_array)) return circ_array def same_circ_signature(cr_a, cr_b, tol): """ Two-by-two comparison of chimeric reads :cr_a: chimeric read a :cr_b: chimeric read b :tol: number of different start-end positions tolerated :return boolean "False" for not valid conditions and "True" for valid conditions """ if cr_a.chrom != cr_b.chrom: return False if abs(cr_a.start - cr_b.start) > tol: return False if abs(cr_a.end - cr_b.end) > tol: return False if cr_a.read_type == cr_b.read_type: if cr_a.strand != cr_b.strand: return False else: if cr_a.strand == cr_b.strand: return False return True def merge_circ_rnas(component): if len(component) == 1: return component[0] elements = [] for circ_rna in component: elements.extend(circ_rna.ccr_array) return circ.CircRNA(args.output_file_format, ccr_array=elements) def connected_components(circ_a, neighbours, cr_threshold): """ Generate an undirected graph and cluters ccr into connected components :ccr_a: array of ccr object :neighbours: dictionary of neighbours chimeric reads :return an array of circular RNAs """ undirected = netwx.Graph() undirected.add_nodes_from(range(len(circ_a))) undirected.add_edges_from(neighbours) circ_rnas = [] # Circular RNAs array # Parsing of connected components composed of chimeric reads: for comp in netwx.connected_components(undirected): circ_rna = merge_circ_rnas([circ_a[i] for i in comp]) if circ_rna.nb_ccr >= cr_threshold: circ_rnas.append(circ_rna) return circ_rnas def cumulative_cr(circ_a): return sum([c.nb_ccr for c in circ_a]) def compute_fuzzy_circ_rnas(ccr_a, cr_threshold, tolerance): """ Get the circular RNA corresponding to each connected component of chimeric reads :ccr_a: array of ccr object :return array of circular RNAs """ if cumulative_cr(ccr_a) < cr_threshold: return [] n = len(ccr_a) # Neighbour cr share sign the same circular RNA neighbours = [] circ_rnas = [] for i in range(n): for j in range(i+1, n): if ccr_a[j].start > ccr_a[i].end: break # Two-by-two comparison of chimeric reads: if same_circ_signature(ccr_a[i], ccr_a[j], tolerance): neighbours.append((i, j)) circ_rnas = connected_components(ccr_a, neighbours, cr_threshold) return circ_rnas def get_independent_intervals(circ_rnas): # Get independant intervals independant_intervals = [] circ_interval = [] chrom = None for circ_rna in circ_rnas: if not chrom: chrom = circ_rna.chrom max_end = circ_rna.end if circ_rna.chrom != chrom: independant_intervals.append(circ_interval) circ_interval = [] chrom = circ_rna.chrom max_end = circ_rna.start if circ_rna.chrom != chrom or circ_rna.start > max_end: independant_intervals.append(circ_interval) circ_interval = [] chrom = circ_rna.chrom max_end = circ_rna.start max_end = max(max_end, circ_rna.end) circ_interval.append(circ_rna) independant_intervals.append(circ_interval) return independant_intervals def get_fuzzy_circrnas(circ_rnas, cr_threshold, tolerance): eprint("Fuzzy circular RNA detection") circ_rnas = natsort.natsorted(circ_rnas, key=lambda e: (e.chrom, e.start)) intervals = get_independent_intervals(circ_rnas) circ_rnas = [] for interval in intervals: fuzzy_circ_rnas = compute_fuzzy_circ_rnas(interval, cr_threshold, tolerance) if fuzzy_circ_rnas: circ_rnas.extend(fuzzy_circ_rnas) return circ_rnas def circrna_detection(circ_juntcions, cr_threshold, tolerance): """ Detecte circular RNAs :df_ccr: dataframe of circular chimeric reads :return the array of circular RNAs detected """ # First pass, detect exact circular RNAs (no tolerance) circ_rnas = get_exact_circrnas(circ_juntcions, cr_threshold) eprint("nb of tol %d circRNAS %d" % (tolerance, len(circ_rnas)))######### # if tolerance > 0 compute fuzzy circrnas from previous list if tolerance > 0: circ_rnas = get_fuzzy_circrnas(circ_rnas, cr_threshold, tolerance) else: circ_rnas = [c for c in circ_rnas if c.nb_ccr >= cr_threshold] return circ_rnas def write_tab_circ_rnas(circ_rnas, outputfile, fmt): """ Write the table of circ RNA retained :circ_rnas: array of circular RNAs object :outputfile: the output file path :return the table of circular RNAs retained :start Chimeric.out.junction file: last intron base in the reference genome order in the circular transcript :end Chimeric.out.junction file: first intron base in the reference genome order in the circular transcript :start GTF: first exon position :end GTF: last exon position """ nb_line = 0 # name bed file circ_rnas = natsort.natsorted(circ_rnas, key=lambda e: (e.chrom, e.start)) eprint(" "+"\n"+"Writing the table of circular RNAs..."+"\n") with open(outputfile, "w") as fout: eprint("\nConvertion the Chimeric.out.junction format to the %s format...\n" % fmt) for circ_rna in circ_rnas: # Convert Chimeric.out.junction format to out format: circ_rna.start += 1 circ_rna.end -= 1 # Convert out format to the chosen format: circ_rna.write_annot(nb_line, fout, fmt) nb_line += 1 def stats(min_cr, tol, nb_ccr_tot, nb_ccr, nb_circ_rna_detected, circ_rnas): nb_tot_cr_circ = sum([c.nb_ccr for c in circ_rnas]) return "%d\t%d\t%d\t%d\t%d\t%d" % (min_cr, tol, nb_ccr_tot, nb_ccr, nb_circ_rna_detected, nb_tot_cr_circ) def stats2(df): return df.groupby('nb_ccr')['nb_ccr'] \ .count().reset_index(name='nb_circ_rnas') def main(r1_input_file, r2_input_file, cr_threshold, tolerance, output_file): eprint("\nReading junction files...\n") df_junctions_a = [] if r1_input_file: df_junctions_a.append(read_junction_file(r1_input_file, read_type="R1")) if r2_input_file: df_junctions_a.append(read_junction_file(r2_input_file, read_type="R2")) eprint("Merging junction files...\n") df = merge_junctions(df_junctions_a) eprint("%d chimeric reads \n" % len(df.index)) # Filtering data: eprint("Selecting valid ccr...\n") circ_junctions = get_valid_circjunctions(df) eprint("%d circular chimeric reads" % len(circ_junctions)) # Circular RNAs detection: circ_rnas = circrna_detection(circ_junctions, cr_threshold, tolerance) #### 'circRNA' object has no attribute 'ccr_array' eprint("%d circRNAs detected" % len(circ_rnas)) #for cir in circ_rnas: # print(cir.var_info_str()) # Write the table of circular RNAs retained: circ.write_annotation(circ_rnas, output_file, fmt=args.output_file_format, attributes=["nb_ccr", "genomic_size", "left", "right", "complete", "nb_distinct"]) def parse_arguments(): parser = argparse.ArgumentParser(description='Identification of chimeric reads and circular RNAs') parser.add_argument('-r1', '--r1_input_file', required=False, help='R1 input file name') parser.add_argument('-r2', '--r2_input_file', required=False, help='R2 input file name') parser.add_argument('-min_cr', '--cr_threshold', type=int, required=False, default=5, help='Minimum number of chimeric reads') parser.add_argument('-tol', '--tolerance', type=int, required=False, default=3, help='Tolerance of different positions to start and end') parser.add_argument('-fmt', '--output_file_format', required=False, default='gtf', choices=["bed","gtf"], # bed ou gtf help='Output file format : 0-based (bed) or 1-based (gtf)') parser.add_argument('-o', '--output_file', required=False, default='circ_rnas_detected.out', help='Output file path') # gft par défaut # attribut : nb ccr, min_cr_distinct parser.add_argument('--verbose', help='Print more info', action='store_true') args = parser.parse_args() dsp_control = circ.DisplayControl.instance() dsp_control.verbose = args.verbose if not (args.r1_input_file or args.r2_input_file): parser.error('At least one input file is requested') return args if __name__ == '__main__': args = parse_arguments() main(args.r1_input_file, args.r2_input_file, args.cr_threshold, args.tolerance, args.output_file) |
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 | import os import argparse import pandas as pd import numpy as np import re import sys from collections import defaultdict def read_file(file): """ Read the circ_rnas_annotation.out file and return a pandas dataframe""" header = ["chrom", "start", "end", "circ_name", "nb_ccr", "strand", "source", "feature", "phase", "attributes"] df = pd.read_table(file, sep='\t', names=header) df.replace(np.nan, "", inplace=True) return df def get_folder_to_create(df): folders_name = [] dict_spl_sp = dict(zip(df["sample"], df["species"])) split_keys = [x.split("_") for x in dict_spl_sp.keys()] for key in split_keys: folders_name.append(key[0]+"_"+key[1]) folders = list(set(folders_name)) return folders def create_folders(folders): for folder in folders: if not os.path.exists(folder): os.makedirs(folder) def filter_samples(df): df = df.drop(df[(df["sample"].isin(["ssc_liver_3", "ssc_liver_4", "ssc_muscle_1", "ssc_testis_1"]))].index) df = df.reset_index(drop=True) return df def main(): # Read samples.tsv file: df_samples = pd.read_csv(args.sample_file, sep='\t') # Remove ssc_liver_3/4, ssc_muscle_1 and ssc_testis_1 samples: df_filtered = filter_samples(df_samples) samples_folders = df_filtered["sample"].tolist() df_filtered = df_filtered.loc[df_filtered['sample'].isin(samples_folders)] # Get the folder list to create: folders = get_folder_to_create(df_filtered) # Create folders: create_folders(folders) # Group folders by species and tissue: df_grouped = df_filtered.groupby(["species", "tissue"]) for key,item in df_grouped: group = df_grouped.get_group(key) # Create col with path of folder: group["path_file_bed"] = group["sample"]+"/auzeville.bed" samples = group["sample"].tolist() samples_split = [s.split("_") for s in samples][0] folder_name = samples_split[0]+"_"+samples_split[1] paths = group["path_file_bed"].tolist() df_concat = pd.concat(read_file(f) for f in paths) df_concat = df_concat.sort_values(by=['chrom', 'start', 'end']) df_final = df_concat.groupby(['chrom', 'start', 'end', 'strand'], as_index=False)['nb_ccr'].sum() df_final.to_csv(folder_name+"/"+"auzeville.bed", sep="\t", index=False) def parse_arguments(): parser = argparse.ArgumentParser(description='Annotation') parser.add_argument('-sp', '--sample_file', required=True, help='Sample file path') parser.add_argument('-circ', '--circ_rna_file', required=False, help='Circular RNA file path') args = parser.parse_args() return args if __name__ == '__main__': args = parse_arguments() main() |
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 | import os, re, sys, csv, argparse import circRNA as circ import pandas as pd import numpy as np # Utility functions: def eprint(*args, **kwargs): print(*args, file=sys.stderr, **kwargs) def get_sample(file): """ Get the sample name from the file path""" sample_name = os.path.split(os.path.dirname(file))[1] return sample_name def read_file(file): """ Read the circ_rnas_annotation.out file and return a pandas dataframe""" df = pd.read_table(file, sep='\t') df.replace(np.nan, "", inplace=True) return df def intersection(lst1, lst2): """ Return the intersection between two lists""" return list(set(lst1) & set(lst2)) def get_exonic_circrnas(df): """ Get statistics about exonic circRNAs from the annotation_circRNA.out file""" # Arrays, dicts: exonic_circ_names, monoexonic_circ_names, true_exonic, single_end_circ_names = [], [], [], [] ccr_start_end_exonic, ccr_single_annot = dict(), dict() # Counter exonic circRNAs: nb_monoexonic, nb_start_end_exonic, nb_antisens_exonic, nb_single_annotated_junction = 0, 0, 0, 0 for index, row in df.iterrows(): exon_id_start = row.exons_id_start.split(",") exon_start = str(exon_id_start[0]) exon_start_tmp = exon_start.split("_") exon_start = exon_start_tmp[0] exon_id_end = row.exons_id_end.split(",") exon_end = str(exon_id_end[0]) exon_end_tmp = exon_end.split("_") exon_end = exon_end_tmp[0] if row.nb_ccr >= 5: # if ((len(row.exons_id_start) > 0 or len(row.exons_id_end) > 0) # and len(row.intron_name) == 0): if len(row.exons_id_start) > 0 or len(row.exons_id_end) > 0: if row.strand == "+": if ("5" and "+") in row.exons_id_start: if len(row.exons_id_end)==0: nb_single_annotated_junction += 1 ccr_single_annot[row.circ_rna_name] = row.nb_ccr single_end_circ_names.append(row.circ_rna_name) if ("3" and "+") in row.exons_id_end: nb_start_end_exonic += 1 true_exonic.append(row) exonic_circ_names.append(row.circ_rna_name) ccr_start_end_exonic[row.circ_rna_name] = row.nb_ccr if ((len(exon_id_start)==1 and len(exon_id_end)==1) and (exon_start == exon_end)): nb_monoexonic += 1 monoexonic_circ_names.append(row.circ_rna_name) if ("3" and "+") in row.exons_id_end: if len(row.exons_id_start)==0: nb_single_annotated_junction += 1 ccr_single_annot[row.circ_rna_name] = row.nb_ccr single_end_circ_names.append(row.circ_rna_name) if ("3" and "-") in row.exons_id_start: if ("5" and "-") in row.exons_id_end: nb_antisens_exonic += 1 elif row.strand == "-": if ("5" and "-") in row.exons_id_end: if len(row.exons_id_start)==0: nb_single_annotated_junction += 1 ccr_single_annot[row.circ_rna_name] = row.nb_ccr single_end_circ_names.append(row.circ_rna_name) if ("3" and "-") in row.exons_id_start: nb_start_end_exonic += 1 true_exonic.append(row) exonic_circ_names.append(row.circ_rna_name) ccr_start_end_exonic[row.circ_rna_name] = row.nb_ccr if ((len(exon_id_start)==1 and len(exon_id_end)==1) and (exon_start == exon_end)): nb_monoexonic += 1 monoexonic_circ_names.append(row.circ_rna_name) if ("3" and "-") in row.exons_id_start: if len(row.exons_id_end)==0: nb_single_annotated_junction += 1 ccr_single_annot[row.circ_rna_name] = row.nb_ccr single_end_circ_names.append(row.circ_rna_name) if ("5" and "+") in row.exons_id_start: if ("3" and "+") in row.exons_id_end: nb_antisens_exonic += 1 return true_exonic, nb_start_end_exonic, nb_single_annotated_junction, single_end_circ_names, ccr_single_annot, monoexonic_circ_names, exonic_circ_names def get_intronic_circrnas(df, exonic_circrnas): # Arrays, dicts: intronic_circ_names, true_intronic = [], [] ccr_true_intronic = dict() # Counter intronic circRNAs: nb_true_intronic = 0 for index, row in df.iterrows(): if row.circ_rna_name not in exonic_circrnas: if row.nb_ccr >= 5: if len(row.intron_name) > 0: if row.strand == "+": if (row.end_i - row.end) in range(-5,60): if (row.start - row.start_i) in range(-5,5) or (row.start==row.start_i): nb_true_intronic += 1 true_intronic.append(row) ccr_true_intronic[row.circ_rna_name] = row.nb_ccr intronic_circ_names.append(row.circ_rna_name) elif (row.start == row.start_i and (row.end_i - row.end) > 32): nb_true_intronic += 1 true_intronic.append(row) ccr_true_intronic[row.circ_rna_name] = row.nb_ccr intronic_circ_names.append(row.circ_rna_name) elif row.strand == "-": if (row.start - row.start_i) in range(-5,60): if ((row.end - row.end_i) in range(-5,5) or (row.end == row.end_i)): nb_true_intronic += 1 true_intronic.append(row) ccr_true_intronic[row.circ_rna_name] = row.nb_ccr intronic_circ_names.append(row.circ_rna_name) elif (row.end == row.end_i and (row.start - row.start_i) > 60): nb_true_intronic += 1 true_intronic.append(row) ccr_true_intronic[row.circ_rna_name] = row.nb_ccr intronic_circ_names.append(row.circ_rna_name) return true_intronic, nb_true_intronic, ccr_true_intronic, intronic_circ_names def get_subexonic_circrnas(df, monoexonic_circ_names, intronic_circ_names): # Subexonic_meg : sens + antisens # Subexonic_pleg : sens # Arrays, dicts: infraexonic_tot_names, infraexonic_circ_names, subexonic, subexonic_antisens_names = [], [], [], [] ccr_infraexonic = dict() # Counter subexonic circRNAs: nb_infraexonic, nb_infraexonic_tot, nb_infraexonic_sens, nb_infraexonic_antisens = 0, 0, 0, 0 for index, row in df.iterrows(): if row.nb_ccr >= 5: if ((len(row.gene_id_ife) > 0) and (row.circ_rna_name not in monoexonic_circ_names)): if (("_5_c_+" not in row.exons_id_start) and ("_5_lnc_+" not in row.exons_id_start) and ("_3_c_-" not in row.exons_id_start) and ("_3_lnc_-" not in row.exons_id_start) and ("_3_c_+" not in row.exons_id_end) and ("_3_lnc_+" not in row.exons_id_end) and ("_5_c_-" not in row.exons_id_end) and ("_5_lnc_-" not in row.exons_id_end)): if len(row.gene_id_start) == 0 and len(row.gene_id_end) == 0: nb_infraexonic_tot += 1 infraexonic_tot_names.append(row.circ_rna_name) if row.circ_rna_name not in intronic_circ_names: if row.strand == "+": if "+" in row.gene_id_ife: nb_infraexonic_sens += 1 subexonic.append(row) ccr_infraexonic[row.circ_rna_name] = row.nb_ccr infraexonic_circ_names.append(row.circ_rna_name) elif "-" in row.gene_id_ife: nb_infraexonic_antisens += 1 subexonic.append(row) infraexonic_circ_names.append(row.circ_rna_name) subexonic_antisens_names.append(row.circ_rna_name) elif row.strand == "-": if "-" in row.gene_id_ife: nb_infraexonic_sens += 1 subexonic.append(row) infraexonic_circ_names.append(row.circ_rna_name) ccr_infraexonic[row.circ_rna_name] = row.nb_ccr elif "+" in row.gene_id_ife: nb_infraexonic_antisens += 1 subexonic.append(row) infraexonic_circ_names.append(row.circ_rna_name) subexonic_antisens_names.append(row.circ_rna_name) elif ((len(row.gene_id_start) > 0 and len(row.gene_id_end) == 0) or (len(row.gene_id_start) == 0 and len(row.gene_id_end) > 0)): nb_infraexonic_tot += 1 infraexonic_tot_names.append(row.circ_rna_name) if row.circ_rna_name not in intronic_circ_names: if row.strand == "+": if "+" in row.gene_id_ife: nb_infraexonic_sens += 1 subexonic.append(row) infraexonic_circ_names.append(row.circ_rna_name) ccr_infraexonic[row.circ_rna_name] = row.nb_ccr elif "-" in row.gene_id_ife: nb_infraexonic_antisens += 1 subexonic.append(row) infraexonic_circ_names.append(row.circ_rna_name) subexonic_antisens_names.append(row.circ_rna_name) elif row.strand == "-": if "-" in row.gene_id_ife: nb_infraexonic_sens += 1 subexonic.append(row) infraexonic_circ_names.append(row.circ_rna_name) ccr_infraexonic[row.circ_rna_name] = row.nb_ccr elif "+" in row.gene_id_ife: nb_infraexonic_antisens += 1 subexonic.append(row) infraexonic_circ_names.append(row.circ_rna_name) subexonic_antisens_names.append(row.circ_rna_name) return subexonic, infraexonic_circ_names, nb_infraexonic_sens, ccr_infraexonic, subexonic_antisens_names def get_circrnas(df): """ Read the annotation_circRNA.out dataframe and return all statistics to tabular format about circRNAs""" # Get exonic circRNAs: stats_exonic_circrnas = get_exonic_circrnas(df) exonic = stats_exonic_circrnas[0] monoexonic_names = stats_exonic_circrnas[5] exonic_names = stats_exonic_circrnas[6] # Get intronic circRNAs: stats_intronic_circrnas = get_intronic_circrnas(df, exonic_names) intronic = stats_intronic_circrnas[0] intronic_names = stats_intronic_circrnas[3] # Get subexonic circRNAs: stats_subexonic_circrnas = get_subexonic_circrnas(df, monoexonic_names, intronic_names) subexonic = stats_subexonic_circrnas[0] subexonic_antisens = stats_subexonic_circrnas[4] return exonic, subexonic, intronic, subexonic_antisens def get_stats_circrnas(sample, df, output_file_name): """ Read the annotation_circRNA.out dataframe and return all statistics to tabular format about circRNAs""" nb_circ_tot = len(df) # Get circRNAs: exonic = get_circrnas(df)[0] subexonic = get_circrnas(df)[1] subexonic_antisens = get_circrnas(df)[3] intronic = get_circrnas(df)[2] # Get exonic circRNAs stats: stats_exonic = write_comparison_exonic_table(sample, exonic, output_file_name) # Get subexonic circRNAs stats: nb_subexonic = get_nb_type(sample, subexonic, subexonic_antisens)[0] nb_gene_subexonic = get_nb_type(sample, subexonic, subexonic_antisens)[1] nb_subexonic_pleg = get_nb_type(sample, subexonic, subexonic_antisens)[2] nb_subexonic_meg = get_nb_type(sample, subexonic, subexonic_antisens)[3] nb_ccr_pleg = get_nb_type(sample, subexonic, subexonic_antisens)[4] nb_ccr_meg = get_nb_type(sample, subexonic, subexonic_antisens)[5] if len(stats_exonic) > 5: nb_c = stats_exonic[0] nb_lnc = stats_exonic[1] nb_autres = stats_exonic[2] nb_start_end_exonic_selected = stats_exonic[3] nb_ccr_start_end_exonic = stats_exonic[4] nb_exonic = write_comparison_exonic_table(sample, exonic, args.output_comp_exonic_file)[5] else: nb_c = stats_exonic[0] nb_lnc = 0 nb_autres = stats_exonic[1] nb_start_end_exonic_selected = stats_exonic[2] nb_ccr_start_end_exonic = stats_exonic[3] nb_exonic = write_comparison_exonic_table(sample, exonic, args.output_comp_exonic_file)[4] nb_start_end_exonic = get_exonic_circrnas(df)[1] nb_single_annotated_junction = get_exonic_circrnas(df)[2] monoexonic_circ_names = get_exonic_circrnas(df)[5] exonic_names = get_exonic_circrnas(df)[6] intronic_circ_names = get_intronic_circrnas(df, exonic_names)[3] infraexonic_circ_names = get_subexonic_circrnas(df, monoexonic_circ_names, intronic_circ_names)[1] single_end_circ_names = get_exonic_circrnas(df)[3] nb_infraexonic_sens = get_subexonic_circrnas(df, monoexonic_circ_names, intronic_circ_names)[2] nb_true_intronic = get_intronic_circrnas(df, exonic_names)[1] ccr_single_annot = get_exonic_circrnas(df)[4] ccr_true_intronic = get_intronic_circrnas(df, exonic_names)[2] ccr_infraexonic = get_subexonic_circrnas(df, monoexonic_circ_names, intronic_circ_names)[3] nb_start_end_false_exonic = nb_start_end_exonic - nb_start_end_exonic_selected nb_tot_exonic = nb_start_end_exonic + nb_single_annotated_junction nb_common_infra_single = len(intersection(infraexonic_circ_names, single_end_circ_names)) nb_infraexonic = nb_infraexonic_sens - nb_common_infra_single nb_circ_non_annotated = nb_circ_tot - (nb_tot_exonic + nb_infraexonic + nb_true_intronic - nb_start_end_false_exonic) nb_ccr_single_annot = sum(ccr_single_annot.values()) nb_ccr_intronic = sum(ccr_true_intronic.values()) nb_ccr_infraexonic = sum(ccr_infraexonic.values()) return "\t".join(map(str,[sample, nb_circ_tot, nb_exonic, nb_lnc, nb_autres, nb_ccr_start_end_exonic, nb_subexonic_pleg, nb_ccr_pleg, nb_subexonic_meg, nb_ccr_meg, nb_gene_subexonic, nb_true_intronic, nb_ccr_intronic]))+"\n" def write_stats_table(stats, output_file): with open(output_file, "w") as fout: fout.write(stats) def write_circ_table(self, header, path, index=None, sep="\t", na_rep='', float_format=None, index_label=None, mode='w', encoding=None, date_format=None, decimal='.'): """ Write a circRNAs list to a tabular-separated file (tsv). """ from pandas.core.frame import DataFrame df = DataFrame(self) # result is only a string if no path provided, otherwise None result = df.to_csv(path, index=index, sep=sep, na_rep=na_rep, float_format=float_format, header=header, index_label=index_label, mode=mode, encoding=encoding, date_format=date_format, decimal=decimal) if path is None: return result def write_comparison_exonic_table(sample, circ_rnas, output_file_name): # Write exonic table for comparaison between tissues/species: df_circ_rnas = pd.DataFrame(circ_rnas, index=None) header = ["chrom:start-end,strand", "nb_ccr", "gene_id", "biotype", "genomic_size"] nb_start_end_annotated, nb_exonic = 0, 0 nb_ccr_exonics, biotypes = [], [] with open(output_file_name, 'w') as fout: tsv_writer = csv.writer(fout, delimiter='\t') tsv_writer.writerow(header) for index, row in df_circ_rnas.iterrows(): exons_id_start_a, exons_id_end_a = [], [] # Select well-annotated true exonic circRNAs bases on the gene_id: # Get key: chrom_start = ":".join(map(str,[row.chrom, row.start])) chrom_start_end = "-".join(map(str, [chrom_start , row.end])) chrom_start_end_strand = ",".join(map(str, [chrom_start_end , row.strand])) # Get the ccr number: nb_ccr = row.nb_ccr nb_ccr_exonics.append(nb_ccr) # Get gene_id: genes_id_start = list(set(list(row.gene_id_start.split(",")))) genes_id_end = list(set(list(row.gene_id_end.split(",")))) # Get biotype: exons_id_start = list(set(list(row.exons_id_start.split(",")))) exons_id_end = list(set(list(row.exons_id_end.split(",")))) exons_id_start = list(i.split("_") for i in exons_id_start) exons_id_end = list(i.split("_") for i in exons_id_end) intersect_gene_id = ",".join(list(set(genes_id_start).intersection(genes_id_end))) if len(intersect_gene_id) > 0: nb_start_end_annotated += 1 # Get biotype: biotypes_start = list(set(list(i[2] for i in exons_id_start))) biotypes_end = list(set(list(i[2] for i in exons_id_end))) intersect_biotypes = "".join(list(set(biotypes_start).intersection(biotypes_end))) # Write line into the output file: nb_exonic += 1 genomic_size = row.end - row.start s = [chrom_start_end_strand, nb_ccr, intersect_gene_id, intersect_biotypes, genomic_size] tsv_writer.writerow(s) biotypes.append(intersect_biotypes) # Dictionary with key = biotype and value = counter: d = {} d["sample"] = sample for biotype in biotypes: d[biotype] = d.get(biotype, 0) + 1 values = [] valid_keys = ["c", "lnc", "sample"] for key, value in d.items(): if key not in valid_keys: values.append(value) d["autres"] = sum(values) nb_ccr_exonic = sum(nb_ccr_exonics) for key, value in d.items(): if ("c" in d.keys()) and ("lnc" in d.keys()): return d["c"], d["lnc"], d["autres"], nb_start_end_annotated, nb_ccr_exonic, nb_exonic elif ("c" in d.keys() and "lnc" not in d.keys()): return d["c"], d["autres"], nb_start_end_annotated, nb_ccr_exonic, nb_exonic def compute_size(exons_start_end): pos_start_end = list(exons_start_end.split("_")) pos_start = int(pos_start_end[0]) pos_end = int(pos_start_end[1]) size = pos_end - pos_start return size def get_true_exons_gene_id(exons_start_end, exons_id, gene_id): if len(exons_start_end)==1: exon_start_end = ",".join(exons_start_end) exon_id = ",".join(exons_id) gene_id = gene_id else: annot = dict(zip(exons_start_end, exons_id)) sizes = [] for key in annot: size = compute_size(key) sizes.append(size) values = zip(exons_start_end, sizes) annot_f = dict(zip(exons_id, values)) min_size = min([i[1] for i in list(annot_f.values())]) for key, values in annot_f.items(): if values[1] == min_size: exon_id = key exon_start_end = values[0] return exon_start_end, exon_id, gene_id def get_nb_type(sample, circ_rnas, subexonic_antisens): # Write exonic table for comparaison between tissues/species: df_circ_rnas = pd.DataFrame(circ_rnas, index=None) nb_sub_exonic, nb_sub_exonic_pleg, nb_sub_exonic_meg = 0, 0, 0 subexonic_genes, nb_ccr_pleg, nb_ccr_meg = [], [], [] for index, row in df_circ_rnas.iterrows(): # Select well-annotated subexonic circRNAs according to the gene_id: # Get gene_id: genes_id = list(set(list(row.gene_id_ife.split(",")))) genes_id = list(i.split("_") for i in genes_id) exons_start_end = sorted(set(row.exons_start_end_ife.split(",")), key=row.exons_start_end_ife.split(',').index) exon_id = sorted(set(row.exon_id_ife.split(",")), key=row.exon_id_ife.split(',').index) gene_id = ",".join(list(set(list(row.gene_id_ife.split(","))))) biotypes = [] exon_start_end = get_true_exons_gene_id(exons_start_end, exon_id, gene_id)[0] exon_id = get_true_exons_gene_id(exons_start_end, exon_id, gene_id)[1] gene_id = get_true_exons_gene_id(exons_start_end, exon_id, gene_id)[2] row.exons_start_end_ife = exon_start_end row.exon_id_ife = exon_id row.gene_id_ife = gene_id if len(genes_id)==1: for i in genes_id: # Subexonic pleg circRNAs: if ('c' in i or 'lnc' in i or 'pseudo' in i) and (row.circ_rna_name not in subexonic_antisens): nb_sub_exonic += 1 nb_sub_exonic_pleg += 1 nb_ccr_pleg.append(row.nb_ccr) subexonic_genes.append(gene_id) # Subexonic meg circRNAs: elif 'c' not in i and 'lnc' not in i and 'pseudo' not in i and 'rRNA' not in i: nb_sub_exonic += 1 nb_sub_exonic_meg += 1 nb_ccr_meg.append(row.nb_ccr) subexonic_genes.append(gene_id) elif 'rRNA' in i: pass if len(genes_id) > 1: for i in genes_id: if len(i) == 3: biotypes.append(i[2]) biotypes = list(set(biotypes)) elif len(i) > 3: p_biotypes = [i[2],i[3]] biotypes = [p for p in p_biotypes] if 'rRNA' in biotypes: pass # Subexonic pleg circRNAs: elif ((biotypes == ['c'] or biotypes == ['lnc'] or biotypes == ['pseudo'] or ('c' and 'pseudo' in biotypes) or ('c' and 'lnc' in biotypes) or ('lnc' and 'pseudo' in biotypes)) and (row.circ_rna_name not in subexonic_antisens)): nb_sub_exonic += 1 nb_sub_exonic_pleg += 1 nb_ccr_pleg.append(row.nb_ccr) subexonic_genes.append(gene_id) # Subexonic meg circRNAs: else: nb_sub_exonic += 1 nb_sub_exonic_meg += 1 nb_ccr_meg.append(row.nb_ccr) subexonic_genes.append(gene_id) nb_subexonic_genes = len(list(set([val for sublist in subexonic_genes for val in sublist]))) nb_ccr_p = sum(nb_ccr_pleg) nb_ccr_m = sum(nb_ccr_meg) return nb_sub_exonic, nb_subexonic_genes, nb_sub_exonic_pleg, nb_sub_exonic_meg, nb_ccr_p, nb_ccr_m def write_subexonic_tables(sample, circ_rnas, subexonic_antisens, output_file_name_meg, output_file_name_pleg): # Write exonic table for comparaison between tissues/species: df_circ_rnas = pd.DataFrame(circ_rnas, index=None) nb_sub_exonic, nb_sub_exonic_pleg, nb_sub_exonic_meg = 0, 0, 0 subexonic_genes, nb_ccr_pleg, nb_ccr_meg = [], [], [] header = list(df_circ_rnas.columns.values.tolist()) with open(output_file_name_meg, 'w') as fout_meg, open(output_file_name_pleg, 'w') as fout_pleg: # Subexonic pleg circRNAs: tsv_writer_pleg = csv.writer(fout_pleg, delimiter='\t') tsv_writer_pleg.writerow(header) # Subexonic meg circRNAs: tsv_writer_meg = csv.writer(fout_meg, delimiter='\t') tsv_writer_meg.writerow(header) for index, row in df_circ_rnas.iterrows(): # Select well-annotated subexonic circRNAs according to the gene_id: # Get gene_id: genes_id = list(set(list(row.gene_id_ife.split(",")))) genes_id = list(i.split("_") for i in genes_id) exons_start_end = sorted(set(row.exons_start_end_ife.split(",")), key=row.exons_start_end_ife.split(',').index) exon_id = sorted(set(row.exon_id_ife.split(",")), key=row.exon_id_ife.split(',').index) gene_id = ",".join(list(set(list(row.gene_id_ife.split(","))))) biotypes = [] exon_start_end = get_true_exons_gene_id(exons_start_end, exon_id, gene_id)[0] exon_id = get_true_exons_gene_id(exons_start_end, exon_id, gene_id)[1] gene_id = get_true_exons_gene_id(exons_start_end, exon_id, gene_id)[2] row.exons_start_end_ife = exon_start_end row.exon_id_ife = exon_id row.gene_id_ife = gene_id if len(genes_id)==1: for i in genes_id: # Subexonic pleg circRNAs: if ('c' in i or 'lnc' in i or 'pseudo' in i) and (row.circ_rna_name not in subexonic_antisens): nb_sub_exonic += 1 nb_sub_exonic_pleg += 1 nb_ccr_pleg.append(row.nb_ccr) subexonic_genes.append(gene_id) s = row tsv_writer_pleg.writerow(s) # Subexonic meg circRNAs: elif 'c' not in i and 'lnc' not in i and 'pseudo' not in i and 'rRNA' not in i: nb_sub_exonic += 1 nb_sub_exonic_meg += 1 nb_ccr_meg.append(row.nb_ccr) subexonic_genes.append(gene_id) s = row tsv_writer_meg.writerow(s) elif 'rRNA' in i: pass if len(genes_id) > 1: for i in genes_id: if len(i) == 3: biotypes.append(i[2]) biotypes = list(set(biotypes)) elif len(i) > 3: p_biotypes = [i[2],i[3]] biotypes = [p for p in p_biotypes] if 'rRNA' in biotypes: pass # Subexonic pleg circRNAs: elif ((biotypes == ['c'] or biotypes == ['lnc'] or biotypes == ['pseudo'] or ('c' and 'pseudo' in biotypes) or ('c' and 'lnc' in biotypes) or ('lnc' and 'pseudo' in biotypes)) and (row.circ_rna_name not in subexonic_antisens)): nb_sub_exonic += 1 nb_sub_exonic_pleg += 1 nb_ccr_pleg.append(row.nb_ccr) subexonic_genes.append(gene_id) s = row tsv_writer_pleg.writerow(s) # Subexonic meg circRNAs: else: nb_sub_exonic += 1 nb_sub_exonic_meg += 1 nb_ccr_meg.append(row.nb_ccr) subexonic_genes.append(gene_id) s = row tsv_writer_meg.writerow(s) nb_subexonic_genes = len(list(set([val for sublist in subexonic_genes for val in sublist]))) nb_ccr_p = sum(nb_ccr_pleg) nb_ccr_m = sum(nb_ccr_meg) return nb_sub_exonic, nb_subexonic_genes, nb_sub_exonic_pleg, nb_sub_exonic_meg, nb_ccr_p, nb_ccr_m def write_circrnas_tables(sample, df, exonic_circrnas, intronic_circrnas, subexonic_circrnas, subexonic_antisens): header = list(df.columns.values.tolist()) if len(exonic_circrnas)>0: # Write the exonic circRNAs table: write_circ_table(exonic_circrnas, header, args.output_exonic_file) else: with open(args.output_exonic_file, 'w') as fout: tsv_writer = csv.writer(fout, delimiter='\t') tsv_writer.writerow(header) if len(intronic_circrnas)>0: # Write the intronic circRNAs table: write_circ_table(intronic_circrnas, header, args.output_intronic_file) else: with open(args.output_intronic_file, 'w') as fout: tsv_writer = csv.writer(fout, delimiter='\t') tsv_writer.writerow(header) eprint("subexonic_circrnas", len(subexonic_circrnas)) eprint("subexonic_antisens", len(subexonic_antisens)) if len(subexonic_circrnas)>0 or len(subexonic_antisens)>0: # Write the subexonic circRNAs tables: write_subexonic_tables(sample, subexonic_circrnas, subexonic_antisens, args.output_subexonic_meg_file, args.output_subexonic_pleg_file) else: eprint("ok") open(args.output_subexonic_meg_file, 'a').close() open(args.output_subexonic_pleg_file, 'a').close() def main(): # Get sample name: sample = get_sample(args.input_file) # Read the circRNAs annotation file: df_circ_annot = read_file(args.input_file) # Get the list of exonic, subexonic and intronic circRNAs: exonic_circrnas = get_circrnas(df_circ_annot)[0] subexonic_circrnas = get_circrnas(df_circ_annot)[1] intronic_circrnas = get_circrnas(df_circ_annot)[2] subexonic_antisens = get_circrnas(df_circ_annot)[3] # Write exonic, intronic and subexonic circRNAs tables: circrnas_tables = write_circrnas_tables(sample, df_circ_annot, exonic_circrnas, intronic_circrnas, subexonic_circrnas, subexonic_antisens) # Compute statistics about exonic, subexonic and intronic circRNAs: stats = get_stats_circrnas(sample, df_circ_annot, args.output_comp_exonic_file) # Write the circRNAs statistics table: write_stats_table(stats, args.output_stats_file) def parse_arguments(): parser = argparse.ArgumentParser(description='Return annotation statistics tables') parser.add_argument('-i', '--input_file', required=True, help='Annotation circRNAs file') parser.add_argument('-o_stats', '--output_stats_file', required=False, default="stats_annotation.tsv", help='Table containing statistics of annotation') parser.add_argument('-oi', '--output_intronic_file', required=False, default="intronic_circRNAs.tsv", help='Table containing intronic circRNAs') parser.add_argument('-oe', '--output_exonic_file', required=False, default="exonic_circRNAs.tsv", help='Table containing exonic circRNAs') parser.add_argument('-oce', '--output_comp_exonic_file', required=False, default="exonic_summary.tsv", help='Table containing exonics circRNAs for comparisons') parser.add_argument('-osepleg', '--output_subexonic_pleg_file', required=False, default="subexonic_pleg_circRNAs.tsv", help='Table containing subexonic circRNAs') parser.add_argument('-osemeg', '--output_subexonic_meg_file', required=False, default="subexonic_meg_circRNAs.tsv", help='Table containing subexonic circRNAs') args = parser.parse_args() return args if __name__ == '__main__': args = parse_arguments() main() |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 | import os import argparse import pandas as pd import numpy as np import re import sys # Utility functions def create_folder(path): if not os.path.exists(path): os.mkdir(path) def eprint(*args, **kwargs): print(*args, file=sys.stderr, **kwargs) def read_file(file): """ Read the sample file containing path files and return a pandas dataframe""" sample = pd.read_csv(file, sep='\t', dtype=str) return sample def check_mapdirs(mapdirs): """ Check if the mapdir exists or not""" for mapdir in mapdirs: if not os.path.exists(mapdir): raise Exception("WARNING following mapdir is missing %s" % mapdir) def read_log_file(sample): """ Read and clean the Log.final.out files containing the summary mapping statistics after mapping job is complete and return an array of the pandas dataframe containing statistics""" stats = [] d = {} reads = ["R1", "R2"] paths_out = [] for index, row in sample.iterrows(): sample_name = row["sample"] sample_unit_name = row["sample_unit"] path = row["mapdir"] for read in reads: path_file = path+"/se/"+read+"/Log.final.out" f = open(path_file, "r") for line in f: l = line.split("|") stats.append(l) for stat in stats: if len(stat) == 1: stats.remove(stat) for stat in stats: stat[0] = ' '.join(stat[0].split("\t")) stat[0] = ' '.join(stat[0].split()) stat[1] = ' '.join(stat[1].split()) stat[1] = stat[1].replace("%", "") d['sample'] = sample_name d['sample_unit'] = sample_unit_name d['read'] = read d[stat[0]] = stat[1] f.close() create_folder("reports/"+sample_unit_name+"/") create_folder("reports/"+sample_unit_name+"/"+read+"/") path_out = "reports/"+sample_unit_name+"/"+read+"/"+args.output_file paths_out.append(path_out) write_sample_file(d, path_out) return paths_out def write_sample_file(dict, output_file): """Get value from a statistic dictionnary and return a pandas object""" df = pd.DataFrame([dict]) df.to_csv(output_file, sep="\t", index=False) def write_final_stat_tab(paths_out, output_file): df_final = pd.DataFrame() for path in paths_out: df = pd.read_csv(path, sep="\t") df_final = pd.concat([df_final, df], ignore_index=True, sort =False) df_final.to_csv(output_file, sep="\t", index=False) def main(): # Read the sample file: sample = read_file(args.input_file) # Check the mapdir of each sample exists or not: check_mapdirs(sample["mapdir"]) create_folder("reports") # Read log files and get the path of a table containing statistics: paths_out = read_log_file(sample) # Write the statistic table with all samples: write_final_stat_tab(paths_out, args.output_file) def parse_arguments(): parser = argparse.ArgumentParser(description='Sample file') parser.add_argument('-i', '--input_file', required=True, help='Sample file') parser.add_argument('-o', '--output_file', required=False, default="mapping_stat.tsv", help='Sample file') args = parser.parse_args() return args if __name__ == '__main__': args = parse_arguments() main() |
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 | import os, re, sys, csv, argparse import pandas as pd import numpy as np # Utility functions: def eprint(*args, **kwargs): print(*args, file=sys.stderr, **kwargs) def read_file(file): """ Read the circRNAs annotation file and return a pandas dataframe""" df = pd.read_table(file, sep='\t') df.replace(np.nan, "", inplace=True) return df def get_item(item): return ''.join(list(set(item))) def write_summary_meg_table(df, output_file_name, min_size): header = ["gene_name", "biotype", "nb_circ_sens", "nb_circ_antisens", "nb_ccr", "exon_name", "start_exon", "end_exon"] with open(output_file_name, 'w') as fout: tsv_writer = csv.writer(fout, delimiter='\t') tsv_writer.writerow(header) if len(df) > 0: for key, item in df: exon_start = get_item(item.exons_start_end_ife).split("_")[0] exon_end = get_item(item.exons_start_end_ife).split("_")[1] if int(exon_end) - int(exon_start) > min_size: strand = list(item.strand) gene_id = list(key.split("_"))[0] strand_gene_ife = list(key.split("_"))[1] # Get biotype: if len(list(key.split("_"))) == 3: biotype = list(key.split("_"))[2] else: biotype = list(key.split("_"))[2]+"_"+list(key.split("_"))[3] nb_circ_rna = len(df.get_group(key)) nb_circ_sens = strand.count(strand_gene_ife) nb_circ_antisens = nb_circ_rna - nb_circ_sens nb_ccr = sum(item.nb_ccr) exon_name = get_item(item.exon_id_ife) # print(df.get_group(key), "\n\n") row = [gene_id, biotype, nb_circ_sens, nb_circ_antisens, nb_ccr, exon_name, exon_start, exon_end] tsv_writer.writerow(row) def write_summary_intronic_table(df, output_file_name, min_size): header = ["gene_name", "biotype", "intron_name", "nb_circ_rna", "nb_ccr", "start_intron", "end_intron"] with open(output_file_name, 'w') as fout: tsv_writer = csv.writer(fout, delimiter='\t') tsv_writer.writerow(header) if len(df) > 0: for key, item in df: start_i = ''.join([str(round(x)) for x in list(set(item.start_i))]) end_i = ''.join([str(round(x)) for x in list(set(item.end_i))]) if int(end_i) - int(start_i) > min_size: intron_name = key gene_id = get_item(item.gene_id_i).split("_")[0] # Get biotype: biotype = get_item(item.gene_id_i).split(",") if len(biotype)>1: biotype = biotype[0].split("_")[4] else: biotype = ''.join(biotype).split("_")[4] nb_ccr = sum(item.nb_ccr) nb_circ_rna = len(df.get_group(key)) row = [gene_id, biotype, intron_name, nb_circ_rna, nb_ccr, start_i, end_i] tsv_writer.writerow(row) def write_summary_pleg_table(df, output_file_name, min_size): header = ["gene_name", "nb_gene", "biotype", "nb_circ_sens", "nb_circ_antisens", "nb_ccr", "exon_name", "start_exon", "end_exon"] with open(output_file_name, 'w') as fout: tsv_writer = csv.writer(fout, delimiter='\t') tsv_writer.writerow(header) if len(df) > 0: for key, item in df: exon_start = get_item(item.exons_start_end_ife).split("_")[0] exon_end = get_item(item.exons_start_end_ife).split("_")[1] if int(exon_end) - int(exon_start) > min_size: # Get gene_id and biotype: gene_id_ife = get_item(item.gene_id_ife).split(",") nb_gene = len(gene_id_ife) if len(gene_id_ife) > 1: gene_id = gene_id_ife[0].split("_")[0]+","+gene_id_ife[1].split("_")[0] strand_gene_ife = get_item(gene_id_ife[0].split("_")[1]+gene_id_ife[1].split("_")[1]) biotype = get_item(gene_id_ife[0].split("_")[2]+gene_id_ife[1].split("_")[2]) else: gene_id = ''.join(gene_id_ife).split("_")[0] strand_gene_ife = ''.join(gene_id_ife).split("_")[1] biotype = ''.join(gene_id_ife).split("_")[2] strand = list(item.strand) nb_circ_rna = len(df.get_group(key)) nb_circ_sens = strand.count(strand_gene_ife) nb_circ_antisens = nb_circ_rna - nb_circ_sens nb_ccr = sum(item.nb_ccr) exon_name = key row = [gene_id, nb_gene, biotype, nb_circ_sens, nb_circ_antisens, nb_ccr, exon_name, exon_start, exon_end] tsv_writer.writerow(row) def check_file(file): with open(file, 'r') as f: if len(f.read()) <= 1: result="Empty" else: result="Not empty" return result def main(): # Get min_size: min_size = int(args.min_size) # Read input annotation circRNAs files: if check_file(args.input_meg_file) == "Not empty": df_meg = read_file(args.input_meg_file) # Meg circRNAs: ## Group by gene: df_meg = df_meg.groupby('gene_id_ife') write_summary_meg_table(df_meg, args.output_meg_file, min_size) else: open(args.output_meg_file, 'a').close() if check_file(args.input_intronic_file) == "Not empty": df_intronic = read_file(args.input_intronic_file) # Intronic circRNAs: ## Group by intron name: df_intronic = df_intronic.groupby('intron_name') write_summary_intronic_table(df_intronic, args.output_intronic_file, min_size) else: open(args.output_intronic_file, 'a').close() if check_file(args.input_pleg_file) == "Not empty": df_pleg = read_file(args.input_pleg_file) # Pleg circRNAs: ## Group by exon: df_pleg = df_pleg.groupby('exon_id_ife') write_summary_pleg_table(df_pleg, args.output_pleg_file, min_size) else: open(args.output_pleg_file, 'a').close() def parse_arguments(): parser = argparse.ArgumentParser(description='Return summary circRNAs tables') parser.add_argument('-ip', '--input_pleg_file', required=True, help='Annotation pleg circRNAs file') parser.add_argument('-im', '--input_meg_file', required=True, help='Annotation meg circRNAs file') parser.add_argument('-ii', '--input_intronic_file', required=True, help='Annotation intronic circRNAs file') parser.add_argument('-op', '--output_pleg_file', required=False, default="pleg_summary.tsv", help='Table containing summary of pleg circRNAs annotation') parser.add_argument('-om', '--output_meg_file', required=False, default="meg_summary.tsv", help='Table containing summary of meg circRNAs annotation') parser.add_argument('-oi', '--output_intronic_file', required=False, default="intronic_summary.tsv", help='Table containing summary of intronic circRNAs annotation') parser.add_argument('-ms', '--min_size', required=False, default=55, help='Minimum size of circRNAs') args = parser.parse_args() return args if __name__ == '__main__': args = parse_arguments() main() |
86 87 | shell: "find {input} -type f -empty -delete" |
95 96 | shell: "cat {{bta,oar,ssc}}_*/*_exonic_summary.tsv | cut -f1,3,4 |tail -n+2|sort|uniq > {output}" |
113 114 115 116 | shell: "python3 ../scripts/summary_table.py -ip {input.pleg} -im {input.meg} -ii {input.intronic}" " -op {output.pleg} -om {output.meg} -oi {output.intronic} -ms {params.min_size}" " 1>{log.stdout} 2>{log.stderr}" |
124 125 | shell: "cat {input} >> {output}" |
141 142 143 144 145 | shell: "python3 ../scripts/stats_annotation.py -i {input} -o_stats {output.stats}" " -oi {output.intronic} -oe {output.exonic} -oce {output.comp_exonic} -osepleg {output.sub_exonic_pleg}" " -osemeg {output.sub_exonic_meg}" " 1>{log.stdout} 2>{log.stderr}" |
157 158 159 | shell: "python3 ../scripts/circRNA_annotation.py -circ {input.circ_detected}" " -annot {input.annot_exon} -fmt bed -o {output} 1>{log.stdout} 2>{log.stderr}" |
167 168 | shell: "python3 ../scripts/cumul_bed.py -sp {input} > {log}" |
180 181 182 183 184 | shell: "if grep -q 'Nreads' {input.R1}; then head -n -2 {input.R1} > {input.R1}2; mv {input.R1}2 {input.R1}; fi ;" " if grep -q 'Nreads' {input.R2}; then head -n -2 {input.R2} > {input.R2}2; mv {input.R2}2 {input.R2}; fi ;" " python3 ../scripts/circRNA_detection.py -r1 {input.R1} -r2 {input.R2}" " -min_cr {params.min_ccr} -tol 0 -fmt bed -o {output} 1>{log.stdout} 2>{log.stderr}" |
191 192 | shell: "python3 ../scripts/stats_mapping.py -i {input} -o {output}" |
Support
- Future updates
Related Workflows





