Automatic identification, classification and annotation of plant lncRNAs in transcriptomic sequences
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Table of contents
Introduction
Long non-coding RNAs (lncRNAs) are RNA molecules longer than 200 nucleotides that do not encode proteins. Experimental studies have shown the diversity and importance of lncRNA functions in plants. To expand knowledge about lncRNAs in other species, computational pipelines that allow for standardised data-processing steps in a mode that does not require user control up to the final result were actively developed recently. These advancements enable wider functionality for lncRNA data identification and analysis. In the present work, we propose the ICAnnoLncRNA pipeline for automatic identification, classification and annotation of plant lncRNAs in transcriptomic sequences assembled from high-throughput RNA sequencing (RNA-seq) data.
This pipeline is only applicable to the Linux operating system.
The pipeline includes the following steps:
1. Pre-processing.
-
Gmap index building.
-
LncFinder model building.
2. LncRNA filtering.
-
Length filtering
-
LncFinder predicting.
-
CPC2 predicting.
-
GMAP alignment on reference genome.
-
Filtering erros/noise.
-
Filtering possible transposable elements (TEs).
3. LncRNA annotation.
-
Classification gffcompare.
-
Blastn alignment.
-
Analysis of lncRNA expression.
The pipeline is implemented using the workflow management system Snakemake , which provides ability to platform-independent installation and execution of the software.
Schematic diagram
Installation
Automatic
recommended for clusters/servers
Install only programs.yaml environment. Other environments will install automatically when ICAnnoLncRNA start work.
1. wget https://github.com/artempronozin95/ICAnnoLncRNA-identification-classification-and-annotation-of-LncRNA/archive/refs/heads/main.zip
2. unzip main.zip
3. cd ./ICAnnoLncRNA-identification-classification-and-annotation-of-LncRNA-main
4. conda env create --file env/programs.yaml
5. conda activate ICAnnoLncRNA
After these steps all necessary packages are installed. If you need update packages (
not recommended
), change the version of packages after “=” (example -
snakemake=4.0.1 -> snakemake=6.0.0
), then
conda env update --file ./programs.yaml
. All necessary packages will be updated. Recomended on clusters, requires a lot of processing power.
Step method
recommended for personal computer
1. conda update conda.
2. conda create -n ICAnnoLncRNA python=3.6
3. conda activate ICAnnoLncRNA
4. conda install -c bioconda emboss
5. install next packeges from file below
Install all packeges represented here
It is necessary to install and download (download program in pipeline folder):
1. cd ./ICAnnoLncRNA---identification-classification-and-annotation-of-LncRNA-main
2. wget https://github.com/biocoder/CPC2/archive/refs/heads/master.zip
3. unzip master.zip
4. mv CPC2-master CPC2
download LncAPDB.fasta into
data/reference/data_index
folder.
Input
Genome sequence
-
Known LncRNA transcripts of the species in
FASTA
format. File with lncRNA from the public database of lncRNA. For example Ensembl , EVLncRNAs v2 . -
Known protein coding sequences of the species in
FASTA
format. File with protein coding sequences from a public database. For example Ensembl . -
RNA-seq transcripts of the species in
FASTA
format. This file contains transcripts obtained after assembly of transcriptome. For example file that we provide, 30k_zey.fasta , obtained by Trinity assembly. All three sets are necessary for build
Reference genome
-
Reference genome of the species in
FASTA
format. -
Annotation of the species in
GFF/GTF
format.
Additional data
-
TE coordinats on the reference genome, example Zea_N_merged.bed
-
File with sequence ID from public lncRNA databases connected to IDs in library of known lncRNAs. example index_and_newindex.csv
-
Annotation of the organism transcriptome libraries by tissue type, example SRX_all_org.tsv
Data
Additional data is presented here: https://data.mendeley.com/datasets/fnk8pmp2yz/3
It includes:
-
blast.outfmt6 - Blastn results. Contain homologs with known lncRNAs sequences from the LncAPDB library.
-
index_and_newindex.fasta - index of PNRD, CANTATAdb, GREENC, PlncDB, EVLncRNA databases compared with new index for LncAPDB library.
-
LncAPDB.fasta - lncRNA sequences of LncAPDB library in fasta format.
-
lncrna_class.tmap - novel lncRNAs divided into gffcompare classes.
-
lncrna_coordinats.bed - coordinates of novel lncRNAs on chromosomes.
-
lncrna_intron_size.tsv - intron size of novel lncRNAs and their coordinates on the genome.
-
new_lncrna.fasta - novel lncRNA sequences in fasta format.
-
new_lncrna_locus.loci - locus of lncRNA sequences on genome.
-
transcriptome_lib.txt - Maize transcriptome libraries.
Configuration file
Input all necessary files into configuration file “config.yaml”:
-
lnc:
- known LncRNA sequences of studied organisms inFASTA
format.-
(Example:
lnc: "data/input/lincrna.fasta"
)
-
(Example:
-
cds:
- known CDS (coding) sequences of studied organisms inFASTA
format.-
(Example:
cds: "data/input/7000_len_mRNA.fasta"
)
-
(Example:
-
model:
- pipeline check if model (lncFinder) for this organism already exist and use it, if it's not, pipeline will create a new one (in output folder). Model that exist represented in Models .-
(Example:
model: "Zea_mays"
)
-
(Example:
-
sequence:
- sequences that need to study inFASTA
format.-
(Example:
sequence: "data/input/Zea.fasta"
)
-
(Example:
-
structure:
- need to choose whether to use secondary structure in model building or not. Choose betweenDNA
orSS
(secondary structure).-
(Example:
structure: "DNA"
)
-
(Example:
-
gmap:
-
gff_reference:
- reference genome annotation inGFF
format.-
(Example:
gff_reference: "data/reference/Zea_mays.AGPv4.40.gff"
)
-
(Example:
-
reference:
- reference genome inFASTA
format.-
(Example:
reference: "data/reference/Zea_mays.AGPv4.dna.toplevel.fa"
)
-
(Example:
-
out:
- output file.-
(Example:
out: "data/output/alignm.gff"
)
-
(Example:
-
-
gff:
- gff file of reference genome.-
(Example:
gff: "data/reference/Zea_mays.AGPv4.40.gff"
)
-
(Example:
-
blast:
- standard BLAST parameters-
evalue:
- evalue-
(Example:
evalue: 1e-50
)
-
(Example:
-
max_target:
- max_target_seqs-
(Example:
max_target: 1
)
-
(Example:
-
identity:
- perc_identity-
(Example:
identity: 30
)
-
(Example:
-
threads:
- num_threads-
(Example:
threads: 1
)
-
(Example:
-
-
TE_coords:
- if there is no TE coordinats information - recomended to turn off this step.-
option:
"on" - turn on or turn off this step (on/off) -
coords:
- TE coordinat-
(Example:
coords: "data/reference/N_coords/Zea_N_merged.bed"
)
-
(Example:
-
-
tissue:
-
organism:
- choose species that you study form species list , or input your own species ID in same format-
(Example:
organism: "ZMAY"
)
-
(Example:
-
exp:
- choose between organisms tissue experiments in Tissue analysis or input your organism.-
(Example:
exp: "ZM"
)
-
(Example:
-
LncAPDB:
- File with sequence ID from public lncRNA databases connected to IDs in library of known lncRNAs.-
(Example:
LncAPDB: "tissue/index_and_newindex.csv"
)
-
(Example:
-
SRX_exp:
- Annotation of the transcriptome libraries by tissue type and gene expression levels.-
(Example:
SRX_exp: "tissue/SRX_all_org.tsv"
)
-
(Example:
-
-
expression
:-
option
: "on" - turn on or turn off this step (on/off) -
path
: - path to expression data.-
(Example:
path: "expressin/Kallisto"
)
-
(Example:
-
calculation_type
: - type of expression data. There three types: 1.Kallisto 2.Htseq 3.Own type of data.-
(Example:
path: "Kallisto"
)
-
(Example:
-
Folders
data/input
Contain:
-
Known LncRNA transcripts of the species in
FASTA
format. -
Known protein coding sequences of the species in
FASTA
format. -
RNA-seq transcripts of the species in
FASTA
format.
data/reference
Contain:
-
reference genome of studied organism in
FASTA
format. -
annotation of studied organism in
GFF/GTF
format. -
data_index
folder with library of known lncRNA from databases. -
intron_data
folder with structure information of organisms. -
models
folder with model for lncFinder. -
N_coords
folder with TE coordinats
Work start
1.
snakemake -j 2
-j
or
--cores
- Use at most N CPU cores/jobs in parallel. If N is omitted or ‘all’, the limit is set to the number of available CPU cores.
2.
snakemake -nr
-n
- Do not execute anything, and display what would be done. If you have a very large workflow, use –dry-run –quiet to just print a summary of the DAG of jobs.
-r
- Print the reason for each executed rule.
3.
snakemake --use-conda
--use-conda
- Use additional conda environment.
4. Recommended run:
snakemake -j 2 --use-conda
Models
-
Zea_mays
-
Arabidopsis_thaliana
LncRNA structure information
-
Zea_mays
-
Arabidopsis_thaliana
-
Oryza_sativa
Known LncRNA for database
Folder
data/reference/data_index
contain lncRNA library (LncAPDB.fasta) that build on base of 5 lncRNA databases (PNRD, CANTATAdb, GREENC, PlncDB, EVLncRNA). The library you can find here: https://data.mendeley.com/v1/datasets/fnk8pmp2yz/draft?a=13de1dfb-b631-42f3-8108-f3de2f50fd90
Species
-
Medicago truncatula -
MTRU
-
Glycine max -
GMAX
-
Populis trichocarpa -
PTRI
-
Arabidopsis thaliana -
ATHA
-
Vitis vivifera -
VVIN
-
Oryza sativa japonica -
OSAT
-
Brachipodiumdistachion -
BDIS
-
Sorghum bicolor -
SBIC
-
Zea mays -
ZMAY
-
Selaginella moellendorfii -
SMOE
-
Physcomitrella patens -
PPAT
-
Ostreococcus tauri -
OTAU
-
Volvox -
VOLV
-
Amborell trichopoda -
ATRI
Tissue analysis
File
tissue/SRX_all_org.tsv
, contains information about 1241 transcript experiment libraries for 5 organisms with respect to specific tissue. In config.yaml need to choose between organisms that presented in SRX file:
-
HV - Hordeum vulgare
-
OS - Oryza sativa
-
SL - Solanum lycopersicum
-
ST - Solanum tuberosum
-
ZM - Zea mays
If the libraries you are investigating (for other organism for exemple) are not in this file. You can add them to the data file in the same format
ST SRX1478098 leaf
ZM SRX339769 stomatal division zone
ZM SRX339787 ear
ZM SRX711129 leaf
OS SRX2582231 unknown
ZM SRX711024 ear
ZM SRX711053 leaf
Where 1 column - id, 2 - organism that you study, 3 - SRX library, 4 - tissue.
Expression and entropy
For this step there is necessary to provide expression data for every transcript. Thus recommended
turn off
this step at first run of the pipeline. After receiving expression data
turn on
this step and pipeline will calculete entropy and espression statistic. In config.yaml need to choose between three types of table building:
-
Kallisto
- result are received by means of Kallisto program. Results should be presented as on example:
├── SRR957466
│ ├── abundance.h5
│ ├── abundance.tsv
│ └── run_info.json
├── SRR957467
│ ├── abundance.h5
│ ├── abundance.tsv
│ └── run_info.json
└── SRR957468 ├── abundance.h5 ├── abundance.tsv └── run_info.json
-
Htseq
- result are received by means of Htseq program. Results should be presented as on example:
├── SRR1586686.txt
└── SRR504468.txt
-
own
- In this case user should build table by himself. Table should looks like as table presented in example folder. Where: -
type
-mRNA
- is protein coding genes.noncons
- nonconserved lncRNAs.consv
- conserved lncRNAs. -
all tissue that needed. In example table, value is a median of expression for libraries that belong to tissue in column name.
-
target_id
- ID of the transcript. it is necessary that the transcripts correspond to the results of the pipeline.
Output
A typical structure of
Output
is follows:
├── gffcompare_first
├── gffcmp.annotated.gtf
├── gffcmp.filter_alignm.bed.refmap
├── gffcmp.filter_alignm.bed.tmap
├── gffcmp.loci
├── gffcmp.stats
└── gffcmp.tracking
├── gffcompare_second
├── gff.annotated.gtf
├── gff.filter_alignm.bed.refmap
├── gff.filter_alignm.bed.tmap
├── gff.loci
├── gff.stats
└── gff.tracking
├── GMAP
├── alignm.bed
├── alignm_filter.gff
├── filter_alignm.bed
├── gmap_build.err.log
├── gmap_build.out.log
├── gmap.out.log
├── reference.bed
└── statistic.csv
├── lncRNA_prediction
├── Coding.fasta
├── cpc.txt
├── lncFinder.csv
└── Noncoding.fasta
├── lncRNA_structure
├── anti.png
├── exon_size.png
├── intron_size.png
├── itron_coordin.tsv
├── long_transcripts.bed
├── number_of_exon.png
├── number_of_lncRNA.png
└── statistic_bed.tsv
├── loci
├── lncRNA_before_loci.bed
└── lncRNA_loci.bed
├── new_lncRNA
├── blast.outfmt6
├── classes.png
├── LncAPDB_vs_blast.csv
├── new_lncrna.fasta
└── True_lncRNA.bed
├── TE_finder
├── Lnc_aling_with_TE.tsv
└── lncRNA_after_loci.bed
├── tissue
├── all.txt
├── cod.txt
├── cons.txt
├── id.txt
├── non.txt
├── tissue_org.csv
├── tissue_org.png
├── transc_cod.csv
├── transc.csv
└── transc_non.csv
├── expression
├── box_plot.png
├── entropy.png
├── expr_table.tsv
├── hist_expr.png
└── table_type.txt
- Folder "tissue", contains lncRNA distribution in tissue.
Groups of output files
1. Pre-processing.
Pre-processing performed in input directory:
-
filtered_seq.fasta - filter transcripts with length < 200 nt.
-
new_vs_old_id.tsv - new ID and old ID of transcripts.
-
test_train - folder of lncFinder model studie.
2. LncRNA filtering.
lncRNA_prediction
lncRNA_prediction
-
Coding.fasta - Transcripts predicted as protein coding by lncFinder,
FASTA
format. -
cpc.txt - Transcripts predicted as lncRNA by CPC2,
TXT
format. -
lncFinder.csv - Transcripts predicted as lncRNA by lncFinder,
CSV
format. -
Noncoding.fasta - Transcripts predicted as lncRNA by lncFinder,
FASTA
format.
GMAP alignment on reference genome.
GMAP
-
alignm.bed - results of lncRNA transcripts alignment on reference genome by GMAP, 'BED' format.
-
alignm_filter.gff - results of lncRNA transcripts alignment on reference genome without long intron transcripts,
GFF
format. -
filter_alignm.bed - results of lncRNA transcripts alignment on reference genome without long intron transcripts,
BED
format. -
gmap_build.err.log
-
gmap_build.out.log
-
gmap.out.log
-
reference.bed - reference genome annotation,
BED
format. -
statistic.csv - lncRNA structure statistic,
CSV
format.
Filtering erros/noise.
gffcompare_first
- classification of the candidate lncRNA sequences by their location in the genome relative to a protein-coding genes, more information can be found on
gffcompare
-
gffcmp.annotated.gtf - Gffcompare reports a GTF file containing the “union” of all transcript IDs in each sample.
-
gffcmp.filter_alignm.bed.refmap - This tab-delimited file lists, for each reference transcript, which query transcript either fully or partially matches that reference transcript.
-
gffcmp.filter_alignm.bed.tmap - This tab delimited file lists the most closely matching reference transcript for each query transcript.
-
gffcmp.loci - Represent loci of transcripts.
-
gffcmp.stats
-
gffcmp.tracking - This file matches transcripts up between samples.
loci
-
lncRNA_before_loci.bed - the candidate lncRNA sequences before merging into loci,
BED
format. -
lncRNA_loci.bed - loci of the candidate lncRNA,
BED
format.
gffcompare_second
- classification of the candidate lncRNA sequences by their location in the genome relative to lncRNA loci.
-
gff.annotated.gtf
-
gff.filter_alignm.bed.refmap
-
gff.filter_alignm.bed.tmap
-
gff.loci
-
gff.stats
-
gff.tracking
Filtering possible transposable elements (TEs).
TE_finder
-
Lnc_aling_with_TE.tsv - the candidate lncRNA sequences that mapped on TE,
TSV
format. -
lncRNA_after_loci.bed - the candidate lncRNA sequences that are completely matched (
=
- class of gff compare) with lncRNA loci,BED
format.
3. LncRNA annotation.
lncRNA_structure
-
anti.png - allocation of antisense lncRNA alignment to target gene structure.
-
exon_size.png - lncRNA exon size distribution.
-
intron_size.png - lncRNA intron size distribution.
-
itron_coordin.tsv - coordinates lncRNA introns,
TSV
format. -
long_transcripts.bed - transcripts with length > 5000 nt,
BED
format. -
number_of_exon.png - the ratio of the number of exons per lncRNA.
-
number_of_lncRNA.png - number of lncRNA per chromosome.
-
statistic_bed.tsv - lncRNA intron statistic,
TSV
format.
new_lncRNA
-
blast.outfmt6 - BLASTn results in
outfmt6
format. -
classes.png - distribution of lncRNA classes.
-
LncAPDB_vs_blast.csv - known lncRNAs that were predicted.
-
new_lncrna.fasta - identified lncRNA transcripts,
FASTA
format. ( final result ) -
True_lncRNA.bed - identified lncRNA transcripts,
BED
format.
tissue
-
all.txt - lncRNA that have homologs with known lncRNAs.
-
cod.txt - ID of proteine coding transcripts.
-
cons.txt - ID of the lncRNAs similar to lncRNAs from other species (conserved sequences)
-
id.txt - ID of all input transcripts
-
non.txt - ID of the lncRNAs similar only to known studied organism sequences (non-conserved sequences)
-
tissue_org.csv - all in one (proteine coding, conserved lncRNAs, non-conserved lncRNAs)
-
tissue_org.png - heatmap of transcript libraries tissue type.
-
transc_cod.csv - proteine coding transcript libraries tissue type.
-
transc.csv - conserved lncRNAs transcript libraries tissue type.
-
transc_non.csv - non-conserved lncRNAs transcript libraries tissue type.
-
Blastn alignment.
-
Analysis of lncRNA expression.
expression
-
box_plot.png - box plot diagrams of gene expression.
-
entropy.png - density plot of Shannon entropy.
-
expr_table.tsv - table with expression data for every transcript in every tissue.
-
hist_expr.png - density plot of maximum expression levels of transcripts.
-
table_type.txt - type of expression results.
Errors
rule model_lncFinder: input: data/input/test_train/sets.txt output: data/input/test_train/dirs.txt jobid: 2
Activating conda environment /home/pronozin/ICAnnoLncRNA---identification-classification-and-annotation-of-LncRNA-main/.snakemake/conda/767d2b45.
/bin/bash: activate:
Solution
: type
conda install conda
in ICAnnoLncRNA environment.
rule Gmap_build:
input: data/reference/Zea_mays.AGPv4.dna.toplevel.fa, data/reference/Zea_mays.AGPv4.40.gff
output: data/output/statistic.csv
jobid: 5
...
Error in rule Gmap_build:
jobid: 5
output: data/output/statistic.csv
RuleException:
CalledProcessError in line 123 of ...:
Command ' set -euo pipefail; ...'
Solution
: update gtfparse package.
Example: conda install -c bioconda gtfparse
Code Snippets
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | import pandas as pd import sys import numpy as np from pybedtools import BedTool pd.set_option('display.max_columns', None) def clean(x): split = x.str.split('.', 4, expand=True) split = split[0] return (split) tmap = pd.read_csv(sys.argv[1], sep='\t') gmap_align_prime = pd.read_csv(sys.argv[2], sep='\t', header=None) # take only lncRNA classes from tmap file tmap['group'] = np.nan tmap['group'][tmap['class_code'].isin(['='])] = 'true lncrna' tmap.dropna(subset = ["group"], inplace=True) tmap = tmap[tmap['qry_gene_id'].str.contains("path1")] print(tmap['group'].value_counts()) # filter duplicates query = clean(tmap['qry_gene_id']) gmap_align_prime['name'] = clean(gmap_align_prime[3]) query = query.drop_duplicates() # take lncRNA from alignment file if str(sys.argv[3]) == 'on': print('TE finder ON') gmap_align_prime = gmap_align_prime[gmap_align_prime['name'].isin(query)] gmap_align = gmap_align_prime[gmap_align_prime[7].str.contains("gene")] gmap_align.to_csv(sys.argv[4], sep='\t', index=False, header=None) # BedTool intersect genes = BedTool(sys.argv[4]) loci = genes.intersect(wa=True, wb=True, b=sys.argv[5]) loci_df = pd.read_table(loci.fn, sep='\t', names=['chr','lncRNA_start','lncRNA_end','lncRNA',4,'lncRNA_strand',6,7,8,9,10,'chr_TE','TE_start','TE_end']) loci_df = loci_df[['chr','lncRNA_start','lncRNA_end','lncRNA','lncRNA_strand','chr_TE','TE_start','TE_end']] loci_df = loci_df.drop_duplicates(subset=['lncRNA']) loci_name = loci_df['lncRNA'].str.split('.',4, expand=True) true_lncrna = gmap_align_prime[~gmap_align_prime['name'].isin(loci_name[0])] loci_df.to_csv(sys.argv[6], sep='\t', index=False, header=None) true_lncrna.to_csv(sys.argv[7], sep='\t', index=False, header=None) else: gmap_align_prime = gmap_align_prime[gmap_align_prime['name'].isin(query)] gmap_align_prime.to_csv(sys.argv[4], sep='\t', index=False, header=None) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 | import os import subprocess import sys from Bio import SeqIO import pandas as pd import numpy as np import matplotlib.pyplot as plt def data_base(x): x_n = x.rsplit('/', 1)[1] x_n = x_n.rsplit('.', 1)[0] dir_check = os.path.isfile('data/reference/data_index/' + x_n + '.nin') if dir_check is True: print('index exist') index_path = os.path.join('data/reference/data_index/', x_n) return index_path else: print('build index') index_path = os.path.join('data/reference/data_index/', x_n) command = 'makeblastdb -in {db} -dbtype nucl -parse_seqids -out {index}'. format(db=x, index=index_path) exit_code = subprocess.call(command, shell=True) return index_path def fasta(x): query = x['name'] for w in query: try: old_id = old_new[old_new['new_id'].isin([w])] print(record_dict[str(old_id['old_id'].to_list()[0])].format('fasta'), end='', file=lncrna) except KeyError: continue def alingment(x,y): path = x.rsplit('/', 1)[0] outfmt = os.path.join(path, 'blast' + '.outfmt6') aling = 'blastn -query {q} -db {dbw} -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen" -evalue {evalue} -max_target_seqs {max_target} -perc_identity {identity} -num_threads {threads} -out {outfmt}'. format(q=x, dbw=y, evalue=evalue, max_target=max_target, identity=identity, threads =threads, outfmt=outfmt) #| sort -k1,1 -k12,12nr -k11,11n | sort -u -k1,1 --merge > {outfmt}'. format(q=x, dbw=y, outfmt=outfmt) exit_code = subprocess.call(aling, shell=True) # input data query = pd.read_csv(sys.argv[1], sep='\t', header=None) gff_tmap = sys.argv[2] data = sys.argv[3] record_dict = SeqIO.to_dict(SeqIO.parse(sys.argv[4], "fasta")) lncrna = open(sys.argv[5], 'w', encoding='utf-8') evalue = sys.argv[6] max_target = sys.argv[7] identity = sys.argv[8] threads = sys.argv[9] old_new = pd.read_csv(sys.argv[10], sep='\t') tmap = pd.read_csv(gff_tmap, sep='\t') # lncRNA classification graf query = query[query[7].str.contains("gene")] tmap = tmap[tmap['qry_gene_id'].str.contains("path1")] name_tmap = tmap['qry_gene_id'].str.rsplit('.', expand=True) tmap[['name','seq']] = name_tmap[[0,1]] #tmap = tmap[tmap['seq'].str.contains("path1")] tmap = tmap[tmap['name'].isin(query[10])] tmap= tmap.drop_duplicates(subset=['name']) tmap['group'] = np.nan tmap['group'][tmap['class_code'].isin(['x'])] = "exon antisense" tmap['group'][tmap['class_code'].isin(['i'])] = "intron" tmap['group'][tmap['class_code'].isin(['u'])] = "intergenic" tmap.dropna(subset = ["group"], inplace=True) print(tmap['group'].value_counts()) f, ax = plt.subplots(figsize=(13, 13)) tmap['group'].value_counts().plot(kind='bar') plt.xticks(size=15, rotation=30) plt.yticks(size=15) plt.ylabel('LncRNA transcripts number', size=15) plt.xlabel('LncRNA classes', size=15) plt.savefig("data/output/new_lncRNA/classes.png", dpi=500) # BLAST fasta(tmap) indx = data_base(data) alingment(sys.argv[5], indx) |
1 2 3 4 5 6 7 8 9 | test=$(ls -lrt $2 | cut -f 5 -d ' ') minimumsize=4000000000 if [ $test -gt $minimumsize ]; then mkdir ./data/input/chunk awk -v size=10000 -v pre=prefix -v pad=5 '/^>/{n++;if(n%size==1){close(f);f=sprintf("./data/input/chunk/%s%0"pad"d.fa",pre,n)}}{print>>f}' $2 Rscript scripts/lncFind.r $1 ./data/input/chunk $3 $4 else Rscript scripts/lncFind.r $1 $2 $3 $4 fi |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 | library(ggpubr) library(dplyr) library(purrr) library(BioQC) library(ggplot2) library(data.table) args <- commandArgs(TRUE) entrop <- function(a) { ent <- entropySpecificity(a, norm=TRUE) return(ent) } trans <- function(a) { a$target_id <- NULL a$type <- NULL a$sum <- NULL entrop(a) } type <- function(a) { col_name <- colnames(a) col_name <- col_name[! col_name %in% c("type", "target_id")] a$sum <- rowSums(a[,col_name], na.rm=TRUE) a <- filter(a, sum >0) ent <- trans(a) a$entropy <- ent return(a) } plot <- function(a,b,c) { ggplot() + geom_density(data = a, aes(x = entropy, fill="Non"), alpha=0.4)+ geom_density(data = b, aes(x = entropy, fill="Cons"), alpha=0.4)+ geom_density(data = c, aes(x = entropy, fill="mRNA"), alpha=0.4)+ scale_colour_manual("", breaks = c("Non", "Cons", "mRNA")) + xlab('entropy') + ylab('density') + guides(fill = guide_legend(title = "Types")) + theme(text = element_text(size = 20)) ggsave("data/output/expression/entropy.png", width = 20, height = 20) } log <- function(a) { col_name <- colnames(a) col_name <- col_name[! col_name %in% c("type", "target_id")] col <- c() for (x in col_name) { col <- append(col, paste("log_", noquote(x), sep = "")) a[, paste("log_", noquote(x), sep = "")] <- log2(a[, noquote(x)]) } new_col <<- col return(a) } # install dataframe df <- read.table(args[1], header = TRUE) # log of columns df_log <- log(df) # exprassin analysis df_log[df_log == -Inf] <- 0 my_comparisons = list(c("consv", "non_cons"), c("consv", "prot"), c("prot", "non_cons")) ggboxplot(df_log, x="type", y= new_col , merge = "flip", ylab = "log2(TPM)", color = "type", palette = "jco")+ font("subtitle", size = 20)+ font("caption", size = 20)+ font("xlab", size = 20)+ font("ylab", size = 20)+ font("xy.text", size = 20) +theme(legend.text=element_text(size=20)) ggsave("data/output/expression/box_plot.png", width = 20, height = 20) ggdensity(df_log, x = new_col , y = "..density..", facet.by = "type", merge = TRUE, xlab = "TPM", rug = TRUE, add = "median", palette = "jco") + font("subtitle", size = 20)+ font("caption", size = 20)+ font("xlab", size = 20)+ font("ylab", size = 20)+ font("xy.text", size = 20) + theme(legend.text=element_text(size=20)) ggsave("data/output/expression/hist_expr.png", width = 20, height = 20) # entropy analysis non <-df[grepl("noncons", df$type),] cons <-df[grepl("consv", df$type),] mRNA <-df[grepl("mRNA", df$type),] non_ent <- type(non) cons_ent <- type(cons) mRNA_ent <- type(mRNA) plot(non_ent, cons_ent, mRNA_ent) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 | import pandas as pd import os import sys strc = str(sys.argv[3])[1:-1] if strc == 'own': print('Own table') with open(sys.argv[9], 'w') as f: f.write('Own table') else: print('Build table') tissue = pd.read_csv(sys.argv[1], sep='\t', header=None) path = str(sys.argv[2])[1:-1] new_old = pd.read_csv(sys.argv[4], sep='\t') cons = pd.read_csv(sys.argv[5], sep=' ', header=None) noncons = pd.read_csv(sys.argv[6], sep=' ', header=None) cod = pd.read_csv(sys.argv[7], sep=' ', header=None) cod = pd.merge(cod, new_old, how='inner', left_on=[0], right_on=['new_id']).drop(columns = [0]) cod['type'] = 'mRNA' noncons['type'] = 'noncons' cons['type'] = 'consv' if strc == 'Kallisto': list_dir = os.listdir(path) lib = {} for w in list_dir: abudance = pd.read_csv(path + '/' + w +'/' + "abundance.tsv", sep='\t') lib[w] = abudance['tpm'] if strc == 'Htseq': list_dir = os.listdir(path) lib = {} for w in list_dir: id = w.split('.') abudance = pd.read_csv(path + '/' + w, sep='\t', header=None) abudance = abudance.rename(columns={0: 'target_id'}) lib[id[0]] = abudance[1] lib_data = pd.DataFrame.from_dict(lib) colnames = lib.keys() tissue = tissue[tissue[1].isin(colnames)] tissue_gr = tissue.groupby([2]) median_tis = {} for k,v in tissue_gr: lib_data_gr = lib_data[v[1]] median_tis[k] = lib_data_gr.median(axis=1) median_tis_df = pd.DataFrame.from_dict(median_tis) median_tis_df['target_id'] = abudance['target_id'] cod = pd.merge(cod, median_tis_df, how='inner', left_on=['new_id'], right_on=['target_id']).drop(columns = ['old_id', 'new_id']) noncons = pd.merge(noncons, median_tis_df, how='inner', left_on=[0], right_on=['target_id']).drop(columns = [0]) cons = pd.merge(cons, median_tis_df, how='inner', left_on=[0], right_on=['target_id']).drop(columns = [0]) full_table = pd.concat([cod, cons, noncons]) full_table.to_csv(sys.argv[8], sep='\t', index=False) with open(sys.argv[9], 'w') as f: f.write(strc) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 | import pandas as pd import sys from collections import Counter import matplotlib.pyplot as plt import scipy.stats as stats from matplotlib.ticker import FuncFormatter from functools import partial import math import statistics from scipy.stats import lognorm import numpy as np pd.options.mode.chained_assignment = None # default='warn' def to_percent(y, position, n): s = str(round(100 * y / n, 3)) return s + '%' def coord(x): Row_list_query = [] for index, rows in x.iterrows(): my_list = [rows[1], rows[2]] Row_list_query.append(my_list) return(Row_list_query) def find_same_coords(x): x['exon_num'] = x['name'] +"_"+ exon[7].astype(str).str.cat(x['number'].astype(str), sep='_') uni3 = x.drop_duplicates(subset=[1]) target_coord = coord(uni3) for start_stop1 in query_coord: start1, stop1 = map(int, start_stop1) for start_stop in target_coord: start, stop = map(int, start_stop) if start1 in range(start-100, start + 100) and stop1 in range(stop - 100, stop+100): same_query = uni[1] == start1 same_target = uni3[1] == start index = uni3[same_target] lines = uni[same_query] chrom.append([' '.join('{0}'.format(i) for i in lines[3]), ' '.join('{0}'.format(i) for i in index['exon_num']), str(round(start1)), str(round(stop1)), str(start), str(stop), ' '.join('{0}'.format(i) for i in index['number']), ' '.join('{0}'.format(i) for i in lines['length']), ' '.join('{0}'.format(i) for i in index[5]), ' '.join('{0}'.format(i) for i in lines[5])]) query = pd.read_csv(sys.argv[1], sep='\t', header=None) target = pd.read_csv(sys.argv[2], sep='\t', header=None) tmap = pd.read_csv(sys.argv[3], sep='\t', header=None, skiprows=1) small_intron = pd.read_csv(sys.argv[4], sep='\t', header=None) query = query[pd.to_numeric(query[0], errors='coerce').notnull()] target = target[pd.to_numeric(target[0], errors='coerce').notnull()] statistic = open(sys.argv[5], 'w', encoding='utf-8') query = query[~query[10].isin(small_intron[0])] name_tmap = tmap[3].str.split('.',4, expand=True) tmap['name'] = name_tmap[0] tmap = tmap[tmap[2].isin(['x', 'i', 'u'])] tmap = tmap[tmap['name'].isin(query[10])] # exon structure of lncRNA chart = query[query[7].str.contains("exon")] spl = chart[3].str.split('.',2, expand=True) size = spl.pivot_table(index = [0], aggfunc ='size').to_dict() res = Counter(size.values()) dict={} for k, v in res.items(): dict[k] = v/len(size)*100 f, ax = plt.subplots(figsize=(15, 5)) ax.bar(list(dict.keys()), dict.values(), color='b') #ax.set_xlim([0, 20]) plt.xticks(np.sort(range(max(list(dict.keys())) + 1))) plt.ylabel('Proportion of lncRNAs (%)', size=15) plt.xlabel('Number of exon', size=15) plt.savefig("data/output/lncRNA_structure/number_of_exon.png", dpi=500) #exon size chart chart['exon_length'] = (chart[2] - chart[1]) step = int(math.ceil(chart['exon_length'].mean() / 100.0)) * 100 f, ax = plt.subplots(figsize=(10, 5)) density = stats.gaussian_kde(chart['exon_length']) n, x, _ = ax.hist(chart['exon_length'], density=True, bins = np.logspace(np.log10(min(chart['exon_length'])),np.log10(max(chart['exon_length'])), 50), color='b') ax.plot(x, density(x), color='r') ax.set_xscale('log') ax.set_ylabel('Probability', size=15) ax.set_xlabel('Exon size (bp)', size=15) plt.savefig("data/output/lncRNA_structure/exon_size.png", dpi=500) print('Mean_exon_length','\t', statistics.mean(chart['exon_length']), file=statistic) print('Max_exon_length','\t', max(chart['exon_length']), file=statistic) print('Min_exon_length','\t', min(chart['exon_length']), file=statistic) print('Median_exon_length','\t', statistics.median(chart['exon_length']), file=statistic) # intron size chart chart['name'] = spl[0] gr = chart[[1,2,'name']].groupby(by='name') intron_length = [] name = [] for key, item in gr: n=0 if 0 < len(item['name']) - 2: intron = abs(item[2].iloc[n] - item[1].iloc[n + 1]) if intron < 60: pass elif intron > 100000: pass else: n = n + 1 intron_length.append(intron) else: intron_length.append(0) intron_length = list(filter(lambda x: x != 0, intron_length)) step = int(math.ceil(statistics.mean(intron_length) / 100.0)) * 100 f, ax = plt.subplots(figsize=(10, 5)) density = stats.gaussian_kde(intron_length) n ,x , _ = ax.hist(intron_length, density=True, bins=np.logspace(np.log10(min(intron_length)),np.log10(max(intron_length)), 50), color='b') ax.plot(x, density(x), color="r") ax.set_xscale('log') ax.set_ylabel('Probability', size=15) ax.set_xlabel('Intron size (bp)', size=15) plt.savefig("data/output/lncRNA_structure/intron_size.png", dpi=500) print('Mean_intron_length' ,'\t', statistics.mean(intron_length), file=statistic) print('Max_intron_length' ,'\t', max(intron_length), file=statistic) print('Min_intron_length' ,'\t', min(intron_length), file=statistic) print('Median_intron_length' ,'\t', statistics.median(intron_length), file=statistic) # choose only query transcripts query = query[query[3].str.contains("path1")] query['length'] = abs(query[1] - query[2]) print('Mean transcript length' ,'\t', statistics.mean(query['length']), file=statistic) # chart lcnRNA distribution across chromosome f, ax = plt.subplots(figsize=(10, 5)) labels, counts = np.unique(query[0].astype(int), return_counts=True) ax.bar(labels, counts, align='center', color='b') plt.gca().set_xticks(labels) plt.ylabel('Number of lncRNAs', size=15) plt.xlabel('Chromosome of organism', size=15) plt.savefig("data/output/lncRNA_structure/number_of_lncRNA.png", dpi=500) chromasom_all = pd.unique(target[0].astype(int)) # choose lncRNA type name_target = target[9].str.split(';',4, expand=True) target['name'] = name_target[0] new_tmap = tmap[tmap[2].isin(['x'])] # filter seq by lncRNA type tmap_name = new_tmap[3].str.split('.',4, expand=True) query_anti = query[query[10].isin(tmap_name[0].to_list())] # choose only target exons exon = target[target[7].str.contains("exon")] exon = exon[exon['name'].groupby(exon['name']).transform('size')>1] uni = query_anti.drop_duplicates(subset=[1]) chrom = [] for w in chromasom_all: print(w) query_new = uni[uni[0].astype(int).isin([w])] target_new = exon[exon[0].astype(int).isin([w])] query_coord = coord(query_new) strand = target_new.groupby([5]) for key, item in strand: if key == "-": item['number'] = item.groupby(['name']).cumcount(ascending=False) + 1 find_same_coords(item) elif key == "+": item['number'] = item.groupby(['name']).cumcount() + 1 find_same_coords(item) df = pd.DataFrame(chrom) df['number'] = df.groupby([6]).cumcount() + 1 df[6] = df[6].astype(int) df = df.sort_values(by=[6]) df.rename(columns={0: 'transcript_id', 1: 'protein_coding_id', 2:'protein_coding_id_start', 3:'protein_coding_id_end', 4:'transcript_id_start', 5:'transcript_id_end', 6:'protein_coding_exon', 7:'transcript_length', 8:'protein_coding_strand', 9:'transcript_strand', "number":'number_of_transcripts'}, inplace=True) df.to_csv('data/output/lncRNA_structure/anti_vs_mrna.tsv', sep='\t', index=False) f, ax = plt.subplots(figsize=(15, 5)) ax.bar(df['protein_coding_exon'], (df['number_of_transcripts']/(len(df['protein_coding_exon']))*100), color='b') plt.ylabel('Number of aligned lnRNAs', size=15) plt.xlabel('Exon of the gene', size=15) plt.xticks(range(max(df['protein_coding_exon'])+1)) plt.savefig('data/output/lncRNA_structure/anti.png') |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 | import os import sys import subprocess from Bio import SeqIO import pandas as pd import numpy as np from gtfparse import read_gtf data_base = sys.argv[1] gff = sys.argv[2] def build(reference, kmer): args_alignment = [] gmap_build = 'gmap_build' path, base = os.path.split(reference) ref_label = os.path.splitext(base)[0] gmap_build_logger_out_path = os.path.join('./data/output/GMAP', gmap_build + '.out.log') gmap_build_logger_err_path = os.path.join('./data/output/GMAP', gmap_build + '.err.log') command = '{gmap_build} -D {tmp_dir} -d {ref_index_name} -k {kmer_value} {reference} >> {log_out_1} 2>> {log_out_2}'.\ format(gmap_build=gmap_build, tmp_dir=path, ref_index_name=ref_label, reference=reference, log_out_1=gmap_build_logger_out_path, log_out_2=gmap_build_logger_err_path, kmer_value=kmer) exit_code = subprocess.call(command, shell=True) gff_df = read_gtf(gff) df = SeqIO.to_dict(SeqIO.parse(data_base, "fasta")) size = round(os.path.getsize(data_base) / (1024*1024)) statistic = open(sys.argv[3], 'w', encoding='utf-8') if size <= 64: k = 12 elif 64 < size <= 256: k = 13 elif 256 < size <= 1000: k = 14 elif 1000 < size <= 10000: k = 15 print("k" ,'\t', str(k), file=statistic) total_len = [] for name,seq in df.items(): total_len.append(len(seq)) print('Golden_Path_Length' ,'\t', str(np.sum(total_len)), file=statistic) exon_only = gff_df[gff_df['feature'].str.contains('exon')] exon_length = (exon_only['end'] - exon_only['start']).mean() print('Mean_exon_length','\t', str(round(exon_length)), file=statistic) transcript_only = gff_df[gff_df['feature'].str.contains('transcript')] transcript_length = (transcript_only['end'] - transcript_only['start']).mean() print('Mean_transcript_length','\t', str(round(transcript_length)), file=statistic) gr = exon_only[['start','end','gene_id']].groupby(by='gene_id') intron_length = [] for key, item in gr: n=0 while n <= len(item['gene_id']) - 2: intron = abs(item['end'].iloc[n] - item['start'].iloc[n + 1]) - 1 n = n + 1 intron_length.append(intron) mean_intron_length = np.mean(intron_length) max_intron_length = np.max(intron_length) min_intron_length = np.min(intron_length) print('Mean_intron_length' ,'\t', str(round(mean_intron_length)), file=statistic) print('Max_intron_length' ,'\t', str(round(max_intron_length)), file=statistic) print('Min_intron_length' ,'\t', str(round(min_intron_length)), file=statistic) build(data_base, k) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | import pandas as pd import sys def coord(x): uni = x.drop_duplicates(subset=[1]) Row_list_query = [] for index, rows in uni.iterrows(): my_list = [rows[1], rows[2]] Row_list_query.append(my_list) return(Row_list_query) gmap = sys.argv[1] known_seq = sys.argv[2] out = sys.argv[3] gmap_align = pd.read_csv(gmap, sep='\t', header=None) target = pd.read_csv(known_seq, sep='\t', header=None) target_coords = coord(target[target[7].str.contains("transcript")]) name = gmap_align[9].str.split(';',4, expand=True) name = name[1].str.split('=',2, expand=True) gmap_align['name'] = name[1] gene = gmap_align[gmap_align[3].str.contains("path1")] exon = gmap_align[gmap_align[3].str.contains("exon")] gr = exon[[0,1,2,'name']].groupby(by='name') intron_large = [] intron_small = [] intron_data = [] for key, item in gr: n = 0 if 0 < len(item['name']) - 2: intron = abs(item[2].iloc[n] - item[1].iloc[n + 1]) intron_data.append([item[0].iloc[n], item[2].iloc[n], item[1].iloc[n + 1], key + '_intron' + str(n+1),intron]) n = n + 1 if intron < 60: intron_small.append(key) if intron > 20000: intron_large.append(key) else: pass else: pass df = pd.DataFrame(intron_data) df.to_csv('data/output/lncRNA_structure/itron_coordin.tsv', sep='\t', index=False, header=None) its = pd.DataFrame(intron_small) its.to_csv('data/output/lncRNA_structure/itron_small.txt', index=False, header=None) gmap_align = gmap_align[~gmap_align['name'].isin(intron_large)] gmap_align['length'] = gmap_align[2] - gmap_align[1] gmap_align = gmap_align[gmap_align['length'] < 5000] gmap_align_big = gmap_align[gmap_align['length'] > 5000] gmap_align_big.to_csv('data/output/lncRNA_structure/long_transcripts.bed', index=False, sep='\t', header=None) gmap_align = gmap_align[~gmap_align['name'].isin(gmap_align_big['name'])] gmap_align = gmap_align[[0,1,2,3,4,5,6,7,8,9]] gmap_align.to_csv(out, index=False, sep='\t', header=None) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 | import os import sys import subprocess import pandas as pd data_base = sys.argv[1] target = sys.argv[2] parameters = sys.argv[3] out_file = sys.argv[5] opt = sys.argv[7:] def aling(reference, transcripts, min, max, thr, lg): if lg[0] > 2**32: gmap_run = 'gmapl' print('large genome use gmapl') else: gmap_run = 'gmap' path, base = os.path.split(reference) ref_label = os.path.splitext(base)[0] out = os.path.join(out_file) gmap_run_logger_err_path = os.path.join('./data/output/GMAP', gmap_run + '.out.log') command = '{gmap} -D {tmp_dir} -d {ref_index_name} {transcripts} --min-intronlength={min_intron_length} --intronlength={intron_length} --cross-species ' \ '--format=gff3_gene --split-large-introns --npaths=1 -t {threads} {option} > {alignment_out} 2>> {log_out_2}'.\ format(gmap=gmap_run, tmp_dir=path, ref_index_name=ref_label, transcripts=transcripts, threads=thr, alignment_out=out, log_out_2=gmap_run_logger_err_path, min_intron_length=min[0], intron_length=max[0], option=' '.join('{0}'.format(w) for w in opt)) exit_code = subprocess.call(command, shell=True) dir_check = os.path.isdir('data/reference/intron_data/' + sys.argv[4]) if dir_check is True: print('use of existing statistics') gff_df = pd.read_csv('data/reference/intron_data/' + sys.argv[4] + '/' + sys.argv[4] + '.tsv', sep='\t', header=None) minn = gff_df[gff_df[0].str.contains("Min_intron_length")][1].to_list() maxx = gff_df[gff_df[0].str.contains("Max_intron_length")][1].to_list() length = gff_df[gff_df[0].str.contains("Golden_Path_Length")][1].to_list() thr = 40 aling(data_base, target, minn, maxx, thr, length) print(target) else: print('use of reference statistics') gff_df = pd.read_csv(parameters, sep='\t', header=None) minn = gff_df[gff_df[0].str.contains("Min_intron_length")][1].to_list() maxx = gff_df[gff_df[0].str.contains("Max_intron_length")][1].to_list() length = gff_df[gff_df[0].str.contains("Golden_Path_Length")][1].to_list() thr = 40 aling(data_base, target, minn, maxx, thr, length) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | import pandas as pd import sys import numpy as np from pybedtools import BedTool def clean(x): split = x.str.rsplit('.', 2, expand=True) split = split[0] return (split) tmap = pd.read_csv(sys.argv[1], sep='\t') gmap_align = pd.read_csv(sys.argv[2], sep='\t', header=None) # take only lncRNA classes from tmap file tmap['group'] = np.nan tmap['group'][tmap['class_code'].isin(['x'])] = "exon antisense" tmap['group'][tmap['class_code'].isin(['i'])] = "intron" tmap['group'][tmap['class_code'].isin(['u'])] = "intergenic" tmap.dropna(subset = ["group"], inplace=True) tmap = tmap[tmap['qry_gene_id'].str.contains("path1")] print(tmap['group'].value_counts()) # filter duplicates query = clean(tmap['qry_gene_id']) gmap_align['name'] = clean(gmap_align[3]) query = query.drop_duplicates() # take lncRNA from alignment file gmap_align = gmap_align[gmap_align['name'].isin(query)] gmap_align = gmap_align[gmap_align[7].str.contains("gene")] gmap_align.to_csv(sys.argv[3], sep='\t', index=False, header=None) # lncRNA into loci genes = BedTool(sys.argv[3]) loci = genes.merge(s=True, c=[4,5,6], o=['collapse','mean','distinct']) loci_df = pd.read_table(loci.fn, sep='\t', names=[0,1,2,3,4,5]) loci_df = loci_df[(loci_df[2] - loci_df[1])>200] loci_df['numb'] = list(range(0, len(loci_df[3]))) loci_df['loc'] = 'LOC' loci_df['loci'] = loci_df['loc'] + '_' + loci_df['numb'].astype(str) loci_df = loci_df[[0,1,2,'loci',4,5,3]] loci_df.to_csv(sys.argv[4], sep='\t', index=False, header=None) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 | library(seqinr) library(LncFinder) args <- commandArgs(TRUE) list_dir <- list.dirs(path = args[1], recursive = FALSE) for (w in list_dir) { lnc <- read.fasta(args[2]) mrna <- read.fasta(paste(w, '/train_mrna.fasta', sep='')) strc <- args[3] print(strc) if (strc == "SS") { print('secondary structure on') RNAfold.path <- '/home/pronozinau/miniconda3/bin/RNAfold' mrna1 <- run_RNAfold(mrna, RNAfold.path = RNAfold.path, parallel.cores = 40) lnc1 <- run_RNAfold(lnc, RNAfold.path = RNAfold.path, parallel.cores = 40) feach <- make_frequencies(cds.seq = mrna1, mRNA.seq = mrna1, lncRNA.seq = lnc1, SS.features = TRUE, cds.format = "SS", lnc.format = "SS", check.cds = TRUE, ignore.illegal = FALSE) print('done') saveRDS(feach, file=paste(w, '/frequencies.rds', sep="")) my_model <- build_model(lncRNA.seq = lnc1, mRNA.seq = mrna1, frequencies.file = feach, SS.features = TRUE, lncRNA.format = "SS", mRNA.format = "SS", parallel.cores = 2, folds.num = 2, seed = 1) saveRDS(my_model, file=paste(w,'/model.rds',sep="")) print(paste(w, '/test.fasta', sep='')) seq <- read.fasta(paste(w, '/test.fasta', sep='')) test <- run_RNAfold(seq, RNAfold.path = RNAfold.path, parallel.cores = 40) result_2 <- lnc_finder(test, SS.features = TRUE, format = "SS", frequencies.file = feach, svm.model = my_model, parallel.cores = 2) write.table(result_2, file=paste(w, '/results.csv', sep=""), col.names=F, sep = ',') } else { print('secondary structure off') feach <- make_frequencies(cds.seq = mrna, mRNA.seq = mrna, lncRNA.seq = lnc, SS.features = FALSE, cds.format = "DNA", lnc.format = "DNA", check.cds = TRUE, ignore.illegal = FALSE) print('done') saveRDS(feach, file=paste(w, '/frequencies.rds', sep="")) my_model <- build_model(lncRNA.seq = lnc, mRNA.seq = mrna, frequencies.file = feach, SS.features = FALSE, lncRNA.format = "DNA", mRNA.format = "DNA", parallel.cores = 2, folds.num = 2, seed = 1) saveRDS(my_model, file=paste(w,'/model.rds',sep="")) print(paste(w, '/test.fasta', sep='')) seq <- read.fasta(paste(w, '/test.fasta', sep='')) result_2 <- lnc_finder(seq, SS.features = FALSE, format = "DNA", frequencies.file = feach, svm.model = my_model, parallel.cores = 2) write.table(result_2, file=paste(w, '/results.csv', sep=""), col.names=F, sep = ',') } } print(list_dir) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | from Bio import SeqIO import sys import pandas as pd fasta = SeqIO.parse(sys.argv[1], "fasta") corrected_file = sys.argv[2] new_old_id = [] with open(corrected_file, 'w') as corrected: n = 1 for rec in fasta: if len(rec.seq) >= 200: new_name = "Transcript_" + str(n) new_old_id.append([rec.id, new_name]) rec.id = new_name rec.description = new_name # <- Add this line n = n + 1 SeqIO.write(rec, corrected, 'fasta') else: pass new_old_id_df = pd.DataFrame(new_old_id, columns=['old_id', 'new_id']) new_old_id_df.to_csv(sys.argv[3], sep='\t', index=False) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | import pandas as pd import numpy as np from Bio import SeqIO import sys lncfinder = pd.read_csv(sys.argv[1], sep=',', header=None) record_dict = SeqIO.to_dict(SeqIO.parse(sys.argv[2], "fasta")) lncrna = open(sys.argv[3], 'w', encoding='utf-8') coding = open('data/output/lncRNA_prediction/Coding.fasta', 'w', encoding='utf-8') lncfinder = lncfinder[[0,1]] finder_cod = lncfinder[lncfinder[1].str.contains("NonCoding")==False] finder_non = lncfinder[lncfinder[1].str.contains("NonCoding")] for w in finder_non[0]: try: print(record_dict[w].format('fasta'), end='', file=lncrna) except KeyError: continue for w in finder_cod[0]: try: print(record_dict[w].format('fasta'), end='', file=coding) except KeyError: continue |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | import numpy as np from sklearn.model_selection import train_test_split from Bio import SeqIO import sys import os lnc = SeqIO.to_dict(SeqIO.parse(sys.argv[1], "fasta")) cds = SeqIO.to_dict(SeqIO.parse(sys.argv[2], "fasta")) train_lnc = open('./data/input/test_train/train_lnc.fasta', 'w', encoding='utf-8') print(len(lnc.keys())) print(len(cds.keys())) test_lnc = len(lnc.keys())*0.25 test_mrna = test_lnc*2 n=0 keys = list(cds.keys()) pers_lnc=test_lnc/len((list(lnc.keys()))) Y_train, Y_test = train_test_split(list(lnc.keys()), test_size=pers_lnc, shuffle=True) print(len(Y_test)) if len(Y_train)*2 <= len(cds.keys())/5: train_mrna = len(Y_train)*2 else: train_mrna = len(cds.keys())/5.2 print(train_mrna) for w in Y_train: print(lnc[w].format('fasta'), end='', file=train_lnc) while n <= 4: os.mkdir('./data/input/test_train/' + str(n)) train_cds = open('./data/input/test_train/' + str(n) + '/train_mrna.fasta', 'w', encoding='utf-8') test_file = open('./data/input/test_train/' + str(n) + '/test.fasta', 'w', encoding='utf-8') compare = open('./data/input/test_train/' + str(n) + '/compare.csv', 'w', encoding='utf-8') pers_mrna=train_mrna/(len(keys)) print(pers_mrna) X_train, X_test = train_test_split(keys, test_size=pers_mrna, shuffle=True) print('train_set',len(X_train)) print('test_set',len(X_test)) if len(X_test) > test_mrna: pers_mrna_test = test_mrna/(len(X_test)) else: true_per = len(X_test)*0.25 pers_mrna_test = true_per/len(X_test) pers_mrna_test = test_mrna/(len(X_test)) train, test = train_test_split(X_test, test_size=pers_mrna_test, shuffle=True) for w in test: print(cds[w].format('fasta'), end='', file=test_file) print(cds[w].id , 'mrna', sep='\t', file=compare) for w in train: print(cds[w].format('fasta'), end='', file=train_cds) for w in Y_test: print(lnc[w].format('fasta'), end='', file=test_file) print(lnc[w].id, 'lnc', sep='\t', file=compare) keys = [w for w in keys if w not in X_test] print('new_set',len(keys)) n=n+1 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 | import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import sys def value_prepare(x, y, z): name = x[0].str.rsplit('_', 1, expand=True) x['srx'] = name[1] same = pd.merge(df, x, how='inner', left_on=[1], right_on=['srx']) transcript = same[1].value_counts().reset_index() transcript = pd.merge(df, transcript, how='inner', left_on=[1], right_on=['index']) transcript.to_csv('data/output/tissue/' + y + '.csv') pr = same[2].value_counts().reset_index() pr = pr.rename(columns={2: z}, inplace=False) return (pr, same) #LncAPDB_vs_blast table blast = pd.read_csv(sys.argv[2], sep='\t', header=None) LncAPDB = pd.read_csv(sys.argv[3], sep=' ', header=None) old_new = pd.read_csv(sys.argv[4], sep='\t', header=None, skiprows=1) LncAPDB = LncAPDB.drop_duplicates(subset=[1]) blast = blast[blast[2].isin(['100.000'])] LncAPDB_vs_blast = pd.merge(blast, LncAPDB, how='inner', left_on=[1], right_on=[1]) LncAPDB_vs_blast = LncAPDB_vs_blast[['0_x',1,'2_x',10, '0_y','2_y']] LncAPDB_vs_blast.rename(columns={'0_x': 'transcript_id', 1: 'library_id', '2_x':'percent_identity', 10:'e_value','0_y':'id_of_database', '2_y':'database'}, inplace=True) LncAPDB_vs_blast.to_csv('data/output/new_lncRNA/LncAPDB_vs_blast.csv', sep='\t', index=False) #tissue analysis blast = pd.read_csv('data/output/tissue/cons.txt', header=None) blast_non = pd.read_csv('data/output/tissue/non.txt', header=None) blast_cod = pd.read_csv('data/output/tissue/cod.txt', header=None) df = pd.read_csv(sys.argv[5], sep='\t', header=None) df = df[df[0].isin([str(sys.argv[1])])] df_tis = df[2].value_counts().reset_index() blast_cod = old_new[old_new[1].isin(blast_cod[0])] try: transc, same = value_prepare(blast, 'transc' , 'Conserved') transc_non, same_non = value_prepare(blast_non, 'transc_non', 'Nonconserved') transc_cod, same_cod = value_prepare(blast_cod, 'transc_cod', 'Coding') full_table = pd.merge(transc, transc_cod, how='outer', left_on=['index'], right_on=['index']) full_table = pd.merge(full_table, transc_non, how='outer', left_on=['index'], right_on=['index']) full_table = pd.merge(full_table, df_tis, how='inner', left_on=['index'], right_on=['index']) full_table = full_table.fillna(0) full_table['con'] = full_table['Conserved'] / full_table[2] full_table['noncon'] = full_table['Nonconserved'] / full_table[2] full_table['cod'] = full_table['Coding'] / full_table[2] full_table['lncRNA conserved'] = full_table['con'] / len(same[2]) full_table['lncRNA nonconseved'] = full_table['noncon'] / len(same_non[2]) full_table['mRNA'] = full_table['cod'] / len(same_cod[2]) full_table = full_table[['index', 'lncRNA conserved', 'lncRNA nonconseved', 'mRNA']] full_table.to_csv('data/output/tissue/tissue_org.csv') full_table = full_table.set_index('index') fig,ax = plt.subplots(1,1,figsize=(8,10)) sns.heatmap(full_table, annot=True, annot_kws={'size': 10}, xticklabels=True, cmap='Blues', ax=ax, cbar_kws={"shrink": .82}, yticklabels=True) plt.savefig('data/output/tissue/tissue_org.png', bbox_inches='tight') except KeyError: print('Please check format of your sequence ID. It should be look like "MSTRG.1031.1_SRX123456" or "TRINITY_DN195_c0_g1_i1_SRX339783". The defining parameter is "_SRX#######" add it after your main ID.') |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | import pandas as pd import sys import glob def Specificity(x,y): if x+y == 0: return float(0) else: return x/(x+y) def Precision(x,y): if x+y ==0: return float(0) else: return x/(x+y) def Sensitivity(x,y): if x+y == 0: return float(0) else: return x/(x+y) def F1(x,y): if x+y == 0: return float(0) else: return 2*(x*y)/(x+y) def TP_FP(q_c,q_n,t_c,t_n): TP = 0 FP = 0 FN = 0 TN = 0 for w in t_n.to_list(): if w in q_n.to_list(): TP = TP+1 if w in q_c.to_list(): FP = FP+1 for w in t_c.to_list(): if w in q_c.to_list(): TN = TN+1 if w in q_n.to_list(): FN = FN+1 print(TP , FN , FP, TN) Pre = Precision(TP,FP) Spe = Specificity(TN,FP) Sen = Sensitivity(TP,FN) F = F1(Pre,Sen) return Pre , Sen, Spe, F dirr = glob.glob(sys.argv[1] + '/*/') f1_comp = {} for w in dirr: cds_lnc = pd.read_csv(str(w) + '/compare.csv', sep='\t', header=None) test = pd.read_csv(str(w) + '/results.csv', sep=',', header=None) test = test[[0,1]] results = open(str(w) + '/F1.csv', 'w', encoding='utf-8') cds_lnc_cod = cds_lnc[cds_lnc[1].str.contains("mrna")] cds_lnc_non = cds_lnc[cds_lnc[1].str.contains("lnc")] test_cod = test[test[1].str.contains("NonCoding")==False] test_non = test[test[1].str.contains("NonCoding")] print(len(cds_lnc_cod[0]), len(cds_lnc_non[0]), len(test_cod[0]), len(test_non[0])) pre, sen, spe, f = TP_FP(cds_lnc_cod[0], cds_lnc_non[0], test_cod[0], test_non[0]) f1_comp[w] = f print(pre, sen, spe, f, sep='\t' , file=results) best_model = open(sys.argv[1] + '/best_model.txt', 'w', encoding='utf-8') print(max(f1_comp, key=f1_comp.get), file=best_model) |
39 40 | run: shell("python scripts/test_train.py {input} > {output}") |
53 54 | shell: "Rscript scripts/model_lncFind.r {params.path} {params.lnc} {params.str} >> {output}" |
62 63 | shell: "python scripts/TP_FN.py {params}" |
72 73 | shell: "python scripts/rename.py {input} {output} {params}" |
86 87 | shell: "scripts/Chunk_dataframe.sh {input} {params}" |
98 99 | shell: "CPC2/bin/CPC2.py -i {input} -r -o {params}" |
107 108 | shell: "python scripts/take_noncoding.py {input} {output}" |
116 117 | shell: "python scripts/GMAP_build.py {input} {output}" |
126 127 | shell: "python scripts/GMAP.py {input} {config[model]} {output}" |
136 137 138 | run: shell("gff2bed < {input.query} > {output.query_out}") shell("gff2bed < {input.reference} > {output.reference_out}") |
148 149 | shell: "python scripts/gmap_filter.py {input} {output}" |
159 160 161 162 | run: shell("gffcompare -r {input} -o {params}") shell("mv data/output/GMAP/gffcmp.filter_alignm.bed.tmap data/output/gffcompare_first/") shell("mv data/output/GMAP/gffcmp.filter_alignm.bed.refmap data/output/gffcompare_first/") |
173 174 | run: shell("python scripts/lncRNA_class.py {input} {params} {output}") |
184 185 186 187 | run: shell("gffcompare -r {input} -o {params}") shell("mv data/output/GMAP/gff.filter_alignm.bed.tmap data/output/gffcompare_second/") shell("mv data/output/GMAP/gff.filter_alignm.bed.refmap data/output/gffcompare_second/") |
200 201 | shell: "python scripts/bed_intersect.py {input.gffcomp} {input.aling} {params.option} {output.loci} {input.coords} {output.TE} {output.true_lnc}" |
212 213 | run: shell("python scripts/lncRNA_class.py {input} {params} {output}") |
223 224 225 226 | run: shell("gffcompare -r {input} -o {params}") shell("mv data/output/GMAP/gff.filter_alignm.bed.tmap data/output/gffcompare_second/") shell("mv data/output/GMAP/gff.filter_alignm.bed.refmap data/output/gffcompare_second/") |
236 237 | shell: "python scripts/bed_intersect.py {input} {params} {output}" |
249 250 | shell: "python scripts/gff.py {input} {params}" |
269 270 | shell: "python scripts/blast.py {input} {output.fasta} {params.evalue} {params.max_target} {params.identity} {params.threads} {params.old_new}" |
285 286 287 288 289 290 291 292 | run: shell("""grep -v NonCoding {input.lnc} | awk -F ',' '{{print $1}}' | tr -d '"' > data/output/tissue/cod.txt""") shell("grep -v {params.org} {input.blast} | awk -F '\t' '{{print $1}}' | sort | uniq > data/output/tissue/cons.txt") shell("grep {params.org} {input.blast} | awk -F '\t' '{{print $1}}' | sort | uniq > data/output/tissue/non.txt") shell("grep '>' {input.new} | awk -F '>' '{{print $2}}' > data/output/tissue/id.txt") shell("awk -F '\t' '{{print $1}}' {input.blast} | sort | uniq > data/output/tissue/all.txt") shell("grep -v -w -f data/output/tissue/all.txt data/output/tissue/id.txt > data/output/tissue/non_aling.txt") shell("python scripts/tissue.py {params.exp} {input.blast} {params.LncAPDB} {params.old_new} {params.SRX}") |
309 310 | shell: "python scripts/expr_table_form.py {input.SRX} {params} {output}" |
319 320 | shell: "Rscript scripts/Expraession_entropy.R {params.table}" |
Support
- Future updates
Related Workflows





