Automatic identification, classification and annotation of plant lncRNAs in transcriptomic sequences

public 1yr ago 0 bookmarks

View Workflow

icannolncrna-identification-classification-and-ann — View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Introduction
Installation
Input
Data
- Configuration
- Folders
Running the pipeline
Models
Structure information
Known LncRNA
Species
Tissue analysis
Expression and entropy
Output
Errors

Introduction

Long non-coding RNAs (lncRNAs) are RNA molecules longer than 200 nucleotides that do not encode proteins. Experimental studies have shown the diversity and importance of lncRNA functions in plants. To expand knowledge about lncRNAs in other species, computational pipelines that allow for standardised data-processing steps in a mode that does not require user control up to the final result were actively developed recently. These advancements enable wider functionality for lncRNA data identification and analysis. In the present work, we propose the ICAnnoLncRNA pipeline for automatic identification, classification and annotation of plant lncRNAs in transcriptomic sequences assembled from high-throughput RNA sequencing (RNA-seq) data.

This pipeline is only applicable to the Linux operating system.

The pipeline includes the following steps:

1. Pre-processing.

Gmap index building.
LncFinder model building.

2. LncRNA filtering.

Length filtering
LncFinder predicting.
CPC2 predicting.
GMAP alignment on reference genome.
Filtering erros/noise.
Filtering possible transposable elements (TEs).

3. LncRNA annotation.

Classification gffcompare.
Blastn alignment.
Analysis of lncRNA expression.

The pipeline is implemented using the workflow management system Snakemake , which provides ability to platform-independent installation and execution of the software.

Schematic diagram

Test Image 1

Installation

Automatic

recommended for clusters/servers

Install only programs.yaml environment. Other environments will install automatically when ICAnnoLncRNA start work.

1. wget https://github.com/artempronozin95/ICAnnoLncRNA-identification-classification-and-annotation-of-LncRNA/archive/refs/heads/main.zip
2. unzip main.zip
3. cd ./ICAnnoLncRNA-identification-classification-and-annotation-of-LncRNA-main
4. conda env create --file env/programs.yaml
5. conda activate ICAnnoLncRNA

After these steps all necessary packages are installed. If you need update packages ( not recommended ), change the version of packages after “=” (example - snakemake=4.0.1 -> snakemake=6.0.0 ), then conda env update --file ./programs.yaml . All necessary packages will be updated. Recomended on clusters, requires a lot of processing power.

Step method

recommended for personal computer

1. conda update conda.
2. conda create -n ICAnnoLncRNA python=3.6
3. conda activate ICAnnoLncRNA
4. conda install -c bioconda emboss
5. install next packeges from file below

Install all packeges represented here

It is necessary to install and download (download program in pipeline folder):

CPC2

1. cd ./ICAnnoLncRNA---identification-classification-and-annotation-of-LncRNA-main
2. wget https://github.com/biocoder/CPC2/archive/refs/heads/master.zip
3. unzip master.zip
4. mv CPC2-master CPC2

LncAPDB.fasta

download LncAPDB.fasta into data/reference/data_index folder.

Input

Genome sequence

Known LncRNA transcripts of the species in FASTA format. File with lncRNA from the public database of lncRNA. For example Ensembl , EVLncRNAs v2 .
Known protein coding sequences of the species in FASTA format. File with protein coding sequences from a public database. For example Ensembl .
RNA-seq transcripts of the species in FASTA format. This file contains transcripts obtained after assembly of transcriptome. For example file that we provide, 30k_zey.fasta , obtained by Trinity assembly. All three sets are necessary for build

Reference genome

Reference genome of the species in FASTA format.
Annotation of the species in GFF/GTF format.

Additional data

TE coordinats on the reference genome, example Zea_N_merged.bed
File with sequence ID from public lncRNA databases connected to IDs in library of known lncRNAs. example index_and_newindex.csv
Annotation of the organism transcriptome libraries by tissue type, example SRX_all_org.tsv

Data

Additional data is presented here: https://data.mendeley.com/datasets/fnk8pmp2yz/3

It includes:

blast.outfmt6 - Blastn results. Contain homologs with known lncRNAs sequences from the LncAPDB library.
index_and_newindex.fasta - index of PNRD, CANTATAdb, GREENC, PlncDB, EVLncRNA databases compared with new index for LncAPDB library.
LncAPDB.fasta - lncRNA sequences of LncAPDB library in fasta format.
lncrna_class.tmap - novel lncRNAs divided into gffcompare classes.
lncrna_coordinats.bed - coordinates of novel lncRNAs on chromosomes.
lncrna_intron_size.tsv - intron size of novel lncRNAs and their coordinates on the genome.
new_lncrna.fasta - novel lncRNA sequences in fasta format.
new_lncrna_locus.loci - locus of lncRNA sequences on genome.
transcriptome_lib.txt - Maize transcriptome libraries.

Configuration file

Input all necessary files into configuration file “config.yaml”:

lnc: - known LncRNA sequences of studied organisms in FASTA format.
- (Example: lnc: "data/input/lincrna.fasta" )
cds: - known CDS (coding) sequences of studied organisms in FASTA format.
- (Example: cds: "data/input/7000_len_mRNA.fasta" )
model: - pipeline check if model (lncFinder) for this organism already exist and use it, if it's not, pipeline will create a new one (in output folder). Model that exist represented in Models .
- (Example: model: "Zea_mays" )
sequence: - sequences that need to study in FASTA format.
- (Example: sequence: "data/input/Zea.fasta" )
structure: - need to choose whether to use secondary structure in model building or not. Choose between DNA or SS (secondary structure).
- (Example: structure: "DNA" )
gmap:
- gff_reference: - reference genome annotation in GFF format.
  - (Example: gff_reference: "data/reference/Zea_mays.AGPv4.40.gff" )
- reference: - reference genome in FASTA format.
  - (Example: reference: "data/reference/Zea_mays.AGPv4.dna.toplevel.fa" )
- out: - output file.
  - (Example: out: "data/output/alignm.gff" )
gff: - gff file of reference genome.
- (Example: gff: "data/reference/Zea_mays.AGPv4.40.gff" )
blast: - standard BLAST parameters
- evalue: - evalue
  - (Example: evalue: 1e-50 )
- max_target: - max_target_seqs
  - (Example: max_target: 1 )
- identity: - perc_identity
  - (Example: identity: 30 )
- threads: - num_threads
  - (Example: threads: 1 )
TE_coords: - if there is no TE coordinats information - recomended to turn off this step.
- option: "on" - turn on or turn off this step (on/off)
- coords: - TE coordinat
  - (Example: coords: "data/reference/N_coords/Zea_N_merged.bed" )
tissue:
- organism: - choose species that you study form species list , or input your own species ID in same format
  - (Example: organism: "ZMAY" )
- exp: - choose between organisms tissue experiments in Tissue analysis or input your organism.
  - (Example: exp: "ZM" )
- LncAPDB: - File with sequence ID from public lncRNA databases connected to IDs in library of known lncRNAs.
  - (Example: LncAPDB: "tissue/index_and_newindex.csv" )
- SRX_exp: - Annotation of the transcriptome libraries by tissue type and gene expression levels.
  - (Example: SRX_exp: "tissue/SRX_all_org.tsv" )
expression :
- option : "on" - turn on or turn off this step (on/off)
- path : - path to expression data.
  - (Example: path: "expressin/Kallisto" )
- calculation_type : - type of expression data. There three types: 1.Kallisto 2.Htseq 3.Own type of data.
  - (Example: path: "Kallisto" )

Folders

data/input

Contain:

Known LncRNA transcripts of the species in FASTA format.
Known protein coding sequences of the species in FASTA format.
RNA-seq transcripts of the species in FASTA format.

data/reference

Contain:

reference genome of studied organism in FASTA format.
annotation of studied organism in GFF/GTF format.
data_index folder with library of known lncRNA from databases.
intron_data folder with structure information of organisms.
models folder with model for lncFinder.
N_coords folder with TE coordinats

Work start

1. `snakemake -j 2`

-j or --cores - Use at most N CPU cores/jobs in parallel. If N is omitted or ‘all’, the limit is set to the number of available CPU cores.

2. `snakemake -nr`

-n - Do not execute anything, and display what would be done. If you have a very large workflow, use –dry-run –quiet to just print a summary of the DAG of jobs.

-r - Print the reason for each executed rule.

3. `snakemake --use-conda`

--use-conda - Use additional conda environment.

4. Recommended run:

snakemake -j 2 --use-conda

Models

Zea_mays
Arabidopsis_thaliana

LncRNA structure information

Zea_mays
Arabidopsis_thaliana
Oryza_sativa

Known LncRNA for database

Folder data/reference/data_index contain lncRNA library (LncAPDB.fasta) that build on base of 5 lncRNA databases (PNRD, CANTATAdb, GREENC, PlncDB, EVLncRNA). The library you can find here: https://data.mendeley.com/v1/datasets/fnk8pmp2yz/draft?a=13de1dfb-b631-42f3-8108-f3de2f50fd90

Species

Medicago truncatula - MTRU
Glycine max - GMAX
Populis trichocarpa - PTRI
Arabidopsis thaliana - ATHA
Vitis vivifera - VVIN
Oryza sativa japonica - OSAT
Brachipodiumdistachion - BDIS
Sorghum bicolor - SBIC
Zea mays - ZMAY
Selaginella moellendorfii - SMOE
Physcomitrella patens - PPAT
Ostreococcus tauri - OTAU
Volvox - VOLV
Amborell trichopoda - ATRI

Tissue analysis

File tissue/SRX_all_org.tsv , contains information about 1241 transcript experiment libraries for 5 organisms with respect to specific tissue. In config.yaml need to choose between organisms that presented in SRX file:

HV - Hordeum vulgare
OS - Oryza sativa
SL - Solanum lycopersicum
ST - Solanum tuberosum
ZM - Zea mays

If the libraries you are investigating (for other organism for exemple) are not in this file. You can add them to the data file in the same format

ST	SRX1478098	leaf
ZM	SRX339769	stomatal division zone
ZM	SRX339787	ear
ZM	SRX711129	leaf
OS	SRX2582231	unknown
ZM	SRX711024	ear
ZM	SRX711053	leaf

Where 1 column - id, 2 - organism that you study, 3 - SRX library, 4 - tissue.

Expression and entropy

For this step there is necessary to provide expression data for every transcript. Thus recommended turn off this step at first run of the pipeline. After receiving expression data turn on this step and pipeline will calculete entropy and espression statistic. In config.yaml need to choose between three types of table building:

Kallisto - result are received by means of Kallisto program. Results should be presented as on example:

├── SRR957466
│ ├── abundance.h5
│ ├── abundance.tsv
│ └── run_info.json
├── SRR957467
│ ├── abundance.h5
│ ├── abundance.tsv
│ └── run_info.json
└── SRR957468 ├── abundance.h5 ├── abundance.tsv └── run_info.json

Htseq - result are received by means of Htseq program. Results should be presented as on example:

├── SRR1586686.txt
└── SRR504468.txt

own - In this case user should build table by himself. Table should looks like as table presented in example folder. Where:
type - mRNA - is protein coding genes. noncons - nonconserved lncRNAs. consv - conserved lncRNAs.
all tissue that needed. In example table, value is a median of expression for libraries that belong to tissue in column name.
target_id - ID of the transcript. it is necessary that the transcripts correspond to the results of the pipeline.

Output

A typical structure of Output is follows:

├── gffcompare_first
 ├── gffcmp.annotated.gtf
 ├── gffcmp.filter_alignm.bed.refmap
 ├── gffcmp.filter_alignm.bed.tmap
 ├── gffcmp.loci
 ├── gffcmp.stats
 └── gffcmp.tracking
├── gffcompare_second
 ├── gff.annotated.gtf
 ├── gff.filter_alignm.bed.refmap
 ├── gff.filter_alignm.bed.tmap
 ├── gff.loci
 ├── gff.stats
 └── gff.tracking
├── GMAP
 ├── alignm.bed
 ├── alignm_filter.gff
 ├── filter_alignm.bed
 ├── gmap_build.err.log
 ├── gmap_build.out.log
 ├── gmap.out.log
 ├── reference.bed
 └── statistic.csv
├── lncRNA_prediction
 ├── Coding.fasta
 ├── cpc.txt
 ├── lncFinder.csv
 └── Noncoding.fasta
├── lncRNA_structure
 ├── anti.png
 ├── exon_size.png
 ├── intron_size.png
 ├── itron_coordin.tsv
 ├── long_transcripts.bed
 ├── number_of_exon.png
 ├── number_of_lncRNA.png
 └── statistic_bed.tsv
├── loci
 ├── lncRNA_before_loci.bed
 └── lncRNA_loci.bed
├── new_lncRNA
 ├── blast.outfmt6
 ├── classes.png
 ├── LncAPDB_vs_blast.csv
 ├── new_lncrna.fasta
 └── True_lncRNA.bed
├── TE_finder
 ├── Lnc_aling_with_TE.tsv
 └── lncRNA_after_loci.bed
├── tissue
 ├── all.txt
 ├── cod.txt
 ├── cons.txt
 ├── id.txt
 ├── non.txt
 ├── tissue_org.csv
 ├── tissue_org.png
 ├── transc_cod.csv
 ├── transc.csv
 └── transc_non.csv
├── expression
 ├── box_plot.png
 ├── entropy.png
 ├── expr_table.tsv
 ├── hist_expr.png
 └── table_type.txt

Folder "tissue", contains lncRNA distribution in tissue.

Groups of output files

1. Pre-processing.

Pre-processing performed in input directory:

filtered_seq.fasta - filter transcripts with length < 200 nt.
new_vs_old_id.tsv - new ID and old ID of transcripts.
test_train - folder of lncFinder model studie.

2. LncRNA filtering.

lncRNA_prediction

lncRNA_prediction

Coding.fasta - Transcripts predicted as protein coding by lncFinder, FASTA format.
cpc.txt - Transcripts predicted as lncRNA by CPC2, TXT format.
lncFinder.csv - Transcripts predicted as lncRNA by lncFinder, CSV format.
Noncoding.fasta - Transcripts predicted as lncRNA by lncFinder, FASTA format.

GMAP alignment on reference genome.

GMAP

alignm.bed - results of lncRNA transcripts alignment on reference genome by GMAP, 'BED' format.
alignm_filter.gff - results of lncRNA transcripts alignment on reference genome without long intron transcripts, GFF format.
filter_alignm.bed - results of lncRNA transcripts alignment on reference genome without long intron transcripts, BED format.
gmap_build.err.log
gmap_build.out.log
gmap.out.log
reference.bed - reference genome annotation, BED format.
statistic.csv - lncRNA structure statistic, CSV format.

Filtering erros/noise.

gffcompare_first - classification of the candidate lncRNA sequences by their location in the genome relative to a protein-coding genes, more information can be found on gffcompare

gffcmp.annotated.gtf - Gffcompare reports a GTF file containing the “union” of all transcript IDs in each sample.
gffcmp.filter_alignm.bed.refmap - This tab-delimited file lists, for each reference transcript, which query transcript either fully or partially matches that reference transcript.
gffcmp.filter_alignm.bed.tmap - This tab delimited file lists the most closely matching reference transcript for each query transcript.
gffcmp.loci - Represent loci of transcripts.
gffcmp.stats
gffcmp.tracking - This file matches transcripts up between samples.

loci

lncRNA_before_loci.bed - the candidate lncRNA sequences before merging into loci, BED format.
lncRNA_loci.bed - loci of the candidate lncRNA, BED format.

gffcompare_second - classification of the candidate lncRNA sequences by their location in the genome relative to lncRNA loci.

gff.annotated.gtf
gff.filter_alignm.bed.refmap
gff.filter_alignm.bed.tmap
gff.loci
gff.stats
gff.tracking

Filtering possible transposable elements (TEs).

TE_finder

Lnc_aling_with_TE.tsv - the candidate lncRNA sequences that mapped on TE, TSV format.
lncRNA_after_loci.bed - the candidate lncRNA sequences that are completely matched ( = - class of gff compare) with lncRNA loci, BED format.

3. LncRNA annotation.

lncRNA_structure

anti.png - allocation of antisense lncRNA alignment to target gene structure.
exon_size.png - lncRNA exon size distribution.
intron_size.png - lncRNA intron size distribution.
itron_coordin.tsv - coordinates lncRNA introns, TSV format.
long_transcripts.bed - transcripts with length > 5000 nt, BED format.
number_of_exon.png - the ratio of the number of exons per lncRNA.
number_of_lncRNA.png - number of lncRNA per chromosome.
statistic_bed.tsv - lncRNA intron statistic, TSV format.

new_lncRNA

blast.outfmt6 - BLASTn results in outfmt6 format.
classes.png - distribution of lncRNA classes.
LncAPDB_vs_blast.csv - known lncRNAs that were predicted.
new_lncrna.fasta - identified lncRNA transcripts, FASTA format. ( final result )
True_lncRNA.bed - identified lncRNA transcripts, BED format.

tissue

all.txt - lncRNA that have homologs with known lncRNAs.
cod.txt - ID of proteine coding transcripts.
cons.txt - ID of the lncRNAs similar to lncRNAs from other species (conserved sequences)
id.txt - ID of all input transcripts
non.txt - ID of the lncRNAs similar only to known studied organism sequences (non-conserved sequences)
tissue_org.csv - all in one (proteine coding, conserved lncRNAs, non-conserved lncRNAs)
tissue_org.png - heatmap of transcript libraries tissue type.
transc_cod.csv - proteine coding transcript libraries tissue type.
transc.csv - conserved lncRNAs transcript libraries tissue type.
transc_non.csv - non-conserved lncRNAs transcript libraries tissue type.
Blastn alignment.
Analysis of lncRNA expression.

expression

box_plot.png - box plot diagrams of gene expression.
entropy.png - density plot of Shannon entropy.
expr_table.tsv - table with expression data for every transcript in every tissue.
hist_expr.png - density plot of maximum expression levels of transcripts.
table_type.txt - type of expression results.

Errors

rule model_lncFinder: input: data/input/test_train/sets.txt output: data/input/test_train/dirs.txt jobid: 2
Activating conda environment /home/pronozin/ICAnnoLncRNA---identification-classification-and-annotation-of-LncRNA-main/.snakemake/conda/767d2b45.
/bin/bash: activate:

Solution : type conda install conda in ICAnnoLncRNA environment.

rule Gmap_build:
 input: data/reference/Zea_mays.AGPv4.dna.toplevel.fa, data/reference/Zea_mays.AGPv4.40.gff
 output: data/output/statistic.csv
 jobid: 5
...
Error in rule Gmap_build:
 jobid: 5
 output: data/output/statistic.csv
RuleException:
CalledProcessError in line 123 of ...:
Command ' set -euo pipefail; ...'

Solution : update gtfparse package. Example: conda install -c bioconda gtfparse

Code Snippets

import pandas as pd
import sys
import numpy as np
from pybedtools import BedTool
pd.set_option('display.max_columns', None)

def clean(x):
    split = x.str.split('.', 4, expand=True)
    split = split[0]
    return (split)

tmap = pd.read_csv(sys.argv[1], sep='\t')
gmap_align_prime = pd.read_csv(sys.argv[2], sep='\t', header=None)

# take only lncRNA classes from tmap file
tmap['group'] = np.nan
tmap['group'][tmap['class_code'].isin(['='])] = 'true lncrna'
tmap.dropna(subset = ["group"], inplace=True)
tmap = tmap[tmap['qry_gene_id'].str.contains("path1")]
print(tmap['group'].value_counts())
# filter duplicates
query = clean(tmap['qry_gene_id'])
gmap_align_prime['name'] = clean(gmap_align_prime[3])
query = query.drop_duplicates()
# take lncRNA from alignment file
if str(sys.argv[3]) == 'on':
   print('TE finder ON')
   gmap_align_prime = gmap_align_prime[gmap_align_prime['name'].isin(query)]
   gmap_align = gmap_align_prime[gmap_align_prime[7].str.contains("gene")]
   gmap_align.to_csv(sys.argv[4], sep='\t', index=False, header=None)
# BedTool intersect
   genes = BedTool(sys.argv[4])
   loci = genes.intersect(wa=True, wb=True, b=sys.argv[5])
   loci_df = pd.read_table(loci.fn, sep='\t', names=['chr','lncRNA_start','lncRNA_end','lncRNA',4,'lncRNA_strand',6,7,8,9,10,'chr_TE','TE_start','TE_end'])
   loci_df = loci_df[['chr','lncRNA_start','lncRNA_end','lncRNA','lncRNA_strand','chr_TE','TE_start','TE_end']]
   loci_df = loci_df.drop_duplicates(subset=['lncRNA'])
   loci_name = loci_df['lncRNA'].str.split('.',4, expand=True)
   true_lncrna = gmap_align_prime[~gmap_align_prime['name'].isin(loci_name[0])]
   loci_df.to_csv(sys.argv[6], sep='\t', index=False, header=None)
   true_lncrna.to_csv(sys.argv[7], sep='\t', index=False, header=None)
else:
   gmap_align_prime = gmap_align_prime[gmap_align_prime['name'].isin(query)]
   gmap_align_prime.to_csv(sys.argv[4], sep='\t', index=False, header=None)

Python Pandas numpy Pybedtools From line 1 of scripts/bed_intersect.py

import os
import subprocess
import sys
from Bio import SeqIO
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt



def data_base(x):
    x_n = x.rsplit('/', 1)[1]
    x_n = x_n.rsplit('.', 1)[0]
    dir_check = os.path.isfile('data/reference/data_index/' + x_n + '.nin')
    if dir_check is True:
        print('index exist')
        index_path = os.path.join('data/reference/data_index/', x_n)
        return index_path
    else:
        print('build index')
        index_path = os.path.join('data/reference/data_index/', x_n)
        command = 'makeblastdb -in {db} -dbtype nucl -parse_seqids -out {index}'. format(db=x, index=index_path)
        exit_code = subprocess.call(command, shell=True)
        return index_path

def fasta(x):
    query = x['name']
    for w in query:
        try:
            old_id = old_new[old_new['new_id'].isin([w])]
            print(record_dict[str(old_id['old_id'].to_list()[0])].format('fasta'), end='', file=lncrna)
        except KeyError:
            continue

def alingment(x,y):

    path = x.rsplit('/', 1)[0]
    outfmt = os.path.join(path, 'blast' + '.outfmt6')
    aling = 'blastn -query {q} -db {dbw} -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen"  -evalue {evalue}  -max_target_seqs {max_target} -perc_identity {identity} -num_threads {threads} -out {outfmt}'. format(q=x, dbw=y, evalue=evalue, max_target=max_target, identity=identity, threads =threads, outfmt=outfmt)
    #| sort -k1,1 -k12,12nr -k11,11n | sort -u -k1,1 --merge > {outfmt}'. format(q=x, dbw=y, outfmt=outfmt)
    exit_code = subprocess.call(aling, shell=True)

# input data
query = pd.read_csv(sys.argv[1], sep='\t', header=None)
gff_tmap = sys.argv[2]
data = sys.argv[3]
record_dict = SeqIO.to_dict(SeqIO.parse(sys.argv[4], "fasta"))
lncrna = open(sys.argv[5], 'w', encoding='utf-8')
evalue = sys.argv[6]
max_target = sys.argv[7]
identity = sys.argv[8]
threads = sys.argv[9]
old_new = pd.read_csv(sys.argv[10], sep='\t')
tmap = pd.read_csv(gff_tmap, sep='\t')
# lncRNA classification graf
query = query[query[7].str.contains("gene")]
tmap = tmap[tmap['qry_gene_id'].str.contains("path1")]
name_tmap = tmap['qry_gene_id'].str.rsplit('.', expand=True)
tmap[['name','seq']] = name_tmap[[0,1]]
#tmap = tmap[tmap['seq'].str.contains("path1")]
tmap = tmap[tmap['name'].isin(query[10])]
tmap= tmap.drop_duplicates(subset=['name'])
tmap['group'] = np.nan
tmap['group'][tmap['class_code'].isin(['x'])] = "exon antisense"
tmap['group'][tmap['class_code'].isin(['i'])] = "intron"
tmap['group'][tmap['class_code'].isin(['u'])] = "intergenic"
tmap.dropna(subset = ["group"], inplace=True)
print(tmap['group'].value_counts())
f, ax = plt.subplots(figsize=(13, 13))
tmap['group'].value_counts().plot(kind='bar')
plt.xticks(size=15, rotation=30)
plt.yticks(size=15)
plt.ylabel('LncRNA transcripts number', size=15)
plt.xlabel('LncRNA classes', size=15)
plt.savefig("data/output/new_lncRNA/classes.png", dpi=500)
# BLAST
fasta(tmap)
indx = data_base(data)
alingment(sys.argv[5], indx)

Python Pandas numpy matplotlib Biopython BLAST From line 3 of scripts/blast.py

test=$(ls -lrt $2 | cut -f 5 -d ' ')
minimumsize=4000000000
if  [ $test -gt $minimumsize ]; then
	mkdir ./data/input/chunk
	awk -v size=10000 -v pre=prefix -v pad=5 '/^>/{n++;if(n%size==1){close(f);f=sprintf("./data/input/chunk/%s%0"pad"d.fa",pre,n)}}{print>>f}' $2
	Rscript scripts/lncFind.r $1 ./data/input/chunk $3 $4
else
	Rscript scripts/lncFind.r $1 $2 $3 $4
fi

Shell From line 1 of scripts/Chunk_dataframe.sh

library(ggpubr)
library(dplyr)
library(purrr)
library(BioQC)
library(ggplot2)
library(data.table)

args <- commandArgs(TRUE)

entrop <- function(a) {
  ent <- entropySpecificity(a, norm=TRUE)
  return(ent)
}

trans <- function(a) {
  a$target_id <- NULL
  a$type <- NULL
  a$sum <- NULL
  entrop(a)
}

type <- function(a) {
  col_name <- colnames(a)
  col_name <- col_name[! col_name %in% c("type", "target_id")]
  a$sum <-  rowSums(a[,col_name], na.rm=TRUE)
  a <- filter(a, sum >0)
  ent <- trans(a)
  a$entropy <- ent
  return(a)
}

plot <- function(a,b,c) {
  ggplot() +
    geom_density(data = a, aes(x = entropy, fill="Non"), alpha=0.4)+
    geom_density(data = b, aes(x = entropy, fill="Cons"), alpha=0.4)+
    geom_density(data = c, aes(x = entropy, fill="mRNA"), alpha=0.4)+
    scale_colour_manual("", 
                        breaks = c("Non", "Cons", "mRNA")) +
    xlab('entropy') + ylab('density') + guides(fill = guide_legend(title = "Types")) + 
    theme(text = element_text(size = 20))
  ggsave("data/output/expression/entropy.png", width = 20, height = 20)
}

log <- function(a) {
       col_name <- colnames(a)
       col_name <- col_name[! col_name %in% c("type", "target_id")]  
       col <- c()
       for (x in col_name) {
            col <- append(col, paste("log_", noquote(x), sep = ""))
            a[, paste("log_", noquote(x), sep = "")] <- log2(a[, noquote(x)])
       } 
       new_col <<- col
       return(a)
}

# install dataframe
df <- read.table(args[1], header = TRUE)

# log of columns
df_log <- log(df) 

# exprassin analysis
df_log[df_log == -Inf] <- 0
my_comparisons = list(c("consv", "non_cons"), c("consv", "prot"),  c("prot", "non_cons"))

ggboxplot(df_log, x="type", y= new_col , merge = "flip", ylab = "log2(TPM)", color = "type", palette = "jco")+  
  font("subtitle", size = 20)+
  font("caption", size = 20)+
  font("xlab", size = 20)+
  font("ylab", size = 20)+
  font("xy.text", size = 20) +theme(legend.text=element_text(size=20))
ggsave("data/output/expression/box_plot.png", width = 20, height = 20)

ggdensity(df_log, x = new_col , y = "..density..", facet.by = "type", merge = TRUE, xlab = "TPM", rug = TRUE, add = "median", palette = "jco") +  font("subtitle", size = 20)+
  font("caption", size = 20)+
  font("xlab", size = 20)+
  font("ylab", size = 20)+
  font("xy.text", size = 20) + theme(legend.text=element_text(size=20))
ggsave("data/output/expression/hist_expr.png", width = 20, height = 20)

# entropy analysis
non <-df[grepl("noncons", df$type),]
cons <-df[grepl("consv", df$type),]
mRNA <-df[grepl("mRNA", df$type),]

non_ent <- type(non)
cons_ent <- type(cons)
mRNA_ent <- type(mRNA)

plot(non_ent, cons_ent, mRNA_ent)

R ggplot2 dplyr data.table ggpubr purrr BioQC From line 1 of scripts/Expraession_entropy.R

import pandas as pd
import os
import sys


strc = str(sys.argv[3])[1:-1]

if strc == 'own':
   print('Own table')
   with open(sys.argv[9], 'w') as f:
        f.write('Own table')
else:
  print('Build table')
  tissue = pd.read_csv(sys.argv[1], sep='\t', header=None)
  path = str(sys.argv[2])[1:-1]
  new_old =  pd.read_csv(sys.argv[4], sep='\t')
  cons =  pd.read_csv(sys.argv[5], sep=' ', header=None)
  noncons =  pd.read_csv(sys.argv[6], sep=' ', header=None)
  cod = pd.read_csv(sys.argv[7], sep=' ', header=None)
  cod = pd.merge(cod, new_old, how='inner', left_on=[0], right_on=['new_id']).drop(columns = [0])
  cod['type'] = 'mRNA'
  noncons['type'] = 'noncons'
  cons['type'] = 'consv'

  if strc == 'Kallisto':
     list_dir = os.listdir(path)
     lib = {}
     for w in list_dir:
         abudance = pd.read_csv(path + '/' + w +'/' + "abundance.tsv", sep='\t')
         lib[w] = abudance['tpm']
  if strc == 'Htseq':
     list_dir = os.listdir(path)
     lib = {}
     for w in list_dir:
         id = w.split('.')
         abudance = pd.read_csv(path + '/' + w, sep='\t', header=None)
         abudance = abudance.rename(columns={0: 'target_id'})
         lib[id[0]] = abudance[1]

  lib_data = pd.DataFrame.from_dict(lib)
  colnames = lib.keys()
  tissue = tissue[tissue[1].isin(colnames)]
  tissue_gr = tissue.groupby([2])
  median_tis = {}
  for k,v in tissue_gr:
     lib_data_gr = lib_data[v[1]]
     median_tis[k] = lib_data_gr.median(axis=1)

  median_tis_df = pd.DataFrame.from_dict(median_tis)
  median_tis_df['target_id'] = abudance['target_id']
  cod = pd.merge(cod, median_tis_df, how='inner', left_on=['new_id'], right_on=['target_id']).drop(columns = ['old_id', 'new_id'])
  noncons = pd.merge(noncons, median_tis_df, how='inner', left_on=[0], right_on=['target_id']).drop(columns = [0])
  cons = pd.merge(cons, median_tis_df, how='inner', left_on=[0], right_on=['target_id']).drop(columns = [0])
  full_table = pd.concat([cod, cons, noncons])
  full_table.to_csv(sys.argv[8], sep='\t', index=False)
  with open(sys.argv[9], 'w') as f:
        f.write(strc)

Python Pandas From line 1 of scripts/expr_table_form.py

import pandas as pd
import sys
from collections import Counter
import matplotlib.pyplot as plt
import scipy.stats as stats
from matplotlib.ticker import FuncFormatter
from functools import partial
import math
import statistics
from scipy.stats import lognorm
import numpy as np

pd.options.mode.chained_assignment = None  # default='warn'

def to_percent(y, position, n):
    s = str(round(100 * y / n, 3))
    return s + '%'

def coord(x):
    Row_list_query = []
    for index, rows in x.iterrows():
        my_list = [rows[1], rows[2]]
        Row_list_query.append(my_list)
    return(Row_list_query)

def find_same_coords(x):
    x['exon_num'] = x['name'] +"_"+ exon[7].astype(str).str.cat(x['number'].astype(str), sep='_')
    uni3 = x.drop_duplicates(subset=[1])
    target_coord = coord(uni3)
    for start_stop1 in query_coord:
        start1, stop1 = map(int, start_stop1)
        for start_stop in target_coord:
            start, stop = map(int, start_stop)
            if start1 in range(start-100, start + 100) and stop1 in range(stop - 100, stop+100):
                same_query = uni[1] == start1
                same_target = uni3[1] == start
                index = uni3[same_target]
                lines = uni[same_query]
                chrom.append([' '.join('{0}'.format(i) for i in lines[3]),
                                  ' '.join('{0}'.format(i) for i in index['exon_num']), str(round(start1)), str(round(stop1)), str(start),
                                  str(stop), ' '.join('{0}'.format(i) for i in index['number']),
                                  ' '.join('{0}'.format(i) for i in lines['length']), ' '.join('{0}'.format(i) for i in index[5]), ' '.join('{0}'.format(i) for i in lines[5])])


query = pd.read_csv(sys.argv[1], sep='\t', header=None)
target = pd.read_csv(sys.argv[2], sep='\t', header=None)
tmap = pd.read_csv(sys.argv[3], sep='\t', header=None, skiprows=1)
small_intron = pd.read_csv(sys.argv[4], sep='\t', header=None)
query = query[pd.to_numeric(query[0], errors='coerce').notnull()]
target = target[pd.to_numeric(target[0], errors='coerce').notnull()]

statistic = open(sys.argv[5], 'w', encoding='utf-8')

query = query[~query[10].isin(small_intron[0])]

name_tmap = tmap[3].str.split('.',4, expand=True)
tmap['name'] = name_tmap[0]

tmap = tmap[tmap[2].isin(['x', 'i', 'u'])]
tmap = tmap[tmap['name'].isin(query[10])]


# exon structure of lncRNA
chart = query[query[7].str.contains("exon")]
spl = chart[3].str.split('.',2, expand=True)
size = spl.pivot_table(index = [0], aggfunc ='size').to_dict()
res = Counter(size.values())
dict={}
for k, v in res.items():
    dict[k] = v/len(size)*100
f, ax = plt.subplots(figsize=(15, 5))
ax.bar(list(dict.keys()), dict.values(), color='b')
#ax.set_xlim([0, 20])
plt.xticks(np.sort(range(max(list(dict.keys())) + 1)))
plt.ylabel('Proportion of lncRNAs (%)', size=15)
plt.xlabel('Number of exon', size=15)
plt.savefig("data/output/lncRNA_structure/number_of_exon.png", dpi=500)

#exon size chart
chart['exon_length'] = (chart[2] - chart[1])
step = int(math.ceil(chart['exon_length'].mean() / 100.0)) * 100
f, ax = plt.subplots(figsize=(10, 5))
density = stats.gaussian_kde(chart['exon_length'])
n, x, _ = ax.hist(chart['exon_length'],  density=True, bins = np.logspace(np.log10(min(chart['exon_length'])),np.log10(max(chart['exon_length'])), 50),  color='b')
ax.plot(x, density(x), color='r')
ax.set_xscale('log')
ax.set_ylabel('Probability', size=15)
ax.set_xlabel('Exon size (bp)', size=15)
plt.savefig("data/output/lncRNA_structure/exon_size.png", dpi=500)

print('Mean_exon_length','\t', statistics.mean(chart['exon_length']), file=statistic)
print('Max_exon_length','\t', max(chart['exon_length']), file=statistic)
print('Min_exon_length','\t', min(chart['exon_length']), file=statistic)
print('Median_exon_length','\t', statistics.median(chart['exon_length']), file=statistic)

# intron size chart
chart['name'] = spl[0]
gr = chart[[1,2,'name']].groupby(by='name')
intron_length = []
name = []
for key, item in gr:
    n=0
    if 0 < len(item['name']) - 2:
        intron = abs(item[2].iloc[n] - item[1].iloc[n + 1])
        if intron < 60:
           pass
        elif intron > 100000:
           pass
        else:
           n = n + 1
           intron_length.append(intron)

    else:
        intron_length.append(0)

intron_length = list(filter(lambda x: x != 0, intron_length))
step = int(math.ceil(statistics.mean(intron_length) / 100.0)) * 100
f, ax = plt.subplots(figsize=(10, 5))
density = stats.gaussian_kde(intron_length)
n ,x , _ = ax.hist(intron_length, density=True, bins=np.logspace(np.log10(min(intron_length)),np.log10(max(intron_length)), 50), color='b')
ax.plot(x, density(x), color="r")
ax.set_xscale('log')
ax.set_ylabel('Probability', size=15)
ax.set_xlabel('Intron size (bp)', size=15)
plt.savefig("data/output/lncRNA_structure/intron_size.png", dpi=500)

print('Mean_intron_length' ,'\t', statistics.mean(intron_length), file=statistic)
print('Max_intron_length' ,'\t', max(intron_length), file=statistic)
print('Min_intron_length' ,'\t', min(intron_length), file=statistic)
print('Median_intron_length' ,'\t', statistics.median(intron_length), file=statistic)


# choose only query transcripts
query = query[query[3].str.contains("path1")]
query['length'] = abs(query[1] - query[2])

print('Mean transcript length' ,'\t', statistics.mean(query['length']), file=statistic)

# chart lcnRNA distribution across chromosome
f, ax = plt.subplots(figsize=(10, 5))
labels, counts = np.unique(query[0].astype(int), return_counts=True)
ax.bar(labels, counts, align='center', color='b')
plt.gca().set_xticks(labels)
plt.ylabel('Number of lncRNAs', size=15)
plt.xlabel('Chromosome of organism', size=15)
plt.savefig("data/output/lncRNA_structure/number_of_lncRNA.png", dpi=500)

chromasom_all = pd.unique(target[0].astype(int))

# choose lncRNA type
name_target = target[9].str.split(';',4, expand=True)
target['name'] = name_target[0]
new_tmap = tmap[tmap[2].isin(['x'])]
# filter seq by lncRNA type
tmap_name = new_tmap[3].str.split('.',4, expand=True)
query_anti = query[query[10].isin(tmap_name[0].to_list())]
# choose only target exons
exon = target[target[7].str.contains("exon")]
exon = exon[exon['name'].groupby(exon['name']).transform('size')>1]
uni = query_anti.drop_duplicates(subset=[1])

chrom = []
for w in chromasom_all:
    print(w)
    query_new = uni[uni[0].astype(int).isin([w])]
    target_new = exon[exon[0].astype(int).isin([w])]
    query_coord = coord(query_new)

    strand = target_new.groupby([5])
    for key, item in strand:
        if key == "-":
            item['number'] = item.groupby(['name']).cumcount(ascending=False) + 1
            find_same_coords(item)
        elif key == "+":
            item['number'] = item.groupby(['name']).cumcount() + 1
            find_same_coords(item)

df = pd.DataFrame(chrom)
df['number'] = df.groupby([6]).cumcount() + 1
df[6] = df[6].astype(int)
df = df.sort_values(by=[6])
df.rename(columns={0: 'transcript_id', 1: 'protein_coding_id', 2:'protein_coding_id_start', 3:'protein_coding_id_end', 4:'transcript_id_start',
                   5:'transcript_id_end', 6:'protein_coding_exon', 7:'transcript_length', 8:'protein_coding_strand', 9:'transcript_strand', "number":'number_of_transcripts'}, inplace=True)
df.to_csv('data/output/lncRNA_structure/anti_vs_mrna.tsv', sep='\t', index=False)

f, ax = plt.subplots(figsize=(15, 5))
ax.bar(df['protein_coding_exon'], (df['number_of_transcripts']/(len(df['protein_coding_exon']))*100), color='b')
plt.ylabel('Number of aligned lnRNAs', size=15)
plt.xlabel('Exon of the gene', size=15)
plt.xticks(range(max(df['protein_coding_exon'])+1))
plt.savefig('data/output/lncRNA_structure/anti.png')

Python Pandas numpy matplotlib scipy From line 1 of scripts/gff.py

import os
import sys
import subprocess
from Bio import SeqIO
import pandas as pd
import numpy as np
from gtfparse import read_gtf

data_base = sys.argv[1]
gff = sys.argv[2]


def build(reference, kmer):
    args_alignment = []
    gmap_build = 'gmap_build'
    path, base = os.path.split(reference)
    ref_label = os.path.splitext(base)[0]
    gmap_build_logger_out_path = os.path.join('./data/output/GMAP', gmap_build + '.out.log')
    gmap_build_logger_err_path = os.path.join('./data/output/GMAP', gmap_build + '.err.log')


    command = '{gmap_build} -D {tmp_dir} -d {ref_index_name} -k {kmer_value} {reference} >> {log_out_1} 2>> {log_out_2}'.\
            format(gmap_build=gmap_build, tmp_dir=path, ref_index_name=ref_label, reference=reference, log_out_1=gmap_build_logger_out_path,
                   log_out_2=gmap_build_logger_err_path, kmer_value=kmer)
    exit_code = subprocess.call(command, shell=True)


gff_df = read_gtf(gff)
df = SeqIO.to_dict(SeqIO.parse(data_base, "fasta"))
size = round(os.path.getsize(data_base) / (1024*1024))
statistic = open(sys.argv[3], 'w', encoding='utf-8')

if size <= 64:
    k = 12
elif 64 < size <= 256:
    k = 13
elif 256 < size <= 1000:
    k = 14
elif 1000 < size <= 10000:
    k = 15
print("k" ,'\t', str(k), file=statistic)

total_len = []
for name,seq in df.items():
    total_len.append(len(seq))
print('Golden_Path_Length' ,'\t', str(np.sum(total_len)), file=statistic)

exon_only = gff_df[gff_df['feature'].str.contains('exon')]
exon_length = (exon_only['end'] - exon_only['start']).mean()
print('Mean_exon_length','\t', str(round(exon_length)), file=statistic)
transcript_only = gff_df[gff_df['feature'].str.contains('transcript')]
transcript_length = (transcript_only['end'] - transcript_only['start']).mean()
print('Mean_transcript_length','\t', str(round(transcript_length)), file=statistic)

gr = exon_only[['start','end','gene_id']].groupby(by='gene_id')
intron_length = []
for key, item in gr:
    n=0
    while n <= len(item['gene_id']) - 2:
        intron = abs(item['end'].iloc[n] - item['start'].iloc[n + 1]) - 1
        n = n + 1
        intron_length.append(intron)

mean_intron_length = np.mean(intron_length)
max_intron_length = np.max(intron_length)
min_intron_length = np.min(intron_length)
print('Mean_intron_length' ,'\t', str(round(mean_intron_length)), file=statistic)
print('Max_intron_length' ,'\t', str(round(max_intron_length)), file=statistic)
print('Min_intron_length' ,'\t', str(round(min_intron_length)), file=statistic)

build(data_base, k)

Python Pandas numpy Biopython gtfparse From line 1 of scripts/GMAP_build.py

import pandas as pd
import sys

def coord(x):
    uni = x.drop_duplicates(subset=[1])
    Row_list_query = []
    for index, rows in uni.iterrows():
        my_list = [rows[1], rows[2]]
        Row_list_query.append(my_list)
    return(Row_list_query)

gmap  = sys.argv[1]
known_seq = sys.argv[2]
out = sys.argv[3]

gmap_align = pd.read_csv(gmap, sep='\t', header=None)
target = pd.read_csv(known_seq, sep='\t', header=None)

target_coords = coord(target[target[7].str.contains("transcript")])
name = gmap_align[9].str.split(';',4, expand=True)
name = name[1].str.split('=',2, expand=True)
gmap_align['name'] = name[1]
gene = gmap_align[gmap_align[3].str.contains("path1")]
exon = gmap_align[gmap_align[3].str.contains("exon")]
gr = exon[[0,1,2,'name']].groupby(by='name')
intron_large = []
intron_small = []
intron_data = []
for key, item in gr:
    n = 0
    if 0 < len(item['name']) - 2:
        intron = abs(item[2].iloc[n] - item[1].iloc[n + 1])
        intron_data.append([item[0].iloc[n], item[2].iloc[n], item[1].iloc[n + 1], key + '_intron' + str(n+1),intron])
        n = n + 1
        if intron < 60:
           intron_small.append(key)
        if intron > 20000:
           intron_large.append(key)      
        else:
           pass
    else:
       pass
df = pd.DataFrame(intron_data)
df.to_csv('data/output/lncRNA_structure/itron_coordin.tsv', sep='\t', index=False, header=None)
its = pd.DataFrame(intron_small)
its.to_csv('data/output/lncRNA_structure/itron_small.txt', index=False, header=None)
gmap_align = gmap_align[~gmap_align['name'].isin(intron_large)]
gmap_align['length'] = gmap_align[2] - gmap_align[1]
gmap_align = gmap_align[gmap_align['length'] < 5000]
gmap_align_big = gmap_align[gmap_align['length'] > 5000]
gmap_align_big.to_csv('data/output/lncRNA_structure/long_transcripts.bed', index=False, sep='\t', header=None)
gmap_align = gmap_align[~gmap_align['name'].isin(gmap_align_big['name'])]
gmap_align = gmap_align[[0,1,2,3,4,5,6,7,8,9]]
gmap_align.to_csv(out, index=False, sep='\t', header=None)

Python Pandas From line 1 of scripts/gmap_filter.py

import os
import sys
import subprocess
import pandas as pd

data_base = sys.argv[1]
target = sys.argv[2]
parameters = sys.argv[3]
out_file = sys.argv[5]
opt = sys.argv[7:]


def aling(reference, transcripts, min, max, thr, lg):
    if lg[0] > 2**32:
       gmap_run = 'gmapl'
       print('large genome use gmapl')
    else:
       gmap_run = 'gmap'
    path, base = os.path.split(reference)
    ref_label = os.path.splitext(base)[0]
    out = os.path.join(out_file)
    gmap_run_logger_err_path = os.path.join('./data/output/GMAP', gmap_run + '.out.log')
    command = '{gmap} -D {tmp_dir} -d {ref_index_name} {transcripts} --min-intronlength={min_intron_length} --intronlength={intron_length} --cross-species ' \
                  '--format=gff3_gene --split-large-introns --npaths=1 -t {threads} {option} > {alignment_out} 2>> {log_out_2}'.\
            format(gmap=gmap_run, tmp_dir=path, ref_index_name=ref_label, transcripts=transcripts,
                   threads=thr, alignment_out=out, log_out_2=gmap_run_logger_err_path, min_intron_length=min[0], intron_length=max[0], option=' '.join('{0}'.format(w) for w in opt))
    exit_code = subprocess.call(command, shell=True)

dir_check = os.path.isdir('data/reference/intron_data/' + sys.argv[4])
if dir_check is True:
   print('use of existing statistics')
   gff_df = pd.read_csv('data/reference/intron_data/' + sys.argv[4] + '/' + sys.argv[4] + '.tsv', sep='\t', header=None)
   minn = gff_df[gff_df[0].str.contains("Min_intron_length")][1].to_list()
   maxx = gff_df[gff_df[0].str.contains("Max_intron_length")][1].to_list()
   length = gff_df[gff_df[0].str.contains("Golden_Path_Length")][1].to_list()
   thr = 40
   aling(data_base, target, minn, maxx, thr, length)
   print(target)
else:
   print('use of reference statistics')
   gff_df = pd.read_csv(parameters, sep='\t', header=None)
   minn = gff_df[gff_df[0].str.contains("Min_intron_length")][1].to_list()
   maxx = gff_df[gff_df[0].str.contains("Max_intron_length")][1].to_list()
   length = gff_df[gff_df[0].str.contains("Golden_Path_Length")][1].to_list()
   thr = 40
   aling(data_base, target, minn, maxx, thr, length)

Python Pandas From line 1 of scripts/GMAP.py

import pandas as pd
import sys
import numpy as np
from pybedtools import BedTool

def clean(x):
    split = x.str.rsplit('.', 2, expand=True)
    split = split[0]
    return (split)

tmap = pd.read_csv(sys.argv[1], sep='\t')
gmap_align = pd.read_csv(sys.argv[2], sep='\t', header=None)
# take only lncRNA classes from tmap file
tmap['group'] = np.nan
tmap['group'][tmap['class_code'].isin(['x'])] = "exon antisense"
tmap['group'][tmap['class_code'].isin(['i'])] = "intron"
tmap['group'][tmap['class_code'].isin(['u'])] = "intergenic"
tmap.dropna(subset = ["group"], inplace=True)
tmap = tmap[tmap['qry_gene_id'].str.contains("path1")]
print(tmap['group'].value_counts())
# filter duplicates
query = clean(tmap['qry_gene_id'])
gmap_align['name'] = clean(gmap_align[3])
query = query.drop_duplicates()
# take lncRNA from alignment file
gmap_align = gmap_align[gmap_align['name'].isin(query)]
gmap_align = gmap_align[gmap_align[7].str.contains("gene")]
gmap_align.to_csv(sys.argv[3], sep='\t', index=False, header=None)
# lncRNA into loci
genes = BedTool(sys.argv[3])
loci = genes.merge(s=True, c=[4,5,6], o=['collapse','mean','distinct'])
loci_df = pd.read_table(loci.fn, sep='\t', names=[0,1,2,3,4,5])
loci_df = loci_df[(loci_df[2] - loci_df[1])>200]
loci_df['numb'] = list(range(0, len(loci_df[3])))
loci_df['loc'] = 'LOC'
loci_df['loci'] = loci_df['loc'] + '_' + loci_df['numb'].astype(str)
loci_df = loci_df[[0,1,2,'loci',4,5,3]]
loci_df.to_csv(sys.argv[4], sep='\t', index=False, header=None)

Python Pandas numpy Pybedtools From line 1 of scripts/lncRNA_class.py

library(seqinr)
library(LncFinder)

args <- commandArgs(TRUE)
list_dir <- list.dirs(path = args[1], recursive = FALSE)
for (w in list_dir) {

lnc <- read.fasta(args[2])
mrna <- read.fasta(paste(w, '/train_mrna.fasta', sep=''))
strc <- args[3]
print(strc)

if (strc == "SS") {
print('secondary structure on')
RNAfold.path <- '/home/pronozinau/miniconda3/bin/RNAfold'

mrna1 <- run_RNAfold(mrna, RNAfold.path = RNAfold.path, parallel.cores = 40)
lnc1 <- run_RNAfold(lnc, RNAfold.path = RNAfold.path, parallel.cores = 40)

feach <- make_frequencies(cds.seq = mrna1,  mRNA.seq = mrna1, lncRNA.seq = lnc1,
SS.features = TRUE, cds.format = "SS",
lnc.format = "SS", check.cds = TRUE,
ignore.illegal = FALSE)
print('done')
saveRDS(feach, file=paste(w, '/frequencies.rds', sep=""))

my_model <- build_model(lncRNA.seq = lnc1, mRNA.seq = mrna1,
frequencies.file = feach, SS.features = TRUE,
lncRNA.format = "SS", mRNA.format = "SS",
parallel.cores = 2, folds.num = 2, seed = 1)
saveRDS(my_model, file=paste(w,'/model.rds',sep=""))

print(paste(w, '/test.fasta', sep=''))
seq <- read.fasta(paste(w, '/test.fasta', sep=''))
test <- run_RNAfold(seq, RNAfold.path = RNAfold.path, parallel.cores = 40)
result_2 <- lnc_finder(test, SS.features = TRUE, format = "SS",
frequencies.file = feach, svm.model = my_model,
parallel.cores = 2)
write.table(result_2, file=paste(w, '/results.csv', sep=""), col.names=F, sep = ',')
} else {
print('secondary structure off')

feach <- make_frequencies(cds.seq = mrna,  mRNA.seq = mrna, lncRNA.seq = lnc,
SS.features = FALSE, cds.format = "DNA",
lnc.format = "DNA", check.cds = TRUE,
ignore.illegal = FALSE)
print('done')
saveRDS(feach, file=paste(w, '/frequencies.rds', sep=""))

my_model <- build_model(lncRNA.seq = lnc, mRNA.seq = mrna,
frequencies.file = feach, SS.features = FALSE,
lncRNA.format = "DNA", mRNA.format = "DNA",
parallel.cores = 2, folds.num = 2, seed = 1)
saveRDS(my_model, file=paste(w,'/model.rds',sep=""))

print(paste(w, '/test.fasta', sep=''))
seq <- read.fasta(paste(w, '/test.fasta', sep=''))
result_2 <- lnc_finder(seq, SS.features = FALSE, format = "DNA",
frequencies.file = feach, svm.model = my_model,
parallel.cores = 2)
write.table(result_2, file=paste(w, '/results.csv', sep=""), col.names=F, sep = ',')
}
}
print(list_dir)

R seqinr LncFinder From line 1 of scripts/model_lncFind.r

from Bio import SeqIO
import sys
import pandas as pd

fasta = SeqIO.parse(sys.argv[1], "fasta")
corrected_file = sys.argv[2]
new_old_id = []
with open(corrected_file, 'w') as corrected:

 n = 1
 for rec in fasta:
    if len(rec.seq) >= 200:
        new_name =  "Transcript_" + str(n)
        new_old_id.append([rec.id, new_name])
        rec.id = new_name
        rec.description = new_name  # <- Add this line
        n = n + 1
        SeqIO.write(rec, corrected, 'fasta')
    else:
        pass

new_old_id_df = pd.DataFrame(new_old_id, columns=['old_id', 'new_id'])
new_old_id_df.to_csv(sys.argv[3], sep='\t', index=False)

Python Pandas Biopython From line 1 of scripts/rename.py

import pandas as pd
import numpy as np
from Bio import SeqIO
import sys

lncfinder = pd.read_csv(sys.argv[1], sep=',', header=None)
record_dict = SeqIO.to_dict(SeqIO.parse(sys.argv[2], "fasta"))
lncrna = open(sys.argv[3], 'w', encoding='utf-8')
coding = open('data/output/lncRNA_prediction/Coding.fasta', 'w', encoding='utf-8')

lncfinder = lncfinder[[0,1]]
finder_cod = lncfinder[lncfinder[1].str.contains("NonCoding")==False]
finder_non = lncfinder[lncfinder[1].str.contains("NonCoding")]

for w in finder_non[0]:
    try:
      print(record_dict[w].format('fasta'), end='', file=lncrna)
    except KeyError:
      continue

for w in finder_cod[0]:
    try:
      print(record_dict[w].format('fasta'), end='', file=coding)
    except KeyError:
      continue

Python Pandas numpy Biopython From line 1 of scripts/take_noncoding.py

import numpy as np
from sklearn.model_selection import train_test_split
from Bio import SeqIO
import sys
import os

lnc = SeqIO.to_dict(SeqIO.parse(sys.argv[1], "fasta"))
cds = SeqIO.to_dict(SeqIO.parse(sys.argv[2], "fasta"))
train_lnc = open('./data/input/test_train/train_lnc.fasta', 'w', encoding='utf-8')
print(len(lnc.keys()))
print(len(cds.keys()))
test_lnc = len(lnc.keys())*0.25
test_mrna = test_lnc*2
n=0
keys = list(cds.keys())
pers_lnc=test_lnc/len((list(lnc.keys())))
Y_train, Y_test = train_test_split(list(lnc.keys()), test_size=pers_lnc, shuffle=True)
print(len(Y_test))
if len(Y_train)*2 <= len(cds.keys())/5:
    train_mrna = len(Y_train)*2
else:
    train_mrna = len(cds.keys())/5.2
print(train_mrna)
for w in Y_train:
    print(lnc[w].format('fasta'), end='', file=train_lnc)

while n <= 4:
    os.mkdir('./data/input/test_train/' + str(n))
    train_cds = open('./data/input/test_train/' + str(n) + '/train_mrna.fasta', 'w', encoding='utf-8')
    test_file = open('./data/input/test_train/' + str(n) + '/test.fasta', 'w', encoding='utf-8')
    compare = open('./data/input/test_train/' + str(n) + '/compare.csv', 'w', encoding='utf-8')
    pers_mrna=train_mrna/(len(keys))
    print(pers_mrna)
    X_train, X_test = train_test_split(keys, test_size=pers_mrna, shuffle=True)
    print('train_set',len(X_train))
    print('test_set',len(X_test))
    if len(X_test) > test_mrna:
    	pers_mrna_test = test_mrna/(len(X_test))
    else:
        true_per = len(X_test)*0.25
        pers_mrna_test = true_per/len(X_test)
    pers_mrna_test = test_mrna/(len(X_test))
    train, test = train_test_split(X_test, test_size=pers_mrna_test, shuffle=True)
    for w in test:
        print(cds[w].format('fasta'), end='', file=test_file)
        print(cds[w].id , 'mrna', sep='\t', file=compare)
    for w in train:
        print(cds[w].format('fasta'), end='', file=train_cds)
    for w in Y_test:
        print(lnc[w].format('fasta'), end='', file=test_file)
        print(lnc[w].id, 'lnc', sep='\t', file=compare)
    keys = [w for w in keys if w not in X_test]
    print('new_set',len(keys))
    n=n+1

Python numpy Biopython sklearn From line 1 of scripts/test_train.py

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sys

def value_prepare(x, y, z):
    name = x[0].str.rsplit('_', 1, expand=True)
    x['srx'] = name[1]
    same = pd.merge(df, x, how='inner', left_on=[1], right_on=['srx'])
    transcript = same[1].value_counts().reset_index()
    transcript = pd.merge(df, transcript, how='inner', left_on=[1], right_on=['index'])
    transcript.to_csv('data/output/tissue/' + y + '.csv')
    pr = same[2].value_counts().reset_index()
    pr = pr.rename(columns={2: z}, inplace=False)
    return (pr, same)


#LncAPDB_vs_blast table
blast = pd.read_csv(sys.argv[2], sep='\t', header=None)
LncAPDB = pd.read_csv(sys.argv[3], sep=' ', header=None)
old_new = pd.read_csv(sys.argv[4], sep='\t', header=None, skiprows=1)
LncAPDB = LncAPDB.drop_duplicates(subset=[1])
blast = blast[blast[2].isin(['100.000'])]
LncAPDB_vs_blast = pd.merge(blast, LncAPDB, how='inner', left_on=[1], right_on=[1])
LncAPDB_vs_blast = LncAPDB_vs_blast[['0_x',1,'2_x',10, '0_y','2_y']]
LncAPDB_vs_blast.rename(columns={'0_x': 'transcript_id', 1: 'library_id', '2_x':'percent_identity', 10:'e_value','0_y':'id_of_database', '2_y':'database'}, inplace=True)
LncAPDB_vs_blast.to_csv('data/output/new_lncRNA/LncAPDB_vs_blast.csv', sep='\t', index=False)

#tissue analysis
blast = pd.read_csv('data/output/tissue/cons.txt', header=None)
blast_non = pd.read_csv('data/output/tissue/non.txt', header=None)
blast_cod = pd.read_csv('data/output/tissue/cod.txt', header=None)
df = pd.read_csv(sys.argv[5], sep='\t', header=None)
df = df[df[0].isin([str(sys.argv[1])])]
df_tis = df[2].value_counts().reset_index()
blast_cod = old_new[old_new[1].isin(blast_cod[0])]
try:

    transc, same = value_prepare(blast, 'transc' , 'Conserved')
    transc_non, same_non = value_prepare(blast_non, 'transc_non', 'Nonconserved')
    transc_cod, same_cod = value_prepare(blast_cod, 'transc_cod', 'Coding')

    full_table = pd.merge(transc, transc_cod, how='outer', left_on=['index'], right_on=['index'])
    full_table = pd.merge(full_table, transc_non, how='outer', left_on=['index'], right_on=['index'])
    full_table = pd.merge(full_table, df_tis, how='inner', left_on=['index'], right_on=['index'])
    full_table = full_table.fillna(0)

    full_table['con'] = full_table['Conserved'] / full_table[2]
    full_table['noncon'] = full_table['Nonconserved'] / full_table[2]
    full_table['cod'] = full_table['Coding'] / full_table[2]

    full_table['lncRNA conserved'] = full_table['con'] / len(same[2])
    full_table['lncRNA nonconseved'] = full_table['noncon'] / len(same_non[2])
    full_table['mRNA'] = full_table['cod'] / len(same_cod[2])

    full_table = full_table[['index', 'lncRNA conserved', 'lncRNA nonconseved', 'mRNA']]

    full_table.to_csv('data/output/tissue/tissue_org.csv')
    full_table = full_table.set_index('index')

    fig,ax = plt.subplots(1,1,figsize=(8,10))
    sns.heatmap(full_table, annot=True, annot_kws={'size': 10}, xticklabels=True, cmap='Blues', ax=ax, cbar_kws={"shrink": .82}, yticklabels=True)
    plt.savefig('data/output/tissue/tissue_org.png', bbox_inches='tight')

except KeyError:
    print('Please check format of your sequence ID. It should be look like "MSTRG.1031.1_SRX123456" or "TRINITY_DN195_c0_g1_i1_SRX339783". The defining parameter is "_SRX#######" add it after your main ID.')

Python Pandas matplotlib seaborn BLAST From line 1 of scripts/tissue.py

import pandas as pd
import sys
import glob

def Specificity(x,y):
    if x+y == 0:
       return float(0)
    else:
       return x/(x+y)
def Precision(x,y):
    if x+y ==0:
       return float(0)
    else:
       return x/(x+y)
def Sensitivity(x,y):
    if x+y == 0:
       return float(0)
    else:
       return x/(x+y)
def F1(x,y):
    if x+y == 0:
       return float(0)
    else:
       return 2*(x*y)/(x+y)
def TP_FP(q_c,q_n,t_c,t_n):
    TP = 0
    FP = 0
    FN = 0
    TN = 0
    for w in t_n.to_list():
        if w in q_n.to_list():
           TP = TP+1
        if w in q_c.to_list():
           FP = FP+1
    for w in t_c.to_list():
        if w in q_c.to_list():
           TN = TN+1
        if w in q_n.to_list():
           FN = FN+1
    print(TP , FN , FP, TN)
    Pre = Precision(TP,FP)
    Spe = Specificity(TN,FP)
    Sen = Sensitivity(TP,FN)
    F = F1(Pre,Sen)
    return Pre , Sen, Spe, F

dirr = glob.glob(sys.argv[1] + '/*/')
f1_comp = {}
for w in dirr:
    cds_lnc = pd.read_csv(str(w) + '/compare.csv', sep='\t', header=None)
    test = pd.read_csv(str(w) + '/results.csv', sep=',', header=None)
    test = test[[0,1]]
    results = open(str(w) + '/F1.csv', 'w', encoding='utf-8')
    cds_lnc_cod = cds_lnc[cds_lnc[1].str.contains("mrna")]
    cds_lnc_non = cds_lnc[cds_lnc[1].str.contains("lnc")]
    test_cod = test[test[1].str.contains("NonCoding")==False]
    test_non = test[test[1].str.contains("NonCoding")]
    print(len(cds_lnc_cod[0]), len(cds_lnc_non[0]), len(test_cod[0]), len(test_non[0]))
    pre, sen, spe, f = TP_FP(cds_lnc_cod[0], cds_lnc_non[0], test_cod[0], test_non[0])
    f1_comp[w] = f
    print(pre, sen, spe, f, sep='\t' , file=results)
best_model = open(sys.argv[1] + '/best_model.txt', 'w', encoding='utf-8')
print(max(f1_comp, key=f1_comp.get), file=best_model)

Python Pandas From line 1 of scripts/TP_FN.py

run:
   shell("python scripts/test_train.py {input} > {output}")

SnakeMake From line 39 of main/Snakefile

shell:
    "Rscript scripts/model_lncFind.r {params.path} {params.lnc} {params.str} >> {output}"

SnakeMake From line 53 of main/Snakefile

shell:
  "python scripts/TP_FN.py {params}"

SnakeMake From line 62 of main/Snakefile

shell:
  "python scripts/rename.py {input} {output} {params}"

SnakeMake From line 72 of main/Snakefile

shell:
    "scripts/Chunk_dataframe.sh {input} {params}"

SnakeMake From line 86 of main/Snakefile

shell:
    "CPC2/bin/CPC2.py -i {input} -r -o {params}"

SnakeMake From line 98 of main/Snakefile

shell:
    "python scripts/take_noncoding.py {input} {output}"

SnakeMake From line 107 of main/Snakefile

shell:
   "python scripts/GMAP_build.py {input} {output}"

SnakeMake From line 116 of main/Snakefile

shell:
    "python scripts/GMAP.py {input} {config[model]} {output}"

SnakeMake From line 126 of main/Snakefile

run:
    shell("gff2bed < {input.query} > {output.query_out}")
    shell("gff2bed < {input.reference} > {output.reference_out}")

SnakeMake gff2bed From line 136 of main/Snakefile

shell:
   "python scripts/gmap_filter.py {input} {output}"

SnakeMake From line 148 of main/Snakefile

run:
    shell("gffcompare -r {input} -o {params}")
    shell("mv data/output/GMAP/gffcmp.filter_alignm.bed.tmap data/output/gffcompare_first/")
    shell("mv data/output/GMAP/gffcmp.filter_alignm.bed.refmap data/output/gffcompare_first/")

SnakeMake gffcompare From line 159 of main/Snakefile

run:
    shell("python scripts/lncRNA_class.py {input} {params} {output}")

SnakeMake From line 173 of main/Snakefile

run:
    shell("gffcompare -r {input} -o {params}")
    shell("mv data/output/GMAP/gff.filter_alignm.bed.tmap data/output/gffcompare_second/")
    shell("mv data/output/GMAP/gff.filter_alignm.bed.refmap data/output/gffcompare_second/")

SnakeMake gffcompare From line 184 of main/Snakefile

shell:
   "python scripts/bed_intersect.py {input.gffcomp} {input.aling} {params.option} {output.loci} {input.coords} {output.TE} {output.true_lnc}"

SnakeMake From line 200 of main/Snakefile

run:
    shell("python scripts/lncRNA_class.py {input} {params} {output}")

SnakeMake From line 212 of main/Snakefile

run:
    shell("gffcompare -r {input} -o {params}")
    shell("mv data/output/GMAP/gff.filter_alignm.bed.tmap data/output/gffcompare_second/")
    shell("mv data/output/GMAP/gff.filter_alignm.bed.refmap data/output/gffcompare_second/")

SnakeMake gffcompare From line 223 of main/Snakefile

shell:
   "python scripts/bed_intersect.py {input} {params} {output}"

SnakeMake From line 236 of main/Snakefile

shell:
    "python scripts/gff.py {input} {params}"

SnakeMake From line 249 of main/Snakefile

shell:
   "python scripts/blast.py {input} {output.fasta} {params.evalue} {params.max_target} {params.identity} {params.threads} {params.old_new}"

SnakeMake From line 269 of main/Snakefile

run:
   shell("""grep -v NonCoding {input.lnc} | awk -F ',' '{{print $1}}' | tr -d '"' > data/output/tissue/cod.txt""")
   shell("grep -v  {params.org} {input.blast} | awk -F '\t' '{{print $1}}' | sort | uniq > data/output/tissue/cons.txt")
   shell("grep {params.org} {input.blast} | awk -F '\t' '{{print $1}}' | sort | uniq > data/output/tissue/non.txt")
   shell("grep '>' {input.new} | awk -F '>' '{{print $2}}' > data/output/tissue/id.txt")
   shell("awk -F '\t' '{{print $1}}' {input.blast} | sort | uniq > data/output/tissue/all.txt")
   shell("grep -v -w -f data/output/tissue/all.txt data/output/tissue/id.txt > data/output/tissue/non_aling.txt")
   shell("python scripts/tissue.py {params.exp} {input.blast} {params.LncAPDB} {params.old_new} {params.SRX}")

SnakeMake From line 285 of main/Snakefile

shell:
   "python scripts/expr_table_form.py {input.SRX} {params} {output}"

SnakeMake From line 309 of main/Snakefile

shell:
   "Rscript scripts/Expraession_entropy.R {params.table}"

SnakeMake From line 319 of main/Snakefile

ShowHide 29 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/artempronozin95/ICAnnoLncRNA-identification-classification-and-annotation-of-LncRNA

Name: icannolncrna-identification-classification-and-ann

Version: 1

Badge:

Insert copied code into your website to add a link to this workflow.

Other Versions:

License: None

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free

Automatic identification, classification and annotation of plant lncRNAs in transcriptomic sequences

Help improve this workflow!

Table of contents

Introduction

This pipeline is only applicable to the Linux operating system.

1. Pre-processing.

2. LncRNA filtering.

3. LncRNA annotation.

Schematic diagram

Installation

Automatic

Step method

It is necessary to install and download (download program in pipeline folder):

Input

Genome sequence

Reference genome

Additional data

Data

Configuration file

Folders

data/input

data/reference

Work start

1. snakemake -j 2

2. snakemake -nr

3. snakemake --use-conda

4. Recommended run:

Models

LncRNA structure information

Known LncRNA for database

Species

Tissue analysis

Expression and entropy

Output

Errors

Code Snippets

Comments

Support

Free

Related Workflows

public

public

public

public

public

public

1. `snakemake -j 2`

2. `snakemake -nr`

3. `snakemake --use-conda`