Annotating Novel lncRNAs Using Existing Annotations: A Snakemake-Based Pipeline
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
A Snakemake-based pipeline to annotate novel lncRNA using existing annotation.
Description
This pipeline uses various bioinformatics tools to annotate lncRNA. It requires reference genome(s) (optional HISAT index), sequencing reads, and reference annotation(s).
The workflow is mainly divided into two parts:
Part 1 (Snakefile):
The first part of pipeline works with the following steps:
-
uses HISAT2 to align all reads to the transcriptome,
-
creates a new gene list with all known and novel genes.
-
separates the novel genes
-
uses CPAT and CPC to annotate novel genes with coding potential
-
extracts non-coding genes
-
identifies and removes transcripts with coding isoforms and less than 200 bp long genes
Part 2 (Snakefile_quantification):
The second part of pipeline works with the following steps:
-
uses StringTie to merge lncRNAs from different sample
-
creates transcriptome using GFFread
-
creates Salmon index for transcriptome
-
quantify lncRNAs of each replicate of tissue/sample using salmon
-
creates a TPM matrix for each sample and replicates
-
filter lncRNAs with low expression value
Pipeline
A simple overview of pipeline is shown below:
Part 1 (Snakefile) | Part 2 (Snakefile_quantification) |
---|---|
|
|
Creating Conda environment
This pipeline uses conda environments. Snakemake will create conda environments during the first run of the pipeline. However, if running this pipeline on High-Performance Computing (HPC) it may not have an active internet connection. In that case, conda environment can be created using
snakemake -s env_snakefile --useconda
command from the node with an internet connection. And then the pipeline will be able to utilise existing environments.
Authors
Code Snippets
25 26 | shell: "sed -n 's/^##//p' {input}" |
47 48 49 50 | shell: """ trim_galore --illumina -o {params.dir} --paired {input.fq1} {input.fq2} """ |
63 64 65 66 67 | shell: """ mkdir -p {params.fq_dir} && \ fastqc {input.fq} --outdir={params.fq_dir} """ |
86 87 88 89 | shell: """ hisat2 -q -p {threads} --dta -x {params.index} -1 {input.fq1} -2 {input.fq2} -S {output.sam} --summary {output.summary} """ |
99 100 101 102 | shell: """ samtools view -b {input} > {output} """ |
111 112 113 114 | shell: """ samtools sort {input} > {output} """ |
123 124 125 126 | shell: """ samtools index -c {input} {output} """ |
145 146 147 148 | shell: """ stringtie {input.bam} -p 4 -G {input.gff} -o {output.gtf} -A {output.abundance} -b {params.ballgown} -l {params.label} """ |
170 171 172 173 174 175 176 177 | shell: """ cd {params.gtf_dir} && \ gffcompare -r {input.ref_gff} {input.query_gtf} -o {params.label} && \ cat {params.gffcompare_file} | awk '$3=="u"{{print $0}}' | cut {params.column} | sort | uniq > {output.novel_transcripts_list} && \ sed 's/^/"/; s/$/"/' {output.novel_transcripts_list} > {output.novel_transcripts_list2} && \ grep -F -f {output.novel_transcripts_list2} {input.query_gtf} > {output.novel_transcripts} """ |
192 193 194 195 | shell: """ gffread -w {output.novel_transcripts_fasta} -g {input.reference_fasta} {input.novel_transcripts_gtf} """ |
207 208 209 210 | shell: """ make_hexamer_tab.py -c {input.cds} -n {input.ncrna} > {output.hexamer} """ |
225 226 227 228 | shell: """ make_logitModel.py -c {input.cds} -n {input.ncrna} -x {input.hexamer} -o {params.logitmodel} """ |
242 243 244 245 246 247 248 | shell: """ cpat.py -g {input.fasta} -d {input.logitmodel} -x {input.hexamer} -o {output.tsv} && \ # changing the case of Stringtie, CPAT changes Stringtie to STRINGTIE in previous rule perl -p -e 's/STRINGTIE/Stringtie/g' {output.tsv} > {output.tsv_lowercase} """ |
260 261 262 263 | shell: """ awk -F "\t" '{{ if($6 <= {params.cutoff} ) {{print $1}} }}' {input.tsv} > {output.nctsv} """ |
272 273 274 275 | shell: """ sort {input.nctsv} > {output.ncsortedtsv} """ |
288 289 290 291 | shell: """ CPC2.py -i {input.fasta} -o {output.tsv} """ |
302 303 304 305 | shell: """ awk '{{ if($8 == "{params.filter}" ) {{print $1}} }}' {input.tsv} > {output.nctsv} """ |
314 315 316 317 | shell: """ sort {input.nctsv} > {output.ncsortedtsv} """ |
327 328 329 330 | shell: """ comm -12 {input.cpat} {input.cpc} > {output.comm} """ |
341 342 343 344 345 | shell: """ sed 's/^/"/; s/$/"/' {input.tsv} > {output.tsv} && \ grep -F -f {output.tsv} {input.query_gtf} > {output.novel_transcripts} """ |
354 355 356 357 358 359 | shell: """ #keeping " to avoid matching with wrong gene ids perl -lne 'print @m if @m=(/((?:gene_id)\s+\S+)/g);' {input.nc_gtf} | perl -p -i -e 's/gene_id|;| //g' > {output.gene_list} && \ sort {output.gene_list} | uniq """ |
370 371 372 373 374 | shell: """ # gene list is already quoted so not adding quotes again, could use Mikado grep -F -f {input.gene_list} {input.query_gtf} > {output.transcripts} """ |
388 389 390 391 392 | shell: """ python {params.scripts}/tra.py < {input.nc_gtf} > {output.nc_pair} && \ python {params.scripts}/tra.py < {input.combine_gtf} > {output.combine_pair} """ |
404 405 406 407 | shell: """ python {params.scripts}/filter.py {input.nc_pair} {input.combine_pair} > {output.diff} """ |
420 421 422 423 424 | shell: """ sed 's/^/"/; s/$/"/' {input.genes} > {output.quoted_genes} && \ grep -v -F -f {output.quoted_genes} {input.gtf} > {output.gtf} """ |
434 435 436 437 | shell: """ gffread {input} -T -l 200 -o {output} """ |
Support
- Future updates
Related Workflows





