Assemble transcriptome and calculate gene/transcript abundance

public public 1yr ago 0 bookmarks

First follow the instructions here:

Step by step guide on how to use my pipelines
Click here for an introduction to Snakemake

ABOUT

This is a pipeline that aligns raw RNA-seq reads (downloaded from SRA) to a genome and quantifies gene abundance. Based on: Pertea, M., Kim, D., Pertea, G. M., Leek, J. T., & Salzberg, S. L. (2016). Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nature Protocols, 11(9), 1650–1667. https://doi.org/10.1038/nprot.2016.095
The final product is a table with gene abundances for every sample.

Tools used:

| DAG | |:--:| | Pipeline workflow |

Edit config.yaml with the paths to your files

METALIST: /path/to/metalist.txt
ASSEMBLY: /path/to/assembly.fa
ANNOTATION: /path/to/annotation.gff3
OUTDIR: /path/to/outdir
  • METALIST - text file with sample details. Each row is a different sample/record. Mandatory columns:

    • Run - the most important - SRA code

    • AGE - age of individual. If not applicable, leave blank or add NA

    • tissue - tissue where the sample comes from. If not applicable, leave blank or add NA

  • ASSEMBLY - decompressed fasta file

  • ANNOTATION - annotation gff3 file

  • OUTDIR - directory where snakemake will run and where the results will be written to
    If you want the results to be written to this directory (not to a new directory), open config.yaml and comment out OUTDIR: /path/to/outdir

RESULTS

The most important results are:

  • <run_date>_files.txt dated file with an overview of the files used to run the pipeline (for documentation purposes)

  • reads directory that contains alignments

  • results directory with final results

    • merged_transcriptome.gtf - non-redundant set of transcripts

    • results/{SRA}/{SRA}.gtf and results/{SRA}/gene_abundance.tab - re-estimated abundances for each sample

    • {SRA}/gene_abundance_{SRA}.txt - re-estimated abundances for each sample with extra informationa added - AGE and tissue

    • results/all_abundances.txt - final file with gene abundances for all samples

Code Snippets

13
14
15
16
17
run:
    with open(output[0], "w") as outfile:
        outfile.write(f"Files used to run {pipeline} on {running_date}\n")
        for key, value in config.items():     
            outfile.write(f"{key}: {value}\n")
58
59
shell:
    "fasterq-dump.2.11.0 -e 16 --outdir fastq {params}"
72
73
shell:
    'reformat.sh in1={input.in1} in2={input.in2} out={output}'
87
88
shell:
    "hisat2-build {input} {params}"
107
108
shell:
    "hisat2 -x {params} {input.fq} --summary-file {output.summary} > {output.sam}"
121
122
123
124
125
126
shell:
    """
    module load samtools
    samtools sort -@ 16 {input} > {output.bam}
    samtools index -@ 16 {output.bam}
    """
145
146
147
148
149
150
151
152
153
154
155
shell:
    """
    stringtie {input.bam} \
    -G {input.annotation} \
    -l {params} \
    -o {output.gtf} \
    -p 16 \
    -A {output.abundance} \
    -B \
    -e 
    """
167
168
169
170
171
run:
    with open(output[0], "w") as out:
        for el in input:
            out.write(el)
            out.write("\n")
SnakeMake From line 167 of master/Snakefile
186
187
188
189
190
191
shell:
    """
    stringtie --merge \
    -G {input.annotation} \
    -o {output} {input.list_to_merge}
    """
210
211
212
213
214
215
216
217
218
219
220
shell:
    """
    stringtie {input.bam} \
    -G {input.annotation} \
    -l {params} \
    -o {output.gtf} \
    -p 16 \
    -A {output.abundance} \
    -B \
    -e 
    """
238
239
240
241
242
243
244
245
246
247
248
run:
    abundance = pd.read_table(input[0])
    metadata = pd.read_table(input[1], sep=",")
    details = metadata[metadata["Run"] == params[0]]
    details_subset = details[["Run", "AGE", "tissue"]]
    details_subset.reset_index(inplace=True)
    combined_tables = pd.concat([abundance, details_subset], axis=1)
    combined_tables["Run"] = combined_tables.iloc[0]["Run"]
    combined_tables["AGE"] = combined_tables.iloc[0]["AGE"]
    combined_tables["tissue"] = combined_tables.iloc[0]["tissue"]
    combined_tables.to_csv(output[0], sep="\t", index=False)
SnakeMake From line 238 of master/Snakefile
261
262
shell:
    "awk 'FNR>1 || NR==1' {input} > {output}"
SnakeMake From line 261 of master/Snakefile
ShowHide 9 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/CarolinaPB/calculate-gene-abundance
Name: calculate-gene-abundance
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...