Assemble transcriptome and calculate gene/transcript abundance
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
First follow the instructions here:
Step by step guide on how to use my pipelines
Click
here
for an introduction to Snakemake
ABOUT
This is a pipeline that aligns raw RNA-seq reads (downloaded from SRA) to a genome and quantifies gene abundance.
Based on:
Pertea, M., Kim, D., Pertea, G. M., Leek, J. T., & Salzberg, S. L. (2016). Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nature Protocols, 11(9), 1650–1667. https://doi.org/10.1038/nprot.2016.095
The final product is a table with gene abundances for every sample.
Tools used:
-
fasterq-dump - download records from SRA
-
BBMAP (reformat.sh) - combine paired reads into one fastq
-
Hisat2 - align RNA-seq to genome
-
Samtools - sort and index mapped reads
-
stringtie - assemble alignments into transcripts
|
|
|:--:|
|
Pipeline workflow
|
Edit config.yaml with the paths to your files
METALIST: /path/to/metalist.txt
ASSEMBLY: /path/to/assembly.fa
ANNOTATION: /path/to/annotation.gff3
OUTDIR: /path/to/outdir
-
METALIST - text file with sample details. Each row is a different sample/record. Mandatory columns:
-
Run - the most important - SRA code
-
AGE - age of individual. If not applicable, leave blank or add
NA
-
tissue - tissue where the sample comes from. If not applicable, leave blank or add
NA
-
-
ASSEMBLY - decompressed fasta file
-
ANNOTATION - annotation gff3 file
-
OUTDIR - directory where snakemake will run and where the results will be written to
If you want the results to be written to this directory (not to a new directory), open config.yaml and comment outOUTDIR: /path/to/outdir
RESULTS
The most important results are:
-
<run_date>_files.txt dated file with an overview of the files used to run the pipeline (for documentation purposes)
-
reads directory that contains alignments
-
results directory with final results
-
merged_transcriptome.gtf - non-redundant set of transcripts
-
results/{SRA}/{SRA}.gtf and results/{SRA}/gene_abundance.tab - re-estimated abundances for each sample
-
{SRA}/gene_abundance_{SRA}.txt - re-estimated abundances for each sample with extra informationa added - AGE and tissue
-
results/all_abundances.txt - final file with gene abundances for all samples
-
Code Snippets
13 14 15 16 17 | run: with open(output[0], "w") as outfile: outfile.write(f"Files used to run {pipeline} on {running_date}\n") for key, value in config.items(): outfile.write(f"{key}: {value}\n") |
58 59 | shell: "fasterq-dump.2.11.0 -e 16 --outdir fastq {params}" |
72 73 | shell: 'reformat.sh in1={input.in1} in2={input.in2} out={output}' |
87 88 | shell: "hisat2-build {input} {params}" |
107 108 | shell: "hisat2 -x {params} {input.fq} --summary-file {output.summary} > {output.sam}" |
121 122 123 124 125 126 | shell: """ module load samtools samtools sort -@ 16 {input} > {output.bam} samtools index -@ 16 {output.bam} """ |
145 146 147 148 149 150 151 152 153 154 155 | shell: """ stringtie {input.bam} \ -G {input.annotation} \ -l {params} \ -o {output.gtf} \ -p 16 \ -A {output.abundance} \ -B \ -e """ |
167 168 169 170 171 | run: with open(output[0], "w") as out: for el in input: out.write(el) out.write("\n") |
186 187 188 189 190 191 | shell: """ stringtie --merge \ -G {input.annotation} \ -o {output} {input.list_to_merge} """ |
210 211 212 213 214 215 216 217 218 219 220 | shell: """ stringtie {input.bam} \ -G {input.annotation} \ -l {params} \ -o {output.gtf} \ -p 16 \ -A {output.abundance} \ -B \ -e """ |
238 239 240 241 242 243 244 245 246 247 248 | run: abundance = pd.read_table(input[0]) metadata = pd.read_table(input[1], sep=",") details = metadata[metadata["Run"] == params[0]] details_subset = details[["Run", "AGE", "tissue"]] details_subset.reset_index(inplace=True) combined_tables = pd.concat([abundance, details_subset], axis=1) combined_tables["Run"] = combined_tables.iloc[0]["Run"] combined_tables["AGE"] = combined_tables.iloc[0]["AGE"] combined_tables["tissue"] = combined_tables.iloc[0]["tissue"] combined_tables.to_csv(output[0], sep="\t", index=False) |
261 262 | shell: "awk 'FNR>1 || NR==1' {input} > {output}" |
Support
- Future updates
Related Workflows





