Transcriptome Analysis Pipeline for Mangrove Crab Scylla serrata Heat Stress Response Study

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Introduction

This repository contains scripts to reproduce the results presented in Shrestha et al.: Comparative transcriptome profiling of heat stress response of the mangrove crab Scylla serrata across sites of varying climate profiles

Data

The raw RNA-seq reads in fastq format can be downloaded from DDBJ Sequence Read Archive under the accession number DRA010977. Here is a direct link . There are 29 RNA-seq samples: DRX242230 -- DRX242258. Each sample is from one of 3 sites: Bataan, Cagayan, Bicol; and one of 2 conditions: control or heat-stressed. Sample information can be obtained by clicking the link to each experiment.

Assembly-mapping-quantification

Software requirement

This pipeline requires requires Conda and Snakemake . Other dependencies are handled by Snakemake automatically. Snakemake recommends installation via Conda:

$ conda install -c conda-forge mamba
$ mamba create -c conda-forge -c bioconda -n snakemake snakemake

This creates an isolated enviroment containing the latest Snakemake. To activate it:

$ conda activate snakemake

To test snakemake installation

$ snakemake --help

Running the pipeline

Download or clone this repository. In the config.yaml file, set (1) the absolute paths to the RNA-seq reads, (2) an absolute path to the directory where you want the outputs to be saved, and (3) the absolute paths to the databases for annotation. Then at the top-level directory, run the following :

snakemake --configfile config.yaml --use-conda --cores all

Output

The main output is arranged into several folders, of which: trinity_out_dir which contains the assembly Trinity.fasta , trinity_abundance which contains the counts required for subsequent differential expression analysis.

The transciptome assembly we obtained can be downloaded here . The quantification results we obtained can be downloaded here .

Assembly annotation

Data

For annotation, download the Swiss-Prot database and the fruit fly proteome UP000000803 .

Software requirement

We used Dammit (v 1.2), which can be installed using bioconda:

$ mamba create -c conda-forge -c bioconda -n dammit dammit=1.2

Next activate the environment:

$conda activate dammit

Running Dammit

With the Swiss-Prot dataset and fruit fly proteome, launch Dammit:

dammit databases --install --quick --busco-group metazoa
dammit annotate trinity_out/Trinity.fasta --quick --user-databases Swiss-Prot.fa fruifly.fa -e 1e-10 --output-dir dammit_out

Ouput

The output contains many files, which we have summarized as a data-frame required by downstream analysis here .

Differential expression analysis

Differential expression analysis can be performed using the R Markdown notebooks `1 - Explore and QC.Rmd` and `2 - DE Analysis.Rmd`.

Software requirement

The following steps were executed on RStudio (v.1.2.5042) and R (v. 4.0.3). RStudio should prompt you for all other library dependecies.

Data

Place the following files in a data folder:

Folder of RSEM Counts

Current folder: counts_per_sample 💾

Generated from RSEM after TRINITY assembly. Contains ".genes.results" file for each sample.

Filename format : "BAT_C_1.genes.results" - This refers to Bataan Control group sample 1
Annotation Dataframe

Current file: annotation.20200701.csv 💾

Dataframe should have columns "gene_id" and also for labeling and filtering.

| gene_id | swiss_prot_acc | swiss_prot_stat_sig | |-|-|-| | TRINITY_DN10022_c0_g1 | Q9UQ80 | 1.2e-176 | | TRINITY_DN10031_c0_g1 | Q8VDW4 | 1.0e-99 |
Descendant Genes Dataframe

Current file: geneID_isDescendant.csv 💾

Generated from "supplementary notebooks/GO_stressResponse.Rmd". View below for more details.

DEG workflow

Run notebook ` 1 - Explore and QC.Rmd ` for exploring data and perform quality checking

Chunk 1 : Prepare DESeq dataset using txImport
- Reads count data produced by Trinity as a DESeq2 dataset
Chunk 2 : Filter Genes and Samples
- Removes outlier samples
- Removes genes with low annotation significance and alignment
Chunk 3 : (optional) Save parameters of this experiment
Chunk 4 : Quality Check
- Graphs sample cluster with heatmap
- Graphs sample cluster dendrogram
- Graphs sample PCA plot
- Graphs gene heatmap
- Graphs Mean SD plot (for checking if variance is dependent on the mean)
Chunk 5 : Save filtered DESEq2 object to be loaded in `2 - DE Analysis.Rmd`

Run notebook ` 2 - DE Analysis.Rmd ` for performing significance testing and graphing

Chunk 1 : Load the filtered DESeq2 dataset saved from `1 - Explore and QC.Rmd`
Chunk 2 : Perform significance testing
Chunk 3 : Export list of Differentially Expressed (DE) genes
Chunk 4 : Graph results from Main effects (the 3 sites: "CAG", "BAT", "BIC") testing
- Graphs expression of top most significant DE genes for each site
- Graphs venn diagram of the DE genes across the 3 site
- Graphs bar graph of DE gene counts for each site
- Graphs MA plot for each site
Chunk 5 : Graph results from Interaction effects testing
- Graphs venn diagram of intersecting DDE genes
- Graphs a count table of the interaction category for each site
- Graphs gene heatmaps of CAG vs BAT and CAG vs BIC
- Graphs interaction plots for each site
Chunk 6 : Combine plots for publication (Figure 7)
Chunk 7 : Combine plots for publication (Figure 8)
Chunk 8 : Plot figure of Top 20 read counts

Supplementary Notebooks

GO_stressResponse.Rmd

Requirements :

Annotation Dataframe with columns "gene_id" and its Uniprot Accession ID

| gene_id | swiss_prot_acc | |-|-| | TRINITY_DN10022_c0_g1 | Q9UQ80 | | TRINITY_DN10031_c0_g1 | Q8VDW4 |
At least 1 hour stable internet

Workflow :

Get list of Accession IDs from specified column (currently "swiss_prot_acc")
Using Uniprot API, for each Accession ID, retrieve all GO terms related to Biological Process.
Using QuickGO API, for each GO term of an Accession ID, retrieve its ancestors and look for Stress Response To Heat ("GO:0006950"). If so, isDescendant is True.
Map Gene IDs to isDecendant.

Output : This notebook will produce 3 CSV files :

geneID_isDescendant.csv

gene_id isDescendant

TRINITY_DN100016_c0_g1 TRUE
uniprotID_isDescendant.csv
- with columns : uniprot_id | isDescendant |
uniprotID_to_goID.csv
- with columns : uniprot_id | GO_id | isDescendant |

gene_id	isDescendant
TRINITY_DN100016_c0_g1	TRUE

Notes :

Current DGE Analysis as of 08/24/2020 uses swiss_prot_acc to determine if gene is descendant. But the current filter criteria uses fly annotation. I tried using fly_acc for isDescendant gene and cross-referenced to the current isDescendant file, around 17% isDescendant did not match.

climate_data_analysis.Rmd

Select Input file :

ERSSTv5 NetCDF files from NOAA website (.zip here 💾 )
- Use this if site coordinates are not yet final and you are still exploring.

Climate Data.csv 💾

Use this if site coordinates and date range is already final.

This file is generated in this notebook. Contains the SST and SSTA for all pre-specified sites within a date range.

site	date	SST	SSTA
CAG	185401	26.54	1.19

Workflow (Using NetCDF files) :

For each site, define its 2x2 lon and lat grid.
Read all NetCDF files and extract SST and SSTAs from all site coordinates.
Generates the dataframe in Climate Data.csv, which can then be used for plotting.

Output : This notebook will produce 2 graphs :

Distribution of SSTs as a Violin plot
Time series of SSTA w/ yearly moving average

pca_metadata.Rmd

Requirements :

Metadata CSV file
- Current file : NGS_morphodata_summary.csv 💾
- Columns should contain "samplenames" and other factors of that sample
  
  | samplename | sex | weight_init | weight_final | carap_init | carap_final | |-|-|-|-|-|-| | BIC_E_1 | M | 100.8 | 116.4 | 96.58 | 96.66 | | BIC_E_2 | I | 88.8 | 97.4 | 91.92 | 92.21 |

Workflow :

PCA of samples can then be color coded by their characteristics.

Code Snippets

shell:
    """
    dammit databases --install --quick --busco-group metazoa &&
    dammit annotate  {input.trinity_assembly} --quick --user-databases {input.db} -e 1e-10 --output-dir {params.dammit_out_dir} --n_threads {threads}
    """

SnakeMake dammit From line 52 of master/Snakefile

shell:
    "bowtie2-build {input.reference} {params.index_basename}"

SnakeMake Bowtie 2 From line 67 of master/Snakefile

shell:
    "bowtie2 -p {threads} -x {params.index_basename} -{params.file_type} -1 {input.reads1} -2 {input.reads2} --no-mixed --no-discordant --gbar 100 --end-to-end -k 200 | samtools view -bS - > {output}"

SnakeMake SAMtools Bowtie 2 From line 85 of master/Snakefile

shell:
    "rsem-prepare-reference {input.reference} {params.rsem_index_basename}"

SnakeMake RSEM From line 98 of master/Snakefile

shell:
    "rsem-calculate-expression -q --no-bam-output --alignments --paired-end {input.alignments} {params.rsem_index_basename} {params.rsem_out_dir}"

SnakeMake RSEM From line 114 of master/Snakefile

shell:
    "$TRINITY_HOME/util/align_and_estimate_abundance.pl --transcripts {input.assembly} --left {input.reads1} --right {input.reads2} --seqType {params.seq_type} --est_method salmon --output_dir {params.trinity_abundance_out_dir} --thread_count {threads} --trinity_mode --prep_reference"

SnakeMake Salmon From line 132 of master/Snakefile

shell:
    "Trinity --seqType {params.seq_type} --max_memory 90G --left {params.left} --right {params.right} --CPU {threads}"

SnakeMake seqtk-trinity From line 147 of master/Snakefile

ShowHide 1 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/anishmss/rna-seq-analysis

Name: rna-seq-analysis

Version: 1

Badge:

Insert copied code into your website to add a link to this workflow.

License: None

Keywords:

Bowtie 2 dammit RSEM Salmon SAMtools seqtk-trinity Snakemake Comparative genomics

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free