Small snakemake pipeline to explore RNA-Seq data with deepTools.

public 1yr ago Version: v2.0.0 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic
Lack of a description for a new keyword deepTools .

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Maciej_Bak
Swiss_Institute_of_Bioinformatics

deepTools is a very nice toolset for exploring RNA-Seq data.
This repository is a snakemake workflow that is based on the example usage from the deepTools manual:
https://deeptools.readthedocs.io/en/develop/content/example_usage.html
My aim was to develop an automatized and reproducible pipeline for my research which I would now happily share with the community :)

Snakemake pipeline execution

Snakemake is a workflow management system that helps to create and execute data processing pipelines. It requires Python 3 and can be most easily installed via the bioconda package from the anaconda cloud service.

Step 1: Download and installation of Miniconda3

Linux:

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
source .bashrc

macOS:

wget https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
bash Miniconda3-latest-MacOSX-x86_64.sh
source .bashrc

Step 2: Pandas and Snakemake installation

To execute the workflow one would require pandas python library and snakemake workflow menager.
Unless a specific snakemake version is specified explicitly it is most likely the best choice to install the latest versions:

conda install -c conda-forge pandas
conda install -c bioconda snakemake

In case you are missing some dependancy packages please install them first (with conda install ... as well).

Step 3: Pipeline execution

Specify all the required information (input/output/parameters) in the config.yaml
The main input to the pipeline is a simple design table which has to have the following format:

sample bam
[sample_name] [path_to_bam]
...

Where:

Each row is a sequencing sample.
All the bam files need to have a different name regardless of their location.
Design table might have more columns than these above.

Apart from the design table the pipeline requires a FASTA-formatted genome file.

Once the metadata are ready write a DAG (directed acyclic graph) into dag.pdf:

bash snakemake_dag_run.sh

There are two scripts to start the pipeline, depending on whether you want to run locally or on a SLURM computational cluster. In order to execute the workflow snakemake automatically creates internal conda virtual environments and installs software from anaconda cloud service. For the cluster execution it might be required to adapt the 'cluster_config.json' and submission scripts before starting the run.

bash snakemake_local_run_conda_env.sh
bash snakemake_cluster_run_conda_env.sh

License

Apache 2.0

Code Snippets

shell:
    """
    mkdir -p {params.DIR_results_dir}; \
    mkdir -p {params.DIR_cluster_log}; \
    mkdir -p {log.DIR_local_log}; \
    touch {output.TMP_output}
    """

SnakeMake From line 75 of master/Snakefile

shell:
    """
    samtools sort -@ {resources.threads} {params.BAM_path} \
    1> {output.BAM_sorted} \
    2> {log.LOG_local_log};
    """

SnakeMake SAMtools From line 112 of master/Snakefile

shell:
    """
    samtools index -@ {resources.threads} {input.BAM_sorted} \
    1> {output.BAI_index} \
    2> {log.LOG_local_log};
    """

SnakeMake SAMtools From line 148 of master/Snakefile

shell:
    """
    bamCoverage \
    --bam {input.BAM_sorted} \
    --outFileName {output.BW_sample} \
    --binSize 1 \
    --outFileFormat bigwig \
    --numberOfProcessors {resources.threads} \
    2> {log.LOG_local_log};
    """

SnakeMake DeepTools From line 186 of master/Snakefile

shell:
    """
    multiBigwigSummary bins \
    --bwfiles {input.BW_sample} \
    --outFileName {output.NPZ_summary} \
    --numberOfProcessors {resources.threads} \
    2> {log.LOG_local_log};
    """

SnakeMake From line 226 of master/Snakefile

shell:
    """
    plotPCA \
    --corData {input.NPZ_summary} \
    --plotFile {output.PNG_pca} \
    --outFileNameData {params.TSV_pca_table} \
    --ntop 1000 \
    2> {log.LOG_local_log};
    plotPCA --transpose \
    --corData {input.NPZ_summary} \
    --plotFile {output.PNG_pca_transposed} \
    --outFileNameData {params.TSV_pca_transposed_table} \
    --ntop 1000 \
    2> {log.LOG_local_log};
    """

SnakeMake PCAtools From line 269 of master/Snakefile

shell:
    """
    plotCorrelation \
    --corData {input.NPZ_summary} \
    --corMethod pearson \
    --whatToPlot heatmap \
    --plotFile {output.PNG_heatmap} \
    --outFileCorMatrix {params.TSV_heatmap_table} \
    2> {log.LOG_local_log};
    plotCorrelation \
    --corData {input.NPZ_summary} \
    --corMethod pearson \
    --whatToPlot scatterplot \
    --plotFile {output.PNG_scatterplot} \
    --outFileCorMatrix {params.TSV_scatterplot_table} \
    2> {log.LOG_local_log};
    """

SnakeMake From line 319 of master/Snakefile

shell:
    """
    plotCoverage \
    --bamfiles {input.BAM_sorted} \
    --plotFile {output.PNG_coverage_plot} \
    --numberOfProcessors {resources.threads} \
    2> {log.LOG_local_log};
    """

SnakeMake From line 370 of master/Snakefile

shell:
    """
    faToTwoBit {params.FASTA_genome} {output.TWOBIT_genome_2bit} \
    2> {log.LOG_local_log};
    """

SnakeMake fatotwobit From line 405 of master/Snakefile

shell:
    """
    computeGCBias \
    --bamfile {input.BAM_sorted} \
    --effectiveGenomeSize {params.effective_genome_size} \
    --genome {input.TWOBIT_genome_2bit} \
    --numberOfProcessors {resources.threads} \
    --GCbiasFrequenciesFile {params.TSV_gc_tsv} \
    --biasPlot {output.PNG_gc_plot} \
    --plotFileFormat png \
    2> {log.LOG_local_log};
    """