Small snakemake pipeline to explore RNA-Seq data with deepTools.
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
- Lack of a description for a new keyword deepTools .
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Maciej_Bak
Swiss_Institute_of_Bioinformatics
deepTools
is a very nice toolset for exploring RNA-Seq data.
This repository is a snakemake workflow that is based on the example usage from the deepTools manual:
https://deeptools.readthedocs.io/en/develop/content/example_usage.html
My aim was to develop an automatized and reproducible pipeline for my research which I would now happily share with the community :)
Snakemake pipeline execution
Snakemake is a workflow management system that helps to create and execute data processing pipelines. It requires Python 3 and can be most easily installed via the bioconda package from the anaconda cloud service.
Step 1: Download and installation of Miniconda3
Linux:
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
source .bashrc
macOS:
wget https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
bash Miniconda3-latest-MacOSX-x86_64.sh
source .bashrc
Step 2: Pandas and Snakemake installation
To execute the workflow one would require pandas python library and snakemake workflow menager.
Unless a specific snakemake version is specified explicitly it is most likely the best choice to install the latest versions:
conda install -c conda-forge pandas
conda install -c bioconda snakemake
In case you are missing some dependancy packages please install them first (with
conda install ...
as well).
Step 3: Pipeline execution
Specify all the required information (input/output/parameters) in the config.yaml
The main input to the pipeline is a simple design table which has to have the following format:
sample bam
[sample_name] [path_to_bam]
...
Where:
-
Each row is a sequencing sample.
-
All the bam files need to have a different name regardless of their location.
-
Design table might have more columns than these above.
Apart from the design table the pipeline requires a FASTA-formatted genome file.
Once the metadata are ready write a DAG (directed acyclic graph) into dag.pdf:
bash snakemake_dag_run.sh
There are two scripts to start the pipeline, depending on whether you want to run locally or on a SLURM computational cluster. In order to execute the workflow snakemake automatically creates internal conda virtual environments and installs software from anaconda cloud service. For the cluster execution it might be required to adapt the 'cluster_config.json' and submission scripts before starting the run.
bash snakemake_local_run_conda_env.sh
bash snakemake_cluster_run_conda_env.sh
License
Apache 2.0
Code Snippets
75 76 77 78 79 80 81 | shell: """ mkdir -p {params.DIR_results_dir}; \ mkdir -p {params.DIR_cluster_log}; \ mkdir -p {log.DIR_local_log}; \ touch {output.TMP_output} """ |
112 113 114 115 116 117 | shell: """ samtools sort -@ {resources.threads} {params.BAM_path} \ 1> {output.BAM_sorted} \ 2> {log.LOG_local_log}; """ |
148 149 150 151 152 153 | shell: """ samtools index -@ {resources.threads} {input.BAM_sorted} \ 1> {output.BAI_index} \ 2> {log.LOG_local_log}; """ |
186 187 188 189 190 191 192 193 194 195 | shell: """ bamCoverage \ --bam {input.BAM_sorted} \ --outFileName {output.BW_sample} \ --binSize 1 \ --outFileFormat bigwig \ --numberOfProcessors {resources.threads} \ 2> {log.LOG_local_log}; """ |
226 227 228 229 230 231 232 233 | shell: """ multiBigwigSummary bins \ --bwfiles {input.BW_sample} \ --outFileName {output.NPZ_summary} \ --numberOfProcessors {resources.threads} \ 2> {log.LOG_local_log}; """ |
269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 | shell: """ plotPCA \ --corData {input.NPZ_summary} \ --plotFile {output.PNG_pca} \ --outFileNameData {params.TSV_pca_table} \ --ntop 1000 \ 2> {log.LOG_local_log}; plotPCA --transpose \ --corData {input.NPZ_summary} \ --plotFile {output.PNG_pca_transposed} \ --outFileNameData {params.TSV_pca_transposed_table} \ --ntop 1000 \ 2> {log.LOG_local_log}; """ |
319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 | shell: """ plotCorrelation \ --corData {input.NPZ_summary} \ --corMethod pearson \ --whatToPlot heatmap \ --plotFile {output.PNG_heatmap} \ --outFileCorMatrix {params.TSV_heatmap_table} \ 2> {log.LOG_local_log}; plotCorrelation \ --corData {input.NPZ_summary} \ --corMethod pearson \ --whatToPlot scatterplot \ --plotFile {output.PNG_scatterplot} \ --outFileCorMatrix {params.TSV_scatterplot_table} \ 2> {log.LOG_local_log}; """ |
370 371 372 373 374 375 376 377 | shell: """ plotCoverage \ --bamfiles {input.BAM_sorted} \ --plotFile {output.PNG_coverage_plot} \ --numberOfProcessors {resources.threads} \ 2> {log.LOG_local_log}; """ |
405 406 407 408 409 | shell: """ faToTwoBit {params.FASTA_genome} {output.TWOBIT_genome_2bit} \ 2> {log.LOG_local_log}; """ |
446 447 448 449 450 451 452 453 454 455 456 457 | shell: """ computeGCBias \ --bamfile {input.BAM_sorted} \ --effectiveGenomeSize {params.effective_genome_size} \ --genome {input.TWOBIT_genome_2bit} \ --numberOfProcessors {resources.threads} \ --GCbiasFrequenciesFile {params.TSV_gc_tsv} \ --biasPlot {output.PNG_gc_plot} \ --plotFileFormat png \ 2> {log.LOG_local_log}; """ |
Support
- Future updates
Related Workflows





