Snakemake workflow: J2Seq

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

This is the template for a new Snakemake workflow. Replace this text with a comprehensive description covering the purpose and domain. Insert your code into the respective folders, i.e. scripts , rules , and envs . Define the entry point of the workflow in the Snakefile and the main configuration in the config.yaml file.

Authors

Chao Di (@dic)

Usage

If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and, if available, its DOI (see above).

Step 1: Obtain a copy of this workflow

Create a new github repository using this workflow as a template .
Clone the newly created repository to your local system, into the place where you want to perform the data analysis.

Step 2: Configure workflow

Configure the workflow according to your needs via editing the files in the config/ folder. Adjust config.yaml to configure the workflow execution, and samples.tsv to specify your sample setup.

Step 3: Install Snakemake

Install Snakemake using conda :

conda create -c bioconda -c conda-forge -n snakemake snakemake

For installation details, see the instructions in the Snakemake documentation .

Step 4: Execute workflow

Activate the conda environment:

conda activate snakemake

Test your configuration by performing a dry-run via

snakemake --use-conda -n

Execute the workflow locally via

snakemake --use-conda --cores $N

using $N cores or run it in a cluster environment via

snakemake --use-conda --cluster qsub --jobs 100

snakemake --use-conda --drmaa --jobs 100

If you not only want to fix the software stack but also the underlying OS, use

snakemake --use-conda --use-singularity

in combination with any of the modes above. See the Snakemake documentation for further details.

Step 5: Investigate results

After successful execution, you can create a self-contained interactive HTML report with all results via:

snakemake --report report.html

This report can, e.g., be forwarded to your collaborators. An example (using some trivial test data) can be seen here .

Step 6: Commit changes

Whenever you change something, don't forget to commit the changes back to your github copy of the repository:

git commit -a
git push

Step 7: Obtain updates from upstream

Whenever you want to synchronize your workflow copy with new developments from upstream, do the following.

Once, register the upstream repository in your local copy: git remote add -f upstream git@github.com:snakemake-workflows/J2seq.git or git remote add -f upstream https://github.com/snakemake-workflows/J2seq.git if you do not have setup ssh keys.
Update the upstream version: git fetch upstream .
Create a diff with the current version: git diff HEAD upstream/master workflow > upstream-changes.diff .
Investigate the changes: vim upstream-changes.diff .
Apply the modified diff via: git apply upstream-changes.diff .
Carefully check whether you need to update the config files: git diff HEAD upstream/master config . If so, do it manually, and only where necessary, since you would otherwise likely overwrite your settings and samples.

Step 8: Contribute back

In case you have also changed or added steps, please consider contributing them back to the original repository:

Fork the original repo to a personal or lab account.
Clone the fork to your local system, to a different place than where you ran your analysis.
Copy the modified files from your analysis to the clone of your fork, e.g., cp -r workflow path/to/fork . Make sure to not accidentally copy config file contents or sample sheets. Instead, manually update the example config files if necessary.
Commit and push your changes to your fork.
Create a pull request against the original repository.

Testing

Test cases are in the subfolder .test . They are automatically executed via continuous integration with Github Actions .

Code Snippets

shell:
    '''
        samtools view -h {input} Ad5 -b > {output} 2> {log}
    '''

SnakeMake SAMtools From line 12 of rules/extract_ad5.smk

shell:
    '''
    samtools index {input} {output.bai}
    '''

SnakeMake SAMtools From line 26 of rules/extract_ad5.smk

shell:
    '''
    multiqc ../results/fastq_screen_output -o {params.outdir} &> {log}
    '''

SnakeMake MultiQC From line 28 of rules/fastq_screen.smk

script:
    "../scripts/featureCount.R"

SnakeMake From line 16 of rules/featureCount.smk

script:
    "../scripts/featureCount.R"

SnakeMake From line 34 of rules/featureCount.smk

script:
    "../scripts/featureCount.R"

SnakeMake From line 52 of rules/featureCount.smk

script:
    "../scripts/featureCount_segments.R"

SnakeMake From line 68 of rules/featureCount.smk

shell:
    "samtools index {input} {output}"

SnakeMake SAMtools From line 34 of rules/merge_unmapped_bam.smk

shell:
    "samtools index {input} {output}"

SnakeMake SAMtools From line 31 of rules/star_map.smk

shell:
    '''
    rm -r ../results/multiqc/star/
    multiqc STAR_align -o {params.outdir} &> {log}
    '''

SnakeMake MultiQC From line 42 of rules/star_map.smk

shell:
    '''
    TEcount --sortByPos --format BAM --mode multi -b {input} \\
    --GTF {params.gene} --TE {params.TE} \\
    --stranded forward --project "../results/TEcount/TEcount.{wildcards.sample}"
    '''

SnakeMake tecount From line 17 of rules/TEtranscripts.smk

shell:
    '''
    rm -f {output};
    cut -f1  ../results/TEcount/TEcount.Ad5input1.cntTable | sed 's/"//g;s/gene\/TE/gene_TE/g' > {output};
    for i in {input}; do
        cut -f2 $i | sed 's/merged_bam\///g;s/_merged_dedup_sorted.bam//g' > foo;
        paste {output} foo > foo1;
        mv foo1 {output};
    done
    rm -f foo foo1
    '''

SnakeMake From line 32 of rules/TEtranscripts.smk

library(Rsubread)
library(dplyr)
library(mgsub)

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

## Count RPFs (normalized in RPKM) on CDS for each gene, using `featureCounts`
## run all bams together
samples <- read.table(snakemake@input[["samples"]], header=T)
bamfiles <- paste0("./merged_bam/", as.vector(samples$sample),"_merged_dedup_sorted.bam")

## run one bam file
# bamfiles <- snakemake@input[["bamfile"]]
RPFcounts <- featureCounts(files=bamfiles, annot.ext=snakemake@input[['gtf']],
    isGTFAnnotationFile=TRUE, GTF.featureType=snakemake@params[["featureType"]], GTF.attrType="gene_name",
    strandSpecific=snakemake@params[["strand"]], countMultiMappingReads=FALSE, juncCounts=TRUE, nthreads=snakemake@threads[[1]])

write.table(RPFcounts$counts, file=snakemake@output[[1]], sep="\t", quote=F, row.names = TRUE, col.names = NA)

R dplyr FeatureCounts Rsubread mgsub From line 2 of scripts/featureCount.R

library(Rsubread)
library(dplyr)
library(mgsub)

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

## Count RPFs (normalized in RPKM) on CDS for each gene, using `featureCounts`
## run all bams together
samples <- read.table(snakemake@input[["samples"]], header=T)
bamfiles <- paste0("./merged_bam/", as.vector(samples$sample),"_merged_dedup_sorted.bam")

## run one bam file
# bamfiles <- snakemake@input[["bamfile"]]
RPFcounts <- featureCounts(files=bamfiles, annot.ext=snakemake@input[['saf']],
    isGTFAnnotationFile=FALSE, fracOverlap=1,
    strandSpecific=snakemake@params[["strand"]], countMultiMappingReads=FALSE, juncCounts=TRUE, 
    nthreads=snakemake@threads[[1]])

write.table(RPFcounts$counts, file=snakemake@output[[1]], sep="\t", quote=F, row.names = TRUE, col.names = NA)

R dplyr FeatureCounts Rsubread mgsub From line 2 of scripts/featureCount_segments.R

__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "julianderuiter@gmail.com"
__license__ = "MIT"


from os import path

from snakemake.shell import shell


input_dirs = set(path.dirname(fp) for fp in snakemake.input)
output_dir = path.dirname(snakemake.output[0])
output_name = path.basename(snakemake.output[0])
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "multiqc"
    " {snakemake.params}"
    " --force"
    " -o {output_dir}"
    " -n {output_name}"
    " {input_dirs}"
    " {log}")