ChIPseq Data Processing Workflow with Snakemake

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

This workflow performs some genernal data process for ChIPseq

Workflow

dag1

Install Snakemake and Git

Please make sure that Snakemake and Git are correctly installed

Snakemake: https://snakemake.readthedocs.io/en/stable/getting_started/installation.html

Git: https://anaconda.org/anaconda/git

clone workflow into working directory

git clone https://github.com/mhu10/SMK_ChIPseq path/to/workdir

Edit config file and workfileas as needed

./SMK_ChIPseq/config/'config.yaml

./SMK_ChIPseq/Snakefile

Activate snakemake

conda activate snakemake

Dry run workflow

snakemake -n

Execute workflow

snakemake --use-conda --cores 12

Build DAG workflow chart

snakemake --rulegraph | dot -Tsvg > dag1.svg

Code Snippets

library(dplyr)

out_file1 <- snakemake@output[["bed_ctl"]]
out_file2 <- snakemake@output[["bed_treatment"]]

datainput_ctl <- read.table(snakemake@input[['peaks_ctl']], sep="\t")[,c(2,3,4,1,6,5)]
datainput_treatment <- read.table(snakemake@input[['peaks_treatment']], sep="\t")[,c(2,3,4,1,6,5)]

colnames(datainput_ctl) = c('Chromosome','Start','End','PeakID','Column5','Strand')
colnames(datainput_treatment) = c('Chromosome','Start','End','PeakID','Column5','Strand')

datainput_ctl = arrange(datainput_ctl, Chromosome)
datainput_treatment = arrange(datainput_treatment, Chromosome)

write.table(datainput_ctl, out_file1, quote = FALSE,sep='\t',row.names = FALSE)
write.table(datainput_treatment, out_file2, quote = FALSE,sep='\t',row.names = FALSE)

R dplyr From line 1 of scripts/ConvertPeaks.R

wrapper:
    "v1.7.1/bio/fastqc"

SnakeMake FastQC From line 49 of main/Snakefile

wrapper:
    "v1.7.1/bio/trimmomatic/pe"

SnakeMake From line 74 of main/Snakefile

shell:
    "bowtie2 -p 8 --local --no-discordant --no-mixed "
    "-x {params.bowtieindex}"
    "-1 {input.r1} "
    "-2 {input.r2} "
    "-S {output} "

SnakeMake Bowtie 2 From line 89 of main/Snakefile

shell:
    "samtools view -bS {input} > {output} "

SnakeMake SAMtools From line 104 of main/Snakefile

shell:
    "samtools sort -@ 5 {input}  -o {output} "

SnakeMake SAMtools From line 116 of main/Snakefile

shell:
    "samtools view -bh -q 10 {input} > {output} "

SnakeMake SAMtools From line 128 of main/Snakefile

wrapper:
    "v1.20.0/bio/picard/markduplicates"

SnakeMake From line 143 of main/Snakefile

wrapper:
    "v1.25.0/bio/samtools/index"

SnakeMake From line 158 of main/Snakefile

shell:
    "multiBamSummary bins "
    "--ignoreDuplicates -p 8 "
    "--bamfile {input.bams} "
    "-out {output.countnpz} "
    "--outRawCounts {output.counttab}"

SnakeMake From line 172 of main/Snakefile

shell:
    "plotPCA --corData {input.countnpz} "
    "--plotFile ./results/MultiBamSummary/deepTools_pcaplot.png "
    "-T 'PCA of read counts' "
    "--outFileNameData ./results/MultiBamSummary/deeptools_pcaProfile.tab "

SnakeMake DeepTools PCAtools From line 191 of main/Snakefile

shell:
    "plotCorrelation --corData {input.countnpz} "
    "--plotFile ./results/MultiBamSummary/deepTools_heatmap_spearman.png "
    "--corMethod spearman "
    "--whatToPlot heatmap "
    "--plotTitle 'Spearman Correlation of Read Counts' "
    "--plotNumbers "

SnakeMake DeepTools From line 206 of main/Snakefile

shell:
    "macs2 callpeak -t {input} -n {wildcards.sample} --outdir results/MACS2/{wildcards.sample} -g hs -f BAM --keep-dup auto --bdg "

SnakeMake macs2 From line 225 of main/Snakefile

wrapper:
    "v1.25.0/bio/homer/mergePeaks"

SnakeMake From line 239 of main/Snakefile

wrapper:
    "v1.25.0/bio/homer/mergePeaks"

SnakeMake From line 253 of main/Snakefile

script:
    "scripts/ConvertPeaks.R"

SnakeMake From line 266 of main/Snakefile

__author__ = "Johannes Köster, Christopher Schröder"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"


import tempfile
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts

log = snakemake.log_fmt_shell()

extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)

bams = snakemake.input.bams
if isinstance(bams, str):
    bams = [bams]
bams = list(map("--INPUT {}".format, bams))

if snakemake.output.bam.endswith(".cram"):
    output = "/dev/stdout"
    if snakemake.params.embed_ref:
        view_options = "-O cram,embed_ref"
    else:
        view_options = "-O cram"
    convert = f" | samtools view -@ {snakemake.threads} {view_options} --reference {snakemake.input.ref} -o {snakemake.output.bam}"
else:
    output = snakemake.output.bam
    convert = ""

with tempfile.TemporaryDirectory() as tmpdir:
    shell(
        "(picard MarkDuplicates"  # Tool and its subcommand
        " {java_opts}"  # Automatic java option
        " {extra}"  # User defined parmeters
        " {bams}"  # Input bam(s)
        " --TMP_DIR {tmpdir}"
        " --OUTPUT {output}"  # Output bam
        " --METRICS_FILE {snakemake.output.metrics}"  # Output metrics
        " {convert} ) {log}"  # Logging
    )

Python Snakemake SAMtools snakemake-wrapper-utils From line 1 of markduplicates/wrapper.py

__author__ = "Jan Forster"
__copyright__ = "Copyright 2020, Jan Forster"
__email__ = "j.forster@dkfz.de"
__license__ = "MIT"

from snakemake.shell import shell
import os.path as path
import sys

extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)


class PrefixNotSupportedError(Exception):
    pass


if "-prefix" in extra:
    raise PrefixNotSupportedError(
        "The use of the -prefix parameter is not yet supported in this wrapper"
    )

shell("(mergePeaks" " {snakemake.input}" " {extra}" " > {snakemake.output})" " {log}")

Python Snakemake From line 1 of mergePeaks/wrapper.py

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"


from snakemake.shell import shell

extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

# Samtools takes additional threads through its option -@
# One thread for samtools merge
# Other threads are *additional* threads passed to the '-@' argument
threads = "" if snakemake.threads <= 1 else " -@ {} ".format(snakemake.threads - 1)

shell(
    "samtools index {threads} {extra} {snakemake.input[0]} {snakemake.output[0]} {log}"
)

Python Snakemake SAMtools From line 1 of index/wrapper.py

__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "julianderuiter@gmail.com"
__license__ = "MIT"


from os import path
import re
from tempfile import TemporaryDirectory

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=True, stderr=True)


def basename_without_ext(file_path):
    """Returns basename of file path, without the file extension."""

    base = path.basename(file_path)
    # Remove file extension(s) (similar to the internal fastqc approach)
    base = re.sub("\\.gz$", "", base)
    base = re.sub("\\.bz2$", "", base)
    base = re.sub("\\.txt$", "", base)
    base = re.sub("\\.fastq$", "", base)
    base = re.sub("\\.fq$", "", base)
    base = re.sub("\\.sam$", "", base)
    base = re.sub("\\.bam$", "", base)

    return base


# Run fastqc, since there can be race conditions if multiple jobs
# use the same fastqc dir, we create a temp dir.
with TemporaryDirectory() as tempdir:
    shell(
        "fastqc {snakemake.params} -t {snakemake.threads} "
        "--outdir {tempdir:q} {snakemake.input[0]:q}"
        " {log}"
    )

    # Move outputs into proper position.
    output_base = basename_without_ext(snakemake.input[0])
    html_path = path.join(tempdir, output_base + "_fastqc.html")
    zip_path = path.join(tempdir, output_base + "_fastqc.zip")

    if snakemake.output.html != html_path:
        shell("mv {html_path:q} {snakemake.output.html:q}")

    if snakemake.output.zip != zip_path:
        shell("mv {zip_path:q} {snakemake.output.zip:q}")

Python Snakemake FastQC From line 3 of fastqc/wrapper.py

__author__ = "Johannes Köster, Jorge Langa"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"


from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts

# Distribute available threads between trimmomatic itself and any potential pigz instances
def distribute_threads(input_files, output_files, available_threads):
    gzipped_input_files = sum(1 for file in input_files if file.endswith(".gz"))
    gzipped_output_files = sum(1 for file in output_files if file.endswith(".gz"))
    potential_threads_per_process = available_threads // (
        1 + gzipped_input_files + gzipped_output_files
    )
    if potential_threads_per_process > 0:
        # decompressing pigz creates at most 4 threads
        pigz_input_threads = (
            min(4, potential_threads_per_process) if gzipped_input_files != 0 else 0
        )
        pigz_output_threads = (
            (available_threads - pigz_input_threads * gzipped_input_files)
            // (1 + gzipped_output_files)
            if gzipped_output_files != 0
            else 0
        )
        trimmomatic_threads = (
            available_threads
            - pigz_input_threads * gzipped_input_files
            - pigz_output_threads * gzipped_output_files
        )
    else:
        # not enough threads for pigz
        pigz_input_threads = 0
        pigz_output_threads = 0
        trimmomatic_threads = available_threads
    return trimmomatic_threads, pigz_input_threads, pigz_output_threads


def compose_input_gz(filename, threads):
    if filename.endswith(".gz") and threads > 0:
        return "<(pigz -p {threads} --decompress --stdout {filename})".format(
            threads=threads, filename=filename
        )
    return filename


def compose_output_gz(filename, threads, compression_level):
    if filename.endswith(".gz") and threads > 0:
        return ">(pigz -p {threads} {compression_level} > {filename})".format(
            threads=threads, compression_level=compression_level, filename=filename
        )
    return filename


extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
compression_level = snakemake.params.get("compression_level", "-5")
trimmer = " ".join(snakemake.params.trimmer)

# Distribute threads
input_files = [snakemake.input.r1, snakemake.input.r2]
output_files = [
    snakemake.output.r1,
    snakemake.output.r1_unpaired,
    snakemake.output.r2,
    snakemake.output.r2_unpaired,
]

trimmomatic_threads, input_threads, output_threads = distribute_threads(
    input_files, output_files, snakemake.threads
)

input_r1, input_r2 = [
    compose_input_gz(filename, input_threads) for filename in input_files
]

output_r1, output_r1_unp, output_r2, output_r2_unp = [
    compose_output_gz(filename, output_threads, compression_level)
    for filename in output_files
]

shell(
    "trimmomatic PE -threads {trimmomatic_threads} {java_opts} {extra} "
    "{input_r1} {input_r2} "
    "{output_r1} {output_r1_unp} "
    "{output_r2} {output_r2_unp} "
    "{trimmer} "
    "{log}"
)