A snakemake workflow to process ATAC-seq data

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Yet another snakemake workflow for ATAC-seq data processing. This pipeline was created from code developed by:

Crazy Hot Tommy! 's many instructional guides
TOBIAS ATAC-seq footprinting Snakemake workflow

For SLURM setup we reference:

jdblischak/smk-simple-slur repo for simple submitting snakemake on SLURM
Tessa Pierce blog for example templates

Workflow Overview

Snakemake pipelines promote experimental reproducibility. For this project, you should have the following inputs customized for your analysis:

A config.yaml that describes the run parameters and location of reference data.
A tab-delimited sample meta file file that describes the experiments to download from SRA and how to group them.
A unique output directory.

A detailed overview of the steps in the ATAC-seq data processing are found on the maxATAC wiki site .

This version of snakeATAC is geared towards use with maxATAC and TOBIAS for making TF binding predictions.

Installation

This pipeline uses Anaconda and Snakemake. Follow the Snakemake install instructions for the best experience. Below is a brief overview of how to install Snakemake.

Create environment

Create a conda environment and download mamba :

conda create -n snakeatac -c conda-forge -c bioconda mamba snakemake

Activate the snakeatac environment:

conda activate snakeatac

Clone the snakeATAC repository

In your favorite directory clone the snakeATAC repo:

git clone https://github.com/tacazares/snakeATAC.git

Set up run-specific parameters

If you are running this pipeline for your first time, you will need to install all the conda environments used and perform a dry-run to make sure that everything was installed right.

Adjust the config.yaml and the tab-delimited sample meta file for your specific experiment.
Change to the working directory for snakeATAC. By default, Snakemake will look for a file called Snakefile with the rules and run information. You can use a custome Snakefile with the -s flag followed by the path to the file.
```
cd ./snakeATAC/
```

Next, use the --conda-create-envs-only flag to create the environments.

snakemake --cores 14 --use-conda --conda-frontend mamba --conda-create-envs-only --configfile ./inputs/config.yaml

Test the workflow and scripts are correctly set up by performing a dry-run with the --dry-run flag.

snakemake --cores 14 --use-conda --conda-frontend mamba --configfile ./inputs/config.yaml --dry-run

Test snakeATAC

The ./snakeATAC/inputs/GM12878_sample.tsv contains information for a test run to process GM12878 OMNI ATAC-seq data.

After install, you can run the full run using your favorite HPC system.

snakemake --cores 14 --use-conda --conda-frontend mamba --configfile ./inputs/config.yaml

Use Snakemake to submit jobs through SLURM

If you want to use Snakemake to submit jobs to slurm, you will need to follow the instruction described by jdblischak/smk-simple-slur repo . The directory and scripts are included in this repository, but you will need to adjust the account information. You can also adjust any defaults that you wish to use with your job submissions. NOTE: You will need to use chmod +x status-sacct.sh to make the script executable.

Example .bat file to drive the snakeATAC workflow

#!/bin/bash
#SBATCH -D ./outputs
#SBATCH -J dmnd_snake 
#SBATCH -t 96:00:00
#SBATCH --ntasks=8
#SBATCH --mem=16gb
#SBATCH --account={YOUR_ACCOUNT}
#SBATCH --output ./outputs/snakeatac-%j.out
#SBATCH --error ./outputs/snakeatac-%j.err
# Load modules
module load python/3.7-2019.10
# Load the snakemake/mamba env
source activate mamba
# go to a particular directory
cd ./snakeATAC
# make things fail on errors
set -o nounset
set -o errexit
set -x
### run your commands here!
# Develop from the below links
# https://bluegenes.github.io/snakemake-via-slurm/
# https://github.com/jdblischak/smk-simple-slurm
snakemake -s /snakeATAC/Snakefile \
--use-conda \
--conda-frontend mamba \
--configfile /snakeATAC/inputs/config.yaml \
--profile simple/

Code Snippets

__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "julianderuiter@gmail.com"
__license__ = "MIT"


from os import path
import re
from tempfile import TemporaryDirectory

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=True, stderr=True)


def basename_without_ext(file_path):
    """Returns basename of file path, without the file extension."""

    base = path.basename(file_path)
    # Remove file extension(s) (similar to the internal fastqc approach)
    base = re.sub("\\.gz$", "", base)
    base = re.sub("\\.bz2$", "", base)
    base = re.sub("\\.txt$", "", base)
    base = re.sub("\\.fastq$", "", base)
    base = re.sub("\\.fq$", "", base)
    base = re.sub("\\.sam$", "", base)
    base = re.sub("\\.bam$", "", base)

    return base


# Run fastqc, since there can be race conditions if multiple jobs
# use the same fastqc dir, we create a temp dir.
with TemporaryDirectory() as tempdir:
    shell(
        "fastqc {snakemake.params} -t {snakemake.threads} "
        "--outdir {tempdir:q} {snakemake.input[0]:q}"
        " {log}"
    )

    # Move outputs into proper position.
    output_base = basename_without_ext(snakemake.input[0])
    html_path = path.join(tempdir, output_base + "_fastqc.html")
    zip_path = path.join(tempdir, output_base + "_fastqc.zip")

    if snakemake.output.html != html_path:
        shell("mv {html_path:q} {snakemake.output.html:q}")

    if snakemake.output.zip != zip_path:
        shell("mv {zip_path:q} {snakemake.output.zip:q}")

Python Snakemake FastQC From line 3 of fastqc/wrapper.py

__author__ = "Johannes Köster, Derek Croote"
__copyright__ = "Copyright 2020, Johannes Köster"
__email__ = "johannes.koester@uni-due.de"
__license__ = "MIT"

import os
import tempfile
from snakemake.shell import shell


log = snakemake.log_fmt_shell(stdout=True, stderr=True)
extra = snakemake.params.get("extra", "")


outdir = os.path.dirname(snakemake.output[0])
if outdir:
    outdir = f"--outdir {outdir}"


compress = ""
for output in snakemake.output:
    out_name, out_ext = os.path.splitext(output)
    if out_ext == ".gz":
        compress += f"pigz -p {snakemake.threads} {out_name}; "
    elif out_ext == ".bz2":
        compress += f"pbzip2 -p{snakemake.threads} {out_name}; "


with tempfile.TemporaryDirectory() as tmp:
    shell(
        "(fasterq-dump --temp {tmp} --threads {snakemake.threads} "
        "{extra} {outdir} {snakemake.wildcards.accession}; "
        "{compress}"
        ") {log}"
    )

Python Snakemake From line 1 of fasterq-dump/wrapper.py

myargs <- commandArgs(trailingOnly=TRUE)
bamfile <- myargs[1]
species <- myargs[2]

print("loading packages (ATACseqQC, ggplot, etc)...")
suppressPackageStartupMessages(library(ggplot2, quietly=TRUE))
suppressPackageStartupMessages(library(Rsamtools, quietly=TRUE))
suppressPackageStartupMessages(library(ATACseqQC, quietly=TRUE))
suppressPackageStartupMessages(library(ChIPpeakAnno, quietly=TRUE))
suppressPackageStartupMessages(library("GenomicAlignments", quietly=TRUE))

if (species == "mm") {
  suppressPackageStartupMessages(library(TxDb.Mmusculus.UCSC.mm10.knownGene, quietly=TRUE))
  suppressPackageStartupMessages(library(BSgenome.Mmusculus.UCSC.mm10, quietly=TRUE))
  txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene
  bsgenome <- BSgenome.Mmusculus.UCSC.mm10
  genome <- Mmusculus
  print("species is 'mm' using mm10 for analysis")
  ### Note: Everything below is deprecated until I can figure out a way to port a 
  ### static/local package with snakemake
  # Note: phastCons60way was manually curated from GenomicAlignments, built, and installed as an R package
  # score was obtained according to: https://support.bioconductor.org/p/96226/
  # package was built and installed according to: https://www.bioconductor.org/packages/devel/bioc/vignettes/GenomicScores/inst/doc/GenomicScores.html
  # (section 5.1: Building an annotation package from a GScores object)
  #suppressWarnings(suppressPackageStartupMessages(library(GenomicScores, lib.loc="/users/dia6sx/snakeATAC/scripts/", quietly=TRUE)))
  #suppressWarnings(suppressPackageStartupMessages(library(phastCons60way.UCSC.mm10, lib.loc="/users/dia6sx/snakeATAC/scripts/", quietly=TRUE)))
} else if (species == "hs") {
  suppressPackageStartupMessages(library(TxDb.Hsapiens.UCSC.hg38.knownGene, quietly=TRUE))
  suppressPackageStartupMessages(library(BSgenome.Hsapiens.UCSC.hg38, quietly=TRUE))
  txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
  bsgenome <- BSgenome.Hsapiens.UCSC.hg38
  genome <- Hsapiens
  print("species is 'hs' using hg38 for analysis")
} else {
  print(paste("params ERROR: ATACseqQC is not configured to use species =", species))
  print("exiting...")
  quit(status=1)
}

doATACseqQC <- function(bamfile, txdb, bsgenome, genome) {
    # Fragment size distribution
    print(paste("generating output for ",strsplit(basename(bamfile),split='\\.')[[1]][1],"...",sep=""))
    print("calculating Fragment size distribution...")
    bamfile.labels <- gsub(".bam", "", basename(bamfile))
    loc_to_save_figures <- paste(dirname(dirname(bamfile)),"/qc/ATACseqQC",sep="")
    if (file.exists(loc_to_save_figures)) {
        print("Warning: old figures will be overwritten")
    } else {
        dir.create(loc_to_save_figures)
    }
    png_file <- paste(loc_to_save_figures,"/",bamfile.labels,"_fragment_size_distribution.png",sep="")
    png(png_file)
    fragSizeDist(bamfile, bamfile.labels)
    dev.off()

    # Adjust the read start sites
    print("adjusting read start sites...")
    ## bamfile tags to be read in
    possibleTag <- list("integer"=c("AM", "AS", "CM", "CP", "FI", "H0", "H1", "H2", 
                                    "HI", "IH", "MQ", "NH", "NM", "OP", "PQ", "SM",
                                    "TC", "UQ"), 
                    "character"=c("BC", "BQ", "BZ", "CB", "CC", "CO", "CQ", "CR",
                                "CS", "CT", "CY", "E2", "FS", "LB", "MC", "MD",
                                "MI", "OA", "OC", "OQ", "OX", "PG", "PT", "PU",
                                "Q2", "QT", "QX", "R2", "RG", "RX", "SA", "TS",
                                "U2"))
    bamTop100 <- scanBam(BamFile(bamfile, yieldSize = 100),
                     param = ScanBamParam(tag=unlist(possibleTag)))[[1]]$tag
    tags <- names(bamTop100)[lengths(bamTop100)>0]
    ## files will be output into outPath
    ## shift the coordinates of 5'ends of alignments in the bam file
    outPath <- paste(dirname(dirname(bamfile)),"/alignments_shifted", sep="")
    seqinformation <- seqinfo(txdb)
    gal <- readBamFile(bamfile, tag=tags, asMates=TRUE, bigFile=TRUE)
    shiftedBamfile <- file.path(outPath, paste(bamfile.labels,"_shifted.bam",sep=""))
    # check if shifted Bam file exists from previous run
    if (file.exists(shiftedBamfile)) {
        print("Shifted Bamfile found.")
        print("Loading in...")
        gal <- readBamFile(shiftedBamfile, tag=tags, asMates=TRUE, bigFile=TRUE)
        ## This step is mostly for formating so splitBam can
        ## take in bamfile. Implementing shift of 0 bp on positive strand
        ## and 0 bp on negative strand because shifted Bamfile should
        ## already have these shifts
        gal1 <- shiftGAlignmentsList(gal, positive = 0L, negative = 0L)
    } else {
        # shifted bam file does not exist check if
        # old shifted alignments directory exists
        # if so remove and create new one
        if (file.exists(outPath)){
            unlink(outPath,recursive=TRUE)
        }
        dir.create(outPath)
        print("*** creating shifted bam file ***")
        gal1 <- shiftGAlignmentsList(gal, outbam=shiftedBamfile)
    }

    # Promoter/Transcript body (PT) score
    print("calculating Promoter/Transcript body (PT) score...")
    txs <- transcripts(txdb)
    pt <- PTscore(gal1, txs)
    png_file <- paste(loc_to_save_figures,"/",bamfile.labels,"_ptscore.png",sep="")
    png(png_file)
    plot(pt$log2meanCoverage, pt$PT_score, 
        xlab="log2 mean coverage",
        ylab="Promoter vs Transcript",
        main=paste(bamfile.labels,"PT score"))
    dev.off()

    # Nucleosome Free Regions (NFR) score
    print("calculating Nucleosome Free Regions (NFR) score")
    nfr <- NFRscore(gal1, txs)
    png_file <- paste(loc_to_save_figures,"/",bamfile.labels,"_nfrscore.png",sep="")
    png(png_file)
    plot(nfr$log2meanCoverage, nfr$NFR_score, 
        xlab="log2 mean coverage",
        ylab="Nucleosome Free Regions score",
        main=paste(bamfile.labels,"\n","NFRscore for 200bp flanking TSSs",sep=""),
        xlim=c(-10, 0), ylim=c(-5, 5))
    dev.off()

    # Transcription Start Site (TSS) Enrichment Score
    print("calculating Transcription Start Site (TSS) Enrichment score")
    tsse <- TSSEscore(gal1, txs)
    png_file <- paste(loc_to_save_figures,"/",bamfile.labels,"_tss_enrichment_score.png",sep="")
    png(png_file)
    plot(100*(-9:10-.5), tsse$values, type="b", 
        xlab="distance to TSS",
        ylab="aggregate TSS score",
        main=paste(bamfile.labels,"\n","TSS Enrichment score",sep=""))
    dev.off()

    # Split reads, Heatmap and coverage curve for nucleosome positions
    print("splitting reads by fragment length...")
    genome <- genome
    outPath <- paste(dirname(dirname(bamfile)),"/alignments_split", sep="")
    TSS <- promoters(txs, upstream=0, downstream=1)
    TSS <- unique(TSS)
    ## estimate the library size for normalization
    librarySize <- estLibSize(bamfile)
    ## calculate the signals around TSSs.
    NTILE <- 101
    dws <- ups <- 1010
    splitBamfiles <- paste(outPath,"/",c("NucleosomeFree", 
                                             "mononucleosome",
                                             "dinucleosome",
                                             "trinucleosome"),".bam",sep="")
    # check if split Bam files exists from previous run
    if (all(file.exists(splitBamfiles))) {
        print("*** split bam files found! ***")
        print("Loading in...")
        sigs <- enrichedFragments(bamfiles=splitBamfiles,
                                    index=splitBamfiles, 
                                    TSS=TSS,
                                    librarySize=librarySize,
                                    TSS.filter=0.5,
                                    n.tile = NTILE,
                                    upstream = ups,
                                    downstream = dws)
    } else {
        # split bam files do not exist check if
        # old split alignments directory exists
        # if so remove and create new one
        if (file.exists(outPath)){
            unlink(outPath,recursive=TRUE)
        }
        print("*** creating split bam files ***")
        dir.create(outPath)
        ## split the reads into NucleosomeFree, mononucleosome, 
        ## dinucleosome and trinucleosome.
        ## and save the binned alignments into bam files.
        objs <- splitGAlignmentsByCut(gal1, txs=txs, genome=genome, outPath = outPath)
        #objs <- splitBam(bamfile, tags=tags, outPath=outPath,
            #        txs=txs, genome=genome,
            #       conservation=phastCons60way.UCSC.mm10,
            #      seqlev=paste0("chr", c(1:19, "X", "Y")))
        sigs <- enrichedFragments(gal=objs[c("NucleosomeFree", 
                                        "mononucleosome",
                                        "dinucleosome",
                                        "trinucleosome")], 
                                    TSS=TSS,
                                    librarySize=librarySize,
                                    TSS.filter=0.5,
                                    n.tile = NTILE,
                                    upstream = ups,
                                    downstream = dws)
    }
    ## log2 transformed signals
    sigs.log2 <- lapply(sigs, function(.ele) log2(.ele+1))
    ## plot heatmap
    png_file <- paste(loc_to_save_figures,"/",bamfile.labels,"_nucleosome_pos_heatmap.png",sep="")
    png(png_file)
    featureAlignedHeatmap(sigs.log2, reCenterPeaks(TSS, width=ups+dws),
                        zeroAt=.5, n.tile=NTILE)
    dev.off()
    ## get signals normalized for nucleosome-free and nucleosome-bound regions.
    out <- featureAlignedDistribution(sigs, 
                                    reCenterPeaks(TSS, width=ups+dws),
                                    zeroAt=.5, n.tile=NTILE, type="l", 
                                    ylab="Averaged coverage")
    ## rescale the nucleosome-free and nucleosome signals to 0~1
    range01 <- function(x){(x-min(x))/(max(x)-min(x))}
    out <- apply(out, 2, range01)
    png_file <- paste(loc_to_save_figures,"/",bamfile.labels,"_TSS_signal_distribution.png",sep="")
    png(png_file)
    matplot(out, type="l", xaxt="n", 
            xlab="Position (bp)", 
            ylab="Fraction of signal",
            main=paste(bamfile.labels,"\n","TSS signal distribution",sep=""))
    axis(1, at=seq(0, 100, by=10)+1, 
        labels=c("-1K", seq(-800, 800, by=200), "1K"), las=2)
    abline(v=seq(0, 100, by=10)+1, lty=2, col="gray")
    dev.off()

    print("QC Finished.")
    print("Generated QC figures can be found in qc folder under ATACseQC")
    print(paste("*** removing temp files in",outPath,"***"))
    unlink(outPath,recursive=TRUE)
    outPath <- paste(dirname(dirname(bamfile)),"/alignments_shifted", sep="")
    print(paste("*** removing temp files in",outPath,"***"))
    unlink(outPath,recursive=TRUE)
}

doATACseqQC(bamfile, txdb, bsgenome, genome)

R Snakemake ggplot2 BSgenome.Hsapiens.UCSC.hg38 BSgenome.Mmusculus.UCSC.mm10 TxDb.Mmusculus.UCSC.mm10.knownGene TxDb.Hsapiens.UCSC.hg38.knownGene GenomicAlignments Rsamtools ATACseqQC GenomicScores From line 23 of scripts/doATACseqQC.R

read_counts=$(gunzip -c ${1} | wc -l)

# Sort and merge input peaks. Then intersect with list of tags
# Bedtools docs for -u: Write original A entry (tag) once if any overlaps found in B (peaks). In other words, just report the fact at least one overlap was found in B
reads_in_peaks=$(bedtools sort -i ${2} | bedtools merge -i - | bedtools intersect -u -a ${1} -b - | wc -l)

echo ${reads_in_peaks} > ${3}
echo ${read_counts} >> ${3}

echo $(bc -l <<< "${reads_in_peaks}/${read_counts}") >> ${3}

Shell BEDTools From line 7 of scripts/frip.sh

mapped_reads=$(samtools view -c -F 260 ${1})
reads_factor=$(bc -l <<< "1/${mapped_reads}")
rpm_factor=$(bc -l <<< "${reads_factor} * ${2}")

bedtools genomecov -i ${3} -g ${4} -bg -scale ${rpm_factor} | LC_COLLATE=C sort -k1,1 -k2,2n > ${5}

bedGraphToBigWig ${5} ${4} ${6}

Shell BEDTools bedGraphToBigWig From line 10 of scripts/generate_bigwig.sh

bedtools bamtobed -i ${1} | \
awk 'BEGIN {OFS = "\t"} ; {if ($6 == "+") print $1, $2 + 4, $2 + 5, $4, $5, $6; else print $1, $3 - 5, $3 - 4, $4, $5, $6}' | \
bedtools slop  -i - -g ${2} -b ${3} | \
bedtools intersect -a - -b ${4} -v | \
pigz > ${5}

Shell BEDTools From line 9 of scripts/shift_reads.sh

shell:
	"""
	Rscript ./scripts/doATACseqQC.R {input} {params.species} > {log}
	"""

SnakeMake From line 12 of snakefiles/atacseqqc.sf

shell:
    """
    bowtie2 {params.bowtie} -p {threads} -x {config[idx_bt2]} -1 {input[0]} -2 {input[1]} -S {output} 2> {log}
    """

SnakeMake Bowtie 2 From line 25 of snakefiles/maxatac.sf

shell:
    """
    samtools view -@ {threads} -b -u -q 30 {input} | \
    samtools sort -@ {threads} -n -o {output} -
    """

SnakeMake SAMtools From line 46 of snakefiles/maxatac.sf

shell:
    """
    # Add mate information to reads for deduplication
    samtools fixmate -@ {threads} -r -m {input} - 2> {log} | \
    samtools sort -@ {threads} -o {output.fixmate_bam} - 2> {log}

    samtools index -@ {threads} {output.fixmate_bam} 2> {log}
    """

SnakeMake SAMtools From line 69 of snakefiles/maxatac.sf

shell:
    """
    # use markdup to remove PCR duplicates
    samtools markdup -@ {threads} -r -s {input} - 2> {log} | \
    samtools sort -@ {threads} -o {output.dedup_bam} - 2> {log}

    samtools index -@ {threads} {output.dedup_bam} 2> {log}

    # remove unmapped reads with -F 4
    samtools view -@ {threads} -b -f 3 -F 4 {output.dedup_bam} {params} | \
    samtools sort -@ {threads} -o {output.final_bam} - 2> {log}

    samtools index -@ {threads} {output.final_bam} 2> {log}

    """

SnakeMake SAMtools From line 99 of snakefiles/maxatac.sf

shell:
    """        
    sh ./scripts/shift_reads.sh {input} {params.chrom_sizes} {params.slop} {params.blacklist} {output}
    """

SnakeMake From line 135 of snakefiles/maxatac.sf

shell:
    """
    sh ./scripts/generate_bigwig.sh {input.bam} {params.millions_factor} {input.tn5_bed} {params.chrom_sizes} {output.tn5_bedgraph} {output.tn5_bigwig}
    """

SnakeMake From line 163 of snakefiles/maxatac.sf

shell:
    """
    zcat {input.tn5_bed} | grep -w "+" | pigz > {output.tn5_bed_pos}
    zcat {input.tn5_bed} | grep -w "-" | pigz > {output.tn5_bed_neg}
    """

SnakeMake From line 183 of snakefiles/maxatac.sf

shell:
    """
    sh ./scripts/generate_bigwig.sh {input.bam} {params.millions_factor} {input.tn5_bed_pos} {params.chrom_sizes} {output.tn5_bedgraph_pos} {output.tn5_bigwig_pos}
    sh ./scripts/generate_bigwig.sh {input.bam} {params.millions_factor} {input.tn5_bed_neg} {params.chrom_sizes} {output.tn5_bedgraph_neg} {output.tn5_bigwig_neg}
    """

SnakeMake From line 214 of snakefiles/maxatac.sf

shell:
    """
    macs2 callpeak -t {input} --name {params.NAME} -g {params.species} --outdir {params.PEAK_DIR} --nomodel --shift -{params.shift_size} --extsize {params.ext_size} --keep-dup=all -q 0.{params.qvalue}
    """

SnakeMake macs2 From line 243 of snakefiles/maxatac.sf

shell:
    """
    samtools flagstat {input} > {output} 2> {log}
    """

SnakeMake SAMtools From line 266 of snakefiles/maxatac.sf

shell:
    """
    samtools flagstat {input} > {output} 2> {log}
    """

SnakeMake SAMtools From line 289 of snakefiles/maxatac.sf

shell:
    """
    sh ./scripts/frip.sh {input[0]} {input[1]} {output}
    """

SnakeMake From line 312 of snakefiles/maxatac.sf

shell:
    """
    bamPEFragmentSize -b {input} -p {threads} --binSize {params.bin_size} --blackListFileName {params.blacklist} --outRawFragmentLengths {output}
    """

SnakeMake From line 337 of snakefiles/maxatac.sf

shell:
    """
    samtools idxstats {input} | cut -f 1,3 > {output}
    """

SnakeMake SAMtools From line 358 of snakefiles/maxatac.sf

shell:
    """
    maxatac average -i {input} --prefix {params.prefix} --output {params.outdir}
    """

SnakeMake From line 380 of snakefiles/maxatac.sf

shell:
    """
    maxatac normalize --signal {input} --prefix {params.prefix} --output {params.outdir}
    """

SnakeMake From line 402 of snakefiles/maxatac.sf

shell:
    """
    maxatac predict -tf {params.TF_model} -s {input} --prefix {params.prefix} --output {params.outdir}
    """

SnakeMake From line 425 of snakefiles/maxatac.sf

wrapper:
    "0.77.0/bio/sra-tools/fasterq-dump"

SnakeMake From line 21 of snakefiles/process_fastq.sf

shell:
    """
    cat {input.R1} > {output.R1_OUT}
    cat {input.R2} > {output.R2_OUT}
    """

SnakeMake From line 38 of snakefiles/process_fastq.sf

wrapper:
    "0.77.0/bio/fastqc"

SnakeMake FastQC From line 61 of snakefiles/process_fastq.sf

shell:
    """
    trim_galore -q 30 -paired -j 4 -o {params.TRIM_DIR} {input.fq1} {input.fq2} 2> {log}
    """

SnakeMake Trim_Galore From line 86 of snakefiles/process_fastq.sf

wrapper:
    "0.77.0/bio/fastqc"

SnakeMake FastQC From line 108 of snakefiles/process_fastq.sf

shell:
    """
    # Select Parameter Description

    # runThreadN: Number of threads to use
    # readFilesIn: Fastq files
    # outFileNamePrefix: Prefix of the outputDIR and the name to use
    # outSAMtype: Output the file as a BAM file that is sorted
    # outSAMunmapped: Output unmapped reads in the SAM file with special flag
    # outSAMattributes: Standard SAM attributes
    # alignIntronMax: Allow only 1 max intron. This is specific to ATAC-seq
    # STAR was designed for RNA transcripts so we want to ignore some parameters
    # alignMatesGapMax: Allow a maximum of 2000 bp gap.
    # alignEndsType: This aligns the full read and considers the whole read in score calculation.
    # outMultimapperOrder: Output multimapped reads in random order
    # outFilterMultimapNmax: Max 1000 multimapped reads
    # outSAMmultNmax: 1
    # outFilterMismatchNoverLmax: .1 mismatch
    # outFilterMatchNmin: 20
    # alignSJDBoverhangMin: 999
    # alignEndsProtrude: 10 ConcordantPair 
    # outFilterScoreMinOverLread: 0.66
    # outFilterMatchNminOverLread: 0.66
    # readFilesCommand: zcat

    echo "Align Files with STAR: In Progress"

    STAR --genomeDir {params.STAR_INDEX} \
    --runThreadN {params.THREADS} \
    --readFilesIn {input[0]} {input[1]} \
    --outFileNamePrefix {params.Prefix} \
    --outSAMtype BAM SortedByCoordinate \
    --outSAMattributes Standard \
    --alignIntronMax 1 \
    --alignMatesGapMax 2000 \
    --alignEndsType EndToEnd \
    --outMultimapperOrder Random \
    --outFilterMultimapNmax 999 \
    --outSAMmultNmax 1 \
    --outFilterMismatchNoverLmax 0.1 \
    --outFilterMatchNmin 20 \
    --alignSJDBoverhangMin 999 \
    --alignEndsProtrude 10 ConcordantPair  \
    --outFilterScoreMinOverLread 0.66 \
    --outFilterMatchNminOverLread 0.66 \
    --readFilesCommand zcat
    """

SnakeMake STAR From line 25 of snakefiles/tobias.sf

shell:
"""
samtools merge -@ {threads} {output} {input}

samtools index -@ 4 {output}
"""

SnakeMake SAMtools From line 84 of snakefiles/tobias.sf

shell:
    """
    macs2 callpeak -t {input} --name {params.NAME} -g {params.species} --outdir {params.PEAK_DIR} --nomodel --shift -100 --extsize 200 --broad
    """

SnakeMake macs2 From line 108 of snakefiles/tobias.sf

shell:
    "TOBIAS ATACorrect --bam ${file} --genome Homo_sapiens.GRCh38.fa --peaks ${peaks} --blacklist hg38_maxatac_blacklist_V2.bed --outdir /TOBIAS/${base_filename} --cores 4"