KAS-Seq Analysis Workflow

public 1yr ago Version: v0.2 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Workflow to perform KAS-seq analysis

This is the template for a new Snakemake workflow. Replace this text with a comprehensive description, covering the purpose and domain. Insert your code into the respective folders, i.e. scripts ,

Code Snippets

wrapper:
    "0.72.0/bio/bwa/mem"

SnakeMake From line 22 of rules/align.smk

wrapper:
    "0.72.0/bio/samtools/index"

SnakeMake From line 35 of rules/align.smk

wrapper:
    "0.72.0/bio/picard/markduplicates"

SnakeMake From line 52 of rules/align.smk

wrapper:
    "0.72.0/bio/bwa/index"

SnakeMake From line 71 of rules/align.smk

shell:
    """
    bamCoverage \
            -b {input.bam:q} \
            -o {output:q} \
            --normalizeUsing RPGC \
            --effectiveGenomeSize {params.effectiveGenomeSize} \
            > {log:q} 2>&1
    """

SnakeMake DeepTools From line 14 of rules/coverage_plots.smk

shell:
    """
    computeMatrix \
        scale-regions \
        --regionsFileName {input.regions:q} \
        --upstream 3000 \
        --downstream 3000 \
        --regionBodyLength 6000 \
        --scoreFileName {input.score:q} \
        --outFileName {output.matrix_gz:q} \
        --outFileNameMatrix {output.matrix_tab:q} \
        --outFileSortedRegions {output.matrix_bed:q} \
        --numberOfProcessors {threads} \
        {params.extra} \
        > {log:q} 2>&1
    """

SnakeMake DeepTools From line 49 of rules/coverage_plots.smk

wrapper:
    "0.72.0/bio/deeptools/plotheatmap"

SnakeMake From line 86 of rules/coverage_plots.smk

wrapper:
    "0.72.0/bio/deeptools/plotprofile"

SnakeMake From line 109 of rules/coverage_plots.smk

wrapper:
    "0.72.0/bio/macs2/callpeak"

SnakeMake From line 36 of rules/macs.smk

shell:
    """
    echo -e 'chr\tstart\tend\tname\tscore\tstrand\tfold_enrichment\t-log10(pvalue)\t-log10(qvalue)' > {output:q}
    cat {input:q} >> {output:q}
    """

SnakeMake From line 49 of rules/macs.smk

wrapper:
    "0.72.0/bio/samtools/faidx"

SnakeMake From line 10 of rules/peak_analysis.smk

script:
    "../scripts/save_txdb.R"

SnakeMake From line 24 of rules/peak_analysis.smk

script:
    "../scripts/peak_annotation.R"

SnakeMake From line 78 of rules/peak_analysis.smk

script:
    "../scripts/peak_profile.R"

SnakeMake From line 109 of rules/peak_analysis.smk

script:
    "../scripts/peak_overlap_enrichment.R"

SnakeMake From line 134 of rules/peak_analysis.smk

wrapper:
    "0.72.0/bio/fastqc"

SnakeMake FastQC From line 12 of rules/qc.smk

wrapper:
    "0.72.0/bio/samtools/stats"

SnakeMake From line 25 of rules/qc.smk

shell:
    """
    plotCoverage \
            --bamfiles {input.bamfiles:q} \
            --plotFile {output.plotfile:q} \
            --outRawCounts {output.coverage:q} \
            --numberOfProcessors {threads} \
            --smartLabels \
            --plotTitle "Read Coverage (deduplicated)" \
            {params} \
            > {log:q} 2>&1 
    """

SnakeMake From line 47 of rules/qc.smk

shell:
    """
    multiBamSummary \
            bins \
            --bamfiles {input.bamfiles:q} \
            --outFileName {output.multibamsummary:q} \
            --numberOfProcessors {threads} \
            --smartLabels \
            {params} \
            > {log:q} 2>&1
    """

SnakeMake From line 78 of rules/qc.smk

shell:
    """
    plotCorrelation \
            --corData {input.cordata:q} \
            --corMethod spearman \
            --whatToPlot heatmap \
            --plotFile {output.plotfile:q} \
            --plotTitle "Alignment correlation (Spearman)" \
            --outFileCorMatrix {output.cormatrix:q} \
            {params} \
            > {log:q} 2>&1
    """

SnakeMake From line 103 of rules/qc.smk

shell:
    """
    bamPEFragmentSize \
            --bamfiles {input.bamfiles:q} \
            --histogram {output.histogram:q} \
            --plotTitle "Paired End Fragment Size (deduplicated)" \
            --numberOfProcessors {threads} \
            --table {output.table:q} \
            --outRawFragmentLengths {output.rawfragmentlengths:q} \
            {params} \
            > {log:q} 2>&1
    """

SnakeMake From line 136 of rules/qc.smk

wrapper:
    "0.72.0/bio/deeptools/plotfingerprint"

SnakeMake From line 203 of rules/qc.smk

wrapper:
    "0.72.0/bio/multiqc"

SnakeMake MultiQC From line 241 of rules/qc.smk

shell:
    """
    ln -sfr {input.fq1:q} {output.fastq1:q}
    ln -sfr {input.fq2:q} {output.fastq2:q}
    """

SnakeMake From line 14 of rules/trim.smk

shell:
    """
    ln -s {input.fq1:q} {output.fastq:q} >> {log:q} 2>&1
    """

SnakeMake From line 30 of rules/trim.smk

wrapper:
    "0.72.0/bio/cutadapt/pe"

SnakeMake From line 54 of rules/trim.smk

wrapper:
    "0.72.0/bio/cutadapt/se"

SnakeMake From line 71 of rules/trim.smk

library(BiocManager, quietly = TRUE)
library(ChIPseeker, quietly = TRUE)
library(org.Hs.eg.db, quietly = TRUE)
library(cowplot, quietly = TRUE)
library(readr, quietly = TRUE)
library(argparser, quietly = TRUE)

p <- arg_parser("KAS-Seq Peak Annotation and Comparison")
p <- add_argument(p, "--peak_files",
                  help = "Peak files to annotate and compare",
                  nargs = Inf)
p <- add_argument(p, "--txdb_file",
                  help = "File to load txdb using AnnotationDbi::loadDb()")
p <- add_argument(p, "--annotation_file",
                  help = paste0("GFF3 or GTF file of gene annotations used to ",
                                "build txdb"))
p <- add_argument(p, "--txdb",
                  help = "Name of txdb package to install from Bioconductor")

# Add an optional arguments
p <- add_argument(p, "--names", help = "Sample names for each peak file",
                  nargs = Inf)
p <- add_argument(p, "--output_dir",
                  help = "Directory for output files",
                  default = "peak_annotation")
p <- add_argument(p, "--annotation_distribution_plot",
                  help = "Peak annotation distribution barplot filename",
                  default = "annotationDistributionPlot.pdf")
p <- add_argument(p, "--peak_annotation_list_rdata",
                  help = "Peak annotation list Rdata file",
                  default = "peak_anno_list.Rdata")
p <- add_argument(p, "--venn_diagram",
                  help = paste0("Venn digagram of annotated genes per sample ",
                                "pdf filename"),
                  default = "annotationVennDiagram.pdf")

# Parse arguments (interactive, snakemake, or command line)
if (exists("snakemake")) {
  # Arguments via Snakemake
  argv <- parse_args(p, c(
    "--peak_files", snakemake@input[["peak_files"]],
    "--txdb_file", snakemake@input[["txdb_file"]],
    "--names", snakemake@params[["names"]],
    "--output_dir", snakemake@params[["output_dir"]],
    "--annotation_distribution_plot",
    snakemake@output[["annotation_distribution_plot"]],
    "--venn_diagram", snakemake@output[["venn_diagram"]],
    "--peak_annotation_list_rdata",
    snakemake@output[["peak_annotation_list_rdata"]]
  ))
} else if (interactive()) {
  # Arguments supplied inline (for debug/testing when running interactively)
  print("Running interactively...")
  peak_files <- c("results_2020-12-03/macs2/D701-lane1_peaks.broadPeak",
                  "results_2020-12-03/macs2/D702-lane1_peaks.broadPeak",
                  "results_2020-12-03/macs2/D703-lane1_peaks.broadPeak",
                  "results_2020-12-03/macs2/D704-lane1_peaks.broadPeak",
                  "results_2020-12-03/macs2/D705-lane1_peaks.broadPeak")
  names <- c("D701", "D702", "D703", "D704", "D705")
  annotation_file <- "genomes/hg38/annotation/Homo_sapiens.GRCh38.101.gtf"
  txdb_file <- "txdb.db"
  argv <- parse_args(p, c("--peak_files", peak_files,
                          "--names", names,
                          "--txdb_file", txdb_file))
  print(argv)
} else {
  # Arguments from command line
  argv <- parse_args(p)
  print(argv)
}

# Set names
if (!anyNA(argv$names)) {
  peak_file_names <- argv$names
} else {
  peak_file_names <- sapply(argv$peak_files, basename)
}
names(argv$peak_files) <- peak_file_names

# Output directory
if (!dir.exists(argv$output_dir)) {
  dir.create(argv$output_dir, recursive = TRUE)
}

# Get txdb object
if (!is.na(argv$txdb)) {
  # Load (install if needed) txdb from bioconductor
  library(pacman, quietly = TRUE)
  pacman::p_load(argv$txdb, character.only = TRUE)
  txdb <- eval(parse(text = argv$txdb))
} else if (!is.na(argv$txdb_file)) {
  # Load txdb
  library(AnnotationDbi, quietly = TRUE)
  txdb <- AnnotationDbi::loadDb(argv$txdb_file)
} else if (!is.na(argv$annotation_file)) {
  # Create txdb object from supplied annotation file
  library(GenomicFeatures, quietly = TRUE)
  txdb <- GenomicFeatures::makeTxDbFromGFF(argv$annotation_file)
} else {
  stop("Must specify one of --txdb, --txdb_file, or --annotation_file")
}


# Peak Annotation
# TODO Provide config parameter for annoDb
peak_anno_list <- lapply(argv$peak_files, annotatePeak, TxDb = txdb,
                       tssRegion = c(-3000, 3000), annoDb = "org.Hs.eg.db",
                       verbose = FALSE)
lapply(names(peak_anno_list), function(name) {
  filebase <- file.path(argv$output_dir, basename(argv$peak_files[[name]]))
  write_tsv(as.data.frame(peak_anno_list[[name]]),
            file = paste0(filebase, ".annotated.tsv.gz"))
  sink(file = paste0(filebase, ".annotated.summary.txt"))
  print(peak_anno_list[[name]])
  sink()
})
saveRDS(peak_anno_list,
        file = argv$peak_annotation_list_rdata)

# Peak annotation distribution plot
peak_anno_dist_plot <- plotAnnoBar(peak_anno_list)
save_plot(
  filename = argv$annotation_distribution_plot,
  plot = peak_anno_dist_plot,
  base_height = 14,
  base_width = 14
)

# nolint start
# Functional profiles comparison
# NOTE: Not all data return enrichment
# genes = lapply(peak_anno_list, function(i) as.data.frame(i)$geneId)
# names(genes) = sub("_", "\n", names(genes))
# compKEGG <- compareCluster(geneCluster   = genes,
#                            fun           = "enrichKEGG",
#                            pvalueCutoff  = 0.05,
#                            pAdjustMethod = "BH")
# dotplot(compKEGG, showCategory = 15, title = "KEGG Pathway Enrichment Analysis")
# nolint end

# Venn Diagram of gene annotated per sample
# Note: vennplot does not return a ggplot object
pdf(
  file = argv$venn_diagram,
  height = 14,
  width = 14
)
genes <- lapply(peak_anno_list, function(i) as.data.frame(i)$geneId)
vennplot(genes)
dev.off()

R Snakemake ggplot2 org.Hs.eg.db readr cowplot BiocManager AnnotationDbi argparser ChIPseeker PACMAN From line 3 of scripts/peak_annotation.R

library(BiocManager, quietly = TRUE)
library(ChIPseeker, quietly = TRUE)
library(readr, quietly = TRUE)
library(argparser, quietly = TRUE)

p <- arg_parser("KAS-Seq Peak Overlap Enrichment Analysis")
p <- add_argument(p, "--query_peak_file",
                  help = "Query peak files to compare targets against")
p <- add_argument(p, "--target_peak_files",
                  help = "Target peak files to compare against query peaks",
                  nargs = Inf)
p <- add_argument(p, "--txdb_file",
                  help = "File to load txdb using AnnotationDbi::loadDb()")
p <- add_argument(p, "--annotation_file",
                  help = paste0("GFF3 or GTF file of gene annotations used ",
                                "to build txdb"))
p <- add_argument(p, "--txdb",
                  help = "Name of txdb package to install from Bioconductor")

# Add an optional arguments
p <- add_argument(p, "--nshuffle", help = "Number of shuffle iterations",
                  default = 1000)
p <- add_argument(p, "--p_adjust_method", help = "pvalue adjustment method",
                  default = "BH")
p <- add_argument(p, "--output_tsv_file",
                  help = "Path and filename of output tsv file",
                  default = "peak_overlap_enrichment.tsv")
p <- add_argument(p, "--cores",
                  help = "number of cores (threads) to use",
                  default = NA)
p <- add_argument(p, "--chromlist",
                  help = paste0("List of chromosomes to include, others will ",
                                "be filtered out, new peaks stored in ",
                                "[peak_file].filtered"),
                  nargs = Inf, default = NA)

# Parse arguments (interactive, snakemake, or command line)
if (exists("snakemake")) {
  # Arguments via Snakemake
  argv <- parse_args(p, c(
    "--query_peak_file", snakemake@input[["query_peak_file"]],
    "--target_peak_files", snakemake@input[["target_peak_files"]],
    "--txdb_file", snakemake@input[["txdb_file"]],
    "--nshuffle", snakemake@params[["nshuffle"]],
    "--p_adjust_method", snakemake@params[["p_adjust_method"]],
    "--output_tsv_file", snakemake@output[["output_tsv_file"]],
    "--cores", snakemake@threads,
    "--chromlist", snakemake@params[["chromlist"]]
  ))
} else if (interactive()) {
  # Arguments supplied inline (for debug/testing when running interactively)
  print("Running interactively...")
  query_peak_file <- "results_2020-12-03/macs2/D701-lane1_peaks.broadPeak"
  target_peak_files <- c("results_2020-12-03/macs2/D702-lane1_peaks.broadPeak",
                         "results_2020-12-03/macs2/D703-lane1_peaks.broadPeak",
                         "results_2020-12-03/macs2/D704-lane1_peaks.broadPeak",
                         "results_2020-12-03/macs2/D705-lane1_peaks.broadPeak")
  nshuffle <- 50
  txdb_file <- "txdb.db"
  chromlist <- paste(c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11",
                       "12", "13", "14", "15", "16", "17", "18", "19", "20",
                       "21", "22", "23"), sep = ", ")
  argv <- parse_args(p, c("--query_peak_file", query_peak_file,
                          "--target_peak_files", target_peak_files,
                          "--nshuffle", nshuffle,
                          "--txdb_file", txdb_file,
                          "--chromlist", chromlist))
  print(argv)
} else {
  # Arguments from command line
  argv <- parse_args(p)
  print(argv)
}

# Get txdb object
if (!is.na(argv$txdb)) {
  # Load (install if needed) txdb from bioconductor
  library(pacman, quietly = TRUE)
  pacman::p_load(argv$txdb, character.only = TRUE)
  txdb <- eval(parse(text = argv$txdb))
} else if (!is.na(argv$txdb_file)) {
  # Load txdb
  library(AnnotationDbi, quietly = TRUE)
  txdb <- AnnotationDbi::loadDb(argv$txdb_file)
} else if (!is.na(argv$annotation_file)) {
  # Create txdb object from supplied annotation file
  library(GenomicFeatures, quietly = TRUE)
  txdb <- GenomicFeatures::makeTxDbFromGFF(argv$annotation_file)
} else {
  stop("Must specify one of --txdb, --txdb_file, or --annotation_file")
}



# Number of cores
if (is.na(argv$cores)) {
  cores <- detectCores() - 1
} else {
  cores <- argv$cores
}

# Peak Overlap Enrichment
# Running with a single core can cause:
# Error in sample.int(length(x), size, replace, prob) :
#    cannot take a sample larger than the population when 'replace = FALSE'
# Running with multiple cores throws only a warning:
# Warning message:
# In mclapply(seq_along(idx), function(j) { :
#    all scheduled cores encountered errors in user code
# This results in incorrect output, thus we throw and error when we encounter
# that warning message
withCallingHandlers(
  peak_overlap_enrichment <- enrichPeakOverlap(
    queryPeak     = argv$query_peak_file,
    targetPeak    = unlist(argv$target_peak_files),
    TxDb          = txdb,
    pAdjustMethod = argv$p_adjust_method,
    nShuffle      = argv$nshuffle,
    chainFile     = NULL,
    verbose       = TRUE,
    mc.cores      = cores,
  ),
  warning = function(w) {
    if (grepl("encountered errors in user code", w$message)) {
      stop(paste0("Errors encounted executing enrichPeakOverlap: ", w$message))
    } else {
      message(w$message)
    }
  }
)

write_tsv(peak_overlap_enrichment, argv$output_tsv_file)

R Snakemake readr BiocManager AnnotationDbi argparser ChIPseeker PACMAN From line 3 of scripts/peak_overlap_enrichment.R

library(BiocManager, quietly = TRUE)
library(ChIPseeker, quietly = TRUE)
library(cowplot, quietly = TRUE)
library(readr, quietly = TRUE)
library(argparser, quietly = TRUE)

p <- arg_parser("Profile of peaks binding to TSS regions")
p <- add_argument(p, "--peak_files",
                  help = "Peak files to annotate and compare",
                  nargs = Inf)
p <- add_argument(p, "--txdb_file",
                  help = "File to load txdb using AnnotationDbi::loadDb()")
p <- add_argument(p, "--annotation_file",
                  help = paste0("GFF3 or GTF file of gene annotations used ",
                                "to build txdb"))
p <- add_argument(p, "--txdb",
                  help = "Name of txdb package to install from Bioconductor")

# Add an optional arguments
p <- add_argument(p, "--names", help = "Sample names for each peak file",
                  nargs = Inf)
p <- add_argument(p, "--tag_profile_plot",
                  help = "Average peak profile plot filename",
                  default = "avgProfilePlot.pdf")
p <- add_argument(p, "--tag_heatmap_plot",
                  help = "Tag heatmap plot PDF filename",
                  default = "tagHeatmapPlot.pdf")
p <- add_argument(p, "--tag_matrix_list",
                  help = paste0("List of outputs from getTagMatrix ",
                                "CHiPseeker function"),
                  default = "tag_matrix_list.Rdata")

# Parse arguments (interactive, snakemake, or command line)
if (exists("snakemake")) {
  # Arguments via Snakemake
  argv <- parse_args(p, c(
    "--peak_files", snakemake@input[["peak_files"]],
    "--txdb_file", snakemake@input[["txdb_file"]],
    "--names", snakemake@params[["names"]],
    "--tag_profile_plot", snakemake@output[["tag_profile_plot"]],
    "--tag_heatmap_plot", snakemake@output[["tag_heatmap_plot"]],
    "--tag_matrix_list", snakemake@output[["tag_matrix_list"]]
  ))
} else if (interactive()) {
  # Arguments supplied inline (for debug/testing when running interactively)
  print("Running interactively...")
  input_file <- c("results_2020-12-03/macs2/D701-lane1_peaks.broadPeak",
                  "results_2020-12-03/macs2/D702-lane1_peaks.broadPeak",
                  "results_2020-12-03/macs2/D703-lane1_peaks.broadPeak",
                  "results_2020-12-03/macs2/D704-lane1_peaks.broadPeak",
                  "results_2020-12-03/macs2/D705-lane1_peaks.broadPeak")
  names <- c("D701", "D702", "D703", "D704", "D705")
  annotation_file <- "genomes/hg38/annotation/Homo_sapiens.GRCh38.101.gtf"
  txdb_file <- "txdb.db"
  argv <- parse_args(p, c("--peak_files", input_file,
                          "--names", names,
                          "--txdb_file", txdb_file))
  print(argv)
} else {
  # Arguments from command line
  argv <- parse_args(p)
  print(argv)
}

# Set names
if (!anyNA(argv$names)) {
  peak_file_names <- argv$names
} else {
  peak_file_names <- sapply(argv$peak_files, basename)
}
names(argv$peak_files) <- peak_file_names

# Get txdb object
if (!is.na(argv$txdb)) {
  # Load (install if needed) txdb from bioconductor
  library(pacman, quietly = TRUE)
  pacman::p_load(argv$txdb, character.only = TRUE)
  txdb <- eval(parse(text = argv$txdb))
} else if (!is.na(argv$txdb_file)) {
  # Load txdb
  library(AnnotationDbi, quietly = TRUE)
  txdb <- AnnotationDbi::loadDb(argv$txdb_file)
} else if (!is.na(argv$annotation_file)) {
  # Create txdb object from supplied annotation file
  library(GenomicFeatures, quietly = TRUE)
  txdb <- GenomicFeatures::makeTxDbFromGFF(argv$annotation_file)
} else {
  stop("Must specify one of --txdb, --txdb_file, or --annotation_file")
}


# Profile of peaks binding to TSS regions
promoter <- getPromoters(TxDb = txdb, upstream = 3000, downstream = 3000)
tag_matrix_list <- lapply(argv$peak_files, getTagMatrix, windows = promoter)
saveRDS(tag_matrix_list, file = argv$tag_matrix_list)

# Average tag profile
tag_profile_plot <- plotAvgProf(tag_matrix_list, xlim = c(-3000, 3000))
save_plot(
  filename = argv$tag_profile_plot,
  plot = tag_profile_plot,
  base_height = 14,
  base_width = 14
)

# Tag heatmap
# Note: tagHeatMap does not return a ggplot object, we must save it differently
pdf(
  file = argv$tag_heatmap_plot,
  height = 14,
  width = 14
)
tagHeatmap(tag_matrix_list, xlim = c(-3000, 3000), color = NULL)
dev.off()

R Snakemake ggplot2 readr cowplot BiocManager AnnotationDbi argparser ChIPseeker PACMAN From line 3 of scripts/peak_profile.R

library(BiocManager, quietly = TRUE)
library(AnnotationDbi, quietly = TRUE)
library(readr, quietly = TRUE)
library(dplyr, quietly = TRUE)
library(argparser, quietly = TRUE)

p <- arg_parser("Get a TXDB object and save for use in other scripts")
p <- add_argument(p, "txdb_input",
                  help = paste0("Name of txdb package to install from ",
                                "Bioconductor or annotation file (GFF3 ",
                                "or GTF) to use to build a TXDB object"))
p <- add_argument(p, "--chrom_info",
                  help = paste0("Chromosome info file with columns: ",
                  "chrom, length, is_circular (optional)"))
p <- add_argument(p, "--output",
                  help = "File to save txdb object in",
                  default = "txdb.db")

# Parse arguments (interactive, snakemake, or command line)
if (exists("snakemake")) {
  # Arguments via Snakemake
  argv <- parse_args(p, c(
    snakemake@input[["txdb_input"]],
    "--chrom_info", snakemake@input[["chrom_info"]],
    "--output", snakemake@output[[1]]
  ))
} else if (interactive()) {
  # Arguments supplied inline (for debug/testing when running interactively)
  print("Running interactively...")
  txdb_input <- "genomes/hg38/annotation/Homo_sapiens.GRCh38.101.gtf"
  chrom_info <- paste0("genomes/hg38/genome/fasta/",
                       "Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.fai")
  argv <- parse_args(p, c(txdb_input,
                          "--chrom_info", chrom_info))
  print(argv)
} else {
  # Arguments from command line
  argv <- parse_args(p)
  print(argv)
}

# Build txdb if txdb_input is file
if (file.exists(argv$txdb_input)) {
  # Create txdb object from supplied annotation file
  library(GenomicFeatures, quietly = TRUE)
  if (!is.na(argv$chrom_info)) {
    # Get chrom info
    is_circular_column <-
      count_fields(argv$chrom_info, tokenizer_tsv())[1] > 2
    if (is_circular_column) {
      col_names <- c("chrom", "length", "is_circular")
      col_spec <- cols_only(X1 = col_character(),
                            X2 = col_integer(),
                            X3 = col_guess())
    } else {
      col_names <- c("chrom", "length")
      col_spec <- cols(X1 = col_character(),
                       X2 = col_integer())
    }
    chrom_info <- read_tsv(argv$chrom_info,
                           col_names = FALSE,
                           col_types = col_spec)
    if (is.logical(chrom_info$X3)) {
      chrom_info <-
        chrom_info %>% select(c(
          chrom = X1,
          length = X2,
          is_circular = X3
        ))
    } else {
      chrom_info <- chrom_info %>% select(c(chrom = X1, length = X2))
    }
    txdb <- makeTxDbFromGFF(argv$txdb_input, chrominfo = chrom_info)
  } else {
    txdb <- makeTxDbFromGFF(argv$txdb_input)
  }
} else {
  library(pacman, quietly = TRUE)
  # Load (install if needed) txdb from bioconductor
  pacman::p_load(argv$txdb_input, character.only = TRUE)
  txdb <- eval(parse(text = argv$txdb_input))
}

# Save txdb
AnnotationDbi::saveDb(txdb, argv$output)

R Snakemake dplyr readr BiocManager AnnotationDbi argparser PACMAN From line 3 of scripts/save_txdb.R

__author__ = "Patrik Smeds"
__copyright__ = "Copyright 2016, Patrik Smeds"
__email__ = "patrik.smeds@gmail.com"
__license__ = "MIT"

from os import path

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

# Check inputs/arguments.
if len(snakemake.input) == 0:
    raise ValueError("A reference genome has to be provided!")
elif len(snakemake.input) > 1:
    raise ValueError("Only one reference genome can be inputed!")

# Prefix that should be used for the database
prefix = snakemake.params.get("prefix", "")

if len(prefix) > 0:
    prefix = "-p " + prefix

# Contrunction algorithm that will be used to build the database, default is bwtsw
construction_algorithm = snakemake.params.get("algorithm", "")

if len(construction_algorithm) != 0:
    construction_algorithm = "-a " + construction_algorithm

shell(
    "bwa index" " {prefix}" " {construction_algorithm}" " {snakemake.input[0]}" " {log}"
)

Python Snakemake BWA From line 1 of index/wrapper.py

__author__ = "Johannes Köster, Julian de Ruiter"
__copyright__ = "Copyright 2016, Johannes Köster and Julian de Ruiter"
__email__ = "koester@jimmy.harvard.edu, julianderuiter@gmail.com"
__license__ = "MIT"


from os import path

from snakemake.shell import shell


# Extract arguments.
extra = snakemake.params.get("extra", "")

sort = snakemake.params.get("sort", "none")
sort_order = snakemake.params.get("sort_order", "coordinate")
sort_extra = snakemake.params.get("sort_extra", "")

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

# Check inputs/arguments.
if not isinstance(snakemake.input.reads, str) and len(snakemake.input.reads) not in {
    1,
    2,
}:
    raise ValueError("input must have 1 (single-end) or " "2 (paired-end) elements")

if sort_order not in {"coordinate", "queryname"}:
    raise ValueError("Unexpected value for sort_order ({})".format(sort_order))

# Determine which pipe command to use for converting to bam or sorting.
if sort == "none":

    # Simply convert to bam using samtools view.
    pipe_cmd = "samtools view -Sbh -o {snakemake.output[0]} -"

elif sort == "samtools":

    # Sort alignments using samtools sort.
    pipe_cmd = "samtools sort {sort_extra} -o {snakemake.output[0]} -"

    # Add name flag if needed.
    if sort_order == "queryname":
        sort_extra += " -n"

    prefix = path.splitext(snakemake.output[0])[0]
    sort_extra += " -T " + prefix + ".tmp"

elif sort == "picard":

    # Sort alignments using picard SortSam.
    pipe_cmd = (
        "picard SortSam {sort_extra} INPUT=/dev/stdin"
        " OUTPUT={snakemake.output[0]} SORT_ORDER={sort_order}"
    )

else:
    raise ValueError("Unexpected value for params.sort ({})".format(sort))

shell(
    "(bwa mem"
    " -t {snakemake.threads}"
    " {extra}"
    " {snakemake.params.index}"
    " {snakemake.input.reads}"
    " | " + pipe_cmd + ") {log}"
)

Python Snakemake SAMtools Picard From line 1 of mem/wrapper.py

__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "julianderuiter@gmail.com"
__license__ = "MIT"


from snakemake.shell import shell


n = len(snakemake.input)
assert n == 2, "Input must contain 2 (paired-end) elements."

extra = snakemake.params.get("extra", "")
adapters = snakemake.params.get("adapters", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)

assert (
    extra != "" or adapters != ""
), "No options provided to cutadapt. Please use 'params: adapters=' or 'params: extra='."

shell(
    "cutadapt"
    " {snakemake.params.adapters}"
    " {snakemake.params.extra}"
    " -o {snakemake.output.fastq1}"
    " -p {snakemake.output.fastq2}"
    " -j {snakemake.threads}"
    " {snakemake.input}"
    " > {snakemake.output.qc} {log}"
)

Python Snakemake Cutadapt From line 3 of pe/wrapper.py

__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "julianderuiter@gmail.com"
__license__ = "MIT"


from snakemake.shell import shell


n = len(snakemake.input)
assert n == 1, "Input must contain 1 (single-end) element."

extra = snakemake.params.get("extra", "")
adapters = snakemake.params.get("adapters", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)

assert (
    extra != "" or adapters != ""
), "No options provided to cutadapt. Please use 'params: adapters=' or 'params: extra='."

shell(
    "cutadapt"
    " {snakemake.params.adapters}"
    " {snakemake.params.extra}"
    " -j {snakemake.threads}"
    " -o {snakemake.output.fastq}"
    " {snakemake.input[0]}"
    " > {snakemake.output.qc} {log}"
)

Python Snakemake Cutadapt From line 3 of se/wrapper.py

__author__ = "Antonie Vietor"
__copyright__ = "Copyright 2020, Antonie Vietor"
__email__ = "antonie.v@gmx.de"
__license__ = "MIT"

from snakemake.shell import shell
import re

log = snakemake.log_fmt_shell(stdout=True, stderr=True)

jsd_sample = snakemake.input.get("jsd_sample")
out_counts = snakemake.output.get("counts")
out_metrics = snakemake.output.get("qc_metrics")
optional_output = ""
jsd = ""

if jsd_sample:
    jsd += " --JSDsample {jsd} ".format(jsd=jsd_sample)

if out_counts:
    optional_output += " --outRawCounts {out_counts} ".format(out_counts=out_counts)

if out_metrics:
    optional_output += " --outQualityMetrics {metrics} ".format(metrics=out_metrics)

shell(
    "(plotFingerprint "
    "-b {snakemake.input.bam_files} "
    "-o {snakemake.output.fingerprint} "
    "{optional_output} "
    "--numberOfProcessors {snakemake.threads} "
    "{jsd} "
    "{snakemake.params}) {log}"
)
# ToDo: remove the 'NA' string replacement when fixed in deepTools, see:
# https://github.com/deeptools/deepTools/pull/999
regex_passes = 2

with open(out_metrics, "rt") as f:
    metrics = f.read()
    for i in range(regex_passes):
        metrics = re.sub("\tNA(\t|\n)", "\tnan\\1", metrics)

with open(out_metrics, "wt") as f:
    f.write(metrics)

Python Snakemake From line 1 of plotfingerprint/wrapper.py

__author__ = "Antonie Vietor"
__copyright__ = "Copyright 2020, Antonie Vietor"
__email__ = "antonie.v@gmx.de"
__license__ = "MIT"

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=True, stderr=True)

out_region = snakemake.output.get("regions")
out_matrix = snakemake.output.get("heatmap_matrix")

optional_output = ""

if out_region:
    optional_output += " --outFileSortedRegions {out_region} ".format(
        out_region=out_region
    )

if out_matrix:
    optional_output += " --outFileNameMatrix {out_matrix} ".format(
        out_matrix=out_matrix
    )

shell(
    "(plotHeatmap "
    "-m {snakemake.input[0]} "
    "-o {snakemake.output.heatmap_img} "
    "{optional_output} "
    "{snakemake.params}) {log}"
)

Python Snakemake From line 1 of plotheatmap/wrapper.py

__author__ = "Antonie Vietor"
__copyright__ = "Copyright 2020, Antonie Vietor"
__email__ = "antonie.v@gmx.de"
__license__ = "MIT"

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=True, stderr=True)

out_region = snakemake.output.get("regions")
out_data = snakemake.output.get("data")

optional_output = ""

if out_region:
    optional_output += " --outFileSortedRegions {out_region} ".format(
        out_region=out_region
    )

if out_data:
    optional_output += " --outFileNameData {out_data} ".format(out_data=out_data)

shell(
    "(plotProfile "
    "-m {snakemake.input[0]} "
    "-o {snakemake.output.plot_img} "
    "{optional_output} "
    "{snakemake.params}) {log}"
)

Python Snakemake From line 1 of plotprofile/wrapper.py

__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "julianderuiter@gmail.com"
__license__ = "MIT"


from os import path
from tempfile import TemporaryDirectory

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=False, stderr=True)


def basename_without_ext(file_path):
    """Returns basename of file path, without the file extension."""

    base = path.basename(file_path)

    split_ind = 2 if base.endswith(".fastq.gz") else 1
    base = ".".join(base.split(".")[:-split_ind])

    return base


# Run fastqc, since there can be race conditions if multiple jobs
# use the same fastqc dir, we create a temp dir.
with TemporaryDirectory() as tempdir:
    shell(
        "fastqc {snakemake.params} --quiet -t {snakemake.threads} "
        "--outdir {tempdir:q} {snakemake.input[0]:q}"
        " {log:q}"
    )

    # Move outputs into proper position.
    output_base = basename_without_ext(snakemake.input[0])
    html_path = path.join(tempdir, output_base + "_fastqc.html")
    zip_path = path.join(tempdir, output_base + "_fastqc.zip")

    if snakemake.output.html != html_path:
        shell("mv {html_path:q} {snakemake.output.html:q}")

    if snakemake.output.zip != zip_path:
        shell("mv {zip_path:q} {snakemake.output.zip:q}")

Python Snakemake FastQC From line 3 of fastqc/wrapper.py

__author__ = "Antonie Vietor"
__copyright__ = "Copyright 2020, Antonie Vietor"
__email__ = "antonie.v@gmx.de"
__license__ = "MIT"

import os
import sys
from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=True, stderr=True)

in_contr = snakemake.input.get("control")
params = "{}".format(snakemake.params)
opt_input = ""
out_dir = ""

ext = "_peaks.xls"
out_file = [o for o in snakemake.output if o.endswith(ext)][0]
out_name = os.path.basename(out_file[: -len(ext)])
out_dir = os.path.dirname(out_file)

if in_contr:
    opt_input = "-c {contr}".format(contr=in_contr)

if out_dir:
    out_dir = "--outdir {dir}".format(dir=out_dir)

if any(out.endswith(("_peaks.narrowPeak", "_summits.bed")) for out in snakemake.output):
    if any(
        out.endswith(("_peaks.broadPeak", "_peaks.gappedPeak"))
        for out in snakemake.output
    ):
        sys.exit(
            "Output files with _peaks.narrowPeak and/or _summits.bed extensions cannot be created together with _peaks.broadPeak and/or _peaks.gappedPeak extended output files.\n"
            "For usable extensions please see https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/macs2/callpeak.html.\n"
        )
    else:
        if " --broad" in params:
            sys.exit(
                "If --broad option in params is given, the _peaks.narrowPeak and _summits.bed files will not be created. \n"
                "Remove --broad option from params if these files are needed.\n"
            )

if any(
    out.endswith(("_peaks.broadPeak", "_peaks.gappedPeak")) for out in snakemake.output
):
    if "--broad " not in params and not params.endswith("--broad"):
        params += " --broad "

if any(
    out.endswith(("_treat_pileup.bdg", "_control_lambda.bdg"))
    for out in snakemake.output
):
    if all(p not in params for p in ["--bdg", "-B"]):
        params += " --bdg "
else:
    if any(p in params for p in ["--bdg", "-B"]):
        sys.exit(
            "If --bdg or -B option in params is given, the _control_lambda.bdg and _treat_pileup.bdg extended files must be specified in output. \n"
        )

shell(
    "(macs2 callpeak "
    "-t {snakemake.input.treatment} "
    "{opt_input} "
    "{out_dir} "
    "-n {out_name} "
    "{params}) {log}"
)

Python Snakemake From line 1 of callpeak/wrapper.py

__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "julianderuiter@gmail.com"
__license__ = "MIT"


from os import path

from snakemake.shell import shell


input_dirs = set(path.dirname(fp) for fp in snakemake.input)
output_dir = path.dirname(snakemake.output[0])
output_name = path.basename(snakemake.output[0])
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "multiqc"
    " {snakemake.params}"
    " --force"
    " -o {output_dir}"
    " -n {output_name}"
    " {input_dirs}"
    " {log}"
)

Python Snakemake MultiQC From line 3 of multiqc/wrapper.py

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"


from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts

log = snakemake.log_fmt_shell(stdout=True, stderr=True)

extra = snakemake.params
java_opts = get_java_opts(snakemake)

shell(
    "picard MarkDuplicates "  # Tool and its subcommand
    "{java_opts} "  # Automatic java option
    "{extra} "  # User defined parmeters
    "INPUT={snakemake.input} "  # Input file
    "OUTPUT={snakemake.output.bam} "  # Output bam
    "METRICS_FILE={snakemake.output.metrics} "  # Output metrics
    "{log}"  # Logging
)

Python Snakemake Picard snakemake-wrapper-utils From line 1 of markduplicates/wrapper.py

__author__ = "Michael Chambers"
__copyright__ = "Copyright 2019, Michael Chambers"
__email__ = "greenkidneybean@gmail.com"
__license__ = "MIT"


from snakemake.shell import shell


shell("samtools faidx {snakemake.params} {snakemake.input[0]} > {snakemake.output[0]}")

Python Snakemake SAMtools From line 1 of faidx/wrapper.py

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"


from snakemake.shell import shell


shell("samtools index {snakemake.params} {snakemake.input[0]} {snakemake.output[0]}")

Python Snakemake SAMtools From line 1 of index/wrapper.py

__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "julianderuiter@gmail.com"
__license__ = "MIT"


from snakemake.shell import shell


extra = snakemake.params.get("extra", "")
region = snakemake.params.get("region", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)


shell("samtools stats {extra} {snakemake.input} {region} > {snakemake.output} {log}")