Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Workflow to perform KAS-seq analysis
This is the template for a new Snakemake workflow. Replace this text with a comprehensive description, covering the purpose and domain.
Insert your code into the respective folders, i.e.
scripts
,
Code Snippets
22 23 | wrapper: "0.72.0/bio/bwa/mem" |
35 36 | wrapper: "0.72.0/bio/samtools/index" |
52 53 | wrapper: "0.72.0/bio/picard/markduplicates" |
71 72 | wrapper: "0.72.0/bio/bwa/index" |
14 15 16 17 18 19 20 21 22 | shell: """ bamCoverage \ -b {input.bam:q} \ -o {output:q} \ --normalizeUsing RPGC \ --effectiveGenomeSize {params.effectiveGenomeSize} \ > {log:q} 2>&1 """ |
49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 | shell: """ computeMatrix \ scale-regions \ --regionsFileName {input.regions:q} \ --upstream 3000 \ --downstream 3000 \ --regionBodyLength 6000 \ --scoreFileName {input.score:q} \ --outFileName {output.matrix_gz:q} \ --outFileNameMatrix {output.matrix_tab:q} \ --outFileSortedRegions {output.matrix_bed:q} \ --numberOfProcessors {threads} \ {params.extra} \ > {log:q} 2>&1 """ |
86 87 | wrapper: "0.72.0/bio/deeptools/plotheatmap" |
109 110 | wrapper: "0.72.0/bio/deeptools/plotprofile" |
36 37 | wrapper: "0.72.0/bio/macs2/callpeak" |
49 50 51 52 53 | shell: """ echo -e 'chr\tstart\tend\tname\tscore\tstrand\tfold_enrichment\t-log10(pvalue)\t-log10(qvalue)' > {output:q} cat {input:q} >> {output:q} """ |
10 11 | wrapper: "0.72.0/bio/samtools/faidx" |
24 25 | script: "../scripts/save_txdb.R" |
78 79 | script: "../scripts/peak_annotation.R" |
109 110 | script: "../scripts/peak_profile.R" |
134 135 | script: "../scripts/peak_overlap_enrichment.R" |
12 13 | wrapper: "0.72.0/bio/fastqc" |
25 26 | wrapper: "0.72.0/bio/samtools/stats" |
47 48 49 50 51 52 53 54 55 56 57 58 | shell: """ plotCoverage \ --bamfiles {input.bamfiles:q} \ --plotFile {output.plotfile:q} \ --outRawCounts {output.coverage:q} \ --numberOfProcessors {threads} \ --smartLabels \ --plotTitle "Read Coverage (deduplicated)" \ {params} \ > {log:q} 2>&1 """ |
78 79 80 81 82 83 84 85 86 87 88 | shell: """ multiBamSummary \ bins \ --bamfiles {input.bamfiles:q} \ --outFileName {output.multibamsummary:q} \ --numberOfProcessors {threads} \ --smartLabels \ {params} \ > {log:q} 2>&1 """ |
103 104 105 106 107 108 109 110 111 112 113 114 | shell: """ plotCorrelation \ --corData {input.cordata:q} \ --corMethod spearman \ --whatToPlot heatmap \ --plotFile {output.plotfile:q} \ --plotTitle "Alignment correlation (Spearman)" \ --outFileCorMatrix {output.cormatrix:q} \ {params} \ > {log:q} 2>&1 """ |
136 137 138 139 140 141 142 143 144 145 146 147 | shell: """ bamPEFragmentSize \ --bamfiles {input.bamfiles:q} \ --histogram {output.histogram:q} \ --plotTitle "Paired End Fragment Size (deduplicated)" \ --numberOfProcessors {threads} \ --table {output.table:q} \ --outRawFragmentLengths {output.rawfragmentlengths:q} \ {params} \ > {log:q} 2>&1 """ |
203 204 | wrapper: "0.72.0/bio/deeptools/plotfingerprint" |
241 242 | wrapper: "0.72.0/bio/multiqc" |
14 15 16 17 18 | shell: """ ln -sfr {input.fq1:q} {output.fastq1:q} ln -sfr {input.fq2:q} {output.fastq2:q} """ |
30 31 32 33 | shell: """ ln -s {input.fq1:q} {output.fastq:q} >> {log:q} 2>&1 """ |
54 55 | wrapper: "0.72.0/bio/cutadapt/pe" |
71 72 | wrapper: "0.72.0/bio/cutadapt/se" |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 | library(BiocManager, quietly = TRUE) library(ChIPseeker, quietly = TRUE) library(org.Hs.eg.db, quietly = TRUE) library(cowplot, quietly = TRUE) library(readr, quietly = TRUE) library(argparser, quietly = TRUE) p <- arg_parser("KAS-Seq Peak Annotation and Comparison") p <- add_argument(p, "--peak_files", help = "Peak files to annotate and compare", nargs = Inf) p <- add_argument(p, "--txdb_file", help = "File to load txdb using AnnotationDbi::loadDb()") p <- add_argument(p, "--annotation_file", help = paste0("GFF3 or GTF file of gene annotations used to ", "build txdb")) p <- add_argument(p, "--txdb", help = "Name of txdb package to install from Bioconductor") # Add an optional arguments p <- add_argument(p, "--names", help = "Sample names for each peak file", nargs = Inf) p <- add_argument(p, "--output_dir", help = "Directory for output files", default = "peak_annotation") p <- add_argument(p, "--annotation_distribution_plot", help = "Peak annotation distribution barplot filename", default = "annotationDistributionPlot.pdf") p <- add_argument(p, "--peak_annotation_list_rdata", help = "Peak annotation list Rdata file", default = "peak_anno_list.Rdata") p <- add_argument(p, "--venn_diagram", help = paste0("Venn digagram of annotated genes per sample ", "pdf filename"), default = "annotationVennDiagram.pdf") # Parse arguments (interactive, snakemake, or command line) if (exists("snakemake")) { # Arguments via Snakemake argv <- parse_args(p, c( "--peak_files", snakemake@input[["peak_files"]], "--txdb_file", snakemake@input[["txdb_file"]], "--names", snakemake@params[["names"]], "--output_dir", snakemake@params[["output_dir"]], "--annotation_distribution_plot", snakemake@output[["annotation_distribution_plot"]], "--venn_diagram", snakemake@output[["venn_diagram"]], "--peak_annotation_list_rdata", snakemake@output[["peak_annotation_list_rdata"]] )) } else if (interactive()) { # Arguments supplied inline (for debug/testing when running interactively) print("Running interactively...") peak_files <- c("results_2020-12-03/macs2/D701-lane1_peaks.broadPeak", "results_2020-12-03/macs2/D702-lane1_peaks.broadPeak", "results_2020-12-03/macs2/D703-lane1_peaks.broadPeak", "results_2020-12-03/macs2/D704-lane1_peaks.broadPeak", "results_2020-12-03/macs2/D705-lane1_peaks.broadPeak") names <- c("D701", "D702", "D703", "D704", "D705") annotation_file <- "genomes/hg38/annotation/Homo_sapiens.GRCh38.101.gtf" txdb_file <- "txdb.db" argv <- parse_args(p, c("--peak_files", peak_files, "--names", names, "--txdb_file", txdb_file)) print(argv) } else { # Arguments from command line argv <- parse_args(p) print(argv) } # Set names if (!anyNA(argv$names)) { peak_file_names <- argv$names } else { peak_file_names <- sapply(argv$peak_files, basename) } names(argv$peak_files) <- peak_file_names # Output directory if (!dir.exists(argv$output_dir)) { dir.create(argv$output_dir, recursive = TRUE) } # Get txdb object if (!is.na(argv$txdb)) { # Load (install if needed) txdb from bioconductor library(pacman, quietly = TRUE) pacman::p_load(argv$txdb, character.only = TRUE) txdb <- eval(parse(text = argv$txdb)) } else if (!is.na(argv$txdb_file)) { # Load txdb library(AnnotationDbi, quietly = TRUE) txdb <- AnnotationDbi::loadDb(argv$txdb_file) } else if (!is.na(argv$annotation_file)) { # Create txdb object from supplied annotation file library(GenomicFeatures, quietly = TRUE) txdb <- GenomicFeatures::makeTxDbFromGFF(argv$annotation_file) } else { stop("Must specify one of --txdb, --txdb_file, or --annotation_file") } # Peak Annotation # TODO Provide config parameter for annoDb peak_anno_list <- lapply(argv$peak_files, annotatePeak, TxDb = txdb, tssRegion = c(-3000, 3000), annoDb = "org.Hs.eg.db", verbose = FALSE) lapply(names(peak_anno_list), function(name) { filebase <- file.path(argv$output_dir, basename(argv$peak_files[[name]])) write_tsv(as.data.frame(peak_anno_list[[name]]), file = paste0(filebase, ".annotated.tsv.gz")) sink(file = paste0(filebase, ".annotated.summary.txt")) print(peak_anno_list[[name]]) sink() }) saveRDS(peak_anno_list, file = argv$peak_annotation_list_rdata) # Peak annotation distribution plot peak_anno_dist_plot <- plotAnnoBar(peak_anno_list) save_plot( filename = argv$annotation_distribution_plot, plot = peak_anno_dist_plot, base_height = 14, base_width = 14 ) # nolint start # Functional profiles comparison # NOTE: Not all data return enrichment # genes = lapply(peak_anno_list, function(i) as.data.frame(i)$geneId) # names(genes) = sub("_", "\n", names(genes)) # compKEGG <- compareCluster(geneCluster = genes, # fun = "enrichKEGG", # pvalueCutoff = 0.05, # pAdjustMethod = "BH") # dotplot(compKEGG, showCategory = 15, title = "KEGG Pathway Enrichment Analysis") # nolint end # Venn Diagram of gene annotated per sample # Note: vennplot does not return a ggplot object pdf( file = argv$venn_diagram, height = 14, width = 14 ) genes <- lapply(peak_anno_list, function(i) as.data.frame(i)$geneId) vennplot(genes) dev.off() |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 | library(BiocManager, quietly = TRUE) library(ChIPseeker, quietly = TRUE) library(readr, quietly = TRUE) library(argparser, quietly = TRUE) p <- arg_parser("KAS-Seq Peak Overlap Enrichment Analysis") p <- add_argument(p, "--query_peak_file", help = "Query peak files to compare targets against") p <- add_argument(p, "--target_peak_files", help = "Target peak files to compare against query peaks", nargs = Inf) p <- add_argument(p, "--txdb_file", help = "File to load txdb using AnnotationDbi::loadDb()") p <- add_argument(p, "--annotation_file", help = paste0("GFF3 or GTF file of gene annotations used ", "to build txdb")) p <- add_argument(p, "--txdb", help = "Name of txdb package to install from Bioconductor") # Add an optional arguments p <- add_argument(p, "--nshuffle", help = "Number of shuffle iterations", default = 1000) p <- add_argument(p, "--p_adjust_method", help = "pvalue adjustment method", default = "BH") p <- add_argument(p, "--output_tsv_file", help = "Path and filename of output tsv file", default = "peak_overlap_enrichment.tsv") p <- add_argument(p, "--cores", help = "number of cores (threads) to use", default = NA) p <- add_argument(p, "--chromlist", help = paste0("List of chromosomes to include, others will ", "be filtered out, new peaks stored in ", "[peak_file].filtered"), nargs = Inf, default = NA) # Parse arguments (interactive, snakemake, or command line) if (exists("snakemake")) { # Arguments via Snakemake argv <- parse_args(p, c( "--query_peak_file", snakemake@input[["query_peak_file"]], "--target_peak_files", snakemake@input[["target_peak_files"]], "--txdb_file", snakemake@input[["txdb_file"]], "--nshuffle", snakemake@params[["nshuffle"]], "--p_adjust_method", snakemake@params[["p_adjust_method"]], "--output_tsv_file", snakemake@output[["output_tsv_file"]], "--cores", snakemake@threads, "--chromlist", snakemake@params[["chromlist"]] )) } else if (interactive()) { # Arguments supplied inline (for debug/testing when running interactively) print("Running interactively...") query_peak_file <- "results_2020-12-03/macs2/D701-lane1_peaks.broadPeak" target_peak_files <- c("results_2020-12-03/macs2/D702-lane1_peaks.broadPeak", "results_2020-12-03/macs2/D703-lane1_peaks.broadPeak", "results_2020-12-03/macs2/D704-lane1_peaks.broadPeak", "results_2020-12-03/macs2/D705-lane1_peaks.broadPeak") nshuffle <- 50 txdb_file <- "txdb.db" chromlist <- paste(c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23"), sep = ", ") argv <- parse_args(p, c("--query_peak_file", query_peak_file, "--target_peak_files", target_peak_files, "--nshuffle", nshuffle, "--txdb_file", txdb_file, "--chromlist", chromlist)) print(argv) } else { # Arguments from command line argv <- parse_args(p) print(argv) } # Get txdb object if (!is.na(argv$txdb)) { # Load (install if needed) txdb from bioconductor library(pacman, quietly = TRUE) pacman::p_load(argv$txdb, character.only = TRUE) txdb <- eval(parse(text = argv$txdb)) } else if (!is.na(argv$txdb_file)) { # Load txdb library(AnnotationDbi, quietly = TRUE) txdb <- AnnotationDbi::loadDb(argv$txdb_file) } else if (!is.na(argv$annotation_file)) { # Create txdb object from supplied annotation file library(GenomicFeatures, quietly = TRUE) txdb <- GenomicFeatures::makeTxDbFromGFF(argv$annotation_file) } else { stop("Must specify one of --txdb, --txdb_file, or --annotation_file") } # Number of cores if (is.na(argv$cores)) { cores <- detectCores() - 1 } else { cores <- argv$cores } # Peak Overlap Enrichment # Running with a single core can cause: # Error in sample.int(length(x), size, replace, prob) : # cannot take a sample larger than the population when 'replace = FALSE' # Running with multiple cores throws only a warning: # Warning message: # In mclapply(seq_along(idx), function(j) { : # all scheduled cores encountered errors in user code # This results in incorrect output, thus we throw and error when we encounter # that warning message withCallingHandlers( peak_overlap_enrichment <- enrichPeakOverlap( queryPeak = argv$query_peak_file, targetPeak = unlist(argv$target_peak_files), TxDb = txdb, pAdjustMethod = argv$p_adjust_method, nShuffle = argv$nshuffle, chainFile = NULL, verbose = TRUE, mc.cores = cores, ), warning = function(w) { if (grepl("encountered errors in user code", w$message)) { stop(paste0("Errors encounted executing enrichPeakOverlap: ", w$message)) } else { message(w$message) } } ) write_tsv(peak_overlap_enrichment, argv$output_tsv_file) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 | library(BiocManager, quietly = TRUE) library(ChIPseeker, quietly = TRUE) library(cowplot, quietly = TRUE) library(readr, quietly = TRUE) library(argparser, quietly = TRUE) p <- arg_parser("Profile of peaks binding to TSS regions") p <- add_argument(p, "--peak_files", help = "Peak files to annotate and compare", nargs = Inf) p <- add_argument(p, "--txdb_file", help = "File to load txdb using AnnotationDbi::loadDb()") p <- add_argument(p, "--annotation_file", help = paste0("GFF3 or GTF file of gene annotations used ", "to build txdb")) p <- add_argument(p, "--txdb", help = "Name of txdb package to install from Bioconductor") # Add an optional arguments p <- add_argument(p, "--names", help = "Sample names for each peak file", nargs = Inf) p <- add_argument(p, "--tag_profile_plot", help = "Average peak profile plot filename", default = "avgProfilePlot.pdf") p <- add_argument(p, "--tag_heatmap_plot", help = "Tag heatmap plot PDF filename", default = "tagHeatmapPlot.pdf") p <- add_argument(p, "--tag_matrix_list", help = paste0("List of outputs from getTagMatrix ", "CHiPseeker function"), default = "tag_matrix_list.Rdata") # Parse arguments (interactive, snakemake, or command line) if (exists("snakemake")) { # Arguments via Snakemake argv <- parse_args(p, c( "--peak_files", snakemake@input[["peak_files"]], "--txdb_file", snakemake@input[["txdb_file"]], "--names", snakemake@params[["names"]], "--tag_profile_plot", snakemake@output[["tag_profile_plot"]], "--tag_heatmap_plot", snakemake@output[["tag_heatmap_plot"]], "--tag_matrix_list", snakemake@output[["tag_matrix_list"]] )) } else if (interactive()) { # Arguments supplied inline (for debug/testing when running interactively) print("Running interactively...") input_file <- c("results_2020-12-03/macs2/D701-lane1_peaks.broadPeak", "results_2020-12-03/macs2/D702-lane1_peaks.broadPeak", "results_2020-12-03/macs2/D703-lane1_peaks.broadPeak", "results_2020-12-03/macs2/D704-lane1_peaks.broadPeak", "results_2020-12-03/macs2/D705-lane1_peaks.broadPeak") names <- c("D701", "D702", "D703", "D704", "D705") annotation_file <- "genomes/hg38/annotation/Homo_sapiens.GRCh38.101.gtf" txdb_file <- "txdb.db" argv <- parse_args(p, c("--peak_files", input_file, "--names", names, "--txdb_file", txdb_file)) print(argv) } else { # Arguments from command line argv <- parse_args(p) print(argv) } # Set names if (!anyNA(argv$names)) { peak_file_names <- argv$names } else { peak_file_names <- sapply(argv$peak_files, basename) } names(argv$peak_files) <- peak_file_names # Get txdb object if (!is.na(argv$txdb)) { # Load (install if needed) txdb from bioconductor library(pacman, quietly = TRUE) pacman::p_load(argv$txdb, character.only = TRUE) txdb <- eval(parse(text = argv$txdb)) } else if (!is.na(argv$txdb_file)) { # Load txdb library(AnnotationDbi, quietly = TRUE) txdb <- AnnotationDbi::loadDb(argv$txdb_file) } else if (!is.na(argv$annotation_file)) { # Create txdb object from supplied annotation file library(GenomicFeatures, quietly = TRUE) txdb <- GenomicFeatures::makeTxDbFromGFF(argv$annotation_file) } else { stop("Must specify one of --txdb, --txdb_file, or --annotation_file") } # Profile of peaks binding to TSS regions promoter <- getPromoters(TxDb = txdb, upstream = 3000, downstream = 3000) tag_matrix_list <- lapply(argv$peak_files, getTagMatrix, windows = promoter) saveRDS(tag_matrix_list, file = argv$tag_matrix_list) # Average tag profile tag_profile_plot <- plotAvgProf(tag_matrix_list, xlim = c(-3000, 3000)) save_plot( filename = argv$tag_profile_plot, plot = tag_profile_plot, base_height = 14, base_width = 14 ) # Tag heatmap # Note: tagHeatMap does not return a ggplot object, we must save it differently pdf( file = argv$tag_heatmap_plot, height = 14, width = 14 ) tagHeatmap(tag_matrix_list, xlim = c(-3000, 3000), color = NULL) dev.off() |
R
Snakemake
ggplot2
readr
cowplot
BiocManager
AnnotationDbi
argparser
ChIPseeker
PACMAN
From
line
3
of
scripts/peak_profile.R
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 | library(BiocManager, quietly = TRUE) library(AnnotationDbi, quietly = TRUE) library(readr, quietly = TRUE) library(dplyr, quietly = TRUE) library(argparser, quietly = TRUE) p <- arg_parser("Get a TXDB object and save for use in other scripts") p <- add_argument(p, "txdb_input", help = paste0("Name of txdb package to install from ", "Bioconductor or annotation file (GFF3 ", "or GTF) to use to build a TXDB object")) p <- add_argument(p, "--chrom_info", help = paste0("Chromosome info file with columns: ", "chrom, length, is_circular (optional)")) p <- add_argument(p, "--output", help = "File to save txdb object in", default = "txdb.db") # Parse arguments (interactive, snakemake, or command line) if (exists("snakemake")) { # Arguments via Snakemake argv <- parse_args(p, c( snakemake@input[["txdb_input"]], "--chrom_info", snakemake@input[["chrom_info"]], "--output", snakemake@output[[1]] )) } else if (interactive()) { # Arguments supplied inline (for debug/testing when running interactively) print("Running interactively...") txdb_input <- "genomes/hg38/annotation/Homo_sapiens.GRCh38.101.gtf" chrom_info <- paste0("genomes/hg38/genome/fasta/", "Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.fai") argv <- parse_args(p, c(txdb_input, "--chrom_info", chrom_info)) print(argv) } else { # Arguments from command line argv <- parse_args(p) print(argv) } # Build txdb if txdb_input is file if (file.exists(argv$txdb_input)) { # Create txdb object from supplied annotation file library(GenomicFeatures, quietly = TRUE) if (!is.na(argv$chrom_info)) { # Get chrom info is_circular_column <- count_fields(argv$chrom_info, tokenizer_tsv())[1] > 2 if (is_circular_column) { col_names <- c("chrom", "length", "is_circular") col_spec <- cols_only(X1 = col_character(), X2 = col_integer(), X3 = col_guess()) } else { col_names <- c("chrom", "length") col_spec <- cols(X1 = col_character(), X2 = col_integer()) } chrom_info <- read_tsv(argv$chrom_info, col_names = FALSE, col_types = col_spec) if (is.logical(chrom_info$X3)) { chrom_info <- chrom_info %>% select(c( chrom = X1, length = X2, is_circular = X3 )) } else { chrom_info <- chrom_info %>% select(c(chrom = X1, length = X2)) } txdb <- makeTxDbFromGFF(argv$txdb_input, chrominfo = chrom_info) } else { txdb <- makeTxDbFromGFF(argv$txdb_input) } } else { library(pacman, quietly = TRUE) # Load (install if needed) txdb from bioconductor pacman::p_load(argv$txdb_input, character.only = TRUE) txdb <- eval(parse(text = argv$txdb_input)) } # Save txdb AnnotationDbi::saveDb(txdb, argv$output) |
R
Snakemake
dplyr
readr
BiocManager
AnnotationDbi
argparser
PACMAN
From
line
3
of
scripts/save_txdb.R
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | __author__ = "Patrik Smeds" __copyright__ = "Copyright 2016, Patrik Smeds" __email__ = "patrik.smeds@gmail.com" __license__ = "MIT" from os import path from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=False, stderr=True) # Check inputs/arguments. if len(snakemake.input) == 0: raise ValueError("A reference genome has to be provided!") elif len(snakemake.input) > 1: raise ValueError("Only one reference genome can be inputed!") # Prefix that should be used for the database prefix = snakemake.params.get("prefix", "") if len(prefix) > 0: prefix = "-p " + prefix # Contrunction algorithm that will be used to build the database, default is bwtsw construction_algorithm = snakemake.params.get("algorithm", "") if len(construction_algorithm) != 0: construction_algorithm = "-a " + construction_algorithm shell( "bwa index" " {prefix}" " {construction_algorithm}" " {snakemake.input[0]}" " {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 | __author__ = "Johannes Köster, Julian de Ruiter" __copyright__ = "Copyright 2016, Johannes Köster and Julian de Ruiter" __email__ = "koester@jimmy.harvard.edu, julianderuiter@gmail.com" __license__ = "MIT" from os import path from snakemake.shell import shell # Extract arguments. extra = snakemake.params.get("extra", "") sort = snakemake.params.get("sort", "none") sort_order = snakemake.params.get("sort_order", "coordinate") sort_extra = snakemake.params.get("sort_extra", "") log = snakemake.log_fmt_shell(stdout=False, stderr=True) # Check inputs/arguments. if not isinstance(snakemake.input.reads, str) and len(snakemake.input.reads) not in { 1, 2, }: raise ValueError("input must have 1 (single-end) or " "2 (paired-end) elements") if sort_order not in {"coordinate", "queryname"}: raise ValueError("Unexpected value for sort_order ({})".format(sort_order)) # Determine which pipe command to use for converting to bam or sorting. if sort == "none": # Simply convert to bam using samtools view. pipe_cmd = "samtools view -Sbh -o {snakemake.output[0]} -" elif sort == "samtools": # Sort alignments using samtools sort. pipe_cmd = "samtools sort {sort_extra} -o {snakemake.output[0]} -" # Add name flag if needed. if sort_order == "queryname": sort_extra += " -n" prefix = path.splitext(snakemake.output[0])[0] sort_extra += " -T " + prefix + ".tmp" elif sort == "picard": # Sort alignments using picard SortSam. pipe_cmd = ( "picard SortSam {sort_extra} INPUT=/dev/stdin" " OUTPUT={snakemake.output[0]} SORT_ORDER={sort_order}" ) else: raise ValueError("Unexpected value for params.sort ({})".format(sort)) shell( "(bwa mem" " -t {snakemake.threads}" " {extra}" " {snakemake.params.index}" " {snakemake.input.reads}" " | " + pipe_cmd + ") {log}" ) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" from snakemake.shell import shell n = len(snakemake.input) assert n == 2, "Input must contain 2 (paired-end) elements." extra = snakemake.params.get("extra", "") adapters = snakemake.params.get("adapters", "") log = snakemake.log_fmt_shell(stdout=False, stderr=True) assert ( extra != "" or adapters != "" ), "No options provided to cutadapt. Please use 'params: adapters=' or 'params: extra='." shell( "cutadapt" " {snakemake.params.adapters}" " {snakemake.params.extra}" " -o {snakemake.output.fastq1}" " -p {snakemake.output.fastq2}" " -j {snakemake.threads}" " {snakemake.input}" " > {snakemake.output.qc} {log}" ) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" from snakemake.shell import shell n = len(snakemake.input) assert n == 1, "Input must contain 1 (single-end) element." extra = snakemake.params.get("extra", "") adapters = snakemake.params.get("adapters", "") log = snakemake.log_fmt_shell(stdout=False, stderr=True) assert ( extra != "" or adapters != "" ), "No options provided to cutadapt. Please use 'params: adapters=' or 'params: extra='." shell( "cutadapt" " {snakemake.params.adapters}" " {snakemake.params.extra}" " -j {snakemake.threads}" " -o {snakemake.output.fastq}" " {snakemake.input[0]}" " > {snakemake.output.qc} {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | __author__ = "Antonie Vietor" __copyright__ = "Copyright 2020, Antonie Vietor" __email__ = "antonie.v@gmx.de" __license__ = "MIT" from snakemake.shell import shell import re log = snakemake.log_fmt_shell(stdout=True, stderr=True) jsd_sample = snakemake.input.get("jsd_sample") out_counts = snakemake.output.get("counts") out_metrics = snakemake.output.get("qc_metrics") optional_output = "" jsd = "" if jsd_sample: jsd += " --JSDsample {jsd} ".format(jsd=jsd_sample) if out_counts: optional_output += " --outRawCounts {out_counts} ".format(out_counts=out_counts) if out_metrics: optional_output += " --outQualityMetrics {metrics} ".format(metrics=out_metrics) shell( "(plotFingerprint " "-b {snakemake.input.bam_files} " "-o {snakemake.output.fingerprint} " "{optional_output} " "--numberOfProcessors {snakemake.threads} " "{jsd} " "{snakemake.params}) {log}" ) # ToDo: remove the 'NA' string replacement when fixed in deepTools, see: # https://github.com/deeptools/deepTools/pull/999 regex_passes = 2 with open(out_metrics, "rt") as f: metrics = f.read() for i in range(regex_passes): metrics = re.sub("\tNA(\t|\n)", "\tnan\\1", metrics) with open(out_metrics, "wt") as f: f.write(metrics) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | __author__ = "Antonie Vietor" __copyright__ = "Copyright 2020, Antonie Vietor" __email__ = "antonie.v@gmx.de" __license__ = "MIT" from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=True, stderr=True) out_region = snakemake.output.get("regions") out_matrix = snakemake.output.get("heatmap_matrix") optional_output = "" if out_region: optional_output += " --outFileSortedRegions {out_region} ".format( out_region=out_region ) if out_matrix: optional_output += " --outFileNameMatrix {out_matrix} ".format( out_matrix=out_matrix ) shell( "(plotHeatmap " "-m {snakemake.input[0]} " "-o {snakemake.output.heatmap_img} " "{optional_output} " "{snakemake.params}) {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | __author__ = "Antonie Vietor" __copyright__ = "Copyright 2020, Antonie Vietor" __email__ = "antonie.v@gmx.de" __license__ = "MIT" from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=True, stderr=True) out_region = snakemake.output.get("regions") out_data = snakemake.output.get("data") optional_output = "" if out_region: optional_output += " --outFileSortedRegions {out_region} ".format( out_region=out_region ) if out_data: optional_output += " --outFileNameData {out_data} ".format(out_data=out_data) shell( "(plotProfile " "-m {snakemake.input[0]} " "-o {snakemake.output.plot_img} " "{optional_output} " "{snakemake.params}) {log}" ) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" from os import path from tempfile import TemporaryDirectory from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=False, stderr=True) def basename_without_ext(file_path): """Returns basename of file path, without the file extension.""" base = path.basename(file_path) split_ind = 2 if base.endswith(".fastq.gz") else 1 base = ".".join(base.split(".")[:-split_ind]) return base # Run fastqc, since there can be race conditions if multiple jobs # use the same fastqc dir, we create a temp dir. with TemporaryDirectory() as tempdir: shell( "fastqc {snakemake.params} --quiet -t {snakemake.threads} " "--outdir {tempdir:q} {snakemake.input[0]:q}" " {log:q}" ) # Move outputs into proper position. output_base = basename_without_ext(snakemake.input[0]) html_path = path.join(tempdir, output_base + "_fastqc.html") zip_path = path.join(tempdir, output_base + "_fastqc.zip") if snakemake.output.html != html_path: shell("mv {html_path:q} {snakemake.output.html:q}") if snakemake.output.zip != zip_path: shell("mv {zip_path:q} {snakemake.output.zip:q}") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 | __author__ = "Antonie Vietor" __copyright__ = "Copyright 2020, Antonie Vietor" __email__ = "antonie.v@gmx.de" __license__ = "MIT" import os import sys from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=True, stderr=True) in_contr = snakemake.input.get("control") params = "{}".format(snakemake.params) opt_input = "" out_dir = "" ext = "_peaks.xls" out_file = [o for o in snakemake.output if o.endswith(ext)][0] out_name = os.path.basename(out_file[: -len(ext)]) out_dir = os.path.dirname(out_file) if in_contr: opt_input = "-c {contr}".format(contr=in_contr) if out_dir: out_dir = "--outdir {dir}".format(dir=out_dir) if any(out.endswith(("_peaks.narrowPeak", "_summits.bed")) for out in snakemake.output): if any( out.endswith(("_peaks.broadPeak", "_peaks.gappedPeak")) for out in snakemake.output ): sys.exit( "Output files with _peaks.narrowPeak and/or _summits.bed extensions cannot be created together with _peaks.broadPeak and/or _peaks.gappedPeak extended output files.\n" "For usable extensions please see https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/macs2/callpeak.html.\n" ) else: if " --broad" in params: sys.exit( "If --broad option in params is given, the _peaks.narrowPeak and _summits.bed files will not be created. \n" "Remove --broad option from params if these files are needed.\n" ) if any( out.endswith(("_peaks.broadPeak", "_peaks.gappedPeak")) for out in snakemake.output ): if "--broad " not in params and not params.endswith("--broad"): params += " --broad " if any( out.endswith(("_treat_pileup.bdg", "_control_lambda.bdg")) for out in snakemake.output ): if all(p not in params for p in ["--bdg", "-B"]): params += " --bdg " else: if any(p in params for p in ["--bdg", "-B"]): sys.exit( "If --bdg or -B option in params is given, the _control_lambda.bdg and _treat_pileup.bdg extended files must be specified in output. \n" ) shell( "(macs2 callpeak " "-t {snakemake.input.treatment} " "{opt_input} " "{out_dir} " "-n {out_name} " "{params}) {log}" ) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" from os import path from snakemake.shell import shell input_dirs = set(path.dirname(fp) for fp in snakemake.input) output_dir = path.dirname(snakemake.output[0]) output_name = path.basename(snakemake.output[0]) log = snakemake.log_fmt_shell(stdout=True, stderr=True) shell( "multiqc" " {snakemake.params}" " --force" " -o {output_dir}" " -n {output_name}" " {input_dirs}" " {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "koester@jimmy.harvard.edu" __license__ = "MIT" from snakemake.shell import shell from snakemake_wrapper_utils.java import get_java_opts log = snakemake.log_fmt_shell(stdout=True, stderr=True) extra = snakemake.params java_opts = get_java_opts(snakemake) shell( "picard MarkDuplicates " # Tool and its subcommand "{java_opts} " # Automatic java option "{extra} " # User defined parmeters "INPUT={snakemake.input} " # Input file "OUTPUT={snakemake.output.bam} " # Output bam "METRICS_FILE={snakemake.output.metrics} " # Output metrics "{log}" # Logging ) |
1 2 3 4 5 6 7 8 9 10 | __author__ = "Michael Chambers" __copyright__ = "Copyright 2019, Michael Chambers" __email__ = "greenkidneybean@gmail.com" __license__ = "MIT" from snakemake.shell import shell shell("samtools faidx {snakemake.params} {snakemake.input[0]} > {snakemake.output[0]}") |
1 2 3 4 5 6 7 8 9 10 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "koester@jimmy.harvard.edu" __license__ = "MIT" from snakemake.shell import shell shell("samtools index {snakemake.params} {snakemake.input[0]} {snakemake.output[0]}") |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" from snakemake.shell import shell extra = snakemake.params.get("extra", "") region = snakemake.params.get("region", "") log = snakemake.log_fmt_shell(stdout=False, stderr=True) shell("samtools stats {extra} {snakemake.input} {region} > {snakemake.output} {log}") |
Support
Do you know this workflow well? If so, you can
request seller status , and start supporting this workflow.
Created: 1yr ago
Updated: 1yr ago
Maitainers:
public
URL:
https://github.com/lparsons/kas-seq-workflow
Name:
kas-seq-workflow
Version:
v0.2
Downloaded:
0
Copyright:
Public Domain
License:
MIT License
Keywords:
- Future updates
Related Workflows

ENCODE pipeline for histone marks developed for the psychENCODE project
psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project.
The o...

Near-real time tracking of SARS-CoV-2 in Connecticut
Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

ATLAS - Three commands to start analyzing your metagenome data
Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...
raw sequence reads
Genome assembly
Annotation track
checkm2
gunc
prodigal
snakemake-wrapper-utils
MEGAHIT
Atlas
BBMap
Biopython
BioRuby
Bwa-mem2
cd-hit
CheckM
DAS
Diamond
eggNOG-mapper v2
MetaBAT 2
Minimap2
MMseqs
MultiQC
Pandas
Picard
pyfastx
SAMtools
SemiBin
Snakemake
SPAdes
SqueezeMeta
TADpole
VAMB
CONCOCT
ete3
gtdbtk
h5py
networkx
numpy
plotly
psutil
utils
metagenomics

RNA-seq workflow using STAR and DESeq2
This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

This Snakemake pipeline implements the GATK best-practices workflow
This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...