Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
This pipeline has been adapted from https://github.com/gustaveroussy/EaCoN.
It leverages the
EaCoN
R package to conduct preprocessing and normalization,
segmentation and copy number estimation from specific microarray .CEL files.
The
EaCoN
package supports copy number estimation from Cytoscan, Oncoscan,
SNP6 arrays as well as WES data, however, the current pipeline has only
implemented support for microarray data.
Snakemake
This pipeline leverages the
snakemake
Python package for workflow management.
As a result the pipeline and its dependencies are easily
installed from this repository, allowing quick setup, configuration and
deployment.
For more information on Snakemake, please see: https://snakemake.readthedocs.io/en/stable/.
Software Environment
Dependency management for this pipeline is handled via
conda
for Python
and
renv
for R. To get started with setup you can install
miniconda3 using the instructions available here: https://docs.conda.io/en/latest/miniconda.html.
Alternatively you can install it directly from CRAN as described here: https://cran.r-project.org/.
Setting Up Your Software Environment
The first step to deploying an analysis pipeline is to install the various
software packages it depends on. We have included the
env/eacon.yml
and
renv.lock
files here to easily accomplish this.
All commands should be executed from the top level directory of this repository unless otherwise indicated.
Python and Snakemake
The first step to deploying an analysis pipeline is to install Python,
Snakemake and Singularity via
conda
. We have included the
envs/eacon.yml
which specifies all the requisite dependencies to use
snakemake
for this pipeline, including R.
You can use
conda
to install all Python and OS system dependencies
using:
conda env create --file env/eacon.yml
This will take some time to run as it gathers and installs the correct
package versions. The environment it creates should be called
eacon
.
If it is not automatically activated after installation please run
conda activate eacon
before proceeding to the next step.
R Dependencies
R dependencies are handled via
renv
and all rules in this
pipeline will use the local R package cache stored in the
renv
directory.
To install all R dependencies run:
Rscript -e 'library(renv); renv::init();'
If you wish to isolate the R dependencies from your Conda environment R libraries, you can use this command instead:
Rscript -e 'library(renv); renv::isolate(); renv::init(bare=TRUE)'
If intialization doesn't trigger dependency installation, you can do so manually using:
Rscript -e 'renv::restore()'
For more information on
renv
and how it can be used to manage dependencies in
your project, please see: https://rstudio.github.io/renv/articles/renv.html.
Configuring the Pipeline
This pipeline assumes the following directory structure:
.
├── env
├── metadata
├── procdata
├── rawdata
├── renv
├── results
└── scripts
Please at minimum create the
rawdata
and
metadata
directories, as they are assumed to hold the raw microarray plate data (.CEL) and the pairs file, respectively. For more information on the correct formatting for your pairs file, please see https://github.com/gustaveroussy/EaCoN.
The remaining missing directories will be created automatically as the pipeline runs.
config.yaml
This file hold the relevant pipeline documentation. Here you can specify the paths
to all the parameters for your current pipeline use case. Documentation is provided
in the
config.yaml
file on what each field should contain.
Using the Pipeline
Deployment
To run the pipeline end-to-end:
snakemake --cores <n_cores> --use-conda
Where <n_cores> is the number of cores to parallelize over.
For details on deploying this pipleine via your local HPC cluster or in the cloud, please consult the Snakemake documentation. Deployment to these platforms requires minimal additional configuration.
Individual Rules
The pipeline can also be run rule by rule using the rule names.
Batch Processing and Normalization
snakemake --cores 2 --use-conda batch_process_rawdata
Segmentation
snakemake --cores 2 --use-conda segment_processed_data
Copy Number Calling
snakemake --cores 2 --use-conda estimate_copy_number
Determine Optimal Value for Gamma
snakemake --cores 2 --use-conda select_optimal_gamma
Build Bioconductor RaggedExperiment Object
snakemake --cores 2 --use-conda build_ragged_experiment
Filter Samples Based on QC Criteria
snakemake --cores 2 --use-conda sample_quality_control
Code Snippets
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 | renv::activate() library(EaCoN) # 0.2 -- Parse Snakemake arguments input <- snakemake@input params <- snakemake@params nthreads <- snakemake@threads output <- snakemake@output # 0.3 -- Load platform specific dependencies if (grepl("snp|cytoscan|oncoscan", params$array_type, ignore.case=TRUE)) library(affy.CN.norm.data) if (grepl('snp', params$array_type)) { library(apt.snp6.1.20.0) library(rcnorm) library(GenomeWideSNP.6.na35.r1) } ## FIXME:: Add conditional dependencies for other platforms switch(params$reference, 'BSgenome.Hsapiens.UCSC.hg19'=library(BSgenome.Hsapiens.UCSC.hg19), 'BSgenome.Hsapiens.UCSC.hg38'={ if (grepl("snp", params$array_type, ignore.case=TRUE)) stop("Must use BSgenome.Hsapiens.UCSC.hg19 for GenomeWide SNP6 arrays!") library(BSgenome.Hsapiens.UCSC.hg38) } ) # 1 -- Load or create the metadata file specifying CEL paths if (file.exists(input$pairs_file)) { pairs_df <- read.table(input$pairs_file, sep="\t", header=TRUE, stringsAsFactors=FALSE) } else { # find all CEL files in the raw data cel_file_paths <- list.files(params$rawdata, pattern="*.CEL$", recursive=TRUE, full.names=TRUE) pairs_df <- data.frame( cel_files=cel_file_paths, # assumes the second element in path is the sample name SampleName=vapply(cel_file_paths, FUN=function(x) strsplit(x)[[1]][2], FUN.VALUE=character(1)) ) if (!is.null(input$pairs_file) || input$paris_file == "") { # create path if it doesn't exist pairs_path <- dirname(input$pairs_file) if (!file.exists(pairs_path)) dir.create(pairs_path, recursive=TRUE) # write out a pairs file write.table(pairs_df, input$pairs_file) } } # 2 -- Format the paths in the pairs_file to match this project directory # structure from config.yaml if (grepl('cytocscan|oncoscan', params$array_type, ignore.case=TRUE)) { stopifnot(c("ATChannelCel", "GCChannelCel", "SampleName") %in% colnames(pairs_df)) # remove existing path, if there is one pairs_df$ATChannelCel <- gsub("^.*\\/", "", pairs_df$ATChannelCel) pairs_df$GCChannelCel <- gsub("^.*\\/", "", pairs_df$GCChannelCel) # create a new path relative to specified rawdata directory pairs_df$GCChannelCel <- file.path(getwd(), params$rawdata, pairs_df$GCChannelCel) } else if (grepl('snp6', params$array_type, ignore.case=TRUE)) { stopifnot(all(c("cel_files", "SampleName") %in% colnames(pairs_df))) # pairs_df$cel_files <- gsub("^.*\\/", "", pairs_df$cel_files) # pairs_df$cel_files <- file.path(getwd(), params$rawdata, pairs_df$cel_files) } # output the file to temporary storage so it can be read by EaCoN pairs_file <- file.path(tempdir(), "CEL_pairs_file.csv") write.table(pairs_df, file=pairs_file, sep="\t") # 3 -- Preprocess and normalize the raw data; does if (grepl('cytoscan', params$array_type, ignore.case=TRUE)) { EaCoN:::CS.Process.Batch(pairs_file, nthread=nthreads, out.dir=params$procdata, force=TRUE, cluter.type=params$cluster_type) } else if (grepl('oncoscan', params$array_type, ignore.case=TRUE)) { EaCoN:::OS.Process.Batch(pairs_file, nthread=nthreads, out.dir=params$procdata, force=TRUE, cluster.type=params$cluster_type) } else if (grepl('snp', params$array_type, ignore.case=TRUE)) { EaCoN:::SNP6.Process.Batch(pairs_file, out.dir=params$procdata, force=TRUE, nthread=nthreads, cluster.type=params$cluster_type) } else if (grepl('wes', params$array_type, ignore.case=TRUE)) { stop("WES has not been implemented in this pipeline yet, please see https://github.com/gustaveroussy/EaCoN for information on setting up your own analysis script.") } else { stop("Supported assay families are wes, cytoscan, oncoscan and snp6") } |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 | renv::activate() library(EaCoN) # 0 -- Parse Snakemake arguments input <- snakemake@input params <- snakemake@params nthreads <- snakemake@threads output <- snakemake@output # 1 -- L2R and BAF joint segment the preprocessed data from the previous rule EaCoN:::Segment.ff.Batch(RDS.files=unlist(input), nthread=nthreads, segmenter=params$segmenter, smooth.k=params$smoothk, BAF.filter=params$BAF_filter, nrf=params$nrf, SER.pen=params$SER_pen, force=TRUE) |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 | renv::activate() library(EaCoN) # 0 -- Parse Snakemake arguments input <- snakemake@input params <- snakemake@params nthreads <- snakemake@threads output <- snakemake@output # 2 -- Generate .html qc reports for # 1 -- Make copy number calls based on L2R segmentation results EaCoN:::ASCN.ff.Batch(unlist(input), nthread=nthreads, force=TRUE, gammaRange=unlist(params$gamma_range)) |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | renv::activate() library(data.table) # -- 0.2 Parse snakemake arguments input <- snakemake@input params <- snakemake@params output <- snakemake@output # -- 0.3 Load local utility functions source(file.path("scripts", "utils.R")) # -- 1. Read in gamma files gammaFiles <- list.files(params$out_dir, '.*gammaEval.*txt', recursive=TRUE, full.names=TRUE) gamma_df <- rbindlist( setNames( lapply(gammaFiles, FUN=fread), gsub(".gammaEval.txt", "", basename(gammaFiles)) ), idcol="sample_name" ) # -- 2. Select the best fits best_fits <- gamma_df[, .SD[which.max(GoF)], by="sample_name"] # -- 3. Save the best fit data.frame to csv fwrite(best_fits, file=output[[1]]) |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 | renv::activate() library(EaCoN) library(data.table) library(qs) library(GenomicRanges) library(TxDb.Hsapiens.UCSC.hg19.knownGene) library(org.Hs.eg.db) library(BiocParallel) library(RaggedExperiment) # -- 0.2 Parse snakemake parameters input <- snakemake@input params <- snakemake@params output <- snakemake@output # -- 0.3 Load utity functions source(file.path("scripts", "utils.R")) # -- 1. Load the optimal gamma for each sample best_fits <- fread(input[[1]]) # -- 2. Find the .RDS files associated with the best fits best_fit_files <- Map( function(x, y) grep(pattern=y, list.files(x, recursive=TRUE, full.names=TRUE), value=TRUE), x=file.path(params$out_dir, best_fits$sample_name), y=paste0(".*gamma", sprintf("%.2f", best_fits$gamma), "/.*RDS$") ) l2r_files <- Map( function(x, y) grep(pattern=y, list.files(x, recursive=TRUE, full.names=TRUE), value=TRUE), x=file.path(params$out_dir, best_fits$sample_name), y=".*L2R/.*RDS$" ) # -- 3. Load the best fit ASCN and L2R data and build GRanges objects .build_granges_from_cnv <- function(ascn, l2r) { ascn_data <- readRDS(ascn) l2r_data <- readRDS(l2r) buildGRangesFromASCNAndL2R(ascn_data, l2r_data) } BPPARAM <- BiocParallel::bpparam() BiocParallel::bpworkers(BPPARAM) <- params$nthreads gr_list <- BiocParallel::bpmapply(.build_granges_from_cnv, best_fit_files, l2r_files, SIMPLIFY=FALSE, USE.NAMES=TRUE, BPPARAM=BPPARAM ) # removing directory paths names(gr_list) <- basename(names(gr_list)) # -- 4. Construct a RaggedExperiment object ragged_exp <- as(GRangesList(gr_list), "RaggedExperiment") # include annotated bins to summarize the RaggedExperiment with txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene genome_bins <- binReferenceGenome() annotated_bins <- annotateGRangesWithTxDB(genome_bins, txdb=txdb) # include annotated genes to summarized RaggedExperiment with gene_granges <- genes(txdb) annotated_genes <- annotateGRangesWithTxDB(gene_granges, txdb=txdb) metadata(ragged_exp) <- list( annotated_genome_bins=annotated_bins, annotated_genes=annotated_genes, simplifyReduce=function(scores, ranges, qranges) { if (is.numeric(scores)) { x <- mean(scores, na.rm=TRUE) } else { count_list <- as.list(table(scores)) x <- paste0( paste0(names(count_list), ":", unlist(count_list)), collapse="," ) } return(x) } ) # -- Save files to disk qsave(ragged_exp, file=output[[1]], nthreads=params$nthread) |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | renv::activate() library(EaCoN) library(Biobase) library(RaggedExperiment) library(data.table) library(qs) # 0.2 -- Parse snakemake parameters input <- snakemake@input params <- snakemake@params output <- snakemake@output # 1 -- Collect and parse QC data qc_files <- list.files(params$procdata, ".*qc.txt", full.names=TRUE, recursive=TRUE) dt_list <- lapply(qc_files, FUN=fread) qc_dt <- rbindlist(dt_list) qc_dt$sample_name <- gsub("_.*$", "", basename(qc_files)) # 2 -- Output the QC results fwrite(qc_dt, output$qc_csv) # 3 -- Load the RaggedExperiment and filter on QC criteria ragged_exp <- qread(input$ragged_exp) # 4 -- Identify samples passing QC qc_dt <- qc_dt[ MAPD <= params$mapd & `waviness-sd` <= params$ndwavinesssd & SNPQC >= params$snpqc, ] if (params$cellularity > 0) { if ("TUSCAN-cellularity" %in% colnames(qcdt)) qc_dt <- qc_dt[`TUSCAN-cellularity` >= params$cellularity] } keep_samples <- intersect(colnames(ragged_exp), qc_dt$sample_name) print(keep_samples) # 5 -- Subset RaggedExperiment ragged_exp <- ragged_exp[, keep_samples] metadata(ragged_exp)$qc_parameters <- params[-c(1, 2)] # 6 -- Write RaggedExperiment to disk qsave(ragged_exp, file=output$ragged_exp, nthreads=params$nthreads) |
2
of
scripts/6_sampleQualityControl.R
73 74 | script: "scripts/1_batchProcessRawdataFiles.R" |
94 95 | script: "scripts/2_segmentProcessedData.R" |
109 110 | script: "scripts/3_estimateCopyNumber.R" |
124 125 | script: "scripts/4_selectOptimalGamma.R" |
139 140 | script: "scripts/5_buildRaggedExperiment.R" |
157 158 | script: "scripts/6_sampleQualityControl.R" |
170 171 | script: "scripts/7_customTotalCopyCalls.R" |
Support
- Future updates
Related Workflows





