Snakemake pipeline for calling CNAs from Affymetrix (now Thermofisher) Cytoscan and Oncoscan arrays
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
This pipeline has been adapted from https://github.com/gustaveroussy/EaCoN. It leverages the
EaCoN
R package to conduct preprocessing and normalization, segmentation and copy number estimation from raw Cytoscan HD .CEL files. The
EaCoN
package also supports copy number estimation from Oncoscan, SNP6 arrays as well as WES data, but these features have not yet been implemented.
Snakemake
This pipeline leverages the
snakemake
Python package for workflow management. As a result the pipeline and its dependencies are easily
installed from this repository, allowing quick setup, configuration and deployment.
For more information on Snakemake, please see: https://snakemake.readthedocs.io/en/stable/.
Requirements
Dependency management for this pipeline is handled via
conda
for Python
and
renv
for R. To get started with setup you can install
miniconda3 using the instructions available here: https://docs.conda.io/en/latest/miniconda.html. If you do not currently have R installed, you can install it via conda using the command:
conda -c conda-forge r-base==3.6
. Please note that the EaCoN package has not been updated to work with R >= 4.0.
Alternatively you can install it directly from CRAN as described here: https://cran.r-project.org/.
Setting Up Your Software Environment
The first step to deploying an analysis pipeline is to install the various
software packages it depends on. We have included the
envs/affymetrix.yml
and
renv.lock
files here to easily accomplish this.
All commands should be executed from the top level directory of this repository.
Python and System Dependencies
Conda can be used to install all Python and most OS system dependencies using:
conda env create --file envs/affymetrix.yml
This will take some time to run as it gathers and installs the correct
package versions. The environent it creates should be called
affymetrix
.
If it is not automatically activated after installation please run
conda activate affymetrix
before proceeding to the next step.
R Dependencies
The
renv
package can be used to install all R dependencies (both CRAN and
Bioconductor). R version 3.6.3 and
renv
are included as dependencies in the
environment.yml
file and should be installed automatically when setting up your conda environment.
To initialize this project with renv run:
Rscript -e 'library(renv); renv::init()'
If you wish to isolate the R dependencies from your Conda environment R libraries, you can use this command instead:
Rscript -e 'library(renv); renv::isolate(); renv::init(bare=TRUE)'
If intialization doesn't trigger dependency installation, you can do so manually using:
Rscript -e 'renv::restore()'
For more information on renv and how it can be used to manage dependencies in your project, please see: https://rstudio.github.io/renv/articles/renv.html.
Configuring the Pipeline
This pipeline assumes the following directory structure:
.
├── envs
├── metadata
├── procdata
├── rawdata
├── renv
├── results
└── scripts
Please at minimum create the
rawdata
and
metadata
directories, as they are assumed to hold the raw Cytoscan HD plate data (.CEL) and the pairs file, respectively. For more information on the correct formatting for your pairs file, please see https://github.com/gustaveroussy/EaCoN.
The remaining missing directories will be created automatically as the pipeline runs.
config.yaml
This file hold the relevant pipeline documentation. Here you can specify the paths
to all the parameters for your current pipeline use case. Documentation is provided
in the
config.yaml
file on what each field should contain.
Using the Pipeline
Batch Processing and Normalization
snakemake --cores 2 batch_process_rawdata
Segmentation
snakemake --cores 2 segment_processed_data
Copy Number Calling
`snakemake --cores 2
Code Snippets
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | renv::activate() library(EaCoN) # 0 -- Parse Snakemake arguments input <- snakemake@input params <- snakemake@params nthreads <- snakemake@threads output <- snakemake@output # 1 -- Preprocess and normalize the raw data; does if (grepl('cytoscan', params$array_type, ignore.case=TRUE)) { CS.Process.Batch(input$pairs_file, nthread=nthreads, out.dir=params$procdata, force=TRUE) } else if (grepl('oncoscan', params$array_type, ignore.case=TRUE)) { OS.Process.Batch(input$pairs_file, nthread=nthreads, out.dir=params$procdata, force=TRUE) } else if (grepl('snf6', params$array_type, ignore.care=TRUE)) { SNF6.Process.Batch(input$pairs_file, out.dir=params$procdata) } else if (grepl('wes', params$array_type)) { stop("WES has not been implemented in this pipeline yet, please see https://github.com/gustaveroussy/EaCoN for information on setting up your own analysis script.") } else { stop("Supported array families are wes, cytoscan, oncoscan and snp6") } |
2 3 4 5 6 7 8 9 10 11 12 13 14 | renv::activate() library(EaCoN) # 0 -- Parse Snakemake arguments input <- snakemake@input params <- snakemake@params nthreads <- snakemake@threads output <- snakemake@output # 1 -- L2R and BAF joint segment the preprocessed data from the previous rule EaCoN:::Segment.ff.Batch(unlist(input), nthread=nthreads, segmenter=params$segmenter, smooth.k=params$smoothk, force=TRUE) |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | renv::activate() library(EaCoN) # 0 -- Parse Snakemake arguments input <- snakemake@input params <- snakemake@params nthreads <- snakemake@threads output <- snakemake@output # 2 -- Generate .html qc reports for # 1 -- Make copy number calls based on L2R segmentation results EaCoN:::ASCN.ff.Batch(unlist(input), nthread=nthreads, force=TRUE, gammaRange=unlist(params$gamma_range)) |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 | renv::activate() library(EaCoN) library(data.table) library(qs) library(GenomicRanges) library(TxDb.Hsapiens.UCSC.hg19.knownGene) library(S4Vectors) # -- 0.2 Parse snakemake arguments input <- snakemake@input params <- snakemake@params output <- snakemake@output # -- 1. Read in gamma files gammaFiles <- list.files(input$out_dir, '.*gammaEval.*txt', recursive=TRUE, full.names=TRUE) print(gammaFiles) # -- 2. Load pancanPloidy data as reference data(pancanPloidy.noSegs) pancan.ploidy <- pancan.obj.segless$breaks$ploidy all.fits <- lapply(gammaFiles, function(file) { list(fit = fread(file, sep = '\t'), sample = strsplit(file, '/')[[1]][2] ) }) names(all.fits) <- sapply(all.fits, function(x) x$sample) .fitsToVector <- function(fit.list) { newList <- list() nameList <- list() for (i in seq_along(fit.list)) { fit <- as.list(fit.list[[i]]$fit[[2]]) names(fit) <- tolower(unlist(fit.list[[i]]$fit[[1]])) print(fit) newList[[i]] <- fit nameList <- c(nameList, fit.list[[i]][[2]]) } names(newList) <- unlist(nameList) return(newList) } vec.fits <- .fitsToVector(all.fits) # -- 3. Annotated the RDS data associated with the gamma files # Change to fix error in annotaRDS.Batch setwd('procdata') print('Starting annotation...') gr.cnv <- EaCoN:::annotateRDS.Batch( all.fits, 'ASCAT', nthread = params$nthreads, gamma.method = 'score', pancan.ploidy = pancan.ploidy ) print('finished annotation') setwd('..') # Save raw results object to disk qsave(gr.cnv, file = file.path(input$out_dir, paste0(params$analysis_name, 'optimal_gamma_list.qs')), nthread=params$nthreads) ## ---- Create GRangesList of segmentation results and save them to disk # Extract segmentation data.frames from the resutls object segmentation_df_list <- lapply(gr.cnv, function(x) x$seg) # Convert all data.frames to GRanges and return in a list list_of_gRanges <- lapply(segmentation_df_list, function(x) makeGRangesFromDataFrame(x, keep.extra.columns = TRUE)) # Convert list of GRanges objects into GRangesList object cnv_grList <- GRangesList(list_of_gRanges) # Save GRangesList to disk for downstream analysis qsave(cnv_grList, file = file.path(params$results, paste0(params$analysis_name, '_grList.qs')), nthread=params$nthreads) |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | renv::activate() library(EaCoN) library(Biobase) library(SummarizedExperiment) library(data.table) library(qs) # 0.2 -- Parse snakemake parameters input <- snakemake@input params <- snakemake@params output <- snakemake@output # 1 -- Load input data gr.cnv <- qread(input$gr_cnv, nthread=params$nthreads) # 2 -- Build ExpressionSets pairs_file <- fread(input$pairs_file) setDF(pairs_file) EaCoN:::buildPSetOut(gr.cnv, params$analysis_name, file.path(params$results), meta=pairs_file[, -c(1)]) # 3 -- Convert ExpressionSet objects to SummarizedExperiments esetFiles <- list.files(params$results, paste0(params$analysis_name, '.*ESet..RData'), full.names=TRUE) esetNames <- list() for (file in esetFiles) esetNames[file] <- load(file) esets <- lapply(esetNames, get) names(esets) <- gsub('.*/|_ESet..RData', '', esetFiles) SEs <- lapply(esets, FUN=as, 'SummarizedExperiment') # 4 -- Save results to disk for (i in seq_along(SEs)) { qsave(SEs[[i]], file=file.path(params$results, paste0(names(SEs)[i], '_SumExp.qs'))) } |
38 39 | script: 'scripts/1_batchProcessRawdataFiles.R' |
57 58 | script: 'scripts/2_segmentProcessedData.R' |
74 75 | script: 'scripts/3_estimateCopyNumber.R' |
92 93 | script: 'scripts/4_selectOptimalGamma.R' |
107 108 | script: 'scripts/5_buildSummarizedExperiments.R' |
Support
- Future updates
Related Workflows





