Snakemake pipeline for calling CNAs from Affymetrix (now Thermofisher) Cytoscan and Oncoscan arrays

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

This pipeline has been adapted from https://github.com/gustaveroussy/EaCoN. It leverages the EaCoN R package to conduct preprocessing and normalization, segmentation and copy number estimation from raw Cytoscan HD .CEL files. The EaCoN package also supports copy number estimation from Oncoscan, SNP6 arrays as well as WES data, but these features have not yet been implemented.

Snakemake

This pipeline leverages the snakemake Python package for workflow management. As a result the pipeline and its dependencies are easily installed from this repository, allowing quick setup, configuration and deployment.

For more information on Snakemake, please see: https://snakemake.readthedocs.io/en/stable/.

Requirements

Dependency management for this pipeline is handled via conda for Python and renv for R. To get started with setup you can install miniconda3 using the instructions available here: https://docs.conda.io/en/latest/miniconda.html. If you do not currently have R installed, you can install it via conda using the command: conda -c conda-forge r-base==3.6 . Please note that the EaCoN package has not been updated to work with R >= 4.0.

Alternatively you can install it directly from CRAN as described here: https://cran.r-project.org/.

Setting Up Your Software Environment

The first step to deploying an analysis pipeline is to install the various software packages it depends on. We have included the envs/affymetrix.yml and renv.lock files here to easily accomplish this.

All commands should be executed from the top level directory of this repository.

Python and System Dependencies

Conda can be used to install all Python and most OS system dependencies using:

conda env create --file envs/affymetrix.yml

This will take some time to run as it gathers and installs the correct package versions. The environent it creates should be called affymetrix .

If it is not automatically activated after installation please run conda activate affymetrix before proceeding to the next step.

R Dependencies

The renv package can be used to install all R dependencies (both CRAN and Bioconductor). R version 3.6.3 and renv are included as dependencies in the environment.yml file and should be installed automatically when setting up your conda environment.

To initialize this project with renv run:

Rscript -e 'library(renv); renv::init()'

If you wish to isolate the R dependencies from your Conda environment R libraries, you can use this command instead:

Rscript -e 'library(renv); renv::isolate(); renv::init(bare=TRUE)'

If intialization doesn't trigger dependency installation, you can do so manually using:

Rscript -e 'renv::restore()'

For more information on renv and how it can be used to manage dependencies in your project, please see: https://rstudio.github.io/renv/articles/renv.html.

Configuring the Pipeline

This pipeline assumes the following directory structure:

.
├── envs
├── metadata
├── procdata
├── rawdata
├── renv
├── results
└── scripts

Please at minimum create the rawdata and metadata directories, as they are assumed to hold the raw Cytoscan HD plate data (.CEL) and the pairs file, respectively. For more information on the correct formatting for your pairs file, please see https://github.com/gustaveroussy/EaCoN. The remaining missing directories will be created automatically as the pipeline runs.

config.yaml

This file hold the relevant pipeline documentation. Here you can specify the paths to all the parameters for your current pipeline use case. Documentation is provided in the config.yaml file on what each field should contain.

Using the Pipeline

Batch Processing and Normalization

snakemake --cores 2 batch_process_rawdata

Segmentation

snakemake --cores 2 segment_processed_data

Copy Number Calling

`snakemake --cores 2

Code Snippets

renv::activate()

library(EaCoN)

# 0 -- Parse Snakemake arguments
input <- snakemake@input
params <- snakemake@params
nthreads <- snakemake@threads
output <- snakemake@output

# 1 -- Preprocess and normalize the raw data; does
if (grepl('cytoscan', params$array_type, ignore.case=TRUE)) {
    CS.Process.Batch(input$pairs_file, 
        nthread=nthreads, out.dir=params$procdata, force=TRUE)
} else if (grepl('oncoscan', params$array_type, ignore.case=TRUE)) {
    OS.Process.Batch(input$pairs_file, 
        nthread=nthreads, out.dir=params$procdata, force=TRUE)
} else if (grepl('snf6', params$array_type, ignore.care=TRUE)) {
    SNF6.Process.Batch(input$pairs_file, out.dir=params$procdata)
} else if (grepl('wes', params$array_type)) {
    stop("WES has not been implemented in this pipeline yet, please see 
        https://github.com/gustaveroussy/EaCoN for information on
        setting up your own analysis script.")
} else {
    stop("Supported array families are wes, cytoscan, oncoscan and snp6")
}

R EaCoN From line 2 of scripts/1_batchProcessRawdataFiles.R

renv::activate()

library(EaCoN)

# 0 -- Parse Snakemake arguments
input <- snakemake@input
params <- snakemake@params
nthreads <- snakemake@threads
output <- snakemake@output

# 1 -- L2R and BAF joint segment the preprocessed data from the previous rule
EaCoN:::Segment.ff.Batch(unlist(input), nthread=nthreads, segmenter=params$segmenter, 
    smooth.k=params$smoothk, force=TRUE)

R EaCoN From line 2 of scripts/2_segmentProcessedData.R

renv::activate()

library(EaCoN)

# 0 -- Parse Snakemake arguments
input <- snakemake@input
params <- snakemake@params
nthreads <- snakemake@threads
output <- snakemake@output

# 2 -- Generate .html qc reports for

# 1 -- Make copy number calls based on L2R segmentation results
EaCoN:::ASCN.ff.Batch(unlist(input), nthread=nthreads, force=TRUE, 
    gammaRange=unlist(params$gamma_range))

R EaCoN From line 2 of scripts/3_estimateCopyNumber.R

renv::activate()

library(EaCoN)
library(data.table)
library(qs)
library(GenomicRanges)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(S4Vectors)

# -- 0.2 Parse snakemake arguments
input <- snakemake@input
params <- snakemake@params
output <- snakemake@output

# -- 1. Read in gamma files

gammaFiles <- list.files(input$out_dir, '.*gammaEval.*txt', recursive=TRUE, 
    full.names=TRUE)
print(gammaFiles)


# -- 2. Load pancanPloidy data as reference
data(pancanPloidy.noSegs)
pancan.ploidy <- pancan.obj.segless$breaks$ploidy

all.fits <- lapply(gammaFiles, function(file) {
    list(fit = fread(file, sep = '\t'),
        sample = strsplit(file, '/')[[1]][2]
    )
})
names(all.fits) <- sapply(all.fits, function(x) x$sample)

.fitsToVector <- function(fit.list) {
    newList <- list()
    nameList <- list()
    for (i in seq_along(fit.list)) {
        fit <- as.list(fit.list[[i]]$fit[[2]])
        names(fit) <- tolower(unlist(fit.list[[i]]$fit[[1]]))
        print(fit)
        newList[[i]] <- fit
        nameList <- c(nameList, fit.list[[i]][[2]])
    }
    names(newList) <- unlist(nameList)
    return(newList)
}

vec.fits <- .fitsToVector(all.fits)

# -- 3. Annotated the RDS data associated with the gamma files

# Change to fix error in annotaRDS.Batch
setwd('procdata')

print('Starting annotation...')
gr.cnv <- EaCoN:::annotateRDS.Batch(
    all.fits,
    'ASCAT',
    nthread = params$nthreads,
    gamma.method = 'score',
    pancan.ploidy = pancan.ploidy
)
print('finished annotation')

setwd('..')

# Save raw results object to disk
qsave(gr.cnv, file = file.path(input$out_dir, 
    paste0(params$analysis_name, 'optimal_gamma_list.qs')), 
    nthread=params$nthreads)

## ---- Create GRangesList of segmentation results and save them to disk

# Extract segmentation data.frames from the resutls object
segmentation_df_list <- lapply(gr.cnv, function(x) x$seg)

# Convert all data.frames to GRanges and return in a list
list_of_gRanges <- lapply(segmentation_df_list, function(x) 
    makeGRangesFromDataFrame(x, keep.extra.columns = TRUE))

# Convert list of GRanges objects into GRangesList object
cnv_grList <- GRangesList(list_of_gRanges)

# Save GRangesList to disk for downstream analysis
qsave(cnv_grList, file = file.path(params$results, 
    paste0(params$analysis_name, '_grList.qs')), nthread=params$nthreads)

R Snakemake data.table TxDb.Hsapiens.UCSC.hg19.knownGene GenomicRanges EaCoN qs S4Vectors From line 2 of scripts/4_selectOptimalGamma.R

renv::activate()

library(EaCoN)
library(Biobase)
library(SummarizedExperiment)
library(data.table)
library(qs)

# 0.2 -- Parse snakemake parameters
input <- snakemake@input
params <- snakemake@params
output <- snakemake@output

# 1 -- Load input data
gr.cnv <- qread(input$gr_cnv, nthread=params$nthreads)

# 2 -- Build ExpressionSets
pairs_file <- fread(input$pairs_file)
setDF(pairs_file)
EaCoN:::buildPSetOut(gr.cnv, params$analysis_name, file.path(params$results),
    meta=pairs_file[, -c(1)])

# 3 -- Convert ExpressionSet objects to SummarizedExperiments

esetFiles <- list.files(params$results, 
    paste0(params$analysis_name, '.*ESet..RData'), full.names=TRUE)

esetNames <- list()
for (file in esetFiles) esetNames[file] <- load(file)

esets <- lapply(esetNames, get)
names(esets) <- gsub('.*/|_ESet..RData', '', esetFiles)

SEs <- lapply(esets, FUN=as, 'SummarizedExperiment')

# 4 -- Save results to disk
for (i in seq_along(SEs)) {
    qsave(SEs[[i]], 
        file=file.path(params$results, paste0(names(SEs)[i], '_SumExp.qs')))
}

R Snakemake data.table Biobase SummarizedExperiment EaCoN qs From line 2 of scripts/5_buildSummarizedExperiments.R

script:
    'scripts/1_batchProcessRawdataFiles.R'

SnakeMake From line 38 of master/Snakefile

script:
    'scripts/2_segmentProcessedData.R'

SnakeMake From line 57 of master/Snakefile

script:
    'scripts/3_estimateCopyNumber.R'

SnakeMake From line 74 of master/Snakefile

script:
    'scripts/4_selectOptimalGamma.R'

SnakeMake From line 92 of master/Snakefile

script:
    'scripts/5_buildSummarizedExperiments.R'

SnakeMake From line 107 of master/Snakefile

ShowHide 7 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/bhklab/affymetrix_cnv_pipeline

Name: affymetrix_cnv_pipeline

Version: 1

Badge:

Insert copied code into your website to add a link to this workflow.

License: None

Keywords:

TxDb.Hsapiens.UCSC.hg19.knownGene Biobase GenomicRanges S4Vectors Snakemake SummarizedExperiment data.table qs EaCoN

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free