Easy Copy Number Analysis (EaCoN) Pipeline

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

This pipeline has been adapted from https://github.com/gustaveroussy/EaCoN. It leverages the EaCoN R package to conduct preprocessing and normalization, segmentation and copy number estimation from specific microarray .CEL files. The EaCoN package supports copy number estimation from Cytoscan, Oncoscan, SNP6 arrays as well as WES data, however, the current pipeline has only implemented support for microarray data.

Snakemake

This pipeline leverages the snakemake Python package for workflow management. As a result the pipeline and its dependencies are easily installed from this repository, allowing quick setup, configuration and deployment.

For more information on Snakemake, please see: https://snakemake.readthedocs.io/en/stable/.

Software Environment

Dependency management for this pipeline is handled via conda for Python and renv for R. To get started with setup you can install miniconda3 using the instructions available here: https://docs.conda.io/en/latest/miniconda.html.

Alternatively you can install it directly from CRAN as described here: https://cran.r-project.org/.

Setting Up Your Software Environment

The first step to deploying an analysis pipeline is to install the various software packages it depends on. We have included the env/eacon.yml and renv.lock files here to easily accomplish this.

All commands should be executed from the top level directory of this repository unless otherwise indicated.

Python and Snakemake

The first step to deploying an analysis pipeline is to install Python, Snakemake and Singularity via conda . We have included the envs/eacon.yml which specifies all the requisite dependencies to use snakemake for this pipeline, including R.

You can use conda to install all Python and OS system dependencies using:

conda env create --file env/eacon.yml

This will take some time to run as it gathers and installs the correct package versions. The environment it creates should be called eacon .

If it is not automatically activated after installation please run conda activate eacon before proceeding to the next step.

R Dependencies

R dependencies are handled via renv and all rules in this pipeline will use the local R package cache stored in the renv directory.

To install all R dependencies run:

Rscript -e 'library(renv); renv::init();'

If you wish to isolate the R dependencies from your Conda environment R libraries, you can use this command instead:

Rscript -e 'library(renv); renv::isolate(); renv::init(bare=TRUE)'

If intialization doesn't trigger dependency installation, you can do so manually using:

Rscript -e 'renv::restore()'

For more information on renv and how it can be used to manage dependencies in your project, please see: https://rstudio.github.io/renv/articles/renv.html.

Configuring the Pipeline

This pipeline assumes the following directory structure:

.
├── env
├── metadata
├── procdata
├── rawdata
├── renv
├── results
└── scripts

Please at minimum create the rawdata and metadata directories, as they are assumed to hold the raw microarray plate data (.CEL) and the pairs file, respectively. For more information on the correct formatting for your pairs file, please see https://github.com/gustaveroussy/EaCoN. The remaining missing directories will be created automatically as the pipeline runs.

config.yaml

This file hold the relevant pipeline documentation. Here you can specify the paths to all the parameters for your current pipeline use case. Documentation is provided in the config.yaml file on what each field should contain.

Using the Pipeline

Deployment

To run the pipeline end-to-end:

snakemake --cores <n_cores> --use-conda

Where <n_cores> is the number of cores to parallelize over.

For details on deploying this pipleine via your local HPC cluster or in the cloud, please consult the Snakemake documentation. Deployment to these platforms requires minimal additional configuration.

Individual Rules

The pipeline can also be run rule by rule using the rule names.

Batch Processing and Normalization

snakemake --cores 2 --use-conda batch_process_rawdata

Segmentation

snakemake --cores 2 --use-conda segment_processed_data

Copy Number Calling

snakemake --cores 2 --use-conda estimate_copy_number

Determine Optimal Value for Gamma

snakemake --cores 2 --use-conda select_optimal_gamma

Build Bioconductor RaggedExperiment Object

snakemake --cores 2 --use-conda build_ragged_experiment

Filter Samples Based on QC Criteria

snakemake --cores 2 --use-conda sample_quality_control

Code Snippets

renv::activate()
library(EaCoN)


# 0.2 -- Parse Snakemake arguments
input <- snakemake@input
params <- snakemake@params
nthreads <- snakemake@threads
output <- snakemake@output


# 0.3 -- Load platform specific dependencies
if (grepl("snp|cytoscan|oncoscan", params$array_type, ignore.case=TRUE))
    library(affy.CN.norm.data)

if (grepl('snp', params$array_type)) {
        library(apt.snp6.1.20.0)
        library(rcnorm)
        library(GenomeWideSNP.6.na35.r1)
}
## FIXME:: Add conditional dependencies for other platforms

switch(params$reference,
    'BSgenome.Hsapiens.UCSC.hg19'=library(BSgenome.Hsapiens.UCSC.hg19),
    'BSgenome.Hsapiens.UCSC.hg38'={
        if (grepl("snp", params$array_type, ignore.case=TRUE))
            stop("Must use BSgenome.Hsapiens.UCSC.hg19 for GenomeWide SNP6 arrays!")
        library(BSgenome.Hsapiens.UCSC.hg38)
    }
)


# 1 -- Load or create the metadata file specifying CEL paths
if (file.exists(input$pairs_file)) {
    pairs_df <- read.table(input$pairs_file, sep="\t", header=TRUE,
        stringsAsFactors=FALSE)
} else {
    # find all CEL files in the raw data
    cel_file_paths <- list.files(params$rawdata, pattern="*.CEL$",
        recursive=TRUE, full.names=TRUE)
    pairs_df <- data.frame(
        cel_files=cel_file_paths,
        # assumes the second element in path is the sample name
        SampleName=vapply(cel_file_paths, FUN=function(x) strsplit(x)[[1]][2],
            FUN.VALUE=character(1))
    )
    if (!is.null(input$pairs_file) || input$paris_file == "") {
        # create path if it doesn't exist
        pairs_path <- dirname(input$pairs_file)
        if (!file.exists(pairs_path)) dir.create(pairs_path, recursive=TRUE)
        # write out a pairs file
        write.table(pairs_df, input$pairs_file)
    }
}


# 2 -- Format the paths in the pairs_file to match this project directory
#   structure from config.yaml
if (grepl('cytocscan|oncoscan', params$array_type, ignore.case=TRUE)) {
    stopifnot(c("ATChannelCel", "GCChannelCel", "SampleName") %in% colnames(pairs_df))
    # remove existing path, if there is one
    pairs_df$ATChannelCel <- gsub("^.*\\/", "", pairs_df$ATChannelCel)
    pairs_df$GCChannelCel <- gsub("^.*\\/", "", pairs_df$GCChannelCel)
    # create a new path relative to specified rawdata directory
    pairs_df$GCChannelCel <- file.path(getwd(), params$rawdata,
        pairs_df$GCChannelCel)
} else if (grepl('snp6', params$array_type, ignore.case=TRUE)) {
    stopifnot(all(c("cel_files", "SampleName") %in% colnames(pairs_df)))
    # pairs_df$cel_files <- gsub("^.*\\/", "", pairs_df$cel_files)
    # pairs_df$cel_files <- file.path(getwd(), params$rawdata, pairs_df$cel_files)
}
# output the file to temporary storage so it can be read by EaCoN
pairs_file <- file.path(tempdir(), "CEL_pairs_file.csv")
write.table(pairs_df, file=pairs_file, sep="\t")


# 3 -- Preprocess and normalize the raw data; does
if (grepl('cytoscan', params$array_type, ignore.case=TRUE)) {
    EaCoN:::CS.Process.Batch(pairs_file,
        nthread=nthreads, out.dir=params$procdata, force=TRUE,
        cluter.type=params$cluster_type)
} else if (grepl('oncoscan', params$array_type, ignore.case=TRUE)) {
    EaCoN:::OS.Process.Batch(pairs_file,
        nthread=nthreads, out.dir=params$procdata, force=TRUE,
        cluster.type=params$cluster_type)
} else if (grepl('snp', params$array_type, ignore.case=TRUE)) {
    EaCoN:::SNP6.Process.Batch(pairs_file, out.dir=params$procdata, force=TRUE,
        nthread=nthreads, cluster.type=params$cluster_type)
} else if (grepl('wes', params$array_type, ignore.case=TRUE)) {
    stop("WES has not been implemented in this pipeline yet, please see
        https://github.com/gustaveroussy/EaCoN for information on
        setting up your own analysis script.")
} else {
    stop("Supported assay families are wes, cytoscan, oncoscan and snp6")
}

R BSgenome.Hsapiens.UCSC.hg38 EaCoN From line 2 of scripts/1_batchProcessRawdataFiles.R

renv::activate()
library(EaCoN)

# 0 -- Parse Snakemake arguments
input <- snakemake@input
params <- snakemake@params
nthreads <- snakemake@threads
output <- snakemake@output

# 1 -- L2R and BAF joint segment the preprocessed data from the previous rule
EaCoN:::Segment.ff.Batch(RDS.files=unlist(input), nthread=nthreads,
    segmenter=params$segmenter, smooth.k=params$smoothk,
    BAF.filter=params$BAF_filter, nrf=params$nrf, SER.pen=params$SER_pen,
    force=TRUE)

R EaCoN From line 2 of scripts/2_segmentProcessedData.R

renv::activate()
library(EaCoN)

# 0 -- Parse Snakemake arguments
input <- snakemake@input
params <- snakemake@params
nthreads <- snakemake@threads
output <- snakemake@output

# 2 -- Generate .html qc reports for

# 1 -- Make copy number calls based on L2R segmentation results
EaCoN:::ASCN.ff.Batch(unlist(input), nthread=nthreads, force=TRUE,
    gammaRange=unlist(params$gamma_range))

R EaCoN From line 2 of scripts/3_estimateCopyNumber.R

renv::activate()
library(data.table)


# -- 0.2 Parse snakemake arguments
input <- snakemake@input
params <- snakemake@params
output <- snakemake@output

# -- 0.3 Load local utility functions
source(file.path("scripts", "utils.R"))

# -- 1. Read in gamma files

gammaFiles <- list.files(params$out_dir, '.*gammaEval.*txt', recursive=TRUE,
    full.names=TRUE)

gamma_df <- rbindlist(
    setNames(
        lapply(gammaFiles, FUN=fread),
        gsub(".gammaEval.txt", "", basename(gammaFiles))
    ),
    idcol="sample_name"
)

# -- 2. Select the best fits
best_fits <- gamma_df[, .SD[which.max(GoF)], by="sample_name"]

# -- 3. Save the best fit data.frame to csv
fwrite(best_fits, file=output[[1]])

R Snakemake data.table From line 2 of scripts/4_selectOptimalGamma.R

renv::activate()
library(EaCoN)
library(data.table)
library(qs)
library(GenomicRanges)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(org.Hs.eg.db)
library(BiocParallel)
library(RaggedExperiment)

# -- 0.2 Parse snakemake parameters
input <- snakemake@input
params <- snakemake@params
output <- snakemake@output

# -- 0.3 Load utity functions
source(file.path("scripts", "utils.R"))

# -- 1. Load the optimal gamma for each sample
best_fits <- fread(input[[1]])

# -- 2. Find the .RDS files associated with the best fits
best_fit_files <- Map(
    function(x, y)
        grep(pattern=y, list.files(x, recursive=TRUE, full.names=TRUE), value=TRUE),
    x=file.path(params$out_dir, best_fits$sample_name),
    y=paste0(".*gamma", sprintf("%.2f", best_fits$gamma), "/.*RDS$")
)
l2r_files <- Map(
    function(x, y)
        grep(pattern=y, list.files(x, recursive=TRUE, full.names=TRUE), value=TRUE),
    x=file.path(params$out_dir, best_fits$sample_name),
    y=".*L2R/.*RDS$"
)

# -- 3. Load the best fit ASCN and L2R data and build GRanges objects

.build_granges_from_cnv <- function(ascn, l2r) {
    ascn_data <- readRDS(ascn)
    l2r_data <- readRDS(l2r)
    buildGRangesFromASCNAndL2R(ascn_data, l2r_data)
}

BPPARAM <- BiocParallel::bpparam()
BiocParallel::bpworkers(BPPARAM) <- params$nthreads
gr_list <- BiocParallel::bpmapply(.build_granges_from_cnv,
    best_fit_files, l2r_files,
    SIMPLIFY=FALSE, USE.NAMES=TRUE, BPPARAM=BPPARAM
)

# removing directory paths
names(gr_list) <- basename(names(gr_list))

# -- 4. Construct a RaggedExperiment object
ragged_exp <- as(GRangesList(gr_list), "RaggedExperiment")

# include annotated bins to summarize the RaggedExperiment with
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
genome_bins <- binReferenceGenome()
annotated_bins <- annotateGRangesWithTxDB(genome_bins, txdb=txdb)

# include annotated genes to summarized RaggedExperiment with
gene_granges <- genes(txdb)
annotated_genes <- annotateGRangesWithTxDB(gene_granges, txdb=txdb)

metadata(ragged_exp) <- list(
    annotated_genome_bins=annotated_bins,
    annotated_genes=annotated_genes,
    simplifyReduce=function(scores, ranges, qranges) {
        if (is.numeric(scores)) {
            x <- mean(scores, na.rm=TRUE)
        } else {
            count_list <- as.list(table(scores))
            x <- paste0(
                paste0(names(count_list), ":", unlist(count_list)),
                collapse=","
            )
        }
        return(x)
    }
)

# -- Save files to disk
qsave(ragged_exp, file=output[[1]], nthreads=params$nthread)

R Snakemake data.table org.Hs.eg.db TxDb.Hsapiens.UCSC.hg19.knownGene GenomicRanges EaCoN qs RaggedExperiment From line 2 of scripts/5_buildRaggedExperiment.R

renv::activate()
library(EaCoN)
library(Biobase)
library(RaggedExperiment)
library(data.table)
library(qs)

# 0.2 -- Parse snakemake parameters
input <- snakemake@input
params <- snakemake@params
output <- snakemake@output

# 1 -- Collect and parse QC data
qc_files <- list.files(params$procdata, ".*qc.txt", full.names=TRUE,
    recursive=TRUE)
dt_list <- lapply(qc_files, FUN=fread)
qc_dt <- rbindlist(dt_list)
qc_dt$sample_name <- gsub("_.*$", "", basename(qc_files))

# 2 -- Output the QC results
fwrite(qc_dt, output$qc_csv)

# 3 -- Load the RaggedExperiment and filter on QC criteria
ragged_exp <- qread(input$ragged_exp)

# 4 -- Identify samples passing QC
qc_dt <- qc_dt[
    MAPD <= params$mapd & `waviness-sd` <= params$ndwavinesssd &
        SNPQC >= params$snpqc,
]
if (params$cellularity > 0) {
    if ("TUSCAN-cellularity" %in% colnames(qcdt))
        qc_dt <- qc_dt[`TUSCAN-cellularity` >= params$cellularity]
}
keep_samples <- intersect(colnames(ragged_exp), qc_dt$sample_name)
print(keep_samples)

# 5 -- Subset RaggedExperiment
ragged_exp <- ragged_exp[, keep_samples]
metadata(ragged_exp)$qc_parameters <- params[-c(1, 2)]

# 6 -- Write RaggedExperiment to disk
qsave(ragged_exp, file=output$ragged_exp, nthreads=params$nthreads)

R Snakemake data.table Biobase EaCoN qs RaggedExperiment From line 2 of scripts/6_sampleQualityControl.R

script:
    "scripts/1_batchProcessRawdataFiles.R"

SnakeMake From line 73 of master/Snakefile

script:
    "scripts/2_segmentProcessedData.R"

SnakeMake From line 94 of master/Snakefile

script:
    "scripts/3_estimateCopyNumber.R"

SnakeMake From line 109 of master/Snakefile

script:
    "scripts/4_selectOptimalGamma.R"

SnakeMake From line 124 of master/Snakefile

script:
    "scripts/5_buildRaggedExperiment.R"

SnakeMake From line 139 of master/Snakefile

script:
    "scripts/6_sampleQualityControl.R"

SnakeMake From line 157 of master/Snakefile

script:
    "scripts/7_customTotalCopyCalls.R"

SnakeMake From line 170 of master/Snakefile

ShowHide 9 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/bhklab/eacon_cnv_pipeline

Name: eacon_cnv_pipeline

Version: 1

Badge:

Insert copied code into your website to add a link to this workflow.

License: None

Keywords:

BSgenome.Hsapiens.UCSC.hg38 org.Hs.eg.db TxDb.Hsapiens.UCSC.hg19.knownGene Biobase BiocParallel GenomicRanges RaggedExperiment Snakemake data.table qs EaCoN

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free