Installation and Execution Guide for QDNAseq.snakemake Pipeline
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
For the installation of this pipeline any Python install compatable Conda is required.
The pipeline itself will run on Python 3.8.5 and R 3.6.3. For exact dependencies view
environment.yaml
and
r-dependencies.R
.
Using Conda/Mamba
for easy installation you need (Mini)Conda.
Miniconda installation from folder where you want to install Miniconda:
cd </path/to/files/dir/>
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
follow the instructions of the installation process, give the location where you want Miniconda to be installed and answer YES to add Miniconda to your path.
go to the directory where the analysis need to be performed
cd </path/to/analysis/dir>
git clone https://github.com/tgac-vumc/QDNAseq.snakemake/
cd QDNAseq.snakemake
install Mamba as drop-in replacement for Conda with Mamba's improved installation-performance:
conda install -c conda-forge mamba
create the environment using Mamba:
mamba env create --name QDNAseq-snakemake --file environment.yaml
activate the environment by:
conda activate QDNAseq-snakemake
Then run the R-script r-dependencies.R in the terminal to install the non-conda R dependencies in the environment:
Rscript r-dependencies.R
Using Singularity
Under development
Preparing analysis
Prepare the data
go to analysis dir and prepare analysis by copy or create links to fastq.gz files:
cd </path/to/analysis/dir>
mkdir fastq
cd fastq
to link a single file:
ln -s <path/to/file>
to link all files from a folder:
for file in <path/to/fastq/files>/*.fastq.gz
do ln -s $file
done
Prepare the snakemake settings
Open the configuration file
config.yaml
to check the settings that snakemake will use and change according to your needs. For providing service-analysis, set
setting
to
'service'
. For research purposes, set
setting
to
'research'
. For all settings set
setting
to
'all'
.
One of the options in the configfile is
dewaving
, if set to
'true'
QNDAseq objects will be dewaved before segmentation.
These options change the rules performed in the pipeline, see the rule-graph in the next section.
Running analysis
Make sure that snakemake is able to find the excecutive file Snakefile by performing a dry-run:
cd ../QDNAseq.snakemake
snakemake -n
Check the rules that are planned to be performed, conform the rule-graph.
An visualization of the order of rules to be performed can be viewed by running the following command and opening the DAG-file
snakemake --forceall --rulegraph | dot -Tsvg > DAG.svg
Rulegraphs for the intial settings
'service'
,
'research'
and
'all'
are commited to this repro in the files
DAG_<setting>.svg
.
When ready, run the analysis
snakemake
Useful snakemake options
-j , --cores, --jobs
: Use at most N cores in parallel (default: 1). If N is omitted, the limit is set to the number of available cores.
-n , --dryrun
: Do not execute anything. but show rules which are planned to be performed.
-k , --keep-going
: Go on with independent jobs if a job fails.
-f , --force
: Force the execution of the selected target or the first rule regardless of already created output.
-R , --forcerun
: Force the re-execution or creation of the given rules or files. Use this option if you changed a rule and want to have all its output in your workflow updated.
-U , --until
: Runs the pipeline until it reaches the specified rules or files. Only runs jobs that are dependencies of the specified rule or files, does not run sibling DAGs.
for all options go to https://snakemake.readthedocs.io/en/v5.31.1/executing/cli.html#all-options
Code Snippets
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | msg <- snakemake@params[["suppressMessages"]] if (msg){ suppressMessages(library(QDNAseq)) suppressMessages(library(stringr)) } else{ library(QDNAseq) library(stringr) } source('scripts/ACE.R', echo = !msg) ploidies<-as.integer(snakemake@wildcards[["ploidy"]]) inputfile <-snakemake@input[["segmented"]] outputdir<-snakemake@params[["outputdir"]] failed <- snakemake@params[["failed"]] log<-snakemake@log[[1]] fitpicker <- snakemake@output[["fitpicker"]] # Insert by Erik 2021-01-27 imagetype <- snakemake@config[["ACE"]][["imagetype"]] method<-snakemake@config[["ACE"]][["method"]] penalty<-as.numeric(snakemake@config[["ACE"]][["penalty"]]) cap<-as.integer(snakemake@config[["ACE"]][["cap"]]) trncname<- snakemake@config[["ACE"]][["trncname"]] printsummaries<- snakemake@config[["ACE"]][["printsummaries"]] copyNumbersSegmented <- readRDS(inputfile) parameters <- data.frame(options = c("ploidies","imagetype","method","penalty","cap","trncname","printsummaries"), values = c(paste0(ploidies,collapse=", "),imagetype,method,penalty,cap,trncname,printsummaries)) #write.table(parameters, file=log, quote = FALSE, sep = "\t", na = "", row.names = FALSE) write.table(parameters, file=log, quote = FALSE, sep = "\t", na = "", col.names = NA) ploidyplotloop(copyNumbersSegmented ,outputdir , ploidies,imagetype,method,penalty,cap,trncname,printsummaries) # 1 to 3 dir.create warnings and 4 to 15 removed rows warnings, output gives NULL #create output for failed samples - for snakemake compatibility. failed_samples<-read.table(failed, stringsAsFactors=FALSE, header=TRUE) if(length(failed_samples[,1]>0)){ for(file in failed_samples[,1]){file.create(paste(outputdir, ploidies,"N/", file,"/summary_",file,".",imagetype,sep=""))} } |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | msg <- snakemake@params[["suppressMessages"]] if (msg){ suppressMessages(library(QDNAseq)) suppressMessages(library(Biobase)) suppressMessages(library(CGHcall)) suppressMessages(library(CGHregions)) suppressMessages(library(CGHtest)) } else{ library(QDNAseq) library(Biobase) library(CGHcall) library(CGHregions) library(CGHtest) } source('scripts/CGHcallPlus.R', echo = FALSE) recalled <- snakemake@input[["recalled"]] RegionsCGH<-snakemake@output[["RegionsCGH"]] profiles <- snakemake@output[["profiles"]] averr <- snakemake@params[["averr"]] #0.0075 # default = 0.01 log<-snakemake@log[[1]] log<-file(log, open="wt") sink(log, append=T, split=FALSE) ############################################################################################################## # CGH regions & frequency plot regions ############################################################################################################## QCN.reCalled <- readRDS(recalled) # Make CGH CGH_reCalledRCs <- makeCgh(QCN.reCalled) # perform CGHregions reCalledRegionsCGH <- CGHregions(CGH_reCalledRCs, averror=averr) # save data: saveRDS(reCalledRegionsCGH, RegionsCGH) # frequency plot reCalled (CGH) regions #filename <- paste('profiles/freqPlot/freqPlotREGIONS_', bin , 'kbp.png', sep='') png(profiles, res=300, width=14, height=7, unit='in') frequencyPlot(reCalledRegionsCGH) dev.off() |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 | msg <- snakemake@params[["suppressMessages"]] if (msg){ suppressMessages(library(CGHtest)) suppressMessages(library(readxl)) } else{ library(CGHtest) library(readxl) } source("scripts/CGHtest_functions.R", echo = FALSE) ########## # input file paths CGHregions_RDS_file_path = '../../../output-CGH/100kbp/data/100kbp-RegionsCGH.rds' CGHregions_RDS_file_path <- snakemake@input[["RegionsCGH"]] freqPlotCompare_output <- snakemake@output[["freqPlotCompare"]] CGHtest_output <- snakemake@output[["CGHtest"]] plotPFDR_output <- snakemake@output[["plotPFDR"]] combined_output <- snakemake@output[["combined"]] # params path outputdir <- snakemake@params[["outputdir"]] clinicaldataPath <- 'clinical.xlsx' clinicaldataPath <- snakemake@params[["clinicaldataPath"]] clinicaldata_col_snames <- 1 clinicaldata_col_snames <- snakemake@params[["columnSampleNames"]] set_clinicaldata_class_samples <- c('short','long') set_clinicaldata_class_samples <- snakemake@params[["ClassSamples"]] clinicaldata_col_class_samples <- 3 clinicaldata_col_class_samples <- snakemake@params[["columnClassSamples"]] log<-snakemake@log[[1]] log<-file(log, open="wt") sink(log, append=T, split=FALSE) ################## if(!dir.exists(outputdir)){dir.create(outputdir)} # read CGHregions RDS CGHregions_RDS_file <- readRDS(CGHregions_RDS_file_path) # get sample names from CGHregions snames <- sampleNames(CGHregions_RDS_file) # read clinical data clinicaldata <- as.data.frame(read_excel(clinicaldataPath)) clinicaldata_snames <- clinicaldata[,clinicaldata_col_snames] if (!isTRUE(all.equal(sort(snames),sort(clinicaldata_snames)))){ print(data.frame('Clinical Sample Names' = clinicaldata_snames, 'CGHregions Sample Names' = snames)) stop('Sample Names in clinical data and CGHregions RDS file do not match!') } # match sample names from CGHregions with sample names in clinical data snames_i <- match(clinicaldata_snames, snames) # redefine clinical data matching the CGHregions names clinicaldata <- clinicaldata[snames_i,] # define clinical data classes and match with set classes clinicaldata_class_samples_all <- clinicaldata[,clinicaldata_col_class_samples] clinicaldata_class_samples_unique <- unique(clinicaldata_class_samples_all) tempa <- clinicaldata_class_samples_unique tempb <- set_clinicaldata_class_samples if (!isTRUE(all.equal(sort(tempa),sort(tempb)))){ print(data.frame('Classes in clinical data' = tempa, 'Classes defined by user in config' = tempb)) stop('Classes in the clinical data do not match the classes defined for the CGHtest-setting in the configuration-file!') } # define groups to compare group1_index <- which(clinicaldata_class_samples_all == set_clinicaldata_class_samples[1]) # c(4,5,8,11,14,17) for clinical.xlsx group2_index <- which(clinicaldata_class_samples_all == set_clinicaldata_class_samples[2]) # c(1,2,3,6,7,9,10,12,13,15,16,18) for clinical.xlsx # subset data group1 <- CGHregions_RDS_file[,group1_index] group2 <- CGHregions_RDS_file[,group2_index] # Compare 2 cohorts teststat <- "Chi-square" compareCNAs2cohorts(x = CGHregions_RDS_file, data1 = group1, data2 = group2, cohort1_ID = "group1", cohort2_ID = "group2", teststat, output_freqPlotCompare = freqPlotCompare_output, output_CGHtest = CGHtest_output, output_plotPFDR = plotPFDR_output, output_combined = combined_output) |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 | msg <- snakemake@params[["suppressMessages"]] if (msg){ suppressMessages(library(QDNAseq)) suppressMessages(library(Biobase)) suppressMessages(library(CGHcall)) suppressMessages(library(CGHtest)) suppressMessages(library(denstrip)) } else{ library(QDNAseq) library(Biobase) library(CGHcall) library(CGHtest) library(denstrip) } source('scripts/CGHcallPlus.R', echo = !msg) source('scripts/plotQDNAseq.R', echo = !msg) called <- snakemake@input[["called"]] fitpickertable<- snakemake@input[["fitpicker"]] recalled <- snakemake@output[["recalled"]] #copynumbers<-snakemake@output[["copynumbers"]]#moved to QDNAseq-segment #segments<-snakemake@output[["segments"]] #moved to QDNAseq-segment calls<-snakemake@output[["calls"]] profiles <- snakemake@params[["profiles"]] #copynumbersbed<-snakemake@params[["copynumbersbed"]] #moved to QDNAseq-segment #segmentsbed<-snakemake@params[["segmentsbed"]] #moved to QDNAseq-segment failed <- snakemake@params[["failed"]] #bedfolder <- snakemake@params[["bedfolder"]] freqplot<- snakemake@output[["freqplot"]] minimum_cellularity <- snakemake@params[["minimum_cellularity"]] log<-snakemake@log[[1]] log<-file(log, open="wt") sink(log, append=TRUE, split=FALSE) ############################################################################################################## # Call data & frequency plot calls ############################################################################################################## #First part is to call with altered cellularity using chgCall. # load data QCN.fcnsdsn <- readRDS(called) fitpicker<- read.delim(fitpickertable, stringsAsFactors=FALSE) fitpicker$likely_fit[fitpicker$likely_fit < minimum_cellularity] <- minimum_cellularity #perform functions # Call with correction for cellularity reCalled <- callBins(QCN.fcnsdsn, nclass=5, cellularity=fitpicker$likely_fit) saveRDS(reCalled, recalled) ############################################################################################################## # Create profiles and frequency plot ############################################################################################################## fitpicker$likely_fit<-paste0("cellularity: ",fitpicker$likely_fit ) if(!dir.exists(profiles)){dir.create(profiles)} plotQDNAseq(reCalled, profiles , subtitle=fitpicker$likely_fit) # frequency plot calls png(freqplot, res=300, width=14, height=7, unit='in') frequencyPlot(reCalled) dev.off() ############################################################################################################## # Create IGV objects from readcounts ############################################################################################################## #exportBins(reCalled, copynumbers, format="igv", type="copynumber") moved to CNA segments #exportBins(reCalled, segments, format="igv", type="segments") moved to CNA segments exportBins(reCalled, calls, format="igv", logTransform=FALSE, type="calls") #exportBins(reCalled, file=copynumbersbed, format="bed", logTransform=TRUE, type="copynumber") #exportBins(reCalled, file=segmentsbed, format="bed", type="segments") #create output for failed samples - for snakemake compatibility. failed_samples<-read.table(failed, stringsAsFactors=FALSE, header=TRUE) if(length(failed_samples[,1]>0)){for(file in failed_samples[,1]){file.create(paste(profiles, file,".png",sep="")) #file.create(paste(bedfolder, file,"-copynumbers.bed",sep="")) #file.create(paste(bedfolder, file,"-segments.bed",sep="")) }} |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 | msg <- snakemake@params[["suppressMessages"]] if (msg){ suppressMessages(library(QDNAseq)) suppressMessages(library(Biobase)) suppressMessages(library(CGHcall)) suppressMessages(library(CGHtest)) } else{ library(QDNAseq) library(Biobase) library(CGHcall) library(CGHtest) } source('scripts/CGHcallPlus.R', echo = !msg) source('scripts/plotQDNAseq.R', echo = !msg) segmented <- snakemake@input[["segmented"]] bin <- as.integer(snakemake@wildcards[["binSize"]]) called <- snakemake@output[["called"]] profiles <- snakemake@params[["profiles"]] freqplots <- snakemake@output[["freqplot"]] failed <- snakemake@params[["failed"]] calls<-snakemake@output[["calls"]] log<-snakemake@log[[1]] log<-file(log, open="wt") sink(log, append=TRUE, split=FALSE) ############################################################################################################## # Call data & frequency plot calls ############################################################################################################## # load data QCN.fcnsdsn <- readRDS(segmented) # perform functions # Call withouth correction for cellularity QCN.fcnsdsnc <- callBins(QCN.fcnsdsn, nclass=5) saveRDS(QCN.fcnsdsnc, called) # Plot called profiles plotQDNAseq(QCN.fcnsdsnc, profiles) # frequency plot calls png(freqplots, res=300, width=14, height=7, unit='in') frequencyPlot(QCN.fcnsdsnc, losscol='blue', gaincol='red', delcol="darkblue", ampcol="darkred") dev.off() #create output for failed samples - for snakemake compatibility. failed_samples<-read.table(failed, stringsAsFactors=FALSE, header=TRUE) if(length(failed_samples[,1]>0)){for(file in failed_samples[,1]){file.create(paste(profiles, file,".png",sep="")) }} ############################################################################################################## # Create IGV objects ############################################################################################################## exportBins(QCN.fcnsdsnc, calls, format="igv", logTransform=FALSE, type="calls") |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | msg <- snakemake@params[["suppressMessages"]] if (msg){ suppressMessages(library(QDNAseq)) suppressMessages(library(Biobase)) # for dewaving: suppressMessages(library(limma)) suppressMessages(library(gtools)) suppressMessages(library(impute)) suppressMessages(library(MASS)) } else{ library(QDNAseq) library(Biobase) # for dewaving: library(limma) library(gtools) library(impute) library(MASS) } source("scripts/functions.R", echo = !msg) source("scripts/plotQDNAseq.R", echo = !msg) sourceDir(snakemake@config[["QDNAseq"]][["dewave_dir"]], echo = !msg) load(snakemake@params[["dewave_data"]], verbose = !msg) bin <- as.integer(snakemake@wildcards[["binSize"]]) corrected <- snakemake@input[["corrected"]] profiles <- snakemake@params[["profiles"]] dewaved <- snakemake@output[["dewaved"]] log<-snakemake@log[[1]] log<-file(log, open="wt") sink(log, append=T , split=FALSE) ############################################################################################################## # Correct, Normalize & Dewave raw data ############################################################################################################## QCN.fcns <- readRDS(corrected) if(bin!=1000){ QCN.fcns <- dewaveBins(QCN.fcns) } saveRDS(QCN.fcns, dewaved) plotQDNAseq(QCN.fcns, profiles) |
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 | msg <- snakemake@params[["suppressMessages"]] if (msg){ suppressMessages(library(QDNAseq)) suppressMessages(library(CGHregions)) } else{ library(QDNAseq) library(CGHregions) } source("scripts/addCytobands.R", echo = !msg) source("scripts/CGHcallPlus.R", echo = !msg) RegionsCGH<-snakemake@input[["RegionsCGH"]] max.focal.size.mb<-snakemake@params[["max_focal_size_mb"]] min.freq.focal<-snakemake@params[["min_freq_focal"]] allRegions<-snakemake@output[["allRegions"]] allFocalRegions<-snakemake@output[["allFocalRegions"]] cytoband_data<-snakemake@params[["cytobands"]] log<-snakemake@log[[1]] log<-file(log, open="wt") sink(log, append=T, split=FALSE) ############################################################################################################## # Create CGHregion BED file + Annotate ############################################################################################################## # load data QCN.Regions <- readRDS(RegionsCGH) # create BED file from CGHregions makeCGHregionsTable <- function(CGHregions, max.focal.size.mb=3, min.freq.focal=25, output_allRegions, output_allFocalRegions, cytoband_data){ # all regions in BED format regionsBED <- fData(CGHregions)[,1:3] # region sizes regionSize <- fData(CGHregions)[,3] - fData(CGHregions)[,2] regionsBED <- cbind(regionsBED, regionSize) # add cytoband info regionsBED <- addCytobands(regionsBED, cytoband_data) # frequencies of gains and losses calls <- regions(CGHregions) loss.freq <- rowMeans(calls < 0) gain.freq <- rowMeans(calls > 0) # add to 1 table regionData <- cbind(regionsBED, loss.freq, gain.freq) colnames(regionData) <- c('chromo', 'start', 'end', 'Region size (Mb)', 'Cytoband', 'loss (%)', 'gain (%)') # round integers: regionData[,4] <- round(regionData[,4]/1000000, digits=2) regionData[,c(6,7)] <- round(regionData[,c(6,7)]*100, digits=2) # save write.table(regionData, output_allRegions, sep='\t', row.names=F, col.names=T, quote=F) # make separate table for focals only focal.ix <- which(regionData[,4] <= max.focal.size.mb) focalRegions <- regionData[focal.ix, ] # and in more than 25% of samples loss.cutoff.ix <- which(focalRegions[,6] > min.freq.focal) gain.cutoff.ix <- which(focalRegions[,7] > min.freq.focal) cutoff.ix <- sort( c(loss.cutoff.ix, gain.cutoff.ix )) focalRegions <- focalRegions[cutoff.ix, ] if(nrow(focalRegions) != 0 ){rownames(focalRegions) <- c(1:nrow(focalRegions))} write.table(focalRegions, output_allFocalRegions, sep='\t', row.names=F, col.names=F, quote=F) } makeCGHregionsTable(QCN.Regions, max.focal.size.mb, min.freq.focal, allRegions, allFocalRegions, cytoband_data) |
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 | msg <- snakemake@params[["suppressMessages"]] if (msg){ suppressMessages(library(QDNAseq)) suppressMessages(library(denstrip)) suppressMessages(library(CGHcall)) } else{ library(QDNAseq) library(denstrip) library(CGHcall) } source("scripts/functions.R", echo = !msg) source("scripts/addCytobands.R", echo = !msg) source("scripts/CGHcallPlus.R", echo = !msg) recalled <- snakemake@input[["recalled"]] beddir <- snakemake@params[["beddir"]] cytoband_data<-snakemake@params[["cytobands"]] max.focal.size.mb<-snakemake@params[["max_focal_size_bed"]] failed <- snakemake@params[["failed"]] log<-snakemake@log[[1]] log<-file(log, open="wt") sink(log, append=TRUE, split=FALSE) ############################################################################################################## QCN.reCalled <- readRDS(recalled) # create BED files makeCNAbedFile <- function(CalledQDNAseqReadCounts, max.focal.size.mb=3, beddir,cytoband_data){ for (i in 1:length(sampleNames(CalledQDNAseqReadCounts))){ # sample <- CalledQDNAseqReadCounts[,i] sname <- sampleNames(sample) # print print( paste('Creating BED files of all CNAs in sample: ', sname, sep='') ) # get all call and segment values of CNAs in the sample allCNA.ix <- which(calls(sample)!=0) allCNAs <- calls(sample)[ allCNA.ix ] seg.allCNA <- segmented(sample)[ allCNA.ix ] # coordinates of all CNAs reg.allCNAs <- fData(sample)[ allCNA.ix, 1:3] # Put together in BED format allCNAsPerBin <- cbind(reg.allCNAs, allCNAs, seg.allCNA) colnames(allCNAsPerBin)[4] <- 'call' colnames(allCNAsPerBin)[5] <- 'segment' bedfile<-paste(beddir, sname, '_allCNAsPerBin.bed', sep="") # save as BED file; with or without column names..? write.table(allCNAsPerBin, bedfile, sep='\t', row.names=F, col.names=T, quote=F) # create BED file for all called regions: # collapse all adjacent bins based on unique segment values coll.calls.bed <- c() #all_segments <- uniquecreate.file()(allCNAsPerBin$segment) if(length(unique(allCNAsPerBin$segment)) !=0){ for (j in 1:length(unique(allCNAsPerBin$segment))){ # first segment SEG.val <- as.numeric( unique(allCNAsPerBin$segment)[j] ) # if the unique segment value occurs more often, then for each one create an entry intervals <- seqToIntervals(which(allCNAsPerBin$segment== SEG.val )) allCNAsPerBin[allCNAsPerBin$chromosome=="X","chromosome"]<-23 allCNAsPerBin[allCNAsPerBin$chromosome=="Y","chromosome"]<-24 for (z in 1:nrow(intervals)){ start.ix <- intervals[z,1] end.ix <- intervals[z,2] SEG.start <- as.numeric( allCNAsPerBin[start.ix,2] ) SEG.end <- as.numeric( allCNAsPerBin[end.ix,3] ) SEG.size <- as.numeric( allCNAsPerBin[end.ix,3] - allCNAsPerBin[start.ix,2] ) SEG.chromo <- as.numeric(allCNAsPerBin[end.ix,1]) SEG.call <- as.numeric( allCNAsPerBin[end.ix,4] ) # check if chromosome is the same; if not the bin spans more chromos and should be separated if(allCNAsPerBin[start.ix, 1] != allCNAsPerBin[end.ix, 1]){ print(paste(j, 'WARNING: Segment ', SEG.val, ' spans more than 1 chromosome', sep=''))} SEG.coll <- cbind(SEG.chromo, SEG.start, SEG.end, SEG.size, SEG.val, SEG.call) coll.calls.bed <- rbind(coll.calls.bed, SEG.coll) } } # reorder dataframe CNAbed <- coll.calls.bed CNAbed <- CNAbed [ order(CNAbed[,1], CNAbed[,2]), ] # if there is only 1 CNA in sample, transform the dataframe if (nrow(coll.calls.bed )==1) { CNAbed <- as.data.frame(t(CNAbed)) } else { CNAbed <- as.data.frame(CNAbed) } CNAbed[,4] <- round( CNAbed[,4]/1000000, digits = 2) CNAbed[,5] <- round( CNAbed[,5], digits=2) # add cytoband info CNAbed <- addCytobands(CNAbed, cytoband_data) CNAbed <- CNAbed[, c(1:4,7,5,6)] }else{CNAbed<- data.frame(matrix(ncol = 7, nrow = 0))} colnames(CNAbed) <- c('chromo', 'start', 'end', 'cytobands','sizeInMb', 'segment', 'call') fileName <- paste(beddir, sname, '_CNAs.bed', sep='') write.table(CNAbed, fileName, sep='\t', row.names=F, col.names=T, quote=F) # make a separate BED file with only focal CNAs (< 3 Mb) focal.ix <- which(CNAbed[,4] <= max.focal.size.mb) focalCNAs <- CNAbed[focal.ix, ] focalfileName <- paste(beddir, sname, '_focalCNAs.bed', sep='') write.table(focalCNAs, focalfileName, sep='\t', row.names=F, col.names=F, quote=F) } } makeCNAbedFile(QCN.reCalled, max.focal.size.mb, beddir, cytoband_data) #create output for failed samples - for snakemake compatibility. failed_samples<-read.table(failed, stringsAsFactors=FALSE, header=TRUE) if(length(failed_samples[,1]>0)){for(file in failed_samples[,1]){ file.create(paste(beddir, file, '_allCNAsPerBin.bed', sep="")) file.create(paste(beddir, file, '_CNAs.bed', sep='')) file.create(paste(beddir, file, '_focalCNAs.bed', sep='')) }} |
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | msg <- snakemake@params[["suppressMessages"]] if (msg){ suppressMessages(library(QDNAseq)) } else{ library(QDNAseq) } source('scripts/ACE.R', echo = !msg) ## Erik #copyNumbersSegmented <- snakemake@output[["ACE_post"]] ## inputfile <-snakemake@input[["segmented"]] outputdir<-snakemake@params[["outputdir"]] failed <- snakemake@params[["failed"]] fitpickertable<-snakemake@input[["fitpicker"]] ploidies<-as.integer(snakemake@wildcards[["ploidy"]]) imagetype <- snakemake@config[["ACE"]][["imagetype"]] trncname<- snakemake@config[["ACE"]][["trncname"]] copyNumbersSegmented <- readRDS(inputfile) postanalysisloop(copyNumbersSegmented, modelsfile=fitpickertable, imagetype=imagetype, outputdir=outputdir, trncname=trncname) # if Rplots.pdf exists rename from QDNAseq-snakemake directory to outputdir if (file.exists("./Rplots.pdf")){ file.rename(from = "./Rplots.pdf", to = paste0(outputdir,"all_samples.pdf")) } failed_samples<-read.table(failed, stringsAsFactors=FALSE, header=TRUE) if(length(failed_samples[,1]>0)){ for(file in failed_samples[,1]){file.create(paste(outputdir,"segmentfiles/",file,"_segments.tsv",sep=""))} } |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | msg <- snakemake@params[["suppressMessages"]] if (msg){ suppressMessages(library(QDNAseq)) suppressMessages(library(Biobase)) suppressMessages(library(R.cache)) } else{ library(QDNAseq) library(Biobase) library(R.cache) } setCacheRootPath(path="../.Rcache") bam <- snakemake@input[["bams"]] bin <- as.integer(snakemake@wildcards[["binSize"]]) genome <- snakemake@params[["genome"]] binReadCounts <- snakemake@output[["binReadCounts"]] ############################################################################################################## # Get bin annotations and bin read counts ############################################################################################################## #custom bin annotations for chrX used - did this whole script by hand for 100kbp See creationofblacklist.R bins <- getBinAnnotations(bin, genome=genome) QRC <- binReadCounts(bins, bamfiles=bam, cache=TRUE) sub("(_[ACGT]+)?(_S\\d+)?(_L\\d{3})?_R\\d{1}_\\d{3}(\\.f(ast)?q\\.gz)?$", "", sampleNames(QRC)) -> samples if (sum(duplicated(samples)) > 0) { QRC <- poolRuns(QRC, samples=samples, force=TRUE) } saveRDS(QRC, binReadCounts) |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 | msg <- snakemake@params[["suppressMessages"]] if (msg){ suppressMessages(library(QDNAseq)) suppressMessages(library(Biobase)) } else{ library(QDNAseq) library(Biobase) } source("scripts/functions.R", echo = !msg) source("scripts/plotQDNAseq.R", echo = !msg) binReadCounts <- snakemake@input[["binReadCounts"]] bin <- as.integer(snakemake@wildcards[["binSize"]]) corrected <- snakemake@output[["corrected"]] profiles <- snakemake@params[["profiles"]] chrom_filter <- c(snakemake@params[["chrom_filter"]]) log<-snakemake@log[[1]] log<-file(log, open="wt") sink(log, append=T , split=FALSE) ############################################################################################################## # Correct, Normalize & Dewave raw data ############################################################################################################## QRC <- readRDS(binReadCounts) #did these steps manualy as is described in second part creationofblacklist.R for samples with chrX. QRC.f <- applyFilters(QRC, residual=TRUE, blacklist=TRUE, mappability=FALSE, bases=FALSE , chromosomes=chrom_filter) #featureData(QRC.f)$use[c(22232:22237)]<-F # segment 14:22400001-22500000 t/m 14:22900001-23000000 on chr14 QRC.f <- estimateCorrection(QRC.f) #QRC.f <- applyFilters(QRC, residual=TRUE, blacklist=TRUE, mappability=FALSE, bases=FALSE) - used for bia-ALCL (2019-07-09 not) #QRC.f <- estimateCorrection(QRC.f, residual=TRUE, blacklist=TRUE, mappability=FALSE, bases=FALSE) - used for bia-ALCL(2019-07-09 not) #QRC.f <- applyFilters(QRC.f, residual=TRUE, blacklist=TRUE, mappability=FALSE, bases=FALSE, chromosome="Y") - used for bia-ALCL(2019-07-09 not) QCN.fc <- correctBins(QRC.f) QCN.fcn <- normalizeBins(QCN.fc) QCN.fcns <- smoothOutlierBins(QCN.fcn) saveRDS(QCN.fcns, corrected) plotQDNAseq(QCN.fcns, profiles) |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 | msg <- snakemake@params[["suppressMessages"]] if (msg){ suppressMessages(library(QDNAseq)) suppressMessages(library(Biobase)) } else{ library(QDNAseq) library(Biobase) } source("scripts/plotQDNAseq.R", echo = !msg) dewaved<- snakemake@input[["dewaved"]] bin <- as.integer(snakemake@wildcards[["binSize"]]) segmented <- snakemake@output[["segmented"]] profiles <- snakemake@params[["profiles"]] failed<- snakemake@params[["failed"]] min_used_reads<-snakemake@params[["minimal_used_reads"]] copynumbersbed<-snakemake@params[["copynumbersbed"]] segmentsbed<-snakemake@params[["segmentsbed"]] copynumbers<-snakemake@output[["copynumbers"]] segments<-snakemake@output[["segments"]] bedfolder <- snakemake@params[["bedfolder"]] log<-snakemake@log[[1]] log<-file(log, open="wt") sink(log, append=TRUE , split=FALSE) # Adjust segmentation settings based on binsize if (bin==15) {SDundo=0.75; alph=1e-15} if (bin==30) {SDundo=0.75; alph=1e-15} if (bin==100) {SDundo=0.10; alph=1e-20} # default = 1e-20 # 0.01 used for PELLL_FS8_a0.01 if (bin==1000) {SDundo=0.10; alph=1e-20} # TODO: mogelijk hogere alpha nodig, om meer segmenten te krijgen: 0.01 (1e-2). Deze setting used in tmp_100kbp # load data QCN.fcnsd <- readRDS(dewaved) QCN.fcnsds <- segmentBins(QCN.fcnsd[,QCN.fcnsd$used.reads > min_used_reads ], undo.splits='sdundo', undo.SD=SDundo, alpha=alph, transformFun="sqrt") # gives 'Performing segmentation: NA QCN.fcnsdsn <- normalizeSegmentedBins(QCN.fcnsds) saveRDS(QCN.fcnsdsn, segmented) ############################################################################################################## # Create profiles ############################################################################################################## plotQDNAseq(QCN.fcnsdsn, profiles) #create output for failed samples - for snakemake compatibility. littledata<-colnames(QCN.fcnsd[,QCN.fcnsd$used.reads <= min_used_reads ]) if(length(littledata>0)){for(file in littledata){file.create(paste(profiles, file,".png",sep="")) file.create(paste(bedfolder, file,"-copynumbers.bed",sep="")) file.create(paste(bedfolder, file,"-segments.bed",sep="")) }} write.table(littledata, file=failed ) ############################################################################################################## # Create IGV objects and bedfiles from readcounts ############################################################################################################## exportBins(QCN.fcnsdsn, copynumbers, format="igv", type="copynumber") exportBins(QCN.fcnsdsn, segments, format="igv", type="segments") exportBins(QCN.fcnsdsn, file=copynumbersbed, format="bed", logTransform=TRUE, type="copynumber") exportBins(QCN.fcnsdsn, file=segmentsbed, format="bed", type="segments") |
74 75 76 77 | shell: "bwa aln -n {params.n} -t {threads} -q {params.q} {params.ref} {input} > {output.sai} 2> {log}; " "bwa samse -f {output.samse} -r '@RG\\tID:{wildcards.sample}\\tSM:{wildcards.sample}'" " {params.ref} {output.sai} {input} 2>> {log}" |
88 89 | shell: "samtools view -uS {input.samse} 2> {log}| samtools sort - -o {output} 2>> {log}" |
100 101 102 103 | shell: "picard MarkDuplicates I={input} O={output.bam} M={output.metrics_file}" " ASSUME_SORTED=true CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT USE_JDK_DEFLATER=true USE_JDK_INFLATER=true &> {log}; " "ln -s {output.bai} {output.bambai}" |
113 114 | shell: "{input.script} {input.bam} {wildcards.sample} {params.outdir} {output}" |
125 126 | script: "scripts/Run_QDNAseq_binReadCounts.R" |
139 140 | script: "scripts/Run_QDNAseq_normalize.R" |
153 154 | script: "scripts/Run_deWave.R" |
175 176 | script: "scripts/Run_QDNAseq_segment.R" |
191 192 | script: "scripts/Run_CNA_call.R" |
209 210 | script: "scripts/Run_CNA_call_cellularity_based.R" |
227 228 | script: "scripts/Run_makeCNAbedFile.R" |
241 242 | shell: "{input.script} {input.bedfile} {wildcards.sample} {params.outdir} {output} {log} 2> {log}" |
254 255 | script: "scripts/Run_CGHregions.R" |
269 270 | script: "scripts/Run_makeCGHregionstable.R" |
282 283 | shell: "{input.script} {input.bedfile} {params.filename} {params.outdir} {output} 2> {log}" |
294 295 | shell: "{input.script} {params.profiles} {params.lb2dir} > {output.index}" |
312 313 | shell: "{input.script} {wildcards.binSize}kpb {params.bamfolder} {params.statsfolder} {params.qcFastqfolder} {params.qcBamfolder}> {output}" |
328 329 | shell: "fastqc {input.fastq} --outdir {params.qcfastqdir} -t {threads} 2>> {log}" |
341 342 | shell: "fastqc {input.bam} --format bam_mapped --outdir {params.qcbamdir} -t {threads} 2>> {log}" |
355 356 | script: "scripts/Run_ACE.R" |
369 370 | script: "scripts/Run_postanalysisloop_ACE.R" |
388 389 | script: "scripts/Run_CGHtest.R" |
Support
- Future updates
Related Workflows





