A SnakeMake workflow to analyse whole genome bisulfite sequencing data from allopolyploids.
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
ARPEGGIO: Automated Reproducible Polyploid EpiGenetic GuIdance wOrkflow

ARPEGGIO is a snakemake workflow that analyzes whole genome bisulfite sequencing (WGBS) data coming from (allo)polyploid species. The workflow includes all basic steps in WGBS data analysis (trimming, quality check and alignment), a read sorting tool specific for allopolyploids, the most comprehensive statistical tool for Differential Methylation (DM) analysis and a set of downstream analyses to obtain a list of genes showing differential methylation.
Motivation
In the last decade, the use of High-Throughput Sequencing (HTS) technologies has become widespread across life sciences. With technology not being a bottleneck anymore, the new challenge with HTS data has shifted towards data analysis. To process and analyze WGBS data, many tools exist, but most of them were developed and/or tested with a focus on diploid model species. For polyploid species there are some complexities that are often not taken into account. One example is the large amount of duplicated genes in polyploids (homeologous genes) which might be challenging at the mapping step and influence downstream analyses. To help with the analysis of polyploid WGBS data we developed ARPEGGIO: an automated and reproducible workflow which aims at being easy to set up and use.
Why ARPEGGIO?
ARPEGGIO is easily setup with one configuration file and once ready, it will automatically analyse your WGBS data to provide a list of differentially methylated regions (DMRs). Thanks to Snakemake, a human readable, Python based language for workflows and Conda, a widely-used package manager, ARPEGGIO takes care of installing all the software needed fo the analyses and running all the steps in the workflow in the correct order. ARPEGGIO also ensures reproducibility of your analysis, you only need to share your configuration and your initial raw data.
What's new in ARPEGGIO?
Besides the workflow itself (which is already quite a lot of new), ARPEGGIO includes an allopolyploid specific read-sorting algorithm that has been adapted to deal with BS-seq data:
EAGLE-RC
. Check out the papers
"Homeolog expression quantification methods for allopolyploids"
and
"EAGLE: Explicit Alternative Genome Likelihood Evaluator"
by Kuo
et al.
for more details. Together with EAGLE-RC, there's also
dmrseq
: an R package for differential methylation analysis. This package has one of the most comprehensive approaches to deal with WGBS data problems: mainly statistical and computational. If you're curious check out
"Detection and accurate false discovery rate control of differentially methylated regions from whole genome bisulfite sequencing"
by Korthauer
et al
.
Workflow overview
Installation
To install this workflow you first need to install Snakemake via Conda . To further ensure reproducibility you can also install Singularity . Once everything is set up, run the following commands to clone the ARPEGGIO repository to your computer and run the workflow. With Conda only:
git clone https://github.com/supermaxiste/ARPEGGIO
cd ARPEGGIO
snakemake --use-conda
With Conda and Singularity:
git clone https://github.com/supermaxiste/ARPEGGIO
cd ARPEGGIO
snakemake --use-conda --use-singularity
Setup and run
Check out the Wiki to set up and run ARPEGGIO. If you're in a hurry you can also find a Quick Setup section. The Wiki will help you better understand the workflow design, input, output and summary files.
Troubleshooting and support
Google doesn't help? Are you stuck on an error that no one else seems to be having? Have you checked all the pages mentioning your problem but the solutions are not suitable? On the Wiki there's a list of common problems together with some general solutions. If that didn't work either, feel free to open an issue. Please make sure to describe your problem/errors and your trials in detail so that you can get the best help possible.
Credits
This project was inspired by
, if you work with RNA-seq data check it out!
Citing ARPEGGIO
Milosavljevic, S., Kuo, T., Decarli, S. et al. ARPEGGIO: Automated Reproducible Polyploid EpiGenetic GuIdance workflOw. BMC Genomics 22, 547 (2021). https://doi.org/10.1186/s12864-021-07845-2
Code Snippets
24 25 | shell: "bismark_genome_preparation {params.genome1} 2> {log}" |
44 45 | shell: "bismark_genome_preparation {params.genome2} 2> {log}" |
78 79 | shell: "bismark --prefix {params.prefix} --multicore {params.bismark_cores} -o {params.output} --temp_dir {params.output} --genome {params.genome1} {input.fastq} 2> {log}" |
112 113 | shell: "bismark --prefix {params.prefix} --multicore {params.bismark_cores} -o {params.output} --temp_dir {params.output} --genome {params.genome2} {input.fastq} 2> {log}" |
149 150 | shell: "bismark --prefix {params.prefix} --multicore {params.bismark_cores} --genome {params.genome1} -1 {input.fastq1} -2 {input.fastq2} -o {params.output} --temp_dir {params.output} 2> {log}" |
186 187 | shell: "bismark --prefix {params.prefix} --multicore {params.bismark_cores} --genome {params.genome2} -1 {input.fastq1} -2 {input.fastq2} -o {params.output} --temp_dir {params.output} 2> {log}" |
50 51 | shell: "wget https://github.com/tony-kuo/eagle/archive/v{params.eagle_version}.tar.gz -O {output.eagle_tar} 2> {log}" |
65 66 | shell: "tar -C {params.build_prefix} -vxf {input.eagle_tar} 2> {log}" |
79 80 | shell: "wget https://github.com/samtools/htslib/releases/download/{params.htslib_version}/{params.htslib_tar_name} -O {output.htslib_tar} 2> {log}" |
94 95 | shell: "tar -C {params.build_prefix} -vxf {input.htslib_tar} 2> {log}" |
113 114 | shell: "{params.eagle_mk_env} make -j {threads} -C {params.eagle_dir_path} {params.eagle_mk_flags} 2> {log}" |
130 131 | shell: "{params.eagle_mk_env} make -C {params.eagle_dir_path} install {params.eagle_mk_flags} 2> {log}" |
28 29 | shell: "deduplicate_bismark -s --output_dir {params} --bam {input} 2> {log}" |
52 53 | shell: "deduplicate_bismark -p --output_dir {params} --bam {input} 2> {log}" |
26 27 | shell: "Rscript scripts/CoverageFileGeneratorComplete.R {input.p1} {params.output} {params.sample_name} 2> {log}" |
47 48 | shell: "Rscript scripts/CoverageFileGeneratorComplete.R {input.p1} {params.output} {params.sample_name} 2> {log}" |
62 63 | shell: "cat {input.p1} {input.p2} > {output}" |
80 81 | shell: "Rscript scripts/CoverageFileGeneratorComplete.R {input} {params.output} {params.sample_name} 2> {log}" |
104 105 | shell: "Rscript scripts/dmrseq.R {params.n_samples_p1} {params.n_samples_allo} {output.comparison1} {threads} {input.p1} {input.allo} 2> {log}" |
128 129 | shell: "Rscript scripts/dmrseq.R {params.n_samples_p2} {params.n_samples_allo} {output.comparison2} {threads} {input.p2} {input.allo} 2> {log}" |
152 153 | shell: "Rscript scripts/dmrseq.R {params.n_samples_p1} {params.n_samples_allo} {output.comparison1} {threads} {input.p1} {input.allo} 2> {log}" |
176 177 | shell: "Rscript scripts/dmrseq.R {params.n_samples_p2} {params.n_samples_allo} {output.comparison2} {threads} {input.p2} {input.allo} 2> {log}" |
200 201 | shell: "Rscript scripts/dmrseq.R {params.n_samples_p1} {params.n_samples_allo} {output.comparison1} {threads} {input.p1} {input.allo} 2> {log}" |
224 225 | shell: "Rscript scripts/dmrseq.R {params.n_samples_p2} {params.n_samples_allo} {output.comparison2} {threads} {input.p2} {input.allo} 2> {log}" |
249 250 | shell: "Rscript scripts/dmrseq.R {params.samples_B} {params.samples_A} {output.comparison} {threads} {input.condB} {input.condA} 2> {log}" |
271 272 | shell: "Rscript scripts/dmrseq.R {params.samples_B} {params.samples_A} {output.comparison} {threads} {input.condB} {input.condA} 2> {log}" |
293 294 | shell: "Rscript scripts/dmrseq.R {params.samples_B} {params.samples_A} {output.comparison} {threads} {input.condB} {input.condA} 2> {log}" |
23 24 | shell: "Rscript scripts/significantGenesToBed.R {input} {params} 2> {log}" |
49 50 | shell: "Rscript scripts/significantGenesToBed.R {input} {params} 2> {log}" |
69 70 | shell: "bedtools intersect -a {params.anno1} -b {input.i1} -wo > {output.o1} 2> {log}" |
86 87 | shell: "bedtools intersect -a {params.anno2} -b {input.i2} -wo > {output.o2} 2> {log}" |
105 106 107 108 | shell: "bedtools intersect -a {params.anno1} -b {input} -wo > {output} 2> {log}" if ( sum(samples.origin == "parent1") > 0 ) else "bedtools intersect -a {params.anno2} -b {input} -wo > {output} 2> {log}" |
122 123 | shell: "bedtools intersect -a {params.anno1} -b {input} -wo > {output} 2> {log}" |
137 138 | shell: "bedtools intersect -a {params.anno2} -b {input} -wo > {output} 2> {log}" |
157 158 | shell: "Rscript scripts/DMGeneSummary.R {input.i1} {input.dm1} {params.geneID1} {params.o1} 2> {log}" |
174 175 | shell: "Rscript scripts/DMGeneSummary.R {input.i2} {input.dm2} {params.geneID2} {params.o2} 2> {log}" |
195 196 197 198 | shell: "Rscript scripts/DMGeneSummary.R {input.i1} {input.dm1} {params.geneID1} {params.o1} 2> {log}" if ( sum(samples.origin == "parent1") > 0 ) else "Rscript scripts/DMGeneSummary.R {input.i1} {input.dm1} {params.geneID2} {params.o1} 2> {log}" |
214 215 | shell: "Rscript scripts/DMGeneSummary.R {input.i1} {input.dm1} {params.geneID1} {params.o1} 2> {log}" |
231 232 | shell: "Rscript scripts/DMGeneSummary.R {input.i1} {input.dm1} {params.geneID2} {params.o2} 2> {log}" |
71 72 73 | shell: "bismark_methylation_extractor -s -o {params.output} --gzip --genome_folder {params.genome} --multicore {params.bismark_cores} --no_overlap --comprehensive --bedGraph --scaffolds --CX {input} 2> {log}" if config[ "UNFINISHED_GENOME" |
141 142 143 | shell: "bismark_methylation_extractor -s -o {params.output} --gzip --genome_folder {params.genome} --multicore {params.bismark_cores} --no_overlap --comprehensive --bedGraph --scaffolds --CX {input} 2> {log}" if config[ "UNFINISHED_GENOME" |
237 238 239 | shell: "bismark_methylation_extractor -s -o {params.output} --gzip --genome_folder {params.genome} --multicore {params.bismark_cores} --no_overlap --comprehensive --bedGraph --scaffolds --CX {input} 2> {log}" if config[ "UNFINISHED_GENOME" |
333 334 335 | shell: "bismark_methylation_extractor -s -o {params.output} --gzip --genome_folder {params.genome} --multicore {params.bismark_cores} --no_overlap --comprehensive --bedGraph --scaffolds --CX {input} 2> {log}" if config[ "UNFINISHED_GENOME" |
410 411 412 | shell: "bismark_methylation_extractor -p -o {params.output} --gzip --genome_folder {params.genome} --multicore {params.bismark_cores} --no_overlap --comprehensive --bedGraph --scaffolds --CX {input} 2> {log}" if config[ "UNFINISHED_GENOME" |
487 488 489 | shell: "bismark_methylation_extractor -p -o {params.output} --gzip --genome_folder {params.genome} --multicore {params.bismark_cores} --no_overlap --comprehensive --bedGraph --scaffolds --CX {input} 2> {log}" if config[ "UNFINISHED_GENOME" |
594 595 596 | shell: "bismark_methylation_extractor -p -o {params.output} --gzip --genome_folder {params.genome} --multicore {params.bismark_cores} --no_overlap --comprehensive --bedGraph --scaffolds --CX {input} 2> {log}" if config[ "UNFINISHED_GENOME" |
701 702 703 | shell: "bismark_methylation_extractor -p -o {params.output} --gzip --genome_folder {params.genome} --multicore {params.bismark_cores} --no_overlap --comprehensive --bedGraph --scaffolds --CX {input} 2> {log}" if config[ "UNFINISHED_GENOME" |
740 741 | shell: "coverage2cytosine -CX --genome_folder {params.genome1} -o {params.filename1} {input.f1} 2> {log}" |
775 776 | shell: "coverage2cytosine -CX --genome_folder {params.genome2} -o {params.filename2} {input.f2} 2> {log}" |
818 819 | shell: "coverage2cytosine -CX --genome_folder {params.genome1} -o {params.filename1} {input.f1} 2> {log}" |
861 862 | shell: "coverage2cytosine -CX --genome_folder {params.genome2} -o {params.filename2} {input.f2} 2> {log}" |
25 26 | shell: "fastqc -o {params.FastQC} -t {threads} {input.fastq} 2> {log}" |
46 47 | shell: "fastqc -o {params.FastQC} -t {threads} {input.fastq} 2> {log}" |
69 70 | shell: "fastqc -o {params.FastQC} -t {threads} {input} 2> {log}" |
87 88 | shell: "bismark_genome_preparation {input.control} 2> {log}" |
121 122 | shell: "bismark --prefix {params.prefix} --multicore {params.bismark_cores} -o {params.output} --temp_dir {params.output} --genome {params.control} {input.fastq} 2> {log}" |
155 156 | shell: "bismark --prefix {params.prefix} --multicore {params.bismark_cores} --genome {params.control} -1 {input.fastq1} -2 {input.fastq2} -o {params.output} --temp_dir {params.output} 2> {log}" |
183 184 | shell: "samtools sort {input.p1} > {output.o1}" |
211 212 | shell: "samtools sort {input.p2} > {output.o2} 2> {log}" |
247 248 | shell: "samtools sort {input} > {output}" |
276 277 | shell: "qualimap bamqc -bam {input} -outdir {params.output} -nt {threads} --java-mem-size=4G 2> {log}" |
305 306 | shell: "qualimap bamqc -bam {input} -outdir {params.output} -nt {threads} --java-mem-size=4G 2> {log}" |
330 331 | shell: "qualimap bamqc -bam {input.genome} -outdir {params.output} -nt {threads} --java-mem-size=4G 2> {log}" |
355 356 | shell: "qualimap bamqc -bam {input.genome} -outdir {params.output} -nt {threads} --java-mem-size=4G 2> {log}" |
376 377 | shell: "multiqc {params.inputdir} -f -o {params.multiqcdir} 2> {log}" |
37 38 | shell: "{input.eagle_bin} --ngi {params.phred} --ref1={params.genome1} --bam1={input.reads1} --ref2={params.genome2} --bam2={input.reads2} -o {params.output} --bs=3 > {params.list} 2> {log}" |
68 69 | shell: "{input.eagle_bin} --ngi --paired {params.phred} --ref1={params.genome1} --bam1={input.reads1} --ref2={params.genome2} --bam2={input.reads2} -o {params.output} --bs=3 > {params.list} 2> {log}" |
29 30 31 | shell: "trim_galore -q 20 --clip_R1 {params.trim_5_r1} --phred33 --length 20 -j {params.trim_cores} -o {params.FASTQtrimmeddir} --path_to_cutadapt cutadapt {input.fastq1} 2> {log}" if config[ "RUN_TRIMMING" |
69 70 71 | shell: "trim_galore -q 20 --clip_R1 {params.trim_5_r1} --clip_R2 {params.trim_5_r2} --phred33 --length 20 -j {params.trim_cores} -o {params.FASTQtrimmeddir} --path_to_cutadapt cutadapt --paired {input.fastq1} {input.fastq2} 2> {log}" if config[ "RUN_TRIMMING" |
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 | library(data.table) comm_args <- commandArgs(trailingOnly = TRUE) full_path <- normalizePath(comm_args[1]) ## Read coverage file coverage_file <- fread(full_path) ## Set working directory out <- comm_args[2] out <- normalizePath(out) setwd(out) # Save output name output <- comm_args[3] colnames(coverage_file) <- c("scaffold", "position", "strand", "mC", "uC", "context", "spec_context") print("File has been read") # Replace all NAs with 0s coverage_file[is.na(coverage_file$mC),]$mC <- 0 coverage_file[is.na(coverage_file$uC),]$uC <- 0 # Save all positions with reads with_reads <- (coverage_file$mC+coverage_file$uC)!=0 ## We vectorize conditions to simplify our problem ## First check only contexts of interest: in our case CG, CHG and CHH (with reads) CG_contexts <- coverage_file$context=="CG" & with_reads CHG_contexts <- coverage_file$context=="CHG" & with_reads CHH_contexts <- coverage_file$context=="CHH" & with_reads ## We now generate our cov files with the following format: ## <scaffold> <start_position> <end_position> <% methylation> # <count_methylated> <count_unmethylated> cov_CG <- cbind(coverage_file$scaffold[CG_contexts], coverage_file$position[CG_contexts], coverage_file$position[CG_contexts], (coverage_file$mC[CG_contexts]/(coverage_file$mC[CG_contexts]+coverage_file$uC[CG_contexts]))*100, coverage_file$mC[CG_contexts], coverage_file$uC[CG_contexts]) #Replace infinite values with 0 (not needed anymore) #infinite <- cov_CG[,4]==Inf #cov_CG[infinite,4] <- 0 cov_CHG <- cbind(coverage_file$scaffold[CHG_contexts], coverage_file$position[CHG_contexts], coverage_file$position[CHG_contexts], (coverage_file$mC[CHG_contexts]/(coverage_file$mC[CHG_contexts]+coverage_file$uC[CHG_contexts]))*100, coverage_file$mC[CHG_contexts], coverage_file$uC[CHG_contexts]) cov_CHH <- cbind(coverage_file$scaffold[CHH_contexts], coverage_file$position[CHH_contexts], coverage_file$position[CHH_contexts], (coverage_file$mC[CHH_contexts]/(coverage_file$mC[CHH_contexts]+coverage_file$uC[CHH_contexts]))*100, coverage_file$mC[CHH_contexts], coverage_file$uC[CHH_contexts]) write.table(cov_CG, file = paste0(output,"_CG.cov"), sep="\t", row.names = FALSE, col.names = FALSE, quote = FALSE) write.table(cov_CHG, file = paste0(output,"_CHG.cov"), sep="\t", row.names = FALSE, col.names = FALSE, quote = FALSE) write.table(cov_CHH, file = paste0(output,"_CHH.cov"), sep="\t", row.names = FALSE, col.names = FALSE, quote = FALSE) |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 | library(data.table) library(stringr) comm_args <- commandArgs(trailingOnly = TRUE) # First argument: intersection file intersection <- comm_args[1] # Second argument: DM regions (dmrseq) regions <- comm_args[2] # Third argument: geneID name geneID <- comm_args[3] # Fifth argument: output name output_name <- comm_args[4] # Check if intersection file is empty if (file.info(intersection)$size==0){ print("There was no intersection between significant DMRs and the annotation file. Returning empty file") empty <- c("There was no intersection between significant DMRs and the annotation file") write.table(empty, file=paste0(output_name, ".txt"), quote = FALSE, sep = "\t", row.names = FALSE, col.names = TRUE) } else { # Read, clean and name file columns intersection_file <- fread(intersection) dm_regions <- fread(regions) # Pick columns of interest intersection_file <- intersection_file[, c(1, 3, 4, 5, 9, 11, 12, 13)] # Rename columns colnames(intersection_file) <- c("seqname", "feature", "start", "end", "attribute", "overlap_start", "overlap_end", "length") # Select only genes intersection_file <- intersection_file[intersection_file$feature=="gene",] # Pick geneID from attribute column geneID_name <- paste0(geneID, "=") # Separate subcolumns in attribute column attribute_col <- as.data.frame(str_split_fixed(intersection_file$attribute, ";", n=Inf)) # Look for column with geneID keyword geneID_column <- which(grepl(geneID_name, unlist(attribute_col[1,]))) # Remove geneID keyword from column intersection_file$attribute <- gsub(geneID_name, "", attribute_col[,geneID_column]) # Create final file (missing 1 column) DM_genes_summary <- as.data.frame(cbind(geneID=intersection_file$attribute, seqname=intersection_file$seqname, start=intersection_file$start, end=intersection_file$end, region_start=intersection_file$overlap_start, region_end=intersection_file$overlap_end, overlap_length=intersection_file$length)) # Match DM regions coordinates dm_coordinates <- paste(DM_genes_summary$seqname, ":", DM_genes_summary$region_start, "-", DM_genes_summary$region_end, sep = "") dm_regions <- cbind(dm_regions, coordinates=paste(dm_regions$seqname, ":", dm_regions$start, "-", dm_regions$end, sep = "")) # Find which line of the summary corresponds to which line of the dmrseq file corresponding_match <- match(dm_coordinates, dm_regions$coordinates) # Add column to dmrseq with methylation status m_status <- ifelse(dm_regions$stat>0, "decrease", "increase") # Add final column to summary file DM_genes_summary <- cbind(DM_genes_summary, methylation_status=m_status[corresponding_match]) # write file to output write.table(DM_genes_summary, file=paste0(output_name, ".txt"), quote = FALSE, sep = "\t", row.names = FALSE, col.names = TRUE) } |
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 | library(dmrseq) library(data.table) library(BiocParallel) # Four command line arguements are needed: first is the number of samples # for the first species analyzed, second is the number of samples for the # second species analyzed, third is the output name with extension and # fourth is the number of cores. After those four, the cov files from the # two species you want to compare need to be added (as many as you have). # The order of the cov files MUST be diploid species first and polyploid species second comm_args <- commandArgs(trailingOnly = TRUE) # First argument: number of samples for one species samples1 <- comm_args[1] # Second argument: number of samples for the other species samples2 <- comm_args[2] # Second argument: output name (with extension) output <- comm_args[3] # Third argument: number of cores cores <- comm_args[4] # All other arguments: cov files (in the right order) samples <- as.integer(samples1) + as.integer(samples2) sample_counter <- 0 for (i in 1:samples){ if (!is.na(comm_args[i+4])){ assign(paste0("file_", i), comm_args[i+4]) sample_counter <- sample_counter + 1 } } # Create vector of cov files cov_files <- c() for (i in 1:sample_counter){ cov_files[i] <- get(paste0("file_", i)) } # Read cov files bismarkBSseq <- read.bismark(files = c(cov_files), rmZeroCov = TRUE, strandCollapse = FALSE, verbose = TRUE) # Specify conditions sampleNames = c(rep("par", as.integer(samples1)), rep("kam", as.integer(samples2))) pData(bismarkBSseq)$Species <- sampleNames # Filtering step loci.idx <- which(DelayedMatrixStats::rowSums2(getCoverage(bismarkBSseq, type="Cov")==0) == 0) sample.idx <- which(pData(bismarkBSseq)$Species %in% c("par", "kam")) bs.filtered <- bismarkBSseq[loci.idx, sample.idx] register(MulticoreParam(cores)) # DMRseq function, normally takes around 1.5h regions <- dmrseq(bs = bs.filtered, testCovariate = "Species") # Save Robject outputR <- paste0(substr(output, 1, nchar(output)-3), "Rdata") save(regions, file = outputR) #This took about 1.5 h regions_dataframe <- as.data.frame(regions) write.csv(regions_dataframe, file=output, quote = FALSE, row.names = FALSE, col.names = TRUE) |
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | library(data.table) comm_args <- commandArgs(trailingOnly = TRUE) # First argument: dmrseq output dmrseq_output <- comm_args[1] # Second argument: output name output_name <- comm_args[2] # Read file candidate_regions <- fread(dmrseq_output) # Select only significant regions sig_regions <- candidate_regions[candidate_regions$qval < .05] # If no regions are found, create an empty file and print message if (nrow(sig_regions)==0){ print("dmrseq didn't find any significant regions: all DMRs had q-value > 0.05. Returning empty file") empty <- c() write.table(empty, file=paste0(output_name, "_sorted.bed"), quote = FALSE, sep = "\t", row.names = FALSE, col.names = FALSE) } else { # Select range of significant regions sig_regions_bed <- sig_regions[,c("seqnames", "start", "end")] # Some regions might have start > end, so we need to fix this # Get index of wrong positions wrong_pos_index <- which(sig_regions_bed$start > sig_regions_bed$end) #select rows with wrong index wrong_pos <- sig_regions_bed[wrong_pos_index,] #switch starting and ending position in the original file sig_regions_bed[wrong_pos_index,2] <- wrong_pos$end sig_regions_bed[wrong_pos_index,3] <- wrong_pos$start #To check whether anything is wrong use following command: #wrong_pos_index <- which(sig_regions_bed$start>sig_regions_bed$end) # output bed file write.table(sig_regions_bed, file=paste0(output_name, ".bed"), quote = FALSE, sep = "\t", row.names = FALSE, col.names = FALSE) command <- paste("sort -k1,1 -k2,2n ", output_name, ".bed > ", output_name, "_sorted.bed", sep = "") system(command = command) } |
Support
- Future updates
Related Workflows





