Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
snakemake-bisulfite
Automatic pipeline for genome wide DNA methylome processing data based on snakemake
Automated workflow for WGBS and EM-Seq (NEB) data
Snakemake workflow for processing BS-seq libaries produced by Illumina bisulfite sequencing kits.
Supported protocols:
Note, both WGBS and EM-seq produce the same kind of data, so no adjustment for postprocessing needed.
Requirments
-
demultiplex fastq files in located in
data
directory. They need to be in the form{sample}_R1.fastq.gz
-
Snakefile
shipped with this repository. -
config.yaml
shipped with this repository. It contains all parameters and settings to customize the processing of the current dataset. -
samples.csv
listing all samples in thedata
directory withoug the_R1.fastq.gz
suffix. The first line is the header i.e. the worklibrary
. An example is shipped with this repository which can be used as a template. -
Optionall:
environment.yaml
to create the software environment if conda is used. -
If conda is not used,
bowtie
,fastqc
,samtools
anddeeptools
need to be in the PATH.The above files can be downloaded as a whole by cloning the repository (which requires git):
git clone https://github.com/seb-mueller/snakemake-bisulfite
Or individually for example the
Snakemake
file using
wget
:
wget https://raw.githubusercontent.com/seb-mueller/snakemake_sRNAseq/master/Snakefile
creating conda environment
conda env create --file environment.yaml --name bsseq_pipeline
activate
conda activate bsseq_pipeline
To
deactivate
the environment, run:
conda deactivate
Genome preparation:
The pipeline requires a bismark index file which might have to be created at first. This best done having the reference genome as fasta file located in a directory by itself (there shouldn't be any other fasta files since bismark works in mysterious ways and you can't specify a conrete file, just the directory).
Say our genome is located here:
ref/genome/mygenome.fa
with no other fa files in it.
Run bismark indexer which will create a folder within that:
bismark_genome_preparation ref/genome
This will create
ref/genome/Bisulfite_Genome
.
To let the pipeline know where the index is located, change the
config.yaml
:
reference: "ref/genome/" reference_short: "mygenome"
Also, the fasta-index is required at some point and should be generated in the same directory as follows:
samtools faidx mygenome.fa
Usage:
Navigate in a Unix shell to the base directory contains the files listed above plus the
data
directory including the data like int this example:
.
├── bsseq.makefile
├── config.yaml
├── data
│ ├── bsseq_sample1_R1.fastq.gz
│ ├── bsseq_sample1_R2.fastq.gz
│ ├── bsseq_sample2_R1.fastq.gz
│ └── bsseq_sample2_R2.fastq.gz
├── environment.yaml
├── README.md
├── samples.csv
└── Snakefile
The
sample.csv
could look something like this:
library,optional_info
bsseq_sample1,WT
bsseq_sample2,Mutant
Then just run snakmake in base directory:
snakemake
useful parameters:
-
--cores
max number of threads -
-n
dryrun -
-p
print commands -
--use-conda
-
--conda-prefix ~/.myconda
-
--forcerun postmapping
forces rerun of a given rule (e.g.postmapping
)
Output:
trimmed
,
log
and
mapped
directory with trimming and mapping results.
Below is the structure of all generated files once the pipeline is finished:
.
├── config.yaml
├── coverage
│ ├── bseq_sample2_MappedOn_chloroplast_CHH.gz
│ ├── bseq_sample2_MappedOn_chloroplast_CHH.gz.bismark.cov.gz
│ ├── bsseq_sample1_MappedOn_chloroplast_CHG
│ ├── bsseq_sample1_MappedOn_chloroplast_CHG.bw
│ ├── bsseq_sample1_MappedOn_chloroplast_CHG.CX_report.txt
│ ├── bsseq_sample1_MappedOn_chloroplast_CHG.gz.bismark.cov
│ ├── bsseq_sample1_MappedOn_chloroplast_CHH
│ ├── bsseq_sample1_MappedOn_chloroplast_CHH.bw
│ ├── bsseq_sample1_MappedOn_chloroplast_CHH.CX_report.txt
│ ├── bsseq_sample1_MappedOn_chloroplast_CHH.gz.bismark.cov
│ ├── bsseq_sample1_MappedOn_chloroplast_CpG
│ ├── bsseq_sample1_MappedOn_chloroplast_CpG.bw
│ ├── bsseq_sample1_MappedOn_chloroplast_CpG.CX_report.txt
│ ├── bsseq_sample1_MappedOn_chloroplast_CpG.gz.bismark.cov
│ ├── bsseq_sample2_MappedOn_chloroplast_CHG
│ ├── bsseq_sample2_MappedOn_chloroplast_CHG.bw
│ ├── bsseq_sample2_MappedOn_chloroplast_CHG.CX_report.txt
│ ├── bsseq_sample2_MappedOn_chloroplast_CHG.gz.bismark.cov
│ ├── bsseq_sample2_MappedOn_chloroplast_CHH
│ ├── bsseq_sample2_MappedOn_chloroplast_CHH.bw
│ ├── bsseq_sample2_MappedOn_chloroplast_CHH.CX_report.txt
│ ├── bsseq_sample2_MappedOn_chloroplast_CHH.gz.bismark.cov
│ ├── bsseq_sample2_MappedOn_chloroplast_CpG
│ ├── bsseq_sample2_MappedOn_chloroplast_CpG.bw
│ ├── bsseq_sample2_MappedOn_chloroplast_CpG.CX_report.txt
│ └── bsseq_sample2_MappedOn_chloroplast_CpG.gz.bismark.cov
├── data
│ ├── bsseq_sample1_R1.fastq.gz
│ ├── bsseq_sample1_R2.fastq.gz
│ ├── bsseq_sample2_R1.fastq.gz
│ ├── bsseq_sample2_R2.fastq.gz
│ └── index
│ ├── Bisulfite_Genome
│ │ ├── CT_conversion
│ │ │ ├── BS_CT.1.bt2
│ │ │ ├── BS_CT.2.bt2
│ │ │ ├── BS_CT.3.bt2
│ │ │ ├── BS_CT.4.bt2
│ │ │ ├── BS_CT.rev.1.bt2
│ │ │ ├── BS_CT.rev.2.bt2
│ │ │ └── genome_mfa.CT_conversion.fa
│ │ └── GA_conversion
│ │ ├── BS_GA.1.bt2
│ │ ├── BS_GA.2.bt2
│ │ ├── BS_GA.3.bt2
│ │ ├── BS_GA.4.bt2
│ │ ├── BS_GA.rev.1.bt2
│ │ ├── BS_GA.rev.2.bt2
│ │ └── genome_mfa.GA_conversion.fa
│ ├── chloroplast.fa
│ ├── chloroplast.fa.fai
│ └── genomic_nucleotide_frequencies.txt
├── environment.yaml
├── logs
│ ├── bismark
│ │ ├── bsseq_sample1.
│ │ ├── bsseq_sample1_CHG_MappedOn_chloroplast.methylation_extractor.log
│ │ ├── bsseq_sample1_CHH_MappedOn_chloroplast.methylation_extractor.log
│ │ ├── bsseq_sample1_CpG_MappedOn_chloroplast.methylation_extractor.log
│ │ ├── bsseq_sample1.log
│ │ ├── bsseq_sample1_MappedOn_chloroplast_CHG.bedGraphToBigWig.log
│ │ ├── bsseq_sample1_MappedOn_chloroplast_CHG.bismark2bedGraph.log
│ │ ├── bsseq_sample1_MappedOn_chloroplast_CHG.coverage2cytosine.log
│ │ ├── bsseq_sample1_MappedOn_chloroplast_CHH.bedGraphToBigWig.log
│ │ ├── bsseq_sample1_MappedOn_chloroplast_CHH.bismark2bedGraph.log
│ │ ├── bsseq_sample1_MappedOn_chloroplast_CHH.coverage2cytosine.log
│ │ ├── bsseq_sample1_MappedOn_chloroplast_CpG.bedGraphToBigWig.log
│ │ ├── bsseq_sample1_MappedOn_chloroplast_CpG.bismark2bedGraph.log
│ │ ├── bsseq_sample1_MappedOn_chloroplast_CpG.coverage2cytosine.log
│ │ ├── bsseq_sample1_MappedOn_chloroplast.deduplication.log
│ │ ├── bsseq_sample1_MappedOn_chloroplast.log
│ │ ├── bsseq_sample1_MappedOn_chloroplast.methylation_extractor.log
│ │ ├── bsseq_sample2_CHG_MappedOn_chloroplast.methylation_extractor.log
│ │ ├── bsseq_sample2_CHH_MappedOn_chloroplast.methylation_extractor.log
│ │ ├── bsseq_sample2_CpG_MappedOn_chloroplast.methylation_extractor.log
│ │ ├── bsseq_sample2.log
│ │ ├── bsseq_sample2_MappedOn_chloroplast_CHG.bedGraphToBigWig.log
│ │ ├── bsseq_sample2_MappedOn_chloroplast_CHG.bismark2bedGraph.log
│ │ ├── bsseq_sample2_MappedOn_chloroplast_CHG.coverage2cytosine.log
│ │ ├── bsseq_sample2_MappedOn_chloroplast_CHH.bedGraphToBigWig.log
│ │ ├── bsseq_sample2_MappedOn_chloroplast_CHH.bismark2bedGraph.log
│ │ ├── bsseq_sample2_MappedOn_chloroplast_CHH.coverage2cytosine.log
│ │ ├── bsseq_sample2_MappedOn_chloroplast_CpG.bedGraphToBigWig.log
│ │ ├── bsseq_sample2_MappedOn_chloroplast_CpG.bismark2bedGraph.log
│ │ ├── bsseq_sample2_MappedOn_chloroplast_CpG.coverage2cytosine.log
│ │ ├── bsseq_sample2_MappedOn_chloroplast.deduplication.log
│ │ ├── bsseq_sample2_MappedOn_chloroplast.log
│ │ └── bsseq_sample2_MappedOn_chloroplast.methylation_extractor.log
│ ├── cutadapt
│ │ ├── bsseq_sample1_R1.log
│ │ ├── bsseq_sample1_R2.log
│ │ ├── bsseq_sample2_R1.log
│ │ └── bsseq_sample2_R2.log
│ └── fastqc
│ └── raw
│ ├── bsseq_sample1_R1_fastqc.html
│ ├── bsseq_sample1_R1_fastqc.zip
│ ├── bsseq_sample1_R2_fastqc.html
│ ├── bsseq_sample1_R2_fastqc.zip
│ ├── bsseq_sample2_R1_fastqc.html
│ ├── bsseq_sample2_R1_fastqc.zip
│ ├── bsseq_sample2_R2_fastqc.html
│ └── bsseq_sample2_R2_fastqc.zip
├── mapped
│ ├── bsseq_sample1_MappedOn_chloroplast_trim_bismark_pe.bam
│ ├── bsseq_sample1_MappedOn_chloroplast_trim_bismark_pe.deduplicated.bam
│ ├── bsseq_sample1_MappedOn_chloroplast_trim_bismark_pe.deduplicated.sorted.bam
│ ├── bsseq_sample1_MappedOn_chloroplast_trim_bismark_pe.deduplicated.sorted.bam.bai
│ ├── bsseq_sample1_MappedOn_chloroplast_trim_bismark_pe.deduplicated.sorted.bw
│ ├── bsseq_sample1_MappedOn_chloroplast_trim_bismark_pe.deduplication_report.txt
│ ├── bsseq_sample1_MappedOn_chloroplast_trim_bismark_pe.nucleotide_stats.txt
│ ├── bsseq_sample1_MappedOn_chloroplast_trim_bismark_PE_report.txt
│ ├── bsseq_sample2_MappedOn_chloroplast_trim_bismark_pe.bam
│ ├── bsseq_sample2_MappedOn_chloroplast_trim_bismark_pe.deduplicated.bam
│ ├── bsseq_sample2_MappedOn_chloroplast_trim_bismark_pe.deduplicated.sorted.bam
│ ├── bsseq_sample2_MappedOn_chloroplast_trim_bismark_pe.deduplicated.sorted.bam.bai
│ ├── bsseq_sample2_MappedOn_chloroplast_trim_bismark_pe.deduplicated.sorted.bw
│ ├── bsseq_sample2_MappedOn_chloroplast_trim_bismark_pe.deduplication_report.txt
│ ├── bsseq_sample2_MappedOn_chloroplast_trim_bismark_pe.nucleotide_stats.txt
│ ├── bsseq_sample2_MappedOn_chloroplast_trim_bismark_PE_report.txt
│ ├── bws
│ │ ├── bsseq_sample1_MappedOn_chloroplast_trim_bismark_pe.deduplicated.sorted.bw
│ │ └── bsseq_sample2_MappedOn_chloroplast_trim_bismark_pe.deduplicated.sorted.bw
│ └── test
├── methylation_extracted
│ ├── bsseq_sample1_MappedOn_chloroplast_trim_bismark_pe.deduplicated.sorted.M-bias.txt
│ ├── bsseq_sample1_MappedOn_chloroplast_trim_bismark_pe.deduplicated.sorted_splitting_report.txt
│ ├── bsseq_sample2_MappedOn_chloroplast_trim_bismark_pe.deduplicated.sorted.M-bias.txt
│ ├── bsseq_sample2_MappedOn_chloroplast_trim_bismark_pe.deduplicated.sorted_splitting_report.txt
│ ├── CHG_context_bsseq_sample1_MappedOn_chloroplast_trim_bismark_pe.deduplicated.sorted.txt.gz
│ ├── CHG_context_bsseq_sample2_MappedOn_chloroplast_trim_bismark_pe.deduplicated.sorted.txt
│ ├── CHG_context_bsseq_sample2_MappedOn_chloroplast_trim_bismark_pe.deduplicated.sorted.txt.gz
│ ├── CHH_context_bsseq_sample1_MappedOn_chloroplast_trim_bismark_pe.deduplicated.sorted.txt.gz
│ ├── CHH_context_bsseq_sample2_MappedOn_chloroplast_trim_bismark_pe.deduplicated.sorted.txt
│ ├── CHH_context_bsseq_sample2_MappedOn_chloroplast_trim_bismark_pe.deduplicated.sorted.txt.gz
│ ├── CpG_context_bsseq_sample1_MappedOn_chloroplast_trim_bismark_pe.deduplicated.sorted.txt.gz
│ ├── CpG_context_bsseq_sample2_MappedOn_chloroplast_trim_bismark_pe.deduplicated.sorted.txt
│ └── CpG_context_bsseq_sample2_MappedOn_chloroplast_trim_bismark_pe.deduplicated.sorted.txt.gz
├── README.md
├── samples.csv
├── Snakefile
└── trimmed ├── bsseq_sample1_R1.fastq.gz_trimming_report.txt ├── bsseq_sample1_R1_trim.fq.gz ├── bsseq_sample1_R2.fastq.gz_trimming_report.txt ├── bsseq_sample1_R2_trim.fq.gz ├── bsseq_sample2_R1.fastq.gz_trimming_report.txt ├── bsseq_sample2_R1_trim.fq.gz ├── bsseq_sample2_R1_val_1.fq.gz ├── bsseq_sample2_R2.fastq.gz_trimming_report.txt ├── bsseq_sample2_R2_trim.fq.gz └── bsseq_sample2_R2_val_2.fq.gz
Update pipeline/environment:
git pull
conda env update --file environment.yaml --name bsseq_pipeline
Post processing
Results can be imported in various software packages such as
https://github.com/al2na/methylKit
using the R-snippet below.
Note, here all 3 context are considered seperately which can by useful for example for anyalysing plant DNA-metyhylation.
The imported files are
*.gz.bismark.cov
from the
coverage
folder:
dir_base <- "/mypath..." # set this as base path manually!
meta.data <- read_csv(file.path( dir_base, "samples.csv"))
config <- read_yaml(file.path(dir_base, "config.yaml"))
genome_suffix <- config$MAPPING$reference_short
dir_files <- file.path(dir_base, "coverage")
contexts <- c("CpG", "CHG", "CHH")
for (context in contexts) {
fls_cov_short <- paste(meta.data$library, "_MappedOn_", genome_suffix, "_", context, ".gz.bismark.cov", sep = "")
fls_cov <- file.path(dir_files, fls_cov_short)
myobj <- methRead(as.list(fls_cov), as.list(as.character(meta.data$library)), "assembly", dbtype = NA,
pipeline = "bismarkCoverage", header = FALSE, skip = 0, sep = "\t",
context = context, resolution = "base",
treatment = as.integer(as.factor(meta.data$condition)), dbdir = getwd(),
mincov = 1)
# ...
}
Code Snippets
9 10 11 12 13 14 15 16 17 18 19 20 21 | from Bio import SeqIO number_bp = 12 with open(snakemake.output["tsv"], "w") as output: for seq_record in SeqIO.parse(snakemake.input["fa"], "fasta"): myline = ( (str(seq_record.id)) + "\t" + str(seq_record.seq[0 : (number_bp + 1)]) + "\n" ) output.write(myline) |
126 127 | script: "scripts/fa2tsv.py" |
148 149 150 151 | shell: """ fastqc {input.read} -a {input.adapters} -t {threads} -f fastq --outdir logs/fastqc/trimmed """ |
171 172 173 174 | shell: """ fastqc {input.read} -a {input.adapters} -t {threads} -f fastq --outdir logs/fastqc/raw """ |
192 193 194 195 196 197 198 199 | shell: """ trim_galore {extra_params_trim} --cores {threads} --paired --trim1 {input.read1} {input.read2} --output_dir trimmed rename 's/val_[12]/trim/g' trimmed/{wildcards.sample}* mkdir -p logs/trim mv trimmed/{log.log1} logs/trim/ mv trimmed/{log.log2} logs/trim/ """ |
250 251 252 253 | shell: """ deduplicate_bismark --paired --bam {input.bam} --output_dir mapped 2> {log} """ |
262 263 264 265 266 | shell: """ samtools sort {input.bam} > {output.sort} samtools index {output.sort} """ |
284 285 286 287 | shell: """ bamCoverage {params.extra} -b {input.bam} -o {output} --binSize {params.binsize} -p {threads} """ |
326 327 328 329 | shell: """ bismark_methylation_extractor {params.extra} --gzip --paired-end --multicore {threads} --genome_folder {params.ref} -s {input.bam} --output methylation_extracted 2> {log} """ |
Support
- Future updates
Related Workflows





