Pipeline for the analysis of PE ChIP-seq data
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Snakemake pipepine for (Illumina) paired-end ChIP-Seq data
Aim
Snakemake pipeline made for reproducible analysis of paired-end Illumina ChIP-seq data. The desired output of this pipeline are:
-
Quality controls files to check for the quality of the reads. Reads are processed by programs such as fastp and deeptools in order to produce graph that are easily readable and inform quickly about the quality of the experiment.
-
Portable visualization files
bigwig files
, these files have a.bw
extension and are relatively small (compared to sam and bam files). There are used to visualize the read coverage over a genome and can be simply be drag and drop in a genome viewer such as IGV , JBrowse or UCSC . -
Peaks informations files, these
bed
files gather the information about the peak calling produces by the MACS2 algorithm. The.narrowPeak
and.broadPeak
files can be used to visualize the peak location as thebigwig
files but also used downstream analysis such as the intersection of peaks between replicates and different treatments. -
Deeptools visualization of the reads over genomic features such as the genes transcriptions start sites or other features.
Informations about the output of the pipeline can be found in the part Main outputs of this documentation.
Content of the repository
-
Snakefile containing the "recipes" of rules to generate the desired outputs from the input files.
-
config/ , folder containing the configuration files making the Snakefile adaptable to any input files, genome and parameter for the rules. Adapt the config file and its reference in the Snakefile. Please also pay attention to the parameters selected for deeptools, for convenience and faster test the bins have been defined at
1000bp
, do not forget to adapt it to your analysis. -
Fastq/ , folder containing subsetted paired-end fastq files used to test locally the pipeline. Generated using Seqtk :
seqtk sample -s100 read1.fq 5000 > sub1.fqseqtk sample -s100 read2.fq 5000 > sub2.fq
. RAW fastq or fastq.gz files should be placed here before running the pipeline. -
envs/ , folder containing the environment needed for the Snakefile to run. To use Snakemake, it is required to create and activate an environment containing snakemake (here :
envs/global_env.yaml
). The Snakefile will load the required environment for each rules automatically, no modification is required here only if a rule is added. -
units.tsv , is a tab separated value files containing information about the experiment name, the condition of the experiment (control or treatment) and the path to the fastq files relative to the Snakefile . Change this file according to your samples.
-
rules/ , folder containing the rules called by the snakefile to run the pipeline, this improves the clarity of the Snakefile and might help modifying the file in the future.
Usage
Conda environment
First, you need to create an environment for the use of Snakemake with Conda package manager .
-
Create a virtual environment named "chipseq" from the
global_env.yaml
file with the following command:conda env create --name chipseq --file envs/global_env.yaml
-
Then, activate this virtual environment with
source activate chipseq
The Snakefile will then take care of installing and loading the packages and softwares required by each step of the pipeline.
Test run on small fastq test files
To test this pipeline, you will need Snakemake installed on your local machine. If you have activated the 'chipseq' environment (see above), then Snakemake 5.2.0 is already installed and in use.
If the
singularity
software is available on your machine and you want to use 10 CPUs (
--cores 10
), then run
snakemake --use-conda --use-singularity --cores 10
Configuration file
The
config.yaml
file specifies custom options for Snakemake:
-
Sample information
units: units.tsv
indicates where to find the sample names. Paired-end files should be provided such as the forward read is to be found in thefq1
column while the reverse read in thefq2
column. Please pay attention at respecting the spaces between the filenames, these should be tab sepatation or 4 spaces . -
The directories:
-
working_dir
: a temporary directory that can be removed after the run -
result_dir
: a directory that contains the desired output files at the end of the run
-
-
The link to the reference genomes used for the mapping, please note that if you have a favorite version of the genome that you use in a genome browser you should use the same or update your reference genome in the genome browser in order to avoid compatibility problems.
-
Parameters for various tools in the pipeline, every tools in the pipeline offer a range of possible change in their parameters. It would be too long to describe them here but you can find these parameters in the manual of the tools.
This configuration file is then used to build parameters in the main
Snakefile
.
Snakemake execution
The Snakemake pipeline/workflow management system reads a master file (often called
Snakefile
) to list the steps to be executed and defining their order.
It has many rich features. Read more
here
.
Test run
Use the command
snakemake -np
after activating the
chipseq environment
to perform a dry run that prints out the rules and commands. If there is no
RED
message, everything is fine and you can proceed to the
Real run
Real run
Within
the folder containing the
Snakefile
(if unchanged
Snakemake_ChIPseq_PE/
, simply type
Snakemake --use-conda --use-singularity
and provide the number of cores with
--cores 10
for ten cores for instance.
Please pay attention to
--use-conda
, it is required for the installation and loading of the dependencies used by the rules of the pipeline.
Main outputs
The main output are :
-
fastqc : Provide informations about the quality of the sequences provided and generate a html file to visualize it. More information to be found here
-
bed : Provide information generated by the MACS2 algorithm for the locations and significance of peaks. These files can be used for direct visualization of the peaks location using IGV or as an input for further analysis using the bedtools
-
bigwig files : Provides files allowing fast displays of read coverages track on any type of genome browsers.
-
plotFingerprint contains interesting figures that answer the question: "Did my ChIP work???" . Explanation of the plot and the options available can be found here
-
PLOTCORRELATION folder contain pdf files displaying the correlation between the samples tested in the ChIP experiment, many options in the plotcorrelation rules can be changed via the configuration file. More information about this plot can be found here
-
HEATMAP folder contain pdf files displaying the content of the matrix produced by the
computeMatrix
rule under the form of a heatmap. Many option for thecomputeMatrix
and theplotHeatmap
rules can be changed in the configuration file. More information about this figure can be found here . -
plotProfile folder contain pdf files displaying profile plot for scores over sets of genomic region, again the genomic region are define in the matrix made previously. Again there are many options to change the plot and more information can be found here
Optionals outputs of the pipelines are bamCompare , bedgraph and bed files for broad peaks calling .
Directed Acyclic Graph of the Snakemake ChIP-seq PE pipeline
This graph has been produced with the command:
snakemake --rulegraph |dot -Tpng > dag.png
Code Snippets
20 21 22 23 24 25 26 27 28 | shell: "bamCoverage --bam {input.bam} \ -o {output} \ --effectiveGenomeSize {params.EFFECTIVEGENOMESIZE} \ --extendReads {params.EXTENDREADS} \ --binSize {params.binSize} \ --smoothLength {params.smoothLength} \ --ignoreForNormalization {params.ignoreForNormalization} \ &>{log}" |
52 53 54 55 56 57 58 59 60 61 62 | shell: "bamCompare -b1 {input.treatment} \ -b2 {input.control} \ --binSize {params.binSize} \ -o {output.bigwig} \ --normalizeUsing {params.normalizeUsing} \ --operation {params.operation} \ --smoothLength {params.smoothLength} \ --ignoreForNormalization {params.ignoreForNormalization} \ --scaleFactorsMethod {params.scaleFactorsMethod} \ &>{log}" |
78 79 80 81 82 83 84 85 86 | shell: "multiBamSummary bins \ --bamfiles {input} \ --numberOfProcessors {threads}\ --binSize {params.binSize} \ --centerReads \ --extendReads \ -o {output} \ 2> {log}" |
104 105 106 107 108 109 110 111 112 113 | shell: "plotCorrelation \ --corData {input} \ --corMethod {params.corMethod} \ --whatToPlot {params.whatToPlot} \ --skipZeros \ --colorMap {params.color} \ --plotFile {output} \ --plotNumbers \ 2> {log}" |
133 134 135 136 137 138 139 140 141 142 143 144 | shell: "computeMatrix \ reference-point \ --referencePoint TSS \ -S {input.bigwig} \ -R {input.bed} \ --afterRegionStartLength {params.upstream} \ --beforeRegionStartLength {params.downstream} \ --numberOfProcessors {threads} \ --binSize {params.binSize} \ -o {output} \ 2> {log}" |
162 163 164 165 166 167 168 169 | shell: "plotHeatmap \ --matrixFile {input} \ --outFileName {output} \ --kmeans {params.kmeans} \ --colorMap {params.color} \ --legendLocation best \ --outFileSortedRegions {params.cluster}" |
188 189 190 191 192 193 | shell: "plotFingerprint \ -b {input.treatment} {input.control} \ --extendReads {params.EXTENDREADS} \ --binSize {params.binSize} \ --plotFile {output}" |
211 212 213 214 215 216 217 218 | shell: "plotProfile \ --matrixFile {input} \ --outFileName {output.pdf} \ --outFileSortedRegions {output.bed} \ --kmeans {params.kmeans} \ --startLabel {params.startLabel} \ --endLabel {params.endLabel}" |
5 | shell: "wget -O {output} {GENOME_FASTA_URL}" |
12 | shell: "wget -O {output} {GFF_URL}" |
20 21 | shell: "python scripts/gff_to_gtf.py {input} {output}" |
18 19 20 21 | shell: """ macs2 callpeak -t {input.treatment} -c {input.control} {params.format} {params.genomesize} --name {params.name} --nomodel --bdg -q {params.qvalue} --outdir results/bed/ &>{log} """ |
40 41 42 43 | shell: """ macs2 callpeak -t {input.treatment} -c {input.control} {params.format} --broad --broad-cutoff 0.1 {params.genomesize} --name {params.name} --nomodel --bdg -q {params.qvalue} --outdir results/bed/ &>{log} """ |
26 27 28 29 30 31 32 33 34 35 36 37 | shell: "trimmomatic PE {params.phred} -threads {threads} " "{input.reads} " "{output.forward_reads} " "{output.forwardUnpaired} " "{output.reverse_reads} " "{output.reverseUnpaired} " "ILLUMINACLIP:{input.adapters}:{params.seedMisMatches}:{params.palindromeClipTreshold}:{params.simpleClipThreshhold} " "LEADING:{params.LeadMinTrimQual} " "TRAILING:{params.TrailMinTrimQual} " "SLIDINGWINDOW:{params.windowSize}:{params.avgMinQual} " "MINLEN:{params.minReadLen} &>{log}" |
53 54 | shell: "fastp -i {input.fwd} -I {input.rev} -h {output} 2>{log}" |
70 | shell:"bowtie2-build --threads {threads} {input} {params}" |
92 93 94 95 | shell: """ bowtie2 {params.bowtie} --threads {threads} -x {params.index} -1 {input.forward} -2 {input.reverse} -U {input.forwardUnpaired},{input.reverseUnpaired} --un-conc-gz {params.unmapped} | samtools view -Sb - > {output.mapped} 2>{log} """ |
109 | shell:"samtools sort -@ {threads} -o {output} {input} &>{log}" |
122 123 124 125 126 | shell: """ samtools rmdup {input} {output.bam} &>{log} samtools index {output.bam} """ |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | import sys args = sys.argv f = open(args[1]) of = open(args[2], "w") keep = set(["gene", "transcript", "exon"]) for line in f: if line.startswith("#"): continue cols = line.split("\t") if cols[2] == "mRNA": cols[2] = "transcript" if cols[2] not in keep: continue attribs = cols[8].split(";") attribs = [x.replace("=", "\t").split("\t") for x in attribs] newAttribs = {} for t in attribs: if t[0] == "ID": _, ID = t[1].split(":") if cols[2] == "gene": newAttribs['gene_id'] = ID elif cols[2] == "transcript": newAttribs['transcript_id'] = ID x = ID.rindex(".") newAttribs['gene_id'] = ID[:x] elif cols[2] == "exon": x = ID.rindex(".") ID = ID[:x] newAttribs['transcript_id'] = ID x = ID.rindex(".") ID = ID[:x] newAttribs['gene_id'] = ID cols[8] = " ".join("{} \"{}\";".format(x, y) for x, y in newAttribs.items()) of.write("\t".join(cols)) of.write("\n") f.close() of.close() |
106 | shell:"rm -rf {WORKING_DIR}" |
Support
- Future updates
Related Workflows





