Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
- Lack of a description for a new keyword MACS2 .
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
snakemake workflow comparing Genrich and MACS2
workflow of the pipeline
Dependencies
-
bwa/0.7.15 for aligning the reads.
git clone the repo on July 22th
git clone https://github.com/jsh58/Genrich
cd Genrich
make
macs2 2.1.2
conda install macs2
How to run it
ssh bio1
## start a screen session
screen
git clone https://github.com/crazyhottommy/Genrich_compare
## make samples.json file
python sample2json.py /path/to/fastq/dir meta.txt
The
meta.txt
file should be a 4-column tab delimited tsv file.
e.g. for ATACseq data:
cat meta.txt
sample fastq factor read
SRR891269 SRR891269_GSM1155958_GM12878_ATACseq_50k_Rep2_Homo_sapiens_OTHER_1.fastq.gz atac R1
SRR891269 SRR891269_GSM1155958_GM12878_ATACseq_50k_Rep2_Homo_sapiens_OTHER_2.fastq.gz atac R2
SRR891270 SRR891270_GSM1155959_GM12878_ATACseq_50k_Rep3_Homo_sapiens_OTHER_1.fastq.gz atac R1
SRR891270 SRR891270_GSM1155959_GM12878_ATACseq_50k_Rep3_Homo_sapiens_OTHER_2.fastq.gz atac R2
SRR891271 SRR891271_GSM1155960_GM12878_ATACseq_50k_Rep4_Homo_sapiens_OTHER_1.fastq.gz atac R1
SRR891271 SRR891271_GSM1155960_GM12878_ATACseq_50k_Rep4_Homo_sapiens_OTHER_2.fastq.gz atac R2
SRR891272 SRR891272_GSM1155961_GM12878_ATACseq_500_Rep1_Homo_sapiens_OTHER_1.fastq.gz atac R1
SRR891272 SRR891272_GSM1155961_GM12878_ATACseq_500_Rep1_Homo_sapiens_OTHER_2.fastq.gz atac R2
SRR891273 SRR891273_GSM1155962_GM12878_ATACseq_500_Rep2_Homo_sapiens_OTHER_1.fastq.gz atac R1
SRR891273 SRR891273_GSM1155962_GM12878_ATACseq_500_Rep2_Homo_sapiens_OTHER_2.fastq.gz atac R2
SRR891274 SRR891274_GSM1155963_GM12878_ATACseq_500_Rep3_Homo_sapiens_OTHER_1.fastq.gz atac R1
SRR891274 SRR891274_GSM1155963_GM12878_ATACseq_500_Rep3_Homo_sapiens_OTHER_2.fastq.gz atac R2
for ChIPseq data with single-end reads
sample fastq_name factor reads
C57BL SRR534669.fastq.gz Per2 R1
C57BL SRR534670.fastq.gz Per2 R1
C57BL SRR534671.fastq.gz Per2 R1
C57BL SRR534672.fastq.gz Per2 R1
C57BL SRR534673.fastq.gz Per2 R1
C57BL SRR534674.fastq.gz Per2 R1
C57BL SRR534676.fastq.gz Input R1
for ChIPseq data with paired-end reads:
sample fastq_name factor reads
LKR10 Input_LKR10_1.fq.gz Input R1
LKR10 Input_LKR10_2.fq.gz Input R2
V6_5 Input_V6_5_1.fq.gz Input R1
V6_5 Input_V6_5_2.fq.gz Input R2
LKR10 Mll4_LKR10_1.fq.gz Mll4 R1
LKR10 Mll4_LKR10_2.fq.gz Mll4 R2
LKR10 Per2_A_LKR10_1.fq.gz Per2A R1
LKR10 Per2_A_LKR10_2.fq.gz Per2A R2
LKR10 Per2_B_LKR10_1.fq.gz Per2B R1
LKR10 Per2_B_LKR10_2.fq.gz Per2B R2
V6_5 Per2_B_V6_5_1.fq.gz Per2 R1
V6_5 Per2_B_V6_5_2.fq.gz Per2 R2
The
config.yaml
file contains various parameters one need to change.
Now it can handle single-end or pair-end fastqs; ChIPseq with or without Input/IgG controls
and ATACseq data.
## whether it is from fastq files or from bam files
from_fastq: True
## the reads are paired end or single end
paired_end: True
# >70bp long, <70bp false
long_reads: True
# what's the control for IP
# if it is ChIP-seq data
# this is the same as the third colmn of the meta.txt factor column
control: 'Input'
## if it is ChIP-seq data without Input/IgG control or ATACseq data (without control by nature)
## set control : False
## the reference fasta path
ref_fa: "/n/regal/informatics_public/reference_genome_by_tommy/ucsc/mm9/mm9.fa"
macs2_g: mm
macs2_pvalue: 1e-5
# other macs2 arguments
macs2_args: -f BAM
Genrich_args: "-r -p 0.05 -a 200 "
#number of reads downsample to, I set to 50 million, if reads number smaller than
## 50 million, downsample will keep the orignal reads.
## if downsample set to False, no downsample will be done, symbolic link will be used
downsample: False
target_reads: 50000000
## set up the config.yaml file
nano config.yaml
## dry run
snakemake -np
## real run
snakemake -j 24
## submit to slurm
./pyflow_Genrich_compare.sh
To do
-
[ ] Adatpor trimming
-
[ ] Add benchmark directive for logging time and memory usage.
-
[ ] Genrich with replicates https://github.com/jsh58/Genrich#multiple-replicates
-
[ ] IDR framework https://github.com/kundajelab/idr compare it with Genrich. IDR is fast.
-
[ ] conda env and docker container
Code Snippets
107 108 109 110 111 | shell: """ gunzip -c {input.r1} | gzip > {output[0]} 2> {log} gunzip -c {input.r2} | gzip > {output[1]} 2>> {log} """ |
120 | shell: "gunzip -c {input} | gzip > {output} 2> {log}" |
159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 | run: if config["long_reads"]: shell( r""" bwa mem -t 5 -M -v 1 -R '{params.rg}' {config[ref_fa]} {input[0]} {input[1]} 2> {log.bwa} \ | samblaster 2> {log.markdup} \ | samtools view -Sb -F 4 - \ | samtools sort -m 2G -@ 5 -T {output[0]}.tmp -o {output[0]} samtools index {output[0]} """) ## short reads < 70bp ## Probably one of the most important is how many mismatches you will allow between a read and a potential mapping location for that location to be considered a match. ## The default is 4% of the read length, but you can set this to be either another proportion of the read length, or a fixed integer else: shell( r""" bwa aln -t 5 {config[ref_fa]} {input[0]} 2> {log.bwa} > 03aln/{wildcards.sample}_R1.sai bwa aln -t 5 {config[ref_fa]} {input[1]} 2>> {log.bwa} > 03aln/{wildcards.sample}_R2.sai bwa sampe -s -r '{params.rg}' {config[ref_fa]} 03aln/{wildcards.sample}_R1.sai 03aln/{wildcards.sample}_R2.sai {input[0]} {input[1]} 2>> {log.bwa} \ | samblaster 2> {log.markdup} \ | samtools view -Sb -F 4 - \ | samtools sort -m 2G -@ 5 -T {output[0]}.tmp -o {output[0]} rm 03aln/{wildcards.sample}_R1.sai 03aln/{wildcards.sample}_R2.sai samtools index {output[0]} """) |
199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 | run: if config["long_reads"]: shell( r""" bwa mem -t 5 -M -v 1 -R '{params.rg}' {config[ref_fa]} {input} 2> {log.bwa} \ | samblaster -r 2> {log.markdup} \ | samtools view -Sb -F 4 - \ | samtools sort -m 2G -@ 5 -T {output[0]}.tmp -o {output[0]} samtools index {output[0]} """) ## short reads < 70bp else: shell( r""" bwa aln -t 5 {config[ref_fa]} {input} 2> {log.bwa} > 03aln/{wildcards.sample}.sai bwa samse -r '{params.rg}' {config[ref_fa]} 03aln/{wildcards.sample}.sai {input} 2>> {log.bwa} \ | samblaster -r 2> {log.markdup} \ | samtools view -Sb -F 4 - \ | samtools sort -m 2G -@ 5 -T {output[0]}.tmp -o {output[0]} rm 03aln/{wildcards.sample}.sai samtools index {output[0]} """) |
229 230 231 232 | shell: """ samtools flagstat {input} > {output} 2> {log} """ |
242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 | run: import re import subprocess with open (input[2], "r") as f: # fifth line contains the number of mapped reads line = f.readlines()[4] match_number = re.match(r'(\d.+) \+.+', line) total_reads = int(match_number.group(1)) if config["downsample"]: target_reads = config["target_reads"] # 15million reads by default, set up in the config.yaml file if total_reads > target_reads: down_rate = target_reads/total_reads else: down_rate = 1 shell("sambamba view -f bam -t 5 --subsampling-seed=3 -s {rate} {inbam} | samtools sort -m 2G -@ 5 -T {outbam}.tmp > {outbam} 2> {log}".format(rate = down_rate, inbam = input[0], outbam = output[0], log = log)) shell("samtools index {outbam}".format(outbam = output[0])) else: shell("ln -s {inbam} {outbam}".format(inbam = params.source_dir + "/" + input[0], outbam = output[0])) shell("ln -s {inbai} {outbai}".format(inbai = params.source_dir + "/" + input[1], outbai = output[1])) |
269 270 271 272 | shell: """ bamCoverage -b {input[0]} --normalizeUsing RPKM --binSize 30 --smoothLength 300 -p 5 --extendReads 200 -o {output} 2> {log} """ |
285 286 287 288 289 290 291 292 | shell: """ ## for macs2, when nomodel is set, --extsize is default to 200bp, this is the same as 2 * shift-size in macs14. source activate macs2 macs2 callpeak -t {input.case} \ -c {input.control} -g {config[macs2_g]} \ --outdir 09peak_macs2 -n {params.name} -p {config[macs2_pvalue]} {params.custom} &> {log} """ |
303 304 305 306 307 308 309 310 | shell: """ ## for macs2, when nomodel is set, --extsize is default to 200bp, this is the same as 2 * shift-size in macs14. source activate macs2 macs2 callpeak -t {input.case} \ -g {config[macs2_g]} \ --outdir 09peak_macs2 -n {params.name} -p {config[macs2_pvalue]} {params.custom} &> {log} """ |
322 323 324 325 326 327 | shell: """ samtools sort -n -m 2G -@ {threads} -T {wildcards.sample} \ -o {output} \ {input[0]} 2> {log} """ |
339 340 341 342 343 | shell: """ {config[Genrich_path]} -t {input.case} \ -c {input.control} {params.custom} -o {output} &> {log} """ |
354 355 356 357 358 | shell: """ {config[Genrich_path]} -t {input.case} \ {params.custom} -o {output} &> {log} """ |
Support
- Future updates
Related Workflows





