Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Introduction
This repository contains a Snakemake workflow for performing a basic ChIP-seq analysis pipeline. This workflow was adapted from this workflow for ChIP-seq data and modified to suit my preffered analysis workflow.
This workflow performs the following steps:
-
Optional: Retrieval of publically available sequencing data from NCBI GEO/SRA.
-
Optional: merge reads from the same sample that were sequenced on separate lanes or sequencing runs.
-
Align raw reads to a reference genome. The workflow will automatically retrieve any publically available reference genome and build the bowtie2 index.
-
Filter aligned reads based on alignment quality and discard multi-mapping reads. The filtering parameters can be customized.
-
Optional: filter aligned reads to discard reads aligning to contigs or mitochondrial genome. This filtering can also be customized.
-
Optional: Perform peak calling using MACS2.
-
Generate bigWig files containing z-score normalized read depth to allow data visualization in the genome browser.
-
Optional: spike-in normalization for experiments where exogenous control chromatin or cells are added to each sample as an internal control.
Running the workflow
Follow the steps below to run this workflow:
Step 1: Software installation
Ensure that you have a conda-based Python3 distribution installed. I recommend Mambaforge .
Snakemake and Snakedeploy are used to deploy and run this workflow. These can be installed using Mamba:
mamba create -c conda-forge -c bioconda --name snakemake snakemake snakedeploy
OR Conda:
conda create -c conda-forge -c bioconda --name snakemake snakemake snakedeploy
Step 2: Deploy the workflow
Activate the conda environment:
conda activate snakemake
Create a new directory for your analysis and enter it:
mkdir -p path/to/project-workdir
cd path/to/project-workdir
Deploy the workflow:
snakedeploy deploy-workflow https://github.com/tjgibson/NGS-workflow-chipseq . --branch main
This command will create all the files necessary for running this workflow.
Step 3: Entering your sample information
The previous step should have created the file
config/units.tsv
in your local project directory. This is a template that can be modified to contain the information about the samples you wish to analyze. Before modifying it, take a look at the file to get a sense of how it is formatted. The fields of this file are described below:
sample_name: The name you want to use for each individual sample. I like to use some sort of naming convention like developmental-stage_genotype_treatment_IP_rep1, where each piece of important information about the sample is separated by an underscore. If a single piece of information, such as developmental stage, contains multiple words, separate them with a dash: L3-larva_rep1.
unit_name:
If you sequenced the same sample across multiple lanes or sequencing runs, these will be listed as separate units. For example, control_IP_rep1_A and control_IP_rep1_B. If you only have a single fastq file for a sample, then the
unit_name
should be identical to the
sample_name
.
fq1:
The file path to the location on your computer where the raw fastq file is stored. These can be located anywhere on your computer, but I suggest placing them in a directory called
data
,
raw
, or
raw_data
within your analysis directory. Fastq files should be compressed via gzip to save space.
fq2: If you performed paired-end sequencing, list the second fastq file here. For single-end data, leave this field blank.
sra: This workflow can perform automatic retrieval of publically available data from the Short Read Archive (SRA). Enter the SRA accession number (e.g. SRR0000001) here. If you provide both an SRA number and a local fastq file, the workflow will use the local file.
read_format:
This should be set to
SE
for single-end or
PE
for paired-end.
call_peaks:
This field indicates if peak calling should be performed using MACS2 and should be set to
TRUE
or
FALSE
.
input:
If you have control samples to be used during peak calling (e.g. input or IgG), this column will indicate which control samples correspond to which IP samples. Enter the
sample_name
for the input/control that corresponds to the IP listed in each row. The name listed here must match that listed in the corresponding row of the
sample_name column
. See the example for more details. If
call_peaks
is set to
TRUE
and
input
is blank, MACS2 peak calling will be performed without using a control dataset.
peak_type:
This specifies what type of peak calling will be performed by MACS2 and should be set to
narrow
(e.g. transcription factors, H3K27ac) or
broad
(e.g. H3K27me3, H3K9me3)
sample_group:
Indicates the name to use for each group of replicates. This name will be used to name bigWig and peak files for merged replicates. For example, for samples control_IP_rep1 and control_IP_rep2,
sample_group
could be set to control_IP.
Step 4: Configuring the workflow
The workflow can be customized to suit your needs by editing the file
config/config.yaml
. Below is a description of the different options that can be customized within the config file.
Merging of technical replicates
If you have samples that were sequenced on multiple runs or sequencing lanes, change the following lines of the config.yaml file:
mergeReads:
activate: True
This will result in merging of the fastq files prior to alignment.
Filtering of reads based on alignment to specific chromosomes
When working with Drosophila data (or other organisms with really good genome assemblies), I typically discard reads aligning to unplaced contigs or scaffolds. For Drosophila, this entails only retaining reads aligning to the major chromosome arms (chr2L, chr2R, Chr3L, chr3R, chrX, and chrY). This can be done with the following lines of the config file:
filter_chroms: True
keep_chroms: config/keep_chroms.txt
You can modify the keep_chroms.txt file to reflect a different filtering strategy. Make sure that the chromosome naming convention used in the file matches that of your chosen reference genome (see below). For example, chr2L for UCSC vs. 2L for Ensembl.
Setting the reference genome
This is where you set the reference genome to use for read alignment. Give the genome a name and provide a link to the fasta file for the genome. Reference genomes can be found through the UCSC Genome Browser , Ensembl , or organism-specific genome databases (e.g. Flybase ). The pipeline will automatically retrieve the reference genome from the provided link and build the bowtie2 index required for alignment. Modify the following lines as necessary for your desired reference genome:
ref_genome:
name: dm6
link: https://hgdownload.soe.ucsc.edu/goldenPath/dm6/bigZips/dm6.fa.gz
If you want to use a custom genome that is not publically available or a bowtie2 index that you have already built, create a folder called
resources/
in your working directory and copy your fasta file or the bowtie2 index files to this directory. If using a custom fasta file, rename the file to be
ref_genome.fasta.gz
. If using a custom bowtie2 index, use the prefix
genome
for the index files (e.g.
genome.1.bt2
).
Setting the spike-in genome
This workflow can optionally perform spike-in normalization if your experiment included exogenous DNA that was added to each sample prior to immunoprecipitation. Similar to above, modify the following lines of the config file:
use_spikeIn: False
spikeIn_genome:
name: yeast_w303
link: http://sgd-archive.yeastgenome.org/sequence/strains/W303/W303_SGD_2015_JRIU00000000/W303_SGD_2015_JRIU00000000.fsa.gz
The workflow will combine the reference genome and spike-in genome into a custom fasta file and use this for alignment. Following alignment, the workflow counts the number of reads aligning to each genome (after removing multi-mapping reads, which will also remove reads that align to both genomes) and then filters out the spike-in reads from the final BAM file. The percentage of spike-in reads is used to calculate a scaling factor. The values in the z-score normalized bigWig files will be multiplied by this scaling factor to produce spike-in normalized bigWig files. The pipeline also outputs a table with the number of reference and spike-in reads, percentage of spike-in reads, and scaling factors. If you wish to use a different strategy for spike-in normalization (e.g. skip initial z-score normalization), you should be able to use the information in this table to impliment alternate spike-in normalizations.
Note: For IP samples that have a corresponding input control, the calculation of the scaling factors will take into account the percentage of spike-in reads in the input and IP. This is the strategy we used in this paper and worked well for the experiments described therein. If IP samples lack an input, than the scaling factor will only be based on the percentage of spike-in reads in the IP.
modifying parameters for specific steps
The final section of the config file allows customization of the parameters for individual steps in the workflow:
params:
bowtie2_index: ""
bowtie2_align: "-k 2 --very-sensitive --no-mixed --no-discordant -X 5000"
filter_multireads: "-f bam -F '(mapping_quality >= 30) and ([XS] == null)'"
bigwigs_ind: "--binSize 10"
bigwigs_merged: "--binSize 10"
macs2_call_peaks_narrow: "-g dm --call-summits"
macs2_call_peaks_broad: "-g dm"
filter_peaks_by_replicates:
min_overlap: 10
replicate_threshold: 2
These can be modified as desired to alter these steps. For example, by default the workflow will run bowtie2 with parameters
-k 2 --very-sensitive --no-mixed --no-discordant -X 5000
. The parameters listed here in the config file will be passed directly to the corresponding step when it is run. The parameters for
filter_peaks_by_replicates
indicates that peaks called by MACS2 will be filtered to only include peaks that were deteced in at least 2 replicates, considering a peak to be shared between replicates if the peak in each replicate overlap oneanother by at least 10 bp.
Step 5: Running the workflow
You are now ready to run the workflow. Before running the workflow, it is recommended to do a "dry run", in which Snakemake will check that all the required input and output files exist, and provide a summary of what the workflow will do. To do a dry run, invoke the following command:
snakemake -n
Examining the output of this dry run can help make sure that samples names and locations of raw files were entered correctly.
If everything looks good, run the workflow with the following command:
snakemake --use-conda -c 1
The
-c 1
parameter tells snakemake to use a single core for running the workflow. Depending on the number of processors/cores available on your computer, this number can be increased, which will speed up execution by allowing multiple steps to run in parallel and allowing steps such as read alignment to use multiple cores.
This workflow can also be run on a computing cluster or using cloud computing services. See the relevant sections of the Snakemake documentation for more information: Cluster execution Cloud execution
Navigating the output
Once finished, the workflow will have produced various output files that will be located in a folder named
results/
. The results will be organized into the following subdirectories:
aligned_reads: BAM files containing aligned, sorted, filtered reads and BAI bam indices. A number of different intermediate BAM files will be generated as the workflow runs. Upon completion, intermediate files will be deleted and only only the filtered, sorted BAM files will be retained.
bigwigs: z-score normalized bigWig files for individual replicates and merged replicates. If spike-in normalization is indicated in the config file, spike-in normalized bigWigs will also appear here.
peaks: narrowPeak/broadPeak and bed files containing peaks called by MACS2. This includes peaks called on individual replicates, merged replicates, and peaks filtered by replicates according to the criteria in the config file.
Combining workflows for multiple data types
It may be desirable to process multiple data types (e.g. ChIP-seq, RNA-seq, ATAC-seq) and then integrate those results into downstream analysis. This can be done in a single Snakemake workflow that contains multiple "modules" for the different data types. For an example of how to build a complex workflow including multiple datasets as well as custom downstream analysis, see this repository for an example.
To-do
In the future, I hope to add some of the following features:
-
Add small, test fastq files to repo to allow for automated testing of rules
-
Create an HTML summary file containing read alignment and filtering statistics, number of peaks called, and other useful information about the workflow.
-
The optional step to take unaligned reads, BLAST them, and provide a summary of which organisms the unaligned reads likely originated from.
-
Add validation to ensure that unit table and config file are formatted correctly.
Code Snippets
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "koester@jimmy.harvard.edu" __license__ = "MIT" from snakemake.shell import shell extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell(stdout=True, stderr=True) n = len(snakemake.input.sample) assert ( n == 1 or n == 2 ), "input->sample must have 1 (single-end) or 2 (paired-end) elements." if n == 1: reads = "-U {}".format(*snakemake.input.sample) else: reads = "-1 {} -2 {}".format(*snakemake.input.sample) shell( "(bowtie2 --threads {snakemake.threads} {extra} " "-x {snakemake.params.index} {reads} " "| samtools view -Sbh -o {snakemake.output[0]} -) {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | __author__ = "Daniel Standage" __copyright__ = "Copyright 2020, Daniel Standage" __email__ = "daniel.standage@nbacc.dhs.gov" __license__ = "MIT" from snakemake.shell import shell extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell(stdout=True, stderr=True) indexbase = snakemake.output[0].replace(".1.bt2", "") shell( "bowtie2-build --threads {snakemake.threads} {snakemake.params.extra} " "{snakemake.input.reference} {indexbase}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 | __author__ = "Antonie Vietor" __copyright__ = "Copyright 2020, Antonie Vietor" __email__ = "antonie.v@gmx.de" __license__ = "MIT" import os import sys from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=True, stderr=True) in_contr = snakemake.input.get("control") params = "{}".format(snakemake.params) opt_input = "" out_dir = "" ext = "_peaks.xls" out_file = [o for o in snakemake.output if o.endswith(ext)][0] out_name = os.path.basename(out_file[: -len(ext)]) out_dir = os.path.dirname(out_file) if in_contr: opt_input = "-c {contr}".format(contr=in_contr) if out_dir: out_dir = "--outdir {dir}".format(dir=out_dir) if any(out.endswith(("_peaks.narrowPeak", "_summits.bed")) for out in snakemake.output): if any( out.endswith(("_peaks.broadPeak", "_peaks.gappedPeak")) for out in snakemake.output ): sys.exit( "Output files with _peaks.narrowPeak and/or _summits.bed extensions cannot be created together with _peaks.broadPeak and/or _peaks.gappedPeak extended output files.\n" "For usable extensions please see https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/macs2/callpeak.html.\n" ) else: if " --broad" in params: sys.exit( "If --broad option in params is given, the _peaks.narrowPeak and _summits.bed files will not be created. \n" "Remove --broad option from params if these files are needed.\n" ) if any( out.endswith(("_peaks.broadPeak", "_peaks.gappedPeak")) for out in snakemake.output ): if "--broad " not in params and not params.endswith("--broad"): params += " --broad " if any( out.endswith(("_treat_pileup.bdg", "_control_lambda.bdg")) for out in snakemake.output ): if all(p not in params for p in ["--bdg", "-B"]): params += " --bdg " else: if any(p in params for p in ["--bdg", "-B"]): sys.exit( "If --bdg or -B option in params is given, the _control_lambda.bdg and _treat_pileup.bdg extended files must be specified in output. \n" ) shell( "(macs2 callpeak " "-t {snakemake.input.treatment} " "{opt_input} " "{out_dir} " "-n {out_name} " "{params}) {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | __author__ = "Jan Forster" __copyright__ = "Copyright 2021, Jan Forster" __email__ = "j.forster@dkfz.de" __license__ = "MIT" import os from snakemake.shell import shell in_file = snakemake.input[0] extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell(stdout=False, stderr=True) if in_file.endswith(".sam") and ("-S" not in extra or "--sam-input" not in extra): extra += " --sam-input" shell( "sambamba view {extra} -t {snakemake.threads} " "{snakemake.input[0]} > {snakemake.output[0]} " "{log}" ) |
1 2 3 4 5 6 7 8 9 10 11 | __author__ = "Antonie Vietor" __copyright__ = "Copyright 2020, Antonie Vietor" __email__ = "antonie.v@gmx.de" __license__ = "MIT" from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=False, stderr=True) shell("samtools idxstats {snakemake.input.bam} > {snakemake.output[0]} {log}") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "koester@jimmy.harvard.edu" __license__ = "MIT" from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=True, stderr=True) # Samtools takes additional threads through its option -@ # One thread for samtools merge # Other threads are *additional* threads passed to the '-@' argument threads = "" if snakemake.threads <= 1 else " -@ {} ".format(snakemake.threads - 1) shell( "samtools index {threads} {snakemake.params} {snakemake.input[0]} {snakemake.output[0]} {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "koester@jimmy.harvard.edu" __license__ = "MIT" from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=True, stderr=True) # Samtools takes additional threads through its option -@ # One thread for samtools merge # Other threads are *additional* threads passed to the '-@' argument threads = "" if snakemake.threads <= 1 else " -@ {} ".format(snakemake.threads - 1) shell( "samtools merge {threads} {snakemake.params} " "{snakemake.output[0]} {snakemake.input} " "{log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "koester@jimmy.harvard.edu" __license__ = "MIT" import os from snakemake.shell import shell extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell(stdout=True, stderr=True) out_name, out_ext = os.path.splitext(snakemake.output[0]) tmp_dir = snakemake.params.get("tmp_dir", "") if tmp_dir: prefix = os.path.join(tmp_dir, os.path.basename(out_name)) else: prefix = out_name # Samtools takes additional threads through its option -@ # One thread for samtools # Other threads are *additional* threads passed to the argument -@ threads = "" if snakemake.threads <= 1 else " -@ {} ".format(snakemake.threads - 1) shell( "samtools sort {extra} {threads} -o {snakemake.output[0]} " "-T {prefix} {snakemake.input[0]} " "{log}" ) |
9 10 | script: "../scripts/fasterq-dump.py" |
21 22 | script: "../scripts/fasterq-dump.py" |
33 34 | shell: "cat {input} > {output} 2> {log}" |
48 49 | wrapper: "v1.1.0/bio/bowtie2/align" |
62 63 | wrapper: "v1.1.0/bio/samtools/sort" |
76 77 | wrapper: "v1.1.0/bio/samtools/index" |
12 13 | shell: "bamCoverage --bam {input.bam} -o {output} -p {threads} {params.extra}" |
24 25 | wrapper: "v1.1.0/bio/samtools/merge" |
38 39 | wrapper: "v1.1.0/bio/samtools/index" |
52 53 | shell: "bamCoverage --bam {input.bam} -o {output} -p {threads} {params.extra}" |
63 64 | script: "../scripts/zscore_normalize_bw.R" |
73 74 | script: "../scripts/zscore_normalize_bw.R" |
85 86 | script: "../scripts/compute_scaling_factors.R" |
97 98 | script: "../scripts/spikeIn_normalize_bw.R" |
108 109 | script: "../scripts/spikeIn_normalize_bw.R" |
16 17 | wrapper: "v1.1.0/bio/macs2/callpeak" |
34 35 | wrapper: "v1.1.0/bio/macs2/callpeak" |
52 53 | wrapper: "v1.1.0/bio/macs2/callpeak" |
70 71 | wrapper: "v1.1.0/bio/macs2/callpeak" |
84 85 | script: "../scripts/filter_peaks_by_replicates.R" |
9 10 | wrapper: "v1.1.0/bio/samtools/idxstats" |
23 24 | wrapper: "v1.1.0/bio/sambamba/view" |
37 38 | wrapper: "v1.1.0/bio/samtools/index" |
48 49 | wrapper: "v1.1.0/bio/samtools/idxstats" |
61 62 | shell: "samtools view -bh -L {input.keep_chroms} --output-fmt BAM -o {output} {input.bam} 2>> {log}" |
74 75 | wrapper: "v1.1.0/bio/sambamba/view" |
89 90 | wrapper: "v1.1.0/bio/samtools/index" |
100 101 | wrapper: "v1.1.0/bio/samtools/idxstats" |
12 13 | shell: "curl {params.link} > {output} 2> {log}" |
27 28 | shell: "curl {params.link} > {output} 2> {log}" |
41 42 43 44 45 46 | shell: """ seqkit replace -p "(.+)" -r "spikeIn_\$1" -o resources/tmp_spikeIn.fasta.gz {input.spikeIn} 2> {log} cat {input.ref} resources/tmp_spikeIn.fasta.gz > {output} 2>> {log} rm resources/tmp_spikeIn.fasta.gz """ |
56 57 | shell: "mv {input} {output} 2> {log}" |
72 73 74 75 | shell: "seqkit grep -f {input.keep_chroms} {input.genome}" " | seqkit fx2tab -nil" " | awk -v OFS='\t' '{{print $1, 1, $2}}' > {output}" |
90 91 | wrapper: "v1.1.0/bio/bowtie2/build" |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 | library(tidyverse) # import data ------------------------------------------------------------------ # read in all relevant stat files stat_files <- snakemake@input[["mapping_stats"]] names(stat_files) <- stat_files mapping_stats <- stat_files %>% map_dfr(read_tsv, .id = "source", col_names = c("chromosome", "chromosome_size", "mapped_reads", "unmapped_reads")) %>% mutate(sample_name = gsub("_unireads.idxstats", "", basename(source))) # add column marking reference or spike-in chromosomes mapping_stats <- mapping_stats %>% mutate(reference_genome = case_when(str_detect(chromosome, "spikeIn") ~ "spikeIn", !str_detect(chromosome, "spikeIn") ~ "reference")) # compute % reads mapping to reference and spike-in genomes mapping_stats <- mapping_stats %>% group_by(sample_name, reference_genome) %>% summarise(total_reads = sum(mapped_reads)) %>% pivot_wider(names_from = reference_genome, values_from = total_reads) %>% mutate(total_mapped_reads = reference + spikeIn) %>% mutate(percent_reference = reference / total_mapped_reads * 100, percent_spikeIn = spikeIn / total_mapped_reads * 100) # read in unit table sample_table <- snakemake@config[["units"]] %>% read_tsv() %>% # remove redundant technical replicates from sample_table distinct(sample_name, .keep_all = TRUE) # mark samples as inputs or IPs sample_table <- sample_table %>% mutate(chip_sample_type = case_when( (sample_name %in% .$input) ~ "input", !(sample_name %in% .$input) ~ "IP" )) %>% mutate(has_input = !is.na(input)) # create new column indicating input/IP pairs tmp <- sample_table %>% select(input, sample_name) %>% dplyr::rename(IP = sample_name, sample_name = input) sample_table <- sample_table %>% left_join(tmp, by = "sample_name") %>% mutate(IP_group = case_when( is.na(IP) ~ sample_name, !is.na(IP) ~ IP )) # prepare sample_table and mapping_stats tables for join individual_mapping_stats <- mapping_stats %>% select(sample_name, percent_reference, percent_spikeIn) # join sample_table and mapping stats individual_scaling_factors <- sample_table %>% left_join(individual_mapping_stats, by = "sample_name") # compute normalization factors for individual samples ------------------------- sample_has_input <- individual_scaling_factors %>% filter(chip_sample_type == "IP") %>% select(has_input) %>% pull() if (all(sample_has_input)) { individual_scaling_factors <- individual_scaling_factors %>% pivot_wider(names_from = chip_sample_type, values_from = percent_spikeIn, id_cols = IP_group) %>% mutate(enrichment = IP / input) %>% mutate(scaling_factor = 1 / enrichment) %>% mutate(norm_scaling_factor = scaling_factor / max(scaling_factor)) } else { warning("not all IPs have corresponding input sample, calculating scaling factors based on %spikeIn for IP samples") individual_scaling_factors <- individual_scaling_factors %>% pivot_wider(names_from = chip_sample_type, values_from = percent_spikeIn, id_cols = IP_group) %>% mutate(scaling_factor = 1 / IP) %>% mutate(norm_scaling_factor = scaling_factor / max(scaling_factor)) } individual_scaling_factors %>% write_tsv(snakemake@output[[1]]) # prepare data for merged samples ----------------------------- merged_mapping_stats <- mapping_stats %>% select(sample_name, reference, spikeIn) # combine replicate reads merged_mapping_stats <- sample_table %>% left_join(merged_mapping_stats, by = "sample_name") %>% group_by(sample_group) %>% summarise(spikeIn = sum(spikeIn), reference = sum(reference)) # create new column indicating input/IP pairs tmp <- sample_table %>% select(input, sample_group) %>% dplyr::rename(IP = sample_group, sample_name = input) sample_table <- sample_table %>% select(-IP, -IP_group) %>% left_join(tmp, by = "sample_name") %>% mutate(IP_group = case_when( is.na(IP) ~ sample_group, !is.na(IP) ~ IP )) # join sample_table and mapping stats for merged reads merged_scaling_factors <- merged_mapping_stats %>% left_join(select(sample_table, sample_group, chip_sample_type,IP_group, has_input), by = "sample_group") %>% distinct(.keep_all = TRUE) %>% mutate(total_mapped_reads = reference + spikeIn) %>% mutate(percent_reference = reference / total_mapped_reads * 100, percent_spikeIn = spikeIn / total_mapped_reads * 100) # compute normalization factors for merged samples ------------------------- sample_has_input <- merged_scaling_factors %>% filter(chip_sample_type == "IP") %>% select(has_input) %>% pull() if (all(sample_has_input)) { merged_scaling_factors <- merged_scaling_factors %>% pivot_wider(names_from = chip_sample_type, values_from = percent_spikeIn, id_cols = IP_group) %>% mutate(enrichment = IP / input) %>% mutate(scaling_factor = 1 / enrichment) %>% mutate(norm_scaling_factor = scaling_factor / max(scaling_factor)) } else { warning("not all IPs have corresponding input sample, calculating scaling factors based on %spikeIn for IP samples") merged_scaling_factors <- merged_scaling_factors %>% pivot_wider(names_from = chip_sample_type, values_from = percent_spikeIn, id_cols = IP_group) %>% mutate(scaling_factor = 1 / IP) %>% mutate(norm_scaling_factor = scaling_factor / max(scaling_factor)) } merged_scaling_factors %>% write_tsv(snakemake@output[[2]]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 | __author__ = "Johannes Köster, Derek Croote" __copyright__ = "Copyright 2020, Johannes Köster" __email__ = "johannes.koester@uni-due.de" __license__ = "MIT" import os import tempfile from snakemake.shell import shell from snakemake_wrapper_utils.snakemake import get_mem log = snakemake.log_fmt_shell(stdout=True, stderr=True) extra = snakemake.params.get("extra", "") # Parse memory mem_mb = get_mem(snakemake, "MiB") # Outdir outdir = os.path.dirname(snakemake.output[0]) if outdir: outdir = f"--outdir {outdir}" # Output compression compress = "" mem = f"-m{mem_mb}" if mem_mb else "" for output in snakemake.output: out_name, out_ext = os.path.splitext(output) if out_ext == ".gz": compress += f"pigz -p {snakemake.threads} {out_name}; " elif out_ext == ".bz2": compress += f"pbzip2 -p{snakemake.threads} {mem} {out_name}; " with tempfile.TemporaryDirectory() as tmpdir: mem = f"--mem {mem_mb}M" if mem_mb else "" shell( "(fasterq-dump --temp {tmpdir} --threads {snakemake.threads} {mem} " "{extra} {outdir} {snakemake.wildcards.accession}; " "{compress}" ") {log}" ) |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 | suppressPackageStartupMessages(library(tidyverse)) suppressPackageStartupMessages(library(rtracklayer)) suppressPackageStartupMessages(library(GenomicRanges)) # define functions ------------------------------------------------------------- # function to merge multiple sets of ChIP peaks into a non-overlapping master set # supports multiple methods # peaks should be a GRangesList # method should be either "overlap" or "merge": # overlap method will determine overlaps among all peak sets and combine non-overlapping peaks # merge method will combine overlapping peaks into a single peak combine_peaks <- function(peaks, method = "overlap", min_overlap = 1L) { require(GenomicRanges) # check arguments if (length(peaks) < 2) { stop("Only one set of peaks provided. For merging peaks, provide multiple peak sets.") } if (!inherits(peaks, "GRangesList") ) { stop("one or more peak set is not a Granges object. peaks must be a GRangesList") } if (!any(method == c("overlap", "merge"))) { stop("Invalid method provided. method should be 'overlap' or 'merge' ") } if (!is_integer(min_overlap)) { stop("min_overlap must be an integer") } # merge peaks using "overlap" method if (method == "overlap") { combined_peaks <- peaks[[1]] for (i in 2:length(peaks)) { i_set <- peaks[[i]] combined_peaks <- c(combined_peaks, subsetByOverlaps(i_set, combined_peaks, minoverlap = min_overlap, invert = TRUE)) } } # merge peaks using "merge" method if (method == "merge") { combined_peaks <- GenomicRanges::reduce(unlist(peaks)) } mcols(combined_peaks) <- NULL return(combined_peaks) } # function to build an overlap table # takes a Granges list object # combines all nonoverlapping peaks from all samples, then builds a logical table indicating which peaks from each sample overlap each peak on the master list peak_overlap_table <- function(peaks, method = "overlap", min_overlap = 1L, combine_peaks = TRUE) { # check arguments if (length(peaks) < 2) { stop("Only one set of peaks provided. For merging peaks, provide multiple peak sets.") } if (!inherits(peaks, "GRangesList") ) { stop("one or more peak set is not a Granges object. peaks must be a GRangesList") } if (!any(method == c("overlap", "merge"))) { stop("Invalid method provided. method should be 'overlap' or 'merge' ") } if (!is_integer(min_overlap)) { stop("min_overlap must be an integer") } # set names if(is.null(names(peaks))) { names(peaks) <- paste0("sample_", seq((peaks))) } if (combine_peaks) { # get a master set of all nonoverlapping peaks from all peak sets all_peaks_gr <- combine_peaks(peaks, method = method, min_overlap = min_overlap) } else { all_peaks_gr <- peaks[[1]] } all_peaks_df <- as.data.frame(all_peaks_gr) %>% dplyr::select(1:5) # build a logical overlap table indicating which peaks were detected in which samples for (i in 1:length(peaks)) { sample_name <- names(peaks)[i] i_peaks <- peaks[[i]] all_peaks_df[,sample_name] <- FALSE overlaps <- findOverlaps(all_peaks_gr, i_peaks, minoverlap = min_overlap)@from all_peaks_df[overlaps,sample_name] <- TRUE } return(all_peaks_df) } # read peaks into overlap table ------------------------------------------------ merged_peak_file <- snakemake@input[["merged_peaks"]] names(merged_peak_file) <- "merged_peaks" replicate_peak_files <- snakemake@input[["ind_peaks"]] names(replicate_peak_files) <- paste0("rep", seq(replicate_peak_files), "_peaks") peak_files <- c(merged_peak_file, replicate_peak_files) overlap_table <- peak_files %>% map(rtracklayer::import) %>% GRangesList() %>% peak_overlap_table(min_overlap = as.integer(snakemake@params[["min_overlap"]])) # filter to include only peaks detected in multiple replicates ----------------- filtered_peaks <- overlap_table %>% mutate(reps_w_peak = rowSums(.[7:ncol(.)])) %>% filter(merged_peaks, reps_w_peak >= snakemake@params[["replicate_threshold"]]) %>% makeGRangesFromDataFrame() # export filtered peaks to file ----------------------------------------------- if (snakemake@wildcards[["peak_type"]] == "narrow") { keep_cols <- c(1:3, 6:7, 5, 8:11) } else if (snakemake@wildcards[["peak_type"]] == "broad") { keep_cols <- c(1:3, 6:7, 5, 8:10) } else { stop("Peak format should be 'narrow' or 'broad'. Check input file format") } output <- merged_peak_file %>% rtracklayer::import() %>% subsetByOverlaps(filtered_peaks) %>% as.data.frame() %>% select(all_of(keep_cols)) %>% mutate(strand = ".") write_tsv(output, snakemake@output[[1]], col_names = FALSE) |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | suppressPackageStartupMessages(library(GenomicRanges)) suppressPackageStartupMessages(library(rtracklayer)) suppressPackageStartupMessages(library(tidyverse)) # import data -------------------------------------------------------------------------------------------------- # get filename of bigwig that will be used for normalization bw <- snakemake@input[["bw"]] # read in normalization factors for merged replicates scaling_factors <- read_tsv(snakemake@input[["scaling_factors"]]) # get sample name to use for finding the correct scaling factor file_bn <- gsub(".bw", "", basename(bw)) # rescale bigwig file based on spike-in normalization factor --------------------------------------------------- # get scaling factor scaling_factor <- scaling_factors %>% filter(IP_group == file_bn) %>% select(norm_scaling_factor) %>% pull() # normalize bigwig scaled.gr <- import(bw) %>% as.data.frame() %>% mutate(score = score * scaling_factor) %>% makeGRangesFromDataFrame(keep.extra.columns = TRUE) # set seqinfo of normalized Granges object seqlevels(scaled.gr) <- seqlevels(BigWigFile(bw)) seqinfo(scaled.gr) <- seqinfo(BigWigFile(bw)) # export normalized bigwig ----------------------------------------------------- export(scaled.gr, snakemake@output[[1]], format = "bw") |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 | suppressPackageStartupMessages(library(tidyverse)) suppressPackageStartupMessages(library(rtracklayer)) suppressPackageStartupMessages(library(GenomicRanges)) zscore_bw <- function(bw) { require(tidyverse) require(rtracklayer) require(GenomicRanges) # import bigwig file to Granges if (typeof(bw) == "character") { message("reading bigwig file") bw <- import(bw) } # if using a spike-in, filter the seqlevels to only the reference genome if (snakemake@config[["use_spikeIn"]]) { message("removing spikeIn chromosomes") ref_chroms <- seqlevels(bw)[!grepl("spikeIn_", seqlevels(bw))] bw <- keepSeqlevels(bw, ref_chroms, pruning.mode = "coarse") } if (snakemake@config[["filter_chroms"]]) { message("filtering reference chromosomes") keep_chroms <- read_tsv(snakemake@config[["keep_chroms"]], col_names = c("chromosome")) ref_chroms <- seqlevels(bw)[seqlevels(bw) %in% keep_chroms$chromosome] bw <- keepSeqlevels(bw, ref_chroms, pruning.mode = "coarse") } # for large regions with the same score, expand into equal sized bins message("binning genome") min_binsize <- min(width(bw)) all_bins <- tileGenome(seqinfo(bw), tilewidth=min_binsize,cut.last.tile.in.chrom=TRUE) message("getting scores for all bins") # add the coverage/score for both input and IP all_bins <- subsetByOverlaps(all_bins, bw) overlaps <- findOverlaps(all_bins, bw) all_bins$score[overlaps@from] <- bw$score[overlaps@to] # perform z-score normalization message("performing z-score normalization") all_bins$zscore <- scale(all_bins$score)[,1] all_bins$score <- NULL all_bins$score <- all_bins$zscore all_bins$zscore <- NULL # collapse adjacent bins with same score collapsed <- unlist(GenomicRanges::reduce(split(all_bins, ~score))) collapsed$score <- as.numeric(names(collapsed)) names(collapsed) <- NULL all_bins <- collapsed #set seqinfo for z-score normalized version seqinfo(all_bins) <- seqinfo(bw) return(all_bins) } # perform z-score normalization and write new bigwig files --------------------- zscore.gr <- zscore_bw(snakemake@input[[1]]) export(zscore.gr, snakemake@output[[1]]) |
Support
- Future updates
Related Workflows





