Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
cutTag-pipeline
Snakemake pipeline for Cut&Tag analysis
Setup
1. Configure the project directory
# cd into a project directory
# type the following to get a copy of the pipeline
git clone https://github.com/maxsonBraunLab/cutTag-pipeline.git
#create a directory for your fastq files
cd cutTag-pipeline
mkdir -p data/raw
# link fastqs to data/raw
ln -s /path/to/fastq/files/sample1_R1.fastq.gz data/raw
ln -s /path/to/fastq/files/sample1_R2.fastq.gz data/raw
ln -s /path/to/fastq/files/sample2_R1.fastq.gz data/raw
ln -s /path/to/fastq/files/sample2_R2.fastq.gz data/raw
...
# make scripts executable
chmod +x src/*.py src/*.sh *.sh
IMPORTANT
After creating the symlinks, rename all the symlinks in data/raw to the following format:
{condition}_{replicate}_{mark}_{R1|R2}.fastq.gz
For example, a file with this original name LIB200706TB_M6Q3_RBP1_S93_L001_R1_001.fastq.gz will be renamed to M6Q_3_RBP1_R1.fastq.gz
2. Make the sample sheet and deseq2 metadata.
Activate an environment containing snakemake, and then run the script
make_sample_sheet.py
script from the root of the directory.
$ python src/make_sample_sheet.py data/raw
This will make a samplesheet for the experiment called samplesheet.tsv in the root of the directory as well as the file
src/deseq2_metadata.csv
, the contents of the samplesheet will be structured like the following example:
sample R1 R2 mark condition igg
HoxE_1_IgG data/raw/HoxE_1_IgG_R1.fastq.gz data/raw/HoxE_1_IgG_R2.fastq.gz IgG HoxE HoxE_1_IgG
HoxE_1_Rbp1 data/raw/HoxE_1_Rbp1_R1.fastq.gz data/raw/HoxE_1_Rbp1_R2.fastq.gz Rbp1 HoxE HoxE_1_Rbp1
HoxE_2_Rbp1 data/raw/HoxE_2_Rbp1_R1.fastq.gz data/raw/HoxE_2_Rbp1_R2.fastq.gz Rbp1 HoxE HoxE_2_Rbp1
HoxE_3_Rbp1 data/raw/HoxE_3_Rbp1_R1.fastq.gz data/raw/HoxE_3_Rbp1_R2.fastq.gz Rbp1 HoxE HoxE_3_Rbp1
HoxM_1_IgG data/raw/HoxM_1_IgG_R1.fastq.gz data/raw/HoxM_1_IgG_R2.fastq.gz IgG HoxM HoxM_1_IgG
HoxM_1_Rbp1 data/raw/HoxM_1_Rbp1_R1.fastq.gz data/raw/HoxM_1_Rbp1_R2.fastq.gz Rbp1 HoxM HoxM_1_Rbp1
HoxM_2_Rbp1 data/raw/HoxM_2_Rbp1_R1.fastq.gz data/raw/HoxM_2_Rbp1_R2.fastq.gz Rbp1 HoxM HoxM_2_Rbp1
HoxM_3_Rbp1 data/raw/HoxM_3_Rbp1_R1.fastq.gz data/raw/HoxM_3_Rbp1_R2.fastq.gz Rbp1 HoxM HoxM_3_Rbp1
HoxW_1_IgG data/raw/HoxW_1_IgG_R1.fastq.gz data/raw/HoxW_1_IgG_R2.fastq.gz IgG HoxW HoxW_1_IgG
HoxW_1_Rbp1 data/raw/HoxW_1_Rbp1_R1.fastq.gz data/raw/HoxW_1_Rbp1_R2.fastq.gz Rbp1 HoxW HoxW_1_Rbp1
HoxW_2_Rbp1 data/raw/HoxW_2_Rbp1_R1.fastq.gz data/raw/HoxW_2_Rbp1_R2.fastq.gz Rbp1 HoxW HoxW_2_Rbp1
HoxW_3_Rbp1 data/raw/HoxW_3_Rbp1_R1.fastq.gz data/raw/HoxW_3_Rbp1_R2.fastq.gz Rbp1 HoxW HoxW_3_Rbp1
The script splits the file name on the '_' and uses the first split for the condition, and the second split for the mark. The 'igg' column is the same as the 'sample' column and should be manually replaced with the sample name of the IGG or control you would like to use for that sample. If the sample is an IgG it can be the same as its name, and won't affect peak calling.
So a fixed version of the table above would look like this:
sample R1 R2 mark condition igg
HoxE_1_IgG data/raw/HoxE_1_IgG_R1.fastq.gz data/raw/HoxE_1_IgG_R2.fastq.gz IgG HoxE HoxE_1_IgG
HoxE_1_Rbp1 data/raw/HoxE_1_Rbp1_R1.fastq.gz data/raw/HoxE_1_Rbp1_R2.fastq.gz Rbp1 HoxE HoxE_1_IgG
HoxE_2_Rbp1 data/raw/HoxE_2_Rbp1_R1.fastq.gz data/raw/HoxE_2_Rbp1_R2.fastq.gz Rbp1 HoxE HoxE_1_IgG
HoxE_3_Rbp1 data/raw/HoxE_3_Rbp1_R1.fastq.gz data/raw/HoxE_3_Rbp1_R2.fastq.gz Rbp1 HoxE HoxE_1_IgG
HoxM_1_IgG data/raw/HoxM_1_IgG_R1.fastq.gz data/raw/HoxM_1_IgG_R2.fastq.gz IgG HoxM HoxM_1_IgG
HoxM_1_Rbp1 data/raw/HoxM_1_Rbp1_R1.fastq.gz data/raw/HoxM_1_Rbp1_R2.fastq.gz Rbp1 HoxM HoxM_1_IgG
HoxM_2_Rbp1 data/raw/HoxM_2_Rbp1_R1.fastq.gz data/raw/HoxM_2_Rbp1_R2.fastq.gz Rbp1 HoxM HoxM_1_IgG
HoxM_3_Rbp1 data/raw/HoxM_3_Rbp1_R1.fastq.gz data/raw/HoxM_3_Rbp1_R2.fastq.gz Rbp1 HoxM HoxM_1_IgG
HoxW_1_IgG data/raw/HoxW_1_IgG_R1.fastq.gz data/raw/HoxW_1_IgG_R2.fastq.gz IgG HoxW HoxW_1_IgG
HoxW_1_Rbp1 data/raw/HoxW_1_Rbp1_R1.fastq.gz data/raw/HoxW_1_Rbp1_R2.fastq.gz Rbp1 HoxW HoxW_1_IgG
HoxW_2_Rbp1 data/raw/HoxW_2_Rbp1_R1.fastq.gz data/raw/HoxW_2_Rbp1_R2.fastq.gz Rbp1 HoxW HoxW_1_IgG
HoxW_3_Rbp1 data/raw/HoxW_3_Rbp1_R1.fastq.gz data/raw/HoxW_3_Rbp1_R2.fastq.gz Rbp1 HoxW HoxW_1_IgG
For this example there was only one IgG per condition, so the sample name corresponding to that IGG was used for each sample in the condition. In the case that each sample had its own control file, each entry would correspond to the IgG for that sample. If only one IgG was used in the whole experiment, then its sample name could be used for each row. If you are not using IgG set config['USEIGG'] to false, and don't modify the samplesheet.
3. Edit configuration files
-
Edit runtime configuration in the file
config.yml
:-
Specify whether or not to trim adapters from raw reads to improve downstream alignment rate. You can decide based on adapter content in the sequencing core's FastQC reports.
-
Specify whether or not to use IGG for peak calling.
-
Specify the path to the bowtie2 index for the genome you are aligning to.
-
Specify the path to the fastq screen configuration file.
-
Specify the path to the genome FASTA file.
-
Specify the path to the chromosome sizes.
-
-
The pipeline detects samples in the subdirectory data/raw with the following assumptions:
-
Paired end reads
-
Read 1 and 2 are designated using "_R1", and "_R2"
-
Samples are designated in the following convention:
{condition}_{replicate}_{mark}_{R1|R2}.fastq.gz
This format affects the output files and ensures the bigwig files from the same marks are merged.
-
-
Make sure the
src/deseq2_metadata.csv
looks right. The file is created when you runmake_sample_sheet.py
and should have the following properties:
-
Should have two columns labeled "sample", and "condition"
-
The sample column corresponds to the individual biological replicates (includes all fields around the "_" delimiter).
-
The condition should be the condition for each sample, which uses the first field with the "_" delimiter.
-
If you have multiple conditions and marks to analyze, you can introduce more columns into this file and adjust the deseq2.R file to account for extra covariates.
The file src/deseq2_metadata is populated with the following example data:
sample,condition
HoxE_1_IgG,HoxE
HoxE_1_Rbp1,HoxE
HoxE_2_Rbp1,HoxE
HoxE_3_Rbp1,HoxE
HoxM_1_IgG,HoxM
HoxM_1_Rbp1,HoxM
HoxM_2_Rbp1,HoxM
HoxM_3_Rbp1,HoxM
HoxW_1_IgG,HoxW
HoxW_1_Rbp1,HoxW
HoxW_2_Rbp1,HoxW
HoxW_3_Rbp1,HoxW
4. Set up SLURM integration (for batch jobs)
Do this step if are running the pipeline as a batch job and don't yet have a SLURM profile set up.
Download the
slurm
folder from the maxsonBraunLab
repository
and copy the entire thing to
~/.config/snakemake
.
Your file configuration for SLURM should be as follows:
~/.config/snakemake/slurm/<files>
Change the file permissions for the scripts in the
slurm
folder so that they are executable. To do this, run:
chmod +x ~/.config/snakemake/slurm/slurm*
Execution
A "dry-run" can be accomplished to see what and how files will be generated by using the command:
snakemake -nrp
To invoke the pipeline, please use either of the two options below:
# run in interactive mode. Recommended for running light jobs.
snakemake -j <n cores> --use-conda --conda-prefix $CONDA_PREFIX_1/envs
# run in batch mode. Recommended for running intensive jobs.
sbatch run_pipeline_conda.sh
For users running the pipeline in batch mode,
run_pipeline_conda.sh
is a wrapper script that contains the following command:
snakemake -j <n jobs> --use-conda --conda-prefix $CONDA_PREFIX_1/envs --profile slurm --cluster-config cluster.yaml
Additional setup instructions are provided in the wrapper script.
You can standardize further arguments for running the pipeline in batch mode using the following instructions . The maxsonBraunLab repository slurm contains further instructions to set up a SnakeMake slurm profile.
Reproducible results with SnakeMake + Singularity
To ensure the reproducibility of your results, we recommend running a SnakeMake workflow using Singularity containers. These containers standardize the underlying operating system of the workflow (e.g. Ubuntu, centOS, etc.), while conda tracks the installation of bioinformatic software (e.g. bowtie2, samtools, deseq2). To utilize Singularity in your analysis, log in to an interactive node and load the module first like this:
# request an interactive node
srun -p light --time=36:00:00 --pty bash
# re-activate your environment with snakemake
conda activate <snakemake-env>
# load the singularity program
module load /etc/modulefiles/singularity/current
By default, Singularity will create and use a cache directory in your personal user root folder (i.e. in
/home/users/<username>
). This may create problems as there is limited space in a user's root folder on Exacloud. To avoid issues with space in your root folder, you can set the Singularity cache directory path to a folder in your lab group directory like this:
# make a cache folder inside your lab user folder
mkdir /home/groups/MaxsonLab/<your_user_folder>/singularity_cache
# make the path to the cache folder accessible to other processes
export SINGULARITY_CACHEDIR=/home/groups/MaxsonLab/<your_user_folder>/singularity_cache
If you are an experienced user, you can add the
export SINGULARITY_CACHEDIR=...
line to your
.bashrc
file. Otherwise, run the
export SINGULARITY_CACHEDIR=...
command before doing the steps below.
More Singularity documentation on Exacloud can be found here . If it is your first time running the pipeline, and especially when using Singularity, we must install all the conda environments using the following command:
indices_folder="/home/groups/MaxsonLab/indices"
conda_folder="${CONDA_PREFIX_1}/envs"
fastq_folder="/home/groups/MaxsonLab/input-data2/path/to/FASTQ/folder/"
snakemake -j 1 \
--verbose \
--use-conda \
--conda-prefix $conda_folder \
--use-singularity \
--singularity-args "--bind $indices_folder,$conda_folder,$fastq_folder" \
--conda-create-envs-only
The above code snippet will take about an hour or more to set up, but is a one-time installation. After creating the conda environments, symlinks,
samplesheet.tsv
, and the
src/deseq2_metadata.csv
, we can invoke the pipeline in the same shell like this:
# Singularity + interactive run
snakemake -j <n cores> \
--use-conda \
--conda-prefix $conda_folder \
--use-singularity \
--singularity-args "--bind $indices_folder,$conda_folder,$fastq_folder"
# Singularity + slurm (batch) run
sbatch run_pipeline_singularity.sh
For users running the Singularity version of the pipeline in batch mode,
run_pipeline_singularity.sh
is a wrapper script for the pipeline. You will need to add the appropriate FASTQ folder path to the script prior to running. Additional instructions are provided in the wrapper script.
NOTE: make sure to use double quotes, and insert an integer for the -j flag.
The above command will install the pipeline's conda environments into the
conda-prefix
directory - this means that conda environments are actually not stored INSIDE the container. The
--bind
argument binds the host (Exacloud) paths to the container to access the genome indices, conda prefix, and the path to the raw sequencing files. The
--profile slurm
in the wrapper script will configure default settings for SnakeMake to interact with SLURM - more information can be found
here
. Feel free to create another
snakemake profile
that has its own set of singularity arguments for added convenience.
Method
Without adapter trimming:
With adapter trimming enabled:
Output
Below is an explanation of each output directory:
aligned - sorted and aligned sample bam files
callpeaks - the output of callpeaks.py with peak files and normalized bigwigs for each sample.
counts - the raw counts for each mark over a consensus peak list
deseq2 - the results of deseq2 fore each counts table (by mark)
dtools - fingerprint plot data for multiqc to use
fastp - adapter-trimmed FASTQ files (if adapter-trimming option is enabled in `config.yml`)
fastqc - fastqc results
fastq_screen - fastq_screen results
logs - runtime logs for each snakemake rule
markd - duplicate marked bam files
multiqc - contains the file multiqc_report.html with a lot of QC info about the experiment.
plotEnrichment - FRIP statistics for each mark
preseq - library complexity data for multiqc report
Deseq2 outputs
Each mark should have the following output files:
"data/deseq2/{mark}/{mark}-rld-pca.png" - PCA of counts after reguarlized log transformation.
"data/deseq2/{mark}/{mark}-vsd-pca.png" - PCA of counts after variance stabilizing transformation.
"data/deseq2/{mark}/{mark}-normcounts.csv" - normalized count for each sample in each consensus peak.
"data/deseq2/{mark}/{mark}-lognormcounts.csv" - log2 normalized counts for each sample in each consensus peak.
"data/deseq2/{mark}/{mark}-rld.png", - the sdVsMean plot using regularized log transformation.
"data/deseq2/{mark}/{mark}-vsd.png" - the sdVsMean plot using variance stabilizing transformation.
"data/deseq2/{mark}/{mark}-vsd-dist.png" - the sample distance matrix after variance stabilizing transformation.
"data/deseq2/{mark}/{mark}-rld-dist.png" - the sample distance matrix using regularized log transformation.
"data/deseq2/{mark}/{mark}-dds.rds" - the R object with the results of running the DEseq() function.
For each contrast, the differentially expressed genes are written to a file ending in
-diffexp.tsv
as well as those with an adjusted p-value less than 0.05 with the extension
-sig05-diffexp.tsv
. A summary of the results using an alpha of 0.05 is also written to a file with the extension
-sig05-diffexp-summary.txt
. Additionally two MA plots are written to the file ending in
plotMA.png
that have highlighted differential peaks with an adjusted p-value less than 0.1.
See the following paper for further explanations of the above plots and data transforms: https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#exporting-results-to-csv-files
Code Snippets
94 95 | shell: "fastp -i {input.r1} -I {input.r2} -o {output.r1} -O {output.r2} --detect_adapter_for_pe --thread {threads} -j {log} -h /dev/null" |
109 110 | shell: "fastqc -t {threads} --outdir data/fastqc {input} > {log} 2>&1" |
123 124 125 | shell: "fastq_screen --aligner bowtie2 --threads {threads} --outdir data/fastq_screen " "--conf {config[FASTQ_SCREEN]} --force {input} > {log} 2>&1" |
139 140 141 142 143 144 | shell: "bowtie2 --local --very-sensitive-local " "--no-unal --no-mixed --threads {threads} " "--no-discordant --phred33 " "-I 10 -X 700 -x {config[GENOME]} " "-1 {input.r1} -2 {input.r2} 2>{log.err} | samtools view -@ {threads} -Sbh - > {output}" |
156 157 | shell: "sambamba sort --tmpdir=data/aligned -t {threads} -o {output} {input} > {log} 2>&1" |
169 170 | shell: "sambamba markdup --tmpdir=data/markd -t {threads} {input} {output} > {log} 2>&1" |
182 183 | shell: "sambamba index -t {threads} {input} > {log} 2>&1" |
194 195 | shell: "bamCoverage -b {input[0]} -o {output} --binSize 10 --smoothLength 50 --normalizeUsing CPM -p {threads} " |
206 207 | shell: "bash src/mergebw.sh -c {config[CSIZES]} -o {output} {input}" |
218 219 220 221 222 | shell: "cat {input} | sort -k1,1 -k2,2n | " "bedtools merge | " "bedtools intersect -a - -b {input} -c | " "awk -v OFS='\t' '$4>=2 {{print}}' > {output}" |
232 233 | shell: "src/fraglen-dist.sh {input} {output}" |
240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 | run: pd.options.plotting.backend = "plotly" dfs = [] for i in input: cond_marker = [os.path.basename(i).split(".")[0]] temp_df = pd.read_csv(i, sep = "\t", index_col = 0, names = cond_marker) dfs.append(temp_df) df = pd.concat(dfs, axis = 1) fraglen = df.plot() fraglen.update_layout( title='Fragment Length Distribution', xaxis_title='Fragment Length (bp)', yaxis_title='Counts', legend_title_text='Samples') fraglen.write_html(str(output)) |
267 268 | shell: "preseq c_curve -B {resources.defect_mode} -l 1000000000 -o {output} {input} > {log} 2>&1" |
281 282 | shell: "preseq lc_extrap -B {resources.defect_mode} -l 1000000000 -e 1000000000 -o {output} {input} > {log} 2>&1" |
293 294 | shell: "plotFingerprint -b {input[0]} --smartLabels --outRawCounts {output} > {log} 2>&1" |
308 309 | shell: "gopeaks -b {input[0]} {params.igg} -o data/callpeaks/{wildcards.sample} {params.params} > {log} 2>&1" |
319 320 | shell: "bash src/consensus_peaks.sh {wildcards.mark} {config[N_INTERSECTS]} {output}" |
331 332 | shell: "bash src/skip_frip.sh {input[0]} {input[1]} {output[0]} {output[1]} {log}" |
339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 | run: pd.options.plotting.backend = "plotly" dfs = [] for i in sorted(input): cond_marker = "_".join(i.split("_")[1:3]) temp_df = pd.read_csv(i, sep = "\t", usecols=["percent"]).rename(columns = {'percent': cond_marker}) dfs.append(temp_df) frip_df = pd.concat(dfs, axis = 1) frip_df = frip_df.rename(index={0: 'inside'}) frip_df.loc["outside"] = 100 - frip_df.loc['inside'] fig = go.Figure(data=[ go.Bar(name="inside_peaks", x=frip_df.columns, y=frip_df.loc['inside'], marker_color='rgb(255, 201, 57)'), go.Bar(name='outside_peaks', x=frip_df.columns, y=frip_df.loc['outside'], marker_color='rgb(0,39, 118)') ]) fig.update_layout(barmode='stack', title='Fraction of Reads in Peaks by Sample', xaxis_tickfont_size=14, yaxis=dict(title='Fraction of reads in peaks', titlefont_size=16, tickfont_size=14), xaxis=dict(title='Samples')) fig.write_html(str(output)) |
380 381 | script: "src/deseq2.R" |
390 391 | shell: "bash src/homer.sh -m {wildcards.mark} -s 1 -p 8 -g {config[FASTA]}" |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 | import pandas as pd # map samples to fastqs def get_samples(): """ return list of samples from samplesheet.tsv """ return list(st.index) def get_marks(): """ return list of marks from samplesheet.tsv """ return list(st['mark']) def get_mark_conditions(): """ return list of samples by condition """ st['mark_condition']=st['mark'].astype(str)+"_"+st['condition'] return st['mark_condition'].unique().tolist() def get_tracks_by_mark_condition(wildcards): """ return list of tracks by mark_condition """ st['mark_condition']=st['mark'].astype(str)+"_"+st['condition'] samps=st.groupby(["mark_condition"])['sample'].apply(list)[wildcards.mark_condition] return [f"data/tracks/{s}.bw" for s in samps] def get_peaks_by_mark_condition(wildcards): """ return list of peaks by mark_condition """ st['mark_condition']=st['mark'].astype(str)+"_"+st['condition'] samps=st.groupby(["mark_condition"])['sample'].apply(list)[wildcards.mark_condition] return [f"data/callpeaks/{s}_peaks.bed" for s in samps] def get_bowtie2_input(wildcards): """ returns reads associated with a sample """ r1=st.loc[wildcards.sample]['R1'] r2=st.loc[wildcards.sample]['R2'] return r1,r2 def get_reads(): """ get list of all reads """ rlist=list(st['R1'])+list(st['R2']) rlist=[os.path.basename(f).split('.')[0] for f in rlist] return rlist def get_igg(wildcards): """ Returns the igg file for the sample unless the sample is IgG then no control file is used. """ if config['USEIGG']: igg=st.loc[wildcards.sample]['igg'] iggbam=f'data/markd/{igg}.sorted.markd.bam' isigg=config['IGG'] in wildcards.sample if not isigg: return f'-c {iggbam}' else: return "" else: return "" def get_callpeaks(wildcards): """ Returns the callpeaks input files """ bam=f"data/markd/{wildcards.sample}.sorted.markd.bam" bai=f"data/markd/{wildcards.sample}.sorted.markd.bam.bai" if config["USEIGG"]: igg=st.loc[wildcards.sample]['igg'] iggbam=f'data/markd/{igg}.sorted.markd.bam' iggbam=f'data/markd/{igg}.sorted.markd.bam.bai' isigg=config['IGG'] in wildcards.sample if not isigg: return [bam,bai,iggbam] else: return [bam,bai] else: return [bam,bai] def callpeaks_params(wildcards): """ Returns callpeaks parameters specified by the user in the samplesheet """ params = st.loc[wildcards.sample]['gopeaks'] if params == "-": return "" else: return params def defect_mode(wildcards, attempt): if attempt == 1: return "" elif attempt > 1: return "-D" |
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | if ! command -v bedtools &> /dev/null then echo "bedtools could not be found" exit fi # check args if [[ $# -eq 0 ]] then echo "no arguments supplied" exit 1 fi MARK=$1 N_INTERSECTS=$2 OUTFILE=$3 # array of replicates per mark declare -a mpks=(data/callpeaks/*${MARK}*_peaks.bed) all_samples=$(echo ${mpks[@]} | tr ' ' '\n' | cut -d/ -f3 | cut -d_ -f1-3) all_conditions=$(echo ${mpks[@]} | tr ' ' '\n' | cut -d/ -f3 | cut -d_ -f1) for condition in $all_conditions; do # file I/O tmp_output="data/counts/$MARK.$condition.tmp.bed" # list all replicates in one condition all_replicates=$(find data/callpeaks/ -name "*$condition*$MARK*.bed" | sort | tr '\n' ' ') # find widest peak that appear in at least n replicates + export to tmp file. cat $all_replicates | cut -f1-3 | sort -k1,1 -k2,2n | bedtools merge | \ bedtools intersect -c -a stdin -b $all_replicates | \ awk -v n=$N_INTERSECTS '$4 >= n' | awk -v OFS='\t' '{print $1,$2,$3}' > $tmp_output done # merge intervals for all tmp files and export as consensus peak. all_temp_files=$(find data/counts -name "*$MARK*.tmp.bed" | sort | tr '\n' ' ') cat $all_temp_files | sort -k1,1 -k2,2n | bedtools merge > data/counts/${MARK}_consensus.bed rm $all_temp_files # count the number of reads per union peak bedtools multicov -bams data/markd/*${MARK}*.bam -bed data/counts/${MARK}_consensus.bed -D > ${OUTFILE}_tmp # label the counts table ls data/markd/*${MARK}*.bam | sed 's!.*/!!' | cut -d. -f1 | xargs | tr ' ' '\t' | awk '{print "chrom\tstart\tend\t" $0}' | cat - ${OUTFILE}_tmp > ${OUTFILE} # remove tmp rm ${OUTFILE}_tmp |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 | library("tidyverse") library("vsn") library("pheatmap") library("ggplot2") library("RColorBrewer") library("DESeq2") library("PoiClaClu") library("gplots") library("plyranges") library("tidyr") library("dendextend") library("ggrepel") library("tools") message("setting snakemake parameters") exp_prefix = snakemake@params[["mark"]] outdir = snakemake@params[["outdir"]] numclusters = snakemake@params[["numclusters"]] message(paste0("using numclusters = ", numclusters)) stopifnot(is.numeric(numclusters)) input = snakemake@input[["counts"]] meta = snakemake@input[["meta"]] genes = snakemake@input[["genes"]] pcaPlot = snakemake@output[["pcaPlot"]] pcaPlotVsd = snakemake@output[["pcaPlotVsd"]] normCount = snakemake@output[["normCounts"]] lnormCount = snakemake@output[["lnormCounts"]] sdMeanRld = snakemake@output[["sdMeanRld"]] sdMeanVsd = snakemake@output[["sdMeanVsd"]] sampleDistPlotVsd = snakemake@output[["sampleDistVsd"]] sampleDistPlotRld = snakemake@output[["sampleDistRld"]] rds = snakemake@output[["rds"]] if (!dir.exists(outdir)) { dir.create(outdir) } # read in genes file genestab = read_tsv(genes, col_names=c("seqnames", "start", "end", "name", "score", "strand")) %>% GRanges() # counts join peaks = read_tsv(input) %>% mutate(row=row_number()) %>% mutate(name=paste0("peak", as.character(row))) # read in counts table counts = read.delim(input, header=T, stringsAsFactors = F, check.names=F) %>% mutate(row=row_number()) %>% mutate(peak=paste0("peak", as.character(row))) rownames(counts) = counts$peak counts = counts[,4:ncol(counts)] # read in metadata meta <- read.csv(meta, header = T, stringsAsFactors = F) # make the meta reflect the available sample names meta = meta[meta$sample %in% colnames(counts),] # make sure meta rownames and counts colnames in the same order. rownames(meta) <- meta$sample counts <- counts[, rownames(meta)] stopifnot(rownames(meta) == colnames(counts)) message('calculating deseq...') print(head(counts)) print(head(meta)) # make deseqdataset ddsMat <- DESeqDataSetFromMatrix(countData = counts, colData = meta, design = ~condition) dds <- DESeq(ddsMat) # write normalized counts tables normCounts <- counts(dds, normalized=TRUE) normCounts <- 0.00001+(normCounts) # ensure nonzero lnormCounts <- log2(normCounts) lnormCounts <- as.data.frame(lnormCounts) lnormCounts$ID <- counts$peak print(head(lnormCounts)) write.csv(normCounts, normCount) write.csv(lnormCounts, lnormCount) # calculate rlog transform rld <- rlog(dds, blind=FALSE) vsd <- varianceStabilizingTransformation(dds, blind = TRUE, fitType = "parametric") ntd <- normTransform(dds) message('plotting count transforms...') save.image() # plot the transform for the first two samples pdf(sdMeanRld) rldCounts <- meanSdPlot(assay(rld)) rldCounts$gg + labs(title="meanSdPlot - rlog transformed") dev.off() pdf(sdMeanVsd) vsdCounts <- meanSdPlot(assay(vsd)) vsdCounts$gg + labs(title="meanSdPlot - vsd transformed") dev.off() message('plotting poisson distance sample cross correlation...') # plot sample distances pdf(sampleDistPlotVsd) sampleDists <- dist(t(assay(vsd))) sampleDistMatrix <- as.matrix(sampleDists) rownames(sampleDistMatrix) <- paste(meta$sample, meta$condition, sep="-") colnames(sampleDistMatrix) <- NULL colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255) pheatmap(sampleDistMatrix, clustering_distance_rows=sampleDists, clustering_distance_cols=sampleDists, col=colors, main=paste(exp_prefix, "vsd sample distance matrix")) dev.off() pdf(sampleDistPlotRld) sampleDists <- dist(t(assay(rld))) sampleDistMatrix <- as.matrix(sampleDists) rownames(sampleDistMatrix) <- paste(meta$sample, meta$condition, sep="-") colnames(sampleDistMatrix) <- NULL colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255) pheatmap(sampleDistMatrix, clustering_distance_rows=sampleDists, clustering_distance_cols=sampleDists, col=colors, main=paste(exp_prefix, "rld sample distance matrix")) dev.off() # plot PCA rld pdf(pcaPlot) plotPCA(rld, intgroup = c("condition")) + labs(title=paste0(exp_prefix,"-rld")) + geom_text_repel(aes(label=name)) dev.off() # plot PCA vsd pdf(pcaPlotVsd) plotPCA(rld, intgroup = c("condition")) + labs(title=paste0(exp_prefix,"-vsd")) + geom_text_repel(aes(label=name)) dev.off() # make contrasts c = combn(unique(meta$condition), 2) lsc=split(c, rep(1:ncol(c), each = nrow(c))) message('writing contrast results...') # write differential results for each contrast for (k in 1:length(lsc)) { cl=lsc[[k]] rname=paste0(cl[1],"vs",cl[2]) res=results(dds, independentFiltering = TRUE, cooksCutoff = FALSE, contrast = c("condition", cl[1], cl[2]), alpha=0.05) # estimate shrinkage resLFC <- lfcShrink(dds, coef=2, type="apeglm") #plotMA maplot=paste0(outdir,"/",exp_prefix,"/",exp_prefix,"-",rname,"plotMA.pdf") pdf(maplot) par(mfrow=c(1,2), mar=c(4,4,2,1)) xlim <- c(1,1e5); ylim <- c(-2,2) plotMA(resLFC, xlim=xlim, ylim=ylim, alpha=0.05, main=paste(rname, "apeglm")) plotMA(res, xlim=xlim, ylim=ylim, alpha=0.05, main=paste(rname, "normal")) dev.off() tableName=paste0(outdir,"/",exp_prefix,"/",cl[1],'-',cl[2],'-',exp_prefix,'-','peaks-annot.tsv') # annottate peaks diffexp=res %>% as_tibble(rownames='name') %>% left_join(peaks) %>% GRanges() %>% join_nearest(genestab, suffix=c(".peak", ".gene")) %>% as_tibble() write_tsv(diffexp, path=tableName) # differential peak files lfc > 0 | lfc > 0 padj < 0.05 | padj < 0.01 deup05=paste0(outdir,"/",exp_prefix,"/",cl[1],'-',cl[2],'-',exp_prefix,'-','differential-up-05.bed') dedown05=paste0(outdir,"/",exp_prefix,"/",cl[1],'-',cl[2],'-',exp_prefix,'-','differential-down-05.bed') deup01=paste0(outdir,"/",exp_prefix,"/",cl[1],'-',cl[2],'-',exp_prefix,'-','differential-up-01.bed') dedown01=paste0(outdir,"/",exp_prefix,"/",cl[1],'-',cl[2],'-',exp_prefix,'-','differential-down-01.bed') # sig up peaks diffexp %>% filter(log2FoldChange > 0, padj < 0.05) %>% select(seqnames, start, end, name.peak,name.gene,score,strand) %>% write_tsv(deup05, col_names=FALSE) # sig down peaks diffexp %>% filter(log2FoldChange < 0, padj < 0.05) %>% select(seqnames, start, end, name.peak,name.gene,score,strand) %>% write_tsv(dedown05, col_names=FALSE) # sig up peaks diffexp %>% filter(log2FoldChange > 0, padj < 0.01) %>% select(seqnames, start, end, name.peak,name.gene,score,strand) %>% write_tsv(deup01, col_names=FALSE) # sig down peaks diffexp %>% filter(log2FoldChange < 0, padj < 0.01) %>% select(seqnames, start, end, name.peak,name.gene,score,strand) %>% write_tsv(dedown01, col_names=FALSE) summaryName=paste0(outdir,"/",exp_prefix,"/",cl[1],"-",cl[2],"-",exp_prefix,"-diffexp-summary.txt") diffexp %>% summarise(Condition=paste0(exp_prefix,"-",rname), DP05=nrow(filter(.,padj<0.05)), DP01=nrow(filter(.,padj<0.01))) %>% write_tsv(summaryName) message("creating heatmap...") sig = diffexp %>% filter(padj < 0.05) # calculate heatmap if there are more than 10 sig peaks if (nrow(sig) > 10) { annot_filename=gsub(".tsv", "-05-clust.tsv", basename(tableName)) # make direction annotation df direction <- sig %>% select(name.peak,log2FoldChange) %>% mutate(direction=ifelse(log2FoldChange>0,"up","down")) %>% data.frame() rownames(direction) = direction[,1] direction[,1]=NULL # subset counts by sig peaks print(head(lnormCounts)) counts <- rownames_to_column(lnormCounts, var="peak") %>% as_tibble() print(head(counts)) sigcounts <- counts %>% filter(peak %in% sig$name.peak) # organize samples by condition meta_conditions <- meta %>% group_by(condition) %>% summarise(sample=paste(sample, collapse = ",")) # rowMean counts for each condition for (c in meta_conditions$condition) { samples = filter(meta_conditions, condition==c)$sample s=unlist(strsplit(samples, ",")) sigcounts <- sigcounts %>% mutate(!!c := rowMeans(sigcounts[s])) } # retain mean counts per condition sigcounts <- sigcounts[append("peak", meta_conditions$condition)] # scale the counts scaled <- t(apply(as.matrix(sigcounts[,-1]), 1, scale)) colnames(scaled) = colnames(sigcounts[-1]) rownames(scaled) <- sigcounts$peak # generate clusters clust=hclust(dist(scaled), method="ward.D2") clustdf=data.frame(cluster=cutree(as.dendrogram(clust), k = numclusters)) annots=merge.data.frame(clustdf, direction, by=0, all = T) %>% select(-log2FoldChange) rownames(annots)=annots[,1] annots[,1]=NULL # get a numebr of unique colors cols=RColorBrewer::brewer.pal(numclusters, name="Set1") # name the colors by cluster names(cols) = unique(clustdf$cluster) # set cluster colors colors = list( cluster=cols ) heattitle=paste0(exp_prefix,"-",rname) # create heatmap heat <- pheatmap(scaled, main=heattitle, show_rownames = F, show_colnames = T, cluster_rows = T, cluster_method = "ward.D2", annotation_row = annots, annotation_colors = colors) save_pheatmap <- function(x, filename, width=7, height=7) { if (tolower(tools::file_ext(filename)) == "png") { png(filename = filename, width = width, height = height) } else { pdf(filename, width = width, height = height) } grid::grid.newpage() grid::grid.draw(x$gtable) dev.off() } heatmap_filename=paste0(outdir,"/",exp_prefix,"/",cl[1],"-",cl[2],"-",exp_prefix,"-heatmap.pdf") save_pheatmap(heat, heatmap_filename) annot_filename=gsub(".tsv", "-05-clust.tsv", tableName) # annotate cluster and direction annots[["name.peak"]] = rownames(annots) sig %>% left_join(annots) %>% write_delim(annot_filename, delim="\t") } } # save RDS of deseq object saveRDS(dds, rds) |
4 | samtools view $1 | awk '$9>0' | cut -f 9 | sort | uniq -c | sort -b -k2,2n | sed -e 's/^[ \t]*//' | sed 's/ /\t/g' | awk -v OFS='\t' '{print $2,$1}' > $2 |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 | while getopts "m:g:s:p:" op do case "$op" in m) mark="$OPTARG";; g) genome="$OPTARG";; s) slurm="$OPTARG";; p) p="$OPTARG";; \?) exit 1;; esac done if [ -f $genome ]; then echo "I found the FASTA file!" else echo -e "$g does not exist. Exiting program." exit fi if [[ $slurm == 1 || $slurm == 0 ]]; then true else echo "ERROR: -s must be either 0 or 1 for SLURM submission." exit 1 fi log_file_exists() { if [ -f $1 ]; then rm $1 touch $1 fi if [ ! -f $1 ]; then touch $1 fi } # detect contrast combinations per mark contrasts=$(find "data/deseq2/${mark}" -name "*.bed" | cut -d/ -f4 | sed -E "s/([a-zA-Z0-9.-]+)-($mark)-([a-zA-Z0-9-]+)(.bed)/\1/" | sort | uniq) # HOMER analysis per contrast ----------------------------------------------------------- for contrast in $contrasts; do echo "assessing contrast $contrast in mark $mark" up_05="data/deseq2/${mark}/${contrast}-${mark}-differential-up-05.bed" up_01="data/deseq2/${mark}/${contrast}-${mark}-differential-up-01.bed" down_05="data/deseq2/${mark}/${contrast}-${mark}-differential-down-05.bed" down_01="data/deseq2/${mark}/${contrast}-${mark}-differential-down-01.bed" declare -a differential_peaks=($up_05 $up_01 $down_05 $down_01) echo -e "Differential peaks files:\n${differential_peaks[@]}" # assess and run HOMER per bed file for file in ${differential_peaks[@]}; do # check if file exists if [ -f "$file" ]; then echo "$file detected" peak_count=$(cat $file | wc -l) echo "Peak count: $peak_count" # run HOMER if >= 10 peaks if [ "$peak_count" -ge 10 ]; then # extract dir,sig info for the run. echo "Running HOMER for $peak_count peaks in $file" directionality=$(basename $file | sed -E "s/([a-zA-Z0-9._-]+)-(down|up)-([0-9]+)(.bed)/\2/") # "up" or "down" sig=$(basename $file | sed -E "s/([a-zA-Z0-9._-]+)-(down|up)-([0-9]+)(.bed)/\3/") # e.g. significance level (01 or 05) # file I/O output_dir="data/homer/${contrast}-${mark}-${directionality}-${sig}" log="data/logs/homer_${contrast}-${mark}-${directionality}-${sig}.log" log_file_exists $log # HOMER local run/submission to SLURM if [ $slurm == 0 ]; then findMotifsGenome.pl $file $genome $output_dir -size 200 -p $p > $log 2>&1 & fi if [ $slurm == 1 ]; then if [ ! -d "jobs/homer" ]; then mkdir -p jobs/homer fi job_out="jobs/homer/homer-${contrast}-${mark}_%j.out" job_err="jobs/homer/homer-${contrast}-${mark}_%j.err" sbatch -e $job_err -o $job_out --job-name "HOMER" --time "02:00:00" --mem="8G" --cpus-per-task=$p --wait --wrap="findMotifsGenome.pl $file $genome $output_dir -size 200 -p $p > $log 2>&1" & fi fi fi done done wait touch data/homer/$mark.done exit |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 | usage(){ echo "Usage: " echo " $0 -c chrom_sizes_file -o output_file rep1.bw rep2.bw rep3.bw" exit 1 } file_exits(){ local f="$1" [[ -f "$f" ]] && return 0 || return 1 } cleanup() { rm ${out_name}.tmp.bdg &> /dev/null } [[ $# -eq 0 ]] && usage script_args=() while [ $OPTIND -le "$#" ] do if getopts c:o: option then case $option in c) chrom_size="$OPTARG";; o) out_name="$OPTARG";; esac else script_args+=("${!OPTIND}") ((OPTIND++)) fi done if [ -z "$chrom_size" ] then echo "Please supply the chrom.sizes file." usage else if ( ! file_exits "$chrom_size" ) then echo "error: file $chrom_size not found" usage fi fi if [ -z "$out_name" ] then echo "Please supply an output file name." usage fi echo "Using chrom.sizes file: $chrom_size" echo "Output file name will be: $out_name" # require bigWigMerge if ! command -v wiggletools &> /dev/null then echo "wiggletools not found" exit fi # require bigWigToBedGraph if ! command -v bigWigToBedGraph &> /dev/null then echo "bigWigToBedGraph no found" exit fi echo "merging the following files..." echo ${script_args[@]} cmd="wiggletools mean ${script_args[@]} | grep -vE '_|EBV|GL|JH' | awk 'NF==4{print}' | sort -S 8G -k1,1 -k2,2n > ${out_name}.tmp.bdg" echo $cmd eval $cmd if [ $? -eq 0 ]; then echo "..." else echo "Command wiggletools failed to complete..." echo "quitting.." cleanup exit 1 fi # convert to bigwig bedGraphToBigWig ${out_name}.tmp.bdg $chrom_size $out_name if [ $? -eq 0 ]; then echo "..." else echo "Command bedGraphToBigWig failed..." echo "quitting.." cleanup exit 1 fi cleanup echo "done" |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | input_bed=$1 input_bam=$2 output_png=$3 output_tsv=$4 logfile=$5 if [ -s ${input_bed} ] then plotEnrichment -b ${input_bam} --BED ${input_bed} --regionLabels 'frip' --outRawCounts ${output_tsv} -o ${output_png} > ${logfile} 2>&1 else total_read_count=$(samtools view ${input_bam} | wc -l) echo "file featureType percent featureReadCount totalReadCount" > ${output_tsv} #creates output file with column headers echo "${input_bam} frip 0 0 ${total_read_count}" >> ${output_tsv} #appends to existing output file with name of sample and 0 counts touch ${output_png} fi |
Support
- Future updates
Related Workflows





