Pipeline for Cut&Tag analysis

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

cutTag-pipeline

Snakemake pipeline for Cut&Tag analysis

Setup

1. Configure the project directory

# cd into a project directory
# type the following to get a copy of the pipeline
git clone https://github.com/maxsonBraunLab/cutTag-pipeline.git
#create a directory for your fastq files
cd cutTag-pipeline
mkdir -p data/raw
# link fastqs to data/raw
ln -s /path/to/fastq/files/sample1_R1.fastq.gz data/raw
ln -s /path/to/fastq/files/sample1_R2.fastq.gz data/raw
ln -s /path/to/fastq/files/sample2_R1.fastq.gz data/raw
ln -s /path/to/fastq/files/sample2_R2.fastq.gz data/raw
...
# make scripts executable
chmod +x src/*.py src/*.sh *.sh

IMPORTANT

After creating the symlinks, rename all the symlinks in data/raw to the following format: {condition}_{replicate}_{mark}_{R1|R2}.fastq.gz

For example, a file with this original name LIB200706TB_M6Q3_RBP1_S93_L001_R1_001.fastq.gz will be renamed to M6Q_3_RBP1_R1.fastq.gz

2. Make the sample sheet and deseq2 metadata.

Activate an environment containing snakemake, and then run the script make_sample_sheet.py script from the root of the directory.

$ python src/make_sample_sheet.py data/raw

This will make a samplesheet for the experiment called samplesheet.tsv in the root of the directory as well as the file src/deseq2_metadata.csv , the contents of the samplesheet will be structured like the following example:

sample	R1	R2	mark	condition	igg
HoxE_1_IgG	data/raw/HoxE_1_IgG_R1.fastq.gz	data/raw/HoxE_1_IgG_R2.fastq.gz	IgG	HoxE	HoxE_1_IgG
HoxE_1_Rbp1	data/raw/HoxE_1_Rbp1_R1.fastq.gz	data/raw/HoxE_1_Rbp1_R2.fastq.gz	Rbp1	HoxE	HoxE_1_Rbp1
HoxE_2_Rbp1	data/raw/HoxE_2_Rbp1_R1.fastq.gz	data/raw/HoxE_2_Rbp1_R2.fastq.gz	Rbp1	HoxE	HoxE_2_Rbp1
HoxE_3_Rbp1	data/raw/HoxE_3_Rbp1_R1.fastq.gz	data/raw/HoxE_3_Rbp1_R2.fastq.gz	Rbp1	HoxE	HoxE_3_Rbp1
HoxM_1_IgG	data/raw/HoxM_1_IgG_R1.fastq.gz	data/raw/HoxM_1_IgG_R2.fastq.gz	IgG	HoxM	HoxM_1_IgG
HoxM_1_Rbp1	data/raw/HoxM_1_Rbp1_R1.fastq.gz	data/raw/HoxM_1_Rbp1_R2.fastq.gz	Rbp1	HoxM	HoxM_1_Rbp1
HoxM_2_Rbp1	data/raw/HoxM_2_Rbp1_R1.fastq.gz	data/raw/HoxM_2_Rbp1_R2.fastq.gz	Rbp1	HoxM	HoxM_2_Rbp1
HoxM_3_Rbp1	data/raw/HoxM_3_Rbp1_R1.fastq.gz	data/raw/HoxM_3_Rbp1_R2.fastq.gz	Rbp1	HoxM	HoxM_3_Rbp1
HoxW_1_IgG	data/raw/HoxW_1_IgG_R1.fastq.gz	data/raw/HoxW_1_IgG_R2.fastq.gz	IgG	HoxW	HoxW_1_IgG
HoxW_1_Rbp1	data/raw/HoxW_1_Rbp1_R1.fastq.gz	data/raw/HoxW_1_Rbp1_R2.fastq.gz	Rbp1	HoxW	HoxW_1_Rbp1
HoxW_2_Rbp1	data/raw/HoxW_2_Rbp1_R1.fastq.gz	data/raw/HoxW_2_Rbp1_R2.fastq.gz	Rbp1	HoxW	HoxW_2_Rbp1
HoxW_3_Rbp1	data/raw/HoxW_3_Rbp1_R1.fastq.gz	data/raw/HoxW_3_Rbp1_R2.fastq.gz	Rbp1	HoxW	HoxW_3_Rbp1

The script splits the file name on the '_' and uses the first split for the condition, and the second split for the mark. The 'igg' column is the same as the 'sample' column and should be manually replaced with the sample name of the IGG or control you would like to use for that sample. If the sample is an IgG it can be the same as its name, and won't affect peak calling.

So a fixed version of the table above would look like this:

sample	R1	R2	mark	condition	igg
HoxE_1_IgG	data/raw/HoxE_1_IgG_R1.fastq.gz	data/raw/HoxE_1_IgG_R2.fastq.gz	IgG	HoxE	HoxE_1_IgG
HoxE_1_Rbp1	data/raw/HoxE_1_Rbp1_R1.fastq.gz	data/raw/HoxE_1_Rbp1_R2.fastq.gz	Rbp1	HoxE	HoxE_1_IgG
HoxE_2_Rbp1	data/raw/HoxE_2_Rbp1_R1.fastq.gz	data/raw/HoxE_2_Rbp1_R2.fastq.gz	Rbp1	HoxE	HoxE_1_IgG
HoxE_3_Rbp1	data/raw/HoxE_3_Rbp1_R1.fastq.gz	data/raw/HoxE_3_Rbp1_R2.fastq.gz	Rbp1	HoxE	HoxE_1_IgG
HoxM_1_IgG	data/raw/HoxM_1_IgG_R1.fastq.gz	data/raw/HoxM_1_IgG_R2.fastq.gz	IgG	HoxM	HoxM_1_IgG
HoxM_1_Rbp1	data/raw/HoxM_1_Rbp1_R1.fastq.gz	data/raw/HoxM_1_Rbp1_R2.fastq.gz	Rbp1	HoxM	HoxM_1_IgG
HoxM_2_Rbp1	data/raw/HoxM_2_Rbp1_R1.fastq.gz	data/raw/HoxM_2_Rbp1_R2.fastq.gz	Rbp1	HoxM	HoxM_1_IgG
HoxM_3_Rbp1	data/raw/HoxM_3_Rbp1_R1.fastq.gz	data/raw/HoxM_3_Rbp1_R2.fastq.gz	Rbp1	HoxM	HoxM_1_IgG
HoxW_1_IgG	data/raw/HoxW_1_IgG_R1.fastq.gz	data/raw/HoxW_1_IgG_R2.fastq.gz	IgG	HoxW	HoxW_1_IgG
HoxW_1_Rbp1	data/raw/HoxW_1_Rbp1_R1.fastq.gz	data/raw/HoxW_1_Rbp1_R2.fastq.gz	Rbp1	HoxW	HoxW_1_IgG
HoxW_2_Rbp1	data/raw/HoxW_2_Rbp1_R1.fastq.gz	data/raw/HoxW_2_Rbp1_R2.fastq.gz	Rbp1	HoxW	HoxW_1_IgG
HoxW_3_Rbp1	data/raw/HoxW_3_Rbp1_R1.fastq.gz	data/raw/HoxW_3_Rbp1_R2.fastq.gz	Rbp1	HoxW	HoxW_1_IgG

For this example there was only one IgG per condition, so the sample name corresponding to that IGG was used for each sample in the condition. In the case that each sample had its own control file, each entry would correspond to the IgG for that sample. If only one IgG was used in the whole experiment, then its sample name could be used for each row. If you are not using IgG set config['USEIGG'] to false, and don't modify the samplesheet.

3. Edit configuration files

Edit runtime configuration in the file config.yml :
- Specify whether or not to trim adapters from raw reads to improve downstream alignment rate. You can decide based on adapter content in the sequencing core's FastQC reports.
- Specify whether or not to use IGG for peak calling.
- Specify the path to the bowtie2 index for the genome you are aligning to.
- Specify the path to the fastq screen configuration file.
- Specify the path to the genome FASTA file.
- Specify the path to the chromosome sizes.
The pipeline detects samples in the subdirectory data/raw with the following assumptions:
- Paired end reads
- Read 1 and 2 are designated using "_R1", and "_R2"
- Samples are designated in the following convention: {condition}_{replicate}_{mark}_{R1|R2}.fastq.gz This format affects the output files and ensures the bigwig files from the same marks are merged.
Make sure the src/deseq2_metadata.csv looks right. The file is created when you run make_sample_sheet.py and should have the following properties:

Should have two columns labeled "sample", and "condition"
The sample column corresponds to the individual biological replicates (includes all fields around the "_" delimiter).
The condition should be the condition for each sample, which uses the first field with the "_" delimiter.
If you have multiple conditions and marks to analyze, you can introduce more columns into this file and adjust the deseq2.R file to account for extra covariates.

The file src/deseq2_metadata is populated with the following example data:

sample,condition
HoxE_1_IgG,HoxE
HoxE_1_Rbp1,HoxE
HoxE_2_Rbp1,HoxE
HoxE_3_Rbp1,HoxE
HoxM_1_IgG,HoxM
HoxM_1_Rbp1,HoxM
HoxM_2_Rbp1,HoxM
HoxM_3_Rbp1,HoxM
HoxW_1_IgG,HoxW
HoxW_1_Rbp1,HoxW
HoxW_2_Rbp1,HoxW
HoxW_3_Rbp1,HoxW

4. Set up SLURM integration (for batch jobs)

Do this step if are running the pipeline as a batch job and don't yet have a SLURM profile set up.

Download the slurm folder from the maxsonBraunLab repository and copy the entire thing to ~/.config/snakemake .

Your file configuration for SLURM should be as follows:

~/.config/snakemake/slurm/<files>

Change the file permissions for the scripts in the slurm folder so that they are executable. To do this, run:

chmod +x ~/.config/snakemake/slurm/slurm*

Execution

A "dry-run" can be accomplished to see what and how files will be generated by using the command:

snakemake -nrp

To invoke the pipeline, please use either of the two options below:

# run in interactive mode. Recommended for running light jobs.
snakemake -j <n cores> --use-conda --conda-prefix $CONDA_PREFIX_1/envs
# run in batch mode. Recommended for running intensive jobs.
sbatch run_pipeline_conda.sh

For users running the pipeline in batch mode, run_pipeline_conda.sh is a wrapper script that contains the following command:

snakemake -j <n jobs> --use-conda --conda-prefix $CONDA_PREFIX_1/envs --profile slurm --cluster-config cluster.yaml

Additional setup instructions are provided in the wrapper script.

You can standardize further arguments for running the pipeline in batch mode using the following instructions . The maxsonBraunLab repository slurm contains further instructions to set up a SnakeMake slurm profile.

Reproducible results with SnakeMake + Singularity

To ensure the reproducibility of your results, we recommend running a SnakeMake workflow using Singularity containers. These containers standardize the underlying operating system of the workflow (e.g. Ubuntu, centOS, etc.), while conda tracks the installation of bioinformatic software (e.g. bowtie2, samtools, deseq2). To utilize Singularity in your analysis, log in to an interactive node and load the module first like this:

# request an interactive node
srun -p light --time=36:00:00 --pty bash
# re-activate your environment with snakemake
conda activate <snakemake-env>
# load the singularity program
module load /etc/modulefiles/singularity/current

By default, Singularity will create and use a cache directory in your personal user root folder (i.e. in /home/users/<username> ). This may create problems as there is limited space in a user's root folder on Exacloud. To avoid issues with space in your root folder, you can set the Singularity cache directory path to a folder in your lab group directory like this:

# make a cache folder inside your lab user folder 
mkdir /home/groups/MaxsonLab/<your_user_folder>/singularity_cache
# make the path to the cache folder accessible to other processes
export SINGULARITY_CACHEDIR=/home/groups/MaxsonLab/<your_user_folder>/singularity_cache

If you are an experienced user, you can add the export SINGULARITY_CACHEDIR=... line to your .bashrc file. Otherwise, run the export SINGULARITY_CACHEDIR=... command before doing the steps below.

More Singularity documentation on Exacloud can be found here . If it is your first time running the pipeline, and especially when using Singularity, we must install all the conda environments using the following command:

indices_folder="/home/groups/MaxsonLab/indices"
conda_folder="${CONDA_PREFIX_1}/envs"
fastq_folder="/home/groups/MaxsonLab/input-data2/path/to/FASTQ/folder/"
snakemake -j 1 \
	--verbose \
	--use-conda \
	--conda-prefix $conda_folder \
	--use-singularity \
	--singularity-args "--bind $indices_folder,$conda_folder,$fastq_folder" \
	--conda-create-envs-only

The above code snippet will take about an hour or more to set up, but is a one-time installation. After creating the conda environments, symlinks, samplesheet.tsv , and the src/deseq2_metadata.csv , we can invoke the pipeline in the same shell like this:

# Singularity + interactive run
snakemake -j <n cores> \
	--use-conda \
	--conda-prefix $conda_folder \
	--use-singularity \
	--singularity-args "--bind $indices_folder,$conda_folder,$fastq_folder"
# Singularity + slurm (batch) run
sbatch run_pipeline_singularity.sh

For users running the Singularity version of the pipeline in batch mode, run_pipeline_singularity.sh is a wrapper script for the pipeline. You will need to add the appropriate FASTQ folder path to the script prior to running. Additional instructions are provided in the wrapper script.

NOTE: make sure to use double quotes, and insert an integer for the -j flag.

The above command will install the pipeline's conda environments into the conda-prefix directory - this means that conda environments are actually not stored INSIDE the container. The --bind argument binds the host (Exacloud) paths to the container to access the genome indices, conda prefix, and the path to the raw sequencing files. The --profile slurm in the wrapper script will configure default settings for SnakeMake to interact with SLURM - more information can be found here . Feel free to create another snakemake profile that has its own set of singularity arguments for added convenience.

Method

Without adapter trimming:

With adapter trimming enabled:

Output

Below is an explanation of each output directory:

aligned - sorted and aligned sample bam files
callpeaks - the output of callpeaks.py with peak files and normalized bigwigs for each sample.
counts - the raw counts for each mark over a consensus peak list
deseq2 - the results of deseq2 fore each counts table (by mark)
dtools - fingerprint plot data for multiqc to use
fastp - adapter-trimmed FASTQ files (if adapter-trimming option is enabled in `config.yml`)
fastqc - fastqc results
fastq_screen - fastq_screen results
logs - runtime logs for each snakemake rule
markd - duplicate marked bam files
multiqc - contains the file multiqc_report.html with a lot of QC info about the experiment.
plotEnrichment - FRIP statistics for each mark
preseq - library complexity data for multiqc report

Deseq2 outputs

Each mark should have the following output files:

"data/deseq2/{mark}/{mark}-rld-pca.png" - PCA of counts after reguarlized log transformation. 
"data/deseq2/{mark}/{mark}-vsd-pca.png" - PCA of counts after variance stabilizing transformation.
"data/deseq2/{mark}/{mark}-normcounts.csv" - normalized count for each sample in each consensus peak.
"data/deseq2/{mark}/{mark}-lognormcounts.csv" - log2 normalized counts for each sample in each consensus peak.
"data/deseq2/{mark}/{mark}-rld.png", - the sdVsMean plot using regularized log transformation.
"data/deseq2/{mark}/{mark}-vsd.png" - the sdVsMean plot using variance stabilizing transformation.
"data/deseq2/{mark}/{mark}-vsd-dist.png" - the sample distance matrix after variance stabilizing transformation.
"data/deseq2/{mark}/{mark}-rld-dist.png" - the sample distance matrix using regularized log transformation.
"data/deseq2/{mark}/{mark}-dds.rds" - the R object with the results of running the DEseq() function.

For each contrast, the differentially expressed genes are written to a file ending in -diffexp.tsv as well as those with an adjusted p-value less than 0.05 with the extension -sig05-diffexp.tsv . A summary of the results using an alpha of 0.05 is also written to a file with the extension -sig05-diffexp-summary.txt . Additionally two MA plots are written to the file ending in plotMA.png that have highlighted differential peaks with an adjusted p-value less than 0.1.

See the following paper for further explanations of the above plots and data transforms: https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#exporting-results-to-csv-files

Code Snippets

shell:
    "fastp -i {input.r1} -I {input.r2} -o {output.r1} -O {output.r2} --detect_adapter_for_pe --thread {threads} -j {log} -h /dev/null"

SnakeMake fastp From line 94 of main/Snakefile

shell:
    "fastqc -t {threads} --outdir data/fastqc {input} > {log} 2>&1"

SnakeMake FastQC From line 109 of main/Snakefile

shell:
    "fastq_screen --aligner bowtie2 --threads {threads} --outdir data/fastq_screen "
    "--conf {config[FASTQ_SCREEN]} --force {input} > {log} 2>&1"

SnakeMake Bowtie 2 From line 123 of main/Snakefile

shell:
    "bowtie2 --local --very-sensitive-local "
    "--no-unal --no-mixed --threads {threads} "
    "--no-discordant --phred33 "
    "-I 10 -X 700 -x {config[GENOME]} "
    "-1 {input.r1} -2 {input.r2} 2>{log.err} | samtools view -@ {threads} -Sbh - > {output}"

SnakeMake SAMtools Bowtie 2 From line 139 of main/Snakefile

shell:
    "sambamba sort --tmpdir=data/aligned -t {threads} -o {output} {input} > {log} 2>&1"

SnakeMake Sambamba From line 156 of main/Snakefile

shell:
    "sambamba markdup --tmpdir=data/markd -t {threads} {input} {output} > {log} 2>&1"

SnakeMake Sambamba From line 169 of main/Snakefile

shell:
    "sambamba index -t {threads} {input} > {log} 2>&1"

SnakeMake Sambamba From line 182 of main/Snakefile

shell:
    "bamCoverage -b {input[0]} -o {output} --binSize 10 --smoothLength 50 --normalizeUsing CPM -p {threads} "

SnakeMake DeepTools From line 194 of main/Snakefile

shell:
    "bash src/mergebw.sh -c {config[CSIZES]} -o {output} {input}"

SnakeMake From line 206 of main/Snakefile

shell:
    "cat {input} | sort -k1,1 -k2,2n | "
    "bedtools merge | "
    "bedtools intersect -a - -b {input} -c | "
    "awk -v OFS='\t' '$4>=2 {{print}}' > {output}"

SnakeMake BEDTools From line 218 of main/Snakefile

shell:
    "src/fraglen-dist.sh {input} {output}"

SnakeMake From line 232 of main/Snakefile

run:
    pd.options.plotting.backend = "plotly"
    dfs = []
    for i in input:
        cond_marker = [os.path.basename(i).split(".")[0]]
        temp_df = pd.read_csv(i, sep = "\t", index_col = 0, names = cond_marker)
        dfs.append(temp_df)
    df = pd.concat(dfs, axis = 1)
    fraglen = df.plot()
    fraglen.update_layout( 
        title='Fragment Length Distribution', 
        xaxis_title='Fragment Length (bp)', 
        yaxis_title='Counts', 
        legend_title_text='Samples')
    fraglen.write_html(str(output))

SnakeMake plotly From line 240 of main/Snakefile

shell:
    "preseq c_curve -B {resources.defect_mode} -l 1000000000 -o {output} {input} > {log} 2>&1"

SnakeMake preseq From line 267 of main/Snakefile

shell:
    "preseq lc_extrap -B {resources.defect_mode} -l 1000000000 -e 1000000000 -o {output} {input} > {log} 2>&1"

SnakeMake preseq From line 281 of main/Snakefile

shell:
    "plotFingerprint -b {input[0]} --smartLabels --outRawCounts {output} > {log} 2>&1"

SnakeMake DeepTools From line 293 of main/Snakefile

shell:
    "gopeaks -b {input[0]} {params.igg} -o data/callpeaks/{wildcards.sample} {params.params} > {log} 2>&1"

SnakeMake From line 308 of main/Snakefile

shell:
   "bash src/consensus_peaks.sh {wildcards.mark} {config[N_INTERSECTS]} {output}"

SnakeMake From line 319 of main/Snakefile

shell:
    "bash src/skip_frip.sh {input[0]} {input[1]} {output[0]} {output[1]} {log}"

SnakeMake From line 331 of main/Snakefile

run:
    pd.options.plotting.backend = "plotly"
    dfs = []
    for i in sorted(input):
        cond_marker = "_".join(i.split("_")[1:3])
        temp_df = pd.read_csv(i, sep = "\t", usecols=["percent"]).rename(columns = {'percent': cond_marker})
        dfs.append(temp_df)
    frip_df = pd.concat(dfs, axis = 1)
    frip_df = frip_df.rename(index={0: 'inside'})
    frip_df.loc["outside"] = 100 - frip_df.loc['inside']
    fig = go.Figure(data=[
        go.Bar(name="inside_peaks", x=frip_df.columns, y=frip_df.loc['inside'], marker_color='rgb(255, 201, 57)'),
        go.Bar(name='outside_peaks', x=frip_df.columns, y=frip_df.loc['outside'], marker_color='rgb(0,39, 118)')
    ])
    fig.update_layout(barmode='stack', 
        title='Fraction of Reads in Peaks by Sample', 
        xaxis_tickfont_size=14, yaxis=dict(title='Fraction of reads in peaks', 
        titlefont_size=16, tickfont_size=14), xaxis=dict(title='Samples'))
    fig.write_html(str(output))

SnakeMake plotly From line 339 of main/Snakefile

script:
    "src/deseq2.R"

SnakeMake DESeq2 From line 380 of main/Snakefile

shell:
    "bash src/homer.sh -m {wildcards.mark} -s 1 -p 8 -g {config[FASTA]}"

SnakeMake From line 390 of main/Snakefile

import pandas as pd

# map samples to fastqs
def get_samples():
    """
    return list of samples from samplesheet.tsv
    """
    return list(st.index)

def get_marks():
    """
    return list of marks from samplesheet.tsv
    """
    return list(st['mark'])

def get_mark_conditions():
    """
    return list of samples by condition
    """
    st['mark_condition']=st['mark'].astype(str)+"_"+st['condition']
    return st['mark_condition'].unique().tolist()

def get_tracks_by_mark_condition(wildcards):
    """
    return list of tracks by mark_condition
    """
    st['mark_condition']=st['mark'].astype(str)+"_"+st['condition']
    samps=st.groupby(["mark_condition"])['sample'].apply(list)[wildcards.mark_condition]
    return [f"data/tracks/{s}.bw" for s in samps]

def get_peaks_by_mark_condition(wildcards):
    """
    return list of peaks by mark_condition
    """
    st['mark_condition']=st['mark'].astype(str)+"_"+st['condition']
    samps=st.groupby(["mark_condition"])['sample'].apply(list)[wildcards.mark_condition]
    return [f"data/callpeaks/{s}_peaks.bed" for s in samps]

def get_bowtie2_input(wildcards):
    """
    returns reads associated with a sample
    """
    r1=st.loc[wildcards.sample]['R1']
    r2=st.loc[wildcards.sample]['R2']
    return r1,r2

def get_reads():
    """
    get list of all reads
    """
    rlist=list(st['R1'])+list(st['R2'])
    rlist=[os.path.basename(f).split('.')[0] for f in rlist]
    return rlist

def get_igg(wildcards):
    """
    Returns the igg file for the sample unless
    the sample is IgG then no control file is used.
    """ 
    if config['USEIGG']:
        igg=st.loc[wildcards.sample]['igg']
        iggbam=f'data/markd/{igg}.sorted.markd.bam'
        isigg=config['IGG'] in wildcards.sample
        if not isigg:
            return f'-c {iggbam}'
        else:
            return ""
    else:
        return ""

def get_callpeaks(wildcards):
    """
    Returns the callpeaks input files
    """
    bam=f"data/markd/{wildcards.sample}.sorted.markd.bam"
    bai=f"data/markd/{wildcards.sample}.sorted.markd.bam.bai"
    if config["USEIGG"]:
        igg=st.loc[wildcards.sample]['igg']
        iggbam=f'data/markd/{igg}.sorted.markd.bam'
        iggbam=f'data/markd/{igg}.sorted.markd.bam.bai'
        isigg=config['IGG'] in wildcards.sample
        if not isigg:
            return [bam,bai,iggbam]
        else:
            return [bam,bai]
    else:
        return [bam,bai]

def callpeaks_params(wildcards):
    """
    Returns callpeaks parameters specified by the user in the samplesheet
    """
    params = st.loc[wildcards.sample]['gopeaks']
    if params == "-":
        return ""
    else:
        return params

def defect_mode(wildcards, attempt):
    if attempt == 1:
        return ""
    elif attempt > 1:
        return "-D"

Python Pandas From line 2 of src/common.py

if ! command -v bedtools &> /dev/null
then
    echo "bedtools could not be found"
    exit
fi

# check args
if [[ $# -eq 0 ]]
then
    echo "no arguments supplied"
    exit 1
fi

MARK=$1
N_INTERSECTS=$2
OUTFILE=$3

# array of replicates per mark
declare -a mpks=(data/callpeaks/*${MARK}*_peaks.bed)

all_samples=$(echo ${mpks[@]} | tr ' ' '\n' | cut -d/ -f3 | cut -d_ -f1-3)
all_conditions=$(echo ${mpks[@]} | tr ' ' '\n' | cut -d/ -f3 | cut -d_ -f1)

for condition in $all_conditions; do

    # file I/O
    tmp_output="data/counts/$MARK.$condition.tmp.bed"

    # list all replicates in one condition
    all_replicates=$(find data/callpeaks/ -name "*$condition*$MARK*.bed" | sort | tr '\n' ' ')

    # find widest peak that appear in at least n replicates + export to tmp file.
    cat $all_replicates | cut -f1-3 | sort -k1,1 -k2,2n | bedtools merge | \
    bedtools intersect -c -a stdin -b $all_replicates | \
    awk -v n=$N_INTERSECTS '$4 >= n' | awk -v OFS='\t' '{print $1,$2,$3}' > $tmp_output

done

# merge intervals for all tmp files and export as consensus peak.
all_temp_files=$(find data/counts -name "*$MARK*.tmp.bed" | sort | tr '\n' ' ')
cat $all_temp_files | sort -k1,1 -k2,2n | bedtools merge > data/counts/${MARK}_consensus.bed
rm $all_temp_files

# count the number of reads per union peak
bedtools multicov -bams data/markd/*${MARK}*.bam -bed data/counts/${MARK}_consensus.bed -D > ${OUTFILE}_tmp

# label the counts table
ls data/markd/*${MARK}*.bam  | sed 's!.*/!!' | cut -d. -f1 | xargs |  tr ' ' '\t' | awk '{print "chrom\tstart\tend\t" $0}' | cat - ${OUTFILE}_tmp > ${OUTFILE}

# remove tmp 
rm ${OUTFILE}_tmp

Shell BEDTools Consensus From line 9 of src/consensus_peaks.sh

library("tidyverse")
library("vsn")
library("pheatmap")
library("ggplot2")
library("RColorBrewer")
library("DESeq2")
library("PoiClaClu")
library("gplots")
library("plyranges")
library("tidyr")
library("dendextend")
library("ggrepel")
library("tools")


message("setting snakemake parameters")

exp_prefix = snakemake@params[["mark"]]
outdir = snakemake@params[["outdir"]]
numclusters = snakemake@params[["numclusters"]]
message(paste0("using numclusters = ", numclusters))
stopifnot(is.numeric(numclusters))

input = snakemake@input[["counts"]]
meta = snakemake@input[["meta"]]
genes = snakemake@input[["genes"]]
pcaPlot = snakemake@output[["pcaPlot"]]
pcaPlotVsd = snakemake@output[["pcaPlotVsd"]]
normCount = snakemake@output[["normCounts"]]
lnormCount = snakemake@output[["lnormCounts"]]
sdMeanRld = snakemake@output[["sdMeanRld"]]
sdMeanVsd = snakemake@output[["sdMeanVsd"]]
sampleDistPlotVsd = snakemake@output[["sampleDistVsd"]]
sampleDistPlotRld = snakemake@output[["sampleDistRld"]]
rds = snakemake@output[["rds"]]

if (!dir.exists(outdir)) {
    dir.create(outdir)
}

# read in genes file
genestab = read_tsv(genes, col_names=c("seqnames", "start", "end", "name", "score", "strand")) %>% GRanges()

# counts join
peaks = read_tsv(input) %>% 
    mutate(row=row_number()) %>%
    mutate(name=paste0("peak", as.character(row)))


# read in counts table
counts = read.delim(input, header=T, stringsAsFactors = F, check.names=F) %>%
    mutate(row=row_number()) %>%
    mutate(peak=paste0("peak", as.character(row)))
rownames(counts) = counts$peak
counts = counts[,4:ncol(counts)]

# read in metadata
meta <- read.csv(meta, header = T, stringsAsFactors = F)

# make the meta reflect the available sample names
meta = meta[meta$sample %in% colnames(counts),]

# make sure meta rownames and counts colnames in the same order.
rownames(meta) <- meta$sample
counts <- counts[, rownames(meta)]
stopifnot(rownames(meta) == colnames(counts))

message('calculating deseq...')
print(head(counts))
print(head(meta))
# make deseqdataset
ddsMat <- DESeqDataSetFromMatrix(countData = counts, colData = meta, design = ~condition)
dds <- DESeq(ddsMat)

# write normalized counts tables
normCounts <- counts(dds, normalized=TRUE)
normCounts <- 0.00001+(normCounts) # ensure nonzero 
lnormCounts <- log2(normCounts)
lnormCounts <- as.data.frame(lnormCounts)
lnormCounts$ID <- counts$peak
print(head(lnormCounts))
write.csv(normCounts, normCount)
write.csv(lnormCounts, lnormCount) 

# calculate rlog transform
rld <- rlog(dds, blind=FALSE)
vsd <- varianceStabilizingTransformation(dds, blind = TRUE,
  fitType = "parametric")
ntd <- normTransform(dds)

message('plotting count transforms...')
save.image()
# plot the transform for the first two samples
pdf(sdMeanRld)
rldCounts <- meanSdPlot(assay(rld))
rldCounts$gg + labs(title="meanSdPlot - rlog transformed")
dev.off()

pdf(sdMeanVsd)
vsdCounts <- meanSdPlot(assay(vsd))
vsdCounts$gg + labs(title="meanSdPlot - vsd transformed")
dev.off()

message('plotting poisson distance sample cross correlation...')
# plot sample distances 
pdf(sampleDistPlotVsd)
sampleDists <- dist(t(assay(vsd)))
sampleDistMatrix <- as.matrix(sampleDists)
rownames(sampleDistMatrix) <- paste(meta$sample, meta$condition, sep="-")
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix,
         clustering_distance_rows=sampleDists,
         clustering_distance_cols=sampleDists,
         col=colors,
         main=paste(exp_prefix, "vsd sample distance matrix"))
dev.off()

pdf(sampleDistPlotRld)
sampleDists <- dist(t(assay(rld)))
sampleDistMatrix <- as.matrix(sampleDists)
rownames(sampleDistMatrix) <- paste(meta$sample, meta$condition, sep="-")
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix,
         clustering_distance_rows=sampleDists,
         clustering_distance_cols=sampleDists,
         col=colors,
         main=paste(exp_prefix, "rld sample distance matrix"))
dev.off()

# plot PCA rld
pdf(pcaPlot)
plotPCA(rld, intgroup = c("condition")) + labs(title=paste0(exp_prefix,"-rld")) + geom_text_repel(aes(label=name))
dev.off()

# plot PCA vsd
pdf(pcaPlotVsd)
plotPCA(rld, intgroup = c("condition")) + labs(title=paste0(exp_prefix,"-vsd")) + geom_text_repel(aes(label=name))
dev.off()

# make contrasts
c = combn(unique(meta$condition), 2)
lsc=split(c, rep(1:ncol(c), each = nrow(c)))
message('writing contrast results...')
# write differential results for each contrast
for (k in 1:length(lsc)) {
    cl=lsc[[k]]
    rname=paste0(cl[1],"vs",cl[2])
    res=results(dds, 
            independentFiltering = TRUE,
            cooksCutoff = FALSE,
            contrast = c("condition", cl[1], cl[2]),
            alpha=0.05)

    # estimate shrinkage 
    resLFC <- lfcShrink(dds, coef=2, type="apeglm")

    #plotMA
    maplot=paste0(outdir,"/",exp_prefix,"/",exp_prefix,"-",rname,"plotMA.pdf")
    pdf(maplot)
    par(mfrow=c(1,2), mar=c(4,4,2,1))
    xlim <- c(1,1e5); ylim <- c(-2,2)
    plotMA(resLFC, xlim=xlim, ylim=ylim, alpha=0.05, main=paste(rname, "apeglm"))
    plotMA(res, xlim=xlim, ylim=ylim, alpha=0.05, main=paste(rname, "normal"))
    dev.off()

    tableName=paste0(outdir,"/",exp_prefix,"/",cl[1],'-',cl[2],'-',exp_prefix,'-','peaks-annot.tsv')

    # annottate peaks
    diffexp=res %>% 
        as_tibble(rownames='name') %>%
        left_join(peaks) %>%
        GRanges() %>%
        join_nearest(genestab, suffix=c(".peak", ".gene")) %>%
        as_tibble() 
    write_tsv(diffexp, path=tableName)

    # differential peak files lfc > 0 | lfc > 0  padj < 0.05 | padj < 0.01
    deup05=paste0(outdir,"/",exp_prefix,"/",cl[1],'-',cl[2],'-',exp_prefix,'-','differential-up-05.bed')
    dedown05=paste0(outdir,"/",exp_prefix,"/",cl[1],'-',cl[2],'-',exp_prefix,'-','differential-down-05.bed')
    deup01=paste0(outdir,"/",exp_prefix,"/",cl[1],'-',cl[2],'-',exp_prefix,'-','differential-up-01.bed')
    dedown01=paste0(outdir,"/",exp_prefix,"/",cl[1],'-',cl[2],'-',exp_prefix,'-','differential-down-01.bed')
    # sig up peaks
    diffexp %>%
        filter(log2FoldChange > 0, padj < 0.05) %>%
        select(seqnames, start, end, name.peak,name.gene,score,strand) %>%
        write_tsv(deup05, col_names=FALSE)

    # sig down peaks
    diffexp %>%
        filter(log2FoldChange < 0, padj < 0.05) %>%
        select(seqnames, start, end, name.peak,name.gene,score,strand) %>%
        write_tsv(dedown05, col_names=FALSE)

    # sig up peaks
    diffexp %>%
        filter(log2FoldChange > 0, padj < 0.01) %>%
        select(seqnames, start, end, name.peak,name.gene,score,strand) %>%
        write_tsv(deup01, col_names=FALSE)

    # sig down peaks
    diffexp %>%
        filter(log2FoldChange < 0, padj < 0.01) %>%
        select(seqnames, start, end, name.peak,name.gene,score,strand) %>%
        write_tsv(dedown01, col_names=FALSE)

    summaryName=paste0(outdir,"/",exp_prefix,"/",cl[1],"-",cl[2],"-",exp_prefix,"-diffexp-summary.txt")
    diffexp %>%
        summarise(Condition=paste0(exp_prefix,"-",rname),
                  DP05=nrow(filter(.,padj<0.05)),
                  DP01=nrow(filter(.,padj<0.01))) %>%
        write_tsv(summaryName)

    message("creating heatmap...")

    sig = diffexp %>% filter(padj < 0.05)

    # calculate heatmap if there are more than 10 sig peaks
    if (nrow(sig) > 10) {
        annot_filename=gsub(".tsv", "-05-clust.tsv", basename(tableName))

        # make direction annotation df 
        direction <- sig %>%
            select(name.peak,log2FoldChange) %>%
            mutate(direction=ifelse(log2FoldChange>0,"up","down")) %>%
            data.frame()

        rownames(direction) = direction[,1]
        direction[,1]=NULL

        # subset counts by sig peaks
        print(head(lnormCounts))
        counts <- rownames_to_column(lnormCounts, var="peak")  %>% as_tibble()
        print(head(counts))
        sigcounts <- counts %>% filter(peak %in% sig$name.peak)

        # organize samples by condition                               
        meta_conditions <- meta %>% group_by(condition) %>% 
            summarise(sample=paste(sample, collapse = ","))

        # rowMean counts for each condition
        for (c in meta_conditions$condition) {
            samples = filter(meta_conditions, condition==c)$sample
            s=unlist(strsplit(samples, ","))
            sigcounts <- sigcounts %>% mutate(!!c := rowMeans(sigcounts[s]))
        }

        # retain mean counts per condition
        sigcounts <- sigcounts[append("peak", meta_conditions$condition)]

        # scale the counts
        scaled <- t(apply(as.matrix(sigcounts[,-1]), 1, scale))
        colnames(scaled) = colnames(sigcounts[-1])

        rownames(scaled) <- sigcounts$peak
        # generate clusters
        clust=hclust(dist(scaled), method="ward.D2")
        clustdf=data.frame(cluster=cutree(as.dendrogram(clust), k = numclusters))
        annots=merge.data.frame(clustdf, direction, by=0, all = T) %>% select(-log2FoldChange)
        rownames(annots)=annots[,1]
        annots[,1]=NULL

        # get a numebr of unique colors
        cols=RColorBrewer::brewer.pal(numclusters, name="Set1")

        # name the colors by cluster
        names(cols) = unique(clustdf$cluster)

        # set cluster colors
        colors = list(
        cluster=cols
        )	

        heattitle=paste0(exp_prefix,"-",rname)
        # create heatmap
        heat <- pheatmap(scaled,
            main=heattitle,
            show_rownames = F,
            show_colnames = T,
            cluster_rows = T,
            cluster_method = "ward.D2",
            annotation_row = annots,
            annotation_colors = colors)

        save_pheatmap <- function(x, filename, width=7, height=7) {
            if (tolower(tools::file_ext(filename)) == "png") {
                png(filename = filename, width = width, height = height)
            }
            else {
                pdf(filename, width = width, height = height)
            }
            grid::grid.newpage()
            grid::grid.draw(x$gtable)
            dev.off()
        }

        heatmap_filename=paste0(outdir,"/",exp_prefix,"/",cl[1],"-",cl[2],"-",exp_prefix,"-heatmap.pdf")
        save_pheatmap(heat, heatmap_filename)

        annot_filename=gsub(".tsv", "-05-clust.tsv", tableName)
        # annotate cluster and direction
        annots[["name.peak"]] = rownames(annots)
        sig %>% left_join(annots) %>% write_delim(annot_filename, delim="\t")
    }
}

# save RDS of deseq object
saveRDS(dds, rds)

R Snakemake ggplot2 tidyverse tidyr DESeq2 ggrepel RColorBrewer PCAtools pheatmap gplots plyranges vsn dendextend PoiClaClu From line 1 of src/deseq2.R

samtools view $1 | awk '$9>0' | cut -f 9 | sort | uniq -c | sort -b -k2,2n | sed -e 's/^[ \t]*//' | sed 's/ /\t/g' | awk -v OFS='\t' '{print $2,$1}' > $2

Shell SAMtools From line 4 of src/fraglen-dist.sh

while getopts "m:g:s:p:" op
do
	case "$op" in
		m)  mark="$OPTARG";;
		g)  genome="$OPTARG";;
		s)  slurm="$OPTARG";;
		p)  p="$OPTARG";;
		\?) exit 1;;
	esac
done

if [ -f $genome ];
then
	echo "I found the FASTA file!"
else
	echo -e "$g does not exist. Exiting program."
	exit
fi

if [[ $slurm == 1 || $slurm == 0 ]]; then
	true
else
	echo "ERROR: -s must be either 0 or 1 for SLURM submission."
	exit 1
fi

log_file_exists() {

	if [ -f $1 ]; then
		rm $1
		touch $1
	fi

	if [ ! -f $1 ]; then
		touch $1
	fi

}

# detect contrast combinations per mark
contrasts=$(find "data/deseq2/${mark}" -name "*.bed" | cut -d/ -f4 | sed -E "s/([a-zA-Z0-9.-]+)-($mark)-([a-zA-Z0-9-]+)(.bed)/\1/" | sort | uniq)

# HOMER analysis per contrast -----------------------------------------------------------

for contrast in $contrasts; do

	echo "assessing contrast $contrast in mark $mark"

	up_05="data/deseq2/${mark}/${contrast}-${mark}-differential-up-05.bed"
	up_01="data/deseq2/${mark}/${contrast}-${mark}-differential-up-01.bed"
	down_05="data/deseq2/${mark}/${contrast}-${mark}-differential-down-05.bed"
	down_01="data/deseq2/${mark}/${contrast}-${mark}-differential-down-01.bed"

	declare -a differential_peaks=($up_05 $up_01 $down_05 $down_01)

	echo -e "Differential peaks files:\n${differential_peaks[@]}"

	# assess and run HOMER per bed file
	for file in ${differential_peaks[@]}; do

		# check if file exists
		if [ -f "$file" ]; then

			echo "$file detected"
			peak_count=$(cat $file | wc -l)

			echo "Peak count: $peak_count"

			# run HOMER if >= 10 peaks
			if [ "$peak_count" -ge 10 ]; then

				# extract dir,sig info for the run.
				echo "Running HOMER for $peak_count peaks in $file"
				directionality=$(basename $file | sed -E "s/([a-zA-Z0-9._-]+)-(down|up)-([0-9]+)(.bed)/\2/") # "up" or "down"
				sig=$(basename $file | sed -E "s/([a-zA-Z0-9._-]+)-(down|up)-([0-9]+)(.bed)/\3/") # e.g. significance level (01 or 05)

				# file I/O
				output_dir="data/homer/${contrast}-${mark}-${directionality}-${sig}"
				log="data/logs/homer_${contrast}-${mark}-${directionality}-${sig}.log"
				log_file_exists $log

				# HOMER local run/submission to SLURM
				if [ $slurm == 0 ]; then
					findMotifsGenome.pl $file $genome $output_dir -size 200 -p $p > $log 2>&1 &
				fi

				if [ $slurm == 1 ]; then
					if [ ! -d "jobs/homer" ]; then
						mkdir -p jobs/homer
					fi
					job_out="jobs/homer/homer-${contrast}-${mark}_%j.out"
					job_err="jobs/homer/homer-${contrast}-${mark}_%j.err"
					sbatch -e $job_err -o $job_out --job-name "HOMER" --time "02:00:00" --mem="8G" --cpus-per-task=$p --wait --wrap="findMotifsGenome.pl $file $genome $output_dir -size 200 -p $p > $log 2>&1" &
				fi
			fi
		fi
	done
done

wait
touch data/homer/$mark.done

exit

Shell From line 3 of src/homer.sh

usage(){
	echo "Usage: " 
        echo "  $0 -c chrom_sizes_file -o output_file rep1.bw rep2.bw rep3.bw"
	exit 1
}

file_exits(){
	local f="$1"
	[[ -f "$f" ]] && return 0 || return 1
}

cleanup() {
    rm ${out_name}.tmp.bdg &> /dev/null
}

[[ $# -eq 0 ]] && usage

script_args=()
while [ $OPTIND -le "$#" ]
do
    if getopts c:o: option
    then
        case $option
        in
            c) chrom_size="$OPTARG";;
            o) out_name="$OPTARG";;
        esac
    else
        script_args+=("${!OPTIND}")
        ((OPTIND++))
    fi
done


if [ -z "$chrom_size" ] 
then 
    echo "Please supply the chrom.sizes file." 
    usage
else 
    if ( ! file_exits "$chrom_size" )
    then 
        echo "error: file $chrom_size not found"
        usage 
    fi
fi


if [ -z "$out_name" ] 
then 
    echo "Please supply an output file name."
    usage
fi

echo "Using chrom.sizes file: $chrom_size" 
echo "Output file name will be: $out_name" 

# require bigWigMerge
if ! command -v wiggletools &> /dev/null
then
    echo "wiggletools not found"
    exit
fi

# require bigWigToBedGraph
if ! command -v bigWigToBedGraph &> /dev/null
then 
    echo "bigWigToBedGraph no found"
    exit
fi


echo "merging the following files..."
echo ${script_args[@]}

cmd="wiggletools mean ${script_args[@]} | grep -vE '_|EBV|GL|JH' | awk 'NF==4{print}' | sort -S 8G -k1,1 -k2,2n > ${out_name}.tmp.bdg"
echo $cmd
eval $cmd

if [ $? -eq 0 ]; then
    echo "..."
else
    echo "Command wiggletools failed to complete..."
    echo "quitting.."
    cleanup
    exit 1
fi

# convert to bigwig
bedGraphToBigWig ${out_name}.tmp.bdg $chrom_size $out_name

if [ $? -eq 0 ]; then
    echo "..."
else
    echo "Command bedGraphToBigWig failed..."
    echo "quitting.."
    cleanup
    exit 1
fi

cleanup
echo "done"

Shell bedGraphToBigWig ucsc-bigwigtobedgraph ucsc-bigwigmerge From line 5 of src/mergebw.sh

input_bed=$1
input_bam=$2
output_png=$3
output_tsv=$4
logfile=$5

if [ -s ${input_bed} ]
then
	plotEnrichment -b ${input_bam} --BED ${input_bed} --regionLabels 'frip' --outRawCounts ${output_tsv} -o ${output_png} > ${logfile} 2>&1
else
	total_read_count=$(samtools view ${input_bam} | wc -l)
	echo "file	featureType	percent	featureReadCount	totalReadCount" > ${output_tsv} #creates output file with column headers
	echo "${input_bam}	frip	0	0	${total_read_count}" >> ${output_tsv} #appends to existing output file with name of sample and 0 counts
	touch ${output_png}
fi