Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Project Description:
Table of contents:
Description of individual steps in pipeline:
1. run_bwa_mem
# align pair of .fastq files to the genome and convert the .sam output to .bam
bwa mem \
-t {params.bwaThreads} \` # number of threads to use for alignment`
-SP5M \` # -5 for split alignment, take the alignment with the smallest coordinate as primary
# -S skip mate rescue
# -P skip pairing
# -M mark shorter split hits as secondary`
{params.bwaIndex} \` # path to bwa indexed genome`
{input.fq1} \` # path to fastq 1`
{input.fq2} | \` # path to fastq 2`
samtools view -Shb - > {output.bam}` # convert bwa mem output to .bam file with header`
2. make_pairs_with_pairtools_parse
pairtools parse \
-c {params.chrom_sizes} \
--drop-sam \
--add-columns mapq \
--output {output.pairs} \
{input.bam}
3. sort_pairs_with_pairtools_sort
# if temporary directory is not available, make it
[ -d {params.tempdir} ] || mkdir {params.tempdir}
# sort .pairs file created in step 2
pairtools sort \
--nproc {params.nproc} \
--memory {params.memory} \
--tmpdir {params.tempdir} \
--output {output.sorted_pairs} \
{output.pairs}
# remove temporary directory after sorting
rm -rf {params.tempdir}
4. mark_duplicates_with_pairtools_dedup
# Find and mark PCR/optical duplicates in .pairs file from step 3.
pairtools dedup \
--mark-dups \` # duplicate pairs are marked as DD in pair_type and as a duplicate in the sam entries.`
--output-dups - \` # output duplicates together with deduped pairs`
--output-unmapped - \` # output unmapped pairs together with deduped pairs`
--output {output.marked_pairs} \` # name of output .pairs file with duplicates marked`
{input.sorted_pairs}` # .pairsam file input from step 3`
# index the .pairsam
pairix {output.marked_pairs}
5. filter_pairs
## Generate lossless bam from the pairsam file
pairtools split \
--output-sam {output.lossless_bam} \`# name of .bam file produced`
{input.marked_pairs}` # .pairsam file input from step 3`
# Select UU, UR, RU reads
## UU = unique-unique, both alignments are unique
## UR or RU = unique-rescued, one alignment unique, the other rescued
pairtools select \
'(pair_type == "UU") or (pair_type == "UR") or (pair_type == "RU")' \
--output-rest {output.unmapped_sam} \ `# name of file with the remainder of read pairs`
--output {params.temp_file} \` # temporary output file with the selected read pairs`
{input.marked_pairs}` # .pairs file input from step 4`
# Generate .pairs file from the UU, UR, and RU reads selected above
pairtools split \
--output-pairs {params.temp_file1} \` # temporary .pairs output file`
{params.temp_file}` # input .pairsam file generated with pairtools select above`
# Make a .pairs file with only pairs in chromosomes of interest
pairtools select 'True' \
--chrom-subset {params.chrom_sizes} \` # path to a chromosomes file containing a chromosome subset of interest.`
-o {output.dedup_pairs} \` # ouput path and filename for .pairs file in chromosomes of interest`
{params.temp_file1}` # input .pairsam file generated with pairtools split above`
# index the .pairs file
pairix {output.dedup_pairs}
6. add_frag2Pairs
The pearl script used here (fragment_4dnpairs.pl) is from this GitHub (Release 1.6). Github - aidenlab/juicer: A one-click system for analyzing loop-resolution hi-c experiments. (n.d.). GitHub. Retrieved March 10, 2022, from https://github.com/aidenlab/juicer. This requres a restriction site file. See these instructions for generating this file outside of the snakemake pipeline.
6A. FIRST TIME ONLY. Generate restriction site file.
python generate_site_positions.py 'HindIII' 'ecoli' '/net/qlotsam.lan.omrf.org/qlotsam/data/sansam/hpc-nobackup/scripts/CutAndRun_2020Aug1/ecoliBowtie2/GCF_000005845.2_ASM584v2_genomic.fna'"
# convert to fragment map
gunzip -ck {input.dedup_pairs} | \` # unzip dsthe .pairs file generated in step 4`
workflow/scripts/fragment_4dnpairs.pl \
-a - ` # allows replacing existing frag1/frag2 columns`\
{params.frag2_pairs_basename} \ ` # filename for .pairs file generated`
{params.restriction_file}` # a restriction site file, which lists on each line, the sorted locations of the enzyme restriction sites.`
# Compress the .pairs file generated
bgzip -f {params.frag2_pairs_basename}
# Index the .pairs file generated
pairix -f {output.frag2_pairs}
7. run_cooler
# use for all chromosomes and contigs
cp {params.chrom_sizes} {params.cooler_tempchrsize}
# the cload command requires the chrom size file to exist besides the chrom size bin file.
cooler cload pairix \
-p {params.cooler_n_cores} \
-s {params.cooler_max_split} \
{params.cooler_tempchrsize}:{params.cooler_bin_size} \
{input.frag2_pairs} \
{output.cooler}
Step-by-step instructions on running Snakemake pipeline:
1. Load slurm and miniconda
Note. The commands to do this will be different on your machine. These commands are specific to an HPC using slurm with these modules installed.
ml slurm
ml miniconda
2. Clone repository
git clone https://github.com/SansamLab/Process_HiC_SnakeMake.git
# rename folder with project name
mv Process_HiC_SnakeMake/ My_HiC_Project_Folder/
# change directory into root of your project folder
cd My_HiC_Project_Folder
3. Start the conda environment
3A. FIRST TIME ONLY: Setup conda environment for snakemake
# -f is the location of the environment .yml file.
## The relative path assumes that you are in the root directory of this repository.
# -p is the path where you want to install this environment
conda env create -f workflow/envs/SnakemakeEnv.yml -p /s/sansam-lab/SnakemakeEnv
3B. Activate conda environment for snakemake
conda activate /s/sansam-lab/SnakemakeEnv
4. Modify the job-specific configuration files.
4A. Modify the config/config.yml file
You must enter paths to the following:
-
bwa_genome:
- location of bwa indexed genome for the alignment
-
chrom_sizes
- chromosome sizes file
-
juicer_RE_file
- restriction enzyme file generated with juicer
4B. Modify the config/samples.csv file
The samples.csv file in the config folder has paths to the test fastq files. You must replace those paths with those for your own fastq files. The first column of each row is the sample name. This name will be used for all output files.
4C. IF SLURM RESOURCE CHANGES ARE NEEDED. Modify the config/cluster_config.yml file
CPU and memory requests for each rule in the pipeline are detailed in this file. If you are using SLURM, you may need to alter this file to fit your needs/system.
5. Do a dry run.
A dry run produces a text output showing exactly what commands will be executed. Look this over carefully before submitting the full job. It is normal to see warnings about changes made to the code, input, and params.
snakemake -npr
6. Make a DAG diagram.
snakemake --dag | dot -Tpdf > dag.pdf
7A. Use conda environments
If conda is to be used for rule-specific environments, you may find it useful to create the environments first. Running 'snakemake' with the '--conda-create-envs-only' option will create the environments without running the pipeline. The '--conda-prefix' option is used to set a directory in which the ‘conda’ and ‘conda-archive’ directories are created. This directory may be changed to a stable or shared location.
sbatch --mem 32G \
--wrap="\
snakemake \
--cores all \
--use-conda \
--conda-prefix ../condEnvs/ \
--conda-create-envs-only \
--conda-frontend conda"
Once the environments are setup, you may execute pipeline with conda environments using the following command:
sbatch --constraint=westmere \
--wrap="\
snakemake \
-R \
-j 999 \
--use-conda \
--conda-prefix ../condEnvs/ \
--conda-frontend conda \
--latency-wait 100 \
--cluster-config config/cluster_config.yml \
--cluster '\
sbatch \
-A {cluster.account} \
-p {cluster.partition} \
--cpus-per-task {cluster.cpus-per-task} \
--mem {cluster.mem} \
--output {cluster.output}'"
7B. Use environment modules.
Rather than using conda environments, you may prefer to use modules installed on your computing cluster. These modules are defined for each rule in 'workflow/Snakefile'. This must be customized for your environment, and you must modify the Snakefile yourself.
To execute the pipeline with environment modules, enter the following:
sbatch --constraint=westmere \
--wrap="\
snakemake \
-R \
-j 999 \
--use-envmodules \
--latency-wait 100 \
--cluster-config config/cluster_config.yml \
--cluster '\
sbatch \
-A {cluster.account} \
-p {cluster.partition} \
--cpus-per-task {cluster.cpus-per-task} \
--mem {cluster.mem} \
--output {cluster.output}'"
8. Check results, and when finished, exit environment.
The results will be saved to the "results" folder. Look over log files generated in either the logs/ or logs/snakelogs folders (depending on whether slurm was used).
conda deactivate
References:
Goloborodko, A., Nezar Abdennur, Venev, S., Hbbrandao, & Gfudenberg. (2019). mirnylab/pairtools v0.3.0 (v0.3.0) [Computer software]. Zenodo. https://doi.org/10.5281/ZENODO.2649383
Abdennur, N., Goloborodko, A., Imakaev, M., Kerpedjiev, P., Fudenberg, G., Oullette, S., Lee, S., Strobelt, H., Gehlenborg, N., & Mirny, L. (2021). open2c/cooler: v0.8.11 (v0.8.11) [Computer software]. Zenodo. https://doi.org/10.5281/ZENODO.4655850
Lee, S., Bakker, C. R., Vitzthum, C., Alver, B. H., & Park, P. J. (2022). Pairs and Pairix: a file format and a tool for efficient storage and retrieval for Hi-C read pairs. In C. Alkan (Ed.), Bioinformatics (Vol. 38, Issue 6, pp. 1729–1731). Oxford University Press (OUP). https://doi.org/10.1093/bioinformatics/btab870
Durand, N. C., Shamim, M. S., Machol, I., Rao, S. S. P., Huntley, M. H., Lander, E. S., & Aiden, E. L. (2016). Juicer provides a one-click system for analyzing loop-resolution hi-c experiments. Cell Systems, 3(1), 95–98. https://doi.org/10.1016/j.cels.2016.07.002
Danecek, P., Bonfield, J. K., Liddle, J., Marshall, J., Ohan, V., Pollard, M. O., Whitwham, A., Keane, T., McCarthy, S. A., Davies, R. M., & Li, H. (2021). Twelve years of samtools and bcftools. GigaScience, 10(2), giab008. https://doi.org/10.1093/gigascience/giab008
Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. ArXiv:1303.3997 [q-Bio]. http://arxiv.org/abs/1303.3997
Köster, J., & Rahmann, S. (2012). Snakemake—A scalable bioinformatics workflow engine. Bioinformatics (Oxford, England), 28(19), 2520–2522. https://doi.org/10.1093/bioinformatics/bts480
Anaconda Software Distribution. (2020). Anaconda Documentation. Anaconda Inc. Retrieved from https://docs.anaconda.com/
Code Snippets
69 70 71 72 | shell: """ bwa mem -t {params.bwaThreads} -SP5M {params.bwaIndex} {input.fq1} {input.fq2} | samtools view -Shb - > {output.bam} """ |
100 101 102 103 | shell: """ pairtools parse -c {params.chrom_sizes} --drop-sam --add-columns mapq --output {output.pairs} {input.bam} """ |
126 127 128 129 130 131 | shell: """ [ -d {params.tempdir} ] || mkdir {params.tempdir} pairtools sort --nproc {params.nproc} --memory {params.memory} --tmpdir {params.tempdir} --output {output.sorted_pairs} {input.pairs} rm -rf {params.tempdir} """ |
146 147 148 149 150 | shell: """ pairtools merge --output results/sorted_pairs/{params.filename}_sorted.pairs.gz {params.sorted_pairs_list} rm -rf results/temp_sorted_pairs/ """ |
173 174 175 176 177 178 | shell: """ [ -d {params.tempdir} ] || mkdir {params.tempdir} pairtools sort --nproc {params.nproc} --memory {params.memory} --tmpdir {params.tempdir} --output {output.sorted_pairs} {input.pairs} rm -rf {params.tempdir} """ |
196 197 198 199 200 | shell: """ pairtools dedup --mark-dups --output-dups - --output-unmapped - --output {output.marked_pairs} {input.sorted_pairs} pairix {output.marked_pairs} """ |
218 219 220 221 222 223 224 225 226 227 228 | shell: """ # Select UU, UR, RU reads pairtools select '(pair_type == "UU") or (pair_type == "UR") or (pair_type == "RU")' --output-rest {output.unmapped} --output {params.temp_file} {input.marked_pairs} # Select reads overlapping chromosomes in chromosome sizes file pairtools select 'True' --chrom-subset {params.chrom_sizes} -o {output.dedup_pairs} {params.temp_file} pairix {output.dedup_pairs} # remove temp file rm {params.temp_file} """ |
251 252 253 254 255 256 | shell: """ gunzip -ck {input.dedup_pairs} | workflow/scripts/fragment_4dnpairs.pl -a - {params.frag2_pairs_basename} {params.restriction_file} bgzip -f {params.frag2_pairs_basename} pairix -f {output.frag2_pairs} """ |
270 271 272 273 274 275 | shell: """ pairtools merge --output results/frag2_pairs/{params.filename}.frag2pairs.pairs.gz {params.frag2_pairs_list} pairix -f {output.merged} rm -rf results/temp_frag2_pairs/ """ |
297 298 299 300 301 302 | shell: """ gunzip -ck {input.dedup_pairs} | workflow/scripts/fragment_4dnpairs.pl -a - {params.frag2_pairs_basename} {params.restriction_file} bgzip -f {params.frag2_pairs_basename} pairix -f {output.frag2_pairs} """ |
326 327 328 329 330 331 332 333 | shell: """ # use for all chromosomes and contigs cp {params.chrom_sizes} {params.cooler_tempchrsize} # the cload command requires the chrom size file to exist besides the chrom size bin file. cooler cload pairix -p {params.cooler_n_cores} -s {params.cooler_max_split} {params.cooler_tempchrsize}:{params.cooler_bin_size} {input.frag2_pairs} {output.cooler} """ |
353 354 355 356 | shell: """ juicer_tools pre -r {params.custom_resolutions} -j {params.nproc} -q {params.mapQ_minimum} {input.frag2_pairs} {output.hic_file} {params.chrom_sizes} """ |
378 379 380 381 382 383 384 385 386 387 388 389 | shell: """ chromosomes=($(echo {params.chromosomes} | tr "," "\n")) mkdir results/compartments/temp_{params.sampleName}/ grep ${{chromosomes[@]/#/-e }} {params.chrom_sizes} | sort | uniq > results/compartments/temp_{params.sampleName}/chromosome.sizes bedtools makewindows -g results/compartments/temp_{params.sampleName}/chromosome.sizes -w {params.resolution} > results/compartments/temp_{params.sampleName}/{params.resolution}_windows.bed parallel juicer_tools eigenvector -p {params.normalization} {input.hic_file} {{.}} BP {params.resolution} results/compartments/temp_{params.sampleName}/{params.sampleName}_eigen_{{.}}.txt ::: ${{chromosomes[@]}} parallel grep -P {{.}}'\t' results/compartments/temp_{params.sampleName}/{params.resolution}_windows.bed '>' results/compartments/temp_{params.sampleName}/{params.sampleName}_eigen_{{.}}.bed ::: ${{chromosomes[@]}} parallel paste results/compartments/temp_{params.sampleName}/{params.sampleName}_eigen_{{.}}.bed results/compartments/temp_{params.sampleName}/{params.sampleName}_eigen_{{.}}.txt '>' results/compartments/{params.sampleName}_eigen_{{.}}.bedgraph ::: ${{chromosomes[@]}} gzip --stdout results/compartments/{params.sampleName}_eigen*.bedgraph > {output.compartment_file} rm -rf results/compartments/temp_{params.sampleName}/ """ |
410 411 412 413 | shell: """ juicer_tools48g arrowhead --threads {params.threads} -k {params.normalization} -m {params.sliding_window} -r {params.resolution} {input.hic_file} results/TADs/{params.outputDirectory} """ |
429 430 431 432 | shell: """ juicer_tools48g arrowhead --threads {params.threads} -k {params.normalization} -m {params.sliding_window} -r {params.resolution} {input.hic_file} results/TADs/{params.outputDirectory} """ |
448 449 450 451 | shell: """ juicer_tools48g arrowhead --threads {params.threads} -k {params.normalization} -m {params.sliding_window} -r {params.resolution} {input.hic_file} results/TADs/{params.outputDirectory} """ |
467 468 469 470 | shell: """ juicer_tools48g arrowhead --threads {params.threads} -k {params.normalization} -m {params.sliding_window} -r {params.resolution} {input.hic_file} results/TADs/{params.outputDirectory} """ |
Support
- Future updates
Related Workflows





