Make colorful identity heatmaps of genomic sequence

public 1yr ago Version: v0.5 0 bookmarks

View Workflow

This is a repository for making colorful identity heatmaps of genomic sequence.

Installation

To install you can follow the directions on the usage page or use the information below.

You will need a current version of snakemake to run this workflow. To get snakemake please follow the install instructions on their website, but in brief once conda and mamba are installed you can install snakemake with:

mamba create -n snakemake -c conda-forge -c bioconda snakemake

Afterwards you can activate the conda environment and download the repository. And all additional dependencies will be handled by snakemake .

conda activate snakemake
git clone https://github.com/mrvollger/StainedGlass.git

Running

Choose a sample identifier for your run e.g. chr8 and a fasta file on which you want to show the colorful alignments and the modify the config file config/config.yaml accordingly.

Once this is done and you have activated your conda env with snakemake you can run the pipeline like so:

snakemake --use-conda --cores 24

Or do a dry run of the pipeline:

snakemake --use-conda --cores 24 -n

All parameters are described in config/README.md and you can modify any of them by modifying config/config.yaml . You can also change the configuration via the command line. For example, to change the sample identifier and fasta options do:

snakemake --use-conda --cores 24 --config sample=test2 fasta=/some/fasta/path.fa

Please try the test case with the default configuration file before submitting issues. If you are familiar with snakemake and want to trouble shoot yourself you can find the Snakefile in the directory workflow .

Output

The file results/{sample}.{\d+}.{\d+}.bed will contain all the alignments identified by the pipeline, and is the main input for figure generation. Under the same prefix there will also be a bam file that contains the unprocessed alignments. Note the bam will contain additional alignments not present in the bed file because redundant alignments with lower scores are removed before the figure generation.

Static dot-plots for moderate sized regions

To make pdfs and pngs for a particular set of regions just add make_figures to your command. This is generally appropriate for comparing up to ~5 regions totaling at most ~40 Mbp.

snakemake --use-conda --cores 24 make_figures

This will make an output directory under results/{sample}.{\d+}.{\d+}_figures with a variety of dot plots in pdf and png format.

If you see tri.TRUE in the output pdf/png it means that the dot plot is rotated and cropped into a triangle. If you see onecolorscale.FALSE it means that between different facets in the same plot different color scales are being used.

Visualization of a large region or a whole genome

Making an interactive whole genome visualization requires the use of the program HiGlass and a web browser. However, this pipeline will make the necessary input files with the following command:

snakemake --use-conda --cores 24 cooler

To view locally, use higlass-manage :

pip install higlass-manage
higlass-manage view results/small.5000.10000.strand.mcool

See the T2T CHM13 v1.0 StainedGlass for an example.

High resolution interactive visualization

To create a high-resolution interactive visualization where the coloring is proportionally to the number of reads mapped to each bin, use the following command:

snakemake --use-conda --cores 24 cooler_density --config window=32 cooler_window=100

Arabidopsis: quick start, case example, and benchmark

To demonstrate a case example of using StainedGlass we applied the tool to a 132 Mbp chromosome level assembly of the Arabidopsis genome ( DOI:10.1126/science.abi7489 ).

wget https://github.com/schatzlab/Col-CEN/raw/main/v1.2/Col-CEN_v1.2.fasta.gz \
 && gunzip Col-CEN_v1.2.fasta.gz \
 && samtools faidx Col-CEN_v1.2.fasta

Using 8 cores on a laptop with 32 GB of ram we ran StainedGlass using the following commands:

time snakemake --cores 8 --config sample=arabidopsis fasta=Col-CEN_v1.2.fasta --use-conda

This command generated 41,036,963 self-self pairwise alignments within the assembly, 16,699,976 of which passed filters for downstream analysis.

Then to generate the cooler files that can be loaded in HiGlass we ran the following command with the already computed alignments:

time snakemake --cores 8 --config sample=arabidopsis fasta=Col-CEN_v1.2.fasta --use-conda cooler

The results can be viewed at resgen.io/paper-data/Naish , and we include a static view of the centromeres here:

Arabidopsis CENs

Arabidopsis runtime statistics:

step	window (bp)	user (s)	system (s)	cpu (%)	wall (h:m:s)
alignment	1,000	16,014.07	163.41	481	56:00.57
cooler	1,000	544.51	32.98	213	4:30.64
static figures [^tnote]	1,000	2,635.30	188.07	58	1:20:14.59

[^tnote]: Not recommended for whole genomes.

A full report of all steps executed and the runtime of those steps is available in case-example-arabidopsis/report.html .

TODO

Allow users to adjust the color pallet used in R
Test short read aligners with smaller window sizes
Make a more intelligent fragmentation method that won't be affected by offsets in repeat motifs
Consider alternative ways to score cells with multiple non-overlapping alignments

Cite

Mitchell R Vollger, Peter Kerpedjiev, Adam M Phillippy, Evan E Eichler, StainedGlass: Interactive visualization of massive tandem repeat structures with identity heatmaps, Bioinformatics, 2022; https://doi.org/10.1093/bioinformatics/btac018

Code Snippets

shell:
    """
    bedtools makewindows -g {input.fai} -w {wildcards.W} {params.slide} > {output.bed}
    """

SnakeMake BEDTools From line 148 of workflow/Snakefile

shell:
    """
    python {params.script} {input.bed} --outputs {output.bed}
    """

SnakeMake From line 168 of workflow/Snakefile

shell:
    """
    bedtools getfasta -fi {input.ref} -bed {input.bed} \
        > {output.fasta}
    """

SnakeMake BEDTools From line 187 of workflow/Snakefile

shell:
    """
    samtools faidx {input.ref} {params.name} > {output.fasta} 
    """

SnakeMake SAMtools From line 208 of workflow/Snakefile

shell:
    """
    minimap2 \
        -f {wildcards.F} -s {params.S} \
        {params.MAP_PARAMS} \
         -d {output.split_ref_index} \
        {input.ref} \
    2> {log}
    """

SnakeMake Minimap2 From line 229 of workflow/Snakefile

shell:
    """
    bedtools getfasta -fi {input.ref} -bed {input.bed} \
        > {output.query_fasta}
    """

SnakeMake BEDTools From line 253 of workflow/Snakefile

shell:
    """
    ( minimap2 \
        -t {threads} \
        -f {wildcards.F} -s {params.S} \
        {params.MAP_PARAMS} \
        --dual=yes --eqx \
        {input.split_ref} {input.query} \
            | samtools sort -m {resources.mem}G \
                -o {output.aln} \
    ) 2> {log}
    """

SnakeMake SAMtools Minimap2 From line 276 of workflow/Snakefile

run:
    open(output.alns, "w").write("\n".join(input.aln) + "\n")

SnakeMake From line 305 of workflow/Snakefile

shell:
    """
    #samtools cat -b {input.alns} -o {output.aln} 
    samtools merge -b {input.alns} {output.aln} 
    """

SnakeMake SAMtools From line 327 of workflow/Snakefile

shell:
    """
    samtools sort -m {resources.mem}G -@ {threads} --write-index \
        -o {output.aln} {input.aln}
    """

SnakeMake SAMtools From line 346 of workflow/Snakefile

shell:
    """
    python {params.script} --threads {threads} \
        --matches  {params.S} --header \
        {input.aln} \
        | pigz -p {threads} > {output.tbl}
    """

SnakeMake From line 369 of workflow/Snakefile

shell:
    """
    python {params.script} \
        --window {wildcards.W} --fai {input.fai} \
        --full {output.full} \
        {params.one} \
        {input.tbl} {output.bed}
    """

SnakeMake From line 395 of workflow/Snakefile

shell:
    """
    Rscript {params.script} \
        --bed {input.bed} \
        --threads {threads} \
        --prefix {wildcards.SM}.{wildcards.W}.{wildcards.F}
    """

SnakeMake From line 435 of workflow/Snakefile

shell:
    """
    gunzip -c {input.bed} | tail -n +2 \
        | sed  's/+$/100/g' \
        | sed  's/-$/50/g' \
          | cooler cload pairs \
            -c1 1 -p1 2 -c2 4 -p2 5 \
                    --field count=8:agg=mean,dtype=float \
            --chunksize 50000000000 \
            {input.fai}:{wildcards.W} \
            --zero-based \
            - {output.cool}
    """

SnakeMake cooler From line 465 of workflow/Snakefile

shell:
    """
    gunzip -c {input.bed} | tail -n +2 \
          | cooler cload pairs \
            -c1 1 -p1 2 -c2 4 -p2 5 \
                    --field count=7:agg=mean,dtype=float \
            --chunksize 50000000000 \
            {input.fai}:{wildcards.W} \
            --zero-based \
            - {output.cool}

    """

SnakeMake cooler From line 493 of workflow/Snakefile

shell:
    """
    cooler zoomify --field count:agg=mean,dtype=float {input.i} \
            -n {threads} \
         -o {output.i}
    """

SnakeMake cooler From line 519 of workflow/Snakefile

shell:
    """
    cooler zoomify --field count:agg=mean,dtype=float {input.s} \
            -n {threads} \
         -o {output.s}
    """

SnakeMake cooler From line 539 of workflow/Snakefile

shell:
    """
    bwa index -p temp/ref_{wildcards.SM} {input.fasta}
    """

SnakeMake BWA From line 557 of workflow/Snakefile

shell:
    """
    bwa aln -t {threads} temp/ref_{wildcards.SM} {input.reads} \
        | bwa samse -n {params.num_dups} temp/ref_{wildcards.SM} - {input.reads} \
        | gzip > {output.sam}
    """

SnakeMake BWA From line 576 of workflow/Snakefile

shell:
    """
    gunzip -c {input.sam} | python {params.script} - \
        | gzip > {output.contacts}
    """

SnakeMake From line 596 of workflow/Snakefile

shell:
    """
    cooler cload pairs \
            -c1 1 -p1 2 -c2 4 -p2 5 \
            --chunksize 50000000000 \
            {input.fai}:{wildcards.CW} \
            --zero-based \
            {input.contacts} {output.cool}
    """

SnakeMake cooler From line 616 of workflow/Snakefile

shell:
    """
    cooler zoomify {input.d} \
            -n {threads} \
         -o {output.d}
    """

SnakeMake cooler From line 639 of workflow/Snakefile

ShowHide 17 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://mrvollger.github.io/StainedGlass/

Name: stainedglass

Version: v0.5

Badge:

Insert copied code into your website to add a link to this workflow.

License: MIT License

Keywords:

Sample annotation Chromatographic alignment Heat map cooler BEDTools BWA Minimap2 SAMtools Snakemake Genomics

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free