Ultimate ATAC-seq Data Processing & Analysis Pipeline

public 1yr ago Version: v0.1.2 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Ultimate ATAC-seq Data Processing & Analysis Pipeline

From r A w (unaligned) BAM files to normali Z ed counts.

A Snakemake implementation of the BSF's ATAC-seq Data Processing Pipeline extended by downstream processing and unsupervised analyses steps using bash, python and R. Reproducibility and Scalability is ensured by using Snakemake, conda and Singularity.

If you use this workflow in a publication, please don't forget to give credits to the authors by citing it using this DOI 10.5281/zenodo.6323634 .

Workflow Rulegraph

Authors
Software
Methods
Features
Usage
Installation
Configuration
Execution
Module
Report
Results
Examples
Data Resources
Tips & FAQs
Links
Publications

Authors

Software

This project wouldn't be possible without the following software

Software	Reference (DOI)
bedtools	https://doi.org/10.1093/bioinformatics/btq033
Bowtie2	https://doi.org/10.1038/nmeth.1923
CQN	https://doi.org/10.1093/biostatistics/kxr054
deeptools	https://doi.org/10.1093/nar/gkw257
ENCODE	https://doi.org/10.1038/s41598-019-45839-z
fastp	https://doi.org/10.1093/bioinformatics/bty560
HOMER	https://doi.org/10.1016/j.molcel.2010.05.004
MACS2	https://doi.org/10.1186/gb-2008-9-9-r137
matplotlib	https://doi.org/10.1109/MCSE.2007.55
MultiQC	https://doi.org/10.1093/bioinformatics/btw354
pybedtools	https://doi.org/10.1093/bioinformatics/btr539
pandas	https://doi.org/10.5281/zenodo.3509134
samblaster	https://doi.org/10.1093/bioinformatics/btu314
samtools	https://doi.org/10.1093/bioinformatics/btp352
SCANPY	https://doi.org/10.1186/s13059-017-1382-0.
scikit-learn	http://jmlr.org/papers/v12/pedregosa11a.html
seaborn	https://doi.org/10.21105/joss.03021
Snakemake	https://doi.org/10.12688/f1000research.29032.2
TMM	https://doi.org/10.1186/gb-2010-11-3-r25
UMAP	https://doi.org/10.21105/joss.00861
UROPA	https://doi.org/10.1038/s41598-017-02464-y

Methods

This is a template for the Methods section of a scientific publication and is intended to serve as a starting point. Only retain paragraphs relevant to your analysis. References [ref] to the respective publications are curated in the software table above. Versions (ver) have to be read out from the respective conda environment specifications (.yaml file) or post execution. Parameters that have to be adapted depending on the data or workflow configurations are denoted in squared brackets e.g. [X].

Processing. Sequencing adapters were removed using the software fastp (ver) [ref]. Bowtie2 (ver) [ref] was used for the alignment of the short reads (representing locations of transposition events) to the [GRCh38 (hg38)/GRCm38 (mm10)] assembly of the [human/mouse] genome using the “--very-sensitive” parameter. PCR duplicates were marked using samblaster (ver) [ref]. Aligned BAM files were then sorted, filtered using ENCODE blacklisted regions [ref], and indexed using samtools (ver) [ref]. To detect the open chromatin regions, peak calling was performed using MACS2 (ver) [ref] using the “--nomodel”, “--keep-dup auto” and “--extsize 147” options on each sample. Homer (ver) [ref] function findMotifs was used for motif enrichment analysis over the detected open chromatin regions.

Quality control metrics were aggregated and reported using MultiQC (ver) [ref], [X] sample(s) needed to be removed.

Quantification. A consensus region set, comprising of [X] genomic regions, was generated, by merging the identified peak summits, extended by 250 bp on both sides using the slop function from bedtools (ver) [ref] and pybedtools (ver) [ref], across all samples while again discarding peaks overlapping blacklisted features as defined by the ENCODE project [ref]. Consensus regions were annotated using annotatePeaks function from Homer (ver) [ref].

Additionally, we annotated all consensus regions using UROPA (ver) [ref] and genomic features from the [GENCODE vX] basic gene annotation as: “TSS proximal” if the region’s midpoint was within [X] bp upstream or [X] bp downstream from a TSS, or if the region overlapped with a TSS; “gene body” if the region overlapped with a gene; “distal” if the region’s midpoint was within [X] bp of a TSS; and “intergenic” otherwise. For each region, only the closest feature was considered, and the annotations took precedence in the following order: TSS proximal, gene body, distal, and intergenic.

The consensus region set was used to quantify the chromatin accessibility in each sample by summing the number of reads overlapping each consensus region. The consensus region set, and sample-wise quantification of accessibility was performed using bedtools (ver) [ref] and pybedtools (ver) [ref].

Optional. We split up of data into subsets according to the annotation [X] and performed all downstream analyses for each subset separately.

Downstream Analysis. For all downstream analyses, we filtered the [X] consensus regions to [X] regions which had reads in at least [X] samples, were covered by at least [X] reads (normalized by median library size) in at least [X] proportion of samples of the smallest subsample group with potential signal determined with the annotation [X], and by at least [X] total reads across all samples.

Next, we determined highly variable regions (HVR) using an adaption of a SCANPY (ver) [ref] function highly_variable_genes with flavor 'cellranger', but instead of dispersion=variation/mean we use dispersion=standard deviation, because in ATAC-seq data the number of regions might be very large. This could lead to log(cpm) values and log(cpm)-means being negative, resulting in negative dispersion values which are meaningless (additionally avoiding division by 0 problems). Therefore we only employ the binning strategy by mean for stabilization, but not a "coefficient of variation" strategy.

Conditional quantile normalization cqn (ver) [ref] was performed on the filtered accessibility matrix using the region length and GC-content of the consensus regions as conditions, quantified using bedtools (ver) [ref] and pybedtools (ver) [ref].

Trimmed mean of M-values (TMM) normalization (ver) [ref] was performed on the filtered accessibility matrix.

Unsupervised Analysis & Visualization. We applied both linear and non-linear unsupervised dimensionality reduction methods to normalized data to visualize emerging sample-wise patterns in two dimensions. We used Principal Component Analysis (PCA) [ref] from scikit-learn (ver) [ref] as the linear approach and Uniform Manifold Approximation projection (UMAP) from umap-learn (ver) [ref] with the correlation metric as the non-linear approach. For visualization matplotlib (ver) [ref] was used.

The processing and analysis described here was performed using a publicly available Snakemake [ver] (ref) workflow [ 10.5281/zenodo.6323634 ].

Features

Processing
- alignment of both single-end and paired-end reads in raw/unaligned BAM format (bowtie2)
- peak calling (macs2)
- peak annotation and motif analysis (homer)
- MultiQC report generation (multiqc)
Quantification
- consensus region set generation
- consensus region set annotation (uropa using regulatory build and gencode as refernce, and homer)
- read count and peak support quantification of the consensus region set across samples, yielding a count and a support matrix with dimensions regions X samples
optional: split data in multiple data subsets (eg by cell type, condition)
Downstream Analysis (performed on the whole data set and each split separately)
- region filtering
- normalization by two different methods separately ( TMM & CQN )
- highly variable region (HVR) selection for each normalized data set
Visualization after each analysis step
- step specific plots (eg region filtering, HVR selection)
- dimensionality reduction by PCA & UMAP and plotting of provided metadata
- mean-variance-relationship plot
Snakemake report generation for workflow statistics, documentation, reproducibility and result presentation

Usage

These steps are the recommended usage for the first run of the workflow:

perform only the processing, by setting the downstram_analysis flag to 0 in the project configuration
use the generated multiQC report (project_path/atacseq_report/multiqc_report.html) to judge the quality of your samples
fill out the mandatory quality control column (pass_qc) in the sample metadata configuration file (you can even use some of the quality metrics for plotting eg like I did in the example files with 'FRiP')
finally execute the remaining donwstream analysis steps by setting the respective flag to 1, thereby only the samples that passed quality control will be included

This workflow is written with snakemake and its usage is described in the Snakemake Workflow Catalog .

Installation

This should take less than 10 minutes.

install snakemake, which requires conda & mamba, according to the documentation
clone/download this repository (eg git clone https://github.com/epigen/atacseq_pipeline.git)

All software/package dependencies are installed and managed automatically via Snakemake and conda (and optional Singularity) and installed upon the first run of the workflow.

Configuration

Detailed specifications can be found here ./config/README.md

Execution

1. Change working directory & activate conda environment

Execute always from within top level of the pipeline directory (ie atacseq_pipeline/). Snakemake commands only work from within the snakemake conda environment.

cd atacseq_pipeline
conda activate snakemake

2. Execute a dry-run

command for a dry-run with option -n (-p makes Snakemake print the resulting shell command for illustration)

snakemake -p -n

3. Execute workflow local or on a cluster

3a. Local execution

command for execution with one core

snakemake -p -j1 --use-conda

3b. Cluster execution

command for vanilla cluster execution on cluster engines that support shell scripts and have access to a common filesystem, (e.g. the Sun Grid Engine), more info in the Snakemake Cluster Execution documentation

snakemake -p --use-conda --cluster qsub -j 32

command for cluster execution by using --profile , submits every task as separate job with dependencies

snakemake -p --use-conda --profile config/slurm.cemm

the profile for CeMM's slurm environment is provided in the config/ directory, of note:

the number of jobs in the slurm.cemm/config.yaml should be set as high as necessary, because arrayed job subsmission does not work (yet) and the scheduler (eg SLURM) should take care of the priorization
jobs which dependencies can never be fulfilled are automatically removed from the queue

If you are using another setup get your cluster execution profile here: The Snakemake-Profiles project

X. Singularity execution (not tested)

Singularity has to be installed (system wide by root) and available/loaded (eg module load singularity). The pipeline automatically loads the correct singularity image from [Dockerhub](https://hub.docker.com/r/ /atacseq_pipeline)

command for execution with singularity, just add the flag --use-singularity and use --singularity-args to provide all the necessary directories the pipeline needs access to (in the example it is configured for the three relevant partitions at CeMM)

snakemake -p --use-conda --use-singularity --singularity-args "--bind /nobackup:/nobackup --bind /research:/research --bind /home:/home"

Module

Use as module in another Snakemake workflow (soon)

Report

command for report generation (this can take a few minutes, depending on the size of the dataset)

snakemake --report /absolute/path/to/report.zip

The command creates a self contained HTML based report in a zip archive containing the following sections:

Workflow: interactive rulegraph to recapitulate individual steps, used software and conrete code (reproducibility)
Statistics: duration and timing of individual steps
Configuration: used pipeline configuration (accountability)
Results
- Configuration: all 3 used configuration files (project, samples, metadata)
- QC reports: link to the MultiQC report
- all: each step of the downstream analysis has its own searchable section with step-specific and unsupervised analysis plots
- optional: one additional section per split, containing the respective downstream analysis results

Results

Project directory structure:

all: Downstream analysis results of the whole data, including consensus region set and region annotation
atacseq_hub: genome browser track files (.bigWig) for each sample
atacseq_report: statistics and metrics from the processing part as input for the MultiQC report
atacseq_results: one directory per sample with all the processing and quantification results
projectName_report.zip: self contained HTML Snakemake report
split1 (optional): Downstream analysis results of subset 1 of the whole data
split2 (optional): Downstream analysis results of subset 2 of the whole data

Examples

We provide configuration files for two example datasets (mm10 & hg38). Additionally, the report zip archive of the hg38 test example is provided to showcase the pipeline results.

Resources

To ensure reproducibility of results and to make the pipeline easy-to-use we provide all required reference data for the analysis of ATAC-seq samples for both supported genomes on Zendodo:

Tips

Here are some tips for troubleshooting & FAQs:

always first perform a dry-run with option -n
if unsure why a certain rule will be executed use option --reason in the dry run, this will give the reason for the execution of each rule
when there are 3 or less samples all the UMAP data and plots will be empty
when there are 2 or less samples all the PCA data and plots will be empty
in case the pipeline crashes, you manually canceled your jobs or when snakemake tries to "resume.. resubmit.." jobs, then remove the .snakemake/incomplete directory!
if you commit a lot of jobs eg via slurm (>500) this might take some time (ie 1s/job commit)
two comments on peak support quantification
- even though the peak support of a region in a certain sample is 0, does not mean that there are no reads counted in the count matrix, it just states that there was no peak called
- the peak support can be >1 for certain samples in case of a consensus region that spans more than one region (with peaks) in the respective sample
command for generating the workflow's rulegraph

snakemake --rulegraph --forceall | dot -Tsvg > workflow/dags/atacseq_pipeline_rulegraph.svg

provided in workflow/dags

command for generating the directed acyclic graph (DAG) of all jobs with current configuration

snakemake --dag --forceall | dot -Tsvg > workflow/dags/all_DAG.svg

provided for both test examples in workflow/dags

Publications

The following publications successfully used this module for their analyses.

Casteels et al (2022) Cell Reports - SMNDC1 links chromatin remodeling and splicing to regulate pancreatic hormone expression.

Code Snippets

shell:
    """
    mkdir -p {params.results_dir};
    mkdir -p {params.sample_dir};
    mkdir -p {params.bam_dir};

    RG="--rg-id {params.sample_name} --rg SM:{params.sample_name} --rg PL:illumina --rg CN:CeMM_BSF"

    for i in {params.raw_bams}; do samtools fastq $i 2>> "{params.bam_dir}/{params.sample_name}.samtools.log" ; done | \
        fastp {params.adapter_sequence} {params.adapter_fasta} --stdin {params.interleaved_in} --stdout --html "{params.bam_dir}/{params.sample_name}.fastp.html" --json "{params.bam_dir}/{params.sample_name}.fastp.json" 2> "{params.bam_dir}/{params.sample_name}.fastp.log" | \
        bowtie2 $RG --very-sensitive --no-discordant -p {threads} --maxins 2000 -x {params.bowtie2_index} --met-file "{params.bam_dir}/{params.sample_name}.bowtie2.met" {params.interleaved} - 2> "{params.bam_dir}/{params.sample_name}.txt" | \
        samblaster {params.add_mate_tags} 2> "{params.bam_dir}/{params.sample_name}.samblaster.log" | \
        samtools sort -o "{params.bam_dir}/{params.sample_name}.bam" - 2>> "{params.bam_dir}/{params.sample_name}.samtools.log";

    samtools index "{params.bam_dir}/{params.sample_name}.bam" 2>> "{params.bam_dir}/{params.sample_name}.samtools.log";
    samtools idxstats "{params.bam_dir}/{params.sample_name}.bam" | awk '{{ sum += $3 + $4; if($1 == "{params.chrM}") {{ mito_count = $3; }}}}END{{ print "mitochondrial_fraction\t"mito_count/sum }}' > "{params.sample_dir}/{params.sample_name}.stats.tsv";
    samtools flagstat "{params.bam_dir}/{params.sample_name}.bam" > "{params.bam_dir}/{params.sample_name}.samtools_flagstat.log";

    samtools view {params.filtering} -o "{params.bam_dir}/{params.sample_name}.filtered.bam" "{params.bam_dir}/{params.sample_name}.bam";
    samtools index "{params.bam_dir}/{params.sample_name}.filtered.bam";
    """

SnakeMake SAMtools fastp Bowtie 2 SAMBLASTER From line 40 of rules/bowtie2_align.smk

script:
    "../scripts/do_UMAP.py"

SnakeMake From line 18 of rules/dimred.smk

script:
    "../scripts/do_PCA.py"

SnakeMake From line 39 of rules/dimred.smk

script:
    "../scripts/plot_dimred.py"

SnakeMake From line 62 of rules/dimred.smk

shell:
    """
    mkdir -p {params.hub_dir};

    bamCoverage --bam {input.bam} \
        -p max --binSize 10  --normalizeUsing RPGC \
        --effectiveGenomeSize {params.genome_size} --extendReads 175 \
        -o "{params.hub_dir}/{params.sample_name}.bigWig" > "{params.hub_dir}/{params.sample_name}.bigWig.log" 2>&1;

    echo "base,count" > {output.tss_hist};
    bedtools slop -b {params.tss_slop} -i {params.unique_tss} -g {params.chromosome_sizes} | \
        bedtools coverage -a - -b {input.bam} -d -sorted | \
        awk '{{if($6 == "+"){{ counts[$7] += $8;}} else counts[{params.double_slop} - $7 + 1] += $8;}} END {{ for(pos in counts) {{ if(pos < {params.noise_lower} || pos > {params.noise_upper}) {{ noise += counts[pos] }} }}; average_noise = noise /(2 * {params.noise_lower}); for(pos in counts) {{print pos-2000-1","(counts[pos]/average_noise) }} }}' | \
        sort -t "," -k1,1n >> {output.tss_hist} ;
    """

SnakeMake BEDTools DeepTools From line 33 of rules/misc_tasks.smk

shell:
    """
    multiqc -fc {params.project_config} {params.project_path}
    """

SnakeMake MultiQC From line 20 of rules/multiqc.smk

shell:
    """
    mkdir -p {params.results_dir};
    mkdir -p {params.sample_dir};
    mkdir -p {params.peaks_dir};
    mkdir -p {params.homer_dir};

    export PATH="{params.homer_bin}:$PATH";

    macs2 callpeak -t {input.bam} {params.formating} \
        --nomodel --keep-dup auto --extsize 147 -g {params.genome_size} \
        -n {params.sample_name} \
        --outdir {params.peaks_dir} > "{params.peaks_dir}/{params.sample_name}.macs2.log" 2>&1;

    resources/homer/bin/annotatePeaks.pl {params.peaks_dir}/{params.sample_name}_peaks.narrowPeak {params.genome} \
        > {params.peaks_dir}/{params.sample_name}_peaks.narrowPeak.annotated.tsv \
        2> {params.peaks_dir}/{params.sample_name}_peaks.narrowPeak.annotated.tsv.log;

    resources/homer/bin/findMotifsGenome.pl "{params.peaks_dir}/{params.sample_name}_summits.bed" {params.genome} {params.homer_dir} -size 200 -mask \
        > "{params.homer_dir}/{params.sample_name}.homer.log" 2>&1

    cat {params.peaks_dir}/{params.sample_name}_peaks.narrowPeak | wc -l | \
        awk '{{print "peaks\t" $1}}' >> "{params.sample_dir}/{params.sample_name}.stats.tsv"

    TOTAL_READS=`samtools idxstats {input.bam} | awk '{{sum += $3}}END{{print sum}}'`;
    samtools view -c -L "{params.peaks_dir}/{params.sample_name}_peaks.narrowPeak" {input.bam} | \
        awk -v total=$TOTAL_READS '{{print "frip\t" $1/total}}' >> "{params.sample_dir}/{params.sample_name}.stats.tsv";

    samtools view -c -L {params.regulatory_regions} {input.bam} | \
        awk -v total=$TOTAL_READS '{{print "regulatory_fraction\t" $1/total}}' >> "{params.sample_dir}/{params.sample_name}.stats.tsv";
    """ 

SnakeMake SAMtools macs2 homer From line 36 of rules/peak_calling.smk

script:
    "../scripts/split_data.py"

SnakeMake From line 23 of rules/processing.smk

script:
    "../scripts/filter_regions.py"

SnakeMake From line 53 of rules/processing.smk

script:
    "../scripts/normalize_TMM.py"

SnakeMake From line 73 of rules/processing.smk

script:
    "../scripts/normalize_CQN.py"

SnakeMake From line 95 of rules/processing.smk

script:
    "../scripts/select_HVR.py"

SnakeMake From line 117 of rules/processing.smk

script:
    "../scripts/plot_mean_var.py"

SnakeMake From line 138 of rules/processing.smk

script:
    "../scripts/get_consensus_regions.py"

SnakeMake From line 27 of rules/quantification.smk

script:
    "../scripts/quantify_support_sample.py"

SnakeMake From line 49 of rules/quantification.smk

script:
    "../scripts/quantify_support_aggregate.py"

SnakeMake From line 71 of rules/quantification.smk

script:
    "../scripts/quantify_counts_sample.py"

SnakeMake From line 95 of rules/quantification.smk

script:
    "../scripts/quantify_counts_aggregate.py"

SnakeMake From line 117 of rules/quantification.smk

run:
    ### generate gencode config
    with open( os.path.join("config","uropa","gencode_config_TEMPLATE_V4.txt") ) as f:
        gencode_template=Template(f.read())

    gencode_config=gencode_template.substitute({
            'TSS_flanking':config['tss_size'],
            'TSS_proximal_upstream':config['proximal_size_up'],
            'TSS_proximal_downstream':config['proximal_size_dn'],
            'distal_distance':config['distal_size'],
            'gtf_file':'"{}"'.format(config["gencode_gtf"]),
            'bed_file':'"{}"'.format(input.consensus_regions)
        })

    gencode_config_file=output.gencode_config
    with open(gencode_config_file,'w') as out:
        out.write(gencode_config)

    ### generate reg config file
    with open( os.path.join("config","uropa","regulatory_config_TEMPLATE.txt") ) as f:
        reg_template=Template(f.read())  

    reg_config=reg_template.substitute({
        'gtf_file':'"{}"'.format(config["regulatory_build_gtf"]),
        'bed_file':'"{}"'.format(input.consensus_regions)
    })

    reg_config_file=output.reg_config
    with open(reg_config_file,'w') as out:
        out.write(reg_config)

SnakeMake UROPA From line 20 of rules/region_annotation.smk

shell:
    """
    uropa -p {params.results_dir}/gencode -i {input.gencode_config} -t {threads} -l {params.results_dir}/uropa.gencode.log
    uropa -p {params.results_dir}/reg -i {input.reg_config} -t {threads} -l {params.results_dir}/uropa.reg.log
    """

SnakeMake UROPA From line 72 of rules/region_annotation.smk

shell:
    """
    export PATH="{params.homer_bin}:$PATH";

    resources/homer/bin/annotatePeaks.pl {input.consensus_regions} {config[genome]} \
        > {output.homer_annotations} \
        2> {output.homer_annotations_log};
    """ 

SnakeMake homer From line 98 of rules/region_annotation.smk

run:
    # load and format uropa gencode results
    gencode_characterization=pd.read_csv(input.gencode_results,sep='\t')
    gencode_characterization = gencode_characterization.set_index("peak_id")
    gencode_characterization.loc[gencode_characterization['feature']=='transcript','feat_type']='transcript:'+gencode_characterization.loc[gencode_characterization['feature']=='transcript','transcript_type']
    gencode_characterization.loc[gencode_characterization['feature']=='gene','feat_type']='gene:'+gencode_characterization.loc[gencode_characterization['feature']=='gene','gene_type']
    gencode_characterization['length']=gencode_characterization['peak_end']-gencode_characterization['peak_start']
    gencode_characterization=gencode_characterization[['peak_chr','peak_start','peak_end','length','feat_anchor','distance','relative_location','feat_type','gene_id','gene_name','name']]
    gencode_characterization.columns=['chr','start','end','length','feat_anchor','distance','location','feat_type','gene_id','gene_name','characterization']
    gencode_characterization.loc[gencode_characterization['characterization'].isna(),'characterization']='NONE'
    gencode_characterization=gencode_characterization.add_prefix('gencode_')

    # load and format uropa regulatory build results
    reg_characterization=pd.read_csv(input.reg_results,sep='\t')
    reg_characterization = reg_characterization.set_index('peak_id')[['feature','ID']]
    reg_characterization.columns=['reg_feature','reg_feature_id']
    reg_characterization.loc[reg_characterization['reg_feature'].isna(),'reg_feature']='reg_NONE'
    reg_characterization=reg_characterization.add_prefix('regulatoryBuild_')

SnakeMake UROPA From line 125 of rules/region_annotation.smk

import os
import pandas as pd
import numpy as np
from itertools import compress

# dimensionality reduction
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

#### configurations

# data=observations x features
data=pd.read_csv(snakemake.input[0], index_col=0).T

output_data=snakemake.output[0]
output_expl_var=snakemake.output[1]

if not os.path.exists(snakemake.params['results_dir']):
    os.mkdir(snakemake.params['results_dir'])

#### Dimensionality reduction via unsupervised PCA so that 99.9% of variance is preserved for visualization purpose
pca_obj = PCA(0.999, 
               random_state=42,
              )
data_pca=pca_obj.fit_transform(StandardScaler().fit_transform(data))

data_df = pd.DataFrame(data_pca, index=data.index,)
data_df = data_df.rename_axis(("sample_name"))

data_df.to_csv(output_data)

pd.DataFrame(pca_obj.explained_variance_ratio_).to_csv(output_expl_var)

Python Pandas numpy sklearn From line 8 of scripts/do_PCA.py

import os
import pandas as pd
import numpy as np
from itertools import compress

# dimensionality reduction
import umap

#### configurations

# data=observations x features
data=pd.read_csv(snakemake.input[0], index_col=0).T

output=snakemake.output[0]

if not os.path.exists(snakemake.params['results_dir']):
    os.mkdir(snakemake.params['results_dir'])

# if 3 samples or less UMAP can not be performed
if data.shape[0]<4:
    from pathlib import Path
    Path(output).touch()
    import sys
    sys.exit()

#### Dimensionality reduction via unsupervised UMAP for visualization purpose
data_umap = umap.UMAP(
    n_components=2,
    random_state=42,
    metric="correlation",
).fit(data)

data_df = pd.DataFrame(data_umap.embedding_, index=data.index,)
data_df = data_df.rename_axis(("sample_name"))

data_df.to_csv(output)

Python Pandas numpy umap From line 8 of scripts/do_UMAP.py

import pybedtools as bedtools
import os
import time
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")


# utility functions
def cpm(counts_matrix, feature_len=1, norm_factors=1, log=False, pseudocount=0.5, libsize_pseudocount=1, million=1e6):
    """
    Counts per million normalization
    If feature_len=1 then you get counts per million
    Setting pseudocount=0.5, libsize_pseudocount=1 ensures results equivalent to LIMMA-voom
    """
    rpk = counts_matrix.astype(float) / feature_len
    cpm = million * (rpk + pseudocount) / (rpk.sum(axis=0) * norm_factors + libsize_pseudocount)

    return np.log2(cpm) if log else cpm

def filter_by_reads(counts_df, min_group_n, cpm_df=None, min_count=10, min_total_count=15, large_min_n=10,
                    large_min_n_proportion=0.7, verbose=True):

    if min_group_n > large_min_n:
        # edgeR
        min_n = large_min_n + (min_group_n - large_min_n) * large_min_n_proportion
    else:
        min_n = min_group_n

    _million, _tolerance = 1e6, 1e-14
    median_lib_size = counts_df.sum(axis=0).median()
    cpm_cutoff = min_count / median_lib_size * _million
    keep_cpm = (cpm_df >= cpm_cutoff).sum(axis=1) >= min_n - _tolerance
    keep_min_total_count = counts_df.sum(axis=1) >= min_total_count - _tolerance
    if verbose:
        print('min_n', min_n)
        print('median_lib_size', median_lib_size)
        print('cpm_cutoff', cpm_cutoff)
        print('remove based on cpm_cutoff', (~keep_cpm).sum())
        print('additionally remove based on keep_min_total_count', (~keep_min_total_count[keep_cpm]).sum())
    return keep_cpm & keep_min_total_count

def plot_sequenced_samples(df, n_samples=30, ax=None, xlabel='Read count', ylabel='Density', title=None, xlim=None, ylim=None,
                           kde=True, hist=False, legend=False, samples_as_rows=False):
    if samples_as_rows:
        df = df.T
    for i in np.random.choice(list(range(df.shape[1])), size=n_samples, replace=False) if n_samples is not None else range(df.shape[1]):
        sample = df.iloc[:, i]
        ax = sns.distplot(sample, ax=ax, kde=kde, hist=hist, label=df.columns[i] if legend else None)
    ax.set_xlabel(xlabel)
    ax.set_ylabel(ylabel)
    if xlim is not None:
        ax.set_xlim(xlim)
    if ylim is not None:
        ax.set_ylim(ylim)
    if title is None:
        ax.set_title('{} samples{}{}'.format(n_samples, ', xlim: {}'.format(xlim) if xlim is not None else '', ', ylim: {}'.format(ylim) if ylim is not None else ''))
    else:
        ax.set_title(title)
    if legend:
        ax.legend(bbox_to_anchor=(1, 1))
    sns.despine()
    return ax


#### configurations

# input
data_counts=pd.read_csv(snakemake.input['counts'], index_col=0)
support=pd.read_csv(snakemake.input['support'], index_col=0)
annot=pd.read_csv(snakemake.input['annot'], index_col=0)

# parameters
peak_support_threshold = snakemake.params["peak_support_threshold"]
large_min_n_proportion = snakemake.params["proportion"]
min_group = snakemake.params["min_group"]

# output
output_data=snakemake.output['filtered_data']
output_plot=snakemake.output['filtered_plots']

#####  filter regions by peak support
support = support.loc[data_counts.index,:]
support_sum = support.sum(axis=1)
region_filter_by_support = (support_sum>=peak_support_threshold)
data_filtered=data_counts.loc[region_filter_by_support,:]

print('before filtering by support', data_counts.shape[0])
print('after filtering by support', data_filtered.shape[0])

##### filter regions by mean/variance distribution
data_filtered_cpm = cpm(data_filtered, feature_len=1, norm_factors=1, log=False, pseudocount=0.5, libsize_pseudocount=0)

assert data_filtered_cpm.columns.equals(data_filtered.columns)
assert data_filtered_cpm.index.equals(data_filtered.index)

# determine min_group_n
if min_group=='':
    min_group_n=data_filtered.shape[1]
else:
    min_group_n=annot.groupby(min_group)[min_group].count().min()

min_count_mask = filter_by_reads(
    data_filtered,
    min_group_n=min_group_n,
    cpm_df=data_filtered_cpm,
    large_min_n_proportion=large_min_n_proportion,
)

print('before filtering', data_filtered.shape[0])
print('after filtering', data_filtered.loc[min_count_mask, :].shape[0])

lcpm = np.log2(data_filtered_cpm)

fig, axs = plt.subplots(2, 2, figsize=(10, 10))
plt.subplots_adjust(wspace=0.5, hspace=0.5)

ax = sns.scatterplot(lcpm.mean(axis=1),
                     lcpm.std(axis=1),
                     ax=axs[0, 0], rasterized=True)
ax.set_xlabel('Mean')
ax.set_ylabel('Std')
sns.despine()
ax.set_title('before filtering\n{} peaks'.format(lcpm.shape[0]))

ax = sns.scatterplot(lcpm.loc[min_count_mask, :].mean(axis=1),
                     lcpm.loc[min_count_mask, :].std(axis=1),
                     ax=axs[0, 1], rasterized=True)
ax.set_xlabel('Mean')
ax.set_ylabel('Std')
sns.despine()
ax.set_title('after filtering\n{} peaks'.format(min_count_mask.sum()))

plot_sequenced_samples(lcpm, n_samples=None, ax=axs[1, 0], xlabel='Log$2$ CPM', ylabel='Density', title='before filtering', samples_as_rows=False)
plot_sequenced_samples(lcpm.loc[min_count_mask, :], n_samples=None, ax=axs[1, 1], xlabel='Log$2$ CPM', ylabel='Density', title='after filtering', samples_as_rows=False)

fig.suptitle('Region filtering with proportion={} and min group size={}'.format(large_min_n_proportion,min_group_n), fontsize=10)

plt.savefig(output_plot, dpi=300)
plt.show()
plt.close()

# apply filter on pre filtered counts
filtered_counts = data_filtered.loc[min_count_mask, :]
# save filtered counts
filtered_counts.to_csv(output_data)

### generate filtered consensus region set and region annotations
# load consensus region set
consensus_regions = pd.read_csv(snakemake.input['consensus_regions'], sep='\t', index_col=3, header=None,)
# load consensus region set annotations
annot_regions = pd.read_csv(snakemake.input['regions_annot'], index_col=0, header=0,)
# apply filter on consensus region set and annotations
consensus_regions = consensus_regions.loc[filtered_counts.index,:]
annot_regions = annot_regions.loc[filtered_counts.index,:]
# save filtered consensus region set and annotations
consensus_regions['ID'] = consensus_regions.index
bedtools.BedTool().from_dataframe(consensus_regions).saveas(snakemake.output['filtered_consensus_regions'])
annot_regions.to_csv(snakemake.output['filtered_regions_annot'])

Python Pandas numpy BEDTools matplotlib seaborn Consensus Pybedtools From line 6 of scripts/filter_regions.py

import os
import pandas as pd
import pybedtools as bedtools

# configuration
sloop_extension=snakemake.params["sloop_extension"]
output = snakemake.output[0]
blacklist_file= snakemake.params["blacklist_file"]
chrom_file = snakemake.params["chrom_file"]
results_dir = snakemake.params["results_dir"]
# load annotation file and get sample names
annotations = pd.read_csv(snakemake.input[0], index_col=0)
annotations=annotations[(annotations['pass_qc']>0)]
# save filtered annotation file
annotations.to_csv(snakemake.output[1])

peakfiles = [os.path.join(results_dir,"{}".format(sample),"peaks", "{}_summits.bed".format(sample)) for sample in annotations.index]

output_bed = None

if not os.path.exists(os.path.split(output)[0]):
    os.mkdir(os.path.split(output)[0])

for peakfile in peakfiles:
    peak_bed = bedtools.BedTool(peakfile)
    if (blacklist_file is not None):
        peak_bed=peak_bed.intersect(blacklist_file,v=True, wa=True)

    peak_bed = peak_bed.slop(g=chrom_file, b=sloop_extension)

    if (output_bed is None):
        output_bed = peak_bed
    else:
        output_bed = output_bed.cat(peak_bed,force_truncate=True)

output_bed.saveas(output)

peaks = bedtools.BedTool(output).sort(faidx=chrom_file).to_dataframe(names=['CHR','START','END'],dtype={'START':int,'END':int})
peaks['ID'] = peaks.index.format(formatter=(lambda x: "CONS{:011d}".format(x)))
bedtools.BedTool().from_dataframe(peaks).saveas(output)

Python Pandas BEDTools Pybedtools From line 3 of scripts/get_consensus_regions.py

import pybedtools
import os
import time
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

# utility functions
def bed_to_index(df):
    import pybedtools

    if isinstance(df, pybedtools.BedTool):
        df = df.to_dataframe()
    elif isinstance(df, str):
        df = pybedtools.BedTool(df).to_dataframe()
    cols = ["chrom", "start", "end"]
    if not all([x in df.columns for x in cols]):
        raise AttributeError(
            "DataFrame does not have '{}' columns.".format("', '".join(cols))
        )
    index = (
        df["chrom"].astype(str)
        + ":"
        + df["start"].astype(int).astype(str)
        + "-"
        + df["end"].astype(int).astype(str)
    )
    return pd.Index(index, name="region")


def get_peak_gccontent_length(bed_file, fasta_file) -> pd.DataFrame:
    import pybedtools

#     sites = pybedtools.BedTool(bed_file) if bed_file is path
    sites = pybedtools.BedTool(bed_file, from_string=True)
    nuc = sites.nucleotide_content(fi=fasta_file).to_dataframe(comment="#")[
        ["score", "blockStarts"]
    ]
    nuc.columns = ["gc_content", "length"]
    nuc.index = bed_to_index(sites)

    return nuc


def cqn(matrix, gc_content, lengths):
    from rpy2.robjects import numpy2ri, pandas2ri, r
    from rpy2.robjects.packages import importr

    numpy2ri.activate()
    pandas2ri.activate()

    importr("cqn")

    cqn_out = r.cqn(matrix, x=gc_content, lengths=lengths)

    y_r = cqn_out[list(cqn_out.names).index("y")]
    y = pd.DataFrame(np.array(y_r), index=matrix.index, columns=matrix.columns)
    offset_r = cqn_out[list(cqn_out.names).index("offset")]
    offset = pd.DataFrame(
        np.array(offset_r), index=matrix.index, columns=matrix.columns
    )

    return y + offset

#### configurations

# input
filtered_counts=pd.read_csv(snakemake.input[0], index_col=0)
consensus_regions=pd.read_csv(snakemake.input[1], sep='\t', index_col=3, header=None)
filtered_regions=consensus_regions.loc[filtered_counts.index,:]

# parameters
genome_fasta = snakemake.params["genome_fasta"]

# output
output_data=snakemake.output[0]

# normalize the filtered count matrices by cqn method with gc-content as covariate
# get regions
bed_regions=filtered_regions.to_string(header=False, index=False)

# get nuc info
nuc = get_peak_gccontent_length(bed_regions, fasta_file=genome_fasta)

# normalize via CQN method
normalized_counts = cqn(filtered_counts, gc_content=nuc["gc_content"], lengths=nuc["length"])

# save normalized matrix
normalized_counts.to_csv(output_data)

Python Pandas numpy matplotlib seaborn Pybedtools rpy2 From line 6 of scripts/normalize_CQN.py

import pybedtools
import os
import time
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

# utility functions
def calc_norm_factors(counts_df, method):
    assert method in ['TMM', 'TMMwsp', 'RLE', 'upperquartile']
    from rpy2.robjects import numpy2ri, pandas2ri, r
    numpy2ri.activate()
    pandas2ri.activate()
    r.source(os.path.join('workflow','scripts','utils_calcNormFactors.R'))
    return r.calcNormFactors(counts_df, method=method)

def cpm(counts_matrix, feature_len=1, norm_factors=1, log=False, pseudocount=0.5, libsize_pseudocount=1, million=1e6):
    """
    Counts per million normalization
    If feature_len=1 then you get counts per million
    Setting pseudocount=0.5, libsize_pseudocount=1 ensures results equivalent to LIMMA-voom
    """
    rpk = counts_matrix.astype(float) / feature_len
    cpm = million * (rpk + pseudocount) / (rpk.sum(axis=0) * norm_factors + libsize_pseudocount)

    return np.log2(cpm) if log else cpm

#### configurations

# input
filtered_counts=pd.read_csv(snakemake.input[0], index_col=0)

# output
output_data=snakemake.output[0]

# normalize filtered data by selected standard method
norm_method = 'TMM'

# determine norm factors
norm_factors = calc_norm_factors(filtered_counts, method=norm_method)
norm_factors = pd.Series(norm_factors, index=filtered_counts.columns, name='norm_factor')

# calculate normalized counts
normalized_lcpm = cpm(filtered_counts, norm_factors=norm_factors, log=True, pseudocount=0.5, libsize_pseudocount=1)

# save normalized counts
normalized_lcpm.to_csv(os.path.join(output_data))

Python Pandas numpy matplotlib seaborn Pybedtools rpy2 From line 6 of scripts/normalize_TMM.py

import os
import pandas as pd
import numpy as np
from itertools import compress

# visualization
import seaborn as sns
import matplotlib.pyplot as plt

# # utils for enumerating string lists
# from collections import defaultdict
# from itertools import count
# from functools import partial


#### configurations

# annot=observations x annotations
annot=pd.read_csv(snakemake.input[0], index_col=0)

# type of dimensionality reduction
dimred=snakemake.params["dimred"]

# variables=columns of annot to plot
variables=snakemake.params['variables']

#  label=plot title and file name
label = snakemake.params["label"]

# results folder
results_dir = snakemake.params["results_dir"]

# if 3 samples or less UMAP could not be performed
if (dimred=="UMAP" and annot.shape[0]<4):
    from pathlib import Path
    for variable in variables:
        Path(os.path.join(results_dir, dimred+"_"+label+"_"+variable+".svg")).touch()
    import sys
    sys.exit()

# if 2 samples or less PCA only consists of one PC
if (dimred=="PCA" and annot.shape[0]<3):
    from pathlib import Path
    for variable in variables:
        Path(os.path.join(results_dir, dimred+"_"+label+"_"+variable+".svg")).touch()
    import sys
    sys.exit()

# data=observations x features
data_df=pd.read_csv(snakemake.input[1], index_col=0)

#plot parameters
color_dict=None
alpha=1


# plot data in 2D
# sns.set(color_codes=True)
for variable in variables:
    fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(5, 5))
    unique_vals = annot[variable].unique()

#     if len(unique_vals)==1 or (sum(pd.isna(unique_vals))>0):
#         print("{} only one value or contains NaNs".format(variable))
#         continue

    unique_vals = unique_vals[pd.notna(unique_vals)]

    if all([isinstance(i, (str, bool, np.bool_)) for i in unique_vals]):
        print('discrete variable ', variable)

        if color_dict==None:
#             colors = plt.cm.get_cmap("tab20").colors
            cm = plt.get_cmap('gist_ncar')
            colors=[cm(1.*i/len(unique_vals)) for i in range(len(unique_vals))]
        else:
            colors = [color_dict[x] for x in unique_vals]

        # plot data by unique value
        for g, c in zip(unique_vals, colors):
            c = [c]  # to remove the warnings
            axes.scatter(
                data_df.loc[annot.index[annot[variable] == g].tolist(), '0',],
                data_df.loc[annot.index[annot[variable] == g].tolist(), '1',],
                label=g,
                marker=".",
                c=c,
                alpha=alpha,
            )

            # centroids by mean
            axes.scatter(
                data_df.loc[annot.index[annot[variable] == g].tolist(), '0',].mean(),
                data_df.loc[annot.index[annot[variable] == g].tolist(), '1',].mean(),
                marker="s",
                alpha=0.5,
                label=str(g) + " centroid",
                c=c,
            )
            axes.text(
                data_df.loc[annot.index[annot[variable] == g].tolist(), '0',].mean(),
                data_df.loc[annot.index[annot[variable] == g].tolist(), '1',].mean(),
                s=g,
                horizontalalignment="center",
                verticalalignment="bottom",
                alpha=0.5,
            )

        # edit and position legend
        handles, labels = axes.get_legend_handles_labels()
        legend_idx = ["centroid" in label for label in labels]
        handles = list(compress(handles, legend_idx))
        labels = list(compress(labels, legend_idx))
        axes.legend(handles, labels, loc="center left", bbox_to_anchor=(1.05, 0.5))

    elif all([isinstance(i, (int, float, np.int, np.float, np.int64)) for i in unique_vals]):
        print('continous variable ', variable)


        # check if enumerated ie numerical -> probably not needed anymore
#         if annot[variable].dtype != 'float64' and annot[variable].dtype != 'int64':
#             label_to_number = defaultdict(partial(next, count(1)))
#             annot[variable]=[label_to_number[label] for label in annot[variable]]

        cmap = sns.cubehelix_palette(as_cmap=True)
        points = axes.scatter(
            data_df.loc[:, '0',],
            data_df.loc[:, '1',],
            marker=".",
            c=annot[variable],
            s=50, 
            cmap=cmap,
            alpha=alpha,
        )
        fig.colorbar(points)

    else:
        print("variable type not-detected for {}".format(variable))
        continue

    # show and save figure
    x_postfix=''
    y_postfix=''

    if dimred=='PCA':
        explained_variance=list(pd.read_csv(os.path.join(results_dir,"PCA_{}_{}_explained_variance.csv".format(snakemake.wildcards["split"],snakemake.wildcards["step"])), index_col=0)['0'])
        x_postfix=' ({:.1f}%)'.format(100*explained_variance[0])
        y_postfix=' ({:.1f}%)'.format(100*explained_variance[1])

    axes.set_xlabel(dimred+" 1"+x_postfix)
    axes.set_ylabel(dimred+" 2 "+y_postfix)
    plt.title(dimred+" of " + label + " "+variable)
    plt.show()

    # save plot
    fig.savefig(
        fname=os.path.join(results_dir, dimred+"_"+label+"_"+variable+".svg"),
        format="svg",
        dpi=300,
        bbox_inches="tight",
    )

Python Pandas numpy matplotlib seaborn From line 8 of scripts/plot_dimred.py

import os
import pandas as pd
import numpy as np

# visualization
import seaborn as sns
import matplotlib.pyplot as plt

# mean-variance relationship
import statsmodels.formula.api as sm
from scipy import stats


#### configurations

# load data
data=pd.read_csv(snakemake.input[0], index_col=0)

if not os.path.exists(snakemake.params['results_dir']):
    os.mkdir(snakemake.params['results_dir'])

mean=data.mean(axis=1)
std=data.std(axis=1)

xaxis='mean'
yaxis='std'
title='Mean-StD Relationship'

lm_result = sm.ols(formula="std ~ mean", data=pd.DataFrame({'std': std, 'mean': mean})).fit()
# print(lm_result.summary())
# print(lm_result.rsquared, lm_result.rsquared_adj)
p = lm_result.params
x_dummy = np.arange(min(mean), max(mean))

spr=stats.spearmanr(mean, std)

plt.scatter(mean, std, marker='.', alpha=0.5)
plt.plot(x_dummy, p.Intercept + p['mean'] * x_dummy, c='red')

plt.xlabel(xaxis)
plt.ylabel(yaxis) 
plt.title("{} R2={:.2f} rho={:.2f} p={:.2f}".format(title,lm_result.rsquared,spr[0],spr[1]))

# save plot
plt.savefig(fname=snakemake.output[0],
            format="svg",
            dpi=300,
            bbox_inches="tight",
           )

Python Pandas numpy matplotlib seaborn scipy statsmodels From line 8 of scripts/plot_mean_var.py

import os
import pandas as pd

# configuration
output = snakemake.output[0]
results_dir = snakemake.params["results_dir"]

# load annotation file and get sample names
annotations = pd.read_csv(snakemake.input[0], index_col=0)

results=[]

for sample in annotations.index:
    print(sample)
    result=pd.read_csv(os.path.join(results_dir,"{}".format(sample),"mapped", "{}_quantification_counts.csv".format(sample)))
    results.append(result)

# results = [item for item in results if item is not None]
results = pd.concat(results)
results.set_index(results.columns[0]).T.to_csv(output)

Python Pandas From line 3 of scripts/quantify_counts_aggregate.py

import os
import pandas as pd
import pybedtools as bedtools

# configuration
output = snakemake.output[0]
chrom_file = snakemake.params["chrom_file"]
results_dir = snakemake.params["results_dir"]
sample=snakemake.wildcards["sample"]

elements_to_quantify = bedtools.BedTool(snakemake.input[0])

bamfile=snakemake.input[1]

print("Processing "+sample)
try:
    result= elements_to_quantify.coverage(b=bamfile,sorted=True,g=chrom_file).to_dataframe(
                names=["CHR", "START", "END", "ID", sample, "NA1", "NA2", "NA3"],
                dtype={sample: int},
                usecols=['ID', sample],
                index_col='ID').T
    result.to_csv(output)
except Exception as e:
    print("Error occured while processing sample "+sample)
    pd.DataFrame(0,index=elements_to_quantify.index,columns=[sample]).T.to_csv(output)

Python Pandas BEDTools Pybedtools From line 3 of scripts/quantify_counts_sample.py

import os
import pandas as pd

# configuration
output = snakemake.output[0]
results_dir = snakemake.params["results_dir"]

# load annotation file and get sample names
annotations = pd.read_csv(snakemake.input[0], index_col=0)

results=[]

for sample in annotations.index:
    print(sample)
    result=pd.read_csv(os.path.join(results_dir,"{}".format(sample),"peaks", "{}_quantification_support.csv".format(sample)))
    results.append(result)

# results = [item for item in results if item is not None]
results = pd.concat(results)
results.set_index(results.columns[0]).T.to_csv(output)

Python Pandas From line 3 of scripts/quantify_support_aggregate.py

import os
import pandas as pd
import pybedtools as bedtools

# configuration
output = snakemake.output[0]
chrom_file = snakemake.params["chrom_file"]
sample=snakemake.wildcards["sample"]

consensus_peaks = bedtools.BedTool(snakemake.input[0])
consensus_peaks_df = bedtools.BedTool(snakemake.input[0]).to_dataframe().set_index('name')

peakfile=snakemake.input[1]
result = pd.DataFrame(0,index=consensus_peaks_df.index,columns=[sample])

try:
    if (peakfile is not None):
        sample_peaks = bedtools.BedTool(peakfile)
        result = consensus_peaks.intersect(
            sample_peaks,
            g=chrom_file, 
            wa=True,
            c=True
        ).to_dataframe(
            index_col='name',
            usecols=[3,4],
            names=['name',sample]
        )
except Exception as e:
    print("Error occured while processing sample "+sample)
finally:
    result.T.to_csv(output)

Python Pandas BEDTools Pybedtools From line 3 of scripts/quantify_support_sample.py

import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

# for robust dispersion norm calculation
from statsmodels import robust
import warnings
warnings.filterwarnings("ignore")

#### configurations

# input
data=pd.read_csv(snakemake.input[0], index_col=0)

# parameters
HVR_percentage=snakemake.params['HVR_percentage']
top_n = int((data.shape[0]/100)*HVR_percentage)

# output
output_data=snakemake.output[0]
output_plot=snakemake.output[1]

#####  determine highly variable regions (HRV) based on normalized dispersion
## inspired by “highly_variable_genes” function from scanpy 1.5.1 (Python 3.6.10) with the “flavor” argument set to “cell_ranger”
## but instead of dispersion=var/mean we use dispersion=std to keep it meaningful for negative mean values
## overlap of region selecrtion in test data was 96.6%

mean = data.mean(axis=1)
# var = data.var(axis=1)

# now actually compute the dispersion
mean[mean == 0] = 1e-12  # set entries equal to zero to small value
dispersion = data.std(axis=1) #var / mean

# all of the following quantities are "per-gene" here
df = pd.DataFrame()
df['means'] = mean
df['dispersions'] = dispersion

# bin regions by their mean in 20 bins
df['mean_bin'] = pd.cut(df['means'], np.r_[
    -np.inf,
    np.percentile(df['means'], np.arange(10, 105, 5)),
    np.inf
])
# group regions by mean_bin
disp_grouped = df.groupby('mean_bin')['dispersions']
# determine median (robust) dispersion by group
disp_median_bin = disp_grouped.median()

# the next line raises the warning: "Mean of empty slice"
with warnings.catch_warnings():
    warnings.simplefilter('ignore')
    # calculate median absolute deviation (mad; robust measure of variance) per group
    disp_mad_bin = disp_grouped.apply(robust.mad)
    # determine "normalized" dispersion via (dispersion-dispersion_group_median)/dispersion_mad
    df['dispersions_norm'] = (df['dispersions'].values
        - disp_median_bin[df['mean_bin'].values].values
        ) / disp_mad_bin[df['mean_bin'].values].values

dispersion_norm = df['dispersions_norm'].values.astype('float32')



HVR_idx=df.index[df['dispersions_norm'].rank(ascending=False)<=top_n]
# HVR_idx=std.index[std.rank(ascending=False)<=top_n]
data_HVR = data.loc[HVR_idx, :]
data_HVR.to_csv(output_data)

# make plots describing the HVR selection
fig, axs = plt.subplots(1, 2, figsize=(10, 5))
plt.subplots_adjust(wspace=0.5, hspace=0.5)

# plot region variability
ax = sns.scatterplot(df['dispersions_norm'].rank(ascending=False),
                     df['dispersions_norm'],
                     ax=axs[0], 
                     rasterized=True,
                    )
ax = sns.scatterplot(df.loc[HVR_idx,'dispersions_norm'].rank(ascending=False),
                     df.loc[HVR_idx,'dispersions_norm'],
                     ax=axs[0], 
                     rasterized=True,
                     color='r'
                    )
ax.set_xlabel('rank')
ax.set_ylabel('normalized dispersion')
sns.despine()
ax.set_title("Regions ranked by normalized dispersion")

# mean-dispersion plot (highlighting selected regions)
ax = sns.scatterplot(df.loc[:,'means'],
                     df.loc[:,'dispersions_norm'],
                     ax=axs[1], 
                     rasterized=True,
                    )
ax = sns.scatterplot(df.loc[HVR_idx,'means'], 
                     df.loc[HVR_idx,'dispersions_norm'],
                     ax=axs[1], 
                     rasterized=True,
                     color='r'
                    )

ax.set_xlabel('mean')
ax.set_ylabel('normalized dispersion')
sns.despine()
ax.set_title("Mean to normalized dispersion relationship")

fig.suptitle('Top {} HVRs by binned normalized dispersion'.format(top_n), fontsize=14)

plt.savefig(
            fname=output_plot,
            format="svg",
            dpi=300,
            bbox_inches="tight",
            )

plt.show()
plt.close()

Python Pandas numpy matplotlib seaborn statsmodels From line 6 of scripts/select_HVR.py

import os
import pandas as pd
import numpy as np

#### configurations

# input
data_counts=pd.read_csv(snakemake.input[0], index_col=0)
annotation=pd.read_csv(snakemake.input[1], index_col=0)

# parameters
project_path = snakemake.params["project_path"]
split_by = snakemake.params["split_by"]

#####  make output folders, split and save data
for split in list(annotation[split_by].unique()):
    # make split folder
    if not os.path.exists(os.path.join(project_path,split)):
        os.mkdir(os.path.join(project_path,split))

    # split counts
    split_counts = data_counts.loc[:, annotation[(annotation[split_by]==split)].index]
    # save split counts
    split_counts.to_csv(os.path.join(project_path,split,"{}_counts.csv".format(split)))

    # split annotatios
    split_annot = annotation.loc[annotation[split_by]==split, :]
    # save split annotatios
    split_annot.to_csv(os.path.join(project_path,split,"{}_annotation.csv".format(split)))

Python Pandas numpy From line 6 of scripts/split_data.py

run:
    command_str="touch {}".format(os.path.join(config["project_path"],"{}_successfully_completed.txt".format(config["project_name"])))
    subprocess.run(command_str, shell=True)