Ultimate ATAC-seq Data Processing & Analysis Pipeline
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Ultimate ATAC-seq Data Processing & Analysis Pipeline
From r A w (unaligned) BAM files to normali Z ed counts.
A Snakemake implementation of the BSF's ATAC-seq Data Processing Pipeline extended by downstream processing and unsupervised analyses steps using bash, python and R. Reproducibility and Scalability is ensured by using Snakemake, conda and Singularity.
If you use this workflow in a publication, please don't forget to give credits to the authors by citing it using this DOI 10.5281/zenodo.6323634 .
Table of contents
Authors
Software
This project wouldn't be possible without the following software
Software | Reference (DOI) |
---|---|
bedtools | https://doi.org/10.1093/bioinformatics/btq033 |
Bowtie2 | https://doi.org/10.1038/nmeth.1923 |
CQN | https://doi.org/10.1093/biostatistics/kxr054 |
deeptools | https://doi.org/10.1093/nar/gkw257 |
ENCODE | https://doi.org/10.1038/s41598-019-45839-z |
fastp | https://doi.org/10.1093/bioinformatics/bty560 |
HOMER | https://doi.org/10.1016/j.molcel.2010.05.004 |
MACS2 | https://doi.org/10.1186/gb-2008-9-9-r137 |
matplotlib | https://doi.org/10.1109/MCSE.2007.55 |
MultiQC | https://doi.org/10.1093/bioinformatics/btw354 |
pybedtools | https://doi.org/10.1093/bioinformatics/btr539 |
pandas | https://doi.org/10.5281/zenodo.3509134 |
samblaster | https://doi.org/10.1093/bioinformatics/btu314 |
samtools | https://doi.org/10.1093/bioinformatics/btp352 |
SCANPY | https://doi.org/10.1186/s13059-017-1382-0. |
scikit-learn | http://jmlr.org/papers/v12/pedregosa11a.html |
seaborn | https://doi.org/10.21105/joss.03021 |
Snakemake | https://doi.org/10.12688/f1000research.29032.2 |
TMM | https://doi.org/10.1186/gb-2010-11-3-r25 |
UMAP | https://doi.org/10.21105/joss.00861 |
UROPA | https://doi.org/10.1038/s41598-017-02464-y |
Methods
This is a template for the Methods section of a scientific publication and is intended to serve as a starting point. Only retain paragraphs relevant to your analysis. References [ref] to the respective publications are curated in the software table above. Versions (ver) have to be read out from the respective conda environment specifications (.yaml file) or post execution. Parameters that have to be adapted depending on the data or workflow configurations are denoted in squared brackets e.g. [X].
Processing. Sequencing adapters were removed using the software fastp (ver) [ref]. Bowtie2 (ver) [ref] was used for the alignment of the short reads (representing locations of transposition events) to the [GRCh38 (hg38)/GRCm38 (mm10)] assembly of the [human/mouse] genome using the “--very-sensitive” parameter. PCR duplicates were marked using samblaster (ver) [ref]. Aligned BAM files were then sorted, filtered using ENCODE blacklisted regions [ref], and indexed using samtools (ver) [ref]. To detect the open chromatin regions, peak calling was performed using MACS2 (ver) [ref] using the “--nomodel”, “--keep-dup auto” and “--extsize 147” options on each sample. Homer (ver) [ref] function findMotifs was used for motif enrichment analysis over the detected open chromatin regions.
Quality control metrics were aggregated and reported using MultiQC (ver) [ref], [X] sample(s) needed to be removed.
Quantification. A consensus region set, comprising of [X] genomic regions, was generated, by merging the identified peak summits, extended by 250 bp on both sides using the slop function from bedtools (ver) [ref] and pybedtools (ver) [ref], across all samples while again discarding peaks overlapping blacklisted features as defined by the ENCODE project [ref]. Consensus regions were annotated using annotatePeaks function from Homer (ver) [ref].
Additionally, we annotated all consensus regions using UROPA (ver) [ref] and genomic features from the [GENCODE vX] basic gene annotation as: “TSS proximal” if the region’s midpoint was within [X] bp upstream or [X] bp downstream from a TSS, or if the region overlapped with a TSS; “gene body” if the region overlapped with a gene; “distal” if the region’s midpoint was within [X] bp of a TSS; and “intergenic” otherwise. For each region, only the closest feature was considered, and the annotations took precedence in the following order: TSS proximal, gene body, distal, and intergenic.
The consensus region set was used to quantify the chromatin accessibility in each sample by summing the number of reads overlapping each consensus region. The consensus region set, and sample-wise quantification of accessibility was performed using bedtools (ver) [ref] and pybedtools (ver) [ref].
Optional. We split up of data into subsets according to the annotation [X] and performed all downstream analyses for each subset separately.
Downstream Analysis. For all downstream analyses, we filtered the [X] consensus regions to [X] regions which had reads in at least [X] samples, were covered by at least [X] reads (normalized by median library size) in at least [X] proportion of samples of the smallest subsample group with potential signal determined with the annotation [X], and by at least [X] total reads across all samples.
Next, we determined highly variable regions (HVR) using an adaption of a SCANPY (ver) [ref] function highly_variable_genes with flavor 'cellranger', but instead of dispersion=variation/mean we use dispersion=standard deviation, because in ATAC-seq data the number of regions might be very large. This could lead to log(cpm) values and log(cpm)-means being negative, resulting in negative dispersion values which are meaningless (additionally avoiding division by 0 problems). Therefore we only employ the binning strategy by mean for stabilization, but not a "coefficient of variation" strategy.
Conditional quantile normalization cqn (ver) [ref] was performed on the filtered accessibility matrix using the region length and GC-content of the consensus regions as conditions, quantified using bedtools (ver) [ref] and pybedtools (ver) [ref].
Trimmed mean of M-values (TMM) normalization (ver) [ref] was performed on the filtered accessibility matrix.
Unsupervised Analysis & Visualization. We applied both linear and non-linear unsupervised dimensionality reduction methods to normalized data to visualize emerging sample-wise patterns in two dimensions. We used Principal Component Analysis (PCA) [ref] from scikit-learn (ver) [ref] as the linear approach and Uniform Manifold Approximation projection (UMAP) from umap-learn (ver) [ref] with the correlation metric as the non-linear approach. For visualization matplotlib (ver) [ref] was used.
The processing and analysis described here was performed using a publicly available Snakemake [ver] (ref) workflow [ 10.5281/zenodo.6323634 ].
Features
-
Processing
-
alignment of both single-end and paired-end reads in raw/unaligned BAM format (bowtie2)
-
peak calling (macs2)
-
peak annotation and motif analysis (homer)
-
MultiQC report generation (multiqc)
-
-
Quantification
-
consensus region set generation
-
consensus region set annotation (uropa using regulatory build and gencode as refernce, and homer)
-
read count and peak support quantification of the consensus region set across samples, yielding a count and a support matrix with dimensions regions X samples
-
-
optional: split data in multiple data subsets (eg by cell type, condition)
-
Downstream Analysis (performed on the whole data set and each split separately)
-
Visualization after each analysis step
-
step specific plots (eg region filtering, HVR selection)
-
dimensionality reduction by PCA & UMAP and plotting of provided metadata
-
mean-variance-relationship plot
-
-
Snakemake report generation for workflow statistics, documentation, reproducibility and result presentation
Usage
These steps are the recommended usage for the first run of the workflow:
-
perform only the processing, by setting the downstram_analysis flag to 0 in the project configuration
-
use the generated multiQC report (project_path/atacseq_report/multiqc_report.html) to judge the quality of your samples
-
fill out the mandatory quality control column (pass_qc) in the sample metadata configuration file (you can even use some of the quality metrics for plotting eg like I did in the example files with 'FRiP')
-
finally execute the remaining donwstream analysis steps by setting the respective flag to 1, thereby only the samples that passed quality control will be included
This workflow is written with snakemake and its usage is described in the Snakemake Workflow Catalog .
Installation
This should take less than 10 minutes.
-
install snakemake, which requires conda & mamba, according to the documentation
-
clone/download this repository (eg git clone https://github.com/epigen/atacseq_pipeline.git)
All software/package dependencies are installed and managed automatically via Snakemake and conda (and optional Singularity) and installed upon the first run of the workflow.
Configuration
Detailed specifications can be found here ./config/README.md
Execution
1. Change working directory & activate conda environment
Execute always from within top level of the pipeline directory (ie atacseq_pipeline/). Snakemake commands only work from within the snakemake conda environment.
cd atacseq_pipeline
conda activate snakemake
2. Execute a dry-run
command for a dry-run with option -n (-p makes Snakemake print the resulting shell command for illustration)
snakemake -p -n
3. Execute workflow local or on a cluster
3a. Local execution
command for execution with one core
snakemake -p -j1 --use-conda
3b. Cluster execution
command for vanilla cluster execution on cluster engines that support shell scripts and have access to a common filesystem, (e.g. the Sun Grid Engine), more info in the Snakemake Cluster Execution documentation
snakemake -p --use-conda --cluster qsub -j 32
command for cluster execution by using --profile , submits every task as separate job with dependencies
snakemake -p --use-conda --profile config/slurm.cemm
the profile for CeMM's slurm environment is provided in the config/ directory, of note:
-
the number of jobs in the slurm.cemm/config.yaml should be set as high as necessary, because arrayed job subsmission does not work (yet) and the scheduler (eg SLURM) should take care of the priorization
-
jobs which dependencies can never be fulfilled are automatically removed from the queue
If you are using another setup get your cluster execution profile here: The Snakemake-Profiles project
X. Singularity execution (not tested)
Singularity has to be installed (system wide by root) and available/loaded (eg module load singularity). The pipeline automatically loads the correct singularity image from [Dockerhub](https://hub.docker.com/r/ /atacseq_pipeline)
command for execution with singularity, just add the flag --use-singularity and use --singularity-args to provide all the necessary directories the pipeline needs access to (in the example it is configured for the three relevant partitions at CeMM)
snakemake -p --use-conda --use-singularity --singularity-args "--bind /nobackup:/nobackup --bind /research:/research --bind /home:/home"
Module
Use as module in another Snakemake workflow (soon)
Report
command for report generation (this can take a few minutes, depending on the size of the dataset)
snakemake --report /absolute/path/to/report.zip
The command creates a self contained HTML based report in a zip archive containing the following sections:
-
Workflow: interactive rulegraph to recapitulate individual steps, used software and conrete code (reproducibility)
-
Statistics: duration and timing of individual steps
-
Configuration: used pipeline configuration (accountability)
-
Results
-
Configuration: all 3 used configuration files (project, samples, metadata)
-
QC reports: link to the MultiQC report
-
all: each step of the downstream analysis has its own searchable section with step-specific and unsupervised analysis plots
-
optional: one additional section per split, containing the respective downstream analysis results
-
Results
Project directory structure:
-
all: Downstream analysis results of the whole data, including consensus region set and region annotation
-
atacseq_hub: genome browser track files (.bigWig) for each sample
-
atacseq_report: statistics and metrics from the processing part as input for the MultiQC report
-
atacseq_results: one directory per sample with all the processing and quantification results
-
projectName_report.zip: self contained HTML Snakemake report
-
split1 (optional): Downstream analysis results of subset 1 of the whole data
-
split2 (optional): Downstream analysis results of subset 2 of the whole data
Examples
We provide configuration files for two example datasets (mm10 & hg38). Additionally, the report zip archive of the hg38 test example is provided to showcase the pipeline results.
Resources
To ensure reproducibility of results and to make the pipeline easy-to-use we provide all required reference data for the analysis of ATAC-seq samples for both supported genomes on Zendodo:
-
resources for the GRCm38 (mm10) assembly of the mouse genome
-
resources for the GRCh38 (hg38) assembly of the human genome
Tips
Here are some tips for troubleshooting & FAQs:
-
always first perform a dry-run with option -n
-
if unsure why a certain rule will be executed use option --reason in the dry run, this will give the reason for the execution of each rule
-
when there are 3 or less samples all the UMAP data and plots will be empty
-
when there are 2 or less samples all the PCA data and plots will be empty
-
in case the pipeline crashes, you manually canceled your jobs or when snakemake tries to "resume.. resubmit.." jobs, then remove the .snakemake/incomplete directory!
-
if you commit a lot of jobs eg via slurm (>500) this might take some time (ie 1s/job commit)
-
two comments on peak support quantification
-
even though the peak support of a region in a certain sample is 0, does not mean that there are no reads counted in the count matrix, it just states that there was no peak called
-
the peak support can be >1 for certain samples in case of a consensus region that spans more than one region (with peaks) in the respective sample
-
-
command for generating the workflow's rulegraph
snakemake --rulegraph --forceall | dot -Tsvg > workflow/dags/atacseq_pipeline_rulegraph.svg
provided in workflow/dags
- command for generating the directed acyclic graph (DAG) of all jobs with current configuration
snakemake --dag --forceall | dot -Tsvg > workflow/dags/all_DAG.svg
provided for both test examples in workflow/dags
Links
Publications
The following publications successfully used this module for their analyses.
Code Snippets
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 | shell: """ mkdir -p {params.results_dir}; mkdir -p {params.sample_dir}; mkdir -p {params.bam_dir}; RG="--rg-id {params.sample_name} --rg SM:{params.sample_name} --rg PL:illumina --rg CN:CeMM_BSF" for i in {params.raw_bams}; do samtools fastq $i 2>> "{params.bam_dir}/{params.sample_name}.samtools.log" ; done | \ fastp {params.adapter_sequence} {params.adapter_fasta} --stdin {params.interleaved_in} --stdout --html "{params.bam_dir}/{params.sample_name}.fastp.html" --json "{params.bam_dir}/{params.sample_name}.fastp.json" 2> "{params.bam_dir}/{params.sample_name}.fastp.log" | \ bowtie2 $RG --very-sensitive --no-discordant -p {threads} --maxins 2000 -x {params.bowtie2_index} --met-file "{params.bam_dir}/{params.sample_name}.bowtie2.met" {params.interleaved} - 2> "{params.bam_dir}/{params.sample_name}.txt" | \ samblaster {params.add_mate_tags} 2> "{params.bam_dir}/{params.sample_name}.samblaster.log" | \ samtools sort -o "{params.bam_dir}/{params.sample_name}.bam" - 2>> "{params.bam_dir}/{params.sample_name}.samtools.log"; samtools index "{params.bam_dir}/{params.sample_name}.bam" 2>> "{params.bam_dir}/{params.sample_name}.samtools.log"; samtools idxstats "{params.bam_dir}/{params.sample_name}.bam" | awk '{{ sum += $3 + $4; if($1 == "{params.chrM}") {{ mito_count = $3; }}}}END{{ print "mitochondrial_fraction\t"mito_count/sum }}' > "{params.sample_dir}/{params.sample_name}.stats.tsv"; samtools flagstat "{params.bam_dir}/{params.sample_name}.bam" > "{params.bam_dir}/{params.sample_name}.samtools_flagstat.log"; samtools view {params.filtering} -o "{params.bam_dir}/{params.sample_name}.filtered.bam" "{params.bam_dir}/{params.sample_name}.bam"; samtools index "{params.bam_dir}/{params.sample_name}.filtered.bam"; """ |
18 19 | script: "../scripts/do_UMAP.py" |
39 40 | script: "../scripts/do_PCA.py" |
62 63 | script: "../scripts/plot_dimred.py" |
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | shell: """ mkdir -p {params.hub_dir}; bamCoverage --bam {input.bam} \ -p max --binSize 10 --normalizeUsing RPGC \ --effectiveGenomeSize {params.genome_size} --extendReads 175 \ -o "{params.hub_dir}/{params.sample_name}.bigWig" > "{params.hub_dir}/{params.sample_name}.bigWig.log" 2>&1; echo "base,count" > {output.tss_hist}; bedtools slop -b {params.tss_slop} -i {params.unique_tss} -g {params.chromosome_sizes} | \ bedtools coverage -a - -b {input.bam} -d -sorted | \ awk '{{if($6 == "+"){{ counts[$7] += $8;}} else counts[{params.double_slop} - $7 + 1] += $8;}} END {{ for(pos in counts) {{ if(pos < {params.noise_lower} || pos > {params.noise_upper}) {{ noise += counts[pos] }} }}; average_noise = noise /(2 * {params.noise_lower}); for(pos in counts) {{print pos-2000-1","(counts[pos]/average_noise) }} }}' | \ sort -t "," -k1,1n >> {output.tss_hist} ; """ |
20 21 22 23 | shell: """ multiqc -fc {params.project_config} {params.project_path} """ |
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 | shell: """ mkdir -p {params.results_dir}; mkdir -p {params.sample_dir}; mkdir -p {params.peaks_dir}; mkdir -p {params.homer_dir}; export PATH="{params.homer_bin}:$PATH"; macs2 callpeak -t {input.bam} {params.formating} \ --nomodel --keep-dup auto --extsize 147 -g {params.genome_size} \ -n {params.sample_name} \ --outdir {params.peaks_dir} > "{params.peaks_dir}/{params.sample_name}.macs2.log" 2>&1; resources/homer/bin/annotatePeaks.pl {params.peaks_dir}/{params.sample_name}_peaks.narrowPeak {params.genome} \ > {params.peaks_dir}/{params.sample_name}_peaks.narrowPeak.annotated.tsv \ 2> {params.peaks_dir}/{params.sample_name}_peaks.narrowPeak.annotated.tsv.log; resources/homer/bin/findMotifsGenome.pl "{params.peaks_dir}/{params.sample_name}_summits.bed" {params.genome} {params.homer_dir} -size 200 -mask \ > "{params.homer_dir}/{params.sample_name}.homer.log" 2>&1 cat {params.peaks_dir}/{params.sample_name}_peaks.narrowPeak | wc -l | \ awk '{{print "peaks\t" $1}}' >> "{params.sample_dir}/{params.sample_name}.stats.tsv" TOTAL_READS=`samtools idxstats {input.bam} | awk '{{sum += $3}}END{{print sum}}'`; samtools view -c -L "{params.peaks_dir}/{params.sample_name}_peaks.narrowPeak" {input.bam} | \ awk -v total=$TOTAL_READS '{{print "frip\t" $1/total}}' >> "{params.sample_dir}/{params.sample_name}.stats.tsv"; samtools view -c -L {params.regulatory_regions} {input.bam} | \ awk -v total=$TOTAL_READS '{{print "regulatory_fraction\t" $1/total}}' >> "{params.sample_dir}/{params.sample_name}.stats.tsv"; """ |
23 24 | script: "../scripts/split_data.py" |
53 54 | script: "../scripts/filter_regions.py" |
73 74 | script: "../scripts/normalize_TMM.py" |
95 96 | script: "../scripts/normalize_CQN.py" |
117 118 | script: "../scripts/select_HVR.py" |
138 139 | script: "../scripts/plot_mean_var.py" |
27 28 | script: "../scripts/get_consensus_regions.py" |
49 50 | script: "../scripts/quantify_support_sample.py" |
71 72 | script: "../scripts/quantify_support_aggregate.py" |
95 96 | script: "../scripts/quantify_counts_sample.py" |
117 118 | script: "../scripts/quantify_counts_aggregate.py" |
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 | run: ### generate gencode config with open( os.path.join("config","uropa","gencode_config_TEMPLATE_V4.txt") ) as f: gencode_template=Template(f.read()) gencode_config=gencode_template.substitute({ 'TSS_flanking':config['tss_size'], 'TSS_proximal_upstream':config['proximal_size_up'], 'TSS_proximal_downstream':config['proximal_size_dn'], 'distal_distance':config['distal_size'], 'gtf_file':'"{}"'.format(config["gencode_gtf"]), 'bed_file':'"{}"'.format(input.consensus_regions) }) gencode_config_file=output.gencode_config with open(gencode_config_file,'w') as out: out.write(gencode_config) ### generate reg config file with open( os.path.join("config","uropa","regulatory_config_TEMPLATE.txt") ) as f: reg_template=Template(f.read()) reg_config=reg_template.substitute({ 'gtf_file':'"{}"'.format(config["regulatory_build_gtf"]), 'bed_file':'"{}"'.format(input.consensus_regions) }) reg_config_file=output.reg_config with open(reg_config_file,'w') as out: out.write(reg_config) |
72 73 74 75 76 | shell: """ uropa -p {params.results_dir}/gencode -i {input.gencode_config} -t {threads} -l {params.results_dir}/uropa.gencode.log uropa -p {params.results_dir}/reg -i {input.reg_config} -t {threads} -l {params.results_dir}/uropa.reg.log """ |
98 99 100 101 102 103 104 105 | shell: """ export PATH="{params.homer_bin}:$PATH"; resources/homer/bin/annotatePeaks.pl {input.consensus_regions} {config[genome]} \ > {output.homer_annotations} \ 2> {output.homer_annotations_log}; """ |
125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 | run: # load and format uropa gencode results gencode_characterization=pd.read_csv(input.gencode_results,sep='\t') gencode_characterization = gencode_characterization.set_index("peak_id") gencode_characterization.loc[gencode_characterization['feature']=='transcript','feat_type']='transcript:'+gencode_characterization.loc[gencode_characterization['feature']=='transcript','transcript_type'] gencode_characterization.loc[gencode_characterization['feature']=='gene','feat_type']='gene:'+gencode_characterization.loc[gencode_characterization['feature']=='gene','gene_type'] gencode_characterization['length']=gencode_characterization['peak_end']-gencode_characterization['peak_start'] gencode_characterization=gencode_characterization[['peak_chr','peak_start','peak_end','length','feat_anchor','distance','relative_location','feat_type','gene_id','gene_name','name']] gencode_characterization.columns=['chr','start','end','length','feat_anchor','distance','location','feat_type','gene_id','gene_name','characterization'] gencode_characterization.loc[gencode_characterization['characterization'].isna(),'characterization']='NONE' gencode_characterization=gencode_characterization.add_prefix('gencode_') # load and format uropa regulatory build results reg_characterization=pd.read_csv(input.reg_results,sep='\t') reg_characterization = reg_characterization.set_index('peak_id')[['feature','ID']] reg_characterization.columns=['reg_feature','reg_feature_id'] reg_characterization.loc[reg_characterization['reg_feature'].isna(),'reg_feature']='reg_NONE' reg_characterization=reg_characterization.add_prefix('regulatoryBuild_') |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | import os import pandas as pd import numpy as np from itertools import compress # dimensionality reduction from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA #### configurations # data=observations x features data=pd.read_csv(snakemake.input[0], index_col=0).T output_data=snakemake.output[0] output_expl_var=snakemake.output[1] if not os.path.exists(snakemake.params['results_dir']): os.mkdir(snakemake.params['results_dir']) #### Dimensionality reduction via unsupervised PCA so that 99.9% of variance is preserved for visualization purpose pca_obj = PCA(0.999, random_state=42, ) data_pca=pca_obj.fit_transform(StandardScaler().fit_transform(data)) data_df = pd.DataFrame(data_pca, index=data.index,) data_df = data_df.rename_axis(("sample_name")) data_df.to_csv(output_data) pd.DataFrame(pca_obj.explained_variance_ratio_).to_csv(output_expl_var) |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | import os import pandas as pd import numpy as np from itertools import compress # dimensionality reduction import umap #### configurations # data=observations x features data=pd.read_csv(snakemake.input[0], index_col=0).T output=snakemake.output[0] if not os.path.exists(snakemake.params['results_dir']): os.mkdir(snakemake.params['results_dir']) # if 3 samples or less UMAP can not be performed if data.shape[0]<4: from pathlib import Path Path(output).touch() import sys sys.exit() #### Dimensionality reduction via unsupervised UMAP for visualization purpose data_umap = umap.UMAP( n_components=2, random_state=42, metric="correlation", ).fit(data) data_df = pd.DataFrame(data_umap.embedding_, index=data.index,) data_df = data_df.rename_axis(("sample_name")) data_df.to_csv(output) |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 | import pybedtools as bedtools import os import time import pandas as pd import matplotlib.pyplot as plt import numpy as np import seaborn as sns import warnings warnings.filterwarnings("ignore") # utility functions def cpm(counts_matrix, feature_len=1, norm_factors=1, log=False, pseudocount=0.5, libsize_pseudocount=1, million=1e6): """ Counts per million normalization If feature_len=1 then you get counts per million Setting pseudocount=0.5, libsize_pseudocount=1 ensures results equivalent to LIMMA-voom """ rpk = counts_matrix.astype(float) / feature_len cpm = million * (rpk + pseudocount) / (rpk.sum(axis=0) * norm_factors + libsize_pseudocount) return np.log2(cpm) if log else cpm def filter_by_reads(counts_df, min_group_n, cpm_df=None, min_count=10, min_total_count=15, large_min_n=10, large_min_n_proportion=0.7, verbose=True): if min_group_n > large_min_n: # edgeR min_n = large_min_n + (min_group_n - large_min_n) * large_min_n_proportion else: min_n = min_group_n _million, _tolerance = 1e6, 1e-14 median_lib_size = counts_df.sum(axis=0).median() cpm_cutoff = min_count / median_lib_size * _million keep_cpm = (cpm_df >= cpm_cutoff).sum(axis=1) >= min_n - _tolerance keep_min_total_count = counts_df.sum(axis=1) >= min_total_count - _tolerance if verbose: print('min_n', min_n) print('median_lib_size', median_lib_size) print('cpm_cutoff', cpm_cutoff) print('remove based on cpm_cutoff', (~keep_cpm).sum()) print('additionally remove based on keep_min_total_count', (~keep_min_total_count[keep_cpm]).sum()) return keep_cpm & keep_min_total_count def plot_sequenced_samples(df, n_samples=30, ax=None, xlabel='Read count', ylabel='Density', title=None, xlim=None, ylim=None, kde=True, hist=False, legend=False, samples_as_rows=False): if samples_as_rows: df = df.T for i in np.random.choice(list(range(df.shape[1])), size=n_samples, replace=False) if n_samples is not None else range(df.shape[1]): sample = df.iloc[:, i] ax = sns.distplot(sample, ax=ax, kde=kde, hist=hist, label=df.columns[i] if legend else None) ax.set_xlabel(xlabel) ax.set_ylabel(ylabel) if xlim is not None: ax.set_xlim(xlim) if ylim is not None: ax.set_ylim(ylim) if title is None: ax.set_title('{} samples{}{}'.format(n_samples, ', xlim: {}'.format(xlim) if xlim is not None else '', ', ylim: {}'.format(ylim) if ylim is not None else '')) else: ax.set_title(title) if legend: ax.legend(bbox_to_anchor=(1, 1)) sns.despine() return ax #### configurations # input data_counts=pd.read_csv(snakemake.input['counts'], index_col=0) support=pd.read_csv(snakemake.input['support'], index_col=0) annot=pd.read_csv(snakemake.input['annot'], index_col=0) # parameters peak_support_threshold = snakemake.params["peak_support_threshold"] large_min_n_proportion = snakemake.params["proportion"] min_group = snakemake.params["min_group"] # output output_data=snakemake.output['filtered_data'] output_plot=snakemake.output['filtered_plots'] ##### filter regions by peak support support = support.loc[data_counts.index,:] support_sum = support.sum(axis=1) region_filter_by_support = (support_sum>=peak_support_threshold) data_filtered=data_counts.loc[region_filter_by_support,:] print('before filtering by support', data_counts.shape[0]) print('after filtering by support', data_filtered.shape[0]) ##### filter regions by mean/variance distribution data_filtered_cpm = cpm(data_filtered, feature_len=1, norm_factors=1, log=False, pseudocount=0.5, libsize_pseudocount=0) assert data_filtered_cpm.columns.equals(data_filtered.columns) assert data_filtered_cpm.index.equals(data_filtered.index) # determine min_group_n if min_group=='': min_group_n=data_filtered.shape[1] else: min_group_n=annot.groupby(min_group)[min_group].count().min() min_count_mask = filter_by_reads( data_filtered, min_group_n=min_group_n, cpm_df=data_filtered_cpm, large_min_n_proportion=large_min_n_proportion, ) print('before filtering', data_filtered.shape[0]) print('after filtering', data_filtered.loc[min_count_mask, :].shape[0]) lcpm = np.log2(data_filtered_cpm) fig, axs = plt.subplots(2, 2, figsize=(10, 10)) plt.subplots_adjust(wspace=0.5, hspace=0.5) ax = sns.scatterplot(lcpm.mean(axis=1), lcpm.std(axis=1), ax=axs[0, 0], rasterized=True) ax.set_xlabel('Mean') ax.set_ylabel('Std') sns.despine() ax.set_title('before filtering\n{} peaks'.format(lcpm.shape[0])) ax = sns.scatterplot(lcpm.loc[min_count_mask, :].mean(axis=1), lcpm.loc[min_count_mask, :].std(axis=1), ax=axs[0, 1], rasterized=True) ax.set_xlabel('Mean') ax.set_ylabel('Std') sns.despine() ax.set_title('after filtering\n{} peaks'.format(min_count_mask.sum())) plot_sequenced_samples(lcpm, n_samples=None, ax=axs[1, 0], xlabel='Log$2$ CPM', ylabel='Density', title='before filtering', samples_as_rows=False) plot_sequenced_samples(lcpm.loc[min_count_mask, :], n_samples=None, ax=axs[1, 1], xlabel='Log$2$ CPM', ylabel='Density', title='after filtering', samples_as_rows=False) fig.suptitle('Region filtering with proportion={} and min group size={}'.format(large_min_n_proportion,min_group_n), fontsize=10) plt.savefig(output_plot, dpi=300) plt.show() plt.close() # apply filter on pre filtered counts filtered_counts = data_filtered.loc[min_count_mask, :] # save filtered counts filtered_counts.to_csv(output_data) ### generate filtered consensus region set and region annotations # load consensus region set consensus_regions = pd.read_csv(snakemake.input['consensus_regions'], sep='\t', index_col=3, header=None,) # load consensus region set annotations annot_regions = pd.read_csv(snakemake.input['regions_annot'], index_col=0, header=0,) # apply filter on consensus region set and annotations consensus_regions = consensus_regions.loc[filtered_counts.index,:] annot_regions = annot_regions.loc[filtered_counts.index,:] # save filtered consensus region set and annotations consensus_regions['ID'] = consensus_regions.index bedtools.BedTool().from_dataframe(consensus_regions).saveas(snakemake.output['filtered_consensus_regions']) annot_regions.to_csv(snakemake.output['filtered_regions_annot']) |
6
of
scripts/filter_regions.py
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | import os import pandas as pd import pybedtools as bedtools # configuration sloop_extension=snakemake.params["sloop_extension"] output = snakemake.output[0] blacklist_file= snakemake.params["blacklist_file"] chrom_file = snakemake.params["chrom_file"] results_dir = snakemake.params["results_dir"] # load annotation file and get sample names annotations = pd.read_csv(snakemake.input[0], index_col=0) annotations=annotations[(annotations['pass_qc']>0)] # save filtered annotation file annotations.to_csv(snakemake.output[1]) peakfiles = [os.path.join(results_dir,"{}".format(sample),"peaks", "{}_summits.bed".format(sample)) for sample in annotations.index] output_bed = None if not os.path.exists(os.path.split(output)[0]): os.mkdir(os.path.split(output)[0]) for peakfile in peakfiles: peak_bed = bedtools.BedTool(peakfile) if (blacklist_file is not None): peak_bed=peak_bed.intersect(blacklist_file,v=True, wa=True) peak_bed = peak_bed.slop(g=chrom_file, b=sloop_extension) if (output_bed is None): output_bed = peak_bed else: output_bed = output_bed.cat(peak_bed,force_truncate=True) output_bed.saveas(output) peaks = bedtools.BedTool(output).sort(faidx=chrom_file).to_dataframe(names=['CHR','START','END'],dtype={'START':int,'END':int}) peaks['ID'] = peaks.index.format(formatter=(lambda x: "CONS{:011d}".format(x))) bedtools.BedTool().from_dataframe(peaks).saveas(output) |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 | import pybedtools import os import time import pandas as pd import matplotlib.pyplot as plt import numpy as np import seaborn as sns # utility functions def bed_to_index(df): import pybedtools if isinstance(df, pybedtools.BedTool): df = df.to_dataframe() elif isinstance(df, str): df = pybedtools.BedTool(df).to_dataframe() cols = ["chrom", "start", "end"] if not all([x in df.columns for x in cols]): raise AttributeError( "DataFrame does not have '{}' columns.".format("', '".join(cols)) ) index = ( df["chrom"].astype(str) + ":" + df["start"].astype(int).astype(str) + "-" + df["end"].astype(int).astype(str) ) return pd.Index(index, name="region") def get_peak_gccontent_length(bed_file, fasta_file) -> pd.DataFrame: import pybedtools # sites = pybedtools.BedTool(bed_file) if bed_file is path sites = pybedtools.BedTool(bed_file, from_string=True) nuc = sites.nucleotide_content(fi=fasta_file).to_dataframe(comment="#")[ ["score", "blockStarts"] ] nuc.columns = ["gc_content", "length"] nuc.index = bed_to_index(sites) return nuc def cqn(matrix, gc_content, lengths): from rpy2.robjects import numpy2ri, pandas2ri, r from rpy2.robjects.packages import importr numpy2ri.activate() pandas2ri.activate() importr("cqn") cqn_out = r.cqn(matrix, x=gc_content, lengths=lengths) y_r = cqn_out[list(cqn_out.names).index("y")] y = pd.DataFrame(np.array(y_r), index=matrix.index, columns=matrix.columns) offset_r = cqn_out[list(cqn_out.names).index("offset")] offset = pd.DataFrame( np.array(offset_r), index=matrix.index, columns=matrix.columns ) return y + offset #### configurations # input filtered_counts=pd.read_csv(snakemake.input[0], index_col=0) consensus_regions=pd.read_csv(snakemake.input[1], sep='\t', index_col=3, header=None) filtered_regions=consensus_regions.loc[filtered_counts.index,:] # parameters genome_fasta = snakemake.params["genome_fasta"] # output output_data=snakemake.output[0] # normalize the filtered count matrices by cqn method with gc-content as covariate # get regions bed_regions=filtered_regions.to_string(header=False, index=False) # get nuc info nuc = get_peak_gccontent_length(bed_regions, fasta_file=genome_fasta) # normalize via CQN method normalized_counts = cqn(filtered_counts, gc_content=nuc["gc_content"], lengths=nuc["length"]) # save normalized matrix normalized_counts.to_csv(output_data) |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 | import pybedtools import os import time import pandas as pd import matplotlib.pyplot as plt import numpy as np import seaborn as sns # utility functions def calc_norm_factors(counts_df, method): assert method in ['TMM', 'TMMwsp', 'RLE', 'upperquartile'] from rpy2.robjects import numpy2ri, pandas2ri, r numpy2ri.activate() pandas2ri.activate() r.source(os.path.join('workflow','scripts','utils_calcNormFactors.R')) return r.calcNormFactors(counts_df, method=method) def cpm(counts_matrix, feature_len=1, norm_factors=1, log=False, pseudocount=0.5, libsize_pseudocount=1, million=1e6): """ Counts per million normalization If feature_len=1 then you get counts per million Setting pseudocount=0.5, libsize_pseudocount=1 ensures results equivalent to LIMMA-voom """ rpk = counts_matrix.astype(float) / feature_len cpm = million * (rpk + pseudocount) / (rpk.sum(axis=0) * norm_factors + libsize_pseudocount) return np.log2(cpm) if log else cpm #### configurations # input filtered_counts=pd.read_csv(snakemake.input[0], index_col=0) # output output_data=snakemake.output[0] # normalize filtered data by selected standard method norm_method = 'TMM' # determine norm factors norm_factors = calc_norm_factors(filtered_counts, method=norm_method) norm_factors = pd.Series(norm_factors, index=filtered_counts.columns, name='norm_factor') # calculate normalized counts normalized_lcpm = cpm(filtered_counts, norm_factors=norm_factors, log=True, pseudocount=0.5, libsize_pseudocount=1) # save normalized counts normalized_lcpm.to_csv(os.path.join(output_data)) |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 | import os import pandas as pd import numpy as np from itertools import compress # visualization import seaborn as sns import matplotlib.pyplot as plt # # utils for enumerating string lists # from collections import defaultdict # from itertools import count # from functools import partial #### configurations # annot=observations x annotations annot=pd.read_csv(snakemake.input[0], index_col=0) # type of dimensionality reduction dimred=snakemake.params["dimred"] # variables=columns of annot to plot variables=snakemake.params['variables'] # label=plot title and file name label = snakemake.params["label"] # results folder results_dir = snakemake.params["results_dir"] # if 3 samples or less UMAP could not be performed if (dimred=="UMAP" and annot.shape[0]<4): from pathlib import Path for variable in variables: Path(os.path.join(results_dir, dimred+"_"+label+"_"+variable+".svg")).touch() import sys sys.exit() # if 2 samples or less PCA only consists of one PC if (dimred=="PCA" and annot.shape[0]<3): from pathlib import Path for variable in variables: Path(os.path.join(results_dir, dimred+"_"+label+"_"+variable+".svg")).touch() import sys sys.exit() # data=observations x features data_df=pd.read_csv(snakemake.input[1], index_col=0) #plot parameters color_dict=None alpha=1 # plot data in 2D # sns.set(color_codes=True) for variable in variables: fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(5, 5)) unique_vals = annot[variable].unique() # if len(unique_vals)==1 or (sum(pd.isna(unique_vals))>0): # print("{} only one value or contains NaNs".format(variable)) # continue unique_vals = unique_vals[pd.notna(unique_vals)] if all([isinstance(i, (str, bool, np.bool_)) for i in unique_vals]): print('discrete variable ', variable) if color_dict==None: # colors = plt.cm.get_cmap("tab20").colors cm = plt.get_cmap('gist_ncar') colors=[cm(1.*i/len(unique_vals)) for i in range(len(unique_vals))] else: colors = [color_dict[x] for x in unique_vals] # plot data by unique value for g, c in zip(unique_vals, colors): c = [c] # to remove the warnings axes.scatter( data_df.loc[annot.index[annot[variable] == g].tolist(), '0',], data_df.loc[annot.index[annot[variable] == g].tolist(), '1',], label=g, marker=".", c=c, alpha=alpha, ) # centroids by mean axes.scatter( data_df.loc[annot.index[annot[variable] == g].tolist(), '0',].mean(), data_df.loc[annot.index[annot[variable] == g].tolist(), '1',].mean(), marker="s", alpha=0.5, label=str(g) + " centroid", c=c, ) axes.text( data_df.loc[annot.index[annot[variable] == g].tolist(), '0',].mean(), data_df.loc[annot.index[annot[variable] == g].tolist(), '1',].mean(), s=g, horizontalalignment="center", verticalalignment="bottom", alpha=0.5, ) # edit and position legend handles, labels = axes.get_legend_handles_labels() legend_idx = ["centroid" in label for label in labels] handles = list(compress(handles, legend_idx)) labels = list(compress(labels, legend_idx)) axes.legend(handles, labels, loc="center left", bbox_to_anchor=(1.05, 0.5)) elif all([isinstance(i, (int, float, np.int, np.float, np.int64)) for i in unique_vals]): print('continous variable ', variable) # check if enumerated ie numerical -> probably not needed anymore # if annot[variable].dtype != 'float64' and annot[variable].dtype != 'int64': # label_to_number = defaultdict(partial(next, count(1))) # annot[variable]=[label_to_number[label] for label in annot[variable]] cmap = sns.cubehelix_palette(as_cmap=True) points = axes.scatter( data_df.loc[:, '0',], data_df.loc[:, '1',], marker=".", c=annot[variable], s=50, cmap=cmap, alpha=alpha, ) fig.colorbar(points) else: print("variable type not-detected for {}".format(variable)) continue # show and save figure x_postfix='' y_postfix='' if dimred=='PCA': explained_variance=list(pd.read_csv(os.path.join(results_dir,"PCA_{}_{}_explained_variance.csv".format(snakemake.wildcards["split"],snakemake.wildcards["step"])), index_col=0)['0']) x_postfix=' ({:.1f}%)'.format(100*explained_variance[0]) y_postfix=' ({:.1f}%)'.format(100*explained_variance[1]) axes.set_xlabel(dimred+" 1"+x_postfix) axes.set_ylabel(dimred+" 2 "+y_postfix) plt.title(dimred+" of " + label + " "+variable) plt.show() # save plot fig.savefig( fname=os.path.join(results_dir, dimred+"_"+label+"_"+variable+".svg"), format="svg", dpi=300, bbox_inches="tight", ) |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 | import os import pandas as pd import numpy as np # visualization import seaborn as sns import matplotlib.pyplot as plt # mean-variance relationship import statsmodels.formula.api as sm from scipy import stats #### configurations # load data data=pd.read_csv(snakemake.input[0], index_col=0) if not os.path.exists(snakemake.params['results_dir']): os.mkdir(snakemake.params['results_dir']) mean=data.mean(axis=1) std=data.std(axis=1) xaxis='mean' yaxis='std' title='Mean-StD Relationship' lm_result = sm.ols(formula="std ~ mean", data=pd.DataFrame({'std': std, 'mean': mean})).fit() # print(lm_result.summary()) # print(lm_result.rsquared, lm_result.rsquared_adj) p = lm_result.params x_dummy = np.arange(min(mean), max(mean)) spr=stats.spearmanr(mean, std) plt.scatter(mean, std, marker='.', alpha=0.5) plt.plot(x_dummy, p.Intercept + p['mean'] * x_dummy, c='red') plt.xlabel(xaxis) plt.ylabel(yaxis) plt.title("{} R2={:.2f} rho={:.2f} p={:.2f}".format(title,lm_result.rsquared,spr[0],spr[1])) # save plot plt.savefig(fname=snakemake.output[0], format="svg", dpi=300, bbox_inches="tight", ) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | import os import pandas as pd # configuration output = snakemake.output[0] results_dir = snakemake.params["results_dir"] # load annotation file and get sample names annotations = pd.read_csv(snakemake.input[0], index_col=0) results=[] for sample in annotations.index: print(sample) result=pd.read_csv(os.path.join(results_dir,"{}".format(sample),"mapped", "{}_quantification_counts.csv".format(sample))) results.append(result) # results = [item for item in results if item is not None] results = pd.concat(results) results.set_index(results.columns[0]).T.to_csv(output) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | import os import pandas as pd import pybedtools as bedtools # configuration output = snakemake.output[0] chrom_file = snakemake.params["chrom_file"] results_dir = snakemake.params["results_dir"] sample=snakemake.wildcards["sample"] elements_to_quantify = bedtools.BedTool(snakemake.input[0]) bamfile=snakemake.input[1] print("Processing "+sample) try: result= elements_to_quantify.coverage(b=bamfile,sorted=True,g=chrom_file).to_dataframe( names=["CHR", "START", "END", "ID", sample, "NA1", "NA2", "NA3"], dtype={sample: int}, usecols=['ID', sample], index_col='ID').T result.to_csv(output) except Exception as e: print("Error occured while processing sample "+sample) pd.DataFrame(0,index=elements_to_quantify.index,columns=[sample]).T.to_csv(output) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | import os import pandas as pd # configuration output = snakemake.output[0] results_dir = snakemake.params["results_dir"] # load annotation file and get sample names annotations = pd.read_csv(snakemake.input[0], index_col=0) results=[] for sample in annotations.index: print(sample) result=pd.read_csv(os.path.join(results_dir,"{}".format(sample),"peaks", "{}_quantification_support.csv".format(sample))) results.append(result) # results = [item for item in results if item is not None] results = pd.concat(results) results.set_index(results.columns[0]).T.to_csv(output) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | import os import pandas as pd import pybedtools as bedtools # configuration output = snakemake.output[0] chrom_file = snakemake.params["chrom_file"] sample=snakemake.wildcards["sample"] consensus_peaks = bedtools.BedTool(snakemake.input[0]) consensus_peaks_df = bedtools.BedTool(snakemake.input[0]).to_dataframe().set_index('name') peakfile=snakemake.input[1] result = pd.DataFrame(0,index=consensus_peaks_df.index,columns=[sample]) try: if (peakfile is not None): sample_peaks = bedtools.BedTool(peakfile) result = consensus_peaks.intersect( sample_peaks, g=chrom_file, wa=True, c=True ).to_dataframe( index_col='name', usecols=[3,4], names=['name',sample] ) except Exception as e: print("Error occured while processing sample "+sample) finally: result.T.to_csv(output) |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 | import os import pandas as pd import matplotlib.pyplot as plt import numpy as np import seaborn as sns # for robust dispersion norm calculation from statsmodels import robust import warnings warnings.filterwarnings("ignore") #### configurations # input data=pd.read_csv(snakemake.input[0], index_col=0) # parameters HVR_percentage=snakemake.params['HVR_percentage'] top_n = int((data.shape[0]/100)*HVR_percentage) # output output_data=snakemake.output[0] output_plot=snakemake.output[1] ##### determine highly variable regions (HRV) based on normalized dispersion ## inspired by “highly_variable_genes” function from scanpy 1.5.1 (Python 3.6.10) with the “flavor” argument set to “cell_ranger” ## but instead of dispersion=var/mean we use dispersion=std to keep it meaningful for negative mean values ## overlap of region selecrtion in test data was 96.6% mean = data.mean(axis=1) # var = data.var(axis=1) # now actually compute the dispersion mean[mean == 0] = 1e-12 # set entries equal to zero to small value dispersion = data.std(axis=1) #var / mean # all of the following quantities are "per-gene" here df = pd.DataFrame() df['means'] = mean df['dispersions'] = dispersion # bin regions by their mean in 20 bins df['mean_bin'] = pd.cut(df['means'], np.r_[ -np.inf, np.percentile(df['means'], np.arange(10, 105, 5)), np.inf ]) # group regions by mean_bin disp_grouped = df.groupby('mean_bin')['dispersions'] # determine median (robust) dispersion by group disp_median_bin = disp_grouped.median() # the next line raises the warning: "Mean of empty slice" with warnings.catch_warnings(): warnings.simplefilter('ignore') # calculate median absolute deviation (mad; robust measure of variance) per group disp_mad_bin = disp_grouped.apply(robust.mad) # determine "normalized" dispersion via (dispersion-dispersion_group_median)/dispersion_mad df['dispersions_norm'] = (df['dispersions'].values - disp_median_bin[df['mean_bin'].values].values ) / disp_mad_bin[df['mean_bin'].values].values dispersion_norm = df['dispersions_norm'].values.astype('float32') HVR_idx=df.index[df['dispersions_norm'].rank(ascending=False)<=top_n] # HVR_idx=std.index[std.rank(ascending=False)<=top_n] data_HVR = data.loc[HVR_idx, :] data_HVR.to_csv(output_data) # make plots describing the HVR selection fig, axs = plt.subplots(1, 2, figsize=(10, 5)) plt.subplots_adjust(wspace=0.5, hspace=0.5) # plot region variability ax = sns.scatterplot(df['dispersions_norm'].rank(ascending=False), df['dispersions_norm'], ax=axs[0], rasterized=True, ) ax = sns.scatterplot(df.loc[HVR_idx,'dispersions_norm'].rank(ascending=False), df.loc[HVR_idx,'dispersions_norm'], ax=axs[0], rasterized=True, color='r' ) ax.set_xlabel('rank') ax.set_ylabel('normalized dispersion') sns.despine() ax.set_title("Regions ranked by normalized dispersion") # mean-dispersion plot (highlighting selected regions) ax = sns.scatterplot(df.loc[:,'means'], df.loc[:,'dispersions_norm'], ax=axs[1], rasterized=True, ) ax = sns.scatterplot(df.loc[HVR_idx,'means'], df.loc[HVR_idx,'dispersions_norm'], ax=axs[1], rasterized=True, color='r' ) ax.set_xlabel('mean') ax.set_ylabel('normalized dispersion') sns.despine() ax.set_title("Mean to normalized dispersion relationship") fig.suptitle('Top {} HVRs by binned normalized dispersion'.format(top_n), fontsize=14) plt.savefig( fname=output_plot, format="svg", dpi=300, bbox_inches="tight", ) plt.show() plt.close() |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | import os import pandas as pd import numpy as np #### configurations # input data_counts=pd.read_csv(snakemake.input[0], index_col=0) annotation=pd.read_csv(snakemake.input[1], index_col=0) # parameters project_path = snakemake.params["project_path"] split_by = snakemake.params["split_by"] ##### make output folders, split and save data for split in list(annotation[split_by].unique()): # make split folder if not os.path.exists(os.path.join(project_path,split)): os.mkdir(os.path.join(project_path,split)) # split counts split_counts = data_counts.loc[:, annotation[(annotation[split_by]==split)].index] # save split counts split_counts.to_csv(os.path.join(project_path,split,"{}_counts.csv".format(split))) # split annotatios split_annot = annotation.loc[annotation[split_by]==split, :] # save split annotatios split_annot.to_csv(os.path.join(project_path,split,"{}_annotation.csv".format(split))) |
223 224 225 | run: command_str="touch {}".format(os.path.join(config["project_path"],"{}_successfully_completed.txt".format(config["project_name"]))) subprocess.run(command_str, shell=True) |
Support
- Future updates
Related Workflows





