snakemake workflow for post-processing scATACseq data
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
snakemake workflow for post-processing scATACseq data
cite if you use it:)
What does it do?
for single cell ATACseq experiment, one gets a merged bam file for all cells. After going through clustering, one groups similar cells into cell types (cell states). This workflow will split the bam by clusters to create a pseudo bulk bam for each cluster, create bigwig tracks for visulization, call peaks for each cluster and merge the peaks across the clusters. Finally it will count reads per peak per cell from the original bam file on the merged peaks.
Thanks
Brent Pedersen
for the
rcbbc
(Read Count By BarCode) in the scripts folder.
This script can be very useful in general. It counts the number of reads per cell in regions specified in a bed format from the bam file, which contains
CB
tag for each cell. The code is written in
nim
and is super fast. For over 1 million regions for over 8000 cells in a ~64G bam, it took ~40 mins.
./rcbbc -h
rcbbc
read-count by barcode
Usage:
rcbbc [options] bed bam
Arguments:
bed
bam
Options:
-f, --fasta=FASTA path to fasta. required only for CRAM
-e, --exclude=EXCLUDE exclude alignments with any of these bits set (default: 1796)
-m, --min-mapping-quality=MIN_MAPPING_QUALITY
exclude alignments with a mapping-quality below this (default: 1)
-w, --white-list=WHITE_LIST
line-delimited list of barcodes to include. default is to include any seen in the bam
-t, --tag=TAG tag on which to partition read-counts (default: CB)
-p, --prefix=PREFIX output prefix for files written by rcbbc (default: rcbbc)
-h, --help Show this help
In the future, the peak calling software should be barcode aware, so one does not need to split the bam file by cluster. But for now, I have this work for me.
workflow of the piepline
Dependencies
-
snakemake >=5.4.5. Note snakemake is python3 based.
-
deeptools >= 3.1.3. We use
bamCoverage
from it to make bigwig files. -
Genrich for calling peaks. It implements ATACseq peak calling with internal +4/-5bp shifting. Develped by John, our previous group member at Harvard FAS informatics .
-
bedtools >= 2.27.1 for manipulating bed files.
-
samtools >=1.9 for manipupating bam files.
How to use this pipeline
- clone the repo
git clone https://github.com/crazyhottommy/pyflow-scATACseq
cd pyflow-scATACseq
-
Prepare a tab delimited
tsv
file with header:meta.tsv
inside thepyflow-scATACseq
folder:
sample | bam_path | cluster_id.csv | white_list |
---|---|---|---|
sample1 | /path/to/sample1.bam | /path/to/sample1.csv | /path/to/white_list_sample1.txt |
sample2 | /path/to/sample2.bam | /path/to/sample2.csv | /path/to/white_list_sample2.txt |
-
The first column is the sample name.
-
The second column is the path to the
10x cellranger_atac
producedpossorted_bam.bam
-
The
cluster_id.csv
is a two column csv file with header (does not matter what name you give). The first column is the cell barcode, the second column is the cluster id (can be numeric or strings).
make sure the cluster_id.csv is named exactly as sample.csv .
e.g. if the sample column is sample1, it should be sample1.csv for the cluster information csv file.
cat example_cluster.csv
name,value
AAACGAAAGCACCATT-1,32
AAACGAAAGTAGTCGG-1,4
AAACGAAGTAACGGCA-1,31
AAACTCGAGAACAGGA-1,17
AAACTCGAGTCCAGAG-1,1
AAACTCGCAAAGGAAG-1,41
AAACTCGCAAGGCGTA-1,5
AAACTCGCAGAAAGAG-1,4
AAACTCGCAGTAGGCA-1,24
- The white_list is a one column txt file with valid barcode you want to include.
you can create it by:
cat example_cluster.csv | cut -f1 -d, | sed '1d' > example_white_list.txt
-
Create a
samples.json
file:
python sample2json.py meta.tsv
# a samples.json file will be created
cat samples.json
{
"sample1": [
"/path/to/sample1.bam",
"/path/to/sample1.csv",
"/path/to/white_list_sample1.txt"
],
"sample2": [
"/path/to/sample2.bam",
"/path/to/sample2.csv",
"/path/to/white_list_sample2.txt"
]
}
- Running the workflow
#dry run
snakemake -np
## real-run on local machine
snakemake -j 10
## if you want to use a singularity image that contains presto for differential accessible test
snakemake -j 10 --use-singularity
## submit to slurm (you can change this script for your own HPC)
./pyflow-scATACseq.sh
To-do list
-
[ ] check file paths exist and format.
-
[ ] Have a conda env set up for the Snakefile.
-
[ ] have a docker container.
-
[ ] Motif analysis.
-
[ ] Find differential peaks using presto.
-
[ ] Associating peaks with genes using Cicero
-
[ ] Gene set enrichment
Other Notes
more thoughts on pileup and shifting reads https://github.com/deeptools/deepTools/issues/453 maxFragmentLength
https://github.com/deeptools/deepTools/issues/370
https://groups.google.com/forum/#!topic/deeptools/JU9itiT5rYk
You probably want “–Offset 1”
The above is not shifting exactly, for shifting use https://deeptools.readthedocs.io/en/develop/content/tools/alignmentSieve.html
--ATACshift
filter out fragments?
Code Snippets
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 | 'Finding Differential accessible regions from one or more sparse matrix The number of mtx sparse matrix should be the same as the cluster information csv files and have the same basename. e.g. sample1.mtx and sample1.csv Usage: presto.R --mtx=<mtx>... --cluster=<cluster>... <output> presto.R --mtx=sample1.mtx --cluster=sample1.csv sample1.txt presto.R --mtx=sample1.mtx --mtx=sample2.mtx --cluster=sample1.csv --cluster=sample2.csv sample.txt Options: --mtx=<mtx>... input file path of the sparse matrix --cluster=<cluster>... input file path of the cluster information. a csv file with a header. column 1 is the cell barcode, column 2 is the cluster id. -h --help Show this screen. -v --version Show version. Arguments: output output file path of the differential peaks ' -> doc suppressMessages(library(docopt)) suppressMessages(library(Seurat)) suppressMessages(library(Matrix)) suppressMessages(library(presto)) suppressMessages(library(tidyverse)) arguments <- docopt(doc, version = 'presto.R v1.0\n\n') num_of_matrix<- length(arguments$mtx) num_of_csv<- length(arguments$cluster) if (num_of_matrix != num_of_csv){ stop("the number of matrix must be the same with the number of cluster information csv file") } check_basename<- function(mtx, csv){ return(gsub(".mtx", "", basename(mtx)) == gsub(".csv", "", basename(csv))) } if (!all(purrr::map2_lgl(arguments$mtx, arguments$cluster, check_basename))){ stop("the basename of the mtx and the cluster info csv file do not match") } read_mtx<- function(mtx){ recount<- readMM(mtx) peaks<- read_tsv(paste0(mtx, ".sites"), col_names = FALSE) cells<- read_tsv(paste0(mtx, ".cells"), col_names = FALSE) prefix<- gsub(".mtx", "", basename(mtx)) cells<- paste0(prefix, "_", cells$X1) rownames(recount)<- peaks$X1 colnames(recount)<- cells return(recount) } message(paste("reading in the sparse matrix:", arguments$mtx, "\n")) mtxs<- purrr::map(arguments$mtx, read_mtx) if (!length(unique((purrr::map(mtxs, rownames))))==1){ stop("all matrix should have same row/peak name") } recount.mat<- purrr::reduce(mtxs, cbind) makeSeuratObject<- function(recount.mat){ recount.TFIDF.mat<- TF.IDF(recount.mat) recount.TFIDF.mat <- Seurat:::LogNorm(data = recount.TFIDF.mat, scale_factor = 1e4) rownames(recount.TFIDF.mat)<- rownames(recount.mat) colnames(recount.TFIDF.mat)<- colnames(recount.mat) recount.atac<- CreateSeuratObject(counts = recount.TFIDF.mat, assay = 'ATAC', project = "recount") return(recount.atac) } ### transferred label information message(paste("reading in the cluster information csv files:", arguments$cluster, "\n")) read_id<- function(csv){ id<- read_csv(csv) prefix<- gsub(".csv", "", basename(csv)) id<- id %>% mutate(cell = paste0(prefix, "_", name)) %>% dplyr::rename(cluster_id = value) return(id) } cluster_ids<- purrr::map(arguments$cluster, read_id) cluster_ids<- bind_rows(cluster_ids) message("making Seurat ATAC objects") recount.atac<- makeSeuratObject(recount.mat) recount.atac[["cell"]]<- rownames(recount.atac@meta.data) old_meta<- recount.atac@meta.data new_meta<- left_join(recount.atac@meta.data, cluster_ids) if(!identical(rownames(old_meta), new_meta$cell)){ stop("The name of the cells are different") } rownames(new_meta)<- rownames(old_meta) recount.atac@meta.data<- new_meta #library(devtools) #install_github('immunogenomics/presto') # this counts is the TF-IDF normalized and log transformed counts # see how long it takes for 1.3 million features message("Finding differential accessible regions by presto") start_time <- Sys.time() res<- wilcoxauc(recount.atac, 'cluster_id', seurat_assay = 'ATAC', assay = "counts") end_time <- Sys.time() message("presto takes") end_time - start_time #Time difference of 2.933884 mins !! impressive message(paste("writing differential peaks to", arguments$output, "\n")) write_tsv(res %>% arrange(padj, desc(auc)), arguments$output) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 | import pysam import csv import argparse import os.path import sys parser = argparse.ArgumentParser() parser.add_argument("csv", help="Required. the FULL path to the cluster csv file with header, \ first column is the cell barcode, second column is the cluster id") parser.add_argument("bam", help="Required. the FULL path to the 10x scATAC bam file generated \ by cellranger-atac count") parser.add_argument("-prefix", help="Optional, the prefix of the output bam, default is cluster_id.bam") parser.add_argument("-outdir", help="Optional, the output directory for the splitted bams, default is current dir") args = parser.parse_args() if os.path.exists(args.csv): pass else: print("csv file is not found") sys.exit(1) if os.path.exists(args.bam): pass else: print("10x scATAC bam not found") sys.exit(1) if args.outdir: if os.path.isdir(args.outdir): pass else: try: os.mkdir(args.outdir) except OSError: print("can not create directory {}".format(args.outdir)) cluster_dict = {} with open(args.csv) as csv_file: csv_reader = csv.reader(csv_file, delimiter=',') #skip header header = next(csv_reader) for row in csv_reader: cluster_dict[row[0]] = row[1] clusters = set(x for x in cluster_dict.values()) fin = pysam.AlignmentFile(args.bam, "rb") # open the number of bam files as the same number of clusters, and map the out file handler to the cluster id, write to a bam with wb fouts_dict = {} for cluster in clusters: if args.prefix: fout_name = args.prefix + "_" + cluster + ".bam" else: fout_name = cluster + ".bam" if args.outdir: fout = pysam.AlignmentFile(os.path.join(args.outdir,fout_name), "wb", template = fin) else: fout = pysam.AlignmentFile(fout_name, "wb", template = fin) fouts_dict[cluster] = fout for read in fin: tags = read.tags # the 8th item is the CB tag CB_list = [ x for x in tags if x[0] == "CB"] if CB_list: cell_barcode = CB_list[0][1] else: continue # the bam files may contain reads not in the final clustered barcodes # will be None if the barcode is not in the clusters.csv file cluster_id = cluster_dict.get(cell_barcode) if cluster_id: fouts_dict[cluster_id].write(read) ## do not forget to close the files fin.close() for fout in fouts_dict.values(): fout.close() |
112 113 114 115 116 | shell: """ scripts/split_scATAC_bam_by_cluster.py -prefix {wildcards.sample} -outdir 01split_bam/{wildcards.sample} {input[1]} {input[0]} touch {output} """ |
122 123 124 125 | shell: """ samtools index 01split_bam/{wildcards.sample}/{wildcards.sample}_{wildcards.cluster_id}.bam """ |
137 138 139 140 | shell: """ bamCoverage -b 01split_bam/{wildcards.sample}/{wildcards.sample}_{wildcards.cluster_id}.bam -p {threads} {params.custom} -o {output} """ |
168 169 170 171 | shell: """ samtools merge -@ 2 -r {output} {params.bam} """ |
176 177 178 179 | shell: """ samtools index {input} """ |
189 190 191 192 | shell: """ bamCoverage -b {input[0]} -p {threads} {params.custom} -o {output} """ |
203 204 205 206 207 208 | shell: """ samtools sort -n -m 2G -@ {threads} -T {wildcards.cluster_id} \ -o {output} \ {input[0]} 2> {log} """ |
217 218 219 220 | shell: """ Genrich -t {input} -o {output} {params.custom} 2> {log} """ |
228 229 230 231 232 | shell: """ bedtools intersect -a {input[0]} -b <(zcat {input[1]}) -v > {output} """ |
240 241 242 243 244 245 | shell: """ # the 10th column is the peak summit cat {input} | awk -v OFS="\t" '$2=$2+$10,$3=$2+1' > {output.summit} bedtools slop -i {output.summit} -g {config[genome_size]} -b {config[extend_size]} > {output.extend_summit} """ |
263 264 265 266 267 268 | shell: """ samtools sort -n -m 2G -@ {threads} -T {wildcards.sample}_{wildcards.cluster_id} \ -o {output} \ 01split_bam/{wildcards.sample}/{wildcards.sample}_{wildcards.cluster_id}.bam 2> {log} """ |
279 280 281 282 | shell: """ Genrich -t {input} -o {output} {params.custom} 2> {log} """ |
290 291 292 293 294 | shell: """ bedtools intersect -a {input[0]} -b <(zcat {input[1]}) -v > {output} """ |
302 303 304 305 306 307 | shell: """ # the 10th column is the peak summit cat {input} | awk -v OFS="\t" '$2=$2+$10,$3=$2+1' > {output.summit} bedtools slop -i {output.summit} -g {config[genome_size]} -b {config[extend_size]} > {output.extend_summit} """ |
325 326 327 328 329 330 | shell: """ # get rid of unconventional chromosomes and natural sort the chr (chr10 after chr9) cat {input} | sort -k1,1V -k2,2n | bedtools merge | grep -v "_" > {output} """ |
340 341 342 343 | shell: """ scripts/rcbbc -w {params.white_list} -p 07recount/{wildcards.sample}/{wildcards.sample} {input.bed} {input.bam} """ |
355 356 357 358 359 | shell: """ # remove unconventional chromosomes and natural sort by -V cat {input} | sort -k1,1V -k2,2n | bedtools merge | grep -v "_" > {output} """ |
370 371 372 373 | shell: """ scripts/rcbbc -w {params.white_list} -p 07recount_all/{wildcards.sample}/{wildcards.sample} {input.bed} {input.bam} """ |
388 389 390 391 | shell: """ Rscript scripts/presto.R --mtx {input.mtx} --cluster {input.csv} {output} """ |
401 402 403 404 405 406 407 408 409 410 411 412 413 414 | shell: """ mtxs="" for mtx in {input.mtxs}; do mtxs+=" --mtx=$mtx" done csvs="" for csv in {input.csvs}; do csvs+=" --cluster=$csv" done Rscript scripts/presto.R $mtxs $csvs {output}> {log} 2>&1 """ |
Support
- Future updates
Related Workflows





