snakemake workflow for post-processing scATACseq data

public 1yr ago Version: v1.0 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

snakemake workflow for post-processing scATACseq data

cite if you use it:)

What does it do?

for single cell ATACseq experiment, one gets a merged bam file for all cells. After going through clustering, one groups similar cells into cell types (cell states). This workflow will split the bam by clusters to create a pseudo bulk bam for each cluster, create bigwig tracks for visulization, call peaks for each cluster and merge the peaks across the clusters. Finally it will count reads per peak per cell from the original bam file on the merged peaks.

Thanks Brent Pedersen for the rcbbc (Read Count By BarCode) in the scripts folder. This script can be very useful in general. It counts the number of reads per cell in regions specified in a bed format from the bam file, which contains CB tag for each cell. The code is written in nim and is super fast. For over 1 million regions for over 8000 cells in a ~64G bam, it took ~40 mins.

./rcbbc -h
rcbbc
read-count by barcode
Usage:
 rcbbc [options] bed bam
Arguments:
 bed
 bam
Options:
 -f, --fasta=FASTA path to fasta. required only for CRAM
 -e, --exclude=EXCLUDE exclude alignments with any of these bits set (default: 1796)
 -m, --min-mapping-quality=MIN_MAPPING_QUALITY
 exclude alignments with a mapping-quality below this (default: 1)
 -w, --white-list=WHITE_LIST
 line-delimited list of barcodes to include. default is to include any seen in the bam
 -t, --tag=TAG tag on which to partition read-counts (default: CB)
 -p, --prefix=PREFIX output prefix for files written by rcbbc (default: rcbbc)
 -h, --help Show this help

In the future, the peak calling software should be barcode aware, so one does not need to split the bam file by cluster. But for now, I have this work for me.

workflow of the piepline

Dependencies

snakemake >=5.4.5. Note snakemake is python3 based.
deeptools >= 3.1.3. We use bamCoverage from it to make bigwig files.
Genrich for calling peaks. It implements ATACseq peak calling with internal +4/-5bp shifting. Develped by John, our previous group member at Harvard FAS informatics .
bedtools >= 2.27.1 for manipulating bed files.
samtools >=1.9 for manipupating bam files.

How to use this pipeline

clone the repo

git clone https://github.com/crazyhottommy/pyflow-scATACseq
cd pyflow-scATACseq

Prepare a tab delimited tsv file with header: meta.tsv inside the pyflow-scATACseq folder:

sample	bam_path	cluster_id.csv	white_list
sample1	/path/to/sample1.bam	/path/to/sample1.csv	/path/to/white_list_sample1.txt
sample2	/path/to/sample2.bam	/path/to/sample2.csv	/path/to/white_list_sample2.txt

The first column is the sample name.
The second column is the path to the 10x cellranger_atac produced possorted_bam.bam
The cluster_id.csv is a two column csv file with header (does not matter what name you give). The first column is the cell barcode, the second column is the cluster id (can be numeric or strings).

make sure the cluster_id.csv is named exactly as sample.csv .

e.g. if the sample column is sample1, it should be sample1.csv for the cluster information csv file.

cat example_cluster.csv
name,value
AAACGAAAGCACCATT-1,32
AAACGAAAGTAGTCGG-1,4
AAACGAAGTAACGGCA-1,31
AAACTCGAGAACAGGA-1,17
AAACTCGAGTCCAGAG-1,1
AAACTCGCAAAGGAAG-1,41
AAACTCGCAAGGCGTA-1,5
AAACTCGCAGAAAGAG-1,4
AAACTCGCAGTAGGCA-1,24

The white_list is a one column txt file with valid barcode you want to include.

you can create it by:

cat example_cluster.csv | cut -f1 -d, | sed '1d' > example_white_list.txt

Create a samples.json file:

python sample2json.py meta.tsv
# a samples.json file will be created
cat samples.json
{
 "sample1": [
 "/path/to/sample1.bam",
 "/path/to/sample1.csv",
 "/path/to/white_list_sample1.txt"
 ],
 "sample2": [
 "/path/to/sample2.bam",
 "/path/to/sample2.csv",
 "/path/to/white_list_sample2.txt"
 ]
}

Running the workflow

#dry run
snakemake -np 
## real-run on local machine
snakemake -j 10
## if you want to use a singularity image that contains presto for differential accessible test
snakemake -j 10 --use-singularity
## submit to slurm (you can change this script for your own HPC)
./pyflow-scATACseq.sh

To-do list

[ ] check file paths exist and format.
[ ] Have a conda env set up for the Snakefile.
[ ] have a docker container.
[ ] Motif analysis.
[ ] Find differential peaks using presto.
[ ] Associating peaks with genes using Cicero
[ ] Gene set enrichment

Other Notes

more thoughts on pileup and shifting reads https://github.com/deeptools/deepTools/issues/453 maxFragmentLength

https://github.com/deeptools/deepTools/issues/370

https://groups.google.com/forum/#!topic/deeptools/JU9itiT5rYk

You probably want “–Offset 1”

The above is not shifting exactly, for shifting use https://deeptools.readthedocs.io/en/develop/content/tools/alignmentSieve.html

--ATACshift

filter out fragments?

Code Snippets

'Finding Differential accessible regions from one or more sparse matrix
The number of mtx sparse matrix should be the same as the cluster information csv
files and have the same basename. e.g. sample1.mtx and sample1.csv
Usage:
    presto.R --mtx=<mtx>... --cluster=<cluster>... <output>
    presto.R --mtx=sample1.mtx --cluster=sample1.csv sample1.txt
    presto.R --mtx=sample1.mtx --mtx=sample2.mtx --cluster=sample1.csv --cluster=sample2.csv sample.txt

Options:
    --mtx=<mtx>...  input file path of the sparse matrix
    --cluster=<cluster>...  input file path of the cluster information. a csv file with a header. column 1 is the
             cell barcode, column 2 is the cluster id.
    -h --help  Show this screen.
    -v --version  Show version.
Arguments:
    output   output file path of the differential peaks
' -> doc

suppressMessages(library(docopt))
suppressMessages(library(Seurat))
suppressMessages(library(Matrix))
suppressMessages(library(presto))
suppressMessages(library(tidyverse))

arguments <- docopt(doc, version = 'presto.R v1.0\n\n')
num_of_matrix<- length(arguments$mtx)
num_of_csv<- length(arguments$cluster)

if (num_of_matrix != num_of_csv){
        stop("the number of matrix must be the same with the number
             of cluster information csv file") 
}


check_basename<- function(mtx, csv){
        return(gsub(".mtx", "", basename(mtx)) == gsub(".csv", "", basename(csv)))
}

if (!all(purrr::map2_lgl(arguments$mtx, arguments$cluster, check_basename))){
        stop("the basename of the mtx and the cluster info csv file do not match")
} 

read_mtx<- function(mtx){
        recount<- readMM(mtx)
        peaks<- read_tsv(paste0(mtx, ".sites"), col_names = FALSE)
        cells<- read_tsv(paste0(mtx, ".cells"), col_names = FALSE)
        prefix<- gsub(".mtx", "", basename(mtx))
        cells<- paste0(prefix, "_", cells$X1)
        rownames(recount)<- peaks$X1
        colnames(recount)<- cells
        return(recount)
}

message(paste("reading in the sparse matrix:", arguments$mtx, "\n"))
mtxs<- purrr::map(arguments$mtx, read_mtx)

if (!length(unique((purrr::map(mtxs, rownames))))==1){
        stop("all matrix should have same row/peak name")
}


recount.mat<- purrr::reduce(mtxs, cbind)

makeSeuratObject<- function(recount.mat){
        recount.TFIDF.mat<- TF.IDF(recount.mat)
        recount.TFIDF.mat <- Seurat:::LogNorm(data = recount.TFIDF.mat, scale_factor = 1e4)
        rownames(recount.TFIDF.mat)<- rownames(recount.mat)
        colnames(recount.TFIDF.mat)<- colnames(recount.mat)
        recount.atac<- CreateSeuratObject(counts = recount.TFIDF.mat, assay = 'ATAC', project = "recount")
        return(recount.atac)
}



### transferred label information
message(paste("reading in the cluster information csv files:", arguments$cluster, "\n"))
read_id<- function(csv){
        id<- read_csv(csv)
        prefix<- gsub(".csv", "", basename(csv))
        id<- id %>% 
                mutate(cell = paste0(prefix, "_", name)) %>%
                dplyr::rename(cluster_id = value)
        return(id)
}

cluster_ids<- purrr::map(arguments$cluster, read_id)
cluster_ids<- bind_rows(cluster_ids)

message("making Seurat ATAC objects")
recount.atac<- makeSeuratObject(recount.mat)
recount.atac[["cell"]]<- rownames(recount.atac@meta.data)
old_meta<- recount.atac@meta.data
new_meta<- left_join(recount.atac@meta.data, cluster_ids) 

if(!identical(rownames(old_meta), new_meta$cell)){
       stop("The name of the cells are different") 
}

rownames(new_meta)<- rownames(old_meta)
recount.atac@meta.data<- new_meta

#library(devtools)
#install_github('immunogenomics/presto')
# this counts is the TF-IDF normalized and log transformed counts
# see how long it takes for 1.3 million features

message("Finding differential accessible regions by presto")
start_time <- Sys.time()
res<- wilcoxauc(recount.atac, 'cluster_id', seurat_assay = 'ATAC', assay = "counts")
end_time <- Sys.time()

message("presto takes")
end_time - start_time

#Time difference of 2.933884 mins !! impressive

message(paste("writing differential peaks to", arguments$output, "\n"))
write_tsv(res %>% arrange(padj, desc(auc)), arguments$output)

R tidyverse Seurat docopt Matrix pRESTO From line 2 of scripts/presto.R

import pysam
import csv
import argparse
import os.path
import sys

parser = argparse.ArgumentParser()
parser.add_argument("csv", help="Required. the FULL path to the cluster csv file with header, \
    first column is the cell barcode, second column is the cluster id")
parser.add_argument("bam", help="Required. the FULL path to the 10x scATAC bam file generated \
    by cellranger-atac count")
parser.add_argument("-prefix", help="Optional, the prefix of the output bam, default is cluster_id.bam")
parser.add_argument("-outdir", help="Optional, the output directory for the splitted bams, default is current dir")
args = parser.parse_args()


if os.path.exists(args.csv):
    pass
else:
    print("csv file is not found")
    sys.exit(1)

if os.path.exists(args.bam):
    pass
else:
    print("10x scATAC bam not found")
    sys.exit(1)

if args.outdir:
    if os.path.isdir(args.outdir):
        pass
    else:
        try:
            os.mkdir(args.outdir)
        except OSError:
            print("can not create directory {}".format(args.outdir))

cluster_dict = {}
with open(args.csv) as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    #skip header
    header = next(csv_reader)
    for row in csv_reader:
        cluster_dict[row[0]] = row[1]

clusters = set(x for x in cluster_dict.values())


fin = pysam.AlignmentFile(args.bam, "rb")

# open the number of bam files as the same number of clusters, and map the out file handler to the cluster id, write to a bam with wb
fouts_dict = {}
for cluster in clusters:
    if args.prefix:
        fout_name = args.prefix + "_" + cluster + ".bam"
    else:
        fout_name = cluster + ".bam"
    if args.outdir:
        fout = pysam.AlignmentFile(os.path.join(args.outdir,fout_name), "wb", template = fin)
    else:
        fout = pysam.AlignmentFile(fout_name, "wb", template = fin)
    fouts_dict[cluster] = fout

for read in fin:
    tags = read.tags
    # the 8th item is the CB tag
    CB_list = [ x for x in tags if x[0] == "CB"]
    if CB_list:
        cell_barcode = CB_list[0][1]
    else: 
        continue
    # the bam files may contain reads not in the final clustered barcodes
    # will be None if the barcode is not in the clusters.csv file
    cluster_id = cluster_dict.get(cell_barcode)
    if cluster_id:
        fouts_dict[cluster_id].write(read)

## do not forget to close the files
fin.close()
for fout in fouts_dict.values():
    fout.close()

Python pysam From line 3 of scripts/split_scATAC_bam_by_cluster.py

shell:
    """
    scripts/split_scATAC_bam_by_cluster.py -prefix {wildcards.sample} -outdir 01split_bam/{wildcards.sample} {input[1]} {input[0]}
    touch {output}
    """

SnakeMake From line 112 of master/Snakefile

shell:
    """
    samtools index 01split_bam/{wildcards.sample}/{wildcards.sample}_{wildcards.cluster_id}.bam
    """

SnakeMake SAMtools From line 122 of master/Snakefile

shell:
    """
    bamCoverage -b 01split_bam/{wildcards.sample}/{wildcards.sample}_{wildcards.cluster_id}.bam -p {threads} {params.custom} -o {output}
    """

SnakeMake DeepTools From line 137 of master/Snakefile

shell:
    """
    samtools merge -@ 2 -r {output} {params.bam}
    """

SnakeMake SAMtools From line 168 of master/Snakefile

shell:
    """
    samtools index {input}
    """

SnakeMake SAMtools From line 176 of master/Snakefile

shell:
    """
    bamCoverage -b {input[0]} -p {threads} {params.custom} -o {output}
    """

SnakeMake DeepTools From line 189 of master/Snakefile

shell:
    """
    samtools sort -n -m 2G -@ {threads} -T {wildcards.cluster_id} \
    -o {output} \
    {input[0]} 2> {log}
    """

SnakeMake SAMtools From line 203 of master/Snakefile

shell:
   """
   Genrich -t {input} -o {output} {params.custom} 2> {log}
   """

SnakeMake genrich From line 217 of master/Snakefile

shell:
    """
    bedtools intersect -a {input[0]} -b <(zcat {input[1]}) -v > {output}

    """

SnakeMake BEDTools From line 228 of master/Snakefile

shell:
    """
    # the 10th column is the peak summit
    cat {input} | awk -v OFS="\t" '$2=$2+$10,$3=$2+1' > {output.summit}
    bedtools slop -i {output.summit} -g {config[genome_size]}  -b {config[extend_size]} > {output.extend_summit}
    """

SnakeMake BEDTools From line 240 of master/Snakefile

shell:
    """
    samtools sort -n -m 2G -@ {threads} -T {wildcards.sample}_{wildcards.cluster_id} \
    -o {output} \
    01split_bam/{wildcards.sample}/{wildcards.sample}_{wildcards.cluster_id}.bam 2> {log}
    """

SnakeMake SAMtools From line 263 of master/Snakefile

shell:
   """
   Genrich -t {input} -o {output} {params.custom} 2> {log}
   """

SnakeMake genrich From line 279 of master/Snakefile

shell:
    """
    bedtools intersect -a {input[0]} -b <(zcat {input[1]}) -v > {output}

    """

SnakeMake BEDTools From line 290 of master/Snakefile

shell:
    """
    # the 10th column is the peak summit
    cat {input} | awk -v OFS="\t" '$2=$2+$10,$3=$2+1' > {output.summit}
    bedtools slop -i {output.summit} -g {config[genome_size]}  -b {config[extend_size]} > {output.extend_summit}
    """

SnakeMake BEDTools From line 302 of master/Snakefile

shell:
    """
    # get rid of unconventional chromosomes and natural sort the chr (chr10 after chr9)
    cat {input} | sort -k1,1V -k2,2n | bedtools merge | grep -v "_"  > {output}

    """

SnakeMake BEDTools From line 325 of master/Snakefile

shell:
    """
    scripts/rcbbc -w {params.white_list} -p 07recount/{wildcards.sample}/{wildcards.sample} {input.bed} {input.bam} 
    """

SnakeMake From line 340 of master/Snakefile

shell:
    """
    # remove unconventional chromosomes and natural sort by -V
    cat {input} | sort -k1,1V -k2,2n | bedtools merge | grep -v "_"  > {output}
    """

SnakeMake BEDTools From line 355 of master/Snakefile

shell:
    """
    scripts/rcbbc -w {params.white_list} -p 07recount_all/{wildcards.sample}/{wildcards.sample} {input.bed} {input.bam} 
    """

SnakeMake From line 370 of master/Snakefile

shell:
    """
    Rscript scripts/presto.R --mtx {input.mtx} --cluster {input.csv} {output}
    """

SnakeMake From line 388 of master/Snakefile

shell:
    """
    mtxs=""
    for mtx in {input.mtxs}; do
        mtxs+=" --mtx=$mtx"
    done

    csvs=""
    for csv in {input.csvs}; do
        csvs+=" --cluster=$csv"
    done

    Rscript scripts/presto.R $mtxs $csvs {output}> {log} 2>&1 
    """