Cell cycle analysis of single-cell proteomic and transcriptomic data for the FUCCI cell model -- RNA-Seq analysis pipeline

public 1yr ago Version: 1.3 0 bookmarks

View Workflow

fuccisinglecellseqpipeline — View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Single-cell proteogenomic analysis

This repository contains the snakemake pipeline for analyzing the RNA sequencing data for ~1k single cells. The results of this single-cell RNA-Seq analysis provide a transcriptomic context to a proteomic analysis based on immunofluorescence staining of ~200k individual cells. For the code used to perform that single-cell proteogenomic analysis of the human cell cycle, please see the CellProfiling/SingleCellProteogenomics repository.

Single-cell sequencing files

The single-cell RNA-Seq data is available at GEO SRA under project number GSE146773 .

This data is downloaded automatically in this pipeline.

Updating the Ensembl version

The genome and Ensembl versions are located at the top of the file workflow/config/FucciSingleCell.yaml . These can be updated, and the references will be downloaded automatically.

Usage

Clone repository and initialize submodules: git clone --recurse-submodules https://github.com/CellProfiling/FucciSingleCellSeqPipeline.git && cd FucciSingleCellSeqPipeline/workflow
Install conda: https://docs.conda.io/en/latest/miniconda.html
Create and activate setup environment: conda env create -n fuccisetup -f envs/setup.yaml && conda activate fuccisetup
Run the workflow: snakemake --use-conda --conda-frontend mamba --cores 24 --resources mem_mb=100000 , where you can subsitute the max number of cores and max memory allocation. At least 54 GB of free memory should be available.

Usage on cluster

In place of installing conda, you may need to activate it as a module, such as by module load conda and then follow the instructions to initialize it.

Adapt config/cluster_config.yaml for your needs.

In place of the last step above, you can use the scheduler like this: snakemake -j 500 --cores 16 --cluster-config config/cluster_config.yaml --latency-wait 60 --keep-going --use-conda --conda-frontend mamba --cluster "sbatch -t {cluster.time} -N {cluster.nodes} --cpus-per-task {threads} -p {cluster.partition}"

Usage on protected access cluster

Clone repository and initialize submodules on your local machine: git clone --recurse-submodules https://github.com/CellProfiling/FucciSingleCellSeqPipeline.git && cd FucciSingleCellSeqPipeline/workflow
Install conda: https://docs.conda.io/en/latest/miniconda.html
Create and activate setup environment: conda env create -n fuccisetup -f envs/setup.yaml && conda activate fuccisetup
If running the pipeline on protected access computer, predownload files by running snakemake -j 16 ../results/setup.txt on a machine with internet access.
Make a tarball of the project with cd ../.. && tar -cxvf FucciSingleCellSeqPipeline.zip FucciSingleCellSeqPipeline and transfer it to the protected access cluster.
Load conda as a module on the protected access cluster, such as with module load conda , and follow the instructions to activate it.
Create and activate setup environment: conda env create -n fuccisetup -f envs/setup.yaml && conda activate fuccisetup
Adapt config/cluster_config.yaml for your needs.
Use the scheduler from snakemake like this: snakemake -j 500 --cores 16 --cluster-config config/cluster_config.yaml --latency-wait 60 --keep-going --use-conda --conda-frontend mamba --cluster "sbatch -A {cluster.account} -t {cluster.time} -N {cluster.nodes} --cpus-per-task {threads} -p {cluster.partition}"

Citation

Mahdessian, D.*; Cesnik, A. J.*; Gnann, C.; Danielsson, F.; Stenström, L.; Arif, M.; Zhang, C.; Le, T.; Johansson, F.; Shutten, R.; Bäckström, A.; Axelsson, U.; Thul, P.; Cho, N. H.; Carja, O.; Uhlén, M.; Mardinoglu, A.; Stadler, C.; Lindskog, C.; Ayoglu, B.; Leonetti, M. D.; Pontén, F.; Sullivan, D. P.; Lundberg, E. “Spatiotemporal dissection of the cell cycle with single cell proteogenomics.” Nature, 2021, 590, 649–654. *Contributed equally. https://www.nature.com/articles/s41586-021-03232-9

Code Snippets

shell:
    "(STAR --runMode genomeGenerate --runThreadN {threads} --genomeDir {params.genomedir}"
    " --genomeFastaFiles {input.efa} {input.gfa} --sjdbGTFfile {input.gtf}"
    " {params.sjoptions} {params.size}) &> {log}"

SnakeMake From line 18 of rules/align.smk

shell:
    "(STAR --genomeLoad NoSharedMemory --genomeDir {params.genomedir}"
    " --runThreadN {threads} {params.bam} "
    " --outFileNamePrefix {params.outfolder}/{wildcards.sra}"
    " --readFilesIn <(zcat {input.fastq}) ) &> {log}"

SnakeMake From line 36 of rules/align.smk

shell:
    "(STAR --runMode genomeGenerate --runThreadN {threads} --genomeDir {params.genomedir}"
    " --genomeFastaFiles {input.efa} {input.gfa} --sjdbGTFfile {input.gtf} {params.sjoptions}"
    " --sjdbFileChrStartEnd {input.jj} {params.size}) &> {log}"

SnakeMake From line 57 of rules/align.smk

shell:
    "(STAR {params.mode} --genomeDir {params.genomedir}"
    " --runThreadN {threads} {params.bam} {params.transcripts}"
    " --outFileNamePrefix {params.outfolder}/{wildcards.sra}"
    " --readFilesIn <(zcat {input.fastq}) ) &> {log}"

SnakeMake From line 84 of rules/align.smk

shell:
    "((wget -O - {params.primary} || wget -O - {params.toplevel}) | gunzip -c - > {output.gfa} && "
    "wget -O - {params.gff} | gunzip -c - > {output.gff3} && "
    "wget -O - {params.pep} | gunzip -c - > {output.pfa} && "
    "wget -O - {params.gtf} | gunzip -c - > {output.gtf}) 2> {log}"

SnakeMake From line 18 of rules/downloads.smk

shell: "python scripts/fix_gff3_for_rsem.py {input} {output} &> {log}"

SnakeMake From line 31 of rules/downloads.smk

shell: "python scripts/fix_gff3_for_rsem.py {input} {output} &> {log}"

SnakeMake From line 40 of rules/downloads.smk

shell: "grep \"^ERCC\\|^#\\|^20\\|^21\\|^22\" {input} > {output}"

SnakeMake From line 49 of rules/downloads.smk

shell: "grep \"^ERCC\\|^#\\|^20\\|^21\\|^22\" {input} > {output}"

SnakeMake From line 58 of rules/downloads.smk

shell: "python scripts/filter_fasta.py {input} {output}"

SnakeMake From line 67 of rules/downloads.smk

shell:
    "prefetch {wildcards.sra}"
    " --output-directory {params.outdir} &> {log}"

SnakeMake From line 76 of rules/downloads.smk

shell:
    "fastq-dump -I --outdir {params.outdir} --split-files {input} && "
    "mv {params.outdir}/{wildcards.sra}_1.fastq {output} &> {log}"

SnakeMake From line 87 of rules/downloads.smk

shell:
    "fastp -q {params.quality}"
    " -i {input} -o {output.fq}"
    " -h {output.html} -j {output.json}"
    " -w {threads} -R {params.title} --detect_adapter_for_pe &> {log}"

SnakeMake fastp From line 104 of rules/downloads.smk

shell: "grep -v ERCC_ID {input} > {output} 2> {log}"

SnakeMake From line 8 of rules/quant.smk

shell:
    "(rsem-prepare-reference --num-threads {threads} --gtf {input.gtf} "
    " --star" # TODO: figure out how to use transcript STAR output
    " \"{input.efa}\",\"{input.gfa}\" {params.prefix}) &> {log}"

SnakeMake STAR From line 25 of rules/quant.smk

shell:
    "(rsem-calculate-expression --num-threads {threads}"
    " --star"
    " {params.options} <(zcat {input.fq})" #{input.bam}"
    " {params.prefix}"
    " {params.quant}/{wildcards.sra}) &> {log}"

SnakeMake From line 51 of rules/quant.smk

shell:
    "python scripts/make_rsem_dataframe.py genes {input.gtf} {input.srr_lookup} {input.series_matrix}"
    " {output.counts} {output.tpms} {output.names} {output.ids} &> {log}"

SnakeMake From line 77 of rules/quant.smk

shell:
    "python scripts/make_rsem_dataframe.py isoforms {input.gtf} {input.srr_lookup} {input.series_matrix}"
    " {output.counts} {output.tpms} {output.names} {output.ids} &> {log}"

SnakeMake From line 100 of rules/quant.smk

shell:
    "(gdown -O {output}.zip"

SnakeMake From line 10 of rules/singlecellproteogenomics.smk

shell:
    "(mkdir -p {output} && cp -r {params.rnadir}/* {output} && "
    "cp {input.quant} {input.ids} {input.velocity} {params.rnadir}) &> {log}"

SnakeMake From line 42 of rules/singlecellproteogenomics.smk

shell: "cd ../results && python ../SingleCellProteogenomics/1_ProteinCellCycleClusters.py &> {log}"

SnakeMake From line 52 of rules/singlecellproteogenomics.smk

shell: "cd ../results && python ../SingleCellProteogenomics/2_ProteinFucciPsuedotime.py --quicker &> {log}"

SnakeMake From line 60 of rules/singlecellproteogenomics.smk

shell: "cd ../results && python ../SingleCellProteogenomics/3_RNAFucciPseudotime.py --quicker &> {log}"

SnakeMake From line 68 of rules/singlecellproteogenomics.smk

shell: "cd ../results && python ../SingleCellProteogenomics/4_TemporalDelay.py &> {log}"

SnakeMake From line 76 of rules/singlecellproteogenomics.smk

shell: "cd ../results && python ../SingleCellProteogenomics/5_ProteinProperties.py &> {log}"

SnakeMake From line 84 of rules/singlecellproteogenomics.smk

shell:
    "(cp {input.protein} {output.protein} && "
    "cp {input.rna} {output.rna}) &> {log}"

SnakeMake From line 97 of rules/singlecellproteogenomics.smk

shell:
    "velocyto run-smartseq2 --outputfolder {params.outfolder}"
    " --sampleid {params.sampleid} {input.bams} {input.gtf} &> {log}"

SnakeMake From line 15 of rules/velo.smk

shell: "(" + "\n".join(["{params.commandpart1} " + sra + " {params.commandpart2} >> {output}" for sra in config['sra']]) + ") 2> {log}"

SnakeMake SRA From line 28 of rules/velo.smk

shell: "python scripts/make_a_obs.py {input.loom} {input.srr_lookup} {input.series_matrix} {output} 2> {log}"

SnakeMake From line 40 of rules/velo.smk

import sys, io
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
fasta=sys.argv[1]
output=sys.argv[2]
fasta_sequences = SeqIO.parse(open(fasta),'fasta')

with open(output,"w") as out:
    for seq_record in fasta_sequences:
        if seq_record.id.startswith("20") or seq_record.id.startswith("21") or seq_record.id.startswith("22"):
            out.write(seq_record.format("fasta"))

Python Biopython From line 1 of scripts/filter_fasta.py

import sys
import io
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

isgtf = sys.argv[2].endswith("gtf")
with open(sys.argv[2], "w") as outff:
    with open(sys.argv[1]) as ff:
        print(f"Reading {sys.argv[1]}")
        for line in ff:
            lins = line.split("\t")
            if line.startswith("#"): continue
            elif lins[8].startswith("ID=gene:"): lins[2] = "gene"
            outff.write("\t".join(lins))

    ercc = SeqIO.parse(open("../resources/ERCC.fa"),'fasta')
    for seq_record in ercc:
        if len(seq_record.seq) > 0:
            id = seq_record.id.split()[0]
            gene_fields = [str(x) for x in [id, "ERCC", "gene", 1, len(seq_record.seq), ".", "+", ".",
                f"ID=gene:{id};Name={id};biotype=;description={''.join(seq_record.id.split()[1:])};gene_id={id}"]]
            transcript_fields = [str(x) for x in [id, "ERCC", "transcript", 1, len(seq_record.seq), ".", "+", ".",
                f"ID=transcript:{id};Parent=gene:{id};Name={id};biotype=;transcript_id={id}"]]
            exon_fields = [str(x) for x in [id, "ERCC", "exon", 1, len(seq_record.seq), ".", "+", ".",
                f"Parent=transcript:{id};Name={id};exon_id={id}"]]
            transcript_fields_gtf = [str(x) for x in [id, "ERCC", "transcript", 1, len(seq_record.seq), ".", "+", ".",
                f"gene_id \"gene:{id}\"; transcript_id \"transcript:{id}\"; gene_name \"{id}\"; transcript_name \"{id}\"; gene_biotype \"\"; transcript_biotype \"\";"]]
            exon_fields_gtf = [str(x) for x in [id, "ERCC", "exon", 1, len(seq_record.seq), ".", "+", ".",
                f"gene_id \"gene:{id}\"; transcript_id \"transcript:{id}\"; exon_number \"1\"; gene_name \"{id}\"; transcript_name \"{id}\"; gene_biotype \"\"; transcript_biotype \"\";"]]
            if isgtf:
                outff.write("\t".join(transcript_fields_gtf) + "\n")
                outff.write("\t".join(exon_fields_gtf) + "\n")
            else:
                outff.write("\t".join(gene_fields) + "\n")
                outff.write("\t".join(transcript_fields) + "\n")
                outff.write("\t".join(exon_fields) + "\n")

Python Biopython From line 1 of scripts/fix_gff3_for_rsem.py

import scvelo as scv
import sys, io
import numpy as np
from srr_functions import create_srr_to_cell

aloomfile = sys.argv[1]
srr_lookup = sys.argv[2]
series_matrix = sys.argv[3]
output = sys.argv[4]

aloom = scv.read_loom(aloomfile)
aobs = aloom.obs_names

srr_to_cell = create_srr_to_cell(series_matrix, srr_lookup)

# Use aobs to look up cell name from cells using the SRR link somehow
acells = [srr_to_cell[x.split(':')[1].split('Aligned')[0]] for x in aobs]
with open(output, 'w') as output:
    output.write('well_plate\n')
    output.write('\n'.join(acells))

Python numpy scVelo From line 1 of scripts/make_a_obs.py

import sys
import numpy as np
import pandas as pd
import os
import glob
from srr_functions import create_srr_to_cell

USAGE = "python make_rsem_dataframe.py <level> <gtf> <srrLookup> <seriesMatrix> <outCountsFile> <outTpmsFile> <namesOut> <idsOut>"
if len(sys.argv) != 9: print(USAGE); exit();
level, gtf, srr_lookup, series_matrix, outcounts, outtpms, outn, outids = sys.argv[1:]

print(f"getting gene names from {gtf}")
with open(gtf) as gtfhandle:
    line_ct = sum([1 for line in gtfhandle])
gene_id_to_name, gene_id_to_biotype = {}, {}
transcript_id_to_name, transcript_id_to_biotype, transcript_id_to_gene = {}, {}, {}
with open(gtf) as gtfhandle:
    for i, line in enumerate(gtfhandle):
        if i % 50000 == 0: print(f"Processed {i} lines out of {line_ct} from {gtf}.")
        if not line.startswith('#'):
            line = line.strip().rstrip(';').split('\t')
            if len(line) < 9 or line[2] != "transcript": continue
            attribute_list = line[8].replace('; ', ';').split(';')
            attributes = {}
            for item in attribute_list:
                item = [x.strip('"') for x in item.replace(" \"",";").split(';')]
                if len(item) < 2:
                    print(f"Error: not a key-value pair: {attributes} {item} in line:\n{line}")
                attributes[item[0]] = item[1]
            if not "gene_name" in attributes:
                attributes["gene_name"] = attributes["gene_id"]
            if not "transcript_name" in attributes:
                attributes["transcript_name"] = attributes["transcript_id"]
            if any([x not in attributes for x in ["gene_id", "gene_name", "transcript_id", "transcript_name"]]):
                print("Error: \"gene_id\", \"gene_name\", \"transcript_id\", or \"transcript_name\" not found in line:\n" + "\t".join(line))
                exit(1)
            gene_id_to_name[attributes["gene_id"].strip("gene:")] = attributes["gene_name"]
            gene_id_to_biotype[attributes["gene_id"].strip("gene:")] = attributes["gene_biotype"] if "gene_biotype" in attributes else ""
            transcript_id_to_name[attributes["transcript_id"].strip("transcript:")] = attributes["transcript_name"]
            transcript_id_to_biotype[attributes["transcript_id"].strip("transcript:")] = attributes["transcript_biotype"] if "transcript_biotype" in attributes else ""
            transcript_id_to_gene[attributes["transcript_id"].strip("transcript:")] = attributes["gene_id"]

print(f"globbing ../results/quant/*.{level}.results")
files = glob.glob(f"../results/quant/*.{level}.results")
line_ct = sum(1 for line in open(files[0]))

print("Making SRR-to-cell lookup...")
srr_to_cell = create_srr_to_cell(series_matrix, srr_lookup)

def get_prefix(file):
    srr = os.path.basename(file).split(".")[0].split("_")[0]
    return srr_to_cell[srr]

doUseGene = level.startswith("gene")
ids = [line.split('\t')[0].strip("gene:").strip("transcript:") for line in open(files[0])]
ids[0] = "gene_id" if doUseGene else "transcript_id"
names = [
    (gene_id_to_name[id] if id in gene_id_to_name else id) if doUseGene else \
    (transcript_id_to_name[id] if id in transcript_id_to_name else id) \
    for id in ids[1:]]
biotypes = [
    (gene_id_to_biotype[id] if id in gene_id_to_biotype else "") if doUseGene else \
    (transcript_id_to_biotype[id] if id in transcript_id_to_biotype else "") \
    for id in ids[1:]]

def save_output(outfilename, id_array, in_files, col_idx, doUseGene):
    print(f"Reading data files to create {outfilename} ...")

    with open(outfilename, "w") as ensg_output_file, open(outfilename + ".ercc.csv", "w") as ercc_output_file:
        ensg_output_file.write(",".join([""] + id_array[1:]) + "\n")
        ercc_output_file.write(",".join([""] + id_array[1:]) + "\n")
        prefixes = [get_prefix(file) for file in in_files]
        file_order = np.argsort(prefixes)
        for file in np.array(in_files)[file_order]:
            prefix = get_prefix(file)
            print(f"reading {prefix}")

            df = pd.read_csv(file, sep='\t', header=None, usecols=[0, col_idx], names=['id', 'value'])
            df['id'] = df['id'].str.strip('gene:').str.strip('transcript:')
            df = df.set_index('id').transpose()
            df.index = [prefix]

            ensg_df = df.filter(regex="ENSG" if doUseGene else "ENST")
            ercc_df = df.filter(regex="ERCC")

            ensg_output_file.write(ensg_df.to_csv(header=False, index=True))
            ercc_output_file.write(ercc_df.to_csv(header=False, index=True))

    print(f"Data saved to {outfilename} and {outfilename}.ercc.csv")

for (filename, col_idx) in [(outcounts, 4), (outtpms, 5)]:
    save_output(filename, ids, files, col_idx, doUseGene)

print(f"Saving to {outn} ...")
np.savetxt(outn, np.column_stack((ids[1:], names, biotypes)), delimiter=",", fmt="%s")

print(f"Saving to {outids} ...")
np.savetxt(outids, 
    np.column_stack((list(transcript_id_to_gene.keys()), 
        list(transcript_id_to_gene.values()))), delimiter=",", fmt="%s")

print("Done.")