Cell cycle analysis of single-cell proteomic and transcriptomic data for the FUCCI cell model -- RNA-Seq analysis pipeline
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Single-cell proteogenomic analysis
This repository contains the snakemake pipeline for analyzing the RNA sequencing data for ~1k single cells. The results of this single-cell RNA-Seq analysis provide a transcriptomic context to a proteomic analysis based on immunofluorescence staining of ~200k individual cells. For the code used to perform that single-cell proteogenomic analysis of the human cell cycle, please see the CellProfiling/SingleCellProteogenomics repository.
Single-cell sequencing files
The single-cell RNA-Seq data is available at GEO SRA under project number GSE146773 .
This data is downloaded automatically in this pipeline.
Updating the Ensembl version
The genome and Ensembl versions are located at the top of the file
workflow/config/FucciSingleCell.yaml
.
These can be updated, and the references will be downloaded automatically.
Usage
-
Clone repository and initialize submodules:
git clone --recurse-submodules https://github.com/CellProfiling/FucciSingleCellSeqPipeline.git && cd FucciSingleCellSeqPipeline/workflow
-
Install conda: https://docs.conda.io/en/latest/miniconda.html
-
Create and activate setup environment:
conda env create -n fuccisetup -f envs/setup.yaml && conda activate fuccisetup
-
Run the workflow:
snakemake --use-conda --conda-frontend mamba --cores 24 --resources mem_mb=100000
, where you can subsitute the max number of cores and max memory allocation. At least 54 GB of free memory should be available.
Usage on cluster
In place of installing conda, you may need to activate it as a module, such as by
module load conda
and then follow the instructions to initialize it.
Adapt
config/cluster_config.yaml
for your needs.
In place of the last step above, you can use the scheduler like this:
snakemake -j 500 --cores 16 --cluster-config config/cluster_config.yaml --latency-wait 60 --keep-going --use-conda --conda-frontend mamba --cluster "sbatch -t {cluster.time} -N {cluster.nodes} --cpus-per-task {threads} -p {cluster.partition}"
Usage on protected access cluster
-
Clone repository and initialize submodules on your local machine:
git clone --recurse-submodules https://github.com/CellProfiling/FucciSingleCellSeqPipeline.git && cd FucciSingleCellSeqPipeline/workflow
-
Install conda: https://docs.conda.io/en/latest/miniconda.html
-
Create and activate setup environment:
conda env create -n fuccisetup -f envs/setup.yaml && conda activate fuccisetup
-
If running the pipeline on protected access computer, predownload files by running
snakemake -j 16 ../results/setup.txt
on a machine with internet access. -
Make a tarball of the project with
cd ../.. && tar -cxvf FucciSingleCellSeqPipeline.zip FucciSingleCellSeqPipeline
and transfer it to the protected access cluster. -
Load conda as a module on the protected access cluster, such as with
module load conda
, and follow the instructions to activate it. -
Create and activate setup environment:
conda env create -n fuccisetup -f envs/setup.yaml && conda activate fuccisetup
-
Adapt
config/cluster_config.yaml
for your needs. -
Use the scheduler from snakemake like this:
snakemake -j 500 --cores 16 --cluster-config config/cluster_config.yaml --latency-wait 60 --keep-going --use-conda --conda-frontend mamba --cluster "sbatch -A {cluster.account} -t {cluster.time} -N {cluster.nodes} --cpus-per-task {threads} -p {cluster.partition}"
Citation
Mahdessian, D.*; Cesnik, A. J.*; Gnann, C.; Danielsson, F.; Stenström, L.; Arif, M.; Zhang, C.; Le, T.; Johansson, F.; Shutten, R.; Bäckström, A.; Axelsson, U.; Thul, P.; Cho, N. H.; Carja, O.; Uhlén, M.; Mardinoglu, A.; Stadler, C.; Lindskog, C.; Ayoglu, B.; Leonetti, M. D.; Pontén, F.; Sullivan, D. P.; Lundberg, E. “Spatiotemporal dissection of the cell cycle with single cell proteogenomics.” Nature, 2021, 590, 649–654. *Contributed equally. https://www.nature.com/articles/s41586-021-03232-9
Code Snippets
18 19 20 21 | shell: "(STAR --runMode genomeGenerate --runThreadN {threads} --genomeDir {params.genomedir}" " --genomeFastaFiles {input.efa} {input.gfa} --sjdbGTFfile {input.gtf}" " {params.sjoptions} {params.size}) &> {log}" |
36 37 38 39 40 | shell: "(STAR --genomeLoad NoSharedMemory --genomeDir {params.genomedir}" " --runThreadN {threads} {params.bam} " " --outFileNamePrefix {params.outfolder}/{wildcards.sra}" " --readFilesIn <(zcat {input.fastq}) ) &> {log}" |
57 58 59 60 | shell: "(STAR --runMode genomeGenerate --runThreadN {threads} --genomeDir {params.genomedir}" " --genomeFastaFiles {input.efa} {input.gfa} --sjdbGTFfile {input.gtf} {params.sjoptions}" " --sjdbFileChrStartEnd {input.jj} {params.size}) &> {log}" |
84 85 86 87 88 | shell: "(STAR {params.mode} --genomeDir {params.genomedir}" " --runThreadN {threads} {params.bam} {params.transcripts}" " --outFileNamePrefix {params.outfolder}/{wildcards.sra}" " --readFilesIn <(zcat {input.fastq}) ) &> {log}" |
18 19 20 21 22 | shell: "((wget -O - {params.primary} || wget -O - {params.toplevel}) | gunzip -c - > {output.gfa} && " "wget -O - {params.gff} | gunzip -c - > {output.gff3} && " "wget -O - {params.pep} | gunzip -c - > {output.pfa} && " "wget -O - {params.gtf} | gunzip -c - > {output.gtf}) 2> {log}" |
31 | shell: "python scripts/fix_gff3_for_rsem.py {input} {output} &> {log}" |
40 | shell: "python scripts/fix_gff3_for_rsem.py {input} {output} &> {log}" |
49 | shell: "grep \"^ERCC\\|^#\\|^20\\|^21\\|^22\" {input} > {output}" |
58 | shell: "grep \"^ERCC\\|^#\\|^20\\|^21\\|^22\" {input} > {output}" |
67 | shell: "python scripts/filter_fasta.py {input} {output}" |
76 77 78 | shell: "prefetch {wildcards.sra}" " --output-directory {params.outdir} &> {log}" |
87 88 89 | shell: "fastq-dump -I --outdir {params.outdir} --split-files {input} && " "mv {params.outdir}/{wildcards.sra}_1.fastq {output} &> {log}" |
104 105 106 107 108 | shell: "fastp -q {params.quality}" " -i {input} -o {output.fq}" " -h {output.html} -j {output.json}" " -w {threads} -R {params.title} --detect_adapter_for_pe &> {log}" |
8 | shell: "grep -v ERCC_ID {input} > {output} 2> {log}" |
25 26 27 28 | shell: "(rsem-prepare-reference --num-threads {threads} --gtf {input.gtf} " " --star" # TODO: figure out how to use transcript STAR output " \"{input.efa}\",\"{input.gfa}\" {params.prefix}) &> {log}" |
51 52 53 54 55 56 | shell: "(rsem-calculate-expression --num-threads {threads}" " --star" " {params.options} <(zcat {input.fq})" #{input.bam}" " {params.prefix}" " {params.quant}/{wildcards.sra}) &> {log}" |
77 78 79 | shell: "python scripts/make_rsem_dataframe.py genes {input.gtf} {input.srr_lookup} {input.series_matrix}" " {output.counts} {output.tpms} {output.names} {output.ids} &> {log}" |
100 101 102 | shell: "python scripts/make_rsem_dataframe.py isoforms {input.gtf} {input.srr_lookup} {input.series_matrix}" " {output.counts} {output.tpms} {output.names} {output.ids} &> {log}" |
10 11 | shell: "(gdown -O {output}.zip" |
42 43 44 | shell: "(mkdir -p {output} && cp -r {params.rnadir}/* {output} && " "cp {input.quant} {input.ids} {input.velocity} {params.rnadir}) &> {log}" |
52 | shell: "cd ../results && python ../SingleCellProteogenomics/1_ProteinCellCycleClusters.py &> {log}" |
60 | shell: "cd ../results && python ../SingleCellProteogenomics/2_ProteinFucciPsuedotime.py --quicker &> {log}" |
68 | shell: "cd ../results && python ../SingleCellProteogenomics/3_RNAFucciPseudotime.py --quicker &> {log}" |
76 | shell: "cd ../results && python ../SingleCellProteogenomics/4_TemporalDelay.py &> {log}" |
84 | shell: "cd ../results && python ../SingleCellProteogenomics/5_ProteinProperties.py &> {log}" |
97 98 99 | shell: "(cp {input.protein} {output.protein} && " "cp {input.rna} {output.rna}) &> {log}" |
15 16 17 | shell: "velocyto run-smartseq2 --outputfolder {params.outfolder}" " --sampleid {params.sampleid} {input.bams} {input.gtf} &> {log}" |
28 | shell: "(" + "\n".join(["{params.commandpart1} " + sra + " {params.commandpart2} >> {output}" for sra in config['sra']]) + ") 2> {log}" |
40 | shell: "python scripts/make_a_obs.py {input.loom} {input.srr_lookup} {input.series_matrix} {output} 2> {log}" |
1 2 3 4 5 6 7 8 9 10 11 12 | import sys, io from Bio import SeqIO from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord fasta=sys.argv[1] output=sys.argv[2] fasta_sequences = SeqIO.parse(open(fasta),'fasta') with open(output,"w") as out: for seq_record in fasta_sequences: if seq_record.id.startswith("20") or seq_record.id.startswith("21") or seq_record.id.startswith("22"): out.write(seq_record.format("fasta")) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | import sys import io from Bio import SeqIO from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord isgtf = sys.argv[2].endswith("gtf") with open(sys.argv[2], "w") as outff: with open(sys.argv[1]) as ff: print(f"Reading {sys.argv[1]}") for line in ff: lins = line.split("\t") if line.startswith("#"): continue elif lins[8].startswith("ID=gene:"): lins[2] = "gene" outff.write("\t".join(lins)) ercc = SeqIO.parse(open("../resources/ERCC.fa"),'fasta') for seq_record in ercc: if len(seq_record.seq) > 0: id = seq_record.id.split()[0] gene_fields = [str(x) for x in [id, "ERCC", "gene", 1, len(seq_record.seq), ".", "+", ".", f"ID=gene:{id};Name={id};biotype=;description={''.join(seq_record.id.split()[1:])};gene_id={id}"]] transcript_fields = [str(x) for x in [id, "ERCC", "transcript", 1, len(seq_record.seq), ".", "+", ".", f"ID=transcript:{id};Parent=gene:{id};Name={id};biotype=;transcript_id={id}"]] exon_fields = [str(x) for x in [id, "ERCC", "exon", 1, len(seq_record.seq), ".", "+", ".", f"Parent=transcript:{id};Name={id};exon_id={id}"]] transcript_fields_gtf = [str(x) for x in [id, "ERCC", "transcript", 1, len(seq_record.seq), ".", "+", ".", f"gene_id \"gene:{id}\"; transcript_id \"transcript:{id}\"; gene_name \"{id}\"; transcript_name \"{id}\"; gene_biotype \"\"; transcript_biotype \"\";"]] exon_fields_gtf = [str(x) for x in [id, "ERCC", "exon", 1, len(seq_record.seq), ".", "+", ".", f"gene_id \"gene:{id}\"; transcript_id \"transcript:{id}\"; exon_number \"1\"; gene_name \"{id}\"; transcript_name \"{id}\"; gene_biotype \"\"; transcript_biotype \"\";"]] if isgtf: outff.write("\t".join(transcript_fields_gtf) + "\n") outff.write("\t".join(exon_fields_gtf) + "\n") else: outff.write("\t".join(gene_fields) + "\n") outff.write("\t".join(transcript_fields) + "\n") outff.write("\t".join(exon_fields) + "\n") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | import scvelo as scv import sys, io import numpy as np from srr_functions import create_srr_to_cell aloomfile = sys.argv[1] srr_lookup = sys.argv[2] series_matrix = sys.argv[3] output = sys.argv[4] aloom = scv.read_loom(aloomfile) aobs = aloom.obs_names srr_to_cell = create_srr_to_cell(series_matrix, srr_lookup) # Use aobs to look up cell name from cells using the SRR link somehow acells = [srr_to_cell[x.split(':')[1].split('Aligned')[0]] for x in aobs] with open(output, 'w') as output: output.write('well_plate\n') output.write('\n'.join(acells)) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 | import sys import numpy as np import pandas as pd import os import glob from srr_functions import create_srr_to_cell USAGE = "python make_rsem_dataframe.py <level> <gtf> <srrLookup> <seriesMatrix> <outCountsFile> <outTpmsFile> <namesOut> <idsOut>" if len(sys.argv) != 9: print(USAGE); exit(); level, gtf, srr_lookup, series_matrix, outcounts, outtpms, outn, outids = sys.argv[1:] print(f"getting gene names from {gtf}") with open(gtf) as gtfhandle: line_ct = sum([1 for line in gtfhandle]) gene_id_to_name, gene_id_to_biotype = {}, {} transcript_id_to_name, transcript_id_to_biotype, transcript_id_to_gene = {}, {}, {} with open(gtf) as gtfhandle: for i, line in enumerate(gtfhandle): if i % 50000 == 0: print(f"Processed {i} lines out of {line_ct} from {gtf}.") if not line.startswith('#'): line = line.strip().rstrip(';').split('\t') if len(line) < 9 or line[2] != "transcript": continue attribute_list = line[8].replace('; ', ';').split(';') attributes = {} for item in attribute_list: item = [x.strip('"') for x in item.replace(" \"",";").split(';')] if len(item) < 2: print(f"Error: not a key-value pair: {attributes} {item} in line:\n{line}") attributes[item[0]] = item[1] if not "gene_name" in attributes: attributes["gene_name"] = attributes["gene_id"] if not "transcript_name" in attributes: attributes["transcript_name"] = attributes["transcript_id"] if any([x not in attributes for x in ["gene_id", "gene_name", "transcript_id", "transcript_name"]]): print("Error: \"gene_id\", \"gene_name\", \"transcript_id\", or \"transcript_name\" not found in line:\n" + "\t".join(line)) exit(1) gene_id_to_name[attributes["gene_id"].strip("gene:")] = attributes["gene_name"] gene_id_to_biotype[attributes["gene_id"].strip("gene:")] = attributes["gene_biotype"] if "gene_biotype" in attributes else "" transcript_id_to_name[attributes["transcript_id"].strip("transcript:")] = attributes["transcript_name"] transcript_id_to_biotype[attributes["transcript_id"].strip("transcript:")] = attributes["transcript_biotype"] if "transcript_biotype" in attributes else "" transcript_id_to_gene[attributes["transcript_id"].strip("transcript:")] = attributes["gene_id"] print(f"globbing ../results/quant/*.{level}.results") files = glob.glob(f"../results/quant/*.{level}.results") line_ct = sum(1 for line in open(files[0])) print("Making SRR-to-cell lookup...") srr_to_cell = create_srr_to_cell(series_matrix, srr_lookup) def get_prefix(file): srr = os.path.basename(file).split(".")[0].split("_")[0] return srr_to_cell[srr] doUseGene = level.startswith("gene") ids = [line.split('\t')[0].strip("gene:").strip("transcript:") for line in open(files[0])] ids[0] = "gene_id" if doUseGene else "transcript_id" names = [ (gene_id_to_name[id] if id in gene_id_to_name else id) if doUseGene else \ (transcript_id_to_name[id] if id in transcript_id_to_name else id) \ for id in ids[1:]] biotypes = [ (gene_id_to_biotype[id] if id in gene_id_to_biotype else "") if doUseGene else \ (transcript_id_to_biotype[id] if id in transcript_id_to_biotype else "") \ for id in ids[1:]] def save_output(outfilename, id_array, in_files, col_idx, doUseGene): print(f"Reading data files to create {outfilename} ...") with open(outfilename, "w") as ensg_output_file, open(outfilename + ".ercc.csv", "w") as ercc_output_file: ensg_output_file.write(",".join([""] + id_array[1:]) + "\n") ercc_output_file.write(",".join([""] + id_array[1:]) + "\n") prefixes = [get_prefix(file) for file in in_files] file_order = np.argsort(prefixes) for file in np.array(in_files)[file_order]: prefix = get_prefix(file) print(f"reading {prefix}") df = pd.read_csv(file, sep='\t', header=None, usecols=[0, col_idx], names=['id', 'value']) df['id'] = df['id'].str.strip('gene:').str.strip('transcript:') df = df.set_index('id').transpose() df.index = [prefix] ensg_df = df.filter(regex="ENSG" if doUseGene else "ENST") ercc_df = df.filter(regex="ERCC") ensg_output_file.write(ensg_df.to_csv(header=False, index=True)) ercc_output_file.write(ercc_df.to_csv(header=False, index=True)) print(f"Data saved to {outfilename} and {outfilename}.ercc.csv") for (filename, col_idx) in [(outcounts, 4), (outtpms, 5)]: save_output(filename, ids, files, col_idx, doUseGene) print(f"Saving to {outn} ...") np.savetxt(outn, np.column_stack((ids[1:], names, biotypes)), delimiter=",", fmt="%s") print(f"Saving to {outids} ...") np.savetxt(outids, np.column_stack((list(transcript_id_to_gene.keys()), list(transcript_id_to_gene.values()))), delimiter=",", fmt="%s") print("Done.") |
42 | shell: "touch {output} 2> {log}" |
Support
- Future updates
Related Workflows





