Cell cycle analysis of single-cell proteomic and transcriptomic data for the FUCCI cell model
This repository contains the code used to perform the single-cell proteogenomic analysis of the human cell cycle . This study was based on immunofluorescence staining of ~200k cells for single-cell analysis of proteomic heterogeneity and ~1k cells for analysis of single-cell RNA variability. These analyses were integrated with other proteomic studies and databases to investigate the functional importance of transcript-regulated and non-transcript regulated variability.
Structure of repository
The code is listed in order of execution, e.g. "1_", "2_" etc. The output of each script is used in the subsequent script. This workflow can also be run using snakemake (see below).
The logic for these analyses is contained in the
SingleCellProteogenomics
folder.
The input files are contained in the "input" folder. This folder is linked
here for release v1.2
as a zip file,
input.zip
. Expand this folder within the base directory of this repository. If you are looking for the raw imaging proteomic dataset produced after filtering artifacts and such, that is located
here
.
The output files are added to a folder "output" during the analysis, and figures are added to a folder "figures."
An R-script used to analyze skewness and kurtosis (noted in the Methods of the manuscript) is contained in the other_scripts folder. The
other_scripts/ProteinDisorder.py
script utilizes
IUPRED2A
and a
human UniProt
database.
Running the workflow using snakemake
This workflow can be run using
snakemake
:
-
Install Miniconda from https://docs.conda.io/en/latest/miniconda.html.
-
Install snakemake using
conda install -c conda-forge snakemake-minimal
. -
Within this directory, run
snakemake -j 1 --use-conda --snakefile workflow/Snakefile
.
Single-cell RNA-Seq analysis
The single-cell RNA-Seq data is available at GEO SRA under project number GSE146773 .
The cell cycle phase and FACS intensity information for these ~1,000 cells are contained in the
input folder
within three files, one per plate, starting with
RNAData/180911_Fucci_single cell seq_ss2-18-*.csv
.
The
snakemake
workflow used to analyze the scRNA-Seq dataset, including RNA velocity calculations and louvain unsupervised clustering, can be found in this repository: https://github.com/CellProfiling/FucciSingleCellSeqPipeline.
The
loom
file containing the results of RNA velocity analysis, including spliced and unspliced counts, can be found in the
input folder
under
RNAData/a.loom
, and the observation names used for each cell that match the "Well_Plate" identifiers can be found in
RNAData/a.obs_names.csv
.
Citation
Mahdessian, D.*; Cesnik, A. J.*; Gnann, C.; Danielsson, F.; Stenström, L.; Arif, M.; Zhang, C.; Le, T.; Johansson, F.; Shutten, R.; Bäckström, A.; Axelsson, U.; Thul, P.; Cho, N. H.; Carja, O.; Uhlén, M.; Mardinoglu, A.; Stadler, C.; Lindskog, C.; Ayoglu, B.; Leonetti, M. D.; Pontén, F.; Sullivan, D. P.; Lundberg, E. “Spatiotemporal dissection of the cell cycle with single cell proteogenomics.” Nature, 2021, 590, 649–654. *Contributed equally. https://www.nature.com/articles/s41586-021-03232-9
Code Snippets
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 | from SingleCellProteogenomics import (ProteinDataPreparation, ProteinGaussianClustering) import matplotlib.pyplot as plt import seaborn as sbn # Make PDF text readable plt.rcParams["pdf.fonttype"] = 42 plt.rcParams["ps.fonttype"] = 42 plt.rcParams["savefig.dpi"] = 300 #%% Read in the protein data (methods in ProteinDataPreparation.py) my_df = ProteinDataPreparation.read_raw_data() plate, u_plate, well_plate, well_plate_imgnb, u_well_plates, ab_objnum, area_cell, area_nuc, area_cyto, ensg_dict, ab_dict, result_dict, compartment_dict, ENSG, antibody, result, compartment = ProteinDataPreparation.read_sample_info( my_df ) wp_ensg, wp_ab, wp_prev_ccd, wp_prev_notccd, wp_prev_negative, prev_ccd_ensg, prev_notccd_ensg, prev_negative_ensg = ProteinDataPreparation.previous_results( u_well_plates, result_dict, ensg_dict, ab_dict ) #%% Idea: Filter the raw data (methods in ProteinDataPreparation.py) # Execution: Use manual annotations and nucleus size to filter samples and images # Output: Filtered dataframe my_df_filtered = ProteinDataPreparation.apply_manual_filtering( my_df, result_dict, ab_dict ) my_df_filtered = ProteinDataPreparation.apply_big_nucleus_filter(my_df_filtered) my_df_filtered = ProteinDataPreparation.apply_cell_count_filter(my_df_filtered) # my_df_filtered.to_csv("input/ProteinData/FucciData.csv") plate, u_plate, well_plate, well_plate_imgnb, u_well_plates, ab_objnum, area_cell, area_nuc, area_cyto, ensg_dict, ab_dict, result_dict, compartment_dict, ENSG, antibody, result, compartment = ProteinDataPreparation.read_sample_info( my_df_filtered ) wp_ensg, wp_ab, wp_prev_ccd, wp_prev_notccd, wp_prev_negative, prev_ccd_ensg, prev_notccd_ensg, prev_negative_ensg = ProteinDataPreparation.previous_results( u_well_plates, result_dict, ensg_dict, ab_dict ) #%% Idea: Filter for variation and get compartments (methods in ProteinDataPreparation.py) # Execution: Use annotated variation and compartment information # Output: Number of cells filtered my_df_filtered_variation, my_df_filtered_novariation = ProteinDataPreparation.apply_variation_filter( my_df_filtered, result_dict, my_df ) ## Uncomment to output these dataframes (used for skewness / kurtosis analysis) # my_df_filtered_variation.to_csv("output/nuc_predicted_prob_phases_filtered_variation.csv") # my_df_filtered_novariation.to_csv("output/nuc_predicted_prob_phases_filtered_novariation.csv") # filter out the ones missing compartment information; these are localized to mitotic structures and handled differently plate, u_plate, well_plate, well_plate_imgnb, u_well_plates, ab_objnum, area_cell, area_nuc, area_cyto, ensg_dict, ab_dict, result_dict, compartment_dict, ENSG, antibody, result, compartment = ProteinDataPreparation.read_sample_info( my_df_filtered_variation ) wp_iscell, wp_isnuc, wp_iscyto, my_df_filtered_compartmentvariation = ProteinDataPreparation.metacompartments( u_well_plates, compartment_dict, my_df_filtered_variation ) # demonstrate that there are no more missing compartment information plate, u_plate, well_plate, well_plate_imgnb, u_well_plates, ab_objnum, area_cell, area_nuc, area_cyto, ensg_dict, ab_dict, result_dict, compartment_dict, ENSG, antibody, result, compartment = ProteinDataPreparation.read_sample_info( my_df_filtered_compartmentvariation ) wp_iscell, wp_isnuc, wp_iscyto, my_df_filtered_compartmentvariation = ProteinDataPreparation.metacompartments( u_well_plates, compartment_dict, my_df_filtered_compartmentvariation ) wp_ensg, wp_ab, wp_prev_ccd, wp_prev_notccd, wp_prev_negative, prev_ccd_ensg, prev_notccd_ensg, prev_negative_ensg = ProteinDataPreparation.previous_results( u_well_plates, result_dict, ensg_dict, ab_dict ) #%% Idea: Get and process intensities (methods in ProteinDataPreparation.py and ProteinGaussianClustering.py) # Execution: get intensities; zero center fucci intensities # Output: Fucci plot ab_nuc, ab_cyto, ab_cell, mt_cell, green_fucci, red_fucci = ProteinDataPreparation.read_sample_data( my_df_filtered_compartmentvariation ) log_green_fucci, log_red_fucci, log_green_fucci_zeroc, log_red_fucci_zeroc, log_green_fucci_zeroc_rescale, log_red_fucci_zeroc_rescale, fucci_data = ProteinGaussianClustering.zero_center_fucci( green_fucci, red_fucci, u_plate, well_plate, plate ) plt.hist2d(log_green_fucci_zeroc_rescale, log_red_fucci_zeroc_rescale, bins=200) plt.xlabel("Log10 Green Fucci Intensity") plt.ylabel("Log10 Red Fucci Intensity") plt.savefig("figures/FucciPlotProteinIFData_unfiltered.png") # plt.show() plt.close() # General picture of antibody intensity density sbn.displot(ab_cell, kind="hist") plt.xlabel("Mean Intensity") plt.ylabel("Density") plt.savefig("figures/antibody_cell_intensity.pdf") # plt.show() plt.close() #%% Idea: Gaussian clustering per plate to identify G1/S/G2 and do kruskal test for variance # Exec: sklearn.mixture.GaussianMixture & scipy.stats.kruskal # Output: FDR for cell cycle variation per well per compartment # NB! The cluster labels can change if any prior analysis changes. Inspect the plots so that top-left FUCCI cluster is G1, top-right is S, bottom-right is G2. g1_idx, sph_idx, g2_idx = 1, 2, 0 clusternames = [ "G2" if g2_idx == 0 else "G1" if g1_idx == 0 else "S-ph", "G2" if g2_idx == 1 else "G1" if g1_idx == 1 else "S-ph", "G2" if g2_idx == 2 else "G1" if g1_idx == 2 else "S-ph", ] cluster_labels = ProteinGaussianClustering.gaussian_clustering( log_green_fucci_zeroc_rescale, log_red_fucci_zeroc_rescale, clusternames ) g1 = cluster_labels == g1_idx sph = cluster_labels == sph_idx g2 = cluster_labels == g2_idx alpha_gauss, doGenerateBoxplotsPerGene = 0.05, False wp_comp_kruskal_gaussccd_adj, wp_pass_kruskal_gaussccd_bh_comp, wp_mt_kruskal_gaussccd_adj, wp_pass_gaussccd_bh_mt = ProteinGaussianClustering.gaussian_clustering_analysis( alpha_gauss, doGenerateBoxplotsPerGene, g1, sph, g2, wp_ensg, well_plate, u_well_plates, ab_cell, ab_nuc, ab_cyto, mt_cell, wp_iscell, wp_isnuc, wp_iscyto, ) # General look at replicates in mock-bulk analysis ProteinGaussianClustering.address_replicates( alpha_gauss, wp_pass_kruskal_gaussccd_bh_comp, wp_ensg, wp_ab, u_well_plates ) |
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 | from SingleCellProteogenomics import (FucciPseudotime, Loaders, ProteinBimodality, ProteinCellCycleDependence, ProteinVariability, utils) import matplotlib.pyplot as plt import pandas as pd import numpy as np import scipy import argparse if __name__ == "__main__": description = "Single cell proteogenomic analysis script -- protein cell cycle dependence" parser = argparse.ArgumentParser(description=description) parser.add_argument( "--quicker", action="store_true", help="Skip some plotting and analyses, namely for HPA releases." ) args = parser.parse_args() doAllPlotsAndAnalyses = not args.quicker # Make PDF text readable plt.rcParams["pdf.fonttype"] = 42 plt.rcParams["ps.fonttype"] = 42 plt.rcParams["savefig.dpi"] = 300 #%% Read in the protein data import_dict = Loaders.load_protein_fucci_pseudotime() u_plate, well_plate, well_plate_imgnb, well_plate_imgnb_objnb, u_well_plates = ( import_dict["u_plate"], import_dict["well_plate"], import_dict["well_plate_imgnb"], import_dict["well_plate_imgnb_objnb"], import_dict["u_well_plates"], ) ab_nuc, ab_cyto, ab_cell, mt_cell = ( import_dict["ab_nuc"], import_dict["ab_cyto"], import_dict["ab_cell"], import_dict["mt_cell"], ) area_cell, area_nuc = import_dict["area_cell"], import_dict["area_nuc"] wp_ensg, wp_ab = import_dict["wp_ensg"], import_dict["wp_ab"] green_fucci, red_fucci = import_dict["green_fucci"], import_dict["red_fucci"] log_green_fucci_zeroc, log_red_fucci_zeroc = ( import_dict["log_green_fucci_zeroc"], import_dict["log_red_fucci_zeroc"], ) log_green_fucci_zeroc_rescale, log_red_fucci_zeroc_rescale = ( import_dict["log_green_fucci_zeroc_rescale"], import_dict["log_red_fucci_zeroc_rescale"], ) wp_comp_kruskal_gaussccd_adj, wp_pass_kruskal_gaussccd_bh_comp = ( import_dict["wp_comp_kruskal_gaussccd_adj"], import_dict["wp_pass_kruskal_gaussccd_bh_comp"], ) fucci_data = import_dict["fucci_data"] wp_iscell, wp_isnuc, wp_iscyto = ( import_dict["wp_iscell"], import_dict["wp_isnuc"], import_dict["wp_iscyto"], ) curr_wp_phases, mockbulk_phases = ( import_dict["curr_wp_phases"], import_dict["mockbulk_phases"], ) #%% # Idea: Calculate the polar coordinates and other stuff # Exec: Devin's calculations # Output: fucci plot with polar coordinates pseudotime_result = FucciPseudotime.pseudotime_protein( fucci_data, ab_nuc, ab_cyto, ab_cell, mt_cell, area_cell, area_nuc, well_plate, well_plate_imgnb, well_plate_imgnb_objnb, log_red_fucci_zeroc_rescale, log_green_fucci_zeroc_rescale, mockbulk_phases, ) pol_sort_well_plate, pol_sort_norm_rev, pol_sort_well_plate_imgnb, pol_sort_well_plate_imgnb_objnb, pol_sort_ab_nuc, pol_sort_ab_cyto, pol_sort_ab_cell, pol_sort_mt_cell, pol_sort_area_cell, pol_sort_area_nuc, pol_sort_fred, pol_sort_fgreen, pol_sort_mockbulk_phases = ( pseudotime_result ) #%% Calculate measures of variance of protein abundance in single cells # Idea: Calculate measures of variance, and show them in plots # Execution: Now that we already have the data filtered for variability, this is just descriptive. # Output: scatters of antibody vs microtubule variances by different measures of variaibility # toggle for using log-transformed intensities # we decided to use natural intensities use_log = False calculate_variaton_result = ProteinVariability.calculate_variation( use_log, u_well_plates, wp_iscell, wp_isnuc, wp_iscyto, pol_sort_well_plate, pol_sort_ab_cell, pol_sort_ab_nuc, pol_sort_ab_cyto, pol_sort_mt_cell, pol_sort_well_plate_imgnb, ) mean_mean_comp, var_comp, gini_comp, cv_comp, var_cell, gini_cell, cv_cell, var_mt, gini_mt, cv_mt = ( calculate_variaton_result ) # Compare variances for protein and microtubules, the internal control for each image if doAllPlotsAndAnalyses: removeThese = pd.read_csv("input/ProteinData/ReplicatesToRemove.txt", header=None)[0] # make these independent samples for one-sided Kruskal-Wallis tests utils.general_boxplot( (var_comp[~np.isin(u_well_plates, removeThese)], var_mt[~np.isin(u_well_plates, removeThese)]), ("Protein", "Microtubules"), "", "Variance", "", False, f"figures/ProteinMicrotubuleVariances.pdf", ) utils.general_boxplot( (cv_comp[~np.isin(u_well_plates, removeThese)], gini_mt[~np.isin(u_well_plates, removeThese)]), ("Protein", "Microtubules"), "", "CV", "", False, f"figures/ProteinMicrotubuleCVs.pdf", ) utils.general_boxplot( (gini_comp[~np.isin(u_well_plates, removeThese)], gini_mt[~np.isin(u_well_plates, removeThese)]), ("Protein", "Microtubules"), "", "Gini", "", False, f"figures/ProteinMicrotubuleGinis.pdf", ) p_varProt_varMt = 2*scipy.stats.kruskal(var_comp[~np.isin(u_well_plates, removeThese)], var_mt[~np.isin(u_well_plates, removeThese)])[1] p_cvProt_cvMt = 2*scipy.stats.kruskal(cv_comp[~np.isin(u_well_plates, removeThese)], cv_mt[~np.isin(u_well_plates, removeThese)])[1] p_giniProt_giniMt = 2*scipy.stats.kruskal(gini_comp[~np.isin(u_well_plates, removeThese)], gini_mt[~np.isin(u_well_plates, removeThese)])[1] print( f"{p_varProt_varMt}: p-value for difference between protein and microtubule variances" ) print( f"{p_cvProt_cvMt}: p-value for difference between protein and microtubule CVs" ) print( f"{p_giniProt_giniMt}: p-value for difference between protein and microtubule Gini indices" ) #%% Gaussian clustering to identify biomodal intensity distributions bimodal_results = ProteinBimodality.identify_bimodal_intensity_distributions( u_well_plates, wp_ensg, pol_sort_well_plate, pol_sort_norm_rev, pol_sort_ab_cell, pol_sort_ab_nuc, pol_sort_ab_cyto, pol_sort_mt_cell, wp_iscell, wp_isnuc, wp_iscyto, doAllPlotsAndAnalyses ) wp_isbimodal_fcpadj_pass, wp_bimodal_cluster_idxs, wp_isbimodal_generally, wp_bimodal_fcmaxmin = ( bimodal_results ) #%% Determine cell cycle dependence for each protein use_log_ccd = False do_remove_outliers = True alphaa = 0.05 # Determine cell cycle dependence for proteins ccd_results = ProteinCellCycleDependence.cell_cycle_dependence_protein( u_well_plates, wp_ensg, wp_ab, use_log_ccd, do_remove_outliers, pol_sort_well_plate, pol_sort_norm_rev, pol_sort_ab_cell, pol_sort_ab_nuc, pol_sort_ab_cyto, pol_sort_mt_cell, pol_sort_fred, pol_sort_fgreen, pol_sort_mockbulk_phases, pol_sort_area_cell, pol_sort_area_nuc, pol_sort_well_plate_imgnb, wp_iscell, wp_isnuc, wp_iscyto, wp_isbimodal_fcpadj_pass, wp_bimodal_cluster_idxs, wp_comp_kruskal_gaussccd_adj, doAllPlotsAndAnalyses ) wp_comp_ccd_difffromrng, mean_diff_from_rng_mt, wp_comp_ccd_clust1, wp_comp_ccd_clust2, wp_ccd_unibimodal, wp_comp_ccd_gauss, perc_var_comp, mean_diff_from_rng, wp_comp_eq_percvar_adj, mean_diff_from_rng_clust1, wp_comp_eq_percvar_adj_clust1, mean_diff_from_rng_clust2, wp_comp_eq_percvar_adj_clust2, mvavgs_x, mvavgs_comp, curr_pols, curr_ab_norms, mvperc_comps, curr_freds, curr_fgreens, curr_mockbulk_phases, curr_area_cell, curr_ab_nuc, curr_well_plate_imgnb, folder = ( ccd_results ) # Move the temporal average plots to more informative places if doAllPlotsAndAnalyses: ProteinCellCycleDependence.copy_mvavg_plots_protein( folder, wp_ensg, wp_comp_ccd_difffromrng, wp_isbimodal_fcpadj_pass, wp_comp_ccd_clust1, wp_comp_ccd_clust2, wp_ccd_unibimodal, wp_comp_ccd_gauss, ) ProteinCellCycleDependence.global_plots_protein( alphaa, u_well_plates, wp_ccd_unibimodal, perc_var_comp, mean_mean_comp, gini_comp, cv_comp, mean_diff_from_rng, wp_comp_eq_percvar_adj, wp_comp_kruskal_gaussccd_adj, ) # Analyze the CCD results and save them ccd_comp, nonccd_comp, bioccd = ProteinCellCycleDependence.analyze_ccd_variation_protein( folder, u_well_plates, wp_ensg, wp_ab, wp_iscell, wp_isnuc, wp_iscyto, wp_comp_ccd_difffromrng, wp_comp_ccd_clust1, wp_comp_ccd_clust2, var_comp, gini_comp, perc_var_comp, mean_diff_from_rng, wp_comp_kruskal_gaussccd_adj, wp_comp_eq_percvar_adj, mean_diff_from_rng_clust1, wp_comp_eq_percvar_adj_clust1, mean_diff_from_rng_clust2, wp_comp_eq_percvar_adj_clust2, wp_isbimodal_fcpadj_pass, wp_isbimodal_generally, wp_ccd_unibimodal, wp_bimodal_fcmaxmin, wp_comp_ccd_gauss, ) # Make a dataframe for plotting on the HPA website ProteinCellCycleDependence.make_plotting_dataframe( wp_ensg, wp_ab, u_well_plates, wp_iscell, wp_iscyto, wp_isnuc, ccd_comp, bioccd, curr_pols, curr_ab_norms, curr_freds, curr_fgreens, curr_mockbulk_phases, mvavgs_x, mvavgs_comp, mvperc_comps, gini_comp, perc_var_comp, ) # Perform comparison to LASSO for finding CCD proteins if doAllPlotsAndAnalyses: ProteinCellCycleDependence.compare_to_lasso_analysis( u_well_plates, pol_sort_norm_rev, pol_sort_well_plate, pol_sort_ab_cell, pol_sort_ab_nuc, pol_sort_ab_cyto, pol_sort_mt_cell, pol_sort_fred, pol_sort_fgreen, wp_iscell, wp_isnuc, wp_iscyto, wp_ensg, ccd_comp, ) # Generate UMAPs to illustrate cutoffs and stability ProteinCellCycleDependence.generate_protein_umaps( u_well_plates, pol_sort_norm_rev, pol_sort_well_plate, pol_sort_ab_cell, pol_sort_ab_nuc, pol_sort_ab_cyto, pol_sort_mt_cell, wp_iscell, wp_isnuc, wp_iscyto, mean_diff_from_rng, ) |
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 | from SingleCellProteogenomics import (FucciPseudotime, RNACellCycleDependence, RNADataPreparation, RNAVelocity, utils) import scanpy as sc import numpy as np import matplotlib.pyplot as plt import shutil import os import argparse if __name__ == "__main__": description = "Single cell proteogenomic analysis script -- protein cell cycle dependence" parser = argparse.ArgumentParser(description=description) parser.add_argument( "--quicker", action="store_true", help="Skip some plotting and analyses, namely for HPA releases." ) args = parser.parse_args() doAllPlotsAndAnalyses = not args.quicker # Make PDF text readable plt.rcParams["pdf.fonttype"] = 42 plt.rcParams["ps.fonttype"] = 42 plt.rcParams["savefig.dpi"] = 300 bioccd = np.genfromtxt( "input/ProteinData/BiologicallyDefinedCCD.txt", dtype="str" ) # from mitotic structures wp_ensg = np.load("output/pickles/wp_ensg.npy", allow_pickle=True) ccd_comp = np.load("output/pickles/ccd_comp.npy", allow_pickle=True) nonccd_comp = np.load("output/pickles/nonccd_comp.npy", allow_pickle=True) u_plates = ["355","356","357"] #%% Convert FACS intensities for FUCCI markers to pseudotime using the same polar coordinate methods as for protein # Idea: Use the polar coordinate pseudotime calculations to calculate the pseudotime for each cell # Execution: Adapt Devin's code for the cells sorted for RNA-Seq # Output: Make log-log fucci intensity plots for the cells analyzed by RNA-Seq; Plot of all fucci pseudotimes; table of pseudotimes for each cell adata, phases_filt = RNADataPreparation.read_counts_and_phases( "Counts", False, "protein_coding", u_plates ) adata = RNADataPreparation.zero_center_fucci(adata) FucciPseudotime.pseudotime_rna(adata) #%% Single cell RNA-Seq data preparation and general analysis if doAllPlotsAndAnalyses: RNADataPreparation.general_plots(u_plates) RNADataPreparation.analyze_noncycling_cells(u_plates) adata, phasesfilt = RNADataPreparation.qc_filtering( adata, do_log_normalize=True, do_remove_blob=False ) adata = RNADataPreparation.zero_center_fucci(adata) RNADataPreparation.plot_pca_for_batch_effect_analysis(adata, "BeforeRemovingNoncycling") #%% Idea: Similar to mock-bulk analysis for proteins, we can evaluate each gene bundled by phase across cells # Execution: Make boxplots of RNA expression by phase # Output: boxplots for each gene valuetype, use_spikeins, biotype_to_use = "Tpms", False, "protein_coding" adata, phases = RNADataPreparation.read_counts_and_phases( valuetype, use_spikeins, biotype_to_use, u_plates ) adata, phasesfilt = RNADataPreparation.qc_filtering( adata, do_log_normalize=True, do_remove_blob=True ) adata = RNADataPreparation.zero_center_fucci(adata) if doAllPlotsAndAnalyses: RNADataPreparation.plot_markers_vs_reads(adata) RNADataPreparation.plot_pca_for_batch_effect_analysis(adata, "AfterRemovingNoncycling") g1 = adata.obs["phase"] == "G1" s = adata.obs["phase"] == "S-ph" g2 = adata.obs["phase"] == "G2M" do_make_boxplots = False if do_make_boxplots: for iii, ensg in enumerate(adata.var_names): maxtpm = np.max( np.concatenate((adata.X[g1, iii], adata.X[s, iii], adata.X[g2, iii])) ) RNACellCycleDependence.boxplot_result( adata.X[g1, iii] / maxtpm, adata.X[s, iii] / maxtpm, adata.X[g2, iii] / maxtpm, "figures/RNABoxplotByPhase", ensg, ) #%% Idea: Display general RNA expression patterns in single cells using UMAP dimensionality reduction, and display with FUCCI pseudotime overlayed if doAllPlotsAndAnalyses: FucciPseudotime.pseudotime_umap(adata) # Generate a UMAP with the pseudotime overlayed # We can also show that the cycle pattern remains when the curated CCD genes or CCD proteins are removed, # demonstrating that there's still valuable information about cell cycling beyond what was called CCD RNADataPreparation.demonstrate_umap_cycle_without_ccd(adata) # Read in the curated CCD genes / CCD proteins from the present work / Non-CCD genes from the present work; filter for genes that weren't filtered in QC of RNA-Seq ccd_regev_filtered, ccd_filtered, nonccd_filtered = utils.ccd_gene_lists(adata) adata_ccdprotein, adata_nonccdprotein, adata_regevccdgenes = RNADataPreparation.is_ccd( adata, wp_ensg, ccd_comp, nonccd_comp, bioccd, ccd_regev_filtered ) # Generate plots with expression of genes overlayed expression_data = adata.X normalized_exp_data = (expression_data.T / np.max(expression_data, axis=0)[:, None]).T # Log-log FUCCI plot with RNA expression overlayed if doAllPlotsAndAnalyses: RNACellCycleDependence.plot_expression_facs( adata, wp_ensg[np.isin(wp_ensg, adata.var_names)], normalized_exp_data, "figures/GeneExpressionFucci", ) # Cluster the expression into phases and analyze it that way bulk_phase_tests = RNACellCycleDependence.analyze_ccd_variation_by_phase_rna( adata, normalized_exp_data, biotype_to_use ) # RNACellCycleDependence.plot_expression_boxplots(adata, wp_ensg[np.isin(wp_ensg, adata.var_names)], bulk_phase_tests, "figures/GeneExpressionBoxplots") # RNACellCycleDependence.plot_expression_umap(adata, wp_ensg[np.isin(wp_ensg, adata.var_names)], "figures/GeneExpressionUmap") #%% Moving average calculations and randomization analysis for RNA rna_ccd_analysis_results = RNACellCycleDependence.analyze_ccd_variation_by_mvavg_rna( adata, wp_ensg, ccd_comp, bioccd, adata_nonccdprotein, adata_regevccdgenes, biotype_to_use, make_mvavg_plots_isoforms=doAllPlotsAndAnalyses, ) percent_ccd_variance, total_gini, mean_diff_from_rng, pass_meandiff, eq_percvar_adj, fucci_time_inds, norm_exp_sort, moving_averages, mvavg_xvals, perms, ccdtranscript, ccdprotein, mvpercs = ( rna_ccd_analysis_results ) RNACellCycleDependence.make_plotting_dataframe( adata, ccdtranscript, wp_ensg, bioccd, norm_exp_sort[np.argsort(fucci_time_inds), :], mvavg_xvals, moving_averages, mvpercs, ) if doAllPlotsAndAnalyses: RNACellCycleDependence.plot_umap_ccd_cutoffs(adata, mean_diff_from_rng) RNACellCycleDependence.figures_ccd_analysis_rna( adata, percent_ccd_variance, mean_diff_from_rng, pass_meandiff, eq_percvar_adj, wp_ensg, ccd_comp, ccd_regev_filtered, ) RNACellCycleDependence.plot_overall_and_ccd_variances( adata, biotype_to_use, total_gini, percent_ccd_variance, pass_meandiff, adata_ccdprotein, adata_nonccdprotein, adata_regevccdgenes, ) RNACellCycleDependence.compare_to_lasso_analysis(adata, ccdtranscript) RNACellCycleDependence.analyze_cnv_calls(adata, ccdtranscript) #%% Moving average calculations and randomization analysis for the spike-in internal controls if doAllPlotsAndAnalyses: adata_spikeins, phases_spikeins = RNADataPreparation.read_counts_and_phases( valuetype, use_spike_ins=True, biotype_to_use="", u_plates=u_plates ) sc.pp.filter_genes(adata_spikeins, min_cells=100) print(f"data shape after filtering: {adata_spikeins.X.shape}") RNACellCycleDependence.ccd_analysis_of_spikeins(adata_spikeins, perms) #%% Analyze RNA velocity valuetype, use_spikeins, biotype_to_use = "Tpms", False, "protein_coding" adata, phases = RNADataPreparation.read_counts_and_phases( valuetype, use_spikeins, biotype_to_use, u_plates, load_velocities=True ) adata_blobless, phasesfilt = RNADataPreparation.qc_filtering( adata, do_log_normalize=True, do_remove_blob=False ) RNAVelocity.analyze_rna_velocity(adata_blobless, mean_diff_from_rng, doAllPlotsAndAnalyses) #%% Analyze isoforms if doAllPlotsAndAnalyses: adata_isoform, ccdtranscript_isoform = RNACellCycleDependence.analyze_isoforms( adata, ccdtranscript, wp_ensg, ccd_comp, nonccd_comp, u_plates, make_mvavg_plots_isoforms=doAllPlotsAndAnalyses ) RNACellCycleDependence.compare_genes_to_isoforms( adata, ccdprotein, ccdtranscript, adata_nonccdprotein, adata_isoform, ccdtranscript_isoform, ) |
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 | from SingleCellProteogenomics import (Loaders, RNADataPreparation, TemporalDelay) import matplotlib.pyplot as plt # Make PDF text readable plt.rcParams["pdf.fonttype"] = 42 plt.rcParams["ps.fonttype"] = 42 plt.rcParams["savefig.dpi"] = 300 #%% Read in the protein data import_dict = Loaders.load_temporal_delay() u_well_plates, wp_ensg = import_dict["u_well_plates"], import_dict["wp_ensg"] wp_iscell, wp_isnuc, wp_iscyto = ( import_dict["wp_iscell"], import_dict["wp_isnuc"], import_dict["wp_iscyto"], ) ccd_comp, ccdtranscript, ccdtranscript_isoform = ( import_dict["ccd_comp"], import_dict["ccdtranscript"], import_dict["ccdtranscript_isoform"], ) pol_sort_well_plate, pol_sort_norm_rev = ( import_dict["pol_sort_well_plate"], import_dict["pol_sort_norm_rev"], ) pol_sort_ab_nuc, pol_sort_ab_cyto, pol_sort_ab_cell, pol_sort_mt_cell = ( import_dict["pol_sort_ab_nuc"], import_dict["pol_sort_ab_cyto"], import_dict["pol_sort_ab_cell"], import_dict["pol_sort_mt_cell"], ) var_comp_prot, gini_comp_prot, cv_comp_prot = ( import_dict["var_comp"], import_dict["gini_comp"], import_dict["cv_comp"], ) var_cell_prot, gini_cell_prot, cv_cell_prot = ( import_dict["var_cell"], import_dict["gini_cell"], import_dict["cv_cell"], ) #%% Idea: Make temporal heatmap for peak protein expression, and compare known and novel proteins that peak at similar times # Execution: plt.imshow makes a heatmap if given a 2D array # Output: heatmap; correlations of known/novel proteins highlights = [] #'ORC6','DUSP19','BUB1B','DPH2', 'FLI1'] highlights_ensg = [] #'ORC6','DUSP19','BUB1B','DPH2', 'FLI1'] nbins = 20 protein_heatmap_results = TemporalDelay.protein_heatmap( nbins, highlights, highlights_ensg, ccd_comp, u_well_plates, wp_ensg, pol_sort_norm_rev, pol_sort_well_plate, pol_sort_ab_cell, pol_sort_ab_nuc, pol_sort_ab_cyto, pol_sort_mt_cell, wp_iscell, wp_isnuc, wp_iscyto, ) sorted_maxpol_array, wp_binned_values, wp_max_pol, wp_max_pol_ccd, xvals = ( protein_heatmap_results ) # Correlations of known and novel proteins that peak at similar times TemporalDelay.peak_expression_correlation_analysis( wp_binned_values, wp_max_pol, wp_ensg, pol_sort_well_plate, u_well_plates ) #%% Create a heatmap of peak RNA expression highlight_names, highlight_ensg = [], [] u_rna_plates = ["355","356","357"] # Read in RNA-Seq data; use TPMs so that the gene-specific results scales match for cross-gene comparisons valuetype, use_spikeins, biotype_to_use = "Tpms", False, "protein_coding" adata, phases = RNADataPreparation.read_counts_and_phases( valuetype, use_spikeins, biotype_to_use, u_rna_plates ) adata, phasesfilt = RNADataPreparation.qc_filtering( adata, do_log_normalize=True, do_remove_blob=True ) adata = RNADataPreparation.zero_center_fucci(adata) sorted_max_moving_avg_pol_ccd, norm_exp_sort, max_moving_avg_pol, sorted_rna_binned_norm = TemporalDelay.rna_heatmap( adata, highlight_names, highlight_ensg, ccdtranscript, xvals ) # Analyze isoforms adata_isoform, phases_isoform = RNADataPreparation.read_counts_and_phases( valuetype, use_spikeins, biotype_to_use, u_rna_plates, use_isoforms=True, ) adata_isoform, phasesfilt_isoform = RNADataPreparation.qc_filtering( adata_isoform, do_log_normalize=True, do_remove_blob=True ) adata_isoform = RNADataPreparation.zero_center_fucci(adata_isoform) sorted_max_moving_avg_pol_ccd_isoform, norm_exp_sort_isoform, max_moving_avg_pol_isoform, sorted_rna_binned_norm_isoform = TemporalDelay.rna_heatmap( adata_isoform, highlight_names, highlight_ensg, ccdtranscript_isoform, xvals, isIsoformData=True, ) pearsonCorrelations = TemporalDelay.analyze_ccd_isoform_correlations( adata, adata_isoform, ccdtranscript, ccdtranscript_isoform, xvals, u_rna_plates ) #%% Compare the variances and time of peak expression between protein and RNA TemporalDelay.compare_variances_prot_v_rna( adata, norm_exp_sort, wp_ensg, var_comp_prot, gini_comp_prot, cv_comp_prot, var_cell_prot, gini_cell_prot, cv_cell_prot, ) TemporalDelay.compare_peak_expression_prot_v_rna( adata, wp_ensg, ccd_comp, ccdtranscript, wp_max_pol, wp_max_pol_ccd, sorted_maxpol_array, max_moving_avg_pol, sorted_max_moving_avg_pol_ccd, ) |
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 | from SingleCellProteogenomics import (FucciCellCycle, Loaders, ProteinPropertyAnalysis, RNADataPreparation, utils) import matplotlib.pyplot as plt import numpy as np # Make PDF text readable plt.rcParams["pdf.fonttype"] = 42 plt.rcParams["ps.fonttype"] = 42 plt.rcParams["savefig.dpi"] = 300 fucci = FucciCellCycle.FucciCellCycle() u_rna_plates = ["355","356","357"] #%% Import the genes names we're analyzing # Read in RNA-Seq data again and the CCD gene lists valuetype, use_spikeins, biotype_to_use = "Tpms", False, "protein_coding" adata, phases = RNADataPreparation.read_counts_and_phases( valuetype, use_spikeins, biotype_to_use, u_rna_plates ) adata, phasesfilt = RNADataPreparation.qc_filtering( adata, do_log_normalize=True, do_remove_blob=True ) adata = RNADataPreparation.zero_center_fucci(adata) import_dict = Loaders.load_ptm_and_stability(adata) wp_ensg, ccd_comp, nonccd_comp, ccdtranscript, wp_max_pol = ( import_dict["wp_ensg"], import_dict["ccd_comp"], import_dict["nonccd_comp"], import_dict["ccdtranscript"], import_dict["wp_max_pol"], ) ensg_results, name_results = utils.save_gene_names_by_category( adata, wp_ensg, ccd_comp, nonccd_comp, ccdtranscript ) ensg_ccdtranscript, ensg_nonccdtranscript, ensg_ccdprotein, ensg_nonccdprotein, ensg_ccdprotein_transcript_regulated, ensg_ccdprotein_nontranscript_regulated, genes_analyzed, ccd_regev_filtered, ccd_filtered = ensg_results names_ccdtranscript, names_nonccdtranscript, names_ccdprotein, names_nonccdprotein, names_ccdprotein_transcript_regulated, names_ccdprotein_nontranscript_regulated, names_genes_analyzed, names_ccd_regev_filtered, names_ccd_filtered = name_results bioccd = np.genfromtxt( "input/ProteinData/BiologicallyDefinedCCD.txt", dtype="str" ) # from mitotic structures names_bioccd = utils.ccd_gene_names(bioccd, utils.getGeneNameDict()) #%% Analyze properties of the different groups relative to melting points proteinProperties = ProteinPropertyAnalysis.ProteinProperties( wp_ensg, ensg_ccdprotein, ensg_ccdprotein_transcript_regulated, ensg_ccdprotein_nontranscript_regulated, bioccd, ensg_nonccdprotein, ensg_ccdtranscript, names_bioccd, names_ccdprotein, names_ccdprotein_transcript_regulated, names_ccdprotein_nontranscript_regulated, names_nonccdprotein, names_ccdtranscript, ) proteinProperties.analyze_melting_points() proteinProperties.analyze_disorder() proteinProperties.analyze_length() proteinProperties.statistical_properties_table() proteinProperties.generate_properties_table() proteinProperties.generate_statistical_boxplots() proteinProperties.tm_scatters() proteinProperties.kinase_families() |
25 | shell: "python 1_ProteinCellCycleClusters.py &> {log}" |
32 | shell: "python 2_ProteinFucciPsuedotime.py &> {log}" |
39 | shell: "python 3_RNAFucciPseudotime.py &> {log}" |
46 | shell: "python 4_TemporalDelay.py &> {log}" |
53 | shell: "python 5_ProteinProperties.py &> {log}" |
Support
- Future updates
Related Workflows





