Cell cycle analysis of single-cell proteomic and transcriptomic data for the FUCCI cell model

public 1yr ago Version: 1.2.1 0 bookmarks

View Workflow

singlecellproteogenomics — View Workflow

This repository contains the code used to perform the single-cell proteogenomic analysis of the human cell cycle . This study was based on immunofluorescence staining of ~200k cells for single-cell analysis of proteomic heterogeneity and ~1k cells for analysis of single-cell RNA variability. These analyses were integrated with other proteomic studies and databases to investigate the functional importance of transcript-regulated and non-transcript regulated variability.

Structure of repository

The code is listed in order of execution, e.g. "1_", "2_" etc. The output of each script is used in the subsequent script. This workflow can also be run using snakemake (see below).

The logic for these analyses is contained in the SingleCellProteogenomics folder.

The input files are contained in the "input" folder. This folder is linked here for release v1.2 as a zip file, input.zip . Expand this folder within the base directory of this repository. If you are looking for the raw imaging proteomic dataset produced after filtering artifacts and such, that is located here .

The output files are added to a folder "output" during the analysis, and figures are added to a folder "figures."

An R-script used to analyze skewness and kurtosis (noted in the Methods of the manuscript) is contained in the other_scripts folder. The other_scripts/ProteinDisorder.py script utilizes IUPRED2A and a human UniProt database.

Running the workflow using snakemake

This workflow can be run using snakemake :

Install Miniconda from https://docs.conda.io/en/latest/miniconda.html.
Install snakemake using conda install -c conda-forge snakemake-minimal .
Within this directory, run snakemake -j 1 --use-conda --snakefile workflow/Snakefile .

Single-cell RNA-Seq analysis

The single-cell RNA-Seq data is available at GEO SRA under project number GSE146773 .

The cell cycle phase and FACS intensity information for these ~1,000 cells are contained in the input folder within three files, one per plate, starting with RNAData/180911_Fucci_single cell seq_ss2-18-*.csv .

The snakemake workflow used to analyze the scRNA-Seq dataset, including RNA velocity calculations and louvain unsupervised clustering, can be found in this repository: https://github.com/CellProfiling/FucciSingleCellSeqPipeline.

The loom file containing the results of RNA velocity analysis, including spliced and unspliced counts, can be found in the input folder under RNAData/a.loom , and the observation names used for each cell that match the "Well_Plate" identifiers can be found in RNAData/a.obs_names.csv .

Citation

Mahdessian, D.*; Cesnik, A. J.*; Gnann, C.; Danielsson, F.; Stenström, L.; Arif, M.; Zhang, C.; Le, T.; Johansson, F.; Shutten, R.; Bäckström, A.; Axelsson, U.; Thul, P.; Cho, N. H.; Carja, O.; Uhlén, M.; Mardinoglu, A.; Stadler, C.; Lindskog, C.; Ayoglu, B.; Leonetti, M. D.; Pontén, F.; Sullivan, D. P.; Lundberg, E. “Spatiotemporal dissection of the cell cycle with single cell proteogenomics.” Nature, 2021, 590, 649–654. *Contributed equally. https://www.nature.com/articles/s41586-021-03232-9

Code Snippets

from SingleCellProteogenomics import (ProteinDataPreparation,
                                      ProteinGaussianClustering)
import matplotlib.pyplot as plt
import seaborn as sbn

# Make PDF text readable
plt.rcParams["pdf.fonttype"] = 42
plt.rcParams["ps.fonttype"] = 42
plt.rcParams["savefig.dpi"] = 300

#%% Read in the protein data (methods in ProteinDataPreparation.py)
my_df = ProteinDataPreparation.read_raw_data()
plate, u_plate, well_plate, well_plate_imgnb, u_well_plates, ab_objnum, area_cell, area_nuc, area_cyto, ensg_dict, ab_dict, result_dict, compartment_dict, ENSG, antibody, result, compartment = ProteinDataPreparation.read_sample_info(
    my_df
)
wp_ensg, wp_ab, wp_prev_ccd, wp_prev_notccd, wp_prev_negative, prev_ccd_ensg, prev_notccd_ensg, prev_negative_ensg = ProteinDataPreparation.previous_results(
    u_well_plates, result_dict, ensg_dict, ab_dict
)

#%% Idea: Filter the raw data (methods in ProteinDataPreparation.py)
# Execution: Use manual annotations and nucleus size to filter samples and images
# Output: Filtered dataframe
my_df_filtered = ProteinDataPreparation.apply_manual_filtering(
    my_df, result_dict, ab_dict
)
my_df_filtered = ProteinDataPreparation.apply_big_nucleus_filter(my_df_filtered)
my_df_filtered = ProteinDataPreparation.apply_cell_count_filter(my_df_filtered)
# my_df_filtered.to_csv("input/ProteinData/FucciData.csv")
plate, u_plate, well_plate, well_plate_imgnb, u_well_plates, ab_objnum, area_cell, area_nuc, area_cyto, ensg_dict, ab_dict, result_dict, compartment_dict, ENSG, antibody, result, compartment = ProteinDataPreparation.read_sample_info(
    my_df_filtered
)
wp_ensg, wp_ab, wp_prev_ccd, wp_prev_notccd, wp_prev_negative, prev_ccd_ensg, prev_notccd_ensg, prev_negative_ensg = ProteinDataPreparation.previous_results(
    u_well_plates, result_dict, ensg_dict, ab_dict
)

#%% Idea: Filter for variation and get compartments (methods in ProteinDataPreparation.py)
# Execution: Use annotated variation and compartment information
# Output: Number of cells filtered
my_df_filtered_variation, my_df_filtered_novariation = ProteinDataPreparation.apply_variation_filter(
    my_df_filtered, result_dict, my_df
)

## Uncomment to output these dataframes (used for skewness / kurtosis analysis)
# my_df_filtered_variation.to_csv("output/nuc_predicted_prob_phases_filtered_variation.csv")
# my_df_filtered_novariation.to_csv("output/nuc_predicted_prob_phases_filtered_novariation.csv")

# filter out the ones missing compartment information; these are localized to mitotic structures and handled differently
plate, u_plate, well_plate, well_plate_imgnb, u_well_plates, ab_objnum, area_cell, area_nuc, area_cyto, ensg_dict, ab_dict, result_dict, compartment_dict, ENSG, antibody, result, compartment = ProteinDataPreparation.read_sample_info(
    my_df_filtered_variation
)
wp_iscell, wp_isnuc, wp_iscyto, my_df_filtered_compartmentvariation = ProteinDataPreparation.metacompartments(
    u_well_plates, compartment_dict, my_df_filtered_variation
)

# demonstrate that there are no more missing compartment information
plate, u_plate, well_plate, well_plate_imgnb, u_well_plates, ab_objnum, area_cell, area_nuc, area_cyto, ensg_dict, ab_dict, result_dict, compartment_dict, ENSG, antibody, result, compartment = ProteinDataPreparation.read_sample_info(
    my_df_filtered_compartmentvariation
)
wp_iscell, wp_isnuc, wp_iscyto, my_df_filtered_compartmentvariation = ProteinDataPreparation.metacompartments(
    u_well_plates, compartment_dict, my_df_filtered_compartmentvariation
)
wp_ensg, wp_ab, wp_prev_ccd, wp_prev_notccd, wp_prev_negative, prev_ccd_ensg, prev_notccd_ensg, prev_negative_ensg = ProteinDataPreparation.previous_results(
    u_well_plates, result_dict, ensg_dict, ab_dict
)

#%% Idea: Get and process intensities (methods in ProteinDataPreparation.py and ProteinGaussianClustering.py)
# Execution: get intensities; zero center fucci intensities
# Output: Fucci plot
ab_nuc, ab_cyto, ab_cell, mt_cell, green_fucci, red_fucci = ProteinDataPreparation.read_sample_data(
    my_df_filtered_compartmentvariation
)
log_green_fucci, log_red_fucci, log_green_fucci_zeroc, log_red_fucci_zeroc, log_green_fucci_zeroc_rescale, log_red_fucci_zeroc_rescale, fucci_data = ProteinGaussianClustering.zero_center_fucci(
    green_fucci, red_fucci, u_plate, well_plate, plate
)

plt.hist2d(log_green_fucci_zeroc_rescale, log_red_fucci_zeroc_rescale, bins=200)
plt.xlabel("Log10 Green Fucci Intensity")
plt.ylabel("Log10 Red Fucci Intensity")
plt.savefig("figures/FucciPlotProteinIFData_unfiltered.png")
# plt.show()
plt.close()

# General picture of antibody intensity density
sbn.displot(ab_cell, kind="hist")
plt.xlabel("Mean Intensity")
plt.ylabel("Density")
plt.savefig("figures/antibody_cell_intensity.pdf")
# plt.show()
plt.close()

#%% Idea: Gaussian clustering per plate to identify G1/S/G2 and do kruskal test for variance
# Exec: sklearn.mixture.GaussianMixture & scipy.stats.kruskal
# Output: FDR for cell cycle variation per well per compartment

# NB! The cluster labels can change if any prior analysis changes. Inspect the plots so that top-left FUCCI cluster is G1, top-right is S, bottom-right is G2.
g1_idx, sph_idx, g2_idx = 1, 2, 0
clusternames = [
    "G2" if g2_idx == 0 else "G1" if g1_idx == 0 else "S-ph",
    "G2" if g2_idx == 1 else "G1" if g1_idx == 1 else "S-ph",
    "G2" if g2_idx == 2 else "G1" if g1_idx == 2 else "S-ph",
]
cluster_labels = ProteinGaussianClustering.gaussian_clustering(
    log_green_fucci_zeroc_rescale, log_red_fucci_zeroc_rescale, clusternames
)
g1 = cluster_labels == g1_idx
sph = cluster_labels == sph_idx
g2 = cluster_labels == g2_idx
alpha_gauss, doGenerateBoxplotsPerGene = 0.05, False
wp_comp_kruskal_gaussccd_adj, wp_pass_kruskal_gaussccd_bh_comp, wp_mt_kruskal_gaussccd_adj, wp_pass_gaussccd_bh_mt = ProteinGaussianClustering.gaussian_clustering_analysis(
    alpha_gauss,
    doGenerateBoxplotsPerGene,
    g1,
    sph,
    g2,
    wp_ensg,
    well_plate,
    u_well_plates,
    ab_cell,
    ab_nuc,
    ab_cyto,
    mt_cell,
    wp_iscell,
    wp_isnuc,
    wp_iscyto,
)

# General look at replicates in mock-bulk analysis
ProteinGaussianClustering.address_replicates(
    alpha_gauss, wp_pass_kruskal_gaussccd_bh_comp, wp_ensg, wp_ab, u_well_plates
)

Python matplotlib seaborn From line 10 of main/1_ProteinCellCycleClusters.py

from SingleCellProteogenomics import (FucciPseudotime, Loaders,
                                      ProteinBimodality,
                                      ProteinCellCycleDependence,
                                      ProteinVariability, utils)
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import scipy
import argparse

if __name__ == "__main__":
    description = "Single cell proteogenomic analysis script -- protein cell cycle dependence"
    parser = argparse.ArgumentParser(description=description)
    parser.add_argument(
        "--quicker", action="store_true", help="Skip some plotting and analyses, namely for HPA releases."
    )
    args = parser.parse_args()
    doAllPlotsAndAnalyses = not args.quicker

    # Make PDF text readable
    plt.rcParams["pdf.fonttype"] = 42
    plt.rcParams["ps.fonttype"] = 42
    plt.rcParams["savefig.dpi"] = 300

    #%% Read in the protein data
    import_dict = Loaders.load_protein_fucci_pseudotime()
    u_plate, well_plate, well_plate_imgnb, well_plate_imgnb_objnb, u_well_plates = (
        import_dict["u_plate"],
        import_dict["well_plate"],
        import_dict["well_plate_imgnb"],
        import_dict["well_plate_imgnb_objnb"],
        import_dict["u_well_plates"],
    )
    ab_nuc, ab_cyto, ab_cell, mt_cell = (
        import_dict["ab_nuc"],
        import_dict["ab_cyto"],
        import_dict["ab_cell"],
        import_dict["mt_cell"],
    )
    area_cell, area_nuc = import_dict["area_cell"], import_dict["area_nuc"]
    wp_ensg, wp_ab = import_dict["wp_ensg"], import_dict["wp_ab"]
    green_fucci, red_fucci = import_dict["green_fucci"], import_dict["red_fucci"]
    log_green_fucci_zeroc, log_red_fucci_zeroc = (
        import_dict["log_green_fucci_zeroc"],
        import_dict["log_red_fucci_zeroc"],
    )
    log_green_fucci_zeroc_rescale, log_red_fucci_zeroc_rescale = (
        import_dict["log_green_fucci_zeroc_rescale"],
        import_dict["log_red_fucci_zeroc_rescale"],
    )
    wp_comp_kruskal_gaussccd_adj, wp_pass_kruskal_gaussccd_bh_comp = (
        import_dict["wp_comp_kruskal_gaussccd_adj"],
        import_dict["wp_pass_kruskal_gaussccd_bh_comp"],
    )
    fucci_data = import_dict["fucci_data"]
    wp_iscell, wp_isnuc, wp_iscyto = (
        import_dict["wp_iscell"],
        import_dict["wp_isnuc"],
        import_dict["wp_iscyto"],
    )
    curr_wp_phases, mockbulk_phases = (
        import_dict["curr_wp_phases"],
        import_dict["mockbulk_phases"],
    )

    #%%
    # Idea: Calculate the polar coordinates and other stuff
    # Exec: Devin's calculations
    # Output: fucci plot with polar coordinates
    pseudotime_result = FucciPseudotime.pseudotime_protein(
        fucci_data,
        ab_nuc,
        ab_cyto,
        ab_cell,
        mt_cell,
        area_cell,
        area_nuc,
        well_plate,
        well_plate_imgnb,
        well_plate_imgnb_objnb,
        log_red_fucci_zeroc_rescale,
        log_green_fucci_zeroc_rescale,
        mockbulk_phases,
    )
    pol_sort_well_plate, pol_sort_norm_rev, pol_sort_well_plate_imgnb, pol_sort_well_plate_imgnb_objnb, pol_sort_ab_nuc, pol_sort_ab_cyto, pol_sort_ab_cell, pol_sort_mt_cell, pol_sort_area_cell, pol_sort_area_nuc, pol_sort_fred, pol_sort_fgreen, pol_sort_mockbulk_phases = (
        pseudotime_result
    )

    #%% Calculate measures of variance of protein abundance in single cells
    # Idea: Calculate measures of variance, and show them in plots
    # Execution: Now that we already have the data filtered for variability, this is just descriptive.
    # Output: scatters of antibody vs microtubule variances by different measures of variaibility

    # toggle for using log-transformed intensities
    # we decided to use natural intensities
    use_log = False
    calculate_variaton_result = ProteinVariability.calculate_variation(
        use_log,
        u_well_plates,
        wp_iscell,
        wp_isnuc,
        wp_iscyto,
        pol_sort_well_plate,
        pol_sort_ab_cell,
        pol_sort_ab_nuc,
        pol_sort_ab_cyto,
        pol_sort_mt_cell,
        pol_sort_well_plate_imgnb,
    )
    mean_mean_comp, var_comp, gini_comp, cv_comp, var_cell, gini_cell, cv_cell, var_mt, gini_mt, cv_mt = (
        calculate_variaton_result
    )

    # Compare variances for protein and microtubules, the internal control for each image
    if doAllPlotsAndAnalyses:
        removeThese = pd.read_csv("input/ProteinData/ReplicatesToRemove.txt", header=None)[0] # make these independent samples for one-sided Kruskal-Wallis tests
        utils.general_boxplot(
            (var_comp[~np.isin(u_well_plates, removeThese)], var_mt[~np.isin(u_well_plates, removeThese)]),
            ("Protein", "Microtubules"),
            "",
            "Variance",
            "",
            False,
            f"figures/ProteinMicrotubuleVariances.pdf",
        )
        utils.general_boxplot(
            (cv_comp[~np.isin(u_well_plates, removeThese)], gini_mt[~np.isin(u_well_plates, removeThese)]),
            ("Protein", "Microtubules"),
            "",
            "CV",
            "",
            False,
            f"figures/ProteinMicrotubuleCVs.pdf",
        )
        utils.general_boxplot(
            (gini_comp[~np.isin(u_well_plates, removeThese)], gini_mt[~np.isin(u_well_plates, removeThese)]),
            ("Protein", "Microtubules"),
            "",
            "Gini",
            "",
            False,
            f"figures/ProteinMicrotubuleGinis.pdf",
        )
        p_varProt_varMt = 2*scipy.stats.kruskal(var_comp[~np.isin(u_well_plates, removeThese)], var_mt[~np.isin(u_well_plates, removeThese)])[1]
        p_cvProt_cvMt = 2*scipy.stats.kruskal(cv_comp[~np.isin(u_well_plates, removeThese)], cv_mt[~np.isin(u_well_plates, removeThese)])[1]
        p_giniProt_giniMt = 2*scipy.stats.kruskal(gini_comp[~np.isin(u_well_plates, removeThese)], gini_mt[~np.isin(u_well_plates, removeThese)])[1]
        print(
            f"{p_varProt_varMt}: p-value for difference between protein and microtubule variances"
        )
        print(
            f"{p_cvProt_cvMt}: p-value for difference between protein and microtubule CVs"
        )
        print(
            f"{p_giniProt_giniMt}: p-value for difference between protein and microtubule Gini indices"
        )

    #%% Gaussian clustering to identify biomodal intensity distributions
    bimodal_results = ProteinBimodality.identify_bimodal_intensity_distributions(
        u_well_plates,
        wp_ensg,
        pol_sort_well_plate,
        pol_sort_norm_rev,
        pol_sort_ab_cell,
        pol_sort_ab_nuc,
        pol_sort_ab_cyto,
        pol_sort_mt_cell,
        wp_iscell,
        wp_isnuc,
        wp_iscyto,
        doAllPlotsAndAnalyses
    )
    wp_isbimodal_fcpadj_pass, wp_bimodal_cluster_idxs, wp_isbimodal_generally, wp_bimodal_fcmaxmin = (
        bimodal_results
    )

    #%% Determine cell cycle dependence for each protein
    use_log_ccd = False
    do_remove_outliers = True
    alphaa = 0.05

    # Determine cell cycle dependence for proteins
    ccd_results = ProteinCellCycleDependence.cell_cycle_dependence_protein(
        u_well_plates,
        wp_ensg,
        wp_ab,
        use_log_ccd,
        do_remove_outliers,
        pol_sort_well_plate,
        pol_sort_norm_rev,
        pol_sort_ab_cell,
        pol_sort_ab_nuc,
        pol_sort_ab_cyto,
        pol_sort_mt_cell,
        pol_sort_fred,
        pol_sort_fgreen,
        pol_sort_mockbulk_phases,
        pol_sort_area_cell,
        pol_sort_area_nuc,
        pol_sort_well_plate_imgnb,
        wp_iscell,
        wp_isnuc,
        wp_iscyto,
        wp_isbimodal_fcpadj_pass,
        wp_bimodal_cluster_idxs,
        wp_comp_kruskal_gaussccd_adj,
        doAllPlotsAndAnalyses
    )
    wp_comp_ccd_difffromrng, mean_diff_from_rng_mt, wp_comp_ccd_clust1, wp_comp_ccd_clust2, wp_ccd_unibimodal, wp_comp_ccd_gauss, perc_var_comp, mean_diff_from_rng, wp_comp_eq_percvar_adj, mean_diff_from_rng_clust1, wp_comp_eq_percvar_adj_clust1, mean_diff_from_rng_clust2, wp_comp_eq_percvar_adj_clust2, mvavgs_x, mvavgs_comp, curr_pols, curr_ab_norms, mvperc_comps, curr_freds, curr_fgreens, curr_mockbulk_phases, curr_area_cell, curr_ab_nuc, curr_well_plate_imgnb, folder = (
        ccd_results
    )

    # Move the temporal average plots to more informative places
    if doAllPlotsAndAnalyses:
        ProteinCellCycleDependence.copy_mvavg_plots_protein(
            folder,
            wp_ensg,
            wp_comp_ccd_difffromrng,
            wp_isbimodal_fcpadj_pass,
            wp_comp_ccd_clust1,
            wp_comp_ccd_clust2,
            wp_ccd_unibimodal,
            wp_comp_ccd_gauss,
        )
        ProteinCellCycleDependence.global_plots_protein(
            alphaa,
            u_well_plates,
            wp_ccd_unibimodal,
            perc_var_comp,
            mean_mean_comp,
            gini_comp,
            cv_comp,
            mean_diff_from_rng,
            wp_comp_eq_percvar_adj,
            wp_comp_kruskal_gaussccd_adj,
        )

    # Analyze the CCD results and save them
    ccd_comp, nonccd_comp, bioccd = ProteinCellCycleDependence.analyze_ccd_variation_protein(
        folder,
        u_well_plates,
        wp_ensg,
        wp_ab,
        wp_iscell,
        wp_isnuc,
        wp_iscyto,
        wp_comp_ccd_difffromrng,
        wp_comp_ccd_clust1,
        wp_comp_ccd_clust2,
        var_comp,
        gini_comp,
        perc_var_comp,
        mean_diff_from_rng,
        wp_comp_kruskal_gaussccd_adj,
        wp_comp_eq_percvar_adj,
        mean_diff_from_rng_clust1,
        wp_comp_eq_percvar_adj_clust1,
        mean_diff_from_rng_clust2,
        wp_comp_eq_percvar_adj_clust2,
        wp_isbimodal_fcpadj_pass,
        wp_isbimodal_generally,
        wp_ccd_unibimodal,
        wp_bimodal_fcmaxmin,
        wp_comp_ccd_gauss,
    )

    # Make a dataframe for plotting on the HPA website
    ProteinCellCycleDependence.make_plotting_dataframe(
        wp_ensg,
        wp_ab,
        u_well_plates,
        wp_iscell,
        wp_iscyto,
        wp_isnuc,
        ccd_comp,
        bioccd,
        curr_pols,
        curr_ab_norms,
        curr_freds,
        curr_fgreens,
        curr_mockbulk_phases,
        mvavgs_x,
        mvavgs_comp,
        mvperc_comps,
        gini_comp,
        perc_var_comp,
    )

    # Perform comparison to LASSO for finding CCD proteins
    if doAllPlotsAndAnalyses:
        ProteinCellCycleDependence.compare_to_lasso_analysis(
            u_well_plates,
            pol_sort_norm_rev,
            pol_sort_well_plate,
            pol_sort_ab_cell,
            pol_sort_ab_nuc,
            pol_sort_ab_cyto,
            pol_sort_mt_cell,
            pol_sort_fred,
            pol_sort_fgreen,
            wp_iscell,
            wp_isnuc,
            wp_iscyto,
            wp_ensg,
            ccd_comp,
        )

        # Generate UMAPs to illustrate cutoffs and stability
        ProteinCellCycleDependence.generate_protein_umaps(
            u_well_plates,
            pol_sort_norm_rev,
            pol_sort_well_plate,
            pol_sort_ab_cell,
            pol_sort_ab_nuc,
            pol_sort_ab_cyto,
            pol_sort_mt_cell,
            wp_iscell,
            wp_isnuc,
            wp_iscyto,
            mean_diff_from_rng,
        )

Python Pandas numpy matplotlib scipy From line 11 of main/2_ProteinFucciPsuedotime.py

from SingleCellProteogenomics import (FucciPseudotime,
                                      RNACellCycleDependence,
                                      RNADataPreparation,
                                      RNAVelocity,
                                      utils)
import scanpy as sc
import numpy as np
import matplotlib.pyplot as plt
import shutil
import os
import argparse

if __name__ == "__main__":
    description = "Single cell proteogenomic analysis script -- protein cell cycle dependence"
    parser = argparse.ArgumentParser(description=description)
    parser.add_argument(
        "--quicker", action="store_true", help="Skip some plotting and analyses, namely for HPA releases."
    )
    args = parser.parse_args()
    doAllPlotsAndAnalyses = not args.quicker

    # Make PDF text readable
    plt.rcParams["pdf.fonttype"] = 42
    plt.rcParams["ps.fonttype"] = 42
    plt.rcParams["savefig.dpi"] = 300

    bioccd = np.genfromtxt(
        "input/ProteinData/BiologicallyDefinedCCD.txt", dtype="str"
    )  # from mitotic structures
    wp_ensg = np.load("output/pickles/wp_ensg.npy", allow_pickle=True)
    ccd_comp = np.load("output/pickles/ccd_comp.npy", allow_pickle=True)
    nonccd_comp = np.load("output/pickles/nonccd_comp.npy", allow_pickle=True)
    u_plates = ["355","356","357"]

    #%% Convert FACS intensities for FUCCI markers to pseudotime using the same polar coordinate methods as for protein
    # Idea: Use the polar coordinate pseudotime calculations to calculate the pseudotime for each cell
    # Execution: Adapt Devin's code for the cells sorted for RNA-Seq
    # Output: Make log-log fucci intensity plots for the cells analyzed by RNA-Seq; Plot of all fucci pseudotimes; table of pseudotimes for each cell
    adata, phases_filt = RNADataPreparation.read_counts_and_phases(
        "Counts", False, "protein_coding", u_plates
    )
    adata = RNADataPreparation.zero_center_fucci(adata)
    FucciPseudotime.pseudotime_rna(adata)

    #%% Single cell RNA-Seq data preparation and general analysis
    if doAllPlotsAndAnalyses:
        RNADataPreparation.general_plots(u_plates)
        RNADataPreparation.analyze_noncycling_cells(u_plates)
        adata, phasesfilt = RNADataPreparation.qc_filtering(
            adata, do_log_normalize=True, do_remove_blob=False
        )
        adata = RNADataPreparation.zero_center_fucci(adata)
        RNADataPreparation.plot_pca_for_batch_effect_analysis(adata, "BeforeRemovingNoncycling")

    #%% Idea: Similar to mock-bulk analysis for proteins, we can evaluate each gene bundled by phase across cells
    # Execution: Make boxplots of RNA expression by phase
    # Output: boxplots for each gene
    valuetype, use_spikeins, biotype_to_use = "Tpms", False, "protein_coding"
    adata, phases = RNADataPreparation.read_counts_and_phases(
        valuetype, use_spikeins, biotype_to_use, u_plates
    )
    adata, phasesfilt = RNADataPreparation.qc_filtering(
        adata, do_log_normalize=True, do_remove_blob=True
    )
    adata = RNADataPreparation.zero_center_fucci(adata)

    if doAllPlotsAndAnalyses:
        RNADataPreparation.plot_markers_vs_reads(adata)
        RNADataPreparation.plot_pca_for_batch_effect_analysis(adata, "AfterRemovingNoncycling")
        g1 = adata.obs["phase"] == "G1"
        s = adata.obs["phase"] == "S-ph"
        g2 = adata.obs["phase"] == "G2M"
        do_make_boxplots = False
        if do_make_boxplots:
            for iii, ensg in enumerate(adata.var_names):
                maxtpm = np.max(
                    np.concatenate((adata.X[g1, iii], adata.X[s, iii], adata.X[g2, iii]))
                )
                RNACellCycleDependence.boxplot_result(
                    adata.X[g1, iii] / maxtpm,
                    adata.X[s, iii] / maxtpm,
                    adata.X[g2, iii] / maxtpm,
                    "figures/RNABoxplotByPhase",
                    ensg,
                )

    #%% Idea: Display general RNA expression patterns in single cells using UMAP dimensionality reduction, and display with FUCCI pseudotime overlayed
    if doAllPlotsAndAnalyses:
        FucciPseudotime.pseudotime_umap(adata)  # Generate a UMAP with the pseudotime overlayed

        # We can also show that the cycle pattern remains when the curated CCD genes or CCD proteins are removed,
        # demonstrating that there's still valuable information about cell cycling beyond what was called CCD
        RNADataPreparation.demonstrate_umap_cycle_without_ccd(adata)

    # Read in the curated CCD genes / CCD proteins from the present work / Non-CCD genes from the present work; filter for genes that weren't filtered in QC of RNA-Seq
    ccd_regev_filtered, ccd_filtered, nonccd_filtered = utils.ccd_gene_lists(adata)
    adata_ccdprotein, adata_nonccdprotein, adata_regevccdgenes = RNADataPreparation.is_ccd(
        adata, wp_ensg, ccd_comp, nonccd_comp, bioccd, ccd_regev_filtered
    )

    # Generate plots with expression of genes overlayed
    expression_data = adata.X
    normalized_exp_data = (expression_data.T / np.max(expression_data, axis=0)[:, None]).T

    # Log-log FUCCI plot with RNA expression overlayed
    if doAllPlotsAndAnalyses:
        RNACellCycleDependence.plot_expression_facs(
            adata,
            wp_ensg[np.isin(wp_ensg, adata.var_names)],
            normalized_exp_data,
            "figures/GeneExpressionFucci",
        )

    # Cluster the expression into phases and analyze it that way
    bulk_phase_tests = RNACellCycleDependence.analyze_ccd_variation_by_phase_rna(
        adata, normalized_exp_data, biotype_to_use
    )
    # RNACellCycleDependence.plot_expression_boxplots(adata, wp_ensg[np.isin(wp_ensg, adata.var_names)], bulk_phase_tests, "figures/GeneExpressionBoxplots")
    # RNACellCycleDependence.plot_expression_umap(adata, wp_ensg[np.isin(wp_ensg, adata.var_names)], "figures/GeneExpressionUmap")

    #%% Moving average calculations and randomization analysis for RNA
    rna_ccd_analysis_results = RNACellCycleDependence.analyze_ccd_variation_by_mvavg_rna(
        adata,
        wp_ensg,
        ccd_comp,
        bioccd,
        adata_nonccdprotein,
        adata_regevccdgenes,
        biotype_to_use,
        make_mvavg_plots_isoforms=doAllPlotsAndAnalyses,
    )
    percent_ccd_variance, total_gini, mean_diff_from_rng, pass_meandiff, eq_percvar_adj, fucci_time_inds, norm_exp_sort, moving_averages, mvavg_xvals, perms, ccdtranscript, ccdprotein, mvpercs = (
        rna_ccd_analysis_results
    )

    RNACellCycleDependence.make_plotting_dataframe(
        adata,
        ccdtranscript,
        wp_ensg,
        bioccd,
        norm_exp_sort[np.argsort(fucci_time_inds), :],
        mvavg_xvals,
        moving_averages,
        mvpercs,
    )

    if doAllPlotsAndAnalyses:
        RNACellCycleDependence.plot_umap_ccd_cutoffs(adata, mean_diff_from_rng)
        RNACellCycleDependence.figures_ccd_analysis_rna(
            adata,
            percent_ccd_variance,
            mean_diff_from_rng,
            pass_meandiff,
            eq_percvar_adj,
            wp_ensg,
            ccd_comp,
            ccd_regev_filtered,
        )
        RNACellCycleDependence.plot_overall_and_ccd_variances(
            adata,
            biotype_to_use,
            total_gini,
            percent_ccd_variance,
            pass_meandiff,
            adata_ccdprotein,
            adata_nonccdprotein,
            adata_regevccdgenes,
        )
        RNACellCycleDependence.compare_to_lasso_analysis(adata, ccdtranscript)
        RNACellCycleDependence.analyze_cnv_calls(adata, ccdtranscript)

    #%% Moving average calculations and randomization analysis for the spike-in internal controls
    if doAllPlotsAndAnalyses:
        adata_spikeins, phases_spikeins = RNADataPreparation.read_counts_and_phases(
            valuetype, use_spike_ins=True, biotype_to_use="", u_plates=u_plates
        )
        sc.pp.filter_genes(adata_spikeins, min_cells=100)
        print(f"data shape after filtering: {adata_spikeins.X.shape}")

        RNACellCycleDependence.ccd_analysis_of_spikeins(adata_spikeins, perms)

    #%% Analyze RNA velocity
    valuetype, use_spikeins, biotype_to_use = "Tpms", False, "protein_coding"
    adata, phases = RNADataPreparation.read_counts_and_phases(
        valuetype, use_spikeins, biotype_to_use, u_plates, load_velocities=True
    )
    adata_blobless, phasesfilt = RNADataPreparation.qc_filtering(
        adata, do_log_normalize=True, do_remove_blob=False
    )
    RNAVelocity.analyze_rna_velocity(adata_blobless, mean_diff_from_rng, doAllPlotsAndAnalyses)

    #%% Analyze isoforms
    if doAllPlotsAndAnalyses:
        adata_isoform, ccdtranscript_isoform = RNACellCycleDependence.analyze_isoforms(
            adata, ccdtranscript, wp_ensg, ccd_comp, nonccd_comp, u_plates, make_mvavg_plots_isoforms=doAllPlotsAndAnalyses
        )
        RNACellCycleDependence.compare_genes_to_isoforms(
            adata,
            ccdprotein,
            ccdtranscript,
            adata_nonccdprotein,
            adata_isoform,
            ccdtranscript_isoform,
        )

Python numpy matplotlib From line 11 of main/3_RNAFucciPseudotime.py

from SingleCellProteogenomics import (Loaders, RNADataPreparation,
                                      TemporalDelay)
import matplotlib.pyplot as plt

# Make PDF text readable
plt.rcParams["pdf.fonttype"] = 42
plt.rcParams["ps.fonttype"] = 42
plt.rcParams["savefig.dpi"] = 300

#%% Read in the protein data
import_dict = Loaders.load_temporal_delay()
u_well_plates, wp_ensg = import_dict["u_well_plates"], import_dict["wp_ensg"]
wp_iscell, wp_isnuc, wp_iscyto = (
    import_dict["wp_iscell"],
    import_dict["wp_isnuc"],
    import_dict["wp_iscyto"],
)
ccd_comp, ccdtranscript, ccdtranscript_isoform = (
    import_dict["ccd_comp"],
    import_dict["ccdtranscript"],
    import_dict["ccdtranscript_isoform"],
)
pol_sort_well_plate, pol_sort_norm_rev = (
    import_dict["pol_sort_well_plate"],
    import_dict["pol_sort_norm_rev"],
)
pol_sort_ab_nuc, pol_sort_ab_cyto, pol_sort_ab_cell, pol_sort_mt_cell = (
    import_dict["pol_sort_ab_nuc"],
    import_dict["pol_sort_ab_cyto"],
    import_dict["pol_sort_ab_cell"],
    import_dict["pol_sort_mt_cell"],
)
var_comp_prot, gini_comp_prot, cv_comp_prot = (
    import_dict["var_comp"],
    import_dict["gini_comp"],
    import_dict["cv_comp"],
)
var_cell_prot, gini_cell_prot, cv_cell_prot = (
    import_dict["var_cell"],
    import_dict["gini_cell"],
    import_dict["cv_cell"],
)

#%% Idea: Make temporal heatmap for peak protein expression, and compare known and novel proteins that peak at similar times
# Execution: plt.imshow makes a heatmap if given a 2D array
# Output: heatmap; correlations of known/novel proteins
highlights = []  #'ORC6','DUSP19','BUB1B','DPH2', 'FLI1']
highlights_ensg = []  #'ORC6','DUSP19','BUB1B','DPH2', 'FLI1']

nbins = 20
protein_heatmap_results = TemporalDelay.protein_heatmap(
    nbins,
    highlights,
    highlights_ensg,
    ccd_comp,
    u_well_plates,
    wp_ensg,
    pol_sort_norm_rev,
    pol_sort_well_plate,
    pol_sort_ab_cell,
    pol_sort_ab_nuc,
    pol_sort_ab_cyto,
    pol_sort_mt_cell,
    wp_iscell,
    wp_isnuc,
    wp_iscyto,
)
sorted_maxpol_array, wp_binned_values, wp_max_pol, wp_max_pol_ccd, xvals = (
    protein_heatmap_results
)

# Correlations of known and novel proteins that peak at similar times
TemporalDelay.peak_expression_correlation_analysis(
    wp_binned_values, wp_max_pol, wp_ensg, pol_sort_well_plate, u_well_plates
)

#%% Create a heatmap of peak RNA expression
highlight_names, highlight_ensg = [], []
u_rna_plates = ["355","356","357"]

# Read in RNA-Seq data; use TPMs so that the gene-specific results scales match for cross-gene comparisons
valuetype, use_spikeins, biotype_to_use = "Tpms", False, "protein_coding"
adata, phases = RNADataPreparation.read_counts_and_phases(
    valuetype, use_spikeins, biotype_to_use, u_rna_plates
)
adata, phasesfilt = RNADataPreparation.qc_filtering(
    adata, do_log_normalize=True, do_remove_blob=True
)
adata = RNADataPreparation.zero_center_fucci(adata)
sorted_max_moving_avg_pol_ccd, norm_exp_sort, max_moving_avg_pol, sorted_rna_binned_norm = TemporalDelay.rna_heatmap(
    adata, highlight_names, highlight_ensg, ccdtranscript, xvals
)

# Analyze isoforms
adata_isoform, phases_isoform = RNADataPreparation.read_counts_and_phases(
    valuetype, use_spikeins, biotype_to_use, u_rna_plates, use_isoforms=True,
)
adata_isoform, phasesfilt_isoform = RNADataPreparation.qc_filtering(
    adata_isoform, do_log_normalize=True, do_remove_blob=True
)
adata_isoform = RNADataPreparation.zero_center_fucci(adata_isoform)
sorted_max_moving_avg_pol_ccd_isoform, norm_exp_sort_isoform, max_moving_avg_pol_isoform, sorted_rna_binned_norm_isoform = TemporalDelay.rna_heatmap(
    adata_isoform,
    highlight_names,
    highlight_ensg,
    ccdtranscript_isoform,
    xvals,
    isIsoformData=True,
)
pearsonCorrelations = TemporalDelay.analyze_ccd_isoform_correlations(
    adata, adata_isoform, ccdtranscript, ccdtranscript_isoform, xvals, u_rna_plates
)

#%% Compare the variances and time of peak expression between protein and RNA
TemporalDelay.compare_variances_prot_v_rna(
    adata,
    norm_exp_sort,
    wp_ensg,
    var_comp_prot,
    gini_comp_prot,
    cv_comp_prot,
    var_cell_prot,
    gini_cell_prot,
    cv_cell_prot,
)

TemporalDelay.compare_peak_expression_prot_v_rna(
    adata,
    wp_ensg,
    ccd_comp,
    ccdtranscript,
    wp_max_pol,
    wp_max_pol_ccd,
    sorted_maxpol_array,
    max_moving_avg_pol,
    sorted_max_moving_avg_pol_ccd,
)

Python matplotlib From line 10 of main/4_TemporalDelay.py

from SingleCellProteogenomics import (FucciCellCycle, Loaders,
                                      ProteinPropertyAnalysis,
                                      RNADataPreparation, utils)
import matplotlib.pyplot as plt
import numpy as np

# Make PDF text readable
plt.rcParams["pdf.fonttype"] = 42
plt.rcParams["ps.fonttype"] = 42
plt.rcParams["savefig.dpi"] = 300
fucci = FucciCellCycle.FucciCellCycle()
u_rna_plates = ["355","356","357"]

#%% Import the genes names we're analyzing
# Read in RNA-Seq data again and the CCD gene lists
valuetype, use_spikeins, biotype_to_use = "Tpms", False, "protein_coding"
adata, phases = RNADataPreparation.read_counts_and_phases(
    valuetype, use_spikeins, biotype_to_use, u_rna_plates
)
adata, phasesfilt = RNADataPreparation.qc_filtering(
    adata, do_log_normalize=True, do_remove_blob=True
)
adata = RNADataPreparation.zero_center_fucci(adata)
import_dict = Loaders.load_ptm_and_stability(adata)
wp_ensg, ccd_comp, nonccd_comp, ccdtranscript, wp_max_pol = (
    import_dict["wp_ensg"],
    import_dict["ccd_comp"],
    import_dict["nonccd_comp"],
    import_dict["ccdtranscript"],
    import_dict["wp_max_pol"],
)
ensg_results, name_results = utils.save_gene_names_by_category(
    adata, wp_ensg, ccd_comp, nonccd_comp, ccdtranscript
)
ensg_ccdtranscript, ensg_nonccdtranscript, ensg_ccdprotein, ensg_nonccdprotein, ensg_ccdprotein_transcript_regulated, ensg_ccdprotein_nontranscript_regulated, genes_analyzed, ccd_regev_filtered, ccd_filtered = ensg_results
names_ccdtranscript, names_nonccdtranscript, names_ccdprotein, names_nonccdprotein, names_ccdprotein_transcript_regulated, names_ccdprotein_nontranscript_regulated, names_genes_analyzed, names_ccd_regev_filtered, names_ccd_filtered = name_results
bioccd = np.genfromtxt(
    "input/ProteinData/BiologicallyDefinedCCD.txt", dtype="str"
)  # from mitotic structures
names_bioccd = utils.ccd_gene_names(bioccd, utils.getGeneNameDict())

#%% Analyze properties of the different groups relative to melting points
proteinProperties = ProteinPropertyAnalysis.ProteinProperties(
    wp_ensg,
    ensg_ccdprotein,
    ensg_ccdprotein_transcript_regulated,
    ensg_ccdprotein_nontranscript_regulated,
    bioccd,
    ensg_nonccdprotein,
    ensg_ccdtranscript,
    names_bioccd,
    names_ccdprotein,
    names_ccdprotein_transcript_regulated,
    names_ccdprotein_nontranscript_regulated,
    names_nonccdprotein,
    names_ccdtranscript,
)
proteinProperties.analyze_melting_points()
proteinProperties.analyze_disorder()
proteinProperties.analyze_length()
proteinProperties.statistical_properties_table()
proteinProperties.generate_properties_table()
proteinProperties.generate_statistical_boxplots()
proteinProperties.tm_scatters()
proteinProperties.kinase_families()

Python numpy matplotlib From line 12 of main/5_ProteinProperties.py

shell: "python 1_ProteinCellCycleClusters.py &> {log}"

SnakeMake From line 25 of workflow/Snakefile

shell: "python 2_ProteinFucciPsuedotime.py &> {log}"

SnakeMake From line 32 of workflow/Snakefile

shell: "python 3_RNAFucciPseudotime.py &> {log}"

SnakeMake From line 39 of workflow/Snakefile

shell: "python 4_TemporalDelay.py &> {log}"

SnakeMake From line 46 of workflow/Snakefile

shell: "python 5_ProteinProperties.py &> {log}"

SnakeMake From line 53 of workflow/Snakefile

ShowHide 8 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/CellProfiling/SingleCellProteogenomics

Name: singlecellproteogenomics

Version: 1.2.1

Badge:

Insert copied code into your website to add a link to this workflow.

License: None

Keywords:

Data Cell migration analysis Gene report Pandas SCANPY Snakemake matplotlib numpy scipy seaborn Proteogenomics

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free