Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Table of Contents
This repo contains files, scripts, and analysis related to exploring integration of single-cell and single-nuclei data.
Environment setup
Managing R packages with
renv
Package dependencies for the analysis workflows in this repository are managed using
renv
. For
renv
to work as intended, you'll need to work within the
sc-data-integration.Rproj
project in RStudio. You may need to run
renv::restore()
upon opening the project to ensure the
renv.lock
file is synced with the project library.
Each time you install or use new packages, you will want to run
renv::snapshot()
to update the
renv.lock
file with any added package and dependencies necessary to run the analyses and scripts in this repo.
If there are dependencies you want to include that are not captured automatically by
renv::snapshot()
, add them to
components/dependencies.R
with a call to
library()
and an explanatory comment. For example, if
dplyr
were recommended but not required by a package and you wanted to make sure to include it in the lockfile, you would add
library(dplyr)
to
components/dependencies.R
. Then rerun
renv::snapshot()
.
Snakemake/conda setup
The main workflow for the integration scripts is written with Snakemake, which will handle most dependencies internally, including the
renv
environment.
You will need the latest version of
snakemake
and the
peppy
python package. The easiest way to install these is with
conda
and/or
mamba
, which you will want to set up to use the
bioconda
and
conda-forge
channels using the following series of commands:
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict
You can then install
snakemake
and
peppy
into your preferred environment with:
mamba install snakemake peppy
(Use
conda install
if you do not have
mamba
installed.)
Python-based environments will be built automatically by Snakemake when the workflow is run, but the environment for R should be built before running the workflow. To create or update the necessary environment for the R scripts, which includes an isolated version of
R
,
pandoc
, and the
renv
package installation, run the following command from the base of the repository:
bash setup_envs.sh
This script will use Snakemake to install all necessary components for the workflow in an isoloated Conda enviroment. If you are on an Apple Silicon (M1/M2/Arm) Mac, this should properly handle setting up R to use an Intel-based build for compatibility with Bioconductor packages.
This installation may take up to an hour, as all of the R packages will likely have to be compiled from scratch. However, this should be a one-time cost, and ensures that you have all of the tools for the workflow installed and ready.
To use the environment you have just created, you will need to run Snakemake with the
--use-conda
flag each time.
If there are updates to the
renv.lock
file only, those can be applied with the following command (on any system):
snakemake --use-conda -c2 build_renv
Data used for benchmarking integration
For exploring data integration, we used test datasets that were obtained from the Human Cell Atlas (HCA) Data Portal , the Single-cell Pediatric Cancer Atlas(ScPCA) Portal , and simulated single-cell data published in Luecken et al., (2022) .
All data from the HCA that we are using can be found in the private S3 bucket,
s3://sc-data-integration/human_cell_atlas_data
. The simulated data can be downloaded directly from
figshare
.
All gene expression data used for benchmarking is stored in the private S3 bucket,
s3://sc-data-integration
. In order to access these files, you must be a Data Lab staff member and have credentials setup for AWS.
Metadata
Inside the
sample-info
folder is metadata related to datasets used for testing data integration.
-
<project_name>-project-metadata.tsv
: This file contains information about each of the projects that are being used for testing integration from a given area (e.g., HCA, simulated, ScPCA). Each row in this file corresponds to a project, dataset, or group of libraries that should be integrated together. Allproject-metadata.tsv
files must contain aproject_name
column, but may also contain other relevant project information such as the following:
column_id | contents |
---|---|
project_name
|
The shorthand project name |
source_id
|
Unique ID associated with the project |
tissue_group
|
Tissue group the project belongs to (e.g. blood, brain, or kidney) |
files_directory
|
files directory on S3 |
metadata_filename
|
name of metadata file obtained from the HCA |
celltype_filename
|
file name corresponding to file containing cell type information as found on HCA |
celltype_filetype
|
format of cell type file availble on HCA |
-
hca-library-metadata.tsv
This file contains information about each library from every project that is being used as a test dataset from the HCA. Each row in this file corresponds to a library and contains the following columns:
column_id | contents |
---|---|
sample_biomaterial_id
|
Unique ID associated with the individual sample |
library_biomaterial_id
|
Unique ID associated with the individual library that was sequenced |
bundle_uuid
|
UUID for the individual folder containing each loom file |
project_name
|
The shorthand project name assigned by the HCA |
source_id
|
Unique ID associated with the project |
tissue_group
|
Tissue group the project belongs to (e.g. blood, brain, or kidney) |
technology
|
Sequencing/library technology used (e.g. 10Xv2, 10Xv3, etc.) |
seq_unit
|
Sequencing unit (cell or nucleus) |
diagnosis
|
Indicates if the sample came from diseasead or normal tissue |
organ
|
Specified tissue by the HCA where the sample was obtained from |
organ_part
|
Specified tissue region by the HCA where the sample was obtained from |
selected_cell_types
|
Identifies the group of cells selected for prior to sequencing, otherwise NA |
s3_files_dir
|
files directory on S3 |
loom_file
|
loom file name in the format
tissue_group/project_name/bundle_uuid/filename
|
-
<project_name>-processed-libraries.tsv
: This file contains the list of libraries from each project that are being used for testing data integration. This metadata file is required for most scripts, including runningscpca-downstream-analyses
using01-run-downstream-analyses.sh
and for running the integration workflow. This file must contain the following columns, but may also contain additional columns related to a given dataset:
column_id | contents |
---|---|
sample_biomaterial_id
|
Unique ID associated with the individual sample |
library_biomaterial_id
|
Unique ID associated with the individual library that was sequenced |
project_name
|
The shorthand project name |
integration_input_dir
|
The directory containing the
SingleCellExperiment
objects to be used as input to the data integration snakemake workflow
|
-
hca-celltype-info.tsv
: This file is not available on the repo and is stored in the private S3 bucket,s3://sc-data-integration/sample-info
. This file contains all available cell type information for projects listed inhca-project-metadata.tsv
. This file was created using thescripts/00a-reformat-celltype-info.R
which takes as input the cell type information available for each project from the Human Cell Atlas Data Portal. The cell type information for each project, in its original format, can be stored ins3://sc-data-integration/human_cell_atlas_data/celltype
. Each row corresponds to a single cell and contain the following information:
column_id | contents |
---|---|
sample_biomaterial_id
|
Unique ID associated with the individual sample |
library_biomaterial_id
|
Unique ID associated with the individual library that was sequenced |
project
|
The shorthand project name assigned by the HCA |
barcode
|
The unique cell barcode |
celltype
|
The assigned cell type for a given barcode, obtained from cell type data stored in
s3://sc-data-integration/human_cell_atlas_data/celltype
|
Data files present on S3
All data and intermediate files are stored in the private S3 bucket,
s3://sc-data-integration
. The following data can be found in the above S3 bucket within the
human_cell_atlas_data
folder:
-
The
loom
folder contains the original loom files downloaded from the Human Cell Atlas data portal for each test dataset. Here loom files are nested bytissue_group
,project_name
, andbundle_uuid
. -
The
sce
folder contains the unfilteredSingleCellExperiment
objects saved as RDS files. TheseSingleCellExperiment
objects have been converted from the loom files using the00-obtain-sce.R
script in thescripts
directory in this repo. Here RDS files are nested bytissue_group
andproject_name
.
The following data can be found in the S3 bucket within the
scib_simulated_data
folder:
-
The
hdf5
folder contains the originalhdf5
files for simulated data obtained from figshare . -
The
sce
folder contains the individualSingleCellExperiment
objects stored asrds
files after runningscripts/00b-obtain-sim-sce.R
andscripts/00c-create-sim1-subsets.R
A separate
reference-files
folder contains any reference files needed for processing dataset, such as the gtf file needed to generate the mitochondrial gene list found in the
reference-files
folder in the repository.
In order to access these files, you must be a Data Lab staff member and have credentials set up for AWS. Additionally, some of the scripts in this repository require use of AWS command line tools. We have previously written up detailed instructions on installing the AWS command line tools and configuring your credentials that can be used as a reference.
After AWS command line tools have been set up, the
SingleCellExperiment
objects found in
s3://sc-data-integration/human_cell_atlas_data/sce
can be copied to your local computer by running the
00-obtain-sce.R
script with the
--copy_s3
flag.
Rscript scripts/00-obtain-sce.R --copy_s3
This will copy any
SingleCellExperiment
objects for libraries listed in
hca-processed-libraries.tsv
that have already been converted from loom files. If any libraries listed in
hca-processed-libraries.tsv
do not have corresponding
SingleCellExperiment
objects, running the
00-obtain-sce.R
will also convert those loom files.
Processed SingleCellExperiment objects to use for data integration
The
human_cell_atlas_results/scpca-downstream-analyses
folder contains all processed
SingleCellExperiment
objects and the output from
running the core workflow in
scpca-downstream-analyses
. Within this folder each library that has been processed has its own folder that contains both the processed
SingleCellExperiment
object and an html summary report. The
SingleCellExperiment
objects in this folder have both been filtered to remove empty droplets and run through
scpca-downstream-analyses
using the
scripts/01-run-downstream-analyses.sh
script. This means they contain a
logcounts
assay with the normalized counts matrix, both PCA and UMAP embeddings, and clustering assignments that can be found in the
louvain_10
column of the
colData
. The
SingleCellExperiment
objects present in
human_cell_atlas_results/scpca-downstream-analyses
should be the objects used as input for integration methods.
These files were produced and synced to S3 using the following script:
Note:
To run the below script, you must have available in your path R (v4.1.2),
Snakemake
and
pandoc
.
pandoc
must be version 1.12.3 or higher, which can be checked using the
pandoc -v
command.
bash scripts/01-run-downstream-analyses.sh \ --downstream_repo <full path to scpca-downstream-analyses-repo> \ --s3_bucket "s3://sc-data-integration/human_cell_atlas_results/scpca-downstream-analyses"
Note: If you wish to run ScPCA data through the integration workflow, rather than HCA data, please see the special guidelines for preparing ScPCA data .
Running the integration workflow
To run the integration workflow, invoke
snakemake
from the
sc-data-integration
directory with the following command:
snakemake -c4 --use-conda
You can adjust the number of cores used by adjusting the
-c4
flag with however many cores you want to use where the given number represents the number of desired cores (here, 4). Note that you will want to have
set up the R conda environment already
, especially if you are on an Apple Silicon Mac.
To run the workflow for development, you may wish to specify the
config-test.yaml
file, which will only run one project through the pipeline to save time:
snakemake -c4 --use-conda --configfile config-test.yaml
Finally, to run the
scib_simulated
data through the pipeline, use:
snakemake -c4 --use-conda --configfile config-scib_simulated.yaml
Code Snippets
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 | project_root <- here::here() renv::load(project_root) # import libraries library(magrittr) library(optparse) source(file.path(project_root, "scripts", "utils", "integration-helpers.R")) # Set up optparse options option_list <- list( make_option( opt_str = c("-l", "--library_file"), type = "character", default = file.path(project_root, "sample-info", "hca-processed-libraries.tsv"), help = "path to file listing all libraries that are to be converted" ), make_option( opt_str = c("-g", "--grouping_var"), type = "character", default = "project_name", help = "Column name present in the library metadata file that was used to indicate which SCE objects should be integrated in `02-prepare-merged-sce.R`." ), make_option( opt_str = c("--merged_sce_dir"), type = "character", default = file.path(project_root, "results", "human_cell_atlas", "merged_sce"), help = "Path to folder where SCE objects to be converted are stored, each file should contain the library ID in the filename and be stored as an RDS file. Typically this is the output from running scpca-downstream-analyses" ), make_option( opt_str = c("--anndata_output_dir"), type = "character", default = file.path(project_root, "results", "human_cell_atlas", "merged_anndata"), help = "path to folder where all AnnData files will be saved as HDF5 files" ) ) # Setup ------------------------------------------------------------------------ # Parse options opt <- parse_args(OptionParser(option_list = option_list)) # checks that provided metadata files exist if(!file.exists(opt$library_file)){ stop("--library_file provided does not exist.") } # read in library metadata and grab names of grouping variable library_metadata_df <- readr::read_tsv(opt$library_file) if(!opt$grouping_var %in% colnames(library_metadata_df)){ stop("--grouping_var must correspond to a column provided in the library metadata file.") } group_names <- library_metadata_df %>% dplyr::pull(opt$grouping_var) %>% unique() # setup output directory create_dir(opt$anndata_output_dir) # Identify merged SCE files ---------------------------------------------------- # find SCE files that match library ID group_name_search <- paste(group_names, collapse = "|") sce_files <- list.files(opt$merged_sce_dir, pattern = group_name_search, recursive = TRUE, full.names = TRUE) # if the number of sce files is different then the number of groups to integrate find the missing files if(length(sce_files) < length(group_names)){ groups_found <- stringr::str_extract(sce_files, group_name_search) missing_groups <- setdiff(group_names, groups_found) stop( glue::glue( "Missing SCE object for {missing_groups}. Make sure that you have run `02-prepare-merged-sce.R`." ) ) } # Write H5 --------------------------------------------------------------------- # extract group names from file path to make sure we are writing the input to the correctly named output sce_file_names <- stringr::str_extract(sce_files, pattern = group_name_search) # create paths to anndata h5 files anndata_files <- file.path(opt$anndata_output_dir, paste0(sce_file_names, "_anndata.h5")) # small function to read in sce and export as anndata export_anndata <- function(sce_file, anndata_file){ sce <- readr::read_rds(sce_file) scpcaTools::sce_to_anndata(sce,anndata_file = anndata_file) } # apply export_anndata function to all sce files purrr::walk2(sce_files, anndata_files, export_anndata) |
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 | project_root <- here::here() renv::load(project_root) # import libraries library(magrittr) library(optparse) library(SingleCellExperiment) # source helper functions source(file.path(project_root, "scripts", "utils", "integration-helpers.R")) # Set up optparse options option_list <- list( make_option( opt_str = c("-l", "--library_file"), type = "character", default = file.path(project_root, "sample-info", "hca-processed-libraries.tsv"), help = "path to file listing all libraries that are to be converted" ), make_option( opt_str = c("-g", "--grouping_var"), type = "character", default = "project_name", help = "Column name present in the library metadata file to use for grouping SCE objects and merging." ), make_option( opt_str = c("--groups_to_integrate"), type = "character", default = "All", help = "Groups present in `grouping_var` column of metadata file to create merged SCE objects for. Default is 'All' which produces a merged object for each group in the metadata file. Specify groups by using a vector, e.g. c('group1','group2')" ), make_option( opt_str = c("--add_celltype"), action = "store_true", default = FALSE, help = "Boolean indicating whether or not celltype data should be added to the individual SCE object prior to merging." ), make_option( opt_str = c("--celltype_info"), type = "character", default = file.path(project_root, "sample-info", "hca-celltype-info.tsv"), help = "Path to file containing cell type information for each SCE object. Must contain columns named `library_biomaterial_id`, `celltype`, and `barcode`." ), make_option( opt_str = c("--batch_column"), type = "character", default = "batch", help = "The name of the column in colData that indicates the batches for each cell, typically this corresponds to the library id. Default is 'batch'." ), make_option( opt_str = c("--random_merge"), default = FALSE, action = "store_true", help = "Used to indicate whether or not to merge SCE objects in a random order. Default is FALSE." ), make_option( opt_str = c("--subset_hvg"), default = FALSE, action = "store_true", help = "Indicates whether or not to subset the merged SCE object by highly variable genes. If --subset_hvg is used, the merged SCE object will only contain genes identified as highly variable genes." ), make_option( opt_str = c("--pca_use_all_genes"), default = FALSE, action = "store_true", help = "Indicates whether or not to use the all genes as input to performing principal component analysis. Otherwise only highly variable genes are used as input." ), make_option( opt_str = c("-n", "--num_hvg"), dest = "num_genes", type = "integer", default = 5000, help = "Number of highly variable genes to select." ), make_option( opt_str = c("--merged_sce_dir"), type = "character", default = file.path(project_root, "results", "human_cell_atlas", "merged_sce"), help = "path to folder where all merged SCE objects files will be saved as RDS files" ), make_option( opt_str = c("--seed"), type = "integer", default = NULL, help = "random seed to set prior to merging" ) ) # Setup ------------------------------------------------------------------------ # Parse options opt <- parse_args(OptionParser(option_list = option_list)) set.seed(opt$seed) # check that num genes provided is an integer if(!is.integer(opt$num_genes)){ stop("--num_hvg must be an integer.") } # checks that provided metadata files exist if(!file.exists(opt$library_file)){ stop("--library_file provided does not exist.") } # read in library metadata and grab unfiltered sce file paths library_metadata_df <- readr::read_tsv(opt$library_file) # check that cell type file exists if using add_celltype option if(opt$add_celltype){ if(!file.exists(opt$celltype_info)){ stop("--celltype_info file provided does not exist.") } celltype_info_df <- readr::read_tsv(opt$celltype_info) } # check that grouping variable is present if(!opt$grouping_var %in% colnames(library_metadata_df)){ stop("Must provide a grouping_var that is a column in the library metadata file.") } # define groups to integrate groups <- library_metadata_df %>% dplyr::pull(opt$grouping_var) %>% unique() # check that groups to integrate isn't set to All if(length(opt$groups_to_integrate) == 1 && (opt$groups_to_integrate == "All")){ groups_to_integrate <- groups } else { # if All is not used then unlist groups, using space, needed to parse a list from snakemake groups_to_integrate <- unlist(stringr::str_split(opt$groups_to_integrate, " ")) # check that specified groups are present in grouping_var column if(!any(groups_to_integrate %in% groups)){ stop("Provided `--groups_to_integrate` must also be present in the `--grouping_var` column of the library metadata file.") } } # subset metadata file to only contain groups to integrate library_metadata_df <- library_metadata_df %>% dplyr::filter(.data[[opt$grouping_var]] %in% groups_to_integrate) # grab library ids library_ids <- library_metadata_df %>% dplyr::pull(library_biomaterial_id) # setup output directory create_dir(opt$merged_sce_dir) # Identify SCE files ----------------------------------------------------------- # grab all unique directories corresponding to projects considered for integration input_sce_dirs <- library_metadata_df %>% dplyr::pull(integration_input_dir) %>% unique() # find SCE files that match library ID, and throw an error if any are missing. sce_files <- find_input_sce_files(library_ids, input_sce_dirs) # Merge by group --------------------------------------------------------------- # get the library IDs from the SCE file names so that we can name the SCEs in the correct order library_ids_sce_order <- stringr::str_extract(sce_files, pattern = paste(library_ids, collapse = "|")) sce_file_df <- data.frame(sce_files = sce_files, library_id= library_ids_sce_order) %>% dplyr::left_join(library_metadata_df, by = c("library_id" = "library_biomaterial_id")) %>% dplyr::select(library_id, opt$grouping_var, sce_files) # group dataframe by the grouping variable grouped_sce_file_df <- split(sce_file_df, sce_file_df[,opt$grouping_var]) # create a list of SCE lists that is named by the grouping variable with # each individual inner SCE list named by the library IDs create_grouped_sce_list <- function(sce_info_dataframe, celltype_info_df = NULL, add_celltype){ library_sce_list = list() for (library_idx in 1:length(sce_info_dataframe$library_id)){ # read sce list for each library sce <- readr::read_rds(sce_info_dataframe$sce_files[library_idx]) library_name <- sce_info_dataframe$library_id[library_idx] if(add_celltype){ # if celltype info is provided add to sce object if(!is.null(celltype_info_df)){ # check that library has cell type information if(library_name %in% unique(celltype_info_df$library_biomaterial_id)){ # filter celltype info to only have info for specified library filtered_celltype_info <- celltype_info_df %>% dplyr::filter(library_biomaterial_id == library_name) %>% dplyr::select(barcode, celltype) # add celltype info sce <- add_celltype_info(sce_object = sce, celltype_info_df = filtered_celltype_info) # add flag indicating that cell type information is available metadata(sce)$celltype_info_available <- TRUE } # only add celltype column/metadata if add celltype is yes, but no celltype data is available } else { # note that no cell type information is available colData(sce)$celltype <- NA metadata(sce)$celltype_info_available <- FALSE } } # create a list for each group named by the library IDs library_sce_list[[library_name]] <- sce } return(library_sce_list) } # create grouped sce list with/without celltype addition if(opt$add_celltype){ grouped_sce_list <- grouped_sce_file_df %>% purrr::map(create_grouped_sce_list, celltype_info_df, add_celltype = opt$add_celltype) } else { grouped_sce_list <- grouped_sce_file_df %>% purrr::map(create_grouped_sce_list, add_celltype = opt$add_celltype) } # create a list of merged SCE objects by group # In this default usage, a batch column named `batch` will get created merged_sce_list <- grouped_sce_list %>% purrr::map(combine_sce_objects, preserve_rowdata_columns = c("Gene", "gene_names", "ensembl_ids"), random_merge = opt$random_merge) # HVG and dim reduction -------------------------------------------------------- # apply HVG calculation to list of merged SCEs # object will only be subset to HVG if subset_hvg is true merged_sce_list <- merged_sce_list %>% purrr::map(~ set_var_genes(.x, num_genes = opt$num_genes, subset_hvg = opt$subset_hvg, batch_column = opt$batch_column)) # add PCA and UMAP # if --pca_use_all_genes is used, use all genes, otherwise only HVG are used if(opt$pca_use_all_genes){ merged_sce_list <- merged_sce_list %>% purrr::map( ~ perform_dim_reduction(.x, var_genes = rownames(.x), pca_type = "multi")) } else { merged_sce_list <- merged_sce_list %>% purrr::map( ~ perform_dim_reduction(.x, var_genes = metadata(.x)$variable_genes, pca_type = "multi")) } # Write RDS -------------------------------------------------------------------- # create paths to merged SCE files # named with the name of the sce list which corresponds to the grouping variable, not library ID merged_sce_files <- file.path(opt$merged_sce_dir, paste0(names(merged_sce_list), "_merged_sce.rds")) # export files purrr::walk2(merged_sce_list, merged_sce_files, readr::write_rds) |
66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 | project_root <- here::here() renv::load(project_root) # import libraries library(magrittr) library(optparse) # source integration functions utils_path <- file.path(project_root, "scripts", "utils") suppressPackageStartupMessages({ source(file.path(utils_path, "integrate-fastMNN.R")) source(file.path(utils_path, "integrate-harmony.R")) source(file.path(utils_path, "integrate-seurat.R")) source(file.path(utils_path, "integration-helpers.R")) }) # Set up optparse options option_list <- list( make_option( opt_str = c("--input_sce_file"), type = "character", default = NULL, help = "Path to RDS file that contains the merged SCE object to integrate" ), make_option( opt_str = c("--output_sce_file"), type = "character", default = NULL, help = "Path to RDS file where the integrated SCE object will be saved" ), make_option( opt_str = c("--method"), type = "character", default = NULL, help = "Integration method to use, either `fastMNN`, `harmony`, or `Seurat` (case-insensitive)." ), make_option( opt_str = c("-b", "--batch_column"), type = "character", default = "batch", help = "Name of the column in the SCE object that indicates batch groupings." ), make_option( opt_str = c("--seed"), type = "integer", default = NULL, help = "random seed to set during integration" ), make_option( opt_str = c("--harmony_covariate_cols"), type = "character", default = NULL, help = "Optional comma-separated list of columns (e.g. patient, sex) to consider as covariates during integration with `harmony`." ), make_option( opt_str = c("--harmony_from_expression"), action = "store_false", default = TRUE, help = "Indicate whether to use the gene expression matrix, rather than PCs, during `harmony` integration. To use expression instead of PCs (default), use `--harmony_from_expression`" ), make_option( opt_str = c("--fastmnn_no_cosine"), action = "store_false", default = TRUE, help = "Indicate whether to turn off cosine normalization during `fastMNN` integration. To turn off cosine normalization, use `--fastmnn_no_cosine`" ), make_option( opt_str = c("--fastmnn_use_all_genes"), action = "store_true", default = FALSE, help = "Indicate whether to use all genes instead of only HVGs during `fastMNN` integration. To use all genes, use `--fastmnn_use_all_genes`" ), make_option( opt_str = c("--fastmnn_auto_merge"), action = "store_true", default = FALSE, help = "Indicates whether or not to use the auto.merge option for `fastMNN` integration. To perform auto.merge, use `--fastmnn_auto_merge`." ), make_option( opt_str = c("--seurat_reduction_method"), type = "character", default = NULL, help = "Reduction method to use during Seurat integration." ), make_option( opt_str = c("--seurat_num_genes"), type = "numeric", default = 2000, help = "Number of variable genes for Seurat to identify and use." ), make_option( opt_str = c("--seurat_integration_dims"), type = "numeric", default = 30, help = "Number of dimensions for Seurat to use during integration." ), make_option( opt_str = c("--seurat_umap_dims"), type = "numeric", default = 30, help = "Number of dimensions for Seurat to use during UMAP calculation." ), make_option( opt_str = c("--seurat_anchor_threshold"), type = "numeric", default = 100, help = "Minimum threshold for number of neighbors to consider when weighting anchors during integration." ), make_option( opt_str = c("--corrected_only"), action = "store_true", default = FALSE, help = "Indicate whether to only return the corrected gene expression data, and not uncorrected expression, in the integrated SCE object. To return only corrected expression, use `--corrected_only`." ) ) # --corrected_only: Flag to specify that only corrected gene expression values should # be returned in the integrated SCE object. Default usage of this script will # return all data. # Setup ------------------------------------------------------------------------ # Parse options opt <- parse_args(OptionParser(option_list = option_list)) # Check and assign provided method based on available methods available_methods <- c("fastmnn", "harmony", "seurat") # helper function for method check fails stop_method <- function() { stop( paste("You must specify one of the following (case-insensitive) to --method:", paste(available_methods, collapse = ", ") ) ) } if (is.null(opt$method)) { stop_method() } else { integration_method <- tolower(opt$method) if (!(integration_method %in% available_methods)) { stop_method() } } # Check that provided input file exists and is an RDS file if(is.null(opt$input_sce_file)) { stop("You must provide the path to the RDS file with merged SCEs to --input_sce_file") } else { if(!file.exists(opt$input_sce_file)) { stop("Provided --input_sce_file file does not exist. Make sure that you have run `02-prepare-merged-sce.R` or the provided path is correct.") } } # Check that both input and output files have RDS extensions if(!(grepl("\\.rds$", opt$input_sce_file, ignore.case = TRUE)) || !(grepl("\\.rds$", opt$output_sce_file, ignore.case = TRUE))) { stop("The provided --input_sce_file and --output_sce_file files must be RDS files.") } # Check that directory for output file exists and the specified file is an RDS file integrated_sce_dir <- dirname(opt$output_sce_file) create_dir(integrated_sce_dir) # Read in SCE file ----------------------------- merged_sce_obj <- readr::read_rds(opt$input_sce_file) # Perform integration with fastMNN, if specified ------------------------- if (integration_method == "fastmnn") { # Prepare `gene_list` argument if (opt$fastmnn_use_all_genes) { fastmnn_gene_list <- NULL } else { fastmnn_gene_list <- metadata(merged_sce_obj)$variable_genes } # Perform integration integrated_sce_obj <- integrate_fastMNN( merged_sce_obj, batch_column = opt$batch_column, cosine_norm = opt$fastmnn_no_cosine, gene_list = fastmnn_gene_list, seed = opt$seed, auto.merge = opt$fastmnn_auto_merge ) # add note about auto merge to metadata of integrated object if(opt$fastmnn_auto_merge){ auto_merge <- "auto" } else { auto_merge <- "input" } metadata(integrated_sce_obj)$fastmnn_auto_merge <- auto_merge } # Perform integration with harmony, if specified ------------------------- if (integration_method == "harmony") { # Set up `covariate_cols` argument based on user options if (is.null(opt$harmony_covariate_cols)) { harmony_covariate_cols <- c() } else { harmony_covariate_cols <- unlist(stringr::str_split(opt$harmony_covariate_cols, "\\s*,\\s*")) } # Perform integration integrated_sce_obj <- integrate_harmony( merged_sce_obj, batch_column = opt$batch_column, covariate_cols = opt$harmony_covariate_cols, from_pca = opt$harmony_from_expression, seed = opt$seed ) } # Perform integration with seurat, if specified ------------------------- if (integration_method == "seurat") { # Convert the merged sce object into a list of seurat objects seurat_list <- prepare_seurat_list(merged_sce_obj) # Perform integration integrated_seurat_obj <- integrate_seurat( seurat_list, opt$seurat_reduction_method, # integrate_seurat() will check this argument batch_column = opt$batch_column, num_genes = opt$seurat_num_genes, integration_dims = 1:opt$seurat_integration_dims, umap_dims = 1:opt$seurat_umap_dims, anchor_threshold = opt$seurat_anchor_threshold ) # Converted the integrated seurat object into an SCE object integrated_sce_obj <- as.SingleCellExperiment(integrated_seurat_obj) # Rename assays appropriately: # The converted `logcounts` is the CORRECTED expression assay(integrated_sce_obj, paste0(opt$seurat_reduction_method, "_corrected")) <- logcounts(integrated_sce_obj) # The logcounts in SCT assay is logcounts logcounts(integrated_sce_obj) <- logcounts( altExp(integrated_sce_obj, "SCT") )[rownames(integrated_sce_obj),] # The counts in RNA assay is counts counts(integrated_sce_obj) <- counts( altExp(integrated_sce_obj, "RNA") )[rownames(integrated_sce_obj),] # Remove altExps integrated_sce_obj <- removeAltExps(integrated_sce_obj) # Convert reducedDims names back to <lowercase>_<UPPERCASE> because # `as.SingleCellExperiment` makes them all uppercase reducedDimNames(integrated_sce_obj) <- stringr::str_replace_all( reducedDimNames(integrated_sce_obj), toupper(opt$seurat_reduction_method), opt$seurat_reduction_method ) } # Remove uncorrected expression values, if specified ---------- if (opt$corrected_only) { integrated_sce_obj <- remove_uncorrected_expression(integrated_sce_obj) } # Write integrated SCE object to RDS ------------------- readr::write_rds(integrated_sce_obj, opt$output_sce_file) |
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 | import os import anndata as adata import argparse import re from utils.integrate_scanorama import integrate_scanorama # define arguments parser = argparse.ArgumentParser() parser.add_argument('-i', '--input_anndata', dest = 'input_anndata', required = True, help = 'Path to HDF5 file with merged AnnData object to integrate') parser.add_argument('-o', '--output_anndata', dest = 'output_anndata', required = True, help = 'Path to HDF5 file to save the integrated AnnData object') parser.add_argument('-b', '--batch_column', dest = 'batch_column', default = 'batch', help = 'The name of the column in `anndata.obs` that indicates the batches for each cell, ' ' typically this corresponds to the library id.') parser.add_argument('--use_hvg', dest = 'use_hvg', action = 'store_true', default = False, help = 'Boolean indicating whether or not to use only highly variable genes for data integration.' 'If --use_hvg is used, integration will only consider highly variable genes, and similarly ' 'the returned integrated object will only contain the highly variable genes.') parser.add_argument('--corrected_only', dest = 'corrected_only', action = 'store_true', help = 'Boolean indicating to return only the corrected data or all raw data.' ' Default will return all data. To return only corrected data, use --corrected_only.') parser.add_argument('-s', '--seed', dest = 'seed', type=int, default = None, help = 'Random seed to set for scanorama.') args = parser.parse_args() # compile extension regex file_ext = re.compile(r"\.hdf5$|.h5$", re.IGNORECASE) # check that input file exists, if it does exist, make sure it's an h5 file if not os.path.exists(args.input_anndata): raise FileExistsError("--input_anndata file not found.") elif not file_ext.search(args.input_anndata): raise ValueError("--input_anndata must end in either .hdf5 or .h5 and contain a merged AnnData object.") # make sure output file path is h5 file if not file_ext.search(args.output_anndata): raise ValueError("--output_anndata must provide a file path ending in either .hdf5 or .h5.") # check that output file directory exists and create directory if doesn't exist integrated_adata_dir = os.path.dirname(args.output_anndata) if not os.path.isdir(integrated_adata_dir): os.makedirs(integrated_adata_dir, exist_ok = True) # read in merged anndata object merged_adata = adata.read_h5ad(args.input_anndata) if args.use_hvg: try: var_genes = list(merged_adata.uns['variable_genes']) except KeyError: print("Variable genes cannot be found in anndata object." " If using --use_hvg, make sure HVG are stored in adata.uns['variable_genes'].") raise else: var_genes = list(merged_adata.var_names) # integrate anndata with scanorama scanorama_integrated_adata = integrate_scanorama(merged_adata, integrate_genes = var_genes, batch_column = args.batch_column, seed = args.seed) # remove raw data and logcounts to minimize space if corrected_only is true if args.corrected_only: print("Removing raw data and log-normalized data. Only corrected data will be returned.") scanorama_integrated_adata.X = None del scanorama_integrated_adata.layers["logcounts"] # write anndata to h5 scanorama_integrated_adata.write(filename = args.output_anndata) |
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 | import os import anndata as adata import argparse import re from utils.integrate_scvi import integrate_scvi # define arguments parser = argparse.ArgumentParser() parser.add_argument('-i', '--input_anndata', dest = 'input_anndata', required = True, help = 'Path to HDF5 file with merged AnnData object to integrate') parser.add_argument('-o', '--output_anndata', dest = 'output_anndata', required = True, help = 'Path to HDF5 file to save the integrated AnnData object') parser.add_argument('-b', '--batch_column', dest = 'batch_column', default = 'batch', help = 'The name of the column in `anndata.obs` that indicates the batches for each cell, ' ' typically this corresponds to the library id.') parser.add_argument('--categorical_covariates', dest = 'categorical_covariates', default = None, type = str, help = 'A comma-separated list of columns containing additional categorical data to be' ' included as a covariate. Default is None.') parser.add_argument('--continuous_covariates', dest = 'continuous_covariates', default = "subsets_mito_percent", type = str, help = 'A comma-separated list of columns containing additional continous data to be' ' included as a covariate. Default is "subsets_mito_percent".') parser.add_argument('--num_latent', dest = 'num_latent', type = int, default = 20, help = 'Number of dimensions to return from the latent space.') parser.add_argument('--use_hvg', dest = 'use_hvg', action = 'store_true', default = False, help = 'Boolean indicating whether or not to use only highly variable genes for data integration.' 'If --use_hvg is used, the returned integrated object will only contain the highly variable genes.') parser.add_argument('--corrected_only', dest = 'corrected_only', action = 'store_true', help = 'Boolean indicating to return only the corrected data or all raw data.' ' Default will return all data. To return only corrected data, use --corrected_only.') parser.add_argument('-s', '--seed', dest = 'seed', type=int, default = None, help = 'Random seed to set for scanorama.') args = parser.parse_args() # compile extension regex file_ext = re.compile(r"\.hdf5$|.h5$", re.IGNORECASE) # check that input file exists, if it does exist, make sure it's an h5 file if not os.path.exists(args.input_anndata): raise FileExistsError("--input_anndata file not found.") elif not file_ext.search(args.input_anndata): raise ValueError("--input_anndata must end in either .hdf5 or .h5 and contain a merged AnnData object.") # make sure output file path is h5 file if not file_ext.search(args.output_anndata): raise ValueError("--output_anndata must provide a file path ending in either .hdf5 or .h5.") # check that output file directory exists and create directory if doesn't exist integrated_adata_dir = os.path.dirname(args.output_anndata) os.makedirs(integrated_adata_dir, exist_ok = True) # read in merged anndata object merged_adata = adata.read_h5ad(args.input_anndata) if args.use_hvg: try: var_genes = list(merged_adata.uns['variable_genes']) except KeyError: print("Variable genes cannot be found in anndata object." " If using --use_hvg, make sure HVG are stored in adata.uns['variable_genes'].") raise else: var_genes = list(merged_adata.var_names) # split covariates from comma separated strings into lists if args.categorical_covariates: categorical_covariates = [covariate.strip() for covariate in args.categorical_covariates.split(',')] else: categorical_covariates = None if args.continuous_covariates: if args.continuous_covariates == "None": continuous_covariates = None else: continuous_covariates = [covariate.strip() for covariate in args.continuous_covariates.split(',')] else: continuous_covariates = None # integrate anndata with scvi scvi_integrated_adata = integrate_scvi(merged_adata, integrate_genes = var_genes, batch_column = args.batch_column, categorical_covariate_columns = categorical_covariates, continuous_covariate_columns = continuous_covariates, num_latent = args.num_latent, seed = args.seed) # remove raw data and logcounts to minimize space if corrected_only is true if args.corrected_only: print("Removing raw data and log-normalized data. Only corrected data will be returned.") scvi_integrated_adata.X = None del scvi_integrated_adata.layers["logcounts"] # write anndata to h5 scvi_integrated_adata.write(filename = args.output_anndata) |
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 | project_root <- here::here() renv::load(project_root) # import libraries suppressPackageStartupMessages({ library(magrittr) library(optparse) library(SingleCellExperiment) }) source(file.path(project_root, "scripts", "utils", "integration-helpers.R")) # Load helper functions source( file.path( "scripts", "utils", "integration-helpers.R" ) ) # Set up optparse options option_list <- list( make_option( opt_str = c("--input_anndata_file"), type = "character", default = NULL, help = "Path to H5 file that contains the integrated AnnData object" ), make_option( opt_str = c("--output_sce_file"), type = "character", default = NULL, help = "Path to RDS file where the integrated SCE object will be saved" ), make_option( opt_str = c("--method"), type = "character", default = NULL, help = "Method that was used for integration, either `scanorama` or `scVI` (case-insensitive)." ), make_option( opt_str = c("--seed"), type = "integer", default = NULL, help = "random seed to set" ), make_option( opt_str = c("--corrected_only"), action = "store_true", default = FALSE, help = "Indicate whether the integrated AnnData object contains only corrected gene expression data, and is missing uncorrected expression (no `X` matrix). To indicate the object only contains corrected data, use `--corrected_only`." ) ) # Setup ------------------------------------------------------------------------ # Parse options opt <- parse_args(OptionParser(option_list = option_list)) set.seed(opt$seed) # Check provided method based on available methods available_methods <- c("scanorama", "scvi") # helper function for method check fails stop_method <- function() { stop( paste("You must specify one of the following (case-insensitive) to --method:", paste(available_methods, collapse = ", ") ) ) } if (is.null(opt$method)) { stop_method() } else { integration_method <- tolower(opt$method) if (!(integration_method %in% available_methods)) { stop_method() } } # Check that provided input file exists and is an RDS file if(is.null(opt$input_anndata_file)) { stop("You must provide the path to the H5 file with merged SCEs to --input_anndata_file") } else { if(!file.exists(opt$input_anndata_file)) { stop("Provided --input_anndata_file file does not exist.") } } # Check that both input and output files have correct extensions if(!(stringr::str_ends(opt$input_anndata_file, ".hdf5|.h5"))){ stop("`--input_anndata_file` must end in either '.hdf5' or '.h5'") } if(!(stringr::str_ends(opt$output_sce_file, ".rds"))){ stop("`--output_sce_file` must end in '.rds'") } # Check that directory for output file exists integrated_sce_dir <- dirname(opt$output_sce_file) create_dir(integrated_sce_dir) # Anndata to SCE post-processing ----------------------------------------------- # read in AnnData object as H5 integrated_sce <- zellkonverter::readH5AD(opt$input_anndata_file) # Remove uncorrected expression values if corrected_only is set # We do this after zellkonverter which always retain counts, even if empty. if (opt$corrected_only) { integrated_sce <- remove_uncorrected_expression(integrated_sce) } # replace . added in anndata names with - found in other sce objects names(metadata(integrated_sce)) <- stringr::str_replace_all(names(metadata(integrated_sce)), "\\.", "-") names(colData(integrated_sce)) <- stringr::str_replace_all(names(colData(integrated_sce)), "\\.", "-") names(rowData(integrated_sce)) <- stringr::str_replace_all(names(rowData(integrated_sce)), "\\.", "-") # check for corrected data corrected_assay_name <- paste(integration_method, "corrected", sep = "_") if(!corrected_assay_name %in% assayNames(integrated_sce)){ stop("integrated SCE missing corrected gene expression data in assays.") } # function to check for SVD or latent embeddings check_pseudo_pca <- function(integrated_sce, pseudo_pca_name){ # check corrected data and SVD were converted if(!pseudo_pca_name %in% reducedDimNames(integrated_sce)){ stop(glue::glue("integrated SCE missing {pseudo_pca_name} in reduced dimensions slot.")) } } # function to run UMAP with respective pseudo PCA name add_umap <- function(integrated_sce, pseudo_pca_name, umap_name){ integrated_sce <- integrated_sce %>% scater::runUMAP(dimred = pseudo_pca_name, name = umap_name) } # calculate UMAP dependent on integration method if(integration_method == "scanorama"){ pseudo_pca_name <- "scanorama_SVD" check_pseudo_pca(integrated_sce, pseudo_pca_name) umap_name <- "scanorama_UMAP" # run UMAP using the SVD as input integrated_sce <- add_umap(integrated_sce, pseudo_pca_name, umap_name) } if(integration_method == "scvi"){ pseudo_pca_name <- "scvi_latent" check_pseudo_pca(integrated_sce, pseudo_pca_name) umap_name <- "scvi_UMAP" # run UMAP using the latent embeddings as input integrated_sce <- add_umap(integrated_sce, pseudo_pca_name, umap_name) } # Write RDS -------------------------------------------------------------------- readr::write_rds(integrated_sce, opt$output_sce_file) |
24 25 26 27 28 | shell: """ Rscript -e "renv::restore(lockfile = '{input}')" &> {log} date -u -Iseconds > {output} """ |
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | shell: """ Rscript scripts/02-prepare-merged-sce.R \ --library_file "{input.processed_tsv}" \ --add_celltype {config[add_celltype]} \ --celltype_info "{config[celltype_file]}" \ --grouping_var {config[grouping_var]} \ --groups_to_integrate "{config[groups_to_integrate]}" \ --merged_sce_dir "{output}" \ --num_hvg {config[num_hvg]} \ --subset_hvg \ --seed {config[seed]} \ {params.random_merge_flag} \ &> {log} """ |
65 66 67 68 69 70 71 72 73 | shell: """ Rscript scripts/02a-convert-sce-to-anndata.R \ --library_file "{input.processed_tsv}" \ --merged_sce_dir "{input.merged_sce_dir}" \ --grouping_var {config[grouping_var]} \ --anndata_output_dir "{output}" \ &> {log} """ |
89 90 91 92 93 94 95 96 97 98 99 | shell: """ Rscript scripts/03a-integrate-sce.R \ --input_sce_file "{input.merged_sce_dir}/{params.merged_sce_file}" \ --output_sce_file "{output}" \ --method fastMNN \ --seed {config[seed]} \ --corrected_only \ {params.auto_merge_flag} \ &> {log} """ |
113 114 115 116 117 118 119 120 121 122 | shell: """ Rscript scripts/03a-integrate-sce.R \ --input_sce_file "{input.merged_sce_dir}/{params.merged_sce_file}" \ --output_sce_file "{output}" \ --method harmony \ --seed {config[seed]} \ --corrected_only \ &> {log} """ |
139 140 141 142 143 144 145 146 147 148 149 | shell: """ Rscript scripts/03a-integrate-sce.R \ --input_sce_file "{input.merged_sce_dir}/{params.merged_sce_file}" \ --output_sce_file "{output}" \ --method seurat \ --seurat_reduction_method {wildcards.method} \ --num_genes {params.num_genes} \ --corrected_only \ &> {log} """ |
161 162 163 164 165 166 167 168 169 170 | shell: """ python scripts/03b-integrate-scanorama.py \ --input_anndata "{input.merged_anndata_dir}/{params.merged_anndata_file}" \ --output_anndata "{output}" \ --seed {config[seed]} \ --use_hvg \ --corrected_only \ &> {log} """ |
182 183 184 185 186 187 188 189 190 191 192 193 | shell: """ python scripts/03c-integrate-scvi.py \ --input_anndata "{input.merged_anndata_dir}/{params.merged_anndata_file}" \ --output_anndata "{output}" \ --continuous_covariates {config[continuous_covariates]} \ --num_latent {config[num_latent]} \ --seed {config[seed]} \ --use_hvg \ --corrected_only \ &> {log} """ |
203 204 205 206 207 208 209 210 211 212 | shell: """ Rscript scripts/04-post-process-anndata.R \ --input_anndata_file "{input}" \ --output_sce_file "{output}" \ --method "{wildcards.method}" \ --seed {config[seed]} \ --corrected_only \ &> {log} """ |
227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 | shell: """ Rscript -e \ "rmarkdown::render('analysis_templates/01-single-group-integration-check-template.Rmd', \ clean = TRUE, \ output_file = '{output}', \ output_dir = dirname('{output}'), \ params = list(group_name = '{wildcards.project}', \ merged_sce_dir = '{workflow.basedir}/{input.merged_sce_dir}', \ integrated_sce_dir = '{workflow.basedir}/{params.integrated_sce_dir}', \ integration_methods = '{config[integration_methods]}', \ library_metadata = '{workflow.basedir}/{config[processed_tsv]}' , \ max_celltypes = {config[max_celltypes]}))" \ &> {log} """ |
Support
- Future updates
Related Workflows





