Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

bollito: single-cell RNA-seq pipeline.
Introduction
bollito is a Snakemake pipeline that performs a comprehensive single-cell RNA-seq analysis, covering both the basic steps (QC, alignment, quantification and cell specific QC) and more advanced downstream analyses (clustering, diferential expresion, trajectory inference, functional analysis and RNA velocity).
This pipeline uses state-of-the-art single-cell RNA-seq tools like Seurat , STARsolo , Vision , Slingshot , and Velocyto .
The pipeline makes extensive use of Snakemake's integration with the conda package manager and docker/singularity containers, to automatically take care of software requirements and dependencies.
bollito has two main modes of execution depending on the input data:
-
From FASTQ : it accepts FASTQ-formatted raw data (from drop-seq or 10x Genomics experiments).
-
From matrices including:
-
Feature-barcode matrices (expression matrix format from STARsolo).
-
Standard matrices (cell names and gene names included in the matrix).
-
We've built bollito for flexibility, with the aim of allowing the user to adjust the pipeline to different experiments using configuration parameters. This includes adjusting the cell filtering, normalization, variables regression, number of significant components, clustering resolution, etc. When in doubt, the default parameters were calculated to offer a good starting configuration. Additionally, once the main steps of the pipeline are finished, bollito creates a AnnData file from the Seurat file and test if it is correctly formatted. This feature improves the interconnection between R and Python, helping users which might want to perform analysis with tools from both languages.
Workflow overview
This is a schema of the complete workflow. Certain parts may be skipped depending on input data and chosen configuration.
Authors
-
Luis García-Jimeno
-
Coral Fustero-Torre
-
María José Jiménez-Santos
-
Gonzalo Gómez-López
-
Tomás Di Domenico
-
Fátima Al-Shahrour
Setup
The setup of the pipeline consists in the modification of three configuration files, indicating the desired parameters and the location of the input files. A general description of these files follows. See the Usage section for more details.
Configuration files
-
config.yaml contains all pipeline parameters.
-
samples.tsv contains information on the samples to be analysed.
-
units.tsv : contains information on the different data files associated to the samples to be analysed.
Input files
- raw data in gzip compressed FASTQ files
or
-
matrices of count data
- 10x -like input (matrix.mtx + genes.tsv + barcodes.tsv)
or
- standard tabular format (where rows are genes and columns are cells)
Usage
1. Set up the environment
bollito requires the conda package manager in order to work. If you have singularity installed and would rather not install conda, you can provide the "--use-singularity" option when running the pipeline (see below). Otherwise, please install conda by following the bioconda installation instructions .
2. Download bollito repository from Gitlab.
Use git clone command to create a local copy.
git clone https://gitlab.com/bu_cnio/bollito.git
3. Configure the pipeline.
Before executing the pipeline, the users must configure it according to their samples. To do this, they must fill these files:
TIP: different analysis can be run using just one cloned repository. This is achieved by changing the outdir and logdir in the configuration file. Also different parameters values can be used in the different analysis.
a. samples.tsv
This file contains information on the samples to be analyzed. The first column, called "sample", is mandatory, and defines the sample name for each sample. The other columns are used to define the samples. This information is stored as metadata in the Seurat object, so, every cell belonging to that sample is labeled with the value defined by the user, meanwhile the column heaeder will correspond to the condition name.
An example file ( samples-example.tsv) is included in the repository.
Rename it to
samples.tsv
and edit its contents to list the sample names and the features
related to each sample.
b. units.tsv
This file is used to configure the input files.
There are two example files, depending on the type of input data:
- If your input are FASTQ files:
An example file ( units-example_fastq.tsv ) is included in the repository.
Rename it to
units.tsv
and edit its contents according to the following table:
Field name | Description |
---|---|
sample | Sample name (must match the sample name specified in samples.tsv ). |
unit | A distinct name for each pair of files associated to the same sample (for example in the case of replicates). |
fq1 | FASTQ file for read 1, containing the Cell Barcode and UMI. |
fq2 | FASTQ file for read 2, containing the transcriptomic sequence. |
- If your input are matrix files:
An example file ( units-example_matrices.tsv ) is included in the repository.
Rename it to
units.tsv
and edit its contents according to the following table:
Field name | Description |
---|---|
sample | Sample name (must match the sample name specified in samples.tsv ). |
matrix | Matrix file (.mtx for 10x or .tsv for standard) storing the counts. |
cell_names (10x only) | tsv file containing one cell name per row. |
gene_names (10x only) | tsv file containing one gene name per row. |
metadata (optional) | tsv file with two or more columns. First column corresponds to each cell name specified in cell_names.tsv and the rest are metadata variables. First row indicates the metadata variable name. |
c. config.yaml
This is the pipeline configuration file, where you can tune all the available parameters to customise your single-cell analysis.
The example file ( config-example.yaml ) features extensive inline documentation.
Here are some of the main available parameters:
Parameter | Description |
---|---|
input_type | Type of input data ( fastq or matrix ). |
technology | Technology used to get the reads files ( 10x or Drop-seq ). |
outdir | Directory where to store the output files. |
logdir | Directory where to store the log files. |
graphics | Graphic card availability. |
random_seed | Seed parameter to allow for reproducible analyses. |
case | Type of case used to represent the gene names, must be use according the specie and genesets used. |
annotation | GTF file holding genetic features information. |
idx | Folder containing STAR genomes indexes. |
whitelist | Cell barcodes whitelist file needed for 10x experiments quantification and demultiplexing. |
d. metadata.tsv (optional)
This file is optional and it is only used when the input file used are matrices. The purpose of this file is to annotate each individual cell, storing that information in the Seurat object. It is a tsv file with two or more columns. The first column corresponds to each cell name specified in cell_names.tsv and the rest are the values of the metadata variables. First row of each column indicates the metadata variable name.
4. Create the Conda environments.
To run the pipeline, the user needs to create the conda environments first, which will take some minutes. This step is done automatically using this command:
snakemake --use-conda --conda-create-envs-only --conda-frontend mamba
5. Run the pipeline.
Once the pipeline is configured and conda environments are created, the user just needs to run bollito.
snakemake --use-conda -j 32
The mandatory arguments are:
-
--use-conda : to install and use the conda environemnts.
-
-j : number of threads/jobs provided to snakemake.
Pipeline steps
Here you can find a general description of the main steps of the pipeline.
1. FASTQ quality control.
General QC
bollito implements FastQC to check the overal quality of the input FASTQ files.
Contamination
FastQ Screen can optionally be enabled in order to check the input FASTQ files for contaminants.
MultiQC report
bollito creates a Quality Control HTML report using MultiQC . This report includes information from FastQC and RSeQC (which will be explained later).
2. Alignment & quantification.
To obtain the cell expression profiles we need to align the FASTQ files against an annotated reference genome. Once we obtain the aligned files, the pipeline assigns the reads to their corresponding cell barcodes, and performs an UMI-based quantification of the annotated features.
The alignment tool implemented in bollito is STAR , taking advantage of its STARsolo mode for single-cell RNA-seq quantification.
This step requires the following parameters, defined in the configuration file:
-
An annotation file containing the features of interest (in GTF format, must match the target genome)
-
One of the following options for the genome:
-
a FASTA file (will be indexed by bollito).
-
a STAR genome index.
-
-
Technology, which includes Drop-seq, 10x (chemistry version also needs to be specified), or a custom technology where the user will need to set the CB and UMI length.
-
The corresponding CB whitelist.
Example parameters for different STAR configurations are available in the example config file.
3. Alignment quality control.
Quality control of the resulting alignments is performed using RSeQC .
4. Cell quality control, normalization, and dimensionality reduction.
Once the UMI-count matrix is obtained, bollito performs a cell-based quality control. The purpose of this step is to control for broken cells and doublets, and to analyze the cell cycle status of the cells. Next, the normalization and dimensionality reduction steps are applied.
Cell quality control, normalization, and dimensionality reduction are implemented using the Seurat package.
Cell cycle checking is implemented based on molecular signatures from Tirosh et al, 2015 .
In order to customize this step, the following parameters can be adjusted via the configuration file:
-
Minimum and maximum number of detected genes per cell.
-
Minimum and maximum read counts per cell.
-
Minimum ribosomal content.
-
Minimum mitochondrial content.
-
Normalization method "SCT" ( (Hafemeister & Satija, 2019) ) or "standard".
The normalization itself can also be parametrized, including the possibility of regressing the desired variables (such as cell cycle scoring, number of detected genes, % of mitochondrial genes) in order to mitigate their effect on the dataset. Refer to the example config file for the available options.
NOTE: the user will have to choose the correct value for some parameters (QC threshold filters, significant PCA components or regressed variables) after the execution of these steps. For this purpose, it is necessary to stop the pipeline executions when these steps are finished. Then, take a look and the results to define those parameter values at the configuration file. Finally, re-run the pipeline from these steps in order to apply the changes.
4.b. Merging.
bollito can perform an optional merging step. This step consists in combining all the sample's Seurat objects which come from the single-cell QC step. No normalization is performed in this step, since the merged object will be used as input in the normalization step.
To enable this step, the user just need to set the "enabled" parameter to TRUE in the configuration file.
4.c. Integration.
bollito allows for an optional integration step, in order to detect shared cell states between datasets. The integration method is based on the identification of anchor cells between the datasets, and the projection of datasets on each other by using these anchors . For more information regarding the integration step, please refer to the Seurat Integration and Label Transfer documentation .
Integration step also performs the normalization of the integrated samples. It uses the same normalization method and values specified for the normalization rule, so the user does not need to specify any extra parameter.
To enable this step, the user just need to set the "enabled" parameter to TRUE in the configuration file.
5. Clustering.
Clustering of cells uses the normalized expression profiles. After, a dimensionality reduction by PCA, the k -nearest-neightbor (KNN) graph embedded in this lower dimensional space, is obtained. Then a Shared Nearest Neighbour (SNN) graph is constructed calculating the nearest neighbour cells overlapping using the Jaccard index. Once the graph is created, clusters are captured by using a the Louvain algorithm.
To explore the clusters along the resolutions, bollito uses Clustree . Cluster validation is achieved by calculating silhouette scores for each cluster. LISI ( (Korsunsky, I. et al. ) is used to assess if there is a batch effect produced by any of the categorical varaibles that are described in the experiment. If any batch effect is detected I would be advisabe to regress out that variable.
Additionally, once the main steps of the pipeline are finished, bollito creates a AnnData file from the Seurat output file and checks its formatting.
For this step, the following parameters need to be adjusted via the configuration file:
-
Number of significant components based on the elbow plot or JackStraw analysis obtained in previous steps.
-
Number of neighbours ( k ) used to generate the KNN graph (default = 20).
-
Resolutions to be used in the community detection method.
NOTE: the best resolution obtained should be specified in the configuration file, since it is used in posterior effects.
6. Differential expression analysis.
Differential expression analysis is based on the condition that the user requires, including a specific cluster resolution or some annotation information. For each condition, two analyses are performed:
-
Marker gene detection (for this test, only genes with a logFC threshold of 0.25, that are expressed in at least 10% of the cells are included).
-
Differential expression for all genes (no thresholds applied).
Output from this step includes a heatmap of top marker genes per condition, and a .rnk file that may be used for a downstream gene enrichment analysis with GSEA ,
The following parameters of this step need to be adjusted via the configuration file:
-
Set enabled to True .
-
Condition to analyze (cluster resolution or annotation information).
-
Differential expression test to apply.
For a list of available tests see the Seurat FindMarkers function documentation.
7. Functional analysis - Seurat-based.
bollito uses Seurat to apply the AddModuleScore function to the molecular signatures specified by the user. A paired Wilcoxon test is applied to compare the expression of the genes from the signature between the clusters.
To enable this step, the following parameters need to be adjusted via the configuration file:
-
Set enabled to True .
-
Set the path to the molecular signatures file to use (in .gmt format).
-
Set the ratio of expressed genes / total genes from a geneset to be tested (default is 0.2).
8. Functional analysis - Vision-based.
bollito applies the Vision methodology in order to study different molecular signatures and their significance at a specific clustering resolution.
To enable this step, the following parameters need to be adjusted via the configuration file:
-
Set enabled to True .
-
Set the path to the molecular signatures file to use (in .gmt format).
-
Select the desired metadata variables from Seurat.
-
Set the desired cluster resolution.
9. Trajectory inference.
This step analyses the cell lineages of your sample by inferring a pseudotime variable from the data and sorting the clusters according to it. The trajectory inference step is implemented by using the Slingshot package.
To enable it, the following parameters need to be adjusted via the configuration file:
-
Set enabled to True .
-
Cluster resolution must be specified.
-
Specifiy start and end cluster(s) (optional).
-
Number of genes for the heatmap representation.
-
Number of most variable genes (allows you to generate the general additive model of the heatmap).
10. RNA velocity.
The analysis of RNA velocity allows you to capture the expression dynamics of your data by estimating the spliced and unspliced mRNA abundances on each of the available splicing sites. Based on this information, future state of single-cells can be inferred. This step is implemented using the Velocyto wrapper.
To enable this step, the following parameters need to be adjusted via the configuration file:
-
Set enabled to True .
-
Cluster resolution must be specified.
-
If the dataset is large, a downsampling should be considered (optional).
NOTE: RNA velocity step can not be performed if we use a matrix as input of the pipeline, since it needs the BAM files to generate the three count matrices (spliced, unspliced and ambiguous).
Configuration of computation resources
The user can configure bollito to optimise the available computational resources in order to reduce the computational time. The optimization is achieved thanks to Snakemake's ability to run several samples at the same time and single-cell script parallelisation using the future package implementation. Resources configuration is done through the configuration file. This file has a field called resources , where the user can define the RAM memory usage and the number of threads (if the rule admits multithreading) available to a specific step. Additionally, if the user does not provide any value for some of these fields, the pipeline will use the default values.
Shortcuts
bollito features a shortcut system based on specfic targets related to the pipeline's steps. Each target calls a end point rule which terminate the pipeline execution.
To use the shorcuts, you only need to run the pipeline as usual, but with the --until option.
snakemake --use-conda --until target_name
The available targets are:
-
expression_matrix : run bollito until alignment step included.
-
qc_expression_matrix : run bollito until single-cell QC step included (rules: seurat_qc, seurat_postqc & seurat_merge).
-
normalized_expression_matrix : run bollito until single-cell normalization step included (rules: seurat_qc, seurat_postqc, seurat_merge, seurat_normalization & seurat_integration).
Additionally, the user might use the Snakemake rules names as targets, which are available in the config.yaml file.
Reporting
bollito produces a HTML report using Snakemake's automatic report generaration. This report includes the multiQC report and some quality control and normalization information at single-cell level from Seurat.
To generate the report, you only need to use --report option when the analysis is finished.
snakemake --report report.zip
This will generate a zip file containing an HTML index page and all the files needed to correctly display the report.
Scanpy interoperability
Bollito generates an AnnData output file to allow users to perform downstream analyses using Scanpy and other python-based packages. This AnnData file is obtained from the post-clustering Seurat object, so it stores all the annotations and cell filterings applied until that step.
Pipeline benchmarking
The following metrics were generating using the [10K PBMC 3p](https://s3-us-west-2.amazonaws.com/10x.files/samples/cell-exp/6.1.0/10k_PBMC_3p_nextgem_Chromium_X/10k_PBMC_3p_nextgem_Chromium_X_fastqs.tar
- from 10x Next GEM Chromium X dataset on a HPC cluster running CentOS 8 (248 cores, 2Tb RAM).
pipeline step | running time (s) | max RAM usage (MB) | threads |
---|---|---|---|
fastqc | 657.897 | 3830.980 | 4 |
star | 5388.925 | 36103.887 | 8 |
rseqc_junction_saturation | 1310.431 | 4848.223 | 1 |
rseqc_readdis | 2507.294 | 1058.127 | 1 |
rseqc_stat | 1022.082 | 116.960 | 1 |
rseqc_readgc | 1166.199 | 2183.963 | 1 |
rseqc_readdup | 2123.237 | 31723.523 | 1 |
rseqc_infer | 7.545 | 150.740 | 1 |
rseqc_innerdis | 630.251 | 970.620 | 1 |
rseqc_junction_annotation | 1171.999 | 291.133 | 1 |
multiQC | 40.741 | 212.860 | 2 |
seurat_preQC | 120.221 | 2310.437 | 1 |
seurat_postQC | 65.062 | 1869.23 | 1 |
seurat_normalization | 405.939 | 9125.253 | 1 |
seurat_find-clusters | 253.288 | 3710.987 | 2 |
References
-
Andrews S. (2010). FastQC: a quality control tool for high throughput sequence data. Available at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc [Accessed 13 March 2020]
-
Wingett S. (2010). FastQ Screem:FastQ Screen allows you to screen a library of sequences in FastQ format against a set of sequence databases so you can see if the composition of the library matches with what you expect. Available at: http://www.bioinformatics.babraham.ac.uk/projects/fastq_screen [Accessed 13 March 2020]
-
Dobin, A., Davis, C., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., Batut, P., Chaisson, M. and Gingeras, T., 2012. STAR: ultrafast universal RNA-seq aligner. Bioinformatics , 29(1), pp.15-21.
-
Wang, L., Wang, S. and Li, W., 2012. RSeQC: quality control of RNA-seq experiments. Bioinformatics , 28(16), pp.2184-2185.
-
Kowalczyk, M., Tirosh, I., Heckl, D., Ebert, B. and Regev, A., 2014. Single cell RNA-Seq of hematopoietic stem cells reveals a cell cycle-dependent interplay between aging and differentiation. Experimental Hematology , 42(8), p.S21.
-
Stuart, T., Butler, A., Hoffman, P., Hafemeister, C., Papalexi, E., Mauck, W., Hao, Y., Stoeckius, M., Smibert, P. and Satija, R., 2019. Comprehensive Integration of Single-Cell Data. Cell , 177(7), pp.1888-1902.e21.
-
Hafemeister, C. and Satija, R., 2019. Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biology , 20(1).
-
Zappia, L. and Oshlack, A., 2018. Clustering trees: a visualization for evaluating clusterings at multiple resolutions. GigaScience , 7(7).
-
DeTomaso, D., Jones, M., Subramaniam, M., Ashuach, T., Ye, C. and Yosef, N., 2019. Functional interpretation of single cell similarity maps. Nature Communications , 10(1).
-
Street, K., Risso, D., Fletcher, R., Das, D., Ngai, J., Yosef, N., Purdom, E. and Dudoit, S., 2018. Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics. BMC Genomics , 19(1).
-
La Manno, G., Soldatov, R., Zeisel, A., Braun, E., Hochgerner, H., Petukhov, V., Lidschreiber, K., Kastriti, M., Lönnerberg, P., Furlan, A., Fan, J., Borm, L., Liu, Z., van Bruggen, D., Guo, J., He, X., Barker, R., Sundström, E., Castelo-Branco, G., Cramer, P., Adameyko, I., Linnarsson, S. and Kharchenko, P., 2018. RNA velocity of single cells. Nature , 560(7719), pp.494-498.
-
Korsunsky, I., Millard, N., Fan, J., Slowikowski, K., Zhang, F., & Wei, K., Baglaenko, Y., Brenner, M., Loh, P. and Raychaudhuri, S. 2019. Fast, sensitive and accurate integration of single-cell data with Harmony. Nature Methods , 16(12), pp.1289-1296.
Test data
The system is pre-configured to run an example based on sample data available from 10x Genomics. The required datasets can be found at these URLS. Please update the "units.tsv" file to point at the data as needed.
-
https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/neuron_10k_v3
-
https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/heart_10k_v3
FASTQ_SCREEN
The pipeline can optionally run FASTQ_SCREEN to check the samples for contamination.
To disable it use the config file option
config["rules"]["fastq_screen"]["disabled"]
.
Config file pointing to indexes should be placed in a directory named
config['rules']['fastq_screen_indexes']['outdir']/FastQ_Screen_Genomes
.
If the rule is enabled and no config file provided, default indexes will be downloaded using the command
fastq_screen --get_genomes
.

Code Snippets
15 16 | wrapper: "0.60.0/bio/star/index" |
34 35 36 | shell:""" cat {input} > {output} 2> {log} """ |
82 83 | wrapper: "0.60.0/bio/star/align" |
27 28 | script: "../scripts/step1_qc.R" |
63 64 | script: "../scripts/step2_postqc.R" |
85 86 | script: "../scripts/step2.1_filter.R" |
108 109 | script: "../scripts/step2.5_seurat_merge.R" |
154 155 | script: "../scripts/step3_normalization.R" |
182 183 | script: "../scripts/step3.2_integration.R" |
208 209 | script: "../scripts/step4_find-clusters.R" |
228 229 | script: "../scripts/readAnnData.py" |
252 253 | script: "../scripts/step5_degs.R" |
275 276 | script: "../scripts/step6_gs-scoring.R" |
301 302 | script: "../scripts/step7_traj_in.R" |
328 329 | script: "../scripts/step8_func_analysis.R" |
19 20 | wrapper: "0.35.0/bio/fastqc" |
37 38 39 | shell:""" fastq_screen --threads {threads} --get_genomes --outdir {params.outdir}/ &> {log} """ |
60 61 | wrapper: "0.35.0/bio/fastq_screen" |
79 80 | script: "../scripts/gtf2bed.py" |
102 103 104 | shell: "junction_annotation.py {params.extra} -i {input.bam} -r {input.bed} -o {params.prefix} " "> {log[0]} 2>&1" |
126 127 128 | shell: "junction_saturation.py {params.extra} -i {input.bam} -r {input.bed} -o {params.prefix} " "> {log} 2>&1" |
146 147 | shell: "bam_stat.py -i {input} > {output} 2> {log}" |
167 168 | shell: "infer_experiment.py -r {input.bed} -i {input.bam} > {output} 2> {log}" |
189 190 | shell: "inner_distance.py -r {input.bed} -i {input.bam} -o {params.prefix} > {log} 2>&1" |
209 210 | shell: "read_distribution.py -r {input.bed} -i {input.bam} > {output} 2> {log}" |
230 231 | shell: "read_duplication.py -i {input} -o {params.prefix} > {log} 2>&1" |
251 252 | shell: "read_GC.py -i {input} -o {params.prefix} > {log} 2>&1" |
288 289 | wrapper: "v1.0.0/bio/multiqc" |
17 18 19 | shell:""" bash scripts/STAR_to_velocyto.sh {params.input_dir} 2> {log} """ |
42 43 | script: "../scripts/step9_RNA_velocity.R" |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | import sys sys.stderr=open(snakemake.log[0], "a+") sys.stdout=open(snakemake.log[0], "a+") import gffutils db = gffutils.create_db(snakemake.input[0], dbfn=snakemake.output.db, force=True, keep_order=True, merge_strategy='merge', sort_attribute_values=True, disable_infer_genes=True, disable_infer_transcripts=True) with open(snakemake.output.bed, 'w') as outfileobj: for tx in db.features_of_type('transcript', order_by='start'): bed = [s.strip() for s in db.bed12(tx).split('\t')] bed[3] = tx.id outfileobj.write('{}\n'.format('\t'.join(bed))) sys.stderr.close() sys.stdout.close() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | import sys sys.stderr=open(snakemake.log[0], "a+") sys.stdout=open(snakemake.log[0], "a+") import scanpy import warnings anndata_path = snakemake.input["anndata_obj"] outdir = snakemake.params["output_dir"] a = scanpy.read_h5ad(filename = anndata_path) check_path = outdir + "/check.txt" print(check_path) check = open(check_path, 'w') ''' try: except: warnings.warn("AnnData can not be read.") sys.exit(1) ''' sys.stderr.close() sys.stdout.close() |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | head -n 2 ${1}/matrix.mtx > ${1}/mtx_header.txt head -n 3 ${1}/matrix.mtx | tail -n 1 > ${1}/mtx_summary.txt tail -n +4 ${1}/matrix.mtx | cut -d " " -f 1-3 > ${1}/ms.txt tail -n +4 ${1}/matrix.mtx | cut -d " " -f 1-2,4 > ${1}/mu.txt tail -n +4 ${1}/matrix.mtx | cut -d " " -f 1-2,5 > ${1}/ma.txt #rm -r ${1}/spliced ${1}/unspliced ${1}/ambiguous mkdir -p ${1}/spliced ${1}/unspliced ${1}/ambiguous cat ${1}/mtx_header.txt ${1}/mtx_summary.txt ${1}/ms.txt > ${1}/spliced/matrix.mtx cat ${1}/mtx_header.txt ${1}/mtx_summary.txt ${1}/mu.txt > ${1}/unspliced/matrix.mtx cat ${1}/mtx_header.txt ${1}/mtx_summary.txt ${1}/ma.txt > ${1}/ambiguous/matrix.mtx cp ${1}/features.tsv ${1}/genes.tsv cp ${1}/genes.tsv ${1}/barcodes.tsv ${1}/spliced/ cp ${1}/genes.tsv ${1}/barcodes.tsv ${1}/unspliced/ cp ${1}/genes.tsv ${1}/barcodes.tsv ${1}/ambiguous/ rm ${1}/*.txt |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 | log <- file(snakemake@log[[1]], open = "wt") sink(log) sink(log, type = "message") message("CONFIGURATION STEP") # A. Parameters: # 1. Load libraries. suppressMessages(library("Seurat")) suppressMessages(library("dplyr")) suppressMessages(library("data.table")) suppressMessages(library("reticulate")) suppressMessages(library("ggplot2")) suppressMessages(library("stringr")) suppressMessages(library("Matrix")) suppressMessages(library("patchwork")) message("1. Libraries were loaded.") # 2. Folder configuration. data_dir = paste0(snakemake@params[["input_dir"]],"/Solo.out/Gene/raw") dir.name = snakemake@params[["output_dir"]] folders = c("1_preprocessing", "2_normalization", "3_clustering", "4_degs", "5_gs", "6_traj_in", "7_func_analysis") message("2. Folder paths were set.") # 3. Get variables from Snakemake. samples_path = snakemake@params[["samples_path"]] min_cells_per_gene = snakemake@params[["min_cells_per_gene"]] input_type = snakemake@params[["input_type"]] technology = snakemake@params[["technology"]] units_path = snakemake@params[["units_path"]] sample = snakemake@params[["sample"]] random_seed = snakemake@params[["random_seed"]] case = snakemake@params[["case"]] ram = snakemake@resources[["mem_mb"]] message("3. Parameters were loaded.") # 4. Analysis configuration. # RAM configuration. options(future.globals.maxSize = ram*1024^2) # Set seed. if (is.numeric(random_seed)) { set.seed(random_seed) } message(paste0("4. Seed was set at ", random_seed, ".")) # 5. Set regexp for QC variables calculation. if (case == "lowercase") { mito_grep <- "^mt-" ribo_grep <- "^rp[sl][[:digit:]]" } else if (case == "titlecase") { mito_grep <- "^mt-" ribo_grep <- "^Rp[sl][[:digit:]]" } else if (case == "uppercase") { mito_grep <- "^MT-" ribo_grep <- "^RP[SL][[:digit:]]" } else { message("Please choose a correc case option.") quit() } message(paste0("5. Case is set as ", case, ".")) message("Configuration finished.") message("\n") message("PROCESSING STEP") # 0. Read input and create the expression matrix object. # If the input file is a fastq file (STARsolo input). if (input_type == "fastq") { file.rename(paste0(data_dir,"/features.tsv"), paste0(data_dir,"/genes.tsv")) expression_matrix <- Read10X(data.dir = data_dir) message("1. Expression matrix was created from alignment files.") # If the input file are matrices (directly read from units.tsv). } else if (input_type == "matrix") { # units.tsv is loaded units <- read.csv(units_path, header = TRUE, sep = "\t", row.names = 1, comment.char = "#") # If the expression matrix is in 10x like format (matrix, cell barcodes and genes). if (technology == "10x") { expression_matrix <- readMM(toString(units[sample,"matrix"])) colnames(expression_matrix) <- read.table(toString(units[sample,"cell_names"]))[,1] row.names(expression_matrix) <- read.table(toString(units[sample,"gene_names"]))[,1] message("1. Expression matrix was created from 10x/CellRanger-like files.") # If the expression matrix is in TSV format ((genes as row names and cells as column names). } else if (technology == "standard") { expression_matrix = read.csv(toString(units[sample,"matrix"]), sep = "\t", header = TRUE, row.names = 1) message("1. Expression matrix was created from TSV matrix.") } else { message("Please specify a correct unit input.") } } else { message("Please specify a correct input type.") } # 1. Creating a seurat object. expression_matrix <- expression_matrix[, colSums(expression_matrix != 0) > 100] # Take into account cells with more than 100 counts, since CreateSeuratObject function breaks. seurat = CreateSeuratObject(expression_matrix, project = sample, min.features = 200, min.cells = min_cells_per_gene) message("2. Seurat object was created.") # 1.1 Add metadata from samples.tsv file. samples_file = read.table(samples_path, sep = "\t", row.names = 1, header = TRUE) if (dim(samples_file)[2] != 0){ for (i in 1:length(colnames(samples_file))) { seurat <- AddMetaData(seurat, samples_file[sample, i], col.name = colnames(samples_file)[i]) } } message("3. Metadata from sample.tsv was added to Seurat object.") # 1.1.1 Add specific cell metadata from metadata.tsv file. if (input_type == "matrix") { if (file.exists(toString(units[sample,"metadata"]))){ meta_file = read.table(toString(units[sample,"metadata"]), sep = "\t", row.names = 1, header = TRUE) meta_file <- subset(meta_file, row.names(meta_file) %in% row.names(seurat@meta.data)) for (i in 1:length(colnames(meta_file))) { seurat <- AddMetaData(seurat, meta_file[,i], col.name = colnames(meta_file)[i]) } message("3.1 Metadata from metadata.tsv was added to Seurat object.") } else { message("Metadata.tsv was not found.") } } #1.2 Set idents to avoid new idents based on shared CB names. Idents(seurat) <- rep(sample, length(colnames(seurat$RNA@data))) # 2. Preprocessing: Get QC-related values. # 2.1. Mitochondrial genes - check levels of expression for mt genes. seurat[["percent.mt"]] <- PercentageFeatureSet(seurat, pattern = mito_grep) # 2.2. Ribosomal genes - check levels of expression for rb genes. seurat[["percent.ribo"]] <- PercentageFeatureSet(seurat, pattern = ribo_grep) message("4. Ribosomal and mitochondrial percentages were calculated.") # 2.3. QC: violin plots. p1 <- VlnPlot(seurat, features = c("nFeature_RNA"), pt.size = 0.25, cols = "#9CCCD0") + ggtitle("Nº features") + theme(legend.position="bottom") p2 <- VlnPlot(seurat, features = c("nCount_RNA"), pt.size = 0.25, cols = "#8ADD56") + ggtitle("Nº counts") + theme(legend.position="bottom") p3 <- VlnPlot(seurat, features = c("percent.mt"), pt.size = 0.25, cols = "#F07800") + ggtitle("Mitochondrial %") + theme(legend.position="bottom") p4 <- VlnPlot(seurat, features = c("percent.ribo"), pt.size = 0.25, cols = "#E44631") + ggtitle("Ribosomal %") + theme(legend.position="bottom") p_comp <- p1 + p2 + p3 + p4 + plot_layout(ncol = 4) ggsave(paste0(dir.name, "/", folders[1], "/1_vlnplot_QC_variables_prefilt.pdf"), plot = p_comp, scale = 1.2, width = 10, height = 8) message("5. Combined violin plot was generated.") # 2.4. QC: GenePlot. scatter1 <- FeatureScatter(seurat, feature1 = "nCount_RNA", feature2 = "percent.mt", pt.size = 0.25)+ theme(legend.position="bottom") + labs(title = "Mitochondrial % vs Nº counts", x = "Nº counts", y = "Mitochondrial %") scatter2 <- FeatureScatter(seurat, feature1 = "nCount_RNA", feature2 = "nFeature_RNA", pt.size = 0.25) + theme(legend.position="bottom") + labs(title = "Nº features vs Nº counts", x = "Nº counts", y = "Nº features") p_comb2 <- scatter1 + scatter2 + plot_layout(ncol = 2) ggsave(paste0(dir.name, "/", folders[1], "/2_geneplot_numi_vs_pctmit_ngene.pdf"), plot = p_comb2, scale = 1.5) message("6. Scaterplots were generated.") # 2.5. Save RDS: we can use this object to generate all the rest of the data. saveRDS(seurat, file = paste0(dir.name, "/" ,folders[1], "/seurat_pre-qc.rds")) message("7. Seurat object was saved.") |
1
of
scripts/step1_qc.R
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 | log <- file(snakemake@log[[1]], open = "wt") sink(log) sink(log, type = "message") message("CONFIGURATION STEP") # A. Parameters: # 1. Load libraries. suppressMessages(library("Seurat")) suppressMessages(library("dplyr")) suppressMessages(library("data.table")) suppressMessages(library("reticulate")) suppressMessages(library("ggplot2")) message("1. Libraries were loaded.") # 2. Folder configuration. dir.name = snakemake@params[["output_dir"]] folders = c("1_preprocessing", "2_normalization", "3_clustering", "4_degs", "5_gs", "6_traj_in", "7_func_analysis") message("2. Folder paths were set.") # 3. Get variables from Snakemake. gene.filter = c(snakemake@params[["genes"]]) # Gene or list of genes. filter.out = snakemake@params[["filter_out"]] # true or false filter.threshold = snakemake@params[["threshold"]] # numeric random_seed = snakemake@params[["random_seed"]] ram = snakemake@resources[["mem_mb"]] message("3. Parameters were loaded.") # 4. Analysis configuration. # RAM configuration. options(future.globals.maxSize = ram*1024^2) # Set seed. if (is.numeric(random_seed)) { set.seed(random_seed) } message(paste0("4. Seed was set at ", random_seed, ".")) message("Configuration finished.") message("\n") # B. Analysis. message("PROCESSING STEP") # Load Seurat object from previous step. seurat = readRDS(paste0(dir.name, "/", folders[1], "/seurat_post-qc.rds")) message("1. Seurat object was loaded.") # 4. Filter cell based on gene expression values. # 4.1 If there are negative markers availale: filter out cells based on gene expression. In this specific case, we are filtering out all cells expressing: Epcam, Pecam1, Krt19 and Ptprc. CHECK THIS if(length(gene.filter) > 0){ if (filter.out == TRUE){ for(i in 1:length(gene.filter)){ sub_seurat <- FetchData(object = seurat, vars = gene.filter[i]) seurat <- seurat[, which(x = sub_seurat > filter.threshold)] } } else { for(i in 1:length(gene.filter)){ sub_seurat <- FetchData(object = seurat, vars = gene.filter[i]) seurat <- seurat[, which(x = sub_seurat <= filter.threshold)] } } } else { next() } message("2. Cells were filter out.") # Save RDS: we can use this object to generate all the rest of the data. saveRDS(seurat, file = paste0(dir.name, "/",folders[1], "/seurat_post-qc-filtered.rds")) message("3. Seurat object was saved.") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 | log <- file(snakemake@log[[1]], open = "wt") sink(log) sink(log, type = "message") message("CONFIGURATION STEP") # A. Parameters: # 1. Load libraries. suppressMessages(library("Seurat")) # 2. Folder configuration. dir.name = snakemake@params[["output_dir"]] folders = c("1_preprocessing", "2_normalization", "3_clustering", "4_degs", "5_gs", "6_traj_in", "7_func_analysis") message("2. Folder paths were set.") # 3. Get variables from Snakemake. input_data = snakemake@input[["data"]] random_seed = snakemake@params[["random_seed"]] velocyto = snakemake@params[["velocyto"]] outdir_config = snakemake@params[["outdir_config"]] ram = snakemake@resources[["mem_mb"]] write_table = as.logical(snakemake@params[["write_table"]]) message("3. Parameters were loaded.") # 4. Analysis configuration. # RAM configuration. options(future.globals.maxSize = ram*1024^2) # Set seed. if (is.numeric(random_seed)) { set.seed(random_seed) } message(paste0("4. Seed was set at ", random_seed, ".")) message("Configuration finished.") message("\n") # B. Analysis. message("PROCESSING STEP") # 2.5 Merging all samples in the execution. # Loading seurat object function. combine_object <- function(x, velocyto) { experiment = head(tail(strsplit(x, "/")[[1]], n=3), n=1) #Assay name is stored to later use in integrated object metadata. seurat_obj <- readRDS(x) seurat_obj <- AddMetaData(object = seurat_obj, metadata = experiment, col.name = 'assay_name') # If RNA velocity is going to be performed, we add the velocyto matrices in this step. if (velocyto){ # Velocyto matrices path is infered. velocyto_dir = paste0(outdir_config,"/star/", experiment,"/Solo.out/Velocyto/raw/") velo_names = c("spliced", "unspliced", "ambiguous") vel_matrices = list() # The matrices are read in 10x format. for (name in velo_names) { vel_matrices[[name]] <- Read10X(data.dir = paste0(velocyto_dir, name)) } # The matrices are added as assays in the respective seurat object. for (name in velo_names) { vel_matrices[[name]] <- vel_matrices[[name]][, which(x = colnames(vel_matrices[[name]]) %in% colnames(seurat_obj))] seurat_obj[[name]] <- CreateAssayObject(counts = vel_matrices[[name]]) } } return(seurat_obj) } message("1. All Seurat object were loaded.") # Function to get sample names. get_name_assays <- function(x) { experiment = head(tail(strsplit(x, "/")[[1]], n=3), n=1) #Assay name is stored to later use in integrated object metadata. return(experiment) } # Seurat objects are loaded and sample names are obtained. seurat_list = lapply(input_data, function(x) combine_object(x, velocyto)) cells_id = sapply(input_data, function(x) get_name_assays(x)) names(seurat_list) <- cells_id # Seurat object split for merging. x_seurat <- seurat_list[[1]] y_seurat <- seurat_list[2:length(seurat_list)] message("2. List of Seurat object was created.") # Merging step. seurat <- merge(x_seurat, y = as.vector(y_seurat), add.cell.ids = cells_id, project = "merged") message("3. Seurat objects were merged.") # Save expression matrix. if (write_table){ write.table(as.matrix(seurat@assays$RNA@counts), file = paste0(dir.name, "/", folders[1], "/expression_matrix.tsv"), sep = "\t", row.names = TRUE, col.names = NA, quote = FALSE) message("4. Merged expression matrix was saved.") } else { message("4. Merged expression matrix was not saved, as specified in the configuration file.") } # Save merged Seurat object. saveRDS(seurat, file = paste0(dir.name, "/", folders[1], "/seurat_post-qc.rds")) message("5. Seurat object was saved.") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 | log <- file(snakemake@log[[1]], open = "wt") sink(log) sink(log, type = "message") message("CONFIGURATION STEP") # A. Parameters: # 1. Load libraries. suppressMessages(library("Seurat")) suppressMessages(library("dplyr")) suppressMessages(library("data.table")) suppressMessages(library("reticulate")) suppressMessages(library("ggplot2")) suppressMessages(library("patchwork")) message("1. Libraries were loaded.") # 2. Folder configuration. dir.name = snakemake@params[["output_dir"]] folders = c("1_preprocessing", "2_normalization", "3_clustering", "4_degs", "5_gs", "6_traj_in", "7_func_analysis") message("2. Folder paths were set.") # 3. Get variables from Snakemake. min_feat = snakemake@params[["min_feat"]] max_feat = snakemake@params[["max_feat"]] min_count = snakemake@params[["min_count"]] max_count = snakemake@params[["max_count"]] mit = snakemake@params[["mit"]] ribo = snakemake@params[["ribo"]] random_seed = snakemake@params[["random_seed"]] ram = snakemake@resources[["mem_mb"]] write_table = as.logical(snakemake@params[["write_table"]]) message("3. Parameters were loaded.") # 4. Analysis configuration. # RAM configuration. options(future.globals.maxSize = ram*1024^2) # Set seed. if (is.numeric(random_seed)) { set.seed(random_seed) } message(paste0("4. Seed was set at ", random_seed, ".")) message("Configuration finished.") message("\n") # B. Analysis. message("PROCESSING STEP") # Load Seurat object from previous step. seurat = readRDS(paste0(dir.name, "/", folders[1], "/seurat_pre-qc.rds")) message("1. Seurat object was loaded.") #3. Preprocessing: Filter out low-quality cells. # 3.1 Pre-filtering stats calculus. stats_pre <- c(length(colnames(seurat)), median(seurat@meta.data[["nCount_RNA"]]), median(seurat@meta.data[["nFeature_RNA"]]), median(seurat@meta.data[["percent.mt"]]), median(seurat@meta.data[["percent.ribo"]])) message("2. Pre-filtering statistics were obtained.") # 3.2 We should apply the filterings once the QC plots (GenePlot and Violin plots) have been checked. # 3.2.1 Feature filter. min_feat <- ifelse(is.null(min_feat), 0, min_feat) max_feat <- ifelse(is.null(max_feat), 0, max_feat) if (min_feat != 0 | max_feat != 0) { cells_seurat <- FetchData(object = seurat, vars = "nFeature_RNA") seurat <- seurat[, which(x = cells_seurat > min_feat & cells_seurat < max_feat)] } # 3.2.2 Count filter. min_count <- ifelse(is.null(min_count), 0, min_count) max_count <- ifelse(is.null(max_count), 0, max_count) if (min_count != 0 | max_count != 0) { cells_seurat <- FetchData(object = seurat, vars = "nCount_RNA") seurat <- seurat[, which(x = cells_seurat > min_count & cells_seurat < max_count)] } # 3.2.3 Mitochondrial filter. if (!is.null(mit)) { mit_seurat <- FetchData(object = seurat, vars = "percent.mt") seurat <- seurat[, which(x = mit_seurat < mit)] } # 3.2.4 Ribosomal filter. if (!is.null(ribo)) { ribo_seurat <- FetchData(object = seurat, vars = "percent.ribo") seurat <- seurat[, which(x = ribo_seurat < ribo)] } message("3. Cell filters were applied.") # 3.3 QC: violin plots - After filter. p1 <- VlnPlot(seurat, features = c("nFeature_RNA"), pt.size = 0.25, cols = "#9CCCD0") + ggtitle("Nº features") + theme(legend.position="bottom") p2 <- VlnPlot(seurat, features = c("nCount_RNA"), pt.size = 0.25, cols = "#8ADD56") + ggtitle("Nº counts") + theme(legend.position="bottom") p3 <- VlnPlot(seurat, features = c("percent.mt"), pt.size = 0.25, cols = "#F07800") + ggtitle("Mitochondrial %") + theme(legend.position="bottom") p4 <- VlnPlot(seurat, features = c("percent.ribo"), pt.size = 0.25, cols = "#E44631") + ggtitle("Ribosomal %") + theme(legend.position="bottom") p_comp <- p1 + p2 + p3 + p4 + plot_layout(ncol = 4) ggsave(paste0(dir.name, "/", folders[1], "/3_vlnplot_QC_variables_postfilt.pdf"), plot = p_comp, scale = 1.2, width = 10, height = 8) message("4. Combined violin plot post-filtering was generated.") # 3.4 Post-filter stats calculus. stats_post <- c(length(colnames(seurat)), median(seurat@meta.data[["nCount_RNA"]]), median(seurat@meta.data[["nFeature_RNA"]]), median(seurat@meta.data[["percent.mt"]]), median(seurat@meta.data[["percent.ribo"]])) message("5. Post-filtering statistics were obtained.") # 3.5 Statistics table. filtering_df <- data.frame("Number of cells" = c(stats_pre[1],stats_post[1]), "Count median" = c(stats_pre[2],stats_post[2]),"Expressed genes median" = c(stats_pre[3],stats_post[3]), "Mitochondrial percentage median" = c(stats_pre[4],stats_post[4]), "Ribosomal percentage median" = c(stats_pre[5],stats_post[5])) row.names(filtering_df) <- c("Pre-QC", "Post-QC") write.table(filtering_df, file = paste0(dir.name, "/", folders[1], "/4_pre_vs_post_stats.tsv"), sep = "\t", col.names = NA, quote = FALSE) message("6. Statistics table was saved.") # 3.6 Save expression matrix. if(write_table){ write.table(as.matrix(seurat@assays$RNA@counts), file = paste0(dir.name, "/", folders[1], "/expression_matrix.tsv"), sep = "\t", row.names = TRUE, col.names = NA, quote = FALSE) message("7. Post-qc expression matrix was saved.") } else { message("7. Post-qc expression matrix was not saved, as specified in the configuration file.") } # 3.7 Save RDS: we can use this object to generate all the rest of the data. saveRDS(seurat, file = paste0(dir.name, "/",folders[1], "/seurat_post-qc.rds")) message("8. Seurat object was saved.") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 | log <- file(snakemake@log[[1]], open = "wt") sink(log) sink(log, type = "message") message("CONFIGURATION STEP") # A. Parameters: # 1. Load libraries. suppressMessages(library("Seurat")) suppressMessages(library("ggplot2")) suppressMessages(library("RColorBrewer")) suppressMessages(library("stringr")) suppressMessages(library("future")) suppressMessages(library("stringr")) message("1. Libraries were loaded.") # 2. Folder configuration. dir.name = snakemake@params[["output_dir"]] input_data = snakemake@input[["data"]] folders = c("1_preprocessing", "2_normalization", "3_clustering", "4_degs", "5_gs", "6_traj_in", "7_func_analysis") message("2. Folder paths were set.") # 3. Get variables from Snakemake. regress_out = snakemake@params[["regress_out"]] regress_cell_cycle = snakemake@params[["regress_cell_cycle"]] vars_to_regress = snakemake@params[["vars_to_regress"]] norm_type = snakemake@params[["norm_type"]] vf = snakemake@params[["variable_features"]] random_seed = snakemake@params[["random_seed"]] velocyto = snakemake@params[["velocyto"]] outdir_config = snakemake@params[["outdir_config"]] case = snakemake@params[["case"]] ram = snakemake@resources[["mem_mb"]] threads = snakemake@threads write_table = as.logical(snakemake@params[["write_table"]]) message("3. Parameters were loaded.") # 4. Analysis configuration. # RAM configuration. options(future.globals.maxSize = ram*1024^2) #Set parallelization. plan("multiprocess", workers = threads) message(paste0("4. Threads were set at ", threads, ".")) # Set seed. if (is.numeric(random_seed)) { set.seed(random_seed) } message(paste0("5. Seed was set at ", random_seed, ".")) # 5. Load cell cycle markers signature from Tirosh et al, 2015. s.genes <- cc.genes$s.genes g2m.genes <- cc.genes$g2m.genes ## 6. Change geneset signatures to specific gene case. if (case == "lowercase") { s.genes <- str_to_lower(s.genes) g2m.genes <- str_to_lower(g2m.genes) message ("Set to lowercase.") } else if (case == "titlecase") { s.genes <- str_to_title(s.genes) g2m.genes <- str_to_title(g2m.genes) message ("Set to titlecase.") } else if (case == "uppercase") { message ("Set to uppercase.") } else { message("Please choose a correc case option.") quit() } message(paste0("6. Case is set as ", case, ".")) message("Configuration finished.") message("\n") message("PROCESSING STEP") # 6. Integration. # 6.1. This function will read each one of the input files and will perform the early normalization steps for each sample. path_to_seurat_object <- function(x, regress_out, vars_to_regress, regress_cell_cycle, norm_type, add_velocity) { experiment = head(tail(strsplit(x, "/")[[1]], n=3), n=1) #Assay name is stored to later use in integrated object metadata. seurat_obj <- readRDS(x) seurat_obj <- AddMetaData(object = seurat_obj, metadata = experiment, col.name = 'assay_name') # If RNA velocity is going to be performed, we add the velocyto matrices in this step. if (add_velocity){ # Velocyto matrices path is infered. velocyto_dir = paste0(outdir_config,"/star/", experiment,"/Solo.out/Velocyto/raw/") velo_names = c("spliced", "unspliced", "ambiguous") vel_matrices = list() # The matrices are read in 10x format. for (name in velo_names) { vel_matrices[[name]] <- Read10X(data.dir = paste0(velocyto_dir, name)) } # The matrices are added as assays in the respective seurat object. for (name in velo_names) { vel_matrices[[name]] <- vel_matrices[[name]][, which(x = colnames(vel_matrices[[name]]) %in% colnames(seurat_obj))] seurat_obj[[name]] <- CreateAssayObject(counts = vel_matrices[[name]]) } } # Save regression parameters variable regression_param <- c(regress_out, regress_cell_cycle) vars_param <- list(vars_to_regress = vars_to_regress, cell_cycle = c("S.Score", "G2M.Score")) # Normalization of the individual samples if(norm_type == "standard"){ seurat_obj <- NormalizeData(seurat_obj, normalization.method = "LogNormalize", scale.factor = 10000) message(paste0("2. Seurat object was normalized using the ", norm_type, " approach")) seurat_obj <- FindVariableFeatures(seurat_obj, selection.method = "vst", nfeatures = 2500) # Scaling to perform PCA. seurat_obj <- ScaleData(seurat_obj, features = rownames(seurat_obj)) # PCA previous to cell cycle scoring. seurat_obj <- RunPCA(seurat_obj, features = if(vf) VariableFeatures(object = seurat_obj) else rownames(seurat_obj), npcs = 50) # Cell cycle scores and plots. seurat_obj <- CellCycleScoring(object = seurat_obj, s.features = s.genes, g2m.features = g2m.genes, set.ident = T) # Scaling. if(all(regression_param)){ seurat_obj <- ScaleData(seurat_obj, features = rownames(seurat_obj), vars.to.regress = c(vars_to_regress, "S.Score", "G2M.Score")) } else if (any(regression_param)){ seurat_obj <- ScaleData(seurat_obj, features = rownames(seurat_obj), vars.to.regress = unlist(vars_param[regression_param])) } } else if (norm_type == "SCT"){ seurat_obj <- SCTransform(seurat_obj, verbose = FALSE) message(paste0("2. Seurat object was normalized using the ", norm_type, " approach")) # PCA previous to cell cycle scoring. seurat_obj <- RunPCA(seurat_obj, features = if(vf) VariableFeatures(object = seurat_obj) else rownames(seurat_obj), npcs = 50) # Cell cycle scores and plots. seurat_obj <- CellCycleScoring(object = seurat_obj, s.features = s.genes, g2m.features = g2m.genes, set.ident = TRUE) # If cell cycle regression is needed, a new SCT transformation is performed. if(all(regression_param)){ seurat_obj <- SCTransform(seurat_obj, assay = "RNA", new.assay = "SCT", vars.to.regress = c(vars_to_regress, "S.Score", "G2M.Score"), verbose = FALSE) } else if (any(regression_param)){ seurat_obj <- SCTransform(seurat_obj, assay = "RNA", new.assay = "SCT", vars.to.regress = unlist(vars_param[regression_param]), verbose = FALSE) } } else { message("Normalization method not found.") } return(assign(paste0(experiment[[1]][6]),seurat_obj)) } # 6.2. The function is applied obtaining a list with all the seurat objects. seurat_object_list <- lapply(input_data, function(x) path_to_seurat_object(x, regress_out = regress_out, vars_to_regress = vars_to_regress, regress_cell_cycle = regress_cell_cycle, norm_type = norm_type, add_velocity = velocyto)) message("1. All Seurat objects were loaded.") # 6.3. The minimum number of features between assays is obtained to get an aproximate feature number and keep as much genes as possible. n_feat_calc <- function(seurat) { if (seurat@active.assay == "SCT"){ return(length(rownames(seurat@assays$SCT@counts))) } else if (seurat@active.assay == "RNA"){ return(length(rownames(seurat@assays$RNA@counts))) } } n_features <- min(unlist(lapply(seurat_object_list, function(x) n_feat_calc(x)))) # 6.4. Final integration steps and integration object creation. if (norm_type == "SCT") { seurat.features <- SelectIntegrationFeatures(object.list = seurat_object_list, nfeatures = n_features, verbose = FALSE) seurat.list <- PrepSCTIntegration(object.list = seurat_object_list, anchor.features = seurat.features, verbose = FALSE) seurat.anchors <- FindIntegrationAnchors(object.list = seurat.list, normalization.method = "SCT", anchor.features = seurat.features, verbose = FALSE) seurat.integrated <- IntegrateData(anchorset = seurat.anchors, normalization.method = "SCT", verbose = FALSE) message(paste0("2. All Seurat objects were integrated following the ", norm_type, " approach.")) } else if (norm_type == "standard") { seurat.features <- SelectIntegrationFeatures(object.list = seurat_object_list, nfeatures = n_features, verbose = FALSE) seurat.anchors <- FindIntegrationAnchors(object.list = seurat_object_list, dims = 1:100, anchor.features = seurat.features, verbose = FALSE) seurat.integrated <- IntegrateData(anchorset = seurat.anchors, dims = 1:100, verbose = FALSE) seurat.integrated <- ScaleData(seurat.integrated, verbose = TRUE) message(paste0("2. All Seurat objects were integrated following the ", norm_type, " approach.")) } # 6.5. PCA and Visualize Dimensional Reduction genes. seurat.integrated <- RunPCA(seurat.integrated, features = if(vf) VariableFeatures(object = seurat.integrated) else rownames(seurat.integrated), ncps = 100, verbose = FALSE) p1 <- VizDimLoadings(seurat.integrated, dims = 1:2, reduction = "pca") + theme(legend.position="bottom") ggsave(paste0(dir.name, "/",folders[2], "/1_viz_dim_loadings.pdf"), plot = p1, scale = 1.5) # 6.6. PCA projection and integration visualization plot. getPalette <- colorRampPalette(brewer.pal(9,'Set1')) p2 <- DimPlot(seurat.integrated, reduction = "pca", group.by = "assay_name", cols=getPalette(length(levels(as.factor(seurat.integrated$assay_name))))) ggsave(paste0(dir.name, "/", folders[2], "/2_dimplot_PCA.pdf"), plot = p2) message("3. PCA was done.") # 6.7. Principal component study using Elbow plot. p3 <- ElbowPlot(seurat.integrated, ndims = 50) + theme(legend.position="bottom") ggsave(paste0(dir.name, "/",folders[2], "/3_elbowplot.pdf"), plot = p3, scale = 1.5) message("4. Elbowplot was generated.") # 6.8. Cell cycle scores and plots. seurat.integrated <- CellCycleScoring(object = seurat.integrated, s.features = s.genes, g2m.features = g2m.genes, set.ident = T) p5 <- FeaturePlot(object = seurat.integrated, features ="S.Score") ggsave(paste0(dir.name, "/", folders[2], "/5_sscore_featureplot.pdf"), plot = p5, scale = 1.5) p6 <- FeaturePlot(object = seurat.integrated, features ="G2M.Score") ggsave(paste0(dir.name, "/", folders[2], "/6_g2mscore_featureplot.pdf"), plot = p6, scale = 1.5) p7 <- DimPlot(seurat.integrated, reduction = "pca", pt.size = 0.5, label = TRUE, label.size = 5) + RotatedAxis() #+ theme(legend.position ="bottom") ggsave(paste0(dir.name, "/", folders[2], "/7_no_umap_pca.pdf"), plot = p7, scale = 1.5) message("6. Cell-cycle analysis plot was done.") # 6.9. Save expression matrix. if(write_table){ write.table(as.matrix(seurat.integrated@assays$integrated@scale.data), file = paste0(dir.name, "/", folders[2], "/normalized_expression_matrix.tsv"), sep = "\t", row.names = TRUE, col.names = NA, quote = FALSE) message("7. Integrated expression matrix was saved.") } else { message("7. Integrated expression matrix was not saved, as specified in the configuration file.") } # 6.10. Save Seurat object. saveRDS(object = seurat.integrated, file = paste0(dir.name, "/", folders[2], "/seurat_normalized-pcs.rds")) message("8. Seurat object was saved.") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 | log <- file(snakemake@log[[1]], open = "wt") sink(log) sink(log, type = "message") message("CONFIGURATION STEP") # A. Parameters: # 1. Load libraries. suppressMessages(library("Seurat")) suppressMessages(library("dplyr")) suppressMessages(library("data.table")) suppressMessages(library("reticulate")) suppressMessages(library("ggplot2")) suppressMessages(library("stringr")) suppressMessages(library("future")) message("1. Libraries were loaded.") # 2. Folder configuration. dir.name = snakemake@params[["output_dir"]] input_data = snakemake@input[["seurat_obj"]] folders = c("1_preprocessing", "2_normalization", "3_clustering", "4_degs", "5_gs") message("2. Folder paths were set.") # 3. Get variables from Snakemake. normalization = snakemake@params[["norm_type"]] # "SCT" or "standard" vf = snakemake@params[["variable_features"]] # TO ADD regress_out = snakemake@params[["regress_out"]] # true or false vars_to_regress = c(snakemake@params[["vars_to_regress"]]) # check if null random_seed = snakemake@params[["random_seed"]] regress_cell_cycle = snakemake@params[["regress_cell_cycle"]] regress_merge_effect = snakemake@params[["regress_merge_effect"]] case = snakemake@params[["case"]] ram = snakemake@resources[["mem_mb"]] threads = snakemake@threads write_table = as.logical(snakemake@params[["write_table"]]) message("3. Parameters were loaded.") # 4. Analysis configuration. # RAM configuration. options(future.globals.maxSize = ram*1024^2) #Set parallelization. plan("multiprocess", workers = threads) message(paste0("4. Threads were set at ", threads, ".")) # Set seed. if (is.numeric(random_seed)) { set.seed(random_seed) } message(paste0("5. Seed was set at ", random_seed, ".")) # 5. Load cell cycle markers signature from Tirosh et al, 2015. s.genes <- cc.genes$s.genes g2m.genes <- cc.genes$g2m.genes # 6. Change geneset signatures to specific gene case. if (case == "lowercase") { s.genes <- str_to_lower(s.genes) g2m.genes <- str_to_lower(g2m.genes) message ("Set to lowercase.") } else if (case == "titlecase") { s.genes <- str_to_title(s.genes) g2m.genes <- str_to_title(g2m.genes) message ("Set to titlecase.") } else if (case == "uppercase") { message ("Set to uppercase.") } else { message("Please choose a correct case option.") quit() } message(paste0("6. Case is set as ", case, ".")) message("Configuration finished.") message("\n") message("PROCESSING STEP") # Load seurat object. seurat = readRDS(input_data) message("1. Seurat object was loaded.") # Regress merge variable input. if (regress_merge_effect){ merge_var = "assay_name" } else { merge_var = NULL } # 5. Normalization. # 5.1. Normalize data depending of the method. if(normalization == "standard"){ seurat <- NormalizeData(seurat, normalization.method = "LogNormalize", scale.factor = 10000) seurat <- FindVariableFeatures(seurat, selection.method = "vst", nfeatures = 2500) # Identify the 10 most highly variable genes. top10 <- head(VariableFeatures(seurat), 10) p1 <- VariableFeaturePlot(seurat) + theme(legend.position="bottom") p1 <- p1 + LabelPoints(plot = p1, points = top10, repel = TRUE) + theme(legend.position="bottom") ggsave(paste0(dir.name, "/",folders[1], "/6_variable_features.pdf"), plot = p1, scale = 1.5) # Scaling to perform PCA. # Merged object check. if (seurat@project.name == "merged") { seurat <- ScaleData(seurat, features = rownames(seurat), vars.to.regress = merge_var) } else { seurat <- ScaleData(seurat, features = rownames(seurat)) } # PCA previous to cell cycle scoring. seurat <- RunPCA(seurat, features = if(vf) VariableFeatures(object = seurat) else rownames(seurat), npcs = 50) # Cell cycle scores and plots. seurat <- CellCycleScoring(object = seurat, s.features = s.genes, g2m.features = g2m.genes, set.ident = T) p4 <- FeaturePlot(object = seurat, features ="S.Score") + ggtitle("S phase score") ggsave(paste0(dir.name, "/", folders[2], "/4_sscore_featureplot.pdf"), plot = p4, scale = 1.5) p5 <- FeaturePlot(object = seurat, features ="G2M.Score") + ggtitle("G2/M phase score") ggsave(paste0(dir.name, "/", folders[2], "/5_g2mscore_featureplot.pdf"), plot = p5, scale = 1.5) p6 <- DimPlot(seurat, reduction = "pca", pt.size = 0.5, label = TRUE, label.size = 5) + RotatedAxis() #+ theme(legend.position ="bottom") ggsave(paste0(dir.name, "/", folders[2], "/6_cell_cycle_dimplot.pdf"), plot = p6, scale = 1.5) message(paste0("2. Seurat object was normalized using the ", normalization, " approach")) # Scaling. if(regress_out == TRUE){ if (regress_cell_cycle) { if (seurat@project.name == "merged"){ seurat <- ScaleData(seurat, features = rownames(seurat), vars.to.regress = c(vars_to_regress, "S.Score", "G2M.Score", merge_var)) } else { seurat <- ScaleData(seurat, features = rownames(seurat), vars.to.regress = c(vars_to_regress, "S.Score", "G2M.Score")) } } else { if (seurat@project.name == "merged"){ seurat <- ScaleData(seurat, features = rownames(seurat), vars.to.regress = c(vars_to_regress, merge_var)) } else { seurat <- ScaleData(seurat, features = rownames(seurat), vars.to.regress = c(vars_to_regress)) } } } } else if (normalization == "SCT"){ if(regress_out == TRUE){ if(seurat@project.name == "merged"){ seurat <- SCTransform(seurat, vars.to.regress = c(vars_to_regress, merge_var), verbose = FALSE) } else { seurat <- SCTransform(seurat, vars.to.regress = vars_to_regress, verbose = FALSE) } } else { if(seurat@project.name == "merged"){ seurat <- SCTransform(seurat, vars.to.regress = merge_var, verbose = FALSE) } else { seurat <- SCTransform(seurat, verbose = FALSE) } } # PCA previous to cell cycle scoring. seurat <- RunPCA(seurat, features = VariableFeatures(object = seurat), npcs = 50) # This result could all be saved in a table. # Cell cycle scores and plots. seurat <- CellCycleScoring(object = seurat, s.features = s.genes, g2m.features = g2m.genes, set.ident = TRUE) p4 <- FeaturePlot(object = seurat, features ="S.Score") + ggtitle("S phase score") ggsave(paste0(dir.name, "/", folders[2], "/4_sscore_featureplot.pdf"), plot = p4, scale = 1.5) p5 <- FeaturePlot(object = seurat, features ="G2M.Score") + ggtitle("G2/M phase score") ggsave(paste0(dir.name, "/", folders[2], "/5_g2mscore_featureplot.pdf"), plot = p5, scale = 1.5) p6 <- DimPlot(seurat, reduction = "pca", pt.size = 0.5, label = TRUE, label.size = 5) + RotatedAxis() #+ theme(legend.position ="bottom") ggsave(paste0(dir.name, "/", folders[2], "/6_cell_cycle_dimplot.pdf"), plot = p6, scale = 1.5) message(paste0("2. Seurat object was normalized using the ", normalization, " approach")) # If cell cycle regression is needed, a new SCT transformation is perform. if (regress_cell_cycle){ if (regress_out) { if (seurat@project.name == "merged"){ seurat <- SCTransform(seurat, assay = "RNA", new.assay = "SCT", vars.to.regress = c(vars_to_regress, "S.Score", "G2M.Score", merge_var), verbose = FALSE) } else { seurat <- SCTransform(seurat, assay = "RNA", new.assay = "SCT", vars.to.regress = c(vars_to_regress, "S.Score", "G2M.Score"), verbose = FALSE) } } else { if (seurat@project.name == "merged"){ seurat <- SCTransform(seurat, assay = "RNA", new.assay = "SCT", vars.to.regress = c("S.Score", "G2M.Score", merge_var), verbose = FALSE) } else { seurat <- SCTransform(seurat, assay = "RNA", new.assay = "SCT", vars.to.regress = c("S.Score", "G2M.Score"), verbose = FALSE) } } } } else { message("Normalization method not found.") } message("3. Seurat object was scaled.") ## 5.2. PCA & metrics calculation. Idents(seurat) <- seurat@project.name seurat <- RunPCA(seurat, features = if(vf) VariableFeatures(object = seurat) else rownames(seurat), npcs = 50) # This result could all be saved in a table. # Visualizing PCA in Different Ways: elbow plot most variable genes p2 <- VizDimLoadings(seurat, dims = 1:2, reduction = "pca") + theme(legend.position="bottom") ggsave(paste0(dir.name, "/",folders[2], "/1_viz_dim_loadings.pdf"), plot = p2, scale = 1.5)#, height = height, width = height * aspect_ratio) p3 <- DimPlot(seurat, reduction = "pca", pt.size = 0.5) + theme(legend.position="bottom") ggsave(paste0(dir.name, "/",folders[2], "/2_dimplot.pdf"), plot = p3, scale = 1.5) # Only if cell cycle was regressed. if (regress_cell_cycle) { p7 <- FeaturePlot(object = seurat, features ="S.Score") + ggtitle("S phase score") ggsave(paste0(dir.name, "/", folders[2], "/4_sscore_featureplot_regressed.pdf"), plot = p7, scale = 1.5) p8 <- FeaturePlot(object = seurat, features ="G2M.Score") + ggtitle("G2/M phase score") ggsave(paste0(dir.name, "/", folders[2], "/5_g2mscore_featureplot_regressed.pdf"), plot = p7, scale = 1.5) p9 <- DimPlot(seurat, group.by = "Phase", reduction = "pca", pt.size = 0.5, label = TRUE, label.size = 5) + RotatedAxis() ggsave(paste0(dir.name, "/", folders[2], "/6_cell_cycle_dimplot_regressed.pdf"), plot = p9, scale = 1.5) } message("4. PCA was calculated.") # 5.3. Determine the dimensionality of the dataset p4 <- ElbowPlot(seurat, ndims = 50) + theme(legend.position="bottom") ggsave(paste0(dir.name, "/",folders[2], "/3_elbowplot.pdf"), plot = p4, scale = 1.5) message("5. Elbowplot was generated.") #if (!(seurat@active.assay == "SCT")) { # seurat <- JackStraw(seurat, num.replicate = 100, dims = 50) # seurat <- ScoreJackStraw(seurat, dims = 1:50) # p5 <- JackStrawPlot(seurat, dims = 1:50) + theme(legend.position="bottom") + guides(fill=guide_legend(nrow=2, byrow=TRUE)) + labs(y = "Empirical", x="Theoretical") # ggsave(paste0(dir.name, "/",folders[2], "/4_jackstrawplot.pdf"), plot = p5, scale = 2) # message("6. JackStraw plot was generated.") #} if (seurat@project.name == "merged"){ Idents(seurat) <- "assay_name" p6 <- DimPlot(seurat, reduction = "pca", pt.size = 0.5) + theme(legend.position="bottom") ggsave(paste0(dir.name, "/",folders[2], "/2_dimplot_merged.pdf"), plot = p6, scale = 1.5) } # 5.4. Save the expresion matrix. if(write_table){ if (normalization == "SCT") { write.table(as.matrix(seurat@assays$SCT@data), file = paste0(dir.name, "/", folders[2], "/normalized_expression_matrix.tsv"), sep = "\t", row.names = TRUE, col.names = NA, quote = FALSE) } if (normalization == "standard") { write.table(as.matrix(seurat@assays$RNA@data), file = paste0(dir.name, "/", folders[2], "/normalized_expression_matrix.tsv"), sep = "\t", row.names = TRUE, col.names = NA, quote = FALSE) } message("6. Normalized expression matrix was saved.") } else { message("6. Normalized expression matrix was not saved, as specified in the configuration file.") } # 5.5. Save RDS: we can use this object to generate all the rest of the data. saveRDS(seurat, file = paste0(dir.name, "/",folders[2], "/seurat_normalized-pcs.rds")) message("7. Seurat object was saved.") |
1
of
scripts/step3_normalization.R
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 | log <- file(snakemake@log[[1]], open = "wt") sink(log) sink(log, type = "message") message("CONFIGURATION STEP") # A. Parameters: # 1. Load libraries. suppressMessages(library("Seurat")) suppressMessages(library("SeuratDisk")) suppressMessages(library("dplyr")) suppressMessages(library("data.table")) suppressMessages(library("reticulate")) suppressMessages(library("clustree")) suppressMessages(library("ggplot2")) suppressMessages(library("cluster")) suppressMessages(library("writexl")) suppressMessages(library("future")) suppressMessages(library("patchwork")) suppressMessages(library("lisi")) message("1. Libraries were loaded.") # 2. Folder configuration. input_file = snakemake@input[["seurat_obj"]] dir.name = snakemake@params[["output_dir"]] folders = c("1_preprocessing", "2_normalization", "3_clustering", "4_degs", "5_gs", "6_traj_in", "7_func_analysis") message("2. Folder paths were set.") # 3. Get variables from Snakemake. pc = snakemake@params[["pc"]] # We should check the PCs using the Elbowplot and JackStraw plot. res = as.vector(snakemake@params[["res"]]) random_seed = snakemake@params[["random_seed"]] k_neighbors = snakemake@params[["k_neighbors"]] batch_metadata = snakemake@params[["batch_metadata"]] # check if null ram = snakemake@resources[["mem_mb"]] threads = snakemake@threads message("3. Parameters were loaded.") # 4. Analysis configuration. # RAM configuration. options(future.globals.maxSize = ram*1024^2) #Set parallelization. plan("multiprocess", workers = threads) message(paste0("4. Threads were set at ", threads, ".")) # Set seed. if (is.numeric(random_seed)) { set.seed(random_seed) } message(paste0("5. Seed was set at ", random_seed, ".")) message("Configuration finished.") message("\n") message("PROCESSING STEP") # Load seurat object. seurat <- readRDS(input_file) assay_type <- seurat@active.assay message("1. Seurat object was loaded.") # 7. Clustering. # 7.1. FindClusters using UMAP projection. We keep the significant PC obtained from PCA. seurat <- FindNeighbors(seurat, reduction = "pca", dims = 1:pc, k.param = k_neighbors, future.seed = NULL) seurat <- FindClusters(seurat, resolution = res, future.seed = NULL) seurat <- RunUMAP(seurat,dims = 1:pc, n.components = 2, verbose = FALSE, future.seed = NULL) message("2. UMAP was done.") # 7.2. Clustree. p1 <- clustree(seurat, prefix = paste0(assay_type,"_snn_res.")) ggsave(paste0(dir.name, "/", folders[3], "/1_clustree.pdf"), plot = p1, scale = 1.5) message("3. Clustree representation was obtained.") # 7.3. Clustering plots and silhouette parameters calculus. # An empty list is created to store silhouette values. silhouette_scores <- vector(mode = "list", length = length(res)) # If the seurat object contains more than 50K samples, a subset is done to if (dim(seurat@meta.data)[1] > 50000){ message("Downsampling seurat object to perform silhouette analysis") seurat_sil <- seurat[, sample(colnames(seurat), size = 50000, replace=F)] } else { seurat_sil <- seurat } # Loop for each resolution. for(i in 1:length(which(grepl(paste0(assay_type,"_snn_"),colnames(seurat_sil@meta.data))))){ full_res = colnames(seurat@meta.data[which(grepl(paste0(assay_type,"_snn_"),colnames(seurat@meta.data)))][i]) Idents(seurat) <- full_res p2 <- DimPlot(seurat, reduction = "umap", label = TRUE, label.size = 5) + theme_minimal() #+ theme(legend.position="bottom") ggsave(paste0(dir.name, "/", folders[3], "/2_umap_",full_res,".pdf"), plot = p2, scale = 1.5) # Silhhouettes calculus. dist.matrix <- dist(x = Embeddings(object = seurat_sil[["pca"]])[, 1:pc]) #!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! clusters <- slot(seurat_sil, "meta.data")[,full_res] sil <- silhouette(x = as.numeric(x = as.factor(x = clusters)), dist = dist.matrix) if(is.null(dim(sil))){ silhouette_scores[[i]] <- as.data.frame(NA) } else { silhouette_scores[[i]] <- as.data.frame(summary(sil)[2]) } names(silhouette_scores[[i]]) <- full_res } # Create a xlsx file to store the silhouette scores. write_xlsx(silhouette_scores, path = paste0(dir.name, "/", folders[3], "/3_silhouette_score.xlsx"),col_names = TRUE, format_headers = TRUE ) message("4. Silhouette scores were calculated.") # 7.4. Plots on vars to regress p3 <- FeaturePlot(seurat, 'nFeature_RNA', pt.size = 0.75) + labs(title = "Nº features") ggsave(paste0(dir.name, "/", folders[3], "/4.1_featureplot.pdf"), plot = p3, scale = 1.5) p4 <- FeaturePlot(seurat, 'nCount_RNA', pt.size = 0.75) + labs(title = "Nº counts") ggsave(paste0(dir.name, "/", folders[3], "/4.2_countplot.pdf"), plot = p4, scale = 1.5) p5 <- FeaturePlot(seurat, 'percent.mt', pt.size = 0.75) + labs(title = "% mitochondrial") ggsave(paste0(dir.name, "/", folders[3], "/4.3_mitplot.pdf"), plot = p5, scale = 1.5) p6 <- FeaturePlot(seurat, 'percent.ribo', pt.size = 0.75) + labs(title = "% ribosomal") ggsave(paste0(dir.name, "/", folders[3], "/4.4_riboplot.pdf"), plot = p6, scale = 1.5) p7 <- FeaturePlot(seurat, 'S.Score', pt.size = 0.75) + labs(title = "S phase score") ggsave(paste0(dir.name, "/", folders[3], "/4.5_sscoreplot.pdf"), plot = p7, scale = 1.5) p8 <- FeaturePlot(seurat, 'G2M.Score', pt.size = 0.75) + labs(title = "G2M phase score") ggsave(paste0(dir.name, "/", folders[3], "/4.5_g2mplot.pdf"), plot = p8, scale = 1.5) message("4. Variable plots on UMAP dimension were produced.") # 7.5. Dimplot for merged or integrated objects. if (seurat@active.assay == TRUE || seurat@project.name == "merged"){ Idents(seurat) <- "assay_name" p3 <- DimPlot(seurat, reduction = "umap") ggsave(paste0(dir.name, "/", folders[3], "/2_umap_by_assay.pdf"), plot = p3, scale = 1.5) } # 7.6 Compute LISI score using the desired variables. if(!is.null(batch_metadata)){ X <- seurat@reductions$umap@cell.embeddings[1:dim(seurat)[2],] # Reductions matrix meta_data <- seurat@meta.data[, batch_metadata, drop = FALSE] # Meta data object res <- compute_lisi(X, meta_data, label_colnames = batch_metadata) # lisi scores # 7.6.1 Superpose lisi score with umap lapply(seq_along(batch_metadata), function(x){ # Orig p1 <- ggplot(res, aes(x = res[,x])) + geom_density() + theme_classic() #+ theme_minimal() p2 <- DimPlot(seurat, group.by = batch_metadata[x]) p3 <- p1 / p2 + plot_layout(nrow = 2, heights = c(0.25, 2)) ggsave(paste0(dir.name, "/", folders[3], "/5_lisi_score_by_", batch_metadata[x], ".pdf"), plot = p3, scale = 1.5) }) message("5. LISI score was calculated.") } # 7.7. Statistics table per cluster. # 7.7.1. Get the index of resolution columns. resolutions = grep("snn_res", colnames(seurat@meta.data)) # 7.7.2. Set the highest number of clusters. number_clusters = length(unique(seurat@meta.data$seurat_clusters)) # 7.7.3. Get the maximum cluster names vector. cluster_names = paste0("Cluster ", seq(number_clusters)) # 7.7.4. Loop for each resolution and write table. for(j in 1:length(resolutions)){ # Set idents and levels. Idents(seurat) <- seurat@meta.data[,resolutions[j]] levels(Idents(seurat)) <- cluster_names[1:length(levels(Idents(seurat)))] table(Idents(seurat)) # Proportion of cells per cluster. prop.table(x = table(Idents(seurat))) cluster.averages <- AverageExpression(object = seurat) genes_per_cluster <- Matrix::colSums(cluster.averages$RNA>0) # Generate a joint table. joint_stats = rbind(number_of_cells = table(Idents(seurat)), prop_per_cluster = prop.table(x = table(Idents(seurat))), genes_per_cluster = genes_per_cluster) colnames(joint_stats) = names(table(Idents(seurat))) write.table(t(joint_stats), file=paste0(dir.name, "/", folders[3], "/5_", colnames(seurat@meta.data)[resolutions[j]],"_stats.tsv"), sep="\t", col.names = NA, quote = FALSE) } message("5. Statistics table was done and saved.") # 7.8. Save Seurat object in AnnData SaveH5Seurat(seurat, filename = paste0(dir.name, "/", folders[3], "/seurat_find-clusters.h5Seurat")) Convert(paste0(dir.name, "/",folders[3], "/seurat_find-clusters.h5Seurat"), dest = "h5ad") # 7.9. Save RDS: we can use this object to generate all the rest of the data. saveRDS(seurat, file = paste0(dir.name, "/",folders[3], "/seurat_find-clusters.rds")) message("8. Seurat object was saved.") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 | log <- file(snakemake@log[[1]], open = "wt") sink(log) sink(log, type = "message") message("CONFIGURATION STEP") # A. Parameters: # 1. Load libraries. suppressMessages(library("Seurat")) suppressMessages(library("dplyr")) suppressMessages(library("data.table")) suppressMessages(library("reticulate")) suppressMessages(library("ggplot2")) suppressMessages(library("BiocParallel")) suppressMessages(library("openxlsx")) suppressMessages(library("future")) message("1. Libraries were loaded.") # 2. Folder configuration. dir.name = snakemake@params[["output_dir"]] input_data = snakemake@input[["seurat_obj"]] folders = c("1_preprocessing", "2_normalization", "3_clustering", "4_degs", "5_gs", "6_traj_in", "7_func_analysis") message("2. Folder paths were set.") # 3. Get variables from Snakemake. selected_cond = snakemake@params[["selected_cond"]] random_seed = snakemake@params[["random_seed"]] test = snakemake@params[["test"]] ranking = snakemake@params[["ranking"]] ram = snakemake@resources[["mem_mb"]] threads = snakemake@threads message("3. Parameters were loaded.") # 4. Analysis configuration. # RAM configuration. options(future.globals.maxSize = ram*1024^2) #Set parallelization. plan("multiprocess", workers = threads) message(paste0("4. Threads were set at ", threads, ".")) # Set seed. if (is.numeric(random_seed)) { set.seed(random_seed) } message(paste0("5. Seed was set at ", random_seed, ".")) message("Configuration finished.") message("\n") message("PROCESSING STEP") # Load seurat object and set clustering resolution and assay type. seurat <- readRDS(input_data) assay_type <- seurat@active.assay message("1. Seurat object was loaded.") lapply(selected_cond, function(cond){ if(grepl("0.", cond)){ cond <- paste0(assay_type, "_snn_res.", cond) if(!(cond %in% colnames(seurat@meta.data))){ stop("The specified resolution is not available.") } } else { if (!(cond %in% colnames(seurat@meta.data))){ stop("The specified condition is not available.") } } if (length(unique(seurat@meta.data[[cond]])) >1){ # Check number of cells for the heatmap to avoid errors. if (length(colnames(seurat)) < 1000){ n_cells = length(colnames(seurat)) } else { n_cells = 1000 } #Set styles for xlsx files. redStyle <- createStyle(fontColour = "#B60A1C", bgFill = "#FFF06A", textDecoration = c("BOLD")) greenStyle <- createStyle(fontColour = "#309143", bgFill = "#FFF06A", textDecoration = c("BOLD")) # 8. Differentially expressed genes between clusters. Idents(seurat) <- cond # If the input seurat object is integrated we only perform the FindConservedMarkers. if (seurat@active.assay == "integrated") { comparisons <- FetchData(seurat, vars = c("assay_name", cond)) comparisons$assay_cond<- apply(comparisons[ ,c("assay_name", cond)] , 1 , paste , collapse = "-" ) if(length(unique(comparisons$assay_cond))>2){ for (i in 1:length(unique(Idents(seurat)))){ # 8.1. Table on differentially expressed genes - using basic filterings. clusterX.markers <- FindConservedMarkers(seurat, ident.1 = unique(Idents(seurat))[i], min.pct = 0.25, grouping.var = "assay_name", test.use = test) #min expressed write.table(clusterX.markers, file = paste0(dir.name, "/", folders[4], "/cluster", unique(Idents(seurat))[i],".markers.txt"), sep = "\t", col.names = NA, quote = FALSE) } } else { print("The specified condition overlaps with the assay level.") } message("2. DEG analysis was done for integrated assay.") # If the input seurat object is not integrated. } else { # 8.1. Table on differentially expressed genes - using basic filterings. for (i in 1:length(unique(Idents(seurat)))){ clusterX.markers <- FindMarkers(seurat, ident.1 = unique(Idents(seurat))[i], min.pct = 0.25, test.use = test) #min expressed write.table(clusterX.markers, file = paste0(dir.name, "/", folders[4], "/cluster", unique(Idents(seurat))[i],".markers.txt"), sep = "\t", col.names = NA, quote = FALSE) } message("2. Cluster markers were obtained.") # 8.2. DE includying all genes - needed for a GSEA analysis. if (ranking){ for (i in 1:length(unique(Idents(seurat)))){ clusterX.markers <- FindMarkers(seurat, ident.1 = unique(Idents(seurat))[i], min.pct = 0, logfc.threshold = 0, test.use = test) #min expressed #write.table(clusterX.markers, file = paste0(dir.name, "/", folders[4], "/cluster", unique(Idents(seurat))[i],".DE.txt"), sep = "\t", col.names = NA, quote = FALSE) wb <- createWorkbook() addWorksheet(wb, "DE analysis") writeData(wb, "DE analysis", clusterX.markers, rowNames = TRUE) conditionalFormatting(wb, "DE analysis", cols = 1:(ncol(clusterX.markers)+1), rows = 2:(nrow(clusterX.markers) + 1), rule = "AND($C2<0, $F2<0.05)", style = greenStyle) conditionalFormatting(wb, "DE analysis", cols = 1:(ncol(clusterX.markers)+1), rows = 2:(nrow(clusterX.markers) + 1), rule = "AND($C2>0, $F2<0.05)", style = redStyle) legend <- createComment(comment = c("Red means a positive LogFold\n\n", "Green means a negative LogFold"), style = c(redStyle, greenStyle)) writeComment(wb, "DE analysis", col = 8, row = 2, comment = legend) saveWorkbook(wb, paste0(dir.name, "/", folders[4], "/cluster", unique(Idents(seurat))[i],".DE.xlsx"), overwrite = TRUE) # 8.2.1. Create RNK file. rnk = NULL rnk = as.matrix(clusterX.markers[,2]) rownames(rnk)= toupper(row.names(clusterX.markers)) write.table(rnk, file = paste0(dir.name, "/", folders[4], "/cluster", unique(Idents(seurat))[i],".rnk"), sep = "\t", col.names = FALSE, quote = FALSE) } message("3. All genes DEG and RNK file were finished.") } # 8.3. Find TOP markers. seurat.markers <- FindAllMarkers(object = seurat, only.pos = TRUE, min.pct = 0.25, thresh.use = 0.25, test.use = test) groupedby.clusters.markers = seurat.markers %>% group_by(cluster) %>% top_n(10, avg_logFC) # 8.3.1. HeatMap top10.f # setting heatmap text label sizes ytext_value <- 70/(length(unique(seurat@meta.data[[cond]]))*1.30) if (ytext_value > 10) {ytext_value = 10} xtext_value <- 50/(length(unique(seurat@meta.data[[cond]]))*1.25) if (xtext_value > 8) {xtext_value = 8} # setting slim.col.label to TRUE will print just the cluster IDS instead of every cell name p1 <- DoHeatmap(object = seurat, features = groupedby.clusters.markers$gene, cells = 1:n_cells, size = xtext_value, angle = 45, group.bar = TRUE, draw.lines = F, raster = FALSE) + scale_fill_gradientn(colors = c("blue", "white", "red")) + guides(color=FALSE) + theme(axis.text.y = element_text(size = ytext_value)) + theme(legend.position="bottom") ggsave(paste0(dir.name, "/", folders[4], "/1_heatmap_topmarkers.pdf"), plot = p1, scale = 2) message("4. Top marker heatmap was done.") } # 8.4. Save RDS: we can use this object to generate all the rest of the data. saveRDS(seurat, file = paste0(dir.name, "/",folders[4], "/seurat_degs.rds")) message("5. Seurat object was saved.") } else if (length(unique(seurat@meta.data[[cond]])) <=1) { if(assay_type == "integrated"){ print("The specified condition has only one level.") if(!file.exists( paste0(dir.name, "/", folders[4], "/seurat_degs.rds"))){ saveRDS(seurat, file = paste0(dir.name, "/", folders[4], "/seurat_degs.rds")) } } else { print("The specified condition has only one level. Nothing to be done.") if(!file.exists( paste0(dir.name, "/", folders[4], "/seurat_degs.rds"))){ saveRDS(seurat, file = paste0(dir.name, "/", folders[4], "/seurat_degs.rds")) } } } }) |
1
of
scripts/step5_degs.R
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 | log <- file(snakemake@log[[1]], open = "wt") sink(log) sink(log, type = "message") message("CONFIGURATION STEP") # A. Parameters: # 1. Load libraries. suppressMessages(library("Seurat")) suppressMessages(library("dplyr")) suppressMessages(library("data.table")) suppressMessages(library("reticulate")) suppressMessages(library("ggplot2")) suppressMessages(library("qusage")) suppressMessages(library("clustree")) suppressMessages(library("patchwork")) message("1. Libraries were loaded.") # 2. Folder configuration. dir.name = snakemake@params[["output_dir"]] input_data = snakemake@input[["seurat_obj"]] folders = c("1_preprocessing", "2_normalization", "3_clustering", "4_degs", "5_gs", "6_traj_in", "7_func_analysis") message("2. Folder paths were set.") # B. Parameters: analysis configuration geneset_collection = snakemake@params[["gs_collection"]] random_seed = snakemake@params[["random_seed"]] resolutions = snakemake@params[["resolutions"]] geneset_percentage <- snakemake@params[["geneset_percentage"]] norm_type = snakemake@params[["norm_type"]] ram = snakemake@resources[["mem_mb"]] message("3. Parameters were loaded.") # 4. Analysis configuration. # RAM configuration. options(future.globals.maxSize = ram*1024^2) # Set seed. if (is.numeric(random_seed)) { set.seed(random_seed) } message(paste0("4. Seed was set at ", random_seed, ".")) message("Configuration finished.") message("\n") message("PROCESSING STEP") # Load seurat object. seurat = readRDS(input_data) assay_type <- seurat@active.assay dir.create(paste0(dir.name, "/", folders[5]), showWarnings = FALSE) message("1. Seurat object was loaded.") # 9. GS scoring # 9.1 Geneset loading, filtering and scoring. genesets <- read.gmt(geneset_collection) #should be a tab file, each row = pathway. genesets <- genesets[unlist(lapply(genesets, function(x) ((length(which(x%in% rownames(seurat)))/ length(x))> geneset_percentage)))] seurat <- AddModuleScore(object = seurat, features= genesets, name = names(genesets)) message("2. Signatures were loaded and scores were calculated.") # 9.2 We create a vector containing the different combinations for each resolutions and each cluster. res_clus_comb <- vector() #Empty vector. for (res in resolutions){ full_res = paste0(assay_type, "_snn_res.", res) for (cluster in levels(seurat@meta.data[[full_res]])) { x <- paste0(full_res,"C",cluster) res_clus_comb <- append(res_clus_comb, x) #Adding combinations to the vector. } } # 9.3 We create a MxN matrix for the p-values, where M is the previous calculated combinations and N is the different gene sets used. mtx_pval <- matrix(nrow=length(res_clus_comb), ncol=length(names(genesets))) # 9.4 We fill the matrix with the p-value calculated at cluster level for each resolution and gene set. if(norm_type == "standard") { norm_type <- "RNA" } if(assay_type == "integrated"){ seurat@active.assay <- norm_type } for (col in 1:length(genesets)) { row = 1 for (res in resolutions){ full_res = paste0(assay_type, "_snn_res.", res) if(length(levels(seurat@meta.data[[full_res]]))>1){ for (cluster in levels(seurat@meta.data[[full_res]])) { # The resolution is set as Idents Idents(seurat) <- seurat[[full_res]] #Counts per gene calculus. seurat_target_clust <- subset(seurat, idents = cluster) counts_in_gene_set_target<- rowMeans(as.matrix(seurat_target_clust@assays[[norm_type]]@data))[which(names(rowSums(as.matrix(seurat_target_clust@assays[[norm_type]]@data))) %in% genesets[[col]])] #expr_genes_target <- length(counts_in_gene_set_target[counts_in_gene_set_target > 1]) #all_expr_genes_target <- length(rowMeans(as.matrix(seurat_target_clust@assays[[assay_type]]@counts))[rowSums(as.matrix(seurat_target_clust@assays[[assay_type]]@counts)) > 1]) seurat_offtarget_clust <- subset(seurat, idents = cluster, invert = TRUE) counts_in_gene_set_offtarget<- rowMeans(as.matrix(seurat_offtarget_clust@assays[[norm_type]]@data))[which(names(rowSums(as.matrix(seurat_offtarget_clust@assays[[norm_type]]@data))) %in% genesets[[col]])] #expr_genes_offtarget <- length(counts_in_gene_set_offtarget[counts_in_gene_set_offtarget > 1]) #all_expr_genes_offtarget <- length(rowMeans(as.matrix(seurat_offtarget_clust@assays[[assay_type]]@counts))[rowSums(as.matrix(seurat_offtarget_clust@assays[[assay_type]]@counts)) > 1]) #Max count value used to generate the factor. max_val <- max(c(counts_in_gene_set_target, counts_in_gene_set_offtarget)) #gene_matrix <- matrix(data = c(expr_genes_target, expr_genes_offtarget, all_expr_genes_target, all_expr_genes_offtarget), nrow = 2) #The factor is calculated and applied for each vector. target_factor <- length(counts_in_gene_set_target[counts_in_gene_set_target > 0.05*max_val])/length(counts_in_gene_set_target) offtarget_factor <- length(counts_in_gene_set_offtarget[counts_in_gene_set_offtarget > 0.05*max_val])/length(counts_in_gene_set_offtarget) counts_in_gene_set_target_factor <- counts_in_gene_set_target*target_factor counts_in_gene_set_offtarget_factor <- counts_in_gene_set_offtarget*offtarget_factor # Wilcoxon test: paired, using a vectors of gene means for each Seurat subset. p_value <- wilcox.test(counts_in_gene_set_target_factor, counts_in_gene_set_offtarget_factor, paired = TRUE, alternative = "greater")$p.value mtx_pval[row, col] <- p_value row = row + 1 } } else { mtx_pval[row, col] <- NA row = row + 1 } } } message("3. Significance per signature and cluster was calculated.") # 9.5 The p-values are corrected per geneset. mtx_corr_pval <- mtx_pval for (col in 1:length(genesets)) { p_val_vec_corr <- p.adjust(mtx_corr_pval[,col], "fdr") mtx_corr_pval[,col] <- p_val_vec_corr } message("4. P-values were corrected.") # 9.6 We create a dataframe from a matrix df_pval <- data.frame(mtx_corr_pval) row.names(df_pval) <- res_clus_comb colnames(df_pval) <- names(genesets) df_pval <- df_pval[ order(row.names(df_pval)), ] # 9.7 Save p-value table write.table(df_pval, file = paste0(dir.name, "/",folders[5], "/pval_table.tsv"), sep = "\t", quote = FALSE, row.names = TRUE, col.names = NA) message("5. P-value table was saved.") # 9.8 We loop for each geneset generating the plots for (i in 1:length(genesets)){ gs_name = paste0(names(genesets[i]), i) module_name = colnames(seurat@meta.data)[grep(names(genesets)[i], colnames(seurat@meta.data))] #Feature plot p1 <- FeaturePlot(object = seurat, features = module_name) #+ theme(legend.position="bottom") ggsave(paste0(dir.name, "/", folders[5], "/", names(genesets)[i], "_featureplot.pdf"), plot = p1, scale = 1.5) #Mean expression plot p2 <- clustree(seurat, paste0(assay_type,"_snn_res."), node_colour = gs_name , node_colour_aggr = "mean") + ggtitle(label = names(genesets)[i]) + theme(plot.title = element_text(size = 12, face = "bold")) p2$labels["colour"] <- "signature mean" #ggsave(paste0(dir.name, "/", folders[5], "/", names(genesets)[i], "_clustree_mean.pdf"), plot = p2, scale = 1) #P-val plot p3 <- clustree(seurat, paste0(assay_type,"_snn_res."), node_colour = gs_name , node_colour_aggr = "mean") + ggtitle(label = names(genesets)[i]) + scale_colour_gradient2(low = "blue", mid = "white", high = "red", midpoint = 1.3) + theme(plot.title = element_text(size = 12, face = "bold")) clustree_gs_colname = paste0("mean_",gs_name) p3$data[[clustree_gs_colname]] <- -log(df_pval[, names(genesets)[i]]) p3$labels[["colour"]] <- "-log(p_value)" #ggsave(paste0(dir.name, "/", folders[5], "/", names(genesets)[i], "_clustree_pval.pdf"), plot = p3, scale = 1) #Plot combination p4 <- p2 + p3 ggsave(paste0(dir.name, "/", folders[5], "/", names(genesets)[i], "_clustree_mean_pval.pdf"), plot = p4, scale = 1, width = 12, height=8, units = "in") } message("6. Clustree plots were produced.") #9.9 Save seurat object saveRDS(seurat, file = paste0(dir.name, "/",folders[5], "/seurat_complete.rds")) message("7. Seurat object was saved.") |
1
of
scripts/step6_gs-scoring.R
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 | log <- file(snakemake@log[[1]], open = "wt") sink(log) sink(log, type = "message") message("CONFIGURATION STEP") # A. Parameters: # 1. Load libraries. suppressMessages(library("GenomeInfoDbData")) suppressMessages(library("Seurat")) suppressMessages(library("RColorBrewer")) suppressMessages(library("slingshot")) suppressMessages(library("SingleCellExperiment")) suppressMessages(library("rmarkdown")) suppressMessages(library("gam")) suppressMessages(library("pheatmap")) if (snakemake@params[["graphics"]]) {suppressMessages(library("rgl"))} if (snakemake@params[["graphics"]]) {suppressMessages(library("htmltools"))} message("1. Libraries were loaded.") # 2. Folder configuration. dir.name = snakemake@params[["output_dir"]] input_data = snakemake@input[["seurat_obj"]] folders = c("1_preprocessing", "2_normalization", "3_clustering", "4_degs", "5_gs", "6_traj_in", "7_func_analysis") message("2. Folder paths were set.") # 3. Get variables from Snakemake. selected_res = snakemake@params[["selected_res"]] start.clus = snakemake@params[["start_clus"]] end.clus = snakemake@params[["end_clus"]] n_var_genes = snakemake@params[["n_var_genes"]] n_plotted_genes = snakemake@params[["n_plotted_genes"]] random_seed = snakemake@params[["random_seed"]] pc = snakemake@params[["pc"]] graphics = snakemake@params[["graphics"]] ram = snakemake@resources[["mem_mb"]] threads = snakemake@threads message("3. Parameters were loaded.") # 4. Analysis configuration. # RAM configuration. options(future.globals.maxSize = ram*1024^2) # Set seed. if (is.numeric(random_seed)) { set.seed(random_seed) } message(paste0("4. Seed was set at ", random_seed, ".")) message("Configuration finished.") message("\n") message("PROCESSING STEP") # 10. Trajectory inference analysis using slingshot. # 10.1. Seurat object is converted to SingleCellExperiment object (required by slingshot) & cluster resolution is selected. seurat <- readRDS(input_data) assay_type <- seurat@active.assay cluster_res <- paste0(assay_type,"_snn_res.",selected_res) if (!(cluster_res %in% colnames(seurat@meta.data))){ stop("Specified resolution is not available.") } seurat.sim <- as.SingleCellExperiment(seurat) message("1. Seurat object was loaded.") if (graphics){ # Running UMAP to get 3 dimensions. seurat3D <- RunUMAP(seurat,dims = 1:pc, n.components = 3, verbose = FALSE) seurat.sim <- as.SingleCellExperiment(seurat) seurat3D.sim <- as.SingleCellExperiment(seurat3D) message("1.5. 3D UMAP was calculated.") } # 10.2. Slingshot algorithm (dimensions = UMAP). There are 4 options depending of the start and end cluster in the following trajectory: if(is.numeric(start.clus) == FALSE && is.numeric(end.clus) == FALSE){ seurat.sim <- slingshot(seurat.sim, clusterLabels=cluster_res, reducedDim="UMAP") if (graphics) {seurat3D.sim <- slingshot(seurat3D.sim, clusterLabels=cluster_res, reducedDim="UMAP")} const = FALSE } else if(is.numeric(start.clus) == TRUE && is.numeric(end.clus) == FALSE){ seurat.sim <- slingshot(seurat.sim, clusterLabels=cluster_res, reducedDim="UMAP", start.clus = start.clus) if (graphics) {seurat3D.sim <- slingshot(seurat3D.sim, clusterLabels=cluster_res, reducedDim="UMAP", start.clus = start.clus)} const = TRUE } else if(is.numeric(start.clus) == FALSE && is.numeric(end.clus) == TRUE){ seurat.sim <- slingshot(seurat.sim, clusterLabels=cluster_res, reducedDim="UMAP", end.clus = end.clus) if (graphics) {seurat3D.sim <- slingshot(seurat3D.sim, clusterLabels=cluster_res, reducedDim="UMAP", end.clus = end.clus)} const = TRUE } else if(is.numeric(start.clus) == TRUE && is.numeric(end.clus) == TRUE){ seurat.sim <- slingshot(seurat.sim, clusterLabels=cluster_res, reducedDim="UMAP", start.clus = start.clus, end.clus=end.clus) if (graphics) {seurat3D.sim <- slingshot(seurat3D.sim, clusterLabels=cluster_res, reducedDim="UMAP", start.clus = start.clus, end.clus=end.clus)} const = TRUE } message("2. Slingshot trajectories were calculated.") # 10.3. Plots generation. # 10.3.1. Set the number of colours from the palette and extend it. n_col <- length(levels(seurat.sim@colData[, cluster_res])) getPalette <- colorRampPalette(brewer.pal(9,'Set1')) if (graphics) { # 10.3.2. Set a good window size for the 3D plots. r3dDefaults$windowRect <- c(0,50, 1024, 720) # 10.3.3. Curves 3D plot with legend --> HTML output. plot3d(reducedDims(seurat3D.sim)$UMAP[,1:3], col = getPalette(n_col)[seurat3D.sim@colData[, cluster_res]]) plot3d.SlingshotDataSet(SlingshotDataSet(seurat3D.sim), type = "curves", add = TRUE, lwd =3) legend3d("topright", legend=paste0("Cluster - ", levels(seurat3D.sim@colData[, cluster_res])), pch=16, col=getPalette(n_col), inset=c(0)) p_curves <- rglwidget(width = 1240, height = 1024) htmltools::save_html(htmltools::tagList(p_curves), file = paste0(dir.name, "/", folders[6], "/", paste0("3D_curves_", selected_res, "_res.html"))) # 10.3.4. Lineage 3D plot with legend --> HTML output. plot3d(reducedDims(seurat3D.sim)$UMAP[,1:3], col = getPalette(n_col)[seurat3D.sim@colData[, cluster_res]]) plot3d.SlingshotDataSet(SlingshotDataSet(seurat3D.sim), type = "lineages", add = TRUE, lwd =3) legend3d("topright", legend=paste0("Cluster - ", levels(seurat3D.sim@colData[, cluster_res])), pch=16, col=getPalette(n_col), inset=c(0)) p_lineages <- rglwidget(width = 1240, height = 1024) htmltools::save_html(htmltools::tagList(p_lineages), file = paste0(dir.name, "/", folders[6], "/", paste0("3D_lineages_", selected_res, "_res.html"))) } # 10.3.5. Curves 2D plot with legend --> pdf output. pdf(paste0(dir.name, "/", folders[6], "/2D_curves_", selected_res, "_res.pdf"), width = 11, height = 11) plot(reducedDims(seurat.sim)$UMAP, col = getPalette(n_col)[seurat.sim@colData[, cluster_res]], pch=16, asp = 1, main = "2D curves - Clusters Trajectories") legend("topright", legend=paste0("Cluster - ", levels(seurat.sim@colData[, cluster_res])), pch=16, col=getPalette(n_col)) lines(SlingshotDataSet(seurat.sim), lwd=2, col = 'black') dev.off() # 10.3.6. Lineage 2D plot with legend --> pdf output. pdf(paste0(dir.name, "/", folders[6], "/2D_lineages_", selected_res, "_res.pdf"), width = 11, height = 11) plot(reducedDims(seurat.sim)$UMAP, col = getPalette(n_col)[seurat.sim@colData[, cluster_res]], pch=16, asp = 1, main = "2D lineages - Clusters Trajectories") legend("topright", legend=paste0("Cluster - ", levels(seurat.sim@colData[, cluster_res])), pch=16, col=getPalette(n_col)) lines(SlingshotDataSet(seurat.sim), lwd=2, type = 'lineages', col = 'black', show.constraints = const) dev.off() message("3. Trajectory lots were done.") # 10.4. Temporally expressed genes heatmap. # 10.4.1. Pseudotime is obtained. t <- seurat.sim$slingPseudotime_1 # 10.4.2. Get the n variable genes. Y <- assays(seurat.sim)$logcounts var_genes <- names(sort(apply(Y,1,var),decreasing = TRUE))[1:n_var_genes] Y <- Y[var_genes,] # 10.4.3. Fitting a gam model. gam.pval <- apply(Y,1,function(z){ d <- data.frame(z=z, t=t) suppressWarnings({ tmp <- suppressWarnings(gam(z ~ lo(t), data=d)) }) p <- summary(tmp)[3][[1]][2,3] p }) # 10.4.4. Get the top genes selected by p-val and plot the heatmap. topgenes <- names(sort(gam.pval, decreasing = FALSE))[1:n_plotted_genes] heatdata <- assays(seurat.sim)$logcounts[topgenes, order(t, na.last = NA)] heatclus <- seurat.sim@colData[,cluster_res][order(t, na.last = NA)] annotation <- data.frame("Cluster" = heatclus, row.names = colnames(heatdata)) ann_colors <- list("Cluster" = setNames(getPalette(n_col), levels(heatclus))) # 10.4.5. Plot the genes. pheatmap(heatdata, cluster_cols = FALSE, color = colorRampPalette(c("yellow", "red"))(100), annotation_col = annotation, annotation_colors = ann_colors, show_colnames = FALSE, filename = paste0(dir.name, "/", folders[6], "/temporally_expressed_heatmaps_", selected_res, "_res.pdf")) message("4. Pseudotime heatmap was obtained.") # Save used objects. if (graphics) {save(seurat.sim, seurat3D.sim, file = paste0(dir.name, "/", folders[6], "/slingshot_sce_objects.RData")) } else {save(seurat.sim, file = paste0(dir.name, "/", folders[6], "/slingshot_sce_objects.RData"))} message("5. R objects were saved.") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 | log <- file(snakemake@log[[1]], open = "wt") sink(log) sink(log, type = "message") message("CONFIGURATION STEP") # A. Parameters: # 1. Load libraries. suppressMessages(library("Seurat")) suppressMessages(library("ggplot2")) suppressMessages(library("viridis")) suppressMessages(library("devtools")) suppressMessages(library("RColorBrewer")) suppressMessages(library("VISION")) suppressMessages(library("scales")) message("1. Libraries were loaded.") # 2. Folder configuration. dir.name = snakemake@params[["output_dir"]] input_data = snakemake@input[["seurat_obj"]] folders = c("1_preprocessing", "2_normalization", "3_clustering", "4_degs", "5_gs", "6_traj_in", "7_func_analysis") message("2. Folder paths were set.") # 3. Get variables from Snakemake. geneset_collection = snakemake@params[["geneset_collection"]] meta_columns = snakemake@params[["meta_columns"]] selected_res = snakemake@params[["selected_res"]] random_seed = snakemake@params[["random_seed"]] regress_out = snakemake@params[["regress_out"]] vars_to_regress = snakemake@params[["vars_to_regress"]] regress_cell_cycle = snakemake@params[["regress_cell_cycle"]] ram = snakemake@resources[["mem_mb"]] threads = snakemake@threads message("3. Parameters were loaded.") # 4. Analysis configuration. # RAM configuration. options(future.globals.maxSize = ram*1024^2) # Set seed. if (is.numeric(random_seed)) { set.seed(random_seed) } message(paste0("4. Seed was set at ", random_seed, ".")) message("Configuration finished.") message("\n") message("PROCESSING STEP") # 11. Vision functional analysis. # 11.1. Seurat object with clustering is loaded. seurat <- readRDS(input_data) assay_type <- seurat@active.assay cluster_res <- paste0(assay_type, "_snn_res.", selected_res) if (!(cluster_res %in% colnames(seurat@meta.data))){ stop("Specified resolution is not available.") } meta_columns <- c(meta_columns, cluster_res) message("1. Seurat object was loaded.") # 11.2. Vision object is created. # If seurat object is not a integration. if (seurat@active.assay != "integrated"){ # To avoid negative values with SCT normalization we use the standard, keeping the information from the previous analysis. if (seurat@active.assay == "SCT") { seurat@active.assay <- "RNA" seurat <- NormalizeData(seurat, normalization.method = "LogNormalize", scale.factor = 10000, seed = TRUE) all.genes <- rownames(seurat) if(regress_out == TRUE){ if (regress_cell_cycle) { seurat <- ScaleData(seurat, features = rownames(seurat), vars.to.regress = c(vars_to_regress, "S.Score", "G2M.Score")) } else { seurat <- ScaleData(seurat, features = rownames(seurat), vars.to.regress = vars_to_regress) } } else { seurat <- ScaleData(seurat, features = rownames(seurat)) } } suppressMessages(vis <- Vision(seurat, signatures = geneset_collection, meta = seurat@meta.data[,meta_columns], projection_methods = NULL)) } else { # Calculate vision score using the RNA slot when samples have been integrated. seurat@active.assay <- "RNA" seurat <- NormalizeData(seurat, normalization.method = "LogNormalize", scale.factor = 10000, seed = TRUE) all.genes <- rownames(seurat) if(regress_out == TRUE){ if (regress_cell_cycle) { seurat <- ScaleData(seurat, features = rownames(seurat), vars.to.regress = c(vars_to_regress, "S.Score", "G2M.Score")) } else { seurat <- ScaleData(seurat, features = rownames(seurat), vars.to.regress = vars_to_regress) } } else { seurat <- ScaleData(seurat, features = rownames(seurat)) } suppressMessages(vis <- Vision(seurat, signatures = geneset_collection, meta = seurat@meta.data[,c(meta_columns, "assay_name")], projection_methods = NULL)) seurat@active.assay <- "integrated" } message("2. Vision object was created.") # 11.3. Vision analysis step. options(mc.cores = threads) vis <- suppressMessages(analyze(vis)) message("3. Vision object was analyzed.") # 11.4. Outputs generation. # 11.4.1. Projection values stored. if (seurat@active.assay != "integrated"){ projections <- vis@Projections$Seurat_umap } else { vis@Projections[[1]] <- seurat@reductions$umap@cell.embeddings projections <- seurat@reductions$umap@cell.embeddings } # 11.4.2. Cluster plot in UMAP projection. getPalette <- colorRampPalette(brewer.pal(9,'Set1')) p1 <- ggplot() + aes(x=projections[, 1], y=projections[, 2], color=vis@metaData[,cluster_res]) + geom_point(alpha=0.5) + xlab("UMAP_1") + ylab("UMAP_2") + ggtitle("Clusters representation UMAP") + labs(color='clusters') + scale_color_manual(values = getPalette(nlevels(vis@metaData[,cluster_res]))) suppressMessages(ggsave(paste0(dir.name, "/", folders[7], "/CLUSTERS_REPRESENTATION_UMAP_set1.pdf"), plot = p1)) # 11.4.3. Integration plot in UMAP projection. if (seurat@active.assay == "integrated"){ p2 <- ggplot() + aes(x=projections[, 1], y=projections[, 2], color=vis@metaData[,"assay_name"]) + geom_point(alpha=0.5) + xlab("UMAP_1") + ylab("UMAP_2") + ggtitle("Integration representation UMAP") + labs(color='Assay') + scale_color_manual(values = getPalette(nlevels(vis@metaData[,"assay_name"]))) suppressMessages(ggsave(paste0(dir.name, "/", folders[7], "/INTEGRATION_REPRESENTATION_UMAP_set1.pdf"), plot = p2)) } # 11.4.4. Clusters vs Molecular Signatures statistics values. for (i in 1:nlevels(vis@metaData[,cluster_res])){ write.table(vis@ClusterComparisons$Signatures[[cluster_res]][[i]], file = paste0(dir.name, "/", folders[7], "/vision_table_res_", selected_res, "_cluster_", (i-1), ".tsv"), col.names = NA, quote = FALSE, sep = "\t") } message("4. Vision plots were produced.") # 11.4.5. Molecular Signatures statistics values. stats_DF <- data.frame("Consistency" = vis@LocalAutocorrelation$Signatures$C, "p-values" = vis@LocalAutocorrelation$Signatures$pValue, "FDR" = vis@LocalAutocorrelation$Signatures$FDR) colnames(stats_DF) <- c("Consistency", "p-values", "FDR") rownames(stats_DF) <- rownames(vis@LocalAutocorrelation$Signatures) write.table(stats_DF, file = paste0(dir.name, "/", folders[7], "/vision_param_values_per_geneset.txt"), col.names = NA, quote = FALSE, sep = "\t" ) message("5. Vision statistics were saved in a table.") # 11.4.6. Molecular signatures scores in UMAP projection. for (genesig in colnames(vis@SigScores)){ if (stats_DF[genesig,"FDR"] < 0.05) { p3 <- ggplot() + aes(x=projections[, 1], y=projections[, 2], color=vis@SigScores[, genesig]) + geom_point() + ggtitle(genesig) + scale_color_viridis(option = "D") + xlab("UMAP_1") + ylab("UMAP_2") + ggtitle(paste0(genesig, " - UMAP")) + labs(color='vision score') suppressMessages(ggsave(paste0(dir.name, "/", folders[7], "/", genesig, "_UMAP.pdf"), plot = p3)) } } message("6. Vision plots from signatures were produced.") # 11.5. Save Vision Object. saveRDS(vis, file = paste0(dir.name, "/", folders[7], "/vision_object.rds")) message("7. Vision object was saved.") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 | log <- file(snakemake@log[[1]], open = "wt") sink(log) sink(log, type = "message") message("CONFIGURATION STEP") # A. Parameters: # 1. Load libraries. suppressMessages(library("Seurat")) suppressMessages(library("velocyto.R")) suppressMessages(library("SeuratWrappers")) suppressMessages(library("RColorBrewer")) suppressMessages(library("future")) message("1. Libraries were loaded.") # 2. Folder configuration. input_data = snakemake@input[["seurat_obj"]] velocyto_dir = snakemake@params[["velocyto_dir"]] dir.name = snakemake@params[["output_dir"]] folders = c("1_preprocessing", "2_normalization", "3_clustering", "4_degs", "5_gs", "6_traj_in", "7_func_analysis", "8_RNA_velocity") message("2. Folder paths were set.") # 3. Get variables from Snakemake. selected_res = snakemake@params[["selected_res"]] random_seed = snakemake@params[["random_seed"]] downsampling = snakemake@params[["downsampling"]] n_cells = snakemake@params[["n_cells"]] ram = snakemake@resources[["mem_mb"]] threads = snakemake@threads message("3. Parameters were loaded.") # 4. Analysis configuration. # RAM configuration. options(future.globals.maxSize = ram*1024^2) #Set parallelization. plan("multiprocess", workers = threads) message(paste0("4. Threads were set at ", threads, ".")) # Set seed. if (is.numeric(random_seed)) { set.seed(random_seed) } message(paste0("5. Seed was set at ", random_seed, ".")) message("Configuration finished.") message("\n") message("PROCESSING STEP") # Load seurat object seurat = readRDS(input_data) # 12. Compute velocity and velocity plots. # 12.1. Check cluster resoltution. assay_type <- seurat@active.assay cluster_res <- paste0(assay_type, "_snn_res.", selected_res) if (!(cluster_res %in% colnames(seurat@meta.data))){ stop("Specified resolution is not available.") } message("1. Seurat object was loaded.") # If Seurat objects are not integrated we need to add the velocyto matrices. if (!(seurat@active.assay == "integrated" || seurat@project.name == "merged")) { # 12.2. Get velocity matrices and place them in a list. velo_names = c("spliced", "unspliced", "ambiguous") vel_matrices = list() for (name in velo_names) { vel_matrices[[name]] <- Read10X(data.dir = paste0(velocyto_dir, name)) } # 12.3. Load Velocyto matrices as seurat assays. for (name in velo_names) { vel_matrices[[name]] <- vel_matrices[[name]][, which(x = colnames(vel_matrices[[name]]) %in% colnames(seurat))] seurat[[name]] <- CreateAssayObject(counts = vel_matrices[[name]]) } } message("2. Seurat object was prepared to perform RNA velocity.") # 12.4. Downsampling (optional). if (downsampling == TRUE && n_cells < length(rownames(seurat@meta.data))){ #n_total_cells = length(rownames(seurat@meta.data)) random_sample = sample(x = rownames(seurat@meta.data), size = n_cells, replace = FALSE) seurat <- subset(x = seurat, cells = random_sample) message("2.5. Downsampling was performed.") } # 12.5. Set specific cluster labels as idents. Idents(seurat) <- seurat@meta.data[[cluster_res]] # 12.6. Run velocyto from the wrapper. seurat <- RunVelocity(object = seurat, deltaT = 1, kCells = 25, fit.quantile = 0.02) message("3. RNA velocity was calculated.") # 12.7. Obtain palette. n_col <- length(levels(seurat)) getPalette <- colorRampPalette(brewer.pal(9,'Set1')) # 12.8. Set colours to clusters. ident.colors <- getPalette(n_col) names(ident.colors) <- levels(seurat) cell.colors <- ident.colors[Idents(seurat)] names(cell.colors) <- colnames(seurat) # 12.9. Create the RNA velocity plot. pdf(paste0(dir.name, "/",folders[8], "/RNA_velocity_plot.pdf")) par(mar=c(5.1, 4.1, 4.1, 8.1), xpd=TRUE) show.velocity.on.embedding.cor(emb = Embeddings(object = seurat, reduction = "umap"), vel = Tool(object = seurat, slot = "RunVelocity"), n = 100, scale = "sqrt", cell.colors = ac(x = cell.colors, alpha = 0.5), cex = 0.8, arrow.scale = 3, show.grid.flow = TRUE, min.grid.cell.mass = 0.5, grid.n = 40, arrow.lwd = 1, do.par = FALSE, cell.border.alpha = 0.1, xlab = "UMAP1", ylab = "UMAP2", main = paste0("RNA velocity plot - Cluster resolution: ", selected_res)) opar <- par(fig=c(0, 1, 0, 1), oma=c(0, 0, 0, 0), mar=c(0, 0, 0, 0), new=TRUE) on.exit(par(opar)) plot(0, 0, type='n', bty='n', xaxt='n', yaxt='n') legend("topright", inset = c(0.05, 0.115), legend=paste0("Cluster - ", levels(seurat)), pch=16, col=getPalette(n_col)) dev.off() message("4. Velocity plot was produced.") #12.10. This plot the sample coloring the cell by its origin assay (integrated objects). if (seurat@active.assay == "integrated") { Idents(seurat) <- seurat@meta.data[["assay_name"]] ident.colors <- getPalette(n_col) names(ident.colors) <- levels(seurat) cell.colors <- ident.colors[Idents(seurat)] names(cell.colors) <- colnames(seurat) # 12.11. Create the RNA velocity plot. pdf(paste0(dir.name, "/",folders[8], "/RNA_velocity_plot_by_assay.pdf")) par(mar=c(5.1, 4.1, 4.1, 8.1), xpd=TRUE) show.velocity.on.embedding.cor(emb = Embeddings(object = seurat, reduction = "umap"), vel = Tool(object = seurat, slot = "RunVelocity"), n = 100, scale = "sqrt", cell.colors = ac(x = cell.colors, alpha = 0.5), cex = 0.8, arrow.scale = 3, show.grid.flow = TRUE, min.grid.cell.mass = 0.5, grid.n = 40, arrow.lwd = 1, do.par = FALSE, cell.border.alpha = 0.1, xlab = "UMAP1", ylab = "UMAP2", main = paste0("RNA velocity plot - Origin assay")) opar <- par(fig=c(0, 1, 0, 1), oma=c(0, 0, 0, 0), mar=c(0, 0, 0, 0), new=TRUE) on.exit(par(opar)) plot(0, 0, type='n', bty='n', xaxt='n', yaxt='n') legend("topright", inset = c(0.05, 0.115), legend=paste0("Assay - ", levels(seurat)), pch=16, col=getPalette(n_col)) dev.off() message("4.5. Velocity plot from integrated object was produced.") } #12.12. This plot the sample coloring the cell by its origin assay (mergedd objects). if (seurat@project.name == "merged") { Idents(seurat) <- "assay_name" ident.colors <- getPalette(n_col) names(ident.colors) <- levels(seurat) cell.colors <- ident.colors[Idents(seurat)] names(cell.colors) <- colnames(seurat) # 12.13. Create the RNA velocity plot. pdf(paste0(dir.name, "/",folders[8], "/RNA_velocity_plot_by_assay.pdf")) par(mar=c(5.1, 4.1, 4.1, 8.1), xpd=TRUE) show.velocity.on.embedding.cor(emb = Embeddings(object = seurat, reduction = "umap"), vel = Tool(object = seurat, slot = "RunVelocity"), n = 100, scale = "sqrt", cell.colors = ac(x = cell.colors, alpha = 0.5), cex = 0.8, arrow.scale = 3, show.grid.flow = TRUE, min.grid.cell.mass = 0.5, grid.n = 40, arrow.lwd = 1, do.par = FALSE, cell.border.alpha = 0.1, xlab = "UMAP1", ylab = "UMAP2", main = paste0("RNA velocity plot - Origin assay")) opar <- par(fig=c(0, 1, 0, 1), oma=c(0, 0, 0, 0), mar=c(0, 0, 0, 0), new=TRUE) on.exit(par(opar)) plot(0, 0, type='n', bty='n', xaxt='n', yaxt='n') legend("topright", inset = c(0.05, 0.115), legend=paste0("Assay - ", levels(seurat)), pch=16, col=getPalette(n_col)) dev.off() message("4.5. Velocity plot from integrated object was produced.") } # 12.14. Save seurat object with RNA velocity slots. saveRDS(seurat, file = paste0(dir.name, "/",folders[8], "/seurat_velocity.rds")) message("5. Seurat object was saved.") |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" from os import path from tempfile import TemporaryDirectory from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=False, stderr=True) def basename_without_ext(file_path): """Returns basename of file path, without the file extension.""" base = path.basename(file_path) split_ind = 2 if base.endswith(".gz") else 1 base = ".".join(base.split(".")[:-split_ind]) return base # Run fastqc, since there can be race conditions if multiple jobs # use the same fastqc dir, we create a temp dir. with TemporaryDirectory() as tempdir: shell("fastqc {snakemake.params} --quiet " "--outdir {tempdir} {snakemake.input[0]}" " {log}") # Move outputs into proper position. output_base = basename_without_ext(snakemake.input[0]) html_path = path.join(tempdir, output_base + "_fastqc.html") zip_path = path.join(tempdir, output_base + "_fastqc.zip") if snakemake.output.html != html_path: shell("mv {html_path} {snakemake.output.html}") if snakemake.output.zip != zip_path: shell("mv {zip_path} {snakemake.output.zip}") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 | import os from snakemake.shell import shell import tempfile __author__ = "Ryan Dale" __copyright__ = "Copyright 2016, Ryan Dale" __email__ = "dalerr@niddk.nih.gov" __license__ = "MIT" _config = snakemake.params['fastq_screen_config'] subset = snakemake.params.get('subset', 100000) aligner = snakemake.params.get('aligner', 'bowtie2') extra = snakemake.params.get('extra', '') log = snakemake.log_fmt_shell() # snakemake.params.fastq_screen_config can be either a dict or a string. If # string, interpret as a filename pointing to the fastq_screen config file. # Otherwise, create a new tempfile out of the contents of the dict: if isinstance(_config, dict): tmp = tempfile.NamedTemporaryFile(delete=False).name with open(tmp, 'w') as fout: for label, indexes in _config['database'].items(): for aligner, index in indexes.items(): fout.write('\t'.join([ 'DATABASE', label, index, aligner.upper()]) + '\n') for aligner, path in _config['aligner_paths'].items(): fout.write('\t'.join([aligner.upper(), path]) + '\n') config_file = tmp else: config_file = _config # fastq_screen hard-codes filenames according to this prefix. We will send # hard-coded output to a temp dir, and then move them later. prefix = os.path.basename(snakemake.input[0].split('.fastq')[0]) tempdir = tempfile.mkdtemp() shell( "fastq_screen --outdir {tempdir} " "--force " "--aligner {aligner} " "--conf {config_file} " "--subset {subset} " "--threads {snakemake.threads} " "{extra} " "{snakemake.input[0]} " "{log}" ) # Move output to the filenames specified by the rule shell("mv {tempdir}/{prefix}_screen.txt {snakemake.output.txt}") shell("mv {tempdir}/{prefix}_screen.png {snakemake.output.png}") # Clean up temp shell("rm -r {tempdir}") if isinstance(_config, dict): shell("rm {tmp}") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "koester@jimmy.harvard.edu" __license__ = "MIT" import os from snakemake.shell import shell extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell(stdout=True, stderr=True) fq1 = snakemake.input.get("fq1") assert fq1 is not None, "input-> fq1 is a required input parameter" fq1 = ( [snakemake.input.fq1] if isinstance(snakemake.input.fq1, str) else snakemake.input.fq1 ) fq2 = snakemake.input.get("fq2") if fq2: fq2 = ( [snakemake.input.fq2] if isinstance(snakemake.input.fq2, str) else snakemake.input.fq2 ) assert len(fq1) == len( fq2 ), "input-> equal number of files required for fq1 and fq2" input_str_fq1 = ",".join(fq1) input_str_fq2 = ",".join(fq2) if fq2 is not None else "" input_str = " ".join([input_str_fq1, input_str_fq2]) if fq1[0].endswith(".gz"): readcmd = "--readFilesCommand zcat" else: readcmd = "" outprefix = os.path.dirname(snakemake.output[0]) + "/" shell( "STAR " "{extra} " "--runThreadN {snakemake.threads} " "--genomeDir {snakemake.params.index} " "--readFilesIn {input_str} " "{readcmd} " "--outFileNamePrefix {outprefix} " "--outStd Log " "{log}" ) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | __author__ = "Thibault Dayris" __copyright__ = "Copyright 2019, Dayris Thibault" __email__ = "thibault.dayris@gustaveroussy.fr" __license__ = "MIT" from snakemake.shell import shell from snakemake.utils import makedirs log = snakemake.log_fmt_shell(stdout=True, stderr=True) extra = snakemake.params.get("extra", "") sjdb_overhang = snakemake.params.get("sjdbOverhang", "100") gtf = snakemake.input.get("gtf") if gtf is not None: gtf = "--sjdbGTFfile " + gtf sjdb_overhang = "--sjdbOverhang " + sjdb_overhang else: gtf = sjdb_overhang = "" makedirs(snakemake.output) shell( "STAR " # Tool "--runMode genomeGenerate " # Indexation mode "{extra} " # Optional parameters "--runThreadN {snakemake.threads} " # Number of threads "--genomeDir {snakemake.output} " # Path to output "--genomeFastaFiles {snakemake.input.fasta} " # Path to fasta files "{sjdb_overhang} " # Read-len - 1 "{gtf} " # Highly recommended GTF "{log}" # Logging ) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" from os import path from snakemake.shell import shell input_dirs = set(path.dirname(fp) for fp in snakemake.input) output_dir = path.dirname(snakemake.output[0]) output_name = path.basename(snakemake.output[0]) log = snakemake.log_fmt_shell(stdout=True, stderr=True) shell( "multiqc" " {snakemake.params}" " --force" " -o {output_dir}" " -n {output_name}" " {input_dirs}" " {log}" ) |
Support
- Future updates
Related Workflows





