Proyecto: Evaluación de la regulación trans en cáncer: un enfoque de biología de sistemas
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
This pipelie serves to get differential expression and co-expression matrices for normal and cancer tissues from the TCGA dataset, downloaded from GDC, and the Toil Xena dataset that integrates both TCGA and GTEx data.
Folders in this repository
-
bin: It contains the GDC Data Transfer Tool command line for Linux
-
config: It contains the config.yaml file used for the Snakemake pipeline
-
input: With the GENECODE annotation files originally used for TCGA data (v22, according to this link ). It also contains GENECODE v37 (April, 2021) to filter genes and BioMart Ensembl Genes 80 to add GC content for QC purposes.
-
workflow: It has all the rules and scripts for the Snakemake pipeline
The pipeline
These are the main steps in the pipeline:
-
Get data:
-
Get data from Xena: It downloads counts, sample information and annotations directly from the Xena-Toil S3 bucket.
- Rules file: xena.smk
-
Get data from GDC: It queries GDC to creat a manifest and uses the gdc-client tool to download files.
-
Rules file: gdc.smk
-
Script: queryGDC.py
-
-
-
Get raw matrix: This step integrates, for each tissue, raw counts with their respective annotation to obtain a raw matrix.
-
Rules: raw.smk
-
Scripts: addAnnotations.R and getRawMatrix.R
-
-
QC: It gets NOISeq plots, filters genes with low expression (mean raw counts < 10), gets PCA and density plots (expression per sample) and removes samples with mean expression greater and lower than 2sd from the total mean.
-
Rules: qc.smk
-
Scripts: NOISeqPlots.R, filterLowExpression.R, PCA.R, densityPlot.R and filterOutliers.R
-
-
Normalization: In the first run, it performs a test with different normalization methods and integrates the results. When normalization steps are defined for each tissue, it gets normalized data and runs arsyn for batch correction.
-
Rules: normalization.smk
-
Scripts: normalizationPlots.R, normalizationTest.R, usrNormalization.R, and runArsyn.R After this step, we get a final expression matrix for normal and cancer tissues, having the same gene set.
-
-
Get final output:
-
Co-expression computation: It gets PCA and density plots after normalization and runs the ARACNE algorithm using a singularity image to get co-expression matrix for each tissue and condition. It can also obtain a pearson correlation matrix, if output files are required.
-
Rules: correlation.smk
-
Scripts: aracne_matrix.py, getPearsonMatrix.R
-
-
Differential expression: This step, like the co-expression computation, can be run after normalization and it gets gene differential expression and its associated plots.
-
Rules: deg.smk
-
Scripts: deg.R
-
-
How to run it
Unfortunately, the HTSeq raw counts used in this pipeline were removed by GDC on March, 2022 ( Data Release 32 ). This means that the GDC query used here (in the Get data from GDC step) will not return any result. However, all the data generated by this pipeline, including raw counts, can be found here:
Code Snippets
16 17 | script: "../scripts/aracne_matrix.py" |
28 29 | script: "../scripts/getPearsonMatrix.R" |
12 13 | script: "../scripts/deg.R" |
7 8 9 10 11 12 | shell: """ mkdir -p {output[0]}; ./bin/gdc-client download -d {output[0]} -m {input[0]} --retry-amount 3; touch {output[1]} """ |
25 26 | script: "../scripts/queryGDC.py" |
10 11 | script: "../scripts/normalizationPlots.R" |
25 26 | script: "../scripts/normalizationTest.R" |
44 45 | script: "../scripts/userNormalization.R" |
63 64 | script: "../scripts/runArsyn.R" |
11 12 | script: "../scripts/NOISeqPlots.R" |
22 23 | script: "../scripts/filterLowExpression.R" |
37 38 | script: "../scripts/PCA.R" |
51 52 | script: "../scripts/densityPlot.R" |
62 63 | script: "../scripts/filterOutliers.R" |
12 13 | script: "../scripts/addAnnotations.R" |
30 31 | script: "../scripts/getRawMatrix.R" |
6 7 8 | shell: "mkdir -p {params.xenadir};" "wget https://toil-xena-hub.s3.us-east-1.amazonaws.com/download/TcgaTargetGtex_gene_expected_count.gz -O {output}" |
15 16 17 | shell: "mkdir -p {params.xenadir};" "wget https://toil-xena-hub.s3.us-east-1.amazonaws.com/download/TcgaTargetGTEX_phenotype.txt.gz -O {output}" |
24 25 26 | shell: "mkdir -p {params.xenadir};" "wget https://toil-xena-hub.s3.us-east-1.amazonaws.com/download/probeMap%2Fgencode.v23.annotation.gene.probemap -O {output}" |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") ## Get the Work and Data dir library(readr) library(dplyr) library(rtracklayer) RDATADIR <- paste(snakemake@params[["tissue_dir"]], "rdata", sep="/") dir.create(RDATADIR) IS_XENA <- snakemake@params[["is_xena"]] ############################################################################## ## Get annotation data ## 1. Read the original annotation data ## 2. Read the new annotation data ## 3. Keep only genes/protein-coding present in original and new data ## 4. Query bioMart ensembl 80 for GC content ## 5. Save data ############################################################################### if(!IS_XENA) { cat("Getting annotation file \n") cat("Reading original file \n") ## Annotation file used for RNA-seq pipeline according to: ## https://gdc.cancer.gov/about-data/gdc-data-processing/gdc-reference-files annot <- rtracklayer::import('input/gencode.v22.annotation.gtf.gz') annot <- as.data.frame(annot) ## Only protein coding genes annot <- annot %>% dplyr::select(gene_id, seqnames, start, end, width, type, gene_type, gene_name) %>% filter(type == "gene" & gene_type == "protein_coding") annot <- annot %>% mutate(ensembl_id = unlist(lapply(strsplit(gene_id, "\\."), "[[", 1))) %>% select(-gene_type) } else { annot <- read_tsv(snakemake@input[["xena_annot"]], col_names = c("gene_id", "gene_name", "seqnames", "start", "end", "strand"), skip = 1) annot <- annot %>% dplyr::mutate(ensembl_id = unlist(lapply(strsplit(gene_id, "\\."), "[[", 1)), width = end - start) join_by <- "ensembl_id" } cat("Reading new file \n") ## Newest gencode file. April, 2021. annot_new <- rtracklayer::import('input/gencode.v37.annotation.gtf.gz') annot_new <- as.data.frame(annot_new) annot_new <- annot_new %>% dplyr::select(gene_id, gene_name, type, gene_type) %>% dplyr::filter(type == "gene" & gene_type == "protein_coding") annot_new <- annot_new %>% dplyr::mutate(ensembl_id = unlist(lapply(strsplit(gene_id, "\\."), "[[", 1))) ## Keep genes that remain in the newest annotation file ## but get the newest names and keep only conventional chromosomes ## remove duplicates annot <- annot %>% dplyr::select(-gene_name) %>% inner_join(annot_new %>% dplyr::select(ensembl_id, gene_name, gene_type), by = "ensembl_id") %>% filter(seqnames %in% paste0("chr", c(as.character(1:22), "X", "Y"))) %>% distinct() cat('Annotation file new/old merge: ', paste(dim(annot), collapse=", "), '\n') ## We need GC content per gene for normalization. ## Query Biomart 80 (accoring to gencode.v22.annotation.gtf, version 79 was used ## but it is no longer accessible in the website) ## http://may2015.archive.ensembl.org/biomart biomart <- read_tsv("input/Biomart_Ensembl80_GRCh38_p2.txt", col_names = c("ensembl_id", "version", "gc"), skip = 1) ## Get only genes matching ensemblID biomart <- biomart %>% mutate(gene_id = paste(ensembl_id, version, sep=".")) if(!IS_XENA) { annot <- annot %>% inner_join(biomart %>% dplyr::select(gene_id, gc), by = "gene_id") } else { annot <- annot %>% inner_join(biomart %>% dplyr::select(ensembl_id, gc), by = "ensembl_id") } annot <- annot %>% mutate(chr = gsub("chr", "", seqnames)) %>% dplyr::rename(length = width) %>% dplyr::select(-seqnames) %>% dplyr::select(gene_id, chr, everything()) cat('Annotation file. Final dimension: ', paste(dim(annot), collapse=", "), '\n') save(annot, file=snakemake@output[["annot_rdata"]], compress="xz") write_tsv(annot, snakemake@output[["annot_tsv"]]) cat('annot.RData saved \n') ############################################################################## ## Merging count and annotation ## 1. Build counts matrix. M=normal|tumour ## 2. Build targets matrix. targets=normal+tumor ## 3. Check M y targets integrity ## 4. Filter by annotation file ## 5. Save the clean data ############################################################################## { cat('Merging counts and annotations \n') normal_samples <- list(matrix=read_tsv(snakemake@input[["normal_matrix"]]), targets=read_tsv(snakemake@input[["normal_targets"]])) cancer_samples <- list(matrix=read_tsv(snakemake@input[["cancer_matrix"]]), targets=read_tsv(snakemake@input[["cancer_targets"]])) ## Raw counts M <- normal_samples$matrix %>% inner_join(cancer_samples$matrix, by = "gene_id") cat('Total number of features and samples: ', paste(dim(M), collapse=" ,"), '\n') ## Samples targets <- bind_rows(normal_samples$targets, cancer_samples$targets) ## Check M y targets integrity. Remove gene_ids col stopifnot(nrow(targets) == (ncol(M)-1)) cat('Number of counts columns match sample number\n') ## Filter counts by annotation data cat('Adding biomart data\n') M <- M %>% semi_join(annot, by = "gene_id") annot <- annot %>% semi_join(M, by = "gene_id") cat('Total number of annotated (genes/protein-coding) features:', nrow(M), '\n') cat('Total number of samples:', ncol(M)-1, '\n') no_gc <- sum(is.na(annot$gc)) cat("There are",no_gc, "entries with no GC info \n") ids <- annot %>% filter(!is.na(gc) & !is.na(length)) %>% select(gene_id) %>% unlist(use.names = F) M <- M %>% filter(gene_id %in% ids) %>% arrange(gene_id) annot <- annot %>% filter(gene_id %in% ids) %>% arrange(gene_id) cat("Non GC and lenght annotated genes removed.\n") ## Save it as a matrix ids <- M$gene_id MM <- M %>% select(-gene_id) %>% as.matrix() rownames(MM) <- ids MM <- MM[,targets$id] ## Make sure they are factors targets$group <- factor(targets$group, levels=c("cancer", "normal")) ##Save clean data cat('Saving raw full data \n') full <- list(M = MM, annot = annot, targets = targets) save(full, file=snakemake@output[["raw_rdata"]], compress="xz") cat("raw_full.RData saved \n") } |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | from MultiAracne import Aracne import sys import pathlib sys.stdout = open(snakemake.log[0], "w") exp_matrix = snakemake.input[0] outmatrix = snakemake.output[0] outdir = snakemake.params[0] + "/correlation/output_" + snakemake.params[1] + "_" + snakemake.params[2] procs = int(snakemake.threads) print(f"Saving files in {outdir}") pathlib.Path(outdir).mkdir(parents=True, exist_ok=True) ma = Aracne(exp_matrix) ma.run(processes=procs, outdir=outdir, pval=1) ma.join_adj(outdir, outdir+"/matrix.adj") Aracne.adj_to_matrix(outdir+"/matrix.adj", outmatrix) |
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library(limma) library(edgeR) library(Glimma) library(readr) library(dplyr) library(janitor) library(ggplot2) library(ggthemes) DEGDIR <- paste(snakemake@params[["deg_dir"]]) dir.create(DEGDIR) TISSUE <- snakemake@params[["tissue"]] cat("Loading files\n") load(snakemake@input[["full"]]) load(snakemake@input[["annot"]]) full$targets <- full$targets %>% dplyr::mutate(group = factor(group, levels=c("normal", "cancer"))) cat("Generating model matrix\n") mm <- model.matrix(~1+group, data = full$targets) rownames(mm) <- full$targets %>% pull(id) full$M <- full$M[, rownames(mm)] cat("Getting voom transformation\n") y <- voom(full$M, mm, plot = F, save.plot = T) p <- ggplot() + geom_point(aes(x = y$voom.xy$x, y = y$voom.xy$y), size = 0.5) + geom_line(aes(x = y$voom.line$x, y = y$voom.line$y), size = 1, color = "red") + xlab(y$voom.xy$xlab) + ylab(y$voom.xy$ylab) + theme_bw() + ggtitle(TISSUE) cat("Saving voom plot\n") png(file = paste0(DEGDIR, "/voom.png"), width = 800, height = 600) print(p) dev.off() cat("Getting linear adjustment\n") fit <- lmFit(y, mm) lfc_fit <- eBayes(fit) cat("Getting Glimma interactive plot plot\n") dt <- decideTests(lfc_fit, lfc = 2) annot <- annot %>% filter(gene_id %in% rownames(full$M)) %>% select(gene_id, ensembl_id, chr, start, end, gene_name) %>% arrange(gene_id) rownames(annot) <- annot$gene_id glMDPlot(path = DEGDIR, lfc_fit, counts = full$M, groups = full$targets %>% pull(group), status = dt, anno = annot) p <- ggplot() + geom_density(aes(x = lfc_fit$coefficients[,2]), size = 0.5) + theme_bw() + xlab("lfc") ggtitle(TISSUE) png(file = paste0(DEGDIR, "/density.png"), width = 800, height = 600) print(p) dev.off() cat("Saving deg results\n") topTable(lfc_fit, sort.by = "P", n = Inf, adjust.method="BH") %>% janitor::clean_names() %>% mutate(gene_id = rownames(.)) %>% inner_join(annot %>% select(gene_id, ensembl_id, chr, gene_name), by ="gene_id") %>% select(-gene_id) %>% select(ensembl_id, everything()) %>% mutate(adj_p_val = ifelse(adj_p_val > 0.05, 1, adj_p_val), log_fc = ifelse(adj_p_val > 0.05, 0, log_fc)) %>% write_tsv(snakemake@output[["deg_results"]]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library(readr) library(dplyr) library(ggplot2) library(reshape2) library(ggthemes) TISSUE <- snakemake@params[["tissue"]] w <- 1024 h <- 1024 p <- 24 cat("Loading data \n") load(snakemake@input[[1]]) cat("Calculating mean \n") melted_data <- melt(log2(full$M+1)) means <- colMeans(full$M) mm <- mean(means) sd <- sd(means) down_outliers <- which(means < mm-2*sd) up_outliers <- which(means > mm+2*sd) if(length(down_outliers) > 0 | length(up_outliers) > 0 ) { cat("Outliers found, saving samples names \n") } else { cat("Outliers not found\n") } outliers <- data.frame(sample = c(names(down_outliers), names(up_outliers)), mean = c(means[down_outliers], means[up_outliers])) outliers <- outliers %>% mutate(sd_from_mean = (mean - mm)/sd ) write_tsv(outliers, file=snakemake@output[["outliers"]]) pl <- ggplot(data = melted_data, aes(x=value, group=Var2, colour=Var2)) + geom_density(show.legend = F) + theme_hc(base_size = 20) + xlab("log2(expr+1)") + ggtitle(TISSUE) png(file = snakemake@output[["density"]], width = w, height = h/2, pointsize = p) print(pl) dev.off() cat("Density plot for all samples generated\n") |
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library(dplyr) ########################################################################## load(snakemake@input[[1]]) { ### We keep only genes with mean expression count > 10 exp_genes <- apply(full$M, 1, function(x) mean(x)>10) egtable <- table(exp_genes) cat("There are", egtable[[1]], "genes with mean expression count < 10", egtable[[2]], "with mean count > 10 \n") exp_genes <- names(exp_genes[exp_genes == TRUE]) non_zero_genes <- apply(full$M, 1, function(x) sum(x == 0) < ncol(full$M)/2) egtable <- table(non_zero_genes) cat("There are", egtable[[1]], "genes with 0 values in more than 50% samples", egtable[[2]], " genes with 0 values in less than 50% samples \n") non_zero_genes <- names(non_zero_genes[non_zero_genes == TRUE]) genes_pass <- intersect(exp_genes, non_zero_genes) ##Filtering low expression genes full <- list(M = round(full$M[genes_pass, full$targets$id]), annot = full$annot %>% filter(gene_id %in% genes_pass), targets = full$targets) full$annot <- full$annot %>% arrange(gene_id) full$M <- full$M[order(row.names(full$M)), full$targets$id] stopifnot(all(full$annot$gene_id == rownames(full$M))) cat(paste0("Genes in matrix and annotation match positions \n")) stopifnot(all(full$targets$id == colnames(full$M))) cat(paste0("Samples in matrix and annotation match positions \n")) row.names(full$annot) <- full$annot$gene_id cat("Saving full_filtered.RData \n") save(full, file=snakemake@output[[1]], compress="xz") } |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library(readr) library(dplyr) cat("Loading files \n") load(snakemake@input[[1]]) outliers <- read_tsv(snakemake@input[["outliers"]]) cat(nrow(outliers), " outliers will be removed.\n") targets <- full$targets %>% filter(!id %in% (outliers %>% pull(sample))) M <- full$M[, targets %>% pull(id)] full <- list(annot = full$annot, M = M, targets = targets) dimcancer <- nrow(full$targets %>% filter(group == "cancer")) dimnormal <- nrow(full$targets %>% filter(group == "normal")) cat("Final dimensions: ", dimcancer, " cancer and ", dimnormal, " normal samples \n") cat("Saving RData \n") save(full, file=snakemake@output[[1]]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library(data.table) library(dplyr) library(magick) library(ComplexHeatmap) cat("Reading matrix \n") expr_file <- snakemake@input[["expr_matrix"]] annot_file <- snakemake@input[["annot"]] cat("Getting annotations \n") chr_pal <- c("#D909D1", "#0492EE", "#D2656C", "#106F35", "#5BD2AE", "#199F41", "#FE0F43", "#00FFCC", "#F495C5", "#E1BF5D", "#5F166F", "#088ACA", "#41CFE0", "#0F0A71", "#FFFF99", "#B06645", "#651532", "#B925AE", "#86C91F", "#CB97E8", "#130B9E", "#EF782B", "#79A5B9", "#F7AA7C") names(chr_pal) <- c("22","11","12","13","14","15","16","17","18","19","1" ,"2" ,"3" ,"4" ,"5" , "6" ,"7" ,"X" ,"8" ,"9" ,"20","10","21", "Y") load(annot_file) expr_matrix <- fread(expr_file) annot <- annot %>% mutate(chr = factor(chr, levels=c(as.character(1:22), "X", "Y"))) %>% arrange(chr, start) genes <- expr_matrix %>% pull(gene) expr_matrix <- expr_matrix %>% select(-gene) %>% as.matrix() rownames(expr_matrix) <- genes annot <- annot %>% filter(gene_id %in% genes) expr_matrix <- expr_matrix[annot$gene_id, ] expr_matrix <- t(expr_matrix) cat("Calculating correlation \n") corr <- cor(expr_matrix, expr_matrix, method="pearson") chrs <- annot %>% dplyr::select(chr) %>% dplyr::rename(Chr = chr) cat("Building heatmap \n") ha <- rowAnnotation(df = chrs, name = "Chr", show_annotation_name = F, col = list(Chr = chr_pal[as.character(chrs$Chr)]), width = unit(0.5, "cm"), annotation_legend_param = list( title = "Chr", title_gp = gpar(fontsize = 14), grid_height = unit(0.5, "cm"))) ht <- Heatmap(corr, cluster_rows = F, cluster_columns = F, show_row_names = F, show_column_names = F, heatmap_legend_param = list( title_gp = gpar(fontsize = 14), title = "Pearson \nCorrelation", legend_height = unit(4, "cm")), right_annotation = ha) cat("Saving image \n") png(snakemake@output[["plot"]], width = 1200, height = 1200) draw(ht, heatmap_legend_side = "right", annotation_legend_side = "right") dev.off() colnames(corr) <- genes corr[lower.tri(corr)] <- NA fwrite(corr, file=snakemake@output[["csv"]]) |
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") ## Get the Work and Data dir library(data.table) library(readr) library(dplyr) library(janitor) library(rtracklayer) TISSUE <- snakemake@params[["tissue"]] TYPE <- snakemake@params[["type"]] IS_XENA <- snakemake@params[["is_xena"]] MCCORES <- as.numeric(snakemake@threads[[1]]) # ############################################################################### # ## Get and check count matrices # ## 1. Find files # ## 2. Read data # ## 4. Check sample sizes # ## 5. Check genes order # ## 6. Build target matrix # ############################################################################### RDATADIR <- paste(snakemake@params[["tissue_dir"]], "results", sep="/") dir.create(RDATADIR) if(!IS_XENA) { RAWDIR <- paste(snakemake@params[["tissue_dir"]], "raw", sep="/") cat(paste0("Checking ", TYPE, " samples \n")) ## Find the files files_to_read <- list.files(path = paste0(RAWDIR,"/", TYPE), pattern = "\\.htseq.counts.gz$", full.names = T, recursive = T) ## Read the data all_files <- lapply(files_to_read, function(file) { data <- read_tsv(file, col_names = c("gene_id", "raw_counts")) data <- data %>% filter(!startsWith(gene_id, "_" )) return(data) }) ## Check samples sizes size <- unique(do.call(rbind,lapply(all_files, dim))) stopifnot(nrow(size)==1) cat(paste0(TYPE, " samples have the same size \n")) ## Check genes order in samples genes <- do.call(cbind, lapply(all_files, function(x) dplyr::select(x, "gene_id"))) genes <- t(unique(t(genes))) stopifnot(dim(genes)==c(size[1,1], 1)) cat(paste0("Genes in ", TYPE, " samples match positions \n")) ## Build targets matrix targets <- data.frame(id = paste(TISSUE, TYPE, 1:length(files_to_read), sep = "_"), file = unlist(lapply(strsplit(files_to_read, "/"), "[[", 10)), file_id = unlist(lapply(strsplit(files_to_read, "/"), "[[", 9)), group = TYPE, stringsAsFactors = FALSE) ## Rename columns in counts matrix matrix <- bind_cols(lapply(all_files, function(x) dplyr::select(x, "raw_counts"))) colnames(matrix) <- targets$id matrix <- matrix %>% mutate(gene_id = genes[,1]) %>% dplyr::select(gene_id, everything()) } else { XENA_COUNTS <- snakemake@input[[1]] XENA_SAMPLES <- snakemake@input[[2]] PRIMARY <- snakemake@params[["primary"]] extended_type <- snakemake@params[["extended_type"]] matrix_samples <- read_tsv(XENA_SAMPLES) %>% clean_names() matrix_counts <- fread(XENA_COUNTS, nThread = MCCORES) # When the normal samples should come from TCGA in UCSC Xena, the extended type # for normal should be: "Solid Tissue Normal". This is only for testing # When the normal samples should come from GTEX in UCSC Xena, the extended type # for normal should be: "Normal Tissue". This is the usual pipeline if(is.null(extended_type)) { extended_type <- ifelse(TYPE == "normal", "Normal Tissue", "Primary Tumor") } cat("Extented type: ", extended_type, "\n") cat("Primary disease or tissue: ", PRIMARY, "\n") camelTissue <- paste0(toupper(substring(TISSUE, 1, 1)), substring(TISSUE, 2)) cat("Tissue name: ", camelTissue, "\n") targets <- matrix_samples %>% dplyr::filter(primary_site == camelTissue & primary_disease_or_tissue == PRIMARY & sample_type == extended_type) cat("Got", nrow(targets), " samples\n") cat("Getting count matrix\n") ### Expected counts from XENA are in log2(expected_count+1) ### get them back to expected_count for downstream pipeline matrix <- matrix_counts %>% dplyr::select_if(names(.) %in% c("sample", targets$sample)) %>% dplyr::select(sample, everything())%>% dplyr::mutate(across(-sample, ~ .x^2-1)) matrix[matrix < 0] <- 0 targets <- targets %>% dplyr::filter(sample %in% names(matrix)) %>% dplyr::mutate(id = paste(TISSUE, TYPE, 1:nrow(.), sep = "_"), group = TYPE) %>% dplyr::rename(file = sample) %>% select(id, file, group) matrix <- matrix %>% dplyr::select(sample, targets$file) colnames(matrix) <- c("gene_id", targets$id) } cat(paste0("Matrices for ", TYPE, " ready.\n")) cat("Saving matrix\n") write_tsv(matrix, snakemake@output[["matrix"]]) cat("Saving samples\n") write_tsv(targets, snakemake@output[["samples"]]) |
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library(dplyr) library(NOISeq) library(ggplot2) PLOTSDIR <-paste(snakemake@params[["tissue_dir"]], "plots", snakemake@params[["plots_type"]], sep="/") dir.create(PLOTSDIR, recursive = TRUE) w <- 1024 h <- 1024 p <- 24 load(snakemake@input[[1]]) ########################################## ## EXPLORATORY ANALYSIS (NOISeq package) ########################################## { ## Reading data into NOISeq package -> mydata mydata <- NOISeq::readData( data = full$M, length = full$annot %>% select(gene_id, length) %>% as.data.frame(), biotype = full$annot %>% select(gene_id, gene_type) %>% as.data.frame(), chromosome = full$annot %>% select(chr, start, end) %>% as.data.frame(), factors = full$targets %>% select(group) %>% as.data.frame(), gc = full$annot %>% select(gene_id, gc) %>% as.data.frame()) } ########################################## ## Plots ########################################## { # Biodetection plot. Per group. mybiodetection <- dat(mydata, type="biodetection", factor="group", k=0) png(filename=paste(PLOTSDIR, "biodetection.Rd_%03d.png", sep="/"), width=w, height=h, pointsize=p) explo.plot(mybiodetection) dev.off() cat("Biodetection plots generated\n") ## Count distribution per biotype. Using count per million, only for one sample mycountsbio <- dat(mydata, factor = NULL, type = "countsbio") png(filename=paste(PLOTSDIR, "countsbio.png", sep="/"), width=w, height=h, pointsize=p) explo.plot(mycountsbio, toplot = 1, samples = 1, plottype = "boxplot") dev.off() cat("Counts distribution plot per biotype and one sample generated\n") #What about expression level? ## Count distribution per sample mycountsbio <- dat(mydata, factor = NULL, type = "countsbio") png(paste(PLOTSDIR, "protein_coding_boxplot.png", sep="/"), width=w*2, height=h, pointsize=p) explo.plot(mycountsbio, toplot = "protein_coding", samples = NULL, plottype = "boxplot") dev.off() cat("Counts distribution plot for protein coding and all samples generated\n") png(paste(PLOTSDIR, "protein_coding_barplot.png", sep="/"), width=w*2, height=h, pointsize=p) explo.plot(mycountsbio, toplot = "protein_coding", samples = NULL, plottype = "barplot") dev.off() cat("Counts distribution barplot for protein coding biotype and all samples generated\n") mycountsbio <- dat(mydata, factor = "group", type = "countsbio") ## Count distribution per Experimental factors png(paste(PLOTSDIR, "protein_coding_boxplot_group.png", sep="/"), width=w, height=h, pointsize=p) explo.plot(mycountsbio, toplot = "protein_coding", samples = NULL, plottype = "boxplot") dev.off() cat("Counts distribution boxplot for protein coding biotype and group generated\n") png(paste(PLOTSDIR, "protein_coding_barplot_group.png", sep="/"), width=w, height=h, pointsize=p) explo.plot(mycountsbio, toplot = "protein_coding", samples = NULL, plottype = "barplot") dev.off() cat("Counts distribution barplot for protein coding biotype and group generated\n") # How much sensitivity we loose? }########################################## ## Bias ########################################## { ## Length bias detection mylengthbias <- dat(mydata, factor="group", type="lengthbias") png(paste(PLOTSDIR, "length_bias.png", sep="/"), width=w, height=h, pointsize=p) explo.plot(mylengthbias, samples = NULL, toplot = "global") dev.off() cat("Lenght bias plot generated\n") # Do we see a clear pattern? ## GC bias mygcbias <- dat(mydata, factor = "group", type="GCbias") png(paste(PLOTSDIR, "gc_bias.png", sep="/"), width=w, height=h, pointsize=p) explo.plot(mygcbias, samples = NULL, toplot = "global") dev.off() cat("GC bias plot generated\n") # Do we see a clear pattern? ## RNA composition mycomp <- dat(mydata, type="cd") png(paste(PLOTSDIR, "rna_composition.png", sep="/"), width=w, height=h, pointsize=p) explo.plot(mycomp, samples=1:12) dev.off() cat("RNA composition plot generated\n") # Are samples comparable? } |
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library(ggplot2) library(reshape2) library(grid) library(png) library(gridExtra) library(readr) library(dplyr) ############################################################################### args <- commandArgs(trailingOnly = T) PLOTSDIR <-paste(snakemake@params[["tissue_dir"]], "plots", sep="/") PLOTSNORMDIR <- paste(PLOTSDIR, "normalization", sep = "/") w <- 1024 h <- 1024 p <- 24 { cat("Integrating plots \n.") length_norm <- c("no","full", "loess", "median", "upper") gc_norm <- c("no", "full", "loess", "median", "upper") between_norm <- c("no", "full", "median", "tmm", "upper") ## This function will retrieve plots for one normalization set and will create ## one single plot savePlots <- function(step1, step2, step3) { plotname <- paste(step1, step2, step3, sep = "_") pngPlots <- c(paste0(PLOTSNORMDIR, "/", plotname, "_length_bias.png"), paste0(PLOTSNORMDIR, "/", plotname, "_gc_bias.png"), paste0(PLOTSNORMDIR, "/", plotname, "_rna_composition.png")) thePlots <- lapply (pngPlots, function(pngFile) { rasterGrob(readPNG(pngFile, native = FALSE), interpolate = FALSE) }) plotpath <- paste(PLOTSNORMDIR, paste(plotname, "png", sep ="."), sep="/") png(plotpath, height=h/2, width=w*(3/2)) par(oma = c(0, 0, 1.5, 0)) plot.new() do.call(grid.arrange, c(thePlots, ncol = 3, nrow=1)) mtext(plotname, outer = TRUE, cex = 1.5) dev.off() unlink(pngPlots) return(plotpath) } df_normalizations <- expand.grid(gcn = gc_norm, ln = length_norm, bn = between_norm, stringsAsFactors = F) cat("Getting plots for all ", nrow(df_normalizations), "normalization combinations.\n") plots <- lapply(1:nrow(df_normalizations), function(i){ gcn <- df_normalizations[i, "gcn"] ln <- df_normalizations[i, "ln"] bn <- df_normalizations[i, "bn"] gcp <- NA tryCatch(expr = { gcp <- savePlots(paste("gc", gcn, sep = "_"), paste("length", ln, sep = "_"), paste("between", bn, sep = "_")) }, error = function(cond){ cat(cond$message, "\n") }) lp <- NA tryCatch(expr = { lp <- savePlots(paste("length", ln, sep = "_"), paste("gc", gcn, sep = "_"), paste("between", bn, sep = "_")) }, error = function(cond){ cat(cond$message, "\n") }) return(list(gcp, lp)) }) plots <- sort(unlist(plots)) thePlots <- lapply (plots, function(pngFile) { rasterGrob(readPNG(pngFile, native = FALSE), interpolate = FALSE) }) cat("Saving normalization_plots.pdf\n") pdf(paste(PLOTSDIR, "normalization_plots.pdf", sep="/"), title="Normalization Plots") print(marrangeGrob(thePlots, nrow = 3, ncol = 1, top = NULL)) dev.off() } |
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library(reshape2) library(grid) library(png) library(gridExtra) library(parallel) library(readr) library(dplyr) library(NOISeq) library(EDASeq) ############################################################################### MCCORES <- as.numeric(snakemake@threads[[1]]) PLOTSDIR <-paste(snakemake@params[["tissue_dir"]], "plots", sep="/") PLOTSNORMDIR <- paste(PLOTSDIR, "normalization", sep = "/") dir.create(PLOTSNORMDIR) w <- 1024 h <- 1024 p <- 24 load(snakemake@input[[1]]) cat("Testing normalization methods\n.") rawEDA <- EDASeq::newSeqExpressionSet( counts = full$M, featureData = data.frame(full$annot, row.names = full$annot$gene_id), phenoData = data.frame( conditions = full$targets$group, row.names = full$targets$id)) length_norm <- c("no", "full", "loess", "median", "upper") gc_norm <- c("no", "full", "loess", "median", "upper") between_norm <-c("no", "full", "median", "tmm", "upper") ## This function gets the relevant statistics for the regression methods for GC and Length bias getRegressionStatistics <- function(regressionmodel) { name <- names(regressionmodel) rsquared <- summary(regressionmodel[[1]])$r.squared fstatistic <- summary(regressionmodel[[1]])$fstatistic pvalue <- signif(pf(q = fstatistic[1], df1 = fstatistic[2], df2 = fstatistic[3], lower.tail = FALSE), 2) return(list("name" = name, "r2" = rsquared, "p" = pvalue)) } ## This function will retrieve the NOISeq results to test the normalization combination getNOISeqResults <- function(step1, step2, step3, n_counts, m10_data) { ### Check the NOISEq results mydata <- NOISeq::readData( data = n_counts, length = m10_data$annot %>% select(gene_id, length) %>% as.data.frame(), biotype = m10_data$annot %>% select(gene_id, gene_type) %>% as.data.frame(), chromosome = m10_data$annot %>% select(chr, start, end) %>% as.data.frame(), factors = m10_data$targets %>% select(group) %>% as.data.frame(), gc = m10_data$annot %>% select(gene_id, gc)%>% as.data.frame()) nsamples <- dim(m10_data$targets)[1] ### Length bias mylengthbias <- dat(mydata, factor="group", norm = TRUE, type="lengthbias") l_stats_1 <- getRegressionStatistics(mylengthbias@dat$RegressionModels[1]) l_stats_2 <- getRegressionStatistics(mylengthbias@dat$RegressionModels[2]) ## GC Bias mygcbias <- dat(mydata, factor = "group", norm = TRUE, type ="GCbias") gc_stats_1 <- getRegressionStatistics(mygcbias@dat$RegressionModels[1]) gc_stats_2 <- getRegressionStatistics(mygcbias@dat$RegressionModels[2]) #RNA Composition myrnacomp <- dat(mydata, norm = TRUE, type="cd") dtable <- table(myrnacomp@dat$DiagnosticTest[, "Diagnostic Test"]) if (is.na(dtable["PASSED"])) dtable <- data.frame(PASSED = 0) cat("Step1:", step1, ", step2:", step2, " step3: ", step3, " calculations done\n") plotname <- paste(step1, step2, step3, sep = "_") pngPlots <- c(paste0(PLOTSNORMDIR, "/", plotname, "_length_bias.png"), paste0(PLOTSNORMDIR, "/", plotname, "_gc_bias.png"), paste0(PLOTSNORMDIR, "/", plotname, "_rna_composition.png")) png(pngPlots[1], width=w/2, height=h/2) explo.plot(mylengthbias, samples = NULL, toplot = "global") dev.off() png(pngPlots[2], width=w/2, height=h/2) explo.plot(mygcbias, samples = NULL, toplot = "global") dev.off() png(pngPlots[3],width=w/2, height=h/2) explo.plot(myrnacomp, samples = 1:12) dev.off() cat("Step1:", step1, ", step2:", step2, " step3: ", step3, " plots saved\n") norm_set_results <- tibble(step1, step2, step3, l_stats_1$r2, l_stats_1$p, l_stats_2$r2, l_stats_2$p, gc_stats_1$r2, gc_stats_1$p, gc_stats_2$r2, gc_stats_2$p, dtable["PASSED"], dtable["PASSED"]/nsamples) colnames(norm_set_results) <- c("step1", "step2", "step3", paste("length", l_stats_1$name, "R2", sep = "_"), paste("length", l_stats_1$name, "p-value", sep = "_"), paste("length", l_stats_2$name, "R2", sep = "_"), paste("length", l_stats_2$name, "p-value", sep = "_"), paste("gc", gc_stats_1$name, "R2", sep = "_"), paste("gc", gc_stats_1$name, "p-value", sep = "_"), paste("gc", gc_stats_2$name, "R2", sep = "_"), paste("gc", gc_stats_2$name, "p-value", sep = "_"), "rna_passed_samples", "rna_passed_proportion") return(norm_set_results) } df_normalizations <- expand.grid(gcn = gc_norm, ln = length_norm, bn = between_norm, stringsAsFactors = F) cat("Testing all ", nrow(df_normalizations), "normalization combinations\n") ## Normalizations with GC step first gc_norms <- mclapply(X = 1:nrow(df_normalizations), mc.cores = MCCORES, mc.preschedule = FALSE, FUN = function(i){ gcn <- df_normalizations[i, "gcn"] ln <- df_normalizations[i, "ln"] bn <- df_normalizations[i, "bn"] tryCatch( expr = { if(gcn == "no") { gcn_data <- counts(rawEDA) } else { gcn_data <- withinLaneNormalization(counts(rawEDA), full$annot$gc, which = gcn) } if(ln == "no") { ln_data <- gcn_data } else { ln_data <- withinLaneNormalization(gcn_data, full$annot$length, which = ln) } if (bn == "no") { between_data <- ln_data } else if (bn == "tmm") { between_data <- tmm(ln_data, long = 1000, lc = 0, k = 0) } else { between_data <- betweenLaneNormalization(ln_data, which = bn, offset = FALSE) } cat("Testing with GC normalization: ", gcn, ", length normalization: ", ln, " and between lane normalization: ", bn, "\n") norm_noiseq_results <- getNOISeqResults(paste("gc", gcn, sep = "_"), paste("length", ln, sep = "_"), paste("between", bn, sep = "_"), between_data, full) return(norm_noiseq_results) }, error = function(cond) { norm_noiseq_results <- tibble(step1 = paste("gc", gcn, sep = "_"), step2 = paste("length", ln, sep = "_"), step3 = paste("between", bn, sep = "_"), error = cond$message) return(norm_noiseq_results) }) }) save(gc_norms, file=snakemake@output[["gc_norms"]]) gc_norms <- bind_rows(gc_norms) ## Normalizations with Length step first ln_norms <- mclapply(X = 1:nrow(df_normalizations), mc.preschedule = FALSE, mc.cores = MCCORES, FUN = function(i){ gcn <- df_normalizations[i, "gcn"] ln <- df_normalizations[i, "ln"] bn <- df_normalizations[i, "bn"] tryCatch( expr = { if(ln == "no") { ln_data <- counts(rawEDA) } else { ln_data <- withinLaneNormalization(counts(rawEDA), full$annot$length, which = ln) } if(gcn == "no") { gcn_data <- ln_data } else { gcn_data <- withinLaneNormalization(ln_data, full$annot$gc, which = gcn) } if (bn == "no") { between_data <- ln_data } else if (bn == "tmm") { between_data <- tmm(ln_data, long = 1000, lc = 0, k = 0) } else { between_data <- betweenLaneNormalization(ln_data, which = bn, offset = FALSE) } cat("Testing with length normalization: ", ln, "GC normalization: ", gcn, " and between lane normalization: ", bn, "\n") norm_noiseq_results <- getNOISeqResults(paste("length", ln, sep = "_"), paste("gc", gcn, sep = "_"), paste("between", bn, sep = "_"), between_data, full) return(norm_noiseq_results) }, error = function(cond) { norm_noiseq_results <- tibble(step1 = paste("length", ln, sep = "_"), step2 = paste("gc", gcn, sep = "_"), step3 = paste("between", bn, sep = "_"), error = cond$message) return(norm_noiseq_results) }) }) save(ln_norms, file=snakemake@output[["ln_norms"]]) ln_norms <- bind_rows(ln_norms) normalization_results <- bind_rows(gc_norms, ln_norms) cat("End of normalization testing\n") cat("Saving normalization_results.tsv\n") write_tsv(normalization_results, snakemake@output[["results"]]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library(NOISeq) library(readr) library(dplyr) library(tidyr) library(ggplot2) library(ggthemes) library(ggpubr) w <- 1024 h <- 1024 p <- 24 PLOTSDIR <-paste(snakemake@params[["tissue_dir"]], "plots", snakemake@params[["plots_type"]], sep="/") dir.create(PLOTSDIR, recursive = TRUE) cat("Loading data") load(snakemake@input[[1]]) cat("Performing PCA") mydata <- NOISeq::readData( data = full$M, length = full$annot %>% select(gene_id, length) %>% as.data.frame(), biotype = full$annot %>% select(gene_id, gene_type) %>% as.data.frame(), chromosome = full$annot %>% select(chr, start, end) %>% as.data.frame(), factors = full$targets %>% select(group) %>% as.data.frame(), gc = full$annot %>% select(gene_id, gc) %>% as.data.frame()) pca_dat <- dat(mydata, type = "PCA", logtransf = F) pca_results <- pca_dat@dat$result pca_factors <- pca_dat@dat$factors cat("Getting variance plot") pc_eigenvalues <- tibble(pc = factor(1:nrow(pca_results$var.exp)), pct = pca_results$var.exp[,1], pct_cum = pca_results$var.exp[,2]) g1 <- ggplot(data = pc_eigenvalues[1:50,], aes(x = pc)) + geom_col(aes(y = pct)) + geom_line(aes(y = pct_cum, group = 1)) + geom_point(aes(y = pct_cum)) + labs(x = "Principal component", y = "Fraction variance explained") + theme_hc(base_size = 20) + theme(axis.text.x = element_text(size=12)) + ggtitle("PCA Variance") ggexport(g1, width = w, height = h/2, pointsize = p, filename = snakemake@output[["variance"]]) cat("Getting scores plot") pc_scores <- tibble(pc = factor(1:nrow(pca_results$var.exp)), name = rownames(pca_factors), group = factor(pca_factors$group, levels = c("normal", "cancer"), labels = c("Normal", "Cancer")), pc1 = pca_results$scores[,1], pc2 = pca_results$scores[,2], pc3 = pca_results$scores[,3]) color_pal <- c("#e3a098", "#a32e27") xrange <- range(pc_scores %>% select(pc1)) yrange <- range(pc_scores %>% select(pc2, pc3)) g1 <- ggplot(pc_scores, aes(x=pc1, y=pc2, color=group)) + geom_point(size=2) + labs(x = "PC1", y = "PC2") + scale_color_manual(values = color_pal) + theme_hc(base_size = 20) + theme(legend.title = element_blank()) + xlim(xrange) + ylim(yrange) g2 <- ggplot(pc_scores, aes(x=pc1, y=pc3, color=group)) + geom_point(size=2) + labs(x = "PC1", y = "PC3") + scale_color_manual(values = color_pal) + theme_hc(base_size = 20) + theme(legend.title = element_blank()) + xlim(xrange) + ylim(yrange) fig <- ggarrange(g1, g2, nrow = 1) fig <- annotate_figure(fig, top = text_grob("PCA Scores", size = 25)) ggexport(fig, width = w, height = h/2, pointsize = p, filename = snakemake@output[["score"]]) cat("Getting loading plot") pc_loadings <- tibble(gene = rownames(pca_results$loadings), pc1 = pca_results$loadings[, 1], pc2 = pca_results$loadings[, 2]) top_genes <- pc_loadings %>% pivot_longer(matches("pc"), names_to = "pc", values_to = "loading")%>% group_by(pc) %>% arrange(desc(abs(loading))) %>% slice(1:10) %>% pull(gene) %>% unique() top_loadings <- pc_loadings %>% filter(gene %in% top_genes) g1 <- ggplot(top_loadings) + geom_segment(aes(x = 0, y = 0, xend = pc1, yend = pc2), arrow = arrow(length = unit(0.1, "in")), color = "brown") + geom_text(aes(x = pc1, y = pc2, label = gene), nudge_y = 0.005, size = 3) + scale_x_continuous(expand = c(0.02, 0.02)) + ggtitle("PCA Loadings") + theme_hc(base_size = 15) ggexport(g1, width = w/2, height = h/2, pointsize = p, filename = snakemake@output[["loading"]]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 | import requests import json import sys import pandas as pd from io import StringIO import pathlib logdir = snakemake.params[0] + "/log" manifestdir = snakemake.params[0] + "/manifests" pathlib.Path(logdir).mkdir(parents=True, exist_ok=True) pathlib.Path(manifestdir).mkdir(parents=True, exist_ok=True) sys.stderr = open(snakemake.log[0], "w") primary_site = snakemake.params[1] sample_type =snakemake.params[2] sst = sample_type if (sample_type == "cancer"): sample_type = "primary tumor" elif (sample_type == "normal"): sample_type = "solid tissue normal" else: sys.exit("Incorrect sample type") print("Getting data for: " + primary_site + ", " + sample_type) # Fields for the query fields = [ "cases.case_id", "file_name" ] fields = ",".join(fields) # Endpoints used files_endpt = "https://api.gdc.cancer.gov/files" manifest_endpt = "https://api.gdc.cancer.gov/manifest" # Filters to get RNA seq data filters = { "op": "and", "content":[ { "op": "in", "content":{ "field": "cases.project.primary_site", "value": [primary_site] } }, { "op": "in", "content":{ "field": "cases.samples.sample_type", "value": [sample_type] } }, { "op": "in", "content":{ "field": "files.analysis.workflow_type", "value": ["HTSeq - Counts"] } }, { "op": "in", "content":{ "field": "files.data_type", "value": ["Gene Expression Quantification"] } } ] } params = { "filters": json.dumps(filters), "fields": fields, "format": "tsv", "size": "2000" } # Getting data response = requests.get(files_endpt, params = params) data = response.content.decode("utf-8") print("RNA-seq query done") if(not data.strip()): sys.exit("No records found for " + primary_site + ", " + sample_type) df = pd.read_csv(StringIO(data), sep ="\t") df.columns = [col + "_rna" for col in df.columns] # Save file ids and file names just in case df.to_csv(snakemake.output[0], sep="\t", index=False) # Getting RNASeq files manifest for future download params = { "ids" : df["id_rna"].tolist() } response = requests.post(manifest_endpt, data= json.dumps(params), headers = {"Content-Type": "application/json"}) print("Got RNASeq manifest") f = open(snakemake.output[1], "w") f.write(response.content.decode("utf-8")) f.close() print("RNASeq manifest written") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library(NOISeq) library(readr) library(dplyr) STEP1 = snakemake@params[["step1"]] STEP2 = snakemake@params[["step2"]] STEP3 = snakemake@params[["step3"]] ########################################## ## ARSyN to reduce batch effect ########################################## { cat("Loading data\n") load(snakemake@input[[1]]) mydata <- NOISeq::readData( data = full$M, factors = full$target %>% select(group) %>% as.data.frame()) cat("Performing ARSyN for batch correction") myARSyN <- ARSyNseq(mydata, norm = "n", logtransf = FALSE) ##Saving everything full_arsyn <- list(M = assayData(myARSyN)$exprs, annot = full$annot, targets = full$targets) stopifnot(nrow(full$M) == nrow(full_arsyn$annot)) stopifnot(all(rownames(full_arsyn$M) == rownames(full_arsyn$annot))) full <- full_arsyn cat("Saving", paste(STEP1, STEP2, STEP3, "norm_arsyn.RData", sep = "_"), "\n") save(full, file=snakemake@output[["arsyn_rdata"]], compress="xz") cat("Generating data matrices with arsyn for Aracne\n") ## Data matrices for Aracne #normal samples normal <- as_tibble(full$M[,full$targets$group == "normal"]) normal <- bind_cols(gene=as.character(full$annot$gene_id), normal) #cancer samples cancer <- as_tibble(full$M[,full$targets$group == "cancer"]) cancer <- bind_cols(gene=as.character(full$annot$gene_id), cancer) #gene_ids symbols <- full$annot %>% select(gene_id) cat("Saving arsyn data\n") cat("Saving", paste(STEP1, STEP2, STEP3, "norm_arsyn_normal.tsv", sep = "_"), "\n") write_tsv(normal, file=snakemake@output[["normal_matrix"]]) cat("Saving", paste(STEP1, STEP2, STEP3, "norm_arsyn_cancer.tsv", sep = "_"), "\n") write_tsv(cancer, file=snakemake@output[["cancer_matrix"]]) cat("Saving", paste(STEP1, STEP2, STEP3, "norm_arsyn_genelist.txt", sep = "_"), "\n") write_tsv(symbols, file=snakemake@output[["gene_list"]], col_names = F) } |
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") ##Usefull Libraries library(ggplot2) library(reshape2) library(readr) library(dplyr) library(EDASeq) library(NOISeq) library(DESeq2) STEP1 = snakemake@params[["step1"]] STEP2 = snakemake@params[["step2"]] STEP3 = snakemake@params[["step3"]] is_xena = snakemake@params[["xena"]] RDATA <-paste(snakemake@params[["tissue_dir"]], "rdata", sep="/") w <- 1024 h <- 1024 p <- 24 ########################################################################## {##### USER SELECTED NORMALIZATION cat("Loading data\n") load(snakemake@input[[1]]) cat("Performing normalization. Step1: ", STEP1, ", Step2: ", STEP2, " Step3: ", STEP3, "\n") first <- strsplit(STEP1, "-")[[1]] second <- strsplit(STEP2, "-")[[1]] if(first[1] == "length") { if(first[2] != "no") { norm_counts <- withinLaneNormalization(full$M, full$annot$length, which = first[2]) } else { norm_counts <- full$M } if(second[1] == "gc") { if(second[2] != "no") { norm_counts <- withinLaneNormalization(norm_counts, full$annot$gc, which = second[2]) } } } else if (first[1] == "gc") { if(first[2] != "no") { norm_counts <- withinLaneNormalization(full$M, full$annot$gc, which = first[2]) } else { norm_counts <- full$M } if(second[1] == "length") { if(second[2] != "no") { norm_counts <- withinLaneNormalization(norm_counts, full$annot$length, which = second[2]) } } } else { norm_counts <- full$M } if(STEP3 == "no") { norm_counts <- full$M } else if (STEP3 == "tmm") { norm_counts <- tmm(norm_counts, long = 1000, lc = 0, k = 0) } else { norm_counts <- betweenLaneNormalization(norm_counts, which = STEP3, offset = FALSE) } cat("Normalization done. Step1: ", STEP1, ", Step2: ", STEP2, " Step3: ", STEP3, "\n") # Saving normalized data full$M <- norm_counts norm_data_cpm10 <- filtered.data(full$M, factor=full$targets$group, norm=TRUE, cv.cutoff=100, cpm=10) filtered <- nrow(full$M) - nrow(norm_data_cpm10) cat("After normalization. There are", filtered, "genes with counts per million mean < 10", nrow(norm_data_cpm10), "with counts per million mean > 10 \n") ### There is a problem with xena expression matrix. Values for ### genes "ENSG00000203811.1" and "ENSG00000203852.3" are exactly ### the same and it affects correlation calculation if(is_xena) { genes <- rownames(norm_data_cpm10) if("ENSG00000203811.1" %in% genes){ genes <- genes[genes != "ENSG00000203811.1"] norm_data_cpm10 <- norm_data_cpm10[genes,] } } full <-list(M = norm_data_cpm10, annot = full$annot %>% filter(gene_id %in% rownames(norm_data_cpm10)), targets = full$targets) stopifnot(nrow(full$M) == nrow(full$annot)) stopifnot(rownames(full$M) == rownames(full$annot)) cat("Saving", paste(STEP1, STEP2, STEP3, "norm_full.RData", sep = "_"), "\n") save(full, file=snakemake@output[["norm_rdata"]], compress="xz") cat("Generating data matrices for Aracne\n") ## Data matrices for Aracne #normal samples normal <- as_tibble(full$M[, full$targets$group == "normal"]) normal <- bind_cols(gene=as.character(full$annot$gene_id), normal) #cancer samples cancer <- as_tibble(full$M[, full$targets$group == "cancer"]) cancer <- bind_cols(gene=as.character(full$annot$gene_id), cancer) #gene_ids symbols <- full$annot %>% select(gene_id) cat("Saving data\n") cat("Saving", paste(STEP1, STEP2, STEP3, "norm_normal.tsv", sep = "_"), "\n") write_tsv(normal, file=snakemake@output[["normal_matrix"]]) cat("Saving", paste(STEP1, STEP2, STEP3, "norm_cancer.tsv", sep = "_"), "\n") write_tsv(cancer, file=snakemake@output[["cancer_matrix"]]) cat("Saving", paste(STEP1, STEP2, STEP3, "norm_genelist.txt", sep = "_"), "\n") write_tsv(symbols, file=snakemake@output[["gene_list"]], col_names = F) }########################################## |
Support
- Future updates
Related Workflows





