Spliced peptide identification from in vitro digestions of polypeptides with purified proteasomes
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Spliced peptide identification from in vitro digestions of polypeptides with purified proteasomes
Please cite the following publication if you are using invitroSPI for your research:
Roetschke, H.P., Rodriguez-Hernandez, G., Cormican, J.A. et al. InvitroSPI and a large database of proteasome-generated spliced and non-spliced peptides. Sci Data 10, 18 (2023). https://doi.org/10.1038/s41597-022-01890-6
overview
The invitroSPI pipeline consists of four main steps that are implemented in a Snakemake workflow:
-
parsing of search result files and creation of a preliminary MS database ( MSDB ) containing all peptide-spectrum matches (PSMs), mapping of peptides to their substrate origins and potential product type re-assignment
-
scoring and filtering of PSMs using Mascot's ion score and q-value as well as the delta score described in the manuscript
-
identification and, optionally, removal of synthesis errors using the control runs
-
mapping of peptides to the substrate sequence accounting for potential multi-mappers
-
database statistics and diagnostic information
Additionally, code for the computation of all possible spliced and non-spliced peptides which is used for the Mascot search is provided in
SOURCE/_generateSearchDB_polypeptides.R
. We also include a script containing useful functions for downstream analyses (
invitroSPI_utils.R
).
execution
invitroSPI relies on
Conda
and Snakemake.
In order to install Conda, click on this
link
and follow the installation guidelines for your respective operating system.
After installing Conda, you need to install Snakemake. The Snakemake installation procedure is described
here
.
Briefly, open the terminal on your computer and paste the following lines sequentially:
conda install -n base -c conda-forge mamba
conda activate base
mamba create -c conda-forge -c bioconda -n snakemake snakemake
Additionally, you might need to run
conda update conda
.
Download this repository as a .zip file (click on
Code
at the upper right corner of this repository --> Download ZIP), move the .zip file in the desired directory on your computer and unpack it.
Make sure to edit
INPUT/sample_list.csv
and
INPUT/config.yaml
before starting the data processing
(see below for a more detailed description)
Open the terminal in this directory and enter:
conda activate snakemake
.
The pipeline can be executed by pasting
snakemake --use-conda --cores all -R createMSDB
into the terminal. The progress of the pipeline execution should appear in your terminal window.
In case you have installed an older version of Conda/Snakemake and encounter an error when executing the pipeline, try executing
snakemake --use-conda --cores all -R createMSDB --conda-frontend conda
.
After your jobs finished, enter
conda deactivate
in order to terminate your Conda environment.
input
invitroSPI identifies spliced and non-spliced peptides from Mascot search result files. Therefore, the user must provide a table
sample_list.csv
in the
INPUT/
folder containing information about:
-
project name
-
substrate ID
-
substrate sequence
-
time point
-
search result file name
-
replicate
-
MSfile (optional)
-
metainformation (optional)
!!!
In case you are creating/editing the
sample_list.csv
file in Microsoft Excel, make sure to save it as actual comma-separated file. I.e.,
Save as...
-->
Comma-separated Values (.csv)
To check that the sample list is in the correct format, you can open it using a text editor and verify that the columns are separated by commas (and NOT semicolons).
!!!
Additionally, the user must provide
search result
files deposited in the folder
INPUT/search_results/
and list their names in the
sample_list.csv
table.
An example of the
sample_list.csv
table is given below and can be modified accordingly by the user:
project_name | substrateID | substrateSeq | digestTime | filename | replicate | MSfile | metainformation |
---|---|---|---|---|---|---|---|
test_data | TSN2 | TSN2.fasta | 4 | F029125.csv | 1 | INPUT/metainformation_testData.csv | |
test_data | TSN89 | RTKAWNRQLYPEW | 4 | F029129.csv | 1 | INPUT/metainformation_testData.csv | |
test_data | TSN2 | VSRQLRTKAWNRQLYPEWTEAQR | CTRL | F029123.csv | 1 | INPUT/metainformation_testData.csv | |
test_data | TSN89 | RTKAWNRQLYPEW | CTRL | F029127.csv | 1 | INPUT/metainformation_testData.csv |
You can either paste the substrate sequence in the
sample_list
directly or put the name of a single-entry .fasta file containing the substrate sequence. This file should be located in
INPUT/sequences
.
Note that also the filenames of the control files must be provided.
For the control files, put
CTRL
in the
digestTime
column
. Please provide all other time points in hours (
e.g.
, 0.5 for 30 min).
MSfile
is an optional column that might be helpful to keep track of the .raw/.mgf files that were used to generate the respective search result files. It will not be used to create the database.
In case you would like to include additional information in the final database (for instance species, location, instrument, substrate origin, proteasome isotype, ...), you could do so by creating a .csv table containing all this information. The .csv table must also contain a column
filename
corresponding to the
filename
in the sample list (see
INPUT/metainformation_SciData2022
for an example). Provide the name of the metainformation table in the
metainformation
column of the sample list.
The invitroSPI workflow is constructed in such a way that the user can keep appending projects and search result files to the
sample_list
. However, only the samples of the current project (which is specified in
INPUT/config.yaml
) are processed and stored separately in
OUTPUT/project_name
subdirectories.
output
The pipeline provides a final database of identified spliced and non-spliced peptides in .csv format (
OUTPUT/project_name/ProteasomeDB.csv
) as well as all intermediate files in binary format (
OUTPUT/project_name/tmp/
).
Additionally, some diagnostic plots and database statistics are being produced which can also be found in the
OUTPUT/
folder. They comprise:
-
number of unique peptides
-
number and frequencies of peptides and PSMs at each time point
-
length distribution of peptides, splice reactants and intervening sequences at each time point
parameters that can be modified by the user
We are using the following default parameters that are specified in the
INPUT/config.yaml
file and that can be changed by the user:
-
project_name
: indicates which files listed in thesample_list.csv
should be processed (enter valid directory names only! - no spaces or German umlauts) -
delta_score
: 0.3 -
ion_score
: 20 -
q_value
: 0.05 -
include_scanNum
: "yes" -
keep_synErrors
: "no"
Processing of scan numbers is only possible if .mgf files were created with
msconvert or Mascot Distiller
. In case you provide search results in another format (not recommended), please set
include_scanNum
to "no".
In case you would like to include synthesis errors (labelled as such) in the final
ProteasomeDB
, change the
keep_synErrors
flag accordingly.
Code Snippets
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 | library(plyr) library(dplyr) library(seqinr) library(stringr) source("SOURCE/loadFunctions.r") print("--------------------------------------------") print("1) PARSE SEARCH RESULT FILES AND CREATE MSDB") print("--------------------------------------------") project_name = snakemake@params[["project_name"]] include_scanNum = snakemake@params[["include_scanNum"]] SEARCHRESULTS_PATH = "INPUT/search_results/" ### INPUT ### sample_list = read.csv(file = snakemake@input[["sample_list"]], sep = ",", header = T, stringsAsFactors = F) # sample_list = read.table(file = "INPUT/sample_list.csv", # sep = ",", header = T, stringsAsFactors = F) # filter sample list sample_list = sample_list[sample_list$project_name == project_name, ] # add meta-information if (!any(is.na(sample_list$metainformation)) & length(unique(sample_list$metainformation)) == 1) { metainfo = read.csv(sample_list$metainformation[1], stringsAsFactors = F) meta = T } else { meta = F } ### MAIN PART ### if (!dir.exists(paste0("OUTPUT/", project_name))) { dir.create(paste0("OUTPUT/", project_name)) } columns = c( "runID" , "substrateID" , "substrateSeq" , "digestTime" , "pepSeq" , "productType", "spliceType", "positions", "scanNum", "rank", "ionScore", "qValue", "charge", "PTM") MSDB = list() pb = txtProgressBar(min = 0, max = nrow(sample_list), style = 3) for (i in 1:nrow(sample_list)) { # setTxtProgressBar(pb, i) # load search result file con = file(paste0(SEARCHRESULTS_PATH, sample_list$filename[i]),"r") i1 = readLines(con) close(con) # remove header from search result file # and remove not assigned queries if there are any start = grep("prot_hit_num",i1) currentSearchFile = read.table(paste(SEARCHRESULTS_PATH, sample_list$filename[i],sep="/"), skip = start-1, sep = ",", header = TRUE, fill = TRUE) kk = which(currentSearchFile$prot_hit_num == "") if (length(kk) > 0) { kk = min(kk) currentSearchFile = currentSearchFile[c(1:(kk-2)),] } substrateID = sample_list$substrateID[i] digestTime = sample_list$digestTime[i] if(nrow(currentSearchFile) > 0){ # empty DB for current search result file currentDB = data.frame(matrix(ncol = length(columns), nrow = nrow(currentSearchFile))) colnames(currentDB) = columns # fill data frame with info straight outta .csv currentDB$charge <- currentSearchFile$pep_exp_z currentDB$rank <- currentSearchFile$pep_rank currentDB$pepSeq <- currentSearchFile$pep_seq currentDB$ionScore <- currentSearchFile$pep_score currentDB$qValue <- currentSearchFile$pep_expect currentDB$PTM <- currentSearchFile$pep_var_mod currentDB$scanNum <- currentSearchFile$pep_query # currentDB[which(duplicated(currentDB) | duplicated(currentDB, fromLast = T)),] %>% nrow() # cntS = currentSearchFile[, c("pep_exp_z","pep_rank","pep_seq", "pep_score", "pep_expect", "pep_var_mod", "pep_query")] # cntS[which(duplicated(cntS) | duplicated(cntS, fromLast = T)),] %>% nrow() #get actual scan numbers, not query numbers # varies between Mascot and Mascot Distiller if(include_scanNum != "no") { warn = F scans = rep(NA,dim(currentSearchFile)[1]) for(ii in 1:dim(currentSearchFile)[1]){ tit = currentSearchFile$pep_scan_title[ii] if (str_detect(tit, pattern = "scan=")) { scans[ii] = as.numeric(strsplit(strsplit(tit,split="scan=")[[1]][2],split="~")[[1]][1]) } else if (str_detect(tit, pattern = "Scan ")) { scans[ii] = as.numeric(unlist(strsplit(tit, " "))[3]) } else if (str_detect(tit, pattern = coll(".")) | str_detect(tit, pattern = "TSN")) { scans[ii] = as.numeric(strsplit(tit,split="\\.")[[1]][2]) } else { warn = T } } currentDB$scanNum <- scans if (warn) { print("could not extract scan numbers - check search result formatting!") currentDB$scanNum = currentSearchFile$pep_query } } #extract product type and position information pepName <- as.character(currentSearchFile$prot_acc) pepType <- c() pepPos <- c() for (z in 1:length(pepName)){ splitPepName <- unlist(strsplit(pepName[z], "_")) pepType <- c(pepType, splitPepName[1]) splitPepName <- splitPepName[-1] pepPos <- c(pepPos, paste(splitPepName, collapse = "_")) } currentDB$productType <- pepType currentDB$positions <- pepPos # substrate sequence if (grepl(pattern = ".fasta", sample_list$substrateSeq[i])) { cntSeq = read.fasta(file = paste0("INPUT/sequences/", sample_list$substrateSeq[i]), seqtype = "AA", strip.desc = T)[[1]] %>% unlist() %>% paste(collapse = "") } else { cntSeq = sample_list$substrateSeq[i] } # add metainfo currentDB$substrateID = substrateID currentDB$substrateSeq <- cntSeq currentDB$digestTime <- digestTime currentDB$runID = paste(substrateID, digestTime,sample_list$filename[i], sample_list$replicate[i], sep = "-") # add further meta-information if provided if (meta) { cntMeta = metainfo[metainfo$filename == sample_list$filename[i], ] currentDB = cbind(currentDB, cntMeta[rep(1,nrow(currentDB)),]) %>% as.data.frame() dp = which(duplicated(names(currentDB))) if (length(dp) > 0) { currentDB = currentDB[,-dp]} } # assign substrate hits as PCP substratehits = which(currentDB$substrateSeq == currentDB$pepSeq) if (length(substratehits) > 0) { currentDB$productType[substratehits] = "PCP" currentDB$positions[substratehits] = paste(1, nchar(currentDB$substrateSeq[substratehits]), sep = "_") } print("") paste0("obtain correct annotations for positions and splice types for run: ", currentDB$runID[1]) %>% print() currentDB = currentDB %>% mapping() currentDB$spliceType[currentDB$productType == "PCP"] = NA } else { currentDB = data.frame(matrix(ncol = length(columns), nrow = 0)) colnames(currentDB) = columns print("No peptides listed in this file! Skipping...") } # append data frame for current search file to master data frame MSDB[[i]] = currentDB } MSDB = plyr::ldply(MSDB) ### OUTPUT ### save(MSDB, file = unlist(snakemake@output[["MSDB"]])) |
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 | library(stringr) library(plyr) library(dplyr) print("------------------------------------------------") print("2) EXTRACT HIGH-QUALITY PSMs FROM SEARCH RESULTS") print("------------------------------------------------") project_name = snakemake@params[["project_name"]] delta_score = snakemake@params[["delta_score"]] ion_score = snakemake@params[["ion_score"]] q_value = snakemake@params[["q_value"]] print(paste0("Delta score: ", delta_score)) print(paste0("Mascot ion score: ", ion_score)) print(paste0("q-value: ", q_value)) ### INPUT ### sample_list = read.csv(file = snakemake@input[["sample_list"]], sep = ",", header = T, stringsAsFactors = F) # filter sample list sample_list = sample_list[sample_list$project_name == project_name, ] load(snakemake@input[["MSDB"]]) ### MAIN PART ### columns = names(MSDB) runIDs = unique(MSDB$runID) allPSMs = list() allPSMs_Delta = list() pb = txtProgressBar(min = 0, max = length(runIDs), style = 3) for (i in 1:length(runIDs)) { setTxtProgressBar(pb, i) selected <- data.frame(matrix(ncol = length(columns), nrow = 0)) colnames(selected) = columns selectedDeltaRecord = data.frame(matrix(ncol = length(columns), nrow = 0)) colnames(selectedDeltaRecord) <- columns runIDTable = MSDB[MSDB$runID == runIDs[i],] scanNum = as.character(runIDTable$scanNum) %>% unique() # bah = sapply(scanNum, function(x){ # all(runIDTable$rank[runIDTable$scanNum == x] > 1) # }) %>% which() # if (length(bah) > 0) { # print("THERE ARE PSMS WITHOUT A TOP-RANKED PEPTIDE!!") # } # Iterate through scanNum and sort by rank for (k in 1:length(scanNum)) { PSPcandidatesIndex = NULL PCPcandidatesIndex = NULL deltaRecord = data.frame(matrix(ncol = length(columns), nrow = 0)) colnames(deltaRecord ) = columns scanNumTable = runIDTable[runIDTable$scanNum == scanNum[k],] scanNumTable = scanNumTable[order(scanNumTable$rank),] # # How many entries in filteredScans are rank 1? scanNumTable = unique(scanNumTable) filteredScans = scanNumTable filteredScans = filteredScans[order(filteredScans$rank),] # print(filteredScans) # extract the top score topScore <- filteredScans[1, "ionScore"] # extract scans with rank 1 allRanks = table(filteredScans$rank) NoTopRankedScans <- allRanks[names(allRanks) == 1] # it can somehow happen that there is no top-ranked peptide... if (length(NoTopRankedScans) == 0) { filteredScans <- filteredScans[0,] paste0("!!! NO RANK 1 PEPTIDES FOUND IN SCAN ", scanNum[k], ", RUN-ID ", runIDs[i], " !!!") %>% print() print("likely, there is an issue with the Mascot search settings - check!") } else if (NoTopRankedScans > 1) { topIndices = which(filteredScans$rank == 1) NoTopRankedPeps = gsub("I","L",filteredScans$pepSeq[topIndices]) %>% unique() %>% length() PCPindex = which(filteredScans$productType[filteredScans$rank == 1]=="PCP") PSPindex = which(filteredScans$productType[filteredScans$rank == 1]=="PSP") # if more than one peptide is rank 1 # keep scan only if there is just a single PCP, otherwise discard if (NoTopRankedPeps > 1) { if (length(PCPindex) >= 1) { if (length(unique(gsub("I","L",filteredScans$pepSeq[PCPindex]))) == 1) { filteredScans = filteredScans[PCPindex[1],] } else { filteredScans = filteredScans[0,] } } else { if (length(PSPindex) > 1) { index = which(filteredScans$productType[which(filteredScans$rank == 1)]=="PSP") PSPcandidatescores <- filteredScans[index, "ionScore"] ind = which(1-PSPcandidatescores/topScore <= delta_score) deltaRecord = rbind(deltaRecord,filteredScans[index[ind],]) } filteredScans = filteredScans[0,] } } else { filteredScans = filteredScans[1,] } } # if there is only one scan on rank 1, there is also only one peptide # if there is only a single scan on rank 1 # (nrow(filteredScans) will be larger than 1 since the previous if condition did not apply) # if the 1st rank is a PCP, assign it as such else if (NoTopRankedScans == 1 & filteredScans[1, "productType"] == "PCP") { filteredScans <- filteredScans[1,] # If rank 1 entry is PSP check if a likely PCP is present in the lower ranks (PCPcandidates) # also check if there are lower-ranked PSPs } else if (NoTopRankedScans == 1 & filteredScans[1, "productType"] == "PSP") { # lower-ranked PCPs and PSPs PCPcandidatesIndex_temp <- which(as.numeric(filteredScans$rank) > 1 & as.character(filteredScans$productType) == "PCP") PSPcandidatesIndex_temp <- which(as.numeric(filteredScans$rank) > 1 & as.character(filteredScans$productType) == "PSP") # if there are no PSPs or PCPs with a lower rank, assign the PSP if(length(PCPcandidatesIndex_temp)==0 & length(PSPcandidatesIndex_temp)==0){ filteredScans <- filteredScans[1,] } # if there are PSPs with a lower rank --> get highest-ranked PSM if (length(PSPcandidatesIndex_temp)>0) { #keeps only top PSP candidate PSPcandidatesIndex = which(as.numeric(filteredScans$rank)==min(as.numeric(filteredScans$rank)[PSPcandidatesIndex_temp]) & as.character(filteredScans$productType) == "PSP")[1] # keeps all PSP candidates PSPcandidatesIndexAll = PSPcandidatesIndex_temp } else { PSPcandidatesIndex = NULL } # if there are PCPs with a lower rank --> get highest-ranked PSM if (length(PCPcandidatesIndex_temp)>0) { PCPcandidatesIndex = which(as.numeric(filteredScans$rank)==min(as.numeric(filteredScans$rank)[PCPcandidatesIndex_temp]) & as.character(filteredScans$productType) == "PCP") } else { PCPcandidatesIndex = NULL } # for lower-ranked PCPs --> calculate Delta score # Keep top ranked PSP if deltascore > 0.3 and remove rest if (length(PCPcandidatesIndex) > 0) { PCPcandidatescore <- filteredScans[PCPcandidatesIndex[1], "ionScore"] # if the Delta score between the top-ranked PSP and a lower-ranked PCP # is larger than the threshold if (1 - PCPcandidatescore / topScore > delta_score) { # if there are any lower-ranked PSPs # Calculate delta_score scores also for lower-ranked PSPs if (length(PSPcandidatesIndex)>0) { PSPcandidatescore <- filteredScans[PSPcandidatesIndex, "ionScore"] # lower-ranked PSP, but with very low ion score --> assign top-ranked PSP if (1 - PSPcandidatescore / topScore > delta_score) { filteredScans <- filteredScans [1,] # discard spectrum bc there are several PSP candidates with similar # ion score } else if (1 - PSPcandidatescore / topScore <= delta_score) { PSPcandidatescores <- filteredScans[PSPcandidatesIndexAll, "ionScore"] ind = which(1-PSPcandidatescores/topScore <= delta_score) deltaRecord = rbind(deltaRecord,filteredScans[PSPcandidatesIndexAll[ind],]) filteredScans <- filteredScans[0,] } # if there are no lower-ranked PSPs --> assign the top-ranked PSP } else if (length(PSPcandidatesIndex)==0){ filteredScans <- filteredScans[1,] } # if the Delta score is too small: # Take candidate PCP as new top ranked and remove rest } else if (1 - PCPcandidatescore / topScore <= delta_score) { if (length(PCPcandidatesIndex) == 1) { filteredScans <- filteredScans[PCPcandidatesIndex,] } else { # more than one lower-ranked PCP that survived Delta filter if (length(unique(gsub("I","L",filteredScans$pepSeq[PCPcandidatesIndex]))) == 1) { filteredScans <- filteredScans[PCPcandidatesIndex[1],] } else { filteredScans = filteredScans[0,] } } } # end if there are both PCPs and PSPs in lower ranks # if there are lower ranked PSPs but no PCPs: continue } else if (length(PSPcandidatesIndex) > 0 & length(PCPcandidatesIndex)==0){ PSPcandidatescore <- filteredScans[PSPcandidatesIndex, "ionScore"] # keep the candidate PSP if the Delta score is large enough if (1 - PSPcandidatescore / topScore > delta_score) { filteredScans <- filteredScans [1,] # if the Delta score is too small --> discard entire scan } else if (1 - PSPcandidatescore / topScore <= delta_score) { PSPcandidatescores <- filteredScans[PSPcandidatesIndexAll, "ionScore"] ind = which(1-PSPcandidatescores/topScore <= delta_score) deltaRecord = rbind(deltaRecord,filteredScans[PSPcandidatesIndexAll[ind],]) filteredScans <- filteredScans[0,] } } } # apply filter for q-value and ion score if(nrow(filteredScans) > 0){ if(as.double(filteredScans$qValue) >= q_value | as.double(filteredScans$ionScore <= ion_score)){ filteredScans <- filteredScans[0,] }else{ selected <- rbind(filteredScans, selected) } } if(nrow(deltaRecord)> 0){ k = which(as.double(deltaRecord$qValue) < q_value & as.double(deltaRecord$ionScore > ion_score)) deltaRecord <- deltaRecord[k,] if(nrow(deltaRecord)> 0){ selectedDeltaRecord <- rbind(deltaRecord,selectedDeltaRecord) } } } allPSMs[[i]] = selected allPSMs_Delta[[i]] = selectedDeltaRecord } allPSMs = plyr::ldply(allPSMs) allPSMs_Delta = plyr::ldply(allPSMs_Delta) paste0("number of PSMs: ", nrow(allPSMs)) %>% print() ### OUTPUT ### save(allPSMs, file = unlist(snakemake@output[["allPSMs"]])) save(allPSMs_Delta, file = unlist(snakemake@output[["allPSMs_Delta"]])) |
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 | library(stringr) library(dplyr) print("----------------------------") print("3) IDENTIFY SYNTHESIS ERRORS") print("----------------------------") ### INPUT ### load(snakemake@input[["allPSMs"]]) keep_synErrors = snakemake@params[["keep_synErrors"]] # gsub("I","L",subControls0$pepSeq) %>% unique() %>% length() # gsub("I","L",ProteasomeDB$pepSeq[str_detect(ProteasomeDB$productType, "synError")]) %>% unique() %>% length() # gsub("I","L",ProteasomeDB$pepSeq[str_detect(ProteasomeDB$productType, "_synErrorPrecursor")]) %>% unique() %>% length() # gsub("I","L",ProteasomeDB$pepSeq[str_detect(ProteasomeDB$productType, "_synError$")]) %>% unique() %>% length() ### MAIN PART ### subControls0 = allPSMs[allPSMs$digestTime == "CTRL", ] extracted0 = allPSMs[allPSMs$digestTime != "CTRL", ] unqTSNctrl<-unique(subControls0$substrateID) unqTSNextr<-unique(extracted0$substrateID) unqTSN = unqTSNextr allExtracted = numeric() for (i in 1:length(unqTSN)) { subIndex=which(subControls0$substrateID == unqTSN[i]) extIndex=which(extracted0$substrateID == unqTSN[i]) extracted<-extracted0[extIndex,] subControls<-subControls0[subIndex,] # Get all unique peptide seqs present in the substrate controls subControlSeq <- unique(gsub("I","L",subControls$pepSeq)) subControlPCP = unique(gsub("I","L",subControls$pepSeq[subControls$productType == "PCP"])) # get SR2s of detected sequences extPos = str_split_fixed(extracted$positions,"[:punct:]",Inf)[,c(1:4)] extPos = apply(extPos,2,as.numeric) extracted = extracted %>% mutate(sr2s = gsub("I","L",str_sub(extracted$substrateSeq, extPos[,3], extPos[,4])), synErrSR2 = ifelse(sr2s %in% subControlPCP, "yes", "no"), synErrSR2 = ifelse(productType == "PCP", NA, synErrSR2)) %>% select(-sr2s) # Check which entries of extracted are also present in the substrate controls. errorIndex <- which(gsub("I","L",extracted$pepSeq) %in% subControlSeq) errorSeqs <- unique(gsub("I","L",extracted$pepSeq[errorIndex])) # We conservatively regard all matches as synthesis errors and delete them from extracted # extracted0 <- extracted0[-extIndex[errorIndex], ] # Match PSPs to errors PSPtoRemove <- c() for (b in 1:nrow(extracted)){ candidateType <- extracted$productType[b] if (str_detect(candidateType,"PSP")){ candidateSeq <- gsub("I","L",extracted$pepSeq[b]) if (any(grepl(candidateSeq, subControlSeq))){ PSPtoRemove <- c(PSPtoRemove, b) } } } mergeIndex=(unique(c(errorIndex, PSPtoRemove))) k = which(PSPtoRemove%in%errorIndex) if(length(k)>0){ PSPtoRemove = PSPtoRemove[-k] } # Label PSPs that were found to match errors if (length(errorIndex)>0) { extracted$productType[errorIndex] = paste(extracted$productType[errorIndex],"_synError",sep="") } if (length(PSPtoRemove)>0) { extracted$productType[PSPtoRemove] = paste(extracted$productType[PSPtoRemove],"_synErrorPrecursor",sep="") } allExtracted = rbind(allExtracted,extracted) } # the for loop extracted = allExtracted subControls = subControls0 # Also, check if timepoint 0 measurements or substrate controls are missing for any substrates # controlSubstrates <- c(unique(subControls$substrateID), zeroSubstrates) controlSubstrates <- unique(subControls$substrateID) extractedSubstrates <- unique(extracted$substrateID) missingSubIndex <- which(!(extractedSubstrates %in% controlSubstrates)) # remove synthesis errors if (keep_synErrors == "no") { if(length(missingSubIndex) > 0){ print("WARNING! The following substrates do not have any substrate control measurements:") print(extractedSubstrates[missingSubIndex]) print("They are being removed from your table!") extracted <- extracted[-which(extracted$substrateID == extractedSubstrates[missingSubIndex[i]]),] }else{ print("Success! All substrates in extracted are covered by substrate control measurements!") } print("Removing synthesis errors from the final dataset") kk = which(str_detect(extracted$productType, "_synError")) if (length(kk > 0)) { extracted = extracted[-kk, ] print("........................................................") paste0("number of synthesis errors that are being removed: ", length(kk)) %>% print() print("........................................................") } } ### OUTPUT ### save(extracted, file = unlist(snakemake@output[["PSMs"]])) |
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | library(seqinr) source("SOURCE/loadFunctions.r") print("-----------------------------") print("4) MAPPING AND POSTPROCESSING") print("-----------------------------") ### INPUT ### load(snakemake@input[["PSMs"]]) load(snakemake@input[["allPSMs_Delta"]]) ### MAIN PART ## d = mapping(extracted) d_delta = mapping(allPSMs_Delta) ### OUTPUT ### write.csv(d, file = unlist(snakemake@output[["ProteasomeDB"]]), row.names = F) write.csv(d_delta, file = unlist(snakemake@output[["DeltaPeptides"]]), row.names = F) |
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 | library(dplyr) library(stringr) library(grid) library(ggplot2) library(gridExtra) theme_set(theme_bw()) source("SOURCE/invitroSPI_utils.R") source("SOURCE/plottingFunctions.R") print("-------------------------------") print("5) GENERATE DATABASE STATISTICS") print("-------------------------------") ### INPUT ### ProteasomeDB = read.csv(snakemake@input[["ProteasomeDB"]], stringsAsFactors = F) # ProteasomeDB = read.csv("OUTPUT/test_data/ProteasomeDB.csv", stringsAsFactors = F) S = ProteasomeDB$substrateID %>% unique() tps = ProteasomeDB$digestTime %>% unique() %>% as.numeric() %>% sort() ### MAIN PART ### # ----- 1) pre-processing ----- Peps = ProteasomeDB %>% remSynthErrors() %>% ILredundancy() %>% filterPepLength() %>% disentangleMultimappers.Type() %>% disentangleMultimappers.SRlen() %>% disentangleMultimappers.IVSeqLen() %>% disentangleMultimappers.AA() %>% DBcosmetics() ProteasomeDB = ProteasomeDB %>% remSynthErrors() # ----- 2) general stats + peptides over time ----- uniquePeps = Peps %>% uniquePeptides() # plotNumberofPeptides(ProteasomeDB, # outname = "OUTPUT/test_data/number_of_products.pdf") plotNumberofPeptides(ProteasomeDB, outname = unlist(snakemake@output[["number_of_products"]])) # pdf("OUTPUT/test_data/DB_stats.pdf", height = 4, width = 6) pdf(file = unlist(snakemake@output[["DB_stats"]]), height = 4, width = 6) generalStats(uniquePeps, tp = "all") %>% grid.table() for (t in 1:length(tps)) { cntDB = Peps[Peps$digestTime == tps[t], ] %>% uniquePeptides() grid.newpage() generalStats(DB = cntDB, tp = tps[t]) %>% grid.table() } dev.off() # ----- 3) coverage maps ----- # pdf("OUTPUT/test_data/coverage_map.pdf", height = 12, width = 16) # pdf(unlist(snakemake@output[["coverage_map"]]), height = 12, width = 16) # for (s in 1:length(S)) { # # cntDB = uniquePeps[uniquePeps$substrateID == S[s], ] # out = plotCoverage(cntDB, name = S[s], tp = "all") # replayPlot(out) # # } # # dev.off() # ----- 4) length distributions + product type frequencies ----- # pdf("OUTPUT/Delta03/length_distributions.pdf", height = 10, width = 14) pdf(file = unlist(snakemake@output[["length_distributions"]]), height = 12, width = 16) freq = TypeFrequency(uniquePeps) pl = PepLength(uniquePeps, tp="all") srlen = SRLength(uniquePeps, tp="all") ivlen = IVSeqLength(uniquePeps, tp="all") cntG = grid.arrange(freq,pl,srlen,ivlen, nrow = 2, ncol = 2) print(cntG) for (t in 1:length(tps)) { cntDB = uniquePeps[uniquePeps$digestTime == tps[t], ] freq = TypeFrequency(cntDB) pl = PepLength(cntDB, tp=tps[t]) srlen = SRLength(cntDB, tp=tps[t]) ivlen = IVSeqLength(cntDB, tp=tps[t]) cntG = grid.arrange(freq,pl,srlen,ivlen, nrow = 2, ncol = 2) print(cntG) } dev.off() |
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 | library(data.table) library(parallel) library(dplyr) library(seqinr) library(stringr) library(beepr) # numCPU = if (detectCores() > 8) detectCores()-1 else if (detectCores() >= 4) 4 else 1 # cl <- makeForkCluster(numCPU) # ----- user parameters ----- nmers = c(5, 40) generateTrans = T # if false, generate only cis peptides outdir = "../DATA/polypeptideDBs/" suppressWarnings(dir.create(outdir, recursive = T)) # change here (you can read in a dataframe with the name and the sequence) sample_list = read.csv("~/Documents/SCIENCE/_tools/aSPIre+invitroSPI/data/sample_list_aSPIre.csv", stringsAsFactors = F) substrates = sample_list %>% select(substrateID, substrateSeq) %>% unique() # ----- functions ----- getCis = function(subSeq, N, Lext,L) { if (L-1 >= N) { allCis = sapply(seq(1,L-N), function(i){ sapply(seq(i+Lext-1,N-Lext+i-1), function(j){ sr2 = N-j+i-1 sapply(seq(j+2, L-sr2+1), function(k){ n = k+sr2-1 pepSeq = paste(str_sub(subSeq, start=i, end=j), str_sub(subSeq, start=k, end=n), sep = "") positions = paste(i,j,k,n, sep = "_") return(c(pepSeq, positions)) }) }) }) CIS = unlist(allCis) cisPep = data.frame(pepSeq = CIS[seq(1,length(CIS),2)], PositionSubstrate = CIS[seq(2,length(CIS),2)]) } else { cisPep=NA } return(cisPep) } getRevCis = function(subSeq, L, N, Lext=1) { if (L >= N) { allRevCis = sapply(seq(1,L-N+1), function(k){ sapply(seq(k+Lext-1,N-Lext+k-1), function(n){ sr1 = N-n+k-1 sapply(seq(n+1, L-sr1+1), function(i){ j = i+sr1-1 pepSeq = paste(str_sub(subSeq, start=i, end=j), str_sub(subSeq, start=k, end=n), sep = "") positions = paste(i,j,k,n, sep = "_") return(c(pepSeq, positions)) }) }) }) REVCIS = unlist(allRevCis) revcisPep = data.frame(pepSeq = REVCIS[seq(1,length(REVCIS),2)], PositionSubstrate = REVCIS[seq(2,length(REVCIS),2)]) } else { revcisPep = NA } return(revcisPep) } getAllTrans = function(L, N, subSeq, Lext=1) { allSRs = lapply(seq(Lext,N-Lext), function(sr1){ # SR1 Is = seq(1, L-sr1+1) Js = Is+sr1-1 # SR2 Ks = seq(1,L-N+sr1+1) Ns = Ks+N-sr1-1 SR1_pos = paste(Is,Js,sep = "_") SR2_pos = paste(Ks,Ns,sep = "_") SR1 = str_sub(subSeq,Is,Js) SR2 = str_sub(subSeq,Ks,Ns) PEP = outer(SR1, SR2, paste, sep="") %>% as.vector() positions = outer(SR1_pos,SR2_pos,paste, sep = "_") %>% as.vector() return(data.frame(pepSeq = PEP, PositionSubstrate = positions )) }) transPep = plyr::ldply(allSRs) return(transPep) } getPCP = function(L,subSeq,N) { if (L > N) { allPCP = sapply(seq(1,L-N+1), function(i){ j = N+i-1 positions = paste(i,j, sep = "_") pepSeq=str_sub(subSeq,i,j) return(c(pepSeq,positions)) }) PCP = unlist(allPCP) PCP_Pep = data.frame(pepSeq = PCP[1,], PositionSubstrate = PCP[2,]) } else { PCP = NA } return(PCP_Pep) } # ----- compute DB ----- for (i in 1:nrow(substrates)) { Name = substrates$substrateID[i] print("...............") print(Name) print("...............") # get substrate sequence and length subSeq = substrates$substrateSeq[i] if (grepl(".fasta", subSeq)) { subSeq = read.fasta(file = subSeq, seqtype = "AA", strip.desc = T, as.string = T) %>% unlist() %>% suppressWarnings() } L = nchar(subSeq) # initiate empty fasta write.fasta(sequences = subSeq, names = Name, file.out=paste0(outdir, Name, "_allPSP_0.fasta"), open="w") start_time = Sys.time() # generate peptides if (generateTrans) { print("generating cis and trans-spliced peptides.....") # all cis and trans allPSP <- lapply(seq(nmers[1], nmers[2]), function(N){ paste0("generating ",N,"-mers") %>% print() DB = getAllTrans(subSeq = subSeq, L = L, N = N, Lext=1) write.fasta(as.list(DB$pepSeq), names = paste0("PSP_", DB$PositionSubstrate), file.out = paste0(outdir, Name, "_allPSP_0.fasta"), open="a") }) print("generating non-spliced peptides.....") Nmax = if(nmers[2] >= L-1) L-1 else nmers[2] allPCP <- lapply(seq(nmers[1], Nmax), function(N){ paste0("generating ",N,"-mers") %>% print() DB = getPCP(subSeq = subSeq, L = L, N = N) write.fasta(as.list(DB$pepSeq), names = paste0("PSP_", DB$PositionSubstrate), file.out = paste0(outdir, Name, "_allPSP_0.fasta"), open="a") }) } else { print("generating cis-spliced peptides.....") # forward cis allCis <- lapply(seq(nmers[1], nmers[2]), function(N){ paste0("generating fwd cis ",N,"-mers") %>% print() DB = getCis(subSeq = subSeq, L = L, N = N, Lext=1) write.fasta(as.list(DB$pepSeq), names = paste0("PSP_", DB$PositionSubstrate), file.out = paste0(outdir, Name, "_allPSP_0.fasta"), open="a") }) # reverse cis allRevCis <- lapply(seq(nmers[1], nmers[2]), function(N){ paste0("generating rev cis ",N,"-mers") %>% print() DB = getRevCis(subSeq = subSeq, L = L, N = N, Lext=1) write.fasta(as.list(DB$pepSeq), names = paste0("PSP_", DB$PositionSubstrate), file.out = paste0(outdir, Name, "_allPSP_0.fasta"), open="a") }) } end_time = Sys.time() paste0("..... finished after: ", difftime(end_time, start_time, units = "s")) %>% print() } print("FINISHED!") beep("mario") |
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 | library(dplyr) library(stringr) ########## plotting colours ########## plottingCols = c( PCP = "darkorange", PSP = "dodgerblue", cis = "darkslateblue", revCis = "lightskyblue", trans = "darkorchid", allcis = "darkslateblue", ProteasomeDB = "olivedrab4", randomDB = "lightsteelblue" ) ########## disentange rows containing multimappers ########## disentangleMultimappers.Type = function(DB) { print("DISENTANGLE MULTI-MAPPERS FOR PRODUCT TYPE") k = which(str_detect(DB$positions, coll(";"))) if (length(k) > 0) { DB_mm = DB[k, ] DB_nomm = DB[-k, ] mm = list() pb = txtProgressBar(min = 0, max = nrow(DB_mm), style = 3) for (r in 1:nrow(DB_mm)) { setTxtProgressBar(pb, r) cnt_types = str_split(DB_mm$spliceType[r], coll(";"), simplify = T) %>% paste() cnt_pos = str_split(DB_mm$positions[r], coll(";"), simplify = T) %>% paste() cntDB = DB_mm[r, ] # all multi-mappers correspond to the same product type if (cnt_types %>% unique() %>% length() == 1) { cntDB$spliceType = cnt_types %>% unique() # in case of cis and trans --> keep cis } else if ( any("trans" %in% cnt_types) ) { x = which(cnt_types == "trans") cnt_types = cnt_types[-x] cnt_pos = cnt_pos[-x] if (cnt_types %>% unique() %>% length() == 1) { cntDB$spliceType = cnt_types %>% unique() } else { cntDB$spliceType = "type_multi-mapper" } cntDB$positions = cnt_pos %>% paste(collapse = ";") } else { cntDB$spliceType = "type_multi-mapper" cntDB$positions = cnt_pos %>% paste(collapse = ";") } mm[[r]] = cntDB } mm = plyr::ldply(mm) DB = rbind(DB_nomm, mm) %>% as.data.frame() } # beep("mario") return(DB) } disentangleMultimappers.AA = function(DB, retSinglePos = T) { print("DISENTANGLE MULTI-MAPPERS FOR AA AT sP1 AND sP1'") DB$AAmultimapper = "no" # only considering PSP multi-mappers k = which(str_detect(DB$positions, coll(";")) & !str_detect(DB$productType, "PCP")) if (length(k) > 0) { DB_mm = DB[k, ] DB_nomm = DB[-k, ] pb = txtProgressBar(min = 0, max = nrow(DB_mm), style = 3) for (r in 1:nrow(DB_mm)) { setTxtProgressBar(pb, r) cnt_pos = strsplit(DB_mm$positions[r], ";") %>% unlist() %>% str_split_fixed(pattern = coll("_"), n = Inf) sP1 = str_sub(DB_mm$substrateSeq[r], start = cnt_pos[, 2], end = cnt_pos[, 2]) sP1d = str_sub(DB_mm$substrateSeq[r], start = cnt_pos[, 3], end = cnt_pos[, 3]) if ((length(unique(sP1)) > 1) | (length(unique(sP1d)) > 1)) { DB_mm$AAmultimapper[r] = "yes" } else if (retSinglePos) { DB_mm$positions[r] = paste(cnt_pos[1, c(1:4)], collapse = "_") } } DB = rbind(DB_nomm, DB_mm) %>% as.data.frame() } # beep("mario") return(DB) } disentangleMultimappers.SRlen = function(DB, retSinglePos = T) { print("DISENTANGLE MULTI-MAPPERS FOR SR length'") DB$SRmultimapper = "no" # only considering PSP multi-mappers k = which(str_detect(DB$positions, coll(";")) & !str_detect(DB$productType, "PCP")) if (length(k) > 0) { DB_mm = DB[k, ] DB_nomm = DB[-k, ] pb = txtProgressBar(min = 0, max = nrow(DB_mm), style = 3) for (r in 1:nrow(DB_mm)) { setTxtProgressBar(pb, r) cnt_pos = strsplit(DB_mm$positions[r], ";") %>% unlist() %>% str_split_fixed(pattern = coll("_"), n = Inf) SR1len = as.numeric(cnt_pos[, 2]) - as.numeric(cnt_pos[, 1]) + 1 SR2len = as.numeric(cnt_pos[, 4]) - as.numeric(cnt_pos[, 3]) + 1 if ((length(unique(SR1len)) > 1) | (length(unique(SR2len)) > 1)) { DB_mm$SRmultimapper[r] = "yes" } else if (retSinglePos) { DB_mm$positions[r] = paste(cnt_pos[1, c(1:4)], collapse = "_") } } DB = rbind(DB_nomm, DB_mm) %>% as.data.frame() } # beep("mario") return(DB) } disentangleMultimappers.IVSeqLen = function(DB, retSinglePos = T) { print("DISENTANGLE MULTI-MAPPERS FOR INTERVENING SEQUENCE length'") DB$IVseqmultimapper = "no" # only considering PSP multi-mappers k = which(str_detect(DB$positions, coll(";")) & !str_detect(DB$productType, "PCP")) if (length(k) > 0) { DB_mm = DB[k, ] DB_nomm = DB[-k, ] pb = txtProgressBar(min = 0, max = nrow(DB_mm), style = 3) for (r in 1:nrow(DB_mm)) { setTxtProgressBar(pb, r) cnt_pos = strsplit(DB_mm$positions[r], ";") %>% unlist() %>% str_split_fixed(pattern = coll("_"), n = Inf) cnt_types = str_split(DB_mm$spliceType[r], coll(";"), simplify = T) %>% paste() ivLens = rep(NA, nrow(cnt_pos)) for (p in 1:nrow(cnt_pos)) { # cis and trans if (cnt_pos[p,3] >= cnt_pos[p,1]) { ivLens[p] = (abs(as.numeric(cnt_pos[p, 3]) - as.numeric(cnt_pos[p, 2])) - 1) %>% as.numeric() # revCis } else { ivLens[p] = (abs(as.numeric(cnt_pos[p, 1]) - as.numeric(cnt_pos[p, 4])) - 1) %>% as.numeric() } } if (length(unique(ivLens)) > 1) { DB_mm$IVseqmultimapper[r] = "yes" } else if (retSinglePos) { DB_mm$positions[r] = paste(cnt_pos[1, c(1:4)], collapse = "_") } } DB = rbind(DB_nomm, DB_mm) %>% as.data.frame() } # beep("mario") return(DB) } removeMultimappers.Type = function(DB) { print("REMOVE PEPTIDES THAT CAN SOLELY BE MULTI-MAPPERS FOR PRODUCT TYPE") k = which(DB$spliceType == "type_multi-mapper") if (length(k) > 0) { DB = DB[-k, ] } return(DB) } removeMultimappers.AA = function(DB) { print("REMOVE PEPTIDES THAT ARE MULTI-MAPPERS IN TERMS OF AA AT sP1 OR sP1'") k = which(DB$AAmultimapper == "yes") if (length(k) > 0) { DB = DB[-k, ] } return(DB) } removeMultimappers.SRlen = function(DB) { print("REMOVE PEPTIDES THAT ARE MULTI-MAPPERS IN TERMS OF SR LENGTH") k = which(DB$SRmultimapper == "yes") if (length(k) > 0) { DB = DB[-k, ] } return(DB) } removeMultimappers.IVSeqLen = function(DB) { print("REMOVE PEPTIDES THAT ARE MULTI-MAPPERS IN TERMS OF INTERVENING SEQUENCE LENGTH") k = which(DB$IVseqmultimapper == "yes") if (length(k) > 0) { DB = DB[-k, ] } return(DB) } ########## filter for peptide length ########## filterPepLength = function(DB, cutoff=5) { print(paste0("REMOVE PEPTIDES THAT ARE SHORTER THAN ", cutoff, " AA")) k = which(nchar(DB$pepSeq) < cutoff) if (length(k) > 0) { DB = DB[-k, ] } return(DB) } ########## synthesis errors ########## remSynthErrors = function(DB) { print("REMOVE SYNTHESIS ERRORS") k = which(str_detect(DB$productType, "synError")) if (length(k) > 0) { DB = DB[-k, ] } return(DB) } keepSynthErrors = function(DB) { print("FILTER FOR SYNTHESIS ERRORS") k = which(str_detect(DB$productType, "synError")) if (length(k) > 0) { DB = DB[k, ] } return(DB) } ########## early time points only ########## filterEarlyTimepoints = function(DB) { print("FILTER FOR EARLY TIME POINTS") DB = DB[which(DB$digestTime %in% c(2, 4)), ] return(DB) } ########## 20S standard proteasome only ########## filter20Sstandard = function(DB) { print("FILTER FOR 20S STANDARD PROTEASOME") if ("protIsotype" %in% names(DB)) { DB = DB[DB$protIsotype %in% c("20S standard", "20S K562"), ] } return(DB) } ########## I/L redundancy in product sequences ######### ILredundancy = function(DB) { print("REPLACE ALL ISOLEUCINS BY LEUCINS") DB$pepSeq = str_replace_all(DB$pepSeq, "I", "L") DB$substrateSeq = str_replace_all(DB$substrateSeq, "I", "L") return(DB) } ########## unique product sequences ########## uniquePeptides = function(DB) { print("UNIQUE PEPTIDES") DB = DB %>% distinct(substrateID, substrateSeq, pepSeq, productType, #spliceType, positions, .keep_all = T) # DB2 = DB %>% # group_by(substrateID, substrateSeq, pepSeq, productType, # spliceType, positions) %>% slice(1) # DB3 = DB[, c("substrateID", "substrateSeq", "pepSeq", # "productType", "spliceType", "positions")] %>% # unique() # DB4 = DB[!duplicated(DB[, c("substrateID", "substrateSeq", "pepSeq", # "productType", "spliceType", "positions")]),] return(DB) } uniquePeptides.perTP = function(DB) { print("UNIQUE PEPTIDES FOR EACH TIME POINT") DB = DB %>% distinct(substrateID, substrateSeq, digestTime, pepSeq, productType, spliceType, positions, .keep_all = T) return(DB) } ########## cosmetics ########## DBcosmetics = function(DB) { print("COSMETICS AND STATS") if ("X" %in% names(DB)) { DB$X = NULL } # replace NA in splice type column of PCPs DB$spliceType[str_detect(DB$productType, "PCP")] = "PCP" # # statistics # print(paste0("number of identified products: ", nrow(DB))) # # print(table(DB$spliceType)) # print(table(DB$spliceType) / nrow(DB)) # # print("substrates") # print(paste0("number of substrates: ", DB$substrateSeq %>% unique() %>% length())) # DB$substrateID %>% # unique() %>% # paste(collapse = ", ") %>% # print() return(DB) } ########## load random DBs ######### loadRandom = function(fpath) { print("LOAD RANDOM DATABASES") # read cis, revCis, trans and PCP # concatenate and return list of data frames cis = read.csv(fpath[str_detect(fpath, "_cis")], stringsAsFactors = F) revCis = read.csv(fpath[str_detect(fpath, "_revCis")], stringsAsFactors = F) trans = read.csv(fpath[str_detect(fpath, "_trans")], stringsAsFactors = F) if (any(str_detect(fpath, "PCP_"))) { PCP = read.csv(fpath[str_detect(fpath, "_PCP")], stringsAsFactors = F) } else { PCP = NA } random = list(cis = cis, revCis = revCis, trans = trans, PCP = PCP) return(random) } |
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 | generalStats = function(DB, tp) { cntTBL = table(DB$spliceType) %>% as.data.frame() cntTBL = cbind(cntTBL, (table(DB$spliceType) / nrow(DB)) %>% as.data.frame()) names(cntTBL) = c(paste0("time point: ", tp), "# peptides", "type", "frequency") cntTBL$frequency = round(cntTBL$frequency, 4) return(cntTBL) } # ----- number of peptides over time ---- getNumberofPeptides = function(ProteasomeDB) { DB_filtered = ProteasomeDB %>% filterPepLength() %>% disentangleMultimappers.Type() %>% DBcosmetics() # fraction of PSMs psms = DB_filtered %>% group_by(substrateID,spliceType, digestTime) %>% summarise(n = n()) %>% ungroup() %>% group_by(substrateID, digestTime) %>% mutate(freq = n / sum(n)) psms_allTP = DB_filtered %>% group_by(substrateID,spliceType) %>% summarise(n = n()) %>% ungroup() %>% group_by(substrateID) %>% mutate(freq = n / sum(n)) %>% mutate(digestTime = 0) # number + frequency of unique peptides peps = DB_filtered %>% uniquePeptides() %>% group_by(substrateID,spliceType, digestTime) %>% summarise(n = n()) %>% ungroup() %>% group_by(substrateID, digestTime) %>% mutate(freq = n / sum(n)) peps_allTP = DB_filtered %>% uniquePeptides() %>% group_by(substrateID,spliceType) %>% summarise(n = n()) %>% ungroup() %>% group_by(substrateID) %>% mutate(freq = n / sum(n)) %>% mutate(digestTime = 0) return(list(psms = rbind(psms, psms_allTP[, names(psms)]), peps = rbind(peps, peps_allTP[, names(peps)]))) } plotNumberofPeptides = function(ProteasomeDB, outname) { res = getNumberofPeptides(ProteasomeDB) psms = res$psms peps = res$peps if (length(unique(ProteasomeDB$digestTime)) == 1) { psms = psms[psms$digestTime == 0, ] peps = peps[peps$digestTime == 0, ] } # number of PSMs over time psms$digestTime = factor(psms$digestTime, levels = sort(unique(as.numeric(psms$digestTime)))) psms$spliceType = factor(psms$spliceType, levels = c("PCP", "cis", "revCis", "trans", "type_multi-mapper")) theme_set(theme_classic()) a = ggplot(psms, aes(x=digestTime, y=n, col=spliceType)) + geom_boxplot(position=position_dodge(width=0.8)) + geom_point(alpha = .3, position=position_dodge(width=0.8)) + scale_y_continuous(trans='log10') + scale_x_discrete(labels = gsub("^0$","all",sort(unique(psms$digestTime)))) + scale_color_manual("product type", values = c(plottingCols["PCP"], plottingCols["cis"], plottingCols["revCis"], plottingCols["trans"], "gray"), labels = c("non-spliced", "forward cis", "reverse cis", "trans", "multi-mapper")) + ggtitle("number of PSMs") + ylab("# PSMs") + xlab("digestion time") a2 = ggplot(psms, aes(x=digestTime, y=freq, col=spliceType)) + geom_boxplot(position=position_dodge(width=0.8)) + scale_x_discrete(labels = gsub("^0$","all",sort(unique(psms$digestTime)))) + scale_color_manual("product type", values = c(plottingCols["PCP"], plottingCols["cis"], plottingCols["revCis"], plottingCols["trans"], "gray"), labels = c("non-spliced", "forward cis", "reverse cis", "trans", "multi-mapper")) + ylim(c(0,1)) + ggtitle("frequency of PSMs") + ylab("frequency") + xlab("digestion time") peps$digestTime = factor(peps$digestTime, levels = sort(unique(as.numeric(peps$digestTime)))) peps$spliceType = factor(peps$spliceType, levels = c("PCP", "cis", "revCis", "trans", "type_multi-mapper")) b = ggplot(peps, aes(x=digestTime, y=n, col=spliceType)) + geom_boxplot(position=position_dodge(width=0.8)) + geom_point(alpha = .3, position=position_dodge(width=0.8)) + scale_y_continuous(trans='log10') + scale_x_discrete(labels = gsub("^0$","all",sort(unique(peps$digestTime)))) + scale_color_manual("product type", values = c(plottingCols["PCP"], plottingCols["cis"], plottingCols["revCis"], plottingCols["trans"], "gray"), labels = c("non-spliced", "forward cis", "reverse cis", "trans", "multi-mapper")) + ggtitle("number of unique peptides") + ylab("# unique peptides") + xlab("digestion time") b2 = ggplot(peps, aes(x=digestTime, y=freq, col=spliceType)) + geom_boxplot(position=position_dodge(width=0.8)) + scale_x_discrete(labels = gsub("^0$","all",sort(unique(peps$digestTime)))) + scale_color_manual("product type", values = c(plottingCols["PCP"], plottingCols["cis"], plottingCols["revCis"], plottingCols["trans"], "gray"), labels = c("non-spliced", "forward cis", "reverse cis", "trans", "multi-mapper")) + ylim(c(0,1)) + ggtitle("frequency of unique peptides") + ylab("frequency") + xlab("digestion time") p = grid.arrange(a,b, a2,b2, nrow=2, ncol=2) ggsave(filename = outname, plot = p, device = cairo_pdf(), height = 10, width = 14, dpi = "retina") } # ----- coverage maps ------ plotCoverage = function(ProtesomeDB, tp, name) { k = which(str_detect(ProteasomeDB$positions, ";")) if (length(k) > 0) { print("removing all multi-mappers") ProteasomeDB = ProteasomeDB[-k,] } pcp_pos = ProteasomeDB$positions[ProteasomeDB$productType == "PCP"] psp_pos = ProteasomeDB$positions[ProteasomeDB$productType == "PSP"] S = ProteasomeDB$substrateSeq[1] # --- PCPs --- pcp_pos = str_split(pcp_pos, "_", simplify = T) pcp_pos = apply(pcp_pos,2,as.numeric) %>% as.data.frame() pcp_pos = pcp_pos[order(pcp_pos$V1, decreasing = F), ] par(mfrow = c(2,1)) y0 =0.1 plot(NULL, ylim=c(0,nrow(pcp_pos)*0.05), ylab = "", xlab="substrate", xlim = c(0, nchar(S)), main = paste0(name, ": ", tp, " hrs"), axes=F) for (i in 1:nrow(pcp_pos)) { if (length(which(pcp_pos$V1[1:i] <= pcp_pos$V1[i] & pcp_pos$V2[1:i] >= pcp_pos$V1[i])) > 1) { cnt.intercept = length(which(pcp_pos$V1[1:i] <= pcp_pos$V1[i] & pcp_pos$V2[1:i] >= pcp_pos$V1[i]))*0.1 } else {cnt.intercept = y0} segments(x0 = pcp_pos$V1[i], y0 = cnt.intercept, x1 = pcp_pos$V2[i], y1 = cnt.intercept, col = plottingCols["PCP"], lwd = .5) } xlabel = c("0", paste(S %>% strsplit("") %>% unlist(), seq(1, nchar(S)), sep = "")) text(x = seq(0, nchar(S)), par("usr")[3]-.3, labels = xlabel, srt=90, xpd=T, cex=.5) axis(2, labels = NA) # --- PSPs --- psp_pos = str_split(psp_pos, "_", simplify = T) psp_pos = apply(psp_pos,2,as.numeric) %>% as.data.frame() psppos = psp_pos[order(psp_pos$V1, decreasing = F), ] psp_pos = rbind(as.matrix(psppos[,c(1:2)]), as.matrix(psppos[,c(3:4)])) %>% as.data.frame() y0 =0.1 plot(NULL, ylim=c(0,nrow(psp_pos)*0.05), ylab = "", xlab="substrate", xlim = c(0, nchar(S)), main = paste0(name, ": ", tp, " hrs"), axes=F) for (i in 1:nrow(psp_pos)) { if (length(which(psp_pos$V1[1:i] <= psp_pos$V1[i] & psp_pos$V2[1:i] >= psp_pos$V1[i])) > 1) { cnt.intercept = length(which(psp_pos$V1[1:i] <= psp_pos$V1[i] & psp_pos$V2[1:i] >= psp_pos$V1[i]))*0.1 } else {cnt.intercept = y0} segments(x0 = psp_pos$V1[i], y0 = cnt.intercept, x1 = psp_pos$V2[i], y1 = cnt.intercept, col = plottingCols["PSP"], lwd = .5) } xlabel = c("0", paste(S %>% strsplit("") %>% unlist(), seq(1, nchar(S)), sep = "")) text(x = seq(0, nchar(S)), par("usr")[3]-.3, labels = xlabel, srt=90, xpd=T, cex=.5) axis(2, labels = NA) out = recordPlot() return(out) } # ----- length distributions ----- # function for violin plots ViolinPlot = function(data) { # print stats s = data %>% dplyr::group_by(type) %>% dplyr::summarise(n = dplyr::n(), mean = mean(value), median = median(value), std = sd(value)) print.data.frame(s) theme_set(theme_classic()) # plotting data$type = factor(data$type, levels = c("PCP", "cis", "revCis", "trans")) sc = ggplot(data = data, aes(x = type, y = value, fill = type)) + geom_violin(size = .1) + stat_summary(fun = median, fun.min = median, fun.max = median, geom = "crossbar", width = .2, position = position_dodge(width = 4)) + xlab("product type") + scale_fill_manual("product type", values = c(plottingCols["PCP"], plottingCols["cis"], plottingCols["revCis"], plottingCols["trans"]), labels = c("non-spliced", "forward cis", "reverse cis", "trans")) + theme(axis.text.x = element_text(angle = 90), legend.position = "none") # add labs and title downstream return(sc) } # peptide length PepLength = function(DB, tp) { print("PEPTIDE LENGTH") DB = removeMultimappers.Type(DB) pl = data.frame(value = nchar(DB$pepSeq), type = DB$spliceType) pepLen = ViolinPlot(data = pl) + ylab("peptide product length (aa residues)") + scale_x_discrete(labels = c("non-spliced", "forward cis", "reverse cis", "trans")) + # scale_y_continuous(limits = c(0, 40), breaks = seq(0, 40, by = 5)) + ggtitle("peptide length distribution", subtitle = paste0(tp, " hrs")) return(pepLen) } # SR length SRLength = function(DB, tp) { print("SPLICE-REACTANT LENGTH") DB = DB[str_detect(DB$productType, "PSP"), ] %>% removeMultimappers.Type() %>% removeMultimappers.SRlen() srlen = function(db) { pos = str_split_fixed(db$positions, pattern = "_", n = Inf) sr = data.frame(value_sr1 = as.numeric(pos[, 2]) - as.numeric(pos[, 1]) + 1, value_sr2 = as.numeric(pos[, 4]) - as.numeric(pos[, 3]) + 1, type = db$spliceType) sr$type = factor(sr$type, levels = c("cis", "revCis", "trans")) return(sr) } sr = srlen(db = DB) sr1 = sr[, c("value_sr1", "type")] names(sr1)[1] = "value" sr2 = sr[, c("value_sr2", "type")] names(sr2)[1] = "value" sr1.plot = ViolinPlot(data = sr1) + ylab("splice reactant length (aa residues)") + scale_x_discrete(labels = c("forward cis", "reverse cis", "trans")) + # scale_y_continuous(limits = c(0, 40), breaks = seq(0, 40, by = 5)) + ggtitle("SR1 length distribution", subtitle = paste0(tp, " hrs")) sr2.plot = ViolinPlot(data = sr2) + ylab("splice reactant length (aa residues)") + scale_x_discrete(labels = c("forward cis", "reverse cis", "trans")) + # scale_y_continuous(limits = c(0, 40), breaks = seq(0, 40, by = 5)) + ggtitle("SR2 length distribution", subtitle = paste0(tp, " hrs")) sr.plot = arrangeGrob(grobs = list(sr1.plot, sr2.plot), ncol=2) return(sr.plot) } # intervening sequence length IVSeqLength = function(DB, tp) { print("INTERVENING SEQUENCE LENGTH") DB = DB[str_detect(DB$productType, "PSP"), ] %>% removeMultimappers.Type() %>% removeMultimappers.IVSeqLen() ivlen = function(db) { # intervening sequence length # calculated as in Specht et al., Paes et al. and SPI-ART pos = str_split_fixed(db$positions, pattern = "_", n = Inf) iv = rep(NA, nrow(db)) cistrans.idx = which(db$spliceType %in% c("cis", "trans")) iv[cistrans.idx] = (abs(as.numeric(pos[cistrans.idx, 3]) - as.numeric(pos[cistrans.idx, 2])) - 1) %>% as.numeric() revcis.idx = which(db$spliceType == "revCis") iv[revcis.idx] = (abs(as.numeric(pos[revcis.idx, 1]) - as.numeric(pos[revcis.idx, 4])) - 1) %>% as.numeric() ivl = data.frame(value = iv, type = db$spliceType) ivl$type = factor(ivl$type, levels = c("cis", "revCis", "trans")) return(ivl) } iv = ivlen(db = DB) iv = iv[iv$type != "trans", ] iv.plot = ViolinPlot(data = iv) + ylab("intervening sequence length (aa residues)") + scale_x_discrete(labels = c("forward cis", "reverse cis")) + ggtitle("intervening sequence length distribution", subtitle = paste0(tp, " hrs")) return(iv.plot) } # ------ type frequencies ----- TypeFrequency = function(DB) { print("FREQUENCY OF PRODUCT TYPES") Freq = DB %>% removeMultimappers.Type() %>% group_by(spliceType) %>% summarise(n = n()) %>% mutate(freq = n / sum(n)) %>% arrange(desc(spliceType)) %>% mutate(ypos = cumsum(freq)- 0.5*freq ) print(Freq) Freq$spliceType = factor(Freq$spliceType, levels = c("PCP", "cis", "revCis", "trans")) freqs = ggplot(Freq, aes(x="", y=freq, fill=spliceType)) + geom_bar(stat="identity", width=1) + coord_polar("y", start=0) + scale_fill_manual("product type", values = c(plottingCols["PCP"], plottingCols["cis"], plottingCols["revCis"], plottingCols["trans"]), labels = c("non-spliced", "forward cis", "reverse cis", "trans")) + theme_void() + geom_text(aes(y = ypos, label=paste0("n = ", n)), size = 4) return(freqs) } |
Support
- Future updates
Related Workflows





