Introduction
wombat-p pipelines is a bioinformatics analysis pipeline that bundles different workflow for the analysis of label-free proteomics data with the purpose of comparison and benchmarking. It allows using files from the proteomics metadata standard SDRF .
The pipeline is built using Nextflow , a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies. We used one of the nf-core templates.
Pipeline summary
This work contains four major different workflows for the analysis or label-free proteomics data, originating from LC-MS experiments.
- MaxQuant + NormalyzerDE
- SearchGui + Proline + PolySTest
- Compomics tools + FlashLFQ + MSqRob
- Tools from the Trans-Proteomic Pipeline + ROTS
Initialization and parameterization of the workflows is based on tools from the SDRF pipelines , the ThermoRawFileParser with our own contributions and additional programs from the wombat-p organizaion [https://github.com/wombat-p/Utilities] as well as our fork . This includes setting a generalized set of data analysis parameters and the calculation of a multiple benchmarks.
Code Snippets
30 31 32 33 34 35 36 37 38 39 40 41 | """ echo '$foo' > params.json cp "${fasta_file}" database.fasta if [[ "${exp_design_file}" != "exp_design.txt" ]] then cp "${exp_design_file}" exp_design.txt fi Rscript $baseDir/bin/CalcBenchmarks.R mv benchmarks.json benchmarks_${workflow}.json cp stand_pep_quant_merged.csv stand_pep_quant_merged${workflow}.csv cp stand_prot_quant_merged.csv stand_prot_quant_merged${workflow}.csv """ |
38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | """ first_line="" # avoid exp_design ending with .txt mv "${exp_design}" exp_design.tsv for file in *.txt do echo \$file tail -n +2 "\$file" >> tlfq_ident.tabular first_line=\$(head -n1 "\$file") done # Use awk to add 3 new columns rep, frac, trep with ones to exp_design file #awk 'NR==1{print \$0"\trep\tfrac\ttrep"} NR>1{print \$0"\t1\t1\t1"}' "exp_design.tsv" > ExperimentalDesign.tsv cp exp_design.tsv ExperimentalDesign.tsv # Remove .raw and .mzml from file names in first column of ExperimentalDesign.tsv sed -i 's/.mzML//g' ExperimentalDesign.tsv sed -i 's/.raw//g' ExperimentalDesign.tsv sed -i 's/.Raw//g' ExperimentalDesign.tsv sed -i 's/.RAW//g' ExperimentalDesign.tsv # Add first line to tlfq_ident.tabular echo "\$first_line" | cat - tlfq_ident.tabular > lfq_ident.tabular # Needed as path is overwritten when running with singularity PATH=\$PATH:/usr/local/lib/dotnet:/usr/local/lib/dotnet/tools CONDA_PREFIX=/usr/local FlashLFQ --idt "lfq_ident.tabular" --rep "./" --out ./ --mbr ${parameters.enable_match_between_runs} --ppm ${parameters.precursor_mass_tolerance} --sha ${protein_inference} --thr ${task.cpus} """ |
64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 | """ first_line="" # avoid exp_design ending with .txt mv "${exp_design}" exp_design.tsv for file in *.txt do echo \$file tail -n +2 "\$file" >> tlfq_ident.tabular first_line=\$(head -n1 "\$file") done # Use awk to add 3 new columns rep, frac, trep with ones to exp_design file #awk 'NR==1{print \$0"\trep\tfrac\ttrep"} NR>1{print \$0"\t1\t1\t1"}' "exp_design.tsv" > ExperimentalDesign.tsv cp exp_design.tsv ExperimentalDesign.tsv # Remove .raw and .mzml from file names in first column of ExperimentalDesign.tsv sed -i 's/.mzML//g' ExperimentalDesign.tsv sed -i 's/.raw//g' ExperimentalDesign.tsv sed -i 's/.Raw//g' ExperimentalDesign.tsv sed -i 's/.RAW//g' ExperimentalDesign.tsv # Add first line to tlfq_ident.tabular echo "\$first_line" | cat - tlfq_ident.tabular > lfq_ident.tabular FlashLFQ --idt "lfq_ident.tabular" --rep "./" --out ./ --mbr ${parameters.enable_match_between_runs} --ppm ${parameters.precursor_mass_tolerance} --sha ${protein_inference} --thr ${task.cpus} """ |
22 23 24 | """ mzdb2mgf -i "${mzdbfile}" -o "${mzdbfile.baseName}.mgf" """ |
31 32 33 34 35 | """ cp "proteinGroups.txt" protein_file.txt cp "peptides.txt" peptide_file.txt Rscript $baseDir/bin/runNormalyzer.R --comps="${params.comps}" --method="${parameters.normalization_method}" --exp_design="${exp_file}" --comp_file="${comp_file}" """ |
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | """ convertProline=\$(which runPolySTestCLI.R) echo \$convertProline convertProline=\$(dirname \$convertProline) echo \$convertProline Rscript \${convertProline}/convertFromProline.R ${exp_design} ${proline_res} sed -i "s/threads: 2/threads: ${task.cpus}/g" pep_param.yml sed -i "s/threads: 2/threads: ${task.cpus}/g" prot_param.yml runPolySTestCLI.R pep_param.yml runPolySTestCLI.R prot_param.yml """ |
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 | """ if [[ "$map" != "params2sdrf.yml" ]] then cp "${map}" params2sdrf.yml fi if [[ "$sdrf" != "no_sdrf" ]] then if [[ "$sdrf" != "sdrf_local.tsv" ]] then cp "${sdrf}" sdrf_local.tsv fi fi if [[ "$exp_design" != "no_exp_design" ]] then if [[ "$exp_design" != "exp_design.txt" ]] then cp "${exp_design}" exp_design.txt fi else if [[ "$sdrf" == "no_sdrf" ]] then printf "raw_file\texp_condition" >> exp_design.txt for a in $raws do printf "\n\$a\tA" >> exp_design.txt done else $baseDir/bin/sdrf2exp_design.py fi fi if [[ "$sdrf" == "no_sdrf" ]] then $baseDir/bin/exp_design2sdrf.py fi if [ "$raws" == "no_raws" ] && [ "$mzmls" == "no_mzmls" ] then # Download all files from column file uri echo "Downloading raw files from column file uri\n" for a in \$(awk -F '\t' -v column_val='comment[file uri]' '{ if (NR==1) {val=-1; for(i=1;i<=NF;i++) { if (\$i == column_val) {val=i;}}} if(val != -1) { if (NR!=1) print \$val} } ' "$sdrf") do echo "Downloading \$a\n" wget -c -T 100 -t 5 "\$a" done fi if [[ "$parameters" == "no_params" ]] then printf "params:\n None: \nrawfiles: None\nfastafile: None" > params.yml elif [[ "$parameters" != "params.yml" ]] then cp "${parameters}" params.yml fi echo "See workflow version" > prepare_files.version.txt cp sdrf_local.tsv sdrf_temp.tsv """ |
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 | """ mkdir ./searchgui_results for file in *.zip do unzip "\$file" -d ./searchgui_results/ mv \$(find ./ -type f \\( -name "*.t.xml.gz" -o -name "*.mzid" \\)) ./ gunzip *.t.xml.gz rm -rf ./searchgui_results/* done touch import_file_list.txt all_id_files=\$(find ./ -type f \\( -name "*.t.xml" -o -name "*.mzid" \\)) for file in \$all_id_files do echo "./\$file" >> import_file_list.txt sed -i "s/PEPFDR/expected_fdr=${peptide_fdr}/g" "${param_file}" sed -i "s/PROTFDR/expected_fdr=${protein_fdr}/g" "${param_file}" sed -i "s/NUMPEPS/threshold=${parameters.min_num_peptides}/g" "${param_file}" sed -i "s/moz_tol = 5/moz_tol=${prec_tol}/g" "${param_file}" sed -i "s/moz_tol_unit = ppm/moz_tol_unit=${prec_ppm}/g" "${param_file}" done cp "${param_file}" lfq_param_file.txt """ |
29 30 31 32 | """ touch quant_exp_design.txt echo "${exp_design_text}" >> quant_exp_design.txt """ |
34 35 36 37 38 39 40 41 42 43 44 | """ cp ${exp_design} quant_exp_design.txt sed -i 's/raw_file/mzdb_file/g' quant_exp_design.txt sed -i 's/.raw/.mzDB/g' quant_exp_design.txt sed -i 's/.mzML/.mzDB/g' quant_exp_design.txt sed -i 's/.mzml/.mzDB/g' quant_exp_design.txt sed -i '2,\$s|^|./|' quant_exp_design.txt # keep first two columns of quant_exp_design.txt cut -f1,2 quant_exp_design.txt > quant_exp_design.txt.tmp mv quant_exp_design.txt.tmp quant_exp_design.txt """ |
20 21 22 23 | """ ls -la thermo2mzdb -i "${rawfile}" -o "${rawfile.baseName}.mzDB" """ |
19 20 21 22 23 24 25 26 27 28 29 30 31 | """ # Check if the file is a mzML file if [[ "${rawfile}" == *.{mzML,mzml} ]] then # check if same file if [[ "${rawfile}" != "${rawfile.baseName}.mzML" ]] then cp "${rawfile}" "${rawfile.baseName}.mzML" fi else thermorawfileparser -i "${rawfile}" -b "${rawfile.baseName}.mzML" -f 2 fi """ |
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | """ parse_sdrf \\ convert-maxquant \\ -s "${sdrf}" \\ -f "PLACEHOLDER${fasta}" \\ -r PLACEHOLDER \\ -t PLACEHOLDERtemp \\ -o2 exp_design.tsv \\ -n ${task.cpus} echo "Preliminary" > sdrf_merge.version.txt parse_sdrf \\ convert-normalyzerde \\ -s "${sdrf}" \\ -mq exp_design.tsv \\ -o Normalyzer_design.tsv \\ -oc Normalyzer_comparisons.txt """ |
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | """ if [[ "$sdrf" != "sdrf.tsv" ]] then cp "${sdrf}" sdrf.tsv fi if [[ "$parameters" != "params.yml" ]] then cp "${parameters}" params.yml fi if [[ "$map" != "params2sdrf.yml" ]] then cp "${map}" params2sdrf.yml fi # TODO change to package when available python $projectDir/bin/add_data_analysis_param.py > changed_params.txt python $projectDir/bin/sdrf2params.py """ |
26 27 28 | """ searchgui eu.isas.searchgui.cmd.FastaCLI -in ${fasta} -decoy """ |
67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 | """ mkdir tmp mkdir log searchgui eu.isas.searchgui.cmd.PathSettingsCLI -temp_folder ./tmp -log ./log searchgui eu.isas.searchgui.cmd.IdentificationParametersCLI -out searchgui \\ -frag_tol ${frag_tol} -frag_ppm ${frag_ppm} -prec_tol ${prec_tol} -prec_ppm ${prec_ppm} -enzyme "${enzyme}" -mc ${parameters["allowed_miscleavages"]} \\ -max_isotope ${parameters["isotope_error_range"]} \\ ${fixed_mods} ${var_mods}\\ -fi "${parameters["fions"]}" -ri "${parameters["rions"]}" -xtandem_quick_acetyl 0 -xtandem_quick_pyro 0 -peptide_fdr ${parameters["ident_fdr_peptide"]}\\ -protein_fdr ${parameters["ident_fdr_protein"]} -psm_fdr ${parameters["ident_fdr_psm"]} \\ -myrimatch_num_ptms ${parameters["max_mods"]} -ms_amanda_max_mod ${parameters["max_mods"]} -msgf_num_ptms ${parameters["max_mods"]} -meta_morpheus_max_mods_for_peptide\\ ${parameters["max_mods"]} -directag_max_var_mods ${parameters["max_mods"]} -comet_num_ptms ${parameters["max_mods"]} \\#-tide_max_ptms ${parameters["max_mods"]} \\ -myrimatch_min_pep_length ${parameters["min_peptide_length"]} -myrimatch_max_pep_length ${parameters["max_peptide_length"]} -ms_amanda_min_pep_length ${parameters["min_peptide_length"]} \\ -ms_amanda_max_pep_length ${parameters["max_peptide_length"]} -msgf_min_pep_length ${parameters["min_peptide_length"]} -msgf_max_pep_length ${parameters["max_peptide_length"]} \\ -omssa_min_pep_length ${parameters["min_peptide_length"]} -omssa_max_pep_length ${parameters["max_peptide_length"]} -comet_min_pep_length ${parameters["min_peptide_length"]} \\ -comet_max_pep_length ${parameters["max_peptide_length"]} -tide_min_pep_length ${parameters["min_peptide_length"]} -tide_max_pep_length ${parameters["max_peptide_length"]} \\ -andromeda_min_pep_length ${parameters["min_peptide_length"]} -andromeda_max_pep_length ${parameters["max_peptide_length"]} -meta_morpheus_min_pep_length ${parameters["min_peptide_length"]} \\ -meta_morpheus_max_pep_length ${parameters["max_peptide_length"]} -max_charge ${parameters.max_precursor_charge} -min_charge ${parameters.min_precursor_charge} """ |
30 31 32 33 34 35 36 37 38 39 40 41 42 | """ cat <<-END_VERSIONS > versions.yml "${task.process}": maxquant: \$(maxquant --version 2>&1 > /dev/null | cut -f2 -d\" \") END_VERSIONS sed \"s_<numThreads>.*_<numThreads>$task.cpus</numThreads>_\" ${paramfile} > mqpar_changed.xml sed -i \"s|PLACEHOLDER|\$PWD/|g\" mqpar_changed.xml mkdir temp chmod -R a+rw * maxquant mqpar_changed.xml mv combined/txt/*.txt . mv combined/proc/*unningTimes.txt runningTimes.txt """ |
Support
- Future updates
Related Workflows





