The ProteoGenomics database generation workflow creates different protein databases for ProteoGenomics data analysis.
The ProteoGenomics database generation workflow ( pgdb ) use the pypgatk and nextflow to create different protein databases for ProteoGenomics data analysis.
Introduction
nf-core/pgdb is a bioinformatics pipeline to generate proteogenomics databases. pgdb allows users to create proteogenomics databases using EMSEMBL as the reference proteome database. Three different major databases can be attached to the final proteogenomics database:
-
The reference proteome (ENSEMBL Reference proteome)
-
Non canonical proteins: pseudo-genes, sORFs, lncRNA.
-
Variants: COSMIC, cBioPortal, GENOMAD variants
The pipeline allows to estimate decoy proteins with different methods and attach them to the final proteogenomics database.
The pipeline is built using Nextflow , a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker containers making installation trivial and results highly reproducible.
Quick Start
-
Install
nextflow
-
Install any of
Docker
,Singularity
,Podman
,Shifter
orCharliecloud
for full pipeline reproducibility (please only useConda
as a last resort; see docs ) -
Download the pipeline and test it on a minimal dataset with a single command (This run will download the canonical ENSEMBL reference proteome and create proteomics database with it):
nextflow run nf-core/pgdb -profile test,<docker/singularity/podman/shifter/charliecloud/conda/institute>
Please check nf-core/configs to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use
-profile <institute>
in your command. This will enable eitherdocker
orsingularity
and set the appropriate execution settings for your local compute environment. -
Start running your own analysis!
nextflow run nf-core/pgdb -profile <docker/singularity/podman/conda/institute> --ncrna true --pseudogenes true --altorfs true
This will create a proteogenomics database with the ENSEMBL reference proteome and non canonical proteins like pseudo genes, non coding rnas or alternative open reading frames.
See usage docs for all of the available options when running the pipeline.
Pipeline Summary
By default, the pipeline currently performs the following:
-
Download protein databases from ENSEMBL
-
Translate from Genomics Variant databases into ProteoGenomics Databases (
COSMIC
,GNOMAD
) -
Add to a Reference proteomics database, non-coding RNAs + pseudogenes.
-
Compute Decoy for a proteogenomics databases
Documentation
The nf-core/pgdb pipeline comes with documentation about the pipeline: usage and output .
Credits
nf-core/pgdb was originally written by Husen M. Umer (EMBL-EBI) & Yasset Perez-Riverol (Karolinska Institute)
Contributions and Support
If you would like to contribute to this pipeline, please see the contributing guidelines .
For further information or help, don't hesitate to get in touch on the
Slack
#pgdb
channel
(you can join with
this invite
).
Citations
You can cite the
nf-core
publication as follows:
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x .
An extensive list of references for the tools used by the pipeline can be found in the
CITATIONS.md
file.
Code Snippets
138 139 140 141 142 | """ echo $workflow.manifest.version > v_pipeline.txt echo $workflow.nextflow.version > v_nextflow.txt scrape_software_versions.py &> software_versions_mqc.yaml """ |
164 165 166 167 168 169 | """ pypgatk_cli.py ensembl-downloader \\ --config_file $ensembl_downloader_config \\ --ensembl_name $params.ensembl_name \\ -sv -sc """ |
184 185 186 | """ cat $reference_proteome >> reference_proteome.fa """ |
203 204 205 206 | """ cat $a >> total_cdnas.fa cat $b >> total_cdnas.fa """ |
225 226 227 228 229 230 231 232 | """ pypgatk_cli.py dnaseq-to-proteindb \\ --config_file "$ensembl_config" \\ --input_fasta $x \\ --output_proteindb ncRNAs_proteinDB.fa \\ --include_biotypes "${params.biotypes['ncRNA']}" \\ --skip_including_all_cds --var_prefix ncRNA_ """ |
253 254 255 256 257 258 259 260 261 | """ pypgatk_cli.py dnaseq-to-proteindb \\ --config_file "$ensembl_config" \\ --input_fasta "$x" \\ --output_proteindb pseudogenes_proteinDB.fa \\ --include_biotypes "${params.biotypes['pseudogene']}" \\ --skip_including_all_cds \\ --var_prefix pseudo_ """ |
282 283 284 285 286 287 288 289 290 | """ pypgatk_cli.py dnaseq-to-proteindb \\ --config_file "$ensembl_config" {{ --input_fasta "$x" \\ --output_proteindb altorfs_proteinDB.fa \\ --include_biotypes "${params.biotypes['protein_coding']}'" \\ --skip_including_all_cds \\ --var_prefix altorf_ """ |
315 316 317 318 319 320 | """ pypgatk_cli.py cosmic-downloader \\ --config_file "$cosmic_config" \\ --username $params.cosmic_user_name \\ --password $params.cosmic_password """ |
340 341 342 343 344 345 346 347 | """ pypgatk_cli.py cosmic-to-proteindb \\ --config_file "$cosmic_config" \\ --input_mutation $m --input_genes $g \\ --filter_column 'Histology subtype 1' \\ --accepted_values $params.cosmic_cancer_type \\ --output_db cosmic_proteinDB.fa """ |
369 370 371 372 373 374 375 376 377 | """ pypgatk_cli.py cosmic-to-proteindb \\ --config_file "$cosmic_config" \\ --input_mutation $m \\ --input_genes $g \\ --filter_column 'Sample name' \\ --accepted_values $params.cosmic_cellline_name \\ --output_db cosmic_celllines_proteinDB.fa """ |
397 398 399 400 401 402 | """ pypgatk_cli.py ensembl-downloader \\ --config_file $ensembl_downloader_config \\ --ensembl_name $params.ensembl_name \\ -sg -sp -sc -sd -sn """ |
420 421 422 | """ awk 'BEGIN{FS=OFS="\t"}{if(\$1~"#" || (\$5!="" && \$4!="")) print}' $vcf_file > checked_$vcf_file """ |
446 447 448 449 450 451 452 453 454 455 456 | """ pypgatk_cli.py vcf-to-proteindb \\ --config_file $e \\ --af_field "$ensembl_af_field" \\ --input_fasta $f \\ --gene_annotations_gtf $g \\ --vcf $v \\ --output_proteindb "${v}_proteinDB.fa" \\ --var_prefix ensvar \\ --annotation_field_name 'CSQ' """ |
482 483 484 | """ gffread -w transcripts.fa -g $f $g """ |
504 505 506 507 508 509 510 511 512 513 514 515 516 | """ awk 'BEGIN{FS=OFS="\t"}{if(\$1=="chrM") \$1="MT"; gsub("chr","",\$1); print}' \\ $v > ${v.baseName}_changedChrNames.vcf pypgatk_cli.py vcf-to-proteindb \\ --config_file $e \\ --af_field "$af_field" \\ --input_fasta $f \\ --gene_annotations_gtf $g \\ --vcf ${v.baseName}_changedChrNames.vcf \\ --output_proteindb ${v.baseName}_proteinDB.fa \\ --annotation_field_name '' """ |
540 541 542 543 544 | """ wget ${g}/gencode.v19.pc_transcripts.fa.gz wget ${g}/gencode.v19.annotation.gtf.gz gunzip *.gz """ |
562 563 564 | """ gsutil cp $g . """ |
582 583 584 | """ zcat $g > ${g}.vcf """ |
605 606 607 608 609 610 611 612 613 614 615 616 | """ pypgatk_cli.py vcf-to-proteindb \\ --config_file $e \\ --vcf $v \\ --input_fasta $f \\ --gene_annotations_gtf $g \\ --output_proteindb "${v}_proteinDB.fa" \\ --af_field controls_AF \\ --transcript_index 6 \\ --annotation_field_name vep \\ --var_prefix gnomadvar """ |
638 639 640 641 | """ wget ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/cds/Homo_sapiens.GRCh37.75.cds.all.fa.gz gunzip *.gz """ |
658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 | """ git clone https://github.com/cBioPortal/datahub.git . git lfs install --local --skip-smudge git lfs pull -I public --include "data*clinical*sample.txt" git lfs pull -I public --include "data_mutations_mskcc.txt" cat public/*/data_mutations_mskcc.txt > cbioportal_allstudies_data_mutations_mskcc.txt cat public/*/*data*clinical*sample.txt | \\ awk 'BEGIN{FS=OFS="\\t"}{if(\$1!~"#SAMPLE_ID"){gsub("#SAMPLE_ID", "\\nSAMPLE_ID");} print}' | \\ awk 'BEGIN{FS=OFS="\\t"}{s=0; j=0; \\ for(i=1;i<=NF;i++){ \\ if(\$i=="CANCER_TYPE_DETAILED") j=1; \\ if(\$i=="CANCER_TYPE") s=1; \\ } \\ if(j==1 && s==0){ \\ gsub("CANCER_TYPE_DETAILED", "CANCER_TYPE"); \\ } \\ print; \\ }' \\ > cbioportal_allstudies_data_clinical_sample.txt """ |
679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 | """ pypgatk_cli.py cbioportal-downloader \\ --config_file "$cbioportal_config" \\ -d "$params.cbioportal_study_id" tar -xzvf database_cbioportal/${params.cbioportal_study_id}.tar.gz cat ${params.cbioportal_study_id}/data_mutations_mskcc.txt > cbioportal_allstudies_data_mutations_mskcc.txt cat ${params.cbioportal_study_id}/data_clinical_sample.txt | \\ awk 'BEGIN{FS=OFS="\\t"}{if(\$1!~"#SAMPLE_ID"){gsub("#SAMPLE_ID", "\\nSAMPLE_ID");} print}' | \\ awk 'BEGIN{FS=OFS="\\t"}{s=0; j=0; \\ for(i=1;i<=NF;i++){ \\ if(\$i=="CANCER_TYPE_DETAILED") j=1; if(\$i=="CANCER_TYPE") s=1; \\ } \\ if(j==1 && s==0){gsub("CANCER_TYPE_DETAILED", "CANCER_TYPE");} print;}' \\ > cbioportal_allstudies_data_clinical_sample.txt """ |
715 716 717 718 719 720 721 722 723 724 | """ pypgatk_cli.py cbioportal-to-proteindb \\ --config_file $cbioportal_config \\ --input_mutation $m \\ --input_cds $g \\ --clinical_sample_file $s \\ --filter_column $params.cbioportal_filter_column \\ --accepted_values $params.cbioportal_accepted_values \\ --output_db cbioPortal_proteinDB.fa """ |
747 748 749 | """ cat proteindb* > merged_databases.fa """ |
779 780 781 782 783 784 785 786 | """ pypgatk_cli.py ensembl-check \\ -in "$file" \\ --config_file "$e" \\ -out database_clean.fa \\ --num_aa "$params.minimum_aa" \\ "$stop_codons" """ |
811 812 813 814 815 816 817 818 819 | """ pypgatk_cli.py generate-decoy \\ --method "$params.decoy_method" \\ --enzyme "$params.decoy_enzyme" \\ --config_file $protein_decoy_config \\ --input_database $f \\ --decoy_prefix "$params.decoy_prefix" \\ --output_database decoy_database.fa """ |
838 839 840 | """ markdown_to_html.py $output_docs -o results_description.html """ |
138 139 140 141 142 | """ echo $workflow.manifest.version > v_pipeline.txt echo $workflow.nextflow.version > v_nextflow.txt scrape_software_versions.py &> software_versions_mqc.yaml """ |
164 165 166 167 168 169 | """ pypgatk_cli.py ensembl-downloader \\ --config_file $ensembl_downloader_config \\ --ensembl_name $params.ensembl_name \\ -sv -sc """ |
184 185 186 | """ cat $reference_proteome >> reference_proteome.fa """ |
203 204 205 206 | """ cat $a >> total_cdnas.fa cat $b >> total_cdnas.fa """ |
225 226 227 228 229 230 231 232 | """ pypgatk_cli.py dnaseq-to-proteindb \\ --config_file "$ensembl_config" \\ --input_fasta $x \\ --output_proteindb ncRNAs_proteinDB.fa \\ --include_biotypes "${params.biotypes['ncRNA']}" \\ --skip_including_all_cds --var_prefix ncRNA_ """ |
253 254 255 256 257 258 259 260 261 | """ pypgatk_cli.py dnaseq-to-proteindb \\ --config_file "$ensembl_config" \\ --input_fasta "$x" \\ --output_proteindb pseudogenes_proteinDB.fa \\ --include_biotypes "${params.biotypes['pseudogene']}" \\ --skip_including_all_cds \\ --var_prefix pseudo_ """ |
282 283 284 285 286 287 288 289 290 | """ pypgatk_cli.py dnaseq-to-proteindb \\ --config_file "$ensembl_config" {{ --input_fasta "$x" \\ --output_proteindb altorfs_proteinDB.fa \\ --include_biotypes "${params.biotypes['protein_coding']}'" \\ --skip_including_all_cds \\ --var_prefix altorf_ """ |
315 316 317 318 319 320 | """ pypgatk_cli.py cosmic-downloader \\ --config_file "$cosmic_config" \\ --username $params.cosmic_user_name \\ --password $params.cosmic_password """ |
340 341 342 343 344 345 346 347 | """ pypgatk_cli.py cosmic-to-proteindb \\ --config_file "$cosmic_config" \\ --input_mutation $m --input_genes $g \\ --filter_column 'Histology subtype 1' \\ --accepted_values $params.cosmic_cancer_type \\ --output_db cosmic_proteinDB.fa """ |
369 370 371 372 373 374 375 376 377 | """ pypgatk_cli.py cosmic-to-proteindb \\ --config_file "$cosmic_config" \\ --input_mutation $m \\ --input_genes $g \\ --filter_column 'Sample name' \\ --accepted_values $params.cosmic_cellline_name \\ --output_db cosmic_celllines_proteinDB.fa """ |
397 398 399 400 401 402 | """ pypgatk_cli.py ensembl-downloader \\ --config_file $ensembl_downloader_config \\ --ensembl_name $params.ensembl_name \\ -sg -sp -sc -sd -sn """ |
420 421 422 | """ awk 'BEGIN{FS=OFS="\t"}{if(\$1~"#" || (\$5!="" && \$4!="")) print}' $vcf_file > checked_$vcf_file """ |
446 447 448 449 450 451 452 453 454 455 456 | """ pypgatk_cli.py vcf-to-proteindb \\ --config_file $e \\ --af_field "$ensembl_af_field" \\ --input_fasta $f \\ --gene_annotations_gtf $g \\ --vcf $v \\ --output_proteindb "${v}_proteinDB.fa" \\ --var_prefix ensvar \\ --annotation_field_name 'CSQ' """ |
482 483 484 | """ gffread -w transcripts.fa -g $f $g """ |
504 505 506 507 508 509 510 511 512 513 514 515 516 | """ awk 'BEGIN{FS=OFS="\t"}{if(\$1=="chrM") \$1="MT"; gsub("chr","",\$1); print}' \\ $v > ${v.baseName}_changedChrNames.vcf pypgatk_cli.py vcf-to-proteindb \\ --config_file $e \\ --af_field "$af_field" \\ --input_fasta $f \\ --gene_annotations_gtf $g \\ --vcf ${v.baseName}_changedChrNames.vcf \\ --output_proteindb ${v.baseName}_proteinDB.fa \\ --annotation_field_name '' """ |
540 541 542 543 544 | """ wget ${g}/gencode.v19.pc_transcripts.fa.gz wget ${g}/gencode.v19.annotation.gtf.gz gunzip *.gz """ |
562 563 564 | """ gsutil cp $g . """ |
582 583 584 | """ zcat $g > ${g}.vcf """ |
605 606 607 608 609 610 611 612 613 614 615 616 | """ pypgatk_cli.py vcf-to-proteindb \\ --config_file $e \\ --vcf $v \\ --input_fasta $f \\ --gene_annotations_gtf $g \\ --output_proteindb "${v}_proteinDB.fa" \\ --af_field controls_AF \\ --transcript_index 6 \\ --annotation_field_name vep \\ --var_prefix gnomadvar """ |
638 639 640 641 | """ wget ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/cds/Homo_sapiens.GRCh37.75.cds.all.fa.gz gunzip *.gz """ |
658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 | """ git clone https://github.com/cBioPortal/datahub.git . git lfs install --local --skip-smudge git lfs pull -I public --include "data*clinical*sample.txt" git lfs pull -I public --include "data_mutations_mskcc.txt" cat public/*/data_mutations_mskcc.txt > cbioportal_allstudies_data_mutations_mskcc.txt cat public/*/*data*clinical*sample.txt | \\ awk 'BEGIN{FS=OFS="\\t"}{if(\$1!~"#SAMPLE_ID"){gsub("#SAMPLE_ID", "\\nSAMPLE_ID");} print}' | \\ awk 'BEGIN{FS=OFS="\\t"}{s=0; j=0; \\ for(i=1;i<=NF;i++){ \\ if(\$i=="CANCER_TYPE_DETAILED") j=1; \\ if(\$i=="CANCER_TYPE") s=1; \\ } \\ if(j==1 && s==0){ \\ gsub("CANCER_TYPE_DETAILED", "CANCER_TYPE"); \\ } \\ print; \\ }' \\ > cbioportal_allstudies_data_clinical_sample.txt """ |
679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 | """ pypgatk_cli.py cbioportal-downloader \\ --config_file "$cbioportal_config" \\ -d "$params.cbioportal_study_id" tar -xzvf database_cbioportal/${params.cbioportal_study_id}.tar.gz cat ${params.cbioportal_study_id}/data_mutations_mskcc.txt > cbioportal_allstudies_data_mutations_mskcc.txt cat ${params.cbioportal_study_id}/data_clinical_sample.txt | \\ awk 'BEGIN{FS=OFS="\\t"}{if(\$1!~"#SAMPLE_ID"){gsub("#SAMPLE_ID", "\\nSAMPLE_ID");} print}' | \\ awk 'BEGIN{FS=OFS="\\t"}{s=0; j=0; \\ for(i=1;i<=NF;i++){ \\ if(\$i=="CANCER_TYPE_DETAILED") j=1; if(\$i=="CANCER_TYPE") s=1; \\ } \\ if(j==1 && s==0){gsub("CANCER_TYPE_DETAILED", "CANCER_TYPE");} print;}' \\ > cbioportal_allstudies_data_clinical_sample.txt """ |
715 716 717 718 719 720 721 722 723 724 | """ pypgatk_cli.py cbioportal-to-proteindb \\ --config_file $cbioportal_config \\ --input_mutation $m \\ --input_cds $g \\ --clinical_sample_file $s \\ --filter_column $params.cbioportal_filter_column \\ --accepted_values $params.cbioportal_accepted_values \\ --output_db cbioPortal_proteinDB.fa """ |
747 748 749 | """ cat proteindb* > merged_databases.fa """ |
779 780 781 782 783 784 785 786 | """ pypgatk_cli.py ensembl-check \\ -in "$file" \\ --config_file "$e" \\ -out database_clean.fa \\ --num_aa "$params.minimum_aa" \\ "$stop_codons" """ |
811 812 813 814 815 816 817 818 819 | """ pypgatk_cli.py generate-decoy \\ --method "$params.decoy_method" \\ --enzyme "$params.decoy_enzyme" \\ --config_file $protein_decoy_config \\ --input_database $f \\ --decoy_prefix "$params.decoy_prefix" \\ --output_database decoy_database.fa """ |
838 839 840 | """ markdown_to_html.py $output_docs -o results_description.html """ |
Support
- Future updates
Related Workflows





