GEN-ERA toolbox suite of Nextflow-Singularity workflows
The GEN-ERA toolbox is a suite of Nextflow-Singularity workflows designed for comparative genomics of bacteria and small eukaryotes. Without any installation, it allows researchers to download, assemble and bin (meta)genomes (from short or long reads). Orthologous inference and maximum likelihood phylogenomic analyses (bootstrap and jackknife) can be inferred with this suite. Constrained (by a ribosomal phylogenomic) SSU rRNA phylogeny can also be inferred. Average nucleotide identity, GTDB identification and metabolic modelling are also included in the toolbox.
BCCM GEN-ERA tools repository
Please visit the wiki for tutorials and access to the tools: https://github.com/Lcornet/GENERA/wiki
NEWS
Update in Phylogeny.nf, pass on RAxMLV8. The fast method is no longer used for jackknife, pass on ML bestree.
Information about the GEN-ERA project
Please visit
https://bccm.belspo.be/content/bccm-collections-genomic-era
GEN-ERA project final report
Pierre Becker, Luc Cornet, Elizabet D’hooge, Ilse Cleenwerck, Oren Tzfadia, Leen Rigouts, Wim Mulders, Heide-Marie Daniel, Annick Wilmotte, Denis Baurain. BCCM collections in the genomic era. Final Report.
Brussels: Belgian Science PolicyOffice2022–40p. (BRAIN-be2.0-(Belgian Research Action through Interdisciplinary Networks))
https://www.belspo.be/belspo/brain2-be/projects/FinalReports/BCCMGENERA_FinRep.pdf
Publications
-
ToRQuEMaDA: tool for retrieving queried Eubacteria, metadata and dereplicating assemblies.
Léonard, R. R., Leleu, M., Vlierberghe, M. V., Cornet, L., Kerff, F., and Baurain, D. (2021).
PeerJ 9, e11348. doi:10.7717/peerj.11348.
https://peerj.com/articles/11348/ -
The taxonomy of the Trichophyton rubrum complex: a phylogenomic approach.
Cornet, L., D’hooge, E., Magain, N., Stubbe, D., Packeu, A., Baurain, D., and Becker P. (2021).
Microbial Genomics 7, 000707. doi:10.1099/mgen.0.000707.
https://www.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000707 -
ORPER: A Workflow for Constrained SSU rRNA Phylogenies.
Cornet, L., Ahn, A.-C., Wilmotte, A., and Baurain, D. (2021).
Genes 12, 1741. doi:10.3390/genes12111741.
https://www.mdpi.com/2073-4425/12/11/1741/html -
AMAW: automated gene annotation for non-model eukaryotic genomes.
Meunier, L., Baurain, D., Cornet, L. (2021)
https://www.biorxiv.org/content/10.1101/2021.12.07.471566v1 -
Phylogenomic analyses of Snodgrassella isolates from honeybees and bumblebees reveals taxonomic and functional diversity.
Cornet, L., Cleenwerck, I., Praet, J., Leonard, R., Vereecken, N.J., Michez, D., Smagghe, G., Baurain, D., Vandamme, P. (2021)
https://doi.org/10.1128/msystems.01500-21 -
Contamination detection in genomic data: more is not enough.
Cornet, L & Baurain, D (2022)
Genome Biology. 2022;23:60.
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02619-9 -
The GEN-ERA toolbox: unified and reproducible workflows for research in microbial genomics
Cornet, L., Durieu, B., Baert, F., D’hooge, E., Colignon, D., Meunier, L., Lupo, V., Cleenwerck I., Daniel, HM., Rigouts, L., Sirjacobs, D., Declerck, D., Vandamme, P., Wilmotte, A., Baurain, D., Becker P (2022).
https://www.biorxiv.org/content/10.1101/2022.10.20.513017v1 -
CRitical Assessment of genomic COntamination detection at several Taxonomic ranks (CRACOT)
Cornet, L., Lupo, V., Declerck, S., Baurain, D. (2022).
https://www.biorxiv.org/content/10.1101/2022.11.14.516442v1
Copyright and License
This softwares is copyright (c) 2017-2021 by University of Liege / Sciensano / BCCM collection by Luc CORNET This is free softwares; you can redistribute it and/or modify.
Code Snippets
146 147 148 149 | """ setup-taxdir.pl --taxdir=$workingdir echo $workingdir > taxdump_path.txt """ |
156 157 158 | """ echo $workingdir > taxdump_path.txt """ |
168 169 170 | """ echo $taxdump > taxdump_path.txt """ |
193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 | """ wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt -O refseq_sum.txt $companion refseq_sum.txt --mode=sum grep -v "#" refseq_sum-filt.txt | cut -f1 > GCF.list fetch-tax.pl GCF.list --taxdir=\$(<taxdump_path.txt) --item-type=taxid --levels=phylum class order family genus species grep -v "#" refseq_sum-filt.txt | cut -f20 > ftp.list grep -v "#" refseq_sum-filt.txt | cut -f20 | cut -f10 -d"/" > names.list for f in `cat ftp.list `; do echo "/"; done > slash.list for f in `cat ftp.list `; do echo "_genomic.fna.gz"; done > end1.list for f in `cat ftp.list `; do echo "wget "; done > get.list for f in `cat ftp.list `; do echo " -O "; done > out.list for f in `cat ftp.list `; do echo ".fna.gz"; done > end2.list cut -f1,2 -d"_" names.list > id.list paste get.list ftp.list slash.list names.list end1.list out.list id.list end2.list > ftp.sh sed -i -e 's/\t//g' ftp.sh echo "RefSeq metadata" >> Genome-downloader.log """ |
213 214 215 216 217 | """ echo "Add Refseq Genomes NOT activated" > GCF.tax echo "Add Refseq Genomes NOT activated" > ftp.sh echo "Add Refseq metadata NOT activated" >> Genome-downloader.log """ |
242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 | """ wget ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/assembly_summary_genbank.txt -O genbank_sum.txt $companion genbank_sum.txt --mode=sum grep -v "#" genbank_sum-filt.txt | cut -f1 > GCA.list fetch-tax.pl GCA.list --taxdir=\$(<taxdump_path.txt) --item-type=taxid --levels=phylum class order family genus species grep -v "#" genbank_sum-filt.txt | cut -f20 > ftp.list grep -v "#" genbank_sum-filt.txt | cut -f20 | cut -f10 -d"/" > names.list for f in `cat ftp.list `; do echo "/"; done > slash.list for f in `cat ftp.list `; do echo "_genomic.fna.gz"; done > end1.list for f in `cat ftp.list `; do echo "wget "; done > get.list for f in `cat ftp.list `; do echo " -O "; done > out.list for f in `cat ftp.list `; do echo ".fna.gz"; done > end2.list cut -f1,2 -d"_" names.list > id.list paste get.list ftp.list slash.list names.list end1.list out.list id.list end2.list > GCA-ftp.sh sed -i -e 's/\t//g' GCA-ftp.sh echo "Add GenBank metadata activated" >> Genome-downloader.log """ |
262 263 264 265 266 | """ echo "Add GenBank Genomes NOT activated" > GCA.tax echo "Add GenBank Genomes NOT activated" > GCA-ftp.sh echo "Add GenBank metadata NOT activated" >> Genome-downloader.log """ |
296 297 298 299 300 301 302 303 304 305 | """ #Produce list of GCF IDs with reference group and taxa levels $companion GCF.tax --mode=fetch --taxa=$taxa --refgroup=$group for f in `cat GCF.refgroup.uniq`; do grep \$f ftp.sh; done > reduce-ftp.sh bash reduce-ftp.sh gunzip *.gz find *.fna | cut -f1,2 -d"." > fna.list for f in `cat fna.list`; do inst-abbr-ids.pl \$f*.fna --id-regex=:DEF --id-prefix=\$f; done echo "Add RefSeq Genomes, abbr mode" >> Genome-downloader.log """ |
308 309 310 311 312 313 314 315 | """ #Produce list of GCF IDs with reference group and taxa levels $companion GCF.tax --mode=fetch --taxa=$taxa --refgroup=$group for f in `cat GCF.refgroup.uniq`; do grep \$f ftp.sh; done > reduce-ftp.sh bash reduce-ftp.sh gunzip *.gz echo "Add RefSeq Genomes, non abbr mode" >> Genome-downloader.log """ |
320 321 322 323 324 325 | """ echo "Add RefSeq Genomes NOT activated" > FALSER-abbr.fna echo "Add RefSeq Genomes NOT activated" > reduce-ftp.sh echo "GCF_FALSE" > GCF.refgroup.uniq echo "Add RefSeq Genomes NOT activated" >> Genome-downloader.log """ |
354 355 356 357 358 359 360 361 362 363 364 365 366 | """ #Produce list of GCA IDs with reference group and taxa levels $companion GCA.tax --mode=fetch --taxa=$taxa --refgroup=$group for f in `cat GCA.refgroup.uniq`; do grep \$f GCA-ftp.sh; done > GCA-reduce-ftp.sh bash GCA-reduce-ftp.sh gunzip *.gz find *.fna | cut -f1,2 -d"." > fna.list for f in `cat fna.list`; do inst-abbr-ids.pl \$f*.fna --id-regex=:DEF --id-prefix=\$f; done #for fix and proceed , false genbank files echo "Add GenBank Genomes activated" > FALSE-abbr.fna echo "Add GenBank Genomes activated" > FALSE-GCA-reduce-ftp.sh echo "Add GenBank Genomes activated, abbr mode" >> Genome-downloader.log """ |
370 371 372 373 374 375 376 377 378 379 380 | """ #Produce list of GCA IDs with reference group and taxa levels $companion GCA.tax --mode=fetch --taxa=$taxa --refgroup=$group for f in `cat GCA.refgroup.uniq`; do grep \$f GCA-ftp.sh; done > GCA-reduce-ftp.sh bash GCA-reduce-ftp.sh gunzip *.gz #for fix and proceed , false genbank files echo "Add GenBank Genomes activated" > FALSE-abbr.fna echo "Add GenBank Genomes activated" > FALSE-GCA-reduce-ftp.sh echo "Add GenBank Genomes activated, non abbr mode" >> Genome-downloader.log """ |
385 386 387 388 389 | """ echo "Add GenBank Genomes NOT activated" > FALSE-abbr.fna echo "Add GenBank Genomes NOT activated" > GCA-reduce-ftp.sh echo "Add GenBank Genomes NOT activated" >> Genome-downloader.log """ |
417 418 419 420 421 422 423 424 425 426 | """ #Delete false Genbak files rm -rf FALSE* mkdir GEN mv *.fna GEN/ dRep dereplicate DREP -g GEN/*.fna -p $cpu mkdir DEREPLICATED/ mv DREP/dereplicated_genomes/*.fna DEREPLICATED/ echo "Drep Dereplication activated" >> Genome-downloader.log """ |
429 430 431 432 433 434 435 436 437 438 | """ #Delete false Genbak files rm -rf FALSE* mkdir GEN mv *.fna GEN/ dRep dereplicate DREP -g GEN/*.fna -p $cpu --ignoreGenomeQuality mkdir DEREPLICATED/ mv DREP/dereplicated_genomes/*.fna DEREPLICATED/ echo "Drep Dereplication activated" >> Genome-downloader.log """ |
443 444 445 446 447 448 449 | """ #Delete false Genbak files rm -rf FALSE* mkdir DEREPLICATED/ mv *.fna DEREPLICATED/ echo "Drep Dereplication not activated" >> Genome-downloader.log """ |
475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 | """ #log part echo test > Genome-downloader.log #Merge ftp cat ftp.sh GCA-ftp.sh > merge-ftp.sh grep -v 'activated' merge-ftp.sh > temp; mv temp merge-ftp.sh #Collecte GCA/F number cp DEREPLICATED/*.fna . find *.fna > fna.list sed -i -e 's/-abbr.fna//g' fna.list sed -i -e 's/.fna//g' fna.list rm -rf *.fna #Get ftp part for f in `cat fna.list`; do grep \$f merge-ftp.sh; done > prot-ftp.sh sed -i -e 's/_genomic.fna.gz/_protein.faa.gz/g' prot-ftp.sh sed -i -e 's/fna.gz/faa.gz/g' prot-ftp.sh bash prot-ftp.sh find *.gz -type f -empty -print -delete gunzip *.gz mkdir PROT mv *.faa PROT/ """ |
500 501 502 503 504 505 | """ #log part echo "Prot download NOT activated" > Genome-downloader.log mkdir PROT echo "Prot download NOT activated" > PROT/info.faa """ |
532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 | """ mv DEREPLICATED/*.fna . #log part echo "Genome-downloader started at `date`" > Genome-downloader.log echo "Genomes dowloaded: " >> Genome-downloader.log find *.fna | wc -l >> Genome-downloader.log echo "Genome-downloader, version: " >> Genome-downloader.log echo "3.0.0 " >> Genome-downloader.log #copy part #find *.fna | cut -f1 -d'-' > GC.list find *.fna > GC.list mkdir GENOMES/ mv *.fna GENOMES/ sed -i -e 's/.fna//g' GC.list fetch-tax.pl GC.list --taxdir=\$(<taxdump_path.txt) --item-type=taxid --levels=phylum class order family genus species mv GC.tax Genomes.taxomonomy mkdir PROTEINS mv PROT/*.faa PROTEINS/ """ |
Support
- Future updates
Related Workflows





