MGnify genomes analysis pipeline
MGnify A pipeline to perform taxonomic and functional annotation and to generate a catalogue from a set of isolate and/or metagenome-assembled genomes (MAGs) using the workflow described in the following publication:
Gurbich TA, Almeida A, Beracochea M, Burdett T, Burgin J, Cochrane G, Raj S, Richardson L, Rogers AB, Sakharova E, Salazar GA and Finn RD. (2023) MGnify Genomes: A Resource for Biome-specific Microbial Genome Catalogues. J Mol Biol . doi: https://doi.org/10.1016/j.jmb.2023.168016
Detailed information about existing MGnify catalogues: https://docs.mgnify.org/src/docs/genome-viewer.html
Code Snippets
27 28 29 30 31 32 33 34 35 36 | """ amrfinder --plus \ -n ${fna} \ -p ${faa} \ -g ${gff} \ -d ${params.amrfinder_plus_db} \ -a prokka \ --output ${cluster}_amrfinderplus.tsv \ --threads ${task.cpus} """ |
51 52 53 54 55 56 57 58 | """ annotate_gff.py \ -g ${gff} \ -i ${ips_annotations_tsv} \ -e ${eggnog_annotations_tsv} \ -r ${ncrna_tsv} \ ${crisprcas_flag} ${sanntis_flag} ${amrfinder_flag} """ |
61 62 63 | """ touch ${gff.simpleName}_annotated.gff """ |
13 14 15 16 17 | """ bracken-build -d ${kraken_db} \ -t ${task.cpus} \ -l ${read_length} """ |
16 17 18 19 20 21 | """ checkm lineage_wf -t ${task.cpus} -x fa --tab_table ${assemblies_folder} checkm_output # to csv # checkm2csv.py -i checkm_output > checkm_quality.csv """ |
24 25 26 | """ touch checkm_quality.csv """ |
45 46 47 48 49 50 51 52 53 54 | """ classify_folders.py -g ${genomes_folder} --text-file ${text_file} # Clean any empty directories # find many_genomes -type d -empty -print -delete find one_genome -type d -empty -print -delete mv many_genomes pg mv one_genome sg """ |
29 30 31 32 33 | """ get_core_genes.py \ -i ${panaroo_gen_preabs} \ -o ${cluster_name}.core_genes.txt """ |
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | """ CRISPRCasFinder.pl -i $fasta \ -so /opt/CRISPRCasFinder/sel392v2.so \ -def G \ -drpt /opt/CRISPRCasFinder/supplementary_files/repeatDirection.tsv \ -outdir crisprcasfinder_results echo "Running post-processing" process_crispr_results.py \ --tsv-report crisprcasfinder_results/TSV/Crisprs_REPORT.tsv \ --gffs crisprcasfinder_results/GFF/*gff \ --tsv-output crisprcasfinder_results/${fasta.baseName}_crisprcasfinder.tsv \ --gff-output crisprcasfinder_results/${fasta.baseName}_crisprcasfinder.gff \ --gff-output-hq crisprcasfinder_results/${fasta.baseName}_crisprcasfinder_hq.gff \ --fasta $fasta """ |
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | """ cmscan \ --cpu ${task.cpus} \ --tblout overlapped_${fasta.baseName} \ --hmmonly \ --clanin ${rfam_ncrna_models}/Rfam.clanin \ --fmt 2 \ --cut_ga \ --noali \ -o /dev/null \ ${rfam_ncrna_models}/Rfam.cm \ ${fasta} # De-overlap # grep -v " = " overlapped_${fasta.baseName} > ${fasta.baseName}.ncrna.deoverlap.tbl """ |
60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 | """ shopt -s extglob RESULTS_FOLDER=results_folder FASTA=${fasta} CM_DB=${cm_models} BASENAME=\$(basename "\${FASTA}") FILENAME="\${BASENAME%.*}" mkdir "\${RESULTS_FOLDER}" echo "[ Detecting rRNAs ] " for CM_FILE in "\${CM_DB}"/*.cm; do MODEL=\$(basename "\${CM_FILE}") echo "Running cmsearch for \${MODEL}..." cmsearch -Z 1000 \ --hmmonly \ --cut_ga --cpu ${task.cpus} \ --noali \ --tblout "\${RESULTS_FOLDER}/\${FILENAME}_\${MODEL}.tblout" \ "\${CM_FILE}" "\${FASTA}" 1> "\${RESULTS_FOLDER}/\${FILENAME}_\${MODEL}.out" done echo "Concatenating results..." cat "\${RESULTS_FOLDER}/\${FILENAME}"_*.tblout > "\${RESULTS_FOLDER}/\${FILENAME}.tblout" echo "Removing overlaps..." cmsearch-deoverlap.pl \ --maxkeep \ --clanin "\${CM_DB}/ribo.claninfo" \ "\${RESULTS_FOLDER}/\${FILENAME}.tblout" mv "\${FILENAME}.tblout.deoverlapped" "\${RESULTS_FOLDER}/\${FILENAME}.tblout.deoverlapped" echo "Parsing final results..." parse_rRNA-bacteria.py -i \ "\${RESULTS_FOLDER}/\${FILENAME}.tblout.deoverlapped" 1> "\${RESULTS_FOLDER}/\${FILENAME}_rRNAs.out" rRNA2seq.py -d \ "\${RESULTS_FOLDER}/\${FILENAME}.tblout.deoverlapped" \ -i "\${FASTA}" 1> "\${RESULTS_FOLDER}/\${FILENAME}_rRNAs.fasta" echo "[ Detecting tRNAs ]" tRNAscan-SE -B -Q \ -m "\${RESULTS_FOLDER}/\${FILENAME}_stats.out" \ -o "\${RESULTS_FOLDER}/\${FILENAME}_trna.out" "\${FASTA}" parse_tRNA.py -i "\${RESULTS_FOLDER}/\${FILENAME}_stats.out" 1> "\${RESULTS_FOLDER}/\${FILENAME}_tRNA_20aa.out" echo "Completed" """ |
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | """ dRep dereplicate -g ${genomes_directory}/*.fa \ -p ${task.cpus} \ -pa 0.9 \ -sa 0.95 \ -nc 0.30 \ -cm larger \ -comp 50 \ -con 5 \ -extraW ${extra_weights_table} \ --genomeInfo ${checkm_csv} \ drep_output tar -czf drep_data_tables.tar.gz drep_output/data_tables """ |
51 52 53 54 55 56 | """ mkdir -p drep_output/data_tables touch drep_output/data_tables/Cdb.csv touch drep_output/data_tables/Mdb.csv touch drep_output/data_tables/Sdb.csv """ |
25 26 27 28 29 30 31 32 33 34 35 36 | """ emapper.py -i ${fasta} \ --database ${eggnog_db} \ --dmnd_db ${eggnog_diamond_db} \ --data_dir ${eggnog_data_dir} \ -m diamond \ --no_file_comments \ --cpu ${task.cpus} \ --no_annot \ --dbmem \ -o ${fasta.baseName} """ |
38 39 40 41 42 43 44 45 46 | """ emapper.py \ --data_dir ${eggnog_data_dir} \ --no_file_comments \ --cpu ${task.cpus} \ --annotate_hits_table ${annotation_hit_table} \ --dbmem \ -o ${annotation_hit_table.baseName} """ |
54 55 56 57 58 59 60 61 62 63 64 | """ touch eggnog-output.emapper.seed_orthologs echo "#query seed_ortholog evalue score eggNOG_OGs max_annot_lvl COG_category Description Preferred_name GOs EC KEGG_ko KEGG_Pathway KEGG_Module KEGG_Reaction KEGG_rclass BRITE KEGG_TC CAZy BiGG_Reaction PFAMs" > eggnog-output.emapper.seed_orthologs echo "MGYG000000012_00001 948106.AWZT01000053_gene1589 1.1e-63 199.0 COG5654@1|root,COG5654@2|Bacteria,1N6P3@1224|Proteobacteria,2VSGY@28216|Betaproteobacteria,1KFU4@119060|Burkholderiaceae 28216|Betaproteobacteria S RES - - - - - - - - - - - - RES" >> eggnog-output.emapper.seed_orthologs echo "MGYG000000001_00001 948106.AWZT01000053_gene1589 1.1e-63 199.0 COG5654@1|root,COG5654@2|Bacteria,1N6P3@1224|Proteobacteria,2VSGY@28216|Betaproteobacteria,1KFU4@119060|Burkholderiaceae 28216|Betaproteobacteria S RES - - - - - - - - - - - - RES" >> eggnog-output.emapper.seed_orthologs echo "MGYG000000020_00001 948106.AWZT01000053_gene1589 1.1e-63 199.0 COG5654@1|root,COG5654@2|Bacteria,1N6P3@1224|Proteobacteria,2VSGY@28216|Betaproteobacteria,1KFU4@119060|Burkholderiaceae 28216|Betaproteobacteria S RES - - - - - - - - - - - - RES" >> eggnog-output.emapper.seed_orthologs """ |
66 67 68 69 70 71 72 73 74 75 | """ touch eggnog-output.emapper.annotations echo "#query seed_ortholog evalue score eggNOG_OGs max_annot_lvl COG_category Description Preferred_name GOs EC KEGG_ko KEGG_Pathway KEGG_Module KEGG_Reaction KEGG_rclass BRITE KEGG_TC CAZy BiGG_Reaction PFAMs" > eggnog-output.emapper.annotations echo "MGYG000000012_00001 59538.XP_005971304.1 7.97e-152 431.0 COG0101@1|root,KOG4393@2759|Eukaryota,39RAQ@33154|Opisthokonta,3BK4Y@33208|Metazoa,3D27W@33213|Bilateria,48A93@7711|Chordata,494G6@7742|Vertebrata,3J2WS@40674|Mammalia 33208|Metazoa J synthase-like 1 - GO:0001522 - - - - - - - - - - DSPc,Laminin_G_3,PseudoU_synth_1" >> eggnog-output.emapper.annotations echo "MGYG000000001_00001 59538.XP_005971304.1 7.97e-152 431.0 COG0101@1|root,KOG4393@2759|Eukaryota,39RAQ@33154|Opisthokonta,3BK4Y@33208|Metazoa,3D27W@33213|Bilateria,48A93@7711|Chordata,494G6@7742|Vertebrata,3J2WS@40674|Mammalia 33208|Metazoa J synthase-like 1 - GO:0001522 - - - - - - - - - - DSPc,Laminin_G_3,PseudoU_synth_1" >> eggnog-output.emapper.annotations echo "MGYG000000020_00001 59538.XP_005971304.1 7.97e-152 431.0 COG0101@1|root,KOG4393@2759|Eukaryota,39RAQ@33154|Opisthokonta,3BK4Y@33208|Metazoa,3D27W@33213|Bilateria,48A93@7711|Chordata,494G6@7742|Vertebrata,3J2WS@40674|Mammalia 33208|Metazoa J synthase-like 1 - GO:0001522 - - - - - - - - - - DSPc,Laminin_G_3,PseudoU_synth_1" >> eggnog-output.emapper.annotations """ |
30 31 32 | """ filter_qs50.py -i ${genomes} -c ${checkm_csv} --filter """ |
31 32 33 34 35 36 37 | """ functional_annotations_summary.py \ -f ${cluster_rep_faa} \ -i ${ips_annotation_tsvs} \ -e ${eggnog_annotation_tsvs} \ -k ${kegg_classes} """ |
40 41 42 43 44 45 46 | """ touch ${cluster_rep_faa.baseName}_annotation_coverage.tsv touch ${cluster_rep_faa.baseName}_kegg_classes.tsv touch ${cluster_rep_faa.baseName}_kegg_modules.tsv touch ${cluster_rep_faa.baseName}_cazy_summary.tsv touch ${cluster_rep_faa.baseName}_cog_summary.tsv """ |
15 16 17 18 19 20 21 22 23 24 25 26 | """ cut -f1 ${mmseqs_100_cluster_tsv} | sort -u > rep_list.txt mkdir gene_catalogue cp ${mmseqs_100_cluster_tsv} gene_catalogue/clusters.tsv # Make the catalogue # seqtk subseq \ ${cluster_reps_ffn} \ rep_list.txt > gene_catalogue/gene_catalogue-100.ffn """ |
33 34 35 36 37 | """ generate_extra_weight_table.py \ -d ${genomes_folder} \ -o extra_weight_table.txt ${args} """ |
40 41 42 | """ touch extra_weight_table.txt """ |
35 36 37 38 39 40 41 42 43 44 | """ generate_summary_json.py \ --annot-cov ${coverage_summary} \ --gff ${annotated_gff} \ --metadata ${metadata} \ --biome "${biome}" \ --species-faa ${cluster_rep_faa} \ --species-name ${cluster} ${args} \ --output-file ${cluster}.json """ |
47 48 49 | """ touch ${cluster}.json """ |
37 38 39 40 41 42 43 44 45 46 47 48 | """ GTDBTK_DATA_PATH=/opt/gtdbtk_refdata \ gtdbtk classify_wf \ --cpus ${task.cpus} \ --pplacer_cpus ${task.cpus} \ --genome_dir genomes_dir \ --extension fna \ --skip_ani_screen \ --out_dir gtdbtk_results tar -czf gtdbtk_results.tar.gz gtdbtk_results """ |
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 | """ mkdir gtdbtk_results mkdir -p gtdbtk_results/classify touch gtdbtk_results/classify/gtdbtk.bac120.summary.tsv touch gtdbtk_results/classify/gtdbtk.ar53.summary.tsv echo "user_genome classification fastani_reference fastani_reference_radius fastani_taxonomy fastani_ani fastani_af closest_placement_reference closest_placement_radius closest_placement_taxonomy closest_placement_ani closest_placement_af pplacer_taxonomy classification_method note other_related_references(genome_id,species_name,radius,ANI,AF) msa_percent translation_table red_value warnings" > gtdbtk_results/classify/gtdbtk.bac120.summary.tsv for file in $drep_folder/* do GENOME=\$(basename \$file .fna) echo "\$GENOME d__Bacteria;p__Actinobacteriota;c__Actinomycetia;o__Actinomycetales;f__Micrococcaceae;g__Rothia;s__Rothia mucilaginosa_B GCF_001548235.1 95 d__Bacteria;p__Actinobacteriota;c__Actinomycetia;o__Actinomycetales;f__Micrococcaceae;g__Rothia;s__Rothia mucilaginosa_B 95.51 0.96 GCF_000175615.1 95 d__Bacteria;p__Actinobacteriota;c__Actinomycetia;o__Actinomycetales;f__Micrococcaceae;g__Rothia;s__Rothia mucilaginosa 94.5 0.94 d__Bacteria;p__Actinobacteriota;c__Actinomycetia;o__Actinomycetales;f__Micrococcaceae;g__Rothia;s__ ANI topological placement and ANI have incongruent species assignments GCF_000269965.1, s__Bifidobacterium infantis, 95.0, 94.8, 0.77 97.9 11 N/A N/A" >> gtdbtk_results/classify/gtdbtk.bac120.summary.tsv done """ |
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | """ gunc run -t ${task.cpus} \ -i ${fasta} \ -r ${gunc_db} ### gunc contaminated genomes ### awk '{if(\$8 > 0.45 && \$9 > 0.05 && \$12 > 0.5) print\$1}' GUNC.*.maxCSS_level.tsv | grep -v "pass.GUNC" >gunc_contaminated.txt # gunc_contaminated.txt could be empty - that means genome is OK # gunc_contaminated.txt could have this genome inside - that means gunc filtered this genome ### check completeness ### # remove header tail -n +2 "${genomes_checkm}" > genomes.csv ### get notcompleted genomes ### cat genomes.csv | tr ',' '\t' | awk '{if(\$2 < 90)print\$1}' > notcompleted.txt grep -f gunc_contaminated.txt notcompleted.txt > bad.txt || true # if bad.txt is not empty - that means genome didnt pass completeness and gunc filters ### final decision ### if [ -s bad.txt ]; then touch ${fasta.baseName}_gunc_empty.txt else touch ${fasta.baseName}_gunc_complete.txt fi """ |
33 34 35 | """ samtools faidx ${fasta} """ |
38 39 40 | """ touch ${fasta.simpleName}.fai """ |
20 21 22 23 24 25 26 27 28 29 | """ interproscan.sh \ -cpu ${task.cpus} \ -dp \ --goterms \ -pa \ -f TSV \ --input ${faa_fasta} \ -o ${faa_fasta.baseName}.IPS.tsv """ |
25 26 27 28 29 30 31 32 33 34 | """ gunzip -c ${msa_fasta_gz} > ${output_prefix}_alignment.faa iqtree -T 8 \ -s ${output_prefix}_alignment.faa \ --prefix iqtree.${output_prefix} cp iqtree.${output_prefix}.treefile ${output_prefix}_iqtree.nwk cp ${msa_fasta_gz} ${output_prefix}_alignment.faa.gz """ |
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | """ # Prepare the GTDB inputs # cat ${gtdbtk_concatenated} | grep -v \"user_genome\" | cut -f1-2 > kraken_taxonomy_temp.tsv while read line; do NAME=\$(echo \$line | cut -d ' ' -f1 | cut -d '.' -f1) echo \$line | sed "s/__\\;/__\$NAME\\;/g" | sed "s/s__\$/s__\$NAME/g" done < kraken_taxonomy_temp.tsv > kraken_taxonomy.tsv sed -i "s/ /\t/" kraken_taxonomy.tsv gtdbToTaxonomy.pl \ --infile kraken_taxonomy.tsv \ --sequence-dir reps_fa/ \ --output-dir kraken_intermediate mkdir ${kraken_db_name} cp -r kraken_intermediate/taxonomy ${kraken_db_name} """ |
50 51 52 53 54 | """ kraken2-build \ --add-to-library ${cluster_fna_tax_annotated} \ --db ${kraken_db_path} """ |
71 72 73 74 75 | """ kraken2-build --build \ --db ${kraken_db_path} \ --threads ${task.cpus} """ |
94 95 96 97 98 | """ cat ${kraken_db}/library/added/*.fna > ${kraken_db}/library/library.fna cp "${kraken_db}"/taxonomy/prelim_map.txt ${kraken_db}/library """ |
27 28 29 30 31 | """ mash2nwk1.R -m ${mash} mv trees/mashtree.nwk ${mash.baseName}.nwk """ |
14 15 16 | """ mash sketch -o all_genomes.msh ${genomes_fasta.join( ' ' )} """ |
18 19 20 21 22 23 24 | """ merge_ncbi_ena.py --ncbi ${ncbi_genomes} \ --ncbi-csv ${ncbi_genomes_checkm} \ --ena ${ena_genomes} \ --ena-csv ${ena_genomes_checkm} \ --outname merged_genomes """ |
27 28 29 30 | """ mkdir merged_genomes touch merged_genomes.csv """ |
35 36 37 38 39 40 41 42 43 44 45 46 47 48 | """ create_metadata_table.py \ --genomes-dir genomes_dir \ --extra-weight-table ${extra_weights_tsv} \ --checkm-results ${check_results_tsv} \ --rna-results rRNA_outs \ --naming-table ${name_mapping_tsv} \ --clusters-table ${clusters_tsv} \ --taxonomy ${gtdb_summary_tsv} \ --ftp-name ${ftp_name} \ --ftp-version ${ftp_version} \ --geo ${geo_metadata} ${args} \ --outfile genomes-all_metadata.tsv """ |
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 | """ timestamp() { date +"%H:%M:%S" } echo "\$(timestamp) [mmseqs script] Creating MMseqs database" mmseqs createdb ${faa_file} mmseqs.db echo "\$(timestamp) [mmseqs script] Clustering MMseqs with linclust with option -c ${id_threshold}" mmseqs linclust \ mmseqs.db \ mmseqs_cluster.db \ mmseqs-tmp --min-seq-id ${id_threshold} \ --threads ${task.cpus} \ -c ${cov_threshold} \ --cov-mode 1 \ --cluster-mode 2 \ --kmer-per-seq 80 echo "\$(timestamp) [mmseqs script] Parsing output to create FASTA file of all sequences" mmseqs createseqfiledb mmseqs.db \ mmseqs_cluster.db \ mmseqs_cluster_seq \ --threads ${task.cpus} mmseqs result2flat mmseqs.db \ mmseqs.db \ mmseqs_cluster_seq \ mmseqs_cluster.fa echo "\$(timestamp) [mmseqs script] Parsing output to create TSV file with cluster membership" mmseqs createtsv mmseqs.db \ mmseqs.db \ mmseqs_cluster.db \ protein_catalogue-${threshold_rounded}.tsv \ --threads ${task.cpus} echo "\$(timestamp) [mmseqs script] Parsing output to create FASTA file of representative sequences" mmseqs result2repseq \ mmseqs.db \ mmseqs_cluster.db \ mmseqs_cluster_rep \ --threads ${task.cpus} mmseqs result2flat \ mmseqs.db \ mmseqs.db \ mmseqs_cluster_rep \ protein_catalogue-${threshold_rounded}.faa \ --use-fasta-header # Create a tarball with all the mmseq files tar -cv mmseqs* | gzip > mmseq_${threshold_rounded}_outdir.tar.gz tar -cv protein_catalogue-${threshold_rounded}.faa \ protein_catalogue-${threshold_rounded}.tsv | gzip > protein_catalogue-${threshold_rounded}.tar.gz """ |
44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | """ panaroo \ -t ${task.cpus} \ -i ${gff_files.join( ' ' )} \ -o ${cluster_name}_panaroo \ --clean-mode strict \ --merge_paralogs \ --core_threshold 0.90 \ --threshold 0.90 \ --family_threshold 0.5 \ --no_clean_edges mv ${cluster_name}_panaroo/pan_genome_reference.fa ${cluster_name}_panaroo/${cluster_name}.pan-genome.fna tar -czf ${cluster_name}_panaroo.tar.gz ${cluster_name}_panaroo """ |
36 37 38 39 40 41 42 43 44 | """ per_genome_annotations.py \ --ips ${ips_annotations_tsv} \ --eggnog ${eggnog_annotations_tsv} \ --rep-list ${species_reps_csv} \ --mmseqs-tsv ${mmseq_tsv} \ -c ${task.cpus} \ -o output_folder """ |
16 17 18 | """ phylo_tree_generator.py --table ${gtdb_taxonomy_tsv} --out phylo_tree.json """ |
60 61 62 63 64 65 66 67 68 69 70 | """ cat ${fasta} | tr '-' ' ' > ${fasta.baseName}_cleaned.fasta prokka ${fasta.baseName}_cleaned.fasta \ --cpus ${task.cpus} \ --kingdom 'Bacteria' \ --outdir ${fasta.baseName}_prokka \ --prefix ${fasta.baseName} \ --force \ --locustag ${fasta.baseName} """ |
37 38 39 40 41 42 43 44 45 | """ rename_fasta.py -d ${genomes} \ -p ${genomes_prefix} \ -i ${start_number} \ --max ${max_number} \ -t name_mapping.tsv \ -o renamed_genomes \ --csv ${check_csv} """ |
48 49 50 51 52 | """ mkdir renamed_genomes touch name_mapping.tsv touch renamed_${check_csv.baseName}_checkm.txt """ |
31 32 33 34 35 36 37 | """ gunzip -c ${interproscan_tsv} > interproscan.tsv sanntis \ --ip-file interproscan.tsv \ --outfile ${cluster_name}_sanntis.gff \ ${prokka_gbk} """ |
39 40 41 42 43 44 | """ sanntis \ --ip-file ${interproscan_tsv} \ --outfile ${cluster_name}_sanntis.gff \ ${prokka_gbk} """ |
32 33 34 | """ split_drep.py --cdb ${cdb_csv} --mdb ${mdb_csv} --sdb ${sdb_csv} -o split_output """ |
31 32 33 34 35 36 37 38 39 40 41 42 | """ mv ${interproscan_annotations} protein_catalogue-90_InterProScan.tsv mv ${eggnog_annotations} protein_catalogue-90_eggNOG.tsv gunzip -c ${mmseq_90_tarball} > protein_catalogue-90.tar rm ${mmseq_90_tarball} tar -uf protein_catalogue-90.tar protein_catalogue-90_InterProScan.tsv protein_catalogue-90_eggNOG.tsv gzip protein_catalogue-90.tar """ |
25 26 27 28 29 30 31 32 33 | """ rm -f gunc_failed.txt || true touch gunc_failed.txt for GUNC_FAILED in failed_gunc/*; do name=\$(basename \$GUNC_FAILED) genome_name="\${name%"_gunc_empty.txt"}" echo \$genome_name >> gunc_failed.txt done """ |
Support
Do you know this workflow well? If so, you can
request seller status , and start supporting this workflow.
Created: 1yr ago
Updated: 1yr ago
Maitainers:
public
URL:
https://github.com/EBI-Metagenomics/genomes-pipeline.git
Name:
mgnify-genomes-analysis-pipeline
Version:
Version 1
Downloaded:
0
Copyright:
Public Domain
License:
GNU General Public License v3.0
Keywords:
FASTQ
raw sequence reads
Genome annotation
genetic variants
AMRFinderPlus
bracken
gunc
CheckM
Diamond
dRep
eggNOG-mapper v2
Infernal cmscan (EBI)
InterProScan (EBI)
IQ-TREE
kraken2
Mash
metaprokka
MMseqs
Nextflow
SAMtools
seqtk
subSeq
Trnascan-SE
cmsearch
extglob
gtdbtk
taxonomic and functional annotation
- Future updates
Related Workflows

ENCODE pipeline for histone marks developed for the psychENCODE project
psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project.
The o...

Near-real time tracking of SARS-CoV-2 in Connecticut
Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

ATLAS - Three commands to start analyzing your metagenome data
Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...
raw sequence reads
Genome assembly
Annotation track
checkm2
gunc
prodigal
snakemake-wrapper-utils
MEGAHIT
Atlas
BBMap
Biopython
BioRuby
Bwa-mem2
cd-hit
CheckM
DAS
Diamond
eggNOG-mapper v2
MetaBAT 2
Minimap2
MMseqs
MultiQC
Pandas
Picard
pyfastx
SAMtools
SemiBin
Snakemake
SPAdes
SqueezeMeta
TADpole
VAMB
CONCOCT
ete3
gtdbtk
h5py
networkx
numpy
plotly
psutil
utils
metagenomics

RNA-seq workflow using STAR and DESeq2
This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

This Snakemake pipeline implements the GATK best-practices workflow
This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...