SAMBA: Standardized and Automated MetaBarcoding Analyses workflow
SAMBA: Standardized and Automated MetaBarcoding Analyses workflow .
[DEPRECATED] Now use the samba v4 version : https://gitlab.ifremer.fr/bioinfo/workflows/samba
Introduction
SAMBA is a FAIR scalable workflow integrating, into a unique tool, state-of-the-art bioinformatics and statistical methods to conduct reproducible eDNA analyses using Nextflow . SAMBA starts processing by verifying integrity of raw reads and metadata. Then all bioinformatics processing is done using commonly used procedure ( QIIME 2 and DADA2 ) but adds new steps relying on dbOTU3 and microDecon to build high quality ASV count tables. Extended statistical analyses are also performed. Finally, SAMBA produces a full dynamic HTML report including resources used, commands executed, intermediate results, statistical analyses and figures.
The SAMBA pipeline can run tasks across multiple compute infrastructures in a very portable manner. It comes with singularity containers making installation trivial and results highly reproducible.
Quick Start
i. Install
nextflow
ii. Install either
Docker
or
Singularity
for full pipeline reproducibility (please only use
Conda
as a last resort)
iii. Download the pipeline and test it on a minimal dataset with a single command
- for short reads test :
nextflow run main.nf -profile shortreadstest,<docker/singularity/conda>
- for long reads test :
nextflow run main.nf -profile longreadstest,<docker/singularity/conda>
To use samba on a computing cluster, it is necessary to provide a configuration file for your system. For some institutes, this one already exists and is referenced on nf-core/configs . If so, you can simply download your institute custom config file and simply use
-c <institute_config_file>
in your command. This will enable eitherdocker
orsingularity
and set the appropriate execution settings for your local compute environment.
iv. Start running your own analysis!
nextflow run main.nf -profile <docker/singularity/conda>,custom [-c <institute_config_file>]
See usage docs for a complete description of all of the options available when running the pipeline.
Documentation
The samba workflow comes with documentation about the pipeline, found in the
docs/
directory:
Here is an overview of the many steps available in samba pipeline:
At the end of samba pipeline execution, you get an interactive HTML report that's look like this one:
Full report description is available in samba pipeline documentation .
Credits
samba is written by SeBiMER , the Bioinformatics Core Facility of IFREMER .
Contributions
We welcome contributions to the pipeline. If such case you can do one of the following:
-
Use issues to submit your questions
-
Fork the project, do your developments and submit a pull request
-
Contact us (see email below)
Support
For further information or help, don't hesitate to get in touch with the samba developpers:
Citation
References
References databases (SILVA v132, PR2, UNITE) are available on IFREMER FTP at ftp://ftp.ifremer.fr/ifremer/dataref/bioinfo/sebimer/sequence-set/SAMBA/2019.10 .
Training dataset used from [Qiime2 Tutorial] (https://docs.qiime2.org/2019.7/tutorials/atacama-soils), associated publication .
Code Snippets
428 429 430 | """ get_test_data.sh ${baseDir} 'data_is_ready' ${datatype} 'manifest' 'metadata' &> get_test_data.log 2>&1 """ |
497 498 499 | """ excel2tsv.py -x ${xls} -s ${datatype} &> excel2tsv.log 2>&1 """ |
531 532 533 | """ data_integrity.py -e ${metadata} -a ${manifest} -p ${params.primer_filter} -s ${datatype} -t ${task.cpus} -c ${params.control_list} &> data_integrity.log 2>&1 """ |
569 570 571 572 | """ q2_import.sh ${params.singleEnd} ${q2_manifest} data.qza data.qzv import_output completecmd &> q2_import.log 2>&1 qiime --version|grep 'q2cli'|cut -d' ' -f3 > v_qiime2.txt """ |
601 602 603 | """ q2_cutadapt.sh ${params.singleEnd} ${task.cpus} ${imported_data} ${params.primerF} ${params.primerR} ${params.errorRate} ${params.overlap} data_trimmed.qza data_trimmed.qzv trimmed_output completecmd &> q2_cutadapt.log 2>&1 """ |
638 639 640 | """ q2_dada2.sh ${params.singleEnd} ${dada2_input} ${metadata} rep_seqs.qza rep_seqs.qzv table.qza table.qzv stats.qza stats.qzv dada2_output ${params.FtrimLeft} ${params.RtrimLeft} ${params.FtruncLen} ${params.RtruncLen} ${params.FmaxEE} ${params.RmaxEE} ${params.minQ} ${params.chimeras} ${task.cpus} completecmd &> q2_dada2.log 2>&1 """ |
674 675 676 | """ q2_dbotu3.sh ${table} ${seqs} ${metadata} dbotu3_details.txt dbotu3_seqs.qza dbotu3_seqs.qzv dbotu3_table.qza dbotu3_table.qzv dbotu3_output ${params.gen_crit} ${params.abund_crit} ${params.pval_crit} completecmd &> q2_dbotu3.log 2>&1 """ |
711 712 713 | """ q2_merge.sh ${table_dir} ${seq_dir} merged_table.qza merged_seq.qza ${metadata_merge} merged_table.qzv dada2_output merged_seq.qzv completecmd &> q2_merge.log 2>&1 """ |
753 754 755 | """ q2_taxo.sh ${task.cpus} ${params.extract_db} ${params.primerF} ${params.primerR} ${params.confidence} ${repseqs_taxo} taxonomy.qza taxonomy.qzv taxo_output ASV_taxonomy.tsv ${summary_output} ASV_table_with_taxonomy.biom ASV_table_with_taxonomy.tsv taxonomic_database.qza seqs_db_amplicons.qza completecmd tmpdir ${params.database} ${params.seqs_db} ${params.taxo_db} &> q2_taxo.log 2>&1 """ |
793 794 795 | """ q2_filtering_tax.sh ${asv_table} ${asv_tax} ${params.tax_to_exclude} ${params.tax_to_include} tax_filtered_table.qza ${asv_seq} tax_filtered_seq.qza tax_filtered_table.qzv ${metadata} tax_filtered_output tax_filtered_seq.qzv ${asv_tax_tsv} tax_filtered_table_with_tax.biom tax_filtered_table_with_tax.tsv completecmd &> q2_filtering_tax.log 2>&1 """ |
833 834 835 836 837 838 839 840 841 | """ sed '1d' ${microDecon_table} > microDecon_table sed -i 's/#OTU ID/ASV_ID/g' microDecon_table sed -i "s/'//g" microDecon_table microDecon.R microDecon_table ${params.control_list} ${params.nb_controls} ${params.nb_samples} decontaminated_ASV_table.tsv abundance_removed.txt ASV_removed.txt &> microDecon.log 2>&1 cp ${baseDir}/bin/microDecon.R completecmd &>> microDecon.log 2>&1 Rscript -e "write(x=as.character(packageVersion('microDecon')), file='v_microdecon.txt')" """ |
861 862 863 864 | """ biom convert -i ${table4microDecon} -o decontaminated_ASV_table.biom --to-hdf5 --table-type="OTU table" --process-obs-metadata taxonomy qiime tools import --input-path decontaminated_ASV_table.biom --type 'FeatureTable[Frequency]' --input-format BIOMV210Format --output-path decontaminated_ASV_table.qza """ |
888 889 890 891 | """ cut -d \$'\t' -f1 ${decontam_table} | sed '1d' > decontaminated_ASV_ID.txt seqtk subseq ${dada2_output}/sequences.fasta decontaminated_ASV_ID.txt > decontaminated_ASV.fasta """ |
911 912 913 | """ qiime tools import --input-path ${ASV_fasta} --output-path decontam_seqs.qza --type 'FeatureData[Sequence]' """ |
952 953 954 955 956 957 958 | """ minimap2 -t ${task.cpus} -K 25M -ax ${params.lr_type} -L ${params.lr_tax_fna} ${fastq} | samtools view -h -F0xe00 | samtools sort -o ${sample}.bam -O bam - &> lr_minimap2.log 2>&1 touch lr_mapping.ok samtools index ${sample}.bam &> lr_samtools-sort.log 2>&1 touch lr_samtools-sort.ok minimap2 --version > v_minimap2.txt """ |
970 971 972 | """ seqtk seq -a ${fastq} > 'lr_sequences.fasta' """ |
991 992 993 994 | """ lr_count_table_minimap2.py -b "." -t "${params.lr_taxo_flat}" -r ${params.lr_rank} -o samples.tsv &> lr_count_table.log 2>&1 touch lr_count_table.ok """ |
1035 1036 1037 1038 | """ q2_phylogeny.sh ${repseqs_phylo} aligned_repseq.qza masked-aligned_repseq.qza tree.qza tree.log rooted_tree.qza tree_export_dir tree_export.log completecmd ${task.cpus} &>> q2_phylogeny.log 2>&1 cp tree_export_dir/tree.nwk tree.nwk &>> q2_phylogeny.log 2>&1 """ |
1092 1093 1094 | """ q2_picrust2.sh ${table_picrust2} ${seqs_picrust2} q2-picrust2_output ${task.cpus} ${params.method} ${params.nsti} complete_picrust2_cmd &> q2_picrust2.log 2>&1 """ |
1122 1123 1124 1125 | """ functional_predictions.R ec_metagenome_predictions_with-descriptions.tsv ko_metagenome_predictions_with-descriptions.tsv pathway_abundance_predictions_with-descriptions.tsv ${metadata} ${picrust_vars} functional_predictions_NMDS_${picrust_vars} ${params.microDecon_enable} ${params.control_list} &> picrust2_stats.log 2>&1 cp ${baseDir}/bin/functional_predictions.R complete_picrust2_stats_cmd &>> picrust2_stats.log 2>&1 """ |
1170 1171 1172 | """ q2_ANCOM.sh ${table4ancom} compo_table.qza ${metadata} ${ancom_var} ancom_${ancom_var}.qzv export_ancom_${ancom_var} ${taxonomy4ancom} collapsed_table_family.qza compo_table_family.qza ancom_${ancom_var}_family.qzv export_ancom_${ancom_var}_family collapsed_table_genus.qza compo_table_genus.qza ancom_${ancom_var}_genus.qzv export_ancom_${ancom_var}_genus completecmd_ancom &> q2_ancom.log 2>&1 """ |
1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 | """ if [ ${params.longreads} = "true" ]; then Rscript --vanilla ${baseDir}/bin/create_phyloseq_obj_longreads.R phyloseq.rds ${biom_tsv} ${metadata} ${params.lr_rank} ${params.kingdom} table_with_taxo_for_stats.tsv &>> stats_prepare_data_lr.log 2&>1 else prepare_data_for_stats.sh ${metadata} ${biom_tsv} table_with_taxo_for_stats.tsv metadata_stats.tsv ${params.microDecon_enable} ${params.stats_only} &> stats_prepare_data.log 2&>1 Rscript --vanilla ${baseDir}/bin/create_phyloseq_obj.R phyloseq.rds table_with_taxo_for_stats.tsv metadata_stats.tsv ${params.microDecon_enable} ${params.control_list} ${newick_tree} ${params.remove_sample} ${params.sample_to_remove} &>> stats_prepare_data.log 2&>1 fi ## get statistics libraries version for report Rscript -e "write(x=as.character(paste0(R.Version()[c('major','minor')], collapse = '.')), file='v_R.txt')" Rscript -e "library(dplyr); write(x=as.character(packageVersion('dplyr')), file='v_dplyr.txt')" Rscript -e "library(stringr); write(x=as.character(packageVersion('stringr')), file='v_stringr.txt')" Rscript -e "library(phyloseq); x=as.character(packageVersion('phyloseq')); write(x, file='v_phyloseq.txt')" Rscript -e "library(phangorn); write(x=as.character(packageVersion('phangorn')), file='v_phangorn.txt')" Rscript -e "library(ggplot2); write(x=as.character(packageVersion('ggplot2')), file='v_ggplot2.txt')" Rscript -e "library(svglite); write(x=as.character(packageVersion('svglite')), file='v_svglite.txt')" Rscript -e "library(vegan); write(x=as.character(packageVersion('vegan')), file='v_vegan.txt')" Rscript -e "library(RColorBrewer); write(x=as.character(packageVersion('RColorBrewer')), file='v_RColorBrewer.txt')" Rscript -e "library(tidyr); write(x=as.character(packageVersion('tidyr')), file='v_tidyr.txt')" Rscript -e "library(gridExtra); write(x=as.character(packageVersion('gridExtra')), file='v_gridExtra.txt')" Rscript -e "library(egg); write(x=as.character(packageVersion('egg')), file='v_egg.txt')" Rscript -e "library(reshape2); write(x=as.character(packageVersion('reshape2')), file='v_reshape2.txt')" Rscript -e "library(BiocManager); write(x=as.character(packageVersion('BiocManager')), file='v_BiocManager.txt')" Rscript -e "library(microbiome); write(x=as.character(packageVersion('microbiome')), file='v_microbiome.txt')" Rscript -e "library(dendextend); write(x=as.character(packageVersion('dendextend')), file='v_dendextend.txt')" Rscript -e "library(metagenomeSeq); write(x=as.character(packageVersion('metagenomeSeq')), file='v_metagenomeSeq.txt')" Rscript -e "library(DESeq2); write(x=as.character(packageVersion('DESeq2')), file='v_DESeq2.txt')" Rscript -e "library(UpSetR); write(x=as.character(packageVersion('UpSetR')), file='v_UpSetR.txt')" touch 'version_ok' """ |
1281 1282 1283 1284 | """ Rscript --vanilla ${baseDir}/bin/alpha_diversity.R ${phyloseq_rds} alpha_div_values.txt alpha_div_plots_${alpha_var} ${params.kingdom} ${params.taxa_nb} barplot_phylum_${alpha_var} barplot_class_${alpha_var} barplot_order_${alpha_var} barplot_family_${alpha_var} barplot_genus_${alpha_var} ${alpha_var} index_significance_tests.txt $workflow.projectDir rarefaction_curve ${params.longreads} ${params.lr_rank} &> stats_alpha_diversity.log 2>&1 touch process_alpha_report.ok """ |
1328 1329 1330 1331 | """ Rscript --vanilla ${baseDir}/bin/beta_diversity.R ${phyloseq_rds} ${beta_var} ${metadata} $workflow.projectDir NMDS_${beta_var}_ PCoA_${beta_var}_ ${params.hc_method} hclustering_${beta_var}_ variance_significance_tests_ pie_ExpVar_ ${params.longreads} &> stats_beta_diversity.log 2>&1 touch process_beta_report.ok """ |
1368 1369 1370 1371 | """ Rscript --vanilla ${baseDir}/bin/beta_diversity_rarefied.R ${phyloseq_rds} Final_rarefied_ASV_table_with_taxonomy.tsv ${beta_var} ${metadata} $workflow.projectDir NMDS_rarefied_${beta_var}_ PCoA_rarefied_${beta_var}_ ${params.hc_method} hclustering_rarefied_${beta_var}_ variance_significance_tests_rarefied_ pie_ExpVar_rarefied_ ${params.longreads} &> stats_beta_diversity_rarefied.log 2>&1 touch process_beta_report_rarefied.ok """ |
1408 1409 1410 1411 | """ Rscript --vanilla ${baseDir}/bin/beta_diversity_deseq2.R ${phyloseq_rds} Final_DESeq2_ASV_table_with_taxonomy.tsv ${beta_var} ${metadata} $workflow.projectDir NMDS_DESeq2_${beta_var}_ PCoA_DESeq2_${beta_var}_ ${params.hc_method} hclustering_DESeq2_${beta_var}_ variance_significance_tests_DESeq2_ pie_ExpVar_DESeq2_ ${params.longreads} &> stats_beta_diversity_deseq2.log 2>&1 touch process_beta_report_DESeq2.ok """ |
1448 1449 1450 1451 | """ Rscript --vanilla ${baseDir}/bin/beta_diversity_css.R ${phyloseq_rds} Final_CSS_ASV_table_with_taxonomy.tsv ${beta_var} ${metadata} $workflow.projectDir NMDS_CSS_${beta_var}_ PCoA_CSS_${beta_var}_ ${params.hc_method} hclustering_CSS_${beta_var}_ variance_significance_tests_CSS_ pie_ExpVar_CSS_ ${params.longreads} &> stats_beta_diversity_css.log 2>&1 touch process_beta_report_CSS.ok """ |
1484 1485 1486 1487 | """ Rscript --vanilla ${baseDir}/bin/desc_comp.R ${phyloseq_rds} ${desc_comp_var} upset_plot_${desc_comp_var} ${params.desc_comp_tax_level} &> stats_desc_comp.log 2>&1 touch process_desc_comp_report.ok """ |
1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 | """ #!/usr/bin/env python import os, json from shutil import copyfile data = {} data["projectName"] = '$params.projectName' data["run_date"] = '$date' data["singleEnd"] = '$params.singleEnd' data["manifest"] = '$params.input_manifest' data["metadata"] = '$params.input_metadata' data["outdir"] = '$params.outdir' data["steps"] = {} data["steps"]["data_integrity_enable"] = '$params.data_integrity_enable' data["steps"]["cutadapt_enable"] = '$params.cutadapt_enable' data["steps"]["dbotu3_enable"] = '$params.dbotu3_enable' data["steps"]["filtering_tax_enable"] = '$params.filtering_tax_enable' data["steps"]["microDecon_enable"] = '$params.microDecon_enable' data["steps"]["ancom_enable"] = '$params.ancom_enable' data["steps"]["picrust2_enable"] = '$params.picrust2_enable' data["steps"]["stats_alpha_enable"] = '$params.stats_alpha_enable' data["steps"]["stats_beta_enable"] = '$params.stats_beta_enable' data["steps"]["stats_desc_comp_enable"] = '$params.stats_desc_comp_enable' data["steps"]["report_enable"] = '$params.report_enable' data["steps"]["stats_only"] = '$params.stats_only' data["steps"]["dada2merge"] = '$params.dada2merge' data["steps"]["longreads"] = '$params.longreads' data["integrity"] = {} data["integrity"]["primer_filter"] = '$params.primer_filter' or None data["cutadapt"] = {} data["cutadapt"]["primerF"] = '$params.primerF' or None data["cutadapt"]["primerR"] = '$params.primerR' or None data["cutadapt"]["errorRate"] = '$params.errorRate' or None data["cutadapt"]["overlap"] = '$params.overlap' or None data["dada2"] = {} data["dada2"]["FtrimLeft"] = '$params.FtrimLeft' or None data["dada2"]["RtrimLeft"] = '$params.RtrimLeft' or None data["dada2"]["FtruncLen"] = '$params.FtruncLen' or None data["dada2"]["RtruncLen"] = '$params.RtruncLen' or None data["dada2"]["FmaxEE"] = '$params.FmaxEE' or None data["dada2"]["RmaxEE"] = '$params.RmaxEE' or None data["dada2"]["minQ"] = '$params.minQ' or None data["dada2"]["chimeras"] = '$params.chimeras' or None data["dada2merge"] = {} data["dada2merge"]["merge_tabledir"] = '$params.merge_tabledir' or None data["dada2merge"]["merge_repseqsdir"] = '$params.merge_repseqsdir' or None data["dbotu3"] = {} data["dbotu3"]["gen_crit"] = '$params.gen_crit' or None data["dbotu3"]["abund_crit"] = '$params.abund_crit' or None data["dbotu3"]["pval_crit"] = '$params.pval_crit' or None data["taxonomy"] = {} data["taxonomy"]["database"] = '$params.database' or None data["taxonomy"]["seqs_db"] = '$params.seqs_db' or None data["taxonomy"]["taxo_db"] = '$params.taxo_db' or None data["taxonomy"]["extract_db"] = '$params.extract_db' data["taxonomy"]["confidence"] = '$params.confidence' or None data["picrust2"] = {} data["picrust2"]["method"] = '$params.method' or None data["picrust2"]["nsti"] = '$params.nsti' or None data["tax_filtering"] = {} data["tax_filtering"]["tax_to_exclude"] = '$params.tax_to_exclude' or None data["microdecon"] = {} data["microdecon"]["control_list"] = '$params.control_list' or None data["microdecon"]["nb_controls"] = '$params.nb_controls' or None data["microdecon"]["nb_samples"] = '$params.nb_samples' or None data["lr"] = {} data["lr"]["lr_type"] = '$params.lr_type' or None data["lr"]["lr_rank"] = '$params.lr_rank' or None data["lr"]["lr_tax_fna"] = '$params.lr_tax_fna' or None data["lr"]["lr_tax_flat"] = '$params.lr_taxo_flat' or None data["stats"] = {} data["stats"]["ancom_var"] = '$params.ancom_var' or None data["stats"]["kingdom"] = '$params.kingdom' or None data["stats"]["taxa_nb"] = '$params.taxa_nb' or None data["stats"]["hc_method"] = '$params.hc_method' or None data["stats"]["alpha_div_group"] = '$params.alpha_div_group' or None data["stats"]["beta_div_var"] = '$params.beta_div_var' or None data["stats"]["desc_comp_crit"] = '$params.desc_comp_crit' or None data["stats"]["inasv_table"] = '$params.inasv_table' or None data["stats"]["innewick"] = '$params.innewick' or None #software versions data["soft"] = {} data["soft"]["samba"] = '$workflow.manifest.version' data["soft"]["nextflow"] = '$workflow.nextflow.version' for elt in os.listdir("${params.outdir}/${params.report_dirname}/version"): if elt.endswith(".txt"): f=open(os.path.join("${params.outdir}/${params.report_dirname}/version",elt),"r") tool_name=elt.replace("v_","").replace(".txt","") version=f.readline().rstrip() data["soft"][tool_name] = version with open('data.json', 'w') as outfile: json.dump(data, outfile, sort_keys=True, indent=4) os.system("${baseDir}/bin/SAMBAreport.py -t ${SAMBAtemplate} -p ${params.outdir}/${params.report_dirname} -c 'data.json'") os.system("cp ${wf_image} samba_wf.png") """ |
1666 1667 1668 1669 1670 | """ cd ${params.outdir}/${params.report_dirname} && zip -r SAMBA_report.zip * cd - ln -s ${params.outdir}/${params.report_dirname}/SAMBA_report.zip """ |
Support
- Future updates
Related Workflows





