Snakemake workflow for metagenomic classification with Kraken2
A Snakemake pipeline wrapper of the Kraken2 short read metagenomic classification software, with additional tools for analysis, plots, compositional data and differential abundance calculation. Designed and maintained by Ben Siranosian in Ami Bhatt's lab at Stanford University.
Introduction
Kraken2 is a short read classification system that is fast and memory efficient. It assigns a taxonomic identification to each sequencing read, by using the lowest common ancestor (LCA) of matching genomes in the database. Using Bracken provides accurate estimates of proportions of different species. This guide will cover some of the basics, but the full Kraken2 manual has much more detail.
Table of contents
Quickstart
Install
If you're in the Bhatt lab, most of this work will take place on the SCG cluster. External users should set this pipeline up on their infrastructure of choice: cloud, HPC, or even a laptop will work for processing small datasets. You will have to
download or build a database
, set the database options, and create the
sample input files
. All steps of this pipeline are containerized, meaning only
snakemake
and
singularity
are required to run all tools.
If you're in the Bhatt lab, use these instructions to set up snakemake and set up a profile to submit jobs to the SCG cluster. External users should follow these instructions:
-
Create a fresh environment with
snakemake
or add it to an existing environment. Activate this environment for any step using this pipeline:
mamba create --name snakemake --channel conda-forge --channel bioconda snakemake
conda activate snakemake
Then, clone this repo in a convenient location.
git clone https://github.com/bhattlab/kraken2_classification.git
cd kraken2_classification
Run with test data
A Kraken2 database is required to use this pipeline. Pre-built database can be downloaded from Ben Langmead's site . As an example, we download the standard database limited to 8GB memory use, and unpack it into a folder to use with the tests:
cd kraken2_classification/tests
wget https://genome-idx.s3.amazonaws.com/kraken/k2_standard_08gb_20230605.tar.gz
mkdir db
tar -C db -xvf k2_standard_08gb_20230605.tar.gz
A small test dataset from Yassour et. al (2018) is included in this repo. 10,000 reads from several timepoints from a mother-infant pair are used. Even with such low coverage, the differences in microbiome composition are apparent in clustering and taxonomic barplots. Launch and end-to-end test run with a command like so:
# Launch this from the kraken2_classification directory
snakemake -s Snakefile --configfile tests/test_config/config_pe.yaml -j1 --use-singularity
The script
tests/run_tests.sh
ensures basic functionality of the pipeline executes as expected.
Run with real-world data
Copy the
config.yaml
file into the working directory for your samples. Change the options to suit your project. The main input is the
sample_reads_file
which defines the mapping from sample names to sequencing reads. See
Usage
for more detail.
On the Bhatt lab SCG cluster, you can then launch the workflow with a snakemake command like so:
# Snakemake workflow - change options in config.yaml first
snakemake -s path/to/Snakefile --configfile config.yaml --use-singularity --singularity-args '--bind /oak/,/labs/,/home' --profile scg --jobs 99
If you're not in the Bhatt lab, a more general command should be sufficient, but you might need to add singularity bind arguments or a profile for SLURM job submission depending on your configuration. This example uses 8 cores, but that can be changed to reflect available resources.
snakemake -s path/to/Snakefile --configfile config.yaml --use-singularity --jobs 8 --cores 8
After running the workflow and you're satisfied the results, run the cleanup command to remove temporary files that are not needed anymore.
snakemake cleanup -s path/to/Snakefile --configfile config.yaml
Run analysis on existing data
If you have a collection of kraken/bracken reports and just want to run the downstream analysis in this pipeline, you can provide the
sample_reports_file
in the config, which is a map from sample names to kraken and bracken report files. See
tests/test_config/config_downstream_only_bracken.yaml
as an example. Then, launch the pipeline with
Snakefile_downstream_only
. Tune the filtering and job submission parameters to meet your needs.
snakemake -s Snakefile_downstream_only --configfile tests/test_config/config_downstream_only_bracken.yaml -j1 --use-singularity
Parsing output reports
The Kraken reports
classification/sample.krak.report
, bracken reports
sample.krak_bracken.report
, and data matrices or GCTx objects in the
processed_results
folder are the best for downstream analysis. See
Downstream processing and plotting
for details on using the data in R.
Example output
The pipeline outputs data, results and figures in the structure below.
| classification
- sample.krak Kraken results - classification of each read. These files
can get very large and are unnecessary if you only want the reports.
- sample.krak.report Kraken report - reads and percentages at each taxonomic level.
- sample.krak.report.bracken Standard bracken report at species level. Not useful, use the one below.
- sample.krak_bracken.report Most useful format of the the Bracken results.
| processed_results
- diversity.txt Diversity of each sample at each taxonomic level
| ALDEX2_differential_abuncance Compositional data analysis done with the ALDEx2 package.
Only done if you have 2 groups in the sample_groups file.
- aldex_result_[].tsv Differential abundance at the given taxonomic level.
- aldex_scatter_[].pdf Scatterplot of effect vs dispersion with significant hits highlighted
- aldex_significant_boxplots_[].pdf Boxplot of any significant hits
| braycurtis_matrices
- bravcurtis_distance_[].tsv Matrix of braycurtis distance between samples at each taxonomic level
| plots Lots of plots!
- classified_taxonomy_barplot_[].pdf Barplot at each taxonomic level.
- compositional_PCA_plot.pdf PCA done on clr values
- diversity_allsamples.pdf Diversity barplot
- diversity_by_group.pdf Diversity barplot, stratified by sample group
- PCoA_2D_plot_labels.pdf Principal coordinates analysis, calculated on braycurtis distances
- PCoA_2D_plot_nolabels.pdf Same as above without the point labels
- rarefaction_curve.pdf Rarefaction "collectors" curve plot
| taxonomy_gctx_classified_only
- bracken_[]_reads.gctx GCTx file with matrix containing reads classified at each level
- bracken_[]_percentage.gctx Same, but percentage of classified reads
| taxonomy_matrices_classified_only
- bracken_[]_reads.txt Matrix of taxa by sample classified reads
- bracken_[]_percentage.txt Same, but percentage of classified reads
- clr_values_[].txt From compositional data analysis, centered log-ratio values at each level
| processed_results_krakenonly
| Same as above, but using the results without Bracken. Also includes taxonomy matrices
| that have unclassified reads in them (as bracken no longer reports unclassified reads)
| unmapped reads
- sample_unmapped_1.fq Only present if selected in the config file; reads that are not
- sample_unmapped_2.fq classified, as paired end fastq.
Taxonomic barplot
PCoA plot example
Changelog
2023-06-07
v2.0 (Breaking changes introduced to to configuration files and the ways parameters are used). This set of changes did a bit to modernize the pipeline:
-
All steps are now available with containerized execution
-
Created separate pipeline
Snakefile_downstream_only
which works from a list of report files to only un the downstream analysis steps. -
Included a small test dataset and better test execution
-
Various code and README/manual changes
-
License file added
2019-09-01
The outputs of this pipeline have been vastly improved! Both internally and saved data now use the GCTx data format, from the CMapR package. Basically, a GCT object is a data matrix that has associated row and column metadata. This allows for consistent metadata to live with the classification data, for both the rows (taxonomy information) and columns (sample metadata). See section 8. GCTx data processing for more information and tools for working with the new implementation.
Also as of this update, the NCBI taxonomy information used by Kraken is filtered and improved some before saving any data or figures. For example, there were previously many taxonomy levels simply labeled "environmental samples" that are now named with their pared taxa name to remove ambiguity. Also, levels without a proper rank designation (listed with an abbreviation and a number in the kraken report) have been forced into a specific rank when nothing was below them. This makes the taxonomy "technically incorrect", but much more practically useful in these cases. Contact me with any questions. The full list of changes is described in Additional considerations
Code Snippets
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 | import os import pandas as pd from itertools import compress import argparse ''' parser = argparse.ArgumentParser(description="Convert a kraken2 \ standard report into the mpa-style .") parser.add_argument('-i', required=True, action='store', dest='input_report', help='Input kraken2 report.') parser.add_argument('-o', required=True, action='store', dest='output', help='Output mpa style report') parser.add_argument('--no_corrections', required=False, action='store_true', dest='no_corrections', help='Some taxonomy levels \ are bugged in the mpa report because \ this script doesnt use the full ncbi taxonomy. \ Normally these are corrected based on a database \ built by comparing reports. This option allows \ you to skip the corrections.') args = parser.parse_args() ''' def main(): taxonomy_levels = ["D","P","C","O","F","G","S"] taxonomy_lower = [t.lower() for t in taxonomy_levels] report = pd.read_csv(snakemake.input[0], delimiter='\t', header=None) # filter to rows at taxonomy_levels report_filtered = report.loc[report[3].isin(taxonomy_levels)] # filter taxonomy names report[5] = [a.strip() for a in report[5]] # tax_dict keeps track of where we are in the taxonomy tax_dict = {t:"" for t in taxonomy_levels} # build report strings mpa_strings = [] for i in range(report.shape[0]): # reset tax dict appropriately # what an awful way to do this if report.iloc[i, 3][0] == "D": tax_dict.update({"P":"", "C":"", "O":"", "F":"", "G":"", "S":""}) elif report.iloc[i, 3][0] == "P": tax_dict.update({"C":"", "O":"", "F":"", "G":"", "S":""}) elif report.iloc[i, 3][0] == "C": tax_dict.update({"O":"", "F":"", "G":"", "S":""}) elif report.iloc[i, 3][0] == "O": tax_dict.update({"F":"", "G":"", "S":""}) elif report.iloc[i, 3][0] == "F": tax_dict.update({"G":"", "S":""}) elif report.iloc[i, 3][0] == "G": tax_dict.update({"S":""}) elif report.iloc[i, 3][0] == "S": tax_dict.update({}) else: continue if report.iloc[i, 3] in taxonomy_levels: tax_dict[report.iloc[i, 3]] = report.iloc[i, 5] mpa_strings.append(tax_dict_to_string(tax_dict)) # correct names if desired #if not args.no_corrections: mpa_strings = correct_mpa_strings(mpa_strings) # build data frame of report mpa_data = {"col1":mpa_strings, "col2":list(report_filtered[1])} mpa_df = pd.DataFrame(mpa_data) # write out mpa_df.to_csv(snakemake.output[0], sep='\t', header=False, index=False) def tax_dict_to_string(tax_dict): taxonomy_levels = ["D","P","C","O","F","G","S"] taxonomy_lower = [t.lower() for t in taxonomy_levels] level_list = [tax_dict[t] for t in taxonomy_levels if tax_dict[t] != ""] add_letters = list(compress(taxonomy_lower, [tax_dict[t] != "" for t in taxonomy_levels])) mpa_string = "" for letter, level in zip(add_letters, level_list): # print(letter) # print(level) if letter != "d": mpa_string = mpa_string + "|" + letter + "__" + level else: mpa_string = mpa_string + letter + "__" + level return(mpa_string) def correct_mpa_strings(mpa_strings): # database of strings to correct # shouldn't store this in this script... correction_dict = { "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Candidatus Hamiltonella|s__secondary endosymbiont of Ctenarytaina eucalypti" : "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|s__secondary endosymbiont of Ctenarytaina eucalypti", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Candidatus Hamiltonella|s__secondary endosymbiont of Heteropsylla cubana" : "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|s__secondary endosymbiont of Heteropsylla cubana", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Ruthia|s__Calyptogena okutanii thioautotrophic gill symbiont" : "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Calyptogena okutanii thioautotrophic gill symbiont", "d__Bacteria|p__Proteobacteria|c__Alphaproteobacteria|o__Pelagibacterales|f__Pelagibacteraceae|g__Candidatus Fonsibacter" : "d__Bacteria|p__Proteobacteria|c__Alphaproteobacteria|o__Pelagibacterales|g__Candidatus Fonsibacter", "d__Bacteria|p__Proteobacteria|c__Alphaproteobacteria|o__Pelagibacterales|f__Pelagibacteraceae|g__Candidatus Fonsibacter|s__Candidatus Fonsibacter ubiquis" : "d__Bacteria|p__Proteobacteria|c__Alphaproteobacteria|o__Pelagibacterales|g__Candidatus Fonsibacter|s__Candidatus Fonsibacter ubiquis", "d__Bacteria|p__Proteobacteria|c__Deltaproteobacteria|o__Desulfurellales|f__Candidatus Desulfofervidaceae" : "d__Bacteria|p__Proteobacteria|c__Deltaproteobacteria|f__Candidatus Desulfofervidaceae", "d__Bacteria|p__Proteobacteria|c__Deltaproteobacteria|o__Desulfurellales|f__Candidatus Desulfofervidaceae|g__Candidatus Desulfofervidus" : "d__Bacteria|p__Proteobacteria|c__Deltaproteobacteria|f__Candidatus Desulfofervidaceae|g__Candidatus Desulfofervidus", "d__Bacteria|p__Proteobacteria|c__Deltaproteobacteria|o__Desulfurellales|f__Candidatus Desulfofervidaceae|g__Candidatus Desulfofervidus|s__Candidatus Desulfofervidus auxilii" : "d__Bacteria|p__Proteobacteria|c__Deltaproteobacteria|f__Candidatus Desulfofervidaceae|g__Candidatus Desulfofervidus|s__Candidatus Desulfofervidus auxilii", "d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|g__Mogibacterium|s__[Eubacterium] minutum" : "d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|s__[Eubacterium] minutum", "d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|g__Mogibacterium|s__[Eubacterium] sulci" : "d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|s__[Eubacterium] sulci", "d__Bacteria|p__Cyanobacteria|c__Gloeobacteria|o__Chroococcidiopsidales" : "d__Bacteria|p__Cyanobacteria|o__Chroococcidiopsidales", "d__Bacteria|p__Cyanobacteria|c__Gloeobacteria|o__Chroococcidiopsidales|f__Chroococcidiopsidaceae" : "d__Bacteria|p__Cyanobacteria|o__Chroococcidiopsidales|f__Chroococcidiopsidaceae", "d__Bacteria|p__Cyanobacteria|c__Gloeobacteria|o__Chroococcidiopsidales|f__Chroococcidiopsidaceae|g__Chroococcidiopsis" : "d__Bacteria|p__Cyanobacteria|o__Chroococcidiopsidales|f__Chroococcidiopsidaceae|g__Chroococcidiopsis", "d__Bacteria|p__Cyanobacteria|c__Gloeobacteria|o__Chroococcidiopsidales|f__Chroococcidiopsidaceae|g__Chroococcidiopsis|s__Chroococcidiopsis thermalis" : "d__Bacteria|p__Cyanobacteria|o__Chroococcidiopsidales|f__Chroococcidiopsidaceae|g__Chroococcidiopsis|s__Chroococcidiopsis thermalis", "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|o__Dehalococcoidales|f__Dehalococcoidaceae|g__Dehalogenimonas" : "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|g__Dehalogenimonas", "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|o__Dehalococcoidales|f__Dehalococcoidaceae|g__Dehalogenimonas|s__Dehalogenimonas formicexedens" : "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|g__Dehalogenimonas|s__Dehalogenimonas formicexedens", "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|o__Dehalococcoidales|f__Dehalococcoidaceae|g__Dehalogenimonas|s__Dehalogenimonas sp. WBC-2" : "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|g__Dehalogenimonas|s__Dehalogenimonas sp. WBC-2", "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|o__Dehalococcoidales|f__Dehalococcoidaceae|g__Dehalogenimonas|s__Dehalogenimonas lykanthroporepellens" : "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|g__Dehalogenimonas|s__Dehalogenimonas lykanthroporepellens", "d__Bacteria|p__Bacteroidetes|c__Chitinophagia|o__Bacteroidetes Order II. Incertae sedis" : "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis", "d__Bacteria|p__Bacteroidetes|c__Chitinophagia|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae" : "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae", "d__Bacteria|p__Bacteroidetes|c__Chitinophagia|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salinibacter" : "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salinibacter", "d__Bacteria|p__Bacteroidetes|c__Chitinophagia|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salinibacter|s__Salinibacter ruber" : "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salinibacter|s__Salinibacter ruber", "d__Bacteria|p__Bacteroidetes|c__Chitinophagia|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|s__Rhodothermaceae bacterium" : "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|s__Rhodothermaceae bacterium", "d__Bacteria|p__Bacteroidetes|c__Chitinophagia|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|s__Rhodothermaceae bacterium RA" : "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|s__Rhodothermaceae bacterium RA", "d__Bacteria|p__Bacteroidetes|c__Chitinophagia|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Rhodothermus" : "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Rhodothermus", "d__Bacteria|p__Bacteroidetes|c__Chitinophagia|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Rhodothermus|s__Rhodothermus marinus" : "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Rhodothermus|s__Rhodothermus marinus", "d__Bacteria|p__Candidatus Saccharibacteria|g__Candidatus Saccharimonas|s__Candidatus Saccharibacteria oral taxon TM7x" : "d__Bacteria|p__Candidatus Saccharibacteria|s__Candidatus Saccharibacteria oral taxon TM7x", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Candidatus Regiella|s__secondary endosymbiont of Ctenarytaina eucalypti": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|s__secondary endosymbiont of Ctenarytaina eucalypti", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Candidatus Regiella|s__secondary endosymbiont of Heteropsylla cubana": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|s__secondary endosymbiont of Heteropsylla cubana", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thiodiazotropha|s__Solemya velum gill symbiont": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Solemya velum gill symbiont", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thiodiazotropha|s__endosymbiont of Galathealinum brachiosum": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__endosymbiont of Galathealinum brachiosum", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thiodiazotropha|s__Solemya velesiana gill symbiont": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Solemya velesiana gill symbiont", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thiodiazotropha|s__Bathymodiolus thermophilus thioautotrophic gill symbiont": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Bathymodiolus thermophilus thioautotrophic gill symbiont", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thiodiazotropha|s__enodsymbiont of Escarpia spicata": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__enodsymbiont of Escarpia spicata", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thiodiazotropha|s__Solemya elarraichensis gill symbiont": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Solemya elarraichensis gill symbiont", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Ruthia|s__Bathymodiolus azoricus thioautotrophic gill symbiont": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Bathymodiolus azoricus thioautotrophic gill symbiont", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Ruthia|s__Solemya pervernicosa gill symbiont": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Solemya pervernicosa gill symbiont", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Ruthia|s__endosymbiont of Ridgeia piscesae": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__endosymbiont of Ridgeia piscesae", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Ruthia|s__Bathymodiolus septemdierum thioautotrophic gill symbiont": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Bathymodiolus septemdierum thioautotrophic gill symbiont", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Endoriftia|s__endosymbiont of Tevnia jerichonana": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__endosymbiont of Tevnia jerichonana", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Endoriftia|s__endosymbiont of Riftia pachyptila": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__endosymbiont of Riftia pachyptila", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Endoriftia|s__endosymbiont of Seepiophila jonesi": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__endosymbiont of Seepiophila jonesi", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Endoriftia|s__endosymbiont of Lamellibrachia luymesi": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__endosymbiont of Lamellibrachia luymesi", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Endoriftia|s__Gammaproteobacteria bacterium LUC13013_P4": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Gammaproteobacteria bacterium LUC13013_P4", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Endoriftia|s__Gammaproteobacteria bacterium LUC13017_P7": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Gammaproteobacteria bacterium LUC13017_P7", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Endoriftia|s__Gammaproteobacteria bacterium LUC14_002_19_P1": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Gammaproteobacteria bacterium LUC14_002_19_P1", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Endoriftia|s__Gammaproteobacteria bacterium LUC004_P11": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Gammaproteobacteria bacterium LUC004_P11", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Endoriftia|s__Gammaproteobacteria bacterium LUC13016_P6": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Gammaproteobacteria bacterium LUC13016_P6", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Endoriftia|s__Gammaproteobacteria bacterium LUC006_P13": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Gammaproteobacteria bacterium LUC006_P13", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Endoriftia|s__Gammaproteobacteria bacterium LUC13018_P8": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Gammaproteobacteria bacterium LUC13018_P8", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Endoriftia|s__Gammaproteobacteria bacterium LUC005_P12": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Gammaproteobacteria bacterium LUC005_P12", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Endoriftia|s__Gammaproteobacteria bacterium LUC003_P10": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Gammaproteobacteria bacterium LUC003_P10", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Endoriftia|s__Gammaproteobacteria bacterium LUC001_P9": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Gammaproteobacteria bacterium LUC001_P9", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Endoriftia|s__Gammaproteobacteria bacterium LUC13015_P5": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Gammaproteobacteria bacterium LUC13015_P5", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Sedimenticola": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Sedimenticola", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Sedimenticola|s__Sedimenticola selenatireducens": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Sedimenticola|s__Sedimenticola selenatireducens", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Sedimenticola|s__Sedimenticola thiotaurini": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Sedimenticola|s__Sedimenticola thiotaurini", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Sedimenticola|s__Sedimenticola sp.": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Sedimenticola|s__Sedimenticola sp.", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Wohlfahrtiimonas": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Wohlfahrtiimonas", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Wohlfahrtiimonas|s__Wohlfahrtiimonas chitiniclastica": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Wohlfahrtiimonas|s__Wohlfahrtiimonas chitiniclastica", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Wohlfahrtiimonas|s__Wohlfahrtiimonas populi": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Wohlfahrtiimonas|s__Wohlfahrtiimonas populi", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Wohlfahrtiimonas|s__Wohlfahrtiimonas larvae": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Wohlfahrtiimonas|s__Wohlfahrtiimonas larvae", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Wohlfahrtiimonas|s__Wohlfahrtiimonas sp. G9077": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Wohlfahrtiimonas|s__Wohlfahrtiimonas sp. G9077", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Gallaecimonas": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Gallaecimonas", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Gallaecimonas|s__Gallaecimonas pentaromativorans": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Gallaecimonas|s__Gallaecimonas pentaromativorans", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Gallaecimonas|s__Gallaecimonas sp. HK-28": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Gallaecimonas|s__Gallaecimonas sp. HK-28", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Gallaecimonas|s__Gallaecimonas xiamenensis": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Gallaecimonas|s__Gallaecimonas xiamenensis", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Candidatus Thioglobus": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thioglobus", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Candidatus Thioglobus|s__Candidatus Thioglobus singularis": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thioglobus|s__Candidatus Thioglobus singularis", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Candidatus Thioglobus|s__Candidatus Thioglobus perditus": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thioglobus|s__Candidatus Thioglobus perditus", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Candidatus Thioglobus|s__Candidatus Thioglobus autotrophicus": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thioglobus|s__Candidatus Thioglobus autotrophicus", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Candidatus Thioglobus|s__Candidatus Thioglobus sp. MED-G25": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thioglobus|s__Candidatus Thioglobus sp. MED-G25", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Candidatus Thioglobus|s__Candidatus Thioglobus sp. REDSEA-S14_B12": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thioglobus|s__Candidatus Thioglobus sp. REDSEA-S14_B12", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Candidatus Thioglobus|s__uncultured SUP05 cluster bacterium": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thioglobus|s__uncultured SUP05 cluster bacterium", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Candidatus Thioglobus|s__Candidatus Thioglobus sp. REDSEA-S12_B1": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thioglobus|s__Candidatus Thioglobus sp. REDSEA-S12_B1", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Candidatus Thioglobus|s__Candidatus Thioglobus sp. REDSEA-S03_B1": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thioglobus|s__Candidatus Thioglobus sp. REDSEA-S03_B1", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Candidatus Thioglobus|s__Candidatus Thioglobus sp.": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thioglobus|s__Candidatus Thioglobus sp.", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Candidatus Thioglobus|s__Candidatus Thioglobus sp. TMED218": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thioglobus|s__Candidatus Thioglobus sp. TMED218", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Candidatus Thioglobus|s__Candidatus Thioglobus sp. MED-G23": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thioglobus|s__Candidatus Thioglobus sp. MED-G23", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Ignatzschineria": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Ignatzschineria", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Ignatzschineria|s__Ignatzschineria cameli": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Ignatzschineria|s__Ignatzschineria cameli", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Ignatzschineria|s__Ignatzschineria larvae": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Ignatzschineria|s__Ignatzschineria larvae", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Ignatzschineria|s__Ignatzschineria ureiclastica": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Ignatzschineria|s__Ignatzschineria ureiclastica", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Ignatzschineria|s__Ignatzschineria indica": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Ignatzschineria|s__Ignatzschineria indica", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Ignatzschineria|s__Ignatzschineria sp. F8392": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Ignatzschineria|s__Ignatzschineria sp. F8392", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Pseudohongiella": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Pseudohongiella", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Pseudohongiella|s__Pseudohongiella acticola": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Pseudohongiella|s__Pseudohongiella acticola", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Pseudohongiella|s__Pseudohongiella spirulinae": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Pseudohongiella|s__Pseudohongiella spirulinae", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Pseudohongiella|s__Pseudohongiella nitratireducens": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Pseudohongiella|s__Pseudohongiella nitratireducens", "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Pseudohongiella|s__Pseudohongiella sp.": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Pseudohongiella|s__Pseudohongiella sp.", "d__Bacteria|p__Proteobacteria|c__Alphaproteobacteria|o__Rhizobiales|f__Salinarimonadaceae|g__Salinarimonas|s__Salinarimonadaceae bacterium HL-109": "d__Bacteria|p__Proteobacteria|c__Alphaproteobacteria|o__Rhizobiales|f__Salinarimonadaceae|s__Salinarimonadaceae bacterium HL-109", "d__Bacteria|p__Proteobacteria|c__Alphaproteobacteria|g__Candidatus Micropelagos|s__PS1 clade bacterium": "d__Bacteria|p__Proteobacteria|c__Alphaproteobacteria|s__PS1 clade bacterium", "d__Bacteria|p__Proteobacteria|c__Zetaproteobacteria|o__Mariprofundales|f__Mariprofundaceae|g__Ghiorsea": "d__Bacteria|p__Proteobacteria|c__Zetaproteobacteria|g__Ghiorsea", "d__Bacteria|p__Proteobacteria|c__Zetaproteobacteria|o__Mariprofundales|f__Mariprofundaceae|g__Ghiorsea|s__Ghiorsea bivora": "d__Bacteria|p__Proteobacteria|c__Zetaproteobacteria|g__Ghiorsea|s__Ghiorsea bivora", "d__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Micrococcales|f__Microbacteriaceae|g__Rhodoluna|s__Microbacteriaceae bacterium MWH-Ta3": "d__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Micrococcales|f__Microbacteriaceae|s__Microbacteriaceae bacterium MWH-Ta3", "d__Bacteria|p__Firmicutes|c__Bacilli|o__Bacillales|g__Candidatus Carbobacillus|s__[Flavobacterium] thermophilum": "d__Bacteria|p__Firmicutes|c__Bacilli|o__Bacillales|s__[Flavobacterium] thermophilum", "d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|g__Monoglobus|s__[Bacteroides] pectinophilus": "d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|s__[Bacteroides] pectinophilus", "d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|g__Anaerovorax|s__[Eubacterium] infirmum": "d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|s__[Eubacterium] infirmum", "d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|g__Mobilibacterium|s__[Eubacterium] minutum": "d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|s__[Eubacterium] minutum", "d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|g__Mobilibacterium|s__[Eubacterium] sulci": "d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|s__[Eubacterium] sulci", "d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|g__Mobilibacterium|s__[Eubacterium] nodatum": "d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|s__[Eubacterium] nodatum", "d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|g__Mobilibacterium|s__[Eubacterium] brachy": "d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|s__[Eubacterium] brachy", "d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|g__Ileibacterium|s__[Eubacterium] saphenum": "d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|s__[Eubacterium] saphenum", "d__Bacteria|p__Firmicutes|c__Tissierellia|o__Tissierellales|f__Tissierellaceae|g__Anaerosalibacter|s__[Clostridium] ultunense": "d__Bacteria|p__Firmicutes|c__Tissierellia|o__Tissierellales|f__Tissierellaceae|s__[Clostridium] ultunense", "d__Bacteria|p__Firmicutes|c__Tissierellia|o__Tissierellales|f__Gottschalkiaceae|g__Sporanaerobacter": "d__Bacteria|p__Firmicutes|c__Tissierellia|o__Tissierellales|g__Sporanaerobacter", "d__Bacteria|p__Firmicutes|c__Tissierellia|o__Tissierellales|f__Gottschalkiaceae|g__Sporanaerobacter|s__Sporanaerobacter acetigenes": "d__Bacteria|p__Firmicutes|c__Tissierellia|o__Tissierellales|g__Sporanaerobacter|s__Sporanaerobacter acetigenes", "d__Bacteria|p__Firmicutes|c__Tissierellia|o__Tissierellales|f__Gottschalkiaceae|g__Sporanaerobacter|s__Sporanaerobacter sp. PP17-6a": "d__Bacteria|p__Firmicutes|c__Tissierellia|o__Tissierellales|g__Sporanaerobacter|s__Sporanaerobacter sp. PP17-6a", "d__Bacteria|p__Firmicutes|c__Tissierellia|g__Sedimentibacter|s__[Bacteroides] coagulans": "d__Bacteria|p__Firmicutes|c__Tissierellia|s__[Bacteroides] coagulans", "d__Bacteria|p__Cyanobacteria|c__Gloeobacteria|o__Gloeoemargaritales": "d__Bacteria|p__Cyanobacteria|o__Gloeoemargaritales", "d__Bacteria|p__Cyanobacteria|c__Gloeobacteria|o__Gloeoemargaritales|f__Gloeomargaritaceae": "d__Bacteria|p__Cyanobacteria|o__Gloeoemargaritales|f__Gloeomargaritaceae", "d__Bacteria|p__Cyanobacteria|c__Gloeobacteria|o__Gloeoemargaritales|f__Gloeomargaritaceae|g__Gloeomargarita": "d__Bacteria|p__Cyanobacteria|o__Gloeoemargaritales|f__Gloeomargaritaceae|g__Gloeomargarita", "d__Bacteria|p__Cyanobacteria|c__Gloeobacteria|o__Gloeoemargaritales|f__Gloeomargaritaceae|g__Gloeomargarita|s__Gloeomargarita lithophora": "d__Bacteria|p__Cyanobacteria|o__Gloeoemargaritales|f__Gloeomargaritaceae|g__Gloeomargarita|s__Gloeomargarita lithophora", "d__Bacteria|p__Candidatus Margulisbacteria|c__Candidatus Sericytochromatia": "d__Bacteria|c__Candidatus Sericytochromatia", "d__Bacteria|p__Candidatus Margulisbacteria|c__Candidatus Sericytochromatia|s__Candidatus Sericytochromatia bacterium GL2-53 LSPB_72": "d__Bacteria|c__Candidatus Sericytochromatia|s__Candidatus Sericytochromatia bacterium GL2-53 LSPB_72", "d__Bacteria|p__Candidatus Margulisbacteria|c__Candidatus Sericytochromatia|s__Candidatus Sericytochromatia bacterium S15B-MN24 CBMW_12": "d__Bacteria|c__Candidatus Sericytochromatia|s__Candidatus Sericytochromatia bacterium S15B-MN24 CBMW_12", "d__Bacteria|p__Candidatus Margulisbacteria|c__Candidatus Sericytochromatia|s__Candidatus Sericytochromatia bacterium S15B-MN24 RAAC_196": "d__Bacteria|c__Candidatus Sericytochromatia|s__Candidatus Sericytochromatia bacterium S15B-MN24 RAAC_196", "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|o__Dehalococcoidales|g__Dehalogenimonas": "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|g__Dehalogenimonas", "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|o__Dehalococcoidales|g__Dehalogenimonas|s__Dehalogenimonas sp. GP": "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|g__Dehalogenimonas|s__Dehalogenimonas sp. GP", "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|o__Dehalococcoidales|g__Dehalogenimonas|s__Dehalogenimonas alkenigignens": "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|g__Dehalogenimonas|s__Dehalogenimonas alkenigignens", "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|o__Dehalococcoidales|g__Dehalogenimonas|s__Dehalogenimonas formicexedens": "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|g__Dehalogenimonas|s__Dehalogenimonas formicexedens", "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|o__Dehalococcoidales|g__Dehalogenimonas|s__Dehalogenimonas sp. WBC-2": "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|g__Dehalogenimonas|s__Dehalogenimonas sp. WBC-2", "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|o__Dehalococcoidales|g__Dehalogenimonas|s__Dehalogenimonas lykanthroporepellens": "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|g__Dehalogenimonas|s__Dehalogenimonas lykanthroporepellens", "d__Bacteria|p__Chloroflexi|c__Thermomicrobia|g__Thermorudis|s__Thermomicrobia bacterium": "d__Bacteria|p__Chloroflexi|c__Thermomicrobia|s__Thermomicrobia bacterium", "d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis", "d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae", "d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salinibacter": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salinibacter", "d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salinibacter|s__Salinibacter ruber": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salinibacter|s__Salinibacter ruber", "d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salinibacter|s__Salinibacter altiplanensis": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salinibacter|s__Salinibacter altiplanensis", "d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salinibacter|s__Salinibacter sp. 10B": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salinibacter|s__Salinibacter sp. 10B", "d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salinibacter|s__Salinibacter sp. J07SB67": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salinibacter|s__Salinibacter sp. J07SB67", "d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|s__Rhodothermaceae bacterium": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|s__Rhodothermaceae bacterium", "d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|s__Rhodothermaceae bacterium bin80": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|s__Rhodothermaceae bacterium bin80", "d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|s__Rhodothermaceae bacterium RA": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|s__Rhodothermaceae bacterium RA", "d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|s__Rhodothermaceae bacterium Tc-Br11_B2g6_7": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|s__Rhodothermaceae bacterium Tc-Br11_B2g6_7", "d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|s__Rhodothermaceae bacterium TMED105": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|s__Rhodothermaceae bacterium TMED105", "d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Rhodothermus": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Rhodothermus", "d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Rhodothermus|s__Rhodothermus marinus": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Rhodothermus|s__Rhodothermus marinus", "d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Rhodothermus|s__Rhodothermus profundi": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Rhodothermus|s__Rhodothermus profundi", "d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Longibacter": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Longibacter", "d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Longibacter|s__Longibacter salinarum": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Longibacter|s__Longibacter salinarum", "d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Longimonas": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Longimonas", "d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Longimonas|s__Longimonas halophila": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Longimonas|s__Longimonas halophila", "d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salisaeta": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salisaeta", "d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salisaeta|s__Salisaeta longa": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salisaeta|s__Salisaeta longa", "d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salisaeta|s__Bacteroidetes bacterium UBA7368": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|s__Bacteroidetes bacterium UBA7368", "d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salisaeta|s__Bacteroidetes bacterium UBA2364": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|s__Bacteroidetes bacterium UBA2364", "d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salisaeta|s__Bacteroidetes bacterium UBA2024": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|s__Bacteroidetes bacterium UBA2024", "d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salisaeta|s__Bacteroidetes bacterium UBA2412": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|s__Bacteroidetes bacterium UBA2412", "d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salisaeta|s__Bacteroidetes bacterium UBA5586": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|s__Bacteroidetes bacterium UBA5586", "d__Bacteria|p__Candidatus Kryptonia|g__Candidatus Aegiribacteria": "d__Bacteria|g__Candidatus Aegiribacteria", "d__Bacteria|p__Candidatus Kryptonia|g__Candidatus Aegiribacteria|s__Candidatus Aegiribacteria bacterium MLS_C": "d__Bacteria|g__Candidatus Aegiribacteria|s__Candidatus Aegiribacteria bacterium MLS_C", "d__Bacteria|p__Verrucomicrobia|c__Spartobacteria|o__Chthoniobacterales|g__Terrimicrobium": "d__Bacteria|p__Verrucomicrobia|c__Spartobacteria|g__Terrimicrobium", "d__Bacteria|p__Verrucomicrobia|c__Spartobacteria|o__Chthoniobacterales|g__Terrimicrobium|s__Terrimicrobium sacchariphilum": "d__Bacteria|p__Verrucomicrobia|c__Spartobacteria|g__Terrimicrobium|s__Terrimicrobium sacchariphilum", "d__Bacteria|p__Verrucomicrobia|c__Spartobacteria|o__Chthoniobacterales|g__Candidatus Xiphinematobacter": "d__Bacteria|p__Verrucomicrobia|c__Spartobacteria|g__Candidatus Xiphinematobacter", "d__Bacteria|p__Verrucomicrobia|c__Spartobacteria|o__Chthoniobacterales|g__Candidatus Xiphinematobacter|s__Candidatus Xiphinematobacter sp. Idaho Grape": "d__Bacteria|p__Verrucomicrobia|c__Spartobacteria|g__Candidatus Xiphinematobacter|s__Candidatus Xiphinematobacter sp. Idaho Grape", "d__Bacteria|p__Candidatus Saccharibacteria|g__Candidatus Saccharimonas|s__candidate division TM7 genomosp. GTL1": "d__Bacteria|p__Candidatus Saccharibacteria|s__candidate division TM7 genomosp. GTL1", "d__Bacteria|p__Candidatus Riflebacteria|c__Candidatus Ozemobacteria|f__Candidatus Ozemobacteraceae|g__Candidatus Ozemobacter|s__Candidatus Riflebacteria bacterium HGW-Riflebacteria-2": "d__Bacteria|p__Candidatus Riflebacteria|s__Candidatus Riflebacteria bacterium HGW-Riflebacteria-2", "d__Bacteria|p__Candidatus Riflebacteria|c__Candidatus Ozemobacteria|f__Candidatus Ozemobacteraceae|g__Candidatus Ozemobacter|s__Candidatus Riflebacteria bacterium GWC2_50_8": "d__Bacteria|p__Candidatus Riflebacteria|s__Candidatus Riflebacteria bacterium GWC2_50_8", "d__Bacteria|p__Candidatus Riflebacteria|c__Candidatus Ozemobacteria|f__Candidatus Ozemobacteraceae|g__Candidatus Ozemobacter|s__Candidatus Riflebacteria bacterium HGW-Riflebacteria-1": "d__Bacteria|p__Candidatus Riflebacteria|s__Candidatus Riflebacteria bacterium HGW-Riflebacteria-1", "d__Bacteria|p__Candidatus Riflebacteria|c__Candidatus Ozemobacteria|f__Candidatus Ozemobacteraceae|g__Candidatus Ozemobacter|s__Candidatus Riflebacteria bacterium": "d__Bacteria|p__Candidatus Riflebacteria|s__Candidatus Riflebacteria bacterium", "d__Bacteria|p__Candidatus Riflebacteria|c__Candidatus Ozemobacteria|f__Candidatus Ozemobacteraceae|g__Candidatus Ozemobacter|s__Candidatus Riflebacteria bacterium RBG_13_59_9": "d__Bacteria|p__Candidatus Riflebacteria|s__Candidatus Riflebacteria bacterium RBG_13_59_9", "d__Archaea|p__Euryarchaeota|c__Methanomicrobia|o__Methanosarcinales|g__Candidatus Syntrophoarchaeum|s__Methanosarcinales archaeon UBA261": "d__Archaea|p__Euryarchaeota|c__Methanomicrobia|o__Methanosarcinales|s__Methanosarcinales archaeon UBA261", "d__Archaea|p__Euryarchaeota|c__Methanomicrobia|o__Methanosarcinales|g__Candidatus Syntrophoarchaeum|s__Methanosarcinales archaeon UBA203": "d__Archaea|p__Euryarchaeota|c__Methanomicrobia|o__Methanosarcinales|s__Methanosarcinales archaeon UBA203", "d__Archaea|p__Euryarchaeota|c__Methanomicrobia|o__Methanosarcinales|g__Candidatus Syntrophoarchaeum|s__Methanosarcinales archaeon Methan_03": "d__Archaea|p__Euryarchaeota|c__Methanomicrobia|o__Methanosarcinales|s__Methanosarcinales archaeon Methan_03", "d__Archaea|p__Euryarchaeota|c__Methanomicrobia|o__Methanosarcinales|g__Candidatus Syntrophoarchaeum|s__Methanosarcinales archaeon Methan_02": "d__Archaea|p__Euryarchaeota|c__Methanomicrobia|o__Methanosarcinales|s__Methanosarcinales archaeon Methan_02", "d__Archaea|p__Euryarchaeota|c__Methanomicrobia|o__Methanosarcinales|g__Candidatus Syntrophoarchaeum|s__Methanosarcinales archaeon Methan_01": "d__Archaea|p__Euryarchaeota|c__Methanomicrobia|o__Methanosarcinales|s__Methanosarcinales archaeon Methan_01", "d__Archaea|p__Thaumarchaeota|c__Nitrososphaeria|o__Cenarchaeales": "d__Archaea|p__Thaumarchaeota|o__Cenarchaeales", "d__Archaea|p__Thaumarchaeota|c__Nitrososphaeria|o__Cenarchaeales|f__Cenarchaeaceae": "d__Archaea|p__Thaumarchaeota|o__Cenarchaeales|f__Cenarchaeaceae", "d__Archaea|p__Thaumarchaeota|c__Nitrososphaeria|o__Cenarchaeales|f__Cenarchaeaceae|g__Cenarchaeum": "d__Archaea|p__Thaumarchaeota|o__Cenarchaeales|f__Cenarchaeaceae|g__Cenarchaeum", "d__Archaea|p__Thaumarchaeota|c__Nitrososphaeria|o__Cenarchaeales|f__Cenarchaeaceae|g__Cenarchaeum|s__Cenarchaeum symbiosum": "d__Archaea|p__Thaumarchaeota|o__Cenarchaeales|f__Cenarchaeaceae|g__Cenarchaeum|s__Cenarchaeum symbiosum", "d__Archaea|p__Candidatus Korarchaeota|g__Candidatus Korarchaeum|s__Candidatus Korarchaeota archaeon NZ13-K": "d__Archaea|p__Candidatus Korarchaeota|s__Candidatus Korarchaeota archaeon NZ13-K", } mpa_strings_corrected = [correction_dict[s] if s in correction_dict.keys() else s for s in mpa_strings] return(mpa_strings_corrected) if __name__ == '__main__': main() |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 | suppressMessages(library(ggplot2, quietly = TRUE, warn.conflicts = FALSE)) suppressMessages(library(rafalib, quietly = TRUE, warn.conflicts = FALSE)) suppressMessages(library(vegan, quietly = TRUE, warn.conflicts = FALSE)) suppressMessages(library(reshape2, quietly = TRUE, warn.conflicts = FALSE)) suppressMessages(library(RColorBrewer, quietly = TRUE, warn.conflicts = FALSE)) suppressMessages(library(cmapR, quietly = TRUE, warn.conflicts = FALSE)) suppressMessages(library(compositions, quietly = TRUE, warn.conflicts = FALSE)) suppressMessages(library(zCompositions, quietly = TRUE, warn.conflicts = FALSE)) suppressMessages(library(ALDEx2, quietly = TRUE, warn.conflicts = FALSE)) suppressMessages(library(ggpubr, quietly = TRUE, warn.conflicts = FALSE)) options(stringsAsFactors = F) # options we need from snakemake sample_reads_file <- snakemake@params[['sample_reads_file']] sample_reports_file <- snakemake@params[['sample_reports_file']] sample_groups_file <- snakemake@params[['sample_groups_file']] workflow.outdir <- snakemake@params[['workflow_outdir']] result.dir <- snakemake@params[['result_dir']] use.bracken.report <- snakemake@params[['use_bracken_report']] remove.chordata <- as.logical(snakemake@params[['remove_chordata']]) if (length(remove.chordata)==0){ remove.chordata <- F } # Use locations in the working directory for scripts and taxonomy array scripts.folder <- file.path(workflow.outdir, "scripts") tax.array.file <- file.path(workflow.outdir, "taxonomy_array.tsv") # Downstream processing for filtering OTUs ## This removes any classification result where all samples are below this threshold, ## which serves to remove the very long tail of lowly abundant species in Kraken2/Bracken results. ## This parameter can be tuned, but I recommend you keep something here to reduce the long tail. min_otu_percentage <- as.numeric(snakemake@config[['min_otu_percentage']]) # Filtering for compositional data analysis that gets applied for each taxonomic level ## Keep only samples with > codata_min_reads (default 5000) codata_min_reads <- as.numeric(snakemake@config[['codata_min_reads']]) ## Keep only OTUs with an FRACTION of at least 0.001 by default ## This is equivalent to a PERCENTAGE of 0.1 codata_min_prop = as.numeric(snakemake@config[['codata_min_prop']]) ## Keep OTUs that are found in at least codata_otu_cutoff fraction of samples ## Default is 0.3 or 30% of samples codata_otu_cutoff = as.numeric(snakemake@config[['codata_otu_cutoff']]) # Set up directories and create those that don't exist classification.folder <- file.path(workflow.outdir, 'classification') outfolder.matrices.taxonomy <- file.path(result.dir, 'taxonomy_matrices') outfolder.matrices.taxonomy.classified <- file.path(result.dir, 'taxonomy_matrices_classified_only') outfolder.gctx.taxonomy <- file.path(result.dir, 'taxonomy_gctx') outfolder.gctx.taxonomy.classified <- file.path(result.dir, 'taxonomy_gctx_classified_only') outfolder.matrices.bray <- file.path(result.dir, 'braycurtis_matrices') outfolder.plots <- file.path(result.dir, 'plots') outfolder.aldex <- file.path(result.dir, 'ALDEx2_differential_abundance') for (f in c(result.dir, outfolder.matrices.bray, outfolder.matrices.taxonomy.classified, outfolder.gctx.taxonomy.classified, outfolder.plots, outfolder.aldex)){ if (!dir.exists(f)){ dir.create(f, recursive = T)} } # dont create dirs we dont need from bracken if (!(use.bracken.report)){ for (f in c(outfolder.matrices.taxonomy, outfolder.gctx.taxonomy)){ if (!dir.exists(f)){ dir.create(f, recursive = T)} } } # Load other data processing and plotting scripts source.script.process <- file.path(scripts.folder, 'process_classification_gctx.R') source.script.plot <- file.path(scripts.folder, 'plotting_classification.R') source.script.codaseq <- file.path(scripts.folder, 'CoDaSeq_functions.R') if (!(file.exists(source.script.process) & file.exists(source.script.plot) & file.exists(source.script.codaseq))) { stop('processing scripts do not exist at any of the attempted directories. Exiting. ') } suppressMessages(source(source.script.process)) suppressMessages(source(source.script.plot)) suppressMessages(source(source.script.codaseq)) # read from sample_reads_file or sample_reports_file as input if(sample_reads_file != ""){ sample.reads <- read.table(sample_reads_file, sep='\t', quote='', header=F, comment.char = "#", colClasses = 'character') colnames(sample.reads) <- c('sample', 'r1', 'r2')[1:ncol(sample.reads)] # ensure we skip first row if it's a comment if ((tolower(sample.reads[1,1]) == 'sample') | (substr(sample.reads[1,1], 1,1) == "#")){ sample.reads <- sample.reads[2:nrow(sample.reads), ] } # create dataframe of output files that would be in sample.reports sample.reports <- data.frame(sample = sample.reads$sample, kraken_report = sapply(sample.reads$sample, function(x) file.path(classification.folder, paste(x, '.krak.report', sep=''))), bracken_report = sapply(sample.reads$sample, function(x) file.path(classification.folder, paste(x, '.krak_bracken_species.report', sep='')))) } else if(sample_reports_file != ""){ sample.reports <- read.table(sample_reports_file, sep='\t', quote='', header=F, comment.char = "#", colClasses = 'character') colnames(sample.reports) <- c('sample', 'kraken_report', 'bracken_report')[1:ncol(sample.reports)] # ensure we skip first row if it's a comment if ((tolower(sample.reports[1,1]) == 'sample') | (substr(sample.reports[1,1], 1,1) == "#")){ sample.reports <- sample.reports[2:nrow(sample.reports), ] } } sample.names <- sample.reports$sample sample.number <- nrow(sample.reports) # Define files for loading in based on sample. if(use.bracken.report){ flist <- sample.reports[,3] } else { flist <- sample.reports[,2] } names(flist) <- sample.names # Ensure all expected results files exist if (!(all(file.exists(flist)))){ # print which files don't exist message('The follwing files do not exist:') print(flist[!sapply(flist, file.exists)]) stop("Some classification files do not exist!") } # if a groups file is specified, read it. Otherwise assign everything to one group if (sample_groups_file != '') { sample.groups <- read.table(sample_groups_file, sep='\t', quote='', header=F, comment.char = "#", colClasses = 'character') colnames(sample.groups) <- c('sample', 'group') # if first sample is sample, discard the line if(tolower(sample.groups[1,'sample']) == 'sample'){ sample.groups <- sample.groups[2:nrow(sample.groups),] } } else { sample.groups <- data.frame(sample=sample.names, group='All') } # get taxonomy array # classification at each level for each entry in the database. # simplified after processing krakens default taxonomy tax.array <- read.table(tax.array.file, sep='\t', quote='', header=F, comment.char = '', colClasses = 'character') colnames(tax.array) <- c('id', 'taxid', 'root', 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species', 'subspecies')[1:ncol(tax.array)] # Bug in generation code gave muliple zero taxids. Eliminate that here now tax.array <- tax.array[!duplicated(tax.array$taxid),] rownames(tax.array) <- tax.array$taxid # some duplicate names in this. if so, change them to include the tax level dup.ids <- tax.array$id[duplicated(tax.array$id)] dup.inds <- which(tax.array$id %in% dup.ids) # fix these by adding taxid to the name tax.array[tax.array$id %in% dup.ids, "id"] <- paste(tax.array[tax.array$id %in% dup.ids, "id"], ' (', tax.array[tax.array$id %in% dup.ids, "taxid"], ')', sep='') # ensure reads and groups have the same data if (!(all(sample.groups$sample %in% sample.names) & all(sample.names %in% sample.groups$sample))){ message('sample.reads:') print(sample.reads) message('sample.groups:') print(sample.groups) message('sample.groups$sample[!(sample.groups$sample %in% sample.names)]') print(sample.groups$sample[!(sample.groups$sample %in% sample.names)]) message('sample.names[!(sample.names %in% sample.groups$sample)]') print(sample.names[!(sample.names %in% sample.groups$sample)]) stop('Sample reads and sample groups dont contain the same samples... check inputs') } # special case for UHGG and MAG databases. ## If reading from UHGG, the taxonomy levels go R, R1-R7 test.df <- kraken_file_to_df(flist[1]) ## UHGG database will have this structure uhgg <- all(paste0('R', 1:7) %in% test.df$tax.level) # MAG database will have this structure ## Number of genus classifications is zero segata <- !uhgg & sum(sum(test.df$tax.level=='G')) == 0 # load classification results from each sample and process into gct format message(paste('Loading data from ', length(flist), ' kraken/bracken results.', sep='')) df.list <- lapply(flist, function(x) kraken_file_to_df(x)) # merge classification matrix merge.mat <- merge_kraken_df_list(df.list) # sample metadata just has groups for now sample.metadata <- data.frame(id=sample.groups$sample, group=sample.groups$group) # construct gct object from reads matrix, sample metadata, row metadata kgct <- make_gct_from_kraken(merge.mat, sample.metadata, tax.array) # remove Chordata reads if desired if(remove.chordata){ kgct <- subset_gct(kgct, rid=kgct@rid[kgct@rdesc$phylum != "Chordata"]) } # filter to each of the taxonomy levels if (segata){ filter.levels <- c('species') kgct.filtered.list <- list(species=subset_gct(kgct, rid=kgct@rid[kgct@rid != 'root'])) } else { filter.levels <- c('kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species') kgct.filtered.list <- lapply(filter.levels, function(level) { subset_kgct_to_level(kgct, level) }) names(kgct.filtered.list) <- filter.levels } # print("kgct.filtered.list[['species']]@mat") # print(kgct.filtered.list[['species']]@mat) # subset to classified taxa only unclassified.rownames <- c('unclassified', 'classified at a higher level') kgct.filtered.classified.list <- lapply(kgct.filtered.list, function(x) subset_gct(x, rid=x@rid[!(x@rid %in% unclassified.rownames)])) # normalize to percentages kgct.filtered.percentage.list <- lapply(kgct.filtered.list, function(x) normalize_kgct(x, min_otu_percentage=min_otu_percentage)) kgct.filtered.classified.percentage.list <- lapply(kgct.filtered.classified.list, function(x) normalize_kgct(x, min_otu_percentage=min_otu_percentage)) message('Saving classification matrices...') # save matrices and gctx objects if (use.bracken.report){mat.name <- 'bracken'} else {mat.name <- 'kraken'} for (tn in filter.levels){ outf.mat.reads <- file.path(outfolder.matrices.taxonomy, paste(mat.name, tolower(tn), 'reads.txt', sep='_')) outf.mat.percentage <- file.path(outfolder.matrices.taxonomy, paste(mat.name, tolower(tn), 'percentage.txt', sep='_')) outf.mat.reads.classified <- file.path(outfolder.matrices.taxonomy.classified, paste(mat.name, tolower(tn), 'reads.txt', sep='_')) outf.mat.percentage.classified <- file.path(outfolder.matrices.taxonomy.classified, paste(mat.name, tolower(tn), 'percentage.txt', sep='_')) # gctx names outf.gctx.reads <- file.path(outfolder.gctx.taxonomy, paste(mat.name, tolower(tn), 'reads.gctx', sep='_')) outf.gctx.percentage <- file.path(outfolder.gctx.taxonomy, paste(mat.name, tolower(tn), 'percentage.gctx', sep='_')) outf.gctx.reads.classified <- file.path(outfolder.gctx.taxonomy.classified, paste(mat.name, tolower(tn), 'reads.gctx', sep='_')) outf.gctx.percentage.classified <- file.path(outfolder.gctx.taxonomy.classified, paste(mat.name, tolower(tn), 'percentage.gctx', sep='_')) # save matrices if (!use.bracken.report){ write.table(kgct.filtered.list[[tn]]@mat, outf.mat.reads, sep='\t', quote=F, row.names = T, col.names = T) } write.table(kgct.filtered.classified.list[[tn]]@mat, outf.mat.reads.classified, sep='\t', quote=F, row.names = T, col.names = T) mat.percentage <- kgct.filtered.percentage.list[[tn]]@mat # order rows by mean abundance across samples row.order <- rownames(mat.percentage)[order(rowMeans(mat.percentage), decreasing = T)] row.order <- row.order[!(row.order %in% unclassified.rownames)] row.order <- c(unclassified.rownames[unclassified.rownames %in% rownames(mat.percentage)], row.order) mat.percentage <- mat.percentage[row.order,] # classified only matrix mat.percentage.classified <- kgct.filtered.classified.percentage.list[[tn]]@mat row.order.classified <- order(rowMeans(mat.percentage.classified), decreasing = T) mat.percentage.classified <- mat.percentage.classified[row.order.classified,] if (!use.bracken.report){ write.table(mat.percentage, outf.mat.percentage, sep='\t', quote=F, row.names = T, col.names = T) } write.table(mat.percentage.classified, outf.mat.percentage.classified, sep='\t', quote=F, row.names = T, col.names = T) # save gctx # only save with unclassified if not using Bracken if (!use.bracken.report){ suppressMessages(write_gctx(kgct.filtered.list[[tn]], outf.gctx.reads, appenddim = F)) suppressMessages(write_gctx(kgct.filtered.percentage.list[[tn]], outf.gctx.percentage, appenddim = F)) } suppressMessages(write_gctx(kgct.filtered.classified.list[[tn]], outf.gctx.reads.classified, appenddim = F)) suppressMessages(write_gctx(kgct.filtered.classified.percentage.list[[tn]], outf.gctx.percentage.classified, appenddim = F)) } ################################################################################# ## Diversity calculation and plots ############################################## ################################################################################# message('Doing diversity calculations and saving figures...') # DISABLE rarefaction curve as it's useless with so many species # Wait actually people want it so Im going to re-enable it plot_rarefaction_curve(kgct.filtered.classified.list$species@mat, file.path(outfolder.plots, 'rarefaction_curve.pdf')) # diversity calculations div.methods <- c('shannon', 'simpson') # calculate for each tax level div.level.method <- lapply(filter.levels, function(x) { use.matrix <- kgct.filtered.classified.list[[x]]@mat # if not enough valid rows, skip it if (nrow(use.matrix) <3){ dl <- lapply(div.methods, function(x) rep(0, times=ncol(use.matrix))) } else { dl <- lapply(div.methods, function(y) diversity(use.matrix, index=y, MARGIN = 2)) } names(dl) <- div.methods dl }) names(div.level.method) <- filter.levels # coerce forcefully into a dataframe # there's definitely a better way to do this... div.df <- as.data.frame(div.level.method) rownames(div.df) <- sample.names div.df <- cbind(data.frame(sample=rownames(div.df)), div.df) div.df <- melt(div.df, id.vars = 'sample') div.df$tax.level <- sapply(div.df$variable, function(x) strsplit(as.character(x), split="\\.")[[1]][1]) div.df$method <- sapply(div.df$variable, function(x) strsplit(as.character(x), split="\\.")[[1]][2]) div.df <- div.df[,c('sample', 'tax.level','method','value')] # round to a sensible number div.df$value <- round(div.df$value, 3) # write out dataframe out.div <- file.path(result.dir, 'diversity.txt') write.table(div.df, out.div, sep='\t', quote=F, row.names=F, col.names=T) # barplot of diversity under different methods # one with everything, one separated by sample group div.df$sample <- factor(div.df$sample, levels = sample.groups$sample) div.plot.list <- list() for (g in unique(sample.groups$group)){ plot.samples <- sample.groups[sample.groups$group==g, "sample"] plot.df <- div.df[div.df$sample %in% plot.samples, ] plot.df$sample <- factor(plot.df$sample, levels = sample.groups$sample) p <- ggplot(plot.df, aes(x=sample, y=value)) + geom_bar(stat='identity') + facet_grid(rows = vars(tax.level), cols= vars(method), scales = 'fixed') + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + labs(title=paste('Diversity for group:', g), y='Diversity', x='Sample') div.plot.list[[g]] <- p } # save one page for each group div.group.pdf <- file.path(outfolder.plots, 'diversity_by_group.pdf') pdf(div.group.pdf, height=8, width = 10) for(p in div.plot.list){print(p)} trash <- dev.off() # also a big figure with one page for each method, all samples div.plot.all.list <- list() for (m in unique(div.df$method)){ plot.df <- div.df[div.df$method == m, ] p <- ggplot(plot.df, aes(x=sample, y=value)) + geom_bar(stat='identity') + facet_grid(rows = vars(tax.level), cols= vars(method), scales = 'fixed') + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + labs(title=paste('Diversity for all samples:', m), y='Diversity', x='Sample') div.plot.all.list[[m]] <- p } # width of plot = 9 + quarter inch for each sample over 10 max.samps <- max(table(sample.groups$group)) pdf.width <- max(9, (9 + (0.25 * (max.samps-10)))) div.all.pdf <- file.path(outfolder.plots, 'diversity_allsamples.pdf') pdf(div.all.pdf, height=5, width = pdf.width) for(p in div.plot.all.list){print(p)} trash <- dev.off() ################################################################################# ## Taxonomic Barplots ########################################################### ################################################################################# message('Generating taxonomic barplots...') # page in the pdf for each sample group taxlevel.plots <- list() taxlevel.plots.classified <- list() for (tn in filter.levels){ group.plots <- list() group.plots.classified <- list() for (g in unique(sample.groups$group)){ plot.samples <- sample.groups[sample.groups$group==g, "sample"] div.df.sub <- div.df[div.df$tax.level==tn & div.df$method=='shannon' & div.df$sample %in% plot.samples,] div.df.plot <- div.df.sub[, c('sample', 'value')] rownames(div.df.plot) <- div.df.plot$sample plot.title <- paste('Taxonomy and diversity: ', g, ', ', tn, sep='') plot.mat <- kgct.filtered.percentage.list[[tn]]@mat[,plot.samples, drop=FALSE] # no longer doing include unclassifed with Bracken report # because new version of report doesn't include this data if (!use.bracken.report){ group.plots[[g]] <- plot_many_samples_with_diversity_barplot(plot.mat, div.df.plot, plot.title = plot.title, include.unclassified=T, tax.level.name = tn) } group.plots.classified[[g]] <- plot_many_samples_with_diversity_barplot(plot.mat, div.df.plot, plot.title = plot.title, include.unclassified=F, tax.level.name = tn) } taxlevel.plots[[tn]] <- group.plots taxlevel.plots.classified[[tn]] <- group.plots.classified } # save pdfs # width of plot = 9 + quarter inch for each sample over 10 max.samps <- max(table(sample.groups$group)) pdf.width <- max(9, (9 + (0.25 * (max.samps-10)))) for (tn in filter.levels){ if (!use.bracken.report){ tax.pdf <- file.path(outfolder.plots, paste('taxonomy_barplot_', tolower(tn), '.pdf', sep='')) pdf(tax.pdf, height=6, width=pdf.width) for (p in taxlevel.plots[[tn]]){print(p)} trash <- dev.off() } tax.pdf.classified <- file.path(outfolder.plots, paste('classified_taxonomy_barplot_', tolower(tn), '.pdf', sep='')) pdf(tax.pdf.classified, height=6, width=pdf.width) for (p in taxlevel.plots.classified[[tn]]){print(p)} trash <- dev.off() } ################################################################################# ## PCoA plots ################################################################### ################################################################################# message('Doing PCoA calculations...') # only do this if we have >=3 samples if (sample.number >=3){ # do for each tax level plotlist.nolabels <- list() plotlist.labels <- list() # don't do for kingdom do.tn <- filter.levels[filter.levels!='kingdom'] for (tn in do.tn){ print(paste(' for:', tn)) fraction.mat <- kgct.filtered.classified.percentage.list[[tn]]@mat if(nrow(fraction.mat) > 2){ pcoa.res <- capscale(t(fraction.mat)~1, distance='bray') pcoa.df <- data.frame(sample.groups, scores(pcoa.res)$sites) pcoa.df.melt <- melt(pcoa.df, id.vars = c('sample','group')) pcoa.variance <- summary(pcoa.res)$cont$importance[2,1:2] # set colors based on number of groups ng <- length(unique(sample.groups$group)) if (ng <10){cols <- brewer.pal(max(ng,3), 'Set1')} else {cols <- colorRampPalette(brewer.pal(9,'Set1'))(ng)} plotlist.nolabels[[tn]] <- ggplot(pcoa.df, aes(x=MDS1, y=MDS2, color=group, label=sample)) + geom_point(size=3) + # scale_color_brewer(palette='Set1') + scale_color_manual(values = cols) + theme_bw() + labs(title=paste('Microbiome beta diversity, Bray-Curtis, ', tn, sep=''), subtitle='Principal Coordinates Analysis plot', x = paste('PC1 (', round(pcoa.variance[1], 3) * 100, '% of variance)', sep=''), y = paste('PC2 (', round(pcoa.variance[2], 3) * 100, '% of variance)', sep='')) + xlim(c(min(pcoa.df$MDS1), max(pcoa.df$MDS1) * 1.25)) + ylim(c(min(pcoa.df$MDS2), max(pcoa.df$MDS2) * 1.15)) + guides(color=guide_legend(title='Group')) # add a version with labels above the points nudge.x <- sum(abs(range(pcoa.df$MDS1))) / 12 nudge.y <- sum(abs(range(pcoa.df$MDS2))) / 30 plotlist.labels[[tn]] <- plotlist.nolabels[[tn]] + geom_text(nudge_x = nudge.x, nudge_y = nudge.y) } else{ plotlist.nolabels[[tn]] <- plot(1,1) plotlist.labels[[tn]] <- plot(1,1) } } # save plot pcoa.pdf.nolabels <- file.path(outfolder.plots, 'PCoA_2D_plot_nolabels.pdf') pdf(pcoa.pdf.nolabels, height=6.5, width=9) for (tn in rev(do.tn)){ print(plotlist.nolabels[[tn]]) } trash <- dev.off() pcoa.pdf.labels <- file.path(outfolder.plots, 'PCoA_2D_plot_labels.pdf') pdf(pcoa.pdf.labels, height=6.5, width=9) for (tn in rev(do.tn)){ print(plotlist.labels[[tn]]) } trash <- dev.off() } else { # warn the user and make fake plots warning('Less than 3 samples, not doing PCoA plots') pcoa.pdf.nolabels <- file.path(outfolder.plots, 'PCoA_2D_plot_nolabels.pdf') pdf(pcoa.pdf.nolabels, height=6.5, width=9) plot(0,0, col='white') text(0,0, 'Not enough samples for PCoA plot', col='firebrick') trash <- dev.off() pcoa.pdf.labels <- file.path(outfolder.plots, 'PCoA_2D_plot_labels.pdf') pdf(pcoa.pdf.labels, height=6.5, width=9) plot(0,0, col='white') text(0,0, 'Not enough samples for PCoA plot', col='firebrick') trash <- dev.off() } # simple bray curtis distance metrics for (tn in filter.levels){ bray.dist <- as.matrix(vegdist(t(kgct.filtered.classified.percentage.list[[tn]]@mat))) out.bray <- file.path(outfolder.matrices.bray, paste('braycurtis_distance_', tolower(tn), '.txt', sep='')) write.table(bray.dist, out.bray, sep='\t', quote=F, row.names = T, col.names = T) } ################################################################################# ## Compositional data analysis and plots ######################################## ################################################################################# if (length(unique(sample.groups$group)) != 2){ warning('Will not do ALDEx2 differential abundance with !=2 groups')} if (nrow(sample.groups) < 3){ warning('Will not do compositional data analysis with < 3 samples') } else { pca.plot.list <- list() if(segata){ do.tax.levels <- c('species') } else { do.tax.levels <- c('kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species') } print('Compositional data analysis....') for (tax.level in do.tax.levels[!(do.tax.levels %in% c('kingdom'))]){ print(paste('.....', tax.level)) use.mat <- kgct.filtered.classified.list[[tax.level]]@mat if(nrow(use.mat) > 2){ reads.filtered <- tryCatch(codaSeq.filter(use.mat, min.reads=codata_min_reads, min.occurrence=codata_otu_cutoff, min.prop=codata_min_prop, samples.by.row=FALSE), error=function(e) { warning(e) matrix(0) }) } else { reads.filtered <- matrix(0) } if(nrow(reads.filtered) > 2){ # replace 0 values with an estimate of the probability that the zero is not 0 # only if necessary # at this point samples are by row, and variables are by column # we can use the GBM or CZM methods. Function from zCompositions. if (sum(reads.filtered==0)>0){ rfz <- t(cmultRepl(t(reads.filtered), label=0, method="CZM", output="p-counts")) # round to integers and round anything low to 1 rfz <- round(rfz) rfz[rfz==0] <- 1 } else { rfz <- reads.filtered } # convert to CLR clr.aldex <- aldex.clr(rfz, conds=rep(NA, ncol(rfz)), mc.samples=128, denom = 'all', verbose = F) # get clr from mean of the MC instances rfz.clr <- t(sapply(clr.aldex@analysisData , function(x) rowMeans(x))) # save CLR matrix outf <- file.path(outfolder.matrices.taxonomy.classified, paste('clr_values_', tax.level, '.tsv', sep='')) write.table(round(rfz.clr, 4), outf, sep='\t', quote=F, row.names = T, col.names = T) rfz.clr.pca <- prcomp(rfz.clr) rfz.clr.mvar <- mvar(rfz.clr) # PCA biplot plot.df <- data.frame(rfz.clr.pca$x[,1:2], group=sample.groups[sample.groups$sample %in% rownames(rfz.clr), 'group']) pc1.var <- round(sum(rfz.clr.pca$sdev[1]^2)/rfz.clr.mvar * 100, 1) pc2.var <- round(sum(rfz.clr.pca$sdev[2]^2)/rfz.clr.mvar * 100, 1) pca.plot <- ggplot(plot.df, aes(x=PC1, y=PC2, col=group)) + geom_point(size=3) + geom_text(label=rownames(plot.df), nudge_x = sum(abs(range(plot.df$PC1)))/ 50, nudge_y = sum(abs(range(plot.df$PC2)))/ 50) + theme_bw() + labs(x=paste("PC1: ", pc1.var, "% of variace", sep=""), y=paste("PC2: ", pc2.var, "% of variace", sep=""), title=paste('Compositional PCA,', tax.level)) + scale_color_brewer(palette='Set1') pca.plot.list[[tax.level]] <- pca.plot # need to limit to two groups for aldex if (length(unique(sample.groups$group)) == 2){ # get two groups group.pos <- unique(sample.groups$group)[1] group.neg <- unique(sample.groups$group)[2] s.pos <- sample.groups$sample[sample.groups$group==group.pos] s.neg <- sample.groups$sample[sample.groups$group==group.neg] # aldex differential expression between groups aldex.res <- aldex(rfz[,c(s.pos, s.neg)], conditions = c(rep(group.pos, length(s.pos)), rep(group.neg, length(s.neg))), mc.samples=128, test="t", effect=TRUE, include.sample.summary=FALSE, denom="all", verbose=FALSE) # scatterplot outf.scatter <- file.path(outfolder.aldex, paste('aldex_scatter_', tax.level, '.pdf', sep='')) pdf(outf.scatter, width = 7, height = 4) mypar(1,2, mar = c(3.5, 3.5, 2.5, 1.1)) aldex.plot(aldex.res, type="MW", test="wilcox", xlab="Dispersion",ylab="Difference") mtext(paste(tax.level, ', wilcox', sep=''), line=0.25, cex=1.25) aldex.plot(aldex.res, type="MW", test="welch", xlab="Dispersion",ylab="Difference") mtext(paste(tax.level, ', welch', sep=''), line=0.25, cex=1.25) dev.off() # improve dataframe aldex.res <- signif(aldex.res, 4) # add sample numbers aldex.res$n.pos <- length(s.pos) aldex.res$n.neg <- length(s.neg) # add in name aldex.res$taxa <- rownames(aldex.res) # reorder aldex.res <- aldex.res[, c('taxa', 'n.pos', 'n.neg', colnames(aldex.res)[1:11])] aldex.res <- aldex.res[order(aldex.res$we.eBH),] # add abs effect aldex.res$abs.effect <- abs(aldex.res$effect) # save dataframe of differential results outf.res <- file.path(outfolder.aldex, paste('aldex_result_', tax.level, '.tsv', sep='')) write.table(aldex.res, outf.res, sep='\t', quote=F, row.names = F, col.names = T) # make boxplots # make plots from everything with this value or less plot.thresh <- 0.1 plot.thresh.col <- 'we.eBH' aldex.res.plot <- aldex.res[aldex.res[, plot.thresh.col] <= plot.thresh, ] outf.boxplot <- file.path(outfolder.aldex, paste('aldex_significant_boxplots_', tax.level, '.pdf', sep='')) if (nrow(aldex.res.plot) >0 ){ # make clr values m.prop <- apply(rfz, 2, function(x) x/sum(x)) clr.aldex <- aldex.clr(rfz[,c(s.pos, s.neg)], conds = c(rep(group.pos, length(s.pos)), rep(group.neg, length(s.neg))), mc.samples=128, denom = 'all', verbose = F) # get clr from mean of the MC instances rfz.clr <- sapply(clr.aldex@analysisData , function(x) rowMeans(x)) # save all significant in one boxplot pdf(outf.boxplot, height=5, width = 6, onefile=TRUE) toplot <- min(nrow(aldex.res.plot), 50) for ( i in 1:toplot){ g <- aldex.res.plot[i, 'taxa'] test.df <- data.frame(group = c(rep(group.pos, length(s.pos)), rep(group.neg, length(s.neg))), sample = c(s.pos, s.neg), proportion = c(m.prop[g, s.pos], m.prop[g, s.neg]), clr = c(rfz.clr[g, s.pos], rfz.clr[g, s.neg])) g1 <- ggplot(test.df, aes(x=group, y=proportion, fill=group)) + geom_boxplot() + theme_bw() + labs(subtitle=g) + ylim(c(0,max(test.df$proportion)*1.2)) + stat_compare_means(method = 'wilcox') + stat_compare_means(method = 't.test', label.y.npc = 0.95) + guides(fill=FALSE) g2 <- ggplot(test.df, aes(x=group, y=clr, fill=group)) + geom_boxplot() + theme_bw() + labs(subtitle=g) + ylim(c(min(test.df$clr), max(test.df$clr)*1.2)) + stat_compare_means(method = 'wilcox') + stat_compare_means(method = 't.test', label.y.npc = 0.95) print(ggarrange(g1,g2,widths = c(1,1), common.legend = T)) } dev.off() } } } } # save pca plots from each level pdf(file.path(outfolder.plots, 'compositional_PCA_plot.pdf'), height=6.5, width=9) for (tn in rev(do.tax.levels)){ print(pca.plot.list[[tn]]) } trash <- dev.off() } message('Done! :D') |
34 35 36 | shell: """ python {params.improve_taxonomy_script} {params.db} """ |
48 49 50 51 | shell: """ cp -r {params.scriptdir} {outdir} cp {input.tax_array} {outdir} """ |
69 70 71 72 73 | shell: """ time kraken2 --db {params.db} --threads {threads} --output {output.krak} \ --report {output.krak_report} {params.paired_string} {input.reads} \ --confidence {params.confidence_threshold} --use-names """ |
93 94 95 96 | shell: """ bracken -d {params.db} -i {input.krak_report} -o {params.outspec} -r {params.readlen} \ -l {params.level} -t {params.threshold} """ |
116 117 | script: 'scripts/post_classification_workflow.R' |
135 136 | script: 'scripts/post_classification_workflow.R' |
147 148 149 150 151 | shell: """ rm -rf {params.workflow_outdir}/scripts rm -f {params.workflow_outdir}/taxonomy_array.tsv touch {output} """ |
156 157 158 159 | shell: """ ktImportTaxonomy -m 3 -s 0 -q 0 -t 5 -i {input} -o {output} \ -tax $(which kraken2 | sed 's/envs\/classification2.*$//g')/envs/classification2/bin/taxonomy """ |
177 178 179 180 181 | shell: """ awk '$1=="U" {{ print }}' {input.krak} | cut -f 2 > {params.tempfile} filterbyname.sh in={input.r1} in2={input.r2} names={params.tempfile} include=true out={output.r1} out2={output.r2} rm {params.tempfile} """ |
193 194 195 196 197 | shell: """ awk '$1=="U" {{ print }}' {input.krak} | cut -f 2 > {params.tempfile} filterbyname.sh in={input.r1} names={params.tempfile} include=true out={output.r1} # rm {params.tempfile} """ |
208 209 210 211 | shell: """ rm -f {params.rmdir_1}/*.krak touch {output} """ |
221 222 | script: "scripts/convert_report_mpa_style.py" |
230 231 232 233 234 | shell: """ sum=$(grep -vP "\\|" {input} | cut -f 2 | awk '{{sum += $1}} END {{printf ("%.2f\\n", sum/100)}}') awk -v sum="$sum" 'BEGIN {{FS="\\t"}} {{OFS="\\t"}} {{print $1,$2/sum}}' {input} > {output} """ |
242 243 244 245 246 247 | shell: """ source activate biobakery2 merge_metaphlan_tables.py {input} > {output.merge} grep -E "(s__)|(^ID)" {output.merge} | grep -v "t__" | sed 's/^.*s__//g' > {output.merge_species} """ |
255 256 257 258 259 260 | shell: """ source activate biobakery2 metaphlan_hclust_heatmap.py --in {input} --top 25 --minv 0.1 -s log --out {output.heatmap1} -f braycurtis -d braycurtis -c viridis metaphlan_hclust_heatmap.py --in {input} --top 150 --minv 0.1 -s log --out {output.heatmap2} -f braycurtis -d braycurtis -c viridis """ |
268 269 270 271 272 | shell: """ kraken-biom outputs/*_bracken.report -o {output} """ ''' |
Support
- Future updates
Related Workflows





