Snakemake workflow for comparison of differential abundance ranks
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
(Pronounced ka-da-bra )
Qadabra is a Snakemake workflow for running and comparing several differential abundance (DA) methods on the same microbiome dataset.
Importantly, Qadabra focuses on both FDR corrected p-values and feature ranks and generates visualizations of differential abundance results.
Installation
pip install qadabra
Qadabra requires the following dependencies:
-
snakemake
-
click
-
biom-format
-
pandas
-
numpy
-
cython
-
iow
Usage
1. Creating the workflow directory
Qadabra can be used on multiple datasets at once. First, we want to create the workflow directory to perfrom differential abundance with all methods:
qadabra create-workflow --workflow-dest <directory_name>
This command will initialize the workflow, but we still need to point to our dataset(s) of interest.
2. Adding a dataset
We can add datasets one-by-one with the
add-dataset
command:
qadabra add-dataset \
--workflow-dest <directory_name> \
--table <directory_name>/data/table.biom \
--metadata <directory_name>/data/metadata.tsv \
--tree <directory_name>/data/my_tree.nwk \
--name my_dataset \
--factor-name case_control \
--target-level case \
--reference-level control \
--confounder confounding_variable(s) <confounding_var> \
--verbose
Let's walkthrough the arguments provided here, which represent the inputs to Qadabra:
-
workflow-dest
: The location of the workflow that we created earlier -
table
: Feature table (features by samples) in BIOM format -
metadata
: Sample metadata in TSV format -
tree
: Phylogenetic tree in .nwk or other tree format (optional) -
name
: Name to give this dataset -
factor-name
: Metadata column to use for differential abundance -
target-level
: The value in the chosen factor to use as the target -
reference-level
: The reference level to which we want to compare our target -
confounder
: Any confounding variable metadata columns (optional) -
verbose
: Flag to show all preprocessing performed by Qadabra
Your dataset should now be added as a line in
my_qadabra/config/datasets.tsv
.
You can use
qadabra add-dataset --help
for more details.
To add another dataset, just run this command again with the new dataset information.
3. Running the workflow
The previous commands will create a subdirectory,
my_qadabra
in which the workflow structure is contained.
From the command line, execute the following to start the workflow:
snakemake --use-conda --cores <number of cores preferred> <other options>
Please read the Snakemake documentation for how to run Snakemake best on your system.
When this process is completed, you should have directories
figures
,
results
, and
log
.
Each of these directories will have a separate folder for each dataset you added.
4. Generating a report
After Qadabra has finished running, you can generate a Snakemake report of the workflow with the following command:
snakemake --report report.zip
This will create a zipped directory containing the report.
Unzip this file and open the
report.html
file to view the report containing results and visualizations in your browser.
Tutorial
See the tutorial page for a walkthroughon using Qadabra workflow with a microbiome dataset.
FAQs
Coming soon: An FAQs page of commonly asked question on the statistics and code pertaining to Qadabra.
Citation
The manuscript for Qadabra is currently in progress. Please cite this GitHub page if Qadabra is used for your analysis. This project is licensed under the MIT License. See the license file for details.
Code Snippets
19 20 | script: "../scripts/R/deseq2.R" |
35 36 | script: "../scripts/R/ancombc.R" |
51 52 | script: "../scripts/R/aldex2.R" |
67 68 | script: "../scripts/R/edger.R" |
86 87 88 89 90 91 92 93 94 95 96 97 98 99 | shell: """ songbird multinomial \ --input-biom {input.table} \ --metadata-file {input.metadata} \ --formula "{params.formula}" \ --epochs {params.epochs} \ --differential-prior {params.diff_prior} \ --summary-interval 1 \ --min-feature-count 0 \ --min-sample-count 0 \ --random-seed 1 \ --summary-dir {params.outdir} > {log} 2>&1 """ |
114 115 | script: "../scripts/R/maaslin2.R" |
130 131 | script: "../scripts/R/metagenomeseq.R" |
146 147 | script: "../scripts/R/corncob.R" |
161 162 | script: "../scripts/process_differentials.py" |
177 178 | script: "../scripts/concatenate_differentials.py" |
192 193 | script: "../scripts/process_pvalues.py" |
208 209 | script: "../scripts/concatenate_pvalues.py" |
222 223 | script: "../scripts/concatenate_all_results.py" |
12 13 | script: "../scripts/run_pca.py" |
27 28 | script: "../scripts/get_pctile_feats.py" |
40 41 | script: "../scripts/get_pctile_feats.py" |
54 55 | script: "../scripts/get_log_ratios.py" |
72 73 | script: "../scripts/logistic_regression.py" |
20 21 | script: "../scripts/plot_differentials.py" |
38 39 | script: "../scripts/plot_diff_kendall_heatmap.py" |
56 57 | script: "../scripts/plot_pvalue_kendall_heatmap.py" |
87 88 | script: "../scripts/plot_upset.py" |
109 110 | script: "../scripts/plot_roc.py" |
131 132 | script: "../scripts/plot_prc.py" |
151 152 | script: "../scripts/plot_pca.py" |
169 170 | script: "../scripts/plot_rank_comparison.py" |
187 188 | script: "../scripts/plot_pvalue_pw_comparisons.py" |
209 210 | script: "../scripts/plot_pvalue_volcanoes.py" |
229 230 231 232 233 234 235 236 | shell: """ qurro \ --ranks {input.ranks} \ --table {input.table} \ --sample-metadata {input.metadata} \ --output-dir {output} > {log} 2>&1 """ |
254 255 256 257 258 259 260 | shell: """ empress tree-plot \ --tree {input.tree} \ --feature-metadata {input.ranks} \ --output-dir {output} > {log} 2>&1 """ |
277 278 | script: "../scripts/create_diff_table.py" |
294 295 | script: "../scripts/create_pval_table.py" |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | import logging import pandas as pd logger = logging.getLogger("qadabra") logger.setLevel(logging.INFO) fh = logging.FileHandler(snakemake.log[0], mode="w") formatter = logging.Formatter( f"[%(asctime)s - {snakemake.rule}] :: %(message)s" ) fh.setFormatter(formatter) logger.addHandler(fh) logging.captureWarnings(True) logging.getLogger("py.warnings").addHandler(fh) logger.info("Loading differentials...") concatenated_diff_file = pd.read_csv(snakemake.input["concatenated_differentials"], sep="\t", index_col=0) for col in concatenated_diff_file.columns: new_col_name = col + " differentials" concatenated_diff_file.rename(columns={col: new_col_name}, inplace=True) logger.info("Loading p-values...") concatenated_pvalue_file = pd.read_csv(snakemake.input["concatenated_pvalues"], sep="\t", index_col=0) for col in concatenated_pvalue_file.columns: new_col_name = col + " significant features" concatenated_pvalue_file.rename(columns={col: new_col_name}, inplace=True) qadabra_all_result = pd.concat([concatenated_diff_file, concatenated_pvalue_file], axis=1) for index, row in qadabra_all_result.iterrows(): count = 0 for col in qadabra_all_result.columns: if col.endswith(" significant features"): if row[col] < 0.05: count += 1 qadabra_all_result.at[index, col] = "p<0.05" else: qadabra_all_result.at[index, col] = "ns" qadabra_all_result.at[index, "# of tools p > 0.05"] = count qadabra_all_result.to_csv(snakemake.output[0], sep="\t", index=True) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | import logging import pandas as pd logger = logging.getLogger("qadabra") logger.setLevel(logging.INFO) fh = logging.FileHandler(snakemake.log[0], mode="w") formatter = logging.Formatter( f"[%(asctime)s - {snakemake.rule}] :: %(message)s" ) fh.setFormatter(formatter) logger.addHandler(fh) logging.captureWarnings(True) logging.getLogger("py.warnings").addHandler(fh) logger.info("Loading differentials...") diffs_files = [ pd.read_table(x, sep="\t", index_col=0) for x in snakemake.input ] concat_diffs = pd.concat(diffs_files, axis=1) concat_diffs.to_csv(snakemake.output[0], sep="\t", index=True) logger.info(f"Saved to {snakemake.output[0]}") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | import logging import pandas as pd logger = logging.getLogger("qadabra") logger.setLevel(logging.INFO) fh = logging.FileHandler(snakemake.log[0], mode="w") formatter = logging.Formatter( f"[%(asctime)s - {snakemake.rule}] :: %(message)s" ) fh.setFormatter(formatter) logger.addHandler(fh) logging.captureWarnings(True) logging.getLogger("py.warnings").addHandler(fh) logger.info("Loading p-values...") pvalue_files = [ pd.read_table(x, sep="\t", index_col=0) for x in snakemake.input ] concat_pvalues = pd.concat(pvalue_files, axis=1) concat_pvalues.to_csv(snakemake.output[0], sep="\t", index=True) logger.info(f"Saved to {snakemake.output[0]}") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | import logging import pandas as pd from bokeh.plotting import output_file, save from bokeh.models import DataTable, TableColumn, ColumnDataSource logger = logging.getLogger("qadabra") logger.setLevel(logging.INFO) fh = logging.FileHandler(snakemake.log[0], mode="w") formatter = logging.Formatter( f"[%(asctime)s - {snakemake.rule}] :: %(message)s" ) fh.setFormatter(formatter) logger.addHandler(fh) output_file(filename=snakemake.output[0], title="Table") diff_df = pd.read_table(snakemake.input[0], sep="\t", index_col=0) diff_df = diff_df.round(2) diff_df = diff_df.reset_index() source = ColumnDataSource(diff_df) columns = [ TableColumn(field=x, title=x) for x in diff_df.columns ] data_table = DataTable(source=source, columns=columns, frozen_columns=1, sizing_mode="stretch_height") save(data_table) logger.info(f"Saved table to {snakemake.output[0]}") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | import logging import pandas as pd from bokeh.plotting import output_file, save from bokeh.models import DataTable, TableColumn, ColumnDataSource logger = logging.getLogger("qadabra") logger.setLevel(logging.INFO) fh = logging.FileHandler(snakemake.log[0], mode="w") formatter = logging.Formatter( f"[%(asctime)s - {snakemake.rule}] :: %(message)s" ) fh.setFormatter(formatter) logger.addHandler(fh) output_file(filename=snakemake.output[0], title="P-value table") pval_df = pd.read_table(snakemake.input[0], sep="\t", index_col=0) pval_df = pval_df.round(2) pval_df = pval_df.reset_index() source = ColumnDataSource(pval_df) columns = [ TableColumn(field=x, title=x) for x in pval_df.columns ] data_table = DataTable(source=source, columns=columns, frozen_columns=1, sizing_mode="stretch_height") save(data_table) logger.info(f"Saved table to {snakemake.output[0]}") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | import logging import biom import numpy as np import pandas as pd logger = logging.getLogger("qadabra") logger.setLevel(logging.INFO) fh = logging.FileHandler(snakemake.log[0], mode="w") formatter = logging.Formatter( f"[%(asctime)s - {snakemake.rule}] :: %(message)s" ) fh.setFormatter(formatter) logger.addHandler(fh) logger.info("Loading table...") table = biom.load_table(snakemake.input["table"]) table = table.to_dataframe(dense=True).T def log_ratio(table, top_feats, bot_feats): num_sum = table.loc[:, top_feats].sum(axis=1) denom_sum = table.loc[:, bot_feats].sum(axis=1) lr_df = pd.concat([num_sum, denom_sum], axis=1) lr_df.columns = ["num", "denom"] lr_df = lr_df.dropna(how="all") lr_df = lr_df + 1 lr_df["log_ratio"] = np.log(lr_df["num"]/lr_df["denom"]).to_frame() return lr_df feat_df = pd.read_table(snakemake.input["feats"], sep="\t") if not feat_df.empty: top_feats = feat_df.query("location == 'numerator'")["feature_id"] bot_feats = feat_df.query("location == 'denominator'")["feature_id"] logger.info("Creating log-ratios...") lr_df = log_ratio(table, top_feats, bot_feats).reset_index() lr_df["pctile"] = snakemake.wildcards["pctile"] lr_df["num_feats"] = feat_df["num_feats"].unique().item() if snakemake.wildcards.get("tool") is not None: lr_df["tool"] = snakemake.wildcards["tool"] else: lr_df["tool"] = "pca_pc1" lr_df = lr_df.rename(columns={"index": "sample_name"}) lr_df.to_csv(snakemake.output[0], sep="\t", index=False) logger.info(f"Saved to {snakemake.output[0]}") else: logger.info("The feature DataFrame is empty. Skipping the rest of the script.") lr_df = pd.DataFrame() lr_df.to_csv(snakemake.output[0], sep="\t", index=False) logger.info(f"Saved to {snakemake.output[0]}") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | import logging import pandas as pd logger = logging.getLogger("qadabra") logger.setLevel(logging.INFO) fh = logging.FileHandler(snakemake.log[0], mode="w") formatter = logging.Formatter( f"[%(asctime)s - {snakemake.rule}] :: %(message)s" ) fh.setFormatter(formatter) logger.addHandler(fh) logger.info("Loading differentials...") diffs = pd.read_table(snakemake.input[0], sep="\t", index_col=0) if snakemake.wildcards.get("tool") is not None: tool = snakemake.wildcards["tool"] else: tool = "PC1" pctile = int(snakemake.wildcards["pctile"]) sorted_diffs = diffs.sort_values(by=tool, ascending=False) n = int(diffs.shape[0]*pctile/100) top_feats = sorted_diffs.head(n).index.tolist() bot_feats = sorted_diffs.tail(n).index.tolist() top_feats = pd.DataFrame( dict(feature_id=top_feats, location="numerator") ) bot_feats = pd.DataFrame( dict(feature_id=bot_feats, location="denominator") ) df = pd.concat([top_feats, bot_feats]) df["pctile"] = pctile df["num_feats"] = n df.to_csv(snakemake.output[0], sep="\t", index=False) logger.info(f"Saved to {snakemake.output[0]}") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 | import logging from joblib import dump import numpy as np import pandas as pd from sklearn.linear_model import LogisticRegression from sklearn.metrics import (roc_curve, auc, precision_recall_curve, average_precision_score) from sklearn.model_selection import RepeatedStratifiedKFold logger = logging.getLogger("qadabra") logger.setLevel(logging.INFO) fh = logging.FileHandler(snakemake.log[0], mode="w") formatter = logging.Formatter( f"[%(asctime)s - {snakemake.rule}] :: %(message)s" ) fh.setFormatter(formatter) logger.addHandler(fh) logging.captureWarnings(True) logging.getLogger("py.warnings").addHandler(fh) covariate = snakemake.params["factor_name"] target = snakemake.params["target"] logger.info(f"Covariate: {covariate}") logger.info(f"Target: {target}") metadata = pd.read_table(snakemake.input["metadata"], sep="\t", index_col=0).dropna(subset=[covariate]) try: log_ratios = pd.read_table(snakemake.input["log_ratios"], sep="\t", index_col=0).join(metadata, how="inner") X = log_ratios["log_ratio"].values.reshape(-1, 1) y = (log_ratios[covariate] == target).astype(int).values mean_fpr = np.linspace(0, 1, 100) mean_rec = np.linspace(0, 1, 100) n_splits = snakemake.config["ml_params"]["n_splits"] n_repeats = snakemake.config["ml_params"]["n_repeats"] logger.info( f"Using repeated stratified k-fold CV with {n_splits} splits & " f"{n_repeats} repeats" ) cv = RepeatedStratifiedKFold(n_splits=n_splits, n_repeats=n_repeats, random_state=1) model = LogisticRegression(penalty="none") folds = [(train, test) for train, test in cv.split(X, y)] model_data = { "folds": { f"fold_{i+1}": {"train_idx": train, "test_idx": test} for i, (train, test) in enumerate(folds) } } model_data["fprs"] = mean_fpr model_data["recalls"] = mean_rec model_data["log_ratios"] = X model_data["truth"] = y # https://stackoverflow.com/a/67968945 tprs = [] precs = [] roc_aucs = [] pr_aucs = [] for i, (train, test) in enumerate(folds): prediction = model.fit(X[train], y[train]).predict_proba(X[test]) fpr, tpr, _ = roc_curve(y[test], prediction[:, 1]) tpr_interp = np.interp(mean_fpr, fpr, tpr) tprs.append(tpr_interp) roc_auc = auc(mean_fpr, tpr_interp) roc_aucs.append(roc_auc) prec, rec, _ = precision_recall_curve(y[test], prediction[:, 1]) prec = prec[::-1] rec = rec[::-1] prec_interp = np.interp(mean_rec, rec, prec) precs.append(prec_interp) pr_auc = auc(mean_rec, prec_interp) pr_aucs.append(pr_auc) logger.info( f"Fold {i+1}/{len(folds)}: ROC AUC = {roc_auc:.2f}, " f"PR AUC = {pr_auc:.2f}" ) model_data["tprs"] = tprs model_data["precisions"] = precs model_data["roc_aucs"] = roc_aucs model_data["pr_aucs"] = pr_aucs mean_roc_auc = np.mean(roc_aucs) mean_pr_auc = np.mean(pr_aucs) logger.info(f"Mean ROC AUC = {mean_roc_auc:.2f}") logger.info(f"Mean PR AUC = {mean_pr_auc:.2f}") dump(model_data, snakemake.output[0], compress=True) logger.info(f"Saving model to {snakemake.output[0]}") except: logger.error(f"The file {snakemake.input['log_ratios']} is empty or does not exist") model_data = {} dump(model_data, snakemake.output[0], compress=True) logger.info(f"Saving empty model to {snakemake.output[0]}") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 | import logging import matplotlib.pyplot as plt import numpy as np import pandas as pd import seaborn as sns from bokeh.plotting import figure, save from bokeh.models import ColumnDataSource, HoverTool from bokeh.io import output_file logger = logging.getLogger("qadabra") logger.setLevel(logging.INFO) fh = logging.FileHandler(snakemake.log[0], mode="w") formatter = logging.Formatter( f"[%(asctime)s - {snakemake.rule}] :: %(message)s" ) fh.setFormatter(formatter) logger.addHandler(fh) logging.captureWarnings(True) logging.getLogger("py.warnings").addHandler(fh) plt.style.use(snakemake.config["stylesheet"]) logger.info("Loading differentials...") diffs = pd.read_table(snakemake.input[0], sep="\t", index_col=0).squeeze() values = diffs.sort_values().values feature_names = diffs.sort_values().index logger.info("Plotting differentials...") # Create a ColumnDataSource to store the data source = ColumnDataSource(data=dict(x=np.arange(len(feature_names)), feature_names=feature_names, values=values)) # Create the figure # p = figure(x_range=(0, len(values)), height=400, width=600, toolbar_location=None) p = figure(x_range=(0, len(values)), toolbar_location=None) # Plot the bars p.vbar(x='x', top='values', width=0.8, source=source) # Add HoverTool to display information on hover hover = HoverTool(tooltips=[("Feature Name", "@feature_names")]) p.add_tools(hover) # Set plot properties p.xaxis.axis_label = "Rank" p.xgrid.grid_line_color = 'gray' p.ygrid.grid_line_color = 'gray' # Configure the title tool_name = snakemake.wildcards["tool"] p.title.text = f"{tool_name} Differentials" p.title.align = 'center' p.title.text_font_size = '18pt' # Save plot output_file(snakemake.output[0]) save(p) logger.info(f"Saved interactive rank plot to {snakemake.output[0]}") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | import logging import matplotlib.pyplot as plt import numpy as np import pandas as pd import seaborn as sns logger = logging.getLogger("qadabra") logger.setLevel(logging.INFO) fh = logging.FileHandler(snakemake.log[0], mode="w") formatter = logging.Formatter( f"[%(asctime)s - {snakemake.rule}] :: %(message)s" ) fh.setFormatter(formatter) logger.addHandler(fh) logging.captureWarnings(True) logging.getLogger("py.warnings").addHandler(fh) plt.style.use(snakemake.config["stylesheet"]) logger.info("Loading differentials...") diffs_df = pd.read_table(snakemake.input[0], sep="\t", index_col=0) corr = diffs_df.corr("kendall") mask = np.zeros_like(corr) mask[np.triu_indices_from(mask)] = True fig, ax = plt.subplots(1, 1) sns.heatmap( corr, annot=True, mask=mask, square=True, ax=ax ) ax.set_title("Kendall Rank Correlations") plt.savefig(snakemake.output[0]) logger.info(f"Saved to {snakemake.output[0]}") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 | import logging import matplotlib.pyplot as plt from matplotlib.patches import Patch import numpy as np import pandas as pd import seaborn as sns logger = logging.getLogger("qadabra") logger.setLevel(logging.INFO) fh = logging.FileHandler(snakemake.log[0], mode="w") formatter = logging.Formatter( f"[%(asctime)s - {snakemake.rule}] :: %(message)s" ) fh.setFormatter(formatter) logger.addHandler(fh) logging.captureWarnings(True) logging.getLogger("py.warnings").addHandler(fh) plt.style.use(snakemake.config["stylesheet"]) logger.info("Loading PCA results...") feat_df = pd.read_table(snakemake.input["features"], sep="\t", index_col=0) tool_df = pd.read_table(snakemake.input["tools"], sep="\t", index_col=0) prop_exp = pd.read_table(snakemake.input["prop_exp"], sep="\t", index_col=0) prop_exp = prop_exp.squeeze() pc_cols = tool_df.columns.tolist() tool_list = tool_df.index.tolist() arrow_palette = dict(zip( tool_list, sns.color_palette("colorblind", len(tool_list)).as_hex() )) fig, ax = plt.subplots(1, 1) logger.info("Plotting scatterplot...") sns.scatterplot( data=feat_df, x="PC1", y="PC2", color="lightgray", edgecolor="black", linewidth=0.1, ax=ax ) xlim = ax.get_xlim() ylim = ax.get_ylim() xrange = xlim[1] - xlim[0] yrange = ylim[1] - ylim[0] logger.info("Plotting arrows...") for i, row in tool_df.iterrows(): tool_name = row.name color = arrow_palette[tool_name] ax.arrow( x=0, y=0, dx=row["PC1"], dy=row["PC2"], facecolor=color, linewidth=0.5, head_width=xrange*0.005*5, width=xrange*0.005, edgecolor="black", ) patches = [] for i, row in tool_df.iterrows(): tool_name = row.name color = arrow_palette[tool_name] ax.arrow( x=0, y=0, dx=row["PC1"], dy=row["PC2"], facecolor=color, linewidth=0.5, head_width=xrange*0.005*5, width=xrange*0.005, edgecolor="black", ) tool_patch = Patch( facecolor=color, edgecolor="black", label=tool_name, linewidth=0.5, ) patches.append(tool_patch) ax.legend( handles=patches, bbox_to_anchor=[1, 1], loc="upper left", frameon=False ) prop_exp_labels = [ f"{i} ({x*100:.2f}%)" for i, x in prop_exp.items() ] ax.set_xlabel(prop_exp_labels[0]) ax.set_ylabel(prop_exp_labels[1]) ax.set_title("PCA") plt.savefig(snakemake.output[0]) logger.info(f"Saved to {snakemake.output[0]}") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 | from collections import defaultdict import logging import joblib from matplotlib.lines import Line2D import matplotlib.pyplot as plt import numpy as np import seaborn as sns logger = logging.getLogger("qadabra") logger.setLevel(logging.INFO) fh = logging.FileHandler(snakemake.log[0], mode="w") formatter = logging.Formatter( f"[%(asctime)s - {snakemake.rule}] :: %(message)s" ) fh.setFormatter(formatter) logger.addHandler(fh) logging.captureWarnings(True) logging.getLogger("py.warnings").addHandler(fh) plt.style.use(snakemake.config["stylesheet"]) tools = snakemake.config["tools"] + ["pca_pc1"] palette = dict(zip( tools, sns.color_palette("colorblind", len(tools)) )) pctile = snakemake.wildcards["pctile"] models = defaultdict(dict) for tool, model_loc in zip(tools, snakemake.input): logger.info(f"Loading {model_loc}...") models[tool] = joblib.load(model_loc) fig, ax = plt.subplots(1, 1) best_tool = None best_auc = 0 leg_lines = [] for tool in tools: this_model_data = models[tool] mean_recalls = this_model_data["recalls"] mean_precs = np.mean(this_model_data["precisions"], axis=0) pr_aucs = this_model_data["pr_aucs"] mean_auc = np.mean(pr_aucs) std_auc = np.std(pr_aucs) if mean_auc > best_auc: best_auc = mean_auc best_tool = tool color = palette[tool] line = Line2D([0], [0], color=color, lw=2, label=f"{tool} ({mean_auc:.2f} $\pm$ {std_auc:.2f})") leg_lines.append(line) ax.plot(mean_recalls, mean_precs, lw=2, color=color) logger.info(f"{tool} Mean AUC = {mean_auc:.2f} +- {std_auc:.2f}") leg = ax.legend( handles=leg_lines, loc="upper left", frameon=False, bbox_to_anchor=[1, 1] ) for text in leg.get_texts(): if best_tool in text._text: text.set_weight("bold") logger.info(f"Best model: {best_tool} ({best_auc:.2f})") ax.set_xlabel("Recall") ax.set_ylabel("Precision") ax.set_title(f"{pctile}% Features") plt.savefig(snakemake.output[0]) logger.info(f"Saved to {snakemake.output[0]}") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 | import logging import matplotlib.pyplot as plt import numpy as np import pandas as pd import seaborn as sns logger = logging.getLogger("qadabra") logger.setLevel(logging.INFO) fh = logging.FileHandler(snakemake.log[0], mode="w") formatter = logging.Formatter( f"[%(asctime)s - {snakemake.rule}] :: %(message)s" ) fh.setFormatter(formatter) logger.addHandler(fh) logging.captureWarnings(True) logging.getLogger("py.warnings").addHandler(fh) plt.style.use(snakemake.config["stylesheet"]) logger.info("Loading pvalues...") diffs_df = pd.read_table(snakemake.input[0], sep="\t", index_col=0) corr = diffs_df.corr("kendall") mask = np.zeros_like(corr) mask[np.triu_indices_from(mask)] = True fig, ax = plt.subplots(1, 1) palette = sns.color_palette("rocket", as_cmap=True) sns.heatmap( corr, annot=True, mask=mask, square=True, ax=ax, cmap=palette, vmin=0, vmax=1 ) ax.set_title("Kendall Rank Correlations of P-values") plt.savefig(snakemake.output[0]) logger.info(f"Saved to {snakemake.output[0]}") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 | import logging import numpy as np import pandas as pd import seaborn as sns from bokeh.plotting import figure, output_file, save from bokeh.layouts import column, row from bokeh.models import ColumnDataSource, Select, HoverTool from bokeh.models.callbacks import CustomJS logger = logging.getLogger("qadabra") logger.setLevel(logging.INFO) fh = logging.FileHandler(snakemake.log[0], mode="w") formatter = logging.Formatter( f"[%(asctime)s - {snakemake.rule}] :: %(message)s" ) fh.setFormatter(formatter) logger.addHandler(fh) logging.captureWarnings(True) logging.getLogger("py.warnings").addHandler(fh) output_file(filename=snakemake.output[0], title="Rank Comparison") pval_df = pd.read_table(snakemake.input[0], sep="\t", index_col=0) rank_df = ( pval_df.rename(columns={f"{x}": f"{x}_pvalue" for x in pval_df.columns}) .reset_index() ) tool_list = pval_df.columns.tolist() tool_rank_list = [x + "_pvalue" for x in tool_list] # Rank Comparison # logger.info("Creating rank comparison panel...") chosen_tool_1 = Select(options=tool_rank_list, value=tool_rank_list[0], title="x-axis") chosen_tool_2 = Select(options=tool_rank_list, value=tool_rank_list[1], title="y-axis") rank_df["x"] = rank_df[chosen_tool_1.value] rank_df["y"] = rank_df[chosen_tool_2.value] source = ColumnDataSource(rank_df) hover = HoverTool(mode="mouse", tooltips=[("Name", "points")], attachment="below") hover.tooltips = ( [("Feature ID", "@feature_id")] + [(f"{x}", f"@{x}") for x in rank_df.columns if x not in ["feature_id", "x", "y"]] ) tools = [ hover, "box_zoom", "reset", "pan", "save" ] # plot = figure(tools=tools) plot = figure(x_range=(0, 1), y_range=(0, 1), tools=tools) plot.circle( source=source, x="x", y="y", color="lightblue", line_width=0.5, line_color="black", size=10, name="points" ) plot.line( x=np.arange(0, rank_df.shape[0]), y=np.arange(0, rank_df.shape[0]), line_width=3, line_dash="dashed", color="black" ) tool_callback = CustomJS(args=dict(source=source, plot=plot), code=""" const data = source.data[cb_obj.value]; if (cb_obj.title == 'x-axis') { plot.below[0].axis_label = cb_obj.value; source.data.x = data; } else { plot.left[0].axis_label = cb_obj.value; source.data.y = data; } source.change.emit(); """) chosen_tool_1.js_on_change("value", tool_callback) chosen_tool_2.js_on_change("value", tool_callback) plot.title.text = "P-value Comparison" plot.title.text_font_size = "20pt" plot.xaxis.axis_label = chosen_tool_1.value plot.yaxis.axis_label = chosen_tool_2.value for ax in [plot.xaxis, plot.yaxis]: ax.axis_label_text_font_size = "14pt" ax.axis_label_text_font_style = "normal" ax.major_label_text_font_size = "10pt" layout = row(column(chosen_tool_1, chosen_tool_2), plot) save(layout) logger.info(f"Saved rank comparisons to {snakemake.output[0]}") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 | import logging import pandas as pd import numpy as np import matplotlib.pyplot as plt from bokeh.plotting import figure, save from bokeh.models import HoverTool, ColumnDataSource from bokeh.io import output_file # Add logging logger = logging.getLogger("qadabra") logger.setLevel(logging.INFO) fh = logging.FileHandler(snakemake.log[0], mode="w") formatter = logging.Formatter( f"[%(asctime)s - {snakemake.rule}] :: %(message)s" ) fh.setFormatter(formatter) logger.addHandler(fh) logging.captureWarnings(True) logging.getLogger("py.warnings").addHandler(fh) plt.style.use(snakemake.config["stylesheet"]) # Read in tsv containing p-values and LFCs logger.info("Loading p-values and LFCs...") data = pd.read_table(snakemake.input[0], sep="\t", index_col=0).squeeze() pval_col = snakemake.params["pval_col"] diff_ab_col = snakemake.params["diff_ab_col"] # Take negative log of pval_col column and add as new column data['negative_log_pval'] = -np.log10(data[pval_col]) # Create a ColumnDataSource object source_blue = ColumnDataSource(data=dict( diff_ab_col=data[diff_ab_col], pval=data['negative_log_pval'], feature=data.index )) source_red = ColumnDataSource(data=dict( diff_ab_col=data[diff_ab_col], pval=data['negative_log_pval'], feature=data.index )) # Create a figure tool_name = snakemake.wildcards["tool"] p = figure(title=f"{tool_name} Volcano Plot", x_axis_label='coefficient', y_axis_label='-log10(pval_col)') p.title.text_font_size = '20pt' # Define the desired y-axis range y_min = data['negative_log_pval'].min() - 1 # Minimum value y_max = data['negative_log_pval'].max() + 1 # Maximum value # Set the y-axis range p.y_range.start = y_min p.y_range.end = y_max # Define the desired x-axis range x_min = data[diff_ab_col].min() - 1 # Minimum value x_max = data[diff_ab_col].max() + 1 # Maximum value # Set the y-axis range p.x_range.start = x_min p.x_range.end = x_max # # Add a vertical line at x = -1 p.line(x=[-1, -1], y=[y_min, y_max], line_color='gray', line_width=1, line_dash='dashed') p.line(x=[1, 1], y=[y_min, y_max], line_color='gray', line_width=1, line_dash='dashed') p.line(x=[x_min,x_max], y=[-np.log10(0.05), -np.log10(0.05)], line_color='gray', line_width=1, line_dash='dashed') # Add hover tool hover = HoverTool( tooltips=[ ('Feature', '@feature'), ('coefficient', '@diff_ab_col'), ('-log10(pval_col)', '@pval'), ] ) p.add_tools(hover) # Select points where y >= 1 and x >= -log10(0.05) pos_sig_diff = (data[diff_ab_col] >= 1) & (data['negative_log_pval'] >= -np.log10(0.05)) # Select points where y <= -1 and x >= -log10(0.05) neg_sig_diff = (data[diff_ab_col] <= -1) & (data['negative_log_pval'] >= -np.log10(0.05)) # Apply selection to the data source source_blue.selected.indices = np.where(neg_sig_diff)[0] # Apply selection to the data source for negative significant differences (blue) p.circle('diff_ab_col', 'pval', size=5, fill_alpha=0.6, nonselection_line_color='black', nonselection_fill_color='black', source=source_blue, selection_color='blue') # Apply selection to the data source source_red.selected.indices = np.where(pos_sig_diff)[0] # Apply selection to the data source for positive significant differences (red) p.circle('diff_ab_col', 'pval', size=5, fill_alpha=0.6, nonselection_line_color='black', nonselection_fill_color='black', source=source_red, selection_color='red') output_file(snakemake.output[0]) save(p) logger.info(f"Saved volocano plot to {snakemake.output[0]}") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 | import logging import numpy as np import pandas as pd import seaborn as sns from bokeh.plotting import figure, output_file, save from bokeh.layouts import column, row from bokeh.models import ColumnDataSource, Select, HoverTool from bokeh.models.callbacks import CustomJS logger = logging.getLogger("qadabra") logger.setLevel(logging.INFO) fh = logging.FileHandler(snakemake.log[0], mode="w") formatter = logging.Formatter( f"[%(asctime)s - {snakemake.rule}] :: %(message)s" ) fh.setFormatter(formatter) logger.addHandler(fh) logging.captureWarnings(True) logging.getLogger("py.warnings").addHandler(fh) output_file(filename=snakemake.output[0], title="Rank Comparison") diff_df = pd.read_table(snakemake.input[0], sep="\t", index_col=0) rank_df = ( diff_df.rank(ascending=False) .rename(columns={f"{x}": f"{x}_rank" for x in diff_df.columns}) .join(diff_df) .reset_index() ) tool_list = diff_df.columns.tolist() tool_rank_list = [x + "_rank" for x in tool_list] # Rank Comparison logger.info("Creating rank comparison panel...") chosen_tool_1 = Select(options=tool_rank_list, value=tool_rank_list[0], title="x-axis") chosen_tool_2 = Select(options=tool_rank_list, value=tool_rank_list[1], title="y-axis") rank_df["x"] = rank_df[chosen_tool_1.value] rank_df["y"] = rank_df[chosen_tool_2.value] source = ColumnDataSource(rank_df) hover = HoverTool(mode="mouse", tooltips=[("Name", "points")], attachment="below") hover.tooltips = ( [("Feature ID", "@feature_id")] + [(f"{x}", f"@{x}") for x in rank_df.columns if x not in ["feature_id", "x", "y"]] ) tools = [ hover, "box_zoom", "reset", "pan", "save" ] plot = figure(tools=tools) plot.circle( source=source, x="x", y="y", color="lightgreen", line_width=0.5, line_color="black", size=10, name="points" ) plot.line( x=np.arange(0, rank_df.shape[0]), y=np.arange(0, rank_df.shape[0]), line_width=3, line_dash="dashed", color="black" ) tool_callback = CustomJS(args=dict(source=source, plot=plot), code=""" const data = source.data[cb_obj.value]; if (cb_obj.title == 'x-axis') { plot.below[0].axis_label = cb_obj.value; source.data.x = data; } else { plot.left[0].axis_label = cb_obj.value; source.data.y = data; } source.change.emit(); """) chosen_tool_1.js_on_change("value", tool_callback) chosen_tool_2.js_on_change("value", tool_callback) plot.title.text = "Differential Rank Comparison" plot.title.text_font_size = "20pt" plot.xaxis.axis_label = chosen_tool_1.value plot.yaxis.axis_label = chosen_tool_2.value for ax in [plot.xaxis, plot.yaxis]: ax.axis_label_text_font_size = "14pt" ax.axis_label_text_font_style = "normal" ax.major_label_text_font_size = "10pt" layout = row(column(chosen_tool_1, chosen_tool_2), plot) save(layout) logger.info(f"Saved rank comparisons to {snakemake.output[0]}") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 | from collections import defaultdict import logging import joblib from matplotlib.lines import Line2D import matplotlib.pyplot as plt import numpy as np import seaborn as sns logger = logging.getLogger("qadabra") logger.setLevel(logging.INFO) fh = logging.FileHandler(snakemake.log[0], mode="w") formatter = logging.Formatter( f"[%(asctime)s - {snakemake.rule}] :: %(message)s" ) fh.setFormatter(formatter) logger.addHandler(fh) logging.captureWarnings(True) logging.getLogger("py.warnings").addHandler(fh) plt.style.use(snakemake.config["stylesheet"]) tools = snakemake.config["tools"] + ["pca_pc1"] palette = dict(zip( tools, sns.color_palette("colorblind", len(tools)) )) pctile = snakemake.wildcards["pctile"] models = defaultdict(dict) for tool, model_loc in zip(tools, snakemake.input): logger.info(f"Loading {model_loc}...") models[tool] = joblib.load(model_loc) fig, ax = plt.subplots(1, 1) ax.set_aspect("equal") best_tool = None best_auc = 0 leg_lines = [] for tool in tools: this_model_data = models[tool] mean_fprs = this_model_data["fprs"] mean_tprs = np.mean(this_model_data["tprs"], axis=0) roc_aucs = this_model_data["roc_aucs"] mean_auc = np.mean(roc_aucs) std_auc = np.std(roc_aucs) if mean_auc > best_auc: best_auc = mean_auc best_tool = tool color = palette[tool] line = Line2D([0], [0], color=color, lw=2, label=f"{tool} ({mean_auc:.2f} $\pm$ {std_auc:.2f})") leg_lines.append(line) ax.plot(mean_fprs, mean_tprs, lw=2, color=color) logger.info(f"{tool} Mean AUC = {mean_auc:.2f} +- {std_auc:.2f}") leg = ax.legend(handles=leg_lines, loc="lower right", frameon=False) for text in leg.get_texts(): if best_tool in text._text: text.set_weight("bold") logger.info(f"Best model: {best_tool} ({best_auc:.2f})") ax.plot([0, 1], [0, 1], linestyle="--", lw=2, color="black") ax.set_xlabel("False Positive Rate") ax.set_ylabel("True Positive Rate") ax.set_title(f"{pctile}% Features") plt.savefig(snakemake.output[0]) logger.info(f"Saved to {snakemake.output[0]}") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 | import logging import matplotlib.pyplot as plt import pandas as pd from upsetplot import UpSet, from_contents logger = logging.getLogger("qadabra") logger.setLevel(logging.INFO) fh = logging.FileHandler(snakemake.log[0], mode="w") formatter = logging.Formatter( f"[%(asctime)s - {snakemake.rule}] :: %(message)s" ) fh.setFormatter(formatter) logger.addHandler(fh) logging.captureWarnings(True) logging.getLogger("py.warnings").addHandler(fh) plt.style.use(snakemake.config["stylesheet"]) tool_names = snakemake.config["tools"] pctile = snakemake.wildcards["pctile"] logger.info("Loading percentile features...") feat_df_dict = { tool: pd.read_table(f, sep="\t", index_col=0) for tool, f in zip(tool_names, snakemake.input) } num_list = { tool: df.query("location == 'numerator'").index.tolist() for tool, df in feat_df_dict.items() } denom_list = { tool: df.query("location == 'denominator'").index.tolist() for tool, df in feat_df_dict.items() } try: num_plt = UpSet(from_contents(num_list), subset_size="count", show_counts=True).plot() plt.title(f"Numerator - {pctile}%") except IndexError: plt.figure() plt.title(f"Numerator - {pctile}% (empty)") logger.info("num_list is empty. Creating empty numerator plot.") finally: plt.savefig(snakemake.output["numerator"]) logger.info(f"Saved to {snakemake.output['numerator']}") try: denom_plt = UpSet(from_contents(denom_list), subset_size="count", show_counts=True).plot() plt.title(f"Denominator - {pctile}%") except IndexError: plt.figure() plt.title(f"Denominator - {pctile}% (empty)") logger.info("denom_list is empty. Creating empty denominator plot.") finally: plt.savefig(snakemake.output["denominator"]) logger.info(f"Saved to {snakemake.output['denominator']}") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | import logging import pandas as pd logger = logging.getLogger("qadabra") logger.setLevel(logging.INFO) fh = logging.FileHandler(snakemake.log[0], mode="w") formatter = logging.Formatter( f"[%(asctime)s - {snakemake.rule}] :: %(message)s" ) fh.setFormatter(formatter) logger.addHandler(fh) logger.info(f"Loading {snakemake.wildcards['tool']} differentials...") col = snakemake.params["col"] logger.info(f"Using '{col}' as column") diffs = pd.read_table(snakemake.input[0], sep="\t", index_col=0)[col] diffs.name = snakemake.wildcards["tool"] diffs.index.name = "feature_id" diffs.to_csv(snakemake.output[0], sep="\t", index=True) logger.info(f"Saved to {snakemake.output[0]}") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | import logging import pandas as pd logger = logging.getLogger("qadabra") logger.setLevel(logging.INFO) fh = logging.FileHandler(snakemake.log[0], mode="w") formatter = logging.Formatter( f"[%(asctime)s - {snakemake.rule}] :: %(message)s" ) fh.setFormatter(formatter) logger.addHandler(fh) logger.info(f"Loading {snakemake.wildcards['tool']} p-values...") col = snakemake.params["col"] logger.info(f"Using '{col}' as column") pvalues = pd.read_table(snakemake.input[0], sep="\t", index_col=0)[col] pvalues.name = snakemake.wildcards["tool"] pvalues.index.name = "feature_id" pvalues.to_csv(snakemake.output[0], sep="\t", index=True) logger.info(f"Saved to {snakemake.output[0]}") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 | library(biomformat) library(ALDEx2) # Set logging information log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") # Load the input table print("Loading table...") table <- biomformat::read_biom(snakemake@input[["table"]]) table <- as.matrix(biomformat::biom_data(table)) # Load the metadata print("Loading metadata...") metadata <- read.table(snakemake@input[["metadata"]], sep="\t", header=T, row.names=1) covariate <- snakemake@params[[1]][["factor_name"]] target <- snakemake@params[[1]][["target_level"]] reference <- snakemake@params[[1]][["reference_level"]] confounders <- snakemake@params[[1]][["confounders"]] # Harmonize table and metadata samples print("Harmonizing table and metadata samples...") samples <- colnames(table) metadata <- subset(metadata, rownames(metadata) %in% samples) metadata[[covariate]] <- as.factor(metadata[[covariate]]) metadata[[covariate]] <- relevel(metadata[[covariate]], reference) sample_order <- row.names(metadata) table <- table[, sample_order] row.names(table) <- paste0("F_", row.names(table)) # Append F_ to features to avoid R renaming # Create the design formula print("Creating design formula...") design.formula <- paste0("~", covariate) if (length(confounders) != 0) { confounders_form = paste(confounders, collapse=" + ") design.formula <- paste0(design.formula, " + ", confounders_form) } design.formula <- as.formula(design.formula) mm <- model.matrix(design.formula, metadata) # Run ALDEx2 print("Running ALDEx2...") x <- ALDEx2::aldex.clr(table, mm) # perform CLR transformation aldex2.results <- ALDEx2::aldex.glm(x) # apply generalized linear model saveRDS(aldex2.results, snakemake@output[[2]]) print("Saved RDS!") # Modify result names row.names(aldex2.results) <- gsub("^F_", "", row.names(aldex2.results)) target <- paste(covariate, target, sep = "") # Save results to output file write.table(aldex2.results, file=snakemake@output[[1]], sep="\t") print("Saved differentials!") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 | library(biomformat) library(ANCOMBC) library(phyloseq) # Set logging information log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") # Load the input table print("Loading table...") table <- biomformat::read_biom(snakemake@input[["table"]]) table <- as.matrix(biomformat::biom_data(table)) # Load the metadata print("Loading metadata...") metadata <- read.table(snakemake@input[["metadata"]], sep="\t", header=T, row.names=1) covariate <- snakemake@params[[1]][["factor_name"]] target <- snakemake@params[[1]][["target_level"]] reference <- snakemake@params[[1]][["reference_level"]] confounders <- snakemake@params[[1]][["confounders"]] # Harmonize table and metadata samples print("Harmonizing table and metadata samples...") samples <- colnames(table) metadata <- subset(metadata, rownames(metadata) %in% samples) metadata[[covariate]] <- as.factor(metadata[[covariate]]) metadata[[covariate]] <- relevel(metadata[[covariate]], reference) sample_order <- row.names(metadata) table <- table[, sample_order] row.names(table) <- paste0("F_", row.names(table)) # Append F_ to features to avoid R renaming # Convert to phyloseq object print("Converting to phyloseq...") taxa <- phyloseq::otu_table(table, taxa_are_rows=T) meta <- phyloseq::sample_data(metadata) physeq <- phyloseq::phyloseq(taxa, meta) # Create the design formula print("Creating design formula...") design.formula <- covariate if (length(confounders) != 0) { confounders_form = paste(confounders, collapse=" + ") design.formula <- paste0(design.formula, " + ", confounders_form) } # Run ANCOMBC print("Running ANCOMBC...") ancombc.results <- ANCOMBC::ancombc(phyloseq=physeq, formula=design.formula, zero_cut=1.0, p_adj_method = "BH",) saveRDS(ancombc.results, snakemake@output[[2]]) print("Saved RDS!") # Access coefficients and p-values from the ANCOMBC results coef_col <- paste("coefs", covariate, target, sep = ".") pval_col <- paste("pvals", covariate, target, sep = ".") coefs <- ancombc.results$res$beta pvals <- ancombc.results$res$p_val # qvals <- ancombc.results$res$q_val # Modify result names row.names(coefs) <- gsub("^F_", "", row.names(coefs)) row.names(pvals) <- gsub("^F_", "", row.names(pvals)) # row.names(qvals) <- gsub("^F_", "", row.names(qvals)) results_all <- data.frame(coefs=coefs, pvals=pvals) colnames(results_all) <- c("coefs", "pvals") # Save results to output file write.table(results_all, file=snakemake@output[[1]], sep="\t") print("Saved differentials!") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 | library(biomformat) library(corncob) library(phyloseq) # Set logging information log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") # Load the input table print("Loading table...") table <- biomformat::read_biom(snakemake@input[["table"]]) table <- as.matrix(biomformat::biom_data(table)) # Load the metadata print("Loading metadata...") metadata <- read.table(snakemake@input[["metadata"]], sep="\t", header=T, row.names=1) covariate <- snakemake@params[[1]][["factor_name"]] target <- snakemake@params[[1]][["target_level"]] reference <- snakemake@params[[1]][["reference_level"]] confounders <- snakemake@params[[1]][["confounders"]] # Harmonize table and metadata samples print("Harmonizing table and metadata samples...") samples <- colnames(table) metadata <- subset(metadata, rownames(metadata) %in% samples) metadata[[covariate]] <- as.factor(metadata[[covariate]]) metadata[[covariate]] <- relevel(metadata[[covariate]], reference) sample_order <- row.names(metadata) table <- table[, sample_order] row.names(table) <- paste0("F_", row.names(table)) # Append F_ to features to avoid R renaming # Convert to phyloseq object print("Converting to phyloseq...") taxa <- phyloseq::otu_table(table, taxa_are_rows=T) meta <- phyloseq::sample_data(metadata) physeq <- phyloseq::phyloseq(taxa, meta) # Create the design formula print("Creating design formula...") design.formula <- paste0("~", covariate) if (length(confounders) != 0) { confounders_form = paste(confounders, collapse=" + ") design.formula <- paste0(design.formula, " + ", confounders_form) } design.formula <- as.formula(design.formula) print(design.formula) # Run Corncob print("Running corncob...") fit <- corncob::differentialTest( formula=design.formula, formula_null=~1, phi.formula=design.formula, phi.formula_null=design.formula, test="Wald", data=physeq, boot=F, filter_discriminant=F, full_output=T ) saveRDS(fit, snakemake@output[[2]]) print("Saved RDS!") # Create results table taxa_names <- names(fit$p) colname <- paste0("mu.", covariate, target) print("Aggregating models...") all_models <- fit$all_models coefs <- c() for (mod in all_models) { coef <- mod$coefficients[c(colname), "Estimate"] coefs <- c(coefs, coef) } coefs <- as.data.frame(coefs) row.names(coefs) <- taxa_names row.names(coefs) <- gsub("^F_", "", row.names(coefs)) pvals <- as.data.frame(fit$p) adjusted_p_values <- p.adjust(fit$p, method = "BH") pval_BH_adj <- as.data.frame(adjusted_p_values) results_all <- data.frame(coefs=coefs, pvals=pvals, pval_BH_adj=pval_BH_adj) # Save results to output file write.table(results_all, snakemake@output[[1]], sep="\t") print("Saved differentials!") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 | library(biomformat) library(DESeq2) # Set logging information log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") # Load the input table print("Loading table...") table <- biomformat::read_biom(snakemake@input[["table"]]) table <- as.matrix(biomformat::biom_data(table)) # Load the metadata print("Loading metadata...") metadata <- read.table(snakemake@input[["metadata"]], sep="\t", header=T, row.names=1) covariate <- snakemake@params[[1]][["factor_name"]] target <- snakemake@params[[1]][["target_level"]] reference <- snakemake@params[[1]][["reference_level"]] confounders <- snakemake@params[[1]][["confounders"]] # Harmonize table and metadata samples print("Harmonizing table and metadata samples...") samples <- colnames(table) metadata <- subset(metadata, rownames(metadata) %in% samples) metadata[[covariate]] <- as.factor(metadata[[covariate]]) metadata[[covariate]] <- relevel(metadata[[covariate]], reference) sample_order <- row.names(metadata) table <- table[, sample_order] row.names(table) <- paste0("F_", row.names(table)) # Append F_ to features to avoid R renaming # Create the design formula print("Creating design formula...") design.formula <- paste0("~", covariate) if (length(confounders) != 0) { confounders_form = paste(confounders, collapse=" + ") design.formula <- paste0(design.formula, " + ", confounders_form) } design.formula <- as.formula(design.formula) # Run DESeq2 print("Running DESeq2...") dds <- DESeq2::DESeqDataSetFromMatrix( countData=table, colData=metadata, design=design.formula ) dds.results <- DESeq2::DESeq(dds, sfType="poscounts") saveRDS(dds.results, snakemake@output[[2]]) print("Saved RDS!") # Create results table and modify result names results <- DESeq2::results( dds.results, format="DataFrame", tidy=TRUE, cooksCutoff=FALSE, pAdjustMethod="BH", contrast=c(covariate, target, reference) ) row.names(results) <- gsub("^F_", "", row.names(table)) # Save results to output file write.table(results, file=snakemake@output[[1]], sep="\t") print("Saved differentials!") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 | library(biomformat) library(edgeR) # Set logging information log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") # Load the input table print("Loading table...") table <- biomformat::read_biom(snakemake@input[["table"]]) table <- as.matrix(biomformat::biom_data(table)) # Load the metadata print("Loading metadata...") metadata <- read.table(snakemake@input[["metadata"]], sep="\t", header=T, row.names=1) covariate <- snakemake@params[[1]][["factor_name"]] target <- snakemake@params[[1]][["target_level"]] reference <- snakemake@params[[1]][["reference_level"]] confounders <- snakemake@params[[1]][["confounders"]] # Harmonize table and metadata samples print("Harmonizing table and metadata samples...") samples <- colnames(table) metadata <- subset(metadata, rownames(metadata) %in% samples) metadata[[covariate]] <- as.factor(metadata[[covariate]]) metadata[[covariate]] <- relevel(metadata[[covariate]], reference) sample_order <- row.names(metadata) table <- table[, sample_order] row.names(table) <- paste0("F_", row.names(table)) # Append F_ to features to avoid R renaming # Create the design formula print("Creating design formula...") design.formula <- paste0("~", covariate) if (length(confounders) != 0) { confounders_form = paste(confounders, collapse=" + ") design.formula <- paste0(design.formula, " + ", confounders_form) } design.formula <- as.formula(design.formula) print(design.formula) mm <- model.matrix(design.formula, metadata) # Run edgeR print("Running edgeR...") d <- edgeR::DGEList(counts=table, group=metadata[[covariate]]) d <- edgeR::calcNormFactors(d) d <- edgeR::estimateDisp(d, design=mm) fit <- edgeR::glmQLFit(d, mm, robust=T) saveRDS(fit, snakemake@output[[2]]) print("Saved RDS!") # Obtain coefficients using glmQLFit coeff_results <- fit$coefficients row.names(coeff_results) <- gsub("^F_", "", row.names(coeff_results)) # Obtain p-values and corrected p-values using glmQLFTest res <- edgeR::glmQLFTest(fit, coef=2) pval_results <- res$table adjusted_p_values <- p.adjust(pval_results$PValue, method = "BH") pval_results$PValue_BH_adj <- adjusted_p_values row.names(pval_results) <- gsub("^F_", "", row.names(pval_results)) # Create results table results_all <- cbind(coeff_results, pval_results[match(rownames(coeff_results), rownames(pval_results)),]) # Save results to output file write.table(results_all, file=snakemake@output[[1]], sep="\t") print("Saved differentials!") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 | library(biomformat) library(dplyr) library(Maaslin2) # Set logging information log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") # Load the input table print("Loading table...") table <- biomformat::read_biom(snakemake@input[["table"]]) table <- as.matrix(biomformat::biom_data(table)) # Load the metadata print("Loading metadata") metadata <- read.table(snakemake@input[["metadata"]], sep="\t", header=T, row.names=1) covariate <- snakemake@params[[1]][["factor_name"]] target <- snakemake@params[[1]][["target_level"]] reference <- snakemake@params[[1]][["reference_level"]] confounders <- snakemake@params[[1]][["confounders"]] # Harmonize table and metadata samples print("Harmonizing table and metadata samples...") samples <- colnames(table) metadata <- subset(metadata, rownames(metadata) %in% samples) metadata[[covariate]] <- as.factor(metadata[[covariate]]) metadata[[covariate]] <- relevel(metadata[[covariate]], reference) sample_order <- row.names(metadata) # Create results table and modify result names row.names(table) <- paste0("F_", row.names(table)) # Append F_ to features to avoid R renaming table <- t(table[, sample_order]) fixed.effects <- c(covariate, confounders) specify.reference <- paste(covariate, reference, sep = ",", collapse = ",") # Run maAsLin print("Running MaAsLin...") fit.data <- Maaslin2::Maaslin2( input_data=table, input_metadata=metadata, output=snakemake@output[["out_dir"]], fixed_effects=fixed.effects, reference=specify.reference, min_prevalence=0, min_abundance=0, plot_scatter=F, plot_heatmap=F ) results <- fit.data$results %>% dplyr::filter(metadata==covariate) results <- results %>% distinct(feature, .keep_all = TRUE) row.names(results) <- gsub("^F_", "", results$feature) results <- results %>% select(-c("feature")) # Extract coefficient, p-value, and adjusted p-value results adjusted_p_values <- p.adjust(results$pval, method = "BH") results$pval_BH_adj <- adjusted_p_values # results$coef <- results$coef # Save results to output file write.table(results, file=snakemake@output[["diff_file"]], sep="\t") print("Saved differentials!") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 | library(biomformat) library(Biobase) library(metagenomeSeq) # Set logging information log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") # Load the input table print("Loading table...") table <- biomformat::read_biom(snakemake@input[["table"]]) table <- as.matrix(biomformat::biom_data(table)) table <- as.data.frame(table) # Load the metadata print("Loading metadata...") metadata <- read.table(snakemake@input[["metadata"]], sep="\t", header=T, row.names=1) covariate <- snakemake@params[[1]][["factor_name"]] target <- snakemake@params[[1]][["target_level"]] reference <- snakemake@params[[1]][["reference_level"]] confounders <- snakemake@params[[1]][["confounders"]] # Harmonize table and metadata samples print("Harmonizing table and metadata samples...") samples <- colnames(table) metadata <- subset(metadata, rownames(metadata) %in% samples) metadata[[covariate]] <- as.factor(metadata[[covariate]]) metadata[[covariate]] <- relevel(metadata[[covariate]], reference) sample_order <- row.names(metadata) table <- table[, sample_order] row.names(table) <- paste0("F_", row.names(table)) # Append F_ to features to avoid R renaming # Create AnnotatedDataFrame from metadata pheno <- Biobase::AnnotatedDataFrame(metadata) # Create MRexperiment object experiment <- metagenomeSeq::newMRexperiment( counts=table, phenoData=pheno, ) # Perform cumulative sum scaling (CSS) normalization pctile <- metagenomeSeq::cumNormStat(experiment, pFlag=T) experiment_norm <- metagenomeSeq::cumNorm(experiment, p=pctile) pd <- Biobase::pData(experiment_norm) # Create design formula for model matrix print("Creating design formula...") design.formula <- paste0("~", covariate) if (length(confounders) != 0) { for (c in confounders) { metadata[[c]] <- as.factor(metadata[[c]]) } confounders_form = paste(confounders, collapse=" + ") design.formula <- paste0(design.formula, " + ", confounders_form) } design.formula <- as.formula(design.formula) mm <- model.matrix(design.formula, data=pd) # Run MetagenomeSeq print("Running metagenomeSeq...") fit <- metagenomeSeq::fitZig(experiment_norm, mm) saveRDS(fit, snakemake@output[[2]]) print("Saved RDS!") # Create results table and modify results names results <- metagenomeSeq::MRcoefs( obj=fit, number=dim(table)[1], adjustMethod="BH", ) row.names(results) <- gsub("^F_", "", row.names(results)) # Save results to output file write.table(results, file=snakemake@output[[1]], sep="\t") print("Saved differentials!") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | import logging import numpy as np import pandas as pd from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler logger = logging.getLogger("qadabra") logger.setLevel(logging.INFO) fh = logging.FileHandler(snakemake.log[0], mode="w") formatter = logging.Formatter( f"[%(asctime)s - {snakemake.rule}] :: %(message)s" ) fh.setFormatter(formatter) logger.addHandler(fh) logger.info("Loading differentials...") diff_df = pd.read_table(snakemake.input[0], sep="\t", index_col=0) rank_df = ( diff_df.rank(ascending=False) .rename(columns={f"{x}": f"{x}_rank" for x in diff_df.columns}) .join(diff_df) ) tool_list = diff_df.columns.tolist() tool_rank_list = [x + "_rank" for x in tool_list] logger.info("Scaling values...") scaled_diff_values = StandardScaler().fit_transform(diff_df.values) pc_cols = [f"PC{x+1}" for x in range(len(tool_list))] logger.info("Running PCA...") pca = PCA(n_components=len(tool_list)).fit(scaled_diff_values) feat_df = pd.DataFrame( pca.transform(scaled_diff_values), index=diff_df.index, columns=pc_cols ).join(rank_df) max_vals = feat_df[pc_cols].max() tool_df = pd.DataFrame( np.dot(pca.components_, np.diag(max_vals)), index=diff_df.columns, columns=pc_cols ) tool_df.index.name = "tool" feat_df.to_csv(snakemake.output["features"], sep="\t", index=True) tool_df.to_csv(snakemake.output["tools"], sep="\t", index=True) logger.info(f"Saved feature PCs to {snakemake.output['features']}") logger.info(f"Saved tool PCs to {snakemake.output['tools']}") prop_exp = pca.explained_variance_ratio_ prop_exp = pd.Series(prop_exp, index=pc_cols) prop_exp.name = "proportion_var_explained" prop_exp.to_csv(snakemake.output["prop_exp"], sep="\t", index=True) logger.info(f"Saved explained variance to {snakemake.output['prop_exp']}") |
Support
- Future updates
Related Workflows





