Cross validation workflow of Semi-binary Matrix Factorization (SBMF)
SBMFCV
Cross validation workflow of Semi-binary Matrix Factorization (SBMF)
SBMFCV
searches for the optimal hyper-parameters (rank and binary regularization parameters) for Semi-binary Matrix Factorization (SBMF) performed by
dcTensor::dNMF
. In SBMF, a non-negative matrix X is decomposed to a matrix product U * V' and only U is imposed to have binary ({0,1}) values. For the details, see the vignette of
dNMF
.
SBMFCV
consists of the rules below:
Pre-requisites (our experiment)
-
Snakemake: v7.30.1
-
Singularity: v3.7.1
-
Docker: v20.10.10 (optional)
Snakemake
is available via Python package managers like
pip
,
conda
, or
mamba
.
Singularity
and
Docker
are available by the installer provided in each website or package manager for each OS like
apt-get/yum
for Linux, or
brew
for Mac.
For the details, see the installation documents below.
-
https://snakemake.readthedocs.io/en/stable/getting_started/installation.html
-
https://docs.sylabs.io/guides/3.0/user-guide/installation.html
-
https://docs.docker.com/engine/install/
Note: The following source code does not work on M1/M2 Mac. M1/M2 Mac users should refer to README_AppleSilicon.md instead.
Usage
In this demo, we use a toy data matrix (data/testdata.tsv) consisting of 1280 samples and 13 variables but any non-negative matrix can be specified by user.
Note that the input file is assumed to be tab separated values (TSV) format with no row/column names.
Download this GitHub repository
First, download this GitHub repository and change the working directory.
git clone https://github.com/chiba-ai-med/SBMFCV.git
cd SBMFCV
Example with local machine
Next, perform
SBMFCV
by the
snakemake
command as follows.
Note: To check if the command is executable, set smaller parameters such as rank_min=2 rank_max=2 lambda_max=2 lambda_min=2 trials=2 n_iter_max=2.
snakemake -j 4 --config input=data/testdata.tsv outdir=output rank_min=2 \
rank_max=10 lambda_min=-10 lambda_max=10 trials=10 \
n_iter_max=100 ratio=20 --resources mem_gb=10 --use-singularity
The meanings of all the arguments are below.
-
-j
: Snakemake option to set the number of cores (e.g. 10, mandatory) -
--config
: Snakemake option to set the configuration (mandatory) -
input
: Input file (e.g., testdata.tsv, mandatory) -
outdir
: Output directory (e.g., output, mandatory) -
rank_min
: Lower limit of rank parameter to search (e.g., 2, which is used for the rank parameter J of dNMF, mandatory) -
rank_max
: Upper limit of rank parameter to search (e.g., 10, which is used for the rank parameter J of dNMF, mandatory) -
lambda_min
: Lower limit of lambda parameter to search (e.g., -10, which means 10^-10 is used for the binary regularization parameter Bin_U of dNMF, mandatory) -
lambda_max
: Upper limit of lambda parameter to search (e.g., -10, which means 10^10 is used for the binary regularization parameter Bin_U of dNMF, mandatory) -
trials
: Number of random trials (e.g., 50, mandatory) -
n_iter_max
: Number of iterations (e.g., 100, mandatory) -
ratio
: Sampling ratio of cross-validation (0 - 100, e.g., 20, mandatory) -
--resources
: Snakemake option to control resources (optional) -
mem_gb
: Memory usage (GB, e.g. 10, optional) -
--use-singularity
: Snakemake option to use Docker containers viaSingularity
(mandatory)
Example with the parallel environment (GridEngine)
If the
GridEngine
(
qsub
command) is available in your environment, you can add the
qsub
command. Just adding the
--cluster
option, the jobs are submitted to multiple nodes and the computations are distributed.
Note: To check if the command is executable, set smaller parameters such as rank_min=2 rank_max=2 lambda_max=2 lambda_min=2 trials=2 n_iter_max=2.
snakemake -j 4 --config input=data/testdata.tsv outdir=output rank_min=2 \
rank_max=10 lambda_min=-10 lambda_max=10 trials=10 \
n_iter_max=100 ratio=20 --resources mem_gb=10 --use-singularity \
--cluster "qsub -l nc=4 -p -50 -r yes" --latency-wait 60
Example with the parallel environment (Slurm)
Likewise, if the
Slurm
(
sbatch
command) is available in your environment, you can add the
sbatch
command after the
--cluster
option.
Note: To check if the command is executable, set smaller parameters such as rank_min=2 rank_max=2 lambda_max=2 lambda_min=2 trials=2 n_iter_max=2.
snakemake -j 4 --config input=data/testdata.tsv outdir=output rank_min=2 \
rank_max=10 lambda_min=-10 lambda_max=10 trials=10 \
n_iter_max=100 ratio=20 --resources mem_gb=10 --use-singularity \
--cluster "sbatch -n 4 --nice=50 --requeue" --latency-wait 60
Example with a local machine with Docker
If the
docker
command is available, the following command can be performed without installing any tools.
Note: To check if the command is executable, set smaller parameters such as rank_min=2 rank_max=2 lambda_max=2 lambda_min=2 trials=2 n_iter_max=2.
docker run --rm -v $(pwd):/work ghcr.io/chiba-ai-med/sbmfcv:main \
-i /work/data/testdata.tsv -o /work/output \
--cores=4 --rank_min=2 --rank_max=10 \
--lambda_min=-10 --lambda_max=10 --trials=10 \
--n_iter_max=100 --ratio=20 --memgb=10
Reference
Authors
-
Koki Tsuyuzaki
-
Eiryo Kawakami
Code Snippets
50 51 | shell: 'src/check_input.sh {input} {output} >& {log}' |
67 68 | shell: 'src/nmf.sh {input.in1} {output} {wildcards.rank} {N_ITER_MAX} {RATIO} >& {log}' |
80 81 | shell: 'src/aggregate_nmf.sh {RANK_MIN} {RANK_MAX} {TRIALS} {OUTDIR} {output} > {log}' |
92 93 | shell: 'src/plot_test_error.sh {input} {output} > {log}' |
104 105 | shell: 'src/bestrank.sh {input} {output} > {log}' |
121 122 | shell: 'src/sbmf.sh {input} {output} {wildcards.l} {N_ITER_MAX} >& {log}' |
134 135 | shell: 'src/aggregate_sbmf.sh {LAMBDA_MIN} {LAMBDA_MAX} {TRIALS} {OUTDIR} {output} > {log}' |
146 147 | shell: 'src/plot_zero_one_percentage.sh {input} {output} > {log}' |
158 159 | shell: 'src/bestlambda.sh {input} {output} > {log}' |
175 176 | shell: 'src/bestrank_bestlambda_sbmf.sh {input} {output} {N_ITER_MAX} >& {log}' |
188 189 | shell: 'src/aggregate_bestrank_bestlambda_sbmf.sh {TRIALS} {OUTDIR} {output} > {log}' |
204 205 | shell: 'src/b3.sh {input} {output} > {log}' |
216 217 | shell: 'src/bindata_for_landscaper.sh {input} {output} > {log}' |
11 12 13 | SLURM_RESTART_COUNT=2 Rscript src/aggregate_bestrank_bestlambda_sbmf.R $@ |
11 12 13 | SLURM_RESTART_COUNT=2 Rscript src/aggregate_nmf.R $@ |
11 12 13 | SLURM_RESTART_COUNT=2 Rscript src/aggregate_sbmf.R $@ |
11 12 13 | SLURM_RESTART_COUNT=2 Rscript src/bestlambda.R $@ |
11 12 13 | SLURM_RESTART_COUNT=2 Rscript src/bestrank_bestlambda_sbmf.R $@ |
11 12 13 | SLURM_RESTART_COUNT=2 Rscript src/bestrank.R $@ |
11 12 13 | SLURM_RESTART_COUNT=2 Rscript src/bindata_for_landscaper.R $@ |
11 12 13 | SLURM_RESTART_COUNT=2 Rscript src/check_input.R $@ |
11 12 13 | SLURM_RESTART_COUNT=2 Rscript src/nmf.R $@ |
11 12 13 | SLURM_RESTART_COUNT=2 Rscript src/plot_test_error.R $@ |
11 12 13 | SLURM_RESTART_COUNT=2 Rscript src/plot_zero_one_percentage.R $@ |
11 12 13 | SLURM_RESTART_COUNT=2 Rscript src/sbmf.R $@ |
Support
- Future updates
Related Workflows





