Cross validation workflow of Semi-binary Matrix Factorization (SBMF)

public 1yr ago Version: v1.0.3 0 bookmarks

SBMFCV

SBMFCV searches for the optimal hyper-parameters (rank and binary regularization parameters) for Semi-binary Matrix Factorization (SBMF) performed by dcTensor::dNMF . In SBMF, a non-negative matrix X is decomposed to a matrix product U * V' and only U is imposed to have binary ({0,1}) values. For the details, see the vignette of dNMF .

SBMFCV consists of the rules below:

Pre-requisites (our experiment)

Snakemake: v7.30.1
Singularity: v3.7.1
Docker: v20.10.10 (optional)

Snakemake is available via Python package managers like pip , conda , or mamba .

Singularity and Docker are available by the installer provided in each website or package manager for each OS like apt-get/yum for Linux, or brew for Mac.

For the details, see the installation documents below.

https://snakemake.readthedocs.io/en/stable/getting_started/installation.html
https://docs.sylabs.io/guides/3.0/user-guide/installation.html
https://docs.docker.com/engine/install/

Note: The following source code does not work on M1/M2 Mac. M1/M2 Mac users should refer to README_AppleSilicon.md instead.

Usage

In this demo, we use a toy data matrix (data/testdata.tsv) consisting of 1280 samples and 13 variables but any non-negative matrix can be specified by user.

Note that the input file is assumed to be tab separated values (TSV) format with no row/column names.

Download this GitHub repository

First, download this GitHub repository and change the working directory.

git clone https://github.com/chiba-ai-med/SBMFCV.git
cd SBMFCV

Example with local machine

Next, perform SBMFCV by the snakemake command as follows.

Note: To check if the command is executable, set smaller parameters such as rank_min=2 rank_max=2 lambda_max=2 lambda_min=2 trials=2 n_iter_max=2.

snakemake -j 4 --config input=data/testdata.tsv outdir=output rank_min=2 \
rank_max=10 lambda_min=-10 lambda_max=10 trials=10 \
n_iter_max=100 ratio=20 --resources mem_gb=10 --use-singularity

The meanings of all the arguments are below.

-j : Snakemake option to set the number of cores (e.g. 10, mandatory)
--config : Snakemake option to set the configuration (mandatory)
input : Input file (e.g., testdata.tsv, mandatory)
outdir : Output directory (e.g., output, mandatory)
rank_min : Lower limit of rank parameter to search (e.g., 2, which is used for the rank parameter J of dNMF, mandatory)
rank_max : Upper limit of rank parameter to search (e.g., 10, which is used for the rank parameter J of dNMF, mandatory)
lambda_min : Lower limit of lambda parameter to search (e.g., -10, which means 10^-10 is used for the binary regularization parameter Bin_U of dNMF, mandatory)
lambda_max : Upper limit of lambda parameter to search (e.g., -10, which means 10^10 is used for the binary regularization parameter Bin_U of dNMF, mandatory)
trials : Number of random trials (e.g., 50, mandatory)
n_iter_max : Number of iterations (e.g., 100, mandatory)
ratio : Sampling ratio of cross-validation (0 - 100, e.g., 20, mandatory)
--resources : Snakemake option to control resources (optional)
mem_gb : Memory usage (GB, e.g. 10, optional)
--use-singularity : Snakemake option to use Docker containers via Singularity (mandatory)

Example with the parallel environment (GridEngine)

If the GridEngine ( qsub command) is available in your environment, you can add the qsub command. Just adding the --cluster option, the jobs are submitted to multiple nodes and the computations are distributed.

Note: To check if the command is executable, set smaller parameters such as rank_min=2 rank_max=2 lambda_max=2 lambda_min=2 trials=2 n_iter_max=2.

snakemake -j 4 --config input=data/testdata.tsv outdir=output rank_min=2 \
rank_max=10 lambda_min=-10 lambda_max=10 trials=10 \
n_iter_max=100 ratio=20 --resources mem_gb=10 --use-singularity \
--cluster "qsub -l nc=4 -p -50 -r yes" --latency-wait 60

Example with the parallel environment (Slurm)

Likewise, if the Slurm ( sbatch command) is available in your environment, you can add the sbatch command after the --cluster option.

Note: To check if the command is executable, set smaller parameters such as rank_min=2 rank_max=2 lambda_max=2 lambda_min=2 trials=2 n_iter_max=2.

snakemake -j 4 --config input=data/testdata.tsv outdir=output rank_min=2 \
rank_max=10 lambda_min=-10 lambda_max=10 trials=10 \
n_iter_max=100 ratio=20 --resources mem_gb=10 --use-singularity \
--cluster "sbatch -n 4 --nice=50 --requeue" --latency-wait 60

Example with a local machine with Docker

If the docker command is available, the following command can be performed without installing any tools.

Note: To check if the command is executable, set smaller parameters such as rank_min=2 rank_max=2 lambda_max=2 lambda_min=2 trials=2 n_iter_max=2.

docker run --rm -v $(pwd):/work ghcr.io/chiba-ai-med/sbmfcv:main \
-i /work/data/testdata.tsv -o /work/output \
--cores=4 --rank_min=2 --rank_max=10 \
--lambda_min=-10 --lambda_max=10 --trials=10 \
--n_iter_max=100 --ratio=20 --memgb=10

Reference

dcTensor

Authors

Koki Tsuyuzaki
Eiryo Kawakami

Code Snippets

shell:
	'src/check_input.sh {input} {output} >& {log}'

SnakeMake From line 50 of main/Snakefile

shell:
	'src/nmf.sh {input.in1} {output} {wildcards.rank} {N_ITER_MAX} {RATIO} >& {log}'

SnakeMake From line 67 of main/Snakefile

shell:
	'src/aggregate_nmf.sh {RANK_MIN} {RANK_MAX} {TRIALS} {OUTDIR} {output} > {log}'

SnakeMake From line 80 of main/Snakefile

shell:
	'src/plot_test_error.sh {input} {output} > {log}'

SnakeMake From line 92 of main/Snakefile

shell:
	'src/bestrank.sh {input} {output} > {log}'

SnakeMake From line 104 of main/Snakefile

shell:
	'src/sbmf.sh {input} {output} {wildcards.l} {N_ITER_MAX} >& {log}'

SnakeMake From line 121 of main/Snakefile

shell:
	'src/aggregate_sbmf.sh {LAMBDA_MIN} {LAMBDA_MAX} {TRIALS} {OUTDIR} {output} > {log}'

SnakeMake From line 134 of main/Snakefile

shell:
	'src/plot_zero_one_percentage.sh {input} {output} > {log}'

SnakeMake From line 146 of main/Snakefile

shell:
	'src/bestlambda.sh {input} {output} > {log}'

SnakeMake From line 158 of main/Snakefile

shell:
	'src/bestrank_bestlambda_sbmf.sh {input} {output} {N_ITER_MAX} >& {log}'

SnakeMake From line 175 of main/Snakefile

shell:
	'src/aggregate_bestrank_bestlambda_sbmf.sh {TRIALS} {OUTDIR} {output} > {log}'

SnakeMake From line 188 of main/Snakefile

shell:
	'src/b3.sh {input} {output} > {log}'

SnakeMake From line 204 of main/Snakefile

shell:
	'src/bindata_for_landscaper.sh {input} {output} > {log}'

SnakeMake From line 216 of main/Snakefile

SLURM_RESTART_COUNT=2

Rscript src/aggregate_bestrank_bestlambda_sbmf.R $@

Shell From line 11 of src/aggregate_bestrank_bestlambda_sbmf.sh

SLURM_RESTART_COUNT=2

Rscript src/aggregate_nmf.R $@

Shell From line 11 of src/aggregate_nmf.sh

SLURM_RESTART_COUNT=2

Rscript src/aggregate_sbmf.R $@

Shell From line 11 of src/aggregate_sbmf.sh

SLURM_RESTART_COUNT=2

Rscript src/b3.R $@

Shell From line 11 of src/b3.sh

SLURM_RESTART_COUNT=2

Rscript src/bestlambda.R $@

Shell From line 11 of src/bestlambda.sh

SLURM_RESTART_COUNT=2

Rscript src/bestrank_bestlambda_sbmf.R $@

Shell From line 11 of src/bestrank_bestlambda_sbmf.sh

SLURM_RESTART_COUNT=2

Rscript src/bestrank.R $@

Shell From line 11 of src/bestrank.sh

SLURM_RESTART_COUNT=2

Rscript src/bindata_for_landscaper.R $@

Shell From line 11 of src/bindata_for_landscaper.sh

SLURM_RESTART_COUNT=2

Rscript src/check_input.R $@

Shell From line 11 of src/check_input.sh

SLURM_RESTART_COUNT=2

Rscript src/nmf.R $@

Shell From line 11 of src/nmf.sh

SLURM_RESTART_COUNT=2

Rscript src/plot_test_error.R $@

Shell From line 11 of src/plot_test_error.sh

SLURM_RESTART_COUNT=2

Rscript src/plot_zero_one_percentage.R $@

Shell From line 11 of src/plot_zero_one_percentage.sh

SLURM_RESTART_COUNT=2

Rscript src/sbmf.R $@

Shell From line 11 of src/sbmf.sh

ShowHide 26 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/chiba-ai-med/SBMFCV

Name: sbmfcv

Version: v1.0.3

Badge:

Insert copied code into your website to add a link to this workflow.

License: MIT License

Keywords:

Reference sample report Cross-assembly Matrix LAmbDA Snakemake Semi-binary Matrix Factorization

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free