BLAST-like search across all pre-2019 bacteria on ordinary laptop and desktop computers within a few hours
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Introduction
MOF-Search is a pipeline for BLAST-like search across all pre-2019 bacteria from ENA (the 661k collection ) on ordinary standard desktop and laptops computers.
The central idea behind, enabling search at such a scale, is phylogenetic compression - a technique based on using estimated evolutionary history to guide compression and efficiently search large collections of microbial genomes using existing algorithms and data structures. In short, input data are reorganized according to the topology of the estimated phylogenies, which makes data highly locally compressible even using basic techniques. Existing software packages for compression, indexing, and search - in this case XZ , COBS , and Minimap2 - are then used as low-level tools. The resulting performance gains come from a wide range of benefits of phylogenetic compression, including easy parallelization, small memory requirements, small database size, better memory locality, and better branch prediction.
For more information about phylogenetic compression and implementation details of MOF-Search, see the corresponding paper (and its supplementary and the associated website for the whole MOF framework ).
Citation
K. Břinda, L. Lima, S. Pignotti, N. Quinones-Olvera, K. Salikhov, R. Chikhi, G. Kucherov, Z. Iqbal, and M. Baym. Efficient and Robust Search of Microbial Genomes via Phylogenetic Compression. bioRxiv 2023.04.15.536996, 2023. https://doi.org/10.1101/2023.04.15.536996
Installation
Step 1: Install dependencies
MOF-Search is implemented as a Snakemake pipeline, using the Conda system to manage all non-standard dependencies. To function smoothly, we recommend having Conda with the following packages:
-
GNU Time (on Linux present by default, on OS X can be installed by
brew install gnu-time
). -
Python (>=3.7)
-
Snakemake (>=6.2.0)
-
Mamba (>= 0.20.0) - optional, recommended
The last three packages can be installed using Conda by
conda install -y "python>=3.7" "snakemake>=6.2.0" "mamba>=0.20.0"
Step 2: Clone the repository
git clone https://github.com/karel-brinda/mof-search
cd mof-search
Step 3: Run a simple test
Run
make test
to ensure the pipeline works for the sample queries and just
3 batches. This will also install additional dependencies using Conda or Mamba, such as COBS, SeqTK, and Minimap 2.
Notes:
-
make test
should return 0 (success) and you should have the following message at the end of the execution, to ensure the test produced the expected output:Files output/backbone19Kbp___ecoli_reads_1___ecoli_reads_2___gc01_1kl.sam_summary.xz and data/backbone19Kbp___ecoli_reads_1___ecoli_reads_2___gc01_1kl.sam_summary.xz are identical
-
If the test did not produce the expected output and you obtained an error message such as
Files output/backbone19Kbp___ecoli_reads_1___ecoli_reads_2___gc01_1kl.sam_summary.xz and data/backbone19Kbp.fa differ make: *** [Makefile:21: test] Error 1
you should verify why.
Step 4: Download the database
Run
make download
to download from Zenodo all the remaining phylogenetically
compressed assemblies and COBS
k
-mer indexes for the
661k-HQ
collection
. The downloaded files
will then be located in the
asms/
and
cobs/
directories.
Notes:
-
The compressed assemblies comprise all the genomes from the 661k collection.The COBS indexes comprise only those genomes that passed quality control.
-
For file accessions, see the MOF website .
Usage
Step 1: Copy or symlink your queries
Remove the default test files or you old files in the
queries/
directory and
copy or symlink (recommended) your query files. The supported input formats are
FASTA and FASTQ, possibly gzipped. All query files will be preprocessed and
merged together.
Notes:
-
All query names have to be unique among all query files.
-
All non-
ACGT
characters in your query sequences will be translated toA
.
Step 2: Adjust configuration
Edit the
config.yaml
file for your desired search. All available options are
documented directly there.
Step 3: Clean up intermediate files
Run
make clean
to clean the intermediate files from the previous runs. This includes COBS matching files, alignment files, and various reports.
Step 4: Run the pipeline
Simply run
make
, which will execute Snakemake with the corresponding parameters. If you want to run the pipeline step by step, run
make match
followed by
make map
.
Step 5: Analyze your results
Check the output files in
results/
.
If the results don't correspond to what you expected and you need to re-adjust your parameters, go to Step 2. If only the mapping part is affected by the changes, you proceed more rapidly, by manually removing the files in
intermediate/03_map
and
output/
and running directly
make map
.
Additional information
List of workflow commands
MOF-Search is executed via GNU Make , which handles all parameters and passes them to Snakemake.
Here's a list of all implemented commands (to be executed as
make {command}
):
######################
## General commands ##
######################
all Run everything (the default rule)
test Quick test using 3 batches
help Print help messages
clean Clean intermediate search files
cleanall Clean all generated and downloaded files
####################
## Pipeline steps ##
####################
conda Create the conda environments
download Download the assemblies and COBS indexes
match Match queries using COBS (queries -> candidates)
map Map candidates to assemblies (candidates -> alignments)
###############
## Reporting ##
###############
viewconf View configuration without comments
report Generate Snakemake report
##########
## Misc ##
##########
cluster_slurm Submit to a SLURM cluster
cluster_lsf_test Submit the test pipeline to LSF cluster
cluster_lsf Submit to LSF cluster
format Reformat Python and Snakemake files
Directories
-
asms/
,cobs/
Downloaded assemblies and COBS indexes -
queries/
Queries, to be provided within one or more FASTA files (.fa
) -
intermediate/
Intermediate files-
00_cobs
Decompressed COBS indexes (tmp) -
01_match
COBS matches -
02_filter
Filtered candidates -
03_map
Minimap2 alignments -
fixed_queries
Sanitized queries
-
-
output/
Results
Running on a cluster
Running on a cluster is much faster as the jobs produced by this pipeline are quite light and usually start running as soon as they are scheduled.
For LSF clusters:
-
Test if the pipeline is working on a LSF cluster:
make cluster_lsf_test
; -
Configure you queries and run the full pipeline:
make cluster_lsf
;
Known limitations
-
When the number of queries is too high, the auxiliary Python scripts start to use too much memory, which may result in swapping. Try to keep the number of queries moderate and ideally their names short. If you have tens or hundreds or more query files, concatenate them all into one before running
mof-search
.
License
Contacts
-
Karel Brinda <karel.brinda@inria.fr>
-
Leandro Lima <leandro@ebi.ac.uk>
Code Snippets
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 | import sys import argparse from pathlib import Path import subprocess import datetime def get_args(): parser = argparse.ArgumentParser(description='Benchmark a command.') parser.add_argument('command', type=str, help='The command to be benchmarked') parser.add_argument('--log', type=str, required=True, help='Path to the log file with benchmark statistics.') args = parser.parse_args() return args def get_time_command(): if sys.platform == "linux": time_command = "/usr/bin/time" elif sys.platform == "darwin": time_command = "gtime" else: raise Exception("Unsupported OS") return time_command def main(): args = get_args() log_file = Path(args.log) log_file.parent.mkdir(parents=True, exist_ok=True) tmp_log_file = Path(f"{log_file}.tmp") is_benchmarking_pipeline = args.command.split()[0] == "snakemake" with open(log_file, "w") as log_fh: formatted_command = " ".join(args.command.replace("\\\n", " ").strip().split()) print(f"# Benchmarking command: {formatted_command}", file=log_fh) header = [ "real(s)", "sys(s)", "user(s)", "percent_CPU", "max_RAM(kb)", "FS_inputs", "FS_outputs", "elapsed_time_alt(s)" ] if is_benchmarking_pipeline: header.append("max_delta_system_RAM(kb)") print("\t".join(header), file=log_fh) time_command = get_time_command() benchmark_command = f'{time_command} -o {tmp_log_file} -f "%e\t%S\t%U\t%P\t%M\t%I\t%O"' start_time = datetime.datetime.now() main_process = subprocess.Popen(f'{benchmark_command} {args.command}', shell=True) if is_benchmarking_pipeline: RAM_tmp_log_file = Path(f"{log_file}.RAM.tmp") RAM_benchmarking_process = subprocess.Popen( [sys.executable, "scripts/get_RAM_usage.py", str(RAM_tmp_log_file), str(main_process.pid)]) return_code = main_process.wait() if return_code: raise subprocess.CalledProcessError(return_code, main_process.args, output=main_process.stdout, stderr=main_process.stderr) end_time = datetime.datetime.now() elapsed_seconds = (end_time - start_time).total_seconds() with open(tmp_log_file) as log_fh_tmp, open(log_file, "a") as log_fh: log_line = log_fh_tmp.readline().strip() log_line += f"\t{elapsed_seconds}" if is_benchmarking_pipeline: RAM_benchmarking_process.kill() with open(RAM_tmp_log_file) as RAM_tmp_log_fh: RAM_usage = RAM_tmp_log_fh.readline().strip() log_line += f"\t{RAM_usage}" RAM_tmp_log_file.unlink() print(log_line, file=log_fh) tmp_log_file.unlink() if __name__ == "__main__": main() |
253 254 255 256 257 | shell: """ curl "{params.url}/{wildcards.batch}.tar.xz" > {output.xz} scripts/test_xz.py {output.xz} """ |
271 272 273 274 275 | shell: """ curl "{params.url}" > {output.xz} scripts/test_xz.py {output.xz} """ |
301 302 303 304 305 306 | shell: """ seqtk seq -A -U -C {input.original_query} \\ | awk '{{if(NR%2==1){{print $0;}}else{{gsub(/[^ACGT]/, \"{params.base_to_replace}\"); print;}}}}' \\ > {output.fixed_query} """ |
321 322 323 324 | shell: """ cat {input} > {output} """ |
345 346 347 348 349 350 | shell: """ ./scripts/benchmark.py --log logs/benchmarks/decompress_cobs/{wildcards.batch}.txt \\ 'xzcat --no-sparse --ignore-check "{input.xz}" > "{params.cobs_index_tmp}" \\ && mv "{params.cobs_index_tmp}" "{output.cobs_index}"' """ |
380 381 382 383 384 385 386 387 388 389 390 391 392 | shell: """ ./scripts/benchmark.py --log logs/benchmarks/run_cobs/{wildcards.batch}____{wildcards.qfile}.txt \\ 'cobs query \\ {params.load_complete} \\ -t {params.kmer_thres} \\ -T {threads} \\ -i {input.cobs_index} \\ -f {input.fa} \\ | ./scripts/postprocess_cobs.py -n {params.nb_best_hits} \\ | gzip --fast \\ > {output.match}' """ |
426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 | shell: """ if [ {params.streaming} = 1 ] then ./scripts/benchmark.py --log logs/benchmarks/run_cobs/{wildcards.batch}____{wildcards.qfile}.txt \\ './scripts/run_cobs_streaming.sh {params.kmer_thres} {threads} "{input.compressed_cobs_index}" {params.uncompressed_batch_size} "{input.fa}" \\ | ./scripts/postprocess_cobs.py -n {params.nb_best_hits} \\ | gzip --fast\\ > {output.match}' else mkdir -p {params.decompression_dir} ./scripts/benchmark.py --log logs/benchmarks/decompress_cobs/{wildcards.batch}____{wildcards.qfile}.txt \\ 'xzcat "{input.compressed_cobs_index}" > "{params.cobs_index_tmp}" \\ && mv "{params.cobs_index_tmp}" "{params.cobs_index}"' ./scripts/benchmark.py --log logs/benchmarks/run_cobs/{wildcards.batch}____{wildcards.qfile}.txt \\ 'cobs query \\ {params.load_complete} \\ -t {params.kmer_thres} \\ -T {threads} \\ -i "{params.cobs_index}" \\ -f "{input.fa}" \\ | ./scripts/postprocess_cobs.py -n {params.nb_best_hits} \\ | gzip --fast\\ > {output.match}' rm -v "{params.cobs_index}" fi """ |
477 478 479 480 481 482 483 484 485 | shell: """ ./scripts/benchmark.py --log logs/benchmarks/translate_matches/translate_matches___{wildcards.qfile}.txt \\ './scripts/filter_queries.py \\ -n {params.nb_best_hits} \\ -q {input.fa} \\ {input.all_matches} \\ > {output.fa} 2>{log}' """ |
506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 | shell: """ xzcat data/661k_batches.txt.xz \\ | grep {wildcards.batch} \\ | cut -f2 \\ > {params.refs_tmp} ./scripts/benchmark.py --log logs/benchmarks/batch_align_minimap2/{wildcards.batch}____{wildcards.qfile}.txt \\ './scripts/batch_align.py \\ --minimap-preset {params.minimap_preset} \\ --threads {threads} \\ --extra-params=\"{params.minimap_extra_params}\" \\ --accessions {params.refs_tmp} \\ {params.pipe} \\ {input.asm} \\ {input.qfa} \\ 2>{log} \\ | {{ grep -Ev "^@" || true; }} \\ | gzip --fast\\ > {output.sam}' rm -f {params.refs_tmp} """ |
539 540 541 542 543 544 | shell: """ ./scripts/benchmark.py --log logs/benchmarks/aggregate_sams/aggregate_sams___{wildcards.qfile}.txt \\ './scripts/aggregate_sams.sh {input.sam} \\ > {output.pseudosam}' """ |
558 559 560 561 562 563 | shell: """ ./scripts/benchmark.py --log logs/benchmarks/aggregate_sams/final_stats___{wildcards.qfile}.txt \\ './scripts/final_stats.py {input.concatenated_query} {input.pseudosam} \\ > {output.stats}' """ |
Support
- Future updates
Related Workflows





