Snakemake workflow to detect and classify viruses in metagenome assemblies.
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Snakemake workflow to detect and classify viruses in metagenome assemblies.
It first detects viral sequences in assemblies (
.fa
files) with
VirSorter2
,
VIBRANT
and
DeepVirFinder
. Predictions are strictly quality controlled with
CheckV
, followed by clustering with
CD-HIT
and taxonomic classification with
Demovir
.
Installation
git clone --recursive https://github.com/alexmsalmeida/virsearch.git
- Download and extract necessary databases (uncompressed directory will require a total of 30 GB).
wget http://ftp.ebi.ac.uk/pub/databases/metagenomics/genome_sets/virsearch_db.tar.gz
tar -xzvf virsearch_db.tar.gz
How to run
-
Edit
config.yml
file to point to the input , output and databases directories, as well as the USEARCH binary location (usearch_binary
). Input directory should contain the.fa
assemblies to analyse. -
Install the necessary conda environments through snakemake
snakemake --use-conda --conda-create-envs-only --cores 1
-
(option 1) Run the pipeline locally (adjust
-j
based on the number of available cores)
snakemake --use-conda -k -j 4
- (option 2) Run the pipeline on a cluster (e.g., SLURM)
snakemake --use-conda -k -j 100 --cluster-config cluster.yml --cluster 'sbatch -A ALMEIDA-SL3-CPU -p icelake-himem --time=12:00:00 --ntasks={cluster.nCPU} --mem={cluster.mem} -o {cluster.output}'
Output
The main output files generated per input FASTA are the
final_predictions.fa
and
final_predictions_tax.tsv
files, which contain the viral sequences in FASTA format and their taxonomic annotation, respectively. If these files are empty it likely means that no high-confidence viral sequences were detected (check individual logs of the tools to confirm no other issues arose).
Code Snippets
52 53 54 55 | shell: """ virsorter setup -d {params.outdir} -j 4 """ |
69 70 71 72 73 74 | shell: """ rm -rf {params.outdir} virsorter run -j 8 -i {input.fa} --db-dir {params.database} -w {params.outdir} --min-length 10000 all || touch {params.raw} tools/rename_multifasta_prefix.py -f {params.raw} -p {wildcards.id}_VS2 > {output} """ |
87 88 89 90 91 | shell: """ rm -rf {params.outdir} tools/DeepVirFinder/dvf.py -i {input} -m {params.database} -o {params.outdir} -l 10000 -c 4 || touch {output} """ |
104 105 106 107 108 109 110 | shell: """ awk '{{if($3 > 0.9 && $4 < 0.01)print$1}}' {input.dvf} > {params.contigs} tools/select_seqs_by_IDs.py -i {input.fa} -d {params.contigs} -o {params.fa_ori} tools/rename_multifasta_prefix.py -f {params.fa_ori} -p {wildcards.id}_DVF > {output} rm {params.fa_ori} {params.contigs} """ |
125 126 127 128 129 130 131 132 133 134 | shell: """ rm -rf {params.outdir} VIBRANT_run.py -t 4 -d {params.database} -m {params.files} -i {input} -folder {params.outdir} -l 10000 -no_plot || true if [[ ! -f {params.phages} ]]; then mkdir -p {params.outdir} && touch {output} else tools/rename_multifasta_prefix.py -f {params.phages} -p {wildcards.id}_VIBRANT > {output} fi """ |
144 145 | shell: "cat {input.vs2} {input.dvf} {input.vibrant} > {output}" |
153 154 | shell: "echo 'No high-confidence viruses detected, generating empty files'" |
166 167 168 169 170 | shell: """ rm -rf {params.outdir} checkv end_to_end -t 8 -d {params.database} {input} {params.outdir} """ |
181 182 183 184 185 186 | shell: """ tools/filter_checkv.py {params.indir}proviruses.fna {input} {params.indir}proviruses_filt.fna pro tools/filter_checkv.py {params.indir}viruses.fna {input} {params.indir}viruses_filt.fna vir cat {params.indir}proviruses_filt.fna {params.indir}viruses_filt.fna > {output} """ |
198 199 200 201 202 | shell: """ cd-hit-est -c 0.99 -i {input} -o {params.clst_dir}/filtered_predictions_nr -T 0 -M 0 -d 0 mv {params.clst_dir}/filtered_predictions_nr {output.pred} """ |
217 218 219 220 221 222 223 224 | shell: """ prodigal-gv -a {params.demovir_dir}/proteins.faa -i {input} -p meta &> /dev/null {params.usearch} -ublast {params.demovir_dir}/proteins.faa -db {params.database}/uniprot_trembl.viral.udb -evalue 1e-5 -trunclabels -blast6out {params.demovir_dir}/trembl_ublast.viral.txt -threads 4 &> /dev/null sort -u -k1,1 {params.demovir_dir}/trembl_ublast.viral.txt > {params.demovir_dir}/trembl_ublast.viral.u.txt cut -f 1,2 {params.demovir_dir}/trembl_ublast.viral.u.txt | sed 's/_[0-9]\+\t/\t/' | cut -f 1 | paste {params.demovir_dir}/trembl_ublast.viral.u.txt - > {output.contig_ids} tools/demovir.R {params.demovir_dir} {params.database} """ |
233 234 235 | run: shutil.move(input[0], output.fa) shutil.move(input[1], output.tax) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | import sys import Bio from Bio import SeqIO from Bio import SeqRecord if len(sys.argv) < 3: print("usage: script.py in.fa checkv_summary.tsv out.fa pro") sys.exit(1) contigs = set() with open(sys.argv[2]) as f: for line in f: line = line.rstrip() cols = line.split("\t") if cols[0] != "contig_id": if cols[9] != "NA": if int(cols[5]) > int(cols[6]) and int(cols[1]) >= 10000 and float(cols[9]) >= 50 and float(cols[12]) <= 1.0 and "contig >1.5x" not in line: contigs.add(cols[0]) with open(sys.argv[1]) as f, open(sys.argv[3], "w") as fout: for record in SeqIO.parse(f, "fasta"): if sys.argv[4] == "pro": contig_id = "_".join(record.id.split("_")[:-1]) elif sys.argv[4] == "vir": contig_id = record.id.split()[0] if contig_id in contigs and len(record.seq) >= 10000: SeqIO.write(record, fout, "fasta") |
Support
- Future updates
Related Workflows





