Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Snakemake workflow: genecatalog extension for metagenome atlas
This is the template for a new Snakemake workflow. Replace this text with a comprehensive description covering the purpose and domain.
Insert your code into the respective folders, i.e.
scripts
,
rules
, and
envs
. Define the entry point of the workflow in the
Snakefile
and the main configuration in the
config.yaml
file.
Authors
- Silas Kieser (@silask)
Usage
Simple
Step 1: Install workflow
If you simply want to use this workflow, download and extract the latest release . If you intend to modify and further extend this workflow or want to work under version control, fork this repository as outlined in Advanced . The latter way is recommended.
In any case, if you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this repository and, if available, its DOI (see above).
Step 2: Configure workflow
Configure the workflow according to your needs via editing the file
config.yaml
.
Step 3: Execute workflow
Test your configuration by performing a dry-run via
snakemake --use-conda -n
Execute the workflow locally via
snakemake --use-conda --cores $N
using
$N
cores or run it in a cluster environment via
snakemake --use-conda --cluster qsub --jobs 100
or
snakemake --use-conda --drmaa --jobs 100
If you not only want to fix the software stack but also the underlying OS, use
snakemake --use-conda --use-singularity
in combination with any of the modes above. See the Snakemake documentation for further details.
Step 4: Investigate results
After successful execution, you can create a self-contained interactive HTML report with all results via:
snakemake --report report.html
This report can, e.g., be forwarded to your collaborators.
Advanced
The following recipe provides established best practices for running and extending this workflow in a reproducible way.
-
Fork the repo to a personal or lab account.
-
Clone the fork to the desired working directory for the concrete project/run on your machine.
-
Create a new branch (the project-branch) within the clone and switch to it. The branch will contain any project-specific modifications (e.g. to configuration, but also to code).
-
Modify the config, and any necessary sheets (and probably the workflow) as needed.
-
Commit any changes and push the project-branch to your fork on github.
-
Run the analysis.
-
Optional: Merge back any valuable and generalizable changes to the upstream repo via a pull request . This would be greatly appreciated .
-
Optional: Push results (plots/tables) to the remote branch on your fork.
-
Optional: Create a self-contained workflow archive for publication along with the paper (snakemake --archive).
-
Optional: Delete the local clone/workdir to free space.
Testing
Tests cases are in the subfolder
.test
. They are automtically executed via continuous integration with Travis CI.
Code Snippets
9 10 | shell: "ln -s {input} {output}" |
26 27 28 | shell: "mkdir {output} 2> {log} ; " "mmseqs createdb {input} {output}/db >> {log} 2>> {log} " |
50 51 52 53 54 55 56 57 58 59 | shell: """ mkdir -p {params.tmpdir} {output} 2>> {log} mmseqs {params.clustermethod} -c {params.coverage} \ --min-seq-id {params.minid} {params.extra} \ --threads {threads} {input.db}/db {output.clusterdb}/db {params.tmpdir} >> {log} 2>> {log} rm -fr {params.tmpdir} 2>> {log} """ |
79 80 81 82 83 84 85 86 87 88 | shell: """ mkdir {output.rep_seqs_db} 2>> {log} mmseqs result2repseq {input.db}/db {input.clusterdb}/db {output.rep_seqs_db}/db >> {log} 2>> {log} mmseqs result2flat {input.db}/db {input.db}/db {output.rep_seqs_db}/db {output.rep_seqs} >> {log} 2>> {log} """ |
107 108 109 110 | shell: """ mmseqs createtsv {input.db}/db {input.db}/db {input.clusterdb}/db {output.cluster_attribution} > {log} 2>> {log} """ |
133 134 | script: "../scripts/rename_catalog.py" |
158 159 | script: "../scripts/rename_mapping.py" |
198 199 200 201 202 203 204 | shell: """ mkdir -p {output} {params.tmpdir} 2> {log} mmseqs {params.clustermethod} -c {params.coverage} \ --min-seq-id {params.minid} {params.extra} \ --threads {threads} {input.db}/db {output.clusterdb}/db {params.tmpdir} >> {log} 2>> {log} """ |
220 221 222 223 224 225 226 227 228 229 | shell: """ mkdir {output.rep_seqs_db} 2>> {log} mmseqs result2repseq {input.db}/db {input.clusterdb}/db {output.rep_seqs_db}/db >> {log} 2>> {log} mmseqs result2flat {input.db}/db {input.db}/db {output.rep_seqs_db}/db {output.rep_seqs} >> {log} 2>> {log} """ |
250 251 | script: "../scripts/rename_catalog.py" |
273 274 275 276 | shell: """ mmseqs createtsv {input.db}/db {input.db}/db {input.clusterdb}/db {output.cluster_attribution} > {log} 2>> {log} """ |
301 302 | script: "../scripts/rename_mapping.py" |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 | import pandas as pd import utils import sys sys.stdout= open(snakemake.log[0],"w") sys.stderr= open(snakemake.log[0],"a") def rename_fasta(fasta_in,fasta_out,new_names): old_names=[] n=0 with open(fasta_in) as fin, open(fasta_out,'w') as fout: for line in fin: if line[0]=='>': old_name=line[1:].split(maxsplit=1)[0] assert (n==0) or (old_name!=old_names[-1]), f"Found duplicate representative name {old_name} in {fasta_in}" old_names.append(old_name) line=f">{new_names[n]} {old_name}\n" n+=1 fout.write(line) return old_names def parse_mmseqs_log_file(log_file,keyword="Number of clusters:"): with open(log_file) as f: result=None for line in f: if line.startswith(keyword): try: result= int(line.replace(keyword,'').strip().rstrip()) except ValueError as e: raise Exception(f"Error parsing line:\n{line}\n") from e if result is None: raise Exception(f"Didn't found value in for keyword '{keyword}' in logfile {log_file}") else: return result if __name__ == '__main__': Nrepresentatives = parse_mmseqs_log_file(snakemake.input.log) print(f"Number of representatives is {Nrepresentatives}") gene_names=utils.gen_names_for_range(Nrepresentatives,snakemake.params.prefix) original_names= rename_fasta(snakemake.input.faa,snakemake.output.faa,gene_names) assert len(gene_names)==len(original_names), "Nuber of representatives should match the number of clusters found in the mmseqs log file" orf2gene = pd.Series(index=original_names,data=gene_names,name = 'Gene') orf2gene.index.name='ORF' orf2gene.to_csv(snakemake.output.name_mapping,sep='\t',header=True) del orf2gene # Rename representative sequence |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | import sys sys.stdout= open(snakemake.log[0],"w") sys.stderr= open(snakemake.log[0],"a") import pandas as pd name_mapping= pd.read_csv(snakemake.input.name_mapping,index_col=0,sep='\t',squeeze=True) assert type(name_mapping)==pd.Series with pd.HDFStore(snakemake.output[0],complevel=3, mode='w') as store: # read cluster mapping in chuncks chuncknr= 0 for orf2gene in pd.read_csv(snakemake.input.cluster_mapping, usecols=[0,1], # clustermaping can have a tailing tab character leading to a index_col=1, # the format is "{cluster}\t{orf}" squeeze=True, header=None, sep='\t', dtype={0:'category'}, chunksize=1e7 ): orf2gene.cat.set_categories(name_mapping, inplace=True, rename=True, ordered=True) orf2gene.name=snakemake.params.headers[1] orf2gene.index.name = snakemake.params.headers[0] key= '/'.join(snakemake.params.headers) # map gene representative name to gene id, write to file with header only once store.append(key, orf2gene, format='table', data_columns=[orf2gene.name], min_itemsize=50) chuncknr+=1 print(f"processed chunck {chuncknr}") |
Support
- Future updates
Related Workflows





