MAGqual is a command line tool to evaluate the quality of metagenomic bins and generate recommended metadata in line with the MIMAG standards
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
MAGqual is a command line tool that will evaluate the quality of metagenomic bins and generate required metadata in line with the MIMAG standards (as outlined in Bowers et al. 2017 ).
MAGqual Set-Up
Here is a step-by-step guide on how to install MAGqual from the GitHub repository at https://github.com/ac1513/MAGqual.
Step 1: Clone the MAGqual repository
Open a command-line interface (CLI) or terminal on your computer or computer cluster. Change to the directory where you want to install MAGqual. Then, run the following command to clone the MAGqual repository from GitHub:
git clone https://github.com/ac1513/MAGqual.git
This will create a copy of the MAGqual repository on your computer.
Step 2: Install dependencies
-
Conda (package manager)
-
Snakemake v.6.17.1 or higher (workflow manager)
In order to run MAGqual, Conda and Snakemake are required.
Miniconda can be installed following the instructions for your system in the Miniconda documentation .
Once Miniconda is installed, Snakemake can be installed following the instructions in the Snakemake documentation .
MAGqual will handle the other dependancies while the pipeline is running through the use of Conda environments.
Additional Notes:
MAGqual is compatible with Python 3.10.1 or higher (as of April 2023). Make sure to clone the MAGqual repository regularly to get the latest updates and bug fixes. Refer to the MAGqual documentation in the GitHub repository for more information on how to use the tool and interpret the results. Congratulations! You have successfully installed MAGqual on your system.
Running MAGqual
Quick start quide
Once you have created an environment with Snakemake in you can run MAGqual with the following command:
python MAGqual.py --asm assembly.fa --bins bins_dir/
-
--asm
corresponds to the location of the assembly used to generate the metagenome bins -
--bins
is the location of the directory containing the metagenomic bins to be run through MAGqual (in fasta format)
Full Help documentation:
usage: MAGqual.py [-h] -a ASSEMBLY -b BINDIR [-p PREFIX] [-j JOBS] [--cluster CLUSTER] [--checkmdb CHECKMDB] [--baktadb BAKTADB]
Required: python MAGqual.py -a/--asm assembly.fa -b/--bins bins_dir/
options:
-h, --help show this help message and exit
-a ASSEMBLY, --asm ASSEMBLY
location of the assembly used to generate the bins (Required)
-b BINDIR, --bins BINDIR
path containing the directory containing the bins to run through MAGqual (Required)
-p PREFIX, --prefix PREFIX
prefix for the MAGqual run, default = MAGqual_YYYYMMDD
-j JOBS, --jobs JOBS The number of cores to be used or if running on a HPC the number of jobs
to be run concurrently, default = 1
--cluster CLUSTER OPTIONAL: The type of cluster to run MAGqual on a HPC system (available options: slurm),
don’t use if running MAGqual locally.
--checkmdb CHECKMDB OPTIONAL: location of a ready installed database for CheckM
--baktadb BAKTADB OPTIONAL: location of a ready installed database for Bakta, note must be v5.0 or above
Additional Options
-
-p / --prefix
: Specify a prefix for the output files (Default = MAGqual_YYYYMMDD) -
-j / --jobs
: If running locally this is the number of cores to be used, if using the --cluster option and running on a HPC queue this corresponds to the number of jobs to be run concurrently, (Default = 1)
Optional
CheckM and Bakta databases
MAGqual will handle the installation of both CheckM and Bakta, however if you have previously used either of these tools it is possible to speed up the process and save file space by specifying the location of pre-downloaded databases.
CheckM database
If you already have the CheckM database downloaded you can specify the location using the parameter
--checkm_db
to skip the download, otherwise MAGqual will download the required databases for CheckM for you (this will be the most recent version which is dated 2015-01-16).
Bakta database
If no Bakta database is provided MAGqual will automatically download the lightweight Bakta database (as the full database is large and can take a long time to download). NOTE: MAGqual uses Bakta v1.7.0 which requires a Bakta database version of > 5.0.
However, for more accurate MAG annotation we recommend downloading the full database (from following the instructions in the
Bakta documentation
) and specifying the location for MAGqual using the parameter
--bakta_db
. This database can then be use for each subsequent MAGqual run.
Running on a computing cluster
MAGqual can be integrated into a HPC queuing system using the following option:
-
--cluster
: Current option:slurm
, run MAGqual with options configured to run on a HPC computer cluster with queuing architecture.
The only queueing system integrated into MAGQual currently isslurm
, however if this doesn't work for you it is possible to run MAGqual on different queuing systems through the Snakemake framework - see Running on different queuing architechture below.
For those familiar with Snakemake
It is possible (and encouraged) to further tweak MAGqual parameters if you are familiar with Snakemake.
The config file:
config/config.yaml
can be edited directly for common parameters.
And then the pipeline can be run from the MAGqual directory using the basic Snakemake command:
snakemake --use-conda -j 1
This command can then be decorated with any of the command line options Snakemake allows - see the Snakemake documentation for options.
Running on different queuing architechture
Snakemake provides an easy way to run this pipeline on a computing cluster. We have provided support for a HPC with a Slurm queuing system, however this configuration is unlikely to work for everyone.
Rule specific cluster information is held in the
config/cluster.json
- you can see more about the format of this file in the
Snakemake documentation
. This file can be edited, as can the command used to submit the pipeline.
NOTE: When using the
--cluster slurm
option with MAGqual.py, the following is added to the Snakemake command:
--cluster-config config/cluster.json --cluster "sbatch -t {cluster.time} -c {cluster.threads} --mem-per-cpu {cluster.mem}
Code Snippets
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 | import pandas as pd import glob import argparse import os from shutil import copyfile import seaborn as sns import matplotlib.pyplot as plt def qual_cluster(comp, cont): if (comp >90) and (cont<5): qual = "High Quality" elif (comp >= 50) and (cont <10): qual = "Medium Quality" elif (comp <50) and (cont <10): qual = "Low Quality" else: qual = "Failed" return qual def gen_qual(comp, cont): if (comp - (cont*5)) >= 50: genome = "y" else: genome = "n" return genome def bsearch(bakta_loc, cluster): length = 0 loc = str(bakta_loc + str(cluster) + '/' + str(cluster) + '.txt') #don't copy this bit change it ! # loc = str(str(cluster) + '.txt') #don't copy this bit change it ! for name in glob.glob(loc): bakta_file = name with open(bakta_file, 'r') as bakta_log: for line in bakta_log: if "Length:" in line: length = int(line.split(":")[1]) if "Count:" in line: count = int(line.split(":")[1]) if "N50:" in line: N50 = int(line.split(":")[1]) if "Software:" in line: software = line.split(":")[1].strip() if "Database" in line: db = line.split(":")[1].strip() bakta_v = "Bakta " + software + " DB " + db row_dict = {"length" : length, "contigs" : count, "N50": N50, "bakta_v" : bakta_v} return pd.Series(row_dict) parser = argparse.ArgumentParser(description='') parser.add_argument('output', help='output directory for the organised bins', type=str) parser.add_argument('checkm_log', help='checkm output log file (TAB SEPARATED', type=str) parser.add_argument('bakta_loc', help='directory containing all bakta output for all clusters', type=str) parser.add_argument('seqkit_log', help='file containing the seqkit output for all clusters', type=str) parser.add_argument('bin_loc', help='directory containing fasta files for all clusters', type=str) parser.add_argument('jobid', help='prefix for current jobs', type=str) args = parser.parse_args() output = args.output checkm_log = args.checkm_log bakta_loc = args.bakta_loc seqkit_log = args.seqkit_log bin_loc = args.bin_loc job_id = args.jobid comp_software = "CheckM v.1.0.13" comp_approach = "Marker gene" colour_dict = dict({'High Quality':'#50C5B7', 'Near Complete':'#9CEC5B', 'Medium Quality': '#F0F465', 'Low Quality': "#F8333C", 'Failed': '#646E78'}) # ============================================================================= # CHECKM PARSE # ============================================================================= checkm_df = pd.read_csv(checkm_log, sep = "\t") checkm_df['Quality'] = checkm_df.apply(lambda x: qual_cluster(x['Completeness'], x['Contamination']), axis=1) checkm_df[['Size_bp', 'No_contigs', 'N50_length', '16S_Software']] = checkm_df.apply(lambda x: bsearch(bakta_loc, x["Bin Id"]), axis = 1) checkm_df = checkm_df.set_index("Bin Id") high_clusters = checkm_df[checkm_df['Quality'].str.contains("High Quality")].index.values.tolist() med_qual_clusters = checkm_df[checkm_df['Quality'].str.contains("Medium Quality")].index.values.tolist() low_qual_clusters = checkm_df[checkm_df['Quality'].str.contains("Low Quality")].index.values.tolist() NA = checkm_df[checkm_df['Quality'].str.contains("Failed")].index.values.tolist() all_clusters = high_clusters + med_qual_clusters + low_qual_clusters + NA checkm_df = checkm_df.drop(checkm_df.columns[[1, 2, 3, 4, 5, 6, 7, 8, 9]], axis=1) # ============================================================================= # SEQKIT PARSE # ============================================================================= seqkit_df = pd.read_csv(seqkit_log, sep = "\t") seqkit_df = seqkit_df[["file", "max_len"]] seqkit_df["file"] = seqkit_df["file"].str.replace('.fasta','', regex=True) seqkit_df["file"] = seqkit_df["file"].str.split("/").str[-1] seqkit_df.set_index("file", inplace=True) seqkit_df.rename(columns={"max_len":"Max_contig_length"}, inplace = True) checkm_df = pd.merge(checkm_df, seqkit_df, left_index=True, right_index=True, how="left") # ============================================================================= # BAKTA PARSE # ============================================================================= high_qual_clusters= [] near_comp_clusters = [] r16s_comp_clusters = [] trna_num = {} for cluster in all_clusters: loc = str(bakta_loc + cluster + '/' + cluster + '.tsv') #change this too # loc = str(str(cluster) + '.tsv') #don't copy this bit change it ! for name in glob.glob(loc): bakta_file = name with open(bakta_file, 'r') as bakta_in: trna_set = set() rna_set = set() rrna_16S = "N" for line in bakta_in: if "tRNA-Ala" in line: trna_set.add("tRNA-Ala") if "tRNA-Arg" in line: trna_set.add("tRNA-Arg") if "tRNA-Asn" in line: trna_set.add("tRNA-Asn") if "tRNA-Asp" in line: trna_set.add("tRNA-Asp") if "tRNA-Cys" in line: trna_set.add("tRNA-Cys") if "tRNA-Gln" in line: trna_set.add("tRNA-Gln") if "tRNA-Glu" in line: trna_set.add("tRNA-Glu") if "tRNA-Gly" in line: trna_set.add("tRNA-Gly") if "tRNA-His" in line: trna_set.add("tRNA-His") if "tRNA-Ile" in line: trna_set.add("tRNA-Ile") if "tRNA-Leu" in line: trna_set.add("tRNA-Leu") if "tRNA-Lys" in line: trna_set.add("tRNA-Lys") if "tRNA-Met" in line: trna_set.add("tRNA-Met") if "tRNA-Phe" in line: trna_set.add("tRNA-Phe") if "tRNA-Pro" in line: trna_set.add("tRNA-Pro") if "tRNA-Ser" in line: trna_set.add("tRNA-Ser") if "tRNA-Thr" in line: trna_set.add("tRNA-Thr") if "tRNA-Trp" in line: trna_set.add("tRNA-Trp") if "tRNA-Tyr" in line: trna_set.add("tRNA-Tyr") if "tRNA-Val" in line: trna_set.add("tRNA-Val") if ("5S ribosomal RNA" in line) and ("partial" not in line): rna_set.add('5S') if ("16S ribosomal RNA" in line) and ("partial" not in line): rna_set.add('16S') rrna_16S = "Y" if ("23S ribosomal RNA" in line) and ("partial" not in line): rna_set.add('23s') if rrna_16S == "Y": r16s_comp_clusters.append(cluster) if cluster in high_clusters: if (len(trna_set) >= 18) and (len(rna_set) == 3): high_qual_clusters.append(cluster) else: near_comp_clusters.append(cluster) # adds high qual that fail trna/rna trna_num.update({cluster : len(trna_set)}) # ============================================================================= # Add these new qualities to the checkm dataframe # ============================================================================= checkm_df.loc[checkm_df.index.isin(near_comp_clusters), "Quality"] = "Near Complete" checkm_df.loc[checkm_df.index.isin(r16s_comp_clusters), "16S_Recovered"] = "Y" checkm_df["16S_Recovered"] = checkm_df["16S_Recovered"].fillna("N") # Sort out legend order label_order = ["High Quality", "Near Complete", "Medium Quality", "Low Quality", "Failed"] labels_all = checkm_df["Quality"].unique().tolist() for item in label_order: if item not in labels_all: label_order.remove(item) labels_list = label_order # ============================================================================= # Basic plot # ============================================================================= checkm_df["Purity"] = 100 - checkm_df["Contamination"] plt.figure(figsize=(15, 10)) ax = sns.scatterplot(data = checkm_df, x="Completeness", y="Purity", hue='Quality', size = 'Size_bp', sizes=(20, 800), alpha = 0.6, palette=colour_dict, hue_order= labels_list) sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1)) plt.xlabel('Completeness (%)', size = 24) plt.ylabel('Purity (%)', size = 24) plt.xlim(0) plt.legend(prop={'size': 20}) plt.tight_layout() plt.tick_params(labelsize=18) plt.savefig(job_id + '_mag_qual.png') # ============================================================================= # COPYING FILES INTO QUAL DIRECTORIES # ============================================================================= location = bin_loc new_loc = output + "/" + job_id + "/" os.makedirs(new_loc + "high_qual", exist_ok=True) os.makedirs(new_loc + "near_comp", exist_ok=True) os.makedirs(new_loc + "med_qual", exist_ok=True) os.makedirs(new_loc + "low_qual", exist_ok=True) os.makedirs(new_loc + "failed", exist_ok=True) for high in high_qual_clusters: file = location + high + ".fasta" copyfile(file, new_loc +"high_qual/"+high+".fasta") for nc in near_comp_clusters: file = location + nc + ".fasta" copyfile(file, new_loc +"near_comp/"+nc+".fasta") for med in med_qual_clusters: file = location + med + ".fasta" copyfile(file, new_loc+"med_qual/"+med+".fasta") for low in low_qual_clusters: file = location + low + ".fasta" copyfile(file, new_loc+"low_qual/"+low+".fasta") for NA_bin in NA: file = location + NA_bin + ".fasta" copyfile(file, new_loc+"failed/"+NA_bin+".fasta") # ============================================================================= # OUTPUT CREATED HERE # ============================================================================= magqual_df = checkm_df[["Quality", "Completeness", "Contamination", "16S_Recovered", "16S_Software", "Size_bp", "No_contigs", "N50_length", "Max_contig_length"]].copy() magqual_df["tRNA_Extracted"] = pd.Series(trna_num) magqual_df["tRNA_Software"] = magqual_df["16S_Software"] magqual_df["Completeness_Approach"] = comp_approach magqual_df["Completeness_Software"] = comp_software magqual_df = magqual_df.reindex(columns=["Quality", "Completeness", "Contamination", "Completeness_Software","Completeness_Approach", "16S_Recovered", "16S_Software", "tRNA_Extracted", "tRNA_Software", "Size_bp", "No_contigs", "N50_length", "Max_contig_length"]) magqual_df.to_csv("analysis/" + job_id + "_mag_qual_statistics.csv") print("-" * 12) print(" NUMBER MAGs") print("-" * 12) print("High Quality:", len(high_qual_clusters)) print("Near Complete:", len(near_comp_clusters)) print("Med Quality:", len(med_qual_clusters)) print("Low Quality:", len(low_qual_clusters)) print("Failed:", len(NA), "\n") print("-" * 12) print(" MAG IDs") print("-" * 12) print("High Quality: " , ", ".join([str(x) for x in high_qual_clusters])) print("Near Complete: ", ", ".join([str(x) for x in near_comp_clusters])) print("Med Quality: ", ", ".join([str(x) for x in med_qual_clusters])) print("Low Quality: ", ", ".join([str(x) for x in low_qual_clusters])) print("Failed: ", ", ".join([str(x) for x in NA])) |
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | shell: """ if [ ! -z {params.database} ] && [ -d {params.database} ]; then cd databases/ rm -r bakta/ cd .. ln -sTf {params.database} databases/bakta else wget https://zenodo.org/record/7669534/files/db-light.tar.gz mkdir -p databases/bakta/ tar -xzf db-light.tar.gz -C databases/bakta/ --strip-components 1 rm db-light.tar.gz amrfinder_update --force_update --database databases/bakta/amrfinderplus-db/ fi """ |
68 69 70 71 | shell: """ bakta --db={params.database} --output {params.dir} --prefix {params.prefix} --threads {resources.threads} {input.clusters} """ |
80 81 82 83 84 85 86 87 88 89 90 91 92 93 | shell: """ if [ ! -z {params.database} ] && [ -d {params.database} ]; then cd databases rm -r checkm/ cd .. ln -sTf {params.database} databases/checkm else wget https://zenodo.org/record/7401545/files/checkm_data_2015_01_16.tar.gz mkdir -p databases/checkm/ tar -xzf checkm_data_2015_01_16.tar.gz -C databases/checkm/ rm checkm_data_2015_01_16.tar.gz fi """ |
113 114 115 116 117 118 | shell: """ checkm_db={params.database} echo ${{checkm_db}} | checkm data setRoot ${{checkm_db}} checkm lineage_wf -f {output.log} --tab_table -x fasta -t {resources.threads} {params.input} {params.out} """ |
128 129 130 131 | shell: """ seqkit stats -aT {input.clusters} > {output.tsv} """ |
152 153 154 155 | shell: """ python workflow/scripts/python/qual_parse.py {params.out_loc} {input.checkm} {params.bakta} {input.seqkit} {params.bin_loc} {params.jobid} > {output.txt} """ |
Support
- Future updates
Related Workflows





