A microbial genomic analysis pipeline used by the Djordjevic Lab
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
About
This repo contains a genomic analysis pipeline used for analysing mostly E. coli genomes, though it also sees use in analysing Salmonella genomes and some other species as well, however note that using this tool for analysing other species may cause certain tools to either break or will otherwise just limit the usefullness of this tool in the first place.
Disclaimer
While I have made every effort to ensure these scripts produce reliable and accurate output, it is important that you spot check your outputs to ensure they are correct :)
The documentation is still a work in progress but if you need any help don't hestitate to post an issue :)
Modules/Workflows
This pipeline has three main modules:
-
Quality control
-
Genotyping
-
Phylogenetics
All individual rules/analyses can be turned on or off in the configuration file.
1. Quality control
Reads are first filtered with fastp before being assembled with shovill. Assembly stats are analysed using assembly-stats. Species ID is determined using Kraken and Bracken and a contamination and completeness screen is carried out using CheckM. GUNC is also run, though usually I disable this as the run-time can be a bit long and I haven't found the output especially helpful yet for my purposes, though this might change.
This information is then consolidated into a tab delimited QC report.
2. Genotyping
The word genotyping here is used a bit loosely but this step performs a variety of analyses using:
-
Abricate to screen for virulence, AMR, IS elements and other custom gene databases. New databases can be added and added to the configuration file to be included for use.
-
Antibiotic resistance gene detection is performed using abritamr
-
Point mutations associated with AMR are detected using CGE's pointfinder (there is some redundancy here with abritamr which now has this functionality)
-
MLST is performed using mlst
-
Plasmid MLST is performed using pMLST
-
Fimbrial adhesins are typed using fimtyper
-
Phylogroup detection is performed with Clermontyping and EZClermont
-
Serotyping is performed with ECtyper
-
Salmonella pathogenicity islands are detected using SPIfinder
-
Genome annotation is performed with Prokka
-
cgMLST is performed with cgMLSTfinder - this will be changed to chewBBACA
-
Plasmid screening is performed using a slightly older version of abricate (screening of large reference genes seemed to break after 0.9.8).
Most of these rules output tab delimited summary files relevant for each rule/analysis, some utilise some downstream scripts in R but not many of these.
3. Phylogenetics
This workflow runs a pangenome derived phylogenetic analysis utilising:
-
Prokka
-
Panaroo
-
SNP-sites
-
IQtree
Pairwise SNP distances are also determined using SNP-dists.
Installation and setup
Because there are many dependencies to install across all of the different analytical tools the first time you run the tool it can take some time to install everything via conda. Because of this I have added a script to install environments for the three modules
setup_environments.sh
# Clone the git repository and enter it
git clone https://github.com/maxlcummins/pipelord.git
cd pipelord
# Setup and installation - this may take some time
sh setup_environments.sh
Configuration
Configuration of the snakemake workflow can be done via the configuration file in
config
. It is a good idea to just make a new one though by copying the template and editing it accordingly, then specifying it in the snakemake command (e.g.
--configfile config/new_template.yaml
).
Usage
Note: Change your config file accordingly if you created a new one.
QC Workflow
#Run job in the background
nohup snakemake --resources mem_mb={max_memory_per_rule} -j 20 -p --use-conda --configfile config/config_template.yaml --use-conda -s workflows/QC_workflow.smk 1> QC.log 2> QC.log &
#Save the job ID so you can cancel it later if need be
PROCESS_ID=$!
echo "JOB_ID =" "$PROCESS_ID" > QC_jobID.txt
Genotyping
#Run job in the background
nohup snakemake --resources mem_mb={max_memory_per_rule} -j 20 -p --use-conda --configfile config/config_template.yaml --use-conda -s workflows/Snakefile 1> genotype.log 2> genotype.log &
#Save the job ID so you can cancel it later if need be
PROCESS_ID=$!
echo "JOB_ID =" "$PROCESS_ID" > genotype_jobID.txt
Pangenomic Phylogenetics
#Run job in the background
nohup snakemake --resources mem_mb={max_memory_per_rule} -j 20 -p --use-conda --configfile config/config_template.yaml --use-conda -s workflows/Treebuild_panaroo.smk 1> treebuild.log 2> treebuild.log &
#Save the job ID so you can cancel it later if need be
PROCESS_ID=$!
echo "JOB_ID =" "$PROCESS_ID" > treebuild_jobID.txt
Database update
To update your databases you can use a command like the following - make sure the db dir path is correct and change the database appropriately
abricate-get_db --dbdir resources/dbs/abricate --db vfdb --force
Code Snippets
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | import pandas as pd import os.path import re # Directing python to the input from snakemake abricate_summaries = snakemake.input["abricate_summary"] #Create an empty list dataframes = [] for file in abricate_summaries: df = pd.read_csv(file, sep="\t", index_col=None) dataframes.append(df) #Concatenate our dfs into a single one combined = pd.concat(dataframes) #Change the filename column to be name combined.rename(columns={'#FILE':'name'}, inplace=True) #Write our output file to text combined.to_csv(snakemake.output[0], sep="\t", index=False) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | import pandas as pd import os.path import re # Directing python to the input from snakemake amrfinder = snakemake.input["amrfinder"] #Create an empty list dataframes = [] for file in amrfinder: df = pd.read_csv(file, sep="\t", index_col=None) df["name"] = file df["name"] = df["name"].str.replace("/amrfinder.out", "", regex=False) df["name"] = df["name"].str.replace(".*\/", "", regex=True) dataframes.append(df) #Concatenate our dfs into a single one combined = pd.concat(dataframes) #Move the name column to the start combined = combined[['name'] + [ col for col in combined.columns if col != 'name' ]] #Write our output file to text combined.to_csv(snakemake.output["abritamr_resistance"], sep="\t", index=False) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 | import pandas as pd import os.path import re #### Process our resistance data #### # Directing python to the input from snakemake abritamr_resistance = snakemake.input["abritamr_resistance"] #Create an empty list dataframes = [] for file in abritamr_resistance: df = pd.read_csv(file, sep="\t", index_col=None) dataframes.append(df) #Concatenate our dfs into a single one combined = pd.concat(dataframes) #Change the filename column to be name combined.rename(columns={'Isolate':'name'}, inplace=True) #Trim characters before the samplename remaining from the path combined['name'].replace({r".*\/" : ""}, inplace = True, regex = True) #Write our output file to text combined.to_csv(snakemake.output["abritamr_resistance"], sep="\t", index=False) #### Process our virulence data #### # Directing python to the input from snakemake abritamr_virulence = snakemake.input["abritamr_virulence"] #Create an empty list dataframes = [] for file in abritamr_virulence: df = pd.read_csv(file, sep="\t", index_col=None) dataframes.append(df) #Concatenate our dfs into a single one combined = pd.concat(dataframes) #Change the filename column to be name combined.rename(columns={'Isolate':'name'}, inplace=True) #Trim characters before the samplename remaining from the path combined['name'].replace({r".*\/" : ""}, inplace = True, regex = True) #Write our output file to text combined.to_csv(snakemake.output["abritamr_virulence"], sep="\t", index=False) #### Process our partial data #### # Directing python to the input from snakemake abritamr_partials = snakemake.input["abritamr_partials"] #Create an empty list dataframes = [] for file in abritamr_partials: df = pd.read_csv(file, sep="\t", index_col=None) dataframes.append(df) #Concatenate our dfs into a single one combined = pd.concat(dataframes) #Change the filename column to be name combined.rename(columns={'Isolate':'name'}, inplace=True) #Trim characters before the samplename remaining from the path combined['name'].replace({r".*\/" : ""}, inplace = True, regex = True) #Write our output file to text combined.to_csv(snakemake.output["abritamr_partials"], sep="\t", index=False) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | import pandas as pd import os.path # Directing python to the input from snakemake assembly_stats_summaries = snakemake.input["assembly_stats_summary"] #Create an empty list dataframes = [] #For each of our mlst files for file in assembly_stats_summaries: #Read it in as df df = pd.read_csv(file, sep="\t", index_col=None) #Add the df to our list of dfs dataframes.append(df) #Concatenate our dfs into a single one combined = pd.concat(dataframes) #Change the filename column to be name combined.rename(columns={'filename':'name'}, inplace=True) #Remove the path related info from our name column combined["name"] = combined["name"].str.replace(".*/", "", regex=True) combined["name"] = combined["name"].str.replace(".fasta", "", regex=False) #Write our output file to text combined.to_csv(snakemake.output[0], sep="\t", index=False) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | import pandas as pd import os.path # Directing python to the input from snakemake bracken_reports = snakemake.input["bracken_reports"] dataframes = [] for file in bracken_reports: df = pd.read_csv(file, sep="\t") df["source"] = os.path.basename(file) df["source"] = df["source"].str.replace(".bracken.txt", "", regex=False) dataframes.append(df) combined = pd.concat(dataframes) combined = combined[ [ "source", "name", "taxonomy_id", "taxonomy_lvl", "kraken_assigned_reads", "added_reads", "new_est_reads", "fraction_total_reads", ] ] combined.columns = [ "name", "scientific_name", "taxonomy_id", "taxonomy_lvl", "kraken_assigned_reads", "added_reads", "new_est_reads", "fraction_total_reads", ] combined.to_csv(snakemake.output[0], sep="\t", index=False) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | import pandas as pd import os.path import re # Directing python to the input from snakemake clermontyping_summaries = snakemake.input['clermontyping'] print(clermontyping_summaries) #Create an empty list dataframes = [] #Initialise a for loop for processing our pMLST outputs for file in clermontyping_summaries: #Read it in as df df = pd.read_csv(file, sep="\t", header=None) #Assign a column name to our input which isnt actually a CSV and therefore has a heading of '0' df.columns = ['name', 'genes_detected', 'quadraplex_resuts', 'alleles_for_C_or_G', 'clermontyping_phylogroup', 'mashfile'] #Append our dataframes dataframes.append(df) #Concatenate our dfs into a single one combined = pd.concat(dataframes) #Replace unwanted characters in our dataframe combined['name'] = combined['name'].replace('\.fasta', '', regex=True) #Write our simple file to text combined.to_csv(snakemake.output[0], sep="\t", index=False) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | import pandas as pd import os.path # Directing python to the input from snakemake ectyper_summaries = snakemake.input["ectyper_summary"] #Create an empty list dataframes = [] #For each of our ectyper files for file in ectyper_summaries: print(file) #Read it in as df df = pd.read_csv(file, sep="\t", header=0, index_col=None) #Add the df to our list of dfs dataframes.append(df) #Concatenate our dfs into a single one combined = pd.concat(dataframes) #Write our output file to text combined.to_csv(snakemake.output[0], sep="\t", index=False) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 | import pandas as pd import os.path import re # Directing python to the input from snakemake ezclermont_summaries = snakemake.input['ezclermont_summary'] print(ezclermont_summaries) #Create an empty list dataframes = [] #Initialise a for loop for processing our pMLST outputs for file in ezclermont_summaries: #Read it in as df df = pd.read_csv(file, sep="\t", header=None, on_bad_lines='skip') #Assign a column name to our input which isnt actually a CSV and therefore has a heading of '0' df.columns = ['datacol'] name = df[df.datacol.str.contains('Reading in sequence')] #Trim our filenames to use as a name column name.datacol = name.datacol.str.replace('Reading in sequence(s) from', '', regex=False) name.datacol = name.datacol.str.replace('.fasta', '', regex=False) #Create an empty dataframe df2 = pd.DataFrame() #Create our name column df2['name'] = name #Save the result for trpBA df2['trpBA'] = df[df.datacol.str.contains('trpBA_control')].iloc[0,0] print(df2['trpBA']) #Save the result for virA df2['virA'] = df[df.datacol.str.contains('virA:')].iloc[0,0] #Save the result for TspE4 df2['TspE4'] = df[df.datacol.str.contains('TspE4:')].iloc[0,0] #Save the result for arpA: df2['arpA'] = df[df.datacol.str.contains('arpA:')].iloc[0,0] #Save the result for chu: df2['chu'] = df[df.datacol.str.contains('chu:')].iloc[0,0] #Save the result for yjaA: df2['yjaA'] = df[df.datacol.str.contains('yjaA:')].iloc[0,0] #Save the result for arpA: df2['arpA'] = df[df.datacol.str.contains('arpA:')].iloc[0,0] #Save the result for the Clermont_type: df2['Clermont_type'] = df[df.datacol.str.contains('Clermont type:')].iloc[0,0] #Append our dataframes dataframes.append(df2) #Concatenate our dfs into a single one combined = pd.concat(dataframes) #Select our columns we wish to change colnames = combined.select_dtypes(['object']).columns #Replace unwanted characters in our dataframe combined[colnames] = combined[colnames].replace('.*: ', '', regex=True) #Replace dashes and pluses with 1s and 0s combined[colnames] = combined[colnames].replace('+', 1, regex=False) combined[colnames] = combined[colnames].replace('-', 0, regex=False) #Write our simple file to text combined.to_csv(snakemake.output[0], sep="\t", index=False) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | import pandas as pd import os.path # Directing python to the input from snakemake fimtyper_summaries = snakemake.input["fimtyper_summary"] #Read in out dataframe and add a name column dataframes = [] for file in fimtyper_summaries: df = pd.read_csv(file, sep="\t") if 'Contig'in df.columns: ## If detect No FimH type found skip sample ## If detect "Please contact curator, if new fimH-type" skip line df["name"] = os.path.basename(file) df["name"] = df["name"].str.replace("_named.txt", "", regex=False) dataframes.append(df) else: print("No fimH type found in "+str(os.path.basename(file))) #Combine our fimtyper files combined = pd.concat(dataframes) #Name our columns combined = combined[ [ "name", "Fimtype", "Identity", "Query/HSP", "Contig", "Position in contig", "Accession no." ] ] #Reorder our columns (probably redundant now) combined = combined[ ['name'] + [ col for col in combined.columns if col != 'name' ] ] #Remove accession number column (usually empty) combined = combined.drop('Accession no.', axis=1) #Add a new column for query length and HSP (High-scoring segment pair i.e. match) length combined["Query length"] = combined["Query/HSP"].str.replace("/.*", "", regex=True) combined["HSP length"] = combined["Query/HSP"].str.replace(".*/", "", regex=True) #Add an asterisk to novel alleles which have a non 100% identity or a length other than 489 combined.loc[combined['Identity'] != 100.0, 'Fimtype'] = combined.loc[combined['Identity'] != 100.0, 'Fimtype']+"*" combined.loc[combined['Query length'] != combined['HSP length'], 'Fimtype'] = combined.loc[combined['Query length'] != combined['HSP length'], 'Fimtype']+"*" #Remove instances of double asterisks combined["Fimtype"] = combined["Fimtype"].str.replace("**", "*", regex=False) #Define pattern of lines we want to remove (Messages from devs to report novel alleles) false_line_pattern = 'Please' #Identify rows which are lines we want to remove pattern_filter = combined['Fimtype'].str.contains(false_line_pattern) #Remove unwanted rows combined = combined[~pattern_filter] #Write to file combined.to_csv(snakemake.output[0], sep="\t", index=False) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 | import pandas as pd import os.path import re # Directing python to the input from snakemake pmlst_summaries = snakemake.input['pMLST_summary'] #Create an empty list dataframes = [] #Initialise a for loop for processing our pMLST outputs for file in pmlst_summaries: #Read it in as df df = pd.read_csv(file, sep="\t", header=None, on_bad_lines='skip') #Trim our filenames to use as a name column name = re.sub(".out/results.txt","", file) name = re.sub(".*/", "", name) #Assign a column name to our input which isnt actually a CSV and therefore has a heading of '0' df.columns = ['datacol'] #Identify pMLST data pertaining to IncF plasmids IncF_results = df['datacol'].str.contains('IncF RST').any() #Only process IncF data. #Wasteful loop iterations but snakemake was being painful so I decided on feeding all pMLST input rather than just IncF if IncF_results: #Pull out the columns with F replicons df = df[df.datacol.str.contains('^FI')] #Convert multiple spaces to commas to properly delimit our file df['datacol'] = df['datacol'].str.replace(' +', ',', regex=True) #Split our string by comma delimiter into different columns df = df['datacol'].str.split(",", expand=True) #Assign our name column df['name'] = name #Append our dataframes dataframes.append(df) #Concatenate our dfs into a single one combined = pd.concat(dataframes) #Change column order combined.columns = [ 'Locus', 'Identity', 'Coverage', 'Alignment Length', 'Allele Length', 'Gaps', 'Allele', 'name' ] #Change column order combined = combined[[ 'name', 'Locus', 'Identity', 'Coverage', 'Alignment Length', 'Allele Length', 'Gaps', 'Allele' ] ] #Simplify when no allele is found with a dash combined['Allele'] = combined['Allele'].str.replace('No hit found', '-', regex=False) #Create a column for a simplified allele call combined['Simple Allele'] = combined['Allele'] combined['Simple Allele'] = combined['Simple Allele'].str.replace('FIIS', 'S', regex=True) combined['Simple Allele'] = combined['Simple Allele'].str.replace('FIIY', 'Y', regex=True) combined['Simple Allele'] = combined['Simple Allele'].str.replace('FIIK', 'K', regex=True) combined['Simple Allele'] = combined['Simple Allele'].str.replace('FI', '', regex=True) combined['Simple Allele'] = combined['Simple Allele'].str.replace('I', 'F', regex=True) combined['Simple Allele'] = combined['Simple Allele'].str.replace('_', '', regex=True) #Separate our alleles to combine them in the correct order for IncF RST #Sort by allele name. If they are instead sorted in order of discovery (via contig number) then we will see inconsitencies between samples FICs = combined[combined['Locus'] == 'FIC'].sort_values(by='Simple Allele', ascending=False) FIIs = combined[combined['Locus'] == 'FII'].sort_values(by='Simple Allele', ascending=False) FIKs = combined[combined['Locus'] == 'FIIK'].sort_values(by='Simple Allele', ascending=False) FISs = combined[combined['Locus'] == 'FIIS'].sort_values(by='Simple Allele', ascending=False) FIYs = combined[combined['Locus'] == 'FIIY'].sort_values(by='Simple Allele', ascending=False) FIA_type_df = combined[combined['Locus'] == 'FIA'].sort_values(by='Simple Allele', ascending=False) FIB_type_df = combined[combined['Locus'] == 'FIB'].sort_values(by='Simple Allele', ascending=False) #Combine our dataframes again F_type_df = pd.concat([FIIs, FICs, FIKs, FISs, FIYs]) F_type_df = F_type_df[F_type_df['Simple Allele'] != '-'] #Group by name and collapse our FIIs, FICs, FIKs, FISs and FIYs with a slash separator F_type = F_type_df.groupby('name').apply(lambda x: '/'.join(x['Simple Allele'])).reset_index() #Assign our column names F_type.columns = ['name', 'F type'] #Group by name and collapse our FIBs with a slash separator FIA_type = FIA_type_df.groupby('name').apply(lambda x: '/'.join(x['Simple Allele'])).reset_index() #Assign our column names FIA_type.columns = ['name', 'FIA type'] #Group by name and collapse our FIBs with a slash separator FIB_type = FIB_type_df.groupby('name').apply(lambda x: '/'.join(x['Simple Allele'])).reset_index() #Assign our column names FIB_type.columns = ['name', 'FIB type'] #Join our columns IncF_RSTs = pd.merge(F_type, FIA_type, on='name',how='outer') IncF_RSTs = pd.merge(IncF_RSTs, FIB_type, on='name',how='outer') #Replace NaNs (for missing alleles) with a dash IncF_RSTs = IncF_RSTs.fillna('-') #Replace the cell contents for F types to make them better reflect null values IncF_RSTs['F type'] = IncF_RSTs['F type'].str.replace('-', 'F-', regex=True) IncF_RSTs['FIA type'] = IncF_RSTs['FIA type'].str.replace('-', 'A-', regex=True) IncF_RSTs['FIB type'] = IncF_RSTs['FIB type'].str.replace('-', 'B-', regex=True) #Generate an IncF RST column by combining the alleles IncF_RSTs['IncF RST'] = IncF_RSTs['F type']+':'+IncF_RSTs['FIA type']+':'+IncF_RSTs['FIB type'] #Make a simple IncF RST dataframe to bind to our full IncF RST dataframe IncF_RSTs_simple = IncF_RSTs[['name', 'IncF RST']] #Join our columns combined = pd.merge(combined, IncF_RSTs_simple, on='name',how='outer') #Write our simple file to text IncF_RSTs.to_csv(snakemake.output[0], sep="\t", index=False) #Write our detailed file to text combined.to_csv(snakemake.output[1], sep="\t", index=False) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | import pandas as pd import os.path # Directing python to the input from snakemake mlst_summaries = snakemake.input["mlst_summary"] #Create an empty list dataframes = [] #For each of our mlst files for file in mlst_summaries: #Read it in as df df = pd.read_csv(file, sep="\t", header=None, index_col=None) #Add the df to our list of dfs dataframes.append(df) #Concatenate our dfs into a single one combined = pd.concat(dataframes) #Extract the number of columns numCols = combined.shape[1] #Define the first three column names mlst_colnames = [ 'name', 'scheme', 'ST' ] #Create a list of column names for our alleles allele_colnames = ["allele"+str(x) for x in range(1, numCols-2)] #Combine our lists of column names mlst_colnames = mlst_colnames + allele_colnames #Assign our list of column names combined.columns = mlst_colnames #Remove the path related info from our name column combined["name"] = combined["name"].str.replace(".*/", "", regex=True) combined["name"] = combined["name"].str.replace(".fasta", "", regex=False) #Fill empty columns with 0 combined = combined.fillna(0) #Write our output file to text combined.to_csv(snakemake.output[0], sep="\t", index=False) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | import pandas as pd import os.path import re # Directing python to the input from snakemake mobsuite = snakemake.input["mobsuite"] # Create an empty list dataframes = [] # Initialise a for loop for processing our pMLST outputs for file in mobsuite: # Read it in as df df = pd.read_csv(file, sep="\t", header=0, on_bad_lines="skip") # Append our dataframes dataframes.append(df) # Concatenate our dfs into a single one combined = pd.concat(dataframes) # Write our output file to text combined.to_csv(snakemake.output[0], sep="\t", index=False) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 | import pandas as pd import os.path import re # Directing python to the input from snakemake pmlst_summaries = snakemake.input["pMLST_summary"] #Create an empty list dataframes = [] #Initialise a for loop for processing our pMLST outputs for file in pmlst_summaries: #Read it in as df df = pd.read_csv(file, sep="\t", header=None, on_bad_lines='skip') #Trim our filenames to use as a name column name = re.sub(".out/results.txt","", file) name = re.sub(".*/", "", name) #Assign our name column df['name'] = name #Reassign name columns. The input isnt actually a CSV so we label the original input as data and redundantly reassign our name column for the names df.columns = ['datacol', 'name'] #Isolate pMLST scheme info df_profile = df[df.datacol.str.contains('pMLST profile')] #Assign colnames to our new df with pMLST scheme info df_profile.columns = ['pMLST_scheme', 'name'] #Isolate pMLST info df_pMLST = df[df.datacol.str.contains('Sequence Type')] #Assign colnames to our new df with pMLST info df_pMLST.columns = ['Sequence_type', 'name'] #Merge our dataframes with this new info df2 = pd.merge(df_profile, df_pMLST) #Check if there are any lines which indicate that no MLST loci were found no_loci = df.datacol.str.contains('No MLST loci was found').any() #Check if there are any lines which indicate that either a novel ST was found or no ST was found #pMLST tool doesn't discriminate these two particularly well on this line. unknown_pMLST = df.datacol.str.contains('Unknown').any() #If the file indicates no MLST loci were found AND that the genome has a novel or NA ST then assign it as 'None' if no_loci > 0 and unknown_pMLST > 0: df2.Sequence_type = 'None' #If the file indicates that some MLST loci were found but the tool still couldnt assign a pMLST then assign it as 'Novel' elif no_loci == 0 and unknown_pMLST > 0: df2.Sequence_type = 'Novel' #Pull out lines which indicate the closest STs df_nearest_pMLST = df[df.datacol.str.contains('Nearest ST')] #Assign column names to our new dataframe with nearest STs df_nearest_pMLST.columns = ['Nearest_ST', 'name'] #Trim unwanted chars preceeding nearest ST data df_nearest_pMLST.Nearest_ST = df_nearest_pMLST['Nearest_ST'].str.replace("Nearest ST: ","", regex=False) #Trim unwanted chars preceeding nearest ST data. Same again but for when multiple matches of equal weight are identified df_nearest_pMLST.Nearest_ST = df_nearest_pMLST['Nearest_ST'].str.replace("Nearest STs: ","", regex=False) #Merge the dataframe with nearest ST data with our info on scheme and ST df2 = pd.merge(df2, df_nearest_pMLST, on='name', how='outer') #Trim unwanted chars preceeding scheme info df2.pMLST_scheme = df2.pMLST_scheme.str.replace("pMLST profile: ","", regex=False) #Trim unwanted chars - after a space is just the word pmlst df2.pMLST_scheme = df2.pMLST_scheme.str.replace(" .*","", regex=True) #Trim unwanted chars in IncA/C scheme - slashes are naughty characters df2.pMLST_scheme = df2.pMLST_scheme.str.replace("/","", regex=False) #Convert schemes lowercase df2.pMLST_scheme = df2.pMLST_scheme.str.lower() #Trim unwanted chars preceeding ST df2.Sequence_type = df2.Sequence_type.str.replace("Sequence Type: ","", regex=False) #Trim unwanted chars before and after our IncF RSTs df2.Sequence_type = df2.Sequence_type.str.replace("[","", regex=False) df2.Sequence_type = df2.Sequence_type.str.replace("]","", regex=False) #Append our dataframes dataframes.append(df2) #Concatenate our dfs into a single one combined = pd.concat(dataframes) #Change column order combined = combined[[ 'name', 'pMLST_scheme', 'Sequence_type', 'Nearest_ST' ] ] #Write our output file to text combined.to_csv(snakemake.output[0], sep="\t", index=False) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | import pandas as pd import os.path # Directing python to the input from snakemake assembly_stats_summaries = snakemake.input["pointfinder_summary"] #Create an empty list dataframes = [] #For each of our mlst files for file in assembly_stats_summaries: #Read it in as df df = pd.read_csv(file, sep="\t", index_col=None) df["name"] = df["name"].str.replace("_blastn_results.tsv", "", regex=False) df["name"] = df["name"].str.replace(".*/", "", regex=True) #Add the df to our list of dfs dataframes.append(df) #Concatenate our dfs into a single one combined = pd.concat(dataframes) #Write our output file to text combined.to_csv(snakemake.output[0], sep="\t", index=False) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 | import pandas as pd import os.path # Directing python to the input from snakemake spifinder_summaries = snakemake.input["spifinder_summary"] dataframes = [] for file in spifinder_summaries: df = pd.read_csv(file, sep="\t") df["source"] = os.path.basename(file) df["source"] = df["source"].str.replace(".txt", "", regex=False) dataframes.append(df) combined = pd.concat(dataframes) combined = combined[ [ "Database", "SPI", "Identity", "Query / Template length", "Contig", "Position in contig", "Organism", "Insertion Site", "Category Function", "Note", "Accession number", "source" ] ] combined.columns = [ "Database", "SPI", "Identity", "Query / Template length", "Contig", "Position in contig", "Organism", "Insertion Site", "Category Function", "Note", "Accession number", "name" ] combined = combined[ ['name'] + [ col for col in combined.columns if col != 'name' ] ] combined.to_csv(snakemake.output[0], sep="\t", index=False) |
20 21 22 23 24 | shell: """ amrfinder -U &> {log} touch {output} """ |
43 44 45 46 | shell: """ abritamr run --prefix {params.abritamr_out}/{wildcards.sample} --contigs {input.contigs} --jobs {threads} --species {params.species} &> {log} """ |
60 61 | script: "../../scripts/combine_abritamr.py" |
71 72 | script: "../../scripts/combine_abritamr_amrfinder_raw.py" |
40 41 42 43 44 45 | shell: """ mkdir -p {output.cgmlstfinder_out} resources/tools/cgmlstfinder/cgMLST.py --shared_memory -s ecoli -db /projects/AusGEM/databases/cgmlstfinder_db -o {output.cgmlstfinder_out} -k $CONDA_PREFIX/bin/kma {input.r1_filt} {input.r2_filt} cp {output.cgmlstfinder_out}/ecoli_summary.txt {output.cgmlst_summary} """ |
19 20 | shell: "ectyper -i {input} -o {output[0]} &> {log}" |
32 33 | script: "../../scripts/combine_ectyper.py" |
27 28 29 30 31 | shell: """ python3 resources/tools/fimtyper/VALIDATE_DB {input} 2> {log} touch {output} """ |
48 49 50 51 52 | shell: """ perl resources/tools/fimtyper/fimtyper.pl -d resources/tools/fimtyper/fimtyper_db -b $CONDA_PREFIX -i {input.assembly} -o {params.output_dir} -k 95.0 -l 0.6 &> {log} cp {output.raw} {output.renamed} """ |
61 62 63 64 65 | shell: """ sed '0,/^FimH type -/d' {input} > {output.trimmed} awk 'NR == 1 {{print "name\t" $0; next;}}{{print FILENAME "\t" $0;}}' {output.trimmed} > {output.named} """ |
77 78 | script: "../../scripts/combine_fimtyper.py" |
18 19 20 21 22 23 24 25 26 27 | shell: """ if [ ! -d "resources/dbs/dfast/protein" ]; then echo 'dfast protein dbs directory not located, downloading to resources/dbs/dfast/protein...' #git clone https://github.com/nigyta/dfast_core.git dfast_file_downloader.py --dbroot resources/dbs/dfast --protein dfast fi if [ ! -d "resources/dbs/dfast/hmm" ]; then echo 'dfast hmm dbs directory not located, downloading to resources/dbs/dfast/hmm...' dfast_file_downloader.py --dbroot resources/dbs/dfast --cdd Cog --hmm TIGR fi """ |
40 41 42 43 | shell: """ dfast -g {input.assembly} --cpu {threads} --dbroot {input.database} -o {output} --use_locustag_as_gene_id """ |
53 54 55 56 | shell: """ mv {input}/genome.gff {output} """ |
68 69 70 71 | shell: """ prokka --cpus {threads} --outdir {output} {input.assembly} """ |
81 82 83 84 | shell: """ mv {input}/PROKKA_*.gff {output} """ |
94 95 96 97 98 99 | shell: """ if [ ! -d "{output}" ]; then echo 'bakta db not located, downloading to {output}' bakta_db download --output {output} 1&2> {log} fi """ |
114 115 116 117 118 | shell: """ bakta --db {params.bakta_db} --verbose --output {params.bakta_out} --prefix {wildcards.sample} --threads {threads} {input[1]} cp {output}/*.log {log} """ |
130 131 132 133 | shell: """ mv {input}/*.gff3 {output} &> {log} """ |
13 14 15 16 | shell: """ cp -n {input} {output} """ |
33 34 35 36 37 | shell: """ shovill --minlen 200 --cpus {threads} --outdir {output.shov_out} --R1 {input.r1_filt} --R2 {input.r2_filt} 1> {log.out} 2> {log.err} cp {output.shov_out}/contigs.fa {output.assembly} """ |
50 51 | shell: "assembly-stats -t {input} > {output}" |
59 60 61 62 63 | shell: """ mkdir {output} cp {input} {output} """ |
75 76 | script: "../../scripts/combine_assembly_stats.py" |
29 30 | shell: "abricate --nopath --datadir {params.datadir} --threads {threads} --db {params.db} {input.assembly} > {output} 2> {log}" |
47 48 | shell: "abricate --nopath --datadir {params.datadir} --threads {threads} --db {params.db} {input.assembly} > {output} 2> {log}" |
63 64 65 66 67 68 69 70 71 | shell: """" cat {input} > {output.out1} perl -p -i -e 's@\.fasta@@g' {output.out1} #Rename first column name sed -E '1s/#FILE/name/g' {output.out1} > {output.out2} #Remove duplicate headers by negative grep grep -v '#FILE' {output.out2} > {output.out} """ |
83 84 | script: "../../scripts/combine_abricate.py" |
17 18 19 20 | shell: """ mob_init -d resources/dbs/mobsuite 1> {log.out} 2> {log.err} """ |
37 38 | shell: "mob_recon --force --num_threads {threads} --infile {input.assembly} --outdir {params.outdir} 1> {log.out} 2> {log.err}" |
52 53 | script: "../../scripts/combine_mobsuite.py" |
17 18 19 20 21 | shell: """ mkdir -p {params.ezclermont_outdir} ! ezclermont {input} --logfile {output.full} > {output.simple} """ |
33 34 | script: "../../scripts/combine_ezclermont.py" |
45 46 47 48 | shell: """ git clone https://github.com/A-BN/ClermonTyping.git resources/tools/clermontyping """ |
64 65 66 67 68 69 70 | shell: """ {input.github_repo}/clermonTyping.sh --fasta {input.assembly} --name {wildcards.sample} --minimal &> {log} ls {params} mv {wildcards.sample}/* {params} rmdir {wildcards.sample} """ |
82 83 | script: "../../scripts/combine_clermontyping.py" |
33 34 35 36 37 | shell: """ python3 {params.pmlst_script_path} -i {input} -o {params.output_dir} -p {params.pmlst_db_path} {params.pmlst_tool} $CONDA_PREFIX/bin/blastn -s {params.db} -x -t {params.tmp} rm -rf {params.tmp} """ |
51 52 | script: "../../scripts/combine_pMLST.py" |
61 62 63 64 | shell: """ touch {input} """ |
80 81 | script: "../../scripts/combine_IncF_RST.py" |
29 30 31 32 | shell: """ resources/tools/pointfinder/PointFinder.py -i {input} -o {params.output_dir} -p resources/tools/pointfinder/pointfinder_db {params.species} -m blastn -m_p $CONDA_PREFIX/bin/blastn 2> {log} """ |
40 41 | shell: """awk 'NR == 1 {{print "name\t" $0; next;}}{{print FILENAME "\t" $0;}}' {input} > {output}""" |
53 54 | script: "../../scripts/combine_pointfinder.py" |
26 27 28 29 | shell: """ fastp -i {input.r1} -I {input.r2} -o {output.r1_filt} -O {output.r2_filt} --thread {threads} -j {output.json} -h {output.html} 2>&1 {log} """ |
6 7 8 9 | shell: """ cp -n {input} {output} """ |
25 26 | shell: "roary -p {threads} -e -v -n -r -cd {params.core_req} -qc -k {params.kraken_db} -f {params.out} {input} 2> {log}" |
39 40 | shell: "snp-sites -c {input} > {output}" |
53 54 | shell: "iqtree -s {input} -m MFP -ntmax {threads} -bb 1000" |
68 69 | shell: "iqtree -s {input} -m MFP -ntmax {threads} -bb 1000" |
38 39 40 41 | shell: """ kraken2 --db {input.db} --use-names --memory-mapping --threads {threads} --report {output.report} --output {output.out} {input.assembly} 2> {log} """ |
59 60 61 62 | shell: """ kraken2 --db {input.db} --use-names --memory-mapping --threads {threads} --report {output.report} --output {output.out} {input.r1_filt} {input.r2_filt} 2> {log} """ |
82 83 | shell: "resources/tools/Bracken/bracken -d {input.db} -i {input.report} -o {output.bracken} -w {output.species} -r 100 -l S -t {threads} 2>&1 {log}" |
103 104 | script: "../../scripts/combine_bracken.py" |
41 42 43 44 45 46 | shell: """ mkdir -p {output.spifinder_out} resources/tools/spifinder/spifinder.py -p resources/tools/spifinder_db -x -o {output.spifinder_out} -l 0.60 -t 0.95 -mp $CONDA_PREFIX/bin/kma -i {input.r1_filt} {input.r2_filt} cp {output.spifinder_out}/results_tab.tsv {output.spifinder_summary} """ |
59 60 | script: "../../scripts/combine_spifinder.py" |
77 78 79 80 81 82 | shell: """ mkdir -p {output.spifinder_out} resources/tools/spifinder/spifinder.py -p resources/tools/spifinder_db -x -o {output.spifinder_out} -l 0.60 -t 0.95 -mp $CONDA_PREFIX/bin/blastn -i {input} cp {output.spifinder_out}/results_tab.tsv {output.spifinder_summary} """ |
95 96 | script: "../../scripts/combine_spifinder.py" |
21 22 | shell: "mlst {input} > {output}" |
34 35 | script: "../../scripts/combine_mlst.py" |
Support
- Future updates
Related Workflows





