A microbial genomic analysis pipeline used by the Djordjevic Lab

public 1yr ago Version: v1.0.0 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

About

This repo contains a genomic analysis pipeline used for analysing mostly E. coli genomes, though it also sees use in analysing Salmonella genomes and some other species as well, however note that using this tool for analysing other species may cause certain tools to either break or will otherwise just limit the usefullness of this tool in the first place.

Disclaimer

While I have made every effort to ensure these scripts produce reliable and accurate output, it is important that you spot check your outputs to ensure they are correct :)

The documentation is still a work in progress but if you need any help don't hestitate to post an issue :)

Modules/Workflows

This pipeline has three main modules:

Quality control
Genotyping
Phylogenetics

All individual rules/analyses can be turned on or off in the configuration file.

1. Quality control

Reads are first filtered with fastp before being assembled with shovill. Assembly stats are analysed using assembly-stats. Species ID is determined using Kraken and Bracken and a contamination and completeness screen is carried out using CheckM. GUNC is also run, though usually I disable this as the run-time can be a bit long and I haven't found the output especially helpful yet for my purposes, though this might change.

This information is then consolidated into a tab delimited QC report.

2. Genotyping

The word genotyping here is used a bit loosely but this step performs a variety of analyses using:

Abricate to screen for virulence, AMR, IS elements and other custom gene databases. New databases can be added and added to the configuration file to be included for use.
Antibiotic resistance gene detection is performed using abritamr
Point mutations associated with AMR are detected using CGE's pointfinder (there is some redundancy here with abritamr which now has this functionality)
MLST is performed using mlst
Plasmid MLST is performed using pMLST
Fimbrial adhesins are typed using fimtyper
Phylogroup detection is performed with Clermontyping and EZClermont
Serotyping is performed with ECtyper
Salmonella pathogenicity islands are detected using SPIfinder
Genome annotation is performed with Prokka
cgMLST is performed with cgMLSTfinder - this will be changed to chewBBACA
Plasmid screening is performed using a slightly older version of abricate (screening of large reference genes seemed to break after 0.9.8).

Most of these rules output tab delimited summary files relevant for each rule/analysis, some utilise some downstream scripts in R but not many of these.

3. Phylogenetics

This workflow runs a pangenome derived phylogenetic analysis utilising:

Prokka
Panaroo
SNP-sites
IQtree

Pairwise SNP distances are also determined using SNP-dists.

Installation and setup

Because there are many dependencies to install across all of the different analytical tools the first time you run the tool it can take some time to install everything via conda. Because of this I have added a script to install environments for the three modules setup_environments.sh

# Clone the git repository and enter it
git clone https://github.com/maxlcummins/pipelord.git
cd pipelord
# Setup and installation - this may take some time
sh setup_environments.sh

Configuration

Configuration of the snakemake workflow can be done via the configuration file in config . It is a good idea to just make a new one though by copying the template and editing it accordingly, then specifying it in the snakemake command (e.g. --configfile config/new_template.yaml ).

Usage

Note: Change your config file accordingly if you created a new one.

QC Workflow

#Run job in the background
nohup snakemake --resources mem_mb={max_memory_per_rule} -j 20 -p --use-conda --configfile config/config_template.yaml --use-conda -s workflows/QC_workflow.smk 1> QC.log 2> QC.log &
#Save the job ID so you can cancel it later if need be
PROCESS_ID=$!
echo "JOB_ID =" "$PROCESS_ID" > QC_jobID.txt

Genotyping

#Run job in the background
nohup snakemake --resources mem_mb={max_memory_per_rule} -j 20 -p --use-conda --configfile config/config_template.yaml --use-conda -s workflows/Snakefile 1> genotype.log 2> genotype.log &
#Save the job ID so you can cancel it later if need be
PROCESS_ID=$!
echo "JOB_ID =" "$PROCESS_ID" > genotype_jobID.txt

Pangenomic Phylogenetics

#Run job in the background
nohup snakemake --resources mem_mb={max_memory_per_rule} -j 20 -p --use-conda --configfile config/config_template.yaml --use-conda -s workflows/Treebuild_panaroo.smk 1> treebuild.log 2> treebuild.log &
#Save the job ID so you can cancel it later if need be
PROCESS_ID=$!
echo "JOB_ID =" "$PROCESS_ID" > treebuild_jobID.txt

Database update

To update your databases you can use a command like the following - make sure the db dir path is correct and change the database appropriately

abricate-get_db --dbdir resources/dbs/abricate --db vfdb --force

Code Snippets

import pandas as pd
import os.path
import re

# Directing python to the input from snakemake
abricate_summaries = snakemake.input["abricate_summary"]

#Create an empty list
dataframes = []

for file in abricate_summaries:
    df = pd.read_csv(file, sep="\t", index_col=None)
    dataframes.append(df)

#Concatenate our dfs into a single one
combined = pd.concat(dataframes)

#Change the filename column to be name
combined.rename(columns={'#FILE':'name'}, inplace=True)

#Write our output file to text
combined.to_csv(snakemake.output[0], sep="\t", index=False)

Python Snakemake Pandas From line 1 of scripts/combine_abricate.py

import pandas as pd
import os.path
import re

# Directing python to the input from snakemake
amrfinder = snakemake.input["amrfinder"]

#Create an empty list
dataframes = []

for file in amrfinder:
    df = pd.read_csv(file, sep="\t", index_col=None)
    df["name"] = file
    df["name"] = df["name"].str.replace("/amrfinder.out", "", regex=False)
    df["name"] = df["name"].str.replace(".*\/", "", regex=True)
    dataframes.append(df)

#Concatenate our dfs into a single one
combined = pd.concat(dataframes)

#Move the name column to the start
combined = combined[['name'] + [ col for col in combined.columns if col != 'name' ]]

#Write our output file to text
combined.to_csv(snakemake.output["abritamr_resistance"], sep="\t", index=False)

Python Snakemake Pandas AMRFinderPlus From line 1 of scripts/combine_abritamr_amrfinder_raw.py

import pandas as pd
import os.path
import re

#### Process our resistance data ####

# Directing python to the input from snakemake
abritamr_resistance = snakemake.input["abritamr_resistance"]

#Create an empty list
dataframes = []

for file in abritamr_resistance:
    df = pd.read_csv(file, sep="\t", index_col=None)
    dataframes.append(df)

#Concatenate our dfs into a single one
combined = pd.concat(dataframes)

#Change the filename column to be name
combined.rename(columns={'Isolate':'name'}, inplace=True)

#Trim characters before the samplename remaining from the path
combined['name'].replace({r".*\/" : ""}, inplace = True, regex = True)

#Write our output file to text
combined.to_csv(snakemake.output["abritamr_resistance"], sep="\t", index=False)


#### Process our virulence data ####

# Directing python to the input from snakemake
abritamr_virulence = snakemake.input["abritamr_virulence"]

#Create an empty list
dataframes = []

for file in abritamr_virulence:
    df = pd.read_csv(file, sep="\t", index_col=None)
    dataframes.append(df)

#Concatenate our dfs into a single one
combined = pd.concat(dataframes)

#Change the filename column to be name
combined.rename(columns={'Isolate':'name'}, inplace=True)

#Trim characters before the samplename remaining from the path
combined['name'].replace({r".*\/" : ""}, inplace = True, regex = True)

#Write our output file to text
combined.to_csv(snakemake.output["abritamr_virulence"], sep="\t", index=False)

#### Process our partial data ####

# Directing python to the input from snakemake
abritamr_partials = snakemake.input["abritamr_partials"]

#Create an empty list
dataframes = []

for file in abritamr_partials:
    df = pd.read_csv(file, sep="\t", index_col=None)
    dataframes.append(df)

#Concatenate our dfs into a single one
combined = pd.concat(dataframes)

#Change the filename column to be name
combined.rename(columns={'Isolate':'name'}, inplace=True)

#Trim characters before the samplename remaining from the path
combined['name'].replace({r".*\/" : ""}, inplace = True, regex = True)

#Write our output file to text
combined.to_csv(snakemake.output["abritamr_partials"], sep="\t", index=False)

Python Snakemake Pandas From line 1 of scripts/combine_abritamr.py

import pandas as pd
import os.path

# Directing python to the input from snakemake
assembly_stats_summaries = snakemake.input["assembly_stats_summary"]
#Create an empty list
dataframes = []

#For each of our mlst files
for file in assembly_stats_summaries:
    #Read it in as df
    df = pd.read_csv(file, sep="\t", index_col=None)
    #Add the df to our list of dfs
    dataframes.append(df)

#Concatenate our dfs into a single one
combined = pd.concat(dataframes)

#Change the filename column to be name
combined.rename(columns={'filename':'name'}, inplace=True)

#Remove the path related info from our name column
combined["name"] = combined["name"].str.replace(".*/", "", regex=True)
combined["name"] = combined["name"].str.replace(".fasta", "", regex=False)
#Write our output file to text
combined.to_csv(snakemake.output[0], sep="\t", index=False)

Python Snakemake Pandas MLST From line 1 of scripts/combine_assembly_stats.py

import pandas as pd
import os.path

# Directing python to the input from snakemake
bracken_reports = snakemake.input["bracken_reports"]


dataframes = []
for file in bracken_reports:
    df = pd.read_csv(file, sep="\t")
    df["source"] = os.path.basename(file)
    df["source"] = df["source"].str.replace(".bracken.txt", "", regex=False)
    dataframes.append(df)

combined = pd.concat(dataframes)

combined = combined[
    [
        "source",
        "name",
        "taxonomy_id",
        "taxonomy_lvl",
        "kraken_assigned_reads",
        "added_reads",
        "new_est_reads",
        "fraction_total_reads",
    ]
]

combined.columns = [
    "name",
    "scientific_name",
    "taxonomy_id",
    "taxonomy_lvl",
    "kraken_assigned_reads",
    "added_reads",
    "new_est_reads",
    "fraction_total_reads",
]

combined.to_csv(snakemake.output[0], sep="\t", index=False)

Python Snakemake Pandas From line 1 of scripts/combine_bracken.py

import pandas as pd
import os.path
import re

# Directing python to the input from snakemake
clermontyping_summaries = snakemake.input['clermontyping']

print(clermontyping_summaries)

#Create an empty list
dataframes = []

#Initialise a for loop for processing our pMLST outputs
for file in clermontyping_summaries:
    #Read it in as df
    df = pd.read_csv(file, sep="\t", header=None)
    #Assign a column name to our input which isnt actually a CSV and therefore has a heading of '0'
    df.columns = ['name', 'genes_detected', 'quadraplex_resuts', 'alleles_for_C_or_G', 'clermontyping_phylogroup', 'mashfile']
    #Append our dataframes
    dataframes.append(df)

#Concatenate our dfs into a single one
combined = pd.concat(dataframes)

#Replace unwanted characters in our dataframe
combined['name'] = combined['name'].replace('\.fasta', '', regex=True)

#Write our simple file to text
combined.to_csv(snakemake.output[0], sep="\t", index=False)

Python Snakemake Pandas From line 1 of scripts/combine_clermontyping.py

import pandas as pd
import os.path

# Directing python to the input from snakemake
ectyper_summaries = snakemake.input["ectyper_summary"]

#Create an empty list
dataframes = []

#For each of our ectyper files
for file in ectyper_summaries:
    print(file)
    #Read it in as df
    df = pd.read_csv(file, sep="\t", header=0, index_col=None)
    #Add the df to our list of dfs
    dataframes.append(df)

#Concatenate our dfs into a single one
combined = pd.concat(dataframes)

#Write our output file to text
combined.to_csv(snakemake.output[0], sep="\t", index=False)

Python Snakemake Pandas From line 1 of scripts/combine_ectyper.py

import pandas as pd
import os.path
import re

# Directing python to the input from snakemake
ezclermont_summaries = snakemake.input['ezclermont_summary']

print(ezclermont_summaries)

#Create an empty list
dataframes = []

#Initialise a for loop for processing our pMLST outputs
for file in ezclermont_summaries:
    #Read it in as df
    df = pd.read_csv(file, sep="\t", header=None, on_bad_lines='skip')
    #Assign a column name to our input which isnt actually a CSV and therefore has a heading of '0'
    df.columns = ['datacol']
    name = df[df.datacol.str.contains('Reading in sequence')]  
    #Trim our filenames to use as a name column
    name.datacol = name.datacol.str.replace('Reading in sequence(s) from', '', regex=False)
    name.datacol = name.datacol.str.replace('.fasta', '', regex=False)
    #Create an empty dataframe
    df2 = pd.DataFrame()
    #Create our name column
    df2['name'] = name
    #Save the result for trpBA
    df2['trpBA'] = df[df.datacol.str.contains('trpBA_control')].iloc[0,0]
    print(df2['trpBA'])
    #Save the result for virA
    df2['virA'] = df[df.datacol.str.contains('virA:')].iloc[0,0]
    #Save the result for TspE4
    df2['TspE4'] = df[df.datacol.str.contains('TspE4:')].iloc[0,0]
    #Save the result for arpA:
    df2['arpA'] = df[df.datacol.str.contains('arpA:')].iloc[0,0]
    #Save the result for chu:
    df2['chu'] = df[df.datacol.str.contains('chu:')].iloc[0,0]
    #Save the result for yjaA:
    df2['yjaA'] = df[df.datacol.str.contains('yjaA:')].iloc[0,0]
    #Save the result for arpA:
    df2['arpA'] = df[df.datacol.str.contains('arpA:')].iloc[0,0]
    #Save the result for the Clermont_type:
    df2['Clermont_type'] = df[df.datacol.str.contains('Clermont type:')].iloc[0,0]
    #Append our dataframes
    dataframes.append(df2)

#Concatenate our dfs into a single one
combined = pd.concat(dataframes)

#Select our columns we wish to change
colnames = combined.select_dtypes(['object']).columns

#Replace unwanted characters in our dataframe
combined[colnames] = combined[colnames].replace('.*: ', '', regex=True)

#Replace dashes and pluses with 1s and 0s
combined[colnames] = combined[colnames].replace('+', 1, regex=False)
combined[colnames] = combined[colnames].replace('-', 0, regex=False)

#Write our simple file to text
combined.to_csv(snakemake.output[0], sep="\t", index=False)

Python Snakemake Pandas From line 1 of scripts/combine_ezclermont.py

import pandas as pd
import os.path

# Directing python to the input from snakemake
fimtyper_summaries = snakemake.input["fimtyper_summary"]

#Read in out dataframe and add a name column
dataframes = []
for file in fimtyper_summaries:
    df = pd.read_csv(file, sep="\t")
    if 'Contig'in df.columns:
        ## If detect No FimH type found skip sample
        ## If detect "Please contact curator, if new fimH-type" skip line
        df["name"] = os.path.basename(file)
        df["name"] = df["name"].str.replace("_named.txt", "", regex=False)
        dataframes.append(df)
    else:
        print("No fimH type found in "+str(os.path.basename(file)))

#Combine our fimtyper files
combined = pd.concat(dataframes)

#Name our columns
combined = combined[
    [
        "name",
        "Fimtype",
        "Identity",
        "Query/HSP",
        "Contig",
        "Position in contig",
        "Accession no."
    ]
]

#Reorder our columns (probably redundant now)
combined = combined[ ['name'] + [ col for col in combined.columns if col != 'name' ] ]

#Remove accession number column (usually empty)
combined = combined.drop('Accession no.', axis=1)

#Add a new column for query length and HSP (High-scoring segment pair i.e. match) length
combined["Query length"] = combined["Query/HSP"].str.replace("/.*", "", regex=True)
combined["HSP length"] = combined["Query/HSP"].str.replace(".*/", "", regex=True)

#Add an asterisk to novel alleles which have a non 100% identity or a length other than 489
combined.loc[combined['Identity'] != 100.0, 'Fimtype'] = combined.loc[combined['Identity'] != 100.0, 'Fimtype']+"*"
combined.loc[combined['Query length'] != combined['HSP length'], 'Fimtype'] = combined.loc[combined['Query length'] != combined['HSP length'], 'Fimtype']+"*"

#Remove instances of double asterisks
combined["Fimtype"] = combined["Fimtype"].str.replace("**", "*", regex=False)

#Define pattern of lines we want to remove (Messages from devs to report novel alleles)
false_line_pattern = 'Please'

#Identify rows which are lines we want to remove
pattern_filter = combined['Fimtype'].str.contains(false_line_pattern)

#Remove unwanted rows
combined = combined[~pattern_filter]

#Write to file
combined.to_csv(snakemake.output[0], sep="\t", index=False)

Python Snakemake Pandas From line 1 of scripts/combine_fimtyper.py

import pandas as pd
import os.path
import re

# Directing python to the input from snakemake
pmlst_summaries = snakemake.input['pMLST_summary']

#Create an empty list
dataframes = []

#Initialise a for loop for processing our pMLST outputs
for file in pmlst_summaries:
    #Read it in as df
    df = pd.read_csv(file, sep="\t", header=None, on_bad_lines='skip')
    #Trim our filenames to use as a name column
    name = re.sub(".out/results.txt","", file)
    name = re.sub(".*/", "", name)
    #Assign a column name to our input which isnt actually a CSV and therefore has a heading of '0'
    df.columns = ['datacol']
    #Identify pMLST data pertaining to IncF plasmids
    IncF_results = df['datacol'].str.contains('IncF RST').any()
    #Only process IncF data.
    #Wasteful loop iterations but snakemake was being painful so I decided on feeding all pMLST input rather than just IncF
    if IncF_results:
        #Pull out the columns with F replicons
        df = df[df.datacol.str.contains('^FI')]
        #Convert multiple spaces to commas to properly delimit our file
        df['datacol'] = df['datacol'].str.replace('  +', ',', regex=True)
        #Split our string by comma delimiter into different columns
        df = df['datacol'].str.split(",", expand=True)
        #Assign our name column
        df['name'] = name
        #Append our dataframes
        dataframes.append(df)

#Concatenate our dfs into a single one
combined = pd.concat(dataframes)

#Change column order
combined.columns = [
    'Locus',
    'Identity',
    'Coverage',
    'Alignment Length',
    'Allele Length',
    'Gaps',
    'Allele',
    'name'
    ]

#Change column order
combined = combined[[
    'name',
    'Locus',
    'Identity',
    'Coverage',
    'Alignment Length',
    'Allele Length',
    'Gaps',
    'Allele'
    ]
]

#Simplify when no allele is found with a dash
combined['Allele'] = combined['Allele'].str.replace('No hit found', '-', regex=False)

#Create a column for a simplified allele call
combined['Simple Allele'] = combined['Allele']
combined['Simple Allele'] = combined['Simple Allele'].str.replace('FIIS', 'S', regex=True)
combined['Simple Allele'] = combined['Simple Allele'].str.replace('FIIY', 'Y', regex=True)
combined['Simple Allele'] = combined['Simple Allele'].str.replace('FIIK', 'K', regex=True)
combined['Simple Allele'] = combined['Simple Allele'].str.replace('FI', '', regex=True)
combined['Simple Allele'] = combined['Simple Allele'].str.replace('I', 'F', regex=True)
combined['Simple Allele'] = combined['Simple Allele'].str.replace('_', '', regex=True)

#Separate our alleles to combine them in the correct order for IncF RST
#Sort by allele name. If they are instead sorted in order of discovery (via contig number) then we will see inconsitencies between samples
FICs = combined[combined['Locus'] == 'FIC'].sort_values(by='Simple Allele', ascending=False)
FIIs = combined[combined['Locus'] == 'FII'].sort_values(by='Simple Allele', ascending=False)
FIKs = combined[combined['Locus'] == 'FIIK'].sort_values(by='Simple Allele', ascending=False)
FISs = combined[combined['Locus'] == 'FIIS'].sort_values(by='Simple Allele', ascending=False)
FIYs = combined[combined['Locus'] == 'FIIY'].sort_values(by='Simple Allele', ascending=False)

FIA_type_df = combined[combined['Locus'] == 'FIA'].sort_values(by='Simple Allele', ascending=False)

FIB_type_df = combined[combined['Locus'] == 'FIB'].sort_values(by='Simple Allele', ascending=False)

#Combine our dataframes again
F_type_df = pd.concat([FIIs, FICs, FIKs, FISs, FIYs])
F_type_df = F_type_df[F_type_df['Simple Allele'] != '-']

#Group by name and collapse our FIIs, FICs, FIKs, FISs and FIYs with a slash separator
F_type = F_type_df.groupby('name').apply(lambda x: '/'.join(x['Simple Allele'])).reset_index()

#Assign our column names
F_type.columns = ['name', 'F type']

#Group by name and collapse our FIBs with a slash separator
FIA_type = FIA_type_df.groupby('name').apply(lambda x: '/'.join(x['Simple Allele'])).reset_index()

#Assign our column names
FIA_type.columns = ['name', 'FIA type']

#Group by name and collapse our FIBs with a slash separator
FIB_type = FIB_type_df.groupby('name').apply(lambda x: '/'.join(x['Simple Allele'])).reset_index()

#Assign our column names
FIB_type.columns = ['name', 'FIB type']

#Join our columns
IncF_RSTs = pd.merge(F_type, FIA_type, on='name',how='outer')
IncF_RSTs = pd.merge(IncF_RSTs, FIB_type, on='name',how='outer')

#Replace NaNs (for missing alleles) with a dash
IncF_RSTs = IncF_RSTs.fillna('-')

#Replace the cell contents for F types to make them better reflect null values 
IncF_RSTs['F type'] = IncF_RSTs['F type'].str.replace('-', 'F-', regex=True)
IncF_RSTs['FIA type'] = IncF_RSTs['FIA type'].str.replace('-', 'A-', regex=True)
IncF_RSTs['FIB type'] = IncF_RSTs['FIB type'].str.replace('-', 'B-', regex=True)

#Generate an IncF RST column by combining the alleles
IncF_RSTs['IncF RST'] = IncF_RSTs['F type']+':'+IncF_RSTs['FIA type']+':'+IncF_RSTs['FIB type']

#Make a simple IncF RST dataframe to bind to our full IncF RST dataframe
IncF_RSTs_simple = IncF_RSTs[['name', 'IncF RST']]

#Join our columns
combined = pd.merge(combined, IncF_RSTs_simple, on='name',how='outer')

#Write our simple file to text
IncF_RSTs.to_csv(snakemake.output[0], sep="\t", index=False)

#Write our detailed file to text
combined.to_csv(snakemake.output[1], sep="\t", index=False)

Python Snakemake Pandas From line 1 of scripts/combine_IncF_RST.py

import pandas as pd
import os.path

# Directing python to the input from snakemake
mlst_summaries = snakemake.input["mlst_summary"]

#Create an empty list
dataframes = []

#For each of our mlst files
for file in mlst_summaries:
    #Read it in as df
    df = pd.read_csv(file, sep="\t", header=None, index_col=None)
    #Add the df to our list of dfs
    dataframes.append(df)

#Concatenate our dfs into a single one
combined = pd.concat(dataframes)

#Extract the number of columns
numCols = combined.shape[1]

#Define the first three column names
mlst_colnames  =  [
    'name',
    'scheme',
    'ST'
]

#Create a list of column names for our alleles
allele_colnames = ["allele"+str(x) for x in range(1, numCols-2)]

#Combine our lists of column names
mlst_colnames = mlst_colnames + allele_colnames

#Assign our list of column names
combined.columns = mlst_colnames

#Remove the path related info from our name column
combined["name"] = combined["name"].str.replace(".*/", "", regex=True)
combined["name"] = combined["name"].str.replace(".fasta", "", regex=False)

#Fill empty columns with 0
combined = combined.fillna(0)

#Write our output file to text
combined.to_csv(snakemake.output[0], sep="\t", index=False)

Python Snakemake Pandas MLST From line 1 of scripts/combine_mlst.py

import pandas as pd
import os.path
import re

# Directing python to the input from snakemake

mobsuite = snakemake.input["mobsuite"]

# Create an empty list

dataframes = []

# Initialise a for loop for processing our pMLST outputs

for file in mobsuite:

    # Read it in as df

    df = pd.read_csv(file, sep="\t", header=0, on_bad_lines="skip")

    # Append our dataframes
    dataframes.append(df)

# Concatenate our dfs into a single one
combined = pd.concat(dataframes)

# Write our output file to text

combined.to_csv(snakemake.output[0], sep="\t", index=False)

Python Snakemake Pandas From line 3 of scripts/combine_mobsuite.py

import pandas as pd
import os.path
import re

# Directing python to the input from snakemake
pmlst_summaries = snakemake.input["pMLST_summary"]

#Create an empty list
dataframes = []

#Initialise a for loop for processing our pMLST outputs
for file in pmlst_summaries:
    #Read it in as df
    df = pd.read_csv(file, sep="\t", header=None, on_bad_lines='skip')
    #Trim our filenames to use as a name column
    name = re.sub(".out/results.txt","", file)
    name = re.sub(".*/", "", name)
    #Assign our name column
    df['name'] = name
    #Reassign name columns. The input isnt actually a CSV so we label the original input as data and redundantly reassign our name column for the names
    df.columns = ['datacol', 'name']
    #Isolate pMLST scheme info
    df_profile = df[df.datacol.str.contains('pMLST profile')]
    #Assign colnames to our new df with pMLST scheme info
    df_profile.columns = ['pMLST_scheme', 'name']
    #Isolate pMLST info    
    df_pMLST = df[df.datacol.str.contains('Sequence Type')]
    #Assign colnames to our new df with pMLST info
    df_pMLST.columns = ['Sequence_type', 'name']
    #Merge our dataframes with this new info
    df2 = pd.merge(df_profile, df_pMLST)
    #Check if there are any lines which indicate that no MLST loci were found
    no_loci = df.datacol.str.contains('No MLST loci was found').any()
    #Check if there are any lines which indicate that either a novel ST was found or no ST was found
    #pMLST tool doesn't discriminate these two particularly well on this line.
    unknown_pMLST = df.datacol.str.contains('Unknown').any()
    #If the file indicates no MLST loci were found AND that the genome has a novel or NA ST then assign it as 'None'
    if no_loci > 0 and unknown_pMLST > 0:
        df2.Sequence_type = 'None'
    #If the file indicates that some MLST loci were found but the tool still couldnt assign a pMLST then assign it as 'Novel'
    elif no_loci == 0 and unknown_pMLST > 0:
        df2.Sequence_type = 'Novel'
    #Pull out lines which indicate the closest STs
    df_nearest_pMLST = df[df.datacol.str.contains('Nearest ST')]
    #Assign column names to our new dataframe with nearest STs
    df_nearest_pMLST.columns = ['Nearest_ST', 'name']
    #Trim unwanted chars preceeding nearest ST data
    df_nearest_pMLST.Nearest_ST = df_nearest_pMLST['Nearest_ST'].str.replace("Nearest ST: ","", regex=False)
    #Trim unwanted chars preceeding nearest ST data. Same again but for when multiple matches of equal weight are identified
    df_nearest_pMLST.Nearest_ST = df_nearest_pMLST['Nearest_ST'].str.replace("Nearest STs: ","", regex=False)
    #Merge the dataframe with nearest ST data with our info on scheme and ST
    df2 = pd.merge(df2, df_nearest_pMLST, on='name', how='outer')
    #Trim unwanted chars preceeding scheme info
    df2.pMLST_scheme = df2.pMLST_scheme.str.replace("pMLST profile: ","", regex=False)
    #Trim unwanted chars - after a space is just the word pmlst
    df2.pMLST_scheme = df2.pMLST_scheme.str.replace(" .*","", regex=True)
    #Trim unwanted chars in IncA/C scheme - slashes are naughty characters
    df2.pMLST_scheme = df2.pMLST_scheme.str.replace("/","", regex=False)
    #Convert schemes lowercase
    df2.pMLST_scheme = df2.pMLST_scheme.str.lower()
    #Trim unwanted chars preceeding ST
    df2.Sequence_type = df2.Sequence_type.str.replace("Sequence Type: ","", regex=False)
    #Trim unwanted chars before and after our IncF RSTs
    df2.Sequence_type = df2.Sequence_type.str.replace("[","", regex=False)
    df2.Sequence_type = df2.Sequence_type.str.replace("]","", regex=False)
    #Append our dataframes
    dataframes.append(df2)

#Concatenate our dfs into a single one
combined = pd.concat(dataframes)

#Change column order
combined = combined[[
    'name',
    'pMLST_scheme',
    'Sequence_type',
    'Nearest_ST'
    ]
]
#Write our output file to text
combined.to_csv(snakemake.output[0], sep="\t", index=False)

Python Snakemake Pandas From line 1 of scripts/combine_pMLST.py

import pandas as pd
import os.path

# Directing python to the input from snakemake
assembly_stats_summaries = snakemake.input["pointfinder_summary"]
#Create an empty list
dataframes = []

#For each of our mlst files
for file in assembly_stats_summaries:
    #Read it in as df
    df = pd.read_csv(file, sep="\t", index_col=None)
    df["name"] = df["name"].str.replace("_blastn_results.tsv", "", regex=False)
    df["name"] = df["name"].str.replace(".*/", "", regex=True)
    #Add the df to our list of dfs
    dataframes.append(df)

#Concatenate our dfs into a single one
combined = pd.concat(dataframes)

#Write our output file to text
combined.to_csv(snakemake.output[0], sep="\t", index=False)

Python Snakemake Pandas MLST From line 1 of scripts/combine_pointfinder.py

import pandas as pd
import os.path

# Directing python to the input from snakemake
spifinder_summaries = snakemake.input["spifinder_summary"]


dataframes = []
for file in spifinder_summaries:
    df = pd.read_csv(file, sep="\t")
    df["source"] = os.path.basename(file)
    df["source"] = df["source"].str.replace(".txt", "", regex=False)
    dataframes.append(df)

combined = pd.concat(dataframes)

combined = combined[
    [
        "Database",
        "SPI",
        "Identity",
        "Query / Template length",
        "Contig",
        "Position in contig",
        "Organism",
        "Insertion Site",
        "Category Function",
        "Note",
        "Accession number",
        "source"
    ]
]

combined.columns = [
    "Database",
    "SPI",
    "Identity",
    "Query / Template length",
    "Contig",
    "Position in contig",
    "Organism",
    "Insertion Site",
    "Category Function",
    "Note",
    "Accession number",
    "name"
]

combined = combined[ ['name'] + [ col for col in combined.columns if col != 'name' ] ]

combined.to_csv(snakemake.output[0], sep="\t", index=False)

Python Snakemake Pandas From line 1 of scripts/combine_spifinder.py

shell:
    """
    amrfinder -U &> {log}
    touch {output}
    """

SnakeMake AMRFinderPlus From line 20 of rules/abritamr.smk

shell:
    """
    abritamr run --prefix {params.abritamr_out}/{wildcards.sample} --contigs {input.contigs}  --jobs {threads} --species {params.species} &> {log}       
    """

SnakeMake abritamr From line 43 of rules/abritamr.smk

script:
    "../../scripts/combine_abritamr.py"

SnakeMake From line 60 of rules/abritamr.smk

script:
    "../../scripts/combine_abritamr_amrfinder_raw.py"

SnakeMake From line 71 of rules/abritamr.smk

shell:
    """
    mkdir -p {output.cgmlstfinder_out}
    resources/tools/cgmlstfinder/cgMLST.py --shared_memory -s ecoli -db /projects/AusGEM/databases/cgmlstfinder_db -o {output.cgmlstfinder_out} -k $CONDA_PREFIX/bin/kma {input.r1_filt} {input.r2_filt} 
    cp {output.cgmlstfinder_out}/ecoli_summary.txt {output.cgmlst_summary}
    """

SnakeMake KMA From line 40 of rules/cgmlstfinder.smk

shell:
    "ectyper -i {input} -o {output[0]} &> {log}"

SnakeMake From line 19 of rules/ectyper.smk

script:
    "../../scripts/combine_ectyper.py"

SnakeMake From line 32 of rules/ectyper.smk

shell:
    """
    python3 resources/tools/fimtyper/VALIDATE_DB {input} 2> {log}
    touch {output}
    """

SnakeMake From line 27 of rules/fimtyper.smk

shell:
    """
    perl resources/tools/fimtyper/fimtyper.pl -d resources/tools/fimtyper/fimtyper_db -b $CONDA_PREFIX -i {input.assembly} -o {params.output_dir} -k 95.0 -l 0.6 &> {log}
    cp {output.raw} {output.renamed}
    """

SnakeMake From line 48 of rules/fimtyper.smk

shell:
    """
    sed '0,/^FimH type -/d' {input} > {output.trimmed}
    awk 'NR == 1 {{print "name\t" $0; next;}}{{print FILENAME "\t" $0;}}' {output.trimmed} > {output.named}
    """

SnakeMake From line 61 of rules/fimtyper.smk

script:
    "../../scripts/combine_fimtyper.py"

SnakeMake From line 77 of rules/fimtyper.smk

shell:
    """
    if [ ! -d "resources/dbs/dfast/protein" ]; then echo 'dfast protein dbs directory not located, downloading to resources/dbs/dfast/protein...'
    #git clone https://github.com/nigyta/dfast_core.git
    dfast_file_downloader.py --dbroot resources/dbs/dfast --protein dfast
    fi
    if [ ! -d "resources/dbs/dfast/hmm" ]; then echo 'dfast hmm dbs directory not located, downloading to resources/dbs/dfast/hmm...'
    dfast_file_downloader.py --dbroot resources/dbs/dfast --cdd Cog --hmm TIGR
    fi
    """

SnakeMake DFAST From line 18 of rules/genome_annotation.smk

shell:
    """
    dfast -g {input.assembly} --cpu {threads} --dbroot {input.database} -o {output} --use_locustag_as_gene_id
    """

SnakeMake DFAST From line 40 of rules/genome_annotation.smk

shell:
    """
    mv {input}/genome.gff {output}
    """

SnakeMake From line 53 of rules/genome_annotation.smk

shell:
    """
    prokka --cpus {threads} --outdir {output} {input.assembly} 
    """

SnakeMake metaprokka From line 68 of rules/genome_annotation.smk

shell:
    """
    mv {input}/PROKKA_*.gff {output}
    """

SnakeMake From line 81 of rules/genome_annotation.smk

shell:
    """
    if [ ! -d "{output}" ]; then echo 'bakta db not located, downloading to {output}'
    bakta_db download --output {output} 1&2> {log}
    fi
    """

SnakeMake Bakta From line 94 of rules/genome_annotation.smk

shell:
    """
    bakta --db {params.bakta_db} --verbose --output {params.bakta_out} --prefix {wildcards.sample} --threads {threads} {input[1]}
    cp {output}/*.log {log}
    """

SnakeMake Bakta From line 114 of rules/genome_annotation.smk

shell:
    """
    mv {input}/*.gff3 {output} &> {log}
    """

SnakeMake From line 130 of rules/genome_annotation.smk

shell:
    """
    cp -n {input} {output}
    """

SnakeMake From line 13 of rules/genome_assembly.smk

shell:
    """
    shovill --minlen 200 --cpus {threads} --outdir {output.shov_out} --R1 {input.r1_filt} --R2 {input.r2_filt} 1> {log.out} 2> {log.err}
    cp {output.shov_out}/contigs.fa {output.assembly}
    """

SnakeMake shovill From line 33 of rules/genome_assembly.smk

shell:
    "assembly-stats -t {input} > {output}"

SnakeMake assembly-stats From line 50 of rules/genome_assembly.smk

shell:
    """
    mkdir {output}
    cp {input} {output}
    """

SnakeMake From line 59 of rules/genome_assembly.smk

script:
    "../../scripts/combine_assembly_stats.py"

SnakeMake From line 75 of rules/genome_assembly.smk

shell:
    "abricate --nopath --datadir {params.datadir} --threads {threads} --db {params.db} {input.assembly} > {output} 2> {log}"

SnakeMake ABRicate From line 29 of rules/genotype_abricate.smk

shell:
    "abricate --nopath --datadir {params.datadir} --threads {threads} --db {params.db} {input.assembly} > {output} 2> {log}"

SnakeMake ABRicate From line 47 of rules/genotype_abricate.smk

shell:
    """"
    cat {input} > {output.out1}
    perl -p -i -e 's@\.fasta@@g' {output.out1}
    #Rename first column name
    sed -E '1s/#FILE/name/g' {output.out1} > {output.out2}
    #Remove duplicate headers by negative grep
    grep -v '#FILE' {output.out2} > {output.out}
    """

SnakeMake From line 63 of rules/genotype_abricate.smk

script:
    "../../scripts/combine_abricate.py"

SnakeMake From line 83 of rules/genotype_abricate.smk

shell:
    """
    mob_init -d resources/dbs/mobsuite 1> {log.out} 2> {log.err}
    """

SnakeMake MOB-suite From line 17 of rules/mobsuite.smk

shell:
    "mob_recon --force --num_threads {threads} --infile {input.assembly} --outdir {params.outdir} 1> {log.out} 2> {log.err}"

SnakeMake MOB-suite From line 37 of rules/mobsuite.smk

script:
    "../../scripts/combine_mobsuite.py"

SnakeMake From line 52 of rules/mobsuite.smk

shell:
    """
    mkdir -p {params.ezclermont_outdir}
    ! ezclermont {input} --logfile {output.full} > {output.simple}
    """

SnakeMake From line 17 of rules/phylogrouping.smk

script:
    "../../scripts/combine_ezclermont.py"

SnakeMake From line 33 of rules/phylogrouping.smk

shell:
    """
    git clone https://github.com/A-BN/ClermonTyping.git resources/tools/clermontyping
    """

SnakeMake From line 45 of rules/phylogrouping.smk

shell:
    """
    {input.github_repo}/clermonTyping.sh --fasta {input.assembly} --name {wildcards.sample} --minimal &> {log}
    ls {params}
    mv {wildcards.sample}/* {params}
    rmdir {wildcards.sample}
    """

SnakeMake From line 64 of rules/phylogrouping.smk

script:
    "../../scripts/combine_clermontyping.py"

SnakeMake From line 82 of rules/phylogrouping.smk

shell:
    """
    python3 {params.pmlst_script_path} -i {input} -o {params.output_dir} -p {params.pmlst_db_path} {params.pmlst_tool} $CONDA_PREFIX/bin/blastn -s {params.db} -x -t {params.tmp}
    rm -rf {params.tmp}
    """

SnakeMake BLAST From line 33 of rules/plasmid_mlst.smk

script:
    "../../scripts/combine_pMLST.py"

SnakeMake From line 51 of rules/plasmid_mlst.smk

shell:
    """
    touch {input}
    """

SnakeMake From line 61 of rules/plasmid_mlst.smk

script:
    "../../scripts/combine_IncF_RST.py"

SnakeMake From line 80 of rules/plasmid_mlst.smk

shell:
    """
    resources/tools/pointfinder/PointFinder.py -i {input} -o {params.output_dir} -p resources/tools/pointfinder/pointfinder_db {params.species} -m blastn -m_p $CONDA_PREFIX/bin/blastn 2> {log}
    """

SnakeMake BLAST From line 29 of rules/point_mutations.smk

shell:
    """awk 'NR == 1 {{print "name\t" $0; next;}}{{print FILENAME "\t" $0;}}' {input} > {output}"""

SnakeMake From line 40 of rules/point_mutations.smk

script:
    "../../scripts/combine_pointfinder.py"

SnakeMake From line 53 of rules/point_mutations.smk

shell:
    """
    fastp -i {input.r1} -I {input.r2} -o {output.r1_filt} -O {output.r2_filt} --thread {threads} -j {output.json} -h {output.html} 2>&1 {log}
    """

SnakeMake fastp From line 26 of rules/read_cleaning.smk

shell:
    """
    cp -n {input} {output}
    """

SnakeMake From line 6 of rules/roary_pangenome.smk

shell:
    "roary -p {threads} -e -v -n -r -cd {params.core_req} -qc -k {params.kraken_db} -f {params.out} {input}  2> {log}"

SnakeMake Roary From line 25 of rules/roary_pangenome.smk

shell:
    "snp-sites -c {input} > {output}"

SnakeMake From line 39 of rules/roary_pangenome.smk

shell:
    "iqtree -s {input} -m MFP -ntmax {threads} -bb 1000"

SnakeMake IQ-TREE From line 53 of rules/roary_pangenome.smk

shell:
    "iqtree -s {input} -m MFP -ntmax {threads} -bb 1000"

SnakeMake IQ-TREE From line 68 of rules/roary_pangenome.smk

shell:
    """
    kraken2 --db {input.db} --use-names --memory-mapping --threads {threads} --report {output.report} --output {output.out} {input.assembly}  2> {log}
    """

SnakeMake kraken2 From line 38 of rules/species_id.smk

shell:
    """
    kraken2 --db {input.db} --use-names --memory-mapping --threads {threads} --report {output.report} --output {output.out} {input.r1_filt} {input.r2_filt}  2> {log}
    """

SnakeMake kraken2 From line 59 of rules/species_id.smk

shell:
    "resources/tools/Bracken/bracken -d {input.db} -i {input.report} -o {output.bracken} -w {output.species} -r 100 -l S -t {threads} 2>&1 {log}"

SnakeMake bracken From line 82 of rules/species_id.smk

script:
    "../../scripts/combine_bracken.py"

SnakeMake From line 103 of rules/species_id.smk

shell:
    """
    mkdir -p {output.spifinder_out}
    resources/tools/spifinder/spifinder.py -p resources/tools/spifinder_db -x -o {output.spifinder_out} -l 0.60 -t 0.95 -mp $CONDA_PREFIX/bin/kma -i {input.r1_filt} {input.r2_filt}
    cp {output.spifinder_out}/results_tab.tsv {output.spifinder_summary}
    """

SnakeMake KMA From line 41 of rules/spifinder.smk

script:
    "../../scripts/combine_spifinder.py"

SnakeMake From line 59 of rules/spifinder.smk

shell:
    """
    mkdir -p {output.spifinder_out}
    resources/tools/spifinder/spifinder.py -p resources/tools/spifinder_db -x -o {output.spifinder_out} -l 0.60 -t 0.95 -mp $CONDA_PREFIX/bin/blastn -i {input}
    cp {output.spifinder_out}/results_tab.tsv {output.spifinder_summary}
    """

SnakeMake BLAST From line 77 of rules/spifinder.smk

script:
    "../../scripts/combine_spifinder.py"

SnakeMake From line 95 of rules/spifinder.smk

shell:
    "mlst {input} > {output}"

SnakeMake MLST From line 21 of rules/strain_mlst.smk

script:
    "../../scripts/combine_mlst.py"

SnakeMake From line 34 of rules/strain_mlst.smk

ShowHide 56 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/maxlcummins/pipelord

Name: pipelord

Version: v1.0.0

Badge:

Insert copied code into your website to add a link to this workflow.

License: None

Keywords:

ABRicate AMRFinderPlus abritamr assembly-stats bracken Bakta BLAST DFAST fastp IQ-TREE KMA kraken2 metaprokka MLST MOB-suite Pandas Roary shovill Snakemake

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free