Ribosomal Protein Phylogenetic Tree Generator using Snakemake
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
ribosomal_snakemake
NEW VERSION AVAILABLE ** MUCH faster, relies on HMMER so only x86_64
# Ribosomal protein tree generation - Python Snakemake version A workflow to generate ribosomal protein phylogenetic trees, using 15 ribosomal protein sequences. Either protein or DNA sequences can be used to build the tree, it may be of benefit to use DNA sequences for more closely related genomes, and protein sequences for those that are more divergent.
For further information on usage and tutorial please see: https://lcrossman.github.io/docs/html/index.html
Requirements:
snakemake>=3.5
python>=3.3
biopython>=1.77
toolz>=0.11.1
diamond BLAST version>=0.9.32
mafft>=7.429
fasttree>=2.1.10 OR:
iqtree>=1.6.11
ncbi-genome-download>=0.3.0
Python scripts from this github folder
An environment.yaml is included for conda in env_yaml directory if required.
Installation via conda:
Use conda to install snakemake to a linux or Mac OSX environment: Install conda:
Install conda here https://conda.io/en/latest/miniconda.html
Install dependencies
via
conda
Install snakemake:
conda install -c conda-forge mamba
mamba create -c conda-forge -c bioconda -n snakemake snakemake
conda activate snakemake
snakemake --help
Install the ribosomal protein tree workflow from github:
git clone XXX
cd XXX
cd testfiles
snakemake -n
Usage
Configure workflow:
Configure the workflow according to your needs by editing the file config.yaml. For use without any sequence database downloads, you need to provide the files and pathnames in cleannames.txt and atccs.txt.
-
Gather your ribosomal protein sequences as protein sequence fasta files in 15 separate files. These will be used for either protein or DNA sequence trees as per your choice laid out in the config.yaml. They should be named within the file as L14_rplN, L16_rplP, L18_rplR, L2_rplB, L22_rplV, L24_rplX, L3_rplC, L4_rplD, L5_rplE, L6_rplF, S10_rpsJ, S17_rpsQ, S19_rpsS, S3_rpsC and S8_rpsH, respectively and 1 should be placed in the ribo_names_field in the config.yaml. If they are named with the rpl or rpS letter first (such as rplN_L14) 0 should be placed in the config.yaml file under ribo_name_field. Staphylococcus sequences are included in the github folder and for examples.
-
Gather all of your genome sequences as annotated genbank files in the same directory.
-
Create a file, “genus.txt” containing the name of the genus and species you want to build a tree for, one per line. Note that the genus or species name should be enclosed with quotes.
-
If you do not want to download any genomes from the sequence databases, create a file, “cleannames.txt” containing the name of each genome file you want to include in your tree, one per line, and place “no” in the config.yaml under download_genbank options. To use a mixture of download and provided genomes place “yes” in the config.yaml and do not provide a cleannames.txt file.
-
To update any previous trees, collect any files generated as concatenated deduplicated fasta files that you want to update, and edit appropriately the config.yaml file with the filenames.
Add Genbank files, ribosomal protein sequence files and a list of the files: If you do not want to download any genomes, add the names of all provided genbank format files to cleannames.txt on a one-line per file basis. The suffix of the genbank files should end in .gbff (following the genbank download nomenclature), .gbk, .gb or .genbank. Add the names of your ribosomal sequence files on a one-line per file basis to atccs.txt. An easy way to generate these files is with: ls *gbff > cleannames.txt
Example files are contained in example_files folder with files for a testrun in the testfiles folder
Execute workflow:
Test your configuration by performing a dry-run via snakemake -n
Execute the workflow locally via
snakemake --cores $N
using $N cores
or run it in a cluster environment such as
snakemake --use-conda --cluster qsub --jobs 100
Further information on running snakemake in a cluster environment can be found on the snakemake website
Code Snippets
57 58 59 60 61 62 63 64 65 | run: if 'yes' in params.required: shell("touch 'cleannames.txt'") else: if 'cleannames.txt' in input.files: pass else: print("Please create a file called cleannames.txt with the genbank files you want to include in the analysis one filename per line") shell("touch 'logs/cleannamecheck.txt'") |
80 81 82 83 84 85 86 87 88 89 90 91 | run: if 'yes' in params.required: outfile2 = open("cleannames.txt", 'w') for file in input.files: print("this is file", file) if 'yes' in params.required: outfile2.write(Path(file).name+"\n") if 'protein' in params.type: shell("python workflow/scripts/gbk2faa.py {file}") elif 'dna' in params.type: shell("python workflow/scripts/gbk2ffn.py {file}") shell("touch 'logs/conversion_complete.txt'") |
103 104 105 106 107 108 | run: if 'yes' in params.required: shell("for file in GCF*gbff; do echo $file; grep DEFINITION $file; done > whichsequenceiswhich.txt") shell("python workflow/scripts/makeconsdatabasemapping.py > Allnamesmapoverdatabase.txt") else: shell("touch 'Allnamesmapoverdatabase.txt'") |
120 121 122 123 124 | run: inputs = [lin.rstrip() for lin in open(''.join(input.check),'r')] for cleanname in inputs: shell("python workflow/scripts/extract_ribo_seqs_from_gbk.py {cleanname} {params}") shell("touch 'logs/extracted_complete.txt'") |
136 137 | run: shell("python workflow/scripts/ribo_concat_diamond.py {input.extracted} {params.protein_dna}") |
149 150 | shell: "(cat {input.cats} {input.newput} {input.nextput} > {output}) 2>>log" |
160 161 162 | run: shell("python workflow/scripts/check_not_duplicatedIDs.py {input}") shell("python workflow/scripts/check_for_excluded_strains.py") |
174 175 | run: shell("(mafft --retree 3 --maxiterate 3 --thread {threads} {input} > {output}) 2>>log") |
189 190 191 192 193 194 195 196 | run: if params.tree_type == 'iqtree': if params.protein_dna == 'dna': shell("(iqtree -s {input} -m GTR -bb 1000 -alrt 1000 -nt {threads} > {output}) 2>>log") else: shell("(iqtree -s {input} -m LG -bb 1000 -alrt 1000 -nt {threads} > {output}) 2>>log") else: shell("(fasttree < {input} > {output}) 2>>log") |
203 204 205 206 207 | run: from snakemake.utils import report with open(input[0]) as aln: n_strains = sum(1 for a in aln if a.startswith('>')) report("""A workflow initially formulated for downloading Non-aureus Staph genomes, processing to extract ribosomal protein sequences, deduplicating, aligning and creating a phylogenetic tree""",output[0], T1=input[0]) |
7 8 | shell: "diamond makedb --in {input.samp} --db {output}" |
22 23 24 25 26 27 28 29 30 31 32 33 34 35 | run: if os.stat('strains_missing_ribos.txt').st_size == 0: shell("touch 'logs/blast_complete.txt'") shell("touch 'logs/collect_hits.log'") shell("touch 'logs/concatenate.log'") else: if params.protein_dna == 'protein': for ribo, name in [(ribo, name) for ribo in SAMPLES for name in [line.rstrip() for line in open('strains_missing_ribos.txt', 'r')]]: shell("diamond blastp --query {0}.faa --db {1} --outfmt 6 --max-target-seqs 1 --out {1}.{0}.out".format(name, ribo)) shell("touch 'logs/blast_complete.txt'") else: for ribo, name in [(ribo, name) for ribo in SAMPLES for name in [line.rstrip() for line in open('strains_missing_ribos.txt', 'r')]]: shell("diamond blastx --query {0}.ffn --db {1} --outfmt 6 --max-target-seqs 1 --out {1}.{0}.out".format(name, ribo)) shell("touch 'logs/blast_complete.txt'") |
47 48 49 50 51 52 53 54 | run: print("number of input files", len(input.infiles)) if len(input.infiles) > 0: for inp in input.infiles: print("input is", inp) shell(f"python workflow/scripts/collect_from_diamond_blast.py {inp} {params.protein_dna}") else: shell("touch {output}") |
66 67 | shell: "python workflow/scripts/ribo_concat_diamond.py {input.infile} {params.protein_dna}" |
10 11 12 13 14 | run: if 'yes' in params.required: for gen in [line.rstrip() for line in open('genus.txt', 'r')]: shell(f"ncbi-genome-download -g {gen} bacteria --parallel {threads}") shell("touch 'logs/completed.txt'") |
24 25 26 27 28 29 30 31 32 33 34 | run: if 'yes' in params.required: files = glob.glob("**/*gbff.gz", recursive=True) outfile = open("logs/newroot.txt", 'w') path=os.getcwd() for file in files: print(Path(file).name) shutil.copy(file, path) outfile.write(Path(file).name+"\n") else: shell("touch 'logs/newroot.txt'") |
44 45 46 47 48 49 | run: if 'yes' in params.required: shell("gunzip *gbff.gz"), shell("touch 'logs/gunzip_complete.txt'") else: shell("touch 'logs/gunzip_complete.txt'") |
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | import sys import re def check_excluded_strains(inlist, finlist): results = [] for fin in finlist: pattern = re.compile(fin) for match in [inl for inl in inlist if pattern.match(inl)]: print(match) results.append(match) return results infile = open("cleannames.txt", 'r') excl = open("finalnames.txt", 'r') ids = [] inlist = [lin.rstrip() for lin in infile] finlist = [line.rstrip() for line in excl] results = check_excluded_strains(inlist, finlist) res = set(inlist) - set(results) outfile = open("excluded_strains.txt", 'w') for exc in res: outfile.write("{}\n".format(exc)) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | def check_for_duplicates(infile): import sys import re from Bio import SeqIO handle = open(infile, 'r') records = list(SeqIO.parse(handle, "fasta")) sequences = [] outfile = open(sys.argv[1]+"dedupe.fasta", 'w') final_strain_list = open("finalnames.txt", 'w') seen_before = [] def checkers(element, list): for l in list: if element in l: return l for rec in records: if 'GCF' in rec.id: shortid = rec.id.split('.') id = shortid[0] newid = id else: newid = rec.id if newid in seen_before: regex = re.compile(str(newid)) print("viewed previously", newid, checkers(newid, seen_before)) else: sequences.append(rec) seen_before.append(newid) SeqIO.write(sequences, outfile, "fasta") for see in seen_before: final_strain_list.write("{}\n".format(see)) import sys check_for_duplicates(sys.argv[1]) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 | from Bio import SeqIO from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from operator import itemgetter from toolz import compose import csv import datetime import sys import re handle = open(sys.argv[1], 'r') reader = csv.reader(open(sys.argv[1]), delimiter='\t') infile_name = sys.argv[1].split('.') if 'protein' in sys.argv[2]: inf = '.'.join(infile_name[2:-1])+".faa" elif 'dna' in sys.argv[2]: inf = '.'.join(infile_name[2:-1])+".ffn" ribo_names_field = 1 #print("infile name is", inf) infile = open(inf, 'r') keeps = [] d = {} i = 0 j = [] outname = datetime.date.today().strftime("%d-%m-%y")+"recovered.fasta" #print("outfile name", outname) def check_for_more_than_one(handle): for line in infile: elements = line.rstrip().split() j.append(float(elements[-1])) maxj = max(j) if j.count(maxj) > 1: return False else: return True d = {} for i, line in enumerate(sorted(reader, key=compose(float, itemgetter(2)), reverse=True)): # print(line) if check_for_more_than_one == False: print("there's more than one with the same score!!!") else: if i > 0: pass else: try: d[line[0]].append(line[1]) except: d[line[0]] = line[1] keeps.append(line[0]) sequences = [] records = list(SeqIO.parse(infile, "fasta")) outfile = open(outname, 'a+') ## uncomment for debugging for rec in records: try: if rec.id in keeps: # print("Hit identified", rec.id, keeps) shortid = rec.id.split('_') newid = "{}|{}".format(d[rec.id],'.'.join(infile_name[2:-1])) # print("this is newid", newid, "here") new_rec = SeqRecord(rec.seq, id=newid, description="") # print("new_rec", new_rec) sequences.append(new_rec) except: pass #print("length of sequences", len(sequences)) SeqIO.write(sequences, outfile, "fasta") |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 | from Bio import SeqIO from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from Bio.SeqFeature import SeqFeature import sys import datetime ribosomal_proteins = ['L14','L16','L18','L2','L22','L24','L3','L4','L5','L6','S10','S17','S19','S3','S8','L15'] ribo_dic = {'L14': 'rplN', 'L16': 'rplP', 'L18': 'rplR', 'L2': 'rplB', 'L22': 'rplV', 'L24': 'rplX', 'L3': 'rplC', 'L4': 'rplD', 'L5': 'rplE', 'L6': 'rplF', 'S10': 'rpsJ', 'S17': 'rpsQ', 'S19': 'rpsS', 'S3': 'rpsC', 'S8': 'rpsH', 'L15': 'rplO'} if sys.argv[1].endswith('.gz'): filehandle = sys.argv[1][:-3] print(filehandle) else: filehandle = sys.argv[1] handle = open(filehandle, 'r') params = sys.argv[2] print("parameters for protein or dna are:", params) outname = datetime.date.today().strftime("%d-%m-%y")+"extracted.fasta" records = list(SeqIO.parse(handle, "genbank")) seen_before = [] sequences = [] #Loop through annotation to look for an annotation match #Save correct matches to file ribo_count = 0 for rec in records: for feature in rec.features: if feature.type == "CDS": try: if 'ribosomal protein' in ''.join(feature.qualifiers['product']): print(feature.qualifiers['locus_tag'], feature.qualifiers['product']) print("ribo_count is", ribo_count, "seen_before", seen_before) if ribo_count > 16: with open('strains_missing_ribos.txt', 'a') as f: print("breaking out") f.write("{}\n".format(filehandle)) break else: elements = ''.join(feature.qualifiers['product']).split() for rib in ribosomal_proteins: if elements[-1] == rib: ribo_count+=1 if rib in seen_before: with open('strains_missing_ribos.txt', 'a') as f: print("breaking out as seen before") f.write("{}\n".format(filehandle)) break seen_before.append(rib) # print("annotation match") if params.lower() == 'dna': #DNA parameters so extract the DNA sequence and save to a file with the appropriate ID newrec = SeqRecord(feature.location.extract(rec.seq), id="{}_{}|{}".format(elements[-1], ribo_dic[elements[-1]], filehandle), description="") print("length of the translate", len(feature.location.extract(rec.seq).translate(to_stop=True)), "seq length", len(feature.location.extract(rec.seq))//3) if len(feature.location.extract(rec.seq).translate(to_stop=True)) < (len(feature.location.extract(rec.seq))//3)-1: pass else: sequences.append(newrec) elif params.lower() == 'protein': #protein parameters so extract the protein sequence, (by translating the DNA sequence) and save to file with appropriate ID newrec = SeqRecord(feature.location.extract(rec.seq).translate(to_stop=True), id="{}_{}|{}".format(elements[-1], ribo_dic[elements[-1]], filehandle), description="") print('>{}\n{}'.format(newrec.id, newrec.seq)) sequences.append(newrec) else: print("unknown parameters for DNA or protein, please choose either dna or protein") except: print("problem {}".format(feature)) outfile = open(outname, 'a+') #save chosen sequences to file with date stamp SeqIO.write(sequences, outfile, "fasta") |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | from Bio import SeqIO from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from Bio.SeqFeature import SeqFeature, FeatureLocation import sys infile = sys.argv[1] gbk_filename = infile faa_filename = "{}.faa".format(gbk_filename) input_handle = open(gbk_filename, "rt") output_handle = open(faa_filename, "w") for seq_record in SeqIO.parse(input_handle, "genbank") : print("Dealing with GenBank record {}".format(seq_record.id)) for seq_feature in seq_record.features : if seq_feature.type=="CDS" : try: assert len(seq_feature.qualifiers['translation'])==1 output_handle.write(">{} from {}\n{}\n".format( seq_feature.qualifiers['locus_tag'][0], seq_record.name, seq_feature.qualifiers['translation'][0])) except: if seq_feature.location.strand == 1: seq_feature.qualifiers['translation'] = seq_record.seq[seq_feature.location.start:seq_feature.location.end].translate(to_stop=True) else: seq_feature.qualifiers['translation'] = seq_record.seq[seq_feature.location.start:seq_feature.location.end].reverse_complement().translate(to_stop=True) output_handle.write(">{} from {}\n{}\n".format(seq_feature.qualifiers['locus_tag'][0],seq_record.name,seq_feature.qualifiers['translation'])) output_handle.close() input_handle.close() |
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | from Bio import SeqIO from Bio.SeqRecord import SeqRecord from Bio.SeqFeature import SeqFeature import sys handle = open(sys.argv[1], 'rt') outfile = open(sys.argv[1]+".ffn", 'w') sequences = [] records = list(SeqIO.parse(handle, "genbank")) for rec in records: print("Dealing with genbank record {}".format(rec.id)) for feature in rec.features: if feature.type == "CDS": if feature.location.strand == -1: ffn = rec.seq[feature.location.start:feature.location.end].reverse_complement() newid = ''.join(feature.qualifiers['locus_tag']) newdescription = ''.join(feature.qualifiers['product']) sequences.append(SeqRecord(ffn,id=newid,description=newdescription)) elif feature.location.strand == 1: ffn = rec.seq[feature.location.start:feature.location.end] newid = ''.join(feature.qualifiers['locus_tag']) newdescription = ''.join(feature.qualifiers['product']) sequences.append(SeqRecord(ffn,id=newid,description=newdescription)) SeqIO.write(sequences, sys.argv[1]+".ffn", "fasta") |
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | handle = open("whichsequenceiswhich.txt", 'r') file = [] species = [] i=0 for line in handle: if line.startswith('GCF'): filename = line.rstrip() fil = filename.split('_') shortname = "GCF_"+fil[1] file.append(filename) i=0 elif line.startswith('DEFINITION'): i+=1 if i > 1: pass else: elements = line.split() species.append("{}_{}".format(shortname,elements[2])) else: print("odd error") mapover = zip(file, species) for m in mapover: print ("{}".format('\t'.join(m))) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 | def collect_expected_seqlengths(inf_ribos, prot_dna): from Bio import SeqIO dicty = {} lens = [] ribos = [] for inf in inf_ribos: print("inf here", inf) inf_handle = open(inf, 'r') inf_rec = SeqIO.read(inf_handle, 'fasta') inf_id_split = inf_rec.id.split('_') inf_id = inf_id_split[1] print("inf_id", inf_id, inf_id_split) if prot_dna == 'protein': seqlen = len(inf_rec.seq) elif prot_dna == 'dna': seqlen = 3*len(inf_rec.seq) try: dicty[inf_id].append(int(seqlen)) except: dicty[inf_id] = int(seqlen) print(dicty) return dicty def concatenate_diamond_matches(infile, prot_dna): import collections import datetime import os from Bio import SeqIO handle = open(infile, 'r') inf_two = open('atccs.txt', 'r') inf_ribos = [lin.rstrip() for lin in inf_two] dicty = collect_expected_seqlengths(inf_ribos, prot_dna) ribo_names_field = 1 #ribo_names_field print("ribo_names_field", ribo_names_field) d = collections.defaultdict(dict) records = list(SeqIO.parse(handle, "fasta")) outf_strains = set() ribosomal_proteins = ['rplN','rplP','rplR','rplB','rplV','rplX','rplC','rplD','rplE','rplF','rpsJ','rpsQ','rpsS','rpsC','rpsH','rplO'] outfile_strains = open("strains_missing_ribos.txt", 'a+') ##uncomment print statements for debugging, ID changes can cause concatenation to fail #Loop through annotation and create a dictionary containing each ribosomal sequence per isolate genome for rec in records: id = rec.id.split('|') # print("id to split on |", id) shortid = id[0].split('_') # print("shortid to split on _", shortid) newid = id[1] # print("newid", newid) rib = shortid[int(ribo_names_field)] # print("rib", rib) seqy = str(rec.seq) if seqy.endswith('*'): #remove the stop character sometimes included in protein translation files newseq = seqy[:-1] else: newseq = seqy for ribo in ribosomal_proteins: if rib == ribo: #check for exact match if len(newseq) - 20 < dicty[rib] < len(newseq) + 20: try: d[newid][ribo].append(newseq) except: d[newid][ribo] = [newseq] # print(d[newid][ribo]) if len(d[newid][ribo]) > 1: print("duplicate annotation, will search with diamond blast", ribo, newid) del d[newid][ribo] else: print("length is incorrect for {}".format(rib), len(newseq), dicty[rib]) outf_strains.add("{}".format(newid)) #create datestamp file outname = datetime.date.today().strftime("%d-%m-%y")+"concatenated_ribosomal_proteins_db.fasta" if os.path.exists(outname): #if files have been collected from the annotation and we are now collecting from diamond blast process outname = outname+"_2" else: #mark strains for diamond blast process if they were missing some sequences pass outfile = open(outname, 'a+') for key, value in d.items(): #test we have all 16 sequences, otherwise mark genome for diamond blast searches or ignore if len(value) == 16: joinstring = "{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}".format(''.join(d[key]['rplN']),''.join(d[key]['rplP']),''.join(d[key]['rplR']),''.join(d[key]['rplB']),''.join(d[key]['rplV']),''.join(d[key]['rplX']),''.join(d[key]['rplC']),''.join(d[key]['rplD']),''.join(d[key]['rplE']),''.join(d[key]['rplF']),''.join(d[key]['rpsJ']),''.join(d[key]['rpsQ']),''.join(d[key]['rpsS']),''.join(d[key]['rpsC']),''.join(d[key]['rpsH']),''.join(d[key]['rplO'])) outfile.write(">{}\n".format(key)) outfile.write(joinstring+"\n") else: print(key, "Missing some ribosomal proteins, will search with Diamond Blast", len(value)) outf_strains.add("{}".format(key)) for stray in outf_strains: outfile_strains.write('{}\n'.format(stray)) import sys concatenate_diamond_matches(sys.argv[1], sys.argv[2]) |
Support
- Future updates
Related Workflows





