Ribosomal Protein Phylogenetic Tree Generator using Snakemake

public 1yr ago Version: Ribo_tre-11-g5ccf10c.54 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

ribosomal_snakemake

NEW VERSION AVAILABLE ** MUCH faster, relies on HMMER so only x86_64

# Ribosomal protein tree generation - Python Snakemake version A workflow to generate ribosomal protein phylogenetic trees, using 15 ribosomal protein sequences. Either protein or DNA sequences can be used to build the tree, it may be of benefit to use DNA sequences for more closely related genomes, and protein sequences for those that are more divergent.
For further information on usage and tutorial please see: https://lcrossman.github.io/docs/html/index.html

Requirements:

snakemake>=3.5
python>=3.3
biopython>=1.77
toolz>=0.11.1
diamond BLAST version>=0.9.32
mafft>=7.429
fasttree>=2.1.10 OR:
iqtree>=1.6.11
ncbi-genome-download>=0.3.0

Python scripts from this github folder

An environment.yaml is included for conda in env_yaml directory if required.

Installation via conda:

Use conda to install snakemake to a linux or Mac OSX environment: Install conda:
Install conda here https://conda.io/en/latest/miniconda.html
Install dependencies via conda

Install snakemake:

conda install -c conda-forge mamba
mamba create -c conda-forge -c bioconda -n snakemake snakemake

conda activate snakemake
snakemake --help

Install the ribosomal protein tree workflow from github:
git clone XXX
cd XXX
cd testfiles
snakemake -n

Usage

Configure workflow:
Configure the workflow according to your needs by editing the file config.yaml. For use without any sequence database downloads, you need to provide the files and pathnames in cleannames.txt and atccs.txt.

Gather your ribosomal protein sequences as protein sequence fasta files in 15 separate files. These will be used for either protein or DNA sequence trees as per your choice laid out in the config.yaml. They should be named within the file as L14_rplN, L16_rplP, L18_rplR, L2_rplB, L22_rplV, L24_rplX, L3_rplC, L4_rplD, L5_rplE, L6_rplF, S10_rpsJ, S17_rpsQ, S19_rpsS, S3_rpsC and S8_rpsH, respectively and 1 should be placed in the ribo_names_field in the config.yaml. If they are named with the rpl or rpS letter first (such as rplN_L14) 0 should be placed in the config.yaml file under ribo_name_field. Staphylococcus sequences are included in the github folder and for examples.
Gather all of your genome sequences as annotated genbank files in the same directory.
Create a file, “genus.txt” containing the name of the genus and species you want to build a tree for, one per line. Note that the genus or species name should be enclosed with quotes.
If you do not want to download any genomes from the sequence databases, create a file, “cleannames.txt” containing the name of each genome file you want to include in your tree, one per line, and place “no” in the config.yaml under download_genbank options. To use a mixture of download and provided genomes place “yes” in the config.yaml and do not provide a cleannames.txt file.
To update any previous trees, collect any files generated as concatenated deduplicated fasta files that you want to update, and edit appropriately the config.yaml file with the filenames.
Add Genbank files, ribosomal protein sequence files and a list of the files: If you do not want to download any genomes, add the names of all provided genbank format files to cleannames.txt on a one-line per file basis. The suffix of the genbank files should end in .gbff (following the genbank download nomenclature), .gbk, .gb or .genbank. Add the names of your ribosomal sequence files on a one-line per file basis to atccs.txt. An easy way to generate these files is with: ls *gbff > cleannames.txt
Example files are contained in example_files folder with files for a testrun in the testfiles folder

Execute workflow:

Test your configuration by performing a dry-run via snakemake -n
Execute the workflow locally via
snakemake --cores $N
using $N cores

or run it in a cluster environment such as
snakemake --use-conda --cluster qsub --jobs 100

Further information on running snakemake in a cluster environment can be found on the snakemake website

Code Snippets

run:
     if 'yes' in params.required:
         shell("touch 'cleannames.txt'")
     else:
         if 'cleannames.txt' in input.files:
             pass
         else:
             print("Please create a file called cleannames.txt with the genbank files you want to include in the analysis one filename per line")
     shell("touch 'logs/cleannamecheck.txt'")

SnakeMake From line 57 of main/Snakefile

run:
     if 'yes' in params.required:
             outfile2 = open("cleannames.txt", 'w')
     for file in input.files:
             print("this is file", file)
             if 'yes' in params.required:
                 outfile2.write(Path(file).name+"\n")
             if 'protein' in params.type:
                 shell("python workflow/scripts/gbk2faa.py {file}")
             elif 'dna' in params.type:
                 shell("python workflow/scripts/gbk2ffn.py {file}")
     shell("touch 'logs/conversion_complete.txt'")

SnakeMake From line 80 of main/Snakefile

run:
     if 'yes' in params.required:	  
         shell("for file in GCF*gbff; do echo $file; grep DEFINITION $file; done > whichsequenceiswhich.txt")
         shell("python workflow/scripts/makeconsdatabasemapping.py > Allnamesmapoverdatabase.txt")
     else:
         shell("touch 'Allnamesmapoverdatabase.txt'")

SnakeMake From line 103 of main/Snakefile

run:
    inputs = [lin.rstrip() for lin in open(''.join(input.check),'r')]
    for cleanname in inputs:
        shell("python workflow/scripts/extract_ribo_seqs_from_gbk.py {cleanname} {params}")
    shell("touch 'logs/extracted_complete.txt'")

SnakeMake From line 120 of main/Snakefile

run:
    shell("python workflow/scripts/ribo_concat_diamond.py {input.extracted} {params.protein_dna}")

SnakeMake From line 136 of main/Snakefile

shell:
     "(cat {input.cats} {input.newput} {input.nextput} > {output}) 2>>log"

SnakeMake From line 149 of main/Snakefile

run:
    shell("python workflow/scripts/check_not_duplicatedIDs.py {input}")
    shell("python workflow/scripts/check_for_excluded_strains.py")

SnakeMake From line 160 of main/Snakefile

run:
    shell("(mafft --retree 3 --maxiterate 3 --thread {threads} {input} > {output}) 2>>log")

SnakeMake From line 174 of main/Snakefile

run:
    if params.tree_type == 'iqtree':
         if params.protein_dna == 'dna':
             shell("(iqtree -s {input} -m GTR -bb 1000 -alrt 1000 -nt {threads} > {output}) 2>>log")
         else:
             shell("(iqtree -s {input} -m LG -bb 1000 -alrt 1000 -nt {threads} > {output}) 2>>log")
    else:
         shell("(fasttree < {input} > {output}) 2>>log")

SnakeMake IQ-TREE From line 189 of main/Snakefile

run:
   from snakemake.utils import report
   with open(input[0]) as aln:
      n_strains = sum(1 for a in aln if a.startswith('>'))
   report("""A workflow initially formulated for downloading Non-aureus Staph genomes, processing to extract ribosomal protein sequences, deduplicating, aligning and creating a phylogenetic tree""",output[0], T1=input[0])

SnakeMake From line 203 of main/Snakefile

shell:
     "diamond makedb --in {input.samp} --db {output}"

SnakeMake Diamond From line 7 of rules/blast_missing_seqs.smk

run:
     if os.stat('strains_missing_ribos.txt').st_size == 0:
          shell("touch 'logs/blast_complete.txt'")
          shell("touch 'logs/collect_hits.log'")
          shell("touch 'logs/concatenate.log'")
     else:
         if params.protein_dna == 'protein':
              for ribo, name in [(ribo, name) for ribo in SAMPLES for name in [line.rstrip() for line in open('strains_missing_ribos.txt', 'r')]]:
                  shell("diamond blastp --query {0}.faa --db {1} --outfmt 6 --max-target-seqs 1 --out {1}.{0}.out".format(name, ribo))
              shell("touch 'logs/blast_complete.txt'")
         else:
              for ribo, name in [(ribo, name) for ribo in SAMPLES for name in [line.rstrip() for line in open('strains_missing_ribos.txt', 'r')]]:
                  shell("diamond blastx --query {0}.ffn --db {1} --outfmt 6 --max-target-seqs 1 --out {1}.{0}.out".format(name, ribo))
              shell("touch 'logs/blast_complete.txt'")

SnakeMake Diamond From line 22 of rules/blast_missing_seqs.smk

run:
     print("number of input files", len(input.infiles))
     if len(input.infiles) > 0:
         for inp in input.infiles:
             print("input is", inp)
             shell(f"python workflow/scripts/collect_from_diamond_blast.py {inp} {params.protein_dna}")
     else:
         shell("touch {output}")

SnakeMake From line 47 of rules/blast_missing_seqs.smk

shell:
     "python workflow/scripts/ribo_concat_diamond.py {input.infile} {params.protein_dna}"

SnakeMake From line 66 of rules/blast_missing_seqs.smk

run:
     if 'yes' in params.required:
         for gen in [line.rstrip() for line in open('genus.txt', 'r')]:
             shell(f"ncbi-genome-download -g {gen} bacteria --parallel {threads}")
     shell("touch 'logs/completed.txt'")

SnakeMake ncbi-genome-download From line 10 of rules/download_snake.smk

run:
     if 'yes' in params.required:
         files = glob.glob("**/*gbff.gz", recursive=True)
         outfile = open("logs/newroot.txt", 'w')
         path=os.getcwd()
         for file in files:
             print(Path(file).name)
             shutil.copy(file, path)
             outfile.write(Path(file).name+"\n")
     else:
         shell("touch 'logs/newroot.txt'")

SnakeMake From line 24 of rules/download_snake.smk

run:
     if 'yes' in params.required:
         shell("gunzip *gbff.gz"),
         shell("touch 'logs/gunzip_complete.txt'")
     else:
         shell("touch 'logs/gunzip_complete.txt'")

SnakeMake From line 44 of rules/download_snake.smk

import sys
import re

def check_excluded_strains(inlist, finlist):
    results = []
    for fin in finlist:
        pattern = re.compile(fin)
        for match in [inl for inl in inlist if pattern.match(inl)]:
            print(match)
            results.append(match)
    return results


infile = open("cleannames.txt", 'r')
excl = open("finalnames.txt", 'r')
ids = []

inlist = [lin.rstrip() for lin in infile]
finlist = [line.rstrip() for line in excl]


results = check_excluded_strains(inlist, finlist)

res = set(inlist) - set(results)

outfile = open("excluded_strains.txt", 'w')

for exc in res:
    outfile.write("{}\n".format(exc))

Python From line 4 of scripts/check_for_excluded_strains.py

def check_for_duplicates(infile):
    import sys
    import re
    from Bio import SeqIO


    handle = open(infile, 'r')
    records = list(SeqIO.parse(handle, "fasta"))

    sequences = []
    outfile = open(sys.argv[1]+"dedupe.fasta", 'w')
    final_strain_list = open("finalnames.txt", 'w')
    seen_before = []

    def checkers(element, list):
        for l in list:
            if element in l:
                return l


    for rec in records:
        if 'GCF' in rec.id:
            shortid = rec.id.split('.')
            id = shortid[0]
            newid = id
        else:
            newid = rec.id
        if newid in seen_before:
             regex = re.compile(str(newid))
             print("viewed previously", newid, checkers(newid, seen_before))
        else:
             sequences.append(rec)
             seen_before.append(newid)

    SeqIO.write(sequences, outfile, "fasta")
    for see in seen_before:
        final_strain_list.write("{}\n".format(see))

import sys
check_for_duplicates(sys.argv[1])

Python Biopython From line 3 of scripts/check_not_duplicatedIDs.py

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from operator import itemgetter
from toolz import compose
import csv
import datetime
import sys
import re

handle = open(sys.argv[1], 'r')
reader = csv.reader(open(sys.argv[1]), delimiter='\t')
infile_name = sys.argv[1].split('.')
if 'protein' in sys.argv[2]:
   inf = '.'.join(infile_name[2:-1])+".faa"
elif 'dna' in sys.argv[2]:
   inf = '.'.join(infile_name[2:-1])+".ffn"
ribo_names_field = 1
#print("infile name is", inf)
infile = open(inf, 'r')

keeps = []
d = {}
i = 0
j = []

outname = datetime.date.today().strftime("%d-%m-%y")+"recovered.fasta"
#print("outfile name", outname)

def check_for_more_than_one(handle):
    for line in infile:
        elements = line.rstrip().split()
        j.append(float(elements[-1]))
        maxj = max(j)
        if j.count(maxj) > 1:
            return False
        else:
            return True


d = {}

for i, line in enumerate(sorted(reader, key=compose(float, itemgetter(2)), reverse=True)):
  #  print(line)
    if check_for_more_than_one == False:
        print("there's more than one with the same score!!!")
    else:
        if i > 0:
            pass
        else:
            try:
               d[line[0]].append(line[1])
            except:
               d[line[0]] = line[1]
            keeps.append(line[0])

sequences = []
records = list(SeqIO.parse(infile, "fasta"))
outfile = open(outname, 'a+')

## uncomment for debugging
for rec in records:
    try:
        if rec.id in keeps:
        #    print("Hit identified", rec.id, keeps)
            shortid = rec.id.split('_')
            newid = "{}|{}".format(d[rec.id],'.'.join(infile_name[2:-1]))
          #  print("this is newid", newid, "here")
            new_rec = SeqRecord(rec.seq, id=newid, description="")
          #  print("new_rec", new_rec)
            sequences.append(new_rec)
    except:
        pass

#print("length of sequences", len(sequences))
SeqIO.write(sequences, outfile, "fasta")

Python Biopython toolz From line 3 of scripts/collect_from_diamond_blast.py

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.SeqFeature import SeqFeature
import sys
import datetime

ribosomal_proteins = ['L14','L16','L18','L2','L22','L24','L3','L4','L5','L6','S10','S17','S19','S3','S8','L15']
ribo_dic = {'L14': 'rplN', 'L16': 'rplP', 'L18': 'rplR', 'L2': 'rplB', 'L22': 'rplV', 'L24': 'rplX', 'L3': 'rplC', 'L4': 'rplD', 'L5': 'rplE', 'L6': 'rplF', 'S10': 'rpsJ', 'S17': 'rpsQ', 'S19': 'rpsS', 'S3': 'rpsC', 'S8': 'rpsH', 'L15': 'rplO'}

if sys.argv[1].endswith('.gz'):
    filehandle = sys.argv[1][:-3]
    print(filehandle)
else:
    filehandle = sys.argv[1]
handle = open(filehandle, 'r')
params = sys.argv[2]
print("parameters for protein or dna are:", params)
outname = datetime.date.today().strftime("%d-%m-%y")+"extracted.fasta"
records = list(SeqIO.parse(handle, "genbank"))
seen_before = []
sequences = []

#Loop through annotation to look for an annotation match
#Save correct matches to file
ribo_count = 0

for rec in records:
    for feature in rec.features:
        if feature.type == "CDS":
          try:
            if 'ribosomal protein' in ''.join(feature.qualifiers['product']):
                print(feature.qualifiers['locus_tag'], feature.qualifiers['product'])
                print("ribo_count is", ribo_count, "seen_before", seen_before)
                if ribo_count > 16:
                    with open('strains_missing_ribos.txt', 'a') as f:
                            print("breaking out")
                            f.write("{}\n".format(filehandle))
                    break
                else:
                    elements = ''.join(feature.qualifiers['product']).split()
                    for rib in ribosomal_proteins:
                        if elements[-1] == rib:
                                ribo_count+=1
                                if rib in seen_before:
                                    with open('strains_missing_ribos.txt', 'a') as f:
                                           print("breaking out as seen before")
                                           f.write("{}\n".format(filehandle))
                                           break
                                seen_before.append(rib)
#                          print("annotation match")
                                if params.lower() == 'dna':
                            #DNA parameters so extract the DNA sequence and save to a file with the appropriate ID
                                    newrec = SeqRecord(feature.location.extract(rec.seq), id="{}_{}|{}".format(elements[-1], ribo_dic[elements[-1]], filehandle), description="")
                                    print("length of the translate", len(feature.location.extract(rec.seq).translate(to_stop=True)), "seq length", len(feature.location.extract(rec.seq))//3)
                                    if len(feature.location.extract(rec.seq).translate(to_stop=True)) < (len(feature.location.extract(rec.seq))//3)-1:
                                         pass
                                    else:
                                         sequences.append(newrec)
                                elif params.lower() == 'protein':
                            #protein parameters so extract the protein sequence, (by translating the DNA sequence) and save to file with appropriate ID
                                     newrec = SeqRecord(feature.location.extract(rec.seq).translate(to_stop=True), id="{}_{}|{}".format(elements[-1], ribo_dic[elements[-1]], filehandle), description="")
                                     print('>{}\n{}'.format(newrec.id, newrec.seq))
                                     sequences.append(newrec)
                                else:
                                     print("unknown parameters for DNA or protein, please choose either dna or protein")
          except:
               print("problem {}".format(feature))
outfile = open(outname, 'a+')
#save chosen sequences to file with date stamp

SeqIO.write(sequences, outfile, "fasta")

Python Biopython From line 3 of scripts/extract_ribo_seqs_from_gbk.py

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.SeqFeature import SeqFeature, FeatureLocation
import sys

infile = sys.argv[1]

gbk_filename = infile
faa_filename = "{}.faa".format(gbk_filename)
input_handle  = open(gbk_filename, "rt")
output_handle = open(faa_filename, "w")

for seq_record in SeqIO.parse(input_handle, "genbank") :
    print("Dealing with GenBank record {}".format(seq_record.id))
    for seq_feature in seq_record.features :
        if seq_feature.type=="CDS" :
            try:
               assert len(seq_feature.qualifiers['translation'])==1
               output_handle.write(">{} from {}\n{}\n".format(
                   seq_feature.qualifiers['locus_tag'][0],
                   seq_record.name,
                   seq_feature.qualifiers['translation'][0]))
            except:
                if seq_feature.location.strand == 1:
                   seq_feature.qualifiers['translation'] = seq_record.seq[seq_feature.location.start:seq_feature.location.end].translate(to_stop=True)
                else:
                   seq_feature.qualifiers['translation'] = seq_record.seq[seq_feature.location.start:seq_feature.location.end].reverse_complement().translate(to_stop=True)
                output_handle.write(">{} from {}\n{}\n".format(seq_feature.qualifiers['locus_tag'][0],seq_record.name,seq_feature.qualifiers['translation']))

output_handle.close()
input_handle.close()

Python Biopython From line 3 of scripts/gbk2faa.py

from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.SeqFeature import SeqFeature
import sys
handle = open(sys.argv[1], 'rt')
outfile = open(sys.argv[1]+".ffn", 'w')

sequences = []
records = list(SeqIO.parse(handle, "genbank"))

for rec in records:
    print("Dealing with genbank record {}".format(rec.id))
    for feature in rec.features:
        if feature.type == "CDS":
            if feature.location.strand == -1:
                ffn = rec.seq[feature.location.start:feature.location.end].reverse_complement()
                newid = ''.join(feature.qualifiers['locus_tag'])
                newdescription = ''.join(feature.qualifiers['product'])
                sequences.append(SeqRecord(ffn,id=newid,description=newdescription))
            elif feature.location.strand == 1:
                ffn = rec.seq[feature.location.start:feature.location.end]
                newid = ''.join(feature.qualifiers['locus_tag'])
                newdescription = ''.join(feature.qualifiers['product'])
                sequences.append(SeqRecord(ffn,id=newid,description=newdescription))



SeqIO.write(sequences, sys.argv[1]+".ffn", "fasta")

Python Biopython From line 4 of scripts/gbk2ffn.py

handle = open("whichsequenceiswhich.txt", 'r')
file = []
species = []
i=0


for line in handle:
    if line.startswith('GCF'):
        filename = line.rstrip()
        fil = filename.split('_')
        shortname = "GCF_"+fil[1]
        file.append(filename)
        i=0
    elif line.startswith('DEFINITION'):
        i+=1
        if i > 1:
            pass
        else:
            elements = line.split()
            species.append("{}_{}".format(shortname,elements[2]))
    else:
        print("odd error")


mapover = zip(file, species)

for m in mapover:
    print ("{}".format('\t'.join(m)))

Python From line 4 of scripts/makeconsdatabasemapping.py

def collect_expected_seqlengths(inf_ribos, prot_dna):
    from Bio import SeqIO
    dicty = {}
    lens = []
    ribos = []
    for inf in inf_ribos:
        print("inf here", inf)
        inf_handle = open(inf, 'r')
        inf_rec = SeqIO.read(inf_handle, 'fasta')
        inf_id_split = inf_rec.id.split('_')
        inf_id = inf_id_split[1]
        print("inf_id", inf_id, inf_id_split)
        if prot_dna == 'protein':
            seqlen = len(inf_rec.seq)
        elif prot_dna == 'dna':
            seqlen = 3*len(inf_rec.seq)
        try:
            dicty[inf_id].append(int(seqlen))
        except:
            dicty[inf_id] = int(seqlen)
    print(dicty)
    return dicty


def concatenate_diamond_matches(infile, prot_dna):
    import collections
    import datetime
    import os
    from Bio import SeqIO

    handle = open(infile, 'r')
    inf_two = open('atccs.txt', 'r')
    inf_ribos = [lin.rstrip() for lin in inf_two]
    dicty = collect_expected_seqlengths(inf_ribos, prot_dna)
    ribo_names_field = 1 #ribo_names_field
    print("ribo_names_field", ribo_names_field)
    d = collections.defaultdict(dict)
    records = list(SeqIO.parse(handle, "fasta"))
    outf_strains = set()

    ribosomal_proteins = ['rplN','rplP','rplR','rplB','rplV','rplX','rplC','rplD','rplE','rplF','rpsJ','rpsQ','rpsS','rpsC','rpsH','rplO']
    outfile_strains = open("strains_missing_ribos.txt", 'a+')
##uncomment print statements for debugging, ID changes can cause concatenation to fail

#Loop through annotation and create a dictionary containing each ribosomal sequence per isolate genome

    for rec in records:
        id = rec.id.split('|')
    #    print("id to split on |", id)
        shortid = id[0].split('_')
    #    print("shortid to split on _", shortid)
        newid = id[1]
     #   print("newid", newid)
        rib = shortid[int(ribo_names_field)]
     #   print("rib", rib)
        seqy = str(rec.seq)
        if seqy.endswith('*'):
             #remove the stop character sometimes included in protein translation files
             newseq = seqy[:-1]
        else:
             newseq = seqy
        for ribo in ribosomal_proteins:
            if rib == ribo:
                #check for exact match
                if  len(newseq) - 20 < dicty[rib] < len(newseq) + 20:   
                    try:
                        d[newid][ribo].append(newseq)
                    except:
                        d[newid][ribo] = [newseq]
                    #    print(d[newid][ribo])
                        if len(d[newid][ribo]) > 1:
                            print("duplicate annotation, will search with diamond blast", ribo, newid)
                            del d[newid][ribo]
                else:
                    print("length is incorrect for {}".format(rib), len(newseq), dicty[rib])
                    outf_strains.add("{}".format(newid))

    #create datestamp file
    outname = datetime.date.today().strftime("%d-%m-%y")+"concatenated_ribosomal_proteins_db.fasta"
    if os.path.exists(outname):
       #if files have been collected from the annotation and we are now collecting from diamond blast process
       outname = outname+"_2"
    else:
       #mark strains for diamond blast process if they were missing some sequences
       pass

    outfile = open(outname, 'a+')
    for key, value in d.items():
         #test we have all 16 sequences, otherwise mark genome for diamond blast searches or ignore
         if len(value) == 16:
              joinstring = "{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}".format(''.join(d[key]['rplN']),''.join(d[key]['rplP']),''.join(d[key]['rplR']),''.join(d[key]['rplB']),''.join(d[key]['rplV']),''.join(d[key]['rplX']),''.join(d[key]['rplC']),''.join(d[key]['rplD']),''.join(d[key]['rplE']),''.join(d[key]['rplF']),''.join(d[key]['rpsJ']),''.join(d[key]['rpsQ']),''.join(d[key]['rpsS']),''.join(d[key]['rpsC']),''.join(d[key]['rpsH']),''.join(d[key]['rplO']))
              outfile.write(">{}\n".format(key))
              outfile.write(joinstring+"\n")
         else:
              print(key, "Missing some ribosomal proteins, will search with Diamond Blast", len(value))
              outf_strains.add("{}".format(key))
    for stray in outf_strains:
        outfile_strains.write('{}\n'.format(stray))

import sys
concatenate_diamond_matches(sys.argv[1], sys.argv[2])