gvclass: Taxonomic Classification of Giant Viruses and MAGs

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

version 0.9.4 March 7, 2023

Giant viruses are abundant and diverse and frequently found in environmental microbiomes. gvclass assigns taxonomy to putative giant virus contigs or metagenome assembled genomes ( GVMAGs ). It uses a conservative approach based on the consensus of single protein trees built from up to 9 giant virus orthologous groups ( GVOGs ). These 9 GVOGs are conserved across different lineages in the viral phylum Nucleocytoviricota.

Running gvclass

Requirements

Conda environment wih snakemake, check here: https://snakemake.readthedocs.io/en/stable/getting_started/installation.html
Input is a directory that contains single contigs or MAGs as nucleic acid (fna) or proteins (faa)
File extensions .fna or .faa
Recommended length for single contigs is 50kb, but at least 20kb
No special characters (".", ";", ":") in filebase name, "_" or "-" are okay
No whitespace in sequence headers
Recommended sequence header format if faa provided: |
Input will be checked and reformatted if necessary

IMG/VR

Upload you metagenome assembled genome or single contig to IMG/VR using the gvclass feature

Snakemake workflow

Clone the repository

git clone https://github.com/NeLLi-team/gvclass

Test gvclass using the provided giant virus assemblies

snakemake -j 24 --use-conda --config querydir="example"

If this completes successfully, run it using your own directory of query genomes

snakemake -j <number of processes> --use-conda --config querydir="<path to query dir>"

Docker / Shifter

Will be provided soon

Interpretation of the results

The classification result is summarized in a tab separated file <query name>.summary.tab

Taxonomy assignments

Taxonomy assignments are provided on different taxonomic levels.
To yield an assignments all nearest neighbors in GVOG phylogenetic tree have to be in agreement.
Depending on the number of identified unique GVOGs an assignment is provided if at least 1 GVOG (stringency "gte1"), 2 GVOGs (stringency "gte2") or 3 GVOGs (stringency "gte3") were found.
Less stringent is the "majority" approach, here more than 50% of identified markers have to yield the same taxonomy to enable and assignment.
If taxonomy assignments are not in agreement at a low taxonomy level (e.g. species, genus, or family) then the next higher taxonomy level will be evaluated, up to the domain level.
March 2023: Added experimental feature xgboost classifier based on kmers. Provides assignment of provided sequences (fna) to cellular domains, phages or NCLDV together with order level assignment (if NCLDV).

Contamination

Giant virus genomes typically have less than 10 out of a set of 56 universal cellular housekeeping genes (UNI56). Higher UNI56 counts indicate cellular contamination, or giant virus sequences that are located on host contigs.
- UNI56u (unique counts), UNI56t(total counts), UNI56df (duplication factor) are provided and can be used for further quality filtering
Giant virus genomes typically have a duplication factor of GVOG7 (as subset of GVOG9) of below 3. Higher GVOG7 duplication factors indicate the presence mixed viral populations.
- GVOG7u (unique counts), GVOG7t(total counts), GVOG7df (duplication factor) are provided and can be used for further quality filtering
Giant viruses may break any of these rules, thus, gvclass does not perform automatic quality filtering based on marker gene counts.

Benchmarking

Will be provided soon

Citation

Will be provided soon

References

Acknowledgements

GVClass was developed by the New Lineages of Life Group at the DOE Joint Genome Institute supported by the Office of Science of the U.S. Department of Energy under contract no. DE-AC02-05CH11231.

Code Snippets

import click
from Bio import SeqIO
import pandas as pd
import os


@click.command()
@click.option('--input', '-i', type=click.Path(exists=True), help='Input FASTA file')
@click.option('--output', '-o', type=click.Path(), help='Output FASTA file')
@click.option('--stats', '-s', type=click.Path(), help='Genome stats outfile')
def main(input, output, stats):
    # read in the input file
    records = list(SeqIO.parse(input, "fasta"))

    # modify the sequence IDs and descriptions
    protcount = 0
    for record in records:
        protcount += 1
        if "|" in record.id:
            record.id = record.id.split("|")[-1]
        record.id = f"{os.path.splitext(os.path.basename(input))[0]}|{record.id}"
        record.description = ""

    # write the reformatted sequences to the output file
    SeqIO.write(records, output, "fasta")

    # calculate statistics and write them to the stats file
    stats_dict = {
        'query': os.path.splitext(os.path.basename(input))[0],
        'contigs': 'no_fna',
        'LENbp': 'no_fna',
        'GCperc': 'no_fna',
        'genecount': protcount,
        'CODINGperc': 'no_fna',
        'ttable': 'no_fna'
    }
    pd.DataFrame.from_dict(stats_dict, orient='index').T.to_csv(stats, sep="\t", index=False)

if __name__ == '__main__':
    main()

Python Pandas Biopython CLICK From line 1 of scripts/00reformat.py

import click
import os
import sys
import shutil
import glob
from subprocess import call
from Bio import SeqIO
import pandas as pd


def calc_stats(fnain, faain):
    """
    Calculate coding density, assembly size, GC%
    """
    cumulative_gc = 0
    cumulative_len = 0
    cumulative_len_aa = 0
    gene_count = 0
    contigs = 0

    for seq_record in SeqIO.parse(fnain, "fasta"):
        cumulative_gc += seq_record.seq.upper().count('G') + seq_record.seq.upper().count('C')
        cumulative_len += len(seq_record.seq)
        contigs += 1

    for seq_record in SeqIO.parse(faain, "fasta"):
        cumulative_len_aa += len(seq_record.seq)
        gene_count += 1

    return [contigs, cumulative_len, round(cumulative_gc / cumulative_len * 100, 2), gene_count, round(cumulative_len_aa * 3 / cumulative_len * 100, 2)]


def check_filename(fnafile):
    """
    Input has to be .fna with no additional . in filename
    """
    if len(fnafile.split(".")) > 2:
        print("fnafile name contains additional '.'")
        sys.exit()
    if fnafile.split(".")[1] != "fna":
        print("fnafile ending needs to be .fna")
        sys.exit()


def rename_header(bestcode, finalfaa, gffout):
    """
    Rename sequence header to ><filename>|<proteinid>
    """
    records = []
    infilebase = os.path.splitext(os.path.basename(bestcode))[0]

    for seq_record in SeqIO.parse(bestcode, "fasta"):
        headerbase = seq_record.description.split("|")[0]
        if headerbase != infilebase:
            seq_record.id = f"{infilebase}|{str(seq_record.id).split()[0]}"
            seq_record.description = ""
            records.append(seq_record)
        elif headerbase == infilebase:
            seq_record.id = str(seq_record.id).split()[0]
            seq_record.description = ""
            records.append(seq_record)

    SeqIO.write(records, finalfaa, "fasta")
    # keep only best gff
    shutil.copy(gffout + "_code" + bestcode.split("_code")[-1], gffout)


def run_genecalling_codes(fnafile, code, gffout):
    """
    Genecalling with prodigal and defined tt
    """
    faaout = gffout.replace(".gff", ".faa")
    if code > 0:
        run_command = f"prodigal -n -q -f gff -i {fnafile} -a {faaout}_code{code} -o {gffout}_code{code} -g {code}"
        call(run_command, shell=True)
    else:  # output for snakemake with -p meta
        run_command = f"prodigal -n -q -f gff -p meta -i {fnafile} -a {faaout}_codemeta -o {gffout}"
        call(run_command, shell=True)
        shutil.copy(gffout, gffout + "_codemeta")


@click.command()
@click.option('--fnafile', '-f', type=click.Path(exists=True), help='Input FNA file')
@click.option('--gffout', '-g', type=click.Path(), help='Output GFF file')
@click.option('--genecalling_statsout', '-gs', '--genecalling-statsout', type=click.Path(), help='Genecalling statistics file')
@click.option('--summary_statsout', '-ss', type=click.Path(), help='Genome stats outfile')
@click.option('--finalfaa', '-fa', type=click.Path(), help='Output FASTA file')
def main(fnafile, gffout, genecalling_statsout, summary_statsout, finalfaa):
    # check input
    check_filename(fnafile)
    # gene calling
    # create output for snakemake
    codes = [0, 1, 4, 6, 11] # pass these from snakemake config instead
    for code in codes:
        run_genecalling_codes(fnafile, code, gffout)
    stats_dict = {}
    faaoutdir = os.path.dirname(gffout)
    for faain in glob.glob(os.path.join(faaoutdir, "*.faa_code*")):
        genome_id = os.path.basename(faain)
        stats_dict[genome_id] = calc_stats(fnafile, faain)

    # gene calling stats to select ttable that yields highest coding density
    stats_df = pd.DataFrame.from_dict(stats_dict, columns=["contigs", "LENbp", "GCperc", "genecount", "CODINGperc"], orient="index")
    stats_df = stats_df.sort_values("CODINGperc", ascending=False)
    stats_df["ttable"] = stats_df.index.map(lambda x: x.split("_")[-1])
    stats_df.insert(0, "query", stats_df.index.map(lambda x: os.path.splitext(x)[0]))
    bestcode = os.path.join(faaoutdir, stats_df.index[0])  # genome id with code appended that had highest coding density
    stats_df.to_csv(genecalling_statsout, sep="\t", index=None)
    stats_df.head(1).to_csv(summary_statsout, sep="\t", index=None)

    # rename headers and cleanup
    rename_header(bestcode, finalfaa, gffout)
    for tempout in glob.glob(f"{gffout}_code*"):
        os.remove(tempout)
        faaout = tempout.replace(".gff", ".faa")
        if os.path.isfile(faaout):
            os.remove(faaout)


if __name__ == '__main__':
    main()

Python Snakemake Pandas Biopython prodigal CLICK From line 1 of scripts/01genecalling_single.py

import click
import pandas as pd
import pickle
from Bio import SeqIO
from Bio.Seq import Seq
from sklearn.preprocessing import LabelEncoder


def get_label_mapping(mapping_file):
    """Returns a dictionary mapping encoded labels to their original values"""
    with open(mapping_file) as f:
        return {int(line.split('\t')[0]): line.strip().split('\t')[1] for line in f}


def get_best_kmers(kmers_file):
    """Returns a list of the best kmers"""
    with open(kmers_file) as f:
        return [line.strip() for line in f.readlines()]


def count_kmers(kmers, fna):
    """Counts the occurrences of kmers in a given FASTA file"""
    kmer_counts_genome_dict = {x:0 for x in kmers}
    total_seqs = sum(1 for _ in SeqIO.parse(fna, "fasta"))
    with click.progressbar(SeqIO.parse(fna, "fasta"), label='Counting kmers', length=total_seqs) as records:
        for seq_record in records:
            for kmer in kmers:
                kmer_counts_genome_dict[kmer] += Seq(seq_record.seq).count_overlap(kmer)
    return kmer_counts_genome_dict


def predict(model, kmers, labels, queryseq):
    # Load XGBoost model
    xgbmodel = pickle.load(open(model, 'rb'))

    # Load label mapping
    inv_mapping = get_label_mapping(labels)

    # Load and preprocess query sequence counts
    kmers_list = get_best_kmers(kmers)
    kmer_count_all_dict = {queryseq.split("/")[-1].split(".f")[0]: count_kmers(kmers_list, queryseq)}
    df_qcounts  =  pd.DataFrame.from_dict(kmer_count_all_dict).T.fillna(0)
    df_qcounts = df_qcounts.div(df_qcounts.sum(axis=1), axis=0)
    df_qcounts = df_qcounts.reset_index().drop(['index'], axis=1)

    # Ensure correct order of features
    cols_when_model_builds = xgbmodel.get_booster().feature_names
    df_qcounts = df_qcounts[cols_when_model_builds]

    # Make prediction and print result
    y_pred = xgbmodel.predict(df_qcounts.head(1))
    prediction = inv_mapping[y_pred[0]]
    prob_val = xgbmodel.predict_proba(df_qcounts.head(1))
    print (prob_val)
    proba_dict = {inv_mapping[x]:[prob_val[0,x]] for x,y in inv_mapping.items()}
    return prediction, proba_dict


@click.command()
@click.option('--queryseq', '-q', type=click.Path(exists=True), required=True, help='Path to query sequence')
@click.option('--model', '-m', type=click.Path(exists=True), required=True, help='Path to prebuilt model')
@click.option('--kmers', '-k', type=click.Path(exists=True), required=True, help='Path to kmer file')
@click.option('--labels', '-l', type=click.Path(exists=True), required=True, help='Path to labels')
@click.option('--out', '-o', type=click.Path(exists=False), required=True, help='Path to prediction output')
def main(queryseq, model, kmers, labels, out):
    # get domain level prediction
    prediction_domain, proba_dict_domain = predict(model, kmers, labels, queryseq)
    # if NCDLV then also get order level prediction
    if prediction_domain == "NCLDV":
        model_order = model.replace("domain", "order")
        kmers_order = kmers.replace("domain", "order")
        labels_order = labels.replace("domain", "order")
        prediction_order, proba_dict_order = predict(model_order, kmers_order, labels_order, queryseq)
    else:
        prediction_order = ""
        proba_dict_order = {}
    with open(out, "w") as outfile:
        outfile.write(f"{queryseq.split('/')[-1].split('.f')[0]}\t{prediction_order}___{prediction_domain}")
    proba_dict = {**proba_dict_domain, **proba_dict_order}
    df_proba = pd.DataFrame(proba_dict)
    df_proba["query"] = queryseq.split('/')[-1].split('.f')[0]
    df_proba = df_proba.set_index("query")
    #print (df_proba)
    #df_proba.to_csv("test.tab", sep="\t", index=None)
    df_proba.to_csv(f"{out}.proba", sep="\t", index=True)

if __name__ == "__main__":
    main()

Python Pandas Biopython sklearn CLICK From line 1 of scripts/01predict.py

import click
import subprocess
import sys
import pandas as pd


def run_cmd(cmd):
    """Runs a command in the shell and prints standard output and error"""
    sp = subprocess.Popen(cmd, shell=True)
    std_out, std_err = sp.communicate()
    print('std_err: ', std_err)
    print('std_out: ', std_out)


def get_models(modelsin):
    """
    Generate list of all model names
    Some models might have no hits in hmmout, but they should be displayed in count matrix
    """
    models = []
    with open(modelsin, "r") as f:
        models = [line.split()[-1].rstrip() for line in f if line.startswith("NAME")]
    return models


def get_markercompleteness(models, hmmout, query):
    """Returns a dictionary of marker count for each model"""
    count_dict = {x: 0 for x in models}
    seen = []
    with open(hmmout, "r") as f:
        lines = [line.rstrip() for line in f if not line.startswith("#")]
        for line in lines:
            if line.split()[0] not in seen:
                count_dict[line.split()[3]] += 1
                seen.append(line.split()[0])
    count_dict = {query: count_dict}
    return count_dict


@click.command()
@click.option('--queryfaa', '-q', type=click.Path(exists=True), help='Input query FASTA file')
@click.option('--modelscombined', '-m', type=click.Path(exists=True), help='Combined HMM model file')
@click.option('--hmmout', '-h', type=click.Path(), help='Output file for HMM hits')
@click.option('--target', '-t', type=str, help='Target database (UNI56 or UNIREF90)')
@click.option('--countout', '-c', type=click.Path(), help='Output file for marker gene counts')
def main(queryfaa, modelscombined, hmmout, target, countout):
    ### hmmsearch ###
    cutoff = "-E 1e-10"
    if target == "UNI56":
        cutoff = "--cut_ga"
    hmmsearch = ["hmmsearch" +
                 " --noali" +
                 " " + cutoff + " " +
                 " --domtblout " + hmmout +
                 " --cpu 4 " + modelscombined +
                 " " + queryfaa]
    run_cmd(hmmsearch)

    ### get counts ###
    query = queryfaa.split("/")[-1].split(".")[0]
    models = get_models(modelscombined)
    count_dict = get_markercompleteness(models, hmmout, query)
    pd.DataFrame.from_dict(count_dict).T.to_csv(countout, sep="\t")


if __name__ == '__main__':
    main()

Python Pandas hmmsearch (genouest) CLICK From line 1 of scripts/02hmmsearch.py

import os
from Bio import SeqIO
import shutil
from collections import defaultdict
import click

@click.command()
@click.option('--hmmout', '-h', required=True, help='Path to HMM output file.')
@click.option('--queryfaa', '-q', required=True, help='Path to query protein FASTA file.')
@click.option('--queryhitsfaa', '-o', required=True, help='Path to output file containing hits.')
def main(hmmout, queryfaa, queryhitsfaa):
    outmodel = os.path.splitext(os.path.basename(queryhitsfaa))[0]
    hits_model_dict = get_qhits(hmmout, outmodel)
    export_hitsfaa(hits_model_dict, queryfaa, queryhitsfaa)

def export_hitsfaa(hits_model_dict, queryfaa, queryhitsfaa):
    query_ids = set()
    with click.progressbar(hits_model_dict.values(), label='Exporting hits to file') as bar1:
        for hits in bar1:
            query_ids |= set(hits)
    with open(queryhitsfaa, "w") as outfile:
        with click.progressbar(SeqIO.parse(queryfaa, "fasta"), label='Writing hits to file') as bar2:
            for seq_record in bar2:
                if seq_record.description in query_ids:
                    SeqIO.write(seq_record, outfile, "fasta")

def get_qhits(hmmout, outmodel):
    hits_model_dict = defaultdict(set)
    with open(hmmout, "r") as infile:
        with click.progressbar(infile, label='Parsing HMM output') as bar:
            for line in bar:
                if not line.startswith("#") and outmodel in line:
                    hits_model_dict[line.split()[3]].add(line.split()[0])
    return hits_model_dict

if __name__ == '__main__':
    main()

Python Biopython CLICK From line 1 of scripts/03extract_qhits.py

import click
import subprocess
from Bio import SeqIO
from collections import defaultdict
import os


def run_cmd(cmd):
    sp = subprocess.Popen(cmd,
        #stderr=subprocess.PIPE,
        #stdout=subprocess.PIPE,
        shell=True)
    std_out, std_err = sp.communicate()
    print('std_err: ', std_err)
    print('std_out: ', std_out)


def run_blastp(queryfaa, refdb, blastpout):
    # no evalue set, what is the default?
    dblastp = ["diamond blastp --threads 8 "
               "--quiet --outfmt 6 "
               "--more-sensitive --query-cover 30 "
               "--subject-cover 30 "
               "-e 1e-10 "
               "-d {} -q {} -o {}".format(refdb, queryfaa, blastpout)]
    run_cmd(dblastp)


def parse_blastp(blastpout):
    queryids_hits_dict = defaultdict(list)
    seen = []
    # there could be multiple queries
    try:
        num_queries = len(set([line.split()[0] for line in open(blastpout)]))
        with open(blastpout, "r") as infile:
            for line in infile:
                queryname = line.split()[0]
                subjectname = line.split()[1]
                pident = float(line.split()[2])
                # Yield equal number of subject ids per query
                if subjectname not in seen and len(queryids_hits_dict[queryname]) <= 100/num_queries and pident!=100:
                    queryids_hits_dict[queryname].append(subjectname)
                    seen.append(subjectname)
    except:
        pass
    # single list with all 100 top subject ids
    besthits = [x for v in queryids_hits_dict.values() for x in v]
    return besthits


def reduce_ref(queryfaa, reffaa, reduced_reffaa_merged, refdb, blastpout):
    run_blastp(queryfaa, refdb, blastpout)
    besthits = parse_blastp(blastpout)
    seenids = []
    outseqs = []
    for seq_record in SeqIO.parse(queryfaa, "fasta"):
        if seq_record.id not in seenids:
            outseqs.append(seq_record)
            seenids.append(seq_record.id)
    for seq_record in SeqIO.parse(reffaa, "fasta"):
        if seq_record.description in besthits and seq_record.id not in seenids:
            outseqs.append(seq_record)
            seenids.append(seq_record.id)
    SeqIO.write(outseqs, reduced_reffaa_merged, "fasta")


@click.command()
@click.option('--queryfaa', '-q', required=True, help='Query sequences from hmmsearch for a single marker')
@click.option('--reffaa', '-r', required=True, help='Ref seqs for a single marker')
@click.option('--refdb', '-d', required=True, help='Diamond formatted db for a single marker')
@click.option('--blastpout', '-b', required=True, help='Output file name for diamond blastp')
@click.option('--reduced_reffaa_merged', '-o', required=True, help='Output file name for reduced reference sequences')
def main(queryfaa, reffaa, refdb, blastpout, reduced_reffaa_merged):
    if os.path.getsize(queryfaa) > 0:
        reduce_ref(queryfaa, reffaa, reduced_reffaa_merged, refdb, blastpout)
    else: 
        print (f"{queryfaa} is empty, no hits, omitting blast")

if __name__ == '__main__':
    main()

Python Biopython BLAST Diamond hmmsearch (genouest) CLICK From line 1 of scripts/04blastp_reduce_merge.py

import subprocess
import click


def run_cmd(cmd):
    sp = subprocess.Popen(cmd,
        stderr=subprocess.PIPE,
        stdout=subprocess.PIPE,
        shell=True)
    (std_out, std_err) = sp.communicate()
    print('std_err: ', std_err)
    print('std_out: ', std_out)


@click.command()
@click.option("--queryfaamerged", "-q", required=True, type=click.Path(exists=True), help="Input fasta file containing merged fasta sequences")
@click.option("--alnout", "-a", required=True, type=click.Path(), help="Output fasta file for aligned sequences")
@click.option("--trimout", "-t", required=True, type=click.Path(), help="Output fasta file for trimmed sequences")
def main(queryfaamerged, alnout, trimout):
    mafftaln = f"mafft --quiet --thread 4 {queryfaamerged} > {alnout}"
    trimaln = f"trimal -gt 0.1 -in {alnout} -out {trimout}"
    for cmd in [mafftaln, trimaln]:
        run_cmd(cmd)

if __name__ == "__main__":
    main()

Python MAFFT API (EBI) CLICK trimAl From line 1 of scripts/05align_trim.py

import click
import subprocess


def run_cmd(cmd):
    sp = subprocess.Popen(cmd,
        stderr=subprocess.PIPE,
        stdout=subprocess.PIPE,
        shell=True)
    (std_out, std_err) = sp.communicate()
    print('std_err: ', std_err)
    print('std_out: ', std_out)


@click.command()
@click.option("--aln", "-a", type=click.Path(exists=True), help='Alignment file in FASTA format')
@click.option("--treeout", "-t",  type=click.Path(), help='Output file for the tree')
def main(aln, treeout):
    runtree = f"fasttree -lg < {aln} > {treeout}"
    run_cmd(runtree)


if __name__ == "__main__":
    main()

Python CLICK fasttree From line 1 of scripts/06build_tree.py

import glob
from collections import Counter
from Bio import AlignIO
import click

def get_nn(aln, query):
    hd_dict_paralogs = {}
    for qseq in AlignIO.read(aln, "fasta"):
        if qseq.id.split("|")[0] == query:
            hd_dict = {}
            for rseq in AlignIO.read(aln, "fasta"):
                if rseq.id.split("|")[0] != query:
                    hd = hamming_distance(qseq.seq, rseq.seq)
                    if hd > 30:
                        hd_dict[rseq.id] = hd
            nn = max(hd_dict, key=hd_dict.get)
            hd_dict_paralogs[nn + ";;;" + qseq.id] = hd_dict[nn]
    return hd_dict_paralogs


def flattenkeys(dict1, labels_dict, taxlevel):
    if taxlevel == "order":
        level = 3
    elif taxlevel == "family":
        level = 2
    elif taxlevel == "genus":
        level = 1
    dictlist = [dict1[model] for model in dict1]
    flattened = list({k: v for d in dictlist for k, v in d.items()}.keys())
    return [labels_dict[tax.split("|")[0]][level] for tax in flattened]


def hamming_distance(qseq, rseq):
    """
    get hamming distance between two aligned sequences
    """
    # return id%
    hd = (sum(aa1 != aa2 for aa1, aa2 in zip(qseq, rseq)))
    return hd

@click.command()
@click.option("--query", "-q", required=True)
@click.option("--aln_outpath", "-a", required=True)
@click.option("--outname", "-o", required=True)
@click.option("--ncldv_labels", "-l", required=True)
def main(query, aln_outpath, outname, ncldv_labels):
    hd_dict_all = {}
    hd_dict_all_best = {}
    for aln in glob.glob(aln_outpath + "*.mafft01"):
        modelname = aln.split("/")[-1].split(".")[0]
        try:
            hd_dict_all[modelname] = get_nn(aln, query)
            hd_dict_all_best[modelname] = {max(hd_dict_all[modelname], key=hd_dict_all[modelname].get) : hd_dict_all[modelname][max(hd_dict_all[modelname], key=hd_dict_all[modelname].get)]}
        except:
            pass

    labels_dict = {}
    with open(ncldv_labels, "r") as infile:
        for line in infile:
            taxid, lineage = line.strip().split("\t")
            labels_dict[taxid] = lineage.split("|")

    final_aln_dict = Counter(flattenkeys(hd_dict_all_best, labels_dict, "order"))
    aln_result = "|".join([f"{x}:{y}" for x, y in final_aln_dict.items()])

    with open(outname, "w") as outfile:
        if len(aln_result) > 0:
            outfile.write(f"{query}\t{aln_result}\taln\n")
        else:
            outfile.write(f"{query}\tno_hits\taln\n")
        for model, value in hd_dict_all.items():
            value = dict(sorted(value.items(), key=lambda item: item[1], reverse=True))
            outfile.write("\n".join([f"{model}\t{k.split(';;;')[1]}\t{k.split(';;;')[0]}\t{'|'.join(labels_dict[k.split('|')[0]])}\t{v}" for k, v in value.items()]) + "\n")

if __name__ == "__main__":
    main()

Python Biopython CLICK From line 1 of scripts/07get_nn_aln.py

import glob
import click
from ete3 import Tree
from collections import defaultdict
from collections import Counter
import pandas as pd


def flattenkeys_to_list(dict1, labels_dict, taxlevel):
    if taxlevel=="order":
        level = 3
    elif taxlevel=="family":
        level = 2
    elif taxlevel=="genus":
        level = 1
    dictlist = [  dict1[model] for model in dict1.keys() ]
    flattened = list({k: v for d in dictlist for k, v in d.items()}.keys())
    return [labels_dict[tax.split("|")[0]][level] for tax in flattened]


def unpack_family(node, leaves, query, seen):
    """
    get all leaves under a parental node
    """
    children = node.get_children()
    for child in children:
        childname = str(child).replace("--","").replace("/-", "").replace("\-", "").replace("\n","")
        if child.is_leaf() and childname.split("|")[0] != query:
                seen.append(childname)
                leaves.append(childname)
        elif not child.is_leaf() and childname != query and childname not in seen:
            seen.append(childname)
            unpack_family(child, leaves, query, seen) 
        elif child.is_leaf() and childname == query:
            seen.append(childname)
        else:
            pass
    if len(leaves)==0:
        parentnode = node.up
        unpack_family(parentnode, leaves, query, seen) 
    return leaves


@click.command()
@click.option('-q', '--query', required=True, help='Query to search in the tree.')
@click.option('-t', '--tree_outpath', required=True, help='Path to the tree directory.')
@click.option('-o', '--outname', required=True, help='Output file name.')
@click.option('-l', '--ncldv_labels', required=True, help='Path to the NCBI taxonomy labels file.')
def main(query, tree_outpath, outname, ncldv_labels):

    def get_closestrelative(query, leaves, node, tree):
        distance = 10
        closestrelative = "nd"
        for child in leaves:
            if child != query:
                newdist = t.get_distance(child, query, topology_only = False)
                if newdist < distance:
                    distance = newdist
                    closestrelative = child
                else:
                    pass
        return closestrelative, distance


    def get_neighbor(t, modelname, query):
        tree_dict_paralogs = {} # include nearest neighbor for all paralogs
        for node in t.traverse():
            if node.name.split("|")[0] == query:
                queryintree = node.name
                print (queryintree)
                # move up one node in the tree
                # collect all terminal leaves under the parent node
                leaves = []
                seen = []
                leaves = unpack_family(node, leaves, query, seen)
                print (leaves)
                closestrelative, distance = get_closestrelative(queryintree, leaves, node, t)
                print (closestrelative)
                tree_dict_paralogs[closestrelative + ";;;" + queryintree] = distance
        #closestrelativep = min(tree_dict_paralogs, key=tree_dict_paralogs.get) # get only ref that is most similar from all paralogs = shorted distance
        #return {closestrelativep : tree_dict_paralogs[closestrelativep]} # return only ref that is most similar
        return tree_dict_paralogs


    labels_dict = {}
    with open(ncldv_labels, "r") as infile:
        for line in infile:
            taxid = line.split("\t")[0]
            lineage = line.split("\t")[1].replace("\n","").split("|")
            labels_dict[taxid] = lineage

    # all distances including paralogs
    tree_dict_all = {}
    # dict with only single paralog that had shortest distance to reference
    tree_dict_all_best = {}
    for ftree in glob.glob(tree_outpath + "/*.FTWAG"):
        try:
            t = Tree(ftree)
            modelname = ftree.split("/")[-1].split(".")[0]
            print (f"{modelname}")
            tree_dict_all[modelname] = get_neighbor(t, modelname, query)
            print (f"{tree_dict_all}")
            tree_dict_all_best[modelname] = {min(tree_dict_all[modelname], key=tree_dict_all[modelname].get) : tree_dict_all[modelname][min(tree_dict_all[modelname], key=tree_dict_all[modelname].get)]}
        except:
            print (f"Something wrong with {ftree}")
            pass

    # this is the summary result, show number of best hits per tax string
    # cutoff for order: tree dist 2, aln 30 idperc --> needs further evaluation

    print (f"{tree_dict_all_best}@@@@@@@@{query}")
    final_tree_dict = Counter(flattenkeys_to_list(tree_dict_all_best, labels_dict, "order"))
    tree_result = "|".join([":".join([str(x) for x in pair]) for pair in list(final_tree_dict.items())])

    with open(outname, "w") as outfile:
        if len(tree_result) > 0:
            outfile.write(f"{query}\t{tree_result}\ttree\n")
        else:
            outfile.write(f"{query}\tno_hits\ttree\n")
        for model, value in tree_dict_all.items():
            # print lowest distance first
            value = dict(sorted(value.items(), key=lambda item: item[1], reverse=False))
            #outfile.write("\n".join(f"{model}\t{k.split(';;;')[1]}\t{k.split(';;;')[0]}\t{'|'.join(labels_dict[k.split('|')[0]])}\t{v}" for k,v in value.items()) + "\n")
            outfile.write("\n".join([
                f"{model}\t{k.split(';;;')[1]}\t{k.split(';;;')[0]}\t{'|'.join(labels_dict[k.split('|')[0]])}\t{v}"
                for k, v in value.items()]) + "\n")


if __name__ == "__main__":
    main()

Python Pandas CLICK ete3 From line 1 of scripts/08get_nn_tree.py

import click
import pandas as pd
import glob
from collections import Counter


def parse_result(tabout):
    results = []
    with open(tabout) as infile:
        for line in infile:
            line = line.strip().split()
            if line[0].startswith("GVOG"):
                if line[0] == "GVOGm0461" and float(line[-1]) < 1.8:
                    results.append(line)
                elif line[0] != "GVOGm0461" and float(line[-1]) < 2.2:
                    results.append(line)
                else:
                    print(f"{line[0]} tree distance to neighbor above threshold, hit removed")
    return results

def process_gvog9(gvog9_count):
    try:
        gvog9_models = ["GVOGm0003","GVOGm0013","GVOGm0022","GVOGm0023","GVOGm0054","GVOGm0172","GVOGm0461","GVOGm0760","GVOGm0890"]
        gvog7_models = ["GVOGm0013","GVOGm0023","GVOGm0054","GVOGm0172","GVOGm0461","GVOGm0760","GVOGm0890"]
        gvogsout_df = pd.read_csv(gvog9_count, sep="\t", index_col=0)
        gvogsout_df['GVOG7u'] = (gvogsout_df[gvog7_models] > 0).sum(axis=1)
        gvogsout_df['GVOG7t'] = (gvogsout_df[gvog7_models]).sum(axis=1)
        gvogsout_df['GVOG9u'] = (gvogsout_df[gvog9_models] > 0).sum(axis=1)
        gvogsout_df['GVOG9t'] = (gvogsout_df[gvog9_models]).sum(axis=1)
        gvogsout_df['GVOG7df'] = gvogsout_df['GVOG7t'] / gvogsout_df['GVOG7u'] if gvogsout_df['GVOG7t'].any() else 0
        return list(gvogsout_df.GVOG9u)[0], list(gvogsout_df.GVOG9t)[0], list(gvogsout_df.GVOG7u)[0], list(gvogsout_df.GVOG7t)[0], list(gvogsout_df.GVOG7df)[0], list(gvogsout_df.GVOGm0003)[0]
    except:
        return 0,0,0,0,0,0


def process_uni56(uni56_count):
    try:
        UNI56out_df = pd.read_csv(uni56_count, sep="\t", index_col=0)
        UNI56out_models = list(UNI56out_df.columns)
        UNI56out_df['UNI56u'] = (UNI56out_df[UNI56out_models] > 0).sum(axis=1)
        UNI56out_df['UNI56t'] = (UNI56out_df[UNI56out_models]).sum(axis=1)
        UNI56out_df['UNI56df'] = UNI56out_df['UNI56t'] / UNI56out_df['UNI56u'] if UNI56out_df['UNI56t'].any() else 0
        return list(UNI56out_df.UNI56u)[0], list(UNI56out_df.UNI56t)[0], list(UNI56out_df.UNI56df)[0]
    except:
        return 0, 0, 0


def most_frequent(taxstrings, taxlevel):
    freq = Counter(taxstrings)
    if len(freq) == 1:
        if len(taxstrings) >= 2:
            return taxstrings[0]
        else:
            return "_"
    most_common = freq.most_common(1)[0]
    if most_common[1] > len(taxstrings) // 2:
        return most_common[0]
    elif taxlevel == "domain":
        return "-".join(sorted(list(set(taxstrings))))
    else:
        return "_"

def tax_annotation(row, level_other, level_ncldv):
    if row["subject"].split("__")[0] in ["EUK", "ARC", "BAC", "PHAGE"]:
        return  row["subject"].split("__")[0] + "__" + row["taxannot"].split("|")[level_other]
    else:
        return row["taxannot"].split("|")[level_ncldv]

def tax_domains(row):
    if row["subject"].split("__")[0] in ["EUK", "ARC", "BAC", "PHAGE"]:
        return  row["subject"].split("__")[0]
    elif row["taxannot"].split("|")[5] == "Nucleocytoviricota":
        return "NCLDV"
    else:
        return "CONFLICT"

def get_final_tax(df, query, stringency_s):
    df_q = df[df.queryname == query]
    tax_levels = ["species", "genus", "family", "order", "class", "phylum", "domain"]
    tax_lists = [list(df_q[tax]) for tax in tax_levels]
    final_tax = [query]
    if stringency_s == "majority":
        for level, lst in zip(tax_levels, tax_lists):
            final_tax.append(f"{level[0]}_{most_frequent(lst, level)}")
    else:
        stringency = int(stringency_s.replace("gte", ""))
        for level, lst in zip(tax_levels, tax_lists):
            if len(set(lst)) == 1:
                final_tax.append(f"{level[0]}_{lst[0]}")
            elif len(lst) >= stringency:
                final_tax.append(f"{level[0]}__")
            else:
                final_tax.extend(["missing_markers"] * (7 - len(final_tax)))
                break
        else:
            final_tax.extend(["missing_markers"] * (7 - len(final_tax)))
        if len(set(tax_lists[-1])) == 1:
            final_tax[-1] = f"{tax_levels[-1][0]}_{tax_lists[-1][0]}"
        else:
            for lst, domain in zip(tax_lists[:-1], df_q["domain"]):
                if domain == "EUK" and "NCLDV" in lst:
                    final_tax[-1] = "d_NCLDV-EUK"
                    break
                elif domain == "BAC" and "NCLDV" in lst:
                    final_tax[-1] = "d_NCLDV-BAC"
                    break
                elif domain == "ARC" and "NCLDV" in lst:
                    final_tax[-1] = "d_NCLDV-ARC"
                    break
                elif domain == "ARC" and "BAC" in lst:
                    final_tax[-1] = "d_ARC-BAC"
                    break
                elif domain == "ARC" and "EUK" in lst:
                    final_tax[-1] = "d_ARC-EUK"
                    break
                elif domain == "BAC" and "EUK" in lst:
                    final_tax[-1] = "d_BAC-EUK"
                    break
                elif domain == "PHAGE" and "NCLDV" in lst:
                    final_tax[-1] = "d_NCLDV-PHAGE"
                    break
            else:
                final_tax[-1] = "d__"
    final_tax.append(stringency_s)
    final_tax.append(df_q["distance"].mean())
    return final_tax


def summarize(df):
    try:
        # Split the query column and extract the queryname
        df["queryname"] = df["query"].str.split("|", expand=True)[0]
        df["species"] =  df.apply(lambda row: tax_annotation(row, 6, -1), axis=1)
        df["genus"] =  df.apply(lambda row: tax_annotation(row, 5, 1), axis=1)
        df["family"] =  df.apply(lambda row: tax_annotation(row, 4, 2), axis=1)
        df["order"] =  df.apply(lambda row: tax_annotation(row, 3, 3), axis=1)
        df["class"] =  df.apply(lambda row: tax_annotation(row, 2, 4), axis=1)
        df["phylum"] =  df.apply(lambda row: tax_annotation(row, 1, 5), axis=1)
        df["domain"] =  df.apply(tax_domains, axis=1)
        df = df[["queryname","species", "genus","family","order", "class", "phylum", "domain", "GVOG", "distance"]]
        # Get the top hit
        df_tophit = df.drop_duplicates(subset=['GVOG'])
        # Get results for different stringencies
        stringencies = ["gte1", "gte2", "gte3", "majority"]
        allresults = [get_final_tax(df_tophit, list(df.queryname)[0], stringency) for stringency in stringencies]
        # Create results dataframe
        df_results = pd.DataFrame(allresults, columns=["query", "species", "genus", "family", "order", "class", "phylum", "domain", "stringency", "avgdist"])
        return df_results
    except:
        pass


@click.command()
@click.option('-n', '--nn_tree', required=True, help='Path to the nearest neighbor tree file.')
@click.option('-g', '--gvog9_count', required=True, help='Path to the GVOG9 count file.')
@click.option('-u', '--uni56_count', required=True, help='Path to the UNI56 count file.')
@click.option('-q', '--querystats', required=True, help='Path to the query stats file.')
@click.option('-c', '--xgb_out', required=True, help='Path to the classifier output file.')
@click.option('-s', '--summary_out', required=True, help='Path to the output summary file.')
def main(nn_tree, gvog9_count, uni56_count, querystats, xgb_out, summary_out):
    try:
        query = summary_out.split("/")[-1].split(".")[0]
        treehits = []
        treehits.extend(parse_result(nn_tree))
        querystats_df = pd.read_csv(querystats, sep="\t")
        with open(xgb_out) as f:
            prediction = [line.split('\t')[1].strip() for line in f if line.strip()]
        if len(treehits) > 0:
            df_tree = pd.DataFrame(treehits, columns=["GVOG", "query", "subject", "taxannot", "distance"])
            df_tree["distance"] = df_tree["distance"].astype(float)
            df_results_tree = summarize(df_tree)
            GVOG9u, GVOG9t, GVOG7u, GVOG7t, GVOG7df, MCP = process_gvog9(gvog9_count)
            UNI56u, UNI56t, UNI56df = process_uni56(uni56_count)
            df_results_tree["GVOG9u"] = GVOG9u
            df_results_tree["GVOG9t"] = GVOG9t
            df_results_tree["GVOG7u"] = GVOG7u
            df_results_tree["GVOG7t"] = GVOG7t
            df_results_tree["GVOG7df"] = GVOG7df
            df_results_tree["MCP"] = MCP
            df_results_tree["UNI56u"] = UNI56u
            df_results_tree["UNI56t"] = UNI56t
            df_results_tree["UNI56df"] = UNI56df
            if len(prediction)>0:
                df_results_tree["xgb"] = prediction[0]
            else:
                df_results_tree["xgb"] = "no_fna"
            df_results_tree = pd.merge(df_results_tree, querystats_df, on='query')
            df_results_tree.to_csv(summary_out, sep="\t", index=False)
        elif len(prediction)>0:
            allresults = [[query, "missing_markers", "missing_markers", "missing_markers", "missing_markers", "missing_markers", "missing_markers", "missing_markers", prediction[0], "no_hits", "missing_markers", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0"]]
            cols = ["query", "species", "genus", "family", "order", "class", "phylum", "domain", "xgb", "stringency", "avgdist", "GVOG9u", "GVOG9t", "GVOG7u", "GVOG7t", "GVOG7df", "MCP", "UNI56u", "UNI56t", "UNI56df"]
            df_results_tree = pd.DataFrame(allresults, columns=cols)
            df_results_tree = pd.merge(df_results_tree, querystats_df, on='query')
            df_results_tree.to_csv(summary_out, sep="\t", index=False)
        else:
            allresults = [[query, "missing_markers", "missing_markers", "missing_markers", "missing_markers", "missing_markers", "missing_markers", "missing_markers", "no_fna", "no_hits", "missing_markers", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0"]]
            cols = ["query", "species", "genus", "family", "order", "class", "phylum", "domain", "xgb", "stringency", "avgdist", "GVOG9u", "GVOG9t", "GVOG7u", "GVOG7t", "GVOG7df", "MCP", "UNI56u", "UNI56t", "UNI56df"]
            df_results_tree = pd.DataFrame(allresults, columns=cols)
            df_results_tree = pd.merge(df_results_tree, querystats_df, on='query')
            df_results_tree.to_csv(summary_out, sep="\t", index=False)
    except:
        print(f"Empty output for {summary_out.split('/')[-1].split('.')[0]}")
        with open(summary_out, "w") as outfile:
            pass

if __name__ == '__main__':
    main()

Python Pandas CLICK From line 1 of scripts/09summarize.py

import click
import os.path
import glob
import pandas as pd


@click.command()
@click.option('-r', '--resultsdirparent', type=click.Path(exists=True), help='Parent directory of the summary files.')
@click.option('-o', '--combinedout', type=click.Path(), help='Output file name.')
def main(resultsdirparent, combinedout):
    """
    Retain only annotation for d_NCLDV at highest available stringency.
    """
    results = []
    for sumout in glob.glob(resultsdirparent + "*/*.summary.tab"):
        try:
            df = pd.read_csv(sumout, index_col=None, header=0, sep="\t")
            results.append(df)
        except:
            print (sumout + " is empty")
    df_results_f = pd.concat(results, axis=0, ignore_index=True)
    df_results_f = df_results_f.fillna(0)
    # also output d_NCLDV-EUK, d_NCLDV-BAC, d_NCLDV-ARC, d_NCLDV-PHAGE
    df_results_f_gt3 = df_results_f[(df_results_f['domain'].str.startswith('d_NCLDV')) & (df_results_f['stringency'] == 'gte3')]
    df_results_f_gt2 = df_results_f[(df_results_f['domain'].str.startswith('d_NCLDV')) & (df_results_f['stringency'] == 'gte2')]
    df_results_f_gt1 = df_results_f[(df_results_f['domain'].str.startswith('d_NCLDV')) & (df_results_f['stringency'] == 'gte1')]
    df_results_f_majority = df_results_f[(df_results_f['domain'].str.startswith('d_NCLDV')) & (df_results_f['stringency'] == 'majority')]
    df_results_f_xgb = df_results_f[(df_results_f['xgb'].str.endswith('NCLDV'))]
    df_results_f_combined = pd.concat([df_results_f_gt3, df_results_f_gt2, df_results_f_gt1, df_results_f_majority, df_results_f_xgb])
    df_results_f_combined = df_results_f_combined.drop_duplicates(subset=['query'])
    df_results_f_combined.to_csv(combinedout, sep='\t', index=None)


if __name__ == '__main__':
    main()

Python Pandas CLICK From line 1 of scripts/10combinedout.py

import click
import tarfile
import os.path
import shutil

@click.command()
@click.option('-r', '--resultsdir', type=click.Path(exists=True), help='Directory to be compressed.')
@click.option('-o', '--tarout', type=click.Path(), help='Output tarfile name.')
def main(resultsdir, tarout):
    """
    Compress a directory to a tar.gz file and delete the original directory.
    """
    with tarfile.open(tarout, "w:gz") as tar:
        tar.add(resultsdir, arcname=os.path.basename(resultsdir))
        try:
            shutil.rmtree(resultsdir)
        except OSError as e:
            print(f"Error: {resultsdir} : {e.strerror}")

if __name__ == '__main__':
    main()

Python CLICK From line 1 of scripts/11cleanup.py

shell:
  """
  (python workflow/scripts/00reformat.py -i {input} -o {output[0]} -s {output[1]}) &> {log}
  """

SnakeMake From line 60 of workflow/Snakefile

shell:
  """
  python workflow/scripts/01genecalling_single.py -f {input} -g {output.gffout} -gs {output.genecalling_statsout} -ss {output.best_statsout} -fa {output.faafinalout} &> {log}
  cp {input} {output.fnafinalout}
  """

SnakeMake From line 83 of workflow/Snakefile

  shell:
    """
    if [ -f {params.infilebase}.fna ]; then
    python workflow/scripts/01predict.py -q {params.infilebase}.fna -m {input.xgbD} -k {input.featuresD} -l {input.mappingsD} -o {output} &> {log};
    else touch {output}; fi
    """


"""
Step 1 
Identify markers, extract, align
"""

SnakeMake From line 106 of workflow/Snakefile

shell:
  """
  (python workflow/scripts/02hmmsearch.py -q {input.queryfaa} -m {input.models} -h {output.GVOG9out} -t GVOG9 -c {output.hitcounts}) &> {log}
  """

SnakeMake From line 133 of workflow/Snakefile

shell:
  """
  (python workflow/scripts/02hmmsearch.py -q {input.queryfaa} -m {input.models} -h {output.UNI56out} -t UNI56 -c {output.hitcounts}) &> {log}
  """

SnakeMake From line 154 of workflow/Snakefile

shell:
  """
  python workflow/scripts/03extract_qhits.py -h {input.hmmout} -q {input.queryfaa} -o {output.queryhitsfaa}
  touch {output.queryhitsfaa}
  """

SnakeMake From line 171 of workflow/Snakefile

shell:
  """
  (python workflow/scripts/04blastp_reduce_merge.py -q {input.queryhitsfaa} -r {input.reffaa} -d {input.refdb} -b {output.blastpout} -o {output.mergedfaa}) >& {log}
  touch {output.blastpout} {output.mergedfaa}
  """

SnakeMake From line 194 of workflow/Snakefile

  shell:
    """
    (python workflow/scripts/05align_trim.py -q {input.mergedfaa} -a {output.aln} -t {output.trimmedaln}) >& {log}
    touch {output.aln} {output.trimmedaln}
    """


"""
Step 2 - Tree based
Build protein trees, get nearest neighbor in trees
"""

SnakeMake From line 214 of workflow/Snakefile

shell:
  """
  (python workflow/scripts/06build_tree.py -a {input.trimmedaln} -t {output.tree}) >& {log}
  touch {output.tree}
  """

SnakeMake From line 238 of workflow/Snakefile

shell: 
  """
  (python workflow/scripts/07get_nn_aln.py -q {params.queryname} -a {params.fdir}/queryrefs_aligned/ -o {output.aln_out} -l {labels}
  python workflow/scripts/08get_nn_tree.py -q {params.queryname} -t {params.fdir}/queryrefs_fasttree/ -o {output.tree_out} -l {labels}) &> {log}
  """

SnakeMake From line 262 of workflow/Snakefile

shell:
  """
  (python workflow/scripts/09summarize.py -n {input.nn_tree} -g {input.gvog9_count} -u {input.uni56_count} -q {input.querystats} -s {output} -c {input.xgbout}
  touch {output}) &> {log}
  """

SnakeMake From line 285 of workflow/Snakefile

shell:
  """
  python workflow/scripts/10combinedout.py -r {params.fdirp} -o {output.combined}
  touch {output.combined}
  """

SnakeMake From line 304 of workflow/Snakefile

shell:
  """
  #find {params.fdir} -type f -empty -print -delete
  python workflow/scripts/11cleanup.py -r {params.fdir} -o {output}
  """

SnakeMake From line 323 of workflow/Snakefile

ShowHide 18 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/NeLLi-team/gvclass

Name: gvclass

Version: 1

Badge:

Insert copied code into your website to add a link to this workflow.

License: GNU General Public License v2.0

Keywords:

Taxonomic classification fasttree prodigal Biopython BLAST CLICK Diamond hmmsearch (genouest) MAFFT API (EBI) Pandas Snakemake trimAl ete3 sklearn viral-genomics

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free