Snakemake Workflow: Impute HLA and/or KIR alleles from SNPs

public 1yr ago Version: v1.0 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

N.B: As of 21st of February, 2020 shapeit=v2.r837 has been removed from the bioconda channel. Hence, I have removed it from the environment file. However, shapeit=2.r904 is available from https://anaconda.org/dranew/shapeit and if you want to use it add it to the environment file with the appropriate dranew channel. Runs with shapeit4 will still work as it is still available on bioconda.

HLA and KIR Imputation from SNP

Authors

Bisrat Johnathan Debebe ( @bjohnnyd )

Usage
Standalone Script

Usage

Workflow Overview

The full workflow consists of creating datasets ready for imputation:

alt text

The steps can be summarized as:

1. Convert input PLINK bed files to reference b37
2. Encode REF/ALT alleles the same as the KIR*IMP reference panel
3. Phase the encoded VCF using shapeit4 and shapeit.

Quickstart

To run the pipeline edit config.yaml to point to your input PLINK bed file by placing the following yaml directives:

project:
 <project name>:
 liftover:
 plink: <path to PLINK bed file>

where <project name> is the name you want to be associated with your run and output and <path to PLINK bed file> is the full or relative path to your input PLINK bed file.

To run the pipeline run the command:

snakemake -j<number of parallel jobs> --use-conda

The default PLINK bed file is assumed to be on hg18 reference but additional reference types are possible as hg16 , hg17 , hg38 . To specify for example reference hg17 edit config.yaml with the following:

project:
 <project name>:
 liftover:
 plink: <path to PLINK bed file>
 reference: hg17

All output will produced inside the output/<project name> directory.

Configure the workflow according to your needs via editing the file config.yaml . To see the list of default configurations settings run the command snakemake print_defaults .

Running Specific Parts or Running Workflow in Steps

The workflow can be ran from a specific part. In addition, the workflow can be run in steps allowing to explore the data before phasing and filtering.

To see the list of specific rules/steps that are possible run snakemake --list-target-rules . Each of the target rules listed can be run with the command snakemake -j<number of parallel jobs> [OPTIONS] <target rule name> .

Running Only Liftover

alt text

It is possible to only use the workflow to perform liftover of PLINK bed files from specific reference to b37 . This will produce a VCF file with the reference being b37 .

To perform only liftover with the same config.yaml from Quickstart run the command:

snakemake -j<number of parallel jobs> --use-conda liftover

Encode VCF Without Liftover or Perform Liftover and Encoding Without Phasing With SHAPEIT

alt text

If you have VCF file based on reference b37 it is possible to only run encoding without liftover. To encode the VCF file using KIR*IMP reference panel edit config.yaml :

project:
 <project name>:
 freq_encode_snps:
 vcf: <path to VCF file>

then execute snakemake using the command:

snakemake -j<number of parallel jobs> --use-conda kirimp_encode

If the above command is executed with the config.yaml settings from Quickstart only liftover and encoding will be performed. This allows inspection of the variants in the VCF by looking at the images produced in output/{project}/kirimp/01_freq_encode_snps/{project}.png . The images might help in deducing what threshold to use during SHAPEIT for filtering troublesome variants with missing genotypes.

Run SHAPEIT After Checking VCF Statistics

alt text

To run SHAPEIT and prepare data for KIR*IMP using the file produced by running snakemake -j<number of parallel jobs> --use-conda kirimp_encode run the command:

snakemake -j<number of parallel jobs> --use-conda kirimp_ready

In addition, SHAPEIT can be run on a specific VCF (not necessarily produced by the workflow) by specifying the vcf file under the shapeit directive in config.yaml :

project:
 <project name>:
 shapeit
 vcf: <Path to VCF file>

Additional and Default Parameters

The default parameters settings and there descriptions are listed below and can be modified by placing it in config.yaml with the desired values different from defaults:

BCFTOOLS_THREADS: 4
KIRIMP_PANEL_URL: http://imp.science.unimelb.edu.au/kir/static/kirimp.uk1.snp.info.csv
PLINK_THREADS: 4
SHAPEIT_GENMAP_URL: https://github.com/odelaneau/shapeit4/blob/master/maps/genetic_maps.b37.tar.gz?raw=true
SHAPEIT_THREADS: 4
project:
 <project name>:
 freq_encode_snps:
 vcf: output/{project}/liftover/04_hg19_vcf/{project}.{reference}ToHg19.vcf.gz
 additional: "--outlier-threshold 0.1" # additional parameters to be passed to the script in 'scripts/frequency_encode_snps.py'.
 liftover:
 reference: hg18 # Input PLINK bed reference can be either hg16, hg17, hg18 or hg38
 shapeit:
 vcf: output/{project}/kirimp/01_freq_encode_snps/{project}.vcf.gz
 gmap: input/meta/shapeit/kirimp.chr19.gmap.txt.gz
 pbwt: 8 # shapeit4: determines the number of conditioning neighbours, higher number should produce better accuracy but slower runtimes
 pbwt-modulo: 8 # shapeit4: determines variants for storing pbwt indexes
 regions: [19] # chromosomes for phasing default is 19 for KIR*IMP
 states: 500 # shapeit2: determines number of haplotypes for conditioning, higher number should produce better accuracy but slower runtimes
 min_missing: 0.25 # variants with missing genotype rate larger than this will be discarded before phasing with SHAPEIT. Inspect the image from kirimp_encode to determine threshold
 v2_additional: "" # Additional parameters for shapeit version 2. Description of possible settings can be found at https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html
 v4_additional: "" # Additional parameters for shapeit version 4. Description of possible settings can be found at https://odelaneau.github.io/shapeit4/

Workflow Running Options

Test your configuration by performing a dry-run via

snakemake --use-conda -npr

Execute the workflow locally via

snakemake --use-conda -j $N

using $N cores or run it in a cluster environment via

snakemake --use-conda --cluster qsub --jobs 100

or if DRMAA is available

snakemake --use-conda --drmaa --jobs 100

If you not only want to fix the software stack but also the underlying OS, use

snakemake --use-conda --use-singularity

in combination with any of the modes above. See the Snakemake documentation for further details.

Investigate results

After successful execution, you can create a self-contained interactive HTML report with all results via:

snakemake --report report.html

Standalone Script

The workflow includes a standalone script

scripts/frequency_encode_snps.py

which can be used to encode a VCF file to a reference panel of variants. To setup the environment for the script create the environment:

conda env create -f envs/freq_encode_snps.yml

then activate the environment with

conda activate freq_encode_snps

The parameters to run the standalone script are:

usage: frequency_encode_snps.py [-h] -v VCF_FILE -r REFERENCE_PANEL
 [-rt {kirimp,custom}] [--separator SEPARATOR]
 -o OUTPUT [-chr [CHROMOSOMES]]
 [--chromosome-annotation {ucsc,ensembl,genbank}]
 [-a] [-c] [-min MIN_AMBIGIOUS_THRESHOLD]
 [-max MAX_AMBIGIOUS_THRESHOLD]
 [--outlier-threshold OUTLIER_THRESHOLD]
 [-t THREADS]

For usage with KIR*IMP the VCF has to be with hg18/b37 reference.

The script in addition to encoding the VCF file will produce a diagnosis plot (same name as output vcf file but with .png extension) as well as a TOML file containing all the metadata of the encoding. The metadata TOML can be read into R using configr

require(configr)
read.config(file = "metadata.toml")

or into Python using

import toml
toml.load("metadata.toml")

the resulting vcf will contain statistics from the encoding inside the INFO fields:

##INFO=<ID=UPD,Number=A,Type=Flag,Description="REF and ALT updated based on reference panel frequency">
##INFO=<ID=PFD,Number=A,Type=Float,Description="Alternate frequency difference to reference panel frequency">
##INFO=<ID=MISS,Number=A,Type=Float,Description="Missing Genotype Frequency">
##INFO=<ID=MAF,Number=A,Type=Float,Description="Minor Allele Frequency">

Code Snippets

shell: 
    "scripts/frequency_encode_snps.py -t {threads} -v {input.vcf} "
    "-r {input.panel} -o {output.vcf} {params} 2> {log} && "
    "bcftools index -f {output.vcf}"

SnakeMake BCFtools From line 23 of rules/freq_encode_snps.smk

shell: "mv {input} {output}"

SnakeMake From line 9 of rules/kirimp_panel.smk

shell: "mv {input} {output}"

SnakeMake From line 17 of rules/liftover.smk

shell: "liftOver {input.bed} {input.chain} {output.lifted} {output.unlifted} &> {log}"

SnakeMake ucsc-liftover From line 43 of rules/liftover.smk

    shell: 
        """
	plink --allow-extra-chr --allow-no-sex --real-ref-alleles --keep-allele-order --snps-only just-acgt --bfile {params.basein} --recode vcf --out {params.baseout} &> {log}
        bcftools view -U {output.tmp_vcf} | bcftools norm --threads {threads} --rm-dup snps | bcftools +fill-tags -Oz  -- -d  2> {log} > {output.vcf} 2>> {log}
        bcftools index -f {output.vcf} &>> {log} 
        bcftools stats {output.vcf} > {output.stats} 2>> {log}
        """

SnakeMake BCFtools pLink From line 77 of rules/liftover.smk

shell: "cat {input} | tar xzfO - --wildcards 'chr19.b37.gmap.gz' 2> {log}  > {output}"

SnakeMake From line 12 of rules/shapeit.smk

shell: 
    """
        bcftools filter -r {wildcards.region} -Oz -e 'MISS > {params.missing_threshold}' {input.vcf} > {output.filtered_vcf} 2> {log.bcftools}
        bcftools index -f {output.filtered_vcf} 2>> {log.bcftools}
        shapeit4 --thread {threads} --input {output.filtered_vcf} --map {input.gmap} --output {output.phased_vcf} \
        --region {wildcards.region} --pbwt-depth {params.pbwt} --pbwt-modulo {params.pbwt_modulo} --log {output.log} {params.additional} 2> {log.shapeit4}
        bcftools convert --hapsample {params.out} {output.phased_vcf} 2>> {log.bcftools}
        gunzip {output.hap}.gz 
    """

SnakeMake BCFtools shapeit4 From line 35 of rules/shapeit.smk

shell: 
    "bcftools filter -r {wildcards.region} -Oz -e 'MISS > {params.missing_threshold}' {input.vcf} > {output.filtered_vcf} 2> {log.bcftools} && "
    "bcftools index -f {output.filtered_vcf} 2> {log.bcftools}  && "
    "shapeit --thread {threads} --input-vcf {output.filtered_vcf} -M {input.gmap} --states {params.states} "
    "-O {output.haps} {output.sample} --output-graph {output.graph} "
    "--output-log {output.log} {params.additional} 2> {log.shapeit}"

SnakeMake BCFtools SHAPEIT From line 64 of rules/shapeit.smk

import argparse
import re
import sys
import textwrap
import matplotlib
import numpy as np
from os import path
from pathlib import Path
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from matplotlib.patches import Patch

from cyvcf2 import VCF, Writer

plt.rcParams["figure.figsize"] = (18, 14)
matplotlib.use("Agg")

CHROMOSOME_19_ANNOTATION = {"ucsc": "chr19", "ensembl": "19", "genbank": "CM000681.2"}

# Required headers for panel input files, either kirimp (kirimp.uk1.snp.info.csv) or custom with the required fields
KIRIMP_HEADER = ["id", "position", "allele0", "allele1", "allele1_frequency"]
CUSTOM_HEADER = ["chrom", "pos", "a0", "a1", "freq"]

COMPLEMENT = {"A": "T", "T": "A", "G": "C", "C": "G", ".": "."}

""" CUSTOM HEADERS """
UPDATED = {
    "ID": "UPD",
    "Description": "REF and ALT updated based on reference panel frequency",
    "Type": "Flag",
    "Number": "A",
}

PANEL_FREQ_DIFF = {
    "ID": "PFD",
    "Description": "Alternate frequency difference to reference panel frequency",
    "Type": "Float",
    "Number": "A",
}

MISSINGNES = {
    "ID": "MISS",
    "Description": "Missing Genotype Frequency",
    "Type": "Float",
    "Number": "A",
}

MAF = {
    "ID": "MAF",
    "Description": "Minor Allele Frequency",
    "Type": "Float",
    "Number": "A",
}

AF = {
    "ID": "AF",
    "Description": "Alternate Allel Frequency",
    "Type": "Float",
    "Number": "A",
}
toml_header = """
[encoder_settings]
outlier_threshold=%s
minimum_ambigious_threshold=%f
maximum_ambigious_threshold=%f
reference_panel_type=%s
chromosome_annotation=%s
ambigious_dropped=%r
correct_complement=%r


[variant]


"""

variant_toml = """
    [variant.%s]
    status = %s
    reason = %s
    frequency = %f
    panel_frequency=%r
    updated_frequency=%r
    minor_allele_frequency=%r
    missing_genotype_frequency=%r
    panel_frequency_difference=%r


"""

""" Class to represent genotypes in 0/1 format might not be necessary as I can flip from there"""


class Genotype(object):
    __slots__ = ("alleles", "phased")
    FLIP_DICT = {0: 1, 1: 0, -1: -1}

    def __init__(self, li):
        self.alleles = li[:-1]
        self.phased = li[-1]

    def __str__(self):
        sep = "/|"[int(self.phased)]
        return sep.join("0123."[a] for a in self.alleles)

    def flip(self):
        self.alleles = [Genotype.FLIP_DICT[allele] for allele in self.alleles]

    def genotype(self):
        return self.alleles + [self.phased]

    __repr__ = __str__


""" Class to keep track of VCF file summaries for variants """


class VCFSummary(object):
    __slots__ = (
        "ambigious",
        "unknown_alt",
        "updated",
        "flipped",
        "n_in_panel",
        "kept",
        "__freqs",
        "VARIANTS",
    )

    def __init__(self):
        self.VARIANTS = {}
        self.ambigious = 0
        self.unknown_alt = 0
        self.updated = 0
        self.flipped = 0
        self.n_in_panel = 0
        self.kept = 0
        self.__freqs = None

    def __str__(self):
        summary_toml = """
        [summary]
        n_ambigious_variants=%d
        n_unknown_alt_or_monomorphic=%d
        n_updated=%d
        n_flipped_strand=%d
        vcf_variants_in_panel=%d
        vcf_variants_in_panel_after_encoding_snps=%d
        """.rstrip()
        return textwrap.dedent(summary_toml) % (
            self.ambigious,
            self.unknown_alt,
            self.updated,
            self.flipped,
            self.n_in_panel,
            self.kept,
        )

    def add_variant(self, v_id):
        self.VARIANTS[v_id] = {"freq": None, "updated_freq": None, "panel_freq": None}

    def add_variant_dict(self, vdict):
        self.VARIANTS.update(vdict)

    def freqs(self):
        if not self.__freqs:
            self.__freqs = np.array(
                [
                    [
                        v["freq"],
                        v["updated_freq"],
                        v.get("MAF", None),
                        v.get("MISS", None),
                        v.get("PFD", None),
                        v["panel_freq"],
                    ]
                    for k, v in sorted(self.VARIANTS.items())
                ]
            )
        return self.__freqs

    def v_ids(self, original=True):
        if original:
            vids = np.array([v["v_id"] for k, v in sorted(self.VARIANTS.items())])
        else:
            vids = np.array(sorted(self.VARIANTS.keys()))
        return vids

    def updates(self):
        return np.array([v["updated"] == 1 for k, v in sorted(self.VARIANTS.items())])

    __repr__ = __str__


def main(arguments=None):
    args = parse_arguments()
    vcf = VCF(args["vcf_file"], threads=args["threads"])
    vcf.add_info_to_header(UPDATED)
    vcf.add_info_to_header(PANEL_FREQ_DIFF)
    vcf.add_info_to_header(MISSINGNES)
    vcf.add_info_to_header(MAF)
    if not vcf.contains("AF"):
        vcf.add_info_to_header(AF)
    w = Writer(args["output"], vcf)

    panel = generate_panel_data(
        panel_file=args["reference_panel"],
        chr=args["chromosomes"],
        annotation=args["chromosome_annotation"],
        panel_type=args["reference_panel_type"],
        separator=args["separator"],
    )

    vcf_summary = VCFSummary()

    print(
        toml_header
        % (
            args["outlier_threshold"],
            args["min_ambigious_threshold"],
            args["max_ambigious_threshold"],
            args["reference_panel_type"],
            args["chromosome_annotation"],
            args["ambigious"],
            args["fix_complement_ref_alt"],
        ),
        file=sys.stderr,
    )
    for variant in vcf:
        status = "unchanged"
        reason = "None"
        panel_variant_freq = None
        variant_id_end = str(variant.CHROM) + "_" + str(variant.end)
        if not variant.INFO.get("AF"):
            variant.INFO["AF"] = variant.aaf
        if variant_id_end in panel:
            variant.INFO["UPD"] = 0
            panel_variant = panel[variant_id_end]
            panel_variant_freq = panel_variant["freq"]
            vcf_summary.n_in_panel += 1
            if not variant.ALT:
                print_variant_toml(
                    variant, panel_variant["freq"], "removed", "unknown_alt/monomorphic"
                )
                vcf_summary.unknown_alt += 1
                continue
            if (
                args["ambigious"]
                and variant.aaf > args["min_ambigious_threshold"]
                and variant.aaf < args["max_ambigious_threshold"]
            ):
                vcf_summary.ambigious += 1
                print_variant_toml(
                    variant, panel_variant["freq"], "removed", "ambigious_frequency"
                )
                continue
            if should_recode(variant, panel_variant):
                swap_ref_alt(variant)
                variant.INFO["UPD"] = 1
                vcf_summary.updated += 1
                status = "updated"
                reason = "ref_alt_swapped"
            if (
                should_flipstrand(variant, panel_variant)
                and args["fix_complement_ref_alt"]
            ):
                flipstrand(variant, panel_variant["freq"])
                variant.INFO["UPD"] = 1
                vcf_summary.flipped += 1
                status = "strand_flipped"
                reason = "ref/alt_not_in_panel_nucleotides"

            vcf_summary.add_variant(variant_id_end)
            v_freq = variant.INFO.get("AF")

            variant.INFO["PFD"] = abs(variant.INFO.get("AF") - panel_variant["freq"])
            variant.INFO["MISS"] = np.sum(variant.gt_types == 2) / len(variant.gt_types)
            variant.INFO["MAF"] = v_freq if v_freq < 0.5 else 1 - v_freq

            vcf_summary.VARIANTS[variant_id_end].update(
                {
                    "freq": variant.aaf,
                    "panel_freq": panel_variant["freq"],
                    "updated_freq": v_freq,
                    "MAF": variant.INFO.get("MAF"),
                    "MISS": variant.INFO.get("MISS"),
                    "PFD": variant.INFO.get("PFD"),
                    "v_id": variant.ID,
                    "updated": variant.INFO.get("UPD"),
                }
            )
            print_variant_toml(variant, panel_variant_freq, status, reason)
            vcf_summary.kept += 1
        w.write_record(variant)
    w.close()
    vcf.close()
    plot_file = re.sub(r"(vcf|bcf)(\.gz)*$", "png", args["output"])
    if not vcf_summary.n_in_panel == 0:
        create_summary_plot(
            vcf_summary, outfile=plot_file, threshold=args["outlier_threshold"]
        )
    print(vcf_summary, file=sys.stderr)
    print("n_reference_panel_size=%d" % len(panel.keys()), file=sys.stderr)


def print_variant_toml(
    variant, panel_variant_freq, status, reason, variant_toml=variant_toml
):

    variant_tup = (
        variant.INFO.get("AF", None),
        variant.INFO.get("MAF", None),
        variant.INFO.get("MISS", None),
        variant.INFO.get("PFD", None),
    )

    print(
        variant_toml
        % ((variant.ID, status, reason, variant.aaf, panel_variant_freq) + variant_tup),
        file=sys.stderr,
    )


def swap_ref_alt(variant):
    gts = variant.genotypes
    gts = [Genotype(li) for li in gts]
    for gt in gts:
        gt.flip()
    variant.genotypes = [gt.genotype() for gt in gts]
    updated_frequency = sum([gt.alleles.count(1) for gt in gts]) / (
        2 * len([gt for gt in gts if -1 not in gt.alleles])
    )
    temp_nuc = variant.REF
    variant.REF = variant.ALT[0]
    variant.ALT = [temp_nuc]
    variant.INFO["AF"] = updated_frequency


def flipstrand(variant, panel_freq, COMPLEMENT=COMPLEMENT):
    variant.REF = COMPLEMENT[variant.REF]
    variant.ALT = COMPLEMENT[variant.ALT[0]]
    frequency_synced = (panel_freq > 0.5 and variant.INFO.get("AF") > 0.5) or (
        panel_freq < 0.5 and variant.INFO.get("AF") < 0.5
    )
    if not frequency_synced:
        swap_ref_alt(variant)


def should_recode(variant, panel_variant):
    frequency_synced = (
        panel_variant["freq"] > 0.5 and variant.INFO.get("AF") > 0.5
    ) or (panel_variant["freq"] < 0.5 and variant.INFO.get("AF") < 0.5)
    nucleotides_synced = (
        variant.REF == panel_variant["A0"] and variant.ALT[0] == panel_variant["A1"]
    )
    return not nucleotides_synced and not frequency_synced


def should_flipstrand(variant, panel_variant, COMPLEMENT=COMPLEMENT):
    nucleotides_synced = (
        variant.REF == panel_variant["A0"] and variant.ALT[0] == panel_variant["A1"]
    )
    frequency_synced = (
        panel_variant["freq"] > 0.5 and variant.INFO.get("AF") > 0.5
    ) or (panel_variant["freq"] < 0.5 and variant.INFO.get("AF") < 0.5)
    alt_is_complement = COMPLEMENT[variant.REF] == variant.ALT[0]
    return (not nucleotides_synced or not frequency_synced) and alt_is_complement


def create_summary_plot(v_summary, outfile, threshold=None):
    freqs = v_summary.freqs()
    default_color = plt.rcParams["axes.prop_cycle"].by_key()["color"][0]

    fig = plt.figure()

    ax1 = fig.add_subplot(321)
    ax2 = fig.add_subplot(323, sharex=ax1)
    ax3 = fig.add_subplot(325)
    ax4 = fig.add_subplot(222)
    ax5 = fig.add_subplot(224, sharex=ax4)

    titles = (
        "Original VCF Frequencies Compared to Panel Frequencies",
        "Updated VCF Frequencies Compared to Panel Frequencies",
        "Minor Allele Frequency Compared to Difference in Frequency Between Panel and VCF",
        "Genotype Missingness Compared to Difference in Frequency Between Panel and VCF",
    )

    x_labs = (
        "VCF Alternate Frequency",
        "VCF Alternate Frequency",
        "Minor Allele Frequency",
        "Missing Genotype Frequency",
    )

    y_labs = (
        "Panel Allele Frequency",
        "Panel Allele Frequency",
        "Panel vs VCF Frequency Difference",
        "Panel vs VCF Frequency Difference",
    )

    coefs = np.corrcoef(freqs.T)[:, [4, 5]]

    for i, ax in enumerate([ax1, ax2, ax4, ax5]):
        coef, comparison_freq = (
            (coefs[i, 0], freqs[:, 4]) if i > 1 else (coefs[i, 1], freqs[:, 5])
        )
        ax.set_title(titles[i], fontsize=9)
        ax.scatter(freqs[:, i], comparison_freq, s=10, alpha=0.7)
        ax.annotate(
            "corr = %.2f" % coef,
            (
                max(freqs[:, i]) - max(freqs[:, i]) / 20,
                max(comparison_freq) - max(comparison_freq) / 20,
            ),
            ha="center",
            fontsize=10,
        )
        ax.set_ylabel(y_labs[i], fontsize=9)
        ax.set_xlabel(x_labs[i], fontsize=9)
        ax.plot(ax.get_xlim(), ax.get_ylim(), ls="--", c=".3")

        if threshold:
            idxs = freqs[:, 4] > threshold
            for f, cf, vid in zip(
                freqs[idxs, i], comparison_freq[idxs], v_summary.v_ids()[idxs]
            ):
                ax.annotate(vid, (f, cf), ha="center", fontsize=8)

    v_types = ["REF --> ALT", "Strand Flipped", "Ambigious Variants", "ALT Missing"]
    counts = [
        v_summary.updated,
        v_summary.flipped,
        v_summary.ambigious,
        v_summary.unknown_alt,
    ]
    bar_width = 0.75
    idx = np.arange(len(counts))
    barlist = ax3.bar(idx, counts, width=bar_width, align="center")
    ax3.set_xticks(idx)
    ax3.set_xticklabels(v_types, rotation=45, minor=False, fontsize=8)
    [bar.set_color("r") for bar in barlist[2:]]
    for i, count in enumerate(counts):
        col = "r" if i > 1 else "black"
        ax3.text(i, count, " " + str(count), ha="center", color=col)
    ax3.set_ylabel("Counts", fontsize=9)
    ax3.set_title("Variant Modification Type and Excluded Counts", fontsize=9)

    leg_elements = [
        Patch(facecolor=default_color, label="Updated"),
        Patch(facecolor="red", label="Removed"),
    ]
    ax3.legend(handles=leg_elements, loc="upper right")

    plt.savefig(outfile)


def generate_panel_data(
    panel_file, chr=None, annotation="ensembl", panel_type="kirimp", separator=None
):
    f = open(panel_file, "r")
    header_line = next(f).strip()
    sep = get_separator(header_line, separator)
    header_line = [cell.replace('"', "") for cell in header_line.split(sep)]
    if panel_type == "kirimp":
        chromosome = CHROMOSOME_19_ANNOTATION[annotation]
        if header_line != KIRIMP_HEADER:
            raise TypeError(
                "If input panel type is kirimp, the panel needs to contain a comma-separated header:\n%s"
                % ",".join(KIRIMP_HEADER)
            )
    else:
        if header_line != CUSTOM_HEADER:
            raise TypeError(
                "If input panel type is custom, the panel needs to contain a comma-separated header:\n%s"
                % ",".join(CUSTOM_HEADER)
            )
    snp_dict = {
        chromosome + "_" + cells[1]
        if panel_type == "kirimp"
        else cells[0]
        + "_"
        + cells[1]: {"A0": cells[2], "A1": cells[3], "freq": float(cells[4].strip())}
        if panel_type == "kirimp"
        else {"A0": cells[2], "A1": cells[3], "freq": float(cells[4])}
        for cells in [line.strip().replace('"', "").split(sep) for line in f]
    }
    f.close()
    return snp_dict


def get_separator(line, passed_separator=None):
    tabs = line.count(r"\t")
    commas = line.count(r",")
    if passed_separator:
        sep = passed_separator
    elif tabs == 4 and commas != 4:
        sep = r"\t"
    elif tabs != 4 and commas == 4:
        sep = ","
    else:
        raise TypeError(
            "Cannot determine separator from file please specify separator directly as an argument [--reference-panel-col-separator]"
        )
    return sep


def parse_arguments(arguments=None):
    parser = argparse.ArgumentParser(
        description="This script encodes SNPs in a VCF to a reference panel based on allele frequencies",
        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
    )
    parser._action_groups.pop()
    required = parser.add_argument_group("required arguments")
    optional = parser.add_argument_group("optional arguments")
    required.add_argument(
        "-v",
        "--vcf-file",
        help="VCF/BCF file to re-encode (can be compressed with bgzip)",
        required=True,
        type=str,
    )
    required.add_argument(
        "-r",
        "--reference-panel",
        help="Reference panel file  either in format of KIR*IMP reference panel  or a custom data format [chrom pos ref alt freq]",
        required=True,
        type=str,
    )
    optional.add_argument(
        "-rt",
        "--reference-panel-type",
        help="Reference panel file type",
        choices=["kirimp", "custom"],
        default="kirimp",
        type=str,
    )
    optional.add_argument(
        "--separator", help="Custom reference panel column separator", type=str
    )
    optional.add_argument(
        "-o", "--output", help="Output vcf file", required=True, type=str
    )
    optional.add_argument(
        "-chr",
        "--chromosomes",
        help="Chromosome over which to encode SNPs ",
        required=False,
        nargs="?",
        type=str,
    )
    optional.add_argument(
        "--chromosome-annotation",
        help="Chromosome annotation type in the VCF",
        choices=["ucsc", "ensembl", "genbank"],
        default="ensembl",
        type=str,
    )
    optional.add_argument(
        "-a",
        "--ambigious",
        help="Determines whether ambigious alternate alleles should be dropped",
        action="store_false",
    )
    optional.add_argument(
        "-c",
        "--fix-complement-ref-alt",
        help="Should ref/alt that are complements be fixed with respect to frequency",
        action="store_false",
    )
    optional.add_argument(
        "-min",
        "--min-ambigious-threshold",
        help="Alternate alleles above this frequency and below the max ambigious frequency will be flagged as ambigious",
        default=0.495,
        type=float,
    )
    optional.add_argument(
        "-max",
        "--max-ambigious-threshold",
        help="Alternate alleles above this frequency and below the max ambigious frequency will be flagged as ambigious",
        default=0.505,
        type=float,
    )
    optional.add_argument(
        "--outlier-threshold",
        help="Threshold to use to label variant frequency differences between alternate and panel frequencis that are significant",
        default=None,
        type=float,
    )
    optional.add_argument(
        "-t", "--threads", help="Number of threads to use for compression", type=int
    )
    args = vars(parser.parse_args())
    if (
        args["reference_panel_type"] == "custom"
        and args["reference_panel_format"] is None
    ):
        parser.error(
            "custom --reference-panel-type requires --reference-panel-format to be set"
        )
    return args


if __name__ == "__main__":
    main()

Python numpy matplotlib cyvcf2 From line 2 of scripts/frequency_encode_snps.py

shell: "snakemake --dag | dot -Tsvg > {output}"

SnakeMake Snakemake From line 43 of master/Snakefile

ShowHide 3 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/bjohnnyd/hla-kir-imputation

Name: hla-kir-imputation

Version: v1.0

Badge:

Insert copied code into your website to add a link to this workflow.

License: MIT License

Keywords:

shapeit4 ucsc-liftover BCFtools cyvcf2 pLink SHAPEIT Snakemake matplotlib numpy

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free

Snakemake Workflow: Impute HLA and/or KIR alleles from SNPs

Help improve this workflow!

HLA and KIR Imputation from SNP

Authors

Table of Contents

Usage

Workflow Overview

Quickstart

Running Specific Parts or Running Workflow in Steps

Running Only Liftover

Encode VCF Without Liftover or Perform Liftover and Encoding Without Phasing With SHAPEIT

Run SHAPEIT After Checking VCF Statistics

Additional and Default Parameters

Workflow Running Options

Investigate results

Standalone Script

Code Snippets

Comments

Support

Free

Related Workflows

public

public

public

public

public

public