Snakemake module for generating reports related to Hydra-Genetics results

public 1yr ago Version: v0.1.0 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

A module containing rules for various types of reports

:speech_balloon: Introduction

The module contains rules for generating reports from results generated by other Hydra-Genetics modules.

:heavy_exclamation_mark: Dependencies

In order to use this module, the following dependencies are required:

:school_satchel: Preparations

Sample data

Input data should be added to samples.tsv and units.tsv . The following information need to be added to these files:

Column Id	Description
`samples.tsv`
sample	unique sample/patient id, one per row
`units.tsv`
sample	same sample/patient id as in `samples.tsv`
type	data type identifier (one letter), can be one of T umor, N ormal, R NA
platform	type of sequencing platform, e.g. `NovaSeq`
machine	specific machine id, e.g. NovaSeq instruments have `@Axxxxx`
flowcell	identifer of flowcell used
lane	flowcell lane number
barcode	sequence library barcode/index, connect forward and reverse indices by `+` , e.g. `ATGC+ATGC`
fastq1/2	absolute path to forward and reverse reads
adapter	adapter sequences to be trimmed, separated by comma

:white_check_mark: Testing

The workflow repository contains a small test dataset .tests/integration which can be run like so:

$ cd .tests/integration
$ snakemake \
 -s ../../workflow/Snakefile \
 --configfile config.yaml \
 --use-singularity \
 --singularity-args "--bind $(realpath ../..)" \
 -j1

:rocket: Usage

To use this module in your workflow, follow the description in the snakemake docs . Add the module to your Snakefile like so:

module reports:
 snakefile:
 github(
 "reports",
 path="workflow/Snakefile",
 tag="0.1.0",
 )
 config:
 config
use rule * from reports as reports_*

Output files

The following output files should be targeted via another rule:

File	Description
`reports/cnv_html_report/{sample}_{type}.{tc_method}.cnv_report.html`	Interactive report of CNV calling results

:judge: Rule Graph

%%{ init: { "flowchart-font-size": "15px" } }%%
flowchart RL
 0(all)
 1(_copy_cnv_html_report)
 2(cnv_html_report)
 3(merge_cnv_json)
 4(cnv_json)
 1 --> 0
 2 --> 1
 3 --> 2
 4 --> 3
 classDef default fill:#00495C,stroke:#000C0F,stroke-width:2,color:#FEF6FC;

Code Snippets

script:
    "../scripts/cnv_html_report.py"

SnakeMake From line 37 of rules/cnv_html_report.smk

script:
    "../scripts/cnv_json.py"

SnakeMake From line 68 of rules/cnv_html_report.smk

script:
    "../scripts/merge_cnv_json.py"

SnakeMake From line 104 of rules/cnv_html_report.smk

from jinja2 import Template
from pathlib import Path
import sys
import time


def get_sample_name(filename):
    return Path(filename).name.split(".")[0]


def create_report(template_filename, json_filename, css_files, js_files, show_table, tc, tc_method):
    with open(template_filename) as f:
        template = Template(source=f.read())

    with open(json_filename) as f:
        json_string = f.read()

    css_string = ""
    for css_filename in css_files:
        with open(css_filename) as f:
            css_string += f.read()

    js_string = ""
    for js_filename in js_files:
        with open(js_filename) as f:
            js_string += f.read()

    return template.render(
        dict(
            json=json_string,
            css=css_string,
            js=js_string,
            metadata=dict(
                date=time.strftime("%Y-%m-%d %H:%M", time.localtime()),
                sample=get_sample_name(json_filename),
                show_table=show_table,
                tc=tc,
                tc_method=tc_method,
            ),
        )
    )


def main():
    log = Path(snakemake.log[0])

    logfile = open(log, "w")
    sys.stdout = sys.stderr = logfile

    json_filename = snakemake.input.json
    template_dir = Path(snakemake.input.template_dir)
    html_filename = snakemake.output.html

    html_template = template_dir / "index.html"
    css_files = sorted(template_dir.glob("*.css"))
    js_files = sorted(template_dir.glob("*.js"))

    report = create_report(
        html_template,
        json_filename,
        css_files,
        js_files,
        snakemake.params.include_table,
        snakemake.params.tc,
        snakemake.params.tc_method,
    )

    with open(html_filename, "w") as f:
        f.write(report)


if __name__ == "__main__":
    main()

Python Jinja2 From line 1 of scripts/cnv_html_report.py

import collections
import csv
import functools
import json
from pathlib import Path
import sys


# The functions `parse_*_ratios` functions take a filename of a file containing
# copy number log2-ratios across the genome for a specific CNV caller. The
# functions `parse_*_segments` takes a filename of a file containing log2-ratio
# segments across the genome for a specific caller. The return value from both
# of these functions should be a list of dictionaries, where the dictionaries
# look like
#
# {
#     "chromosome": str,
#     "start": int,
#     "end": int,
#     "log2": float,
# }
#


PARSERS = collections.defaultdict(dict)


def cnv_parser(file_format, header=True, skip=0, comment="#"):
    """
    Decorator for parsers of CNV result files. The first argument of
    the wrapped function should be a path to a file, and this argument
    is replaced with the contents of that file. How the content is represented
    depends on the file format:

    - tsv, csv: a generator over lines, each line being a list of values
    - vcf:

    arguments:
        file_format     a file path
        skip            the number of lines to skip before starting to read
                        the file
    """

    def decorator_cnv_parser(func):
        caller_filetype = func.__name__.split("_")[1:]
        caller = "_".join(caller_filetype[:-1])
        filetype = caller_filetype[-1]

        def line_generator(file, delim):
            for _ in range(skip):
                next(file)
            found_header = False
            for line in csv.reader(file, delimiter=delim):
                if line[0].strip()[0] == comment:
                    continue
                if header and not found_header:
                    found_header = True
                    continue
                yield line

        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            f = None
            try:
                f = open(args[0], "r")
                if file_format == "tsv":
                    lines = line_generator(f, delim="\t")
                elif file_format == "csv":
                    lines = line_generator(f, delim=",")
                else:
                    raise IOError(f"invalid filetype: {file_format}")
                args = [lines] + list(args[1:])
                return func(*args, **kwargs)
            finally:
                if f is not None:
                    f.close()

        PARSERS[caller][filetype] = wrapper

    return decorator_cnv_parser


@cnv_parser("tsv", header=True)
def parse_cnvkit_ratios(file):
    ratios = []
    for line in file:
        ratios.append(
            dict(
                chromosome=line[0],
                start=int(line[1]),
                end=int(line[2]),
                log2=float(line[5]),
            )
        )
    return ratios


@cnv_parser("tsv", header=True)
def parse_cnvkit_segments(file):
    segments = []
    for line in file:
        segments.append(
            dict(
                chromosome=line[0],
                start=int(line[1]),
                end=int(line[2]),
                log2=float(line[4]),
            )
        )
    return segments


@cnv_parser("tsv", header=True, comment="@")
def parse_gatk_ratios(file):
    ratios = []
    for line in file:
        ratios.append(
            dict(
                chromosome=line[0],
                start=int(line[1]),
                end=int(line[2]),
                log2=float(line[3]),
            )
        )
    return ratios


@cnv_parser("tsv", header=True, comment="@")
def parse_gatk_segments(file):
    segments = []
    for line in file:
        segments.append(
            dict(
                chromosome=line[0],
                start=int(line[1]),
                end=int(line[2]),
                log2=float(line[4]),
            )
        )
    return segments


def to_json(caller, ratios, segments):
    json_dict = dict(
        caller=caller,
        ratios=ratios,
        segments=segments,
    )
    return json.dumps(json_dict)


def main():
    log = Path(snakemake.log[0])

    logfile = open(log, "w")
    sys.stdout = sys.stderr = logfile

    caller = snakemake.wildcards["caller"]
    ratio_filename = snakemake.input["ratios"]
    segment_filename = snakemake.input["segments"]

    output_filename = snakemake.output["json"]

    skip_chromosomes = snakemake.params["skip_chromosomes"]

    if caller not in PARSERS:
        print(f"error: no parser for {caller} implemented", file=sys.stderr)
        sys.exit(1)

    ratios = PARSERS[caller]["ratios"](ratio_filename)
    segments = PARSERS[caller]["segments"](segment_filename)

    if skip_chromosomes is not None:
        ratios = [r for r in ratios if r["chromosome"] not in skip_chromosomes]
        segments = [s for s in segments if s["chromosome"] not in skip_chromosomes]

    with open(output_filename, "w") as f:
        print(to_json(caller, ratios, segments), file=f)


if __name__ == "__main__":
    main()

Python JSON From line 1 of scripts/cnv_json.py

from collections import defaultdict
from collections.abc import Generator
from dataclasses import dataclass
import json
from pathlib import Path
import pysam
import sys
from typing import Union


@dataclass
class CNV:
    caller: str
    chromosome: str
    genes: list
    start: int
    length: int
    type: str
    copy_number: float
    baf: float

    def end(self):
        return self.start + self.length - 1

    def overlaps(self, other):
        return self.chromosome == other.chromosome and (
            # overlaps in the beginning, or self contained in other
            (self.start >= other.start and self.start <= other.end())
            or
            # overlaps at the end, or self contained in other
            (self.end() >= other.start and self.end() <= other.end())
            or
            # other is contained in self
            (other.start >= self.start and other.end() <= self.end())
        )

    def __hash__(self):
        return hash(f"{self.caller}_{self.chromosome}:{self.start}-{self.end()}_{self.copy_number}")


cytoband_config = snakemake.config.get("merge_cnv_json", {}).get("cytoband_config", {}).get("colors", {})
cytoband_centromere = "acen"
cytoband_colors = {
    "gneg": cytoband_config.get("gneg", "#e3e3e3"),
    "gpos25": cytoband_config.get("gpos25", "#555555"),
    "gpos50": cytoband_config.get("gpos50", "#393939"),
    "gpos75": cytoband_config.get("gpos75", "#8e8e8e"),
    "gpos100": cytoband_config.get("gpos100", "#000000"),
    "acen": cytoband_config.get("acen", "#963232"),
    "gvar": cytoband_config.get("gvar", "#000000"),
    "stalk": cytoband_config.get("stalk", "#7f7f7f"),
}


def parse_fai(filename, skip=None):
    with open(filename) as f:
        for line in f:
            chrom, length = line.strip().split()[:2]
            if skip is not None and chrom in skip:
                continue
            yield chrom, int(length)


def parse_annotation_bed(filename, skip=None):
    with open(filename) as f:
        for line in f:
            chrom, start, end, name = line.strip().split()[:4]
            if skip is not None and chrom in skip:
                continue
            yield chrom, int(start), int(end), name


def cytoband_color(giemsa):
    return cytoband_colors.get(giemsa, "#ff0000")


def parse_cytobands(filename, skip=None):
    cytobands = defaultdict(list)
    with open(filename) as f:
        for line in f:
            chrom, start, end, name, giemsa = line.strip().split()
            if skip is not None and chrom in skip:
                continue
            cytobands[chrom].append(
                {
                    "name": name,
                    "start": int(start),
                    "end": int(end),
                    "direction": "none",
                    "giemsa": giemsa,
                    "color": cytoband_color(giemsa),
                }
            )

    for k, v in cytobands.items():
        cytobands[k] = sorted(v, key=lambda x: x["start"])
        centromere_index = [i for i, x in enumerate(cytobands[k]) if x["giemsa"] == cytoband_centromere]

        if len(centromere_index) > 0 and len(centromere_index) != 2:
            print(
                f"error: chromosome {k} does not have 0 or 2 centromere bands, " f"found {len(centromere_index)}", file=sys.stderr
            )
            sys.exit(1)
        elif len(centromere_index) == 0:
            continue

        cytobands[k][centromere_index[0]]["direction"] = "right"
        cytobands[k][centromere_index[1]]["direction"] = "left"

    return cytobands


def get_vaf(vcf_filename: Union[str, bytes, Path], skip=None) -> Generator[tuple, None, None]:
    vcf = pysam.VariantFile(str(vcf_filename))
    for variant in vcf.fetch():
        if variant.chrom in skip:
            continue
        yield variant.chrom, variant.pos, variant.info.get("AF", None)


def get_cnvs(vcf_filename, skip=None):
    cnvs = defaultdict(lambda: defaultdict(list))
    vcf = pysam.VariantFile(vcf_filename)
    for variant in vcf.fetch():
        if skip is not None and variant.chrom in skip:
            continue
        caller = variant.info.get("CALLER")
        if caller is None:
            raise KeyError("could not find caller information for variant, has the vcf been annotated?")
        genes = variant.info.get("Genes")
        if genes is None:
            continue
        if isinstance(genes, str):
            genes = [genes]
        cnv = CNV(
            caller,
            variant.chrom,
            sorted(genes),
            variant.pos,
            variant.info.get("SVLEN"),
            variant.info.get("SVTYPE"),
            variant.info.get("CORR_CN"),
            variant.info.get("BAF"),
        )
        cnvs[variant.chrom][caller].append(cnv)
    return cnvs


def merge_cnv_dicts(dicts, vaf, annotations, cytobands, chromosomes, filtered_cnvs, unfiltered_cnvs):
    callers = list(map(lambda x: x["caller"], dicts))
    caller_labels = dict(
        cnvkit="cnvkit",
        gatk="GATK",
    )
    cnvs = {}
    for chrom, chrom_length in chromosomes:
        cnvs[chrom] = dict(
            chromosome=chrom,
            label=chrom,
            length=chrom_length,
            vaf=[],
            annotations=[],
            callers={c: dict(name=c, label=caller_labels.get(c, c), ratios=[], segments=[], cnvs=[]) for c in callers},
        )

    for a in annotations:
        for item in a:
            cnvs[item[0]]["annotations"].append(
                dict(
                    start=item[1],
                    end=item[2],
                    name=item[3],
                )
            )

    for c in cytobands:
        cnvs[c]["cytobands"] = cytobands[c]

    if vaf is not None:
        for v in vaf:
            cnvs[v[0]]["vaf"].append(
                dict(
                    pos=v[1],
                    vaf=v[2],
                )
            )

    # Iterate over the unfiltered CNVs and pair them according to overlap.
    for uf_cnvs, f_cnvs in zip(unfiltered_cnvs, filtered_cnvs):
        for chrom, cnvdict in uf_cnvs.items():
            callers = sorted(list(cnvdict.keys()))
            first_caller = callers[0]
            rest_callers = callers[1:]

            # Keep track of added CNVs on a chromosome to avoid duplicates
            added_cnvs = set()

            for cnv1 in cnvdict[first_caller]:
                pass_filter = False

                if cnv1 in f_cnvs[chrom][first_caller]:
                    # The CNV is part of the filtered set, so all overlapping
                    # CNVs should pass the filter.
                    pass_filter = True

                cnv_group = [cnv1]
                for caller2 in rest_callers:
                    for cnv2 in cnvdict[caller2]:
                        if cnv1.overlaps(cnv2):
                            # Add overlapping CNVs from other callers
                            cnv_group.append(cnv2)

                            if cnv2 in f_cnvs[chrom][caller2]:
                                # If the overlapping CNV is part of the filtered
                                # set, the whole group should pass the filter.
                                pass_filter = True

                for c in cnv_group:
                    if c in added_cnvs:
                        continue
                    cnvs[c.chromosome]["callers"][c.caller]["cnvs"].append(
                        dict(
                            genes=c.genes,
                            start=c.start,
                            length=c.length,
                            type=c.type,
                            cn=c.copy_number,
                            baf=c.baf,
                            passed_filter=pass_filter,
                        )
                    )
                    added_cnvs.add(c)

    for d in dicts:
        for r in d["ratios"]:
            cnvs[r["chromosome"]]["callers"][d["caller"]]["ratios"].append(
                dict(
                    start=r["start"],
                    end=r["end"],
                    log2=r["log2"],
                )
            )
        for s in d["segments"]:
            cnvs[s["chromosome"]]["callers"][d["caller"]]["segments"].append(
                dict(
                    start=s["start"],
                    end=s["end"],
                    log2=s["log2"],
                )
            )

    for v in cnvs.values():
        v["callers"] = list(v["callers"].values())

    return list(cnvs.values())


def main():
    log = Path(snakemake.log[0])

    logfile = open(log, "w")
    sys.stdout = sys.stderr = logfile

    annotation_beds = snakemake.input["annotation_bed"]
    fasta_index_file = snakemake.input["fai"]
    germline_vcf = snakemake.input["germline_vcf"]
    json_files = snakemake.input["json"]
    filtered_cnv_vcf_files = snakemake.input["filtered_cnv_vcfs"]
    cnv_vcf_files = snakemake.input["cnv_vcfs"]
    cytoband_file = snakemake.input["cytobands"]

    if len(germline_vcf) == 0:
        germline_vcf = None

    output_file = snakemake.output["json"]

    skip_chromosomes = snakemake.params["skip_chromosomes"]
    show_cytobands = snakemake.params["cytobands"]

    cnv_dicts = []
    for fname in json_files:
        with open(fname) as f:
            cnv_dicts.append(json.load(f))

    fai = parse_fai(fasta_index_file, skip_chromosomes)
    vaf = None
    if germline_vcf is not None:
        vaf = get_vaf(germline_vcf, skip_chromosomes)
    annotations = []
    for filename in annotation_beds:
        annotations.append(parse_annotation_bed(filename, skip_chromosomes))

    cytobands = []
    if cytoband_file and show_cytobands:
        cytobands = parse_cytobands(cytoband_file, skip_chromosomes)

    filtered_cnv_vcfs = []
    unfiltered_cnv_vcfs = []
    for f_vcf, uf_vcf in zip(filtered_cnv_vcf_files, cnv_vcf_files):
        filtered_cnv_vcfs.append(get_cnvs(f_vcf, skip_chromosomes))
        unfiltered_cnv_vcfs.append(get_cnvs(uf_vcf, skip_chromosomes))

    cnvs = merge_cnv_dicts(cnv_dicts, vaf, annotations, cytobands, fai, filtered_cnv_vcfs, unfiltered_cnv_vcfs)

    with open(output_file, "w") as f:
        print(json.dumps(cnvs), file=f)


if __name__ == "__main__":
    main()