Reproducibility workflow for Gigante et al., 2018: Using long-read sequencing to detect imprinted DNA methylation

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Haplotyped Methylome

Reproducibility instructions for Gigante et al., 2019.

Note: this repository is still being tested! If you find a bug, please file an issue.

Data available at ENA Accession PRJEB27157 .

Author's note: there is an error in Figure S2 of the paper, in which the figure legend was marked Maternal/Paternal where it should have been Black6/Cast. A corrected version is available here .

System requirements

R
Python>=3.5
Lots of RAM (min 256GB)
Lots of disk space (estimate: 2TB)

Dependencies

To install with conda , run the following command.

conda env create -f environment.yml
source activate haplotyped_methylome

You will then need to install Albacore: it is available on the Nanopore Community .

To install without conda, see the list of dependencies at the bottom of this README.

Required data

For the standard workflow, snakemake will download all the necessary files.

If you wish to avoid running albacore , bwa and nanopolish on the raw nanopore data, you can run the following command, which downloads the output of these programs and tricks snakemake into thinking you have run the pipeline from the beginning:

snakemake intermediate_download

Note: currently this only downloads the methylation files We hope to provide the alignment and phasing data in the future.

If you wish to rerun from the beginning after running this command, you can revert to the original download with snakemake --forceall .

Running the workflow

To generate all plots, tables and notebooks, simply run from the root directory:

snakemake --cores 16

If you don't wish to run the full analysis, you can run specific rules from the Snakefile by running, for example:

snakemake --cores 16 rnaseq_analysis
snakemake --cores 16 haplotype_analysis
snakemake --cores 16 methylation_analysis

Installation without `conda`

Software dependencies:

SAMtools
Hisat2
Trim Galore
SNPsplit
Bismark
Bowtie 2
Pandoc
Albacore (optional)
BWA (optional)
Nanopolish (optional)

Python package dependencies:

pip install --user -r requirements.txt

R package dependencies:

Rscript install_R_deps.R

Known Issues

rnaseq_analys.Rmd failed: there is no package called 'ggrastr'

If devtools doesn't play nicely with conda , sometimes the automatic GitHub installation of ggrastr fails. You can resolve if as follows:

git clone --depth 1 https://github.com/VPetukhov/ggrastr.git
cd ggrastr
R -e 'devtools::install()'

Code Snippets

shell:
    "python build_supplementary_db.py {input.bam} --summary {output.summary}"

SnakeMake From line 8 of scripts/allelic_methylation.snakefile

shell:
    "python split_methylation_by_alignment.py {input.bam} {input.meth} > {output}"

SnakeMake From line 18 of scripts/allelic_methylation.snakefile

shell:
    "python calculate_methylation_frequency.py -i {input} -p > {output}"

SnakeMake From line 26 of scripts/allelic_methylation.snakefile

shell:
    "python call_variant_proportion.py -b {input.bam} -t {threads} -p {input.phased_bam} -v {input.vcf} -o {output}"

SnakeMake From line 41 of scripts/allelic_methylation.snakefile

shell:
    "mkdir -p {output}.tmp && tail -n +2 {input.meth_split} | "
    "sort -k4,4 -k1,1 -k2n,2n -T {output}.tmp | "
    "cat <(head -n 1 {input.meth}) - > {output} && "
    "rm -rf {output}.tmp"

SnakeMake From line 50 of scripts/allelic_methylation.snakefile

shell:
    "mkdir -p {output}.tmp && tail -n +2 {input.meth_split} | "
    "sort -k1,1 -k2n,2n -k4,4 -T {output}.tmp | "
    "cat <(head -n 1 {input.meth}) - > {output} && "
    "rm -rf {output}.tmp"

SnakeMake From line 62 of scripts/allelic_methylation.snakefile

shell:
    "Rscript read_summary.R {input} {output}"

SnakeMake From line 73 of scripts/allelic_methylation.snakefile

shell:
    "Rscript read_haplotypes.R {input} {output}"

SnakeMake From line 81 of scripts/allelic_methylation.snakefile

shell:
    "Rscript fit_reads.R ../nanopore/{params.sample} ../RData/{params.sample}"

SnakeMake From line 94 of scripts/allelic_methylation.snakefile

shell:
    "python split_methylation_by_haplotype.py -m {input.meth} -p {input.phase}"

SnakeMake From line 104 of scripts/allelic_methylation.snakefile

shell:
    "python calculate_methylation_frequency.py -i {input} -p > {output}"

SnakeMake From line 112 of scripts/allelic_methylation.snakefile

shell:
    "Rscript dss_paired.R &> {log}"

SnakeMake From line 125 of scripts/allelic_methylation.snakefile

shell:
    "python compare_haplotype_methylation.py -m {input.meth} -p {input.phase} -b 11 -o 5 -r $(cat {input.region}) > {output}"

SnakeMake From line 135 of scripts/allelic_methylation.snakefile

script:
    "build_dss_dmrlist.R"

SnakeMake From line 144 of scripts/allelic_methylation.snakefile

script:
    "build_methylation_df.R"

SnakeMake From line 154 of scripts/allelic_methylation.snakefile

suppressMessages(suppressPackageStartupMessages(library(tidyverse)))

load("../RData/paired_DSS.RData")
combined_dmr <- dmr_parent %>%
  mutate(type="imprinted") %>%
  bind_rows(dmr_strain %>%
              mutate(type="strain")) %>%
  mutate(chr = as.character(chr)) %>%
  arrange(-abs(areaStat)) %>%
  mutate(dmr_start=start,
         dmr_end=end,
         start = dmr_start-5000,
         end = dmr_end + 5000,
         id=row_number()) %>%
  select(chr, start, end, everything()) %>%
  fuzzyjoin::genome_left_join(read_tsv("../genome_data/Mus_musculus.GRCm38_90.chr.genes.tsv", col_types="ccccdd") %>%
                                select(chr, start, end, gene_name)) %>%
  select(-starts_with("start"), -starts_with("end"), -chr.y) %>%
  select(id, chr=chr.x, start=dmr_start, end=dmr_end, everything()) %>%
  group_by(id) %>%
  mutate(overlapping_genes=paste0(gene_name, collapse=",")) %>%
  distinct(overlapping_genes, .keep_all = TRUE) %>%
  ungroup() %>%
  select(-gene_name)
combined_dmr %>%
  write_csv("../tables/dss_dmrlist.csv")

R tidyverse From line 1 of scripts/build_dss_dmrlist.R

suppressMessages(suppressPackageStartupMessages(library(tidyverse)))
suppressMessages(suppressPackageStartupMessages(library(data.table)))

bisulfite_df <- read_tsv("../bisulfite/B6CastF1_1_pe.summary.tsv", col_names=c("chr","pos","percentMeth", "meth", "coverage"), col_types='ciddd') %>%
  arrange(chr, pos) %>%
  dplyr::rename(start=pos) %>%
  mutate(end=start,
         chr=sub("chr","",chr))
save(bisulfite_df, file="../RData/bisulfite_df.RData")

nanopolish_df <- read.table("../nanopore/b6xcast.minion.methylation.summary.tsv",
                            header=TRUE, sep="\t", stringsAsFactors = FALSE) %>%
  mutate(start=start+1,
         end=end+2) %>%
  dplyr::rename(chr=chromosome,
                percentMeth=methylated_frequency) %>%
  arrange(chr, start)
save(nanopolish_df, file="../RData/nanopolish_df.RData")

R tidyverse data.table From line 1 of scripts/build_methylation_df.R

from __future__ import print_function
import pysam
import pickle
import argparse

parser = argparse.ArgumentParser()
parser.add_argument('bam')
parser.add_argument('-s', '--summary', default=None,
                    help='Optional file for printing read summaries')
parser.add_argument('--summary-only', default=False,
                    action='store_true', help="Don't produce a .suppdb file")
args = parser.parse_args()

bam_fn = args.bam
reads = dict()
with pysam.AlignmentFile(bam_fn, 'r') as bam:
    if args.summary:
        import numpy as np
        quals = dict()
    for read in bam:
        if read.is_unmapped:
            continue
        try:
            overlap = False
            for i in range(len(reads[read.query_name])):
                chr, start, end = reads[read.query_name][i]
                if chr == read.reference_name and start <= read.reference_end and end >= read.reference_start:
                    reads[read.query_name][i] = (
                        chr, min(start, read.reference_start), max(end, read.reference_end))
                    overlap = True
            if not overlap:
                reads[read.query_name].append(
                    (read.reference_name, read.reference_start, read.reference_end))
        except KeyError:
            reads[read.query_name] = [
                (read.reference_name, read.reference_start, read.reference_end)]
        if args.summary:
            try:
                read_qual = np.median(read.query_qualities)
            except TypeError:
                read_qual = "NA"
            try:
                quals[read.query_name].append(read_qual)
            except KeyError:
                quals[read.query_name] = [read_qual]
non_supplementary = set()
for read, alignments in reads.items():
    if len(alignments) == 1:
        non_supplementary.add(read)
if args.summary:
    with open(args.summary, 'w') as summary:
        print("read_name\tchr\tstart\tend\tqual", file=summary)
        for read, alignments in reads.items():
            if read in non_supplementary:
                print("{read_name}\t{chr}\t{start}\t{end}\t{qual}".format(
                    read_name=read,
                    chr=alignments[0][0],
                    start=alignments[0][1],
                    end=alignments[0][2],
                    qual=quals[read][0]), file=summary)
            else:
                for i, alignment in enumerate(alignments):
                    print("{read_name}\t{chr}\t{start}\t{end}\t{qual}".format(
                        read_name=read + "_" + str(i),
                        chr=alignment[0],
                        start=alignment[1],
                        end=alignment[2],
                        qual=quals[read][i]), file=summary)
if not args.summary_only:
    for read in non_supplementary:
        del reads[read]
    with open("{}.suppdb".format(bam_fn), 'wb') as handle:
        pickle.dump(reads, handle, pickle.HIGHEST_PROTOCOL)

Python numpy pysam From line 4 of scripts/build_supplementary_db.py

from __future__ import print_function
import math
import sys
import csv
import argparse
from collections import namedtuple


def make_key(c, s, e):
    return c + ":" + str(s) + ":" + str(e)


class SiteStats:

    def __init__(self, g_size, g_seq):
        self.num_reads = 0
        self.posterior_methylated = 0
        self.called_sites = 0
        self.called_sites_methylated = 0
        self.group_size = g_size
        self.sequence = g_seq


def update_call_stats(key, num_called_cpg_sites, methylation, sequence):
    if key not in sites:
        sites[key] = SiteStats(num_called_cpg_sites, sequence)

    sites[key].num_reads += 1
    sites[key].called_sites += num_called_cpg_sites
    if methylation > 0:
        sites[key].called_sites_methylated += num_called_cpg_sites * methylation

parser = argparse.ArgumentParser(
    description='Calculate methylation frequency at genomic CpG sites')
parser.add_argument('-c', '--call-threshold', type=float,
                    required=False, default=2.5)
parser.add_argument('-i', '--input', type=str, required=False)
parser.add_argument('-s', '--split-groups', action='store_true')
parser.add_argument('-p', '--probabilistic', action='store_true')
parser.add_argument('--prior', type=float, default=0.5,
                    help="Prior probability of methylation")
args = parser.parse_args()
assert(args.call_threshold is not None)
if args.probabilistic:
    args.call_threshold = 0

sites = dict()

if args.input:
    in_fh = open(args.input)
else:
    in_fh = sys.stdin
csv_reader = csv.DictReader(in_fh, delimiter='\t')

for record in csv_reader:

    try:
        num_sites = int(record['num_motifs'])
    except KeyError:
        # backwards compatible
        num_sites = int(record['num_cpgs'])
    llr = float(record['log_lik_ratio'])

    # Skip ambiguous call
    if abs(llr) < args.call_threshold:
        continue
    sequence = record['sequence']

    if args.probabilistic:
        try:
            if llr > 700:
                llr = 700
            elif llr < -700:
                llr = -700
            methylation = 1 / (1 + (1 - args.prior) /
                               (args.prior * math.exp(llr)))
        except OverflowError:
            print(llr)
            raise
    else:
        methylation = 1 if llr > 0 else 0

    # if this is a multi-cpg group and split_groups is set, break up these
    # sites
    if args.split_groups and num_sites > 1:
        c = record['chromosome']
        s = int(record['start'])
        e = int(record['end'])

        # find the position of the first CG dinucleotide
        sequence = record['sequence']
        cg_pos = sequence.find("CG")
        first_cg_pos = cg_pos
        while cg_pos != -1:
            key = make_key(c, s + cg_pos - first_cg_pos,
                           s + cg_pos - first_cg_pos)
            update_call_stats(key, 1, methylation, "split-group")
            cg_pos = sequence.find("CG", cg_pos + 1)
    else:
        key = make_key(record['chromosome'], record['start'], record['end'])
        update_call_stats(key, num_sites, methylation, sequence)

# header
print("\t".join(["chromosome", "start", "end", "num_cpgs_in_group",
                 "called_sites", "called_sites_methylated",
                 "methylated_frequency", "group_sequence"]))

for key in sites:
    if sites[key].called_sites > 0:
        (c, s, e) = key.split(":")
        f = float(sites[key].called_sites_methylated) / sites[key].called_sites
        print("\t".join([str(x) for x in [c, s, e, sites[key].group_size,
                                          sites[key].called_sites,
                                          sites[key].called_sites_methylated,
                                          f, sites[key].sequence]]))

Python From line 5 of scripts/calculate_methylation_frequency.py

from __future__ import print_function
import sys
import pysam
import csv
import pickle
import functools
import multiprocessing
import argparse
import math

__min_coverage = 5
__override_ratio = 3

# fit exponential to log(1-correctness)
__slope = -0.1203
__intercept = -0.6927


def parse_args():
    parser = argparse.ArgumentParser(
        description="Phase nanopore reads from nanopolish output")
    parser.add_argument('-b', '--bam', required=True,
                        help="Path to the original bam file")
    parser.add_argument('-p', '--phased', required=True,
                        help="Path to the phased bam file from nanopolish phase-reads")
    parser.add_argument('-v', '--vcf', required=True,
                        nargs='+', help="Path to the variants file(s)")
    parser.add_argument('-d', '--detailed-output', action='store_true',
                        default=False, help="Print per-read, per-site calls to stdout")
    parser.add_argument('-o', '--outfile',
                        type=argparse.FileType('w'), default=sys.stdout)
    parser.add_argument('-t', '--threads', type=int,
                        default=multiprocessing.cpu_count())
    args = parser.parse_args()
    return args


def get_read_id(read, contig, position, suppdb):
    if read.query_name in suppdb:
        for i, alignment in enumerate(suppdb[read.query_name]):
            chr, start, end = alignment
            if contig == chr and position >= start and position <= end:
                return "{}_{}".format(read.query_name, i)
        raise Exception("Position {}:{} not internal to any alignment of read {}".format(
            contig, position, read.query_name))
    else:
        return read.query_name


def check_column_support(vcf_rows, vcf_pos, column, reads, suppdb, type, detailed_output):
    if column is not None:
        valid_allele = [True if vcf_row['POS'] == vcf_pos and len(
            vcf_row['REF']) == 1 else False for vcf_row in vcf_rows]
        if column.reference_pos == vcf_pos and True in valid_allele and 'PASS' in [vcf_row['FILTER'] for vcf_row in vcf_rows]:
            for read in column.pileups:
                if not read.is_del:
                    read_name = read.alignment.query_name
                    read_pos = read.query_position
                    base = read.alignment.query_sequence[read_pos]
                    call = []
                    try:
                        qual = read.alignment.query_qualities[read_pos]
                        if type == "signal":
                            # nanopolish qualities scores should be offset by
                            # 35 - they never give anything below this
                            qual = max(0, qual - 35)
                        error = math.exp(__intercept + __slope * qual)
                    except TypeError:
                        if read.alignment.query_qualities is None:
                            # quality scores missing
                            error = 0
                        else:
                            # unknown error
                            raise
                    try:
                        reads[read_name]
                    except KeyError:
                        reads[read_name] = [0] * (len(vcf_rows) + 2)

                    # only have to check reference once
                    if base == vcf_rows[0]["REF"]:
                        reads[read_name][0] += 1 - error
                        call.append("ref")
                    else:
                        reads[read_name][0] += error
                    for i in range(len(vcf_rows)):
                        if vcf_rows[i]['FILTER'] == 'PASS':
                            match = False
                            if valid_allele[i]:
                                if base in vcf_rows[i]["ALT"]:
                                    match = True
                            elif "ref" in call and base == vcf_rows[0]["REF"]:
                                # no snp here, same as reference
                                match = True
                            if match:
                                reads[read_name][i + 1] += 1 - error
                                call.append("alt{}".format(i) if len(
                                    vcf_rows) > 1 else "alt")
                            else:
                                reads[read_name][i + 1] += error
                        else:
                            # treat as heterozygous
                            if valid_allele[i] and base in vcf_rows[i]['REF'] + vcf_rows[i]['ALT']:
                                reads[read_name][i + 1] += (1 - error) / 2
                            else:
                                reads[read_name][i + 1] += error
                    # add 1 to coverage
                    reads[read_name][-1] += 1
                    if len(call) > 0 and detailed_output:
                        qual = read.alignment.query_qualities[read_pos]
                        print(vcf_rows)
                        print(base)
                        print(error)
                        print(reads[read_name])
                        print("{}\t{}\t{}\t{}\t{}\t{}".format(
                            column.reference_name, column.reference_pos, read_name, type, ",".join(call), qual))
    return reads


def ref_support(reads):
    return set([read for read, counts in reads.items() if counts[0] > counts[1]])


def summarize_reads(reads):
    if len(reads) == 0:
        return None
    support = len(ref_support(reads))
    return float(support) / len(reads)


def load_vcf(handle):
    header = next(handle)
    while header.startswith("##"):
        header = next(handle)
    header = header.strip("#").strip("\n").split("\t")
    reader = csv.DictReader(handle, delimiter="\t", fieldnames=header)
    return reader


def pileup_step(ref_pos, original_pileup, phased_pileup):
    try:
        original_column = next(original_pileup)
        while original_column.reference_pos < ref_pos:
            original_column = next(original_pileup)
    except StopIteration:
        original_column = None
    try:
        phased_column = next(phased_pileup)
        while phased_column.reference_pos < ref_pos:
            phased_column = next(phased_pileup)
    except StopIteration:
        phased_column = None
    return original_column, phased_column


def process_row(vcf_rows, vcf_pos, original_pileup, original_reads, phased_pileup, phased_reads, suppdb, detailed_output):
    if 'PASS' in [vcf_row['FILTER'] for vcf_row in vcf_rows]:
        # process this snp
        original_column, phased_column = pileup_step(
            vcf_pos, original_pileup, phased_pileup)
        if not (original_column or phased_column):
            return None, original_reads, None, phased_reads
        else:
            original_reads = check_column_support(
                vcf_rows, vcf_pos, original_column, original_reads, suppdb, "base", detailed_output)
            phased_reads = check_column_support(
                vcf_rows, vcf_pos, phased_column, phased_reads, suppdb, "signal", detailed_output)
    return original_pileup, original_reads, phased_pileup, phased_reads


def call_contig(contig, vcf_fns, original_fn, phased_fn, suppdb, detailed_output):
    with pysam.AlignmentFile(original_fn, 'r') as original_bam, pysam.AlignmentFile(phased_fn, 'r') as phased_bam:
        vcf_handles = []
        vcf_readers = []
        for vcf_fn in vcf_fns:
            handle = open(vcf_fn, 'r')
            vcf_handles.append(handle)
            vcf_readers.append(load_vcf(handle))

        original_pileup = original_bam.pileup(contig)
        original_reads = dict()

        phased_pileup = phased_bam.pileup(contig)
        phased_reads = dict()

        if original_pileup or phased_pileup:
            # find right location in vcf - maybe an index is better for this
            vcf_rows = [next(vcf_reader) for vcf_reader in vcf_readers]
            for i, reader in enumerate(vcf_readers):
                if not vcf_rows[i]['CHROM'] == contig:
                    for vcf_row in reader:
                        if vcf_row['CHROM'] != contig:
                            # not there yet
                            continue
                        vcf_rows[i] = vcf_row
                        break
                vcf_rows[i]['POS'] = int(vcf_rows[i]['POS']) - 1
            # process the files
            while True:
                try:
                    vcf_pos = min(
                        [vcf_row['POS'] for vcf_row in vcf_rows if vcf_row['CHROM'] == contig])
                except ValueError:
                    # empty list, all on wrong contig
                    break
                original_pileup, original_reads, phased_pileup, phased_reads = process_row(
                    vcf_rows, vcf_pos, original_pileup, original_reads, phased_pileup, phased_reads, suppdb, detailed_output)
                if not (original_pileup or phased_pileup):
                    # out of reads
                    break
                num_failed = 0
                for i in range(len(vcf_rows)):
                    try:
                        if vcf_rows[i]['CHROM'] == contig and vcf_rows[i]['POS'] <= vcf_pos:
                            vcf_rows[i] = next(vcf_readers[i])
                            if vcf_rows[i]['CHROM'] != contig:
                                vcf_rows[i]['POS'] = float('inf')
                                num_failed += 1
                            else:
                                vcf_rows[i]['POS'] = int(
                                    vcf_rows[i]['POS']) - 1
                    except StopIteration:
                        vcf_rows[i]['CHROM'] = None
                        vcf_rows[i]['POS'] = float('inf')
                        num_failed += 1
                if num_failed == len(vcf_rows):
                    # no more variants
                    break
    return original_reads, phased_reads

args = parse_args()

with open("{}.suppdb".format(args.bam), 'rb') as handle:
    suppdb = pickle.load(handle)

func = functools.partial(call_contig, vcf_fns=args.vcf, original_fn=args.bam,
                         phased_fn=args.phased, suppdb=suppdb, detailed_output=args.detailed_output)
with pysam.AlignmentFile(args.bam, 'r') as bam:
    original_reads, phased_reads = dict(), dict()
    contigs = [bam.get_reference_name(i) for i in range(bam.nreferences)]
if args.detailed_output:
    print("\t".join(["chr", "pos", "read_name", "caller", "genotype", "qual"]))

threads = 1 if args.detailed_output else min(args.threads, len(contigs))
if threads > 1:
    pool = multiprocessing.Pool(threads)
    data = pool.imap_unordered(func, contigs)
    pool.close()
    pool.join()
else:
    data = [func(contig) for contig in contigs]

for original_contig_reads, phased_contig_reads in data:
    original_reads.update(original_contig_reads)
    phased_reads.update(phased_contig_reads)

reads = set(original_reads.keys()).union(set(phased_reads.keys()))
signal_alt_header = "\t".join(["signal_alt{}".format(
    i + 1) for i in range(len(args.vcf))]) if len(args.vcf) > 1 else "signal_alt"
base_alt_header = "\t".join(["base_alt{}".format(
    i + 1) for i in range(len(args.vcf))]) if len(args.vcf) > 1 else "base_alt"
print("\t".join(['read', 'genotype', 'signal_ref', signal_alt_header,
                 'signal_coverage', 'base_ref', base_alt_header, 'base_coverage']), file=args.outfile)
for read in reads:
    try:
        signal_calls = phased_reads[read]
    except KeyError:
        signal_calls = [0] * (len(args.vcf) + 2)
    try:
        base_calls = original_reads[read]
    except KeyError:
        base_calls = [0] * (len(args.vcf) + 2)

    base_coverage = base_calls[-1]
    base_calls = base_calls[:-1]
    signal_coverage = signal_calls[-1]
    signal_calls = signal_calls[:-1]
    base_max = max(base_calls)
    signal_max = max(signal_calls)

    if signal_coverage == 0 or signal_calls.count(signal_max) > 1:
        signal_call = "fail"
        signal_ratio = 0
    else:
        signal_call = signal_calls.index(signal_max)
        signal_ratio = signal_max / signal_coverage
        if signal_call == 0:
            signal_call = "ref"
        else:
            signal_call = "alt{}".format(
                signal_call) if len(args.vcf) > 1 else "alt"

    if base_coverage == 0 or base_calls.count(max(base_calls)) > 1:
        base_call = "fail"
        base_ratio = 0
    else:
        base_call = base_calls.index(max(base_calls))
        base_ratio = base_max / base_coverage
        if base_call == 0:
            base_call = "ref"
        else:
            base_call = "alt{}".format(base_call) if len(
                args.vcf) > 1 else "alt"

    if base_coverage < __min_coverage and signal_coverage < __min_coverage:
        genotype = "fail"
    elif base_call == signal_call:
        genotype = base_call
    elif signal_call == "fail":
        genotype = base_call
    elif base_call == "fail":
        genotype = signal_call
    elif base_coverage > __override_ratio * signal_coverage and base_call != "fail":
        genotype = base_call
    elif signal_coverage > __override_ratio * base_coverage and signal_call != "fail":
        genotype = signal_call
    elif base_ratio - 0.5 > __override_ratio * (signal_ratio - 0.5) and base_ratio > 0.5:
        genotype = base_call
    elif signal_ratio - 0.5 > __override_ratio * (base_ratio - 0.5) and signal_ratio > 0.5:
        genotype = signal_call
    else:
        genotype = "fail"

    if read in suppdb:
        for j in range(len(suppdb[read])):
            supp_read = "{}_{}".format(read, j)
            print("\t".join([supp_read, genotype, "\t".join([str(i) for i in signal_calls]), str(
                signal_coverage), "\t".join([str(i) for i in base_calls]), str(base_coverage)]), file=args.outfile)
    else:
        print("\t".join([read, genotype, "\t".join([str(i) for i in signal_calls]), str(
            signal_coverage), "\t".join([str(i) for i in base_calls]), str(base_coverage)]), file=args.outfile)

Python pysam nanopolish From line 6 of scripts/call_variant_proportion.py

from __future__ import print_function
import csv
import numpy as np
import math
import argparse
import re


def parse_region(region):
    """
    Parse a region specification.
    :param region: String region specification
    :raises argparse.ArgumentTypeError: raises an error if format not recognised
    :returns contig: String contig / chromosome name
    :returns start: integer start position (0-based)
    :returns end: integer end position (1-based)
    >>> parseRegion("chr1:1000-2000")
    ("chr1", 1000, 2000)
    """
    r = ''.join(region.split())  # remove whitespace
    r = re.split(':|-', r)
    start = 0
    end = None
    if len(r) < 1:
        raise argparse.ArgumentTypeError(
            "Region {} must specify a reference name".format(region))
    elif len(r) > 3:
        raise argparse.ArgumentTypeError(
            "Region {} format not recognized".format(region))
    else:
        contig = r[0]
        try:
            start = int(re.sub(",|\.", "", r[1]))
        except IndexError:
            pass
        try:
            end = int(re.sub(",|\.", "", r[2]))
        except IndexError:
            pass
        finally:
            return contig, start, end


parser = argparse.ArgumentParser()
parser.add_argument('-m', '--methylation', required=True,
                    help="Methylation data, split by supplementaries, sorted by genomic location")
parser.add_argument('-p', '--phase', required=True,
                    help="Phased reads (haplotyping data)")
parser.add_argument('-b', '--binwidth', type=int, default=11,
                    help="Genomic distance over which to aggregate tests")
parser.add_argument('-o', '--overlap', type=int, default=5,
                    help="Genomic distance for bin overlap. Limited to binwidth/2 for streaming reasons, but this could be reengineered")
parser.add_argument('-r', '--region', type=parse_region, nargs="+", metavar="CHR:START-END",
                    help="Name and range of contigs to test (Default: entire genome)")
args = parser.parse_args()

__binwidth = args.binwidth
__overlap = min(__binwidth // 2, args.overlap)  # max = __binwidth/2

meth_fn = args.methylation
haplotype_fn = args.phase

regions = dict()
# format: { chr : [start, end] }
if args.region:
    for region in args.region:
        try:
            regions[region[0]]
        except KeyError:
            regions[region[0]] = []
        regions[region[0]].append([region[1], region[2]])

haplotypes = dict()
with open(haplotype_fn, 'r') as haplotype_handle:
    reader = csv.DictReader(haplotype_handle, delimiter="\t")
    for row in reader:
        haplotypes[row["read"]] = row["genotype"]


def get_genotype(read_name):
    try:
        return haplotypes[read_name]
    except KeyError:
        return "fail"


print("\t".join(["chr", "start", "end", "ref_mean", "ref_sd", "ref_site_count", "ref_read_count",
                 "alt_mean", "alt_sd", "alt_site_count", "alt_read_count"]))
chr = None
start = None
ref_reads = None
alt_reads = None
ref_meth = None
alt_meth = None
next_start = None
next_ref_reads = None
next_alt_reads = None
next_ref_meth = None
next_alt_meth = None
end = 0

with open(meth_fn, 'r') as meth_handle:
    reader = csv.DictReader(meth_handle, delimiter="\t")
    for row in reader:
        pos = (float(row["start"]) + float(row["end"])) / 2
        if regions:
            hit = False
            try:
                for region in regions[row['chromosome']]:
                    if pos >= region[0] and pos < region[1]:
                        hit = True
            except KeyError:
                pass
            if not hit:
                continue
        if pos >= end or row["chromosome"] != chr:
            if chr is not None:
                # print result
                if not (len(ref_meth) == 0 and len(alt_meth) == 0):
                    ref_mean = np.mean(ref_meth) if len(ref_meth) > 0 else "NA"
                    ref_sd = np.std(ref_meth) if len(ref_meth) > 0 else "NA"
                    alt_mean = np.mean(alt_meth) if len(alt_meth) > 0 else "NA"
                    alt_sd = np.std(alt_meth) if len(alt_meth) > 0 else "NA"
                    print("\t".join([chr, str(start), str(end), str(ref_mean), str(ref_sd), str(len(ref_meth)), str(len(
                        ref_reads)), str(alt_mean), str(alt_sd), str(len(alt_meth)), str(len(alt_reads))]))
            if chr is not None and chr == row["chromosome"] and pos < next_start:
                start = next_start
                ref_reads = next_ref_reads
                alt_reads = next_alt_reads
                ref_meth = next_ref_meth
                alt_meth = next_alt_meth
            else:
                start = int(row["start"])
                ref_reads = set()
                alt_reads = set()
                ref_meth = list()
                alt_meth = list()
                chr = row["chromosome"]
            next_start = start + __overlap
            end = start + __binwidth
            next_ref_reads = set()
            next_alt_reads = set()
            next_ref_meth = list()
            next_alt_meth = list()
        try:
            meth = 1 / (1. + math.exp(-float(row["log_lik_ratio"])))
        except OverflowError:
            meth = 0.0
        genotype = get_genotype(row["read_name"])
        if genotype == "ref":
            ref_meth.append(meth)
            ref_reads.add(row["read_name"])
        elif genotype == "alt":
            alt_meth.append(meth)
            alt_reads.add(row["read_name"])
        if int(row["start"]) >= next_start:
            if genotype == "ref":
                next_ref_meth.append(meth)
                next_ref_reads.add(row["read_name"])
            elif genotype == "alt":
                next_alt_meth.append(meth)
                next_alt_reads.add(row["read_name"])

# print last result
if chr is not None:
    ref_mean = np.mean(ref_meth) if len(ref_meth) > 0 else "NA"
    ref_sd = np.std(ref_meth) if len(ref_meth) > 0 else "NA"
    alt_mean = np.mean(alt_meth) if len(alt_meth) > 0 else "NA"
    alt_sd = np.std(alt_meth) if len(alt_meth) > 0 else "NA"
    print("\t".join([chr, str(start), str(end), str(ref_mean), str(ref_sd), str(len(ref_meth)), str(len(
        ref_reads)), str(alt_mean), str(alt_sd), str(len(alt_meth)), str(len(alt_reads))]))

Python numpy From line 8 of scripts/compare_haplotype_methylation.py

import sys
import re


def count_cpg(file, path=True):
    substring = "CG"
    print("chr\tstart\tend")
    f = open(file) if path else file
    line = f.readline()
    while line:
        if line.startswith('>'):
            chrm = line[1:].rstrip()
        else:
            index = 0
            linelen = len(line)
            # Find first non-N base
            # index = re.search(r'[^N]', line).start()
            # print("start\t{}\tend\t{}".format(index+1, linelen))
            while index < linelen:
                index = line.find(substring, index)
                if index == -1:
                    break
                print("{}\t{}\t{}".format(
                    chrm, index + 1, index + len(substring)))
                index += len(substring)
        line = f.readline()
    if path:
        f.close()


if __name__ == "__main__":
    if len(sys.argv) != 2:
        if sys.stdin:
            count_cpg(sys.stdin, path=False)
        else:
            raise TypeError("missing file input")
    else:
        count_cpg(sys.argv[1])

Python From line 4 of scripts/count_cpg.py

suppressMessages(suppressPackageStartupMessages(library(tidyverse)))
suppressMessages(suppressPackageStartupMessages(library(DSS)))
suppressMessages(suppressPackageStartupMessages(library(bsseq)))

meth_fn <- "../nanopore/b6xcast.minion.methylation"

forward_ref_df <- read.table(paste0(meth_fn, ".ref_summary.tsv"),
                             header=TRUE, sep="\t", stringsAsFactors = FALSE) %>%
  mutate(start=start+1,
         end=end+2,
         sampleName="b6xcast.mat") %>%
  dplyr::rename(chr=chromosome,
                percentMeth=methylated_frequency) %>%
  arrange(chr, start)
forward_alt_df <- read.table(paste0(meth_fn, ".alt_summary.tsv"),
                             header=TRUE, sep="\t", stringsAsFactors = FALSE) %>%
  mutate(start=start+1,
         end=end+2,
         sampleName="b6xcast.pat") %>%
  dplyr::rename(chr=chromosome,
                percentMeth=methylated_frequency) %>%
  arrange(chr, start)

meth_fn <- "../nanopore/castxb6.promethion.methylation"

reverse_ref_df <- read.table(paste0(meth_fn, ".ref_summary.tsv"),
                              header=TRUE, sep="\t", stringsAsFactors = FALSE) %>%
    mutate(start=start+1,
           end=end+2,
         sampleName="castxb6.pat") %>%
    dplyr::rename(chr=chromosome,
                  percentMeth=methylated_frequency) %>%
    arrange(chr, start)
reverse_alt_df <- read.table(paste0(meth_fn, ".alt_summary.tsv"),
                              header=TRUE, sep="\t", stringsAsFactors = FALSE) %>%
    mutate(start=start+1,
           end=end+2,
         sampleName="castxb6.mat") %>%
    dplyr::rename(chr=chromosome,
                  percentMeth=methylated_frequency) %>%
    arrange(chr, start)

tidy_df <- function(df, sampleName) {
  Cov <- rlang::sym(paste0("Cov.",sampleName))
  M <- rlang::sym(paste0("M.",sampleName))
  df %>%
    select(chr, start, end, 
           !!Cov:=called_sites, 
           !!M:=called_sites_methylated)
}

make_bs <- function(df) {
  M <- df %>% 
    select(starts_with("M.")) %>%
    rename_all(function(x) gsub("M.", "", x)) %>%
    mutate_all(function(x) ifelse(is.na(x), 0, x)) %>%
    as.matrix()
  Cov <- df %>% 
    select(starts_with("Cov.")) %>%
    rename_all(function(x) gsub("Cov.", "", x)) %>%
    mutate_all(function(x) ifelse(is.na(x), 0, x)) %>%
    as.matrix()
  BSseq(chr = df$chr, 
        pos = df$start,
        Cov = Cov,
        M = M,
        sampleNames = colnames(M))
}

bs <- tidy_df(forward_ref_df, "b6xcast.mat") %>%
  full_join(tidy_df(forward_alt_df, "b6xcast.pat"),
            by=c("chr", "start", "end")) %>%
  full_join(tidy_df(reverse_ref_df, "castxb6.pat"),
            by=c("chr", "start", "end")) %>%
  full_join(tidy_df(reverse_alt_df, "castxb6.mat"),
            by=c("chr", "start", "end")) %>%
  group_by(chr, start) %>%
  summarise_if(is.numeric, sum) %>%
  ungroup() %>%
  make_bs()
pData(bs)$strain <- c("b6xcast", "b6xcast", "castxb6", "castxb6")
pData(bs)$parent <- c("mat", "pat", "pat", "mat")

loci.idx <- which(rowSums(getCoverage(bs, type="Cov")==0) == 0)
dml_parent <- DMLtest(BSobj = bs[loci.idx,], 
               group1=c("b6xcast.mat", "castxb6.mat"), 
               group2=c("b6xcast.pat", "castxb6.pat"), 
               equal.disp = TRUE, smoothing=TRUE)
dml_strain <- DMLtest(BSobj = bs[loci.idx,], 
               group1=c("b6xcast.mat", "castxb6.pat"), 
               group2=c("castxb6.mat", "b6xcast.pat"), 
               equal.disp = TRUE, smoothing=TRUE)

dmr_parent <- callDMR(dml_parent, p.threshold=1e-5, dis.merge=1500)
dmr_strain <- callDMR(dml_strain, p.threshold=1e-5, dis.merge=1500)

save(dml_parent, dml_strain, dmr_parent, dmr_strain, file="../RData/paired_DSS.RData")

R tidyverse bsseq DSS From line 1 of scripts/dss_paired.R

suppressMessages(suppressPackageStartupMessages(library(magrittr)))
suppressMessages(suppressPackageStartupMessages(library(dplyr)))
suppressMessages(suppressPackageStartupMessages(library(readr)))

args <- commandArgs(trailingOnly = TRUE)
gtf <- rtracklayer::import(args[1])
gene_list <- gtf %>%
  as_data_frame() %>%
  rename(chr=seqnames) %>%
  select(chr, start, end, strand, gene_id, gene_name) %>%
  group_by(chr, gene_id, gene_name, strand) %>%
  summarise(start=min(start),
            end=max(end))
gene_list %>%
  write_tsv(sub("gtf$", "genes.tsv", args[1]))

R dplyr readr magrittr From line 1 of scripts/ensembl_gtf_to_tsv.R

suppressMessages(suppressPackageStartupMessages(library(tidyverse)))
suppressMessages(suppressPackageStartupMessages(library(ggExtra)))
suppressMessages(suppressPackageStartupMessages(library(broom)))
suppressMessages(suppressPackageStartupMessages(library(data.table)))

n_min <- 7
args = commandArgs(trailingOnly=TRUE)
infile <- args[1]
outdir <- args[2]

load(file.path(outdir, "summary_df.RData"))
load(file.path(outdir, "haplotype_df.RData"))

fit_loess <- function(chr, read_name, start, percentMeth, ...) {
  x = data.frame(start=start, percentMeth=percentMeth)
  fit <- loess(percentMeth ~ start, x, control=loess.control(surface = "interpolate",
                                                             statistics = "none",
                                                             trace.hat = "approximate",
                                                             cell=0.5), ...)
  fit <- list(kd=fit$kd,
              start=min(start),
              end=max(start),
              chr=chr,
              read_name=read_name)
  fit
}

process_file <- function(filepath) {
  con = file(filepath, "r")
  reads=list()
  header = readLines(con, n = 1)

  # process first line
  line = readLines(con, n = 1)
  line <- strsplit(line, "\\t")[[1]]
  chr=line[1]
  read_name = line[4]
  pos=as.integer(line[2])
  meth = 1/(1+exp(-as.numeric(line[5])))
  i=1
  start=c(pos)
  percentMeth=c(meth)

  # process the rest
  while ( TRUE ) {
    line = readLines(con, n = 1)
    if ( length(line) == 0 ) {
      break
    }
    line <- strsplit(line, "\\t")[[1]]
    pos=as.integer(line[2])
    name = line[4]
    meth = 1/(1+exp(-as.numeric(line[5])))
    if (name != read_name) {
      # new line
      if (length(start) > n_min){
        len <- max(start) - min(start)
        span <- 0.1 + 8e-11*((max(-len + 1e5, 0)))^2
        reads[[i]] = fit_loess(chr, read_name, start, percentMeth, span=span)
        i <- i + 1
      }
      chr=line[1]
      read_name = name
      start=c(pos)
      percentMeth=c(meth)
    } else {
      start=c(start, pos)
      percentMeth=c(percentMeth, meth)
    }
  }
  close(con)
  reads
}

fit_reads <- process_file(paste0(infile, ".methylation.sorted.by_read.tsv"))
save(fit_reads, file=file.path(outdir, "fit_reads.RData"))
# load(file.path(outdir, "fit_reads.RData"))

fit_reads_df <- data_frame(read_name=sapply(fit_reads, function(x) { x$read_name })) %>% 
  mutate(id=row_number()) %>%
  left_join(summary_df, by="read_name") %>%
  arrange(chr, start, end) %>%
  left_join(haplotype_df %>% select(read_name, genotype), by="read_name") %>%
  dplyr::select(chr, start, end, id, read_name, genotype, qual)
save(fit_reads_df, file=file.path(outdir, "fit_reads_df.RData"))

R tidyverse data.table broom From line 8 of scripts/fit_reads.R

shell:
    "python call_variant_proportion.py -b {input.bam} -p {input.phased_bam} -v {input.cast_vcf} {input.fvb_vcf} -o {output}"

SnakeMake From line 12 of scripts/fvb_resolution.snakefile

shell:
    "Rscript render_notebook.R ../notebooks/nanopolish_fvb_resolution.Rmd &> {params.log}"

SnakeMake From line 25 of scripts/fvb_resolution.snakefile

shell:
    "python merge_recombined_haplotypes.py -i {input.phase} -s {input.summary} -a alt2 -r $(cat {input.regions}) -o {output}"

SnakeMake From line 35 of scripts/fvb_resolution.snakefile

shell:
    "wget -q -O {output} {params.url}"

SnakeMake From line 6 of scripts/intermediate.snakefile

shell:
    "wget -q -O {output} {params.url}"

SnakeMake From line 14 of scripts/intermediate.snakefile

shell:
    "wget -q -O {output} {params.url}"

SnakeMake From line 22 of scripts/intermediate.snakefile

shell:
    "wget -q -O {output} {params.url}"

SnakeMake From line 30 of scripts/intermediate.snakefile

shell:
    "md5sum -c {input.md5} && touch {output}"

SnakeMake From line 39 of scripts/intermediate.snakefile

shell:
    "gunzip {input.tsv} && touch {output.signpost}"

SnakeMake From line 48 of scripts/intermediate.snakefile

from __future__ import print_function
from Bio import SeqIO
from Bio import Seq
import re
import sys
import csv
import argparse


def load_vcf(handle):
    header = next(handle)
    while header.startswith("##"):
        header = next(handle)
    header = header.strip("#").strip("\n").split("\t")
    reader = csv.DictReader(handle, delimiter="\t", fieldnames=header)
    return reader


__ambiguity_codes = {
    'N': {'A', 'C', 'G', 'T'},
    'V': {'A', 'C', 'G'},
    'H': {'A', 'C', 'T'},
    'D': {'A', 'G', 'T'},
    'B': {'C', 'G', 'T'},
    'M': {'A', 'C'},
    'K': {'G', 'T'},
    'R': {'A', 'G'},
    'Y': {'C', 'T'},
    'W': {'A', 'T'},
    'S': {'C', 'G'},
    'A': {'A'},
    'C': {'C'},
    'G': {'G'},
    'T': {'T'}
}


def get_ambiguity_base(bases):
    bases = set(bases)
    for ambiguity_base, possible_bases in __ambiguity_codes.items():
        if bases == possible_bases:
            return ambiguity_base
    raise Exception("Unrecognises bases: ['" + "','".join(list(bases)) + "']")


def disambiguate_base(base):
    base = str(base)
    return list(__ambiguity_codes[base])


def parse_region(region):
    """
    Parse a region specification.
    :param region: String region specification
    :raises argparse.ArgumentTypeError: raises an error if format not recognised
    :returns contig: String contig / chromosome name
    :returns start: integer start position (0-based)
    :returns end: integer end position (1-based)
    >>> parseRegion("chr1:1000-2000")
    ("chr1", 1000, 2000)
    """
    region = ''.join(region.split())  # remove whitespace
    region = re.split(':|-', region)
    start = 0
    end = None
    if len(region) < 1:
        raise argparse.ArgumentTypeError(
            "Region must specify a reference name")
    elif len(region) > 3:
        raise argparse.ArgumentTypeError("Region format not recognized")
    else:
        contig = region[0]
        try:
            start = int(re.sub(",|\.", "", region[1]))
        except IndexError:
            pass
        try:
            end = int(re.sub(",|\.", "", region[2]))
        except IndexError:
            pass
        finally:
            return contig, start, end


def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument('-i', '--input', required=True,
                        help='filename of input genome fasta')
    parser.add_argument('-o', '--output', required=True,
                        help='filename of output masked genome fasta')
    parser.add_argument('-v', '--vcf', required=True,
                        help='filename of vcf listing snps')
    parser.add_argument('-a', '--alternate-only', default=False, action='store_true',
                        help='replace reference base with alternate allele, rather than retaining both (Default: false)')
    parser.add_argument('-r', '--region', type=parse_region, nargs="+", metavar="CHR:START-END",
                        help="Name and range of contigs to mask (Default: entire genome)")
    parser.add_argument('-b', '--boundary', type=int, default=0, metavar="INT",
                        help="Width of region boundary where both alleles are included")
    return parser.parse_args()


def contig_to_int(contig):
    try:
        return(int(contig))
    except ValueError:
        if contig == 'X':
            return max_autosome + 1
        elif contig == 'Y':
            return max_autosome + 2
        elif contig == 'MT':
            return max_autosome + 3
        else:
            return max_autosome + 4


def cmp_contigs(contig):
    return(contig_to_int(contig), contig)


args = parse_args()
regions = dict()
# format: { chr : [start, end, alternate_only] }
if args.region:
    for region in args.region:
        try:
            regions[region[0]]
        except KeyError:
            regions[region[0]] = []
        region_start, region_end = region[1], region[2]
        if region_start > 0:
            region_start = region_start + args.boundary // 2
        if region_end is not None:
            region_end = region_end - args.boundary // 2
        regions[region[0]].append(
            [region_start, region_end, args.alternate_only])
        if args.boundary > 1:
            if region[1] > 0:
                regions[region[0]].append(
                    [region[1] - args.boundary // 2, region[1] + args.boundary // 2, False])
            if region[2] is not None:
                regions[region[0]].append(
                    [region[2] - args.boundary // 2, region[2] + args.boundary // 2, False])

genome = dict()
with open(args.input, 'r') as handle:
    for record in SeqIO.parse(handle, "fasta"):
        record.seq = record.seq.tomutable()
        genome[record.id] = record

with open(args.vcf, 'r') as handle:
    vcf = load_vcf(handle)
    for row in vcf:
        snp = True
        ref_base = [row['REF']]
        alt_bases = row['ALT'].split(',')
        for base in alt_bases + ref_base:
            if len(base) > 1:
                snp = False
        if not snp:
            # don't touch indels
            continue

        pos = int(row['POS']) - 1
        if regions:
            alternate_only = None
            try:
                for r in regions[row['CHROM']]:
                    if pos >= r[0] and pos < r[1]:
                        alternate_only = r[2]
            except KeyError:
                pass
            if alternate_only is None:
                # no region hit
                continue
        else:
            # genome wide
            alternate_only = args.alternate_only

        record = genome[row['CHROM']]

        bases = alt_bases
        if not alternate_only:
            bases = bases + disambiguate_base(record.seq[pos])

        record.seq[pos] = get_ambiguity_base(bases)
        genome[row['CHROM']] = record

contigs = list()
max_autosome = 0
for contig in genome.keys():
    contigs.append(contig)
    try:
        if int(contig) > max_autosome:
            max_autosome = int(contig)
    except ValueError:
        pass

contigs = sorted(contigs, key=cmp_contigs)

with open(args.output, 'w') as handle:
    for contig in contigs:
        SeqIO.write(genome[contig], handle, "fasta")

Python Biopython From line 3 of scripts/mask_genome_variants.py

from __future__ import print_function
import re
import csv
import argparse

__min_coverage = 5
__override_ratio = 3


def parse_region(region):
    """
    Parse a region specification.
    :param region: String region specification
    :raises argparse.ArgumentTypeError: raises an error if format not recognised
    :returns contig: String contig / chromosome name
    :returns start: integer start position (0-based)
    :returns end: integer end position (1-based)
    >>> parseRegion("chr1:1000-2000")
    ("chr1", 1000, 2000)
    """
    region = ''.join(region.split())  # remove whitespace
    region = re.split(':|-', region)
    start = 0
    end = None
    if len(region) < 1:
        raise argparse.ArgumentTypeError(
            "Region must specify a reference name")
    elif len(region) > 3:
        raise argparse.ArgumentTypeError("Region format not recognized")
    else:
        contig = region[0]
        try:
            start = int(re.sub(",|\.", "", region[1]))
        except IndexError:
            pass
        try:
            end = int(re.sub(",|\.", "", region[2]))
        except IndexError:
            pass
        finally:
            return contig, start, end


def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument('-i', '--input', required=True,
                        help='filename of input three-way haplotypes tsv')
    parser.add_argument('-o', '--output', required=True,
                        help='filename of output haplotypes tsv')
    parser.add_argument('-s', '--summary', required=True,
                        help='filename of summary generated by build_supplementary_db')
    parser.add_argument('-a', '--alt-reference', required=True, choices=[
                        'alt1', 'alt2'], help='Name of the allele for which alternate is recombined with the reference')
    parser.add_argument('-r', '--region', type=parse_region, nargs="+", metavar="CHR:START-END",
                        help="Name and range of contigs where reference is replaced by alternate reference")
    parser.add_argument('-b', '--boundary', type=int, default=50000, metavar="INT",
                        help="Width of region boundary where both alleles are included")
    return parser.parse_args()

args = parse_args()

signal_alt_ref = "signal_" + args.alt_reference
base_alt_ref = "base_" + args.alt_reference
signal_alt = "signal_" + "alt1" if args.alt_reference == "alt2" else "alt2"
base_alt = "base_" + "alt1" if args.alt_reference == "alt2" else "alt2"

regions = dict()
# format: { chr : [start, end, alternate_only] }
if args.region:
    for region in args.region:
        try:
            regions[region[0]]
        except KeyError:
            regions[region[0]] = []
        region_start, region_end = region[1], region[2]
        if region_start > 0:
            region_start = region_start + args.boundary // 2
        if region_end is not None:
            region_end = region_end - args.boundary // 2
        regions[region[0]].append([region_start, region_end, True])
        if args.boundary > 1:
            if region[1] > 0:
                regions[region[0]].append(
                    [region[1] - args.boundary // 2, region[1] + args.boundary // 2, False])
            if region[2] is not None:
                regions[region[0]].append(
                    [region[2] - args.boundary // 2, region[2] + args.boundary // 2, False])

with open(args.input, 'r') as haplotype_handle, open(args.summary, 'r') as summary_handle, open(args.output, 'w') as output:
    print("read\tgenotype\tsignal_ref\tsignal_alt\tsignal_coverage\tbase_ref\tbase_alt\tbase_coverage", file=output)
    haplotype_reader = csv.DictReader(haplotype_handle, delimiter="\t")
    summary_reader = csv.DictReader(summary_handle, delimiter="\t", fieldnames=[
                                    'read', 'chr', 'start', 'end', 'qual'])
    summary = next(summary_reader)
    while True:
        try:
            haplotype = next(haplotype_reader)
            while summary['read'] < haplotype['read']:
                summary = next(summary_reader)
                while summary['read'] > haplotype['read']:
                    haplotype = next(haplotype_reader)
            # now they should be the same read
            for key in haplotype.keys():
                if (key.startswith("signal") or key.startswith("base")) and not key.endswith("coverage"):
                    haplotype[key] = float(haplotype[key])
        except StopIteration:
            break

        use_ref = True
        use_alt_ref = False
        if summary['chr'] in regions:
            start = int(summary['start'])
            end = int(summary['end'])
            for region in regions[summary['chr']]:
                if region[0] <= start and region[1] >= end:
                    use_alt_ref = True
                    if region[2]:
                        # alternate only
                        use_ref = False

        if use_ref and use_alt_ref:
            if haplotype['signal_ref'] + haplotype['base_ref'] > haplotype[signal_alt_ref] + haplotype[base_alt_ref]:
                use_alt_ref = False
            elif haplotype['signal_ref'] + haplotype['base_ref'] < haplotype[signal_alt_ref] + haplotype[base_alt_ref]:
                use_ref = False
            else:
                # pick the one with the lower range - more reliable, so
                # probably more accurate
                if abs(haplotype['signal_ref'] - haplotype['base_ref']) < abs(haplotype[signal_alt_ref] - haplotype[base_alt_ref]):
                    use_alt_ref = False
                else:
                    use_ref = False

        if use_ref:
            signal_ref = 'signal_ref'
            base_ref = 'base_ref'
        else:
            signal_ref = signal_alt_ref
            base_ref = base_alt_ref

        # remove excess counts - how?
        signal_coverage = haplotype[signal_ref] + haplotype[signal_alt]
        base_coverage = haplotype[base_ref] + haplotype[base_alt]

        # determine genotype
        signal_call = "fail"
        if signal_coverage > 0:
            signal_ratio = haplotype[signal_ref] / signal_coverage
            if signal_ratio > 0.5:
                signal_call = "ref"
            elif signal_ratio < 0.5:
                signal_call = "alt"

        base_call = "fail"
        if base_coverage > 0:
            base_ratio = haplotype[base_ref] / base_coverage
            if base_ratio > 0.5:
                base_call = "ref"
            elif base_ratio < 0.5:
                base_call = "alt"

        if base_coverage < __min_coverage and signal_coverage < __min_coverage:
            genotype = "fail"
        elif base_call == signal_call:
            genotype = base_call
        elif signal_call == "fail":
            genotype = base_call
        elif base_call == "fail":
            genotype = signal_call
        elif base_coverage > __override_ratio * signal_coverage and base_call != "fail":
            genotype = base_call
        elif signal_coverage > __override_ratio * base_coverage and signal_call != "fail":
            genotype = signal_call
        elif base_ratio - 0.5 > __override_ratio * (signal_ratio - 0.5) and base_ratio > 0.5:
            genotype = base_call
        elif signal_ratio - 0.5 > __override_ratio * (base_ratio - 0.5) and signal_ratio > 0.5:
            genotype = signal_call
        else:
            genotype = "fail"

        print("{read}\t{genotype}\t{signal_ref}\t{signal_alt}\t{signal_coverage}\t{base_ref}\t{base_alt}\t{base_coverage}".format(
            read=haplotype['read'],
            genotype=genotype,
            signal_ref=haplotype[signal_ref],
            base_ref=haplotype[base_ref],
            signal_alt=haplotype[signal_alt],
            base_alt=haplotype[base_alt],
            signal_coverage=haplotype['signal_coverage'],
            base_coverage=haplotype['base_coverage']
        ), file=output)

Python From line 10 of scripts/merge_recombined_haplotypes.py

shell:
    "wget -q {params.url} -O {output}"

SnakeMake From line 6 of scripts/misc.snakefile

shell:
    "wget -q {params.url} -O {output}"

SnakeMake From line 14 of scripts/misc.snakefile

shell:
    "gunzip {input}"

SnakeMake From line 22 of scripts/misc.snakefile

shell:
    "gunzip {input}"

SnakeMake From line 30 of scripts/misc.snakefile

shell:
    "gunzip {input}"

SnakeMake From line 38 of scripts/misc.snakefile

shell:
    "bwa index {input} &> {params.log}"

SnakeMake BWA From line 53 of scripts/misc.snakefile

shell:
    "samtools index {input}"

SnakeMake SAMtools From line 61 of scripts/misc.snakefile

shell:
    "wget -q {params.url} -O {output}"

SnakeMake From line 70 of scripts/misc.snakefile

shell:
    "python mask_genome_variants.py -i {input.fasta} -o {output} -v {input.vcf}"

SnakeMake From line 79 of scripts/misc.snakefile

shell:
    "Rscript ensembl_gtf_to_tsv.R {input}"

SnakeMake From line 87 of scripts/misc.snakefile

shell:
    "python tsv_to_region.py {input} 10000 > {output}"

SnakeMake From line 95 of scripts/misc.snakefile

shell:
    "python count_cpg.py {input} > {output}"

SnakeMake From line 103 of scripts/misc.snakefile

shell:
    "wget -q -O {output} {params.url}"

SnakeMake From line 6 of scripts/nanopolish.snakefile

shell:
    "wget -q -O {output} {params.url}"

SnakeMake From line 14 of scripts/nanopolish.snakefile

shell:
    "wget -q -O {output} {params.url}"

SnakeMake From line 22 of scripts/nanopolish.snakefile

shell:
    "wget -q -O {output} {params.url}"

SnakeMake From line 30 of scripts/nanopolish.snakefile

shell:
    "wget -q -O {output} {params.url}"

SnakeMake From line 38 of scripts/nanopolish.snakefile

shell:
    "wget -q -O {output} {params.url}"

SnakeMake From line 46 of scripts/nanopolish.snakefile

shell:
    "md5sum -c {input.md5} && touch {output}"

SnakeMake From line 55 of scripts/nanopolish.snakefile

shell:
    "mkdir -p {output} && "
    "tar xzf {input.tar} -C {output}"

SnakeMake From line 64 of scripts/nanopolish.snakefile

shell:
    "find {input} -mindepth 2 -type d -links 2  -exec mv -t {input} {{}} \\+ && "
    "find {input} -name '*.basecalled.*.tar.gz'  -exec rm {{}} \\+ && "
    "for i in $(find {input} -name '*.tar.gz'); do "
    "  tar -xzf $i -C {input} && rm $i; "
    "done && "
    "touch {output}"

SnakeMake From line 73 of scripts/nanopolish.snakefile

shell:
    "mkdir -p {output.outdir} && "
    "for i in {input.indir}; do mv $i {output.outdir}; done && "
    "touch {output.outdir_clean}"

SnakeMake From line 92 of scripts/nanopolish.snakefile

shell:
    "read_fast5_basecaller.py -c r94_450bps_linear.cfg -o fastq -i {input.indir} -r -s {params.outdir} -t {threads} -q 0"

SnakeMake From line 109 of scripts/nanopolish.snakefile

shell:
    "read_fast5_basecaller.py -c r941_450bps_linear_prom.cfg -o fastq -i {input.indir} -r -s {params.outdir} -t {threads} -q 0"

SnakeMake From line 124 of scripts/nanopolish.snakefile

shell:
    "find {input} -name '*.fastq' -exec cat {{}} \\+ > {output}"

SnakeMake From line 132 of scripts/nanopolish.snakefile

shell:
    "bwa mem -t {threads} {input.genome} {input.reads} | "
    "samtools sort -T {output}.samtools.tmp -@ {threads} -o {output} &> {log}"

SnakeMake SAMtools BWA From line 146 of scripts/nanopolish.snakefile

shell:
    "nanopolish index -d {input.fast5} {input.reads} &> {params.log}"

SnakeMake nanopolish From line 158 of scripts/nanopolish.snakefile

shell:
    "nanopolish call-methylation -t {threads} -r {input.reads} -b {input.bam} -g {input.genome} > {output}"

SnakeMake nanopolish From line 172 of scripts/nanopolish.snakefile

shell:
    "nanopolish phase-reads -t {threads} -r {input.reads} -b {input.bam} -g {input.genome} {input.vcf} | "
    "samtools sort -T {output}.samtools.tmp -@ {threads} -o {output}"

SnakeMake SAMtools nanopolish From line 187 of scripts/nanopolish.snakefile

shell:
    "Rscript render_notebook.R {input.notebook} &> {params.log}"

SnakeMake From line 11 of scripts/notebooks.snakefile

shell:
    "Rscript render_notebook.R {input.notebook} &> {params.log}"

SnakeMake From line 24 of scripts/notebooks.snakefile

shell:
    "Rscript render_notebook.R {input.notebook} &> {params.log}"

SnakeMake From line 37 of scripts/notebooks.snakefile

shell:
    "Rscript render_notebook.R {input.notebook} &> {params.log}"

SnakeMake From line 50 of scripts/notebooks.snakefile

shell:
    "Rscript render_notebook.R {input.notebook} &> {params.log}"

SnakeMake From line 62 of scripts/notebooks.snakefile

shell:
    "Rscript render_notebook.R {input.notebook} &> {params.log}"

SnakeMake From line 77 of scripts/notebooks.snakefile

shell:
    "Rscript render_notebook.R {input.notebook} &> {params.log}"

SnakeMake From line 101 of scripts/notebooks.snakefile

shell:
    "Rscript render_notebook.R {input.notebook} &> {params.log}"

SnakeMake From line 115 of scripts/notebooks.snakefile

shell:
    "Rscript render_notebook.R {input.notebook} &> {params.log}"

SnakeMake From line 130 of scripts/notebooks.snakefile

shell:
    "Rscript render_notebook.R {input.notebook} &> {params.log}"

SnakeMake From line 143 of scripts/notebooks.snakefile

shell:
    "Rscript render_notebook.R {input.notebook} &> {params.log}"

SnakeMake From line 175 of scripts/notebooks.snakefile

script:
    "prepare_visualisation.R"

SnakeMake From line 185 of scripts/notebooks.snakefile

suppressMessages(suppressPackageStartupMessages(library(GenomicRanges)))
suppressMessages(suppressPackageStartupMessages(library(tidyverse)))
suppressMessages(suppressPackageStartupMessages(library(data.table)))

gtf <- rtracklayer::import("../genome_data/Mus_musculus.GRCm38_90.chr.gtf")
knownGene <- gtf %>%
  as_data_frame() %>%
  rename(chr=seqnames) %>%
  select(chr, start, end, strand, type, gene_id, gene_name, exon_number) %>%
  filter(type %in% c("exon", "five_prime_utr", "three_prime_utr")) %>%
  filter(end > start) %>%
  mutate(type=fct_recode(type, utr="three_prime_utr", utr="five_prime_utr")) %>%
  unique() %>%
  group_by(gene_name, start) %>%
  filter(end == min(end)) %>%
  ungroup() %>%
  arrange(gene_name, start, end) %>%
  makeGRangesFromDataFrame(keep.extra.columns=TRUE)
save(knownGene, file="../RData/knownGene.RData")

rna_seq <- read_tsv("../rna_seq/all_runs_with_reverse_coverage.tsv", 
                    col_names=c("chr", "pos", "fwd1_mat", "fwd1_pat", 
                                "fwd2_mat", "fwd2_pat", "fwd3_mat", "fwd3_pat", 
                                "fwd4_mat", "fwd4_pat", "rev1_pat", "rev1_mat", 
                                "rev2_pat", "rev2_mat", "rev3_pat", "rev3_mat", 
                                "rev4_pat", "rev4_mat"),
                    col_types = 'ciiiiiiiiiiiiiiiii') %>%
  mutate(fwd_mat=(fwd1_mat+fwd2_mat+fwd3_mat+fwd4_mat)/4,
         fwd_pat=(fwd1_pat+fwd2_pat+fwd3_pat+fwd4_pat)/4,
         rev_pat=(rev1_pat+rev2_pat+rev3_pat+rev4_pat)/4,
         rev_mat=(rev1_mat+rev2_mat+rev3_mat+rev4_mat)/4) %>% 
  select(chr, pos, fwd_mat, fwd_pat, rev_mat, rev_pat)
save(rna_seq, file="../RData/rna_seq_with_reverse.RData")

R tidyverse data.table GenomicRanges From line 1 of scripts/prepare_visualisation.R

suppressMessages(library(tidyverse))

min_coverage <- 5
args = commandArgs(trailingOnly=TRUE)
infile <- args[1]
outfile <- args[2]

haplotype_df <- read_tsv(infile, col_types='ccddiddi') %>%
  mutate(signal_coverage = signal_ref + signal_alt,
         signal_ratio = signal_ref / signal_coverage,
         base_coverage = base_ref + base_alt,
         base_ratio=base_ref / base_coverage) %>% 
  mutate(info=case_when(.$signal_coverage < min_coverage & .$base_coverage < min_coverage ~ "low_coverage",
                        .$signal_ratio < 0.5 & .$base_ratio < 0.5 ~ "pass",
                        .$signal_ratio > 0.5 & .$base_ratio > 0.5 ~ "pass",
                        .$base_coverage > 3*.$signal_coverage ~ "signal_low_coverage",
                        .$signal_coverage > 3*.$base_coverage ~ "base_low_coverage",
                        abs(0.5-.$signal_ratio)*3 < abs(0.5-.$base_ratio) ~ "signal_uncertain",
                        abs(0.5-.$base_ratio)*3 < abs(0.5-.$signal_ratio) ~ "base_uncertain",
                        TRUE ~ "fail"),
         base_genotype=ifelse(base_ratio>0.5, "ref", "alt"),
         signal_genotype=ifelse(signal_ratio>0.5, "ref", "alt"),
         genotype="fail",
         genotype=ifelse(startsWith(info, "signal") | info == "pass", base_genotype, genotype),
         genotype=ifelse(startsWith(info, "base"), signal_genotype, genotype)) %>%
  rename(read_name=read)
save(haplotype_df, file=outfile)

R tidyverse From line 4 of scripts/read_haplotypes.R

suppressMessages(library(tidyverse))

args = commandArgs(trailingOnly=TRUE)
infile <- args[1]
outfile <- args[2]

summary_df <- read_tsv(infile,
                       col_names=c("read_name", "chr", "start", "end", "qual"), 
                       col_types='cciid',
                       skip=1) %>% 
  unique()
save(summary_df, file=outfile)

R tidyverse From line 6 of scripts/read_summary.R

rmd_files <- commandArgs(trailingOnly = TRUE)
purrr::map(rmd_files, 
           ~ rmarkdown::render(., 
                               output_format='html_document', 
                               output_file=sub("Rmd$", "html", .)))

R From line 1 of scripts/render_notebook.R

shell:
    "wget -q -O {output} {params.url}"

SnakeMake From line 12 of scripts/rna_seq.snakefile

shell:
    "wget -q -O {output} {params.url}"

SnakeMake From line 21 of scripts/rna_seq.snakefile

shell:
    "wget -q -O {output} {params.url}"

SnakeMake From line 30 of scripts/rna_seq.snakefile

shell:
    "wget -q -O {output} {params.url}"

SnakeMake From line 39 of scripts/rna_seq.snakefile

shell:
    "wget -q -O {output} {params.url}"

SnakeMake From line 48 of scripts/rna_seq.snakefile

shell:
    "wget -q -O {output} {params.url}"

SnakeMake From line 57 of scripts/rna_seq.snakefile

shell:
    "wget -q -O {output} {params.url}"

SnakeMake From line 66 of scripts/rna_seq.snakefile

shell:
    "wget -q -O {output} {params.url}"

SnakeMake From line 75 of scripts/rna_seq.snakefile

shell:
    "md5sum -c {input.md5} && touch {output}"

SnakeMake From line 84 of scripts/rna_seq.snakefile

shell:
    "trim_galore --phred33 --fastqc --gzip --paired -o {params.outdir} {input.r1} {input.r2} &> {params.log}"

SnakeMake Trim_Galore From line 100 of scripts/rna_seq.snakefile

shell:
    "cp {input} {output}"

SnakeMake From line 108 of scripts/rna_seq.snakefile

shell:
    "cd $(dirname {input.genome}) && "
    "SNPsplit_genome_preparation --vcf_file {input.vcf} --strain CAST_EiJ --reference_genome ./ &> {params.log} && "
    "cd ../scripts/"

SnakeMake From line 141 of scripts/rna_seq.snakefile

shell:
    "cat {input.fasta} > {output.fasta} && "
    "mv {input.snp} {output.snp} && "
    "rm -r {params.tempdir}"

SnakeMake From line 176 of scripts/rna_seq.snakefile

shell:
    "hisat2-build {input} {params.base} &> {params.log}"

SnakeMake HISAT2 From line 191 of scripts/rna_seq.snakefile

shell:
    "hisat2 -p {threads} --no-softclip -x {params.base} -1 {input.r1} -2 {input.r2} | "
    "samtools sort -n -@ {threads} -T {output}.samtools.tmp -o {output} &> {log}"

SnakeMake SAMtools HISAT2 From line 208 of scripts/rna_seq.snakefile

shell:
    "SNPsplit --snp_file {input.snp} --paired --no_sort {input.bam} -o ../rna_seq/ &> {params.log}"

SnakeMake snpsplit From line 222 of scripts/rna_seq.snakefile

shell:
    "samtools sort -T {input}.samtools.tmp -@ {threads} -o {output} {input} &> {log}"

SnakeMake SAMtools From line 234 of scripts/rna_seq.snakefile

shell:
    "samtools depth {input.bam} > {output}"

SnakeMake SAMtools From line 275 of scripts/rna_seq.snakefile

shell:
    "wget -q -O {output} {params.url}"

SnakeMake From line 11 of scripts/rrbs.snakefile

shell:
    "wget -q -O {output} {params.url}"

SnakeMake From line 20 of scripts/rrbs.snakefile

shell:
    "wget -q -O {output} {params.url}"

SnakeMake From line 29 of scripts/rrbs.snakefile

shell:
    "wget -q -O {output} {params.url}"

SnakeMake From line 38 of scripts/rrbs.snakefile

shell:
    "cp {input} {output}"

SnakeMake From line 46 of scripts/rrbs.snakefile

shell:
    "cd $(dirname {input}) && "
    "bismark_genome_preparation --bowtie2 ./ &> {params.log} && "
    "cd ../scripts/"

SnakeMake Bismark From line 57 of scripts/rrbs.snakefile

shell:
    "bismark --gzip --bam --bowtie2 -p {threads} -B {params.basename} "
    "-o ../bisulfite/ {params.genome} -1 {input.r1} -2 {input.r2} &> {params.log}"

SnakeMake Bismark From line 78 of scripts/rrbs.snakefile

shell:
    "samtools sort -n -@ {threads} -T {input}.samtools.tmp -o {output} {input} &> {log}"

SnakeMake SAMtools From line 91 of scripts/rrbs.snakefile

shell:
    "SNPsplit --paired --bisulfite --snp_file {input.snp} --no_sort {input.bam} -o ../bisulfite/ &> {params.log}"

SnakeMake snpsplit From line 104 of scripts/rrbs.snakefile

shell:
    "bismark_methylation_extractor --ignore 13 --paired-end --multicore {threads} "
    "--comprehensive --merge_non_CpG --report --output $(dirname {output.txt}) --gzip "
    "--bedGraph {input} &> {params.log}"

SnakeMake Bismark From line 119 of scripts/rrbs.snakefile

shell:
    "python summarize_bisulfite_methylation.py {output} {input}"

SnakeMake From line 132 of scripts/rrbs.snakefile

shell:
    "python summarize_bisulfite_methylation.py {output} {input}"

SnakeMake From line 143 of scripts/rrbs.snakefile

shell:
    "python summarize_bisulfite_methylation.py {output} {input}"

SnakeMake From line 151 of scripts/rrbs.snakefile

from __future__ import print_function
import csv
import sys
import pickle
import copy


def get_read_id(read, contig, position, suppdb):
    if read in suppdb:
        for i, alignment in enumerate(suppdb[read]):
            chr, start, end = alignment
            if contig == chr and position >= start and position <= end:
                return "{}_{}".format(read, i)
        raise Exception("Position {}:{} not internal to any alignment of read {} ({})".format(
            contig, position, read, suppdb[read]))
    else:
        return read

bam_fn = sys.argv[1]
if len(sys.argv) > 2:
    meth_fn = sys.argv[2]
    handle = open(meth_fn, 'r')
else:
    print("Reading from stdin.", file=sys.stderr)
    handle = sys.stdin

with open("{}.suppdb".format(bam_fn), 'rb') as db_handle:
    suppdb = pickle.load(db_handle)

reader = csv.DictReader(handle, delimiter="\t")
fieldnames = reader.fieldnames
if "strand" in reader.fieldnames:
    fieldnames = copy.copy(fieldnames)
    del fieldnames[fieldnames.index("strand")]
    del_strand = True
else:
    del_strand = False
writer = csv.DictWriter(sys.stdout, delimiter="\t",
                        fieldnames=fieldnames)
writer.writeheader()
for row in reader:
    row["read_name"] = get_read_id(
        row["read_name"], row["chromosome"], int(row["start"]), suppdb)
    if del_strand:
        del row["strand"]
    try:
        writer.writerow(row)
    except Exception:
        print(row, file=sys.stderr)
        raise
handle.close()

Python From line 5 of scripts/split_methylation_by_alignment.py

from __future__ import print_function
import sys
import csv
import argparse

parser = argparse.ArgumentParser()
parser.add_argument('-m', '--methylation', required=True,
                    help="Methylation data, split by supplementaries, sorted by genomic location")
parser.add_argument('-p', '--phase', required=True,
                    help="Phased reads (haplotyping data)")
args = parser.parse_args()

meth_fn = args.methylation
haplotype_fn = args.phase

haplotypes = dict()
genotypes = set(["fail"])
with open(haplotype_fn, 'r') as haplotype_handle:
    reader = csv.DictReader(haplotype_handle, delimiter="\t")
    for row in reader:
        haplotypes[row["read"]] = row["genotype"]
        genotypes.add(row["genotype"])


def get_genotype(read_name):
    try:
        return haplotypes[read_name]
    except KeyError:
        return "fail"

with open(meth_fn, 'r') as meth_handle:
    reader = csv.DictReader(meth_handle, delimiter="\t")
    out_handles = list()
    writers = dict()
    for genotype in genotypes:
        handle = open("{}.{}.tsv".format(meth_fn, genotype), 'w')
        out_handles.append(handle)
        writers[genotype] = csv.DictWriter(
            handle, fieldnames=reader.fieldnames, delimiter="\t")
        writers[genotype].writeheader()
    for row in reader:
        writers[get_genotype(row["read_name"])].writerow(row)
    for handle in out_handles:
        handle.close()

Python From line 8 of scripts/split_methylation_by_haplotype.py

import sys
import csv
import gzip
from functools import partial
__true = "+"
__false = "-"


def add(summary, chr, pos, meth):
    try:
        summary[chr]
    except KeyError:
        summary[chr] = dict()
    try:
        summary[chr][pos]
    except KeyError:
        summary[chr][pos] = [0., 0.]
    if meth:
        summary[chr][pos][0] += 1
    summary[chr][pos][1] += 1
    return summary


def summarize(in_handle, summary=None):
    if summary is None:
        summary = dict()
    header = next(in_handle)
    reader = csv.reader(in_handle, delimiter="\t")
    for line in reader:
        meth = line[1] == __true
        chr = line[2]
        pos = int(line[3])
        summary = add(summary, chr, pos, meth)
    return summary


def write_out_summary(output_fn, summary):
    with open(output_fn, 'w') as out_handle:
        for chr, positions in summary.items():
            for pos, data in positions.items():
                meth, total = tuple(data)
                out_handle.write("{}\t{}\t{:.3f}\t{}\t{}\n".format(
                    chr, pos, meth / total, meth, total))


output_fn = sys.argv[1]
input_fns = sys.argv[2:]
if len(sys.argv) == 3:
    input_fns = [sys.argv[2]]
else:
    input_fns = sys.argv[2:]

summary = None
for input_fn in input_fns:
    if input_fn.endswith("gz"):
        if sys.version_info >= (3,):
            open_fun = partial(gzip.open, mode='rt')
        else:
            open_fun = gzip.open
    else:
        open_fun = open
    with open_fun(input_fn) as handle:
        summary = summarize(handle, summary)

write_out_summary(output_fn, summary)

Python From line 8 of scripts/summarize_bisulfite_methylation.py

import sys
import csv
tsv_fn = sys.argv[1]
if len(sys.argv) > 2:
    overhang = int(sys.argv[2])
else:
    overhang = 0
output = []
with open(tsv_fn, 'r') as tsv_handle:
    reader = csv.DictReader(tsv_handle, delimiter="\t")
    for row in reader:
        output.append(
            "{}:{}-{}".format(row["chr"], int(row["start"]) - overhang, int(row["end"]) + overhang))

print(" ".join(output))