Reproducibility workflow for Gigante et al., 2018: Using long-read sequencing to detect imprinted DNA methylation
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Haplotyped Methylome
Reproducibility instructions for Gigante et al., 2019.
Note: this repository is still being tested! If you find a bug, please file an issue.
Data available at ENA Accession PRJEB27157 .
Author's note: there is an error in Figure S2 of the paper, in which the figure legend was marked Maternal/Paternal where it should have been Black6/Cast. A corrected version is available here .
System requirements
-
R
-
Python>=3.5
-
Lots of RAM (min 256GB)
-
Lots of disk space (estimate: 2TB)
Dependencies
To install with
conda
, run the following command.
conda env create -f environment.yml
source activate haplotyped_methylome
You will then need to install Albacore: it is available on the Nanopore Community .
To install without conda, see the list of dependencies at the bottom of this README.
Required data
For the standard workflow,
snakemake
will download all the necessary files.
If you wish to avoid running
albacore
,
bwa
and
nanopolish
on the raw nanopore data, you can run the following command, which downloads the output of these programs and tricks
snakemake
into thinking you have run the pipeline from the beginning:
snakemake intermediate_download
Note: currently this only downloads the methylation files We hope to provide the alignment and phasing data in the future.
If you wish to rerun from the beginning after running this command, you can revert to the original download with
snakemake --forceall
.
Running the workflow
To generate all plots, tables and notebooks, simply run from the root directory:
snakemake --cores 16
If you don't wish to run the full analysis, you can run specific rules from the Snakefile by running, for example:
snakemake --cores 16 rnaseq_analysis
snakemake --cores 16 haplotype_analysis
snakemake --cores 16 methylation_analysis
Installation without
conda
Software dependencies:
-
Albacore (optional)
-
BWA (optional)
-
Nanopolish (optional)
Python package dependencies:
pip install --user -r requirements.txt
R package dependencies:
Rscript install_R_deps.R
Known Issues
-
rnaseq_analys.Rmd
failed:there is no package called 'ggrastr'
If
devtools
doesn't play nicely with
conda
, sometimes the automatic GitHub installation of
ggrastr
fails. You can resolve if as follows:
git clone --depth 1 https://github.com/VPetukhov/ggrastr.git
cd ggrastr
R -e 'devtools::install()'
Code Snippets
8 9 | shell: "python build_supplementary_db.py {input.bam} --summary {output.summary}" |
18 19 | shell: "python split_methylation_by_alignment.py {input.bam} {input.meth} > {output}" |
26 27 | shell: "python calculate_methylation_frequency.py -i {input} -p > {output}" |
41 42 | shell: "python call_variant_proportion.py -b {input.bam} -t {threads} -p {input.phased_bam} -v {input.vcf} -o {output}" |
50 51 52 53 54 | shell: "mkdir -p {output}.tmp && tail -n +2 {input.meth_split} | " "sort -k4,4 -k1,1 -k2n,2n -T {output}.tmp | " "cat <(head -n 1 {input.meth}) - > {output} && " "rm -rf {output}.tmp" |
62 63 64 65 66 | shell: "mkdir -p {output}.tmp && tail -n +2 {input.meth_split} | " "sort -k1,1 -k2n,2n -k4,4 -T {output}.tmp | " "cat <(head -n 1 {input.meth}) - > {output} && " "rm -rf {output}.tmp" |
73 74 | shell: "Rscript read_summary.R {input} {output}" |
81 82 | shell: "Rscript read_haplotypes.R {input} {output}" |
94 95 | shell: "Rscript fit_reads.R ../nanopore/{params.sample} ../RData/{params.sample}" |
104 105 | shell: "python split_methylation_by_haplotype.py -m {input.meth} -p {input.phase}" |
112 113 | shell: "python calculate_methylation_frequency.py -i {input} -p > {output}" |
125 126 | shell: "Rscript dss_paired.R &> {log}" |
135 136 | shell: "python compare_haplotype_methylation.py -m {input.meth} -p {input.phase} -b 11 -o 5 -r $(cat {input.region}) > {output}" |
144 145 | script: "build_dss_dmrlist.R" |
154 155 | script: "build_methylation_df.R" |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | suppressMessages(suppressPackageStartupMessages(library(tidyverse))) load("../RData/paired_DSS.RData") combined_dmr <- dmr_parent %>% mutate(type="imprinted") %>% bind_rows(dmr_strain %>% mutate(type="strain")) %>% mutate(chr = as.character(chr)) %>% arrange(-abs(areaStat)) %>% mutate(dmr_start=start, dmr_end=end, start = dmr_start-5000, end = dmr_end + 5000, id=row_number()) %>% select(chr, start, end, everything()) %>% fuzzyjoin::genome_left_join(read_tsv("../genome_data/Mus_musculus.GRCm38_90.chr.genes.tsv", col_types="ccccdd") %>% select(chr, start, end, gene_name)) %>% select(-starts_with("start"), -starts_with("end"), -chr.y) %>% select(id, chr=chr.x, start=dmr_start, end=dmr_end, everything()) %>% group_by(id) %>% mutate(overlapping_genes=paste0(gene_name, collapse=",")) %>% distinct(overlapping_genes, .keep_all = TRUE) %>% ungroup() %>% select(-gene_name) combined_dmr %>% write_csv("../tables/dss_dmrlist.csv") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | suppressMessages(suppressPackageStartupMessages(library(tidyverse))) suppressMessages(suppressPackageStartupMessages(library(data.table))) bisulfite_df <- read_tsv("../bisulfite/B6CastF1_1_pe.summary.tsv", col_names=c("chr","pos","percentMeth", "meth", "coverage"), col_types='ciddd') %>% arrange(chr, pos) %>% dplyr::rename(start=pos) %>% mutate(end=start, chr=sub("chr","",chr)) save(bisulfite_df, file="../RData/bisulfite_df.RData") nanopolish_df <- read.table("../nanopore/b6xcast.minion.methylation.summary.tsv", header=TRUE, sep="\t", stringsAsFactors = FALSE) %>% mutate(start=start+1, end=end+2) %>% dplyr::rename(chr=chromosome, percentMeth=methylated_frequency) %>% arrange(chr, start) save(nanopolish_df, file="../RData/nanopolish_df.RData") |
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 | from __future__ import print_function import pysam import pickle import argparse parser = argparse.ArgumentParser() parser.add_argument('bam') parser.add_argument('-s', '--summary', default=None, help='Optional file for printing read summaries') parser.add_argument('--summary-only', default=False, action='store_true', help="Don't produce a .suppdb file") args = parser.parse_args() bam_fn = args.bam reads = dict() with pysam.AlignmentFile(bam_fn, 'r') as bam: if args.summary: import numpy as np quals = dict() for read in bam: if read.is_unmapped: continue try: overlap = False for i in range(len(reads[read.query_name])): chr, start, end = reads[read.query_name][i] if chr == read.reference_name and start <= read.reference_end and end >= read.reference_start: reads[read.query_name][i] = ( chr, min(start, read.reference_start), max(end, read.reference_end)) overlap = True if not overlap: reads[read.query_name].append( (read.reference_name, read.reference_start, read.reference_end)) except KeyError: reads[read.query_name] = [ (read.reference_name, read.reference_start, read.reference_end)] if args.summary: try: read_qual = np.median(read.query_qualities) except TypeError: read_qual = "NA" try: quals[read.query_name].append(read_qual) except KeyError: quals[read.query_name] = [read_qual] non_supplementary = set() for read, alignments in reads.items(): if len(alignments) == 1: non_supplementary.add(read) if args.summary: with open(args.summary, 'w') as summary: print("read_name\tchr\tstart\tend\tqual", file=summary) for read, alignments in reads.items(): if read in non_supplementary: print("{read_name}\t{chr}\t{start}\t{end}\t{qual}".format( read_name=read, chr=alignments[0][0], start=alignments[0][1], end=alignments[0][2], qual=quals[read][0]), file=summary) else: for i, alignment in enumerate(alignments): print("{read_name}\t{chr}\t{start}\t{end}\t{qual}".format( read_name=read + "_" + str(i), chr=alignment[0], start=alignment[1], end=alignment[2], qual=quals[read][i]), file=summary) if not args.summary_only: for read in non_supplementary: del reads[read] with open("{}.suppdb".format(bam_fn), 'wb') as handle: pickle.dump(reads, handle, pickle.HIGHEST_PROTOCOL) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 | from __future__ import print_function import math import sys import csv import argparse from collections import namedtuple def make_key(c, s, e): return c + ":" + str(s) + ":" + str(e) class SiteStats: def __init__(self, g_size, g_seq): self.num_reads = 0 self.posterior_methylated = 0 self.called_sites = 0 self.called_sites_methylated = 0 self.group_size = g_size self.sequence = g_seq def update_call_stats(key, num_called_cpg_sites, methylation, sequence): if key not in sites: sites[key] = SiteStats(num_called_cpg_sites, sequence) sites[key].num_reads += 1 sites[key].called_sites += num_called_cpg_sites if methylation > 0: sites[key].called_sites_methylated += num_called_cpg_sites * methylation parser = argparse.ArgumentParser( description='Calculate methylation frequency at genomic CpG sites') parser.add_argument('-c', '--call-threshold', type=float, required=False, default=2.5) parser.add_argument('-i', '--input', type=str, required=False) parser.add_argument('-s', '--split-groups', action='store_true') parser.add_argument('-p', '--probabilistic', action='store_true') parser.add_argument('--prior', type=float, default=0.5, help="Prior probability of methylation") args = parser.parse_args() assert(args.call_threshold is not None) if args.probabilistic: args.call_threshold = 0 sites = dict() if args.input: in_fh = open(args.input) else: in_fh = sys.stdin csv_reader = csv.DictReader(in_fh, delimiter='\t') for record in csv_reader: try: num_sites = int(record['num_motifs']) except KeyError: # backwards compatible num_sites = int(record['num_cpgs']) llr = float(record['log_lik_ratio']) # Skip ambiguous call if abs(llr) < args.call_threshold: continue sequence = record['sequence'] if args.probabilistic: try: if llr > 700: llr = 700 elif llr < -700: llr = -700 methylation = 1 / (1 + (1 - args.prior) / (args.prior * math.exp(llr))) except OverflowError: print(llr) raise else: methylation = 1 if llr > 0 else 0 # if this is a multi-cpg group and split_groups is set, break up these # sites if args.split_groups and num_sites > 1: c = record['chromosome'] s = int(record['start']) e = int(record['end']) # find the position of the first CG dinucleotide sequence = record['sequence'] cg_pos = sequence.find("CG") first_cg_pos = cg_pos while cg_pos != -1: key = make_key(c, s + cg_pos - first_cg_pos, s + cg_pos - first_cg_pos) update_call_stats(key, 1, methylation, "split-group") cg_pos = sequence.find("CG", cg_pos + 1) else: key = make_key(record['chromosome'], record['start'], record['end']) update_call_stats(key, num_sites, methylation, sequence) # header print("\t".join(["chromosome", "start", "end", "num_cpgs_in_group", "called_sites", "called_sites_methylated", "methylated_frequency", "group_sequence"])) for key in sites: if sites[key].called_sites > 0: (c, s, e) = key.split(":") f = float(sites[key].called_sites_methylated) / sites[key].called_sites print("\t".join([str(x) for x in [c, s, e, sites[key].group_size, sites[key].called_sites, sites[key].called_sites_methylated, f, sites[key].sequence]])) |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 | from __future__ import print_function import sys import pysam import csv import pickle import functools import multiprocessing import argparse import math __min_coverage = 5 __override_ratio = 3 # fit exponential to log(1-correctness) __slope = -0.1203 __intercept = -0.6927 def parse_args(): parser = argparse.ArgumentParser( description="Phase nanopore reads from nanopolish output") parser.add_argument('-b', '--bam', required=True, help="Path to the original bam file") parser.add_argument('-p', '--phased', required=True, help="Path to the phased bam file from nanopolish phase-reads") parser.add_argument('-v', '--vcf', required=True, nargs='+', help="Path to the variants file(s)") parser.add_argument('-d', '--detailed-output', action='store_true', default=False, help="Print per-read, per-site calls to stdout") parser.add_argument('-o', '--outfile', type=argparse.FileType('w'), default=sys.stdout) parser.add_argument('-t', '--threads', type=int, default=multiprocessing.cpu_count()) args = parser.parse_args() return args def get_read_id(read, contig, position, suppdb): if read.query_name in suppdb: for i, alignment in enumerate(suppdb[read.query_name]): chr, start, end = alignment if contig == chr and position >= start and position <= end: return "{}_{}".format(read.query_name, i) raise Exception("Position {}:{} not internal to any alignment of read {}".format( contig, position, read.query_name)) else: return read.query_name def check_column_support(vcf_rows, vcf_pos, column, reads, suppdb, type, detailed_output): if column is not None: valid_allele = [True if vcf_row['POS'] == vcf_pos and len( vcf_row['REF']) == 1 else False for vcf_row in vcf_rows] if column.reference_pos == vcf_pos and True in valid_allele and 'PASS' in [vcf_row['FILTER'] for vcf_row in vcf_rows]: for read in column.pileups: if not read.is_del: read_name = read.alignment.query_name read_pos = read.query_position base = read.alignment.query_sequence[read_pos] call = [] try: qual = read.alignment.query_qualities[read_pos] if type == "signal": # nanopolish qualities scores should be offset by # 35 - they never give anything below this qual = max(0, qual - 35) error = math.exp(__intercept + __slope * qual) except TypeError: if read.alignment.query_qualities is None: # quality scores missing error = 0 else: # unknown error raise try: reads[read_name] except KeyError: reads[read_name] = [0] * (len(vcf_rows) + 2) # only have to check reference once if base == vcf_rows[0]["REF"]: reads[read_name][0] += 1 - error call.append("ref") else: reads[read_name][0] += error for i in range(len(vcf_rows)): if vcf_rows[i]['FILTER'] == 'PASS': match = False if valid_allele[i]: if base in vcf_rows[i]["ALT"]: match = True elif "ref" in call and base == vcf_rows[0]["REF"]: # no snp here, same as reference match = True if match: reads[read_name][i + 1] += 1 - error call.append("alt{}".format(i) if len( vcf_rows) > 1 else "alt") else: reads[read_name][i + 1] += error else: # treat as heterozygous if valid_allele[i] and base in vcf_rows[i]['REF'] + vcf_rows[i]['ALT']: reads[read_name][i + 1] += (1 - error) / 2 else: reads[read_name][i + 1] += error # add 1 to coverage reads[read_name][-1] += 1 if len(call) > 0 and detailed_output: qual = read.alignment.query_qualities[read_pos] print(vcf_rows) print(base) print(error) print(reads[read_name]) print("{}\t{}\t{}\t{}\t{}\t{}".format( column.reference_name, column.reference_pos, read_name, type, ",".join(call), qual)) return reads def ref_support(reads): return set([read for read, counts in reads.items() if counts[0] > counts[1]]) def summarize_reads(reads): if len(reads) == 0: return None support = len(ref_support(reads)) return float(support) / len(reads) def load_vcf(handle): header = next(handle) while header.startswith("##"): header = next(handle) header = header.strip("#").strip("\n").split("\t") reader = csv.DictReader(handle, delimiter="\t", fieldnames=header) return reader def pileup_step(ref_pos, original_pileup, phased_pileup): try: original_column = next(original_pileup) while original_column.reference_pos < ref_pos: original_column = next(original_pileup) except StopIteration: original_column = None try: phased_column = next(phased_pileup) while phased_column.reference_pos < ref_pos: phased_column = next(phased_pileup) except StopIteration: phased_column = None return original_column, phased_column def process_row(vcf_rows, vcf_pos, original_pileup, original_reads, phased_pileup, phased_reads, suppdb, detailed_output): if 'PASS' in [vcf_row['FILTER'] for vcf_row in vcf_rows]: # process this snp original_column, phased_column = pileup_step( vcf_pos, original_pileup, phased_pileup) if not (original_column or phased_column): return None, original_reads, None, phased_reads else: original_reads = check_column_support( vcf_rows, vcf_pos, original_column, original_reads, suppdb, "base", detailed_output) phased_reads = check_column_support( vcf_rows, vcf_pos, phased_column, phased_reads, suppdb, "signal", detailed_output) return original_pileup, original_reads, phased_pileup, phased_reads def call_contig(contig, vcf_fns, original_fn, phased_fn, suppdb, detailed_output): with pysam.AlignmentFile(original_fn, 'r') as original_bam, pysam.AlignmentFile(phased_fn, 'r') as phased_bam: vcf_handles = [] vcf_readers = [] for vcf_fn in vcf_fns: handle = open(vcf_fn, 'r') vcf_handles.append(handle) vcf_readers.append(load_vcf(handle)) original_pileup = original_bam.pileup(contig) original_reads = dict() phased_pileup = phased_bam.pileup(contig) phased_reads = dict() if original_pileup or phased_pileup: # find right location in vcf - maybe an index is better for this vcf_rows = [next(vcf_reader) for vcf_reader in vcf_readers] for i, reader in enumerate(vcf_readers): if not vcf_rows[i]['CHROM'] == contig: for vcf_row in reader: if vcf_row['CHROM'] != contig: # not there yet continue vcf_rows[i] = vcf_row break vcf_rows[i]['POS'] = int(vcf_rows[i]['POS']) - 1 # process the files while True: try: vcf_pos = min( [vcf_row['POS'] for vcf_row in vcf_rows if vcf_row['CHROM'] == contig]) except ValueError: # empty list, all on wrong contig break original_pileup, original_reads, phased_pileup, phased_reads = process_row( vcf_rows, vcf_pos, original_pileup, original_reads, phased_pileup, phased_reads, suppdb, detailed_output) if not (original_pileup or phased_pileup): # out of reads break num_failed = 0 for i in range(len(vcf_rows)): try: if vcf_rows[i]['CHROM'] == contig and vcf_rows[i]['POS'] <= vcf_pos: vcf_rows[i] = next(vcf_readers[i]) if vcf_rows[i]['CHROM'] != contig: vcf_rows[i]['POS'] = float('inf') num_failed += 1 else: vcf_rows[i]['POS'] = int( vcf_rows[i]['POS']) - 1 except StopIteration: vcf_rows[i]['CHROM'] = None vcf_rows[i]['POS'] = float('inf') num_failed += 1 if num_failed == len(vcf_rows): # no more variants break return original_reads, phased_reads args = parse_args() with open("{}.suppdb".format(args.bam), 'rb') as handle: suppdb = pickle.load(handle) func = functools.partial(call_contig, vcf_fns=args.vcf, original_fn=args.bam, phased_fn=args.phased, suppdb=suppdb, detailed_output=args.detailed_output) with pysam.AlignmentFile(args.bam, 'r') as bam: original_reads, phased_reads = dict(), dict() contigs = [bam.get_reference_name(i) for i in range(bam.nreferences)] if args.detailed_output: print("\t".join(["chr", "pos", "read_name", "caller", "genotype", "qual"])) threads = 1 if args.detailed_output else min(args.threads, len(contigs)) if threads > 1: pool = multiprocessing.Pool(threads) data = pool.imap_unordered(func, contigs) pool.close() pool.join() else: data = [func(contig) for contig in contigs] for original_contig_reads, phased_contig_reads in data: original_reads.update(original_contig_reads) phased_reads.update(phased_contig_reads) reads = set(original_reads.keys()).union(set(phased_reads.keys())) signal_alt_header = "\t".join(["signal_alt{}".format( i + 1) for i in range(len(args.vcf))]) if len(args.vcf) > 1 else "signal_alt" base_alt_header = "\t".join(["base_alt{}".format( i + 1) for i in range(len(args.vcf))]) if len(args.vcf) > 1 else "base_alt" print("\t".join(['read', 'genotype', 'signal_ref', signal_alt_header, 'signal_coverage', 'base_ref', base_alt_header, 'base_coverage']), file=args.outfile) for read in reads: try: signal_calls = phased_reads[read] except KeyError: signal_calls = [0] * (len(args.vcf) + 2) try: base_calls = original_reads[read] except KeyError: base_calls = [0] * (len(args.vcf) + 2) base_coverage = base_calls[-1] base_calls = base_calls[:-1] signal_coverage = signal_calls[-1] signal_calls = signal_calls[:-1] base_max = max(base_calls) signal_max = max(signal_calls) if signal_coverage == 0 or signal_calls.count(signal_max) > 1: signal_call = "fail" signal_ratio = 0 else: signal_call = signal_calls.index(signal_max) signal_ratio = signal_max / signal_coverage if signal_call == 0: signal_call = "ref" else: signal_call = "alt{}".format( signal_call) if len(args.vcf) > 1 else "alt" if base_coverage == 0 or base_calls.count(max(base_calls)) > 1: base_call = "fail" base_ratio = 0 else: base_call = base_calls.index(max(base_calls)) base_ratio = base_max / base_coverage if base_call == 0: base_call = "ref" else: base_call = "alt{}".format(base_call) if len( args.vcf) > 1 else "alt" if base_coverage < __min_coverage and signal_coverage < __min_coverage: genotype = "fail" elif base_call == signal_call: genotype = base_call elif signal_call == "fail": genotype = base_call elif base_call == "fail": genotype = signal_call elif base_coverage > __override_ratio * signal_coverage and base_call != "fail": genotype = base_call elif signal_coverage > __override_ratio * base_coverage and signal_call != "fail": genotype = signal_call elif base_ratio - 0.5 > __override_ratio * (signal_ratio - 0.5) and base_ratio > 0.5: genotype = base_call elif signal_ratio - 0.5 > __override_ratio * (base_ratio - 0.5) and signal_ratio > 0.5: genotype = signal_call else: genotype = "fail" if read in suppdb: for j in range(len(suppdb[read])): supp_read = "{}_{}".format(read, j) print("\t".join([supp_read, genotype, "\t".join([str(i) for i in signal_calls]), str( signal_coverage), "\t".join([str(i) for i in base_calls]), str(base_coverage)]), file=args.outfile) else: print("\t".join([read, genotype, "\t".join([str(i) for i in signal_calls]), str( signal_coverage), "\t".join([str(i) for i in base_calls]), str(base_coverage)]), file=args.outfile) |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 | from __future__ import print_function import csv import numpy as np import math import argparse import re def parse_region(region): """ Parse a region specification. :param region: String region specification :raises argparse.ArgumentTypeError: raises an error if format not recognised :returns contig: String contig / chromosome name :returns start: integer start position (0-based) :returns end: integer end position (1-based) >>> parseRegion("chr1:1000-2000") ("chr1", 1000, 2000) """ r = ''.join(region.split()) # remove whitespace r = re.split(':|-', r) start = 0 end = None if len(r) < 1: raise argparse.ArgumentTypeError( "Region {} must specify a reference name".format(region)) elif len(r) > 3: raise argparse.ArgumentTypeError( "Region {} format not recognized".format(region)) else: contig = r[0] try: start = int(re.sub(",|\.", "", r[1])) except IndexError: pass try: end = int(re.sub(",|\.", "", r[2])) except IndexError: pass finally: return contig, start, end parser = argparse.ArgumentParser() parser.add_argument('-m', '--methylation', required=True, help="Methylation data, split by supplementaries, sorted by genomic location") parser.add_argument('-p', '--phase', required=True, help="Phased reads (haplotyping data)") parser.add_argument('-b', '--binwidth', type=int, default=11, help="Genomic distance over which to aggregate tests") parser.add_argument('-o', '--overlap', type=int, default=5, help="Genomic distance for bin overlap. Limited to binwidth/2 for streaming reasons, but this could be reengineered") parser.add_argument('-r', '--region', type=parse_region, nargs="+", metavar="CHR:START-END", help="Name and range of contigs to test (Default: entire genome)") args = parser.parse_args() __binwidth = args.binwidth __overlap = min(__binwidth // 2, args.overlap) # max = __binwidth/2 meth_fn = args.methylation haplotype_fn = args.phase regions = dict() # format: { chr : [start, end] } if args.region: for region in args.region: try: regions[region[0]] except KeyError: regions[region[0]] = [] regions[region[0]].append([region[1], region[2]]) haplotypes = dict() with open(haplotype_fn, 'r') as haplotype_handle: reader = csv.DictReader(haplotype_handle, delimiter="\t") for row in reader: haplotypes[row["read"]] = row["genotype"] def get_genotype(read_name): try: return haplotypes[read_name] except KeyError: return "fail" print("\t".join(["chr", "start", "end", "ref_mean", "ref_sd", "ref_site_count", "ref_read_count", "alt_mean", "alt_sd", "alt_site_count", "alt_read_count"])) chr = None start = None ref_reads = None alt_reads = None ref_meth = None alt_meth = None next_start = None next_ref_reads = None next_alt_reads = None next_ref_meth = None next_alt_meth = None end = 0 with open(meth_fn, 'r') as meth_handle: reader = csv.DictReader(meth_handle, delimiter="\t") for row in reader: pos = (float(row["start"]) + float(row["end"])) / 2 if regions: hit = False try: for region in regions[row['chromosome']]: if pos >= region[0] and pos < region[1]: hit = True except KeyError: pass if not hit: continue if pos >= end or row["chromosome"] != chr: if chr is not None: # print result if not (len(ref_meth) == 0 and len(alt_meth) == 0): ref_mean = np.mean(ref_meth) if len(ref_meth) > 0 else "NA" ref_sd = np.std(ref_meth) if len(ref_meth) > 0 else "NA" alt_mean = np.mean(alt_meth) if len(alt_meth) > 0 else "NA" alt_sd = np.std(alt_meth) if len(alt_meth) > 0 else "NA" print("\t".join([chr, str(start), str(end), str(ref_mean), str(ref_sd), str(len(ref_meth)), str(len( ref_reads)), str(alt_mean), str(alt_sd), str(len(alt_meth)), str(len(alt_reads))])) if chr is not None and chr == row["chromosome"] and pos < next_start: start = next_start ref_reads = next_ref_reads alt_reads = next_alt_reads ref_meth = next_ref_meth alt_meth = next_alt_meth else: start = int(row["start"]) ref_reads = set() alt_reads = set() ref_meth = list() alt_meth = list() chr = row["chromosome"] next_start = start + __overlap end = start + __binwidth next_ref_reads = set() next_alt_reads = set() next_ref_meth = list() next_alt_meth = list() try: meth = 1 / (1. + math.exp(-float(row["log_lik_ratio"]))) except OverflowError: meth = 0.0 genotype = get_genotype(row["read_name"]) if genotype == "ref": ref_meth.append(meth) ref_reads.add(row["read_name"]) elif genotype == "alt": alt_meth.append(meth) alt_reads.add(row["read_name"]) if int(row["start"]) >= next_start: if genotype == "ref": next_ref_meth.append(meth) next_ref_reads.add(row["read_name"]) elif genotype == "alt": next_alt_meth.append(meth) next_alt_reads.add(row["read_name"]) # print last result if chr is not None: ref_mean = np.mean(ref_meth) if len(ref_meth) > 0 else "NA" ref_sd = np.std(ref_meth) if len(ref_meth) > 0 else "NA" alt_mean = np.mean(alt_meth) if len(alt_meth) > 0 else "NA" alt_sd = np.std(alt_meth) if len(alt_meth) > 0 else "NA" print("\t".join([chr, str(start), str(end), str(ref_mean), str(ref_sd), str(len(ref_meth)), str(len( ref_reads)), str(alt_mean), str(alt_sd), str(len(alt_meth)), str(len(alt_reads))])) |
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | import sys import re def count_cpg(file, path=True): substring = "CG" print("chr\tstart\tend") f = open(file) if path else file line = f.readline() while line: if line.startswith('>'): chrm = line[1:].rstrip() else: index = 0 linelen = len(line) # Find first non-N base # index = re.search(r'[^N]', line).start() # print("start\t{}\tend\t{}".format(index+1, linelen)) while index < linelen: index = line.find(substring, index) if index == -1: break print("{}\t{}\t{}".format( chrm, index + 1, index + len(substring))) index += len(substring) line = f.readline() if path: f.close() if __name__ == "__main__": if len(sys.argv) != 2: if sys.stdin: count_cpg(sys.stdin, path=False) else: raise TypeError("missing file input") else: count_cpg(sys.argv[1]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 | suppressMessages(suppressPackageStartupMessages(library(tidyverse))) suppressMessages(suppressPackageStartupMessages(library(DSS))) suppressMessages(suppressPackageStartupMessages(library(bsseq))) meth_fn <- "../nanopore/b6xcast.minion.methylation" forward_ref_df <- read.table(paste0(meth_fn, ".ref_summary.tsv"), header=TRUE, sep="\t", stringsAsFactors = FALSE) %>% mutate(start=start+1, end=end+2, sampleName="b6xcast.mat") %>% dplyr::rename(chr=chromosome, percentMeth=methylated_frequency) %>% arrange(chr, start) forward_alt_df <- read.table(paste0(meth_fn, ".alt_summary.tsv"), header=TRUE, sep="\t", stringsAsFactors = FALSE) %>% mutate(start=start+1, end=end+2, sampleName="b6xcast.pat") %>% dplyr::rename(chr=chromosome, percentMeth=methylated_frequency) %>% arrange(chr, start) meth_fn <- "../nanopore/castxb6.promethion.methylation" reverse_ref_df <- read.table(paste0(meth_fn, ".ref_summary.tsv"), header=TRUE, sep="\t", stringsAsFactors = FALSE) %>% mutate(start=start+1, end=end+2, sampleName="castxb6.pat") %>% dplyr::rename(chr=chromosome, percentMeth=methylated_frequency) %>% arrange(chr, start) reverse_alt_df <- read.table(paste0(meth_fn, ".alt_summary.tsv"), header=TRUE, sep="\t", stringsAsFactors = FALSE) %>% mutate(start=start+1, end=end+2, sampleName="castxb6.mat") %>% dplyr::rename(chr=chromosome, percentMeth=methylated_frequency) %>% arrange(chr, start) tidy_df <- function(df, sampleName) { Cov <- rlang::sym(paste0("Cov.",sampleName)) M <- rlang::sym(paste0("M.",sampleName)) df %>% select(chr, start, end, !!Cov:=called_sites, !!M:=called_sites_methylated) } make_bs <- function(df) { M <- df %>% select(starts_with("M.")) %>% rename_all(function(x) gsub("M.", "", x)) %>% mutate_all(function(x) ifelse(is.na(x), 0, x)) %>% as.matrix() Cov <- df %>% select(starts_with("Cov.")) %>% rename_all(function(x) gsub("Cov.", "", x)) %>% mutate_all(function(x) ifelse(is.na(x), 0, x)) %>% as.matrix() BSseq(chr = df$chr, pos = df$start, Cov = Cov, M = M, sampleNames = colnames(M)) } bs <- tidy_df(forward_ref_df, "b6xcast.mat") %>% full_join(tidy_df(forward_alt_df, "b6xcast.pat"), by=c("chr", "start", "end")) %>% full_join(tidy_df(reverse_ref_df, "castxb6.pat"), by=c("chr", "start", "end")) %>% full_join(tidy_df(reverse_alt_df, "castxb6.mat"), by=c("chr", "start", "end")) %>% group_by(chr, start) %>% summarise_if(is.numeric, sum) %>% ungroup() %>% make_bs() pData(bs)$strain <- c("b6xcast", "b6xcast", "castxb6", "castxb6") pData(bs)$parent <- c("mat", "pat", "pat", "mat") loci.idx <- which(rowSums(getCoverage(bs, type="Cov")==0) == 0) dml_parent <- DMLtest(BSobj = bs[loci.idx,], group1=c("b6xcast.mat", "castxb6.mat"), group2=c("b6xcast.pat", "castxb6.pat"), equal.disp = TRUE, smoothing=TRUE) dml_strain <- DMLtest(BSobj = bs[loci.idx,], group1=c("b6xcast.mat", "castxb6.pat"), group2=c("castxb6.mat", "b6xcast.pat"), equal.disp = TRUE, smoothing=TRUE) dmr_parent <- callDMR(dml_parent, p.threshold=1e-5, dis.merge=1500) dmr_strain <- callDMR(dml_strain, p.threshold=1e-5, dis.merge=1500) save(dml_parent, dml_strain, dmr_parent, dmr_strain, file="../RData/paired_DSS.RData") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | suppressMessages(suppressPackageStartupMessages(library(magrittr))) suppressMessages(suppressPackageStartupMessages(library(dplyr))) suppressMessages(suppressPackageStartupMessages(library(readr))) args <- commandArgs(trailingOnly = TRUE) gtf <- rtracklayer::import(args[1]) gene_list <- gtf %>% as_data_frame() %>% rename(chr=seqnames) %>% select(chr, start, end, strand, gene_id, gene_name) %>% group_by(chr, gene_id, gene_name, strand) %>% summarise(start=min(start), end=max(end)) gene_list %>% write_tsv(sub("gtf$", "genes.tsv", args[1])) |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 | suppressMessages(suppressPackageStartupMessages(library(tidyverse))) suppressMessages(suppressPackageStartupMessages(library(ggExtra))) suppressMessages(suppressPackageStartupMessages(library(broom))) suppressMessages(suppressPackageStartupMessages(library(data.table))) n_min <- 7 args = commandArgs(trailingOnly=TRUE) infile <- args[1] outdir <- args[2] load(file.path(outdir, "summary_df.RData")) load(file.path(outdir, "haplotype_df.RData")) fit_loess <- function(chr, read_name, start, percentMeth, ...) { x = data.frame(start=start, percentMeth=percentMeth) fit <- loess(percentMeth ~ start, x, control=loess.control(surface = "interpolate", statistics = "none", trace.hat = "approximate", cell=0.5), ...) fit <- list(kd=fit$kd, start=min(start), end=max(start), chr=chr, read_name=read_name) fit } process_file <- function(filepath) { con = file(filepath, "r") reads=list() header = readLines(con, n = 1) # process first line line = readLines(con, n = 1) line <- strsplit(line, "\\t")[[1]] chr=line[1] read_name = line[4] pos=as.integer(line[2]) meth = 1/(1+exp(-as.numeric(line[5]))) i=1 start=c(pos) percentMeth=c(meth) # process the rest while ( TRUE ) { line = readLines(con, n = 1) if ( length(line) == 0 ) { break } line <- strsplit(line, "\\t")[[1]] pos=as.integer(line[2]) name = line[4] meth = 1/(1+exp(-as.numeric(line[5]))) if (name != read_name) { # new line if (length(start) > n_min){ len <- max(start) - min(start) span <- 0.1 + 8e-11*((max(-len + 1e5, 0)))^2 reads[[i]] = fit_loess(chr, read_name, start, percentMeth, span=span) i <- i + 1 } chr=line[1] read_name = name start=c(pos) percentMeth=c(meth) } else { start=c(start, pos) percentMeth=c(percentMeth, meth) } } close(con) reads } fit_reads <- process_file(paste0(infile, ".methylation.sorted.by_read.tsv")) save(fit_reads, file=file.path(outdir, "fit_reads.RData")) # load(file.path(outdir, "fit_reads.RData")) fit_reads_df <- data_frame(read_name=sapply(fit_reads, function(x) { x$read_name })) %>% mutate(id=row_number()) %>% left_join(summary_df, by="read_name") %>% arrange(chr, start, end) %>% left_join(haplotype_df %>% select(read_name, genotype), by="read_name") %>% dplyr::select(chr, start, end, id, read_name, genotype, qual) save(fit_reads_df, file=file.path(outdir, "fit_reads_df.RData")) |
12 13 | shell: "python call_variant_proportion.py -b {input.bam} -p {input.phased_bam} -v {input.cast_vcf} {input.fvb_vcf} -o {output}" |
25 26 | shell: "Rscript render_notebook.R ../notebooks/nanopolish_fvb_resolution.Rmd &> {params.log}" |
35 36 | shell: "python merge_recombined_haplotypes.py -i {input.phase} -s {input.summary} -a alt2 -r $(cat {input.regions}) -o {output}" |
6 7 | shell: "wget -q -O {output} {params.url}" |
14 15 | shell: "wget -q -O {output} {params.url}" |
22 23 | shell: "wget -q -O {output} {params.url}" |
30 31 | shell: "wget -q -O {output} {params.url}" |
39 40 | shell: "md5sum -c {input.md5} && touch {output}" |
48 49 | shell: "gunzip {input.tsv} && touch {output.signpost}" |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 | from __future__ import print_function from Bio import SeqIO from Bio import Seq import re import sys import csv import argparse def load_vcf(handle): header = next(handle) while header.startswith("##"): header = next(handle) header = header.strip("#").strip("\n").split("\t") reader = csv.DictReader(handle, delimiter="\t", fieldnames=header) return reader __ambiguity_codes = { 'N': {'A', 'C', 'G', 'T'}, 'V': {'A', 'C', 'G'}, 'H': {'A', 'C', 'T'}, 'D': {'A', 'G', 'T'}, 'B': {'C', 'G', 'T'}, 'M': {'A', 'C'}, 'K': {'G', 'T'}, 'R': {'A', 'G'}, 'Y': {'C', 'T'}, 'W': {'A', 'T'}, 'S': {'C', 'G'}, 'A': {'A'}, 'C': {'C'}, 'G': {'G'}, 'T': {'T'} } def get_ambiguity_base(bases): bases = set(bases) for ambiguity_base, possible_bases in __ambiguity_codes.items(): if bases == possible_bases: return ambiguity_base raise Exception("Unrecognises bases: ['" + "','".join(list(bases)) + "']") def disambiguate_base(base): base = str(base) return list(__ambiguity_codes[base]) def parse_region(region): """ Parse a region specification. :param region: String region specification :raises argparse.ArgumentTypeError: raises an error if format not recognised :returns contig: String contig / chromosome name :returns start: integer start position (0-based) :returns end: integer end position (1-based) >>> parseRegion("chr1:1000-2000") ("chr1", 1000, 2000) """ region = ''.join(region.split()) # remove whitespace region = re.split(':|-', region) start = 0 end = None if len(region) < 1: raise argparse.ArgumentTypeError( "Region must specify a reference name") elif len(region) > 3: raise argparse.ArgumentTypeError("Region format not recognized") else: contig = region[0] try: start = int(re.sub(",|\.", "", region[1])) except IndexError: pass try: end = int(re.sub(",|\.", "", region[2])) except IndexError: pass finally: return contig, start, end def parse_args(): parser = argparse.ArgumentParser() parser.add_argument('-i', '--input', required=True, help='filename of input genome fasta') parser.add_argument('-o', '--output', required=True, help='filename of output masked genome fasta') parser.add_argument('-v', '--vcf', required=True, help='filename of vcf listing snps') parser.add_argument('-a', '--alternate-only', default=False, action='store_true', help='replace reference base with alternate allele, rather than retaining both (Default: false)') parser.add_argument('-r', '--region', type=parse_region, nargs="+", metavar="CHR:START-END", help="Name and range of contigs to mask (Default: entire genome)") parser.add_argument('-b', '--boundary', type=int, default=0, metavar="INT", help="Width of region boundary where both alleles are included") return parser.parse_args() def contig_to_int(contig): try: return(int(contig)) except ValueError: if contig == 'X': return max_autosome + 1 elif contig == 'Y': return max_autosome + 2 elif contig == 'MT': return max_autosome + 3 else: return max_autosome + 4 def cmp_contigs(contig): return(contig_to_int(contig), contig) args = parse_args() regions = dict() # format: { chr : [start, end, alternate_only] } if args.region: for region in args.region: try: regions[region[0]] except KeyError: regions[region[0]] = [] region_start, region_end = region[1], region[2] if region_start > 0: region_start = region_start + args.boundary // 2 if region_end is not None: region_end = region_end - args.boundary // 2 regions[region[0]].append( [region_start, region_end, args.alternate_only]) if args.boundary > 1: if region[1] > 0: regions[region[0]].append( [region[1] - args.boundary // 2, region[1] + args.boundary // 2, False]) if region[2] is not None: regions[region[0]].append( [region[2] - args.boundary // 2, region[2] + args.boundary // 2, False]) genome = dict() with open(args.input, 'r') as handle: for record in SeqIO.parse(handle, "fasta"): record.seq = record.seq.tomutable() genome[record.id] = record with open(args.vcf, 'r') as handle: vcf = load_vcf(handle) for row in vcf: snp = True ref_base = [row['REF']] alt_bases = row['ALT'].split(',') for base in alt_bases + ref_base: if len(base) > 1: snp = False if not snp: # don't touch indels continue pos = int(row['POS']) - 1 if regions: alternate_only = None try: for r in regions[row['CHROM']]: if pos >= r[0] and pos < r[1]: alternate_only = r[2] except KeyError: pass if alternate_only is None: # no region hit continue else: # genome wide alternate_only = args.alternate_only record = genome[row['CHROM']] bases = alt_bases if not alternate_only: bases = bases + disambiguate_base(record.seq[pos]) record.seq[pos] = get_ambiguity_base(bases) genome[row['CHROM']] = record contigs = list() max_autosome = 0 for contig in genome.keys(): contigs.append(contig) try: if int(contig) > max_autosome: max_autosome = int(contig) except ValueError: pass contigs = sorted(contigs, key=cmp_contigs) with open(args.output, 'w') as handle: for contig in contigs: SeqIO.write(genome[contig], handle, "fasta") |
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 | from __future__ import print_function import re import csv import argparse __min_coverage = 5 __override_ratio = 3 def parse_region(region): """ Parse a region specification. :param region: String region specification :raises argparse.ArgumentTypeError: raises an error if format not recognised :returns contig: String contig / chromosome name :returns start: integer start position (0-based) :returns end: integer end position (1-based) >>> parseRegion("chr1:1000-2000") ("chr1", 1000, 2000) """ region = ''.join(region.split()) # remove whitespace region = re.split(':|-', region) start = 0 end = None if len(region) < 1: raise argparse.ArgumentTypeError( "Region must specify a reference name") elif len(region) > 3: raise argparse.ArgumentTypeError("Region format not recognized") else: contig = region[0] try: start = int(re.sub(",|\.", "", region[1])) except IndexError: pass try: end = int(re.sub(",|\.", "", region[2])) except IndexError: pass finally: return contig, start, end def parse_args(): parser = argparse.ArgumentParser() parser.add_argument('-i', '--input', required=True, help='filename of input three-way haplotypes tsv') parser.add_argument('-o', '--output', required=True, help='filename of output haplotypes tsv') parser.add_argument('-s', '--summary', required=True, help='filename of summary generated by build_supplementary_db') parser.add_argument('-a', '--alt-reference', required=True, choices=[ 'alt1', 'alt2'], help='Name of the allele for which alternate is recombined with the reference') parser.add_argument('-r', '--region', type=parse_region, nargs="+", metavar="CHR:START-END", help="Name and range of contigs where reference is replaced by alternate reference") parser.add_argument('-b', '--boundary', type=int, default=50000, metavar="INT", help="Width of region boundary where both alleles are included") return parser.parse_args() args = parse_args() signal_alt_ref = "signal_" + args.alt_reference base_alt_ref = "base_" + args.alt_reference signal_alt = "signal_" + "alt1" if args.alt_reference == "alt2" else "alt2" base_alt = "base_" + "alt1" if args.alt_reference == "alt2" else "alt2" regions = dict() # format: { chr : [start, end, alternate_only] } if args.region: for region in args.region: try: regions[region[0]] except KeyError: regions[region[0]] = [] region_start, region_end = region[1], region[2] if region_start > 0: region_start = region_start + args.boundary // 2 if region_end is not None: region_end = region_end - args.boundary // 2 regions[region[0]].append([region_start, region_end, True]) if args.boundary > 1: if region[1] > 0: regions[region[0]].append( [region[1] - args.boundary // 2, region[1] + args.boundary // 2, False]) if region[2] is not None: regions[region[0]].append( [region[2] - args.boundary // 2, region[2] + args.boundary // 2, False]) with open(args.input, 'r') as haplotype_handle, open(args.summary, 'r') as summary_handle, open(args.output, 'w') as output: print("read\tgenotype\tsignal_ref\tsignal_alt\tsignal_coverage\tbase_ref\tbase_alt\tbase_coverage", file=output) haplotype_reader = csv.DictReader(haplotype_handle, delimiter="\t") summary_reader = csv.DictReader(summary_handle, delimiter="\t", fieldnames=[ 'read', 'chr', 'start', 'end', 'qual']) summary = next(summary_reader) while True: try: haplotype = next(haplotype_reader) while summary['read'] < haplotype['read']: summary = next(summary_reader) while summary['read'] > haplotype['read']: haplotype = next(haplotype_reader) # now they should be the same read for key in haplotype.keys(): if (key.startswith("signal") or key.startswith("base")) and not key.endswith("coverage"): haplotype[key] = float(haplotype[key]) except StopIteration: break use_ref = True use_alt_ref = False if summary['chr'] in regions: start = int(summary['start']) end = int(summary['end']) for region in regions[summary['chr']]: if region[0] <= start and region[1] >= end: use_alt_ref = True if region[2]: # alternate only use_ref = False if use_ref and use_alt_ref: if haplotype['signal_ref'] + haplotype['base_ref'] > haplotype[signal_alt_ref] + haplotype[base_alt_ref]: use_alt_ref = False elif haplotype['signal_ref'] + haplotype['base_ref'] < haplotype[signal_alt_ref] + haplotype[base_alt_ref]: use_ref = False else: # pick the one with the lower range - more reliable, so # probably more accurate if abs(haplotype['signal_ref'] - haplotype['base_ref']) < abs(haplotype[signal_alt_ref] - haplotype[base_alt_ref]): use_alt_ref = False else: use_ref = False if use_ref: signal_ref = 'signal_ref' base_ref = 'base_ref' else: signal_ref = signal_alt_ref base_ref = base_alt_ref # remove excess counts - how? signal_coverage = haplotype[signal_ref] + haplotype[signal_alt] base_coverage = haplotype[base_ref] + haplotype[base_alt] # determine genotype signal_call = "fail" if signal_coverage > 0: signal_ratio = haplotype[signal_ref] / signal_coverage if signal_ratio > 0.5: signal_call = "ref" elif signal_ratio < 0.5: signal_call = "alt" base_call = "fail" if base_coverage > 0: base_ratio = haplotype[base_ref] / base_coverage if base_ratio > 0.5: base_call = "ref" elif base_ratio < 0.5: base_call = "alt" if base_coverage < __min_coverage and signal_coverage < __min_coverage: genotype = "fail" elif base_call == signal_call: genotype = base_call elif signal_call == "fail": genotype = base_call elif base_call == "fail": genotype = signal_call elif base_coverage > __override_ratio * signal_coverage and base_call != "fail": genotype = base_call elif signal_coverage > __override_ratio * base_coverage and signal_call != "fail": genotype = signal_call elif base_ratio - 0.5 > __override_ratio * (signal_ratio - 0.5) and base_ratio > 0.5: genotype = base_call elif signal_ratio - 0.5 > __override_ratio * (base_ratio - 0.5) and signal_ratio > 0.5: genotype = signal_call else: genotype = "fail" print("{read}\t{genotype}\t{signal_ref}\t{signal_alt}\t{signal_coverage}\t{base_ref}\t{base_alt}\t{base_coverage}".format( read=haplotype['read'], genotype=genotype, signal_ref=haplotype[signal_ref], base_ref=haplotype[base_ref], signal_alt=haplotype[signal_alt], base_alt=haplotype[base_alt], signal_coverage=haplotype['signal_coverage'], base_coverage=haplotype['base_coverage'] ), file=output) |
6 7 | shell: "wget -q {params.url} -O {output}" |
14 15 | shell: "wget -q {params.url} -O {output}" |
22 23 | shell: "gunzip {input}" |
30 31 | shell: "gunzip {input}" |
38 39 | shell: "gunzip {input}" |
53 54 | shell: "bwa index {input} &> {params.log}" |
61 62 | shell: "samtools index {input}" |
70 71 | shell: "wget -q {params.url} -O {output}" |
79 80 | shell: "python mask_genome_variants.py -i {input.fasta} -o {output} -v {input.vcf}" |
87 88 | shell: "Rscript ensembl_gtf_to_tsv.R {input}" |
95 96 | shell: "python tsv_to_region.py {input} 10000 > {output}" |
103 104 | shell: "python count_cpg.py {input} > {output}" |
6 7 | shell: "wget -q -O {output} {params.url}" |
14 15 | shell: "wget -q -O {output} {params.url}" |
22 23 | shell: "wget -q -O {output} {params.url}" |
30 31 | shell: "wget -q -O {output} {params.url}" |
38 39 | shell: "wget -q -O {output} {params.url}" |
46 47 | shell: "wget -q -O {output} {params.url}" |
55 56 | shell: "md5sum -c {input.md5} && touch {output}" |
64 65 66 | shell: "mkdir -p {output} && " "tar xzf {input.tar} -C {output}" |
73 74 75 76 77 78 79 | shell: "find {input} -mindepth 2 -type d -links 2 -exec mv -t {input} {{}} \\+ && " "find {input} -name '*.basecalled.*.tar.gz' -exec rm {{}} \\+ && " "for i in $(find {input} -name '*.tar.gz'); do " " tar -xzf $i -C {input} && rm $i; " "done && " "touch {output}" |
92 93 94 95 | shell: "mkdir -p {output.outdir} && " "for i in {input.indir}; do mv $i {output.outdir}; done && " "touch {output.outdir_clean}" |
109 110 | shell: "read_fast5_basecaller.py -c r94_450bps_linear.cfg -o fastq -i {input.indir} -r -s {params.outdir} -t {threads} -q 0" |
124 125 | shell: "read_fast5_basecaller.py -c r941_450bps_linear_prom.cfg -o fastq -i {input.indir} -r -s {params.outdir} -t {threads} -q 0" |
132 133 | shell: "find {input} -name '*.fastq' -exec cat {{}} \\+ > {output}" |
146 147 148 | shell: "bwa mem -t {threads} {input.genome} {input.reads} | " "samtools sort -T {output}.samtools.tmp -@ {threads} -o {output} &> {log}" |
158 159 | shell: "nanopolish index -d {input.fast5} {input.reads} &> {params.log}" |
172 173 | shell: "nanopolish call-methylation -t {threads} -r {input.reads} -b {input.bam} -g {input.genome} > {output}" |
187 188 189 | shell: "nanopolish phase-reads -t {threads} -r {input.reads} -b {input.bam} -g {input.genome} {input.vcf} | " "samtools sort -T {output}.samtools.tmp -@ {threads} -o {output}" |
11 12 | shell: "Rscript render_notebook.R {input.notebook} &> {params.log}" |
24 25 | shell: "Rscript render_notebook.R {input.notebook} &> {params.log}" |
37 38 | shell: "Rscript render_notebook.R {input.notebook} &> {params.log}" |
50 51 | shell: "Rscript render_notebook.R {input.notebook} &> {params.log}" |
62 63 | shell: "Rscript render_notebook.R {input.notebook} &> {params.log}" |
77 78 | shell: "Rscript render_notebook.R {input.notebook} &> {params.log}" |
101 102 | shell: "Rscript render_notebook.R {input.notebook} &> {params.log}" |
115 116 | shell: "Rscript render_notebook.R {input.notebook} &> {params.log}" |
130 131 | shell: "Rscript render_notebook.R {input.notebook} &> {params.log}" |
143 144 | shell: "Rscript render_notebook.R {input.notebook} &> {params.log}" |
175 176 | shell: "Rscript render_notebook.R {input.notebook} &> {params.log}" |
185 186 | script: "prepare_visualisation.R" |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | suppressMessages(suppressPackageStartupMessages(library(GenomicRanges))) suppressMessages(suppressPackageStartupMessages(library(tidyverse))) suppressMessages(suppressPackageStartupMessages(library(data.table))) gtf <- rtracklayer::import("../genome_data/Mus_musculus.GRCm38_90.chr.gtf") knownGene <- gtf %>% as_data_frame() %>% rename(chr=seqnames) %>% select(chr, start, end, strand, type, gene_id, gene_name, exon_number) %>% filter(type %in% c("exon", "five_prime_utr", "three_prime_utr")) %>% filter(end > start) %>% mutate(type=fct_recode(type, utr="three_prime_utr", utr="five_prime_utr")) %>% unique() %>% group_by(gene_name, start) %>% filter(end == min(end)) %>% ungroup() %>% arrange(gene_name, start, end) %>% makeGRangesFromDataFrame(keep.extra.columns=TRUE) save(knownGene, file="../RData/knownGene.RData") rna_seq <- read_tsv("../rna_seq/all_runs_with_reverse_coverage.tsv", col_names=c("chr", "pos", "fwd1_mat", "fwd1_pat", "fwd2_mat", "fwd2_pat", "fwd3_mat", "fwd3_pat", "fwd4_mat", "fwd4_pat", "rev1_pat", "rev1_mat", "rev2_pat", "rev2_mat", "rev3_pat", "rev3_mat", "rev4_pat", "rev4_mat"), col_types = 'ciiiiiiiiiiiiiiiii') %>% mutate(fwd_mat=(fwd1_mat+fwd2_mat+fwd3_mat+fwd4_mat)/4, fwd_pat=(fwd1_pat+fwd2_pat+fwd3_pat+fwd4_pat)/4, rev_pat=(rev1_pat+rev2_pat+rev3_pat+rev4_pat)/4, rev_mat=(rev1_mat+rev2_mat+rev3_mat+rev4_mat)/4) %>% select(chr, pos, fwd_mat, fwd_pat, rev_mat, rev_pat) save(rna_seq, file="../RData/rna_seq_with_reverse.RData") |
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | suppressMessages(library(tidyverse)) min_coverage <- 5 args = commandArgs(trailingOnly=TRUE) infile <- args[1] outfile <- args[2] haplotype_df <- read_tsv(infile, col_types='ccddiddi') %>% mutate(signal_coverage = signal_ref + signal_alt, signal_ratio = signal_ref / signal_coverage, base_coverage = base_ref + base_alt, base_ratio=base_ref / base_coverage) %>% mutate(info=case_when(.$signal_coverage < min_coverage & .$base_coverage < min_coverage ~ "low_coverage", .$signal_ratio < 0.5 & .$base_ratio < 0.5 ~ "pass", .$signal_ratio > 0.5 & .$base_ratio > 0.5 ~ "pass", .$base_coverage > 3*.$signal_coverage ~ "signal_low_coverage", .$signal_coverage > 3*.$base_coverage ~ "base_low_coverage", abs(0.5-.$signal_ratio)*3 < abs(0.5-.$base_ratio) ~ "signal_uncertain", abs(0.5-.$base_ratio)*3 < abs(0.5-.$signal_ratio) ~ "base_uncertain", TRUE ~ "fail"), base_genotype=ifelse(base_ratio>0.5, "ref", "alt"), signal_genotype=ifelse(signal_ratio>0.5, "ref", "alt"), genotype="fail", genotype=ifelse(startsWith(info, "signal") | info == "pass", base_genotype, genotype), genotype=ifelse(startsWith(info, "base"), signal_genotype, genotype)) %>% rename(read_name=read) save(haplotype_df, file=outfile) |
6 7 8 9 10 11 12 13 14 15 16 17 | suppressMessages(library(tidyverse)) args = commandArgs(trailingOnly=TRUE) infile <- args[1] outfile <- args[2] summary_df <- read_tsv(infile, col_names=c("read_name", "chr", "start", "end", "qual"), col_types='cciid', skip=1) %>% unique() save(summary_df, file=outfile) |
1 2 3 4 5 | rmd_files <- commandArgs(trailingOnly = TRUE) purrr::map(rmd_files, ~ rmarkdown::render(., output_format='html_document', output_file=sub("Rmd$", "html", .))) |
12 13 | shell: "wget -q -O {output} {params.url}" |
21 22 | shell: "wget -q -O {output} {params.url}" |
30 31 | shell: "wget -q -O {output} {params.url}" |
39 40 | shell: "wget -q -O {output} {params.url}" |
48 49 | shell: "wget -q -O {output} {params.url}" |
57 58 | shell: "wget -q -O {output} {params.url}" |
66 67 | shell: "wget -q -O {output} {params.url}" |
75 76 | shell: "wget -q -O {output} {params.url}" |
84 85 | shell: "md5sum -c {input.md5} && touch {output}" |
100 101 | shell: "trim_galore --phred33 --fastqc --gzip --paired -o {params.outdir} {input.r1} {input.r2} &> {params.log}" |
108 109 | shell: "cp {input} {output}" |
141 142 143 144 | shell: "cd $(dirname {input.genome}) && " "SNPsplit_genome_preparation --vcf_file {input.vcf} --strain CAST_EiJ --reference_genome ./ &> {params.log} && " "cd ../scripts/" |
176 177 178 179 | shell: "cat {input.fasta} > {output.fasta} && " "mv {input.snp} {output.snp} && " "rm -r {params.tempdir}" |
191 192 | shell: "hisat2-build {input} {params.base} &> {params.log}" |
208 209 210 | shell: "hisat2 -p {threads} --no-softclip -x {params.base} -1 {input.r1} -2 {input.r2} | " "samtools sort -n -@ {threads} -T {output}.samtools.tmp -o {output} &> {log}" |
222 223 | shell: "SNPsplit --snp_file {input.snp} --paired --no_sort {input.bam} -o ../rna_seq/ &> {params.log}" |
234 235 | shell: "samtools sort -T {input}.samtools.tmp -@ {threads} -o {output} {input} &> {log}" |
275 276 | shell: "samtools depth {input.bam} > {output}" |
11 12 | shell: "wget -q -O {output} {params.url}" |
20 21 | shell: "wget -q -O {output} {params.url}" |
29 30 | shell: "wget -q -O {output} {params.url}" |
38 39 | shell: "wget -q -O {output} {params.url}" |
46 47 | shell: "cp {input} {output}" |
57 58 59 60 | shell: "cd $(dirname {input}) && " "bismark_genome_preparation --bowtie2 ./ &> {params.log} && " "cd ../scripts/" |
78 79 80 | shell: "bismark --gzip --bam --bowtie2 -p {threads} -B {params.basename} " "-o ../bisulfite/ {params.genome} -1 {input.r1} -2 {input.r2} &> {params.log}" |
91 92 | shell: "samtools sort -n -@ {threads} -T {input}.samtools.tmp -o {output} {input} &> {log}" |
104 105 | shell: "SNPsplit --paired --bisulfite --snp_file {input.snp} --no_sort {input.bam} -o ../bisulfite/ &> {params.log}" |
119 120 121 122 | shell: "bismark_methylation_extractor --ignore 13 --paired-end --multicore {threads} " "--comprehensive --merge_non_CpG --report --output $(dirname {output.txt}) --gzip " "--bedGraph {input} &> {params.log}" |
132 133 | shell: "python summarize_bisulfite_methylation.py {output} {input}" |
143 144 | shell: "python summarize_bisulfite_methylation.py {output} {input}" |
151 152 | shell: "python summarize_bisulfite_methylation.py {output} {input}" |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 | from __future__ import print_function import csv import sys import pickle import copy def get_read_id(read, contig, position, suppdb): if read in suppdb: for i, alignment in enumerate(suppdb[read]): chr, start, end = alignment if contig == chr and position >= start and position <= end: return "{}_{}".format(read, i) raise Exception("Position {}:{} not internal to any alignment of read {} ({})".format( contig, position, read, suppdb[read])) else: return read bam_fn = sys.argv[1] if len(sys.argv) > 2: meth_fn = sys.argv[2] handle = open(meth_fn, 'r') else: print("Reading from stdin.", file=sys.stderr) handle = sys.stdin with open("{}.suppdb".format(bam_fn), 'rb') as db_handle: suppdb = pickle.load(db_handle) reader = csv.DictReader(handle, delimiter="\t") fieldnames = reader.fieldnames if "strand" in reader.fieldnames: fieldnames = copy.copy(fieldnames) del fieldnames[fieldnames.index("strand")] del_strand = True else: del_strand = False writer = csv.DictWriter(sys.stdout, delimiter="\t", fieldnames=fieldnames) writer.writeheader() for row in reader: row["read_name"] = get_read_id( row["read_name"], row["chromosome"], int(row["start"]), suppdb) if del_strand: del row["strand"] try: writer.writerow(row) except Exception: print(row, file=sys.stderr) raise handle.close() |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 | from __future__ import print_function import sys import csv import argparse parser = argparse.ArgumentParser() parser.add_argument('-m', '--methylation', required=True, help="Methylation data, split by supplementaries, sorted by genomic location") parser.add_argument('-p', '--phase', required=True, help="Phased reads (haplotyping data)") args = parser.parse_args() meth_fn = args.methylation haplotype_fn = args.phase haplotypes = dict() genotypes = set(["fail"]) with open(haplotype_fn, 'r') as haplotype_handle: reader = csv.DictReader(haplotype_handle, delimiter="\t") for row in reader: haplotypes[row["read"]] = row["genotype"] genotypes.add(row["genotype"]) def get_genotype(read_name): try: return haplotypes[read_name] except KeyError: return "fail" with open(meth_fn, 'r') as meth_handle: reader = csv.DictReader(meth_handle, delimiter="\t") out_handles = list() writers = dict() for genotype in genotypes: handle = open("{}.{}.tsv".format(meth_fn, genotype), 'w') out_handles.append(handle) writers[genotype] = csv.DictWriter( handle, fieldnames=reader.fieldnames, delimiter="\t") writers[genotype].writeheader() for row in reader: writers[get_genotype(row["read_name"])].writerow(row) for handle in out_handles: handle.close() |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 | import sys import csv import gzip from functools import partial __true = "+" __false = "-" def add(summary, chr, pos, meth): try: summary[chr] except KeyError: summary[chr] = dict() try: summary[chr][pos] except KeyError: summary[chr][pos] = [0., 0.] if meth: summary[chr][pos][0] += 1 summary[chr][pos][1] += 1 return summary def summarize(in_handle, summary=None): if summary is None: summary = dict() header = next(in_handle) reader = csv.reader(in_handle, delimiter="\t") for line in reader: meth = line[1] == __true chr = line[2] pos = int(line[3]) summary = add(summary, chr, pos, meth) return summary def write_out_summary(output_fn, summary): with open(output_fn, 'w') as out_handle: for chr, positions in summary.items(): for pos, data in positions.items(): meth, total = tuple(data) out_handle.write("{}\t{}\t{:.3f}\t{}\t{}\n".format( chr, pos, meth / total, meth, total)) output_fn = sys.argv[1] input_fns = sys.argv[2:] if len(sys.argv) == 3: input_fns = [sys.argv[2]] else: input_fns = sys.argv[2:] summary = None for input_fn in input_fns: if input_fn.endswith("gz"): if sys.version_info >= (3,): open_fun = partial(gzip.open, mode='rt') else: open_fun = gzip.open else: open_fun = open with open_fun(input_fn) as handle: summary = summarize(handle, summary) write_out_summary(output_fn, summary) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | import sys import csv tsv_fn = sys.argv[1] if len(sys.argv) > 2: overhang = int(sys.argv[2]) else: overhang = 0 output = [] with open(tsv_fn, 'r') as tsv_handle: reader = csv.DictReader(tsv_handle, delimiter="\t") for row in reader: output.append( "{}:{}-{}".format(row["chr"], int(row["start"]) - overhang, int(row["end"]) + overhang)) print(" ".join(output)) |
Support
- Future updates
Related Workflows





