SeeNV: A Structural Variant Visualization and Analysis Tool for Oxford Nanopore Data
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
SeeNV is still being developed. It can be download and used but is by no means exaustively tested.
Find a bug? Create an issue !
Have a feature idea? Create an issue !
SeeNV provides comprehensive yet easy to digest visualizations for each call in a sample and depicts relevant statistics comparing a sample to a cohort of other samples. It is known that the accuracy and reliability of CNV calls increases when using multiple callers and parameter sets, for this reason SeeNV also provides a way to visualize multiple callers or bin sizes simultaneously — a feature not known to exist in other tools. Combined with the tool PlotCritic , we found that a clinician can accurately filter through roughly 200 calls in 20 minutes, or just 6 seconds per call on average. SeeNV is a command line tool that can be run as it’s own or added into your variant calling pipelines.
Installation
SeeNV is currently only usable on Linux based systems.
SeeNV requires you have conda installed.
Download this repo, move into it and run the installation script!
git clone https://github.com/MSBradshaw/SeeNV.git
cd SeeNV
source install.sh
The install script will create a conda envrionment called
seenv
with all the necessary python dependancies for SeeNV. It will also download a the following external non-python tools:
All these dependacies will be placed in the bin direcoty of the cnviz conda environment.
As long as the conda environemnt is activated the
seenv
command can now be used anywhere.
Usage
SeeNV requires it's conda environment in order to work, start the conda environment:
conda activate seenv
Build a Reference DB
In order to generate plots, a reference panel database is required. You can either create your own or use the one included with this repository.
seenv \
-b \
-i ExampleData/ref_panel_sample_list.tsv \
-s ExampleData/SureSelect_All_Exon_V2.bed \
-c ExampleData/17_bi_300_samples_calls.txt \
-o ReferenceDB \
-t 32
Generate plots
seenv \
-p \
-i ExampleData/proband_sample_list.tsv \
-s ExampleData/SureSelect_All_Exon_V2.bed \
-c ExampleData/17_bi_300_samples_calls.txt \
-a ExampleData/gnomad_v2.1_sv.sites.bed.gz \
-t 64 \
-v ExampleData/vardb.sorted.bed.gz \
-m ExampleData/genomicRepeats.sorted.bed.gz \
-r ReferenceDB/ \
-o TestPlots
Parameter explaination
One of the following options is required
--buildref, -b flag to use function to build a reference panel database
--plotsamples, -p flag to plot samples
Parameters to accompany --plotsamples, -p:
-i INPUT_SAMPLES (required) samples list
-s SITES (required) genomic sites bed file
-c ALL_CALLS (required) calls file. Each line should be a path to a set of calls in bed format
-o OUTPUT (required) output directory, where to save the plots
-r REF_DB (required) path to reference db created by the --buildref function of SeeNV
-a GNOMAD (required) the gnomad sv sites file with allele frequency information
-t THREADS (optional) number of threads to use, default 1 (you really want to use more than 1)
-v varDB (required) path to a GZipped bed file for the varDB common variants with an accompanying tabix indexed
-m RepeatMasker (required) path to a GZipped bed file for the RepeatMasker elements with an accompanying tabix indexed
Parameters to accompany --buildref, -b
-i INPUT_SAMPLES (required) samples list
-s SITES (required) genomic sites bed file
-c ALL_CALLS (required) calls file. Each line should be a path to a set of calls in bed format
-o OUTPUT (required) output directory, reference panel database
-t THREADS (optional) number of threads to use, default 1 (you really want to use more than 1)
-i INPUT_SAMPLES
Same format for samples and reference panel. See
Example/panel.samples
or
Example/cohort.samples
TSV file with the following columns in this order:
Patient Id: Unique identifier for each patient
Sample Id: Unique identifier for each sample (can be the same as the sample's Patient Id)
Sex: Chromosomal sex of patient (e.g. XY or XX). If unknown just input ZZ.
BAM: path to the sample's .bam file
BAI: path to the sample's .bai file
-s SITES
BED
formatted file with sites of interest within the genome. We use the regions our probes target for whole exome sequencing. See
Example/probes.bed
-c ALL_CALLS
File that contains a list of files containing calls. See
Example/all_calls.txt
.
Each file contains the path to a set of calls in a bed file with the following information seporated by tabs:
Chromosome: the Chromosome number/name (if listing chromsome 1 input 1 not chr1)
Start: base number at which the calls starts
End: base number at which the calls ends
Call type: type of call made (Duplication, Deletion ect)
Sample Id: Sample Id the call pertains to, should match Sample Id's found in
-i INPUT_SAMPLES
-o OUTPUT
When using the
-b
or
--buildref
flag
-o
will is the path to where you want the reference panel database (a dirrectory) saved.
When using the
-p
or
--plotsamples
flag
-o
is the path for where the plots will be saved.
-r REF_DB
Path to a reference panel database (the output created when using the
-b
flag)
-a GNOMAD with Allele Frequency data
path to the gnomAD SV sites file in .beg.gz format with an accompanying .beg.gz.tbi file in the same dirrectory. This can be found in
ExampleData/gnomad_v2.1_sv.sites.bed.gz
or can the
bed.gz
and and
.tbi
be downloaded from
here
and
here
-m RepeatMasker
path to a GZipped bed file for the varDB common variants with an accompanying tabix indexed. This is can be found in
ExampleData/genomicRepeats.sorted.bed.gz
.
-v varDB
path to a GZipped bed file for the varDB common variants with an accompanying tabix indexed. This can be found in
ExampleData/vardb.sorted.bed.gz
-t THREADS
The number of cores (not threads) to be used. Default is 1, but you really should use many more. For reference, using 32 cores, the provided reference panel database, and processing/plotting 6 samples with 300 calls takes us ~2 hours.
Code Snippets
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | import os import pysam import sys import re f = sys.argv[2] if '.tbi' in f: f = '.'.join(f.split('.')[:-1]) if len(sys.argv) >= 4: sample = sys.argv[3] else: sample = f.split('/')[-1].split('.')[0] res = [] tabixfile = pysam.TabixFile(f) for i,line in enumerate(open(sys.argv[1],'r')): row = line.strip().split('\t') try: a = list(tabixfile.fetch(row[0],int(row[1]),int(row[2]))) except ValueError: print('Error') continue if len(a) > 0: alt_counts = sum( int(x.strip().split('\t')[4]) for x in a) alt_counts_norm = alt_counts / float(len(a)) probe = row[3] res.append(alt_counts_norm) else: res.append(-1) print('\t'.join([sample] + [str(x) for x in res])) |
47 48 49 50 51 52 | shell: """ mkdir -p workpanel/Mosdepth mosdepth workpanel/Mosdepth/{wildcards.sample} {input} tabix -p bed workpanel/Mosdepth/{wildcards.sample}.per-base.bed.gz """ |
61 62 63 64 | shell: """ samtools view -c -F 260 {input} > {output} """ |
71 72 73 74 | shell: """ cat {input} > workpanel/TotalReads/total_read.txt """ |
82 83 84 85 86 87 88 | shell: """ mkdir -p workpanel/Probes/ bedtools sort -i {input.probes} > workpanel/Probes/probes.sorted.bed bgzip workpanel/Probes/probes.sorted.bed tabix workpanel/Probes/probes.sorted.bed.gz -p bed """ |
98 99 100 101 102 103 104 105 106 | shell: """ get_rpm_rates.py \ -m {input.per_base_bed_gz} \ --regions_file {input.probes_gz} \ --num_reads {input.total_reads} \ | bgzip -c > workpanel/RPM/{wildcards.sample}.probe.rpm_rate.bed.gz tabix -p bed workpanel/RPM/{wildcards.sample}.probe.rpm_rate.bed.gz """ |
120 121 122 123 124 125 126 127 128 | shell: """ mkdir -p workpanel/ProbeCoverage get_regions_zscores.py -r workpanel/ProbeCoverage -s {input.sample} | bgzip -c > {output.bedgz} cp {output.bedgz} workpanel/ProbeCoverage/{wildcards.sample}.probe.cover.mean.stdev.copy.bed.gz gunzip workpanel/ProbeCoverage/{wildcards.sample}.probe.cover.mean.stdev.copy.bed.gz mv workpanel/ProbeCoverage/{wildcards.sample}.probe.cover.mean.stdev.copy.bed {output.bed} tabix -p bed {output.bedgz} """ |
139 140 141 142 143 144 145 146 147 | shell: """ mkdir -p workpanel/ProbeCoverage get_coverage_zscores.py \ -r {input.ref_sample_rpm} \ -s {input.ref_sample_coverage} \ | bgzip -c > {output.bedgz} tabix -p bed {output.bedgz} """ |
159 160 161 162 163 164 165 166 167 168 169 170 171 172 | shell: """ mkdir -p {output}/RPM mkdir -p {output}/Calls mkdir -p {output}/AdjZscore cp {input.rpm_gz} {output}/RPM cp {input.rpm_tbi} {output}/RPM cp {input.z_bedgz} {output}/AdjZscore cp {input.z_tbi} {output}/AdjZscore while read p; do cp "$p" {output}/Calls done <{input.all_calls} cp {input.sample_list} {output} """ |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 | import pandas as pd import sys """ 1 - all_samples.aggregate_probe_allele_counts.txt 2 - probes file """ al = pd.read_csv(sys.argv[1],sep='\t',header=None) bed = pd.read_csv(sys.argv[2],sep='\t',header=None) samples = list( str(x) for x in al.iloc[:,0]) print('\t'.join(['chr','start','end','exon'] + samples)) for i in range(bed.shape[0]): print('\t'.join(list( str(x) for x in bed.iloc[i,0:4]) + [ str(x) for x in al.iloc[:,i+1]]) ) |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | import sys """ arguments: 1. a bedfile 2. file with a single line formated like: ['WES100', 'WES52' ...] 3. the name of an output file """ ids = None for line in open(sys.argv[2],'r'): ids = line.replace(' ','').replace('[','').replace(']','').split(',') break with open(sys.argv[3],'w') as outfile: for line in open(sys.argv[1],'r'): if sum(x in line for x in ids) > 0: outfile.write(line) |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 | import utils import gzip import argparse from pysam import TabixFile import numpy as np def get_args(): parser = argparse.ArgumentParser() parser.add_argument('-m', dest='mosdepth_file', required=True, help='Path file containing target bam coverage') parser.add_argument('--exons_file', dest='exons_file', required=True, help='Exon coordinate bed file') args = parser.parse_args() return args def main(): args = get_args() regions = [] with gzip.open(args.exons_file,'rt') as f: for l in f: A = l.rstrip().split('\t') region = utils.Interval(chrom=A[0], start=int(A[1]), end=int(A[2]), data=A[3:] if len(A) > 3 else None) if region.start == region.end: continue regions.append(utils.Interval(chrom=A[0], start=int(A[1]), end=int(A[2]), data=A[3:] if len(A) > 3 else None)) rates = [] tbx = TabixFile(args.mosdepth_file) for region in regions: I = utils.get_intervals_in_region(region, args.mosdepth_file, tbx=tbx) total_depth = 0 for i in I: total_depth += int(i.data[0]) rate = float(total_depth)/float(region.end - region.start) rates.append(rate) #mean = np.mean(rates) #stdev = np.std(rates) for i in range(len(regions)): region = regions[i] rate = rates[i] print('\t'.join([str(x) for x in [region.chrom, region.start, region.end, region.data[0], rate]])) if __name__ == '__main__': main() |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 | import utils import gzip import argparse from pysam import TabixFile import numpy as np def get_args(): parser = argparse.ArgumentParser() parser.add_argument('-r', dest='rate_file', required=True, help='Path file containing coverage rate') parser.add_argument('-s', dest='stats_file', required=True, help='Path file containing region stats') args = parser.parse_args() return args def main(): args = get_args() stats = [] with gzip.open(args.stats_file,'rt') as f: for l in f: A = l.rstrip().split('\t') region = utils.Interval(chrom=A[0], start=int(A[1]), end=int(A[2]), data={'gene':A[3], 'mean':float(A[4]), 'stdev':float(A[5])}) if region.start == region.end: continue stats.append(region) regions = [] Z = [] with gzip.open(args.rate_file,'rt') as f: i = 0 for l in f: A = l.rstrip().split('\t') region = utils.Interval(chrom=A[0], start=int(A[1]), end=int(A[2]), data={'gene':A[3], 'rate':float(A[4])}) if region.start == region.end: continue assert region.chrom == stats[i].chrom assert region.start == stats[i].start assert region.end == stats[i].end z = 0 if stats[i].data['stdev'] > 0: z = (region.data['rate'] - stats[i].data['mean']) / stats[i].data['stdev'] region.data['z'] = z Z.append(z) regions.append(region) i+=1 z_mean = np.mean(Z) z_stdev = np.std(Z) for region in regions: adj_z = 0 if z_stdev > 0 : adj_z = (region.data['z'] - z_mean) / z_stdev print('\t'.join([str(x) for x in [region.chrom, region.start, region.end, region.data['gene'], region.data['rate'], region.data['z'], adj_z]])) if __name__ == '__main__': main() |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 | import argparse import utils import pandas as pd import numpy as np import pysam import statistics from random import randrange import sys def get_args(): parser = argparse.ArgumentParser() parser.add_argument('--all_calls', dest='all_calls', help='file with path to BGZIPED and TABIXED Savvy calls set on each line') parser.add_argument('--exons', dest='exons', help='BGZIPED and TABIXED bed file with location of all exons') parser.add_argument('--sample', dest='sample', help='Sample name') parser.add_argument('--region', dest='region', help='Target region') parser.add_argument('--window', dest='window', type=int, default=50000, help='Window (default 50000)') parser.add_argument("--depth", dest="depth", help="file with depth/rpm data") args = parser.parse_args() return args def get_all_savvy_calls_in_target(savvy_bed_file, target_interval, ignore=None): calls = utils.get_intervals_in_region(target_interval, savvy_bed_file,ignore=ignore) sample_calls = [] for call in calls: curr_sample = call.data[1].split('.')[0] sample_calls.append(call) return(sample_calls) def main(): args = get_args() call_colors = ['#37dc94', '#8931EF', '#F2CA19', '#FF00BD', '#0057E9', '#87E911', '#E11845'] chrom=args.region.split(':')[0] target_region = utils.Interval( chrom=chrom, start=max(0, int(args.region.split(':')[1].split('-')[0]) - args.window), end=int(args.region.split(':')[1].split('-')[1]) + args.window, data=None) exons = utils.get_intervals_in_region(target_region, args.exons) if args.exons: probes = [] for line in open(args.depth,'r'): row = line.strip().split('\t') if str(row[0]) != str(target_region.chrom): continue probes.append(utils.Interval(chrom=row[0],start=int(row[1]),end=int(row[2]),data=None)) legend_elements = [] colors = call_colors all_max_ys = [] for i,line in enumerate(open(args.all_calls,'r')): print(line) f = line.strip() try: density_of_calls_at_each_probe = [ len(get_all_savvy_calls_in_target(f,x,ignore=args.sample)) for x in probes] except ValueError: print('ValueError in file ' + f) density_of_calls_at_each_probe = [0 for x in probes] x_vals = [ x.start for x in probes] indexes = [j for j,x in enumerate(density_of_calls_at_each_probe) if x > 0] x_vals = [ x_vals[j] for j in indexes] y_vals = [ density_of_calls_at_each_probe[i] for i in indexes] if len(y_vals) > 0: all_max_ys.append(max(y_vals)) if len(all_max_ys) > 0: print(max(all_max_ys)) else: print(0) main() |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 | import utils import gzip import argparse from pysam import TabixFile import numpy as np import glob def get_args(): parser = argparse.ArgumentParser() parser.add_argument('-r', dest='rate_dir', required=True, help='Path to directory containing rate files') parser.add_argument('-s', dest='single_sample', required=False, default=None, help='Path to single sample file to be compared to everything in rate dir') args = parser.parse_args() return args def main(): args = get_args() regions = [] first_file = True file_i = 0 all_files = list(glob.glob(args.rate_dir + '*.probe.*.bed.gz')) if args.single_sample is not None: all_files.append(args.single_sample) for rate_file in all_files: with gzip.open(rate_file,'rt') as f: line_i = 0 for l in f: A = l.rstrip().split('\t') region = utils.Interval(chrom=A[0], start=int(A[1]), end=int(A[2]), data=A[3:] if len(A) > 3 else None) if region.start == region.end: continue interval = utils.Interval(chrom=A[0], start=int(A[1]), end=int(A[2]), data=A[3:] if len(A) > 3 else None) if first_file: regions.append([interval]) else: regions[line_i].append(interval) line_i += 1 first_file = False file_i += 1 for region in regions: depths = [] for interval in region: depths.append(float(interval.data[1])) print('\t'.join([str(x) for x in [region[0].chrom, region[0].start, region[0].end, region[0].data[0], np.mean(depths), np.std(depths)]])) if __name__ == '__main__': main() |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 | import utils import gzip import argparse from pysam import TabixFile import numpy as np def get_args(): parser = argparse.ArgumentParser() parser.add_argument('-m', dest='mosdepth_file', required=True, help='Path file containing target bam coverage') parser.add_argument('--num_reads', dest='num_reads', required=True, help='Number of total reads in BAM/CRAM file') parser.add_argument('--regions_file', dest='regions_file', required=True, help='Region coordinate bed file') args = parser.parse_args() return args def main(): args = get_args() num_reads = None for line in open(args.num_reads,'r'): num_reads = float(line.strip()) rpm_factor = num_reads/1000000 regions = [] with gzip.open(args.regions_file,'rt') as f: for l in f: A = l.rstrip().split('\t') region = utils.Interval(chrom=A[0], start=int(A[1]), end=int(A[2]), data=A[3:] if len(A) > 3 else None) if region.start == region.end: continue regions.append(utils.Interval(chrom=A[0], start=int(A[1]), end=int(A[2]), data=A[3:] if len(A) > 3 else None)) rates = [] tbx = TabixFile(args.mosdepth_file) for region in regions: I = utils.get_intervals_in_region(region, args.mosdepth_file, tbx=tbx) total_depth = 0 for i in I: total_depth += int(i.data[0]) rate = float(total_depth)/rpm_factor/float(region.end-region.start) rates.append(rate) #mean = np.mean(rates) #stdev = np.std(rates) for i in range(len(regions)): region = regions[i] rate = rates[i] if region.data != None: gene_name = region.data[0] else: gene_name = 'No_gene_name' print('\t'.join([str(x) for x in [region.chrom, region.start, region.end, gene_name, rate]])) if __name__ == '__main__': main() |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | import sys """ input: tab seporated bed format with one additional field, the gene symbol output: tab seporated bed format followed by gene symbol and exon number """ current = None count = 0 for line in sys.stdin: row = line.strip().split('\t') if row[-1] == current: count += 1 else: current = row[-1] count = 1 print('\t'.join(row + [str(count)])) |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 | import utils import gzip import argparse from pysam import TabixFile import numpy as np import glob def get_args(): parser = argparse.ArgumentParser() parser.add_argument('-r', dest='adj_score_dir', required=True, help='Path to directory containing adjusted score files') args = parser.parse_args() return args def main(): args = get_args() regions = [] samples = [] first_file = True file_i = 0 if args.adj_score_dir[-1] != '/': args.adj_score_dir += '/' for adj_file in glob.glob(args.adj_score_dir + '*.adj_z.bed.gz'): sample = adj_file.split('/')[-1].split('.')[0] samples.append(sample) with gzip.open(adj_file,'rt') as f: line_i = 0 for l in f: A = l.rstrip().split('\t') region = utils.Interval(chrom=A[0], start=int(A[1]), end=int(A[2]), data=A[3:] if len(A) > 3 else None) if region.start == region.end: continue interval = utils.Interval(chrom=A[0], start=int(A[1]), end=int(A[2]), data=A[3:] if len(A) > 3 else None) if first_file: regions.append([interval]) else: regions[line_i].append(interval) line_i += 1 first_file = False file_i += 1 print('\t'.join(['#CHROM', 'START', 'END', 'STRAND'] + samples)) for region in regions: adjs = [] for interval in region: adjs.append(float(interval.data[3])) print('\t'.join([str(x) for x in [region[0].chrom, region[0].start, region[0].end, region[0].data[0]] + adjs])) if __name__ == '__main__': main() |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 | import argparse import utils import matplotlib.pyplot as plt import matplotlib.gridspec as gridspec from pysam import VariantFile import pandas as pd import numpy as np import pysam import statistics from random import randrange from matplotlib.lines import Line2D from matplotlib.gridspec import GridSpec import typing import matplotlib import matplotlib.patches as patches from matplotlib.patches import Rectangle import sys print(sys.argv) """ Global Settings """ matplotlib.rcParams['font.family'] = 'monospace' STD_POINT_SIZE = 2 STD_POINT_SIZE_OTHER = 7 STD_ARROW_SIZE = 5 DEFAULT_BLUE = '#1f78b4' DEFAULT_GREY = '#a5a5a5' DEFAULT_RED = '#ff5025' SUBPLOT_TITLE_SIZE = 10 SMALL_TEXT_SIZE = 7 STD_COVERAGE_Y_LIM = (-6, 6) CALL_BOX_ALPHA = .7 DUP_COLOR = '#1f78b4' DEL_COLOR = '#b2df8a' EXTERNAL_DATA_COLOR = '#feb24c' ULTRA_RARE_COLOR = '#1b9e77' RARE_COLOR = '#d95f02' COMMON_COLOR = '#7570b3' REPEAT_COLORS = ['#fc8d62', '#8da0cb', '#e78ac3', '#a6d854'] DUP_ALPHA = 0.9 DEL_ALPHA = 0.3 DUP_linestyle = None DEL_linestyle = '--' def get_args(): """ Gather the command line parameters, returns a namespace """ parser = argparse.ArgumentParser() parser.add_argument('--calls', '-i', dest='calls', help='list of comman seporated BGZIPED and TABIXED bed file of calls') parser.add_argument('--region', '-r', dest='region', required=True, help='genomic location to plot in format 1:1000-2000') parser.add_argument('--window', '-w', dest='window', type=int, default=0, help='window size to be plotted around the specified region (--region, -r)') parser.add_argument('--sample', '-s', dest='sample', required=True, help='sample identifier used to select the specific calls to plot') parser.add_argument('--coverage_scores', '-c', dest='coverage_scores', help='Name of output file') parser.add_argument('--sites', dest='sites', help='BED file that hase been bgzipped and tabixed') parser.add_argument('--gnomad', dest='gnomad', help='gnomad SV BED file that hase been bgzipped and tabixed') parser.add_argument('--vardb', dest='vardb', help='vardb common variants BED file that hase been bgzipped and tabixed') parser.add_argument('--depth', dest='depth', help='bed file site name, mean and standard deviation depth coverages') parser.add_argument('--title', dest='title', default=None, help='what to title the plot, "# Probes XX" will be appended to it.' 'Default will be SAMPLE_ID Chr:Start-End # Probes XX') parser.add_argument('--repeatmasker', dest='repeatmasker', help='repeat masker bed file that has been bgzipped and tabixed') parser.add_argument('--output', '-o', dest='output', help='Name of output file') return parser.parse_args() def ss(_line: str) -> typing.List[str]: """ Strip new line characters and split the give line on tabs """ return _line.strip().split('\t') def format_string_length(_s: str): """ take a string and add spaces till the end such that it is 20 characters in length """ string_length = 15 - len(_s) _new_s = '' return _s + (' ' * string_length) def parse_region(_region: utils.Interval, _window: int = 0) -> utils.Interval: """ param region: string of genomic location example "1:1000-2000" or "X:1000-2000" param window: int window to include on either side, default 0 """ _chrom = _region.split(':')[0] _target_region = utils.Interval( chrom=_chrom, start=max(0, int(_region.split(':')[1].split('-')[0]) - _window), end=int(_region.split(':')[1].split('-')[1]) + _window, data=None) return _target_region def get_calls(_files: typing.List[str], _region: utils.Interval, _with_sample_id: str = None, _without_sample_id: str = None) -> typing.List[utils.Interval]: """ params _files: list of gz files with an accompanying tbi file of calls param _region: Interval object param _with_sample_id: sample id to specifically include param _without_sample_id: sample id to specifically exclude return list of Interval objects, one for each call that intersects the region of interest """ _calls = [] print('Pre loop') for _f in _files: print(_calls) # get all calls that overlap the region try: _tmp = utils.get_intervals_in_region(_region, _f) except ValueError: print('Warning: no overlapping calls in ' + _f) continue print(len(_tmp)) for x in _tmp: print('\t',x.data) _f_name = _f.split('/')[-1].split('.')[0] [x.data.append(_f_name) for x in _tmp] _keepers = [] # get calls with a specific sample if _with_sample_id is not None: _keepers = [x for x in _tmp if sum(_with_sample_id in y for y in x.data) > 0] # get calls without a specific sample if _without_sample_id is not None: print('Without',_without_sample_id) _keepers = _keepers + [x for x in _tmp if x.data[1] != _without_sample_id] _calls = _calls + _keepers print('Post loop') print(_calls) return _calls def add_interval_box(_ax: plt.Axes, _interval: utils.Interval, _color: str, _a: float = 0.25, _ymin: float = 0.0, _ymax: float = 1.0, _hatch: str = None) -> None: """ Mark an interval with a box on an axis param ax: matplotlib ax object the boxes should be added to param _interval: interval object to mark with a box param _color: string color to be used at the color parameter in matplotlib param _a: float 0.0-1.0 for the alpha value, default 0.25 param _ymin: float 0.0-1.0 bottom start point for the box default 0 param _ymax: float 0.0-1.0 top end point for the box default 1 param _hatch: string, the hatch (patterned fill) to be used in the box return None """ _ax.axvspan(_interval.start, _interval.end, _ymin, _ymax, alpha=_a, color=_color, hatch=_hatch ) def mark_calls(_ax: plt.Axes, _ax_legend: plt.Axes, _calls: typing.List[utils.Interval], _hatch_map: typing.Dict, _x_range: typing.Tuple[int, int]) -> None: """ Mark all the _calls on the given _ax, color based on _color_map param _ax: matplotlib Axes object to plot on param _calls: list of intervals to be marked param _color_map: dictionary mapping file name to color param _hatch_map: dictionary mapping file names to hatch (patterned fill) param _x_range: tuple contain the range of the x axis return: None """ _range_off_set = (_x_range[1] - _x_range[0]) * 0.01 _hatches = [] _colors = [] _alphas = [] for _i, _c in enumerate(_calls): print('Sample Call:', _c) _this_calls_alpha = None _this_calls_color = DEFAULT_BLUE _colors.append(DEFAULT_BLUE) _hatches.append(None) if 'Dup' in _c.data[0] or 'DUP' in _c.data[0] or 'dup' in _c.data[0]: print('Its a DUP') _this_calls_alpha = DUP_ALPHA elif 'Del' in _c.data[0] or 'DEL' in _c.data[0] or 'del' in _c.data[0]: _this_calls_alpha = DEL_ALPHA print('Its a DEL') else: print('Warning! It is not a DUP or DEL') _this_calls_color = DEFAULT_GREY _colors.append(DEFAULT_GREY) _hatches.append(None) _alphas.append(_this_calls_alpha) _Y, _Y_info = stack_plots(_ax, _calls, _alphas, _colors) _hatch_idx = 0 # unused so repalced with alpha _alpha_idx = 0 _color_idx = 1 for _i, _row in enumerate(_Y): for _j, _call in enumerate(_row): # _ax.hlines(xmin=_call.start, xmax=_call.end, y=_i, color=_Y_info[_i][_j][_color_idx], linewidth=3, # alpha=_Y_info[_i][_j][_alpha_idx]) _edge_type = DUP_linestyle if _Y_info[_i][_j][_alpha_idx] == DEL_ALPHA: _edge_type = DEL_linestyle _span = _call.end - _call.start _rect = Rectangle((_call.start, _i), _span, .4, linewidth=1, edgecolor=_Y_info[_i][_j][_color_idx], facecolor=_Y_info[_i][_j][_color_idx], alpha=_Y_info[_i][_j][_alpha_idx], linestyle=_edge_type) # Add the patch to the Axes _ax.add_patch(_rect) # if the call extends past the x limits, add arrow if _call.start < _x_range[0]: # plot arrows if call goes out of range _ax.plot(_x_range[0] + _range_off_set, _i, '<', markersize=STD_ARROW_SIZE, c='black') _ax.plot(_x_range[0] + (_range_off_set * 2.5), _i, '<', markersize=STD_ARROW_SIZE, c='black') if _call.end > _x_range[1]: # plot arrows if call goes out of range _ax.plot(_x_range[1] - _range_off_set, _i, '>', markersize=STD_ARROW_SIZE, c='black') _ax.plot(_x_range[1] - (_range_off_set * 2.5), _i, '>', markersize=STD_ARROW_SIZE, c='black') # make the x_lim match the top plot _ax.set_xlim(_x_range) # add some extra space to the top of the plot for arrows are not cut off _lim = _ax.get_ylim() _ax.set_ylim(_lim[0] - 1, _lim[1] + 1) remove_borders(_ax) remove_ticks(_ax) _ax.set_title('Sample Calls', loc='left', fontsize=SUBPLOT_TITLE_SIZE) del_dup_legend(_ax_legend=_ax_legend, _color=DEFAULT_BLUE) def plot_sample_coverage_stats(_ax: plt.Axes, _region: utils.Interval, _scores_file: str, _sample_id: str, _x_range: typing.Tuple[int, int], _non_sample_overlapping_calls: typing.List, _sites: str = None ) -> None: """ param _ax: matplotlib axes object to plot onto param _ax_legend: matplotlib axes object put the legend in param _region: Interval object param _scores_file: bed file that has been bgzipped and tabixed and has normalized coverage stats param _sample_id: sample id param _sites: path to site/probe/exon BED file that have been bgzipped and tabixed param _x_range: tuple contain the range of the x axis param _non_sample_overlapping_calls: list of Interval objects that are overlapping non-sample calls """ # get the sample names of other samples with overlapping calls, this should be index 1 in the data _other_sample_with_calls = [x.data[1] for x in _non_sample_overlapping_calls] _sites_tabix = None if _sites is not None: _sites_tabix = pysam.TabixFile(_sites) print('r',_region) print('f',_scores_file) _scores = utils.get_intervals_in_region(_region, _scores_file) _site_pos = [] _means = [] _stdevs = [] _ys = [] _non_samp_xs = [] _non_samp_pos = [] _samples = utils.get_header(_scores_file)[0].split('\t')[4:] _sample_column_index = _samples.index(_sample_id) _colors = [] _color_legend = {} _non_samp_edge_colors = [] _non_samp_lw = [] for _site in _scores: # the the color representing quality from the probe file _added = False if _sites_tabix is not None: res = _sites_tabix.fetch(_site.chrom, _site.start, _site.end) for _line in res: # if there are less than 4 lines in the sites file, that means there is definately not any probe quality information if len(ss(_line)) < 4: continue _colors.append(ss(_line)[3]) if _colors[-1] not in _color_legend: _color_legend[_colors[-1]] = ss(_line)[4] _added = True break # if there is not color information provided or there is no probe matching this region use the default color if _sites_tabix is None or not _added: _colors.append(DEFAULT_BLUE) _color_legend[DEFAULT_BLUE] = 'No info' _tmp_scores = [float(x) for x in _site.data[1:]] # samples with an overlapping call should have a black border applied _tmp_non_samp_edge_colors = ['black' if s in _other_sample_with_calls else DEFAULT_GREY for s in _samples] _tmp_non_samp_lw = [1 if s in _other_sample_with_calls else 0 for s in _samples] _site_pos.append(_site.end) _means.append(np.mean(_tmp_scores)) _stdevs.append(np.std(_tmp_scores)) _ys.append(_tmp_scores[_sample_column_index]) _non_samp_xs += _tmp_scores _non_samp_pos += [_site.end] * len(_tmp_scores) _non_samp_edge_colors += _tmp_non_samp_edge_colors _non_samp_lw += _tmp_non_samp_lw # the non-sample points plotted in the background _non_sample_plot = _ax.scatter(_non_samp_pos, _non_samp_xs, # '-o', lw=_non_samp_lw, s=STD_POINT_SIZE_OTHER, label='Cohort Sample Coverage', edgecolors=_non_samp_edge_colors, c=DEFAULT_GREY, alpha=.3) # the mean of the cohort (not database) _mean_plot = None for i in range(len(_colors)): _tmp = _ax.plot(_site_pos[i], _means[i], marker='_', markersize=5, lw=0, label='Mean Pop. Coverage', c=DEFAULT_GREY) if _mean_plot is None: _mean_plot = _tmp # plot the standard deviations of the cohort normalized coverage z-scores _std_plots = None for i in range(len(_means)): _tmp = _ax.plot([_site_pos[i], _site_pos[i]], [_means[i] - _stdevs[i], _means[i] + _stdevs[i]], lw=1, color=DEFAULT_GREY, alpha=CALL_BOX_ALPHA) if _std_plots is None: _std_plots = _tmp # plot this samples normalized coverage z-scores as points _sample_cov_plot = None for i in range(len(_ys)): _tmp = _ax.plot(_site_pos, _ys, '-o', lw=0, markersize=STD_POINT_SIZE, label='Sample Coverage', c=DEFAULT_BLUE) if _sample_cov_plot is None: _sample_cov_plot = _tmp _ax.set_title('Double Normalized Coverage', loc='left', fontsize=SUBPLOT_TITLE_SIZE) # standardize the y-axis limits to match all other plots, only if it falls within that range if _ax.get_ylim()[0] >= STD_COVERAGE_Y_LIM[0] and _ax.get_ylim()[1] <= STD_COVERAGE_Y_LIM[1]: _ax.set_ylim(STD_COVERAGE_Y_LIM) remove_borders(_ax) remove_ticks(_ax, remove_y=False) _ax.tick_params(axis='both', which='both', labelsize=SMALL_TEXT_SIZE) _ax.set_ylabel('Z-score', size=SMALL_TEXT_SIZE) _ax.set_xlim(_x_range) # probe_quality_heatmap2(_ax=_ax, _ax_legend=None, _file=_sites, _region=_region, _x_range=_x_range) def remove_borders(_ax: plt.Axes) -> None: """ Remove all borders from the given matplotlib axes object param _ax: axes object """ _ax.spines['top'].set_visible(False) _ax.spines['right'].set_visible(False) _ax.spines['bottom'].set_visible(False) _ax.spines['left'].set_visible(False) def remove_ticks(_ax: plt.Axes, remove_x: bool = True, remove_y: bool = True) -> None: """ Remove tick marks from a matplotlib axes object param _ax: axes object param remove_x=True: If true remove the x ticks param remove_y=True: If true remove the y ticks """ if remove_x: _ax.get_xaxis().set_ticks([]) if remove_y: _ax.get_yaxis().set_ticks([]) def plot_other_samples_calls(_ax: plt.Axes, _ax_legend: plt.Axes, _region: utils.Interval, _non_sample_overlapping_calls: typing.List, _sample_id: str, _x_range: typing.Tuple[int, int], _color_map: typing.Dict, _hatch_map: typing.Dict) -> None: """ Plot boxes of the non-sample calls param _ax: matplotlib axes object to plot onto param _region: Interval object params _files: list of gz files with an accompanying tbi file of calls param _sample_id: sample id param _x_range: tuple contain the range of the x axis param _color_map: dictionary of calls files and colors they should be marked as param _hatch_map: dictionary mapping file names to hatch (patterned fill) param _non_sample_overlapping_calls: list of Interval objects that are overlapping non-sample calls """ _range_off_set = (_x_range[1] - _x_range[0]) * 0.01 # plot each call _calls = list(_non_sample_overlapping_calls) _hatches = [] _colors = [] _alphas = [] for _i, _c in enumerate(_calls): _this_calls_alpha = None _this_calls_color = DEFAULT_GREY _colors.append(DEFAULT_GREY) _hatches.append(None) if 'Dup' in _c.data[0] or 'DUP' in _c.data[0] or 'dup' in _c.data[0]: print('Chort DUP') _this_calls_alpha = DUP_ALPHA elif 'Del' in _c.data[0] or 'DEL' in _c.data[0] or 'del' in _c.data[0]: print('Chort DEL') _this_calls_alpha = DEL_ALPHA else: print('Chort Niether DUP not DEL') _this_calls_color = DEFAULT_GREY _colors.append(DEFAULT_GREY) _hatches.append(None) _alphas.append(_this_calls_alpha) _Y, _Y_info = stack_plots(_ax, _calls, _alphas, _colors) _hatch_idx = 0 # unused so replace with alphas _alpha_idx = 0 _color_idx = 1 for _i, _row in enumerate(_Y): for _j, _call in enumerate(_row): # if the call is a DEL, give it a different line type _edge_type = DUP_linestyle if _Y_info[_i][_j][_alpha_idx] == DEL_ALPHA: _edge_type = DEL_linestyle _span = _call.end - _call.start _rect = Rectangle((_call.start, _i), _span, .4, linewidth=1, edgecolor=_Y_info[_i][_j][_color_idx], facecolor=_Y_info[_i][_j][_color_idx], alpha=_Y_info[_i][_j][_alpha_idx], linestyle=_edge_type) # Add the patch to the Axes _ax.add_patch(_rect) # if the call extends past the x limits, add arrow if _call.start < _x_range[0]: # plot arrows if call goes out of range _ax.plot(_x_range[0] + _range_off_set, _i, '<', markersize=STD_ARROW_SIZE, c='black') _ax.plot(_x_range[0] + (_range_off_set * 2.5), _i, '<', markersize=STD_ARROW_SIZE, c='black') if _call.end > _x_range[1]: # plot arrows if call goes out of range _ax.plot(_x_range[1] - _range_off_set, _i, '>', markersize=STD_ARROW_SIZE, c='black') _ax.plot(_x_range[1] - (_range_off_set * 2.5), _i, '>', markersize=STD_ARROW_SIZE, c='black') # make the x_lim match the top plot _ax.set_xlim(_x_range) # add some extra space to the top of the plot for arrows are not cut off _lim = _ax.get_ylim() _ax.set_ylim(_lim[0] - 2, _lim[1] + 2) remove_borders(_ax) remove_ticks(_ax) _ax.set_title('Cohort Calls', loc='left', fontsize=SUBPLOT_TITLE_SIZE) del_dup_legend(_ax_legend=_ax_legend, _color=DEFAULT_GREY) def plot_gnomad(_ax: plt.Axes, _ax_legend: plt.Axes, _region: utils.Interval, _file: str, _x_range: typing.Tuple[int, int]) -> None: """ param _ax: matplotlib axes object to plot onto param _ax_legend: matplotlib axes object put the legend in param _region: Interval object param _file: gnomad sv file bgzipped and tabixed param x_range: tuple holding x axis limit """ _gnomad_tbx = pysam.TabixFile(_file) _res = _gnomad_tbx.fetch(_region.chrom, _region.start, _region.end) _allele_freqs = [] _regions = [] _colors = [] _hatches = [] _alphas = [] for _line in _res: if 'DUP' in _line: _alphas.append(DUP_ALPHA) elif 'DEL' in _line or 'CN=0' in _line: _alphas.append(DEL_ALPHA) else: # skip everything that is not a del or dup continue _row = ss(_line) _regions.append(utils.Interval(chrom=_row[0], start=int(_row[1]), end=int(_row[2]), data=_row[3:])) # _colors.append(DEFAULT_GREY) _hatches.append(None) _afs = [float(_x) for _x in _row[37].split(',')] _af = None if len(_afs) > 1: _af = sum(_afs) / len(_afs) else: _af = _afs[0] if _af <= .01: # ultra rare _colors.append(ULTRA_RARE_COLOR) elif _af > .01 and _af < 0.05: # rare _colors.append(RARE_COLOR) else: # polymorphic _colors.append(COMMON_COLOR) _Y, _Y_info = stack_plots(_ax, _regions, _alphas, _colors) _hatch_idx = 0 # not used, replaced with alpha _alpha_idx = 0 _color_idx = 1 for _i, _row in enumerate(_Y): for _j, _call in enumerate(_row): # _ax.hlines(xmin=_call.start, xmax=_call.end, y=_i, color=_Y_info[_i][_j][_color_idx], # alpha=_Y_info[_i][_j][_alpha_idx], linewidth=3) _edge_type = DUP_linestyle if _Y_info[_i][_j][_alpha_idx] == DEL_ALPHA: _edge_type = DEL_linestyle _span = _call.end - _call.start _rect = Rectangle((_call.start, _i), _span, .4, linewidth=1, edgecolor=_Y_info[_i][_j][_color_idx], facecolor=_Y_info[_i][_j][_color_idx], alpha=_Y_info[_i][_j][_alpha_idx], linestyle=_edge_type) # Add the patch to the Axes _ax.add_patch(_rect) _ax.set_xlim(_x_range) _ax.set_ylim(-1, len(_Y) - 1) remove_ticks(_ax_legend) remove_borders(_ax_legend) remove_borders(_ax) remove_ticks(_ax) _ax.set_title('gnomAD SVs Allele Freq', loc='left', fontsize=SUBPLOT_TITLE_SIZE) _legend_elements = [] _legend_labels = [] # _legend_elements.append(matplotlib.patches.Patch(facecolor='#ffffff', alpha=.5)) # _legend_labels.append('Pop. Freq.') _legend_elements.append(matplotlib.patches.Patch(facecolor=ULTRA_RARE_COLOR)) _legend_labels.append(format_string_length('Very Rare')) _legend_elements.append(matplotlib.patches.Patch(facecolor=RARE_COLOR)) _legend_labels.append(format_string_length('Rare')) _legend_elements.append(matplotlib.patches.Patch(facecolor=COMMON_COLOR)) _legend_labels.append(format_string_length('Polymorphic')) _legend_elements.append(matplotlib.patches.Patch(facecolor=DEFAULT_GREY, alpha=DEL_ALPHA, edgecolor=DEFAULT_GREY, linewidth=1, linestyle=DEL_linestyle)) _legend_labels.append(format_string_length('Del')) _legend_elements.append(matplotlib.patches.Patch(facecolor=DEFAULT_GREY, alpha=DUP_ALPHA, edgecolor=DEFAULT_GREY, linewidth=1, linestyle=DUP_linestyle)) _legend_labels.append(format_string_length('Dup')) # plot the legend _leg = _ax_legend.legend(_legend_elements, _legend_labels, frameon=False, prop={'size': SMALL_TEXT_SIZE}, borderaxespad=0) remove_borders(_ax_legend) remove_ticks(_ax_legend) def del_dup_legend(_ax_legend: plt.Axes, _color: str): """ _ax_legend: ax legend to add custom legend to _color: the color to make the del and dup as a string """ _legend_elements = [] _legend_labels = [] _legend_elements.append(matplotlib.patches.Patch(facecolor=_color, alpha=DEL_ALPHA, edgecolor=_color, linestyle=DEL_linestyle, linewidth=1)) _legend_labels.append(format_string_length('Del')) _legend_elements.append(matplotlib.patches.Patch(facecolor=_color, alpha=DUP_ALPHA, edgecolor=_color, linestyle=DUP_linestyle, linewidth=1)) _legend_labels.append(format_string_length('Dup')) _leg = _ax_legend.legend(_legend_elements, _legend_labels, frameon=False, prop={'size': SMALL_TEXT_SIZE}, borderaxespad=0) remove_borders(_ax_legend) remove_ticks(_ax_legend) def plot_dbvar(_ax: plt.Axes, _ax_legend: plt.Axes, _region: utils.Interval, _file: str, _x_range: typing.Tuple[float, float]) -> None: """ Plot VarDB common variants param _ax: matplotlib axes object to plot onto param _ax_legend: matplotlib axes object put the legend in param _region: Interval object param _file: VarDB Common Variants file bgzipped and tabixed param x_range: tuple holding x axis limit """ _tbx = pysam.TabixFile(_file) _res = _tbx.fetch(_region.chrom, _region.start, _region.end) _regions = [] _hatches = [] _colors = [] _alphas = [] for _line in _res: if 'DUP' in _line: _alphas.append(DUP_ALPHA) elif 'DEL' in _line: _alphas.append(DEL_ALPHA) else: # skip everything that is not a del or dup continue _row = ss(_line) _regions.append(utils.Interval(chrom=_row[0], start=int(_row[1]), end=int(_row[2]), data=_row[3:])) _colors.append(EXTERNAL_DATA_COLOR) _hatches.append(None) # _ax.hlines(xmin=_x_starts, xmax=_x_ends, y=list(range(len(_x_starts))), color=DEFAULT_GREY, linewidth=3) _Y, _Y_info = stack_plots(_ax, _regions, _alphas, _colors) _hatch_idx = 0 # hatch is not used, replaced with alpha _alpha_idx = 0 _color_idx = 1 for _i, _row in enumerate(_Y): for _j, _call in enumerate(_row): # _ax.hlines(xmin=_call.start, xmax=_call.end, y=_i, color=EXTERNAL_DATA_COLOR, linewidth=3, # alpha=_Y_info[_i][_j][_alpha_idx]) # if the call is a DEL, give it a different line type _edge_type = DUP_linestyle if _Y_info[_i][_j][_alpha_idx] == DEL_ALPHA: _edge_type = DEL_linestyle _span = _call.end - _call.start _rect = Rectangle((_call.start, _i), _span, .4, linewidth=1, edgecolor=_Y_info[_i][_j][_color_idx], facecolor=_Y_info[_i][_j][_color_idx], alpha=.3, linestyle=_edge_type) # Add the patch to the Axes _ax.add_patch(_rect) _ax.set_xlim(_x_range) _ax.set_ylim(-1, len(_Y) - 1) remove_ticks(_ax_legend) remove_ticks(_ax) remove_borders(_ax_legend) remove_borders(_ax) _ax.set_title('dbVar Common Variants', loc='left', fontsize=SUBPLOT_TITLE_SIZE) del_dup_legend(_ax_legend=_ax_legend, _color=EXTERNAL_DATA_COLOR) def plot_chr_mean_std(_ax_mean: plt.Axes, _ax_std: plt.Axes, _region: utils.Interval, _file: str) -> None: """ param _ax: matplotlib axes object to plot onto param _ax_legend: matplotlib axes object put the legend in param _region: Interval object param _file: adjusted scores file bgzipped and tabixed """ _xs = [] _means = [] _stds = [] for _line in open(_file, 'r'): _row = ss(_line) if _row[0] != _region.chrom: continue _xs.append(int(_row[1])) _means.append(float(_row[4])) _stds.append(float(_row[5])) _ax_mean.scatter(_xs, _means, s=.1, color=DEFAULT_BLUE) _ax_std.scatter(_xs, _stds, s=.1, color=DEFAULT_BLUE) _ax_mean.set_title('Mean Normalized Coverage', loc='left', fontsize=SUBPLOT_TITLE_SIZE) _ax_std.set_title('Std Normalized Coverage', loc='left', fontsize=SUBPLOT_TITLE_SIZE) # _ax_std.set_ylabel('Std',size=SMALL_TEXT_SIZE) # _ax_mean.set_ylabel('Mean',size=SMALL_TEXT_SIZE) _ax_std.tick_params(axis='both', which='both', labelsize=SMALL_TEXT_SIZE) _ax_mean.tick_params(axis='both', which='both', labelsize=SMALL_TEXT_SIZE) remove_borders(_ax_std) remove_borders(_ax_mean) remove_ticks(_ax_mean, remove_y=False) remove_ticks(_ax_std, remove_y=False) _ax_mean.axvspan(_region.start, _region.end, alpha=.5, color=DEFAULT_RED) _ax_std.axvspan(_region.start, _region.end, alpha=.5, color=DEFAULT_RED) _ax_std.set_xlabel('Chromosome ' + str(_region.chrom)) # use only integer in y lim def convert_size_to_string(size: int): """ param size: number of base pairs return the size converted from an integer to a string e.g. 2000 -> 2Kb or 123,000,000 -> 123Mb """ if size < 1000: return str(size) + ' bases' elif size >= 1000 and size < 1000000: return str(round(size / 1000, 1)) + 'Kb' elif size >= 1000000 and size < 1000000000: return str(round(size / 1000000, 1)) + 'Mb' else: return str(round(size / 1000000000, 1)) + 'Gb' def configure_title(_fig: plt.Figure, _region: utils.Interval, _scores_file: str, _sample_id: str, _title: str = None): """ param _fig: matplotlib figure object for the whole entire plot param _region: Interval object param _scores_file: bed file that has been bgzipped and tabixed and has normalized coverage stats param _sample_id: sample id param _title: string what the plot should be titled, defaults to sample_id, region. # probes is appended to the end of all titles """ _scores = utils.get_intervals_in_region(_region, _scores_file) if _title is None: _title = '{} {}:{}-{}\n# Probes {}'.format(_sample_id, str(_region.chrom), str(_region.start), str(_region.end), str(len(_scores))) _title = _title + 'Target size: {}'.format(convert_size_to_string(_region.end - _region.start)) _fig.suptitle(_title) else: _title = _title + '\n# Probes {}'.format(str(len(_scores))) _title = _title + ' Target size: {}'.format(convert_size_to_string(_region.end - _region.start)) _fig.suptitle(_title) # control the amount fo space below the title _fig.subplots_adjust(top=0.92) def plot_probe_qualities(_ax: plt.Axes, _region: utils.Interval, _scores_file: str, _sample_id: str, _sites: str = None): """ param _ax: matplotlib axes object to plot onto param _ax_legend: matplotlib axes object put the legend in param _region: Interval object param _scores_file: bed file that has been bgzipped and tabixed and has normalized coverage stats param _sample_id: sample id param _sites: path to site/probe/exon BED file that have been bgzipped and tabixed """ _scores = utils.get_intervals_in_region(_region, _scores_file) _sites_tabix = pysam.TabixFile(_sites) _colors = [] for _site in _scores: # the the color representing quality from the probe file _added = False if _sites_tabix is not None: res = _sites_tabix.fetch(_site.chrom, _site.start, _site.end) for _line in res: _colors.append(ss(_line)[3]) _ax.set_title('Probe quality', loc='left', fontsize=SUBPLOT_TITLE_SIZE) for _i in range(len(_colors)): _ax.axvspan(_i, _i + .7, alpha=1.0, color=_colors[_i]) _ax.set_ylim(0, 1) remove_ticks(_ax) remove_borders(_ax) def main_legend(_ax: plt.Axes, _region: utils.Interval, _scores_file: str, _hatch_map: typing.Dict, _sites: str = None) -> None: """ param _ax: matplotlib Axes object to plot onto param _region: Interval object param _scores_file: bed file that has been bgzipped and tabixed and has normalized coverage stats param _sites: path to site/probe/exon BED file that have been bgzipped and tabixed """ _legend_elements = [] _legend_labels = [] _legend_elements.append(Line2D([0], [0], marker='o', color=DEFAULT_BLUE, # label=_color_legend[_color], markerfacecolor=DEFAULT_BLUE, markersize=5, linewidth=0)) _legend_labels.append(format_string_length('Sample')) _legend_elements.append(Line2D([0], [0], marker='o', color=DEFAULT_GREY, # label=_color_legend[_color], markerfacecolor=DEFAULT_GREY, markersize=5, linewidth=0)) _legend_labels.append(format_string_length('Cohort')) _legend_elements.append(Line2D([0], [0], marker='o', color=DEFAULT_GREY, markeredgecolor='black', markerfacecolor=DEFAULT_GREY, markersize=5, markeredgewidth=1, linewidth=0)) _legend_labels.append(format_string_length('Overlapping Calls')) # _legend_elements.append(matplotlib.patches.Patch(facecolor=DEFAULT_BLUE, alpha=1)) # _legend_labels.append('Sample') # # _legend_elements.append(matplotlib.patches.Patch(facecolor=DEFAULT_GREY, alpha=1)) # _legend_labels.append('Cohort') # plot the legend _leg = _ax.legend(_legend_elements, _legend_labels, frameon=False, prop={'size': SMALL_TEXT_SIZE}, borderaxespad=0) remove_borders(_ax) remove_ticks(_ax) def a_single_legend_to_rule_them_all(_ax: plt.Axes, _region: utils.Interval, _scores_file: str, _hatch_map: typing.Dict, _sites: str = None) -> None: """ param _ax: matplotlib Axes object to plot onto param _region: Interval object param _scores_file: bed file that has been bgzipped and tabixed and has normalized coverage stats param _sites: path to site/probe/exon BED file that have been bgzipped and tabixed """ _sites_tabix = None if _sites is not None: _sites_tabix = pysam.TabixFile(_sites) _scores = utils.get_intervals_in_region(_region, _scores_file) _colors = [] _color_legend = {} _color_legend['#ffffff'] = 'Probe Quality' for _site in _scores: # the the color representing quality from the probe file _added = False if _sites_tabix is not None: res = _sites_tabix.fetch(_site.chrom, _site.start, _site.end) for _line in res: _colors.append(ss(_line)[3]) if _colors[-1] not in _color_legend: _color_legend[_colors[-1]] = ss(_line)[4] _added = True break # Normalized coverage: # create the legend _legend_elements = [] _legend_labels = [] _color_legend[DEFAULT_GREY] = 'Cohort' for _color in _color_legend.keys(): _legend_elements.append(Line2D([0], [0], marker='o', color=_color, # label=_color_legend[_color], markerfacecolor=_color, markersize=5, linewidth=0)) _legend_labels.append(_color_legend[_color]) """ Call Groups """ p = matplotlib.patches.Patch(facecolor='#ffffff', hatch=None, alpha=.5) _legend_labels.append('Call Groups') _legend_elements.append(p) for _key in _hatch_map.keys(): p = matplotlib.patches.Patch(facecolor=DEFAULT_GREY, hatch=_hatch_map[_key], alpha=.5) _legend_labels.append(_key[:10]) _legend_elements.append(p) _legend_elements.append(matplotlib.patches.Patch(facecolor='#ffffff', alpha=.5)) _legend_labels.append('SV Type') _legend_elements.append(matplotlib.patches.Patch(facecolor=DEL_COLOR, alpha=.5)) _legend_labels.append('Del') _legend_elements.append(matplotlib.patches.Patch(facecolor=DUP_COLOR, alpha=.5)) _legend_labels.append('Dup') _legend_elements.append(matplotlib.patches.Patch(facecolor='#ffffff', alpha=.5)) _legend_labels.append('Pop. Freq.') _legend_elements.append(matplotlib.patches.Patch(facecolor=ULTRA_RARE_COLOR)) _legend_labels.append('Very Rare') _legend_elements.append(matplotlib.patches.Patch(facecolor=RARE_COLOR)) _legend_labels.append('Rare') _legend_elements.append(matplotlib.patches.Patch(facecolor=COMMON_COLOR)) _legend_labels.append('Polymorphic') # plot the legend _leg = _ax.legend(_legend_elements, _legend_labels, frameon=False, prop={'size': SMALL_TEXT_SIZE}, borderaxespad=0) remove_borders(_ax) remove_ticks(_ax) def get_overlap(a, b): """ param a: interval object param b: interval object return the amount of overlap two Interval objects have assuming they are on the same chromosome """ return max(0, min(a.end, b.end) - max(a.start, b.start)) def stack_plots(_ax, _regions, _hatches, _colors): # list of lists, each internal list represent a row in the plot _Y = [] _Y_info = [] for _i, _call in enumerate(_regions): _added_to_a_row = False for _j, _row in enumerate(_Y): _fits_in_row = True for _r_call in _row: # check if the two intervals overlap if get_overlap(_r_call, _call) != 0: _fits_in_row = False break if _fits_in_row: _Y[_j].append(_call) _Y_info[_j].append((_hatches[_i], _colors[_i])) _added_to_a_row = True break # if it reaches here and has not been added to a row, start a new row if not _added_to_a_row: _Y.append([_call]) _Y_info.append([(_hatches[_i], _colors[_i])]) return _Y, _Y_info def plot_repeat_masker(_ax: plt.Axes, _ax_legend: plt.Axes, _file: str, _region: utils.Interval): _tbx = pysam.TabixFile(_file) _res = _tbx.fetch(_region.chrom, _region.start, _region.end) _repeat_element = [] _segdup = [] _selfchain = [] _simple_repeat = [] """ repeatElement segdup selfchain simpleRepeat """ for _line in _res: _row = ss(_line) _pair = (int(_row[1]), int(_row[2])) try: if 'repeatElement' in _row[4]: _repeat_element.append(_pair) elif 'segdup' in _row[4]: _segdup.append(_pair) elif 'selfchain' in _row[4]: _selfchain.append(_pair) elif 'simpleRepeat' in _row[4]: _simple_repeat.append(_pair) else: print('Unrecognized {}'.format(_row[4])) except IndexError: print(_row) _colors_index = 0 for _pair in _repeat_element: _ax.hlines(xmin=_pair[0], xmax=_pair[1], y=_colors_index, color=REPEAT_COLORS[_colors_index], linewidth=3, alpha=.5) _colors_index = 1 for _pair in _segdup: _ax.hlines(xmin=_pair[0], xmax=_pair[1], y=_colors_index, color=REPEAT_COLORS[_colors_index], linewidth=3, alpha=.5) _colors_index = 2 for _pair in _selfchain: _ax.hlines(xmin=_pair[0], xmax=_pair[1], y=_colors_index, color=REPEAT_COLORS[_colors_index], linewidth=3, alpha=.5) _colors_index = 3 for _pair in _simple_repeat: _ax.hlines(xmin=_pair[0], xmax=_pair[1], y=_colors_index, color=REPEAT_COLORS[_colors_index], linewidth=3, alpha=.5) _ax.set_title('Repeat Masker', loc='left', fontsize=SUBPLOT_TITLE_SIZE) remove_ticks(_ax) remove_ticks(_ax_legend) remove_borders(_ax) remove_borders(_ax_legend) _ax.spines['bottom'].set_visible(True) # _ax.spines['bottom'].set_edgecolor(DEFAULT_RED) _ax.set_yticks([-1, 0, 1, 2, 3]) _ax.set_yticklabels(['', ' Repeat Elements', 'SegDup', 'Self Chain', 'Simple Repeat']) _ax.tick_params(axis='both', which='both', labelsize=SMALL_TEXT_SIZE) _ax.set_xlabel(str(_region.chrom) + ':' + str(_region.start) + '-' + str(_region.end), size=SMALL_TEXT_SIZE) def probe_quality_heatmap(_ax: plt.Axes, _ax_legend: plt.Axes, _file: str, _region: utils.Interval, _x_range: typing.Tuple): _probes_tabix = pysam.TabixFile(_file) _jump_size = (_x_range[1] - _x_range[0]) / 100 _scores = [] for i in range(100): _start_pos = _x_range[0] + _jump_size * i _end_pos = _start_pos + _jump_size _res = _probes_tabix.fetch(_region.chrom, _start_pos, _end_pos) _qualities = ['pseudo_count'] for _line in _res: if len(ss(_line)) < 4: continue _qualities.append(ss(_line)[4]) # if there is more than one there is probe information, else there is not and we need a filler value if len(_qualities) > 1: _portion_good = len([x for x in _qualities if 'good' == x.lower()]) / len(_qualities) _scores.append(_portion_good) else: _scores.append(0) # plot the scores cmap = matplotlib.cm.get_cmap('Oranges') for i in range(len(_scores)): _color = cmap(_scores[i]) _x1 = _x_range[0] + _jump_size * i # the * 1.2 is so the hlines overlap with each other and there is no white space showing _x2 = _x1 + _jump_size * 1.2 _ax.hlines(1, _x1, _x2, colors=_color, linewidth=5.0) scale_vals = [0.0, 0.5, 1.0] _legend_elements = [] _legend_labels = [] _legend_elements.append(matplotlib.patches.Patch(facecolor=cmap(scale_vals[2]), edgecolor=DEFAULT_GREY)) _legend_labels.append(format_string_length('100% Good')) _legend_elements.append(matplotlib.patches.Patch(facecolor=cmap(scale_vals[1]), edgecolor=cmap(scale_vals[1]))) _legend_labels.append(format_string_length(' ')) _legend_elements.append(matplotlib.patches.Patch(facecolor=cmap(scale_vals[0]), edgecolor=cmap(scale_vals[0]))) _legend_labels.append(format_string_length('0% Good')) _leg = _ax_legend.legend(_legend_elements, _legend_labels, frameon=False, prop={'size': SMALL_TEXT_SIZE}, borderaxespad=0, labelspacing=-0.3) # this removes spacing between legend items remove_ticks(_ax_legend) remove_borders(_ax_legend) _ax.set_xlim(_x_range[0], _x_range[1]) _ax.set_ylim(.5, 1.5) remove_borders(_ax) remove_ticks(_ax) _ax.set_title('Probe Quality', loc='left', fontsize=SUBPLOT_TITLE_SIZE) def probe_quality_heatmap2(_ax: plt.Axes, _ax_legend: plt.Axes, _file: str, _region: utils.Interval, _x_range: typing.Tuple): _probes_tabix = pysam.TabixFile(_file) _jump_size = (_x_range[1] - _x_range[0]) / 100 _scores = [] _scores2 = [] cmap = matplotlib.cm.get_cmap('Blues_r') for i in range(100): _start_pos = _x_range[0] + _jump_size * i _end_pos = _start_pos + _jump_size _res = _probes_tabix.fetch(_region.chrom, _start_pos, _end_pos) _res = list(_res) _qualities = ['pseudo_count'] for _line in _res: _qualities.append(ss(_line)[4]) _portion_good = len([x for x in _qualities if 'good' == x.lower()]) / len(_qualities) _scores.append(_portion_good) _scores2.append(len(_qualities)) _color = cmap(_scores[i]) # plot the scores for i in range(len(_scores)): _color = cmap(_scores[i]) _x1 = _x_range[0] + _jump_size * i # the * 1.2 is so the hlines overlap with each other and there is no white space showing _x2 = _x1 + _jump_size * 1.2 _ax.hlines(-5, _x1, _x2, colors=_color, linewidth=5.0) def main(): """ Main function, here just to ensure proper variable scopes """ args = get_args() #calls args on comma args.calls = args.calls.split(',') print('Calls',args.calls) fig = plt.figure() fig.set_size_inches(8, 10, forward=True) # fig.patch.set_facecolor('#f6f6f6') # plt.gcf().set_facecolor("red") """ Format the axes """ gs1 = GridSpec(9, 2, width_ratios=[4, 1], height_ratios=[12, 2, 4, 4, 4, 4, 4, 4, 4], left=0.12, right=0.98, wspace=0.10, hspace=0.5) all_axes = [] # top panel coverage_ax = fig.add_subplot(gs1[0, 0]) # coverage_ax_legend = fig.add_subplot(gs1[0, 1]) all_axes.append(coverage_ax) # all_axes.append(coverage_ax_legend) # sample calls heatmap_ax = fig.add_subplot(gs1[1, 0]) heatmap_legend_ax = fig.add_subplot(gs1[1, 1]) all_axes.append(heatmap_ax) all_axes.append(heatmap_legend_ax) # sample calls samp_calls_ax = fig.add_subplot(gs1[2, 0]) all_axes.append(samp_calls_ax) samp_calls_ax_legend = fig.add_subplot(gs1[2, 1]) all_axes.append(samp_calls_ax) # middle panel ax2 = fig.add_subplot(gs1[3, 0]) ax2_legend = fig.add_subplot(gs1[3, 1]) all_axes.append(ax2) all_axes.append(ax2_legend) # legend first_legend = fig.add_subplot(gs1[0, 1]) all_axes.append(first_legend) # vardb panel vardb_ax = fig.add_subplot(gs1[4, 0]) vardb_ax_legend = fig.add_subplot(gs1[4, 1]) all_axes.append(vardb_ax) all_axes.append(vardb_ax_legend) # gnomad panel gnomad_ax = fig.add_subplot(gs1[5, 0]) gnomad_ax_legend = fig.add_subplot(gs1[5, 1]) all_axes.append(gnomad_ax) all_axes.append(gnomad_ax_legend) # vardb panel repeat_ax = fig.add_subplot(gs1[6, 0]) repeat_ax_legend = fig.add_subplot(gs1[6, 1]) all_axes.append(repeat_ax) all_axes.append(repeat_ax_legend) # probe quality report # site_qc_ax = fig.add_subplot(gs1[5, 0]) # all_axes.append(site_qc_ax) # bottom panel mean_ax = fig.add_subplot(gs1[7, 0]) std_ax = fig.add_subplot(gs1[8, 0]) all_axes.append(mean_ax) all_axes.append(std_ax) """ Plot the calls and a window size """ target_window = parse_region(args.region, args.window) target = parse_region(args.region, 0) print(args.calls) in_sample_calls = get_calls(args.calls, target_window, _with_sample_id=args.sample) for x in in_sample_calls: print('in sample calls', x) # input_file_colors_options = ['#a6cee3', '#1f78b4', '#b2df8a', '#33a02c'] input_file_hatch_options = ['////', '....', '||||', '----', '++++', 'xxxx', 'oooo', '****', 'OOOO', '\\\\\\\\'] # index x.data[-1] is the file the call came from f_names = list(set([x.data[-1] for x in in_sample_calls])) f_names.sort() # input_file_color_map = {y: input_file_colors_options[i] for i, y in enumerate(f_names)} input_file_hatch_map = {y: input_file_hatch_options[i] for i, y in enumerate(f_names)} # find the calls that overlap with the given sample's call but are not that sample. # This is used in the main plot and the Cohort Calls plot non_sample_overlapping_calls = get_calls(args.calls, target_window, _without_sample_id=args.sample) print('Non-Sample calls',non_sample_overlapping_calls) """ Mark this sample's coverage stats """ plot_sample_coverage_stats(coverage_ax, _region=target_window, _scores_file=args.coverage_scores, _sample_id=args.sample, _sites=args.sites, _x_range=(target_window.start, target_window.end), _non_sample_overlapping_calls=non_sample_overlapping_calls) mark_calls(samp_calls_ax, samp_calls_ax_legend, in_sample_calls, input_file_hatch_map, _x_range=(target_window.start, target_window.end)) """ Probe quality heatmap """ probe_quality_heatmap(_ax=heatmap_ax, _ax_legend=heatmap_legend_ax, _file=args.sites, _region=target_window, _x_range=(target_window.start, target_window.end)) """ Plot the calls made on other cohort members in this region """ plot_other_samples_calls(_ax=ax2, _ax_legend=ax2_legend, _region=target_window, _non_sample_overlapping_calls=non_sample_overlapping_calls, _sample_id=args.sample, _x_range=coverage_ax.get_xlim(), _color_map=None, _hatch_map=input_file_hatch_map) """ Plot gnomAD """ plot_gnomad(gnomad_ax, gnomad_ax_legend, _region=target_window, _file=args.gnomad, _x_range=coverage_ax.get_xlim()) """ Plot VarDB """ plot_dbvar(vardb_ax, vardb_ax_legend, _region=target_window, _file=args.vardb, _x_range=coverage_ax.get_xlim()) """ Plot Mean and Standard Deviation Coverages across whole chromosome """ plot_chr_mean_std(mean_ax, std_ax, target_window, args.depth) """ Make the title """ configure_title(fig, target, args.coverage_scores, args.sample, args.title) """ Add a non-to scale probe quality report """ # plot_probe_qualities(site_qc_ax, target_window, args.coverage_scores, args.sample, # args.sites) """ """ # a_single_legend_to_rule_them_all(first_legend, target_window, args.coverage_scores, input_file_hatch_map, # args.sites) main_legend(first_legend, target_window, args.coverage_scores, input_file_hatch_map, args.sites) """ Add the repeat masker lines """ plot_repeat_masker(repeat_ax, repeat_ax_legend, _file=args.repeatmasker, _region=target_window) """ Final Plotting Options """ plt.savefig(args.output, bbox_inches='tight', dpi=600) plt.show() if __name__ == '__main__': main() """ Huge Example -i ExampleData/calls6000.sorted.bed.gz -i ExampleData/calls_savvy1000.sorted.bed.gz -o test_figure_2.png -r 15:20442000-22572000 -s WES100_S13 -w 50000 --coverage_scores ExampleData/adj_scores.bed.gz --sites ExampleData/wes_probes.sorted.bed.gz --gnomad ExampleData/gnomad_v2.1_sv.sites.bed.gz --vardb ExampleData/vardb.sorted.bed.gz --depth ExampleData/WES100_S13_probe.cover.mean.stdev.bed --title "WES100_S13 Duplication" --repeatmasker ExampleData/genomicRepeats.sorted.bed.gz """ """ Small Example -i ExampleData/calls6000.sorted.bed.gz -i ExampleData/calls_savvy1000.sorted.bed.gz -o test_figure_17:18324000-18458000.png -r 17:18324000-18458000 -s WES100_S13 -w 50000 --coverage_scores ExampleData/adj_scores.bed.gz --sites ExampleData/wes_probes.sorted.bed.gz --gnomad ExampleData/gnomad_v2.1_sv.sites.bed.gz --vardb ExampleData/vardb.sorted.bed.gz --depth ExampleData/WES100_S13_probe.cover.mean.stdev.bed --title "WES100_S13 Duplication" --repeatmasker ExampleData/genomicRepeats.sorted.bed.gz """ |
73 74 75 76 77 78 | shell: """ mkdir -p workproband/Mosdepth mosdepth workproband/Mosdepth/{wildcards.sample} {input} tabix -f -p bed workproband/Mosdepth/{wildcards.sample}.per-base.bed.gz """ |
90 91 92 93 | shell: """ samtools view -c -F 260 {input} > {output} """ |
103 104 105 106 | shell: """ cat {input} > workproband/TotalReads/total_read.txt """ |
117 118 119 120 121 122 123 | shell: """ mkdir -p workproband/Probes/ bedtools sort -i {input.probes} > workproband/Probes/probes.sorted.bed bgzip workproband/Probes/probes.sorted.bed tabix workproband/Probes/probes.sorted.bed.gz -p bed """ |
136 137 138 139 140 141 142 143 144 | shell: """ get_rpm_rates.py \ -m {input.per_base_bed_gz} \ --regions_file {input.probes_gz} \ --num_reads {input.total_reads} \ | bgzip -c > workproband/RPM/{wildcards.sample}.probe.rpm_rate.bed.gz tabix -p bed workproband/RPM/{wildcards.sample}.probe.rpm_rate.bed.gz """ |
159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 | shell: """ mkdir -p workproband/Params/ while read p; do cat $p | awk -v VAR=$p '{{ print $1":"$2"-"$3"\t"$4"\t"$5"\t"VAR"\t"$1"."$2"-"$3"\t"$4}}' >> workproband/Params/tmp_call_plotting_params.txt done <{input.all_calls} for i in {{1..23}}; do grep "^$i" workproband/Params/tmp_call_plotting_params.txt >> workproband/Params/call_plotting_params.txt.tmp || echo "Not found" 2>> {log} done touch workproband/Params/call_plotting_params.txt # remove samples from the calls that are not in the sample list array=( {params.samples} ) for j in "${{array[@]}}" do cat workproband/Params/call_plotting_params.txt.tmp | {{ grep $j || :; }} >> workproband/Params/call_plotting_params.txt done """ |
191 192 193 194 195 196 197 198 199 200 | shell: """ mkdir -p workproband/SplitParams i=0 while read p; do echo $p > workproband/SplitParams/${{i}}.txt i=$((i+1)) done <{input.param_file} touch workproband/SplitParams/.split_param_file.ready """ |
229 230 231 232 233 234 | shell: """ zcat {input} | label_exon_number.py > workproband/Exons/labeled_exons.bed bgzip workproband/Exons/labeled_exons.bed tabix -p bed workproband/Exons/labeled_exons.bed.gz """ |
250 251 252 253 254 255 256 257 | shell: """ mkdir -p workproband/ProbeCoverageRefPanel/ cp {input.bedgz} workproband/ProbeCoverageRefPanel/ cp {input.tbi} workproband/ProbeCoverageRefPanel/ cp {input.adj_bed} workproband/AdjZscoreRefPanel/ cp {input.adj_tbi} workproband/AdjZscoreRefPanel/ """ |
273 274 275 276 277 278 279 280 | shell: """ get_regions_zscores.py -r workproband/ProbeCoverageRefPanel/ -s {input.proband_rpm} | bgzip -c > {output.bedgz} cp {output.bedgz} workproband/ProbeCoverage/{wildcards.sample}_probe.cover.mean.stdev.copy.bed.gz gunzip workproband/ProbeCoverage/{wildcards.sample}_probe.cover.mean.stdev.copy.bed.gz mv workproband/ProbeCoverage/{wildcards.sample}_probe.cover.mean.stdev.copy.bed {output.bed} tabix -p bed {output.bedgz} """ |
294 295 296 297 298 299 300 301 302 | shell: """ get_coverage_zscores.py \ -r {input.proband_rpm} \ -s {input.proband_coverage} \ | bgzip -c > {output.bedgz} tabix -p bed {output.bedgz} """ |
316 317 318 319 320 321 322 323 | shell: """ mkdir -p workproband/AllAdjZScore cp workproband/AdjZscore/* workproband/AllAdjZScore cp workproband/AdjZscoreRefPanel/* workproband/AllAdjZScore merge_samples_adj_scores.py -r workproband/AllAdjZScore | bgzip -c > {output.final_bed_gz} tabix -p bed {output.final_bed_gz} """ |
334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 | shell: """ mkdir -p workproband/FindMaxTMP rm -f workproband/FindMaxTMP/* while read p; do name=$(basename $p) array=( {params.samples} ) for i in "${{array[@]}}" do cat $p | {{ grep $i || :; }} >> workproband/FindMaxTMP/$name.filtered 2>> {log} done bedtools sort -i workproband/FindMaxTMP/$name.filtered > workproband/FindMaxTMP/$name.sorted 2>> {log} bgzip -f workproband/FindMaxTMP/$name.sorted 2>> {log} tabix -f -p bed workproband/FindMaxTMP/$name.sorted.gz 2>> {log} echo "workproband/FindMaxTMP/${{name}}.sorted.gz" >> workproband/FindMaxTMP/multiple_savvy_calls.txt 2>> {log} done <{input.all_calls} touch workproband/FindMaxTMP/.ready """ |
374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 | shell: """ inputs="" empty="" while read p; do if [ "$inputs" == "$empty" ]; then inputs="$p" else inputs="$inputs,$p" fi done < workproband/FindMaxTMP/multiple_savvy_calls.txt echo {params.num_calls} >> {log} echo {wildcards.num} >> {log} mkdir -p workproband/Plots/ 2>> {log} mkdir -p workproband/PlotsComplete/ 2>> {log} string=$(head -1 {input.param_file}) 2>> {log} region=$(cut -d" " -f1 <<< $string) 2>> {log} svtype=$(cut -d" " -f2 <<< $string) 2>> {log} samplename=$(cut -d" " -f3 <<< $string) 2>> {log} sample=${{samplename%.*}} 2>> {log} callsfile=$(cut -d" " -f4 <<< $string) 2>> {log} regionclean=$(cut -d" " -f5 <<< $string) 2>> {log} svtype2=$(cut -d" " -f6 <<< $string) 2>> {log} new_plotter.py \ -i $inputs \ -s ${{samplename%.*}} \ --output workproband/Plots/"${{sample}}.${{region}}.${{svtype}}".png \ --coverage_scores {input.adj_scores} \ --sites {input.probes} \ --window 50000 \ --region ${{region}} \ --title "${{sample}} ${{region}} ${{svtype}}" \ --depth workproband/ProbeCoverage/${{sample}}_probe.cover.mean.stdev.bed \ --gnomad {input.gnomad_sv} \ --vardb {input.vardb} \ --repeatmasker {input.repeat_masker} touch {output} """ |
424 | shell: "mkdir -p workproband/AlleleCount; touch {output}" |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 | import argparse import sys import os def get_args(): help_message =""" SeeNV usage: One of the following options is required --buildref, -b flag to use function to build a reference panel database --plotsamples, -p flag to plot samples Parameters to accompany --plotsamples, -p: -i INPUT_SAMPLES (required) samples list -s SITES (required) genomic sites bed file -c ALL_CALLS (required) calls file. Each line should be a path to a set of calls in bed format -o OUTPUT (required) output directory, where to save the plots -r REF_DB (required) path to reference db created by the --buildref function of SeeNV -a GNOMAD (required) the gnomad sv sites file with allele frequency information -t THREADS (optional) number of threads to use, default 1 (you really want to use more than 1) -v varDB (required) path to a GZipped bed file for the varDB common variants with an accompanying tabix indexed -m RepeatMasker (required) path to a GZipped bed file for the RepeatMasker elements with an accompanying tabix indexed Parameters to accompany --buildref, -b -i INPUT_SAMPLES (required) samples list -s SITES (required) genomic sites bed file -c ALL_CALLS (required) calls file. Each line should be a path to a set of calls in bed format -o OUTPUT (required) output directory, reference panel database -t THREADS (optional) number of threads to use, default 1 (you really want to use more than 1) """ if '-h' in sys.argv or '--help' in sys.argv: print(help_message) param_sum = sum(['--buildref' in sys.argv or '-b' in sys.argv, '--plotsamples' in sys.argv or '-p' in sys.argv]) if param_sum > 1 or param_sum == 0: print('SeeNV Error: you must use --buildref (-b) xor --plotsamples (-p)') print(help_message) exit(1) run_type = None try: if '--buildref' in sys.argv or '-b' in sys.argv: args = get_build_args() run_type = 'buildref' elif '--plotsamples' in sys.argv or '-p' in sys.argv: args = get_plot_args() run_type = 'plotsamples' except: print(help_message) exit(1) return run_type, args def get_plot_args(): parser = argparse.ArgumentParser() parser.add_argument('--plotsamples', '-p', action='store_true', dest='plotsamples', default=False, help='flag to plot samples') parser.add_argument('-i',dest='input_samples',help='samples list',required=True) parser.add_argument('-s',dest='sites',help='genomic sites bed file',required=True) parser.add_argument('-c',dest='all_calls',help='calls file. Each line should be a path to a set of calls in bed format',required=True) parser.add_argument('-o',dest='output',help='output directory, where to save the plots',required=True) parser.add_argument('-r',dest='ref_db',help='path to reference db created by the --buildref function of SeeNV',required=True) parser.add_argument('-a',dest='gnomad',help='the gnomad sv sites file with allele frequency information',required=True) parser.add_argument('-t',dest='threads',help='number of threads to use, default 1 (you really want to use more than 1)',default="1",required=False) parser.add_argument('-v',dest='vardb',help='path to bed file containing varDB common variants',required=True) parser.add_argument('-m',dest='repeat_masker',help='path to bed file containing repeatMasker',required=True) return parser.parse_args() def get_build_args(): parser = argparse.ArgumentParser() parser.add_argument('--buildref', '-b', action='store_true', dest='buildref', default=False, help='flag to use function to build a reference panel database') parser.add_argument('-i',dest='input_samples',help='samples list',required=True) parser.add_argument('-s',dest='sites',help='genomic sites bed file',required=True) parser.add_argument('-c',dest='all_calls',help='calls file. Each line should be a path to a set of calls in bed format',required=True) parser.add_argument('-o',dest='output',help='output directory, reference panel database',required=True) parser.add_argument('-t',dest='threads',help='number of threads to use, default 1 (you really want to use more than 1)',default="1",required=False) return parser.parse_args() run_type, args = get_args() if run_type == 'plotsamples': command=""" conda_loc=$(which python) conda_bin=$(dirname $conda_loc) inputsamples="{samples}" probes="{probes}" calls="{calls}" gnomadfile="{af}" outputdir="{out}" refpanel="{ref_db}" threads="{threads}" vardb="{vardb}" repeat_masker="{repeat_masker}" mkdir -p workproband cp $inputsamples workproband/proband.samples cat $inputsamples | gargs --sep="\t" "ln -f -s {{3}} workproband/{{0}}.bam" cat $inputsamples | gargs --sep="\t" "ln -f -s {{4}} workproband/{{0}}.bai" # cat $inputsamples | gargs --sep="\t" "ln -f -s {{5}} workproband/{{0}}.vcf.gz" # cat $inputsamples | gargs --sep="\t" "ln -f -s {{6}} workproband/{{0}}.vcf.gz.tbi" mkdir -p workproband/Probes ln -s "$(cd "$(dirname "$probes")"; pwd)/$(basename "$probes")" workproband/Probes/probes.original.bed mkdir -p workproband/Calls ln -s "$(cd "$(dirname "$calls")"; pwd)/$(basename "$calls")" workproband/Calls/all.calls.original.txt ln -s "$(cd "$(dirname "$gnomadfile")"; pwd)/$(basename "$gnomadfile")" workproband/gnomad_sv.bed.gz ln -s "$(cd "$(dirname "$gnomadfile")"; pwd)/$(basename "$gnomadfile")".tbi workproband/gnomad_sv.bed.gz.tbi ln -s "$(cd "$(dirname "$vardb")"; pwd)/$(basename "$vardb")" workproband/vardb.bed.gz ln -s "$(cd "$(dirname "$vardb")"; pwd)/$(basename "$vardb")".tbi workproband/vardb.bed.gz.tbi ln -s "$(cd "$(dirname "$repeat_masker")"; pwd)/$(basename "$repeat_masker")" workproband/repeat_masker.bed.gz ln -s "$(cd "$(dirname "$repeat_masker")"; pwd)/$(basename "$repeat_masker")".tbi workproband/repeat_masker.bed.gz.tbi # put the name of the output dirrectory into a textfile for Snakemake to read in echo $outputdir > workproband/outputdir.txt # put the path to the reference panel into a textfie for Snakemake to read in echo $refpanel > workproband/reference_panel.txt # start the snakemake pipeline now they input files are in there proper locations snakemake -c $threads -s $conda_bin/run_proband.snake --configfile $conda_bin/proband_config.json --printshellcmds --rerun-incomplete --restart-times 3 """ command = command.format(samples=args.input_samples, probes=args.sites, calls=args.all_calls, out=args.output, ref_db=args.ref_db, af=args.gnomad, threads=args.threads, repeat_masker=args.repeat_masker, vardb=args.vardb) elif run_type == 'buildref': print('Building') command=""" conda_loc=$(which python) conda_bin=$(dirname $conda_loc) inputsamples="{samples}" probes="{probes}" calls="{calls}" outputdir="{out}" threads="{threads}" # create the workpanel dir, dump sym linked files in to a location they can be easily accessed by SnakeMake # will forably overwrite input files of the same name. All samples named MUST be unique. mkdir -p workpanel cp $inputsamples workpanel/panel.samples cat $inputsamples | gargs --sep="\t" "ln -f -s {{3}} workpanel/{{0}}.bam" cat $inputsamples | gargs --sep="\t" "ln -f -s {{4}} workpanel/{{0}}.bai" # cat $inputsamples | gargs --sep="\t" "ln -f -s {{5}} workpanel/{{0}}.vcf.gz" # cat $inputsamples | gargs --sep="\t" "ln -f -s {{6}} workpanel/{{0}}.vcf.gz.tbi" mkdir -p workpanel/Probes ln -f -s "$(cd "$(dirname "$probes")"; pwd)/$(basename "$probes")" workpanel/Probes/probes.original.bed mkdir -p workpanel/Calls ln -f -s "$(cd "$(dirname "$calls")"; pwd)/$(basename "$calls")" workpanel/Calls/all.calls.original.txt # put the name of the output dirrectory into a textfile for Snakemake to read in echo $outputdir > workpanel/outputdir.txt # start the snakemake pipeline now they input files are in there proper locations snakemake -c $threads -s $conda_bin/build_panel.snake --configfile $conda_bin/panel_config.json --restart-times 3 """ command = command.format(samples=args.input_samples,probes=args.sites,calls=args.all_calls,out=args.output,threads=args.threads) else: print('Internal Error, unexpected run_type') exit(1) os.system(command) |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 | import collections from pysam import VariantFile from pysam import AlignmentFile from pysam import TabixFile import gzip import numpy as np Interval = collections.namedtuple('Interval', 'chrom start end data') def get_gene_cord_map(gene_bed): genes = {} f = gzip.open(gene_bed, 'rb') for l in f: A = l.rstrip().split() cord = Interval(chrom=A[0].decode('UTF-8'), start=int(A[1]), end=int(A[2]), data=None) name = A[3].decode('UTF-8') genes[name]=cord f.close() return genes def get_header(bed_file, tbx=None): if tbx is None: tbx = TabixFile(bed_file) return tbx.header def get_intervals_in_region(target_interval, bed_file, tbx=None, ignore=None): if tbx is None: tbx = TabixFile(bed_file) intervals = [] for row in tbx.fetch(target_interval.chrom, target_interval.start, target_interval.end): if ignore is not None and ignore in row: continue A = row.rstrip().split() cord = Interval(chrom=A[0], start=int(A[1]), end=int(A[2]), data=A[3:] if len(A)>3 else None) intervals.append(cord) return intervals def get_gene_exons(target_gene, gene_bed_file, exons_bed_file): gene_cord_map = get_gene_cord_map(gene_bed_file) gene_cord = gene_cord_map[target_gene] exons = get_intervals_in_region(gene_cord, exons_bed_file) return exons def get_gt_counts(sample, region, vcf_file=None, vcf_handle=None): vcf = None if vcf_handle is None: vcf = VariantFile(vcf_file) else: vcf = vcf_handle gt_count = {'HOM_REF':0, 'HET':0, 'HOM_ALT':0, 'UNK':0} for rec in vcf.fetch(region.chrom, region.start, region.end): gt = rec.samples[sample]['GT'] if gt[0] is None or gt[1] is None: gt_count['UNK'] += 1 elif gt[0] == gt[1]: if gt[0] == 0: gt_count['HOM_REF'] += 1 else: gt_count['HOM_ALT'] += 1 else: gt_count['HET'] += 1 return gt_count def get_gt_stats(region, target_sample=None, vcf_file=None, vcf_handle=None): vcf = None if vcf_handle is None: vcf = VariantFile(vcf_file) else: vcf = vcf_handle samples = None target_counts = [0,0,0] for rec in vcf.fetch(region.chrom, region.start, region.end): if samples is None: samples = {} for sample in rec.samples: samples[sample] = [0,0,0] for sample in rec.samples: gt = rec.samples[sample]['GT'] if gt[0] == None: samples[sample][2] += 1 elif gt[0] == gt[1]: samples[sample][0] += 1 else: samples[sample][1] += 1 gt = rec.samples[target_sample]['GT'] if gt[0] == None: target_counts[2] += 1 elif gt[0] == gt[1]: target_counts[0] += 1 else: target_counts[1] += 1 if samples is None: return None counts = [[],[],[]] for sample in samples: for i in range(3): counts[i].append(samples[sample][i]) count_stats = [] for i in range(3): m = np.mean(counts[i]) s = np.std(counts[i]) x = target_counts[i] z = 0 if s > 0 : z = (x-m)/s count_stats.append( [z,x,m,s] ) return count_stats # r = [] # gts = [] # ads = [] # dps = [] # gqs = [] # # # S = None # # # for rec in vcf.fetch(region.chrom, region.start, region.end): # samples = [] # if target_sample: # samples = [target_sample] # else: # samples = rec.sample # # for sample in samples: # ad = rec.samples[sample]['AD'] # gt = rec.samples[sample]['GT'] # dp = rec.samples[sample]['DP'] # gq = rec.samples[sample]['GQ'] # if gt[0] and gt[1]: # gts.append(gt) # ads.append(ad) # dps.append(dp) # gqs.append(gq) # return (gts, ads, dps, gqs) def get_coverage(region, bam_file=None, bam_handle=None): bam = None if bam_handle is None: bam = AlignmentFile(bam_file, 'rb') else: bam = bam_handle iter = bam.fetch(region.chrom, region.start, region.end) count = 0 for x in iter: count += 1 return count |
Support
- Future updates
Related Workflows





