Genome Assembly Validation via Inter-SUNK distances in ONT reads
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Genome Assembly Validation via Inter-SUNK distances in nanopore reads
Setup source files in config.yaml and ont.tsv
config.yaml requires:
-
a manifest of ONT data for validation "ont.tsv",
-
and the kmer-length to use for validation "SUNK_len" (default: 20)
optionally, for more detailed output plots (uses more RAM, experimental), set
plot_detailed: True
in config.yaml
The ONT manifest is a .tsv file "ont.tsv" with the following columns:
-
sample: must be unique for each file
-
hap[1,2]_ONT : location of ONT reads (gzipped or raw .fa or .fq)
-
hap[1,2]_asm : location of haplotype assembly (FASTA, must also have .fai index in same directory)
-
hap[1,2]_bed : BED format regions of assembly to visualize (optional)
-
hap[1,2]_colotrack : BED format color track to include in visualizations (optional, no headers, columns Chrom Start End Color)
I recommend a single input file for each haplotype, and to haplotype phase with canu and parental illumina (not currently incorporated into this pipeline). Hi-C phasing of ONT is also possible: see the "hic" branch.
Prerequisites: Snakemake (tested with versions 6.12.1, 7.8.2, 7.14.0)
To run snakefile locally on the provided test cases (AMY locus of CHM13/1 pseudodiploid and HG02723), generating optional image outputs, execute:
git clone https://github.com/pdishuck/GAVISUNK
cd GAVISUNK
snakemake -R --use-conda --cores 8 --configfile .test/config.yaml --resources load=1000
.BED results are found in the
results/[sample]/final_outs
directory, along with .tsv files summarizing the probability of each validation gap having been spanned by ONT reads, given the input library
Automated visualizations of validation gaps are found in the
results/pngs/gaps/[sample]
directory
For the included test cases, the following output should be generated:
results/gaps/AMY_HG02723/AMY_HG02723_hap1_AMY_h1_84861_524275.png
, corresponding to the region displayed in Figure 1 of the manuscript:
Example SGE execution (your cluster's parameters may vary):
alias snakesub='mkdir -p log; snakemake --ri --jobname "{rulename}.{jobid}" --drmaa " -V -cwd -j y -o ./log -e ./log -l h_rt=48:00:00 -l mfree={resources.mem}G -pe serial {threads} -w n -S /bin/bash" -w 60'
snakesub --use-conda -j150 > snakelog 2>&1 &
Troubleshooting tips:
To get snakemake conda envs to work correctly, you may need to deactivate your local conda env ($PATH issues as in https://github.com/snakemake/snakemake/issues/883) GAVISUNK is written to execute from its top-level directory. Old versions of dependencies that are in your $PATH by default may interfere with GAVISUNK operation. See dependencies in workflow/envs/viz.yaml
How to cite:
Philip C Dishuck, Allison N Rozanski, Glennis A Logsdon, David Porubsky, Evan E Eichler, GAVISUNK: genome assembly validation via inter-SUNK distances in Oxford Nanopore reads, Bioinformatics, Volume 39, Issue 1, January 2023, btac714, https://doi.org/10.1093/bioinformatics/btac714
Code Snippets
15 16 17 18 19 | shell: """ zcat -f {input.hap1_asm} {input.hap2_asm} > {output.combined} samtools faidx {output.combined} """ |
37 38 39 40 | shell: """ jellyfish count -m {params.sunk_len} -s 10000000 -t {threads} -C -c 1 -U 1 {input.asm} -o {output.counts} """ |
57 58 59 60 61 | shell: """ jellyfish dump -c -t {input.counts} | awk '{{print $1}}' > {output.db} awk '{{ print ">"$0"\\n"$0 }}' {output.db} > {output.fa} """ |
77 78 79 80 | shell: """ mrsfast --ws 14 --index {input.asm} """ |
99 100 101 102 103 | shell: """ mrsfast --search {input.ref} --threads {threads} --mem 16 --seq {input.db} -o {output.sam} --disable-nohits -e 0 samtools sort -@ {threads} {output.sam} -o {output.bam} """ |
122 123 124 125 126 127 | shell: """ bedtools bamtobed -i {input.bam} > {output.bed} && \ bedtools merge -i {output.bed} > {output.bedmerge} && \ bedtools intersect -a {output.bed} -b {output.bedmerge} -wo | cut -f 1,2,4,8 > {output.locs} """ |
22 23 | script: "../scripts/viz_detailed.py" |
38 39 | script: "../scripts/viz.py" |
59 60 | script: "../scripts/covprob.py" |
15 16 17 18 | shell: """ cat {input.reads} | seqtk seq -F '#' | rustybam fastq-split {output.reads} """ |
34 35 36 37 | shell: """ workflow/scripts/kmerpos_annot3 {input.ONT} {input.db} {input.locs} {output} """ |
52 53 54 55 | shell: """ cat {input.gather_ONT_pos} > {output.ONT_pos} """ |
71 72 73 74 | shell: """ workflow/scripts/rlen {input.ONT} {output} """ |
90 91 92 93 | shell: """ workflow/scripts/diag_filter_v3 {input.ONT_pos} {input.fai} > {output.ONT_pos_diag} """ |
108 109 110 111 | shell: """ workflow/scripts/diag_filter_step2 {input.ONT_pos} {input.ONT_pos_diag} > {output.ONT_pos_diag_final} """ |
127 128 129 130 131 | shell: """ cat {input.gather_ONT_pos} > {output.ONT_pos} cat {input.gather_ONT_len} > {output.ONT_len} """ |
148 149 150 151 | shell: """ python workflow/scripts/badsunks_AR.py {input.hap1_fai} {input.hap2_fai} {input.hap1_sunkpos} {input.hap2_sunkpos} {output.badsunks} """ |
167 168 | script: "../scripts/split_locs.py" |
187 188 189 190 191 | shell: """ python workflow/scripts/process-by-contig_lowmem_AR.py {input.locs_contig} {input.sunk_pos_contig} {input.rlen} {input.bad_sunks} {output.outputdf} {output.bed} touch {output.bed} """ |
206 207 208 209 | shell: ''' cat {input.bed} > {output.allbeds} ''' |
228 229 230 231 | shell: """ python workflow/scripts/get_gaps.py {input.hap1_fai} {input.hap2_fai} {wildcards.sample} results/{wildcards.sample}/bed_files/ results/{wildcards.sample}/final_out/ """ |
247 248 249 250 | shell: """ bedtools slop -i {input.gaps} -g {input.fai} -b 200000 > {output.gaps_slop} """ |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 | import argparse import pandas as pd from matplotlib import (pyplot as plt, lines) import seaborn as sns import os def main(): parser = argparse.ArgumentParser() parser.add_argument("fai1", help="list of contigs, hap1") parser.add_argument("fai2", help="list of contigs, hap2") parser.add_argument("sunkpos1", help=".sunkpos file of SUNK locations on hap1 ONT reads") parser.add_argument("sunkpos2", help=".sunkpos file of SUNK locations on hap2 ONT reads") parser.add_argument("outpath", help="output file") args = parser.parse_args() sunkposcat = pd.read_csv(args.sunkpos1,sep="\t",header=None,names=['rname','pos','chrom','start','ID'],dtype={'rname':'string','pos':'uint32','chrom':'category','start':'uint32','ID':'uint32'})# sunkposcat['ID2'] = sunkposcat['chrom'].astype(str) +":"+ sunkposcat['ID'].astype(str) kmer_counts = sunkposcat.ID2.value_counts() kmer_counts2 = pd.DataFrame(kmer_counts,index=None) kmer_counts2.reset_index(inplace=True) kmer_counts2.rename(columns={'ID2':'count','index':'ID2'},inplace=True) kmer_counts2[['contig','ID']] = kmer_counts2.ID2.str.split(":",1,expand=True) fai = pd.read_csv(args.fai1, sep='\t',header=None,names=['contig','length','cumlen','3','4']) contiglist = set(fai.contig.tolist()) kmer_counts2['correct_hap'] = kmer_counts2['contig'].isin(contiglist) kmer_counts2.correct_hap.value_counts() fig,ax = plt.subplots(figsize=(22,22)) g = sns.histplot(data = kmer_counts2.query('count < 150 & count > 1'), x='count',binwidth=1,hue='correct_hap') # plt.savefig("badsunks_hap1.png") # meancov = kmer_counts2[kmer_counts2['count'] > kmer_counts2['count'].median()/2]['count'].mode() meancov = kmer_counts2.query("correct_hap")['count'].mode().values[0] print("mean coverage", meancov) limit = meancov + (meancov)**0.5*4 #4 SDs above mean (though apparent mean is depressed by seq accuracy) badsunks1a = set(kmer_counts2.query("correct_hap").query("(count > @limit) or (count < 2)")['ID2']) print(len(badsunks1a)) badsunks1b = set(kmer_counts2.query("not correct_hap").query("(count > @limit)")['ID2']) print(len(badsunks1b)) # sunkposfile = args.sunkpos sunkposfile = args.sunkpos2 sunkposcat = pd.read_csv(sunkposfile,sep="\t",header=None,names=['rname','pos','chrom','start','ID'],dtype={'rname':'string','pos':'uint32','chrom':'category','start':'uint32','ID':'uint32'})# sunkposcat['ID2'] = sunkposcat['chrom'].astype(str) +":"+ sunkposcat['ID'].astype(str) kmer_counts = sunkposcat.ID2.value_counts() kmer_counts2 = pd.DataFrame(kmer_counts,index=None) kmer_counts2.reset_index(inplace=True) kmer_counts2.rename(columns={'ID2':'count','index':'ID2'},inplace=True) kmer_counts2[['contig','ID']] = kmer_counts2.ID2.str.split(":",1,expand=True) fai = pd.read_csv(args.fai2 ,sep='\t',header=None,names=['contig','length','cumlen','3','4']) contiglist = set(fai.contig.tolist()) kmer_counts2['correct_hap'] = kmer_counts2['contig'].isin(contiglist) kmer_counts2.correct_hap.value_counts() fig,ax = plt.subplots(figsize=(22,22)) g = sns.histplot(data = kmer_counts2.query('count < 150 & count > 1'), x='count',binwidth=1,hue='correct_hap') # plt.savefig("badsunks_hap2.png") # meancov = kmer_counts2[kmer_counts2['count'] > kmer_counts2['count'].median()/2]['count'].mode() meancov = kmer_counts2.query("correct_hap")['count'].mode().values[0] print("mean coverage", meancov) limit = meancov + (meancov)**0.5*4 #4 SDs above mean (though apparent mean is depressed by seq accuracy) badsunks2a = set(kmer_counts2.query("correct_hap").query("(count > @limit) or (count < 2)")['ID2']) print(len(badsunks2a)) badsunks2b = set(kmer_counts2.query("not correct_hap").query("(count > @limit)")['ID2']) print(len(badsunks2b)) badsunks_all = badsunks1a |badsunks1b | badsunks2a | badsunks2b print(len(badsunks_all)) outfile = open(args.outpath, "w") for element in badsunks_all: outfile.write(element + "\n") outfile.close() main() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 | import pandas as pd, numpy as np import os from sympy.solvers import solve, solveset from sympy import Symbol, Reals import pyranges as pr import argparse print(snakemake.input) # genome_size=3.055e6 fai = pd.read_csv(snakemake.input.fai,sep='\t',header=None,names=['contig','length','cumlen','3','4']) genome_size = fai['length'].sum()/1000 # print("assembly size: ", genome_size) df = pd.read_csv(snakemake.input.bed, header=None, names="Chromosome Start End".split(), sep="\t") df['type'] = "gap" # df2 = pd.read_csv(breakdir + "hap" + hap + ".nodata.bed", header=None, names="Chromosome Start End".split(), sep="\t") # df2['type'] = 'nodata' # df3 = pd.concat([df,df2]) df3 = df df3.reset_index(inplace=True) breaks = pr.PyRanges(df3) # breaks # df3.eval("End - Start").sort_values() # kmermergefile = os.path.dirname(snakemake.input.pos_locs) +"/"+ cont2 + "_" + snakemake.wildcards.hap + ".loc" kmermerge = pd.read_csv(snakemake.input.locs,sep="\t",header=None,names=['chrom','loc','kmer','ID'])# this could be split too kmermerge = kmermerge.drop_duplicates(subset='kmer') kmermerge['ID2'] = kmermerge['chrom'].astype(str) +":"+ kmermerge['ID'].astype(str) kmer_groups = kmermerge.drop_duplicates(subset='ID2') # kmer_groups['dist'] = kmer_groups.ID.diff().fillna(kmer_groups.ID).astype(int) #rows) kmer_groups['dist'] = kmer_groups['dist'].map( lambda x: max(x,0) ) lendf = pd.read_csv(snakemake.input.rlen,sep="\t",header=None,names=['rname','len'],dtype={'rname':'string','len':'uint32'}) lendf.drop_duplicates(inplace=True) lendf.set_index("rname",drop=True,inplace=True) lendf.sort_values(by="len",ascending=False,inplace=True) lendf['bp']=lendf['len'].cumsum() # lendf['cov'] = lendf['bp']/3.055e9 lendf['kbp'] = lendf['len']/1000 lendf['kbp'] = lendf['kbp'].astype(int) kbp_cnt = lendf['kbp'].value_counts(sort=False) def cov_prob(intsize,readlen,readcnt,basequal=1): # everything in kbp ? if intsize > readlen: # print("no help here") return 1 return ( (1- ( ((readlen-intsize) /genome_size) * (basequal**2) ) )**(readcnt) ) x = Symbol('x') p=0.94 # read quality q=1-p r=int(snakemake.config['SUNK_len']) # kmer size roots2 = solveset( 1 - x + q*((p)**(r)) * ((x)**(r+1)), x, domain=Reals) pinv = 1/p roots2 = [x for x in roots2 if x > 0] roots2.sort(key=lambda e: abs(pinv - e)) assert(len(roots2) == 2) roots3 = roots2[1] print("Should be unequal: ",roots3, pinv) n=30 # group size qn = ( ( 1- p*roots3)/(q*(r + 1 - r*roots3))) * (1 / (roots3**(n+1))) pn = float(1-qn) # print(qn, pn) # odds perfect 20mer not seen in kmergroup length 22 covprobs = [] for intsize in range(1,3500): cum_prob = 1 # counter = 0 for i,x in kbp_cnt.iteritems(): prob = cov_prob(intsize, i, x,pn) cum_prob = cum_prob * prob covprobs.append((intsize,1-cum_prob)) # counter += 1 covs,probs = zip(*covprobs) covprobsdict = {k:v for k,v in covprobs} covprobsdict[0] = 1.0 for cov, prob in covprobs: if prob <=0.99: print(cov,prob) break max_gaps = [] max_gaps2 = [] for x in breaks.df.itertuples(): ix,ix2,contig,regstart,regend,types = x xmin = regstart xmax = regend kmers = kmer_groups.query("(chrom == @contig) & (ID > (@xmin-2)) & (ID < (@xmax+2))") # print(len(kmers)) kmerlist = list(pd.unique(kmers.ID)) max_gaps.append(kmers['dist'].max()) max_gaps2.append((contig,regstart,regend,kmers['dist'].max(),kmers.iloc[kmers['dist'].argmax()].kmer)) # breaks # kmer_groups.dist.max() df1 = breaks.df df1['max_gap'] = max_gaps df1['covprob'] = df1['max_gap'].apply(lambda x: covprobsdict[int(x/1000)] ) # df1.sort_values(by='covprob').head(25) # df1.drop(columns='index').to_csv(snakemake.output.tsv,index=False,sep="\t") df1.to_csv(snakemake.output.tsv,index=False,sep="\t") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 | import pandas as pd, numpy as np import os import argparse import pyranges as pr parser = argparse.ArgumentParser() parser.add_argument("fai1") parser.add_argument("fai2") parser.add_argument("sample") parser.add_argument('indir') parser.add_argument('outdir') args = parser.parse_args() outdir = args.outdir indir = args.indir samp = args.sample def main(): print("hap1") fai = pd.read_csv(args.fai1,sep='\t',header=None,names=['contig','length','cumlen','3','4']) contigs = fai.contig.tolist() allbreaks={} outbreaks = [] bp1 = 0 bp2 = 0 nodata = [] for x in contigs: # print(x) cont2 = x.replace("#","_") contiglen = fai.query("contig == @x").length.values[0] try: bedin = pd.read_csv(indir + cont2 + "_hap1.bed",sep="\t",header=None,names=['contig','start','end']) # try: bedin = pd.read_csv(x + ".bed",sep="\t",header=None,names=['contig','start','end']) except: print("No bed for", x) nodata.append((x, 0, contiglen )) bp2 += contiglen continue if len(bedin) == 0: nodata.append((x, 0, contiglen )) bp2 += contiglen continue bed = pr.PyRanges(bedin.rename(columns={'contig':'Chromosome','start':'Start','end':'End'})) bedin2 = bed.merge().df bedin2.columns = ['contig', 'start', 'end'] bedin2['end_prev'] = bedin2['end'].shift(1,fill_value=np.nan) breaks = [] first=True for row in bedin2.itertuples(): if row.end < row.end_prev: print("weird", x) break if first: first=False # print("confirmed region begins: ",row.start) else: # print(int(row.end_prev), row.start,) breaks.append((int(row.end_prev), row.start,)) outbreaks.append((x, int(row.end_prev), row.start-1,)) bp1 += (row.start - int(row.end_prev)) # print(row.end) # print("unconfirmed bp at contig end: ", contiglen - row.end) # # print(breaks) allbreaks[x] = breaks print("breaks: ", len(outbreaks), bp1) print("no data: ", len(nodata), bp2) pd.DataFrame(nodata).to_csv(outdir + "hap1.nodata.bed",header=False,sep="\t",index=False) pd.DataFrame(outbreaks).to_csv(outdir + "hap1.gaps.bed",header=False,sep="\t",index=False) fai = pd.read_csv(args.fai2,sep='\t',header=None,names=['contig','length','cumlen','3','4']) contigs = fai.contig.tolist() print("hap2") allbreaks={} outbreaks = [] bp1 = 0 bp2 = 0 nodata = [] for x in contigs: cont2 = x.replace("#","_") # print(x) contiglen = fai.query("contig == @x").length.values[0] # try: bedin = pd.read_csv(x + "_hap2.bed",sep="\t",header=None,names=['contig','start','end']) try: bedin = pd.read_csv(indir + cont2 + "_hap2.bed",sep="\t",header=None,names=['contig','start','end']) except: print("No bed for", x) nodata.append((x, 0, contiglen )) bp2 += contiglen continue if len(bedin) == 0: nodata.append((x, 0, contiglen )) bp2 += contiglen continue bed = pr.PyRanges(bedin.rename(columns={'contig':'Chromosome','start':'Start','end':'End'})) bedin2 = bed.merge().df bedin2.columns = ['contig', 'start', 'end'] bedin2['end_prev'] = bedin2['end'].shift(1,fill_value=np.nan) breaks = [] first=True for row in bedin2.itertuples(): if first: first=False # print("confirmed region begins: ",row.start) else: # print(int(row.end_prev), row.start,) breaks.append((int(row.end_prev), row.start,)) outbreaks.append((x, int(row.end_prev), row.start-1,)) bp1 += (row.start - int(row.end_prev)) # print(row.end) # print("unconfirmed bp at contig end: ", contiglen - row.end) # # print(breaks) allbreaks[x] = breaks print("breaks: ", len(outbreaks), bp1) print("no data: ", len(nodata), bp2) pd.DataFrame(nodata).to_csv(outdir + "hap2.nodata.bed",header=False,sep="\t",index=False) pd.DataFrame(outbreaks).to_csv(outdir + "hap2.gaps.bed",header=False,sep="\t",index=False) main() |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 | import networkx as nx import graph_tool as gt from graph_tool.topology import extract_largest_component import timeit from scipy.spatial.distance import pdist import itertools as it import numpy as npt import pandas as pd, numpy as np from matplotlib import (pyplot as plt, lines) import seaborn as sns import timeit from matplotlib.patches import Circle, Wedge, Polygon from matplotlib.collections import PatchCollection import matplotlib.pyplot as plt import seaborn as sns import numpy as np from matplotlib import (pyplot as plt, lines) import matplotlib.ticker as mtick import argparse def main(): parser = argparse.ArgumentParser() parser.add_argument("SUNKs", help=".loc file of SUNKs aligned to assembly") parser.add_argument("sunkpos", help=".sunkpos file of SUNK locations on ONT reads") parser.add_argument("rlen", help=".rlen file of ONT read lengths") parser.add_argument("badsunks", help="list of SUNKs to exclude") parser.add_argument("outputfile", help="output file for intermediate") parser.add_argument("outputbed", help="output file for bed") parser.add_argument("--minlen", help="minimum length filter for reads to visualize",default=10000,type=int) parser.add_argument("--opt_filt", help="apply optional filter",action='store_true') args = parser.parse_args() #contig = args.contig ofile = args.outputfile lendf = pd.read_csv(args.rlen,sep="\t",header=None,names=['rname','len'],dtype={'rname':'string','len':'uint32'}) lendf.set_index("rname",drop=True,inplace=True) minlen = args.minlen kmermergefile = args.SUNKs kmermerge = pd.read_csv(kmermergefile,sep="\t",header=None,names=['chrom','loc','kmer','ID'])# this could be split too kmermerge = kmermerge.drop_duplicates(subset='kmer') kmermerge['ID2'] = kmermerge['chrom'].astype(str) +":"+ kmermerge['ID'].astype(str) sunkposfile = args.sunkpos sunkposcat = pd.read_csv(sunkposfile,sep="\t",header=None,names=['rname','pos','chrom','start','ID'],dtype={'rname':'string','pos':'uint32','chrom':'category','start':'uint32','ID':'uint32'})# contig = sunkposcat['chrom'].unique().tolist()[0] ## get list of bad sunks with open(args.badsunks, "r") as f: badsunkin = set(f.read().splitlines()) sunkposcat['ID2'] = sunkposcat['chrom'].astype(str) +":"+ sunkposcat['ID'].astype(str) sunkposcat = sunkposcat.query("ID2 not in @badsunkin") # contig = row.chrom # regstart = row.start # regend = row.end # xmin = regstart # xmax = regend # contig = 'chr20' kmer_groups = kmermerge.drop_duplicates(subset='ID') # pd.options.mode.chained_assignment = None # default='warn kmer_groups['dist'] = kmer_groups.ID.diff().fillna(kmer_groups.ID).astype(int) #rows) # contig = bedreg1['chrom'].values[0] # xmin = bedreg1['start'].values[0] # xmax = bedreg1['end'].values[0] #Filter 1: only keep reads with at least two SUNKs within target region multisunk = sunkposcat.groupby('rname',as_index=False).agg({'ID':'nunique'}).sort_values(by='ID').query('ID > 1') if len(multisunk) == 0: pd.DataFrame(list([contig])).to_csv(ofile,header=False,sep="\t",index=False) exit() # continue multisunkset = set(multisunk.rname.tolist()) ### RESTRICT TO REGION sunkpos3 = sunkposcat.query('rname in @multisunkset ').sort_values(by=['rname','start']) # reads with multiple SUNK group matches # & start >= @xmin & start <= @xmax # sunkpos3 = sunkpos2.drop_duplicates(subset=['rname','ID'],keep='first') # one representation per read * SUNK group # done in kmerpos_annot script sunkpos3b = sunkpos3.drop_duplicates() minlen = 10000 minlenset = set(lendf.query("len >= @minlen").index.tolist()) sub = sunkpos3b.query("rname in @minlenset") #sunkpos3 # part2 bookmark # minlenset = set(lendf.query("len >= @minlen").index.tolist()) ### keeep pos start = timeit.default_timer() sub['start'] = sub['start'].astype('int64') sub['pos'] = sub['pos'].astype('int64') # rnamelist = list(pd.unique(distdf2.rname)) # rnamelist = ['00060b17-6288-4b65-a67e-93e7498e0633'] # rnamelist = ['fffdf958-07a2-4bc1-95aa-7c6b05cc70c6'] outputs= [] counter=0 subgraphs=[] start2=timeit.default_timer() times=[] sub['start'] = sub['start'].astype('int64') sub['pos'] = sub['pos'].astype('int64') grouped = sub.groupby("rname") # .drop(columns='rname') for rname, g in grouped: # if counter > 10:break counter +=1 start=timeit.default_timer() startarray = pdist(g[['start']]) posarray = pdist(g[['pos']]) signarray = pdist(g[['pos']], lambda u,v: u[0]>v[0] ) idarray = list(it.combinations(g['ID'],2)) locarray = list(it.combinations(g['pos'],2)) # keep track of ID-Pos corresponence with np.errstate(divide='ignore'): # kmers occasionally seen in multiple locations on same read diffarray = posarray/startarray mask = np.logical_not((diffarray >= 1.1) + (diffarray <= 0.9)) if sum(mask) < 1: continue # only calculate on masked version (distdf2 equiv) signarray = signarray.astype(int) vals,counts = np.unique(signarray[mask], return_counts = True) trueOrient = vals[np.argmax(counts)] mask2 = np.logical_and(mask,(signarray == trueOrient)) stack = np.column_stack([idarray,locarray])[mask2,:] sub2 = pd.DataFrame(np.column_stack([idarray,locarray])[mask2,:]) sub2.columns = ['ID','ID2','pos1','pos2'] mergedf = pd.DataFrame(np.row_stack([stack[:,[0,2]],stack[:,[1,3]]])) mergedf.columns = ['ID','pos'] mergedf = pd.concat( [ sub2[['ID','pos1']].rename(columns={'pos1':'pos'}), sub2[['ID2','pos2']].rename(columns={'ID2':'ID','pos2':'pos'}) ]) # .drop_duplicates().query("ID > @xmin & ID < @xmax") multipos = mergedf.groupby("ID")['pos'].agg(nunique='nunique').query("nunique>1") if len(multipos)>=1: badrowlist = [] # multipos for i in multipos.index: # i=38570942 goodpos = mergedf.query("ID == @i")['pos'].value_counts().index[0] # any checks that this is good? badrowlist += list(sub2.query("(ID == @i) & not ((pos1 == @goodpos)|(pos2 == @goodpos))").index) sub2b = sub2.drop(badrowlist) else: sub2b = sub2 mergedf2 = pd.concat( [ sub2b[['ID','pos1']].rename(columns={'pos1':'pos'}), sub2b[['ID2','pos2']].rename(columns={'ID2':'ID','pos2':'pos'}) ]) # .drop_duplicates().query("ID > @xmin & ID < @xmax") g = gt.Graph(directed=False) # g.vp.name = g.new_vertex_property("int64_t") name = g.add_edge_list(sub2b[['ID','ID2']].values,hashed=True, hash_type='int64_t') # eprops = [a_pos1,a_pos2] g.vp.name = name glarge = gt.topology.extract_largest_component(g) glarge.vp.name.get_array() df=pd.DataFrame(g.vp.name.get_array()[glarge.get_vertices()],columns=['ID']) df['rname'] = rname outputs.append(df) end=timeit.default_timer() times.append(end-start) if len(outputs) == 0: pd.DataFrame(list([contig])).to_csv(ofile,header=False,sep="\t",index=False) exit() else: outputsdf = pd.concat(outputs) outputsdf.to_csv(ofile,header=False,sep="\t",index=False) end2=timeit.default_timer() # made outputdf ### make contig-wide graph graphtimes=[] start = timeit.default_timer() graphtimes.append(start) idarray = pd.DataFrame(outputsdf.groupby("rname").apply(lambda g: list(it.combinations(g['ID'],2)))).explode([0]) end = timeit.default_timer() graphtimes.append(end) idarray.rename(columns={0:'IDs',},inplace=True) idarray[['ID', 'ID2']] = pd.DataFrame(idarray['IDs'].tolist(), index=idarray.index) graphtimes=[] end = timeit.default_timer() graphtimes.append(end) g = gt.Graph(directed=False) # g.vp.name = g.new_vertex_property("int64_t") name = g.add_edge_list(idarray[['ID','ID2']].values,hashed=True, hash_type='int64_t') # eprops = [a_pos1,a_pos2] g.vp.name = name end = timeit.default_timer() graphtimes.append(end) # glarge = gt.topology.extract_largest_component(g) c = gt.topology.label_components(g)[0] end = timeit.default_timer() graphtimes.append(end) regions = [] for s in list(pd.unique(c.a)): color='lightgray' offset=counter sunks = g.vp.name.get_array()[gt.GraphView(g, vfilt=c.a == s).get_vertices()] if len(sunks) <= 2: continue start = int(sunks.min()) end = int(sunks.max()) span = end-start regions.append((start,end,span,len(sunks))) # ax.plot([start,end],[rownum+offset,rownum+offset],color=color,linewidth=9,solid_capstyle='butt',zorder=1) end = timeit.default_timer() counter +=1 regdf = pd.DataFrame(regions,columns=['start','end','span','sunks']).sort_values(by='start') regdf['contig'] = contig regdf[['contig','start','end']].to_csv(args.outputbed,header=False,sep="\t",index=False) main() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | import pandas as pd import os def splitlocs(hap_type,group_df, dirname, exten): for name, contig_group in group_df: contig_name = name.replace("#", "_") output_path = f'{dirname}/{contig_name}_{hap_type}.{exten}' print(output_path) contig_group.to_csv(output_path, header=None, index=None, sep="\t") return(True) ONT_pos_df = pd.read_csv(snakemake.input.ONT_pos, header=None, sep="\t", names=['read','read_pos','contig','contig_start','contig_stop']) kmer_loc = pd.read_csv(snakemake.input.kmer_loc, header=None, sep="\t", names=['contig','pos1','SUNK','pos2']) contig_list = ONT_pos_df['contig'].unique().tolist() kmer_loc_subset = kmer_loc[kmer_loc['contig'].isin(contig_list)] dir_name = os.path.dirname(snakemake.output.flag) group_ONT = ONT_pos_df.groupby(['contig']) group_loc = kmer_loc_subset.groupby(['contig']) output_complete = splitlocs(snakemake.wildcards.hap,group_ONT, dir_name,"sunkpos") output_complete_2 = splitlocs(snakemake.wildcards.hap,group_loc, dir_name,"loc") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 | import pandas as pd from matplotlib import (pyplot as plt, lines) from matplotlib.collections import PatchCollection from matplotlib.patches import Circle, Wedge, Polygon, Rectangle import seaborn as sns import numpy as nptwi import matplotlib.ticker as mtick import argparse #print(pd.__version__) import os from ncls import NCLS print(snakemake.input) runmode=snakemake.wildcards.runmode opt_keys = list(snakemake.input.keys()) regbedfile = snakemake.input.bed if os.stat(regbedfile).st_size == 0: print("empty gap file") exit() bedreg1 = pd.read_csv(regbedfile, delimiter="\t",encoding='utf-8',header=None) header = ['chrom','start','end'] bedreg1.columns = header + [''] * (len(bedreg1.columns) - len(header)) lendf = pd.read_csv(snakemake.input.rlen,sep="\t",header=None,names=['rname','len'],dtype={'rname':'string','len':'uint32'}) lendf.drop_duplicates(inplace=True) lendf.set_index("rname",drop=True,inplace=True) bedreg1['chrom'].value_counts() contigs = list(bedreg1['chrom'].value_counts().index) # print(len(contigs)) if 'colorbed' in opt_keys: dm = pd.read_csv(snakemake.input.colorbed, delim_whitespace=True, names=['chr','chrStart','chrEnd','color'], header=None,dtype = {'chr':'string','chrStart':'int','chrEnd':'int','color':'string'}) dm = dm[dm.chrStart >= 0] dm["y"] = 1 dm["func"] = "" # sunkposfile = os.path.dirname(snakemake.input.pos_locs)+"/"+ cont2 + "_" + snakemake.wildcards.hap + ".sunkpos" sunkposfile = snakemake.input.pos_locs sunkposcat = pd.read_csv(sunkposfile,sep="\t",header=None,names=['rname','pos','chrom','start','ID'],dtype={'rname':'string','pos':'int64','chrom':'category','start':'int64','ID':'int64'}) sunkposcat.set_index("rname",drop=True,inplace=True) sunkposcat['ID2'] = sunkposcat['chrom'].astype(str) +":"+ sunkposcat['ID'].astype(str) for contig in contigs: cont2 = contig.replace("#","_") bedreg= bedreg1.query('chrom == @contig') #print(type(snakemake.input.interout)) #my_str=os.path.dirname(snakemake.input.interout) #print(my_str) try: outputsdf= pd.read_csv(os.path.dirname(snakemake.input.interout) + "/" + cont2 + "_" + snakemake.wildcards.hap + ".tsv" ,sep="\t",names=['ID','rname'],dtype = {'ID':'int','rname':'string'}) except: print("outputsdf not found for", contig) outputsdf.set_index("rname",inplace=True,drop=True) if len(outputsdf) <2: print("NO READS FOR", contig) kmermergefile = os.path.dirname(snakemake.input.splits) +"/"+ cont2 + "_" + snakemake.wildcards.hap + ".loc" kmermerge = pd.read_csv(kmermergefile,sep="\t",header=None,names=['chrom','loc','kmer','ID'])# this could be split too kmermerge = kmermerge.drop_duplicates(subset='kmer') plotdir = os.path.dirname(snakemake.output.flag) + "/" #print(plotdir) kmermerge['ID2'] = kmermerge['chrom'].astype(str) +":"+ kmermerge['ID'].astype(str) kmer_groups = kmermerge.drop_duplicates(subset='ID') # kmer_groups['dist'] = kmer_groups.ID.diff().fillna(kmer_groups.ID).astype(int) #rows) sunkpos_sub = sunkposcat.query("chrom == @contig") multisunk = sunkpos_sub.groupby(sunkpos_sub.index,as_index=True).agg({'ID':'nunique'}).sort_values(by='ID').query('ID > 1') if len(multisunk) == 0: print("No viable reads for this region: "+str(contig)+"_"+str(regstart)+"_"+str(regend)) multisunklist = list(multisunk.index) multisunkset = set(multisunklist) ### RESTRICT TO REGION # sunkpos3 = sunkposcat.loc[multisunklist] # reads with multiple SUNK group matches # & start >= @xmin & start <= @xmax # orig sunkposfile stuff sunkposfile_split = os.path.dirname(snakemake.input.splits)+"/"+ cont2 + "_" + snakemake.wildcards.hap + ".sunkpos" sunkposcat_split = pd.read_csv(sunkposfile_split,sep="\t",header=None,names=['rname','pos','chrom','start','ID'],dtype={'rname':'string','pos':'int64','chrom':'category','start':'int64','ID':'int64'}) sunkposcat_split['ID2'] = sunkposcat_split['chrom'].astype(str) +":"+ sunkposcat_split['ID'].astype(str) sunkpos3 = sunkposcat_split.query('rname in @multisunkset ').sort_values(by=['rname','start']) # reads with multiple SUNK group matches # & start >= @xmin & start <= @xmax sunkpos3.set_index("rname",drop=True,inplace=True) # sunkpos3.set_index("rname",drop=True,inplace=True) for row in bedreg.itertuples(): contig = row.chrom regstart = row.start regend = row.end xmin = regstart xmax = regend breakloc = int((xmin + xmax)/2) rightset = set(outputsdf.query("ID > @breakloc").index) # print("skipped ", skip) # print(len(pd.unique(outputsdf['rname']))) # end = timeit.default_timer() # print(end-start) # sunklines=True # PLOT df = outputsdf.query("ID > @xmin & ID < @xmax") if len(df) == 0: print("NO HITS FOR", row) # continue # df.set_index("rname",inplace=True,drop=True) kmerlist = list(pd.unique(kmer_groups.query("ID > @xmin & ID < @xmax").ID)) kmerset = set(kmerlist) never_seen2 = kmerset - set(df.ID) constart = df.ID.min() conend = df.ID.max() # plt.figure(figsize=(25,15)) # plt.grid(False) maxes={} first=True fig,ax = plt.subplots(figsize=(20,9)) fig.subplots_adjust(right=0.75) ax.set_xlim(regstart,regend) intervals = [] valid_intervals = [] poss = [] poss_bad = [] for rname in list(pd.unique(df.index)): try: rlen = lendf.loc[rname].len except: print("no rlen for: ", rname) if runmode == 'user_bed': continue sub = df.loc[rname] try: idlist = set(sub.ID) except: idlist = set([sub.ID]) possub = sunkpos3.loc[rname].query("ID in @idlist") posfirst = possub.iloc[0] poslast = possub.iloc[-1] forward = True badlocs_trans = [] possuball = sunkposcat.loc[rname].drop_duplicates(subset=['chrom','ID']) badlocs_same = possuball.query(" chrom==@contig & (not ID in @idlist) ") # badlocs_diff = possuball.query(" chrom != @contig ") badlocs_diff = possuball.query(" chrom != @contig ") badlocs_diff_most = badlocs_diff.chrom.value_counts().index[0] badlocs_diff = badlocs_diff.query("chrom == @badlocs_diff_most") # if len(badlocs_diff) > 0 : break if posfirst.pos > poslast.pos: forward=False if forward: posstart = posfirst.start - posfirst.pos posend = poslast.start + (rlen - poslast.pos) badlocs_same_trans = [ (posfirst.start + (x - posfirst.pos)) for x in badlocs_same.pos.tolist() ] badlocs_diff_trans = [ (posfirst.start + (x - posfirst.pos)) for x in badlocs_diff.pos.tolist() ] else: posend = poslast.pos + poslast.start posstart = posfirst.start - (rlen - posfirst.pos) badlocs_same_trans = [ (posfirst.start - (x - posfirst.pos)) for x in badlocs_same.pos.tolist() ] badlocs_diff_trans = [ (posfirst.start - (x - posfirst.pos)) for x in badlocs_diff.pos.tolist() ] # print(posstart,posend) start = posstart end = posend if runmode == 'gaps': plotgroup=1 columns = ['Start','End','Plotgroup'] if rname in rightset: plotgroup=0 intervals.append((start,end,min(idlist),max(idlist),plotgroup)) elif runmode == 'user_bed': intervals.append((start,end,min(idlist),max(idlist),plotgroup)) columns = ['Start','End'] poss_bad.append(badlocs_diff_trans) # valid_intervals.append((min(idlist),max(idlist),plotgroup)) poss.append(list(idlist)) interval_df = pd.DataFrame(intervals,columns=['Start','End','valid_Start','valid_End','Plotgroup'],) # valid_interval_df = pd.DataFrame(valid_intervals,columns=['Start','End','Plotgroup'],) ncls = NCLS(interval_df.Start,interval_df.End,interval_df.index) row_asn = dict() # interval_df.sort_values(by=['Plotgroup','Start'],inplace=True) interval_df.sort_values(by=['Start'],inplace=True) rows_used = set([0]) if runmode == 'user_bed': for i,s,e in interval_df.itertuples(): overlap = ncls.find_overlap(s,e) rows_overlap = set() for i2,s2,e2 in overlap: rows_overlap.add( row_asn.get(e2,0)) rows_avail = rows_used - rows_overlap if len(rows_avail) == 0: row_pick = sorted(rows_used)[-1]+1 rows_used.add(row_pick) row_asn[i] = row_pick else: row_asn[i] = sorted(rows_avail)[0] interval_df['row_asn'] = interval_df.index.map(lambda x: row_asn[x]) patches = [] ticks = [] for i,s,e,r in interval_df.itertuples(): patches.append(Rectangle((s, r), e-s, 0.7)) tick = [Rectangle((x, r), 1, 0.7) for x in poss[i]] ticks = ticks + tick elif runmode == 'gaps': for pgi in [0,1]: intv_sub = interval_df.query("Plotgroup == @pgi") intv_sub['End'] = intv_sub['End'] + 500 # Minimum separation between reads if pgi==1: intv_sub = intv_sub.sort_values(by=['End'],ascending=False) for i,s,e,sv,ev,pg in intv_sub.itertuples(): overlap = ncls.find_overlap(s,e) rows_overlap = set() for i2,s2,e2 in overlap: rows_overlap.add( row_asn.get(e2,0)) rows_avail = rows_used - rows_overlap if len(rows_avail) == 0: row_pick = sorted(rows_used)[-1]+1 rows_used.add(row_pick) row_asn[i] = row_pick else: row_asn[i] = sorted(rows_avail)[0] maxrow_prev = max(row_asn.values()) interval_df['row_asn'] = interval_df.index.map(lambda x: row_asn[x]) patches = [] valid_patches = [] ticks = [] bad_ticks = [] for i,s,e,sv,ev,pg,r in interval_df.itertuples(): patches.append(Rectangle((s, r+0.2), e-s, 0.4)) valid_patches.append(Rectangle((sv, r), ev-sv, 0.7)) tick = [Rectangle((x, r), 1, 0.7) for x in poss[i]] bad_tick = [Rectangle((x, r+0.2), 1, 0.4) for x in poss_bad[i]] bad_ticks = bad_ticks + bad_tick ticks = ticks + tick never_seen3 = [x for x in never_seen2 if not ((x<constart) | (x>conend))] if 'colorbed' in opt_keys: # print("COLORBED!!!") patches2=[] for r in dm[['chrStart','chrEnd','color','chr']].query('chr==@contig & chrEnd > @xmin & chrStart < @xmax ').itertuples(): polygon = Polygon([[r[1],-4],[r[1],-5],[r[2],-5],[r[2],-4]], True, color=r[3],alpha=1) patches2.append(polygon) p = PatchCollection(patches2, alpha=1, match_original=True) # print("patchlen: ",len(patches2)) ax.add_collection(p) ax.add_collection(PatchCollection(valid_patches,color='lightgray',zorder=2,aa=True,edgecolor='w',linewidth=0.01)) ax.add_collection(PatchCollection(patches,color='paleturquoise',zorder=1,aa=True,edgecolor='w',linewidth=0.01)) ax.add_collection(PatchCollection(ticks,color='k',zorder=7,linewidth=0.5,edgecolor='k',aa=True)) # aa=True ax.add_collection(PatchCollection(bad_ticks,color='r',zorder=8,linewidth=0.5,edgecolor='r',aa=True)) # aa=True ax.scatter(x=never_seen3,y=[-2.5]*len(never_seen3),color='k',s=55,marker="|",facecolors=None,linewidths=0.4,zorder=2) if sunklines: sl = [Rectangle((x, -1),0.5, (max(rows_used) + 1.5) ) for x in kmerlist] ax.add_collection(PatchCollection(sl,color='darkgray',zorder=0.9,linewidth=0.5,edgecolor='lightgray',aa=True,linestyle="--")) # aa=True # ax.scatter(x=kmerlist,y=[-1.5]*len(kmerlist),color='k',s=55,marker="|",facecolors=None,linewidths=0.4,zorder=2) # # plt.scatter(data=graph3,x='ID',y='counter',color='red',s=2) fmt = '{x:,.0f}' tick = mtick.StrMethodFormatter(fmt) ax.xaxis.set_major_formatter(tick) ax.set_ylabel("ONT Read Depth") ax.set_xlabel("Contig coordinate") plt.rcParams['svg.fonttype'] = 'none' plt.savefig(plotdir + "/" + snakemake.wildcards.sample +"_" + snakemake.wildcards.hap + "_" +cont2+"_"+str(regstart)+"_"+str(regend)+"_detailed.svg",format="svg",pad_inches=0,bbox_inches='tight') plt.savefig(plotdir + "/" + snakemake.wildcards.sample +"_" + snakemake.wildcards.hap + "_" +cont2+"_"+str(regstart)+"_"+str(regend)+"_detailed.png",pad_inches=0,bbox_inches='tight',dpi=300) plt.savefig(plotdir + "/" + snakemake.wildcards.sample +"_" + snakemake.wildcards.hap + "_" +cont2+"_"+str(regstart)+"_"+str(regend)+"_detailed.pdf",pad_inches=0,bbox_inches='tight') #plt.savefig(args.plotdir +row.name +"__" + args.sampname +"_" + args.hap + "_" +contig+"_"+str(regstart)+"_"+str(regend)+".svg",format="svg",pad_inches=0,bbox_inches='tight') # plt.savefig(plotdir+contig+"_"+str(regstart)+"_"+str(regend)+"_v4.png",dpi=300,pad_inches=0,bbox_inches='tight') # plt.savefig(plotdir+contig+"_"+str(regstart)+"_"+str(regend)+"_v4.pdf",dpi=300,pad_inches=0,bbox_inches='tight') # plt.savefig(plotdir+contig+"_"+str(regstart)+"_"+str(regend)+"_v4.svg",dpi=300,pad_inches=0,bbox_inches='tight') # plt.savefig(args.plotdir +row.name +"__" + args.sampname +"_" + args.hap + "_" +contig+"_"+str(regstart)+"_"+str(regend)+".png",dpi=300,pad_inches=0,bbox_inches='tight') # plt.savefig(args.plotdir + args.sampname +"_" + args.hap + "_" + row.name +"_" +contig+"_"+str(regstart)+"_"+str(regend)+".png",dpi=300,pad_inches=0,bbox_inches='tight') # plt.savefig("breakplots/"+contig+"_"+str(regstart)+"_"+str(regend)+"_.pdf",dpi=300,pad_inches=0,bbox_inches='tight') # plt.savefig("breakplots/"+contig+"_"+str(regstart)+"_"+str(regend)+"_.svg",dpi=300,pad_inches=0,bbox_inches='tight') plt.close() print("Plotting complete for " +contig+"_"+str(regstart)+"_"+str(regend)) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 | import pandas as pd from matplotlib import (pyplot as plt, lines) from matplotlib.collections import PatchCollection from matplotlib.patches import Circle, Wedge, Polygon, Rectangle import seaborn as sns import numpy as nptwi import matplotlib.ticker as mtick import argparse #print(pd.__version__) import os from ncls import NCLS print(snakemake.input) runmode=snakemake.wildcards.runmode opt_keys = list(snakemake.input.keys()) regbedfile = snakemake.input.bed if os.stat(regbedfile).st_size == 0: print("empty gap file") exit() bedreg1 = pd.read_csv(regbedfile, delimiter="\t",encoding='utf-8',header=None) header = ['chrom','start','end'] bedreg1.columns = header + [''] * (len(bedreg1.columns) - len(header)) lendf = pd.read_csv(snakemake.input.rlen,sep="\t",header=None,names=['rname','len'],dtype={'rname':'string','len':'uint32'}) lendf.drop_duplicates(inplace=True) lendf.set_index("rname",drop=True,inplace=True) bedreg1['chrom'].value_counts() contigs = list(bedreg1['chrom'].value_counts().index) print(len(contigs)) if 'colorbed' in opt_keys: dm = pd.read_csv(snakemake.input.colorbed, delim_whitespace=True, names=['chr','chrStart','chrEnd','color'], header=None,dtype = {'chr':'string','chrStart':'int','chrEnd':'int','color':'string'}) dm = dm[dm.chrStart >= 0] dm["y"] = 1 dm["func"] = "" for contig in contigs: cont2 = contig.replace("#","_") print(contig) bedreg= bedreg1.query('chrom == @contig') #print(type(snakemake.input.interout)) #my_str=os.path.dirname(snakemake.input.interout) #print(my_str) try: outputsdf= pd.read_csv(os.path.dirname(snakemake.input.interout) + "/" + cont2 + "_" + snakemake.wildcards.hap + ".tsv" ,sep="\t",names=['ID','rname'],dtype = {'ID':'int','rname':'string'}) except: print("outputsdf not found for", contig) outputsdf.set_index("rname",inplace=True,drop=True) if len(outputsdf) <2: print("NO READS FOR", contig) kmermergefile = os.path.dirname(snakemake.input.pos_locs) +"/"+ cont2 + "_" + snakemake.wildcards.hap + ".loc" kmermerge = pd.read_csv(kmermergefile,sep="\t",header=None,names=['chrom','loc','kmer','ID'])# this could be split too kmermerge = kmermerge.drop_duplicates(subset='kmer') plotdir = os.path.dirname(snakemake.output.flag) + "/" #print(plotdir) sunkposfile = os.path.dirname(snakemake.input.pos_locs)+"/"+ cont2 + "_" + snakemake.wildcards.hap + ".sunkpos" sunkposcat = pd.read_csv(sunkposfile,sep="\t",header=None,names=['rname','pos','chrom','start','ID'],dtype={'rname':'string','pos':'int64','chrom':'category','start':'int64','ID':'int64'}) kmermerge['ID2'] = kmermerge['chrom'].astype(str) +":"+ kmermerge['ID'].astype(str) sunkposcat['ID2'] = sunkposcat['chrom'].astype(str) +":"+ sunkposcat['ID'].astype(str) print("sunkposcat len: ", len(sunkposcat)) kmer_groups = kmermerge.drop_duplicates(subset='ID') # kmer_groups['dist'] = kmer_groups.ID.diff().fillna(kmer_groups.ID).astype(int) #rows) multisunk = sunkposcat.groupby('rname',as_index=False).agg({'ID':'nunique'}).sort_values(by='ID').query('ID > 1') print("Filter 1 multisunk len: ", len(multisunk)) if len(multisunk) == 0: print("No viable reads for this region: "+str(contig)+"_"+str(regstart)+"_"+str(regend)) multisunkset = set(multisunk.rname.tolist()) ### RESTRICT TO REGION sunkpos3 = sunkposcat.query('rname in @multisunkset ').sort_values(by=['rname','start']) # reads with multiple SUNK group matches # & start >= @xmin & start <= @xmax print(len(sunkpos3)) print(len(pd.unique(sunkpos3.rname))) print(len(pd.unique(sunkposcat.rname))) sunkpos3.set_index("rname",drop=True,inplace=True) for row in bedreg.itertuples(): contig = row.chrom regstart = row.start regend = row.end xmin = regstart xmax = regend breakloc = int((xmin + xmax)/2) rightset = set(outputsdf.query("ID > @breakloc").index) df = outputsdf.query("ID > @xmin & ID < @xmax") if len(df) == 0: print("NO HITS FOR", row) continue kmerlist = list(pd.unique(kmer_groups.query("ID > @xmin & ID < @xmax").ID)) kmerset = set(kmerlist) never_seen2 = kmerset - set(df.ID) print(len(never_seen2)) constart = df.ID.min() conend = df.ID.max() maxes={} first=True fig,ax = plt.subplots(figsize=(20,9)) fig.subplots_adjust(right=0.75) ax.set_xlim(regstart,regend) intervals = [] poss = [] print("NUMBER OF RNAMES: ",len(list(pd.unique(df.index)))) for rname in list(pd.unique(df.index)): try: rlen = lendf.loc[rname].len except: print("no rlen for: ", rname) if runmode == 'user_bed': continue sub = df.loc[rname] try: idlist = set(sub.ID) except: idlist = set([sub.ID]) possub = sunkpos3.loc[rname].query("ID in @idlist") posfirst = possub.iloc[0] poslast = possub.iloc[-1] forward = True badlocs_trans = [] if posfirst.pos > poslast.pos: forward=False if forward: posstart = posfirst.start - posfirst.pos posend = poslast.start + (rlen - poslast.pos) else: posend = poslast.pos + poslast.start posstart = posfirst.start - (rlen - posfirst.pos) start = posstart end = posend if runmode == 'gaps': plotgroup=1 columns = ['Start','End','Plotgroup'] if rname in rightset: plotgroup=0 intervals.append((start,end,plotgroup)) elif runmode == 'user_bed': intervals.append((start,end)) columns = ['Start','End'] poss.append(list(idlist)) interval_df = pd.DataFrame(intervals,columns=columns) ncls = NCLS(interval_df.Start,interval_df.End,interval_df.index) row_asn = dict() interval_df.sort_values(by='Start',inplace=True) print("INTERVAL_DF len: ", len(interval_df)) rows_used = set([0]) if runmode == 'user_bed': for i,s,e in interval_df.itertuples(): overlap = ncls.find_overlap(s,e) rows_overlap = set() for i2,s2,e2 in overlap: rows_overlap.add( row_asn.get(e2,0)) rows_avail = rows_used - rows_overlap if len(rows_avail) == 0: row_pick = sorted(rows_used)[-1]+1 rows_used.add(row_pick) row_asn[i] = row_pick else: row_asn[i] = sorted(rows_avail)[0] interval_df['row_asn'] = interval_df.index.map(lambda x: row_asn[x]) patches = [] ticks = [] for i,s,e,r in interval_df.itertuples(): patches.append(Rectangle((s, r), e-s, 0.7)) tick = [Rectangle((x, r), 1, 0.7) for x in poss[i]] ticks = ticks + tick elif runmode == 'gaps': for pgi in [0,1]: intv_sub = interval_df.query("Plotgroup == @pgi") intv_sub['End'] = intv_sub['End'] + 500 # Minimum separation between reads if pgi==1: intv_sub = intv_sub.sort_values(by=['End'],ascending=False) for i,s,e,pg in intv_sub.itertuples(): overlap = ncls.find_overlap(s,e) rows_overlap = set() for i2,s2,e2 in overlap: rows_overlap.add( row_asn.get(e2,0)) rows_avail = rows_used - rows_overlap if len(rows_avail) == 0: row_pick = sorted(rows_used)[-1]+1 rows_used.add(row_pick) row_asn[i] = row_pick else: row_asn[i] = sorted(rows_avail)[0] maxrow_prev = max(row_asn.values()) interval_df['row_asn'] = interval_df.index.map(lambda x: row_asn[x]) patches = [] ticks = [] for i,s,e,pg,r in interval_df.itertuples(): patches.append(Rectangle((s, r), e-s, 0.7)) tick = [Rectangle((x, r), 1, 0.7) for x in poss[i]] ticks = ticks + tick never_seen3 = [x for x in never_seen2 if not ((x<constart) | (x>conend))] if 'colorbed' in opt_keys: patches2=[] for r in dm[['chrStart','chrEnd','color','chr']].query('chr==@contig & chrEnd > @xmin & chrStart < @xmax ').itertuples(): polygon = Polygon([[r[1],-4],[r[1],-5],[r[2],-5],[r[2],-4]], True, color=r[3],alpha=1) patches2.append(polygon) p = PatchCollection(patches2, alpha=1, match_original=True) ax.add_collection(p) ax.add_collection(PatchCollection(patches,color='lightgray',zorder=1,aa=True,edgecolor='w',linewidth=0.01)) ax.add_collection(PatchCollection(ticks,color='k',zorder=7,linewidth=0.5,edgecolor='k',aa=True)) ax.scatter(x=never_seen3,y=[-2.5]*len(never_seen3),color='k',s=55,marker="|",facecolors=None,linewidths=0.4,zorder=2) ax.scatter(x=kmerlist,y=[-1.5]*len(kmerlist),color='k',s=55,marker="|",facecolors=None,linewidths=0.4,zorder=2) fmt = '{x:,.0f}' tick = mtick.StrMethodFormatter(fmt) ax.xaxis.set_major_formatter(tick) ax.set_ylabel("ONT Read Depth") ax.set_xlabel("Contig coordinate") plt.rcParams['svg.fonttype'] = 'none' print(plotdir + "/" + snakemake.wildcards.sample +"_" + snakemake.wildcards.hap + "_" +cont2+"_"+str(regstart)+"_"+str(regend)+".svg") plt.savefig(plotdir + "/" + snakemake.wildcards.sample +"_" + snakemake.wildcards.hap + "_" +cont2+"_"+str(regstart)+"_"+str(regend)+".svg",format="svg",pad_inches=0,bbox_inches='tight') plt.savefig(plotdir + "/" + snakemake.wildcards.sample +"_" + snakemake.wildcards.hap + "_" +cont2+"_"+str(regstart)+"_"+str(regend)+".png",pad_inches=0,bbox_inches='tight',dpi=300) plt.savefig(plotdir + "/" + snakemake.wildcards.sample +"_" + snakemake.wildcards.hap + "_" +cont2+"_"+str(regstart)+"_"+str(regend)+".pdf",pad_inches=0,bbox_inches='tight') plt.close() print("Plotting complete for " +contig+"_"+str(regstart)+"_"+str(regend)) |
Support
- Future updates
Related Workflows





