Analysis of multi-way chromatin interactions from very long sequencing reads
The thesis is focused on the concept of “chromosomal walks” introduced by Olivares-Chauvet et al., 2016. These chromosomal walks are derived from interactions identified de novo in whole- genome chromatin conformation data. The structure of chromosomal
Code Snippets
63 64 | shell: "src/py/digestion.py {input} {output}" |
76 77 78 79 80 81 82 83 84 85 | shell: "bowtie2 -x {input.index}/hg19 -U {input.fastq} | samtools view -bS -> {output}" # human # "bowtie2 -x {input.index}/mm9 -U {input.fastq} | samtools view -bS -> {output}" # mouse # rule bowtie2: # bedtools bamtofastq [OPTIONS] -i <BAM> -fq <FASTQ> - command to check (working) # input: # index = "data/Bowtie2Index/hg19", # fastq = "data/digested/k562/{sample}.digested.fastq" |
107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 | shell: "src/py/filtering.py human {input} {output}" # human # "src/py/filtering.py mouse {input} {output}" # mouse rule barh: input: "data/supportive/k562" # human # "data/supportive/mesc" # mouse output: "data/charts/k562/barh/human_barh.png" # human # "data/charts/mesc/barh/mouse_barh.png" # mouse shell: "src/py/barh.py human {input} {output}" # human # "src/py/barh.py mouse {input} {output}" # mouse rule charts: input: "data/supportive/k562/{res}.tsv" # human # "data/supportive/mesc/{res}.tsv" # mouse output: # human RvsR_barplot = "data/charts/k562/{res}_RvsR.png", dist_0_5000_R = "data/charts/k562/{res}_0_5000_R.png", dist_500_10000_R = "data/charts/k562/{res}_500_10000_R.png", strand_1vs2_barplot = "data/charts/k562/{res}_strand_1vs2.png", dist_0_5000_S = "data/charts/k562/{res}_0_5000_S.png", dist_500_10000_S = "data/charts/k562/{res}_500_10000_S.png", log10 = "data/charts/k562/{res}_log10_500_1000.png" |
144 145 146 | shell: "src/py/charts.py human {input} {output.RvsR_barplot} {output.dist_0_5000_R} {output.dist_500_10000_R} \ {output.strand_1vs2_barplot} {output.dist_0_5000_S} {output.dist_500_10000_S} {output.log10}" |
164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 | shell: "src/py/cwalk.py human {input.bedfile} {input.tsvfile} {output.cwalk_txt} {output.cwalk_bed}" # human # "src/py/cwalk.py mouse {input.bedfile} {input.tsvfile} {output.cwalk_txt} {output.cwalk_bed}" # mouse rule analysis: input: "data/cwalks/k562/txt" # human # "data/cwalks/mesc/txt" # mouse output: stats_intra = "data/charts/cwalks/k562/human_cwalk_hists.png", # human fractions = "data/charts/cwalks/k562/human_fractions.png", # human barh_all = "data/charts/cwalks/k562/cwalk_human_barh.png" # human # stats = "data/analysis/cwalks/mesc/mouse_cwalk_hists.png", # mouse # fractions = "data/analysis/cwalks/mesc/mouse_fractions.png", # mouse # barh = "data/analysis/cwalks/mesc/cwalk_mouse_barh.png" # mouse shell: "src/py/cwalk_analysis.py human {input} {output.stats_intra} {output.fractions} {output.barh_all}" # human # "src/py/cwalk_analysis.py mouse {input} {output.stats} {output.fractions} {output.barh}" # mouse rule directionality: input: "data/cwalks/k562/txt" # human # "data/cwalks/mesc/txt" # mouse output: "data/charts/directionality/k562/human_directionality.png" # human # "data/charts/directionality/mesc/mouse_directionality.png" # mouse shell: "src/py/directionality.py human data/cwalks/k562/txt {output}" # human # "src/py/directionality.py mouse data/cwalks/mesc/txt {output}" # mouse rule TF: input: cwalk = "data/cwalks/k562/txt", genome = "data/genomes/hg19.chrom.sizes", chipseq = "data/factors/chip-seq_hg19_K562" |
203 204 | shell: "src/py/TF_human.py {input.cwalk} {input.genome} {input.chipseq} {output}" |
231 232 233 | shell: "src/py/TAD.py human {input.cwalk} {input.doms} {input.genome} {input.rest} \ {output.frac_data} {output.frac_random} {output.types} {output.comparision} {output.counts} {output.doms}" |
245 246 | shell: "src/py/TF_mouse.py {input.cwalk} {input.genome} {input.chipseq}" |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 | from os import listdir from os.path import isfile, join import pandas as pd import matplotlib.pyplot as plt import numpy as np import seaborn as sns import matplotlib.ticker as ticker import sys def load_tsvfiles(): """ load all available .tsv files """ return [sample for sample in listdir(sys.argv[1]) if isfile(join(sys.argv[1], sample))] def count_aligns(infiles): """ Extract information about the number of all aligns, aligns with abs_pos >= 500, rejected aligns and its labels. """ global rejected, labels raw_data, data = [], [] for file in infiles: df = pd.read_csv(f"{sys.argv[1]}/{file}", sep='\t') # load single .tsv file df_clean = df.where(df.abs_pos >= 500).dropna() df_clean.reset_index(drop=True, inplace=True) raw_data.append(len(df.index)) data.append(len(df_clean.index)) rejected = [a - b for (a, b) in zip(raw_data, data)] labels = [file[:-4] for file in infiles] return [data, rejected, labels] def barh(data, rejected, labels, name): sns.set_style("whitegrid") fig, ax = plt.subplots(figsize=(12, 16)) y_pos = np.arange(len(labels)) bar_1 = ax.barh(y_pos, data, color="tab:blue", edgecolor="black", height=1) bar_2 = ax.barh(y_pos, rejected, left=data, color="tab:red", edgecolor="black", height=1) ax.xaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: "{0:g}".format(x / 1000000) + "M")) plt.yticks(y_pos, labels) ax.set_title("Number of useful and rejected alignments in each sample", fontsize=18) plt.legend([bar_1, bar_2], ["Useful", "Rejected"], loc="upper right", prop={"size": 16}) return plt.savefig(f"{name}"), plt.close() if __name__ == '__main__': counted = count_aligns(load_tsvfiles()) barh(counted[0], counted[1], counted[2], sys.argv[2]) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 | import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import matplotlib.ticker as ticker import sys organism = sys.argv[1] """ load .tsv file """ df = pd.read_csv(sys.argv[2], sep='\t') experiment_name = sys.argv[2][21:-4] """ Initiation basic dependencies """ DIST_no_limit = df DIST_500_inf = df.where(df.abs_pos >= 500) DIST_0_5000 = df.where(df.abs_pos <= 5000) DIST_500_10000 = df.where(df.abs_pos >= 500).where(df.abs_pos <= 10000) df_RvsR_0_5000 = {x: y for x, y in DIST_0_5000.groupby("RvsR")} df_RvsR_500_10000 = {x: y for x, y in DIST_500_10000.groupby("RvsR")} df_strand_0_5000 = {x: y for x, y in DIST_0_5000.groupby("strand_1vs2")} df_strand_500_10000 = {x: y for x, y in DIST_500_10000.groupby("strand_1vs2")} RvsR_keys = list(df_RvsR_0_5000.keys()) strand_keys = list(df_strand_0_5000.keys()) def main(): def distances_histplot(df, keys, experiment_name, name): sns.set_style("whitegrid") fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, sharex=False, sharey=False, figsize=(16, 12)) plt.subplots_adjust(bottom=0.1, top=0.9, hspace=0.25, wspace=0.3, right=0.93, left=0.15) if "R1 vs R1" in keys: R1vsR1 = df["R1 vs R1"].reset_index(drop=True) sns.histplot(ax=ax1, data=df, x=R1vsR1["abs_pos"], color="red", edgecolor="black") ax1.set_title("R1 vs R1", fontsize=18) elif "forward vs forward" in keys: S1vsS1 = df["forward vs forward"].reset_index(drop=True) sns.histplot(ax=ax1, data=df, x=S1vsS1["abs_pos"], color="forestgreen", edgecolor="black") ax1.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: "{0:g}".format(x / 1000) + "k")) ax1.set_title("forward vs forward", fontsize=18) else: fig.delaxes(ax1) if "R2 vs R2" in keys: R2vsR2 = df["R2 vs R2"].reset_index(drop=True) sns.histplot(ax=ax2, data=df, x=R2vsR2["abs_pos"], color="red", edgecolor="black") ax2.set_title("R2 vs R2", fontsize=18) elif "reverse vs reverse" in keys: S1vsS1 = df["reverse vs reverse"].reset_index(drop=True) sns.histplot(ax=ax2, data=df, x=S1vsS1["abs_pos"], color="forestgreen", edgecolor="black") ax2.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: "{0:g}".format(x / 1000) + "k")) ax2.set_title("reverse vs reverse", fontsize=18) else: fig.delaxes(ax2) if "R1 vs R2" in keys: R1vsR2 = df["R1 vs R2"].reset_index(drop=True) sns.histplot(ax=ax3, data=df, x=R1vsR2["abs_pos"], color="red", edgecolor="black") ax3.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: "{0:g}".format(x / 1000) + "k")) ax3.set_title("R1 vs R2", fontsize=18) elif "forward vs reverse" in keys: S1vsS1 = df["forward vs reverse"].reset_index(drop=True) sns.histplot(ax=ax3, data=df, x=S1vsS1["abs_pos"], color="forestgreen", edgecolor="black") ax3.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: "{0:g}".format(x / 1000) + "k")) ax3.set_title("forward vs reverse", fontsize=18) else: fig.delaxes(ax3) if "R2 vs R1" in keys: R2vsR1 = df["R2 vs R1"].reset_index(drop=True) sns.histplot(ax=ax4, data=df, x=R2vsR1["abs_pos"], color="red", edgecolor="black") ax4.set_title("R2 vs R1", fontsize=18) elif "reverse vs forward" in keys: S1vsS1 = df["reverse vs forward"].reset_index(drop=True) sns.histplot(ax=ax4, data=df, x=S1vsS1["abs_pos"], color="forestgreen", edgecolor="black") ax4.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: "{0:g}".format(x / 1000) + 'K')) ax4.set_title("reverse vs forward", fontsize=18) else: fig.delaxes(ax4) fig.suptitle(f"Sample from {experiment_name} {organism} cells", fontsize=20) axes = [ax1, ax2, ax3, ax4] for ax in axes: ax.set_xlabel("Absolute value comparing each two aligns", fontsize=15) ax.set_ylabel("Number of compared aligns", fontsize=15) return plt.savefig(f"{name}"), plt.close() def feature_barplot(df, experiment_name, name, save_as): sns.set_style("whitegrid") plt.title(f"Sample from {experiment_name} {organism} cells") if name == "RvsR": fig = plt.figure(figsize=(10, 5)) fig.suptitle(f"Sample from {experiment_name} {organism} cells") plt.subplots_adjust(wspace=0.3, top=0.85) fig.add_subplot(121) ax1 = sns.barplot(x=df[name].value_counts().index, y=df[name].value_counts(), color="royalblue", edgecolor="black") ax1.set(xlabel="Combinations", ylabel="Number of compared aligns") ax1.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: "{0:g}".format(x / 1000000) + "M")) fig.add_subplot(122) ax2 = sns.barplot(x=df[name].value_counts().iloc[1:].index, y=df[name].value_counts().iloc[1:], color="royalblue", edgecolor="black") ax2.set(xlabel="Combinations", ylabel="Number of compared aligns") elif name == "strand_1vs2": fig = plt.figure(figsize=(8, 5)) fig.suptitle(f"Sample from {experiment_name} {organism} cells") ax = sns.barplot(x=df[name].value_counts().index, y=df[name].value_counts(), color="forestgreen", edgecolor="black") ax.set(xlabel="Combinations", ylabel="Number of compared aligns") ax.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: "{0:g}".format(x / 1000) + "k")) return plt.savefig(f"{save_as}"), plt.close() def log_scale_histplot(df, experiment_name, name): sns.set(rc={'figure.figsize': (12, 8)}) sns.set_style("whitegrid") ax = sns.histplot(data=df, x=DIST_500_inf["abs_pos"], log_scale=True, color="red", edgecolor="black") ax.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: "{0:g}".format(x / 1000) + "k")) ax.set(xlabel="Absolute value of position comparing each two aligns [log10]", ylabel="Number of compared aligns", title=f"Sample from {experiment_name} {organism} cells") return plt.savefig(f"{name}"), plt.close() """ Plotting """ # RvsR feature_barplot(DIST_no_limit, experiment_name, "RvsR", sys.argv[3]) # "RvsR" distances_histplot(df_RvsR_0_5000, RvsR_keys, experiment_name, sys.argv[4]) # "0_5000_R" distances_histplot(df_RvsR_500_10000, RvsR_keys, experiment_name, sys.argv[5]) # "500_10000_R" # strand 1vs2 feature_barplot(DIST_no_limit, experiment_name, "strand_1vs2", sys.argv[6]) # "strand_1vs2" distances_histplot(df_strand_0_5000, strand_keys, experiment_name, sys.argv[7]) # "0_5000_S" distances_histplot(df_strand_500_10000, strand_keys, experiment_name, sys.argv[8]) # "500_10000_S" # log10-scale log_scale_histplot(DIST_500_10000, experiment_name, sys.argv[9]) # log10_500_1000" if __name__ == '__main__': main() |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 | import networkx as nx import pandas as pd import pickle import matplotlib.pyplot as plt import seaborn as sns import matplotlib.ticker as ticker from matplotlib.ticker import MaxNLocator import numpy as np import statistics import sys import matplotlib.pylab as pylab params = {'legend.fontsize': 'x-large', 'axes.labelsize': 'x-large', 'axes.titlesize':'x-large', 'xtick.labelsize':'xx-large', 'ytick.labelsize':'xx-large'} pylab.rcParams.update(params) def load_cwalk_graph(cwalk_graph): import pickle return pickle.load(open(cwalk_graph, "rb")) def load_files(dictionary, parser): """ load all available files from a folder and return lists of objects """ import os files, labels = [], [] for file in os.listdir(dictionary): labels.append(file) files.append(parser(f"{dictionary}/{file}")) return files, labels def fractions(one_chr, two_chr, many_chr, name): sns.set_style("whitegrid") plt.figure(figsize=(12, 10)) bar1 = plt.bar([i for i in range(3, 16)], one_chr, width=1, edgecolor="black", color="tab:blue") bar2 = plt.bar([i for i in range(3, 16)], two_chr, bottom=one_chr, width=1, edgecolor="black", color="tab:cyan") bar3 = plt.bar([i for i in range(3, 16)], many_chr, bottom=np.add(one_chr, two_chr), width=1, edgecolor="black", color="tab:gray") plt.xlabel("Number of hops", fontsize=16) plt.ylabel("Percentage [%]", fontsize=16) plt.title(f"Fraction of c-walks in each class in {sys.argv[1]} cells", fontsize=18) plt.legend([bar1, bar2, bar3], ["one chromosome (class I)", "two chromosomes (class II)", "many chromosomes (class III)"]) return plt.savefig(f"{name}"), plt.close() def barh(graphs, labels, name): data = [] for graph in graphs: data.append(nx.number_connected_components(graph)) sns.set_style("whitegrid") fig, ax = plt.subplots(figsize=(12, 16)) ax.barh([label[:-11] for label in labels], data, color="tab:blue", edgecolor="black", height=1) ax.xaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: "{0:g}".format(x / 1000) + "k")) ax.set_title(f"Number of c-walks in each graph in {sys.argv[1]} cells", fontsize=18) return plt.savefig(f"{name}"), plt.close() def stats(graphs, name): i, j, k = 0, 0, 0 lengths = [] for graph in graphs: for cwalk in list(nx.connected_components(graph)): cwalk = list(cwalk) k += 1 if all(cwalk[i][2] == cwalk[0][2] for i in range(0, len(cwalk))): lengths.append(len(cwalk)) i += 1 else: j += 1 print(f"Number of intra-chromosomal cwalks: {i}") print(f"Number of inter-chromosomal cwalks: {j}") print(f"Number of all cwalks: {k}") print(f"Percentage of intra-chromosomal cwalks is {round((i / k) * 100, 3)}%") print(f"Average: {round(statistics.mean(lengths), 2)}") print(f"Median: {round(statistics.median(lengths), 2)}") print(f"Mode: {statistics.mode(lengths)}") print(f"Standard deviation: {round(statistics.stdev(lengths), 2)}") sns.set_style("whitegrid") fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6)) plt.suptitle(f"Distribution of c-walks length in {sys.argv[1]} cells", fontsize=20) ax1.hist(lengths, color="tab:blue", edgecolor="black") ax2.hist([x for x in lengths if x <= 6], color="tab:blue", edgecolor="black", rwidth=1) ax1.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: "{0:g}".format(x / 1000) + "k")) ax2.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: "{0:g}".format(x / 1000) + "k")) ax2.xaxis.set_major_locator(MaxNLocator(integer=True)) ax1.set_xlabel("c-walks length", fontsize=16) ax1.set_ylabel("Number of c-walks", fontsize=16) ax2.set_xlabel("c-walks length", fontsize=16) ax2.set_ylabel("Number of c-walks", fontsize=16) return plt.savefig(f"{name}"), plt.close() def identical(cwalk): span = 1 chrs = [node[2] for node in cwalk] while not all(chrs[0] == element for element in chrs): if not all(chrs[0] == element for element in chrs): span += 1 repeat = chrs[0] chrs = [chr for chr in chrs if chr != repeat] return span def counting(graphs, length): i = 0 one, two, many = 0, 0, 0 for graph in graphs: for cwalk in list(nx.connected_components(graph)): cwalk = list(cwalk) if len(cwalk) == length: i += 1 span = identical(cwalk) if span == 1: one += 1 elif span == 2: two += 1 elif span >= 3: many += 1 return [round((one / i) * 100, 2), round((two / i) * 100, 2), round((many / i) * 100, 2)] def main(): graphs, labels = load_files(sys.argv[2], load_cwalk_graph) # load cwalk graphs one_chr, two_chr, many_chr = [], [], [] for hop in range(3, 16): one, two, many = counting(graphs, hop) one_chr.append(one) # fraction of inter-chrs cwalks of particular length two_chr.append(two) many_chr.append(many) """ Plotting """ stats(graphs, sys.argv[3]) fractions(one_chr, two_chr, many_chr, sys.argv[4]) barh(graphs, labels, sys.argv[5]) if __name__ == '__main__': main() |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 | import networkx as nx import pandas as pd from filtering import typical_chromosomes, collect_data from intervaltree import IntervalTree import pickle import sys def parse_positions(tsvfile: str, abs_threshold: int) -> zip: """ Returns lists of positions of aligns """ df = pd.read_csv(tsvfile, sep="\t") df = df.where(df.abs_pos >= abs_threshold).dropna().reset_index(drop=True) return zip(df.chr_R1.tolist(), df.chr_R2.tolist(), df.pos_R1.astype(int).tolist(), df.pos_R2.astype(int).tolist()) def parse_bedfile(bedfile: str, organism: str) -> dict: """ Return lists of restrictions sites positions and chromosomes where they were found """ df = pd.read_csv(bedfile, sep="\t", header=None) df = df.iloc[:, :2] df = df.loc[df[0].isin(typical_chromosomes(organism))].reset_index(drop=True) return {x: y for x, y in df.groupby(0)} def add_edge(u, v): """ Creating weighted edges """ if G.has_edge(u, v): G[u][v]["weight"] += 1 else: G.add_edge(u, v, weight=1) def matching_edges(interval_tree_dict: dict, positions: zip): """ Graph construction: interval_tree: a dictionary storing the restriction intervals (as IntervalTree) for each chromosome positions: list of C-walk positions """ for chr1, chr2, position_R1, position_R2 in positions: left_edge = interval_tree_dict[chr1][position_R1] right_edge = interval_tree_dict[chr2][position_R2] if left_edge and right_edge: # prevention of empty sets left_edge = list(list(left_edge)[0]) right_edge = list(list(right_edge)[0]) left_edge[2], right_edge[2] = chr1, chr2 add_edge(tuple(left_edge), tuple(right_edge)) # ex. (77366342, 77367727, "chr1") def cwalk_construction(edges): """ Resolve the C-walk graph """ P = nx.Graph() edge_index = 0 for u, v, a in edges: if (u not in P or P.degree[u] < 2) and (v not in P or P.degree[v] < 2): P.add_edge(u, v, weight=a["weight"], index=edge_index) edge_index += 1 for cwalk in list(nx.connected_components(P)): if len(cwalk) < 3: for node in cwalk: P.remove_node(node) # Remove cwalks that are include one hop return P def save_as_bed(graph, experiment_name): numbers = list(range(1, nx.number_connected_components(graph) + 1)) order = [2, 0, 1] for cwalk, number in zip(list(nx.connected_components(graph)), numbers): cwalk_length = range(1, len(cwalk) + 1) for node, length in zip(cwalk, cwalk_length): node = [node[i] for i in order] node.append(f"{length}_{number}") collect_data(node, f"{experiment_name}", "a") def main(): restrictions_dict = parse_bedfile(sys.argv[2], sys.argv[1]) # .bed file tree_dict = dict() # ex. tree_dict["chr1"] will be an object of type IntervalTree for chr in typical_chromosomes(sys.argv[1]): """ Interval tree construction, separate for each chromosome """ restrictions = restrictions_dict[chr][1].tolist() intervals = [(i, j) for i, j in zip(restrictions[:-1], restrictions[1:])] tree_dict[chr] = IntervalTree.from_tuples(intervals) """ Parse C-walk positions """ positions = parse_positions(sys.argv[3], 500) # .tsv file with selected absolute threshold matching_edges(tree_dict, positions) """ C-walks construction """ sorted_edges = sorted(G.edges(data=True), key=lambda x: x[2]["weight"], reverse=True) # Sort edges by read-coverage P = cwalk_construction(sorted_edges) save_as_bed(P, sys.argv[5]) # save as .bed outfile with cwalks pickle.dump(P, open(sys.argv[4], "wb")) # save cwalks as .txt outfile in binary mode if __name__ == '__main__': G = nx.Graph() main() |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | from Bio.SeqIO.QualityIO import FastqGeneralIterator import sys def division(i, seq, positions) -> str: if i == len(positions): sub_seq = seq[positions[i-1]:] elif i == 0: sub_seq = seq[:positions[i]+4] else: sub_seq = seq[positions[i-1]:positions[i]+4] return sub_seq def digest(input_reads) -> zip: sub_title, sub_seq, sub_qual = [], [], [] for title, seq, qual in input_reads: positions = [i for i in range(len(seq)) if seq.startswith("GATC", i) if i != 0] if positions: for i in range(0, len(positions)+1): sub_title.append(f"{i}.{title}") sub_seq.append(division(i, seq, positions)) sub_qual.append(division(i, qual, positions)) else: sub_title.append(title), sub_seq.append(seq), sub_qual.append(qual) return zip(sub_title, sub_seq, sub_qual) def fastq_write(output_zip): with open(sys.argv[2], 'w') as output: for title, seq, qual in output_zip: output.write(f"@{title}\n{seq}\n+\n{qual}\n") if __name__ == '__main__': reads = FastqGeneralIterator(sys.argv[1]) fastq_write(digest(reads)) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 | import networkx as nx import matplotlib.pyplot as plt import seaborn as sns import sys import statistics import matplotlib.ticker as ticker from cwalk_analysis import load_cwalk_graph, load_files import random import pandas as pd import random def directionality(path: list) -> bool: path = [(node[0]+node[1])/2 for node in path] # the distance is between the middle point of the restriction interval dist = [b - a for a, b in zip(path[:-1], path[1:])] same_sign_dist = [(a * b >= 0) for a, b in zip(dist[:-1], dist[1:])] # 0 if nodes are in the same restriction interval # if the same node occurs more than one time it is irrelevant in the directionality concept # as well as the circle return same_sign_dist def avg_directionality(signs: list) -> float: return round(statistics.mean(signs), 2) def hist(data, data_random, avg, avg_random, length): sns.set_style("whitegrid") fig, axs = plt.subplots(1, 2, sharex=True, sharey=True, tight_layout=True, figsize=(15, 6)) fig.suptitle(f"Distribution of directionality of c-walks of {length} length from {sys.argv[1]} cells", fontsize=14) axs[0].hist(data, color="tab:blue", edgecolor="black") axs[0].set_title(f"c-walks from data \n average value of directionality is {avg}", fontsize=14) axs[1].hist(data_random, color="tab:red", edgecolor="black") axs[1].set_title(f"randomize c-walks \n average value of directionality is {avg_random}", fontsize=14) axs[0].set_ylabel("number of c-walks", fontsize=14) fig.supxlabel("value of directionality", fontsize=14) axs[1].yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: "{0:g}".format(x / 1000) + "k")) axs[0].yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: "{0:g}".format(x / 1000) + "k")) return plt.savefig(f"{sys.argv[3]}"), plt.close() def main(): graphs, _ = load_files(sys.argv[2], load_cwalk_graph) # load .txt cwalk graph cwalks = [] for graph in graphs: S = [graph.subgraph(c).copy() for c in nx.connected_components(graph)] for subgraph in S: sorted_cwalk = sorted(subgraph.edges(data=True), key=lambda x: x[2]["index"]) edges = [] nodes = [] for next_edge in sorted_cwalk[0:]: v1, v2, _ = next_edge edges.append([v1, v2]) for i in range(1, len(edges)): for edge in edges: v1, v2 = edge if v1 == (edges[i][0] or edges[i][1]) and v1 not in nodes: nodes.append(v1) elif v2 not in nodes: nodes.append(v2) if nodes[0] != edges[0][0]: nodes.insert(0, edges[0][0]) elif nodes[0] != edges[0][1]: nodes.insert(0, edges[0][1]) if nodes[-1] == nodes[0]: # remove repetitive nodes (in case of circle) nodes = nodes[1:] # remove first node in circle if all(nodes[i][2] == nodes[0][2] for i in range(0, len(nodes))): # only intra-chrs cwalks cwalks.append(nodes) avg_signs = [] avg_shuffle_signs = [] for cwalk in cwalks: avg_signs.append(avg_directionality(directionality(cwalk))) random.shuffle(cwalk) avg_shuffle_signs.append(avg_directionality(directionality(cwalk))) hist(avg_signs, avg_shuffle_signs, round(statistics.mean(avg_signs), 2), round(statistics.mean(avg_shuffle_signs), 2), "all") from scipy.stats import wilcoxon res = wilcoxon(x=avg_signs, y=avg_shuffle_signs, zero_method="wilcox", alternative="less") print(f"The value of the test statistic: {res.statistic} and p-value: {res.pvalue}") if __name__ =='__main__': main() |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 | import pysam import sys def parse_bam(inbamfile, read): """ load bam two bam files and create list of tuples """ alignments = pysam.AlignmentFile(inbamfile, "rb") alignments_list = [] for line in alignments: if line.is_reverse: strand = "reverse" else: strand = "forward" name = line.qname if line.qname[0].isdigit(): name = line.qname[2:] if line.qname[0].isalpha(): name = line.qname align = (name, f"R{read}", line.reference_name, line.pos, line.mapq, strand) # order of tuples alignments_list.append(align) return alignments_list def cleaning(alignments): """ filtering not mapped alignment and with quality below 30 """ return [align for align in alignments if align[4] >= 30] def collect_data(data, experiment_name, WvsA): """ saving supportive file to further analysis """ import csv with open(f"{experiment_name}", WvsA, newline='') as outfile: tsv_output = csv.writer(outfile, delimiter="\t") tsv_output.writerow(data) def typical_chromosomes(organism) -> list: """ return list of typical chromosomes for human or mouse """ chrs = [f"chr{i}" for i in range(1, 23)] if organism == "human" else [f"chr{i}" for i in range(1, 20)] chrs.extend(["chrX", "chrY"]) return chrs def main(): organism = sys.argv[1] experiment_R1 = sys.argv[2] experiment_R2 = sys.argv[3] experiment_name = sys.argv[4] """ create .tsv file with field names """ fieldnames = ["seqname", "chr_R1", "chr_R2", "pos_R1", "pos_R2", "strand_1vs2", "RvsR", "abs_pos"] collect_data(fieldnames, experiment_name, "w") alignments_R1 = parse_bam(f"{experiment_R1}", 1) alignments_R2 = parse_bam(f"{experiment_R2}", 2) iter_1 = iter(alignments_R1) iter_2 = iter(alignments_R2) next1 = next(iter_1) next2 = next(iter_2) while next1 is not None: seqname = next1[0] assert(seqname == next2[0]) align_1 = [] try: while next1[0] == seqname: align_1.append(next1) next1 = next(iter_1) except StopIteration: next1 = None align_2 = [] try: while next2[0] == seqname: align_2.append(next2) next2 = next(iter_2) except StopIteration: next2 = None all_alignments = align_1 + align_2 filtered_alignments = cleaning(all_alignments) for i in range(len(filtered_alignments)): for j in range(i + 1, len(filtered_alignments)): if (filtered_alignments[j][2] in typical_chromosomes(organism) and filtered_alignments[i][2]) in typical_chromosomes(organism): statistics = [ filtered_alignments[i][0], # seqname filtered_alignments[i][2], # chromosome R1 filtered_alignments[j][2], # chromosome R2 filtered_alignments[i][3], # position R1 filtered_alignments[j][3], # position R2 f"{filtered_alignments[i][5]} vs {filtered_alignments[j][5]}", # strand_1vs2 f"{filtered_alignments[i][1]} vs {filtered_alignments[j][1]}", # RvsR abs(filtered_alignments[i][3] - filtered_alignments[j][3]) # abs_pos ] """ appends aligns to created .tsv file """ collect_data(statistics, experiment_name, "a") if __name__ == '__main__': main() |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 | from cwalk import parse_bedfile from cwalk_analysis import load_cwalk_graph, load_files import pandas as pd import networkx as nx from intervaltree import Interval, IntervalTree import matplotlib.pyplot as plt import seaborn as sns import matplotlib.ticker as ticker import numpy as np from cwalk import parse_bedfile from filtering import typical_chromosomes import sys import matplotlib.pylab as pylab params = {'legend.fontsize': 'x-large', 'axes.labelsize': 'x-large', 'axes.titlesize':'x-large', 'xtick.labelsize':'xx-large', 'ytick.labelsize':'xx-large'} pylab.rcParams.update(params) def boundaries(bedfile: str): # parse bedfile with TAD domains boundaries # return DataFrame with "chrom", "start", "end", "mode", "size" columns df = pd.read_csv(bedfile, delimiter="\t") return df[["chrom", "start", "end", "mode", "size"]] def identical(itv: list) -> int: # return number of visited intervals from list of intervals span = 1 while not all(itv[0] == element for element in itv): if not all(itv[0] == element for element in itv): span += 1 a = itv[0] itv = [value for value in itv if value != a] return span def chrs_sizes(chr_sizes: str) -> dict: # dict where chrs are keys and values are list with size of this chr df_sizes = pd.read_csv(chr_sizes, sep="\t", header=None) df_sizes = df_sizes.loc[df_sizes[0].isin(typical_chromosomes(sys.argv[1]))].reset_index( drop=True) df_sizes.columns = df_sizes.columns.map(str) return df_sizes.groupby("0")["1"].agg(list).to_dict() def random(cwalk, chrs_dict, rest_dict): # return random cwalk (mirror image relative to the center of the chromosome) center = [(((i[0] + i[1]) / 2), i[2]) for i in cwalk] # ex. (139285246.5, 'chr3') reflection = [(chrs_dict[i[1]][0] - i[0], i[1]) for i in center] itv = [(rest_dict[i[1]][i[0]], i[1]) for i in reflection] itv = [i for i in itv if len(i[0]) != 0] # remove empty sets reflected = [(list(i[0])[0][0], list(i[0])[0][1], i[1]) for i in itv] return reflected def tad_tree(boundaries): # interval tree where keys are chrs and values are zip object (start, end, mode) chrs_dict = {x: y for x, y in boundaries.groupby("chrom")} keys, values = [], [] for key in chrs_dict.keys(): left = chrs_dict[key]["start"].tolist() right = chrs_dict[key]["end"].tolist() mode = chrs_dict[key]["mode"].tolist() sets_list = [(a, b, c) for a, b, c in zip(left, right, mode)] keys.append(key) values.append(sets_list) boundaries_dict = dict(zip(keys, values)) tree = dict() # ex. tree_dict["chr1"] will be an object of type IntervalTree for key in boundaries_dict.keys(): intervals = boundaries_dict[key] # Interval tree construction, separate for each chromosome tree[key] = IntervalTree.from_tuples(intervals) return tree def counting(tad_dict, graphs, length): i = 0 general_in, general_out = 0, 0 active, passive, two, three, many = 0, 0, 0, 0, 0 # number of cwalks in eah TAD type for graph in graphs: for cwalk in list(nx.connected_components(graph)): cwalk = list(cwalk) if len(cwalk) == length and all( cwalk[i][2] == cwalk[0][2] for i in range(0, len(cwalk))): # only intra-chrs c-walks i += 1 # number of intra-chrs c-walks # list of the middle of restriction intervals of each node in cwalk cwalk_boundaries = [((node[0] + node[1]) / 2) for node in cwalk] # found list of TAD bounds (itv) for each node in cwalk itv_tad = [tad_dict[cwalk[0][2]][bound] for bound in cwalk_boundaries] in_tad = identical(itv_tad) # number of TAD visited by cwalk modes = [list(i)[0][2] for i in list(itv_tad)] # list of TAD mode for each node in cwalk if in_tad == 1: general_in += 1 # number of cwalks in one TAD (active or passive) else: general_out += 1 # number of cwalks in more than one TAD if in_tad == 1 and modes[0] == "active" and all(modes[0] == element for element in modes): active += 1 # number of cwalks in one active TAD elif in_tad == 1 and modes[0] == "passive" and all(modes[0] == element for element in modes): passive += 1 # number of cwalks in one passive TAD elif in_tad == 2: two += 1 # number of cwalks in two TADs (active or passive) elif in_tad == 3: three += 1 # number of cwalks in three TADs (active or passive) elif in_tad > 3: many += 1 # number of cwalks in many TADs (active or passive) return active, passive, two, three, many, i, general_in, general_out def random_counting(tad_dict, graphs, length, chrs_dict, rest_dict): i = 0 general_in, general_out = 0, 0 active, passive, two, three, many = 0, 0, 0, 0, 0 # number of random cwalks in eah TAD type for graph in graphs: for cwalk in list(nx.connected_components(graph)): cwalk = list(cwalk) if len(cwalk) == length and all( cwalk[i][2] == cwalk[0][2] for i in range(0, len(cwalk))): # only intra-chrs c-walks random_cwalk = random(cwalk, chrs_dict, rest_dict) if len(cwalk) == len(random_cwalk): i += 1 # number of random intra-chrs c-walks # list of the middle of restriction intervals of each node in cwalk cwalk_boundaries = [((node[0] + node[1]) / 2) for node in random_cwalk] # found list of TAD bounds (itv) for each node in cwalk itv_tad = [tad_dict[random_cwalk[0][2]][bound] for bound in cwalk_boundaries] in_tad = identical(itv_tad) # number of TAD visited by cwalk modes = [list(i)[0][2] for i in list(itv_tad)] # list of TAD mode for each node in random cwalk if in_tad == 1: general_in += 1 # number of random cwalks in one TAD (active or passive) else: general_out += 1 # number of random cwalks in more than one TAD if in_tad == 1 and modes[0] == "active" and all(modes[0] == element for element in modes): active += 1 # number of cwalks in one active TAD elif in_tad == 1 and modes[0] == "passive" and all(modes[0] == element for element in modes): passive += 1 # number of cwalks in one passive TAD elif in_tad == 2: two += 1 # number of random cwalks in two TADs (active or passive) elif in_tad == 3: three += 1 # number of random cwalks in three TADs (active or passive) elif in_tad > 3: many += 1 # number of random cwalks in many TADs (active or passive) return active, passive, two, three, many, i, general_in, general_out def tad_count(graphs, tree_dict, name): active, passive, two_active, two_passive, three_active, three_passive, other = 0, 0, 0, 0, 0, 0, 0 labels = ["one active", "one passive", "two active", "two passive", "three active", "three passive", "other"] for graph in graphs: for cwalk in list(nx.connected_components(graph)): cwalk = list(cwalk) if all(cwalk[i][2] == cwalk[0][2] for i in range(0, len(cwalk))): # only intra chrs cwalks cwalk_boundaries = [((node[0] + node[1]) / 2) for node in cwalk] itv_tad = [tree_dict[cwalk[0][2]][bound] for bound in cwalk_boundaries] in_tad = identical(itv_tad) modes = [list(i)[0][2] for i in list(itv_tad)] if in_tad == 1 and modes[0] == "active" and all(modes[0] == element for element in modes): active += 1 elif in_tad == 1 and modes[0] == "passive" and all(modes[0] == element for element in modes): passive += 1 elif in_tad == 2 and modes[0] == "active" and all(modes[0] == element for element in modes): two_active += 1 elif in_tad == 2 and modes[0] == "passive" and all(modes[0] == element for element in modes): two_passive += 1 elif in_tad == 3 and modes[0] == "active" and all(modes[0] == element for element in modes): three_active += 1 elif in_tad == 3 and modes[0] == "passive" and all(modes[0] == element for element in modes): three_passive += 1 else: other += 1 data = [active, passive, two_active, two_passive, three_active, three_passive, other] sns.set_style("whitegrid") fig, ax = plt.subplots(figsize=(15, 12)) ax.bar(labels, data, color="tab:blue", edgecolor="black", width=1) plt.xticks(rotation=30) ax.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: "{0:g}".format(x / 1000) + "k")) plt.ylabel("Number of c-walks", fontsize=15) ax.set_title(f"Number of c-walks based on location in TAD types in {sys.argv[1]} cells", fontsize=20) return plt.savefig(f"{name}"), plt.close() def chart_comparison(general_in, general_out, name): labels = ["data", "random"] sns.set_style("whitegrid") fig, ax = plt.subplots(figsize=(10, 8)) bar1 = plt.bar(labels, general_in, label="in a single TAD", color="tab:blue", edgecolor="black", width=1) bar2 = plt.bar(labels, general_out, bottom=general_in, label="in more than one TAD", color="tab:red", edgecolor="black", width=1) ax.set_title(f"Fraction of c-walks contained in one TAD and c-walks in more than one TAD \n" f"Comparison c-walks from data and random in {sys.argv[1]} cells", fontsize=16) plt.ylabel("Percentage") plt.legend([bar1, bar2], ["in a single TAD", "in more than one TAD"], loc="upper right") return plt.savefig(f"{name}"), plt.close() def tad_fractions(one_tad_active, one_tad_passive, two_tad, three_tad, many_tad, name): sns.set_style("whitegrid") plt.figure(figsize=(12, 10)) bar1 = plt.bar([i for i in range(3, 16)], one_tad_active, width=1, edgecolor="black", color="royalblue") bar2 = plt.bar([i for i in range(3, 16)], one_tad_passive, bottom=one_tad_active, width=1, edgecolor="black", color="cornflowerblue") bar3 = plt.bar([i for i in range(3, 16)], two_tad, bottom=np.add(one_tad_active, one_tad_passive), width=1, edgecolor="black", color="tab:cyan") bar4 = plt.bar([i for i in range(3, 16)], three_tad, bottom=np.add(np.add(one_tad_active, one_tad_passive), two_tad), width=1, edgecolor="black", color="lightsteelblue") bar5 = plt.bar([i for i in range(3, 16)], many_tad, bottom=np.add(np.add(np.add(one_tad_active, one_tad_passive), two_tad), three_tad), width=1, edgecolor="black", color="slategrey") plt.xlabel("Number of hops") plt.ylabel("Percentage") plt.title(f"Intra- and Inter-TAD c-walks in {sys.argv[1]} cells", fontsize=18) plt.legend([bar1, bar2, bar3, bar4, bar5], ["1 TAD active", "1 TAD passive", "2 TAD", "3 TAD", "more TADs"], loc="upper right", prop={"size": 16}) return plt.savefig(f"{name}"), plt.close() def hop_count(active, passive, two, three, many, name): active = active[:6] passive = passive[:6] two = two[:6] three = three[:6] many = many[:6] sns.set_style("whitegrid") fig, ax = plt.subplots(figsize=(10, 8)) bar1 = plt.bar([i for i in range(3, 9)], active, width=1, edgecolor="black", color="royalblue") bar2 = plt.bar([i for i in range(3, 9)], passive, bottom=active, width=1, edgecolor="black", color="cornflowerblue") bar3 = plt.bar([i for i in range(3, 9)], two, bottom=np.add(active, passive), width=1, edgecolor="black", color="tab:cyan") bar4 = plt.bar([i for i in range(3, 9)], three, bottom=np.add(np.add(active, passive), two), width=1, edgecolor="black", color="lightsteelblue") bar5 = plt.bar([i for i in range(3, 9)], many, bottom=np.add(np.add(np.add(active, passive), two), three), width=1, edgecolor="black", color="slategrey") ax.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: "{0:g}".format(x / 1000) + "k")) plt.xlabel("Number of hops") plt.ylabel("Number of c-walks") plt.title(f"Intra- and inter-TAD c-walks in {sys.argv[1]} cells", fontsize=18) plt.legend([bar1, bar2, bar3, bar4, bar5], ["1 TAD active", "1 TAD passive", "2 TAD", "3 TAD", "more TADs"], loc="upper right", prop={"size": 12}) return plt.savefig(f"{name}"), plt.close() def domains(boundaries, name): active_boundaries = boundaries.loc[boundaries["mode"] == "active"].dropna().reset_index(drop=True) passive_boundaries = boundaries.loc[boundaries["mode"] == "passive"].dropna().reset_index(drop=True) sns.set_style("whitegrid") plt.figure(figsize=(10, 8)) sns.distplot(active_boundaries["size"], hist=False, color="red") sns.distplot(passive_boundaries["size"], hist=False, color="black") plt.title(f"{sys.argv[1]} domains size distribution") plt.legend(labels=["Active", "Inactive"]) plt.xlabel("Domain size [Mb]") return plt.savefig(f"{name}"), plt.close() def main(): graphs, _ = load_files(sys.argv[2], load_cwalk_graph) # load .txt cwalk graph tad_dict = tad_tree(boundaries(sys.argv[3])) # interval tree dict with TAD domains chrs_dict = chrs_sizes(sys.argv[4]) # plot domains size distribution domains(boundaries(sys.argv[3]), sys.argv[11]) # Return dict where chrs are keys and values are list of restriction itvs restrictions_dict = parse_bedfile(sys.argv[5], sys.argv[1]) # Interval tree construction, separate for each chromosome rest_dict = dict() # ex. tree_dict["chr1"] will be an object of type IntervalTree for chr in typical_chromosomes(sys.argv[1]): restrictions = restrictions_dict[chr][1].tolist() restrictions[0] = 0 restrictions[-1] = chrs_dict[chr][0] intervals = [(i, j) for i, j in zip(restrictions[:-1], restrictions[1:])] rest_dict[chr] = IntervalTree.from_tuples(intervals) lst_active, lst_passive, lst_two, lst_three, lst_many, lst_i, lst_general_in, lst_general_out = \ [], [], [], [], [], [], [], [] lst_rdm_active, lst_rdm_passive, lst_rdm_two, lst_rdm_three, lst_rdm_many, lst_rdm_i, lst_rdm_general_in, lst_rdm_general_out = \ [], [], [], [], [], [], [], [] count_active, count_passive, count_two, count_three, count_many, count_general_in, count_general_out = \ [], [], [], [], [], [], [] rdm_count_active, rdm_count_passive, rdm_count_two, rdm_count_three, rdm_count_many, rdm_count_general_in, rdm_count_general_out = \ [], [], [], [], [], [], [] for length in range(3, 30): active, passive, two, three, many, i, general_in, general_out = counting(tad_dict, graphs, length) # list of % of intra-chrs cwalks of each length if i != 0: count_general_in.append(general_in) count_general_out.append(general_out) lst_general_in.append(round((general_in / i) * 100, 2)) # % of intra-chrs cwalks in a single TAD lst_general_out.append(round((general_out / i) * 100, 2)) # % of intra-chrs cwalks in more than one TAD lst_i.append(i) # number of cwalks of each length rdm_active, rdm_passive, rdm_two, rdm_three, rdm_many, rdm_i, rdm_general_in, rdm_general_out = \ random_counting(tad_dict, graphs, length, chrs_dict, rest_dict) rdm_count_general_in.append(rdm_general_in) rdm_count_general_out.append(rdm_general_out) lst_rdm_general_in.append(round((rdm_general_in / i) * 100, 2)) # % of intra-chrs cwalks in a single TAD lst_rdm_general_out.append(round((rdm_general_out / i) * 100, 2)) # % of intra-chrs cwalks in more than one TAD lst_rdm_i.append(rdm_i) # number of random cwalks of each length """ Plotting """ chart_comparison([round((sum(count_general_in) / sum(lst_i)) * 100, 2), round((sum(rdm_count_general_in) / sum(lst_rdm_i)) * 100, 2)], [round((sum(count_general_out) / sum(lst_i)) * 100, 2), round((sum(rdm_count_general_out) / sum(lst_rdm_i)) * 100, 2)], sys.argv[9]) print(f"Total number of c-walks: {sum(lst_i)}") print(f"Total number of random c-walks: {sum(lst_rdm_i)}") print(f"Percentage of intra-chrs cwalks in a single TAD: {round((sum(count_general_in) / sum(lst_i)) * 100, 2)}%") print(f"Percentage of intra-chrs cwalks in more than one TAD: {round((sum(count_general_out) / sum(lst_i)) * 100, 2)}%") print(f"Percentage of random intra-chrs cwalks in a single TAD: {round((sum(rdm_count_general_in) / sum(lst_rdm_i)) * 100, 2)}%") print(f"Percentage of random intra-chrs cwalks in more than one TAD: {round((sum(rdm_count_general_out) / sum(lst_rdm_i)) * 100, 2)}%") print( f"Number of c-walks from data inside one TAD: {[sum(count_general_in)][0]} and in more than one TAD:" f" {[sum(count_general_out)][0]}") print( f"Number of random c-walks inside one TAD: {[sum(rdm_count_general_in)][0]} and in more than one TAD: " f"{[sum(rdm_count_general_out)][0]}") print(f"We are considering sample consists with {sum(lst_i) + sum(lst_rdm_i)} c-walks " f"for which we examine two features (c-walks from data and random)." f" Significance level: 0.05") table = np.array([[[sum(count_general_in)][0], [sum(count_general_out)][0]], [[sum(rdm_count_general_in)][0], [sum(rdm_count_general_out)][0]]]) print("Fisher test:") from scipy.stats import fisher_exact oddsr, p = fisher_exact(table, alternative="two-sided") print(f"Two-sided Fisher: {oddsr} and p-value: {p}") oddsr, p = fisher_exact(table, alternative="greater") print(f"One-sided Fisher: {oddsr} and p-value: {p}") for length in range(3, 16): active, passive, two, three, many, i, general_in, general_out = counting(tad_dict, graphs, length) # list of % of intra-chrs cwalks of each length if i != 0: lst_active.append(round((active / i) * 100, 2)) # % of intra-chrs cwalks in a one active TAD lst_passive.append(round((passive / i) * 100, 2)) # % of intra-chrs cwalks in a one passive TAD lst_two.append(round((two / i) * 100, 2)) # % of intra-chrs cwalks in two TADs lst_three.append(round((three / i) * 100, 2)) # % of intra-chrs cwalks three TADs lst_many.append(round((many / i) * 100, 2)) # % of intra-chrs cwalks in many TADs # list of intra-chrs cwalks counts of each length count_active.append(active) count_passive.append(passive) count_two.append(two) count_three.append(three) count_many.append(many) rdm_active, rdm_passive, rdm_two, rdm_three, rdm_many, rdm_i, rdm_general_in, rdm_general_out = \ random_counting(tad_dict, graphs, length, chrs_dict, rest_dict) # list of % of intra-chrs cwalks of each length lst_rdm_active.append(round((rdm_active / rdm_i) * 100, 2)) # % of intra-chrs cwalks in a one active TAD lst_rdm_passive.append(round((rdm_passive / rdm_i) * 100, 2)) # % of intra-chrs cwalks in a one passive TAD lst_rdm_two.append(round((rdm_two / rdm_i) * 100, 2)) # % of intra-chrs cwalks in two TADs lst_rdm_three.append(round((rdm_three / rdm_i) * 100, 2)) # % of intra-chrs cwalks three TADs lst_rdm_many.append(round((rdm_many / rdm_i) * 100, 2)) # % of intra-chrs cwalks in many TADs # list of intra-chrs random cwalks counts of each length rdm_count_active.append(active) rdm_count_passive.append(passive) rdm_count_two.append(two) rdm_count_three.append(three) rdm_count_many.append(many) """ Plotting """ hop_count(count_active, count_passive, count_two, count_three, count_many, sys.argv[10]) tad_count(graphs, tad_dict, sys.argv[8]) tad_fractions(lst_active, lst_passive, lst_two, lst_three, lst_many, sys.argv[6]) tad_fractions(lst_rdm_active, lst_rdm_passive, lst_rdm_two, lst_rdm_three, lst_rdm_many, sys.argv[7]) if __name__ == '__main__': main() |
Python
Pandas
numpy
matplotlib
seaborn
scipy
Filtering
networkx
intervaltree
From
line
3
of
py/TAD.py
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 | import networkx as nx import matplotlib.pyplot as plt import seaborn as sns import sys import statistics import matplotlib.ticker as ticker from cwalk_analysis import load_cwalk_graph, load_files import random from filtering import typical_chromosomes import pandas as pd import random import matplotlib.pylab as pylab params = {'legend.fontsize': 'x-large', 'axes.labelsize': 'x-large', 'axes.titlesize': 'x-large', 'xtick.labelsize': 'xx-large', 'ytick.labelsize': 'xx-large'} pylab.rcParams.update(params) def parse_tf(tf: str) -> list: """ load transcription factor binding sites return: list of dicts (dict for each tf) where chomosomes are keys and values are peaks within them """ tf = pd.read_csv(tf, sep='\t', header=None) tf = tf.iloc[:, 0:3] tf[3] = ((tf[1] + tf[2]) / 2).astype(float) tf.columns = tf.columns.map(str) tf_dict = tf.groupby("0")["3"].agg(list).to_dict() return tf_dict def mirror_peaks(chr_sizes: str, peaks_dict: dict) -> dict: """ Mirror image of peaks load .tsv file with chromosomes sizes return: dict where chromosomes are keys and values are "mirror peaks" within them """ df_sizes = pd.read_csv(chr_sizes, sep='\t', header=None) df_sizes = df_sizes.loc[df_sizes[0].isin(typical_chromosomes("human"))].reset_index(drop=True) # sys instead of str df_sizes.columns = df_sizes.columns.map(str) sizes_dict = df_sizes.groupby("0")["1"].agg(list).to_dict() keys, mirrors = [], [] for key in peaks_dict.keys(): mirror = [sizes_dict[key][0] - peak for peak in peaks_dict[key]] keys.append(key) mirrors.append(mirror) return dict(zip(keys, mirrors)) def count_boundaries(subgraph) -> list: # edge boundaries for one cwalk # subgraph is one cwalk edge_boundaries = [] sorted_cwalk = sorted(subgraph.edges(data=True), key=lambda x: x[2]["index"]) if all(i[2] == j[2] for i, j, _ in sorted_cwalk): # only inter-chrs cwalks for next_edge in sorted_cwalk[0:]: v1, v2, _ = next_edge edge_bound = (pd.Interval(min([v1[0], v1[0], v2[0], v2[0]]), max([v1[0], v1[0], v2[0], v2[0]]), closed="both")) edge_boundaries.append((edge_bound, v1[2])) return edge_boundaries def peak_in_edges(bounds: list, chromosome: str, tf_dict: dict) -> int: # number of edges which has at least one peak in one cwalk isin = 0 for edge in bounds: if chromosome != "chrY": for peak in tf_dict[chromosome]: if peak in edge: isin += 1 break return isin def histogram(labels, fractions, mirror_fractions): sns.set_style("whitegrid") fig, axs = plt.subplots(4, 3, sharex=True, sharey=True, tight_layout=True, figsize=(16, 18)) if statistics.mean(fractions[0]) < statistics.mean(mirror_fractions[0]): axs[0, 0].hist(fractions[0], color="tab:blue", edgecolor="black") axs[0, 0].hist(mirror_fractions[0], color="tab:red", edgecolor="black") else: axs[0, 0].hist(mirror_fractions[0], color="tab:red", edgecolor="black") axs[0, 0].hist(fractions[0], color="tab:blue", edgecolor="black") axs[0, 0].set_title(f"Overlap with {labels[0]} binding sites", fontsize=14) if statistics.mean(fractions[1]) < statistics.mean(mirror_fractions[1]): axs[0, 1].hist(fractions[1], color="tab:blue", edgecolor="black") axs[0, 1].hist(mirror_fractions[1], color="tab:red", edgecolor="black") else: axs[0, 1].hist(mirror_fractions[1], color="tab:red", edgecolor="black") axs[0, 1].hist(fractions[1], color="tab:blue", edgecolor="black") axs[0, 1].set_title(f"Overlap with {labels[1]} binding sites", fontsize=14) if statistics.mean(fractions[2]) < statistics.mean(mirror_fractions[2]): axs[0, 2].hist(fractions[2], color="tab:blue", edgecolor="black") axs[0, 2].hist(mirror_fractions[2], color="tab:red", edgecolor="black") else: axs[0, 2].hist(mirror_fractions[2], color="tab:red", edgecolor="black") axs[0, 2].hist(fractions[2], color="tab:blue", edgecolor="black") axs[0, 2].set_title(f"Overlap with {labels[2]} binding sites", fontsize=14) if statistics.mean(fractions[3]) < statistics.mean(mirror_fractions[3]): axs[1, 0].hist(fractions[3], color="tab:blue", edgecolor="black") axs[1, 0].hist(mirror_fractions[3], color="tab:red", edgecolor="black") else: axs[1, 0].hist(mirror_fractions[3], color="tab:red", edgecolor="black") axs[1, 0].hist(fractions[3], color="tab:blue", edgecolor="black") axs[1, 0].set_title(f"Overlap with {labels[3]} binding sites", fontsize=14) if statistics.mean(fractions[4]) < statistics.mean(mirror_fractions[4]): axs[1, 1].hist(fractions[4], color="tab:blue", edgecolor="black") axs[1, 1].hist(mirror_fractions[4], color="tab:red", edgecolor="black") else: axs[1, 1].hist(mirror_fractions[4], color="tab:red", edgecolor="black") axs[1, 1].hist(fractions[4], color="tab:blue", edgecolor="black") axs[1, 1].set_title(f"Overlap with {labels[4]} binding sites", fontsize=14) if statistics.mean(fractions[5]) < statistics.mean(mirror_fractions[5]): axs[1, 2].hist(fractions[5], color="tab:blue", edgecolor="black") axs[1, 2].hist(mirror_fractions[5], color="tab:red", edgecolor="black") else: axs[1, 2].hist(mirror_fractions[5], color="tab:red", edgecolor="black") axs[1, 2].hist(fractions[5], color="tab:blue", edgecolor="black") axs[1, 2].set_title(f"Overlap with {labels[5]} binding sites", fontsize=14) if statistics.mean(fractions[6]) < statistics.mean(mirror_fractions[6]): axs[2, 0].hist(fractions[6], color="tab:blue", edgecolor="black") axs[2, 0].hist(mirror_fractions[6], color="tab:red", edgecolor="black") else: axs[2, 0].hist(mirror_fractions[6], color="tab:red", edgecolor="black") axs[2, 0].hist(fractions[6], color="tab:blue", edgecolor="black") axs[2, 0].set_title(f"Overlap with {labels[6]} binding sites", fontsize=14) if statistics.mean(fractions[7]) < statistics.mean(mirror_fractions[7]): axs[2, 1].hist(fractions[7], color="tab:blue", edgecolor="black") axs[2, 1].hist(mirror_fractions[7], color="tab:red", edgecolor="black") else: axs[2, 1].hist(mirror_fractions[7], color="tab:red", edgecolor="black") axs[2, 1].hist(fractions[7], color="tab:blue", edgecolor="black") axs[2, 1].set_title(f"Overlap with {labels[7]} binding sites", fontsize=14) if statistics.mean(fractions[8]) < statistics.mean(mirror_fractions[8]): axs[2, 2].hist(fractions[8], color="tab:blue", edgecolor="black") axs[2, 2].hist(mirror_fractions[8], color="tab:red", edgecolor="black") else: axs[2, 2].hist(mirror_fractions[8], color="tab:red", edgecolor="black") axs[2, 2].hist(fractions[8], color="tab:blue", edgecolor="black") axs[2, 2].set_title(f"Overlap with {labels[8]} binding sites", fontsize=14) if statistics.mean(fractions[9]) < statistics.mean(mirror_fractions[9]): axs[3, 0].hist(fractions[9], color="tab:blue", edgecolor="black") axs[3, 0].hist(mirror_fractions[9], color="tab:red", edgecolor="black") else: axs[3, 0].hist(mirror_fractions[9], color="tab:red", edgecolor="black") axs[3, 0].hist(fractions[9], color="tab:blue", edgecolor="black") axs[3, 0].set_title(f"Overlap with {labels[9]} binding sites", fontsize=14) if statistics.mean(fractions[10]) < statistics.mean(mirror_fractions[10]): axs[3, 1].hist(fractions[10], color="tab:blue", edgecolor="black") axs[3, 1].hist(mirror_fractions[10], color="tab:red", edgecolor="black") else: axs[3, 1].hist(mirror_fractions[10], color="tab:red", edgecolor="black") axs[3, 1].hist(fractions[10], color="tab:blue", edgecolor="black") axs[3, 1].set_title(f"Overlap with {labels[10]} binding sites", fontsize=14) fig.delaxes(axs[3, 2]) fig.supxlabel("Fraction of edges in c-walk having a peak", fontsize=18) axs[0, 1].yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: "{0:g}".format(x / 1000) + "k")) axs[1, 1].yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: "{0:g}".format(x / 1000) + "k")) axs[2, 1].yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: "{0:g}".format(x / 1000) + "k")) axs[3, 1].yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: "{0:g}".format(x / 1000) + "k")) return plt.savefig(sys.argv[4]), plt.close() def counting(boundaries, tf_dict, tf_mirror_dict): boundaries = filter(None, boundaries) # remove empty lists fractions, mirror_fractions = [], [] for cwalk_bounds in boundaries: # iter over each cwalk edge bounds (edge bounds are pandas itvs) bounds = [i[0] for i in cwalk_bounds] chromosome = [i[1] for i in cwalk_bounds][0] isin = peak_in_edges(bounds, chromosome, tf_dict) mirror_isin = peak_in_edges(bounds, chromosome, tf_mirror_dict) # fraction of edges in cwalks which has at least one peak, normalized by cwalk length (number of nodes) fraction = round(isin / (len(cwalk_bounds) + 1), 2) mirror_fraction = round(mirror_isin / (len(cwalk_bounds) + 1), 2) fractions.append(fraction) mirror_fractions.append(mirror_fraction) return fractions, mirror_fractions def main(): graphs, _ = load_files(sys.argv[1], load_cwalk_graph) # load folder with cwalks lst_tf_peaks, labels = load_files((sys.argv[3], parse_tf) lst_tf_mirror = [] for tf_dict in lst_tf_peaks: lst_tf_mirror.append(mirror_peaks((sys.argv[2], tf_dict)) boundaries = [] for graph in graphs: S = [graph.subgraph(c).copy() for c in nx.connected_components(graph)] for subgraph in S: # subgraph is one cwalk edge_bounds = count_boundaries(subgraph) # edge bounds for one cwalk boundaries.append(edge_bounds) tf_labels = [] tfs_fractions = [] tfs_mirror_fractions = [] for label, tf_dict, tf_mirror_dict in zip(labels, lst_tf_peaks, lst_tf_mirror): print(label[:-14]) tf_labels.append(label[:-14]) fractions, mirror_fractions = counting(boundaries, tf_dict, tf_mirror_dict) print(f"Average value of {label[:-14]} distribution is {statistics.mean(fractions)}") print(f"Average value of {label[:-14]} with randomize peaks distribution is {statistics.mean(mirror_fractions)}") from scipy.stats import wilcoxon if fractions and mirror_fractions: print("two-sided") res = wilcoxon(x=fractions, y=mirror_fractions, zero_method="zsplit", alternative="two-sided") print(f"The value of the test statistic: {res.statistic} and p-value: {res.pvalue}") res = wilcoxon(x=fractions, y=mirror_fractions, zero_method="zsplit", alternative="greater") print("greater") print(f"The value of the test statistic: {res.statistic} and p-value: {res.pvalue}") res = wilcoxon(x=fractions, y=mirror_fractions, zero_method="zsplit", alternative="less") print("less") print(f"The value of the test statistic: {res.statistic} and p-value: {res.pvalue}") print("wilcox") res = wilcoxon(x=fractions, y=mirror_fractions, zero_method="wilcox", alternative="greater") print("greater") print(f"The value of the test statistic: {res.statistic} and p-value: {res.pvalue}") res = wilcoxon(x=fractions, y=mirror_fractions, zero_method="wilcox", alternative="less") print("less") print(f"The value of the test statistic: {res.statistic} and p-value: {res.pvalue}") zeros = [each for each in fractions if each == 0.0] mirror_zeros = [each for each in mirror_fractions if each == 0.0] if fractions and mirror_fractions: print( f"Percentage of c-walks from data which edges has no peaks within is {round(len(zeros) / (len(fractions)) * 100, 2)}%") print( f"Percentage of c-walks which edges has no random peaks within is {round(len(mirror_zeros) / (len(mirror_fractions)) * 100, 2)}%") tfs_fractions.append(fractions) tfs_mirror_fractions.append(mirror_fractions) histogram(tf_labels, tfs_fractions, tfs_mirror_fractions) if __name__ == '__main__': main() |
Support
Do you know this workflow well? If so, you can
request seller status , and start supporting this workflow.
Created: 1yr ago
Updated: 1yr ago
Maitainers:
public
URL:
https://github.com/pd410668/Analysis_of_multi_way_chromatin_interactions_from_very_long_sequencing_reads
Name:
analysis_of_multi_way_chromatin_interactions_from_
Version:
1
Downloaded:
0
Copyright:
Public Domain
License:
None
Keywords:
- Future updates
Related Workflows

ENCODE pipeline for histone marks developed for the psychENCODE project
psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project.
The o...

Near-real time tracking of SARS-CoV-2 in Connecticut
Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

ATLAS - Three commands to start analyzing your metagenome data
Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...
raw sequence reads
Genome assembly
Annotation track
checkm2
gunc
prodigal
snakemake-wrapper-utils
MEGAHIT
Atlas
BBMap
Biopython
BioRuby
Bwa-mem2
cd-hit
CheckM
DAS
Diamond
eggNOG-mapper v2
MetaBAT 2
Minimap2
MMseqs
MultiQC
Pandas
Picard
pyfastx
SAMtools
SemiBin
Snakemake
SPAdes
SqueezeMeta
TADpole
VAMB
CONCOCT
ete3
gtdbtk
h5py
networkx
numpy
plotly
psutil
utils
metagenomics

RNA-seq workflow using STAR and DESeq2
This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

This Snakemake pipeline implements the GATK best-practices workflow
This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...