SigMa is a probabilistic model for the sequential dependencies of mutation signatures
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Lack of a description for a new keyword tool/pypi/data-utils .
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
This repository contains the source code for SigMa (Signature Markov model) and related experiments. SigMa is a probabilistic model of the sequential dependencies among mutation signatures.
Below, we provide an overview of the SigMa model from the corresponding paper . "The input data consists of ( A ) a set of predefined signatures that form an emission matrix E (here, for simplicity, represented over six mutation types), and ( B ) a sequence of mutation categories from a single sample and a distance threshold separating sky and cloud mutation segments. ( C ) The SigMa model has two components: (top) a multinomial mixture model (MMM) for isolated sky mutations and (bottom) an extension of a Hidden Markov Model (HMM) capturing sequential dependencies between close-by cloud mutations; all model parameters are learned from the input data in an unsupervised manner. ( D ) SigMa finds the most likely sequence of signatures that explains the observed mutations in sky and clouds."

Setup
Dependencies
SigMa is written in Python 3. We recommend using
Conda
to manage dependencies, which you can do directly using the provided
environment.yml
file:
conda env create -f environment.yml
source activate sigma-env
For windows replace last command with
activate sigma-env
Usage
We use Snakemake to manage the workflow of running SigMa on hundreds of tumor samples.
Reproducing the experiments from the SigMa paper
First, download and preprocess the ICGC breast cancer whole-genomes and COSMIC mutation signatures. To do so, run:
cd data && snakemake all
Second, run SigMa and a multinomial mixture model (MMM) on each sample, and perform leave-one-out cross-validation (LOOCV):
snakemake all
This will create an
output/
directory, with two subdirectories:
models/
and
loocv/
.
models/
contains SigMa trained on each sample.
loocv/
contains the results of LOOCV with SigMa using different cloud thresholds.
Configuration
To run the entire SigMa workflow on different mutation signatures or data, see the
Snakefile
for configuration options.
To train SigMa or MMM on individual mutation sequences, use the
src/train_and_predict.py
script. To get a list of command-line arguments, run:
python src/train_and_predict.py -h
Support
Please report bugs and feature requests in the Issues tab of this GitHub repository.
For further questions, please email Max Leiserson and [Itay Sason](itaysason@mail.tau.ac.il
- directly.
References
Xiaoqing Huang*, Itay Sason*, Damian Wojtowicz*, Yoo-Ah Kim, Mark Leiserson^, Teresa M Przytycka^, Roded Sharan^. Hidden Markov Models Lead to Higher Resolution Maps of Mutation Signature Activity in Cancer. Genome Medicine (2019) doi: 10.1186/s13073-019-0659-1 .
* equal first author contribution ^ equal senior author contribution
Code Snippets
76 77 78 79 80 | shell: 'python {TRAIN_AND_PREDICT_PY} -mf {input.mutations} -sf {input.signatures} '\ '-od {TRAINED_MODEL_DIR}/sigma{wildcards.threshold} -mn {wildcards.model} '\ '{params.active_signatures} -sn {wildcards.sample} -mi {params.max_iter} '\ '-ct {wildcards.threshold} -rs {params.random_seed} -tol {params.tolerance}' |
99 100 101 102 103 | shell: 'python {TRAIN_AND_PREDICT_PY} -mf {input.mutations} -sf {input.signatures} '\ '-od {LOOCV_DIR}/sigma{wildcards.threshold} -mn {SIGMA_NAME} -sn {wildcards.sample} '\ '{params.active_signatures} -mi {params.max_iter} -ct {wildcards.threshold} '\ '-rs {params.random_seed} -tol {params.tolerance} --cross-validation-mode' |
116 117 118 119 120 | shell: 'python {TRAIN_AND_PREDICT_PY} -mf {input.mutations} -sf {input.signatures} '\ '-od {LOOCV_DIR}/mmm -mn {MMM_NAME} -sn {wildcards.sample} '\ '{params.active_signatures} -mi {params.max_iter} -ct 0 '\ '-rs {params.random_seed} -tol {params.tolerance} --cross-validation-mode' |
128 129 | shell: 'python {CREATE_FIGURE_PY} -ld {LOOCV_DIR} -of {output}' |
4 5 6 7 8 9 10 11 12 13 14 | MMM_NAME = 'mmm' SIGMA_NAME = 'sigma' MODEL_NAMES = [MMM_NAME, SIGMA_NAME] # Columns in mutation files PATIENT = 'Patient' CATEGORY = 'SBS_96' CATEGORY_IDX = 'Category index' MUT_DIST = 'Distance to Previous Mutation' CHROMOSOME = 'Chromosome' START_POS = 'Start Position' |
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | import sys, os, argparse, logging, pandas as pd from data_utils import get_logger from constants import PATIENT # Parse command-line arguments parser = argparse.ArgumentParser() parser.add_argument('-mf', '--mutations_file', type=str, required=True) parser.add_argument('-sf', '--samples_file', type=str, required=True) parser.add_argument('-of', '--output_file', type=str, required=True) parser.add_argument('-v', '--verbosity', type=int, required=False, default=logging.INFO) args = parser.parse_args(sys.argv[1:]) # Load the mutations file mut_df = pd.read_csv(args.mutations_file, sep='\t', low_memory=False) cols = mut_df.columns # Load the sample name mapping sample_df = pd.read_csv(args.samples_file, sep='\t') sample_ids = [ s.split('a')[0] if s.endswith('a') or s.endswith('a2') else s.split('b')[0] for s in sample_df['submitted_sample_id'] ] patient_map = dict(zip(sample_df['icgc_donor_id'], sample_ids)) # Map the patients and save mut_df[PATIENT] = [ patient_map[p] for p in mut_df[PATIENT] ] mut_df.to_csv(args.output_file, sep='\t', index=0) |
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 | import matplotlib matplotlib.use('Agg') import sys, numpy as np, matplotlib.pyplot as plt, seaborn as sns from os import listdir from os.path import isfile, join sns.set_style('whitegrid') # Load our modules from data_utils import load_json from constants import MODEL_NAMES, SIGMA_NAME, MMM_NAME def across_all_dir(path, files): scores = np.zeros((len(files), 24)) for i, f in enumerate(files): sample_dict = load_json(join(path, f)) for j, c in enumerate(sample_dict['chromosomesToResults'].keys()): scores[i][j] = sample_dict['chromosomesToResults'][c]['score'] return scores def fix_nans(mat): """ returns the matrix with average over models if a model, sample, chromosome had nan in it. :param mat: ndarray (model, sample, chromosome) :return: mat ndarray (model, sample, chromosome) """ mat = np.nan_to_num(mat) idx, idy, idz = np.where(mat == 0) for x, y, z in zip(idx, idy, idz): mat[x, y, z] = mat[:, y, z].mean() return mat def analyze(par_path, out_fig): dirs = [d for d in listdir(par_path)] dirs = sorted(dirs) thresholds = [] tmp = [] names = [] if MMM_NAME in dirs: tmp.append(MMM_NAME) names.append(MMM_NAME.upper()) for d in dirs: if SIGMA_NAME in d: thresholds.append(int(d.split('a')[1])) thresholds = sorted(thresholds) print(thresholds) for i in thresholds: tmp.append(SIGMA_NAME + str(i)) names.append(str(i // 1000) + 'K') dirs = tmp files_dict = {} for d in dirs: curr_path = join(par_path, d) files_dict[d] = sorted([f.split('-')[1] for f in listdir(curr_path) if isfile(join(curr_path, f))]) files_intersection = files_dict[dirs[0]] for d in dirs: files_intersection = np.intersect1d(files_dict[d], files_intersection) print('number of samples in common: {}'.format(len(files_intersection))) results_mat = np.zeros((len(dirs), len(files_intersection), 24)) for i, d in enumerate(dirs): path = join(par_path, d) if SIGMA_NAME in d: prefix = SIGMA_NAME + '-' else: prefix = MMM_NAME + '-' scores = across_all_dir(path, [prefix + f for f in files_intersection]) results_mat[i] = scores results_mat = fix_nans(results_mat) sum_results = np.sum(results_mat, axis=(1, 2)).tolist() for i in range(len(dirs)): print('{}: {}'.format(dirs[i], sum_results[i])) # Create the plot, highlighting the one with maximum log-likelihood max_ind = np.argmax(sum_results) ind = list(range(0, max_ind)) + list(range(max_ind+1, len(names))) plt.bar(ind, sum_results[:max_ind] + sum_results[max_ind+1:]) plt.bar([max_ind], [sum_results[max_ind]], color=(181./255, 85./255, 85./255)) plt.xticks(range(len(names)), names) plt.ylim((min(sum_results) - 1000, max(sum_results) + 1000)) plt.xlabel('Model') plt.ylabel('Held-out log-likelihood') plt.title('Comparing MMM and SigMa with various cloud thresholds') plt.tight_layout() plt.savefig(out_fig) def get_parser(): import argparse parser = argparse.ArgumentParser() parser.add_argument('-ld', '--loocv_dir', type=str, required=True) parser.add_argument('-of', '--output_file', type=str, required=True) return parser def main(args): analyze(args.loocv_dir, args.output_file) if __name__ == '__main__': main( get_parser().parse_args(sys.argv[1:]) ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 | import numpy as np from pomegranate import * import json ################################################################################ # LOGGING ################################################################################ import logging # Logging format FORMAT = '%(asctime)s SigMa %(levelname)-10s: %(message)s' logging.basicConfig(format=FORMAT) def get_logger(verbosity=logging.INFO): ''' Returns logger object ''' logger = logging.getLogger(__name__) logger.setLevel(verbosity) return logger ################################################################################ # UTILS ################################################################################ def sample_and_noise(model, noise_dist, n_seqs, seqs_len): noise_change_dist = DiscreteDistribution(dict(zip(range(96), [1.0 / 96] * 96))) seqs = [] noised_seqs = [] for i in range(n_seqs): seq = np.array(model.sample(seqs_len)) seqs.append(seq) noised_seq = seq.copy() hits = noise_dist.sample(seqs_len) for j, hit in enumerate(hits): if hit == 0: noised_seq[j] = noise_change_dist.sample() noised_seqs.append(noised_seq) return seqs, noised_seqs def get_emissions(file='data\emissions_for_breast_cancer'): return np.load(file + '.npy') def sample_uniform_between_a_b(n_states, a=0.0, b=1.0): return (b - a) * np.random.sample(n_states) + a def random_seqs_from_json(file_name, n_seqs=10): seqs = [] seqs_names = [] json_file = json.load(open(file_name)) samples = json_file[u'samples'] samples_to_seq = json_file[u'sampleToSequence'] samples = np.random.permutation(samples) for i in range(n_seqs): seqs.append(samples_to_seq[samples[i]]) seqs_names.append(samples[i]) return seqs, seqs_names def to_json(file_name, dict_to_save): with open(file_name + '.json', 'w') as fp: json.dump(dict_to_save, fp) def full_sample_to_chromosomes_seqs(sample, dists_sample): np_sample = np.array(sample) starting_chromosome_idxs = np.where(np.array(dists_sample) >= 1e100)[0] return np.split(np_sample, starting_chromosome_idxs)[1:] def load_json(file_name): return json.load(open(file_name)) def get_split_sequences(file_name, sample_numbers=None): json_file = json.load(open(file_name)) samples = json_file[u'samples'] samples_to_seq = json_file[u'sampleToSequence'] samples_dists = json_file[u'sampleToPrevMutDists'] out_seqs = [] out_names = [] if sample_numbers is None: sample_numbers = range(len(samples)) for i in sample_numbers: n = samples[i] out_names.append(n) out_seqs.append(full_sample_to_chromosomes_seqs(samples_to_seq[n], samples_dists[n])) return zip(out_names, out_seqs) def get_full_sequences(file_name='data/nik-zainal2016-wgs-brca-mutations-for-hmm.json'): json_file = json.load(open(file_name)) samples = json_file[u'samples'] samples_to_seq = json_file[u'sampleToSequence'] out_seqs = [] out_names = [] for n in samples: out_names.append(n) out_seqs.append(samples_to_seq[n]) return zip(out_names, out_seqs) def get_count_sequences_as_mat(file_name='data/nik-zainal2016-wgs-brca-mutations-for-hmm.json'): json_file = json.load(open(file_name)) samples = json_file[u'samples'] samples_to_seq = json_file[u'sampleToSequence'] # finding num_object + counting num_objects = 0 samples_objects = [] samples_counts = [] for sample in samples: objects, counts = np.unique(samples_to_seq[sample], return_counts=True) samples_objects.append(objects) samples_counts.append(counts) num_objects = max(num_objects, np.max(objects)) num_objects += 1 count_mat = np.zeros((len(samples), num_objects)) for i in range(len(samples)): count_mat[i, samples_objects[i]] = samples_counts[i] return count_mat def get_samples_names(file_name='data/nik-zainal2016-wgs-brca-mutations-for-hmm.json'): json_file = json.load(open(file_name)) samples = json_file[u'samples'] return samples def get_split_sequences_by_threshold(file_name, threshold, sample_numbers=None): json_file = json.load(open(file_name)) samples = json_file[u'samples'] samples_to_seq = json_file[u'sampleToSequence'] samples_dists = json_file[u'sampleToPrevMutDists'] out_seqs = [] out_names = [] if sample_numbers is None: sample_numbers = range(len(samples)) for i in sample_numbers: n = samples[i] out_names.append(n) out_seqs.append(full_sample_to_chromosomes_seqs_by_threshold(samples_to_seq[n], samples_dists[n], threshold)) return zip(out_names, out_seqs) def full_sample_to_chromosomes_seqs_by_threshold(sample, dists_sample, threshold): np_sample = np.array(sample) np_dists = np.array(dists_sample) starting_chromosome_idxs = np.where(np_dists >= 1e100)[0] chromosomes = np.split(np_sample, starting_chromosome_idxs)[1:] chromosomes_dists = np.split(np_dists, starting_chromosome_idxs)[1:] out = [] for i in range(len(chromosomes)): chromosome = chromosomes[i] chromosome_dists = chromosomes_dists[i] starting_seqs_idxs = np.where(chromosome_dists >= threshold)[0] seqs = np.split(chromosome, starting_seqs_idxs)[1:] out.append(seqs) return out def seqs_to_seq(seqs): out = [] for seq in seqs: out.extend(seq) return np.array(out) def seqs_to_seq_of_prefix(seqs): out = [] for seq in seqs: out.append(seq[0]) return np.array(out) def sample_indices_not_in_dir(dir_path): import os samples_in_dir = [f[:-5] for f in os.listdir(dir_path)] samples = get_samples_names() missing_indices = [] for i in range(len(samples)): if samples[i] not in samples_in_dir: missing_indices.append(i) return missing_indices |
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 | import sys, os, argparse, pandas as pd, logging, numpy as np # Helpers for parsing categories into substitution, left flanking, # and right flanking def sub(c): return c.split('[')[1].split(']')[0] def lf(c): return c.split('[')[0] def rf(c): return c.split(']')[-1] if __name__ == '__main__': # Parse command-line arguments parser = argparse.ArgumentParser() parser.add_argument('-i', '--input_file', type=str, required=True) parser.add_argument('-o', '--output_file', type=str, required=True) parser.add_argument('-v', '--verbosity', type=int, required=False, default=logging.INFO) args = parser.parse_args(sys.argv[1:]) # Set up logger logger = logging.getLogger(__name__) logger.setLevel(args.verbosity) # Load the signatures logger.info('[Loading the signatures]') with open(args.input_file, 'r') as IN: arrs = [ l.rstrip('\n\t').split('\t') for l in IN ] header = arrs.pop(0) # Get categories and sort according to standard categories = [ arr[2] for arr in arrs ] categories.sort(key=lambda c: (sub(c), lf(c), rf(c))) # Create a container for the signatures sig_names = header[3:] K = len(sig_names) L = len(categories) sigs = np.zeros((K, L)) # Parse the lines in the file for arr in arrs: j = categories.index(arr[2]) for i, sig_name in enumerate(sig_names): sigs[i,j] += float(arr[3+i]) logger.info('- Loaded %s x %s signature matrix' % sigs.shape) # Create dataframe and output to file logger.info('[Creating dataframe]') df = pd.DataFrame(index=sig_names, columns=categories, data=sigs) logger.info('- Saving to %s' % args.output_file) df.to_csv(args.output_file, sep='\t') |
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 | import sys, os, argparse, json, pandas as pd, logging, numpy as np from collections import defaultdict from itertools import permutations from pandas import read_csv, Categorical # Load our modules from data_utils import get_logger from constants import PATIENT, CATEGORY, CATEGORY_IDX, MUT_DIST, CHROMOSOME, START_POS # Command-line argument parser def get_parser(): parser = argparse.ArgumentParser() parser.add_argument('-mf', '--mutations_file', type=str, required=True) parser.add_argument('-mbf', '--mappability_blacklist_file', type=str, required=False, default=None) parser.add_argument('-sf', '--signatures_file', type=str, required=True) parser.add_argument('-of', '--output_file', type=str, required=True) parser.add_argument('-v', '--verbosity', type=int, required=False, default=logging.INFO) return parser # Main def run( args ): # Set up logger logger = get_logger(args.verbosity) logger.info('[Loading input data]') # Load the signatures sig_df = pd.read_csv(args.signatures_file, sep='\t', index_col=0) categories = list(sig_df.columns) category_index = dict(zip(categories, range(len(categories)))) logger.info('- Loaded %s x %s signature matrix' % sig_df.values.shape) # Load the mutations mut_df = pd.read_csv(args.mutations_file, sep='\t', usecols=[PATIENT, CATEGORY, MUT_DIST, CHROMOSOME, START_POS], dtype={PATIENT: str, CATEGORY: str, MUT_DIST: np.float, CHROMOSOME: str, START_POS: int }) samples = sorted(set(mut_df[PATIENT])) logger.info('- Loaded %s mutations in %s samples' % (len(mut_df), len(samples))) # If a mappability blacklist is provided, use it remove mutations if not (args.mappability_blacklist_file is None): # Load the dataframe and process into a dictionary mappability_df = pd.read_csv(args.mappability_blacklist_file, sep=',') chrom_idx, start_idx, stop_idx = mappability_df.columns[:3] map_blacklist = defaultdict(list) unmappable_bases = 0 for _, r in mappability_df.iterrows(): chrom = r[chrom_idx][3:] map_blacklist[chrom].append(( int(r[start_idx]), int(r[stop_idx]) )) unmappable_bases += map_blacklist[chrom][-1][1] - map_blacklist[chrom][-1][0] logger.info('- Loaded unmappable regions spanning %s bases in %s chromosomes' % (unmappable_bases, len(map_blacklist))) # Restrict mutations that fall in a blacklisted region logger.info('[Removing unmappable mutations]') def mappable(r): for start, stop in map_blacklist[r[CHROMOSOME]]: if start <= r[START_POS] <= stop: return False return True n_muts = len(mut_df) mut_df = mut_df[mut_df.apply(mappable, axis='columns')] n_unmappable = n_muts-len(mut_df) logger.info('\t-> Removed %s mutations that were not mappable' % n_unmappable) # Add the category index and create sequences of mutations logger.info('[Processing data into SigMa format]') mut_df[CATEGORY_IDX] = mut_df.apply(lambda r: category_index[r[CATEGORY]], axis='columns') sampleToSequence = dict( (s, list(map(int, s_df[CATEGORY_IDX]))) for s, s_df in mut_df.groupby([PATIENT]) ) sampleToPrevMutDists = dict( (s, list(map(float, s_df[MUT_DIST]))) for s, s_df in mut_df.groupby([PATIENT]) ) # Save to JSON logger.info('- Saving to file %s' % args.output_file) with open(args.output_file, 'w') as OUT: output = dict(sampleToSequence=sampleToSequence, sampleToPrevMutDists=sampleToPrevMutDists, samples=samples, categories=categories, params=vars(args)) json.dump( output, OUT ) if __name__ == '__main__': run( get_parser().parse_args(sys.argv[1:]) ) |
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 | import sys, os, numpy as np, json, logging, pandas as pd from time import time from tqdm import tqdm # Load our modules from constants import MODEL_NAMES, SIGMA_NAME, MMM_NAME from models.MMMFrozenEmissions import MMMFrozenEmissions from models.SigMa import SigMa from data_utils import get_split_sequences_by_threshold, to_json, get_logger ################################################################################ # HELPERS ################################################################################ def leave_one_out(sample_seqs_tuple, model_name, emissions, max_iterations, epsilon): seqs = sample_seqs_tuple[1] n_seq = len(seqs) chromosomes_names = ['chromosome%s' % str(i).zfill(2) for i in range(n_seq)] chromosome_to_experiment_dict = {} total_score = 0 total_time = 0 total_test_length = 0 total_train_length = 0 total_num_iterations = 0 for i in range(n_seq): train_data = [] train_length = 0 test_length = 0 test_data = seqs[i] for k in range(n_seq): if k != i: train_data.extend(seqs[k]) for seq in train_data: train_length += len(seq) for seq in test_data: test_length += len(seq) tic = time() model, num_iterations = get_trained_model(model_name, emissions, train_data, epsilon, max_iterations) train_time = time() - tic score = model.log_probability(test_data) current_dict = {'score': score, 'time': train_time, 'trainLength': train_length, 'testLength': test_length, 'numIterations': num_iterations} chromosome_to_experiment_dict[chromosomes_names[i]] = current_dict total_time += train_time total_score += score total_test_length += test_length total_train_length += train_length total_num_iterations += num_iterations summery_dict = {'time': total_time, 'score': total_score, 'testLength': total_test_length, 'trainLength': total_train_length, 'numIterations': total_num_iterations} output_dict = {'results': summery_dict, 'chromosomes': chromosomes_names, 'chromosomesToResults': chromosome_to_experiment_dict, 'numberChromosomes': n_seq} return output_dict def get_viterbi(sample_seqs_tuple, model_name, emissions, max_iterations, epsilon): train_data = [] train_length = 0 for s in sample_seqs_tuple[1]: train_data.extend(s) train_length += len(s[0]) tic = time() model, num_iterations = get_trained_model(model_name, emissions, train_data, epsilon, max_iterations) train_time = time() - tic score = model.log_probability(train_data) out_dict = {'score': score, 'numIterations': num_iterations, 'time': train_time, 'trainLength': train_length} viterbi = model.predict(train_data) if model_name == MMM_NAME: out_dict['viterbi'] = {'path': viterbi} elif model_name == SIGMA_NAME: viterbi_dict = {'path': viterbi[1], 'cloud_indicator': viterbi[0]} map_prediction = model.predict(train_data) map_dict = {'path': map_prediction[1], 'cloud_indicator': map_prediction[0]} out_dict['viterbi'] = viterbi_dict out_dict['map'] = map_dict return out_dict def get_trained_model(model_name, emissions, train_data, epsilon, max_iterations): model = get_model(model_name, emissions) num_iterations = model.fit(train_data, stop_threshold=epsilon, max_iterations=max_iterations) return model, num_iterations def get_model(model_name, emissions): if model_name == SIGMA_NAME: return SigMa(emissions) elif model_name == MMM_NAME: return MMMFrozenEmissions(emissions) else: raise NotImplementedError('Model "%s" not implemented' % args.model_name) ################################################################################ # MAIN ################################################################################ # Parser for command-line arguments def get_parser(): import argparse parser = argparse.ArgumentParser() parser.add_argument('-sf', '--signatures_file', type=str, required=True) parser.add_argument('-mf', '--mutations_file', type=str, required=True) parser.add_argument('-mn', '--model_name', type=str, required=True, choices=MODEL_NAMES) parser.add_argument('-ct', '--cloud_threshold', type=int, required=True) parser.add_argument('-sn', '--sample_names', type=str, required=False, nargs='*', default=[]) parser.add_argument('-mi', '--max_iter', type=int, required=False, default=500) parser.add_argument('-tol', '--tolerance', type=float, required=False, default=1e-3) parser.add_argument('-rs', '--random_state', type=int, required=False, default=5733) parser.add_argument('-od', '--output_directory', type=str, required=True) parser.add_argument('-as', '--active_signatures', type=int, required=False, default=[], nargs='*', help='1-based indices of signatures') parser.add_argument('--cross-validation-mode', action='store_true', default=False, required=False) parser.add_argument('-v', '--verbosity', type=int, required=False, default=logging.INFO) return parser # Main def main(args): """ The main function to reproducing the paper's results :param command: 'loo' for leave one out. 'viterbi' for viterbi :param model: 'sigma' or 'mmm' :param batch: Takes indices from batch * batch_size to (batch + 1) * batch_size :param batch_size: see batch :param threshold: Define maximal distance (in bp) in clouds. Use 0 to not split (i.e use whole chromosomes) :param max_iterations: Maximal number of iterations for training the model :param epsilon: Minimum improvement in every iteration of the model, if improvement is lower stop training :param random_state: Random state to initialize the models :param out_dir: where to save all the files :return: """ # Create simple logger logger = get_logger(args.verbosity) logger.info('[Loading input data]') # Get the list of samples we're going to run on with open(args.mutations_file, 'r') as IN: obj = json.load(IN) samples = obj.get('samples') categories = obj.get('categories') if len(args.sample_names) == 0: sample_indices = range(len(samples)) else: sample_indices = [ samples.index(s) for s in args.sample_names ] logger.info('- Loading data for %s samples' % len(sample_indices)) # Load the emissions matrix sig_df = pd.read_csv(args.signatures_file, sep='\t', index_col=0) emissions = sig_df.values if len(args.active_signatures) > 0: emissions = emissions[np.array(args.active_signatures)-1] assert( list(sig_df.columns) == categories ) logger.info('- Loaded %s x %s emissions matrix' % emissions.shape) # if threshold <= 0: # out_dir_for_file = os.path.join(out_dir, model) # threshold = 1e99 # else: # out_dir_for_file = os.path.join(out_dir, model + '_' + str(threshold)) experiment_tuples = get_split_sequences_by_threshold(args.mutations_file, args.cloud_threshold, sample_indices) # Perform the experiments logger.info('[Performing experiments]') if args.cross_validation_mode: logger.info('- Cross-validation mode') func = leave_one_out else: logger.info('- Viterbi mode') func = get_viterbi for experiment_tuple in tqdm(experiment_tuples, total=len(sample_indices), ncols=80): np.random.seed(args.random_state) # setting the random state before each experiment sample = experiment_tuple[0] out_file = '%s/%s-%s' % (args.output_directory, args.model_name, sample) dict_to_save = func(experiment_tuple, args.model_name, emissions, args.max_iter, args.tolerance) to_json(out_file, dict_to_save) logger.info('- Done') if __name__ == '__main__': main( get_parser().parse_args(sys.argv[1:]) ) |
Support
- Future updates
Related Workflows





