Challenges of accurately estimating sex-biased admixture from X chromosomal and autosomal ancestry proportions
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Table of Contents
Models for estimating sex ratios assuming single pulse or constant admixture
Software requirements
The code was tested with the following python package version:
-
matplotlib v3.34
-
numpy v1.20.2
-
pandas v1.2.4
-
scipy v1.6.2
To set up a conda environment with the following versions and activate the environment, run the following commands:
conda env create -f sex_ratios.yml
conda activate sex_ratios
Input files
Either:
-
A comma-separated file, containing the ancestry proportions of various source populations and fraction of sampled females in different regions. The first row of the file is the header, and the file has the following structure:
,H1X,H1A,H2X,H2A,...,pf
region1,x,a,xx,aa,...,f
region2,x,a,xx,aa,...,f
where HiX and HiA are the X chromosomal and autosomal ancestry proportions that trace to source population 1, and pf are the fraction of females in the sample in a given region. Comments are indicated by a "#".
Or (only accepted by
scripts/estimate_sex_ratios_single_pulse.py
):
-
.Q
files for autosomes and X chromosomes, which were generated with ADMIXTURE.1 Additionally,.fam
file generated with plink2 needs to be provided to get individual IDs.2
If a comma-separated file with mean ancestry proprotions is provided as input, sex ratios are calculated based on the provided proportions. If .Q and .fam files are provided, individual ancestry proportions are bootstrapped and for each bootstrap resample mean ancestry proportions are computed and sex ratios are estimated, yielding confidence intervals.
Analysis
If
scripts/estimate_sex_ratios_single_pulse.py
is used, sex ratios are calculated using a dynamic model proposed by Golderberg and Rosenberg (2015) and an equilibirium model.3 Both models assume a single admixture event, a constant population size, no subsequent gene flow, random mating, no genetic drift, and no selection. The
scripts/estimate_sex_ratios_constant_admixture.py
implements another model proposed by Goldberg and Rosenberg that assumes constant, nonzero admixture.3 This model attempts to find parameter combinations of constant migration rates that fit the data. For details on any of these models see either Goldberg and Rosenberg or our publication.3
Output files
-
Single pulse: Generates an excel file with estimated sex ratios for 2-15 generations of random within the admixed population and the equilibrium sex ratio. If a comma-separated file is provided this is done for each ancestry and each region. The output file name can be specified with the
-o
flag. -
Constant admixture: Generates a box plot of possible sex ratios for each ancestry and region. The name of the output file can be specified with the
-o
flag.
Example usage
-
To infer sex ratios assuming a single admixture event and provided mean ancestry proportions in
input.txt
, do:
scripts/estimate_sex_ratios_single_pulse.py -i input.txt -o output.xlsx
- To infer sex ratios with confidence intervals assuming a single admixture event and provided individual level ADMIXTURE results for K clusters, do:
scripts/estimate_sex_ratios_single_pulse.py -qa admixture_results_autosomes.K.Q -qx admixture_results_x_chromosome.K.Q -f plink.fam -b 10000 -o output.xlsx
where confidence intervals are obtained from 10,000 bootstrap resamples(
-b
).
-
To infer sex ratios under a demographic scenarios of constant, nonzero admixture and given mean ancestry proportions provided in
input.txt
, do:
scripts/estimate_sex_ratios_constant_admixture.py -i input.txt -n 0.02 -d 0.01 -g 15 -o output_figure.pdf --sf0 0.5 --sm0 0.5
where 0.02 increments are used for the gridsearch (
-n
), accept only constant admixture parameter that have an Eucleadean distance less than 0.01 to observed mean ancestry proportions (
-d
), admixture is assumed to have occured 15 generations ago (
-g
), the output figure is saved to output_figure.pdf (
-o
), and initial female and male contributions are assumed to have been 0.5 (
--sf0
and
--sm0
; the effect of initial contributions is erased if g > 10).
Figure S1 in our manuscript can be reproduced by replacing
input.txt
with
data/ancestry_proportions_micheletti.txt
in the above command.
Recalculation of sex ratios based on summary statistics from Micheletti et al., calculation of possible sex ratios given the unassigend ancestry in their study, and sensitivity analysis of the applied models
Software requirements
See above.
Input files
African, European, and Native American mean ancestry proportions for different geographic regions were inferred from summary statistics of Micheletti et al. We weighted mean ancestry estimates for each granular region reported in Table S10 in Micheletti et al. by the corresponding sample size reported in Table S2 in Micheletti et al.4 The ancestry proportions were stored in
data/ancestry_proportions_micheletti.txt
.
Analysis
First, the script
sensitivity_analysis_and_distribute_unassigned_ancestry.py
applies the single pulse admixture model from Goldberg and Rosenberg (2015) to the provided ancestry proportions.3 If unassigned ancestry is present, such as in the study by Micheletti et al.,4 the unassigned ancestry is distributed in all permissible ways such that ancestry proportions sum to 100%. For each possible distribution, sex ratios are computed, yielding a distribution of sex ratios that is plotted as a box plot for each ancestry and region (this is only done for regions specified with the
-p
flag). In addition, it conducts a sensitivity analysis of the equilibirium model, which also holds for Goldberg's and Rosenberg's model. For more details, see our publication.
Output files
-
An excel file with estimated sex ratios for 2-15 generations of random within the admixed population and the equilibrium sex ratio for each ancestry and region based on provided ancestry proportions. The name of the output file can be specified with the
-o
flag (Table 1 in our manuscript). -
A figure with possible sex ratios under consideration of the unassigned ancestry. The name of the figure can be specified with
--figure
flag (Figure 1 in our manuscript). -
A figure illustrating the sensitivity of the applied models. The name of the figure can be specified with the
--figure_sensitivity
flag (Figure 2 in our manuscript).
Example usage
The following command takes mean ancestry proportions from Micheletti et al., distributes all unassigned ancestry for the geographic regions of the Latin Caribbean, central South America, northern South America, and Central America, and saves box plots of corresponding distributions of possible sf/sm values to figure1.pdf. Additionally, it conducts a sensitivity analysis, which will be plotted to figure2.pdf. In this case, it uses the African ancestry proportion that Micheletti et al. reported for central South America for the sensitivity analysis. Point estimates of sf/sm for each ancestry and georgraphic regions based on the reported mean ancestry proportions will be written to sex_ratios_micheletti.xlsx.
scripts/sensitivity_analysis_and_distribute_unassigned_ancestry.py -i data/ancestry_proportions_micheletti.txt --figure figure1.pdf -p1 "Latin Caribbean" "C South America" "N South America" "Central America" --figure_sensitivity figure2.pdf -p2 "C South America" -a Afr -o sex_ratios_micheletti.xlsx
Assessing the robustness of single admixture models to violations of demographic assumptions using simulations
Software requirements
The following software is required:
-
ADMIXTURE v1.3.0
-
plink2 v2.00a2.3
-
bcftools v
-
tabix v1.11
-
SliM3 v3.7.1
Additionally, the following Python packages are required:
-
numpy v1.22.2
-
pandas v1.4.1
-
multiprocessing-loggin v0.3.1
-
pyslim v0.700
-
msprime v1.1.1
-
matplotlib v3.5.1
-
scipy v1.8.0
-
scikit-learn v1.0.2
The simulations and subsequent analysis are implemented in a Snakemake workflow. Therefore, snakemake needs to be installed. Snakemake will automatically install all required software from the .yml files provided in the envs directory. Snakemake can be installed from the provided .yml file using the following commands:
conda env create -f snakemake.yml
conda activate snakemake
Analysis
To assess the effects of violations of demographic assumptions made by the single pulse admixture model, such as population growth, subsequent gene flow, and sex-specific assortative mating, we simulated these scenarios and evaluate the effects on inferred sex ratios. Three continental ancestral population are simulated according to Gravel et al. (2011),5 and admixture is simulated 15 generations ago. Then, random individuals are sampled from the ancestral and admixed population (we recommend 10,000), and LD pruning is performed using plink2.2 On the pruned data, ADMIXTURE is run to separately infer X chromosomal and autosomal ancestry proportions using the individuals from the ancestral populations as training data.6 Mitochondrial and Y chromosome haplogroup frequencies are inferred by clustering the ancestral haplogroups and assigning haplogroups of admixed individuals based proximity to ancestral haplogroup clusters. Finally, sex ratios are estimated using the single admixture models and bootstrapping individual ancestry proportions. Additionally, PCA is performed on the simulated populations to determine that the correct population structure was achievied, and site-frequency spectra (SFS) of the admixed population is plotted. Simulations are done with SLiM3 using tree sequence recording.7,8 Pyslim and msprime are then used for recapitation and superimposing mutations in Python.9,10
Output files
-
Tree sequences will be stored in the directory specified with
tree_dir
inconfig/config.yaml
-
VCF (unpruned and pruned), plink files, haplogroup assignments, ADMIXTURE results will be stored in the directory specified with
data_dir
inconfig/config.yaml
-
Figures of the PCA and SFS will stored in the directory specified with
plot_dir
inconfig/config.yaml
-
Summary files of bootstrap results will stored in directory specified with
results_dir
inconfig/config.yaml
Example usage
The entire process is implemented in a snakemake workflow, which can be executed using the command below.11
snakemake --use-conda --conda-frontend --cores 16
All simulations parameters, such as admixture proportions, sex biases, migration rates, sample sizes etc., can be set via the
config/config.yaml
file. To re-run all simulations that were presented in our paper (Table 2), run
./run_all_simulations.sh
. This scripts also generates Figure 3 in our manuscript, which assess the effect of sample size on the inferred sex ratios. Inferred sex ratios are compared for different sample sizes using a custom script:
scripts/assess_effect_of_sampling_size.py -qa <list_of_q_files_for_autosomes_for_different_sample_sizes> -qx <list_of_q_files_for_autosomes_for_different_sample_sizes> -f <fam_files_for_different_sample_sizes> -p ./ -o figure3.pdf
References
-
Alexander, David H, John Novembre, and Kenneth Lange. 2009. “Fast Model-Based Estimation of Ancestry in Unrelated Individuals.” Genome Research 19 (9): 1655–64. https://doi.org/10.1101/gr.094052.109.
-
Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience. 2015;4:7. Published 2015 Feb 25. doi:10.1186/s13742-015-0047-8
-
Goldberg, A., and Rosenberg, N.A. (2015). Beyond 2/3 and 1/3: The Complex Signatures of Sex-Biased Admixture on the X Chromosome. Genetics 201, 263 LP – 279.
-
Micheletti, S.J., Bryc, K., Ancona Esselmann, S.G., Freyman, W.A., Moreno, M.E., Poznik, G.D., Shastri, A.J., Agee, M., Aslibekyan, S., Auton, A., et al. (2020). Genetic Consequences of the Transatlantic Slave Trade in the Americas. Am. J. Hum. Genet. 107, 265–277.
-
Gravel, S., Henn, B.M., Gutenkunst, R.N., Indap, A.R., Marth, G.T., Clark, A.G., Yu, F., Gibbs, R.A., and Bustamante, C.D. (2011). Demographic history and rare allele sharing among human populations. Proc. Natl. Acad. Sci. U. S. A. 108, 11983–11988.
-
Alexander, D.H., Novembre, J., and Lange, K. (2009). Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19, 1655–1664.
-
Haller, B.C., and Messer, P.W. (2019). SLiM 3: Forward Genetic Simulations Beyond the Wright–Fisher Model. Mol. Biol. Evol. 36, 632–637.
-
Haller, B.C., Galloway, J., Kelleher, J., Messer, P.W., and Ralph, P.L. (2019). Tree-sequence recording in SLiM opens new horizons for forward-time simulation of whole genomes. Mol. Ecol. Resour. 19, 552–566
-
Kelleher, J., Etheridge, A.M., and McVean, G. (2016). Efficient Coalescent Simulation and Genealogical Analysis for Large Sample Sizes. PLOS Comput. Biol. 12, 1–22
-
Nelson, D., Kelleher, J., Ragsdale, A.P., Moreau, C., McVean, G., and Gravel, S. (2020). Accounting for long-range correlations in genome-wide simulations of large cohorts. PLOS Genet. 16, 1–12.
-
Mölder, F., Jablonski, K.P., Letcher, B., Hall, M.B., Tomkins-Tinch, C.H., Sochat, V., Forster, J., Lee, S., Twardziok, S.O., Kanitz, A., Wilm, A., Holtgrewe, M., Rahmann, S., Nahnsen, S., Köster, J., 2021. Sustainable data analysis with Snakemake. F1000Res 10, 33.
Code Snippets
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 | import pickle import pyslim import msprime import sys import argparse import numpy as np import os import multiprocessing as mp import matplotlib.pyplot as plt import logging import warnings import re from multiprocessing_logging import install_mp_handler def tree_heights(ts): """ Taken from pyslim manual :param ts: treeSequence :return: array, tree heights """ # Calculate tree heights, giving uncoalesced sites the maximum time heights = np.zeros(ts.num_trees + 1) for tree in ts.trees(): if tree.num_roots > 1: # not fully coalesced heights[tree.index] = ts.metadata['SLiM']['generation'] else: children = tree.children(tree.root) real_root = tree.root if len(children) > 1 else children[0] heights[tree.index] = tree.time(real_root) heights[-1] = heights[-2] # repeat the last entry for plotting with step return heights def visualize_recapitation(pre_recap, post_recap): """ Visualize effect of recapitation :param pre_recap: TreeSequence, ts before recapitation :param post_recap: TreeSequence, ts after recapitation """ fig, ax = plt.subplots(2, 1, sharex=True, sharey=True) # Plot tree heights before recapitation breakpoints = list(pre_recap.breakpoints()) heights = tree_heights(pre_recap) ax[0].step(breakpoints, heights, where='post') ax[0].set_title('Prior recapitation') ax[0].set_ylabel('TMRCA') # ax[0].set_xlabel('Position') max_prior = heights.max() # Plot the tree heights after recapitation breakpoints = list(post_recap.breakpoints()) heights = tree_heights(post_recap) ax[1].step(breakpoints, heights, where='post') ax[1].axhline(max_prior, color='black', ls='--') ax[1].set_xlabel('Position') ax[1].set_ylabel('TMRCA') ax[1].set_title('Post recapitation') fig.savefig('plots/vis_recapitation.png', bbox_inches='tight') plt.close() def get_sfs(ts, pop, pop_index, filename, chr): """ Plot site frequency spectrum :param ts: TreeSequence :param pop: str, current population :param filename: str, path to save figure to :param chr: str, current chromosome """ fig, ax = plt.subplots() bins = np.arange(0, 101, 1) # plot real data if pop in ['afr', 'eur', 'ea']: if pop == 'afr': pop = 'yri' elif pop == 'eur': pop = 'ceu' elif pop == 'ea': pop = 'chb' if re.match('chr[0-9]+', chr): real_data = np.loadtxt(f'data/daf_frequencies_kg/allele_frequencies_{pop}_autosomes.txt') elif chr == 'chrX': real_data = np.loadtxt(f'data/daf_frequencies_kg/allele_frequencies_{pop}_chrX.txt') elif chr == 'chrY': real_data = np.loadtxt(f'data/daf_frequencies_kg/allele_frequencies_{pop}_chrY.txt') elif chr == 'mtDNA': real_data = np.loadtxt(f'data/daf_frequencies_kg/allele_frequencies_{pop}_mtDNA.txt') real_data = real_data[real_data > 0] real_data *= 100 hist, bin_edges = np.histogram(real_data, bins=bins, weights=np.repeat(1 / real_data.shape[0], real_data.shape[0])) ax.scatter(bin_edges[:-1], hist, marker='^', color='blue', label='real') corresponding_nodes = ts.samples(pop_index) ts = ts.simplify(samples=corresponding_nodes) afs = ts.allele_frequency_spectrum(span_normalise=False, polarised=True) afs = afs[1:] frequencies = np.ones_like(afs) frequencies = np.cumsum(frequencies) / frequencies.shape[0] * 100 afs /= afs.sum() hist, bin_edges = np.histogram(frequencies, bins=bins, weights=afs) ax.scatter(bin_edges[:-1], hist, marker='+', color='orange', label='simulations') ax.legend() fig.savefig(f"plots/{filename}", bbox_inches='tight') plt.close() def add_mutations(tree_path, out_tree_path, Ne, mu, mtDNA_mu, chr_length, mtDNA_length, populations, recapitate, force): """ Recapitate tree sequence, simply and overlay with mutations :param tree_path: str, path to input tree sequence :param out_tree_path: str, path were to dump mutated tree sequence :param Ne: float, ancestral population size :param mu: float, mutation rate :param mtDNA_mu: float, mutation rate mitochondrial DNA :param chr_length: int, chromosome length :param mtDNA_length: int, mtDNA length :param populations: list, acronyms of populations that are considered :param recapitate: boolean, whether to recaptitate or not :param force: boolean, whether to run full pipeline if out_tree_path already exists :return: TreeSequence, dict, tree sequence with mutations, pointer to nodes with Y-chromsomes and mtDNAs """ # load tree sequence logging.info(f'Loading tree sequence {tree_path}') if os.path.isfile(out_tree_path) and not force: mutated = pyslim.load(out_tree_path) f = open(out_tree_path.replace('.trees', '.pkl'), 'rb') sample_genotype_mapping_by_population = pickle.load(f) f.close() else: ts = pyslim.load(tree_path) if recapitate: # recapitate (Ne = ancestral Ne) logging.info(f'Recapitating {tree_path}') ts_recap = pyslim.recapitate(ts, Ne) visualize_recapitation(ts, ts_recap) ts = ts_recap # simplify logging.info(f'Simplifying {tree_path}') ts.simplify() # assert all sites coalesced for t in ts.trees(): if not t.num_roots == 1: warnings.warn("Tree {} did not completely coalesced." "Consider setting --recapitate flag.".format(tree_path, t.interval[0], t.interval[1])) break sample_genotype_mapping_by_population = {} for i, pop in enumerate(populations): sample_genotype_mapping_by_population[pop] = {} node_ids = ts.samples(i + 1) sample_genotype_mapping_by_population[pop]['nodes'] = node_ids pop_var = [var for var in ts.variants(samples=node_ids)] Y_pop_var = [var for var in pop_var if var.site.position == 2 * chr_length - 1] mtDNA_var = [var for var in pop_var if var.site.position == 2 * chr_length + mtDNA_length - 1] # nodes that have Y chromosome Y_chroms = node_ids[np.where(Y_pop_var[0].genotypes == 1)[0]] sample_genotype_mapping_by_population[pop]['Y'] = Y_chroms # nodes that have maternal mtDNA mtDNA = node_ids[np.where(mtDNA_var[0].genotypes == 1)[0]] sample_genotype_mapping_by_population[pop]['mtDNA'] = mtDNA f = open(out_tree_path.replace('.trees', '.pkl'), 'wb') pickle.dump(sample_genotype_mapping_by_population, f) f.close() # add mutations logging.info(f'Overlaying mutations {tree_path}') mutation_rate_map = msprime.RateMap(position=[0, 2 * chr_length, 2 * chr_length + mtDNA_length], rate=[mu, mtDNA_mu]) mutated = msprime.sim_mutations(ts, rate=mutation_rate_map, keep=False) mutated.dump(out_tree_path) return mutated, sample_genotype_mapping_by_population def reverse_map_individuals(old_new_mapping, individuals_nodes_mapping, ts): """ Map node IDs after simplifying to previous ones :param old_new_mapping: np.array, Has the shape (#nodes) where the uth entry gives the new node ID of uth node in the old mapping :param individuals_nodes_mapping: dict, maps old node ids to individual ids :param ts: TreeSequence, simplified tree sequence :return: list, list, old individual ids, new individual ids """ old_individuals = [] new_individuals = [] for indv, nds in individuals_nodes_mapping.items(): for nd in nds: new_node_id = old_new_mapping[nd] if new_node_id == -1: continue new_indv = ts.node(new_node_id).individual new_individuals.append(new_indv) old_individuals.append(indv) break return old_individuals, new_individuals def extract_population_specific_chromosomes(ts, pointers_Y_mtDNA, output_base, chr_length, mtDNA_length, populations, sample_size, target_population, off_target_sample_size, plot_sfs, threads): """ Write different chromosomes to VCF for each population :param ts: TreeSequence, tree sequence to sample from :param pointers_Y_mtDNA: dict, pointers to nodes with mtDNA and Y chromosomes :param output_base: str, string base for output file :param chr_length: int, chromosome length :param mtDNA_length: int, length of mtDNA :param populations: list, population acronyms :param sample_size: int, number of individuals to sample from each population :param target_population: str, population for which to extract full sample size :param off_target_sample_size: int, sample size for off-target populations :param plot_sfs: boolean, whether to plot SFS or not :param threads: int, number of processes """ # get current populations sampled_individuals_all_pop = [] sampled_nodes_all_pop = [] sampled_Y_all_pop = [] sampled_X_all_pop = [] sampled_mtDNA_all_pop = [] for i, pop in enumerate(populations): logging.info(f'Extracting chromosome specific tree sequences for {pop}') # population specific nodes nodes = pointers_Y_mtDNA[pop]['nodes'] # nodes with Y chromosomes Y_chromosomes = pointers_Y_mtDNA[pop]['Y'] # nodes with X chromosomes X_chromosomes = nodes[~np.isin(nodes, Y_chromosomes)] # nodes with maternal mtDNA mtDNA = pointers_Y_mtDNA[pop]['mtDNA'] # population specific individuals individuals = list(set([nd.individual for nd in ts.tables.nodes[nodes]])) if pop == target_population and len(individuals) < sample_size: logging.info(f'Population {pop} only has {len(individuals)} individuals. ' f'Reducing sample size for {pop} from {sample_size} to {len(individuals)}') c_sample_size = len(individuals) elif pop == target_population and len(individuals) >= sample_size: c_sample_size = sample_size elif pop != target_population and len(individuals) >= off_target_sample_size: c_sample_size = off_target_sample_size else: c_sample_size = len(individuals) # taken random sample of individuals sampled_individuals = np.random.choice(individuals, size=c_sample_size, replace=False) sampled_nodes = np.concatenate([ts.individual(indv).nodes for indv in sampled_individuals]) sampled_Y = Y_chromosomes[np.isin(Y_chromosomes, sampled_nodes)] sampled_X = X_chromosomes[np.isin(X_chromosomes, sampled_nodes)] sampled_mtDNA = mtDNA[np.isin(mtDNA, sampled_nodes)] sampled_individuals_all_pop.append(sampled_individuals) sampled_nodes_all_pop.append(sampled_nodes) sampled_Y_all_pop.append(sampled_Y) sampled_X_all_pop.append(sampled_X) sampled_mtDNA_all_pop.append(sampled_mtDNA) sampled_indv_amr = sampled_individuals.copy() # merge samples from all populations sampled_individuals = np.concatenate(sampled_individuals_all_pop) sampled_nodes = np.concatenate(sampled_nodes_all_pop) sampled_Y = np.concatenate(sampled_Y_all_pop) sampled_X = np.concatenate(sampled_X_all_pop) sampled_mtDNA = np.concatenate(sampled_mtDNA_all_pop) sampled_individuals_node_mapping = {indv: ts.individual(indv).nodes for indv in sampled_individuals} # extract tree sequences corresponsing to chromosome 1 ts_chr1, node_map_chr1 = ts.keep_intervals(np.array([[0, chr_length]]), simplify=False).simplify(sampled_nodes, map_nodes=True) # extract tree sequences corresponsing to sex chromosomes ts_Y_chromosomes, node_map_Y = ts.keep_intervals(np.array([[chr_length, 2 * chr_length]]), simplify=False).simplify(sampled_Y, map_nodes=True) ts_X_chromosomes, node_map_X = ts.keep_intervals(np.array([[chr_length, 2 * chr_length]]), simplify=False).simplify(sampled_X, map_nodes=True) # extract tree sequences for mtDNA ts_mtDNAs, node_map_mtDNA = ts.keep_intervals(np.array([[2 * chr_length, 2 * chr_length + mtDNA_length]]), simplify=False).simplify(sampled_mtDNA, map_nodes=True) sampled_individuals, individuals_chr1 = reverse_map_individuals(node_map_chr1, sampled_individuals_node_mapping, ts_chr1) sampled_individuals_Y, individuals_Y = reverse_map_individuals(node_map_Y, sampled_individuals_node_mapping, ts_Y_chromosomes) sampled_individuals_X, individuals_X = reverse_map_individuals(node_map_X, sampled_individuals_node_mapping, ts_X_chromosomes) sampled_individuals_mtDNA, individuals_mtDNA = reverse_map_individuals(node_map_mtDNA, sampled_individuals_node_mapping, ts_mtDNAs) # get sex of individuals and write to file sampled_individuals_sex = ['F' if ts_chr1.individual(indv).metadata['sex'] == 0 else 'M' for indv in individuals_chr1] sampled_individuals_population = [populations[ts_chr1.node(ts_chr1.individual(indv).nodes[0]).population] for indv in individuals_chr1] with open(f"{output_base}_sex.tab", 'w') as sex_pointer: sex_pointer.write('#IID\tsex\n') for indv, sex, pop in zip(sampled_individuals, sampled_individuals_sex, sampled_individuals_population): sex_pointer.write(f'{pop}{indv + 1}\t{sex}\n') sex_pointer.close() ## plot SFS if plot_sfs: for pop_index, pop in enumerate(populations): logging.info(f'Plotting SFS {output_base} for {pop}') for tree_seqs, chr in zip([ts_chr1, ts_X_chromosomes, ts_Y_chromosomes, ts_mtDNAs], ['chr1', 'chrX', 'chrY', 'mtDNA']): get_sfs(tree_seqs, pop, pop_index, f"{output_base.split('/')[-1]}_{pop}_{chr}_sfs.png", chr) # write .pop file for supervised admixture runs for ts, chr, individuals in zip([ts_chr1, ts_X_chromosomes, ts_Y_chromosomes, ts_mtDNAs], ['chr1', 'chrX', 'chrY', 'mtDNA'], [individuals_chr1, individuals_X, individuals_Y, individuals_mtDNA]): write_pop_file(ts, individuals, populations, target_population, f"{output_base}_{chr}.pop") ## write VCFs in parallel ready_to_map = [(tree_seqs, indvs, indv_names, f"{output_base}_{chr}.vcf", chr, populations) for tree_seqs, indvs, indv_names, chr in zip([ts_chr1, ts_X_chromosomes, ts_Y_chromosomes, ts_mtDNAs], [individuals_chr1, individuals_X, individuals_Y, individuals_mtDNA], [sampled_individuals, sampled_individuals_X, sampled_individuals_Y, sampled_individuals_mtDNA], ['chr1', 'chrX', 'chrY', 'mtDNA'])] # write_vcf(ready_to_map[3]) threads = min([threads, len(ready_to_map)]) install_mp_handler() pool = mp.Pool(processes=threads) pool.map(write_vcf, ready_to_map) pool.close() pool.join() def write_pop_file(ts, individuals, populations, target_population, outfile): source_populations = [populations[ts.node(ts.individual(indv).nodes[0]).population] if populations[ts.node(ts.individual(indv).nodes[0]).population] != target_population else '-' for indv in individuals] with open(outfile, 'w') as f: f.write(source_populations[0]) for pop in source_populations[1:]: f.write('\n') f.write(pop) f.close() def write_vcf(args): """ Write VCF file for a given chromosome and population :param args: (tuple), (TreeSequence, individual IDs, individual names, population, output file, chromosomes) """ ts, individuals, indv_names, outfile, chr, populations = args # name the sample individual_names = [f"{populations[ts.node(ts.individual(indv).nodes[0]).population]}{indv_name + 1}" for indv, indv_name in zip(individuals, indv_names)] # get contig id if chr == 'chr1': contig_id = '1' elif chr == 'chrX': contig_id = '23' elif chr == 'chrY': contig_id = '24' elif chr == 'mtDNA': contig_id = '26' logging.info(f'Writing VCF {outfile}') # see https://tskit.dev/tskit/docs/stable/python-api.html#tskit.TreeSequence.write_vcf with open(outfile, 'w') as vcf_file: ts.write_vcf(vcf_file, individuals=individuals, individual_names=individual_names, contig_id=contig_id) logging.info(f"Finished Writing VCF {outfile}") vcf_file.close() def main(argv): parser = argparse.ArgumentParser(description="Recaptitate tree sequence and superimpose neutral mutations. " "For qualtiy control, recapitation is visualized and SFS is plotted " "after superimposing mutations") parser.add_argument('--tree_sequences', help='Input tree sequences. Multiple files can be specified separated by ' 'a space. [trees/single_pulse_admixture.trees]', default=['trees/single_pulse_admixture.trees'], nargs='+') parser.add_argument('-o', '--output_tree_sequences', help='Output file with mutated tree sequences.' 'Creates a corresponding pickle file with pointers to ' 'nodes with Y chromosomes and mtDNAs. ' 'Multiple files can be specified separated by a space.' '[trees/single_pulse_admixture_mutated.trees]', default=['trees/single_pulse_admixture_mutated.trees'], nargs='+') parser.add_argument('--output_vcf_dir', help='Directory where to save VCF file [./data]', default='./data') parser.add_argument('-N', '--ancestral_population_size', help='Ancestral population size [7310.370867595234]', type=float, default=7310.370867595234) parser.add_argument('--mtDNA_mutation_rate', help='Mutation rate mitochondrial DNA [4e-4]', type=float, default=4e-4) parser.add_argument('-m', '--mutation_rate', type=float, default=2.36e-8, help='Germ line mutation rate [2.36e-8') parser.add_argument('-l', '--chromosome_length', help='Chromosome Length [1e8]', default=1e8, type=float) parser.add_argument('--mtDNA_length', help='Length of mitochondrial DNA [20000]', type=float, default=20000) parser.add_argument('--populations', nargs='+', help='Simulated populations in order [afr, eur, ea, amr]', default=['afr', 'eur', 'ea', 'amr']) parser.add_argument('-s', '--sample_size', type=int, default=100, help='Sample size, same for each population [100]') parser.add_argument('--target_population', help='Population of most interest. Extract full sample size ' 'for this population. [amr]', default='amr') parser.add_argument('--off_target_sample_size', help='Reference populations. Use this reduced sample size. [1000]', default=1000, type=int) parser.add_argument('--plot_sfs', action='store_true', help='Plot SFS only', default=False) parser.add_argument('--recapitate', help='Whether or not do recapitation to ensure complete coalescence [False]', action='store_true', default=False) parser.add_argument('--force', help='Force re-run if intermediate files already exist [False]', action='store_true', default=False) parser.add_argument('--log', help='Path to log file [log/adding_mutations.log]', default='log/adding_mutations.log') parser.add_argument('-t', '--threads', help='Number of processors [16]', default=16, type=int) args = parser.parse_args() if args.output_vcf_dir.endswith('/'): output_vcf_dir = args.output_vcf_dir else: output_vcf_dir = args.output_vcf_dir + '/' if not os.path.isdir(output_vcf_dir): os.makedirs(output_vcf_dir) logging.basicConfig(filename=args.log, level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', datefmt='%H:%M:%S') # install_mp_handler() ancestral_population_size = int(args.ancestral_population_size) for ts, ots in zip(args.tree_sequences, args.output_tree_sequences): logger = logging.getLogger("Overlaying mutations on {ts}") mutated, pointers_Y_mtDNA = add_mutations(ts, ots, ancestral_population_size, args.mutation_rate, args.mtDNA_mutation_rate, args.chromosome_length, args.mtDNA_length, args.populations, args.recapitate, args.force) output_base = f"{output_vcf_dir}{ts.split('/')[-1].split('.trees')[0]}" extract_population_specific_chromosomes(mutated, pointers_Y_mtDNA, output_base, args.chromosome_length, args.mtDNA_length, args.populations, args.sample_size, args.target_population, args.off_target_sample_size, args.plot_sfs, args.threads) if __name__ == '__main__': main(sys.argv[1:]) |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | import os import sys import argparse import subprocess def run_simulations(recipe, ots, rec, length, length_mtDNA): """ Execute SLiM simulations :param recipe: str, path to recipe to use :param ots: str, path to output tree sequence :param rec: float, recombination rate :param length: int, chromosome length :param length_mtDNA: int, length of mtDNA """ logfile = ots.replace('.trees', '.log') command = ' '.join(['slim', '-d', f"\"outfile='{ots}'\"", '-d', f"\"logfile='{logfile}'\"", '-d', f'rec={rec}', '-d', f'L={length}', '-d', f'L_mtDNA={length_mtDNA}', recipe]) subprocess.check_call(command, shell=True) def main(argv): parser = argparse.ArgumentParser(description='Simulate ancestral continental mutations') parser.add_argument('-r', '--recombination_rate', type=float, default=1e-8, help='Recombination rate. [1e-8]') parser.add_argument('--recipe', help='SLiM recipe to use [scripts/afr_eur_ea.slim]', type=str, default='scripts/afr_eur_ea.slim') parser.add_argument('-l', '--chromosome_length', help='Chromosome Length [1e6]', default=1e6, type=float) parser.add_argument('--mtDNA_length', help='Length of mitochondrial DNA [20000]', type=float, default=20000) parser.add_argument('-o', '--output_tree_sequences', help='File to write tree sequences to.' '[./trees/afr_eur_ea.trees]', default='./trees/afr_eur_ea.trees') args = parser.parse_args() output_dir = '/'.join(args.output_tree_sequences.split('/')[:-1]) if not os.path.isdir(output_dir): os.makedirs(output_dir) run_simulations(args.recipe, args.output_tree_sequences, args.recombination_rate, args.chromosome_length, args.mtDNA_length) if __name__ == '__main__': main(sys.argv[1:]) |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 | import sys import argparse import subprocess def run_simulations(recipe, ts, out, rec, length, mtDNA_length, afr, eur, ea, sm_afr, sm_eur, sm_ea, N_amr, afr_const=None, eur_const=None, ea_const=None, sm_afr_const=None, sm_eur_const=None, sm_ea_const=None, amr_growth_rate=0.0, rejection_rate_female_eur_male_non_eur=None): """ Execute SLiM simulations :param recipe: str, path to SLiM recipe :param ts: str, path to input tree sequence from where to start simulations :param out: str, path where to write output tree sequence to :param rec: float, recombination rate :param length: int, chromosome length :param mtDNA_length: int, mtDNA length :param afr: float, African admixture fraction :param eur: float, European admixture fraction :param ea: float, East Asian admixture fraction :param sm_afr: float, fraction of African males :param sm_eur: float, fraction of European males :param sm_ea: float, fraction of East Asian males :param N_amr: float, population size of admixed population :param afr_const: float, African migration after initial admixture; default=None :param eur_const: float, European migration after initial admixture; default=None :param ea_const: float, East Asian migration after initial admixture; default=None :param sm_afr_const: float, fraction of constantly migrating African males; default=None :param sm_eur_const: float, fraction of constantly migrating European males; default=None :param sm_ea_const: float, fraction of constantly migrating East Asian males; default=None :param amr_growth_rate: float, exponential growth rate of admixed population :param rejection_rate_female_eur_male_non_eur: float, Fraction of matings of European females and non-European males to reject in non_random mating scheme """ logfile = f"{out.split('.trees')[0]}.log" command = ['slim', '-d', f"\"input_tree_sequence='{ts}'\"", '-d', f"\"outbase='{out}'\"", '-d', f"\"logfile='{logfile}'\"", '-d', f'rec={rec}', '-d', f'L={length}', '-d', f'L_mtDNA={mtDNA_length}', '-d', f'afr={afr}', '-d', f'eur={eur}', '-d', f'ea={ea}', '-d', f'sm_afr={sm_afr}', '-d', f'sm_eur={sm_eur}', '-d', f'sm_ea={sm_ea}', '-d', f'adx_n={N_amr}', '-d', f'amr_growth_rate={amr_growth_rate}'] if not afr_const is None: command.extend(["-d", f'afr_const={afr_const}', "-d", f'eur_const={eur_const}', "-d", f'ea_const={ea_const}', "-d", f'sm_afr_const={sm_afr_const}', "-d", f'sm_eur_const={sm_eur_const}', "-d", f'sm_ea_const={sm_ea_const}']) if not rejection_rate_female_eur_male_non_eur is None and \ (recipe == 'scripts/single_pulse_non_random_mating.slim' or recipe == 'scripts/constant_admixture_non_random_mating.slim'): command.extend(["-d", f'rejection_rate_female_eur_male_non_eur={rejection_rate_female_eur_male_non_eur}']) command.append(recipe) command = ' '.join(command) subprocess.check_call(command, shell=True) def main(argv): parser = argparse.ArgumentParser(description="Simulate three-way (American) admixture of ancestral " "continental populations") parser.add_argument('-r', '--recombination_rate', type=float, default=1e-8, help='Recombination rate. [1e-8]') parser.add_argument('--recipe', help='SLiM recipe to use [scripts/single_pulse_admixture.slim]', type=str, default='scripts/single_pulse_admixture.slim') parser.add_argument('-l', '--chromosome_length', help='Chromosome Length [1e6]', default=1e6, type=float) parser.add_argument('--mtDNA_length', help='Length of mitochondrial DNA [20000]', type=float, default=20000) parser.add_argument('--tree_sequences', help='Input tree sequences [trees/afr_eur_ea.trees]', default='trees/afr_eur_ea.trees') parser.add_argument('--afr', help='African admixture fraction [1/6]', default=1/6, type=float) parser.add_argument('--eur', help='European admixture fraction [1/3]', default=1/3, type=float) parser.add_argument('--ea', help='East Asian admixture fraction [1/2]', default=1/3, type=float) parser.add_argument('--sm_afr', help='Fraction of males of admixing Africans [1/2]', default=1/2, type=float) parser.add_argument('--sm_eur', help='Fraction of males of admixing Europeans[1/2]', default=1/2, type=float) parser.add_argument('--sm_ea', help='Fraction of males of admixing East Asians [1/2]', default=1/2, type=float) parser.add_argument('--N_amr', help='Initial size of admixed American population [1000]', default=1000, type=int) parser.add_argument('--afr_const', help='Constant African admixture fraction', default=None, type=float) parser.add_argument('--eur_const', help='Constant European admixture fraction', default=None, type=float) parser.add_argument('--ea_const', help='Constant East Asian admixture fraction', default=None, type=float) parser.add_argument('--sm_afr_const', help='Fraction of males of constantly admixing Africans', default=None, type=float) parser.add_argument('--sm_eur_const', help='Fraction of males of constantly admixing Europeans', default=None, type=float) parser.add_argument('--sm_ea_const', help='Fraction of males of constantly admixing East Asians', default=None, type=float) parser.add_argument('--amr_growth_rate', help='Exponential growth rate of admixed populations [0.0]', default=0.0, type=float) parser.add_argument('--rejection_rate_female_eur_male_non_eur', help='Fraction of matings of European females and non-European males to reject in ' 'non-random mating scheme. Only has an effect if ' '--recipe scripts/single_pulse_non_random_mating.slim', default=None, type=float) args = parser.parse_args() output_treeseq_base = f"{'/'.join(args.tree_sequences.split('/')[:-1])}/{args.recipe.split('/')[-1].split('.slim')[0]}_g" if not args.afr_const is None: assert args.eur_const is not None and args.ea_const is not None and args.sm_afr_const is not None and \ args.sm_eur_const is not None and args.sm_ea_const is not None and "constant" in args.recipe, \ "If specifing one constant admixture parameter you have to specify all and " \ "select SLiM recipe for constant admixture" run_simulations(args.recipe, args.tree_sequences, output_treeseq_base, args.recombination_rate, args.chromosome_length, args.mtDNA_length, args.afr, args.eur, args.ea, args.sm_afr, args.sm_eur, args.sm_ea, args.N_amr, args.afr_const, args.eur_const, args.ea_const, args.sm_afr_const, args.sm_eur_const, args.sm_ea_const, args.amr_growth_rate, args.rejection_rate_female_eur_male_non_eur) if __name__ == '__main__': main(sys.argv[1:]) |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 | import sys import argparse import re import logging import pandas as pd import numpy as np from sklearn.metrics import pairwise_distances from scipy.spatial.distance import euclidean from sklearn.cluster import DBSCAN, KMeans, AffinityPropagation, AgglomerativeClustering, SpectralClustering, MeanShift def loss_clustering(eigenvec, labels, populations): """ Quantify loss based on ancestry of homogeneity of assigned clusters. Distance between individuals of same cluster only contributes to loss if they come from different populations :param eigenvec: array-like, PCs of samples :param labels: array-like, assigned cluster IDs :param populations: array-like, populations of origin :return: float, loss """ pops = np.array(populations) # calculate pairwise distances within clusters dist = pairwise_distances(eigenvec) # 1 if in same cluster, 0 if in different one indicator_matrix_same_cluster =(labels[:, None] == labels[None, :]).astype(int) # 0 if sampled from same population, 1 if sampled from different ones indicator_matrix_same_pops = (pops[:, None] != pops[None, :]).astype(int) # multiple distances by indicator matrices loss = np.sum(dist * indicator_matrix_same_pops * indicator_matrix_same_cluster) return loss def find_best_clustering(eigenvec_source, populations_source, not_accepted_algos=[]): """ Find best clustering via GridSearch :param eigenvec_source: array-like, PCs of samples of source populations :param populations_source: array-like, origin populations of samples :param not_accepted_algos: list, clustering algorithms that were previously rejected :return: dict, best algorithm + params """ c_error = -1 # DBSCAN if not 'dbscan' in not_accepted_algos: logging.info('Running DBSCAN') eps_vals = [1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1] min_samples = np.arange(5, 50, 5) for eps in eps_vals: for ms in min_samples: dbscan = DBSCAN(min_samples=ms, eps=eps).fit(eigenvec_source.values) loss = loss_clustering(eigenvec_source, dbscan.labels_, populations_source) if loss < c_error or c_error == -1: best_params = {'min_samples': ms, 'eps': eps, 'algo': 'dbscan'} c_error = loss logging.info('Best run: Loss: {:.4f}; Params: {}'.format(c_error, best_params)) n_clusters = np.arange(3, 50, 1) if not 'kmeans' in not_accepted_algos: # KMeans logging.info('Running KMeans') for n in n_clusters: kmeans = KMeans(n_clusters=n, random_state=1).fit(eigenvec_source.values) loss = loss_clustering(eigenvec_source, kmeans.labels_, populations_source) if loss < c_error or c_error == -1: best_params = {'n_clusters': n, 'algo': 'kmeans'} c_error = loss logging.info('Best run: Loss: {:.4f}; Params: {}'.format(c_error, best_params)) if not 'agglomerative' in not_accepted_algos: # Agglomerative logging.info('Running Agglomerative Clustering') for n in n_clusters: agglo = AgglomerativeClustering(n_clusters=n).fit(eigenvec_source.values) loss = loss_clustering(eigenvec_source, agglo.labels_, populations_source) if loss < c_error or c_error == -1: best_params = {'n_clusters': n, 'algo': 'agglomerative'} c_error = loss logging.info('Best run: Loss: {:.4f}; Params: {}'.format(c_error, best_params)) if not 'affinity' in not_accepted_algos: # Affinity logging.info('Running Affinity propagation') damping = np.arange(0.5, 1.0, 0.05) for d in damping: affinity = AffinityPropagation(damping=d, max_iter=500, random_state=1).fit(eigenvec_source) loss = loss_clustering(eigenvec_source, affinity.labels_, populations_source) if loss < c_error or c_error == -1: best_params = {'damping': d, 'algo': 'affinity'} c_error = loss logging.info('Best run: Loss: {:.4f}; Params: {}'.format(c_error, best_params)) if not 'meanshift' in not_accepted_algos: # MeanShift logging.info('Running Mean Shift') meanshift = MeanShift().fit(eigenvec_source.values) loss = loss_clustering(eigenvec_source, meanshift.labels_, populations_source) if loss < c_error or c_error == -1: best_params = {'algo': 'meanshift'} c_error = loss logging.info('Best run: Loss: {:.4f}; Params: {}'.format(c_error, best_params)) if not 'spectral' in not_accepted_algos: # Spectral logging.info("Running Spectral Clustering") for n in n_clusters: spectral = SpectralClustering(n_clusters=n, random_state=1).fit(eigenvec_source.values) loss = loss_clustering(eigenvec_source, spectral.labels_, populations_source) if loss < c_error or c_error == -1: best_params = {'n_clusters': n, 'algo': 'spectral'} c_error = loss logging.info('Best run: Loss: {:.4f}; Params: {}'.format(c_error, best_params)) return best_params def cluster_to_ancestry(labels, populations): """ Map cluster IDs to ancestries based on majority :param labels: array-like, cluster IDs :param populations: array-like, populations of origin :return: dict, mapping cluster id - ancestry """ cluster_to_labels = {} for l in labels: pops, counts = np.unique(populations[labels == l], return_counts=True) c_pop = pops[np.argmax(counts)] cluster_to_labels[l] = c_pop return cluster_to_labels def dbscan_predict(dbscan_model, X_test, metric=euclidean): """ Predict DBSCAN cluster belonging :param dbscan_model: trained DBSCAN model :param X_test: array-like, PCs American samples :param metric: sklearn.spatial.distance metric :return: array-like, cluster IDs of American samples """ # Taken from https://stackoverflow.com/questions/27822752/scikit-learn-predicting-new-points-with-dbscan # Result is noise by default y_test = np.ones(shape=len(X_test), dtype=int) * -1 # Iterate all input samples for a label for j, x_new in enumerate(X_test): # Find a core sample closer than EPS for i, x_core in enumerate(dbscan_model.components_): if metric(x_new, x_core) < dbscan_model.eps: # Assign label of x_core to x_new y_test[j] = dbscan_model.labels_[dbscan_model.core_sample_indices_[i]] break return y_test def find_nearest_neighbor(X_train, labels, X_test): """ Define cluster belonging based on nearest neighbor :param X_train: array-like, PCs of training samples (non-America) :param labels: array-like, cluster ids of training samples :param X_test: array-like, PCs of American samples :return: array-like, cluster IDs of American samples """ y_test = np.ones(len(X_test)) for i, x in enumerate(X_test): if len(x.shape) == 1: x = x[np.newaxis, :] # find nearest neighbor dist = pairwise_distances(X_train, x) y_test[i] = labels[np.argmin(dist)] return y_test def predict_amr_haplogroups(eigenvec_source, eigenvec_amr, populations_source, best_params): """ Predicted cluster belonging of American samples :param eigenvec_source: pd.DataFrame, PCs of individuals from source populations :param eigenvec_amr: pd.DataFrame, PCs of individuals from American population :param populations_source: array-like, population of origin of non-American samples :param best_params: dict, best algorithm and parameters :return: trained clustering algorithm, predicted cluster ids, corresponding ancestries """ logging.info('Predicting American haplogroups') # retrain clustering if best_params['algo'] == 'dbscan': clustering = DBSCAN(eps=best_params['eps'], min_samples=best_params['min_samples']).fit(eigenvec_source.values) elif best_params['algo'] == 'agglomerative': clustering = AgglomerativeClustering(n_clusters=best_params['n_clusters']).fit(eigenvec_source.values) elif best_params['algo'] == 'kmeans': clustering = KMeans(n_clusters=best_params['n_clusters']).fit(eigenvec_source.values) elif best_params['algo'] == 'affinity': clustering = AffinityPropagation(damping=best_params['damping']).fit(eigenvec_source.values) elif best_params['algo'] == 'meanshift': clustering = MeanShift().fit(eigenvec_source.values) elif best_params['algo'] == 'spectral': clustering = SpectralClustering(n_clusters=best_params['n_clusters']).fit(eigenvec_source.values) # map cluster labels to ancestry based on majority cluster_ancestry_mapping = cluster_to_ancestry(clustering.labels_, populations_source) # predict cluster belonging of American individuals if best_params['algo'] == 'dbscan': amr_cluster = dbscan_predict(clustering, eigenvec_amr.values) elif best_params['algo'] == 'agglomerative' or best_params["algo"] == 'spectral': amr_cluster = find_nearest_neighbor(eigenvec_source.values, clustering.labels_, eigenvec_amr.values) else: amr_cluster = clustering.predict(eigenvec_amr.values) # convert cluster ids to ancestries amr_haplogroups = [cluster_ancestry_mapping[y] for y in amr_cluster] # summarize ancestry, counts = np.unique([amr_haplogroups], return_counts=True) logging.info([f'{counts[i]} {ancestry[i].upper()}' for i in range(len(ancestry))]) return clustering, amr_cluster, amr_haplogroups def check_clustering_and_predictions(clustering, predictions, algo): """ Check if clustering and predictions were meaningful :param clustering: trained clustering algorithm from sklearn :param predictions: np.array, predicted cluster labels for american samples :param algo: str, name of algorithm :return: boolean, whether or not to accept clustering """ logging.info('Checking clustering and predictions') # too many singletons if np.unique(clustering.labels_).shape[0] / clustering.labels_.shape[0] > 0.15: logging.info( '{} found to many clusters. Re-running find_best_clustering with excluding {}'.format(algo, algo)) return False # too many training samples unassigned elif clustering.labels_[clustering.labels_ == -1].shape[0] / clustering.labels_.shape[0] > 0.1: logging.info('{} left too many samples unassigned. ' 'Re-running find_best_clustering with excluding {}'.format(algo, algo)) return False # too many american individuals were unassigned elif predictions[predictions == -1].shape[0] / predictions.shape[0] > 0.1: logging.info('{} left too many American samples unassigned. ' 'Re-running find_best_clustering with excluding {}'.format(algo, algo)) return False else: # accept return True def main(argv): parser = argparse.ArgumentParser(description='Cluster ancestral mtDNA and Y haplogroups and assign ancestral ' 'haplogroups of admixed individuals based on proximity to ' 'ancestral clusters') parser.add_argument('--eigenvec', help='Path to .eigenvec file generated with plink. Multiple files can be ' 'specified separated by a space.', nargs='+') parser.add_argument('--log', help='Log file [log/classifying_mtDNA_Y_haplogroups.log]', default='log/classifying_mtDNA_Y_haplogroups.log') parser.add_argument('--output_file', help="File to write haplogroup assignments to. Multiple files can be " "specified separated by a space.", nargs='+') args = parser.parse_args() logging.basicConfig(filename=args.log, level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', datefmt='%H:%M:%S') logger = logging.getLogger() for eigenvec, output_file in zip(args.eigenvec, args.output_file): eigenvec = pd.read_csv(eigenvec, sep='\t') eigenvec.set_index('#IID', inplace=True) populations = np.array([re.sub('[0-9]*', '', ind) for ind in eigenvec.index.values]) eigenvec_source = eigenvec.loc[populations != 'amr',] populations_source = populations[populations != 'amr'] eigenvec_amr = eigenvec.loc[populations == 'amr',] accept_clustering = False not_accepted_clustering = [] while not accept_clustering: best_params = find_best_clustering(eigenvec_source, populations_source, not_accepted_algos=not_accepted_clustering) clustering, amr_cluster, amr_haplogroups = predict_amr_haplogroups(eigenvec_source, eigenvec_amr, populations_source, best_params) accept_clustering = check_clustering_and_predictions(clustering, amr_cluster, best_params['algo']) if not accept_clustering: logging.info('Did not accept clustering {}'.format(best_params)) not_accepted_clustering.append(best_params['algo']) if len(not_accepted_clustering) == 6: logging.info('None of the tried clustering algorithms was accepted ({}). ' 'Try something else.'.format(not_accepted_clustering)) sys.exit(1) logging.info('Accepted {}'.format(best_params)) cluster_ancestry_mapping = cluster_to_ancestry(clustering.labels_, populations_source) source_haplogroups = [cluster_ancestry_mapping[y] for y in clustering.labels_] haplogroups = pd.DataFrame(index=eigenvec.index.values) haplogroups.loc[eigenvec_source.index.values, 'haplogroup'] = source_haplogroups haplogroups.loc[eigenvec_amr.index.values, 'haplogroup'] = amr_haplogroups logging.info(f'Writing haplogroup assignments to {output_file}') haplogroups.to_csv(output_file, sep='\t', header=True, index=True) if __name__ == '__main__': main(sys.argv[1:]) |
2
of
scripts/classify_mtDNA_Y_haplogroups.py
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | import sys import argparse import subprocess def main(argv): parser = argparse.ArgumentParser(description='Convert vcf to plink file') parser.add_argument('--vcf', help='Input vcf. Multiple files can be specified separated by a space.', nargs='+', required=True) parser.add_argument('--output_base', help='Output bases. Multiple files can be specified separated by a space.', nargs='+', required=True) parser.add_argument('--sex', help='Individual sex mapping files. Multiple files can be specified separated ' 'by a space.', nargs='+', required=True) parser.add_argument('--pruned', help='Plink *prune.in files. Multiple files can be specified separated by a space.', nargs='+', default=[]) parser.add_argument('--threads', help='Number of CPUs [1]', default=1) args = parser.parse_args() if len(args.pruned) > 0: for i, o, s, p in zip(args.vcf, args.output_base, args.sex, args.pruned): subprocess.call(f"plink2 --threads {args.threads} --chr-set 40 --allow-extra-chr --vcf {i} --make-bed " f"--extract {p} --out {o} --pca 20 approx --update-sex {s} --max-alleles 2 --maf 0.01", shell=True) else: for i, o, s in zip(args.vcf, args.output_base, args.sex): subprocess.call(f"plink2 --threads {args.threads} --chr-set 40 --allow-extra-chr --vcf {i} --make-bed " f"--out {o} --pca 20 approx --update-sex {s} --max-alleles 2", shell=True) if __name__ == '__main__': main(sys.argv[1:]) |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | import sys import argparse import subprocess def main(argv): parser = argparse.ArgumentParser(description='Perform LD pruning using plink2') parser.add_argument('--vcf', help='Input vcf. Multiple files can be specified separated by a space.', nargs='+', required=True) parser.add_argument('--output_base', help='Output bases. Multiple files can be specified separated by a space.', nargs='+', required=True) parser.add_argument('--sex', help='Individual sex mapping files. Multiple files can be specified separated ' 'by a space.', nargs='+', required=True) parser.add_argument('--threads', help='Number of CPUs [1]', default=1) args = parser.parse_args() for i, o, s in zip(args.vcf, args.output_base, args.sex): subprocess.call(f"plink2 --threads {args.threads} --allow-extra-chr --vcf {i} --indep-pairwise 50 kb 1 0.1 " f"--out {o} --update-sex {s}", shell=True) if __name__ == '__main__': main(sys.argv[1:]) |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | import matplotlib.pyplot as plt from matplotlib.lines import Line2D import sys import argparse import pandas as pd import numpy as np import os import re def visualize_pcs(eigenvecs, outfile, pc_a=1, pc_b=2): """ Visualize PCA done with plink :param eigenvecs: str, path to .eigenvecs file :param outfile: str, where to save Figure to :param pc_a: int, first PC to plot :param pc_b: int, second PC to plot """ num_pop = eigenvecs.population.unique().shape[0] populations = eigenvecs.population.unique() np.random.seed(1) colors = [(np.random.random(), np.random.random(), np.random.random()) for i in range(num_pop)] color_dict = {populations[i]: colors[i] for i in range(num_pop)} colors = np.array([color_dict[pop] for pop in eigenvecs.population]) fig, ax = plt.subplots() ax.scatter(eigenvecs.loc[:, f'PC{pc_a}'], eigenvecs.loc[:, f'PC{pc_b}'], color=colors) ax.set_xlabel(f'PC{pc_a}') ax.set_ylabel(f'PC{pc_b}') legend_elements = [Line2D([0], [0], color=color_dict[pop], marker='o', label=pop.upper()) for pop in populations] ax.legend(handles=legend_elements, loc='upper center', ncol=4, bbox_to_anchor=(0.5, -.14)) fig.savefig(outfile, bbox_inches='tight') plt.close() def main(argv): parser = argparse.ArgumentParser(description='Visualize PCA performed with plink2') parser.add_argument('--eigenvec', nargs='+', help='.eigenvec file that were generated by plink. ' 'You can specify multiple separated by a space') parser.add_argument('--plot_dir', help='Directory where to save plots') args = parser.parse_args() if args.plot_dir.endswith('/'): plot_dir = args.plot_dir else: plot_dir = args.plot_dir + '/' if not os.path.isdir(plot_dir): os.makedirs(plot_dir) eigenvec_files=sorted(args.eigenvec) for eigenvec in eigenvec_files: eigenvectors = pd.read_csv(eigenvec, sep='\t') eigenvectors.set_index('#IID', inplace=True) populations = np.array([re.sub('[0-9]*', '', ind) for ind in eigenvectors.index.values]) eigenvectors.loc[:, 'population'] = populations outfile = plot_dir + eigenvec.split('/')[-1].split('.eigenvec')[0] + "_pca.png" visualize_pcs(eigenvectors, outfile) if __name__ == '__main__': main(sys.argv[1:]) |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 | import sys import argparse import pandas as pd import numpy as np import re import matplotlib.pyplot as plt import os from matplotlib.lines import Line2D import logging from sklearn.utils import resample def read_results(admixture_results_autosomes, admixture_results_chrx, fam_file): """ Read Admixture results for autosomes and X chromosomes as well fam file to obtain individual IDs and sexes :param admixture_results_autosomes: str, Path to *.Q file for autosomes :param admixture_results_chrx: str, Path to *.Q file for X chromosomes :param fam_file: str, Path to *.fam file :return: pd.DataFrame, pd.DataFrame, admixture results for autosomes and X chromosomes """ # read fam file indv_names_and_sex = pd.read_csv(fam_file, sep='\t', header=None, usecols=[1, 4]) # read Q file autosomes K = int(admixture_results_autosomes.split('.')[-2]) admixture_autosomes = pd.read_csv(admixture_results_autosomes, sep=' ', header=None, names=[f'K{i}' for i in range(K)]) # send IDs as index admixture_autosomes.index = indv_names_and_sex.iloc[:, 0].values # add sexes admixture_autosomes.loc[:, 'sex'] = indv_names_and_sex.iloc[:, 1].values - 1 # read Q file X chromosomes admixture_x = pd.read_csv(admixture_results_chrx, sep=' ', header=None, names=[f'K{i}' for i in range(K)]) # set IDs as index admixture_x.index = indv_names_and_sex.iloc[:, 0].values # add sexes admixture_x.loc[:, 'sex'] = indv_names_and_sex.iloc[:, 1].values - 1 # add sample populations populations = np.array([re.sub('[0-9]*', '', ind) for ind in indv_names_and_sex.iloc[:, 0]]) admixture_autosomes.loc[:, 'pop'] = populations admixture_x.loc[:, 'pop'] = populations return admixture_autosomes, admixture_x def assign_k_to_ancestry(df, ancestries=['afr', 'eur', 'ea']): """ Identify K corresponding to source ancestries :param df: pd.DataFrame, admixture results :param ancestries: list, source populations :return: pd.DataFrame, with renamed columns according to main ancestries """ mean_per_k = df.groupby("pop").mean().iloc[:, :-1] mean_per_k = mean_per_k.loc[ancestries] max_ind_k = np.argmax(mean_per_k.values, axis=0) if np.unique(max_ind_k).shape[0] < len(ancestries): logging.debug(f"Inadequate number of source populations inferred." f"The mean contributions of source populations to different cluster were\n{mean_per_k}") raise ValueError('Not all source ancestries are represented. You likely used an inadequate number for K. ' 'Check log file for more details') new_col_names = {col: ancestries[max_ind_k[i]] for i, col in enumerate(mean_per_k.columns)} df.rename(columns=new_col_names, inplace=True) new_df = pd.DataFrame(index=df.index) for anc in ancestries: try: new_df.loc[:, anc] = df.loc[:, anc].sum(axis=1) except ValueError: new_df.loc[:, anc] = df.loc[:, anc] new_df.loc[:, 'pop'] = df.loc[:, 'pop'].copy() new_df.loc[:, 'sex'] = df.loc[:, 'sex'].copy() return new_df def plot_autosome_proportions_vs_x_proportions(proportions_autosomes, proportions_x, output_base): """ Plot autosomal vs. X chromosomal ancestry proportions :param proportions_autosomes: pd.DataFrame, autosomal admixture results :param proportions_x: pd.DataFrame, X chromosomal admixture results :param output_base: str, output base to save plot per population to """ np.random.seed(13) populations = proportions_autosomes.loc[:, 'pop'].unique() source_pop = np.sort(proportions_autosomes.columns[:-2]) colors = {col: (np.random.random(), np.random.random(), np.random.random()) for col in source_pop} for pop in populations: fig, ax = plt.subplots() c_pop_autosomes = proportions_autosomes[proportions_autosomes['pop'] == pop] c_pop_x = proportions_x[proportions_x['pop'] == pop] for s_pop in source_pop: ax.scatter(c_pop_x[s_pop].values, c_pop_autosomes[s_pop].values, label=s_pop, color=colors[s_pop], alpha=0.5) ax.set_xlabel('X chromosome proportion') ax.legend(bbox_to_anchor=(0.5, -.14), ncol=3, loc='upper center') ax.set_ylabel('Autosome proportion') ax.plot([0, 1], [0, 1], ls='--', c='black') ax.set_xlim([0, 1]) ax.set_ylim([0, 1]) fig.savefig(output_base + f'_A_vs_X_{pop}.png', bbox_inches='tight') plt.close() def plot_histogram_of_ancestry_proportions(proportions_autosomes, proportions_x, output_base): """ Plot cumulative histograms of autosomal and X chromosomal ancestry proportions :param proportions_autosomes: pd.DataFrame, autosomal admixture results :param proportions_x: pd.DataFrame, X chromosomal admixture results :param output_base: str, output base to save plot per population to """ np.random.seed(13) populations = proportions_autosomes.loc[:, 'pop'].unique() source_pop = np.sort(proportions_autosomes.columns[:-2]) colors = {col: (np.random.random(), np.random.random(), np.random.random()) for col in source_pop} bins = np.arange(0, 1.01, 0.01) for pop in populations: fig, ax = plt.subplots() c_pop_autosomes = proportions_autosomes[proportions_autosomes['pop'] == pop] c_pop_x = proportions_x[proportions_x['pop'] == pop] for s_pop in source_pop: ax.hist(c_pop_x[s_pop].values, color=colors[s_pop], cumulative=True, ls='--', bins=bins, histtype='step', weights=np.repeat(1 / c_pop_x[s_pop].values.shape[0], c_pop_x[s_pop].values.shape[0])) ax.hist(c_pop_autosomes[s_pop].values, color=colors[s_pop], cumulative=True, ls='-', bins=bins, histtype='step', weights=np.repeat(1 / c_pop_autosomes[s_pop].values.shape[0], c_pop_autosomes[s_pop].values.shape[0])) ax.set_xlabel('Ancestry proportion') ax.set_ylabel('Cumulative density') handles = [Line2D([0], [0], color=colors[source_pop[n]], ls='--') for n in range(len(source_pop))] handles.extend([Line2D([0], [0], color='black', ls='-'), Line2D([0], [0], color='black', ls='--')]) labels = source_pop.copy().tolist() labels.extend(['Autosomes', 'X chromosomes']) ax.legend(handles, labels, bbox_to_anchor=(0.5, -.14), loc='upper center', ncol=3) fig.savefig(output_base + f"_hist_a_x_proportions_{pop}.png", bbox_inches='tight') def plot_admixture_results(df, outfile): """ Plot admixture results :param df: pd.DataFrame, admixture results :param output_base: str, output base to save plot per population to """ np.random.seed(13) populations = df['pop'].unique() source_pop = np.sort(df.columns[:-2]) colors = {col: (np.random.random(), np.random.random(), np.random.random()) for col in source_pop} fig, ax = plt.subplots() proportions_dict = {} colors_dict = {} for i in range(len(source_pop)): proportions_dict[i] = np.zeros(df.shape[0]) colors_dict[i] = np.zeros((df.shape[0], 3)) i = 0 n_samples = [] for pop in populations: c_df = df[df['pop'] == pop] proportions = c_df.loc[:, source_pop] median_prop = np.median(proportions, axis=0) priorities = source_pop[np.argsort(median_prop)][::-1].tolist() for n in range(len(source_pop)): proportions_dict[n][i: i + c_df.shape[0]] = c_df.loc[:, priorities[n]] colors_dict[n][i: i + c_df.shape[0]] = colors[priorities[n]] i += c_df.shape[0] n_samples.append(c_df.shape[0]) bottom = np.zeros(df.shape[0]) for n in range(len(source_pop)): ax.bar(np.arange(0, df.shape[0]), proportions_dict[n], color=colors_dict[n], bottom=bottom, width=1) bottom += proportions_dict[n] n_samples = np.array(n_samples) cum_samples = np.cumsum(n_samples) n_samples = n_samples / 2 xticks = cum_samples - n_samples ax.set_xticks(xticks) ax.set_ylim([0, 1.0]) ax.set_xlim([0, df.shape[0]]) ax.set_xticklabels([pop.upper() for pop in populations]) ax.set_ylabel('Admixture proportion') ax.spines['top'].set_visible(False) ax.spines['bottom'].set_visible(False) ax.spines['right'].set_visible(False) ax.spines['left'].set_visible(False) fig.savefig(outfile, bbox_inches='tight') plt.close() def write_mean_ancestry_proportions(df, output_file): """ Write mean ancestry proportions to file :param df: pd.DataFrame, mean ancestry proportions of source populations for each sample populations :param output_file: str, Path to file where to write proportions to """ # bringing columns into the right order columns = {col: col.replace('ea', 'na').replace('_A', '_ZA').upper() for col in df.columns} df.rename(columns=columns, inplace=True) columns = sorted([col for col in df.columns]) df = df.loc[:, columns] columns = {col: col.replace('_ZA', '_A').replace('SEX', 'pf') for col in df.columns} df.rename(columns=columns, inplace=True) df.index.name = '' df.to_csv(output_file, header=True, index=True) def filter_individuals(admixture_autosomes): """ Filtered out African/European/East Asian individuals with less than 95% African/Eurpean/East Asian ancestry as well as admixed individuals with more than 95% of one ancestry :param admixture_autosomes: pd.DataFrame, Admixture results :return: np.array, IIDs of individuals that not passed filtering """ # filter Africans with less 95% African ancestry afr_to_remove = admixture_autosomes[(admixture_autosomes['pop'] == 'afr') & (admixture_autosomes.afr < 0.95)].index.values logging.info('Removing {} African individuals that have <95% African ancestry'.format(afr_to_remove.shape[0])) # filter Europeans with less 95% European ancestry eur_to_remove = admixture_autosomes[(admixture_autosomes['pop'] == 'eur') & (admixture_autosomes.eur < 0.95)].index.values logging.info('Removing {} European individuals that have <95% European ancestry'.format(eur_to_remove.shape[0])) # filter East Asians with less 95% East Asian ancestry ea_to_remove = admixture_autosomes[(admixture_autosomes['pop'] == 'ea') & (admixture_autosomes.ea < 0.95)].index.values logging.info('Removing {} East Asian individuals that have <95% East Asian ancestry'.format(ea_to_remove.shape[0])) # filter Americans with less 5% African ancestry amr_to_remove = admixture_autosomes[(admixture_autosomes['pop'] == 'amr') & (admixture_autosomes.loc[:, ['afr', 'eur', 'ea']].max(axis=1) > 0.95)].index.values logging.info('Removing {} American individuals that have >95% of one ancestry'.format(amr_to_remove.shape[0])) individuals_to_remove = np.concatenate([amr_to_remove, eur_to_remove, ea_to_remove, afr_to_remove]) return individuals_to_remove def compute_haplogroup_frequencies(df): """ Compute haplogroup frequencies :param df: pd.DataFrame, haplogroup assignments :return: pd.DataFrame, haplogroup frequencies of sampled populations """ haplogroups = np.sort(df.haplogroup.unique()) populations = df.population.unique() frequencies = pd.DataFrame(index=populations, columns=haplogroups) for pop in populations: freqs = np.zeros_like(haplogroups) c_df = df[df.population == pop] counts = c_df.haplogroup.value_counts() for hg, c in zip(counts.index, counts): freqs[haplogroups == hg] = c freqs = freqs / freqs.sum() frequencies.loc[pop, :] = freqs return frequencies def bootstrap_results(ancestry_proportions_haplogroups, n_resamples=10000): """ Bootstrap individuals to compute confidence intervals for summary statistics (mean ancestry proportions, and sex ratios) :param ancestry_proportions_haplogroups: pd.DataFrame, individual level autosomal and X chromosomal ancestry proportions as well as mtDNA and Y chromosome haplogroups :param n_resamples: int, number of resampling steps to perform :return: pd.DataFrame, Mean summary statistics with confidence intervals """ ancestries = [col.split('_')[0] for col in ancestry_proportions_haplogroups.columns if col.endswith('A')] populations = ancestry_proportions_haplogroups.loc[:, "pop"].unique() sf_sm = lambda x, a: (3 * x - 2 * a) / (4 * a - 3 * x) bootstrapped_results = [] for pop in populations: c_results = ancestry_proportions_haplogroups.loc[ancestry_proportions_haplogroups.loc[:, "pop"] == pop] sample_names = c_results.index.values if n_resamples is None: c_resamples = min([c_results.shape[0], 10000]) iterations_df = pd.DataFrame(index=np.arange(0, c_resamples)) for i in range(c_resamples): resampled = resample(sample_names) y_haplogroups = c_results.loc[resampled, "haplogroup_Y"].copy().dropna() mt_haplogroups = c_results.loc[resampled, 'haplogroup_mt'] resampled_autosomes = c_results.loc[resampled, [f"{anc}_A" for anc in ancestries]] resampled_x = c_results.loc[resampled, [f"{anc}_X" for anc in ancestries]] for anc in ancestries: mt_haplo_freq = mt_haplogroups[mt_haplogroups == anc].shape[0] / mt_haplogroups.shape[0] y_haplo_freq = y_haplogroups[y_haplogroups == anc].shape[0] / y_haplogroups.shape[0] try: haplogroup_imbalance = mt_haplo_freq / y_haplo_freq except ZeroDivisionError: haplogroup_imbalance = np.inf mean_autosomal_prop = resampled_autosomes.loc[:, f"{anc}_A"].mean() mean_x_prop = resampled_x.loc[:, f"{anc}_X"].mean() sex_ratio = sf_sm(mean_x_prop, mean_autosomal_prop) iterations_df.loc[i, f'{anc}_A'] = mean_autosomal_prop iterations_df.loc[i, f'{anc}_X'] = mean_x_prop iterations_df.loc[i, f'{anc}_sf_sm'] = sex_ratio iterations_df.loc[i, f'{anc}_mt'] = mt_haplo_freq iterations_df.loc[i, f'{anc}_Y'] = y_haplo_freq iterations_df.loc[i, f'{anc}_mt_Y'] = haplogroup_imbalance mean_vals = iterations_df.mean(axis=0) mean_vals.index = [f"mean_{ind}" for ind in mean_vals.index.values] ci_low = iterations_df.quantile(0.025, axis=0) ci_low.index = [f"ci_low_{ind}" for ind in ci_low.index.values] ci_up = iterations_df.quantile(0.975, axis=0) ci_up.index = [f"ci_up_{ind}" for ind in ci_up.index.values] summary_df = pd.concat([ci_low, mean_vals, ci_up]).to_frame(pop) bootstrapped_results.append(summary_df.T) bootstrapped_results = pd.concat(bootstrapped_results) return bootstrapped_results def main(argv): parser = argparse.ArgumentParser(description="Summarize simulation results, i.e., sex ratios, " "admixture proportions, haplogroup frequencies, " "and haplogroup imbalances including confidence intervals.") parser.add_argument('-qa', '--admixture_results_autosomes', nargs='+', help='.Q file from Admixture for autosomes') parser.add_argument('-qx', '--admixture_results_chrx', nargs='+', help='.Q file from Admixture for X chromosome') parser.add_argument('-hm', '--haplogroups_mtDNA', nargs='+', help='Haplogroup assignments mtDNA') parser.add_argument('-hy', '--haplogroups_Y', nargs='+', help='Haplogroup assignments Y chromosomes') parser.add_argument('-f', '--fam_file', nargs='+', help='*.fam file generated by plink to get individual IDs') parser.add_argument('-p', '--plot_dir', help='Directory where to save plots [./plots/]', default='./plots/') parser.add_argument('-rax', '--results_a_x', nargs='+', help='File to write autosomal and X chromosomal ' 'ancestry proportions to') parser.add_argument('-rmy', '--results_mt_y', nargs='+', help='File to write mtDNA and Y chromosomal haplogroup ' 'frequencies to') parser.add_argument('-b', '--bootstraps', default=None, help='Number of bootstrap resamples. Default is the ' 'number of samples and max. 10000', type=int) parser.add_argument('-rb', '--results_bootstrapped', nargs="+", help='File to write bootstrapped summary ' 'statistics to') parser.add_argument('--log', help='Log file [results/summarizing_results.log]', default='results/summarizing_results.log') args = parser.parse_args() logging.basicConfig(filename=args.log, level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', datefmt='%H:%M:%S') logger = logging.getLogger() if args.plot_dir.endswith('/'): plot_dir = args.plot_dir else: plot_dir = args.plot_dir + '/' if not os.path.isdir(plot_dir): os.makedirs(plot_dir) results_dir_hg_freq = '/'.join(args.results_mt_y[0].split('/')[:-1]) if not os.path.isdir(results_dir_hg_freq): os.makedirs(results_dir_hg_freq) results_dir_ap_freq = '/'.join(args.results_a_x[0].split('/')[:-1]) if not os.path.isdir(results_dir_ap_freq): os.makedirs(results_dir_ap_freq) for admixture_results_autosomes, admixture_results_chrx, fam_file, \ haplogroups_mtDNA, haplogroups_Y, results_a_x,\ results_mt_y, results_bootstrapped in zip(args.admixture_results_autosomes, args.admixture_results_chrx, args.fam_file, args.haplogroups_mtDNA, args.haplogroups_Y, args.results_a_x, args.results_mt_y, args.results_bootstrapped): logging.info('Parsing results {}'.format(admixture_results_autosomes.split('_chr1.3.Q')[0])) admixture_autosomes, admixture_x = read_results(admixture_results_autosomes, admixture_results_chrx, fam_file) haplogroups_mtDNA = pd.read_csv(haplogroups_mtDNA, sep='\t', index_col=0) haplogroups_Y = pd.read_csv(haplogroups_Y, sep='\t', index_col=0) populations_mtDNA = np.array([re.sub('[0-9]*', '', ind) for ind in haplogroups_mtDNA.index.values]) populations_Y = np.array([re.sub('[0-9]*', '', ind) for ind in haplogroups_Y.index.values]) haplogroups_mtDNA.loc[:, 'population'] = populations_mtDNA haplogroups_Y.loc[:, 'population'] = populations_Y admixture_autosomes = assign_k_to_ancestry(admixture_autosomes) admixture_x = assign_k_to_ancestry(admixture_x) # remove individuals that did not pass filtering logging.info('Filtering individuals') # filter individuals based on average ancestry_proportion # assumes equal length of autosomes and X chromosome #average_ancestry_proportions = admixture_autosomes.copy() #average_ancestry_proportions.loc[:, ["afr", "eur", "ea"]] += admixture_x.loc[:, ["afr", "eur", "ea"]] #average_ancestry_proportions.loc[:, ["afr", "eur", "ea"]] /= 2 individuals_to_remove = filter_individuals(admixture_autosomes) admixture_autosomes.drop(individuals_to_remove, inplace=True) admixture_x.drop(individuals_to_remove, inplace=True) haplogroups_mtDNA.drop(individuals_to_remove, inplace=True) haplogroups_Y.drop(individuals_to_remove[np.isin(individuals_to_remove, haplogroups_Y.index)], inplace=True) # visualize ADMIXTURE results logging.info('Visualizing ADMIXTURE results') plot_admixture_results(admixture_autosomes, f'{plot_dir}{admixture_results_autosomes.split("/")[-1].replace(".3.Q", "_admixture.png")}') plot_admixture_results(admixture_x, f'{plot_dir}{admixture_results_chrx.split("/")[-1].replace(".3.Q", "_admixture.png")}') plot_autosome_proportions_vs_x_proportions(admixture_autosomes, admixture_x, f'{plot_dir}{admixture_results_chrx.split("/")[-1].replace(".3.Q", "")}') plot_histogram_of_ancestry_proportions(admixture_autosomes, admixture_x, f'{plot_dir}{admixture_results_chrx.split("/")[-1].replace(".3.Q", "")}') # join results ancestry_proportions = admixture_autosomes.join(admixture_x.drop(["pop", "sex"], axis=1), lsuffix='_A', rsuffix='_X') logging.info('Computing mean ancestry proportions') # take average per populations mean_ancestry_proportions = ancestry_proportions.groupby('pop').mean() logging.info(f'Writing mean ancestry proportions to {results_a_x}') # write results write_mean_ancestry_proportions(mean_ancestry_proportions, results_a_x) logging.info("Computing mean haplogroup frequencies") # compute haplogroup frequencies freq_mtDNA = compute_haplogroup_frequencies(haplogroups_mtDNA) freq_Y = compute_haplogroup_frequencies(haplogroups_Y) # merge haplogroup_frequencies = freq_Y.join(freq_mtDNA, lsuffix='_Y', rsuffix='_MT') columns = {col: col.replace("ea", 'na').upper() for col in haplogroup_frequencies.columns} haplogroup_frequencies.rename(columns=columns, inplace=True) logging.info(f"Writing mean haplogroup frequencies to {results_mt_y}") # save haplogroup_frequencies.to_csv(results_mt_y, header=True, index=True) # bootstrap results logging.info("Bootstrapping results") haplogroups = haplogroups_mtDNA.drop("population", axis=1).join(haplogroups_Y.drop("population", axis=1), lsuffix='_mt', rsuffix='_Y') ancestry_prop_and_haplogroups = ancestry_proportions.join(haplogroups) bootstrapped_results = bootstrap_results(ancestry_prop_and_haplogroups, args.bootstraps) logging.info(f"Writing bootstrap results to {results_bootstrapped}") bootstrapped_results.to_csv(results_bootstrapped, sep='\t', header=True, index=True) if __name__ == '__main__': main(sys.argv[1:]) |
48 49 50 | shell: "scripts/afr_eur_ea_populations.py -r {recombination_rate} --recipe {params.recipe} " \ " -l {chromosome_length} --mtDNA_length {mtDNA_length} -o {output}" |
65 66 67 68 69 | shell: "scripts/american_admixture.py -r {recombination_rate} --recipe {config[single_pulse_admixture_recipe]} " \ " -l {chromosome_length} --mtDNA_length {mtDNA_length} --afr {afr} --eur {eur} --ea {ea} --sm_afr {sm_afr} "\ "--sm_eur {sm_eur} --sm_ea {sm_ea} --N_amr {N_amr} --tree_sequences {input} " "--amr_growth_rate {config[amr_growth_rate]}" |
78 79 80 81 82 83 | shell: "scripts/american_admixture.py -r {recombination_rate} --recipe {config[single_pulse_admixture_non_random_mating_recipe]} " \ " -l {chromosome_length} --mtDNA_length {mtDNA_length} --afr {afr} --eur {eur} --ea {ea} --sm_afr {sm_afr} "\ "--sm_eur {sm_eur} --sm_ea {sm_ea} --N_amr {N_amr} --tree_sequences {input} " "--amr_growth_rate {config[amr_growth_rate]} " "--rejection_rate_female_eur_male_non_eur {config[rejection_rate_female_eur_male_non_eur]}" |
92 93 94 95 96 97 98 | shell: "scripts/american_admixture.py -r {recombination_rate} --recipe {config[constant_admixture_recipe]} " \ " -l {chromosome_length} --mtDNA_length {mtDNA_length} --afr {afr} --eur {eur} --ea {ea} --sm_afr {sm_afr} "\ "--sm_eur {sm_eur} --sm_ea {sm_ea} --N_amr {N_amr} --tree_sequences {input} " "--afr_const {config[afr_const]} --eur_const {config[eur_const]} "\ "--ea_const {config[ea_const]} --sm_afr_const {config[sm_afr_const]} --sm_eur_const {config[sm_eur_const]} "\ "--sm_ea_const {config[sm_ea_const]} --amr_growth_rate {config[amr_growth_rate]}" |
107 108 109 110 111 112 113 114 | shell: "scripts/american_admixture.py -r {recombination_rate} --recipe {config[constant_admixture_non_random_mating_recipe]} " \ " -l {chromosome_length} --mtDNA_length {mtDNA_length} --afr {afr} --eur {eur} --ea {ea} --sm_afr {sm_afr} "\ "--sm_eur {sm_eur} --sm_ea {sm_ea} --N_amr {N_amr} --tree_sequences {input} "\ "--amr_growth_rate {config[amr_growth_rate]} --afr_const {config[afr_const]} --eur_const {config[eur_const]} "\ "--ea_const {config[ea_const]} --sm_afr_const {config[sm_afr_const]} --sm_eur_const {config[sm_eur_const]} "\ "--sm_ea_const {config[sm_ea_const]} " "--rejection_rate_female_eur_male_non_eur {config[rejection_rate_female_eur_male_non_eur]}" |
146 147 148 149 150 151 | shell: "scripts/add_mutations_to_tree_sequence.py -N {config[ancestral_Ne]} -m {mutation_rate} "\ "--mtDNA_mutation_rate {mtDNA_mutation_rate} -o {output.ts} --output_vcf_dir {data_dir} " "-l {chromosome_length} --mtDNA_length {mtDNA_length} --populations {populations} --log {log} "\ "--tree_sequences {input} --threads {threads} --sample_size {config[sample_size]} --plot_sfs " "--recapitate" |
258 259 260 261 262 263 264 | shell: "for vcf_file in {input}; " "do " "bcftools norm -m - --threads {threads} ${{vcf_file}} | bcftools annotate --threads {threads} " "--set-id +'%CHROM\_%POS\_%REF\_%FIRST_ALT' -Oz -o $( echo ${{vcf_file}} | sed 's/_filtered//' ).gz; " "tabix ${{vcf_file}}.gz; " "done" |
333 334 | shell: "scripts/ld_pruning_plink2.py --threads {threads} --vcf {input.vcf} --output_base {params.out} --sex {input.sex}" |
438 439 440 | shell: "scripts/convert_vcf_to_plink.py --threads {threads} --vcf {input.vcf} --sex {input.sex} " "--output_base {params.out} --pruned {input.pruned}" |
547 548 549 | shell: "scripts/convert_vcf_to_plink.py --threads {threads} --vcf {input.vcf} --sex {input.sex} " "--output_base {params.out}" |
647 648 | shell: "scripts/classify_mtDNA_Y_haplogroups.py --eigenvec {input} --log {log} --output_file {output}" |
714 715 | shell: "scripts/plot_pca.py --eigenvec {input.eigenvec_1} {input.eigenvec_x} --plot_dir {plot_dir}" |
757 758 | shell: "admixture --supervised -j{threads} {input.bed} 3; mv *3.Q {data_dir}; mv *3.P {data_dir}" |
801 802 803 | shell: "admixture --supervised --haploid=\"male:23\" {input.bed} 3; " "mv *3.Q {data_dir}; mv *3.P {data_dir}" |
852 853 854 855 | shell: "scripts/summarize_results.py -hy {input.chrY} -hm {input.mtDNA} " "-qa {input.autosome_q} -qx {input.x_q} -f {input.fam} -rax {output.autosomes_x} " "-rmy {output.haplogroups} -rb {output.bootstrapped} -p {plot_dir} --log {log}" |
Support
- Future updates
Related Workflows





