Challenges of accurately estimating sex-biased admixture from X chromosomal and autosomal ancestry proportions

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Models for estimating sex ratios assuming single pulse or constant admixture
Recalculation of sex ratios based on summary statistics from Micheletti et al., calculation of possible sex ratios given the unassigend ancestry in their study, and sensitivity analysis of the applied models
Assessing the robustness of single admixture models to violations of demographic assumptions using simulations
References

Models for estimating sex ratios assuming single pulse or constant admixture

Software requirements

The code was tested with the following python package version:

matplotlib v3.34
numpy v1.20.2
pandas v1.2.4
scipy v1.6.2

To set up a conda environment with the following versions and activate the environment, run the following commands:

conda env create -f sex_ratios.yml
conda activate sex_ratios

Input files

Either:

A comma-separated file, containing the ancestry proportions of various source populations and fraction of sampled females in different regions. The first row of the file is the header, and the file has the following structure:

,H1X,H1A,H2X,H2A,...,pf
region1,x,a,xx,aa,...,f
region2,x,a,xx,aa,...,f

where HiX and HiA are the X chromosomal and autosomal ancestry proportions that trace to source population 1, and pf are the fraction of females in the sample in a given region. Comments are indicated by a "#".

Or (only accepted by scripts/estimate_sex_ratios_single_pulse.py ):

.Q files for autosomes and X chromosomes, which were generated with ADMIXTURE.1 Additionally, .fam file generated with plink2 needs to be provided to get individual IDs.2

If a comma-separated file with mean ancestry proprotions is provided as input, sex ratios are calculated based on the provided proportions. If .Q and .fam files are provided, individual ancestry proportions are bootstrapped and for each bootstrap resample mean ancestry proportions are computed and sex ratios are estimated, yielding confidence intervals.

Analysis

If scripts/estimate_sex_ratios_single_pulse.py is used, sex ratios are calculated using a dynamic model proposed by Golderberg and Rosenberg (2015) and an equilibirium model.3 Both models assume a single admixture event, a constant population size, no subsequent gene flow, random mating, no genetic drift, and no selection. The scripts/estimate_sex_ratios_constant_admixture.py implements another model proposed by Goldberg and Rosenberg that assumes constant, nonzero admixture.3 This model attempts to find parameter combinations of constant migration rates that fit the data. For details on any of these models see either Goldberg and Rosenberg or our publication.3

Output files

Single pulse: Generates an excel file with estimated sex ratios for 2-15 generations of random within the admixed population and the equilibrium sex ratio. If a comma-separated file is provided this is done for each ancestry and each region. The output file name can be specified with the -o flag.
Constant admixture: Generates a box plot of possible sex ratios for each ancestry and region. The name of the output file can be specified with the -o flag.

Example usage

To infer sex ratios assuming a single admixture event and provided mean ancestry proportions in input.txt , do:

scripts/estimate_sex_ratios_single_pulse.py -i input.txt -o output.xlsx

To infer sex ratios with confidence intervals assuming a single admixture event and provided individual level ADMIXTURE results for K clusters, do:

scripts/estimate_sex_ratios_single_pulse.py -qa admixture_results_autosomes.K.Q -qx admixture_results_x_chromosome.K.Q -f plink.fam -b 10000 -o output.xlsx

where confidence intervals are obtained from 10,000 bootstrap resamples( -b ).

To infer sex ratios under a demographic scenarios of constant, nonzero admixture and given mean ancestry proportions provided in input.txt , do:

scripts/estimate_sex_ratios_constant_admixture.py -i input.txt -n 0.02 -d 0.01 -g 15 -o output_figure.pdf --sf0 0.5 --sm0 0.5

where 0.02 increments are used for the gridsearch ( -n ), accept only constant admixture parameter that have an Eucleadean distance less than 0.01 to observed mean ancestry proportions ( -d ), admixture is assumed to have occured 15 generations ago ( -g ), the output figure is saved to output_figure.pdf ( -o ), and initial female and male contributions are assumed to have been 0.5 ( --sf0 and --sm0 ; the effect of initial contributions is erased if g > 10).

Figure S1 in our manuscript can be reproduced by replacing input.txt with data/ancestry_proportions_micheletti.txt in the above command.

Recalculation of sex ratios based on summary statistics from Micheletti et al., calculation of possible sex ratios given the unassigend ancestry in their study, and sensitivity analysis of the applied models

Software requirements

See above.

Input files

African, European, and Native American mean ancestry proportions for different geographic regions were inferred from summary statistics of Micheletti et al. We weighted mean ancestry estimates for each granular region reported in Table S10 in Micheletti et al. by the corresponding sample size reported in Table S2 in Micheletti et al.4 The ancestry proportions were stored in data/ancestry_proportions_micheletti.txt .

Analysis

First, the script sensitivity_analysis_and_distribute_unassigned_ancestry.py applies the single pulse admixture model from Goldberg and Rosenberg (2015) to the provided ancestry proportions.3 If unassigned ancestry is present, such as in the study by Micheletti et al.,4 the unassigned ancestry is distributed in all permissible ways such that ancestry proportions sum to 100%. For each possible distribution, sex ratios are computed, yielding a distribution of sex ratios that is plotted as a box plot for each ancestry and region (this is only done for regions specified with the -p flag). In addition, it conducts a sensitivity analysis of the equilibirium model, which also holds for Goldberg's and Rosenberg's model. For more details, see our publication.

Output files

An excel file with estimated sex ratios for 2-15 generations of random within the admixed population and the equilibrium sex ratio for each ancestry and region based on provided ancestry proportions. The name of the output file can be specified with the -o flag (Table 1 in our manuscript).
A figure with possible sex ratios under consideration of the unassigned ancestry. The name of the figure can be specified with --figure flag (Figure 1 in our manuscript).
A figure illustrating the sensitivity of the applied models. The name of the figure can be specified with the --figure_sensitivity flag (Figure 2 in our manuscript).

Example usage

The following command takes mean ancestry proportions from Micheletti et al., distributes all unassigned ancestry for the geographic regions of the Latin Caribbean, central South America, northern South America, and Central America, and saves box plots of corresponding distributions of possible sf/sm values to figure1.pdf. Additionally, it conducts a sensitivity analysis, which will be plotted to figure2.pdf. In this case, it uses the African ancestry proportion that Micheletti et al. reported for central South America for the sensitivity analysis. Point estimates of sf/sm for each ancestry and georgraphic regions based on the reported mean ancestry proportions will be written to sex_ratios_micheletti.xlsx.

scripts/sensitivity_analysis_and_distribute_unassigned_ancestry.py -i data/ancestry_proportions_micheletti.txt --figure figure1.pdf -p1 "Latin Caribbean" "C South America" "N South America" "Central America" --figure_sensitivity figure2.pdf -p2 "C South America" -a Afr -o sex_ratios_micheletti.xlsx

Assessing the robustness of single admixture models to violations of demographic assumptions using simulations

Software requirements

The following software is required:

ADMIXTURE v1.3.0
plink2 v2.00a2.3
bcftools v
tabix v1.11
SliM3 v3.7.1

Additionally, the following Python packages are required:

numpy v1.22.2
pandas v1.4.1
multiprocessing-loggin v0.3.1
pyslim v0.700
msprime v1.1.1
matplotlib v3.5.1
scipy v1.8.0
scikit-learn v1.0.2

The simulations and subsequent analysis are implemented in a Snakemake workflow. Therefore, snakemake needs to be installed. Snakemake will automatically install all required software from the .yml files provided in the envs directory. Snakemake can be installed from the provided .yml file using the following commands:

conda env create -f snakemake.yml
conda activate snakemake

Analysis

To assess the effects of violations of demographic assumptions made by the single pulse admixture model, such as population growth, subsequent gene flow, and sex-specific assortative mating, we simulated these scenarios and evaluate the effects on inferred sex ratios. Three continental ancestral population are simulated according to Gravel et al. (2011),5 and admixture is simulated 15 generations ago. Then, random individuals are sampled from the ancestral and admixed population (we recommend 10,000), and LD pruning is performed using plink2.2 On the pruned data, ADMIXTURE is run to separately infer X chromosomal and autosomal ancestry proportions using the individuals from the ancestral populations as training data.6 Mitochondrial and Y chromosome haplogroup frequencies are inferred by clustering the ancestral haplogroups and assigning haplogroups of admixed individuals based proximity to ancestral haplogroup clusters. Finally, sex ratios are estimated using the single admixture models and bootstrapping individual ancestry proportions. Additionally, PCA is performed on the simulated populations to determine that the correct population structure was achievied, and site-frequency spectra (SFS) of the admixed population is plotted. Simulations are done with SLiM3 using tree sequence recording.7,8 Pyslim and msprime are then used for recapitation and superimposing mutations in Python.9,10

Output files

Tree sequences will be stored in the directory specified with tree_dir in config/config.yaml
VCF (unpruned and pruned), plink files, haplogroup assignments, ADMIXTURE results will be stored in the directory specified with data_dir in config/config.yaml
Figures of the PCA and SFS will stored in the directory specified with plot_dir in config/config.yaml
Summary files of bootstrap results will stored in directory specified with results_dir in config/config.yaml

Example usage

The entire process is implemented in a snakemake workflow, which can be executed using the command below.11

snakemake --use-conda --conda-frontend --cores 16

All simulations parameters, such as admixture proportions, sex biases, migration rates, sample sizes etc., can be set via the config/config.yaml file. To re-run all simulations that were presented in our paper (Table 2), run ./run_all_simulations.sh . This scripts also generates Figure 3 in our manuscript, which assess the effect of sample size on the inferred sex ratios. Inferred sex ratios are compared for different sample sizes using a custom script:

scripts/assess_effect_of_sampling_size.py -qa <list_of_q_files_for_autosomes_for_different_sample_sizes> -qx <list_of_q_files_for_autosomes_for_different_sample_sizes> -f <fam_files_for_different_sample_sizes> -p ./ -o figure3.pdf

References

Alexander, David H, John Novembre, and Kenneth Lange. 2009. “Fast Model-Based Estimation of Ancestry in Unrelated Individuals.” Genome Research 19 (9): 1655–64. https://doi.org/10.1101/gr.094052.109.
Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience. 2015;4:7. Published 2015 Feb 25. doi:10.1186/s13742-015-0047-8
Goldberg, A., and Rosenberg, N.A. (2015). Beyond 2/3 and 1/3: The Complex Signatures of Sex-Biased Admixture on the X Chromosome. Genetics 201, 263 LP – 279.
Micheletti, S.J., Bryc, K., Ancona Esselmann, S.G., Freyman, W.A., Moreno, M.E., Poznik, G.D., Shastri, A.J., Agee, M., Aslibekyan, S., Auton, A., et al. (2020). Genetic Consequences of the Transatlantic Slave Trade in the Americas. Am. J. Hum. Genet. 107, 265–277.
Gravel, S., Henn, B.M., Gutenkunst, R.N., Indap, A.R., Marth, G.T., Clark, A.G., Yu, F., Gibbs, R.A., and Bustamante, C.D. (2011). Demographic history and rare allele sharing among human populations. Proc. Natl. Acad. Sci. U. S. A. 108, 11983–11988.
Alexander, D.H., Novembre, J., and Lange, K. (2009). Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19, 1655–1664.
Haller, B.C., and Messer, P.W. (2019). SLiM 3: Forward Genetic Simulations Beyond the Wright–Fisher Model. Mol. Biol. Evol. 36, 632–637.
Haller, B.C., Galloway, J., Kelleher, J., Messer, P.W., and Ralph, P.L. (2019). Tree-sequence recording in SLiM opens new horizons for forward-time simulation of whole genomes. Mol. Ecol. Resour. 19, 552–566
Kelleher, J., Etheridge, A.M., and McVean, G. (2016). Efficient Coalescent Simulation and Genealogical Analysis for Large Sample Sizes. PLOS Comput. Biol. 12, 1–22
Nelson, D., Kelleher, J., Ragsdale, A.P., Moreau, C., McVean, G., and Gravel, S. (2020). Accounting for long-range correlations in genome-wide simulations of large cohorts. PLOS Genet. 16, 1–12.
Mölder, F., Jablonski, K.P., Letcher, B., Hall, M.B., Tomkins-Tinch, C.H., Sochat, V., Forster, J., Lee, S., Twardziok, S.O., Kanitz, A., Wilm, A., Holtgrewe, M., Rahmann, S., Nahnsen, S., Köster, J., 2021. Sustainable data analysis with Snakemake. F1000Res 10, 33.

Code Snippets

import pickle
import pyslim
import msprime
import sys
import argparse
import numpy as np
import os
import multiprocessing as mp
import matplotlib.pyplot as plt
import logging
import warnings
import re
from multiprocessing_logging import install_mp_handler


def tree_heights(ts):
    """
    Taken from pyslim manual
    :param ts: treeSequence
    :return: array, tree heights
    """
    # Calculate tree heights, giving uncoalesced sites the maximum time
    heights = np.zeros(ts.num_trees + 1)
    for tree in ts.trees():
        if tree.num_roots > 1:  # not fully coalesced
            heights[tree.index] = ts.metadata['SLiM']['generation']
        else:
            children = tree.children(tree.root)
            real_root = tree.root if len(children) > 1 else children[0]
            heights[tree.index] = tree.time(real_root)
    heights[-1] = heights[-2]  # repeat the last entry for plotting with step
    return heights


def visualize_recapitation(pre_recap, post_recap):
    """
    Visualize effect of recapitation
    :param pre_recap: TreeSequence, ts before recapitation
    :param post_recap: TreeSequence, ts after recapitation
    """
    fig, ax = plt.subplots(2, 1, sharex=True, sharey=True)
    # Plot tree heights before recapitation
    breakpoints = list(pre_recap.breakpoints())
    heights = tree_heights(pre_recap)
    ax[0].step(breakpoints, heights, where='post')
    ax[0].set_title('Prior recapitation')
    ax[0].set_ylabel('TMRCA')
    # ax[0].set_xlabel('Position')
    max_prior = heights.max()
    # Plot the tree heights after recapitation
    breakpoints = list(post_recap.breakpoints())
    heights = tree_heights(post_recap)
    ax[1].step(breakpoints, heights, where='post')
    ax[1].axhline(max_prior, color='black', ls='--')
    ax[1].set_xlabel('Position')
    ax[1].set_ylabel('TMRCA')
    ax[1].set_title('Post recapitation')
    fig.savefig('plots/vis_recapitation.png', bbox_inches='tight')
    plt.close()


def get_sfs(ts, pop, pop_index, filename, chr):
    """
    Plot site frequency spectrum
    :param ts: TreeSequence
    :param pop: str, current population
    :param filename: str, path to save figure to
    :param chr: str, current chromosome
    """
    fig, ax = plt.subplots()
    bins = np.arange(0, 101, 1)
    # plot real data
    if pop in ['afr', 'eur', 'ea']:
        if pop == 'afr':
            pop = 'yri'
        elif pop == 'eur':
            pop = 'ceu'
        elif pop == 'ea':
            pop = 'chb'
        if re.match('chr[0-9]+', chr):
            real_data = np.loadtxt(f'data/daf_frequencies_kg/allele_frequencies_{pop}_autosomes.txt')
        elif chr == 'chrX':
            real_data = np.loadtxt(f'data/daf_frequencies_kg/allele_frequencies_{pop}_chrX.txt')
        elif chr == 'chrY':
            real_data = np.loadtxt(f'data/daf_frequencies_kg/allele_frequencies_{pop}_chrY.txt')
        elif chr == 'mtDNA':
            real_data = np.loadtxt(f'data/daf_frequencies_kg/allele_frequencies_{pop}_mtDNA.txt')
        real_data = real_data[real_data > 0]
        real_data *= 100
        hist, bin_edges = np.histogram(real_data, bins=bins, weights=np.repeat(1 / real_data.shape[0],
                                                                               real_data.shape[0]))
        ax.scatter(bin_edges[:-1], hist, marker='^', color='blue', label='real')
    corresponding_nodes = ts.samples(pop_index)
    ts = ts.simplify(samples=corresponding_nodes)
    afs = ts.allele_frequency_spectrum(span_normalise=False, polarised=True)
    afs = afs[1:]
    frequencies = np.ones_like(afs)
    frequencies = np.cumsum(frequencies) / frequencies.shape[0] * 100
    afs /= afs.sum()
    hist, bin_edges = np.histogram(frequencies, bins=bins, weights=afs)
    ax.scatter(bin_edges[:-1], hist, marker='+', color='orange', label='simulations')

    ax.legend()
    fig.savefig(f"plots/{filename}", bbox_inches='tight')
    plt.close()


def add_mutations(tree_path, out_tree_path, Ne, mu, mtDNA_mu, chr_length, mtDNA_length, populations, recapitate, force):
    """
    Recapitate tree sequence, simply and overlay with mutations
    :param tree_path: str, path to input tree sequence
    :param out_tree_path: str, path were to dump mutated tree sequence
    :param Ne: float, ancestral population size
    :param mu: float, mutation rate
    :param mtDNA_mu: float, mutation rate mitochondrial DNA
    :param chr_length: int, chromosome length
    :param mtDNA_length: int, mtDNA length
    :param populations: list, acronyms of populations that are considered
    :param recapitate: boolean, whether to recaptitate or not
    :param force: boolean, whether to run full pipeline if out_tree_path already exists
    :return: TreeSequence, dict, tree sequence with mutations, pointer to nodes with Y-chromsomes and mtDNAs
    """
    # load tree sequence
    logging.info(f'Loading tree sequence {tree_path}')

    if os.path.isfile(out_tree_path) and not force:
        mutated = pyslim.load(out_tree_path)
        f = open(out_tree_path.replace('.trees', '.pkl'), 'rb')
        sample_genotype_mapping_by_population = pickle.load(f)
        f.close()

    else:
        ts = pyslim.load(tree_path)
        if recapitate:
            # recapitate (Ne = ancestral Ne)
            logging.info(f'Recapitating {tree_path}')
            ts_recap = pyslim.recapitate(ts, Ne)
            visualize_recapitation(ts, ts_recap)
            ts = ts_recap

        # simplify
        logging.info(f'Simplifying {tree_path}')
        ts.simplify()

        # assert all sites coalesced
        for t in ts.trees():
            if not t.num_roots == 1:
                warnings.warn("Tree {} did not completely coalesced."
                              "Consider setting --recapitate flag.".format(tree_path, t.interval[0], t.interval[1]))
                break
        sample_genotype_mapping_by_population = {}
        for i, pop in enumerate(populations):
            sample_genotype_mapping_by_population[pop] = {}
            node_ids = ts.samples(i + 1)
            sample_genotype_mapping_by_population[pop]['nodes'] = node_ids
            pop_var = [var for var in ts.variants(samples=node_ids)]
            Y_pop_var = [var for var in pop_var if var.site.position == 2 * chr_length - 1]
            mtDNA_var = [var for var in pop_var if var.site.position == 2 * chr_length + mtDNA_length - 1]
            # nodes that have Y chromosome
            Y_chroms = node_ids[np.where(Y_pop_var[0].genotypes == 1)[0]]
            sample_genotype_mapping_by_population[pop]['Y'] = Y_chroms
            # nodes that have maternal mtDNA
            mtDNA = node_ids[np.where(mtDNA_var[0].genotypes == 1)[0]]
            sample_genotype_mapping_by_population[pop]['mtDNA'] = mtDNA

        f = open(out_tree_path.replace('.trees', '.pkl'), 'wb')
        pickle.dump(sample_genotype_mapping_by_population, f)
        f.close()

        # add mutations
        logging.info(f'Overlaying mutations {tree_path}')
        mutation_rate_map = msprime.RateMap(position=[0, 2 * chr_length, 2 * chr_length + mtDNA_length],
                                            rate=[mu, mtDNA_mu])
        mutated = msprime.sim_mutations(ts, rate=mutation_rate_map, keep=False)
        mutated.dump(out_tree_path)
    return mutated, sample_genotype_mapping_by_population


def reverse_map_individuals(old_new_mapping, individuals_nodes_mapping, ts):
    """
    Map node IDs after simplifying to previous ones
    :param old_new_mapping: np.array, Has the shape (#nodes) where the uth entry gives the new node ID of
                                    uth node in the old mapping
    :param individuals_nodes_mapping: dict, maps old node ids to individual ids
    :param ts: TreeSequence, simplified tree sequence
    :return: list, list, old individual ids, new individual ids
    """
    old_individuals = []
    new_individuals = []
    for indv, nds in individuals_nodes_mapping.items():
        for nd in nds:
            new_node_id = old_new_mapping[nd]
            if new_node_id == -1:
                continue
            new_indv = ts.node(new_node_id).individual
            new_individuals.append(new_indv)
            old_individuals.append(indv)
            break
    return old_individuals, new_individuals 


def extract_population_specific_chromosomes(ts, pointers_Y_mtDNA, output_base, chr_length,
                                            mtDNA_length, populations, sample_size, target_population,
                                            off_target_sample_size, plot_sfs, threads):
    """
    Write different chromosomes to VCF for each population
    :param ts: TreeSequence, tree sequence to sample from
    :param pointers_Y_mtDNA: dict, pointers to nodes with mtDNA and Y chromosomes
    :param output_base: str, string base for output file
    :param chr_length: int, chromosome length
    :param mtDNA_length: int, length of mtDNA
    :param populations: list, population acronyms
    :param sample_size: int, number of individuals to sample from each population
    :param target_population: str, population for which to extract full sample size
    :param off_target_sample_size: int, sample size for off-target populations
    :param plot_sfs: boolean, whether to plot SFS or not
    :param threads: int, number of processes
    """
    # get current populations
    sampled_individuals_all_pop = []
    sampled_nodes_all_pop = []
    sampled_Y_all_pop = []
    sampled_X_all_pop = []
    sampled_mtDNA_all_pop = []
    for i, pop in enumerate(populations):
        logging.info(f'Extracting chromosome specific tree sequences for {pop}')
        # population specific nodes
        nodes = pointers_Y_mtDNA[pop]['nodes']
        # nodes with Y chromosomes
        Y_chromosomes = pointers_Y_mtDNA[pop]['Y']
        # nodes with X chromosomes
        X_chromosomes = nodes[~np.isin(nodes, Y_chromosomes)]
        # nodes with maternal mtDNA
        mtDNA = pointers_Y_mtDNA[pop]['mtDNA']
        # population specific individuals
        individuals = list(set([nd.individual for nd in ts.tables.nodes[nodes]]))
        if pop == target_population and len(individuals) < sample_size:
            logging.info(f'Population {pop} only has {len(individuals)} individuals. '
                         f'Reducing sample size for {pop} from {sample_size} to {len(individuals)}')
            c_sample_size = len(individuals)
        elif pop == target_population and len(individuals) >= sample_size:
            c_sample_size = sample_size
        elif pop != target_population and len(individuals) >= off_target_sample_size:
            c_sample_size = off_target_sample_size
        else:
            c_sample_size = len(individuals)
        # taken random sample of individuals
        sampled_individuals = np.random.choice(individuals, size=c_sample_size, replace=False)
        sampled_nodes = np.concatenate([ts.individual(indv).nodes for indv in sampled_individuals])
        sampled_Y = Y_chromosomes[np.isin(Y_chromosomes, sampled_nodes)]
        sampled_X = X_chromosomes[np.isin(X_chromosomes, sampled_nodes)]
        sampled_mtDNA = mtDNA[np.isin(mtDNA, sampled_nodes)]
        sampled_individuals_all_pop.append(sampled_individuals)
        sampled_nodes_all_pop.append(sampled_nodes)
        sampled_Y_all_pop.append(sampled_Y)
        sampled_X_all_pop.append(sampled_X)
        sampled_mtDNA_all_pop.append(sampled_mtDNA)

    sampled_indv_amr = sampled_individuals.copy()
    # merge samples from all populations
    sampled_individuals = np.concatenate(sampled_individuals_all_pop)
    sampled_nodes = np.concatenate(sampled_nodes_all_pop)
    sampled_Y = np.concatenate(sampled_Y_all_pop)
    sampled_X = np.concatenate(sampled_X_all_pop)
    sampled_mtDNA = np.concatenate(sampled_mtDNA_all_pop)
    sampled_individuals_node_mapping = {indv: ts.individual(indv).nodes for indv in sampled_individuals}

    # extract tree sequences corresponsing to chromosome 1
    ts_chr1, node_map_chr1 = ts.keep_intervals(np.array([[0, chr_length]]),
                                               simplify=False).simplify(sampled_nodes, map_nodes=True)
    # extract tree sequences corresponsing to sex chromosomes
    ts_Y_chromosomes, node_map_Y = ts.keep_intervals(np.array([[chr_length, 2 * chr_length]]),
                                                     simplify=False).simplify(sampled_Y, map_nodes=True)
    ts_X_chromosomes, node_map_X = ts.keep_intervals(np.array([[chr_length, 2 * chr_length]]),
                                                     simplify=False).simplify(sampled_X, map_nodes=True)
    # extract tree sequences for mtDNA
    ts_mtDNAs, node_map_mtDNA = ts.keep_intervals(np.array([[2 * chr_length, 2 * chr_length + mtDNA_length]]),
                                                  simplify=False).simplify(sampled_mtDNA, map_nodes=True)

    sampled_individuals, individuals_chr1 = reverse_map_individuals(node_map_chr1, sampled_individuals_node_mapping,
                                                                    ts_chr1)
    sampled_individuals_Y, individuals_Y = reverse_map_individuals(node_map_Y, sampled_individuals_node_mapping,
                                                                   ts_Y_chromosomes)
    sampled_individuals_X, individuals_X = reverse_map_individuals(node_map_X, sampled_individuals_node_mapping,
                                                                   ts_X_chromosomes)
    sampled_individuals_mtDNA, individuals_mtDNA = reverse_map_individuals(node_map_mtDNA,
                                                                           sampled_individuals_node_mapping, ts_mtDNAs)


    # get sex of individuals and write to file
    sampled_individuals_sex = ['F' if ts_chr1.individual(indv).metadata['sex'] == 0 else 'M'
                               for indv in individuals_chr1]
    sampled_individuals_population = [populations[ts_chr1.node(ts_chr1.individual(indv).nodes[0]).population]
                                      for indv in individuals_chr1]
    with open(f"{output_base}_sex.tab", 'w') as sex_pointer:
        sex_pointer.write('#IID\tsex\n')
        for indv, sex, pop in zip(sampled_individuals, sampled_individuals_sex, sampled_individuals_population):
            sex_pointer.write(f'{pop}{indv + 1}\t{sex}\n')
    sex_pointer.close()
    ## plot SFS
    if plot_sfs:
        for pop_index, pop in enumerate(populations):
            logging.info(f'Plotting SFS {output_base} for {pop}')
            for tree_seqs, chr in zip([ts_chr1, ts_X_chromosomes, ts_Y_chromosomes, ts_mtDNAs],
                                      ['chr1', 'chrX', 'chrY', 'mtDNA']):
                get_sfs(tree_seqs, pop, pop_index, f"{output_base.split('/')[-1]}_{pop}_{chr}_sfs.png", chr)

    # write .pop file for supervised admixture runs
    for ts, chr, individuals in zip([ts_chr1, ts_X_chromosomes, ts_Y_chromosomes, ts_mtDNAs],
                                    ['chr1', 'chrX', 'chrY', 'mtDNA'],
                                    [individuals_chr1, individuals_X, individuals_Y, individuals_mtDNA]):
        write_pop_file(ts, individuals, populations, target_population, f"{output_base}_{chr}.pop")

    ## write VCFs in parallel
    ready_to_map = [(tree_seqs, indvs, indv_names, f"{output_base}_{chr}.vcf", chr, populations) for
                    tree_seqs, indvs, indv_names, chr in zip([ts_chr1, ts_X_chromosomes, ts_Y_chromosomes, ts_mtDNAs],
                                                             [individuals_chr1, individuals_X, individuals_Y,
                                                              individuals_mtDNA],
                                                             [sampled_individuals, sampled_individuals_X,
                                                              sampled_individuals_Y, sampled_individuals_mtDNA],
                                                             ['chr1', 'chrX', 'chrY', 'mtDNA'])]
    # write_vcf(ready_to_map[3])
    threads = min([threads, len(ready_to_map)])
    install_mp_handler()
    pool = mp.Pool(processes=threads)
    pool.map(write_vcf, ready_to_map)
    pool.close()
    pool.join()


def write_pop_file(ts, individuals, populations, target_population, outfile):
    source_populations = [populations[ts.node(ts.individual(indv).nodes[0]).population]
                          if populations[ts.node(ts.individual(indv).nodes[0]).population] != target_population
                          else '-' for indv in individuals]
    with open(outfile, 'w') as f:
        f.write(source_populations[0])
        for pop in source_populations[1:]:
            f.write('\n')
            f.write(pop)
    f.close()


def write_vcf(args):
    """
    Write VCF file for a given chromosome and population
    :param args: (tuple), (TreeSequence, individual IDs, individual names, population, output file, chromosomes)
    """
    ts, individuals, indv_names, outfile, chr, populations = args
    # name the sample
    individual_names = [f"{populations[ts.node(ts.individual(indv).nodes[0]).population]}{indv_name + 1}"
                        for indv, indv_name in zip(individuals, indv_names)]

    # get contig id
    if chr == 'chr1':
        contig_id = '1'
    elif chr == 'chrX':
        contig_id = '23'
    elif chr == 'chrY':
        contig_id = '24'
    elif chr == 'mtDNA':
        contig_id = '26'
    logging.info(f'Writing VCF {outfile}')
    # see https://tskit.dev/tskit/docs/stable/python-api.html#tskit.TreeSequence.write_vcf
    with open(outfile, 'w') as vcf_file:
        ts.write_vcf(vcf_file, individuals=individuals, individual_names=individual_names, contig_id=contig_id)
    logging.info(f"Finished Writing VCF {outfile}")
    vcf_file.close()


def main(argv):
    parser = argparse.ArgumentParser(description="Recaptitate tree sequence and superimpose neutral mutations. "
                                                 "For qualtiy control, recapitation is visualized and SFS is plotted "
                                                 "after superimposing mutations")
    parser.add_argument('--tree_sequences', help='Input tree sequences. Multiple files can be specified separated by '
                                                 'a space. [trees/single_pulse_admixture.trees]',
                        default=['trees/single_pulse_admixture.trees'], nargs='+')
    parser.add_argument('-o', '--output_tree_sequences', help='Output file with mutated tree sequences.'
                                                              'Creates a corresponding pickle file with pointers to '
                                                              'nodes with Y chromosomes and mtDNAs. '
                                                              'Multiple files can be specified separated by a space.'
                                                              '[trees/single_pulse_admixture_mutated.trees]',
                        default=['trees/single_pulse_admixture_mutated.trees'], nargs='+')
    parser.add_argument('--output_vcf_dir', help='Directory where to save VCF file [./data]', default='./data')
    parser.add_argument('-N', '--ancestral_population_size', help='Ancestral population size [7310.370867595234]',
                        type=float, default=7310.370867595234)
    parser.add_argument('--mtDNA_mutation_rate', help='Mutation rate mitochondrial DNA [4e-4]', type=float,
                        default=4e-4)
    parser.add_argument('-m', '--mutation_rate', type=float, default=2.36e-8, help='Germ line mutation rate [2.36e-8')
    parser.add_argument('-l', '--chromosome_length', help='Chromosome Length [1e8]', default=1e8, type=float)
    parser.add_argument('--mtDNA_length', help='Length of mitochondrial DNA [20000]', type=float, default=20000)
    parser.add_argument('--populations', nargs='+', help='Simulated populations in order [afr, eur, ea, amr]',
                        default=['afr', 'eur', 'ea', 'amr'])
    parser.add_argument('-s', '--sample_size', type=int, default=100, help='Sample size, same for each population [100]')
    parser.add_argument('--target_population', help='Population of most interest. Extract full sample size '
                                                    'for this population. [amr]', default='amr')
    parser.add_argument('--off_target_sample_size', help='Reference populations. Use this reduced sample size. [1000]',
                        default=1000, type=int)
    parser.add_argument('--plot_sfs', action='store_true', help='Plot SFS only', default=False)
    parser.add_argument('--recapitate', help='Whether or not do recapitation to ensure complete coalescence [False]',
                        action='store_true', default=False)
    parser.add_argument('--force', help='Force re-run if intermediate files already exist [False]',
                        action='store_true', default=False)
    parser.add_argument('--log', help='Path to log file [log/adding_mutations.log]', default='log/adding_mutations.log')
    parser.add_argument('-t', '--threads', help='Number of processors [16]', default=16, type=int)
    args = parser.parse_args()
    if args.output_vcf_dir.endswith('/'):
        output_vcf_dir = args.output_vcf_dir
    else:
        output_vcf_dir = args.output_vcf_dir + '/'
    if not os.path.isdir(output_vcf_dir):
        os.makedirs(output_vcf_dir)
    logging.basicConfig(filename=args.log, level=logging.INFO,
                        format='%(asctime)s - %(levelname)s - %(message)s', datefmt='%H:%M:%S')
    # install_mp_handler()
    ancestral_population_size = int(args.ancestral_population_size)
    for ts, ots in zip(args.tree_sequences, args.output_tree_sequences):
        logger = logging.getLogger("Overlaying mutations on {ts}")
        mutated, pointers_Y_mtDNA = add_mutations(ts, ots,
                                                  ancestral_population_size, args.mutation_rate,
                                                  args.mtDNA_mutation_rate, args.chromosome_length, args.mtDNA_length,
                                                  args.populations, args.recapitate, args.force)
        output_base = f"{output_vcf_dir}{ts.split('/')[-1].split('.trees')[0]}"
        extract_population_specific_chromosomes(mutated, pointers_Y_mtDNA, output_base, args.chromosome_length,
                                                args.mtDNA_length, args.populations, args.sample_size,
                                                args.target_population, args.off_target_sample_size, args.plot_sfs,
                                                args.threads)


if __name__ == '__main__':
    main(sys.argv[1:])

Python numpy matplotlib msprime pyslim multiprocessing-logging From line 2 of scripts/add_mutations_to_tree_sequence.py

import os
import sys
import argparse
import subprocess


def run_simulations(recipe, ots, rec, length, length_mtDNA):
    """
    Execute SLiM simulations
    :param recipe: str, path to recipe to use
    :param ots: str, path to output tree sequence
    :param rec: float, recombination rate
    :param length: int, chromosome length
    :param length_mtDNA: int, length of mtDNA
    """
    logfile = ots.replace('.trees', '.log')
    command = ' '.join(['slim', '-d', f"\"outfile='{ots}'\"", '-d', f"\"logfile='{logfile}'\"",
                        '-d', f'rec={rec}', '-d', f'L={length}', '-d', f'L_mtDNA={length_mtDNA}', recipe])
    subprocess.check_call(command, shell=True)


def main(argv):
    parser = argparse.ArgumentParser(description='Simulate ancestral continental mutations')
    parser.add_argument('-r', '--recombination_rate', type=float, default=1e-8, help='Recombination rate. [1e-8]')
    parser.add_argument('--recipe', help='SLiM recipe to use [scripts/afr_eur_ea.slim]', type=str,
                        default='scripts/afr_eur_ea.slim')
    parser.add_argument('-l', '--chromosome_length', help='Chromosome Length [1e6]', default=1e6, type=float)
    parser.add_argument('--mtDNA_length', help='Length of mitochondrial DNA [20000]', type=float, default=20000)
    parser.add_argument('-o', '--output_tree_sequences', help='File to write tree sequences to.'
                                                              '[./trees/afr_eur_ea.trees]',
                        default='./trees/afr_eur_ea.trees')
    args = parser.parse_args()
    output_dir = '/'.join(args.output_tree_sequences.split('/')[:-1])
    if not os.path.isdir(output_dir):
        os.makedirs(output_dir)
    run_simulations(args.recipe, args.output_tree_sequences, args.recombination_rate, args.chromosome_length,
                    args.mtDNA_length)


if __name__ == '__main__':
    main(sys.argv[1:])

Python SLiM From line 2 of scripts/afr_eur_ea_populations.py

import sys
import argparse
import subprocess


def run_simulations(recipe, ts, out, rec, length, mtDNA_length, afr, eur, ea, sm_afr, sm_eur, sm_ea, N_amr,
                    afr_const=None, eur_const=None, ea_const=None, sm_afr_const=None, sm_eur_const=None,
                    sm_ea_const=None, amr_growth_rate=0.0, rejection_rate_female_eur_male_non_eur=None):
    """
    Execute SLiM simulations
    :param recipe: str, path to SLiM recipe
    :param ts: str, path to input tree sequence from where to start simulations
    :param out: str, path where to write output tree sequence to
    :param rec: float, recombination rate
    :param length: int, chromosome length
    :param mtDNA_length: int, mtDNA length
    :param afr: float, African admixture fraction
    :param eur: float, European admixture fraction
    :param ea: float, East Asian admixture fraction
    :param sm_afr: float, fraction of African males
    :param sm_eur: float, fraction of European males
    :param sm_ea: float, fraction of East Asian males
    :param N_amr: float, population size of admixed population
    :param afr_const: float, African migration after initial admixture; default=None
    :param eur_const: float, European migration after initial admixture; default=None
    :param ea_const: float, East Asian migration after initial admixture; default=None
    :param sm_afr_const: float, fraction of constantly migrating African males; default=None
    :param sm_eur_const: float, fraction of constantly migrating European males; default=None
    :param sm_ea_const: float, fraction of constantly migrating East Asian males; default=None
    :param amr_growth_rate: float, exponential growth rate of admixed population
    :param rejection_rate_female_eur_male_non_eur: float, Fraction of matings of European females and non-European males
                                                          to reject in non_random mating scheme
    """
    logfile = f"{out.split('.trees')[0]}.log"
    command = ['slim', '-d', f"\"input_tree_sequence='{ts}'\"", '-d', f"\"outbase='{out}'\"",
               '-d', f"\"logfile='{logfile}'\"", '-d', f'rec={rec}', '-d', f'L={length}',
               '-d', f'L_mtDNA={mtDNA_length}', '-d', f'afr={afr}', '-d', f'eur={eur}', '-d', f'ea={ea}',
               '-d', f'sm_afr={sm_afr}',  '-d', f'sm_eur={sm_eur}',  '-d', f'sm_ea={sm_ea}',
               '-d', f'adx_n={N_amr}', '-d', f'amr_growth_rate={amr_growth_rate}']
    if not afr_const is None:
        command.extend(["-d", f'afr_const={afr_const}', "-d", f'eur_const={eur_const}', "-d", f'ea_const={ea_const}',
                        "-d", f'sm_afr_const={sm_afr_const}', "-d", f'sm_eur_const={sm_eur_const}',
                        "-d", f'sm_ea_const={sm_ea_const}'])
    if not rejection_rate_female_eur_male_non_eur is None and \
            (recipe == 'scripts/single_pulse_non_random_mating.slim' or
             recipe == 'scripts/constant_admixture_non_random_mating.slim'):
        command.extend(["-d", f'rejection_rate_female_eur_male_non_eur={rejection_rate_female_eur_male_non_eur}'])
    command.append(recipe)
    command = ' '.join(command)
    subprocess.check_call(command, shell=True)


def main(argv):
    parser = argparse.ArgumentParser(description="Simulate three-way (American) admixture of ancestral "
                                                 "continental populations")
    parser.add_argument('-r', '--recombination_rate', type=float, default=1e-8, help='Recombination rate. [1e-8]')
    parser.add_argument('--recipe', help='SLiM recipe to use [scripts/single_pulse_admixture.slim]', type=str,
                        default='scripts/single_pulse_admixture.slim')
    parser.add_argument('-l', '--chromosome_length', help='Chromosome Length [1e6]', default=1e6, type=float)
    parser.add_argument('--mtDNA_length', help='Length of mitochondrial DNA [20000]', type=float, default=20000)
    parser.add_argument('--tree_sequences', help='Input tree sequences [trees/afr_eur_ea.trees]',
                        default='trees/afr_eur_ea.trees')
    parser.add_argument('--afr', help='African admixture fraction [1/6]', default=1/6, type=float)
    parser.add_argument('--eur', help='European admixture fraction [1/3]', default=1/3, type=float)
    parser.add_argument('--ea', help='East Asian admixture fraction [1/2]', default=1/3, type=float)
    parser.add_argument('--sm_afr', help='Fraction of males of admixing Africans [1/2]', default=1/2, type=float)
    parser.add_argument('--sm_eur', help='Fraction of males of admixing Europeans[1/2]', default=1/2, type=float)
    parser.add_argument('--sm_ea', help='Fraction of males of admixing East Asians [1/2]', default=1/2, type=float)
    parser.add_argument('--N_amr', help='Initial size of admixed American population [1000]', default=1000, type=int)
    parser.add_argument('--afr_const', help='Constant African admixture fraction', default=None, type=float)
    parser.add_argument('--eur_const', help='Constant European admixture fraction', default=None, type=float)
    parser.add_argument('--ea_const', help='Constant East Asian admixture fraction', default=None, type=float)
    parser.add_argument('--sm_afr_const', help='Fraction of males of constantly admixing Africans', default=None,
                        type=float)
    parser.add_argument('--sm_eur_const', help='Fraction of males of constantly admixing Europeans', default=None,
                        type=float)
    parser.add_argument('--sm_ea_const', help='Fraction of males of constantly admixing East Asians', default=None,
                        type=float)
    parser.add_argument('--amr_growth_rate', help='Exponential growth rate of admixed populations [0.0]', default=0.0,
                        type=float)
    parser.add_argument('--rejection_rate_female_eur_male_non_eur',
                        help='Fraction of matings of European females and non-European males to reject in '
                             'non-random mating scheme. Only has an effect if '
                             '--recipe scripts/single_pulse_non_random_mating.slim', default=None, type=float)
    args = parser.parse_args()

    output_treeseq_base = f"{'/'.join(args.tree_sequences.split('/')[:-1])}/{args.recipe.split('/')[-1].split('.slim')[0]}_g"
    if not args.afr_const is None:
        assert args.eur_const is not None and args.ea_const is not None and args.sm_afr_const is not None and \
               args.sm_eur_const is not None and args.sm_ea_const is not None and "constant" in args.recipe, \
            "If specifing one constant admixture parameter you have to specify all and " \
            "select SLiM recipe for constant admixture"
    run_simulations(args.recipe, args.tree_sequences, output_treeseq_base, args.recombination_rate,
                    args.chromosome_length, args.mtDNA_length, args.afr, args.eur, args.ea, args.sm_afr,
                    args.sm_eur, args.sm_ea, args.N_amr, args.afr_const, args.eur_const, args.ea_const,
                    args.sm_afr_const, args.sm_eur_const, args.sm_ea_const, args.amr_growth_rate,
                    args.rejection_rate_female_eur_male_non_eur)


if __name__ == '__main__':
    main(sys.argv[1:])

Python SLiM From line 2 of scripts/american_admixture.py

import sys
import argparse
import re
import logging
import pandas as pd
import numpy as np
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import euclidean
from sklearn.cluster import DBSCAN, KMeans, AffinityPropagation, AgglomerativeClustering, SpectralClustering, MeanShift


def loss_clustering(eigenvec, labels, populations):
    """
    Quantify loss based on ancestry of homogeneity of assigned clusters. Distance between individuals of same cluster
    only contributes to loss if they come from different populations
    :param eigenvec: array-like, PCs of samples
    :param labels: array-like, assigned cluster IDs
    :param populations: array-like, populations of origin
    :return: float, loss
    """
    pops = np.array(populations)
    # calculate pairwise distances within clusters
    dist = pairwise_distances(eigenvec)
    # 1 if in same cluster, 0 if in different one
    indicator_matrix_same_cluster =(labels[:, None] == labels[None, :]).astype(int)
    # 0 if sampled from same population, 1 if sampled from different ones
    indicator_matrix_same_pops = (pops[:, None] != pops[None, :]).astype(int)
    # multiple distances by indicator matrices
    loss = np.sum(dist * indicator_matrix_same_pops * indicator_matrix_same_cluster)
    return loss


def find_best_clustering(eigenvec_source, populations_source, not_accepted_algos=[]):
    """
    Find best clustering via GridSearch
    :param eigenvec_source: array-like, PCs of samples of source populations
    :param populations_source: array-like, origin populations of samples
    :param not_accepted_algos: list, clustering algorithms that were previously rejected
    :return: dict, best algorithm + params
    """
    c_error = -1

    # DBSCAN
    if not 'dbscan' in not_accepted_algos:
        logging.info('Running DBSCAN')
        eps_vals = [1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1]
        min_samples = np.arange(5, 50, 5)
        for eps in eps_vals:
            for ms in min_samples:
                dbscan = DBSCAN(min_samples=ms, eps=eps).fit(eigenvec_source.values)
                loss = loss_clustering(eigenvec_source, dbscan.labels_, populations_source)
                if loss < c_error or c_error == -1:
                    best_params = {'min_samples': ms, 'eps': eps, 'algo': 'dbscan'}
                    c_error = loss

        logging.info('Best run: Loss: {:.4f}; Params: {}'.format(c_error, best_params))

    n_clusters = np.arange(3, 50, 1)
    if not 'kmeans' in not_accepted_algos:
        # KMeans
        logging.info('Running KMeans')
        for n in n_clusters:
            kmeans = KMeans(n_clusters=n, random_state=1).fit(eigenvec_source.values)
            loss = loss_clustering(eigenvec_source, kmeans.labels_, populations_source)
            if loss < c_error or c_error == -1:
                best_params = {'n_clusters': n, 'algo': 'kmeans'}
                c_error = loss
        logging.info('Best run: Loss: {:.4f}; Params: {}'.format(c_error, best_params))

    if not 'agglomerative' in not_accepted_algos:
        # Agglomerative
        logging.info('Running Agglomerative Clustering')
        for n in n_clusters:
            agglo = AgglomerativeClustering(n_clusters=n).fit(eigenvec_source.values)
            loss = loss_clustering(eigenvec_source, agglo.labels_, populations_source)
            if loss < c_error or c_error == -1:
                best_params = {'n_clusters': n, 'algo': 'agglomerative'}
                c_error = loss
        logging.info('Best run: Loss: {:.4f}; Params: {}'.format(c_error, best_params))

    if not 'affinity' in not_accepted_algos:
        # Affinity
        logging.info('Running Affinity propagation')
        damping = np.arange(0.5, 1.0, 0.05)
        for d in damping:
            affinity = AffinityPropagation(damping=d, max_iter=500, random_state=1).fit(eigenvec_source)
            loss = loss_clustering(eigenvec_source, affinity.labels_, populations_source)
            if loss < c_error  or c_error == -1:
                best_params = {'damping': d, 'algo': 'affinity'}
                c_error = loss

        logging.info('Best run: Loss: {:.4f}; Params: {}'.format(c_error, best_params))

    if not 'meanshift' in not_accepted_algos:
        # MeanShift
        logging.info('Running Mean Shift')
        meanshift = MeanShift().fit(eigenvec_source.values)
        loss = loss_clustering(eigenvec_source, meanshift.labels_, populations_source)
        if loss < c_error or c_error == -1:
            best_params = {'algo': 'meanshift'}
            c_error = loss

        logging.info('Best run: Loss: {:.4f}; Params: {}'.format(c_error, best_params))

    if not 'spectral' in not_accepted_algos:
        # Spectral
        logging.info("Running Spectral Clustering")
        for n in n_clusters:
            spectral = SpectralClustering(n_clusters=n, random_state=1).fit(eigenvec_source.values)
            loss = loss_clustering(eigenvec_source, spectral.labels_, populations_source)
            if loss < c_error or c_error == -1:
                best_params = {'n_clusters': n, 'algo': 'spectral'}
                c_error = loss
        logging.info('Best run: Loss: {:.4f}; Params: {}'.format(c_error, best_params))

    return best_params


def cluster_to_ancestry(labels, populations):
    """
    Map cluster IDs to ancestries based on majority
    :param labels: array-like, cluster IDs
    :param populations: array-like, populations of origin
    :return: dict, mapping cluster id - ancestry
    """
    cluster_to_labels = {}
    for l in labels:
        pops, counts = np.unique(populations[labels == l], return_counts=True)
        c_pop = pops[np.argmax(counts)]
        cluster_to_labels[l] = c_pop
    return cluster_to_labels


def dbscan_predict(dbscan_model, X_test, metric=euclidean):
    """
    Predict DBSCAN cluster belonging
    :param dbscan_model: trained DBSCAN model
    :param X_test: array-like, PCs American samples
    :param metric: sklearn.spatial.distance metric
    :return: array-like, cluster IDs of American samples
    """
    # Taken from https://stackoverflow.com/questions/27822752/scikit-learn-predicting-new-points-with-dbscan
    # Result is noise by default
    y_test = np.ones(shape=len(X_test), dtype=int) * -1

    # Iterate all input samples for a label
    for j, x_new in enumerate(X_test):
        # Find a core sample closer than EPS
        for i, x_core in enumerate(dbscan_model.components_):
            if metric(x_new, x_core) < dbscan_model.eps:
                # Assign label of x_core to x_new
                y_test[j] = dbscan_model.labels_[dbscan_model.core_sample_indices_[i]]
                break
    return y_test


def find_nearest_neighbor(X_train, labels, X_test):
    """
    Define cluster belonging based on nearest neighbor
    :param X_train: array-like, PCs of training samples (non-America)
    :param labels: array-like, cluster ids of training samples
    :param X_test: array-like, PCs of American samples
    :return: array-like, cluster IDs of American samples
    """
    y_test = np.ones(len(X_test))
    for i, x in enumerate(X_test):
        if len(x.shape) == 1:
            x = x[np.newaxis, :]
        # find nearest neighbor
        dist = pairwise_distances(X_train, x)
        y_test[i] = labels[np.argmin(dist)]
    return y_test


def predict_amr_haplogroups(eigenvec_source, eigenvec_amr, populations_source, best_params):
    """
    Predicted cluster belonging of American samples
    :param eigenvec_source: pd.DataFrame, PCs of individuals from source populations
    :param eigenvec_amr: pd.DataFrame, PCs of individuals from American population
    :param populations_source: array-like, population of origin of non-American samples
    :param best_params: dict, best algorithm and parameters
    :return: trained clustering algorithm, predicted cluster ids, corresponding ancestries
    """
    logging.info('Predicting American haplogroups')
    # retrain clustering
    if best_params['algo'] == 'dbscan':
        clustering = DBSCAN(eps=best_params['eps'], min_samples=best_params['min_samples']).fit(eigenvec_source.values)
    elif best_params['algo'] == 'agglomerative':
        clustering = AgglomerativeClustering(n_clusters=best_params['n_clusters']).fit(eigenvec_source.values)
    elif best_params['algo'] == 'kmeans':
        clustering = KMeans(n_clusters=best_params['n_clusters']).fit(eigenvec_source.values)
    elif best_params['algo'] == 'affinity':
        clustering = AffinityPropagation(damping=best_params['damping']).fit(eigenvec_source.values)
    elif best_params['algo'] == 'meanshift':
        clustering = MeanShift().fit(eigenvec_source.values)
    elif best_params['algo'] == 'spectral':
        clustering = SpectralClustering(n_clusters=best_params['n_clusters']).fit(eigenvec_source.values)
    # map cluster labels to ancestry based on majority
    cluster_ancestry_mapping = cluster_to_ancestry(clustering.labels_, populations_source)

    # predict cluster belonging of American individuals
    if best_params['algo'] == 'dbscan':
        amr_cluster = dbscan_predict(clustering, eigenvec_amr.values)
    elif best_params['algo'] == 'agglomerative' or best_params["algo"] == 'spectral':
        amr_cluster = find_nearest_neighbor(eigenvec_source.values, clustering.labels_, eigenvec_amr.values)
    else:
        amr_cluster = clustering.predict(eigenvec_amr.values)
    # convert cluster ids to ancestries
    amr_haplogroups = [cluster_ancestry_mapping[y] for y in amr_cluster]
    # summarize
    ancestry, counts = np.unique([amr_haplogroups], return_counts=True)
    logging.info([f'{counts[i]} {ancestry[i].upper()}' for i in range(len(ancestry))])
    return clustering, amr_cluster, amr_haplogroups


def check_clustering_and_predictions(clustering, predictions, algo):
    """
    Check if clustering and predictions were meaningful
    :param clustering: trained clustering algorithm from sklearn
    :param predictions: np.array, predicted cluster labels for american samples
    :param algo: str, name of algorithm
    :return: boolean, whether or not to accept clustering
    """
    logging.info('Checking clustering and predictions')
    # too many singletons
    if np.unique(clustering.labels_).shape[0] / clustering.labels_.shape[0] > 0.15:
        logging.info(
            '{} found to many clusters. Re-running find_best_clustering with excluding {}'.format(algo, algo))
        return False
    # too many training samples unassigned
    elif clustering.labels_[clustering.labels_ == -1].shape[0] / clustering.labels_.shape[0] > 0.1:
        logging.info('{} left too many samples unassigned. '
                     'Re-running find_best_clustering with excluding {}'.format(algo, algo))
        return False
    # too many american individuals were unassigned
    elif predictions[predictions == -1].shape[0] / predictions.shape[0] > 0.1:
        logging.info('{} left too many American samples unassigned. '
                     'Re-running find_best_clustering with excluding {}'.format(algo, algo))
        return False
    else:
        # accept
        return True


def main(argv):
    parser = argparse.ArgumentParser(description='Cluster ancestral mtDNA and Y haplogroups and assign ancestral '
                                                 'haplogroups of admixed individuals based on proximity to '
                                                 'ancestral clusters')
    parser.add_argument('--eigenvec', help='Path to .eigenvec file generated with plink. Multiple files can be '
                                           'specified separated by a space.', nargs='+')
    parser.add_argument('--log', help='Log file [log/classifying_mtDNA_Y_haplogroups.log]',
                        default='log/classifying_mtDNA_Y_haplogroups.log')
    parser.add_argument('--output_file', help="File to write haplogroup assignments to. Multiple files can be "
                                              "specified separated by a space.", nargs='+')
    args = parser.parse_args()
    logging.basicConfig(filename=args.log, level=logging.INFO,
                        format='%(asctime)s - %(levelname)s - %(message)s', datefmt='%H:%M:%S')
    logger = logging.getLogger()
    for eigenvec, output_file in zip(args.eigenvec, args.output_file):
        eigenvec = pd.read_csv(eigenvec, sep='\t')
        eigenvec.set_index('#IID', inplace=True)
        populations = np.array([re.sub('[0-9]*', '', ind) for ind in eigenvec.index.values])
        eigenvec_source = eigenvec.loc[populations != 'amr',]
        populations_source = populations[populations != 'amr']
        eigenvec_amr = eigenvec.loc[populations == 'amr',]
        accept_clustering = False
        not_accepted_clustering = []
        while not accept_clustering:
            best_params = find_best_clustering(eigenvec_source, populations_source,
                                               not_accepted_algos=not_accepted_clustering)
            clustering, amr_cluster, amr_haplogroups = predict_amr_haplogroups(eigenvec_source, eigenvec_amr,
                                                                               populations_source, best_params)
            accept_clustering = check_clustering_and_predictions(clustering, amr_cluster, best_params['algo'])
            if not accept_clustering:
                logging.info('Did not accept clustering {}'.format(best_params))
                not_accepted_clustering.append(best_params['algo'])
            if len(not_accepted_clustering) == 6:
                logging.info('None of the tried clustering algorithms was accepted ({}). '
                             'Try something else.'.format(not_accepted_clustering))
                sys.exit(1)
        logging.info('Accepted {}'.format(best_params))
        cluster_ancestry_mapping = cluster_to_ancestry(clustering.labels_, populations_source)
        source_haplogroups = [cluster_ancestry_mapping[y] for y in clustering.labels_]
        haplogroups = pd.DataFrame(index=eigenvec.index.values)
        haplogroups.loc[eigenvec_source.index.values, 'haplogroup'] = source_haplogroups
        haplogroups.loc[eigenvec_amr.index.values, 'haplogroup'] = amr_haplogroups
        logging.info(f'Writing haplogroup assignments to {output_file}')
        haplogroups.to_csv(output_file, sep='\t', header=True, index=True)


if __name__ == '__main__':
    main(sys.argv[1:])

Python Pandas numpy scipy sklearn BiocSklearn From line 2 of scripts/classify_mtDNA_Y_haplogroups.py

import sys
import argparse
import subprocess


def main(argv):
    parser = argparse.ArgumentParser(description='Convert vcf to plink file')
    parser.add_argument('--vcf', help='Input vcf. Multiple files can be specified separated by a space.', nargs='+',
                        required=True)
    parser.add_argument('--output_base', help='Output bases. Multiple files can be specified separated by a space.',
                        nargs='+', required=True)
    parser.add_argument('--sex', help='Individual sex mapping files. Multiple files can be specified separated '
                                      'by a space.', nargs='+', required=True)
    parser.add_argument('--pruned', help='Plink *prune.in files. Multiple files can be specified separated by a space.',
                        nargs='+', default=[])
    parser.add_argument('--threads', help='Number of CPUs [1]', default=1)
    args = parser.parse_args()
    if len(args.pruned) > 0:
        for i, o, s, p in zip(args.vcf, args.output_base, args.sex, args.pruned):
            subprocess.call(f"plink2 --threads {args.threads} --chr-set 40 --allow-extra-chr --vcf {i} --make-bed "
                            f"--extract {p} --out {o} --pca 20 approx --update-sex {s} --max-alleles 2 --maf 0.01",
                            shell=True)
    else:
        for i, o, s in zip(args.vcf, args.output_base, args.sex):
            subprocess.call(f"plink2 --threads {args.threads} --chr-set 40 --allow-extra-chr --vcf {i} --make-bed "
                            f"--out {o} --pca 20 approx --update-sex {s} --max-alleles 2", shell=True)


if __name__ == '__main__':
    main(sys.argv[1:])

Python pLink plink2 From line 2 of scripts/convert_vcf_to_plink.py

import sys
import argparse
import subprocess


def main(argv):
    parser = argparse.ArgumentParser(description='Perform LD pruning using plink2')
    parser.add_argument('--vcf', help='Input vcf. Multiple files can be specified separated by a space.', nargs='+',
                        required=True)
    parser.add_argument('--output_base', help='Output bases. Multiple files can be specified separated by a space.',
                        nargs='+', required=True)
    parser.add_argument('--sex', help='Individual sex mapping files. Multiple files can be specified separated '
                                      'by a space.', nargs='+', required=True)
    parser.add_argument('--threads', help='Number of CPUs [1]', default=1)
    args = parser.parse_args()
    for i, o, s in zip(args.vcf, args.output_base, args.sex):
        subprocess.call(f"plink2 --threads {args.threads} --allow-extra-chr --vcf {i} --indep-pairwise 50 kb 1 0.1 "
                        f"--out {o} --update-sex {s}", shell=True)


if __name__ == '__main__':
    main(sys.argv[1:])

Python plink2 From line 2 of scripts/ld_pruning_plink2.py

import matplotlib.pyplot as plt
from matplotlib.lines import Line2D
import sys 
import argparse
import pandas as pd
import numpy as np
import os
import re


def visualize_pcs(eigenvecs, outfile, pc_a=1, pc_b=2):
    """
    Visualize PCA done with plink
    :param eigenvecs: str, path to .eigenvecs file
    :param outfile: str, where to save Figure to
    :param pc_a: int, first PC to plot
    :param pc_b: int, second PC to plot
    """
    num_pop = eigenvecs.population.unique().shape[0]
    populations = eigenvecs.population.unique()
    np.random.seed(1)
    colors = [(np.random.random(), np.random.random(), np.random.random()) for i in range(num_pop)]
    color_dict = {populations[i]: colors[i] for i in range(num_pop)}
    colors = np.array([color_dict[pop] for pop in eigenvecs.population])
    fig, ax = plt.subplots()
    ax.scatter(eigenvecs.loc[:, f'PC{pc_a}'], eigenvecs.loc[:, f'PC{pc_b}'], color=colors)
    ax.set_xlabel(f'PC{pc_a}')
    ax.set_ylabel(f'PC{pc_b}')
    legend_elements = [Line2D([0], [0], color=color_dict[pop], marker='o', label=pop.upper()) for pop in populations]
    ax.legend(handles=legend_elements, loc='upper center', ncol=4, bbox_to_anchor=(0.5, -.14))
    fig.savefig(outfile, bbox_inches='tight')
    plt.close()


def main(argv):
    parser = argparse.ArgumentParser(description='Visualize PCA performed with plink2')
    parser.add_argument('--eigenvec', nargs='+', help='.eigenvec file that were generated by plink. '
                                                      'You can specify multiple separated by a space')
    parser.add_argument('--plot_dir', help='Directory where to save plots')
    args = parser.parse_args()
    if args.plot_dir.endswith('/'):
        plot_dir = args.plot_dir
    else:
        plot_dir = args.plot_dir + '/'
    if not os.path.isdir(plot_dir):
        os.makedirs(plot_dir)
    eigenvec_files=sorted(args.eigenvec)
    for eigenvec in eigenvec_files:
        eigenvectors = pd.read_csv(eigenvec, sep='\t')
        eigenvectors.set_index('#IID', inplace=True)
        populations = np.array([re.sub('[0-9]*', '', ind) for ind in eigenvectors.index.values])
        eigenvectors.loc[:, 'population'] = populations
        outfile = plot_dir + eigenvec.split('/')[-1].split('.eigenvec')[0] + "_pca.png"
        visualize_pcs(eigenvectors, outfile)  


if __name__ == '__main__':
    main(sys.argv[1:])

Python Pandas numpy matplotlib pLink plink2 From line 2 of scripts/plot_pca.py

import sys
import argparse
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import os
from matplotlib.lines import Line2D
import logging
from sklearn.utils import resample


def read_results(admixture_results_autosomes, admixture_results_chrx, fam_file):
    """
    Read Admixture results for autosomes and X chromosomes as well fam file to obtain individual IDs and sexes
    :param admixture_results_autosomes: str, Path to *.Q file for autosomes
    :param admixture_results_chrx: str, Path to *.Q file for X chromosomes
    :param fam_file: str, Path to *.fam file
    :return: pd.DataFrame, pd.DataFrame, admixture results for autosomes and X chromosomes
    """
    # read fam file
    indv_names_and_sex = pd.read_csv(fam_file, sep='\t', header=None, usecols=[1, 4])
    # read Q file autosomes
    K = int(admixture_results_autosomes.split('.')[-2])
    admixture_autosomes = pd.read_csv(admixture_results_autosomes, sep=' ', header=None,
                                      names=[f'K{i}' for i in range(K)])
    # send IDs as index
    admixture_autosomes.index = indv_names_and_sex.iloc[:, 0].values
    # add sexes
    admixture_autosomes.loc[:, 'sex'] = indv_names_and_sex.iloc[:, 1].values - 1
    # read Q file X chromosomes
    admixture_x = pd.read_csv(admixture_results_chrx, sep=' ', header=None, names=[f'K{i}' for i in range(K)])
    # set IDs as index
    admixture_x.index = indv_names_and_sex.iloc[:, 0].values
    # add sexes
    admixture_x.loc[:, 'sex'] = indv_names_and_sex.iloc[:, 1].values - 1
    # add sample populations
    populations = np.array([re.sub('[0-9]*', '', ind) for ind in indv_names_and_sex.iloc[:, 0]])
    admixture_autosomes.loc[:, 'pop'] = populations
    admixture_x.loc[:, 'pop'] = populations
    return admixture_autosomes, admixture_x


def assign_k_to_ancestry(df, ancestries=['afr', 'eur', 'ea']):
    """
    Identify K corresponding to source ancestries
    :param df: pd.DataFrame, admixture results
    :param ancestries: list, source populations
    :return: pd.DataFrame, with renamed columns according to main ancestries
    """
    mean_per_k = df.groupby("pop").mean().iloc[:, :-1]
    mean_per_k = mean_per_k.loc[ancestries]
    max_ind_k = np.argmax(mean_per_k.values, axis=0)
    if np.unique(max_ind_k).shape[0] < len(ancestries):
        logging.debug(f"Inadequate number of source populations inferred."
                      f"The mean contributions of source populations to different cluster were\n{mean_per_k}")
        raise ValueError('Not all source ancestries are represented. You likely used an inadequate number for K. '
                         'Check log file for more details')
    new_col_names = {col: ancestries[max_ind_k[i]] for i, col in enumerate(mean_per_k.columns)}
    df.rename(columns=new_col_names, inplace=True)
    new_df = pd.DataFrame(index=df.index)
    for anc in ancestries:
        try:
            new_df.loc[:, anc] = df.loc[:, anc].sum(axis=1)
        except ValueError:
            new_df.loc[:, anc] = df.loc[:, anc]
    new_df.loc[:, 'pop'] = df.loc[:, 'pop'].copy()
    new_df.loc[:, 'sex'] = df.loc[:, 'sex'].copy()
    return new_df


def plot_autosome_proportions_vs_x_proportions(proportions_autosomes, proportions_x, output_base):
    """
    Plot autosomal vs. X chromosomal ancestry proportions
    :param proportions_autosomes: pd.DataFrame, autosomal admixture results
    :param proportions_x: pd.DataFrame, X chromosomal admixture results
    :param output_base: str, output base to save plot per population to
    """
    np.random.seed(13)
    populations = proportions_autosomes.loc[:, 'pop'].unique()
    source_pop = np.sort(proportions_autosomes.columns[:-2])
    colors = {col: (np.random.random(), np.random.random(), np.random.random()) for col in source_pop}
    for pop in populations:
        fig, ax = plt.subplots()
        c_pop_autosomes = proportions_autosomes[proportions_autosomes['pop'] == pop]
        c_pop_x = proportions_x[proportions_x['pop'] == pop]
        for s_pop in source_pop:
            ax.scatter(c_pop_x[s_pop].values, c_pop_autosomes[s_pop].values, label=s_pop,
                       color=colors[s_pop], alpha=0.5)
        ax.set_xlabel('X chromosome proportion')
        ax.legend(bbox_to_anchor=(0.5, -.14), ncol=3, loc='upper center')
        ax.set_ylabel('Autosome proportion')
        ax.plot([0, 1], [0, 1], ls='--', c='black')
        ax.set_xlim([0, 1])
        ax.set_ylim([0, 1])
        fig.savefig(output_base + f'_A_vs_X_{pop}.png', bbox_inches='tight')
        plt.close()


def plot_histogram_of_ancestry_proportions(proportions_autosomes, proportions_x, output_base):
    """
    Plot cumulative histograms of autosomal and X chromosomal ancestry proportions
    :param proportions_autosomes: pd.DataFrame, autosomal admixture results
    :param proportions_x: pd.DataFrame, X chromosomal admixture results
    :param output_base: str, output base to save plot per population to
    """
    np.random.seed(13)
    populations = proportions_autosomes.loc[:, 'pop'].unique()
    source_pop = np.sort(proportions_autosomes.columns[:-2])
    colors = {col: (np.random.random(), np.random.random(), np.random.random()) for col in source_pop}
    bins = np.arange(0, 1.01, 0.01)
    for pop in populations:
        fig, ax = plt.subplots()
        c_pop_autosomes = proportions_autosomes[proportions_autosomes['pop'] == pop]
        c_pop_x = proportions_x[proportions_x['pop'] == pop]
        for s_pop in source_pop:
            ax.hist(c_pop_x[s_pop].values, color=colors[s_pop], cumulative=True, ls='--', bins=bins,
                    histtype='step',
                    weights=np.repeat(1 / c_pop_x[s_pop].values.shape[0], c_pop_x[s_pop].values.shape[0]))
            ax.hist(c_pop_autosomes[s_pop].values, color=colors[s_pop], cumulative=True, ls='-', bins=bins,
                    histtype='step',
                    weights=np.repeat(1 / c_pop_autosomes[s_pop].values.shape[0],
                                      c_pop_autosomes[s_pop].values.shape[0]))
        ax.set_xlabel('Ancestry proportion')
        ax.set_ylabel('Cumulative density')
        handles = [Line2D([0], [0], color=colors[source_pop[n]], ls='--') for n in range(len(source_pop))]
        handles.extend([Line2D([0], [0], color='black', ls='-'), Line2D([0], [0], color='black', ls='--')])
        labels = source_pop.copy().tolist()
        labels.extend(['Autosomes', 'X chromosomes'])
        ax.legend(handles, labels, bbox_to_anchor=(0.5, -.14), loc='upper center', ncol=3)
        fig.savefig(output_base + f"_hist_a_x_proportions_{pop}.png", bbox_inches='tight')


def plot_admixture_results(df, outfile):
    """
    Plot admixture results
    :param df: pd.DataFrame, admixture results
    :param output_base: str, output base to save plot per population to
    """
    np.random.seed(13)
    populations = df['pop'].unique()
    source_pop = np.sort(df.columns[:-2])
    colors = {col: (np.random.random(), np.random.random(), np.random.random()) for col in source_pop}
    fig, ax = plt.subplots()
    proportions_dict = {}
    colors_dict = {}
    for i in range(len(source_pop)):
        proportions_dict[i] = np.zeros(df.shape[0])
        colors_dict[i] = np.zeros((df.shape[0], 3))
    i = 0
    n_samples = []
    for pop in populations:
        c_df = df[df['pop'] == pop]
        proportions = c_df.loc[:, source_pop]
        median_prop = np.median(proportions, axis=0)
        priorities = source_pop[np.argsort(median_prop)][::-1].tolist()
        for n in range(len(source_pop)):
            proportions_dict[n][i: i + c_df.shape[0]] = c_df.loc[:, priorities[n]]
            colors_dict[n][i: i + c_df.shape[0]] = colors[priorities[n]]
        i += c_df.shape[0]
        n_samples.append(c_df.shape[0])
    bottom = np.zeros(df.shape[0])
    for n in range(len(source_pop)):
        ax.bar(np.arange(0, df.shape[0]), proportions_dict[n], color=colors_dict[n], bottom=bottom, width=1)
        bottom += proportions_dict[n]
    n_samples = np.array(n_samples)
    cum_samples = np.cumsum(n_samples)
    n_samples = n_samples / 2
    xticks = cum_samples - n_samples
    ax.set_xticks(xticks)
    ax.set_ylim([0, 1.0])
    ax.set_xlim([0, df.shape[0]])
    ax.set_xticklabels([pop.upper() for pop in populations])
    ax.set_ylabel('Admixture proportion')
    ax.spines['top'].set_visible(False)
    ax.spines['bottom'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.spines['left'].set_visible(False)
    fig.savefig(outfile, bbox_inches='tight')
    plt.close()


def write_mean_ancestry_proportions(df, output_file):
    """
    Write mean ancestry proportions to file
    :param df: pd.DataFrame, mean ancestry proportions of source populations for each sample populations
    :param output_file: str, Path to file where to write proportions to
    """
    # bringing columns into the right order
    columns = {col: col.replace('ea', 'na').replace('_A', '_ZA').upper() for col in df.columns}
    df.rename(columns=columns, inplace=True)
    columns = sorted([col for col in df.columns])
    df = df.loc[:, columns]
    columns = {col: col.replace('_ZA', '_A').replace('SEX', 'pf') for col in df.columns}
    df.rename(columns=columns, inplace=True)
    df.index.name = ''
    df.to_csv(output_file, header=True, index=True)


def filter_individuals(admixture_autosomes):
    """
    Filtered out African/European/East Asian individuals with less than 95% African/Eurpean/East Asian ancestry as well
    as admixed individuals with more than 95% of one ancestry
    :param admixture_autosomes: pd.DataFrame, Admixture results
    :return: np.array, IIDs of individuals that not passed filtering
    """
    # filter Africans with less 95% African ancestry
    afr_to_remove = admixture_autosomes[(admixture_autosomes['pop'] == 'afr') &
                                        (admixture_autosomes.afr < 0.95)].index.values
    logging.info('Removing {} African individuals that have <95% African ancestry'.format(afr_to_remove.shape[0]))
    # filter Europeans with less 95% European ancestry
    eur_to_remove = admixture_autosomes[(admixture_autosomes['pop'] == 'eur') &
                                        (admixture_autosomes.eur < 0.95)].index.values
    logging.info('Removing {} European individuals that have <95% European ancestry'.format(eur_to_remove.shape[0]))
    # filter East Asians with less 95% East Asian ancestry
    ea_to_remove = admixture_autosomes[(admixture_autosomes['pop'] == 'ea') &
                                       (admixture_autosomes.ea < 0.95)].index.values
    logging.info('Removing {} East Asian individuals that have <95% East Asian ancestry'.format(ea_to_remove.shape[0]))
    # filter Americans with less 5% African ancestry
    amr_to_remove = admixture_autosomes[(admixture_autosomes['pop'] == 'amr') &
                                        (admixture_autosomes.loc[:, ['afr', 'eur', 'ea']].max(axis=1) > 0.95)].index.values
    logging.info('Removing {} American individuals that have >95% of one ancestry'.format(amr_to_remove.shape[0]))
    individuals_to_remove = np.concatenate([amr_to_remove, eur_to_remove, ea_to_remove, afr_to_remove])
    return individuals_to_remove


def compute_haplogroup_frequencies(df):
    """
    Compute haplogroup frequencies
    :param df: pd.DataFrame, haplogroup assignments
    :return: pd.DataFrame, haplogroup frequencies of sampled populations
    """
    haplogroups = np.sort(df.haplogroup.unique())
    populations = df.population.unique()
    frequencies = pd.DataFrame(index=populations, columns=haplogroups)
    for pop in populations:
        freqs = np.zeros_like(haplogroups)
        c_df = df[df.population == pop]
        counts = c_df.haplogroup.value_counts()
        for hg, c in zip(counts.index, counts):
            freqs[haplogroups == hg] = c
        freqs = freqs / freqs.sum()
        frequencies.loc[pop, :] = freqs
    return frequencies


def bootstrap_results(ancestry_proportions_haplogroups, n_resamples=10000):
    """
    Bootstrap individuals to compute confidence intervals for summary statistics (mean ancestry proportions,
    and sex ratios)
    :param ancestry_proportions_haplogroups: pd.DataFrame, individual level autosomal and X chromosomal ancestry
                                                           proportions as well as mtDNA and Y chromosome haplogroups
    :param n_resamples: int, number of resampling steps to perform
    :return: pd.DataFrame, Mean summary statistics with confidence intervals
    """
    ancestries = [col.split('_')[0] for col in ancestry_proportions_haplogroups.columns if col.endswith('A')]
    populations = ancestry_proportions_haplogroups.loc[:, "pop"].unique()
    sf_sm = lambda x, a: (3 * x - 2 * a) / (4 * a - 3 * x)

    bootstrapped_results = []
    for pop in populations:
        c_results = ancestry_proportions_haplogroups.loc[ancestry_proportions_haplogroups.loc[:, "pop"] == pop]
        sample_names = c_results.index.values
        if n_resamples is None:
            c_resamples = min([c_results.shape[0], 10000])
        iterations_df = pd.DataFrame(index=np.arange(0, c_resamples))
        for i in range(c_resamples):
            resampled = resample(sample_names)
            y_haplogroups = c_results.loc[resampled, "haplogroup_Y"].copy().dropna()
            mt_haplogroups = c_results.loc[resampled, 'haplogroup_mt']
            resampled_autosomes = c_results.loc[resampled, [f"{anc}_A" for anc in ancestries]]
            resampled_x = c_results.loc[resampled, [f"{anc}_X" for anc in ancestries]]
            for anc in ancestries:
                mt_haplo_freq = mt_haplogroups[mt_haplogroups == anc].shape[0] / mt_haplogroups.shape[0]
                y_haplo_freq = y_haplogroups[y_haplogroups == anc].shape[0] / y_haplogroups.shape[0]
                try:
                    haplogroup_imbalance = mt_haplo_freq / y_haplo_freq
                except ZeroDivisionError:
                    haplogroup_imbalance = np.inf
                mean_autosomal_prop = resampled_autosomes.loc[:, f"{anc}_A"].mean()
                mean_x_prop = resampled_x.loc[:, f"{anc}_X"].mean()
                sex_ratio = sf_sm(mean_x_prop, mean_autosomal_prop)
                iterations_df.loc[i, f'{anc}_A'] = mean_autosomal_prop
                iterations_df.loc[i, f'{anc}_X'] = mean_x_prop
                iterations_df.loc[i, f'{anc}_sf_sm'] = sex_ratio
                iterations_df.loc[i, f'{anc}_mt'] = mt_haplo_freq
                iterations_df.loc[i, f'{anc}_Y'] = y_haplo_freq
                iterations_df.loc[i, f'{anc}_mt_Y'] = haplogroup_imbalance

        mean_vals = iterations_df.mean(axis=0)
        mean_vals.index = [f"mean_{ind}" for ind in mean_vals.index.values]
        ci_low = iterations_df.quantile(0.025, axis=0)
        ci_low.index = [f"ci_low_{ind}" for ind in ci_low.index.values]
        ci_up = iterations_df.quantile(0.975, axis=0)
        ci_up.index = [f"ci_up_{ind}" for ind in ci_up.index.values]

        summary_df = pd.concat([ci_low, mean_vals, ci_up]).to_frame(pop)
        bootstrapped_results.append(summary_df.T)
    bootstrapped_results = pd.concat(bootstrapped_results)
    return bootstrapped_results


def main(argv):
    parser = argparse.ArgumentParser(description="Summarize simulation results, i.e., sex ratios, "
                                                 "admixture proportions, haplogroup frequencies, "
                                                 "and haplogroup imbalances including confidence intervals.")
    parser.add_argument('-qa', '--admixture_results_autosomes', nargs='+', help='.Q file from Admixture for autosomes')
    parser.add_argument('-qx', '--admixture_results_chrx', nargs='+', help='.Q file from Admixture for X chromosome')
    parser.add_argument('-hm', '--haplogroups_mtDNA', nargs='+', help='Haplogroup assignments mtDNA')
    parser.add_argument('-hy', '--haplogroups_Y', nargs='+', help='Haplogroup assignments Y chromosomes')
    parser.add_argument('-f', '--fam_file', nargs='+', help='*.fam file generated by plink to get individual IDs')
    parser.add_argument('-p', '--plot_dir', help='Directory where to save plots [./plots/]', default='./plots/')
    parser.add_argument('-rax', '--results_a_x', nargs='+', help='File to write autosomal and X chromosomal '
                                                                 'ancestry proportions to')
    parser.add_argument('-rmy', '--results_mt_y', nargs='+', help='File to write mtDNA and Y chromosomal haplogroup '
                                                                  'frequencies to')
    parser.add_argument('-b', '--bootstraps', default=None, help='Number of bootstrap resamples. Default is the '
                                                                 'number of samples and max. 10000', type=int)
    parser.add_argument('-rb', '--results_bootstrapped', nargs="+", help='File to write bootstrapped summary '
                                                                         'statistics to')
    parser.add_argument('--log', help='Log file [results/summarizing_results.log]',
                        default='results/summarizing_results.log')
    args = parser.parse_args()
    logging.basicConfig(filename=args.log, level=logging.INFO,
                        format='%(asctime)s - %(levelname)s - %(message)s', datefmt='%H:%M:%S')
    logger = logging.getLogger()
    if args.plot_dir.endswith('/'):
        plot_dir = args.plot_dir
    else:
        plot_dir = args.plot_dir + '/'
    if not os.path.isdir(plot_dir):
        os.makedirs(plot_dir)
    results_dir_hg_freq = '/'.join(args.results_mt_y[0].split('/')[:-1])
    if not os.path.isdir(results_dir_hg_freq):
        os.makedirs(results_dir_hg_freq)
    results_dir_ap_freq = '/'.join(args.results_a_x[0].split('/')[:-1])
    if not os.path.isdir(results_dir_ap_freq):
        os.makedirs(results_dir_ap_freq)
    for admixture_results_autosomes, admixture_results_chrx, fam_file, \
        haplogroups_mtDNA, haplogroups_Y, results_a_x,\
        results_mt_y, results_bootstrapped in zip(args.admixture_results_autosomes,
                                                  args.admixture_results_chrx,
                                                  args.fam_file,
                                                  args.haplogroups_mtDNA,
                                                  args.haplogroups_Y,
                                                  args.results_a_x,
                                                  args.results_mt_y,
                                                  args.results_bootstrapped):
        logging.info('Parsing results {}'.format(admixture_results_autosomes.split('_chr1.3.Q')[0]))
        admixture_autosomes, admixture_x = read_results(admixture_results_autosomes, admixture_results_chrx, fam_file)
        haplogroups_mtDNA = pd.read_csv(haplogroups_mtDNA, sep='\t', index_col=0)
        haplogroups_Y = pd.read_csv(haplogroups_Y, sep='\t', index_col=0)
        populations_mtDNA = np.array([re.sub('[0-9]*', '', ind) for ind in haplogroups_mtDNA.index.values])
        populations_Y = np.array([re.sub('[0-9]*', '', ind) for ind in haplogroups_Y.index.values])
        haplogroups_mtDNA.loc[:, 'population'] = populations_mtDNA
        haplogroups_Y.loc[:, 'population'] = populations_Y
        admixture_autosomes = assign_k_to_ancestry(admixture_autosomes)
        admixture_x = assign_k_to_ancestry(admixture_x)
        # remove individuals that did not pass filtering
        logging.info('Filtering individuals')
        # filter individuals based on average ancestry_proportion 
        # assumes equal length of autosomes and X chromosome
        #average_ancestry_proportions = admixture_autosomes.copy()
        #average_ancestry_proportions.loc[:, ["afr", "eur", "ea"]] += admixture_x.loc[:, ["afr", "eur", "ea"]]
        #average_ancestry_proportions.loc[:, ["afr", "eur", "ea"]] /= 2
        individuals_to_remove = filter_individuals(admixture_autosomes)
        admixture_autosomes.drop(individuals_to_remove, inplace=True)
        admixture_x.drop(individuals_to_remove, inplace=True)
        haplogroups_mtDNA.drop(individuals_to_remove, inplace=True)
        haplogroups_Y.drop(individuals_to_remove[np.isin(individuals_to_remove, haplogroups_Y.index)], inplace=True)

        # visualize ADMIXTURE results
        logging.info('Visualizing ADMIXTURE results')
        plot_admixture_results(admixture_autosomes,
                               f'{plot_dir}{admixture_results_autosomes.split("/")[-1].replace(".3.Q", "_admixture.png")}')
        plot_admixture_results(admixture_x,
                               f'{plot_dir}{admixture_results_chrx.split("/")[-1].replace(".3.Q", "_admixture.png")}')
        plot_autosome_proportions_vs_x_proportions(admixture_autosomes, admixture_x,
                                                   f'{plot_dir}{admixture_results_chrx.split("/")[-1].replace(".3.Q", "")}')
        plot_histogram_of_ancestry_proportions(admixture_autosomes, admixture_x,
                                               f'{plot_dir}{admixture_results_chrx.split("/")[-1].replace(".3.Q", "")}')
        # join results
        ancestry_proportions = admixture_autosomes.join(admixture_x.drop(["pop", "sex"], axis=1),
                                                        lsuffix='_A', rsuffix='_X')
        logging.info('Computing mean ancestry proportions')
        # take average per populations
        mean_ancestry_proportions = ancestry_proportions.groupby('pop').mean()
        logging.info(f'Writing mean ancestry proportions to {results_a_x}')
        # write results
        write_mean_ancestry_proportions(mean_ancestry_proportions, results_a_x)
        logging.info("Computing mean haplogroup frequencies")
        # compute haplogroup frequencies
        freq_mtDNA = compute_haplogroup_frequencies(haplogroups_mtDNA)
        freq_Y = compute_haplogroup_frequencies(haplogroups_Y)
        # merge
        haplogroup_frequencies = freq_Y.join(freq_mtDNA, lsuffix='_Y', rsuffix='_MT')
        columns = {col: col.replace("ea", 'na').upper() for col in haplogroup_frequencies.columns}
        haplogroup_frequencies.rename(columns=columns, inplace=True)
        logging.info(f"Writing mean haplogroup frequencies to {results_mt_y}")
        # save
        haplogroup_frequencies.to_csv(results_mt_y, header=True, index=True)
        # bootstrap results
        logging.info("Bootstrapping results")
        haplogroups = haplogroups_mtDNA.drop("population", axis=1).join(haplogroups_Y.drop("population", axis=1),
                                                                        lsuffix='_mt', rsuffix='_Y')
        ancestry_prop_and_haplogroups = ancestry_proportions.join(haplogroups)
        bootstrapped_results = bootstrap_results(ancestry_prop_and_haplogroups, args.bootstraps)
        logging.info(f"Writing bootstrap results to {results_bootstrapped}")
        bootstrapped_results.to_csv(results_bootstrapped, sep='\t', header=True, index=True)


if __name__ == '__main__':
    main(sys.argv[1:])

Python Pandas numpy matplotlib pLink sklearn From line 2 of scripts/summarize_results.py

shell:
    "scripts/afr_eur_ea_populations.py -r {recombination_rate} --recipe {params.recipe} " \
    " -l {chromosome_length} --mtDNA_length {mtDNA_length} -o {output}"

SnakeMake From line 48 of main/Snakefile

shell:
    "scripts/american_admixture.py -r {recombination_rate} --recipe {config[single_pulse_admixture_recipe]} " \
    " -l {chromosome_length} --mtDNA_length {mtDNA_length} --afr {afr} --eur {eur} --ea {ea} --sm_afr {sm_afr} "\
    "--sm_eur {sm_eur} --sm_ea {sm_ea} --N_amr {N_amr} --tree_sequences {input} "
    "--amr_growth_rate {config[amr_growth_rate]}"

SnakeMake From line 65 of main/Snakefile

shell:
    "scripts/american_admixture.py -r {recombination_rate} --recipe {config[single_pulse_admixture_non_random_mating_recipe]} " \
    " -l {chromosome_length} --mtDNA_length {mtDNA_length} --afr {afr} --eur {eur} --ea {ea} --sm_afr {sm_afr} "\
    "--sm_eur {sm_eur} --sm_ea {sm_ea} --N_amr {N_amr} --tree_sequences {input} "
    "--amr_growth_rate {config[amr_growth_rate]} "
    "--rejection_rate_female_eur_male_non_eur {config[rejection_rate_female_eur_male_non_eur]}"

SnakeMake From line 78 of main/Snakefile

shell:
    "scripts/american_admixture.py -r {recombination_rate} --recipe {config[constant_admixture_recipe]} " \
    " -l {chromosome_length} --mtDNA_length {mtDNA_length} --afr {afr} --eur {eur} --ea {ea} --sm_afr {sm_afr} "\
    "--sm_eur {sm_eur} --sm_ea {sm_ea} --N_amr {N_amr} --tree_sequences {input} "
    "--afr_const {config[afr_const]} --eur_const {config[eur_const]} "\
    "--ea_const {config[ea_const]} --sm_afr_const {config[sm_afr_const]} --sm_eur_const {config[sm_eur_const]} "\
    "--sm_ea_const {config[sm_ea_const]} --amr_growth_rate {config[amr_growth_rate]}"

SnakeMake From line 92 of main/Snakefile

shell:
    "scripts/american_admixture.py -r {recombination_rate} --recipe {config[constant_admixture_non_random_mating_recipe]} " \
    " -l {chromosome_length} --mtDNA_length {mtDNA_length} --afr {afr} --eur {eur} --ea {ea} --sm_afr {sm_afr} "\
    "--sm_eur {sm_eur} --sm_ea {sm_ea} --N_amr {N_amr} --tree_sequences {input} "\
    "--amr_growth_rate {config[amr_growth_rate]} --afr_const {config[afr_const]} --eur_const {config[eur_const]} "\
    "--ea_const {config[ea_const]} --sm_afr_const {config[sm_afr_const]} --sm_eur_const {config[sm_eur_const]} "\
    "--sm_ea_const {config[sm_ea_const]} "
    "--rejection_rate_female_eur_male_non_eur {config[rejection_rate_female_eur_male_non_eur]}"

SnakeMake From line 107 of main/Snakefile

shell:
    "scripts/add_mutations_to_tree_sequence.py -N {config[ancestral_Ne]} -m {mutation_rate} "\
    "--mtDNA_mutation_rate {mtDNA_mutation_rate} -o {output.ts} --output_vcf_dir {data_dir} "
    "-l {chromosome_length} --mtDNA_length {mtDNA_length} --populations {populations} --log {log} "\
    "--tree_sequences {input} --threads {threads} --sample_size {config[sample_size]} --plot_sfs "
    "--recapitate"

SnakeMake From line 146 of main/Snakefile

shell:
    "for vcf_file in {input}; "
    "do "
    "bcftools norm -m - --threads {threads} ${{vcf_file}} | bcftools annotate --threads {threads} "
    "--set-id +'%CHROM\_%POS\_%REF\_%FIRST_ALT' -Oz -o $( echo ${{vcf_file}} | sed 's/_filtered//' ).gz; "
    "tabix ${{vcf_file}}.gz; "
    "done"

SnakeMake BCFtools tabix From line 258 of main/Snakefile

shell:
   "scripts/ld_pruning_plink2.py --threads {threads} --vcf {input.vcf} --output_base {params.out} --sex {input.sex}"

SnakeMake From line 333 of main/Snakefile

shell:
    "scripts/convert_vcf_to_plink.py --threads {threads} --vcf {input.vcf} --sex {input.sex} "
    "--output_base {params.out} --pruned {input.pruned}"

SnakeMake From line 438 of main/Snakefile

shell:
    "scripts/convert_vcf_to_plink.py --threads {threads} --vcf {input.vcf} --sex {input.sex} "
    "--output_base {params.out}"

SnakeMake From line 547 of main/Snakefile

shell:
    "scripts/classify_mtDNA_Y_haplogroups.py --eigenvec {input} --log {log} --output_file {output}"

SnakeMake From line 647 of main/Snakefile

shell:
    "scripts/plot_pca.py --eigenvec {input.eigenvec_1} {input.eigenvec_x} --plot_dir {plot_dir}"

SnakeMake From line 714 of main/Snakefile

shell:
    "admixture --supervised -j{threads} {input.bed} 3; mv *3.Q {data_dir}; mv *3.P {data_dir}"

SnakeMake From line 757 of main/Snakefile

shell:
    "admixture --supervised --haploid=\"male:23\" {input.bed} 3; "
    "mv *3.Q {data_dir}; mv *3.P {data_dir}"

SnakeMake From line 801 of main/Snakefile

shell:
    "scripts/summarize_results.py -hy {input.chrY} -hm {input.mtDNA} "
    "-qa {input.autosome_q} -qx {input.x_q} -f {input.fam} -rax {output.autosomes_x} "
    "-rmy {output.haplogroups} -rb {output.bootstrapped} -p {plot_dir} --log {log}"