Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
African Demography Review
This repository provides a Snakemake workflow used for the data analysis presented in Pfennig et al., Evolutionary genetics and admixture in African populations, Genome Biology and Evolution, 2023; https://doi.org/10.1093/gbe/evad054 .
Datasets
We combine and harmonize genomic datasets from"
-
20 European, Middle Eastern, and African populations from the Simons Genome Diversity Project (SGDP)1
-
36 African populations from Crawford et al. (2017)2 and Scheinfeldt et al. (2019)3 that were genotyped on the Illumina 1M array (These datasets were merged by Matt Hansen (unpublished work) and kindly provided to us)
-
16 Sudanese populations from Hollfelder et al. (2017)4
-
8 North African populations from Arauna et al. (2017)5
-
14 Sahelian populations from Fortes-Lima et al. (2022)6
-
10 Southern African populations from Schlebusch et al. (2012)7
Populations with more than 2 samples are downsampled to 2 indivdiuals to avoid effects from variable sample sizes. If a population was included in more than one dataset, only samples from one dataset were included for this population. We merge the dataset on biallelic common SNPs. Conflicting SNPs are strand flipped once and SNPs that remain in conflict are excluded. A minor allele frequency threshold of 0.05 and genotyping rate per SNP threshold of 0.1 and LD pruning (max r2=0.05) were applied (these parameters can be changed in the
config/config.yaml
. Ancestry proportions were estyimated using ADMIXTURE8 and different values of K. For visualization, ADMIXTURE results are aligned across all Ks using CLUMPAK9 prior to plotting (Figure 2). Ancestry proportions of the best K were then interpolated in space using the ordinary Kriging method and effective migration rates were estimated using FEEMS10 (Figure 3).
Software dependencies
-
bcftools v1.13 (path can be updated in
config/config.yaml
) -
All other software is stored in provided conda environment, which are automatically created and used by Snakemake.
Data dependencies
-
The dataset from Crawford et al.2 (dbGaP: phs001396.v1.p1 ) and Scheinfeldt et al.3 (dbGaP: phs001780.v1.p1 ) need to be manually downloaded and merged and put into the
data
directory. The plink prefix of the files can be updated in theconfig/config.yaml
. We provide coordinates of populations used in our analysis indata/coordinates_crawford_et_al_and_scheinfeldt_et_al.txt
. Importantly, we only used samples that were genotyped on the Illumina 1M array. -
The dataset from Fortes-Lima et al.5 needs to be downloaded separately from https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-12243?query=E-MTAB-12243 (the dataset was kindly made available by the authors prior to publication...). Population coordinates were extracted from Table S2 in the corresponding publication and are stored in
data/coordinates_fortes_lima_et_al.txt
. The plink prefix can be updated in theconfig/config.yaml
. -
Geographic coordinates of the 18 Sudanese populations from Hollfelder et al. (2017)3 were kindly made available by the original authors. We provide the coordinates in
data/coordinates_hollfelder_et_al.txt
. -
If city information was available for populations in Arauna et al. (2017)5, the coordinates of the city were used. For the populations from Libya and Egypt, the center of the respective country was taken (see
data/coordinates_arauna_et_al.txt
). The data file was kindly made available by the origina authors. -
All other required data is automatically retrieved
Comments
The Kriging plot of admixture results currently only works for K<=4, as we observed the lowest CV error at K=4. To get the best CV error across different Ks run
grep CV {results_path}/{prefix}.*.Q
.
As of now, this Snakemake is workflow is not written in a general way that allows to download/merge data from any dataset, but only those mentioned above.
Executing the Snakemake workflow
After ensuring that soft and data requirements are satisfied, the Snakemake workflow can be executed using the following command:
snakemake -c<threads> --use-conda
Citation
Pfennig et al., Evolutionary genetics and admixture in African populations, Genome Biology and Evolution, 2023; https://doi.org/10.1093/gbe/evad054
References
-
Mallick S, Li H, Lipson M, et al. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature 538, 201–206 (2016). https://doi.org/10.1038/nature18964
-
Crawford, N. G., Kelly, D. E., Hansen, M. E. B., et al. (2017) Loci associated with skin pigmentation identified in African populations. Science 358, eaan8433. 10.1126/science.aan8433
-
Scheinfeldt LB et al. 2019. Genomic evidence for shared common ancestry of East African hunting-gathering populations and insights into local adaptation. Proc. Natl. Acad. Sci. 116:4166–4175. doi: https://www.pnas.org/doi/full/10.1073/pnas.1817678116 .
-
Hollfelder N, Schlebusch CM, Günther T, et al. (2017) Northeast African genomic variation shaped by the continuity of indigenous groups and Eurasian migrations. PLoS Genet 13(8): e1006976. https://doi.org/10.1371/journal.pgen.1006976
-
Arauna LR, Mendoza-Revilla J, Mas-Sandoval A, et al. (2017) Recent Historical Migrations Have Shaped the Gene Pool of Arabs and Berbers in North Africa, Molecular Biology and Evolution, Volume 34, Issue 2, https://doi.org/10.1093/molbev/msw218
-
Fortes-Lima C, Tříska P, Čížková M, et al. (2022) Demographic and Selection Histories of Populations Across the Sahel/Savannah Belt, Molecular Biology and Evolution, Volume 39, Issue 10, msac209, https://doi.org/10.1093/molbev/msac209
-
Schlebusch CM et al. (2012). Genomic variation in seven Khoe-San groups reveals adaptation and complex African history. Science. 338:374–379. https://doi.org/10.1126/science.1227721 .
-
Alexander DH, Novembre J, and Lange K. Fast model-based estimation of ancestry in unrelated individuals. Genome Research, 19:1655–1664, 2009. http://www.genome.org/cgi/doi/10.1101/gr.094052.109
-
Kopelman, N, Mayzel, J, Jakobsson, M, Rosenberg, N, Mayrose, I (2015) CLUMPAK: a program for identifying clustering modes and packaging population structure inferences across K. Molecular Ecology Resources 15(5): 1179-1191, https://doi.org/10.1111/1755-0998.12387
-
Marcus J, Wooseok H, Barber RF, Novembre J (2021) Fast and flexible estimation of effective migration surfaces eLife 10:e61927. https://doi.org/10.7554/eLife.61927
Code Snippets
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 | import matplotlib.pyplot as plt from matplotlib.lines import Line2D import sys import argparse import pandas as pd import numpy as np import os ### This for purely explorative purposes. The labelling is roughly done on geographic coordinates, but def visualize_pcs(data, outfile, pc_a=1, pc_b=2): """ Visualize PCA done with plink :param data: df, DataFrame with eigenvectors and geographic coordinates :param outfile: str, where to save Figure to :param pc_a: int, first PC to plot :param pc_b: int, second PC to plot """ np.random.seed(42) colors = np.where( (data.Longitude >= -16.7) & (data.Longitude <= 17) & (data.Latitude >= 2.5) & (data.Latitude <= 20), 'green', 0) colors = np.where((data.Longitude > 17) & (data.Latitude >= -5.0) & (data.Latitude <= 20), 'blue', colors) colors = np.where((data.Latitude <= -5), 'orange', colors) colors = np.where((data.Latitude > 15) & (data.Longitude >= 35), 'pink', colors) colors = np.where((data.Latitude > 27) & (data.Longitude < 15), 'red', colors) color_dict = {'Middle East': 'pink', "West Africa": "green", "East Africa": "blue", "South Africa": "orange", "Europe": "red"} data['colors'] = colors fig, ax = plt.subplots() ax.scatter(data.loc[:, f'PC{pc_a}'], data.loc[:, f'PC{pc_b}'].values, c=data.colors, alpha=0.5) ax.set_xlabel(f'PC{pc_a}') ax.set_ylabel(f'PC{pc_b}') legend_elements = [Line2D([0], [0], color=col, marker='o', ls='', label=pop) for pop, col in color_dict.items()] ax.legend(handles=legend_elements, loc='upper center', ncol=4, bbox_to_anchor=(0.5, -.14)) fig.savefig(outfile, bbox_inches='tight') plt.close() def main(argv): parser = argparse.ArgumentParser(description='Visualize PCA performed with plink2') parser.add_argument('--eigenvec', help='.eigenvec file that was generated by plink. ') parser.add_argument('--coords', help='file containing geographic coordinates of sample locations.') parser.add_argument('--plot_dir', help='Directory where to save plots') args = parser.parse_args() if args.plot_dir.endswith('/'): plot_dir = args.plot_dir else: plot_dir = args.plot_dir + '/' if not os.path.isdir(plot_dir): os.makedirs(plot_dir) eigenvectors = pd.read_csv(args.eigenvec, sep='\t') eigenvectors.set_index('IID', inplace=True) coordinates = pd.read_csv(args.coords, sep='\t', header=None, names=['FID', "IID", "Latitude", "Longitude"]) # join data = eigenvectors.join(coordinates.set_index('IID')) outfile = plot_dir + args.eigenvec.split('/')[-1].split('.eigenvec')[0] + "_pca.png" visualize_pcs(data, outfile) if __name__ == '__main__': main(sys.argv[1:]) |
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 | import numpy as np import pkg_resources from sklearn.impute import SimpleImputer from pandas_plink import read_plink import matplotlib.pyplot as plt import cartopy.crs as ccrs from feems.utils import prepare_graph_inputs from feems import SpatialGraph, Viz from feems.cross_validation import run_cv import sys import argparse import pandas as pd # change matplotlib fonts plt.rcParams["font.family"] = "Arial" plt.rcParams["font.sans-serif"] = "Arial" def create_spatial_graph(data_path, prefix, coords_file): """ Create SpatialGraph object :param data_path: str, data directory :param prefix: str, prefix for plink files in data directory :param coords_file: str, path to file with sampling coordinates :return: SpatialGraph """ data_path_feems = pkg_resources.resource_filename("feems", "data/") # read the genotype data and mean impute missing data (bim, fam, G) = read_plink("{}/{}".format(data_path, prefix)) imp = SimpleImputer(missing_values=np.nan, strategy="mean") genotypes = imp.fit_transform((np.array(G)).T) # setup graph coords = pd.read_csv(coords_file, sep='\t', header=None, names=['fid_c', 'iid_c', 'latitude', 'longitude']) coords = fam.set_index('iid').join(coords.set_index('iid_c')).loc[:, ["longitude", "latitude"]].values grid_path = "{}/grid_100.shp".format(data_path_feems) # path to discrete global grid # graph input files _, edges, grid, _ = prepare_graph_inputs(coord=coords, ggrid=grid_path, translated=False, buffer=0,) # # construct spatial graph object sp_graph = SpatialGraph(genotypes, coords, grid, edges, scale_snps=True) return sp_graph def cross_validate(sp_graph): """ Perform cross validation to find optimal lambda :param sp_graph: SpatialGraph :return: float, lambda """ # reverse the order of lambdas and alphas for warmstart lamb_grid = np.geomspace(1e-6, 1e2, 20)[::-1] # run cross-validation cv_err = run_cv(sp_graph, lamb_grid, n_folds=sp_graph.n_observed_nodes, factr=1e10) # average over folds mean_cv_err = np.mean(cv_err, axis=0) # argmin of cv error lamb_cv = float(lamb_grid[np.argmin(mean_cv_err)]) return lamb_cv def fit_final_graph(sp_graph, lamb_cv, prefix, output_path): """ Fit final SpatialGraph using optimal lambda inferred from cross-validation :param sp_graph: SpatialGraph :param lamb_cv: float, lambda :param output_path: str, directory where to save output figure named feems_plot.pdf """ sp_graph.fit(lamb_cv) projection = ccrs.PlateCarree() # Plot the FEEMS result fig = plt.figure(figsize=(9, 16), dpi=600) ax = fig.add_subplot(1, 1, 1, projection=projection) ax.set_extent([-25, 60, -40, 50], crs=ccrs.PlateCarree()) v = Viz(ax, sp_graph, projection=projection, edge_width=.5, edge_alpha=1, edge_zorder=100, sample_pt_size=20, obs_node_size=7.5, sample_pt_color="black", cbar_font_size=5, cbar_ticklabelsize=4) v.draw_map() v.draw_edges(use_weights=True) v.draw_obs_nodes(use_ids=False) v.draw_edge_colorbar() fig.savefig(output_path + prefix + '_feems_plot.pdf', bbox_inches='tight', dpi=600) def main(argv): parser = argparse.ArgumentParser() parser.add_argument('-d', '--data_path', help='Data directory. default=./data/', default='./data/') parser.add_argument('-p', '--prefix', help='Prefix to plink files') parser.add_argument('-c', '--coords', help='Path to file coordinates of samples. ' 'default=data/population_geo_coords.tab', default='data/population_geo_coords.tab') parser.add_argument('-o', '--output_path', help='Output directory. Will save Figure called feems_plot.pdf ' 'in this direcotry. default=./results/', default='./results/') args = parser.parse_args() data_path = args.data_path prefix = args.prefix coords = args.coords output_path = args.output_path if not data_path.endswith('/'): data_path += '/' if not output_path.endswith('/'): output_path += '/' sp_graph = create_spatial_graph(data_path, prefix, coords) lamb_cv = cross_validate(sp_graph) fit_final_graph(sp_graph, lamb_cv, prefix, output_path) if __name__ == '__main__': main(sys.argv[1:]) |
7
of
scripts/run_feems.py
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 | import matplotlib.pyplot as plt import pandas as pd import numpy as np import glob import sys import argparse import seaborn as sns import numpy.ma as ma from pykrige.ok import OrdinaryKriging import cartopy.crs as ccrs import cartopy.feature as cfeature import cartopy.io.shapereader as shpreader import shapely.geometry as sgeom from shapely.ops import unary_union from shapely.prepared import prep from itertools import permutations def is_land(x, y, land): """ Function to determine if a given point is on land or not :param x: float, longitude :param y: float, latitude :param land: :return: boolean """ return land.contains(sgeom.Point(x, y)) def build_mask(x, y, land): """ Create a mask of points that are on land and should be plotted :param x: np.array, longitude values :param y: np.array, latitude values :param land: :return: np.array, boolean """ # mesh xv, yv = np.meshgrid(x, y, indexing='xy') # initialize mask mask = np.zeros(xv.shape, dtype=bool) for i in range(x.shape[0]): for j in range(y.shape[0]): # check if point is land mask[j, i] = is_land(xv[j, i], yv[j, i], land) return mask def sort_individuals(df, K): """ Sort individuals by modal ancestry components and it's intensity :param df: pd.DataFrame, data frame containing ancestry porportions :param K: int, K used to run admixture :return: pd.DataFrame, sorted df """ # get mean ancestry per component for each population mean_ancestry_pops = df.groupby('population').mean() # identify modal component for each population, i.e., the component with the highest ancestry proportion modal_clusters_pops = mean_ancestry_pops.idxmax(axis=1).sort_values() individual_orders = [] # iterate over all k for i in range(K): # get all populations for which the current component is model pops = modal_clusters_pops[modal_clusters_pops == i].index.values.tolist() # sort the population by they mean ancestry proportions of the modal component pops_sorted = mean_ancestry_pops.loc[pops, i].sort_values(ascending=False).index.values inds = [] # sort the individuals within a population by ancestry proportion for pop in pops_sorted: inds.extend(df.loc[df.population == pop, i].sort_values(ascending=False).index.values.tolist()) individual_orders.extend(inds) df = df.loc[individual_orders, :] return df def get_best_k_and_plot_best_k(qfiles, Kvals, fam, log_dir, colors, prefix): """ Find best K across different ADMIXTURE runs using different K and plot admixture plot of best K separately :param qfiles: list, list of paths to aligned .Q files :param Kvals: list-like, K values that were used :param fam: df, pd.DataFrame of plink .fam file :param log_dir: str, path to directory with log files of ADMIXTURE runs :param colors: list, colors used for plotting :param prefix: str, prefix of output figure "_admixture_plot_best_K.pdf" will be appended :return: array-like, array-like; order of individuals, cross-validation error """ min_k = min(Kvals) # get CV error of ADMIXTURE runs cv_error = np.zeros(len(Kvals)) logfiles = glob.glob(log_dir + '*.log') # read CV error for log in logfiles: k = int(log.split('.')[-2]) with open(log, 'r') as fp: for line in fp: if line.startswith("CV error"): cv_error[k - min_k] = float(line.split(':')[1]) break fp.close() # get best k best_k = Kvals[np.argmin(cv_error)] df = pd.read_csv(qfiles[np.argmin(cv_error)], sep=' ', header=None, index_col=None) df = df.iloc[:, 6:] df.columns = [i for i in range(best_k)] df['population'] = fam.fid.values # sort individuals df_sorted = sort_individuals(df, best_k) # get order of idnviduals plot_best_k(df_sorted, best_k, min(cv_error), colors, prefix) return cv_error def sort_individuals_specified_K(qfile, K, fam): """ Sort individuals based on estimated ancestry proportions given a K :param qfile: str, path to aligned Q file :param K: int, K value :param fam: df, pd.DataFrame of plink .fam file :return: array-like, order of individuals """ df = pd.read_csv(qfile, sep=' ', header=None, index_col=None) df = df.iloc[:, 6:] df.columns = [i for i in range(K)] df['population'] = fam.fid.values # sort individuals df_sorted = sort_individuals(df, K) # get order of idnviduals individual_order = df_sorted.index.values return individual_order def plot_krigin(qfile, K, path_to_coords, vmin, fam, colors, prefix): """ Interpolate admixture proportion in space using Kriging method :param qfile: str, path to qfile corresponding to K :param K: int, K used for admixture run :param path_to_coords: :param vmin: float, minimum admixture proportion required for Kriging method :param fam: df, pd.DataFrame of plink .fam file :param colors: list, colors used for plotting :param prefix: str, prefix for figure """ # The parameters are tailored to our specific case and may not give meaningful results in other cases if K > 4: raise AttributeError("Only K<= 4 is supported") land_shp_fname = shpreader.natural_earth(resolution='50m', category='physical', name='land') land_geom = unary_union(list(shpreader.Reader(land_shp_fname).geometries())) # create land object land = prep(land_geom) # read data coords = pd.read_csv(path_to_coords, sep='\t', header=None, names=['fid', 'lat', 'long'], usecols=[0, 2, 3]) coords.drop_duplicates('fid', inplace=True) df = pd.read_csv(qfile, sep=' ', header=None, index_col=None) df = df.iloc[:, 6:] df.columns = [i for i in range(K)] df['population'] = fam.fid.values df = df.join(coords.set_index('fid'), on='population') # longitude and latidude coords of interest long_coords = np.arange(-25, 60, 0.15, dtype=float) lat_coords = np.arange(-40, 50, 0.15, dtype=float) # color maps cmaps = [ sns.color_palette(f"blend:white,{colors[0]}", as_cmap=True), sns.color_palette(f"blend:white,{colors[1]}", as_cmap=True), sns.color_palette(f"blend:white,{colors[2]}", as_cmap=True), sns.color_palette(f"blend:white,{colors[3]}", as_cmap=True), ] projection = ccrs.PlateCarree() fig = plt.figure(figsize=(9, 16)) ax = fig.add_subplot(1, 1, 1, projection=projection) ax.set_extent([-25, 60, -40, 50], crs=ccrs.PlateCarree()) # run kriging for each K perms = list(set(permutations([i for i in range(K - 1)]))) perms = [perm + (K-1,) for perm in perms] for perm in perms: for i in perm: cmap = cmaps[i] proportions = df.loc[:, i].values ok = OrdinaryKriging(df.long.values, df.lat.values, proportions, variogram_model='linear', coordinates_type='geographic', nlags=2) z, ss = ok.execute('grid', long_coords, lat_coords) # get mask mask = build_mask(long_coords, lat_coords, land) # normalize data data = (z.data - z.data.min()) / (z.data.max() - z.data.min()) # update mask mask[data < vmin] = False masked_data = ma.masked_array(data, ~mask) # plot ax.imshow(np.flip(masked_data, 0), extent=[-25, 60, -40, 50], cmap=cmap, interpolation='none', vmax=1, vmin=vmin, alpha=1 / len(perms)) # formatting ax.add_feature(cfeature.COASTLINE, color='black') ax.add_feature(cfeature.BORDERS, linestyle=':', color='black') # plot populations ax.scatter(coords.long, coords.lat, color='black', s=8) ax.axis('off') fig.savefig(f"{prefix}_admixture_kriging_K{K}.pdf", bbox_inches='tight', dpi=600) def plot_best_k(df, K, cv, colors, prefix): """ Create horizontal bar plot of ADMIXTURE results :param df: pd.DataFrame, sorted DataFrame with ancestry proportions :param K: int, K used for admixture run :param colors: list, list of colors to use :param cv: float, CV error of ADMIXTURE run :param prefix: str, prefix for figure """ fig, ax = plt.subplots(figsize=(2, 8)) populations = df.population.values y_coords = [0] y_ticks = [] y_labels = [] prev_pop = populations[0] pop_coord_start = 0 cumulative_padding = 0 padding = 0.15 for i, pop in enumerate(populations[1:], 1): if prev_pop != pop: # get ytick positions and labels y_ticks.append((pop_coord_start + i + cumulative_padding - 1) / 2) y_labels.append(prev_pop) prev_pop = pop # add white space between populations cumulative_padding += padding pop_coord_start = i + cumulative_padding # determine y coords --> add white space between populations y_coords.append(i + cumulative_padding) y_ticks.append((pop_coord_start + i + cumulative_padding) / 2) y_labels.append(prev_pop) prev_k = [] # do plotting for i in range(K): if i > 0: ax.barh(y_coords, df.loc[:, i], left=df.loc[:, prev_k].sum(axis=1), height=1, color=colors[i]) else: ax.barh(y_coords, df.loc[:, i], height=1, color=colors[i]) prev_k.append(i) # formatting ax.set_yticks(y_ticks) ax.set_yticklabels([x.replace('Sabue', 'Chabu').replace('Xun', '!Xun').replace('_ERR_', "_Errachidia_")\ .replace('Juhoansi', "Ju|'Hoan").replace('SEBantu', 'BantuSAfrica')\ .replace('SWBantu', 'Herero').replace('Haihom', 'Hai||om').replace("_TIZ_", "_Tiznit_")\ .replace('_Sen_', '_Sened_').replace('_Chen_', '_Chenini_').replace('_', " ")\ .replace("BER", "Berber") for x in y_labels], fontsize=5) ax.set_xticks([]) ax.set_xticklabels([]) ax.invert_yaxis() ax.set_xlabel(f"K={K}\nCV={round(cv, 4)}", fontsize=8) ax.set_ylim([len(populations) + cumulative_padding - 0.5, -0.5]) ax.set_xlim([0, 1]) fig.savefig(f"{prefix}_admixture_plot_best_K.pdf", bbox_inches='tight', dpi=600) def plot_all_k(qfiles, Kvals, cv_error, fam, individual_order, colors, prefix): """ Plot admixture proportions across all K :param qfiles: list, list of paths to aligned .Q files :param Kvals: list-like, K values that were used :param cv_error: array-like, cross-validation errors of individual runs :param fam: pd.DataFrame, df of plink .fam file :param individual_order: list, order of individudals in which to plot them to guarantee consistency :param colors: list, colors used for plotting :param prefix: str, prefix of output figure "_admixture_plots.pdf" will be appended """ fig, ax = plt.subplots(1, len(qfiles), figsize=(8, 11)) plt.subplots_adjust(wspace=0.15) min_k = min(Kvals) for qfile, K, cv in zip(qfiles, Kvals, cv_error): df = pd.read_csv(qfile, sep=' ', header=None, index_col=None) df = df.iloc[:, 6:] df.columns = [i for i in range(K)] df['population'] = fam.fid.values df_sorted = df.loc[individual_order] plot_admixture_proportions(df_sorted, K, ax[K - min_k], cv, colors, min_k) fig.savefig(prefix + "_admixture_plots.pdf", bbox_inches='tight', dpi=600) plt.close() def plot_admixture_proportions(df, K, ax, cv, colors, min_k): """ Helper to plot admixture proportions of for specific value of K :param df: pd.DataFrame, sorted DataFrame with ancestry proportions :param K: int, K used for admixture run :param ax: ax object to plot on :param cv: float, CV error of ADMIXTURE run :param colors: list, list of colors to use :param min_k: int, minimum k that was used in any run :return: ax """ populations = df.population.values y_coords = [0] y_ticks = [] y_labels = [] prev_pop = populations[0] pop_coord_start = 0 cumulative_padding = 0 padding = 0.15 for i, pop in enumerate(populations[1:], 1): if prev_pop != pop: # get ytick positions and labels y_ticks.append((pop_coord_start + i + cumulative_padding - 1) / 2) y_labels.append(prev_pop) prev_pop = pop # add white space between populations cumulative_padding += padding pop_coord_start = i + cumulative_padding # determine y coords --> add white space between populations y_coords.append(i + cumulative_padding) y_ticks.append((pop_coord_start + i + cumulative_padding) / 2) y_labels.append(prev_pop) prev_k = [] # do plotting for i in range(K): if i > 0: ax.barh(y_coords, df.loc[:, i], left=df.loc[:, prev_k].sum(axis=1), height=1, color=colors[i]) else: ax.barh(y_coords, df.loc[:, i], height=1, color=colors[i]) prev_k.append(i) # formatting if K > min_k: ax.set_yticks([]) else: ax.set_yticks(y_ticks) ax.set_yticklabels([x.replace('Sabue', 'Chabu').replace('Xun', '!Xun').replace('_ERR_', "_Errachidia_")\ .replace('Juhoansi', "Ju|'Hoan").replace('SEBantu', 'BantuSAfrica')\ .replace('SWBantu', 'Herero').replace('Haihom', 'Hai||om').replace("_TIZ_", "_Tiznit_")\ .replace('_Sen_', '_Sened_').replace('_Chen_', '_Chenini_').replace('_', " ")\ .replace("BER", "Berber") for x in y_labels], fontsize=5) ax.set_xticks([]) ax.set_xticklabels([]) ax.invert_yaxis() ax.set_xlabel(f"K={K}\nCV={round(cv, 4)}", fontsize=5) ax.set_ylim([len(populations) + cumulative_padding - 0.5, -0.5]) ax.set_xlim([0, 1]) return ax def main(argv): parser = argparse.ArgumentParser() parser.add_argument('-i', '--input_dir', help='Input directory of "converted" files, i.e., .Q files aligned with CLUMPAK') parser.add_argument('-l', '--logfile_dir', help='Directory containing log files from admixture runs to extract CV error.') parser.add_argument('-f', '--fam', help='Plink fam file to extract individual and family IDs') parser.add_argument('-c', '--coords', help='File with individual geographic coordinates') parser.add_argument('--vmin', type=float, help='Only admixture proportions greater vmin are considered for Kriging method. default=0.5', default=0.5) parser.add_argument('-k', '--K_kriging', type=int, help='K to use for plotting Kriging') parser.add_argument('-o', '--output_prefix', help='Output prefix') args = parser.parse_args() input_dir = args.input_dir log_dir = args.logfile_dir if not input_dir.endswith('/'): input_dir += '/' if not log_dir.endswith('/'): log_dir += '/' # get input q files aligned_q_files = glob.glob(input_dir + "*.converted") # read fam file fam = pd.read_csv(args.fam, sep='\t', names=['fid', 'iid'], usecols=[0, 1]) # get k values Kvals = [int(f.split('.')[-3]) for f in aligned_q_files] # sort based on K aligned_q_files = np.array(aligned_q_files)[np.argsort(Kvals)] Kvals = np.sort(Kvals) max_k = max(Kvals) # initialize colors colors = ["purple", "orange", "blue", "green"] # pad colors additional_colors = [] while max_k > len(colors) + len(additional_colors): additional_colors.append((np.random.random(), np.random.random(), np.random.random())) additional_colors = list(set(additional_colors)) colors.extend(additional_colors) cv_error = get_best_k_and_plot_best_k(aligned_q_files, Kvals, fam, log_dir, colors, args.output_prefix) individual_order = sort_individuals_specified_K(aligned_q_files[np.where(Kvals == args.K_kriging)[0][0]], args.K_kriging, fam) plot_all_k(aligned_q_files, Kvals, cv_error, fam, individual_order, colors, args.output_prefix) plot_krigin(aligned_q_files[np.where(Kvals == args.K_kriging)[0][0]], args.K_kriging, args.coords, args.vmin, fam, colors, args.output_prefix) if __name__ == '__main__': main(sys.argv[1:]) |
2
of
scripts/visualize_admixture.py
53 54 55 56 | shell: "wget -q {params.url}; unzip {params.zip}; rm {params.zip}; mv {output} {output}_tmp; " "unzip -d {output}_tmp {output}_tmp/26_03_2015_CLUMPAK.zip; mv {output}_tmp/26_03_2015_CLUMPAK/CLUMPAK .;" "rm -r {output}_tmp; chmod +x {output}/distruct/*; chmod +x {output}/CLUMPP/*; chmod +x {output}/mcl/*" |
64 65 | shell: "wget -q -P {params.path} {params.url}" |
74 75 76 | shell: "wget -q -O- {params.url} | awk -F \"\\t\" 'BEGIN{{OFS=\"\\t\"}}{{print $3, $5, $6, $7, $8, $12, $13, $2}}' | " "grep -E \"{params.pops_of_interest}\" > {output}" |
85 86 | shell: "mkdir -p {params.dir}; tar xvf {input} -C {params.dir}" |
103 104 | shell: "{params.bcftools} merge -0 -r {params.chrom} -Oz -o {output} --threads {threads} {input}" |
116 117 118 | shell: "plink2 --vcf {input} --snps-only --max-alleles 2 --make-bed --set-all-var-ids @:# " "--out {params.base} --threads {threads}" |
126 127 128 | shell: "cut -f2 {input.fam} | tr \"\n\" \"|\" | xargs -I {{}} grep -Ew {{}} {input.metadata} | " "awk -F \"\\t\" 'BEGIN{{OFS=\"\\t\"}}{{print 0, $NF}}' > {output};" |
143 144 | shell: "plink2 --bfile {params.input_base} --keep {input.samples} --make-bed --out {params.output_base} --threads {threads}" |
152 153 154 | shell: "cut -f2 {input.old} | tr \"\n\" \"|\" | xargs -I {{}} grep -E {{}} {input.metadata} | " "awk -F \"\\t\" 'BEGIN{{OFS=\"\\t\"}}{{print $3, $2}}' | paste {input.old} - > {output}" |
170 171 172 | shell: 'plink2 --bfile {params.input_base} --update-ids {input.ids} --make-bed --out {params.output_base} ' '--threads {threads}' |
181 182 183 | shell: "cut -f1 {input.fam} | sort | uniq | xargs -I {{}} bash -c \"grep {{}} {input.fam} | cut -f1,2 | " "shuf -n{params.n_samples}\" bash > {output}" |
197 198 | shell: "plink2 --bfile {params.input_base} --keep {input[0]} --make-bed --out {params.output_base} --threads {threads}" |
206 207 | shell: "cut -f1,2 {input.coords} > {output.to_keep}" |
221 222 | shell: "plink2 --bfile {params.input_base} --keep {input.keep} --make-bed --out {params.output_base} --threads {threads}" |
230 231 232 233 | shell: "cut -f2 {input.fam} | xargs -I {{}} grep {{}} {input.meta} | " "awk -F \"\\t\" 'BEGIN{{OFS=\"\\t\"}}{{print $3, $2}}' | sed -e 's/\///g' | " "paste <(cut -f1,2 {input.fam} | sed 's/\\s/\\t/g') - > {output}" |
247 248 249 | shell: "plink2 --bfile {params.input_base} --update-ids {input.ids} --make-bed --threads {threads} " "--out {params.output_base}" |
259 260 261 262 | shell: "cat {input.samples} | sed -e 's/\///g' -e 's/|/|\\\/' | " "cut -f3 | sort | uniq | xargs -I {{}} bash -c \"grep -w \\\"{{}}\\\" <( cat {input.samples} | sed -e 's/\///g') | " "shuf -n {params.n_samples} | cut -f3,4 >> {output}\" sh" |
276 277 278 | shell: "plink2 --bfile {params.input_base} --keep {input.samples} --make-bed --out {params.output_base} " "--threads {threads}" |
286 287 | shell: "wget -q -P {params.data} {params.url}" |
296 297 | shell: "tar xvzf {input} -C {params.data_path}" |
307 308 309 | shell: "cut -d ' ' -f1 {input.fam} | sort | uniq | xargs -I {{}} bash -c \"grep {{}} {input.fam} | cut -d ' ' -f1,2 | " "shuf -n {params.n_samples}\" bash | grep -v -E \"{params.exclude}\" > {output}" |
323 324 325 | shell: "plink2 --bfile {params.input_base} --keep {input.to_keep} --make-bed --set-all-var-ids @:# " "--out {params.output_base} --threads {threads}" |
332 333 | shell: "cut -f2 {input} | sort | uniq -d > {output}" |
347 348 349 | shell: "plink2 --bfile {params.input_base} --exclude {input.ids} --make-bed --out {params.outputbase} " "--threads {threads}" |
361 362 363 364 365 | shell: "wget -q -P {params.data_path} {params.url_fam}; mv {params.fam_file} {output.fam}; " "wget -q -P {params.data_path} {params.url_tped}; " "tar -xzvf {params.tped_file}.tar.gz; mv {params.tped_file} {output.tped}; " "rm {params.tped_file}.tar.gz; " |
377 378 | shell: "plink --tfile {params.base} --recode --make-bed --out {params.base} --threads {threads}" |
388 389 390 | shell: "cut -d ' ' -f1 {input.fam} | sort | uniq | xargs -I {{}} bash -c \"grep {{}} {input.fam} | cut -d ' ' -f1,2 | " "shuf -n {params.n_samples}\" bash | grep -v -E \"{params.exclude}\" > {output}" |
404 405 406 | shell: "plink2 --bfile {params.input_base} --keep {input.to_keep} --make-bed --set-all-var-ids @:# " "--snps-only --max-alleles 2 --out {params.output_base} --threads {threads}" |
419 420 421 422 | shell: "wget -q {params.url_bed}; mv {params.name_bed} {params.data_path}arauna_et_al_2017.bed; " "wget -q {params.url_bim}; mv {params.name_bim} {params.data_path}arauna_et_al_2017.bim; " "wget -q {params.url_fam}; mv {params.name_fam} {params.data_path}arauna_et_al_2017.fam;" |
433 434 435 | shell: "cut -d ' ' -f1 {input.coords} | tail -n+2 | xargs -I {{}} bash -c \"grep -w {{}} {input.fam} | " "shuf -n{params.n_samples} | cut -d ' ' -f1,2\" > {output}" |
449 450 451 | shell: "plink2 --bfile {params.input_base} --keep {input.to_keep} --make-bed --set-all-var-ids @:# " "--out {params.output_base} --threads {threads}" |
478 479 | shell: "cut -f2 {input} | sort > {output}" |
493 494 | shell: "comm -12 {input} > {output}" |
509 510 | shell: "plink2 --bfile {params.input_base} --extract {input.snps} --make-bed --out {params.output_base}" |
539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 | shell: "if [[ -f {params.output_base}_trial1-merge.missnp ]]; " # second trial --> flip SNPs once "then " "plink --bfile {params.base_b} --flip {params.output_base}_trial1-merge.missnp --make-bed --out " "{params.base_b}_tmp --threads {threads}; " # flip once "rm {params.output_base}_trial1-merge.missnp; " "plink --bfile {params.base_a} --bmerge {params.base_b}_tmp --make-bed --geno {params.geno} " "--maf {params.maf} --out {params.output_base}_trial2 --threads {threads}; " "mv {params.output_base}_trial2.bed {params.output_base}.bed; " "mv {params.output_base}_trial2.bim {params.output_base}.bim; " "mv {params.output_base}_trial2.fam {params.output_base}.fam; " "elif [[ -f {params.output_base}_trial2-merge.missnp ]]; " # last trial --> excluding SNPS that caused problems "then " "plink --bfile {params.base_a} --exclude {params.output_base}_trial2-merge.missnp --make-bed " "--out {params.base_a}_tmp --threads {threads}; " "plink --bfile {params.base_b}_tmp --exclude {params.output_base}_trial2-merge.missnp --make-bed " "--out {params.base_b}_tmp1 --threads {threads}; " "plink --bfile {params.base_a}_tmp --bmerge {params.base_b}_tmp1 --make-bed --geno {params.geno} " "--maf {params.maf} --out {params.output_base} --threads {threads}; " "rm {params.base_a}_tmp.*; " "rm {params.base_b}_tmp.*; " "rm {params.base_b}_tmp1.*; " "rm {params.output_base}_trial2-merge.missnp; " "else " # first pass "plink --bfile {params.base_a} --bmerge {params.base_b} --make-bed --geno {params.geno} " "--maf {params.maf} --out {params.output_base}_trial1 --threads {threads}; " "mv {params.output_base}_trial1.bed {params.output_base}.bed; " "mv {params.output_base}_trial1.bim {params.output_base}.bim; " "mv {params.output_base}_trial1.fam {params.output_base}.fam; " "fi" |
575 576 577 578 579 | shell: "awk -F '\\t' '{{if ($5 == \"A\" && $6 != \"T\" || $5 != \"A\") print $0}}' {input} | " "awk -F '\\t' '{{if ($5 == \"T\" && $6 != \"A\" || $5 != \"T\") print $0}}' | " "awk -F '\\t' '{{if ($5 == \"C\" && $6 != \"G\" || $5 != \"C\") print $0}}' | " "awk -F '\\t' '{{if ($5 == \"G\" && $6 != \"C\" || $5 != \"G\") print $2}}' | sort > {output}" |
593 594 595 | shell: "plink2 --bfile {params.input_base} --extract {input[0]} --make-bed --out {params.output_base} " "--threads {threads}" |
602 603 | shell: "awk -F '\\t' 'BEGIN{{OFS=\"\\t\"}}{{if ($5 == \".\" || $6 == \".\") print $2}}' {input} > {output}" |
611 612 | shell: "grep -v -w -f {input.exclude} {input.bim} | cut -f2 | sort > {output}" |
620 621 | shell: "comm -12 {input} > {output}" |
668 669 | shell: 'cut -f2 {input} | sort > {output}' |
676 677 | shell: 'cut -f2 {input} | sort > {output}' |
685 686 | shell: "comm -12 {input} > {output}" |
731 732 | shell: "cut -f2 {input} | sort > {output}" |
739 740 | shell: "cut -f2 {input} | sort > {output}" |
748 749 | shell: "comm -12 {input} > {output}" |
799 800 | shell: "cut -f2 {input} | sort > {output}" |
807 808 | shell: "cut -f2 {input} | sort > {output}" |
816 817 | shell: "comm -12 {input} > {output}" |
874 875 876 | shell: "plink2 --bfile {params.input_base} --indep-pairwise 50 1 {params.max_r2} --out {params.input_base} " "--threads {threads}" |
891 892 893 | shell: "plink2 --bfile {params.input_base} --extract {input.extract} --make-bed --out {params.output_base} " "--threads {threads}" |
906 907 | shell: "plink2 --bfile {params.base} --pca 20 --out {params.base} --threads {threads}" |
917 918 | shell: "scripts/plot_pca.py --eigenvec {input.eigenvec} --plot_dir {params.output_dir} --coords {input.coords}" |
937 938 939 940 941 942 943 944 945 946 947 948 949 950 | shell: "cut -f2 {input.sgdp_fam} | xargs -I {{}} bash -c \"grep {{}} {input.meta_sgdp} | cut -f6,7\" | " "paste <(cut -f1,2 {input.sgdp_fam}) - > {output}; " "cut -f2 {input.crawford_and_scheinfeldt_fam} | " "xargs -I {{}} bash -c \"grep {{}} {input.meta_crawford_and_scheinfeldt}\" | cut -f4,5 | " "paste <(cut -f1,2 {input.crawford_and_scheinfeldt_fam}) - >> {output}; " "cut -f1 {input.hollfelder_fam} | xargs -I {{}} bash -c \"grep {{}} {input.meta_hollfelder}\" | cut -f6,7 | " "paste <(cut -f1,2 {input.hollfelder_fam}) - >> {output}; " "cut -f1 {input.arauna_fam} | xargs -I {{}} bash -c \"grep -w {{}} {input.meta_arauna}\" | cut -d ' ' -f2,3 | " "paste <(cut -f1,2 {input.arauna_fam}) - | sed -e 's/\\s/\\t/g' >> {output}; " "cut -f1 {input.fortes_lima_fam} | xargs -I {{}} bash -c \"grep -w {{}} {input.meta_fortes_lima}\" | " "cut -d ' ' -f2,3 | paste <(cut -f1,2 {input.fortes_lima_fam}) - | sed -e 's/\\s/\\t/g' >> {output}; " "cut -f1 {input.schlebusch_fam} | xargs -I {{}} bash -c \"grep -w {{}} {input.meta_schlebusch}\" | " "cut -d ' ' -f2,3 | paste <(cut -f1,2 {input.schlebusch_fam}) - | sed -e 's/\\s/\\t/g' >> {output}; " |
965 966 967 968 | shell: "admixture --cv {input} {wildcards.K} -j{threads} > {params.prefix}.{wildcards.K}.log; " "mkdir -p {params.results}; " "mv {params.prefix}.{wildcards.K}.* {params.results}" |
979 980 | shell: "cd {params.results_path}; zip {params.zip} {params.files}" |
987 988 | shell: "cut -f1 {input} > {output}" |
999 1000 1001 1002 | shell: "cd {input.clumpak_dir}; cpanm List::Permutor; " "perl distructForManyKs.pl --id 1 --dir ../{output} --file ../{input.qfiles} " "--inputtype admixture --indtopop ../{input.pops}" |
1019 1020 1021 | shell: "scripts/visualize_admixture.py -i {input.clumpak}/aligned.files/ -f {input.fam} -l {params.results_path} " "-c {input.coords} -k {params.K_kriging} -o {params.output_prefix}" |
1036 1037 | shell: "scripts/run_feems.py -d {params.data_path} -p {params.prefix} -c {input.pop_coords} -o {params.output_path}" |
Support
- Future updates
Related Workflows





