Evaluating the robustness of polygenic adaptation to the choice of the GWAS cohort
A - Neutrality test for polygenic scores
Workflow
Step 0: download softwares and packages
This workflow is built in Snakemake. Please, download Snakemake following the instructions here:
https://snakemake.readthedocs.io/en/stable/getting_started/installation.html
Install glactools to calculate allele frequencies for a given population panel. Follow the instructions here:
https://github.com/grenaud/glactools
Step 1: download all files
You should modify the value for all keys config.yaml with your own paths and dirs (do not modify the key name unless you change it accordingly in all snake make files). If you are working with GRCh37, download files from:
# EPO file
wget ftp://ftp.healthtech.dtu.dk/public/EPO/all_hg19.epo.gz; wget ftp://ftp.healthtech.dtu.dk/public/EPO/all_hg19.epo.gz.tbi
# FAI file
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.fai
# LBD_blocks
# Download accordingly based on the ancestry of the populations you are working with, e.g. EUR
wget https://bitbucket.org/nygcresearch/ldetect-data/src/master/EUR/fourier_ls-all.bed
# Summary stats
# You can find a list of traits available for the UKBB by Neale lab at: https://docs.google.com/spreadsheets/d/1kvPoupSzsSFBNSztMzl04xMoSC3Kcx3CrjVf4yBmESU/edit#gid=178908679. For instance, for height:
wget https://broad-ukb-sumstats-us-east-1.s3.amazonaws.com/round2/additive-tsvs/50_irnt.gwas.imputed_v3.both_sexes.tsv.bgz -O 50_irnt.gwas.imputed_v3.both_sexes.tsv.bgz
# VCF file - 1000 Genomes can be downloaded from here:
# http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/
Most files for GRCh38 are also available. However, you will need to liftover the LBD_blocks files.
Step 2: Genomic data and allele frequencies
Get the allele frequency of the populations of interest (e.g.: populations from the 1000 Genomes Project) by running vcf2acf.smk using Snakemake. The first rule is to prepare your vcf file (filters at the variant- and/or individual-level).
In the config file, add the path/filename under 'pops_panel'. This is a two-columns tab-separate file with no header, the first-column corresponds to the sample ID and the second to the population they belong two. Example:
head -n1 pops_panel_example_1000GP.txt
HG00096 GBR
Once, the config file is ready. Run snakemake as follows:
snakemake --snakefile path/to/vcf2acf.smk --cores XX
The final output of step 1 corresponds to the file under the key 'acf_file' in the config.yaml. This file will contain the allele frequencies for each site in each population. Intermediate files will be deleted (remove the temp() within the rules to keep them). Output example (use glactools view to check the output):
#chr coord REF,ALT root anc GBR
1 19276348 G,A 0,0:0 0,0:0 192,14:0
More info on how to use glactools is here: https://github.com/grenaud/glactools
Compute Qx statistic
Before computing polygenic scores, if you are working with more than one GWAS, make sure their effect size is polarized by the same allele (e.g.: derived allele).
A. rule polyAdapt_freqs
Step 1: Run
scripts/acf2ukbfreq_byP.py
to get a tabulated file with all the info needed. The output is a file with the population-allele frequencies and some extra info about the variant (chr, position, alleles, beta etc.). It currently only works for the UK Biobank. However, you can very easily modify it if you are using different summary stats. When you read 'gwasfile', you will need to make sure the column numbers and names of the variables we are interested in corresponds to those in the summary stats (eg. beta value). ONLY LINES 3-43 WILL NEED MODIFICATIONS! The output contains a new column DEREFFECT as we polarised the betas by the derived allele for comparison reasons (by default the Uk Biobank was polarised by allele 'a1').
#CHROM POS SNPID REF ALT ANC DER DEREFFECT SE PVAL GBR XXX
1 889638 1:889638 G C G C -2.18042e-04 3.36098e-03 9.48274e-01 14,192 XXX,XXX
Step 2: Run
scripts/partitionUKB_byP.py
to get the "candidate variants". It uses the output from step 1 and the LD blocks from Berisa et al, 2019. It extracts the variant with the lowest p-value for the association test in each block (if there is at least a variant that passes the genome-wide p-value threshold 5e-8). The p-value threshold can be changed
-p 5e-08
. Output example:
#CHROM POS SNPID DEREFFECT GBR XXX
1 930751 1:930751 5.22061e-03 34,172
Step 3: Run
scripts/extractneutral_byP.py
to get "neutral variants". You can change again the p-value for what you would like to consider "non-associated" variants similarly as in step 2 -
-p 1e-5
. The output file has the same format as in Step 2.
#CHROM POS SNPID DEREFFECT GBR XXX
1 958953 1:958953 1.01310e-02 54,152
B. rule polyAdapt_qx
Step 4: Run
scripts/CalcQX_edit4parallel_Alba.R
It generates two outputs:
-
QX_report: it contains the PRS for each pop, the QX value, the Chi-squared p-value, the number of trait-associated SNPs and the sign-randomised P-value.
-
PRS: two-column text file
POP SCORE
GBR 0.55
snakemake --snakefile path/to/polyadapt.smk --cores XX
- NB! CURRENTLY UNDER DEVELOPMENT
B - Assessing different association methods
Quality control filters:
-
Genotyped SNPs (~ 800,000)
-
MAF > 0.001
-
HWE pvalue > 1e-10
-
Autosomes
Association models Examples
- Linear model implemented in PLINK 2.0
plink2
- Linear mixed model implemented in BOLT_LMM
WHITE="/home/white"
BRI="/home/bri"
HEIGHT="/home/id_height.tsv" # tabulated file with #FID IID HEIGHT
cd /home/white
plink2 --bgen allind.bgen --sample allind.sample --hwe 1e-10 --maf 0.001 --keep-fam filtered_ind.list.txt
--linear hide-covar --pheno $HEIGHT --covar allcov.tsv --covar-col-nums 3-4,8-27 --variance-standardize --out $WHITE
- Meta-analysis implemented in METAL
//inverse variance based
metal
// sample size based
metal
Code Snippets
35 36 37 38 39 40 41 42 | shell: """ python scripts/acf2ukbfreq_byP_Alba.py -a {input.popfile} -g {input.infile} -o {output.outmp} cat <(head -1 {output.outmp}) <(tail -n+2 {output.outmp} | sort -k1,1 -k2,2g) | bgzip -c > {output.outfile} tabix -s 1 -b 2 -e 2 {output.outfile} python scripts/partitionUKB_byP.py -i {output.outfile} -b {input.lbd} -o {output.candi} -p {params.pvalCan} python scripts/extractneutral_byP.py -i {output.outfile} -o {output.neut} -s 20 -p {params.pvalNeut} """ |
51 52 53 54 | shell: """ Rscript scripts/CalcQX_edit4parallel_Alba.R -w {input.candi} -e {input.neut} -o {output.qx} -s {output.scores} -n 1000 -j 1000 """ |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 | library("optparse") # terminal line # Rscript CalcQX.R -w gwasfreqs_candidates_pheno.tsv -e gwasfreqs_neutral_pheno.tsv -o qxreport.txt -n 1000 -j 1000 -s genscores.txt option_list = list( make_option(c("-w", "--gwasfile"), type="character", default=NULL, help="GWAS input file name"), make_option(c("-e", "--neutfile"), type="character", default=NULL, help="Neutral input file name"), make_option(c("-o", "--outfile"), type="character", default="output.txt", help="Output file name"), make_option(c("-p", "--pops"), type="character", default="ALL", help="Populations to be tested"), make_option(c("-n", "--numrep"), type="numeric", default=NULL, help="Number of sign-randomized replicates"), make_option(c("-j", "--emprep"), type="numeric", default=NULL, help="Number of frequency-matched replicates"), make_option(c("-s", "--scorefile"), type="character", default=NULL, help="Score file name"), make_option(c("-l", "--leeway"), type="numeric", default=0.01, help="Leeway for empirical frequency matching"), make_option(c("-f", "--minorfreq"), type="numeric", default=0.05, help="Minor allele frequency cutoff for covariance matrix") ); opt_parser = OptionParser(option_list=option_list); opt = parse_args(opt_parser); if (is.null(opt$gwasfile)){ print <- help(opt_parser) stop("GWAS file name must be supplied.n", call.=FALSE) } if (is.null(opt$neutfile)){ print <- help(opt_parser) stop("Neutral file name must be supplied.n", call.=FALSE) } if (is.null(opt$outfile)){ print <- help(opt_parser) stop("Output file name must be supplied.n", call.=FALSE) } if (is.null(opt$scorefile)){ print <- help(opt_parser) stop("Score file name must be supplied.n", call.=FALSE) } gwasfile <- opt$gwasfile neutfile <- opt$neutfile outfile <- opt$outfile scorefile <- opt$scorefile pops <- opt$pops #pops <- strsplit(opt$pops,",")[[1]] pseudorep <- opt$numrep emprep <- opt$emprep # Multiple-testing popmatch <- opt$popmatch leeway <- opt$leeway minorfreq <- opt$minorfreq #pops <- strsplit(readLines(pops), ",")[[1]] #pops <- c("ESN","MSL", "CEU", "TSI", "CDX", "JPT", "PEL") ObtainFreqs <- function(countdat){ dercounts <- apply(countdat,c(1,2),function(x){splitted <- strsplit(x,",")[[1]]; return( as.numeric(splitted[2]) )}) totalcounts <- apply(countdat,c(1,2),function(x){splitted <- strsplit(x,",")[[1]]; return( as.numeric(splitted[2])+as.numeric(splitted[1]) )}) dersum <- apply(dercounts,1,function(x){sum(x)}) totalsum <- apply(totalcounts,1,function(x){sum(x)}) totalfreq <- dersum/totalsum freqs <- apply(countdat,c(1,2),function(x){splitted <- strsplit(x,",")[[1]]; return( as.numeric(splitted[2]) / (as.numeric(splitted[2])+as.numeric(splitted[1])) )}) return(list(as.matrix(freqs), totalfreq, as.matrix(dercounts),as.matrix(totalcounts))) } # Trim out white space trim <- function (x) gsub("^\\s+|\\s+$", "", x) LoadCounts <- function(filename,pops){ table <- as.matrix(read.table(filename,header=TRUE,sep="\t",strip.white=TRUE,comment.char="!")) if(paste(pops,collapse=",") == "ALL"){ tokeep <- seq(5,length(colnames(table))) } else{ tokeep <- which(colnames(table) %in% pops)} tokeep <- c(1,2,3,4,tokeep) table <- table[,tokeep] table[,1] <- trim(table[,1]) table[,2] <- trim(table[,2]) return(table) } ComputeFmat <- function(neut_leaves_freqs, neut_total_freqs){ leaves <- colnames(neut_leaves_freqs) #checksegneut <- which( apply(neut_leaves_freqs,1,sum)/length(leaves) < 0.95 & apply(neut_leaves_freqs,1,sum)/length(leaves) > 0.05 ) #neut_leaves_freqs <- neut_leaves_freqs[checksegneut,] checksegneut <- which(neut_total_freqs < 0.95 & neut_total_freqs > 0.05 ) neut_leaves_freqs <- neut_leaves_freqs[checksegneut,] neut_leaves_freqs_means <- apply(neut_leaves_freqs, 1, mean) mean_hetero <- neut_leaves_freqs_means*(1-neut_leaves_freqs_means) numSNPs <- length(neut_leaves_freqs_means) Fmat <- sapply(seq(1,dim(neut_leaves_freqs)[2]),function(x){ sapply(seq(1,dim(neut_leaves_freqs)[2]),function(y){ cov(neut_leaves_freqs[,x]/sqrt(mean_hetero), neut_leaves_freqs[,y]/sqrt(mean_hetero)) }) }) colnames(Fmat) <- colnames(neut_leaves_freqs) rownames(Fmat) <- colnames(neut_leaves_freqs) return(Fmat) } ChiSquared <- function(leaves_freqs,total_freqs,effects,F_mat, dercounts, totalcounts, randomize=FALSE){ leaves <- colnames(leaves_freqs) checkseg <- which(total_freqs < 0.95 & total_freqs > 0.05 ) leaves_freqs <- leaves_freqs[c(checkseg),] effects <- effects[c(checkseg)] leaves_freqs <- leaves_freqs[! is.infinite(effects),] effects <- effects[! is.infinite(effects)] # Randomize effects if necessary if(randomize == TRUE){effects <- effects * sample(c(-1,1),length(effects),replace=TRUE)} # Compute mean genetic values meangen <- apply(leaves_freqs * effects, 2, function(x){sum(x)}) # Scale by average genetic value meangen <- (meangen - mean(meangen)) # Compute the estimated ancestral genetic variance over all populations meanfreqs <- apply(leaves_freqs,1,mean) varmean <- sum(sapply(seq(1,length(meanfreqs)),function(i){ score = meanfreqs[i]*(1-meanfreqs[i])*effects[i]^2 return(score) })) meangenvec <- 2*meangen / sqrt( 4*varmean) # Compute error PRS alpha <- dercounts + 1 beta <- 1 + totalcounts - dercounts num = alpha * beta denom = (alpha+beta)^2 * (alpha+beta+1) varpl = num/denom varpl <- varpl[c(checkseg),] varprs = effects^2 * varpl varprs <- apply(varprs, 2, function(x) sum(x)) se = sqrt(varprs) # meangen already subracted mean normcilow <- 2*(meangen-1.96*se)/sqrt(4*varmean) normcihi <- 2*(meangen + 1.96*se)/sqrt(4*varmean) # Compute Q_X statistic numerator <- t(meangen) %*% solve(Fmat) %*% meangen denominator <- varmean Qteststat <- numerator / denominator Pval <- 1 - pchisq(Qteststat,qr(Fmat)$rank) allstats <- c(Qteststat,Pval) return(list(allstats, meangenvec, normcihi, normcilow)) } # Load GWAS data print("Loading data...") data <- LoadCounts(gwasfile, pops) leaves_counts <- as.data.frame(data[,seq(5,dim(data)[2])]) raw_leaves_freqs <- ObtainFreqs(leaves_counts) leaves_freqs <- raw_leaves_freqs[[1]] total_freqs <- raw_leaves_freqs[[2]] dercounts <- raw_leaves_freqs[[3]] totalcounts <- raw_leaves_freqs[[4]] effects <- as.numeric(data[,4]) if (sum(apply(leaves_freqs, 1, function(x) sum(is.na(x))))!=0){ leaves_freqs <- na.omit(leaves_freqs) total_freqs <- na.omit(total_freqs) print(paste('Ommiting', length(leaves_counts[,1])-length(leaves_freqs[,1]), 'out of', length(leaves_counts[,1]), 'character SNPs', sep = ' ')) } # Load Neutral data neutdata <- LoadCounts(neutfile, pops) neut_leaves_counts <- as.data.frame(neutdata[,seq(5,dim(neutdata)[2])]) raw_neut_leaves_freqs <- ObtainFreqs(neut_leaves_counts) neut_leaves_freqs <- raw_neut_leaves_freqs[[1]] neut_total_freqs <- raw_neut_leaves_freqs[[2]] # Calculate covariance matrix print("Computing covariance matrix...") Fmat <- ComputeFmat(neut_leaves_freqs, neut_total_freqs) # Calculate chi-squared statistics print("Computing Q_X statistic...") totaltest <- ChiSquared(leaves_freqs,total_freqs,effects,Fmat, dercounts, totalcounts, randomize=FALSE) totalstat <- totaltest[[1]] meangenvec <- totaltest[[2]] meangenvechi <- totaltest[[3]] meangenveclow <- totaltest[[4]] qtab <- cbind(round(totalstat[1],3),totalstat[2]) colnames(qtab) <- c("Q_X","Pval") #### Only calculate pval and genetic scores write("Genetic scores",file=outfile,append=FALSE) write(paste(names(meangenvec),collapse="\t"),file=outfile,append=TRUE) write(paste(meangenvec,collapse="\t"),file=outfile,append=TRUE) write("",file=outfile,append=TRUE) printgenvec <- cbind(names(meangenvec),meangenvec, meangenvechi, meangenveclow) rownames(printgenvec) <- c() colnames(printgenvec) <- c("#POP","SCORE", "upperlim", "lowerlim") write.table(printgenvec,file=scorefile,sep="\t",row.names=FALSE,col.names=TRUE, quote=FALSE,append=FALSE) write(paste("Q_X:",qtab[1],sep="\t"),file=outfile,append=TRUE) write("",file=outfile,append=TRUE) write(paste("Chi-squared distribution P-value:",qtab[2],sep="\t"),file=outfile,append=TRUE) write("",file=outfile,append=TRUE) # Add number of traits-ass SNPs write(paste("Number of trait-associated SNPs:", length(effects), sep="\t"), file=outfile, append=TRUE) write("",file=outfile,append=TRUE) # Calculate sign-randomized chi-squared statistics if( !is.null(pseudorep) ){ print(paste("Computing sign-randomized P-values, using ",pseudorep," pseudo-replicates...",sep="")) totaltest <- ChiSquared(leaves_freqs,total_freqs,effects,Fmat, dercounts, totalcounts, randomize=TRUE) totalstat <- totaltest[[1]] randqtab <- round(totalstat[1],3) for(i in seq(2,pseudorep)){ totaltest <- ChiSquared(leaves_freqs,total_freqs,effects,Fmat, dercounts, totalcounts, randomize=TRUE) totalstat <- totaltest[[1]] randqtab <- c( randqtab, round(totalstat[1],3) ) } # Edit adding 1 randpval <- (1+sum( as.numeric(randqtab) > as.numeric(qtab[1]) )) / (1+length(randqtab)) write(paste("Sign-randomized P-value:",randpval,sep="\t"),file=outfile,append=TRUE) write("",file=outfile,append=TRUE) } |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | import pysam from optparse import OptionParser parser = OptionParser("$prog [options]") parser.add_option("-i", "--infile", dest="infile", help="Input GWAS+freq file", default=None, type="string") parser.add_option("-p", "--minp", dest="minp", help="Minimum p-value allowed", default=1, type="float") parser.add_option("-o", "--outfile", dest="outfile", help="Output file", default=None, type="string") parser.add_option("-s", "--sep", dest="sep", help="Number of valid SNPs separating each printed SNP", default=None, type="int") (options,args) = parser.parse_args() infile = pysam.Tabixfile(options.infile, mode='r') outfile = open(options.outfile,"w") readheader = infile.header for x in readheader: header = x header = header.split("\t") header = list(filter(lambda a: a != "REF" and a != "ALT" and a != "ANC" and a != "DER" and a!= "SE" and a != "PVAL", header)) header = "\t".join(header) outfile.write(''.join(header) + '\n') i=0 for line in infile.fetch(): fields = line.strip().split("\t") Pval = float(fields[9]) if Pval > options.minp: if i == options.sep: i = 0 finalvec = fields[0:3]+[fields[7]]+fields[10:] outfile.write("\t".join(finalvec)+ '\n') i += 1 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | import pysam import numpy as np from optparse import OptionParser import subprocess parser = OptionParser("$prog [options]") parser.add_option("-i", "--infile", dest="infile", help="Input GWAS+freq file", default=None, type="string") parser.add_option("-b", "--bedfile", dest="bedfile", help="Bed file with LD partitions (def None)", default=None, type="string") parser.add_option("-p", "--maxp", dest="maxp", help="Maximum p-value allowed", default=1, type="float") parser.add_option("-o", "--outfile", dest="outfile", help="Output file", default=None, type="string") (options,args) = parser.parse_args() bedfile = open(options.bedfile,"r") infile = pysam.Tabixfile(options.infile, mode='r') outfile = open(options.outfile,"w") logmaxp = np.log10(options.maxp) readheader = infile.header for x in readheader: header = x header = header.split("\t") header = list(filter(lambda a: a != "REF" and a != "ALT" and a != "ANC" and a != "DER" and a!= "SE" and a != "PVAL", header)) header = "\t".join(header) outfile.write(''.join(header) + '\n') for line in bedfile: regfields = line.strip("\n").split("\t") regchr = regfields[0].strip() regstart = int(regfields[1]) regend = int(regfields[2]) CurrLogPval = 0 CurrPval = None CurrBest = None try: for elem in infile.fetch(regchr,regstart-1,regend): fields = elem.strip().split("\t") Pval = float(fields[9]) if Pval == 0.0: Pval = 1e-20 LogPval = np.log10(Pval) if LogPval < CurrLogPval: CurrBest = fields[0:3]+[fields[7]]+fields[10:] CurrLogPval = LogPval CurrPval = Pval except: continue if CurrLogPval < logmaxp and CurrBest != None: outfile.write("\t".join(CurrBest) + '\n') |
Support
- Future updates
Related Workflows





