Evaluating the robustness of polygenic adaptation to the choice of the GWAS cohort

public 1yr ago 0 bookmarks

View Workflow

A - Neutrality test for polygenic scores

Workflow

Step 0: download softwares and packages

This workflow is built in Snakemake. Please, download Snakemake following the instructions here: https://snakemake.readthedocs.io/en/stable/getting_started/installation.html

Install glactools to calculate allele frequencies for a given population panel. Follow the instructions here: https://github.com/grenaud/glactools

Step 1: download all files

You should modify the value for all keys config.yaml with your own paths and dirs (do not modify the key name unless you change it accordingly in all snake make files). If you are working with GRCh37, download files from:

# EPO file
wget ftp://ftp.healthtech.dtu.dk/public/EPO/all_hg19.epo.gz; wget ftp://ftp.healthtech.dtu.dk/public/EPO/all_hg19.epo.gz.tbi
# FAI file
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.fai
# LBD_blocks
# Download accordingly based on the ancestry of the populations you are working with, e.g. EUR
wget https://bitbucket.org/nygcresearch/ldetect-data/src/master/EUR/fourier_ls-all.bed
# Summary stats
# You can find a list of traits available for the UKBB by Neale lab at: https://docs.google.com/spreadsheets/d/1kvPoupSzsSFBNSztMzl04xMoSC3Kcx3CrjVf4yBmESU/edit#gid=178908679. For instance, for height: 
wget https://broad-ukb-sumstats-us-east-1.s3.amazonaws.com/round2/additive-tsvs/50_irnt.gwas.imputed_v3.both_sexes.tsv.bgz -O 50_irnt.gwas.imputed_v3.both_sexes.tsv.bgz
# VCF file - 1000 Genomes can be downloaded from here: 
# http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/

Most files for GRCh38 are also available. However, you will need to liftover the LBD_blocks files.

Step 2: Genomic data and allele frequencies

Get the allele frequency of the populations of interest (e.g.: populations from the 1000 Genomes Project) by running vcf2acf.smk using Snakemake. The first rule is to prepare your vcf file (filters at the variant- and/or individual-level).

In the config file, add the path/filename under 'pops_panel'. This is a two-columns tab-separate file with no header, the first-column corresponds to the sample ID and the second to the population they belong two. Example:

head -n1 pops_panel_example_1000GP.txt
HG00096 GBR

Once, the config file is ready. Run snakemake as follows:

snakemake --snakefile path/to/vcf2acf.smk --cores XX

The final output of step 1 corresponds to the file under the key 'acf_file' in the config.yaml. This file will contain the allele frequencies for each site in each population. Intermediate files will be deleted (remove the temp() within the rules to keep them). Output example (use glactools view to check the output):

#chr coord REF,ALT root anc GBR
1 19276348 G,A 0,0:0 0,0:0 192,14:0

More info on how to use glactools is here: https://github.com/grenaud/glactools

Compute Qx statistic

Before computing polygenic scores, if you are working with more than one GWAS, make sure their effect size is polarized by the same allele (e.g.: derived allele).

A. rule polyAdapt_freqs

Step 1: Run scripts/acf2ukbfreq_byP.py to get a tabulated file with all the info needed. The output is a file with the population-allele frequencies and some extra info about the variant (chr, position, alleles, beta etc.). It currently only works for the UK Biobank. However, you can very easily modify it if you are using different summary stats. When you read 'gwasfile', you will need to make sure the column numbers and names of the variables we are interested in corresponds to those in the summary stats (eg. beta value). ONLY LINES 3-43 WILL NEED MODIFICATIONS! The output contains a new column DEREFFECT as we polarised the betas by the derived allele for comparison reasons (by default the Uk Biobank was polarised by allele 'a1').

#CHROM POS SNPID REF ALT ANC DER DEREFFECT SE PVAL GBR XXX
1 889638 1:889638 G C G C -2.18042e-04 3.36098e-03 9.48274e-01 14,192 XXX,XXX

Step 2: Run scripts/partitionUKB_byP.py to get the "candidate variants". It uses the output from step 1 and the LD blocks from Berisa et al, 2019. It extracts the variant with the lowest p-value for the association test in each block (if there is at least a variant that passes the genome-wide p-value threshold 5e-8). The p-value threshold can be changed -p 5e-08 . Output example:

#CHROM POS SNPID DEREFFECT GBR XXX
1 930751 1:930751 5.22061e-03 34,172

Step 3: Run scripts/extractneutral_byP.py to get "neutral variants". You can change again the p-value for what you would like to consider "non-associated" variants similarly as in step 2 - -p 1e-5 . The output file has the same format as in Step 2.

#CHROM POS SNPID DEREFFECT GBR XXX
1 958953 1:958953 1.01310e-02 54,152

B. rule polyAdapt_qx

Step 4: Run scripts/CalcQX_edit4parallel_Alba.R

It generates two outputs:

QX_report: it contains the PRS for each pop, the QX value, the Chi-squared p-value, the number of trait-associated SNPs and the sign-randomised P-value.
PRS: two-column text file

POP SCORE
GBR 0.55

snakemake --snakefile path/to/polyadapt.smk --cores XX

- NB! CURRENTLY UNDER DEVELOPMENT

B - Assessing different association methods

Quality control filters:

Genotyped SNPs (~ 800,000)
MAF > 0.001
HWE pvalue > 1e-10
Autosomes

Association models Examples

Linear model implemented in PLINK 2.0

plink2

Linear mixed model implemented in BOLT_LMM

WHITE="/home/white"
BRI="/home/bri"
HEIGHT="/home/id_height.tsv" # tabulated file with #FID	IID	HEIGHT
cd /home/white
plink2 --bgen allind.bgen --sample allind.sample --hwe 1e-10 --maf 0.001 --keep-fam filtered_ind.list.txt 
--linear hide-covar --pheno $HEIGHT --covar allcov.tsv --covar-col-nums 3-4,8-27 --variance-standardize --out $WHITE

Meta-analysis implemented in METAL

//inverse variance based
metal
// sample size based
metal

Code Snippets

shell:
    """
    python scripts/acf2ukbfreq_byP_Alba.py  -a {input.popfile} -g {input.infile} -o {output.outmp}
    cat <(head -1 {output.outmp}) <(tail -n+2 {output.outmp} | sort -k1,1 -k2,2g) | bgzip -c > {output.outfile}
    tabix -s 1 -b 2 -e 2 {output.outfile}
    python scripts/partitionUKB_byP.py -i {output.outfile} -b {input.lbd} -o {output.candi} -p {params.pvalCan}
    python scripts/extractneutral_byP.py -i {output.outfile} -o {output.neut} -s 20 -p {params.pvalNeut}
    """

SnakeMake tabix From line 35 of rules/polyadapt.smk

shell:
    """
    Rscript scripts/CalcQX_edit4parallel_Alba.R -w {input.candi} -e {input.neut} -o {output.qx} -s {output.scores} -n 1000 -j 1000
    """

SnakeMake From line 51 of rules/polyadapt.smk

library("optparse")
# terminal line
# Rscript CalcQX.R -w gwasfreqs_candidates_pheno.tsv -e gwasfreqs_neutral_pheno.tsv -o qxreport.txt -n 1000 -j 1000 -s genscores.txt 

option_list = list(
    make_option(c("-w", "--gwasfile"), type="character", default=NULL, help="GWAS input file name"),
    make_option(c("-e", "--neutfile"), type="character", default=NULL, help="Neutral input file name"),
    make_option(c("-o", "--outfile"), type="character", default="output.txt", help="Output file name"),
    make_option(c("-p", "--pops"), type="character", default="ALL", help="Populations to be tested"),
    make_option(c("-n", "--numrep"), type="numeric", default=NULL, help="Number of sign-randomized replicates"),
    make_option(c("-j", "--emprep"), type="numeric", default=NULL, help="Number of frequency-matched replicates"),
    make_option(c("-s", "--scorefile"), type="character", default=NULL, help="Score file name"),
    make_option(c("-l", "--leeway"), type="numeric", default=0.01, help="Leeway for empirical frequency matching"),
    make_option(c("-f", "--minorfreq"), type="numeric", default=0.05, help="Minor allele frequency cutoff for covariance matrix")
);

opt_parser = OptionParser(option_list=option_list);
opt = parse_args(opt_parser);

if (is.null(opt$gwasfile)){
    print <- help(opt_parser)
    stop("GWAS file name must be supplied.n", call.=FALSE)
}
if (is.null(opt$neutfile)){
    print <- help(opt_parser)
    stop("Neutral file name must be supplied.n", call.=FALSE)
}
if (is.null(opt$outfile)){
    print <- help(opt_parser)
    stop("Output file name must be supplied.n", call.=FALSE)
}

if (is.null(opt$scorefile)){
    print <- help(opt_parser)
    stop("Score file name must be supplied.n", call.=FALSE)
}


gwasfile <- opt$gwasfile
neutfile <- opt$neutfile
outfile <- opt$outfile
scorefile <- opt$scorefile
pops <- opt$pops
#pops <- strsplit(opt$pops,",")[[1]]
pseudorep <- opt$numrep
emprep <- opt$emprep

# Multiple-testing
popmatch <- opt$popmatch
leeway <- opt$leeway
minorfreq <- opt$minorfreq


#pops <- strsplit(readLines(pops), ",")[[1]]
#pops <- c("ESN","MSL", "CEU", "TSI", "CDX", "JPT", "PEL")

ObtainFreqs <- function(countdat){
    dercounts <- apply(countdat,c(1,2),function(x){splitted <- strsplit(x,",")[[1]]; return( as.numeric(splitted[2]) )})
    totalcounts <- apply(countdat,c(1,2),function(x){splitted <- strsplit(x,",")[[1]]; return( as.numeric(splitted[2])+as.numeric(splitted[1]) )})
    dersum <- apply(dercounts,1,function(x){sum(x)})
    totalsum <- apply(totalcounts,1,function(x){sum(x)})
    totalfreq <- dersum/totalsum
    freqs <- apply(countdat,c(1,2),function(x){splitted <- strsplit(x,",")[[1]];
        return( as.numeric(splitted[2]) / (as.numeric(splitted[2])+as.numeric(splitted[1])) )})
    return(list(as.matrix(freqs), totalfreq, as.matrix(dercounts),as.matrix(totalcounts)))
}

# Trim out white space
trim <- function (x) gsub("^\\s+|\\s+$", "", x)

LoadCounts <- function(filename,pops){
    table <- as.matrix(read.table(filename,header=TRUE,sep="\t",strip.white=TRUE,comment.char="!"))
    if(paste(pops,collapse=",") == "ALL"){ tokeep <- seq(5,length(colnames(table)))
    } else{ tokeep <- which(colnames(table) %in% pops)}
    tokeep <- c(1,2,3,4,tokeep)
    table <- table[,tokeep]
    table[,1] <- trim(table[,1])
    table[,2] <- trim(table[,2])
    return(table)
}

ComputeFmat <- function(neut_leaves_freqs, neut_total_freqs){
    leaves <- colnames(neut_leaves_freqs)

    #checksegneut <- which( apply(neut_leaves_freqs,1,sum)/length(leaves) < 0.95  & apply(neut_leaves_freqs,1,sum)/length(leaves) > 0.05 )
    #neut_leaves_freqs <- neut_leaves_freqs[checksegneut,]
    checksegneut <- which(neut_total_freqs < 0.95  & neut_total_freqs > 0.05 )
    neut_leaves_freqs <- neut_leaves_freqs[checksegneut,]

    neut_leaves_freqs_means <- apply(neut_leaves_freqs, 1, mean)
    mean_hetero <- neut_leaves_freqs_means*(1-neut_leaves_freqs_means)
    numSNPs <- length(neut_leaves_freqs_means)

    Fmat <- sapply(seq(1,dim(neut_leaves_freqs)[2]),function(x){
        sapply(seq(1,dim(neut_leaves_freqs)[2]),function(y){
            cov(neut_leaves_freqs[,x]/sqrt(mean_hetero), neut_leaves_freqs[,y]/sqrt(mean_hetero))
        })
    })

    colnames(Fmat) <- colnames(neut_leaves_freqs)
    rownames(Fmat) <- colnames(neut_leaves_freqs)

    return(Fmat)
}


ChiSquared <- function(leaves_freqs,total_freqs,effects,F_mat, dercounts, totalcounts, randomize=FALSE){
    leaves <- colnames(leaves_freqs)
    checkseg <- which(total_freqs < 0.95  & total_freqs > 0.05 )
    leaves_freqs <- leaves_freqs[c(checkseg),]
    effects <- effects[c(checkseg)]
    leaves_freqs <- leaves_freqs[! is.infinite(effects),]
    effects <- effects[! is.infinite(effects)]

    # Randomize effects if necessary
    if(randomize == TRUE){effects <- effects * sample(c(-1,1),length(effects),replace=TRUE)}

    # Compute mean genetic values
    meangen <- apply(leaves_freqs * effects, 2, function(x){sum(x)})

    # Scale by average genetic value
    meangen <- (meangen - mean(meangen))

    # Compute the estimated ancestral genetic variance over all populations
    meanfreqs <- apply(leaves_freqs,1,mean)

    varmean <- sum(sapply(seq(1,length(meanfreqs)),function(i){
        score = meanfreqs[i]*(1-meanfreqs[i])*effects[i]^2
        return(score)
    }))

    meangenvec <- 2*meangen / sqrt( 4*varmean)

    # Compute error PRS
    alpha <- dercounts + 1
	beta <- 1 + totalcounts - dercounts
	num = alpha * beta
	denom = (alpha+beta)^2 * (alpha+beta+1)
	varpl = num/denom

	varpl <- varpl[c(checkseg),]
	varprs = effects^2 * varpl

	varprs <- apply(varprs, 2, function(x) sum(x))
	se = sqrt(varprs)


	# meangen already subracted mean 
	normcilow <- 2*(meangen-1.96*se)/sqrt(4*varmean)
	normcihi <- 2*(meangen + 1.96*se)/sqrt(4*varmean)

    # Compute Q_X statistic
    numerator <- t(meangen) %*% solve(Fmat) %*% meangen
    denominator <- varmean
    Qteststat <- numerator / denominator
    Pval <- 1 - pchisq(Qteststat,qr(Fmat)$rank)
    allstats <- c(Qteststat,Pval)

    return(list(allstats, meangenvec, normcihi, normcilow))
}


# Load GWAS data  
print("Loading data...")
data <- LoadCounts(gwasfile, pops)
leaves_counts <- as.data.frame(data[,seq(5,dim(data)[2])])
raw_leaves_freqs <- ObtainFreqs(leaves_counts)
leaves_freqs <- raw_leaves_freqs[[1]]
total_freqs <- raw_leaves_freqs[[2]]
dercounts <- raw_leaves_freqs[[3]]
totalcounts <- raw_leaves_freqs[[4]]
effects <- as.numeric(data[,4])


if (sum(apply(leaves_freqs, 1, function(x) sum(is.na(x))))!=0){
leaves_freqs <- na.omit(leaves_freqs)
total_freqs <- na.omit(total_freqs)
print(paste('Ommiting', length(leaves_counts[,1])-length(leaves_freqs[,1]),
            'out of', length(leaves_counts[,1]), 'character SNPs', sep = ' '))
}

# Load Neutral data
neutdata <- LoadCounts(neutfile, pops)
neut_leaves_counts <- as.data.frame(neutdata[,seq(5,dim(neutdata)[2])])
raw_neut_leaves_freqs <- ObtainFreqs(neut_leaves_counts)
neut_leaves_freqs <- raw_neut_leaves_freqs[[1]]
neut_total_freqs <- raw_neut_leaves_freqs[[2]]

# Calculate covariance matrix
print("Computing covariance matrix...")
Fmat <- ComputeFmat(neut_leaves_freqs, neut_total_freqs)


# Calculate chi-squared statistics
print("Computing Q_X statistic...")
totaltest <- ChiSquared(leaves_freqs,total_freqs,effects,Fmat, dercounts, totalcounts, randomize=FALSE)
totalstat <- totaltest[[1]]
meangenvec <- totaltest[[2]]
meangenvechi <- totaltest[[3]]
meangenveclow <- totaltest[[4]]


qtab <- cbind(round(totalstat[1],3),totalstat[2])
colnames(qtab) <- c("Q_X","Pval")

#### Only calculate pval and genetic scores

write("Genetic scores",file=outfile,append=FALSE)
write(paste(names(meangenvec),collapse="\t"),file=outfile,append=TRUE)
write(paste(meangenvec,collapse="\t"),file=outfile,append=TRUE)
write("",file=outfile,append=TRUE)

printgenvec <- cbind(names(meangenvec),meangenvec, meangenvechi, meangenveclow)
rownames(printgenvec) <- c()
colnames(printgenvec) <- c("#POP","SCORE", "upperlim", "lowerlim")
write.table(printgenvec,file=scorefile,sep="\t",row.names=FALSE,col.names=TRUE, quote=FALSE,append=FALSE)

write(paste("Q_X:",qtab[1],sep="\t"),file=outfile,append=TRUE)
write("",file=outfile,append=TRUE)

write(paste("Chi-squared distribution P-value:",qtab[2],sep="\t"),file=outfile,append=TRUE)
write("",file=outfile,append=TRUE)

# Add number of traits-ass SNPs 
write(paste("Number of trait-associated SNPs:", length(effects), sep="\t"), file=outfile, append=TRUE)
write("",file=outfile,append=TRUE)

# Calculate sign-randomized chi-squared statistics
if( !is.null(pseudorep) ){
    print(paste("Computing sign-randomized P-values, using ",pseudorep," pseudo-replicates...",sep=""))
    totaltest <- ChiSquared(leaves_freqs,total_freqs,effects,Fmat, dercounts, totalcounts, randomize=TRUE)
    totalstat <- totaltest[[1]]
    randqtab <- round(totalstat[1],3)
    for(i in seq(2,pseudorep)){
        totaltest <- ChiSquared(leaves_freqs,total_freqs,effects,Fmat, dercounts, totalcounts, randomize=TRUE)
        totalstat <- totaltest[[1]]
        randqtab <- c( randqtab, round(totalstat[1],3) )
    }
    # Edit adding 1
    randpval <- (1+sum( as.numeric(randqtab)  > as.numeric(qtab[1]) )) / (1+length(randqtab))

    write(paste("Sign-randomized P-value:",randpval,sep="\t"),file=outfile,append=TRUE)
    write("",file=outfile,append=TRUE)

}

R optparse From line 2 of scripts/CalcQX_edit4parallel_Alba.R

import pysam
from optparse import OptionParser

parser = OptionParser("$prog [options]")
parser.add_option("-i", "--infile", dest="infile", help="Input GWAS+freq file", default=None, type="string")
parser.add_option("-p", "--minp", dest="minp", help="Minimum p-value allowed", default=1, type="float")
parser.add_option("-o", "--outfile", dest="outfile", help="Output file", default=None, type="string")
parser.add_option("-s", "--sep", dest="sep", help="Number of valid SNPs separating each printed SNP", default=None, type="int")
(options,args) = parser.parse_args()


infile = pysam.Tabixfile(options.infile, mode='r')
outfile = open(options.outfile,"w")

readheader = infile.header
for x in readheader:
    header = x

header = header.split("\t")
header = list(filter(lambda a: a != "REF" and a != "ALT" and a != "ANC" and a != "DER" and a!= "SE" and a != "PVAL", header))
header = "\t".join(header)

outfile.write(''.join(header) + '\n')

i=0
for line in infile.fetch():
    fields = line.strip().split("\t")
    Pval = float(fields[9])
    if Pval > options.minp:
        if i == options.sep:
            i = 0
            finalvec = fields[0:3]+[fields[7]]+fields[10:]
            outfile.write("\t".join(finalvec)+ '\n')
        i += 1

Python pysam From line 1 of scripts/extractneutral_byP.py

import pysam
import numpy as np
from optparse import OptionParser
import subprocess

parser = OptionParser("$prog [options]")
parser.add_option("-i", "--infile", dest="infile", help="Input GWAS+freq file", default=None, type="string")
parser.add_option("-b", "--bedfile", dest="bedfile", help="Bed file with LD partitions (def None)", default=None, type="string")
parser.add_option("-p", "--maxp", dest="maxp", help="Maximum p-value allowed", default=1, type="float")
parser.add_option("-o", "--outfile", dest="outfile", help="Output file", default=None, type="string")
(options,args) = parser.parse_args()


bedfile = open(options.bedfile,"r")
infile = pysam.Tabixfile(options.infile, mode='r')
outfile = open(options.outfile,"w")
logmaxp = np.log10(options.maxp)

readheader = infile.header
for x in readheader:
    header = x

header = header.split("\t")
header = list(filter(lambda a: a != "REF" and a != "ALT" and a != "ANC" and a != "DER" and a!= "SE" and a != "PVAL", header))
header = "\t".join(header)
outfile.write(''.join(header) + '\n')

for line in bedfile:
    regfields = line.strip("\n").split("\t")
    regchr = regfields[0].strip()
    regstart = int(regfields[1])
    regend = int(regfields[2])

    CurrLogPval = 0
    CurrPval = None
    CurrBest = None
    try:
        for elem in infile.fetch(regchr,regstart-1,regend):
            fields = elem.strip().split("\t")
            Pval = float(fields[9])
            if Pval == 0.0:
                Pval = 1e-20
            LogPval = np.log10(Pval)
            if LogPval < CurrLogPval:
                CurrBest = fields[0:3]+[fields[7]]+fields[10:]
                CurrLogPval = LogPval
                CurrPval = Pval
    except:
        continue

    if CurrLogPval < logmaxp and CurrBest != None:
        outfile.write("\t".join(CurrBest) + '\n')