Genomic prediction of amino acid traits in Arabidopsis seeds

public 1yr ago Version: v1.0 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Genomic prediction for free amino acid traits in Arabidopsis seeds

Data and scripts necessary to run genomic partitioning and prediciton models on free amino acid traits measured in a diverse panel of 313 Arabidopsis lines.

Software requirements

R v3.6.0
PLINK v1.90b4 64-bit
Miniconda3 (includes conda v4.6.7)
Snakemake v5.4.2 (install to virtual environment)

To install snakemake in a virtual environment, run:

conda env create --name multiblup --file environment.yaml

For future use, activate this environment with: source activate multiblup

Data

Edit the config.yaml file to specify the paths for the genotype and phenotype data.

Genotype data

Before running Snakemake, download the Arabidopsis Regional Mapping (RegMap) data ( Horton et al. 2012 ):
cd data/external
wget https://github.com/Gregor-Mendel-Institute/atpolydb/blob/master/250k_snp_data/call_method_75.tar.gz
tar -xvf call_method_75.tar.gz

Phenotype and covariate data

data/raw/aa360_raw_ratios.csv
Raw measurements (nmol/mg seed) of 65 free amino acid traits measured in 313 accessions of Arabidopsis thaliana as reported by Angelovici et al. 2013
data/processed/aa360_BLUEs.txt Environment adjusted best linear unbiased estimates (BLUEs) for 65 free amino acid traits. Calculated using the HAPPI-GWAS pipeline from Slaten et al. 2020 . Check out notebooks/01-calculate_BLUEs.Rmd for details.
data/processed/aa360_covars.txt
Principal components from genotype data to model population structure. Environment adjusted best linear unbiased estimates (BLUEs) for 65 free amino acid traits. Calculated using the HAPPI-GWAS pipeline from Slaten et al. 2020 . Check out notebooks/01-calculate_BLUEs.Rmd for details.
data/processed/aa360_covars.txt
Principal components from genotype data to model population structure.

Snakemake

The snakemake workflow includes a Snakefile (specifies steps of the workflow and variables) and rule files in rules/ to run specific commands.

Snakefile

This file specifies variable names for use in a snakemake workflow. There are notes on how to include specific pathways and MapMan bincodes. Other variables include trait names, the number of random SNP sets, and the number of cross validations to perform.

The rule all: section is a pseudorule that tracks the expected output of each command in the rule files.

To run the workflow, edit cluster configuration settings in submit.json then run submit.sh

Rules

rules/common.smk - specifies location of config.yaml file

rules/prep_data.smk

filter and convert genotype data to PED format
exports TAIR 10 ensembl gene ids for SNP data
calculate SNP weightings (these aren't actually used, could skip)
run PCA and exports covariate file (including two PCs here - recommend adjusting depending on data)

rules/cross_validation.smk

create training and testing sets for cross validation
Note: may need to run this separately on command line, repeating the command via loops/snakemake sometimes does not generate unique folds

rules/gblup.smk

export kinship matrix for all SNPs
estimate variances and heritability
genomic prediction and cross-validation (REML + calculating BLUPs)
summarize GBLUP output ( reports/gblup.Rdata )

rules/multiblup.smk

filter and export list of pathway SNPs (includes a 2.5 kb buffer before and after each pathway gene)
calculate kinship matrices for each pathway (list1) and all remaining SNPs (list2)
estimate variances and heritability for each SNP partition
genomic prediction and cross-validation with multiple kernels/random effects (REML + calculating BLUPs)
summarize MultiBLUP output ( reports/multiblup.RData )

rules/null_distribution.smk

first option: generate 5000 random gene groups with a uniform distribution of SNPs. This is useful if you want to examine influences of partition size on heritability explained/model fit or if you are looking to compare a lot of different partitions with varying size to an empirical distribution (output reports/lr_null_results.csv )
second option: generate 1000 random gene groups for each pathway (excludes pathway SNPs and samples a similar number of SNPs/genes)
calculate kinship matrices for each random group
estimate variances and heritability for each random SNP set
summarize results across all 1000 gene groups ( reports/null_dist_results_pathways/{pathway}_null.csv )

Notebooks

R notebooks to summarize results.

01-calculate_BLUEs.Rmd

Using raw FAA trait data, calculate BLUEs for each trait using the HAPPI-GWAS pipeline.

02-gblup_results.Rmd

Checks quality of model output and summarizes prediction results for the GBLUP and MultiBLUP models (e.g. proportion of heritability explained, likelihood ratio, prediction accuracy, reliability, bias, MSE)

03-process_null.Rmd

Examines properties of the random gene groups (e.g. distribution of likelihood ratio) and performs quantile regression to establish 95 percentiles for the proportion of heritability explained and likelihood ratio (see nice discussion of this approach in Edwards et al. 2016 )

04-multiblup_results_by_pathway.Rmd

Identifies pathways that pass significance criteria based on comparison to random gene groups with the same number of SNPs (proportion of h2, likelihood ratio, and increase in prediction accuracy).

05-figures.Rmd

Code to create figures used in the manuscript.

Project Organization (based on Cookiecutter data science)

Project based on the cookiecutter data science project template . #cookiecutterdatascience

├── LICENSE
├── README.md <- The top-level README for developers using this project.
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
│
├── docs <- manuscript documents
│
├── models <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks <- R notebooks. Naming convention is a number (for ordering),
│ the creator's initials, and a short `-` delimited description, e.g.
│ `1.0-jqp-initial-data-exploration`.
│
├── references <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated graphics and figures to be used in reporting
│
├── environment.yaml <- The requirements file for reproducing the analysis conda environment
│
├── src <- Source code for use in this project.
│ ├── data <- Scripts to download or generate data
│ │
│ ├── features <- Scripts to turn raw data into features for modeling
│ │
│ ├── models <- Scripts to train models and then use trained models to make
│ │ predictions
│ │
│ └── visualization <- Scripts to create exploratory and results oriented visualizations
│
|── Snakefile <- Snakemake workflow to execute analyses
│
└── submit.json <- Configuration settings to run snakemake on a computing cluster

Code Snippets

run:
    shell("{ldak} --cut-folds {params.folds} \
    --bfile {params.bfile} \
    --num-folds {params.numfolds}")

SnakeMake From line 23 of rules/cross_validation.smk

run:
    shell("{ldak} --calc-kins-direct {params.outdir} \
    --bfile {params.bfile} \
    --ignore-weights YES \
    --power 0 \
    --kinship-raw YES")

SnakeMake From line 18 of rules/gblup.smk

run:
    shell("{ldak} --reml {params.out} \
    --pheno {input.pheno} \
    --pheno-name {params.trait} \
    --grm {params.grm} \
    --covar {input.covar} \
    --constrain YES \
    --reml-iter 500")

SnakeMake From line 40 of rules/gblup.smk

run:
    for i in CV:
        for j in INDEX:
            shell("{ldak} --reml {params.out}{i}.{j} \
            --pheno {input.pheno} \
            --pheno-name {params.trait} \
            --grm {params.grm} \
            --keep {params.keep}{i}.train{j} \
            --covar {input.covar} \
            --constrain YES \
            --reml-iter 500")

SnakeMake From line 67 of rules/gblup.smk

run:
    for i in CV:
        for j in INDEX:
            shell("{ldak} --calc-blups {params.out}{i}.{j} \
            --grm {params.grm} \
            --remlfile {params.reml}{i}.{j}.reml \
            --bfile {params.bfile} \
            --covar {input.covar}")
            shell("{ldak} --calc-scores {params.out}{i}.{j} \
            --bfile {params.bfile} \
            --power 0 \
            --scorefile {params.out}{i}.{j}.blup \
            --keep {params.keep}{i}.test{j} \
            --pheno {input.pheno} \
            --pheno-name {params.trait}")

SnakeMake From line 98 of rules/gblup.smk

run:
    shell("Rscript src/04_summarize_gblup.R {input.blues}")

SnakeMake From line 120 of rules/gblup.smk

run:
    shell("Rscript src/05_select_pathway_snps.R {params.pathway} \"{params.bincode}\"")

SnakeMake From line 12 of rules/multiblup.smk

run:
    # partition kinship matrix
    # --ignore-weights YES & --power 0 based on LDAK recommendations for prediction
    shell("{ldak} --cut-kins {params.outdir} \
    --bfile {params.bfile} \
    --partition-number 2 \
    --partition-prefix {params.prefix}")
    shell("{ldak} --calc-kins {params.outdir} \
    --bfile {params.bfile} \
    --partition 1 \
    --ignore-weights YES \
    --power 0 \
    --kinship-raw YES")
    shell("{ldak} --calc-kins {params.outdir} \
    --bfile {params.bfile} \
    --partition 2 \
    --ignore-weights YES \
    --power 0 \
    --kinship-raw YES")

SnakeMake From line 31 of rules/multiblup.smk

run:
    for p in PATHWAYS.keys():
        shell("{ldak} --reml {params.out}{p}/{params.trait}.multiblup \
        --pheno {input.pheno} \
        --pheno-name {params.trait} \
        --mgrm {params.mgrm}{p}/partition.list \
        --covar {input.covar} \
        --constrain YES \
        --reml-iter 500")

SnakeMake From line 67 of rules/multiblup.smk

run:
    for p in PATHWAYS.keys():
        for i in CV:
            for j in INDEX:
                shell("{ldak} --reml {params.out}{p}/{params.trait}.cv{i}.{j} \
                --pheno {input.pheno} \
                --pheno-name {params.trait} \
                --mgrm {params.mgrm}{p}/partition.list \
                --keep {params.keep}{i}.train{j} \
                --covar {input.covar} \
                --constrain YES \
                --reml-iter 100")

SnakeMake From line 94 of rules/multiblup.smk

run:
    for p in PATHWAYS.keys():
        for i in CV:
            for j in INDEX:
                shell("{ldak} --calc-blups {params.out}{p}/{params.trait}.cv{i}.{j} \
                --mgrm {params.mgrm}{p}/partition.list \
                --remlfile {params.out}{p}/{params.trait}.cv{i}.{j}.reml \
                --bfile {params.bfile} \
                --covar {input.covar}")
                shell("{ldak} --calc-scores {params.out}{p}/{params.trait}.cv{i}.{j} \
                --bfile {params.bfile} \
                --power 0 \
                --scorefile {params.out}{p}/{params.trait}.cv{i}.{j}.blup \
                --keep {params.keep}{i}.test{j} \
                --pheno {input.pheno} \
                --pheno-name {params.trait}")

SnakeMake From line 125 of rules/multiblup.smk

run:
    shell("Rscript src/06_summarize_multiblup.R {input.blues}")

SnakeMake From line 148 of rules/multiblup.smk

shell:
    "Rscript src/07_null_group_sizes.R"

SnakeMake From line 10 of rules/null_distribution.smk

shell:
    "Rscript src/08_generate_null_gene_groups.R {params.num}"

SnakeMake From line 24 of rules/null_distribution.smk

run:
    for r in RANDOM:
        shell("{ldak} --cut-kins {params.prefix}{r} \
        --bfile {params.bfile} \
        --partition-number 2 \
        --partition-prefix {params.prefix}{r}/list")
        shell("{ldak} --calc-kins {params.prefix}{r} \
        --bfile {params.bfile} \
        --partition 1 \
        --ignore-weights YES \
        --power 0")
        shell("{ldak} --calc-kins {params.prefix}{r} \
        --bfile {params.bfile} \
        --partition 2 \
        --ignore-weights YES \
        --power 0")

SnakeMake From line 45 of rules/null_distribution.smk

run:
    for r in RANDOM:
        shell("{ldak} --reml {params.prefix}{r}/{params.trait}.h2 \
        --pheno {input.pheno} \
        --pheno-name {params.trait} \
        --mgrm {params.mgrm}{r}/partition.list \
        --covar {input.covar} \
        --constrain YES \
        --reml-iter 500")

SnakeMake From line 75 of rules/null_distribution.smk

run:
    shell("Rscript src/09_summarize_null.R")

SnakeMake From line 91 of rules/null_distribution.smk

shell:
    "Rscript src/08a_generate_null_gene_groups_pathways.R {params.pathway}"

SnakeMake From line 106 of rules/null_distribution.smk

run:
    for r in range(1,1001):
        shell("{ldak} --cut-kins {params.prefix}{r} \
        --bfile {params.bfile} \
        --partition-number 2 \
        --partition-prefix {params.prefix}{r}/list")
        shell("{ldak} --calc-kins {params.prefix}{r} \
        --bfile {params.bfile} \
        --partition 1 \
        --ignore-weights YES \
        --power 0")
        shell("{ldak} --calc-kins {params.prefix}{r} \
        --bfile {params.bfile} \
        --partition 2 \
        --ignore-weights YES \
        --power 0")

SnakeMake From line 126 of rules/null_distribution.smk

run:
    for r in range(1,1001):
        shell("{ldak} --reml {params.prefix}{r}/{params.trait}.h2 \
        --pheno {input.pheno} \
        --pheno-name {params.trait} \
        --mgrm {params.mgrm}{r}/partition.list \
        --covar {input.covar} \
        --constrain YES \
        --reml-iter 500")

SnakeMake From line 156 of rules/null_distribution.smk

run:
    shell("Rscript src/09a_summarize_null_pathways.R {params.pathway}")

SnakeMake From line 174 of rules/null_distribution.smk

run:
    shell("Rscript src/01_convert_to_ped.R")
    shell("plink --file {params.indir} \
    --recode 12 \
    --maf 0.05 \
    --mind 0.1 \
    --geno 0.1 \
    --make-bed \
    --out {params.outdir}")

SnakeMake pLink From line 20 of rules/prep_data.smk

shell:
    "Rscript src/02_extract_gene_ids.R {input}"

SnakeMake From line 38 of rules/prep_data.smk

run:
    shell("{ldak} --cut-weights {params.outdir} \
    --bfile {params.bfile} \
    --section-kb 0.25")
    shell("{ldak} --calc-weights-all {params.outdir} \
    --bfile {params.bfile}")

SnakeMake From line 58 of rules/prep_data.smk

library(tidyverse)

snp_data <- read.csv("data/external/call_method_75/call_method_75_TAIR9.csv",
                     header = TRUE, skip = 1)

# create the MAP file
# requires Chromosome and Position info
# columns: Chromosome/rsid/physical_distance/Position

map <- snp_data %>%
  select(Chromosome, Positions) %>%
  mutate(rsid = paste0("S", seq_len(n())),
         distance = 0) %>%
  select(Chromosome, rsid, distance, Positions) %>%
  write_delim(., "data/raw/call_method_75_TAIR9.map", col_names = FALSE)

# create the PED file
# rows are individuals
# columns are family/sample/paternal/maternal/sex/phenotype/marker1/marker2/etc

# individual ids
id <- colnames(snp_data[,-c(1:2)])

# get accession ids with phenotype data
acc360 <- read_delim("data/processed/pheno_file", delim = "\t") 

ped <- snp_data[,-c(1:2)] %>%
  # recode genotypes to be diploid
  mutate_if(is.factor, funs(fct_recode(., GG = "G",
                                       TT = "T",
                                       AA = "A",
                                       CC = "C"))) %>%
  t() %>%
  as.data.frame(row.names = FALSE) %>%
  # add columns expected by plink (family/sample id/paternal/maternal/sex/phenotype)
  add_column(., family = "AtRegMap",
             sample = gsub("X", "", id),
             paternal = 0,
             maternal = 0,
             sex = -9,
             phenotype = -9,
             .before = 1) %>%
  # only include accessions with phenotype data
  filter(sample %in% acc360$IID) %>%
  write_delim(., "data/raw/call_method_75_TAIR9.ped", col_names = FALSE)

R tidyverse pLink From line 6 of src/01_convert_to_ped.R

library(tidyverse)
library(broom)

# use command line args
args <- commandArgs(trailingOnly = TRUE)

# phenotypes
pheno <- read.table(paste0(args[1]), header = TRUE) %>%
  gather(key = "trait",
         value = "phenotype",
         -FID, -IID)

# heritability
gblup_h2 <- tibble(file = list.files(path = "models/gblup_h2",
                                     pattern = ".reml$",
                                     full.names = TRUE)) %>%
  separate(file, sep = "/|[.]", into = c("source", "method", "trait", "method2", "fill"), remove = FALSE) %>%
  mutate(data = lapply(file, read.table, header = TRUE, skip = 13)) %>%
  unnest(data) %>%
  filter(Component %in% "Her_ALL") %>%
  select(trait, Heritability, Her_SD)

head(gblup_h2)

# log likelihood
gblup_llik <- tibble(file = list.files(path = "models/gblup_h2",
                                     pattern = ".reml$",
                                     full.names = TRUE)) %>%
  separate(file, sep = "/|[.]", into = c("source", "method", "trait", "metric", "fill", "fill2"), remove = FALSE) %>%
  mutate(data = lapply(file, read.table, header = TRUE)) %>%
  unnest(data) %>%
  filter(Num_Kinships %in% "Alt_Likelihood") %>%
  mutate(gblup_llik = as.numeric(as.character(X1))) %>%
  select(trait, gblup_llik)

head(gblup_llik)

# individuals used for test set
cross_val <- tibble(filename = list.files(path = "data/processed/cross_validation",
                    pattern = ".test", full.names = TRUE)) %>%
             mutate(cvnum = word(filename, start = 4, end = 4, sep = "/|[.]"),
                    fold = word(filename, start = 5, end = 5, sep = "/|[.]"),
                    fold = gsub("test", "", fold)) %>%
             mutate(file_contents = map(filename, ~read.table(., header = FALSE))) %>%
             unnest(cols = c("file_contents")) %>%
             rename("FID" = "V1",
                    "IID" = "V2")

# predictive accuracy
gblup_pa <- tibble(file = list.files(path = "models/gblup",
                                     pattern = ".pred$",
                                     full.names = TRUE)) %>%
  separate(file, sep = "/|[.]", into = c("source", "method", "trait", "cvnum", "fold", "fill"), remove = FALSE) %>%
  mutate(data = lapply(file, read.table, header = TRUE)) %>%
  unnest(data) %>%
  left_join(pheno, ., by = c("FID" = "ID1", "IID" = "ID2", "trait")) %>%
  inner_join(., cross_val)
  # left_join(., training_coef, by = c("trait", "cvnum", "fold"))

# training_coef <- tibble(file = list.files(path = "models/gblup",
#                                           pattern = ".coeff$",
#                                           full.names = TRUE)) %>%
#   separate(file, sep = "/|[.]", into = c("source", "method", "trait", "cvnum", "fold", "fill"), remove = FALSE) %>%
#   mutate(data = lapply(file, read.table, header = TRUE)) %>%
#   unnest(data) %>%
#   filter(Component %in% "Intercept") %>%
#   select(trait, cvnum, fold, Effect)
#
# head(training_coef)
#
# gblup_pa <- tibble(file = list.files(path = "models/gblup",
#                                      pattern = ".profile$",
#                                      full.names = TRUE)) %>%
#   separate(file, sep = "/|[.]", into = c("source", "method", "trait", "cvnum", "fold", "fill"), remove = FALSE) %>%
#   mutate(data = lapply(file, read.table, header = TRUE)) %>%
#   unnest(data) %>%
#   left_join(., training_coef, by = c("trait", "cvnum", "fold"))

head(gblup_pa)

save(gblup_h2, gblup_llik, gblup_pa, file = "reports/gblup.RData")

R tidyverse broom From line 6 of src/04_summarize_gblup.R

library(tidyverse)
library(fuzzyjoin)

# use command line arguments
args = commandArgs(trailingOnly = TRUE)
print(args[1])
print(args[2])

# read in gene list and annotations (from biomaRt)
gene_list <- read.table("data/processed/gene_list_tair10.txt", header = TRUE)

################################################################################
## Function to extract genes for protein related pathways
## input: data frame with chromosome name, ensembl gene id, external gene name,
## start/end position, and GO category names (name_1006)
################################################################################

extract_genes <- function(genes, category, go_term, snp_data, filename) {
  tmp <- genes[grep(paste0('^', category), genes[,go_term], perl = TRUE),] %>%
    distinct(ensembl_gene_id, .keep_all = TRUE) %>%
    mutate_at(c("chromosome_name", "ensembl_gene_id"), funs(factor(.))) %>%
    mutate(mn = start_position - 2500,
           mx = end_position + 2500,
           chromosome_name = as.character(chromosome_name))

  snp_data <- snp_data %>% mutate(chromosome = as.character(chromosome))

  snp_tmp <- fuzzy_inner_join(snp_data, tmp, by = c("chromosome" = "chromosome_name",
                                        "position" = "mn",
                                        "position" = "mx"),
                    match_fun = list(`==`, `>=`, `<=`)) %>%
    droplevels() %>%
    dplyr::select(chromosome, snp_id, position, ensembl_gene_id.x,
                  ensembl_gene_id.y, external_gene_name, BINCODE, NAME) %>%
    distinct(., snp_id, .keep_all = TRUE)

  dir.create(paste0("data/processed/pathways/", filename), recursive = TRUE)
  write.table(snp_tmp, paste0("data/processed/pathways/",
                              filename, "/", filename, ".txt"),
              col.names = TRUE, row.names = FALSE)

  return(snp_tmp)
}

# Read in annotations from Mapman

cats <- read.csv("data/external/Ath_AGI_LOCUS_TAIR10_Aug2012.csv", header = TRUE, na.strings = c("", "NA"))
cats$IDENTIFIER <- toupper(cats$IDENTIFIER) # convert locus ids to uppercase
str(cats)
head(cats)

# merge with gene_list
gene_cats <- merge(gene_list, cats, by.x = "ensembl_gene_id", by.y = "IDENTIFIER")

# read in snps and order by Chromosome/Position
snps <- read.table("data/processed/snp_gene_ids_tair10.txt", header = TRUE) %>%
  mutate_at(c("chromosome"), funs(factor(.))) %>%
  mutate(position = as.integer(position)) %>%
  arrange(chromosome, position)

# create directories for pathways
dir <- paste0("data/processed/pathways/", args[1])
dir.create(dir, recursive = TRUE, showWarnings = FALSE)

# pull out annotation categories based on Mapman categories
pathway <- if(args[2] == 0) {
  read.table(paste0(dir, "/", args[1], ".txt"), header = TRUE)
} else {
  extract_genes(gene_cats, c(args[2]), "BINCODE", snps, args[1])
}

head(pathway)

unique(pathway$BINCODE)
unique(pathway$NAME)

# export snplists for LDAK
pathway %>%
  select(snp_id) %>%
  write.table(., paste0(dir, "/list1"), quote = FALSE, row.names = FALSE, col.names = FALSE)

snps %>%
  select(snp_id) %>%
  filter(!snp_id %in% pathway$snp_id) %>%
  write.table(., paste0(dir, "/list2"), quote = FALSE, row.names = FALSE, col.names = FALSE)

R tidyverse fuzzyjoin From line 7 of src/05_select_pathway_snps.R

library(tidyverse)
library(broom)

# use command line args
args <- commandArgs(trailingOnly = TRUE)

# phenotypes
pheno <- read.table(paste0(args[1]), header = TRUE) %>%
  gather(key = "trait",
         value = "phenotype",
         -FID, -IID)

# heritability
multiblup_h2 <- tibble(file = list.files(path = "models/multiblup_h2",
                                     pattern = ".share$",
                                     full.names = TRUE, recursive = TRUE)) %>%
  separate(file, sep = "/|[.]", into = c("source", "method", "pathway", "trait", "method2", "fill"), remove = FALSE) %>%
  # mutate(data = lapply(file, read.table, header = TRUE, skip = 13)) %>%
  mutate(data = lapply(file, read.table, header = TRUE)) %>%
  unnest(data) %>%
  # filter(Component %in% c("Her_K1", "Her_K2")) %>%
  # select(trait, pathway, Component, Heritability, Her_SD)
  select(trait, pathway, Component, Share, SD)

head(multiblup_h2)

# likelihood ratio
multiblup_llik <- tibble(file = list.files(path = "models/multiblup_h2",
                                     pattern = ".reml$",
                                     full.names = TRUE, recursive = TRUE)) %>%
  separate(file, sep = "/|[.]", into = c("source", "method", "pathway", "trait", "method2", "fill"), remove = FALSE) %>%
  mutate(data = lapply(file, read.table, header = TRUE)) %>%
  unnest(data) %>%
  filter(Num_Kinships %in% "Alt_Likelihood") %>%
  mutate(multiblup_llik = as.numeric(as.character(X2))) %>%
  select(trait, pathway, multiblup_llik)

head(multiblup_llik)

# predictive ability

# individuals used for test set
cross_val <- tibble(filename = list.files(path = "data/processed/cross_validation",
                    pattern = ".test", full.names = TRUE)) %>%
             mutate(cvnum = word(filename, start = 4, end = 4, sep = "/|[.]"),
                    fold = word(filename, start = 5, end = 5, sep = "/|[.]"),
                    fold = gsub("test", "", fold)) %>%
             mutate(file_contents = map(filename, ~read.table(., header = FALSE))) %>%
             unnest(cols = c("file_contents")) %>%
             rename("FID" = "V1",
                    "IID" = "V2")

# predictive accuracy
multiblup_pa <- tibble(file = list.files(path = "models/multiblup",
                                     pattern = ".pred$",
                                     full.names = TRUE,
                                     recursive = TRUE)) %>%
  separate(file, sep = "/|[.]", into = c("source", "method", "pathway", "trait", "cvnum", "fold"), remove = FALSE) %>%
  mutate(data = lapply(file, read.table, header = TRUE)) %>%
  unnest(data) %>%
  left_join(pheno, ., by = c("FID" = "ID1", "IID" = "ID2", "trait")) %>%
  inner_join(., cross_val)

# training_coef <- tibble(file = list.files(path = "models/multiblup",
#                                           pattern = ".coeff$",
#                                           full.names = TRUE, recursive = TRUE)) %>%
#   separate(file, sep = "/|[.]", into = c("source", "method", "pathway", "trait", "cvnum", "fold", "fill"), remove = FALSE) %>%
#   mutate(data = lapply(file, read.table, header = TRUE)) %>%
#   unnest(data) %>%
#   filter(Component %in% "Intercept") %>%
#   select(trait, pathway, cvnum, fold, Effect)
#
# head(training_coef)
#
# multiblup_pa <- tibble(file = list.files(path = "models/multiblup",
#                                          pattern = ".profile$",
#                                          full.names = TRUE, recursive = TRUE)) %>%
#   separate(file, sep = "/|[.]", into = c("source", "method", "pathway", "trait", "cvnum", "fold"), remove = FALSE) %>%
#   mutate(data = lapply(file, read.table, header = TRUE)) %>%
#   unnest(data) %>%
#   left_join(., training_coef, by = c("pathway", "trait", "cvnum", "fold"))

head(multiblup_pa)

# number of snps/genes in each pathway
files <- list.files(path = "data/processed/pathways", pattern = "*.txt", full.names = TRUE,
                    recursive = TRUE)

pathways <- tibble(file = files) %>%
  separate(file, sep = "/", into = c("dir", "dir2", "dir3", "pathway", "pathway2"),
           remove = FALSE) %>%
  mutate(data = lapply(file, read.table, header = TRUE),
         data = lapply(data, mutate_all, as.character)) %>%
  unnest(data) %>%
  group_by(pathway) %>%
  summarise(size = length(unique(snp_id)),
            n_genes = length(unique(ensembl_gene_id.x))) %>%
  select(pathway, n_genes, size) %>%
  ungroup() %>%
  write_csv(., "reports/pathways_summary.csv")

head(pathways)

save(multiblup_h2, multiblup_llik, multiblup_pa, pathways, file = "reports/multiblup.RData")

R tidyverse broom From line 6 of src/06_summarize_multiblup.R

gene_group_sizes <- round(runif(5000, 1, 50000))
hist(gene_group_sizes)
write.table(gene_group_sizes, "data/interim/null_group_sizes.txt", row.names = FALSE)

# write sampling file (control_sampling.txt) for sbatch
# this can be used if submitting with sbatch to run groups of 50 random sets at a time
sampling <- as.data.frame(seq(1, length(gene_group_sizes), by = 50))
colnames(sampling) <- "start"
sampling$end <- sampling$start + 49

write.table(sampling, "data/interim/control_sampling.txt", row.names = FALSE, col.names = FALSE)

R From line 7 of src/07_null_group_sizes.R

library(tidyverse)

# use command line arguments
args = commandArgs(trailingOnly = TRUE)

# name of pathway
pathway <- args[1]

# pathway SNPs
snp_ids <- read.table(paste0("data/processed/pathways/", pathway, "/list1"))

# number of SNPs to sample
nsnps <- nrow(snp_ids)

# number of random subsets to export
end <- 1000

cat("pathway:", pathway, "- sampling", end, "random subsets of", nsnps, "snps each\n")

# read in SNP and gene information for all SNPs
snps <- read.table("data/processed/snp_gene_ids_tair10.txt", header = TRUE) %>%
        arrange(chromosome, position) %>%
        filter(!snp_id %in% snp_ids$V1) # remove pathway SNPs

str(snps)

# # read in random uniform distribution for number of SNPs
# snp_number <- read.table("data/interim/gene_groups.txt", header = TRUE)
#
# a function to sample genes randomly up to a specified number of SNPs
# n = numer of SNPs to sample
# bprange = number of basepairs to include before/after a gene

sample_genes <- function(data, nsnps, bprange) {
  out <- data.frame()
  while (nrow(out) < nsnps) {
    # subset snps by genes already selected in output
    set <- na.omit(data[!data$ensembl_gene_id %in% out$ensembl_gene_id,])
    gene <- sample(set$ensembl_gene_id, 1, replace = FALSE) # select a random gene
    snp_sub <- set[set$ensembl_gene_id==gene,] # select all snps within gene
    chr <- snp_sub$chromosome[1]
    mn <- min(snp_sub$position) - bprange
    mx <- max(snp_sub$position) + bprange
    positions <- data[data$position > mn & data$position < mx & data$chromosome==chr,] # select SNPs in gene with X kb buffer
    out <- rbind(out, positions) # combine until threshold number of snps are met
    }
    return(out)
  }

# loop through gene sets specified by commandArgs

for (i in 1:end) {
  if(file.exists(paste0("data/processed/random_sets_pathways/", pathway, "/null_", i, ".txt"))) {
    cat("file already exists for sample", i, "\n")
    next
  }
  tryCatch({

  cat("exporting random gene group", i, "of", end, "\n")

  random <- sample_genes(snps, n = nsnps, bprange = 2500) %>%
  arrange(chromosome, position)

  dir.create(paste0("data/processed/random_sets_pathways/", pathway), showWarnings = FALSE, recursive = TRUE)

  write.table(random, paste0("data/processed/random_sets_pathways/", pathway, "/null_", i, ".txt"))

  # export snplists for LDAK
  dir <- paste0("data/processed/random_sets_pathways/", pathway, "/c_", i)
  dir.create(dir, recursive = TRUE, showWarnings = FALSE)

  # SNPs in random gene group
  random %>%
    select(snp_id) %>%
    unique() %>%
    write.table(., paste0(dir, "/list1"), quote = FALSE, row.names = FALSE, col.names = FALSE)

  # remaining SNPs
  snps %>%
    select(snp_id) %>%
    filter(!snp_id %in% random$V1) %>%
    write.table(., paste0(dir, "/list2"), quote = FALSE, row.names = FALSE, col.names = FALSE)

  }, error = function(e){cat("ERROR:", conditionMessage(e), "\n")})
}

R tidyverse From line 11 of src/08a_generate_null_gene_groups_pathways.R

library(tidyverse)

# use command line arguments
args = commandArgs(trailingOnly = TRUE)
# end <- as.numeric(args[1]) + 49
end <- 5000
cat("random subsets from ", args[1], "to", end, "\n")

# read in SNP and gene information
snps <- read.table("data/processed/snp_gene_ids_tair10.txt", header = TRUE) %>%
arrange(chromosome, position)

# read in random uniform distribution for number of SNPs
snp_number <- read.table("data/interim/gene_groups.txt", header = TRUE)

# a function to sample genes randomly up to a specified number of SNPs
# n = numer of SNPs to sample
# bprange = number of basepairs to include before/after a gene

sample_genes <- function(data, n, bprange) {
  out <- data.frame()
  while (nrow(out) < n) {
    # subset snps by genes already selected in output
    set <- na.omit(data[!data$ensembl_gene_id %in% out$ensembl_gene_id,])
    gene <- sample(set$ensembl_gene_id, 1, replace = FALSE) # select a random gene
    snp_sub <- set[set$ensembl_gene_id==gene,] # select all snps within gene
    chr <- snp_sub$chromosome[1]
    mn <- min(snp_sub$position) - bprange
    mx <- max(snp_sub$position) + bprange
    positions <- data[data$position > mn & data$position < mx & data$chromosome==chr,] # select SNPs in gene with X kb buffer
    out <- rbind(out, positions) # combine until threshold number of snps are met
    }
    return(out)
  }

# loop through gene sets specified by commandArgs

for (i in args[1]:end) {
  if(file.exists(paste0("data/processed/random_sets/null_", i, ".txt"))) {
    cat("file already exists for sample", i, "\n")
    next
  }
  tryCatch({
  print(snp_number[i,])
  random <- sample_genes(snps, n = snp_number[i,], bprange = 2500) %>%
  arrange(chromosome, position)

  dir.create("data/processed/random_sets", showWarnings = FALSE)

  write.table(random, paste0("data/processed/random_sets/null_", i, ".txt"))

  # export snplists for LDAK
  dir <- paste0("data/processed/random_sets/c_", i)
  dir.create(dir, recursive = TRUE, showWarnings = FALSE)

  random %>%
    select(snp_id) %>%
    unique() %>%
    write.table(., paste0(dir, "/list1"), quote = FALSE, row.names = FALSE, col.names = FALSE)

  snps %>%
    select(snp_id) %>%
    filter(!snp_id %in% random$V1) %>% # filter(!snp_id %in% random$snp_id) %>%
    write.table(., paste0(dir, "/list2"), quote = FALSE, row.names = FALSE, col.names = FALSE)

  }, error = function(e){cat("ERROR:", conditionMessage(e), "\n")})
}


for (f in files){
foo <- read.table(paste0(f), header = FALSE)
print(length(foo$V1))
foo %>% select(V1) %>% unique() %>% write.table(., paste0(f), row.names = FALSE, col.names = FALSE, quote = FALSE)
}

R tidyverse From line 10 of src/08_generate_null_gene_groups.R

library(tidyverse)
library(data.table)

################################################################################
# Step 1: summarise random gene groups

# use command line arguments
args = commandArgs(trailingOnly = TRUE)

# filepath
path <- paste0("data/processed/random_sets_pathways/", args[1])

# list files for random gene groups
random_files <- data.frame(
  filename = list.files(path = path,
                        pattern = ".txt$",
                        full.names = TRUE,
                        recursive = TRUE)
)

random_files <- data.frame(filename =
                         system(paste("find", path, "| grep .txt", sep = " "),
                         intern = TRUE)
                       )

# summarize number of snps and genes in each random group

random_summary <- random_files %>%
  mutate(filename = as.character(filename),
         # pathway id
         pathway = word(filename, start = 4, sep = "/"),
         # random group id (1-1000)
         random_group = word(filename, start = 5, sep = "/"),
         # cleaning up a bit.. this is kind of messy
         random_group = gsub(".txt", "", random_group),
         random_group = gsub("null_", "c_", random_group),
         # read in file contents into a list
         file_contents = map(filename,
                             ~ fread(., drop = 1))) %>%
#                             ~ read.table(.))) %>%
  # unnest list into a data frame
  unnest() %>%
  group_by(pathway, random_group) %>%
  # count number of unique genes and snps in each random gene group
  summarize(n_genes = n_distinct(ensembl_gene_id),
            n_snps = n_distinct(snp_id))

head(random_summary)

################################################################################
# Step 2: compute likelihood ratio test statistic

# import log likelihood for GBLUP model
# this is used to compute the likelihood ratio statistic
load("reports/gblup.RData")

# filepath to model output from REML
path <- paste0("models/null_h2_pathways/", args[1])

## list files for reml output of each random gene group

# list files using 'find' OS command
# this option is *much* faster than 'list.files'
lr_files <- data.frame(filename =
                         system(paste("find", path, "| grep .reml", sep = " "),
                         intern = TRUE)
                       )

lr_null <- lr_files %>%
  mutate(filename = as.character(filename),
         # pathway id
         pathway = word(filename, start = 3, sep = "/"),
         # random group id (1-1000)
         random_group = word(filename, start = 4, sep = "/"),
         # traid id
         trait = word(filename, start = 5, sep = "/|[.]"),
         # read in file contents into a list
         file_contents = map(filename,
                             ~ fread(.,
                                     header = FALSE,
                                     skip = 10,
                                     nrows = 1))) %>%
  # unnest list into a data frame
  unnest() %>%
#   filter(Num_Kinships %in% "Alt_Likelihood") %>%
  mutate(null_llik = as.numeric(as.character(V2))) %>%
  left_join(., gblup_llik, by = "trait") %>%
  select(trait, pathway, random_group, null_llik, gblup_llik) %>%
  mutate(LR = 2 * (null_llik - gblup_llik))

head(lr_null)

################################################################################
## Step 3: summarize proportion of variance explained

## list files for proportion of variance explained
# uses same path as LR output

h2_files <- data.frame(filename =
                         system(paste("find", path, "| grep .share", sep = " "),
                         intern = TRUE)
                       )


# h2_files <- data.frame(filename =
#                          list.files(path = path,
#                                     pattern = ".share$",
#                                     full.names = TRUE,
#                                     recursive = TRUE)
# )

h2_null <- h2_files %>%
  mutate(filename = as.character(filename),
         # pathway id
         pathway = word(filename, start = 3, sep = "/"),
         # random group id (1-1000)
         random_group = word(filename, start = 4, sep = "/"),
         # traid id
         trait = word(filename, start = 5, sep = "/|[.]"),
         # read in file contents into a list
         file_contents = map(filename,
                             ~ fread(., header = TRUE))) %>%
  # unnest list into a data frame
  unnest() %>%
  select(trait, pathway, random_group, Component, Share, SD) %>%
  left_join(., random_summary, by = c("pathway", "random_group"))

head(h2_null)

################################################################################
## Step 4: merge and export results!

output <- paste0("reports/null_dist_results_pathways/", args[1], "_null.csv")

null_dist <- left_join(lr_null, h2_null,
                       by = c("trait", "pathway", "random_group")) %>%
  write_csv(., output)

R tidyverse data.table From line 16 of src/09a_summarize_null_pathways.R

library(tidyverse)
library(data.table)
# library(quantreg)

# summarize random gene groups (number of SNPs & number of genes)

feature_sizes <- read_csv("reports/random_subsets_summary.csv")
head(feature_sizes)

# import log likelihood for GBLUP model
load("reports/gblup.RData")

## list files for reml output of each random gene group
files <- list.files(path = "models/null_h2", pattern = ".reml$",
                    full.names = TRUE, recursive = TRUE)

lr_null <- rbindlist(lapply(files, read.table, header = TRUE),
                      idcol = "filename", use.names = TRUE) %>%
  mutate(filename = files[filename]) %>%
  separate(filename, sep = "/|[.]",
           into = c("dir", "source", "pathway", "trait", "metric", "model"), remove = FALSE) %>%
  filter(Num_Kinships %in% "Alt_Likelihood") %>%
  mutate(null_llik = as.numeric(as.character(X2))) %>%
  left_join(., gblup_llik, by = "trait") %>%
  select(trait, pathway, Num_Kinships, null_llik, gblup_llik) %>%
  left_join(., feature_sizes, by = "pathway") %>%
  mutate(group_size = cut(size, breaks = seq(0, 50000, by = 10000),
                          include.lowest = TRUE, dig.lab = 10),
         LR = 2 * (null_llik - gblup_llik))

# lr_null <- rbindlist(lapply(files, read.table, header = TRUE),
#                       idcol = "filename", use.names = TRUE) %>%
#   mutate(filename = files[filename]) %>%
#   separate(filename, sep = "/|[.]",
#            into = c("dir", "source", "pathway", "trait", "metric", "model"), remove = FALSE) %>%
#   filter(Num_Kinships %in% "Alt_Likelihood") %>%
#   # mutate(null_llik = as.numeric(as.character(X2))) %>%
#   # left_join(., gblup_llik, by = "trait") %>%
#   # select(trait, pathway, Num_Kinships, null_llik, gblup_llik) %>%
#   rename("LRT_Stat" = X2) %>%
#   select(trait, pathway, LRT_Stat) %>%
#   left_join(., feature_sizes, by = "pathway") %>%
#   mutate(group_size = cut(size, breaks = seq(0, 50000, by = 10000),
#                           include.lowest = TRUE, dig.lab = 10))

head(lr_null)

## read in variance component estimates
files <- list.files(path = "models/null_h2", pattern = ".share$",
                    full.names = TRUE, recursive = TRUE)

h2_null <- rbindlist(lapply(files, read.table, header = TRUE),
                      idcol = "filename", use.names = TRUE) %>%
  mutate(filename = files[filename]) %>%
  separate(filename, sep = "/|[.]", into = c("dir", "source", "pathway", "trait", "metric", "model"), remove = FALSE) %>%
  left_join(., feature_sizes, by = "pathway") %>%
  select(trait, pathway, size, n_genes, Component, Share, SD)

head(h2_null)

## merge!
null_dist <- left_join(lr_null, h2_null, by = c("trait", "pathway")) %>%
  write_csv(., "reports/lr_null_results.csv")