Genomic prediction of amino acid traits in Arabidopsis seeds
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Genomic prediction for free amino acid traits in Arabidopsis seeds
Data and scripts necessary to run genomic partitioning and prediciton models on free amino acid traits measured in a diverse panel of 313 Arabidopsis lines.
Software requirements
-
R v3.6.0
-
PLINK v1.90b4 64-bit
-
Miniconda3 (includes conda v4.6.7)
-
Snakemake v5.4.2 (install to virtual environment)
To install snakemake in a virtual environment, run:
conda env create --name multiblup --file environment.yaml
For future use, activate this environment with:
source activate multiblup
Data
Edit the config.yaml file to specify the paths for the genotype and phenotype data.
Genotype data
Before running Snakemake, download the
Arabidopsis Regional Mapping (RegMap) data
(
Horton et al. 2012
):
cd data/external
wget https://github.com/Gregor-Mendel-Institute/atpolydb/blob/master/250k_snp_data/call_method_75.tar.gz
tar -xvf call_method_75.tar.gz
Phenotype and covariate data
-
data/raw/aa360_raw_ratios.csv
Raw measurements (nmol/mg seed) of 65 free amino acid traits measured in 313 accessions of Arabidopsis thaliana as reported by Angelovici et al. 2013 -
data/processed/aa360_BLUEs.txt Environment adjusted best linear unbiased estimates (BLUEs) for 65 free amino acid traits. Calculated using the
HAPPI-GWAS
pipeline from Slaten et al. 2020 . Check outnotebooks/01-calculate_BLUEs.Rmd
for details. -
data/processed/aa360_covars.txt
Principal components from genotype data to model population structure. Environment adjusted best linear unbiased estimates (BLUEs) for 65 free amino acid traits. Calculated using theHAPPI-GWAS
pipeline from Slaten et al. 2020 . Check outnotebooks/01-calculate_BLUEs.Rmd
for details. -
data/processed/aa360_covars.txt
Principal components from genotype data to model population structure.
Snakemake
The snakemake workflow includes a
Snakefile
(specifies steps of the workflow and variables) and rule files in
rules/
to run specific commands.
Snakefile
This file specifies variable names for use in a snakemake workflow. There are notes on how to include specific pathways and MapMan bincodes. Other variables include trait names, the number of random SNP sets, and the number of cross validations to perform.
The
rule all:
section is a pseudorule that tracks the expected output of each command in the rule files.
To run the workflow, edit cluster configuration settings in
submit.json
then run
submit.sh
Rules
rules/common.smk - specifies location of config.yaml file
rules/prep_data.smk
-
filter and convert genotype data to PED format
-
exports TAIR 10 ensembl gene ids for SNP data
-
calculate SNP weightings (these aren't actually used, could skip)
-
run PCA and exports covariate file (including two PCs here - recommend adjusting depending on data)
rules/cross_validation.smk
-
create training and testing sets for cross validation
-
Note: may need to run this separately on command line, repeating the command via loops/snakemake sometimes does not generate unique folds
rules/gblup.smk
-
export kinship matrix for all SNPs
-
estimate variances and heritability
-
genomic prediction and cross-validation (REML + calculating BLUPs)
-
summarize GBLUP output (
reports/gblup.Rdata
)
rules/multiblup.smk
-
filter and export list of pathway SNPs (includes a 2.5 kb buffer before and after each pathway gene)
-
calculate kinship matrices for each pathway (list1) and all remaining SNPs (list2)
-
estimate variances and heritability for each SNP partition
-
genomic prediction and cross-validation with multiple kernels/random effects (REML + calculating BLUPs)
-
summarize MultiBLUP output (
reports/multiblup.RData
)
rules/null_distribution.smk
-
first option: generate 5000 random gene groups with a uniform distribution of SNPs. This is useful if you want to examine influences of partition size on heritability explained/model fit or if you are looking to compare a lot of different partitions with varying size to an empirical distribution (output
reports/lr_null_results.csv
) -
second option: generate 1000 random gene groups for each pathway (excludes pathway SNPs and samples a similar number of SNPs/genes)
-
calculate kinship matrices for each random group
-
estimate variances and heritability for each random SNP set
-
summarize results across all 1000 gene groups (
reports/null_dist_results_pathways/{pathway}_null.csv
)
Notebooks
R notebooks to summarize results.
01-calculate_BLUEs.Rmd
Using raw FAA trait data, calculate BLUEs for each trait using the
HAPPI-GWAS
pipeline.
02-gblup_results.Rmd
Checks quality of model output and summarizes prediction results for the GBLUP and MultiBLUP models (e.g. proportion of heritability explained, likelihood ratio, prediction accuracy, reliability, bias, MSE)
03-process_null.Rmd
Examines properties of the random gene groups (e.g. distribution of likelihood ratio) and performs quantile regression to establish 95 percentiles for the proportion of heritability explained and likelihood ratio (see nice discussion of this approach in Edwards et al. 2016 )
04-multiblup_results_by_pathway.Rmd
Identifies pathways that pass significance criteria based on comparison to random gene groups with the same number of SNPs (proportion of h2, likelihood ratio, and increase in prediction accuracy).
05-figures.Rmd
Code to create figures used in the manuscript.
Project Organization (based on Cookiecutter data science)
Project based on the cookiecutter data science project template . #cookiecutterdatascience
├── LICENSE
├── README.md <- The top-level README for developers using this project.
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
│
├── docs <- manuscript documents
│
├── models <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks <- R notebooks. Naming convention is a number (for ordering),
│ the creator's initials, and a short `-` delimited description, e.g.
│ `1.0-jqp-initial-data-exploration`.
│
├── references <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated graphics and figures to be used in reporting
│
├── environment.yaml <- The requirements file for reproducing the analysis conda environment
│
├── src <- Source code for use in this project.
│ ├── data <- Scripts to download or generate data
│ │
│ ├── features <- Scripts to turn raw data into features for modeling
│ │
│ ├── models <- Scripts to train models and then use trained models to make
│ │ predictions
│ │
│ └── visualization <- Scripts to create exploratory and results oriented visualizations
│
|── Snakefile <- Snakemake workflow to execute analyses
│
└── submit.json <- Configuration settings to run snakemake on a computing cluster
Code Snippets
23 24 25 26 | run: shell("{ldak} --cut-folds {params.folds} \ --bfile {params.bfile} \ --num-folds {params.numfolds}") |
18 19 20 21 22 23 | run: shell("{ldak} --calc-kins-direct {params.outdir} \ --bfile {params.bfile} \ --ignore-weights YES \ --power 0 \ --kinship-raw YES") |
40 41 42 43 44 45 46 47 | run: shell("{ldak} --reml {params.out} \ --pheno {input.pheno} \ --pheno-name {params.trait} \ --grm {params.grm} \ --covar {input.covar} \ --constrain YES \ --reml-iter 500") |
67 68 69 70 71 72 73 74 75 76 77 | run: for i in CV: for j in INDEX: shell("{ldak} --reml {params.out}{i}.{j} \ --pheno {input.pheno} \ --pheno-name {params.trait} \ --grm {params.grm} \ --keep {params.keep}{i}.train{j} \ --covar {input.covar} \ --constrain YES \ --reml-iter 500") |
98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 | run: for i in CV: for j in INDEX: shell("{ldak} --calc-blups {params.out}{i}.{j} \ --grm {params.grm} \ --remlfile {params.reml}{i}.{j}.reml \ --bfile {params.bfile} \ --covar {input.covar}") shell("{ldak} --calc-scores {params.out}{i}.{j} \ --bfile {params.bfile} \ --power 0 \ --scorefile {params.out}{i}.{j}.blup \ --keep {params.keep}{i}.test{j} \ --pheno {input.pheno} \ --pheno-name {params.trait}") |
120 121 | run: shell("Rscript src/04_summarize_gblup.R {input.blues}") |
12 13 | run: shell("Rscript src/05_select_pathway_snps.R {params.pathway} \"{params.bincode}\"") |
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 | run: # partition kinship matrix # --ignore-weights YES & --power 0 based on LDAK recommendations for prediction shell("{ldak} --cut-kins {params.outdir} \ --bfile {params.bfile} \ --partition-number 2 \ --partition-prefix {params.prefix}") shell("{ldak} --calc-kins {params.outdir} \ --bfile {params.bfile} \ --partition 1 \ --ignore-weights YES \ --power 0 \ --kinship-raw YES") shell("{ldak} --calc-kins {params.outdir} \ --bfile {params.bfile} \ --partition 2 \ --ignore-weights YES \ --power 0 \ --kinship-raw YES") |
67 68 69 70 71 72 73 74 75 | run: for p in PATHWAYS.keys(): shell("{ldak} --reml {params.out}{p}/{params.trait}.multiblup \ --pheno {input.pheno} \ --pheno-name {params.trait} \ --mgrm {params.mgrm}{p}/partition.list \ --covar {input.covar} \ --constrain YES \ --reml-iter 500") |
94 95 96 97 98 99 100 101 102 103 104 105 | run: for p in PATHWAYS.keys(): for i in CV: for j in INDEX: shell("{ldak} --reml {params.out}{p}/{params.trait}.cv{i}.{j} \ --pheno {input.pheno} \ --pheno-name {params.trait} \ --mgrm {params.mgrm}{p}/partition.list \ --keep {params.keep}{i}.train{j} \ --covar {input.covar} \ --constrain YES \ --reml-iter 100") |
125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 | run: for p in PATHWAYS.keys(): for i in CV: for j in INDEX: shell("{ldak} --calc-blups {params.out}{p}/{params.trait}.cv{i}.{j} \ --mgrm {params.mgrm}{p}/partition.list \ --remlfile {params.out}{p}/{params.trait}.cv{i}.{j}.reml \ --bfile {params.bfile} \ --covar {input.covar}") shell("{ldak} --calc-scores {params.out}{p}/{params.trait}.cv{i}.{j} \ --bfile {params.bfile} \ --power 0 \ --scorefile {params.out}{p}/{params.trait}.cv{i}.{j}.blup \ --keep {params.keep}{i}.test{j} \ --pheno {input.pheno} \ --pheno-name {params.trait}") |
148 149 | run: shell("Rscript src/06_summarize_multiblup.R {input.blues}") |
10 11 | shell: "Rscript src/07_null_group_sizes.R" |
24 25 | shell: "Rscript src/08_generate_null_gene_groups.R {params.num}" |
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 | run: for r in RANDOM: shell("{ldak} --cut-kins {params.prefix}{r} \ --bfile {params.bfile} \ --partition-number 2 \ --partition-prefix {params.prefix}{r}/list") shell("{ldak} --calc-kins {params.prefix}{r} \ --bfile {params.bfile} \ --partition 1 \ --ignore-weights YES \ --power 0") shell("{ldak} --calc-kins {params.prefix}{r} \ --bfile {params.bfile} \ --partition 2 \ --ignore-weights YES \ --power 0") |
75 76 77 78 79 80 81 82 83 | run: for r in RANDOM: shell("{ldak} --reml {params.prefix}{r}/{params.trait}.h2 \ --pheno {input.pheno} \ --pheno-name {params.trait} \ --mgrm {params.mgrm}{r}/partition.list \ --covar {input.covar} \ --constrain YES \ --reml-iter 500") |
91 92 | run: shell("Rscript src/09_summarize_null.R") |
106 107 | shell: "Rscript src/08a_generate_null_gene_groups_pathways.R {params.pathway}" |
126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 | run: for r in range(1,1001): shell("{ldak} --cut-kins {params.prefix}{r} \ --bfile {params.bfile} \ --partition-number 2 \ --partition-prefix {params.prefix}{r}/list") shell("{ldak} --calc-kins {params.prefix}{r} \ --bfile {params.bfile} \ --partition 1 \ --ignore-weights YES \ --power 0") shell("{ldak} --calc-kins {params.prefix}{r} \ --bfile {params.bfile} \ --partition 2 \ --ignore-weights YES \ --power 0") |
156 157 158 159 160 161 162 163 164 | run: for r in range(1,1001): shell("{ldak} --reml {params.prefix}{r}/{params.trait}.h2 \ --pheno {input.pheno} \ --pheno-name {params.trait} \ --mgrm {params.mgrm}{r}/partition.list \ --covar {input.covar} \ --constrain YES \ --reml-iter 500") |
174 175 | run: shell("Rscript src/09a_summarize_null_pathways.R {params.pathway}") |
20 21 22 23 24 25 26 27 28 | run: shell("Rscript src/01_convert_to_ped.R") shell("plink --file {params.indir} \ --recode 12 \ --maf 0.05 \ --mind 0.1 \ --geno 0.1 \ --make-bed \ --out {params.outdir}") |
38 39 | shell: "Rscript src/02_extract_gene_ids.R {input}" |
58 59 60 61 62 63 | run: shell("{ldak} --cut-weights {params.outdir} \ --bfile {params.bfile} \ --section-kb 0.25") shell("{ldak} --calc-weights-all {params.outdir} \ --bfile {params.bfile}") |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | library(tidyverse) snp_data <- read.csv("data/external/call_method_75/call_method_75_TAIR9.csv", header = TRUE, skip = 1) # create the MAP file # requires Chromosome and Position info # columns: Chromosome/rsid/physical_distance/Position map <- snp_data %>% select(Chromosome, Positions) %>% mutate(rsid = paste0("S", seq_len(n())), distance = 0) %>% select(Chromosome, rsid, distance, Positions) %>% write_delim(., "data/raw/call_method_75_TAIR9.map", col_names = FALSE) # create the PED file # rows are individuals # columns are family/sample/paternal/maternal/sex/phenotype/marker1/marker2/etc # individual ids id <- colnames(snp_data[,-c(1:2)]) # get accession ids with phenotype data acc360 <- read_delim("data/processed/pheno_file", delim = "\t") ped <- snp_data[,-c(1:2)] %>% # recode genotypes to be diploid mutate_if(is.factor, funs(fct_recode(., GG = "G", TT = "T", AA = "A", CC = "C"))) %>% t() %>% as.data.frame(row.names = FALSE) %>% # add columns expected by plink (family/sample id/paternal/maternal/sex/phenotype) add_column(., family = "AtRegMap", sample = gsub("X", "", id), paternal = 0, maternal = 0, sex = -9, phenotype = -9, .before = 1) %>% # only include accessions with phenotype data filter(sample %in% acc360$IID) %>% write_delim(., "data/raw/call_method_75_TAIR9.ped", col_names = FALSE) |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 | library(tidyverse) library(broom) # use command line args args <- commandArgs(trailingOnly = TRUE) # phenotypes pheno <- read.table(paste0(args[1]), header = TRUE) %>% gather(key = "trait", value = "phenotype", -FID, -IID) # heritability gblup_h2 <- tibble(file = list.files(path = "models/gblup_h2", pattern = ".reml$", full.names = TRUE)) %>% separate(file, sep = "/|[.]", into = c("source", "method", "trait", "method2", "fill"), remove = FALSE) %>% mutate(data = lapply(file, read.table, header = TRUE, skip = 13)) %>% unnest(data) %>% filter(Component %in% "Her_ALL") %>% select(trait, Heritability, Her_SD) head(gblup_h2) # log likelihood gblup_llik <- tibble(file = list.files(path = "models/gblup_h2", pattern = ".reml$", full.names = TRUE)) %>% separate(file, sep = "/|[.]", into = c("source", "method", "trait", "metric", "fill", "fill2"), remove = FALSE) %>% mutate(data = lapply(file, read.table, header = TRUE)) %>% unnest(data) %>% filter(Num_Kinships %in% "Alt_Likelihood") %>% mutate(gblup_llik = as.numeric(as.character(X1))) %>% select(trait, gblup_llik) head(gblup_llik) # individuals used for test set cross_val <- tibble(filename = list.files(path = "data/processed/cross_validation", pattern = ".test", full.names = TRUE)) %>% mutate(cvnum = word(filename, start = 4, end = 4, sep = "/|[.]"), fold = word(filename, start = 5, end = 5, sep = "/|[.]"), fold = gsub("test", "", fold)) %>% mutate(file_contents = map(filename, ~read.table(., header = FALSE))) %>% unnest(cols = c("file_contents")) %>% rename("FID" = "V1", "IID" = "V2") # predictive accuracy gblup_pa <- tibble(file = list.files(path = "models/gblup", pattern = ".pred$", full.names = TRUE)) %>% separate(file, sep = "/|[.]", into = c("source", "method", "trait", "cvnum", "fold", "fill"), remove = FALSE) %>% mutate(data = lapply(file, read.table, header = TRUE)) %>% unnest(data) %>% left_join(pheno, ., by = c("FID" = "ID1", "IID" = "ID2", "trait")) %>% inner_join(., cross_val) # left_join(., training_coef, by = c("trait", "cvnum", "fold")) # training_coef <- tibble(file = list.files(path = "models/gblup", # pattern = ".coeff$", # full.names = TRUE)) %>% # separate(file, sep = "/|[.]", into = c("source", "method", "trait", "cvnum", "fold", "fill"), remove = FALSE) %>% # mutate(data = lapply(file, read.table, header = TRUE)) %>% # unnest(data) %>% # filter(Component %in% "Intercept") %>% # select(trait, cvnum, fold, Effect) # # head(training_coef) # # gblup_pa <- tibble(file = list.files(path = "models/gblup", # pattern = ".profile$", # full.names = TRUE)) %>% # separate(file, sep = "/|[.]", into = c("source", "method", "trait", "cvnum", "fold", "fill"), remove = FALSE) %>% # mutate(data = lapply(file, read.table, header = TRUE)) %>% # unnest(data) %>% # left_join(., training_coef, by = c("trait", "cvnum", "fold")) head(gblup_pa) save(gblup_h2, gblup_llik, gblup_pa, file = "reports/gblup.RData") |
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 | library(tidyverse) library(fuzzyjoin) # use command line arguments args = commandArgs(trailingOnly = TRUE) print(args[1]) print(args[2]) # read in gene list and annotations (from biomaRt) gene_list <- read.table("data/processed/gene_list_tair10.txt", header = TRUE) ################################################################################ ## Function to extract genes for protein related pathways ## input: data frame with chromosome name, ensembl gene id, external gene name, ## start/end position, and GO category names (name_1006) ################################################################################ extract_genes <- function(genes, category, go_term, snp_data, filename) { tmp <- genes[grep(paste0('^', category), genes[,go_term], perl = TRUE),] %>% distinct(ensembl_gene_id, .keep_all = TRUE) %>% mutate_at(c("chromosome_name", "ensembl_gene_id"), funs(factor(.))) %>% mutate(mn = start_position - 2500, mx = end_position + 2500, chromosome_name = as.character(chromosome_name)) snp_data <- snp_data %>% mutate(chromosome = as.character(chromosome)) snp_tmp <- fuzzy_inner_join(snp_data, tmp, by = c("chromosome" = "chromosome_name", "position" = "mn", "position" = "mx"), match_fun = list(`==`, `>=`, `<=`)) %>% droplevels() %>% dplyr::select(chromosome, snp_id, position, ensembl_gene_id.x, ensembl_gene_id.y, external_gene_name, BINCODE, NAME) %>% distinct(., snp_id, .keep_all = TRUE) dir.create(paste0("data/processed/pathways/", filename), recursive = TRUE) write.table(snp_tmp, paste0("data/processed/pathways/", filename, "/", filename, ".txt"), col.names = TRUE, row.names = FALSE) return(snp_tmp) } # Read in annotations from Mapman cats <- read.csv("data/external/Ath_AGI_LOCUS_TAIR10_Aug2012.csv", header = TRUE, na.strings = c("", "NA")) cats$IDENTIFIER <- toupper(cats$IDENTIFIER) # convert locus ids to uppercase str(cats) head(cats) # merge with gene_list gene_cats <- merge(gene_list, cats, by.x = "ensembl_gene_id", by.y = "IDENTIFIER") # read in snps and order by Chromosome/Position snps <- read.table("data/processed/snp_gene_ids_tair10.txt", header = TRUE) %>% mutate_at(c("chromosome"), funs(factor(.))) %>% mutate(position = as.integer(position)) %>% arrange(chromosome, position) # create directories for pathways dir <- paste0("data/processed/pathways/", args[1]) dir.create(dir, recursive = TRUE, showWarnings = FALSE) # pull out annotation categories based on Mapman categories pathway <- if(args[2] == 0) { read.table(paste0(dir, "/", args[1], ".txt"), header = TRUE) } else { extract_genes(gene_cats, c(args[2]), "BINCODE", snps, args[1]) } head(pathway) unique(pathway$BINCODE) unique(pathway$NAME) # export snplists for LDAK pathway %>% select(snp_id) %>% write.table(., paste0(dir, "/list1"), quote = FALSE, row.names = FALSE, col.names = FALSE) snps %>% select(snp_id) %>% filter(!snp_id %in% pathway$snp_id) %>% write.table(., paste0(dir, "/list2"), quote = FALSE, row.names = FALSE, col.names = FALSE) |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 | library(tidyverse) library(broom) # use command line args args <- commandArgs(trailingOnly = TRUE) # phenotypes pheno <- read.table(paste0(args[1]), header = TRUE) %>% gather(key = "trait", value = "phenotype", -FID, -IID) # heritability multiblup_h2 <- tibble(file = list.files(path = "models/multiblup_h2", pattern = ".share$", full.names = TRUE, recursive = TRUE)) %>% separate(file, sep = "/|[.]", into = c("source", "method", "pathway", "trait", "method2", "fill"), remove = FALSE) %>% # mutate(data = lapply(file, read.table, header = TRUE, skip = 13)) %>% mutate(data = lapply(file, read.table, header = TRUE)) %>% unnest(data) %>% # filter(Component %in% c("Her_K1", "Her_K2")) %>% # select(trait, pathway, Component, Heritability, Her_SD) select(trait, pathway, Component, Share, SD) head(multiblup_h2) # likelihood ratio multiblup_llik <- tibble(file = list.files(path = "models/multiblup_h2", pattern = ".reml$", full.names = TRUE, recursive = TRUE)) %>% separate(file, sep = "/|[.]", into = c("source", "method", "pathway", "trait", "method2", "fill"), remove = FALSE) %>% mutate(data = lapply(file, read.table, header = TRUE)) %>% unnest(data) %>% filter(Num_Kinships %in% "Alt_Likelihood") %>% mutate(multiblup_llik = as.numeric(as.character(X2))) %>% select(trait, pathway, multiblup_llik) head(multiblup_llik) # predictive ability # individuals used for test set cross_val <- tibble(filename = list.files(path = "data/processed/cross_validation", pattern = ".test", full.names = TRUE)) %>% mutate(cvnum = word(filename, start = 4, end = 4, sep = "/|[.]"), fold = word(filename, start = 5, end = 5, sep = "/|[.]"), fold = gsub("test", "", fold)) %>% mutate(file_contents = map(filename, ~read.table(., header = FALSE))) %>% unnest(cols = c("file_contents")) %>% rename("FID" = "V1", "IID" = "V2") # predictive accuracy multiblup_pa <- tibble(file = list.files(path = "models/multiblup", pattern = ".pred$", full.names = TRUE, recursive = TRUE)) %>% separate(file, sep = "/|[.]", into = c("source", "method", "pathway", "trait", "cvnum", "fold"), remove = FALSE) %>% mutate(data = lapply(file, read.table, header = TRUE)) %>% unnest(data) %>% left_join(pheno, ., by = c("FID" = "ID1", "IID" = "ID2", "trait")) %>% inner_join(., cross_val) # training_coef <- tibble(file = list.files(path = "models/multiblup", # pattern = ".coeff$", # full.names = TRUE, recursive = TRUE)) %>% # separate(file, sep = "/|[.]", into = c("source", "method", "pathway", "trait", "cvnum", "fold", "fill"), remove = FALSE) %>% # mutate(data = lapply(file, read.table, header = TRUE)) %>% # unnest(data) %>% # filter(Component %in% "Intercept") %>% # select(trait, pathway, cvnum, fold, Effect) # # head(training_coef) # # multiblup_pa <- tibble(file = list.files(path = "models/multiblup", # pattern = ".profile$", # full.names = TRUE, recursive = TRUE)) %>% # separate(file, sep = "/|[.]", into = c("source", "method", "pathway", "trait", "cvnum", "fold"), remove = FALSE) %>% # mutate(data = lapply(file, read.table, header = TRUE)) %>% # unnest(data) %>% # left_join(., training_coef, by = c("pathway", "trait", "cvnum", "fold")) head(multiblup_pa) # number of snps/genes in each pathway files <- list.files(path = "data/processed/pathways", pattern = "*.txt", full.names = TRUE, recursive = TRUE) pathways <- tibble(file = files) %>% separate(file, sep = "/", into = c("dir", "dir2", "dir3", "pathway", "pathway2"), remove = FALSE) %>% mutate(data = lapply(file, read.table, header = TRUE), data = lapply(data, mutate_all, as.character)) %>% unnest(data) %>% group_by(pathway) %>% summarise(size = length(unique(snp_id)), n_genes = length(unique(ensembl_gene_id.x))) %>% select(pathway, n_genes, size) %>% ungroup() %>% write_csv(., "reports/pathways_summary.csv") head(pathways) save(multiblup_h2, multiblup_llik, multiblup_pa, pathways, file = "reports/multiblup.RData") |
7 8 9 10 11 12 13 14 15 16 17 | gene_group_sizes <- round(runif(5000, 1, 50000)) hist(gene_group_sizes) write.table(gene_group_sizes, "data/interim/null_group_sizes.txt", row.names = FALSE) # write sampling file (control_sampling.txt) for sbatch # this can be used if submitting with sbatch to run groups of 50 random sets at a time sampling <- as.data.frame(seq(1, length(gene_group_sizes), by = 50)) colnames(sampling) <- "start" sampling$end <- sampling$start + 49 write.table(sampling, "data/interim/control_sampling.txt", row.names = FALSE, col.names = FALSE) |
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 | library(tidyverse) # use command line arguments args = commandArgs(trailingOnly = TRUE) # name of pathway pathway <- args[1] # pathway SNPs snp_ids <- read.table(paste0("data/processed/pathways/", pathway, "/list1")) # number of SNPs to sample nsnps <- nrow(snp_ids) # number of random subsets to export end <- 1000 cat("pathway:", pathway, "- sampling", end, "random subsets of", nsnps, "snps each\n") # read in SNP and gene information for all SNPs snps <- read.table("data/processed/snp_gene_ids_tair10.txt", header = TRUE) %>% arrange(chromosome, position) %>% filter(!snp_id %in% snp_ids$V1) # remove pathway SNPs str(snps) # # read in random uniform distribution for number of SNPs # snp_number <- read.table("data/interim/gene_groups.txt", header = TRUE) # # a function to sample genes randomly up to a specified number of SNPs # n = numer of SNPs to sample # bprange = number of basepairs to include before/after a gene sample_genes <- function(data, nsnps, bprange) { out <- data.frame() while (nrow(out) < nsnps) { # subset snps by genes already selected in output set <- na.omit(data[!data$ensembl_gene_id %in% out$ensembl_gene_id,]) gene <- sample(set$ensembl_gene_id, 1, replace = FALSE) # select a random gene snp_sub <- set[set$ensembl_gene_id==gene,] # select all snps within gene chr <- snp_sub$chromosome[1] mn <- min(snp_sub$position) - bprange mx <- max(snp_sub$position) + bprange positions <- data[data$position > mn & data$position < mx & data$chromosome==chr,] # select SNPs in gene with X kb buffer out <- rbind(out, positions) # combine until threshold number of snps are met } return(out) } # loop through gene sets specified by commandArgs for (i in 1:end) { if(file.exists(paste0("data/processed/random_sets_pathways/", pathway, "/null_", i, ".txt"))) { cat("file already exists for sample", i, "\n") next } tryCatch({ cat("exporting random gene group", i, "of", end, "\n") random <- sample_genes(snps, n = nsnps, bprange = 2500) %>% arrange(chromosome, position) dir.create(paste0("data/processed/random_sets_pathways/", pathway), showWarnings = FALSE, recursive = TRUE) write.table(random, paste0("data/processed/random_sets_pathways/", pathway, "/null_", i, ".txt")) # export snplists for LDAK dir <- paste0("data/processed/random_sets_pathways/", pathway, "/c_", i) dir.create(dir, recursive = TRUE, showWarnings = FALSE) # SNPs in random gene group random %>% select(snp_id) %>% unique() %>% write.table(., paste0(dir, "/list1"), quote = FALSE, row.names = FALSE, col.names = FALSE) # remaining SNPs snps %>% select(snp_id) %>% filter(!snp_id %in% random$V1) %>% write.table(., paste0(dir, "/list2"), quote = FALSE, row.names = FALSE, col.names = FALSE) }, error = function(e){cat("ERROR:", conditionMessage(e), "\n")}) } |
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 | library(tidyverse) # use command line arguments args = commandArgs(trailingOnly = TRUE) # end <- as.numeric(args[1]) + 49 end <- 5000 cat("random subsets from ", args[1], "to", end, "\n") # read in SNP and gene information snps <- read.table("data/processed/snp_gene_ids_tair10.txt", header = TRUE) %>% arrange(chromosome, position) # read in random uniform distribution for number of SNPs snp_number <- read.table("data/interim/gene_groups.txt", header = TRUE) # a function to sample genes randomly up to a specified number of SNPs # n = numer of SNPs to sample # bprange = number of basepairs to include before/after a gene sample_genes <- function(data, n, bprange) { out <- data.frame() while (nrow(out) < n) { # subset snps by genes already selected in output set <- na.omit(data[!data$ensembl_gene_id %in% out$ensembl_gene_id,]) gene <- sample(set$ensembl_gene_id, 1, replace = FALSE) # select a random gene snp_sub <- set[set$ensembl_gene_id==gene,] # select all snps within gene chr <- snp_sub$chromosome[1] mn <- min(snp_sub$position) - bprange mx <- max(snp_sub$position) + bprange positions <- data[data$position > mn & data$position < mx & data$chromosome==chr,] # select SNPs in gene with X kb buffer out <- rbind(out, positions) # combine until threshold number of snps are met } return(out) } # loop through gene sets specified by commandArgs for (i in args[1]:end) { if(file.exists(paste0("data/processed/random_sets/null_", i, ".txt"))) { cat("file already exists for sample", i, "\n") next } tryCatch({ print(snp_number[i,]) random <- sample_genes(snps, n = snp_number[i,], bprange = 2500) %>% arrange(chromosome, position) dir.create("data/processed/random_sets", showWarnings = FALSE) write.table(random, paste0("data/processed/random_sets/null_", i, ".txt")) # export snplists for LDAK dir <- paste0("data/processed/random_sets/c_", i) dir.create(dir, recursive = TRUE, showWarnings = FALSE) random %>% select(snp_id) %>% unique() %>% write.table(., paste0(dir, "/list1"), quote = FALSE, row.names = FALSE, col.names = FALSE) snps %>% select(snp_id) %>% filter(!snp_id %in% random$V1) %>% # filter(!snp_id %in% random$snp_id) %>% write.table(., paste0(dir, "/list2"), quote = FALSE, row.names = FALSE, col.names = FALSE) }, error = function(e){cat("ERROR:", conditionMessage(e), "\n")}) } for (f in files){ foo <- read.table(paste0(f), header = FALSE) print(length(foo$V1)) foo %>% select(V1) %>% unique() %>% write.table(., paste0(f), row.names = FALSE, col.names = FALSE, quote = FALSE) } |
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 | library(tidyverse) library(data.table) ################################################################################ # Step 1: summarise random gene groups # use command line arguments args = commandArgs(trailingOnly = TRUE) # filepath path <- paste0("data/processed/random_sets_pathways/", args[1]) # list files for random gene groups random_files <- data.frame( filename = list.files(path = path, pattern = ".txt$", full.names = TRUE, recursive = TRUE) ) random_files <- data.frame(filename = system(paste("find", path, "| grep .txt", sep = " "), intern = TRUE) ) # summarize number of snps and genes in each random group random_summary <- random_files %>% mutate(filename = as.character(filename), # pathway id pathway = word(filename, start = 4, sep = "/"), # random group id (1-1000) random_group = word(filename, start = 5, sep = "/"), # cleaning up a bit.. this is kind of messy random_group = gsub(".txt", "", random_group), random_group = gsub("null_", "c_", random_group), # read in file contents into a list file_contents = map(filename, ~ fread(., drop = 1))) %>% # ~ read.table(.))) %>% # unnest list into a data frame unnest() %>% group_by(pathway, random_group) %>% # count number of unique genes and snps in each random gene group summarize(n_genes = n_distinct(ensembl_gene_id), n_snps = n_distinct(snp_id)) head(random_summary) ################################################################################ # Step 2: compute likelihood ratio test statistic # import log likelihood for GBLUP model # this is used to compute the likelihood ratio statistic load("reports/gblup.RData") # filepath to model output from REML path <- paste0("models/null_h2_pathways/", args[1]) ## list files for reml output of each random gene group # list files using 'find' OS command # this option is *much* faster than 'list.files' lr_files <- data.frame(filename = system(paste("find", path, "| grep .reml", sep = " "), intern = TRUE) ) lr_null <- lr_files %>% mutate(filename = as.character(filename), # pathway id pathway = word(filename, start = 3, sep = "/"), # random group id (1-1000) random_group = word(filename, start = 4, sep = "/"), # traid id trait = word(filename, start = 5, sep = "/|[.]"), # read in file contents into a list file_contents = map(filename, ~ fread(., header = FALSE, skip = 10, nrows = 1))) %>% # unnest list into a data frame unnest() %>% # filter(Num_Kinships %in% "Alt_Likelihood") %>% mutate(null_llik = as.numeric(as.character(V2))) %>% left_join(., gblup_llik, by = "trait") %>% select(trait, pathway, random_group, null_llik, gblup_llik) %>% mutate(LR = 2 * (null_llik - gblup_llik)) head(lr_null) ################################################################################ ## Step 3: summarize proportion of variance explained ## list files for proportion of variance explained # uses same path as LR output h2_files <- data.frame(filename = system(paste("find", path, "| grep .share", sep = " "), intern = TRUE) ) # h2_files <- data.frame(filename = # list.files(path = path, # pattern = ".share$", # full.names = TRUE, # recursive = TRUE) # ) h2_null <- h2_files %>% mutate(filename = as.character(filename), # pathway id pathway = word(filename, start = 3, sep = "/"), # random group id (1-1000) random_group = word(filename, start = 4, sep = "/"), # traid id trait = word(filename, start = 5, sep = "/|[.]"), # read in file contents into a list file_contents = map(filename, ~ fread(., header = TRUE))) %>% # unnest list into a data frame unnest() %>% select(trait, pathway, random_group, Component, Share, SD) %>% left_join(., random_summary, by = c("pathway", "random_group")) head(h2_null) ################################################################################ ## Step 4: merge and export results! output <- paste0("reports/null_dist_results_pathways/", args[1], "_null.csv") null_dist <- left_join(lr_null, h2_null, by = c("trait", "pathway", "random_group")) %>% write_csv(., output) |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 | library(tidyverse) library(data.table) # library(quantreg) # summarize random gene groups (number of SNPs & number of genes) feature_sizes <- read_csv("reports/random_subsets_summary.csv") head(feature_sizes) # import log likelihood for GBLUP model load("reports/gblup.RData") ## list files for reml output of each random gene group files <- list.files(path = "models/null_h2", pattern = ".reml$", full.names = TRUE, recursive = TRUE) lr_null <- rbindlist(lapply(files, read.table, header = TRUE), idcol = "filename", use.names = TRUE) %>% mutate(filename = files[filename]) %>% separate(filename, sep = "/|[.]", into = c("dir", "source", "pathway", "trait", "metric", "model"), remove = FALSE) %>% filter(Num_Kinships %in% "Alt_Likelihood") %>% mutate(null_llik = as.numeric(as.character(X2))) %>% left_join(., gblup_llik, by = "trait") %>% select(trait, pathway, Num_Kinships, null_llik, gblup_llik) %>% left_join(., feature_sizes, by = "pathway") %>% mutate(group_size = cut(size, breaks = seq(0, 50000, by = 10000), include.lowest = TRUE, dig.lab = 10), LR = 2 * (null_llik - gblup_llik)) # lr_null <- rbindlist(lapply(files, read.table, header = TRUE), # idcol = "filename", use.names = TRUE) %>% # mutate(filename = files[filename]) %>% # separate(filename, sep = "/|[.]", # into = c("dir", "source", "pathway", "trait", "metric", "model"), remove = FALSE) %>% # filter(Num_Kinships %in% "Alt_Likelihood") %>% # # mutate(null_llik = as.numeric(as.character(X2))) %>% # # left_join(., gblup_llik, by = "trait") %>% # # select(trait, pathway, Num_Kinships, null_llik, gblup_llik) %>% # rename("LRT_Stat" = X2) %>% # select(trait, pathway, LRT_Stat) %>% # left_join(., feature_sizes, by = "pathway") %>% # mutate(group_size = cut(size, breaks = seq(0, 50000, by = 10000), # include.lowest = TRUE, dig.lab = 10)) head(lr_null) ## read in variance component estimates files <- list.files(path = "models/null_h2", pattern = ".share$", full.names = TRUE, recursive = TRUE) h2_null <- rbindlist(lapply(files, read.table, header = TRUE), idcol = "filename", use.names = TRUE) %>% mutate(filename = files[filename]) %>% separate(filename, sep = "/|[.]", into = c("dir", "source", "pathway", "trait", "metric", "model"), remove = FALSE) %>% left_join(., feature_sizes, by = "pathway") %>% select(trait, pathway, size, n_genes, Component, Share, SD) head(h2_null) ## merge! null_dist <- left_join(lr_null, h2_null, by = c("trait", "pathway")) %>% write_csv(., "reports/lr_null_results.csv") |
Support
- Future updates
Related Workflows





