Causal effect of modifiable risk factors on COVID
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Esimating the shared genetic eitology and the causal association of potentially modifiable risk factors on SARS-CoV-2 infection and COVID-19 severity using genetic correlation (rg) and Mendelian randomization (MR).
Citation: The COVID-19 Host Genetics Initative. 2021. Mapping the human genetic architecture of COVID-19 by worldwide meta-analysis. MedRxiv . doi:10.1101/2021.03.10.21252820.
Data Avaliability
Exposures
A set of 38 disease, health and neuropsychiatric phenotypes were selected as potential COVID-19 risk factors based on their putative relevance to the disease susceptibility, severity, or mortality. Initial risk factors were selected based on guidelines from the CDC and OpenSafely .
-
docs/mrcovid_exposures.xlsx
: excel file listing the traits and respective GWAS used as exposures.
Outcomes
Genome-wide assocations studies for COIVD-19 susceptibility and severity were obtained from Release 5 of the COVID-19 host genetics initiative .
Outcomes included:
-
SARS-CoV-2 reported infection
-
COVID-19 Hospitalization
-
COVID-19 Critical Illness
Data Analysis
The Snakemake workflow management system was used to implement pipelines for the rg and MR anlaysis. Rules defineding the snakemake workflows for the rg and MR analyses are avliable in
workflow/rules/rg.smk
and
workflow/rules/mr.smk
. Scripts for implementing polygenic risk score analyses were not utlized in this analysis. This pipeline is a adapted from
Andrews et al (2020) Annals of Neurology
.
Final output files used in the current publication are avaliable in
docs/20210224
.
Genetic Correlation
Genetic correlations between the exposures and outcomes was estimated using LD Score (LDSC) regression.
Snakefile:
-
workflow/rules/rg.smk
: workflow for implementing MR anlaysis
Configuration files:
-
config_rg_covid.yaml
: parameters for implementing rg analysis
Script files:
-
workflow/scripts/rg_*
: script files for implementing spefific rules in the rg workflow
MR
Mendelian randomization analysis was conducted using the TwoSampleMR package. Primary analysis was conducted using IVW to estimated causal relationships, while senstivity analysis were conducted using WME, WMBE, MR-Egger and MR-PRESSO.
Snakefile
-
workflow/rules/mr.smk
: workflow for implementing MR anlaysis
Configuration files
-
config_covid_eur.yaml
: analysis parameters for MR analysis COIVD-19 ouctomes using EUR only popultions -
config_covid_eur_wo_ukbb.yaml
: analysis parameters for MR analysis COIVD-19 ouctomes using EUR only popultions, without samples from UKBB
Script files
-
workflow/scripts/mr_*
: script files for implementing spefific rules in MR workflow
Results
Genetic correlations and Mendelian randomization causal estimates between 38 traits and COVID-19 Critical Illness, COVID-19 Hospitalization and SARS-CoV-2 reported infection. Blue, negative genetic correlation and protective Mendelian randomization (MR) causal estimates; red, positive genetic correlation and risk MR causal estimates. Larger squares correspond to more significant P values, with genetic correlations or MR causal estimates significantly different from zero at a P < 0.05 shown as a full-sized square. Genetic correlations or causal estimates that are significantly different from zero at a false discovery rate (FDR) of 5% are marked with an asterisk. Forest plots display the causal estimates for each of the sensitivity analyses used in the MR analysis for trait pairs that were significant at an FDR of 5%. Individual scatter and funnel plots for each pair of traits are available in
docs/20210224/MRcovideur_MRScatterFunnelPlots.pdf
.
IVW: Inverse variance weighted analysis; WME: Weighted median estimator; WMBE: weighted mode based estimator; MR-PRESSO: Mendelian Randomization Pleiotropy RESidual Sum and Outlier. RBC: Red blood cell count
Genetic correlation results:
-
data/RGcovid
contains intermediatry files generated during rg workflow -
results/RGcovid
contains final results for rg analysis
MR results
-
data/MRcovideur
, anddata/MRcovideurwoukbb
contain intermediatry files generated during MR workflow -
results/MRcovideur
, andresults/MRcovideurwoukbb
contain final results for MR analysis
Code Snippets
83 84 | script: '../scripts/mr_FormatGwas.R' |
107 108 | script: '../scripts/mr_FormatGwas.R' |
124 125 126 127 128 129 130 | shell: """ plink --bfile {params.ref} --keep-allele-order --allow-no-sex --clump {input.ss} --clump-r2 {params.r2} --clump-kb {params.kb} --clump-p1 1 --clump-p2 1 --out {params.out}; #gzip -k {output.exp_clumped} --keep not working now?? gzip {output.exp_clumped}; zcat {output.exp_clumped_zipped} > {output.exp_clumped} """ |
142 143 | script: '../scripts/mr_ExposureData.R' |
155 156 | script: "../scripts/manhattan_plot.R" |
169 170 | script: '../scripts/mr_OutcomeData.R' |
182 183 184 185 186 187 188 189 190 191 192 193 | shell: """ if [ $(wc -l < {input.MissingSNPs}) -eq 0 ]; then touch {output.ProxyList} else plink --bfile {params.ref} \ --keep-allele-order \ --r2 dprime in-phase with-freqs \ --ld-snp-list {input.MissingSNPs} \ --ld-window-r2 0.8 --ld-window-kb 500 --ld-window 1000 --out {params.Outcome} fi """ |
207 208 | script: '../scripts/mr_ExtractProxySNPs.R' |
226 | script: "../scripts/mr_DataHarmonization.R" |
241 242 | script: '../scripts/mr_MRPRESSO.R' |
254 255 | script: '../scripts/mr_MRPRESSO_wo_outliers.R' |
268 | script: '../scripts/mr_analysis.R' |
278 | script: '../scripts/mr_SteigerTest.R' |
292 | shell: "awk 'FNR==1 && NR!=1{{next;}}{{print}}' {input.mr_steiger} > {output}" |
298 | script: '../scripts/mr_PowerEstimates.R' |
312 | shell: "awk 'FNR==1 && NR!=1{{next;}}{{print}}' {input.mr_power} > {output}" |
329 330 | script: '../scripts/mr_ConcatMRdat.R' |
353 354 | shell: "awk 'FNR==1 && NR!=1{{next;}}{{print}}' {input.mrpresso_global} > {output}" |
361 362 | shell: "awk 'FNR==1 && NR!=1{{next;}}{{print}}' {input.mrpresso_global_wo_outliers} > {output}" |
377 378 | shell: "awk 'FNR==1 && NR!=1{{next;}}{{print}}' {input.mrpresso_res} > {output}" |
394 395 | shell: "awk 'FNR==1 && NR!=1{{next;}}{{print}}' {input.mr_heterogenity} > {output}" |
411 412 | shell: "awk 'FNR==1 && NR!=1{{next;}}{{print}}' {input.mr_pleiotropy} > {output}" |
428 429 | shell: "awk 'FNR==1 && NR!=1{{next;}}{{print}}' {input.mr_results} > {output}" |
458 | script: "../scripts/mr_RenderReport.R" |
488 | script: "../scripts/mr_RenderFinalReport.R" |
35 | shell: "zcat {input} | sed '/^[[:blank:]]*#/d;s/#.*//' > {output}" |
42 43 | script: '../scripts/prs_sample_exclude.R' |
54 55 | shell: "plink --bfile {params.input} --keep-allele-order --keep {input.samp_keep} --geno 0.05 --maf 0.01 --hwe 1e-10 midp include-nonctrl --make-bed --out {params.out}" |
66 67 68 69 70 | shell: """ plink --bfile {params.input} --keep-allele-order --mind 0.05 --make-bed --out {params.out}; awk '{{ print $2 }}\' {params.out}.bim > {params.out}.snplist.txt """ |
82 83 84 85 86 87 88 89 | shell: """ plink --bfile {params.ref} --keep-allele-order --allow-no-sex \ --extract {input.snplist} \ --clump {input.ss} --clump-snp-field ID --clump-field P \ --clump-r2 {params.r2} --clump-kb {params.kb} --clump-p1 1 --clump-p2 1 \ --out {params.out} """ |
98 | script: "../scripts/prs_region_exclude.R" |
112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 | shell: """ ./src/PRSice \ --base {input.base} \ --target {params.target} \ --pheno-file {input.pheno} \ --pheno-col {outcome} \ --cov-file {input.covar} \ --cov-col {covar} \ --cov-factor {covar_factors} \ --no-clump \ --extract {input.snps} \ --beta --A1 ALT --A2 REF --bp POS --chr CHROM --pvalue P --snp ID --stat BETA \ --all-score \ --binary-target T \ --print-snp \ --bar-levels {bar_levels} \ --fastscore \ --perm 1000 \ --out {params.out} """ |
137 | script: "../scripts/prs_wrangle.R" |
150 151 | script: '../scripts/prs_glm.R' |
160 | script: '../scripts/prs_concat.R' |
166 | script: '../scripts/prs_concat.R' |
172 | script: '../scripts/prs_concat.R' |
185 | script: '../scripts/prs_output.Rmd' |
61 62 63 64 65 66 67 68 69 | shell: """ Rscript workflow/src/HDL/HDL.data.wrangling.R \ gwas.file={input} \ LD.path=data/raw/UKB_imputed_SVD_eigen99_extraction \ SNP={params.SNP} A1={params.A1} A2={params.A2} N={params.N} Z={params.Z} \ output.file={params.outfile} \ log.file={params.log} """ |
123 | shell: "awk 'FNR==1 && NR!=1{{next;}}{{print}}' {input} > {output}" |
136 | script: '../scripts/rg_plot_ldsc.R' |
147 148 149 150 151 152 153 154 155 | shell: """ Rscript workflow/src/HDL/HDL.run.R \ gwas1.df={input.g1} \ gwas2.df={input.g2} \ LD.path={input.ld} \ output.file={output}; Rscript workflow/scripts/rg_extract_hdl.R {output.log} """ |
160 | shell: "awk 'FNR==1 && NR!=1{{next;}}{{print}}' {input} > {output}" |
173 174 175 176 177 178 179 | shell: """ python2.7 workflow/src/GNOVA/gnova.py {input.g1} {input.g2} \ --bfile data/raw/bfiles/eur_chr@_SNPmaf5 \ --out {output.res}; Rscript workflow/scripts/rg_extract_gnova.R {output.res} {params.g1} {params.g2} """ |
184 | shell: "awk 'FNR==1 && NR!=1{{next;}}{{print}}' {input} > {output}" |
195 196 197 198 199 200 201 202 203 | shell: """ python3 workflow/src/SUPERGNOVA/supergnova.py {input.g1} {input.g2} \ --bfile data/raw/bfiles/eur_chr@_SNPmaf5 \ --partition workflow/src/SUPERGNOVA/partition/eur_chr@.bed \ --thread 8 \ --out {output}; Rscript workflow/scripts/rg_extract_supergnova.R {output} {params.g1} {params.g2} """ |
208 | shell: "awk 'FNR==1 && NR!=1{{next;}}{{print}}' {input} > {output}" |
45 | script: '../scripts/rg_region_exclude.R' |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 | if(any(grepl("conda", .libPaths(), fixed = TRUE))){ message("Setting libPaths") df = .libPaths() conda_i = which(grepl("conda", df, fixed = TRUE)) .libPaths(c(df[conda_i], df[-conda_i])) } .libPaths(c(.libPaths(), "/hpc/users/harern01/miniconda3/envs/py38/lib/R/library")) ## Load Packages library(tidyverse) library(ggplot2) library(ggman) library(gridExtra) snp.r2 <- function(eaf, beta){ 2*eaf*(1 - eaf)*beta^2 } ## Read in arguments infile_gwas = snakemake@input[["ingwas"]] infile_clump = snakemake@input[["inclump"]] outfile_plot = snakemake@output[["out"]] PlotTitle = snakemake@params[["PlotTitle"]] message(paste(PlotTitle, '\n')) ## Read in GWAS and Plink Clumped File trait.gwas <- vroom::vroom(infile_gwas, comment = '##', col_select = c(SNP, CHROM, POS, BETA, AF, P), col_types = list(SNP = col_character(), CHROM = col_double(), POS = col_double(), AF = col_number(), BETA = col_double(), P = col_double())) %>% mutate(r2 = snp.r2(AF, BETA)) trait.clump <- suppressMessages(read_table2(infile_clump)) %>% filter(!is.na(CHR)) %>% select(CHR, F, SNP, BP, P, TOTAL, NSIG) ## Count number of SNPs in clumped dataset clump.n <- trait.clump %>% mutate(p.cut = cut(P, breaks = c(Inf, 0.05, 5e-5, 5e-6, 5e-7, 5e-8, -Inf), lablels = F)) %>% left_join(select(trait.gwas, SNP, r2)) %>% mutate(r2 = as.numeric(r2)) %>% group_by(p.cut) %>% summarise(n = n(), r2 = round(sum(r2, na.rm = T), 4)) ## Count number of SNPs in total dataset all.n <- trait.gwas %>% mutate(p.cut = cut(P, breaks = c(Inf, 0.05, 5e-5, 5e-6, 5e-7, 5e-8, -Inf), lablels = F)) %>% group_by(p.cut) %>% summarise(n = n()) ## Merge Summary files tab.TRAIT <- full_join(all.n, clump.n, by = 'p.cut', suffix = c('.all', '.clump')) ## Calculate Maximum Pvalue max.p <- max(-log10(filter(trait.clump, P > 0)$P)) + 5 ## Index SNPs IndexSnps1 <- filter(trait.clump, P <= 5e-8)$SNP IndexSnps2 <- filter(trait.clump, P <= 5e-6 & P > 5e-8)$SNP ## Plot GWAS p.TRAIT <- ggman(filter(trait.gwas, P < 0.05), snp = 'SNP', chrom = 'CHROM', bp = 'POS', pvalue = 'P', ymin = 0, ymax = max.p, title = PlotTitle, sigLine = -log10(5e-8), relative.positions = TRUE) + theme_classic() + theme(text = element_text(size=10)) + geom_hline(yintercept = -log10(5e-5), colour = 'blue', size = 0.25) + geom_hline(yintercept = -log10(5e-6), colour = 'red', size = 0.25, linetype = 2) if(length(IndexSnps1) >= 1){ p.TRAIT <- ggmanHighlight(p.TRAIT, highlight = IndexSnps1, shape = 18, colour = 'red') } if(length(IndexSnps2) >= 1){ p.TRAIT <- ggmanHighlight(p.TRAIT, highlight = IndexSnps2, shape = 18, colour = 'blue') } out.TRAIT <- arrangeGrob(p.TRAIT, tableGrob(tab.TRAIT, theme = ttheme_minimal(base_size = 8)), ncol = 2, widths = 2:1, layout_matrix = rbind(c(1,2), c(1, NA))) ggsave(outfile_plot, device = 'png', units = 'in', width = 10, height = 5, plot = out.TRAIT) |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 | if(any(grepl("conda", .libPaths(), fixed = TRUE))){ message("Setting libPaths") df = .libPaths() conda_i = which(grepl("conda", df, fixed = TRUE)) .libPaths(c(df[conda_i], df[-conda_i])) } ## LOAD packages library(tidyr) library(readr) library(dplyr) library(TwoSampleMR) ## For conducting MR https://mrcieu.github.io/TwoSampleMR/ source("workflow/scripts/miscfunctions.R", chdir = TRUE) ## Path to input/output input = snakemake@input[["mrdat"]] # Harmonized MR data output = snakemake@params[["out"]] # Output ## mrdat <- read_csv(input) %>% filter(mr_keep == TRUE) %>% filter(pleitropy_keep == TRUE) pt = mrdat %>% slice(1) %>% pull(pt) n_outliers <- nrow(mrdat %>% filter(mr_keep == TRUE)) - nrow(mrdat %>% filter(mr_keep == TRUE) %>% filter(mrpresso_keep == TRUE)) ## ================= w/ outliers ================= ## ## MR analysis mr_res <- mr(mrdat, method_list = c("mr_ivw_fe", "mr_weighted_median", "mr_weighted_mode", "mr_egger_regression")) %>% as_tibble() %>% mutate(outliers_removed = FALSE) %>% mutate(n_outliers = n_outliers) %>% mutate(pt = pt) %>% select(id.exposure, id.outcome, outcome, exposure, pt, outliers_removed, method, nsnp, n_outliers, b, se, pval) ## Cochrans Q heterogeneity test mr_heterogenity <- mr_heterogeneity(mrdat, method_list=c("mr_egger_regression", "mr_ivw")) %>% as_tibble() %>% mutate(outliers_removed = FALSE) %>% mutate(pt = pt) %>% select(id.exposure, id.outcome, outcome, exposure, pt, outliers_removed, method, Q, Q_df, Q_pval) ## MR Egger I2_Gx isq <- Isq_gx(mrdat) ## MR Egger Test of Pliotropy mr_plei <- mr_pleiotropy_test(mrdat) %>% as_tibble() %>% mutate(outliers_removed = FALSE) %>% mutate(pt = pt, Isq = isq) %>% select(id.exposure, id.outcome, outcome, exposure, pt, outliers_removed, Isq, egger_intercept, se, pval) ## ================= w/o outliers ================= ## if(n_outliers >= 1){ ## Remove outliers mrdat_mrpresso <- filter(mrdat, mrpresso_keep == T) ## MR Egger I2_Gx mrpresso_isq <- Isq_gx(mrdat_mrpresso) ## MR, Heterogenity, and Pleitorpy Tests mrpresso_res <- mr(mrdat_mrpresso, method_list = c("mr_ivw_fe", "mr_weighted_median", "mr_weighted_mode", "mr_egger_regression")) %>% as_tibble() %>% mutate(outliers_removed = TRUE) %>% mutate(n_outliers = n_outliers) %>% mutate(pt = pt) %>% select(id.exposure, id.outcome, outcome, exposure, pt, outliers_removed, method, nsnp, n_outliers, b, se, pval) mrpresso_heterogenity <- mr_heterogeneity(mrdat_mrpresso, method_list=c("mr_egger_regression", "mr_ivw")) %>% as_tibble() %>% mutate(outliers_removed = TRUE) %>% mutate(pt = pt) %>% select(id.exposure, id.outcome, outcome, exposure, pt, outliers_removed, method, Q, Q_df, Q_pval) mrpresso_plei <- mr_pleiotropy_test(mrdat_mrpresso) %>% as_tibble() %>% mutate(outliers_removed = TRUE) %>% mutate(pt = pt, Isq = mrpresso_isq) %>% select(id.exposure, id.outcome, outcome, exposure, pt, outliers_removed, Isq, egger_intercept, se, pval) }else{ ## If no outliers are removed, make empty dataframes mrpresso_res <- data.frame(id.exposure = as.character(mrdat[1,'id.exposure']), id.outcome = as.character(mrdat[1,'id.outcome']), outcome = as.character(mrdat[1,'outcome']), exposure = as.character(mrdat[1,'exposure']), pt = pt, outliers_removed = NA, method = 'mrpresso', nsnp = NA, n_outliers = NA, b = NA, se = NA, pval = NA) mrpresso_heterogenity <- data.frame( id.exposure = as.character(mrdat[1,'id.exposure']), id.outcome = as.character(mrdat[1,'id.outcome']), outcome = as.character(mrdat[1,'outcome']), exposure = as.character(mrdat[1,'exposure']), pt = pt, outliers_removed = NA, method = 'mrpresso', Q = NA, Q_df = NA, Q_pval = NA) mrpresso_plei <- data.frame( id.exposure = as.character(mrdat[1,'id.exposure']), id.outcome = as.character(mrdat[1,'id.outcome']), outcome = as.character(mrdat[1,'outcome']), exposure = as.character(mrdat[1,'exposure']), pt = pt, outliers_removed = NA, Isq = NA, egger_intercept = NA, se = NA, pval = NA) } ## ================= bind results ================= ## res_mr <- bind_rows(mr_res, mrpresso_res) %>% filter(!is.na(outliers_removed)) res_heterogenity <- bind_rows(mr_heterogenity, mrpresso_heterogenity) %>% filter(!is.na(outliers_removed)) res_plei <- bind_rows(mr_plei, mrpresso_plei) %>% filter(!is.na(outliers_removed)) ## ================= Write Out ================= ## res_heterogenity %>% write_tsv(paste0(output, '_MR_heterogenity.txt')) res_plei %>% write_tsv(paste0(output, '_MR_egger_plei.txt')) res_mr %>% write_tsv(paste0(output, '_MR_Results.txt')) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | if(any(grepl("conda", .libPaths(), fixed = TRUE))){ message("Setting libPaths") df = .libPaths() conda_i = which(grepl("conda", df, fixed = TRUE)) .libPaths(c(df[conda_i], df[-conda_i])) } library(tidyverse) library(plyr) input = snakemake@input[["dat"]] # Outcome Summary statistics output = snakemake@output[["out"]] mrpresso_MRdat <- input %>% map(., function(x){ datin <- read_csv(x, col_types = list(target_snp.outcome = col_character(), proxy.outcome = col_character(), proxy_snp.outcome = col_character(), target_a1.outcome = col_character(), target_a2.outcome = col_character(), proxy_a1.outcome = col_character(), proxy_a2.outcome = col_character(), effect_allele.outcome = col_character(), mrpresso_RSSobs= col_character(), mrpresso_pval= col_character())) }) %>% bind_rows() write_csv(mrpresso_MRdat, output) |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 | if(any(grepl("conda", .libPaths(), fixed = TRUE))){ message("Setting libPaths") df = .libPaths() conda_i = which(grepl("conda", df, fixed = TRUE)) .libPaths(c(df[conda_i], df[-conda_i])) } message("Begining Harmonization \n") ### ===== Command Line Arguments ===== ## ## Infiles exposure.summary= snakemake@input[["ExposureSummary"]] outcome.summary = snakemake@input[["OutcomeSummary"]] proxy.snps = snakemake@input[["ProxySNPs"]] ## Outfile out.harmonized = snakemake@output[["Harmonized"]] ## Params for output pt = as.numeric(snakemake@params[["Pthreshold"]]) ExposureCode = snakemake@params[["excposurecode"]] OutcomeCode = snakemake@params[["outcomecode"]] ## Regions for exclusion regions_chrm = snakemake@params[["regions_chrm"]] regions_start = snakemake@params[["regions_start"]] regions_stop = snakemake@params[["regions_stop"]] ### ===== Load packages ===== ### message("Loading packages \n") library(tidyr) library(glue) library(readr) library(dplyr) library(TwoSampleMR) ## For conducting MR https://mrcieu.github.io/TwoSampleMR/ ### ===== Read In Data ===== ### message("READING IN EXPOSURE \n") exposure.dat <- read_tsv(exposure.summary) message("READING IN OUTCOME \n") outcome.dat <- read_tsv(outcome.summary) message("READING IN PROXY SNPs \n") proxy.dat <- read_csv(proxy.snps) %>% filter(proxy.outcome == TRUE) %>% select(proxy.outcome, target_snp, proxy_snp, ALT, REF, ALT.proxy, REF.proxy) %>% mutate(SNP = target_snp) %>% rename(target_snp.outcome = target_snp, proxy_snp.outcome = proxy_snp, target_a1.outcome = ALT, target_a2.outcome = REF, proxy_a1.outcome = ALT.proxy, proxy_a2.outcome = REF.proxy) ### ===== Harmonization ===== ### message("Formating Exposure and Outcome \n") # Format Exposure mr_exposure.dat <- format_data(exposure.dat, type = 'exposure', snp_col = 'SNP', chr_col = 'CHROM', pos_col = 'POS', beta_col = "BETA", se_col = "SE", eaf_col = "AF", effect_allele_col = "ALT", other_allele_col = "REF", pval_col = "P", z_col = "Z", samplesize_col = "N", phenotype_col = 'TRAIT') mr_exposure.dat$exposure = ExposureCode # Format Outcome mr_outcome.dat <- format_data(outcome.dat, type = 'outcome', snp_col = 'SNP', chr_col = 'CHROM', pos_col = 'POS', beta_col = "BETA", se_col = "SE", eaf_col = "AF", effect_allele_col = "ALT", other_allele_col = "REF", pval_col = "P", z_col = "Z", samplesize_col = "N", phenotype_col = 'TRAIT') mr_outcome.dat$outcome = OutcomeCode # Harmonize Datasets # Remove Plietropic variatns ## Exclude variants that are significant for outcome at a given Pt ## Exclude variants within genomic regions # gws_pt = 5e-8 gws_pt = 5e-300 if(is.null(regions_chrm)){ message("Harmonizing data \n") harmonized.MRdat <- harmonise_data(mr_exposure.dat, mr_outcome.dat) %>% as_tibble() %>% mutate(pt = pt, pleitropy_keep = case_when(pval.outcome <= gws_pt ~ FALSE, TRUE ~ TRUE)) } else { harmonized.MRdat <- harmonise_data(mr_exposure.dat, mr_outcome.dat) %>% as_tibble() %>% mutate(pt = pt, pleitropy_keep = case_when(pval.outcome <= gws_pt ~ FALSE, TRUE ~ TRUE)) for(i in 1:length(regions_chrm)){ message(glue("Harmonizing data and excluding regions: ", regions_chrm[i], ":", regions_start[i], "-", regions_stop[i])) harmonized.MRdat <- harmonized.MRdat %>% mutate(pleitropy_keep = case_when( pleitropy_keep == FALSE ~ FALSE, chr.exposure == regions_chrm[i] & between(pos.exposure, regions_start[i], regions_stop[i]) ~ FALSE, TRUE ~ TRUE)) } } print(filter(harmonized.MRdat, pleitropy_keep == FALSE) %>% select(SNP, chr.exposure, pos.exposure, pval.exposure, pval.outcome)) print(count(harmonized.MRdat, pleitropy_keep) ) ## Write out Harmonized data message("Exporting Data to ", out.harmonized, "\n") write_csv(harmonized.MRdat, out.harmonized) |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 | if(any(grepl("conda", .libPaths(), fixed = TRUE))){ message("Setting libPaths") df = .libPaths() conda_i = which(grepl("conda", df, fixed = TRUE)) .libPaths(c(df[conda_i], df[-conda_i])) } ### ===== Command Line Arguments ===== ## exposure.summary = snakemake@input[["summary"]] # Exposure summary statistics exposure.clump = snakemake@input[["ExposureClump"]] p.threshold = as.numeric(snakemake@params[["Pthreshold"]]) out.file = snakemake@output[["out"]] # SPECIFY THE OUTPUT FILE ### ===== Load packages ===== ### suppressMessages(library(tidyverse)) ## For data wrangling library(here) source(here("workflow", "scripts", "miscfunctions.R"), chdir = TRUE) ### ===== Read in Data ===== ### message("\n READING IN EXPOSURE \n") exposure.dat <- read_tsv(exposure.summary, comment = '#', guess_max = 15000000) %>% filter(P < p.threshold) %>% distinct(SNP, .keep_all = TRUE) ### ===== Clump Exposure ===== ### message("\n CLUMPING EXPOSURE SNPS \n") ## Plink Pre-clumped mr_exposure.dat_ld <- read_table2(exposure.clump) %>% filter(!is.na(CHR)) %>% select(CHR, F, SNP, BP, P, TOTAL, NSIG) # Filter exposure data for clumped SNPs exposure.dat <- exposure.dat %>% filter(SNP %in% mr_exposure.dat_ld$SNP) # message("Distanced based clumping on lead SNP \n") # # Distanced based clumping on lead SNP # ls.p <- exposure.dat # ls.mat <- ls.p %>% # split(., .$CHROM) %>% # map(., select, SNP,POS) %>% # map(., column_to_rownames, var = "SNP") %>% # map(., dist, method = "manhattan", upper = TRUE) %>% # map_dfr(., tidy) %>% # mutate(clump = case_when(distance <= 250000 ~ 1, distance > 250000 ~ 0)) # # ls.clump <- list() # i = 1 # # while(!plyr::empty(ls.p)){ # snp1 <- ls.p %>% slice(1) %>% pull(SNP) # message("Distance Clumping: ", snp1) # ls.clump[[i]] <- snp1 # # snprm <- ls.mat %>% # filter(item1 == snp1 & clump == 1) %>% # pull(item2) # # ls.p <- ls.p %>% filter(SNP %nin% snprm) %>% filter(SNP %nin% snp1) # i <- i + 1 # } # # ls.clump <- unlist(ls.clump) # # exposure.dat <- exposure.dat %>% # mutate(LSclump = case_when(SNP %nin% ls.clump ~ FALSE, # TRUE ~ TRUE) ### ===== Write Out Exposure ===== ### message("\n Writing Out Exposure \n") write_tsv(exposure.dat, out.file) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 | if(any(grepl("conda", .libPaths(), fixed = TRUE))){ message("Setting libPaths") df = .libPaths() conda_i = which(grepl("conda", df, fixed = TRUE)) .libPaths(c(df[conda_i], df[-conda_i])) } ## ==== Load required packages ==== ## suppressMessages(library(plyr)) suppressMessages(library(tidyverse)) `%nin%` = Negate(`%in%`) summary.path = snakemake@input[["OutcomeSummary"]] # Outcome Summary statistics proxy.path = snakemake@input[["OutcomeProxys"]] # prox snps snps.path = snakemake@input[["OutcomeSNPs"]] out = snakemake@params[["Outcome"]] message("READING IN OUTCOME AND PROXY's \n") summary.dat <- read_tsv(summary.path, comment = '#', guess_max = 15000000) proxy.dat <- read_table2(proxy.path) outcome.raw <- read_tsv(snps.path) if(empty(proxy.dat)){ message("NO PROXY SNPS AVALIABLE \n") outcome.dat <- outcome.raw %>% filter(!is.na(CHROM)) } else if (empty(filter(proxy.dat, SNP_A != SNP_B)) | empty(filter(summary.dat, SNP %in% (filter(proxy.dat, SNP_A != SNP_B) %>% pull(SNP_B)))) ){ message("NO PROXY SNPS AVALIABLE \n") outcome.dat <- outcome.raw %>% filter(!is.na(CHROM)) query_snps <- proxy.dat %>% filter(SNP_A == SNP_B) } else { message("WRANGLING TARGET AND PROXY SNPs \n") ## Filter Query SNPs query_snps <- proxy.dat %>% filter(SNP_A == SNP_B) %>% select(CHR_A, BP_A, SNP_A, PHASE) ## Filter Proxy SNPs proxy.snps <- proxy.dat %>% filter(SNP_A != SNP_B) %>% ## remove query snps left_join(summary.dat, by = c('SNP_B' = 'SNP')) %>% filter(!is.na(ALT)) %>% ## remove snps with missing information group_by(SNP_A) %>% ## by query snp arrange(-R2) %>% ## arrange by ld slice(1) %>% ## select top proxy snp ungroup() %>% rename(ALT.proxy = ALT, REF.proxy = REF) %>% select(-CHR_A, -CHR_B, -BP_A, -BP_B, -MAF_A, -MAF_B, -R2, -DP) ## remove uneeded columns ## Select correlated alleles alleles <- proxy.snps %>% select(PHASE) %>% mutate(PHASE = str_replace(PHASE, "/", "")) alleles <- str_split(alleles$PHASE, "", n = 4, simplify = T) colnames(alleles) <- c('ref', 'ref.proxy', 'alt', 'alt.proxy') alleles <- as_tibble(alleles) ## Bind Proxy SNPs and correlated alleles proxy.out <- proxy.snps %>% bind_cols(alleles) %>% rename(SNP = SNP_A) %>% mutate(ALT = ifelse(ALT.proxy == ref.proxy, ref, alt)) %>% mutate(REF = ifelse(REF.proxy == ref.proxy, ref, alt)) %>% mutate(CHROM = as.numeric(CHROM)) %>% mutate(AF = as.numeric(AF)) ## Outcome data outcome.dat <- outcome.raw %>% filter(SNP %nin% proxy.out$SNP) %>% bind_rows(select(proxy.out, SNP, CHROM, POS, REF, ALT, AF, BETA, SE, Z, P, N, TRAIT)) %>% arrange(CHROM, POS) %>% filter(!is.na(CHROM)) } message("\n EXPORTING \n") ## Write out outcomes SNPs write_tsv(outcome.dat, paste0(out, '_ProxySNPs.txt')) ## Write out Proxy SNPs if(empty(proxy.dat)){ tibble(proxy.outcome = NA, target_snp = NA, proxy_snp = NA, ld.r2 = NA, Dprime = NA, ref.proxy = NA, alt.proxy = NA, CHROM = NA, POS = NA, ALT.proxy = NA, REF.proxy = NA, AF = NA, BETA = NA, SE = NA, P = NA, N = NA, ref = NA, alt = NA, ALT = NA, REF = NA, PHASE = NA) %>% write_csv(paste0(out, '_MatchedProxys.csv')) } else if (empty(filter(proxy.dat, SNP_A != SNP_B)) | empty(filter(summary.dat, SNP %in% (filter(proxy.dat, SNP_A != SNP_B) %>% pull(SNP_B)))) ){ tibble(proxy.outcome = NA, target_snp = query_snps$SNP_A, proxy_snp = NA, ld.r2 = NA, Dprime = NA, ref.proxy = NA, alt.proxy = NA, CHROM = NA, POS = NA, ALT.proxy = NA, REF.proxy = NA, AF = NA, BETA = NA, SE = NA, P = NA, N = NA, ref = NA, alt = NA, ALT = NA, REF = NA, PHASE = NA) %>% write_csv(paste0(out, '_MatchedProxys.csv')) }else{ proxy.dat %>% select(SNP_A, SNP_B, R2, DP) %>% inner_join(proxy.out, by = c('SNP_A' = 'SNP', 'SNP_B')) %>% rename(target_snp = SNP_A, proxy_snp = SNP_B, ld.r2 = R2, Dprime = DP) %>% mutate(proxy.outcome = TRUE) %>% full_join(select(query_snps, SNP_A), by = c('target_snp' = 'SNP_A')) %>% write_csv(paste0(out, '_MatchedProxys.csv')) } |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 | infile_gwas = snakemake@input[["ss"]] outfile = snakemake@output[["formated_ss"]] snp_col = snakemake@params[["snp_col"]] chrom_col = snakemake@params[["chrom_col"]] pos_col = snakemake@params[["pos_col"]] ref_col = snakemake@params[["ref_col"]] alt_col = snakemake@params[["alt_col"]] af_col = snakemake@params[["af_col"]] beta_col = snakemake@params[["beta_col"]] se_col = snakemake@params[["se_col"]] p_col = snakemake@params[["p_col"]] z_col = snakemake@params[["z_col"]] n_col = snakemake@params[["n_col"]] trait_col = snakemake@params[["trait_col"]] if(any(grepl("conda", .libPaths(), fixed = TRUE))){ message("Setting libPaths") df = .libPaths() conda_i = which(grepl("conda", df, fixed = TRUE)) .libPaths(c(df[conda_i], df[-conda_i])) } # library(gwasvcf) # library(VariantAnnotation) library(tidyverse) library(magrittr) message('\n', 'Columns names are: ', 'SNP: ', snp_col, ', CHROM: ', chrom_col, ', POS: ', pos_col, ', REF: ', ref_col, ', ALT: ', alt_col, ', AF: ', af_col, ', BETA: ', beta_col, ', SE: ', se_col, ', P: ', p_col, ', Z: ', z_col, ', N: ', n_col, ', TRAIT: ', trait_col, '\n') if(str_detect(infile_gwas, ".vcf.gz")){ ## Formating for IEU OPEN GWAS .vcf files message("IEU OPEN GWAS FORMAT") vcf <- VariantAnnotation::readVcf(infile_gwas) trait.gwas <- gwasvcf::vcf_to_tibble(vcf) traitID = tolower(samples(header(vcf))) ieugwas <- read_csv("data/raw/ieugwas_201020.csv") trait_meta <- select(ieugwas, id, trait, sample_size, ncase, ncontrol) %>% filter(id == traitID) out <- trait.gwas %>% mutate(EZ = ES/SE, P = 10^-LP, TRAIT = pull(trait_meta, trait), N = pull(trait_meta, sample_size), NCASE = pull(trait_meta, ncase), NCTRL = pull(trait_meta, ncontrol)) %>% filter(nchar(REF) == 1) %>% filter(nchar(ALT) == 1) %>% mutate(Z = ifelse(BETA == 0 & SE == 0, 0, Z)) %>% select(SNP = ID, CHROM = seqnames, POS = start, REF = REF, ALT = ALT, AF = AF, BETA = ES, SE, Z = EZ, P, N, TRAIT, NCASE, NCTRL) %>% distinct(SNP, .keep_all = TRUE) out %>% write_tsv(gzfile(outfile)) } else if(str_detect(infile_gwas, ".chrall.CPRA_b37.tsv.gz")){ ## Formating for MSSM files message("MSSM GWAS FORMAT") trait.gwas <- suppressMessages(vroom::vroom(infile_gwas, comment = '#', guess_max = 11000000)) out <- trait.gwas %>% rename(SNP = snp_col, CHROM = chrom_col, POS = pos_col, REF = ref_col, ALT = alt_col, AF = af_col, BETA = beta_col, SE = se_col, Z = z_col, P = p_col, N = n_col, TRAIT = trait_col) %>% filter(nchar(REF) == 1) %>% filter(nchar(ALT) == 1) %>% mutate(Z = ifelse(BETA == 0 & SE == 0, 0, Z)) %>% select(SNP, CHROM, POS, REF, ALT, AF, BETA, SE, Z, P, N, TRAIT) %>% drop_na %>% distinct(SNP, .keep_all = TRUE) out %>% write_tsv(gzfile(outfile)) } message('\n', 'SNPs in orginal file: ', nrow(trait.gwas), '; SNPs in formated file: ', pos_col, ', REF: ', nrow(out), '\n') |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 | if(any(grepl("conda", .libPaths(), fixed = TRUE))){ message("Setting libPaths") df = .libPaths() conda_i = which(grepl("conda", df, fixed = TRUE)) .libPaths(c(df[conda_i], df[-conda_i])) } .libPaths(c(.libPaths(), "/hpc/users/harern01/miniconda3/envs/py38/lib/R/library")) ### ===== Command Line Arguments ===== ## infile = snakemake@input[["mrdat"]] # Exposure summary statistics out = snakemake@params[["out"]] ### ===== Load packages ===== ### library(tidyr) ## For data wrangling library(dplyr) ## For data transformation library(magrittr) ## For data wrangling library(readr) library(stringi) library(stringr) library(MRPRESSO) ## For detecting pleitropy ### ===== READ IN DATA ===== ### message("\n READING IN HARMONIZED MR DATA \n") mrdat.raw <- read_csv(infile) mrdat <- mrdat.raw %>% filter(mr_keep == TRUE) %>% filter(pleitropy_keep == TRUE) ## Data Frame of nsnps and number of iterations df.NbD <- data.frame(n = c(10, 50, 100, 500, 1000, 1500, 2000), NbDistribution = c(10000, 10000, 10000, 25000, 50000, 75000, 100000)) nsnps <- nrow(mrdat) SignifThreshold <- 0.05 NbDistribution <- df.NbD[which.min(abs(df.NbD$n - nsnps)), 2] ## Bonfernoni correction, needs a crazy amount of nbdistributions for larger nsnps. # SignifThreshold <- 0.05/nsnps # NbDistribution <- (nsnps+100)/SignifThreshold ### ===== MR-PRESSO ===== ### message("\n CALCULATING PLEITROPY \n") set.seed(333) mrpresso.out <- mr_presso(BetaOutcome = "beta.outcome", BetaExposure = "beta.exposure", SdOutcome = "se.outcome", SdExposure = "se.exposure", OUTLIERtest = TRUE, DISTORTIONtest = TRUE, data = as.data.frame(mrdat), NbDistribution = NbDistribution, SignifThreshold = SignifThreshold) ### ===== FORMAT DATA ===== ### ## extract RSSobs and Pvalue if("Global Test" %in% names(mrpresso.out$`MR-PRESSO results`)){ mrpresso.p <- mrpresso.out$`MR-PRESSO results`$`Global Test`$Pvalue RSSobs <- mrpresso.out$`MR-PRESSO results`$`Global Test`$RSSobs } else { mrpresso.p <- mrpresso.out$`MR-PRESSO results`$Pvalue RSSobs <- mrpresso.out$`MR-PRESSO results`$RSSobs } ## If Global test is significant, append outlier tests to mrdat if("Outlier Test" %in% names(mrpresso.out$`MR-PRESSO results`)){ outliers <- mrdat %>% bind_cols(mrpresso.out$`MR-PRESSO results`$`Outlier Test`) %>% rename(mrpresso_RSSobs = RSSobs, mrpresso_pval = Pvalue) %>% mutate(mrpresso_keep = as.numeric(str_replace_all(mrpresso_pval, pattern="<", repl="")) > SignifThreshold) %>% select(SNP, mrpresso_RSSobs, mrpresso_pval, mrpresso_keep) mrdat.out <- mrdat.raw %>% left_join(outliers, by = 'SNP') } else { mrdat.out <- mrdat.raw %>% mutate(mrpresso_RSSobs = NA, mrpresso_pval = NA) %>% mutate(mrpresso_keep = ifelse(mr_keep == TRUE, TRUE, NA)) } # Write n outliers, RSSobs and Pvalue to tibble mrpresso.dat <- tibble(id.exposure = as.character(mrdat[1,'id.exposure']), id.outcome = as.character(mrdat[1,'id.outcome']), outcome = as.character(mrdat[1,'outcome']), exposure = as.character(mrdat[1,'exposure']), pt = mrdat.raw %>% slice(1) %>% pull(pt), outliers_removed = FALSE, n_outliers = sum(mrdat.out$mrpresso_keep == F, na.rm = T), RSSobs = RSSobs, pval = mrpresso.p) # Write MR-PRESSO results out mrpresso.res <- extract2(mrpresso.out, 1) %>% as_tibble() %>% mutate(id.exposure = as.character(mrdat[1,'id.exposure']), id.outcome = as.character(mrdat[1,'id.outcome']), outcome = as.character(mrdat[1,'outcome']), exposure = as.character(mrdat[1,'exposure']), pt = mrdat.raw %>% slice(1) %>% pull(pt), method = "MR-PRESSO", nsnp = c(nrow(mrdat), nrow(filter(mrdat.out, mr_keep == T, pleitropy_keep == T, mrpresso_keep == T))) ) %>% select(id.exposure, id.outcome, outcome, exposure, pt, method, nsnp, outliers_removed = `MR Analysis`, b = `Causal Estimate`, sd = Sd, t.stat = `T-stat`, pval = `P-value`) ### ===== EXPORTING ===== ### message("\n EXPORTING REPORTS \n") sink(paste0(out, '.txt'), append=FALSE, split=FALSE) mrpresso.out sink() write_tsv(mrpresso.dat, paste0(out, '_global.txt')) write_tsv(mrpresso.res, paste0(out, '_res.txt')) write_csv(mrdat.out, paste0(out, '_MRdat.csv')) |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 | if(any(grepl("conda", .libPaths(), fixed = TRUE))){ message("Setting libPaths") df = .libPaths() conda_i = which(grepl("conda", df, fixed = TRUE)) .libPaths(c(df[conda_i], df[-conda_i])) } .libPaths(c(.libPaths(), "/hpc/users/harern01/miniconda3/envs/py38/lib/R/library")) ### ===== Command Line Arguments ===== ## infile = snakemake@input[["mrdat"]] # Exposure summary statistics out = snakemake@params[["out"]] # infile = "data/MR_ADbidir/Grasby2020thickness/Willer2013hdl/Grasby2020thickness_5e-6_Willer2013hdl_mrpresso_MRdat.csv" # out = "/Users/sheaandrews/Dropbox/Research/PostDoc-MSSM/2_MR/2_DerivedData/mvpa/load/mvpa_5e-6_load" ### ===== Load packages ===== ### library(tidyr) ## For data wrangling library(dplyr) ## For data transformation library(magrittr) ## For data wrangling library(readr) library(stringi) library(stringr) library(MRPRESSO) ## For detecting pleitropy ### ===== READ IN DATA ===== ### message("\n READING IN HARMONIZED MR DATA \n") mrdat.raw <- read_csv(infile, col_types = list(mrpresso_pval = col_character())) %>% filter(mr_keep == TRUE) %>% filter(pleitropy_keep == TRUE) mrdat <- filter(mrdat.raw, mrpresso_keep == TRUE) if(nrow(mrdat) < nrow(mrdat.raw)){ if(nrow(mrdat) > 3){ ## Data Frame of nsnps and number of iterations df.NbD <- data.frame(n = c(10, 50, 100, 500, 1000, 1500, 2000), NbDistribution = c(10000, 10000, 10000, 25000, 50000, 75000, 100000)) nsnps <- nrow(mrdat) SignifThreshold <- 0.05 NbDistribution <- df.NbD[which.min(abs(df.NbD$n - nsnps)), 2] ### ===== MR-PRESSO ===== ### message("\n CALCULATING PLEITROPY \n") set.seed(333) mrpresso.out <- mr_presso(BetaOutcome = "beta.outcome", BetaExposure = "beta.exposure", SdOutcome = "se.outcome", SdExposure = "se.exposure", OUTLIERtest = FALSE, DISTORTIONtest = FALSE, data = as.data.frame(mrdat), NbDistribution = NbDistribution, SignifThreshold = SignifThreshold) ### ===== FORMAT DATA ===== ### ## extract RSSobs and Pvalue message("\n Extracting RSSobs and Pvalue \n") mrpresso.p <- mrpresso.out$`MR-PRESSO results`$`Global Test`$Pvalue RSSobs <- mrpresso.out$`MR-PRESSO results`$`Global Test`$RSSobs if("Outlier Test" %in% names(mrpresso.out$`MR-PRESSO results`)){ n_outliers = length(mrpresso.out$`MR-PRESSO results`$`Distortion Test`$`Outliers Indices`) } else { n_outliers = 0 } # Write n outliers, RSSobs and Pvalue to tibble message("\n Writing Results \n") mrpresso.dat <- tibble(id.exposure = as.character(mrdat[1,'id.exposure']), id.outcome = as.character(mrdat[1,'id.outcome']), outcome = as.character(mrdat[1,'outcome']), exposure = as.character(mrdat[1,'exposure']), pt = mrdat.raw %>% slice(1) %>% pull(pt), outliers_removed = TRUE, n_outliers = (nrow(mrdat.raw) - nrow(mrdat)), RSSobs = RSSobs, pval = mrpresso.p) mrpresso.res <- extract2(mrpresso.out, 1) %>% as_tibble() %>% mutate(id.exposure = as.character(mrdat[1,'id.exposure']), id.outcome = as.character(mrdat[1,'id.outcome']), outcome = as.character(mrdat[1,'outcome']), exposure = as.character(mrdat[1,'exposure']), pt = mrdat.raw %>% slice(1) %>% pull(pt), method = "MR-PRESSO_wo_outliers", nsnp = c(nrow(mrdat), nrow(mrdat) - n_outliers) ) %>% select(id.exposure, id.outcome, outcome, exposure, pt, method, nsnp, outliers_removed = `MR Analysis`, b = `Causal Estimate`, sd = Sd, t.stat = `T-stat`, pval = `P-value`) } else { message("\n Not enough intrumental variables \n") mrpresso.dat <- tibble(id.exposure = as.character(mrdat[1,'id.exposure']), id.outcome = as.character(mrdat[1,'id.outcome']), outcome = as.character(mrdat[1,'outcome']), exposure = as.character(mrdat[1,'exposure']), pt = mrdat.raw %>% slice(1) %>% pull(pt), outliers_removed = TRUE, n_outliers = (nrow(mrdat.raw) - nrow(mrdat)), RSSobs = NA, pval = NA) mrpresso.res <- tibble( id.exposure = as.character(mrdat[1,'id.exposure']), id.outcome = as.character(mrdat[1,'id.outcome']), outcome = as.character(mrdat[1,'outcome']), exposure = as.character(mrdat[1,'exposure']), pt = mrdat.raw %>% slice(1) %>% pull(pt), method = "MR-PRESSO_wo_outliers", nsnp = NA, outliers_removed = NA, b = NA, sd = NA, t.stat = NA, pval = NA) } } else { message("\n NO OUTLIER VARIANTS \n") mrpresso.dat <- tibble(id.exposure = as.character(mrdat[1,'id.exposure']), id.outcome = as.character(mrdat[1,'id.outcome']), outcome = as.character(mrdat[1,'outcome']), exposure = as.character(mrdat[1,'exposure']), pt = mrdat.raw %>% slice(1) %>% pull(pt), outliers_removed = TRUE, n_outliers = (nrow(mrdat.raw) - nrow(mrdat)), RSSobs = NA, pval = NA) mrpresso.res <- tibble( id.exposure = as.character(mrdat[1,'id.exposure']), id.outcome = as.character(mrdat[1,'id.outcome']), outcome = as.character(mrdat[1,'outcome']), exposure = as.character(mrdat[1,'exposure']), pt = mrdat.raw %>% slice(1) %>% pull(pt), method = "MR-PRESSO_wo_outliers", nsnp = NA, outliers_removed = NA, b = NA, sd = NA, t.stat = NA, pval = NA) } write_tsv(mrpresso.dat, paste0(out, '_global_wo_outliers.txt')) write_tsv(mrpresso.res, paste0(out, '_res_wo_outliers.txt')) |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 | if(any(grepl("conda", .libPaths(), fixed = TRUE))){ message("Setting libPaths") df = .libPaths() conda_i = which(grepl("conda", df, fixed = TRUE)) .libPaths(c(df[conda_i], df[-conda_i])) } ### ===== Command Line Arguments ===== ## exposure.summary = snakemake@input[["ExposureSummary"]] # Exposure summary statistics outcome.summary = snakemake@input[["OutcomeSummary"]] # Outcome Summary statistics out = snakemake@params[["Outcome"]] ### ===== Load packages ===== ### library(tidyverse)## For data wrangling library(dplyr) library(readr) #suppressMessages(library(Hmisc)) ## Contains miscillaneous funtions ### ===== READ IN SNPs ===== ### message("READING IN EXPOSURE \n") exposure.dat <- read_tsv(exposure.summary) message("\n READING IN OUTCOME \n") outcome.dat.raw <- read_tsv(outcome.summary, comment = '##', guess_max = 15000000) ### ===== EXTACT SNPS ===== ### message("\n EXTRACTING SNP EFFECTS FROM OUTCOME GWAS \n") outcome.dat <- outcome.dat.raw %>% right_join(select(exposure.dat, SNP), by = 'SNP') ### ===== MISSING SNPS SNPS ===== ### outcome.dat %>% filter(is.na(CHROM)) %>% select(SNP) %>% write_tsv(paste0(out, '_MissingSNPs.txt'), col_names = F) message("\n EXPORTING \n") ## Write out outcomes SNPs write_tsv(outcome.dat, paste0(out, '_SNPs.txt')) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 | if(any(grepl("conda", .libPaths(), fixed = TRUE))){ message("Setting libPaths") df = .libPaths() conda_i = which(grepl("conda", df, fixed = TRUE)) .libPaths(c(df[conda_i], df[-conda_i])) } ## Load librarys and functions .libPaths(c(.libPaths(), "/hpc/users/harern01/miniconda3/envs/py38/lib/R/library")) library(tidyr) library(dplyr) library(readr) library(purrr) library(TwoSampleMR) source("workflow/scripts/miscfunctions.R", chdir = TRUE) source("workflow/scripts/traits.R", chdir = TRUE) source("workflow/scripts/mr_PowerFunctions.R", chdir = TRUE) infile = snakemake@input[["infile"]] outfile = snakemake@output[["outfile"]] ## -------------------------------------------------------------------------- ## ## Read in Harmonized datasets MRdat.raw <- infile %>% read_csv(., guess_max = 1000) n_outliers <- MRdat.raw %>% summarise(., n_outliers = sum(!mrpresso_keep, na.rm = TRUE)) %>% pull(n_outliers) std.MRdat <- MRdat.raw %>% filter(pleitropy_keep == TRUE) %>% select(-samplesize.outcome, -samplesize.exposure) %>% left_join(samplesize, by = c('exposure' = 'code')) %>% rename(samplesize.exposure = samplesize, ncase.exposure = ncase, ncontrol.exposure = ncontrol, logistic.exposure = logistic, exposure.name = trait) %>% left_join(samplesize, by = c('outcome' = 'code')) %>% rename(samplesize.outcome = samplesize, ncase.outcome = ncase, ncontrol.outcome = ncontrol, logistic.outcome = logistic, outcome.name = trait) %>% ## Standardize continous outcomes mutate(st_beta.outcome = ifelse(logistic.outcome == FALSE, std_beta(z.outcome, eaf.outcome, samplesize.outcome), beta.outcome), st_se.outcome = ifelse(logistic.outcome == FALSE, std_se(z.outcome, eaf.outcome, samplesize.outcome), se.outcome)) %>% ## Standardize all exposures mutate(st_beta.exposure = std_beta(z.exposure, eaf.exposure, samplesize.exposure), st_se.exposure = std_se(z.exposure, eaf.exposure, samplesize.exposure)) %>% mutate(beta.outcome = st_beta.outcome, se.outcome = st_se.outcome, beta.exposure = st_beta.exposure, se.exposure = st_se.exposure) ## -------------------------------------------------------------------------- ## ## Rerun IVW MR analysis using standardized beta estimates ## Outliers retained std.res <- mr(std.MRdat, method_list = c("mr_ivw_fe")) %>% as_tibble(.) %>% mutate(outliers_removed = FALSE) %>% left_join(select(samplesize, code, logistic), by = c('outcome' = 'code')) %>% mutate(or = ifelse(logistic == TRUE, exp(b), NA)) %>% mutate(beta = ifelse(logistic == FALSE, b, NA)) ## Join std MR results with summary data for power analysis mrdat.power <- std.MRdat %>% filter(mr_keep == TRUE) %>% mutate(pve.exposure = snp.pve(eaf.exposure, beta.exposure, se.exposure, samplesize.exposure)) %>% summarise(exposure = first(exposure), outcome = first(outcome), pt = first(pt), pve.exposure = sum(pve.exposure), samplesize.exposure = max(samplesize.exposure), samplesize.outcome = max(samplesize.outcome), nsnps = n(), k.outcome = max(ncase.outcome)/max(samplesize.outcome)) %>% left_join(std.res, by = c('exposure', 'outcome', "nsnps" = 'nsnp')) %>% select(-id.exposure, -id.outcome) ## -------------------------------------------------------------------------- ## ## Rerun IVW MR analysis using standardized beta estimates ## Outliers retained if(n_outliers > 0){ std.res_wo_outliers <- std.MRdat %>% filter(mrpresso_keep == TRUE) %>% mr(., method_list = c("mr_ivw_fe")) %>% as_tibble(.) %>% mutate(outliers_removed = TRUE) %>% left_join(select(samplesize, code, logistic), by = c('outcome' = 'code')) %>% mutate(or = ifelse(logistic == TRUE, exp(b), NA)) %>% mutate(beta = ifelse(logistic == FALSE, b, NA)) mrdat.power_wo_outliers <- std.MRdat %>% filter(mr_keep == TRUE) %>% filter(mrpresso_keep == TRUE) %>% mutate(pve.exposure = snp.pve(eaf.exposure, beta.exposure, se.exposure, samplesize.exposure)) %>% summarise(exposure = first(exposure), outcome = first(outcome), pt = first(pt), pve.exposure = sum(pve.exposure), samplesize.exposure = max(samplesize.exposure), samplesize.outcome = max(samplesize.outcome), nsnps = n(), k.outcome = max(ncase.outcome)/max(samplesize.outcome)) %>% left_join(std.res_wo_outliers, by = c('exposure', 'outcome', "nsnps" = 'nsnp')) %>% select(-id.exposure, -id.outcome) } ## -------------------------------------------------------------------------- ## ## Power to detect observed effect ## observed_power.df <- mrdat.power %>% {if(n_outliers > 0) bind_rows(., mrdat.power_wo_outliers) else .} %>% split(., as.factor(1:nrow(.))) %>% map_dfr(., function(dat){ message('Processing: ', dat$exposure, ' - ', dat$outcome, ' - ', dat$pt) if(dat$logistic == TRUE){ results_binary(dat) } else { results_beta_based(dat) } }) write_tsv(observed_power.df, outfile) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | if(any(grepl("conda", .libPaths(), fixed = TRUE))){ message("Setting libPaths") df = .libPaths() conda_i = which(grepl("conda", df, fixed = TRUE)) .libPaths(c(df[conda_i], df[-conda_i])) } .libPaths(c(.libPaths(), "/hpc/users/harern01/.Rlib")) # log_path = snakemake@log[[1]] # # ## Logging messages # con <- file(log_path, open = "wt") # sink(con, type = "message") message("Render Final Report", "\n markdown: ", snakemake@input[["markdown"]], "\n intermediates_dir: ", snakemake@params[["output_dir"]], "\n output_file: ", snakemake@output[["outfile"]], "\n output_dir: ", snakemake@params[["output_dir"]], "\n model: ", snakemake@params[["project"]], "\n exposures: ", snakemake@params[["exposures"]], "\n outcomes: ", snakemake@params[["outcomes"]], "\n rwd: ", snakemake@params[["rwd"]], "\n rlib: ", snakemake@params[["rlib"]], "\n DATE: ", snakemake@params[["today_date"]], "\n MR_results.path: ", snakemake@input[["mrresults"]], "\n MRdat.path: ", snakemake@input[["mrdat"]], "\n mrpresso_res.path: ", snakemake@input[["mrpresso_res"]], "\n mrpresso_global.path: ", snakemake@input[["mrpresso_global"]], "\n mrpresso_global_wo_outliers.path: ", snakemake@input[["mrpresso_global_wo_outliers"]], "\n egger_comb.path: ", snakemake@input[["egger"]], "\n steiger.path: ", snakemake@input[["steiger"]], "\n power.path: ", snakemake@input[["power"]]) message("\n", getwd()) rmarkdown::render( input = snakemake@input[["markdown"]], clean = TRUE, intermediates_dir = snakemake@params[["output_dir"]], output_file = snakemake@output[["outfile"]], output_dir = snakemake@params[["output_dir"]], output_format = "all", params = list( model = snakemake@params[["project"]], rwd = snakemake@params[["rwd"]], exposures = snakemake@params[["exposures"]], outcomes = snakemake@params[["outcomes"]], rlib = snakemake@params[["rlib"]], DATE = snakemake@params[["today_date"]], MR_results.path = snakemake@input[["mrresults"]], MRdat.path = snakemake@input[["mrdat"]], mrpresso_res.path = snakemake@input[["mrpresso_res"]], mrpresso_global.path = snakemake@input[["mrpresso_global"]], mrpresso_global_wo_outliers.path = snakemake@input[["mrpresso_global_wo_outliers"]], egger_comb.path = snakemake@input[["egger"]], steiger.path = snakemake@input[["steiger"]], power.path = snakemake@input[["power"]] ) ) |
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | if(any(grepl("conda", .libPaths(), fixed = TRUE))){ message("Setting libPaths") df = .libPaths() conda_i = which(grepl("conda", df, fixed = TRUE)) .libPaths(c(df[conda_i], df[-conda_i])) } # log_path = snakemake@log[[1]] # # ## Logging messages # con <- file(log_path, open = "wt") # sink(con, type = "message") rmarkdown::render( input = snakemake@input[["markdown"]], clean = TRUE, intermediates_dir = snakemake@params[["output_dir"]], output_file = snakemake@output[["outfile"]], output_dir = snakemake@params[["output_dir"]], output_format = "all", params = list( rwd = snakemake@params[["rwd"]], Rlib = snakemake@params[["Rlib"]], exposure.snps = snakemake@input[["ExposureSnps"]], outcome.snps = snakemake@input[["OutcomeSnps"]], proxy.snps = snakemake@input[["ProxySnps"]], harmonized.dat = snakemake@input[["HarmonizedDat"]], mrpresso_global = snakemake@input[["mrpresso_global"]], power = snakemake@input[["power"]], outcome.code = snakemake@params[["OutcomeCode"]], exposure.code = snakemake@params[["ExposureCode"]], Outcome = snakemake@params[["Outcome"]], Exposure = snakemake@params[["Exposure"]], p.threshold = snakemake@params[["Pthreshold"]], r2.threshold = snakemake@params[["r2threshold"]], kb.threshold = snakemake@params[["kbthreshold"]], out = snakemake@params[["output_name"]]) ) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 | if(any(grepl("conda", .libPaths(), fixed = TRUE))){ message("Setting libPaths") df = .libPaths() conda_i = which(grepl("conda", df, fixed = TRUE)) .libPaths(c(df[conda_i], df[-conda_i])) } .libPaths(c(.libPaths(), "/hpc/users/harern01/miniconda3/envs/py38/lib/R/library")) ## Load Packages library(tidyr) library(dplyr) library(readr) library(TwoSampleMR) source('workflow/scripts/miscfunctions.R', chdir = TRUE) source("workflow/scripts/traits.R", chdir = TRUE) ## estimate r for binary or continouse exposures/outcomes r_func <- function(x){ x$r.exposure <- if(logistic.exposure == TRUE){ x %>% mutate(r.exposure = get_r_from_lor(.$beta.exposure, .$eaf.exposure, .$ncase.exposure, .$ncontrol.exposure, .$prevelance.exposure)) %>% pull(r.exposure) } else if(logistic.exposure == FALSE){ x %>% mutate(r.exposure = get_r_from_pn(.$pval.exposure, .$samplesize.exposure)) %>% pull(r.exposure) } x$r.outcome <- if(logistic.outcome == TRUE){ x %>% mutate(r.outcome = get_r_from_lor(.$beta.outcome, .$eaf.outcome, .$ncase.outcome, .$ncontrol.outcome, .$prevelance.outcome)) %>% pull(r.outcome) } else if(logistic.outcome == FALSE){ x %>% mutate(r.outcome = get_r_from_pn(.$pval.outcome, .$samplesize.outcome)) %>% pull(r.outcome) } x } ## -------------------- TEST -------------------------## # exposure.code = "Kunkle2019load" # outcome.code = "Yengo2018bmi" # p.threshold = "5e-8" # dataout = "data/MR_ADbidir/" # # harmonized.dat = glue(dataout, "{exposure.code}/{outcome.code}/{exposure.code}_{p.threshold}_{outcome.code}_mrpresso_MRdat.csv") ## -------------------- Load/Wrangle Data -------------------------## harmonized.dat = snakemake@input[["mrdat"]] # Harmonized MR data pt = snakemake@params[["pt"]] outfile = snakemake@output[["outfile"]] # Output mrdat.raw <- read_csv(harmonized.dat) mrdat <- mrdat.raw %>% filter(pleitropy_keep == TRUE, mrkeep = TRUE) %>% select(-samplesize.outcome, -samplesize.exposure) %>% left_join(samplesize, by = c('exposure' = 'code')) %>% rename(samplesize.exposure = samplesize, ncase.exposure = ncase, ncontrol.exposure = ncontrol, logistic.exposure = logistic, exposure.name = trait, prevelance.exposure = prevelance) %>% left_join(samplesize, by = c('outcome' = 'code')) %>% rename(samplesize.outcome = samplesize, ncase.outcome = ncase, ncontrol.outcome = ncontrol, logistic.outcome = logistic, outcome.name = trait, prevelance.outcome = prevelance) %>% select(-domain.y, -domain.x, -pmid.x, -pmid.y) logistic.exposure <- mrdat %>% slice(1) %>% pull(logistic.exposure) logistic.outcome <- mrdat %>% slice(1) %>% pull(logistic.outcome) mrdat <- r_func(mrdat) ## -------------------- Directionality Test -------------------------## steiger_outliers <- directionality_test(mrdat) %>% mutate(outliers_removed = FALSE, pt = pt) steiger_out <- steiger_outliers if(sum(!mrdat$mrpresso_keep, na.rm = TRUE) > 0){ steiger_wo_outliers <- mrdat %>% filter(mrpresso_keep == TRUE) %>% directionality_test(.) %>% mutate(outliers_removed = TRUE, pt = pt) %>% bind_rows(steiger_outliers) steiger_out <- steiger_wo_outliers } ## -------------------- Write -------------------------## write_tsv(steiger_out, outfile) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | library(tidyverse) library(qvalue) ## File Paths res.files = snakemake@input[["res"]] out <- snakemake@output[["summ"]] file_type <- snakemake@params[["file_type"]] if(file_type == "association"){ message("\n Concat GLM assocation statistics: ", res.files, "\n") # GLM assocations stats ## res.files <- list.files(pattern = 'association.tsv', recursive = T) res <- map_dfr(res.files, read_tsv) %>% mutate(OR = exp(estimate), LCI = exp(estimate-std.error*1.96), UCI = exp(estimate+std.error*1.96)) write_csv(res, out) } else if(file_type == "prsice"){ message("\n Concat PRSice files: ", res.files, "\n") ##Summary files of output from PRSice ## prsice.files <- list.files(pattern = '.prsice', recursive = T) prsice.files <- map_dfr(res.files, function(x){ read_tsv(x) %>% mutate(Phenotype = str_extract(x, pattern = '(?<=prs_)[A-Za-z,0-9]+(?=_)')) }) %>% filter(Threshold == 5e-8) write_csv(prsice.files, out) } else if(file_type == "summary"){ message("\n Concat Summary files of output from PRSice: ", res.files, "\n") # Summary files of output from PRSice ## sum.files <- list.files(pattern = '.summary', recursive = T) sum.files <- map_dfr(res.files, function(x){ read_tsv(x) %>% mutate(Phenotype = str_extract(x, pattern = '(?<=prs_)[A-Za-z,0-9]+(?=_)')) }) write_csv(sum.files, out) } else { stop("In compatiable file type used as input") } |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 | library(tidyverse) library(broom) zscore = function(x){(x - mean(x)) / sd(x)} # setwd('/Users/sheaandrews/LOAD_minerva/dummy/shea/Projects/AD_PRS') ## File Paths prs.file.path <- snakemake@input[["prs"]] sum.file.path <- snakemake@input[["summary"]] cohort.pheno.path <- snakemake@input[["pheno_file"]] ## params exposure = snakemake@params[["pheno"]] r2 <- snakemake@params[["r2"]] window <- snakemake@params[["window"]] # Read in summary file and extract best Pt message('READ IN SUMMARY FILE \n') best.prs <- read_tsv(sum.file.path) %>% pull(Threshold) %>% as.character() # Read in Pheno file message('READ IN PHENOTYPES \n') pheno <- read_tsv(cohort.pheno.path, col_names = T) # Read in all PRS file message('READ IN PRS \n') prs <- read_csv(prs.file.path) %>% # rename_at(vars(-FID, -IID), function(x) paste0('prs.pt_', x)) %>% mutate_at(vars(matches("prs.pt_")), zscore) %>% select(-prs.pc2) # extract Pt used pt <- select(prs, starts_with("prs")) %>% colnames(.) ## Join PRS and Phenotype data message('Join PRS and Phenotype data \n') dat <- left_join(prs, pheno) %>% select(FID:cohort, sex, status, aaoaae2, apoe4dose, jointPC1:jointPC10) %>% mutate(status = as.factor(status), sex = as.factor(sex), cohort = fct_infreq(cohort)) ## Peform logistic regression for each Pvalue threshold message('RUNNING REGRESSIONS \n') res <- map(pt, function(x){ select(dat, status, x, aaoaae2, sex, apoe4dose, cohort, jointPC1:jointPC10) %>% glm(status ~ ., family = 'binomial', data = .) %>% tidy(.) %>% mutate(pt = x, pt = case_when( str_detect(pt, 'prs.pt_') ~ str_replace(pt, 'prs.pt_', "") %>% as.numeric() %>% as.character(), str_detect(pt, 'prs.pc') ~ str_replace(pt, 'prs.', ""), TRUE ~ "NA" ), exposure = exposure) }) %>% bind_rows() %>% # mutate(pt = as.numeric(pt)) %>% mutate(model = case_when( pt == '5e-08' & pt == best.prs ~ 'GWS_best', pt == '5e-08' ~ "GWS", pt == best.prs ~ 'best', pt == "pc1" ~ 'PCA', TRUE ~ "other" ), r2 = r2, window = window) %>% select(exposure, pt, model, everything()) message('EXPORT \n') write_tsv(res, snakemake@output[["out"]]) |
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | res.path = snakemake@input[["association"]] prsice.path = snakemake@input[["prsice"]] sum.path = snakemake@input[["summary"]] model = snakemake@params[["model"]] rwd = snakemake@params[["rwd"]] r2 = snakemake@params[["r2"]] window = snakemake@params[["window"]] todaydate = Sys.Date() knitr::opts_chunk$set(echo = TRUE) knitr::opts_knit$set(root.dir = rwd) # knitr::opts_knit$set(root.dir = "/sc/arion/projects/LOAD/shea/Projects/AD_PRS") library(tidyverse) library(broom) library(glue) library(scales) ## Function for spreading multiple columns # https://community.rstudio.com/t/spread-with-multiple-value-columns/5378 myspread <- function(df, key, value) { # quote key keyq <- rlang::enquo(key) # break value vector into quotes valueq <- rlang::enquo(value) s <- rlang::quos(!!valueq) df %>% gather(variable, value, !!!s) %>% unite(temp, !!keyq, variable) %>% spread(temp, value) } |
59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 | traits <- tibble( code = c('Liu2019drnkwk23andMe', 'Liu2019smkint23andMe', 'Liu2019smkcpd23andMe', 'SanchezRoige2019auditt23andMe', "Niarchou2020meat", "Niarchou2020fish", 'Wells2019hdiff', 'Xue2018diab', 'Yengo2018bmi', 'Willer2013tc', 'Willer2013ldl', 'Willer2013hdl', 'Willer2013tg', 'Evangelou2018dbp', 'Evangelou2018sbp', 'Evangelou2018pp', 'Howard2019dep23andMe', 'Jansen2018insomnia23andMe', 'Dashti2019slepdur', 'Day2018sociso', 'Lee2018education23andMe', 'Huang2017aaos', 'Deming2017ab42', 'Hilbar2017hipv', 'Hilbar2015hipv', 'Lambert2013load', 'Kunkle2019load', 'Beecham2014braak4', 'Beecham2014npany', 'Deming2017ptau', 'Deming2017tau', 'Beecham2014vbiany', 'Klimentidis2018mvpa'), trait = c("Alcohol Consumption", "Smoking Initiation", "Cigarettes per Day", "AUDIT", "Meat diet", "Fish and Plant diet", "Hearing Difficulties", "Type 2 Diabetes", 'BMI', "Total Cholesterol", "Low-density lipoproteins", "High-density lipoproteins", "Triglycerides", "Diastolic Blood Pressure", "Systolic Blood Pressure", "Pulse Pressure", "Depressive Symptoms", "Insomnia Symptoms", "Sleep Duration", "Social Isolation", "Educational Attainment", "AAOS", "AB42", "Hippocampal Volume", "Hippocampal Volume", "LOAD", "LOAD", "Neurofibrillary Tangles", "Neuritic Plaques", "Ptau181", "Tau", "Vascular Brain Injury", "Moderate-to-vigorous PA"), abb = c("Alcohol Consumption", "Smoking", "Cigarettes per Day", "AUDIT", "Meat diet", "Fish & Plant diet", "Hearing Difficulties", "Diabetes", 'BMI', "Total Cholesterol", "LDL-cholesterol", "HDL-cholesterol", "Triglycerides", "DBP", "SBP", "PP", "Depression", "Insomnia", "Sleep", "Social Isolation", "Education", "AAOS", "AB42", "Hippocampal Volume", "Hippocampal Volume", "LOAD", "LOAD", "Neurofibrillary Tangles", "Neuritic Plaques", "Ptau181", "Tau", "Vascular Brain Injury", "Moderate-to-vigorous PA")) traits %>% knitr::kable() |
98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 | ## GLM association statistics res.files <- list.files(pattern = 'association.tsv', recursive = T) res <- map_dfr(res.files, read_tsv) %>% mutate(OR = exp(estimate), LCI = exp(estimate-std.error*1.96), UCI = exp(estimate+std.error*1.96)) ## Summary files of output from PRSice prsice.files <- list.files(pattern = '.prsice', recursive = T) %>% map_dfr(., function(x){read_tsv(x) %>% mutate(Phenotype = str_extract(x, pattern = '(?<=prs_)[A-Za-z,0-9]+(?=_)'))}) %>% filter(Threshold == 5e-8) ## Summary files of output from PRSice sum.files <- list.files(pattern = '.summary', recursive = T) %>% map_dfr(., function(x){read_tsv(x) %>% mutate(Phenotype = str_extract(x, pattern = '(?<=prs_)[A-Za-z,0-9]+(?=_)'))}) |
117 118 119 | res <- read_csv(res.path) prsice.files <- read_csv(prsice.path) sum.files <- read_csv(sum.path) |
126 127 128 129 130 131 132 133 | message("GWS") # Genome-wide Significant res.gws <- res %>% filter(str_detect(model, 'GWS'), str_detect(term, 'prs')) %>% mutate(fdr = p.adjust(p.value, method = 'BH')) %>% mutate(model = 'GWS') %>% left_join(select(prsice.files, Phenotype, Num_SNP), by = c('exposure' = 'Phenotype')) %>% arrange(fdr) |
137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 | res.gws %>% left_join(traits, by = c('exposure' = 'code')) %>% mutate(estimate = round(estimate, 3), std.error = round(std.error, 3), b_se = paste0(estimate, ' (', std.error, ')'), OR = round(OR, 2), LCI = round(LCI, 2), UCI = round(UCI, 2), out = paste0(OR, ' [', LCI, ', ', UCI, ']'), pt = ifelse(pt < 0.001, scientific(pt , digits = 2), pt), fdr = ifelse(fdr < 0.001, scientific(fdr , digits = 2), round(fdr, 3)), p.value = ifelse(p.value < 0.001, scientific(p.value , digits = 2), round(p.value, 3))) %>% select(trait, model, pt, Num_SNP, b_se, p.value, fdr, out) %>% arrange(as.numeric(p.value)) %>% select(trait, "n SNP" = Num_SNP, "b (se)" = b_se, "OR (95% CI)" = out, p.value, fdr) %>% knitr::kable(caption = "Assoction of GWS PRS with AD", format = "html") %>% kableExtra::kable_styling() |
157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 | ## Genome Wide Significant Results p.gws <- res.gws %>% arrange(OR) %>% left_join(traits, by = c('exposure' = 'code')) %>% mutate(exposure = fct_inorder(exposure), trait = fct_inorder(trait)) %>% ggplot(., aes(x = OR, y = trait, colour = fdr < 0.05)) + geom_vline(xintercept = 1, linetype = 2, colour = 'grey75') + geom_point() + geom_errorbarh(aes(xmax = UCI, xmin = LCI), height = 0) + scale_color_manual(values = c('black', 'red')) + theme_bw() + theme(legend.position = 'none', axis.title.y = element_blank()) p.gws ggsave(glue("results/{model}_coeffplot_gws_{todaydate}_rsq{r2}_{window}kb.jpg"), plot = p.gws, height = 3, width = 3, units = 'in') |
178 179 180 181 182 183 184 | # Best Pt res.best <- res %>% filter(str_detect(model, 'best'), str_detect(term, 'prs')) %>% mutate(fdr = p.adjust(p.value, method = 'BH')) %>% mutate(model = 'best') %>% arrange(fdr) %>% left_join(select(sum.files, Phenotype, Num_SNP), by = c('exposure' = 'Phenotype')) |
189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 | res.best %>% left_join(traits, by = c('exposure' = 'code')) %>% mutate(estimate = round(estimate, 3), std.error = round(std.error, 3), b_se = paste0(estimate, ' (', std.error, ')'), OR = round(OR, 2), LCI = round(LCI, 2), UCI = round(UCI, 2), out = paste0(OR, ' [', LCI, ', ', UCI, ']'), pt = ifelse(pt < 0.001, scientific(pt , digits = 2), pt), fdr = ifelse(fdr < 0.001, scientific(fdr , digits = 2), round(fdr, 3)), p.value = ifelse(p.value < 0.001, scientific(p.value , digits = 2), round(p.value, 3))) %>% select(trait, model, pt, Num_SNP, b_se, p.value, fdr, out) %>% arrange(as.numeric(p.value)) %>% select(trait, "n SNP" = Num_SNP, "b (se)" = b_se, "OR (95% CI)" = out, p.value, fdr) %>% knitr::kable(caption = "Assoction of best PRS threshold with AD", format = "html") %>% kableExtra::kable_styling() |
210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 | ## Plot p.best <- res.best %>% arrange(OR) %>% left_join(traits, by = c('exposure' = 'code')) %>% mutate(exposure = fct_inorder(exposure), trait = fct_inorder(trait)) %>% ggplot(., aes(x = OR, y = trait, colour = fdr < 0.05)) + geom_vline(xintercept = 1, linetype = 2, colour = 'grey75') + geom_point() + geom_errorbarh(aes(xmax = UCI, xmin = LCI), height = 0) + scale_color_manual(values = c('black', 'red')) + theme_bw() + theme(legend.position = 'none', axis.title.y = element_blank(), text = element_text(size=10)) p.best ggsave(glue("results/{model}_coeffplot_best_{todaydate}_rsq{r2}_{window}kb.jpg"), plot = p.best, height = 3, width = 3, units = 'in') |
234 235 236 237 238 239 240 | # Best Pt res.pca <- res %>% filter(str_detect(model, 'PCA'), str_detect(term, 'prs')) %>% mutate(fdr = p.adjust(p.value, method = 'BH')) %>% mutate(model = 'PCA') %>% arrange(fdr) %>% left_join(select(sum.files, Phenotype, Num_SNP), by = c('exposure' = 'Phenotype')) |
245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 | res.pca %>% left_join(traits, by = c('exposure' = 'code')) %>% mutate(estimate = round(estimate, 3), std.error = round(std.error, 3), b_se = paste0(estimate, ' (', std.error, ')'), OR = round(OR, 2), LCI = round(LCI, 2), UCI = round(UCI, 2), out = paste0(OR, ' [', LCI, ', ', UCI, ']'), pt = ifelse(pt < 0.001, scientific(pt , digits = 2), pt), fdr = ifelse(fdr < 0.001, scientific(fdr , digits = 2), round(fdr, 3)), p.value = ifelse(p.value < 0.001, scientific(p.value , digits = 2), round(p.value, 3))) %>% select(trait, model, pt, Num_SNP, b_se, p.value, fdr, out) %>% arrange(as.numeric(p.value)) %>% select(trait, "n SNP" = Num_SNP, "b (se)" = b_se, "OR (95% CI)" = out, p.value, fdr) %>% knitr::kable(caption = "Assoction of best PRS threshold with AD", format = "html") %>% kableExtra::kable_styling() |
266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 | ## Plot p.pca <- res.pca %>% arrange(OR) %>% left_join(traits, by = c('exposure' = 'code')) %>% mutate(exposure = fct_inorder(exposure), trait = fct_inorder(trait)) %>% ggplot(., aes(x = OR, y = trait, colour = fdr < 0.05)) + geom_vline(xintercept = 1, linetype = 2, colour = 'grey75') + geom_point() + geom_errorbarh(aes(xmax = UCI, xmin = LCI), height = 0) + scale_color_manual(values = c('black', 'red')) + theme_bw() + theme(legend.position = 'none', axis.title.y = element_blank(), text = element_text(size=10)) p.pca ggsave(glue("results/{model}_coeffplot_best_{todaydate}_rsq{r2}_{window}kb.jpg"), plot = p.pca, height = 3, width = 3, units = 'in') |
290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 | ## Output results out <- bind_rows(res.gws, res.best, res.pca) %>% left_join(traits, by = c('exposure' = 'code')) %>% mutate(estimate = round(estimate, 3), std.error = round(std.error, 3), b_se = paste0(estimate, ' (', std.error, ')'), OR = round(OR, 2), LCI = round(LCI, 2), UCI = round(UCI, 2), out = paste0(OR, ' [', LCI, ', ', UCI, ']'), pt = ifelse(pt < 0.001, scientific(pt , digits = 2), pt), fdr = ifelse(fdr < 0.001, scientific(fdr , digits = 2), round(fdr, 3)), p.value = ifelse(p.value < 0.001, scientific(p.value , digits = 2), round(p.value, 3))) %>% select(trait, model, pt, Num_SNP, b_se, p.value, fdr, out) %>% pivot_wider(names_from = model, values_from = c(pt, Num_SNP, b_se, p.value, fdr, out)) %>% arrange(as.numeric(p.value_GWS)) %>% select(trait, Num_SNP_GWS, b_se_GWS, p.value_GWS, fdr_GWS, out_GWS, pt_best, Num_SNP_best, b_se_best, p.value_best, fdr_best, out_best, b_se_PCA, p.value_PCA, fdr_PCA, out_PCA) write_csv(out, glue("results/{model}_rsq{r2}_{window}kb_table1_{todaydate}.csv")) |
315 316 317 318 319 320 321 322 | # pt = best_pt, "n SNP" = best_Num_SNP, "b (se)" = best_b_se, "OR (95% CI)" = best_out, p = best_p.value, fdr = GWS_fdr out %>% rename( "n SNP" = Num_SNP_GWS, "b (se)" = b_se_GWS, "OR (95% CI)" = out_GWS, p = p.value_GWS, fdr = fdr_GWS) %>% knitr::kable(caption = "Comparison of GWS and Best pt Results", format = "html") %>% kableExtra::kable_styling() %>% kableExtra::add_header_above(c(" " = 1, "GWS" = 5, "Best Pt" = 6, "PRS-PCA" = 4)) %>% kableExtra::scroll_box(width = "100%") |
327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 | ## Joint Plot p.joint <- bind_rows(res.gws, res.best, res.pca) %>% left_join(traits, by = c('exposure' = 'code')) %>% mutate(model = fct_relevel(model, 'GWS', 'best', 'PCA')) %>% arrange(model, OR) %>% mutate(exposure = fct_inorder(exposure), trait = fct_inorder(trait), signif = case_when(fdr < 0.05 ~ "*", fdr < 0.1 ~ ".", fdr >= 0.1 ~ " ")) %>% ggplot(., aes(x = OR, y = trait, colour = signif)) + geom_vline(xintercept = 1, linetype = 2, colour = 'grey75') + geom_point() + geom_errorbarh(aes(xmax = UCI, xmin = LCI), height = 0) + scale_color_manual(values = c(" " = 'black', "." = "orange", "*" = 'red')) + facet_grid(cols = vars(model)) + theme_bw() + theme(legend.position = 'none', axis.title.y = element_blank()) p.joint ggsave(glue("results/{model}_coeffplot_joint_{todaydate}_rsq{r2}_{window}kb.jpg"), plot = p.joint, height = 5, width = 8, units = 'in') |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | suppressMessages(library(tidyverse)) suppressMessages(library(glue)) # Read in Data clumped = snakemake@input[[1]] out.snps = snakemake@output[[1]] # Regions for exclusion regions_chrm = snakemake@params[["regions_chrm"]] regions_start = snakemake@params[["regions_start"]] regions_stop = snakemake@params[["regions_stop"]] message("Reading in Data \n") dat_ld <- read_table2(clumped) %>% filter(!is.na(CHR)) %>% select(CHR, F, SNP, BP, P, TOTAL, NSIG) if(is.null(regions_chrm)){ message("No exclusions \n") out <- dat_ld %>% select(SNP) } else { out <- dat_ld for(i in 1:length(regions_chrm)){ message(glue("Excluding regions: ", regions_chrm[i], ":", regions_start[i], "-", regions_stop[i], " (n = {n})", n = filter(out, (CHR == regions_chrm[i] & between(BP, regions_start[i], regions_stop[i]) )) %>% nrow())) out <- filter(out, !(CHR == regions_chrm[i] & between(BP, regions_start[i], regions_stop[i]) )) } out <- select(out, SNP) } write_tsv(out, out.snps) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | library(tidyverse) library(fastDummies) `%nin%` = Negate(`%in%`) ## Read in ADGC pheno file cohort.pheno.path <- snakemake@input[["pheno_file"]] keep_out <- snakemake@output[["keep"]] pheno_out <- snakemake@output[["pheno"]] ## Exclude samples message('\n READ AND FILTER DATA \n') dat <- read_tsv(cohort.pheno.path) df <- dat %>% filter(!QC_omit) %>% filter(!ADGC_omit) %>% filter(!rel_omit) %>% filter(status %in% c(1,2)) %>% filter(!is.na(jointPC1)) %>% filter(!is.na(aaoaae2)) %>% filter(!is.na(apoe4dose)) %>% filter(cohort %nin% c('GSK')) %>% mutate(cohort = fct_infreq(cohort)) %>% dummy_cols(., select_columns = "cohort", remove_first_dummy = TRUE) # filter(cohort %nin% c('ACT2', 'BIOCARD', 'EAS', 'UKS', 'GSK')) ## Write out list of sample to keep message('\n EXPORT \n') df %>% select(FID, IID) %>% write_tsv(keep_out) ## Males Only df %>% filter(sex == 1) %>% select(FID, IID) %>% write_tsv(paste0(keep_out, '_male')) ## Females Only df %>% filter(sex == 2) %>% select(FID, IID) %>% write_tsv(paste0(keep_out, '_female')) # ## Write out dataframe of phenotypes write_tsv(df, pheno_out) message('\n TABLE OF EXCLUDED \n') exclusions <- summarise(dat, total = nrow(dat) - nrow(df), QC_omit = sum(QC_omit), ADGC_omit = sum(ADGC_omit, na.rm = T), rel_omit = sum(rel_omit, na.rm = T), status = sum(status %nin% c(1,2)), miss.pc = sum(is.na(jointPC1)), miss.aaoaae2 = sum(is.na(aaoaae2)), miss.apoe4dose = sum(is.na(apoe4dose))) exclusions |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | library(tidyverse) library(here) source(here("workflow", "scripts", "prs_pca.R"), chdir = TRUE) # Define Input prs.file.path <- snakemake@input[[1]] outfile = snakemake@output[[1]] # Read in all PRS file message('READ IN PRS \n') prs <- read_table2(prs.file.path) %>% rename_at(vars(-FID, -IID), function(x) paste0('prs.pt_', x)) prspca <- prs.pc(prs) prs.out <- left_join(prs, prspca$data) write_csv(prs.out, outfile) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | library(dplyr) library(readr) library(stringr) args = commandArgs(trailingOnly = TRUE) # Set arguments from the command line input = args[1] g1 = args[2] g2 = args[3] dat <- read_table2(input) %>% mutate(trait1 = g1, trait2 = g2) %>% relocate(annot_name, trait1, trait2) write_tsv(dat, str_replace(input, ".txt", ".tsv")) |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 | library(tidyr) library(readr) library(dplyr) library(plyr) library(magrittr) args = commandArgs(trailingOnly = TRUE) # Set arguments from the command line input = args[1] ## Extract rg results from HDL log output extract_hdl <- function(x){ hdl_res <- read_lines(x) tribble( ~term, ~var, "h2_trait1", extract(hdl_res, str_detect(hdl_res, "Heritability of phenotype 1:")), "h2_trait2", extract(hdl_res, str_detect(hdl_res, "Heritability of phenotype 2:")), "rho", extract(hdl_res, str_detect(hdl_res, "Genetic Covariance:")), "rg", extract(hdl_res, str_detect(hdl_res, "Genetic Correlation:")), "p", extract(hdl_res, str_detect(hdl_res, "P:")), ) %>% separate(var, into = c("var", "est"), sep = ":") %>% separate(est, into = c("b", "se"), sep = "\\(") %>% mutate(se = str_remove(se, "\\)"), b = str_trim(b) %>% as.numeric(), se = str_trim(se) %>% as.numeric()) %>% select(-var) %>% pivot_wider(names_from = term, values_from = c(b, se)) %>% mutate(trait1 = extract(hdl_res, str_detect(hdl_res, "gwas1.df")) %>% str_extract(., pattern = '(?<=input/hdl/)[A-Za-z,0-9]+(?=_formated)'), trait2 = extract(hdl_res, str_detect(hdl_res, "gwas2.df")) %>% str_extract(., pattern = '(?<=input/hdl/)[A-Za-z,0-9]+(?=_formated)')) %>% select(trait1, trait2, b_h2_trait1, se_h2_trait1, b_h2_trait2, se_h2_trait2, b_rho, se_rho, b_rg, se_rg, p = b_p, -se_p) } dat = read_lines(input) if(str_detect(dat, "Genetic Covariance:") %>% any()){ # If HDL finished suscesucully message("HDL output present: Extracting") extract_hdl(input) %>% mutate(success = ifelse(b_rg == "Inf", FALSE, TRUE)) %>% write_tsv(., str_replace(input, ".Rout", ".tsv")) } else { # If HDL threw an error message("HDL output not present") tibble( trait1 = str_extract(input, pattern = '(?<=res/hdl/)[A-Za-z,0-9]+(?=_)'), trait2 = str_extract(input, pattern = '(?<=_)[A-Za-z,0-9]+(?=_HDL)'), b_h2_trait1 = NA, se_h2_trait1 = NA, b_h2_trait2 = NA, se_h2_trait2 = NA, b_rho = NA, se_rho = NA, b_rg = NA, se_rg = NA, p = NA, success = FALSE ) %>% write_tsv(., str_replace(input, ".Rout", ".tsv")) } |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | library(dplyr) library(readr) library(stringr) args = commandArgs(trailingOnly = TRUE) # Set arguments from the command line input = args[1] g1 = args[2] g2 = args[3] dat <- read_table2(input) %>% mutate(trait1 = g1, trait2 = g2) %>% relocate(trait1, trait2) write_tsv(dat, input) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 | library(tidyverse) library(here) library(gtools) source(here("workflow", "scripts", "miscfunctions.R"), chdir = TRUE) # setwd("/sc/arion/projects/LOAD/shea/Projects/MRcovid") # rg.path = 'results/RGcovid/ldsc/all_results_ldsc_2021-01-12.tsv' rg.path = snakemake@input[["rg"]] DATE = snakemake@params[["date"]] out.path = snakemake@params[["out"]] ## Munge datasets exposures_outcomes <- expand_grid(exposures, outcomes) %>% magrittr::set_colnames(c("exposures.name", "outcomes.name")) %>% left_join(select(samplesize, code, trait, domain), by = c('exposures.name' = 'code')) %>% rename(exposure.name = trait, domain.exposure = domain) %>% left_join(select(samplesize, code, trait, domain), by = c('outcomes.name' = 'code')) %>% rename(outcome.name = trait, domain.outcome = domain) %>% mutate(UKB = !str_detect(outcome.name, "UKB")) rg <- read_tsv(rg.path) %>% mutate(p1 = str_replace(p1, "\\.sumstats.gz", ""), p1 = str_replace(p1, "data/formated/ldsc/", ""), p2 = str_replace(p2, "data/formated/ldsc/", ""), p2 = str_replace(p2, ".sumstats.gz", "")) %>% filter(p1 %in% exposures) %>% filter(p2 %in% outcomes) %>% left_join(select(exposures_outcomes, exposures.name, outcomes.name, exposure.name, outcome.name, UKB), by = c("p1" = "exposures.name", "p2" = "outcomes.name")) %>% group_by(UKB) %>% mutate(fdr = p.adjust(p, "fdr")) %>% ungroup() %>% mutate(lci = rg - (1.96 * se), uci = rg + (1.96 * se), signif = case_when(p > 0.05 ~ "NS", p < 0.05 & fdr > 0.05 ~ "Nominal", fdr < 0.05 ~ "Significant"), signif = replace_na(signif, "NS"), signif = fct_relevel(signif, "NS", "Nominal", "Significant"), stars = stars.pval(fdr)) %>% arrange(UKB, p2, fdr) ## Functions of heatmap and forest plots forrest_rg <- function(dat){ ggplot(dat, aes(y = exposure.name, x = rg, colour = signif)) + geom_vline(xintercept = 0, linetype = 2) + facet_grid(. ~ outcome.name, scales = "free_x") + geom_point() + geom_linerange(aes(xmin = lci, xmax = uci)) + scale_color_manual(values = c("grey75", "black", "red")) + theme_bw() + labs(title = "Genetic correlation", x = "rg (95%CI)") + theme(legend.position = "bottom", legend.title = element_blank(), axis.title.y=element_blank(), strip.background = element_blank(), strip.text.y = element_blank(), panel.grid.minor.x = element_blank()) } heatmap_rg <- function(dat){ ggplot(dat) + geom_raster(aes(x = exposure.name, y = outcome.name, fill = rg)) + geom_text(data = dat, size = 4, aes(label = stars, x = exposure.name, y = outcome.name)) + scale_fill_gradient2(low="steelblue", high="firebrick", mid = "white", na.value = "grey75", name = "rg", limits = c(-1,1)) + geom_vline(xintercept=seq(0.5, 23.5, 1),color="white") + geom_hline(yintercept=seq(0.5, 11.5, 1),color="white") + coord_equal() + theme_classic() + theme(legend.position = 'right', legend.key.height = unit(1, "line"), axis.text.x = element_text(angle = 35, hjust = 0), legend.text = element_text(hjust = 1.5), text = element_text(size=10), axis.title.x = element_blank(), axis.title.y = element_blank()) + scale_x_discrete(position = "top") } ## Print Forrest Plots rg_forrest_eur.p <- rg %>% filter(UKB == TRUE) %>% arrange(p2, rg) %>% mutate(exposure.name = fct_inorder(exposure.name)) %>% forrest_rg(.) ggsave(glue(out.path, "forrest_eur_{DATE}.png"), rg_forrest_eur.p ,width = 9, height = 6, units = "in") rg_forrest_eurLeaveUKB.p <- rg %>% filter(UKB == FALSE) %>% arrange(p2, rg) %>% mutate(exposure.name = fct_inorder(exposure.name)) %>% forrest_rg(.) ggsave(glue(out.path, "forrest_eurLeaveUKB_{DATE}.png"), rg_forrest_eurLeaveUKB.p ,width = 9, height = 6, units = "in") ## Print Heatmaps rg_heatmap_eur.p <- rg %>% filter(UKB == TRUE) %>% arrange(p2, exposure.name) %>% mutate(exposure.name = fct_inorder(exposure.name), outcome.name = fct_rev(outcome.name)) %>% heatmap_rg() ggsave(glue(out.path, "heatmap_eur_{DATE}.png"), rg_heatmap_eur.p, width = 11, height = 4, units = "in") rg_heatmap_eurLeaveUKB.p <- rg %>% filter(UKB == FALSE) %>% arrange(p2, exposure.name) %>% mutate(exposure.name = fct_inorder(exposure.name), outcome.name = fct_rev(outcome.name)) %>% heatmap_rg() ggsave(glue(out.path, "heatmap_eurLeaveUKB_{DATE}.png"), rg_heatmap_eurLeaveUKB.p, width = 11, height = 4, units = "in") |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | if(any(grepl("conda", .libPaths(), fixed = TRUE))){ message("Setting libPaths") df = .libPaths() conda_i = which(grepl("conda", df, fixed = TRUE)) .libPaths(c(df[conda_i], df[-conda_i])) } .libPaths(c(.libPaths(), "/hpc/users/harern01/miniconda3/envs/py38/lib/R/library")) library(tidyr) library(readr) library(dplyr) library(plyr) library(glue) #suppressMessages(library(tidyverse)) #suppressMessages(library(glue)) infile_gwas = snakemake@input[["ss"]] outfile = snakemake@output[["out"]] ## Regions for exclusion regions_chrm = snakemake@params[["regions_chrm"]] regions_start = snakemake@params[["regions_start"]] regions_stop = snakemake@params[["regions_stop"]] trait.gwas <- suppressMessages(read_tsv(infile_gwas, comment = '#', guess_max = 11000000)) if(is.null(regions_chrm)){ message("No genomic regions to exclude") } else { for(i in 1:length(regions_chrm)){ message(glue("Excluding regions: ", regions_chrm[i], ":", regions_start[i], "-", regions_stop[i])) trait.gwas <- trait.gwas %>% filter(!(CHROM == regions_chrm[i] & between(POS, regions_start[i], regions_stop[i]))) } } write_tsv(trait.gwas , gzfile(outfile)) |
Support
- Future updates
Related Workflows





