Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Authors
- Benjamin Fair (@bfairkun)
Overview
This pipeline includes read mapping (STAR), preparation of a phenotype table of splicing traits (leafcutter), and sQTL calling (MatrixEQTL calculate nominal associations, and run permutations, saving the best P-value for each intron for each permutation). Then due a permutation test on a per cluster basis. Depending on your downstream analysis, you may want intron level, or gene level Pvalues. If you want those, you will have to edit the script in the pipeline that does the permutation testing.
Usage
Step 1: Install workflow
If you simply want to use this workflow, clone the latest release . If you intend to modify and further develop this workflow, fork this repository. Please consider providing any generally applicable modifications via a pull request.
Install dependencies with conda:
conda env create --file environment.yaml
Other dependencies that I could not include on conda include the scripts for leafcutter . I have my own fork with small modifications that are required for this pipeline to work:
leafcutter : modified script to allow nonconventional chromosome names (eg: 2A)
Clone my forks linked above, and add the necessary scripts to $PATH by appending the following to .bashrc:
export PATH=$PATH:PathToLeacutterClonedRepo/scripts
export PATH=$PATH:PathToLeacutterClonedRepo/clustering
re-source the .bashrc:
source ~/.bashrc
Make sure tidyverse, qvalue, stats, and MatrixEQTLlibraries are installed for R... I have been using RCC's R/3.4.3 (
module load R/3.4.3
), and installed these with
install.packages()
once in R.
activate the conda environment:
conda activate my_Chimp_EQTL_env
and create rule-specic environments:
snakemake --use-conda --create-envs-only
Step 2: Configure workflow
Configure the workflow according to your needs via editing the file
config.yaml
. Configure cluster settings in
cluster-config.json
Step 3: Execute workflow
Test your configuration by performing a dry-run via
snakemake -n
Execute the workflow locally via
snakemake --cores $N
using
$N
cores or run it in a cluster environment via
snakemake --cluster --cluster-config cluster-config.json --cluster "sbatch --partition={cluster.partition} --job-name={cluster.name} --output=/dev/null --job-name={cluster.name} --nodes={cluster.n} --mem={cluster.mem}"
or by executing the included sbatch script to execute the snakemake process from a cluster
sbatch snakemake.sbatch
See the Snakemake documentation for further details.
Code Snippets
8 9 10 11 | shell: """ plink --id-delim '-' --vcf {input.vcf} --vcf-half-call m --allow-extra-chr --make-bed --out eQTL_mapping/plink/Unfiltered &> {log} """ |
33 34 35 36 | shell: """ plink --bfile eQTL_mapping/plink/Unfiltered --allow-extra-chr --memory 28000 --make-bed --out eQTL_mapping/plink/ForAssociationTesting {params.keepfam} {params.extra} {params.remove_ind} {params.maf} &> {log} """ |
50 51 52 53 54 | shell: """ plink --bfile eQTL_mapping/plink/Unfiltered --allow-extra-chr --make-bed --out eQTL_mapping/Kinship/ForGRM {params.keepfam} {params.maf} {params.remove_ind} {params.extra} sed -i 's/-9$/1/' {output.fam} """ |
64 65 66 67 68 69 70 | shell: """ plink --bfile eQTL_mapping/Kinship/ForGRM --allow-extra-chr --indep-pairwise 50 5 0.5 &> {log} plink --bfile eQTL_mapping/Kinship/ForGRM --allow-extra-chr --extract plink.prune.in --make-bed --out eQTL_mapping/Kinship/ForAssociationTesting.pruned &> {log} sed -i 's/-9$/1/' {output.fam} rm plink.prune.in plink.prune.out """ |
80 81 82 83 | shell: """ gemma -gk 1 -bfile eQTL_mapping/Kinship/ForAssociationTesting.pruned -o GRM -outdir eQTL_mapping/Kinship &> {log} """ |
92 93 94 95 96 | shell: """ plink --bfile eQTL_mapping/plink/ForAssociationTesting --allow-extra-chr --recode A-transpose --tab --geno --memory 40000 --out eQTL_mapping/MatrixEQTL/ForAssociationTesting.snps perl -lne '/^.+?\\t(.+?\\t).+?\\t.+?\\t.+?\\t.+?\\t(.+$)/ and print "$1$2" ' eQTL_mapping/MatrixEQTL/ForAssociationTesting.snps.traw | sed '1s/_{params}//g' > {output} """ |
104 105 106 107 | shell: """ awk -F'\\t' -v OFS='\\t' 'BEGIN {{ print "snp","chrom","pos" }} {{ print $2, $1,$4 }}' {input.snps} > {output.snp_locs} """ |
6 7 | shell: "cat {input} > {output}" |
17 18 19 20 | shell: """ STAR --runMode genomeGenerate --runThreadN 4 --genomeDir MiscOutput/STAR_index/ --sjdbGTFfile {input.gtf} --genomeFastaFiles {input.fasta} &> {log} """ |
33 34 | shell: "STAR --runThreadN {threads} --genomeDir MiscOutput/STAR_index/ --readFilesIn {input.fastq} --outSAMtype BAM SortedByCoordinate --outWigStrand Unstranded --outWigType wiggle --alignEndsType EndToEnd --quantMode GeneCounts --twopassMode Basic --readFilesCommand zcat --outFileNamePrefix RNASeq/STAR/{wildcards.RNASeqSample}/ &> {log}" |
41 42 | shell: "samtools index {input}" |
48 | shell: "bash scripts/GetFastqIdentifierInfo.sh {input} > {output} 2> {log}" |
8 9 10 11 | shell: """ awk -F'\\t' -v OFS='\\t' '$4==1 && $1!="MT" {{ print $1,$2,$3,".",$7,"+" }} $4==2&& $1!="MT" {{ print $1,$2,$3,".",$7,"-" }}' {input} > {output} """ |
20 21 22 23 24 25 26 27 | run: import os SamplesToRemove = open(params.SamplesToRemove, 'r').read().split('\n') with open(output[0], "w") as out: for filepath in input: samplename = os.path.basename(filepath).split(".junc")[0] if samplename not in SamplesToRemove: out.write(filepath + '\n') |
37 38 39 40 | shell: """ leafcutter_cluster.py -j {input} -r sQTL_mapping/leafcutter/clustering/ &> {log} """ |
49 50 51 52 53 54 | shell: """ ~/miniconda3/bin/python2.7 ~/CurrentProjects/leafcutter/scripts/prepare_phenotype_table.py -p 13 --ChromosomeBlackList {input.blacklist_chromosomes} {input.counts} cat <(head -1 sQTL_mapping/leafcutter/clustering/leafcutter_perind.counts.gz.qqnorm_chr3) <(awk 'FNR>1' sQTL_mapping/leafcutter/clustering/leafcutter_perind.counts.gz.qqnorm_chr*) > {output.phenotypes} """ |
64 65 66 67 68 69 | shell: """ cut -d $'\\t' -f 4- {input.phenotypes} > {output.phenotypes} awk -F'\\t' -v OFS='\\t' 'BEGIN {{ print "gene", "chr", "start", "stop" }} NR>1 {{ split($4,a,":"); print $4, a[1], a[2], a[3] }}' {input.phenotypes} > {output.intron_locs} Rscript scripts/ReorderPhenotypeTableForMatrixEQTL.R {output.phenotypes} {input.fam} {output.phenotypesReordered} """ |
86 87 88 89 | shell: """ Rscript scripts/SelectNtoM_PCs_toCovariateFiles.R {input.leafcutterPCs} {input.fam} sQTL_mapping/Covariates/FromLeafcutter.PCs. {params.PC_min} {params.PC_max} """ |
110 111 112 113 | shell: """ Rscript scripts/MatrixEqtl_Cis.R {input.snps} {input.snp_locs} {input.phenotypes} {input.gene_loc} {input.covariates} {input.GRM} {output.results} {output.fig} {output.permuted_results} {output.permutated_fig} {params.cis_window} &> {log} """ |
123 124 125 126 127 | shell: """ awk -F'\\t' -v OFS='\\t' 'FNR>1 && $6<0.3 {{ print $1,$2,$3,$4,$5,$6,FILENAME }}' {input} > {output.CattedResult} Rscript scripts/Plot_EQTLs_vs_PCs.R {output.CattedResult} {output.Plot} """ |
148 149 150 151 | shell: """ Rscript scripts/MatrixEqtl_Cis.AllPvals.R {input.snps} {input.snp_locs} {input.phenotypes} {input.gene_loc} {input.covariates} {input.GRM} {output.results} {output.fig} {params.cis_window} {output.BestGenePvals} &> {log} """ |
164 165 166 167 168 169 | shell: """ awk -F'\\t' -v OFS='\\t' 'NR==1 || /^ID\.{wildcards.chromosome}\./' {input.snps} > {output.snps} awk -F'\\t' -v OFS='\\t' 'NR==1 || /^ID\.{wildcards.chromosome}\./' {input.snp_locs} > {output.snp_locs} awk -F'\\t' -v OFS='\\t' 'NR==1 || /^ID\.{wildcards.chromosome}\./ {{print $1, $2, $3, $4, $5, $6}}' {input.results} > {output.results} """ |
191 192 193 194 | shell: """ eigenMT.py --QTL {input.results} --GEN {input.snps} --GENPOS {input.snp_locs} --PHEPOS {input.gene_loc} --cis_dist 250000 --OUT {output} --CHROM {wildcards.chromosome} &> {log} """ |
203 204 205 206 | shell: """ (cat <(cat {input} | head -1) <(cat {input} | grep -v '^snps') | gzip - > {output}) &> {log} """ |
224 225 226 227 | shell: """ Rscript scripts/MatrixEqtl_Cis_Permutations.R {input.snps} {input.snp_locs} {input.phenotypes} {input.gene_loc} {input.covariates} {input.GRM} {output.results} {params.NumberPermutations} {params.InitialSeed} {params.cis_window} > {log} """ |
237 238 239 240 241 242 243 | shell: """ cat sQTL_mapping/MatrixEQTL/ConfigCovariateModelResults/Permutations/Chunk.0.txt > {output} for i in {{1..{LastChunk}}}; do tail -n +2 sQTL_mapping/MatrixEQTL/ConfigCovariateModelResults/Permutations/Chunk.${{i}}.txt >> {output} done """ |
253 254 255 256 | shell: """ Rscript scripts/PermutationTest.R {input.Permutations_MinPvaluePerTrait} {input.Actual_MinPvaluePerTrait} {output.PermutationTestResults} &> {log} """ |
6 7 8 9 10 11 12 13 14 | printf "file\tinstrument\trunid\tflowcellid\tflowcelllane\tbarcode" for MYFILE in "$@" do samplename=$(basename -s ".fastq.gz" $MYFILE) printf "\n$MYFILE\t" barcode=$(zcat $MYFILE | paste -d'\t' - - - - | awk -F'\t' '{split($1,a,":"); print a[10]}' | head -10000 | sort | uniq -c | sort -nr | head -1) outstr=$(zcat $MYFILE | head -1 | awk -F'\t' -v OFS='\t' '{split($1,a,":"); print a[1], a[2], a[3], a[4]}') printf "$outstr\t$barcode" done |
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 | library(MatrixEQTL) library(tidyverse) library(qvalue) args = commandArgs(trailingOnly=TRUE) SNP_file_name <- args[1] snps_location_file_name <- args[2] expression_file_name <- args[3] gene_location_file_name <- args[4] covariates_file_name <- args[5] errorCovariance_file <- args[6] output_file_name_cis <- args[7] ouput_QQ <- args[8] cisDistance <- args[9] output_best_p <- args[10] # SNP_file_name <- "~/CurrentProjects/Comparative_eQTL/code/snakemake_workflow/scratch/Test.snps" # snps_location_file_name <- "~/CurrentProjects/Comparative_eQTL/code/snakemake_workflow/scratch/Test.snploc" # expression_file_name <- "code/snakemake_workflow/eQTL_mapping/MatrixEQTL/ForAssociationTesting.phenotypes.txt" # gene_location_file_name <- "code/snakemake_workflow/eQTL_mapping/MatrixEQTL/ForAssociationTesting.geneloc.txt" # covariates_file_name <- "output/Covariates/0GenotypePCs_and_11RNASeqPCs.covariates" # errorCovariance_file <- "code/snakemake_workflow/eQTL_mapping/Kinship/GRM.cXX.txt" # output_file_name_cis = tempfile() # cisDistance = 250000 # SNP_file_name <- "code/snakemake_workflow/eQTL_mapping/MatrixEQTL/ForAssociationTesting.snps" # snps_location_file_name <- "code/snakemake_workflow/eQTL_mapping/MatrixEQTL/ForAssociationTesting.snploc" # expression_file_name <- "code/snakemake_workflow/eQTL_mapping/MatrixEQTL/ForAssociationTesting.phenotypes.txt" # gene_location_file_name <- "code/snakemake_workflow/eQTL_mapping/MatrixEQTL/ForAssociationTesting.geneloc.txt" # covariates_file_name <- "output/Covariates/0GenotypePCs_and_10RNASeqPCs.covariates" # errorCovariance_file <- "code/snakemake_workflow/eQTL_mapping/Kinship/GRM.cXX.txt" # output_file_name_cis = tempfile() # cisDistance = 250000 # Linear model to use, modelANOVA, modelLINEAR, or modelLINEAR_CROSS useModel = modelLINEAR; # modelANOVA, modelLINEAR, or modelLINEAR_CROSS output_file_name_tra = tempfile(); # Only associations significant at this level will be saved pvOutputThreshold_cis = 1; # Error covariance matrix # Set to numeric() for identity. errorCovariance <- as.matrix(read.table(errorCovariance_file,sep='\t')) # Distance for local gene-SNP pairs cisDist = as.numeric(cisDistance); print(cisDist) ## Load genotype data snps = SlicedData$new(); snps$fileDelimiter = "\t"; # the TAB character snps$fileOmitCharacters = "NA"; # denote missing values; snps$fileSkipRows = 1; # one row of column labels snps$fileSkipColumns = 1; # one column of row labels snps$fileSliceSize = 2000; # read file in slices of 2,000 rows snps$LoadFile(SNP_file_name); ## Load gene expression data print("reading phenotype table") gene = SlicedData$new(); gene$fileDelimiter = "\t"; # the TAB character gene$fileOmitCharacters = "NA"; # denote missing values; gene$fileSkipRows = 1; # one row of column labels gene$fileSkipColumns = 1; # one column of row labels gene$fileSliceSize = 2000; # read file in slices of 2,000 rows gene$LoadFile(expression_file_name); ## Load covariates print("reading covariates") cvrt = SlicedData$new(); cvrt$fileDelimiter = "\t"; # the TAB character cvrt$fileOmitCharacters = "NA"; # denote missing values; cvrt$fileSkipRows = 1; # one row of column labels cvrt$fileSkipColumns = 1; # one column of row labels if(length(covariates_file_name)>0) { cvrt$LoadFile(covariates_file_name); } ## Run the analysis snpspos = read.table(snps_location_file_name, header = TRUE, stringsAsFactors = FALSE); genepos = read.table(gene_location_file_name, header = TRUE, stringsAsFactors = FALSE); #Include all pvalues in output so that qvalues can be calculated. Filter after. me = Matrix_eQTL_main( snps = snps, gene = gene, cvrt = cvrt, output_file_name = NULL, pvOutputThreshold = 0, useModel = useModel, errorCovariance = errorCovariance, verbose = TRUE, output_file_name.cis = NULL, pvOutputThreshold.cis = 1, snpspos = snpspos, genepos = genepos, cisDist = cisDist, pvalue.hist = "qqplot", min.pv.by.genesnp = TRUE, noFDRsaveMemory = FALSE); print('done with real pass') #Transform Pvalues to Qvalues #If pi_0 is close to 1 (which is often case in eQTL calling), BH-p will be equal to Q-value within precision of saved digits. me$cis$eqtls$qvalue <- qvalue(me$cis$eqtls$pvalue)$qvalues print('done with qval stuff') #Write out results, while filtering by pvalue me$cis$eqtls %>% select(snps, gene, beta, statistic, pvalue, FDR, qvalue) %>% write.table(file=output_file_name_cis, sep='\t', quote = F, row.names = F) cat('Analysis done in: ', me$time.in.sec, ' seconds', '\n'); ## Plot the Q-Q plot of local p-values ggsave(file=ouput_QQ, plot(me)) ## Write out best p-value for each phenotype. me$cis$min.pv.gene %>% as.data.frame() %>% rownames_to_column() %>% write.table(file=output_best_p, sep='\t', quote=F, col.names=c("Gene", "MinP"), row.names = FALSE) |
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 | library(MatrixEQTL) library(tidyverse) args = commandArgs(trailingOnly=TRUE) SNP_file_name <- args[1] snps_location_file_name <- args[2] expression_file_name <- args[3] gene_location_file_name <- args[4] covariates_file_name <- args[5] errorCovariance_file <- args[6] permutation_matrix_output_filename <- args[7] Npermutations <- as.numeric(args[8]) InitialSeed <- as.numeric(args[9]) cisDistance <- args[10] # setwd("/project2/gilad/bjf79_project1/projects/Comparative_eQTL/") # SNP_file_name <- "~/CurrentProjects/Comparative_eQTL/code/snakemake_workflow/scratch/Test.snps" # snps_location_file_name <- "~/CurrentProjects/Comparative_eQTL/code/snakemake_workflow/scratch/Test.snploc" # expression_file_name <- "code/snakemake_workflow/eQTL_mapping/MatrixEQTL/ForAssociationTesting.phenotypes.txt" # gene_location_file_name <- "code/snakemake_workflow/eQTL_mapping/MatrixEQTL/ForAssociationTesting.geneloc.txt" # covariates_file_name <- "output/Covariates/0GenotypePCs_and_11RNASeqPCs.covariates" # errorCovariance_file <- "code/snakemake_workflow/eQTL_mapping/Kinship/GRM.cXX.txt" # permutation_matrix_output_filename <- "/project2/gilad/bjf79/temp/PvalueMatrix.txt" # cisDistance<-100000 # InitialSeed <- 1 # Npermutations <- 5 # Linear model to use, modelANOVA, modelLINEAR, or modelLINEAR_CROSS useModel = modelLINEAR; # modelANOVA, modelLINEAR, or modelLINEAR_CROSS output_file_name_tra = tempfile(); # Only associations significant at this level will be saved # Error covariance matrix # Set to numeric() for identity. errorCovariance <- as.matrix(read.table(errorCovariance_file,sep='\t')) # Distance for local gene-SNP pairs cisDist = as.numeric(cisDistance); print(paste("Cis distance setting:",cisDist)) ## Load genotype data print("Loading genotype data...") snps = SlicedData$new(); snps$fileDelimiter = "\t"; # the TAB character snps$fileOmitCharacters = "NA"; # denote missing values; snps$fileSkipRows = 1; # one row of column labels snps$fileSkipColumns = 1; # one column of row labels snps$fileSliceSize = 2000; # read file in slices of 2,000 rows snps$LoadFile(SNP_file_name); ## Load gene expression data print("Loading expression data...") gene = SlicedData$new(); gene$fileDelimiter = "\t"; # the TAB character gene$fileOmitCharacters = "NA"; # denote missing values; gene$fileSkipRows = 1; # one row of column labels gene$fileSkipColumns = 1; # one column of row labels gene$fileSliceSize = 2000; # read file in slices of 2,000 rows gene$LoadFile(expression_file_name); ## Load covariates print("Loading covariate data...") cvrt = SlicedData$new(); cvrt$fileDelimiter = "\t"; # the TAB character cvrt$fileOmitCharacters = "NA"; # denote missing values; cvrt$fileSkipRows = 1; # one row of column labels cvrt$fileSkipColumns = 1; # one column of row labels if(length(covariates_file_name)>0) { cvrt$LoadFile(covariates_file_name); } ## Other initialization stuff print("Loading other stuff...") snpspos = read.table(snps_location_file_name, header = TRUE, stringsAsFactors = FALSE); genepos = read.table(gene_location_file_name, header = TRUE, stringsAsFactors = FALSE); ### Permute the sample labels for expression ### # Read in actual data expression matrix and covariates ActualData.ExpressionMatrix <- read.table(expression_file_name, header=T, row.names = 1, check.names = FALSE ) ActualData.Cov <- read.table(covariates_file_name, header=T, row.names = 1, check.names = FALSE ) #Initialize empty matrix for results. Rows are genes. Columns are permutations PermutatePvalueMatrix <- matrix(data=NA, nrow = nrow(ActualData.ExpressionMatrix), ncol = Npermutations) ### Run permutations print("Running permutations...") for (i in 1:Npermutations){ # Permute column labels for both (using same seed for randomization) # technically I am permuting the column data, and preserving the column labels. That way, I do not have to do the same for the (huge) genotype file, which would take more computational time set.seed(i+InitialSeed) Temp.df <- ActualData.ExpressionMatrix %>% select(sample(colnames(ActualData.ExpressionMatrix), length(colnames(ActualData.ExpressionMatrix)))) set.seed(i+InitialSeed) Temp.df.cov <- ActualData.Cov %>% select(sample(colnames(ActualData.ExpressionMatrix), length(colnames(ActualData.ExpressionMatrix)))) colnames(Temp.df) <- colnames(ActualData.ExpressionMatrix) colnames(Temp.df.cov) <- colnames(ActualData.ExpressionMatrix) # Write out permutated expression matrix and reload it TempFilepath.ExpressionMatrix <- tempfile("ExpressionMatrix.") write.table(Temp.df, file=TempFilepath.ExpressionMatrix, sep='\t', quote=F, col.names =NA) gene$LoadFile(TempFilepath.ExpressionMatrix); TempFilepath.Covariates <- tempfile("Covariates.") write.table(Temp.df.cov, file=TempFilepath.Covariates, sep='\t', quote=F, col.names =NA) cvrt$LoadFile(TempFilepath.Covariates); #Calculate Pvalues from permutated data permuted = Matrix_eQTL_main( snps = snps, gene = gene, cvrt = cvrt, output_file_name = NULL, pvOutputThreshold = 0, useModel = useModel, errorCovariance = errorCovariance, verbose = F, output_file_name.cis = NULL, pvOutputThreshold.cis = 1, snpspos = snpspos, genepos = genepos, cisDist = cisDist, pvalue.hist = FALSE, min.pv.by.genesnp = TRUE, noFDRsaveMemory = FALSE); print(paste('done with permutation pass', i)) PermutatePvalueMatrix[,i]=permuted$cis$min.pv.gene } row.names(PermutatePvalueMatrix) <- names(permuted$cis$min.pv.gene) write.table(t(PermutatePvalueMatrix), permutation_matrix_output_filename, quote=F, sep='\t', col.names = T, row.names=F) |
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 | library(MatrixEQTL) library(tidyverse) library(qvalue) args = commandArgs(trailingOnly=TRUE) SNP_file_name <- args[1] snps_location_file_name <- args[2] expression_file_name <- args[3] gene_location_file_name <- args[4] covariates_file_name <- args[5] errorCovariance_file <- args[6] output_file_name_cis <- args[7] ouput_QQ <- args[8] permuted_output_filename <- args[9] permuted_output_QQ <- args[10] cisDistance <- args[11] # setwd("/project2/gilad/bjf79_project1/projects/Comparative_eQTL/") # SNP_file_name <- "/project2/gilad/bjf79_project1/projects/Comparative_eQTL/code/snakemake_workflow/scratch/Test.snps" # snps_location_file_name <- "/project2/gilad/bjf79_project1/projects/Comparative_eQTL/code/snakemake_workflow/scratch/Test.snploc" # expression_file_name <- "code/snakemake_workflow/eQTL_mapping/MatrixEQTL/ForAssociationTesting.phenotypes.txt" # gene_location_file_name <- "code/snakemake_workflow/eQTL_mapping/MatrixEQTL/ForAssociationTesting.geneloc.txt" # covariates_file_name <- "output/Covariates/0GenotypePCs_and_11RNASeqPCs.covariates" # errorCovariance_file <- "code/snakemake_workflow/eQTL_mapping/Kinship/GRM.cXX.txt" # cisDistance<-100000 # output_file_name_cis = tempfile() # Linear model to use, modelANOVA, modelLINEAR, or modelLINEAR_CROSS useModel = modelLINEAR; # modelANOVA, modelLINEAR, or modelLINEAR_CROSS output_file_name_tra = tempfile(); # Only associations significant at this level will be saved pvOutputThreshold_cis = 2e-2; # Error covariance matrix # Set to numeric() for identity. errorCovariance <- as.matrix(read.table(errorCovariance_file,sep='\t')) # Distance for local gene-SNP pairs cisDist = as.numeric(cisDistance); print(cisDist) ## Load genotype data snps = SlicedData$new(); snps$fileDelimiter = "\t"; # the TAB character snps$fileOmitCharacters = "NA"; # denote missing values; snps$fileSkipRows = 1; # one row of column labels snps$fileSkipColumns = 1; # one column of row labels snps$fileSliceSize = 2000; # read file in slices of 2,000 rows snps$LoadFile(SNP_file_name); ## Load gene expression data gene = SlicedData$new(); gene$fileDelimiter = "\t"; # the TAB character gene$fileOmitCharacters = "NA"; # denote missing values; gene$fileSkipRows = 1; # one row of column labels gene$fileSkipColumns = 1; # one column of row labels gene$fileSliceSize = 2000; # read file in slices of 2,000 rows gene$LoadFile(expression_file_name); ## Load covariates cvrt = SlicedData$new(); cvrt$fileDelimiter = "\t"; # the TAB character cvrt$fileOmitCharacters = "NA"; # denote missing values; cvrt$fileSkipRows = 1; # one row of column labels cvrt$fileSkipColumns = 1; # one column of row labels if(length(covariates_file_name)>0) { cvrt$LoadFile(covariates_file_name); } ## Run the analysis snpspos = read.table(snps_location_file_name, header = TRUE, stringsAsFactors = FALSE); genepos = read.table(gene_location_file_name, header = TRUE, stringsAsFactors = FALSE); #Include all pvalues in output so that qvalues can be calculated. Filter after. me = Matrix_eQTL_main( snps = snps, gene = gene, cvrt = cvrt, output_file_name = NULL, pvOutputThreshold = 0, useModel = useModel, errorCovariance = errorCovariance, verbose = TRUE, output_file_name.cis = NULL, pvOutputThreshold.cis = 1, snpspos = snpspos, genepos = genepos, cisDist = cisDist, pvalue.hist = "qqplot", min.pv.by.genesnp = TRUE, noFDRsaveMemory = FALSE); print('done with real pass') #Transform Pvalues to Qvalues #If pi_0 is close to 1 (which is often case in eQTL calling), BH-p will be equal to Q-value within precision of saved digits. me$cis$eqtls$qvalue <- qvalue(me$cis$eqtls$pvalue)$qvalues #Write out results, while filtering by pvalue me$cis$eqtls %>% filter(pvalue < pvOutputThreshold_cis ) %>% select(snps, gene, beta, statistic, pvalue, FDR, qvalue) %>% write.table(file=output_file_name_cis, sep='\t', quote = F, row.names = F) # Permute the sample labels for expression ActualData <- read.table(expression_file_name, header=T, row.names = 1) Temp.df <- ActualData %>% select(sample(colnames(ActualData), length(colnames(ActualData)))) colnames(Temp.df) <- colnames(ActualData) TempFilepath <- tempfile() write.table(Temp.df, file=TempFilepath, sep='\t', quote=F, col.names =NA) gene$LoadFile(TempFilepath); permuted = Matrix_eQTL_main( snps = snps, gene = gene, cvrt = cvrt, output_file_name = NULL, pvOutputThreshold = 0, useModel = useModel, errorCovariance = errorCovariance, verbose = TRUE, output_file_name.cis = NULL, pvOutputThreshold.cis = 1, snpspos = snpspos, genepos = genepos, cisDist = cisDist, pvalue.hist = "qqplot", min.pv.by.genesnp = FALSE, noFDRsaveMemory = FALSE); print('done with permutation pass') permuted$cis$eqtls %>% filter(pvalue < pvOutputThreshold_cis ) %>% select(snps, gene, beta, statistic, pvalue, FDR) %>% write.table(file=permuted_output_filename, sep='\t', quote = F, row.names = F) ## Results: cat('Analysis done in: ', me$time.in.sec, ' seconds', '\n'); ## Plot the Q-Q plot of local p-values ggsave(file=ouput_QQ, plot(me)) ggsave(file=permuted_output_QQ, plot(permuted)) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 | library(tidyverse) library(data.table) library(qvalue) library(stats) args = commandArgs(trailingOnly=TRUE) PermutationsTableFile <- args[1] ActualTableFile <- args[2] TableOutFile <- args[3] ## Files for testing # setwd("~/CurrentProjects/sQTL_pipeline/") # PermutationsTableFile <- "sQTL_mapping/MatrixEQTL/ConfigCovariateModelResults/PermutationsCombined.txt" # ActualTableFile <- "sQTL_mapping/MatrixEQTL/ConfigCovariateModelResults/BestPvalueNonPermuted.txt" # TableOutFile <- "~/temp/test.txt" # read table of min p values from permutations permutationTable <- t(fread(PermutationsTableFile, sep='\t', header=T)) # read table of min p values from real data ActualTable <- fread(ActualTableFile, sep='\t', header=T) %>% as.data.frame() %>% tibble::column_to_rownames("Gene") # Merge the two tables, putting the real data as the first column. # If you are interested in gene level or cluster-level tests, this where you # should group and find minimum Pval. Here I did cluster level P-values. MergedTable <- merge(ActualTable, permutationTable, by="row.names") %>% mutate(clusterName=gsub("^(.+?):.+?:.+?:(clu_\\d+).+$", "\\1.\\2", Row.names, perl=T)) %>% dplyr::group_by(clusterName) %>% dplyr::summarise_all(dplyr::funs(min)) %>% dplyr::ungroup() %>% dplyr::select(-Row.names) %>% as.data.frame() %>% tibble::column_to_rownames("clusterName") PermutationTest <- function(PvalVector){ return(ecdf(PvalVector)(PvalVector[1])) } OutTable <- data.frame( BestActualPval=MergedTable$MinP, PermutationTestPval=apply(MergedTable, 1, PermutationTest)) OutTable$PermutationTestQvalue <- qvalue(OutTable$PermutationTestPval)$qvalue OutTable %>% tibble::rownames_to_column(var = "GroupedTrait") %>% write.table(file=TableOutFile, quote=F, sep='\t') |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 | library(tidyverse) library(gridExtra) library(reshape2) args <- commandArgs(trailingOnly = T) FileIn <- args[1] PlotOut <- args[2] results <- read.table(FileIn, sep='\t', header=F, col.names = c("SNP", "gene", "beta","t", "p", "q", "Filename")) head(results) ## eGene count eGenes_perPC <- results %>% filter(q<0.2) %>% distinct(gene, Filename, .keep_all = TRUE) %>% group_by(Filename) %>% tally() %>% mutate(PCcount = gsub("^(.+?)_and_(.+?)RNASeqPCs.covariates.txt","\\2",Filename, perl=TRUE)) %>% select(-Filename, FDR_20=n) eGenes_perPC$FDR_10 <- results %>% filter(q<0.1) %>% distinct(gene, Filename, .keep_all = TRUE) %>% group_by(Filename) %>% tally() %>% mutate(PCcount = gsub("^(.+?)_and_(.+?)RNASeqPCs.covariates.txt","\\2",Filename, perl=TRUE)) %>% pull(n) eGenes_perPC$FDR_5 <- results %>% filter(q<0.05) %>% distinct(gene, Filename, .keep_all = TRUE) %>% group_by(Filename) %>% tally() %>% mutate(PCcount = gsub("^(.+?)_and_(.+?)RNASeqPCs.covariates.txt","\\2",Filename, perl=TRUE)) %>% pull(n) P1 <- eGenes_perPC %>% melt(id.vars="PCcount") %>% transform(PCcount=as.numeric(PCcount)) %>% mutate(FDR = plyr::mapvalues(variable, from=c("FDR_10", "FDR_20", "FDR_5"), to=c("0.1", "0.2", "0.05"))) %>% ggplot(aes(x=PCcount, y=value, color=FDR)) + geom_point() + geom_line() + xlab("Number PCs") + ylab("Number eGenes") + theme_bw() ## eQTL count eGenes_perPC <- results %>% filter(q<0.2) %>% group_by(Filename) %>% tally() %>% mutate(PCcount = gsub("^(.+?)_and_(.+?)RNASeqPCs.covariates.txt","\\2",Filename, perl=TRUE)) %>% select(-Filename, FDR_20=n) eGenes_perPC$FDR_10 <- results %>% filter(q<0.1) %>% group_by(Filename) %>% tally() %>% mutate(PCcount = gsub("^(.+?)_and_(.+?)RNASeqPCs.covariates.txt","\\2",Filename, perl=TRUE)) %>% pull(n) eGenes_perPC$FDR_5 <- results %>% filter(q<0.05) %>% group_by(Filename) %>% tally() %>% mutate(PCcount = gsub("^(.+?)_and_(.+?)RNASeqPCs.covariates.txt","\\2",Filename, perl=TRUE)) %>% pull(n) P2 <- eGenes_perPC %>% melt(id.vars="PCcount") %>% transform(PCcount=as.numeric(PCcount)) %>% mutate(FDR = plyr::mapvalues(variable, from=c("FDR_10", "FDR_20", "FDR_5"), to=c("0.1", "0.2", "0.05"))) %>% ggplot(aes(x=PCcount, y=value, color=FDR)) + geom_point() + geom_line() + xlab("Number PCs") + ylab("Number eGene-SNP pairs") + theme_bw() g<-arrangeGrob(P1, P2, nrow=2) ggsave(PlotOut, g) |
1 2 3 4 5 6 7 8 9 10 11 12 13 | library(tidyverse) args <- commandArgs(trailingOnly = T) PhenotypeTableFilepath <- args[1] EmptyFamFilepath <- args[2] PhenotypeOutFilepath <- args[3] EmptyFamFile <- read.table(EmptyFamFilepath, col.names=c("FID", "IID", "Father", "Mother", "SX", "Pheno"), stringsAsFactors = F) %>% select(-Pheno) PhenotypeFile <- read.table(PhenotypeTableFilepath, header=T, check.names = F) PhenotypeFile %>% select(ID, EmptyFamFile$IID) %>% write.table(file=PhenotypeOutFilepath, col.names = T, row.names = F, quote=F, sep='\t') |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | library(tidyverse) args <- commandArgs(trailingOnly = T) PCsFromLeafcutter <- args[1] EmptyFamFilepath <- args[2] PhenotypeOutBaseFilepath <- args[3] N <- args[4] M <- args[5] EmptyFamFile <- read.table(EmptyFamFilepath, col.names=c("FID", "IID", "Father", "Mother", "SX", "Pheno"), stringsAsFactors = F, sep=" ") %>% select(-Pheno) Covariates <- read.table(PCsFromLeafcutter, header=T, check.names = F) for (i in N:M){ Covariates %>% select(id, EmptyFamFile$IID) %>% filter(id %in% N:i) %>% write.table(file=paste0(PhenotypeOutBaseFilepath, i, ".covariates.txt"), quote=F, row.names=F, sep='\t') } |
Support
- Future updates
Related Workflows





