eQTL-Catalogue/qtlmap
Portable eQTL analysis and statistical fine mapping workflow used by the eQTL Catalogue
Introduction
eQTL-Catalogue/qtlmap is a bioinformatics analysis pipeline used for QTL Analysis.
The workflow takes phenotype count matrix (normalized and quality controlled) and genotype data as input, and finds associations between them with the help of sample metadata and phenotype metadata files (See Input formats and preparation for required input file details). To map QTLs, pipeline uses QTLTools's PCA and RUN methods. For manipulation of files BcfTools , Tabix and custom Rscript scripts are used.
The pipeline is built using Nextflow , a bioinformatics workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker / singularity containers making installation trivial and results highly reproducible.
Documentation
The eQTL-Catalogue/qtlmap pipeline comes with documentation about the pipeline, found in the
docs/
directory:
- Installation
- Pipeline configuration
- Input formats and preparation
- Running the pipeline
- Troubleshooting
Pipeline Description
Mapping QTLs is a process of finding statistically significant associations between phenotypes and genetic variants located nearby (within a specific window around phenotype, a.k.a cis window) This pipeline is designed to perform QTL mapping. It is intended to add this pipeline to the nf-core framework in the future. High level representation of the pipeline is shown below:
Results
The output directory of the workflow contains the following subdirectories:
- PCA - genotype and gene expression PCA values used as covariates for QTL analysis.
- sumstats - QTL summary statistics from nominal and permutation passes.
- susie - SuSiE fine mapping credible sets.
- susie_full - full set of susie results for all tested variants (very large files).
- susie_merged - susie credible sets merged with summary statistics from univariate QTL analysis.
Column names of the output files are explained here .
Contributors- Nurlan Kerimov
- Kaur Alasoo
- Masahiro Kanai
- Ralf Tambets
Code Snippets
16 17 18 19 20 | """ csvtk -t cut -f molecular_trait_id $credible_sets | csvtk -t uniq > cc_signals_phenotypes.txt zcat $qtl_ss | csvtk -t grep -P cc_signals_phenotypes.txt | bgzip > ${qtl_subset}.cc.tsv.gz tabix -s2 -b3 -e3 -S1 -f ${qtl_subset}.cc.tsv.gz """ |
14 15 16 17 18 19 | """ bcftools view -S $sample_names $genotype_vcf -Oz -o ${sample_names.simpleName}_extract.vcf.gz bcftools +fill-tags ${sample_names.simpleName}_extract.vcf.gz -Oz -o ${sample_names.simpleName}_extract_filltags.vcf.gz bcftools view -i 'AN[0]*MAF[0]>5 & MAF[0]>0.01' ${sample_names.simpleName}_extract_filltags.vcf.gz -Oz -o ${sample_names.simpleName}.vcf.gz tabix -p vcf ${sample_names.simpleName}.vcf.gz """ |
15 16 17 | """ set +o pipefail; bcftools +fill-tags $vcf | bcftools query -f '%CHROM\\t%POS\\t%ID\\t%REF\\t%ALT\\t%TYPE\\t%AC\\t%AN\\t%MAF\\t%R2\\n' | gzip > ${qtl_subset}.variant_information.txt.gz """ |
19 20 21 | """ set +o pipefail; bcftools +fill-tags $vcf | bcftools query -f '%CHROM\\t%POS\\t%ID\\t%REF\\t%ALT\\t%TYPE\\t%AC\\t%AN\\t%MAF\\tNA\\n' | gzip > ${qtl_subset}.variant_information.txt.gz """ |
16 17 18 | """ QTLtools cis --vcf $vcf --bed $bed --cov $covariate --chunk $batch_index ${params.n_batches} --out ${qtl_subset}.permutation.batch.${batch_index}.${params.n_batches}.txt --window ${params.cis_window} --permute ${params.n_permutations} --grp-best """ |
36 37 38 39 | """ cat ${batch_file_names.join(' ')} | csvtk space2tab | sort -k11n -k12n > merged.txt cut -f 1,6,7,8,10,11,12,18,20,21,22 merged.txt | csvtk add-header -t -n molecular_trait_object_id,molecular_trait_id,n_traits,n_variants,variant,chromosome,position,pvalue,beta,p_perm,p_beta | gzip > ${qtl_subset}.permuted.tsv.gz """ |
57 58 59 60 61 62 63 | """ fastQTL --vcf $vcf --bed $fastqtl_bed --cov $covariate \\ --chunk $batch_index ${params.n_batches} \\ --out ${qtl_subset}.nominal.batch.${batch_index}.${params.n_batches}.txt \\ --window ${params.cis_window} \\ --ma-sample-threshold 1 """ |
80 81 82 83 84 85 86 87 | """ cat ${batch_file_names.join(' ')} | \\ csvtk space2tab -T | \\ csvtk sep -H -t -f 2 -s "_" | \\ csvtk replace -t -H -f 10 -p ^chr | \\ csvtk cut -t -f1,10,11,12,13,2,4,5,6,7,8,9 | \\ bgzip > ${qtl_subset}.nominal.tab.txt.gz """ |
106 107 108 109 | """ gzip -dc $nominal_merged | LANG=C sort -k2,2 -k3,3n -S${params.sumstat_sort_mem} --parallel=${params.sumstat_sort_cores} | uniq | \\ bgzip > ${qtl_subset}.nominal.sorted.norsid.tsv.gz """ |
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | """ Rscript $baseDir/bin/prepare_molecular_traits.R \\ -p "$phenotype_metadata" \\ -s "$sample_metadata" \\ -e "$expression_matrix" \\ -v "$vcf_variant_info" \\ -o "." \\ -c ${params.cis_window} \\ -m ${params.mincisvariant} \\ -a ${params.covariates} #Merge phenotype covariates together head -n ${params.n_pheno_pcs + 1} phenoPCA.tsv > ${qtl_subset}.pheno_cov.txt tail -n+2 additional_covariates.tsv >> ${qtl_subset}.pheno_cov.txt """ |
49 50 51 52 53 54 | """ bgzip ${bed_file} && tabix -p bed ${bed_file}.gz csvtk cut -C\$ -t -f -strand,-group_id ${bed_file}.gz | bgzip > ${bed_file.baseName}.fastQTL.bed.gz tabix -p bed ${bed_file.baseName}.fastQTL.bed.gz """ |
72 73 74 75 76 77 78 79 80 81 82 83 | """ plink2 --vcf $vcf --vcf-half-call h --indep-pairwise 50000 200 0.05 --out ${qtl_subset}_pruned_variants --threads ${task.cpus} --memory ${task.memory.mega} --const-fid plink2 --vcf $vcf --vcf-half-call h --extract ${qtl_subset}_pruned_variants.prune.in --make-bed --out ${qtl_subset}_pruned --const-fid plink2 -bfile ${qtl_subset}_pruned --pca ${params.n_geno_pcs} header tabs cat plink.eigenvec \\ | sed '1s/IID/genotype_id/' \\ | sed '1s/PC/geno_PC/g' \\ | csvtk cut -t -f -"FID" \\ | csvtk transpose -t > ${qtl_subset}.geno.pca cat $phenotype_cov > ${qtl_subset}.covariates.txt set +o pipefail; tail -n+2 ${qtl_subset}.geno.pca | head -n ${params.n_geno_pcs} >> ${qtl_subset}.covariates.txt """ |
14 15 16 17 18 19 | """ $baseDir/bin/join_variant_info.py \\ -v $var_info \\ -r $rsid_map \\ -o ${var_info.simpleName}.var_info_rsid.tsv.gz """ |
33 34 35 36 37 38 39 40 41 | """ $baseDir/bin/join_variant_info.py \ -s $summ_stats \ -v $var_info \ -r $rsid_map \ -p $phenotype_metadata \ -m $median_tpm \ -o ${qtl_subset}.nominal.sorted.tsv.gz """ |
59 60 61 62 63 | """ zcat $sumstats_file | bgzip > ${qtl_subset}.nominal.sorted.bgzip.tsv.gz mv ${qtl_subset}.nominal.sorted.bgzip.tsv.gz ${qtl_subset}.all.tsv.gz tabix -s2 -b3 -e3 -S1 -f ${qtl_subset}.all.tsv.gz """ |
12 13 14 15 16 17 18 19 20 21 22 23 24 | """ Rscript $baseDir/bin/run_susie.R --expression_matrix ${expression_matrix}\ --phenotype_meta ${phenotype_meta}\ --sample_meta ${sample_meta}\ --phenotype_list ${phenotype_list}\ --covariates ${covariates}\ --genotype_matrix ${genotype_matrix}\ --chunk '${batch_index} ${params.n_batches}'\ --cisdistance ${params.cis_window}\ --out_prefix '${qtl_subset}.${batch_index}_${params.n_batches}'\ --eqtlutils null\ --write_full_susie ${params.write_full_susie} """ |
41 42 43 44 45 46 | """ awk 'NR == 1 || FNR > 1{print}' ${in_cs_variant_batch_names.join(' ')} | gzip -c > ${qtl_subset}.txt.gz awk 'NR == 1 || FNR > 1{print}' ${credible_set_batch_names.join(' ')} | gzip -c > ${qtl_subset}.cred.txt.gz awk 'NR == 1 || FNR > 1{print}' ${variant_batch_names.join(' ')} | gzip -c > ${qtl_subset}.snp.txt.gz awk 'NR == 1 || FNR > 1{print}' ${lbf_variable_batch_names.join(' ')} | gzip -c > ${qtl_subset}.lbf_variable.txt.gz """ |
61 62 63 64 | """ gunzip -c ${merged_susie_output} > susie_merged.txt (head -n 1 susie_merged.txt && tail -n +2 susie_merged.txt | sort -k3 -k4n ) | gzip > ${qtl_subset}.purity_filtered.txt.gz """ |
77 78 79 80 81 82 83 84 85 | """ #Extract variant coordinates from the credible set file csvtk cut -t -T -f chromosome,position ${credible_sets} | tail -n +2 | sort -k1n -k2n | uniq > selected_regions.tsv #Extract variants from the summary stats file set +o pipefail; zcat ${qtl_ss} | head -n1 | gzip > header.txt.gz set +o pipefail; tabix -R selected_regions.tsv ${qtl_ss} | gzip > filtered_sumstats.tsv.gz set +o pipefail; zcat header.txt.gz filtered_sumstats.tsv.gz | gzip > ${qtl_subset}.extracted_sumstats.tsv.gz """ |
99 100 101 102 103 | """ Rscript $baseDir/bin/susie_merge_cs.R --cs_results ${credible_sets}\ --sumstats ${sumstats}\ --out ${qtl_subset}.credible_sets.tsv.gz """ |
12 13 14 | """ bcftools annotate --set-id 'chr%CHROM\\_%POS\\_%REF\\_%FIRST_ALT' $vcf -Oz -o ${vcf.simpleName}_renamed.vcf.gz """ |
12 13 14 15 16 17 18 19 20 21 22 23 | """ #Extract header printf 'CHROM\\nPOS\\nREF\\nALT\\n' > 4_columns.tsv bcftools query -l ${vcf} > sample_list.tsv cat 4_columns.tsv sample_list.tsv > header.tsv csvtk transpose header.tsv -T | gzip > header_row.tsv.gz #Extract dosage and merge bcftools query -f "%CHROM\\t%POS\\t%REF\\t%ALT[\\t%DS]\\n" ${vcf} | gzip > dose_matrix.tsv.gz zcat header_row.tsv.gz dose_matrix.tsv.gz | bgzip > ${vcf.simpleName}.dose.tsv.gz tabix -s1 -b2 -e2 -S1 ${vcf.simpleName}.dose.tsv.gz """ |
25 26 27 28 29 30 31 32 33 34 35 36 | """ #Extract header printf 'CHROM\\nPOS\\nREF\\nALT\\n' > 4_columns.tsv bcftools query -l ${vcf} > sample_list.tsv cat 4_columns.tsv sample_list.tsv > header.tsv csvtk transpose header.tsv -T | gzip > header_row.tsv.gz #Extract dosage and merge bcftools +dosage ${vcf} -- -t GT | tail -n+2 | gzip > dose_matrix.tsv.gz zcat header_row.tsv.gz dose_matrix.tsv.gz | bgzip > ${vcf.simpleName}.dose.tsv.gz tabix -s1 -b2 -e2 -S1 ${vcf.simpleName}.dose.tsv.gz """ |
Support
- Future updates
Related Workflows





