Perform error-prone data preparation for whole genome regression modeling
istari 🔬
An awesome whole genome regression modeling pipeline
Overview
Welcome to istari! Before getting started, we highly recommend reading through istari's documentation .
The
./istari
pipeline is composed several inter-related sub commands to setup and run the pipeline across different systems. Each of the available sub commands perform different functions:
-
istari run
: Run the istari pipeline with your input files. -
istari unlock
: Unlocks a previous runs output directory. -
istari cache
: Cache remote resources locally, coming soon!
istari is a comprehensive pipeline that performs error-prone data preparation steps for genome-wide association studies (GWAS) optimized for WES and WGS. It relies on technologies like Singularity1 to maintain the highest-level of reproducibility. The pipeline consists of a series of data processing and quality-control steps orchestrated by Snakemake2 , a flexible and scalable workflow management system, to submit jobs to a cluster.
The pipeline is compatible with data generated from Illumina short-read sequencing technologies. This pipeline prepares data for GWAS by converting file formats, performing QC filtering, preparing phenotypes, covariates, and files for regenie3 . As input, it accepts a gVCF file (note: file must have corresponding index file), phenotype file, and covariates file. The pipeline validates phenotype and covariate files. If sex and/or ancestry information is not present in the covariates file, the pipeline runs Somalier4 to estimate sex and ancestry and adds this information to the covariates file. The pipeline then uses Ensembl's Variant Effect Predictor5 (VEP) to annotate variants. The pipeline then filters variants using Slivar , which utilizes the Genome Aggregation 190 Database6 (gnomAD) popMax AF. By default, this pipeline include variants with a population allele frequency ≤ 0.01 and ≤ 0.05 (1% or 5% in gnomAD) and predicted to be functionally impactful by Slivar, but it is set up so these thresholds can be adjusted. Then, the pipeline performs pre-association test QC filtering for regenie step 1 based on minor allele frequency (MAF), minor allele count (MAC), linkage disequilibrium (LD), and genotype and sample missingness using GATK and plink7,8 . By default, certain filtering thresholds are set, but these thresholds and pruning settings can be adjusted. Note that the regenie developers do not recommend using >1M SNPs for step 1. The pipeline then creates the annotation file, mask file, and list file, which are used as input for gene-based tests in regenie step 2 gene-based tests. The pipeline can submit jobs to a cluster using a job scheduler like SLURM (more coming soon!). A hybrid approach ensures the pipeline is accessible to all users.
Before getting started, we highly recommend reading through the usage section of each available sub command.
For more information about issues or trouble-shooting a problem, please checkout our FAQ prior to opening an issue on Github .
Dependencies
Requires:
singularity=3.10.5
snakemake=7.19.1
At the current moment, the pipeline uses a mixture of environment modules and docker images; however, this will be changing soon! In the very near future, the pipeline will only use docker images. With that being said, snakemake and singularity must be installed on the target system. Snakemake orchestrates the execution of each step in the pipeline. To guarantee the highest level of reproducibility, each step of the pipeline will rely on versioned images from DockerHub . Snakemake uses singularity to pull these images onto the local filesystem prior to job execution, and as so, snakemake and singularity will be the only two dependencies in the future.
Installation
Please clone this repository to your local filesystem using the following command:
# Clone Repository from Github
git clone https://github.com/OpenOmics/istari.git
# Change your working directory
cd istari/
# Add dependencies to $PATH
# Biowulf users should run
module load snakemake singularity
# Get usage information
./istari -h
Contribute
This site is a living document, created for and by members like you. istari is maintained by the members of OpenOmics and is improved by continuous feedback! We encourage you to contribute new content and make improvements to existing content via pull request to our GitHub repository .
References
1.
Kurtzer GM, Sochat V, Bauer MW (2017). Singularity: Scientific containers for mobility of compute. PLoS ONE 12(5): e0177459.
2.
Koster, J. and S. Rahmann (2018). "Snakemake-a scalable bioinformatics workflow engine." Bioinformatics 34(20): 3600.
3.
Mbatchou, J., Barnard, L., Backman, J., Marcketta, A., Kosmicki, J. A., Ziyatdinov, A., et al. (2021). Computationally efficient whole-genome regression for quantitative and binary traits. Nat. Genet. 53 (7), 1097–1103.
4.
Pedersen, B.S., Bhetariya, P.J., Brown, J., Kravitz, S.N., Marth, G., Jensen, R.L., Bronner, M.P., Underhill, H.R., and Quinlan, A.R. (2020). Somalier: rapid relatedness estimation for cancer and germline studies using efficient genome sketches.Genome Med. 12, 62.
5.
McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GR, Thormann A, Flicek P, Cunningham F. (2016). The Ensembl Variant Effect Predictor. Genome Biology Jun 6;17(1):122.
6.
Karczewski, K.J., Francioli, L.C., Tiao, G. et al. (2020). The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443.
7.
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira M, Bender D, Maller J, Sklar P, de Bakker P, Daly MJ, Sham PC (2007) PLINK: A Tool Set for Whole-Genome and Population-Based Linkage Analyses. American Journal of Human Genetics, 81.
8.
Chang CC, Chow CC, Tellier LCAM, Vattikuti S, Purcell SM, Lee JJ (2015) Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience, 4.
Code Snippets
22 23 24 25 26 27 28 29 30 | shell: """ set +u module load GATK gatk SelectVariants -V {input.vcf} -O {params.unzipped_vcf} -R {params.ref} --max-nocall-fraction {params.filt_nocall} module load bcftools bgzip {params.unzipped_vcf} tabix -p vcf {output.cleaned_vcf} """ |
72 73 74 75 76 77 78 79 80 81 82 | shell: """ ml plink/1 plink --vcf {input.cleaned_vcf} --vcf-half-call m --set-missing-var-ids @:# --not-chr M --geno {params.geno} --maf {params.maf} --double-id --set-hh-missing --update-sex {params.sex} --make-bed --out {params.bed1} ml plink/2 plink2 --bfile {params.prefix} --indep-pairwise {params.ld_window} {params.ld_step} {params.ld_thresh} --make-bed --out {params.ld_bed} plink2 --bfile {params.ld_bed} --exclude {params.ld_bed}.prune.out --mac {params.mac} --make-bed --out {params.bed} ml plink/1 plink --bfile {params.bed} --bmerge {params.bed_01} --make-bed --out {params.step2_01} --update-sex {params.sex} plink --bfile {params.bed} --bmerge {params.bed_05} --make-bed --out {params.step2_05} --update-sex {params.sex} """ |
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 | shell: """ regenie \ --step 1 \ --bed {params.step1} \ --covarFile {params.covariates} \ --covarColList {params.covarColList} \ --phenoFile {params.phenoFile} \ --phenoColList {params.phenoColList} \ --bsize {params.step1_bsize} \ --threads {threads} \ --lowmem \ --gz \ --nauto 23 \ --lowmem-prefix tmp_rg \ --out {params.step1} \ regenie \ --step 2 \ --bed {params.bed_01} \ --covarFile {params.covariates} \ --covarColList {params.covarColList} \ --phenoFile {params.phenoFile} \ --phenoColList {params.phenoColList} \ --write-mask-snplist \ --pred {params.step1}_pred.list \ --anno-file {input.annot_01} \ --mask-def {input.mask_01} \ --vc-tests skat,skato \ --debug \ --nauto 23 \ --bsize {params.step2_bsize} \ --build-mask sum \ --check-burden-files \ --set-list {input.list_01} \ --strict-check-burden \ --aaf-bins 0.01 \ --minMAC 1 \ --write-samples \ --out {params.step201} \ regenie \ --step 2 \ --bed {params.bed_05} \ --covarFile {params.covariates} \ --covarColList {params.covarColList} \ --phenoFile {params.phenoFile} \ --phenoColList {params.phenoColList} \ --write-mask-snplist \ --pred {params.step1}_pred.list \ --anno-file {input.annot_05} \ --mask-def {input.mask_05} \ --vc-tests skat,skato \ --debug \ --nauto 23 \ --bsize {params.step2_bsize} \ --build-mask sum \ --check-burden-files \ --set-list {input.list_05} \ --strict-check-burden \ --aaf-bins 0.05 \ --minMAC 1 \ --write-samples \ --out {params.step205} \ """ |
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | shell: """ echo "Extracting sites to estimate ancestry." {params.exe} extract -d {params.outdir}/ --sites {params.sites} -f {params.ref} {input.vcf} echo "Estimating relatedness." {params.exe} relate --infer -i -o {params.outdir}/relatedness {output.somalier} echo "Estimating ancestry." {params.exe} ancestry --n-pcs=10 -o {params.outdir}/ancestry --labels {params.ancestry_data}/ancestry-labels-1kg.tsv {params.ancestry_data}/*.somalier ++ {output.somalier} || {{ # Somalier ancestry error, # usually due to not finding # any sites compared to the # its references, expected # with sub-sampled datasets echo "WARNING: Somalier ancestry failed..." 1>&2 touch {output.ancestry} }} """ |
65 66 67 68 69 70 71 72 73 | shell: """ mkdir -p {params.outdir_regenie} mkdir -p {params.outdir_QC} paste -d '\t' <(cut -f 1,2 {input.covariates}) <(cut -f 5 {input.sex}) > {output.sex_file} head -n 1 {input.ancestry} > {output.subset} awk '/P00/' {input.ancestry} >> {output.subset} paste -d '\t' <(cut -f 1,2 {input.covariates}) <(cut -f 5 {input.sex}) <(cut -f 9- {output.subset}) > {output.reg_covariates} """ |
19 20 21 22 23 24 25 26 27 28 29 | shell: """ module load bcftools bcftools view --max-alleles 2 {input.vcf} -O z -o {output.sub} mkdir -p {params.outdir} set +u module load VEP vep -i {output.sub} -o {output.vep_vcf} --force_overwrite --fork 12 --fasta {params.ref} --species human --assembly {params.vep_assembly} --cache --dir_cache $VEP_CACHEDIR --offline --format vcf --compress_output bgzip --everything --pick --vcf module load bcftools tabix -p vcf {output.vep_vcf} """ |
56 57 58 59 60 61 62 63 64 65 66 67 68 | shell: """ mkdir -p {params.outdir} set +u SLIVAR_IMPACTFUL_ORDER={params.slivar_order} {params.slivar_tool} expr --js {params.slivar_js} -g {params.gnomad_db} --vcf {input.vep_vcf} --info "INFO.gnomad_popmax_af < {params.gnomad_af} && INFO.impactful" --pass-only > {output.unzipped_vcf} module load bcftools bcftools +split-vep {output.unzipped_vcf} -f '%CHROM:%POS:%REF:%ALT\t%SYMBOL\t%Consequence\n' > {output.slivar_annot} module load plink/2 plink2 --make-bed --max-alleles 2 --double-id --out {params.bed} --set-missing-var-ids @:# --vcf {output.unzipped_vcf} --vcf-half-call m bgzip -c {output.unzipped_vcf} > {output.slivar_vcf} tabix -p vcf {output.slivar_vcf} """ |
95 96 97 98 99 100 101 102 103 104 105 106 107 | shell: """ mkdir -p {params.outdir} set +u SLIVAR_IMPACTFUL_ORDER={params.slivar_order} {params.slivar_tool} expr --js {params.slivar_js} -g {params.gnomad_db} --vcf {input.vep_vcf} --info "INFO.gnomad_popmax_af < {params.gnomad_af} && INFO.impactful" --pass-only > {output.unzipped_vcf} module load bcftools bcftools +split-vep {output.unzipped_vcf} -f '%CHROM:%POS:%REF:%ALT\t%SYMBOL\t%Consequence\n' > {output.slivar_annot} module load plink/2 plink2 --make-bed --max-alleles 2 --double-id --out {params.bed} --set-missing-var-ids @:# --vcf {output.unzipped_vcf} --vcf-half-call m bgzip -c {output.unzipped_vcf} > {output.slivar_vcf} tabix -p vcf {output.slivar_vcf} """ |
128 129 130 131 132 | shell: """ module load R Rscript {params.script} {params.wdir} {input.in1} {output.annot_out1} {output.list_out1} {output.mask_out1} {input.in2} {output.annot_out2} {output.list_out2} {output.mask_out2} """ |
Support
- Future updates
Related Workflows





