GWAS Meta-Analysis Pipeline with Snakemake and Singularity
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Prerequisites
-
Snakemake
-
Singularity
Run
-
Modify the
config.yaml
file -
Variant QC
-
Single-cohort association
-
Meta-analysis
-
Scan signal regions
-
Plot signals
-
Select independent signals
1. Modify the
config.yaml
file
You'll need to provide 5 information.
input
: Path to the input phenotype files including the
cohort
,
group
and
phenotype
wildcards.
The format of the phenotype file needs to be as shown below.
ID | res | zres |
---|---|---|
SampleA | 0.5532 | 0.312 |
SampleB | 0.323 | 0.150 |
The third column will be used as the phenotype value.
cohorts
: For each cohort, the plink binary prefix (
bfile
) and the path to the genetic-relatedness matrix (
grm
) created using GCTA is required.
group : An identifier for the phenotype group. This currently has no functional importance other than being used as a prefix for output file names.
phenotypes
: Either a list of phenotype names or a path to a file containing a list of phenotype names which will be used to fetch the phenotype files based on the
input
configuration value.
2. Variant QC
This step filters the genotype data according to the thresholds specified in the
config.yaml
file.
snakemake --cores 1 --use-singularity all1
3. Single-cohort association
Run single-cohort single-point association analysis using GCTA-MLMA
snakemake --cores 1 --resources rate_limit=30 --use-singularity all2
4. Meta-analysis
Run single-point association meta-analysis using METAL
snakemake --cores 1 --use-singularity create_all_metal
5. Scan signal regions
Scan the meta-analysis regions to extract significant signal regions.
snakemake --cores 1 --use-singularity detect_all_peaks
6. Plot signals
Run PeakPlotter on all signal regions to create regional plots and annotated data.
snakemake --cores 1 --resources rate_limit=30 --use-singularity collect_all_peak_csvs
7. Select independent signals
Run LD-clumping and GCTA-COJO to identify independent signals within all signal regions.
snakemake --cores 1 --use-singularity run_all_cojo
Questions
Q. Why do the
freq
and
freq_geno
column values in the
.jma.cojo
file differ? A.
freq_geno
column is the frequency of the
refA
column allele in the input bfile (you can use
plink --freq
to check). The
freq
column value is the exact value extracted from the input cojofile, where the cojofile was created from the corresponding metal file. So the
freq
column value comes from the
Alt_Freq
column value in the metal file, and the
Alt_Freq
column value is the "weighted average of frequency for Alt allele across all studies". The
freq_geno
and
freq
column values differ because
freq_geno
is just the allele frequency of the variant from the genotype file (plink bfile) that was combined from all cohorts, whereas
freq
column is the weighted average of frequency across cohorts (calculated by metal).
Q. When I try to run a rule, I get an error saying
Text file busy
. What do I do? A. Delete the script and restore it using
git restore workflow/script/problematic_script.sh
. Your rules should run normally after doing this
Snakefile order
-
read-config.smk
-
variant-qc.smk
-
single-cohort.smk
-
meta-analysis.smk
-
detect-peaks.smk
-
peakplot.smk
-
cojo.smk
-
query.smk
-
gwas.smk
Code Snippets
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | run: concat_list = list() for f in input: string = re.sub(r'.*\/cojo\/', '', f) string = re.sub(r'\/.*', '', string) group, phenotype = string.split('.') peak = f.split('/')[-1].replace(f'{string}.', '').replace('.jma.cojo', '') df = pd.read_csv(f, sep = '\t', header = 0) min_p = df['p'].min() df['is cojo tophit'] = False df.loc[df['p'] == min_p, 'is cojo tophit'] = True df.insert(0, 'peak', peak) df.insert(0, 'phenotype', phenotype) df.insert(0, 'group', group) concat_list.append(df) pd.concat(concat_list).to_csv(output[0], index = False, header = True, compression = 'gzip') |
78 79 80 81 82 83 84 85 86 | shell: """ workflow/scripts/make_cojofile.sh \ {params.group} \ {params.phenotype} \ {params.bfile} \ {input.metal} \ {params.prefix} 2>&1 > {log} """ |
114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 | shell: """ workflow/scripts/run_cojo.sh \ {params.bfile} \ {input.cojofile} \ {input.metal} \ {params.threshold} \ {params.group} \ {params.phenotype} \ {params.chrom} \ {params.start} \ {params.end} \ {params.prefix} \ {resources.mem_mb} 2>&1 > {log} """ |
27 28 29 30 31 32 33 34 35 36 37 | run: all_df = list() for f in input: try: df = pd.read_csv(f, sep = '\t', header = None) except pd.errors.EmptyDataError: continue all_df.append(df) all_df = pd.concat(all_df) all_df.sort_values([0, 1, 2, 3, 4], inplace = True) all_df.to_csv(output[0], sep = '\t', header = False, index = False) |
58 59 | shell: "python3 workflow/scripts/collect_peaks.py {input} {params.span} {params.signif} {params.group} {params.phenotype} {output} 2>&1 > {log}" |
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 | shell: """ # Create command file echo 'SEPARATOR TAB' > {output.cmd} echo 'MARKER SNP' >> {output.cmd} echo 'ALLELE A1 A2' >> {output.cmd} echo 'FREQLABEL Freq' >> {output.cmd} echo 'EFFECT b' >> {output.cmd} echo 'STDERR se' >> {output.cmd} echo 'PVALUE p' >> {output.cmd} echo 'SCHEME STDERR' >> {output.cmd} echo 'AVERAGEFREQ ON' >> {output.cmd} echo 'MINMAXFREQ ON' >> {output.cmd} for f in {input} ; do echo \"PROCESS $f\" >> {output.cmd} done echo 'OUTFILE {params.prefix}. .txt' >> {output.cmd} echo 'ANALYZE' >> {output.cmd} metal < {output.cmd} 2>&1 > {log} ## CLEAN, COMPRESS, INDEX cat \ <(echo "Chrom MarkerName Pos Allele1 Allele2 Freq1 FreqSE MinFreq MaxFreq Effect StdErr P-value Direction" | tr ' ' '\\t') \ <(tail -n+2 {params.prefix}.1.txt | sed 's/^chr//' | tr ':' '\\t' | awk 'BEGIN{{FS=\"\\t\";OFS=\"\\t\"}}{{print $1,\"chr\"$1\":\"$2,$2,toupper($3),toupper($4),$5,$6,$7,$8,$9,$10,$11,$12}}' | sort -k1,1n -k3,3n) \ | bgzip > {output.metal} 2>> {log} tabix -s1 -b3 -e3 -S1 {output.metal} 2>> {log} rm {params.prefix}.1.txt ## Modify info file mv {params.prefix}.1.txt.info {output.info} sed -i 's,{params.prefix}.1.txt,{output.metal},' {output.info} sed -i 's/Marker /MarkerName /' {output.info} """ |
102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 | shell: """ mergelist=$(mktemp /tmp/single-point-analysis-pipeline.merge_bfiles.XXXXXX) exec 3>\"$mergelist\" echo {params.input} | tr ' ' '\\n' >&3 plink --allow-no-sex --threads {threads} --make-bed \ --merge-list $mergelist \ --out {params.out} || true # || true to prevent exitting due to bash strict mode # If merge failed due to multiallelics, exclude multiallelics and re-merge if [ -f "{output.missnp}" ] then exec 3>\"$mergelist\" for bfile in {params.input} do tmp_bfile=${{bfile}}-combine-tmp plink --allow-no-sex --threads {threads} --make-bed \ --bfile $bfile \ --exclude {output.missnp} \ --out $tmp_bfile echo $tmp_bfile >&3 done plink --allow-no-sex --threads {threads} --make-bed \ --merge-list $mergelist \ --out {params.out} else touch {output.missnp} fi rm output/bfile/*-combine-tmp.* exec 3>- """ |
148 149 150 151 152 153 154 155 156 | shell: """ plink --threads {threads} \ --bfile {params.bfile} \ --freq \ --out {params.bfile} awk -F '[[:space:]]+' '{params.awk}' {output.frq} > {output.frq2} """ |
185 186 187 188 189 190 191 192 193 194 195 | shell: """ mac_threshold=$(awk -F' ' -v id='{params.id}' '{{if ($1==id){{print $4}}}}' {input.mac}) awk -F ' ' -v mac_threshold=\"$mac_threshold\" '{{if ($2<=mac_threshold){{print $1}}}}' {params.bfile}.frq2 > {output.excludelist} plink --allow-no-sex --threads {threads} --make-bed \ --bfile {params.bfile} \ --exclude {output.excludelist} \ --out {params.out} """ |
206 207 208 209 210 | shell: """ zgrep -v -w -f <(cat {input.missnp} {input.excludelist}) {input.metal} | grep -v na | bgzip > {output.gz} tabix -s1 -b3 -e3 -S1 {output.gz} """ |
227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 | shell: """ run_manqq.R \ --chr-col Chrom \ --pval-col P-value \ --pos-col Pos \ --a1 Alt \ --a2 Ref \ --build 38 \ --image png \ --af-col Alt_Freq \ --maf-filter {params.maf} {input} \ {params.out_prefix} 2>&1 > {log} """ |
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 | run: peaks = pd.read_csv(input[0]) novel = peaks['ensembl_consequence']=='novel' subset = peaks.loc[novel, ['chrom', 'ps', 'a1', 'a2']].drop_duplicates() # Create POST input data variants = [f"{chrom} {pos} . {ref} {alt} . . ." for _, (chrom, pos, ref, alt) in subset.iterrows()] max_limit = 200 chunks = [variants[i:i+max_limit] for i in range(0, len(variants), max_limit)] path = Path(output[1]) path.mkdir(parents=True, exist_ok=True) files = list() for idx, chunk in enumerate(chunks): filename = path.joinpath(f"chunk_{idx}.txt") files.append(filename) with open(filename, 'w') as f: for line in chunk: f.write(line) f.write('\n') pd.DataFrame(files).to_csv(output[0], index = False, header = False) |
64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 | run: concat_list = list() for f in input: string = re.sub(r'.*\/peaks\/', '', f) string = re.sub(r'\/.*', '', string) group, phenotype = string.split('.') peak = f.split('/')[-1].replace('.500kb.csv', '') df = pd.read_csv(f, header = 0) df.insert(0, 'peak', peak) df.insert(0, 'phenotype', phenotype) df.insert(0, 'group', group) concat_list.append(df) all_data = pd.concat(concat_list) all_data.to_csv(output[0], index = False, header = True, compression='gzip') all_data[all_data['p-value']<=params.p_threshold].to_csv(output[1], index = False, header = True, compression='gzip') |
100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 | shell: """ peakplotter-region \ --chr-col Chrom \ --pos-col Pos \ --rs-col MarkerName \ --pval-col P-value \ --a1-col Allele1 \ --a2-col Allele2 \ --maf-col Freq1 \ --build 38 \ --assoc-file {input.metal} \ --bfiles {params.bfile} \ --out {params.outdir} \ --chrom {params.chrom} \ --start {params.start} \ --end {params.end} \ --debug """ |
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 | shell: """ pheno_file={params.outprefix}.pheno mlma={params.outprefix}.mlma mlma_bgz={params.outprefix}.mlma.gz awk '{{OFS=\"\\t\"}}NR!=1{{print $1,$1,$3}}' {input.phenotype} > $pheno_file 2> {log} gcta64 --mlma \ --bfile {params.bfile} \ --grm {params.grm} \ --pheno $pheno_file \ --out {params.outprefix} \ --threads {threads} 2>&1 >> {log} bgzip -c $mlma > $mlma_bgz 2>> {log} tabix --skip-lines 1 --sequence 1 --begin 3 --end 3 $mlma_bgz 2>&1 >> {log} rm $pheno_file $mlma {params.outprefix}.log 2>&1 >> {log} """ |
91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 | shell: """ run_manqq.R \ --chr-col Chr \ --pval-col p \ --pos-col bp \ --a1 A1 \ --a2 A2 \ --build 38 \ --image png \ --af-col Freq \ --no-man \ --maf-filter {params.filter} \ {input} \ {params.prefix} > {log.out} 2> {log.err} """ |
19 20 21 22 23 24 25 | shell: """ plink \ --bfile {params.input} \ --hardy \ --out {params.output} """ |
31 32 33 34 | shell: """ tail -n+2 {input} | tr -s ' ' | awk '{{if ($NF<={params}) {{print $2}}}}' > {output} """ |
43 44 45 46 47 48 49 | shell: """ plink \ --bfile {params.input} \ --missing \ --out {params.output} """ |
55 56 57 58 | shell: """ tail -n+2 {input} | tr -s ' ' | awk '{{if ($NF>={params}) {{print $2}}}}' > {output} """ |
66 67 | shell: "cat {input} > {output}" |
83 84 85 86 87 88 89 90 91 92 | shell: """ plink \ --bfile {params.bfile} \ --exclude {input.exclude_list} \ --make-bed \ --out {params.out} \ --threads {threads} awk -F$'\\t' 'BEGIN{{OFS=\"\\t\"}}{{print $1,\"chr\"$1\":\"$4,$3,$4,$5,$6}}' {params.out}.bim | sponge {params.out}.bim """ |
101 102 103 104 105 106 | run: data = pd.read_csv(input[0], sep = '\t') n = data.shape[0] max_mac = n * 2 threshold = params.threshold / max_mac pd.DataFrame([[params.id, n, max_mac, threshold]]).to_csv(output[0], sep = ' ', index = False, header = False) |
112 | shell: "cat {input} > {output}" |
122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 | run: mac = pd.read_csv( input[0], sep = ' ', header = None, names = ['id', 'n', 'max_mac', 'maf_threshold']) mac['cohort'] = [i[0] for i in mac['id'].str.split('.')] mac['group_pheno'] = ['.'.join(i[1:]) for i in mac['id'].str.split('.')] group_phenos = set(mac['group_pheno']) meta_mac = list() for gp in group_phenos: combind_n = mac.loc[mac['group_pheno']==gp, 'n'].sum() combind_max_mac = combind_n * 2 combined_threshold = params.threshold / combind_max_mac meta_mac.append([gp, combind_n, combind_max_mac, combined_threshold]) mac.drop(columns = ['cohort', 'group_pheno'], inplace = True) all_mac = pd.concat([mac, pd.DataFrame(meta_mac, columns = ['id', 'n', 'max_mac', 'maf_threshold'])]) all_mac.to_csv(output[0], sep = ' ', index = False, header = True) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | __version__ = '0.2' import sys import pandas as pd from peakplotter.peakit import peakit from peakplotter.plotpeaks import read_assoc, get_signals from peakplotter.test_utils import get_test_logger assocfile = sys.argv[1] flank_bp = int(sys.argv[2]) signif = float(sys.argv[3]) group = sys.argv[4] phenotype = sys.argv[5] output = sys.argv[6] chr_col = 'Chrom' pos_col = 'Pos' rs_col = 'MarkerName' pval_col = 'P-value' a1_col = 'Allele1' a2_col = 'Allele2' maf_col = 'Freq1' # Frequency of A1 logger = get_test_logger() print(f'Running collect_peaks.py (v{__version__})') print(f''' chr_col = '{chr_col}' pos_col = '{pos_col}' rs_col = '{rs_col}' pval_col = '{pval_col}' a1_col = '{a1_col}' a2_col = '{a2_col}' maf_col = '{maf_col}' ''') assoc = read_assoc(assocfile, chr_col, pos_col, pval_col, maf_col, rs_col, a1_col, a2_col, logger) print('Getting signals') signals = get_signals(assoc, signif, chr_col, pos_col, pval_col) print(signals) if signals.empty: print("No peaks found. Exiting.") pd.DataFrame().to_csv(output, sep = '\t', header = False, index = False) sys.exit(0) print('Making peak_collections') peak_collections = peakit(signals, pval_col, chr_col, pos_col, flank_bp) print('peak_collections before merge') print(peak_collections) peak_collections.merge() print('peak_collections after merge') print(peak_collections) peaked = peak_collections.data peaked.insert(0, 'phenotype', phenotype) peaked.insert(0, 'group', group) peaked.to_csv(output, sep = '\t', header = False, index = False) |
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 | make_cojofile=v0.0.2 group=$1 phenotype=$2 bfile=$3 metal_file=$4 prefix=$5 echo "=== Running make_cojofile.sh === run_cojo: $make_cojofile group: $group phenotype: $phenotype bfile: $bfile metal_file: $metal_file prefix: $prefix ================================" samplesize=$prefix.samplesize plink --bfile $bfile --missing --out $prefix awk -F '[[:space:]]+' '{if(NR!=1){print $3, $5-$4}}' $prefix.lmiss > $samplesize tmpfile=$(mktemp /tmp/make_cojofile.XXXXXX) # https://unix.stackexchange.com/a/181938 echo "[INFO] Creating temporary file: $tmpfile" exec 3>"$tmpfile" join \ -j 1 \ -o 0 1.2 1.3 1.4 1.5 1.6 1.7 2.2 \ <(zcat $metal_file \ | awk '{if(NR!=1){print "chr"$1":"$3, $4, $5, $6, $10, $11, $12}}' \ | sort -k 1b,1 \ ) \ <(sort -k 1b,1 $samplesize) >&3 # 'sort -k 1b,1' is the recommended sort command if joining on the first field using 'join' # See 'man join' cat \ <(echo "SNP A1 A2 freq b se p N") \ <(sed 's/:/ /' $tmpfile \ | sort -n -k1.4 -k2 \ | sed 's/ /:/' \ ) \ | gzip > $prefix.ma.gz exec 3>- # rm $tmpfile # rm $prefix.{lmiss,samplesize} rm $prefix.{imiss,nosex} |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 | run_cojo=v0.3.0 bfile=$1 cojofile=$2 metal_file=$3 threshold=$4 # p-value threshold group=$5 phenotype=$6 chrom=$7 start=$8 end=$9 prefix=${10} # output prefix memory=${11} echo "=== Running run_cojo.sh === run_cojo: $run_cojo group: $group phenotype: $phenotype bfile: $bfile chrom: $chrom start: $start end: $end threshold: $threshold cojofile: $cojofile metal_file: $metal_file prefix: $prefix =========================== " # TODO: Find a way to insert the $threshold variable # inside the below awk command. # See Plink 1.9's .qassoc file format for more details qassoc=$prefix.qassoc echo "[$(date), Step 1/4] Creating $qassoc" tabix $metal_file ${chrom}:${start}-${end} | \ awk 'BEGIN{OFS="\t";print "CHR\tSNP\tBP\tBETA\tSE\tR2\tT\tP"} \ { split($12, a, "[eE]") if($12<1e-6 || (length(a)>1 && a[2] < -6)){ print $1, "chr"$1":"$3, $3, $10, $11, 1, 1, $12} }' > $qassoc echo echo "Extracted $(( $(wc -l < $prefix.qassoc) - 1 )) SNP(s)." echo ## Subset bfile echo "[$(date), Step 2/4] Subset bfile" plink --allow-no-sex --make-bed \ --bfile $bfile \ --chr $chrom \ --from-bp $start \ --to-bp $end \ --memory $memory \ --out $prefix echo -e "\n\n" echo "[$(date), Step 3/4] Clumping" ## CLUMP from merged file plink \ --bfile $prefix \ --clump $qassoc \ --clump-kb 1000 \ --clump-r2 0.05 \ --memory $memory \ --out $prefix clumpedlist=$prefix.clumped.list awk -v threshold=$threshold '{if($5<threshold && $1~/^[0-9]+$/){print $3}}' $prefix.clumped > $clumpedlist echo -e "\n\n" echo "[$(date), Step 4/4] GCTA-COJO" gcta64 \ --bfile $prefix \ --cojo-file <(zcat $cojofile) \ --extract $clumpedlist \ --out $prefix \ --cojo-slct \ --cojo-p $threshold \ --cojo-wind 10000 \ --diff-freq 0.2 \ --cojo-collinear 0.9 || true echo -e "\n\n" echo "[$(date)] Done" rm -f \ $prefix.nosex \ $prefix.clumped.list |
Support
- Future updates
Related Workflows





