GWAS Meta-Analysis Pipeline with Snakemake and Singularity

public 1yr ago Version: 0.0.1 0 bookmarks

View Workflow

single-point-analysis-pipeline — View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Prerequisites

Snakemake
Singularity

Run

Modify the config.yaml file
Variant QC
Single-cohort association
Meta-analysis
Scan signal regions
Plot signals
Select independent signals

1. Modify the `config.yaml` file

You'll need to provide 5 information.

input : Path to the input phenotype files including the cohort , group and phenotype wildcards.
The format of the phenotype file needs to be as shown below.

ID	res	zres
SampleA	0.5532	0.312
SampleB	0.323	0.150

The third column will be used as the phenotype value.

cohorts : For each cohort, the plink binary prefix ( bfile ) and the path to the genetic-relatedness matrix ( grm ) created using GCTA is required.

group : An identifier for the phenotype group. This currently has no functional importance other than being used as a prefix for output file names.

phenotypes : Either a list of phenotype names or a path to a file containing a list of phenotype names which will be used to fetch the phenotype files based on the input configuration value.

2. Variant QC

This step filters the genotype data according to the thresholds specified in the config.yaml file.

snakemake --cores 1 --use-singularity all1

3. Single-cohort association

Run single-cohort single-point association analysis using GCTA-MLMA

snakemake --cores 1 --resources rate_limit=30 --use-singularity all2

4. Meta-analysis

Run single-point association meta-analysis using METAL

snakemake --cores 1 --use-singularity create_all_metal

5. Scan signal regions

Scan the meta-analysis regions to extract significant signal regions.

snakemake --cores 1 --use-singularity detect_all_peaks

6. Plot signals

Run PeakPlotter on all signal regions to create regional plots and annotated data.

snakemake --cores 1 --resources rate_limit=30 --use-singularity collect_all_peak_csvs

7. Select independent signals

Run LD-clumping and GCTA-COJO to identify independent signals within all signal regions.

snakemake --cores 1 --use-singularity run_all_cojo

Questions

Q. Why do the freq and freq_geno column values in the .jma.cojo file differ? A. freq_geno column is the frequency of the refA column allele in the input bfile (you can use plink --freq to check). The freq column value is the exact value extracted from the input cojofile, where the cojofile was created from the corresponding metal file. So the freq column value comes from the Alt_Freq column value in the metal file, and the Alt_Freq column value is the "weighted average of frequency for Alt allele across all studies". The freq_geno and freq column values differ because freq_geno is just the allele frequency of the variant from the genotype file (plink bfile) that was combined from all cohorts, whereas freq column is the weighted average of frequency across cohorts (calculated by metal).

Q. When I try to run a rule, I get an error saying Text file busy . What do I do? A. Delete the script and restore it using git restore workflow/script/problematic_script.sh . Your rules should run normally after doing this

Snakefile order

read-config.smk
variant-qc.smk
single-cohort.smk
meta-analysis.smk
detect-peaks.smk
peakplot.smk
cojo.smk
query.smk
gwas.smk

Code Snippets

run:
    concat_list = list()
    for f in input:
        string = re.sub(r'.*\/cojo\/', '', f)
        string = re.sub(r'\/.*', '', string)
        group, phenotype = string.split('.')
        peak = f.split('/')[-1].replace(f'{string}.', '').replace('.jma.cojo', '')
        df = pd.read_csv(f, sep = '\t', header = 0)
        min_p = df['p'].min()
        df['is cojo tophit'] = False
        df.loc[df['p'] == min_p, 'is cojo tophit'] = True
        df.insert(0, 'peak', peak)
        df.insert(0, 'phenotype', phenotype)
        df.insert(0, 'group', group)
        concat_list.append(df)
    pd.concat(concat_list).to_csv(output[0], index = False, header = True, compression = 'gzip')

SnakeMake From line 32 of rules/cojo.smk

shell:
    """
    workflow/scripts/make_cojofile.sh \
        {params.group} \
        {params.phenotype} \
        {params.bfile} \
        {input.metal} \
        {params.prefix} 2>&1 > {log}
    """

SnakeMake From line 78 of rules/cojo.smk

shell:
    """
    workflow/scripts/run_cojo.sh \
        {params.bfile} \
        {input.cojofile} \
        {input.metal} \
        {params.threshold} \
        {params.group} \
        {params.phenotype} \
        {params.chrom} \
        {params.start} \
        {params.end} \
        {params.prefix} \
        {resources.mem_mb} 2>&1 > {log}
    """

SnakeMake From line 114 of rules/cojo.smk

run:
    all_df = list()
    for f in input:
        try:
            df = pd.read_csv(f, sep = '\t', header = None)
        except pd.errors.EmptyDataError:
            continue
        all_df.append(df)
    all_df = pd.concat(all_df)
    all_df.sort_values([0, 1, 2, 3, 4], inplace = True)
    all_df.to_csv(output[0], sep = '\t', header = False, index = False)

SnakeMake From line 27 of rules/detect-peaks.smk

shell:
    "python3 workflow/scripts/collect_peaks.py {input} {params.span} {params.signif} {params.group} {params.phenotype} {output} 2>&1 > {log}"

SnakeMake From line 58 of rules/detect-peaks.smk

shell:
    """
    # Create command file
    echo 'SEPARATOR TAB'  > {output.cmd}
    echo 'MARKER SNP'     >> {output.cmd}
    echo 'ALLELE A1 A2'   >> {output.cmd} 
    echo 'FREQLABEL Freq' >> {output.cmd}
    echo 'EFFECT b'       >> {output.cmd}
    echo 'STDERR se'      >> {output.cmd}
    echo 'PVALUE p'       >> {output.cmd}
    echo 'SCHEME STDERR'  >> {output.cmd}
    echo 'AVERAGEFREQ ON' >> {output.cmd}
    echo 'MINMAXFREQ ON'  >> {output.cmd}
    for f in {input} ; do
      echo \"PROCESS $f\" >> {output.cmd}
    done
    echo 'OUTFILE {params.prefix}. .txt' >> {output.cmd}
    echo 'ANALYZE' >> {output.cmd}

    metal < {output.cmd} 2>&1 > {log}

    ## CLEAN, COMPRESS, INDEX
    cat \
      <(echo "Chrom MarkerName Pos Allele1 Allele2 Freq1 FreqSE MinFreq MaxFreq Effect StdErr P-value Direction" | tr ' ' '\\t') \
      <(tail -n+2 {params.prefix}.1.txt | sed 's/^chr//' | tr ':' '\\t' | awk 'BEGIN{{FS=\"\\t\";OFS=\"\\t\"}}{{print $1,\"chr\"$1\":\"$2,$2,toupper($3),toupper($4),$5,$6,$7,$8,$9,$10,$11,$12}}' | sort -k1,1n -k3,3n) \
      | bgzip > {output.metal} 2>> {log}

    tabix -s1 -b3 -e3 -S1 {output.metal} 2>> {log}
    rm {params.prefix}.1.txt

    ## Modify info file
    mv {params.prefix}.1.txt.info {output.info}

    sed -i 's,{params.prefix}.1.txt,{output.metal},' {output.info}
    sed -i 's/Marker    /MarkerName  /' {output.info}
    """

SnakeMake tabix From line 43 of rules/meta-analysis.smk

shell:
    """
    mergelist=$(mktemp /tmp/single-point-analysis-pipeline.merge_bfiles.XXXXXX)
    exec 3>\"$mergelist\"
    echo {params.input} | tr ' ' '\\n' >&3

    plink --allow-no-sex --threads {threads} --make-bed \
      --merge-list $mergelist \
      --out {params.out} || true # || true to prevent exitting due to bash strict mode

    # If merge failed due to multiallelics, exclude multiallelics and re-merge
    if [ -f "{output.missnp}" ]
    then
      exec 3>\"$mergelist\"
      for bfile in {params.input}
      do
        tmp_bfile=${{bfile}}-combine-tmp
        plink --allow-no-sex --threads {threads} --make-bed \
          --bfile $bfile \
          --exclude {output.missnp} \
          --out $tmp_bfile
        echo $tmp_bfile >&3
      done

      plink --allow-no-sex --threads {threads} --make-bed \
        --merge-list $mergelist \
        --out {params.out}
    else
      touch {output.missnp}
    fi
    rm output/bfile/*-combine-tmp.*
    exec 3>-
    """

SnakeMake pLink From line 102 of rules/meta-analysis.smk

shell:
    """
    plink --threads {threads} \
        --bfile {params.bfile} \
        --freq \
        --out {params.bfile}

    awk -F '[[:space:]]+' '{params.awk}' {output.frq} > {output.frq2}
    """

SnakeMake pLink From line 148 of rules/meta-analysis.smk

shell:
    """
    mac_threshold=$(awk -F' ' -v id='{params.id}' '{{if ($1==id){{print $4}}}}' {input.mac})

    awk -F ' ' -v mac_threshold=\"$mac_threshold\" '{{if ($2<=mac_threshold){{print $1}}}}' {params.bfile}.frq2 > {output.excludelist}

    plink --allow-no-sex --threads {threads} --make-bed \
          --bfile {params.bfile} \
          --exclude {output.excludelist} \
          --out {params.out}
    """

SnakeMake pLink From line 185 of rules/meta-analysis.smk

shell:
    """
    zgrep -v -w -f <(cat {input.missnp} {input.excludelist}) {input.metal} | grep -v na | bgzip > {output.gz}
    tabix -s1 -b3 -e3 -S1 {output.gz}
    """

SnakeMake tabix From line 206 of rules/meta-analysis.smk

shell:
    """
    run_manqq.R \
    --chr-col Chrom \
    --pval-col P-value \
    --pos-col Pos \
    --a1 Alt \
    --a2 Ref \
    --build 38 \
    --image png \
    --af-col Alt_Freq \
    --maf-filter {params.maf}
    {input} \
    {params.out_prefix} 2>&1 > {log}
    """

SnakeMake From line 227 of rules/meta-analysis.smk

run:
    peaks = pd.read_csv(input[0])
    novel = peaks['ensembl_consequence']=='novel'
    subset = peaks.loc[novel, ['chrom', 'ps', 'a1', 'a2']].drop_duplicates()

    # Create POST input data
    variants = [f"{chrom} {pos} . {ref}  {alt}  . . ." for _, (chrom, pos, ref, alt) in subset.iterrows()]
    max_limit = 200
    chunks = [variants[i:i+max_limit] for i in range(0, len(variants), max_limit)]
    path = Path(output[1])
    path.mkdir(parents=True, exist_ok=True)
    files = list()
    for idx, chunk in enumerate(chunks):
        filename = path.joinpath(f"chunk_{idx}.txt")
        files.append(filename)
        with open(filename, 'w') as f:
            for line in chunk:
                f.write(line)
                f.write('\n')
    pd.DataFrame(files).to_csv(output[0], index = False, header = False)

SnakeMake From line 27 of rules/peakplot.smk

run:
    concat_list = list()
    for f in input:
        string = re.sub(r'.*\/peaks\/', '', f)
        string = re.sub(r'\/.*', '', string)
        group, phenotype = string.split('.')
        peak = f.split('/')[-1].replace('.500kb.csv', '')
        df = pd.read_csv(f, header = 0)
        df.insert(0, 'peak', peak)
        df.insert(0, 'phenotype', phenotype)
        df.insert(0, 'group', group)
        concat_list.append(df)
    all_data = pd.concat(concat_list)
    all_data.to_csv(output[0], index = False, header = True, compression='gzip')
    all_data[all_data['p-value']<=params.p_threshold].to_csv(output[1], index = False, header = True, compression='gzip')

SnakeMake From line 64 of rules/peakplot.smk

shell:
    """
    peakplotter-region  \
      --chr-col Chrom \
      --pos-col Pos \
      --rs-col MarkerName \
      --pval-col P-value \
      --a1-col Allele1 \
      --a2-col Allele2 \
      --maf-col Freq1 \
      --build 38 \
      --assoc-file {input.metal} \
      --bfiles {params.bfile} \
      --out {params.outdir} \
      --chrom {params.chrom} \
      --start {params.start} \
      --end {params.end} \
      --debug
    """

SnakeMake peakplotter From line 100 of rules/peakplot.smk

shell:
    """
    pheno_file={params.outprefix}.pheno
    mlma={params.outprefix}.mlma
    mlma_bgz={params.outprefix}.mlma.gz

    awk '{{OFS=\"\\t\"}}NR!=1{{print $1,$1,$3}}' {input.phenotype} > $pheno_file 2> {log}

    gcta64 --mlma \
            --bfile {params.bfile} \
            --grm {params.grm} \
            --pheno $pheno_file \
            --out {params.outprefix} \
            --threads {threads} 2>&1 >> {log}
    bgzip -c $mlma > $mlma_bgz 2>> {log}
    tabix --skip-lines 1 --sequence 1 --begin 3 --end 3 $mlma_bgz 2>&1 >> {log}

    rm $pheno_file $mlma {params.outprefix}.log 2>&1 >> {log}
    """

SnakeMake tabix From line 54 of rules/single-cohort.smk

shell:
    """
    run_manqq.R \
      --chr-col Chr \
      --pval-col p \
      --pos-col bp \
      --a1 A1 \
      --a2 A2 \
      --build 38 \
      --image png \
      --af-col Freq \
      --no-man \
      --maf-filter {params.filter} \
      {input} \
      {params.prefix} > {log.out} 2> {log.err}
    """

SnakeMake From line 91 of rules/single-cohort.smk

shell:
    """
    plink \
      --bfile {params.input} \
      --hardy \
      --out {params.output}
    """

SnakeMake pLink From line 19 of rules/variant-qc.smk

shell:
    """
    tail -n+2 {input} | tr -s ' ' | awk '{{if ($NF<={params}) {{print $2}}}}' > {output}
    """

SnakeMake From line 31 of rules/variant-qc.smk

shell:
    """
    plink \
      --bfile {params.input} \
      --missing \
      --out {params.output}
    """

SnakeMake pLink From line 43 of rules/variant-qc.smk

shell:
    """
    tail -n+2 {input} | tr -s ' ' | awk '{{if ($NF>={params}) {{print $2}}}}' > {output}
    """

SnakeMake From line 55 of rules/variant-qc.smk

shell:
    "cat {input} > {output}"

SnakeMake From line 66 of rules/variant-qc.smk

shell:
    """
    plink \
      --bfile {params.bfile} \
      --exclude {input.exclude_list} \
      --make-bed \
      --out {params.out} \
      --threads {threads}
    awk -F$'\\t' 'BEGIN{{OFS=\"\\t\"}}{{print $1,\"chr\"$1\":\"$4,$3,$4,$5,$6}}' {params.out}.bim | sponge {params.out}.bim
    """

SnakeMake pLink From line 83 of rules/variant-qc.smk

run:
    data = pd.read_csv(input[0], sep = '\t')
    n = data.shape[0]
    max_mac = n * 2
    threshold = params.threshold / max_mac
    pd.DataFrame([[params.id, n, max_mac, threshold]]).to_csv(output[0], sep = ' ', index = False, header = False)

SnakeMake From line 101 of rules/variant-qc.smk

shell: "cat {input} > {output}"

SnakeMake From line 112 of rules/variant-qc.smk

run:
    mac = pd.read_csv(
        input[0],
        sep = ' ',
        header = None,
        names = ['id', 'n', 'max_mac', 'maf_threshold'])
    mac['cohort'] = [i[0] for i in mac['id'].str.split('.')]
    mac['group_pheno'] = ['.'.join(i[1:]) for i in mac['id'].str.split('.')]

    group_phenos = set(mac['group_pheno'])

    meta_mac = list()
    for gp in group_phenos:
        combind_n = mac.loc[mac['group_pheno']==gp, 'n'].sum()
        combind_max_mac = combind_n * 2
        combined_threshold = params.threshold / combind_max_mac
        meta_mac.append([gp, combind_n, combind_max_mac, combined_threshold])

    mac.drop(columns = ['cohort', 'group_pheno'], inplace = True)
    all_mac = pd.concat([mac, pd.DataFrame(meta_mac, columns = ['id', 'n', 'max_mac', 'maf_threshold'])])
    all_mac.to_csv(output[0], sep = ' ', index = False, header = True)

SnakeMake From line 122 of rules/variant-qc.smk

__version__ = '0.2'

import sys

import pandas as pd
from peakplotter.peakit import peakit
from peakplotter.plotpeaks import read_assoc, get_signals
from peakplotter.test_utils import get_test_logger

assocfile = sys.argv[1]
flank_bp = int(sys.argv[2])
signif = float(sys.argv[3])
group = sys.argv[4]
phenotype = sys.argv[5]
output = sys.argv[6]


chr_col = 'Chrom'
pos_col = 'Pos'
rs_col = 'MarkerName'
pval_col = 'P-value'
a1_col = 'Allele1'
a2_col = 'Allele2'
maf_col = 'Freq1' # Frequency of A1
logger = get_test_logger()


print(f'Running collect_peaks.py (v{__version__})')
print(f'''
chr_col = '{chr_col}'
pos_col = '{pos_col}'
rs_col = '{rs_col}'
pval_col = '{pval_col}'
a1_col = '{a1_col}'
a2_col = '{a2_col}'
maf_col = '{maf_col}'
''')

assoc = read_assoc(assocfile, chr_col, pos_col, pval_col, maf_col, rs_col, a1_col, a2_col, logger)
print('Getting signals')
signals = get_signals(assoc, signif, chr_col, pos_col, pval_col)
print(signals)
if signals.empty:
    print("No peaks found. Exiting.")
    pd.DataFrame().to_csv(output, sep = '\t', header = False, index = False)
    sys.exit(0)
print('Making peak_collections')
peak_collections = peakit(signals, pval_col, chr_col, pos_col, flank_bp)
print('peak_collections before merge')
print(peak_collections)
peak_collections.merge()
print('peak_collections after merge')
print(peak_collections)
peaked = peak_collections.data
peaked.insert(0, 'phenotype', phenotype)
peaked.insert(0, 'group', group)
peaked.to_csv(output, sep = '\t', header = False, index = False)

Python Pandas From line 3 of scripts/collect_peaks.py

make_cojofile=v0.0.2
group=$1
phenotype=$2
bfile=$3
metal_file=$4
prefix=$5

echo "=== Running make_cojofile.sh ===
run_cojo: $make_cojofile
group: $group
phenotype: $phenotype
bfile: $bfile
metal_file: $metal_file
prefix: $prefix
================================"

samplesize=$prefix.samplesize
plink --bfile $bfile --missing --out $prefix
awk -F '[[:space:]]+' '{if(NR!=1){print $3, $5-$4}}' $prefix.lmiss > $samplesize


tmpfile=$(mktemp /tmp/make_cojofile.XXXXXX) # https://unix.stackexchange.com/a/181938
echo "[INFO] Creating temporary file: $tmpfile"
exec 3>"$tmpfile"

join \
  -j 1 \
  -o 0 1.2 1.3 1.4 1.5 1.6 1.7 2.2 \
  <(zcat $metal_file \
      | awk '{if(NR!=1){print "chr"$1":"$3, $4, $5, $6, $10, $11, $12}}' \
      | sort -k 1b,1 \
   ) \
  <(sort -k 1b,1 $samplesize) >&3
# 'sort -k 1b,1' is the recommended sort command if joining on the first field using 'join'
# See 'man join'

cat \
  <(echo "SNP A1 A2 freq b se p N") \
  <(sed 's/:/ /' $tmpfile \
    | sort -n -k1.4 -k2 \
    | sed 's/ /:/' \
   ) \
  | gzip > $prefix.ma.gz

exec 3>-
# rm $tmpfile
# rm $prefix.{lmiss,samplesize}
rm $prefix.{imiss,nosex}

Shell pLink From line 11 of scripts/make_cojofile.sh

run_cojo=v0.3.0
bfile=$1
cojofile=$2
metal_file=$3
threshold=$4 # p-value threshold
group=$5
phenotype=$6
chrom=$7
start=$8
end=$9
prefix=${10} # output prefix
memory=${11}


echo "=== Running run_cojo.sh ===
run_cojo: $run_cojo
group: $group
phenotype: $phenotype
bfile: $bfile
chrom: $chrom
start: $start
end: $end
threshold: $threshold
cojofile: $cojofile
metal_file: $metal_file
prefix: $prefix
===========================
"
# TODO: Find a way to insert the $threshold variable
# inside the below awk command.

# See Plink 1.9's .qassoc file format for more details
qassoc=$prefix.qassoc
echo "[$(date), Step 1/4] Creating $qassoc"
tabix $metal_file ${chrom}:${start}-${end} | \
    awk 'BEGIN{OFS="\t";print "CHR\tSNP\tBP\tBETA\tSE\tR2\tT\tP"} \
      {
        split($12, a, "[eE]") 
        if($12<1e-6 || (length(a)>1 && a[2] < -6)){
          print $1, "chr"$1":"$3, $3, $10, $11, 1, 1, $12}
      }' > $qassoc


echo
echo "Extracted $(( $(wc -l < $prefix.qassoc) - 1 )) SNP(s)."
echo

## Subset bfile
echo "[$(date), Step 2/4] Subset bfile"
plink --allow-no-sex --make-bed \
    --bfile $bfile \
    --chr $chrom \
    --from-bp $start \
    --to-bp $end \
    --memory $memory \
    --out $prefix

echo -e "\n\n"
echo "[$(date), Step 3/4] Clumping"
## CLUMP from merged file 
plink \
  --bfile $prefix \
  --clump $qassoc \
  --clump-kb 1000 \
  --clump-r2 0.05 \
  --memory $memory \
  --out $prefix

clumpedlist=$prefix.clumped.list
awk -v threshold=$threshold '{if($5<threshold && $1~/^[0-9]+$/){print $3}}' $prefix.clumped > $clumpedlist



echo -e "\n\n"
echo "[$(date), Step 4/4] GCTA-COJO"
gcta64 \
  --bfile $prefix \
  --cojo-file <(zcat $cojofile) \
  --extract $clumpedlist \
  --out $prefix \
  --cojo-slct \
  --cojo-p $threshold \
  --cojo-wind 10000 \
  --diff-freq 0.2 \
  --cojo-collinear 0.9 || true

echo -e "\n\n"

echo "[$(date)] Done"

rm -f \
  $prefix.nosex \
  $prefix.clumped.list