Lange Symposium 2020 - GWAS for Ordinal Phenotypes with OrdinalGWAS.jl

This tutorial demonstrates how to conduct GWAS on ordinal phenotypes using OrdinalGWAS.jl. We will demonstrate various procedures that this software package can perform.

OrdinalGWAS.jl is a Julia package for performing genome-wide association studies (GWAS) for ordered categorical phenotypes using proportional odds model or ordred Probit model. It is useful when the phenotype takes ordered discrete values, e.g., disease status (undiagnosed, pre-disease, mild, moderate, severe). This is especially common in complex diseases where a binary status may not be suitable for all individuals.

Outline

  • Motivation
  • Model
  • Basic Usage
    • Input files
    • Running a simple analysis
    • Output files
  • Customized Analysis
    • Restricting sample/snps
    • Link functions
    • LRT & score tests
    • SNP-set analysis (with annotation file)
  • Additional Features in Documentation
  • Example (if time)

Motivation

The following is a phenotyping algorithm from Eastwood et al. (2016) for diagnosing type II diabetes in the UK biobank population.

diabetes.png

Labels generated include likelihood of diabetes, relying on several variables.

Hard to dichotomize.

Other types of ordinal phenotypes:

  • Disease Severity

  • Disease Progression

  • Maximum stage of a disease reached under treatment

Model

OrdinalGWAS.jl uses an ordered multinomial model, by default it runs the null model and then performs a score test for each SNP using the null model.

  • Assume trait $Y$ takes ordinal values $j \in \{1, \ldots, J\}$.
  • Cumulative probabilities $\alpha_{ij} = \mathbb{P}(Y_i \le j)$ are linked to covariates $\mathbf{x}_i$ via $$ g(\alpha_{ij}) = \theta_j - \mathbf{x}_i^T \boldsymbol{\beta}, \quad j = 1,\ldots, J-1, $$ where $g$ is a strictly increasing link function.
  • Intercepts $\theta_1 \le \cdots \le \theta_{J-1}$ enforces order between categories and regression coefficients $\boldsymbol{\beta}$ reflects effects of covariates.

For reproducibility, check the machine information below. To execute a notebook command, hold down Shift and Enter within the box. This tutorial and corresponding modules have been checked with Julia versions 1.0 and 1.3.

In [1]:
# machine information for this tutorial
versioninfo()
Julia Version 1.3.1
Commit 2d5741174c (2019-12-30 21:36 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.6.0)
  CPU: Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, haswell)
In [2]:
# for use in this tutorial
using CSV, OrdinalGWAS, SnpArrays

Basic usage

Example data set

The data folder contains an example data set with a simulated covariate in covariate.txt.

In [3]:
# content of the data folder
readdir("data")
Out[3]:
7-element Array{String,1}:
 ".DS_Store"            
 "covariate.txt"        
 "hapmap3.bed"          
 "hapmap3.bim"          
 "hapmap3.fam"          
 "hapmap3.map"          
 "hapmap_snpsetfile.txt"

Input files

ordinalgwas expects two input files: one for responses plus covariates (second argument), the other the Plink files for genotypes (third argument).

Covariate and trait file

Covariates and phenotype are provided in a csv file, e.g., covariate.txt, which has one header line for variable names. In this example, variable hypertension representing levels of hypertension is the ordered categorical phenotypes coded as integers 1 to 4. We want to include variable sex as the covariate in GWAS.

In [4]:
run(`head data/covariate.txt`);
famid,perid,faid,moid,sex,hypertension
2431,NA19916,0,0,1,4
2424,NA19835,0,0,2,4
2469,NA20282,0,0,2,4
2368,NA19703,0,0,1,3
2425,NA19901,0,0,2,3
2427,NA19908,0,0,1,4
2430,NA19914,0,0,2,4
2470,NA20287,0,0,2,1
2436,NA19713,0,0,2,3

We can use SnpArrays.jl to read in the raw genotype data.

In [5]:
s1 = SnpArray("data/hapmap3.bed")
Out[5]:
324×13928 SnpArray:
 0x03  0x03  0x00  0x03  0x03  0x03  …  0x02  0x02  0x02  0x03  0x03  0x03
 0x03  0x02  0x02  0x02  0x03  0x03     0x03  0x03  0x03  0x03  0x02  0x03
 0x03  0x03  0x02  0x02  0x02  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x02  0x03  0x02  0x03     0x02  0x02  0x02  0x03  0x03  0x03
 0x03  0x03  0x02  0x02  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x02  0x03  0x03  0x00  0x03  …  0x02  0x02  0x02  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x00  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x00  0x03  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x02  0x02  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x02  0x03  0x03  0x03     0x02  0x02  0x02  0x03  0x03  0x03
 0x03  0x03  0x02  0x03  0x03  0x03  …  0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x02  0x03  0x03     0x02  0x02  0x02  0x03  0x03  0x03
 0x03  0x03  0x00  0x03  0x02  0x03     0x02  0x02  0x02  0x03  0x03  0x03
    ⋮                             ⋮  ⋱                       ⋮            
 0x03  0x03  0x02  0x03  0x00  0x03     0x02  0x02  0x02  0x03  0x03  0x03
 0x03  0x03  0x00  0x03  0x03  0x03     0x02  0x02  0x02  0x03  0x03  0x03
 0x03  0x03  0x00  0x03  0x02  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x00  0x02  0x03  0x03  …  0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x02  0x03  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x02  0x03  0x03     0x02  0x02  0x02  0x03  0x03  0x03
 0x03  0x03  0x02  0x02  0x03  0x03     0x02  0x02  0x02  0x03  0x03  0x03
 0x03  0x03  0x02  0x03  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x00  0x03  0x02  0x03  …  0x02  0x02  0x02  0x03  0x03  0x03
 0x03  0x03  0x02  0x02  0x03  0x03     0x02  0x02  0x02  0x03  0x03  0x03
 0x03  0x03  0x00  0x02  0x03  0x03     0x02  0x02  0x02  0x03  0x03  0x03
 0x03  0x03  0x00  0x03  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03

In this example, there are 324 individuals genotyped at 13,928 SNPs.

Analysis

The following command performs GWAS using the proportional odds model as the default when no link function is specified. The output is the fitted null model.

Formula for null model

The first argument specifies the null model without SNP effects, e.g., @formula(hypertension ~ sex).

In [6]:
ordinalgwas(@formula(hypertension ~ sex), "data/covariate.txt", "data/hapmap3")
Out[6]:
StatsModels.TableRegressionModel{OrdinalMultinomialModel{Int64,Float64,LogitLink},Array{Float64,2}}

hypertension ~ sex

Coefficients:
──────────────────────────────────────────────────────
               Estimate  Std.Error   t value  Pr(>|t|)
──────────────────────────────────────────────────────
intercept1|2  -1.48564    0.358891  -4.13952    <1e-4 
intercept2|3  -0.569479   0.341044  -1.66981    0.0959
intercept3|4   0.429815   0.339642   1.26549    0.2066
sex            0.424656   0.213914   1.98517    0.0480
──────────────────────────────────────────────────────

For documentation of the ordinalgwas function, type ?ordinalgwas in Julia REPL.

In [7]:
?ordinalgwas
search: ordinalgwas ordinalgwasGxE OrdinalGWAS ordinalsnpsetgwas OrdinalRange

Out[7]:
ordinalgwas(nullformula, covfile, plinkfile)
ordinalgwas(nullformula, df, plinkfile)
ordinalgwas(fittednullmodel, plinkfile)
ordinalgwas(fittednullmodel, bedfile, bimfile, bedn)

Positional arguments

  • nullformula::FormulaTerm: formula for the null model.
  • covfile::AbstractString: covariate file (csv) with one header line. One column should be the ordinal phenotype coded as integers starting from 1. For example, ordinal phenotypes can be coded as 1, 2, 3, 4 but not 0, 1, 2, 3.
  • df::DataFrame: DataFrame containing response and regressors for null model.
  • plinkfile::AbstractString: Plink file name without the bed, fam, or bim extensions. If plinkfile==nothing, only null model is fitted. If plinkfile is provided, bed, bim, and fam file with same plinkfile prefix need to exist. Compressed file formats such as gz and bz2 are allowed. Check all allowed formats by SnpArrays.ALLOWED_FORMAT.
  • fittednullmodel::StatsModels.TableRegressionModel: the fitted null model output from ordinalgwas(nullformula, covfile) or ordinalgwas(nullformula, df).
  • bedfile::Union{AbstractString,IOStream}: path to Plink bed file with full file name.
  • bimfile::Union{AbstractString,IOStream}: path to Plink bim file with full file name.
  • bedn::Integer: number of samples in bed file.

Keyword arguments

  • nullfile::Union{AbstractString, IOStream}: output file for the fitted null model; default is ordinalgwas.null.txt.
  • pvalfile::Union{AbstractString, IOStream}: output file for the gwas p-values; default is ordinalgwas.pval.txt.
  • covtype::Vector{DataType}: type information for covfile. This is useful when CSV.read(covarfile) has parsing errors.
  • covrowinds::Union{Nothing,AbstractVector{<:Integer}}: sample indices for covariate file.
  • testformula::FormulaTerm: formula for test unit. Default is @formula(trait ~ 0 + snp).
  • test::Symbol: :score (default) or :lrt.
  • link::GLM.Link: LogitLink() (default), ProbitLink(), CauchitLink(), or CloglogLink().
  • snpmodel: ADDITIVE_MODEL (default), DOMINANT_MODEL, or RECESSIVE_MODEL.
  • snpinds::Union{Nothing,AbstractVector{<:Integer}}: SNP indices for bed file.
  • bedrowinds::Union{Nothing,AbstractVector{<:Integer}}: sample indices for bed file.
  • solver: an optimization solver supported by MathProgBase. Default is NLoptSolver(algorithm=:LD_SLSQP, maxeval=4000). Another common choice is IpoptSolver(print_level=0).
  • verbose::Bool: default is false.

Compressed Plink files are supported. For example, if Plink files are hapmap3.bed.gz, hapmap3.bim.gz and hapmap3.fam.gz, the same command

ordinalgwas(@formula(hypertension ~ sex), "data/covariate.txt", "data/hapmap3")

still works. Check all supported compression format by

In [8]:
SnpArrays.ALLOWED_FORMAT
Out[8]:
6-element Array{String,1}:
 "gz"  
 "zlib"
 "zz"  
 "xz"  
 "zst" 
 "bz2" 

Output files

ordinalgwas outputs two files: ordinalgwas.null.txt and ordinalgwas.pval.txt.

  • ordinalgwas.null.txt lists the estimated null model (without SNPs).
In [9]:
run(`cat ordinalgwas.null.txt`);
StatsModels.TableRegressionModel{OrdinalMultinomialModel{Int64,Float64,LogitLink},Array{Float64,2}}

hypertension ~ sex

Coefficients:
──────────────────────────────────────────────────────
               Estimate  Std.Error   t value  Pr(>|t|)
──────────────────────────────────────────────────────
intercept1|2  -1.48564    0.358891  -4.13952    <1e-4 
intercept2|3  -0.569479   0.341044  -1.66981    0.0959
intercept3|4   0.429815   0.339642   1.26549    0.2066
sex            0.424656   0.213914   1.98517    0.0480
──────────────────────────────────────────────────────
  • ordinalgwas.pval.txt tallies the SNPs and their pvalues.
In [10]:
ENV["COLUMNS"]=120 #shows up to 10 columns for dataframe displays
CSV.read("ordinalgwas.pval.txt")
Out[10]:

13,928 rows × 6 columns

chrpossnpidmafhwepvalpval
Int64Int64StringFloat64Float64Float64
11554484rs104585970.01.01.0
21758311rs125620340.07763980.4098760.00456531
31967643rs27108750.3240744.07625e-73.10828e-5
411168108rs112605660.1915890.1285681.21687e-5
511375074rs13125680.4413582.5376e-190.00820686
611588771rs351541050.01.01.0
711789051rs168245080.004629630.9332780.511198
811990452rs26789390.4537045.07696e-110.299728
912194615rs75531780.2268520.1705610.171333
1012396747rs133763560.144860.9053080.532042
1112623158rs287539130.01.01.0
1212823603rs15634680.4830254.23066e-90.225191
1313025087rs66903730.253879.23864e-80.701847
1413225416rs120435190.0293210.0007788020.00107978
1513431124rs120931170.1099070.2326930.427794
1613633945rs109100170.2218750.0002708880.913129
1713895935rs347709240.02469143.20726e-50.999021
1814096895rs67026330.4752320.6638120.00651631
1914297388rs6849650.3055560.2136330.0951953
2014498133rs118092950.09937890.2571620.0832435
2114698713rs5785280.3240740.6169370.0692307
2214899946rs46544710.3580250.1812650.224531
2315100369rs66811480.1315790.008454470.155667
2415302730rs107991970.4287930.2947420.669055
2515502779rs107964000.2314810.460910.241525
2615703284rs22446320.3947370.3133780.53453
2715904631rs75493240.3672840.5175670.716133
2816106513rs28434940.06481480.7406630.365573
2916310159rs49088800.05555560.289680.579307
3016514524rs9321120.2206790.1726690.303095
In [11]:
#clean up files 
rm("ordinalgwas.null.txt", force=true)
rm("ordinalgwas.pval.txt", force=true)

Output file names can be changed by the nullfile and pvalfile keywords respectively. For example,

ordinalgwas(@formula(hypertension ~ sex), "data/covariate.txt", "data/hapmap3", pvalfile="ordinalgwas.pval.txt.gz")

will output the p-value file in compressed gz format.

Timing

For this moderate-sized data set (N = 324 SNPs = 13,928), ordinalgwas takes around 0.3 seconds.

In [12]:
@time(ordinalgwas(@formula(hypertension ~ sex), "data/covariate.txt", "data/hapmap3"));
  0.219184 seconds (639.63 k allocations: 32.737 MiB, 5.43% gc time)
In [13]:
#clean up files 
rm("ordinalgwas.null.txt", force=true)
rm("ordinalgwas.pval.txt", force=true)

We have applied this software to the COPDGene cohort (N = 5,953, Nsnps = 630,860) and the UKBiobank cohort (N = 185,565, Nsnps = 464,137) published in Genetic Epidemiology. Runtimes were 3.5  minutes and 181 minutes respectively.

SNP and/or sample masks

In practice, we often perform GWAS on selected SNPs and/or selected samples. They can be specified by the snpinds, covrowinds and bedrowinds keywords of ordinalgwas function.

For example, to perform GWAS on SNPs with minor allele frequency (MAF) above 0.05

In [14]:
# create SNP mask
snpinds = maf(SnpArray("data/hapmap3.bed")) .≥ 0.05
# GWAS on selected SNPs
@time ordinalgwas(@formula(hypertension ~ sex), "data/covariate.txt", "data/hapmap3", 
    snpinds=snpinds, nullfile="commonvariant.null.txt", pvalfile="commonvariant.pval.txt")
  0.354116 seconds (974.12 k allocations: 51.225 MiB, 6.77% gc time)
Out[14]:
StatsModels.TableRegressionModel{OrdinalMultinomialModel{Int64,Float64,LogitLink},Array{Float64,2}}

hypertension ~ sex

Coefficients:
──────────────────────────────────────────────────────
               Estimate  Std.Error   t value  Pr(>|t|)
──────────────────────────────────────────────────────
intercept1|2  -1.48564    0.358891  -4.13952    <1e-4 
intercept2|3  -0.569479   0.341044  -1.66981    0.0959
intercept3|4   0.429815   0.339642   1.26549    0.2066
sex            0.424656   0.213914   1.98517    0.0480
──────────────────────────────────────────────────────
In [15]:
CSV.read("commonvariant.pval.txt")
Out[15]:

12,085 rows × 6 columns

chrpossnpidmafhwepvalpval
Int64Int64StringFloat64Float64Float64
11758311rs125620340.07763980.4098760.00456531
21967643rs27108750.3240744.07625e-73.10828e-5
311168108rs112605660.1915890.1285681.21687e-5
411375074rs13125680.4413582.5376e-190.00820686
511990452rs26789390.4537045.07696e-110.299728
612194615rs75531780.2268520.1705610.171333
712396747rs133763560.144860.9053080.532042
812823603rs15634680.4830254.23066e-90.225191
913025087rs66903730.253879.23864e-80.701847
1013431124rs120931170.1099070.2326930.427794
1113633945rs109100170.2218750.0002708880.913129
1214096895rs67026330.4752320.6638120.00651631
1314297388rs6849650.3055560.2136330.0951953
1414498133rs118092950.09937890.2571620.0832435
1514698713rs5785280.3240740.6169370.0692307
1614899946rs46544710.3580250.1812650.224531
1715100369rs66811480.1315790.008454470.155667
1815302730rs107991970.4287930.2947420.669055
1915502779rs107964000.2314810.460910.241525
2015703284rs22446320.3947370.3133780.53453
2115904631rs75493240.3672840.5175670.716133
2216106513rs28434940.06481480.7406630.365573
2316310159rs49088800.05555560.289680.579307
2416514524rs9321120.2206790.1726690.303095
2516715827rs4415150.4938080.7829370.596866
2616917805rs120434290.09876540.6002980.726598
2717119246rs49086000.1568320.6490170.501062
2817319987rs75533720.3560375.28439e-90.122518
2917522841rs11931690.2282610.4819340.134259
3017723190rs49086910.3719140.08751270.497583
In [16]:
# extra header line in commonvariant.pval.txt
countlines("commonvariant.pval.txt"), count(snpinds)
Out[16]:
(12086, 12085)
In [17]:
# clean up
rm("commonvariant.null.txt", force=true)
rm("commonvariant.pval.txt", force=true)

covrowinds specify the samples in the covariate file and bedrowinds for SnpArray. User should be particularly careful when using these two keyword. Selected rows in SnpArray should exactly match the samples in the null model. Otherwise the results are meaningless.

Use the keyword covrowinds to specify selected samples in the covarite file. Use the keyword bedrowinds to specify selected samples in the Plink bed file. For example, to use the first 300 samples in both covariate and bed file:

In [18]:
ordinalgwas(@formula(hypertension ~ sex), "data/covariate.txt", "data/hapmap3", 
    nullfile="first300.null.txt", pvalfile="first300.pval.txt", covrowinds=1:300,
    bedrowinds=1:300)
Out[18]:
StatsModels.TableRegressionModel{OrdinalMultinomialModel{Int64,Float64,LogitLink},Array{Float64,2}}

hypertension ~ sex

Coefficients:
──────────────────────────────────────────────────────
               Estimate  Std.Error   t value  Pr(>|t|)
──────────────────────────────────────────────────────
intercept1|2  -1.42902    0.371752  -3.84401    0.0001
intercept2|3  -0.590123   0.355841  -1.65839    0.0983
intercept3|4   0.407014   0.35423    1.14901    0.2515
sex            0.40068    0.221836   1.8062     0.0719
──────────────────────────────────────────────────────
In [19]:
CSV.read("first300.pval.txt")
Out[19]:

13,928 rows × 6 columns

chrpossnpidmafhwepvalpval
Int64Int64StringFloat64Float64Float64
11554484rs104585970.01.01.0
21758311rs125620340.07763980.4098760.00355969
31967643rs27108750.3240744.07625e-70.000123604
411168108rs112605660.1915890.1285685.2213e-6
511375074rs13125680.4413582.5376e-190.00758234
611588771rs351541050.01.01.0
711789051rs168245080.004629630.9332780.504171
811990452rs26789390.4537045.07696e-110.363679
912194615rs75531780.2268520.1705610.24109
1012396747rs133763560.144860.9053080.453925
1112623158rs287539130.01.01.0
1212823603rs15634680.4830254.23066e-90.148556
1313025087rs66903730.253879.23864e-80.762476
1413225416rs120435190.0293210.0007788020.00131734
1513431124rs120931170.1099070.2326930.530602
1613633945rs109100170.2218750.0002708880.835396
1713895935rs347709240.02469143.20726e-50.96925
1814096895rs67026330.4752320.6638120.0367962
1914297388rs6849650.3055560.2136330.0813781
2014498133rs118092950.09937890.2571620.168299
2114698713rs5785280.3240740.6169370.0654792
2214899946rs46544710.3580250.1812650.232969
2315100369rs66811480.1315790.008454470.106474
2415302730rs107991970.4287930.2947420.693596
2515502779rs107964000.2314810.460910.0755408
2615703284rs22446320.3947370.3133780.616338
2715904631rs75493240.3672840.5175670.987062
2816106513rs28434940.06481480.7406630.558175
2916310159rs49088800.05555560.289680.414671
3016514524rs9321120.2206790.1726690.215976
In [20]:
# clean up
rm("first300.null.txt", force=true)
rm("first300.pval.txt", force=true)

Likelihood ratio test (LRT)

By default, ordinalgwas calculates p-value for each SNP using score test. Score test is fast because it doesn't require fitting alternative model for each SNP. User can request likelihood ratio test (LRT) using keyword test=:lrt. LRT is much slower but may be more powerful than score test.

In [21]:
@time ordinalgwas(@formula(hypertension ~ sex), "data/covariate.txt", "data/hapmap3"; snpinds = 1:10, 
    test=:LRT, nullfile="lrt.null.txt", pvalfile="lrt.pval.txt")
  0.742670 seconds (1.82 M allocations: 92.778 MiB, 3.28% gc time)
Out[21]:
StatsModels.TableRegressionModel{OrdinalMultinomialModel{Int64,Float64,LogitLink},Array{Float64,2}}

hypertension ~ sex

Coefficients:
──────────────────────────────────────────────────────
               Estimate  Std.Error   t value  Pr(>|t|)
──────────────────────────────────────────────────────
intercept1|2  -1.48564    0.358891  -4.13952    <1e-4 
intercept2|3  -0.569479   0.341044  -1.66981    0.0959
intercept3|4   0.429815   0.339642   1.26549    0.2066
sex            0.424656   0.213914   1.98517    0.0480
──────────────────────────────────────────────────────

Note the extra effect column in pvalfile, which is the effect size (regression coefficient) for each SNP.

In [22]:
CSV.read("lrt.pval.txt")
Out[22]:

10 rows × 7 columns

chrpossnpidmafhwepvaleffectpval
Int64Int64StringFloat64Float64Float64Float64
11554484rs104585970.01.00.01.0
21758311rs125620340.07763980.409876-1.005780.00191858
31967643rs27108750.3240744.07625e-7-0.6488561.80505e-5
411168108rs112605660.1915890.128568-0.9157235.87338e-6
511375074rs13125680.4413582.5376e-19-0.3318140.00808102
611588771rs351541050.01.00.01.0
711789051rs168245080.004629630.933278-0.7338030.516903
811990452rs26789390.4537045.07696e-11-0.1358650.299464
912194615rs75531780.2268520.170561-0.2512080.161511
1012396747rs133763560.144860.9053080.1294610.538734
In [23]:
# clean up
rm("lrt.pval.txt", force=true)
rm("lrt.null.txt", force=true)

In this example, GWAS by score test takes less than 0.2 second, while GWAS by LRT takes about 20 seconds. About 100 fold difference in run time.

Score test for screening, LRT for power

For large data sets, a practical solution is to perform score test first, then re-do LRT for the most promising SNPs according to score test p-values.

Step 1: Perform score test GWAS, results in score.pval.txt.

In [24]:
@time ordinalgwas(@formula(hypertension ~ sex), "data/covariate.txt", "data/hapmap3", 
    test=:score, pvalfile="score.pval.txt");
  0.261909 seconds (752.90 k allocations: 38.855 MiB, 6.65% gc time)
In [25]:
CSV.read("score.pval.txt")
Out[25]:

13,928 rows × 6 columns

chrpossnpidmafhwepvalpval
Int64Int64StringFloat64Float64Float64
11554484rs104585970.01.01.0
21758311rs125620340.07763980.4098760.00456531
31967643rs27108750.3240744.07625e-73.10828e-5
411168108rs112605660.1915890.1285681.21687e-5
511375074rs13125680.4413582.5376e-190.00820686
611588771rs351541050.01.01.0
711789051rs168245080.004629630.9332780.511198
811990452rs26789390.4537045.07696e-110.299728
912194615rs75531780.2268520.1705610.171333
1012396747rs133763560.144860.9053080.532042
1112623158rs287539130.01.01.0
1212823603rs15634680.4830254.23066e-90.225191
1313025087rs66903730.253879.23864e-80.701847
1413225416rs120435190.0293210.0007788020.00107978
1513431124rs120931170.1099070.2326930.427794
1613633945rs109100170.2218750.0002708880.913129
1713895935rs347709240.02469143.20726e-50.999021
1814096895rs67026330.4752320.6638120.00651631
1914297388rs6849650.3055560.2136330.0951953
2014498133rs118092950.09937890.2571620.0832435
2114698713rs5785280.3240740.6169370.0692307
2214899946rs46544710.3580250.1812650.224531
2315100369rs66811480.1315790.008454470.155667
2415302730rs107991970.4287930.2947420.669055
2515502779rs107964000.2314810.460910.241525
2615703284rs22446320.3947370.3133780.53453
2715904631rs75493240.3672840.5175670.716133
2816106513rs28434940.06481480.7406630.365573
2916310159rs49088800.05555560.289680.579307
3016514524rs9321120.2206790.1726690.303095

Step 2: Sort score test p-values and find top 10 SNPs.

In [26]:
scorepvals = CSV.read("score.pval.txt")[!, 6] # p-values in 5th column
tophits = sortperm(scorepvals)[1:10] # indices of 10 SNPs with smallest p-values
scorepvals[tophits] # smallest 10 p-values
Out[26]:
10-element Array{Float64,1}:
 1.3080149099181303e-6
 6.536722765052079e-6 
 9.66474218566903e-6  
 1.2168672367668889e-5
 1.8027460018331254e-5
 2.0989542284213636e-5
 2.6844521269963608e-5
 3.108283828554874e-5 
 4.1010912875160476e-5
 4.2966265138454725e-5

Step 3: Re-do LRT on top hits.

In [27]:
@time ordinalgwas(@formula(hypertension ~ sex), "data/covariate.txt", "data/hapmap3", 
    snpinds=tophits, test=:LRT, pvalfile="lrt.pval.txt");
  0.278900 seconds (527.08 k allocations: 28.637 MiB, 4.83% gc time)
In [28]:
CSV.read("lrt.pval.txt")
Out[28]:

10 rows × 7 columns

chrpossnpidmafhwepvaleffectpval
Int64Int64StringFloat64Float64Float64Float64
11967643rs27108750.3240744.07625e-7-0.6488561.80505e-5
211168108rs112605660.1915890.128568-0.9157235.87338e-6
3336821790rs46785530.2345680.1094670.7424951.13038e-5
4411017683rs168814460.2755420.894275-0.7870581.11054e-5
553739190rs125211660.06790120.1861361.146894.78129e-5
667574576rs18854660.1774690.7620690.8750627.27235e-6
7652474721rs20731830.1826630.5077770.7790795.06939e-5
8741152376rs288800.3379630.805237-0.8146349.18013e-7
9784223996rs41286230.07870370.02183471.002226.5879e-5
1023121048059rs19371650.438083.95961e-160.5392311.97546e-5
In [29]:
# clean up
rm("ordinalgwas.null.txt", force=true)
rm("score.pval.txt", force=true)
rm("lrt.pval.txt", force=true)

SNP-set testing

In many applications, we want to test a SNP-set (testing if SNPs have a joint effect together). The function ordinalsnpsetgwas() can be used to do this. The following is an example of using an annotated file where each SNP has a gene annotation.

In [30]:
run(`head data/hapmap_snpsetfile.txt`);
gene1 rs10458597
gene1 rs12562034
gene1 rs2710875
gene1 rs11260566
gene1 rs1312568
gene1 rs35154105
gene1 rs16824508
gene1 rs2678939
gene1 rs7553178
gene1 rs13376356

Now we just need to specify that file to run a snpset analysis.

In [31]:
ordinalsnpsetgwas(@formula(hypertension ~ sex), "data/covariate.txt", "data/hapmap3",
    pvalfile = "snpset.pval.txt", snpset = "data/hapmap_snpsetfile.txt")
Out[31]:
StatsModels.TableRegressionModel{OrdinalMultinomialModel{Int64,Float64,LogitLink},Array{Float64,2}}

hypertension ~ sex

Coefficients:
──────────────────────────────────────────────────────
               Estimate  Std.Error   t value  Pr(>|t|)
──────────────────────────────────────────────────────
intercept1|2  -1.48564    0.358891  -4.13952    <1e-4 
intercept2|3  -0.569479   0.341044  -1.66981    0.0959
intercept3|4   0.429815   0.339642   1.26549    0.2066
sex            0.424656   0.213914   1.98517    0.0480
──────────────────────────────────────────────────────
In [32]:
CSV.read("snpset.pval.txt")
Out[32]:

150 rows × 3 columns

snpsetidnsnpspval
StringInt64Float64
1gene1931.72134e-5
2gene2930.036925
3gene3930.747855
4gene4920.0276508
5gene5930.611958
6gene6930.0291846
7gene7930.391601
8gene8930.125396
9gene9930.708564
10gene10920.0665913
11gene11930.145549
12gene12930.64883
13gene13930.420183
14gene14930.266712
15gene15930.0364581
16gene16930.0641792
17gene17920.70697
18gene18930.738655
19gene19930.304919
20gene20930.349533
21gene21930.54087
22gene22930.161117
23gene23920.0176682
24gene24930.352236
25gene25930.346303
26gene26930.640547
27gene27930.109541
28gene28930.18828
29gene29930.28567
30gene30920.258806
In [33]:
# clean up
rm("snpset.pval.txt", force=true)
rm("ordinalgwas.null.txt", force=true)

Additional Features Available Not Covered Today

The following features are also available in OrdinalGWAS.jl. They are covered in the documentation found here.

  • Additional SNP-set analysis options
    • Sliding window
    • Specific set of snps
  • GxE interaction analysis
  • Link Functions
    • LogitLink(), proportional odds model (default),
    • ProbitLink(), ordred Probit model,
    • CloglogLink(), proportional hazards model, or
    • CauchyLink()
  • Set of multiple plink files (i.e. by chromosome)
    • Running on a cluster
  • Running analysis with a Docker file
  • Plotting

GxE example (if time allows)

GxE interactions

In many applications, the user may want to test the GxE interaction effect. This requires fitting the SNP in the null model and is quite slower, but the command ordinalgwasGxE() can be used test the interaction effect. To do this you must specify the environmental variable in the command, either as a symbol, such as ":age" or as a string "age".

For documentation of the ordinalgwasGxE function, type ?ordinalgwasGxE in Julia REPL.

@docs
?ordinalgwasGxE
In [34]:
?ordinalgwasGxE
search: ordinalgwasGxE ordinalgwas OrdinalGWAS ordinalsnpsetgwas

Out[34]:
ordinalgwasGxE(nullformula, covfile, plinkfile, e)
ordinalgwasGxE(nullformula, df, plinkfile, e)
ordinalgwasGxE(fittednullmodel, plinkfile, e)
ordinalgwasGxE(fittednullmodel, bedfile, bimfile, bedn, e)

Positional arguments

  • nullformula::FormulaTerm: formula for the null model.
  • covfile::AbstractString: covariate file (csv) with one header line. One column should be the ordinal phenotype coded as integers starting from 1. For example, ordinal phenotypes can be coded as 1, 2, 3, 4 but not 0, 1, 2, 3.
  • df::DataFrame: DataFrame containing response and regressors for null model.
  • plinkfile::AbstractString: Plink file name without the bed, fam, or bim extensions. If plinkfile==nothing, only null model is fitted. If plinkfile is provided, bed, bim, and fam file with same plinkfile prefix need to exist. Compressed file formats such as gz and bz2 are allowed. Check all allowed formats by SnpArrays.ALLOWED_FORMAT.
  • e::Union{AbstractString,Symbol}: Enviromental variable to be used to test the GxE interaction.

For instance, for testing sex & snp interaction, use :sex or "sex".

  • fittednullmodel::StatsModels.TableRegressionModel: the fitted null model output from ordinalgwas(nullformula, covfile) or ordinalgwas(nullformula, df).
  • bedfile::Union{AbstractString,IOStream}: path to Plink bed file with full file name.
  • bimfile::Union{AbstractString,IOStream}: path to Plink bim file with full file name.
  • bedn::Integer: number of samples in bed file.

Keyword arguments

  • nullfile::Union{AbstractString, IOStream}: output file for the fitted null model; default is ordinalgwas.null.txt.
  • pvalfile::Union{AbstractString, IOStream}: output file for the gwas p-values; default is ordinalgwas.pval.txt.
  • covtype::Vector{DataType}: type information for covfile. This is useful when CSV.read(covarfile) has parsing errors.
  • covrowinds::Union{Nothing,AbstractVector{<:Integer}}: sample indices for covariate file.
  • test::Symbol: :score (default) or :lrt.
  • link::GLM.Link: LogitLink() (default), ProbitLink(), CauchitLink(), or CloglogLink().
  • snpmodel: ADDITIVE_MODEL (default), DOMINANT_MODEL, or RECESSIVE_MODEL.
  • snpinds::Union{Nothing,AbstractVector{<:Integer}}: SNP indices for bed file.
  • bedrowinds::Union{Nothing,AbstractVector{<:Integer}}: sample indices for bed file.
  • solver: an optimization solver supported by MathProgBase. Default is NLoptSolver(algorithm=:LD_SLSQP, maxeval=4000). Another common choice is IpoptSolver(print_level=0).
  • verbose::Bool: default is false.

The following can be used to test if there is an interaction between sex and the first five SNPs in the data.

In [35]:
ordinalgwasGxE(@formula(hypertension ~ sex), "data/covariate.txt", "data/hapmap3",
    :sex, pvalfile = "gxe_snp.pval.txt", snpinds=1:5, test=:score)
Out[35]:
StatsModels.TableRegressionModel{OrdinalMultinomialModel{Int64,Float64,LogitLink},Array{Float64,2}}

hypertension ~ sex

Coefficients:
──────────────────────────────────────────────────────
               Estimate  Std.Error   t value  Pr(>|t|)
──────────────────────────────────────────────────────
intercept1|2  -1.48564    0.358891  -4.13952    <1e-4 
intercept2|3  -0.569479   0.341044  -1.66981    0.0959
intercept3|4   0.429815   0.339642   1.26549    0.2066
sex            0.424656   0.213914   1.98517    0.0480
──────────────────────────────────────────────────────
In [36]:
CSV.read("gxe_snp.pval.txt")
Out[36]:

5 rows × 7 columns

chrpossnpidmafhwepvalsnpeffectnullpval
Int64Int64StringFloat64Float64Float64Float64
11554484rs104585970.01.00.01.0
21758311rs125620340.07763980.409876-1.005780.637742
31967643rs27108750.3240744.07625e-7-0.6488560.966711
411168108rs112605660.1915890.128568-0.9157230.263527
511375074rs13125680.4413582.5376e-19-0.3318140.781113
In [37]:
# clean up
rm("gxe_snp.pval.txt", force=true)

Now try to run a GxE interaction analysis for the first five snps using an LRT (returns the effect size estimate of the GxE interaction as well), saving the results to a file named gxe_lrt.pval.txt.

In [39]:
## Your code here:

###

Display the results

In [ ]:
CSV.read("gxe_snp.pval.txt")
In [ ]:
# clean up
rm("gxe_snp.pval.txt", force=true)

GxE interactions - testing joint effect

In some applications, the user may want to test SNP effect and/or its interaction with other terms. testformula keyword specifies the test unit besides the covariates in nullformula.

In following example, keyword testformula=@formula(hypertension ~ snp + snp & sex) instructs ordinalgwas to test joint effect of snp and snp & sex interaction.

In [40]:
ordinalgwas(@formula(hypertension ~ sex), "data/covariate.txt", "data/hapmap3", 
    pvalfile="GxE.pval.txt", testformula=@formula(hypertension ~ snp + snp & sex));
In [41]:
CSV.read("GxE.pval.txt")
Out[41]:

13,928 rows × 6 columns

chrpossnpidmafhwepvalpval
Int64Int64StringFloat64Float64Float64
11554484rs104585970.01.01.0
21758311rs125620340.07763980.4098760.017446
31967643rs27108750.3240744.07625e-70.000166707
411168108rs112605660.1915890.1285684.76376e-5
511375074rs13125680.4413582.5376e-190.0291385
611588771rs351541050.01.01.0
711789051rs168245080.004629630.9332780.296436
811990452rs26789390.4537045.07696e-110.379246
912194615rs75531780.2268520.1705610.325582
1012396747rs133763560.144860.9053080.81664
1112623158rs287539130.01.01.0
1212823603rs15634680.4830254.23066e-90.394388
1313025087rs66903730.253879.23864e-80.929342
1413225416rs120435190.0293210.0007788020.00458087
1513431124rs120931170.1099070.2326930.521578
1613633945rs109100170.2218750.0002708880.987983
1713895935rs347709240.02469143.20726e-50.390835
1814096895rs67026330.4752320.6638120.012629
1914297388rs6849650.3055560.2136330.150915
2014498133rs118092950.09937890.2571620.162402
2114698713rs5785280.3240740.6169370.184986
2214899946rs46544710.3580250.1812650.281664
2315100369rs66811480.1315790.008454470.312034
2415302730rs107991970.4287930.2947420.742085
2515502779rs107964000.2314810.460910.403759
2615703284rs22446320.3947370.3133780.751857
2715904631rs75493240.3672840.5175670.47387
2816106513rs28434940.06481480.7406630.629369
2916310159rs49088800.05555560.289680.523624
3016514524rs9321120.2206790.1726690.173462
In [42]:
# clean up
rm("ordinalgwas.null.txt")
rm("GxE.pval.txt")