OrdinalGWAS.jl

OrdinalGWAS.jl is a Julia package for performing genome-wide association studies (GWAS) for ordered categorical phenotypes using proportional odds model or ordred Probit model. It is useful when the phenotype takes ordered discrete values, e.g., disease status (undiagnosed, pre-disease, mild, moderate, severe).

The methods and applications of this software package are detailed in the following publication:

German CA, Sinsheimer JS, Klimentidis YC, Zhou H, Zhou JJ. (2020) Ordered multinomial regression for genetic association analysis of ordinal phenotypes at Biobank scale. Genetic Epidemiology. 44:248-260. https://doi.org/10.1002/gepi.22276

OrdinalGWAS.jl currently supports PLINK, VCF (both dosage and genotype data) file formats, and BGEN file formats. We plan to add PGEN support in the future.

Installation

This package requires Julia v1.6 or later. The package has been registered and can be installed using Julia package manager. Start julia and type:

using Pkg
Pkg.add("OrdinalGWAS")

# machine information for this tutorial
versioninfo()

Julia Version 1.7.1
Commit ac5cc99908 (2021-12-22 19:35 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin19.5.0)
  CPU: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)

# for use in this tutorial
using BenchmarkTools, CSV, Glob, SnpArrays, OrdinalGWAS, DataFrames

┌ Info: Precompiling BenchmarkTools [6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf]
└ @ Base loading.jl:1423
┌ Info: Precompiling CSV [336ed68f-0bac-5ca0-87d4-7b16caf5d00b]
└ @ Base loading.jl:1423
┌ Info: Precompiling SnpArrays [4e780e97-f5bf-4111-9dc4-b70aaf691b06]
└ @ Base loading.jl:1423
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mCompiling VCF parser...
┌ Info: Precompiling OrdinalGWAS [00428148-03d9-50ae-bfb7-4a690d5612f1]
└ @ Base loading.jl:1423

Example data sets

The data folder of the package contains the example data sets for use with PLINK and VCF Files. In general, the user can locate this folder by command:

using OrdinalGWAS
const datadir = normpath(joinpath(dirname(pathof(OrdinalGWAS)), "../data/"))

"/Users/kose/.julia/dev/OrdinalGWAS/data/"

# content of the data folder
readdir(glob"*.*", datadir)

16-element Vector{String}:
 "/Users/kose/.julia/dev/OrdinalGWAS/data/bgen_ex.csv"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/bgen_snpsetfile.txt"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/bgen_test.bgen"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/bgen_test.bgen.bgi"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/covariate.txt"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.bed"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.bim"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.fam"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.map"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap_snpsetfile.txt"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/simtrait.jl"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/simtrait_bgen.jl"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/simtrait_vcf.jl"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/snpsetfile_vcf.txt"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/vcf_example.csv"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/vcf_test.vcf.gz"

The hapmap3 files covariate.txt file correspond to data examples using PLINK formatted files (.bed, .bim, .fam).

The vcf_test.vcf.gz and vcf_example.csv files are for an example analysis using VCF formatted files.

Basic usage

The following command performs GWAS using the proportional odds model for the hapmap3 PLINK files. The output is the fitted null model.

The default type of GWAS performed is a single-snp significance genome-wide scan, this can be changed by the keyword analysistype (default is "singlesnp"). Other types of analyses are gone over later.

ordinalgwas(@formula(trait ~ sex), datadir * "covariate.txt", datadir * "hapmap3")

StatsModels.TableRegressionModel{OrdinalMultinomialModel{Int64, Float64, LogitLink}, Matrix{Float64}}

trait ~ sex

Coefficients:
──────────────────────────────────────────────────────
               Estimate  Std.Error   t value  Pr(>|t|)
──────────────────────────────────────────────────────
intercept1|2  -1.48564    0.358891  -4.13952    <1e-04
intercept2|3  -0.569479   0.341044  -1.66981    0.0959
intercept3|4   0.429815   0.339642   1.26549    0.2066
sex            0.424656   0.213914   1.98517    0.0480
──────────────────────────────────────────────────────

If you do not have any other covariates, you can specify the right-hand side of the formula as 0. For example:

ordinalgwas(@formula(trait ~ 0), datadir * "covariate.txt", datadir * "hapmap3")

StatsModels.TableRegressionModel{OrdinalMultinomialModel{Int64, Float64, LogitLink}, Matrix{Float64}}

trait ~ 0

Coefficients:
───────────────────────────────────────────────────────
               Estimate  Std.Error    t value  Pr(>|t|)
───────────────────────────────────────────────────────
intercept1|2  -2.1112     0.178982  -11.7956     <1e-26
intercept2|3  -1.2        0.131718   -9.11035    <1e-17
intercept3|4  -0.210729   0.111728   -1.88609    0.0602
───────────────────────────────────────────────────────

For documentation of the ordinalgwas function, type ?ordinalgwas in Julia REPL.

OrdinalGWAS.ordinalgwas — Function

ordinalgwas(nullformula, covfile, geneticfile; kwargs...)
ordinalgwas(nullformula, df, geneticfile; kwargs...)
ordinalgwas(fittednullmodel, geneticfile; kwargs...)
ordinalgwas(fittednullmodel, bedfile, bimfile, bedn; kwargs...)

Positional arguments

nullformula::FormulaTerm: formula for the null model.
covfile::AbstractString: covariate file (csv) with one header line. One column should be the ordinal phenotype coded as integers starting from 1. For example, ordinal phenotypes can be coded as 1, 2, 3, 4 but not 0, 1, 2, 3.
df::DataFrame: DataFrame containing response and regressors for null model.
geneticfile::Union{Nothing, AbstractString}: File containing genetic information for GWAS. This includes a PLINK file name without the .bed, .fam, or .bim extensions or a VCF file without the .vcf extension. If geneticfile==nothing, only null model is fitted. If geneticfile is provided, bed, bim, and fam file (or vcf) with the same geneticfile prefix need to exist. Compressed file formats such as gz and bz2 are allowed. Check all allowed formats by SnpArrays.ALLOWED_FORMAT. If you're using a VCF file, make sure to use the geneticformat = "VCF" keyword option, and specificy dosage (:DS) or genotype (:GT) data with the vcftype command.
fittednullmodel::StatsModels.TableRegressionModel: the fitted null model output from ordinalgwas(nullformula, covfile) or ordinalgwas(nullformula, df).
bedfile::Union{AbstractString,IOStream}: path to Plink bed file with full file name.
bimfile::Union{AbstractString,IOStream}: path to Plink bim file with full file name.
bedn::Integer: number of samples in bed/vcf file.

Keyword arguments

analysistype::AbstractString: Type of analysis to conduct. Default is singlesnp. Other options are snpset and gxe.
geneticformat::AbstractString: Type of file used for the genetic analysis. "PLINK", "VCF", and "BGEN" are currently supported. Default is PLINK.
vcftype::Union{Symbol, Nothing}: Data to extract from the VCF file for the GWAS analysis. :DS for dosage or :GT for genotypes. Default is nothing.
nullfile::Union{AbstractString, IOStream}: output file for the fitted null model; default is ordinalgwas.null.txt.
pvalfile::Union{AbstractString, IOStream}: output file for the gwas p-values; default is ordinalgwas.pval.txt.
covtype::Vector{DataType}: type information for covfile. This is useful when CSV.read(covarfile) has parsing errors.
covrowinds::Union{Nothing,AbstractVector{<:Integer}}: sample indices for covariate file.
testformula::FormulaTerm: formula for test unit. Default is @formula(trait ~ 0 + snp).
test::Symbol: :score (default) or :lrt.
link::GLM.Link: LogitLink() (default), ProbitLink(), CauchitLink(), or CloglogLink().
snpmodel: ADDITIVE_MODEL (default), DOMINANT_MODEL, or RECESSIVE_MODEL.
snpinds::Union{Nothing,AbstractVector{<:Integer}}: SNP indices for bed/vcf file.
geneticrowinds::Union{Nothing,AbstractVector{<:Integer}}: sample indices for bed/vcf file.
samplepath::Union{Nothing, AbstractString}: path for BGEN sample file if it's not encoded in the BGEN file.
solver: an optimizer supported by MathOptInterface. Default is NLopt.Optimizer() with algorithm=:LD_SLSQP and maxeval=4000. Another common choice is Ipopt.Optimizer().
verbose::Bool: default is false.
snpset::Union{Nothing, Integer, AbstractString, AbstractVector{<:Integer}}: Only include if you are conducting a snpset analysis. An integer indicates a window of SNPs (i.e. every 500 snps). An abstract string allows you to specify an input file, with no header and two columns separated by a space. The first column must contain the snpset ID and the second column must contain the snpid's identical to the bimfile. An AbstractVector allows you to specify the snps you want to perform one joint snpset test for.
e::Union{AbstractString,Symbol}: Only include if you are conducting a GxE analysis. Enviromental variable to be used to test the GxE interaction. For instance, for testing sex & snp interaction, use :sex or "sex".

Examples

The following is an example of basic GWAS with PLINK files:

plinkfile = "plinkexample"
covfile = "covexample"
ordinalgwas(@formula(trait ~ sex), covfile, plkfile)

The following is an example of basic GWAS with a VCF file using dosages then genotypes:

vcffile = "vcfexample"
covfile = "covexample"
ordinalgwas(@formula(trait ~ sex), covfile, vcffile; 
    geneticfile = "VCF", vcftype = :DS)

ordinalgwas(@formula(trait ~ sex), covfile, vcffile; 
    geneticfile = "VCF", vcftype = :GT)

The following is an example of snpset GWAS (every 50 snps). For more types of snpset analyses see documentation:

ordinalgwas(@formula(trait ~ sex), covfile, plkfile; 
    analysistype = "snpset", snpset = 50)

The following is an example of GxE GWAS testing the interaction effect:

ordinalgwas(@formula(trait ~ sex), covfile, plkfile;
    analysistype = "gxe", e = :sex)

source

Formula for null model

The first argument specifies the null model without SNP effects, e.g., @formula(trait ~ sex).

Input files

ordinalgwas expects two input files: one for responses plus covariates (second argument), the other the genetic file(s) for dosages/genotypes (third argument).

Covariate and trait file

Covariates and phenotype are provided in a csv file, e.g., covariate.txt, which has one header line for variable names. In this example, variable trait is the ordered categorical phenotypes coded as integers 1 to 4. We want to include variable sex as the covariate in GWAS.

run(`head $(datadir)covariate.txt`);

famid,perid,faid,moid,sex,trait
2431,NA19916,0,0,1,4
2424,NA19835,0,0,2,4
2469,NA20282,0,0,2,4
2368,NA19703,0,0,1,3
2425,NA19901,0,0,2,3
2427,NA19908,0,0,1,4
2430,NA19914,0,0,2,4
2470,NA20287,0,0,2,1
2436,NA19713,0,0,2,3

Genetic file(s)

OrdinalGWAS supports PLINK, VCF, and BGEN Files.

Genotype data is available as binary PLINK files.
OrdinalGWAS can use dosage or genotype data from VCF Files.
OrdinalGWAS can use dosage data from BGEN files.

Note

By default, OrdinalGWAS assumes a set of PLINK files will be used. When using a VCF File, VCF file and type of data (dosage, genotype) must be specified by the geneticformat and vcftype options (as shown later). Similarly, BGEN must be specified as the geneticformat if using a BGEN file.

readdir(glob"hapmap3.*", datadir)

4-element Vector{String}:
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.bed"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.bim"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.fam"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.map"

In this example, there are 324 samples at 13,928 SNPs.

size(SnpArray(datadir * "hapmap3.bed"))

(324, 13928)

Compressed PLINK and VCF files are supported. For example, if Plink files are hapmap3.bed.gz, hapmap3.bim.gz and hapmap3.fam.gz, the same command

ordinalgwas(@formula(trait ~ sex), datadir * "covariate.txt", datadir * "hapmap3")

still works. Check all supported compression format by

SnpArrays.ALLOWED_FORMAT

6-element Vector{String}:
 "gz"
 "zlib"
 "zz"
 "xz"
 "zst"
 "bz2"

Output files

ordinalgwas outputs two files: ordinalgwas.null.txt and ordinalgwas.pval.txt.

ordinalgwas.null.txt lists the estimated null model (without SNPs).

run(`cat ordinalgwas.null.txt`);

StatsModels.TableRegressionModel{OrdinalMultinomialModel{Int64, Float64, LogitLink}, Matrix{Float64}}

trait ~ 0

Coefficients:
───────────────────────────────────────────────────────
               Estimate  Std.Error    t value  Pr(>|t|)
───────────────────────────────────────────────────────
intercept1|2  -2.1112     0.178982  -11.7956     <1e-26
intercept2|3  -1.2        0.131718   -9.11035    <1e-17
intercept3|4  -0.210729   0.111728   -1.88609    0.0602
───────────────────────────────────────────────────────

ordinalgwas.pval.txt tallies the SNPs and their pvalues.

run(`head ordinalgwas.pval.txt`);

chr,pos,snpid,maf,hwepval,pval
1,554484,rs10458597,0.0,1.0,1.0
1,758311,rs12562034,0.07763975155279501,0.4098763332666681,0.004918111393760363
1,967643,rs2710875,0.32407407407407407,4.076249100705747e-7,2.873976267701535e-5
1,1168108,rs11260566,0.19158878504672894,0.1285682279446898,8.839965152336459e-6
1,1375074,rs1312568,0.441358024691358,2.5376019650614977e-19,0.010123141132845563
1,1588771,rs35154105,0.0,1.0,1.0
1,1789051,rs16824508,0.00462962962962965,0.9332783156468178,0.5146026416188009
1,1990452,rs2678939,0.4537037037037037,5.07695957708431e-11,0.22505726090414777
1,2194615,rs7553178,0.22685185185185186,0.17056143157457776,0.18590541524018153

Output file names can be changed by the nullfile and pvalfile keywords respectively. For example,

ordinalgwas(@formula(trait ~ sex), datadir * "covariate.txt", datadir * "hapmap3",
    pvalfile="ordinalgwas.pval.txt.gz")

will output the p-value file in compressed gz format.

Subsamples

Use the keyword covrowinds to specify selected samples in the covarite file. Use the keyword geneticrowinds to specify selected samples in the Plink bed file or VCF File. For example, to use the first 300 samples in both covariate and bed file:

ordinalgwas(@formula(trait ~ sex), covfile, geneticfile, 
    covrowinds=1:300, geneticrowinds=1:300)

Note

Users should always make sure that the selected samples in covariate file match exactly those in bed file.

Input non-genetic data as DataFrame

Internally ordinalgwas parses the covariate file as a DataFrame by CSV.read(covfile). For covariate file of other formats, users can parse it as a DataFrame and then input the DataFrame to ordinalgwas directly.

ordinalgwas(@formula(trait ~ sex), df, geneticfile)

Note

Users should always make sure that individuals in covariate file or DataFrame match those in Plink fam file/VCF File/BGEN file.

For example, following code checks that the first 2 columns of the covariate.txt file match the first 2 columns of the hapmap3.fam file exactly.

covdf = CSV.read(datadir * "covariate.txt", DataFrame)
plkfam = CSV.read(datadir * "hapmap3.fam", header=0, delim=' ',  DataFrame)
all(covdf[!, 1] .== plkfam[!, 1]) && all(covdf[!, 2] .== plkfam[!, 2])

true

Timing

For this moderate-sized data set, ordinalgwas takes around 0.2 seconds.

@btime(ordinalgwas(@formula(trait ~ sex), datadir * "covariate.txt", datadir * "hapmap3"));

  167.401 ms (486629 allocations: 44.74 MiB)

# clean up
rm("ordinalgwas.null.txt", force=true)
rm("ordinalgwas.pval.txt", force=true)

VCF Formatted Files

By default, OrdinalGWAS.jl will assume you are using PLINK files. It also supports VCF (and BGEN) Files. vcf.gz files are supported as well. To use vcf files in any of the analysis options detailed in this documentation, you simply need to add two keyword options to the ordinalgwas function:

geneticformat: Choices are "VCF" or "PLINK". If you are using a VCF file, use geneticformat = "VCF".
vcftype: Choices are :GT (for genotypes) or :DS (for dosages). This tells OrdinalGWAS which type of data to extract from the VCF file.

Using a VCF File does not output minor allele frequency or hardy weinberg equillibrium p-values for each SNP tested since they may be dosages and MAF/HWE calculations via VCFTools does not currently support dosage data.

The following shows how to run an analysis with a VCF file using the dosage information.

ordinalgwas(@formula(y ~ sex), datadir * "vcf_example.csv", 
    datadir * "vcf_test"; geneticformat = "VCF", vcftype = :DS)

StatsModels.TableRegressionModel{OrdinalMultinomialModel{Int64, Float64, LogitLink}, Matrix{Float64}}

y ~ sex

Coefficients:
───────────────────────────────────────────────────────
               Estimate  Std.Error    t value  Pr(>|t|)
───────────────────────────────────────────────────────
intercept1|2  -1.10106    0.205365  -5.36147     <1e-06
intercept2|3   0.370894   0.188822   1.96425     0.0510
intercept3|4   1.74736    0.232756   7.50726     <1e-11
sex            0.22796    0.262237   0.869289    0.3858
───────────────────────────────────────────────────────

run(`head ordinalgwas.pval.txt`);

chr,pos,snpid,pval
22,20000086,rs138720731,0.6435068072069544
22,20000146,rs73387790,1.0
22,20000199,rs183293480,0.9378258278500581
22,20000291,rs185807825,0.21907288710091172
22,20000428,rs55902548,0.002790402446884929
22,20000683,rs142720028,1.0
22,20000771,rs114690707,0.23034910158749672
22,20000793,rs189842693,0.15612228776953083
22,20000810,rs147349046,0.1741386270145462

BGEN Formatted Files

By default, OrdinalGWAS.jl will assume you are using PLINK files. It also supports BGEN Files. To use BGEN files in any of the analysis options detailed in this documentation, you simply need to add the following keyword option to the ordinalgwas function:

geneticformat: Choices are "VCF" or "PLINK" or "BGEN". If you are using a BGEN file, use geneticformat = "BGEN".

Some features, such as SNP-set analyses, are only available currently when there's an indexing .bgi file available.

Note

BGEN files can contain an optional index file (.bgi file) that allows the variants to be specified in order of position. OrdinalGWAS will automatically look for a file in the same directory as the BGENFILENAME with the name BGENFILENAME.bgi. The BGEN file is read either in the .bgi file order if BGENFILENAME.bgi is supplied in the same directory as the BGEN file, otherwise it will use the order in the BGEN file. This is important in analyses specifying snpinds as well as annotation groupings. You must make sure this matches the way the BGEN file will be read in.

The following shows how to run an analysis with a BGEN file using the dosage information.

ordinalgwas(@formula(y ~ sex), datadir * "bgen_ex.csv", 
    datadir * "bgen_test"; geneticformat = "BGEN")

StatsModels.TableRegressionModel{OrdinalMultinomialModel{Int64, Float64, LogitLink}, Matrix{Float64}}

y ~ sex

Coefficients:
──────────────────────────────────────────────────────
               Estimate  Std.Error   t value  Pr(>|t|)
──────────────────────────────────────────────────────
intercept1|2  -1.08828    0.128371  -8.47759    <1e-15
intercept2|3   0.434834   0.118541   3.66821    0.0003
intercept3|4   1.80196    0.145367  12.396      <1e-30
sex            0.461387   0.162676   2.83623    0.0048
──────────────────────────────────────────────────────

run(`head ordinalgwas.pval.txt`);

chr,pos,snpid,varid,maf,hwepval,infoscore,pval
01,1001,RSID_101,SNPID_101,0.4169765,0.8627403,0.9846387832074154,0.12449779722495212
01,2000,RSID_2,SNPID_2,0.19750981,0.1921813,9.0,0.0005572706473664106
01,2001,RSID_102,SNPID_102,0.19766665,0.18440013,0.72730855295163,0.0004996980174197089
01,3000,RSID_3,SNPID_3,0.48339605,0.9653543,0.955355323674357,0.5866179557677308
01,3001,RSID_103,SNPID_103,0.4833961,0.9653536,0.9553553468765227,0.5866179209956521
01,4000,RSID_4,SNPID_4,0.21670982,0.37192723,0.9917677420119885,0.3522822949083395
01,4001,RSID_104,SNPID_104,0.2167098,0.37192774,0.99176792776617,0.35228235843382233
01,5000,RSID_5,SNPID_5,0.38808233,0.5870128,0.9682577646241436,0.8349949433297409
01,5001,RSID_105,SNPID_105,0.3880863,0.5867037,0.9682302221772928,0.8347201604324502

Link functions

The link keyword argument of ordinalgwas can take value:

LogitLink(), proportional odds model (default),
ProbitLink(), ordred Probit model,
CloglogLink(), proportional hazards model, or
CauchyLink().

For example, to perform GWAS using the ordred Probit model

ordinalgwas(@formula(trait ~ sex), datadir * "covariate.txt", datadir * "hapmap3", 
    link=ProbitLink(), nullfile="opm.null.txt", pvalfile="opm.pval.txt")

StatsModels.TableRegressionModel{OrdinalMultinomialModel{Int64, Float64, ProbitLink}, Matrix{Float64}}

trait ~ sex

Coefficients:
──────────────────────────────────────────────────────
               Estimate  Std.Error   t value  Pr(>|t|)
──────────────────────────────────────────────────────
intercept1|2  -0.866156   0.210677  -4.11129    <1e-04
intercept2|3  -0.359878   0.205817  -1.74854    0.0813
intercept3|4   0.247054   0.205382   1.2029     0.2299
sex            0.251058   0.128225   1.95795    0.0511
──────────────────────────────────────────────────────

The estimates in null model and p-values are slightly different from those in proportional odds moodel.

run(`head opm.pval.txt`);

chr,pos,snpid,maf,hwepval,pval
1,554484,rs10458597,0.0,1.0,1.0
1,758311,rs12562034,0.07763975155279501,0.4098763332666681,0.010076916742307648
1,967643,rs2710875,0.32407407407407407,4.076249100705747e-7,2.62725649418621e-5
1,1168108,rs11260566,0.19158878504672894,0.1285682279446898,1.0897484851087357e-5
1,1375074,rs1312568,0.441358024691358,2.5376019650614977e-19,0.0051028839904383545
1,1588771,rs35154105,0.0,1.0,1.0
1,1789051,rs16824508,0.00462962962962965,0.9332783156468178,0.48653776297937934
1,1990452,rs2678939,0.4537037037037037,5.07695957708431e-11,0.33231290090456195
1,2194615,rs7553178,0.22685185185185186,0.17056143157457776,0.25915513977199217

rm("opm.null.txt", force=true)
rm("opm.pval.txt", force=true)

SNP models

Genotypes are translated into numeric values according to different genetic model, which is specified by the snpmodel keyword. Default is ADDITIVE_MODEL.

Genotype	`SnpArray`	`ADDITIVE_MODEL`	`DOMINANT_MODEL`	`RECESSIVE_MODEL`
A1,A1	0x00	0	0	0
missing	0x01	NaN	NaN	NaN
A1,A2	0x02	1	1	0
A2,A2	0x03	2	1	1

Note

ordinalgwas imputes missing genotypes according to minor allele frequencies.

Users are advised to impute genotypes using more sophiscated methods before GWAS.

SNP and/or sample masks

In practice, we often perform GWAS on selected SNPs and/or selected samples. They can be specified by the snpinds, covrowinds and geneticrowinds keywords of ordinalgwas function.

For example, to perform GWAS on SNPs with minor allele frequency (MAF) above 0.05

# create SNP mask
snpinds = maf(SnpArray(datadir * "hapmap3.bed")) .≥ 0.05
# GWAS on selected SNPs
@time ordinalgwas(@formula(trait ~ sex), datadir * "covariate.txt", datadir * "hapmap3", 
    snpinds=snpinds, nullfile="commonvariant.null.txt", pvalfile="commonvariant.pval.txt")

  1.080135 seconds (2.39 M allocations: 147.498 MiB, 8.99% gc time, 82.17% compilation time)





StatsModels.TableRegressionModel{OrdinalMultinomialModel{Int64, Float64, LogitLink}, Matrix{Float64}}

trait ~ sex

Coefficients:
──────────────────────────────────────────────────────
               Estimate  Std.Error   t value  Pr(>|t|)
──────────────────────────────────────────────────────
intercept1|2  -1.48564    0.358891  -4.13952    <1e-04
intercept2|3  -0.569479   0.341044  -1.66981    0.0959
intercept3|4   0.429815   0.339642   1.26549    0.2066
sex            0.424656   0.213914   1.98517    0.0480
──────────────────────────────────────────────────────

run(`head commonvariant.pval.txt`);

chr,pos,snpid,maf,hwepval,pval
1,758311,rs12562034,0.07763975155279501,0.4098763332666681,0.004565312839535881
1,967643,rs2710875,0.32407407407407407,4.076249100705747e-7,3.108283828553597e-5
1,1168108,rs11260566,0.19158878504672894,0.1285682279446898,1.2168672367654906e-5
1,1375074,rs1312568,0.441358024691358,2.5376019650614977e-19,0.008206860046174836
1,1990452,rs2678939,0.4537037037037037,5.07695957708431e-11,0.29972829571847304
1,2194615,rs7553178,0.22685185185185186,0.17056143157457776,0.17133312458049288
1,2396747,rs13376356,0.1448598130841121,0.9053079215078139,0.5320416198875104
1,2823603,rs1563468,0.4830246913580247,4.23065537243926e-9,0.22519139178356798
1,3025087,rs6690373,0.2538699690402477,9.238641887192776e-8,0.7018469417717436

# extra header line in commonvariant.pval.txt
countlines("commonvariant.pval.txt"), count(snpinds)

(12086, 12085)

# clean up
rm("commonvariant.null.txt", force=true)
rm("commonvariant.pval.txt", force=true)

covrowinds specify the samples in the covariate file and geneticrowinds for PLINK or VCF File. User should be particularly careful when using these two keyword. Selected rows in SnpArray should exactly match the samples in the null model. Otherwise the results are meaningless.

Likelihood ratio test (LRT)

By default, ordinalgwas calculates p-value for each SNP using score test. Score test is fast because it doesn't require fitting alternative model for each SNP. User can request likelihood ratio test (LRT) using keyword test=:lrt. LRT is much slower but may be more powerful than score test.

Because LRT fits the alternative model for each SNP, we also output the standard error and estimated effect size of the SNP.

@time ordinalgwas(@formula(trait ~ sex), datadir * "covariate.txt", datadir * "hapmap3", 
    test=:LRT, nullfile="lrt.null.txt", pvalfile="lrt.pval.txt")

 20.684400 seconds (5.69 M allocations: 1.857 GiB, 1.82% gc time, 1.31% compilation time)





StatsModels.TableRegressionModel{OrdinalMultinomialModel{Int64, Float64, LogitLink}, Matrix{Float64}}

trait ~ sex

Coefficients:
──────────────────────────────────────────────────────
               Estimate  Std.Error   t value  Pr(>|t|)
──────────────────────────────────────────────────────
intercept1|2  -1.48564    0.358891  -4.13952    <1e-04
intercept2|3  -0.569479   0.341044  -1.66981    0.0959
intercept3|4   0.429815   0.339642   1.26549    0.2066
sex            0.424656   0.213914   1.98517    0.0480
──────────────────────────────────────────────────────

Note the extra effect and stder columns in pvalfile, which is the effect size (regression coefficient) and standard errors for each SNP. -1 is reported when there is no variation in the SNP (0 MAF).

run(`head lrt.pval.txt`);

chr,pos,snpid,maf,hwepval,effect,stder,pval
1,554484,rs10458597,0.0,1.0,0.0,-1.0,1.0
1,758311,rs12562034,0.07763975155279501,0.4098763332666681,-1.0057833719544336,0.34080917686608586,0.0019185836579804134
1,967643,rs2710875,0.32407407407407407,4.076249100705747e-7,-0.6488560566291872,0.15659100275284915,1.805050556976133e-5
1,1168108,rs11260566,0.19158878504672894,0.1285682279446898,-0.9157225669357891,0.2177988861470356,5.873384712686268e-6
1,1375074,rs1312568,0.441358024691358,2.5376019650614977e-19,-0.3318136652577256,0.12697404813031218,0.008081022577833346
1,1588771,rs35154105,0.0,1.0,0.0,-1.0,1.0
1,1789051,rs16824508,0.00462962962962965,0.9332783156468178,-0.733802638869998,1.2495287571216083,0.5169027130144458
1,1990452,rs2678939,0.4537037037037037,5.07695957708431e-11,-0.13586499231819757,0.13123472232496708,0.29946402200912603
1,2194615,rs7553178,0.22685185185185186,0.17056143157457776,-0.251207564044013,0.17872139152979566,0.1615106909443865

# clean up
rm("lrt.pval.txt", force=true)
rm("lrt.null.txt", force=true)

In this example, GWAS by score test takes less than 0.2 second, while GWAS by LRT takes over 20 seconds. Over 100 fold difference in run time.

Score test for screening, LRT for power

For large data sets, a practical solution is to perform the score test first across all SNPs, then re-do LRT for the most promising SNPs according to score test p-values.

Step 1: Perform score test GWAS, results in score.pval.txt.

@time ordinalgwas(@formula(trait ~ sex), datadir * "covariate.txt", datadir * "hapmap3", 
    test=:score, pvalfile="score.pval.txt");

  0.276010 seconds (486.63 k allocations: 44.736 MiB, 20.01% gc time)

run(`head score.pval.txt`);

chr,pos,snpid,maf,hwepval,pval
1,554484,rs10458597,0.0,1.0,1.0
1,758311,rs12562034,0.07763975155279501,0.4098763332666681,0.004565312839535881
1,967643,rs2710875,0.32407407407407407,4.076249100705747e-7,3.108283828553597e-5
1,1168108,rs11260566,0.19158878504672894,0.1285682279446898,1.2168672367654906e-5
1,1375074,rs1312568,0.441358024691358,2.5376019650614977e-19,0.008206860046174836
1,1588771,rs35154105,0.0,1.0,1.0
1,1789051,rs16824508,0.00462962962962965,0.9332783156468178,0.5111981332534471
1,1990452,rs2678939,0.4537037037037037,5.07695957708431e-11,0.29972829571847304
1,2194615,rs7553178,0.22685185185185186,0.17056143157457776,0.17133312458049288

Step 2: Sort score test p-values and find top 10 SNPs.

scorepvals = CSV.read("score.pval.txt", DataFrame)[!, end] # p-values in last column
tophits = sortperm(scorepvals)[1:10] # indices of 10 SNPs with smallest p-values
scorepvals[tophits] # smallest 10 p-values

10-element Vector{Float64}:
 1.3080149099173788e-6
 6.536722765046125e-6
 9.664742185663502e-6
 1.2168672367654906e-5
 1.8027460018322353e-5
 2.0989542284210004e-5
 2.6844521269932935e-5
 3.108283828553597e-5
 4.101091287507263e-5
 4.2966265138340985e-5

Step 3: Re-do LRT on top hits.

@time ordinalgwas(@formula(trait ~ sex), datadir * "covariate.txt", datadir * "hapmap3", 
    snpinds=tophits, test=:LRT, pvalfile="lrt.pval.txt");

  0.882847 seconds (2.01 M allocations: 111.597 MiB, 3.12% gc time, 96.29% compilation time)

run(`cat lrt.pval.txt`);

chr,pos,snpid,maf,hwepval,effect,stder,pval
1,967643,rs2710875,0.32407407407407407,4.076249100705747e-7,-0.6488560566291872,0.15659100275284915,1.805050556976133e-5
1,1168108,rs11260566,0.19158878504672894,0.1285682279446898,-0.9157225669357891,0.2177988861470356,5.873384712686268e-6
3,36821790,rs4678553,0.23456790123456794,0.1094668216324497,0.7424952268973514,0.17055414498937366,1.1303825016262592e-5
4,11017683,rs16881446,0.27554179566563464,0.8942746118760274,-0.7870581482955514,0.18729099511224528,1.1105427468799613e-5
5,3739190,rs12521166,0.0679012345679012,0.18613647746093887,1.1468852997925314,0.2809307861547482,4.7812882296576845e-5
6,7574576,rs1885466,0.17746913580246915,0.7620687178987191,0.8750621092263018,0.19417734621377952,7.272346896740631e-6
6,52474721,rs2073183,0.1826625386996904,0.5077765730476698,0.7790794914858653,0.19764574359737597,5.069394513904594e-5
7,41152376,rs28880,0.3379629629629629,0.8052368892744096,-0.8146339024453504,0.17471459882610024,9.180126530294943e-7
7,84223996,rs4128623,0.07870370370370372,0.0218347173467568,1.0022229316338562,0.25534365673199283,6.587895464657107e-5
23,121048059,rs1937165,0.4380804953560371,3.959609737265113e-16,0.5392313636256605,0.1280959604207797,1.975464385552384e-5

# clean up
rm("ordinalgwas.null.txt", force=true)
rm("score.pval.txt", force=true)
rm("lrt.pval.txt", force=true)

GxE or other interactions

Testing jointly G + GxE

In many applications, we want to test SNP effect and/or its interaction with other terms. testformula keyword specifies the test unit besides the covariates in nullformula.

In following example, keyword testformula=@formula(trait ~ snp + snp & sex) instructs ordinalgwas to test joint effect of snp and snp & sex interaction.

ordinalgwas(@formula(trait ~ sex), datadir * "covariate.txt", datadir * "hapmap3", 
    pvalfile="GxE.pval.txt", testformula=@formula(trait ~ snp + snp & sex));

run(`head GxE.pval.txt`);

chr,pos,snpid,maf,hwepval,pval
1,554484,rs10458597,0.0,1.0,1.0
1,758311,rs12562034,0.07763975155279501,0.4098763332666681,0.017446010412245565
1,967643,rs2710875,0.32407407407407407,4.076249100705747e-7,0.0001667073239488232
1,1168108,rs11260566,0.19158878504672894,0.1285682279446898,4.7637624578926465e-5
1,1375074,rs1312568,0.441358024691358,2.5376019650614977e-19,0.029138471242991387
1,1588771,rs35154105,0.0,1.0,1.0
1,1789051,rs16824508,0.00462962962962965,0.9332783156468178,0.29643631147339605
1,1990452,rs2678939,0.4537037037037037,5.07695957708431e-11,0.3792458047934272
1,2194615,rs7553178,0.22685185185185186,0.17056143157457776,0.32558226993239

# clean up
rm("ordinalgwas.null.txt")
rm("GxE.pval.txt")

Testing only GxE interaction term

For some applications, the user may want to simply test the GxE interaction effect. This requires fitting the SNP in the null model and is much slower, but the command ordinalgwas() with keyword analysistype = "gxe" can be used test the interaction effect. The environmental variable must be specified in the command using the keyword argument e, either as a symbol, such as :age or as a string "age".

ordinalgwas(@formula(trait ~ sex), datadir * "covariate.txt", datadir * "hapmap3"; analysistype = "gxe",
    e = :sex, pvalfile = "gxe_snp.pval.txt", snpinds=1:5, test=:score)

StatsModels.TableRegressionModel{OrdinalMultinomialModel{Int64, Float64, LogitLink}, Matrix{Float64}}

trait ~ sex

Coefficients:
──────────────────────────────────────────────────────
               Estimate  Std.Error   t value  Pr(>|t|)
──────────────────────────────────────────────────────
intercept1|2  -1.48564    0.358891  -4.13952    <1e-04
intercept2|3  -0.569479   0.341044  -1.66981    0.0959
intercept3|4   0.429815   0.339642   1.26549    0.2066
sex            0.424656   0.213914   1.98517    0.0480
──────────────────────────────────────────────────────

run(`head gxe_snp.pval.txt`);

chr,pos,snpid,maf,hwepval,snpeffectnull,pval
1,554484,rs10458597,0.0,1.0,0.0,1.0
1,758311,rs12562034,0.07763975155279501,0.4098763332666681,-1.0057833719544336,0.6377422425980503
1,967643,rs2710875,0.32407407407407407,4.076249100705747e-7,-0.6488560566291872,0.9667114197304937
1,1168108,rs11260566,0.19158878504672894,0.1285682279446898,-0.9157225669357891,0.263526746940882
1,1375074,rs1312568,0.441358024691358,2.5376019650614977e-19,-0.3318136652577256,0.7811133315583378

# clean up
rm("gxe_snp.pval.txt", force=true)

SNP-set testing

In many applications, we want to test a SNP-set. The function with keyword analysistype = "snpset" can be used to do this. To specify the type of snpset test, use the snpset keyword argument.

The snpset can be specified as either:

a window (test every X snps) => snpset = X
an annotated file. This requires snpset = filename where filename is an input file, with no header and two columns separated by a space. The first column must contain the snpset ID and the second column must contain the snpid's (rsid) identical to the bimfile, or in the case of a BGEN format, the order the data will be read (see BGEN above for more details).
a joint test on only a specific set of snps. snpset = AbstractVector where the vector specifies the snps you want to perform one joint snpset test for. The vector can either be a vector of integers where the elements are the indicies of the SNPs to test, a vector of booleans, where true represents that you want to select that SNP index, or a range indicating the indicies of the SNPs to test.

In following example, we perform a SNP-set test on the 50th to 55th snps.

ordinalgwas(@formula(trait ~ sex), datadir * "covariate.txt", datadir * "hapmap3"; 
    analysistype = "snpset", pvalfile = "snpset.pval.txt", snpset = 50:55)

StatsModels.TableRegressionModel{OrdinalMultinomialModel{Int64, Float64, LogitLink}, Matrix{Float64}}

trait ~ sex

Coefficients:
──────────────────────────────────────────────────────
               Estimate  Std.Error   t value  Pr(>|t|)
──────────────────────────────────────────────────────
intercept1|2  -1.48564    0.358891  -4.13952    <1e-04
intercept2|3  -0.569479   0.341044  -1.66981    0.0959
intercept3|4   0.429815   0.339642   1.26549    0.2066
sex            0.424656   0.213914   1.98517    0.0480
──────────────────────────────────────────────────────

run(`head snpset.pval.txt`);

The joint pvalue of snps indexed at 50:55 is 0.3647126536663968

# clean up
rm("snpset.pval.txt", force=true)
rm("ordinalgwas.null.txt", force=true)

In the following example we run a SNP-set test on the annotated SNP-set file.

run(`head $datadir/hapmap_snpsetfile.txt`);

gene1 rs10458597
gene1 rs12562034
gene1 rs2710875
gene1 rs11260566
gene1 rs1312568
gene1 rs35154105
gene1 rs16824508
gene1 rs2678939
gene1 rs7553178
gene1 rs13376356

ordinalgwas(@formula(trait ~ sex), datadir * "covariate.txt", datadir * "hapmap3";
    analysistype = "snpset", pvalfile = "snpset.pval.txt", 
    snpset = datadir * "/hapmap_snpsetfile.txt")

StatsModels.TableRegressionModel{OrdinalMultinomialModel{Int64, Float64, LogitLink}, Matrix{Float64}}

trait ~ sex

Coefficients:
──────────────────────────────────────────────────────
               Estimate  Std.Error   t value  Pr(>|t|)
──────────────────────────────────────────────────────
intercept1|2  -1.48564    0.358891  -4.13952    <1e-04
intercept2|3  -0.569479   0.341044  -1.66981    0.0959
intercept3|4   0.429815   0.339642   1.26549    0.2066
sex            0.424656   0.213914   1.98517    0.0480
──────────────────────────────────────────────────────

run(`head snpset.pval.txt`);

snpsetid,nsnps,pval
gene1,93,1.721339469879094e-5
gene2,93,0.03692497684163164
gene3,93,0.7478549371392493
gene4,92,0.02765079822347538
gene5,93,0.6119581594572997
gene6,93,0.02918464223006533
gene7,93,0.3916007348851458
gene8,93,0.12539574109853738
gene9,93,0.708563562160767

# clean up
rm("snpset.pval.txt", force=true)
rm("ordinalgwas.null.txt", force=true)

In the following example we run a SNP-set test on every 15 SNPs.

ordinalgwas(@formula(trait ~ sex), datadir * "covariate.txt", datadir * "hapmap3";
    analysistype = "snpset", pvalfile = "snpset.pval.txt", snpset=15)

StatsModels.TableRegressionModel{OrdinalMultinomialModel{Int64, Float64, LogitLink}, Matrix{Float64}}

trait ~ sex

Coefficients:
──────────────────────────────────────────────────────
               Estimate  Std.Error   t value  Pr(>|t|)
──────────────────────────────────────────────────────
intercept1|2  -1.48564    0.358891  -4.13952    <1e-04
intercept2|3  -0.569479   0.341044  -1.66981    0.0959
intercept3|4   0.429815   0.339642   1.26549    0.2066
sex            0.424656   0.213914   1.98517    0.0480
──────────────────────────────────────────────────────

run(`head snpset.pval.txt`);

startchr,startpos,startsnpid,endchr,endpos,endsnpid,pval
1,554484,rs10458597,1,3431124,rs12093117,1.9394189465428366e-13
1,3633945,rs10910017,1,6514524,rs932112,0.11077869162800988
1,6715827,rs441515,1,9534606,rs4926480,0.2742450817958276
1,9737551,rs12047054,1,12559747,rs4845907,0.49346113650467543
1,12760427,rs848577,1,16021797,rs6679870,0.15447358658256338
1,16228774,rs1972359,1,19100349,rs9426794,0.1544232917023417
1,19301516,rs4912046,1,22122176,rs9426785,0.4749873349462694
1,22323074,rs2235529,1,25166528,rs4648890,0.41246096621461137
1,25368553,rs7527379,1,28208168,rs12140070,0.16458135278661626

# clean up
rm("snpset.pval.txt", force=true)
rm("ordinalgwas.null.txt", force=true)

Plotting Results

To plot the GWAS results, we recommend using the MendelPlots.jl package.

Docker

For ease of using OrdinalGWAS, we provide a Dockerfile so users don't need to install Julia and required packages. Only Docker app needs to be installed in order to run analysis. Following is tested on Docker 2.0.0.0-mac78.

Step 1: Create a Dockerfile with content here, or, if the bash command wget is available, obtain Dockerfile by

# on command line
wget https://raw.githubusercontent.com/OpenMendel/OrdinalGWAS.jl/master/docker/Dockerfile

Step 2: Build a docker image called ordinalgwas-app, assuming that the Dockerfile is located in the ../docker folder. Building the image for the first time can take up to 10 minutes; but it only needs to be done once.

# on command line
docker build -t ordinalgwas-app ../docker/

Step 3: Suppose data files are located at /path/to/data folder, run analysis by

# on command line
docker run -v /path/to/data:/data -t ordinalgwas-app julia -e 'using OrdinalGWAS; ordinalgwas(@formula(trait ~ sex), "/data/covariate.txt", "/data/hapmap3", nullfile="/data/ordinalgwas.null.txt", pvalfile="/data/ordinalgwas");'

Here

-t ordinalgwas-app creates a container using the ordinalgwas-app image build in step 2.
-v /path/to/data:/data maps the /path/to/data folder on host machine to the /data folder within the container.
julia -e 'using OrdinalGWAS; ordinalgwas(@formula(trait ~ sex), "/data/covariate.txt", "/data/hapmap3", nullfile="/data/ordinalgwas.null.txt", pvalfile="/data/ordinalgwas"); calls Julia and runs ordinalgwas function.

The output files are written in /path/to/data directory.

Multiple Plink file sets

In large scale studies, genotypes data are split into multiple Plink files, e.g., by chromosome. Then GWAS analysis can be done in parallel. This can be achieved by two steps.

Let's first create demo data by splitting hapmap3 according to chromosome:

# split example hapmap3 data according to chromosome
SnpArrays.split_plink(datadir * "hapmap3", :chromosome; prefix=datadir * "hapmap3.chr.")
readdir(glob"hapmap3.chr.*", datadir)

75-element Vector{String}:
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.1.bed"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.1.bim"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.1.fam"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.10.bed"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.10.bim"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.10.fam"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.11.bed"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.11.bim"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.11.fam"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.12.bed"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.12.bim"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.12.fam"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.13.bed"
 ⋮
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.6.bed"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.6.bim"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.6.fam"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.7.bed"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.7.bim"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.7.fam"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.8.bed"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.8.bim"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.8.fam"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.9.bed"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.9.bim"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.9.fam"

Step 1: Fit the null model. Setting third argument geneticfile to nothing instructs ordinalgwas function to fit the null model only.

nm = ordinalgwas(@formula(trait ~ sex), datadir * "covariate.txt", nothing)

StatsModels.TableRegressionModel{OrdinalMultinomialModel{Int64, Float64, LogitLink}, Matrix{Float64}}

trait ~ sex

Coefficients:
──────────────────────────────────────────────────────
               Estimate  Std.Error   t value  Pr(>|t|)
──────────────────────────────────────────────────────
intercept1|2  -1.48564    0.358891  -4.13952    <1e-04
intercept2|3  -0.569479   0.341044  -1.66981    0.0959
intercept3|4   0.429815   0.339642   1.26549    0.2066
sex            0.424656   0.213914   1.98517    0.0480
──────────────────────────────────────────────────────

Step 2: GWAS for each chromosome.

# this part can be submitted as separate jobs
for chr in 1:23
    plinkfile = datadir * "hapmap3.chr." * string(chr)
    pvalfile = plinkfile * ".pval.txt" 
    ordinalgwas(nm, plinkfile, pvalfile=pvalfile)
end

# show the result files
readdir(glob"*.pval.txt", datadir)

23-element Vector{String}:
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.1.pval.txt"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.10.pval.txt"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.11.pval.txt"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.12.pval.txt"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.13.pval.txt"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.14.pval.txt"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.15.pval.txt"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.16.pval.txt"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.17.pval.txt"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.18.pval.txt"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.19.pval.txt"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.2.pval.txt"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.20.pval.txt"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.21.pval.txt"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.22.pval.txt"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.23.pval.txt"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.3.pval.txt"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.4.pval.txt"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.5.pval.txt"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.6.pval.txt"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.7.pval.txt"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.8.pval.txt"
 "/Users/kose/.julia/dev/OrdinalGWAS/data/hapmap3.chr.9.pval.txt"

In the rare situations where the multiple sets of Plink files lack the fam file or the corresponding bed and bim files have different filenames, users can explicitly supply bed filename, bim file name, and number of individuals. Replace Step 2 by

Step 2': GWAS for each chromosome.

# this part can be submitted as separate jobs
for chr in 1:23
    bedfile = datadir * "hapmap3.chr." * string(chr) * ".bed"
    bimfile = datadir * "hapmap3.chr." * string(chr) * ".bim"
    pvalfile = datadir * "hapmap3.chr." * string(chr) * ".pval.txt"
    ordinalgwas(nm, bedfile, bimfile, 324; pvalfile=pvalfile)
end

# clean up
rm("ordinalgwas.null.txt", force=true)
isfile(datadir * "fittednullmodel.jld2") && rm(datadir * "fittednullmodel.jld2")
for chr in 1:23
    pvalfile = datadir * "hapmap3.chr." * string(chr) * ".pval.txt"
    rm(pvalfile, force=true)
end
for chr in 1:26
    plinkfile = datadir * "hapmap3.chr." * string(chr)
    rm(plinkfile * ".bed", force=true)
    rm(plinkfile * ".fam", force=true)
    rm(plinkfile * ".bim", force=true)
end

Multiple Plink file sets on cluster

We provide two scripts that successfully run on UCLA's Hoffman2 cluster using Julia v1.0.1 and PBS job schedulaer (qsub).

The first script cluster_preparedata.jl creates a demo data set in current folder. Run

julia cluster_preparedata.jl

on head node.

run(`cat $datadir/../docs/cluster_preparedata.jl`);

#!/usr/local/bin/julia
#
# This script prepares a data set in current folder. 
# For each of chromosome 1-23, there is a set gzipped Plink files:
# hapmap3.chr.1.bed.gz, hapmap3.chr.1.bim.gz, hapmap3.chr.1.fam.gz
# hapmap3.chr.2.bed.gz, hapmap3.chr.2.bim.gz, hapmap3.chr.2.fam.gz
# ...
# hapmap3.chr.23.bed.gz, hapmap3.chr.23.bim.gz, hapmap3.chr.23.fam.gz
# There is also a csv file "covariate.txt" that contains trait and covariates.
#

# install and load Julia packages
using Pkg
if haskey(Pkg.installed(), "SnpArrays")
    Pkg.update("SnpArrays")
else
    Pkg.add(PackageSpec(url="https://github.com/OpenMendel/SnpArrays.jl.git"))
end
if haskey(Pkg.installed(), "OrdinalMultinomialModels")
    Pkg.update("OrdinalMultinomialModels")
else
    Pkg.add(PackageSpec(url="https://github.com/OpenMendel/OrdinalMultinomialModels.jl.git"))
end
if haskey(Pkg.installed(), "OrdinalGWAS")
    Pkg.update("OrdinalGWAS")
else
    Pkg.add(PackageSpec(url="https://github.com/OpenMendel/OrdinalGWAS.jl.git"))
end
using OrdinalMultinomialModels, OrdinalGWAS, SnpArrays

# split hapmap3 data according to chromosome
datadir = normpath(joinpath(dirname(pathof(OrdinalGWAS)), "../data/"))
SnpArrays.split_plink(datadir * "hapmap3", :chromosome; prefix = "hapmap3.chr.")
# compresse Plink files for chromosome 1-23
for chr in 1:23
    plinkfile = "hapmap3.chr." * string(chr)
    SnpArrays.compress_plink(plinkfile)
end
# delete uncompressed chromosome Plink files
for chr in 1:26
    plinkfile = "hapmap3.chr." * string(chr)
    rm(plinkfile * ".bed", force=true)
    rm(plinkfile * ".bim", force=true)
    rm(plinkfile * ".fam", force=true)
end
# copy covariate.txt file
cp(datadir * "covariate.txt", joinpath(pwd(), "covariate.txt"))

The second script cluster_run.jl first fits the null model then submits a separate job for each chromosome. Run

julia cluster_run.jl

on head node.

run(`cat $datadir/../docs/cluster_run.jl`);

#!/usr/local/bin/julia
#
# This script demonstrates how to submit multiple OrdinalGWAS runs from multiple sets of
# Plink files on UCLA Hoffman2 cluster. It assumes that a demo data is available by
# running `julia cluster_preparedata.jl` at current folder.
#

using OrdinalGWAS, Serialization

# Step 1: fit null model and save result to file `fittednullmodel.jls`
nm = ordinalgwas(@formula(trait ~ sex), "covariate.txt", nothing)
open("fittednullmodel.jls", "w") do io
    Serialization.serialize(io, nm)
end

# Step 2: GWAS for each chromosome
for chr in 1:23
    println("submit job for chromosome=$chr")
    jcode = "using OrdinalGWAS, Serialization;
    nm = open(deserialize, \"fittednullmodel.jls\");
    bedfile = \"hapmap3.chr.\" * string($chr) * \".bed.gz\";
    bimfile = \"hapmap3.chr.\" * string($chr) * \".bim.gz\";
    pvalfile = \"hapmap3.chr.\" * string($chr) * \".pval.txt\";
    ordinalgwas(nm, bedfile, bimfile, 324; pvalfile=pvalfile);"
    # prepare sh file for qsub
    open("tmp.sh", "w") do io
        println(io, "#!/bin/bash")
        println(io, "#\$ -cwd")
        println(io, "# error = Merged with joblog")
        println(io, "#\$ -o joblog.\$JOB_ID")
        println(io, "#\$ -j y")
        println(io, "#\$ -l h_rt=0:30:00,h_data=2G") # request runtime and memory
        println(io, "#\$ -pe shared 2") # request # shared-memory nodes
        println(io, "# Email address to notify")
        println(io, "#\$ -M \$USER@mail")
        println(io, "# Notify when")
        println(io, "#\$ -m a")
        println(io)
        println(io, "# load the job environment:")
        println(io, ". /u/local/Modules/default/init/modules.sh")
        println(io, "module load julia/1.0.1") # available Julia version
        println(io)
        println(io, "# run julia code")
        println(io, "julia -e '$jcode' > output.\$JOB_ID 2>&1")
    end
    # submit job
    run(`qsub tmp.sh`)
end

Proportional Odds Assumption

The ordered multinomial model used in this package is a cumulative logit model where one effect size is estimated for each covariate. Using the logit link yields the commonly used proportional odds model which assumes the proportional odds assumption holds for your data/model. The proportional odds assumption assumes that the effect of a covariate $\boldsymbol{\beta}$ is constant across different groupings of the ordinal outcome i.e.

The SNP's effect (on the odds ratio) is the same for the odds ratio of mild vs {medium, severe} as it is for the odds ratio of {mild, medium} vs severe.

The consequences/neccesity of this assumption has been thoroughly discussed, as an example see this blog post. In simualtions, we did not find violation of this to increase type I error. As stated in the post, when the assumption is violated (meaning there are different effect sizes for different groupings), the estimated effect size is like a weighted average of the different effect sizes. Formal tests for proportional odds are likely to reject the assumption for large n (see p. 306, Categorical Data Analysis 3rd edition by Agresti). Violation of this assumption means the effect size $\boldsymbol{\beta}$ differs across outcome strata, but should not imply a significant finding is not significant.