TrajGWAS.jl
TrajGWAS.jl is a Julia package for performing genome-wide association studies (GWAS) for continuous longitudinal phenotypes using a modified linear mixed effects model. It builds upon the within-subject variance estimation by robust regression (WiSER) method and can be used to identify variants associated with changes in the mean and within-subject variability of the longitduinal trait. The estimation procedure is robust to distributional misspecifications of both the random effects and the response. A saddlepoint approximation (SPA) option is implemented to provide improved power and calibrated type I error for rare variants.
TrajGWAS.jl currently supports PLINK, VCF (both dosage and genotype data), and BGEN file formats. We plan to add PGEN support in the future.
Installation
This package requires Julia v1.6 or later. The package is registered and can be installed using the command:
using Pkg
pkg"add TrajGWAS"
To run the code in this document, the packages installed by the following command are also necessary:
pkg"add BenchmarkTools CSV Glob PrettyTables"
# machine information for this tutorial
versioninfo()
Julia Version 1.7.1
Commit ac5cc99908 (2021-12-22 19:35 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin19.5.0)
CPU: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
# for use in this tutorial
ENV["COLUMNS"] = 400
using BenchmarkTools, CSV, Glob, SnpArrays, TrajGWAS, PrettyTables
Example data sets
The data
folder of the package contains the example data sets for use with PLINK and VCF Files. In general, the user can locate this folder by command:
pvalpath = "trajgwas.pval.txt"
nullpath = "trajgwas.null.txt"
const datadir = normpath(joinpath(dirname(pathof(TrajGWAS)), "../data/"))
"/Users/xyz/.julia/dev/TrajGWAS/data/"
# content of the data folder
readdir(glob"*.*", datadir)
14-element Vector{String}:
"/Users/xyz/.julia/dev/TrajGWAS/data/bgen_snpsetfile.txt"
"/Users/xyz/.julia/dev/TrajGWAS/data/covariate.txt"
"/Users/xyz/.julia/dev/TrajGWAS/data/example.8bits.bgen"
"/Users/xyz/.julia/dev/TrajGWAS/data/example.8bits.bgen.bgi"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.bed"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.bim"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.fam"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap_snpsetfile.txt"
"/Users/xyz/.julia/dev/TrajGWAS/data/sim_data.jl"
"/Users/xyz/.julia/dev/TrajGWAS/data/snpsetfile_vcf.txt"
"/Users/xyz/.julia/dev/TrajGWAS/data/test_vcf.vcf.gz"
"/Users/xyz/.julia/dev/TrajGWAS/data/trajgwas_bgen_ex.csv"
"/Users/xyz/.julia/dev/TrajGWAS/data/trajgwas_plinkex.csv"
"/Users/xyz/.julia/dev/TrajGWAS/data/trajgwas_vcfex.csv"
The hapmap3
files trajgwas_plinkex.csv
file correspond to data examples using PLINK formatted files (.bed, .bim, .fam).
The test_vcf.vcf.gz
and trajgwas_vcfex.csv
files are for an example analysis using VCF formatted files.
The example.8bits.bgen
and trajgwas_bgen_ex.csv
files are for an example analysis using VCF formatted files.
Basic usage
The following command performs GWAS using for the hapmap3 PLINK files. The output is the fitted null model.
The default type of GWAS performed is a single-snp significance genome-wide scan, this can be changed by the keyword analysistype
(default is "singlesnp"). Other types of analyses are covered later. It outputs the null model, runtime of fitting the null model, and convergence metrics.
trajgwas(@formula(y ~ 1 + sex + onMeds),
@formula(y ~ 1),
@formula(y ~ 1 + sex + onMeds),
:id,
datadir * "trajgwas_plinkex.csv",
datadir * "hapmap3",
pvalfile = pvalpath,
nullfile = nullpath)
******************************************************************************
This program contains Ipopt, a library for large-scale nonlinear optimization.
Ipopt is released as open source code under the Eclipse Public License (EPL).
For more information visit https://github.com/coin-or/Ipopt
******************************************************************************
run = 1, ‖Δβ‖ = 0.037090, ‖Δτ‖ = 0.136339, ‖ΔL‖ = 0.005441, status = LOCALLY_SOLVED, time(s) = 0.430740
run = 2, ‖Δβ‖ = 0.000913, ‖Δτ‖ = 0.019810, ‖ΔL‖ = 0.001582, status = LOCALLY_SOLVED, time(s) = 0.039891
Within-subject variance estimation by robust regression (WiSER)
Mean Formula:
y ~ 1 + sex + onMeds
Random Effects Formula:
y ~ 1
Within-Subject Variance Formula:
y ~ 1 + sex + onMeds
Number of individuals/clusters: 324
Total observations: 3240
Fixed-effects parameters:
────────────────────────────────────────────────────────
Estimate Std. Error Z Pr(>|Z|)
────────────────────────────────────────────────────────
β1: (Intercept) 13.2282 0.146459 90.32 <1e-99
β2: sex -3.29295 0.2101 -15.67 <1e-54
β3: onMeds 0.459585 0.0596002 7.71 <1e-13
τ1: (Intercept) 0.792508 0.0850728 9.32 <1e-19
τ2: sex -0.2865 0.0970732 -2.95 0.0032
τ3: onMeds 0.422303 0.063825 6.62 <1e-10
────────────────────────────────────────────────────────
Random effects covariance matrix Σγ:
"γ1: (Intercept)" 3.32057
For documentation of the trajgwas
function, type ?trajgwas
in Julia REPL.
TrajGWAS.trajgwas
— Functiontrajgwas(nullmeanformula, reformula, nullwsvarformula, idvar, covfile, geneticfile; kwargs...)
trajgwas(nullmeanformula, reformula, nullwsvarformula, idvar, df, geneticfile; kwargs...)
trajgwas(fittednullmodel, geneticfile; kwargs...)
trajgwas(fittednullmodel, bedfile, bimfile, bedn; kwargs...)
trajgwas(fittednullmodel, vcffile, nsamples, vcftype; kwargs...)
trajgwas(fittednullmodel, bgenfile, nsamples; kwargs...)
Positional arguments
nullmeanformula::FormulaTerm
: mean formula (β) for the null model.reformula::FormulaTerm
: random effects formula (γ) for the model.nullwsvarformula::FormulaTerm
: within-subject variance formula (τ) for the null model.idvar::Union{String, Symbol}
: id variable for groupings.covfile::AbstractString
: covariate file (csv) with one header line. One column should be the longitudinal phenotype.df::DataFrame
: DataFrame containing response and regressors for null model.geneticfile::Union{Nothing, AbstractString}
: File containing genetic information for GWAS. This includes a PLINK file name without the .bed, .fam, or .bim extensions or a VCF file without the .vcf extension. Ifgeneticfile==nothing
, only null model is fitted. Ifgeneticfile
is provided, bed, bim, and fam file (or vcf) with the samegeneticfile
prefix need to exist. Compressed file formats such as gz and bz2 are allowed. Check all allowed formats bySnpArrays.ALLOWED_FORMAT
. If you're using a VCF file, make sure to use thegeneticformat = "VCF"
keyword option, and specificy dosage (:DS) or genotype (:GT) data with thevcftype
command.fittednullmodel::StatsModels.TableRegressionModel
: the fitted null model output fromtrajgwas(nullformula, covfile)
ortrajgwas(nullformula, df)
. NOTE If the nullmodel is passed in with thebedfile, bimfile, bedn
orvcffile, nsamples, vcftype
arguments, the IDs/data in the null model must match the order of the IDs in the PLINK/VCF file.bedfile::Union{AbstractString,IOStream}
: path to Plink bed file with full file name.bimfile::Union{AbstractString,IOStream}
: path to Plink bim file with full file name.bedn::Integer
: number of samples in bed/vcf file.
Keyword arguments
analysistype
::AbstractString: Type of analysis to conduct. Default issinglesnp
. Other options aresnpset
andgxe
.geneticformat
::AbstractString: Type of file used for the genetic analysis."PLINK"
and"VCF"
are currently supported. Default is PLINK.vcftype
::Union{Symbol, Nothing}: Data to extract from the VCF file for the GWAS analysis.:DS
for dosage or:GT
for genotypes. Default is nothing.nullfile::Union{AbstractString, IOStream}
: output file for the fitted null model; default istrajgwas.null.txt
.pvalfile::Union{AbstractString, IOStream}
: output file for the gwas p-values; default istrajgwas.pval.txt
.covtype::Vector{DataType}
: type information forcovfile
. This is useful whenCSV.read(covarfile)
has parsing errors.covrowinds::Union{Nothing,AbstractVector{<:Integer}}
: sample indices for covariate file.testformula::FormulaTerm
: formula for test unit. Default is@formula(trait ~ 0 + snp)
.test::Symbol
::score
(default) or:wald
.disable_wsvar::Bool
: Disables test for within-sample variability for the Wald test. Default false.snpmodel
:ADDITIVE_MODEL
(default),DOMINANT_MODEL
, orRECESSIVE_MODEL
.snpinds::Union{Nothing,AbstractVector{<:Integer}}
: SNP indices for bed/vcf file.geneticrowinds::Union{Nothing,AbstractVector{<:Integer}}
: sample indices for bed/vcf file.samplepath::Union{Nothing, AbstractString}
: path for BGEN sample file if it's not encoded in the BGEN file.solver
: an optimizer supported by MathOptInterface. Default isIpopt.Optimizer()
.solver_config
: Dict of configuration keys and values for the optimizer. Default isDict("print_level" => 0, "mehrotra_algorithm" => "yes", "warm_start_init_point" => "yes", "max_iter" => 100)
.runs::Integer
: Number of weighted NLS runs for the null model; default is 2. Each run will use the newestm.τ
andm.Lγ
to update the weight matricesVi
and solve the new weighted NLS.parallel::Bool
: Multi-threading or not for fitting the null model. Default isfalse
.verbose::Bool
: default isfalse
.snpset::Union{Nothing, Integer, AbstractString, AbstractVector{<:Integer}}
: Only include if you are conducting a snpset analysis. An integer indicates a window of SNPs (i.e. every 500 snps). An abstract string allows you to specify an input file, with no header and two columns separated by a space. The first column must contain the snpset ID and the second column must contain the snpid's identical to the bimfile. An AbstractVector allows you to specify the snps you want to perform one joint snpset test for.e::Union{AbstractString,Symbol}
: Only include if you are conducting a GxE analysis. Enviromental variable to be used to test the GxE interaction. For instance, for testingsex & snp
interaction, use:sex
or"sex"
.
Examples
The following is an example of basic GWAS with PLINK files:
plinkfile = "plinkexample"
covfile = "covexample"
trajgwas(@formula(trait ~ sex), covfile, plkfile)
The following is an example of basic GWAS with a VCF file using dosages then genotypes:
vcffile = "vcfexample"
covfile = "covexample"
trajgwas(@formula(trait ~ sex), covfile, vcffile;
geneticfile = "VCF", vcftype = :DS)
trajgwas(@formula(trait ~ sex), covfile, vcffile;
geneticfile = "VCF", vcftype = :GT)
The following is an example of snpset GWAS (every 50 snps). For more types of snpset analyses see documentation:
trajgwas(@formula(trait ~ sex), covfile, plkfile;
analysistype = "snpset", snpset = 50)
The following is an example of GxE GWAS testing the interaction effect:
trajgwas(@formula(trait ~ sex), covfile, plkfile;
analysistype = "gxe", e = :sex)
Formula for null model
The first three arguments specify the null model without SNP effects. The fourth argument specifies the grouping variable (subject ID, cluster, etc.).
- The first term is the mean fixed effects formula e.g.,
@formula(y ~ 1 + sex + onMeds)
. - The second argument is the random (location) effects –
@formula(y ~ 1)
specifies a random intercept, whereas@formula(y ~ 1 + time)
would specify a random intercept and slope (time). - The third argument specifies the within-subject variance (WSV) formula
@formula(y ~ 1 + sex + onMeds)
models the intra-individual variability ofy
with an intercept, sex, and medication. Note: The WSV formula must have an intercept. - The fourth argument specifies the grouping variable of the longitudinal data. In the example data, the
id
variable specifies the grouping variable the repeated measures are collected on.
Input files
trajgwas
expects two input files: one for responses plus covariates (fifth argument) in long format, the other the genetic file(s) for dosages/genotypes (sixth argument).
Covariate and trait file
Covariates and phenotype are provided in a csv file, e.g., trajgwas_plinkex.csv
, which has one header line for variable names. In this example, variable y
is continuous variable with repeated measures on individuals specified by id
. We want to include variable sex
and onMeds
as covariates for GWAS.
run(`head $(datadir)trajgwas_plinkex.csv`);
sex,onMeds,snp1,snp2,snp3,snp4,y,id
0.0,1.0,0.0,1.0,2.0,0.0,12.26667411332518,A1
0.0,0.0,0.0,1.0,2.0,0.0,10.268123812744903,A1
0.0,0.0,0.0,1.0,2.0,0.0,12.165997408822557,A1
0.0,0.0,0.0,1.0,2.0,0.0,11.879709602937222,A1
0.0,0.0,0.0,1.0,2.0,0.0,12.812705165990007,A1
0.0,0.0,0.0,1.0,2.0,0.0,9.987659617201372,A1
0.0,0.0,0.0,1.0,2.0,0.0,12.140779426464974,A1
0.0,1.0,0.0,1.0,2.0,0.0,13.205778146177705,A1
0.0,0.0,0.0,1.0,2.0,0.0,11.363145919060207,A1
Genetic file(s)
TrajGWAS supports PLINK files, VCF files, and BGEN files.
Genotype data is available as binary PLINK files.
TrajGWAS can use dosage or genotype data from VCF files.
TrajGWAS uses dosage data from BGEN files.
By default, TrajGWAS assumes a set of PLINK files will be used. When using a VCF File, VCF file and type of data (dosage, genotype) must be specified by the geneticformat
and vcftype
options (as shown later). Similarly, BGEN must be specified as the geneticformat
if using a BGEN file.
readdir(glob"hapmap3.*", datadir)
3-element Vector{String}:
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.bed"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.bim"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.fam"
In this example, there are 324 samples at 13,928 SNPs.
size(SnpArray(datadir * "hapmap3.bed"))
(324, 13928)
Compressed PLINK and VCF files are supported. For example, if Plink files are hapmap3.bed.gz
, hapmap3.bim.gz
and hapmap3.fam.gz
, the same command
trajgwas(@formula(y ~ 1 + sex + onMeds),
@formula(y ~ 1),
@formula(y ~ 1 + sex + onMeds),
:id,
datadir * "trajgwas_plinkex.csv",
datadir * "hapmap3",
pvalfile = pvalpath,
nullfile = nullpath)
still works. Check all supported compression format by
SnpArrays.ALLOWED_FORMAT
6-element Vector{String}:
"gz"
"zlib"
"zz"
"xz"
"zst"
"bz2"
Output files
trajgwas
outputs two files: trajgwas.null.txt
and trajgwas.pval.txt
.
trajgwas.null.txt
lists the estimated null model (without SNPs).
run(`cat trajgwas.null.txt`);
Within-subject variance estimation by robust regression (WiSER)
Mean Formula:
y ~ 1 + sex + onMeds
Random Effects Formula:
y ~ 1
Within-Subject Variance Formula:
y ~ 1 + sex + onMeds
Number of individuals/clusters: 324
Total observations: 3240
Fixed-effects parameters:
────────────────────────────────────────────────────────
Estimate Std. Error Z Pr(>|Z|)
────────────────────────────────────────────────────────
β1: (Intercept) 13.2282 0.146459 90.32 <1e-99
β2: sex -3.29295 0.2101 -15.67 <1e-54
β3: onMeds 0.459585 0.0596002 7.71 <1e-13
τ1: (Intercept) 0.792508 0.0850728 9.32 <1e-19
τ2: sex -0.2865 0.0970732 -2.95 0.0032
τ3: onMeds 0.422303 0.063825 6.62 <1e-10
────────────────────────────────────────────────────────
Random effects covariance matrix Σγ:
"γ1: (Intercept)" 3.32057
trajgwas.pval.txt
tallies the SNPs, their pvalues, and relevant information on each SNP.betapval
represents the p-value of the SNP's effect on the mean of the trait. Ifspa=true
(default), then this is the SPA p-value. Ifspa=false
, then this is the score test beta p-value.taupval
represents the p-value of the SNP's effect on the within-subject variability of the trait. Ifspa=true
(default), then this is the SPA p-value. Ifspa=false
, then this is the score test tau p-value.jointpval
represents a joint p-value of the SNP's effect on both the mean and variance. By defaultspa=true
this is the harmonic mean of the saddlepoint approximated p-values for beta and tau. Ifspa=false
, this is the joint score test p-value.
pretty_table(first(CSV.read("trajgwas.pval.txt", DataFrame), 8), tf = tf_markdown)
|[1m chr [0m|[1m pos [0m|[1m snpid [0m|[1m maf [0m|[1m hwepval [0m|[1m betapval [0m|[1m betadir [0m|[1m taupval [0m|[1m taudir [0m|[1m jointpval [0m|
|[90m Int64 [0m|[90m Int64 [0m|[90m String31 [0m|[90m Float64 [0m|[90m Float64 [0m|[90m Float64 [0m|[90m Int64 [0m|[90m Float64 [0m|[90m Int64 [0m|[90m Float64 [0m|
|-------|---------|------------|------------|-------------|-------------|---------|-------------|--------|-------------|
| 1 | 554484 | rs10458597 | 0.0 | 1.0 | 1.0 | 0 | 1.0 | 0 | 1.0 |
| 1 | 758311 | rs12562034 | 0.0776398 | 0.409876 | 0.38939 | -1 | 0.922822 | -1 | 0.547682 |
| 1 | 967643 | rs2710875 | 0.324074 | 4.07625e-7 | 5.83856e-6 | 1 | 3.35408e-6 | -1 | 4.93598e-7 |
| 1 | 1168108 | rs11260566 | 0.191589 | 0.128568 | 0.000850889 | 1 | 0.0559055 | 1 | 0.00101827 |
| 1 | 1375074 | rs1312568 | 0.441358 | 2.5376e-19 | 0.000171683 | -1 | 1.84811e-13 | 1 | 2.07091e-15 |
| 1 | 1588771 | rs35154105 | 0.0 | 1.0 | 1.0 | 0 | 1.0 | 0 | 1.0 |
| 1 | 1789051 | rs16824508 | 0.00462963 | 0.933278 | 0.295035 | 1 | 0.304109 | 1 | 0.299504 |
| 1 | 1990452 | rs2678939 | 0.453704 | 5.07696e-11 | 1.27871e-7 | 1 | 7.73809e-9 | -1 | 3.3986e-9 |
Output file names can be changed by the nullfile
and pvalfile
keywords respectively. For example,
trajgwas(@formula(y ~ 1 + sex + onMeds),
@formula(y ~ 1),
@formula(y ~ 1 + sex + onMeds),
:id,
datadir * "trajgwas_plinkex.csv",
datadir * "hapmap3",
pvalfile = "trajgwas.pval.txt.gz",
nullfile = nullpath)
will output the p-value file in compressed gz format.
Subsamples
Use the keyword covrowinds
to specify selected samples in the covarite file. Use the keyword geneticrowinds
to specify selected samples in the Plink (.bed), VCF, or BGEN file. For example, to use the first 300 samples in both covariate and bed file:
trajgwas(@formula(trait ~ 1+ sex),
@formula(trait ~ 1),
@formula(trait ~ 1 + sex),
:id,
covfile, geneticfile,
covrowinds=1:300, geneticrowinds=1:300)
Users should make sure that the selected samples in covariate file match exactly those in bed file, otherwise an error will be displayed.
Input non-genetic data as DataFrame
Internally trajgwas
parses the covariate file as a DataFrame by df = CSV.read(covfile, DataFrame)
. For covariate file of other formats, or for specifying default levels of categorical variables, users can parse their datafile as a DataFrame and then input the DataFrame to trajgwas
directly.
trajgwas(@formula(trait ~ 1+ sex),
@formula(trait ~ 1),
@formula(trait ~ 1 + sex),
:id,
df, geneticfile)
Users should always make sure that individuals in covariate file or DataFrame match those in Plink fam file/VCF File/BGEN file.
For example, following code checks that the subject ID of the covariate.txt
file match that of the hapmap3.fam
file exactly.
covdf = CSV.read(datadir * "covariate.txt", DataFrame)
plkfam = CSV.read(datadir * "hapmap3.fam", DataFrame, header=0, delim=' ')
all(covdf[!, 2] .== plkfam[!, 2])
true
pretty_table(first(covdf, 8), tf = tf_markdown)
|[1m famid [0m|[1m perid [0m|[1m faid [0m|[1m moid [0m|[1m sex [0m|[1m trait [0m|
|[90m String7 [0m|[90m String7 [0m|[90m Int64 [0m|[90m Int64 [0m|[90m Int64 [0m|[90m Int64 [0m|
|---------|---------|-------|-------|-------|-------|
| 2431 | NA19916 | 0 | 0 | 1 | 4 |
| 2424 | NA19835 | 0 | 0 | 2 | 4 |
| 2469 | NA20282 | 0 | 0 | 2 | 4 |
| 2368 | NA19703 | 0 | 0 | 1 | 3 |
| 2425 | NA19901 | 0 | 0 | 2 | 3 |
| 2427 | NA19908 | 0 | 0 | 1 | 4 |
| 2430 | NA19914 | 0 | 0 | 2 | 4 |
| 2470 | NA20287 | 0 | 0 | 2 | 1 |
pretty_table(first(plkfam, 8), tf = tf_markdown)
|[1m Column1 [0m|[1m Column2 [0m|[1m Column3 [0m|[1m Column4 [0m|[1m Column5 [0m|[1m Column6 [0m|
|[90m String7 [0m|[90m String7 [0m|[90m Int64 [0m|[90m Int64 [0m|[90m Int64 [0m|[90m Int64 [0m|
|---------|---------|---------|---------|---------|---------|
| A1 | NA19916 | 0 | 0 | 1 | -9 |
| 2 | NA19835 | 0 | 0 | 2 | -9 |
| 3 | NA20282 | 0 | 0 | 2 | -9 |
| 4 | NA19703 | 0 | 0 | 1 | -9 |
| 5 | NA19901 | 0 | 0 | 2 | -9 |
| 6 | NA19908 | 0 | 0 | 1 | -9 |
| 7 | NA19914 | 0 | 0 | 2 | -9 |
| 8 | NA20287 | 0 | 0 | 2 | -9 |
Timing
For this moderate-sized data set, trajgwas
takes around 1 second without applying SPA. With SPA, it takes a little longer.
@btime(trajgwas(@formula(y ~ 1 + sex + onMeds),
@formula(y ~ 1),
@formula(y ~ 1 + sex + onMeds),
:id,
datadir * "trajgwas_plinkex.csv",
datadir * "hapmap3",
pvalfile = pvalpath,
nullfile = nullpath,
usespa = false));
run = 1, ‖Δβ‖ = 0.037090, ‖Δτ‖ = 0.136339, ‖ΔL‖ = 0.005441, status = LOCALLY_SOLVED, time(s) = 0.043075
run = 2, ‖Δβ‖ = 0.000913, ‖Δτ‖ = 0.019810, ‖ΔL‖ = 0.001582, status = LOCALLY_SOLVED, time(s) = 0.038360
run = 1, ‖Δβ‖ = 0.037090, ‖Δτ‖ = 0.136339, ‖ΔL‖ = 0.005441, status = LOCALLY_SOLVED, time(s) = 0.042863
run = 2, ‖Δβ‖ = 0.000913, ‖Δτ‖ = 0.019810, ‖ΔL‖ = 0.001582, status = LOCALLY_SOLVED, time(s) = 0.038349
run = 1, ‖Δβ‖ = 0.037090, ‖Δτ‖ = 0.136339, ‖ΔL‖ = 0.005441, status = LOCALLY_SOLVED, time(s) = 0.046263
run = 2, ‖Δβ‖ = 0.000913, ‖Δτ‖ = 0.019810, ‖ΔL‖ = 0.001582, status = LOCALLY_SOLVED, time(s) = 0.041134
run = 1, ‖Δβ‖ = 0.037090, ‖Δτ‖ = 0.136339, ‖ΔL‖ = 0.005441, status = LOCALLY_SOLVED, time(s) = 0.044395
run = 2, ‖Δβ‖ = 0.000913, ‖Δτ‖ = 0.019810, ‖ΔL‖ = 0.001582, status = LOCALLY_SOLVED, time(s) = 0.040617
run = 1, ‖Δβ‖ = 0.037090, ‖Δτ‖ = 0.136339, ‖ΔL‖ = 0.005441, status = LOCALLY_SOLVED, time(s) = 0.046369
run = 2, ‖Δβ‖ = 0.000913, ‖Δτ‖ = 0.019810, ‖ΔL‖ = 0.001582, status = LOCALLY_SOLVED, time(s) = 0.040637
run = 1, ‖Δβ‖ = 0.037090, ‖Δτ‖ = 0.136339, ‖ΔL‖ = 0.005441, status = LOCALLY_SOLVED, time(s) = 0.047622
run = 2, ‖Δβ‖ = 0.000913, ‖Δτ‖ = 0.019810, ‖ΔL‖ = 0.001582, status = LOCALLY_SOLVED, time(s) = 0.042389
run = 1, ‖Δβ‖ = 0.037090, ‖Δτ‖ = 0.136339, ‖ΔL‖ = 0.005441, status = LOCALLY_SOLVED, time(s) = 0.048733
run = 2, ‖Δβ‖ = 0.000913, ‖Δτ‖ = 0.019810, ‖ΔL‖ = 0.001582, status = LOCALLY_SOLVED, time(s) = 0.042363
run = 1, ‖Δβ‖ = 0.037090, ‖Δτ‖ = 0.136339, ‖ΔL‖ = 0.005441, status = LOCALLY_SOLVED, time(s) = 0.057969
run = 2, ‖Δβ‖ = 0.000913, ‖Δτ‖ = 0.019810, ‖ΔL‖ = 0.001582, status = LOCALLY_SOLVED, time(s) = 0.044363
run = 1, ‖Δβ‖ = 0.037090, ‖Δτ‖ = 0.136339, ‖ΔL‖ = 0.005441, status = LOCALLY_SOLVED, time(s) = 0.046105
run = 2, ‖Δβ‖ = 0.000913, ‖Δτ‖ = 0.019810, ‖ΔL‖ = 0.001582, status = LOCALLY_SOLVED, time(s) = 0.041188
run = 1, ‖Δβ‖ = 0.037090, ‖Δτ‖ = 0.136339, ‖ΔL‖ = 0.005441, status = LOCALLY_SOLVED, time(s) = 0.046367
run = 2, ‖Δβ‖ = 0.000913, ‖Δτ‖ = 0.019810, ‖ΔL‖ = 0.001582, status = LOCALLY_SOLVED, time(s) = 0.040956
run = 1, ‖Δβ‖ = 0.037090, ‖Δτ‖ = 0.136339, ‖ΔL‖ = 0.005441, status = LOCALLY_SOLVED, time(s) = 0.045681
run = 2, ‖Δβ‖ = 0.000913, ‖Δτ‖ = 0.019810, ‖ΔL‖ = 0.001582, status = LOCALLY_SOLVED, time(s) = 0.040467
run = 1, ‖Δβ‖ = 0.037090, ‖Δτ‖ = 0.136339, ‖ΔL‖ = 0.005441, status = LOCALLY_SOLVED, time(s) = 0.045576
run = 2, ‖Δβ‖ = 0.000913, ‖Δτ‖ = 0.019810, ‖ΔL‖ = 0.001582, status = LOCALLY_SOLVED, time(s) = 0.040579
run = 1, ‖Δβ‖ = 0.037090, ‖Δτ‖ = 0.136339, ‖ΔL‖ = 0.005441, status = LOCALLY_SOLVED, time(s) = 0.043948
run = 2, ‖Δβ‖ = 0.000913, ‖Δτ‖ = 0.019810, ‖ΔL‖ = 0.001582, status = LOCALLY_SOLVED, time(s) = 0.039300
run = 1, ‖Δβ‖ = 0.037090, ‖Δτ‖ = 0.136339, ‖ΔL‖ = 0.005441, status = LOCALLY_SOLVED, time(s) = 0.045900
run = 2, ‖Δβ‖ = 0.000913, ‖Δτ‖ = 0.019810, ‖ΔL‖ = 0.001582, status = LOCALLY_SOLVED, time(s) = 0.038610
run = 1, ‖Δβ‖ = 0.037090, ‖Δτ‖ = 0.136339, ‖ΔL‖ = 0.005441, status = LOCALLY_SOLVED, time(s) = 0.044771
run = 2, ‖Δβ‖ = 0.000913, ‖Δτ‖ = 0.019810, ‖ΔL‖ = 0.001582, status = LOCALLY_SOLVED, time(s) = 0.040101
run = 1, ‖Δβ‖ = 0.037090, ‖Δτ‖ = 0.136339, ‖ΔL‖ = 0.005441, status = LOCALLY_SOLVED, time(s) = 0.047024
run = 2, ‖Δβ‖ = 0.000913, ‖Δτ‖ = 0.019810, ‖ΔL‖ = 0.001582, status = LOCALLY_SOLVED, time(s) = 0.039047
run = 1, ‖Δβ‖ = 0.037090, ‖Δτ‖ = 0.136339, ‖ΔL‖ = 0.005441, status = LOCALLY_SOLVED, time(s) = 0.043731
run = 2, ‖Δβ‖ = 0.000913, ‖Δτ‖ = 0.019810, ‖ΔL‖ = 0.001582, status = LOCALLY_SOLVED, time(s) = 0.039392
run = 1, ‖Δβ‖ = 0.037090, ‖Δτ‖ = 0.136339, ‖ΔL‖ = 0.005441, status = LOCALLY_SOLVED, time(s) = 0.043643
run = 2, ‖Δβ‖ = 0.000913, ‖Δτ‖ = 0.019810, ‖ΔL‖ = 0.001582, status = LOCALLY_SOLVED, time(s) = 0.038403
run = 1, ‖Δβ‖ = 0.037090, ‖Δτ‖ = 0.136339, ‖ΔL‖ = 0.005441, status = LOCALLY_SOLVED, time(s) = 0.042840
run = 2, ‖Δβ‖ = 0.000913, ‖Δτ‖ = 0.019810, ‖ΔL‖ = 0.001582, status = LOCALLY_SOLVED, time(s) = 0.038771
run = 1, ‖Δβ‖ = 0.037090, ‖Δτ‖ = 0.136339, ‖ΔL‖ = 0.005441, status = LOCALLY_SOLVED, time(s) = 0.044481
run = 2, ‖Δβ‖ = 0.000913, ‖Δτ‖ = 0.019810, ‖ΔL‖ = 0.001582, status = LOCALLY_SOLVED, time(s) = 0.040184
run = 1, ‖Δβ‖ = 0.037090, ‖Δτ‖ = 0.136339, ‖ΔL‖ = 0.005441, status = LOCALLY_SOLVED, time(s) = 0.045463
run = 2, ‖Δβ‖ = 0.000913, ‖Δτ‖ = 0.019810, ‖ΔL‖ = 0.001582, status = LOCALLY_SOLVED, time(s) = 0.042343
run = 1, ‖Δβ‖ = 0.037090, ‖Δτ‖ = 0.136339, ‖ΔL‖ = 0.005441, status = LOCALLY_SOLVED, time(s) = 0.046744
run = 2, ‖Δβ‖ = 0.000913, ‖Δτ‖ = 0.019810, ‖ΔL‖ = 0.001582, status = LOCALLY_SOLVED, time(s) = 0.041665
run = 1, ‖Δβ‖ = 0.037090, ‖Δτ‖ = 0.136339, ‖ΔL‖ = 0.005441, status = LOCALLY_SOLVED, time(s) = 0.043388
run = 2, ‖Δβ‖ = 0.000913, ‖Δτ‖ = 0.019810, ‖ΔL‖ = 0.001582, status = LOCALLY_SOLVED, time(s) = 0.039299
run = 1, ‖Δβ‖ = 0.037090, ‖Δτ‖ = 0.136339, ‖ΔL‖ = 0.005441, status = LOCALLY_SOLVED, time(s) = 0.046229
run = 2, ‖Δβ‖ = 0.000913, ‖Δτ‖ = 0.019810, ‖ΔL‖ = 0.001582, status = LOCALLY_SOLVED, time(s) = 0.041051
run = 1, ‖Δβ‖ = 0.037090, ‖Δτ‖ = 0.136339, ‖ΔL‖ = 0.005441, status = LOCALLY_SOLVED, time(s) = 0.046147
run = 2, ‖Δβ‖ = 0.000913, ‖Δτ‖ = 0.019810, ‖ΔL‖ = 0.001582, status = LOCALLY_SOLVED, time(s) = 0.042524
run = 1, ‖Δβ‖ = 0.037090, ‖Δτ‖ = 0.136339, ‖ΔL‖ = 0.005441, status = LOCALLY_SOLVED, time(s) = 0.047299
run = 2, ‖Δβ‖ = 0.000913, ‖Δτ‖ = 0.019810, ‖ΔL‖ = 0.001582, status = LOCALLY_SOLVED, time(s) = 0.040978
run = 1, ‖Δβ‖ = 0.037090, ‖Δτ‖ = 0.136339, ‖ΔL‖ = 0.005441, status = LOCALLY_SOLVED, time(s) = 0.043347
run = 2, ‖Δβ‖ = 0.000913, ‖Δτ‖ = 0.019810, ‖ΔL‖ = 0.001582, status = LOCALLY_SOLVED, time(s) = 0.050493
468.501 ms (1451269 allocations: 127.54 MiB)
# clean up
rm("trajgwas.null.txt", force=true)
rm("trajgwas.pval.txt", force=true)
VCF Formatted Files
By default, TrajGWAS.jl will assume you are using PLINK files. It also supports VCF (and BGEN) Files. vcf.gz
files are supported as well. To use vcf files in any of the analysis options detailed in this documentation, you simply need to add two keyword options to the trajgwas
function:
geneticformat
: Choices are "VCF" or "PLINK". If you are using a VCF file, usegeneticformat = "VCF"
.vcftype
: Choices are :GT (for genotypes) or :DS (for dosages). This tells TrajGWAS which type of data to extract from the VCF file.
Using a VCF File does not output minor allele frequency or hardy weinberg equillibrium p-values for each SNP tested since they may be dosages.
The following shows how to run an analysis with a VCF file using the dosage information.
trajgwas(@formula(y ~ 1 + sex + onMeds),
@formula(y ~ 1),
@formula(y ~ 1 + sex + onMeds),
:id,
datadir * "trajgwas_vcfex.csv",
datadir * "test_vcf",
pvalfile = pvalpath,
geneticformat = "VCF",
vcftype = :DS)
run = 1, ‖Δβ‖ = 0.003459, ‖Δτ‖ = 0.421747, ‖ΔL‖ = 0.002492, status = LOCALLY_SOLVED, time(s) = 0.020269
run = 2, ‖Δβ‖ = 0.001369, ‖Δτ‖ = 0.015998, ‖ΔL‖ = 0.003244, status = LOCALLY_SOLVED, time(s) = 0.016583
Within-subject variance estimation by robust regression (WiSER)
Mean Formula:
y ~ 1 + sex + onMeds
Random Effects Formula:
y ~ 1
Within-Subject Variance Formula:
y ~ 1 + sex + onMeds
Number of individuals/clusters: 191
Total observations: 1910
Fixed-effects parameters:
────────────────────────────────────────────────────────
Estimate Std. Error Z Pr(>|Z|)
────────────────────────────────────────────────────────
β1: (Intercept) 12.1708 0.18351 66.32 <1e-99
β2: sex -3.41012 0.253155 -13.47 <1e-40
β3: onMeds 0.485447 0.0701423 6.92 <1e-11
τ1: (Intercept) 0.774299 0.0849867 9.11 <1e-19
τ2: sex -0.524117 0.106719 -4.91 <1e-06
τ3: onMeds 0.633337 0.0814974 7.77 <1e-14
────────────────────────────────────────────────────────
Random effects covariance matrix Σγ:
"γ1: (Intercept)" 2.86452
pretty_table(first(CSV.read("trajgwas.pval.txt", DataFrame), 8), tf = tf_markdown)
|[1m chr [0m|[1m pos [0m|[1m snpid [0m|[1m betapval [0m|[1m betadir [0m|[1m taupval [0m|[1m taudir [0m|[1m jointpval [0m|
|[90m Int64 [0m|[90m Int64 [0m|[90m String15 [0m|[90m Float64 [0m|[90m Int64 [0m|[90m Float64 [0m|[90m Int64 [0m|[90m Float64 [0m|
|-------|----------|-------------|-----------|---------|------------|--------|------------|
| 22 | 20000086 | rs138720731 | 0.322684 | -1 | 0.61956 | -1 | 0.424353 |
| 22 | 20000146 | rs73387790 | 1.0 | 0 | 1.0 | 0 | 1.0 |
| 22 | 20000199 | rs183293480 | 0.214207 | 1 | 0.190785 | 1 | 0.201819 |
| 22 | 20000291 | rs185807825 | 0.292231 | 1 | 0.200029 | 1 | 0.237495 |
| 22 | 20000428 | rs55902548 | 0.0140114 | 1 | 0.00369005 | 1 | 0.00576184 |
| 22 | 20000683 | rs142720028 | 1.0 | 0 | 1.0 | 0 | 1.0 |
| 22 | 20000771 | rs114690707 | 0.536406 | -1 | 0.145113 | 1 | 0.22843 |
| 22 | 20000793 | rs189842693 | 0.161414 | -1 | 0.757017 | 1 | 0.266091 |
BGEN Formatted Files
By default, TrajGWAS.jl will assume you are using PLINK files. It also supports BGEN Files. To use BGEN files in any of the analysis options detailed in this documentation, you simply need to add the following keyword option to the trajgwas
function:
geneticformat
: Choices are "VCF" or "PLINK" or "BGEN". If you are using a BGEN file, usegeneticformat = "BGEN"
.
Using a BGEN File does not output minor allele frequency or hardy weinberg equillibrium p-values for each SNP tested.
Some features, such as SNP-set analyses, are only available when there's an indexing .bgi
file available. Additionally, if the BGEN file does not contain sample information (i.e. sample IDs), then a sample file is necessary and its path can be specified with the samplepath
keyword.
BGEN files can contain an optional index file (.bgi
file) that allows the variants to be specified in order of position. TrajGWAS will automatically look for a file in the same directory as the BGENFILENAME
with the name BGENFILENAME.bgi
. The BGEN file is read either in the .bgi
file order if BGENFILENAME.bgi
is supplied in the same directory as the BGEN file, otherwise it will use the order in the BGEN file. This is important in analyses specifying snpinds
as well as annotation groupings. You must make sure this matches the way the BGEN file will be read in.
The following shows how to run an analysis with a BGEN file using the dosage information.
trajgwas(@formula(y ~ 1 + sex + onMeds),
@formula(y ~ 1),
@formula(y ~ 1 + sex + onMeds),
:id,
datadir * "trajgwas_bgen_ex.csv",
datadir * "example.8bits",
geneticformat = "BGEN",
pvalfile = pvalpath)
run = 1, ‖Δβ‖ = 0.096578, ‖Δτ‖ = 0.162563, ‖ΔL‖ = 0.007625, status = LOCALLY_SOLVED, time(s) = 0.046532
run = 2, ‖Δβ‖ = 0.003173, ‖Δτ‖ = 0.005584, ‖ΔL‖ = 0.001468, status = LOCALLY_SOLVED, time(s) = 0.038319
Within-subject variance estimation by robust regression (WiSER)
Mean Formula:
y ~ 1 + sex + onMeds
Random Effects Formula:
y ~ 1
Within-Subject Variance Formula:
y ~ 1 + sex + onMeds
Number of individuals/clusters: 500
Total observations: 5000
Fixed-effects parameters:
────────────────────────────────────────────────────────
Estimate Std. Error Z Pr(>|Z|)
────────────────────────────────────────────────────────
β1: (Intercept) 10.5275 0.0992675 106.05 <1e-99
β2: sex -3.38067 0.146416 -23.09 <1e-99
β3: onMeds 0.522162 0.0358727 14.56 <1e-47
τ1: (Intercept) 0.294364 0.0443191 6.64 <1e-10
τ2: sex -0.365009 0.0503589 -7.25 <1e-12
τ3: onMeds 0.559769 0.0467046 11.99 <1e-32
────────────────────────────────────────────────────────
Random effects covariance matrix Σγ:
"γ1: (Intercept)" 2.51561
pretty_table(first(CSV.read("trajgwas.pval.txt", DataFrame), 8), tf = tf_markdown)
|[1m chr [0m|[1m pos [0m|[1m snpid [0m|[1m varid [0m|[1m hwepval [0m|[1m maf [0m|[1m infoscore [0m|[1m betapval [0m|[1m betadir [0m|[1m taupval [0m|[1m taudir [0m|[1m jointpval [0m|
|[90m Int64 [0m|[90m Int64 [0m|[90m String15 [0m|[90m String15 [0m|[90m Float64 [0m|[90m Float64 [0m|[90m Float64 [0m|[90m Float64 [0m|[90m Int64 [0m|[90m Float64 [0m|[90m Int64 [0m|[90m Float64 [0m|
|-------|-------|----------|-----------|----------|----------|-----------|-------------|---------|-------------|--------|-------------|
| 1 | 1001 | RSID_101 | SNPID_101 | 0.86274 | 0.416977 | 0.984639 | 0.510629 | -1 | 0.425112 | -1 | 0.463963 |
| 1 | 2000 | RSID_2 | SNPID_2 | 0.192181 | 0.19751 | 9.0 | 4.5062e-10 | 1 | 1.40173e-21 | -1 | 6.67473e-21 |
| 1 | 2001 | RSID_102 | SNPID_102 | 0.1844 | 0.197667 | 0.727308 | 4.47847e-10 | -1 | 1.79121e-21 | 1 | 8.45783e-21 |
| 1 | 3000 | RSID_3 | SNPID_3 | 0.965354 | 0.483396 | 0.955355 | 0.0721822 | -1 | 0.600949 | -1 | 0.128884 |
| 1 | 3001 | RSID_103 | SNPID_103 | 0.965354 | 0.483396 | 0.955355 | 0.0721821 | 1 | 0.600949 | 1 | 0.128884 |
| 1 | 4000 | RSID_4 | SNPID_4 | 0.371927 | 0.21671 | 0.991768 | 0.112472 | -1 | 0.133752 | -1 | 0.122192 |
| 1 | 4001 | RSID_104 | SNPID_104 | 0.371928 | 0.21671 | 0.991768 | 0.112471 | 1 | 0.133752 | 1 | 0.122192 |
| 1 | 5000 | RSID_5 | SNPID_5 | 0.587013 | 0.388082 | 0.968258 | 0.526325 | -1 | 0.542332 | -1 | 0.534209 |
SNP models
Genotypes are translated into numeric values according to different genetic model, which is specified by the snpmodel
keyword. Default is ADDITIVE_MODEL
.
Genotype | SnpArray | ADDITIVE_MODEL | DOMINANT_MODEL | RECESSIVE_MODEL |
---|---|---|---|---|
A1,A1 | 0x00 | 0 | 0 | 0 |
missing | 0x01 | NaN | NaN | NaN |
A1,A2 | 0x02 | 1 | 1 | 0 |
A2,A2 | 0x03 | 2 | 1 | 1 |
trajgwas
imputes missing genotypes according to minor allele frequencies.
Users are advised to impute genotypes using more sophiscated methods before GWAS.
SNP and/or sample masks
In practice, we often perform GWAS on selected SNPs and/or selected samples. They can be specified by the snpinds
, covrowinds
and geneticrowinds
keywords of trajgwas
function.
For example, to perform GWAS on SNPs with minor allele frequency (MAF) above 0.05
# create SNP mask
snpinds = maf(SnpArray("../data/hapmap3.bed")) .≥ 0.05
# GWAS on selected SNPs
@time trajgwas(@formula(y ~ 1 + sex + onMeds),
@formula(y ~ 1),
@formula(y ~ 1 + sex + onMeds),
:id,
datadir * "trajgwas_plinkex.csv",
datadir * "hapmap3",
pvalfile = "commonvariant.pval.txt",
snpinds = snpinds)
run = 1, ‖Δβ‖ = 0.037090, ‖Δτ‖ = 0.136339, ‖ΔL‖ = 0.005441, status = LOCALLY_SOLVED, time(s) = 0.049849
run = 2, ‖Δβ‖ = 0.000913, ‖Δτ‖ = 0.019810, ‖ΔL‖ = 0.001582, status = LOCALLY_SOLVED, time(s) = 0.053331
3.115314 seconds (7.98 M allocations: 489.049 MiB, 3.92% gc time, 65.78% compilation time)
Within-subject variance estimation by robust regression (WiSER)
Mean Formula:
y ~ 1 + sex + onMeds
Random Effects Formula:
y ~ 1
Within-Subject Variance Formula:
y ~ 1 + sex + onMeds
Number of individuals/clusters: 324
Total observations: 3240
Fixed-effects parameters:
────────────────────────────────────────────────────────
Estimate Std. Error Z Pr(>|Z|)
────────────────────────────────────────────────────────
β1: (Intercept) 13.2282 0.146459 90.32 <1e-99
β2: sex -3.29295 0.2101 -15.67 <1e-54
β3: onMeds 0.459585 0.0596002 7.71 <1e-13
τ1: (Intercept) 0.792508 0.0850728 9.32 <1e-19
τ2: sex -0.2865 0.0970732 -2.95 0.0032
τ3: onMeds 0.422303 0.063825 6.62 <1e-10
────────────────────────────────────────────────────────
Random effects covariance matrix Σγ:
"γ1: (Intercept)" 3.32057
run(`head commonvariant.pval.txt`);
chr pos snpid maf hwepval betapval betadir taupval taudir jointpval
1 758311 rs12562034 0.07763975155279501 0.4098763332666681 0.3893898715952268 -1 0.9228220976232073 -1 0.5476822137398478
1 967643 rs2710875 0.32407407407407407 4.076249100705747e-7 5.838562452864651e-6 1 3.354075049151666e-6 -1 4.935983727867445e-7
1 1168108 rs11260566 0.19158878504672894 0.1285682279446898 0.0008508894259509042 1 0.05590549977468929 1 0.0010182674469333282
1 1375074 rs1312568 0.441358024691358 2.5376019650614977e-19 0.00017168324418138014 -1 1.8481146312336014e-13 1 2.0709090755361924e-15
1 1990452 rs2678939 0.4537037037037037 5.07695957708431e-11 1.278709529589261e-7 1 7.7380891475822e-9 -1 3.3986012947082272e-9
1 2194615 rs7553178 0.22685185185185186 0.17056143157457776 0.07404001756940115 -1 0.00023109048716816823 1 0.00010021413726889922
1 2396747 rs13376356 0.1448598130841121 0.9053079215078139 0.48972088680069503 -1 0.6595593862223496 1 0.5620909278620404
1 2823603 rs1563468 0.4830246913580247 4.23065537243926e-9 2.581476982676523e-7 -1 1.1700736777037956e-8 1 6.36797979359385e-8
1 3025087 rs6690373 0.2538699690402477 9.238641887192776e-8 3.111965106232366e-5 1 2.0479935907554532e-7 -1 2.123094937745341e-5
# extra header line in commonvariant.pval.txt
countlines("commonvariant.pval.txt"), count(snpinds)
(12086, 12085)
# clean up
rm("trajgwas.null.txt", force=true)
rm("commonvariant.pval.txt", force=true)
covrowinds
specify the samples in the covariate file and geneticrowinds
for PLINK or VCF File. User should be particularly careful when using these two keyword. Selected rows in SnpArray should exactly match the samples in the null model. Otherwise the results are meaningless.
Estimating Effect Sizes (Wald)
By default, trajgwas
calculates p-value for each SNP using SPA/score test. Score test is fast because it doesn't require fitting alternative model for each SNP. User can request the Wald p-values and the estimated effect size of each SNP using keyword test=:wald
. Wald is much slower but will give you estimated effect sizes from the WiSER model.
@time trajgwas(@formula(y ~ 1 + sex + onMeds),
@formula(y ~ 1),
@formula(y ~ 1 + sex + onMeds),
:id,
datadir * "trajgwas_plinkex.csv",
datadir * "hapmap3",
pvalfile = "wald.pval.txt",
snpinds = 1:5,
test = :wald)
run = 1, ‖Δβ‖ = 0.037090, ‖Δτ‖ = 0.136339, ‖ΔL‖ = 0.005441, status = LOCALLY_SOLVED, time(s) = 0.045384
run = 2, ‖Δβ‖ = 0.000913, ‖Δτ‖ = 0.019810, ‖ΔL‖ = 0.001582, status = LOCALLY_SOLVED, time(s) = 0.043791
run = 1, ‖Δβ‖ = 0.033243, ‖Δτ‖ = 0.146511, ‖ΔL‖ = 0.005476, status = LOCALLY_SOLVED, time(s) = 0.051594
run = 2, ‖Δβ‖ = 0.005774, ‖Δτ‖ = 0.042246, ‖ΔL‖ = 0.001784, status = LOCALLY_SOLVED, time(s) = 0.046967
run = 1, ‖Δβ‖ = 0.013090, ‖Δτ‖ = 0.130781, ‖ΔL‖ = 0.005011, status = LOCALLY_SOLVED, time(s) = 0.056558
run = 2, ‖Δβ‖ = 0.003913, ‖Δτ‖ = 0.037309, ‖ΔL‖ = 0.001516, status = LOCALLY_SOLVED, time(s) = 0.051369
run = 1, ‖Δβ‖ = 0.022159, ‖Δτ‖ = 0.141135, ‖ΔL‖ = 0.005554, status = LOCALLY_SOLVED, time(s) = 0.048028
run = 2, ‖Δβ‖ = 0.001482, ‖Δτ‖ = 0.021700, ‖ΔL‖ = 0.001435, status = LOCALLY_SOLVED, time(s) = 0.043137
run = 1, ‖Δβ‖ = 0.026764, ‖Δτ‖ = 0.368620, ‖ΔL‖ = 0.000317, status = LOCALLY_SOLVED, time(s) = 0.047566
run = 2, ‖Δβ‖ = 0.003023, ‖Δτ‖ = 0.030938, ‖ΔL‖ = 0.003568, status = LOCALLY_SOLVED, time(s) = 0.042606
7.188892 seconds (12.73 M allocations: 777.569 MiB, 3.41% gc time, 91.29% compilation time)
Within-subject variance estimation by robust regression (WiSER)
Mean Formula:
y ~ 1 + sex + onMeds
Random Effects Formula:
y ~ 1
Within-Subject Variance Formula:
y ~ 1 + sex + onMeds
Number of individuals/clusters: 324
Total observations: 3240
Fixed-effects parameters:
────────────────────────────────────────────────────────
Estimate Std. Error Z Pr(>|Z|)
────────────────────────────────────────────────────────
β1: (Intercept) 13.2282 0.146459 90.32 <1e-99
β2: sex -3.29295 0.2101 -15.67 <1e-54
β3: onMeds 0.459585 0.0596002 7.71 <1e-13
τ1: (Intercept) 0.792508 0.0850728 9.32 <1e-19
τ2: sex -0.2865 0.0970732 -2.95 0.0032
τ3: onMeds 0.422303 0.063825 6.62 <1e-10
────────────────────────────────────────────────────────
Random effects covariance matrix Σγ:
"γ1: (Intercept)" 3.32057
Note the extra effect
column in pvalfile, which is the effect size (regression coefficient) for each SNP.
run(`head wald.pval.txt`);
chr pos snpid maf hwepval betaeffect betapval taueffect taupval
1 554484 rs10458597 0.0 1.0 0.0 1.0 0.0 1.0
1 758311 rs12562034 0.07763975155279501 0.4098763332666681 -0.231072479332816 0.3929844692486013 0.01340575997086949 0.9216488777690096
1 967643 rs2710875 0.32407407407407407 4.076249100705747e-7 0.6406954913590558 8.824209286611495e-7 -0.34763283794106525 1.0836367479719384e-11
1 1168108 rs11260566 0.19158878504672894 0.1285682279446898 0.6188250819875118 0.0002537468407208455 -0.14748437551458396 0.0502802758129169
1 1375074 rs1312568 0.441358024691358 2.5376019650614977e-19 -0.4560323801786057 0.00016592110001810087 0.4964521486169327 1.177346561221082e-36
One may disable within-sample variability modeling by putting a keyword argument disable_wsvar=true
.
@time trajgwas(@formula(y ~ 1 + sex + onMeds),
@formula(y ~ 1),
@formula(y ~ 1 + sex + onMeds),
:id,
datadir * "trajgwas_plinkex.csv",
datadir * "hapmap3",
pvalfile = "wald.pval.txt",
snpinds = 1:5,
test = :wald,
disable_wsvar = true)
run = 1, ‖Δβ‖ = 0.037090, ‖Δτ‖ = 0.136339, ‖ΔL‖ = 0.005441, status = LOCALLY_SOLVED, time(s) = 0.043700
run = 2, ‖Δβ‖ = 0.000913, ‖Δτ‖ = 0.019810, ‖ΔL‖ = 0.001582, status = LOCALLY_SOLVED, time(s) = 0.038386
run = 1, ‖Δβ‖ = 0.034186, ‖Δτ‖ = 0.147419, ‖ΔL‖ = 0.005538, status = LOCALLY_SOLVED, time(s) = 0.043628
run = 2, ‖Δβ‖ = 0.006293, ‖Δτ‖ = 0.021529, ‖ΔL‖ = 0.001733, status = LOCALLY_SOLVED, time(s) = 0.038884
run = 1, ‖Δβ‖ = 0.007322, ‖Δτ‖ = 0.107654, ‖ΔL‖ = 0.004949, status = LOCALLY_SOLVED, time(s) = 0.043276
run = 2, ‖Δβ‖ = 0.001662, ‖Δτ‖ = 0.013700, ‖ΔL‖ = 0.001501, status = LOCALLY_SOLVED, time(s) = 0.038970
run = 1, ‖Δβ‖ = 0.021644, ‖Δτ‖ = 0.134147, ‖ΔL‖ = 0.005351, status = LOCALLY_SOLVED, time(s) = 0.043053
run = 2, ‖Δβ‖ = 0.001608, ‖Δτ‖ = 0.019473, ‖ΔL‖ = 0.001570, status = LOCALLY_SOLVED, time(s) = 0.050367
run = 1, ‖Δβ‖ = 0.017117, ‖Δτ‖ = 0.154749, ‖ΔL‖ = 0.005408, status = LOCALLY_SOLVED, time(s) = 0.047992
run = 2, ‖Δβ‖ = 0.000453, ‖Δτ‖ = 0.023558, ‖ΔL‖ = 0.001218, status = LOCALLY_SOLVED, time(s) = 0.042605
4.148354 seconds (9.69 M allocations: 605.898 MiB, 4.13% gc time, 86.93% compilation time)
Within-subject variance estimation by robust regression (WiSER)
Mean Formula:
y ~ 1 + sex + onMeds
Random Effects Formula:
y ~ 1
Within-Subject Variance Formula:
y ~ 1 + sex + onMeds
Number of individuals/clusters: 324
Total observations: 3240
Fixed-effects parameters:
────────────────────────────────────────────────────────
Estimate Std. Error Z Pr(>|Z|)
────────────────────────────────────────────────────────
β1: (Intercept) 13.2282 0.146459 90.32 <1e-99
β2: sex -3.29295 0.2101 -15.67 <1e-54
β3: onMeds 0.459585 0.0596002 7.71 <1e-13
τ1: (Intercept) 0.792508 0.0850728 9.32 <1e-19
τ2: sex -0.2865 0.0970732 -2.95 0.0032
τ3: onMeds 0.422303 0.063825 6.62 <1e-10
────────────────────────────────────────────────────────
Random effects covariance matrix Σγ:
"γ1: (Intercept)" 3.32057
run(`head wald.pval.txt`);
chr pos snpid maf hwepval betaeffect betapval
1 554484 rs10458597 0.0 1.0 0.0 1.0
1 758311 rs12562034 0.07763975155279501 0.4098763332666681 -0.23114605301396765 0.3929413858489548
1 967643 rs2710875 0.32407407407407407 4.076249100705747e-7 0.6357974571413716 1.02417981478764e-6
1 1168108 rs11260566 0.19158878504672894 0.1285682279446898 0.6188324923086607 0.0002473977473431405
1 1375074 rs1312568 0.441358024691358 2.5376019650614977e-19 -0.45677365058013597 0.0001627044523523231
# clean up
rm("wald.pval.txt", force=true)
rm("trajgwas.null.txt", force=true)
Score test for screening, Wald for effect size estimates
For large data sets, a practical solution is to perform the score test first across all SNPs, then re-do Wald for the most promising SNPs according to score test p-values in order to get estimated effect sizes.
Step 1: Perform score test GWAS, results in score.pval.txt
.
@time trajgwas(@formula(y ~ 1 + sex + onMeds),
@formula(y ~ 1),
@formula(y ~ 1 + sex + onMeds),
:id,
datadir * "trajgwas_plinkex.csv",
datadir * "hapmap3",
pvalfile = "score.pval.txt")
run = 1, ‖Δβ‖ = 0.037090, ‖Δτ‖ = 0.136339, ‖ΔL‖ = 0.005441, status = LOCALLY_SOLVED, time(s) = 0.043247
run = 2, ‖Δβ‖ = 0.000913, ‖Δτ‖ = 0.019810, ‖ΔL‖ = 0.001582, status = LOCALLY_SOLVED, time(s) = 0.038263
1.090987 seconds (4.02 M allocations: 259.292 MiB, 12.06% gc time)
Within-subject variance estimation by robust regression (WiSER)
Mean Formula:
y ~ 1 + sex + onMeds
Random Effects Formula:
y ~ 1
Within-Subject Variance Formula:
y ~ 1 + sex + onMeds
Number of individuals/clusters: 324
Total observations: 3240
Fixed-effects parameters:
────────────────────────────────────────────────────────
Estimate Std. Error Z Pr(>|Z|)
────────────────────────────────────────────────────────
β1: (Intercept) 13.2282 0.146459 90.32 <1e-99
β2: sex -3.29295 0.2101 -15.67 <1e-54
β3: onMeds 0.459585 0.0596002 7.71 <1e-13
τ1: (Intercept) 0.792508 0.0850728 9.32 <1e-19
τ2: sex -0.2865 0.0970732 -2.95 0.0032
τ3: onMeds 0.422303 0.063825 6.62 <1e-10
────────────────────────────────────────────────────────
Random effects covariance matrix Σγ:
"γ1: (Intercept)" 3.32057
pretty_table(first(CSV.read("score.pval.txt", DataFrame), 8), tf = tf_markdown)
|[1m chr [0m|[1m pos [0m|[1m snpid [0m|[1m maf [0m|[1m hwepval [0m|[1m betapval [0m|[1m betadir [0m|[1m taupval [0m|[1m taudir [0m|[1m jointpval [0m|
|[90m Int64 [0m|[90m Int64 [0m|[90m String31 [0m|[90m Float64 [0m|[90m Float64 [0m|[90m Float64 [0m|[90m Int64 [0m|[90m Float64 [0m|[90m Int64 [0m|[90m Float64 [0m|
|-------|---------|------------|------------|-------------|-------------|---------|-------------|--------|-------------|
| 1 | 554484 | rs10458597 | 0.0 | 1.0 | 1.0 | 0 | 1.0 | 0 | 1.0 |
| 1 | 758311 | rs12562034 | 0.0776398 | 0.409876 | 0.38939 | -1 | 0.922822 | -1 | 0.547682 |
| 1 | 967643 | rs2710875 | 0.324074 | 4.07625e-7 | 5.83856e-6 | 1 | 3.35408e-6 | -1 | 4.93598e-7 |
| 1 | 1168108 | rs11260566 | 0.191589 | 0.128568 | 0.000850889 | 1 | 0.0559055 | 1 | 0.00101827 |
| 1 | 1375074 | rs1312568 | 0.441358 | 2.5376e-19 | 0.000171683 | -1 | 1.84811e-13 | 1 | 2.07091e-15 |
| 1 | 1588771 | rs35154105 | 0.0 | 1.0 | 1.0 | 0 | 1.0 | 0 | 1.0 |
| 1 | 1789051 | rs16824508 | 0.00462963 | 0.933278 | 0.295035 | 1 | 0.304109 | 1 | 0.299504 |
| 1 | 1990452 | rs2678939 | 0.453704 | 5.07696e-11 | 1.27871e-7 | 1 | 7.73809e-9 | -1 | 3.3986e-9 |
Step 2: Sort score test p-values and find top 10 SNPs.
scorepvals = CSV.read("score.pval.txt", DataFrame)[!, :taupval] # tau p-values
tophits = sortperm(scorepvals)[1:10] # indices of 10 SNPs with smallest p-values
scorepvals[tophits] # smallest 10 p-values
10-element Vector{Float64}:
7.872202557706978e-17
1.8481146312336014e-13
1.497868141369434e-12
9.730062616490452e-12
1.2892059340654512e-11
1.4844638660225815e-11
1.5167114834215392e-11
1.8761256218261456e-11
2.3491352960094672e-11
2.6075629334507108e-11
Step 3: Re-do LRT on top hits.
@time trajgwas(@formula(y ~ 1 + sex + onMeds),
@formula(y ~ 1),
@formula(y ~ 1 + sex + onMeds),
:id,
datadir * "trajgwas_plinkex.csv",
datadir * "hapmap3",
pvalfile = "wald.pval.txt",
snpinds = tophits,
test = :wald)
run = 1, ‖Δβ‖ = 0.037090, ‖Δτ‖ = 0.136339, ‖ΔL‖ = 0.005441, status = LOCALLY_SOLVED, time(s) = 0.044230
run = 2, ‖Δβ‖ = 0.000913, ‖Δτ‖ = 0.019810, ‖ΔL‖ = 0.001582, status = LOCALLY_SOLVED, time(s) = 0.041163
run = 1, ‖Δβ‖ = 0.026764, ‖Δτ‖ = 0.368620, ‖ΔL‖ = 0.000317, status = LOCALLY_SOLVED, time(s) = 0.058660
run = 2, ‖Δβ‖ = 0.003023, ‖Δτ‖ = 0.030938, ‖ΔL‖ = 0.003568, status = LOCALLY_SOLVED, time(s) = 0.047351
run = 1, ‖Δβ‖ = 0.039946, ‖Δτ‖ = 0.185049, ‖ΔL‖ = 0.005690, status = LOCALLY_SOLVED, time(s) = 0.050856
run = 2, ‖Δβ‖ = 0.001640, ‖Δτ‖ = 0.029639, ‖ΔL‖ = 0.002028, status = LOCALLY_SOLVED, time(s) = 0.046967
run = 1, ‖Δβ‖ = 0.040207, ‖Δτ‖ = 0.407863, ‖ΔL‖ = 0.001924, status = LOCALLY_SOLVED, time(s) = 0.047495
run = 2, ‖Δβ‖ = 0.002632, ‖Δτ‖ = 0.039609, ‖ΔL‖ = 0.005131, status = LOCALLY_SOLVED, time(s) = 0.047358
run = 1, ‖Δβ‖ = 0.052372, ‖Δτ‖ = 0.828946, ‖ΔL‖ = 0.001277, status = LOCALLY_SOLVED, time(s) = 0.062142
run = 2, ‖Δβ‖ = 0.013694, ‖Δτ‖ = 0.185493, ‖ΔL‖ = 0.007118, status = LOCALLY_SOLVED, time(s) = 0.057363
run = 1, ‖Δβ‖ = 0.032435, ‖Δτ‖ = 0.199659, ‖ΔL‖ = 0.004973, status = LOCALLY_SOLVED, time(s) = 0.050019
run = 2, ‖Δβ‖ = 0.002847, ‖Δτ‖ = 0.032438, ‖ΔL‖ = 0.002941, status = LOCALLY_SOLVED, time(s) = 0.074352
run = 1, ‖Δβ‖ = 0.009362, ‖Δτ‖ = 0.625969, ‖ΔL‖ = 0.002237, status = LOCALLY_SOLVED, time(s) = 0.079848
run = 2, ‖Δβ‖ = 0.005893, ‖Δτ‖ = 0.176417, ‖ΔL‖ = 0.005017, status = LOCALLY_SOLVED, time(s) = 0.057973
run = 1, ‖Δβ‖ = 0.012805, ‖Δτ‖ = 0.243313, ‖ΔL‖ = 0.006289, status = LOCALLY_SOLVED, time(s) = 0.056301
run = 2, ‖Δβ‖ = 0.002636, ‖Δτ‖ = 0.044210, ‖ΔL‖ = 0.003312, status = LOCALLY_SOLVED, time(s) = 0.060641
run = 1, ‖Δβ‖ = 0.014062, ‖Δτ‖ = 0.225841, ‖ΔL‖ = 0.003899, status = LOCALLY_SOLVED, time(s) = 0.061649
run = 2, ‖Δβ‖ = 0.002112, ‖Δτ‖ = 0.030955, ‖ΔL‖ = 0.003860, status = LOCALLY_SOLVED, time(s) = 0.053257
run = 1, ‖Δβ‖ = 0.026425, ‖Δτ‖ = 0.258805, ‖ΔL‖ = 0.004204, status = LOCALLY_SOLVED, time(s) = 0.060798
run = 2, ‖Δβ‖ = 0.001090, ‖Δτ‖ = 0.051281, ‖ΔL‖ = 0.003822, status = LOCALLY_SOLVED, time(s) = 0.059465
run = 1, ‖Δβ‖ = 0.035737, ‖Δτ‖ = 0.174702, ‖ΔL‖ = 0.001439, status = LOCALLY_SOLVED, time(s) = 0.063468
run = 2, ‖Δβ‖ = 0.004776, ‖Δτ‖ = 0.019790, ‖ΔL‖ = 0.001867, status = LOCALLY_SOLVED, time(s) = 0.064655
3.802703 seconds (7.04 M allocations: 458.754 MiB, 3.72% gc time, 60.82% compilation time)
Within-subject variance estimation by robust regression (WiSER)
Mean Formula:
y ~ 1 + sex + onMeds
Random Effects Formula:
y ~ 1
Within-Subject Variance Formula:
y ~ 1 + sex + onMeds
Number of individuals/clusters: 324
Total observations: 3240
Fixed-effects parameters:
────────────────────────────────────────────────────────
Estimate Std. Error Z Pr(>|Z|)
────────────────────────────────────────────────────────
β1: (Intercept) 13.2282 0.146459 90.32 <1e-99
β2: sex -3.29295 0.2101 -15.67 <1e-54
β3: onMeds 0.459585 0.0596002 7.71 <1e-13
τ1: (Intercept) 0.792508 0.0850728 9.32 <1e-19
τ2: sex -0.2865 0.0970732 -2.95 0.0032
τ3: onMeds 0.422303 0.063825 6.62 <1e-10
────────────────────────────────────────────────────────
Random effects covariance matrix Σγ:
"γ1: (Intercept)" 3.32057
pretty_table(CSV.read("wald.pval.txt", DataFrame), tf = tf_markdown)
|[1m chr [0m|[1m pos [0m|[1m snpid [0m|[1m maf [0m|[1m hwepval [0m|[1m betaeffect [0m|[1m betapval [0m|[1m taueffect [0m|[1m taupval [0m|
|[90m Int64 [0m|[90m Int64 [0m|[90m String15 [0m|[90m Float64 [0m|[90m Float64 [0m|[90m Float64 [0m|[90m Float64 [0m|[90m Float64 [0m|[90m Float64 [0m|
|-------|-----------|-----------|----------|-------------|------------|-------------|-----------|-------------|
| 1 | 1375074 | rs1312568 | 0.441358 | 2.5376e-19 | -0.456032 | 0.000165921 | 0.496452 | 1.17735e-36 |
| 1 | 11552817 | rs2745282 | 0.421053 | 1.3537e-14 | -0.67282 | 7.6492e-9 | 0.425653 | 1.45395e-19 |
| 1 | 120276030 | rs6688004 | 0.469136 | 1.75473e-27 | -0.637817 | 5.13062e-9 | 0.430562 | 6.07457e-21 |
| 2 | 135623558 | rs6730157 | 0.313272 | 7.76753e-16 | -0.556166 | 1.10163e-5 | 0.462346 | 2.88628e-20 |
| 9 | 126307510 | rs3814134 | 0.427469 | 4.28276e-33 | 0.723472 | 1.41601e-11 | -0.518216 | 4.75338e-39 |
| 15 | 40603025 | rs2617236 | 0.390966 | 1.16466e-14 | -0.644375 | 1.44948e-7 | 0.445647 | 9.43451e-19 |
| 15 | 40803767 | rs3742988 | 0.42284 | 4.26181e-30 | -0.796743 | 2.24785e-13 | 0.389234 | 7.09349e-17 |
| 17 | 56509992 | rs8064681 | 0.481481 | 1.37485e-21 | -0.58961 | 1.9256e-7 | 0.425417 | 5.60171e-25 |
| 17 | 71293786 | rs2125345 | 0.339009 | 1.75527e-14 | -0.635669 | 1.82739e-7 | 0.449612 | 1.7146e-21 |
| 23 | 64815688 | rs5964999 | 0.475078 | 3.63212e-56 | 0.75886 | 5.04109e-14 | -0.38849 | 5.39098e-23 |
# clean up
rm("trajgwas.null.txt", force=true)
rm("trajgwas.pval.txt", force=true)
rm("score.pval.txt", force=true)
rm("wald.pval.txt", force=true)
GxE or other interactions
Testing jointly G + GxE
In many applications, we want to test SNP effect and/or its interaction with other terms. testformula
keyword specifies the test unit besides the covariates in nullformula
.
In following example, keyword testformula=@formula(y ~ snp + snp & sex)
instructs trajgwas
to test joint effect of snp
and snp & sex
interaction.
trajgwas(@formula(y ~ 1 + sex + onMeds),
@formula(y ~ 1),
@formula(y ~ 1 + sex + onMeds),
:id,
datadir * "trajgwas_plinkex.csv",
datadir * "hapmap3",
pvalfile = "GxE.pval.txt",
testformula=@formula(y ~ snp + snp & sex))
run = 1, ‖Δβ‖ = 0.037090, ‖Δτ‖ = 0.136339, ‖ΔL‖ = 0.005441, status = LOCALLY_SOLVED, time(s) = 0.054691
run = 2, ‖Δβ‖ = 0.000913, ‖Δτ‖ = 0.019810, ‖ΔL‖ = 0.001582, status = LOCALLY_SOLVED, time(s) = 0.050742
Within-subject variance estimation by robust regression (WiSER)
Mean Formula:
y ~ 1 + sex + onMeds
Random Effects Formula:
y ~ 1
Within-Subject Variance Formula:
y ~ 1 + sex + onMeds
Number of individuals/clusters: 324
Total observations: 3240
Fixed-effects parameters:
────────────────────────────────────────────────────────
Estimate Std. Error Z Pr(>|Z|)
────────────────────────────────────────────────────────
β1: (Intercept) 13.2282 0.146459 90.32 <1e-99
β2: sex -3.29295 0.2101 -15.67 <1e-54
β3: onMeds 0.459585 0.0596002 7.71 <1e-13
τ1: (Intercept) 0.792508 0.0850728 9.32 <1e-19
τ2: sex -0.2865 0.0970732 -2.95 0.0032
τ3: onMeds 0.422303 0.063825 6.62 <1e-10
────────────────────────────────────────────────────────
Random effects covariance matrix Σγ:
"γ1: (Intercept)" 3.32057
pretty_table(first(CSV.read("GxE.pval.txt", DataFrame), 5), tf = tf_markdown)
|[1m chr [0m|[1m pos [0m|[1m snpid [0m|[1m maf [0m|[1m hwepval [0m|[1m betapval [0m|[1m taupval [0m|[1m jointpval [0m|
|[90m Int64 [0m|[90m Int64 [0m|[90m String31 [0m|[90m Float64 [0m|[90m Float64 [0m|[90m Float64 [0m|[90m Float64 [0m|[90m Float64 [0m|
|-------|---------|------------|-----------|------------|-------------|-------------|-------------|
| 1 | 554484 | rs10458597 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| 1 | 758311 | rs12562034 | 0.0776398 | 0.409876 | 0.383 | 0.33511 | 0.357458 |
| 1 | 967643 | rs2710875 | 0.324074 | 4.07625e-7 | 9.98217e-6 | 5.73202e-7 | 1.08415e-6 |
| 1 | 1168108 | rs11260566 | 0.191589 | 0.128568 | 0.00225789 | 0.105482 | 0.00442114 |
| 1 | 1375074 | rs1312568 | 0.441358 | 2.5376e-19 | 0.000928728 | 3.48807e-15 | 6.97615e-15 |
# clean up
rm("trajgwas.null.txt", force=true)
rm("GxE.pval.txt", force=true)
Testing only GxE interaction term
For some applications, the user may want to simply test the GxE interaction effect. This requires fitting the SNP in the null model and is much slower, but the command trajgwas()
with keyword analysistype = "gxe"
can be used test the interaction effect. The environmental variable must be specified in the command using the keyword argument e
, either as a symbol, such as :age
or as a string "age"
.
trajgwas(@formula(y ~ 1 + sex + onMeds),
@formula(y ~ 1),
@formula(y ~ 1 + sex + onMeds),
:id,
datadir * "trajgwas_plinkex.csv",
datadir * "hapmap3",
pvalfile = "gxe_snp.pval.txt",
analysistype = "gxe",
e = :sex,
snpinds=1:5)
run = 1, ‖Δβ‖ = 0.037090, ‖Δτ‖ = 0.136339, ‖ΔL‖ = 0.005441, status = LOCALLY_SOLVED, time(s) = 0.052487
run = 2, ‖Δβ‖ = 0.000913, ‖Δτ‖ = 0.019810, ‖ΔL‖ = 0.001582, status = LOCALLY_SOLVED, time(s) = 0.046791
run = 1, ‖Δβ‖ = 0.480377, ‖Δτ‖ = 0.023510, ‖ΔL‖ = 0.001822, status = LOCALLY_SOLVED, time(s) = 0.045053
run = 2, ‖Δβ‖ = 0.000323, ‖Δτ‖ = 0.002136, ‖ΔL‖ = 0.000025, status = LOCALLY_SOLVED, time(s) = 0.050111
run = 1, ‖Δβ‖ = 1.059989, ‖Δτ‖ = 0.458239, ‖ΔL‖ = 0.063093, status = LOCALLY_SOLVED, time(s) = 0.064015
run = 2, ‖Δβ‖ = 0.009118, ‖Δτ‖ = 0.105352, ‖ΔL‖ = 0.000056, status = LOCALLY_SOLVED, time(s) = 0.058007
run = 1, ‖Δβ‖ = 1.144781, ‖Δτ‖ = 0.251459, ‖ΔL‖ = 0.034318, status = LOCALLY_SOLVED, time(s) = 0.049777
run = 2, ‖Δβ‖ = 0.000830, ‖Δτ‖ = 0.025631, ‖ΔL‖ = 0.000042, status = LOCALLY_SOLVED, time(s) = 0.044902
run = 1, ‖Δβ‖ = 0.683108, ‖Δτ‖ = 0.773919, ‖ΔL‖ = 0.042397, status = LOCALLY_SOLVED, time(s) = 0.045594
run = 2, ‖Δβ‖ = 0.008159, ‖Δτ‖ = 0.069178, ‖ΔL‖ = 0.003574, status = LOCALLY_SOLVED, time(s) = 0.053428
Within-subject variance estimation by robust regression (WiSER)
Mean Formula:
y ~ 1 + sex + onMeds
Random Effects Formula:
y ~ 1
Within-Subject Variance Formula:
y ~ 1 + sex + onMeds
Number of individuals/clusters: 324
Total observations: 3240
Fixed-effects parameters:
────────────────────────────────────────────────────────
Estimate Std. Error Z Pr(>|Z|)
────────────────────────────────────────────────────────
β1: (Intercept) 13.2282 0.146459 90.32 <1e-99
β2: sex -3.29295 0.2101 -15.67 <1e-54
β3: onMeds 0.459585 0.0596002 7.71 <1e-13
τ1: (Intercept) 0.792508 0.0850728 9.32 <1e-19
τ2: sex -0.2865 0.0970732 -2.95 0.0032
τ3: onMeds 0.422303 0.063825 6.62 <1e-10
────────────────────────────────────────────────────────
Random effects covariance matrix Σγ:
"γ1: (Intercept)" 3.32057
pretty_table(CSV.read("gxe_snp.pval.txt", DataFrame), tf = tf_markdown)
|[1m chr [0m|[1m pos [0m|[1m snpid [0m|[1m maf [0m|[1m hwepval [0m|[1m snpeffectnullbeta [0m|[1m snpeffectnulltau [0m|[1m betapval [0m|[1m taupval [0m|
|[90m Int64 [0m|[90m Int64 [0m|[90m String15 [0m|[90m Float64 [0m|[90m Float64 [0m|[90m Float64 [0m|[90m Float64 [0m|[90m Float64 [0m|[90m Float64 [0m|
|-------|---------|------------|-----------|------------|-------------------|------------------|----------|----------|
| 1 | 554484 | rs10458597 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 |
| 1 | 758311 | rs12562034 | 0.0776398 | 0.409876 | -0.230593 | 0.0102355 | 0.448376 | 0.149797 |
| 1 | 967643 | rs2710875 | 0.324074 | 4.07625e-7 | 0.639871 | -0.338815 | 0.293876 | 0.119916 |
| 1 | 1168108 | rs11260566 | 0.191589 | 0.128568 | 0.618855 | -0.146819 | 0.590189 | 0.262791 |
| 1 | 1375074 | rs1312568 | 0.441358 | 2.5376e-19 | -0.456003 | 0.492582 | 0.574737 | 0.250501 |
# clean up
rm("trajgwas.null.txt", force=true)
rm("gxe_snp.pval.txt", force=true)
SNP-set testing
In many applications, we want to test a SNP-set. The function with keyword analysistype = "snpset"
can be used to do this. To specify the type of snpset test, use the snpset
keyword argument.
The snpset can be specified as either:
- a window (test every X snps) =>
snpset = X
- an annotated file. This requires
snpset = filename
where filename is an input file, with no header and two columns separated by a space. The first column must contain the snpset ID and the second column must contain the snpid's (rsid) identical to the bimfile, or in the case of a BGEN format, the order the data will be read (see BGEN above for more details). - a joint test on only a specific set of snps.
snpset = AbstractVector
where the vector specifies the snps you want to perform one joint snpset test for. The vector can either be a vector of integers where the elements are the indicies of the SNPs to test, a vector of booleans, where true represents that you want to select that SNP index, or a range indicating the indicies of the SNPs to test.
SNPset testing is currently only implemented for the score test (not Wald).
In the following example, we perform a SNP-set test on the 50th to 55th snps.
trajgwas(@formula(y ~ 1 + sex + onMeds),
@formula(y ~ 1),
@formula(y ~ 1 + sex + onMeds),
:id,
datadir * "trajgwas_plinkex.csv",
datadir * "hapmap3",
pvalfile = "snpset.pval.txt",
analysistype = "snpset",
snpset = 50:55)
run = 1, ‖Δβ‖ = 0.037090, ‖Δτ‖ = 0.136339, ‖ΔL‖ = 0.005441, status = LOCALLY_SOLVED, time(s) = 0.045957
run = 2, ‖Δβ‖ = 0.000913, ‖Δτ‖ = 0.019810, ‖ΔL‖ = 0.001582, status = LOCALLY_SOLVED, time(s) = 0.043228
Within-subject variance estimation by robust regression (WiSER)
Mean Formula:
y ~ 1 + sex + onMeds
Random Effects Formula:
y ~ 1
Within-Subject Variance Formula:
y ~ 1 + sex + onMeds
Number of individuals/clusters: 324
Total observations: 3240
Fixed-effects parameters:
────────────────────────────────────────────────────────
Estimate Std. Error Z Pr(>|Z|)
────────────────────────────────────────────────────────
β1: (Intercept) 13.2282 0.146459 90.32 <1e-99
β2: sex -3.29295 0.2101 -15.67 <1e-54
β3: onMeds 0.459585 0.0596002 7.71 <1e-13
τ1: (Intercept) 0.792508 0.0850728 9.32 <1e-19
τ2: sex -0.2865 0.0970732 -2.95 0.0032
τ3: onMeds 0.422303 0.063825 6.62 <1e-10
────────────────────────────────────────────────────────
Random effects covariance matrix Σγ:
"γ1: (Intercept)" 3.32057
run(`head snpset.pval.txt`);
The pvalue of snps indexed at 50:55 is betapval: 2.315617121196509e-5, taupval: 3.972993706371914e-10
# clean up
rm("snpset.pval.txt", force=true)
rm("trajgwas.null.txt", force=true)
In the following example we run a SNP-set test on the annotated SNP-set file.
run(`head ../data/hapmap_snpsetfile.txt`);
gene1 rs10458597
gene1 rs12562034
gene1 rs2710875
gene1 rs11260566
gene1 rs1312568
gene1 rs35154105
gene1 rs16824508
gene1 rs2678939
gene1 rs7553178
gene1 rs13376356
trajgwas(@formula(y ~ 1 + sex + onMeds),
@formula(y ~ 1),
@formula(y ~ 1 + sex + onMeds),
:id,
datadir * "trajgwas_plinkex.csv",
datadir * "hapmap3",
pvalfile = "snpset.pval.txt",
analysistype = "snpset",
snpset = datadir * "/hapmap_snpsetfile.txt")
run = 1, ‖Δβ‖ = 0.037090, ‖Δτ‖ = 0.136339, ‖ΔL‖ = 0.005441, status = LOCALLY_SOLVED, time(s) = 0.063600
run = 2, ‖Δβ‖ = 0.000913, ‖Δτ‖ = 0.019810, ‖ΔL‖ = 0.001582, status = LOCALLY_SOLVED, time(s) = 0.062349
Within-subject variance estimation by robust regression (WiSER)
Mean Formula:
y ~ 1 + sex + onMeds
Random Effects Formula:
y ~ 1
Within-Subject Variance Formula:
y ~ 1 + sex + onMeds
Number of individuals/clusters: 324
Total observations: 3240
Fixed-effects parameters:
────────────────────────────────────────────────────────
Estimate Std. Error Z Pr(>|Z|)
────────────────────────────────────────────────────────
β1: (Intercept) 13.2282 0.146459 90.32 <1e-99
β2: sex -3.29295 0.2101 -15.67 <1e-54
β3: onMeds 0.459585 0.0596002 7.71 <1e-13
τ1: (Intercept) 0.792508 0.0850728 9.32 <1e-19
τ2: sex -0.2865 0.0970732 -2.95 0.0032
τ3: onMeds 0.422303 0.063825 6.62 <1e-10
────────────────────────────────────────────────────────
Random effects covariance matrix Σγ:
"γ1: (Intercept)" 3.32057
pretty_table(first(CSV.read("snpset.pval.txt", DataFrame; delim="\t"), 5), tf = tf_markdown)
|[1m snpsetid [0m|[1m nsnps [0m|[1m betapval [0m|[1m taupval [0m|
|[90m String7 [0m|[90m Int64 [0m|[90m Float64 [0m|[90m Float64 [0m|
|----------|-------|-----------|-----------|
| gene1 | 93 | 0.111864 | 0.011595 |
| gene2 | 93 | 0.0249929 | 0.0648265 |
| gene3 | 93 | 0.131741 | 0.0298884 |
| gene4 | 92 | 0.010327 | 0.0584591 |
| gene5 | 93 | 0.0303905 | 0.0924954 |
# clean up
rm("snpset.pval.txt", force=true)
rm("trajgwas.null.txt", force=true)
In the following example we run a SNP-set test on every 15 SNPs.
trajgwas(@formula(y ~ 1 + sex + onMeds),
@formula(y ~ 1),
@formula(y ~ 1 + sex + onMeds),
:id,
datadir * "trajgwas_plinkex.csv",
datadir * "hapmap3",
pvalfile = "snpset.pval.txt",
analysistype = "snpset",
snpset = 15)
run = 1, ‖Δβ‖ = 0.037090, ‖Δτ‖ = 0.136339, ‖ΔL‖ = 0.005441, status = LOCALLY_SOLVED, time(s) = 0.044625
run = 2, ‖Δβ‖ = 0.000913, ‖Δτ‖ = 0.019810, ‖ΔL‖ = 0.001582, status = LOCALLY_SOLVED, time(s) = 0.038447
Within-subject variance estimation by robust regression (WiSER)
Mean Formula:
y ~ 1 + sex + onMeds
Random Effects Formula:
y ~ 1
Within-Subject Variance Formula:
y ~ 1 + sex + onMeds
Number of individuals/clusters: 324
Total observations: 3240
Fixed-effects parameters:
────────────────────────────────────────────────────────
Estimate Std. Error Z Pr(>|Z|)
────────────────────────────────────────────────────────
β1: (Intercept) 13.2282 0.146459 90.32 <1e-99
β2: sex -3.29295 0.2101 -15.67 <1e-54
β3: onMeds 0.459585 0.0596002 7.71 <1e-13
τ1: (Intercept) 0.792508 0.0850728 9.32 <1e-19
τ2: sex -0.2865 0.0970732 -2.95 0.0032
τ3: onMeds 0.422303 0.063825 6.62 <1e-10
────────────────────────────────────────────────────────
Random effects covariance matrix Σγ:
"γ1: (Intercept)" 3.32057
pretty_table(first(CSV.read("snpset.pval.txt", DataFrame), 5), tf = tf_markdown)
|[1m startchr [0m|[1m startpos [0m|[1m startsnpid [0m|[1m endchr [0m|[1m endpos [0m|[1m endsnpid [0m|[1m betapval [0m|[1m taupval [0m|
|[90m Int64 [0m|[90m Int64 [0m|[90m String15 [0m|[90m Int64 [0m|[90m Int64 [0m|[90m String15 [0m|[90m Float64 [0m|[90m Float64 [0m|
|----------|----------|------------|--------|----------|------------|-------------|-------------|
| 1 | 554484 | rs10458597 | 1 | 3431124 | rs12093117 | 3.83046e-6 | 2.24303e-10 |
| 1 | 3633945 | rs10910017 | 1 | 6514524 | rs932112 | 0.000127922 | 7.94764e-6 |
| 1 | 6715827 | rs441515 | 1 | 9534606 | rs4926480 | 8.10353e-5 | 6.11848e-7 |
| 1 | 9737551 | rs12047054 | 1 | 12559747 | rs4845907 | 0.000425108 | 1.15311e-8 |
| 1 | 12760427 | rs848577 | 1 | 16021797 | rs6679870 | 2.75832e-5 | 0.00028553 |
# clean up
rm("snpset.pval.txt", force=true)
rm("trajgwas.null.txt", force=true)
Matching Indicies
In some cases, there are only a subset of individuals with both genetic data and covariate information available. The null model must be fit on a subset of the individuals with the genetic data. The rows can be specified with the argument covrowinds
if you pass in a covariate file. The genetic indicies can be specified with the geneticrowinds
argument.
For simplicity, we have implemented a function matchindices(meanformula, reformula, wsvarformula, idvar, df, geneticsamples)
which can be used to do this. Input the mean, random effects, and within-subject variance formulas, the grouping (id) variable, the dataframe (or table), and a vector of the IDs in the order of the genetic file and it returns covrowmask, geneticrowmask
for matching indicies in a covariate file and geneticfile.
Note: the idvar
in the dataframe and the geneticsamples
vector must have the same element type. To convert a vector of integers to strings string.(intvector)
can be used. To go from a vector of strings to integers you can use" map(x -> parse(Int, x), stringvectoparse)
.
famfileids = CSV.read(datadir * "hapmap3.fam", DataFrame, header = false)[!, 1] # famfile contains the sample IDs for PLINK files
324-element Vector{String7}:
"A1"
"2"
"3"
"4"
"5"
"6"
"7"
"8"
"9"
"10"
"11"
"12"
"13"
⋮
"313"
"314"
"315"
"316"
"317"
"318"
"319"
"320"
"321"
"322"
"323"
"Z324"
covdf = CSV.read(datadir * "trajgwas_plinkex.csv", DataFrame)
pretty_table(first(covdf, 11), tf = tf_markdown)
|[1m sex [0m|[1m onMeds [0m|[1m snp1 [0m|[1m snp2 [0m|[1m snp3 [0m|[1m snp4 [0m|[1m y [0m|[1m id [0m|
|[90m Float64 [0m|[90m Float64 [0m|[90m Float64 [0m|[90m Float64 [0m|[90m Float64 [0m|[90m Float64 [0m|[90m Float64 [0m|[90m String7 [0m|
|---------|---------|---------|---------|---------|---------|---------|---------|
| 0.0 | 1.0 | 0.0 | 1.0 | 2.0 | 0.0 | 12.2667 | A1 |
| 0.0 | 0.0 | 0.0 | 1.0 | 2.0 | 0.0 | 10.2681 | A1 |
| 0.0 | 0.0 | 0.0 | 1.0 | 2.0 | 0.0 | 12.166 | A1 |
| 0.0 | 0.0 | 0.0 | 1.0 | 2.0 | 0.0 | 11.8797 | A1 |
| 0.0 | 0.0 | 0.0 | 1.0 | 2.0 | 0.0 | 12.8127 | A1 |
| 0.0 | 0.0 | 0.0 | 1.0 | 2.0 | 0.0 | 9.98766 | A1 |
| 0.0 | 0.0 | 0.0 | 1.0 | 2.0 | 0.0 | 12.1408 | A1 |
| 0.0 | 1.0 | 0.0 | 1.0 | 2.0 | 0.0 | 13.2058 | A1 |
| 0.0 | 0.0 | 0.0 | 1.0 | 2.0 | 0.0 | 11.3631 | A1 |
| 0.0 | 1.0 | 0.0 | 1.0 | 2.0 | 0.0 | 15.2511 | A1 |
| 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 2.0 | 12.746 | 2 |
covrowmask, geneticrowmask = matchindices(@formula(y ~ 1 + sex + onMeds),
@formula(y ~ 1),
@formula(y ~ 1 + sex + onMeds),
:id, covdf, famfileids);
trajgwas(@formula(y ~ 1 + sex + onMeds),
@formula(y ~ 1),
@formula(y ~ 1 + sex + onMeds),
:id,
datadir * "trajgwas_plinkex.csv",
datadir * "hapmap3",
pvalfile = pvalpath,
nullfile = nullpath,
covrowinds = covrowmask,
geneticrowinds = geneticrowmask)
run = 1, ‖Δβ‖ = 0.037090, ‖Δτ‖ = 0.136339, ‖ΔL‖ = 0.005441, status = LOCALLY_SOLVED, time(s) = 0.049927
run = 2, ‖Δβ‖ = 0.000913, ‖Δτ‖ = 0.019810, ‖ΔL‖ = 0.001582, status = LOCALLY_SOLVED, time(s) = 0.046026
Within-subject variance estimation by robust regression (WiSER)
Mean Formula:
y ~ 1 + sex + onMeds
Random Effects Formula:
y ~ 1
Within-Subject Variance Formula:
y ~ 1 + sex + onMeds
Number of individuals/clusters: 324
Total observations: 3240
Fixed-effects parameters:
────────────────────────────────────────────────────────
Estimate Std. Error Z Pr(>|Z|)
────────────────────────────────────────────────────────
β1: (Intercept) 13.2282 0.146459 90.32 <1e-99
β2: sex -3.29295 0.2101 -15.67 <1e-54
β3: onMeds 0.459585 0.0596002 7.71 <1e-13
τ1: (Intercept) 0.792508 0.0850728 9.32 <1e-19
τ2: sex -0.2865 0.0970732 -2.95 0.0032
τ3: onMeds 0.422303 0.063825 6.62 <1e-10
────────────────────────────────────────────────────────
Random effects covariance matrix Σγ:
"γ1: (Intercept)" 3.32057
# clean up
rm("trajgwas.null.txt", force=true)
rm("trajgwas.pval.txt", force=true)
Plotting Results
To plot the GWAS results, we recommend using the MendelPlots.jl package.
Multiple Plink file sets
In large scale studies, genotypes data are split into multiple Plink files, e.g., by chromosome. Then GWAS analysis can be done in parallel. This can be achieved by two steps.
Let's first create demo data by splitting hapmap3 according to chromosome:
# split example hapmap3 data according to chromosome
SnpArrays.split_plink(datadir * "hapmap3", :chromosome; prefix=datadir * "hapmap3.chr.")
readdir(glob"hapmap3.chr.*", datadir)
75-element Vector{String}:
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.1.bed"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.1.bim"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.1.fam"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.10.bed"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.10.bim"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.10.fam"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.11.bed"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.11.bim"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.11.fam"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.12.bed"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.12.bim"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.12.fam"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.13.bed"
⋮
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.6.bed"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.6.bim"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.6.fam"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.7.bed"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.7.bim"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.7.fam"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.8.bed"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.8.bim"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.8.fam"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.9.bed"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.9.bim"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.9.fam"
Step 1: Fit the null model. Setting third argument geneticfile
to nothing
instructs trajgwas
function to fit the null model only.
nm = trajgwas(@formula(y ~ 1 + sex + onMeds),
@formula(y ~ 1),
@formula(y ~ 1 + sex + onMeds),
:id,
datadir * "trajgwas_plinkex.csv",
nothing)
run = 1, ‖Δβ‖ = 0.037090, ‖Δτ‖ = 0.136339, ‖ΔL‖ = 0.005441, status = LOCALLY_SOLVED, time(s) = 0.045024
run = 2, ‖Δβ‖ = 0.000913, ‖Δτ‖ = 0.019810, ‖ΔL‖ = 0.001582, status = LOCALLY_SOLVED, time(s) = 0.047245
Within-subject variance estimation by robust regression (WiSER)
Mean Formula:
y ~ 1 + sex + onMeds
Random Effects Formula:
y ~ 1
Within-Subject Variance Formula:
y ~ 1 + sex + onMeds
Number of individuals/clusters: 324
Total observations: 3240
Fixed-effects parameters:
────────────────────────────────────────────────────────
Estimate Std. Error Z Pr(>|Z|)
────────────────────────────────────────────────────────
β1: (Intercept) 13.2282 0.146459 90.32 <1e-99
β2: sex -3.29295 0.2101 -15.67 <1e-54
β3: onMeds 0.459585 0.0596002 7.71 <1e-13
τ1: (Intercept) 0.792508 0.0850728 9.32 <1e-19
τ2: sex -0.2865 0.0970732 -2.95 0.0032
τ3: onMeds 0.422303 0.063825 6.62 <1e-10
────────────────────────────────────────────────────────
Random effects covariance matrix Σγ:
"γ1: (Intercept)" 3.32057
Step 2: GWAS for each chromosome.
# this part can be submitted as separate jobs
for chr in 1:23
plinkfile = datadir * "hapmap3.chr." * string(chr)
pvalfile = plinkfile * ".pval.txt"
trajgwas(nm, plinkfile, pvalfile=pvalfile)
end
# show the result files
readdir(glob"*.pval.txt", datadir)
23-element Vector{String}:
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.1.pval.txt"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.10.pval.txt"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.11.pval.txt"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.12.pval.txt"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.13.pval.txt"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.14.pval.txt"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.15.pval.txt"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.16.pval.txt"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.17.pval.txt"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.18.pval.txt"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.19.pval.txt"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.2.pval.txt"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.20.pval.txt"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.21.pval.txt"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.22.pval.txt"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.23.pval.txt"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.3.pval.txt"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.4.pval.txt"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.5.pval.txt"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.6.pval.txt"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.7.pval.txt"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.8.pval.txt"
"/Users/xyz/.julia/dev/TrajGWAS/data/hapmap3.chr.9.pval.txt"
In the rare situations where the multiple sets of Plink files lack the fam
file or the corresponding bed and bim files have different filenames, users can explicitly supply bed filename, bim file name, and number of individuals. Replace Step 2 by
Step 2': GWAS for each chromosome.
# this part can be submitted as separate jobs
for chr in 1:23
bedfile = datadir * "hapmap3.chr." * string(chr) * ".bed"
bimfile = datadir * "hapmap3.chr." * string(chr) * ".bim"
pvalfile = datadir * "hapmap3.chr." * string(chr) * ".pval.txt"
trajgwas(nm, bedfile, bimfile, 324; pvalfile=pvalfile)
end
# clean up
rm("trajgwas.null.txt", force=true)
isfile(datadir * "fittednullmodel.jld2") && rm(datadir * "fittednullmodel.jld2")
for chr in 1:23
pvalfile = datadir * "hapmap3.chr." * string(chr) * ".pval.txt"
rm(pvalfile, force=true)
end
for chr in 1:26
plinkfile = datadir * "hapmap3.chr." * string(chr)
rm(plinkfile * ".bed", force=true)
rm(plinkfile * ".fam", force=true)
rm(plinkfile * ".bim", force=true)
end
Multiple file sets on cluster
For running the score tests on a cluster, it would be desirable to fit a null model on a single machine first, and save the model as a serialized Julia object (.jls
). For example:
using DataFrames, CSV
using Statistics
using TrajGWAS
using Ipopt, WiSER
using LinearAlgebra
using BGEN
# fit the null model
BLAS.set_num_threads(1)
solver = Ipopt.Optimizer()
solver_config = Dict("print_level" => 1,
"watchdog_shortened_iter_trigger" => 5,
"max_iter" => 120)
bp_data = CSV.read("bp.csv", DataFrame)
@time nm = trajgwas(@formula(dbp ~ 1 + SEX + age + age_sq +
PC1 + PC2 + PC3 + PC4 + PC5 + bmi),
@formula(dbp ~ 1 + age),
@formula(dbp ~ 1 + SEX + age + age_sq +
PC1 + PC2 + PC3 + PC4 + PC5 +
bmi),
:FID, # subject ID
bp_data,
nothing;
nullfile="dbp.null.txt",
solver=solver,
solver_config=solver_config,
init = nothing, # may change to `x -> WiSER.init_ls!(x; gniters=0)`
runs=10
)
println(nm)
using Serialization
open("null.model.jls", "w") do io
Serialization.serialize(io, nm)
end
Then, the fitted null model would be used for the score test. It could be desirable to configure each job to run on a slice of SNPs on a BGEN file for higher throughput. The julia script for score test (scoretest_bp.jl
) for a slice of BGEN file would look like:
using DataFrames, CSV
using Statistics
using TrajGWAS
using WiSER
using LinearAlgebra
using BGEN
# fit the null model
BLAS.set_num_threads(1)
using Serialization
bgendir = ARGS[1]
chr = ARGS[2] # 1 to 22
fitted_null = ARGS[3] # "null.model.jls"
pvalfile = ARGS[4] # "sbp.test.diabetics.chr$(chr).txt"
chunkidx = parse(Int, ARGS[5])
nchunks = parse(Int, ARGS[6])
nm = open(deserialize, fitted_null)
genetic_iids_subsample = nm.ids
bgenfilename = bgendir * "/Chr$(chr)" # to analyze, for example, /path/to/bgen/Chr5.bgen
samplefilename = bgendir * "/Samples.sample" # .sample file compatible with BGEN
mfifilename = bgendir * "/mfi_chr$(chr).txt" # "MFI" file, an external file with MAF + info score
ukb_data = Bgen(bgenfilename * ".bgen"; sample_path = samplefilename)
genetic_iids = map(x -> parse(Int, split(x, " ")[1]), samples(ukb_data))
order_dict = Dict{Int, Int}()
for (i, iid) in enumerate(genetic_iids)
order_dict[iid] = i
end
sample_indicator = falses(length(genetic_iids))
for v in genetic_iids_subsample
sample_indicator[order_dict[v]] = true # extract only the samples being used for the analysis
end
# GWAS for each chromosome
## pre-filtering SNPs not passing the criteria (MAF > 0.002, info score > 0.3)
min_maf = 0.002
min_info_score = 0.3
min_hwe_pval = 1e-10
mfi = CSV.read(mfifilename, DataFrame; header=false)
mfi.Column8 = map(x -> x == "NA" ? NaN : parse(Float64, x), mfi.Column8) # Column8: info score
snpmask = (mfi.Column6 .> min_maf) .& (mfi.Column8 .> 0.3) # Column6: MAF
## compute range to run the analysis
chunksize = n_variants(ukb_data) ÷ nchunks + (n_variants(ukb_data) % nchunks > 0 ? 1 : 0)
startidx = chunksize * (chunkidx - 1) + 1
endidx = min(chunksize * chunkidx, n_variants(ukb_data))
snpmask = snpmask[startidx:endidx]
println("running for variants $startidx to $endidx")
# rearrange data in nm so that it matches bgen data
nullinds = indexin(genetic_iids[sample_indicator], nm.ids)
nm.obswts .= isempty(nm.obswts) ? nm.obswts : nm.obswts[nullinds]
nm.ids .= nm.ids[nullinds]
nm.nis .= nm.nis[nullinds]
nm.data .= nm.data[nullinds]
@assert genetic_iids[sample_indicator] == nm.ids "there is some issue -- sampleids not matching"
trajgwas(nm, bgenfilename * ".bgen", count(sample_indicator);
samplepath=samplefilename,
pvalfile=pvalfile,
snpinds=snpmask,
min_hwe_pval = min_hwe_pval,
bgenrowinds = sample_indicator,
startidx = startidx,
endidx = endidx,
usespa=true)
- Command-line arguments
- Argument 1: directory for the BGEN files. BGEN files (.bgen), BGEN index files (.bgen.bgi), and MFI files (.txt) should be included there.
- Argument 2: chromosome
- Argument 3: fitted null model (.jls)
- Argument 4: path for the result p-value file
- Argument 5: chunk index (1-based)
- Argument 6: number of chunks
The code above runs the analysis on ARGS[5]
-th slice out of ARGS[6]
slices of chromosome ARGS[2]
.
- Note: an index file (
.bgen.bgi
) is required for this slicing of BGEN file. See this link to see how to create one.
Then, the following script could be used for a cluster managed by Sun Grid Engine: (sbp_diabetes.sh
)
#!/bin/bash
#$ -cwd
#$ -o joblog.$JOB_ID.$TASK_ID
#$ -j y
#$ -pe shared 2
#$ -l h_rt=8:00:00,h_data=8G,arch=intel*
# Email address to notify
##$ -M $USER@mail
# Notify when
#$ -m a
# Job array indexes
#$ -t 1-352:1
NCHUNKS=16
CHUNKIDX=$(( (${SGE_TASK_ID} - 1) % ${NCHUNKS} + 1 ))
CHR=$(( (${SGE_TASK_ID} - 1) / ${NCHUNKS} + 1))
PROJECTDIR=/user/dir/jobscripts
BGENDIR=/user/dir/imputed
FITTED_NULL=/user/dir/null.model.jls
PVALFILE=/user/dir/pvalfiles/sbp.test.diabetes.chr${CHR}.${CHUNKIDX}of${NCHUNKS}.txt
module load julia
time julia --project=${PROJECTDIR} ${PROJECTDIR}/scoretest_bp.jl ${BGENDIR} ${CHR} ${FITTED_NULL} ${PVALFILE} ${CHUNKIDX} ${NCHUNKS}
and this could be submitted using the qsub
command.
Troubleshooting
If there are issues you're encountering with running TrajGWAS, the following are possible remedies.
- Null Model
- Issues with null model convergence may be solved by choosing alternate starting values for parameters, using a different solver, transforming variables, and increasing the number of runs (WiSER runs). These are detailed in the WiSER documentation here.
init = x -> WiSER.init_ls!(x; gniters=0)
works fine in general.
- Issues with null model convergence may be solved by choosing alternate starting values for parameters, using a different solver, transforming variables, and increasing the number of runs (WiSER runs). These are detailed in the WiSER documentation here.
- GWAS Results
- If you use the score test instead of the SPA-score test (SPA is default for single-SNP analyses), then there can be inflation in type I error and decreased power when (a) the sample size is small, (b) the number of repeated measures is low, or (c) the variants analyzed are rare with low minor allele frequencies. In these cases, the score test is not optimal and it is suggested to use the SPA version (
usespa=true
). SPA is only implemented for single-SNP analyses. These issues can occur in both Wald and score tests.
- If you use the score test instead of the SPA-score test (SPA is default for single-SNP analyses), then there can be inflation in type I error and decreased power when (a) the sample size is small, (b) the number of repeated measures is low, or (c) the variants analyzed are rare with low minor allele frequencies. In these cases, the score test is not optimal and it is suggested to use the SPA version (
If you notice any problems with your output or results, file an issue.