API
Here is a list of available function calls. A detailed description can be found below.
MendelIHT.cross_validate
MendelIHT.cv_iht
MendelIHT.fit_iht
MendelIHT.iht
MendelIHT.iht_run_many_models
MendelIHT.make_bim_fam_files
MendelIHT.pve
MendelIHT.simulate_correlated_snparray
MendelIHT.simulate_random_response
MendelIHT.simulate_random_snparray
Wrapper Functions
Most users will use the following wrapper functions. Users specify location of PLINK files and possibly the phenotype/covariate files. These functions will soon be updated to support VCF and BGEN formats.
MendelIHT.iht
— Functioniht(filename, k, d, phenotypes=6, covariates="", summaryfile="iht.summary.txt",
betafile="iht.beta.txt", kwargs...)
Runs IHT with sparsity level k
.
Arguments
filename
: AString
for VCF, binary PLINK, or BGEN file. VCF files should end in.vcf
or.vcf.gz
. Binary PLINK files should exclude.bim/.bed/.fam
trailings but the trio must all be present in the same directory. BGEN files should end in.bgen
.k
: AnInt
for sparsity parameter = number of none-zero coefficientsd
: Distribution of phenotypes. SpecifyNormal
for quantitative traits,Bernoulli
for binary traits,Poisson
orNegativeBinomial
for count traits, andMvNormal
for multiple quantitative traits.
Optional Arguments
phenotypes
: Phenotype file name (String
), an integer, or vector of integer. Integer(s) coresponds to the column(s) of PLINK's.fam
file that stores phenotypes (defaultphenotypes=6
). Enter multiple integers for multivariate analysis (e.g.phenotypes=[6, 7]
). We recognize missing phenotypes asNA
or-9
. For quantitative traits (univariate or multivariate), missing phenotypes are imputed with the mean. Binary and count phenotypes cannot be imputed. Phenotype files are read usingreaddlm
function in Julia base. We require each subject's phenotype to occupy a different row, and for multivariate analysis, each phenotype is comma separated. The phenotype file should not include a header line. Each row should be listed in the same order as in the PLINK and (for multivariate analysis) be comma separated.covariates
: Covariate file name. Defaultcovariates=""
(in which case an intercept term will be automatically included). Ifcovariates
file specified, it will be read usingreaddlm
function in Julia base. We require the covariate file to be comma separated, and not include a header line. Each row should be listed in the same order as in the PLINK. The first column should be all 1s to indicate an intercept. All other columns not specified inexclude_std_idx
will be standardized to mean 0 variance 1.summaryfile
: Output file name for saving IHT's summary statistics. Defaultsummaryfile="iht.summary.txt"
.betafile
: Output file name for saving IHT's estimated genotype effect sizes. Defaultbetafile="iht.beta.txt"
.covariancefile
: Output file name for saving IHT's estimated trait covariance matrix for multivariate analysis. Defaultcovariancefile="iht.cov.txt"
.exclude_std_idx
: Indices of non-genetic covariates that should be excluded from standardization.dosage
: Currently only guaranteed to work for VCF files. Iftrue
, will read genotypes dosages (i.e.X[i, j] ∈ [0, 2]
before standardizing)- All optional arguments available in
fit_iht
Output
A fitted IHTResult
(univariate analysis) or mIHTResult
(multivariate analysis). In addition, a human-readable text file is also outputted with file name specified with betafile
.
MendelIHT.cross_validate
— Functioncross_validate(filename, d, path=1:20, phenotypes=6, covariates="",
cv_summaryfile="cviht.summary.txt", q=5, kwargs...)
Runs cross-validation to determinal optimal sparsity level k
. Different sparsity levels are specified in path
.
Arguments
filename
: AString
for VCF, binary PLINK, or BGEN file. VCF files should end in.vcf
or.vcf.gz
. Binary PLINK files should exclude.bim/.bed/.fam
trailings but the trio must all be present in the same directory. BGEN files should end in.bgen
.d
: Distribution of phenotypes. SpecifyNormal
for quantitative traits,Bernoulli
for binary traits,Poisson
orNegativeBinomial
for count traits, andMvNormal
for multiple quantitative traits.
Optional Arguments
path
: Different values ofk
that should be tested. One can input a vector ofInt
(e.g.path=[5, 10, 15, 20]
) or a range (defaultpath=1:20
).phenotypes
: Phenotype file name (String
), an integer, or vector of integer. Integer(s) coresponds to the column(s) of.fam
file that stores phenotypes (default 6). We recognize missing phenotypes asNA
or-9
. For quantitative traits (univariate or multivariate), missing phenotypes are imputed with the mean. Binary and count phenotypes cannot be imputed. Phenotype files are read usingreaddlm
function in Julia base. We require each subject's phenotype to occupy a different row, and for multivariate analysis, each phenotype is comma separated. The phenotype file should not include a header line. Each row should be listed in the same order as in the PLINK.covariates
: Covariate file name. Defaultcovariates=""
(in which case an intercept term will be automatically included). Ifcovariates
file specified, it will be read usingreaddlm
function in Julia base. We require the covariate file to be comma separated, and not include a header line. Each row should be listed in the same order as in the PLINK. The first column should be all 1s to indicate an intercept. All other columns not specified inexclude_std_idx
will be standardized to mean 0 variance 1cv_summaryfile
: Output file name for saving IHT's cross validation summary statistics. Defaultcv_summaryfile="cviht.summary.txt"
.q
: Number of cross validation folds. Larger means more accurate and more computationally intensive. Should be larger 2 and smaller than 10. Defaultq=5
.dosage
: Currently only guaranteed to work for VCF files. Iftrue
, will read genotypes dosages (i.e.X[i, j] ∈ [0, 2]
before standardizing)- All optional arguments available in
cv_iht
Output
A vector mse
of cross-validated mean-squared errors (technically deviance residuals), where mse[i]
is the error of sparsity level path[i]
. In addition, a human-readable text file is also outputted with file name specified with cv_summaryfile
.
Core Functions
Users can also use the fit_iht
and cv_iht
functions directly. One must import genotypes via SnpArrays.jl (and use SnpLinAlg
type for x
argument) or VCFTools.jl. Phenotypes/covariates must also be imported using Julia's standard routine (typically with Base.readdlm
).
MendelIHT.fit_iht
— Functionfit_iht(y, x, z; k=10, d = Normal(), l=IdentityLink(),...)
Fits a model on design matrix (genotype data) x
, response (phenotype) y
, and non-genetic covariates z
on a specific sparsity parameter k
. Only predictors in x
will be subject to sparsity constraint, unless zkeep
keyword is specified.
Arguments:
y
: Phenotype vector or matrix. Should be anArray{T, 1}
(single traits) orArray{T, 2}
(multivariate Gaussian traits). For multivariate traits, each column ofy
should be a sample.x
: Genotype matrix (anArray{T, 2}
orSnpLinAlg
). For univariate analysis, samples are rows ofx
. For multivariate analysis, samples are columns ofx
(i.e. inputTranspose(x)
forSnpLinAlg
)z
: Matrix of non-genetic covariates of typeArray{T, 2}
orArray{T, 1}
. For univariate analysis, sample covariates are rows ofz
. For multivariate analysis, sample covariates are columns ofz
. If this is not specified, an intercept term will be included automatically. Ifz
is specified, make sure the first column (row) is all 1s to represent the intercept.
Optional Arguments:
k
: Number of non-zero predictors. Can be a constant or a vector (for group IHT).J
: The number of maximum groups (set as 1 if no group infomation available)d
: Distribution of phenotypes. SpecifyNormal()
for quantitative traits,Bernoulli()
for binary traits,Poisson()
orNegativeBinomial()
for count traits, andMvNormal()
for multiple quantitative traits.l
: A link function. The recommended link functions arel=IdentityLink()
for quantitative traits,l=LogitLink()
for binary traits,l=LogLink()
for Poisson distribution, andl=Loglink()
for NegativeBinomial distribution. For multivariate analysis, the choice of link does not matter.group
: vector storing (non-overlapping) group membershipweight
: vector storing vector of weights containing prior knowledge on each SNPzkeep
: BitVector determining whether non-genetic covariates inz
will be subject to sparsity constraint.zkeep[i] = true
means covariatei
will NOT be projected. Note covariates forced in the model are not subject to sparsity constraintk
.est_r
: Symbol (:MM
,:Newton
or:None
) to estimate nuisance parameters for negative binomial regressionuse_maf
: boolean indicating whether we want to scale projection with minor allele frequencies (see paper)debias
: boolean indicating whether we debias at each iterationverbose
: boolean indicating whether we want to print intermediate resultstol
: used to track convergencemax_iter
: is the maximum IHT iteration for a model to converge. Defaults to 200, or 100 for cross validationmin_iter
: is the minimum IHT iteration before checking for convergence. Defaults to 5.max_step
: is the maximum number of backtracking per IHT iteration. Defaults 3io
: AnIO
object for displaying intermediate results. Defaultstdout
.init_beta
: Whether to initialize beta values to univariate regression values. Currently only Gaussian traits can be initialized. Defaultfalse
.memory_efficient
: Iftrue,
it will cause ~1.1 times slow down but one only needs to store the genotype matrix (requiring 2np bits for PLINK binary files and8np
bytes for other formats). Ifmemory_efficient=false
, one also need to store a8nk
byte matrix wherek
is the sparsity levels.
Output
- An
IHTResult
(for single-trait analysis) ormIHTResult
(for multivariate analysis).
Group IHT
If k
is a constant, then each group will have the same sparsity level. To run doubly sparse IHT with varying group sparsities, construct k
to be a vector where k[i]
indicates the max number of predictors for group i
.
MendelIHT.cv_iht
— Functioncv_iht(y, x, z; path=1:20, q=5, d=Normal(), l=IdentityLink(), ...)
For each model specified in path
, performs q
-fold cross validation and returns the (averaged) deviance residuals. The purpose of this function is to find the best sparsity level k
, obtained from selecting the model with the minimum out-of-sample error. Note sparsity is enforced on x
only, unless zkeep
keyword is specified.
To check if multithreading is enabled, check output of Threads.nthreads()
.
Arguments:
y
: Phenotype vector or matrix. Should be anArray{T, 1}
(single traits) orArray{T, 2}
(multivariate Gaussian traits). For multivariate traits, each column ofy
should be a sample.x
: Genotype matrix (anArray{T, 2}
orSnpLinAlg
). For univariate analysis, samples are rows ofx
. For multivariate analysis, samples are columns ofx
(i.e. inputTranspose(x)
forSnpLinAlg
)z
: Matrix of non-genetic covariates of typeArray{T, 2}
orArray{T, 1}
. For univariate analysis, sample covariates are rows ofz
. For multivariate analysis, sample covariates are columns ofz
. If this is not specified, an intercept term will be included automatically. Ifz
is specified, make sure the first column (row) is all 1s to represent the intercept.
Optional Arguments:
path
: Different values ofk
that should be tested. One can input a vector ofInt
(e.g.path=[5, 10, 15, 20]
) or a range (defaultpath=1:20
).q
: Number of cross validation folds. Larger means more accurate and more computationally intensive. Should be larger 2 and smaller than 10. Defaultq=5
.d
: Distribution of phenotypes. SpecifyNormal()
for quantitative traits,Bernoulli()
for binary traits,Poisson()
orNegativeBinomial()
for count traits, andMvNormal()
for multiple quantitative traits.l
: A link function. The recommended link functions arel=IdentityLink()
for quantitative traits,l=LogitLink()
for binary traits,l=LogLink()
for Poisson distribution, andl=Loglink()
for NegativeBinomial distribution. For multivariate analysis, the choice of link does not matter.zkeep
: BitVector determining whether non-genetic covariates inz
will be subject to sparsity constraint.zkeep[i] = true
means covariatei
will NOT be projected. Note covariates forced in the model are not subject to sparsity constraints inpath
.est_r
: Symbol (:MM
,:Newton
or:None
) to estimate nuisance parameters for negative binomial regressiongroup
: vector storing group membership for each predictorweight
: vector storing vector of weights containing prior knowledge on each predictorfolds
: Vector that separates the sample intoq
disjoint subsetsdebias
: Boolean indicating whether we should debias at each IHT step. Defaultsfalse
verbose
: Boolean indicating whether to print progress and mean squared error for eachk
inpath
. Defaultstrue
max_iter
: is the maximum IHT iteration for a model to converge. Defaults to 100min_iter
: is the minimum IHT iteration before checking for convergence. Defaults to 5.init_beta
: Whether to initialize beta values to univariate regression values. Currently only Gaussian traits can be initialized. Defaultfalse
.memory_efficient
: Iftrue,
it will cause ~1.1 times slow down but one only needs to store the genotype matrix (requiring 2np bits for PLINK binary files and8np
bytes for other formats). Ifmemory_efficient=false
, one may potentially storet
sparse matrices of dimension8nk
bytes wherek
are the cross validated sparsity levels.
Output
mse
: A vector of mean-squared error for eachk
specified inpath
.
Specifying Groups and Weights
When you have group and weight information, you input them as optional arguments in fit_iht
and cv_iht
. The weight vector is a vector of Float64
, while the group vector is a vector of Int
. For instance,
g = #import group vector
w = #import weight vector
ng = length(unique(g)) # specify number of non-zero groups
result = fit_iht(y, x, z; J=ng, k=10, d=Normal(), l=IdentityLink(), group=g, weight=w)
Simulation Utilities
For complex simulations, please use TraitSimulation.jl.
MendelIHT provides very naive simulation utilities, which were written before TraitSimulation.jl was developed.
MendelIHT.simulate_random_snparray
— Functionsimulate_random_snparray(s::String, n::Integer, p::Integer;
[mafs::Vector{Float64}], [min_ma::Integer])
Creates a random SnpArray in the current directory without missing value, where each SNP has ⫺5 (default) minor alleles.
Note: if supplied minor allele frequency is extremely small, it could take a long time for the simulation to generate samples where at least min_ma
(defaults to 5) are present.
Arguments:
s
: name of SnpArray that will be created in the current directory. To not create file, useundef
.n
: number of samplesp
: number of SNPs
Optional Arguments:
mafs
: vector of desired minor allele freuqencies (uniform(0,0.5) by default)min_ma
: the minimum number of minor alleles that must be present for each SNP (defaults to 5)
MendelIHT.simulate_correlated_snparray
— Functionsimulate_correlated_snparray(s, n, p; block_length, hap, prob)
Simulates a SnpArray with correlation. SNPs are divided into blocks where each adjacent SNP is the same with probability prob. There are no correlation between blocks.
Arguments:
n
: number of samplesp
: number of SNPss
: name of SnpArray that will be created (memory mapped) in the current directory. To not memory map, useundef
.
Optional arguments:
block_length
: length of each LD blockhap
: number of haplotypes to simulate for each blockprob
: with probabilityprob
an adjacent SNP would be the same.
Simulating a SnpArray with $n$ subjects and $p$ SNPs requires up to $2np$ bits of RAM.
MendelIHT.simulate_random_response
— Functionsimulate_random_response(x, k, d, l; kwargs...)
This function simulates a random response (trait) vector y
. When the distribution d
is from Poisson, Gamma, or Negative Binomial, we simulate β ∼ N(0, 0.3)
to roughly ensure the mean of response y
doesn't become too large. For other distributions, we choose β ∼ N(0, 1)
.
Arguments
x
: Design matrixk
: the true number of predictors.d
: The distribution of the simulated trait (notetypeof(d) = UnionAll
buttypeof(d())
is an actual distribution: e.g. Normal)l
: The link function. Inputcanonicallink(d())
if you want to use the canonical link ofd
.
Optional arguments
r
: The number of success until stopping in negative binomial regression, defaults to 10α
: Shape parameter of the gamma distribution, defaults to 1Zu
: Effect of non-genetic covariates.Zu
should have dimensionn × 1
.
simulate_random_response(x, k, traits)
Simulates a response matrix Y
where each row is an independent multivariate Gaussian with length trait
. There are k
non-zero β
over all traits. Each trait shares overlap
causal SNPs. The covariance matrix Σ
is positive definite and symmetric.
Arguments
x
: Design matrix of dimensionn × p
. Each row is a sample.k
: the total true number of causal SNPs (predictors)traits
: Number of traits
Optional arguments
Zu
: Effect of non-genetic covariates.Zu
should have dimensionn × traits
.overlap
: Number of causal SNPs shared by all traits. Shared SNPs does not have the same effect size.
Outputs
Y
: Response matrix where each row is sampled from a multivariate normal with meanμ[i] = X[i, :] * true_b
and varianceΣ
Σ
: the symmetric, positive definite covariance matrix usedtrue_b
: A sparse matrix containing true beta values.correct_position
: Non-zero indices oftrue_b
For negative binomial and gamma, the link function must be LogLink.
MendelIHT.make_bim_fam_files
— Functionmake_bim_fam_files(x::SnpArray, y, name::String)
Creates .bim and .bed files from a SnpArray.
Arguments:
x
: A SnpArray (i.e..bed
file on the disk) for which you wish to create corresponding.bim
and.fam
files.name
: string that should match the.bed
file (Do not include.bim
or.fam
extensions inname
).y
: Trait vector that will go in to the 6th column of.fam
file.
Other Useful Functions
MendelIHT additionally provides useful utilities that may be of interest to a few advanced users.
MendelIHT.iht_run_many_models
— FunctionRuns IHT across many different model sizes specifed in path
using the full design matrix. Same as cv_iht
but DOES NOT validate in a holdout set, meaning that this will definitely induce overfitting as we increase model size. Use this if you want to quickly estimate a range of feasible model sizes before engaging in full cross validation.
MendelIHT.pve
— Functionpve(y, X, β; l = IdentityLink())
Estimates phenotype's Proportion of Variance Explained (PVE) by typed genotypes (i.e. chip heritability or SNP heritability).
Model
We compute Var(ŷ) / Var(y)
where y
is the raw phenotypes, X
contains all the genotypes, and ŷ = Xβ
is the predicted (average) phenotype values from the statistical model β. Intercept is NOT included.
Missing docstring for parse_genotypes
. Check Documenter's build log for details.
Missing docstring for convert_gt
. Check Documenter's build log for details.