API

Here is a list of available function calls. A detailed description can be found below.

Wrapper Functions

Most users will use the following wrapper functions. Users specify location of PLINK files and possibly the phenotype/covariate files. These functions will soon be updated to support VCF and BGEN formats.

MendelIHT.ihtFunction
iht(filename, k, d, phenotypes=6, covariates="", summaryfile="iht.summary.txt",
    betafile="iht.beta.txt", kwargs...)

Runs IHT with sparsity level k.

Arguments

  • filename: A String for VCF, binary PLINK, or BGEN file. VCF files should end in .vcf or .vcf.gz. Binary PLINK files should exclude .bim/.bed/.fam trailings but the trio must all be present in the same directory. BGEN files should end in .bgen.
  • k: An Int for sparsity parameter = number of none-zero coefficients
  • d: Distribution of phenotypes. Specify Normal for quantitative traits, Bernoulli for binary traits, Poisson or NegativeBinomial for count traits, and MvNormal for multiple quantitative traits.

Optional Arguments

  • phenotypes: Phenotype file name (String), an integer, or vector of integer. Integer(s) coresponds to the column(s) of PLINK's .fam file that stores phenotypes (default phenotypes=6). Enter multiple integers for multivariate analysis (e.g. phenotypes=[6, 7]). We recognize missing phenotypes as NA or -9. For quantitative traits (univariate or multivariate), missing phenotypes are imputed with the mean. Binary and count phenotypes cannot be imputed. Phenotype files are read using readdlm function in Julia base. We require each subject's phenotype to occupy a different row, and for multivariate analysis, each phenotype is comma separated. The phenotype file should not include a header line. Each row should be listed in the same order as in the PLINK and (for multivariate analysis) be comma separated.
  • covariates: Covariate file name. Default covariates="" (in which case an intercept term will be automatically included). If covariates file specified, it will be read using readdlm function in Julia base. We require the covariate file to be comma separated, and not include a header line. Each row should be listed in the same order as in the PLINK. The first column should be all 1s to indicate an intercept. All other columns not specified in exclude_std_idx will be standardized to mean 0 variance 1.
  • summaryfile: Output file name for saving IHT's summary statistics. Default summaryfile="iht.summary.txt".
  • betafile: Output file name for saving IHT's estimated genotype effect sizes. Default betafile="iht.beta.txt".
  • covariancefile: Output file name for saving IHT's estimated trait covariance matrix for multivariate analysis. Default covariancefile="iht.cov.txt".
  • exclude_std_idx: Indices of non-genetic covariates that should be excluded from standardization.
  • dosage: Currently only guaranteed to work for VCF files. If true, will read genotypes dosages (i.e. X[i, j] ∈ [0, 2] before standardizing)
  • All optional arguments available in fit_iht

Output

A fitted IHTResult (univariate analysis) or mIHTResult (multivariate analysis). In addition, a human-readable text file is also outputted with file name specified with betafile.

source
MendelIHT.cross_validateFunction
cross_validate(filename, d, path=1:20, phenotypes=6, covariates="", 
    cv_summaryfile="cviht.summary.txt", q=5, kwargs...)

Runs cross-validation to determinal optimal sparsity level k. Different sparsity levels are specified in path.

Arguments

  • filename: A String for VCF, binary PLINK, or BGEN file. VCF files should end in .vcf or .vcf.gz. Binary PLINK files should exclude .bim/.bed/.fam trailings but the trio must all be present in the same directory. BGEN files should end in .bgen.
  • d: Distribution of phenotypes. Specify Normal for quantitative traits, Bernoulli for binary traits, Poisson or NegativeBinomial for count traits, and MvNormal for multiple quantitative traits.

Optional Arguments

  • path: Different values of k that should be tested. One can input a vector of Int (e.g. path=[5, 10, 15, 20]) or a range (default path=1:20).
  • phenotypes: Phenotype file name (String), an integer, or vector of integer. Integer(s) coresponds to the column(s) of .fam file that stores phenotypes (default 6). We recognize missing phenotypes as NA or -9. For quantitative traits (univariate or multivariate), missing phenotypes are imputed with the mean. Binary and count phenotypes cannot be imputed. Phenotype files are read using readdlm function in Julia base. We require each subject's phenotype to occupy a different row, and for multivariate analysis, each phenotype is comma separated. The phenotype file should not include a header line. Each row should be listed in the same order as in the PLINK.
  • covariates: Covariate file name. Default covariates="" (in which case an intercept term will be automatically included). If covariates file specified, it will be read using readdlm function in Julia base. We require the covariate file to be comma separated, and not include a header line. Each row should be listed in the same order as in the PLINK. The first column should be all 1s to indicate an intercept. All other columns not specified in exclude_std_idx will be standardized to mean 0 variance 1
  • cv_summaryfile: Output file name for saving IHT's cross validation summary statistics. Default cv_summaryfile="cviht.summary.txt".
  • q: Number of cross validation folds. Larger means more accurate and more computationally intensive. Should be larger 2 and smaller than 10. Default q=5.
  • dosage: Currently only guaranteed to work for VCF files. If true, will read genotypes dosages (i.e. X[i, j] ∈ [0, 2] before standardizing)
  • All optional arguments available in cv_iht

Output

A vector mse of cross-validated mean-squared errors (technically deviance residuals), where mse[i] is the error of sparsity level path[i]. In addition, a human-readable text file is also outputted with file name specified with cv_summaryfile.

source

Core Functions

Users can also use the fit_iht and cv_iht functions directly. One must import genotypes via SnpArrays.jl (and use SnpLinAlg type for x argument) or VCFTools.jl. Phenotypes/covariates must also be imported using Julia's standard routine (typically with Base.readdlm).

MendelIHT.fit_ihtFunction
fit_iht(y, x, z; k=10, d = Normal(), l=IdentityLink(),...)

Fits a model on design matrix (genotype data) x, response (phenotype) y, and non-genetic covariates z on a specific sparsity parameter k. Only predictors in x will be subject to sparsity constraint, unless zkeep keyword is specified.

Arguments:

  • y: Phenotype vector or matrix. Should be an Array{T, 1} (single traits) or Array{T, 2} (multivariate Gaussian traits). For multivariate traits, each column of y should be a sample.
  • x: Genotype matrix (an Array{T, 2} or SnpLinAlg). For univariate analysis, samples are rows of x. For multivariate analysis, samples are columns of x (i.e. input Transpose(x) for SnpLinAlg)
  • z: Matrix of non-genetic covariates of type Array{T, 2} or Array{T, 1}. For univariate analysis, sample covariates are rows of z. For multivariate analysis, sample covariates are columns of z. If this is not specified, an intercept term will be included automatically. If z is specified, make sure the first column (row) is all 1s to represent the intercept.

Optional Arguments:

  • k: Number of non-zero predictors. Can be a constant or a vector (for group IHT).
  • J: The number of maximum groups (set as 1 if no group infomation available)
  • d: Distribution of phenotypes. Specify Normal() for quantitative traits, Bernoulli() for binary traits, Poisson() or NegativeBinomial() for count traits, and MvNormal() for multiple quantitative traits.
  • l: A link function. The recommended link functions are l=IdentityLink() for quantitative traits, l=LogitLink() for binary traits, l=LogLink() for Poisson distribution, and l=Loglink() for NegativeBinomial distribution. For multivariate analysis, the choice of link does not matter.
  • group: vector storing (non-overlapping) group membership
  • weight: vector storing vector of weights containing prior knowledge on each SNP
  • zkeep: BitVector determining whether non-genetic covariates in z will be subject to sparsity constraint. zkeep[i] = true means covariate i will NOT be projected. Note covariates forced in the model are not subject to sparsity constraint k.
  • est_r: Symbol (:MM, :Newton or :None) to estimate nuisance parameters for negative binomial regression
  • use_maf: boolean indicating whether we want to scale projection with minor allele frequencies (see paper)
  • debias: boolean indicating whether we debias at each iteration
  • verbose: boolean indicating whether we want to print intermediate results
  • tol: used to track convergence
  • max_iter: is the maximum IHT iteration for a model to converge. Defaults to 200, or 100 for cross validation
  • min_iter: is the minimum IHT iteration before checking for convergence. Defaults to 5.
  • max_step: is the maximum number of backtracking per IHT iteration. Defaults 3
  • io: An IO object for displaying intermediate results. Default stdout.
  • init_beta: Whether to initialize beta values to univariate regression values. Currently only Gaussian traits can be initialized. Default false.
  • memory_efficient: If true, it will cause ~1.1 times slow down but one only needs to store the genotype matrix (requiring 2np bits for PLINK binary files and 8np bytes for other formats). If memory_efficient=false, one also need to store a 8nk byte matrix where k is the sparsity levels.

Output

  • An IHTResult (for single-trait analysis) or mIHTResult (for multivariate analysis).

Group IHT

If k is a constant, then each group will have the same sparsity level. To run doubly sparse IHT with varying group sparsities, construct k to be a vector where k[i] indicates the max number of predictors for group i.

source
MendelIHT.cv_ihtFunction
cv_iht(y, x, z; path=1:20, q=5, d=Normal(), l=IdentityLink(), ...)

For each model specified in path, performs q-fold cross validation and returns the (averaged) deviance residuals. The purpose of this function is to find the best sparsity level k, obtained from selecting the model with the minimum out-of-sample error. Note sparsity is enforced on x only, unless zkeep keyword is specified.

To check if multithreading is enabled, check output of Threads.nthreads().

Arguments:

  • y: Phenotype vector or matrix. Should be an Array{T, 1} (single traits) or Array{T, 2} (multivariate Gaussian traits). For multivariate traits, each column of y should be a sample.
  • x: Genotype matrix (an Array{T, 2} or SnpLinAlg). For univariate analysis, samples are rows of x. For multivariate analysis, samples are columns of x (i.e. input Transpose(x) for SnpLinAlg)
  • z: Matrix of non-genetic covariates of type Array{T, 2} or Array{T, 1}. For univariate analysis, sample covariates are rows of z. For multivariate analysis, sample covariates are columns of z. If this is not specified, an intercept term will be included automatically. If z is specified, make sure the first column (row) is all 1s to represent the intercept.

Optional Arguments:

  • path: Different values of k that should be tested. One can input a vector of Int (e.g. path=[5, 10, 15, 20]) or a range (default path=1:20).
  • q: Number of cross validation folds. Larger means more accurate and more computationally intensive. Should be larger 2 and smaller than 10. Default q=5.
  • d: Distribution of phenotypes. Specify Normal() for quantitative traits, Bernoulli() for binary traits, Poisson() or NegativeBinomial() for count traits, and MvNormal() for multiple quantitative traits.
  • l: A link function. The recommended link functions are l=IdentityLink() for quantitative traits, l=LogitLink() for binary traits, l=LogLink() for Poisson distribution, and l=Loglink() for NegativeBinomial distribution. For multivariate analysis, the choice of link does not matter.
  • zkeep: BitVector determining whether non-genetic covariates in z will be subject to sparsity constraint. zkeep[i] = true means covariate i will NOT be projected. Note covariates forced in the model are not subject to sparsity constraints in path.
  • est_r: Symbol (:MM, :Newton or :None) to estimate nuisance parameters for negative binomial regression
  • group: vector storing group membership for each predictor
  • weight: vector storing vector of weights containing prior knowledge on each predictor
  • folds: Vector that separates the sample into q disjoint subsets
  • debias: Boolean indicating whether we should debias at each IHT step. Defaults false
  • verbose: Boolean indicating whether to print progress and mean squared error for each k in path. Defaults true
  • max_iter: is the maximum IHT iteration for a model to converge. Defaults to 100
  • min_iter: is the minimum IHT iteration before checking for convergence. Defaults to 5.
  • init_beta: Whether to initialize beta values to univariate regression values. Currently only Gaussian traits can be initialized. Default false.
  • memory_efficient: If true, it will cause ~1.1 times slow down but one only needs to store the genotype matrix (requiring 2np bits for PLINK binary files and 8np bytes for other formats). If memory_efficient=false, one may potentially store t sparse matrices of dimension 8nk bytes where k are the cross validated sparsity levels.

Output

  • mse: A vector of mean-squared error for each k specified in path.
source

Specifying Groups and Weights

When you have group and weight information, you input them as optional arguments in fit_iht and cv_iht. The weight vector is a vector of Float64, while the group vector is a vector of Int. For instance,

    g = #import group vector
    w = #import weight vector
    ng = length(unique(g)) # specify number of non-zero groups
    result = fit_iht(y, x, z; J=ng, k=10, d=Normal(), l=IdentityLink(), group=g, weight=w)

Simulation Utilities

For complex simulations, please use TraitSimulation.jl.

MendelIHT provides very naive simulation utilities, which were written before TraitSimulation.jl was developed.

MendelIHT.simulate_random_snparrayFunction
simulate_random_snparray(s::String, n::Integer, p::Integer; 
    [mafs::Vector{Float64}], [min_ma::Integer])

Creates a random SnpArray in the current directory without missing value, where each SNP has ⫺5 (default) minor alleles.

Note: if supplied minor allele frequency is extremely small, it could take a long time for the simulation to generate samples where at least min_ma (defaults to 5) are present.

Arguments:

  • s: name of SnpArray that will be created in the current directory. To not create file, use undef.
  • n: number of samples
  • p: number of SNPs

Optional Arguments:

  • mafs: vector of desired minor allele freuqencies (uniform(0,0.5) by default)
  • min_ma: the minimum number of minor alleles that must be present for each SNP (defaults to 5)
source
MendelIHT.simulate_correlated_snparrayFunction
simulate_correlated_snparray(s, n, p; block_length, hap, prob)

Simulates a SnpArray with correlation. SNPs are divided into blocks where each adjacent SNP is the same with probability prob. There are no correlation between blocks.

Arguments:

  • n: number of samples
  • p: number of SNPs
  • s: name of SnpArray that will be created (memory mapped) in the current directory. To not memory map, use undef.

Optional arguments:

  • block_length: length of each LD block
  • hap: number of haplotypes to simulate for each block
  • prob: with probability prob an adjacent SNP would be the same.
source
Note

Simulating a SnpArray with $n$ subjects and $p$ SNPs requires up to $2np$ bits of RAM.

MendelIHT.simulate_random_responseFunction
simulate_random_response(x, k, d, l; kwargs...)

This function simulates a random response (trait) vector y. When the distribution d is from Poisson, Gamma, or Negative Binomial, we simulate β ∼ N(0, 0.3) to roughly ensure the mean of response y doesn't become too large. For other distributions, we choose β ∼ N(0, 1).

Arguments

  • x: Design matrix
  • k: the true number of predictors.
  • d: The distribution of the simulated trait (note typeof(d) = UnionAll but typeof(d()) is an actual distribution: e.g. Normal)
  • l: The link function. Input canonicallink(d()) if you want to use the canonical link of d.

Optional arguments

  • r: The number of success until stopping in negative binomial regression, defaults to 10
  • α: Shape parameter of the gamma distribution, defaults to 1
  • Zu: Effect of non-genetic covariates. Zu should have dimension n × 1.
source
simulate_random_response(x, k, traits)

Simulates a response matrix Y where each row is an independent multivariate Gaussian with length trait. There are k non-zero β over all traits. Each trait shares overlap causal SNPs. The covariance matrix Σ is positive definite and symmetric.

Arguments

  • x: Design matrix of dimension n × p. Each row is a sample.
  • k: the total true number of causal SNPs (predictors)
  • traits: Number of traits

Optional arguments

  • Zu: Effect of non-genetic covariates. Zu should have dimension n × traits.
  • overlap: Number of causal SNPs shared by all traits. Shared SNPs does not have the same effect size.

Outputs

  • Y: Response matrix where each row is sampled from a multivariate normal with mean μ[i] = X[i, :] * true_b and variance Σ
  • Σ: the symmetric, positive definite covariance matrix used
  • true_b: A sparse matrix containing true beta values.
  • correct_position: Non-zero indices of true_b
source
Note

For negative binomial and gamma, the link function must be LogLink.

MendelIHT.make_bim_fam_filesFunction
make_bim_fam_files(x::SnpArray, y, name::String)

Creates .bim and .bed files from a SnpArray.

Arguments:

  • x: A SnpArray (i.e. .bed file on the disk) for which you wish to create corresponding .bim and .fam files.
  • name: string that should match the .bed file (Do not include .bim or .fam extensions in name).
  • y: Trait vector that will go in to the 6th column of .fam file.
source

Other Useful Functions

MendelIHT additionally provides useful utilities that may be of interest to a few advanced users.

MendelIHT.iht_run_many_modelsFunction

Runs IHT across many different model sizes specifed in path using the full design matrix. Same as cv_iht but DOES NOT validate in a holdout set, meaning that this will definitely induce overfitting as we increase model size. Use this if you want to quickly estimate a range of feasible model sizes before engaging in full cross validation.

source
MendelIHT.pveFunction
pve(y, X, β; l = IdentityLink())

Estimates phenotype's Proportion of Variance Explained (PVE) by typed genotypes (i.e. chip heritability or SNP heritability).

Model

We compute Var(ŷ) / Var(y) where y is the raw phenotypes, X contains all the genotypes, and ŷ = Xβ is the predicted (average) phenotype values from the statistical model β. Intercept is NOT included.

source
Missing docstring.

Missing docstring for parse_genotypes. Check Documenter's build log for details.

Missing docstring.

Missing docstring for convert_gt. Check Documenter's build log for details.