Genetic Relationship Matrix
For homogeneous population, the grm function computes an empirical kinship matrix using either the classical genetic relationship matrix (model=:GRM
), the method of moment method, (model=:MoM
), or the robust method (model=:Robust
). See the section Kinship Comparison of this manuscript for the formulae and references for these methods.
using VCFTools
using BenchmarkTools
using Random
Example data
We use a small dataset for illustration.
vcf = "test.08Jun17.d8b.vcf.gz"
isfile(vcf) || download("http://faculty.washington.edu/browning/beagle/test.08Jun17.d8b.vcf.gz",
joinpath(pwd(), "test.08Jun17.d8b.vcf.gz"))
true
Classical genetic relation matrix:
Φgrm = grm(vcf, method=:GRM)
191×191 Array{Union{Missing, Float64},2}:
0.509639 -0.173137 0.139009 0.344595 … -0.0538013 0.118336
-0.173137 0.269917 -0.0378611 -0.168066 0.0149373 -0.0938038
0.139009 -0.0378611 0.544581 0.0929861 -0.0811612 -0.108824
0.344595 -0.168066 0.0929861 0.632202 0.0380955 0.197141
0.135174 0.0502916 -0.104267 0.0336621 0.0569862 0.151167
0.246076 -0.014241 0.179227 0.19268 … -0.139676 -0.165812
0.020906 -0.139727 -0.139303 0.0851532 0.0886343 0.2662
0.215287 -0.0120802 0.174532 0.180937 -0.13446 -0.150215
-0.0857512 0.101322 0.0273445 -0.124995 -0.0284517 -0.133978
-0.0334045 0.00105786 -0.0847244 0.0192485 0.17522 0.28381
0.191035 -0.194461 0.123831 0.343103 … -0.169409 -0.123299
0.0403932 0.0362218 -0.104964 0.0818739 0.120505 0.221341
0.119437 -0.159888 -0.154841 0.0654152 0.00605001 0.189286
⋮ ⋱ ⋮
-0.0151901 0.0145886 -0.0797194 -0.0432524 -0.11637 -0.144026
-0.176762 0.0709844 -0.0553092 -0.196741 … 0.0202614 -0.0684934
0.119737 -0.0772078 -0.0293833 0.192959 0.139404 0.465827
0.160204 -0.108528 -0.155567 0.32801 0.330113 0.587766
0.0114375 -0.135734 -0.134039 0.132396 0.0916851 0.258156
-0.0667285 -0.075066 -0.0659199 0.060112 -0.00138809 -0.00636518
-0.100007 0.0307077 -0.0525619 -0.0751086 … 0.204855 0.40322
0.158981 -0.116065 -0.15441 0.406356 0.314238 0.579388
-0.0155081 -0.0123674 -0.076561 -0.0252304 0.115077 0.370777
-0.138001 0.0857907 -0.0506552 -0.15243 0.0454182 -0.0203949
-0.0538013 0.0149373 -0.0811612 0.0380955 0.247693 0.269822
0.118336 -0.0938038 -0.108824 0.197141 … 0.269822 0.763134
# calculate GRM using Float64 = double precision
@btime grm($vcf, method=:GRM);
109.355 ms (520006 allocations: 42.78 MiB)
Computation time is dominated by data import. For large VCF files, we recommend converting to PLINK files and use SnpArrays.jl's GRM functions.
# time to import data
@btime convert_gt(Float64, $vcf);
70.635 ms (519985 allocations: 42.42 MiB)
By default, grm
exlcudes SNPs with minor allele frequency below 0.01. This can be changed by the keyword argument minmaf
.
# compute GRM excluding SNPs with MAF≤0.05
grm(vcf, method=:GRM, minmaf=0.05)
191×191 Array{Union{Missing, Float64},2}:
0.554328 -0.214182 0.170112 … -0.17178 -0.0647294 0.155598
-0.214182 0.305252 -0.0674007 0.103615 0.016486 -0.114942
0.170112 -0.0674007 0.357826 -0.0544721 -0.0910721 -0.139278
0.443756 -0.203137 0.136578 -0.185155 0.0550511 0.258983
0.136835 0.067131 -0.113638 0.0211501 0.0755193 0.198218
0.291432 -0.020033 0.215636 … -0.112002 -0.177116 -0.205093
0.0352073 -0.170828 -0.157507 -0.130001 0.115162 0.342286
0.275446 -0.0141623 0.21027 -0.0988044 -0.167418 -0.182397
-0.104754 0.124656 0.00681675 0.108201 -0.0378602 -0.165262
-0.0359602 0.00233001 -0.0923087 0.0609292 0.220443 0.361188
0.254598 -0.23304 0.178358 … -0.24739 -0.201663 -0.139165
0.0564595 0.0463671 -0.117655 0.00785399 0.151922 0.282956
0.155375 -0.199302 -0.180192 -0.115809 0.0085127 0.242738
⋮ ⋱ ⋮
-0.01318 0.0192445 -0.0860713 -0.102959 -0.144756 -0.174637
-0.218718 0.0866758 -0.0586966 … 0.143783 0.0231579 -0.08324
0.157352 -0.0941577 -0.0397914 -0.0489522 0.177118 0.361727
0.201673 -0.139738 -0.185857 -0.0729051 0.409593 0.737014
0.0359959 -0.153181 -0.138269 -0.169825 0.131629 0.344859
-0.0618942 -0.0772039 -0.0529604 -0.151762 0.01507 0.0135899
-0.117868 0.0409624 -0.0505296 … 0.102447 0.23121 0.301612
0.203299 -0.14602 -0.181252 -0.0664162 0.392868 0.729678
-0.0136508 -0.0145859 -0.0821884 0.0803987 0.145022 0.259378
-0.17178 0.103615 -0.0544721 0.197499 0.0530589 -0.0246084
-0.0647294 0.016486 -0.0910721 0.0530589 0.280131 0.340445
0.155598 -0.114942 -0.139278 … -0.0246084 0.340445 0.734055
To specify specific SNPs for calculating empirical kinship, use the cinds
keyword (default is nothing
). When cinds
is specified, minmaf
is ignored.
# robust GRM using every other SNP
grm(vcf, cinds=1:2:nrecords(vcf))
191×191 Array{Union{Missing, Float64},2}:
0.615936 -0.300454 0.119274 … -0.202922 -0.0880007 0.199078
-0.300454 0.374049 -0.0554153 0.106105 0.00604129 -0.169099
0.119274 -0.0554153 0.40731 -0.0868751 -0.0794463 -0.104096
0.523751 -0.285146 0.134582 -0.219862 0.0240505 0.321878
0.159964 0.0390207 -0.14321 0.0290594 0.0364882 0.151578
0.302406 -0.0227735 0.192718 … -0.150977 -0.208044 -0.211195
0.136271 -0.188909 -0.0916589 -0.123625 0.152535 0.450363
0.2916 -0.0228298 0.203411 -0.151033 -0.197351 -0.200502
-0.142535 0.134245 0.0272584 0.124283 -0.0617748 -0.247664
-0.0798966 -0.00735309 -0.0928407 0.0686797 0.237347 0.373936
0.351425 -0.253236 0.241737 … -0.295445 -0.159025 -0.0116866
0.0778531 0.0321547 -0.117829 0.0114441 0.126366 0.262954
0.259521 -0.259145 -0.172644 -0.0863686 0.0285528 0.326381
⋮ ⋱ ⋮
0.0865764 0.0516272 -0.0231112 -0.108824 -0.144392 -0.169042
-0.27631 0.118712 -0.0742687 … 0.162497 0.0194357 -0.155704
0.250798 -0.171125 0.0121194 -0.0843426 0.148821 0.414401
0.297453 -0.242711 -0.16696 -0.0699352 0.421211 0.805033
0.170601 -0.176077 -0.0788273 -0.16454 0.154618 0.430947
0.0591123 -0.0940788 0.0246696 -0.157787 0.0323798 0.104473
-0.184012 0.0605192 -0.0464669 … 0.0935549 0.283721 0.366564
0.292726 -0.22594 -0.160938 -0.0639134 0.373486 0.778807
-0.0181024 -0.0530515 -0.0847928 0.0874768 0.137903 0.252993
-0.202922 0.106105 -0.0868751 0.246634 0.0498263 -0.0285702
-0.0880007 0.00604129 -0.0794463 0.0498263 0.336736 0.376581
0.199078 -0.169099 -0.104096 … -0.0285702 0.376581 0.867896
High proportion of missing data
By default, missing data is imputed to the mean of each SNP. If genotype missing rate is too high (e.g. >10%), this introduces biases and reduces the variance. The appendix of this paper proposes to scale the estimated kinship values as
\[\Phi^c_{ij} = \frac{\Phi_{ij}}{(1 - m_i)(1 - m_j)}\]
where $\Phi^c_{ij}$ is the corrected kinship coefficient between samples $i$ and $j$, $\Phi_{ij}$ is the estimated kinship coefficient that initializes missing by the mean, and $m_i$, $m_j$ are the proportion of missing data for individuals $i$ and $j$.
# randomly mask genotypes
Random.seed!(2020)
masked_vcf = "masked.test.08Jun17.d8b.vcf.gz"
p, n = nrecords(vcf), nsamples(vcf)
mask_gt(vcf, bitrand(p, n), des=masked_vcf) # create vcf file with missings
# estimate GRM, adjusting for missing initialization
Φrbs_sm = grm(masked_vcf, method=:Robust, scale_missing=true)
191×191 Array{Union{Missing, Float64},2}:
1.30588 -0.290492 0.0689064 … -0.276493 -0.111094 0.228753
-0.290492 0.651879 -0.0633084 0.120228 -0.0162717 -0.138165
0.0689064 -0.0633084 0.745551 -0.0492747 -0.0907247 -0.15895
0.430492 -0.247225 0.109645 -0.19945 0.0501855 0.172614
0.126902 0.0610365 -0.102666 -0.0522749 0.00563318 0.130814
0.297125 -0.00438906 0.166185 … -0.11928 -0.267808 -0.206888
0.158559 -0.201101 -0.0921285 -0.170932 0.133963 0.468156
0.291988 -0.0396832 0.228577 -0.130672 -0.212078 -0.184651
-0.0933044 0.0609117 -0.0205374 0.119006 -0.0707123 -0.149736
-0.0986155 0.0212019 -0.104169 0.0411152 0.247845 0.283098
0.394477 -0.336916 0.156454 … -0.270127 -0.106199 -0.054917
0.101792 0.0190197 -0.116953 -0.0284481 0.12791 0.180167
0.265324 -0.273398 -0.131791 -0.143237 0.0624757 0.318722
⋮ ⋱ ⋮
0.108532 0.0291635 -0.0279842 -0.14845 -0.226982 -0.17088
-0.330036 0.152157 -0.0822878 … 0.202119 0.0365194 -0.141963
0.293801 -0.166514 -0.0953181 -0.081469 0.204271 0.500274
0.301582 -0.173733 -0.140419 -0.16801 0.392107 0.761084
0.255879 -0.203995 -0.0920963 -0.226188 0.247413 0.436354
0.0534598 -0.0868044 0.0217071 -0.11379 0.0803161 0.107935
-0.239608 0.0928141 -0.00405421 … 0.116603 0.287398 0.227279
0.261594 -0.226794 -0.154149 -0.158804 0.468693 0.62726
-0.0638803 0.0142407 -0.073264 0.085191 0.126589 0.230395
-0.276493 0.120228 -0.0492747 0.452869 0.0445575 -0.0420102
-0.111094 -0.0162717 -0.0907247 0.0445575 0.706254 0.383406
0.228753 -0.138165 -0.15895 … -0.0420102 0.383406 1.51661
Note: The resulting kinship values may have values exceedingly large (e.g. > 1) if $m_i$ or $m_j$ is close to 1. Please proceed with caution.
Inhomogeneous populations
If your population is admixed, we recommend one to follow this protocol in SnpArrays.jl.