Data from genome-wide association studies (GWAS) are often saved as a PLINK binary biallelic genotype table or .bed
file. To be useful, such files should be accompanied by a .fam
file, containing metadata on the rows of the table, and a .bim
file,
containing metadata on the columns. The .fam
and .bim
files are in tab-separated format.
The table contains the observed allelic type at n
single nucleotide polymorphism (SNP) positions for m
individuals. A SNP corresponds to a nucleotide position on the genome where some degree of variation has been observed in a population, with each individual have one of two possible alleles at that position on each of a pair of chromosomes. Three possible genotypes and corresponding coding are
Genotype | Plink/SnpArray |
---|---|
A1,A1 | 0x00 |
missing | 0x01 |
A1,A2 | 0x02 |
A2,A2 | 0x03 |
# dispay Julia version info
versioninfo()
using SnpArrays, Glob
An example data set, from a study published in 2006, is about 5 Mb in size. Data from recent studies, which have samples from tens of thousands of individuals at over a million SNP positions, would be in the tens or even hundreds of Gb range.
Example data location can be retrieved by using the following commands:
datapath = normpath(SnpArrays.datadir())
readdir(glob"mouse.*", datapath)
There are various ways to initialize a SnpArray. The easiest is to initilize from Plink files.
All three files (.bed
, .fam
, .bim
) need to be present.
mouse = SnpArray(SnpArrays.datadir("mouse.bed"))
The virtual size of the GWAS data is 1940 observations at each of 10150 SNP positions.
size(mouse)
Because the file is memory-mapped opening the file and accessing the data is fast, even for very large .bed files.
SnpArray can be initialized from Plink files in compressed formats. For a complete list, type
SnpArrays.ALLOWED_FORMAT
Let us first compress the mouse data in gz format. We see gz format takes less than 1/3 storage of original Plink files.
compress_plink(SnpArrays.datadir("mouse"), "gz")
readdir(glob"mouse.*.gz", datapath)
To initialize SnpArray from gzipped Plink file, simply used the bed file with name ending with .bed.gz
:
# requires corresponding `.fam.gz` file
SnpArray(SnpArrays.datadir("mouse.bed.gz"))
convert
and copyto!
¶Most common usage of SnpArray is to convert genotypes to numeric values for statistical analysis. The conversion rule depends on genetic models (additive, dominant, or recessive), centering, scaling, or imputation.
convert
¶convert
function has 4 keyword arguments: model
, center
, scale
, and impute
.
model
keyword specifies the SNP model for conversion. By default convert
function translates genotypes according to the additive SNP model, which essentially counts the number of A2 allele (0, 1 or 2) per genotype. Other SNP models are dominant and recessive, both in terms of the A2 allele.
Genotype | SnpArray |
model=ADDITIVE_MODEL |
model=DOMINANT_MODEL |
model=RECESSIVE_MODEL |
---|---|---|---|---|
A1,A1 | 0x00 | 0 | 0 | 0 |
missing | 0x01 | NaN | NaN | NaN |
A1,A2 | 0x02 | 1 | 1 | 0 |
A2,A2 | 0x03 | 2 | 1 | 1 |
center=true
tells convert
to center each column by its mean. Default is false
.
scale=true
tells convert
to scale each column by its standard deviation. Default is false
.
impute=true
tells convert
to impute missing genotypes (0x01) by column mean. Default is false
.
Convert the whole SnpArray to a Float64 matrix using defaults (model=ADDITIVE_MODEL
, center=false
, scale=false
, impute=false
)
convert(Matrix{Float64}, mouse)
When convert
or copyto!
is used on a slice or subarray of SnpArray, using view
, @view
or views
is necessary for both correctness and efficiency. Without view, it's simply converting the UInt8 coding in original bed file.
Convert a column to Float64 vector using defaults (model=ADDITIVE_MODEL
, center=false
, scale=false
, impute=false
).
# convert(Vector{Float64}, view(mouse, :, 1)) # alternative syntax
# @views convert(Vector{Float64}, mouse[:, 1]) # alternative syntax
convert(Vector{Float64}, @view(mouse[:, 1]))
Convert a subarray of SnpArray to a Float64 matrix using defaults (model=ADDITIVE_MODEL
, center=false
, scale=false
, impute=false
).
convert(Matrix{Float64}, @view(mouse[1:2:10, 1:2:10]))
Let's make a matrix of different SNP models (ADDITIVE_MODEL
vs DOMINANT_MODEL
vs RECESSIVE_MODEL
) for the same snp.
@views [convert(Vector{Float64}, mouse[:, 1], model=ADDITIVE_MODEL) convert(Vector{Float64}, mouse[:, 1], model=DOMINANT_MODEL) convert(Vector{Float64}, mouse[:, 1], model=RECESSIVE_MODEL)]
Center and scale (last column) while using convert
convert(Vector{Float64}, @view(mouse[:, end]), center=true, scale=true)
Center, scale, and impute (last column) while using convert
convert(Vector{Float64}, @view(mouse[:, end]), center=true, scale=true, impute=true)
copyto!
¶copyto!
is the in-place version of convert
. It takes the same keyword arguments (model
, center
, scale
, impute
) as convert
.
Looping over all columns (typical use e.g. in GWAS)
v = Vector{Float64}(undef, size(mouse, 1))
function loop_test(v, s)
for j in 1:size(s, 2)
copyto!(v, @view(s[:, j]), center=true, scale=true)
end
end
@time loop_test(v, mouse)
Copy the whole SnpArray
M = similar(mouse, Float64)
@time copyto!(M, mouse)
@time counts(mouse, dims=1)
Column 2 has no missing values (code 0x01
, the second row in the column-counts table).
In that SNP position for this sample, 359 individuals are homozygous allele 1 (G
according to the .bim
file), 1004 are heterozygous, and 577 are homozygous allele 2 (A
).
The counts by column and by row are cached in the SnpArray
object. Accesses after the first are extremely fast.
Minor allele frequencies (MAF) for each SNP.
maf(mouse)
Proportion of missing genotypes
missingrate(mouse, 1)
missingrate(mouse, 2)
grm
function computes the empirical kinship matrix using either the classical genetic relationship matrix, grm(A, model=:GRM)
, or the method of moment method, grm(A, model=:MoM)
, or the robust method, grm(A, model=:Robust)
.
Classical genetic relation matrix
# grm(mouse, method=:MoM)
# grm(mouse, method=:Robust)
grm(mouse, method=:GRM)
By default, grm
excludes SNPs with minor allele frequency below 0.01. This default can be changed by the keyword argument minmaf
.
# compute GRM excluding SNPs with MAF≤0.05
grm(mouse, minmaf=0.05)
To specify specific SNPs for calculating the empirical kinship, use the cinds
keyword (default is nothing
). When cinds
is specified, minmaf
is ignored.
# GRM using every other SNP
grm(mouse, cinds=1:2:size(mouse, 2))
Filter according to minimum success rates (1 - proportion of missing genotypes) per row and column
rowmask, colmask = SnpArrays.filter(mouse,
min_success_rate_per_row = 0.999,
min_success_rate_per_col = 0.999)
One may use the rowmask
and colmask
to filter and save the filtering results as Plink files.
SnpArrays.filter(SnpArrays.datadir("mouse"), rowmask, colmask)
Filter a set of Plink files according to row indices and column indices. By result, filtered Plink files are saved as srcname.filtered.bed
, srcname.filtered.fam
, and srcname.filtered.bim
, where srcname
is the source Plink file name. You can also specify the destination file name using keyword des
.
SnpArrays.filter(SnpArrays.datadir("mouse"), 1:5, 1:5)
# clean up
rm(SnpArrays.datadir("mouse") * ".filtered.bed")
rm(SnpArrays.datadir("mouse") * ".filtered.fam")
rm(SnpArrays.datadir("mouse") * ".filtered.bim")
Filter a set of Plink files according to logical vectors.
SnpArrays.filter(SnpArrays.datadir("mouse"), rowmask, colmask)
rm(SnpArrays.datadir("mouse") * ".filtered.bed")
rm(SnpArrays.datadir("mouse") * ".filtered.fam")
rm(SnpArrays.datadir("mouse") * ".filtered.bim")
In some applications we want to perform linear algebra using SnpArray directly without expanding it to numeric matrix. This is achieved by the SnpBitMatrix
type. The implementation assumes:
First let's load a data set without missing genotypes.
EUR = SnpArray(SnpArrays.datadir("EUR_subset.bed"))
To instantiate a SnpBitMatrix based on SnpArray,
EURbm = SnpBitMatrix{Float64}(EUR, model=ADDITIVE_MODEL, center=true, scale=true);
The constructor shares the same keyword arguments as the convert
or copyto!
functions. The type parameter, Float64
in this example, indicates the SnpBitMatrix acts like a Float64 matrix.
The memory usage of the SnpBitMatrix should be similar to the SnpArray, or equivalently bed file size, if model=ADDITIVE_MODEL
or half of that of SnpArray if model=DOMINANT_MODEL
or model=RECESSIVE_MODEL
.
Base.summarysize(EUR), Base.summarysize(EURbm)
A SnpBitMatrix acts similar to a regular matrix and responds to size
, eltype
, and SnpBitMatrix-vector multiplications.
@show size(EURbm)
@show eltype(EURbm)
@show typeof(EURbm) <: AbstractMatrix;
SnpBitMatrix-vector multiplication is mathematically equivalent to the corresponding Float matrix contained from convert
or copyto!
a SnpArray.
using LinearAlgebra
v1 = randn(size(EUR, 1))
v2 = randn(size(EUR, 2))
A = convert(Matrix{Float64}, EUR, model=ADDITIVE_MODEL, center=true, scale=true)
norm(EURbm * v2 - A * v2)
norm(EURbm' * v1 - A' * v1)
SnpBitMatrix
can be created from a subarray of SnpArray.
EURsub = @view EUR[1:2:100, 1:2:100]
EURsubbm = SnpBitMatrix{Float64}(EURsub, model=ADDITIVE_MODEL, center=true, scale=true);
Base.summarysize(EURsubbm)
@show size(EURsubbm)
@show eltype(EURsubbm)
@show typeof(EURsubbm) <: AbstractMatrix;
using LinearAlgebra
v1 = randn(size(EURsub, 1))
v2 = randn(size(EURsub, 2))
A = convert(Matrix{Float64}, EURsub, model=ADDITIVE_MODEL, center=true, scale=true)
norm(EURsubbm * v2 - A * v2)
norm(EURsubbm' * v1 - A' * v1)
We can create a SnpData
dataframe, which is a SnpArray
with information on SNP and subject appended.
EUR_data = SnpData(SnpArrays.datadir("EUR_subset"))
We can split SnpData
by SNP's choromosomes using split_plink
.
splitted = SnpArrays.split_plink(SnpArrays.datadir("EUR_subset"); prefix="tmp.chr.")
Let's take a SnpArray for chromosome 17.
piece = splitted["17"]
We can merge the split dictionary back into one SnpData using merge_plink
.
# write_plink is included here
# merged = SnpArrays.merge_plink("tmp.merged", splitted)
You can also merge the plink formatted files based on their common prefix.
# merged_from_splitted_files = merge_plink("tmp.chr";
# des = "tmp.merged.2")
# cleanup
isfile("tmp.merged.bim") && rm("tmp.merged.bim")
isfile("tmp.merged.fam") && rm("tmp.merged.fam")
isfile("tmp.merged.bed") && rm("tmp.merged.bed")
isfile("tmp.merged.2.bim") && rm("tmp.merged.2.bim")
isfile("tmp.merged.2.fam") && rm("tmp.merged.2.fam")
isfile("tmp.merged.2.bed") && rm("tmp.merged.2.bed")
for k in keys(splitted)
isfile("tmp.chr.$(k).bim") && rm("tmp.chr.$(k).bim")
isfile("tmp.chr.$(k).fam") && rm("tmp.chr.$(k).fam")
isfile("tmp.chr.$(k).bed") && rm("tmp.chr.$(k).bed")
end
SnpArrays provides users with the ability to rapidly and accurately manipulate very large genetic data sets.