SnpArrays.jl

Data from genome-wide association studies (GWAS) are often saved as a PLINK binary biallelic genotype table or .bed file. To be useful, such files should be accompanied by a .fam file, containing metadata on the rows of the table, and a .bim file, containing metadata on the columns. The .fam and .bim files are in tab-separated format.

The table contains the observed allelic type at n single nucleotide polymorphism (SNP) positions for m individuals. A SNP corresponds to a nucleotide position on the genome where some degree of variation has been observed in a population, with each individual have one of two possible alleles at that position on each of a pair of chromosomes. Three possible genotypes and corresponding coding are

Genotype Plink/SnpArray
A1,A1 0x00
missing 0x01
A1,A2 0x02
A2,A2 0x03
In [1]:
# dispay Julia version info
versioninfo()
Julia Version 1.2.0
Commit c6da87ff4b (2019-08-20 00:03 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.6.0)
  CPU: Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = code
In [2]:
using SnpArrays, Glob

Example data

An example data set, from a study published in 2006, is about 5 Mb in size. Data from recent studies, which have samples from tens of thousands of individuals at over a million SNP positions, would be in the tens or even hundreds of Gb range.

Example data location can be retrieved by using the following commands:

In [3]:
datapath = normpath(SnpArrays.datadir())
Out[3]:
"/Users/huazhou/.julia/packages/SnpArrays/Arv5H/data"
In [4]:
readdir(glob"mouse.*", datapath)
Out[4]:
3-element Array{String,1}:
 "/Users/huazhou/.julia/packages/SnpArrays/Arv5H/data/mouse.bed"
 "/Users/huazhou/.julia/packages/SnpArrays/Arv5H/data/mouse.bim"
 "/Users/huazhou/.julia/packages/SnpArrays/Arv5H/data/mouse.fam"

There are various ways to initialize a SnpArray. The easiest is to initilize from Plink files.

All three files (.bed, .fam, .bim) need to be present.

In [5]:
mouse = SnpArray(SnpArrays.datadir("mouse.bed"))
Out[5]:
1940×10150 SnpArray:
 0x02  0x02  0x02  0x02  0x03  0x02  …  0x03  0x03  0x03  0x03  0x03  0x03
 0x02  0x02  0x03  0x02  0x02  0x02     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x02  0x02  0x02  0x02  0x02  0x02     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x02  0x02  0x02  0x02  0x02  0x02
 0x02  0x02  0x02  0x02  0x03  0x02  …  0x03  0x03  0x03  0x03  0x03  0x03
 0x02  0x02  0x02  0x02  0x03  0x02     0x03  0x03  0x03  0x03  0x03  0x03
 0x02  0x02  0x03  0x02  0x02  0x02     0x03  0x03  0x03  0x03  0x03  0x03
 0x02  0x02  0x03  0x02  0x02  0x02     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x02  0x02  0x02  0x02  0x02  0x02
 0x03  0x03  0x03  0x03  0x03  0x03  …  0x00  0x00  0x00  0x00  0x00  0x00
 0x02  0x02  0x02  0x02  0x03  0x02     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x00  0x00  0x00  0x00  0x00  0x00
    ⋮                             ⋮  ⋱           ⋮                        
 0x03  0x03  0x03  0x03  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x02  0x02  0x02  0x02  0x02  0x02  …  0x03  0x03  0x03  0x03  0x03  0x03
 0x02  0x02  0x02  0x02  0x03  0x02     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x02  0x02  0x02  0x02  0x03  0x02     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x02  0x02  0x02  0x02  0x03  0x02  …  0x03  0x03  0x03  0x03  0x03  0x03
 0x02  0x02  0x03  0x02  0x02  0x02     0x03  0x03  0x03  0x03  0x03  0x03
 0x02  0x02  0x03  0x02  0x02  0x02     0x03  0x03  0x03  0x03  0x03  0x03
 0x02  0x02  0x02  0x02  0x02  0x02     0x01  0x01  0x01  0x01  0x01  0x01
 0x00  0x00  0x00  0x00  0x03  0x00     0x03  0x03  0x03  0x03  0x03  0x03

The virtual size of the GWAS data is 1940 observations at each of 10150 SNP positions.

In [6]:
size(mouse)
Out[6]:
(1940, 10150)

Because the file is memory-mapped opening the file and accessing the data is fast, even for very large .bed files.

SnpArray can be initialized from Plink files in compressed formats. For a complete list, type

In [7]:
SnpArrays.ALLOWED_FORMAT
Out[7]:
6-element Array{String,1}:
 "gz"  
 "zlib"
 "zz"  
 "xz"  
 "zst" 
 "bz2" 

Let us first compress the mouse data in gz format. We see gz format takes less than 1/3 storage of original Plink files.

In [8]:
compress_plink(SnpArrays.datadir("mouse"), "gz")
readdir(glob"mouse.*.gz", datapath)
Out[8]:
3-element Array{String,1}:
 "/Users/huazhou/.julia/packages/SnpArrays/Arv5H/data/mouse.bed.gz"
 "/Users/huazhou/.julia/packages/SnpArrays/Arv5H/data/mouse.bim.gz"
 "/Users/huazhou/.julia/packages/SnpArrays/Arv5H/data/mouse.fam.gz"

To initialize SnpArray from gzipped Plink file, simply used the bed file with name ending with .bed.gz:

In [9]:
# requires corresponding `.fam.gz` file
SnpArray(SnpArrays.datadir("mouse.bed.gz"))
Out[9]:
1940×10150 SnpArray:
 0x02  0x02  0x02  0x02  0x03  0x02  …  0x03  0x03  0x03  0x03  0x03  0x03
 0x02  0x02  0x03  0x02  0x02  0x02     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x02  0x02  0x02  0x02  0x02  0x02     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x02  0x02  0x02  0x02  0x02  0x02
 0x02  0x02  0x02  0x02  0x03  0x02  …  0x03  0x03  0x03  0x03  0x03  0x03
 0x02  0x02  0x02  0x02  0x03  0x02     0x03  0x03  0x03  0x03  0x03  0x03
 0x02  0x02  0x03  0x02  0x02  0x02     0x03  0x03  0x03  0x03  0x03  0x03
 0x02  0x02  0x03  0x02  0x02  0x02     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x02  0x02  0x02  0x02  0x02  0x02
 0x03  0x03  0x03  0x03  0x03  0x03  …  0x00  0x00  0x00  0x00  0x00  0x00
 0x02  0x02  0x02  0x02  0x03  0x02     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x00  0x00  0x00  0x00  0x00  0x00
    ⋮                             ⋮  ⋱           ⋮                        
 0x03  0x03  0x03  0x03  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x02  0x02  0x02  0x02  0x02  0x02  …  0x03  0x03  0x03  0x03  0x03  0x03
 0x02  0x02  0x02  0x02  0x03  0x02     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x02  0x02  0x02  0x02  0x03  0x02     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x02  0x02  0x02  0x02  0x03  0x02  …  0x03  0x03  0x03  0x03  0x03  0x03
 0x02  0x02  0x03  0x02  0x02  0x02     0x03  0x03  0x03  0x03  0x03  0x03
 0x02  0x02  0x03  0x02  0x02  0x02     0x03  0x03  0x03  0x03  0x03  0x03
 0x02  0x02  0x02  0x02  0x02  0x02     0x01  0x01  0x01  0x01  0x01  0x01
 0x00  0x00  0x00  0x00  0x03  0x00     0x03  0x03  0x03  0x03  0x03  0x03

convert and copyto!

Most common usage of SnpArray is to convert genotypes to numeric values for statistical analysis. The conversion rule depends on genetic models (additive, dominant, or recessive), centering, scaling, or imputation.

convert

convert function has 4 keyword arguments: model, center, scale, and impute.

model keyword specifies the SNP model for conversion. By default convert function translates genotypes according to the additive SNP model, which essentially counts the number of A2 allele (0, 1 or 2) per genotype. Other SNP models are dominant and recessive, both in terms of the A2 allele.

Genotype SnpArray model=ADDITIVE_MODEL model=DOMINANT_MODEL model=RECESSIVE_MODEL
A1,A1 0x00 0 0 0
missing 0x01 NaN NaN NaN
A1,A2 0x02 1 1 0
A2,A2 0x03 2 1 1

center=true tells convert to center each column by its mean. Default is false.

scale=true tells convert to scale each column by its standard deviation. Default is false.

impute=true tells convert to impute missing genotypes (0x01) by column mean. Default is false.

Convert the whole SnpArray to a Float64 matrix using defaults (model=ADDITIVE_MODEL, center=false, scale=false, impute=false)

In [10]:
convert(Matrix{Float64}, mouse)
Out[10]:
1940×10150 Array{Float64,2}:
 1.0  1.0  1.0  1.0  2.0  1.0  2.0  1.0  …    2.0    2.0    2.0    2.0    2.0
 1.0  1.0  2.0  1.0  1.0  1.0  1.0  2.0       2.0    2.0    2.0    2.0    2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0       2.0    2.0    2.0    2.0    2.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  2.0       2.0    2.0    2.0    2.0    2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0       1.0    1.0    1.0    1.0    1.0
 1.0  1.0  1.0  1.0  2.0  1.0  2.0  1.0  …    2.0    2.0    2.0    2.0    2.0
 1.0  1.0  1.0  1.0  2.0  1.0  2.0  1.0       2.0    2.0    2.0    2.0    2.0
 1.0  1.0  2.0  1.0  1.0  1.0  1.0  2.0       2.0    2.0    2.0    2.0    2.0
 1.0  1.0  2.0  1.0  1.0  1.0  1.0  2.0       2.0    2.0    2.0    2.0    2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0       1.0    1.0    1.0    1.0    1.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  …    0.0    0.0    0.0    0.0    0.0
 1.0  1.0  1.0  1.0  2.0  1.0  2.0  1.0       2.0    2.0    2.0    2.0    2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0       0.0    0.0    0.0    0.0    0.0
 ⋮                        ⋮              ⋱    ⋮                              
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0       2.0    2.0    2.0    2.0    2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0       2.0    2.0    2.0    2.0    2.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  2.0  …    2.0    2.0    2.0    2.0    2.0
 1.0  1.0  1.0  1.0  2.0  1.0  2.0  1.0       2.0    2.0    2.0    2.0    2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0       2.0    2.0    2.0    2.0    2.0
 1.0  1.0  1.0  1.0  2.0  1.0  2.0  1.0       2.0    2.0    2.0    2.0    2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0       2.0    2.0    2.0    2.0    2.0
 1.0  1.0  1.0  1.0  2.0  1.0  2.0  1.0  …    2.0    2.0    2.0    2.0    2.0
 1.0  1.0  2.0  1.0  1.0  1.0  1.0  2.0       2.0    2.0    2.0    2.0    2.0
 1.0  1.0  2.0  1.0  1.0  1.0  1.0  2.0       2.0    2.0    2.0    2.0    2.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  2.0     NaN    NaN    NaN    NaN    NaN  
 0.0  0.0  0.0  0.0  2.0  0.0  2.0  0.0       2.0    2.0    2.0    2.0    2.0

When convert or copyto! is used on a slice or subarray of SnpArray, using view, @view or views is necessary for both correctness and efficiency. Without view, it's simply converting the UInt8 coding in original bed file.

Convert a column to Float64 vector using defaults (model=ADDITIVE_MODEL, center=false, scale=false, impute=false).

In [11]:
# convert(Vector{Float64}, view(mouse, :, 1)) # alternative syntax
# @views convert(Vector{Float64}, mouse[:, 1]) # alternative syntax
convert(Vector{Float64}, @view(mouse[:, 1]))
Out[11]:
1940-element Array{Float64,1}:
 1.0
 1.0
 2.0
 1.0
 2.0
 1.0
 1.0
 1.0
 1.0
 2.0
 2.0
 1.0
 2.0
 ⋮  
 2.0
 2.0
 1.0
 1.0
 2.0
 1.0
 2.0
 1.0
 1.0
 1.0
 1.0
 0.0

Convert a subarray of SnpArray to a Float64 matrix using defaults (model=ADDITIVE_MODEL, center=false, scale=false, impute=false).

In [12]:
convert(Matrix{Float64}, @view(mouse[1:2:10, 1:2:10]))
Out[12]:
5×5 Array{Float64,2}:
 1.0  1.0  2.0  2.0  1.0
 2.0  2.0  2.0  2.0  2.0
 2.0  2.0  2.0  2.0  2.0
 1.0  1.0  2.0  2.0  1.0
 1.0  2.0  1.0  1.0  1.0

Let's make a matrix of different SNP models (ADDITIVE_MODEL vs DOMINANT_MODEL vs RECESSIVE_MODEL) for the same snp.

In [13]:
@views [convert(Vector{Float64}, mouse[:, 1], model=ADDITIVE_MODEL) convert(Vector{Float64}, mouse[:, 1], model=DOMINANT_MODEL) convert(Vector{Float64}, mouse[:, 1], model=RECESSIVE_MODEL)]
Out[13]:
1940×3 Array{Float64,2}:
 1.0  1.0  0.0
 1.0  1.0  0.0
 2.0  1.0  1.0
 1.0  1.0  0.0
 2.0  1.0  1.0
 1.0  1.0  0.0
 1.0  1.0  0.0
 1.0  1.0  0.0
 1.0  1.0  0.0
 2.0  1.0  1.0
 2.0  1.0  1.0
 1.0  1.0  0.0
 2.0  1.0  1.0
 ⋮            
 2.0  1.0  1.0
 2.0  1.0  1.0
 1.0  1.0  0.0
 1.0  1.0  0.0
 2.0  1.0  1.0
 1.0  1.0  0.0
 2.0  1.0  1.0
 1.0  1.0  0.0
 1.0  1.0  0.0
 1.0  1.0  0.0
 1.0  1.0  0.0
 0.0  0.0  0.0

Center and scale (last column) while using convert

In [14]:
convert(Vector{Float64}, @view(mouse[:, end]), center=true, scale=true)
Out[14]:
1940-element Array{Float64,1}:
   0.4721460731147541
   0.4721460731147541
   0.4721460731147541
   0.4721460731147541
  -1.8819155626127624
   0.4721460731147541
   0.4721460731147541
   0.4721460731147541
   0.4721460731147541
  -1.8819155626127624
  -4.2359771983402785
   0.4721460731147541
  -4.2359771983402785
   ⋮                 
   0.4721460731147541
   0.4721460731147541
   0.4721460731147541
   0.4721460731147541
   0.4721460731147541
   0.4721460731147541
   0.4721460731147541
   0.4721460731147541
   0.4721460731147541
   0.4721460731147541
 NaN                 
   0.4721460731147541

Center, scale, and impute (last column) while using convert

In [15]:
convert(Vector{Float64}, @view(mouse[:, end]), center=true, scale=true, impute=true)
Out[15]:
1940-element Array{Float64,1}:
  0.4721460731147541
  0.4721460731147541
  0.4721460731147541
  0.4721460731147541
 -1.8819155626127624
  0.4721460731147541
  0.4721460731147541
  0.4721460731147541
  0.4721460731147541
 -1.8819155626127624
 -4.2359771983402785
  0.4721460731147541
 -4.2359771983402785
  ⋮                 
  0.4721460731147541
  0.4721460731147541
  0.4721460731147541
  0.4721460731147541
  0.4721460731147541
  0.4721460731147541
  0.4721460731147541
  0.4721460731147541
  0.4721460731147541
  0.4721460731147541
  0.0               
  0.4721460731147541

copyto!

copyto! is the in-place version of convert. It takes the same keyword arguments (model, center, scale, impute) as convert.

Looping over all columns (typical use e.g. in GWAS)

In [16]:
v = Vector{Float64}(undef, size(mouse, 1))
function loop_test(v, s)
    for j in 1:size(s, 2)
        copyto!(v, @view(s[:, j]), center=true, scale=true)
    end
end
@time loop_test(v, mouse)
  0.094314 seconds (53.78 k allocations: 2.759 MiB)

Copy the whole SnpArray

In [17]:
M = similar(mouse, Float64)
@time copyto!(M, mouse)
  0.109812 seconds (66 allocations: 4.266 KiB)
Out[17]:
1940×10150 Array{Float64,2}:
 1.0  1.0  1.0  1.0  2.0  1.0  2.0  1.0  …    2.0    2.0    2.0    2.0    2.0
 1.0  1.0  2.0  1.0  1.0  1.0  1.0  2.0       2.0    2.0    2.0    2.0    2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0       2.0    2.0    2.0    2.0    2.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  2.0       2.0    2.0    2.0    2.0    2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0       1.0    1.0    1.0    1.0    1.0
 1.0  1.0  1.0  1.0  2.0  1.0  2.0  1.0  …    2.0    2.0    2.0    2.0    2.0
 1.0  1.0  1.0  1.0  2.0  1.0  2.0  1.0       2.0    2.0    2.0    2.0    2.0
 1.0  1.0  2.0  1.0  1.0  1.0  1.0  2.0       2.0    2.0    2.0    2.0    2.0
 1.0  1.0  2.0  1.0  1.0  1.0  1.0  2.0       2.0    2.0    2.0    2.0    2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0       1.0    1.0    1.0    1.0    1.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  …    0.0    0.0    0.0    0.0    0.0
 1.0  1.0  1.0  1.0  2.0  1.0  2.0  1.0       2.0    2.0    2.0    2.0    2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0       0.0    0.0    0.0    0.0    0.0
 ⋮                        ⋮              ⋱    ⋮                              
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0       2.0    2.0    2.0    2.0    2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0       2.0    2.0    2.0    2.0    2.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  2.0  …    2.0    2.0    2.0    2.0    2.0
 1.0  1.0  1.0  1.0  2.0  1.0  2.0  1.0       2.0    2.0    2.0    2.0    2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0       2.0    2.0    2.0    2.0    2.0
 1.0  1.0  1.0  1.0  2.0  1.0  2.0  1.0       2.0    2.0    2.0    2.0    2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0       2.0    2.0    2.0    2.0    2.0
 1.0  1.0  1.0  1.0  2.0  1.0  2.0  1.0  …    2.0    2.0    2.0    2.0    2.0
 1.0  1.0  2.0  1.0  1.0  1.0  1.0  2.0       2.0    2.0    2.0    2.0    2.0
 1.0  1.0  2.0  1.0  1.0  1.0  1.0  2.0       2.0    2.0    2.0    2.0    2.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  2.0     NaN    NaN    NaN    NaN    NaN  
 0.0  0.0  0.0  0.0  2.0  0.0  2.0  0.0       2.0    2.0    2.0    2.0    2.0

Summaries

Counts

Counts of each the four possible values for each column are returned by counts.`

In [18]:
@time counts(mouse, dims=1)
  0.081329 seconds (91.80 k allocations: 4.583 MiB)
Out[18]:
4×10150 Array{Int64,2}:
  358   359  252   358    33   359  …    56    56    56    56    56    56
    2     0    4     3     4     1      173   173   162   173   174   175
 1003  1004  888  1004   442  1004      242   242   242   242   242   242
  577   577  796   575  1461   576     1469  1469  1480  1469  1468  1467

Column 2 has no missing values (code 0x01, the second row in the column-counts table). In that SNP position for this sample, 359 individuals are homozygous allele 1 (G according to the .bim file), 1004 are heterozygous, and 577 are homozygous allele 2 (A).

The counts by column and by row are cached in the SnpArray object. Accesses after the first are extremely fast.

Minor allele frequencies

Minor allele frequencies (MAF) for each SNP.

In [19]:
maf(mouse)
Out[19]:
10150-element Array{Float64,1}:
 0.4434984520123839  
 0.4438144329896907  
 0.359504132231405   
 0.4439855446566856  
 0.13119834710743805 
 0.44404332129963897 
 0.1412706611570248  
 0.30299123259412064 
 0.4445018069179143  
 0.44424367578729995 
 0.43427835051546393 
 0.14075413223140498 
 0.304639175257732   
 ⋮                   
 0.0527624309392265  
 0.052980132450331174
 0.08079096045197742 
 0.08253250423968339 
 0.08253250423968339 
 0.10022650056625138 
 0.10016977928692694 
 0.10016977928692694 
 0.09955005624296964 
 0.10016977928692694 
 0.10022650056625138 
 0.10028328611898019 

Missing rate

Proportion of missing genotypes

In [20]:
missingrate(mouse, 1)
Out[20]:
10150-element Array{Float64,1}:
 0.0010309278350515464
 0.0                  
 0.002061855670103093 
 0.0015463917525773195
 0.002061855670103093 
 0.0005154639175257732
 0.002061855670103093 
 0.0005154639175257732
 0.0015463917525773195
 0.0015463917525773195
 0.0                  
 0.002061855670103093 
 0.0                  
 ⋮                    
 0.06701030927835051  
 0.06597938144329897  
 0.08762886597938144  
 0.08814432989690722  
 0.08814432989690722  
 0.08969072164948454  
 0.08917525773195877  
 0.08917525773195877  
 0.08350515463917525  
 0.08917525773195877  
 0.08969072164948454  
 0.09020618556701031  
In [21]:
missingrate(mouse, 2)
Out[21]:
1940-element Array{Float64,1}:
 0.00019704433497536947
 0.0                   
 0.018423645320197045  
 0.0007881773399014779 
 0.0                   
 0.004236453201970443  
 0.0051231527093596055 
 0.00039408866995073894
 0.005517241379310344  
 0.0016748768472906405 
 0.0                   
 9.852216748768474e-5  
 0.0004926108374384236 
 ⋮                     
 0.000689655172413793  
 0.004729064039408867  
 0.0004926108374384236 
 0.001083743842364532  
 0.00019704433497536947
 0.0025615763546798028 
 0.0038423645320197044 
 0.001379310344827586  
 0.0064039408866995075 
 0.002857142857142857  
 0.0011822660098522167 
 0.00029556650246305416

Genetic relationship matrix

grm function computes the empirical kinship matrix using either the classical genetic relationship matrix, grm(A, model=:GRM), or the method of moment method, grm(A, model=:MoM), or the robust method, grm(A, model=:Robust).

Classical genetic relation matrix

In [22]:
# grm(mouse, method=:MoM)
# grm(mouse, method=:Robust)
grm(mouse, method=:GRM)
Out[22]:
1940×1940 Array{Float64,2}:
  0.478301    -0.0331304    0.0135612    …  -0.0347737   -0.0129443 
 -0.0331304    0.422771    -0.0389227        0.0457987    0.00556832
  0.0135612   -0.0389227    0.509248        -0.0356689   -0.0608705 
  0.0198205    0.00728645  -0.00935362      -0.0302404   -0.0102152 
  0.056747    -0.0163418   -0.00495283      -0.0413347   -0.0415659 
 -0.0165628   -0.0191127   -0.0112181    …   0.0177118   -0.0193087 
  0.123771    -0.0404167    0.00442739       0.00880649  -0.0437565 
 -0.0628362    0.172552    -0.0728312        0.0640027   -0.0281429 
  0.0605018   -0.0260505    0.00398852      -0.00277754  -0.0607773 
  0.108886    -0.0204594   -0.00767711      -0.0210501    0.00343526
 -0.0142307    0.00270989  -0.0235504    …  -0.0223563   -0.028408  
 -0.0306022    0.197743    -0.00244269       0.0213998   -0.0478472 
 -0.0131463   -0.0226707    0.0223522       -0.037288     0.0493662 
  ⋮                                      ⋱                          
  0.0176725   -0.0165609    0.0378308        0.0238751   -0.0420143 
  0.0024949   -0.0411137    0.0154847       -0.0380656   -0.0650806 
  0.0952286    0.00894298  -0.0163446    …  -0.0202633   -0.0219594 
 -0.0309488   -0.0228342   -0.0478253       -0.014896     0.261623  
 -0.004804    -0.0375168   -0.0211418       -0.0172572    0.0359166 
  0.0076296    0.0481887   -0.0328968        0.0920425   -0.0292548 
  0.070045    -0.0302138    0.000647283      0.00892069  -0.00632566
  0.0378132   -6.59565e-5   0.00888932   …   0.00230815  -0.0291622 
 -0.00132837   0.00223654   0.0495928       -0.00936248   0.0299075 
  0.0640864   -0.0241218    0.00602283       0.00403413   0.00689551
 -0.0347737    0.0457987   -0.0356689        0.509228    -0.035215  
 -0.0129443    0.00556832  -0.0608705       -0.035215     0.552712  

By default, grm excludes SNPs with minor allele frequency below 0.01. This default can be changed by the keyword argument minmaf.

In [23]:
# compute GRM excluding SNPs with MAF≤0.05 
grm(mouse, minmaf=0.05)
Out[23]:
1940×1940 Array{Float64,2}:
  0.478556    -0.0331783    0.013541     …  -0.0348225   -0.0129761 
 -0.0331783    0.422993    -0.0389741        0.0457975    0.00554753
  0.013541    -0.0389741    0.50952         -0.0357183   -0.0609305 
  0.0203209    0.00777944  -0.00887047      -0.0297696   -0.00972836
  0.0567523   -0.0163798   -0.00498406      -0.0413874   -0.0416146 
 -0.0166009   -0.0191523   -0.0112531    …   0.0176939   -0.0193442 
  0.123816    -0.0404689    0.00440171       0.0087834   -0.0438065 
 -0.0629017    0.172626    -0.0729026        0.0640123   -0.0281836 
  0.0605093   -0.0260942    0.00396257      -0.00280748  -0.0608373 
  0.108922    -0.0204998   -0.00770996      -0.0210909    0.00341321
 -0.0142674    0.00268319  -0.0235927    …  -0.0223978   -0.0284489 
 -0.0306486    0.197832    -0.00247243       0.0213842   -0.0478996 
 -0.0131824   -0.0227124    0.0223371       -0.0373384    0.0493713 
  ⋮                                      ⋱                          
  0.0176546   -0.016599     0.0378249        0.0238609   -0.0420633 
  0.00246808  -0.0411663    0.0154656       -0.0381165   -0.0651432 
  0.0952566    0.00891997  -0.0163826    …  -0.0203036   -0.0219965 
 -0.0309912   -0.0228718   -0.0478777       -0.0149289    0.261754  
 -0.00483514  -0.0375673   -0.0211827       -0.0172957    0.0359138 
  0.00770862   0.0482917   -0.0328417        0.0921714   -0.0292961 
  0.0700582   -0.03026      0.000619365      0.00889767  -0.00635348
  0.0378313   -7.02155e-5   0.00889036   …   0.0023053   -0.0291795 
 -0.00133338   0.00223364   0.0496179       -0.00937223   0.0299252 
  0.0641201   -0.0241403    0.00602217       0.0040323    0.00689958
 -0.0348225    0.0457975   -0.0357183        0.509501    -0.0352599 
 -0.0129761    0.00554753  -0.0609305       -0.0352599    0.553015  

To specify specific SNPs for calculating the empirical kinship, use the cinds keyword (default is nothing). When cinds is specified, minmaf is ignored.

In [24]:
# GRM using every other SNP
grm(mouse, cinds=1:2:size(mouse, 2))
Out[24]:
1940×1940 Array{Float64,2}:
  0.477       -0.0307774     0.0118026   …  -0.0320301    -0.0125113 
 -0.0307774    0.425085     -0.0367459       0.0480442     0.00519065
  0.0118026   -0.0367459     0.505038       -0.0385129    -0.0631557 
  0.0166017    0.00614789   -0.00919695     -0.0399744    -0.0104884 
  0.05724     -0.0122148    -0.00543377     -0.0395663    -0.0372998 
 -0.0193129   -0.0224378    -0.009277    …   0.0153785    -0.0220184 
  0.12194     -0.0410682     0.00274307      0.00796748   -0.0441578 
 -0.0624031    0.173985     -0.0724784       0.0663191    -0.0294243 
  0.0627626   -0.0288615     0.00265615     -0.00449877   -0.0579702 
  0.110878    -0.0232715    -0.00881604     -0.021272      0.00169016
 -0.00800735  -0.00149824   -0.019791    …  -0.024124     -0.0289397 
 -0.0272944    0.19894      -0.00534771      0.0209384    -0.0511051 
 -0.011388    -0.0281003     0.0273853      -0.0360047     0.0459359 
  ⋮                                      ⋱                           
  0.0169431   -0.0136989     0.0340794       0.0272811    -0.041189  
  0.00201325  -0.0426611     0.0124353      -0.0387982    -0.0656181 
  0.097587     0.0058123    -0.0160698   …  -0.021457     -0.023226  
 -0.0342014   -0.0211246    -0.0490112      -0.0129575     0.256552  
 -0.00324255  -0.0423482    -0.0192699      -0.0149015     0.0339388 
  0.00575353   0.0464237    -0.0294694       0.0924759    -0.0275451 
  0.0748725   -0.0258461    -0.00141068      0.0115232    -0.00486589
  0.0386555    0.000612169   0.00959997  …  -0.000357284  -0.0334687 
 -0.00343056   0.0120673     0.0455375      -0.0103798     0.0336959 
  0.0656909   -0.0193469     0.00600815      0.00188545    0.00726181
 -0.0320301    0.0480442    -0.0385129       0.513285     -0.0317963 
 -0.0125113    0.00519065   -0.0631557      -0.0317963     0.54471   

Fitering by missing rate

Filter according to minimum success rates (1 - proportion of missing genotypes) per row and column

In [25]:
rowmask, colmask =  SnpArrays.filter(mouse, 
    min_success_rate_per_row = 0.999, 
    min_success_rate_per_col = 0.999)
Out[25]:
(Bool[1, 1, 1, 1, 1, 1, 1, 1, 0, 1  …  1, 1, 1, 1, 0, 1, 1, 1, 1, 1], Bool[0, 1, 0, 0, 0, 1, 0, 1, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

One may use the rowmask and colmask to filter and save the filtering results as Plink files.

SnpArrays.filter(SnpArrays.datadir("mouse"), rowmask, colmask)

Filter a set of Plink files according to row indices and column indices. By result, filtered Plink files are saved as srcname.filtered.bed, srcname.filtered.fam, and srcname.filtered.bim, where srcname is the source Plink file name. You can also specify the destination file name using keyword des.

In [26]:
SnpArrays.filter(SnpArrays.datadir("mouse"), 1:5, 1:5)
Out[26]:
5×5 SnpArray:
 0x02  0x02  0x02  0x02  0x03
 0x02  0x02  0x03  0x02  0x02
 0x03  0x03  0x03  0x03  0x03
 0x02  0x02  0x02  0x02  0x02
 0x03  0x03  0x03  0x03  0x03
In [27]:
# clean up
rm(SnpArrays.datadir("mouse") * ".filtered.bed")
rm(SnpArrays.datadir("mouse") * ".filtered.fam")
rm(SnpArrays.datadir("mouse") * ".filtered.bim")

Filter a set of Plink files according to logical vectors.

In [28]:
SnpArrays.filter(SnpArrays.datadir("mouse"), rowmask, colmask)
Out[28]:
1856×5726 SnpArray:
 0x02  0x02  0x02  0x02  0x02  0x02  …  0x03  0x02  0x00  0x02  0x03  0x02
 0x02  0x02  0x03  0x02  0x03  0x03     0x00  0x03  0x03  0x03  0x03  0x00
 0x03  0x03  0x03  0x03  0x03  0x03     0x02  0x02  0x02  0x03  0x00  0x03
 0x02  0x02  0x03  0x02  0x03  0x03     0x03  0x03  0x00  0x03  0x03  0x00
 0x03  0x03  0x03  0x03  0x03  0x03     0x02  0x03  0x03  0x02  0x03  0x02
 0x02  0x02  0x02  0x02  0x02  0x02  …  0x03  0x02  0x00  0x03  0x00  0x03
 0x02  0x02  0x02  0x02  0x02  0x02     0x03  0x02  0x02  0x03  0x03  0x03
 0x02  0x02  0x03  0x02  0x03  0x03     0x00  0x03  0x03  0x00  0x03  0x00
 0x03  0x03  0x03  0x03  0x03  0x03     0x03  0x03  0x02  0x02  0x03  0x02
 0x03  0x03  0x03  0x03  0x03  0x03     0x03  0x02  0x00  0x03  0x03  0x03
 0x02  0x02  0x02  0x02  0x02  0x02  …  0x00  0x03  0x03  0x03  0x03  0x00
 0x03  0x03  0x03  0x03  0x03  0x03     0x03  0x03  0x03  0x03  0x00  0x03
 0x02  0x02  0x02  0x02  0x02  0x02     0x02  0x03  0x03  0x03  0x00  0x00
    ⋮                             ⋮  ⋱     ⋮                             ⋮
 0x03  0x03  0x03  0x03  0x03  0x03     0x02  0x03  0x02  0x03  0x03  0x00
 0x03  0x03  0x03  0x03  0x03  0x03  …  0x02  0x02  0x02  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x02  0x03  0x03  0x03  0x03  0x00
 0x02  0x02  0x03  0x02  0x03  0x03     0x00  0x03  0x03  0x03  0x03  0x00
 0x02  0x02  0x02  0x02  0x02  0x02     0x03  0x03  0x03  0x03  0x00  0x00
 0x03  0x03  0x03  0x03  0x03  0x03     0x00  0x03  0x03  0x03  0x03  0x03
 0x02  0x02  0x02  0x02  0x02  0x02  …  0x02  0x03  0x03  0x03  0x03  0x03
 0x02  0x02  0x02  0x02  0x02  0x02     0x02  0x02  0x02  0x03  0x03  0x03
 0x02  0x02  0x03  0x02  0x03  0x03     0x02  0x02  0x02  0x03  0x00  0x03
 0x02  0x02  0x03  0x02  0x03  0x03     0x00  0x03  0x03  0x03  0x03  0x03
 0x02  0x02  0x03  0x02  0x03  0x03     0x00  0x03  0x03  0x03  0x00  0x03
 0x00  0x00  0x00  0x00  0x00  0x00  …  0x03  0x03  0x03  0x03  0x03  0x03
In [29]:
rm(SnpArrays.datadir("mouse") * ".filtered.bed")
rm(SnpArrays.datadir("mouse") * ".filtered.fam")
rm(SnpArrays.datadir("mouse") * ".filtered.bim")

Linear algebra with SnpArray

In some applications we want to perform linear algebra using SnpArray directly without expanding it to numeric matrix. This is achieved by the SnpBitMatrix type. The implementation assumes:

  1. the SnpArray does not have missing genotypes, and
  2. the matrix corresponding to SnpArray is the matrix of A2 allele counts.

First let's load a data set without missing genotypes.

In [30]:
EUR = SnpArray(SnpArrays.datadir("EUR_subset.bed"))
Out[30]:
379×54051 SnpArray:
 0x03  0x03  0x03  0x02  0x02  0x03  …  0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x02  0x03  0x02  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x02  0x02  0x02  0x03  0x03  0x02
 0x03  0x03  0x03  0x00  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x00  0x03  0x03     0x02  0x02  0x02  0x03  0x03  0x03
 0x02  0x03  0x03  0x03  0x03  0x03  …  0x03  0x03  0x03  0x03  0x03  0x02
 0x02  0x03  0x03  0x02  0x02  0x03     0x03  0x03  0x02  0x02  0x03  0x03
 0x02  0x03  0x03  0x03  0x02  0x02     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x00  0x02  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x02  0x03  0x03  0x02  0x03  0x02     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x02  0x03  0x03  …  0x03  0x03  0x02  0x02  0x03  0x03
 0x03  0x03  0x03  0x02  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x02
 0x03  0x02  0x03  0x02  0x02  0x03     0x03  0x03  0x03  0x03  0x03  0x03
    ⋮                             ⋮  ⋱     ⋮                             ⋮
 0x03  0x03  0x03  0x00  0x02  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x02  0x02  0x03     0x02  0x02  0x02  0x03  0x02  0x03
 0x03  0x03  0x03  0x02  0x02  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x02  0x03  0x03  0x02  0x03  0x03  …  0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x00  0x00  0x03     0x02  0x02  0x02  0x03  0x03  0x03
 0x02  0x03  0x03  0x03  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x02  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x02  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x02  0x03  0x03  0x03  0x03  0x03  …  0x03  0x03  0x02  0x02  0x03  0x03
 0x03  0x03  0x03  0x00  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x02  0x03  0x03  0x02  0x00  0x02     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x02  0x02  0x03     0x03  0x03  0x03  0x03  0x03  0x03

To instantiate a SnpBitMatrix based on SnpArray,

In [31]:
EURbm = SnpBitMatrix{Float64}(EUR, model=ADDITIVE_MODEL, center=true, scale=true);

The constructor shares the same keyword arguments as the convert or copyto! functions. The type parameter, Float64 in this example, indicates the SnpBitMatrix acts like a Float64 matrix.

The memory usage of the SnpBitMatrix should be similar to the SnpArray, or equivalently bed file size, if model=ADDITIVE_MODEL or half of that of SnpArray if model=DOMINANT_MODEL or model=RECESSIVE_MODEL.

In [32]:
Base.summarysize(EUR), Base.summarysize(EURbm)
Out[32]:
(6876757, 6421960)

A SnpBitMatrix acts similar to a regular matrix and responds to size, eltype, and SnpBitMatrix-vector multiplications.

In [33]:
@show size(EURbm)
@show eltype(EURbm)
@show typeof(EURbm) <: AbstractMatrix;
size(EURbm) = (379, 54051)
eltype(EURbm) = Float64
typeof(EURbm) <: AbstractMatrix = true

SnpBitMatrix-vector multiplication is mathematically equivalent to the corresponding Float matrix contained from convert or copyto! a SnpArray.

In [34]:
using LinearAlgebra
v1 = randn(size(EUR, 1))
v2 = randn(size(EUR, 2))
A = convert(Matrix{Float64}, EUR, model=ADDITIVE_MODEL, center=true, scale=true)
norm(EURbm * v2 -  A * v2)
Out[34]:
2.3504168245427624e-10
In [35]:
norm(EURbm' * v1 - A' * v1)
Out[35]:
6.043297338083621e-12

SnpBitMatrix can be created from a subarray of SnpArray.

In [36]:
EURsub = @view EUR[1:2:100, 1:2:100]
EURsubbm = SnpBitMatrix{Float64}(EURsub, model=ADDITIVE_MODEL, center=true, scale=true);
In [37]:
Base.summarysize(EURsubbm)
Out[37]:
2600
In [38]:
@show size(EURsubbm)
@show eltype(EURsubbm)
@show typeof(EURsubbm) <: AbstractMatrix;
size(EURsubbm) = (50, 50)
eltype(EURsubbm) = Float64
typeof(EURsubbm) <: AbstractMatrix = true
In [39]:
using LinearAlgebra
v1 = randn(size(EURsub, 1))
v2 = randn(size(EURsub, 2))
A = convert(Matrix{Float64}, EURsub, model=ADDITIVE_MODEL, center=true, scale=true)
norm(EURsubbm * v2 -  A * v2)
Out[39]:
2.7548859461055947e-14
In [40]:
norm(EURsubbm' * v1 - A' * v1)
Out[40]:
4.561993409500186e-14

SnpData

We can create a SnpData dataframe, which is a SnpArray with information on SNP and subject appended.

In [41]:
EUR_data = SnpData(SnpArrays.datadir("EUR_subset"))
Out[41]:
SnpData(people: 379, snps: 54051,
snp_info: 
│ Row │ chromosome │ snpid       │ genetic_distance │ position │ allele1      │ allele2      │
│     │ String     │ String      │ Float64          │ Int64    │ Categorical… │ Categorical… │
├─────┼────────────┼─────────────┼──────────────────┼──────────┼──────────────┼──────────────┤
│ 1   │ 17         │ rs34151105  │ 0.0              │ 1665     │ T            │ C            │
│ 2   │ 17         │ rs143500173 │ 0.0              │ 2748     │ T            │ A            │
│ 3   │ 17         │ rs113560219 │ 0.0              │ 4702     │ T            │ C            │
│ 4   │ 17         │ rs1882989   │ 5.6e-5           │ 15222    │ G            │ A            │
│ 5   │ 17         │ rs8069133   │ 0.000499         │ 32311    │ G            │ A            │
│ 6   │ 17         │ rs112221137 │ 0.000605         │ 36405    │ G            │ T            │
…,
person_info: 
│ Row │ fid       │ iid       │ father    │ mother    │ sex       │ phenotype │
│     │ Abstract… │ Abstract… │ Abstract… │ Abstract… │ Abstract… │ Abstract… │
├─────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┤
│ 1   │ 1         │ HG00096   │ 0         │ 0         │ 1         │ 1         │
│ 2   │ 2         │ HG00097   │ 0         │ 0         │ 2         │ 1         │
│ 3   │ 3         │ HG00099   │ 0         │ 0         │ 2         │ 1         │
│ 4   │ 4         │ HG00100   │ 0         │ 0         │ 2         │ 1         │
│ 5   │ 5         │ HG00101   │ 0         │ 0         │ 1         │ 1         │
│ 6   │ 6         │ HG00102   │ 0         │ 0         │ 2         │ 1         │
…,
srcbed: /Users/huazhou/.julia/packages/SnpArrays/Arv5H/src/../data/EUR_subset.bed
srcbim: /Users/huazhou/.julia/packages/SnpArrays/Arv5H/src/../data/EUR_subset.bim
srcfam: /Users/huazhou/.julia/packages/SnpArrays/Arv5H/src/../data/EUR_subset.fam
)

We can split SnpData by SNP's choromosomes using split_plink.

In [42]:
splitted = SnpArrays.split_plink(SnpArrays.datadir("EUR_subset"); prefix="tmp.chr.")
Out[42]:
Dict{AbstractString,SnpData} with 6 entries:
  "21" => SnpData(people: 379, snps: 5813,…
  "17" => SnpData(people: 379, snps: 11041,…
  "19" => SnpData(people: 379, snps: 9690,…
  "20" => SnpData(people: 379, snps: 9327,…
  "22" => SnpData(people: 379, snps: 5938,…
  "18" => SnpData(people: 379, snps: 12242,…

Let's take a SnpArray for chromosome 17.

In [43]:
piece = splitted["17"]
Out[43]:
SnpData(people: 379, snps: 11041,
snp_info: 
│ Row │ chromosome │ snpid       │ genetic_distance │ position │ allele1      │ allele2      │
│     │ String     │ String      │ Float64          │ Int64    │ Categorical… │ Categorical… │
├─────┼────────────┼─────────────┼──────────────────┼──────────┼──────────────┼──────────────┤
│ 1   │ 17         │ rs34151105  │ 0.0              │ 1665     │ T            │ C            │
│ 2   │ 17         │ rs143500173 │ 0.0              │ 2748     │ T            │ A            │
│ 3   │ 17         │ rs113560219 │ 0.0              │ 4702     │ T            │ C            │
│ 4   │ 17         │ rs1882989   │ 5.6e-5           │ 15222    │ G            │ A            │
│ 5   │ 17         │ rs8069133   │ 0.000499         │ 32311    │ G            │ A            │
│ 6   │ 17         │ rs112221137 │ 0.000605         │ 36405    │ G            │ T            │
…,
person_info: 
│ Row │ fid       │ iid       │ father    │ mother    │ sex       │ phenotype │
│     │ Abstract… │ Abstract… │ Abstract… │ Abstract… │ Abstract… │ Abstract… │
├─────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┤
│ 1   │ 1         │ HG00096   │ 0         │ 0         │ 1         │ 1         │
│ 2   │ 2         │ HG00097   │ 0         │ 0         │ 2         │ 1         │
│ 3   │ 3         │ HG00099   │ 0         │ 0         │ 2         │ 1         │
│ 4   │ 4         │ HG00100   │ 0         │ 0         │ 2         │ 1         │
│ 5   │ 5         │ HG00101   │ 0         │ 0         │ 1         │ 1         │
│ 6   │ 6         │ HG00102   │ 0         │ 0         │ 2         │ 1         │
…,
srcbed: tmp.chr.17.bed
srcbim: tmp.chr.17.bim
srcfam: tmp.chr.17.fam
)

We can merge the split dictionary back into one SnpData using merge_plink.

In [44]:
# write_plink is included here
# merged = SnpArrays.merge_plink("tmp.merged", splitted)

You can also merge the plink formatted files based on their common prefix.

In [45]:
# merged_from_splitted_files = merge_plink("tmp.chr"; 
#    des = "tmp.merged.2")
In [46]:
# cleanup
isfile("tmp.merged.bim") && rm("tmp.merged.bim") 
isfile("tmp.merged.fam") && rm("tmp.merged.fam")
isfile("tmp.merged.bed") && rm("tmp.merged.bed")
isfile("tmp.merged.2.bim") && rm("tmp.merged.2.bim")
isfile("tmp.merged.2.fam") && rm("tmp.merged.2.fam")
isfile("tmp.merged.2.bed") && rm("tmp.merged.2.bed")
for k in keys(splitted)
    isfile("tmp.chr.$(k).bim") && rm("tmp.chr.$(k).bim")
    isfile("tmp.chr.$(k).fam") && rm("tmp.chr.$(k).fam")
    isfile("tmp.chr.$(k).bed") && rm("tmp.chr.$(k).bed")
end

Conclusions

SnpArrays provides users with the ability to rapidly and accurately manipulate very large genetic data sets.