Linear Algebra Benchmarks

SnpArrays.jl supports three modes of matrix-vector multiplications:

  1. Direct operations on a plink-formatted SnpArray: SnpLinAlg
  2. Operations on transformed BitMatrixes: SnpBitMatrix (support for this will be dropped in the near future)
  3. Direct operations on a plink-formatted data on an Nvidia GPU: CuSnpArray.

SnpLinAlg also supports matrix-matrix multiplications.

  • SnpLinAlg and SnpBitMatrix use Chris Elrod's LoopVectorization.jl internally. It is much faster on machines with AVX support.
  • CuSnpArray uses CUDA.jl internally.

On this page, we compare these three.

  • SnpLinAlg supports multithreading. See this page to learn how to use it.
versioninfo()
Julia Version 1.6.2
Commit 1b93d53fc4 (2021-07-14 15:36 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake-avx512)
using SnpArrays
using LinearAlgebra
using BenchmarkTools
const EUR = SnpArray(SnpArrays.datadir("EUR_subset.bed"));

Let's try with EUR data repeated 100 and 101 times: 37900 by 54051 and 38279 by 54051, respectively.

EUR_10 = [EUR; EUR; EUR; EUR; EUR; EUR; EUR; EUR; EUR; EUR]
EUR_100 = [EUR_10; EUR_10; EUR_10; EUR_10; EUR_10; EUR_10; EUR_10; EUR_10; EUR_10; EUR_10];
EUR_101 = [EUR_100; EUR];

We create instances of SnpLinAlg, SnpBitmatrix and CuSnpArray:

EUR_100_bm = SnpBitMatrix{Float64}(EUR_100; model=ADDITIVE_MODEL, center=false, scale=false);
EUR_100_sla = SnpLinAlg{Float64}(EUR_100; model=ADDITIVE_MODEL, center=false, scale=false);
EUR_100_sla_ = SnpLinAlg{Float64}(EUR_100; model=ADDITIVE_MODEL, center=false, scale=false, impute=false);
EUR_100_mat = convert(Matrix{Float64}, EUR_100, model=ADDITIVE_MODEL, center=false, scale=false);

EUR_101_bm = SnpBitMatrix{Float64}(EUR_101; model=ADDITIVE_MODEL, center=false, scale=false);
EUR_101_sla = SnpLinAlg{Float64}(EUR_101; model=ADDITIVE_MODEL, center=false, scale=false);
EUR_101_sla_ = SnpLinAlg{Float64}(EUR_101; model=ADDITIVE_MODEL, center=false, scale=false, impute=false);
EUR_101_mat = convert(Matrix{Float64}, EUR_101, model=ADDITIVE_MODEL, center=false, scale=false);
using CUDA
EUR_100_cu = CuSnpArray{Float64}(EUR_100; model=ADDITIVE_MODEL, center=false, scale=false);
EUR_100_cu_ = CuSnpArray{Float64}(EUR_100; model=ADDITIVE_MODEL, center=false, scale=false, impute=false);
┌ Warning: The NVIDIA driver on this system only supports up to CUDA 10.2.0.
│ For performance reasons, it is recommended to upgrade to a driver that supports CUDA 11.2 or higher.
└ @ CUDA /home/xyz/.julia/packages/CUDA/CtvPY/src/initialization.jl:42

$y = Ax$

v1 = randn(size(EUR_100, 1))
v1_ = randn(size(EUR_100, 1))
v2 = randn(size(EUR_100, 2));

With 8-threaded OpenBLAS (included in standard binary installation of Julia):

BLAS.set_num_threads(8)
@benchmark LinearAlgebra.mul!($v1, $EUR_100_mat, $v2)
BenchmarkTools.Trial: 11 samples with 1 evaluation.
 Range (min … max):  408.568 ms … 478.025 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     461.873 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   457.339 ms ±  19.116 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                              █   █   █     ██   ██ █  █   █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁█▁▁▁█▁▁▁▁▁██▁▁▁██▁█▁▁█▁▁▁█ ▁
  409 ms           Histogram: frequency by time          478 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

With single-threaded OpenBLAS:

BLAS.set_num_threads(1)
@benchmark LinearAlgebra.mul!($v1, $EUR_100_mat, $v2)
BenchmarkTools.Trial: 3 samples with 1 evaluation.
 Range (min … max):  1.957 s …   2.089 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.986 s              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.011 s ± 69.424 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █           █                                           █  
  █▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.96 s         Histogram: frequency by time         290 s <

 Memory estimate: 0 bytes, allocs estimate: 0.

Direct linear algebra on a SnpArray, with mean imputation:

@benchmark LinearAlgebra.mul!($v1, $EUR_100_sla, $v2)
BenchmarkTools.Trial: 6 samples with 1 evaluation.
 Range (min … max):  840.116 ms … 852.731 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     844.339 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   845.643 ms ±   4.589 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █            █   █    █                     █               █  
  █▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁█▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  840 ms           Histogram: frequency by time          853 ms <

 Memory estimate: 160 bytes, allocs estimate: 1.

With zero imputation:

@benchmark LinearAlgebra.mul!($v1, $EUR_100_sla_, $v2)
BenchmarkTools.Trial: 9 samples with 1 evaluation.
 Range (min … max):  590.436 ms … 600.609 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     594.370 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   594.687 ms ±   2.947 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁      ▁         ▁     █     ▁▁     ▁                       ▁  
  █▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁██▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  590 ms           Histogram: frequency by time          601 ms <

 Memory estimate: 160 bytes, allocs estimate: 1.

Indeed, we are paying some price for mean imputation.

The below is the benchmark for SnpBitMatrix (always zero-imputed):

@benchmark (LinearAlgebra.mul!($v1, $EUR_100_bm, $v2))
BenchmarkTools.Trial: 2 samples with 1 evaluation.
 Range (min … max):  4.945 s …   4.981 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.963 s              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.963 s ± 25.520 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                       █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  4.94 s         Histogram: frequency by time        4.98 s <

 Memory estimate: 0 bytes, allocs estimate: 0.

At first glance, the result from SnpBitMatrix might look better than SnpLinAlg. However, SnpLinAlg is more stable in performance when the number of samples is not multiple of 4 or 8.

v1 = randn(size(EUR_101, 1))
v2 = randn(size(EUR_101, 2));
@benchmark LinearAlgebra.mul!($v1, $EUR_101_sla, $v2)
BenchmarkTools.Trial: 6 samples with 1 evaluation.
 Range (min … max):  858.307 ms … 895.131 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     867.094 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   869.931 ms ±  13.289 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █  █        █   █    █                                      █  
  █▁▁█▁▁▁▁▁▁▁▁█▁▁▁█▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  858 ms           Histogram: frequency by time          895 ms <

 Memory estimate: 160 bytes, allocs estimate: 1.
@benchmark LinearAlgebra.mul!($v1, $EUR_101_sla_, $v2)
BenchmarkTools.Trial: 9 samples with 1 evaluation.
 Range (min … max):  615.004 ms … 631.572 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     616.410 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   618.802 ms ±   5.452 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁▁  ██     ▁                    ▁                           ▁  
  ██▁▁██▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  615 ms           Histogram: frequency by time          632 ms <

 Memory estimate: 160 bytes, allocs estimate: 1.
@benchmark LinearAlgebra.mul!($v1, $EUR_101_bm, $v2)
BenchmarkTools.Trial: 2 samples with 1 evaluation.
 Range (min … max):  4.967 s …   4.995 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.981 s              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.981 s ± 19.665 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                       █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  4.97 s         Histogram: frequency by time        4.99 s <

 Memory estimate: 0 bytes, allocs estimate: 0.

Now let's try CUDA. The device is Nvidia Titan V.

using Adapt

Moving data to GPU:

v1 = randn(size(EUR_100, 1))
v1_ = randn(size(EUR_100, 1))
v2 = randn(size(EUR_100, 2));
v1_d = adapt(CuArray{Float64}, v1)
v1_d_ = similar(v1_d)
v2_d = adapt(CuArray{Float64}, v2);
using BenchmarkTools
@benchmark CUDA.@sync LinearAlgebra.mul!($v1_d, $EUR_100_cu, $v2_d)
BenchmarkTools.Trial: 240 samples with 1 evaluation.
 Range (min … max):  20.808 ms …  23.129 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     20.834 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   20.857 ms ± 200.753 μs  ┊ GC (mean ± σ):  0.08% ± 1.19%

  ▅█                                                            
  ███▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▅
  20.8 ms       Histogram: log(frequency) by time      22.1 ms <

 Memory estimate: 69.23 KiB, allocs estimate: 4393.

For CuSnpArray, the additional cost for mean imputation is negligible.

@benchmark CUDA.@sync LinearAlgebra.mul!($v1_d_, $EUR_100_cu_, $v2_d)
BenchmarkTools.Trial: 239 samples with 1 evaluation.
 Range (min … max):  20.813 ms …  30.895 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     20.837 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   20.945 ms ± 915.783 μs  ┊ GC (mean ± σ):  0.08% ± 1.27%

  █                                                             
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▅
  20.8 ms       Histogram: log(frequency) by time      27.4 ms <

 Memory estimate: 18.95 KiB, allocs estimate: 1175.
EUR_100_mat_d = adapt(CuArray, EUR_100_mat);
Out of GPU memory trying to allocate 15.263 GiB
Effective GPU memory usage: 11.12% (1.311 GiB/11.784 GiB)
CUDA allocator usage: 980.317 MiB
Memory pool usage: 980.317 MiB (980.317 MiB allocated, 0 bytes cached)




Stacktrace:

  [1] #alloc#176

    @ ~/.julia/packages/CUDA/CtvPY/src/pool.jl:267 [inlined]

  [2] alloc

    @ ~/.julia/packages/CUDA/CtvPY/src/pool.jl:259 [inlined]

  [3] CuArray{Float64, 2}(#unused#::UndefInitializer, dims::Tuple{Int64, Int64})

    @ CUDA ~/.julia/packages/CUDA/CtvPY/src/array.jl:28

  [4] CuArray

    @ ~/.julia/packages/CUDA/CtvPY/src/array.jl:241 [inlined]

  [5] CuArray

    @ ~/.julia/packages/CUDA/CtvPY/src/array.jl:249 [inlined]

  [6] convert

    @ ~/.julia/packages/GPUArrays/Tebtl/src/host/construction.jl:4 [inlined]

  [7] adapt_storage(#unused#::Type{CuArray}, xs::Matrix{Float64})

    @ CUDA ~/.julia/packages/CUDA/CtvPY/src/array.jl:286

  [8] adapt_structure(to::Type, x::Matrix{Float64})

    @ Adapt ~/.julia/packages/Adapt/RGNRk/src/Adapt.jl:42

  [9] adapt(to::Type, x::Matrix{Float64})

    @ Adapt ~/.julia/packages/Adapt/RGNRk/src/Adapt.jl:40

 [10] top-level scope

    @ In[21]:1

 [11] eval

    @ ./boot.jl:360 [inlined]

 [12] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)

    @ Base ./loading.jl:1116
@benchmark CUDA.@sync LinearAlgebra.mul!($v1_d, $EUR_100_mat_d, $v2_d)
UndefVarError: EUR_100_mat_d not defined



Stacktrace:

 [1] top-level scope

   @ ~/.julia/packages/BenchmarkTools/tGTCy/src/execution.jl:440

 [2] eval

   @ ./boot.jl:360 [inlined]

 [3] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)

   @ Base ./loading.jl:1116

The speedup is obvious, CuSnpArrays is 30-50x faster than on CPU, and using CuSnpArray is both faster and memory-efficient compared to linear algebra with floating point matrix on GPU.

isapprox(v1_d, v1_d_)
true

$y = A^T x$

v1 = randn(size(EUR_100, 1))
v2 = randn(size(EUR_100, 2))
v2_ = randn(size(EUR_100, 2))
v1_d = adapt(CuArray{Float64}, v1)
v2_d = adapt(CuArray{Float64}, v2);
@benchmark LinearAlgebra.mul!($v2, transpose($EUR_100_sla), $v1)
BenchmarkTools.Trial: 6 samples with 1 evaluation.
 Range (min … max):  851.938 ms … 875.831 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     864.144 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   864.772 ms ±   8.569 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                    █        ██                    █       █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁█ ▁
  852 ms           Histogram: frequency by time          876 ms <

 Memory estimate: 304 bytes, allocs estimate: 1.
@benchmark (LinearAlgebra.mul!($v2, transpose($EUR_100_bm), $v1))
BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 5.943 s (0.00% GC) to evaluate,
 with a memory estimate of 0 bytes, over 0 allocations.
@benchmark LinearAlgebra.mul!($v2_d, transpose($EUR_100_cu), $v1_d)
BenchmarkTools.Trial: 278 samples with 1 evaluation.
 Range (min … max):  17.848 ms … 52.127 ms  ┊ GC (min … max): 0.00% … 65.80%
 Time  (median):     17.878 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   18.038 ms ±  2.084 ms  ┊ GC (mean ± σ):  0.68% ±  3.95%

  █▅                                                           
  ██▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▅
  17.8 ms      Histogram: log(frequency) by time        20 ms <

 Memory estimate: 142.14 KiB, allocs estimate: 9059.
isapprox(collect(v2_d), v2)
true
v1 = randn(size(EUR_101, 1))
v2 = randn(size(EUR_101, 2));
@benchmark LinearAlgebra.mul!($v2, transpose($EUR_101_sla), $v1)
BenchmarkTools.Trial: 5 samples with 1 evaluation.
 Range (min … max):  868.448 ms …    1.327 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     895.181 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):      1.033 s ± 217.498 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █  ▁                                        ▁               ▁  
  █▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  868 ms           Histogram: frequency by time          1.33 s <

 Memory estimate: 304 bytes, allocs estimate: 1.
@benchmark LinearAlgebra.mul!($v2, transpose($EUR_101_sla_), $v1)
BenchmarkTools.Trial: 6 samples with 1 evaluation.
 Range (min … max):  927.944 ms …   1.028 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     934.817 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   958.481 ms ± 42.785 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁█    ▁                                 ▁                  ▁  
  ██▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  928 ms          Histogram: frequency by time           130 s <

 Memory estimate: 304 bytes, allocs estimate: 1.
@benchmark (LinearAlgebra.mul!($v2, transpose($EUR_101_bm), $v1))
BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 5.813 s (0.00% GC) to evaluate,
 with a memory estimate of 0 bytes, over 0 allocations.

BitMatrix is slightly faster in this direction.

$Y = AX$

Now for matrix-matrix multiplications. If we want to center/scale the SnpArray, we have

\[\begin{aligned} Y_{ij} = \sum_{k} \frac{A_{ik} - \mu_k}{\sigma_k}X_{kj} \end{aligned}\]

where $\mu_k$ and $\sigma_k$ is the mean and standard deviation of the $k$th SNP. Centering and scaling is performed on-the-fly. First check correctness

EUR = SnpArray(SnpArrays.datadir("EUR_subset.bed"));
EUR_10 = [EUR; EUR; EUR; EUR; EUR; EUR; EUR; EUR; EUR; EUR]

m = size(EUR_10, 1)
n = size(EUR_10, 2)
p = 2

A = SnpLinAlg{Float64}(EUR_10; model=ADDITIVE_MODEL, impute=true, center=true, scale=true);
X = rand(n, p)
Y = zeros(m, p)
SnpArrays.mul!(Y, A, X)
Afloat = convert(Matrix{Float64}, EUR_10, impute=true, center=true, scale=true)
Ytrue = Afloat * X
all(Y .≈ Ytrue)
WARNING: redefinition of constant EUR. This may fail, cause incorrect answers, or produce other errors.





true

Now lets check out timings. If $B$ is a "tall and thin" matrix, then SnpLinAlg remains competitive, often superior, to BLAS.

# SnpLinAlg-matrix
@benchmark LinearAlgebra.mul!($Y, $A, $X)
BenchmarkTools.Trial: 46 samples with 1 evaluation.
 Range (min … max):  105.415 ms … 132.895 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     106.670 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   108.777 ms ±   6.331 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▄█▅                                                           
  ▇███▃▃▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▃▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▃ ▁
  105 ms           Histogram: frequency by time          133 ms <

 Memory estimate: 112 bytes, allocs estimate: 1.
# BLAS with 1 threaed 
BLAS.set_num_threads(1)
@benchmark LinearAlgebra.mul!($Y, $Afloat, $X)
BenchmarkTools.Trial: 8 samples with 1 evaluation.
 Range (min … max):  589.561 ms … 885.109 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     601.005 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   635.873 ms ± 100.867 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

    █▃                                                           
  ▇▇██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇ ▁
  590 ms           Histogram: frequency by time          885 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.
# BLAS with 8 threaed 
BLAS.set_num_threads(8)
@benchmark LinearAlgebra.mul!($Y, $Afloat, $X)
BenchmarkTools.Trial: 40 samples with 1 evaluation.
 Range (min … max):  109.089 ms … 609.969 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     111.358 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   127.126 ms ±  78.665 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                              
  █▃▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂ ▁
  109 ms           Histogram: frequency by time          610 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

But if $B$ is a large matrix too, single threaded BLAS is ~6 times faster and >10x faster for 8 thread BLAS.

# SnpLinAlg-matrix
p = 100
X = rand(n, p)
Y = zeros(m, p)
@benchmark LinearAlgebra.mul!($Y, $A, $X)
BenchmarkTools.Trial: 2 samples with 1 evaluation.
 Range (min … max):  3.599 s …  3.602 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.600 s             ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.600 s ± 1.921 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                      █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  3.6 s         Histogram: frequency by time         3.6 s <

 Memory estimate: 112 bytes, allocs estimate: 1.
# BLAS with 1 threaed 
BLAS.set_num_threads(1)
@benchmark LinearAlgebra.mul!($Y, $Afloat, $X)
BenchmarkTools.Trial: 3 samples with 1 evaluation.
 Range (min … max):  2.192 s …    2.759 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.192 s               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.381 s ± 326.877 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                        ▁  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  2.19 s         Histogram: frequency by time         2.76 s <

 Memory estimate: 0 bytes, allocs estimate: 0.
# BLAS with 8 threaed 
BLAS.set_num_threads(8)
@benchmark LinearAlgebra.mul!($Y, $Afloat, $X)
BenchmarkTools.Trial: 15 samples with 1 evaluation.
 Range (min … max):  323.403 ms … 366.700 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     330.332 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   334.467 ms ±  11.791 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁█▁  █▁  ▁      ▁▁   ▁▁        ▁▁                           ▁  
  ███▁▁██▁▁█▁▁▁▁▁▁██▁▁▁██▁▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  323 ms           Histogram: frequency by time          367 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

$Y = A^tX$

If we want to center/scale the SnpArray, we have

\[\begin{aligned} Y_{ij} = \sum_{k} \left(\frac{A_{ik} - \mu_i}{\sigma_i}\right)X_{kj} \end{aligned}\]

where $\mu_i$ and $\sigma_i$ is the mean and standard deviation of the $i$th SNP. Similar to before, lets first check correctness.

EUR = SnpArray(SnpArrays.datadir("EUR_subset.bed"));
EUR_10 = [EUR; EUR; EUR; EUR; EUR; EUR; EUR; EUR; EUR; EUR]

m = size(EUR_10, 1)
n = size(EUR_10, 2)
p = 2

A = SnpLinAlg{Float64}(EUR_10; model=ADDITIVE_MODEL, impute=true, center=true, scale=true);
X = rand(m, p)
Y = zeros(n, p)
SnpArrays.mul!(Y, Transpose(A), X)
Afloat = convert(Matrix{Float64}, EUR_10, impute=true, center=true, scale=true)
Ytrue = Afloat' * X
all(Y .≈ Ytrue)
WARNING: redefinition of constant EUR. This may fail, cause incorrect answers, or produce other errors.





true

Now lets check out timings. If $B$ is a "tall and thin" matrix, then SnpLinAlg remains competitive, often superior, to BLAS.

# SnpLinAlg-matrix
@benchmark LinearAlgebra.mul!($Y, $(Transpose(A)), $X)
BenchmarkTools.Trial: 25 samples with 1 evaluation.
 Range (min … max):  193.170 ms … 217.577 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     199.180 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   200.356 ms ±   5.605 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▃      █  ▃   ▃▃                                            
  ▇▁▁█▁▇▇▇▁▇█▁▁█▁▇▁██▇▁▇▇▁▁▁▁▇▁▁▁▁▁▁▇▁▁▁▇▁▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇ ▁
  193 ms           Histogram: frequency by time          218 ms <

 Memory estimate: 304 bytes, allocs estimate: 1.
# BLAS with 1 threaed 
BLAS.set_num_threads(1)
@benchmark LinearAlgebra.mul!($Y, $(Transpose(Afloat)), $X)
BenchmarkTools.Trial: 13 samples with 1 evaluation.
 Range (min … max):  402.128 ms … 452.548 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     404.967 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   408.729 ms ±  13.282 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

   █ █▃                                                          
  ▇█▁██▇▇▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇ ▁
  402 ms           Histogram: frequency by time          453 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.
# BLAS with 8 threaed 
BLAS.set_num_threads(8)
@benchmark LinearAlgebra.mul!($Y, $(Transpose(Afloat)), $X)
BenchmarkTools.Trial: 74 samples with 1 evaluation.
 Range (min … max):  65.019 ms … 96.348 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     65.594 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   67.762 ms ±  5.614 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅█▂                                                          
  ███▁▅▁▁▁▁▁▅▅▆▅▅▅▁▁▅▁▁▁▁▁▁▅▅▁▁▁▅▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▅▅ ▁
  65 ms        Histogram: log(frequency) by time      86.5 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

But if $B$ is a large matrix too, single threaded BLAS is ~6 times faster and >10x faster for 8 thread BLAS.

# SnpLinAlg-matrix
p = 100
X = rand(m, p)
Y = zeros(n, p)
@benchmark LinearAlgebra.mul!($Y, $(Transpose(A)), $X)
BenchmarkTools.Trial: 2 samples with 1 evaluation.
 Range (min … max):  4.218 s …    4.367 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.293 s               ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.293 s ± 105.845 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                        █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  4.22 s         Histogram: frequency by time         4.37 s <

 Memory estimate: 304 bytes, allocs estimate: 1.
# BLAS with 1 threaed 
BLAS.set_num_threads(1)
@benchmark LinearAlgebra.mul!($Y, $(Transpose(Afloat)), $X)
BenchmarkTools.Trial: 3 samples with 1 evaluation.
 Range (min … max):  1.845 s …   1.870 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.855 s              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.857 s ± 12.738 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                      █                                █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.84 s         Histogram: frequency by time        1.87 s <

 Memory estimate: 0 bytes, allocs estimate: 0.
# BLAS with 8 threaed 
BLAS.set_num_threads(8)
@benchmark LinearAlgebra.mul!($Y, $(Transpose(Afloat)), $X)
BenchmarkTools.Trial: 17 samples with 1 evaluation.
 Range (min … max):  276.946 ms … 386.706 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     285.168 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   302.769 ms ±  37.051 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                              
  █▅█▁▅▅▁▁▁▁▅▁▁▅▅▁▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▁▅▁▁▁▁▁▁▅ ▁
  277 ms           Histogram: frequency by time          387 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

Conclusion

  • SnpLinAlg (for CPU)
    • achieves up to 32x memory savings compared to double-precision matrices
    • is usually faster than single threaded BLAS for matrix-vector multiply.
    • is competitive with single threaded BLAS for matrix-matrix multiply if $B$ is "tall and thin"
  • CuSnpArray supports GPU matrix-vector operations that is 30-50x faster than multithreaded BLAS.
  • Other linear algebra operations (e.g. $v*A$ and $qr(A)$...etc) will be much slower and are not guaranteed to work.