https://github.com/janeliascicomp/batchedblas.jl

batch what Nvidia does not
https://github.com/janeliascicomp/batchedblas.jl

Last synced: 4 months ago
JSON representation

batch what Nvidia does not

Host: GitHub
URL: https://github.com/janeliascicomp/batchedblas.jl
Owner: JaneliaSciComp
License: bsd-3-clause
Created: 2021-11-09T16:16:00.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2024-06-11T18:42:15.000Z (over 1 year ago)
Last Synced: 2025-10-06T00:54:26.385Z (4 months ago)
Language: Julia
Size: 76.2 KB
Stars: 4
Watchers: 4
Forks: 0
Open Issues: 3
Metadata Files:
- Readme: README.md
- License: LICENSE.txt

Awesome Lists containing this project

README

          Nvidia only provides batched versions of the BLAS `gemv`, `gemm`, and `trsm`

functions.  Further, they only support floating point, and alpha and beta

can only be scalars.  BatchedBLAS.jl extends support for batched arrays by

(currently) providing batched versions of `dot`, `gemv`, `symv`, `spmv`,

`ger`, `syr`, and `spr` that work with arrays of AbstractFloats and Integers,

and scaling coefficients which can be scalars or Vectors.

In addition to the type flexibility, there is a performance benefit for rank-1

updates as execution times for `ger`, `syr`, and `spr` are faster than the

equivalent batched `gemm` for the range of parameters tested.  `dot` is also

faster for small matrices.  Benchmarks on an H100 follow.  The dashed lines are

for the transposed version of `gemv` and the upper-triangle versions of all

other functions.  Lower numbers are better.

![benchmarks](/bench/bench.svg)

Example usage:

```

julia> using CUDA, SymmetricFormats, BatchedBLAS

julia> L = 4  # the matrix size

4

julia> N = 6  # the batch dimension

6

julia> A = reshape(1:L*L*N, L, L, N)

4×4×6 reshape(::UnitRange{Int64}, 4, 4, 6) with eltype Int64:

[:, :, 1] =

 1  5   9  13

 2  6  10  14

 3  7  11  15

 4  8  12  16

[:, :, 2] =

 17  21  25  29

 18  22  26  30

 19  23  27  31

 20  24  28  32

[:, :, 3] =

 33  37  41  45

 34  38  42  46

 35  39  43  47

 36  40  44  48

[:, :, 4] =

 49  53  57  61

 50  54  58  62

 51  55  59  63

 52  56  60  64

[:, :, 5] =

 65  69  73  77

 66  70  74  78

 67  71  75  79

 68  72  76  80

[:, :, 6] =

 81  85  89  93

 82  86  90  94

 83  87  91  95

 84  88  92  96

julia> SP = CuArray{Float64}(undef, packedsize(A), N);

julia> for i in 1:N

           SP[:,i] = SymmetricPacked(view(A,:,:,i), :U).tri

       end

julia> SP

10×6 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:

  1.0  17.0  33.0  49.0  65.0  81.0

  5.0  21.0  37.0  53.0  69.0  85.0

  6.0  22.0  38.0  54.0  70.0  86.0

  9.0  25.0  41.0  57.0  73.0  89.0

 10.0  26.0  42.0  58.0  74.0  90.0

 11.0  27.0  43.0  59.0  75.0  91.0

 13.0  29.0  45.0  61.0  77.0  93.0

 14.0  30.0  46.0  62.0  78.0  94.0

 15.0  31.0  47.0  63.0  79.0  95.0

 16.0  32.0  48.0  64.0  80.0  96.0

julia> x = CuArray{Float64}(reshape(1:L*N, L, N))

4×6 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:

 1.0  5.0   9.0  13.0  17.0  21.0

 2.0  6.0  10.0  14.0  18.0  22.0

 3.0  7.0  11.0  15.0  19.0  23.0

 4.0  8.0  12.0  16.0  20.0  24.0

julia> batched_spr!('U', 1.0, x, SP)

julia> SP

10×6 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:

  2.0  42.0  114.0  218.0  354.0  522.0

  7.0  51.0  127.0  235.0  375.0  547.0

 10.0  58.0  138.0  250.0  394.0  570.0

 12.0  60.0  140.0  252.0  396.0  572.0

 16.0  68.0  152.0  268.0  416.0  596.0

 20.0  76.0  164.0  284.0  436.0  620.0

 17.0  69.0  153.0  269.0  417.0  597.0

 22.0  78.0  166.0  286.0  438.0  622.0

 27.0  87.0  179.0  303.0  459.0  647.0

 32.0  96.0  192.0  320.0  480.0  672.0

julia> SymmetricPacked(SP[:,1])

4×4 SymmetricPacked{Float64, CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}:

  2.0   7.0  12.0  17.0

  7.0  10.0  16.0  22.0

 12.0  16.0  20.0  27.0

 17.0  22.0  27.0  32.0

julia> SymmetricPacked(A[:,:,1] .+ Array(x[:,1]*transpose(x[:,1])))

4×4 SymmetricPacked{Float64, Matrix{Float64}}:

  2.0   7.0  12.0  17.0

  7.0  10.0  16.0  22.0

 12.0  16.0  20.0  27.0

 17.0  22.0  27.0  32.0

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/janeliascicomp/batchedblas.jl

Awesome Lists containing this project

README