Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/prince781/libgpublas
Drop-in GPU acceleration for linear algebra.
https://github.com/prince781/libgpublas
blas blas-kernels c cblas clblas cuda gpu gpu-acceleration hpc interposition linear-algebra nvidia opencl
Last synced: 18 days ago
JSON representation
Drop-in GPU acceleration for linear algebra.
- Host: GitHub
- URL: https://github.com/prince781/libgpublas
- Owner: Prince781
- License: gpl-3.0
- Created: 2017-09-06T05:14:10.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2020-02-22T21:18:59.000Z (almost 5 years ago)
- Last Synced: 2024-11-14T13:44:25.962Z (3 months ago)
- Topics: blas, blas-kernels, c, cblas, clblas, cuda, gpu, gpu-acceleration, hpc, interposition, linear-algebra, nvidia, opencl
- Language: Fortran
- Homepage:
- Size: 445 KB
- Stars: 3
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# blas2cuda
This is a library to intercept calls to CPU BLAS kernels and run their
equivalent on the GPU in a CUDA environment.## Compiling
```
$ export CUDA=...
$ meson -DCUDA=$CUDA build && ninja -C build
```## How it works
TODO: expand this section### Allocation tracking
- done once
- track ALL object allocations
- remove object-specific information, giving us only calls to malloc()
- saved to file### Object tracking
- in code (blas2cuda):
- define custom object manager (`struct objmngr`)
- any time an allocation file is loaded, we use this memory manager
- any time `malloc()` is called:
1. object tracker compares call info (from requested size and using
libunwind to get instruction pointer) with allocation list
2. if call info matches, allocate the object using the custom memory
manager defined for the call, and track the object
3. if call info doesn't match, act normally### Motivation for object tracking
- Each time a kernel is called, we would have to copy data to GPU, invoke the
kernel, and copy it back to the CPU. For a series of calls to kernels that
aren't computation-intensive (Level 1 and Level 2 BLAS calls are vector-vector
and matrix-vector operations), throughput is significantly degraded as the
time to transfer data dominates computation.
- This is why [NVBLAS](https://docs.nvidia.com/cuda/nvblas/index.html), a
similar project, only intercepts computation-intensive Level 3
matrix-matrix operations, where the computation dominates data transfer.
- However, there's still this issue of copying back and forth.
- blas2cuda uses object tracking to distinguish memory objects that are used in
BLAS kernels from other memory objects we don't care about.
- When a call is made to `malloc()` that we should care about, we use
`cudaMallocManaged()` instead and return a memory address that is shared
between the CPU and GPU. This memory is a managed object, and a later call to
`free()` will use `cudaFree()` instead.
- By intercepting the right calls, we can tell when these memory objects are
later used in kernels, and avoid copying.
- Instead of explicit copying, a [page faulting mechanism is used to move data
between the CPU and GPU](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-data-migration).### (Outdated) Running a program
`./blas2cuda.sh `