https://github.com/TeamBipartite/bipartite-gemm
High throughput data-parallel GEMM implementations in Cuda using Cuda cores and Tensor cores
https://github.com/TeamBipartite/bipartite-gemm
cuda data-parallelism gemm
Last synced: 5 months ago
JSON representation
High throughput data-parallel GEMM implementations in Cuda using Cuda cores and Tensor cores
- Host: GitHub
- URL: https://github.com/TeamBipartite/bipartite-gemm
- Owner: TeamBipartite
- Created: 2024-10-12T23:31:30.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2025-01-05T00:45:21.000Z (over 1 year ago)
- Last Synced: 2025-02-19T23:41:44.149Z (over 1 year ago)
- Topics: cuda, data-parallelism, gemm
- Language: C++
- Homepage:
- Size: 834 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# tempNametempName
A high-throughput data-parallel linear algebra library.
# Prerequisites
You should have a CUDA environment installed with a GPU of compute
capability 7.5 or higher. Our code has been tested on both the Turing and Ampere
architectures.
If using the included benchmark/verification program, OpenBLAS is helpful for checking correctness of output, but not necessary.
See the 'Build' section for more details.
# Build
A makefile is provided in the top-level directory which handles building the application.
The default target is `sm_75` (eg, Compute Capability 7.5 graphics cards such as the Turing T4).
```bash
$ make
```
is sufficient to build the library with default options.
## Build options
To specify the `arch` string for your target, use the `TARGET` variable when
calling `make`. For example, for a Compute Capability 8.6 graphics card such as the RTX 3060, use:
```bash
$ make TARGET=sm_86
```
To configure the tests ran by `bench`, the following options may also be passed:
* `NUM_SMS=n` For accurate GFLOPs per SM calculations, this defines the number
of SMs to consider. default=28
* `TEST_N=n` Input size to use for test calculations. Note that this may be
padded as required by the individual library functions. default=4096
* `TEST_MAX_ELEMENT=n` For library functions that operate on numerical data,
specifies the maximum element to (randomly) generate for test data.
default=1
The following options may also be specified to configure use of the OpenBLAS
library in `bench`:
* `USE_OPENBLAS=yes`: By default, the OpenBLAS library is called to
perform a CPU matrix multiplication to serve as a baseline to check for
correctness. If you do not have OpenBLAS on your system, set this variable
to `no` to use a naive provided n^3 CPU implementation.
* `OPENBLAS_NUM_THREADS=$(nproc)`: If `USE_OPENBLAS=yes`, this option can be
specified to reduce the number of threads used by OpenBLAS.
A clean is required before switching configurations.
# Install & run
Simply copy the `tempNametempName` directory to your system's `include`
directory to install the library for use in other applications. The interface is
provided within the `tempNametempName` namespace.
A `bench` executable is generated in the `build` directory for testing and verification purposes.
Simply run this file to both verify output correctness and benchmark
the library - it does not need to be installed to run. Note that the test parameters must be passed at compile time using
the options specified in the `build` section of this document - this is to
ensure best performance possible through hte use of compile-time constants.
For debugging and output inspection, `bench` provides a `-p` argument. When this
argument is provided, both the expected and actual outputs is printed.
**PRECISION NOTE:** We provide an FP16 version of our tensor matrix multiplication,
but since FP16 only has a 10-bit mantissa, it can be quite inaccurate for larger
matrices (or matrices with large values). We use a fixed epsilon (currently set
to `0.00001`) for output checking. This works on the provided example, but if the
parameters in `main.cu` are changed, **the test may report a fail for the FP16 runs
even though the output is as accurate as practical for FP16**.