https://github.com/kerneltuner/kernel_float
CUDA/HIP header-only library writing vectorized and low-precision (16 bit, 8 bit) GPU kernels
https://github.com/kerneltuner/kernel_float
bfloat16 cpp cuda floating-point gpu half-precision header-only-library hip kernel-tuner low-precision mixed-precision performance reduced-precision vectorization
Last synced: 9 months ago
JSON representation
CUDA/HIP header-only library writing vectorized and low-precision (16 bit, 8 bit) GPU kernels
- Host: GitHub
- URL: https://github.com/kerneltuner/kernel_float
- Owner: KernelTuner
- License: apache-2.0
- Created: 2023-02-21T08:52:34.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2025-04-11T08:28:58.000Z (9 months ago)
- Last Synced: 2025-04-11T09:56:41.425Z (9 months ago)
- Topics: bfloat16, cpp, cuda, floating-point, gpu, half-precision, header-only-library, hip, kernel-tuner, low-precision, mixed-precision, performance, reduced-precision, vectorization
- Language: C++
- Homepage: https://kerneltuner.github.io/kernel_float/
- Size: 7.23 MB
- Stars: 7
- Watchers: 3
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Kernel Float

[](https://github.com/KernelTuner/kernel_float/)




_Kernel Float_ is a header-only library for CUDA/HIP that simplifies working with vector types and reduced precision floating-point arithmetic in GPU code.
## Summary
CUDA/HIP natively offers several reduced precision floating-point types (`__half`, `__nv_bfloat16`, `__nv_fp8_e4m3`, `__nv_fp8_e5m2`)
and vector types (e.g., `__half2`, `__nv_fp8x4_e4m3`, `float3`).
However, working with these types is cumbersome:
mathematical operations require intrinsics (e.g., `__hadd2` performs addition for `__half2`),
type conversion is awkward (e.g., `__nv_cvt_halfraw2_to_fp8x2` converts float16 to float8),
and some functionality is missing (e.g., one cannot convert a `__half` to `__nv_bfloat16`).
_Kernel Float_ resolves this by offering a single data type `kernel_float::vec` that stores `N` elements of type `T`.
Internally, the data is stored as a fixed-sized array of elements.
Operator overloading (like `+`, `*`, `&&`) has been implemented such that the most optimal intrinsic for the available types is selected automatically.
Many mathematical functions (like `log`, `exp`, `sin`) and common operations (such as `sum`, `range`, `for_each`) are also available.
Using Kernel Float, developers avoid the complexity of reduced precision floating-point types in CUDA and can focus on their applications.
## Features
In a nutshell, _Kernel Float_ offers the following features:
* Single type `vec` that unifies all vector types.
* Operator overloading to simplify programming.
* Support for half (16 bit) floating-point arithmetic, with a fallback to single precision for unsupported operations.
* Support for quarter (8 bit) floating-point types.
* Easy integration as a single header file.
* Written for C++17.
* Compatible with NVCC (NVIDIA Compiler) and NVRTC (NVIDIA Runtime Compilation).
* Compatible with HIPCC (AMD HIP Compiler)
## Example
Check out the [examples](https://github.com/KernelTuner/kernel_float/tree/master/examples) directory for some examples.
Below shows a simple example of a CUDA kernel that adds a `constant` to the `input` array and writes the results to the `output` array.
Each thread processes two elements.
Notice how easy it would be to change the precision (for example, `double` to `half`) or the vector size (for example, 4 instead of 2 items per thread).
```cpp
#include "kernel_float.h"
namespace kf = kernel_float;
__global__ void kernel(const kf::vec* input, float constant, kf::vec* output) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
output[i] = input[i] + kf::cast(constant);
}
```
Here is how the same kernel would look for CUDA without Kernel Float.
```cpp
__global__ void kernel(const __half* input, float constant, float* output) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
__half in0 = input[2 * i + 0];
__half in1 = input[2 * i + 1];
__half2 a = __halves2half2(in0, in1);
float b = float(constant);
__half c = __float2half(b);
__half2 d = __half2half2(c);
__half2 e = __hadd2(a, d);
__half f = __low2half(e);
__half g = __high2half(e);
float out0 = __half2float(f);
float out1 = __half2float(g);
output[2 * i + 0] = out0;
output[2 * i + 1] = out1;
}
```
Even though the second kernel looks a lot more complex, the PTX code generated by these two kernels is nearly identical.
## Installation
This is a header-only library. Copy the file `single_include/kernel_float.h` to your project and include it:
```cpp
#include "kernel_float.h"
```
Use the provided Makefile to generate this single-include header file if it is outdated:
```
make
```
## Documentation
See the [documentation](https://kerneltuner.github.io/kernel_float/) for the [API reference](https://kerneltuner.github.io/kernel_float/api.html) of all functionality.
## License
Licensed under Apache 2.0. See [LICENSE](https://github.com/KernelTuner/kernel_float/blob/master/LICENSE).
## Related Work
* [Kernel Tuner](https://github.com/KernelTuner/kernel_tuner)
* [Kernel Launcher](https://github.com/KernelTuner/kernel_launcher)