An open API service indexing awesome lists of open source software.

https://github.com/kerneltuner/kernel_float

CUDA/HIP header-only library writing vectorized and low-precision (16 bit, 8 bit) GPU kernels
https://github.com/kerneltuner/kernel_float

bfloat16 cpp cuda floating-point gpu half-precision header-only-library hip kernel-tuner low-precision mixed-precision performance reduced-precision vectorization

Last synced: 9 months ago
JSON representation

CUDA/HIP header-only library writing vectorized and low-precision (16 bit, 8 bit) GPU kernels

Awesome Lists containing this project

README

          

# Kernel Float

![Kernel Float logo](https://raw.githubusercontent.com/KernelTuner/kernel_float/main/docs/logo.png)

[![github](https://img.shields.io/badge/github-repo-000.svg?logo=github&labelColor=gray&color=blue)](https://github.com/KernelTuner/kernel_float/)
![GitHub branch checks state](https://img.shields.io/github/actions/workflow/status/KernelTuner/kernel_float/docs.yml)
![GitHub](https://img.shields.io/github/license/KernelTuner/kernel_float)
![GitHub tag (latest by date)](https://img.shields.io/github/v/tag/KernelTuner/kernel_float)
![GitHub Repo stars](https://img.shields.io/github/stars/KernelTuner/kernel_float?style=social)

_Kernel Float_ is a header-only library for CUDA/HIP that simplifies working with vector types and reduced precision floating-point arithmetic in GPU code.

## Summary

CUDA/HIP natively offers several reduced precision floating-point types (`__half`, `__nv_bfloat16`, `__nv_fp8_e4m3`, `__nv_fp8_e5m2`)
and vector types (e.g., `__half2`, `__nv_fp8x4_e4m3`, `float3`).
However, working with these types is cumbersome:
mathematical operations require intrinsics (e.g., `__hadd2` performs addition for `__half2`),
type conversion is awkward (e.g., `__nv_cvt_halfraw2_to_fp8x2` converts float16 to float8),
and some functionality is missing (e.g., one cannot convert a `__half` to `__nv_bfloat16`).

_Kernel Float_ resolves this by offering a single data type `kernel_float::vec` that stores `N` elements of type `T`.
Internally, the data is stored as a fixed-sized array of elements.
Operator overloading (like `+`, `*`, `&&`) has been implemented such that the most optimal intrinsic for the available types is selected automatically.
Many mathematical functions (like `log`, `exp`, `sin`) and common operations (such as `sum`, `range`, `for_each`) are also available.

Using Kernel Float, developers avoid the complexity of reduced precision floating-point types in CUDA and can focus on their applications.

## Features

In a nutshell, _Kernel Float_ offers the following features:

* Single type `vec` that unifies all vector types.
* Operator overloading to simplify programming.
* Support for half (16 bit) floating-point arithmetic, with a fallback to single precision for unsupported operations.
* Support for quarter (8 bit) floating-point types.
* Easy integration as a single header file.
* Written for C++17.
* Compatible with NVCC (NVIDIA Compiler) and NVRTC (NVIDIA Runtime Compilation).
* Compatible with HIPCC (AMD HIP Compiler)

## Example

Check out the [examples](https://github.com/KernelTuner/kernel_float/tree/master/examples) directory for some examples.

Below shows a simple example of a CUDA kernel that adds a `constant` to the `input` array and writes the results to the `output` array.
Each thread processes two elements.
Notice how easy it would be to change the precision (for example, `double` to `half`) or the vector size (for example, 4 instead of 2 items per thread).

```cpp
#include "kernel_float.h"
namespace kf = kernel_float;

__global__ void kernel(const kf::vec* input, float constant, kf::vec* output) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
output[i] = input[i] + kf::cast(constant);
}

```

Here is how the same kernel would look for CUDA without Kernel Float.

```cpp
__global__ void kernel(const __half* input, float constant, float* output) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
__half in0 = input[2 * i + 0];
__half in1 = input[2 * i + 1];
__half2 a = __halves2half2(in0, in1);
float b = float(constant);
__half c = __float2half(b);
__half2 d = __half2half2(c);
__half2 e = __hadd2(a, d);
__half f = __low2half(e);
__half g = __high2half(e);
float out0 = __half2float(f);
float out1 = __half2float(g);
output[2 * i + 0] = out0;
output[2 * i + 1] = out1;
}

```

Even though the second kernel looks a lot more complex, the PTX code generated by these two kernels is nearly identical.

## Installation

This is a header-only library. Copy the file `single_include/kernel_float.h` to your project and include it:

```cpp
#include "kernel_float.h"
```

Use the provided Makefile to generate this single-include header file if it is outdated:

```
make
```

## Documentation

See the [documentation](https://kerneltuner.github.io/kernel_float/) for the [API reference](https://kerneltuner.github.io/kernel_float/api.html) of all functionality.

## License

Licensed under Apache 2.0. See [LICENSE](https://github.com/KernelTuner/kernel_float/blob/master/LICENSE).

## Related Work

* [Kernel Tuner](https://github.com/KernelTuner/kernel_tuner)
* [Kernel Launcher](https://github.com/KernelTuner/kernel_launcher)