https://github.com/kerneltuner/kernel_float

CUDA/HIP header-only library writing vectorized and low-precision (16 bit, 8 bit) GPU kernels
https://github.com/kerneltuner/kernel_float

bfloat16 cpp cuda floating-point gpu half-precision header-only-library hip kernel-tuner low-precision mixed-precision performance reduced-precision vectorization

Last synced: 11 months ago
JSON representation

CUDA/HIP header-only library writing vectorized and low-precision (16 bit, 8 bit) GPU kernels

Host: GitHub
URL: https://github.com/kerneltuner/kernel_float
Owner: KernelTuner
License: apache-2.0
Created: 2023-02-21T08:52:34.000Z (about 3 years ago)
Default Branch: main
Last Pushed: 2025-04-11T08:28:58.000Z (11 months ago)
Last Synced: 2025-04-11T09:56:41.425Z (11 months ago)
Topics: bfloat16, cpp, cuda, floating-point, gpu, half-precision, header-only-library, hip, kernel-tuner, low-precision, mixed-precision, performance, reduced-precision, vectorization
Language: C++
Homepage: https://kerneltuner.github.io/kernel_float/
Size: 7.23 MB
Stars: 7
Watchers: 3
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Kernel Float

![Kernel Float logo](https://raw.githubusercontent.com/KernelTuner/kernel_float/main/docs/logo.png)

[![github](https://img.shields.io/badge/github-repo-000.svg?logo=github&labelColor=gray&color=blue)](https://github.com/KernelTuner/kernel_float/)

![GitHub branch checks state](https://img.shields.io/github/actions/workflow/status/KernelTuner/kernel_float/docs.yml)

![GitHub](https://img.shields.io/github/license/KernelTuner/kernel_float)

![GitHub tag (latest by date)](https://img.shields.io/github/v/tag/KernelTuner/kernel_float)

![GitHub Repo stars](https://img.shields.io/github/stars/KernelTuner/kernel_float?style=social)

_Kernel Float_ is a header-only library for CUDA/HIP that simplifies working with vector types and reduced precision floating-point arithmetic in GPU code.

## Summary

CUDA/HIP natively offers several reduced precision floating-point types (`__half`, `__nv_bfloat16`, `__nv_fp8_e4m3`, `__nv_fp8_e5m2`)

and vector types (e.g., `__half2`, `__nv_fp8x4_e4m3`, `float3`).

However, working with these types is cumbersome:

mathematical operations require intrinsics (e.g., `__hadd2` performs addition for `__half2`),

type conversion is awkward (e.g., `__nv_cvt_halfraw2_to_fp8x2` converts float16 to float8),

and some functionality is missing (e.g., one cannot convert a `__half` to `__nv_bfloat16`).

_Kernel Float_ resolves this by offering a single data type `kernel_float::vec` that stores `N` elements of type `T`.

Internally, the data is stored as a fixed-sized array of elements.

Operator overloading (like `+`, `*`, `&&`) has been implemented such that the most optimal intrinsic for the available types is selected automatically.

Many mathematical functions (like `log`, `exp`, `sin`) and common operations (such as `sum`, `range`, `for_each`) are also available.

Using Kernel Float, developers avoid the complexity of reduced precision floating-point types in CUDA and can focus on their applications.

## Features

In a nutshell, _Kernel Float_ offers the following features:

* Single type `vec` that unifies all vector types.

* Operator overloading to simplify programming.

* Support for half (16 bit) floating-point arithmetic, with a fallback to single precision for unsupported operations.

* Support for quarter (8 bit) floating-point types.

* Easy integration as a single header file.

* Written for C++17.

* Compatible with NVCC (NVIDIA Compiler) and NVRTC (NVIDIA Runtime Compilation).

* Compatible with HIPCC (AMD HIP Compiler)

## Example

Check out the [examples](https://github.com/KernelTuner/kernel_float/tree/master/examples) directory for some examples.

Below shows a simple example of a CUDA kernel that adds a `constant` to the `input` array and writes the results to the `output` array.

Each thread processes two elements.

Notice how easy it would be to change the precision (for example, `double` to `half`) or the vector size (for example, 4 instead of 2 items per thread).

```cpp

#include "kernel_float.h"

namespace kf = kernel_float;

__global__ void kernel(const kf::vec* input, float constant, kf::vec* output) {

    int i = blockIdx.x * blockDim.x + threadIdx.x;

    output[i] = input[i] + kf::cast(constant);

}

```

Here is how the same kernel would look for CUDA without Kernel Float.

```cpp

__global__ void kernel(const __half* input, float constant, float* output) {

    int i = blockIdx.x * blockDim.x + threadIdx.x;

    __half in0 = input[2 * i + 0];

    __half in1 = input[2 * i + 1];

    __half2 a = __halves2half2(in0, in1);

    float b = float(constant);

    __half c = __float2half(b);

    __half2 d = __half2half2(c);

    __half2 e = __hadd2(a, d);

    __half f = __low2half(e);

    __half g = __high2half(e);

    float out0 = __half2float(f);

    float out1 = __half2float(g);

    output[2 * i + 0] = out0;

    output[2 * i + 1] = out1;

}

```

Even though the second kernel looks a lot more complex, the PTX code generated by these two kernels is nearly identical.

## Installation

This is a header-only library. Copy the file `single_include/kernel_float.h` to your project and include it:

```cpp

#include "kernel_float.h"

```

Use the provided Makefile to generate this single-include header file if it is outdated:

```

make

```

## Documentation

See the [documentation](https://kerneltuner.github.io/kernel_float/) for the [API reference](https://kerneltuner.github.io/kernel_float/api.html) of all functionality.

## License

Licensed under Apache 2.0. See [LICENSE](https://github.com/KernelTuner/kernel_float/blob/master/LICENSE).

## Related Work

* [Kernel Tuner](https://github.com/KernelTuner/kernel_tuner)

* [Kernel Launcher](https://github.com/KernelTuner/kernel_launcher)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/kerneltuner/kernel_float

Awesome Lists containing this project

README