{"id":20379947,"url":"https://github.com/kerneltuner/kernel_float","last_synced_at":"2025-04-12T08:33:34.808Z","repository":{"id":74324817,"uuid":"604528869","full_name":"KernelTuner/kernel_float","owner":"KernelTuner","description":"CUDA/HIP header-only library writing vectorized and low-precision (16 bit, 8 bit) GPU kernels ","archived":false,"fork":false,"pushed_at":"2025-04-11T08:28:58.000Z","size":7578,"stargazers_count":7,"open_issues_count":0,"forks_count":1,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-04-11T09:56:41.425Z","etag":null,"topics":["bfloat16","cpp","cuda","floating-point","gpu","half-precision","header-only-library","hip","kernel-tuner","low-precision","mixed-precision","performance","reduced-precision","vectorization"],"latest_commit_sha":null,"homepage":"https://kerneltuner.github.io/kernel_float/","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/KernelTuner.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-02-21T08:52:34.000Z","updated_at":"2025-04-11T08:32:09.000Z","dependencies_parsed_at":null,"dependency_job_id":"903e6825-2e50-4e4a-b3ed-c7dfac943ff6","html_url":"https://github.com/KernelTuner/kernel_float","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KernelTuner%2Fkernel_float","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KernelTuner%2Fkernel_float/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KernelTuner%2Fkernel_float/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KernelTuner%2Fkernel_float/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/KernelTuner","download_url":"https://codeload.github.com/KernelTuner/kernel_float/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248540556,"owners_count":21121376,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bfloat16","cpp","cuda","floating-point","gpu","half-precision","header-only-library","hip","kernel-tuner","low-precision","mixed-precision","performance","reduced-precision","vectorization"],"created_at":"2024-11-15T02:05:41.498Z","updated_at":"2025-04-12T08:33:34.799Z","avatar_url":"https://github.com/KernelTuner.png","language":"C++","readme":"# Kernel Float\n\n![Kernel Float logo](https://raw.githubusercontent.com/KernelTuner/kernel_float/main/docs/logo.png)\n\n[![github](https://img.shields.io/badge/github-repo-000.svg?logo=github\u0026labelColor=gray\u0026color=blue)](https://github.com/KernelTuner/kernel_float/)\n![GitHub branch checks state](https://img.shields.io/github/actions/workflow/status/KernelTuner/kernel_float/docs.yml)\n![GitHub](https://img.shields.io/github/license/KernelTuner/kernel_float)\n![GitHub tag (latest by date)](https://img.shields.io/github/v/tag/KernelTuner/kernel_float)\n![GitHub Repo stars](https://img.shields.io/github/stars/KernelTuner/kernel_float?style=social)\n\n\n_Kernel Float_ is a header-only library for CUDA/HIP that simplifies working with vector types and reduced precision floating-point arithmetic in GPU code.\n\n\n## Summary\n\nCUDA/HIP natively offers several reduced precision floating-point types (`__half`, `__nv_bfloat16`, `__nv_fp8_e4m3`, `__nv_fp8_e5m2`)\nand vector types (e.g., `__half2`, `__nv_fp8x4_e4m3`, `float3`).\nHowever, working with these types is cumbersome:\nmathematical operations require intrinsics (e.g., `__hadd2` performs addition for `__half2`),\ntype conversion is awkward (e.g., `__nv_cvt_halfraw2_to_fp8x2` converts float16 to float8),\nand some functionality is missing (e.g., one cannot convert a `__half` to `__nv_bfloat16`).\n\n_Kernel Float_ resolves this by offering a single data type `kernel_float::vec\u003cT, N\u003e` that stores `N` elements of type `T`.\nInternally, the data is stored as a fixed-sized array of elements.\nOperator overloading (like `+`, `*`, `\u0026\u0026`) has been implemented such that the most optimal intrinsic for the available types is selected automatically.\nMany mathematical functions (like `log`, `exp`, `sin`) and common operations (such as `sum`, `range`, `for_each`) are also available.\n\nUsing Kernel Float, developers avoid the complexity of reduced precision floating-point types in CUDA and can focus on their applications.\n\n\n## Features\n\nIn a nutshell, _Kernel Float_ offers the following features:\n\n* Single type `vec\u003cT, N\u003e` that unifies all vector types.\n* Operator overloading to simplify programming.\n* Support for half (16 bit) floating-point arithmetic, with a fallback to single precision for unsupported operations.\n* Support for quarter (8 bit) floating-point types.\n* Easy integration as a single header file.\n* Written for C++17.\n* Compatible with NVCC (NVIDIA Compiler) and NVRTC (NVIDIA Runtime Compilation).\n* Compatible with HIPCC (AMD HIP Compiler)\n\n\n## Example\n\nCheck out the [examples](https://github.com/KernelTuner/kernel_float/tree/master/examples) directory for some examples.\n\n\nBelow shows a simple example of a CUDA kernel that adds a `constant` to the `input` array and writes the results to the `output` array.\nEach thread processes two elements.\nNotice how easy it would be to change the precision (for example, `double` to `half`) or the vector size (for example, 4 instead of 2 items per thread).\n\n\n```cpp\n#include \"kernel_float.h\"\nnamespace kf = kernel_float;\n\n__global__ void kernel(const kf::vec\u003chalf, 2\u003e* input, float constant, kf::vec\u003cfloat, 2\u003e* output) {\n    int i = blockIdx.x * blockDim.x + threadIdx.x;\n    output[i] = input[i] + kf::cast\u003chalf\u003e(constant);\n}\n\n```\n\nHere is how the same kernel would look for CUDA without Kernel Float.\n\n```cpp\n__global__ void kernel(const __half* input, float constant, float* output) {\n    int i = blockIdx.x * blockDim.x + threadIdx.x;\n    __half in0 = input[2 * i + 0];\n    __half in1 = input[2 * i + 1];\n    __half2 a = __halves2half2(in0, in1);\n    float b = float(constant);\n    __half c = __float2half(b);\n    __half2 d = __half2half2(c);\n    __half2 e = __hadd2(a, d);\n    __half f = __low2half(e);\n    __half g = __high2half(e);\n    float out0 = __half2float(f);\n    float out1 = __half2float(g);\n    output[2 * i + 0] = out0;\n    output[2 * i + 1] = out1;\n}\n\n```\n\nEven though the second kernel looks a lot more complex, the PTX code generated by these two kernels is nearly identical.\n\n\n## Installation\n\nThis is a header-only library. Copy the file `single_include/kernel_float.h` to your project and include it:\n\n```cpp\n#include \"kernel_float.h\"\n```\n\nUse the provided Makefile to generate this single-include header file if it is outdated:\n\n```\nmake\n```\n\n\n## Documentation\n\nSee the [documentation](https://kerneltuner.github.io/kernel_float/) for the [API reference](https://kerneltuner.github.io/kernel_float/api.html) of all functionality.\n\n\n## License\n\nLicensed under Apache 2.0. See [LICENSE](https://github.com/KernelTuner/kernel_float/blob/master/LICENSE).\n\n\n## Related Work\n\n* [Kernel Tuner](https://github.com/KernelTuner/kernel_tuner)\n* [Kernel Launcher](https://github.com/KernelTuner/kernel_launcher)\n\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkerneltuner%2Fkernel_float","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkerneltuner%2Fkernel_float","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkerneltuner%2Fkernel_float/lists"}