{"id":47410269,"url":"https://github.com/ashvardanian/NumKong","last_synced_at":"2026-03-22T23:01:12.128Z","repository":{"id":157820683,"uuid":"613772664","full_name":"ashvardanian/NumKong","owner":"ashvardanian","description":"SIMD-accelerated distances, dot products, matrix ops, geospatial \u0026 geometric kernels for 16 numeric types — from 6-bit floats to 64-bit complex — across x86, Arm, RISC-V, and WASM, with bindings for Python, Rust, C, C++, Swift, JS, and Go 📐","archived":false,"fork":false,"pushed_at":"2026-03-20T09:33:39.000Z","size":11312,"stargazers_count":1688,"open_issues_count":23,"forks_count":107,"subscribers_count":20,"default_branch":"main","last_synced_at":"2026-03-20T09:54:26.043Z","etag":null,"topics":["arm-neon","assembly","blas","cpp","golang","information-retrieval","javascript","matrix-multiplication","metrics","numpy","rust","scipy","simd","swift","tensor","vector-search"],"latest_commit_sha":null,"homepage":"https://ashvardanian.com/posts/simsimd-faster-scipy/","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ashvardanian.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2023-03-14T08:41:48.000Z","updated_at":"2026-03-20T09:36:24.000Z","dependencies_parsed_at":"2025-12-30T04:03:35.650Z","dependency_job_id":null,"html_url":"https://github.com/ashvardanian/NumKong","commit_stats":{"total_commits":743,"total_committers":26,"mean_commits":"28.576923076923077","dds":"0.19650067294751006","last_synced_commit":"81799b6dd8cf28ac71873db7fec34ed85f381ce8"},"previous_names":["ashvardanian/numkong"],"tags_count":164,"template":false,"template_full_name":null,"purl":"pkg:github/ashvardanian/NumKong","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2FNumKong","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2FNumKong/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2FNumKong/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2FNumKong/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ashvardanian","download_url":"https://codeload.github.com/ashvardanian/NumKong/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2FNumKong/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30776455,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-20T22:51:33.771Z","status":"online","status_checked_at":"2026-03-21T02:00:07.962Z","response_time":114,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["arm-neon","assembly","blas","cpp","golang","information-retrieval","javascript","matrix-multiplication","metrics","numpy","rust","scipy","simd","swift","tensor","vector-search"],"created_at":"2026-03-20T23:00:37.733Z","updated_at":"2026-03-22T23:01:12.114Z","avatar_url":"https://github.com/ashvardanian.png","language":"C","funding_links":[],"categories":["Libraries","Math","C","Machine Learning"],"sub_categories":["Data structures"],"readme":"# NumKong: Mixed Precision for All\n\nNumKong (previously SimSIMD) is a portable mixed-precision math library with over 2000 kernels for x86, Arm, RISC-V, and WASM.\nIt covers numeric types from 6-bit floats to 64-bit complex numbers, hardened against in-house 118-bit extended-precision baselines.\nBuilt alongside the [USearch](https://github.com/unum-cloud/usearch) vector-search engine, it provides wider accumulators to avoid the overflow and precision loss typical of naive same-type arithmetic.\n\n![NumKong banner](https://github.com/ashvardanian/ashvardanian/blob/master/repositories/NumKong-v7.png?raw=true)\n\n## Latency, Throughput, \u0026 Numerical Stability\n\nMost libraries return dot products in the __same type as the input__ — Float16 × Float16 → Float16, Int8 × Int8 → Int8.\nThis leads to quiet overflow: a 2048-dimensional `i8` dot product can reach ±10 million, but `i8` maxes out at 127.\nNumKong promotes to wider accumulators — Float16 → Float32, BFloat16 → Float32, Int8 → Int32, Float32 → Float64 — so results stay in range.\n\n\u003e Single 2048-d dot product on Intel [Sapphire Rapids](https://en.wikipedia.org/wiki/Sapphire_Rapids), single-threaded.\n\u003e Each cell shows __gso/s, mean relative error__ vs higher-precision reference.\n\u003e gso/s = Giga Scalar Operations per Second — a more suitable name than GFLOP/s when counting both integer and floating-point work.\n\u003e NumPy 2.4, PyTorch 2.10, JAX 0.9.\n\n| Input  |        NumPy + OpenBLAS |           PyTorch + MKL |                     JAX |               NumKong |\n| :----- | ----------------------: | ----------------------: | ----------------------: | --------------------: |\n|        |          ░░░░░░░░░░░░░░ |          ░░░░░░░░░░░░░░ |          ░░░░░░░░░░░░░░ |        ░░░░░░░░░░░░░░ |\n| `f64`  |    2.0 gso/s, 1e-15 err |    0.6 gso/s, 1e-15 err |    0.4 gso/s, 1e-14 err |  5.8 gso/s, 1e-16 err |\n| `f32`  |     1.5 gso/s, 2e-6 err |     0.6 gso/s, 2e-6 err |     0.4 gso/s, 5e-6 err |   7.1 gso/s, 2e-7 err |\n| `bf16` |                       — |     0.5 gso/s, 1.9% err |     0.5 gso/s, 1.9% err |   9.7 gso/s, 1.8% err |\n| `f16`  |    0.2 gso/s, 0.25% err |    0.5 gso/s, 0.25% err |    0.4 gso/s, 0.25% err | 11.5 gso/s, 0.24% err |\n| `e5m2` |                       — |     0.7 gso/s, 4.6% err |     0.5 gso/s, 4.6% err |     7.1 gso/s, 0% err |\n| `i8`   | 1.1 gso/s, __overflow__ | 0.5 gso/s, __overflow__ | 0.5 gso/s, __overflow__ |    14.8 gso/s, 0% err |\n\nA fair objection: PyTorch and JAX are designed for throughput, not single-call latency.\nThey lower execution graphs through [XLA](https://openxla.org/) or vendored BLAS libraries like [Intel MKL](https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html) and Nvidia [cuBLAS](https://developer.nvidia.com/cublas).\nSo here's the same comparison on a throughput-oriented workload — matrix multiplication:\n\n\u003e Matrix multiplication (2048 × 2048) × (2048 × 2048) on Intel Sapphire Rapids, single-threaded.\n\u003e gso/s = Giga Scalar Operations per Second, same format.\n\u003e NumPy 2.4, PyTorch 2.10, JAX 0.9, same versions.\n\n| Input  |        NumPy + OpenBLAS |            PyTorch + MKL |                      JAX |              NumKong |\n| :----- | ----------------------: | -----------------------: | -----------------------: | -------------------: |\n|        |          ░░░░░░░░░░░░░░ |           ░░░░░░░░░░░░░░ |           ░░░░░░░░░░░░░░ |       ░░░░░░░░░░░░░░ |\n| `f64`  |   65.5 gso/s, 1e-15 err |    68.2 gso/s, 1e-15 err |   ~14.3 gso/s, 1e-15 err | 8.6 gso/s, 1e-16 err |\n| `f32`  |     140 gso/s, 9e-7 err |      145 gso/s, 1e-6 err |    ~60.5 gso/s, 1e-6 err | 37.7 gso/s, 4e-7 err |\n| `bf16` |                       — |      851 gso/s, 1.8% err |    ~25.8 gso/s, 3.4% err |  458 gso/s, 3.6% err |\n| `f16`  |    0.3 gso/s, 0.25% err |     140 gso/s, 0.37% err |   ~26.1 gso/s, 0.35% err | 103 gso/s, 0.26% err |\n| `e5m2` |                       — |      0.4 gso/s, 4.6% err |    ~26.4 gso/s, 4.6% err |    398 gso/s, 0% err |\n| `i8`   | 0.4 gso/s, __overflow__ | 50.0 gso/s, __overflow__ | ~0.0 gso/s, __overflow__ |   1279 gso/s, 0% err |\n\nFor `f64`, compensated \"Dot2\" summation reduces error by 10–50× compared to naive Float64 accumulation, depending on vector length.\nFor `f32`, widening to Float64 gives 5–10× lower error.\nThe library ships as a relatively small binary:\n\n| Package          |   Size | Parallelism \u0026 Memory                              | Available For     |\n| :--------------- | -----: | :------------------------------------------------ | :---------------- |\n| PyTorch + MKL    | 705 MB | Vector \u0026 Tile SIMD, OpenMP Threads, Hidden Allocs | Python, C++, Java |\n| JAX + jaxlib     | 357 MB | Vector SIMD, XLA Threads, Hidden Allocs           | Python            |\n| NumPy + OpenBLAS |  30 MB | Vector SIMD, Built-in Threads, Hidden Allocs      | Python            |\n| mathjs           |   9 MB | No SIMD, No Threads, Many Allocs                  | JS                |\n| NumKong          |   5 MB | Vector \u0026 Tile SIMD, Your Threads, Your Allocs     | 7 languages       |\n\nEvery kernel is validated against 118-bit extended-precision baselines with per-type ULP budgets across log-normal, uniform, and Cauchy input distributions.\nTests check triangle inequality, Cauchy-Schwarz bounds, NaN propagation, overflow detection, and probability-simplex constraints for each ISA variant.\nResults are cross-validated against OpenBLAS, Intel MKL, and Apple Accelerate.\nA broader throughput comparison is maintained in [NumWars](https://github.com/ashvardanian/NumWars).\n\n## Quick Start\n\n| Language | Install                    | Compatible with                | Guide                                        |\n| :------- | :------------------------- | :----------------------------- | :------------------------------------------- |\n| C / C++  | CMake, headers, \u0026 prebuilt | Linux, macOS, Windows, Android | [include/README.md](include/README.md)       |\n| Python   | `pip install`              | Linux, macOS, Windows          | [python/README.md](python/README.md)         |\n| Rust     | `cargo add`                | Linux, macOS, Windows          | [rust/README.md](rust/README.md)             |\n| JS       | `npm install` \u0026 `import`   | Node.js, Bun, Deno \u0026 browsers  | [javascript/README.md](javascript/README.md) |\n| Swift    | Swift Package Manager      | macOS, iOS, tvOS, watchOS      | [swift/README.md](swift/README.md)           |\n| Go       | `go get`                   | Linux, macOS, Windows via cGo  | [golang/README.md](golang/README.md)         |\n\n## What's Inside\n\nNumKong covers 16 numeric types — from 6-bit floats to 64-bit complex numbers — across dozens of operations and 30+ SIMD backends, with hardware-aware defaults: Arm prioritizes `f16`, x86 prioritizes `bf16`.\n\n\u003cdiv align=\"center\"\u003e\n\u003cpre\u003e\u003ccode\u003e\n┌──────────────────────────────┬────────────────┬───────────────────────────┬────────────┐\n│          Operations          │   Datatypes    │         Backends          │ Ecosystems │\n├──────────────────────────────┼────────────────┼───────────────────────────┼────────────┤\n│ Vector-Vector                │ \u003ca href=\"#numeric-types\"\u003eBits \u0026amp; Ints\u003c/a\u003e    │ \u003ca href=\"#compile-time-and-run-time-dispatch\"\u003ex86\u003c/a\u003e                       │ Core       │\n│ \u003ca href=\"https://github.com/ashvardanian/NumKong/blob/main/include/README.md#dot-products\"\u003edot\u003c/a\u003e · \u003ca href=\"https://github.com/ashvardanian/NumKong/blob/main/include/README.md#dense-distances\"\u003eangular\u003c/a\u003e · \u003ca href=\"https://github.com/ashvardanian/NumKong/blob/main/include/README.md#dense-distances\"\u003eeuclidean\u003c/a\u003e    │ u1 · u4 · u8   │ Haswell · Alder Lake      │ \u003ca href=\"https://github.com/ashvardanian/NumKong/blob/main/include/README.md#the-c-abi\"\u003eC 99\u003c/a\u003e       │\n│ hamming · kld · jsd · …      │ i4 · i8        │ Sierra Forest · Skylake   │            │\n│                              │                │ Ice Lake · Genoa · Turin  │ Primary    │\n│ \u003ca href=\"https://github.com/ashvardanian/NumKong/blob/main/include/README.md#packed-matrix-kernels-for-gemm-like-workloads\"\u003eMatrix-Matrix\u003c/a\u003e                │ \u003ca href=\"#mini-floats-e4m3-e5m2-e3m2--e2m3\"\u003eMini-floats\u003c/a\u003e    │ Sapphire Rapids ·         │ \u003ca href=\"https://github.com/ashvardanian/NumKong/blob/main/include/README.md#the-c-layer\"\u003eC++ 23\u003c/a\u003e     │\n│ \u003ca href=\"https://github.com/ashvardanian/NumKong/blob/main/include/README.md#packed-matrix-kernels-for-gemm-like-workloads\"\u003edots_packed\u003c/a\u003e · \u003ca href=\"https://github.com/ashvardanian/NumKong/blob/main/include/README.md#symmetric-kernels-for-syrk-like-workloads\"\u003edots_symmetric\u003c/a\u003e │ e2m3 · e3m2    │ Granite Rapids            │ \u003ca href=\"https://github.com/ashvardanian/NumKong/blob/main/python/README.md\"\u003ePython 3\u003c/a\u003e   │\n│ \u003ca href=\"https://github.com/ashvardanian/NumKong/blob/main/include/README.md#packed-matrix-kernels-for-gemm-like-workloads\"\u003eeuclideans_packed\u003c/a\u003e · …        │ e4m3 · e5m2    │                           │ \u003ca href=\"https://github.com/ashvardanian/NumKong/blob/main/rust/README.md\"\u003eRust\u003c/a\u003e       │\n│                              │                │ \u003ca href=\"#compile-time-and-run-time-dispatch\"\u003eArm\u003c/a\u003e                       │            │\n│ Quadratic                    │ \u003ca href=\"#float16--bfloat16-half-precision\"\u003eHalf \u0026amp; Classic\u003c/a\u003e │ NEON · NEONHalf · NEONFhm │ Additional │\n│ \u003ca href=\"https://github.com/ashvardanian/NumKong/blob/main/include/README.md#curved-metrics\"\u003ebilinear\u003c/a\u003e · mahalanobis       │ f16 · bf16     │ NEONBFDot · NEONSDot      │ \u003ca href=\"https://github.com/ashvardanian/NumKong/blob/main/swift/README.md\"\u003eSwift\u003c/a\u003e · \u003ca href=\"https://github.com/ashvardanian/NumKong/blob/main/javascript/README.md\"\u003eJS\u003c/a\u003e │\n│                              │ f32 · f64      │ SVE · SVEHalf · SVEBfDot  │ \u003ca href=\"https://github.com/ashvardanian/NumKong/blob/main/golang/README.md\"\u003eGo\u003c/a\u003e         │\n│ \u003ca href=\"https://github.com/ashvardanian/NumKong/blob/main/include/README.md#geospatial-metrics\"\u003eGeospatial\u003c/a\u003e \u0026amp; \u003ca href=\"https://github.com/ashvardanian/NumKong/blob/main/include/README.md#geometric-mesh-alignment\"\u003eGeometric\u003c/a\u003e       │                │ SVESDot · SVE2            │            │\n│ haversine · vincenty         │ \u003ca href=\"#complex-types\"\u003eComplex\u003c/a\u003e        │ SME · SMEF64 · SMEBI32    │ \u003ca href=\"https://github.com/ashvardanian/NumKong/blob/main/CONTRIBUTING.md\"\u003eTools\u003c/a\u003e      │\n│ rmsd · kabsch · umeyama · …  │ f16c · bf16c   │                           │ \u003ca href=\"https://github.com/ashvardanian/NumKong/blob/main/test/README.md\"\u003eTests\u003c/a\u003e      │\n│                              │ f32c · f64c    │ \u003ca href=\"#compile-time-and-run-time-dispatch\"\u003eRISC-V\u003c/a\u003e                    │ \u003ca href=\"https://github.com/ashvardanian/NumKong/blob/main/bench/README.md\"\u003eBenchmarks\u003c/a\u003e │\n│ Bespoke                      │                │ RVV · RVVHalf             │ \u003ca href=\"https://github.com/ashvardanian/NumWars\"\u003eNumWars\u003c/a\u003e    │\n│ \u003ca href=\"https://github.com/ashvardanian/NumKong/blob/main/include/numkong/each/README.md\"\u003efma\u003c/a\u003e · blend · \u003ca href=\"https://github.com/ashvardanian/NumKong/blob/main/include/numkong/trigonometry/README.md\"\u003esin\u003c/a\u003e · \u003ca href=\"https://github.com/ashvardanian/NumKong/blob/main/include/numkong/cast/README.md\"\u003ecast\u003c/a\u003e     │                │ RVVBf16 · RVVBB           │            │\n│ \u003ca href=\"https://github.com/ashvardanian/NumKong/blob/main/include/numkong/reduce/README.md\"\u003ereduce_moments\u003c/a\u003e · \u003ca href=\"https://github.com/ashvardanian/NumKong/blob/main/include/numkong/sparse/README.md\"\u003esparse_dot\u003c/a\u003e  │                │                           │            │\n│ \u003ca href=\"https://github.com/ashvardanian/NumKong/blob/main/include/README.md#maxsim-and-late-interaction\"\u003emaxsim\u003c/a\u003e · intersect · …       │                │ \u003ca href=\"https://github.com/ashvardanian/NumKong/blob/main/CONTRIBUTING.md#cross-compilation\"\u003eWASM\u003c/a\u003e                      │            │\n│                              │                │ V128Relaxed               │            │\n└──────────────────────────────┴────────────────┴───────────────────────────┴────────────┘\n\u003c/code\u003e\u003c/pre\u003e\n\u003c/div\u003e\n\nNot every combination is implemented — only the ones that unlock interesting new opportunities.\nThe `icelake` level doesn't get a `dot_bf16` variant, for example, and falls through to `dot_bf16_skylake`.\nEvery operation has a `serial` fallback, but even types no CPU supports today get optimized via lookup tables and bit-twiddling hacks rather than scalar loops.\n\n## Design Decisions\n\n- Avoid loop unrolling and scalar tails.\n- Don't manage threads and be compatible with any parallelism models.\n- Don't manage memory and be compatible with arbitrary allocators \u0026 alignment.\n- Don't constrain ourselves to traditional BLAS-like Matrix Multiplication APIs.\n- Don't throw exceptions and pass values by pointers.\n- Prefer saturated arithmetic and avoid overflows, where needed.\n- Cover most modern CPUs with flexible dispatch and wait for them to converge with GPUs.\n\nThe rest of this document unpacks the functionality and the logic behind the design decisions.\n\n### Auto-Vectorization \u0026 Loop Unrolling\n\nMost \"optimized SIMD code\" is a 2–4x unrolled data-parallel `for`-loop over `f32` arrays with a serial scalar tail for the last few elements:\n\n```c\nfloat boring_dot_product_f32(float const *a, float const *b, size_t n) {\n    __m256 sum0 = _mm256_setzero_ps(), sum1 = _mm256_setzero_ps();\n    size_t i = 0;\n    for (; i + 16 \u003c= n; i += 16) {\n        sum0 = _mm256_fmadd_ps(_mm256_loadu_ps(a + i), _mm256_loadu_ps(b + i), sum0);\n        sum1 = _mm256_fmadd_ps(_mm256_loadu_ps(a + i + 8), _mm256_loadu_ps(b + i + 8), sum1);\n    }\n    float result = _mm256_reduce_add_ps(_mm256_add_ps(sum0, sum1));\n    for (; i \u003c n; i++) result += a[i] * b[i]; // serial tail\n    return result;\n}\n```\n\nThis kind of unrolling has been a common request for NumKong, but the library avoids it by design.\n\n__Modern CPUs already \"unroll\" in hardware.__\nOut-of-order engines with reorder buffers of 320–630 entries (Zen 4: 320, Golden Cove: 512, Apple Firestorm: ~630) can keep a dozen of loop iterations in-flight simultaneously.\nThe physical register file is much larger than the ISA-visible architectural registers — Skylake has ~180 physical integer registers behind 16 architectural GPRs, and ~168 physical vector registers behind 32 architectural ZMMs.\nThe register renaming unit maps the same `zmm0` in iteration N and iteration N+1 to different physical registers, extracting cross-iteration parallelism automatically — exactly the benefit that source-level unrolling was historically supposed to provide.\n\n__Unrolling works against NumKong's goals.__\nEvery unrolled copy is a distinct instruction in the binary.\nWith 1,500+ kernel endpoints across 30+ backends, even 2x unrolling would inflate the `.text` section by megabytes — directly impacting install size for Python wheels, NPM packages, and Rust crates.\nLarger loop bodies also increase instruction-cache and micro-op-cache pressure; [Agner Fog](https://www.agner.org/optimize/) also recommends:\n\n\u003e _\"avoid loop unrolling where possible in order to economize the use of the micro-op cache\"_.\n\nA loop that spills out of the uop cache falls back to the slower legacy decoder, making the \"optimized\" version slower than the compact original.\nFor a header-only library, unrolling also compounds __compilation time__: register allocation is NP-hard (reducible to graph coloring), and unrolling multiplies the number of simultaneously live ranges the allocator must consider, increasing compile time super-linearly across every translation unit that includes the headers.\n\n__Serial tails are a correctness hazard.__\nThe leftover elements after the last full SIMD chunk run through a scalar loop that silently drops FMA fusion, compensated accumulation, and saturating arithmetic — producing results with different numerical properties than the SIMD body.\nNumKong often uses masked loads instead (`_mm512_maskz_loadu_ps` on AVX-512, predicated `svld1_f32` on SVE), processing every element through the same arithmetic path regardless of alignment.\nIt's not exactly orthogonal to loop-unrolling, but makes a different kernel layout more compatible.\n\n__The gains come from elsewhere.__\nOn Intel Sapphire Rapids, NumKong was benchmarked against auto-vectorized code compiled with GCC 12.\nGCC handles single-precision `float` well, but struggles with `_Float16` and other mixed-precision paths:\n\n| Kind                      | GCC 12 `f32` | GCC 12 `f16` | NumKong `f16` | `f16` improvement |\n| :------------------------ | -----------: | -----------: | ------------: | ----------------: |\n| Inner Product             |    3,810 K/s |      192 K/s |     5,990 K/s |          __31 x__ |\n| Cosine Distance           |    3,280 K/s |      336 K/s |     6,880 K/s |          __20 x__ |\n| Euclidean Distance ²      |    4,620 K/s |      147 K/s |     5,320 K/s |          __36 x__ |\n| Jensen-Shannon Divergence |    1,180 K/s |       18 K/s |     2,140 K/s |         __118 x__ |\n\nNumKong's `f16` kernels are faster than GCC's `f32` output — not because of unrolling, but because they use [F16C](https://en.wikipedia.org/wiki/F16C) conversion instructions, widening FMA pipelines, and compensated accumulation that compilers do not synthesize from a plain `for` loop.\nThe same story repeats for `bf16`, `e4m3`, `i8`, and `i4`: these types require algorithmic transformations — lookup tables, algebraic domain shifts, asymmetric [VNNI](https://en.wikipedia.org/wiki/AVX-512#VNNI) tricks — that live beyond the reach of auto-vectorization.\n\n### Parallelism \u0026 Multi-Threading\n\nBLAS libraries traditionally manage their own thread pools.\n[OpenBLAS](https://github.com/OpenMathLib/OpenBLAS/blob/develop/USAGE.md) spawns threads controlled by `OPENBLAS_NUM_THREADS`, [Intel MKL](https://www.intel.com/content/www/us/en/docs/onemkl/developer-guide-linux/2025-1/techniques-to-set-the-number-of-threads.html) forks its own OpenMP runtime via `MKL_NUM_THREADS`, and [Apple Accelerate](https://developer.apple.com/documentation/accelerate/blas) delegates to [GCD](https://developer.apple.com/documentation/dispatch) (Grand Central Dispatch).\nThis works in isolation — but the moment your application adds its own parallelism (joblib, std::thread, Tokio, GCD, OpenMP), you get __thread oversubscription__: MKL spawns 8 threads inside each of your 8 joblib workers, producing 64 threads on 8 cores, thrashing caches and stalling on context switches.\nThe Python ecosystem has built [entire libraries](https://github.com/joblib/threadpoolctl) just to work around this problem, and [scikit-learn's documentation](https://scikit-learn.org/stable/computing/parallelism.html) devotes a full page to managing the interaction between joblib parallelism and BLAS thread pools.\n\nNumKong takes a different position: __the numerics layer should not own threads__.\nModern hardware makes the \"spawn N threads and split evenly\" model increasingly untenable:\n\n- __Server-grade CPUs__ have hundreds of cores split across sockets, chiplets, and tiles, resulting in dozens of physical [NUMA](https://en.wikipedia.org/wiki/Non-uniform_memory_access) domains with vastly different memory access latencies.\n  A thread pool that ignores NUMA topology will spend more time on remote memory stalls than on arithmetic.\n- __Consumer-grade CPUs__ pack heterogeneous Quality-of-Service core types on the same die — Intel P-cores and E-cores run at different frequencies and sometimes support different ISA extensions.\n  A naive work-split gives equal chunks to fast and slow cores, and the whole task stalls waiting for the slowest partition.\n- __Real-time operating systems__ in robotics and edge AI cannot afford to yield the main thread to a BLAS-managed pool.\n  These systems need deterministic latency, not maximum throughput.\n\nInstead, NumKong exposes __row-range parameters__ that let the caller partition work across any threading model.\nFor GEMM-shaped `dots_packed`, this is straightforward — pass a slice of A's rows and the full packed B to compute the corresponding slice of C.\nFor SYRK-shaped `dots_symmetric`, explicit `start_row` / `end_row` parameters control which rows of the symmetric output matrix a given thread computes.\nThe [GIL](https://docs.python.org/3/glossary.html#term-global-interpreter-lock) (Global Interpreter Lock) is released around every kernel call, making NumKong compatible with `concurrent.futures`, `multiprocessing`, or any other parallelism model:\n\n```python\nimport concurrent.futures, numkong as nk, numpy as np\n\nvectors, num_threads = np.random.randn(1000, 768).astype(np.float32), 4\noutput = nk.zeros((1000, 1000), dtype=\"float32\")\n\ndef compute_slice(t):\n    start = t * (len(vectors) // num_threads)\n    end = start + len(vectors) // num_threads if t \u003c num_threads - 1 else len(vectors)\n    nk.dots_symmetric(vectors, out=output, start_row=start, end_row=end)\n\nwith concurrent.futures.ThreadPoolExecutor(max_workers=num_threads) as pool:\n    list(pool.map(compute_slice, range(num_threads)))\n```\n\nFor users who want a ready-made low-latency thread pool without the oversubscription baggage of OpenMP, we built [ForkUnion](https://github.com/ashvardanian/ForkUnion) — a minimalist fork-join library for C, C++, and Rust that avoids mutexes, CAS atomics, and dynamic allocations on the critical path, with optional NUMA pinning on Linux.\n\n### Memory Allocation \u0026 Management\n\nBLAS libraries typically allocate internal buffers during GEMM — [OpenBLAS](https://github.com/OpenMathLib/OpenBLAS) packs matrices into L2/L3-sized panels via per-thread buffer pools backed by `mmap` or `shmget`.\nThis hidden allocation has caused real problems: [14 lock/unlock pairs per small GEMM call](https://github.com/OpenMathLib/OpenBLAS/issues/478) throttling 12-thread scaling to 2x, [silently incorrect results](https://github.com/OpenMathLib/OpenBLAS/issues/1844) from thread-unsafe allocation in `np.dot`, and [deadlocks after `fork()`](https://github.com/numpy/numpy/issues/30092) due to mutex state not being reset in child processes.\nThe [BLASFEO](https://github.com/giaf/blasfeo) library was created specifically for embedded model-predictive control where `malloc` during computation is unacceptable.\n\nNumKong __never allocates memory__.\nFollowing the same philosophy as [Intel MKL's packed GEMM API](https://www.intel.com/content/www/us/en/developer/articles/technical/introducing-the-new-packed-apis-for-gemm.html) (`cblas_sgemm_pack_get_size` → `cblas_sgemm_pack` → `cblas_sgemm_compute`), NumKong exposes typed three-phase interfaces — `nk_dots_packed_size_*` → `nk_dots_pack_*` → `nk_dots_packed_*` — where the caller owns the buffer and NumKong only fills it.\n\nThe reason GEMM libraries repack matrices at all is that every hardware target has a different preferred layout — Intel AMX expects B in a [VNNI-interleaved](https://www.intel.com/content/www/us/en/developer/articles/code-sample/advanced-matrix-extensions-intrinsics-functions.html) tile format (pairs of BFloat16 values packed into DWORDs across the K dimension), while Arm SME wants column vectors for its [FMOPA outer-product](https://developer.arm.com/documentation/ddi0602/latest/SME-Instructions) instructions.\nSince GEMM is $O(N^3)$ and repacking is $O(N^2)$, the cost is asymptotically free — but the allocation and locking overhead is not.\n\nNumKong's `nk_dots_pack_*` family performs five transformations beyond simple reordering:\n\n- __Type pre-conversion__ — mini-floats (E4M3, BFloat16, etc.) are upcast to the compute type once during packing, not on every GEMM call.\n  This amortizes the conversion cost across all rows of A that will be multiplied against the packed B.\n- __SIMD depth padding__ — rows are zero-padded to the SIMD vector width (16 for AVX-512 Float32, 64 for AVX-512 Int8), allowing inner loops to load without boundary checks.\n- __Per-column norm precomputation__ — squared norms ($\\|b_j\\|^2$) are computed and stored alongside the packed data, so distance kernels (`angulars_packed`, `euclideans_packed`) can reuse them without a separate pass.\n- __ISA-specific tile layout__ — AMX packing interleaves BFloat16 pairs into 16×32 tiles matching `TDPBF16PS` expectations; SME packing arranges vectors at SVE granularity for `FMOPA` outer products; generic backends use simple column-major with depth padding.\n- __Power-of-2 stride breaking__ — when the padded row stride is a power of 2, one extra SIMD step of padding is added.\n  Power-of-2 strides cause [cache set aliasing](https://en.algorithmica.org/hpc/cpu-cache/associativity/) where consecutive rows map to the same cache sets, effectively shrinking usable L1/L2 capacity — stride-256 traversals can be [~10x slower](https://en.algorithmica.org/hpc/cpu-cache/associativity/) than stride-257.\n\n```python\nimport numkong as nk, numpy as np\n\nright_matrix = np.random.randn(1000, 768).astype(np.float16)\nright_packed = nk.dots_pack(right_matrix, dtype=nk.float16)                        # pack once\nfor query_batch in stream: results = nk.dots_packed(query_batch, right_packed)    # reuse many times\n```\n\n### Why Not Just GEMM? The Evolution of Matrix Multiplication APIs\n\nThe classic BLAS GEMM computes $C = \\alpha A B + \\beta C$ for Float32/Float64 matrices.\nIt covers many use cases, but LLM inference, vector search, and quantum simulation expose three ways in which the traditional interface falls short.\n\n__Frozen weights justify separating packing from computation.__\nDuring LLM inference, a very large share of GEMM calls use a static weight matrix — weights don't change after loading.\nThis makes offline repacking a one-time cost amortized over the entire serving lifetime: [NVIDIA's TurboMind](https://arxiv.org/pdf/2508.15601) explicitly splits GEMM into offline weight packing (hardware-aware layout conversion) and online mixed-precision computation, and [Intel MKL's packed GEMM API](https://www.intel.com/content/www/us/en/developer/articles/technical/introducing-the-new-packed-apis-for-gemm.html) exposes the same two-phase pattern.\nNumKong's `nk_dots_pack_*` → `nk_dots_packed_*` path follows this philosophy — pack the weight matrix once, reuse it across all queries.\n\n__Mixed precision demands more than an epilogue addition.__\nModern transformer layers operate in a precision sandwich: weights stored in BFloat16/Float8, [GEMM accumulated in Float32](https://pytorch.org/blog/what-every-user-should-know-about-mixed-precision-training-in-pytorch/), output downcast back to BFloat16 for the next layer.\nBetween GEMM calls, [LayerNorm or RMSNorm](https://arxiv.org/html/2409.12951v2) re-normalizes hidden states, so the next layer is often much closer to an angular or normalized similarity computation than to a plain raw dot product.\n[nGPT](https://arxiv.org/html/2410.01131v1) takes this to its logical conclusion: all vectors live on the unit hypersphere, and every matrix-vector product is a pure angular distance.\nThis means many \"GEMM\" workloads in production are semantically closer to many-to-many angular distance computation — which is exactly what NumKong's `angulars_packed` and `euclideans_packed` kernels compute directly, fusing norm handling and type conversion into a single pass.\n\n__The GEMM-for-distances trick has real costs.__\nA common shortcut in vector search is to decompose pairwise Euclidean distance as $\\|a - b\\|^2 = \\|a\\|^2 + \\|b\\|^2 - 2 \\langle a, b \\rangle$, precompute norms, and call `sgemm` for the inner-product matrix.\nBoth [FAISS](https://github.com/facebookresearch/faiss/wiki/Implementation-notes) and [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.euclidean_distances.html) use this approach — and both document its limitations.\nScikit-learn's docs warn of _\"catastrophic cancellation\"_ in the subtraction; this has caused [real bugs](https://github.com/scikit-learn/scikit-learn/issues/9354) with ~37% error on near-identical Float32 vectors.\nThe $O(N^2)$ postprocessing pass (adding norms, square roots, divisions) is not free either — [NVIDIA's RAFT](https://github.com/rapidsai/raft/pull/339) measured a __20–25% speedup__ from fusing it into the GEMM epilogue.\nEven [FAISS switches to direct SIMD](https://github.com/facebookresearch/faiss/wiki/Implementation-notes) when the query count drops below 20.\nThe standard BLAS interface was never designed for sub-byte types either — [no vendor supports Int4](https://research.colfax-intl.com/cutlass-tutorial-sub-byte-gemm-on-nvidia-blackwell-gpus/), and sub-byte types cannot even be strided without bit-level repacking.\n\n__Some operations need more than GEMM + postprocessing.__\nNumKong implements several GEMM-shaped operations where the \"epilogue\" is too complex for a simple addition:\n\n- __Bilinear forms__ ($a^T C b$) in quantum computing compute a [scalar expectation value](https://phys.libretexts.org/Bookshelves/Quantum_Mechanics/Advanced_Quantum_Mechanics_(Kok)/10:_Pauli_Spin_Matrices/10.2:_Expectation_Values) — the naive approach materializes an $N$-dimensional intermediate vector $Cb$, but NumKong's typed `nk_bilinear_*` kernels stream through rows of $C$ with nested compensated dot products, never allocating beyond registers.\n  For complex-valued quantum states, where the intermediate would be a 2N-element complex vector, the savings double.\n- __MaxSim scoring__ for [ColBERT-style late-interaction retrieval](https://github.com/stanford-futuredata/ColBERT) computes $\\sum_i \\min_j \\text{angular}(q_i, d_j)$ — a sum-of-min-distances across token pairs.\n  A GEMM would produce the full $M \\times N$ similarity matrix, but NumKong's typed `nk_maxsim_packed_*` kernels fuse a coarse Int8-quantized screening with full-precision angular refinement on winning pairs only, packing both query and document matrices to use all 4 SME tiles as accumulators.\n  [PLAID](https://ar5iv.labs.arxiv.org/html/2205.09707) and [maxsim-cpu](https://www.mixedbread.com/blog/maxsim-cpu) have independently shown that dedicated MaxSim kernels can outperform the GEMM decomposition by 5–10x.\n\nNumKong treats these as first-class operations — `dots_packed`, `euclideans_packed`, `angulars_packed`, typed `nk_bilinear_*` kernels, and typed `nk_maxsim_packed_*` kernels — rather than decomposing everything into GEMM + postprocessing.\n\n### Precision by Design: Saturation, Rounding, \u0026 Float6 Over Float8\n\nFloating-point arithmetic on computers [is not associative](https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html): $(a + b) + c \\neq a + (b + c)$ in general, and upcasting to wider types is not always sufficient.\nNumKong makes operation-specific decisions about where to spend precision and where to economize, rather than applying one rule uniformly.\n\n__Saturation depends on the operation.__\nA reduction over a 4 GB array of `i8` values contains ~4 billion elements — but [Int32 wrapping overflow](https://cedardb.com/blog/overflow_handling/) occurs after just ~17 million Int8 summands ($127 \\times 16.9\\text{M} \u003e 2^{31}$).\nReductions in NumKong use saturating arithmetic because the input can be arbitrarily long.\nMatrix multiplications don't need saturation because GEMM depth rarely exceeds tens of thousands — well within Int32 range.\nx86 provides no saturating 32-bit SIMD add ([only byte/word variants](https://www.felixcloutier.com/x86/paddb:paddw:paddd:paddq)), so NumKong implements saturation via overflow detection with XOR-based unsigned comparison on platforms that lack native support.\n\n__Square roots \u0026 special math ops are platform-specific.__\nAngular distance requires $1/\\sqrt{\\|a\\|^2 \\cdot \\|b\\|^2}$ — but the cost of computing this normalization varies dramatically across hardware.\nx86 `VSQRTPS` takes [~12 cycles](https://uops.info/html-lat/SKX/VSQRTPS_XMM_XMM-Measurements.html), followed by `VDIVPS` at ~11 cycles — totalling ~23 cycles for a precise `1/sqrt(x)`.\nThe `VRSQRT14PS` alternative starts with a [14-bit estimate in ~4 cycles](https://www.intel.com/content/www/us/en/developer/articles/code-sample/reference-implementations-for-ia-approximation-instructions-vrcp14-vrsqrt14-vrcp28-vrsqrt28-vexp2.html), then one Newton-Raphson iteration ($y = y \\cdot (1.5 - 0.5 x y^2)$, ~4 more cycles) reaches full Float32 precision — roughly 3x faster.\nARM's `FRSQRTE` provides only [~8 bits](https://github.com/DLTcollab/sse2neon/issues/526), requiring __two__ Newton-Raphson iterations to match.\nNumKong selects the iteration count per platform so the final ULP bound is consistent across ISAs, rather than exposing different precision to different users.\n\n__E2M3 and E3M2 can outperform E4M3 and E5M2.__\n6-bit [MX formats](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf) can be scaled to exact integers, enabling integer accumulation that avoids E5M2's catastrophic cancellation risk.\nThis works because E2M3's narrower exponent range means every representable value maps to an integer after a fixed shift — no rounding, no cancellation.\nSee [Mini-Floats](#mini-floats-e4m3-e5m2-e3m2--e2m3) for a worked example.\n\nEvery such decision — saturation thresholds, Newton-Raphson iteration counts, integer vs floating-point paths — is documented per operation and per type in the [module-specific READMEs](include/numkong/).\n\n### Calling Convention \u0026 Error Handling\n\nNumKong never throws exceptions, never sets `errno`, and never calls `setjmp`/`longjmp` — [exceptions bloat call sites with unwind tables](https://monkeywritescode.blogspot.com/p/c-exceptions-under-hood.html) and are invisible to C, Python, Rust, Swift, Go, and JavaScript FFI; `errno` is thread-local state whose [storage model varies across C runtimes](https://en.cppreference.com/w/c/error/errno).\nInstead, every function takes inputs as `const` pointers, writes outputs through caller-provided pointers, and returns `void`:\n\n```c\nvoid nk_dot_f32(nk_f32_t const *a, nk_f32_t const *b, nk_size_t n, nk_f64_t *result);\nvoid nk_dot_bf16(nk_bf16_t const *a, nk_bf16_t const *b, nk_size_t n, nk_f32_t *result);\n```\n\nPointers eliminate implicit casts for types with platform-dependent storage — this is why they matter for half-precision types.\n`nk_f16_t` and `nk_bf16_t` resolve to native `__fp16` / `__bf16` when available but fall back to `unsigned short` otherwise — if passed by value, the compiler would silently apply integer promotion instead of preserving the bit pattern.\nPassing by pointer keeps the representation opaque: kernels read raw and convert explicitly when needed, so the same binary works regardless of whether the compiler understands `_Float16`.\n\nThe only place that requires error signaling is [dynamic dispatch](#compile-time-and-run-time-dispatch) — looking up the best kernel for the current CPU at runtime.\nWhen no kernel matches, the dispatcher sets the [capabilities mask](c/dispatch.h) to zero and fills the function pointer with a family-specific error stub such as `nk_error_dense_` from [c/dispatch.h](c/dispatch.h) and [c/numkong.c](c/numkong.c) that writes `0xFF` into the output — `NaN` for floats, `−1` for signed integers, `TYPE_MAX` for unsigned.\n\n### Compile-Time and Run-Time Dispatch\n\nNumKong provides two dispatch mechanisms.\n__Compile-time dispatch__ selects the fastest kernel supported by the target platform at build time — thinner binaries, no indirection overhead, but requires knowing your deployment hardware.\n__Run-time dispatch__ compiles every supported kernel into the binary and picks the best one on the target machine via `nk_capabilities()` — one pointer indirection per call, but a single binary runs everywhere.\nThe run-time path is common in DBMS products (ClickHouse), web browsers (Chromium), and other upstream projects that ship to heterogeneous fleets.\n\nAll kernel names follow the pattern `nk_{operation}_{type}_{backend}`.\nIf you need to resolve the best kernel manually, use `nk_find_kernel_punned` with a `nk_kernel_kind_t`, `nk_dtype_t`, and a viable capabilities mask:\n\n```c\nnk_metric_dense_punned_t angular = 0;\nnk_capability_t used = nk_cap_serial_k;\nnk_find_kernel_punned(\n    nk_kernel_angular_k, nk_f32_k,            // what functionality? for which input type?\n    nk_capabilities(),                        // which capabilities are viable?\n    (nk_kernel_punned_t *)\u0026angular, \u0026used);   // the kernel found and capabilties used!\n```\n\nThe first call to `nk_capabilities()` initializes the dispatch table; all subsequent calls are lock-free.\n\n## Numeric Types\n\n### Float64 \u0026 Float32: IEEE Precision\n\n__Float64__ — NumKong uses __compensated summation__ that tracks numerical errors separately.\nOn serial paths, we use __[Neumaier's algorithm](https://en.wikipedia.org/wiki/Kahan_summation_algorithm#Further_enhancements)__ (1974), an improvement over Kahan-Babuška that correctly handles cases where added terms are larger than the running sum, achieving $O(1)$ error growth instead of $O(n)$.\nOn SIMD paths with FMA support, we implement the __Dot2 algorithm__ (Ogita-Rump-Oishi, 2005), maintaining separate error compensators for both multiplication and accumulation via `TwoProd` and `TwoSum` operations.\nThe accuracy differences are visible in the [benchmark tables above](#latency-throughput--numerical-stability) — compensated Float64 suits scientific computing where numerical stability matters more than raw speed.\n\n__Float32__ — SIMD implementations load Float32 values, upcast to Float64 for full-precision multiplication and accumulation, then downcast only during finalization.\nThis avoids catastrophic cancellation at minimal cost since modern CPUs have dedicated Float64 vector units operating at nearly the same throughput as Float32.\nThe same compensated accumulation strategy applies to Mahalanobis distance, bilinear forms, and KL/JS divergences.\n\n```c\n// Dot2 TwoProd: Capture multiplication rounding error\nh = a * b;\nr = fma(a, b, -h);  // Extracts rounding error\n\n// Dot2 TwoSum: Capture addition rounding error\nt = sum + product;\ne = (sum - t) + product;  // Compensator term\n```\n\n### BFloat16 \u0026 Float16: Half Precision\n\n__BFloat16__ — not an IEEE 754 standard type, but widely adopted for AI workloads.\nBFloat16 shares Float32's 8-bit exponent but truncates the mantissa to 7 bits, prioritizing __dynamic range over precision__ (±3.4×10³⁸ with coarser granularity).\nOn old CPUs, upcasting BFloat16 to Float32 requires just an unpack and left-shift by 16 bits (essentially free); on newer CPUs, both Arm and x86 provide widening mixed-precision dot products via __DPBF16PS__ (AVX-512 on Genoa/Sapphire Rapids) and __BFDOT__ (NEON on ARMv8.6-A Graviton 3+).\nNumKong's Float8 types (E4M3/E5M2) upcast to BFloat16 before using DPBF16PS, creating a three-tier precision hierarchy: Float8 for storage, BFloat16 for compute, Float32 for accumulation.\n\n__Float16__ — IEEE 754 half-precision with 1 sign bit, 5 exponent bits (bias=15), and 10 mantissa bits, giving a range of ±65504.\nFloat16 prioritizes __precision over range__ (10 vs 7 mantissa bits), making it better suited for values near zero and gradients during training.\nOn x86, older CPUs use __F16C extensions__ (Ivy Bridge+) for fast Float16 → Float32 conversion; Sapphire Rapids+ adds native __AVX-512-FP16__ with dedicated Float16 arithmetic.\nOn Arm, ARMv8.4-A adds __FMLAL/FMLAL2__ instructions for fused Float16 → Float32 widening multiply-accumulate, reducing the total latency from 7 cycles to 4 cycles and achieving 20–48% speedup over the separate convert-then-FMA path.\n\n| Platform               | BFloat16 Path            | Elem/Op | Float16 Path           | Elem/Op |\n| ---------------------- | ------------------------ | ------: | ---------------------- | ------: |\n| __x86__                |                          |         |                        |         |\n| Sapphire Rapids (2023) | ↓ Genoa                  |      32 | ↓ Skylake              |      16 |\n| Genoa (2022)           | `VDPBF16PS` widening dot |      32 | ↓ Skylake              |      16 |\n| Skylake (2015)         | `SLLI` + `VFMADD`        |      16 | `VCVTPH2PS` + `VFMADD` |      16 |\n| Haswell (2013)         | `SLLI` + `VFMADD`        |       8 | `VCVTPH2PS` + `VFMADD` |       8 |\n| __Arm__                |                          |         |                        |         |\n| Graviton 3 (2021)      | `SVBFDOT` widening dot   |    4–32 | `SVCVT` → `SVFMLA`     |    4–32 |\n| Apple M2+ (2022)       | `BFDOT` widening dot     |       8 | ↓ FP16FML              |       8 |\n| Apple M1 (2020)        | ↓ NEON                   |       8 | `FMLAL` widening FMA   |       8 |\n| Graviton 2 (2019)      | ↓ NEON                   |       8 | `FCVTL` + `FMLA`       |       4 |\n| Graviton 1 (2018)      | `SHLL` + `FMLA`          |       8 | bit-manip → `FMLA`     |       8 |\n\n\u003e BFloat16 shares Float32's 8-bit exponent, so upcasting is a 16-bit left shift (`SLLI` on x86, `SHLL` on Arm) that zero-pads the truncated mantissa — essentially free.\n\u003e Float16 has a different exponent width (5 vs 8 bits), requiring a dedicated convert: `VCVTPH2PS` (x86 F16C) or `FCVTL` (Arm NEON).\n\u003e Widening dot products (`VDPBF16PS`, `BFDOT`, `FMLAL`) fuse the conversion and multiply-accumulate into one instruction.\n\u003e Sapphire Rapids has native `VFMADDPH` for Float16 arithmetic, but NumKong does not use it for general dot products — Float16 accumulation loses precision.\n\u003e It is only used for mini-float (E2M3/E3M2) paths where periodic flush-to-Float32 windows keep error bounded.\n\n### Mini-Floats: E4M3, E5M2, E3M2, \u0026 E2M3\n\n| Format                    |  Bits |  Range | NumKong Promotion Rules                         | Support in GPUs   |\n| ------------------------- | ----: | -----: | ----------------------------------------------- | ----------------- |\n| E5M2FN                    |     8 | ±57344 | BFloat16 → Float32                              | H100+, MI300+     |\n| E4M3FN                    |     8 |   ±448 | BFloat16 → Float32                              | H100+, MI300+     |\n| E3M2FN                    | 6 → 8 |    ±28 | BFloat16 \u0026 Float16 → Float32,\u003cbr/\u003eInt16 → Int32 | only block-scaled |\n| E2M3FN                    | 6 → 8 |   ±7.5 | BFloat16 \u0026 Float16 → Float32,\u003cbr/\u003eInt8 → Int32  | only block-scaled |\n| Block-scaled NVFP4        |     4 |     ±6 | —                                               | B200+             |\n| Block-scaled MXFP4 / E2M1 |     4 |     ±6 | —                                               | B200+, MI325+     |\n\n\u003e __Block scaling.__\n\u003e NumKong does not implement block-scaled variants (MXFP4, NVFP4, or block-scaled E3M2/E2M3).\n\u003e Block scaling couples elements through a shared exponent per block, introducing structural bias into a fundamentally uniform operation.\n\u003e NumKong treats each element independently; block-scaled inputs should be dequantized before processing.\n\n\u003e __FNUZ variants.__\n\u003e AMD MI300 (CDNA 3) uses FNUZ encoding (negative-zero-is-NaN) rather than the OCP standard.\n\u003e MI350+ and NVIDIA H100/B200 both use OCP-standard E4M3FN/E5M2FN.\n\u003e NumKong follows the OCP convention; FNUZ inputs require conversion before processing.\n\n__8-bit floats (E4M3 \u0026 E5M2)__ follow the [OCP FP8 standard](https://www.opencompute.org/documents/ocp-8-bit-floating-point-specification-ofp8-revision-1-0-2023-12-01-pdf-1).\nE4M3FN (no infinities, NaN only) is preferred for __training__ where precision near zero matters; E5M2FN (with infinities) provides wider dynamic range for __inference__.\nOn x86 Genoa/Sapphire Rapids, E4M3/E5M2 values upcast to BFloat16 via lookup tables, then use native __DPBF16PS__ for 2-per-lane dot products accumulating to Float32.\nOn Arm Graviton 3+, the same BFloat16 upcast happens via NEON table lookups, then __BFDOT__ instructions complete the computation.\n\n__6-bit floats (E3M2 \u0026 E2M3)__ follow the [OCP MX v1.0 standard](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf).\nTheir smaller range allows scaling to exact integers that fit in `i8`/`i16`, enabling integer `VPDPBUSD`/`SDOT` accumulation instead of the floating-point pipeline.\nFloat16 can also serve as an accumulator, accurately representing ~50 products of E3M2FN pairs or ~20 products of E2M3FN pairs before overflow.\nOn Arm, NEON FHM extensions bring widening `FMLAL` dot-products for Float16 — both faster and more widely available than `BFDOT` for BFloat16.\n\nE4M3 and E5M2 cannot use the integer path.\nE4M3 scaled by 16 reaches 7,680 — too large for Int8, barely fitting Int16 with a 128-entry table.\nE5M2's range (±57,344) makes the scaled product exceed Int32 entirely.\nWithout the integer path, E5M2 falls back to Float32 accumulation — where its [2-bit mantissa (only 4 values per binade)](https://developer.nvidia.com/blog/floating-point-8-an-introduction-to-efficient-lower-precision-ai-training/) creates a [catastrophic cancellation risk](https://www.ac.uma.es/arith2024/papers/Fused%20FP8%204-Way%20Dot%20Product%20with%20Scaling%20and%20FP32%20Accumulation.pdf) that E2M3's integer path avoids completely:\n\n|         |  _i_ = 0 | _i_ = 1 |  _i_ = 2 |   _i_ = 3 |  _i_ = 4 |  _i_ = 5 |  _i_ = 6 |\n| ------- | -------: | ------: | -------: | --------: | -------: | -------: | -------: |\n| _aᵢ_    |  0.00122 |   20480 | −0.00122 |       1.5 |    −3072 |     −640 |  0.00146 |\n| _bᵢ_    |      −40 |     320 |    −1280 |  −7.63e⁻⁵ | 0.000427 |    10240 | −4.58e⁻⁵ |\n| _aᵢ·bᵢ_ | −0.04883 | 6553600 |   1.5625 | −0.000114 |  −1.3125 | −6553600 |      ≈ 0 |\n\n\u003e __Why Float32 accumulation fails here.__\n\u003e The accurate sum of these 7 products is ≈ 0.201.\n\u003e A `vfmaq_f32` call accumulates 4 lanes at a time; the first batch already carries values around ±6.5 M.\n\u003e At that magnitude the Float32 ULP is 0.5 — so the small meaningful terms (−0.049, 1.563, −1.313, −0.0001) are all below one ULP and get absorbed during lane reduction.\n\u003e The large terms then cancel exactly to zero, and the information is gone.\n\u003e Final Float32 result: __0.0__ instead of __0.201__.\n\n### Int8 \u0026 Int4: Integer Types\n\nBoth signed and unsigned 8-bit and 4-bit integers are supported with __Int32 accumulation__ to prevent overflow.\nA notable optimization is the __VNNI algebraic transform__: on Ice Lake+ with AVX-512 VNNI, the native __DPBUSD__ instruction is asymmetric (unsigned × signed → signed), but NumKong uses it for both Int8×Int8 and UInt8×UInt8.\nFor __signed Int8×Int8__, we convert the signed operand to unsigned via XOR with `0x80`, compute `DPBUSD(a⊕0x80, b) = (a+128)×b`, then subtract a correction term `128×sum(b)` to recover the true result.\nFor __unsigned UInt8×UInt8__, we XOR the second operand to make it signed, compute `DPBUSD(a, b⊕0x80) = a×(b-128)`, then add correction `128×sum(a)` via the fast SAD instruction.\n\n__Int4__ values pack two nibbles per byte, requiring bitmask extraction: low nibbles `(byte \u0026 0x0F)` and high nibbles `(byte \u003e\u003e 4)`.\nFor signed Int4, the transformation `(nibble ⊕ 8) - 8` maps the unsigned range [0,15] to signed range [−8,7].\nSeparate accumulators for low and high nibbles avoid expensive nibble-interleaving and allow SIMD lanes to work in parallel.\n\n```c\n// Asymmetric transform for i8×i8 using DPBUSD (unsigned×signed)\na_unsigned = a XOR 0x80;           // Convert signed→unsigned\nresult = DPBUSD(a_unsigned, b);    // Computes (a+128)×b\ncorrection = 128 * sum(b);         // Parallel on different port\nfinal = result - correction;       // True a×b value\n```\n\n### Binary: Packed Bits\n\nThe `u1x8` type packs 8 binary values per byte, enabling __Hamming distance__ and __Jaccard similarity__ via population-count instructions.\nOn x86, `VPOPCNTDQ` (Ice Lake+) counts set bits in 512-bit registers directly; on Arm, `CNT` (NEON) operates on 8-bit lanes with a horizontal add.\nResults accumulate into `u32` — sufficient for vectors up to 4 billion bits.\nBinary representations are the most compact option for locality-sensitive hashing and binary neural network inference.\n\n### Complex Types\n\nNumKong supports four complex types — `f16c`, `bf16c`, `f32c`, and `f64c` — stored as interleaved real/imaginary pairs.\nComplex types are essential in quantum simulation (state vectors, density matrices), signal processing (FFT coefficients, filter design), and electromagnetic modeling.\nThe `dot` operation computes the unconjugated dot product $\\sum a_k b_k$, while `vdot` computes the conjugated inner product $\\sum \\bar{a}_k b_k$ standard in physics and signal processing.\n\nFor complex dot products, NumKong defers sign flips until after the accumulation loop: instead of using separate FMA and FMS (fused multiply-subtract) instructions for the real component, we compute $a_r b_r + a_i b_i$ treating all products as positive, then apply a single bitwise XOR with `0x80000000` to flip the sign bits.\nThis avoids execution port contention between FMA and FMS, letting dual FMA units stay occupied.\n\n```c\nfor (...) { // Complex multiply optimization: XOR sign flip after the loop\n    sum_real = fma(a, b, sum_real);   // No sign flip in loop\n    sum_imag = fma(a, b_swapped, sum_imag);\n}\nsum_real = xor(sum_real, 0x80000000);  // Single XOR after loop\n```\n\n## License\n\nFeel free to use the project under Apache 2.0 or the Three-clause BSD license at your preference.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fashvardanian%2FNumKong","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fashvardanian%2FNumKong","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fashvardanian%2FNumKong/lists"}