{"id":50425478,"url":"https://github.com/pmarreck/blip_mp","last_synced_at":"2026-05-31T10:03:50.407Z","repository":{"id":355227008,"uuid":"1227287663","full_name":"pmarreck/blip_mp","owner":"pmarreck","description":"BLIP-storage arbitrary-precision integers in pure Zig. Beats GMP at i64 (1.95-2.66×) and common cryptographic mul (1.12-1.46×) on Apple Silicon. 8240 GMP cross-validation tests pass.","archived":false,"fork":false,"pushed_at":"2026-05-14T20:39:33.000Z","size":1964,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"yolo","last_synced_at":"2026-05-14T22:38:49.694Z","etag":null,"topics":["arbitrary-precision-integers","bignum","blip","cryptography","ffi","gmp","multi-precision","zig"],"latest_commit_sha":null,"homepage":null,"language":"Zig","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pmarreck.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-02T13:26:06.000Z","updated_at":"2026-05-14T20:39:37.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/pmarreck/blip_mp","commit_stats":null,"previous_names":["pmarreck/blip_mp"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/pmarreck/blip_mp","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pmarreck%2Fblip_mp","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pmarreck%2Fblip_mp/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pmarreck%2Fblip_mp/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pmarreck%2Fblip_mp/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pmarreck","download_url":"https://codeload.github.com/pmarreck/blip_mp/tar.gz/refs/heads/yolo","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pmarreck%2Fblip_mp/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33726722,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-31T02:00:06.040Z","response_time":95,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["arbitrary-precision-integers","bignum","blip","cryptography","ffi","gmp","multi-precision","zig"],"created_at":"2026-05-31T10:03:49.607Z","updated_at":"2026-05-31T10:03:50.396Z","avatar_url":"https://github.com/pmarreck.png","language":"Zig","funding_links":[],"categories":[],"sub_categories":[],"readme":"# blip_mp\n\n**A pure-Zig multi-precision integer library that stores values as BLIP-encoded bytes (variable-length, self-describing) instead of GMP's fixed-width-limb-array representation.** Beats GMP at the most common bignum operations on Apple Silicon. Pure Zig, no inline assembly, no LGPL link constraint.\n\n\u003e *\"Storage-as-wire-form\" + small-buffer-optimization + lazy-on-demand metadata caching. Validated empirically against GMP's 8240-test cross-checked reference.*\n\n---\n\n## Why does this exist?\n\nGMP's `mpz_t` is the de-facto bignum representation in serious numerical software. Its weakness: **per-value overhead is fixed and structural, regardless of how large the actual integer is.**\n\n```\nmpz_t = { int _mp_alloc, int _mp_size, mp_limb_t *_mp_d }\n       ≈ 16 bytes of struct\n       + heap allocation for _mp_d\n       + alignment padding\n       + allocator bookkeeping\n\nStoring the value 5 takes ~24 bytes spread across two cache lines,\nmediated by a malloc round-trip.\n```\n\nFor workloads dominated by small or medium numbers (accumulator loops, EC scalars, polynomial coefficients, hash-derived integers), GMP's structural overhead is a 5–25× memory blowup over the actual information content, plus an allocator round-trip on every value created or destroyed.\n\n**BLIP** ([spec](https://github.com/pmarreck/BLIP)) is a self-describing variable-length integer encoding:\n\n| Value | BLIP bytes | GMP storage |\n|---|---|---|\n| `5` | 1 byte (immediate) | 24+ bytes (struct + heap) |\n| `2^32` | 5 bytes (header + 4 LE) | 24+ bytes |\n| `2^256` | 33 bytes | 16 bytes struct + 32 bytes heap |\n| `2^4096` | 513 bytes contiguous | 16 + 512 + bookkeeping |\n\n`blip_mp` builds an arbitrary-precision integer library directly on top of BLIP storage. No separate length field, no heap pointer indirection for small values, no allocator round-trip for accumulator loops. **The bytes ARE the value.**\n\n---\n\n## What did we measure?\n\nApple Silicon (M-series), aarch64-darwin, Zig 0.16.0 ReleaseFast, libc malloc.\n\n### Headline wins\n\n- **All `i64`-fitting values: 1.95–2.66× faster than GMP**\n- **All three standard RSA mul sizes (1024, 2048, 3072 bit) beat GMP by 1.19–1.45×** — twelve mul sizes total beat GMP after the iter-33 Karatsuba-leaf cleanup added new wins at 4096, 6144, and 8192 bit too.\n- **RSA-2048 trifecta**: blip_mp beats GMP at `mul` (1.16×) + `divMod` (1.31×) + `powm` (1.11×) at the most-deployed crypto operand size worldwide.\n- **Addition at 768-bit and above: 1.04–1.28× faster than GMP** (eight sizes now beat GMP after the M5/M9 bookkeeping cleanup; was previously 1.03× at 4096+ only)\n- **Subtraction at 256-bit and above: 1.03–2.11× faster than GMP** — ten Mp.sub sizes beat GMP after the chunked-`subPayloads` fix mirrored the long-standing addPayloads optimization. Headline: 6144-bit sub went 606 ns → 48 ns (12.7×).\n- **Large multiplication (16384+ bits via Toom-3): 1.03× faster** (modest)\n- **Modular exponentiation (RSA-2048): 13% faster than GMP** (Mp.powm with Montgomery, M7-4.3 + Möller-Granlund-improved inner div). At 1024 we beat by 3%; at 3072 by 8%. This is the headliner for any serious crypto workload (RSA encrypt/decrypt/sign, DH key exchange, ECC scalar mul).\n- **Long division (2048-bit / 1024-bit): 24% faster than GMP** (Mp.divMod with u64-base Knuth Algorithm D) — 36× faster than the byte-base implementation that originally lagged by 28.8×.\n- **Correctness: 13029/13029 random GMP cross-validation tests pass** across add, sub, mul, div, mod, divMod, powm, invMod, and `Fp` (rational) cross-checks against `mpq_t` — the complete modular-arithmetic + exact-rational API.\n- **Real-world: pi-spigot at 10,000 digits runs within 1.22× of GMP** — a streaming Gibbons spigot port with mul-heavy ratio. Started a session at 3.55×; closed 72% of the gap via the same techniques GMP uses (`mpn_mul_1` dispatch for single-limb operand, per-thread scratch caches, specialized squaring, sub-quadratic `mpz_get_str`). At small N (≤100 digits) pi-blip is tied or faster than pi-c (C + GMP-with-asm).\n- **Modular exponentiation at 1024-bit: 21% FASTER than GMP** (after specialized squaring landed) — RSA-1024 modexp falls into blip_mp's schoolbook tier where the n²/2 squaring symmetry trick directly pays out.\n\n### The honest losses\n\nWe're slower than GMP at:\n- **128-bit addition and subtraction** (~0.56–0.65× of GMP). The remaining gap at the smallest size is the load+store ABI cost difference from blip_mp's heavier struct (one cache line + 8B vs GMP's 24-byte mpz_t) — fundamental, not algorithmic. Other small sizes (192-512 bit) are now within 70-90% of GMP after the cleanup landed.\n- **128–256 bit multiplication** (0.40–0.67×). Same per-op overhead.\n- **16384+ bit multiplication** (0.55–0.87× post-iter-33). 8192-bit was 0.72× behind GMP pre-iter-33; the Karatsuba-leaf chunking flipped it to 1.04× WIN. 16384-bit closed from 0.68× → 0.87×. 32768-bit still in FFT territory (0.55×). GMP uses Schönhage-Strassen FFT mul above ~16K-bit. We have a full pure-Zig FFT stack (single-prime NTT + two-prime CRT + NEON-SIMD butterflies) shipped + correctness-validated but gated off in production — even with the 1.40× speedup from vectorization (32K-bit FFT path: 191K → 135K ns), Toom-3 still wins at 100K ns post-iter-33. M6-4-E ladder in PLAN.md targets the alloc-elimination + Stockham + inline-asm levers needed to flip it.\n\n### Surprise: GMP's hand-tuned aarch64 asm gives ~0% advantage on Apple Silicon\n\nWe built a second GMP variant with `--disable-assembly` and ran the same benchmark. The asm advantage is essentially zero across all our test sizes. At 4096-bit and 32768-bit add, **the C reference code is actually faster than the asm.** Modern clang `-O3` generates near-optimal ADCS chains from C `__builtin_add_overflow` that the hand-tuned asm can't beat — and the asm becomes an opaque call boundary that breaks inlining.\n\n**Implication on aarch64:** every gap to GMP is purely algorithmic. We don't need inline asm. We need FFT mul and tighter bookkeeping. Both are pure-Zig achievable.\n\n### Cross-platform: x86_64 (AMD Zen 4) tells a different story\n\nWe ran the same `nix build .#bench` on a NixOS x86_64 Framework laptop (AMD Ryzen 9 7940HS — Zen 4 with full AVX-512, AVX2, BMI/ADX). Two clean findings:\n\n1. **GMP-asm is load-bearing on x86_64**: 1.65–3.88× faster than GMP-noasm depending on size/op. Decades of hand-tuned `mpn_*` chains with `adcx`/`adox`/`mulx` + AVX2/AVX-512 paths IS doing real work on x86_64 — the M-series finding does NOT generalize.\n2. **The BLIP-storage paradigm still wins on x86_64**: compared against GMP-noasm-x86_64 (the storage-paradigm-only baseline), pure-Zig blip_mp wins at 22+ size/op pairs — including 1.46× faster on immediate add, 1.30–1.68× faster on add at 4K–32K-bit, 1.5–1.8× faster on mul at 1K–8K-bit, and 1.73× faster on RSA-2048 powm.\n\nSo on x86_64 vs **GMP-asm**, blip_mp wins all tier-0/1 sizes but loses tier-3 by 1.5–4× (= the GMP-asm advantage above). Closing that on x86_64 would require equivalent hand-asm work in our codebase, OR upstream Zig codegen improvements for `adcx`/`adox` chains and AVX big-int SIMD. Full numbers in `BENCHMARK_RESULTS.md` Run 18 and `RESULTS.md` \"Cross-platform validation\" section.\n\n**Cross-platform conclusion:** the architectural advantage of BLIP-storage + sign-extended SBO + byte-direct chunked arithmetic + Möller-Granlund div + Mont-form powm is **platform-independent**. On M-series it produces a clean win-on-most-things vs GMP. On Zen 4 it produces a clean win-on-most-things vs GMP-noasm. The GMP-asm gap on x86_64 is a separate axis (hand-tuned asm vs pure-Zig codegen) — orthogonal to the storage paradigm.\n\n---\n\n## Quick start\n\nRequires [Nix](https://nixos.org/) (handles the Zig 0.16.0 toolchain and GMP build for benchmarks).\n\n```bash\ngit clone https://github.com/pmarreck/blip_mp\ncd blip_mp\n\n./build           # native ReleaseFast build via nix; also builds bp\n./test            # 461 unit tests + 13029 GMP cross-checks + C FFI smoke + bp CLI smoke\n./result/bin/blip_mp_bench    # run the bench (after nix build .#packages.\u003csys\u003e.bench)\n```\n\n## `bp` — RPN exact-arithmetic calculator\n\n`bp` is a Forth-style RPN calculator that ships in `./result/bin/bp` after `./build`. It dogfoods the C FFI — same headers any downstream Rust/Lua/Python binding would use — so every shipped fix to the FFI surface gets exercised by the calculator itself.\n\n**Every value is an exact arbitrary-precision rational** (M14 `Fp`, base = decimal). No IEEE754. No silent precision loss. Division uses `divExact` and errors when the quotient has no terminating decimal expansion — there's no quiet rounding hiding under the hood.\n\n```bash\n# Three ergonomic input forms — pick whichever's least painful:\nbp 35 factorial 24 factorial '*'      # separate args (escape * for shell)\nbp '35 factorial 24 factorial *'      # ONE quoted arg (no escaping needed)\necho '35 factorial 24 factorial *' | bp   # stdin pipe\n\n# Headline IEEE754 disruption demos:\nbp 0.1 0.2 +                          # → \"0.3\"  (the disruption)\nbp '0.1 0.2 + 0.3 -'                  # → \"0\"    (the proof)\nbp 1 4 /                              # → \"0.25\" (terminates exactly)\nbp 22 7 /                             # → ERROR: non-terminating expansion\nbp 100 fib                            # → 354224848179261915075\nbp '50 25 binomial'                   # → 126410606437752\nbp 48 18 gcd                          # → 6\n\n# Forth-style \":\" definitions — Phase 2 (threaded code, classical semantics):\nbp ': square dup * ; 5 square'        # → 25\nbp ': tau 6.28 ; tau 2 *'             # → 12.56\n\n# Heredoc form (multi-line):\nbp \u003c\u003c'EOF'\n: square dup * ;\n: cube dup square * ;\n3 cube\nEOF\n# → 27\n\n# Re-defining a builtin shadows it for FUTURE lookups but doesn't\n# retroactively rebind earlier compiled bodies (classic Forth):\nbp ': real-fact ! ;\n    : ! drop 999 ;\n    5 real-fact'                      # → still 120\n```\n\n**Operators** (`bp --help` for full list with stack-effect comments):\n`+ - * / % ^ neg abs dup drop swap factorial ! fibonacci fib binomial isqrt sqrt gcd lcm`\n\n**Definitions**: `:` enters compile mode and consumes the next token as the new word's name. `;` is \"immediate\" — it executes even in compile mode and finalises the definition. The body is a list of *resolved* instruction pointers (builtin / user-word / literal-Fp), not text — so re-defining a builtin shadows it for *future* lookups but does NOT retroactively rebind any earlier compiled body. Classic Forth threaded code.\n\n## Use as a library\n\nIn your own Zig code:\n\n```zig\nconst std = @import(\"std\");\nconst blip_mp = @import(\"blip_mp\");\nconst Mp = blip_mp.Mp;\n\npub fn main() !void {\n    const allocator = std.heap.c_allocator;\n\n    var a = Mp.init(allocator);\n    defer a.deinit();\n    var b = Mp.init(allocator);\n    defer b.deinit();\n    var r = Mp.init(allocator);\n    defer r.deinit();\n\n    try a.setI64(123_456_789);\n    try b.setI64(987_654_321);\n    try r.mul(\u0026a, \u0026b);\n\n    // r.bytes() is the canonical signed-BLIP encoding — also the wire form.\n    // No mpz_export round-trip needed.\n    std.debug.print(\"Product encoded as {d} bytes\\n\", .{r.bytes().len});\n\n    const value = try r.getI64(); // works because product fits in i64\n    std.debug.print(\"Value: {d}\\n\", .{value});\n}\n```\n\n---\n\n## Architecture in one paragraph\n\n`Mp` is a 72-byte struct (one cache line + 8B). Inline mode stores values up to 24 bytes encoded in `inline_buf`; heap mode uses `heap_buf` with a `heap_offset` field that lets results be written without shifting (the header gets placed directly before the payload). For inline length-prefixed values, an internal invariant maintains `inline_buf[1..9]` as the full sign-extended i64 — the arithmetic hot path reads it as a single `LDR` u64 load. Cached `(payload_offset, sign, payload_len)` fields skip per-op header parsing. Tier dispatch is automatic: i64-fitting values use native arithmetic; larger values use byte-direct two's-complement add/sub or Karatsuba/Toom-3 mul over the BLIP payload bytes. **No auxiliary \"limb array\" data structure exists** — we read u64/u128/u256/u512 chunks directly from the byte payload via `readInt`/`writeInt`. The bytes ARE the value, all the way through.\n\nFull details in [`CODE_MINIMAP.md`](CODE_MINIMAP.md), benchmark history in [`BENCHMARK_RESULTS.md`](BENCHMARK_RESULTS.md), and the comprehensive results writeup in [`RESULTS.md`](RESULTS.md).\n\n---\n\n## Tradeoffs and limitations\n\n**What this library is:** a research-grade arbitrary-precision integer library that validates the BLIP-storage paradigm and beats GMP at common sizes on Apple Silicon. Pure Zig, no asm, no LGPL constraint.\n\n**What it isn't (yet):**\n- **FFT multiplication is correctness-shipped but gated off** — full single-prime NTT + two-prime CRT + NEON-SIMD vectorized butterflies live in `src/fft.zig`, all bit-identical to GMP across 8240/8240 cross-checks at sizes up to 256K-bit. But constant factors keep Toom-3 ahead at every operand size in our supported range (M-series-specific finding: pure-NEON Montgomery integrates slower than the existing scalar-inside-vector form because it crowds the NEON pipe and starves M4's dual scalar mul pipes). The 13–15% remaining gap needs alloc-elimination + inline asm, planned in M6-4-E.\n- **Modular inverse lags GMP** by ~2.5–4× at 256-2048 bit (down from 29-38× before M9 Lehmer; down from 7-8× after M10 wider-window Lehmer; down from 4-6× after M11 recursive HGCD + the divMod/mul scratch caches landed). Headline: 2048-bit invMod is now 2.49× behind GMP (was 3.96× post-M11, 7.85× pre-M10). Further closure would need the FFT-mul-as-HGCD-leaf option flipped on (M11.2 PROD enablement, blocked on FFT viability).\n- **Two platforms validated** — aarch64-darwin (Apple M-series) is the headline, x86_64-linux (AMD Zen 4 with AVX-512) is the cross-check. Library is bit-portable: 461 unit tests + 13029 GMP cross-validations + C FFI smoke + bp CLI smoke pass on both. The asm-vs-clang result is M-series-specific; on x86_64 GMP's hand-asm is genuinely load-bearing (1.65–3.88× over GMP-noasm). Windows (x86_64 + aarch64) is covered by the Garnix CI cross-build matrix.\n- **Not optimized for non-aligned operand sizes** — `tier3Op` works on any size but is fastest when payload lengths are multiples of 8 bytes (which most cryptographic sizes are).\n\n**What it isn't trying to be:**\n- Not a full GMP replacement. No `mpf_t` (floats), no `mpq_t` (rationals), no `mpfr` (extended-precision floats).\n- Not chasing huge-number records. GMP's per-arch asm tuning is decades of work; we stop being competitive at 64K+ bit operands until FFT lands.\n- Not asm-tuned. The M5-5 controlled experiment showed asm gives ~0% on M-series — and we beat GMP-asm at 22+ sizes there anyway. **On x86_64 the picture differs**: GMP-asm gives 1.65–3.88× over GMP-noasm on Zen 4, so blip_mp loses tier-3 to GMP-asm on x86_64 (but still beats GMP-noasm-x86_64, the storage-paradigm baseline). Closing the x86_64 GMP-asm gap is future work — would require equivalent hand-asm or Zig compiler improvements for `adcx`/`adox` + AVX big-int codegen.\n\n---\n\n## Roadmap\n\n**In priority order:**\n\n1. **Finish the FFT-vs-Toom-3 flip** (M6-4-E in PLAN.md). The FFT primitives, CRT extension, and NEON-SIMD butterfly are all shipped and correctness-validated; closed Toom-3 gap from 1.93× to 1.15×. Remaining 13–15% needs caller-supplied scratch (eliminates 4 per-call allocs ≈ 6–9K ns — partially done via the per-thread scratch caches), wiring Stockham into production, and possibly hand-scheduled aarch64 inline asm for the butterfly inner loop.\n\n2. ~~**Tighter `tier3Op` bookkeeping**~~ — DONE 2026-05-02. Five new add wins (768/1536/2048/3072/4096 bit); 128-bit gap closed by ~40%; 1024-bit at parity.\n\n3. ~~**Cross-platform validation on x86_64 Linux + Windows.**~~ — DONE. Garnix builds aarch64/x86_64 on Linux/Darwin/Windows; cross-platform x86_64 Zen 4 run confirmed (a) GMP-asm IS load-bearing on x86_64 (1.65–3.88× over GMP-noasm — different result than M-series); (b) BLIP storage paradigm still wins vs GMP-noasm-x86_64.\n\n4. ~~**True recursive half-GCD for `Mp.invMod` (M11)**~~ — DONE 2026-05-03. Closed 2048-bit invMod from 7.85× → 2.49× of GMP. Further closure would need M11.2 PROD enablement (blocked on FFT mul viability).\n\n5. ~~**C FFI header** (`include/blip_mp.h`)~~ — DONE 2026-05 (M8). `bp` CLI uses it; downstream Rust/Lua/Python bindings get the same surface.\n\n6. **Karatsuba/Toom-3 squaring variants** — schoolbook squaring (`mpn_sqr_basecase` analogue) is shipped and gives 21% on RSA-1024 powm. The recursive Karatsuba and Toom-3 variants for squaring would close the remaining gap at 4K+ bit modular exponentiation (RSA-4096 / DH-4096).\n\n7. **Direct-to-`r.heap_buf` write in `Mp.mulU64`** — currently writes to scratch + memcpys via `writeMpFromPayload`. Direct-write would save the trailing memcpy (≈5–10% on mul-by-small-constant-heavy workloads like the pi spigot).\n\n8. **Toom-Cook 4-way** for 4K-16K bit mul. Deprioritized — its modest 15-30% gain isn't worth the implementation cost while FFT remains the headliner.\n\n9. **`hyperfine` integration** in `./bm` for proper statistical benchmark aggregation. Current numbers are 3-run hand medians.\n\n---\n\n## License intent\n\nMIT or Apache-2.0 (TBD before tagging a release). Both permit linkage with libgmp (LGPLv3) and downstream commercial use.\n\n---\n\n## Acknowledgments\n\n- The [BLIP encoding spec](https://github.com/pmarreck/BLIP) by Peter Marreck.\n- GMP team for the reference implementation we cross-validate against.\n- Marco Bodrato and Alberto Zanoni for the Toom-3 interpolation formulas.\n\n---\n\n*See [SPEC.md](SPEC.md) for the original design hypothesis, [RESULTS.md](RESULTS.md) for the comprehensive technical writeup, [BENCHMARK_RESULTS.md](BENCHMARK_RESULTS.md) for per-run history, [PLAN.md](PLAN.md) for the milestone checklist, and [CODE_MINIMAP.md](CODE_MINIMAP.md) for the per-file index.*\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpmarreck%2Fblip_mp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpmarreck%2Fblip_mp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpmarreck%2Fblip_mp/lists"}