{"id":49947100,"url":"https://github.com/Luce-Org/lucebox-hub","last_synced_at":"2026-05-23T21:00:56.784Z","repository":{"id":349003688,"uuid":"1200680436","full_name":"Luce-Org/lucebox-hub","owner":"Luce-Org","description":"Lucebox: LLM inference server built for speed for specific consumer hardware.","archived":false,"fork":false,"pushed_at":"2026-05-17T16:43:01.000Z","size":8150,"stargazers_count":2135,"open_issues_count":49,"forks_count":200,"subscribers_count":22,"default_branch":"main","last_synced_at":"2026-05-17T17:43:58.817Z","etag":null,"topics":["cuda","cuda-kernels","dflash","kernel","llama-cpp","local-ai","luce","lucebox","megakernel","nvidia-cuda","pflash","qwen","rtx3090","speculative-decoding","speculative-prefill"],"latest_commit_sha":null,"homepage":"https://www.lucebox.com/blog","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Luce-Org.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-03T17:46:29.000Z","updated_at":"2026-05-17T17:05:47.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/Luce-Org/lucebox-hub","commit_stats":null,"previous_names":["luce-org/luce-megakernel","luce-org/lucebox-hub"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Luce-Org/lucebox-hub","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Luce-Org%2Flucebox-hub","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Luce-Org%2Flucebox-hub/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Luce-Org%2Flucebox-hub/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Luce-Org%2Flucebox-hub/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Luce-Org","download_url":"https://codeload.github.com/Luce-Org/lucebox-hub/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Luce-Org%2Flucebox-hub/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33412082,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-23T18:09:33.147Z","status":"ssl_error","status_checked_at":"2026-05-23T18:09:31.380Z","response_time":53,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cuda","cuda-kernels","dflash","kernel","llama-cpp","local-ai","luce","lucebox","megakernel","nvidia-cuda","pflash","qwen","rtx3090","speculative-decoding","speculative-prefill"],"created_at":"2026-05-17T16:00:40.654Z","updated_at":"2026-05-23T21:00:56.769Z","avatar_url":"https://github.com/Luce-Org.png","language":"C++","funding_links":[],"categories":["C++"],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"assets/banner.png\" alt=\"Lucebox\" width=\"85%\"\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://lucebox.com\"\u003e\u003cimg src=\"https://img.shields.io/badge/lucebox.com-f5c842?style=for-the-badge\u0026logo=safari\u0026logoColor=f5c842\u0026labelColor=090909\" alt=\"lucebox.com\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://discord.gg/yHfswqZmJQ\"\u003e\u003cimg src=\"https://img.shields.io/badge/Discord-f5c842?style=for-the-badge\u0026logo=discord\u0026logoColor=f5c842\u0026labelColor=090909\" alt=\"Discord\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://lucebox.com/blog\"\u003e\u003cimg src=\"https://img.shields.io/badge/Blog-f5c842?style=for-the-badge\u0026logo=rss\u0026logoColor=f5c842\u0026labelColor=090909\" alt=\"Blog\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"LICENSE\"\u003e\u003cimg src=\"https://img.shields.io/badge/License-Apache_2.0-e8e8ed?style=for-the-badge\u0026labelColor=090909\" alt=\"Apache 2.0\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://developer.nvidia.com/cuda-toolkit\"\u003e\u003cimg src=\"https://img.shields.io/badge/CUDA-12%2B-76b900?style=for-the-badge\u0026logo=nvidia\u0026logoColor=76b900\u0026labelColor=090909\" alt=\"CUDA 12+\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://rocm.docs.amd.com/projects/HIP/en/latest/\"\u003e\u003cimg src=\"https://img.shields.io/badge/HIP-7%2B-ed1c24?style=for-the-badge\u0026logo=amd\u0026logoColor=ed1c24\u0026labelColor=090909\" alt=\"HIP 7+\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://isocpp.org\"\u003e\u003cimg src=\"https://img.shields.io/badge/C%2B%2B-17-e8e8ed?style=for-the-badge\u0026logo=cplusplus\u0026logoColor=e8e8ed\u0026labelColor=090909\" alt=\"C++17\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cstrong\u003eLocal LLM inference server built for speed. Custom kernels, speculative prefill \u0026 decoding, quantized GGUF paths.\u003c/strong\u003e\u003cbr/\u003e\n  Each project is a new optimization to our engine for a specific model family and hardware target.\n\u003c/p\u003e\n\n---\n\n## Projects\n\nEach directory is a self-contained project with setup instructions and benchmark notes.\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"megakernel/\"\u003e\u003cimg src=\"assets/svg/card-megakernel-dark.svg\" alt=\"Megakernel\" width=\"46%\"\u003e\u003c/a\u003e\n  \u0026nbsp;\u0026nbsp;\n  \u003ca href=\"dflash/\"\u003e\u003cimg src=\"assets/svg/card-dflash-dark.svg\" alt=\"DFlash 27B\" width=\"46%\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"pflash/\"\u003e\u003cimg src=\"assets/svg/card-pflash-dark.svg\" alt=\"PFlash speculative prefill\" width=\"46%\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n---\n\n## Supported models\n\nAll speedups measured vs vendored llama.cpp (`-fa 1`, matching KV quant).\n\n| GPU | Model | TTFT speedup | Decode speedup |\n|-----|-------|:------------:|:--------------:|\n| RTX 3090 | Qwen 3.5-0.8B (Megakernel) | — | **~2×** vs F16 |\n| RTX 3090 | Qwen 3.5-27B Q4_K_M (DFlash + DDTree) | — | **3.43×** vs AR |\n| RTX 3090 | Qwen 3.6-27B Q4_K_M (DFlash + PFlash) | **10.4×** @ 128K | **~3×** vs AR |\n| RTX 3090 | Laguna-XS.2 33B-A3B Q4_K_M (DFlash + PFlash) | **5.4×** @ 128K | AR (draft pending) |\n| RTX 5090 | Qwen 3.6-27B Q4_K_M (DFlash + DDTree) | — | **4.84×** vs AR (205 tok/s) |\n| Ryzen AI MAX+ 395 (gfx1151) | Qwen 3.5-27B Q4_K_M (DFlash + PFlash, HIP) | **2.24×** @ 16K | **3.08×** vs llama.cpp HIP AR (37 tok/s) |\n\n## Client harnesses\n\n[`harness/`](harness/) contains RTX 3090 client launchers and regression tests\nfor Lucebox server compatibility. Use it to run Lucebox inside Claude Code,\nCodex, OpenCode, Hermes, Pi, OpenClaw, or Open WebUI, or to check that a server\nchange still works with those clients.\n\n```bash\nharness/clients/run_codex.sh\nharness/clients/run_claude_code.sh\npython3 harness/client_test_runner.py probe --url http://127.0.0.1:8000\n```\n\nThe harness can also launch the native C++ HTTP server instead of the Python\nserver wrapper:\n\n```bash\nLUCEBOX_SERVER_BACKEND=cpp \\\nDFLASH_SERVER_BIN=dflash/build/dflash_server \\\nMAX_CTX=32768 BUDGET=22 VERIFY_MODE=ddtree \\\nharness/clients/run_codex.sh\n```\n\n## 01 · Megakernel Qwen3.5 0.8B on RTX 3090\n\nSingle-kernel CUDA inference for Qwen 3.5-0.8B on RTX 3090. All 24 layers run in one persistent dispatch.\n\n```bash\n# 1. clone + enter\ngit clone https://github.com/Luce-Org/lucebox-hub \u0026\u0026 cd lucebox-hub\n\n# 2. install via the workspace (Python 3.12, CUDA 12+, PyTorch 2.0+).\n#    Weights stream from HF on first run.\nuv sync --extra megakernel          # builds the CUDA extension; torch is auto-installed first, then setup.py compiles\n\n# 3. run the benchmark (prefill pp520 + decode tg128 vs llama.cpp BF16 + PyTorch HF)\nuv run --directory megakernel python final_bench.py\n```\n\n\u003e Don't have `uv`? Install with `curl -LsSf https://astral.sh/uv/install.sh | sh` or see [astral.sh/uv](https://astral.sh/uv/). The legacy `python -m venv` + `pip install -e . --no-build-isolation` flow still works from inside `megakernel/`.\n\n| Method | Prefill pp520 | Decode tg128 | tok/J |\n|--------|:-------------:|:------------:|:-----:|\n| **Megakernel** `@220W` | **21,347** | **413** | **1.87** |\n| llama.cpp BF16 `@350W` | 11,247 | 267 | 0.76 |\n| PyTorch HF | 7,578 | 108 | n/a |\n\nImplementation notes: 82 blocks, 512 threads, cooperative grid sync, no CPU round trips between layers, and weights streamed from Hugging Face on first run.\n\n[Full writeup →](megakernel/README.md) · [Benchmarks →](megakernel/RESULTS.md) · [Blog post →](https://lucebox.com/blog/megakernel)\n\n\u003e **Blackwell (RTX 5090, DGX Spark / GB10):** auto-detected by setup; NVFP4 decode path lands ~194 tok/s tg128 on GB10. See [megakernel/README.md#blackwell-sm_120--sm_121a](megakernel/README.md).\n\n---\n\n## 02 · DFlash DDtree Qwen3.5 \u0026 Qwen3.6 27B GGUF on RTX 3090\n\nDFlash speculative decoding for Qwen3.5/Qwen3.6 27B GGUF targets on a single GPU. The default setup uses Qwen3.6-27B Q4_K_M plus the Lucebox Q8_0 GGUF DFlash draft.\n\n- **Up to 207 tok/s** in the demo (207.6 tok/s DFlash vs 38.0 tok/s AR, 5.46×)\n- **129.5 tok/s mean** on the HumanEval 10-prompt bench\n- **3.43× faster than autoregressive** (+15% over chain speculative decoding)\n- **2.8× faster than SGLang AWQ** on the same hardware\n- **Up to 256K context in 24 GB** via TurboQuant TQ3_0 KV cache (128K Q4_0 bench: 134.78 tok/s at ctx=131072)\n\n```bash\n# 1. clone with submodules (pulls the pinned Luce-Org/llama.cpp@luce-dflash fork)\ngit clone --recurse-submodules https://github.com/Luce-Org/lucebox-hub \u0026\u0026 cd lucebox-hub\n\n# 2. install Python deps via the workspace (creates one shared .venv at the\n#    repo root).\nuv sync\n\n# 3. build the C++/CUDA decoder (CUDA 12+, CMake 3.18+)\n# Default compiles for Pascal/Volta/Turing/Ampere (60/61/62/70/75/86; +120 on CUDA 12.8+, +sm_121/DGX Spark on CUDA 12.9+, +sm_110/Thor on CUDA 13.0+) so the binary runs on every supported card.\n# 3090-only users can add -DCMAKE_CUDA_ARCHITECTURES=86 to skip the other archs and build faster (~3 min).\ncmake -B dflash/build -S dflash -DCMAKE_BUILD_TYPE=Release\ncmake --build dflash/build --target test_dflash -j\ncmake --build dflash/build --target test_generate -j\ncmake --build dflash/build --target dflash_server -j\n\n# 4. fetch weights: ~16 GB Q4_K_M target + 1.84 GB Lucebox Q8_0 GGUF DFlash draft\nuv run hf download unsloth/Qwen3.6-27B-GGUF Qwen3.6-27B-Q4_K_M.gguf --local-dir dflash/models/\nuv run hf download Lucebox/Qwen3.6-27B-DFlash-GGUF dflash-draft-3.6-q8_0.gguf --local-dir dflash/models/draft/\n\n# 5a. one-shot streaming generate\nuv run --directory dflash python scripts/run.py --prompt \"def fibonacci(n):\"\n\n# 5b. or reproduce the paper-style bench (HumanEval + GSM8K + Math500, ~15 min)\nuv run --directory dflash python scripts/bench_llm.py\n```\n\n| Benchmark | AR (tok/s) | DFlash+DDTree (tok/s) | Speedup |\n|-----------|:----------:|:---------------------:|:-------:|\n| **HumanEval** | 37.8 | **129.5** | **3.43×** |\n| Math500 | 37.7 | 110.5 | 2.93× |\n| GSM8K | 37.7 | 96.2 | 2.55× |\n\n**Why GGUF/Q4_K_M:** on 24 GB GPUs, the target, draft, DDTree verify state, and KV cache need to fit together. The default Qwen3.6 setup uses a ~16 GB Q4_K_M target and a 1.84 GB GGUF draft.\n\nAlgorithms used:\n- [**DFlash**](https://arxiv.org/abs/2602.06036) (z-lab, 2026): block-diffusion draft conditioned on target hidden states.\n- [**DDTree**](https://arxiv.org/abs/2604.12989) (Ringel et al., 2026): tree-structured verify that beats chain verify at the same compute budget.\n\nImplemented here:\n- C++/CUDA decode engine on top of ggml (no libllama, no Python runtime, Q4_K_M target path).\n- Three custom CUDA kernels for tree-aware SSM state rollback: `ggml_ssm_conv_tree`, `ggml_gated_delta_net_tree`, `ggml_gated_delta_net_tree_persist`.\n- DDTree budget swept for RTX 3090 + Q4_K_M target: **budget=22** is the sweet spot.\n- TQ3_0 KV cache (TurboQuant 3.5 bpv, default) + sliding `target_feat` ring to fit up to 256K context in 24 GB (Q4_0 available as legacy, tops out near 128K).\n\n### Running on other GPUs (4090, 5090, DGX Spark / GB10, Jetson AGX Thor)\n\nSupported out of the box; the build just needs the right CUDA toolkit. `dflash/CMakeLists.txt` already auto-adds Blackwell archs when your nvcc is new enough, so the main quickstart above works as-is on newer cards.\n\n| GPU | Arch | Min CUDA | Status |\n|-----|:----:|:--------:|--------|\n| Tesla P40 Pascal | `sm_61` | 12.0 | supported with scalar F16 fallback; needs 24 GB for the 27B stack |\n| Tesla V100 Volta | `sm_70` | 12.0 | supported with F16 WMMA kernels |\n| RTX 3090 Ampere | `sm_86` | 12.0 | **reference, all numbers above** |\n| RTX 2080 Ti Turing | `sm_75` | 12.0 | supported, 53 tok/s DFlash verified (FP16 draft) |\n| RTX 4090 Ada | `sm_89` | 12.0 | should work, unverified, pass `-DCMAKE_CUDA_ARCHITECTURES=89` |\n| RTX 5090 Blackwell consumer | `sm_120` | 12.8 | **205 tok/s DFlash, 4.84× vs AR** (Q4_K_M, budget=40) |\n| DGX Spark / GB10 | `sm_121` (compute capability 12.1) | 12.9 | supported, auto-added by CMake |\n| Jetson AGX Thor | `sm_110` | 13.0 | supported, auto-added by CMake |\n\nVerify your target:\n```bash\npython -c \"import torch; p=torch.cuda.get_device_properties(0); print(p.name, 'sm_%d%d'%(p.major,p.minor), p.multi_processor_count,'SMs', round(p.total_memory/1e9,1),'GB')\"\nnvcc --version\n```\n\n**DGX Spark / GB10 quick start:**\n```bash\n# CUDA 12.9+ required for sm_121\nnvcc --version  # must show \u003e= 12.9\ngit clone --recurse-submodules https://github.com/Luce-Org/lucebox-hub \u0026\u0026 cd lucebox-hub/dflash\ncmake -B build -S . -DCMAKE_BUILD_TYPE=Release   # CMake auto-adds sm_121\ncmake --build build --target test_dflash -j\n```\n\n**Jetson AGX Thor quick start:**\n```bash\n# CUDA 13.0+ required for sm_110 / AGX Thor.\nnvcc --version\ngit clone --recurse-submodules https://github.com/Luce-Org/lucebox-hub \u0026\u0026 cd lucebox-hub/dflash\ncmake -B build -S . -DCMAKE_BUILD_TYPE=Release   # CMake auto-adds the Thor arch your nvcc supports\ncmake --build build --target test_dflash -j\n```\n\n**Retune per GPU:**\n- **DDTree `budget=22`** tuned for 3090 + Q4_K_M + 24 GB. On the RTX 5090, budget=40 is optimal (swept). On GB10 (128 GB unified), re-sweep — larger tree = more verify throughput until memory bandwidth saturates. `scripts/bench_llm.py --budget N` has the sweep hooks.\n- **TQ3_0 KV cache + sliding `target_feat` ring** was shaped by 24 GB (fits up to 256K context on a 3090). On GB10 (128 GB unified) / 5090 (32 GB) you can push context further or skip quantization entirely and keep F16 KV.\n- **Perf numbers** (207 tok/s demo, 129.5 HumanEval, 2.8× vs SGLang AWQ) are RTX 3090 @ stock. RTX 5090 numbers (205 tok/s HumanEval, 4.84×) are in [RESULTS.md](dflash/RESULTS.md). Ada/GB10/Thor not yet swept, PRs with `RESULTS.md` entries welcome.\n\n[Full writeup →](dflash/README.md) · [Benchmarks →](dflash/RESULTS.md) · [Blog post →](https://lucebox.com/blog/dflash27b)\n\n---\n\n## 03 · PFlash speculative prefill on RTX 3090\n\nSpeculative prefill for long prompts. A Qwen3-0.6B BF16 drafter scores token importance, then the 27B target prefills only the retained spans. Runtime is C++/CUDA through the dflash binaries; no PyTorch is required at serving time.\n\n- **~10.4× TTFT** on 128K context: **24.8 s** dflash daemon vs **~257 s** llama.cpp (FA on, Q4_0 KV).\n- **10.0× TTFT** on 64K context: **13.5 s** dflash vs **134.95 s** llama.cpp.\n- **NIAH single-needle retrieved** at every measured context (32K → 128K), `keep_ratio=0.05`, `DFLASH_FP_ALPHA=0.85`.\n\n```bash\n# 1. build dflash + BSA kernel (sm_80+ required for BSA, ~10 min cold compile)\ngit clone --recurse-submodules https://github.com/Luce-Org/lucebox-hub \u0026\u0026 cd lucebox-hub/dflash\ncmake -B build -S . -DCMAKE_BUILD_TYPE=Release \\\n                    -DCMAKE_CUDA_ARCHITECTURES=86 \\\n                    -DDFLASH27B_ENABLE_BSA=ON\ncmake --build build --target test_dflash test_flashprefill_kernels -j\n\n# 2. fetch weights: 27B Q4_K_M target + 0.6B BF16 drafter (GGUF) + DFlash spec-decode draft\nhf download unsloth/Qwen3.6-27B-GGUF Qwen3.6-27B-Q4_K_M.gguf --local-dir models/\nhf download unsloth/Qwen3-0.6B-GGUF Qwen3-0.6B-BF16.gguf --local-dir models/\nhf download Lucebox/Qwen3.6-27B-DFlash-GGUF dflash-draft-3.6-q8_0.gguf --local-dir models/draft/\n\n# 3. run the daemon: compress (drafter scoring) + generate (target spec decode)\nDFLASH_FP_USE_BSA=1 DFLASH_FP_ALPHA=0.85 \\\n./build/test_dflash models/Qwen3.6-27B-Q4_K_M.gguf models/draft/dflash-draft-3.6-q8_0.gguf --daemon\n# stdin protocol: `compress \u003cids.bin\u003e \u003ckeep_x1000\u003e \u003cdrafter.gguf\u003e` →\n#                 stream of compressed token ids, then `generate \u003c…\u003e` →\n#                 stream of generated tokens.\n```\n\n| Source S | dflash TTFT | llama.cpp baseline | Speedup | NIAH |\n|----------|:-----------:|:------------------:|:-------:|:----:|\n| **64K**  | **13.5 s** | 134.95 s (FA off, dense) | **10.0×** | ✅ |\n| **128K** | **24.8 s** | ~257 s (FA on, Q4_0 KV)  | **~10.4×** | ✅ |\n\nDaemon stdin commands: `compress` runs the drafter with FlashPrefill block-sparse attention and returns the compressed token-id stream; `generate` runs the target on that stream with normal speculative decode + DDTree. `park` / `unpark` / `free drafter` swap weights in and out of VRAM so target + drafter coexist on a 24 GB card.\n\n**Runtime tunables** (full list in [`dflash/src/flashprefill.h`](dflash/src/flashprefill.h)):\n```\nDFLASH_FP_USE_BSA=1     # dispatch sparse FA forward through BSA (sm_80+)\nDFLASH_FP_ALPHA=0.85    # block-selection threshold; higher = stricter = fewer K-blocks per Q-row\nDFLASH_FP_PROFILE=1     # log mean / score / select / forward stage timings\n```\n\n**What's ours, what isn't.** Algorithms are from [Cross-Family Speculative Prefill (Liu et al., ICLR 2026)](https://arxiv.org/abs/2603.02631) for the scoring + selection layer and [FlashPrefill (Fan et al., 2026)](https://arxiv.org/abs/2603.06199) for the drafter sparse-attention forward. What we built:\n- C++/CUDA daemon-resident speculative prefill in front of a quantized GGUF target — no PyTorch, no Triton, no per-request subprocess.\n- BSA wired without `libtorch` via a 3-header ATen/c10 stub set under `dflash/deps/bsa_stubs/`.\n- Custom Qwen3-0.6B forward (`qwen3_0p6b_*`) so the drafter runs through the same ggml allocator as the 27B target.\n- 4 CUDA kernels (`flashprefill_kernels.cu`) for the FlashPrefill `mean_K / score / select / sparse_fwd` algorithm.\n\n[Full writeup →](pflash/README.md) · [Daemon-side build / tunables →](dflash/docs/SPEC_PREFILL.md) · [Blog post →](https://lucebox.com/blog/pflash)\n\n---\n\n## AMD Strix Halo (HIP backend)\n\n**Same DFlash + PFlash stack on an AMD iGPU.** PR #119 ports the Phase 2 rocWMMA flashprefill kernels to HIP. End-to-end on a single Ryzen AI MAX+ 395 box (Radeon 8060S iGPU, gfx1151, 128 GiB LPDDR5X-8000 unified): **37.0 tok/s** DFlash decode on Qwen3.5-27B Q4_K_M, **27.6 s** TTFT at 16K context with NIAH retrieval intact. That is **3.08×** decode and **2.24×** prefill over llama.cpp HIP AR on the same iGPU. End-to-end wall clock at a realistic 16K prompt + 1K generation workload: **2.66×** faster than vanilla llama.cpp.\n\n```bash\ngit clone --recurse-submodules https://github.com/Luce-Org/lucebox-hub \u0026\u0026 cd lucebox-hub/dflash\n\n# Build for gfx1151 (Strix Halo). Swap the arch for gfx1100 / gfx1201.\ncmake -B build -S . \\\n  -DCMAKE_BUILD_TYPE=Release \\\n  -DDFLASH27B_GPU_BACKEND=hip \\\n  -DDFLASH27B_HIP_ARCHITECTURES=gfx1151 \\\n  -DDFLASH27B_HIP_SM80_EQUIV=ON\ncmake --build build --target test_dflash -j\n```\n\n`DFLASH27B_HIP_SM80_EQUIV=ON` enables the rocWMMA Phase 2 flashprefill kernels (the path that delivers the prefill speedup). `OFF` falls back to ggml's `flash_attn_ext` (slower but no rocwmma headers needed).\n\n**Per-arch DDTree tuning**: gfx1151 (Strix Halo iGPU, bandwidth-bound on LPDDR5X) peaks at `--ddtree-budget=22`. gfx1100 (7900 XTX, GDDR6) prefers `budget=8` per the [PR #156 cross-arch perf plan](https://github.com/Luce-Org/lucebox-hub/pull/156). Run `scripts/bench_he.py --ddtree-budget N` to verify on your card.\n\n**Drafter recipe for max decode**: target = Qwen3.5-27B Q4_K_M, drafter = same gen quantized to Q8_0 via `dflash/scripts/quantize_draft_q8.py`. The matching Q8_0 GGUF on the unsloth Qwen3.6 target needs `DFLASH27B_DRAFT_SWA=2048` for sliding-window correctness.\n\n[Blog post →](https://lucebox.com/blog/amd) · [PR #119 →](https://github.com/Luce-Org/lucebox-hub/pull/119) · [PR #156 cross-arch perf plan →](https://github.com/Luce-Org/lucebox-hub/pull/156)\n\n---\n\n## Why this exists\n\nLocal AI should be a default, not a privilege: private data, no per-token bill, no vendor lock-in. The hardware to run capable models already sits on desks. The software to run those chips well doesn't.\n\nGeneral-purpose frameworks dominated the last decade because hand-tuning kernels per chip was too expensive to justify. One stack, decent on everything, great on nothing. Most of the silicon's capability stays on the floor.\n\nAI-assisted development flips that calculus. Rewrites that took a quarter now fit in a release cycle. Lucebox is where we publish them, one chip and one model family at a time. Apache 2.0 source, full writeup, reproducible benchmarks.\n\n---\n\n## Requirements\n\nAll experiments in this repo are built, tuned, and benchmarked on NVIDIA RTX 3090 (2020), the reference target. Supported GPU families:\n\n- **Ampere** (sm_86, RTX 3090 / A-series): reference, CUDA 12+.\n- **Ada** (sm_89, RTX 40xx): should work, unverified, CUDA 12+.\n- **Blackwell consumer** (sm_120, RTX 50xx incl. 5090): supported, CUDA 12.8+.\n- **DGX Spark / GB10** (sm_121, compute capability 12.1): supported, CUDA 12.9+.\n- **Jetson AGX Thor** (sm_110): supported, CUDA 13+.\n- **Turing** (sm_75, RTX 2080): supported, CUDA 12+.\n\nPyTorch 2.0+. `dflash/` needs CMake 3.18+ and `--recurse-submodules` for the pinned `Luce-Org/llama.cpp@luce-dflash` fork (three tree-mode ggml ops); multi-arch build is automatic (see [Running on other GPUs](#running-on-other-gpus-4090-5090-dgx-spark--gb10-jetson-agx-thor)).\n\n**Megakernel porting note.** `megakernel/setup.py` auto-detects the GPU arch and SM count at build time via `torch.cuda.get_device_capability()`. The decode grid is persistent (one block per SM) and is clamped to the resident-block ceiling at runtime, so no manual tuning is needed. On SM \u003c 80 (Turing), the kernel uses FP16 instead of BF16 via a compile-time `TARGET_SM` flag; on SM \u003e= 80 (Ampere+), BF16 is used. From the workspace root, `uv sync --extra megakernel` builds the extension; the legacy `pip install -e . --no-build-isolation` flow still works from inside `megakernel/`.\n\n**Optional, find your GPU's sweet spot:** `sudo nvidia-smi -pl 220` (megakernel hits best tok/J at 220 W on 3090; re-sweep for other cards).\n\n---\n\n## Repository layout\n\n```\nlucebox-hub/\n├── megakernel/    · fused forward pass for Qwen 3.5-0.8B\n├── dflash/        · DFlash speculative decoding port for Qwen 3.5/3.6-27B on RTX 3090\n├── pflash/        · speculative-prefill harness in front of dflash (12.5× TTFT at 128K)\n└── assets/        · banners, cards, diagrams\n```\n\n---\n\n## Roadmap\n\n```\n  Q1 2026    ▮▮▮▮▮▮▮▮▮▮    RTX 3090 kernels \u0026 optimizations\n  Q2 2026    ▮▮▮▮▮▯▯▯▯▯    Ryzen AI MAX+ 395 optimizations\n  Q2 2026    ▮▮▯▯▯▯▯▯▯▯    Heterogeneous CPU + GPU latency optimizations\n  Q2 2026    ▮▯▯▯▯▯▯▯▯▯    Lucebox OS for local AI machines\n  Q3 2026    ▯▯▯▯▯▯▯▯▯▯    Lucebox official launch\n```\n\n---\n\n## Citation\n\n```bibtex\n@software{lucebox_2026,\n  title  = {Lucebox: Open LLM Inference, Rewritten by Hand for One Specific Chip at a Time},\n  author = {Lucebox},\n  url    = {https://github.com/Luce-Org/lucebox-hub},\n  year   = {2026}\n}\n```\n\nPer-project citations live in each subproject's README.\n\n---\n\n## Inspired by\n\n- [Hazy Research](https://hazyresearch.stanford.edu/blog/2025-05-27-no-bubbles): megakernel idea and the intelligence-per-watt methodology.\n- [z-lab/DFlash](https://arxiv.org/abs/2602.06036) (Wang et al., 2026): block-diffusion speculative decoding algorithm. We use their published Qwen3.5/Qwen3.6-27B-DFlash draft weights as-is.\n- [DDTree](https://arxiv.org/abs/2604.12989) (Ringel \u0026 Romano, 2026): tree-structured verify that DFlash 27B uses for its 3.5× speedup over chain spec decoding. [liranringel/ddtree](https://github.com/liranringel/ddtree).\n- [AlpinDale/qwen_megakernel](https://github.com/AlpinDale/qwen_megakernel), [Infatoshi/MegaQwen](https://github.com/Infatoshi/MegaQwen): prior art on fused Qwen kernels.\n\n---\n\n## Community\n\n- **Discord**: [discord.gg/yHfswqZmJQ](https://discord.gg/yHfswqZmJQ)\n- **Website**: [lucebox.com](https://lucebox.com)\n- **Issues**: [github.com/Luce-Org/lucebox-hub/issues](https://github.com/Luce-Org/lucebox-hub/issues)\n- **Blog**: [lucebox.com/blog](https://lucebox.com/blog)\n\n---\n\n\u003cp align=\"center\"\u003e\n  \u003csub\u003e\u003ca href=\"LICENSE\"\u003eApache 2.0\u003c/a\u003e · \u003ca href=\"https://lucebox.com\"\u003eLucebox.com\u003c/a\u003e\u003c/sub\u003e\n\u003c/p\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FLuce-Org%2Flucebox-hub","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FLuce-Org%2Flucebox-hub","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FLuce-Org%2Flucebox-hub/lists"}