{"id":50873465,"url":"https://github.com/g023/cuda_inf","last_synced_at":"2026-06-15T07:05:48.244Z","repository":{"id":362998051,"uuid":"1261582068","full_name":"g023/cuda_inf","owner":"g023","description":"A self-contained CUDA inference engine for LiquidAI/LFM2.5-8B-A1B (hybrid conv + GQA-attention MoE, 8.5B params, 1B active) targeting a single RTX 3060 (12 GB). No Python, no frameworks at runtime: a single .cu engine + a header-only byte-level BPE tokenizer. ","archived":false,"fork":false,"pushed_at":"2026-06-06T23:44:47.000Z","size":51,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-07T00:10:35.161Z","etag":null,"topics":["3060","ai","c","cpp","cuda","fast-inference","gpu","inference","inference-engine","large-language-models","lfm25","liquidai","llm","moe","nvidia","open-source","rtx","token"],"latest_commit_sha":null,"homepage":"https://github.com/g023/cuda_inf","language":"Cuda","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/g023.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-06-06T22:11:51.000Z","updated_at":"2026-06-06T23:44:50.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/g023/cuda_inf","commit_stats":null,"previous_names":["g023/cuda_inf"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/g023/cuda_inf","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/g023%2Fcuda_inf","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/g023%2Fcuda_inf/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/g023%2Fcuda_inf/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/g023%2Fcuda_inf/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/g023","download_url":"https://codeload.github.com/g023/cuda_inf/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/g023%2Fcuda_inf/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34351469,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-15T02:00:07.085Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["3060","ai","c","cpp","cuda","fast-inference","gpu","inference","inference-engine","large-language-models","lfm25","liquidai","llm","moe","nvidia","open-source","rtx","token"],"created_at":"2026-06-15T07:05:45.693Z","updated_at":"2026-06-15T07:05:48.237Z","avatar_url":"https://github.com/g023.png","language":"Cuda","funding_links":[],"categories":[],"sub_categories":[],"readme":"# LFM2.5-8B-A1B pure C/C++/CUDA inference engine\n# Author: g023 (https://github.com/g023/)\n\n**Note: I've uploaded the converted model so you can skip the prepare step (would be the ./scratch folder in this project) to https://huggingface.co/g023/LFM2.5-8B-A1B-Special/ (copy scratch folder into this project root)**\n\nA self-contained CUDA inference engine for `LiquidAI/LFM2.5-8B-A1B` (hybrid conv + GQA-attention\nMoE, 8.5B params, 1B active) targeting a single RTX 3060 (12 GB). No Python, no frameworks at\nruntime: a single `.cu` engine + a header-only byte-level BPE tokenizer. Offline preprocessing\n(download + quantize) uses Python.\n\n## What works\nEnd-to-end text -\u003e coherent text. Example:\n\n```\n$ ./build/engine --prompt \"What is the capital of France? Answer in one sentence.\" --n 400\n\u003cthink\u003e\nThe user asks: \"What is the capital of France? Answer in one sentence.\" ... The answer:\n\"The capital of France is Paris.\" ...\n\u003c/think\u003e\nThe capital of France is Paris.\u003c|im_end|\u003e\n[engine] generated 83 tokens in 0.75s (110.9 tok/s)\n```\n\n~105 tok/s short-context decode on the RTX 3060, and ~100 tok/s sustained at long context\n(Flash-Decoding, see below): ~7x faster than the naive single-warp decode kernel which collapsed\nto ~14 tok/s at 2k tokens. Throughput is now flat across context length.\n\n## Flash-Decoding (long-context speed)\nSingle-token decode attention splits the KV key range across NSPLIT=16 blocks per head\n(`attn_decode_split` -\u003e `attn_decode_combine`) instead of the original one-block/one-warp scan, so\nthe whole GPU is used and per-token cost no longer grows with context. Numerically identical to the\nprefill kernel (online softmax, fp32 accumulate); coherence unchanged. Per-token floor was also\ntrimmed: GPU argmax over logits (1-int copy vs 512KB D2H), expert_bias cached host-side once, and a\npersistent embed-id buffer (no per-token malloc). See kb/03_status.md.\n\n## Architecture implemented (see kb/01_architecture.md)\n- 24 layers: 18 LFM2 short-conv (depthwise causal conv k=3, gated) + 6 GQA attention (32 Q / 8 KV\n  heads, head_dim 64, QK-RMSNorm, RoPE theta 5e6).\n- FFN: 2 dense SwiGLU layers + 22 MoE layers (32 experts, top-4, sigmoid router + expert bias,\n  norm_topk_prob).\n- RMSNorm (eps 1e-5), tied embeddings / lm_head (fp16), vocab 128000.\n\n## Precision (see kb/02_decisions.md)\n- Dense INT4 group-wise (G=128) weights for all big matmuls; fp16 embed/lm_head, router gate,\n  norms, conv kernels, expert bias. fp32 accumulation.\n- Weights ~4.76 GB -\u003e fits the 12 GB card with room for activations + KV.\n- NOTE: the blueprint's 2:4 structured sparsity is intentionally OMITTED on the generation path\n  (naive 2:4 pruning of a pretrained model destroys coherence). Dense INT4 preserves coherence.\n\n## Intelligent KV cache + fused FlashAttention (task 7, coherence-verified)\n- KV cache is **FP8 E4M3**, group-wise quantized: 64 tokens/group along the sequence share one\n  fp16 scale per kv-head; the in-progress (partial) group is kept fp16 until full (\"tail\"), then\n  committed to FP8. Decode/attention read FP8 + dequantize on the fly.\n- **Fused FlashAttention** (`attention_fused`): single online-softmax pass that reads the FP8\n  cache + fp16 tail (branch-free committed/tail segments), fp32 accumulate. Numerically equivalent\n  to the prior fp16 attention; greedy output matches the dense-KV baseline for 57/60 tokens on the\n  France prompt and stays fully coherent (reaches \"Paris\", terminates at EOS). ~100-105 tok/s.\n- **H2O eviction** (`--kv-budget N [--kv-window W]`): a per-token-scaled FP8 circular buffer of N\n  slots. Always keep the last W tokens (local window); among older tokens keep the highest\n  cumulative attention (`attn_sum`, accumulated inside the attention kernel); evict the min-`attn_sum`\n  slot. Coherent even when most of the context is evicted (e.g. budget=48 while context grows to\n  82; budget=32 over 250 generated tokens). Off by default (no flag) so the default path is lossless.\n- **Sparse Tensor-Core GEMM** (`src/mma_sp_test.cu`, `build/mma_sp_test`): the blueprint's\n  `mma.sp.sync.aligned.m16n8k32` 2:4 sparse INT4 GEMM, validated against a CPU reference (instruction\n  decode + metadata/thread mapping exact; tiled INT4 GEMM within fp16 rounding). Kept OFF the\n  generation path: 2:4 on pretrained weights needs sparsity-aware retraining to stay coherent\n  (excluded by the goal), so this is a correctness harness for the kernel, not part of inference.\n\n## Build / run\n```\n./prepare.sh         # one-time: download + quantize + export tokenizer (needs conda env unsloth_env)\n./build.sh           # compile -\u003e build/engine  (needs nvcc, sm_86)\n./build/engine --prompt \"your question\" --n 200\n```\nFlags: `--prompt` (chat-wrapped) | `--raw` (no template) | `--ids file.i32`; `--n` max new tokens;\n`--no-stream`; `--kv-budget N` (enable H2O eviction, N KV slots) `--kv-window W` (local window);\n`--dbgdir dir` (dump per-layer hidden states); `--dump f` (dump final_normed).\n`./build/mma_sp_test` validates the sparse Tensor-Core GEMM.\n\n## Validation\n`tools/build_oracle.py` runs transformers (fp32, CPU) for a fixed prompt and dumps reference\nper-layer activations, logits, and greedy ids. The engine reproduces the oracle's greedy tokens\n(first ~12 exact; later divergence is expected INT4 quant noise) and stays coherent. The C++\ntokenizer matches HF `encode`/`decode` exactly on chat templates, code, numbers, and whitespace.\n\n## Files\n- `src/engine.cu`     - kernels (INT4 GEMV, RMSNorm, RoPE, FP8 KV + fused FlashAttention + H2O,\n                        conv, MoE) + host orchestration\n- `src/mma_sp_test.cu`- standalone validation of the 2:4 sparse-INT4 mma.sp Tensor-Core GEMM\n- `src/tokenizer.h`   - GPT-2 byte-level BPE (encode + decode), header-only\n- `tools/`            - offline: export_weights, make_index, export_tokenizer, build_oracle\n- `kb/`               - architecture, decisions, status\n\n## Status of task 7 (done) - see kb/03_status.md\n- FP8 E4M3 KV cache, fused FlashAttention, H2O eviction: implemented on the generation path,\n  coherence-verified (above). mma.sp sparse Tensor-Core GEMM: validated as a standalone kernel,\n  kept off the generation path (2:4 on pretrained weights needs retraining to stay coherent).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fg023%2Fcuda_inf","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fg023%2Fcuda_inf","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fg023%2Fcuda_inf/lists"}