{"id":50970107,"url":"https://github.com/manishklach/gpu-resident-inference-lab","last_synced_at":"2026-06-19T01:02:56.785Z","repository":{"id":363875727,"uuid":"1265119227","full_name":"manishklach/gpu-resident-inference-lab","owner":"manishklach","description":"Research lab for GPU-resident LLM inference loops: persistent kernels, sparse KV selection, tiered residency, speculative decode, and trace-driven scheduling.","archived":false,"fork":false,"pushed_at":"2026-06-13T14:12:59.000Z","size":419,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-13T15:27:19.683Z","etag":null,"topics":["cuda","gpu-systems","kv-cache","llm-inference","mega-kernel","model-systems","persistent-kernel","runtime","speculative-decoding"],"latest_commit_sha":null,"homepage":"https://manishklach.github.io/gpu-resident-inference-lab/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/manishklach.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":"docs/ROADMAP.md","authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-06-10T13:35:07.000Z","updated_at":"2026-06-13T14:13:02.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/manishklach/gpu-resident-inference-lab","commit_stats":null,"previous_names":["manishklach/xl-persistent-kernel","manishklach/gpu-resident-inference-lab"],"tags_count":5,"template":false,"template_full_name":null,"purl":"pkg:github/manishklach/gpu-resident-inference-lab","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/manishklach%2Fgpu-resident-inference-lab","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/manishklach%2Fgpu-resident-inference-lab/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/manishklach%2Fgpu-resident-inference-lab/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/manishklach%2Fgpu-resident-inference-lab/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/manishklach","download_url":"https://codeload.github.com/manishklach/gpu-resident-inference-lab/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/manishklach%2Fgpu-resident-inference-lab/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34513029,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-18T02:00:06.871Z","response_time":128,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cuda","gpu-systems","kv-cache","llm-inference","mega-kernel","model-systems","persistent-kernel","runtime","speculative-decoding"],"created_at":"2026-06-19T01:02:51.686Z","updated_at":"2026-06-19T01:02:56.776Z","avatar_url":"https://github.com/manishklach.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# GPU Resident Inference Lab\n\nResearch lab for GPU-resident LLM inference loops: persistent kernels, sparse KV selection, tiered residency, speculative decode, and trace-driven scheduling.\n\n[![CI](https://github.com/manishklach/gpu-resident-inference-lab/actions/workflows/ci.yml/badge.svg)](https://github.com/manishklach/gpu-resident-inference-lab/actions/workflows/ci.yml)\n[![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue)](https://www.python.org/)\n[![License: Research](https://img.shields.io/badge/license-Research%20Use-yellow)](LICENSE)\n[![Blog](https://img.shields.io/badge/GitHub%20Pages-blog-green)](https://manishklach.github.io/gpu-resident-inference-lab/)\n\nA research scaffold for future LLM inference loops where decode control flow,\nKV movement, speculative verification, and request scheduling stay resident\non the GPU as much as possible.\n\nThe project explores five interacting ideas:\n\n1. Persistent GPU-resident execution to reduce host launch/sync overhead\n2. Sparse KV block selection to reduce memory touched per decode step\n3. Speculative/token-parallel decode to create more useful work per loop\n4. Tiered KV residency across HBM, DRAM, and SSD-like tiers\n5. Trace-driven admission and scheduling for multi-request serving\n\nThis is not a production LLM runtime. It is a control-flow, memory-scheduling,\nand CUDA-staging research platform.\n\nQuick stub demo:\n\n```bash\npython tools/reasoning_metrics_stub.py\n```\n\nThe core thesis is that once inference becomes quantized, sparse, and latency-sensitive, the bottleneck shifts from raw compute to orchestration and data movement. The next runtime layer should keep more of the decode/refine/verify/KV-update loop resident on GPU, while using sparse KV selection and token/block parallelism to ensure the persistent loop has enough useful work to execute.\n\n## Why This Repo Exists\n\nTraditional autoregressive decode is a skinny loop:\n\n```text\nCPU launch -\u003e GPU decode -\u003e CPU sync -\u003e CPU launch -\u003e GPU verify -\u003e CPU sync\n```\n\nThat shape creates orchestration gaps. Even if individual kernels are efficient, the overall serving loop can underutilize the GPU when each step is too narrow and too host-driven.\n\nThis repo studies a wider GPU-resident loop instead:\n\n```text\nsubmit once\n   |\n   v\nGPU resident loop:\n  sparse KV select\n  -\u003e draft / token-block decode\n  -\u003e expert route\n  -\u003e attention / verify\n  -\u003e commit accepted tokens\n  -\u003e KV update\n  -\u003e schedule next block\n```\n\nThe point is not to overclaim throughput. The point is to make orchestration, residency, and memory-movement bottlenecks visible and to prototype how a future GPU-resident runtime might be structured.\n\n## Decode Loop Shapes\n\nDiagram 1: CPU-driven decode today\n\n```text\nCPU\n | launch decode\n v\nGPU: decode one token/block\n |\nCPU sync / schedule / launch again\n |\n v\nGPU: verify / update\n |\nrepeat\n```\n\nDiagram 2: GPU-resident loop thesis\n\n```text\nCPU submits work once\n |\n v\nGPU persistent loop:\n  [select KV blocks]\n  [draft token block]\n  [route experts]\n  [attention/verify]\n  [commit accepted tokens]\n  [update KV/state]\n  [prefetch next block]\n |\n v\nCPU receives coarse-grained completions\n```\n\n## Persistent Kernels Are Not Enough\n\nPersistent kernel alone:\n- reduces launch/sync overhead\n\nPersistent kernel + token/block parallelism:\n- creates enough useful resident work\n\nPersistent kernel + sparse KV:\n- reduces memory touched per iteration\n\nPersistent kernel + tiered residency:\n- controls where KV lives under pressure\n\nPersistent kernel + trace-driven admission:\n- decides which requests/blocks deserve GPU residency\n\nA persistent kernel only removes orchestration gaps. It does not magically make autoregressive decoding parallel. The decode loop becomes interesting when persistence is combined with token/block parallelism, sparse KV selection, and resident scheduling.\n\n## Modern Inference Stack\n\nThis repo studies the runtime/kernel side of a broader inference stack:\n\n- FP4 / NVFP4-style quantization: reduce weight bandwidth\n- MoE sparsity: reduce active parameters per token\n- SWA / local attention / sparse KV: bound KV and context movement\n- MTP / speculative / block decoding: make decode wider than one token at a time\n- Persistent GPU-resident mega-kernels: keep the hot loop on device\n- Tiered KV residency: decide what stays in HBM, what spills, and what is prefetched\n\nThe repo is mostly focused on the last three layers: token/block parallel decode, GPU-resident execution, and KV residency/scheduling.\n\n## What Runs Today vs Future Work\n\n| Area | Today | Future |\n|---|---|---|\n| Persistent loop | CUDA scaffold / control-flow prototype | real fused decode loop |\n| Transformer math | deterministic placeholder math | attention/projection/sampling kernels |\n| Sparse KV | metadata/top-k scaffold | real sparse KV gather |\n| Tiered residency | planning model / simulator | async HBM/DRAM/SSD movement integration |\n| Speculative decode | block workflow scaffold | real draft/verify model path |\n| Metrics | launch/sync/memory estimates | real TTFT/ITL/tok/s under load |\n| Scheduling | trace-driven admission ideas | multi-request GPU-resident scheduler |\n\n## Research Themes\n\n### Persistent GPU-Resident Execution\n\nInstead of launching many short-lived kernels from the CPU, the project explores keeping the decode/verify/KV-update loop resident on the device.\n\n### Sparse KV Selection\n\nInstead of touching all KV blocks, a runtime can select a smaller relevant subset. In this repo, that path is deterministic and lightweight by design. It is not MiniMax MSA or production sparse attention.\n\n### Speculative / Token-Parallel Decode\n\nSpeculative and block-style decode creates more useful work per resident iteration. This is what makes a persistent loop more valuable than a one-token-at-a-time device loop.\n\n### Tiered KV Residency\n\nThe repo models conceptual HBM, DRAM, and SSD-like tiers, plus promotion, demotion, staging, and pressure handling. These are residency and scheduling scaffolds, not real migration engines.\n\n### Trace-Driven Scheduling\n\nThe repo also models arrival, admission, active-set limits, and completion ordering so multi-request serving behavior can be studied as a control-flow problem, not just a single-request kernel problem.\n\n## What This Repo Is\n\n- A research scaffold for GPU-resident inference control flow\n- A way to study persistent execution, sparse KV selection, and token/block parallel decode\n- A place to prototype scheduling ideas before integrating with real serving stacks\n\n## Non-goals\n\nThis repo is not:\n- a production inference server\n- a replacement for vLLM, SGLang, TensorRT-LLM, or FlashAttention\n- a complete transformer implementation today\n- a benchmark claiming state-of-the-art throughput\n- a hardware proposal requiring a new SRAM chip\n\nThis repo is:\n- a research scaffold for GPU-resident inference control flow\n- a way to study persistent execution, sparse KV selection, and token/block parallel decode\n- a place to prototype scheduling ideas before integrating with real serving stacks\n\n## How to Evaluate This Repo\n\nThis repo should be evaluated on whether it makes the control-flow bottlenecks visible, not on whether it currently serves a real frontier model.\n\nGood questions:\n- How many CPU launches are removed per token/block?\n- How much KV traffic is avoided by sparse selection?\n- How many accepted tokens are produced per verify step?\n- How much useful work happens inside one resident loop?\n- Where do orchestration gaps still appear?\n- Which parts would need to become real CUDA kernels next?\n\n## Implemented Today\n\n- Persistent execution scaffold in both Python simulation and CUDA staging\n- Sparse KV block selection path with deterministic top-k selection\n- DMA-aware KV movement planning over sparse-selected pages\n- Tier-aware KV staging plan that orders selected pages for resident decode\n- KV pressure and draft-first eviction scaffold for resident memory reclamation\n- Hierarchical KV tier rebalance across HBM, DRAM, and SSD budgets\n- Trace-driven request admission and completion replay on device queues\n- Speculative/token-parallel workflow scaffolding\n- KV lifecycle tracking across committed, draft, selected, and released states\n- Benchmark harness for control-flow and memory-accounting experiments\n- CUDA resident-loop scaffold plus host-launched baseline comparison\n- Tests covering runtime behavior, sparse KV selection, benchmark schema, and KV lifecycle rules\n\n## Metrics\n\nThe repo currently emits some metrics directly and models others as future benchmark surfaces.\n\n| Metric | Status | Meaning |\n|---|---|---|\n| `tokens_per_resident_loop` | Implemented | Useful committed tokens per resident-loop iteration |\n| `kv_blocks_total` | Implemented | Total logical KV blocks considered |\n| `kv_blocks_selected` | Implemented | KV blocks selected for the sparse path |\n| `kv_sparsity_ratio` | Implemented | Fraction of blocks not touched by sparse selection |\n| `estimated_kv_bytes_read` | Implemented | Approximate KV bytes read under selected-block access |\n| `estimated_kv_bytes_saved` | Implemented | Approximate KV bytes not read because of sparsity |\n| `accepted_tokens` | Partially implemented | Accepted speculative tokens tracked in runtime traces/results |\n| `rejected_tokens` | Implemented in block workflow benchmarks | Rejected speculative tail tokens |\n| `trace queue metrics` | Implemented | Admission, completion, queue depth, and active-set watermarks |\n| `TTFT / ITL / tok/s` | Future real-model mode | Real serving metrics once placeholder math is replaced |\n\n## CUDA Staging Layer\n\nThe `cuda/` directory contains the resident-loop scaffold and the host-launched baseline.\n\n- `src/xl_persistent_megakernel.cu` models a fused GPU-resident loop\n- `src/baseline_host_decode_kernel.cu` models repeated host-launched decode steps\n- `include/stage_sparse_kv_select.cuh` models sparse KV block selection in the loop\n- `src/sparse_kv_gather_kernel.cu` models page scoring, top-k selection, and compacted sparse KV gather\n- `src/verify_commit_kernel.cu` models fused speculative verify plus accepted-prefix commit and rejected-page release\n- `src/tiered_kv_staging_kernel.cu`, `src/kv_pressure_eviction_kernel.cu`, `src/kv_tier_residency_kernel.cu`, and `src/trace_replay_admission_kernel.cu` model staging, pressure, tier rebalance, and trace-driven admission\n\nThese are research kernels and stage helpers for a persistent GPU-resident loop. They are not a production CUDA serving stack.\n\n## Benchmark Modes\n\nThe Python side provides control-flow simulations such as:\n\n- `serial_decode`\n- `speculative_decode`\n- `forced_rejection`\n- `kv_pressure`\n- `mega_kernel_sim`\n- `sparse_kv_megakernel`\n- `block_speculative`\n- `block_speculative_persistent_sim`\n\nThe CUDA host launcher exposes standalone research-kernel modes for:\n\n- resident scheduler ordering\n- sparse KV gather and score\n- KV prefetch planning\n- fused verify plus commit\n- DMA-aware KV movement planning\n- tiered KV staging\n- KV pressure eviction\n- KV tier residency rebalance\n- trace replay admission\n- resident sparse decode pipeline\n\n## Repository Structure\n\n```text\nsrc/megakernel_lab/\n    config.py             - Runtime and benchmark configuration\n    state.py              - Request, trace, and result state objects\n    runtime.py            - Persistent runtime and worker model\n    kv_cache.py           - Paged KV planner, pinning, eviction, accounting\n    sparse_kv.py          - Sparse KV top-k selection scaffold\n    spec_decode.py        - Draft/verify control-flow logic\n    block_runtime.py      - Block speculative runtime model\n    block_spec_decode.py  - Block drafting scaffold\n    bench.py              - Benchmark harness and metrics\n    demo.py               - Runnable runtime demo\n\ncuda/\n    include/              - Device-side stage helpers and metadata structs\n    src/                  - Resident-loop scaffold and host baseline\n    examples/             - Conceptual CUDA sketches\n\ndocs/\n    ARCHITECTURE.md       - Core concepts and design intent\n    CUDA_STAGING.md       - CUDA staging notes\n    ROADMAP.md            - Development roadmap\n    BLOG.md               - Draft long-form framing\n    REASONING_PIPELINE.md - Decode-to-reasoning north-star framing\n    REASONING_METRICS_STUB.md - Metric vocabulary for future branch verification\n\ntools/\n    reasoning_metrics_stub.py - A toy simulator for future verified-decisions/sec metrics. It does not run a model and should not be interpreted as a real reasoning benchmark.\n```\n\n## Quick Start\n\n```bash\npip install -e \".[dev]\"\n\npython -m megakernel_lab.demo\npython -m pytest tests/ -v\npython -c \"from megakernel_lab.bench import BenchmarkRunner; print(BenchmarkRunner().run())\"\n\nmake help\nmake demo\nmake test\nmake bench\n```\n\nIf CUDA is available:\n\n```bash\nmake cuda-smoke\nmake cuda-bench\nmake cuda-bench-large\nmake cuda-research-bench\n```\n\n## Suggested GitHub Description\n\nResearch lab for GPU-resident LLM inference loops: persistent kernels, sparse KV selection, tiered residency, speculative decode, and trace-driven scheduling.\n\n## Suggested GitHub Topics\n\n- `gpu-inference`\n- `llm-inference`\n- `persistent-kernels`\n- `cuda`\n- `kv-cache`\n- `speculative-decoding`\n- `multi-token-prediction`\n- `sparse-kv`\n- `gpu-runtime`\n- `inference-systems`\n\n## Further Reading\n\n- [Project blog](https://manishklach.github.io/gpu-resident-inference-lab/)\n- [Adaptive Speculative Block Sizing (ASBS) blog, historical naming](https://manishklach.github.io/writings/adaptive-speculative-block-sizing-xl-persistent-kernel.html)\n- [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md)\n- [docs/ROADMAP.md](docs/ROADMAP.md)\n- [docs/BLOG.md](docs/BLOG.md)\n- [From GPU-Resident Decode to GPU-Resident Reasoning](docs/REASONING_PIPELINE.md)\n- [Reasoning Metrics Stub](docs/REASONING_METRICS_STUB.md)\n- [Speculative Reasoning and Multi-Agent Orchestration](docs/SPECULATIVE_REASONING_SYSTEM.md)\n- [Decisions per Second Research Agenda](docs/DECISIONS_PER_SECOND_AGENDA.md)\n\n## Where This Is Going: From Tokens/sec to Verified Decisions/sec\n\nThe near-term focus of this repo is GPU-resident inference control flow:\npersistent loops, sparse KV selection, token/block parallel decode, tiered\nresidency, and trace-driven scheduling.\n\nThe longer-term direction is broader, but still future-facing:\n\n\u003e not just faster tokens/sec, but faster verified decisions/sec.\n\nFuture inference systems will likely combine:\n\n- continuous batching for many concurrent agents and reasoning branches\n- speculative decoding evolving into speculative reasoning\n- adaptive precision routing across FP4, FP8, and BF16\n- memory-aware model architecture across HBM, L2, shared memory, registers, and KV tiers\n- GPU-resident persistent loops that fuse draft, verify, commit, KV update, and scheduling\n\nIn this framing, the winning system is not only the fastest token generator.\n\nIt is the fastest correct reasoner.\n\nThis repo does not implement that full system today. It starts with the lower-level\nruntime pieces needed to study that direction: resident execution, sparse memory\nselection, block-level decode structure, and scheduling visibility.\n\n### Scope guardrail\n\nToday, this repo is not a reasoning engine.\n\nIt does not yet implement:\n- real multi-agent orchestration\n- real verifier models\n- real tool-use loops\n- real retrieval-integrated generation\n- real symbolic checking\n- real correctness scoring\n\nThose are future research directions.\n\nThe current repo focuses on the systems substrate:\nGPU-resident control flow, memory selection, token/block decode structure,\nand orchestration-gap measurement.\n\n## License\n\nResearch use only. See [LICENSE](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmanishklach%2Fgpu-resident-inference-lab","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmanishklach%2Fgpu-resident-inference-lab","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmanishklach%2Fgpu-resident-inference-lab/lists"}