{"id":50708325,"url":"https://github.com/manishklach/gb300-rl-runtime","last_synced_at":"2026-06-09T13:30:38.986Z","repository":{"id":361351593,"uuid":"1254149572","full_name":"manishklach/gb300-rl-runtime","owner":"manishklach","description":"Close-to-metal C/CUDA lab for RL inference fast paths: persistent GPU workers, hugepage KV arenas, cacheline-aware command rings, and async reward handoff. Goal: remove page faults, malloc/free, scheduler wakeups, CPU round-trips, and KV migration from the per-token path.","archived":false,"fork":false,"pushed_at":"2026-05-30T08:36:23.000Z","size":31,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2026-05-30T09:21:10.966Z","etag":null,"topics":["ai-infrastructure","close-to-metal","cuda","gb300","gpu-inference","hpc","lock-free","nvlink","reinforcement-learning","spsc-queue"],"latest_commit_sha":null,"homepage":null,"language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/manishklach.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-30T07:43:46.000Z","updated_at":"2026-05-30T08:36:27.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/manishklach/gb300-rl-runtime","commit_stats":null,"previous_names":["manishklach/gb300-rl-runtime"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/manishklach/gb300-rl-runtime","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/manishklach%2Fgb300-rl-runtime","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/manishklach%2Fgb300-rl-runtime/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/manishklach%2Fgb300-rl-runtime/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/manishklach%2Fgb300-rl-runtime/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/manishklach","download_url":"https://codeload.github.com/manishklach/gb300-rl-runtime/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/manishklach%2Fgb300-rl-runtime/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34110009,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-09T02:00:06.510Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-infrastructure","close-to-metal","cuda","gb300","gpu-inference","hpc","lock-free","nvlink","reinforcement-learning","spsc-queue"],"created_at":"2026-06-09T13:30:37.211Z","updated_at":"2026-06-09T13:30:38.978Z","avatar_url":"https://github.com/manishklach.png","language":"C","funding_links":[],"categories":[],"sub_categories":[],"readme":"# GB300 RL Inference Runtime\n\nA close-to-metal C/CUDA reference runtime for reinforcement learning\ninference at GB300 NVL72 scale.  No page faults, no `malloc`/`free`,\nno **per-token** CUDA kernel launches, no CPU scheduler wakeups in the\nper-token hot path.  Persistent GPU workers are launched once at init.\n\n`gb300-rl-runtime` is a close-to-metal C/CUDA + RTL lab for RL\ninference control planes.\n\n## Why This Matters\n\nThis repo studies the control plane of RL inference: how rollout work\nis described, submitted, backpressured, progressed, and completed\nwithout per-token CPU micromanagement.\n\nIt models the path from host runtime descriptors to hardware-style\nqueues:\n\n```text\nC submit API\n  ↓\nhardware descriptor\n  ↓\ncommand ring + doorbell\n  ↓\npersistent worker / RTL FSM\n  ↓\ncompletion ring\n  ↓\nhost observes completion\n```\n\nThe goal is to study how RL inference work should be submitted,\nprogressed, backpressured, and completed without per-token CPU\nmicromanagement.\n\n## Repo Stack\n\n```text\n┌──────────────────────┐\n│ C/CUDA Runtime       │\n│ infer_submit_decode()│\n└──────────┬───────────┘\n           │ hw_desc_t\n           ▼\n┌──────────────────────┐\n│ Hardware Fast Path   │\n│ ring + MMIO doorbell │\n└──────────┬───────────┘\n           │ descriptor contract\n           ▼\n┌──────────────────────┐\n│ RTL Descriptor Engine│\n│ desc_ring + FSM      │\n└──────────┬───────────┘\n           │ completion_t\n           ▼\n┌──────────────────────┐\n│ Host Completion Path │\n└──────────────────────┘\n```\n\n## v0.2.1: Correctness and Benchmark Honesty\n\nThis release is a stabilization pass focused on queue correctness,\nworker shutdown safety, overflow handling, and more honest benchmark\nreporting.\n\n- SPSC command-ring accounting now uses explicit producer-tail and\n  consumer-head ownership.\n- Persistent workers now use `__shfl_sync` for warp value broadcast and\n  reserve `__sync_warp` for synchronization only.\n- Shutdown paths wait for persistent kernels to exit before freeing\n  host/device resources.\n- Completion and done rings now detect full conditions, count overflow\n  attempts, and apply backpressure instead of silently overwriting data.\n- Hot-path guard output now reports `wrapper-clean` rather than a global\n  `CLEAN` claim.\n\n## Portable, not GB300-only\n\nThe code targets GB300 because that's the interesting scale, but it\nruns CUDA kernels on **any GPU with compute capability 8.0+**\n(Ampere or newer).\n\nThe full host runtime is still **Linux/POSIX-oriented** today.  It uses\nAPIs such as `mmap`, `MAP_HUGETLB`, `mbind`, `clock_gettime`, `getopt`,\nand POSIX-style filesystem/device conventions in a few tests and\nbenchmarks.  `make smoke` provides CPU-only queue correctness checks on\nsupported Linux environments without requiring a GPU.\n\nWhat you'd change for a non-GB300 system:\n\n| GB300 assumption | Portable alternative |\n|---|---|---|\n| NVLink-C2C coherent command ring | `cudaHostAlloc` + `cudaHostGetDevicePointer` |\n| Grace CPU NUMA topology | `numa.c` now guards with `numa_available()` — skips `mbind` on non-NUMA systems |\n| Grace ARM + NVSwitch | Works on any x86 + any NVIDIA GPU |\n| `-arch=sm_90a` (Blackwell) | Makefile now uses multi-arch gencode: sm_80 (Ampere) through sm_90a (Blackwell) |\n\nEverything else — atomics, hugepages, `cp.async`, persistent workers,\non-device sampling — is standard CUDA C, but the current portability\nstory is best described as \"modern NVIDIA GPUs on Linux\" rather than\n\"all host OSes.\"\n\n## Architecture\n\n```\n ┌────────────────────────────────────────────────────────────────┐\n │                     Host (CPU)                                 │\n │  ┌─────────────────────────────────────────────────────────┐   │\n │  │  v0.2 path (CPU per-token):                              │   │\n │  │    rollout_slab ──► Pipeline ──► CommandRing ──► GPU     │   │\n │  │    GPU ──► CompletionRing ──► CPU (one trip per token)   │   │\n │  └─────────────────────────────────────────────────────────┘   │\n │                                                                │\n │  ┌─────────────────────────────────────────────────────────┐   │\n │  │  v0.3 path (GPU-resident):                               │   │\n │  │    CPU ──► RequestRing ──► GPU (one request per rollout) │   │\n │  │    GPU ──► DoneRing ──► CPU (one done per rollout)        │   │\n │  └─────────────────────────────────────────────────────────┘   │\n │                    │                                           │\n │                    ▼                                           │\n │           ┌────────────────┐                                   │\n │           │  CommandRing   │ ──► RequestRing (v0.3)            │\n │           │  CompletionRing│ ◄── DoneRing (v0.3)               │\n │           └───────┬────────┘                                   │\n │                   │  coherent / pinned memory                   │\n ├───────────────────┼───────────────────────────────────────────┤\n │              GPU  │                                            │\n │                   ▼                                            │\n │  ┌──────────────────────────────────────────────────────┐      │\n │  │  Persistent Workers (1 warp / SM)                    │      │\n │  │                                                      │      │\n │  │  decode_worker (v0.2):                               │      │\n │  │    poll ring ─► read desc ─► decode ─► comp_ring     │      │\n │  │                                                      │      │\n │  │  rollout_worker (v0.3):                              │      │\n │  │    poll request ─► alloc slot ─► decode loop         │      │\n │  │      ─► sample ─► KV update ─► done_ring             │      │\n │  └──────────────────────────────────────────────────────┘      │\n │                   │                                            │\n │              ┌────┴────┐                                       │\n │              │         │                                       │\n │        ┌─────▼──┐  ┌───▼──────┐                               │\n │        │  KV    │  │  Reward  │                               │\n │        │ Arena  │  │  Ring    │                               │\n │        └────────┘  └──────────┘                               │\n └──────────────────────────────────────────────────────────────┘\n```\n\n### Hot-Path Anatomy\n\n```\n  CPU (producer)                          GPU persistent worker\n  ═══════════════                          ══════════════════════\n  ring_acquire()                           ┌─ poll producer tail\n       │                                   │   (acquire-load)\n       ▼                                   ▼\n  write descriptor ──store──▶  ring  ──load──▶  read descriptor\n       │                  slot  │                 │\n       ▼                       │                 ▼\n  ring_commit()                │            read KV arena\n  (release-store tail)         │            (hugepage, no TLB miss)\n       │                       │                 │\n       ▼                       ▼                 ▼\n  ┌─────────────────────────────────────┐   decode attention\n  │  key invariant:                     │   (cp.async prefetch)\n  │  no syscall, no page fault,         │        │\n  │  no malloc/free, no scheduler       │        ▼\n  │  wakeup in the entire hot path      │   sample token\n  └─────────────────────────────────────┘   (GPU-resident)\n                                                  │\n                                                  ▼\n  CPU polls completion ◀─── store ──────────  comp_ring_push()\n  (acquire-load tail)                         (release-store tail)\n```\n\n## Components\n\n| Component | File | Description |\n|---|---|---|\n| Work Descriptor | `include/descriptor.h` | 28-byte packed decode-step command |\n| SPSC Ring | `include/ring.h`, `src/ring.c` | Lock-free producer-consumer ring in coherent memory |\n| Completion Ring | `include/completion.h` | GPU→CPU result notification (mirror of command ring) |\n| KV Arena | `include/arena.h`, `src/arena.c` | Hugepage-backed slab allocator with O(1) acquire/release |\n| Prefetch Pipeline | `include/prefetch.h`, `cu/prefetch.cu` | `cp.async` software-pipelined KV block loader |\n| Sampling | `include/sample.h`, `cu/sample.cu` | GPU-resident top-k / top-p / temperature sampling |\n| Persistent Worker | `cu/worker.cu` | GPU SM decode loop — polls ring, loads KV, runs attention |\n| NUMA Helpers | `include/numa.h`, `src/numa.c` | `mbind`-based NUMA-local hugepage allocation |\n| Host Runtime | `src/main.c` | Init, dispatch loop, completion polling |\n| Rollout State Machine | `include/rollout.h`, `src/rollout.c` | CAS-based rollout lifecycle with valid transition table |\n| Rollout Pipeline | `include/pipeline.h`, `src/pipeline.c` | 6-queue RL pipeline (free/prefill/decode/reward/trajectory/done) |\n| Runtime Metrics | `include/metrics.h`, `src/metrics.c` | Cacheline-padded hot-path counters with formatted output |\n| Hot-Path Guard | `include/hotpath_guard.h`, `src/hotpath_guard.c` | Wrapper-based tracking for explicit malloc/cudaMalloc/page-fault hooks |\n| Copy-on-Write Prefix KV | `include/kv_prefix.h`, `src/kv_prefix.c`, `cu/kv_prefix.cu` | Shared prefix KV with per-rollout delta branches |\n| Reward Pipeline | `include/reward.h`, `src/reward.c`, `cu/reward.cu` | Mock reward/verifier ring with GPU scoring kernel |\n| Pipeline Benchmark | `bench/bench_pipeline.cu` | Full RL pipeline benchmark with hot-path guard verification |\n| Trace Pipeline Benchmark | `bench/bench_trace_pipeline.cu` | Pipeline benchmark with nanosecond tracing and latency percentiles |\n| Tracing | `include/trace.h`, `src/trace.c` | Ring-buffer trace entries with pair-latency report (p50/p90/p99) |\n| Scheduling Policies | `include/pipeline.h`, `src/pipeline.c` | FIFO / shortest-remaining / prefix-sharing scheduling |\n| Pipeline Backpressure | `include/pipeline.h`, `src/pipeline.c` | Credit-based flow control per pipeline stage |\n| COW Prefix KV Benchmark | `bench/bench_cow_prefix.cu` | Memory savings comparison: COW vs full-duplicate KV |\n| Control-Plane Tax Benchmark | `bench/bench_control_tax.cu` | Syscall vs polling vs persistent worker comparison |\n| MMIO Helpers | `include/mmio.h`, `src/mmio.c` | Host-side write barrier and MMIO-style doorbell helpers |\n| Hardware Descriptor | `include/hw_desc.h` | 64-byte fixed-format hardware-facing inference descriptor |\n| Hardware Ring | `include/hw_ring.h`, `src/hw_ring.c` | Cacheline-owned SPSC descriptor ring for command/done queues |\n| Hardware Submit API | `include/infer_submit.h`, `src/infer_submit.c` | Host inference submit path that pushes descriptors and rings a doorbell |\n| Hardware Worker Simulator | `include/hw_worker_sim.h`, `src/hw_worker_sim.c` | CPU-only device model that consumes command descriptors and posts completions |\n| Hardware Fastpath Test | `test/test_hw_ring.c` | CPU-only validation for the descriptor/doorbell/ring path |\n| Hardware Fastpath Benchmark | `bench/bench_hw_fastpath.c` | Doorbell batching and worker-sim throughput benchmark |\n| GPU Request Ring | `include/request.h` | SPSC request ring (CPU→GPU) + done ring (GPU→CPU) for GPU-resident scheduler |\n| GPU Rollout Scheduler | `cu/gpu_scheduler.cu` | Persistent GPU kernel that manages rollout lifecycle without CPU per-token involvement |\n| GPU Scheduler Benchmark | `bench/bench_gpu_scheduler.cu` | Benchmark for GPU-resident rollout scheduler |\n| GPU Scheduler Docs | `docs/gpu_scheduler.md` | Design doc for GPU-resident rollout progression |\n\n## Implementation Status: Real vs Stub\n\n| Component | Status | What it actually does |\n|-----------|--------|----------------------|\n| Command/Completion rings | **Real** | Lock-free SPSC with acquire/release atomics, cacheline-padded indices |\n| KV arena (hugepage) | **Real** | `mmap MAP_HUGETLB`, bitmap O(1) alloc, pre-faulted |\n| `cp.async` prefetch pipeline | **Real** | Triple-buffered async copy device code |\n| GPU sampling (xoshiro256**) | **Real** | Device-side PRNG, top-k/p, temperature scaling |\n| Persistent decode worker | **Real** | Per-SM warp, ring poll loop, `__nanosleep` yield |\n| NUMA binding | **Real** | `mbind(MPOL_BIND)` with `numa_available()` guard |\n| Rollout state machine | **Real** | CAS transitions, slab allocator, transition validation |\n| Pipeline rings | **Real** | 6 lock-free ID rings with acquire/release |\n| Hot-path guards | **Partial** | Counts explicit wrapper calls; useful for regressions, not a whole-process proof |\n| Tracing | **Real** | 1M-entry ring buffer, pair-latency matching, p50/p90/p99 |\n| Request/Done rings (v0.3) | **Real** | Host+device atomics, GPU resident slot management |\n| Fixed128 decode path | **Partial** | Real QK / softmax / V math for one fixed-shape path; the decode kernel now stages KV with all-lane `cp.async` participation and uses tiled online softmax accumulation to fuse score normalization with V accumulation, while the runtime routes descriptors through a tiny weighted model-state block plus query projection |\n| Pipeline windows | **Real** | Snapshot helpers expose queue occupancy, stage credit headroom, and suggested batch windows for decode/reward/trajectory stages; pipeline benches now use them to gate batched decode admission |\n| Descriptor windows | **Real** | Host-side decode batch helpers prepare and submit grouped descriptor windows with one ring commit instead of one commit per step |\n| Device-visible batch contract | **Partial** | Grouped descriptor windows now stamp explicit batch size and batch index onto each descriptor, and the worker uses that metadata to shape local prefetch state |\n| `cp.async` prefetch path | **Partial** | The prefetch layer now uses lane-striped 16-byte `cp.async` helpers and explicit commit/wait flow, but broader multistage overlap is still limited |\n| KV layout descriptor | **Scaffold** | Concrete fixed128 KV block math, alignment invariants, and offset helpers |\n| Attention decoder | **Partial** | Fixed128 real path exists; broader runtime/model coverage is still incomplete |\n| Hardware-facing inference fastpath | **Real** | 64-byte descriptors, cacheline-owned command/done rings, MMIO-style doorbell writes, and a CPU-only worker simulator for device-model validation |\n| Reward model | **Stub** | `reward_score_mock()` returns `(n \u0026 0xFF) / 255.0f` — no real scoring |\n\nBenchmarks measure **ring throughput, control-plane latency, and pipeline overhead**,\nnot FLOPs or model quality. The attention stub means token/s numbers reflect the\ncontrol-path speed, not actual decode performance.\n\n## What This Is Not\n\nThis is not a replacement for vLLM, TensorRT-LLM, SGLang, or JAX.\n\nThis is a **reference fast-path** showing how an RL inference runtime\ncould structure fixed KV ownership, CPU→GPU command rings, persistent\ndecode workers, hugepage-backed memory, cacheline-aware queues, and\nasync reward handoff.\n\nThe goal is to study the control-plane mechanics, not to outperform\nproduction inference stacks today.\n\n## Hardware-Facing Inference Fastpath\n\nThe repository now includes a second host-side fastpath that is shaped\nlike a real device queue rather than a CUDA-specific control surface.\n\n- each `hw_desc_t` is exactly 64 bytes: one cache line, fixed format,\n  no heap pointers\n- `hw_ring_t` separates consumer-owned `head` from producer-owned `tail`\n  on different cache lines and uses acquire/release atomics\n- `infer_submit_decode()` pushes a descriptor, release-stores the tail,\n  and rings an MMIO-style doorbell with `mmio_write32()`\n- `hw_worker_sim_run()` provides a CPU-only device model that consumes\n  command descriptors and posts done/reward-needed completions\n\nThis is a hardware-facing model, not direct access to NVIDIA internal\nGPU doorbells. The point is to make the queueing discipline, descriptor\nshape, and doorbell tradeoffs explicit and measurable.\n\n## RTL Descriptor Engine\n\nThe repo now also includes an RTL model of the hardware queue/control-plane\nprotocol. It mirrors the C hardware-facing fast path:\n\n```text\nC infer_submit_decode()\n    |\n    v\nhw_desc_t / command ring\n    |\n    v\nRTL desc_ring\n    |\n    v\nrollout_worker_fsm\n    |\n    v\ncompletion_ring\n    |\n    v\nhost observes completion\n```\n\nIt models:\n\n- fixed descriptors\n- descriptor ring\n- MMIO doorbell register\n- rollout worker FSM\n- completion ring\n- ready/valid backpressure\n\nIt does not model:\n\n- real transformer kernels\n- tensor cores\n- GB300 internals\n- NVIDIA hardware doorbells\n\n| Layer | Status |\n|---|---|\n| C/CUDA runtime fast path | implemented / experimental |\n| Hardware descriptor format | implemented |\n| RTL descriptor engine | implemented as control-plane model |\n| RTL worker FSM | fake decode / control-plane simulation |\n| RTL transformer compute | not implemented |\n| Verilator C co-sim | implemented |\n\n## C + Verilator Co-Simulation\n\nThe repo now includes a Verilator bridge that lets the host-side C++\nsimulation submit descriptors into the RTL descriptor engine and observe\ncompletions.\n\n```text\nC++ RtlRuntimeBridge\n  -\u003e Verilated rl_runtime_top\n  -\u003e RTL desc_ring\n  -\u003e RTL rollout_worker_fsm\n  -\u003e RTL completion_ring\n  -\u003e C++ completion polling\n```\n\nCommands:\n\n```bash\nmake verilate\nmake sim-rtl\nmake test-rtl-bridge\n```\n\nTarget passing output:\n\n```text\nRTL co-sim basic decode: PASS\nRTL co-sim reward boundary: PASS\nRTL co-sim completion backpressure: PASS\nRTL bridge tests: PASS\n```\n\n| Layer | Status |\n|---|---|\n| C/CUDA runtime | implemented / experimental |\n| Hardware descriptor path | implemented |\n| RTL descriptor engine | implemented |\n| C + Verilator bridge | implemented |\n| Real transformer compute in RTL | not implemented |\n| GB300 hardware validation | not implemented |\n\nThis co-simulation validates the descriptor/control-plane contract. It\ndoes not validate real GPU execution, GB300 hardware behavior, or\ntransformer math. The current bridge validates a shared logical\ndescriptor contract, not yet a byte-identical 64-byte wire format; see\n`docs/descriptor_contract.md`.\n\n## Status Honesty\n\n| Layer | Status |\n|---|---|\n| C descriptor ring | implemented |\n| C hardware fast path | implemented |\n| COW prefix KV | implemented / benchmarked |\n| RL pipeline | implemented / simulated |\n| RTL descriptor ring | implemented |\n| RTL worker FSM | fake decode / control-plane model |\n| RTL transformer compute | not implemented |\n| GB300 hardware validation | not implemented |\n| C + Verilator bridge | implemented |\n\n## Documentation\n\n| File | What it covers |\n|---|---|\n| `docs/architecture.md` | Full architecture map, hot path vs init path, component interactions |\n| `docs/hotpath.md` | Every operation classified as init vs hot path |\n| `docs/metrics.md` | Target metrics and benchmark commands (all benchmarks) |\n| `docs/rtl.md` | RTL control-plane scope, module map, ring logic, and roadmap |\n| `docs/verilator_bridge.md` | Verilator bridge API, signal mapping, tests, and limitations |\n| `docs/descriptor_contract.md` | Current C, RTL, and bridge descriptor/completion contract, plus the 64-byte unification roadmap |\n| `docs/tracing.md` | Trace event types, latency pairs, example output |\n| `docs/gpu_scheduler.md` | GPU-resident rollout scheduler design and comparison |\n| `docs/decode_microkernel.md` | Status and intent of the fixed-shape decode microkernel scaffold |\n| `docs/part3-metal-blog.md` | Deep dive on PTX, `cp.async`, memory ordering, host flush semantics, and the v0.4 roadmap |\n| `docs/v0.2.2-roadmap.md` | File-by-file plan for the first real hardware-close decode path |\n| `rtl/README.md` | How to run and reason about the RTL descriptor engine |\n| `docs/release-notes-v0.3-rtl.md` | RTL-focused release-note draft |\n\n## Build\n\nRequires Linux, CUDA 12.x+, and `libnuma-dev` for the full runtime.\n\n```bash\nmake               # build library + test bench + all benchmarks\nmake smoke         # CPU-only smoke tests for ring/overflow correctness\nmake test          # smoke tests + CUDA-backed unit tests where supported\nmake test-all      # software tests + RTL tests if iverilog is installed\nmake test-hw-ring  # CPU-only hardware fastpath tests\nmake rtl-test      # RTL descriptor-engine testbench\nmake verilate      # generate Verilator model for rl_runtime_top\nmake sim-rtl       # basic C++/RTL co-simulation run\nmake test-rtl-bridge # multi-case C++/RTL co-simulation tests\nmake bench         # benchmark: 1M tokens through ring+worker\nmake bench-pipeline # benchmark: full RL pipeline with rollouts, state machine, reward, hot-path guards\nmake bench-trace   # benchmark with nanosecond tracing + latency percentiles\nmake bench-cow          # COW prefix KV memory savings benchmark\nmake bench-tax          # control-plane tax comparison (syscall vs polling vs persistent worker)\nmake bench-hw-fastpath  # hardware-facing descriptor + doorbell batching benchmark\nmake bench-gpu-scheduler # GPU-resident rollout scheduler (zero CPU per-token work)\nmake bench-decode       # fixed128 decode microkernel scaffold benchmark\nmake bench-kv-layout    # KV block-layout scaffold and offset math check\nmake bench-prefetch     # isolated global-\u003eshared staging microbenchmark\nmake bench-all          # run all benchmarks\nmake ci-build           # CPU-only build path used by GitHub Actions\nmake ci-run             # CPU-only verification path used by GitHub Actions\nmake cuda-compile-check # compile CUDA translation units with nvcc, no GPU execution\nmake cuda-ptx-check     # emit PTX for core CUDA translation units, no GPU execution\nmake rtl-clean          # remove RTL simulator artifact\nmake clean-verilator    # remove generated Verilator files\n```\n\n`cuda-compile-check` and `cuda-ptx-check` still require `nvcc`, but they\ndo not require a running NVIDIA GPU. They are useful when you want syntax,\ntemplate, host/device, and PTX-generation validation on a machine that has\nthe CUDA toolkit installed but no accelerator attached.\n\n## What Each Benchmark Proves\n\n| Benchmark | Command | What it proves |\n|-----------|---------|----------------|\n| `bench` | `make bench` | Ring + GPU worker baseline throughput (1M tokens through ring, no rollout logic) |\n| `bench-pipeline` | `make bench-pipeline` | End-to-end RL rollout flow: alloc → decode → reward → trajectory → done with wrapper-guard reporting and optional page-fault snapshots |\n| `bench-trace` | `make bench-trace` | Nanosecond latency breakdown — where pipeline time is spent (p50/p90/p99 for 8 latency pairs) |\n| `bench-cow` | `make bench-cow` | Memory saved by shared prefix KV vs full-duplicate per rollout |\n| `bench-tax` | `make bench-tax` | Control-plane overhead: eventfd syscall vs userspace polling vs persistent worker (this runtime) |\n| `bench-hw-fastpath` | `make bench-hw-fastpath` | 64-byte hardware descriptor submission, MMIO-style doorbell batching, and CPU-only worker-sim completion throughput |\n| `bench-gpu-scheduler` | `make bench-gpu-scheduler` | GPU-managed rollout lifecycle — zero CPU per-token work, CPU only sees request/done |\n| `bench-decode` | `make bench-decode` | Fixed128 real math path for one staged KV block, benchmarked separately from control-plane costs |\n| `bench-kv-layout` | `make bench-kv-layout` | KV block layout invariants, byte offsets, and fixed-shape memory math |\n| `bench-prefetch` | `make bench-prefetch` | Isolated global→shared KV staging cost, comparing baseline shared-memory copy against `cp.async` staging |\n\n## Benchmark Snapshot\n\n```\nHardware:\n  GPU:      H100 (or your GPU here)\n  CPU:      x86-64 / Grace ARM\n  OS:       Linux\n  CUDA:     12.x\n  Build:    make bench-all\n\nNote: token throughput reflects the stub attention (mock completion write).\nReal attention kernels would add compute latency; the numbers here measure\nthe control path only.\n\nResults (run `make bench-all` on your hardware and replace placeholders with dated measurements):\n  Ring throughput:                \u003e 50 M ops/s\n  Full pipeline tokens/s:         measure on your hardware\n  Pipeline rollouts/s:            measure on your hardware\n  COW prefix KV memory saved:     \u003e 90% for 10K branches\n  Prefetch staging BW:            measure with make bench-prefetch\n  Control-plane tax:\n    syscall per step:             ~X ns (baseline)\n    userspace polling:            ~Y ns (fast, CPU-hungry)\n    persistent worker (this):     ~Z ns (fastest, no CPU tax)\n  Wrapper-tracked mallocs:        0 (wrapper-clean)\n  Post-init page faults:          inspect OS snapshot output\n  Per-token kernel launches:      0\n```\n\nRun on your hardware and open a PR to update this table.\n\n## Labs\n\nThe `lab/` directory contains six self-contained C experiments\nthat teach the close-to-metal concepts used in the runtime:\n\n| Lab | What it teaches | Runs on |\n|---|---|---|\n| `01_false_sharing` | Cache line contention — MESI protocol, padding | any Linux |\n| `02_spsc_ring` | Lock-free ring buffer from scratch — atomics, memory ordering | any Linux |\n| `03_hugepage_tlb` | 4K vs 2M page TLB miss comparison — why hugepages matter | Linux w/ hugepages |\n| `04_syscall_vs_poll` | eventfd wakeup vs shared-memory polling — syscall cost | any Linux |\n| `05_doorbell_mock` | Producer/consumer with doorbell — device queue model | any Linux |\n| `06_memory_ordering` | publication ordering, stale-descriptor windows, release/acquire | any Linux |\n\nEach lab is standalone — `cd lab/01_false_sharing \u0026\u0026 make run`.\n\n```bash\nmake labs      # build all labs\nmake lab-run   # run all labs sequentially\nmake lab-run-safe # run the CPU-safe labs used in CI\n```\n\n## Design Rules\n\n1. **Pre-fault everything** — no runtime page faults\n2. **No cudaMalloc after init** — static slab allocation\n3. **CPU stays out of the data path** — descriptors only\n4. **NUMA-local memory** — `mbind(MPOL_BIND)` on every allocation\n5. **Reward is GPU-resident** — no PCIe round-trips for scoring\n6. **NVLink-C2C for coordination** — coherent rings, no DMA\n7. **Rollouts are hardware-visible state machines** — CAS transitions through 6 pipeline stages, no Python/CPU per-token control\n8. **Hot-path wrappers catch explicit zero-alloc regressions** — useful guardrail, not a global proof\n9. **Copy-on-write prefix KV** — shared prompt KV across rollouts, only per-rollout deltas allocated\n10. **Credit-based backpressure** — every pipeline stage has a max occupancy; `pipeline_try_push` blocks well before ring-full\n11. **Multiple scheduling policies** — FIFO, shortest-remaining-first, and prefix-sharing policies for the decode queue\n12. **GPU-resident rollout progression** — v0.3 moves the state machine and pipeline to the GPU; CPU only submits requests and drains completions\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmanishklach%2Fgb300-rl-runtime","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmanishklach%2Fgb300-rl-runtime","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmanishklach%2Fgb300-rl-runtime/lists"}