{"id":50752473,"url":"https://github.com/rightnow-ai/automegakernel","last_synced_at":"2026-06-15T06:01:21.075Z","repository":{"id":363359743,"uuid":"1263098427","full_name":"RightNow-AI/AutoMegaKernel","owner":"RightNow-AI","description":"An agent harness that compiles a model into one provably-correct, self-retargeting CUDA megakernel and self-tunes it past cuBLAS at batch-1 LLM decode.","archived":false,"fork":false,"pushed_at":"2026-06-08T16:11:07.000Z","size":1086,"stargazers_count":32,"open_issues_count":0,"forks_count":4,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-06-14T05:09:11.142Z","etag":null,"topics":["agent-harness","cuda","gpu","gpu-programming","kernel-fusion","llm-inference","machine-learning","megakernel","mlsys"],"latest_commit_sha":null,"homepage":"https://runinfra.ai/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/RightNow-AI.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2026-06-08T16:08:59.000Z","updated_at":"2026-06-14T03:04:30.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/RightNow-AI/AutoMegaKernel","commit_stats":null,"previous_names":["rightnow-ai/automegakernel"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/RightNow-AI/AutoMegaKernel","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RightNow-AI%2FAutoMegaKernel","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RightNow-AI%2FAutoMegaKernel/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RightNow-AI%2FAutoMegaKernel/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RightNow-AI%2FAutoMegaKernel/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/RightNow-AI","download_url":"https://codeload.github.com/RightNow-AI/AutoMegaKernel/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RightNow-AI%2FAutoMegaKernel/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34349927,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-15T02:00:07.085Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agent-harness","cuda","gpu","gpu-programming","kernel-fusion","llm-inference","machine-learning","megakernel","mlsys"],"created_at":"2026-06-11T02:05:48.935Z","updated_at":"2026-06-15T06:01:21.028Z","avatar_url":"https://github.com/RightNow-AI.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"assets/automegakernel-social.png\" alt=\"AutoMegaKernel: a statically-checked agent harness for self-retargeting megakernel synthesis\" width=\"840\"\u003e\n\u003c/p\u003e\n\n# AutoMegaKernel (AMK)\n\n\u003e **A general agent harness for GPU megakernel synthesis.** A coding agent (Claude Code / Codex)\n\u003e drives AMK to compile a model into one **provably-correct**, self-retargeting megakernel, the\n\u003e whole forward pass fused into a single persistent launch, then self-tunes it, and it gets better\n\u003e every time it runs.\n\n**Repo:** https://github.com/RightNow-AI/AutoMegaKernel \u0026nbsp;·\u0026nbsp; **License:** MIT \u0026nbsp;·\u0026nbsp; the open research harness ([Enterprise / Forge](#enterprise))\n\n```bash\namk compile \u003cmodel\u003e --gpu \u003carch\u003e --regime single-stream\n# imports -\u003e lowers -\u003e validates (deadlock + race free) -\u003e verifies vs eager -\u003e builds the GPU\n# megakernel -\u003e measures it against the HBM roofline -\u003e emits a correct megakernel + report\n```\n\nAMK is the sibling of [AutoKernel](https://github.com/RightNow-AI/autokernel): AutoKernel\nauto-generates the single best *kernel*; AMK auto-generates the single best whole-model\n*megakernel*. It inherits AutoKernel's autoresearch loop (propose → fixed eval → keep/revert, for\nhours, unattended) and adds a new search axis: **the schedule**.\n\n\u003e **Coverage today:** the HuggingFace Llama family on CUDA (sm\\_75 to sm\\_120).\n\u003e **Roadmap (future work):** generalizing the importer and backends to more architectures,\n\u003e languages, and targets. The harness (validator, oracle, search loop) is not model-specific; the\n\u003e agent is what supplies that generality, and broadening it is the central direction of the work.\n\n---\n\n## Results: int8 beats cuBLAS across the inference fleet\n\nAMK's auto-tuned **int8** (W8A16, near-lossless) megakernel **beats CUDA-graphed cuBLAS bf16** at\nbatch-1 decode across NVIDIA's datacenter **inference-class** GPUs, found autonomously by AMK's own\nsearch and correctness-gated (argmax-exact). Ratio = cuBLAS / AMK, so **\u003e 1 means AMK is faster.**\n\n| GPU | Class | int8 vs cuBLAS bf16 (best) | Verdict |\n|---|---|---:|---|\n| **L4** (sm_89, 300 GB/s) | inference | **1.18× → 1.33×** (1.3B→4B) | ✅ wins, grows with size |\n| **L40S** (sm_89, 864 GB/s) | inference flagship | **1.25× → 1.27×** (4B→6.7B) | ✅ wins |\n| **A10G** (sm_86, 600 GB/s) | inference | **1.04× → 1.08×** (≥3.5B) | ✅ wins at scale |\n| **RTX 5090** (sm_120) | consumer (local) | **1.19× → 1.23×** | ✅ wins |\n| A100 (sm_80, 1.4 TB/s) | training | 0.79× → 0.55× (1.3B→13B) | ✗ trails (declines w/ size) |\n| H100 (sm_90, 3.1 TB/s) | training | 0.72× → 0.60× | ✗ trails |\n\nInference / datacenter GPUs measured on **Modal** (reproducible; file-backed in\n`paper/results/int8_scale_*.json`). RTX 5090 measured **locally** (`int8_search_multisize.json`;\nModal has no RTX 5090 silicon, so it is not in the Modal sweep).\n\n**What governs the win is the inference-class vs. training-class regime, not bandwidth ordering:**\nthe 864 GB/s L40S wins by *more* than the 600 GB/s A10G. AMK reads ~half the weight bytes (int8), and\nthat advantage has to overcome a *fixed* per-tile cross-SM sync; larger, GEMV-dominated models amortize\nthat fixed cost. The training-class A100/H100 never cross (the ratio *declines* with size, the\nfingerprint of the sync deficit; a `cp.async` int8-GEMV probe even regressed to 0.82× on A100,\nconfirming cross-SM sync, not load latency, is the binder). The win is batch-1, pos-0 / low-context.\nFull data and analysis: [`DATACENTER_RESULTS.md`](DATACENTER_RESULTS.md).\n\n### The equal-precision gap (stated plainly)\n\nOn the **like-for-like bf16** path AMK *trails* cuBLAS. On a 622.9 MB model the bf16 kernel runs at\n**~1.38 ms/token, ~1.24× slower** than CUDA-graphed cuBLAS; the optimized bf16 GEMV sustains **~451 GB/s\n= ~51% of spec / ~63% of measured HBM peak** (cuBLAS ceiling ~90%). The int8 win therefore comes from\nstreaming fewer bytes, not from a faster kernel; the int8 kernel on the same model is ~1.22 ms/token\n(~1.09× slower than cuBLAS bf16). The remaining lever is a bandwidth-saturating `cp.async` GEMV with\ncoarser cross-SM sync, a kernel-quality push rather than a redesign, and the correctness-bearing\narchitecture that makes that safe and automatic is already in place. We do **not** beat cuBLAS/vLLM at\nbf16 batch-1, and we say so. Reproduce (GPU): `uv run pytest tests/test_cuda_perf.py`; the 10-minute\nself-improving run: `uv run python amk_cli.py autoresearch small --gpu rtx5090 --minutes 10`.\n\n---\n\n## What's built and measured\n\n- **The whole forward pass runs as one cooperative kernel launch.** The persistent megakernel (one\n  threadblock per SM, counter-synchronized) executes a full Llama-style decode and matches eager\n  PyTorch **and** the CPU reference VM to ~1e-7 (fp32) / bf16 tolerance.\n- **Self-retargeting, measured on three architectures.** The *same source* built and ran a correct\n  megakernel on **sm_120 (RTX 5090)**, **sm_80 (A100)**, and **sm_90 (H100)**, with the nvcc gencode\n  derived from the live device (the H100 ran a 3 GB / 3202-task Llama-1B-shaped decode correctly).\n  See [`DATACENTER_RESULTS.md`](DATACENTER_RESULTS.md).\n- **Multi-token generation that matches eager.** AMK greedily decodes a *sequence*, threading a\n  persistent KV cache across steps; the generated token ids are identical to eager greedy decode.\n  Reproduce: `uv run amk generate toy --gpu rtx5090 --prompt-ids \"1,2,3\" --max-tokens 32 --verify`.\n- **A real trained checkpoint, end-to-end.** AMK imports `HuggingFaceTB/SmolLM2-135M` (real weights +\n  tokenizer) and reproduces HuggingFace's own greedy `generate` token-for-token. Reproduce:\n  `uv run python examples/run_hf_model.py` (also `tests/test_hf_checkpoint.py`).\n- **A statically-checked schedule validator.** Across 7,160 adversarial schedules the validator had\n  **zero false-accepts**; an unsafe agent-proposed schedule is `REJECTED` at validation time instead of\n  hanging the GPU.\n- **A native coding-agent harness.** Drivable by any coding agent (Claude Code / Codex) through one\n  structured edit surface: MCP server + Claude Code skill/commands/subagent/workflow + Codex\n  `AGENTS.md`. See [`docs/AGENT_HARNESS.md`](docs/AGENT_HARNESS.md) and [`HARNESS.md`](HARNESS.md).\n- **A 10-minute unattended autoresearch run** self-improves the megakernel **1.47×** over its own\n  starting schedule.\n- **98 tests green** (`uv run pytest`): 78 pass on CPU; 20 CUDA tests auto-skip without a GPU.\n\n---\n\n## Why a megakernel\n\nSingle-stream decode is **bandwidth-bound**: each token must stream the whole weight set through the\nSMs once. The theoretical floor is `weights_bytes / HBM_bandwidth`. A normal PyTorch / cuBLAS execution\nlaunches *one kernel per op* and round-trips activations through HBM between every op, paying launch\nlatency and a memory bubble dozens of times per layer.\n\nA **megakernel** launches **once**, keeps the persistent threadblocks resident on every SM, and walks\nthe model's dependency graph in-place: activations live in on-chip pages, the next layer's weights are\nprefetched while the current layer computes, and there is no kernel-launch bubble between ops. The win\nregime is **single-stream / low-batch decode latency**: voice, realtime, agentic loops. We do **not**\nclaim to beat throughput-optimized serving at high batch; that is compute-bound and not this fight.\n\n## How it works\n\nGeneration is **confined inside a verified structure**, so correctness is a property of the\narchitecture, not of the model output.\n\n### Four layers\n\n| Layer | Role | Trust model |\n|------:|------|-------------|\n| **0: VM** ([`vm/`](vm/)) | Persistent kernel: per-SM scheduler loop, page-based scratchpad, counter-based sync. Launched once, runs the whole forward pass. | Trusted base. Hand-written, exhaustively verified, **frozen** per arch. |\n| **1: Instructions** ([`instructions/`](instructions/)) | ABI-conformant micro-kernels (gemv/gemm tile, attention tile, RMSNorm, RoPE, SwiGLU, dequant…). Triton for iteration, CUDA for max perf. | Each is correctness-checked vs its reference op **in isolation** before it can enter a megakernel. |\n| **2: Scheduler** ([`schedule/`](schedule/)) | `HF model → graph IR → tiled task-DAG → instruction stream + page allocation`. Cost-model explore + on-hardware exploit. | The research core. Proposes points in a search space the VM realizes **safely**. |\n| **3: Dynamism** ([`dynamism/`](dynamism/)) | Shape-parametric tiles + in-kernel dispatch: continuous batching, dynamic shapes, MoE routing. | The relevance gate for real serving. (Roadmap: placeholder package.) |\n\n### Deadlock-freedom by construction\n\nA forward pass is a **DAG**. Producers only *increment* counters; consumers only *wait* on\nstatically-known thresholds; execution is a topological walk with monotonic counters: no locks, no\narbitrary signalling. **The VM refuses to load any schedule that is not a valid DAG**: an invalid\nschedule becomes a clean `REJECTED` at validation time instead of a hung GPU. This is what makes\nauto-generated schedules safe to run unattended.\n\n### Two autoresearch loops\n\n- **Loop 1, Instruction optimization** (this *is* AutoKernel): edit one ABI-conformant micro-kernel,\n  isolated correctness-then-latency eval (~seconds), keep/revert. A wrong instruction fails its own\n  unit test; no persistent kernel, no hang.\n- **Loop 2, Schedule optimization** (the new loop): the agent's edit surface is the **schedule IR**, a\n  structured object `{tiling, fusion_grouping, sm_assignment, pipelining_depth, page_allocation}` plus\n  kernel knobs, *not* megakernel code. The frozen VM deterministically lowers it; every proposal is\n  statically DAG-validated **before launch**. The full contract is in [`HARNESS.md`](HARNESS.md).\n\n## The four properties (the product spec)\n\n1. **Generality**: one command compiles a model into a verified megakernel with zero per-model\n   hand-written CUDA (today: the HF Llama family; broadening coverage is the roadmap).\n2. **Self-retargeting**: when new silicon ships, AMK retargets in *days* via search + on-hardware\n   verification. Already measured across sm_120 / sm_80 / sm_90. This is the moat.\n3. **A standard IR**: AMK owns the canonical megakernel IR: the SM-level task-DAG, the instruction\n   ABI, the schedule format. See [`docs/IR_SPEC.md`](docs/IR_SPEC.md), [`schedule/ir.py`](schedule/ir.py), [`vm/abi.h`](vm/abi.h).\n4. **A data flywheel**: every run logs `(model, gpu, schedule, instruction, measured result)`; that\n   corpus trains a learned prior so every future run starts smarter.\n\n## Native coding-agent integration\n\nAMK exposes the verified loop substrate *natively* to coding agents, with the same behavior and the\nsame honesty rules. The single guide is [`docs/AGENT_HARNESS.md`](docs/AGENT_HARNESS.md).\n\n- **MCP server** ([`amk_mcp.py`](amk_mcp.py)): `amk_doctor` / `amk_propose` / `amk_eval` / `amk_loop` /\n  `amk_autoresearch` / `amk_orchestrate_*`. Enable with `uv sync --extra agent`; register via\n  `.mcp.json` (Claude Code) or `~/.codex/config.toml` (Codex).\n- **Claude Code**: the `megakernel-optimization` skill; the `/amk-optimize`, `/amk-autoresearch`,\n  `/amk-compile` slash commands; the `amk-megakernel-optimizer` subagent; a workflow and a goal under\n  `.claude/`.\n- **Codex**: `AGENTS.md` + the same MCP server.\n\n## Quickstart\n\nAMK installs as a real package (hatchling) and exposes a real `amk` console command. `uv sync`\nprovisions the full environment (torch cu128, numpy, transformers for the HF importer, ninja for the\nCUDA JIT build, pytest); **`uv` is the recommended path.** With pip:\n`pip install \"automegakernel[models,cuda]\"`, note the cu128 torch pin only applies under `uv`, so pip\nusers on Blackwell/sm_120 (or any specific CUDA build) should install the matching torch wheel first,\ne.g. `pip install torch --index-url https://download.pytorch.org/whl/cu128`.\n\n```bash\nuv sync                              # provision the env + install the `amk` command (editable)\n\n# --- No GPU required (works on a fresh CPU-only machine) ---\nuv run pytest                        # full suite (98 tests; 78 on CPU, 20 CUDA auto-skip without a GPU)\namk doctor                           # environment + GPU + nvcc + registered targets\namk eval toy --device cpu            # one structured correctness verdict on the CPU reference VM\n\n# --- Requires a CUDA GPU + nvcc ---\namk compile toy --gpu rtx5090 --regime single-stream                            # model -\u003e verified megakernel + report\namk generate toy --gpu rtx5090 --prompt-ids \"1,2,3\" --max-tokens 32 --verify    # multi-token decode == eager\namk eval toy --gpu rtx5090                                                       # correctness + measured-GPU latency\n```\n\nEvery subcommand is also runnable as `uv run python amk_cli.py \u003ccmd\u003e ...` (identical behavior). A real\ntrained checkpoint runs end-to-end via `uv run python examples/run_hf_model.py`.\n\n## Honesty rules (enforced by the harness, not just stated)\n\n- **Never a latency number without its paired correctness result.** `eval/bench.py` refuses to emit a\n  latency without a verdict from `eval/oracle.py`.\n- Correctness = full-model logit equivalence within tolerance **plus** generated-token agreement vs\n  eager PyTorch over a sequence.\n- Always report **distance to the `weights / HBM_bandwidth` roofline** ([`eval/roofline.py`](eval/roofline.py)).\n- Measured numbers are produced on the hardware named in the flywheel corpus / `results.tsv`; the\n  datacenter numbers in [`DATACENTER_RESULTS.md`](DATACENTER_RESULTS.md) *are* measured. We do not\n  transcribe numbers we did not measure.\n- We are **near-bandwidth-bound nowhere yet** on the bf16 path, and we say so. The honest current\n  claims are: the int8 inference-fleet cuBLAS win, generality, self-retargeting, trust\n  (deadlock + race free by construction), and honest distance-to-roofline.\n\n## Repo layout\n\n```\nvm/            Layer 0, trusted megakernel VM (CUDA) + CPU reference simulator + verify\ninstructions/  Layer 1, ABI-conformant micro-kernels (Triton + CUDA) + generator + verify\nschedule/      Layer 2, graph import, lowering, the STANDARD IR, cost model, search\ndynamism/      Layer 3, continuous batching, dynamic shapes, MoE (roadmap placeholder)\neval/          oracle (logit equivalence) · bench (latency) · baselines · roofline\namk_cli.py     the `amk` console command (doctor/compile/generate/eval/propose/loop/...)\ncompile.py     THE PRODUCT: amk compile \u003chf-model\u003e --gpu \u003carch\u003e\ngenerate.py    autoregressive multi-token decode (KV cache threaded across steps)\nharness.py     the coding-agent integration surface (Loop 2, schedule search)\nexamples/      run_hf_model.py, a real HF checkpoint end-to-end (SmolLM2-135M)\ndocs/          IR_SPEC.md (the standard IR) · AGENT_HARNESS.md (agent integration)\nHARNESS.md     the coding-agent harness contract\nDATACENTER_RESULTS.md  measured inference-fleet + sm_80/sm_90 self-retargeting results\nprogram.md     the autonomous-operation brain (run AMK unattended)\nmodels/        self-contained test models (small dense -\u003e MoE)\n```\n\n## Status\n\nThe correctness-bearing core, the GPU megakernel, self-retargeting, multi-token generation, the\nreal-checkpoint path, and the agent harness are all built and measured. The active push is\nkernel-quality perf toward the roofline (see [the gap](#the-equal-precision-gap-stated-plainly)). See\n[`program.md`](program.md) for the roadmap and the autonomous-loop discipline; the IR\n([`schedule/ir.py`](schedule/ir.py), [`docs/IR_SPEC.md`](docs/IR_SPEC.md)) and the instruction ABI\n([`vm/abi.h`](vm/abi.h)) are the two stable contracts everything is built against.\n\n## License\n\nMIT © 2026 RightNow AI\n\n## Enterprise\n\nAutoMegaKernel is our open research harness. For production and enterprise needs we are building\n**Forge**, an internal advanced kernel generator for enterprises. For enterprise requests, contact\n**[jaber@runinfra.ai](mailto:jaber@runinfra.ai)**.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frightnow-ai%2Fautomegakernel","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frightnow-ai%2Fautomegakernel","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frightnow-ai%2Fautomegakernel/lists"}