{"id":50309565,"url":"https://github.com/manishklach/intent-attention-kernel","last_synced_at":"2026-05-28T20:00:41.786Z","repository":{"id":360863922,"uuid":"1250636212","full_name":"manishklach/intent-attention-kernel","owner":"manishklach","description":"Intent-aware attention research prototype that treats long-context inference as structured semantic blocks instead of a flat token stream, proving CPU-first correctness and analytical KV/FLOP savings before GPU kernel implementation.","archived":false,"fork":false,"pushed_at":"2026-05-28T08:09:05.000Z","size":73,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2026-05-28T08:11:07.677Z","etag":null,"topics":["agentic-ai","ai-infrastructure","attention","block-attention","cost-model","cuda","gpu-kernels","inference","kernel-research","kv-cache","llm-inference","long-context","python","pytorch","research","semantic-attention","sparse-attention","systems","transformers","triton"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/manishklach.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-26T20:35:17.000Z","updated_at":"2026-05-28T08:09:10.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/manishklach/intent-attention-kernel","commit_stats":null,"previous_names":["manishklach/intent-attention-kernel"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/manishklach/intent-attention-kernel","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/manishklach%2Fintent-attention-kernel","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/manishklach%2Fintent-attention-kernel/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/manishklach%2Fintent-attention-kernel/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/manishklach%2Fintent-attention-kernel/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/manishklach","download_url":"https://codeload.github.com/manishklach/intent-attention-kernel/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/manishklach%2Fintent-attention-kernel/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33624221,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-28T02:00:06.440Z","response_time":99,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agentic-ai","ai-infrastructure","attention","block-attention","cost-model","cuda","gpu-kernels","inference","kernel-research","kv-cache","llm-inference","long-context","python","pytorch","research","semantic-attention","sparse-attention","systems","transformers","triton"],"created_at":"2026-05-28T20:00:29.582Z","updated_at":"2026-05-28T20:00:41.755Z","avatar_url":"https://github.com/manishklach.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![CI](https://github.com/manishklach/intent-attention-kernel/actions/workflows/tests.yml/badge.svg)](https://github.com/manishklach/intent-attention-kernel/actions/workflows/tests.yml)\n[![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue)](https://python.org)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)\n\n---\n\n# Intent Attention Kernel\n\n**Intent-Aware KV Execution for Agentic Long-Context Inference**\n\n\u003e Attention should not pretend context is flat — and KV execution should not\n\u003e pretend every block is equally useful.\n\n---\n\n**This repo is a CPU-first research prototype for exposing semantic runtime\nintent to the KV execution layer. It does not claim GPU speedups yet.**\n\n---\n\n## Current Status\n\n| Area | Status |\n|---|---|\n| CPU-first prototype | Complete |\n| Router-to-kernel metadata | Implemented |\n| IntentQuant policy simulation | Implemented |\n| IntentQuant attention reference | Implemented |\n| Triton decode prototype | Optional, exists |\n| LLM validation harness | Exists (proxy only) |\n| GPU benchmark harness | Exists (no measured speedups yet) |\n| Measured GPU speedup | Not claimed |\n| Measured model quality | Not claimed |\n\n**Key docs:**\n- [Research Summary](docs/research_summary.md) — thesis, problem, proposed interface, limitations\n- [Reproducibility Guide](docs/reproducibility.md) — exact commands for CPU, dry-run, LLM, and GPU\n- [Validation Plan](docs/validation_plan.md) — quality ladder, proxy metrics, publishable-evidence bar\n- [GPU Benchmarking](docs/gpu_benchmarking.md) — fair baselines, hardware matrix, T4 caveat\n- [Results Template](docs/results_template.md) — tables to fill when running experiments\n\n---\n\nLong-context agentic inference is not just an attention problem. It is a KV\nexecution problem.\n\nAgentic context contains structurally different regions:\n\n- system prompts\n- recent conversation\n- retrieved documents\n- tool outputs\n- memory summaries\n- scratchpads\n- intermediate reasoning traces\n\nA generic dense attention path treats all of these as one flat KV stream.\nThis repo explores a different interface: expose semantic block metadata to\nthe execution layer so the runtime can select, score, quantize, prefetch,\nand eventually schedule KV blocks more intelligently.\n\n---\n\n## System Components\n\n### 1. Semantic KV Block Selection\n\n`BlockLayout` and `SemanticBlock` describe context regions. `BlockPolicy`\ncontrols whether a block is `ALWAYS`, `ATTEND`, `SKIP`, `RECENT`, or\n`GLOBAL`. The CPU reference gathers selected K/V tokens and computes\nattention over them.\n\n\u003e Do not compute and then mask; expose structure early enough to avoid the\n\u003e work.\n\n### 2. KV Block Router\n\nThe KV Block Router is the missing **runtime-to-kernel policy layer**.\nIt converts semantic context blocks into flat kernel-ready metadata:\n\n- selected pages\n- skipped pages\n- precision by page\n- prefetch hints\n- routing reasons\n\n```python\nfrom intent_attention import BlockRouter, RouterConfig\n\nrouter = BlockRouter(RouterConfig(memory_pressure=0.5))\nrouted = router.route_layout(layout, total_tokens=1440)\nsummary = router.routing_summary(routed)\nmeta = routing_to_kernel_metadata(routed, page_size=16)\n```\n\n**The router is the policy layer. The kernel is the execution layer.**\n\n### 3. Dynamic Block Scoring\n\nSome blocks may be ambiguous. A lightweight scoring path can rank candidate\nblocks using query-to-block similarity. This is a heuristic prototype, not\na trained router. It is meant to model the control-plane surface that a\nfuture runtime or kernel could consume.\n\n### 4. IntentQuant-KV: Intent-Aware Mixed-Precision KV Quantization\n\nNot every KV block deserves the same precision. `IntentQuantizer` assigns\nper-block precision (FP16, FP8, INT8, INT4, INT4_RESIDUAL, or SKIP) based\non block policy, score, recency, and memory pressure. This is a policy\nsimulator only — no real GPU quantization kernel is provided.\n\n### 5. IntentQuant Attention Reference — Per-Block Quantized Attention\n\nExtends IntentQuant-KV into the selected-block attention path itself. Each\nselected block is individually quantized (via `fake_quantize_tensor`) and\nimmediately dequantized (via `fake_dequantize_tensor`) before being\nconcatenated and passed to dense attention. This is a CPU reference — the\nquantized path is intentionally slower to isolate reconstruction error\nmechanics without hardware fusion.\n\n```python\nfrom intent_attention.intent_quant_attention import (\n    intent_quant_attention_reference,\n    compare_intent_quant_to_fp16_selected,\n)\n```\n\n### 6. Speculative KV Prefetch Simulation\n\nAgentic decode often reuses similar KV regions over adjacent steps. A\nprefetcher can predict likely next-step KV pages. The current benchmark\nsimulates hit rate and latency-hiding potential. Prefetch must never affect\ncorrectness. No real latency speedup is claimed without hardware validation.\n\n### 7. Optional Triton Decode-Attention Prototype\n\nAn optional GPU-only kernel (`triton_intent_quant_attention.py`) implements\nsingle-token decode attention over selected KV pages with per-page precision\n(FP16 or INT8). It skips cleanly on systems without Triton or CUDA and is\nnot required for any CPU test or benchmark. **No GPU speedup is claimed.**\n\n```bash\npython benchmarks/bench_triton_intent_quant_attention.py\n```\n\n### 8. Validation Harness\n\nTwo experiment scripts validate the prototype pipeline without making claims:\n\n- `experiments/llm_quality_validation.py` — proxy perplexity validation on\n  small HuggingFace models. Applies fake quant/dequant to `past_key_values`\n  across multiple routing policies. Dry-run mode validates imports without\n  downloading models.\n- `experiments/gpu_decode_benchmark.py` — decode-step attention latency\n  benchmark across PyTorch SDPA, SelectedKV, Triton IntentQuant, xFormers,\n  and FlashAttention-2 (Ampere+ only). Dry-run mode detects hardware without\n  launching kernels.\n\nSee `docs/validation_plan.md` and `docs/gpu_benchmarking.md` for details.\n\n### 9. Fused Selected-Quant Decode Kernel\n\nAn experimental Triton kernel (`fused_selected_quant_decode.py`) that fuses\nruntime semantic page selection, mixed-precision (FP16/INT8/SKIP) page loading,\nand decode-step attention into a single GPU kernel. It consumes BlockRouter\nmetadata directly and is the execution-layer backend for intent-aware KV\nexecution.\n\n```bash\n# Dry-run (validate imports, detect hardware)\npython benchmarks/bench_fused_selected_quant_decode.py --dry-run\n\n# Full benchmark on GPU (requires Triton + CUDA)\npython benchmarks/bench_fused_selected_quant_decode.py \\\n    --batch 1 --heads 8 --head-dim 64 \\\n    --num-pages 64 --selected-frac 0.25\n```\n\n**No GPU speedup is claimed.** This is a research prototype.\n\n---\n\n## IntentQuant-KV\n\nIntentQuant-KV explores the idea that **not every KV block deserves the\nsame precision**.\n\nIn agentic long-context inference, KV blocks have different semantic roles:\n\n- **Critical blocks** — system prompts, global memory, recent context —\n  are attended to every step and may need higher precision.\n- **Lower-score blocks** — old retrieved documents, tool outputs,\n  scratchpad regions — are attended to less frequently and can use\n  lower precision or residual quantization.\n- **Skipped blocks** contribute zero KV bytes.\n\n`IntentQuantizer` assigns a `KVPrecision` (FP16, FP8, INT8, INT4,\nINT4_RESIDUAL, or SKIP) to each block using:\n\n- **Block policy**: ALWAYS/GLOBAL blocks default to FP16; RECENT to FP8.\n- **Block score**: high-scoring ATTEND blocks retain higher precision;\n  low-scoring blocks are downgraded.\n- **Memory pressure**: a knob in [0, 1] that downgrades non-critical\n  blocks as pressure increases.\n- **Preserve flags**: `preserve_recent` and `preserve_global` keep\n  important blocks at higher precision even under moderate pressure.\n\n**This is CPU-first, analytical, and prototype-level.**\n\n- No GPU speedup is claimed.\n- No model accuracy or perplexity preservation is claimed.\n- Fake quantize/dequantize is only a CPU simulation using symmetric\n  absmax scaling.\n- The real benefit depends on dequant overhead, memory bandwidth,\n  page layout, page reuse, and attention fusion.\n\n```bash\n# Run the IntentQuant-KV benchmark\npython benchmarks/bench_intent_quant.py\n```\n\n---\n\n## Runtime-to-Kernel Contract\n\nThis repo models a contract where the runtime produces policy metadata\nand the kernel consumes it selectively. The kernel does not magically\ndiscover which context blocks are useful.\n\n```text\nAgentic runtime\n    |\n    v\nSemantic block layout\n    |\n    v\nKV Block Router\n    |\n    +--\u003e block selection (policy + score + recency)\n    +--\u003e precision assignment (IntentQuantizer)\n    +--\u003e prefetch candidates\n    |\n    v\nKernel metadata\n    |\n    +--\u003e selected_page_ids\n    +--\u003e block_precision_by_page\n    +--\u003e prefetch_page_ids\n    +--\u003e routing reasons\n    |\n    v\nSelected-block / IntentQuant attention path\n```\n\n| Layer | Responsibility |\n|---|---|\n| Semantic block layout | Describe context regions, policies, scores, and token bounds |\n| KV Block Router | Decide which blocks to select, skip, quantize, or prefetch |\n| IntentQuantizer | Assign per-block precision such as FP16, FP8, INT8, INT4, or SKIP |\n| Kernel metadata | Flatten routing output into selected page IDs, precision tags, and prefetch hints |\n| Attention reference | Run CPU dense or selected-block attention over the selected metadata |\n| Future Triton/CUDA kernel | Consume the same metadata in a fused GPU execution path |\n\nThe router is the policy layer. The kernel is the execution layer.\n\n---\n\n## Architecture\n\n```text\nAgentic runtime\n    |\n    v\nKV Block Router (policy layer)\n    |\n    +--\u003e semantic policy (ALWAYS, ATTEND, SKIP, RECENT, GLOBAL)\n    +--\u003e dynamic block score\n    +--\u003e recency window\n    +--\u003e memory pressure\n    +--\u003e optional query-to-block similarity\n    |\n    v\nKernel metadata\n    |\n    +--\u003e selected pages\n    +--\u003e precision by page\n    +--\u003e prefetch hints\n    |\n    v\nIntentQuant / selected-block attention kernels\n    |\n    v\nFuture Triton/CUDA kernel path\n```\n\n---\n\n## Dense vs Masked vs Intent-Aware\n\n| Approach | What it knows | Work avoided today | Future GPU goal |\n|---|---|---:|---|\n| Dense attention | Flat token stream | None | Baseline |\n| Masked attention | Token/block mask | Usually limited | May still process masked regions |\n| Selected-block attention | Semantic block bounds and policy | CPU gather over selected K/V | Avoid loading skipped KV pages |\n| Intent-aware KV execution | Policy, score, quantization, and prefetch hints | Analytical/simulated today | Fuse selection, dequant, and prefetch into kernel/runtime |\n\n\u003e Do not compute and then mask; expose structure early enough to avoid the\n\u003e work.\n\n---\n\n## Quickstart\n\n```bash\n# Install from source (editable, with dev dependencies)\npip install -e \".[dev]\"\n\n# Compile-check all source files\npython -m py_compile src/intent_attention/*.py\n\n# Run tests\npytest -q\n\n# Run analytical cost model\npython benchmarks/bench_cost_model.py\n\n# Run CPU timing benchmark\npython benchmarks/bench_cpu_reference.py\n\n# Run KV quantization memory model\npython benchmarks/bench_kv_quant.py\n\n# Run speculative prefetch simulation\npython benchmarks/bench_prefetch.py\n\n# Run dynamic scoring benchmark\npython benchmarks/bench_dynamic_scoring.py\n\n# Run intent-aware mixed-precision KV quantization benchmark\npython benchmarks/bench_intent_quant.py\n\n# Run per-block IntentQuant attention reference benchmark\npython benchmarks/bench_intent_quant_attention.py\n\n# Run optional Triton IntentQuant decode attention benchmark (requires GPU + Triton)\npython benchmarks/bench_triton_intent_quant_attention.py\n\n# Run KV Block Router benchmark (CPU)\npython benchmarks/bench_block_router.py\n\n# Run end-to-end router demo\npython examples/end_to_end_router_demo.py\n\n# Dry-run LLM quality validation (validates imports only, no model download)\npython experiments/llm_quality_validation.py --dry-run\n\n# Dry-run GPU decode benchmark (validates imports, no GPU required)\npython experiments/gpu_decode_benchmark.py --dry-run\n```\n\n---\n\n## Example Usage\n\n```python\nimport torch\nfrom intent_attention import (\n    BlockLayout,\n    BlockPolicy,\n    SemanticBlock,\n    semantic_block_attention,\n    savings_report,\n)\n\nq = torch.randn(1, 4, 16, 64)\nk = torch.randn(1, 4, 1024, 64)\nv = torch.randn(1, 4, 1024, 64)\n\nlayout = BlockLayout([\n    SemanticBlock(\"system_prompt\",     0,   128, BlockPolicy.ALWAYS),\n    SemanticBlock(\"retrieved_doc_0\",  128, 512, BlockPolicy.ATTEND, score=0.85),\n    SemanticBlock(\"retrieved_doc_1\",  512, 768, BlockPolicy.SKIP),\n    SemanticBlock(\"recent_context\",    768, 1024, BlockPolicy.RECENT),\n])\n\nout, debug = semantic_block_attention(q, k, v, layout, return_debug=True)\n\nprint(out.shape)          # torch.Size([1, 4, 16, 64])\nprint(debug)\n# {\n#   'selected_token_count': 640,\n#   'selected_block_names': ['system_prompt', 'retrieved_doc_0', 'recent_context'],\n#   'total_kv_tokens': 1024,\n#   'selected_kv_tokens': 640\n# }\n\nreport = savings_report(1, 4, 16, 1024, debug[\"selected_kv_tokens\"], 64)\nprint(f\"FLOPs saved: {report['flops_saved_pct']:.1f}%\")\nprint(f\"KV bytes saved: {report['kv_bytes_saved_pct']:.1f}%\")\n```\n\n---\n\n## Tests\n\n```bash\npytest -q          # quiet mode\npytest -v          # verbose mode\npytest tests/      # run all tests\n```\n\n---\n\n## Benchmarks\n\nAll benchmarks run on CPU and are safe to run without CUDA or Triton.\n\n### bench_cost_model.py\n\nAnalytical FLOP and KV-byte savings from selected-block attention. Uses\nzero-tensor arithmetic to compare dense vs selected-attention cost.\n\n### bench_cpu_reference.py\n\nCPU timing sanity check for dense vs selected-block reference paths.\nMeasures PyTorch overhead at small token counts on CPU only.\n\n### bench_kv_quant.py\n\nKV byte savings model for selected INT8-style KV pages. Compares fp16\ndense storage vs int8+scale for selected pages. Purely analytical.\n\n### bench_prefetch.py\n\nSimulated next-step KV page prediction and hit-rate behavior for\nspeculative prefetch during agentic decode.\n\n### bench_dynamic_scoring.py\n\nSynthetic query-to-block cosine-similarity scoring behavior across\nvarying block counts.\n\n### bench_intent_quant.py\n\nIntent-aware mixed-precision KV quantization policy simulator. Assigns\nper-block precision (FP16/FP8/INT8/INT4/INT4_RESIDUAL/SKIP) based on\nblock policy, score, recency, and memory pressure. Includes a fake\nquant/dequant reconstruction error test.\n\n### bench_intent_quant_attention.py\n\nCPU reference for per-block mixed-precision fake quant/dequant within the\nselected-block attention path. Compares FP16-selected vs quantized-selected\nattention outputs and reports reconstruction error metrics.\n\n### bench_triton_intent_quant_attention.py\n\nOptional Triton prototype for single-token decode attention over selected\nKV pages with per-page precision (FP16 or INT8). Skips cleanly on systems\nwithout Triton or CUDA. No GPU speedup is claimed — this is a first kernel\nprototype for hardware experimentation.\n\n### bench_block_router.py\n\nCPU routing and cost-model benchmark for the KV Block Router. Generates\nsynthetic agentic layouts at 8K, 32K, and 128K tokens and reports block\nselection, precision distribution, page IDs, and estimated KV byte savings\nfor multiple router configurations.\n\n\u003e This is a routing and cost-model benchmark, not a GPU speedup claim.\n\n\u003e CPU Ratio is not a GPU speedup claim. CPU timing is affected by PyTorch\n\u003e dispatch overhead, gather overhead, cache behavior, tensor size, and\n\u003e small-batch effects.\n\n### Experiments\n\n#### LLM Quality Validation (`experiments/llm_quality_validation.py`)\n\nProxy perplexity validation on small HuggingFace models (SmolLM2, TinyLlama).\nRuns baseline vs quantized-pass_key_values comparison across multiple routing\npolicies. **This is a proxy only** — the quantization is applied outside the\nnative model forward pass and does not represent production KV-cache\nquantization.\n\n```bash\n# Dry-run (validate imports, no model download)\npython experiments/llm_quality_validation.py --dry-run\n\n# Run with SmolLM2-135M on Wikitext-2 (requires transformers + datasets)\npython experiments/llm_quality_validation.py --model HuggingFaceTB/SmolLM2-135M\n```\n\nResults include: baseline perplexity, quantized perplexity per policy,\nreconstruction error metrics (MSE, max-abs, cosine), and selected/skipped\nblock counts per routing config.\n\n| Policy | KV tokens kept | Est. bytes saved |\n|---|---|---|\n| conservative | 100% (no skip) | 0% |\n| balanced | ~50% | ~50% |\n| aggressive | ~25% | ~75% |\n\n#### GPU Decode Benchmark (`experiments/gpu_decode_benchmark.py`)\n\nMeasures decode-step attention latency on available GPU hardware across\nmultiple backends: PyTorch SDPA, selected-KV gather + SDPA, optional Triton\nIntentQuant decode, optional xFormers, and optional FlashAttention.\n\n```bash\n# Dry-run (validate imports, detect hardware)\npython experiments/gpu_decode_benchmark.py --dry-run\n\n# Full benchmark on GPU\npython experiments/gpu_decode_benchmark.py \\\n    --batch 1 --heads 32 --head-dim 64 \\\n    --kv-len 65536 --selected-frac 0.25 \\\n    --iters 100 --warmup 20\n```\n\n**T4 caveat:** FlashAttention-2 is skipped on Turing GPUs (CC \u003c 8.0). Use\nPyTorch SDPA or xFormers as baselines on T4.\n\nSee `docs/gpu_benchmarking.md` for hardware matrix and fair-baseline guide.\n\n---\n\n## What Is Implemented\n\n- [x] SemanticBlock / BlockLayout metadata\n- [x] BlockPolicy enum (ALWAYS, ATTEND, SKIP, RECENT, GLOBAL)\n- [x] BlockTable page mapping helper\n- [x] PyTorch dense attention baseline\n- [x] PyTorch selected-block attention reference\n- [x] Dynamic block scoring prototype (BlockScorer)\n- [x] Analytical FLOP/KV-byte cost model\n- [x] Synthetic agentic trace generator\n- [x] KV quantization benchmark/model\n- [x] Speculative prefetch simulator (BlockPrefetcher)\n- [x] Triton/CUDA placeholder paths with CPU-safe fallback\n- [x] HuggingFace Transformers integration (patch_model)\n- [x] vLLM-style paged-attention bridge\n- [x] Intent-aware mixed-precision KV quantization policy simulator (IntentQuantizer)\n- [x] Fake quant/dequant reconstruction metrics (FP16/FP8/INT8/INT4/INT4_RESIDUAL)\n- [x] pytest coverage (153 tests)\n- [x] CPU benchmark scripts (9 benchmarks)\n- [x] IntentQuant Attention Kernel — per-block fake quant/dequant in selected-block attention path\n- [x] Triton IntentQuant decode attention prototype (optional, GPU-only)\n- [x] CPU-first KV Block Router — runtime-to-kernel policy layer\n- [x] routing-to-kernel metadata conversion (selected pages, precision, prefetch)\n- [x] per-block routing decisions and reasons\n- [x] End-to-end demo script (examples/end_to_end_router_demo.py)\n- [x] LLM quality validation experiment (experiments/llm_quality_validation.py)\n- [x] GPU decode benchmark experiment (experiments/gpu_decode_benchmark.py)\n- [x] Validation plan docs (docs/validation_plan.md)\n- [x] GPU benchmarking guide (docs/gpu_benchmarking.md)\n\n---\n\n## What Is Not Claimed\n\n- No GPU speedups are claimed.\n- No production-ready Triton/CUDA kernel is claimed.\n- No real NVIDIA hardware validation has been performed.\n- Quantization has not been validated for model accuracy or perplexity.\n- No superiority over KIVI, KVQuant, or TurboQuant is claimed.\n- No production quantization kernel is provided.\n- No model quality guarantee is made.\n- Prefetch has not been validated for real latency improvement.\n- Dynamic scoring is a heuristic, not a trained routing model.\n- **The KV Block Router is heuristic, not learned.**\n- **Selected pages are not guaranteed optimal.**\n- **No accuracy or perplexity validation has been performed on routing decisions.**\n- **Partial-page bounds are not implemented.** The router selects full pages\n  even if a block starts or ends mid-page. A future kernel would need\n  per-page token offset masks for correctness.\n- CPU Ratio is not a GPU speedup.\n- Analytical KV/FLOP savings are not measured GPU performance.\n- **Validation experiments use proxy KV-cache quantization** — post-hoc\n  quantize/dequantize on past_key_values, not real in-place KV cache\n  quantization. Results do not guarantee production quality preservation.\n- **GPU benchmarks are local measurements only.** No GPU speedup claim\n  is made from any single config, GPU, or software version. Results vary\n  by hardware, driver, CUDA version, and system load.\n\n---\n\n## Repository Layout\n\n```\nintent-attention-kernel/\n    .github/workflows/tests.yml   CI\n    benchmarks/\n        bench_block_router.py     KV Block Router routing \u0026 cost model\n        bench_cost_model.py       Analytical cost model\n        bench_cpu_reference.py    CPU timing (for development only)\n        bench_dynamic_scoring.py  Dynamic block scoring evaluation\n        bench_intent_quant.py     Intent-aware mixed-precision KV quantization\n        bench_intent_quant_attention.py  Per-block quantized attention reference\n        bench_triton_intent_quant_attention.py  Optional Triton decode attention\n        bench_kv_quant.py         KV cache quantisation memory analysis\n        bench_prefetch.py         Speculative prefetch decode simulation\n        docs/\n            architecture.md           Module design\n            attention_layout.md       Block policies\n            block_router.md           KV Block Router design and contract\n            dynamic_scoring.md        Dynamic scoring design\n            gpu_benchmarking.md       GPU benchmarking guide \u0026 fair baselines\n            gpu_kernel_plan.md        Future GPU mapping\n            intent_quant.md           Intent-aware mixed-precision KV quantization\n            kv_quantization.md        KV quantization modeling\n            prefetch.md               Speculative prefetch simulation\n            repo_metadata.md          Suggestions for GitHub settings\n            results_cpu.md            Detailed CPU results notes\n            validation_plan.md        LLM quality validation plan\n        experiments/\n            gpu_decode_benchmark.py   GPU decode attention benchmark\n            llm_quality_validation.py Proxy perplexity validation\n    src/intent_attention/\n        __init__.py               Public API\n        _enum.py                  StrEnum base\n        block_metadata.py         BlockPolicy, SemanticBlock, BlockLayout\n        block_router.py           KV Block Router (policy layer)\n        block_scorer.py           Dynamic block scoring (cosine similarity)\n        block_table.py            Paged KV mapping simulation\n        cost_model.py             Analytical FLOP/KV-byte model\n        hf_patch.py               HuggingFace Transformers integration\n        intent_quant.py           Intent-aware mixed-precision KV quantization\n        intent_quant_attention.py Per-block quantized attention reference\n        kv_quant.py               INT8 KV cache quantisation\n        triton_intent_quant_attention.py Optional Triton IntentQuant decode attention\n        prefetch.py               Speculative KV block prefetching\n        reference.py              Dense + selected-block attention\n        synthetic_traces.py       Layout generators\n        triton_kernel.py          Triton GPU kernel with CPU fallback\n        triton_kernel_quant.py    INT8 quantised Triton kernel\n        vllm_bridge.py            vLLM-style paged-attention bridge\n    tests/                        Test suite\n    CHANGELOG.md\n    README.md\n    pyproject.toml\n```\n\n---\n\n## Formatting\n\n```bash\n# Auto-format with black\npython -m black src tests benchmarks\n\n# Lint with ruff\npython -m ruff check src tests benchmarks\n```\n\n---\n\n## Roadmap (Future Work)\n\n- [x] **KV Block Router** — runtime-to-kernel policy layer (CPU)\n- [x] **Triton IntentQuant decode kernel** — selected-page decode with per-page precision (FP16/INT8)\n- [ ] **Triton kernel** — iterate only over physical pages from block table (general)\n- [ ] **CUDA kernel** — minimal paged-attention with semantic skipping\n- [ ] **Variable block sizes** — support non-uniform page sizes\n- [ ] **Integration with HuggingFace / vLLM** — plug into real inference\n      engines\n- [ ] **Trained routing** — replace heuristic scoring with learned block\n      selection\n\n---\n\n## Disclaimer\n\nThis is research prototype code. Interfaces may change. Not\nproduction-ready. No GPU speedups are claimed or implied. All GPU-related\nstatements describe future design goals, not current capabilities.\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmanishklach%2Fintent-attention-kernel","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmanishklach%2Fintent-attention-kernel","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmanishklach%2Fintent-attention-kernel/lists"}