{"id":35532285,"url":"https://github.com/psmarter/mini-infer","last_synced_at":"2026-04-02T16:18:07.141Z","repository":{"id":331117315,"uuid":"1125096708","full_name":"psmarter/mini-infer","owner":"psmarter","description":"基于PagedAttention的高性能大模型推理引擎（重构中）","archived":false,"fork":false,"pushed_at":"2026-03-25T15:57:22.000Z","size":955,"stargazers_count":118,"open_issues_count":0,"forks_count":6,"subscribers_count":12,"default_branch":"main","last_synced_at":"2026-03-26T10:34:01.436Z","etag":null,"topics":["ai","cuda","deep-learning","gpu","inference","language-model","llm","machine-learning","pagedattention","python","pytorch","transformer","triton"],"latest_commit_sha":null,"homepage":"https://smarter.xin/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/psmarter.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-12-30T06:20:29.000Z","updated_at":"2026-03-26T09:03:30.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/psmarter/mini-infer","commit_stats":null,"previous_names":["psmarter/mini-infer"],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/psmarter/mini-infer","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/psmarter%2Fmini-infer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/psmarter%2Fmini-infer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/psmarter%2Fmini-infer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/psmarter%2Fmini-infer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/psmarter","download_url":"https://codeload.github.com/psmarter/mini-infer/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/psmarter%2Fmini-infer/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31309817,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-02T12:59:32.332Z","status":"ssl_error","status_checked_at":"2026-04-02T12:54:48.875Z","response_time":89,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","cuda","deep-learning","gpu","inference","language-model","llm","machine-learning","pagedattention","python","pytorch","transformer","triton"],"created_at":"2026-01-04T02:10:27.997Z","updated_at":"2026-04-02T16:18:07.135Z","avatar_url":"https://github.com/psmarter.png","language":"Python","funding_links":[],"categories":["*Ops for AI"],"sub_categories":["Model Serving \u0026 Inference"],"readme":"# mini-infer\n\n**LLM inference engine built from scratch** — paged KV cache, continuous batching, chunked prefill, prefix caching, speculative decoding, CUDA graph, tensor parallelism, MoE expert parallelism, and OpenAI-compatible HTTP serving. Each mechanism has a dedicated benchmark with quantitative results. Core serving path reaches **100% of HF baseline throughput** at batch=8; concurrent HTTP throughput scales **3.9× (1→8 clients, 55.7→219.1 tok/s)**. Ships with dry-run mode (no model weights needed), `/healthz`, Docker, and CI.\n\n\u003e 从零实现的 LLM 推理引擎。核心 serving 路径（PagedAttention + Continuous Batching + OpenAI HTTP API）在 Qwen2.5-7B 达到 HF Transformers **100% 吞吐**，支持 `--dry-run` 无权重启动验证。\n\n[![CI](https://github.com/psmarter/mini-infer/actions/workflows/smoke.yml/badge.svg)](https://github.com/psmarter/mini-infer/actions/workflows/smoke.yml)\n[![lint](https://github.com/psmarter/mini-infer/actions/workflows/lint.yml/badge.svg)](https://github.com/psmarter/mini-infer/actions/workflows/lint.yml)\n![Python](https://img.shields.io/badge/Python-3.10%2B-blue)\n![PyTorch](https://img.shields.io/badge/PyTorch-2.1%2B-orange)\n![CUDA](https://img.shields.io/badge/CUDA-12.1-green)\n![License](https://img.shields.io/badge/license-MIT-blue)\n\n![demo](assets/demo.gif)\n\n---\n\n## 立即验证（无需模型权重）\n\n```bash\npip install -e \".[serve,dev]\"\nmini-infer-serve --dry-run --port 8000\ncurl http://localhost:8000/healthz   # → {\"status\":\"ok\",\"model\":\"dry\",...}\n```\n\n---\n\n## 如何阅读这个仓库\n\n5 个文件覆盖核心推理机制，建议按序阅读：\n\n| 文件 | 内容 |\n|------|------|\n| [`mini_infer/cache/kv_cache.py`](mini_infer/cache/kv_cache.py) | Paged KV Cache：BlockTable、FreeBlockPool、prefix cache LRU eviction |\n| [`mini_infer/runtime/scheduler.py`](mini_infer/runtime/scheduler.py) | 四队列调度器：waiting / running / swapped / prefilling，preemption，chunked prefill |\n| [`mini_infer/runtime/async_engine.py`](mini_infer/runtime/async_engine.py) | Continuous batching：后台 step loop + asyncio.Queue，HTTP 请求合并入同一 decode_batch |\n| [`mini_infer/serving/server.py`](mini_infer/serving/server.py) | OpenAI Chat Completions HTTP API：SSE streaming、non-streaming、`/healthz` |\n| [`mini_infer/parallel/tp_engine.py`](mini_infer/parallel/tp_engine.py) | Tensor Parallelism：NCCL all-reduce，Megatron-LM 风格列/行并行 |\n\n分布式 / 量化扩展另见 [`mini_infer/modeling/quantization.py`](mini_infer/modeling/quantization.py)、[`mini_infer/parallel/ep_engine.py`](mini_infer/parallel/ep_engine.py)。\n\n---\n\n## 核心成果\n\n**主 serving 路径（`mini-infer-serve` 默认启动）**\n\n| 技术 | 关键数据 |\n|------|---------|\n| **Continuous Batching HTTP API**（AsyncEngine + OpenAI 兼容） | 并发 1→8 吞吐 **55.7→219.1 tok/s**（3.9×，Qwen2.5-7B，RTX 4090） |\n| **True PagedAttention**（flash_attn block_table） | batch=8 吞吐达到 HF Transformers **100%**（406 tok/s） |\n| **Chunked Prefill** | 混合 serving 场景 ITL spike 降低 **57%–67%** |\n| **Prefix Caching**（block-level hash + LRU） | 共享前缀 TTFT **−22%** |\n\n**独立 benchmark 实验（功能完整，未接入默认 serving 路径）**\n\n| 技术 | 关键数据 |\n|------|---------|\n| **Speculative Decoding**（0.5B draft + 7B target） | acceptance rate **55.85%** |\n| **CUDA Graph**（decode_batch 静态捕获） | 1.5B bs=1 decode 延迟 **−28.9%** |\n| **Flash Decoding**（Triton split-K） | seq=4096 延迟 **3.31×** vs 标准 Triton，SM 利用率 9%→103% |\n| **Tensor Parallelism**（NCCL all-reduce，Megatron-LM 风格） | TP=2 greedy 输出与单卡**完全一致**（见注 ¹） |\n| **MLA**（DeepSeek-V2/V3 架构） | latent cache 体积 **−56.25%** vs GQA |\n| **MoE Expert Parallelism**（Grouped Local Execution） | EP grouped / dense = **2.500×** |\n\n**原型实现（correctness-first，有明确边界限制）**\n\n| 技术 | 关键数据 |\n|------|---------|\n| **W8A8 量化**（per-channel int8 + mixed fallback） | 权重显存 **−32.4%**，greedy match 71.8%（见注 ²） |\n| **PD 解耦**（同机双进程） | TTFT 三段分解：prefill 12.3ms / transfer ≈14.7ms / decode 519ms |\n\n完整 benchmark 数据与复现命令见 [docs/benchmarks.md](docs/benchmarks.md)。\n\n| 主线吞吐演进 | MoE EP 吞吐演进 |\n|:-----------:|:--------------:|\n| ![throughput](assets/charts/01_throughput_evolution.png) | ![moe_ep](assets/charts/03_moe_ep_evolution.png) |\n\n| Flash Decoding（seq_len sweep） | CUDA Graph decode 延迟 |\n|:-------------------------------:|:---------------------:|\n| ![flash_decode](assets/charts/04_flash_decode.png) | ![cuda_graph](assets/charts/02_cuda_graph.png) |\n\n\u003e ¹ **TP 说明**：1.5B 模型规模下，2 卡 NCCL all-reduce 通信开销超过计算节省，吞吐未提升；当前验收结论为 greedy 输出与单卡完全一致（正确性已验证）。TP 适用于单卡显存不足时的大模型横向扩展。\n\u003e\n\u003e ² **W8A8 说明**：decode 路径受限于 `torch._int_mm` 小 M 瓶颈，退回 FP16 mixed fallback（显存节省有效，decode 速度无明显提升）；greedy match 71.8% 为 correctness-first 实现，未做 PTQ/AWQ 校准。\n\n---\n\n## 快速开始\n\n```bash\ngit clone https://github.com/psmarter/mini-infer \u0026\u0026 cd mini-infer\npip install -e \".[serve,dev]\"\n\n# 无需模型权重，立即验证服务接口\nmini-infer-serve --dry-run --port 8000\n\n# 真实模型（需要 Qwen2.5-7B-Instruct）\nmini-infer-serve --model /path/to/Qwen2.5-7B --port 8000\n\n# 开启 CUDA Graph + W8A8 量化\nmini-infer-serve --model /path/to/Qwen2.5-7B --use-cuda-graph --quant-mode w8a8 --port 8000\n```\n\n```bash\ncurl http://localhost:8000/v1/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"model\":\"mini-infer\",\"messages\":[{\"role\":\"user\",\"content\":\"Hello\"}],\"stream\":false}'\n```\n\nPython 完整示例（streaming / 多轮对话）见 [`examples/openai_client.py`](examples/openai_client.py)。\n\n---\n\n## 架构\n\n```mermaid\ngraph TD\n    A[\"HTTP / CLI 请求\"] --\u003e B[\"AsyncEngine\\n后台 step loop\"]\n    B --\u003e C[\"LLMEngine\\nContinuous Batching 主循环\"]\n    C --\u003e D[\"Scheduler\\nwaiting / running / swapped / prefilling\"]\n    C --\u003e E[\"KVCacheManager\\nBlockTable + FreeBlockPool + Prefix Cache\"]\n    C --\u003e F[\"ModelRunner\\nprefill + decode_batch\"]\n    F --\u003e G[\"PagedAttention\\nflash_attn block_table\"]\n    F --\u003e H[\"CUDA Graph\\ndecode replay\"]\n    F --\u003e I[\"QuantLinear\\nW8A8 / mixed fallback\"]\n\n    subgraph dist [\"分布式扩展\"]\n        J[\"TPEngine\\nNCCL all-reduce\"]\n        K[\"EPEngine\\nMoE all-to-all\"]\n    end\n\n    subgraph algo [\"算法扩展\"]\n        L[\"SpecEngine\\ndraft + target\"]\n        M[\"PDEngine\\nPrefill/Decode split\"]\n    end\n\n    C --\u003e dist\n    C --\u003e algo\n```\n\n详细模块说明见 [docs/architecture.md](docs/architecture.md)。\n\n---\n\n## 能力状态\n\n| 能力 | 状态 | 关键数据 |\n|------|------|---------|\n| Paged KV Cache + Continuous Batching | ✅ 主链路 | 主推理路径核心 |\n| Chunked Prefill | ✅ 主链路 | ITL spike −57%–67% |\n| Prefix Caching（block-level hash + LRU） | ✅ 主链路 | TTFT −22% |\n| True PagedAttention（flash_attn block_table） | ✅ 主链路 | batch=8 达到 HF **100%** |\n| OpenAI Chat Completions HTTP API | ✅ 主链路 | SSE streaming / non-streaming |\n| Speculative Decoding（0.5B draft + 7B target） | 🔬 独立实验 | acceptance 55.85%（SpecEngine，未接入 serve CLI） |\n| CUDA Graph（decode_batch 静态捕获） | 🔬 独立实验 | decode 延迟 −28.9%（`--use-cuda-graph` 实验开关） |\n| Flash Decoding（Triton split-K） | 🔬 独立实验 | 3.31× vs 标准 Triton，SM 9%→103% |\n| Triton decode attention kernel | 🔬 独立实验 | 对比 flash_attn，未接入主链路 |\n| Tensor Parallelism（NCCL all-reduce） | 🔬 独立实验 | greedy 输出与单卡一致（正确性验证） |\n| MLA（DeepSeek-V2/V3 latent cache） | 🔬 独立实验 | cache 体积 −56.25% vs GQA |\n| MoE Expert Parallelism（synthetic workload） | 🔬 独立实验 | EP grouped / dense = 2.500× |\n| W8A8 量化（per-channel int8） | 🔧 原型 | 显存 −32.4%，greedy match 71.8% |\n| PD 解耦（同机双进程） | 🔧 原型 | TTFT 三段分解（prefill/transfer/decode） |\n\n\u003e ✅ 主链路：接入完整 serving 路径，可通过 HTTP API 端对端验证\n\u003e 🔬 独立实验：独立 benchmark 脚本，有量化数据，未接入主 serving 链路\n\u003e 🔧 原型：功能已实现，correctness-first，有边界限制（见注 ¹²）\n\n---\n\n## 目录结构\n\n```\nmini_infer/\n├─ core/        # EngineConfig、Request、SamplingParams\n├─ runtime/     # LLMEngine、Scheduler、AsyncEngine、SpecEngine、PDEngine\n├─ cache/       # KVCacheManager（BlockTable + Prefix Cache）\n├─ modeling/    # ModelRunner、量化、MLA、MoE\n├─ kernels/     # PagedAttention、Triton decode、Flash Decoding\n├─ parallel/    # TP、EP、Replica、PP\n└─ serving/     # FastAPI server、OpenAI schema\n\nbenchmarks/     # 每项能力对应一个 benchmark 脚本（21 个）\ntests/          # 287 collected items（含参数化展开），大多数支持 dry_run，不依赖模型权重\n```\n\n`make test-fast` 跑 CPU dry-run 全量测试（约 10s）；`make test` 含 GPU 专项。\n\n---\n\n## 文档\n\n| 文档 | 内容 |\n|------|------|\n| [docs/architecture.md](docs/architecture.md) | 包结构、模块说明、请求生命周期 |\n| [docs/benchmarks.md](docs/benchmarks.md) | 所有能力的 benchmark 数据与复现命令 |\n| [docs/faq.md](docs/faq.md) | 常见问题：安装、环境、CUDA Graph / W8A8 开关 |\n| [docs/roadmap.md](docs/roadmap.md) | 后续扩展方向与已知 gap |\n\n---\n\n## 与 vLLM 的区别\n\n| 维度 | mini-infer | vLLM |\n|------|-----------|------|\n| **目标** | 从零实现并测量关键推理机制 | 生产级：高吞吐、多模型、SLO 保障 |\n| **PagedAttention** | 与 vLLM 同路线（flash_attn block_table） | 相同路线，更成熟 |\n| **量化** | W8A8 手工实现，greedy match 71.8% | PTQ / AWQ / GPTQ 完整工具链 |\n| **模型覆盖** | Qwen2.5 / DeepSeek-V2（synthetic MoE） | 数十种架构，自动适配 |\n| **调度器** | 手工实现，四队列 + chunked prefill | 完整 SLO、KV 共享感知 |\n| **部署** | 单机原型 | K8s、多机 RDMA、完整监控 |\n\n---\n\n## 环境\n\n| 依赖 | 版本 |\n|------|------|\n| Python | 3.10+ |\n| PyTorch | 2.1.2+cu121 |\n| transformers | 4.43.4 |\n| flash-attn | 2.5.9.post1（`block_size` 须为 256 的倍数） |\n| CUDA | 12.1 / RTX 4090 |\n\n---\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpsmarter%2Fmini-infer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpsmarter%2Fmini-infer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpsmarter%2Fmini-infer/lists"}