https://github.com/michaelasper/kir-ai
https://github.com/michaelasper/kir-ai
Last synced: about 1 month ago
JSON representation
- Host: GitHub
- URL: https://github.com/michaelasper/kir-ai
- Owner: michaelasper
- Created: 2026-05-08T02:42:01.000Z (about 2 months ago)
- Default Branch: main
- Last Pushed: 2026-05-11T03:45:08.000Z (about 2 months ago)
- Last Synced: 2026-05-11T05:15:34.366Z (about 2 months ago)
- Language: Rust
- Size: 6.65 MB
- Stars: 0
- Watchers: 0
- Forks: 1
- Open Issues: 69
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
kir-ai
Rust-first local inference on Apple Silicon with explicit, OpenAI-compatible runtime boundaries.
[![License][license-shield]][license-url]
[![CI][ci-shield]][ci-url]
[![Release][release-shield]][release-url]
[![Rust][rust-shield]][rust-url]
[![Apple Metal][metal-shield]][metal-url]
[![Local Inference][inference-shield]][docs-setup]
**kir-ai** is an OpenAI-shaped local inference workspace for Apple Silicon that keeps core inference, request contracts, and safety checks in Rust. The project is built around explicit runtime selection: protocol verification, native Metal execution, and MLX sidecar interop all live behind the same CLI/server surface with strict capability boundaries.
## Why / The Problem
Many local inference stacks are easiest to ship with ad-hoc Python glue, but that coupling makes behaviour harder to audit and scale. `kir-ai` addresses this by making protocol handling and runtime orchestration explicit in a Rust workspace while preserving the API shape your clients already expect.
You get an engine that:
- exposes OpenAI-style endpoints consistently,
- fails closed for unsupported request features,
- separates testing pathways from model-serving pathways,
- and keeps model lifecycle (plan/pull/verify/serve) under explicit commands.
## Features / Highlights
- **OpenAI-compatible edge** for `/v1/chat/completions`, `/v1/completions`, streaming SSE, and model listing.
- **Strict capability gating** in request validation and runtime mapping; unsupported features return stable errors instead of silent fallback behaviour.
- **Two serving modes**: protocol-test mode for client contract work and snapshot-backed serving for native Metal/MLX paths.
- **Native Metal first-class support** for Qwen and Gemma text pipelines with bounded prefill and typed cache identities.
- **Model lifecycle tooling** in `llm-engine`: `model plan`, `model list`, `model inspect`, `model verify`, and `model pull`.
- **Operational controls** with admin endpoints for metrics, snapshot verification/pull, lane-level request cancellation, and model metadata.
- **Failure-safe semantics** including request validation for unsafe fields (`max_tokens`, sampling controls, stop sequences, tool schemas, malformed JSON, and token budgets).
## When to Use
Use `kir-ai` when you want a local inference server that is explicit about execution mode and protocol behaviour. If you are iterating on client integration, choose protocol-test mode first. If you are preparing model-backed inference runs, switch to snapshot-based serving.
Avoid `kir-ai` as a first step if your immediate need is a managed multi-user cloud inference platform.
## Quick Start
1. Install and prepare the workspace.
```sh
curl -fsSL https://raw.githubusercontent.com/michaelasper/kir-ai/main/scripts/install-macos.sh | bash
```
2. Start the protocol test backend.
```sh
kirai
```
3. Send a smoke request.
```sh
curl -s http://127.0.0.1:3000/v1/chat/completions \
-H 'content-type: application/json' \
-d '{
"model": "local-qwen36",
"messages": [{"role": "user", "content": "hello"}],
"max_tokens": 8
}' | jq
```
Expected response: OpenAI-shaped `chat.completion` JSON with `local-qwen36`.
### Install and Runtime Options
- `KIR_AI_DIR`, `KIR_AI_REF` choose install location and revision.
- `KIR_AI_SKIP_BUILD=1` for dependency setup without compile.
- `KIR_AI_SKIP_PYTHON=1` for Rust-only install paths.
- `KIR_AI_FORCE_CLONE=1` to force a fresh checkout path.
For full script controls, see [`docs/ci-and-release.md`][docs-setup].
### Serve with a Snapshot
```sh
kirai serve \
--snapshot .llm-models/ \
--model-id local-qwen36 \
--max-new-tokens 256 \
--max-prefill-tokens 2048
```
For MLX manifests, set the loopback endpoint:
```sh
kirai serve \
--snapshot .llm-models/ \
--loader mlx \
--family qwen \
--model-id local-qwen35-4b \
--mlx-endpoint http://127.0.0.1:8080/v1
```
## Usage
### Core Endpoints
- `GET /health`
- `GET /v1/models`
- `GET /admin/models` and `/admin/models/{alias}`
- `POST /v1/chat/completions` and `POST /v1/completions`
- `POST /admin/models/{alias}/verify`
- `POST /admin/models/{alias}/plan`
- `POST /admin/models/{alias}/pull`
- `POST /admin/requests/{request_id}/cancel`
- `GET /admin/metrics`
For request and response examples, see [`docs/getting-started.md`][docs-getting-started].
For the full HTTP contract, see [`docs/http-api-reference.md`][http-api-doc].
## Native Text Snapshot Flow
Use `kirai` model commands to plan, inspect, verify, and pull profiles before serving.
```sh
kirai model plan Qwen/Qwen3-0.6B \
--revision main \
--profile qwen3-dense-safetensors-bf16
kirai model pull Qwen/Qwen3.6-35B-A3B \
--metadata-only \
--model-home .llm-models
kirai model inspect .llm-models/
```
Want direct source commands? Use `cargo run -p llm-engine -- ...` from a local checkout (development mode).
## Documentation Map
| Need | Document |
| --- | --- |
| Start with a working response | [`docs/getting-started.md`][docs-getting-started] |
| Developer machine setup | [`docs/setup.md`][docs-setup] |
| Run server and native text paths | [`docs/how-to-run-server.md`][docs-run-server] |
| Model snapshot lifecycle | [`docs/how-to-manage-models.md`][docs-models] |
| CLI reference | [`docs/cli-reference.md`][docs-cli] |
| HTTP API reference | [`docs/http-api-reference.md`][http-api-doc] |
| Configuration and formats | [`docs/configuration-reference.md`][docs-config] |
| Project architecture | [`docs/architecture.md`][docs-architecture] |
| CI and release details | [`docs/ci-and-release.md`][docs-ci-release] |
| Development guide | [`docs/development.md`][docs-dev] |
The product direction and implementation milestones are tracked in [`rust-metal-inference-engine-north-star.md`][north-star].
## Current Limitations
- Native Metal text execution currently covers dense Qwen, Qwen3/Qwen3.6 MoE, and Gemma 4 paths.
- Native paths are correctness-first and intentionally conservative for sampling and throughput.
- The server does not execute `generation_config.json` or downloaded chat templates (`chat_template.jinja`) as runtime config.
- Tool-call and JSON-object validation paths may buffer to preserve fail-closed semantics.
- Snapshot serving requires explicit backend mode; implicit no-snapshot stub serving is not supported.
## Compatibility
- Rust workspace version: `1.95`
- Runtime target profile: Apple Silicon first-class, macOS-first CI.
## License
This project is licensed under MIT. See upstream license terms at the official MIT license text.
[ci-shield]: https://img.shields.io/github/actions/workflow/status/michaelasper/kir-ai/ci.yml?branch=main&style=flat-square&label=ci
[ci-url]: https://github.com/michaelasper/kir-ai/actions/workflows/ci.yml
[release-shield]: https://img.shields.io/github/actions/workflow/status/michaelasper/kir-ai/release.yml?label=release&style=flat-square
[release-url]: https://github.com/michaelasper/kir-ai/actions/workflows/release.yml
[rust-shield]: https://img.shields.io/badge/rust-1.95-f5a97f?style=flat-square&logo=rust&logoColor=white
[rust-url]: https://www.rust-lang.org/
[metal-shield]: https://img.shields.io/badge/apple%20metal-native-c6a0f6?style=flat-square&logo=apple&logoColor=white
[metal-url]: https://developer.apple.com/metal/
[license-shield]: https://img.shields.io/badge/license-MIT-a6da95?style=flat-square&logo=opensourceinitiative&logoColor=white
[license-url]: https://opensource.org/licenses/MIT
[inference-shield]: https://img.shields.io/badge/local-inference-91d7e3?style=flat-square
[docs-getting-started]: docs/getting-started.md
[docs-setup]: docs/setup.md
[docs-run-server]: docs/how-to-run-server.md
[docs-models]: docs/how-to-manage-models.md
[docs-cli]: docs/cli-reference.md
[http-api-doc]: docs/http-api-reference.md
[docs-config]: docs/configuration-reference.md
[docs-architecture]: docs/architecture.md
[docs-ci-release]: docs/ci-and-release.md
[docs-dev]: docs/development.md
[north-star]: rust-metal-inference-engine-north-star.md