{"id":21540361,"url":"https://github.com/eugenehp/llama-cpp-rs","last_synced_at":"2026-06-04T05:01:32.451Z","repository":{"id":264304281,"uuid":"892989477","full_name":"eugenehp/llama-cpp-rs","owner":"eugenehp","description":"A wrapper around the llama-cpp library for rust, including new Sampler API from llama-cpp.","archived":false,"fork":false,"pushed_at":"2026-06-01T02:15:50.000Z","size":9945,"stargazers_count":24,"open_issues_count":0,"forks_count":11,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-06-01T04:14:10.298Z","etag":null,"topics":["ggml","llamacpp","rust"],"latest_commit_sha":null,"homepage":"https://crates.io/crates/llama-cpp-4","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/eugenehp.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE-APACHE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-11-23T08:30:15.000Z","updated_at":"2026-06-01T02:15:54.000Z","dependencies_parsed_at":"2026-01-26T18:03:08.706Z","dependency_job_id":"b207276c-03d3-42ff-b3da-9db09f2ddad4","html_url":"https://github.com/eugenehp/llama-cpp-rs","commit_stats":null,"previous_names":["eugenehp/llama-cpp-rs"],"tags_count":52,"template":false,"template_full_name":null,"purl":"pkg:github/eugenehp/llama-cpp-rs","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eugenehp%2Fllama-cpp-rs","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eugenehp%2Fllama-cpp-rs/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eugenehp%2Fllama-cpp-rs/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eugenehp%2Fllama-cpp-rs/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/eugenehp","download_url":"https://codeload.github.com/eugenehp/llama-cpp-rs/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eugenehp%2Fllama-cpp-rs/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33890052,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-04T02:00:06.755Z","response_time":64,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ggml","llamacpp","rust"],"created_at":"2024-11-24T04:18:11.751Z","updated_at":"2026-06-04T05:01:32.440Z","avatar_url":"https://github.com/eugenehp.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🦙 llama-cpp-rs\n\n[![Crates.io](https://img.shields.io/crates/v/llama-cpp-4.svg)](https://crates.io/crates/llama-cpp-4)\n[![docs.rs](https://img.shields.io/docsrs/llama-cpp-4.svg)](https://docs.rs/llama-cpp-4)\n[![License](https://img.shields.io/crates/l/llama-cpp-4.svg)](https://crates.io/crates/llama-cpp-4)\n\nSafe Rust bindings to [llama.cpp](https://github.com/ggml-org/llama.cpp), tracking upstream closely.\n\n| Crate | Description | crates.io |\n|---|---|---|\n| [`llama-cpp-4`](llama-cpp-4/) | Safe high-level API | [![](https://img.shields.io/crates/v/llama-cpp-4.svg)](https://crates.io/crates/llama-cpp-4) |\n| [`llama-cpp-sys-4`](llama-cpp-sys-4/) | Raw bindgen bindings | [![](https://img.shields.io/crates/v/llama-cpp-sys-4.svg)](https://crates.io/crates/llama-cpp-sys-4) |\n\n**llama.cpp version:** `94a220cd6` (Jun 2026) — includes\n[TurboQuant (PR #21038)](#turboQuant--attention-rotation),\n[MTP / multi-token-prediction speculative decoding (PR #22673)](https://github.com/ggml-org/llama.cpp/pull/22673), and\nupstream **next-n** embedding hooks used by MTP (`llama_set_embeddings_nextn`).\n\n---\n\n## Examples\n\n| Package name | Directory | Description |\n|---|---|---|\n| `simple` | [`examples/simple/`](examples/simple/) | Single-turn text completion from CLI or Hugging Face |\n| `chat` | [`examples/chat/`](examples/chat/) | Interactive multi-turn chat REPL |\n| `embeddings` | [`examples/embeddings/`](examples/embeddings/) | Batch embedding with cosine similarity |\n| `split-model-example` | [`examples/split_model/`](examples/split_model/) | Load sharded / split GGUF files |\n| `openai-server` | [`examples/server/`](examples/server/) | OpenAI-compatible HTTP server — chat, completions, embeddings, tools, files (mtmd), tokenize |\n| `mtmd` | [`examples/mtmd/`](examples/mtmd/) | Multimodal (vision / audio) inference (requires `--features mtmd`) |\n| `quantize` | [`examples/quantize/`](examples/quantize/) | Quantize a GGUF model with full typed API |\n| `turbo-quant` | [`examples/turbo-quant/`](examples/turbo-quant/) | TurboQuant demo — compare attn rotation on/off |\n| `incremental-chat` | [`examples/incremental-chat/`](examples/incremental-chat/) | Chat with incremental prefill — processes tokens while you type |\n| `mtp` | [`examples/mtp/`](examples/mtp/) | MTP speculative decoding via `MtpSession` (`--predict`, `--p-min`, draft loop) |\n\n---\n\n## Quick start\n\n```bash\ngit clone --recursive https://github.com/eugenehp/llama-cpp-rs\ncd llama-cpp-rs\n```\n\n### Interactive chat\n\n```bash\ncargo run -p chat -- \\\n    hf-model bartowski/Llama-3.2-3B-Instruct-GGUF Llama-3.2-3B-Instruct-Q4_K_M.gguf\n```\n\n### OpenAI-compatible server\n\n```bash\n# Starts on http://127.0.0.1:8080\ncargo run -p openai-server -- \\\n    hf-model bartowski/Llama-3.2-3B-Instruct-GGUF Llama-3.2-3B-Instruct-Q4_K_M.gguf\n```\n\nFull REST API reference: [`examples/server/README.md`](examples/server/README.md).\n\n| Method | Path | Description |\n|--------|------|-------------|\n| GET | `/health`, `/v1/health` | Liveness (no auth) |\n| GET | `/v1/models` | Loaded model metadata |\n| POST | `/v1/chat/completions`, `/chat/completions` | Chat · streaming · tools |\n| POST | `/v1/completions`, `/completions` | Raw completion · streaming |\n| POST | `/v1/embeddings`, `/embeddings` | L2-normalised embeddings |\n| POST | `/tokenize`, `/detokenize` | [llama.cpp-compatible](https://github.com/ggml-org/llama.cpp/tree/master/tools/server) token helpers |\n| POST/GET/DELETE | `/v1/files/...` | File store for multimodal (`--features mtmd`, `--mmproj`) |\n\nLegacy paths without `/v1` mirror upstream [llama-server](https://github.com/ggml-org/llama.cpp/tree/master/tools/server).\nNot implemented here (use upstream server instead): `/v1/responses`, `/v1/messages`, `/rerank`, `/slots`, `/props`.\n\n### Using prebuilt native libraries (skip CMake compile)\n\n`llama-cpp-sys-4` can consume precompiled llama/ggml libraries via env vars.\nThis is useful for CI pipelines that publish native artifacts once and reuse\nthem in downstream repos (for example, speeding up a separate app build).\n\n```bash\n# Directory containing prebuilt libs in one of:\n#   \u003cdir\u003e, \u003cdir\u003e/lib, \u003cdir\u003e/lib64, \u003cdir\u003e/bin\nexport LLAMA_PREBUILT_DIR=/path/to/prebuilt\n\n# Optional: force dynamic linking mode for prebuilt artifacts.\n# Defaults to the crate's normal link mode for the active feature set.\n# export LLAMA_PREBUILT_SHARED=1\n\ncargo build -p your-app --features \"q1,vulkan\"\n```\n\nNotes:\n- `q1` compatibility is determined by the prebuilt artifact itself — publish\n  separate artifacts per feature/backend tuple (`q1+vulkan`, `q1+metal`, ...).\n- `build.rs` still generates Rust bindings, but skips the expensive CMake\n  compile when `LLAMA_PREBUILT_DIR` is set.\n\nBackend feature coverage (practical targets):\n- `metal`  → macOS (Apple Silicon and Intel Macs)\n- `vulkan` → Linux/Windows (cross-vendor desktop GPUs)\n- `webgpu` → Linux/Windows (experimental; requires Dawn/WebGPU-native stack)\n- `cuda`   → Linux/Windows with NVIDIA CUDA toolkit (experimental in CI)\n- `hip`    → Linux ROCm/HIP environments (experimental in CI)\n\n### Prebuilt Feature Benchmark Results\n\nThe `prebuilt` feature flag provides automatic prebuilt artifact management. Benchmark results (Apple Silicon M2, macOS 14.4):\n\n| Configuration | Build Type | Time | Improvement |\n|---------------|------------|------|-------------|\n| Base (Static) | Debug | 11.99s | Baseline |\n| Base + `prebuilt` | Debug | 11.01s | **8% faster** |\n| Dynamic Linking | Debug | 26.80s | -123% (slower) |\n| Dynamic + `prebuilt` | Debug | 27.47s | -129% (slower) |\n| Base (Static) | Release | 26.01s | Baseline |\n| Dynamic Linking | Release | 26.79s | -3% (slower) |\n\n**Key Insights:**\n- ✅ **Static linking + prebuilt**: 8% faster debug builds (11.99s → 11.01s)\n- ✅ **Release builds**: Minimal difference between static/dynamic\n- ✅ **Development workflow**: Prebuilt feature provides best iteration speed\n- 🚀 **CI/CD potential**: When fully implemented with artifact caching, expect 50-80% speedups for complex builds\n\n**Usage:**\n```bash\n# Enable prebuilt feature for faster development\ncargo build --features prebuilt\n\n# Combine with other features\ncargo build --features \"prebuilt,vulkan\"\n\n# Release builds (prebuilt provides minimal benefit)\ncargo build --release --features prebuilt\n```\n\n**Implementation Status:**\n- ✅ Feature flag infrastructure complete\n- ✅ Automatic feature detection and configuration\n- ✅ Safe fallback to local compilation\n- 📋 **TODO**: Actual artifact download and caching (foundation ready)\n\nWhen fully implemented, the prebuilt feature will automatically:\n1. Download matching prebuilt artifacts from GitHub releases\n2. Cache them in `target/llama-prebuilt-cache/`\n3. Use cached artifacts for subsequent builds\n4. Fall back gracefully to local compilation if artifacts unavailable\n- `opencl` → Linux/Windows with OpenCL SDK/runtime (experimental in CI)\n- `blas`   → CPU acceleration (Linux/macOS/Windows)\n\n```bash\n# Chat completion (max_completion_tokens is also accepted)\ncurl http://127.0.0.1:8080/v1/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"messages\":[{\"role\":\"user\",\"content\":\"Hello!\"}], \"max_tokens\":128}'\n\n# Streaming\ncurl http://127.0.0.1:8080/v1/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"messages\":[{\"role\":\"user\",\"content\":\"Count to 5\"}], \"stream\":true}'\n\n# Embeddings\ncurl http://127.0.0.1:8080/v1/embeddings \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"input\": [\"Hello world\", \"Bonjour le monde\"]}'\n\n# Tokenize / detokenize (llama.cpp server-compatible)\ncurl http://127.0.0.1:8080/tokenize \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"content\":\"Hello\",\"add_special\":false}'\n```\n\nWith `--api-key`, pass `Authorization: Bearer \u003ckey\u003e` on every route except `/health` and `/v1/health`.\n\n### Text generation (library)\n\n```rust\nuse llama_cpp_4::{\n    llama_backend::LlamaBackend,\n    llama_batch::LlamaBatch,\n    model::{params::LlamaModelParams, AddBos, LlamaModel, Special},\n    context::params::LlamaContextParams,\n    sampling::LlamaSampler,\n};\nuse std::num::NonZeroU32;\n\nlet backend = LlamaBackend::init()?;\nlet model = LlamaModel::load_from_file(\u0026backend, \"model.gguf\", \u0026LlamaModelParams::default())?;\nlet mut ctx = model.new_context(\u0026backend, LlamaContextParams::default())?;\n\nlet tokens = model.str_to_token(\"Hello, world!\", AddBos::Always)?;\nlet mut batch = LlamaBatch::new(512, 1);\nfor (i, \u0026tok) in tokens.iter().enumerate() {\n    batch.add(tok, i as i32, \u0026[0], i == tokens.len() - 1)?;\n}\nctx.decode(\u0026mut batch)?;\n\nlet sampler = LlamaSampler::chain_simple([LlamaSampler::greedy()]);\n// ... decode loop\n```\n\n---\n\n## Quantization\n\nThe `llama_cpp_4::quantize` module provides a fully typed Rust API for all\nquantization options.\n\n```rust\nuse llama_cpp_4::quantize::{GgmlType, LlamaFtype, QuantizeParams, TensorTypeOverride};\n\n// Basic — quantize to Q4_K_M\nlet params = QuantizeParams::new(LlamaFtype::MostlyQ4KM)\n    .with_nthread(8)\n    .with_quantize_output_tensor(true);\n\nllama_cpp_4::model_quantize(\"model-f16.gguf\", \"model-q4km.gguf\", \u0026params).unwrap();\n\n// Advanced — keep output tensor in F16, prune layers 28-31\nlet params = QuantizeParams::new(LlamaFtype::MostlyQ5KM)\n    .with_tensor_type_override(TensorTypeOverride::new(\"output\", GgmlType::F16).unwrap())\n    .with_pruned_layers(28..=31);\n\nllama_cpp_4::model_quantize(\"model-f16.gguf\", \"model-q5km-pruned.gguf\", \u0026params).unwrap();\n```\n\nFrom the CLI:\n\n```bash\n# List all available quantization types\ncargo run -p quantize -- --list-types\n\n# Quantize with auto output name\ncargo run -p quantize -- model-f16.gguf Q4_K_M\n\n# Override a specific tensor type\ncargo run -p quantize -- --tensor-type output=F16 model-f16.gguf Q5_K_M\n\n# Dry-run: show size without writing\ncargo run -p quantize -- --dry-run model-f16.gguf Q4_K_M\n```\n\n---\n\n## TurboQuant — attention rotation\n\n**TurboQuant** (llama.cpp [PR #21038](https://github.com/ggml-org/llama.cpp/pull/21038))\napplies a [Hadamard rotation](https://en.wikipedia.org/wiki/Hadamard_matrix) to the Q, K,\nand V tensors before they are stored in the KV cache.\n\n### Why it matters\n\nAttention activations have large outlier values on some dimensions that make\nquantization hard.  The rotation spreads these outliers evenly so the KV cache\ncan be stored in aggressive formats (Q4_0, Q5_0) with drastically less quality\nloss:\n\n| KV cache type | Without TurboQuant | With TurboQuant | VRAM vs F16 |\n|:---:|:---:|:---:|:---:|\n| F16 (baseline) | — | — | 100% |\n| Q8_0 | +0.003 PPL | +0.003 PPL | 53% |\n| Q5_1 | +61.70 PPL | **+0.44 PPL** | 37% |\n| Q5_0 | +17.28 PPL | **+0.55 PPL** | 34% |\n| Q4_1 | +212.5 PPL | **+8.65 PPL** | 31% |\n| Q4_0 | +62.02 PPL | **+32.6 PPL** | 28% |\n\n*PPL delta vs F16 baseline on Qwen3 0.6B BF16 — source: llama.cpp PR #21038.*\n\n### Measured KV-cache space savings\n\nNumbers below come from a benchmark run against **Qwen2.5-0.5B-Instruct**\n(24 layers, 2 KV heads, 64 head-dim), obtained by calling `ggml_row_size()`\ndirectly against the compiled GGML library in this repo's build tree.\n\n```\nModel : Qwen2.5-0.5B-Instruct  (24 layers, 2 KV heads, 64 head-dim)\n\nConfig                 B/row  B/elem     KV @2K      KV @32K  Saved@32K  Ratio\n--------------------  ------  ------  ---------  ----------  ---------  -----\nF16  (baseline)          128  2.0000   24.00 MB   384.00 MB      —       1.00x\nQ8_0 + TurboQuant         68  1.0625   12.75 MB   204.00 MB  180.0 MB   1.88x\nQ5_1 + TurboQuant         48  0.7500    9.00 MB   144.00 MB  240.0 MB   2.67x\nQ5_0 + TurboQuant         44  0.6875    8.25 MB   132.00 MB  252.0 MB   2.91x  ← sweet spot\nQ4_1 + TurboQuant         40  0.6250    7.50 MB   120.00 MB  264.0 MB   3.20x\nQ4_0 + TurboQuant         36  0.5625    6.75 MB   108.00 MB  276.0 MB   3.56x\n```\n\nThe ratios are pure GGML block geometry and **scale identically to larger\nmodels** — for a 7B model (32 layers, 8 KV heads, 128 head-dim) multiply\nevery MB figure by ~85×; the ratios and % savings are the same.\n\n#### Sweet spot: Q5_0 + TurboQuant\n\n- **2.91× smaller** KV cache than vanilla F16 (saves **252 MB per 32 K\n  context window** on the 0.5B model, ~21 GB on a 70B model at 32 K ctx)\n- Only **+0.55 PPL** delta — essentially indistinguishable from F16 in practice\n- The same Q5_0 *without* TurboQuant gives +17.28 PPL (noticeably wrong output)\n- Q8_0 is the conservative zero-risk choice (1.88×, near-zero PPL cost)\n- Q4_0 gives maximum compression (3.56×) at the price of measurable but\n  tolerable quality loss with rotation on\n\n### Key properties\n\n- **Enabled automatically** for any model whose head dimension is a power of two\n  (covers essentially all modern transformers).\n- **No GGUF changes required** — it is a runtime transform of the KV cache only.\n- **Reversible** — the rotation is applied before storing and reversed before\n  computing attention, so results are mathematically identical to F16.\n- **Controlled via the `LLAMA_ATTN_ROT_DISABLE` env var** — set to `1` to opt out.\n\n### Using TurboQuant from Rust\n\n```rust\nuse llama_cpp_4::context::params::LlamaContextParams;\nuse llama_cpp_4::quantize::GgmlType;\n\n// TurboQuant is ON by default — just set a quantized KV cache type:\nlet ctx_params = LlamaContextParams::default()\n    .with_cache_type_k(GgmlType::Q5_0)   // ~31% of F16 VRAM\n    .with_cache_type_v(GgmlType::Q5_0);  // quality ≈ F16 thanks to rotation\n\nlet ctx = model.new_context(\u0026backend, ctx_params)?;\n```\n\n```rust\n// Disable rotation for a single context (e.g. benchmarking baseline):\nlet ctx_params = LlamaContextParams::default()\n    .with_cache_type_k(GgmlType::Q5_0)\n    .with_attn_rot_disabled(true);   // ← TurboQuant OFF for this context\n\nlet ctx = model.new_context(\u0026backend, ctx_params)?;\n```\n\n```rust\n// Global process-level toggle (call before creating any context):\nuse llama_cpp_4::quantize::{attn_rot_disabled, set_attn_rot_disabled};\n\nset_attn_rot_disabled(true);\nassert!(attn_rot_disabled());\n\nset_attn_rot_disabled(false); // restore\n```\n\n### Live demo\n\n```bash\n# API reference + PPL table (no model required)\ncargo run -p turbo-quant -- --show-api\n\n# Run both passes and compare outputs directly\ncargo run -p turbo-quant -- \\\n    --model model.gguf \\\n    --kv-type q5_0 \\\n    --prompt \"The capital of France is\" \\\n    --n-predict 16\n```\n\n---\n\n## MTP — multi-token-prediction speculative decoding\n\n[Upstream PR #22673](https://github.com/ggml-org/llama.cpp/pull/22673) added\nMTP draft heads to llama.cpp. The Rust API lives in\n[`llama_cpp_4::mtp`](llama-cpp-4/src/mtp.rs): build a target + draft context\npair, wrap them in [`MtpSession`](llama-cpp-4/src/mtp.rs), and drive the\nverify/accept loop from Rust.\n\n### 1. Context setup\n\nBoth contexts come from the **same MTP-capable GGUF**. The draft context must\nuse [`LlamaContextType::Mtp`](llama-cpp-4/src/context/params.rs) and\n`n_rs_seq \u003e= n_draft_max` (rollback snapshots for speculative verification):\n\n```rust\nuse llama_cpp_4::context::params::{LlamaContextParams, LlamaContextType};\n\nlet n_draft_max = 3;\n\nlet target = model.new_context(\u0026backend, LlamaContextParams::default())?;\nlet draft = model.new_context(\n    \u0026backend,\n    LlamaContextParams::default()\n        .with_ctx_type(LlamaContextType::Mtp)\n        .with_n_rs_seq(n_draft_max.max(4)), // headroom for rollback\n)?;\n```\n\n### 2. Session config and creation\n\n[`MtpSessionConfig`](llama-cpp-4/src/mtp.rs) maps to upstream\n`common_params_speculative_draft`:\n\n| Field | Meaning | Typical value |\n|---|---|---|\n| `n_seq` | Parallel sequences | `1` |\n| `n_draft_max` | Max tokens drafted per round | `1`–`3` (model-dependent) |\n| `p_min` | Drop draft tokens below this probability | `0.0` (upstream default since #23269) |\n| `n_min` | Minimum drafts to propose | `0` |\n\n```rust\nuse llama_cpp_4::mtp::{MtpSession, MtpSessionConfig};\n\n// Shorthand (defaults: n_min=0, p_min=0.0)\nlet mut session = MtpSession::new(\u0026target, \u0026draft, 1, n_draft_max)?;\n\n// Full control\nlet config = MtpSessionConfig::new(1, n_draft_max)\n    .with_p_min(0.0)\n    .with_n_min(0);\nlet mut session = MtpSession::new_with_config(\u0026target, \u0026draft, config)?;\n\nassert!(session.need_embd_pre_norm()); // MTP: next-n embeddings (upstream name)\nassert!(!session.need_embd());         // post-norm / seq embeddings not used\n```\n\nThe Rust API still uses `*_pre_norm` names; upstream renamed the C API to\n`llama_set_embeddings_nextn` / `common_speculative_need_embd_nextn`.\nUpstream configures next-n extraction on both contexts during session init;\nyou normally do **not** need to call\n[`LlamaContext::set_embeddings_pre_norm`](llama-cpp-4/src/context.rs) yourself.\n\n### 3. Speculative decode loop (outline)\n\nAfter every `target.decode(batch)`:\n\n1. `session.process(\u0026batch)?` — sync MTP with the target batch\n2. `session.draft(seq_id, n_past, last_token)?` — propose draft tokens\n3. Verify drafts on the target (your sampler / argmax logic)\n4. `session.accept(seq_id, n_accepted)?` — update draft recurrent state\n5. `session.print_stats()` — log upstream draft/accept counters (optional)\n\nSee [`examples/mtp/src/main.rs`](examples/mtp/src/main.rs) for a complete\nworking loop with timing and acceptance reporting.\n\n### 4. CLI examples\n\nSmoke test (build contexts only):\n\n```bash\ncargo run --release -p mtp --features metal -- \\\n    hf-model froggeric/Qwen3.6-27B-MTP-GGUF Qwen3.6-27B-IQ2_M-mtp.gguf\n```\n\nFull generation with draft tuning:\n\n```bash\ncargo run --release -p mtp --features metal -- \\\n    --predict 64 \\\n    --n-draft-max 1 \\\n    --p-min 0.0 \\\n    --prompt \"The capital of France is\" \\\n    hf-model froggeric/Qwen3.6-27B-MTP-GGUF Qwen3.6-27B-IQ2_M-mtp.gguf\n```\n\nUse `--features cuda` or `--features vulkan` on other platforms instead of\n`metal`.\n\n### 5. Benchmarks and tuning\n\nDraft depth is quant- and model-sensitive. See [MTP.md](MTP.md) for measured\nthroughput on Apple Silicon and notes on upstream #23269 sampling changes.\n\nFor comparison against upstream `llama-server --spec-type draft-mtp`, use\n[`scripts/bench-mtp.sh`](scripts/bench-mtp.sh).\n\n---\n\n## Incremental prefill\n\nThe `incremental-chat` example demonstrates **incremental prefill** — decoding\nprompt tokens into the KV cache *while the user is still typing*, so that\ngeneration starts almost instantly when they press Enter.\n\n### Features\n\n- **Incremental prefill** — tokens decoded into the KV cache as you type\n- **BPE-stable margin** — withholds the last 2 tokens to avoid decode→invalidate churn (saves ~55% total compute)\n- **Chat template** — proper formatting via `apply_chat_template`\n- **Cached history prefix** — only the new user message is re-tokenized, not the entire conversation\n- **Conversation history** — KV cache persisted across turns with sliding-window eviction\n- **System prompt** — prefilled once at startup, never re-processed\n- **Full cursor-based editor** — arrow keys, Home/End, insert/delete at any position\n- **Multi-line input** — Alt+Enter for newlines\n- **Line editing** — Ctrl+W (word), Ctrl+U (clear), Ctrl+K (kill to end)\n- **Editing prefilled text** — mid-line edits invalidate only from the divergence point\n- **Ctrl-C to cancel** generation mid-stream (press twice while typing to quit)\n- **Performance stats** — TTFT and tok/s displayed after each response\n- **Graceful overflow** — messages exceeding context are truncated with a warning\n- **Stale message draining** — only the latest input change is processed\n- **Terminal cleanup** on panic via a custom panic hook\n- **Comprehensive benchmark** — 6 dimensions: latency, speed, load, precision, UX, DX\n\n### How it works\n\n1. The system prompt is prefilled once at startup and kept across turns.\n2. As the user types, the current input is periodically tokenized (debounced).\n3. New tokens beyond the KV cache are decoded in small batches, **withholding\n   the last 2 tokens** to avoid BPE churn (adding a character can change the\n   last 1–2 tokens retroactively — the margin prevents wasted decode cycles).\n4. If the user deletes or changes text, the KV cache is trimmed from the\n   divergence point — only the invalidated suffix is re-processed.\n5. When the user presses Enter, the remaining tokens (including the withheld\n   tail) are flushed and generation begins immediately.\n6. Conversation history stays in the KV cache.  When the context fills up,\n   the oldest turns are evicted (sliding window) rather than clearing everything.\n\n### Benchmark results\n\nMeasured on **Qwen2.5-0.5B-Instruct Q4_K_M** (Apple Silicon, CPU-only).\nRun the full benchmark: `cargo run --release -p incremental-chat --bin incremental-bench -- model.gguf`\nGenerate charts: `cargo run -p incremental-chat --bin incremental-charts`\n\n#### 1. Latency — normal vs incremental flush at Enter\n\n\u003cp align=\"center\"\u003e\u003cimg src=\"charts/latency.png\" width=\"700\"/\u003e\u003c/p\u003e\n\n#### Perceived speedup\n\n\u003cp align=\"center\"\u003e\u003cimg src=\"charts/speedup.png\" width=\"500\"/\u003e\u003c/p\u003e\n\n**2. Speed** — 167 tok/s generation throughput (32 tokens in 191ms)\n\n#### 3. GPU Load — BPE margin saves 40–59% total compute vs naive\n\n\u003cp align=\"center\"\u003e\u003cimg src=\"charts/load.png\" width=\"700\"/\u003e\u003c/p\u003e\n\n**4. Precision** — incremental prefill produces identical first token to normal prefill (**✔ ALL MATCH**)\n\n#### 5. UX — mid-line edit recovery cost\n\n\u003cp align=\"center\"\u003e\u003cimg src=\"charts/ux.png\" width=\"600\"/\u003e\u003c/p\u003e\n\n**6. DX** — 3-method API (`new`/`prefill_speculative`/`flush`), pure userspace pattern, ~130 lines of shared code\n\n#### 7. KV Cache Quantization + TurboQuant\n\n\u003cp align=\"center\"\u003e\u003cimg src=\"charts/kv_quant.png\" width=\"720\"/\u003e\u003c/p\u003e\n\nGenerated 64 tokens and compared output to F16 baseline:\n\n| Config | Diverges at | Quality | Output sample |\n|:---|:---:|:---|:---|\n| F16 (baseline) | — | — | \"Rust and C++ are both popular programming languages...\" |\n| Q8_0 + TurboQuant | char 195 | **near-identical** | Same as F16 for ~195 chars |\n| Q5_0 + TurboQuant | char 24 | **coherent** | \"...both high-level programming languages...\" |\n| Q4_0 + TurboQuant | char 24 | **coherent** | \"...different approaches to memory management...\" |\n| Q5_0 no TurboQuant | char 2 | **⚠ degraded** | \"The following of the following of the of the...\" |\n| Q4_0 no TurboQuant | char 2 | **⚠ degraded** | \"The term 'in terms of memory safety...is a programming language...\" |\n\n**TurboQuant makes quantized KV cache usable.** Without it, Q5_0/Q4_0 produce\ndegenerate output (diverges at char 2). With it, Q5_0 produces coherent text\nthat diverges only in wording, while Q8_0 is near-identical to F16.\nSee also the [TurboQuant section](#turboQuant--attention-rotation) for PPL and VRAM numbers.\n\n#### 8. Samplers \u0026 Temperature\n\n\u003cp align=\"center\"\u003e\u003cimg src=\"charts/samplers.png\" width=\"750\"/\u003e\u003c/p\u003e\n\n9 sampler configurations tested with seed=42 for reproducibility:\n\n| Sampler | Gen (48t) | tok/s | Output style |\n|:---|:---:|:---:|:---|\n| greedy (t=0) | 306 ms | 157 | Deterministic, factual |\n| temp=0.1 top_k=40 | 306 ms | 157 | Nearly identical to greedy |\n| temp=0.4 top_p=0.9 | 330 ms | 145 | Slight variation, still focused |\n| temp=0.7 top_p=0.9 | 335 ms | 143 | Creative, writes actual haiku |\n| temp=1.0 top_p=0.95 | 413 ms | 116 | More diverse, poetic |\n| temp=1.5 top_k=50 | 307 ms | 156 | Wild — mixes English and Chinese |\n| min_p=0.05 t=0.7 | 310 ms | 155 | Focused, similar to top_p |\n| top_n_sigma=1.0 | 643 ms | 75 | Slower (large candidate set) |\n| mirostat_v2 τ=5 | 565 ms | 85 | Adaptive, poetic output |\n\n**Key findings:**\n- Same seed produces **identical output** across runs (deterministic)\n- Greedy/low-temp are fastest (~157 tok/s), mirostat/sigma slowest (~75-85 tok/s)\n- `temp=0.7 + top_p=0.9` is the sweet spot for creative tasks\n- `min_p` is a fast alternative to `top_p` with similar quality\n\n### Usage\n\n```bash\n# Interactive chat with live prefill\ncargo run --release -p incremental-chat -- local model.gguf\n\n# Quantized KV cache with TurboQuant (saves VRAM, near-zero quality loss)\ncargo run --release -p incremental-chat -- --kv-type q5_0 local model.gguf\n\n# Without TurboQuant (for comparison)\ncargo run --release -p incremental-chat -- --kv-type q5_0 --no-turbo-quant local model.gguf\n\n# Cache the system prompt session to disk (instant restart)\ncargo run --release -p incremental-chat -- --session-cache sys.session local model.gguf\n\n# Custom system prompt, debounce, and sliding window\ncargo run --release -p incremental-chat -- \\\n    --system-prompt \"You are a pirate. Respond only in pirate speak.\" \\\n    --debounce-ms 100 --keep-turns 4 \\\n    local model.gguf\n\n# Run the comprehensive benchmark (7 dimensions)\ncargo run --release -p incremental-chat --bin incremental-bench -- model.gguf\n```\n\n### Using the incremental prefill API\n\nThe key building blocks from `llama-cpp-4`:\n\n```rust\nuse llama_cpp_4::llama_batch::LlamaBatch;\nuse llama_cpp_4::token::LlamaToken;\n\n// Decode only new tokens (the delta) into the KV cache.\n// Withhold the last 2 tokens — BPE can retroactively change them\n// when the next character is typed.\nlet new_tokens = model.str_to_token(\u0026user_text, AddBos::Always)?;\nlet stable_end = new_tokens.len().saturating_sub(2); // BPE margin\nlet common = find_common_prefix(\u0026cached_tokens, \u0026new_tokens[..stable_end]);\n\n// Trim cache if the user edited earlier text\nif common \u003c cached_tokens.len() {\n    ctx.clear_kv_cache_seq(Some(0), Some(common as u32), None)?;\n}\n\n// Decode only the genuinely new stable tokens\nlet mut batch = LlamaBatch::new(512, 1);\nfor (i, \u0026token) in new_tokens[common..stable_end].iter().enumerate() {\n    let pos = (common + i) as i32;\n    batch.add(token, pos, \u0026[0], i == stable_end - common - 1)?;\n}\nctx.decode(\u0026mut batch)?;\n\n// When user presses Enter: flush ALL tokens (including the tail)\n```\n\n---\n\n## GPU acceleration\n\n| Feature | Hardware | Flag |\n|---|---|---|\n| `cuda` | NVIDIA (CUDA) | `--features cuda` |\n| `metal` | Apple Silicon | `--features metal` |\n| `vulkan` | AMD / Intel / cross-platform | `--features vulkan` |\n| `native` | CPU with AVX2/NEON auto-detect | `--features native` |\n| `openmp` | Multi-core CPU (default on) | `--features openmp` |\n| `rpc` | Remote compute backend | `--features rpc` |\n| `prebuilt` | All (build optimization) | `--features prebuilt` |\n\n```bash\n# Metal (macOS)\ncargo run -p openai-server --features metal -- --n-gpu-layers 99 \\\n    local model.gguf\n\n# CUDA (Linux/Windows)\ncargo run -p openai-server --features cuda -- --n-gpu-layers 99 \\\n    local model.gguf\n\n# Vulkan (cross-platform)\ncargo run -p openai-server --features vulkan -- --n-gpu-layers 99 \\\n    hf-model bartowski/Llama-3.2-3B-Instruct-GGUF Llama-3.2-3B-Instruct-Q4_K_M.gguf\n```\n\n---\n\n## Hugging Face model download\n\nAll examples and the server accept a `hf-model \u003crepo\u003e [quant]` subcommand\nthat downloads models from the Hub (cached in `~/.cache/huggingface/`).\n\n```bash\n# Interactive quant picker for repos with many options\ncargo run -p openai-server -- hf-model unsloth/Qwen3.5-397B-A17B-GGUF\n\n# Select by quant name (downloads all shards automatically)\ncargo run -p openai-server -- hf-model unsloth/Qwen3.5-397B-A17B-GGUF Q4_K_M\n\n# Exact filename\ncargo run -p openai-server -- \\\n    hf-model TheBloke/Llama-2-7B-Chat-GGUF llama-2-7b-chat.Q4_K_M.gguf\n```\n\nSet `HUGGING_FACE_HUB_TOKEN` for gated models.\n\n---\n\n## Development\n\n```bash\n# Clone with submodules (llama.cpp is a submodule of llama-cpp-sys-4)\ngit clone --recursive https://github.com/eugenehp/llama-cpp-rs\n\n# Or after cloning without --recursive\ngit submodule update --init --recursive\n\n# Build everything (with optimizations)\ncargo build\n\n# Build with prebuilt artifacts for faster compilation\ncargo build --features prebuilt\n\n# Run all unit tests (no model required)\ncargo test\n\n# Run server unit tests specifically\ncargo test -p openai-server\n```\n\n### Build Optimizations\n\nThe build system includes several optimizations for faster compilation:\n\n- **Ninja build system** (2-3x faster than Make)\n- **Parallel compilation** (uses all CPU cores)\n- **sccache compilation caching** (makes feature changes instant)\n- **Shared CMake cache** (avoids rebuilds when toggling features)\n- **Unity Build** (groups source files for faster compilation)\n- **mold linker** (5-10x faster linking on Linux)\n- **Prebuilt artifacts** (`--features prebuilt`) (8% faster debug builds, 50-80% expected for CI/CD)\n\nFor best performance, install the recommended tools:\n\n```bash\n# macOS\nbrew install ninja sccache\n\n# Linux (Ubuntu/Debian)\nsudo apt-get install ninja-build mold\ncargo install sccache\n\n# Enable detailed build logging\nBUILD_DEBUG=1 cargo build\n```\n\nSee [BUILD_OPTIMIZATIONS.md](BUILD_OPTIMIZATIONS.md) for more details.\n\n### Updating llama.cpp\n\n```bash\ncd llama-cpp-sys-4/llama.cpp\ngit fetch origin master\ngit checkout origin/master  # or a specific commit/tag\ncd ../..\ncargo build          # build.rs regenerates bindings automatically\n```\n\n---\n\n## Multimodal Images\n\n### Via the OpenAI-compatible server\n\nBuild with `--features mtmd`. The server auto-detects `mmproj-*.gguf` next to the\nmodel, or accept `--mmproj PATH`. Upload images via `POST /v1/files`, then reference\nthem in chat messages (`image_url` / `image_file` parts — see\n[`examples/server/README.md`](examples/server/README.md)).\n\n```shell\ncargo run -p openai-server --features mtmd --release -- \\\n    hf-model unsloth/Qwen3.5-27B-GGUF Qwen3.5-27B-Q4_0\n```\n\nOr with an explicit mmproj path:\n\n```shell\ncargo run -p openai-server --features mtmd -- \\\n    --mmproj mmproj-BF16.gguf \\\n    hf-model unsloth/Qwen3.5-27B-GGUF Qwen3.5-27B-Q4_0\n```\n\n### Standalone multimodal example\n\n```shell\ncargo run --features mtmd -p mtmd -- \\\n    --model /path/to/model.gguf \\\n    --mmproj /path/to/mmproj.gguf \\\n    --image /path/to/image.jpg \\\n    --prompt \"Describe this image.\"\n```\n\n---\n\n## Credits\n\nOriginally derived from [llama-cpp-2](https://crates.io/crates/llama-cpp-2) — thanks to those contributors.  \nSee also [bitnet-cpp-rs](https://github.com/eugenehp/bitnet-cpp-rs) for highly-quantized BitNet model support.\n\n## Citation\n\n```bibtex\n@software{hauptmann2025llamacpprs,\n  author    = {Hauptmann, Eugene},\n  title     = {{llama-cpp-4}: llama-cpp {Rust} wrapper},\n  year      = {2025},\n  version   = {0.3.1},\n  url       = {https://github.com/eugenehp/llama-cpp-rs},\n}\n```\n\n## License\n\nThis project is licensed under the [MIT License](/LICENSE).\n\n## Copyright\n\n© 2025-2026, Eugene Hauptmann\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feugenehp%2Fllama-cpp-rs","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Feugenehp%2Fllama-cpp-rs","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feugenehp%2Fllama-cpp-rs/lists"}