{"id":50828830,"url":"https://github.com/nobottomline/ullm","last_synced_at":"2026-06-13T21:01:32.503Z","repository":{"id":364385663,"uuid":"1264280381","full_name":"nobottomline/ullm","owner":"nobottomline","description":"uLLM — universal local LLM inference engine in Rust. GGUF + SafeTensors + MLX, Metal GPU forward. Runs Llama / Qwen2 / Qwen3 / Qwen3-MoE / Gemma-3 on Apple Silicon.","archived":false,"fork":false,"pushed_at":"2026-06-12T20:54:05.000Z","size":434,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-12T21:26:26.245Z","etag":null,"topics":["agents","apple-silicon","constrained-decoding","gguf","grammar","inference","inference-engine","json-schema","llama","llm","local-llm","metal","openai-api","rust","structured-outputs"],"latest_commit_sha":null,"homepage":null,"language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nobottomline.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":"docs/roadmap.md","authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":"NOTICE","maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-06-09T18:30:20.000Z","updated_at":"2026-06-12T20:54:09.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/nobottomline/ullm","commit_stats":null,"previous_names":["nobottomline/ullm"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/nobottomline/ullm","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nobottomline%2Fullm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nobottomline%2Fullm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nobottomline%2Fullm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nobottomline%2Fullm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nobottomline","download_url":"https://codeload.github.com/nobottomline/ullm/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nobottomline%2Fullm/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34300116,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-13T02:00:06.617Z","response_time":62,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agents","apple-silicon","constrained-decoding","gguf","grammar","inference","inference-engine","json-schema","llama","llm","local-llm","metal","openai-api","rust","structured-outputs"],"created_at":"2026-06-13T21:01:21.310Z","updated_at":"2026-06-13T21:01:32.498Z","avatar_url":"https://github.com/nobottomline.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"assets/banner.svg\" alt=\"uLLM — the local inference engine where the model obeys\" width=\"840\"\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://github.com/nobottomline/ullm/actions/workflows/ci.yml\"\u003e\u003cimg src=\"https://github.com/nobottomline/ullm/actions/workflows/ci.yml/badge.svg\" alt=\"CI\"\u003e\u003c/a\u003e\n  \u003ca href=\"LICENSE\"\u003e\u003cimg src=\"https://img.shields.io/badge/license-Apache--2.0-blue.svg\" alt=\"License: Apache-2.0\"\u003e\u003c/a\u003e\n  \u003ca href=\"rust-toolchain.toml\"\u003e\u003cimg src=\"https://img.shields.io/badge/rust-2024-orange.svg\" alt=\"Rust 2024\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n**The local inference engine where the model obeys.** Bring any model you\nalready have — GGUF, Hugging Face, or Apple MLX — and get output *guaranteed* to\nmatch a JSON Schema, a grammar, or a regex: valid JSON every time, tool calls\nthat are always well-formed, no retries, no JSON-repair. Pure Rust,\nApple-Silicon-first, embeddable.\n\n\u003e **Status:** single-Mac, structured output complete. Runs real models on the\n\u003e Metal GPU — including a 30B mixture-of-experts — and the guarantee holds on\n\u003e every format, on CPU and GPU. See the [roadmap](docs/roadmap.md).\n\n## Install\n\nApple Silicon Mac (macOS 14+):\n\n```sh\n# Homebrew (recommended) — prebuilt binary, no Rust needed:\nbrew install nobottomline/ullm/ullm\n#   or:  brew tap nobottomline/ullm \u0026\u0026 brew install ullm\n\n# ...or grab the release tarball directly:\n#   https://github.com/nobottomline/ullm/releases/latest\ntar -xzf ullm-*-aarch64-apple-darwin.tar.gz \u0026\u0026 ./ullm-*/ullm doctor\n\n# ...or from source (needs Rust):\ncargo install --git https://github.com/nobottomline/ullm ullm-cli\n#   or, in a clone:  cargo build --release   # binary at ./target/release/ullm\n```\n\n## Quickstart\n\n```sh\n# Generate from a GGUF file, or a Hugging Face / MLX directory. Drop --gpu for CPU.\nullm run model.gguf \"The capital of France is\" --gpu\n\n# Or chat interactively — multi-turn, with conversation memory:\nullm chat model.gguf --gpu\n\n# Structured output that cannot come out malformed:\nullm run model.gguf \"Extract: John is 30.\"          --json\nullm run model.gguf \"Review: great blender, 5 stars\" --schema grammars/review.schema.json\nullm run model.gguf \"Date two days after 2024-01-13:\" --regex '[0-9]{4}-[0-9]{2}-[0-9]{2}'\n\n# OpenAI-compatible server with Structured Outputs + tool calling:\nullm serve model.gguf --gpu     # http://127.0.0.1:8080\ncurl 127.0.0.1:8080/v1/chat/completions -d '{\n  \"messages\": [{\"role\":\"user\",\"content\":\"Extract: Acme blender, 5 stars.\"}],\n  \"response_format\": {\"type\":\"json_schema\",\"json_schema\":{\"schema\":\n    {\"type\":\"object\",\"properties\":{\"product\":{\"type\":\"string\"},\"rating\":{\"type\":\"integer\"}},\n     \"required\":[\"product\",\"rating\"]}}}}'   # content is guaranteed to match the schema\n```\n\n`ullm --help` also has `inspect`, `tokenize`, `doctor`, and `gpu-check`. Runnable\nPython (OpenAI SDK) and Rust (embedded) samples are in [`examples/`](examples).\n\n## What it does\n\n- **Guaranteed structure** — GBNF grammar / JSON Schema (`$ref`, recursion,\n  `enum`, `pattern`/`format`) / regex, enforced at the logit level so a token\n  that would break the contract is impossible to sample. The per-token cost is\n  cached down to ~tens of µs.\n- **OpenAI-compatible** — `/v1/chat/completions` (streaming), `response_format`,\n  and `tools` + `tool_choice` returning valid `tool_calls`. A drop-in local\n  OpenAI for agents.\n- **Any weights, one runtime** — GGUF, SafeTensors, and Apple MLX (4-bit) load\n  with no conversion; Llama 2/3, Qwen2/3, Qwen3-MoE, Gemma-3.\n- **Full Metal GPU forward** — weights, activations and KV cache stay resident,\n  one command buffer per token, dequant-in-kernel; validated against the CPU\n  reference (`ullm gpu-check`) and, for MLX, token-for-token against `mlx_lm`.\n\n## Benchmarks\n\nSingle-stream decode, Apple M4 Max ([numbers + how to reproduce](docs/benchmarks.md)):\n\n| Model | Format | tok/s |\n|-------|--------|------:|\n| Llama-3.2-1B | GGUF Q4_K_M | 263 |\n| Qwen2.5-1.5B | GGUF Q4_K_M | 190 |\n| gemma-3-4b | GGUF Q6_K | 80.5 |\n| Qwen3-4B | HF BF16 | 26.6 |\n| Qwen3-Coder-30B-A3B | MLX 4-bit (MoE) | 63.6 |\n\n## Layout\n\n```\ncrates/\n  ullm-core/         types + container-agnostic IR (WeightSource, dequant)\n  ullm-gguf/         GGUF loader\n  ullm-safetensors/  SafeTensors / Hugging Face + MLX loader\n  ullm-tokenizer/    SentencePiece + byte-level BPE + tokenizer.json\n  ullm-grammar/      grammar / JSON-Schema / regex constraint engine\n  ullm-model/        CPU runtime, architectures, sampling, MLX/MoE\n  ullm-metal/        Metal GPU backend (full forward + kernels)\n  ullm-server/       OpenAI-compatible HTTP server\n  ullm-cli/          the `ullm` binary\n```\n\n## Docs\n\n- [Why uLLM exists](docs/strategy/positioning.md) — the corner we own, and what we're explicitly not\n- [Architecture](docs/architecture/00-overview.md) · [Roadmap](docs/roadmap.md) · [Benchmarks](docs/benchmarks.md) · [Decisions (ADRs)](docs/adr)\n\n## License\n\n[Apache-2.0](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnobottomline%2Fullm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnobottomline%2Fullm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnobottomline%2Fullm/lists"}