An open API service indexing awesome lists of open source software.

https://github.com/nobottomline/ullm

uLLM — universal local LLM inference engine in Rust. GGUF + SafeTensors + MLX, Metal GPU forward. Runs Llama / Qwen2 / Qwen3 / Qwen3-MoE / Gemma-3 on Apple Silicon.
https://github.com/nobottomline/ullm

agents apple-silicon constrained-decoding gguf grammar inference inference-engine json-schema llama llm local-llm metal openai-api rust structured-outputs

Last synced: 4 days ago
JSON representation

uLLM — universal local LLM inference engine in Rust. GGUF + SafeTensors + MLX, Metal GPU forward. Runs Llama / Qwen2 / Qwen3 / Qwen3-MoE / Gemma-3 on Apple Silicon.

Awesome Lists containing this project

README

          


uLLM — the local inference engine where the model obeys


CI
License: Apache-2.0
Rust 2024

**The local inference engine where the model obeys.** Bring any model you
already have — GGUF, Hugging Face, or Apple MLX — and get output *guaranteed* to
match a JSON Schema, a grammar, or a regex: valid JSON every time, tool calls
that are always well-formed, no retries, no JSON-repair. Pure Rust,
Apple-Silicon-first, embeddable.

> **Status:** single-Mac, structured output complete. Runs real models on the
> Metal GPU — including a 30B mixture-of-experts — and the guarantee holds on
> every format, on CPU and GPU. See the [roadmap](docs/roadmap.md).

## Install

Apple Silicon Mac (macOS 14+):

```sh
# Homebrew (recommended) — prebuilt binary, no Rust needed:
brew install nobottomline/ullm/ullm
# or: brew tap nobottomline/ullm && brew install ullm

# ...or grab the release tarball directly:
# https://github.com/nobottomline/ullm/releases/latest
tar -xzf ullm-*-aarch64-apple-darwin.tar.gz && ./ullm-*/ullm doctor

# ...or from source (needs Rust):
cargo install --git https://github.com/nobottomline/ullm ullm-cli
# or, in a clone: cargo build --release # binary at ./target/release/ullm
```

## Quickstart

```sh
# Generate from a GGUF file, or a Hugging Face / MLX directory. Drop --gpu for CPU.
ullm run model.gguf "The capital of France is" --gpu

# Or chat interactively — multi-turn, with conversation memory:
ullm chat model.gguf --gpu

# Structured output that cannot come out malformed:
ullm run model.gguf "Extract: John is 30." --json
ullm run model.gguf "Review: great blender, 5 stars" --schema grammars/review.schema.json
ullm run model.gguf "Date two days after 2024-01-13:" --regex '[0-9]{4}-[0-9]{2}-[0-9]{2}'

# OpenAI-compatible server with Structured Outputs + tool calling:
ullm serve model.gguf --gpu # http://127.0.0.1:8080
curl 127.0.0.1:8080/v1/chat/completions -d '{
"messages": [{"role":"user","content":"Extract: Acme blender, 5 stars."}],
"response_format": {"type":"json_schema","json_schema":{"schema":
{"type":"object","properties":{"product":{"type":"string"},"rating":{"type":"integer"}},
"required":["product","rating"]}}}}' # content is guaranteed to match the schema
```

`ullm --help` also has `inspect`, `tokenize`, `doctor`, and `gpu-check`. Runnable
Python (OpenAI SDK) and Rust (embedded) samples are in [`examples/`](examples).

## What it does

- **Guaranteed structure** — GBNF grammar / JSON Schema (`$ref`, recursion,
`enum`, `pattern`/`format`) / regex, enforced at the logit level so a token
that would break the contract is impossible to sample. The per-token cost is
cached down to ~tens of µs.
- **OpenAI-compatible** — `/v1/chat/completions` (streaming), `response_format`,
and `tools` + `tool_choice` returning valid `tool_calls`. A drop-in local
OpenAI for agents.
- **Any weights, one runtime** — GGUF, SafeTensors, and Apple MLX (4-bit) load
with no conversion; Llama 2/3, Qwen2/3, Qwen3-MoE, Gemma-3.
- **Full Metal GPU forward** — weights, activations and KV cache stay resident,
one command buffer per token, dequant-in-kernel; validated against the CPU
reference (`ullm gpu-check`) and, for MLX, token-for-token against `mlx_lm`.

## Benchmarks

Single-stream decode, Apple M4 Max ([numbers + how to reproduce](docs/benchmarks.md)):

| Model | Format | tok/s |
|-------|--------|------:|
| Llama-3.2-1B | GGUF Q4_K_M | 263 |
| Qwen2.5-1.5B | GGUF Q4_K_M | 190 |
| gemma-3-4b | GGUF Q6_K | 80.5 |
| Qwen3-4B | HF BF16 | 26.6 |
| Qwen3-Coder-30B-A3B | MLX 4-bit (MoE) | 63.6 |

## Layout

```
crates/
ullm-core/ types + container-agnostic IR (WeightSource, dequant)
ullm-gguf/ GGUF loader
ullm-safetensors/ SafeTensors / Hugging Face + MLX loader
ullm-tokenizer/ SentencePiece + byte-level BPE + tokenizer.json
ullm-grammar/ grammar / JSON-Schema / regex constraint engine
ullm-model/ CPU runtime, architectures, sampling, MLX/MoE
ullm-metal/ Metal GPU backend (full forward + kernels)
ullm-server/ OpenAI-compatible HTTP server
ullm-cli/ the `ullm` binary
```

## Docs

- [Why uLLM exists](docs/strategy/positioning.md) — the corner we own, and what we're explicitly not
- [Architecture](docs/architecture/00-overview.md) · [Roadmap](docs/roadmap.md) · [Benchmarks](docs/benchmarks.md) · [Decisions (ADRs)](docs/adr)

## License

[Apache-2.0](LICENSE).