https://github.com/nobottomline/ullm

uLLM — universal local LLM inference engine in Rust. GGUF + SafeTensors + MLX, Metal GPU forward. Runs Llama / Qwen2 / Qwen3 / Qwen3-MoE / Gemma-3 on Apple Silicon.
https://github.com/nobottomline/ullm

agents apple-silicon constrained-decoding gguf grammar inference inference-engine json-schema llama llm local-llm metal openai-api rust structured-outputs

Last synced: 4 days ago
JSON representation

uLLM — universal local LLM inference engine in Rust. GGUF + SafeTensors + MLX, Metal GPU forward. Runs Llama / Qwen2 / Qwen3 / Qwen3-MoE / Gemma-3 on Apple Silicon.

Host: GitHub
URL: https://github.com/nobottomline/ullm
Owner: nobottomline
License: apache-2.0
Created: 2026-06-09T18:30:20.000Z (8 days ago)
Default Branch: main
Last Pushed: 2026-06-12T20:54:05.000Z (5 days ago)
Last Synced: 2026-06-12T21:26:26.245Z (5 days ago)
Topics: agents, apple-silicon, constrained-decoding, gguf, grammar, inference, inference-engine, json-schema, llama, llm, local-llm, metal, openai-api, rust, structured-outputs
Language: Rust
Size: 424 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Roadmap: docs/roadmap.md
- Notice: NOTICE

Awesome Lists containing this project

README

          


  





  

  

  



**The local inference engine where the model obeys.** Bring any model you

already have — GGUF, Hugging Face, or Apple MLX — and get output *guaranteed* to

match a JSON Schema, a grammar, or a regex: valid JSON every time, tool calls

that are always well-formed, no retries, no JSON-repair. Pure Rust,

Apple-Silicon-first, embeddable.

> **Status:** single-Mac, structured output complete. Runs real models on the

> Metal GPU — including a 30B mixture-of-experts — and the guarantee holds on

> every format, on CPU and GPU. See the [roadmap](docs/roadmap.md).

## Install

Apple Silicon Mac (macOS 14+):

```sh

# Homebrew (recommended) — prebuilt binary, no Rust needed:

brew install nobottomline/ullm/ullm

#   or:  brew tap nobottomline/ullm && brew install ullm

# ...or grab the release tarball directly:

#   https://github.com/nobottomline/ullm/releases/latest

tar -xzf ullm-*-aarch64-apple-darwin.tar.gz && ./ullm-*/ullm doctor

# ...or from source (needs Rust):

cargo install --git https://github.com/nobottomline/ullm ullm-cli

#   or, in a clone:  cargo build --release   # binary at ./target/release/ullm

```

## Quickstart

```sh

# Generate from a GGUF file, or a Hugging Face / MLX directory. Drop --gpu for CPU.

ullm run model.gguf "The capital of France is" --gpu

# Or chat interactively — multi-turn, with conversation memory:

ullm chat model.gguf --gpu

# Structured output that cannot come out malformed:

ullm run model.gguf "Extract: John is 30."          --json

ullm run model.gguf "Review: great blender, 5 stars" --schema grammars/review.schema.json

ullm run model.gguf "Date two days after 2024-01-13:" --regex '[0-9]{4}-[0-9]{2}-[0-9]{2}'

# OpenAI-compatible server with Structured Outputs + tool calling:

ullm serve model.gguf --gpu     # http://127.0.0.1:8080

curl 127.0.0.1:8080/v1/chat/completions -d '{

  "messages": [{"role":"user","content":"Extract: Acme blender, 5 stars."}],

  "response_format": {"type":"json_schema","json_schema":{"schema":

    {"type":"object","properties":{"product":{"type":"string"},"rating":{"type":"integer"}},

     "required":["product","rating"]}}}}'   # content is guaranteed to match the schema

```

`ullm --help` also has `inspect`, `tokenize`, `doctor`, and `gpu-check`. Runnable

Python (OpenAI SDK) and Rust (embedded) samples are in [`examples/`](examples).

## What it does

- **Guaranteed structure** — GBNF grammar / JSON Schema (`$ref`, recursion,

  `enum`, `pattern`/`format`) / regex, enforced at the logit level so a token

  that would break the contract is impossible to sample. The per-token cost is

  cached down to ~tens of µs.

- **OpenAI-compatible** — `/v1/chat/completions` (streaming), `response_format`,

  and `tools` + `tool_choice` returning valid `tool_calls`. A drop-in local

  OpenAI for agents.

- **Any weights, one runtime** — GGUF, SafeTensors, and Apple MLX (4-bit) load

  with no conversion; Llama 2/3, Qwen2/3, Qwen3-MoE, Gemma-3.

- **Full Metal GPU forward** — weights, activations and KV cache stay resident,

  one command buffer per token, dequant-in-kernel; validated against the CPU

  reference (`ullm gpu-check`) and, for MLX, token-for-token against `mlx_lm`.

## Benchmarks

Single-stream decode, Apple M4 Max ([numbers + how to reproduce](docs/benchmarks.md)):

| Model | Format | tok/s |

|-------|--------|------:|

| Llama-3.2-1B | GGUF Q4_K_M | 263 |

| Qwen2.5-1.5B | GGUF Q4_K_M | 190 |

| gemma-3-4b | GGUF Q6_K | 80.5 |

| Qwen3-4B | HF BF16 | 26.6 |

| Qwen3-Coder-30B-A3B | MLX 4-bit (MoE) | 63.6 |

## Layout

```

crates/

  ullm-core/         types + container-agnostic IR (WeightSource, dequant)

  ullm-gguf/         GGUF loader

  ullm-safetensors/  SafeTensors / Hugging Face + MLX loader

  ullm-tokenizer/    SentencePiece + byte-level BPE + tokenizer.json

  ullm-grammar/      grammar / JSON-Schema / regex constraint engine

  ullm-model/        CPU runtime, architectures, sampling, MLX/MoE

  ullm-metal/        Metal GPU backend (full forward + kernels)

  ullm-server/       OpenAI-compatible HTTP server

  ullm-cli/          the `ullm` binary

```

## Docs

- [Why uLLM exists](docs/strategy/positioning.md) — the corner we own, and what we're explicitly not

- [Architecture](docs/architecture/00-overview.md) · [Roadmap](docs/roadmap.md) · [Benchmarks](docs/benchmarks.md) · [Decisions (ADRs)](docs/adr)

## License

[Apache-2.0](LICENSE).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/nobottomline/ullm

Awesome Lists containing this project

README