https://github.com/jundot/omlx

LLM inference server with continuous batching & SSD caching for Apple Silicon — managed from the macOS menu bar
https://github.com/jundot/omlx

apple-silicon inference-server llm macos mlx openai-api

Last synced: about 1 month ago
JSON representation

LLM inference server with continuous batching & SSD caching for Apple Silicon — managed from the macOS menu bar

Host: GitHub
URL: https://github.com/jundot/omlx
Owner: jundot
License: apache-2.0
Created: 2026-02-13T14:13:27.000Z (5 months ago)
Default Branch: main
Last Pushed: 2026-02-27T11:51:13.000Z (4 months ago)
Last Synced: 2026-02-27T11:59:12.031Z (4 months ago)
Topics: apple-silicon, inference-server, llm, macos, mlx, openai-api
Language: Python
Homepage: https://github.com/jundot/omlx
Size: 8.23 MB
Stars: 88
Watchers: 4
Forks: 13
Open Issues: 16
Metadata Files:
- Readme: README.md
- Contributing: docs/CONTRIBUTING.md
- License: LICENSE

Awesome Lists containing this project

awesome-infra-for-ai - jundot/omlx - oMLX is an LLM inference server optimized for Apple Silicon, offering continuous batching, tiered KV caching, and a macOS menu bar interface for managing models locally. (Model Serving & Inference / Model Serving Frameworks)
awesome-local-llm - omlx - LLM inference server with continuous batching & SSD caching for Apple Silicon — managed from the macOS menu bar (Inference engines)
AiTreasureBox - jundot/omlx - 06-11_16397_18](https://img.shields.io/github/stars/jundot/omlx.svg)|LLM inference server with continuous batching & SSD caching for Apple Silicon — managed from the macOS menu bar| (Repos)
Awesome-GitHub-Repo - omlx - Apple Silicon 本地 LLM 推理服务器。 (AI开源项目 / AI 工具)
StarryDivineSky - jundot/omlx - 3 倍；其次，**SSD 缓存机制** 将部分模型数据卸载到高速固态硬盘，类比为“扩展内存书架”，即使面对超大规模模型（如 70B 参数）也能在有限内存中流畅运行；最后，**macOS 菜单栏集成** 提供了开箱即用的管理界面，用户无需命令行操作即可启停服务、监控资源占用，大幅降低使用门槛。 **其技术原理通过分层设计实现高效推理**：底层基于 MLX 框架（苹果专为 M 系列芯片优化的机器学习库），直接调用统一内存架构（UMA）实现 CPU/GPU 无缝数据交换；中间层采用事件驱动架构，将用户请求拆分为“计算任务流水线”，优先处理短任务以减少排队延迟；顶层通过内存映射（mmap）技术将 SSD 存储虚拟化为“二级内存”，当物理内存不足时自动加载模型分块。这种设计类似“智能仓储系统”——高频数据存放内存（仓库核心区），低频数据暂存 SSD（外围货架），按需调取以平衡速度与容量。整体而言，omlx 在易用性、性能和资源管理之间找到了平衡点，尤其适合苹果生态中需要低成本、高响应能力的 AI 应用场景。其开源属性进一步降低了开发者定制化门槛，未来有望成为 Apple Silicon 本地化 AI 部署的标准工具之一。 (A01_文本生成_文本对话 / 大语言对话模型及数据)
awesome-llm-services - oMLX
awesome-opensource-ai - omlx - Apple-centric inference server for local-first AI workflows with model management, GPU orchestration, and OpenAI-compatible APIs for self-hosted deployment. Apache 2.0 licensed. ![GitHub stars](https://img.shields.io/github/stars/jundot/omlx?style=social) (3. Inference Engines & Serving)
awesome-mlx - omlx - managed from the macOS menu bar (LLM & Inference)

README

oMLX

LLM inference, optimized for your Mac
Continuous batching and tiered KV caching, managed directly from your menu bar.

Install ·
Quickstart ·
Features ·
Models ·
CLI Configuration ·
GitHub

---

oMLX Admin Dashboard

> *Every LLM server I tried made me choose between convenience and control. I wanted to pin everyday models in memory, auto-swap heavier ones on demand, set context limits - and manage it all from a menu bar.*
>
> *oMLX persists KV cache across a hot in-memory tier and cold SSD tier - even when context changes mid-conversation, all past context stays cached and reusable across requests, making local LLMs practical for real coding work with tools like Claude Code. That's why I built it.*

## Install

### macOS App

Download the `.dmg` from [Releases](https://github.com/jundot/omlx/releases), drag to Applications, done. The app includes in-app auto-update, so future upgrades are just one click.

### Homebrew

```bash
brew tap jundot/omlx https://github.com/jundot/omlx
brew install omlx

# Upgrade to the latest version
brew update && brew upgrade omlx

# Run as a background service (auto-restarts on crash)
brew services start omlx
```

### From Source

```bash
git clone https://github.com/jundot/omlx.git
cd omlx
pip install -e .
```

Requires Python 3.10+ and Apple Silicon (M1/M2/M3/M4).

## Quickstart

### macOS App

Launch oMLX from your Applications folder. The Welcome screen guides you through three steps - model directory, server start, and first model download. That's it.

oMLX Welcome Screen
oMLX Menubar

### CLI

```bash
omlx serve --model-dir ~/models
```

The server discovers models from subdirectories automatically. Any OpenAI-compatible client can connect to `http://localhost:8000/v1`. A built-in chat UI is also available at `http://localhost:8000/admin/chat`.

### Homebrew Service

If you installed via Homebrew, you can run oMLX as a managed background service:

```bash
brew services start omlx # Start (auto-restarts on crash)
brew services stop omlx # Stop
brew services restart omlx # Restart
brew services info omlx # Check status
```

The service runs `omlx serve` with zero-config defaults (`~/.omlx/models`, port 8000). To customize, either set environment variables (`OMLX_MODEL_DIR`, `OMLX_PORT`, etc.) or run `omlx serve --model-dir /your/path` once to persist settings to `~/.omlx/settings.json`.

Logs are written to two locations:
- **Service log**: `$(brew --prefix)/var/log/omlx.log` (stdout/stderr)
- **Server log**: `~/.omlx/logs/server.log` (structured application log)

## Features

oMLX is built on top of [vllm-mlx](https://github.com/waybarrios/vllm-mlx), extending it with tiered KV caching, multi-model serving, an admin dashboard, Claude Code optimization, and Anthropic API support. Currently supports text-based LLMs - VLM and OCR model support is planned for upcoming milestones.

### Admin Dashboard

Web UI at `/admin` for real-time monitoring, model management, chat, benchmark, and per-model settings. All CDN dependencies are vendored for fully offline operation.

oMLX Admin Dashboard

### Tiered KV Cache (Hot + Cold)

Block-based KV cache management inspired by vLLM, with prefix sharing and Copy-on-Write. The cache operates across two tiers:

- **Hot tier (RAM)**: Frequently accessed blocks stay in memory for fast access.
- **Cold tier (SSD)**: When the hot cache fills up, blocks are offloaded to SSD in safetensors format. On the next request with a matching prefix, they're restored from disk instead of recomputed from scratch - even after a server restart.

oMLX Hot & Cold Cache

### Continuous Batching

Handles concurrent requests through mlx-lm's BatchGenerator. Prefill and completion batch sizes are configurable.

### Claude Code Optimization

Context scaling support for running smaller context models with Claude Code. Scales reported token counts so that auto-compact triggers at the right timing, and SSE keep-alive prevents read timeouts during long prefill.

### Multi-Model Serving

Load LLMs, embedding models, and rerankers within the same server. Models are managed through a combination of automatic and manual controls:

- **LRU eviction**: Least-recently-used models are evicted automatically when memory runs low.
- **Manual load/unload**: Interactive status badges in the admin panel let you load or unload models on demand.
- **Model pinning**: Pin frequently used models to keep them always loaded.
- **Per-model TTL**: Set an idle timeout per model to auto-unload after a period of inactivity.
- **Process memory enforcement**: Total memory limit (default: system RAM - 8GB) prevents system-wide OOM.

### Per-Model Settings

Configure sampling parameters, chat template kwargs, TTL, and more per model directly from the admin panel. Changes apply immediately without server restart.

oMLX Chat Template Kwargs

### Built-in Chat

Chat directly with any loaded model from the admin panel. Supports conversation history, model switching, dark mode, and reasoning model output.

oMLX Chat

### Model Downloader

Search and download MLX models from HuggingFace directly in the admin dashboard. Browse model cards, check file sizes, and download with one click.

oMLX Model Downloader

### Performance Benchmark

One-click benchmarking from the admin panel. Measures prefill (PP) and text generation (TG) tokens per second, with partial prefix cache hit testing for realistic performance numbers.

oMLX Benchmark Tool

### macOS Menubar App

Native PyObjC menubar app (not Electron). Start, stop, and monitor the server without opening a terminal. Includes real-time serving stats, auto-restart on crash, and in-app auto-update.

oMLX Menubar Stats

### API Compatibility

Drop-in replacement for OpenAI and Anthropic APIs. Supports streaming usage stats (`stream_options.include_usage`) and Anthropic adaptive thinking.

| Endpoint | Description |
|----------|-------------|
| `POST /v1/chat/completions` | Chat completions (streaming) |
| `POST /v1/completions` | Text completions (streaming) |
| `POST /v1/messages` | Anthropic Messages API |
| `POST /v1/embeddings` | Text embeddings |
| `POST /v1/rerank` | Document reranking |
| `GET /v1/models` | List available models |

### Tool Calling & Structured Output

Supports all function calling formats available in mlx-lm, JSON schema validation, and MCP tool integration. Tool calling requires the model's chat template to support the `tools` parameter. The following model families are auto-detected via mlx-lm's built-in tool parsers:

| Model Family | Format |
|---|---|
| Llama, Qwen, DeepSeek, etc. | JSON `` |
| Qwen3 Coder | XML `` |
| Gemma | `` |
| GLM (4.7, 5) | `/` XML |
| MiniMax | Namespaced `` |
| Mistral | `[TOOL_CALLS]` |
| Kimi K2 | `<\|tool_calls_section_begin\|>` |
| Longcat | `` |

Models not listed above may still work if their chat template accepts `tools` and their output uses a recognized `` XML format. Streaming requests with tool calls buffer all content and emit results at completion.

## Models

Point `--model-dir` at a directory containing MLX-format model subdirectories. Two-level organization folders (e.g., `mlx-community/model-name/`) are also supported.

```
~/models/
├── Step-3.5-Flash-8bit/
├── Qwen3-Coder-Next-8bit/
├── gpt-oss-120b-MXFP4-Q8/
└── bge-m3/
```

Models are auto-detected by type. You can also download models directly from the admin dashboard.

| Type | Models |
|------|--------|
| LLM | Any model supported by [mlx-lm](https://github.com/ml-explore/mlx-lm) |
| Embedding | BERT, BGE-M3, ModernBERT |
| Reranker | ModernBERT, XLM-RoBERTa |

## CLI Configuration

```bash
# Memory limit for loaded models
omlx serve --model-dir ~/models --max-model-memory 32GB

# Process-level memory limit (default: auto = RAM - 8GB)
omlx serve --model-dir ~/models --max-process-memory 80%

# Enable SSD cache for KV blocks
omlx serve --model-dir ~/models --paged-ssd-cache-dir ~/.omlx/cache

# Set in-memory hot cache size
omlx serve --model-dir ~/models --hot-cache-max-size 20%

# Adjust batch sizes
omlx serve --model-dir ~/models --prefill-batch-size 8 --completion-batch-size 32

# With MCP tools
omlx serve --model-dir ~/models --mcp-config mcp.json

# API key authentication
omlx serve --model-dir ~/models --api-key your-secret-key
```

All settings can also be configured from the web admin panel at `/admin`. Settings are persisted to `~/.omlx/settings.json`, and CLI flags take precedence.

Architecture

```
FastAPI Server (OpenAI / Anthropic API)
│
├── EnginePool (multi-model, LRU eviction, TTL, manual load/unload)
│ ├── BatchedEngine (LLMs, continuous batching)
│ ├── EmbeddingEngine
│ └── RerankerEngine
│
├── ProcessMemoryEnforcer (total memory limit, TTL checks)
│
├── Scheduler (FCFS, configurable batch sizes)
│ └── mlx-lm BatchGenerator
│
└── Cache Stack
├── PagedCacheManager (GPU, block-based, CoW, prefix sharing)
├── Hot Cache (in-memory tier, write-back)
└── PagedSSDCacheManager (SSD cold tier, safetensors format)
```

## Development

### CLI Server

```bash
git clone https://github.com/jundot/omlx.git
cd omlx
pip install -e ".[dev]"
pytest -m "not slow"
```

### macOS App

Requires Python 3.11+ and [venvstacks](https://venvstacks.lmstudio.ai) (`pip install venvstacks`).

```bash
cd packaging

# Full build (venvstacks + app bundle + DMG)
python build.py

# Skip venvstacks (code changes only)
python build.py --skip-venv

# DMG only
python build.py --dmg-only
```

See [packaging/README.md](packaging/README.md) for details on the app bundle structure and layer configuration.

## Contributing

Contributions are welcome! See [Contributing Guide](docs/CONTRIBUTING.md) for details.

- Bug fixes and improvements
- Performance optimizations
- Documentation improvements

## License

[Apache 2.0](LICENSE)

## Acknowledgments

- [MLX](https://github.com/ml-explore/mlx) and [mlx-lm](https://github.com/ml-explore/mlx-lm) by Apple
- [vllm-mlx](https://github.com/waybarrios/vllm-mlx) - oMLX originated as a fork of vllm-mlx v0.1.0, since re-architected with multi-model serving, paged SSD caching, an admin panel, and a standalone macOS menu bar app
- [venvstacks](https://venvstacks.lmstudio.ai) - Portable Python environment layering for the macOS app bundle
- [mlx-embeddings](https://github.com/Blaizzy/mlx-embeddings) - Embedding model support for Apple Silicon

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jundot/omlx

Awesome Lists containing this project

README

oMLX