https://github.com/jundot/omlx
LLM inference server with continuous batching & SSD caching for Apple Silicon — managed from the macOS menu bar
https://github.com/jundot/omlx
apple-silicon inference-server llm macos mlx openai-api
Last synced: 8 days ago
JSON representation
LLM inference server with continuous batching & SSD caching for Apple Silicon — managed from the macOS menu bar
- Host: GitHub
- URL: https://github.com/jundot/omlx
- Owner: jundot
- License: apache-2.0
- Created: 2026-02-13T14:13:27.000Z (28 days ago)
- Default Branch: main
- Last Pushed: 2026-02-27T11:51:13.000Z (14 days ago)
- Last Synced: 2026-02-27T11:59:12.031Z (14 days ago)
- Topics: apple-silicon, inference-server, llm, macos, mlx, openai-api
- Language: Python
- Homepage: https://github.com/jundot/omlx
- Size: 8.23 MB
- Stars: 88
- Watchers: 4
- Forks: 13
- Open Issues: 16
-
Metadata Files:
- Readme: README.md
- Contributing: docs/CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
oMLX
LLM inference, optimized for your Mac
Continuous batching and tiered KV caching, managed directly from your menu bar.
Install ·
Quickstart ·
Features ·
Models ·
CLI Configuration ·
GitHub
---
> *Every LLM server I tried made me choose between convenience and control. I wanted to pin everyday models in memory, auto-swap heavier ones on demand, set context limits - and manage it all from a menu bar.*
>
> *oMLX persists KV cache across a hot in-memory tier and cold SSD tier - even when context changes mid-conversation, all past context stays cached and reusable across requests, making local LLMs practical for real coding work with tools like Claude Code. That's why I built it.*
## Install
### macOS App
Download the `.dmg` from [Releases](https://github.com/jundot/omlx/releases), drag to Applications, done. The app includes in-app auto-update, so future upgrades are just one click.
### Homebrew
```bash
brew tap jundot/omlx https://github.com/jundot/omlx
brew install omlx
# Upgrade to the latest version
brew update && brew upgrade omlx
# Run as a background service (auto-restarts on crash)
brew services start omlx
```
### From Source
```bash
git clone https://github.com/jundot/omlx.git
cd omlx
pip install -e .
```
Requires Python 3.10+ and Apple Silicon (M1/M2/M3/M4).
## Quickstart
### macOS App
Launch oMLX from your Applications folder. The Welcome screen guides you through three steps - model directory, server start, and first model download. That's it.
### CLI
```bash
omlx serve --model-dir ~/models
```
The server discovers models from subdirectories automatically. Any OpenAI-compatible client can connect to `http://localhost:8000/v1`. A built-in chat UI is also available at `http://localhost:8000/admin/chat`.
### Homebrew Service
If you installed via Homebrew, you can run oMLX as a managed background service:
```bash
brew services start omlx # Start (auto-restarts on crash)
brew services stop omlx # Stop
brew services restart omlx # Restart
brew services info omlx # Check status
```
The service runs `omlx serve` with zero-config defaults (`~/.omlx/models`, port 8000). To customize, either set environment variables (`OMLX_MODEL_DIR`, `OMLX_PORT`, etc.) or run `omlx serve --model-dir /your/path` once to persist settings to `~/.omlx/settings.json`.
Logs are written to two locations:
- **Service log**: `$(brew --prefix)/var/log/omlx.log` (stdout/stderr)
- **Server log**: `~/.omlx/logs/server.log` (structured application log)
## Features
oMLX is built on top of [vllm-mlx](https://github.com/waybarrios/vllm-mlx), extending it with tiered KV caching, multi-model serving, an admin dashboard, Claude Code optimization, and Anthropic API support. Currently supports text-based LLMs - VLM and OCR model support is planned for upcoming milestones.
### Admin Dashboard
Web UI at `/admin` for real-time monitoring, model management, chat, benchmark, and per-model settings. All CDN dependencies are vendored for fully offline operation.
### Tiered KV Cache (Hot + Cold)
Block-based KV cache management inspired by vLLM, with prefix sharing and Copy-on-Write. The cache operates across two tiers:
- **Hot tier (RAM)**: Frequently accessed blocks stay in memory for fast access.
- **Cold tier (SSD)**: When the hot cache fills up, blocks are offloaded to SSD in safetensors format. On the next request with a matching prefix, they're restored from disk instead of recomputed from scratch - even after a server restart.
### Continuous Batching
Handles concurrent requests through mlx-lm's BatchGenerator. Prefill and completion batch sizes are configurable.
### Claude Code Optimization
Context scaling support for running smaller context models with Claude Code. Scales reported token counts so that auto-compact triggers at the right timing, and SSE keep-alive prevents read timeouts during long prefill.
### Multi-Model Serving
Load LLMs, embedding models, and rerankers within the same server. Models are managed through a combination of automatic and manual controls:
- **LRU eviction**: Least-recently-used models are evicted automatically when memory runs low.
- **Manual load/unload**: Interactive status badges in the admin panel let you load or unload models on demand.
- **Model pinning**: Pin frequently used models to keep them always loaded.
- **Per-model TTL**: Set an idle timeout per model to auto-unload after a period of inactivity.
- **Process memory enforcement**: Total memory limit (default: system RAM - 8GB) prevents system-wide OOM.
### Per-Model Settings
Configure sampling parameters, chat template kwargs, TTL, and more per model directly from the admin panel. Changes apply immediately without server restart.
### Built-in Chat
Chat directly with any loaded model from the admin panel. Supports conversation history, model switching, dark mode, and reasoning model output.
### Model Downloader
Search and download MLX models from HuggingFace directly in the admin dashboard. Browse model cards, check file sizes, and download with one click.
### Performance Benchmark
One-click benchmarking from the admin panel. Measures prefill (PP) and text generation (TG) tokens per second, with partial prefix cache hit testing for realistic performance numbers.
### macOS Menubar App
Native PyObjC menubar app (not Electron). Start, stop, and monitor the server without opening a terminal. Includes real-time serving stats, auto-restart on crash, and in-app auto-update.
### API Compatibility
Drop-in replacement for OpenAI and Anthropic APIs. Supports streaming usage stats (`stream_options.include_usage`) and Anthropic adaptive thinking.
| Endpoint | Description |
|----------|-------------|
| `POST /v1/chat/completions` | Chat completions (streaming) |
| `POST /v1/completions` | Text completions (streaming) |
| `POST /v1/messages` | Anthropic Messages API |
| `POST /v1/embeddings` | Text embeddings |
| `POST /v1/rerank` | Document reranking |
| `GET /v1/models` | List available models |
### Tool Calling & Structured Output
Supports all function calling formats available in mlx-lm, JSON schema validation, and MCP tool integration. Tool calling requires the model's chat template to support the `tools` parameter. The following model families are auto-detected via mlx-lm's built-in tool parsers:
| Model Family | Format |
|---|---|
| Llama, Qwen, DeepSeek, etc. | JSON `` |
| Qwen3 Coder | XML `` |
| Gemma | `` |
| GLM (4.7, 5) | `/` XML |
| MiniMax | Namespaced `` |
| Mistral | `[TOOL_CALLS]` |
| Kimi K2 | `<\|tool_calls_section_begin\|>` |
| Longcat | `` |
Models not listed above may still work if their chat template accepts `tools` and their output uses a recognized `` XML format. Streaming requests with tool calls buffer all content and emit results at completion.
## Models
Point `--model-dir` at a directory containing MLX-format model subdirectories. Two-level organization folders (e.g., `mlx-community/model-name/`) are also supported.
```
~/models/
├── Step-3.5-Flash-8bit/
├── Qwen3-Coder-Next-8bit/
├── gpt-oss-120b-MXFP4-Q8/
└── bge-m3/
```
Models are auto-detected by type. You can also download models directly from the admin dashboard.
| Type | Models |
|------|--------|
| LLM | Any model supported by [mlx-lm](https://github.com/ml-explore/mlx-lm) |
| Embedding | BERT, BGE-M3, ModernBERT |
| Reranker | ModernBERT, XLM-RoBERTa |
## CLI Configuration
```bash
# Memory limit for loaded models
omlx serve --model-dir ~/models --max-model-memory 32GB
# Process-level memory limit (default: auto = RAM - 8GB)
omlx serve --model-dir ~/models --max-process-memory 80%
# Enable SSD cache for KV blocks
omlx serve --model-dir ~/models --paged-ssd-cache-dir ~/.omlx/cache
# Set in-memory hot cache size
omlx serve --model-dir ~/models --hot-cache-max-size 20%
# Adjust batch sizes
omlx serve --model-dir ~/models --prefill-batch-size 8 --completion-batch-size 32
# With MCP tools
omlx serve --model-dir ~/models --mcp-config mcp.json
# API key authentication
omlx serve --model-dir ~/models --api-key your-secret-key
```
All settings can also be configured from the web admin panel at `/admin`. Settings are persisted to `~/.omlx/settings.json`, and CLI flags take precedence.
Architecture
```
FastAPI Server (OpenAI / Anthropic API)
│
├── EnginePool (multi-model, LRU eviction, TTL, manual load/unload)
│ ├── BatchedEngine (LLMs, continuous batching)
│ ├── EmbeddingEngine
│ └── RerankerEngine
│
├── ProcessMemoryEnforcer (total memory limit, TTL checks)
│
├── Scheduler (FCFS, configurable batch sizes)
│ └── mlx-lm BatchGenerator
│
└── Cache Stack
├── PagedCacheManager (GPU, block-based, CoW, prefix sharing)
├── Hot Cache (in-memory tier, write-back)
└── PagedSSDCacheManager (SSD cold tier, safetensors format)
```
## Development
### CLI Server
```bash
git clone https://github.com/jundot/omlx.git
cd omlx
pip install -e ".[dev]"
pytest -m "not slow"
```
### macOS App
Requires Python 3.11+ and [venvstacks](https://venvstacks.lmstudio.ai) (`pip install venvstacks`).
```bash
cd packaging
# Full build (venvstacks + app bundle + DMG)
python build.py
# Skip venvstacks (code changes only)
python build.py --skip-venv
# DMG only
python build.py --dmg-only
```
See [packaging/README.md](packaging/README.md) for details on the app bundle structure and layer configuration.
## Contributing
Contributions are welcome! See [Contributing Guide](docs/CONTRIBUTING.md) for details.
- Bug fixes and improvements
- Performance optimizations
- Documentation improvements
## License
[Apache 2.0](LICENSE)
## Acknowledgments
- [MLX](https://github.com/ml-explore/mlx) and [mlx-lm](https://github.com/ml-explore/mlx-lm) by Apple
- [vllm-mlx](https://github.com/waybarrios/vllm-mlx) - oMLX originated as a fork of vllm-mlx v0.1.0, since re-architected with multi-model serving, paged SSD caching, an admin panel, and a standalone macOS menu bar app
- [venvstacks](https://venvstacks.lmstudio.ai) - Portable Python environment layering for the macOS app bundle
- [mlx-embeddings](https://github.com/Blaizzy/mlx-embeddings) - Embedding model support for Apple Silicon