https://github.com/lukaspustina/4lm
Local LLM control plane for Apple Silicon — one command, your hardware, your data.
https://github.com/lukaspustina/4lm
apple-silicon cli inference llm local-llm macos mlx ollama open-webui openai-compatible qwen3 self-hosted
Last synced: 8 days ago
JSON representation
Local LLM control plane for Apple Silicon — one command, your hardware, your data.
- Host: GitHub
- URL: https://github.com/lukaspustina/4lm
- Owner: lukaspustina
- License: mit
- Created: 2026-04-26T17:05:16.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2026-05-17T19:48:46.000Z (22 days ago)
- Last Synced: 2026-05-17T21:40:34.147Z (22 days ago)
- Topics: apple-silicon, cli, inference, llm, local-llm, macos, mlx, ollama, open-webui, openai-compatible, qwen3, self-hosted
- Language: Shell
- Homepage: https://lukaspustina.github.io/4lm/
- Size: 1.15 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Security: SECURITY.md
Awesome Lists containing this project
README
# 4lm — Local LLM control plane for Apple Silicon
[](https://github.com/lukaspustina/4lm/actions/workflows/ci.yml)
[](https://github.com/lukaspustina/4lm/releases)
[](LICENSE)
[](#quickstart)
**A local LLM stack you don't have to babysit.**
Your Mac. Or a Mac in your closet. Either way, one command.
Your hardware, your data, no metered bill.
Running multiple local LLMs on Apple Silicon means juggling ~140 GB
of weights, wired-memory limits, profile YAMLs, and Open WebUI config
drift. You end up administering the stack instead of using it.
`4lm` is the CLI that refuses to let that happen.
## Two shapes
One installer, one CLI, two ways to run it.
### Workstation — your Mac IS the LLM
```sh
./install.sh # omlx + Open WebUI + opencode TUI
4lm start
open http://localhost:3000 # register your admin account here, first
4lm opencode # daily driver
```
Open WebUI on `http://localhost:3000` with RAG, web search, code
interpreter, and memory wired in by default. `opencode` in your
terminal, pointed at the local `/v1`. Your laptop is the assistant.
> **Register your WebUI admin account before doing anything else.** New
> accounts default to the `pending` role with no privileges, and
> registration is locked after the first user. If you skip this and
> later run `4lm expose lan --confirm` on a network you don't fully
> control, you can lock yourself out.
### Appliance — a Mac in your closet serves the LAN
```sh
./install.sh --backend-only # skips Open WebUI + opencode
4lm start
4lm expose lan --confirm
```
Headless OpenAI-compatible `/v1/*` API on the LAN. Other machines run
their own clients (`opencode`, Open WebUI, Continue.dev — anything that
speaks `/v1`) pointed at `http://:8000`. The Mac Studio in the
closet does the inference; the Air on the couch does the typing.
> **The backend has no auth.** Anyone who can reach `:8000` on your LAN
> can call `/v1/*`. Use this on a network you trust, or front it with
> Tailscale (or another VPN that does authenticate).
## What it refuses to do
- **Never auto-starts after reboot.** A 70 GB working set should not
sneak onto wired memory before you've made coffee. Plists live in
`~/.4lm/launchd/` — launchd never finds them unless `4lm start`
says so. Opt in with `4lm autostart enable` if you want it.
- **Never binds to LAN without `--confirm`.** No env-var bypass, no
config typo, no *"I thought it was already local."* `4lm expose lan`
is a deliberate two-step. (Tailscale is still the better answer.)
- **Never silently breaks profile switches.** `4lm profile set `
validates the YAML → swaps the active symlink → polls `/v1/models`
for 30 s → on timeout, restores the previous symlink and re-polls.
Bad YAML never kills the stack.
- **Never invalidates your knowledge base across profiles.** Every
omlx profile serves the embedder as `qwen3-embedding` and the
reranker as `qwen3-reranker`. Switch from `default` (65 GB) to
`lean` (40 GB) to `max-100gb` (92 GB) — the same RAG index keeps
working. Switch profiles like you switch branches.
- **Never lets you OOM silently.** `install.sh` enforces
`iogpu.wired_limit_mb=98304` via sudoers + sysctl. `4lm doctor`
smoke-tests inference. `4lm diag` shows what's actually running
when the fans spin up.
## What it actually does
- **One CLI**, three backends behind one OpenAI-compatible `/v1`
seam — `omlx` (default; vLLM-style MLX, paged KV cache, multi-model
EnginePool), `mlx_lm` (upstream MLX, single-model), `ollama`
(GGUF/llama.cpp). Clients don't know or care which is running.
- **Atomic profile switching with bounded rollback** (above).
- **Same-name reissue = live config reload.** Edit a profile YAML
and re-run `4lm profile set ` — re-renders
`model_settings.json`, re-stages model symlinks, kickstarts the
backend. No full stop/start.
- **Open WebUI, preconfigured for daily use.** DuckDuckGo web search,
Pyodide code interpreter, personal memory, follow-up + autocomplete
suggestions, file-upload RAG. Embeddings + reranker served by the
same omlx backend → no second service, no cloud calls, fully offline.
(Most of these settings are PersistentConfig: copied into `webui.db`
on first init only — after that the admin UI is source of truth.
Toggles in the admin panel survive restarts; env-var changes do not.)
- **Idempotent install / upgrade / uninstall.** Every step is a
no-op on re-run: sudoers, sysctl, newsyslog, pipx, opencode config.
Re-running `--backend-only` over a full install does not strip the
WebUI; re-running the full installer over `--backend-only` upgrades
cleanly.
- **Visibility commands.** `4lm doctor` (prereqs + smoke-test
inference), `4lm diag` (live clients, in-flight requests, top CPU
consumers), `4lm outdated` / `4lm upgrade` (PyPI + Homebrew + HF).
## Quickstart
```sh
make bootstrap # Brewfile + Brewfile-tui (skipped if BACKEND_ONLY=1)
# core: shellcheck, shfmt, bats-core, jq, python@3.12,
# pipx, llmfit, ollama; tui extra: opencode
make install # ~/.4lm/, sudoers, sysctl, pipx-installed deps,
# log rotation, opencode config
make models # ~140 GB from HuggingFace (idempotent; same target updates)
4lm start # bootstrap launchd agents
4lm opencode # daily driver (alias: 4lm code)
```
> **64 GB Macs: switch to the `lean` profile first** —
> `4lm profile set lean` before `make models`. The `default` profile
> wants 96 GB+ steady; `lean` fits in 40 GB and downloads ~80 GB instead
> of ~140 GB. `4lm doctor` will warn you if the active profile doesn't
> fit your hardware.
After a reboot: `4lm start`. There's no autostart and that's a feature.
Backend-only variant:
```sh
make bootstrap BACKEND_ONLY=1 # skips opencode brew formula
./install.sh --backend-only # skips Open WebUI + opencode + their plists/configs
4lm start && 4lm expose lan --confirm
```
See [`docs/setup.md`](docs/setup.md) for the operator runbook,
including the `OPENAI_API_BASE_URL` wiring for consumer hosts on
the LAN.
## Profile lineup
Six profiles. The three Qwen3-stack tiers share an 8B embedder so
knowledge bases stay valid across switches.
| Profile | Backend | Coder | Chat | Embed | Rerank | Vision | Steady | Fits on |
|-----------------|---------|-------------------------|-----------------|-------|--------|--------|--------|---------|
| `lean` | omlx | Qwen3-Coder-30B-A3B | Qwen3.6-35B-A3B | 8B | 0.6B | — | ~40 GB | 64 GB+ |
| `default` | omlx | Qwen3-Coder-Next (80B) | Qwen3.6-35B-A3B | 8B | 0.6B | VL-8B | ~65 GB | 96 GB+ |
| `max-100gb` | omlx | Qwen3-Coder-Next (80B) | Qwen3-Next-80B | 8B | 4B | VL-8B | ~92 GB | 128 GB |
| `mlx-coding` | omlx | Qwen3-Coder-Next (80B) | — | — | — | — | ~42 GB | 64 GB+ |
| `mlx-knowledge` | omlx | — | Qwen3.6-35B-A3B | 8B | 0.6B | — | ~23 GB | 36 GB+ |
| `ollama` | ollama | qwen3-coder-next:q4_K_M | — | — | — | — | ~22 GB | 36 GB+ |
**Memory math for `default` on a 128 GB Mac.** Qwen3-Coder-Next 80B
(~42 GB 4-bit) + Qwen3.6-35B-A3B (~12 GB) + Qwen3-Embedding-8B
(~5 GB) + Qwen3-Reranker-0.6B (~0.4 GB) + Qwen3-VL-8B (~5 GB) ≈
65 GB steady. Both 80B-class models are MoE → ~3B active params each
→ KV cache and batched decoding fit comfortably in the remaining
~33 GB of the wired-memory budget.
The everyday ladder is `lean` → `default` → `max-100gb`. `mlx-coding`
strips everything except the 80B coder so long agentic sessions get
maximum KV-cache headroom. `mlx-knowledge` is the text-only vault-
synthesis tier. `ollama` is the GGUF smoke test — switch to it
occasionally to confirm Ollama still works, then switch back.
Each profile YAML carries an extensive header comment documenting
slot-by-slot rationale, memory math, when-to-use, and
assumptions-to-validate.
## Architecture
```
4lm (single control command)
│ bootstrap / bootout / kickstart
▼
┌──────────────────────────────────┐ ┌──────────────────────────────────┐
│ com.4lm.backend │ │ com.4lm.webui │
│ omlx | mlx_lm | ollama │ │ open-webui serve │
│ :8000 (OpenAI API) │←───│ :3000 (Web UI) │
└──────────────────────────────────┘ └──────────────────────────────────┘
▲ ▲
│ HTTP │ HTTP (browser)
┌─────┴────┐ ┌─────┴────┐
│ opencode │ │ Safari │
│ TUI │ │ Chrome │
└──────────┘ └──────────┘
```
The backend is the source of truth. Open WebUI is a stateless
frontend proxying to it. `opencode` talks directly to `:8000/v1`.
None of them know or care about each other → the OpenAI-compatible
API is the seam.
In `--backend-only` mode the WebUI block is absent on disk; `bin/4lm`
probes for the plist and treats the layer as not installed. Consumer
hosts on the LAN run their own clients against the backend's
`:8000/v1`.
## Common operations
```sh
# Lifecycle
4lm start | stop | restart [backend|webui|all]
4lm # status (alias for `4lm status`; --json for parseable)
4lm logs [backend|webui] # tail -F
# Profiles
4lm profile list
4lm profile set # atomic, validated, rollback on failure
4lm profile show [] # YAML of active (or named) profile
4lm profile validate [--all]
# Models
make models # download/update everything in config/profiles/
4lm model list # what's loaded vs cached
4lm model recommend [] # top picks via llmfit + localmaxxing benchmarks
# Updates
4lm outdated # check PyPI / Homebrew / HF
4lm upgrade [brew|models|python] # apply pending updates
# Diagnostics
4lm doctor # prereqs + smoke-test inference
4lm diag # live clients, in-flight inference, top CPU
# Network exposure (refuses without --confirm)
4lm expose lan --confirm
4lm expose local --confirm
# Autostart at login (off by default)
4lm autostart enable [backend|webui|all]
4lm autostart disable [backend|webui|all]
4lm autostart status
# Removal
4lm uninstall # bootout, remove ~/.local/bin/4lm; keep ~/.4lm/
make uninstall # full: bootout, ~/.4lm/, sudoers, newsyslog, pipx packages
```
Every command has `--help`.
## File layout
```
~/.4lm/
├── bin/ control + wrappers (called by launchd)
├── launchd/ com.4lm.{backend,webui}.plist
├── config/
│ ├── active-profile symlink → profiles/.yaml
│ ├── previous-profile plain text, used for rollback
│ ├── network.yaml bind mode + ports (single config channel)
│ ├── webui_secret_key mode 0600, generated on first lan-mode start
│ └── profiles/ lean / default / max-100gb / mlx-coding /
│ mlx-knowledge / ollama YAMLs
├── logs/ backend.log, webui.log (merged stdout+stderr)
└── openwebui-data/ Open WebUI db, settings, RAG index
~/.local/bin/4lm symlink to ~/.4lm/bin/4lm
~/.config/opencode/opencode.jsonc seeded by install.sh from the repo template
/etc/sudoers.d/4lm-stack NOPASSWD for the backend's sysctl call
/etc/newsyslog.d/4lm.conf log rotation (10 MB, 7 generations, gzipped)
```
## What this isn't
- Not a multi-user server. Single-user, single-host, single login session.
- Not a Docker or pip-into-system-Python project. PEP 668 is honoured
via pipx; Python 3.12 is pinned for compatibility with the MLX ecosystem.
- Not auto-starting. After reboot you run `4lm start`. On purpose.
- Not a cloud-fallback router. If the local stack fails, fall back by
hand; 4lm does not silently route to Anthropic, OpenAI, or anyone else.
## Where this is going
Today: best Apple Silicon models, fastest local inference, a
Claude-Desktop-shaped Open WebUI in front of it. Next: tool calling
+ MCP so local models can *use* Open WebUI's tools (web search, RAG,
code interpreter, memory) instead of hallucinating them — and reach
toward Claude-Desktop feature parity. The phased roadmap lives in
[`specs/sdd/webui-tools-and-mcp.md`](specs/sdd/webui-tools-and-mcp.md).
Foundation is shipped at v0.6; agentic phases are draft.
## Development
```sh
make bootstrap # one-time: Brewfile + Brewfile-tui + pipx ensurepath
make lint # shellcheck + shfmt -d
make fmt # shfmt -w
make test # bats + pytest
make check # everything; CI runs this on macos-latest
make ci # mirror the CI matrix locally (default + backend-only legs)
```
CI on every PR (`.github/workflows/ci.yml`):
`shellcheck`, `shfmt -d`, `bash -n`, `plutil -lint`, `xmllint --noout`,
profile YAML validation, bats suite — both install modes — driven by
`make check`.
## Conventions
- Bash scripts: `set -euo pipefail`, `shellcheck` clean, `shfmt -i 2 -ci`.
- Plists carry `__HOME__` placeholders, substituted by `install.sh`.
- Profiles validated by `validate_profile()` in `bin/4lm` (regex name
whitelist, schema check, parser-enum match against backend defaults).
- Conventional-commit prefixes (`feat:`, `fix:`, `refactor:`, `chore:`,
`docs:`, `test:`); subject ≤ 72 chars.
- No `Co-Authored-By: Claude` lines.
- Formatting-only changes ship in their own commits.
## Model history (short)
4lm migrated through `mlx-openai-server` → `mlx_lm` → `omlx` as Apple
Silicon MLX tooling matured. The initial model set (GLM-4.7-Flash +
Qwen3.6-35B-A3B) gave way to Qwen3-Coder + Qwen3.6-27B in v0.3 after
the Qwen3 thinking-loop bug forced a custom Jinja workaround. v0.6
consolidates onto the full Qwen3 stack (coder + chat + embed + rerank
+ vision) on a single omlx process. Per-version detail and the
thinking-mode template story live in [`CHANGELOG.md`](CHANGELOG.md).
## Credits
4lm is glue around several upstream projects that do the actual heavy
lifting. Go give them stars:
- [**omlx**](https://github.com/jundot/omlx) — vLLM-style MLX inference
server with paged KV cache, continuous batching, and multi-model
EnginePool. Primary backend.
- [**mlx_lm**](https://github.com/ml-explore/mlx-lm) — Apple's reference
MLX language-model library. Alternative single-model backend.
- [**Ollama**](https://github.com/ollama/ollama) — llama.cpp + Metal
GGUF serving. Smoke-test backend.
- [**Open WebUI**](https://github.com/open-webui/open-webui) — the
frontend. Without their PersistentConfig surface this stack would be
half the experience.
- [**opencode**](https://github.com/sst/opencode) — the TUI client.
- [**Qwen team @ Alibaba**](https://github.com/QwenLM) — the model
family carrying the default profile (Qwen3-Coder, Qwen3.6,
Qwen3-Embedding, Qwen3-Reranker, Qwen3-VL). Apache-2.0 licensed.
- [**llmfit**](https://github.com/lukaspustina/llmfit) +
[**localmaxxing.com**](https://localmaxxing.com) — hardware-fit
scoring and community benchmarks used by `4lm model recommend`.
## License
[MIT](LICENSE).
## Documentation
- [`docs/setup.md`](docs/setup.md) — operator runbook (sudoers, troubleshooting, model pulls, LAN client wiring)
- [`docs/profile-schema.md`](docs/profile-schema.md) — YAML key reference for all backends
- [`docs/autostart.md`](docs/autostart.md) — opt-in login autostart mechanics
- [`specs/sdd/webui-tools-and-mcp.md`](specs/sdd/webui-tools-and-mcp.md) — active SDD for tool calling + MCP
- [`specs/done/sdd/4lm-rework-2026-05-09.md`](specs/done/sdd/4lm-rework-2026-05-09.md) — archived design doc this repo implements
- [`CONTRIBUTING.md`](CONTRIBUTING.md) — dev setup, commit style, PR checklist
- [`SECURITY.md`](SECURITY.md) — threat model + vulnerability reporting
- [`CLAUDE.md`](CLAUDE.md) — orientation for AI assistants working in this repo