https://github.com/slogsdon/mac-mini-llm-roster
Speed benchmarks for local LLMs on a 32 GB Apple Silicon Mac Mini: tokens-per-second, wall-clock, and quality across Qwen3, phi4-reasoning, lfm2, gpt-oss, mistral-nemo, and others — served via Ollama + LiteLLM with role aliases. Companion to a blog post.
https://github.com/slogsdon/mac-mini-llm-roster
apple-silicon benchmark gpt-oss lfm2 litellm llm-benchmark local-llm mac-mini mistral-nemo mlx ollama phi-4 qwen3
Last synced: 18 days ago
JSON representation
Speed benchmarks for local LLMs on a 32 GB Apple Silicon Mac Mini: tokens-per-second, wall-clock, and quality across Qwen3, phi4-reasoning, lfm2, gpt-oss, mistral-nemo, and others — served via Ollama + LiteLLM with role aliases. Companion to a blog post.
- Host: GitHub
- URL: https://github.com/slogsdon/mac-mini-llm-roster
- Owner: slogsdon
- License: mit
- Created: 2026-05-24T04:54:31.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2026-05-24T05:02:19.000Z (about 1 month ago)
- Last Synced: 2026-05-24T07:07:05.020Z (about 1 month ago)
- Topics: apple-silicon, benchmark, gpt-oss, lfm2, litellm, llm-benchmark, local-llm, mac-mini, mistral-nemo, mlx, ollama, phi-4, qwen3
- Language: Python
- Size: 62.5 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# mac-mini-llm-roster
Six benchmark runs of every local model I bothered to download, on one Mac Mini M4 with 32 GB of unified memory. Scripts, raw JSON, run logs, and the markdown tables that came out of them.
If [willitrunai.com](https://www.willitrunai.com) told you a model will fit on your Mac, this is the next question: *given it fits, does it actually behave at the contexts and budgets you'd run it at?* For most of the models in here, the answer was a surprise — usually not in the direction the model card implied.
This is the data side of a forthcoming blog post (link goes here when it ships).
## What's in here
```
scripts/
roster_speed_test.py 8 prompts × N models at production num_ctx
roster_speed_test_maxctx.py Same prompts, num_ctx pinned to each model's
ollama_max
data/
2026-05-23-roster-speed/ Run 1: 10 models at the defaults I was shipping
2026-05-23-roster-speed-delta/ Run 2: 6 models re-measured with think:false fixes
2026-05-24-roster-speed-maxctx/ Run 3: 11-model roster at maxed contexts
2026-05-25-roster-additions/ Run 4: deepseek-r1:14b + qwen2.5-coder:14b adds
2026-05-26-alias-verification/ Run 5: alias-route verification + dedup audit
2026-05-26-pipeline-challenger/ Run 6: lfm2:24b vs mistral-small:24b head-to-head
litellm-example/
config.yaml The use-case-alias pattern I run in production,
with OTel + Langfuse wiring, sanitized
```
Each `data//` folder carries `results.json` (one row per call), usually a `run.log` (full stdout), and a `report.md` (tables and the notes I took at the time). Some runs add a `README.md` for run-specific context, and the dedup run adds an `audit.md`.
## Hardware
One M4 Mac Mini, 32 GB unified memory, macOS, Ollama 0.24 running on the host. No GPU box, no quant tricks past whatever the published tags use.
## Running it
```bash
python3 scripts/roster_speed_test.py
python3 scripts/roster_speed_test_maxctx.py
```
Each script walks the model list sequentially so the cold-load cost falls on prompt 1 and prompts 2–8 hit a warm runner. Default output is `/tmp/roster_speed_*.json`; pass `--out PATH` to redirect or `--only alias1,alias2` to subset. A full run is roughly 60–110 minutes on this hardware depending on how much the reasoning models decide to think.
## How to read `results.json`
One object per (model, prompt) call. The fields that matter:
| field | meaning |
|---|---|
| `alias` | the alias the call was made under (the naming scheme evolved across runs — see each `report.md`) |
| `tag` | the underlying Ollama tag |
| `num_ctx` / `num_predict` | what the call was configured with |
| `wall_s` | total request wall time (load + prompt eval + thinking + generation) |
| `eval_tps` | tokens/sec during generation only, from Ollama's `eval_count` / `eval_duration` |
| `eval_count` | how many tokens the model actually produced |
| `thinking_chars` | chars in the separated thinking channel (0 if the model doesn't have one) |
| `content` | first 300 chars of the visible response |
| `error` | `null` on success; populated on timeouts or HTTP errors |
`eval_tps` is the clean per-model speed number — it excludes prompt eval and KV alloc, so a cold first prompt doesn't drag down the average. `wall_s` is what a user actually feels.
## Why hit Ollama directly?
In production everything routes through LiteLLM at `:4000` so I can put Langfuse + OTel in front of it. But LiteLLM strips Ollama's `eval_count` / `eval_duration` (it only surfaces OpenAI-format `usage`) and doesn't pass the separated `thinking` field through cleanly. For benchmarking that's a dealbreaker, so the scripts go straight to `:11434`. Production traffic still goes through LiteLLM.
## License
MIT. Take the scripts, take the data, tell me where I got it wrong.
## What I'd still like to know
- Which model pairs co-load without eviction on this hardware (concurrency profiling)
- The same matrix on an M4 Pro or Max — at what point does the small-dense-model KV penalty disappear
- A clean non-thinking baseline: every reasoning model with `think:false` forced, for direct comparison
- Drift over time: re-run monthly and watch the numbers move
### Keywords for the search engines
`ollama`, `litellm`, `local-llm`, `apple-silicon`, `mac-mini`, `m4`, `mlx`, `mixture-of-experts`, `qwen3`, `qwen2.5-coder`, `deepseek-r1`, `phi4-reasoning`, `lfm2`, `gpt-oss`, `mistral-small`, `mistral-nemo`, `gemma`, `granite`, `benchmark`, `tokens-per-second`, `kv-cache`, `thinking-channel`, `langfuse`, `otel`, `role-aliases`