{"id":50915377,"url":"https://github.com/slogsdon/mac-mini-llm-roster","last_synced_at":"2026-06-16T14:02:57.309Z","repository":{"id":359923355,"uuid":"1248025072","full_name":"slogsdon/mac-mini-llm-roster","owner":"slogsdon","description":"Speed benchmarks for local LLMs on a 32 GB Apple Silicon Mac Mini: tokens-per-second, wall-clock, and quality across Qwen3, phi4-reasoning, lfm2, gpt-oss, mistral-nemo, and others — served via Ollama + LiteLLM with role aliases. Companion to a blog post.","archived":false,"fork":false,"pushed_at":"2026-05-24T05:02:19.000Z","size":64,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-24T07:07:05.020Z","etag":null,"topics":["apple-silicon","benchmark","gpt-oss","lfm2","litellm","llm-benchmark","local-llm","mac-mini","mistral-nemo","mlx","ollama","phi-4","qwen3"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/slogsdon.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-24T04:54:31.000Z","updated_at":"2026-05-24T05:02:23.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/slogsdon/mac-mini-llm-roster","commit_stats":null,"previous_names":["slogsdon/mac-mini-llm-roster"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/slogsdon/mac-mini-llm-roster","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/slogsdon%2Fmac-mini-llm-roster","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/slogsdon%2Fmac-mini-llm-roster/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/slogsdon%2Fmac-mini-llm-roster/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/slogsdon%2Fmac-mini-llm-roster/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/slogsdon","download_url":"https://codeload.github.com/slogsdon/mac-mini-llm-roster/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/slogsdon%2Fmac-mini-llm-roster/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34408788,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-16T02:00:06.860Z","response_time":126,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apple-silicon","benchmark","gpt-oss","lfm2","litellm","llm-benchmark","local-llm","mac-mini","mistral-nemo","mlx","ollama","phi-4","qwen3"],"created_at":"2026-06-16T14:02:56.445Z","updated_at":"2026-06-16T14:02:57.290Z","avatar_url":"https://github.com/slogsdon.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# mac-mini-llm-roster\n\nSix benchmark runs of every local model I bothered to download, on one Mac Mini M4 with 32 GB of unified memory. Scripts, raw JSON, run logs, and the markdown tables that came out of them.\n\nIf [willitrunai.com](https://www.willitrunai.com) told you a model will fit on your Mac, this is the next question: *given it fits, does it actually behave at the contexts and budgets you'd run it at?* For most of the models in here, the answer was a surprise — usually not in the direction the model card implied.\n\nThis is the data side of a forthcoming blog post (link goes here when it ships).\n\n## What's in here\n\n```\nscripts/\n  roster_speed_test.py          8 prompts × N models at production num_ctx\n  roster_speed_test_maxctx.py   Same prompts, num_ctx pinned to each model's\n                                ollama_max\ndata/\n  2026-05-23-roster-speed/        Run 1: 10 models at the defaults I was shipping\n  2026-05-23-roster-speed-delta/  Run 2: 6 models re-measured with think:false fixes\n  2026-05-24-roster-speed-maxctx/ Run 3: 11-model roster at maxed contexts\n  2026-05-25-roster-additions/    Run 4: deepseek-r1:14b + qwen2.5-coder:14b adds\n  2026-05-26-alias-verification/  Run 5: alias-route verification + dedup audit\n  2026-05-26-pipeline-challenger/ Run 6: lfm2:24b vs mistral-small:24b head-to-head\nlitellm-example/\n  config.yaml                   The use-case-alias pattern I run in production,\n                                with OTel + Langfuse wiring, sanitized\n```\n\nEach `data/\u003crun\u003e/` folder carries `results.json` (one row per call), usually a `run.log` (full stdout), and a `report.md` (tables and the notes I took at the time). Some runs add a `README.md` for run-specific context, and the dedup run adds an `audit.md`.\n\n## Hardware\n\nOne M4 Mac Mini, 32 GB unified memory, macOS, Ollama 0.24 running on the host. No GPU box, no quant tricks past whatever the published tags use.\n\n## Running it\n\n```bash\npython3 scripts/roster_speed_test.py\npython3 scripts/roster_speed_test_maxctx.py\n```\n\nEach script walks the model list sequentially so the cold-load cost falls on prompt 1 and prompts 2–8 hit a warm runner. Default output is `/tmp/roster_speed_*.json`; pass `--out PATH` to redirect or `--only alias1,alias2` to subset. A full run is roughly 60–110 minutes on this hardware depending on how much the reasoning models decide to think.\n\n## How to read `results.json`\n\nOne object per (model, prompt) call. The fields that matter:\n\n| field | meaning |\n|---|---|\n| `alias` | the alias the call was made under (the naming scheme evolved across runs — see each `report.md`) |\n| `tag` | the underlying Ollama tag |\n| `num_ctx` / `num_predict` | what the call was configured with |\n| `wall_s` | total request wall time (load + prompt eval + thinking + generation) |\n| `eval_tps` | tokens/sec during generation only, from Ollama's `eval_count` / `eval_duration` |\n| `eval_count` | how many tokens the model actually produced |\n| `thinking_chars` | chars in the separated thinking channel (0 if the model doesn't have one) |\n| `content` | first 300 chars of the visible response |\n| `error` | `null` on success; populated on timeouts or HTTP errors |\n\n`eval_tps` is the clean per-model speed number — it excludes prompt eval and KV alloc, so a cold first prompt doesn't drag down the average. `wall_s` is what a user actually feels.\n\n## Why hit Ollama directly?\n\nIn production everything routes through LiteLLM at `:4000` so I can put Langfuse + OTel in front of it. But LiteLLM strips Ollama's `eval_count` / `eval_duration` (it only surfaces OpenAI-format `usage`) and doesn't pass the separated `thinking` field through cleanly. For benchmarking that's a dealbreaker, so the scripts go straight to `:11434`. Production traffic still goes through LiteLLM.\n\n## License\n\nMIT. Take the scripts, take the data, tell me where I got it wrong.\n\n## What I'd still like to know\n\n- Which model pairs co-load without eviction on this hardware (concurrency profiling)\n- The same matrix on an M4 Pro or Max — at what point does the small-dense-model KV penalty disappear\n- A clean non-thinking baseline: every reasoning model with `think:false` forced, for direct comparison\n- Drift over time: re-run monthly and watch the numbers move\n\n### Keywords for the search engines\n\n`ollama`, `litellm`, `local-llm`, `apple-silicon`, `mac-mini`, `m4`, `mlx`, `mixture-of-experts`, `qwen3`, `qwen2.5-coder`, `deepseek-r1`, `phi4-reasoning`, `lfm2`, `gpt-oss`, `mistral-small`, `mistral-nemo`, `gemma`, `granite`, `benchmark`, `tokens-per-second`, `kv-cache`, `thinking-channel`, `langfuse`, `otel`, `role-aliases`\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fslogsdon%2Fmac-mini-llm-roster","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fslogsdon%2Fmac-mini-llm-roster","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fslogsdon%2Fmac-mini-llm-roster/lists"}