{"id":50792867,"url":"https://github.com/cryptojones/macminim2pro_localmodelconfig","last_synced_at":"2026-06-12T12:02:25.318Z","repository":{"id":359002864,"uuid":"1244064621","full_name":"CryptoJones/MacminiM2Pro_LocalModelConfig","owner":"CryptoJones","description":"Memory-safe, LAN-accessible OpenAI-compatible server for Gemma 4 12B (4-bit MLX) on a 16 GB M2 Pro Mac mini","archived":false,"fork":false,"pushed_at":"2026-06-05T09:35:11.000Z","size":105,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-05T12:09:31.517Z","etag":null,"topics":["anthropic-api","apple-silicon","claude-code","gemma","inference-server","local-llm","mac-mini","macos","mlx","omlx","qwen3"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/CryptoJones.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-19T23:51:09.000Z","updated_at":"2026-06-05T09:35:16.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/CryptoJones/MacminiM2Pro_LocalModelConfig","commit_stats":null,"previous_names":["cryptojones/macminim2pro_localmodelconfig"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/CryptoJones/MacminiM2Pro_LocalModelConfig","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CryptoJones%2FMacminiM2Pro_LocalModelConfig","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CryptoJones%2FMacminiM2Pro_LocalModelConfig/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CryptoJones%2FMacminiM2Pro_LocalModelConfig/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CryptoJones%2FMacminiM2Pro_LocalModelConfig/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/CryptoJones","download_url":"https://codeload.github.com/CryptoJones/MacminiM2Pro_LocalModelConfig/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CryptoJones%2FMacminiM2Pro_LocalModelConfig/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34243053,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-12T02:00:06.859Z","response_time":109,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["anthropic-api","apple-silicon","claude-code","gemma","inference-server","local-llm","mac-mini","macos","mlx","omlx","qwen3"],"created_at":"2026-06-12T12:01:46.408Z","updated_at":"2026-06-12T12:02:25.177Z","avatar_url":"https://github.com/CryptoJones.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# MacminiM2Pro_LocalModelConfig\n\n**Memory-safe, LAN-accessible, OpenAI-compatible server for [Gemma 4 12B](https://developers.googleblog.com/gemma-4-12b-the-developer-guide/) running locally in [MLX](https://github.com/ml-explore/mlx) on a 16 GB Apple Silicon M2 Pro Mac mini.**\n\n[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg?logo=apache)](LICENSE)\n[![Codeberg](https://img.shields.io/badge/Codeberg-CryptoJones%2FMacminiM2Pro_LocalModelConfig-2185D0?logo=codeberg\u0026logoColor=white)](https://codeberg.org/CryptoJones/MacminiM2Pro_LocalModelConfig)\n[![GitHub](https://img.shields.io/badge/GitHub-CryptoJones%2FMacminiM2Pro_LocalModelConfig-181717?logo=github\u0026logoColor=white)](https://github.com/CryptoJones/MacminiM2Pro_LocalModelConfig)\n\n\u003e Authoritative repo is on [Codeberg](https://codeberg.org/CryptoJones/MacminiM2Pro_LocalModelConfig); mirrored to [GitHub](https://github.com/CryptoJones/MacminiM2Pro_LocalModelConfig).\n\n\u003e 📺 Inspired by the video [*Gemma 4 12B on a 16GB Mac Mini Is Surprisingly Capable*](https://www.youtube.com/watch?v=PDxKrp-dTDA).\n\nGemma 4 12B is an encoder-free multimodal model (text/image/audio/video) that Google\npositions for 16 GB machines. It *fits* — but only just. On 16 GB it sits right at the\nMetal GPU memory ceiling, so a naive server **OOM-crashes on the first large prompt**.\nThis repo is the configuration and a small server wrapper that make it **safe to run\nheadless** and reachable by other agents on your LAN.\n\n---\n\n## The 16 GB problem (why this repo exists)\n\nMLX inference is bounded by the **Metal recommended working-set size** — by default\n~74 % of RAM = **11.84 GB** on a 16 GB machine. Measured peaks for Gemma 4 12B:\n\n| Quant | Weights resident | Verdict on 16 GB |\n|-------|------------------|------------------|\n| `8bit` (12.7 GB) | — | ❌ won't load |\n| `6bit` (11.9 GB) | **11.85 GB peak** | ❌ saturates the GPU budget → **Metal OOM on any real prompt**; forces 3.6 GB swap just to load |\n| **`4bit` (10 GB)** | **10.99 GB load / 11.8 GB+ under load** | ✅ **only viable option** — and still needs the steps below |\n\nEven 4-bit peaks **scale with input length** (prefill activations, *not* just KV cache):\n\n| Input prompt | Peak memory |\n|--------------|-------------|\n| ~50 tokens   | 11.80 GB |\n| ~360 tokens  | 11.80 GB |\n| ~1,560 tokens| 13.21 GB |\n| ~4,560 tokens| 💥 **OOM crash** |\n\nGeneration throughput: **~14–15 tokens/sec**.\n\n### Two things make it safe\n\n1. **Raise the Metal working-set limit** so the GPU may use more than the default 74 %.\n   For a headless box, 13.5 GB leaves ~2.9 GB for macOS:\n   ```bash\n   sudo sysctl iogpu.wired_limit_mb=13500\n   ```\n   This resets on reboot — see [persisting it](#persist-the-gpu-limit-across-reboots).\n\n2. **Guard against oversized prompts.** `server.py` rejects prompts over\n   `MAX_INPUT_TOKENS` with **HTTP 413** instead of letting them OOM-crash the process,\n   and **serializes** requests (a second concurrent generation would double the working\n   set and OOM → **HTTP 429**).\n\n---\n\n## Quick start\n\n```bash\ngit clone https://codeberg.org/CryptoJones/MacminiM2Pro_LocalModelConfig.git\ncd MacminiM2Pro_LocalModelConfig\n./setup.sh                                  # uv venv (py3.12) + deps + downloads 4-bit weights (~10 GB)\nsudo sysctl iogpu.wired_limit_mb=13500      # raise GPU memory ceiling (per boot)\n./.venv/bin/python server.py                # serves on 0.0.0.0:8080\n```\n\n\u003e **Python note:** MLX has no wheels for Python 3.14 yet. `setup.sh` pins the venv to\n\u003e Python 3.12 via [`uv`](https://github.com/astral-sh/uv).\n\n---\n\n## Using it from the LAN\n\nThe server binds `0.0.0.0:8080`, so any agent on your network can use it as an\nOpenAI-compatible endpoint. Find the host's LAN IP with `ipconfig getifaddr en0`.\n\n```bash\ncurl http://\u003cMAC_MINI_LAN_IP\u003e:8080/v1/chat/completions \\\n  -H 'Content-Type: application/json' \\\n  -d '{\"messages\":[{\"role\":\"user\",\"content\":\"Explain unified memory in one sentence.\"}],\n       \"max_tokens\":80}'\n```\n\n```python\nfrom openai import OpenAI\nclient = OpenAI(base_url=\"http://\u003cMAC_MINI_LAN_IP\u003e:8080/v1\", api_key=\"not-needed\")\nprint(client.chat.completions.create(\n    model=\"mlx-community/gemma-4-12B-4bit\",\n    messages=[{\"role\": \"user\", \"content\": \"Hello!\"}],\n).choices[0].message.content)\n```\n\nEndpoints: `GET /healthz`, `GET /v1/models`, `POST /v1/chat/completions`.\n\n\u003e **Security:** this server has **no authentication**. Only expose it on a trusted LAN,\n\u003e never directly to the internet. Put it behind a reverse proxy / firewall if needed.\n\n---\n\n## Files\n\n| File | Purpose |\n|------|---------|\n| `server.py` | OpenAI-compatible FastAPI server with the memory-safety guards. |\n| `run.py` | One-shot CLI generation (text or image), useful for testing. |\n| `safety_test.py` | The authoritative memory/throughput test used to derive the limits above. |\n| `setup.sh` | Creates the venv, installs deps, downloads the 4-bit weights. |\n| `com.cryptojones.gemma4.plist` | Optional `launchd` agent to run the server headless at login. |\n\n### One-shot CLI\n\n```bash\n./.venv/bin/python run.py \"Write a haiku about unified memory.\"\n./.venv/bin/python run.py \"Describe this image.\" --image photo.jpg   # multimodal\n```\n\n### Re-run the safety test\n\n```bash\n./.venv/bin/python safety_test.py mlx-community/gemma-4-12B-4bit --kv-bits 8 --max-kv-size 2048\n```\n\n---\n\n## Configuration\n\nEdit the constants at the top of `server.py`:\n\n| Constant | Default | Notes |\n|----------|---------|-------|\n| `MODEL` | `mlx-community/gemma-4-12B-4bit` | The only quant that fits 16 GB. |\n| `MAX_INPUT_TOKENS` | `600` | Safety guard. ~600 in-tokens peaks ~12.1 GB. Raising it toward ~1,300 approaches the crash threshold — do so only if you raised `iogpu.wired_limit_mb` further. |\n| `MAX_OUTPUT_TOKENS` | `512` | Hard cap on generation length. |\n| `MAX_KV_SIZE` / `KV_BITS` | `2048` / `8` | Bounded, quantized KV cache. |\n\n### A note on the chat template\n\nThe community MLX conversion ships **without** a `tokenizer.chat_template`. Feeding a raw\nprompt makes Gemma 4 ramble and emit `\u003cimage|\u003e`/`\u003caudio|\u003e` soft-tokens. Both `server.py`\nand `run.py` apply the Gemma turn format manually\n(`\u003cstart_of_turn\u003euser … \u003cend_of_turn\u003e\u003cstart_of_turn\u003emodel`) and stop on `\u003cend_of_turn\u003e`.\n\n### Persist the GPU limit across reboots\n\n`iogpu.wired_limit_mb` resets to 0 (default) on reboot. To make a headless server\nsurvive reboots, install a `LaunchDaemon` that sets it at boot:\n\n```bash\nsudo tee /Library/LaunchDaemons/com.cryptojones.gpulimit.plist \u003e/dev/null \u003c\u003c'PLIST'\n\u003c?xml version=\"1.0\" encoding=\"UTF-8\"?\u003e\n\u003c!DOCTYPE plist PUBLIC \"-//Apple//DTD PLIST 1.0//EN\" \"http://www.apple.com/DTDs/PropertyList-1.0.dtd\"\u003e\n\u003cplist version=\"1.0\"\u003e\u003cdict\u003e\n  \u003ckey\u003eLabel\u003c/key\u003e\u003cstring\u003ecom.cryptojones.gpulimit\u003c/string\u003e\n  \u003ckey\u003eProgramArguments\u003c/key\u003e\n  \u003carray\u003e\u003cstring\u003e/usr/sbin/sysctl\u003c/string\u003e\u003cstring\u003eiogpu.wired_limit_mb=13500\u003c/string\u003e\u003c/array\u003e\n  \u003ckey\u003eRunAtLoad\u003c/key\u003e\u003ctrue/\u003e\n\u003c/dict\u003e\u003c/plist\u003e\nPLIST\nsudo launchctl load /Library/LaunchDaemons/com.cryptojones.gpulimit.plist\n```\n\nThen use `com.cryptojones.gemma4.plist` (a per-user `LaunchAgent`) to start the server itself.\n\n---\n\n## License\n\n[Apache 2.0](LICENSE). Gemma 4 is released by Google under the Apache 2.0 license.\n\n---\n\n\u003cp align=\"center\"\u003e\u003cem\u003eProudly Made in Nebraska. Go Big Red! 🌽 \u003ca href=\"https://xkcd.com/2347/\"\u003ehttps://xkcd.com/2347/\u003c/a\u003e\u003c/em\u003e\u003c/p\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcryptojones%2Fmacminim2pro_localmodelconfig","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcryptojones%2Fmacminim2pro_localmodelconfig","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcryptojones%2Fmacminim2pro_localmodelconfig/lists"}