{"id":50539938,"url":"https://github.com/deemwar-products/mochallama","last_synced_at":"2026-06-03T20:00:13.944Z","repository":{"id":360903203,"uuid":"1252113831","full_name":"deemwar-products/mochallama","owner":"deemwar-products","description":"Local LLM for the JVM — llama.cpp via Project Panama FFM (no JNI). OpenAI-compatible Spring Boot starter + Spring AI adapter + CLI. Streaming, tool calling, real token usage.","archived":false,"fork":false,"pushed_at":"2026-05-28T11:18:21.000Z","size":289,"stargazers_count":0,"open_issues_count":10,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-28T11:23:42.563Z","etag":null,"topics":["ffm","gguf","java","jvm","llama-cpp","llm","local-llm","openai-api","panama","spring-ai","spring-boot","tool-calling"],"latest_commit_sha":null,"homepage":"https://deemwar-products.github.io/mochallama/","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/deemwar-products.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":"NOTICE","maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-28T07:47:18.000Z","updated_at":"2026-05-28T11:18:25.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/deemwar-products/mochallama","commit_stats":null,"previous_names":["deemwar-products/mochallama"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/deemwar-products/mochallama","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deemwar-products%2Fmochallama","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deemwar-products%2Fmochallama/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deemwar-products%2Fmochallama/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deemwar-products%2Fmochallama/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/deemwar-products","download_url":"https://codeload.github.com/deemwar-products/mochallama/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deemwar-products%2Fmochallama/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33876894,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-03T02:00:06.370Z","response_time":59,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ffm","gguf","java","jvm","llama-cpp","llm","local-llm","openai-api","panama","spring-ai","spring-boot","tool-calling"],"created_at":"2026-06-03T20:00:11.334Z","updated_at":"2026-06-03T20:00:13.669Z","avatar_url":"https://github.com/deemwar-products.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# mochallama\n\n**Local LLM for Spring Boot — Java → Project Panama FFM → a thin C++ `common_chat` bridge → vendored llama.cpp. No JNI. Spring-first.**\n\n`tools.deemwar:mochallama-*` · npm `@deemwario/mochallama` · MIT · JDK 22 · OpenAI-compatible HTTP · streaming · tool calling · Actuator metrics\n\n[Documentation](https://deemwar-products.github.io/mochallama/) · [GitHub](https://github.com/deemwar-products/mochallama)\n\n---\n\n## What it is\n\nmochallama runs GGUF chat models **locally, in-process on the JVM**. The Java\nside binds a handful of C symbols through the JDK 22 **Foreign Function \u0026 Memory\nAPI (Project Panama)** — there is **no JNI** and no native compilation in the\nJava toolchain. Those symbols belong to a small C++ bridge\n(`libllamabridge`) built on llama.cpp's `common_chat` helpers, which in turn\ndrives a **vendored copy of llama.cpp** compiled via CMake and staged into the\nJAR as platform-specific resources.\n\nIt is **Spring-first**: drop in the starter and you get an autoconfigured local\nmodel service, an OpenAI-compatible HTTP endpoint, a Spring AI `ChatModel` /\n`ChatClient`, and inference metrics + a health indicator — no extra wiring.\n\n```\nHTTP client (curl / OpenAI SDK / Spring AI)\n        │  POST /v1/chat/completions\n        ▼\nSpring Boot app  →  LlamaCppService  →  ChatEngine (Panama FFM)\n        │                                     │  downcall MethodHandles\n        ▼                                     ▼\nlibllamabridge  (our C++ bridge over common_chat)\n        ▼\nlibllama + libggml*  (vendored llama.cpp)  →  GGUF model on disk\n```\n\n\u003e **Today: macOS Intel `x86_64`, CPU-only.** The shipped artifacts bundle the\n\u003e `darwin-x86_64` native dylibs (Accelerate / BLAS; Metal/CUDA/Vulkan are gated\n\u003e off in CMake). Linux and Apple-silicon binaries build in CI (see\n\u003e `.github/workflows/build.yml`) and will publish as separate bundles later.\n\u003e This is an honest single-platform release, not a cross-platform promise.\n\n## Quickstart\n\nRequires **JDK 22** (FFM went GA in 22). Run the demo app:\n\n```bash\n./gradlew :app:bootRun\n```\n\nThe HTTP port (`8080`) comes up immediately; the model downloads on first start\ninto `~/.chatbot_models` and loads asynchronously. While it loads, endpoints\nreturn `503` with `{\"error\":\"model loading\",\"state\":\"DOWNLOADING\"|\"LOADING\"}`.\nWatch the logs for `state: READY`, then:\n\n```bash\ncurl http://localhost:8080/v1/chat/completions \\\n  -H 'Content-Type: application/json' \\\n  -d '{\n    \"messages\": [\n      {\"role\": \"user\", \"content\": \"Write a haiku about Project Panama.\"}\n    ],\n    \"max_tokens\": 128,\n    \"temperature\": 0.7\n  }'\n```\n\n## Modules\n\n| Module     | Maven / npm coordinate                          | What it is |\n|------------|-------------------------------------------------|------------|\n| `core`     | `tools.deemwar:mochallama-core`                 | Framework-free Panama FFM bridge + `ChatEngine` + the stable `MochallamaClient` contract. Bundles the native dylibs. No Spring. |\n| `starter`  | `tools.deemwar:mochallama-spring-boot-starter`  | Spring Boot starter: autoconfigures `LlamaCppService`, the OpenAI-compatible REST controller, Actuator metrics + health. No Spring AI dependency. |\n| `spring-ai`| `tools.deemwar:mochallama-spring-ai`            | Spring AI `ChatModel` / `ChatClient` adapter over `MochallamaClient`. Spring AI is `compileOnly` so the consumer pins the version. |\n| `cli`      | npm `@deemwario/mochallama`                        | Terminal CLI (`mochallama models` / `mochallama chat`), shipped as a self-contained jlink image — no JDK required. |\n| `app`      | _(not published)_                               | Demo Spring Boot app that wires the starter + Spring AI adapter together, plus a small web UI. The reference for running everything end-to-end. |\n\n## Endpoints\n\nServed by the demo `app` (the OpenAI surface comes from the starter; the\n`/spring-ai/*` routes are app-local demos of the Spring AI adapter):\n\n| Endpoint                                | Method | Notes |\n|-----------------------------------------|--------|-------|\n| `/v1/chat/completions`                  | POST   | OpenAI chat completions. Supports `stream: true` (SSE) and `tools[]` → `tool_calls`. Full sampling params (see below). |\n| `/v1/models`                            | GET    | Lists the loaded model id (derived from the GGUF filename). |\n| `/spring-ai/chat`                       | POST   | `{\"message\": \"...\"}` → `{\"reply\": \"...\"}` via the autoconfigured `ChatClient`. |\n| `/spring-ai/tool-demo`                  | POST   | Drives Spring AI tool calling end-to-end; surfaces the proposed `get_weather` tool call. |\n| `/actuator/health`                      | GET    | `UP` once the model is `READY`, `DOWN` while loading/failed. Includes `model`, `state`, `loadDurationMs`. |\n| `/actuator/metrics`                     | GET    | All meter names; `/actuator/metrics/{name}` for one meter (e.g. `mochallama.inference.duration`). |\n| `/actuator/prometheus`                  | GET    | Prometheus scrape (opt-in — add `micrometer-registry-prometheus`). |\n\n### `/v1/chat/completions` parameters\n\n`messages[]` (roles `system` / `user` / `assistant` / `tool`) plus\n`max_tokens`, `temperature`, `top_k`, `top_p`, `min_p`, `repeat_penalty`,\n`seed`, `stop[]`, `stream`, `tools[]`, `tool_choice`. Per-request values\noverride the server-side defaults bound from `llamacpp.model.*`.\n\n```bash\n# Streaming\ncurl -N -X POST http://localhost:8080/v1/chat/completions \\\n  -H 'Content-Type: application/json' \\\n  -d '{\"messages\":[{\"role\":\"user\",\"content\":\"count 1 to 5\"}],\"stream\":true,\"max_tokens\":32}'\n\n# Tool calling\ncurl -s -X POST http://localhost:8080/v1/chat/completions \\\n  -H 'Content-Type: application/json' \\\n  -d '{\n    \"messages\":[{\"role\":\"user\",\"content\":\"What is the weather in Paris?\"}],\n    \"tools\":[{\"type\":\"function\",\"function\":{\n      \"name\":\"get_weather\",\n      \"description\":\"Get the current weather for a location\",\n      \"parameters\":{\"type\":\"object\",\"properties\":{\"location\":{\"type\":\"string\"}},\"required\":[\"location\"]}\n    }}]\n  }'\n```\n\n## Models\n\nThe lineup is **tool-callers only** — every shipped profile ships a tool-capable\nchat template. The default is **`qwen2.5-1.5b`** (Qwen2.5-1.5B-Instruct, Q4_K_M,\n~1.1 GB): the proven tool-caller in this lineup and the smallest/fastest, so\nfirst boot is quick.\n\n| Profile        | Model                          | Size  | Tool calling |\n|----------------|--------------------------------|-------|--------------|\n| `qwen2.5-1.5b` | Qwen2.5-1.5B-Instruct (default)| ~1.1 GB | Yes (proven) |\n| `qwen2.5-3b`   | Qwen2.5-3B-Instruct            | ~2.1 GB | Yes |\n| `qwen3-4b`     | Qwen3-4B-Instruct-2507         | ~2.5 GB | Yes |\n| `phi-4-mini`   | Phi-4-mini-instruct            | ~2.5 GB | Yes |\n\nSwitch by activating a Spring profile:\n\n```bash\n./gradlew :app:bootRun --args='--spring.profiles.active=qwen2.5-3b'\n```\n\nModels download on first start into `~/.chatbot_models`. The id on\n`GET /v1/models` is derived from the filename, so switching profiles switches\nthe OpenAI model id too. See the\n[model profiles](https://deemwar-products.github.io/mochallama/specs/models) doc.\n\n**Load any tool-capable model by Hugging Face id.** Instead of a profile, point\nthe starter at a HF repo and it resolves the GGUF (preferred quant `Q4_K_M`,\nshared `~/.chatbot_models` cache) for you:\n\n```properties\nllamacpp.model.hf-id=Qwen/Qwen2.5-3B-Instruct-GGUF\n# optional: llamacpp.model.quant=Q4_K_M\n```\n\nThe CLI accepts the same — a profile name, a HF id, or a local `.gguf` path:\n\n```bash\nmochallama chat --model Qwen/Qwen2.5-3B-Instruct-GGUF\n```\n\nOnly tool-capable models load. A non-tool model is rejected at load time\n(Spring: `/actuator/health` goes DOWN/FAILED with *\"does not support tool\ncalling\"*; CLI: a clear refusal). See\n[tool-calling support](https://deemwar-products.github.io/mochallama/specs/tool-calling-support).\n\n## Use as a library\n\nAdd the starter to a Spring Boot app. It autoconfigures the local model service,\nthe OpenAI endpoint, and (if `mochallama-spring-ai` + Spring AI are present) a\n`ChatClient` / `ChatModel`:\n\n```gradle\ndependencies {\n    implementation 'tools.deemwar:mochallama-spring-boot-starter:0.1.0-SNAPSHOT'\n    // Optional: Spring AI ChatClient / ChatModel adapter\n    implementation 'tools.deemwar:mochallama-spring-ai:0.1.0-SNAPSHOT'\n    implementation 'org.springframework.ai:spring-ai-client-chat:1.0.8'\n}\n```\n\nInject the autoconfigured `ChatClient`:\n\n```java\n@RestController\nclass AssistantController {\n    private final ChatClient chat;\n    AssistantController(ChatClient chat) { this.chat = chat; }\n\n    @PostMapping(\"/ask\")\n    String ask(@RequestBody String prompt) {\n        return chat.prompt().user(prompt).call().content();\n    }\n}\n```\n\nPoint the model location and sampling defaults via `llamacpp.model.*` (e.g.\n`llamacpp.model.url`, `llamacpp.model.filename`, `llamacpp.model.context-size`,\n`llamacpp.model.threads`, `llamacpp.model.temperature`). Disable the OpenAI\nendpoint with `mochallama.openai-endpoint.enabled=false`. JVM args\n`--enable-native-access=ALL-UNNAMED --add-modules=jdk.incubator.vector` are\nrequired.\n\nFor the framework-free path, depend on `mochallama-core` and use\n`MochallamaClient` / `ChatEngine` directly.\n\n## CLI\n\n```bash\nnpm i -g @deemwario/mochallama   # macOS x64 only for v0.1.0\nmochallama models\nmochallama chat --model qwen2.5-3b\n```\n\n## Documentation\n\nFull docs (architecture, the C ABI, model profiles, metrics, and a complete\nexamples section) live at **https://deemwar-products.github.io/mochallama/**.\n\n## License\n\n[MIT](LICENSE). Vendored llama.cpp + ggml are also MIT — see [NOTICE](NOTICE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeemwar-products%2Fmochallama","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdeemwar-products%2Fmochallama","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeemwar-products%2Fmochallama/lists"}