https://github.com/deemwar-products/mochallama
Local LLM for the JVM — llama.cpp via Project Panama FFM (no JNI). OpenAI-compatible Spring Boot starter + Spring AI adapter + CLI. Streaming, tool calling, real token usage.
https://github.com/deemwar-products/mochallama
ffm gguf java jvm llama-cpp llm local-llm openai-api panama spring-ai spring-boot tool-calling
Last synced: 15 days ago
JSON representation
Local LLM for the JVM — llama.cpp via Project Panama FFM (no JNI). OpenAI-compatible Spring Boot starter + Spring AI adapter + CLI. Streaming, tool calling, real token usage.
- Host: GitHub
- URL: https://github.com/deemwar-products/mochallama
- Owner: deemwar-products
- License: mit
- Created: 2026-05-28T07:47:18.000Z (22 days ago)
- Default Branch: main
- Last Pushed: 2026-05-28T11:18:21.000Z (22 days ago)
- Last Synced: 2026-05-28T11:23:42.563Z (22 days ago)
- Topics: ffm, gguf, java, jvm, llama-cpp, llm, local-llm, openai-api, panama, spring-ai, spring-boot, tool-calling
- Language: Java
- Homepage: https://deemwar-products.github.io/mochallama/
- Size: 282 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 10
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Notice: NOTICE
Awesome Lists containing this project
README
# mochallama
**Local LLM for Spring Boot — Java → Project Panama FFM → a thin C++ `common_chat` bridge → vendored llama.cpp. No JNI. Spring-first.**
`tools.deemwar:mochallama-*` · npm `@deemwario/mochallama` · MIT · JDK 22 · OpenAI-compatible HTTP · streaming · tool calling · Actuator metrics
[Documentation](https://deemwar-products.github.io/mochallama/) · [GitHub](https://github.com/deemwar-products/mochallama)
---
## What it is
mochallama runs GGUF chat models **locally, in-process on the JVM**. The Java
side binds a handful of C symbols through the JDK 22 **Foreign Function & Memory
API (Project Panama)** — there is **no JNI** and no native compilation in the
Java toolchain. Those symbols belong to a small C++ bridge
(`libllamabridge`) built on llama.cpp's `common_chat` helpers, which in turn
drives a **vendored copy of llama.cpp** compiled via CMake and staged into the
JAR as platform-specific resources.
It is **Spring-first**: drop in the starter and you get an autoconfigured local
model service, an OpenAI-compatible HTTP endpoint, a Spring AI `ChatModel` /
`ChatClient`, and inference metrics + a health indicator — no extra wiring.
```
HTTP client (curl / OpenAI SDK / Spring AI)
│ POST /v1/chat/completions
▼
Spring Boot app → LlamaCppService → ChatEngine (Panama FFM)
│ │ downcall MethodHandles
▼ ▼
libllamabridge (our C++ bridge over common_chat)
▼
libllama + libggml* (vendored llama.cpp) → GGUF model on disk
```
> **Today: macOS Intel `x86_64`, CPU-only.** The shipped artifacts bundle the
> `darwin-x86_64` native dylibs (Accelerate / BLAS; Metal/CUDA/Vulkan are gated
> off in CMake). Linux and Apple-silicon binaries build in CI (see
> `.github/workflows/build.yml`) and will publish as separate bundles later.
> This is an honest single-platform release, not a cross-platform promise.
## Quickstart
Requires **JDK 22** (FFM went GA in 22). Run the demo app:
```bash
./gradlew :app:bootRun
```
The HTTP port (`8080`) comes up immediately; the model downloads on first start
into `~/.chatbot_models` and loads asynchronously. While it loads, endpoints
return `503` with `{"error":"model loading","state":"DOWNLOADING"|"LOADING"}`.
Watch the logs for `state: READY`, then:
```bash
curl http://localhost:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"messages": [
{"role": "user", "content": "Write a haiku about Project Panama."}
],
"max_tokens": 128,
"temperature": 0.7
}'
```
## Modules
| Module | Maven / npm coordinate | What it is |
|------------|-------------------------------------------------|------------|
| `core` | `tools.deemwar:mochallama-core` | Framework-free Panama FFM bridge + `ChatEngine` + the stable `MochallamaClient` contract. Bundles the native dylibs. No Spring. |
| `starter` | `tools.deemwar:mochallama-spring-boot-starter` | Spring Boot starter: autoconfigures `LlamaCppService`, the OpenAI-compatible REST controller, Actuator metrics + health. No Spring AI dependency. |
| `spring-ai`| `tools.deemwar:mochallama-spring-ai` | Spring AI `ChatModel` / `ChatClient` adapter over `MochallamaClient`. Spring AI is `compileOnly` so the consumer pins the version. |
| `cli` | npm `@deemwario/mochallama` | Terminal CLI (`mochallama models` / `mochallama chat`), shipped as a self-contained jlink image — no JDK required. |
| `app` | _(not published)_ | Demo Spring Boot app that wires the starter + Spring AI adapter together, plus a small web UI. The reference for running everything end-to-end. |
## Endpoints
Served by the demo `app` (the OpenAI surface comes from the starter; the
`/spring-ai/*` routes are app-local demos of the Spring AI adapter):
| Endpoint | Method | Notes |
|-----------------------------------------|--------|-------|
| `/v1/chat/completions` | POST | OpenAI chat completions. Supports `stream: true` (SSE) and `tools[]` → `tool_calls`. Full sampling params (see below). |
| `/v1/models` | GET | Lists the loaded model id (derived from the GGUF filename). |
| `/spring-ai/chat` | POST | `{"message": "..."}` → `{"reply": "..."}` via the autoconfigured `ChatClient`. |
| `/spring-ai/tool-demo` | POST | Drives Spring AI tool calling end-to-end; surfaces the proposed `get_weather` tool call. |
| `/actuator/health` | GET | `UP` once the model is `READY`, `DOWN` while loading/failed. Includes `model`, `state`, `loadDurationMs`. |
| `/actuator/metrics` | GET | All meter names; `/actuator/metrics/{name}` for one meter (e.g. `mochallama.inference.duration`). |
| `/actuator/prometheus` | GET | Prometheus scrape (opt-in — add `micrometer-registry-prometheus`). |
### `/v1/chat/completions` parameters
`messages[]` (roles `system` / `user` / `assistant` / `tool`) plus
`max_tokens`, `temperature`, `top_k`, `top_p`, `min_p`, `repeat_penalty`,
`seed`, `stop[]`, `stream`, `tools[]`, `tool_choice`. Per-request values
override the server-side defaults bound from `llamacpp.model.*`.
```bash
# Streaming
curl -N -X POST http://localhost:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"messages":[{"role":"user","content":"count 1 to 5"}],"stream":true,"max_tokens":32}'
# Tool calling
curl -s -X POST http://localhost:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"messages":[{"role":"user","content":"What is the weather in Paris?"}],
"tools":[{"type":"function","function":{
"name":"get_weather",
"description":"Get the current weather for a location",
"parameters":{"type":"object","properties":{"location":{"type":"string"}},"required":["location"]}
}}]
}'
```
## Models
The lineup is **tool-callers only** — every shipped profile ships a tool-capable
chat template. The default is **`qwen2.5-1.5b`** (Qwen2.5-1.5B-Instruct, Q4_K_M,
~1.1 GB): the proven tool-caller in this lineup and the smallest/fastest, so
first boot is quick.
| Profile | Model | Size | Tool calling |
|----------------|--------------------------------|-------|--------------|
| `qwen2.5-1.5b` | Qwen2.5-1.5B-Instruct (default)| ~1.1 GB | Yes (proven) |
| `qwen2.5-3b` | Qwen2.5-3B-Instruct | ~2.1 GB | Yes |
| `qwen3-4b` | Qwen3-4B-Instruct-2507 | ~2.5 GB | Yes |
| `phi-4-mini` | Phi-4-mini-instruct | ~2.5 GB | Yes |
Switch by activating a Spring profile:
```bash
./gradlew :app:bootRun --args='--spring.profiles.active=qwen2.5-3b'
```
Models download on first start into `~/.chatbot_models`. The id on
`GET /v1/models` is derived from the filename, so switching profiles switches
the OpenAI model id too. See the
[model profiles](https://deemwar-products.github.io/mochallama/specs/models) doc.
**Load any tool-capable model by Hugging Face id.** Instead of a profile, point
the starter at a HF repo and it resolves the GGUF (preferred quant `Q4_K_M`,
shared `~/.chatbot_models` cache) for you:
```properties
llamacpp.model.hf-id=Qwen/Qwen2.5-3B-Instruct-GGUF
# optional: llamacpp.model.quant=Q4_K_M
```
The CLI accepts the same — a profile name, a HF id, or a local `.gguf` path:
```bash
mochallama chat --model Qwen/Qwen2.5-3B-Instruct-GGUF
```
Only tool-capable models load. A non-tool model is rejected at load time
(Spring: `/actuator/health` goes DOWN/FAILED with *"does not support tool
calling"*; CLI: a clear refusal). See
[tool-calling support](https://deemwar-products.github.io/mochallama/specs/tool-calling-support).
## Use as a library
Add the starter to a Spring Boot app. It autoconfigures the local model service,
the OpenAI endpoint, and (if `mochallama-spring-ai` + Spring AI are present) a
`ChatClient` / `ChatModel`:
```gradle
dependencies {
implementation 'tools.deemwar:mochallama-spring-boot-starter:0.1.0-SNAPSHOT'
// Optional: Spring AI ChatClient / ChatModel adapter
implementation 'tools.deemwar:mochallama-spring-ai:0.1.0-SNAPSHOT'
implementation 'org.springframework.ai:spring-ai-client-chat:1.0.8'
}
```
Inject the autoconfigured `ChatClient`:
```java
@RestController
class AssistantController {
private final ChatClient chat;
AssistantController(ChatClient chat) { this.chat = chat; }
@PostMapping("/ask")
String ask(@RequestBody String prompt) {
return chat.prompt().user(prompt).call().content();
}
}
```
Point the model location and sampling defaults via `llamacpp.model.*` (e.g.
`llamacpp.model.url`, `llamacpp.model.filename`, `llamacpp.model.context-size`,
`llamacpp.model.threads`, `llamacpp.model.temperature`). Disable the OpenAI
endpoint with `mochallama.openai-endpoint.enabled=false`. JVM args
`--enable-native-access=ALL-UNNAMED --add-modules=jdk.incubator.vector` are
required.
For the framework-free path, depend on `mochallama-core` and use
`MochallamaClient` / `ChatEngine` directly.
## CLI
```bash
npm i -g @deemwario/mochallama # macOS x64 only for v0.1.0
mochallama models
mochallama chat --model qwen2.5-3b
```
## Documentation
Full docs (architecture, the C ABI, model profiles, metrics, and a complete
examples section) live at **https://deemwar-products.github.io/mochallama/**.
## License
[MIT](LICENSE). Vendored llama.cpp + ggml are also MIT — see [NOTICE](NOTICE).