https://github.com/vxfemboy/claude-llama
WIP: on a quest to save token usage cuz im broke
https://github.com/vxfemboy/claude-llama
claude-code claude-code-plugin
Last synced: 11 days ago
JSON representation
WIP: on a quest to save token usage cuz im broke
- Host: GitHub
- URL: https://github.com/vxfemboy/claude-llama
- Owner: vxfemboy
- License: mit
- Created: 2026-05-28T06:43:42.000Z (28 days ago)
- Default Branch: main
- Last Pushed: 2026-05-29T22:28:04.000Z (26 days ago)
- Last Synced: 2026-05-29T23:20:26.171Z (26 days ago)
- Topics: claude-code, claude-code-plugin
- Language: Go
- Homepage:
- Size: 94.7 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# claude-llama
> Delegate token-heavy file work to a local llama.cpp model so the bulk content never enters Claude's context.
`claude-llama` is an MCP server that exposes three tools — `llama_summarize`, `llama_extract`, `llama_ask` — plus a `llama_health` probe. Claude calls them instead of reading large files itself; the server reads the files locally, hands them to your llama.cpp instance, and returns only the answer.
Every response carries a footer like:
```
---
[claude-llama] input=7,992 tok · returned=931 tok · saved≈7,061 tok · model=Qwen3.5-9B · 141s
```
(real numbers from summarizing a 32KB plan doc — see [Real-world savings](#real-world-savings) for the full matrix.)
The savings are also appended to a JSONL log; `claude-llama-mcp stats` summarizes it. CI guards the savings claim with a benchmark.
## Install
**One-liner (recommended):**
```sh
curl -fsSL https://raw.githubusercontent.com/vxfemboy/claude-llama/main/install.sh | sh
```
Downloads the latest release binary for your OS/arch, verifies the checksum, drops it in `~/.local/bin`, and runs `claude-llama-mcp init`.
**As a Claude Code plugin:**
```
/plugin marketplace add vxfemboy/claude-llama
/plugin install claude-llama:claude-llama
```
(then `/reload-plugins`)
**From source:**
```sh
go install github.com/vxfemboy/claude-llama/cmd/claude-llama-mcp@latest
claude-llama-mcp init
```
After installing, register it with your MCP client. For Claude Code, add to your project's `.mcp.json`:
```json
{
"mcpServers": {
"claude-llama": { "command": "claude-llama-mcp" }
}
}
```
## Configuration
All settings are environment variables. `claude-llama-mcp init` writes them to `~/.config/claude-llama/env` (honoring `$XDG_CONFIG_HOME`); the process env always wins over the file.
| Variable | Default | Purpose |
|--------------------------|--------------------------------------|-----------------------------------------------------------------|
| `LLAMA_API_URL` | `http://localhost:8080` | llama.cpp server (OpenAI-compatible) |
| `LLAMA_MODEL` | `unsloth/Qwen3.5-9B-GGUF:Q4_K_M` | model name passed to `/v1/chat/completions` |
| `LLAMA_MAX_INPUT_TOKENS` | `6000` | max tokens per chunk before map/reduce kicks in |
| `LLAMA_TIMEOUT_SECONDS` | `120` | per-call timeout |
| `LLAMA_WORKSPACE_ROOT` | cwd | path-traversal boundary; the server refuses to read outside it |
| `LLAMA_FOOTER` | `true` | append the per-call savings footer to each response |
| `LLAMA_USAGE_LOG` | `true` | append a JSONL row per call to `$XDG_STATE_HOME/claude-llama/usage.jsonl` |
Set any value to `0`, `false`, `no`, or `off` to disable a boolean.
## Tools
- **`llama_summarize`** `(paths, focus?)` — summarize files/dirs/globs.
- **`llama_extract`** `(paths, query)` — pull only snippets matching `query`.
- **`llama_ask`** `(prompt, paths?)` — delegate a self-contained task; paths are optional context.
- **`llama_health`** `()` — JSON status: `{ok, url, models, latency_ms, error}`. Lets Claude self-diagnose before relying on the MCP for a big job.
## Real-world savings
### In the wild
Two `llama_summarize` calls during a single cross-project session
(separate Rust repo, same Qwen3.5-9B Q8 model on `hack-mini:8080`):
| Call | Input tok | Returned tok | Saved | Duration |
|--------------------------------------------|----------:|-------------:|--------:|---------:|
| `src/` + `README.md` + `Cargo.toml` | 34,247 | 535 | 33,712 | 10m48s |
| config + docker + `scripts/` + `tests/` | 3,914 | 528 | 3,386 | 1m51s |
| **Total** |**38,161** | **1,063** |**37,098** | **12m39s** |
**~97% of bulk file content kept out of Claude's context** at a cost of
~13 minutes of local inference. Pulled from
`claude-llama-mcp stats --json`.
### Benchmark matrix
Measured against this repo's own files (Qwen3.5-9B Q8, local hardware —
your mileage will vary with model + GPU):
| Fixture | Tool | Input tok | Returned tok | Saved | % | Duration |
|------------------------|-------------------|----------:|-------------:|------:|-----:|---------:|
| 3KB Go source | `llama_summarize` | 734 | 409 | 325 | 44% | 1m28s |
| 15KB design spec | `llama_summarize` | 3,824 | 1,626 | 2,198 | 57% | 2m38s |
| 32KB plan | `llama_summarize` | 7,992 | 931 | 7,061 | 88% | 2m21s |
| 15KB design spec | `llama_extract` | 3,824 | 387 | 3,437 | 90% | 3m4s |
| `llama_ask` (no paths) | `llama_ask` | 13 | 46 | 0 | 0% | 1m10s |
Read this as: **delegation pays off once you'd be reading more than a
few KB into Claude's context.** Below ~3KB the local model's reply is
nearly as long as the input — net savings are small and you'd be better
off having Claude read the file directly. Above ~10KB savings grow fast,
and `llama_extract` beats `llama_summarize` because it returns only
matching snippets instead of a whole summary. `llama_ask` with no paths
is a wash on tokens (the prompt and answer are both tiny) — its purpose
is offloading bulky generation, not saving context.
The trade-off is latency: 1-3 minutes per call on this hardware vs. a
few seconds for Claude's API. Use this MCP when the *token cost* of the
work matters more than the wall-clock; skip it for snappy interactions.
Reproduce with `make integration` against a live llama, or look at the
matrix test at `cmd/claude-llama-mcp/real_savings_test.go`.
## Verifying the savings
Per call: read the footer. Cumulatively:
```sh
claude-llama-mcp stats # last 7 days
claude-llama-mcp stats --since 24h
claude-llama-mcp stats --tool llama_extract --json
```
The CI bench (`make bench`) runs three fixtures through `httptest`-replayed llama responses and asserts each tool produces ≥80% byte savings. That's the regression guard for the project's pitch.
## Troubleshooting
```sh
claude-llama-mcp doctor
```
Prints resolved config, pings the llama server, lists available models, and checks that the workspace root and usage log are writable. Exits non-zero if anything fails.
## Development
```sh
make build # build ./bin/claude-llama-mcp
make test # go test -race ./...
make bench # token-savings regression bench
make integration # smoke against a real llama (needs LLAMA_API_URL up)
make lint # golangci-lint
make setup # install the pre-commit hook
```
Source layout:
- `cmd/claude-llama-mcp/` — entrypoint, MCP server, CLI subcommands (`init`, `doctor`, `stats`).
- `internal/config/` — env-var + env-file loader.
- `internal/files/` — workspace guard + chunking.
- `internal/llama/` — chat-completions client + `/v1/models` health probe.
- `internal/tools/` — map/reduce service that wraps the three delegation tools.
- `internal/usage/` — token estimator, JSONL recorder, savings footer.
## License
MIT.