{"id":50834875,"url":"https://github.com/vxfemboy/claude-llama","last_synced_at":"2026-06-14T02:31:27.399Z","repository":{"id":361277620,"uuid":"1252060990","full_name":"vxfemboy/claude-llama","owner":"vxfemboy","description":"WIP: on a quest to save token usage cuz im broke","archived":false,"fork":false,"pushed_at":"2026-05-29T22:28:04.000Z","size":97,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-29T23:20:26.171Z","etag":null,"topics":["claude-code","claude-code-plugin"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/vxfemboy.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-28T06:43:42.000Z","updated_at":"2026-05-29T22:32:01.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/vxfemboy/claude-llama","commit_stats":null,"previous_names":["vxfemboy/claude-llama"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/vxfemboy/claude-llama","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vxfemboy%2Fclaude-llama","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vxfemboy%2Fclaude-llama/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vxfemboy%2Fclaude-llama/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vxfemboy%2Fclaude-llama/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/vxfemboy","download_url":"https://codeload.github.com/vxfemboy/claude-llama/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vxfemboy%2Fclaude-llama/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34307683,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-14T02:00:07.365Z","response_time":62,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["claude-code","claude-code-plugin"],"created_at":"2026-06-14T02:31:26.679Z","updated_at":"2026-06-14T02:31:27.393Z","avatar_url":"https://github.com/vxfemboy.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# claude-llama\n\n\u003e Delegate token-heavy file work to a local llama.cpp model so the bulk content never enters Claude's context.\n\n`claude-llama` is an MCP server that exposes three tools — `llama_summarize`, `llama_extract`, `llama_ask` — plus a `llama_health` probe. Claude calls them instead of reading large files itself; the server reads the files locally, hands them to your llama.cpp instance, and returns only the answer.\n\nEvery response carries a footer like:\n\n```\n---\n[claude-llama] input=7,992 tok · returned=931 tok · saved≈7,061 tok · model=Qwen3.5-9B · 141s\n```\n\n(real numbers from summarizing a 32KB plan doc — see [Real-world savings](#real-world-savings) for the full matrix.)\n\nThe savings are also appended to a JSONL log; `claude-llama-mcp stats` summarizes it. CI guards the savings claim with a benchmark.\n\n## Install\n\n**One-liner (recommended):**\n\n```sh\ncurl -fsSL https://raw.githubusercontent.com/vxfemboy/claude-llama/main/install.sh | sh\n```\n\nDownloads the latest release binary for your OS/arch, verifies the checksum, drops it in `~/.local/bin`, and runs `claude-llama-mcp init`.\n\n**As a Claude Code plugin:**\n\n```\n/plugin marketplace add vxfemboy/claude-llama\n/plugin install claude-llama:claude-llama\n```\n\n(then `/reload-plugins`)\n\n**From source:**\n\n```sh\ngo install github.com/vxfemboy/claude-llama/cmd/claude-llama-mcp@latest\nclaude-llama-mcp init\n```\n\nAfter installing, register it with your MCP client. For Claude Code, add to your project's `.mcp.json`:\n\n```json\n{\n  \"mcpServers\": {\n    \"claude-llama\": { \"command\": \"claude-llama-mcp\" }\n  }\n}\n```\n\n## Configuration\n\nAll settings are environment variables. `claude-llama-mcp init` writes them to `~/.config/claude-llama/env` (honoring `$XDG_CONFIG_HOME`); the process env always wins over the file.\n\n| Variable                 | Default                              | Purpose                                                         |\n|--------------------------|--------------------------------------|-----------------------------------------------------------------|\n| `LLAMA_API_URL`          | `http://localhost:8080`              | llama.cpp server (OpenAI-compatible)                            |\n| `LLAMA_MODEL`            | `unsloth/Qwen3.5-9B-GGUF:Q4_K_M`     | model name passed to `/v1/chat/completions`                     |\n| `LLAMA_MAX_INPUT_TOKENS` | `6000`                               | max tokens per chunk before map/reduce kicks in                 |\n| `LLAMA_TIMEOUT_SECONDS`  | `120`                                | per-call timeout                                                |\n| `LLAMA_WORKSPACE_ROOT`   | cwd                                  | path-traversal boundary; the server refuses to read outside it  |\n| `LLAMA_FOOTER`           | `true`                               | append the per-call savings footer to each response             |\n| `LLAMA_USAGE_LOG`        | `true`                               | append a JSONL row per call to `$XDG_STATE_HOME/claude-llama/usage.jsonl` |\n\nSet any value to `0`, `false`, `no`, or `off` to disable a boolean.\n\n## Tools\n\n- **`llama_summarize`** `(paths, focus?)` — summarize files/dirs/globs.\n- **`llama_extract`** `(paths, query)` — pull only snippets matching `query`.\n- **`llama_ask`** `(prompt, paths?)` — delegate a self-contained task; paths are optional context.\n- **`llama_health`** `()` — JSON status: `{ok, url, models, latency_ms, error}`. Lets Claude self-diagnose before relying on the MCP for a big job.\n\n## Real-world savings\n\n### In the wild\n\nTwo `llama_summarize` calls during a single cross-project session\n(separate Rust repo, same Qwen3.5-9B Q8 model on `hack-mini:8080`):\n\n| Call                                       | Input tok | Returned tok |   Saved | Duration |\n|--------------------------------------------|----------:|-------------:|--------:|---------:|\n| `src/` + `README.md` + `Cargo.toml`        |    34,247 |          535 |  33,712 |   10m48s |\n| config + docker + `scripts/` + `tests/`    |     3,914 |          528 |   3,386 |    1m51s |\n| **Total**                                  |**38,161** |    **1,063** |**37,098** | **12m39s** |\n\n**~97% of bulk file content kept out of Claude's context** at a cost of\n~13 minutes of local inference. Pulled from\n`claude-llama-mcp stats --json`.\n\n### Benchmark matrix\n\nMeasured against this repo's own files (Qwen3.5-9B Q8, local hardware —\nyour mileage will vary with model + GPU):\n\n| Fixture                | Tool              | Input tok | Returned tok | Saved | %    | Duration |\n|------------------------|-------------------|----------:|-------------:|------:|-----:|---------:|\n| 3KB Go source          | `llama_summarize` |       734 |          409 |   325 |  44% |    1m28s |\n| 15KB design spec       | `llama_summarize` |     3,824 |        1,626 | 2,198 |  57% |    2m38s |\n| 32KB plan              | `llama_summarize` |     7,992 |          931 | 7,061 |  88% |    2m21s |\n| 15KB design spec       | `llama_extract`   |     3,824 |          387 | 3,437 |  90% |     3m4s |\n| `llama_ask` (no paths) | `llama_ask`       |        13 |           46 |     0 |   0% |    1m10s |\n\nRead this as: **delegation pays off once you'd be reading more than a\nfew KB into Claude's context.** Below ~3KB the local model's reply is\nnearly as long as the input — net savings are small and you'd be better\noff having Claude read the file directly. Above ~10KB savings grow fast,\nand `llama_extract` beats `llama_summarize` because it returns only\nmatching snippets instead of a whole summary. `llama_ask` with no paths\nis a wash on tokens (the prompt and answer are both tiny) — its purpose\nis offloading bulky generation, not saving context.\n\nThe trade-off is latency: 1-3 minutes per call on this hardware vs. a\nfew seconds for Claude's API. Use this MCP when the *token cost* of the\nwork matters more than the wall-clock; skip it for snappy interactions.\n\nReproduce with `make integration` against a live llama, or look at the\nmatrix test at `cmd/claude-llama-mcp/real_savings_test.go`.\n\n## Verifying the savings\n\nPer call: read the footer. Cumulatively:\n\n```sh\nclaude-llama-mcp stats              # last 7 days\nclaude-llama-mcp stats --since 24h\nclaude-llama-mcp stats --tool llama_extract --json\n```\n\nThe CI bench (`make bench`) runs three fixtures through `httptest`-replayed llama responses and asserts each tool produces ≥80% byte savings. That's the regression guard for the project's pitch.\n\n## Troubleshooting\n\n```sh\nclaude-llama-mcp doctor\n```\n\nPrints resolved config, pings the llama server, lists available models, and checks that the workspace root and usage log are writable. Exits non-zero if anything fails.\n\n## Development\n\n```sh\nmake build      # build ./bin/claude-llama-mcp\nmake test       # go test -race ./...\nmake bench      # token-savings regression bench\nmake integration # smoke against a real llama (needs LLAMA_API_URL up)\nmake lint       # golangci-lint\nmake setup      # install the pre-commit hook\n```\n\nSource layout:\n\n- `cmd/claude-llama-mcp/` — entrypoint, MCP server, CLI subcommands (`init`, `doctor`, `stats`).\n- `internal/config/` — env-var + env-file loader.\n- `internal/files/` — workspace guard + chunking.\n- `internal/llama/` — chat-completions client + `/v1/models` health probe.\n- `internal/tools/` — map/reduce service that wraps the three delegation tools.\n- `internal/usage/` — token estimator, JSONL recorder, savings footer.\n\n## License\n\nMIT.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvxfemboy%2Fclaude-llama","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvxfemboy%2Fclaude-llama","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvxfemboy%2Fclaude-llama/lists"}