https://github.com/aivinay/switchboard
Privacy-aware, local-first router for your CLI coding agents (Codex, Claude Code) and local LLMs (Ollama) — keeps sensitive prompts on-device and cuts premium-model usage.
https://github.com/aivinay/switchboard
ai-agents claude-code codex fastapi llm llm-orchestration llm-routing local-first local-llm model-routing ollama privacy privacy-preserving-ai python semantic-memory
Last synced: about 9 hours ago
JSON representation
Privacy-aware, local-first router for your CLI coding agents (Codex, Claude Code) and local LLMs (Ollama) — keeps sensitive prompts on-device and cuts premium-model usage.
- Host: GitHub
- URL: https://github.com/aivinay/switchboard
- Owner: aivinay
- License: mit
- Created: 2026-06-22T19:12:52.000Z (7 days ago)
- Default Branch: main
- Last Pushed: 2026-06-24T20:49:43.000Z (5 days ago)
- Last Synced: 2026-06-24T21:08:48.065Z (5 days ago)
- Topics: ai-agents, claude-code, codex, fastapi, llm, llm-orchestration, llm-routing, local-first, local-llm, model-routing, ollama, privacy, privacy-preserving-ai, python, semantic-memory
- Language: Python
- Homepage: https://github.com/aivinay/switchboard
- Size: 487 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Citation: CITATION.cff
- Security: SECURITY.md
Awesome Lists containing this project
README
62% fewer premium-agent calls · 4.1/5 quality vs 4.6/5 always-premium · 0 benchmark leaks observed
Install ·
Evaluation ·
How it works ·
Privacy ·
Paper ·
Docs
---

One session, three backends: local by default, Codex for code, Claude Code for reasoning.
Switchboard wraps the CLI tools you already use — no separate service, no proxy, no resold API access — and routes each prompt with deterministic rules before any learned classifier runs.
In its 100-case benchmark, Switchboard kept **62% of requests off premium
agents** while reaching **4.1/5 quality** against a **4.6/5 always-premium
baseline**, with **100% answered** and **no benchmark leaks observed**. See
[Evaluation](#evaluation) for the numbers and reproduction bundle.
Use it when you want to:
- **Spend premium agent quota where it matters** instead of sending every prompt
to the most expensive backend.
- **Keep sensitive prompts local** with a deterministic privacy floor that
learned routing cannot override.
- **Switch backends mid-session without losing context** — shared session history, semantic memory, and redaction travel with you across Ollama, Codex, and Claude Code.
## What it does
- **Routes** across local [Ollama](https://ollama.com) models, the **Codex** CLI, and **Claude Code** — deterministic rules first, with optional tiny learned classifiers for recall.
- **Private mode** — a deterministic keyword/PII/secret-format floor blocks sensitive prompts from ever reaching a subscription backend, even on fallback.
- **Grounds** answers with deterministic tools (time/date, safe calculator, unit conversion, keyless live stock & news) instead of letting a model guess.
- **Carries context** across backend switches: recent user, assistant, and tool turns are assembled into one redacted session prompt.
- **Compresses** long context with a Headroom-inspired layer; the model-boundary pass only summarizes recent conversation, while trusted facts, retrieved memory, and the current request survive intact.
- **Remembers** across backends via local embedding-based semantic memory, with SQLite search available for direct memory lookup.
- **Explains every decision** and records metadata-only telemetry (no prompt/response bodies).
- **Ships its own evaluation** — a 100-case quality benchmark, a local LLM-as-judge, and a multi-run statistical harness.
## How it works
```
UI / CLI ──► Session manager (shared history across all backends)
│
▼
Capability detector (regex) ◄──► deterministic tools
│ (learned tool dispatcher recovers misses; tool verifies)
▼
Privacy floor (keywords + PII + secret formats — a match is FINAL)
│ (learned sensitivity escalator may only ADD protection)
▼
Deterministic policy ← always wins; unknown ⇒ local
│ (learned router supplies recall: tool / local / coding / reasoning)
▼
Context builder + redaction ◄── semantic memory
│
▼
Compression (metadata + history-only context pass)
│
▼
Ollama (default) │ Codex (coding) │ Claude Code (reasoning)
│
▼
Response sanitizer ─► metadata-only telemetry
```
The organizing invariant: **deterministic policy always precedes and overrides
the learned components.** Privacy, tool grounding, forced selection, and
fallback keep working even when the local model runtime — and therefore every
learned component — is down.
## Get started
```bash
pip install switchboard-local
```
```bash
# point it at a local model runtime (install Ollama, then pull a small model)
ollama pull llama3.2:3b
# sanity-check your setup
switchboard doctor
# ask — Switchboard routes it, grounds it, and tells you why
switchboard ask "summarize this error log and suggest a fix"
# see the routing decision without running anything
switchboard route "refactor the auth module and add tests"
# prefer your browser? launch the local web UI, then open http://127.0.0.1:8080/ui
switchboard ui
```
Requires **Python 3.11+**. Codex / Claude Code backends are optional — without
them, everything routes locally. See [docs/usage.md](docs/usage.md).
## Context, memory, and tokens
Switchboard has two user-facing CLI surfaces:
- `switchboard route ...` previews the same core backend decision without calling a model.
- The web UI, bare `switchboard ask ...`, and `switchboard ask --backend auto ...` use the stateful core workflow: shared sessions, model switching, semantic-memory retrieval, context-boundary compression, and backend telemetry all run on the same path.
Example stateful CLI session:
```bash
switchboard ask --backend auto --new-session "Remember: prefer local models for private notes."
switchboard ask --backend auto --session --memory "What should you remember?"
```
Long prompts and long sessions record token estimates and savings metadata. The request-level pass can shorten an oversized raw prompt; the context-boundary pass then compresses only ``. The ``, ``, and `` blocks are protected from that second pass so grounding and intent are not traded away for token budget.
Memory is local. `switchboard memory add` stores the item in SQLite and, when `semantic_memory_enabled` is on and Ollama can serve `nomic-embed-text`, indexes an embedding for cross-backend retrieval. `switchboard memory search` works as local text search even when embeddings are unavailable.
Details: [docs/context-memory-compression.md](docs/context-memory-compression.md).
## Evaluation
A 100-case benchmark across five task categories (coding, reasoning,
summarization, private, grounding), run on real backends and judged by a local
model, over **multiple independent runs** (means shown; full per-condition
numbers, confidence intervals, and significance tests are in the paper):
| Policy | Quality (1–5) | Premium usage | Privacy leaks | Answered |
|-------------------|:-------------:|:-------------:|:-------------:|:--------:|
| always-local | 3.4 | 0% | **0** | 100% |
| rules | 3.8 | 27% | **0** | 100% |
| hybrid | 3.9 | 28% | **0** | 100% |
| **learned** | **4.1** | 38% | **0** | 100% |
| always-premium | 4.6 | 100% | **0** | 61%¹ |
¹ The "just use the premium agent for everything" baseline must block every
sensitive prompt to stay leak-free, so its coverage collapses — exactly the gap
Switchboard closes. No benchmark leaks were observed in any condition or run.
These numbers come from a real-backend benchmark whose full harness travels with the paper's [reproduction bundle on Zenodo](https://doi.org/10.5281/zenodo.20836918).
## Context: why this exists (Uber, Microsoft, 2026)
Some employers have begun rationing AI coding-tool spend: Uber reportedly
capped engineers at $1,500/month per AI tool after burning its 2026 AI budget
in four months ([Bloomberg](https://www.bloomberg.com/news/articles/2026-06-02/uber-caps-usage-of-ai-tools-like-claude-code-to-cut-costs));
Microsoft's Experiences + Devices org reportedly moved off Claude Code to
GitHub Copilot CLI ([Windows Central](https://www.windowscentral.com/microsoft/microsoft-cancels-claude-code-licenses-shifting-developers-to-github-copilot-cli-a-move-likely-driven-by-financial-motives)).
A spend cap controls the invoice, but it does not decide which work actually
needs a premium model or which prompts should never leave the machine. A better
pattern is **routing, not blanket rationing**: decide request by request what
belongs local, what needs a coding agent, and what is worth premium reasoning.
Switchboard is a reference implementation of that pattern for a single
workstation. It is not yet an enterprise product; it is the smallest honest
proof that local-first routing can work, with a reproducible benchmark to back
it.
## Privacy
Switchboard is local-first and privacy-aware by construction:
- The **deterministic privacy floor runs before any non-local routing**; a positive verdict is final and cannot be overridden by a learned component or by prompt wording.
- **Secret-format detection** (cloud keys, JWTs, PEM blocks, env credentials) shares its patterns with context redaction, so the routing boundary and the redactor can't drift apart.
- **Metadata-only telemetry** — prompt and response bodies are not stored by default.
- Semantic-memory **embeddings and the eval judge run locally**.
Switchboard deliberately does **not** resell API access, scrape web UIs, or
bypass provider limits — subscription CLIs are invoked exactly as the
authenticated user could invoke them, in read-only sandbox modes. See
[SECURITY.md](SECURITY.md) and [docs/privacy.md](docs/privacy.md).
What's inside
- **Deterministic router** — keyword rules; unknown prompts default local-first.
- **Learned router / tool dispatcher / sensitivity escalator** — tiny softmax classifiers over a locally-computed embedding (~50 ms, pure-Python inference), each retrainable in seconds from your own thumbs-down corrections behind golden-accuracy gates. They fail closed to the deterministic path.
- **Tools** — time/date with timezones, safe abstract-syntax-tree calculator, unit conversion, keyless live stock quotes & news.
- **Compression** — structure-aware, deterministic, dependency-free; preserves task header, code blocks, tracebacks, and grounded facts.
- **Semantic memory** — `nomic-embed-text` embeddings, cosine retrieval, local memory commands, and SQLite text-search fallback for direct search.
- **Evaluation** — mock evals (CI), real-backend smoke suite, 100-case quality benchmark, adversarial tester/developer dogfooding loop.
## Configuration
Settings live in `config/personal.yaml` (ships with safe local-first defaults —
see `config/personal.example.yaml`). Highlights:
```yaml
preferences:
router_mode: "learned" # rules | llm | hybrid | learned
private_mode: true # block sensitive prompts from non-local backends
allow_cloud: false
compression_enabled: true
compression_threshold_tokens: 1000
semantic_memory_enabled: true
semantic_memory_top_k: 3
claude_code_web_search: true # allow Claude Code WebSearch for live-data fallback
finance_provider: "yahoo"
news_provider: "google_news_rss"
```
Provider API keys are referenced **by environment-variable name** (e.g.
`OPENAI_API_KEY`), never inline. See [docs/overrides.md](docs/overrides.md).
## The paper
Switchboard is described in a preprint — *"Privacy-Aware Hybrid Routing Across
Heterogeneous AI Agents."* The manuscript, the multi-run
benchmark harness, the statistical-aggregation and figure scripts, and the
per-case records are archived together as a reproduction bundle on Zenodo:
[10.5281/zenodo.20836918](https://doi.org/10.5281/zenodo.20836918).
This repository ships only the software. It deliberately does not carry the
paper's experiment-running or figure-generation tooling — that lives with the
archival record so the code stays focused on the router itself.
## Development
```bash
make install # .venv + editable install with dev extras
make check # ruff + mypy + the full test suite
```
See [CONTRIBUTING.md](CONTRIBUTING.md). Issues and PRs welcome — please preserve
the privacy invariant described there.
## Citing Switchboard
A preprint is available on Zenodo with a citable DOI —
[10.5281/zenodo.20836918](https://doi.org/10.5281/zenodo.20836918). See
[CITATION.cff](CITATION.cff) for machine-readable metadata.
> V. Gupta, "Switchboard: Privacy-Aware Hybrid Routing Across Heterogeneous AI
> Agents," Zenodo, 2026, doi:10.5281/zenodo.20836918.
## License
[MIT](LICENSE) © 2026 Vinay Gupta