https://github.com/kazi-org/kazi

Make your coding agent actually finish the job. Install one skill and Claude Code keeps working — planning, fixing, testing, deploying — until your goal is objectively true (tests pass, endpoint live, deployed), or it tells you why.
https://github.com/kazi-org/kazi

agent-orchestration agentic ai-agents automation claude-code codex coding-agent developer-tools elixir llm mcp reconciliation

Last synced: 4 days ago
JSON representation

Host: GitHub
URL: https://github.com/kazi-org/kazi
Owner: kazi-org
License: apache-2.0
Created: 2025-01-24T20:45:20.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2026-06-27T00:44:53.000Z (7 days ago)
Last Synced: 2026-06-27T02:18:32.673Z (6 days ago)
Topics: agent-orchestration, agentic, ai-agents, automation, claude-code, codex, coding-agent, developer-tools, elixir, llm, mcp, reconciliation
Language: Elixir
Homepage: https://kazi.sire.run
Size: 3.78 MB
Stars: 2
Watchers: 0
Forks: 0
Open Issues: 24
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
- Notice: NOTICE
- Agents: AGENTS.md

Awesome Lists containing this project

README

Website ·
Proof ·
Blog ·
Concept ·
Releases ·
Homebrew tap

# kazi

**Your coding agent says "done." kazi proves it.**

Give **Claude Code** the power to actually *finish*. You chat with Claude the way you
already do; kazi works in the background to make "done" **objective** — looping your agent
until every check passes (tests green, the endpoint live, the change deployed), or stopping
to tell you why (`stuck`, or out of budget) instead of pretending it's finished.
**You never run kazi yourself — Claude does.**

## Try it in 10 seconds

```sh
brew install kazi-org/tap/kazi
kazi install-skill # teaches Claude Code the kazi skill (writes ~/.claude/skills/kazi)
```

Then drive it from Claude Code with the kazi skill — author the checks, then converge:

```text
/kazi plan "add a /healthz endpoint that returns 200 ok, with a test, deployed"
# Claude drafts the acceptance predicates; glance at them, then:
/kazi apply
```

Claude loops — editing, testing, deploying — and reports back only when every predicate
is *objectively* true (or it is genuinely `stuck`). You never leave your chat with Claude.

Prefer plain English? Just say **have kazi drive this until done** — the skill runs the
same `/kazi plan` → `/kazi apply` for you.

## How it works

Under the hood, *kazi* is **the outer/reconciliation loop for coding agents** — your agent
runs it, not you. (*kazi* is Swahili for *work / a job*.) It *drives* the coding agent you
already use (Claude Code, Codex, …) in a reconcile loop: observe every failing check,
dispatch a fix, integrate, re-check.

Think of it like **Kubernetes for coding goals**: you declare desired state, kazi
watches actual state, and it keeps closing the gap until the two match.

```mermaid
flowchart TD
U(["You: 'build a URL-shortener web service and ship it live in production'"]) --> K

subgraph kazi [kazi reconcile loop]
O["Observe
What's failing?"] --> D["Dispatch
an agent to fix it"]
D --> C{"Every check passes?"}
C -- No --> O
C -- Yes --> I["Integrate
PR / Merge"]
I --> Dep["Deploy and Verify Live"]
end

style U fill:#1e293b,stroke:#cbd5e1,color:#f8fafc
style kazi fill:#0f172a,stroke:#334155,color:#f8fafc
style O fill:#334155,stroke:#475569,color:#f8fafc
style D fill:#334155,stroke:#475569,color:#f8fafc
style C fill:#334155,stroke:#475569,color:#f8fafc
style I fill:#334155,stroke:#475569,color:#f8fafc
style Dep fill:#334155,stroke:#475569,color:#f8fafc
```

That loop drives the failing predicates to zero and only then reports done. The
recording below is a **real** `kazi apply` run (not a mockup): a goal whose one
acceptance predicate — *"`go test` passes"* — is false at t0, driven by the
`claude` harness until it is objectively true. Reproduce it with
[`priv/examples/hero_cast_demo`](priv/examples/hero_cast_demo) (the committed
asciicast is [`assets/proof-loop.cast`](assets/proof-loop.cast)):

A real kazi apply run: kazi.loop reports iter=1 failing=["tests-pass"], then iter=2 failing=[], then CONVERGED — every predicate is satisfied (predicate vector: [pass] tests-pass).

It is **not** another coding agent, terminal, or IDE. kazi *drives* the agent you
already use. As that agent gets better, kazi gets better for free.

---

## Why kazi?

Two problems nobody else owns:

1. **"Done" is the agent's opinion.** A coding agent stops when it *thinks* it's
finished — even when the work is merely plausible. kazi makes "done" objective:
the loop can only succeed when *every* check (kazi calls them **predicates**)
evaluates true, with stored evidence. Truth lives in the controller, not the agent.
2. **Parallel agents collide.** Locking a *task* doesn't stop two agents editing the
*same files*. kazi coordinates on **resources** — an agent leases its "blast
radius" before touching code — so concurrent runs converge instead of conflict.
3. **Bring Your Own Model (BYOM) & Privacy.** Use cloud models or run entirely locally. Wire kazi to a local model (e.g., Llama 3, Qwen) via `opencode` for zero data leaks. Your code and context never leave your hardware.

> **Why now?** Coding agents are finally good enough to do real work — ship
> features, fix bugs, wire up tests. That is exactly *why* kazi exists: once
> agents act autonomously, you need a **controller above them** to decide when
> they are truly done. That layer didn't exist until now, and it's precisely what
> kazi is.

---

## The 60-second mental model

A **goal** is just a list of checkable statements plus a budget:

- **predicates** — the checks that define "done": `the unit tests pass`, `GET /health
returns 200 ok`, `the production error rate is 0 over 30m`, …
- a **budget** — a hard ceiling (iterations / wall-clock / tokens) so it can never
run forever or burn money.
- a **scope** — the repo + paths agents are allowed to touch.

kazi loops: **observe** every predicate → the failing ones *are* the to-do list →
**dispatch** an agent to fix them → **integrate** (open a PR, rebase-merge) →
**deploy** → **re-check**. It stops only when all predicates are true (`converged`),
the same checks keep failing (`stuck` → escalate to you), or the budget runs out.

---

## With kazi vs. without

| Without kazi | With kazi |
|---|---|
| *"The agent says it's done."* You trust it on faith. | **Every predicate verified true**, with stored evidence. Truth lives in the controller, not the agent. |
| Two parallel agents edit the same files → merge conflicts. | Agents **lease their blast radius** first — disjoint work runs free, overlapping work serializes. |
| Green tests on a laptop, broken in production. | A **live predicate** probes the *deployed* endpoint. Green-on-my-machine is never enough. |
| It stops when it *feels* finished. | It stops only on `converged`, `stuck`, or `over-budget` — and tells you which. |

---

## Proof — real goals kazi converged

Goals a naive pipeline leaves *subtly broken* — the file looks created, the answer looks
right, the parallel split looks done — that kazi drove to **objective convergence**. Every
number below is copied verbatim from a real `kazi apply --json` run recorded in
[`docs/devlog.md`](docs/devlog.md); the full reproduction steps (goal-file, command, version,
and the metered `cost_usd` per run) are in
[`docs/dogfood-methodology.md`](docs/dogfood-methodology.md). No unverifiable claims —
if a number can't be traced to a captured run, it isn't here.

| Goal | The subtle break | Result | Source |
|---|---|---|---|
| **Exact-content file** — `VERSION.txt` must be exactly `1.0.0` | "The file exists" looks done, but the bytes must be exact | `converged`, 2 iters / 18.5 s / 39,712 tokens (v1.64.2) | T26.6 |
| **Self-correcting on an opaque oracle** — `solution.py` graded by a one-way sha256 | The first plausible attempt is *wrong*; nothing grades it | `converged`, 2 iters / 39.3 s, Haiku self-corrected on iter 2 (v1.64.1) | T30.4 |
| **A real cross-group dependency, parallelized** — streaming consumes a contract type | A naive parallel split compiles against a type that doesn't exist yet | `collective: converged`, 2 disjoint groups concurrent + 1 gated, single-node, NATS-free (v1.64.2) | T21.12 / T23.9 |

Full gallery + before/after evidence + reproducible method (incl. per-run cost):
**** (see also the [methodology doc](docs/dogfood-methodology.md)).

---

## How the skill routes

`install-skill` adds a trigger to the kazi skill, so `/kazi plan` / `/kazi apply` — and the
plain-English phrase **have kazi drive this until done** — only route to kazi once you have
installed it. From there Claude authors the acceptance predicates with `kazi plan` and runs
`kazi apply` until they are *objectively* true. You do not operate kazi directly; your
agent does.

---

## Token economy without local models

The cheapest way to run an agent loop isn't a local GPU — it's spending frontier
reasoning **once**, then grinding on a cheap model that the predicates keep honest.
kazi makes that an **in-family Claude** move, so any Claude Code user gets it with
**no local model and no local GPU host**
([ADR-0033](docs/adr/0033-cheaper-via-in-family-claude-tiering.md)):

- You **chat with Claude Code**; it drives kazi.
- **Easy iterations** run on a **cheap Claude model** (e.g. **Haiku 4.5**, `claude-haiku-4-5`).
- **Hard reasoning** runs on a **frontier model** (e.g. **Opus 4.8**, `claude-opus-4-8`).
- The **predicates keep the cheap model honest** — it cannot declare a false "done,"
because convergence is gated on objective checks, not the model's say-so.

You pay frontier rates only for the judgment (authoring the predicates) and cheap
rates for the bulk of the work.

### Worked example — author on a frontier model, grind on a cheap one

```sh
# 1. Author the acceptance predicates ONCE — let your session's frontier model
# (e.g. Opus 4.8) draft them with `kazi plan`. It PRINTS a proposal-ref:
kazi plan "add a /healthz endpoint that returns 200 ok" --workspace ./svc
# review the draft (or run `kazi list-proposed` to see pending refs), then approve
# it by passing the ref `kazi plan` printed:
kazi approve

# 2. Drive the N-iteration grind on a CHEAP Claude model — no local model needed:
kazi apply my-goal.toml --workspace ./svc --harness claude --model claude-haiku-4-5
```

`--harness claude --model ` selects which Claude model runs that call
([ADR-0033](docs/adr/0033-cheaper-via-in-family-claude-tiering.md)); `claude` is
already the default harness, so the only new thing is naming a cheaper model for
the grind.

### Start cheap, escalate on stuck (the smart default)

Static tiering always grinds on one cheap model. The adaptive default — **start
cheapest, step UP only when kazi reports the same slice is stuck** — pays frontier
rates only for the slices that actually need them
([ADR-0035](docs/adr/0035-skill-driven-adaptive-model-tiering.md)):

```
claude-haiku-4-5 -> claude-sonnet-4-6 -> claude-opus-4-8 (cap — do not escalate past Opus)
```

The policy lives in the orchestrating skill, **never in kazi**: kazi reports
per-iteration state (`converged` / `stuck` / `over_budget`) via `kazi apply --json`,
and the skill owns the ladder and the per-slice rung counter. The full
copy-paste recipe is in [`AGENTS.md`](AGENTS.md) ("Escalate-on-stuck") and the
installed `kazi` skill.

> **Designed-for, not yet measured.** The cost win is the *intended* economics —
> frontier judgment once, cheap iterations gated by predicates. The headline dollar
> figure is being measured by the multi-iteration benchmark; until it lands, we state
> the *shape* of the saving, not an unproven number.

> **Want full privacy instead?** Local / bring-your-own-model is the **secondary**
> option: point kazi at a local model via `opencode` so your code and context never
> leave your hardware (see [Use a different coding harness](#use-a-different-coding-harness)).
> It trades the in-family convenience for on-prem privacy.

---

## What a coding agent says

> *"Left to myself, I'll tell you a task is done the moment the code looks
> right. kazi won't let me — it holds the predicates and re-checks them against
> reality, so I stop claiming 'done' when it isn't. I end up shipping the thing
> you actually asked for, not the thing I hoped was finished."*
>
> — Claude (Anthropic), describing kazi in its own words. Agent-authored, kept
> verbatim and labelled as such — not a human testimonial.

---

## Who it's for

- **Builders who ship fast but need reliability** — if an agent has ever
"finished" something that wasn't actually done, objective termination is the
guardrail against plausible-but-broken output.
- **Teams running parallel coding agents** — resource leases coordinate who edits
what *before* any file changes, so concurrent runs converge instead of collide.
- **Engineers who refuse "works on my machine"** — predicates can verify the live,
deployed system, not just the local checkout.

**Not for you (yet) if:** you want an agent to decide *what* to build — that's your
call; kazi only drives toward an outcome you declare. It also needs a coding
harness (`claude`, `opencode`, …) on your `PATH`; kazi drives one, it isn't one.

---

## Install

The fastest way — a single self-contained binary via Homebrew (no Erlang
prerequisite; ERTS and the SQLite NIF are bundled, so you get the full read-model):

```sh
brew install kazi-org/tap/kazi
kazi --help
```

Prebuilt binaries are published for **Apple Silicon macOS**, **x86_64 Linux**, and
**ARM Linux** (`aarch64`) on each [GitHub Release](https://github.com/kazi-org/kazi/releases)
(Intel macOS is not yet built — build from source, below). The binary is a
Burrito wrap of a `mix release` ([ADR-0014](docs/adr/0014-binary-distribution-burrito-homebrew.md)),
so unlike the escript it carries the native `exqlite` NIF and persists every
iteration.

> **Runtime requirement:** kazi DRIVES a coding agent ([ADR-0001](docs/adr/0001-positioning-outer-loop-reconciler.md));
> it does not bundle one. A harness binary — `claude` (default) or `opencode` —
> must be on your `PATH` to actually run a goal.

To build from source instead, you need **Elixir / Erlang** (OTP 26+) and `mix`,
plus **git** (kazi commits/opens PRs in your target repo) and *(optional, for live
deploys)* **gcloud** / a deploy command and `gh`:

```sh
git clone https://github.com/kazi-org/kazi && cd kazi
mix deps.get
mix test # ~850 hermetic tests, should be green
```

Two ways to invoke kazi (same behavior):

```sh
# Mix task — recommended. Boots the full app and persists every iteration to a
# local SQLite read-model (created + migrated automatically on first run).
mix kazi.apply --workspace

# Or build a standalone binary:
mix escript.build # produces ./kazi
./kazi apply --workspace
./kazi --help
```

### Use a different coding harness

`claude` (Claude Code) is the **default** harness — if you do nothing, kazi
shells out to `claude` exactly as before. But kazi drives whatever CLI coding
agent you already have installed and configured; it does not reimplement provider
plumbing ([ADR-0016](docs/adr/0016-generic-harness-profiles.md)). Pick another
harness per-run with a flag:

```sh
kazi apply --workspace \
--harness opencode --model local-ollama/qwen3.6:35b-a3b
```

`--harness ` selects the harness (`claude`, `opencode`, `codex`,
`antigravity`, `claw`, or `gemini_cli` today — see the [tier table](#tiered-harness-support-adr-0022)
below); `--model ` selects the model that harness should use.

**Point opencode at a local model (e.g. a locally-hosted Qwen3.6).** If you run
[`opencode`](https://opencode.ai) wired to a local model (for example a
**Qwen3.6 35B-A3B** on a local GPU host), **opencode's own provider config
is the source of truth** for the endpoint and credentials. `--model` is
opencode's `provider/model` string — the provider (`local-ollama` above) and its
base URL live in your opencode config, not in kazi. kazi can also forward
provider/endpoint environment variables to the harness subprocess when a local
setup expects them — declare them as the harness `:env` and kazi passes them
straight through to the underlying call.

#### A per-goal or global default

You don't have to pass `--harness`/`--model` on every run. A goal-file can carry
its own preferred harness in an optional `[harness]` table:

```toml
[harness]
id = "opencode" # a KNOWN harness id (claude / opencode / codex / antigravity / claw)
model = "local-ollama/qwen3.6:35b-a3b" # optional provider/model override
command = "opencode" # optional binary override
```

Or set a machine-wide default in app config:

```elixir
# config/config.exs
config :kazi, :harness, :opencode
```

kazi resolves the harness with a fixed **precedence** (highest first):

1. the **`--harness` / `--model` CLI flags**;
2. the goal-file **`[harness]` table**;
3. the **app config** `config :kazi, :harness`;
4. the default, **`claude`**.

So a CLI flag always wins; absent every layer, kazi drives `claude`.

#### Add a harness = declare a profile

There is no new adapter module per harness. A harness is a **profile** — a value
in [`Kazi.Harness.Registry`](lib/kazi/harness/registry.ex) built from
[`Kazi.Harness.Profile`](lib/kazi/harness/profile.ex): a `command`, an argv
renderer (`build_args`), and a stdout parser (`parse`), plus the set of optional
flags the harness understands. One generic adapter (`Kazi.Harness.CliAdapter`)
runs every profile. Adding `codex`, `gemini-cli`, etc. is profile DATA — often
reusing an existing parser — not a new module; a fully custom harness can be
declared in config without touching kazi.

> **Runtime requirement.** The chosen harness binary (`claude`, `opencode`, …)
> must be installed and on your `PATH`. kazi shells out to it as a subprocess; it
> does not bundle or install harnesses.

#### Tiered harness support (ADR-0022)

Not every CLI agent clears the same bar. kazi drives every harness as a
non-interactive subprocess and parses its stdout, so a harness is **first-class**
only when it runs from a single prompt AND emits machine-parseable output
(JSON/JSONL) correctly under a non-TTY subprocess
([ADR-0022](docs/adr/0022-harness-onboarding-conformance.md)). Some tools are
added with a documented workaround, and one is **best-effort only**:

| Harness (`--harness`) | Tier | Notes |
| --- | --- | --- |
| `claude` (default) | First-class | single JSON envelope; full cost/token parse. |
| `opencode` | First-class | NDJSON event stream; point it at a local model. |
| `codex` | First-class | `codex exec … --json` JSONL stream; auth `OPENAI_API_KEY` / `codex login`. |
| `antigravity` | Conformant **with a workaround** | non-TTY stdout bug (`antigravity-cli#76`) handled via `--prompt-file --output json`; auth `GEMINI_API_KEY` / `ANTIGRAVITY_API_KEY`. |
| `claw` | **Best-effort / demo-grade** | claw-code emits **no** structured output and has no model flag — kazi surfaces its raw stdout as the result with **no cost/token extraction**. It runs, but fidelity is degraded; treat it as a demo ("an agent-managed museum exhibit, not a production tool"), not a budgeted production run. Auth is via env API keys (`ANTHROPIC_API_KEY` / `OPENAI_API_KEY`). |
| `gemini_cli` | First-class | `gemini -p … -o json` single JSON envelope; `--approval-mode yolo` runs non-interactively; full result + best-effort token parse; auth `GEMINI_API_KEY` (or Google OAuth / Vertex `GOOGLE_API_KEY`). |

### Build a self-contained release (full read-model)

The escript can't bundle the native SQLite NIF, so it runs **without** the
read-model. A `mix release` bundles ERTS *and* the compiled NIFs, so the released
binary has the **full read-model** (and is the foundation the per-platform binary
is built from — see [ADR-0014](docs/adr/0014-binary-distribution-burrito-homebrew.md)):

```sh
MIX_ENV=prod mix release --overwrite # builds _build/prod/rel/kazi

# The CLI is invoked through the release's `eval` command, which propagates the
# CLI's exit code (0 on convergence / a recorded proposal / approval, non-zero
# otherwise) — so the release composes in scripts and CI like the escript:
_build/prod/rel/kazi/bin/kazi eval 'Kazi.Release.cli(["--help"])'
_build/prod/rel/kazi/bin/kazi eval \
'Kazi.Release.cli(["apply", "", "--workspace", ""])'
_build/prod/rel/kazi/bin/kazi eval 'Kazi.Release.cli(["list-proposed"])'
```

`Kazi.Release.cli/1` dispatches to the same `Kazi.CLI` core as the escript and
`mix kazi.apply`, so every subcommand (`apply` / `plan` / `list-proposed` /
`approve` / `reject` / `--help`) behaves identically.

### Build a single-file native binary (Burrito)

[Burrito](https://github.com/burrito-elixir/burrito) wraps the `mix release`
above into one self-contained per-platform executable that bundles ERTS **and**
the compiled exqlite NIF — so the binary has the **full SQLite read-model** with
no Erlang prerequisite on the user's machine (T6.2, [ADR-0014](docs/adr/0014-binary-distribution-burrito-homebrew.md)).
The `kazi` release declares four targets: macOS `aarch64`/`x86_64` and Linux
`aarch64`/`x86_64`.

Building requires [Zig](https://ziglang.org) **0.15.2** (Burrito's pinned
version) and `xz` on `PATH`; cross-target builds also need `7z` for Windows
(kazi ships no Windows target). Build the host target and run it:

```sh
# Build the binary for the current host platform (set BURRITO_TARGET to one of
# macos_aarch64 / macos_x86_64 / linux_aarch64 / linux_x86_64; omit it to build
# every declared target). Output lands in ./burrito_out/.
BURRITO_TARGET=macos_aarch64 MIX_ENV=prod mix release --overwrite

# The wrapped binary takes the CLI args directly — no `eval`. It reads them via
# Burrito's argv and dispatches through the same Kazi.CLI core:
./burrito_out/kazi_macos_aarch64 --help
./burrito_out/kazi_macos_aarch64 apply --workspace
./burrito_out/kazi_macos_aarch64 list-proposed
```

The binary persists its read-model to `$KAZI_DB` if set, otherwise
`~/.kazi/kazi.db` (created on first run; see `config/runtime.exs`). Unlike the
escript, every iteration and proposal is persisted — the NIF is bundled.

> **macOS 26 + Zig note.** Burrito 1.5.0 pins Zig **0.15.2**, which cannot link
> native binaries against the macOS 26 SDK (Xcode 26); Zig 0.16 links it but is
> API-incompatible with Burrito's `build.zig`. On a macOS 26 host the wrap step
> fails at the Zig link; build the macOS binaries on a macOS 15 (or earlier)
> runner — which is what the release CI matrix (T6.3) targets.

---

## Quickstart 1 — describe what you want in plain English

You don't have to write a goal-file by hand, and you don't have to break the work
down yourself. Tell kazi the *app* (or feature) you want — as high-level as
"build an X" — and it drafts the machine-checkable predicates that define "done"
for you (using your coding agent), then holds them for your review. **Nothing runs
until you approve**, and you can trim or edit what it drafted:

```sh
# 1. Describe the app you want. In a terminal, kazi asks a few sharp clarifying
# questions FIRST (so "done" is precise — especially the live-verification
# target), then drafts the acceptance predicates and an inline rationale:
kazi plan "create a URL-shortener web service" --workspace ./shortener
#
# A few questions to make the goal precise (press Enter for the default):
# What is the live-verification target for this goal?
# 1) A deployed URL probed over HTTP *
# 2) Production logs / a runtime signal
# 3) None for now — green tests are enough
# > 1
# ...
# PROPOSED proposal=prop-url-shortener-3f9c1a2b goal=url-shortener
# • go test ./... passes
# • POST /shorten returns 201 with a short code for a submitted URL
# • GET /redirects (302) to the original URL # • GET / renders a form to submit a URL # rationale: probe the deployed shortener over HTTP; auth is out of scope for v1

# 2. Review what it drafted (you're the approver — agents propose, humans dispose). # Too much? Too little? Refine inline with a sharper sentence when prompted. kazi list-proposed # prop-url-shortener-3f9c1a2b proposed url-shortener (4 predicates) # 3. Approve the goal you want kazi to pursue: kazi approve prop-url-shortener-3f9c1a2b # APPROVED proposal=prop-url-shortener-3f9c1a2b goal=url-shortener # The goal is now runnable: kazi apply --workspace ``` The clarify phase is a HYBRID (ADR-0019): a deterministic floor of gap-checks kazi always runs (it insists on a live-verification target and a scope boundary) plus questions your coding agent drafts for the specific idea. Scripting it? `--yes` (or any non-TTY pipe) skips the questions and drafts best-effort; `--strict` refuses an underspecified idea instead of guessing; `--adr` also writes an ADR-lite rationale doc under `docs/adr/`. `plan` / `approve` are the natural-language **front door** (an agent drafts, a human approves — the only write path the dashboard shares too). The higher-level the idea, the more predicates kazi drafts — and the more you'll want to curate them before approving, because every predicate becomes a wall kazi won't declare "done" until it's objectively true. Approving blesses the goal; to drive it, hand `kazi apply` a goal-file (next section) — the same predicates, captured as a file you can version and re-run. > More "build an app for X" ideas kazi can draft predicates for: > - `kazi plan "create a paste-bin app with a create-paste API and a raw view"` > - `kazi plan "build a webhook receiver that validates signatures and stores events"` > - `kazi plan "create a REST API for a to-do list with the usual CRUD endpoints"` --- ## Quickstart 2 — write a tiny goal-file and ship it A goal-file is a few lines of TOML. Here's one that says *"the unit tests pass AND the deployed `/livez` endpoint returns `ok`"* — code **and** live production, in one declaration: ```toml # my-goal.toml id = "health-green-and-live" name = "health endpoint returns ok, tests green and live" [budget] max_iterations = 8 # hard ceilings — kazi can never loop forever max_tokens = 500000 [scope] workspace = "." # the repo kazi may edit paths = ["main.go"] # A CODE check: the project's tests must pass. [[predicate]] id = "tests" provider = "test_runner" description = "unit tests pass" cmd = "go" args = ["test", "./..."] # A LIVE check: the *deployed* service must answer correctly. This is what makes # convergence real — green-on-my-laptop is not enough. [[predicate]] id = "livez-live" provider = "http_probe" description = "deployed GET /livez returns 200 body \"ok\"" url = "https://your-service.run.app/livez" expect_status = 200 expect_body = "ok" body_match = "exact" # exact, not substring — "ok" is a substring of "not-ok"! ``` Run it: ```sh mix kazi.apply my-goal.toml --workspace ./my-service ``` kazi prints each iteration and a final verdict, and exits `0` only on convergence: ``` kazi.loop goal=health-green-and-live iter=1 failing=["tests","livez-live"] → dispatch agent kazi.loop goal=health-green-and-live iter=2 failing=["livez-live"] → integrate (PR #42) kazi.loop goal=health-green-and-live iter=3 failing=["livez-live"] → deploy kazi.loop goal=health-green-and-live iter=4 failing=[] → CONVERGED ✓ OUTCOME: :converged (tests pass · live /livez = "ok") ``` **Predicate providers** you can use today: | `provider` | checks… | key config | |----------------|---------|------------| | `test_runner` | a command's exit code (unit/integration tests) | `cmd`, `args` | | `http_probe` | a live URL's status + body, optionally **sustained** over N samples | `url`, `expect_status`, `expect_body`, `body_match`, `samples`, `interval_ms` | | `browser` | a real browser flow (Playwright), optionally a **journey** over N runs | per-flow config, `samples` | | `prod_log` | a production-log condition (e.g. 5xx rate) — a coarse safety net | per-check config | | `metrics` | a live **RED/SLO** signal (PromQL): windowed quantile, error-rate, or **burn-rate** gate | `query_url`, `query`, `pass_when`, `quantile`, `burn_rate` | | `custom_script`| ANY CLI checker (scanner, mutation tester, contract check) | `cmd`, `args`, `verdict`, `path`, `pass_when` | | `ratchet` | a metric may not regress vs a baseline (coverage, perf, size) | `metric`, `baseline`, `direction`, `allowed_regression` | | `static` | static analysis / type-check / lint (Dialyzer-led, SARIF-general) | `cmd`, `args`, `format`, `baseline`, `allowed_regression` | `custom_script` is the **escape hatch**: it turns any command-line tool into a predicate without a kazi release. Crucially the **verdict is declared, not assumed** — a SARIF/JSON scanner that exits `0` *with* findings is gated on its parsed output, not its exit code (the class of "the gate silently passed" bug, designed out). See [`docs/custom-script-provider.md`](docs/custom-script-provider.md), `kazi schema custom_script`, and the recipes in [`priv/examples/`](priv/examples/) (`custom_script_sarif.toml`, `custom_script_junit.toml`, `custom_script_mutation.toml`). A fuller off-the-shelf catalog — contract/schema compat, perf/size, secret scanning, a11y, IaC/container scan, visual regression — plus the two evidence tiers and the per-tool exit-code gotchas is in [`docs/custom-script-recipes.md`](docs/custom-script-recipes.md). `ratchet` is the **no-regression** mode: a metric passes only while it stays within `allowed_regression` of a `baseline`, read through `direction` (`higher_better` for coverage/mutation score, `lower_better` for size/latency). The baseline is a fixed number, the metric's own stored prior value (`"stored"` — seeded on the first run, tightened on every pass), or a **git ref** (`"main"` — the metric recomputed at that ref). Coverage, perf, and size are configs of this one mode. With `allowed_regression = 0` a metric "may only improve." See [`docs/ratchet-predicate.md`](docs/ratchet-predicate.md), `kazi schema ratchet`, and the recipes in [`priv/examples/`](priv/examples/) (`ratchet_coverage.toml`, `ratchet_size.toml`). `static` is the **analysis / type-check / lint** mode: the cheapest, most deterministic check, run every iteration to catch defects on paths the tests never execute. It **leads with Dialyzer** (kazi-native, zero false positives) and generalizes to the polyglot SARIF tools (`tsc`, `mypy`, `golangci-lint`, Semgrep) via `format`. The verdict is gated on the **parsed findings, not the exit code**, and a `baseline` turns it into a ratchet that **fails only on NEW findings** (so pre-existing debt can only shrink, never block). Findings surface as localized `file:line:col` evidence. See [`docs/static-predicate.md`](docs/static-predicate.md), `kazi schema static`, and the recipes in [`priv/examples/`](priv/examples/) (`static_dialyzer.toml`, `static_sarif.toml`). The **live providers** (`http_probe` sustained-health, `browser` journeys, `metrics`, `prod_log`) verify a *deployed* service. The discipline they enforce: **never converge on a single sample** — `http_probe` and `browser` require N *consecutive* healthy samples (the Kubernetes `failureThreshold` model), and a `metrics` burn-rate gate fires only when both a long and a short window breach. Absent a metrics endpoint, `metrics` degrades to *not applicable* (never a false pass). See [`docs/live-providers.md`](docs/live-providers.md) and `kazi schema http_probe` / `kazi schema browser` / `kazi schema metrics`. Add `guard = true` to a predicate to make it an **invariant** (e.g. "coverage must not drop") — kazi blocks the "delete the failing test" shortcut. --- ## A real worked example: failing test → live production This is kazi's own end-to-end proof (the **T0.12 dogfood**), and you can read it in [`docs/devlog.md`](docs/devlog.md). The fixture in [`fixtures/deploy-target/`](fixtures/deploy-target/) is a tiny Go web service whose `/livez` endpoint returns `"not-ok"` and whose unit test therefore **fails on purpose**. Given the goal *"tests pass AND deployed `/livez` returns ok"*, kazi: 1. **Observed** both checks failing — and refused to call it done. 2. **Dispatched** a `claude -p` agent, which made the one-line fix. 3. **Integrated** it — opened a PR and rebase-merged it to `main`. 4. **Deployed** the new build to Cloud Run. 5. **Re-checked** the live endpoint → `200 "ok"` → **converged**. Crucially, through steps 1–4 the live check kept failing and kazi **stayed non-converged** — it only reported success once the real, deployed endpoint was correct. That's the whole point: done is observed, not asserted. > Live deploys need a deploy target configured (the service / project / region and a > deploy command). The fixture's setup — GCP roles, the Cloud Run quirks kazi > discovered, and the goal-file — is documented in > [`fixtures/deploy-target/README.md`](fixtures/deploy-target/README.md) and > [`docs/lore.md`](docs/lore.md). --- ## Adopt an existing project Already have a working repo? `kazi init ` reverse-engineers a starter goal-file — the equivalent of `terraform import` for "what already works" ([ADR-0013](docs/adr/0013-adopt-reverse-engineer-goals.md)): ```sh kazi init ./my-service --out my-service.goal.toml ``` It detects the stack from marker files (`go.mod` → `go test ./...`, `mix.exs` → `mix test`, `package.json`'s test script, `pyproject.toml`/`setup.cfg` → `pytest`) and writes one baseline goal-file with: - a **`test_runner` predicate** naming the detected test command, so kazi holds the suite green; - conservative **guards** (e.g. a coverage ratchet when a coverage tool is configured) — never a guard it cannot evaluate; - a **commented live-predicate TODO** — an `http_probe` scaffold for you to point at the real deployed endpoint. Live predicates are scaffolded, never guessed. Detection is deterministic: the same repo always produces the same goal-file. Pass `--enrich` (off by default) to have your coding agent propose live predicates from discovered endpoints; the deterministic detection always stands. Review the goal-file, fill in the live TODO, then `kazi apply` it. Pass `--with-gist` to opt **this repo** into the Gist context store ([ADR-0045](docs/adr/0045-context-store-layer-gist-provider.md)) — a budget-fitted text-artifact memory that keeps each agent prompt small. It verifies `gist doctor`, writes the project-local `.kazi/context.toml` naming the provider, registers the `gist serve` MCP server in the repo's `.mcp.json`, and recommends setting `KAZI_GIST_DSN` to a PostgreSQL DSN for cross-iteration persistence. It is **project-local only** — it never mutates a global agent config — and requires the [`gist`](https://github.com/sirerun/gist) binary on `PATH` (absent it, the command reports the missing dep and the goal-file is still written): ```sh kazi init ./my-service --with-gist export KAZI_GIST_DSN="postgres://USER:PASS@HOST:5432/gist" # cross-call persistence ``` ### Worked example Run it against the Go fixture that ships with this repo: ```sh kazi init fixtures/deploy-target --out my.goal.toml ``` It detects the `go.mod` and writes [`priv/examples/adopt_deploy_target.goal.toml`](priv/examples/adopt_deploy_target.goal.toml): ```toml id = "adopt-deploy-target" name = "Adopted baseline for deploy-target" [scope] workspace = "fixtures/deploy-target" [[predicate]] id = "tests-pass" provider = "test_runner" description = "project test suite passes" args = ["test", "./..."] cmd = "go" # ... a `tests-pass-baseline` guard, then a COMMENTED live-predicate TODO # you uncomment and point at the real deployed endpoint. ``` The acceptance predicate names the detected `go test ./...`; the live predicate is left as a commented scaffold for you to fill in. A hermetic end-to-end test pins this output, so the example never drifts from what the tool produces. --- ## Watch it work - **LiveView dashboard** — a goal board, live agent presence, the lease map, a live dependency-DAG "wave" view (`/dag`: groups by running / ready / blocked / converged, the `needs` edges, per-group convergence), and per-goal convergence history. Read-only inspection, decoupled from the loop ([ADR-0011](docs/adr/0011-slice3-operator-surfaces.md)). --- ## CLI reference ``` kazi init [--out ] [--enrich] [--with-mcp] [--with-gist] # adopt a repo -> a goal-file (+ .mcp.json / context store) kazi plan "" [--workspace ] # draft predicates from plain English kazi list-proposed [--status ] # review drafts (proposed/approved/rejected) kazi approve # bless a drafted goal kazi reject # discard a draft kazi apply --workspace # drive a goal to convergence [--env ] # target a deploy environment (staging/prod) [--standing] # run continuously (re-converge on drift) kazi status # report a run's (or proposal's) current state kazi context index # context store: index a heavy artifact kazi context search "" [--budget N] # budget-fitted recall (--provider gist) kazi context stats # byte accounting (indexed/returned/saved) kazi export --obsidian # write an Obsidian vault of the goal tree kazi lint # advisory near-duplicate group-name warnings kazi mcp # start the MCP server over stdio (ADR-0044) kazi help [--json] # the command/flag surface (--json for machines) kazi version # print the kazi version and exit ``` `kazi apply` exits `0` on convergence, non-zero otherwise — so it composes in CI/scripts. > **Drive kazi over MCP (preferred).** An MCP-speaking harness wires kazi as an MCP > server and drives its self-describing `kazi_plan` / `kazi_approve` / `kazi_apply` / > `kazi_status` tools — no JSON-CLI shell-out. The canonical client config references > the installed binary verb (`kazi init --with-mcp` writes exactly this `.mcp.json`): > > ```json > { "mcpServers": { "kazi": { "command": "kazi", "args": ["mcp"] } } } > ``` > **Read-model note.** The Mix task (`mix kazi.apply`) creates and migrates the SQLite > read-model on startup, so every iteration is persisted. The standalone escript > can't bundle the native SQLite NIF, so it runs without persistence (it still > converges; it just won't record history). --- ## How it works (under the hood) - **Positioning** — a harness-agnostic outer loop, never a harness ([ADR-0001](docs/adr/0001-positioning-outer-loop-reconciler.md)). - **Goals** — machine-checkable predicate sets, evidence-backed ([ADR-0002](docs/adr/0002-goals-as-predicates.md)). - **Runtime** — Elixir / OTP + Phoenix LiveView ([ADR-0003](docs/adr/0003-language-elixir-otp.md)); one supervised process per active goal. - **Coordination** — NATS JetStream KV leases (revision-CAS + TTL) and graph-aware blast-radius partitioning ([ADR-0004](docs/adr/0004-coordination-substrate-nats-jetstream.md), [ADR-0006](docs/adr/0006-coordination-leases-and-graph-partitioning.md)). - **Data split** — Git (code) · JetStream (coordination) · ETS (live state) · SQLite (read-model) ([ADR-0005](docs/adr/0005-data-layer-split.md)). - **Harness & context** — stateless per iteration; kazi owns context as a thin, deterministic evidence projection plus a blast-radius orientation pack — never conversation memory ([ADR-0008](docs/adr/0008-harness-invocation-and-context.md), [ADR-0009](docs/adr/0009-prompt-construction-thin-evidence-projection.md), [ADR-0010](docs/adr/0010-context-injection-reexploration-mitigation.md)). - **Intended vs. actual** — kazi imports intent from standard specs and prose, and surfaces drift / dead code via a surface-coverage meta-predicate ([ADR-0021](docs/adr/0021-intended-vs-actual-reconciliation.md)). - **Any harness, self-taught** — kazi onboards any CLI coding harness through a profile conformance contract, and is a harness-friendly, agent-drivable CLI that teaches itself to harnesses via a skill, an MCP server, and machine-readable help ([ADR-0022](docs/adr/0022-harness-onboarding-conformance.md), [ADR-0023](docs/adr/0023-harness-friendly-agent-drivable-cli.md), [ADR-0024](docs/adr/0024-kazi-self-teaching-to-harnesses.md)). - **Native scheduler & predicate-graph waves** — kazi owns parallelization with a native scheduler over a partitioned goal-set, running dependency-aware predicate-graph waves instead of leaning on an external pool ([ADR-0026](docs/adr/0026-kazi-under-apply-pool.md), [ADR-0027](docs/adr/0027-kazi-owns-parallelization-native-scheduler.md), [ADR-0028](docs/adr/0028-dependency-aware-partitioning-predicate-graph-waves.md)). - **Agent-native surfaces** — the docs and website lead with the agent-driven on-ramp, the `kazi` skill is a router for code goals, and your coding agent (not a separate chat bridge) is the mobile interface ([ADR-0025](docs/adr/0025-docs-lead-with-agent-driven-onramp.md), [ADR-0030](docs/adr/0030-content-marketing-agent-native-positioning.md), [ADR-0031](docs/adr/0031-kazi-skill-router-subsumes-loop-apply-qualify.md)). - **One verb set** — the CLI verbs are `kazi plan` (draft predicates from an idea) and `kazi apply` (drive a goal to convergence), unifying the human, skill, and CLI surfaces ([ADR-0032](docs/adr/0032-rename-cli-verbs-run-apply-propose-plan.md)). Full narrative: [`docs/concept.md`](docs/concept.md). Decisions: [`docs/adr/`](docs/adr/). Build plan: [`docs/plan.md`](docs/plan.md). --- ## Status Slices 0–3 are implemented and green (Elixir/OTP; ~700 hermetic ExUnit tests), and the live idea → production loop is proven end-to-end (the T0.12 dogfood above). What works today: - **Convergence core** — the reconcile loop drives predicates to truth via a stateless agent harness plus integrate (branch → PR → rebase-merge) and deploy actions; evidence persisted to SQLite. - **Trustworthy loops** — regression detection, flake quarantine, hard budgets, stuck-escalation, and a production-log predicate. - **Creation mode** — kazi builds *new* features from failing acceptance predicates, not only repairs. From here on, kazi builds kazi. - **Coordination & surfaces** — NATS leases + presence, graph partitioning, natural-language authoring, and a LiveView dashboard. - **Context injection** — every stateless iteration starts *oriented* (a deterministic blast-radius pack + an optional, off-by-default retrieval adapter), without reintroducing conversation memory. **By design, kazi will never**: become a coding agent/harness; decide *what* to build (that's your judgment); or put a vector DB in the core loop (the retrieval adapter is an optional augmentation, never the foundation). ## Learn more New to driving coding agents to an objective "done"? The blog walks the same ladder we climbed, one rung per post — from prompting by feel up to a reconciliation workflow. Each post is independently useful, even if you never adopt kazi. - **[From Vibe Coding to Reconciliation](https://kazi.sire.run/blog/from-vibe-coding-to-reconciliation)** — the twelve-part series. - **[All posts](https://kazi.sire.run/blog)** — the kazi blog index. ## Community & help Questions? Start a [GitHub Discussion](https://github.com/kazi-org/kazi/discussions) | Read [`concept.md`](docs/concept.md) for the architecture. ## License Licensed under the [Apache License, Version 2.0](LICENSE). See the [NOTICE](NOTICE) file for attribution. Copyright 2026 Sire Run, Inc. --- Built by the team behind Sire.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/kazi-org/kazi

Awesome Lists containing this project

README