https://github.com/vimalyad/accura
An accuracy-first browser agent. TypeScript, Playwright, model-agnostic — develop on free models, run on Claude.
https://github.com/vimalyad/accura
Last synced: 6 days ago
JSON representation
An accuracy-first browser agent. TypeScript, Playwright, model-agnostic — develop on free models, run on Claude.
- Host: GitHub
- URL: https://github.com/vimalyad/accura
- Owner: vimalyad
- Created: 2026-06-22T17:22:38.000Z (7 days ago)
- Default Branch: main
- Last Pushed: 2026-06-22T18:18:22.000Z (7 days ago)
- Last Synced: 2026-06-22T19:22:08.108Z (7 days ago)
- Language: TypeScript
- Homepage:
- Size: 236 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Accura
An accuracy-first browser agent. TypeScript, Playwright — runs on Claude
(Anthropic) via external API calls.
Accura optimizes one metric: **task success rate**. Latency is explicitly not
a constraint, so the architecture spends time wherever it buys correctness:
it re-observes after every action, verifies every step, samples multiple
candidates at uncertain decisions, simulates irreversible actions before
running them, and refuses to declare success it cannot prove.
## Quickstart
```sh
pnpm install
pnpm --filter @accura/browser exec playwright install chromium
pnpm build
# run a task on Claude (needs ANTHROPIC_API_KEY)
node apps/cli/dist/main.js "Find the price of the Super Widget" --url https://example.com --profile final
# run the eval suite
node apps/cli/dist/main.js eval packages/evals/suites/fixtures.json --profile final --seeds 3
```
Model keys come from your shell, or a local `.env` loaded with
`node --env-file=.env …`. The shipped profile is `configs/final.json` — **Claude
via external API calls, no local model hosting**: a Sonnet 4.6 executor (adaptive
thinking), an Opus 4.8 planner and judge, and a Sonnet 4.6 extractor. Needs
`ANTHROPIC_API_KEY`.
> The full self-hosted platform and local-model profiles live on the
> [`self-hosted`](https://github.com/vimalyad/accura/tree/self-hosted) branch.
## Architecture
Design rationale and the research behind every decision:
[ARCHITECTURE.md](./ARCHITECTURE.md).
### System overview
```mermaid
flowchart TD
CLI["apps/cli
accura run · accura eval"]
subgraph orchestration["Orchestration"]
AGENT["@accura/agent
loop · planner · arbiter ·
simulation gate · recovery · traces"]
EVALS["@accura/evals
fixture server · multi-seed runner ·
bootstrap CIs · judge agreement"]
end
subgraph capabilities["Capabilities"]
PERCEPTION["@accura/perception
DOM walker · stable element ids ·
new-element diff · observer"]
ACTIONS["@accura/actions
zod registry · 17 core actions ·
batching with stale-DOM guards"]
VERIFY["@accura/verify
state diff · grounding check ·
trajectory judge"]
MEMORY["@accura/memory
skill store · induction ·
deterministic replay"]
LLM["@accura/llm
anthropic + openai-compatible ·
structured output · model router"]
end
subgraph foundation["Foundation"]
BROWSER["@accura/browser
playwright session · stability gate ·
screenshots · watchdogs · CDP hatch"]
SHARED["@accura/shared
Result · errors · logger · profiles"]
end
CLI --> AGENT
CLI --> EVALS
EVALS --> AGENT
AGENT --> PERCEPTION
AGENT --> ACTIONS
AGENT --> VERIFY
AGENT --> MEMORY
AGENT --> LLM
PERCEPTION --> BROWSER
ACTIONS --> BROWSER
ACTIONS --> PERCEPTION
VERIFY --> LLM
MEMORY --> ACTIONS
BROWSER --> SHARED
LLM --> SHARED
```
### One agent step, end to end
```mermaid
flowchart TD
START([task]) --> SETUP["judge derives key points
planner creates checklist
memory: matching skills injected,
best skill replayed deterministically"]
SETUP --> GATE
subgraph step["every step"]
GATE["stability gate:
domcontentloaded → network quiet →
two zero-mutation windows"]
GATE --> OBSERVE["perceive: enumerated elements
[id]<tag> with *new-element marks,
page text, scroll hints, warnings"]
OBSERVE --> NOTES["verifier notes: what changed ·
contradiction check ·
FORBIDDEN / STUCK advice"]
NOTES --> EXEC["executor → structured output
{eval, memory, goal, actions 1..3}"]
EXEC --> FLAGGED{flagged
decision?}
FLAGGED -- "uncertain / contradiction" --> BON["sample 3 candidates →
dedup → arbiter picks"]
FLAGGED -- no --> IRREV
BON --> IRREV{irreversible
action?}
IRREV -- yes --> SIM["simulate outcome"]
SIM -- mismatch --> BLOCK["block action +
force replan"]
SIM -- ok --> RUN
IRREV -- no --> RUN["execute batch:
ids → live elements ·
stale-DOM guards ·
recovery hard-blocks repeats"]
BLOCK --> RECORD["record step + trace"]
RUN --> RECORD
end
RECORD --> DONE{done
declared?}
DONE -- no --> GATE
DONE -- "success=false" --> HONEST([honest failure returned])
DONE -- "success=true" --> GROUND{grounding:
every claimed value
exists in observations?}
GROUND -- no --> REJECT["rejection reason injected
(max 2, then honest failure)"]
REJECT --> GATE
GROUND -- yes --> JUDGE{trajectory judge:
all key points
demonstrably met?}
JUDGE -- no --> REJECT
JUDGE -- yes --> WIN([success])
WIN --> INDUCE["skill induced → memory →
next run replays it"]
```
### Model roles per profile
```mermaid
flowchart LR
subgraph roles["Agent roles"]
E[executor]
P[planner]
J[judge / arbiter]
X[extractor]
S[skill-inductor]
end
subgraph final["configs/final.json — Claude"]
SONNET["Sonnet 4.6
adaptive thinking · effort high ·
clickAt enabled"]
OPUS["Opus 4.8"]
end
E ==> SONNET
X ==> SONNET
S ==> SONNET
P ==> OPUS
J ==> OPUS
```
Capability flags degrade gracefully: a non-vision executor gets DOM-only
observations; only coordinate-grounded models (Claude) get the `clickAt`
fallback action.
### The five accuracy mechanisms
1. **Clean enumerated action space** (`perception`) — the model picks from
stable indexed element ids and never invents selectors. The single
highest-leverage change in the published evidence (AgentOccam, +26.6 pts).
2. **Verification everywhere** (`verify`) — a deterministic state diff after
every step, a "your actions succeeded but nothing changed" contradiction
check, and a two-layer `done` gate: code-level grounding of claimed values,
then a skeptical key-point judge. Attacks the #1 measured failure mode:
confident false success.
3. **Hard recovery rules** (`agent`) — an identical action that failed twice
is blocked in code, not just prompted away; stuck-detection forces a
strategy change.
4. **Test-time spending** (`agent`) — best-of-3 with an arbiter at flagged
decisions only; outcome simulation before irreversible actions. Latency is
the currency, accuracy the purchase.
5. **Compounding memory** (`memory`) — verified successes are distilled into
text-grounded recipes; later runs replay them deterministically and fall
back to the live executor at the first mismatch (AWM/SkillWeaver, +31–51%
relative).
Everything is measured by `evals` (multi-seed runs, bootstrap 95% CIs,
judge-agreement tracking) — no accuracy claim without numbers.
## Packages
| Package | What it does |
|---|---|
| `@accura/shared` | Result type, errors, logging, zod-validated model profiles |
| `@accura/llm` | Provider-agnostic ChatModel (Anthropic SDK + any OpenAI-compatible endpoint), structured output with repair reprompts, role-based model router |
| `@accura/browser` | Playwright session: stability gate, exact-dimension screenshots, popup/dialog/download/crash watchdogs, CDP escape hatch |
| `@accura/perception` | In-page walker → enumerated interactive elements with stable ids, new-element diffing, id→element resolution |
| `@accura/actions` | Zod-validated action registry, 17 core actions (incl. `doubleClick`), multi-action batching with stale-DOM guards |
| `@accura/verify` | State-diff step verifier, deterministic data-grounding check, skeptical trajectory judge |
| `@accura/agent` | The loop: planner, best-of-N arbiter, simulation gate, recovery policy, done gating, JSONL traces |
| `@accura/memory` | Cross-run skills: induction from verified successes, deterministic replay with live fallback, scoring/retirement |
| `@accura/evals` | Task suites, multi-seed runner, bootstrap CIs, judge-agreement harness, failure clustering |
| `apps/cli` | `accura ""` and `accura eval ` |
## Status
The agent and CLI are implemented and tested — unit tests plus
browser-integration tests against real Chromium and full-pipeline end-to-end
runs. Verified end-to-end on the `final` (Claude) profile.
## Development
```sh
pnpm build # turbo build across the workspace
pnpm test # unit + browser integration tests
pnpm lint # eslint
pnpm typecheck # tsc --noEmit
```
One branch per phase, merged to `main` after its exit criteria pass; see git
history.