https://github.com/vimalyad/accura

An accuracy-first browser agent. TypeScript, Playwright, model-agnostic — develop on free models, run on Claude.
https://github.com/vimalyad/accura
Last synced: 6 days ago
JSON representation
An accuracy-first browser agent. TypeScript, Playwright, model-agnostic — develop on free models, run on Claude.
Host: GitHub
URL: https://github.com/vimalyad/accura
Owner: vimalyad
Created: 2026-06-22T17:22:38.000Z (7 days ago)
Default Branch: main
Last Pushed: 2026-06-22T18:18:22.000Z (7 days ago)
Last Synced: 2026-06-22T19:22:08.108Z (7 days ago)
Language: TypeScript
Homepage:
Size: 236 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

          


  



# Accura

An accuracy-first browser agent. TypeScript, Playwright — runs on Claude

(Anthropic) via external API calls.

Accura optimizes one metric: **task success rate**. Latency is explicitly not

a constraint, so the architecture spends time wherever it buys correctness:

it re-observes after every action, verifies every step, samples multiple

candidates at uncertain decisions, simulates irreversible actions before

running them, and refuses to declare success it cannot prove.

## Quickstart

```sh

pnpm install

pnpm --filter @accura/browser exec playwright install chromium

pnpm build

# run a task on Claude (needs ANTHROPIC_API_KEY)

node apps/cli/dist/main.js "Find the price of the Super Widget" --url https://example.com --profile final

# run the eval suite

node apps/cli/dist/main.js eval packages/evals/suites/fixtures.json --profile final --seeds 3

```

Model keys come from your shell, or a local `.env` loaded with

`node --env-file=.env …`. The shipped profile is `configs/final.json` — **Claude

via external API calls, no local model hosting**: a Sonnet 4.6 executor (adaptive

thinking), an Opus 4.8 planner and judge, and a Sonnet 4.6 extractor. Needs

`ANTHROPIC_API_KEY`.

> The full self-hosted platform and local-model profiles live on the

> [`self-hosted`](https://github.com/vimalyad/accura/tree/self-hosted) branch.

## Architecture

Design rationale and the research behind every decision:

[ARCHITECTURE.md](./ARCHITECTURE.md).

### System overview

```mermaid

flowchart TD

    CLI["apps/cli
accura run · accura eval"]

    subgraph orchestration["Orchestration"]

        AGENT["@accura/agent
loop · planner · arbiter ·
simulation gate · recovery · traces"]

        EVALS["@accura/evals
fixture server · multi-seed runner ·
bootstrap CIs · judge agreement"]

    end

    subgraph capabilities["Capabilities"]

        PERCEPTION["@accura/perception
DOM walker · stable element ids ·
new-element diff · observer"]

        ACTIONS["@accura/actions
zod registry · 17 core actions ·
batching with stale-DOM guards"]

        VERIFY["@accura/verify
state diff · grounding check ·
trajectory judge"]

        MEMORY["@accura/memory
skill store · induction ·
deterministic replay"]

        LLM["@accura/llm
anthropic + openai-compatible ·
structured output · model router"]

    end

    subgraph foundation["Foundation"]

        BROWSER["@accura/browser
playwright session · stability gate ·
screenshots · watchdogs · CDP hatch"]

        SHARED["@accura/shared
Result · errors · logger · profiles"]

    end

    CLI --> AGENT

    CLI --> EVALS

    EVALS --> AGENT

    AGENT --> PERCEPTION

    AGENT --> ACTIONS

    AGENT --> VERIFY

    AGENT --> MEMORY

    AGENT --> LLM

    PERCEPTION --> BROWSER

    ACTIONS --> BROWSER

    ACTIONS --> PERCEPTION

    VERIFY --> LLM

    MEMORY --> ACTIONS

    BROWSER --> SHARED

    LLM --> SHARED

```

### One agent step, end to end

```mermaid

flowchart TD

    START([task]) --> SETUP["judge derives key points
planner creates checklist
memory: matching skills injected,
best skill replayed deterministically"]

    SETUP --> GATE

    subgraph step["every step"]

        GATE["stability gate:
domcontentloaded → network quiet →
two zero-mutation windows"]

        GATE --> OBSERVE["perceive: enumerated elements
[id]<tag> with *new-element marks,
page text, scroll hints, warnings"]

        OBSERVE --> NOTES["verifier notes: what changed ·
contradiction check ·
FORBIDDEN / STUCK advice"]

        NOTES --> EXEC["executor → structured output
{eval, memory, goal, actions 1..3}"]

        EXEC --> FLAGGED{flagged
decision?}

        FLAGGED -- "uncertain / contradiction" --> BON["sample 3 candidates →
dedup → arbiter picks"]

        FLAGGED -- no --> IRREV

        BON --> IRREV{irreversible
action?}

        IRREV -- yes --> SIM["simulate outcome"]

        SIM -- mismatch --> BLOCK["block action +
force replan"]

        SIM -- ok --> RUN

        IRREV -- no --> RUN["execute batch:
ids → live elements ·
stale-DOM guards ·
recovery hard-blocks repeats"]

        BLOCK --> RECORD["record step + trace"]

        RUN --> RECORD

    end

    RECORD --> DONE{done
declared?}

    DONE -- no --> GATE

    DONE -- "success=false" --> HONEST([honest failure returned])

    DONE -- "success=true" --> GROUND{grounding:
every claimed value
exists in observations?}

    GROUND -- no --> REJECT["rejection reason injected
(max 2, then honest failure)"]

    REJECT --> GATE

    GROUND -- yes --> JUDGE{trajectory judge:
all key points
demonstrably met?}

    JUDGE -- no --> REJECT

    JUDGE -- yes --> WIN([success])

    WIN --> INDUCE["skill induced → memory →
next run replays it"]

```

### Model roles per profile

```mermaid

flowchart LR

    subgraph roles["Agent roles"]

        E[executor]

        P[planner]

        J[judge / arbiter]

        X[extractor]

        S[skill-inductor]

    end

    subgraph final["configs/final.json — Claude"]

        SONNET["Sonnet 4.6
adaptive thinking · effort high ·
clickAt enabled"]

        OPUS["Opus 4.8"]

    end

    E ==> SONNET

    X ==> SONNET

    S ==> SONNET

    P ==> OPUS

    J ==> OPUS

```

Capability flags degrade gracefully: a non-vision executor gets DOM-only

observations; only coordinate-grounded models (Claude) get the `clickAt`

fallback action.

### The five accuracy mechanisms

1. **Clean enumerated action space** (`perception`) — the model picks from

   stable indexed element ids and never invents selectors. The single

   highest-leverage change in the published evidence (AgentOccam, +26.6 pts).

2. **Verification everywhere** (`verify`) — a deterministic state diff after

   every step, a "your actions succeeded but nothing changed" contradiction

   check, and a two-layer `done` gate: code-level grounding of claimed values,

   then a skeptical key-point judge. Attacks the #1 measured failure mode:

   confident false success.

3. **Hard recovery rules** (`agent`) — an identical action that failed twice

   is blocked in code, not just prompted away; stuck-detection forces a

   strategy change.

4. **Test-time spending** (`agent`) — best-of-3 with an arbiter at flagged

   decisions only; outcome simulation before irreversible actions. Latency is

   the currency, accuracy the purchase.

5. **Compounding memory** (`memory`) — verified successes are distilled into

   text-grounded recipes; later runs replay them deterministically and fall

   back to the live executor at the first mismatch (AWM/SkillWeaver, +31–51%

   relative).

Everything is measured by `evals` (multi-seed runs, bootstrap 95% CIs,

judge-agreement tracking) — no accuracy claim without numbers.

## Packages

| Package | What it does |

|---|---|

| `@accura/shared` | Result type, errors, logging, zod-validated model profiles |

| `@accura/llm` | Provider-agnostic ChatModel (Anthropic SDK + any OpenAI-compatible endpoint), structured output with repair reprompts, role-based model router |

| `@accura/browser` | Playwright session: stability gate, exact-dimension screenshots, popup/dialog/download/crash watchdogs, CDP escape hatch |

| `@accura/perception` | In-page walker → enumerated interactive elements with stable ids, new-element diffing, id→element resolution |

| `@accura/actions` | Zod-validated action registry, 17 core actions (incl. `doubleClick`), multi-action batching with stale-DOM guards |

| `@accura/verify` | State-diff step verifier, deterministic data-grounding check, skeptical trajectory judge |

| `@accura/agent` | The loop: planner, best-of-N arbiter, simulation gate, recovery policy, done gating, JSONL traces |

| `@accura/memory` | Cross-run skills: induction from verified successes, deterministic replay with live fallback, scoring/retirement |

| `@accura/evals` | Task suites, multi-seed runner, bootstrap CIs, judge-agreement harness, failure clustering |

| `apps/cli` | `accura ""` and `accura eval ` |

## Status

The agent and CLI are implemented and tested — unit tests plus

browser-integration tests against real Chromium and full-pipeline end-to-end

runs. Verified end-to-end on the `final` (Claude) profile.

## Development

```sh

pnpm build      # turbo build across the workspace

pnpm test       # unit + browser integration tests

pnpm lint       # eslint

pnpm typecheck  # tsc --noEmit

```

One branch per phase, merged to `main` after its exit criteria pass; see git

history.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/vimalyad/accura

Awesome Lists containing this project

README