{"id":49223217,"url":"https://github.com/razroo/state-trace","last_synced_at":"2026-04-24T05:04:06.539Z","repository":{"id":353382346,"uuid":"1218560810","full_name":"razroo/state-trace","owner":"razroo","description":null,"archived":false,"fork":false,"pushed_at":"2026-04-23T16:15:07.000Z","size":109,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-23T18:18:22.941Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/razroo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-23T02:02:05.000Z","updated_at":"2026-04-23T16:15:13.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/razroo/state-trace","commit_stats":null,"previous_names":["razroo/state-trace"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/razroo/state-trace","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/razroo%2Fstate-trace","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/razroo%2Fstate-trace/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/razroo%2Fstate-trace/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/razroo%2Fstate-trace/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/razroo","download_url":"https://codeload.github.com/razroo/state-trace/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/razroo%2Fstate-trace/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32209897,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-24T03:15:14.334Z","status":"ssl_error","status_checked_at":"2026-04-24T03:15:11.608Z","response_time":64,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-04-24T05:03:56.084Z","updated_at":"2026-04-24T05:04:06.530Z","avatar_url":"https://github.com/razroo.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# state-trace\n\n\u003e Graph-native working memory for coding agents: typed memories, causal retrieval, bounded capacity, and compact briefs for small models.\n\n`state-trace` is a bounded working-memory layer for coding and debugging agents that need the right file, failure, and next action under tight token budgets. It is not a replacement for a general-purpose temporal knowledge graph like Graphiti — see [ARCHITECTURE.md](./ARCHITECTURE.md) for the honest comparison.\n\nWhat it is optimized for:\n\n- artifact-first retrieval for coding agents\n- current-vs-stale task state (`engine.current_state()`, `engine.failed_hypotheses()`)\n- compact harness-facing briefs for smaller models\n- online agent loops and post-hoc trajectory ingestion\n- bounded memory with decay, compression, and lifecycle retention\n- MCP-mountable, local-first deployment\n\n## Headline: SWE-bench-Verified localization — n=500\n\nThe credibility benchmark. Cold-start artifact localization on the full SWE-bench-Verified test split: given only the GitHub issue text and hints (no trajectory), rank the correct patch file at 1 and at 5.\n\n```bash\npip install -e \".[bench]\"\npython3 examples/swebench_verified_eval.py --limit 500 --backends no_memory bm25 state_trace graphiti\n```\n\n\u003c!-- BENCHMARK:SWEBENCH_N500:START --\u003e\n| backend | n | Artifact@1 | Artifact@1 CI | Artifact@5 | Artifact@5 CI | AvgLatencyMs |\n| --- | ---: | ---: | :---: | ---: | :---: | ---: |\n| no_memory | 500 | 0.000 | [0.000, 0.000] | 0.000 | [0.000, 0.000] | 0.01 |\n| bm25 | 500 | 0.176 | [0.144, 0.208] | 0.300 | [0.262, 0.338] | 0.19 |\n| **state_trace** | 500 | **0.216** | [0.182, 0.252] | **0.322** | [0.284, 0.362] | 27.43 |\n| graphiti | 500 | 0.098 | [0.072, 0.126] | 0.216 | [0.182, 0.254] | 5427.39 |\n\u003c!-- BENCHMARK:SWEBENCH_N500:END --\u003e\n\nWhat this says, plainly:\n\n- **state_trace leads on both Artifact@1 and Artifact@5 across every baseline.**\n- **vs. Graphiti:** non-overlapping 95% CIs on both metrics (0.216 vs 0.098 on A@1; 0.322 vs 0.216 on A@5). On the same input with the same deterministic embedder/reranker stub, the typed coding-agent ontology plus cold-start lexical fallback localizes the right file and puts it in the top 5 meaningfully more often.\n- **vs. BM25:** a real but narrower lead. A@1 0.216 vs 0.176 — 95% CIs just barely overlap (BM25 upper bound 0.208, state_trace lower bound 0.182), so it's a consistent directional win but not a statistical blowout. A@5 0.322 vs 0.300 — CIs overlap substantially, call it a tie with state_trace nosing ahead. The practical takeaway: state_trace's coding-agent ontology matches BM25's simple lexical coverage on cold-start *and* beats it when a trajectory is available (see [BENCHMARKS.md](./BENCHMARKS.md)).\n- **Latency:** state_trace retrieves in ~27ms vs BM25's ~0.2ms vs Graphiti's ~5,400ms. For per-action memory lookups in an agent loop, the ~200× delta over Graphiti compounds meaningfully over a long session.\n\nThe A@5 ≡ A@1 collapse that appeared in v0.2.0 is fixed in v0.2.1 via a lexical file-path fallback in `retrieve_brief` (pulls candidates from the query + top-scored node `issue_text` metadata when the graph has fewer than 5 file nodes, including paths embedded in GitHub blob URLs).\n\n### Caveats\n\n- Graphiti is run with a deterministic hash-embedder and BM25 + cosine + BFS → RRF reranker (no LLM entity extraction). That's the same simplification `graphiti_head_to_head_eval.py` uses for reproducibility without API keys. A full Graphiti pipeline with GPT-4-class extraction might close some of the gap, at materially higher cost per ingest.\n- Cold-start localization from issue text is only one axis. Trajectory-informed retrieval (BENCHMARKS.md) is where state_trace's larger advantage lives.\n\n## What makes the architecture different\n\nTyped coding-agent ontology, not generic Entity/Edge:\n\n- **Nodes:** `task`, `observation`, `decision`, `file`, `goal`, `session`, `command`, `test`, `symbol`, `patch_hunk`, `error_signature`, `episode`\n- **Edges:** `patches_file`, `fails_in`, `verified_by`, `rejected_by`, `supersedes`, `contradicts`, `solves`, `derived_from`, `precedes`, `motivates`, and more\n- **Intent routing:** the retrieval scorer re-prioritizes edge types per query intent (`locate_file`, `failure_analysis`, `history`, `general`).\n\nBounded working memory as a first-class constraint:\n\n- `enforce_capacity()` runs decay, compression, and summarization on every step.\n- `current_state(session)` answers \"what's live right now\" directly — cheap for state-trace, expensive for a general-purpose knowledge graph.\n- `failed_hypotheses(session)` returns invalidated, superseded, or unrecovered-error nodes — the \"don't propose this again\" signal.\n\nLocal-first, MCP-mountable:\n\n- Hot graph is an in-process `networkx.MultiDiGraph`. Cold storage is WAL SQLite+FTS5.\n- `state-trace-mcp` is a stdio MCP server you can mount in Claude Code / Cursor / Codex CLI.\n\nSee [ARCHITECTURE.md](./ARCHITECTURE.md) for why these choices matter vs. Graphiti, and [BENCHMARKS.md](./BENCHMARKS.md) for the smaller repo-local benchmarks.\n\n## vs. Graphiti\n\nGraphiti is the stronger general-purpose temporal knowledge graph for AI agents. `state-trace` is narrower: working memory for one coding/debugging session at a time. We're not claiming to replace Graphiti — we're claiming a specific lane where the tradeoffs land differently.\n\nEach row below is a concrete, measured axis, not a vibe.\n\n| Axis | state-trace | Graphiti | Winner for coding agents |\n| --- | --- | --- | --- |\n| **Artifact@1** on SWE-bench-Verified, n=500 | **0.216** [0.182, 0.252] | 0.098 [0.072, 0.126] | **state-trace** — non-overlapping 95% CIs |\n| **Artifact@5** on SWE-bench-Verified, n=500 | **0.322** [0.284, 0.362] | 0.216 [0.182, 0.254] | **state-trace** — non-overlapping 95% CIs |\n| **Per-retrieval latency** (same benchmark) | **27 ms** | 5,427 ms | **state-trace** — ~200× faster |\n| **Write path per agent step** | Typed insert, zero LLM calls | `add_episode` → LLM entity extraction each step | **state-trace** — cheaper, deterministic, no API key |\n| **Default deploy** | Pure Python + local SQLite/JSON; `state-trace-mcp` stdio binary | Neo4j / Kuzu / FalkorDB graph DB + embedder + LLM | **state-trace** — local-first, no external services |\n| **Coding-agent ontology** | Typed: `file`, `patch_hunk`, `error_signature`, `test`, `command`, `symbol`, `observation`, `decision`, `task`, `goal`, `session`, `episode` | Generic `EntityNode` / `EntityEdge` / `EpisodicNode` | **state-trace** — retrieval scorer routes on these types |\n| **\"What's true right now in this session?\"** | `engine.current_state(session)` — direct O(graph) query | Inferred from temporal facts via Cypher or LLM | **state-trace** — first-class API |\n| **\"What have I already tried and rejected?\"** | `engine.failed_hypotheses(session)` — direct query returning `invalid_at` + superseded + unrecovered-error nodes | Has to be inferred from `invalid_at` + contradictions | **state-trace** — first-class API |\n| **Working-memory capacity bound** | `enforce_capacity` with decay + compression + lifecycle retention. Long-horizon pressure benchmark: Artifact@1 0.771 *while* staying within a 96-unit budget 100% of the time | Unbounded by design; relies on the graph DB to scale | **state-trace** for long debugging sessions that need a memory ceiling |\n| **Small-model brief** | `retrieve_brief` produces ~220-token structured brief (`patch_file`, `rerun_command`, `tests_to_rerun`, `failed_attempts`, `recommended_actions`, …) that fits a tight budget | Returns raw nodes/facts; caller compresses | **state-trace** — built for small-model harnesses |\n| **MCP-mountable** | `state-trace-mcp` stdio server in the `[mcp]` extra — 11 tools exposed, drop into `~/.claude/settings.json` | No official MCP server; library-first | **state-trace** — plug straight into Claude Code / Cursor / Codex / opencode |\n| **Long-lived temporal knowledge across weeks** | Scoped to a session or repo namespace; no cross-namespace fact merging | First-class; bi-temporal validity, contradiction resolution, fact supersession across episodes | **Graphiti** |\n| **Multi-tenant SaaS scale** | Single-writer process model; authoritative graph is in-process networkx | Built for it on Neo4j/Kuzu substrate | **Graphiti** |\n| **Cross-session learning about users / orgs / policies** | Out of scope | First-class | **Graphiti** |\n\n### When to pick which\n\nUse **state-trace** when:\n\n- Your agent is editing code in a single debugging or refactoring session.\n- You talk to an MCP client (Claude Code, Cursor, Codex CLI, opencode) and want working memory without standing up a graph DB.\n- Per-action latency matters — you're calling memory on every tool invocation in an agent loop.\n- You run on small models where a 220-token structured brief beats a 1,000-token raw dump.\n- You need \"what file should I patch / what did I already try\" to be a direct query, not inferred.\n\nUse **Graphiti** when:\n\n- You need a knowledge graph of facts about the world, users, or an organization that evolves across weeks.\n- Multi-tenant, multi-agent shared memory is part of the design.\n- You're willing to run Neo4j/Kuzu and pay the LLM-extraction cost per ingest for the ontological payoff.\n- Your retrieval patterns are richer than \"which file, which test, which failed hypothesis.\"\n\nThey solve adjacent problems. The only reason a comparison is even interesting is that both ship as \"memory for AI agents\" — the honest answer is they're different products that happen to live on the same shelf.\n\n## Installation\n\n```bash\nuv sync                       # or: pip install -e .\npip install -e \".[mcp]\"       # stdio MCP server for Claude Code / Cursor / Codex CLI\npip install -e \".[bench]\"     # graphiti-core[kuzu] + datasets (for the headline benchmark)\npip install -e \".[llm]\"       # OpenAI-backed live benchmarks + LLM ingestion\npip install -e \".[adapters]\"  # LangGraph / LlamaIndex adapter shims\npip install -e \".[api]\"       # FastAPI app\n```\n\nDistribution name: `state-trace`. Python import path: `state_trace`.\n\n## Quickstart\n\n```python\nfrom state_trace import MemoryEngine\n\nengine = MemoryEngine(capacity_limit=24.0, storage_path=\"memory.json\")\n\ntask = engine.store(\n    \"Fix login by tracing the refresh token path\",\n    {\"type\": \"task\", \"session\": \"auth-debug\", \"goal\": \"restore login\", \"file\": \"auth.ts\", \"importance\": 0.92},\n)\nengine.store(\n    \"Login still returns 401 after refresh token exchange\",\n    {\"type\": \"observation\", \"session\": \"auth-debug\", \"goal\": \"restore login\", \"file\": \"auth.ts\",\n     \"blocks\": [task.id], \"importance\": 0.88},\n)\nengine.store(\n    \"Authorization header is dropped before the retry request reaches auth.ts\",\n    {\"type\": \"decision\", \"session\": \"auth-debug\", \"goal\": \"restore login\",\n     \"related_to\": [task.id], \"file\": \"auth.ts\", \"importance\": 0.91},\n)\n\nresult = engine.retrieve(\"Why is login still broken?\", {\"session\": \"auth-debug\", \"goal\": \"restore login\"})\n```\n\n## Current state, live hypotheses, failed attempts\n\nThe architectural wedge. These APIs return a live view of the session without re-ranking:\n\n```python\nstate = engine.current_state(session=\"auth-debug\", goal=\"restore login\")\n# → {\"active_task\": ..., \"latest_observation\": ..., \"active_files\": [...], ...}\n\nfailures = engine.failed_hypotheses(session=\"auth-debug\")\n# → [{\"id\": ..., \"reason\": [\"superseded\"], \"content\": \"Login still returns 401 ...\"}, ...]\n```\n\n`current_state` filters out invalidated and superseded nodes; `failed_hypotheses` surfaces them as \"do not propose again\" context. A general-purpose temporal graph has to infer this from fact updates; here it's a direct query.\n\n## MCP Server\n\n```bash\npip install -e \".[mcp]\"\nstate-trace-mcp\n```\n\nEnvironment config:\n\n- `STATE_TRACE_STORAGE_PATH` — durable path; `.db`/`.sqlite` uses the SQLite backend. Default: `~/.state-trace/memory.db`.\n- `STATE_TRACE_NAMESPACE` — default namespace (e.g. the repo slug).\n- `STATE_TRACE_CAPACITY_LIMIT` — working-memory budget (default `256`).\n\nTools exposed: `store`, `retrieve`, `retrieve_brief`, `record_action`, `record_observation`, `record_test_result`, `ingest_agent_log_file`, `current_state`, `failed_hypotheses`, `list_namespaces`, `graph_snapshot`.\n\nExample Claude Code config (`~/.claude/settings.json`):\n\n```json\n{\n  \"mcpServers\": {\n    \"state-trace\": {\n      \"command\": \"state-trace-mcp\",\n      \"env\": {\n        \"STATE_TRACE_STORAGE_PATH\": \"/Users/me/.state-trace/memory.db\",\n        \"STATE_TRACE_NAMESPACE\": \"repo-x\"\n      }\n    }\n  }\n}\n```\n\n## Online agent loop\n\n```python\nengine = MemoryEngine(capacity_limit=256.0)\nctx = {\"session\": \"auth-debug\", \"goal\": \"restore login\", \"repo\": \"example/auth-service\"}\n\nengine.record_action('open \"src/auth.ts\"', {**ctx, \"files\": [\"src/auth.ts\"]})\nengine.record_observation(\n    \"AttributeError: login still fails with a 401 in src/auth.ts\",\n    {**ctx, \"files\": [\"src/auth.ts\"], \"status\": \"error\"},\n)\nengine.record_action('edit \"src/auth.ts\"', {**ctx, \"files\": [\"src/auth.ts\"], \"action_kind\": \"edit\"})\nengine.record_test_result(\n    \"pytest tests/test_auth.py::test_refresh_retry\",\n    \"tests/test_auth.py::test_refresh_retry PASSED\",\n    {**ctx, \"files\": [\"src/auth.ts\", \"tests/test_auth.py::test_refresh_retry\"]},\n)\n\nbrief = engine.retrieve_brief(\n    \"Which file should I patch and what test should I rerun?\",\n    {\"session\": \"auth-debug\", \"goal\": \"restore login\"},\n    mode=\"small_model\",\n)\n```\n\nThe brief fields: `patch_file`, `rerun_command`, `target_files`, `tests_to_rerun`, `current_state`, `failed_attempts`, `recommended_actions`, `evidence`, `symbols`, `patch_hints`, `confidence`, `token_estimate`.\n\n## Trajectory ingestion\n\n```python\nengine = MemoryEngine(capacity_limit=256.0)\nengine.store_agent_log_file(\"examples/data/agent_logs/marshmallow__marshmallow-1867.json\")\n```\n\nSupported inputs: normalized `agent_log` JSON, raw SWE-agent `.traj` files, raw OpenHands event JSON logs.\n\n## Live solve-rate (next credibility step)\n\n`examples/swebench_verified_solve_rate.py` scaffolds end-to-end solve-rate measurement: state-trace brief → LLM patch proposal → SWE-bench-Verified prediction JSONL. It does not run the swebench docker harness; that step is documented in the script's header.\n\n```bash\npython3 examples/swebench_verified_solve_rate.py --limit 5 --model gpt-5.1-mini --dry-run\n```\n\n## Storage backends\n\n`MemoryEngine(storage_path=...)` picks the backend from the file extension:\n\n- `.db` / `.sqlite` / `.sqlite3` — durable SQLite with WAL journal + FTS5 seed index. Recommended for long-running agent harnesses.\n- any other path — JSON blob (simple, single-writer, fine for benchmarks).\n\nSee [ARCHITECTURE.md](./ARCHITECTURE.md) for the \"why networkx + SQLite, not Neo4j\" explainer.\n\n## Namespaces\n\n```python\nengine = MemoryEngine(storage_path=\"memory.db\", namespace=\"payments-api\")\nengine.retrieve(\"why is login broken?\")  # scoped to payments-api by default\nengine.retrieve(\"...\", include_all_namespaces=True)  # opt out\n```\n\nNodes without a namespace remain visible in every view so pre-namespace data is not lost.\n\n## Framework adapters\n\n```python\nfrom state_trace.adapters import StateTraceLangGraphMemory, StateTraceLlamaIndexMemory\n\nlg_memory = StateTraceLangGraphMemory(default_session=\"coding-session\")\nli_memory = StateTraceLlamaIndexMemory(session_id=\"agent-session\")\n```\n\nNeither adapter imports the host framework; they satisfy the duck-typed memory contract used by each.\n\n## FastAPI\n\n```python\nfrom state_trace.api import app  # POST /store, /retrieve, /retrieve_brief, GET /graph\n```\n\nPass `\"explain\": true` on retrieve to include per-node score breakdowns.\n\n## Tests\n\n```bash\npython3 -m pytest -q\n```\n\n## Benchmarks\n\nFull set of repo-local benchmarks and their honest caveats lives in [BENCHMARKS.md](./BENCHMARKS.md). The SWE-bench-Verified row above is the only one that's at a scale worth citing externally.\n\n## Positioning\n\nSee [**vs. Graphiti**](#vs-graphiti) above for the head-to-head comparison and [ARCHITECTURE.md](./ARCHITECTURE.md) for the architecture tradeoffs in detail. tl;dr: different products, adjacent problems — `state-trace` owns the narrow coding-agent working-memory lane; Graphiti owns weeks-of-history temporal knowledge graphs.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frazroo%2Fstate-trace","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frazroo%2Fstate-trace","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frazroo%2Fstate-trace/lists"}