{"id":47938388,"url":"https://github.com/lancedb/locomo-eval","last_synced_at":"2026-04-04T07:55:22.936Z","repository":{"id":346199350,"uuid":"1188342925","full_name":"lancedb/locomo-eval","owner":"lancedb","description":"LOCOMO benchmark for OpenClaw memory stores","archived":false,"fork":false,"pushed_at":"2026-03-24T21:51:37.000Z","size":75,"stargazers_count":1,"open_issues_count":1,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-04T07:55:21.895Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lancedb.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-22T00:06:08.000Z","updated_at":"2026-03-28T02:22:13.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/lancedb/locomo-eval","commit_stats":null,"previous_names":["lancedb/locomo-eval"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/lancedb/locomo-eval","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lancedb%2Flocomo-eval","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lancedb%2Flocomo-eval/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lancedb%2Flocomo-eval/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lancedb%2Flocomo-eval/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lancedb","download_url":"https://codeload.github.com/lancedb/locomo-eval/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lancedb%2Flocomo-eval/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31392188,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-04T04:26:24.776Z","status":"ssl_error","status_checked_at":"2026-04-04T04:23:34.147Z","response_time":60,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-04-04T07:55:22.370Z","updated_at":"2026-04-04T07:55:22.927Z","avatar_url":"https://github.com/lancedb.png","language":"Python","funding_links":[],"categories":["Skills \u0026 Plugins"],"sub_categories":["Notable Skills \u0026 Plugins"],"readme":"# LOCOMO Benchmark for OpenClaw Memory\n\nMinimal harness for memory benchmarking using OpenClaw on the [LOCOMO](https://github.com/snap-research/locomo) dataset.\n\nThis repo benchmarks OpenClaw against LOCOMO with three backends:\n\n- `memory-core`: writes lossless LOCOMO session markdown into the OpenClaw workspace, then reindexes\n- `memory-lancedb`: writes the same LOCOMO session markdown, reuses the exact `memory-core` indexed chunks and embeddings, then stores those chunks in the built-in LanceDB plugin table\n- `memory-lancedb-pro`: starts from that same chunk-aligned LanceDB corpus and uses the `memory-lancedb-pro` plugin for more advanced retrieval tuning\n\nQA still runs through a local OpenClaw gateway, and an LLM judge scores the answers.\n\n## Quick Start\n\n### 1. Install OpenClaw\n\n```bash\nnpm install -g openclaw@latest\nopenclaw onboard\n```\n\n### 2. Add your OpenAI key\n\nPut this in `.env` at the repo root:\n\n```bash\nOPENAI_API_KEY=your_key_here\n```\n\n### 3. Download the dataset\n\n```bash\nmkdir -p datasets\ncurl -fsSL https://raw.githubusercontent.com/snap-research/locomo/main/data/locomo10.json -o datasets/locomo10.json\n```\n\n### 4. Install Python dependencies\n\n```bash\nuv sync\n```\n\n### 5. Configure OpenClaw for the backend you want to benchmark\n\n```bash\n./setup_memory_core.sh\n```\n\nFor the LanceDB leg, use:\n\n```bash\n./setup_memory_lancedb.sh\n```\n\nFor the LanceDB Pro leg, use:\n\n```bash\n./setup_memory_lancedb_pro.sh\n```\n\nBefore the first LanceDB Pro run, install the plugin once:\n\n```bash\nopenclaw plugins install memory-lancedb-pro@beta\n```\n\n### 6. Start the gateway\n\n```bash\n./start_gateway.sh\n```\n\nThe benchmark expects the gateway at `http://127.0.0.1:18789`.\n\n`start_gateway.sh` hardcodes the OpenClaw agent model to `openai/gpt-4.1-mini` so the gateway does not fall back to another provider by default.\n\n### 7. Build the backend corpus once\n\nFor `memory-core`, prebuild the workspace markdown and SQLite index:\n\n```bash\n./setup_memory_core.sh\n\nuv run python scripts/build_memory_core_corpus.py \\\n  --input datasets/locomo10.json\n```\n\nThat writes the full LOCOMO session markdown into the benchmark workspace and runs the built-in `openclaw memory index --force`. It records a manifest at `locomo-bench/prebuilt-memory-core.json`.\n\nFor `memory-lancedb`, prebuild the chunk-aligned LanceDB store:\n\n```bash\n./setup_memory_lancedb.sh\n\nuv run python scripts/build_memory_lancedb_corpus.py \\\n  --input datasets/locomo10.json\n```\n\nThat writes the full LOCOMO session markdown into the workspace, runs the built-in `memory-core` indexer, reads back the exact indexed chunk text and embeddings from SQLite, and writes those chunks into the configured `memory-lancedb` store. It records a manifest at `locomo-bench/prebuilt-memory-lancedb.json`.\n\nFor `memory-lancedb-pro`, build from the existing `memory-lancedb` store:\n\n```bash\n./setup_memory_lancedb_pro.sh\n\nuv run python scripts/build_memory_lancedb_pro_corpus.py \\\n  --source-db locomo-bench/lancedb\n```\n\n**This script requires that the `memory-lancedb` store pre-exist (to migrate already-computed embeddings to the format expected by the `memory-lancedb-pro` plugin).\n\nThat uses the plugin's migration path to materialize a separate `memory-lancedb-pro` store from the already-built chunk-aligned `memory-lancedb` corpus without re-embedding the corpus again. It records a manifest at `locomo-bench/prebuilt-memory-lancedb-pro.json`.\n\n### 8. Run a subset of the benchmark\n\nUse the `--limit` parameter to specify the number of QA pairs to benchmark.\n\n\u003e [!NOTE]\n\u003e `--limit` applies to the flattened LOCOMO QA rows, not the number of dialogues. So, the loader flattens every sample's `qa` array into one ordered benchmark list. Here, `--limit 5` means \"run the first 5 QA rows\". If those rows all come from one dialogue, the harness still ingests the full source dialogue for that selected sample. It does not ingest the entire LOCOMO dataset unless your selected rows span the entire dataset.\n\nThe benchmark exposes three model controls:\n\n- `--agent-model`: the actual OpenClaw agent model used for QA\n- `--gateway-model`: optional model value sent to OpenClaw `/v1/responses`; if omitted it defaults to `--agent-model`\n- `--judge-model`: model used by the LLM judge\n\nBy default, judge calls run with `--judge-concurrency 10` (10 parallel requests). Gateway QA calls run serially (`--concurrency 1`) because the OpenClaw gateway serializes requests through a single lane queue. For full parallelism across QA calls, use `scripts/run_parallel.py` which splits rows across 4 subprocesses. Output format is identical regardless of concurrency settings.\n\nIn a second terminal, enter the following for `memory-core`:\n\n```bash\nuv run python scripts/run_memory_core.py \\\n  --input datasets/locomo10.json \\\n  --limit 5 \\\n  --gateway http://127.0.0.1:18789 \\\n  --agent-model openai/gpt-4.1-mini \\\n  --judge-model openai/gpt-4.1-mini \\\n  --skip-ingest\n```\n\nFor the LanceDB leg:\n\n```bash\nuv run python scripts/run_memory_lancedb.py \\\n  --input datasets/locomo10.json \\\n  --limit 5 \\\n  --gateway http://127.0.0.1:18789 \\\n  --agent-model openai/gpt-4.1-mini \\\n  --judge-model openai/gpt-4.1-mini \\\n  --skip-ingest\n```\n\nFor the LanceDB Pro leg:\n\n```bash\nuv run python scripts/run_memory_lancedb_pro.py \\\n  --input datasets/locomo10.json \\\n  --limit 5 \\\n  --gateway http://127.0.0.1:18789 \\\n  --agent-model openai/gpt-4.1-mini \\\n  --judge-model openai/gpt-4.1-mini \\\n  --skip-ingest\n```\n\nAll runs write artifacts under `outputs/` by default.\n\n### 8b. Parallel runs with `run_parallel.py`\n\nThe OpenClaw gateway serializes requests through a single lane queue, so in-process concurrency for QA calls doesn't help. For faster large-scale runs, use `run_parallel.py` which splits rows across 4 subprocesses:\n\n```bash\nuv run python scripts/run_parallel.py \\\n  --backend memory-lancedb-pro \\\n  --input datasets/locomo10.json \\\n  --limit 100 \\\n  --gateway http://127.0.0.1:18789 \\\n  --agent-model openai/gpt-4.1-mini \\\n  --judge-model openai/gpt-4.1-mini \\\n  --skip-ingest\n```\n\nThis spawns 4 worker processes, each handling a quarter of the rows. When all workers finish, the script merges the JSONL outputs and recomputes `summary.json` into a single output directory. The output format is identical to a single-process run.\n\nAll flags from the single-process scripts are supported (`--limit`, `--skip-ingest`, `--judge-concurrency`, etc.). The `--backend` flag selects which runner script to use (`memory-core`, `memory-lancedb`, or `memory-lancedb-pro`).\n\n### Concurrency controls\n\nThe benchmark exposes two concurrency settings:\n\n- `--concurrency`: number of concurrent gateway QA requests per process (default: 1, serial). Increasing this is not useful because the gateway serializes via lane queuing.\n- `--judge-concurrency`: number of concurrent LLM judge requests per process (default: 10). This parallelizes calls to the OpenAI API for grading answers.\n\nThese flags apply to both the single-process scripts and the parallel wrapper.\n\n## Sample Results\n\nThe latest summaries under [`outputs/`](/Users/prrao/code/locomo-eval/outputs) show the following picture for recent `--limit 10` runs:\n\n| Backend | Rows | Correct | Wrong | Completion Rate | Avg latency (s) |\n| --- | ---: | ---: | ---: | ---: | ---: |\n| `memory-core` | 50 | 27 | 23 | 0.54 | 6.7 |\n| `memory-lancedb` | 50 | 32 | 18 | 0.64 | 4.1 |\n| `memory-lancedb-pro` | 50 | 36 | 14 | 0.72 |  10.9 |\n\nSource summaries are from the output files after running the benchmark using each memory plugin.\n\n### 9. Large-scale runs: prebuild the stores once, then benchmark in query-only mode\n\nFor real benchmark runs, repeatedly ingesting the same corpus is wasteful. The recommended workflow is:\n\n1. build the `memory-core` corpus once\n2. build the `memory-lancedb` corpus once\n3. build the `memory-lancedb-pro` corpus once from that prebuilt `memory-lancedb` store\n4. run each benchmark leg with `--skip-ingest`\n\nAfter the corpora exist, benchmark in query-only mode.\n\nFor `memory-core`:\n\n```bash\n./setup_memory_core.sh\n./start_gateway.sh\n```\n\nIn a second terminal:\n\n```bash\nuv run python scripts/run_memory_core.py \\\n  --input datasets/locomo10.json \\\n  --limit 100 \\\n  --gateway http://127.0.0.1:18789 \\\n  --agent-model openai/gpt-4.1-mini \\\n  --judge-model openai/gpt-4.1-mini \\\n  --skip-ingest\n```\n\nFor `memory-lancedb`:\n\n```bash\n./setup_memory_lancedb.sh\n./start_gateway.sh\n```\n\nIn a second terminal:\n\n```bash\nuv run python scripts/run_memory_lancedb.py \\\n  --input datasets/locomo10.json \\\n  --limit 100 \\\n  --gateway http://127.0.0.1:18789 \\\n  --agent-model openai/gpt-4.1-mini \\\n  --judge-model openai/gpt-4.1-mini \\\n  --skip-ingest\n```\n\nFor `memory-lancedb-pro`:\n\n```bash\n./setup_memory_lancedb_pro.sh\n./start_gateway.sh\n```\n\nIn a second terminal:\n\n```bash\nuv run python scripts/run_memory_lancedb_pro.py \\\n  --input datasets/locomo10.json \\\n  --limit 100 \\\n  --gateway http://127.0.0.1:18789 \\\n  --agent-model openai/gpt-4.1-mini \\\n  --judge-model openai/gpt-4.1-mini \\\n  --skip-ingest\n```\n\n`--skip-ingest` means:\n\n- the benchmark does not touch the existing store\n- the run is query-only\n- `memory_status_before.json` and `memory_status_after.json` reflect the prebuilt store as-is\n- for `memory-core`, this only works after `build_memory_core_corpus.py` has populated the workspace markdown and SQLite index\n\n## What the Runner Does\n\nFor `memory-core`, the benchmark:\n\n- writes raw LOCOMO session markdown into `workspace/memory/locomo/`\n- clears only that benchmark-managed subtree before each run\n- runs `openclaw memory index --force`\n- asks QA through the gateway after reindexing\n- does no summarization during ingest\n\nFor `memory-lancedb`, the benchmark:\n\n- writes the same LOCOMO session markdown used by `memory-core`\n- runs the built-in `openclaw memory index --force`\n- reads the exact indexed `memory-core` chunks and stored embeddings from the SQLite `chunks` table\n- writes those chunks directly into the plugin's `memories` table\n- clears the benchmark-managed LanceDB directory before each run\n- uses the bundled `memory-lancedb` plugin with its Node dependency installed once\n- does no summarization during ingest\n\nWhen `--skip-ingest` is set, it skips the write/reset step and queries the already-built store.\n\nThe standalone [build_memory_core_corpus.py](./scripts/build_memory_core_corpus.py) script performs that same write-and-index step once up front so later `memory-core` runs can safely use `--skip-ingest`.\n\nFor `memory-lancedb-pro`, the benchmark:\n\n- writes the same LOCOMO session markdown used by `memory-core`\n- runs the built-in `openclaw memory index --force`\n- reads the exact indexed `memory-core` chunks and stored embeddings from the SQLite `chunks` table\n- materializes a temporary legacy `memory-lancedb` store from those chunks\n- migrates that temporary legacy store into `memory-lancedb-pro` so Pro sees the same chunk corpus and vectors\n- uses the installed `memory-lancedb-pro` plugin and the retrieval settings from `setup_memory_lancedb_pro.sh`\n- does no summarization during ingest\n\nThis is intentional. `memory-lancedb-pro` maintains additional search/index state, so the benchmark migrates through the plugin's supported path instead of treating it like the simpler built-in LanceDB store.\n\nWhen `--skip-ingest` is set, it skips the import step and queries the already-built Pro store.\n\n## Run Artifacts (What is Output)\n\nEach benchmark run writes a directory under:\n\n- [./outputs](./outputs)\n\nTypical files in one run directory:\n\n- `selected_rows.jsonl`\n  - the flattened LOCOMO QA rows selected by `--limit`\n  - useful for seeing exactly which benchmark questions were included\n- `ingest_log.jsonl`\n  - one row per stored memory unit written during ingest\n  - for `memory-core`, this is one row per session markdown file\n  - for `memory-lancedb`, this is one row per `memory-core` chunk stored in LanceDB\n  - for `memory-lancedb-pro`, this is one row per `memory-core` chunk stored in LanceDB Pro\n- `reindex.log`\n  - stdout from `openclaw memory index --force` for `memory-core`\n  - stdout from the chunk-source `openclaw memory index --force` step for both LanceDB legs\n  - for `memory-lancedb-pro`, this file also includes the plugin migration output\n- `document_log.jsonl`\n  - the session markdown files written before chunk extraction\n  - produced for both LanceDB legs because their chunk source is now the same session-document corpus used by `memory-core`\n- `memory_status_before.json`\n  - backend status before ingest\n  - for `memory-core`, this records the workspace, SQLite path, and indexed file/chunk counts\n  - for `memory-lancedb`, this records the LanceDB path and whether the store already existed\n  - for `memory-lancedb-pro`, this records the LanceDB Pro path and whether the store already existed\n- `memory_status_after.json`\n  - backend status after ingest\n  - for `memory-core`, this confirms how many files and chunks were indexed\n  - for `memory-lancedb`, this confirms the LanceDB path and how many chunk rows were written\n  - for `memory-lancedb-pro`, this confirms the LanceDB Pro path and how many chunk rows were written\n- `qa_results.jsonl`\n  - raw benchmark answers returned by the gateway for each selected QA row\n  - includes latency, token usage if present, and any gateway errors\n  - token usage for gateway-mediated runs should be treated as approximate until the gateway usage fields are fully normalized by the harness\n- `judged_results.jsonl`\n  - the QA rows after LLM judging\n  - marks each answer as `CORRECT` or `WRONG` with short reasoning\n- `summary.json`\n  - small run-level summary\n  - includes task completion rate, counts, token totals, latency, and memory status before/after\n\n## Notes\n\n- Start with `memory-core` as the baseline, then compare it against `memory-lancedb` and `memory-lancedb-pro`.\n- The same `.env` key is used for both OpenClaw and judge calls.\n- For the cleanest benchmark, keep unrelated files out of the active OpenClaw workspace memory corpus.\n- `setup_memory_lancedb_pro.sh` is the comparable baseline script.\n- `setup_memory_lancedb_pro_tune.sh` is the experimental tuned script for retrieval sweeps and A/B testing.\n- For large runs, prefer the prebuilt-store workflow and `--skip-ingest`. Re-ingesting the same corpus on every run is mainly useful for debugging, not for full benchmarks.\n- `memory-lancedb-pro` still depends on the prebuilt `memory-lancedb` store as its corpus source.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flancedb%2Flocomo-eval","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flancedb%2Flocomo-eval","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flancedb%2Flocomo-eval/lists"}