{"id":47724200,"url":"https://github.com/kadubon/search-stability-lab","last_synced_at":"2026-04-02T20:04:29.981Z","repository":{"id":343140662,"uuid":"1176453983","full_name":"kadubon/search-stability-lab","owner":"kadubon","description":"Theory-to-experiment lab for search stability in long-running agents under finite context, with exact simulator tests and lightweight mechanistic probe tasks.","archived":false,"fork":false,"pushed_at":"2026-03-09T03:26:38.000Z","size":142,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-09T08:38:50.490Z","etag":null,"topics":["agent-evaluation","ai","ai-agents","bounded-memory","finite-context","hypothesis-management","llm-agents","long-horizon-reasoning","long-running-agents","mechanistic-probes","reproducible-research","reset-policy","scientific-audit","search-stability","simulator","state-compression","structured-output"],"latest_commit_sha":null,"homepage":"https://doi.org/10.5281/zenodo.18905242","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kadubon.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":"docs/SECURITY_RESPONSE.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-09T03:17:24.000Z","updated_at":"2026-03-09T03:26:41.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/kadubon/search-stability-lab","commit_stats":null,"previous_names":["kadubon/search-stability-lab"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/kadubon/search-stability-lab","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kadubon%2Fsearch-stability-lab","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kadubon%2Fsearch-stability-lab/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kadubon%2Fsearch-stability-lab/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kadubon%2Fsearch-stability-lab/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kadubon","download_url":"https://codeload.github.com/kadubon/search-stability-lab/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kadubon%2Fsearch-stability-lab/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31314849,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-02T12:59:32.332Z","status":"ssl_error","status_checked_at":"2026-04-02T12:54:48.875Z","response_time":89,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agent-evaluation","ai","ai-agents","bounded-memory","finite-context","hypothesis-management","llm-agents","long-horizon-reasoning","long-running-agents","mechanistic-probes","reproducible-research","reset-policy","scientific-audit","search-stability","simulator","state-compression","structured-output"],"created_at":"2026-04-02T20:04:26.153Z","updated_at":"2026-04-02T20:04:29.971Z","avatar_url":"https://github.com/kadubon.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Search Stability Lab\n\nThis repository implements a reproducible experiment codebase for studying search stability under finite context. It operationalizes the theory paper with two layers:\n\n- Layer A: a controlled simulator with simulator-exact latent adequacy and exact failure-channel attribution.\n- Layer B: a lightweight task harness with deterministic family proxies and bundled mechanistic probe tasks.\n\nThe repository is designed for scientific use. It tests scoped hypotheses about controller laws under finite-context constraints. It does not prove the theory universally, does not bundle copyrighted benchmark assets, and does not fabricate empirical results. The strongest current cross-layer result is the substitution-first story (`C3 \u003e C0`). Layer B now also shows directional reset-probe evidence against `C0`, plus limited outcome-sensitive compression evidence on a tiny authored suite, and some local CPU results remain confounded by malformed structured output.\n\n## Theoretical basis\n\nThis repository is an implementation-oriented companion to the following preprint, which defines the theory and terminology used here:\n\nTakahashi, K. (2026). *Search Stability under Finite Context: A Minimal Theory of Adequacy Preservation, Compression, and Reset in Long-Running Agents*. Zenodo. [https://doi.org/10.5281/zenodo.18905242](https://doi.org/10.5281/zenodo.18905242)\n\nFor experiment design and interpretation, see `docs/THEORY_VALIDATION_GUIDE.md` and `docs/SCIENTIFIC_CHECKLIST.md`.\nThe final experiment-readiness audit is recorded in `docs/FINAL_AUDIT.md`.\nThe post-pilot redesign rationale is recorded in `docs/EXPERIMENT_REDESIGN.md`.\nThe current conservative status page is `docs/CURRENT_STATUS.md`.\nThe repository scientific scope note is `docs/SCIENTIFIC_SCOPE.md`.\nThe results interpretation note is `docs/RESULTS_INTERPRETATION.md`.\nLayer B-specific scope and probe design notes are `docs/LAYER_B_STATUS.md`, `docs/LAYER_B_PROXY_MODEL.md`, and `docs/LAYER_B_PROBE_MATRIX.md`.\nThe completed Layer B probe-block report is `docs/LAYER_B_PROBE_REPORT_2026-03-09.md`.\n\n## What this repo tests\n\n- Whether controller-law changes affect adequate-family survival under finite active-context budgets.\n- Whether substitution-first, reserve-aware, compression-cautious, and reset-aware policies differ under controlled conditions.\n- Whether small real-task harnesses show directionally similar controller effects when only proxies are available.\n\n## Who this repo is for\n\n- AI engineers who need a lightweight, theory-aligned harness for long-running agent evaluation\n- AI agents that need explicit rules for what may be claimed from simulator versus real-task evidence\n- researchers who want a falsifiable, CPU-feasible layer before investing in larger benchmark runs\n\n## Why this theory is useful\n\nThe repository is meant to make long-running agent failures more legible. Instead of treating every miss as generic model weakness, it helps test whether failure came from:\n\n- loss of adequate-family reserve before verification\n- harmful compression aliasing\n- avoidable retirement of a still-useful route family\n- stale-legacy continuation when reset would have been rational\n- a mixture of ecological and raw-model failure\n\n## What this repo does not test\n\n- It is not a leaderboard implementation.\n- It does not claim benchmark completeness for SWE-Bench Lite.\n- It does not validate the theory outside the implemented simulator conditions and the chosen fixed task slice.\n\n## Setup\n\nPython 3.11+ is required.\n\n```bash\npython -m pip install -r requirements.txt\n```\n\nEnvironment variables:\n\n- `GEMINI_API_KEY` for Gemini runs\n- `LOCAL_LLM_ENDPOINT` and `LOCAL_LLM_MODEL` when using a local CPU model server\n\nThe repository ships `.env.example` only. Do not commit `.env`. Local scripts load `.env` automatically when present and never write its values into configs or logs.\nIf a real secret was ever tracked earlier, see `docs/SECURITY_RESPONSE.md`.\n\n## Quickstart\n\nSimulator smoke test:\n\n```bash\npython scripts/run_simulator.py --config configs/experiments/pilot.yaml --max-conditions 1 --max-episodes 1\n```\n\nReal-task harness smoke test:\n\n```bash\npython scripts/run_real_tasks.py --config configs/experiments/real_tasks.yaml\n```\n\nBundled non-mock Layer B run on the frozen micro-task slice:\n\n```bash\npython scripts/run_real_tasks.py --config configs/experiments/real_tasks_gemini_nonmock_frozen.yaml\n```\n\nExpanded bundled non-mock Layer B run:\n\n```bash\npython scripts/validate_task_assets.py --manifest tasks/manifests/frozen_task_slice_v3.yaml\npython scripts/run_real_tasks.py --config configs/experiments/real_tasks_gemini_nonmock_expanded.yaml\n```\n\nCompression-focused Layer B probe block:\n\n```bash\npython scripts/run_real_tasks.py --config configs/experiments/real_tasks_gemini_compression_probe.yaml\n```\n\nReset-focused Layer B probe block:\n\n```bash\npython scripts/run_real_tasks.py --config configs/experiments/real_tasks_gemini_reset_probe.yaml\n```\n\nLocal CPU model smoke test:\n\n```bash\npython scripts/run_real_tasks.py --config configs/experiments/real_tasks_local_cpu.yaml --max-tasks 1 --controllers C0\n```\n\nGemini smoke test:\n\n```bash\npython scripts/run_real_tasks.py --config configs/experiments/real_tasks_gemini.yaml --max-tasks 1 --controllers C0\n```\n\nAnalyze generated logs:\n\n```bash\npython scripts/analyze_results.py --input-dir artifacts --output-dir artifacts/analysis\n```\n\nBefore zipping or publishing:\n\n```bash\npython scripts/release_clean.py --dry-run\npython scripts/check_public_safety.py --mode release\n```\n\nRead the run status and result interpretation guide:\n\n```bash\ntype result_summary.md\ntype docs/CURRENT_STATUS.md\n```\n\nValidate a design before main runs:\n\n```bash\npython scripts/check_experiment_design.py --layer layer_a --config configs/experiments/pilot_gemini.yaml\npython scripts/check_experiment_design.py --layer layer_b --config configs/experiments/real_tasks_gemini.yaml\n```\n\n## Two-layer design\n\nLayer A is the theory-faithful core. It includes:\n\n- finite family sets and route instances\n- a nonempty hidden adequate-family set\n- delayed strong verification\n- hard-cap budgets\n- staleness, inertia, overlap burden, legacy contamination\n- lossy compression with alias logging\n- reset, branch, continue, retire, substitute, and tool actions\n- exact recoverability and deterministic exact failure attribution\n\nLayer B is a small-scope harness. It includes:\n\n- a task manifest format for a fixed task slice\n- deterministic family proxy construction\n- an explicit proxy-construction layer separating task-authored, derived, and runtime proxies\n- mock mode for offline validation\n- a bundled public-safe frozen micro-task slice for non-mock runs\n- explicit substitution, compression, and reset probe suites\n- an expanded bundled eight-task frozen slice for stronger Layer B contrasts\n- graceful behavior when external task assets are absent\n- proxy-only instrumentation aligned to the theory\n\n## Unified model adapters\n\nThe repository exposes a single adapter interface:\n\n- `plan_step(context)`\n- `compress_state(context)`\n- `diagnose_failure(trace)`\n- `choose_continue_branch_reset(context)`\n\nSupported backbones:\n\n- Gemini via config plus `GEMINI_API_KEY`\n- a local CPU endpoint via Ollama-style or OpenAI-compatible HTTP APIs\n- a deterministic mock adapter for offline tests and smoke checks\n\n## Exact vs proxy note\n\nLayer A logs exact simulator quantities. Layer B logs only proxies and task outcomes. The code and analysis pipeline keep these separate in field names, tables, figure titles, and documentation.\n\n## Primary hypotheses\n\n- `H1`: Success falls sharply once budget drops below a recoverability-supporting threshold.\n- `H2`: Within-family substitution preserves success better than greedy deletion.\n- `H3`: Lossy compression can create decision-relevant aliasing.\n- `H4`: Compression harm grows when strong verification is delayed.\n- `H5`: Reset-aware control can dominate stale continuation when contamination is high enough.\n- `H6`: These effects are attributable to controller law under a fixed backbone, not only to backbone changes.\n\n## Reproducibility note\n\nEvery run is driven by YAML config plus explicit seeds. Logs record:\n\n- layer\n- controller\n- backbone model ID\n- prompt version\n- code revision\n- run and episode identifiers\n- structured trajectory events\n\nEach run directory also includes a `run_manifest.json` with the resolved experiment config, model config, controller IDs, theory hypotheses, and a scientific-guardrail report.\n\nIf the repository is not inside a Git checkout, `code_revision` is logged as `unknown`.\n\n## Directory map\n\n- `configs/`: model, controller, and experiment YAML\n- `prompts/`: versioned prompt templates\n- `schemas/`: JSON Schemas for structured model outputs\n- `simulator/`: Layer A generator, engine, scenarios, attribution\n- `controllers/`: controller laws `C0` through `C6`\n- `models/`: Gemini, local endpoint, and mock adapters\n- `tasks/`: Layer B harness, proxy rule, and example manifest\n- `logging/`: logging notes and field conventions\n- `analysis/`: aggregation, statistics, and figure generation\n- `scripts/`: runnable entry points\n- `docs/`: build plan, implementation notes, and metric registry\n- `result_summary.md`: top-level execution and interpretation summary\n- `tests/`: unit tests, smoke checks, and offline pipeline tests\n\n## Running a pilot simulator experiment\n\n```bash\npython scripts/check_experiment_design.py --layer layer_a --config configs/experiments/pilot.yaml\npython scripts/run_simulator.py --config configs/experiments/pilot.yaml\npython scripts/analyze_results.py --input-dir artifacts/pilot\n```\n\nBackbone-specific pilot configs are also provided:\n\n- `configs/experiments/pilot_local_cpu.yaml`\n- `configs/experiments/pilot_gemini.yaml`\n- `configs/experiments/layer_a_identification_pilot_gemini.yaml`\n- `configs/experiments/layer_a_h1_budget_threshold_gemini.yaml`\n- `configs/experiments/layer_a_h2_substitution_gemini.yaml`\n- `configs/experiments/layer_a_h3_h4_compression_gemini.yaml`\n- `configs/experiments/layer_a_h5_reset_gemini.yaml`\n\nThis will generate logs and summary outputs from actual runs only. No figure is generated unless matching logs exist.\n\n## How to read results correctly\n\n- Use Layer A to make exact mechanism claims about reserve loss, compression aliasing, retirement, and reset behavior.\n- Use Layer B to ask whether controller-law effects remain directionally visible under a fixed small task slice.\n- Treat smoke runs as pipeline checks, not evidence for the theory.\n- Treat mock-mode Layer B runs as instrumentation validation, not external-validity evidence.\n- Report null or weak effects explicitly when they occur.\n\n## Plugging in a real task slice later\n\n1. Prepare a manifest following `tasks/README.md`.\n2. Point `configs/experiments/real_tasks.yaml` at that manifest.\n3. Keep backbone, prompt, turn budget, and tool permissions fixed within each comparison block.\n4. Run the harness in non-mock mode only when the task assets are available and frozen.\n5. Run `scripts/check_experiment_design.py` before main runs and keep the generated `run_manifest.json`.\n\nBundled non-mock assets are already provided for the shipped micro-task slice under `tasks/assets/`.\nThe larger frozen slice is `tasks/manifests/frozen_task_slice_v3.yaml`.\n\n## Safety and privacy\n\n- No secrets are hardcoded.\n- Default configs use placeholders or mock backbones.\n- No local absolute paths are shipped.\n- The repository does not bundle external benchmark data.\n- Run `python scripts/release_clean.py` and `python scripts/check_public_safety.py --mode release` before packaging a public zip.\n\n## Limitations\n\n- Layer B now ships a bundled frozen task slice, but it is still a small micro-task set rather than a benchmark-scale evaluation.\n- The simulator is lightweight and theorem-aligned, not benchmark-realistic.\n- Small smoke runs may produce unstable estimates or degenerate regressions; the analysis code preserves those outputs rather than hiding them.\n- The paper discusses richer posterior-robust and theorem-local audit quantities than this lightweight implementation currently exposes online; those are documented as scoped simplifications rather than omitted silently.\n\n## Citation\n\nIf you use this repository, cite the underlying theory preprint:\n\n```text\nTakahashi, K. (2026). Search Stability under Finite Context: A Minimal Theory of Adequacy Preservation, Compression, and Reset in Long-Running Agents. Zenodo. https://doi.org/10.5281/zenodo.18905242\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkadubon%2Fsearch-stability-lab","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkadubon%2Fsearch-stability-lab","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkadubon%2Fsearch-stability-lab/lists"}