{"id":51231226,"url":"https://github.com/jeremylongshore/intent-eval-lab","last_synced_at":"2026-06-28T16:31:01.620Z","repository":{"id":356853134,"uuid":"1232282427","full_name":"jeremylongshore/intent-eval-lab","owner":"jeremylongshore","description":"Vendor-neutral research umbrella for measuring AI plugin, agent, and MCP server quality across CLI runtimes (Claude Code, Gemini CLI, Copilot CLI, Codex CLI).","archived":false,"fork":false,"pushed_at":"2026-06-20T00:25:55.000Z","size":2950,"stargazers_count":1,"open_issues_count":44,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-20T01:17:01.354Z","etag":null,"topics":["agent-eval","ai-evaluation","claude-code","cross-cli","gemini-cli","invocation-rate","mcp","opentelemetry","plugin-testing","skill-discovery"],"latest_commit_sha":null,"homepage":"https://intentsolutions.io","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jeremylongshore.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":"SECURITY.md","support":"SUPPORT.md","governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null},"funding":{"github":["jeremylongshore"],"custom":["https://intentsolutions.io"]}},"created_at":"2026-05-07T19:20:06.000Z","updated_at":"2026-06-20T00:25:58.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/jeremylongshore/intent-eval-lab","commit_stats":null,"previous_names":["jeremylongshore/intent-eval-lab"],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/jeremylongshore/intent-eval-lab","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jeremylongshore%2Fintent-eval-lab","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jeremylongshore%2Fintent-eval-lab/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jeremylongshore%2Fintent-eval-lab/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jeremylongshore%2Fintent-eval-lab/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jeremylongshore","download_url":"https://codeload.github.com/jeremylongshore/intent-eval-lab/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jeremylongshore%2Fintent-eval-lab/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34896652,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-28T02:00:05.809Z","response_time":54,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agent-eval","ai-evaluation","claude-code","cross-cli","gemini-cli","invocation-rate","mcp","opentelemetry","plugin-testing","skill-discovery"],"created_at":"2026-06-28T16:31:00.291Z","updated_at":"2026-06-28T16:31:01.612Z","avatar_url":"https://github.com/jeremylongshore.png","language":"Python","funding_links":["https://github.com/sponsors/jeremylongshore","https://intentsolutions.io"],"categories":[],"sub_categories":[],"readme":"# Intent Eval Lab\n\nPart of the **[Intent Eval Platform](https://github.com/intent-solutions-io/intent-eval-platform)** — the umbrella mapping the five repos that converge via a shared Evidence Bundle schema (the contracts kernel `intent-eval-core` plus this lab, `audit-harness`, `j-rig-skill-binary-eval`, and `intent-rollout-gate`; `intent-eval-dashboard` is the 6th platform repo, separate from the convergence taxonomy).\n\nA research umbrella for measuring AI plugin, agent, and MCP server quality across CLI runtimes — Claude Code, Gemini CLI, GitHub Copilot CLI, and OpenAI Codex CLI.\n\nThe agentic-CLI ecosystem is converging on a small set of cross-tool conventions (`AGENTS.md`, MCP, `SKILL.md`) but the empirical question — _does my plugin actually get discovered and invoked correctly when the agent decides on its own?_ — has no vendor-neutral answer. The vendors won't publish opinionated cross-CLI invocation-measurement frameworks because they're competing across the stack. That niche is structurally available to a third party.\n\nThis repo is the working surface for that work.\n\n## What's here\n\n```text\nintent-eval-lab/\n├── 000-docs/        ← numbered docs (research summaries, methodology, plans, AAR)\n├── specs/           ← normative methodology output — versioned, testable specs\n│                       per class of inference system, with case studies\n├── research/        ← literature surveys, paper notes, competitive landscape\n├── sandboxes/       ← per-experiment dirs (one dated subdir per run)\n├── evidence/        ← captured telemetry, OTEL traces, JSON evidence (mostly gitignored)\n├── scripts/         ← reusable test harness, OTEL probes, prompt-suite runners\n└── projects/        ← symlinks to constituent project repos (gitignored — see below)\n```\n\n`specs/` is the **normative output** of the lab — currently shipping module 1 ([`mcp-plugin-observability/`](./specs/mcp-plugin-observability/)) at `v0.1.0-draft`, with placeholder modules for `validator-contract-reliability/`, `forecasting-drift-detection/`, and `decentralized-crypto-evaluation/` reserving the structural slots for future engagements. See [`specs/README.md`](./specs/README.md) for the module index and authoring conventions.\n\n`projects/` is filesystem-only — the constituent projects keep their own GitHub remotes. The lab is a research umbrella over them, not a monorepo.\n\n## Topics in scope\n\n- Cross-CLI plugin invocation rate measurement (Claude Code / Gemini CLI / Copilot CLI / Codex CLI)\n- Skill auto-discovery vs explicit invocation — when does each pattern win\n- Eager-vs-lazy skill loading tradeoffs (eager via `contextFileName` arrays vs lazy via auto-discovery)\n- OpenTelemetry-instrumented agent evaluation — using `claude_code.skill_activated` and similar primitives\n- Plugin metadata quality and its effect on agent decision rates\n- MCP server contract conformance and tool-allowlist tradeoffs\n- Cross-tool plugin pattern (`AGENTS.md` + `SKILL.md` + `mcpServers`) at scale\n\n## Adjacent projects\n\n| Project                                                                                                 | Role                                                                                                                                                                                                                                                                                                                                                                                                                                                          |\n| ------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| [`jeremylongshore/j-rig-skill-binary-eval`](https://github.com/jeremylongshore/j-rig-skill-binary-eval) | 7-layer binary-criteria evaluation harness for SKILL.md artifacts. Natural extension target — the layers are artifact-type-polymorphic; specializing executor + judgment for plugin/agent/MCP is the core build axis.                                                                                                                                                                                                                                         |\n| [`jeremylongshore/intent-audit-harness`](https://github.com/jeremylongshore/intent-audit-harness)       | `@intentsolutions/audit-harness` — source-code test-policy containment. Composes with the lab's L4-L7 sandbox work. JRig vendors it via a copied `.audit-harness/` directory for self-validation in its own CI (not an npm dep). The convergence integrates these at the shared Evidence Bundle schema layer rather than via package coupling — see [`000-docs/003-PP-PLAN-phase-b-scope-refinement.md`](./000-docs/003-PP-PLAN-phase-b-scope-refinement.md). |\n\n## Working pattern\n\n1. **Research first.** File a numbered doc in `000-docs/` from a literature survey or competitive scan before scaffolding anything. Methodology decisions get a permanent canonical filename.\n2. **One experiment per sandbox.** Every experiment gets its own dated subdir under `sandboxes/`, isolated infrastructure, isolated state. Never share state across experiments.\n3. **Evidence as JSON + screenshots.** Captures go to `evidence/\u003cexperiment-id\u003e/` with cross-link in the experiment's `REPORT.md`.\n4. **Reusable code in `scripts/`.** One-off scripts stay in their experiment's sandbox; only promote to `scripts/` when used twice.\n\n## Status\n\nActive. The Phase A foundation is complete (Blueprints A/B/C, the canonical glossary, and the Evidence Bundle spec module are landed on `main`), the Spec Authority Kernel Class-1 charter is ratified, and the repo is releasing at `v0.3.0`. Public so the methodology can be reviewed and the work can be discoverable. The lab publishes **methodology and normative specs**, not an executable harness — productizable harness code lands in the sister repo `j-rig-skill-binary-eval`.\n\n## License\n\nApache 2.0 — see [LICENSE](LICENSE).\n\n## Contributing\n\nSee [CONTRIBUTING.md](CONTRIBUTING.md) for the contribution model and conventions. The repo accepts PRs but is currently maintained by a single author; expect slower review cycles than a multi-maintainer project.\n\n## Author\n\nJeremy Longshore — [intentsolutions.io](https://intentsolutions.io) · [startaitools.com](https://startaitools.com)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjeremylongshore%2Fintent-eval-lab","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjeremylongshore%2Fintent-eval-lab","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjeremylongshore%2Fintent-eval-lab/lists"}