https://github.com/jeremylongshore/intent-eval-lab
Vendor-neutral research umbrella for measuring AI plugin, agent, and MCP server quality across CLI runtimes (Claude Code, Gemini CLI, Copilot CLI, Codex CLI).
https://github.com/jeremylongshore/intent-eval-lab
agent-eval ai-evaluation claude-code cross-cli gemini-cli invocation-rate mcp opentelemetry plugin-testing skill-discovery
Last synced: 1 day ago
JSON representation
Vendor-neutral research umbrella for measuring AI plugin, agent, and MCP server quality across CLI runtimes (Claude Code, Gemini CLI, Copilot CLI, Codex CLI).
- Host: GitHub
- URL: https://github.com/jeremylongshore/intent-eval-lab
- Owner: jeremylongshore
- License: other
- Created: 2026-05-07T19:20:06.000Z (about 2 months ago)
- Default Branch: main
- Last Pushed: 2026-06-20T00:25:55.000Z (10 days ago)
- Last Synced: 2026-06-20T01:17:01.354Z (10 days ago)
- Topics: agent-eval, ai-evaluation, claude-code, cross-cli, gemini-cli, invocation-rate, mcp, opentelemetry, plugin-testing, skill-discovery
- Language: Python
- Homepage: https://intentsolutions.io
- Size: 2.81 MB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 44
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- Funding: .github/FUNDING.yml
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Codeowners: .github/CODEOWNERS
- Security: SECURITY.md
- Support: SUPPORT.md
- Agents: AGENTS.md
Awesome Lists containing this project
README
# Intent Eval Lab
Part of the **[Intent Eval Platform](https://github.com/intent-solutions-io/intent-eval-platform)** — the umbrella mapping the five repos that converge via a shared Evidence Bundle schema (the contracts kernel `intent-eval-core` plus this lab, `audit-harness`, `j-rig-skill-binary-eval`, and `intent-rollout-gate`; `intent-eval-dashboard` is the 6th platform repo, separate from the convergence taxonomy).
A research umbrella for measuring AI plugin, agent, and MCP server quality across CLI runtimes — Claude Code, Gemini CLI, GitHub Copilot CLI, and OpenAI Codex CLI.
The agentic-CLI ecosystem is converging on a small set of cross-tool conventions (`AGENTS.md`, MCP, `SKILL.md`) but the empirical question — _does my plugin actually get discovered and invoked correctly when the agent decides on its own?_ — has no vendor-neutral answer. The vendors won't publish opinionated cross-CLI invocation-measurement frameworks because they're competing across the stack. That niche is structurally available to a third party.
This repo is the working surface for that work.
## What's here
```text
intent-eval-lab/
├── 000-docs/ ← numbered docs (research summaries, methodology, plans, AAR)
├── specs/ ← normative methodology output — versioned, testable specs
│ per class of inference system, with case studies
├── research/ ← literature surveys, paper notes, competitive landscape
├── sandboxes/ ← per-experiment dirs (one dated subdir per run)
├── evidence/ ← captured telemetry, OTEL traces, JSON evidence (mostly gitignored)
├── scripts/ ← reusable test harness, OTEL probes, prompt-suite runners
└── projects/ ← symlinks to constituent project repos (gitignored — see below)
```
`specs/` is the **normative output** of the lab — currently shipping module 1 ([`mcp-plugin-observability/`](./specs/mcp-plugin-observability/)) at `v0.1.0-draft`, with placeholder modules for `validator-contract-reliability/`, `forecasting-drift-detection/`, and `decentralized-crypto-evaluation/` reserving the structural slots for future engagements. See [`specs/README.md`](./specs/README.md) for the module index and authoring conventions.
`projects/` is filesystem-only — the constituent projects keep their own GitHub remotes. The lab is a research umbrella over them, not a monorepo.
## Topics in scope
- Cross-CLI plugin invocation rate measurement (Claude Code / Gemini CLI / Copilot CLI / Codex CLI)
- Skill auto-discovery vs explicit invocation — when does each pattern win
- Eager-vs-lazy skill loading tradeoffs (eager via `contextFileName` arrays vs lazy via auto-discovery)
- OpenTelemetry-instrumented agent evaluation — using `claude_code.skill_activated` and similar primitives
- Plugin metadata quality and its effect on agent decision rates
- MCP server contract conformance and tool-allowlist tradeoffs
- Cross-tool plugin pattern (`AGENTS.md` + `SKILL.md` + `mcpServers`) at scale
## Adjacent projects
| Project | Role |
| ------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [`jeremylongshore/j-rig-skill-binary-eval`](https://github.com/jeremylongshore/j-rig-skill-binary-eval) | 7-layer binary-criteria evaluation harness for SKILL.md artifacts. Natural extension target — the layers are artifact-type-polymorphic; specializing executor + judgment for plugin/agent/MCP is the core build axis. |
| [`jeremylongshore/intent-audit-harness`](https://github.com/jeremylongshore/intent-audit-harness) | `@intentsolutions/audit-harness` — source-code test-policy containment. Composes with the lab's L4-L7 sandbox work. JRig vendors it via a copied `.audit-harness/` directory for self-validation in its own CI (not an npm dep). The convergence integrates these at the shared Evidence Bundle schema layer rather than via package coupling — see [`000-docs/003-PP-PLAN-phase-b-scope-refinement.md`](./000-docs/003-PP-PLAN-phase-b-scope-refinement.md). |
## Working pattern
1. **Research first.** File a numbered doc in `000-docs/` from a literature survey or competitive scan before scaffolding anything. Methodology decisions get a permanent canonical filename.
2. **One experiment per sandbox.** Every experiment gets its own dated subdir under `sandboxes/`, isolated infrastructure, isolated state. Never share state across experiments.
3. **Evidence as JSON + screenshots.** Captures go to `evidence//` with cross-link in the experiment's `REPORT.md`.
4. **Reusable code in `scripts/`.** One-off scripts stay in their experiment's sandbox; only promote to `scripts/` when used twice.
## Status
Active. The Phase A foundation is complete (Blueprints A/B/C, the canonical glossary, and the Evidence Bundle spec module are landed on `main`), the Spec Authority Kernel Class-1 charter is ratified, and the repo is releasing at `v0.3.0`. Public so the methodology can be reviewed and the work can be discoverable. The lab publishes **methodology and normative specs**, not an executable harness — productizable harness code lands in the sister repo `j-rig-skill-binary-eval`.
## License
Apache 2.0 — see [LICENSE](LICENSE).
## Contributing
See [CONTRIBUTING.md](CONTRIBUTING.md) for the contribution model and conventions. The repo accepts PRs but is currently maintained by a single author; expect slower review cycles than a multi-maintainer project.
## Author
Jeremy Longshore — [intentsolutions.io](https://intentsolutions.io) · [startaitools.com](https://startaitools.com)