{"id":50454172,"url":"https://github.com/mizcausevic-dev/agent-eval-arena","last_synced_at":"2026-06-01T01:05:43.202Z","repository":{"id":357457942,"uuid":"1232401899","full_name":"mizcausevic-dev/agent-eval-arena","owner":"mizcausevic-dev","description":"Agent and LLM evaluation harness — golden datasets, multi-scorer execution, regression detection across model versions, cost-quality leaderboards, and CI gates for model promotion.","archived":false,"fork":false,"pushed_at":"2026-05-12T21:37:48.000Z","size":496,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-12T22:28:20.247Z","etag":null,"topics":["agent-eval","ai-governance","ai-platform","ci-gate","express","llm-eval","ml-ops","platform-engineering","regression-detection","typescript"],"latest_commit_sha":null,"homepage":null,"language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mizcausevic-dev.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-07T22:34:47.000Z","updated_at":"2026-05-12T21:37:52.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/mizcausevic-dev/agent-eval-arena","commit_stats":null,"previous_names":["mizcausevic-dev/agent-eval-arena"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/mizcausevic-dev/agent-eval-arena","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mizcausevic-dev%2Fagent-eval-arena","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mizcausevic-dev%2Fagent-eval-arena/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mizcausevic-dev%2Fagent-eval-arena/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mizcausevic-dev%2Fagent-eval-arena/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mizcausevic-dev","download_url":"https://codeload.github.com/mizcausevic-dev/agent-eval-arena/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mizcausevic-dev%2Fagent-eval-arena/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33755379,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-31T02:00:06.040Z","response_time":95,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agent-eval","ai-governance","ai-platform","ci-gate","express","llm-eval","ml-ops","platform-engineering","regression-detection","typescript"],"created_at":"2026-06-01T01:05:43.121Z","updated_at":"2026-06-01T01:05:43.191Z","avatar_url":"https://github.com/mizcausevic-dev.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Agent Eval Arena\n\n[![CI](https://github.com/mizcausevic-dev/agent-eval-arena/actions/workflows/ci.yml/badge.svg)](https://github.com/mizcausevic-dev/agent-eval-arena/actions/workflows/ci.yml)\n[![Node](https://img.shields.io/badge/node-20%2B-339933?logo=node.js\u0026logoColor=white)](https://nodejs.org)\n[![TypeScript](https://img.shields.io/badge/typescript-5.6-3178C6?logo=typescript\u0026logoColor=white)](https://www.typescriptlang.org)\n[![License: MIT](https://img.shields.io/badge/license-MIT-66FCF1)](LICENSE)\n\nEvaluation harness for **AI agents and LLMs** — golden datasets, multi-scorer execution, regression detection across model versions, cost-quality leaderboards, and CI-gating decisions for model promotion. The pre-production half of the loop that `agentobserve` closes after deploy.\n\n\u003e Recruiter takeaway:\n\u003e\n\u003e *\"This person is the engineer who actually wires LLM eval into the release pipeline. Pass-rate gates, cost caps, latency caps, and regression checks — all of it as testable backend logic that can fail a build before a customer sees a regression.\"*\n\n## Why This Exists\n\nMost AI teams either ship without eval (and find regressions in production) or have a Jupyter notebook that someone runs occasionally (which nobody trusts in CI). The middle ground is a service: a real eval harness with versioned datasets, scoring engines, regression comparison, and a gate that returns `pass` / `fail` so CI can block bad promotions.\n\nThis repo is that service. It ships with five scoring engines (exact match, fuzzy, token overlap, rubric-based aggregation), a regression detector that diffs two runs, a leaderboard that ranks models on quality / cost / latency / value, and a CI gate that returns a single decision your pipeline can branch on.\n\n## Where This Sits in the Portfolio\n\n| Repo | Surface | Question it answers |\n|---|---|---|\n| [`mcp-sentinel`](https://github.com/mizcausevic-dev/mcp-sentinel) | Tool calls | *What MCP tools are exposed and how risky?* |\n| [`rag-sentinel`](https://github.com/mizcausevic-dev/rag-sentinel) | Retrieval | *What's in the vector store and how trustworthy?* |\n| [`agent-codex`](https://github.com/mizcausevic-dev/agent-codex) | Decisions | *Under what policies are decisions allowed?* |\n| **`agent-eval-arena`** | **Pre-production** | ***Should this model promotion ship?*** |\n| [`agentobserve`](https://github.com/mizcausevic-dev/agentobserve) | Runtime | *What did agents actually do?* |\n| [`kinetic-flightdeck`](https://github.com/mizcausevic-dev/kinetic-flightdeck) | Operator | *Are we OK right now? Who do I call?* |\n\n## Project Overview\n\n| Attribute | Detail |\n|---|---|\n| Runtime | Node.js + TypeScript |\n| Framework | Express 5 |\n| Domain | LLM/agent evaluation and CI gating |\n| Scoring | Exact match · Fuzzy (Levenshtein) · Token overlap · Rubric (multi-criterion) |\n| Analysis | Regression detection · Multi-model leaderboards · Quality-per-dollar |\n| CI Integration | Single-call gate decision: `pass` / `fail` |\n\n## Five Capabilities\n\n### 1. Text-Match Scorers\n\nThree deterministic scorers for known-output evaluation:\n\n| Scorer | When to use |\n|---|---|\n| `exactMatch` | Classification, extraction, slot-filling — answer must equal expected |\n| `fuzzyMatch` (Levenshtein) | Tolerates typos and minor formatting variance |\n| `tokenOverlap` (Jaccard) | Bag-of-words match for paraphrasing tolerance |\n\nAll three handle case sensitivity, whitespace normalization, and trim independently.\n\n### 2. Rubric-Based Scoring\n\nFor open-ended outputs scored against multi-criterion rubrics. Each criterion has a weight; per-case results are `pass`, `partial`, or `fail`. The aggregator returns weighted score, criteria pass/fail breakdown, and worst-failure highlight (highest-weight criterion that failed).\n\nRollup across many cases yields per-criterion pass rates — surfacing systemic weaknesses (\"model passes accuracy 92% but fails safety 4% of the time\").\n\n### 3. Regression Detection\n\nCompares two eval runs (baseline vs candidate) and produces:\n\n- Pass-rate delta (percentage points)\n- Average score delta\n- Latency p95 delta\n- Cost-per-case delta\n- New failures (cases that passed in baseline, fail in candidate)\n- New passes (cases that failed in baseline, pass in candidate)\n- Verdict: `improved` · `no-change` · `regression` · `severe-regression`\n\nSevere-regression triggers when pass rate drops ≥ 5pp OR new failures exceed 5% of dataset.\n\n### 4. Multi-Model Leaderboard\n\nFor a given dataset with multiple model runs, the leaderboard ranks models on four axes:\n\n| Axis | Definition |\n|---|---|\n| `bestQuality` | Highest pass rate |\n| `bestCost` | Lowest avg cost per case |\n| `bestLatency` | Lowest avg latency |\n| `bestValue` | Best quality-per-dollar (pass rate ÷ cost per case) |\n\nThe fourth metric is what CFOs look at — and it usually doesn't pick the same model as `bestQuality`.\n\n### 5. CI Gate\n\nThe integration point. Wire `POST /api/eval/gate` into your CI/CD pipeline. Pass thresholds (or accept defaults), pass the candidate run (and optional baseline), receive a single `pass` / `fail` decision plus reasons.\n\n```json\n{\n  \"minPassRate\": 80,\n  \"maxRegressionPp\": 2,\n  \"maxNewFailures\": 2,\n  \"maxLatencyP95Ms\": 0,\n  \"maxCostPerCaseUsd\": 0\n}\n```\n\nSet `maxLatencyP95Ms` or `maxCostPerCaseUsd` to non-zero to enforce hard caps in addition to relative regression checks.\n\n## API Endpoints\n\n### Scoring\n\n| Method | Endpoint | Purpose |\n|---|---|---|\n| POST | `/api/score/exact-match` | Exact match with normalization options |\n| POST | `/api/score/fuzzy-match` | Levenshtein-based similarity |\n| POST | `/api/score/token-overlap` | Jaccard bag-of-words match |\n| POST | `/api/score/rubric` | Multi-criterion rubric aggregation |\n\n### Evaluation\n\n| Method | Endpoint | Purpose |\n|---|---|---|\n| POST | `/api/eval/compare` | Compare two runs; return regression verdict |\n| POST | `/api/eval/gate` | CI gate decision with thresholds |\n\n### Read\n\n| Method | Endpoint | Purpose |\n|---|---|---|\n| GET | `/health` | Service status |\n| GET | `/api/datasets` | List eval datasets |\n| GET | `/api/datasets/:id` | Single dataset |\n| GET | `/api/datasets/:id/runs` | Runs for a dataset |\n| GET | `/api/datasets/:id/leaderboard` | Multi-model rankings on dataset |\n| GET | `/api/runs` | All runs |\n| GET | `/api/runs/:id` | Single run with per-case results |\n\n## Sample: CI Gate Decision\n\n```json\nPOST /api/eval/gate\n{\n  \"candidate\": {\n    \"runId\": \"run_2026_05_07_005\",\n    \"modelId\": \"claude-opus\",\n    \"modelVersion\": \"4.7\",\n    \"datasetId\": \"ds_code_completion\",\n    \"timestamp\": \"2026-05-07T12:00:00Z\",\n    \"cases\": [ ... 480 cases ... ]\n  },\n  \"baseline\": {\n    \"runId\": \"run_2026_05_05_004\",\n    \"modelId\": \"claude-opus\",\n    \"modelVersion\": \"4.6\",\n    \"datasetId\": \"ds_code_completion\",\n    \"timestamp\": \"2026-05-05T10:00:00Z\",\n    \"cases\": [ ... 480 cases ... ]\n  },\n  \"thresholds\": {\n    \"minPassRate\": 75,\n    \"maxRegressionPp\": 2,\n    \"maxCostPerCaseUsd\": 0.025\n  }\n}\n```\n\n```json\n{\n  \"decision\": \"pass\",\n  \"reasons\": [],\n  \"passingChecks\": [\n    \"Candidate pass rate 82.30% meets minimum.\",\n    \"Cost per case $0.01900 within cap.\",\n    \"Pass-rate delta +4.20pp within tolerance.\",\n    \"No new failures vs baseline.\"\n  ],\n  \"recommendedAction\": \"Promote candidate; eval gate passed.\"\n}\n```\n\n## Operator Console Preview\n\n![Agent Eval Arena dashboard — leaderboard, regression detection, CI gate decision](docs/hero.png)\n\n## Getting Started\n\n### Prerequisites\n\n- Node.js 20+\n- npm\n\n### Setup\n\n```bash\ngit clone https://github.com/mizcausevic-dev/agent-eval-arena.git\ncd agent-eval-arena\nnpm install\nnpm run dev\n```\n\nVisit:\n\n- `http://localhost:3000/health`\n- `http://localhost:3000/api/datasets`\n- `http://localhost:3000/api/datasets/ds_support_qa/leaderboard`\n\n### Run Tests\n\n```bash\nnpm test\n```\n\n25 unit tests across text-match scorers, rubric aggregation, regression detection, leaderboard ranking, and CI gate decision logic.\n\n## What This Demonstrates\n\n- LLM/agent eval translated into testable, deterministic backend logic — no judge LLMs required for the harness layer\n- Cost and latency thresholds as first-class CI signals (most teams gate only on quality, miss the FinOps regression)\n- Regression detection as a structural diff, not a manual notebook ritual\n- Quality-per-dollar as a buyer-facing metric (the one CFOs actually want)\n- Strict-mode TypeScript with full test coverage; CI matrix on Node 20 + 22\n\n## Future Enhancements\n\n- LLM-as-judge integration for rubric per-criterion scoring\n- Dataset versioning with hash-pinning for reproducibility\n- Real-time eval streaming via SSE\n- Webhook-driven CI gate (`POST` from GitHub Actions, return decision)\n- Multi-tenant control plane for managed-service deployment\n- Integration with `agentobserve` for production-vs-eval drift detection\n\n## Tech Stack\n\n- Node.js, TypeScript, Express, Zod\n- Helmet, CORS, Morgan\n- Node test runner\n\n## Portfolio Links\n\n- [LinkedIn](https://www.linkedin.com/in/mizcausevic/)\n- [Skills Page](https://mizcausevic.com/skills)\n- [Medium](https://medium.com/@mizcausevic)\n- [GitHub](https://github.com/mizcausevic-dev)\n\nPart of [mizcausevic-dev's GitHub portfolio](https://github.com/mizcausevic-dev) — AI Platform Engineering doctrine.\n\n---\n\n**Connect:** [LinkedIn](https://www.linkedin.com/in/mirzacausevic/) · [Kinetic Gain](https://kineticgain.com) · [Medium](https://medium.com/@mizcausevic/) · [Skills](https://mizcausevic.com/skills/)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmizcausevic-dev%2Fagent-eval-arena","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmizcausevic-dev%2Fagent-eval-arena","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmizcausevic-dev%2Fagent-eval-arena/lists"}