An open API service indexing awesome lists of open source software.

https://github.com/evaluation-context-protocol/ecp

ECP is a standardized interface for orchestrating, auditing, and enforcing authority limits in AI Agent evaluations. It moves evaluation from "brittle Python scripts" to a deterministic infrastructure protocol
https://github.com/evaluation-context-protocol/ecp

evaluation-metrics evaluations llm-evaluation model-evaluation

Last synced: about 1 month ago
JSON representation

ECP is a standardized interface for orchestrating, auditing, and enforcing authority limits in AI Agent evaluations. It moves evaluation from "brittle Python scripts" to a deterministic infrastructure protocol

Awesome Lists containing this project

README

          

# Evaluation Context Protocol (ECP)

![Status](https://img.shields.io/badge/Status-Experimental-orange)
![License](https://img.shields.io/badge/License-Apache%202.0-blue)

> Work in progress: this repository is actively evolving, and some concepts may change.

A lightweight protocol and reference runtime for evaluating agents with public output, private reasoning, and tool usage. This repo contains:

- `sdk/` - Python SDK for implementing an ECP agent.
- `runtime/` - Python runtime (CLI) that runs manifests and grades results.
- `examples/` - Minimal framework demos (LangChain, LlamaIndex, CrewAI, PydanticAI).
- `spec/` - Protocol specification.

## Documentation

- Docs site: https://evaluationcontextprotocol.io/
- Quickstart: https://evaluationcontextprotocol.io/quickstart/
- Specification: https://evaluationcontextprotocol.io/spec/
- Docs deploy automatically from `main` via GitHub Actions.

## Quick Start

Create a venv and install the current PyPI prerelease that matches the latest GitHub beta release:

```bash
py -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install "ecp-runtime==0.2.9" "ecp-sdk[langchain]==0.2.9" langchain-openai
```

Run the example manifest:

```bash
python -m ecp_runtime.cli run --manifest .\examples\langchain_demo\manifest.yaml
```

Generate an HTML report:

```bash
python -m ecp_runtime.cli run --manifest .\examples\langchain_demo\manifest.yaml --report .\report.html
```

Print a JSON report (useful for CI tooling):

```bash
python -m ecp_runtime.cli run --manifest .\examples\langchain_demo\manifest.yaml --json
```

Save a JSON report to a file:

```bash
python -m ecp_runtime.cli run --manifest .\examples\langchain_demo\manifest.yaml --json-out .\report.json
```

If your manifest uses `llm_judge`, set your key:

```bash
$env:OPENAI_API_KEY="your_key_here"
$env:ECP_LLM_JUDGE_MODEL="gpt-4o-mini"
```

The latest stable packages on PyPI are now `0.2.9`, and this repository is aligned with that release line.

Run the other demos:

```bash
pip install "ecp-sdk[crewai]==0.2.9" crewai
python -m ecp_runtime.cli run --manifest .\examples\crewai_demo\manifest.yaml

pip install "ecp-sdk[pydanticai]==0.2.9" pydantic-ai
python -m ecp_runtime.cli run --manifest .\examples\pydantic_ai_demo\manifest.yaml
```

## Example (LangChain Agent + Manifest)

Agent (LangChain `create_agent` + tool usage):

```python
from langchain.agents import create_agent
from langchain_openai import ChatOpenAI
from langchain_core.tools import tool
from ecp import serve
from ecp.adaptors.langchain import ECPLangChainAdapter

@tool
def calculator(expression: str) -> str:
allowed = set("0123456789+-*/() ")
if not expression or any(ch not in allowed for ch in expression):
return "Invalid expression."
try:
return str(int(eval(expression, {"__builtins__": {}})))
except Exception:
return "Invalid expression."

agent = create_agent(
model=ChatOpenAI(model="gpt-3.5-turbo", temperature=0),
tools=[calculator],
system_prompt="Use the calculator tool for arithmetic."
)

def to_messages(text: str):
return {"messages": [{"role": "user", "content": text}]}

serve(ECPLangChainAdapter(agent, name="MathBot", input_mapper=to_messages))
```

Manifest (runtime checks output + tool usage):

```yaml
manifest_version: "v1"
name: "LangChain Math Check"
target: "python agent.py"

scenarios:
- name: "Ratio Word Problem"
steps:
- input: "Katy makes coffee using teaspoons of sugar and cups of water in the ratio of 7:13..."
graders:
- type: text_match
field: public_output
condition: contains
value: "42"
- type: tool_usage
tool_name: "calculator"
arguments: {}
```

Supported graders:

- `text_match` (`contains`, `equals`, `does_not_contain`, `regex`)
- `llm_judge` (requires `OPENAI_API_KEY`)
- `tool_usage` (name + argument subset match)

Note: manifest validation is strict and fails fast on invalid grader configuration.

## ECP in 60 Seconds

ECP is JSON-RPC 2.0 over stdio. The runtime launches your agent process and calls:

- `agent/initialize`
- `agent/step`
- `agent/reset`

Your agent replies with a structured result containing:

- `public_output` (what the user sees)
- `private_thought` (for evaluators)
- `tool_calls` (actions taken)

See `spec/protocol.md` for the full protocol.

## Repo Layout

- `sdk/python/src/ecp` - SDK decorators, adapters, and server loop
- `runtime/python/src/ecp_runtime` - CLI, runner, graders, reporting, and trend analysis
- `examples/` - Demo agents and manifests for LangChain, LlamaIndex, CrewAI, and PydanticAI
- `runtime/python/tests` - Runtime unit and CLI smoke tests
- `sdk/python/tests` - Adapter normalization tests

## Status

This project is evolving quickly. Expect changes between minor versions.