An open API service indexing awesome lists of open source software.

https://github.com/metronis-space/aegis

The Adaptive Intelligence Layer for AI Agents — eval, train, and memory on one platform.
https://github.com/metronis-space/aegis

agent-eval ai-agents benchmarks evaluation grpo llm memory reinforcement-learning

Last synced: 2 months ago
JSON representation

The Adaptive Intelligence Layer for AI Agents — eval, train, and memory on one platform.

Awesome Lists containing this project

README

          

# Aegis

[![CI](https://github.com/metronis-space/aegis/actions/workflows/ci.yml/badge.svg)](https://github.com/metronis-space/aegis/actions/workflows/ci.yml)
[![Python 3.11+](https://img.shields.io/badge/python-3.11%2B-blue.svg)](https://www.python.org/downloads/)
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-green.svg)](https://opensource.org/licenses/Apache-2.0)

**The Closed-Loop Intelligence Engine for AI Agents** — eval, train, and deploy smarter agents.

Aegis by [Metronis, Inc.](https://metronis.space/) is an open-source framework that evaluates AI agents across 125 dimensions, identifies weaknesses, trains improved models via RL, and proves the improvement with rigorous before/after evaluation.

## Benchmark Evidence Tiers

Aegis exposes three different benchmark evidence modes and they are not interchangeable:

- `internal proxy`: built-in `*-proxy` suites are for internal regression tracking and ablations.
- `public proxy`: built-in `*-heldout` suites use held-out slices and benchmark-native metrics, but they are still not official leaderboard-grade evidence.
- `claim-grade`: only manifest-backed claim suites are intended for externally claimable benchmark results.

The built-in `legal` alias currently resolves to `legal-heldout`, which is a held-out proxy suite, not an official claim-grade legal benchmark. Treat strict built-in runs as honest proxy evidence unless the suite contract explicitly reports claim eligibility.
The repo does not bundle claim-grade legal suite JSON by default; install real
manifests under `benchmarks/claim_suites/` or set `AEGIS_BENCHMARK_SUITE_DIR`
to an external manifest directory.

## The Pipeline

```
Traces In → Eval → Diagnose Weaknesses → Spin RL Environments →
Train (GRPO + Continuous Rewards) → Store Results to Memory →
Train with Tools + Self-Managed Memory → Final Eval (Prove Improvement)
```

```mermaid
flowchart LR
A["Agent Traces"] --> B["Aegis Eval\n125 dimensions"]
B --> C["Weakness\nDiagnosis"]
C --> D["RL Environments\nauto-generated"]
D --> E["GRPO Training\ncontinuous rewards"]
E --> F["Aegis Memory\nextract strategies"]
F --> G["Tool + Memory\nTraining"]
G --> H["Final Eval\nprove improvement"]
H --> A
```

## Three Products

| Product | What it does | Status |
|---------|-------------|--------|
| **Aegis Eval** | 125 dimensions across 7 tiers, triangulated scoring (rule + semantic + LLM judge), legal & finance domain plugins | Working |
| **Aegis Train** | GRPO-based RL with continuous reward functions, environment factory, Observatory monitoring | Building |
| **Aegis Memory** | 12 operations, knowledge graph, vector store, provenance tracking, 7 time horizons | Working |

## Installation

```bash
pip install -e ".[dev,all]"
```

## Quick Start

### Evaluate an agent

```python
from aegis import Evaluator, EvalConfig

evaluator = Evaluator(config=EvalConfig(dimensions="all"))
result = evaluator.run()

print(f"Overall score: {result.overall_score:.2%}")
for tier_name, tier_score in result.tier_scores.items():
print(f" {tier_name}: {tier_score:.2%}")
```

### CLI

```bash
aegis eval run --config eval.yaml # Run evaluation
aegis eval dimensions # List all 125 dimensions
aegis eval benchmark-list # Inspect proxy vs claim-grade suite status
aegis train start --model Qwen/Qwen2.5-7B --optimizer dr_grpo
```

### Toolathlon SOTA on OVHcloud

```bash
cp .env.ovhcloud.example .env
aegis train ovhcloud-doctor
aegis pipeline toolathlon-sota \
--baseline-dir results/toolathlon_full \
--output results/toolathlon_sota_run.json
```

## Architecture

```
src/aegis/
├── core/ # Types, config, settings (5 files)
├── cli/ # Typer CLI (1 file)
├── adapters/ # OpenAI, Anthropic, LangChain, REST (6 files)
├── api/ # FastAPI + eval/training/memory routes (9 files)
├── data/ # CUAD, LegalBench, FinanceBench loaders (5 files)
├── ingestion/ # Document parsing pipeline (6 files)
├── eval/ # Engine, 7-tier dimensions, scorers, judges (27 files)
├── environments/ # Legal & finance tool-use RL environments (4 files)
├── training/ # GRPO engine, rewards, rollouts, optimizers (16 files)
├── memory/ # 12 ops, graph, vectors, provenance (16 files)
├── plugins/ # Legal (18 dims) + Finance (20 dims) (5 files)
├── observatory/ # Goodhart detection, efficiency tracking (4 files)
├── security/ # RBAC (2 files)
└── store/ # SQLite, Postgres, Neo4j (5 files)
```

~100 source files. One pipeline. No bloat.

## Adapters

| Adapter | Framework |
|---------|-----------|
| `OpenAIAdapter` | OpenAI Assistants + Chat Completions |
| `AnthropicAdapter` | Anthropic Messages API |
| `LangChainAdapter` | LangChain agents |
| `RESTAdapter` | Any REST API |

## Domain Plugins

- **Legal**: 18 dimensions, CUAD dataset, contract clause extraction, citation verification
- **Finance**: 20 dimensions, FinanceBench + SEC EDGAR, numerical accuracy, formula validation

## Documentation

| Topic | Link |
|-------|------|
| Quick Start | [`docs/quickstart.md`](docs/quickstart.md) |
| Architecture | [`docs/architecture.md`](docs/architecture.md) |
| Master Implementation Plan | [`docs/master-implementation-plan.md`](docs/master-implementation-plan.md) |
| CLI Reference | [`docs/cli-reference.md`](docs/cli-reference.md) |
| API Reference | [`docs/api-reference.md`](docs/api-reference.md) |
| Dimensions | [`docs/dimensions.md`](docs/dimensions.md) |
| Scoring | [`docs/scoring.md`](docs/scoring.md) |
| Adapters | [`docs/adapters.md`](docs/adapters.md) |
| Plugins | [`docs/plugins.md`](docs/plugins.md) |

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md).

## License

Apache License 2.0. See [LICENSE](LICENSE).

Built by [Metronis, Inc.](https://metronis.space/)