https://github.com/metronis-space/aegis
The Adaptive Intelligence Layer for AI Agents — eval, train, and memory on one platform.
https://github.com/metronis-space/aegis
agent-eval ai-agents benchmarks evaluation grpo llm memory reinforcement-learning
Last synced: 2 months ago
JSON representation
The Adaptive Intelligence Layer for AI Agents — eval, train, and memory on one platform.
- Host: GitHub
- URL: https://github.com/metronis-space/aegis
- Owner: metronis-space
- License: apache-2.0
- Created: 2026-02-22T05:34:48.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2026-04-19T10:22:15.000Z (2 months ago)
- Last Synced: 2026-04-19T11:27:13.773Z (2 months ago)
- Topics: agent-eval, ai-agents, benchmarks, evaluation, grpo, llm, memory, reinforcement-learning
- Language: Python
- Homepage: https://metronis.space/
- Size: 8 MB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 27
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- Funding: .github/FUNDING.yml
- License: LICENSE
- Agents: AGENTS.md
Awesome Lists containing this project
README
# Aegis
[](https://github.com/metronis-space/aegis/actions/workflows/ci.yml)
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/Apache-2.0)
**The Closed-Loop Intelligence Engine for AI Agents** — eval, train, and deploy smarter agents.
Aegis by [Metronis, Inc.](https://metronis.space/) is an open-source framework that evaluates AI agents across 125 dimensions, identifies weaknesses, trains improved models via RL, and proves the improvement with rigorous before/after evaluation.
## Benchmark Evidence Tiers
Aegis exposes three different benchmark evidence modes and they are not interchangeable:
- `internal proxy`: built-in `*-proxy` suites are for internal regression tracking and ablations.
- `public proxy`: built-in `*-heldout` suites use held-out slices and benchmark-native metrics, but they are still not official leaderboard-grade evidence.
- `claim-grade`: only manifest-backed claim suites are intended for externally claimable benchmark results.
The built-in `legal` alias currently resolves to `legal-heldout`, which is a held-out proxy suite, not an official claim-grade legal benchmark. Treat strict built-in runs as honest proxy evidence unless the suite contract explicitly reports claim eligibility.
The repo does not bundle claim-grade legal suite JSON by default; install real
manifests under `benchmarks/claim_suites/` or set `AEGIS_BENCHMARK_SUITE_DIR`
to an external manifest directory.
## The Pipeline
```
Traces In → Eval → Diagnose Weaknesses → Spin RL Environments →
Train (GRPO + Continuous Rewards) → Store Results to Memory →
Train with Tools + Self-Managed Memory → Final Eval (Prove Improvement)
```
```mermaid
flowchart LR
A["Agent Traces"] --> B["Aegis Eval\n125 dimensions"]
B --> C["Weakness\nDiagnosis"]
C --> D["RL Environments\nauto-generated"]
D --> E["GRPO Training\ncontinuous rewards"]
E --> F["Aegis Memory\nextract strategies"]
F --> G["Tool + Memory\nTraining"]
G --> H["Final Eval\nprove improvement"]
H --> A
```
## Three Products
| Product | What it does | Status |
|---------|-------------|--------|
| **Aegis Eval** | 125 dimensions across 7 tiers, triangulated scoring (rule + semantic + LLM judge), legal & finance domain plugins | Working |
| **Aegis Train** | GRPO-based RL with continuous reward functions, environment factory, Observatory monitoring | Building |
| **Aegis Memory** | 12 operations, knowledge graph, vector store, provenance tracking, 7 time horizons | Working |
## Installation
```bash
pip install -e ".[dev,all]"
```
## Quick Start
### Evaluate an agent
```python
from aegis import Evaluator, EvalConfig
evaluator = Evaluator(config=EvalConfig(dimensions="all"))
result = evaluator.run()
print(f"Overall score: {result.overall_score:.2%}")
for tier_name, tier_score in result.tier_scores.items():
print(f" {tier_name}: {tier_score:.2%}")
```
### CLI
```bash
aegis eval run --config eval.yaml # Run evaluation
aegis eval dimensions # List all 125 dimensions
aegis eval benchmark-list # Inspect proxy vs claim-grade suite status
aegis train start --model Qwen/Qwen2.5-7B --optimizer dr_grpo
```
### Toolathlon SOTA on OVHcloud
```bash
cp .env.ovhcloud.example .env
aegis train ovhcloud-doctor
aegis pipeline toolathlon-sota \
--baseline-dir results/toolathlon_full \
--output results/toolathlon_sota_run.json
```
## Architecture
```
src/aegis/
├── core/ # Types, config, settings (5 files)
├── cli/ # Typer CLI (1 file)
├── adapters/ # OpenAI, Anthropic, LangChain, REST (6 files)
├── api/ # FastAPI + eval/training/memory routes (9 files)
├── data/ # CUAD, LegalBench, FinanceBench loaders (5 files)
├── ingestion/ # Document parsing pipeline (6 files)
├── eval/ # Engine, 7-tier dimensions, scorers, judges (27 files)
├── environments/ # Legal & finance tool-use RL environments (4 files)
├── training/ # GRPO engine, rewards, rollouts, optimizers (16 files)
├── memory/ # 12 ops, graph, vectors, provenance (16 files)
├── plugins/ # Legal (18 dims) + Finance (20 dims) (5 files)
├── observatory/ # Goodhart detection, efficiency tracking (4 files)
├── security/ # RBAC (2 files)
└── store/ # SQLite, Postgres, Neo4j (5 files)
```
~100 source files. One pipeline. No bloat.
## Adapters
| Adapter | Framework |
|---------|-----------|
| `OpenAIAdapter` | OpenAI Assistants + Chat Completions |
| `AnthropicAdapter` | Anthropic Messages API |
| `LangChainAdapter` | LangChain agents |
| `RESTAdapter` | Any REST API |
## Domain Plugins
- **Legal**: 18 dimensions, CUAD dataset, contract clause extraction, citation verification
- **Finance**: 20 dimensions, FinanceBench + SEC EDGAR, numerical accuracy, formula validation
## Documentation
| Topic | Link |
|-------|------|
| Quick Start | [`docs/quickstart.md`](docs/quickstart.md) |
| Architecture | [`docs/architecture.md`](docs/architecture.md) |
| Master Implementation Plan | [`docs/master-implementation-plan.md`](docs/master-implementation-plan.md) |
| CLI Reference | [`docs/cli-reference.md`](docs/cli-reference.md) |
| API Reference | [`docs/api-reference.md`](docs/api-reference.md) |
| Dimensions | [`docs/dimensions.md`](docs/dimensions.md) |
| Scoring | [`docs/scoring.md`](docs/scoring.md) |
| Adapters | [`docs/adapters.md`](docs/adapters.md) |
| Plugins | [`docs/plugins.md`](docs/plugins.md) |
## Contributing
See [CONTRIBUTING.md](CONTRIBUTING.md).
## License
Apache License 2.0. See [LICENSE](LICENSE).
Built by [Metronis, Inc.](https://metronis.space/)