https://github.com/metronis-space/aegis

The Adaptive Intelligence Layer for AI Agents — eval, train, and memory on one platform.
https://github.com/metronis-space/aegis

agent-eval ai-agents benchmarks evaluation grpo llm memory reinforcement-learning

Last synced: 2 months ago
JSON representation

The Adaptive Intelligence Layer for AI Agents — eval, train, and memory on one platform.

Host: GitHub
URL: https://github.com/metronis-space/aegis
Owner: metronis-space
License: apache-2.0
Created: 2026-02-22T05:34:48.000Z (4 months ago)
Default Branch: main
Last Pushed: 2026-04-19T10:22:15.000Z (2 months ago)
Last Synced: 2026-04-19T11:27:13.773Z (2 months ago)
Topics: agent-eval, ai-agents, benchmarks, evaluation, grpo, llm, memory, reinforcement-learning
Language: Python
Homepage: https://metronis.space/
Size: 8 MB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 27
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- Funding: .github/FUNDING.yml
- License: LICENSE
- Agents: AGENTS.md

Awesome Lists containing this project

README

          # Aegis

[![CI](https://github.com/metronis-space/aegis/actions/workflows/ci.yml/badge.svg)](https://github.com/metronis-space/aegis/actions/workflows/ci.yml)

[![Python 3.11+](https://img.shields.io/badge/python-3.11%2B-blue.svg)](https://www.python.org/downloads/)

[![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-green.svg)](https://opensource.org/licenses/Apache-2.0)

**The Closed-Loop Intelligence Engine for AI Agents** — eval, train, and deploy smarter agents.

Aegis by [Metronis, Inc.](https://metronis.space/) is an open-source framework that evaluates AI agents across 125 dimensions, identifies weaknesses, trains improved models via RL, and proves the improvement with rigorous before/after evaluation.

## Benchmark Evidence Tiers

Aegis exposes three different benchmark evidence modes and they are not interchangeable:

- `internal proxy`: built-in `*-proxy` suites are for internal regression tracking and ablations.

- `public proxy`: built-in `*-heldout` suites use held-out slices and benchmark-native metrics, but they are still not official leaderboard-grade evidence.

- `claim-grade`: only manifest-backed claim suites are intended for externally claimable benchmark results.

The built-in `legal` alias currently resolves to `legal-heldout`, which is a held-out proxy suite, not an official claim-grade legal benchmark. Treat strict built-in runs as honest proxy evidence unless the suite contract explicitly reports claim eligibility.

The repo does not bundle claim-grade legal suite JSON by default; install real

manifests under `benchmarks/claim_suites/` or set `AEGIS_BENCHMARK_SUITE_DIR`

to an external manifest directory.

## The Pipeline

```

Traces In → Eval → Diagnose Weaknesses → Spin RL Environments →

Train (GRPO + Continuous Rewards) → Store Results to Memory →

Train with Tools + Self-Managed Memory → Final Eval (Prove Improvement)

```

```mermaid

flowchart LR

    A["Agent Traces"] --> B["Aegis Eval\n125 dimensions"]

    B --> C["Weakness\nDiagnosis"]

    C --> D["RL Environments\nauto-generated"]

    D --> E["GRPO Training\ncontinuous rewards"]

    E --> F["Aegis Memory\nextract strategies"]

    F --> G["Tool + Memory\nTraining"]

    G --> H["Final Eval\nprove improvement"]

    H --> A

```

## Three Products

| Product | What it does | Status |

|---------|-------------|--------|

| **Aegis Eval** | 125 dimensions across 7 tiers, triangulated scoring (rule + semantic + LLM judge), legal & finance domain plugins | Working |

| **Aegis Train** | GRPO-based RL with continuous reward functions, environment factory, Observatory monitoring | Building |

| **Aegis Memory** | 12 operations, knowledge graph, vector store, provenance tracking, 7 time horizons | Working |

## Installation

```bash

pip install -e ".[dev,all]"

```

## Quick Start

### Evaluate an agent

```python

from aegis import Evaluator, EvalConfig

evaluator = Evaluator(config=EvalConfig(dimensions="all"))

result = evaluator.run()

print(f"Overall score: {result.overall_score:.2%}")

for tier_name, tier_score in result.tier_scores.items():

    print(f"  {tier_name}: {tier_score:.2%}")

```

### CLI

```bash

aegis eval run --config eval.yaml     # Run evaluation

aegis eval dimensions                 # List all 125 dimensions

aegis eval benchmark-list             # Inspect proxy vs claim-grade suite status

aegis train start --model Qwen/Qwen2.5-7B --optimizer dr_grpo

```

### Toolathlon SOTA on OVHcloud

```bash

cp .env.ovhcloud.example .env

aegis train ovhcloud-doctor

aegis pipeline toolathlon-sota \

  --baseline-dir results/toolathlon_full \

  --output results/toolathlon_sota_run.json

```

## Architecture

```

src/aegis/

├── core/           # Types, config, settings (5 files)

├── cli/            # Typer CLI (1 file)

├── adapters/       # OpenAI, Anthropic, LangChain, REST (6 files)

├── api/            # FastAPI + eval/training/memory routes (9 files)

├── data/           # CUAD, LegalBench, FinanceBench loaders (5 files)

├── ingestion/      # Document parsing pipeline (6 files)

├── eval/           # Engine, 7-tier dimensions, scorers, judges (27 files)

├── environments/   # Legal & finance tool-use RL environments (4 files)

├── training/       # GRPO engine, rewards, rollouts, optimizers (16 files)

├── memory/         # 12 ops, graph, vectors, provenance (16 files)

├── plugins/        # Legal (18 dims) + Finance (20 dims) (5 files)

├── observatory/    # Goodhart detection, efficiency tracking (4 files)

├── security/       # RBAC (2 files)

└── store/          # SQLite, Postgres, Neo4j (5 files)

```

~100 source files. One pipeline. No bloat.

## Adapters

| Adapter | Framework |

|---------|-----------|

| `OpenAIAdapter` | OpenAI Assistants + Chat Completions |

| `AnthropicAdapter` | Anthropic Messages API |

| `LangChainAdapter` | LangChain agents |

| `RESTAdapter` | Any REST API |

## Domain Plugins

- **Legal**: 18 dimensions, CUAD dataset, contract clause extraction, citation verification

- **Finance**: 20 dimensions, FinanceBench + SEC EDGAR, numerical accuracy, formula validation

## Documentation

| Topic | Link |

|-------|------|

| Quick Start | [`docs/quickstart.md`](docs/quickstart.md) |

| Architecture | [`docs/architecture.md`](docs/architecture.md) |

| Master Implementation Plan | [`docs/master-implementation-plan.md`](docs/master-implementation-plan.md) |

| CLI Reference | [`docs/cli-reference.md`](docs/cli-reference.md) |

| API Reference | [`docs/api-reference.md`](docs/api-reference.md) |

| Dimensions | [`docs/dimensions.md`](docs/dimensions.md) |

| Scoring | [`docs/scoring.md`](docs/scoring.md) |

| Adapters | [`docs/adapters.md`](docs/adapters.md) |

| Plugins | [`docs/plugins.md`](docs/plugins.md) |

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md).

## License

Apache License 2.0. See [LICENSE](LICENSE).

Built by [Metronis, Inc.](https://metronis.space/)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/metronis-space/aegis

Awesome Lists containing this project

README