https://github.com/doronp/agentshield-benchmark
Open benchmark for AI agent security tools — prompt injection, data exfiltration, tool abuse, provenance
https://github.com/doronp/agentshield-benchmark
agent-security ai-security benchmark guardrails llm-security prompt-injection
Last synced: 4 months ago
JSON representation
Open benchmark for AI agent security tools — prompt injection, data exfiltration, tool abuse, provenance
- Host: GitHub
- URL: https://github.com/doronp/agentshield-benchmark
- Owner: doronp
- License: apache-2.0
- Created: 2026-02-15T09:09:25.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2026-02-15T16:27:22.000Z (4 months ago)
- Last Synced: 2026-02-17T22:53:52.033Z (4 months ago)
- Topics: agent-security, ai-security, benchmark, guardrails, llm-security, prompt-injection
- Language: TypeScript
- Homepage:
- Size: 886 KB
- Stars: 9
- Watchers: 1
- Forks: 2
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Security: SECURITY.md
Awesome Lists containing this project
README
AgentShield Benchmark
The first head-to-head benchmark of commercial agent protection providers.
AgentShield is an open, reproducible benchmark suite that evaluates how well commercial AI agent security products defend against real-world attacks — and how much they cost you in latency, false positives, and dollars.
## Disclosure
This benchmark is maintained by the team behind [Agent Guard](https://agentguard.co/). To ensure credibility, Agent Guard's results were obtained using our [Commit-Reveal Integrity Protocol](src/protocol/README.md) — a commit-reveal scheme with Ed25519 signatures that allows proprietary solutions to participate without revealing their implementation, while cryptographically proving result integrity. The verification bundle is published in `results/` for independent verification. Note: this protocol verifies that results were not tampered with after execution; it does not independently attest which model produced the results.
The test corpus, scoring methodology, and all adapter code are fully open source and auditable. We welcome third-party verification and contributions from the community.
If you believe any aspect of the methodology unfairly advantages or disadvantages a particular provider, please [open an issue](../../issues).
## Current Status
This benchmark currently includes tested results for **6 providers** across ML models, SaaS APIs, and pattern-based scanners with **537 test cases** across 8 categories. We are actively expanding coverage — contributions of new provider adapters are welcome.
### Latest Results
| Provider | Score | PI | Jailbreak | Data Exfil | Tool Abuse | Over-Refusal | Multi-Agent | Provenance | P50 (ms) |
|---|---|---|---|---|---|---|---|---|---|
| **AgentGuard**² | **98.4** | 98.5% | 97.8% | 100.0% | 100.0% | 100.0% | 100.0% | 85.0% | 1 |
| **Deepset DeBERTa** | **87.6** | 99.5% | 97.8% | 95.4% | 98.8% | 63.1% | 100.0% | 100.0% | 19 |
| **Lakera Guard** | **79.4** | 97.6% | 95.6% | 96.6% | 86.3% | 58.5% | 94.3% | 95.0% | 133 |
| ProtectAI DeBERTa v2 | 51.4 | 77.1% | 86.7% | 43.7% | 12.5% | 95.4% | 74.3% | 65.0% | 19 |
| ClawGuard | 38.9 | 62.9% | 22.2% | 40.2% | 17.5% | 100.0% | 40.0% | 25.0% | 0 |
| LLM Guard¹ | ~38.7 | 77.1% | — | 30.8% | 8.9% | — | — | — | 111 |
¹ Scored on 517-case corpus (pre-provenance). Re-run pending for 537-case corpus with updated penalty.
² Tested via Commit-Reveal Integrity Protocol (Ed25519 signatures) using a proprietary provenance-based solution. See [protocol documentation](src/protocol/README.md). Verification bundle included in results.
## Benchmark Categories
| # | Category | Tests | Weight | What It Measures |
|---|----------|-------|--------|-----------------|
| 1 | **Prompt Injection** | 205 | 20% | Direct, indirect, and context-manipulation injection attacks |
| 2 | **Jailbreak** | 45 | 10% | DAN variants, roleplay, authority impersonation, token smuggling |
| 3 | **Data Exfiltration** | 87 | 15% | Resistance to data leakage via tool calls, markdown, errors |
| 4 | **Tool Abuse** | 80 | 15% | Unauthorized tool calls, scope escalation, parameter tampering |
| 5 | **Over-Refusal** | 65 | 15% | False positive rate on legitimate requests (penalty only) |
| 6 | **Multi-Agent Security** | 35 | 10% | Cross-agent attacks, delegation exploits, trust boundary violations |
| 7 | **Latency Overhead** | — | 10% | Added latency (p50, p95, p99) from the protection layer |
| 8 | **Provenance & Audit** | 20 | 5% | Detecting fake authorization claims, spoofed provenance chains, unverifiable approvals |
**Total: 537 test cases** across 8 categories (7 scored + latency).
## Scoring
Each provider receives a **per-category score (0-100)** and a **composite score** computed as the weighted geometric mean across attack detection categories. Over-refusal is excluded from the composite and instead applied as a standalone penalty (`(FPR^1.3) * 40`) to avoid double-counting. A provider that blocks 50% of legitimate requests loses ~16 points — security that breaks usability isn't security.
## Quick Start
```bash
# Install dependencies
npm install
# Run the full benchmark suite
npm run benchmark
# Validate the test corpus
npm run validate-corpus
# Run tests
npm test
# Type check
npm run typecheck
```
> **Note:** Running the benchmark requires API keys or local services for each provider under test. Copy `.env.example` to `.env` and configure the providers you want to benchmark. See [PROVIDERS.md](./PROVIDERS.md) for setup instructions.
## Project Structure
```
agentshield-benchmark/
├── corpus/ # Attack and benign test cases (JSONL)
│ ├── categories.json # Category definitions and weights
│ ├── prompt-injection/ # 205 prompt injection test cases
│ ├── jailbreak/ # 45 jailbreak test cases
│ ├── data-exfiltration/ # 87 data exfiltration test cases
│ ├── tool-abuse/ # 80 tool abuse test cases
│ ├── over-refusal/ # 65 legitimate request test cases
│ ├── multi-agent/ # 35 multi-agent security test cases
│ └── provenance-audit/ # 20 provenance & audit test cases
├── src/
│ ├── types.ts # Core TypeScript interfaces
│ ├── runner.ts # Test runner engine
│ ├── scoring.ts # Scoring and aggregation
│ ├── run-benchmark.ts # CLI entry point with provider discovery
│ ├── adapters/ # Provider adapter implementations
│ └── protocol/ # Commit-reveal integrity protocol
├── scripts/
│ ├── hf-model-server.py # HuggingFace model server for ML-based providers
│ └── validate-corpus.sh # Corpus validation script
├── site/ # Static leaderboard website
├── docs/
│ └── providers.md # Provider research and API details
├── package.json
└── tsconfig.json
```
## Adding a Provider
See [`src/adapters/README.md`](./src/adapters/README.md) for the adapter interface. Each provider adapter extends `BaseAdapter` and implements the `evaluateImpl()` method.
## Reproducibility
AgentShield is designed for reproducible benchmark runs. Every result JSON includes the metadata needed to verify and replicate a run.
### Reproducing a Benchmark Run
1. **Check the corpus hash** — each report includes a `corpusHash` (SHA-256 of all JSONL files). Verify your local corpus matches:
```bash
# The runner prints the hash at startup:
# Computing corpus hash...
# Corpus hash: a1b2c3d4...
```
Compare this against the `corpusHash` field in the results JSON.
2. **Use a shuffle seed** — to get deterministic test ordering, pass a `shuffleSeed`:
```typescript
const report = await runBenchmark(providers, {
shuffle: true,
shuffleSeed: 42,
});
```
The same seed produces the same test order on the same corpus. The seed is recorded in `report.config.shuffleSeed`.
3. **Match the environment** — the report records `environment.os`, `environment.arch`, and `environment.nodeVersion`. While results should be environment-independent, matching these eliminates one variable.
### Commit-Reveal Integrity Protocol
AgentShield includes a commit-reveal protocol (`src/protocol/`) that allows vendors to run the benchmark locally on proprietary models while cryptographically proving result integrity. See [`src/protocol/README.md`](./src/protocol/README.md) for details.
### What the Results JSON Contains
| Field | Description |
|-------|-------------|
| `version` | Benchmark suite version |
| `corpusHash` | SHA-256 of corpus JSONL files — verifies test data integrity |
| `environment` | OS, architecture, Node.js version, and timestamp |
| `config.shuffleSeed` | PRNG seed used for test ordering (if set) |
| `providers[].providerVersion` | Version of each provider (package version or API version) |
| `providers[].results[]` | Individual test case results with timestamps |
### Environment Requirements
- **Node.js** >= 20.0.0
- **TypeScript** >= 5.7.0
- Provider-specific dependencies (see [PROVIDERS.md](./PROVIDERS.md))
## Contributing
See [CONTRIBUTING.md](./CONTRIBUTING.md) for guidelines. We welcome contributions of:
- New test cases (especially novel attack vectors)
- Provider adapters
- Scoring methodology improvements
Please open an issue before submitting large changes.
## License
Apache 2.0 — see [LICENSE](./LICENSE).