https://github.com/doronp/agentshield-benchmark

Open benchmark for AI agent security tools — prompt injection, data exfiltration, tool abuse, provenance
https://github.com/doronp/agentshield-benchmark

agent-security ai-security benchmark guardrails llm-security prompt-injection

Last synced: 4 months ago
JSON representation

Open benchmark for AI agent security tools — prompt injection, data exfiltration, tool abuse, provenance

Host: GitHub
URL: https://github.com/doronp/agentshield-benchmark
Owner: doronp
License: apache-2.0
Created: 2026-02-15T09:09:25.000Z (4 months ago)
Default Branch: main
Last Pushed: 2026-02-15T16:27:22.000Z (4 months ago)
Last Synced: 2026-02-17T22:53:52.033Z (4 months ago)
Topics: agent-security, ai-security, benchmark, guardrails, llm-security, prompt-injection
Language: TypeScript
Homepage:
Size: 886 KB
Stars: 9
Watchers: 1
Forks: 2
Open Issues: 1
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Security: SECURITY.md

Awesome Lists containing this project

README

          


  



AgentShield Benchmark


The first head-to-head benchmark of commercial agent protection providers.


AgentShield is an open, reproducible benchmark suite that evaluates how well commercial AI agent security products defend against real-world attacks — and how much they cost you in latency, false positives, and dollars.

## Disclosure

This benchmark is maintained by the team behind [Agent Guard](https://agentguard.co/). To ensure credibility, Agent Guard's results were obtained using our [Commit-Reveal Integrity Protocol](src/protocol/README.md) — a commit-reveal scheme with Ed25519 signatures that allows proprietary solutions to participate without revealing their implementation, while cryptographically proving result integrity. The verification bundle is published in `results/` for independent verification. Note: this protocol verifies that results were not tampered with after execution; it does not independently attest which model produced the results.

The test corpus, scoring methodology, and all adapter code are fully open source and auditable. We welcome third-party verification and contributions from the community.

If you believe any aspect of the methodology unfairly advantages or disadvantages a particular provider, please [open an issue](../../issues).

## Current Status

This benchmark currently includes tested results for **6 providers** across ML models, SaaS APIs, and pattern-based scanners with **537 test cases** across 8 categories. We are actively expanding coverage — contributions of new provider adapters are welcome.

### Latest Results

| Provider | Score | PI | Jailbreak | Data Exfil | Tool Abuse | Over-Refusal | Multi-Agent | Provenance | P50 (ms) |

|---|---|---|---|---|---|---|---|---|---|

| **AgentGuard**² | **98.4** | 98.5% | 97.8% | 100.0% | 100.0% | 100.0% | 100.0% | 85.0% | 1 |

| **Deepset DeBERTa** | **87.6** | 99.5% | 97.8% | 95.4% | 98.8% | 63.1% | 100.0% | 100.0% | 19 |

| **Lakera Guard** | **79.4** | 97.6% | 95.6% | 96.6% | 86.3% | 58.5% | 94.3% | 95.0% | 133 |

| ProtectAI DeBERTa v2 | 51.4 | 77.1% | 86.7% | 43.7% | 12.5% | 95.4% | 74.3% | 65.0% | 19 |

| ClawGuard | 38.9 | 62.9% | 22.2% | 40.2% | 17.5% | 100.0% | 40.0% | 25.0% | 0 |

| LLM Guard¹ | ~38.7 | 77.1% | — | 30.8% | 8.9% | — | — | — | 111 |

¹ Scored on 517-case corpus (pre-provenance). Re-run pending for 537-case corpus with updated penalty.

² Tested via Commit-Reveal Integrity Protocol (Ed25519 signatures) using a proprietary provenance-based solution. See [protocol documentation](src/protocol/README.md). Verification bundle included in results.

## Benchmark Categories

| # | Category | Tests | Weight | What It Measures |

|---|----------|-------|--------|-----------------|

| 1 | **Prompt Injection** | 205 | 20% | Direct, indirect, and context-manipulation injection attacks |

| 2 | **Jailbreak** | 45 | 10% | DAN variants, roleplay, authority impersonation, token smuggling |

| 3 | **Data Exfiltration** | 87 | 15% | Resistance to data leakage via tool calls, markdown, errors |

| 4 | **Tool Abuse** | 80 | 15% | Unauthorized tool calls, scope escalation, parameter tampering |

| 5 | **Over-Refusal** | 65 | 15% | False positive rate on legitimate requests (penalty only) |

| 6 | **Multi-Agent Security** | 35 | 10% | Cross-agent attacks, delegation exploits, trust boundary violations |

| 7 | **Latency Overhead** | — | 10% | Added latency (p50, p95, p99) from the protection layer |

| 8 | **Provenance & Audit** | 20 | 5% | Detecting fake authorization claims, spoofed provenance chains, unverifiable approvals |

**Total: 537 test cases** across 8 categories (7 scored + latency).

## Scoring

Each provider receives a **per-category score (0-100)** and a **composite score** computed as the weighted geometric mean across attack detection categories. Over-refusal is excluded from the composite and instead applied as a standalone penalty (`(FPR^1.3) * 40`) to avoid double-counting. A provider that blocks 50% of legitimate requests loses ~16 points — security that breaks usability isn't security.

## Quick Start

```bash

# Install dependencies

npm install

# Run the full benchmark suite

npm run benchmark

# Validate the test corpus

npm run validate-corpus

# Run tests

npm test

# Type check

npm run typecheck

```

> **Note:** Running the benchmark requires API keys or local services for each provider under test. Copy `.env.example` to `.env` and configure the providers you want to benchmark. See [PROVIDERS.md](./PROVIDERS.md) for setup instructions.

## Project Structure

```

agentshield-benchmark/

├── corpus/                  # Attack and benign test cases (JSONL)

│   ├── categories.json      # Category definitions and weights

│   ├── prompt-injection/    # 205 prompt injection test cases

│   ├── jailbreak/           # 45 jailbreak test cases

│   ├── data-exfiltration/   # 87 data exfiltration test cases

│   ├── tool-abuse/          # 80 tool abuse test cases

│   ├── over-refusal/        # 65 legitimate request test cases

│   ├── multi-agent/         # 35 multi-agent security test cases

│   └── provenance-audit/   # 20 provenance & audit test cases

├── src/

│   ├── types.ts             # Core TypeScript interfaces

│   ├── runner.ts            # Test runner engine

│   ├── scoring.ts           # Scoring and aggregation

│   ├── run-benchmark.ts     # CLI entry point with provider discovery

│   ├── adapters/            # Provider adapter implementations

│   └── protocol/            # Commit-reveal integrity protocol

├── scripts/

│   ├── hf-model-server.py   # HuggingFace model server for ML-based providers

│   └── validate-corpus.sh   # Corpus validation script

├── site/                    # Static leaderboard website

├── docs/

│   └── providers.md         # Provider research and API details

├── package.json

└── tsconfig.json

```

## Adding a Provider

See [`src/adapters/README.md`](./src/adapters/README.md) for the adapter interface. Each provider adapter extends `BaseAdapter` and implements the `evaluateImpl()` method.

## Reproducibility

AgentShield is designed for reproducible benchmark runs. Every result JSON includes the metadata needed to verify and replicate a run.

### Reproducing a Benchmark Run

1. **Check the corpus hash** — each report includes a `corpusHash` (SHA-256 of all JSONL files). Verify your local corpus matches:

   ```bash

   # The runner prints the hash at startup:

   # Computing corpus hash...

   #    Corpus hash: a1b2c3d4...

   ```

   Compare this against the `corpusHash` field in the results JSON.

2. **Use a shuffle seed** — to get deterministic test ordering, pass a `shuffleSeed`:

   ```typescript

   const report = await runBenchmark(providers, {

     shuffle: true,

     shuffleSeed: 42,

   });

   ```

   The same seed produces the same test order on the same corpus. The seed is recorded in `report.config.shuffleSeed`.

3. **Match the environment** — the report records `environment.os`, `environment.arch`, and `environment.nodeVersion`. While results should be environment-independent, matching these eliminates one variable.

### Commit-Reveal Integrity Protocol

AgentShield includes a commit-reveal protocol (`src/protocol/`) that allows vendors to run the benchmark locally on proprietary models while cryptographically proving result integrity. See [`src/protocol/README.md`](./src/protocol/README.md) for details.

### What the Results JSON Contains

| Field | Description |

|-------|-------------|

| `version` | Benchmark suite version |

| `corpusHash` | SHA-256 of corpus JSONL files — verifies test data integrity |

| `environment` | OS, architecture, Node.js version, and timestamp |

| `config.shuffleSeed` | PRNG seed used for test ordering (if set) |

| `providers[].providerVersion` | Version of each provider (package version or API version) |

| `providers[].results[]` | Individual test case results with timestamps |

### Environment Requirements

- **Node.js** >= 20.0.0

- **TypeScript** >= 5.7.0

- Provider-specific dependencies (see [PROVIDERS.md](./PROVIDERS.md))

## Contributing

See [CONTRIBUTING.md](./CONTRIBUTING.md) for guidelines. We welcome contributions of:

- New test cases (especially novel attack vectors)

- Provider adapters

- Scoring methodology improvements

Please open an issue before submitting large changes.

## License

Apache 2.0 — see [LICENSE](./LICENSE).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/doronp/agentshield-benchmark

Awesome Lists containing this project

README

AgentShield Benchmark