An open API service indexing awesome lists of open source software.

https://github.com/joemunene-by/cyberbench

Open, reproducible benchmark for evaluating LLMs on cybersecurity reasoning. YAML tasks, pluggable backends, ranked leaderboard.
https://github.com/joemunene-by/cyberbench

ai-security benchmark cve cybersecurity evaluation llm llm-evaluation red-team security-research sigma

Last synced: 20 days ago
JSON representation

Open, reproducible benchmark for evaluating LLMs on cybersecurity reasoning. YAML tasks, pluggable backends, ranked leaderboard.

Awesome Lists containing this project

README

          

# CyberBench

**An open, reproducible benchmark for evaluating LLMs on cybersecurity reasoning.**

Runs any model — OpenAI, Anthropic, Ollama, local — through a structured suite of security tasks and produces a ranked markdown leaderboard. Task files are YAML, scoring is transparent, and the whole thing runs in one command.

![Python](https://img.shields.io/badge/python-3.10+-3776AB?style=flat-square&logo=python&logoColor=white)
![License](https://img.shields.io/badge/license-MIT-green?style=flat-square)
![Tasks](https://img.shields.io/badge/seed_tasks-15-6C9CFF?style=flat-square)

---

## Why

Existing LLM benchmarks (MMLU, HELM, MT-Bench) barely touch security. The ones that do tend to be vendor-owned (CyberSecEval), narrow (SecQA), or closed. CyberBench is:

- **Open.** Tasks live in plain YAML; scoring code is one page each.
- **Reproducible.** A `cyberbench run --model X` produces a JSON run file that's committed alongside the repo. Anyone can verify or extend.
- **Extensible.** Adding a task is writing one YAML file. Adding a backend is a 30-line class.
- **Multi-category.** CVE triage, secure code review, detection rule generation — more coming.

## What's scored

| Category | Task type | Scorer | Seed tasks |
| --- | --- | --- | --- |
| `cve_triage` | Multiple-choice CWE / exploit reasoning | `mc_exact` | 5 |
| `code_review` | Free-form vulnerability review of a code snippet | `rubric_keyword` (weighted phrase match) | 5 |
| `detection_rule` | Generate a SIGMA rule for a given attack scenario | `sigma_structural` (YAML + field + domain checks) | 5 |

**v0.1 ships 15 hand-curated seed tasks.** The framework is the product; the task bank grows with community contributions (see below).

## Quickstart

```bash
git clone https://github.com/joemunene-by/cyberbench.git
cd cyberbench
pip install -e ".[all]" # pulls openai + anthropic; drop [all] to skip

# Smoke test with the built-in echo backend (no API key needed)
cyberbench run --model echo
cyberbench report --print
```

### Run a real model

```bash
# OpenAI (requires OPENAI_API_KEY)
cyberbench run --model openai:gpt-4o-mini

# Anthropic (requires ANTHROPIC_API_KEY)
cyberbench run --model anthropic:claude-sonnet-4-6

# Local Ollama (requires ollama running at :11434)
cyberbench run --model ollama:llama3

# Only one category
cyberbench run --model openai:gpt-4o-mini --category detection_rule
```

Each run writes `leaderboard/.json`. Build the ranked table:

```bash
cyberbench report --out LEADERBOARD.md
```

## Adding a task

Drop a YAML file under `tasks//`:

```yaml
id: cve-006-spectre
category: cve_triage
type: multiple_choice

prompt: |
CVE-2017-5753 (Spectre variant 1)...

A) CWE-20: Improper Input Validation
B) CWE-203: Observable Discrepancy (side channel)
C) ...

choices:
- "A: CWE-20"
- "B: CWE-203"
- "C: ..."

answer: B

metadata:
cve: CVE-2017-5753
```

Three task types are supported out of the box:

- **`multiple_choice`** — set `choices` and `answer` (letter).
- **`free_form`** — set a `rubric` of `{must_mention | any_of, weight}` entries.
- **`sigma_rule`** — set `requirements` (`must_have_fields`, `detection_keys_should_include`, `expected_logsource_product`).

Tests run on every task file via `pytest tests/test_tasks.py::test_seed_tasks_all_load`.

## Adding a backend

Implement `Backend.generate(prompt, system=None) -> str` under `src/cyberbench/backends/`. Register in `backends/__init__.py::get_backend`. That's it.

## Architecture

```
cyberbench/
├── src/cyberbench/
│ ├── tasks.py # YAML loader, Task dataclass
│ ├── backends/ # echo, ollama, openai, anthropic
│ ├── scorers/ # mc_exact, rubric_keyword, sigma_structural
│ ├── runner.py # dispatches tasks → backend → scorer
│ ├── report.py # markdown leaderboard builder
│ └── cli.py # argparse entry point
├── tasks/
│ ├── cve_triage/
│ ├── code_review/
│ └── detection_rule/
├── leaderboard/ # committed JSON runs (the reproducibility story)
└── LEADERBOARD.md # generated by `cyberbench report`
```

## Roadmap

- **v0.2** — 50 tasks per category, LLM-as-judge scorer for free-form, Dockerfile for sealed reproducibility
- **v0.3** — New categories: exploit chain reasoning, log forensics, threat model critique
- **v0.4** — Adversarial subset (prompt injection resistance against a security-Q&A persona)
- **v1.0** — Public hosted leaderboard, automated re-runs on model releases

## Related projects

Built by [@joemunene-by](https://github.com/joemunene-by). Companion work:

- [`GhostLM`](https://github.com/joemunene-by/GhostLM) — cybersecurity-focused LLM (CyberBench is how it gets validated)
- [`secure-mcp`](https://github.com/joemunene-by/secure-mcp) — MCP server for giving agents a gated security toolbox
- [`ghostsiem`](https://github.com/joemunene-by/ghostsiem) — SIGMA-based detection and alerting

## Contributing

PRs welcome — especially **new tasks**. Open an issue first for new scorers or categories so we can align on the schema.

## License

MIT. See [LICENSE](./LICENSE).