https://github.com/joemunene-by/cyberbench
Open, reproducible benchmark for evaluating LLMs on cybersecurity reasoning. YAML tasks, pluggable backends, ranked leaderboard.
https://github.com/joemunene-by/cyberbench
ai-security benchmark cve cybersecurity evaluation llm llm-evaluation red-team security-research sigma
Last synced: 20 days ago
JSON representation
Open, reproducible benchmark for evaluating LLMs on cybersecurity reasoning. YAML tasks, pluggable backends, ranked leaderboard.
- Host: GitHub
- URL: https://github.com/joemunene-by/cyberbench
- Owner: joemunene-by
- License: mit
- Created: 2026-04-17T16:19:28.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2026-06-10T15:20:28.000Z (21 days ago)
- Last Synced: 2026-06-10T17:09:19.271Z (21 days ago)
- Topics: ai-security, benchmark, cve, cybersecurity, evaluation, llm, llm-evaluation, red-team, security-research, sigma
- Language: Python
- Size: 29.3 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# CyberBench
**An open, reproducible benchmark for evaluating LLMs on cybersecurity reasoning.**
Runs any model — OpenAI, Anthropic, Ollama, local — through a structured suite of security tasks and produces a ranked markdown leaderboard. Task files are YAML, scoring is transparent, and the whole thing runs in one command.



---
## Why
Existing LLM benchmarks (MMLU, HELM, MT-Bench) barely touch security. The ones that do tend to be vendor-owned (CyberSecEval), narrow (SecQA), or closed. CyberBench is:
- **Open.** Tasks live in plain YAML; scoring code is one page each.
- **Reproducible.** A `cyberbench run --model X` produces a JSON run file that's committed alongside the repo. Anyone can verify or extend.
- **Extensible.** Adding a task is writing one YAML file. Adding a backend is a 30-line class.
- **Multi-category.** CVE triage, secure code review, detection rule generation — more coming.
## What's scored
| Category | Task type | Scorer | Seed tasks |
| --- | --- | --- | --- |
| `cve_triage` | Multiple-choice CWE / exploit reasoning | `mc_exact` | 5 |
| `code_review` | Free-form vulnerability review of a code snippet | `rubric_keyword` (weighted phrase match) | 5 |
| `detection_rule` | Generate a SIGMA rule for a given attack scenario | `sigma_structural` (YAML + field + domain checks) | 5 |
**v0.1 ships 15 hand-curated seed tasks.** The framework is the product; the task bank grows with community contributions (see below).
## Quickstart
```bash
git clone https://github.com/joemunene-by/cyberbench.git
cd cyberbench
pip install -e ".[all]" # pulls openai + anthropic; drop [all] to skip
# Smoke test with the built-in echo backend (no API key needed)
cyberbench run --model echo
cyberbench report --print
```
### Run a real model
```bash
# OpenAI (requires OPENAI_API_KEY)
cyberbench run --model openai:gpt-4o-mini
# Anthropic (requires ANTHROPIC_API_KEY)
cyberbench run --model anthropic:claude-sonnet-4-6
# Local Ollama (requires ollama running at :11434)
cyberbench run --model ollama:llama3
# Only one category
cyberbench run --model openai:gpt-4o-mini --category detection_rule
```
Each run writes `leaderboard/.json`. Build the ranked table:
```bash
cyberbench report --out LEADERBOARD.md
```
## Adding a task
Drop a YAML file under `tasks//`:
```yaml
id: cve-006-spectre
category: cve_triage
type: multiple_choice
prompt: |
CVE-2017-5753 (Spectre variant 1)...
A) CWE-20: Improper Input Validation
B) CWE-203: Observable Discrepancy (side channel)
C) ...
choices:
- "A: CWE-20"
- "B: CWE-203"
- "C: ..."
answer: B
metadata:
cve: CVE-2017-5753
```
Three task types are supported out of the box:
- **`multiple_choice`** — set `choices` and `answer` (letter).
- **`free_form`** — set a `rubric` of `{must_mention | any_of, weight}` entries.
- **`sigma_rule`** — set `requirements` (`must_have_fields`, `detection_keys_should_include`, `expected_logsource_product`).
Tests run on every task file via `pytest tests/test_tasks.py::test_seed_tasks_all_load`.
## Adding a backend
Implement `Backend.generate(prompt, system=None) -> str` under `src/cyberbench/backends/`. Register in `backends/__init__.py::get_backend`. That's it.
## Architecture
```
cyberbench/
├── src/cyberbench/
│ ├── tasks.py # YAML loader, Task dataclass
│ ├── backends/ # echo, ollama, openai, anthropic
│ ├── scorers/ # mc_exact, rubric_keyword, sigma_structural
│ ├── runner.py # dispatches tasks → backend → scorer
│ ├── report.py # markdown leaderboard builder
│ └── cli.py # argparse entry point
├── tasks/
│ ├── cve_triage/
│ ├── code_review/
│ └── detection_rule/
├── leaderboard/ # committed JSON runs (the reproducibility story)
└── LEADERBOARD.md # generated by `cyberbench report`
```
## Roadmap
- **v0.2** — 50 tasks per category, LLM-as-judge scorer for free-form, Dockerfile for sealed reproducibility
- **v0.3** — New categories: exploit chain reasoning, log forensics, threat model critique
- **v0.4** — Adversarial subset (prompt injection resistance against a security-Q&A persona)
- **v1.0** — Public hosted leaderboard, automated re-runs on model releases
## Related projects
Built by [@joemunene-by](https://github.com/joemunene-by). Companion work:
- [`GhostLM`](https://github.com/joemunene-by/GhostLM) — cybersecurity-focused LLM (CyberBench is how it gets validated)
- [`secure-mcp`](https://github.com/joemunene-by/secure-mcp) — MCP server for giving agents a gated security toolbox
- [`ghostsiem`](https://github.com/joemunene-by/ghostsiem) — SIGMA-based detection and alerting
## Contributing
PRs welcome — especially **new tasks**. Open an issue first for new scorers or categories so we can align on the schema.
## License
MIT. See [LICENSE](./LICENSE).