{"id":50770718,"url":"https://github.com/joemunene-by/cyberbench","last_synced_at":"2026-06-11T18:01:42.081Z","repository":{"id":363847436,"uuid":"1213666609","full_name":"joemunene-by/cyberbench","owner":"joemunene-by","description":"Open, reproducible benchmark for evaluating LLMs on cybersecurity reasoning. YAML tasks, pluggable backends, ranked leaderboard.","archived":false,"fork":false,"pushed_at":"2026-06-10T15:20:28.000Z","size":30,"stargazers_count":0,"open_issues_count":2,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-10T17:09:19.271Z","etag":null,"topics":["ai-security","benchmark","cve","cybersecurity","evaluation","llm","llm-evaluation","red-team","security-research","sigma"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/joemunene-by.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-17T16:19:28.000Z","updated_at":"2026-06-10T15:15:10.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/joemunene-by/cyberbench","commit_stats":null,"previous_names":["joemunene-by/cyberbench"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/joemunene-by/cyberbench","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joemunene-by%2Fcyberbench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joemunene-by%2Fcyberbench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joemunene-by%2Fcyberbench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joemunene-by%2Fcyberbench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/joemunene-by","download_url":"https://codeload.github.com/joemunene-by/cyberbench/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joemunene-by%2Fcyberbench/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34211067,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-11T02:00:06.485Z","response_time":57,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-security","benchmark","cve","cybersecurity","evaluation","llm","llm-evaluation","red-team","security-research","sigma"],"created_at":"2026-06-11T18:01:41.975Z","updated_at":"2026-06-11T18:01:42.073Z","avatar_url":"https://github.com/joemunene-by.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n# CyberBench\n\n**An open, reproducible benchmark for evaluating LLMs on cybersecurity reasoning.**\n\nRuns any model — OpenAI, Anthropic, Ollama, local — through a structured suite of security tasks and produces a ranked markdown leaderboard. Task files are YAML, scoring is transparent, and the whole thing runs in one command.\n\n![Python](https://img.shields.io/badge/python-3.10+-3776AB?style=flat-square\u0026logo=python\u0026logoColor=white)\n![License](https://img.shields.io/badge/license-MIT-green?style=flat-square)\n![Tasks](https://img.shields.io/badge/seed_tasks-15-6C9CFF?style=flat-square)\n\n\u003c/div\u003e\n\n---\n\n## Why\n\nExisting LLM benchmarks (MMLU, HELM, MT-Bench) barely touch security. The ones that do tend to be vendor-owned (CyberSecEval), narrow (SecQA), or closed. CyberBench is:\n\n- **Open.** Tasks live in plain YAML; scoring code is one page each.\n- **Reproducible.** A `cyberbench run --model X` produces a JSON run file that's committed alongside the repo. Anyone can verify or extend.\n- **Extensible.** Adding a task is writing one YAML file. Adding a backend is a 30-line class.\n- **Multi-category.** CVE triage, secure code review, detection rule generation — more coming.\n\n## What's scored\n\n| Category | Task type | Scorer | Seed tasks |\n| --- | --- | --- | --- |\n| `cve_triage` | Multiple-choice CWE / exploit reasoning | `mc_exact` | 5 |\n| `code_review` | Free-form vulnerability review of a code snippet | `rubric_keyword` (weighted phrase match) | 5 |\n| `detection_rule` | Generate a SIGMA rule for a given attack scenario | `sigma_structural` (YAML + field + domain checks) | 5 |\n\n**v0.1 ships 15 hand-curated seed tasks.** The framework is the product; the task bank grows with community contributions (see below).\n\n## Quickstart\n\n```bash\ngit clone https://github.com/joemunene-by/cyberbench.git\ncd cyberbench\npip install -e \".[all]\"   # pulls openai + anthropic; drop [all] to skip\n\n# Smoke test with the built-in echo backend (no API key needed)\ncyberbench run --model echo\ncyberbench report --print\n```\n\n### Run a real model\n\n```bash\n# OpenAI (requires OPENAI_API_KEY)\ncyberbench run --model openai:gpt-4o-mini\n\n# Anthropic (requires ANTHROPIC_API_KEY)\ncyberbench run --model anthropic:claude-sonnet-4-6\n\n# Local Ollama (requires ollama running at :11434)\ncyberbench run --model ollama:llama3\n\n# Only one category\ncyberbench run --model openai:gpt-4o-mini --category detection_rule\n```\n\nEach run writes `leaderboard/\u003cslug\u003e.json`. Build the ranked table:\n\n```bash\ncyberbench report --out LEADERBOARD.md\n```\n\n## Adding a task\n\nDrop a YAML file under `tasks/\u003ccategory\u003e/`:\n\n```yaml\nid: cve-006-spectre\ncategory: cve_triage\ntype: multiple_choice\n\nprompt: |\n  CVE-2017-5753 (Spectre variant 1)...\n\n  A) CWE-20: Improper Input Validation\n  B) CWE-203: Observable Discrepancy (side channel)\n  C) ...\n\nchoices:\n  - \"A: CWE-20\"\n  - \"B: CWE-203\"\n  - \"C: ...\"\n\nanswer: B\n\nmetadata:\n  cve: CVE-2017-5753\n```\n\nThree task types are supported out of the box:\n\n- **`multiple_choice`** — set `choices` and `answer` (letter).\n- **`free_form`** — set a `rubric` of `{must_mention | any_of, weight}` entries.\n- **`sigma_rule`** — set `requirements` (`must_have_fields`, `detection_keys_should_include`, `expected_logsource_product`).\n\nTests run on every task file via `pytest tests/test_tasks.py::test_seed_tasks_all_load`.\n\n## Adding a backend\n\nImplement `Backend.generate(prompt, system=None) -\u003e str` under `src/cyberbench/backends/`. Register in `backends/__init__.py::get_backend`. That's it.\n\n## Architecture\n\n```\ncyberbench/\n├── src/cyberbench/\n│   ├── tasks.py          # YAML loader, Task dataclass\n│   ├── backends/         # echo, ollama, openai, anthropic\n│   ├── scorers/          # mc_exact, rubric_keyword, sigma_structural\n│   ├── runner.py         # dispatches tasks → backend → scorer\n│   ├── report.py         # markdown leaderboard builder\n│   └── cli.py            # argparse entry point\n├── tasks/\n│   ├── cve_triage/\n│   ├── code_review/\n│   └── detection_rule/\n├── leaderboard/          # committed JSON runs (the reproducibility story)\n└── LEADERBOARD.md        # generated by `cyberbench report`\n```\n\n## Roadmap\n\n- **v0.2** — 50 tasks per category, LLM-as-judge scorer for free-form, Dockerfile for sealed reproducibility\n- **v0.3** — New categories: exploit chain reasoning, log forensics, threat model critique\n- **v0.4** — Adversarial subset (prompt injection resistance against a security-Q\u0026A persona)\n- **v1.0** — Public hosted leaderboard, automated re-runs on model releases\n\n## Related projects\n\nBuilt by [@joemunene-by](https://github.com/joemunene-by). Companion work:\n\n- [`GhostLM`](https://github.com/joemunene-by/GhostLM) — cybersecurity-focused LLM (CyberBench is how it gets validated)\n- [`secure-mcp`](https://github.com/joemunene-by/secure-mcp) — MCP server for giving agents a gated security toolbox\n- [`ghostsiem`](https://github.com/joemunene-by/ghostsiem) — SIGMA-based detection and alerting\n\n## Contributing\n\nPRs welcome — especially **new tasks**. Open an issue first for new scorers or categories so we can align on the schema.\n\n## License\n\nMIT. See [LICENSE](./LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjoemunene-by%2Fcyberbench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjoemunene-by%2Fcyberbench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjoemunene-by%2Fcyberbench/lists"}