{"id":48582461,"url":"https://github.com/zircote/claude-spec-benchmark","last_synced_at":"2026-04-08T17:33:57.663Z","repository":{"id":329358001,"uuid":"1117237431","full_name":"zircote/claude-spec-benchmark","owner":"zircote","description":null,"archived":false,"fork":false,"pushed_at":"2026-03-30T14:04:39.000Z","size":214,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-30T16:09:10.576Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zircote.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-12-16T03:06:11.000Z","updated_at":"2026-03-30T14:04:36.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/zircote/claude-spec-benchmark","commit_stats":null,"previous_names":["zircote/claude-spec-benchmark"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/zircote/claude-spec-benchmark","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zircote%2Fclaude-spec-benchmark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zircote%2Fclaude-spec-benchmark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zircote%2Fclaude-spec-benchmark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zircote%2Fclaude-spec-benchmark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zircote","download_url":"https://codeload.github.com/zircote/claude-spec-benchmark/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zircote%2Fclaude-spec-benchmark/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31567056,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-08T14:31:17.711Z","status":"ssl_error","status_checked_at":"2026-04-08T14:31:17.202Z","response_time":54,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-04-08T17:33:56.891Z","updated_at":"2026-04-08T17:33:57.649Z","avatar_url":"https://github.com/zircote.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# claude-spec-benchmark\n\nSWE-bench test harness for evaluating Claude's software engineering capabilities, extended with **SDD-Bench** for Spec-Driven Development evaluation.\n\n## Features\n\n### Core SWE-bench Harness\n- **SWE-bench Lite Support**: Load and evaluate against 300 curated tasks\n- **Claude Code Integration**: Spawn Claude Code CLI subprocess for patch generation\n- **Docker Isolation**: Each task runs in an isolated container for safety\n- **Multi-Metric Evaluation**: Test-based pass/fail, diff similarity, and custom metrics\n- **Rich Reporting**: Console tables and Markdown reports\n\n### SDD-Bench Extension\n- **Spec Degradation**: Simulate incomplete requirements with 5 degradation levels\n- **Elicitation Evaluation**: Measure requirement discovery through Socratic dialogue\n- **Framework Comparison**: A/B test passthrough vs Claude Code approaches\n- **Phase Pipeline**: Degrade → Elicit → Parse → Test → Implement → Refine → Validate\n- **Interactive Reports**: HTML dashboards with charts and filtering\n\n## Prerequisites\n\n- Python 3.14+\n- Docker (for task isolation)\n- Claude Code CLI (`claude`) installed and authenticated\n\n## Installation\n\n```bash\n# Clone the repository\ngit clone https://github.com/zircote/claude-spec-benchmark.git\ncd claude-spec-benchmark\n\n# Install with uv (recommended)\nuv sync\n\n# Or with pip\npip install -e .\n```\n\n## Quick Start\n\n```bash\n# List available tasks\nclaude-spec-benchmark list\n\n# Run benchmark on specific tasks (dry-run first)\nclaude-spec-benchmark run --tasks \"django__django-11099\" --dry-run\n\n# Run full benchmark\nclaude-spec-benchmark run --limit 10 --output ./results\n\n# Generate report from results\nclaude-spec-benchmark report ./results\n```\n\n## CLI Commands\n\n### `list` - List available tasks\n\n```bash\nclaude-spec-benchmark list                    # List all tasks\nclaude-spec-benchmark list --repo django/django  # Filter by repo\nclaude-spec-benchmark list --limit 50         # Limit results\n```\n\n### `run` - Execute benchmark\n\n```bash\nclaude-spec-benchmark run                     # Run all tasks\nclaude-spec-benchmark run --tasks \"task1,task2\"  # Specific tasks\nclaude-spec-benchmark run --repo django/django   # Tasks from repo\nclaude-spec-benchmark run --limit 10          # Limit task count\nclaude-spec-benchmark run --timeout 3600      # Custom timeout (seconds)\nclaude-spec-benchmark run --workers 8         # Parallel workers\nclaude-spec-benchmark run --model opus-4      # Model override\nclaude-spec-benchmark run --dry-run           # Preview without executing\n```\n\n### `report` - Generate reports\n\n```bash\nclaude-spec-benchmark report ./results                  # Console table\nclaude-spec-benchmark report ./results --format markdown  # Markdown\nclaude-spec-benchmark report ./results --format json     # JSON\n```\n\n### `info` - Display harness information\n\n```bash\nclaude-spec-benchmark info\n```\n\n## SDD-Bench Commands\n\nSDD-Bench extends the CLI with spec-driven development evaluation capabilities.\n\n### `sdd degrade` - Degrade specifications\n\nSimulate incomplete requirements by removing technical details:\n\n```bash\n# Degrade an issue to vague level\nclaude-spec-benchmark sdd degrade issue.txt --level vague\n\n# Show what was hidden\nclaude-spec-benchmark sdd degrade issue.txt --level minimal --show-hidden\n\n# Output as JSON\nclaude-spec-benchmark sdd degrade issue.txt --level partial --format json -o degraded.json\n```\n\n**Degradation Levels:**\n| Level | Description |\n|-------|-------------|\n| `full` | Original spec unchanged |\n| `partial` | Some implementation hints removed |\n| `vague` | Technical details abstracted |\n| `minimal` | Only user-facing behavior |\n| `ambiguous` | Deliberately unclear language |\n\n### `sdd run` - Execute SDD evaluation\n\nRun the full spec-driven development pipeline:\n\n```bash\n# Run with passthrough framework (baseline)\nclaude-spec-benchmark sdd run --framework passthrough --limit 10\n\n# Run with Claude Code framework\nclaude-spec-benchmark sdd run --framework claude-code --degradation vague\n\n# Skip certain phases\nclaude-spec-benchmark sdd run --skip-phases \"refine,validate\"\n\n# Specify tasks\nclaude-spec-benchmark sdd run --tasks \"django__django-11099\"\n```\n\n### `sdd extract` - Extract requirements\n\nParse specifications into atomic requirements:\n\n```bash\n# Extract requirements to JSON\nclaude-spec-benchmark sdd extract issue.txt --format json -o requirements.json\n\n# Text format output\nclaude-spec-benchmark sdd extract issue.txt --format text\n```\n\n### `sdd report` - Generate SDD reports\n\nGenerate reports from evaluation results:\n\n```bash\n# Generate interactive HTML report\nclaude-spec-benchmark sdd report ./sdd-results --format html\n\n# Markdown report\nclaude-spec-benchmark sdd report ./sdd-results --format markdown -o report.md\n\n# JSON export for analysis\nclaude-spec-benchmark sdd report ./sdd-results --format json\n\n# CSV export for spreadsheets\nclaude-spec-benchmark sdd report ./sdd-results --format csv\n\n# Compare two runs (A/B testing)\nclaude-spec-benchmark sdd report ./experiment --compare ./baseline\n```\n\n## Python API\n\n```python\nfrom claude_spec_benchmark import (\n    TaskLoader,\n    ClaudeCodeRunner,\n    DockerManager,\n    Evaluator,\n    MetricsCollector,\n)\n\n# Load tasks\nloader = TaskLoader()\nfor task in loader.iter_tasks(repos=[\"django/django\"]):\n    print(f\"Task: {task.instance_id}\")\n\n# Custom evaluation\ndocker = DockerManager()\nevaluator = Evaluator(docker)\n```\n\n## Project Structure\n\n```\nclaude-spec-benchmark/\n├── src/claude_spec_benchmark/\n│   ├── __init__.py          # Public API exports\n│   ├── main.py              # CLI entry point\n│   ├── models.py            # Pydantic data models\n│   ├── task_loader.py       # SWE-bench dataset loading\n│   ├── runner.py            # Claude Code subprocess execution\n│   ├── docker_manager.py    # Docker container isolation\n│   ├── evaluator.py         # Multi-metric evaluation engine\n│   ├── metrics.py           # Metrics collection and reporting\n│   ├── harness.py           # SWE-bench test harness integration\n│   ├── sdd_runner.py        # SDD pipeline orchestrator\n│   ├── swt_metrics.py       # SWT-bench test generation metrics\n│   │\n│   ├── degradation/         # Spec degradation module\n│   │   ├── engine.py        # Degradation engine\n│   │   └── patterns.py      # Repo-specific patterns\n│   │\n│   ├── elicitation/         # Elicitation evaluation module\n│   │   ├── oracle.py        # Requirement oracle\n│   │   ├── scoring.py       # Question relevance scoring\n│   │   └── extraction.py    # Requirements extraction\n│   │\n│   ├── frameworks/          # SDD framework implementations\n│   │   ├── base.py          # Abstract framework interface\n│   │   ├── passthrough.py   # Baseline passthrough\n│   │   └── claude_code.py   # Claude Code integration\n│   │\n│   └── reporting/           # Report generation\n│       └── dashboard.py     # HTML/Markdown/JSON/CSV reports\n│\n├── tests/                   # Unit tests for all modules\n├── pyproject.toml           # Project configuration\n└── Makefile                 # Development commands\n```\n\n## Development\n\n```bash\n# Install dev dependencies\nuv sync\n\n# Run tests\nmake test\n\n# Run tests with coverage\nmake test-cov\n\n# Run all quality checks\nmake quality\n\n# Format code\nmake format\n\n# Type check\nmake typecheck\n\n# Security scan\nmake security\n```\n\n## Custom Metrics\n\nExtend evaluation with custom metric plugins:\n\n```python\nfrom claude_spec_benchmark import MetricPlugin, Evaluator\n\nclass MyMetric(MetricPlugin):\n    @property\n    def name(self) -\u003e str:\n        return \"my_metric\"\n\n    def compute(self, task, task_run, test_results):\n        return {\"score\": 42}\n\nevaluator = Evaluator(docker_manager)\nevaluator.register_plugin(MyMetric())\n```\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzircote%2Fclaude-spec-benchmark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzircote%2Fclaude-spec-benchmark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzircote%2Fclaude-spec-benchmark/lists"}