{"id":31126411,"url":"https://github.com/jimmc414/claudecode_n_codex_swebench","last_synced_at":"2025-09-17T22:48:36.805Z","repository":{"id":314387107,"uuid":"1050818331","full_name":"jimmc414/claudecode_n_codex_swebench","owner":"jimmc414","description":"Toolkit for measuring Claude Code and Codex performance over time against a baseline using SWEbench-lite dataset **No API key required for Max subscribers**","archived":false,"fork":false,"pushed_at":"2025-09-12T19:32:33.000Z","size":93,"stargazers_count":6,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-09-12T21:53:41.394Z","etag":null,"topics":["claude-code","claudecode","eval","evaluation-framework","swebench"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jimmc414.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-05T01:52:07.000Z","updated_at":"2025-09-12T21:36:07.000Z","dependencies_parsed_at":"2025-09-12T21:53:42.578Z","dependency_job_id":null,"html_url":"https://github.com/jimmc414/claudecode_n_codex_swebench","commit_stats":null,"previous_names":["jimmc414/claudecode_swebench","jimmc414/claudecode_n_codex_swebench"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/jimmc414/claudecode_n_codex_swebench","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jimmc414%2Fclaudecode_n_codex_swebench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jimmc414%2Fclaudecode_n_codex_swebench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jimmc414%2Fclaudecode_n_codex_swebench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jimmc414%2Fclaudecode_n_codex_swebench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jimmc414","download_url":"https://codeload.github.com/jimmc414/claudecode_n_codex_swebench/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jimmc414%2Fclaudecode_n_codex_swebench/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":275680307,"owners_count":25508570,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-17T02:00:09.119Z","response_time":84,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["claude-code","claudecode","eval","evaluation-framework","swebench"],"created_at":"2025-09-17T22:48:33.443Z","updated_at":"2025-09-17T22:48:36.776Z","avatar_url":"https://github.com/jimmc414.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# SWE-bench Code Model Performance Monitor\n\n## Purpose\n\nThis project provides an empirical framework for measuring the performance of code-focused language models like Claude Code and Codex on real-world software engineering tasks. It was built to provide objective, reproducible metrics that allow users to assess these tools for themselves, rather than relying on anecdotal reports or marketing claims.\n\nThe SWE-bench benchmark presents the model with actual GitHub issues from popular open-source projects and measures its ability to generate patches that successfully resolve these issues. This provides a concrete, measurable answer to the question: \"How well do these code models actually perform on real software engineering tasks?\"\n\n\u003e **Platform support:** The tools in this repository run on Linux, macOS, and Windows (including WSL). Replace `python` with `python3` on Unix-like systems or `py` on Windows if needed.\n\n## Getting Started in 5 Minutes\n\n```bash\n# Assuming you have Python, a code model CLI (Claude or Codex), and Docker installed:\n# Replace `python` with `python3` on Linux/macOS or `py` on Windows if needed.\ngit clone https://github.com/jimmc414/claudecode_n_codex_swebench.git\ncd claudecode_n_codex_swebench\npython -m pip install -r requirements.txt\npython swe_bench.py run --limit 1  # Run your first test (~10 min)\npython swe_bench.py check           # See your results\n```\n\nFor detailed setup instructions, see [Prerequisites](#prerequisites) and [Installation](#installation) below.\n\n## Quick Start (After Installation)\n\n```bash\n# 1. Run your first test (1 instance, ~10 minutes)\npython swe_bench.py run --limit 1               # Claude Code (default)\npython swe_bench.py run --limit 1 --backend codex  # Codex\n\n# 2. Check your results\npython swe_bench.py check\n\n# 3. Try a larger test when ready (10 instances, ~2 hours)\npython swe_bench.py quick\n```\n\n\n## Prerequisites\n\nBefore starting, ensure you have:\n\n1. **Python 3.8 or newer**\n   ```bash\n   python --version  # or python3/py --version\n   ```\n\n2. **Claude Code or Codex CLI installed and logged in**\n   ```bash\n   # Claude Code\n   claude --version  # Should work without errors\n   # Codex\n   codex --version   # Should work without errors\n   # If not logged in, run the relevant CLI (claude or codex)\n   ```\n\n3. **Docker installed and running**\n   ```bash\n   docker --version  # Should show version\n   docker ps        # Should work without \"daemon not running\" error\n   ```\n   - Needs ~50GB free disk space for images\n   - 16GB+ RAM recommended\n   - For Mac/Windows: Increase Docker Desktop memory to 8GB+\n   \n   **Don't have Docker?** The easiest way is to ask Claude Code to set it up:\n   ```bash\n   claude  # Open Claude Code\n   # Then ask: \"Please help me install Docker on my system\"\n   ```\n   Or see [Manual Docker Setup](#docker-setup) below.\n\n## Installation\n\n```bash\n# 1. Clone this repository\ngit clone \u003crepository-url\u003e\ncd claudecode_n_codex_swebench\n\n# 2. Install all Python dependencies (includes swebench)\npython -m pip install -r requirements.txt  # Use python3/py as needed\n\n# 3. Verify everything is working\npython swe_bench.py list-models               # Claude models\npython swe_bench.py list-models --backend codex  # Codex models\n\n# Optional: Quick test to verify full setup\npython swe_bench.py run --limit 1 --no-eval  # Test without Docker (2-5 min)\npython swe_bench.py run --limit 1            # Full test with Docker (10-15 min)\n```\n\n### Troubleshooting Setup\n\nIf you get errors:\n\n- **\"Claude CLI not found\"**: Install from https://claude.ai/download\n- **\"Codex CLI not found\"**: Ensure `codex` is installed and in your PATH\n- **\"Docker daemon not running\"**: Start Docker Desktop or `sudo systemctl start docker`\n- **\"swebench not found\"**: Run `pip install swebench`\n- **Out of memory**: Increase Docker memory in Docker Desktop settings\n- **Permission denied (Docker)**: Add yourself to docker group: `sudo usermod -aG docker $USER` then logout/login\n\n## Command Reference\n\n### Main Tool: `swe_bench.py`\n\nThe unified tool provides all functionality through a single entry point:\n\n```bash\n# Default: Run full 300-instance benchmark\npython swe_bench.py\n\n# Quick commands\npython swe_bench.py quick          # 10 instances with evaluation\npython swe_bench.py full           # 300 instances with evaluation\npython swe_bench.py check          # View scores and statistics\npython swe_bench.py list-models    # Show available models (Claude by default)\npython swe_bench.py list-models --backend codex  # Show Codex models\n```\n\n### Running Benchmarks\n\n```bash\n# Basic runs with different sizes\npython swe_bench.py run --quick                    # 10 instances\npython swe_bench.py run --standard                 # 50 instances\npython swe_bench.py run --full                     # 300 instances\npython swe_bench.py run --limit 25                 # Custom count\n\n# Model selection (September 2025 models)\npython swe_bench.py run --model opus-4.1 --quick   # Latest Opus\npython swe_bench.py run --model sonnet-3.7 --limit 20\npython swe_bench.py run --model best --quick       # Best performance alias\n\n# Performance options\npython swe_bench.py run --quick --no-eval          # Skip Docker evaluation\npython swe_bench.py run --limit 20 --max-workers 4 # More parallel containers\n\n# Dataset selection\npython swe_bench.py run --dataset princeton-nlp/SWE-bench_Lite --limit 10\n```\n\n### Running Specific Test Instances\n\nWhen establishing a baseline or debugging specific issues, you can run SWE-bench against individual test instances:\n\n```bash\n# Using code_swe_agent.py directly (patch generation only)\npython code_swe_agent.py --instance_id django__django-11133\n\n# Specify backend explicitly\npython code_swe_agent.py --instance_id django__django-11133 --backend codex\n\n# With full SWE-bench dataset instead of Lite\npython code_swe_agent.py --instance_id django__django-11133 --dataset_name princeton-nlp/SWE-bench\n\n# With specific model for baseline comparison\npython code_swe_agent.py --instance_id django__django-11133 --model opus-4.1\n\n# Finding available instance IDs\npython -c \"from datasets import load_dataset; ds = load_dataset('princeton-nlp/SWE-bench_Lite', split='test'); print('\\\\n'.join([d['instance_id'] for d in ds][:20]))\"\n```\n\n**Use Cases for Single Instance Testing:**\n- Establishing performance baselines for specific problem types\n- Debugging Claude Code's approach to particular challenges\n- Comparing model performance on identical problems\n- Validating fixes after prompt or model updates\n\n**Note:** Instance IDs follow the format `\u003crepo\u003e__\u003crepo\u003e-\u003cissue_number\u003e` (e.g., `django__django-11133`, `sympy__sympy-20154`)\n\n### Evaluating Past Runs\n\n```bash\n# Interactive selection menu\npython swe_bench.py eval --interactive\n\n# Specific file\npython swe_bench.py eval --file predictions_20250902_163415.jsonl\n\n# By date\npython swe_bench.py eval --date 2025-09-02\npython swe_bench.py eval --date-range 2025-09-01 2025-09-03\n\n# Recent runs\npython swe_bench.py eval --last 5\n\n# Preview without running\npython swe_bench.py eval --last 3 --dry-run\n```\n\n### Viewing Scores\n\n```bash\n# Basic score view\npython swe_bench.py scores\n\n# With statistics and analysis\npython swe_bench.py scores --stats --trends\n\n# Filter results\npython swe_bench.py scores --filter evaluated      # Only evaluated runs\npython swe_bench.py scores --filter pending        # Only pending evaluation\n\n# Export to CSV\npython swe_bench.py scores --export results.csv\n\n# Recent entries\npython swe_bench.py scores --last 10\n```\n\n## Model Selection\n\n### Available Models (September 2025)\n\n```bash\n# View all available models and their expected performance\npython swe_bench.py list-models\n```\n\n**Opus Models** (Most Capable):\n- `opus-4.1`: Latest, 30-40% expected score\n- `opus-4.0`: Previous version, 25-35% expected score\n\n**Sonnet Models** (Balanced):\n- `sonnet-4`: New generation, 20-30% expected score\n- `sonnet-3.7`: Latest 3.x, 18-25% expected score\n- `sonnet-3.6`: Solid performance, 15-22% expected score\n- `sonnet-3.5`: Fast/efficient, 12-20% expected score\n\n**Aliases**:\n- `best`: Maps to opus-4.1\n- `balanced`: Maps to sonnet-3.7\n- `fast`: Maps to sonnet-3.5\n\nYou can also use any model name accepted by Claude's `/model` command, including experimental or future models not yet in the registry.\n\n## Understanding Scores\n\n### Two Types of Scores\n\n1. **Generation Score**: Percentage of instances where a patch was created (misleading)\n2. **Evaluation Score**: Percentage of instances where the patch actually fixes the issue (real score)\n\nOnly the evaluation score matters. A 100% generation score with 20% evaluation score means Claude Code created patches for all issues but only 20% actually worked.\n\n### Expected Performance Ranges\n\nBased on empirical testing with SWE-bench:\n\n| Score Range | Performance Level | What It Means |\n|------------|------------------|---------------|\n| 0-5% | Poor | Patches rarely work, significant issues |\n| 5-10% | Below Average | Some success but needs improvement |\n| 10-15% | Average | Decent performance for an AI system |\n| 15-20% | Good | Solid performance, useful for real work |\n| 20-25% | Very Good | Strong performance, competitive |\n| 25-30% | Excellent | Top-tier performance |\n| 30%+ | Outstanding | Exceptional, rare to achieve |\n\n### Time Estimates\n\n| Test Size | Instances | Generation | Evaluation | Total Time |\n|-----------|-----------|------------|------------|------------|\n| Quick | 10 | ~20-30 min | ~30-50 min | ~1-2 hours |\n| Standard | 50 | ~2-3 hours | ~3-5 hours | ~5-8 hours |\n| Full | 300 | ~12-15 hours | ~20-30 hours | ~35-45 hours |\n\n## Project Structure\n\n```\nclaudecode_n_codex_swebench/\n├── swe_bench.py              # Main unified tool (all commands)\n├── code_swe_agent.py         # Core agent for Claude Code or Codex\n├── USAGE.md                  # Detailed command usage guide\n├── benchmark_scores.log      # Results log (JSON lines format)\n├── requirements.txt          # Python dependencies\n│\n├── utils/                    # Core utilities\n│   ├── claude_interface.py  # Claude Code CLI interface\n│   ├── prompt_formatter.py  # Formats issues into prompts\n│   ├── patch_extractor.py   # Extracts patches from responses\n│   └── model_registry.py    # Model definitions and aliases\n│\n├── prompts/                  # Prompt templates\n│   ├── swe_bench_prompt.txt # Default prompt\n│   ├── chain_of_thought_prompt.txt\n│   └── react_style_prompt.txt\n│\n├── predictions/              # Generated predictions (JSONL)\n├── results/                  # Detailed Claude outputs\n├── evaluation_results/       # Docker evaluation results\n└── backup/                   # Archived/unused files\n```\n\n## Verification Checklist\n\nUse this checklist to verify Claude Code is working properly:\n\n```bash\n# 1. Test setup (should list models)\npython swe_bench.py list-models\n\n# 2. Single instance test (5-10 minutes)\npython swe_bench.py run --limit 1 --no-eval\n\n# 3. Quick test with evaluation (1-2 hours)\npython swe_bench.py quick\n\n# 4. Check scores (should show real evaluation scores)\npython swe_bench.py check\n\n# 5. If scores look good, try larger test\npython swe_bench.py run --standard  # 50 instances\n```\n\n## Troubleshooting\n\n### Common Issues\n\n**Claude CLI not found**\n```bash\n# Ensure claude is in PATH\nwhich claude\n# If not found, reinstall Claude Code or add to PATH\n```\n\n**Docker permission denied**\n```bash\n# Add user to docker group\nsudo usermod -aG docker $USER\n# Log out and back in for changes to take effect\n```\n\n**Out of memory during evaluation**\n```bash\n# Reduce parallel workers\npython swe_bench.py run --quick --max-workers 1\n```\n\n**Evaluation times out**\n- Some instances take longer to test\n- Default timeout is 30 minutes per instance\n- This is normal for complex codebases\n\n**Low scores (0-5%)**\n- This may be normal for certain models or datasets\n- Try with a better model: `--model opus-4.1`\n- Check that Claude Code is properly authenticated\n\n### Log Files\n\n- **benchmark_scores.log**: Main results log (JSON lines)\n- **predictions/**: All generated patches\n- **evaluation_results/**: Detailed Docker test results\n- **results/**: Raw Claude Code outputs for debugging\n\n## Docker Setup\n\nIf you don't have Docker installed, here's how to set it up manually:\n\n### macOS\n```bash\n# Download Docker Desktop from:\n# https://www.docker.com/products/docker-desktop/\n# Or use Homebrew:\nbrew install --cask docker\n\n# Start Docker Desktop from Applications\n# Increase memory to 8GB in Docker Desktop \u003e Settings \u003e Resources\n```\n\n### Ubuntu/Debian Linux\n```bash\n# Update packages and install prerequisites\nsudo apt update\nsudo apt install -y ca-certificates curl gnupg lsb-release\n\n# Add Docker's official GPG key\nsudo mkdir -p /etc/apt/keyrings\ncurl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg\n\n# Set up repository\necho \"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable\" | sudo tee /etc/apt/sources.list.d/docker.list \u003e /dev/null\n\n# Install Docker\nsudo apt update\nsudo apt install -y docker-ce docker-ce-cli containerd.io\n\n# Add your user to docker group (avoids needing sudo)\nsudo usermod -aG docker $USER\n# Log out and back in for this to take effect\n\n# Start Docker\nsudo systemctl start docker\nsudo systemctl enable docker\n\n# Verify installation\ndocker run hello-world\n```\n\n### Windows\n```bash\n# Download Docker Desktop from:\n# https://www.docker.com/products/docker-desktop/\n\n# Requirements:\n# - Windows 10/11 64-bit with WSL 2\n# - Enable virtualization in BIOS\n# - Install WSL 2 first if needed:\nwsl --install\n\n# After installing Docker Desktop:\n# - Increase memory to 8GB in Settings \u003e Resources\n# - Ensure WSL 2 backend is enabled\n```\n\n### Verify Docker is Ready\n```bash\n# Check Docker is installed\ndocker --version\n\n# Check Docker daemon is running\ndocker ps\n\n# Test Docker works\ndocker run hello-world\n```\n\n\n## License\n\nThis benchmarking framework is provided as-is for empirical evaluation purposes. SWE-bench is created by Princeton NLP. Claude Code is a product of Anthropic.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjimmc414%2Fclaudecode_n_codex_swebench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjimmc414%2Fclaudecode_n_codex_swebench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjimmc414%2Fclaudecode_n_codex_swebench/lists"}