{"id":29486379,"url":"https://github.com/imbios/ide-ai-benchmark","last_synced_at":"2026-04-18T12:03:42.233Z","repository":{"id":304594571,"uuid":"1019254345","full_name":"ImBIOS/ide-ai-benchmark","owner":"ImBIOS","description":"Comprehensive multi-IDE AI model benchmarking framework supporting Cursor, Windsurf, VSCode, and other IDEs with automated testing and performance comparison capabilities","archived":false,"fork":false,"pushed_at":"2025-07-14T03:52:22.000Z","size":96,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-09-30T20:42:20.004Z","etag":null,"topics":["ai-benchmarking","claude","copilot","cursor","ide-automation","openai","performance","testing","vscode","windsurf"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ImBIOS.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null},"funding":{"github":["ImBIOS"],"patreon":null,"open_collective":null,"ko_fi":null,"tidelift":null,"community_bridge":null,"liberapay":null,"issuehunt":null,"otechie":null,"custom":null}},"created_at":"2025-07-14T03:46:38.000Z","updated_at":"2025-07-14T03:52:25.000Z","dependencies_parsed_at":"2025-07-14T06:09:06.744Z","dependency_job_id":"008305b7-871e-40e5-ab00-25550e1351c5","html_url":"https://github.com/ImBIOS/ide-ai-benchmark","commit_stats":null,"previous_names":["imbios/ide-ai-benchmark"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ImBIOS/ide-ai-benchmark","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ImBIOS%2Fide-ai-benchmark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ImBIOS%2Fide-ai-benchmark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ImBIOS%2Fide-ai-benchmark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ImBIOS%2Fide-ai-benchmark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ImBIOS","download_url":"https://codeload.github.com/ImBIOS/ide-ai-benchmark/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ImBIOS%2Fide-ai-benchmark/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278747860,"owners_count":26038792,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-07T02:00:06.786Z","response_time":59,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-benchmarking","claude","copilot","cursor","ide-automation","openai","performance","testing","vscode","windsurf"],"created_at":"2025-07-15T08:00:29.130Z","updated_at":"2025-10-07T09:12:25.014Z","avatar_url":"https://github.com/ImBIOS.png","language":"Python","funding_links":["https://github.com/sponsors/ImBIOS"],"categories":[],"sub_categories":[],"readme":"# IDE AI Benchmark\n\nA comprehensive benchmarking framework to evaluate and compare different AI models (Claude, OpenAI, Gemini, etc.) across multiple IDEs and development environments (Cursor IDE, Windsurf IDE, Trae IDE, Claude Code CLI, VSCode + GitHub Copilot, etc.).\n\n## 🚀 Features\n\n- **Multi-IDE Support**: Automated testing across Cursor, Windsurf, Trae, VSCode, and more\n- **Cross-Model Comparison**: Compare Claude, OpenAI, Gemini, and other AI models\n- **Standardized Benchmarks**: Consistent testing methodology across all IDE/model combinations\n- **Performance Metrics**: Response time, code quality, accuracy, and completion rate analysis\n- **Real-world Scenarios**: Daily software engineering tasks and workflows\n- **Automated Evaluation**: AI-powered judging system to assess model performance objectively\n- **Comprehensive Reporting**: Detailed comparison reports with rankings and insights across IDEs\n\n## 🎯 Supported IDEs \u0026 AI Models\n\n### Supported IDEs\n- **Cursor IDE** - Claude, OpenAI, Gemini\n- **Windsurf IDE** - Claude, OpenAI, Gemini\n- **Trae IDE** - Various models\n- **Claude Code CLI** - Claude models\n- **VSCode** - GitHub Copilot, various extensions\n- **Others** - Extensible framework for adding new IDEs\n\n### Supported AI Models\n- **Anthropic**: Claude 3.5 Sonnet, Claude 3 Haiku, Claude 3 Opus\n- **OpenAI**: OpenAI, OpenAI Turbo\n- **Google**: Gemini Pro, Gemini Ultra\n- **GitHub**: Copilot (OpenAI based)\n- **Others**: Extensible for new models\n\n## 📋 Prerequisites\n\n- **Linux** (Ubuntu/Debian preferred)\n- **Python 3.13+**\n- **Target IDEs** installed and configured\n- **API Keys** for AI models you want to benchmark\n- **GUI Environment** (for interactive testing) or **Xvfb** (for headless automation)\n\n## 📦 Installation\n\n1. **Clone the repository**:\n\n```bash\ngit clone https://github.com/ImBIOS/ide-ai-benchmark.git\ncd ide-ai-benchmark\n```\n\n2. **Set up Python environment**:\n\n### Using venv (recommended)\n\n```bash\npython -m venv venv\nsource venv/bin/activate  # On Windows: venv\\Scripts\\activate\n\n# Install the package\npip install -e .\n\n# Install with test dependencies\npip install -e .[test]\n```\n\n### Using uv (fast alternative)\n\n```bash\n# Install dependencies\nuv sync\n\n# Install with test dependencies\nuv sync --extra test\n```\n\n3. **Install system dependencies** (Ubuntu/Debian):\n\n```bash\nsudo apt-get update\nsudo apt-get install -y \\\n    xvfb \\\n    x11-utils \\\n    xdotool \\\n    scrot \\\n    python3-tk \\\n    python3-dev \\\n    libxtst6 \\\n    libxss1 \\\n    libgtk-3-0 \\\n    python3.13-tk \\\n    python3.13-dev\n```\n\n4. **Install and configure IDEs**:\n\n```bash\n# Download Cursor IDE\nwget https://download.cursor.sh/linux/appImage/x64 -O cursor.AppImage\nchmod +x cursor.AppImage\n\n# Download Windsurf IDE (example)\n# wget \u003cwindsurf-download-url\u003e -O windsurf.AppImage\n# chmod +x windsurf.AppImage\n\n# Install VSCode with Copilot\nsudo snap install --classic code\n# Then install GitHub Copilot extension\n```\n\n5. **Configure API keys**:\n\n```bash\ncp .env.example .env\n# Edit .env with your API keys\nexport OPENAI_API_KEY=\"your-openai-key\"\nexport ANTHROPIC_API_KEY=\"your-anthropic-key\"\nexport GOOGLE_API_KEY=\"your-google-key\"\n```\n\n## 🧪 Running Benchmarks\n\n### Quick Start\n\n```bash\n# Run benchmarks across all IDEs and models\npython scripts/run_tests.py --all-ides --all-models\n\n# Compare specific IDE/model combinations\npython scripts/run_tests.py --ide cursor --model claude-3.5-sonnet\npython scripts/run_tests.py --ide vscode --model github-copilot\n\n# Run specific benchmark categories\npython scripts/run_tests.py --code-generation --ide cursor,windsurf\npython scripts/run_tests.py --performance --model gpt-4,claude-3.5-sonnet\n\n# Generate cross-IDE comparison report\npython scripts/run_tests.py --cross-ide-report\n```\n\n### Advanced Benchmarking\n\n```bash\n# Test specific IDE with multiple models\npython scripts/run_tests.py --ide cursor --models claude-3.5-sonnet,gpt-4,gpt-4-turbo\n\n# Run performance benchmarks only\npython scripts/run_tests.py --performance --timeout 300\n\n# Headless testing for CI/CD\npython scripts/run_tests.py --headless --quick\n\n# Custom test scenarios\npython scripts/run_tests.py --custom-scenarios scenarios/web-dev-tasks.json\n```\n\n### Manual Test Execution\n\n```bash\n# Test specific IDE functionality\npytest tests/test_ide_functionality.py::TestCursorIDE -v\npytest tests/test_ide_functionality.py::TestWindsurfIDE -v\n\n# Cross-IDE performance comparison\npytest tests/test_cross_ide_performance.py -v\n\n# AI model quality benchmarks\npytest tests/test_ai_model_quality.py -v\n\n# Real-world workflow tests\npytest tests/test_development_workflows.py -v\n```\n\n## 📊 Benchmark Categories\n\n### 1. Code Generation Tests (`test_code_generation.py`)\n\nCompare AI models across IDEs for:\n- Function and class creation\n- Algorithm implementation\n- Unit test generation\n- Documentation writing\n- API integration code\n- Database query generation\n\n### 2. Performance \u0026 Quality Benchmarks (`test_performance_quality.py`)\n\nEvaluate:\n- Response time across IDE/model combinations\n- Code quality and best practices adherence\n- Memory efficiency of generated code\n- Compilation/execution success rate\n- Security vulnerability detection\n- Code maintainability scores\n\n### 3. Cross-IDE Workflow Tests (`test_cross_ide_workflows.py`)\n\nReal-world engineering scenarios:\n- Bug fixing efficiency\n- Code refactoring quality\n- Feature implementation speed\n- Debugging assistance effectiveness\n- Code review automation\n- Project scaffolding capabilities\n\n### 4. AI Model Capabilities (`test_ai_capabilities.py`)\n\nModel-specific testing:\n- Context understanding depth\n- Multi-language programming support\n- Complex reasoning tasks\n- Code explanation quality\n- Architecture decision support\n\n## 🏗️ Framework Architecture\n\n```\n.\n├── src/\n│   ├── __init__.py\n│   └── ide_automation.py          # Multi-IDE automation framework\n├── tests/\n│   ├── test_ide_functionality.py  # Basic IDE automation tests\n│   ├── test_code_generation.py    # Code generation benchmarks\n│   ├── test_performance_quality.py # Performance and quality tests\n│   ├── test_cross_ide_workflows.py # Cross-IDE workflow tests\n│   └── test_ai_capabilities.py    # AI model capability tests\n├── scripts/\n│   ├── run_tests.py               # Multi-IDE test runner\n│   └── generate_reports.py       # Cross-IDE comparison reports\n├── config/\n│   ├── ide_configs.yml           # IDE-specific configurations\n│   └── model_configs.yml         # AI model configurations\n├── scenarios/\n│   ├── web-dev-tasks.json        # Web development scenarios\n│   ├── data-science-tasks.json   # Data science scenarios\n│   └── devops-tasks.json         # DevOps scenarios\n├── reports/                       # Generated benchmark reports\n├── screenshots/                   # IDE screenshots during tests\n└── results/                       # Raw benchmark data\n```\n\n## 🔧 Configuration\n\n### Environment Variables\n\n```bash\n# IDE Application Paths\nexport CURSOR_PATH=\"/path/to/cursor\"\nexport WINDSURF_PATH=\"/path/to/windsurf\"\nexport VSCODE_PATH=\"/usr/bin/code\"\nexport TRAE_PATH=\"/path/to/trae\"\n\n# AI Model API Keys\nexport OPENAI_API_KEY=\"your-openai-key\"\nexport ANTHROPIC_API_KEY=\"your-anthropic-key\"\nexport GOOGLE_API_KEY=\"your-google-key\"\n\n# Display for headless mode\nexport DISPLAY=:99\n```\n\n### IDE Configuration (`config/ide_configs.yml`)\n\n```yaml\ncursor:\n  launch_args: [\"--no-sandbox\", \"--disable-dev-shm-usage\"]\n  models: [\"claude-3.5-sonnet\", \"gpt-4\", \"gpt-4-turbo\"]\n  shortcuts:\n    ai_chat: \"ctrl+l\"\n    command_palette: \"ctrl+shift+p\"\n\nwindsurf:\n  launch_args: [\"--no-sandbox\"]\n  models: [\"claude-3.5-sonnet\", \"gpt-4\", \"gemini-pro\"]\n  shortcuts:\n    ai_chat: \"ctrl+i\"\n    command_palette: \"ctrl+shift+p\"\n\nvscode:\n  launch_args: [\"--no-sandbox\"]\n  models: [\"github-copilot\"]\n  shortcuts:\n    copilot_chat: \"ctrl+shift+i\"\n    command_palette: \"ctrl+shift+p\"\n```\n\n## 📈 Cross-IDE AI Model Benchmarking\n\nThe framework provides comprehensive comparison across multiple dimensions:\n\n### Performance Metrics\n\n- **Response Time**: Time to generate code across IDE/model combinations\n- **Completion Quality**: Accuracy and usefulness of generated code\n- **Context Awareness**: How well models understand project context\n- **IDE Integration**: Smoothness of model integration within each IDE\n\n### Capability Assessment\n\n- **Code Generation**: Function, class, and algorithm creation quality\n- **Code Explanation**: Ability to explain existing code\n- **Debugging**: Bug identification and fix suggestions\n- **Refactoring**: Code improvement recommendations\n- **Testing**: Unit test generation and test-driven development\n\n### Cross-IDE Consistency\n\n- **Model Behavior**: How consistently models perform across different IDEs\n- **Feature Parity**: Comparison of AI features available in each IDE\n- **Workflow Efficiency**: Which IDE/model combinations work best for specific tasks\n\n## 🎯 Writing Custom Benchmarks\n\n### Basic Test Structure\n\n```python\nimport pytest\nfrom ide_automation import create_ide_automation\n\nclass TestCustomBenchmark:\n    @pytest.fixture(params=[\"cursor\", \"windsurf\", \"vscode\"])\n    def ide_app(self, request):\n        app = create_ide_automation(request.param)\n        assert app.launch_app()\n        yield app\n        app.close_app()\n\n    def test_custom_ai_functionality(self, ide_app):\n        # Test AI model switching\n        models = ide_app.get_ai_models()\n        for model in models:\n            assert ide_app.switch_ai_model(model)\n\n            # Test AI completion\n            prompt = \"Write a Python function to sort a list\"\n            assert ide_app.trigger_ai_completion(prompt)\n\n            response = ide_app.get_ai_response()\n            assert \"def\" in response  # Basic validation\n```\n\n### Cross-IDE Comparison Test\n\n```python\ndef test_cross_ide_code_generation():\n    ides = [\"cursor\", \"windsurf\", \"vscode\"]\n    prompt = \"Create a REST API endpoint for user management\"\n    results = {}\n\n    for ide_name in ides:\n        ide = create_ide_automation(ide_name)\n        ide.launch_app()\n\n        # Test with each available model\n        for model in ide.get_ai_models():\n            ide.switch_ai_model(model)\n            ide.trigger_ai_completion(prompt)\n            response = ide.get_ai_response()\n\n            results[f\"{ide_name}_{model}\"] = {\n                \"response\": response,\n                \"quality_score\": evaluate_code_quality(response),\n                \"response_time\": measure_response_time()\n            }\n\n        ide.close_app()\n\n    # Compare results across IDEs\n    generate_comparison_report(results)\n```\n\n## 📊 Reports and Output\n\n### Cross-IDE Comparison Reports\n\n- `reports/cross-ide-comparison.html` - Comprehensive IDE/model comparison\n- `reports/model-performance-matrix.html` - Performance matrix across all combinations\n- `reports/workflow-efficiency.html` - Task-specific IDE/model recommendations\n\n### Individual IDE Reports\n\n- `reports/cursor-benchmark.html` - Cursor IDE specific results\n- `reports/windsurf-benchmark.html` - Windsurf IDE specific results\n- `reports/vscode-benchmark.html` - VSCode specific results\n\n### Raw Data\n\n- `results/benchmark_data.json` - Complete benchmark dataset\n- `results/response_times.csv` - Response time measurements\n- `results/quality_scores.csv` - Code quality assessments\n\n## 🔄 CI/CD Integration\n\nThe framework includes GitHub Actions for continuous benchmarking:\n\n```yaml\n# .github/workflows/cross-ide-benchmark.yml\nname: Cross-IDE AI Model Benchmark\n\non:\n  schedule:\n    - cron: '0 2 * * *'  # Daily at 2 AM\n  workflow_dispatch:\n\njobs:\n  benchmark:\n    runs-on: ubuntu-latest\n    strategy:\n      matrix:\n        ide: [cursor, windsurf, vscode]\n\n    steps:\n      - uses: actions/checkout@v3\n      - name: Setup Python\n        uses: actions/setup-python@v4\n        with:\n          python-version: '3.13'\n\n      - name: Install dependencies\n        run: |\n          pip install -e .[test]\n\n      - name: Run IDE benchmarks\n        env:\n          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}\n          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}\n        run: |\n          xvfb-run python scripts/run_tests.py --ide ${{ matrix.ide }} --all-models\n\n      - name: Upload results\n        uses: actions/upload-artifact@v3\n        with:\n          name: benchmark-results-${{ matrix.ide }}\n          path: reports/\n```\n\n## 🚀 Getting Started Guide\n\n### 1. Quick Setup for Cursor vs VSCode Comparison\n\n```bash\n# Install the framework\ngit clone https://github.com/ImBIOS/ide-ai-benchmark.git\ncd ide-ai-benchmark\npip install -e .[test]\n\n# Set up API keys\nexport OPENAI_API_KEY=\"your-key\"\nexport ANTHROPIC_API_KEY=\"your-key\"\n\n# Run comparison\npython scripts/run_tests.py --ide cursor,vscode --models claude-3.5-sonnet,github-copilot --quick\n```\n\n### 2. Full Multi-IDE Benchmark\n\n```bash\n# Download and set up all IDEs\n./scripts/setup_ides.sh\n\n# Run comprehensive benchmark\npython scripts/run_tests.py --all-ides --all-models --comprehensive\n\n# Generate reports\npython scripts/generate_reports.py --cross-ide-analysis\n```\n\n## 🐛 Troubleshooting\n\n### IDE-Specific Issues\n\n1. **Cursor not launching**\n   ```bash\n   export CURSOR_PATH=\"/correct/path/to/cursor.AppImage\"\n   chmod +x cursor.AppImage\n   ```\n\n2. **VSCode Copilot not working**\n   ```bash\n   code --install-extension GitHub.copilot\n   # Authenticate Copilot in VSCode\n   ```\n\n3. **Windsurf configuration**\n   ```bash\n   # Check Windsurf installation\n   ./windsurf.AppImage --version\n   ```\n\n### API Key Issues\n\n```bash\n# Verify API keys\npython scripts/verify_api_keys.py\n\n# Test model access\npython -c \"\nfrom ide_automation import create_ide_automation\nide = create_ide_automation('cursor')\nprint(ide.get_ai_models())\n\"\n```\n\n## 🤝 Contributing\n\nWe welcome contributions to expand IDE and model support!\n\n### Adding New IDEs\n\n1. Create a new class inheriting from `IDEAutomation`\n2. Implement all abstract methods\n3. Add configuration in `config/ide_configs.yml`\n4. Create tests in `tests/test_ide_functionality.py`\n\n### Adding New AI Models\n\n1. Update model lists in IDE classes\n2. Implement model switching logic\n3. Add API integration if needed\n4. Update documentation\n\n## 📝 License\n\nTBD (To Be Determined)\n\n## 📞 Support\n\n- **Issues**: Use GitHub Issues for bug reports and feature requests\n- **Discussions**: GitHub Discussions for questions and ideas\n- **Email**: Contact for enterprise support\n\n## 🚀 Roadmap\n\n- [ ] **IDE Support**: JetBrains IDEs, Sublime Text, Vim/Neovim\n- [ ] **Model Support**: Local models (Ollama), CodeLlama, StarCoder\n- [ ] **Advanced Metrics**: Code security analysis, performance benchmarks\n- [ ] **Real-time Dashboard**: Live benchmark results and leaderboards\n- [ ] **Custom Scenarios**: Industry-specific benchmark suites\n- [ ] **Integration**: Slack/Discord notifications, webhook support\n\n---\n\n**Start benchmarking your AI coding assistants today!** 🚀\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fimbios%2Fide-ai-benchmark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fimbios%2Fide-ai-benchmark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fimbios%2Fide-ai-benchmark/lists"}