{"id":25527137,"url":"https://github.com/openai/swelancer-benchmark","last_synced_at":"2025-05-14T16:12:55.410Z","repository":{"id":278238568,"uuid":"934963130","full_name":"openai/SWELancer-Benchmark","owner":"openai","description":"This repo contains the dataset and code for the paper \"SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?\"","archived":false,"fork":false,"pushed_at":"2025-04-03T19:07:31.000Z","size":56407,"stargazers_count":1317,"open_issues_count":24,"forks_count":116,"subscribers_count":16,"default_branch":"main","last_synced_at":"2025-04-06T08:04:13.511Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/openai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-02-18T17:23:15.000Z","updated_at":"2025-04-05T19:38:29.000Z","dependencies_parsed_at":"2025-03-06T01:33:40.587Z","dependency_job_id":null,"html_url":"https://github.com/openai/SWELancer-Benchmark","commit_stats":null,"previous_names":["openai/swelancer-benchmark"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openai%2FSWELancer-Benchmark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openai%2FSWELancer-Benchmark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openai%2FSWELancer-Benchmark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openai%2FSWELancer-Benchmark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/openai","download_url":"https://codeload.github.com/openai/SWELancer-Benchmark/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248708509,"owners_count":21149012,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-02-19T22:17:15.138Z","updated_at":"2025-04-13T11:38:24.410Z","avatar_url":"https://github.com/openai.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# SWE-Lancer\n\nThis repo contains the dataset and code for the paper [\"SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?\"](https://www.openai.com/index/swe-lancer/).\n\n---\n\nThank you so much for checking out our benchmark! If you have questions, run into issues, or want to contribute, please open an issue or pull request. You can also reach us at samuelgm@openai.com and michele@openai.com at any time.\n\nWe will continue to update this repository with the latest tasks, updates to the scaffolding, and improvements to the codebase \n\n- If you'd like to use the latest version, please use the `main` branch.\n\n- If you'd like to use the version of the dataset from the paper and codebase at time of paper release, please check out the `paper` branch. Note that the performance outlined in our paper is on our internal scaffold. We've aimed to open-source as much of it as possible, but the open-source agent and harness may not be exactly the same. \n\n\n---\n\n**Step 1: Package Management and Requirements**\n\nPython 3.11 is the most stable version to use with SWE-Lancer.\n\nFor package management, this repo comes with a pre-existing virtualenv or you can build one from scratch.\n\nWe recommend using the pre-built virtualenv with [uv](https://github.com/astral-sh/uv), a lightweight OSS package manager. To do this, run:\n\n```bash\nuv sync\nsource .venv/bin/activate\nfor proj in nanoeval alcatraz nanoeval_alcatraz; do\n  uv pip install -e project/\"$proj\"\ndone\n```\n\nTo use your own virtualenv, without uv, run:\n\n```bash\npython -m venv .venv\nsource .venv/bin/activate\npip install -r requirements.txt\nfor proj in nanoeval alcatraz nanoeval_alcatraz; do\n  pip install -e project/\"$proj\"\ndone\n```\n\n**Step 2: Build the Docker Image**\n\nPlease run the command that corresponds to your computer's architecture.\n\nFor Apple Silicon (or other ARM64 systems):\n\n```bash\ndocker buildx build \\\n  -f Dockerfile \\\n  --ssh default=$SSH_AUTH_SOCK \\\n  -t swelancer \\\n  .\n```\n\nFor Intel-based Mac (or other x86_64 systems):\n\n```bash\ndocker buildx build \\\n  -f Dockerfile_x86 \\\n  --platform linux/amd64 \\\n  --ssh default=$SSH_AUTH_SOCK \\\n  -t swelancer \\\n  .\n```\n\nAfter the command completes, run the Docker container.\n\n**Step 3: Configure Environment Variables**\n\nEnsure you have an OpenAI API key and username set on your machine.\n\nLocate the `sample.env` file in the root directory. This file contains template environment variables needed for the application:\n\n```plaintext\n# sample.env contents example:\nPUSHER_APP_ID=your-app-id\n# ... other variables\n```\n\nCreate a new file named `.env` and copy the contents from `sample.env`.\n\n**Step 4: Running SWE-Lancer**\n\nYou are now ready to run the eval with:\n\n```bash\nuv run python run_swelancer.py\n```\n\nYou should immediately see logging output as the container gets set up and the tasks are loaded, which may take several minutes. You can adjust the model, concurrency, recording, and other parameters in `run_swelancer.py`.\n\n## Running at Scale\n\nTo run SWELancer at scale in your own environment, you'll need to implement your own compute infrastructure. Here's a high-level overview of how to integrate SWELancer with your compute system:\n\n### 1. Implement a Custom ComputerInterface\n\nCreate your own implementation of the `ComputerInterface` class that interfaces with your compute infrastructure. The main methods you need to implement are:\n\n```python\nclass YourComputerInterface(ComputerInterface):\n  async def send_shell_command(self, command: str) -\u003e CommandResult:\n    \"\"\"Execute a shell command and return the result\"\"\"\n    pass\n  async def upload(self, local_path: str, remote_path: str) -\u003e None:\n    \"\"\"Upload a file to the compute environment\"\"\"\n    pass\n  async def download(self, remote_path: str) -\u003e bytes:\n    \"\"\"Download a file from the compute environment\"\"\"\n    pass\n  async def check_shell_command(self, command: str) -\u003e CommandResult:\n    \"\"\"Execute a shell command and raise an error if it fails\"\"\"\n    pass\n    async def cleanup(self) -\u003e None:\n    \"\"\"Clean up any resources\"\"\"\n    pass\n```\n\n### 2. Update the Computer Start Function\n\nModify `swelancer_agent.py`'s `_start_computer` function to use your custom interface:\n\n```python\nasync def _start_computer(self, task: ComputerTask) -\u003e AsyncGenerator[ComputerInterface, None]:\n    # Implement your compute logic here\n\n    # Initialize your compute environment\n    # This could involve:\n    # - Spinning up a container/VM\n    # - Setting up SSH connections\n    # - Configuring environment variables\n    # Return your custom ComputerInterface implementation\n    return YourComputerInterface()\n```\n\n### Reference Implementation\n\nFor a complete example of a ComputerInterface implementation, you can refer to the `alcatraz_computer_interface.py` file in the codebase. This shows how to:\n\n- Handle command execution\n- Manage file transfers\n- Deal with environment setup\n- Handle cleanup and resource management\n\n### Best Practices\n\n1. **Resource Management**\n\n   - Implement proper cleanup in your interface\n   - Handle container/VM lifecycle appropriately\n   - Clean up temporary files\n\n2. **Security**\n\n   - Implement proper isolation between tasks\n   - Handle sensitive data appropriately\n   - Control network access\n\n3. **Scalability**\n\n   - Consider implementing a pool of compute resources\n   - Handle concurrent task execution\n   - Implement proper resource limits\n\n4. **Error Handling**\n   - Implement robust error handling\n   - Provide meaningful error messages\n   - Handle network issues gracefully\n\n## Citation\n```\n@misc{miserendino2025swelancerfrontierllmsearn,\n      title={SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?}, \n      author={Samuel Miserendino and Michele Wang and Tejal Patwardhan and Johannes Heidecke},\n      year={2025},\n      eprint={2502.12115},\n      archivePrefix={arXiv},\n      primaryClass={cs.LG},\n      url={https://arxiv.org/abs/2502.12115}, \n}\n```\n\n## Utilities \n\nWe include the following utilities to facilitate future research:\n\n- `download_videos.py` allows you to download the videos attached to an Expensify GitHub issue if your model supports video input\n\n## SWELancer-Lite \n\nIf you'd like to run SWELancer-Lite, swap out `swelancer_tasks.csv` with `swelancer_tasks_lite.csv` in `swelancer.py`. The lite dataset contains 174 tasks each worth over $1,000 (61 IC SWE tasks and 113 SWE Manager tasks). ","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopenai%2Fswelancer-benchmark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fopenai%2Fswelancer-benchmark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopenai%2Fswelancer-benchmark/lists"}