https://github.com/scaleapi/swe-bench_pro-os
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
https://github.com/scaleapi/swe-bench_pro-os
Last synced: 3 months ago
JSON representation
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
- Host: GitHub
- URL: https://github.com/scaleapi/swe-bench_pro-os
- Owner: scaleapi
- Created: 2025-09-05T20:42:53.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2025-10-02T23:54:25.000Z (4 months ago)
- Last Synced: 2025-10-03T00:14:14.383Z (4 months ago)
- Language: Python
- Homepage:
- Size: 3.47 MB
- Stars: 177
- Watchers: 1
- Forks: 14
- Open Issues: 19
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## SWE-Bench Pro
Code and data for the following works:
* SWE-bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
* HuggingFace: https://huggingface.co/datasets/ScaleAI/SWE-bench_Pro
* Public Leaderboard: https://scale.com/leaderboard/swe_bench_pro_public
* Commercial (Private) Leaderboard: https://scale.com/leaderboard/swe_bench_pro_commercial
## News
(10/3) Notes on reproducing paper results:
For the research paper, we ran SWE-Agent results which are cost-limited to $2 per instance and 50 turns. Since this limits the model performance, we are running additional evals which have no cost limit and a turn limit of 250 and will report those results as well.
## Overview
SWE-Bench Pro is a challenging benchmark evaluating LLMs/Agents on long-horizon software engineering tasks.
Given a *codebase* and an *issue*, a language model is tasked with generating a *patch* that resolves the described problem.
The dataset is inspired from SWE-Bench: https://github.com/SWE-bench/SWE-bench
To access SWE-bench Pro, copy and run the following code:
```python
from datasets import load_dataset
swebench = load_dataset('ScaleAI/SWE-bench_Pro', split='test')
```
## Setup
SWE-bench Pro uses Docker for reproducible evaluations.
In addition, the evaluation script requires Modal to scale the evaluation set.
Follow the instructions in the [Docker setup guide](https://docs.docker.com/engine/install/) to install Docker on your machine.
If you're setting up on Linux, we recommend seeing the [post-installation steps](https://docs.docker.com/engine/install/linux-postinstall/) as well.
Run the following commands to store modal credentials:
```
pip install modal
modal setup # and follow the prompts to generate your token and secret
```
After running these steps, you should be able to see a token ID and secret in `~/.modal.toml`:
EG:
```
token_id =
token_secret =
active = true
```
We store prebuilt Docker images for each instance. They can be found in this directory:
https://hub.docker.com/r/jefzda/sweap-images
The format of the images is as follows.
`jefzda/sweap-images:{repo_base}.{repo_name}-{repo_base}__{repo_name}-{hash}`
For example:
`jefzda/sweap-images:gravitational.teleport-gravitational__teleport-82185f232ae8974258397e121b3bc2ed0c3729ed-v626ec2a48416b10a88641359a169d99e935ff03`
Note that bash runs by default in our images. e.g. when running these images, you should not manually envoke bash. See https://github.com/scaleapi/SWE-bench_Pro-os/issues/6
## Usage
First generate patch predictions using your harness of choice.
Evaluate patch predictions on SWE-bench Pro with the following command:
```bash
python swe_bench_pro_eval.py \
--raw_sample_path=external_hf_v2.csv \
--patch_path={OUTPUT}/gold_patches.json \
--output_dir={OUTPUT}/ \
--scripts_dir=run_scripts \
--num_workers=100 \
--dockerhub_username=jefzda
```
Replace gold_patches with your patch json, and point raw_sample_path to the SWE-Bench Pro CSV.
Gold Patches can be compiled from the HuggingFace dataset.