{"id":31861151,"url":"https://github.com/scaleapi/swe-bench_pro-os","last_synced_at":"2025-10-12T16:27:39.625Z","repository":{"id":315647124,"uuid":"1051347649","full_name":"scaleapi/SWE-bench_Pro-os","owner":"scaleapi","description":"SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?","archived":false,"fork":false,"pushed_at":"2025-10-02T23:54:25.000Z","size":3641,"stargazers_count":177,"open_issues_count":19,"forks_count":14,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-10-03T00:14:14.383Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/scaleapi.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-05T20:42:53.000Z","updated_at":"2025-10-02T22:10:34.000Z","dependencies_parsed_at":"2025-09-19T22:14:34.176Z","dependency_job_id":null,"html_url":"https://github.com/scaleapi/SWE-bench_Pro-os","commit_stats":null,"previous_names":["scaleapi/swe-bench_pro-os"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/scaleapi/SWE-bench_Pro-os","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scaleapi%2FSWE-bench_Pro-os","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scaleapi%2FSWE-bench_Pro-os/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scaleapi%2FSWE-bench_Pro-os/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scaleapi%2FSWE-bench_Pro-os/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/scaleapi","download_url":"https://codeload.github.com/scaleapi/SWE-bench_Pro-os/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scaleapi%2FSWE-bench_Pro-os/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279012009,"owners_count":26085041,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-12T02:00:06.719Z","response_time":53,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-10-12T16:27:37.265Z","updated_at":"2025-10-12T16:27:39.619Z","avatar_url":"https://github.com/scaleapi.png","language":"Python","readme":"## SWE-Bench Pro\n\nCode and data for the following works:\n* \u003ca href=\"https://static.scale.com/uploads/654197dc94d34f66c0f5184e/SWEAP_Eval_Scale%20(9).pdf\"\u003eSWE-bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?\u003c/a\u003e\n\n* HuggingFace: \u003ca href=\"https://huggingface.co/datasets/ScaleAI/SWE-bench_Pro\"\u003ehttps://huggingface.co/datasets/ScaleAI/SWE-bench_Pro\u003c/a\u003e\n\n* Public Leaderboard: \u003ca href=\"https://scale.com/leaderboard/swe_bench_pro_public\"\u003ehttps://scale.com/leaderboard/swe_bench_pro_public\u003c/a\u003e\n\n* Commercial (Private) Leaderboard: \u003ca href=\"https://scale.com/leaderboard/swe_bench_pro_commercial\"\u003ehttps://scale.com/leaderboard/swe_bench_pro_commercial\u003c/a\u003e\n\n## News\n\n(10/3) Notes on reproducing paper results:\nFor the research paper, we ran SWE-Agent results which are cost-limited to $2 per instance and 50 turns. Since this limits the model performance, we are running additional evals which have no cost limit and a turn limit of 250 and will report those results as well.\n\n## Overview\nSWE-Bench Pro is a challenging benchmark evaluating LLMs/Agents on long-horizon software engineering tasks.\nGiven a *codebase* and an *issue*, a language model is tasked with generating a *patch* that resolves the described problem.\n\nThe dataset is inspired from SWE-Bench: https://github.com/SWE-bench/SWE-bench\n\nTo access SWE-bench Pro, copy and run the following code:\n```python\nfrom datasets import load_dataset\nswebench = load_dataset('ScaleAI/SWE-bench_Pro', split='test')\n```\n\n## Setup\nSWE-bench Pro uses Docker for reproducible evaluations.\nIn addition, the evaluation script requires Modal to scale the evaluation set.\n\nFollow the instructions in the [Docker setup guide](https://docs.docker.com/engine/install/) to install Docker on your machine.\nIf you're setting up on Linux, we recommend seeing the [post-installation steps](https://docs.docker.com/engine/install/linux-postinstall/) as well.\n\nRun the following commands to store modal credentials:\n```\npip install modal\nmodal setup # and follow the prompts to generate your token and secret\n```\n\nAfter running these steps, you should be able to see a token ID and secret in  `~/.modal.toml`:\nEG:\n```\ntoken_id = \u003ctoken id\u003e\ntoken_secret = \u003ctoken secret\u003e\nactive = true\n```\n\nWe store prebuilt Docker images for each instance. They can be found in this directory:\n\nhttps://hub.docker.com/r/jefzda/sweap-images\n\nThe format of the images is as follows.\n\n`jefzda/sweap-images:{repo_base}.{repo_name}-{repo_base}__{repo_name}-{hash}`\n\nFor example:\n\n`jefzda/sweap-images:gravitational.teleport-gravitational__teleport-82185f232ae8974258397e121b3bc2ed0c3729ed-v626ec2a48416b10a88641359a169d99e935ff03`\n\nNote that bash runs by default in our images. e.g. when running these images, you should not manually envoke bash. See https://github.com/scaleapi/SWE-bench_Pro-os/issues/6\n\n## Usage\nFirst generate patch predictions using your harness of choice.\nEvaluate patch predictions on SWE-bench Pro with the following command:\n\n```bash\npython swe_bench_pro_eval.py \\\n    --raw_sample_path=external_hf_v2.csv \\\n    --patch_path={OUTPUT}/gold_patches.json \\\n    --output_dir={OUTPUT}/ \\\n    --scripts_dir=run_scripts \\\n    --num_workers=100 \\\n    --dockerhub_username=jefzda\n```\n\nReplace gold_patches with your patch json, and point raw_sample_path to the SWE-Bench Pro CSV.\nGold Patches can be compiled from the HuggingFace dataset.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscaleapi%2Fswe-bench_pro-os","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fscaleapi%2Fswe-bench_pro-os","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscaleapi%2Fswe-bench_pro-os/lists"}