https://github.com/runloopai/public_benchmarks_example
Simple examples of how to run public benchmarks with Runloop
https://github.com/runloopai/public_benchmarks_example
Last synced: 4 months ago
JSON representation
Simple examples of how to run public benchmarks with Runloop
- Host: GitHub
- URL: https://github.com/runloopai/public_benchmarks_example
- Owner: runloopai
- Created: 2025-04-30T20:32:04.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-06-17T00:39:09.000Z (11 months ago)
- Last Synced: 2025-06-17T01:35:25.195Z (11 months ago)
- Language: Python
- Size: 42 KB
- Stars: 0
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Public Benchmarks Example
This repository contains a script to run public benchmarks using the Runloop API.
## Setup
Export your Runloop API Key.
You can get an API key from the Runloop dashboard at https://platform.runloop.ai/manage/keys
```bash
export RUNLOOP_API_KEY=
```
### Python setup
1. Install `uv` (if not already installed):
See: https://docs.astral.sh/uv/getting-started/installation/
```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
```
2. Sync Dependencies:
```bash
uv sync
```
### Node setup
1. Install `Node.js` from [https://nodejs.org/en/download](https://nodejs.org/en/download) (if not already installed)
2. Install packages via package manager
```bash
npm install # or pnpm install
```
## Usage
The script can be run in several ways:
- If using python, use the command `uv run run_public_benchmark.py`
- If using typescript, use the command `npx tsx runPublicBenchmark.ts`
- You can also use `npm run test` to see an example of running a test on a single scenario by ID.
The README will continue with python command
1. Run a specific benchmark:
```bash
uv run run_public_benchmark.py --benchmark-id
```
2. Run a specific scenario by ID:
```bash
uv run run_public_benchmark.py --scenario-id
```
3. Run a specific scenario by name:
```bash
uv run run_public_benchmark.py --scenario-name
```
# SWE Bench Examples
1. Run full SWE Bench Verified benchmark:
```bash
uv run run_public_benchmark.py --benchmark-id bmd_2zmp3Mu3LhWu7yDVIfq3m
```
2. Run a specific SWE bench verified scenario by instance ID:
See full list of scenarios at: https://huggingface.co/datasets/princeton-nlp/SWE-bench_Verified
```bash
uv run run_public_benchmark.py --scenario-name astropy__astropy-12907
```
### Additional Options
- `--keep-devbox`: Keep the devbox running after scoring for manual inspection and debugging
- `--force-clear-running-devboxes`: Force shutdown all running devboxes before running the benchmark/scenario
## Notes
- The script limits concurrent scenario runs to 50