Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/allenai/super-benchmark
https://github.com/allenai/super-benchmark
Last synced: about 1 month ago
JSON representation
- Host: GitHub
- URL: https://github.com/allenai/super-benchmark
- Owner: allenai
- License: apache-2.0
- Created: 2024-08-10T16:40:59.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2024-09-16T23:30:36.000Z (about 2 months ago)
- Last Synced: 2024-09-17T23:08:38.870Z (about 2 months ago)
- Language: Jupyter Notebook
- Size: 8.72 MB
- Stars: 20
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories
A benchmark and resources for evaluation of LLM agents on setting up and executing ML/NLP tasks from research repositories in the GitHub wild.
[arxiv]---
## 📝 Benchmark Tasks
Dataset tasks are available in [HuggingFace Hub 🤗](https://huggingface.co/datasets/allenai/super).
We provide three sets: Expert (45 problems), Masked (152) and AutoGen (602).
Agents trajectories from the paper's experiments are available [here](trajectories).
## 🚀 Quick Start: Running the Agent
### Setup
#### 1. Clone the repo and install the requirements:
```
git clone https://github.com/allenai/super-benchmark.git
cd super-benchmark
pip install -r requirements.txt
```#### 2. Fill in your OpenAI API key:
```
echo "OPENAI_API_KEY=your-openai-api-key" > .env
```### Running queries
The following command will run the agent locally, which may incur risks as it will execute code on your machine.
We provide the option to run the agent inside a Docker container, and using [modal.com](https://www.modal.com/). We use the latter for the benchmark evaluation.```bash
python -m super.run_single_query --env-backend local --query "Download the OpenBookQA dataset at https://github.com/allenai/OpenBookQA and tell me how many examples are in the train, dev, and test splits of the datasets."
```## 🤖 Running & Evaluating Agents on SUPER
We provide code to evaluate our implemented agents on SUPER.
To run tasks safely and concurrently, we use [modal.com](https://www.modal.com/). Modal isn't free, but is quite cheap: running an average problem from the benchmark should generally cost 2-3 cents (assuming CPU).
In addition users receive $30 credit per month, which should be enough to run the benchmark evaluation multiple times.```bash
python -m super.run_on_benchmark --set Expert
```