Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/stream-bench/stream-bench
We propose a pioneering benchmark to evaluate LLM agents' ability to improve over time in streaming scenarios
https://github.com/stream-bench/stream-bench
Last synced: 3 days ago
JSON representation
We propose a pioneering benchmark to evaluate LLM agents' ability to improve over time in streaming scenarios
- Host: GitHub
- URL: https://github.com/stream-bench/stream-bench
- Owner: stream-bench
- License: apache-2.0
- Created: 2024-06-11T23:43:30.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2024-10-28T00:55:27.000Z (2 months ago)
- Last Synced: 2024-10-28T04:43:26.925Z (2 months ago)
- Language: Python
- Size: 4.09 MB
- Stars: 6
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome_ai_agents - Stream-Bench - We propose a pioneering benchmark to evaluate LLM agents' ability to improve over time in streaming scenarios (Building / Benchmarks)
- awesome_ai_agents - Stream-Bench - We propose a pioneering benchmark to evaluate LLM agents' ability to improve over time in streaming scenarios (Building / Benchmarks)
README
# StreamBench: Towards Benchmarking Continuous Improvement of Language Agents
**TL;DR:** We propose a pioneering benchmark to evaluate LLM agents' ability to improve over time in streaming scenarios
**Paper link:** https://arxiv.org/abs/2406.08747
**(New Feature)** Run with *OpenAI Batch API* to save cost! See the corresponding [section](#new-feature-run-the-main-script-with-openai-batch-api-mode) for how to use it. (**Note:** Only the non-streaming agents are supported.)
![Figure 0](./Figure0.png)
Overview of StreamBench, illustrating the continuous improvement process of language agents in streaming scenarios. (Figure reference: CHiME 2024 Keynote @Interspeech)
![Figure 1](./Figure1.png)
## Steps to Reproduce the Experiments
### Install Required Packages
Run the following commands to install the requirements:
```
conda create -n stream_bench python=3.10
conda activate stream_bench
python -m pip install -r requirements.txt
```### (Only for Text-to-SQL Datasets) Download SQL Data
For `Spider`, `CoSQL`, and `BIRD` datasets, one would need to download the SQL databases with the following command:
```
python download_text2sql_data.py
```
The script will download, unzip, and extract Text-to-SQL databases to the `./data` directory automatically.### Setup Environment Variables
Depending on the method(s) to run, you might need to set the following API keys:
```
export OAI_KEY=
export GOOGLE_API_KEY=
export ANTHROPIC_KEY=
```### (Recommended) Sanity checking
Before running the main script with differnet baselines, one may want to check that the environment is correctly configured. This can be done by running the `GroundTruthAgent` on the dataset(s) and check whether the performance is 100%.
```
python -m stream_bench.pipelines.run_bench \
--agent_cfg "configs/agent/gt.yml" \
--bench_cfg "configs/bench/ds_1000.yml" \
--entity "photocopier" \
--use_wandb
```
In this example, we run the `GroundTruthAgent` on `DS-1000`. One may run on other datasets by replacing the `.yml` file of the `--bench_cfg` argument.### Run the Main Script
In this example, the `ZeroShot` baseline on the `DDXPlus` dataset is executed. Written scripts for running other datasets can be found in `./scripts`.
```
python -m stream_bench.pipelines.run_bench \
--agent_cfg "configs/agent/zeroshot.yml" \
--bench_cfg "configs/bench/ddxplus.yml" \
--entity "photocopier" \
--use_wandb
```
If you want to run other baselines on the dataset, you can modify `--agent_cfg` to different `.yml` files, which are located in the `./configs/agent` folder.### (New Feature) Run the Main Script with OpenAI Batch API mode
To save cost, you can run the main script with OpenAI Batch API mode as follows:
```
python -m stream_bench.pipelines.run_bench_batch \
--agent_cfg "configs/agent/.yml" \
--bench_cfg "configs/bench/.yml" \
--entity "photocopier" \
--use_wandb
```### (Optional) Interactive Notebook
If you want a step-by-step walkthrough, please refer to `playground.ipynb`.## Steps to Implement Your Own Methods
If you want to implement your own LLM agent, you may subclass the `Agent` base class in `./stream_bench/agents/base.py` and implement the following methods:- `__init__`: Initialization of the agent (e.g., setting up LLMs and RAG pipelines).
- `__call__`: The inference logics of the agent. This should return the agent's prediction in string.
- `update`: The updating logics of the agent.## Steps to Run Your Own LLMs
If you want to run agents with your own backbone LLMs, you have two options:1. Using HuggingFace models: upload / choose your HuggingFace model, and set the configurations in `./configs/agent/.yml`. For example, if you want to run the zero-shot baseline with `google/gemma-2-2b-it`, set the configurations as follows:
```
agent_name: "zeroshot"
llm:
series: "hf_model"
model_name: "google/gemma-2-2b-it"
temperature: 0.0
max_tokens: 32
```2. Others: for further customization, you can subclass the `LLM` base class in `./stream_bench/llms/base.py` and implement the following methods:
- `__init__`: Setup LLM configs here.
- `__call__`: Inference flows of prompting the LLM and get a tuple of (response_text, response_info). See the implementation of `./stream_bench/llms/oai_chat.py` and `./stream_bench/llms/hf_model.py` as examples.## (Optional) StreamBench Datasets
If you want to download the datasets on StreamBench, we have collected the datasets on HuggingFace:
https://huggingface.co/datasets/appier-ai-research/StreamBenchThese datasets have their original source webpages, please refer to our [paper](https://arxiv.org/abs/2406.08747) (Appendix F) for more details.
## Citation
If you find our work helpful, please cite as
```
@article{wu2024streambench,
title={StreamBench: Towards Benchmarking Continuous Improvement of Language Agents},
author={Wu, Cheng-Kuang and Tam, Zhi Rui and Lin, Chieh-Yen and Chen, Yun-Nung and Lee, Hung-yi},
journal={arXiv preprint arXiv:2406.08747},
year={2024}
}
```