{"id":23932726,"url":"https://github.com/stream-bench/stream-bench","last_synced_at":"2025-09-11T15:32:13.129Z","repository":{"id":243954656,"uuid":"813883231","full_name":"stream-bench/stream-bench","owner":"stream-bench","description":"We propose a pioneering benchmark to evaluate LLM agents' ability to improve over time in streaming scenarios","archived":false,"fork":false,"pushed_at":"2024-10-28T00:55:27.000Z","size":4292,"stargazers_count":6,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2024-10-28T04:43:26.925Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/stream-bench.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-11T23:43:30.000Z","updated_at":"2024-10-24T06:45:46.000Z","dependencies_parsed_at":"2024-07-10T04:41:05.574Z","dependency_job_id":"0f53024b-c4c3-4c26-b5fa-b2a644c1df19","html_url":"https://github.com/stream-bench/stream-bench","commit_stats":null,"previous_names":["stream-bench/stream-bench"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stream-bench%2Fstream-bench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stream-bench%2Fstream-bench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stream-bench%2Fstream-bench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stream-bench%2Fstream-bench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/stream-bench","download_url":"https://codeload.github.com/stream-bench/stream-bench/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":232657489,"owners_count":18556857,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-01-06T00:29:18.573Z","updated_at":"2025-01-06T00:29:28.163Z","avatar_url":"https://github.com/stream-bench.png","language":"Python","readme":"# StreamBench: Towards Benchmarking Continuous Improvement of Language Agents\n\n**TL;DR:** We propose a pioneering benchmark to evaluate LLM agents' ability to improve over time in streaming scenarios\n\n**Paper link:** https://arxiv.org/abs/2406.08747\n\n**(New Feature)** Run with *OpenAI Batch API* to save cost! See the corresponding [section](#new-feature-run-the-main-script-with-openai-batch-api-mode) for how to use it. (**Note:** Only the non-streaming agents are supported.)\n\n![Figure 0](./Figure0.png)\n\u003cp align=\"center\"\u003e\u003cem\u003eOverview of StreamBench, illustrating the continuous improvement process of language agents in streaming scenarios. (Figure reference: \u003ca href=\"https://www.chimechallenge.org/current/workshop/CHiME2024_Lee.pdf\" target=\"_blank\"\u003eCHiME 2024 Keynote @Interspeech\u003c/a\u003e)\u003c/em\u003e\u003c/p\u003e\n\n![Figure 1](./Figure1.png)\n\n## Steps to Reproduce the Experiments\n\n### Install Required Packages\nRun the following commands to install the requirements:\n```\nconda create -n stream_bench python=3.10\nconda activate stream_bench\npython -m pip install -r requirements.txt\n```\n\n### (Only for Text-to-SQL Datasets) Download SQL Data\nFor `Spider`, `CoSQL`, and `BIRD` datasets, one would need to download the SQL databases with the following command:\n```\npython download_text2sql_data.py\n```\nThe script will download, unzip, and extract Text-to-SQL databases to the `./data` directory automatically.\n\n### Setup Environment Variables\nDepending on the method(s) to run, you might need to set the following API keys:\n```\nexport OAI_KEY=\u003cyour_openai_api_key\u003e\nexport GOOGLE_API_KEY=\u003cyour_google_ai_studio_api_key\u003e\nexport ANTHROPIC_KEY=\u003cyour_anthropic_api_key\u003e\n```\n\n### (Recommended) Sanity checking\nBefore running the main script with differnet baselines, one may want to check that the environment is correctly configured. This can be done by running the `GroundTruthAgent` on the dataset(s) and check whether the performance is 100%.\n```\npython -m stream_bench.pipelines.run_bench \\\n    --agent_cfg \"configs/agent/gt.yml\" \\\n    --bench_cfg \"configs/bench/ds_1000.yml\" \\\n    --entity \"photocopier\" \\\n    --use_wandb\n```\nIn this example, we run the `GroundTruthAgent` on `DS-1000`. One may run on other datasets by replacing the `\u003cdataset_name\u003e.yml` file of the `--bench_cfg` argument.\n\n### Run the Main Script\nIn this example, the `ZeroShot` baseline on the `DDXPlus` dataset is executed. Written scripts for running other datasets can be found in `./scripts`.\n```\npython -m stream_bench.pipelines.run_bench \\\n    --agent_cfg \"configs/agent/zeroshot.yml\" \\\n    --bench_cfg \"configs/bench/ddxplus.yml\" \\\n    --entity \"photocopier\" \\\n    --use_wandb\n```\nIf you want to run other baselines on the dataset, you can modify `--agent_cfg` to different `\u003cbaseline_name\u003e.yml` files, which are located in the `./configs/agent` folder.\n\n### (New Feature) Run the Main Script with OpenAI Batch API mode\nTo save cost, you can run the main script with OpenAI Batch API mode as follows:\n```\npython -m stream_bench.pipelines.run_bench_batch \\\n    --agent_cfg \"configs/agent/\u003cagent_name\u003e.yml\" \\\n    --bench_cfg \"configs/bench/\u003cbench_name\u003e.yml\" \\\n    --entity \"photocopier\" \\\n    --use_wandb\n```\n\n### (Optional) Interactive Notebook\nIf you want a step-by-step walkthrough, please refer to `playground.ipynb`.\n\n## Steps to Implement Your Own Methods\nIf you want to implement your own LLM agent, you may subclass the `Agent` base class in `./stream_bench/agents/base.py` and implement the following methods:\n\n- `__init__`: Initialization of the agent (e.g., setting up LLMs and RAG pipelines).\n- `__call__`: The inference logics of the agent. This should return the agent's prediction in string.\n- `update`: The updating logics of the agent.\n\n## Steps to Run Your Own LLMs\nIf you want to run agents with your own backbone LLMs, you have two options:\n\n1. Using HuggingFace models: upload / choose your HuggingFace model, and set the configurations in `./configs/agent/\u003cagent_name\u003e.yml`. For example, if you want to run the zero-shot baseline with `google/gemma-2-2b-it`, set the configurations as follows:\n```\nagent_name: \"zeroshot\"\nllm:\n  series: \"hf_model\"\n  model_name: \"google/gemma-2-2b-it\"\n  temperature: 0.0\n  max_tokens: 32\n```\n\n2. Others: for further customization, you can subclass the `LLM` base class in `./stream_bench/llms/base.py` and implement the following methods:\n\n- `__init__`: Setup LLM configs here.\n- `__call__`: Inference flows of prompting the LLM and get a tuple of (response_text, response_info). See the implementation of `./stream_bench/llms/oai_chat.py` and `./stream_bench/llms/hf_model.py` as examples.\n\n## (Optional) StreamBench Datasets\nIf you want to download the datasets on StreamBench, we have collected the datasets on HuggingFace:\nhttps://huggingface.co/datasets/appier-ai-research/StreamBench\n\nThese datasets have their original source webpages, please refer to our [paper](https://arxiv.org/abs/2406.08747) (Appendix F) for more details.\n\n## Citation\nIf you find our work helpful, please cite as\n```\n@article{wu2024streambench,\n  title={StreamBench: Towards Benchmarking Continuous Improvement of Language Agents},\n  author={Wu, Cheng-Kuang and Tam, Zhi Rui and Lin, Chieh-Yen and Chen, Yun-Nung and Lee, Hung-yi},\n  journal={arXiv preprint arXiv:2406.08747},\n  year={2024}\n}\n```\n","funding_links":[],"categories":["Building"],"sub_categories":["Benchmarks"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstream-bench%2Fstream-bench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fstream-bench%2Fstream-bench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstream-bench%2Fstream-bench/lists"}