https://github.com/arthur-ai/bench

A tool for evaluating LLMs
https://github.com/arthur-ai/bench

llm mlops

Last synced: 6 months ago
JSON representation

A tool for evaluating LLMs

Host: GitHub
URL: https://github.com/arthur-ai/bench
Owner: arthur-ai
License: mit
Created: 2023-07-07T14:40:39.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-05-10T14:20:46.000Z (over 1 year ago)
Last Synced: 2025-04-18T12:07:16.664Z (7 months ago)
Topics: llm, mlops
Language: TypeScript
Homepage: https://bench.readthedocs.io
Size: 10.4 MB
Stars: 414
Watchers: 11
Forks: 42
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-llm-eval - Arthur Bench - 10-06) | (Tools)
StarryDivineSky - arthur-ai/bench - ai/bench是一个用于评估大型语言模型（LLMs）的工具。它提供了一个框架，可以对LLMs在各种任务上的表现进行基准测试和评估，帮助用户了解不同模型的优劣。Bench的核心功能包括数据集管理、模型集成、评估指标计算和结果可视化。用户可以自定义数据集和评估指标，以满足特定的评估需求。该工具支持多种LLMs，并提供了一套标准化的评估流程，使得LLM的评估过程更加高效和可重复。Bench旨在帮助开发者和研究人员选择最适合其应用的LLM，并促进LLM技术的进步。它通过提供清晰的评估报告，帮助用户理解LLM的性能瓶颈，并指导模型改进。Bench还支持自动化评估流程，可以定期对LLM进行评估，以监控模型性能的变化。总而言之，arthur-ai/bench是一个强大且灵活的LLM评估工具，能够帮助用户深入了解和优化LLM的性能。 (A01_文本生成_文本对话 / 大语言对话模型及数据)

README

# Bench

Bench is a tool for evaluating LLMs for production use cases. Whether you are comparing different LLMs, considering different prompts, or testing generation hyperparameters like temperature and # tokens, Bench provides one touch point for all your LLM performance evaluation.

If you have encountered a need for any of the following in your LLM work, then Bench can help with your evaluation:

- to standardize the workflow of LLM evaluation with a common interface across tasks and use cases
- to test whether open source LLMs can do as well as the top closed-source LLM API providers on your specific data
- to translate the rankings on LLM leaderboards and benchmarks into scores that you care about for your actual use case

Join the bench community on [Discord](https://discord.gg/tdfUAtaVHz).

For bug fixes and feature requests, please file a Github issue.

## Package installation

Install Bench to your python environment with optional dependencies for serving results locally (recommended):
`pip install 'arthur-bench[server]'`

Alternatively, install Bench to your python environment with minimum dependencies:
`pip install arthur-bench`

For further setup instructions visit our [installation guide](https://bench.readthedocs.io/en/latest/setup.html)

## Using Bench

For a more in-depth walkthrough of using bench, visit our [quickstart walkthrough](https://bench.readthedocs.io/en/latest/quickstart.html) and our [test suite creation guide](https://bench.readthedocs.io/en/latest/creating_test_suites.html) on our docs.

To make sure you can run test suites in bench, you can run the following code snippets to create a test suite and run it to give a score to candidate outputs.

```python
from arthur_bench.run.testsuite import TestSuite
suite = TestSuite(
"bench_quickstart",
"exact_match",
input_text_list=["What year was FDR elected?", "What is the opposite of down?"],
reference_output_list=["1932", "up"]
)
suite.run("quickstart_run", candidate_output_list=["1932", "up is the opposite of down"])
```

Saved test suites can be loaded later on to benchmark test performance over time, without needing to re-prepare reference data:

```python
existing_suite = TestSuite("bench_quickstart", "exact_match")
existing_suite.run("quickstart_new_run", candidate_output_list=["1936", "up"])
```

To view the results for these runs in the local UI that comes with the `bench` package, run `bench` from the command line (this requires the bench optional server dependencies to be installed):

```
bench
```

Viewing examples in the bench UI will look something like this:

Examples UI

## Running Bench from source

To launch Bench from source:

1. Install the dependencies
* `pip install -e '.[server]'`
2. Build the Front End
* `cd arthur_bench/server/js`
* `npm i`
* `npm run build`
3. Launch the server
* `bench`

Because the server was installed with `pip -e`, local changes will be picked up. However, the server will need to be restarted between
changes in order for those changes to be picked up.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/arthur-ai/bench

Awesome Lists containing this project

README