An open API service indexing awesome lists of open source software.

https://github.com/hop-top/ben


https://github.com/hop-top/ben

bench benchmark benchmark-suite benchmarking

Last synced: about 1 month ago
JSON representation

Awesome Lists containing this project

README

          

# ben

> [!WARNING]
> **🚧 Do Not Use — History Will Be Rewritten 🚧**
>
> This repo is undergoing major restructuring as we selectively
> open-source internal tools built at
> [Idea Crafters LLC](https://ideacrafters.com). Git history **will be
> force-pushed and rewritten** multiple times. Do not fork, clone, or
> depend on this repo in any capacity until we tag a stable release.

General-purpose benchmarking tool — answers "which approach is better, and by how much?"
for any measurable task: tools, implementations, deps, LLM calls, agents.

---

## Install

```
go install hop.top/ben/cmd/ben@latest
```

ben depends on `hop.top/kit kit/v0.4.0-alpha.3`, pinned in `go.mod`
with no local override. Local development against unreleased kit
revisions uses a `replace` directive in `go.mod` (commented-out
example near the bottom of the file):

```go
// replace hop.top/kit => ../kit
```

Uncomment, point at your kit checkout, and `go mod tidy`.

---

## Quick start

```sh
# Inline run: compare two CLI tools on a task
ben run --task "Find HTTP handlers" --candidates xray,grep --metric latency_ms,quality_score \
--scorer weighted:latency_ms=0.3,quality_score=0.7 --input repo=.

# Suite file: run a named, repeatable benchmark
ben run --suite .ben/suites/codebase-indexing.yaml

# Compare two historical runs
ben compare 01HX...abc 01HX...def

# List last 10 runs for a suite
ben list --suite codebase-indexing --last 10

# Show one run by id
ben show 01HX...abc
```

---

## Commands

| Command | Description |
|---------------------------------|-------------------------------------------------------|
| `ben run` | Run benchmark suite or inline task against candidates |
| `ben list` | List recent runs from local storage |
| `ben show ` | Show details of one run |
| `ben compare ` | Diff two run results side-by-side |
| `ben suite list` | List known suites (global + project-local) |
| `ben suite show ` | Show suite spec details |
| `ben registry push ` | Push a run to the shared registry |
| `ben registry pull` | Pull community baselines for a suite |
| `ben config path` / `paths` | Inspect ben config file precedence |
| `ben spec` | Emit machine-readable capability manifest |

---

## Adapters

| Adapter | How ben runs the candidate |
|------------|--------------------------------------------------------------------------|
| `cli` | Spawns a shell command; captures stdout/stderr, exit code, latency |
| `llm` | Calls an LLM via API; captures tokens, cost, output |
| `eva` | Wraps `eva run` as a ben candidate for standard eval suites |
| binary | Any `ben-adapter-*` binary on PATH; communicates via stdio JSON protocol |

---

## Metrics

| Metric | Source | Description |
|------------------|----------|-----------------------------------------------|
| `latency_ms` | built-in | Wall-clock execution time in milliseconds |
| `exit_code` | built-in | Process exit code (cli adapter) |
| `output_size` | built-in | Byte length of stdout output |
| `tokens` | llm | Total tokens consumed (prompt + completion) |
| `cost_usd` | llm | Estimated cost in USD |
| `quality_score` | plugin | 0–1 relevance score; requires llm_judge plugin|

---

## Scorers

| Scorer | Description |
|-------------------------------|------------------------------------------------------|
| `single:` | Rank by one metric; lowest wins for cost/latency |
| `weighted:=,...` | Weighted sum across metrics; highest score wins |
| `raw` | No ranking; emit raw metrics only; winner=null |

Examples:

```
--scorer single:latency_ms
--scorer weighted:latency_ms=0.3,cost_usd=0.2,quality_score=0.5
--scorer raw
```

---

## Spec file

```yaml
name: codebase-indexing
description: Compare xray vs grep for initial codebase orientation
version: 1

task:
prompt: "Find all HTTP handler functions in this repo"
input:
repo: ./testdata/sample-repo

candidates:
- name: xray
adapter: cli
cmd: "xray explore --search {{input.prompt}} --path {{input.repo}}"
- name: grep
adapter: cli
cmd: "grep -r 'func.*Handler' {{input.repo}}"

metrics:
- latency_ms
- quality_score

scorer:
strategy: weighted
weights:
latency_ms: 0.3
quality_score: 0.7
```

---

## Plugin protocol

Binary plugins are auto-discovered as `ben-adapter-` or `ben-reporter-` on PATH.
Ben communicates via newline-delimited JSON over stdio: it writes a request JSON object to the
plugin's stdin and reads the response from stdout. Adapter plugins receive
`{"action":"run","candidate":{...},"input":{...}}` and must respond with
`{"metrics":{...},"output":"..."}`. Reporter plugins receive `{"run":{...}}` and write
formatted output to stdout. Naming convention: use the adapter/reporter name as the suffix,
e.g. `ben-adapter-docker`, `ben-reporter-markdown`.

---

## Agent usage

ben is designed for programmatic use mid-task:

```sh
# Machine-readable output; all logs to stderr
ben run --suite my-suite --format json --quiet

# Parse winner directly
ben run ... --format json | jq .winner
```

- `--format json` — emits valid JSON to stdout; diagnostics to stderr only
- `--quiet` — suppresses stderr; clean for pipelines
- Exit `0` — successful run (candidate failures are in the result, not exit code)
- Exit `1` — ben error (bad config, missing adapter, etc.)
- `winner` field — primary decision signal for agents; `null` when scorer is `raw`

---

## Storage

Global (cross-project):

```
~/.local/share/ben/
runs/ # persisted run results
registry/ # local registry index + cache
suites/ # global suite specs
```

Project-local (detected automatically when `.ben/` exists in cwd):

```
.ben/
suites/ # project-scoped suite specs
runs/ # project-scoped run results
```

Ben prefers project-local storage when `.ben/` is present; falls back to global.

---

## Configuration

Ben loads config from three layers, highest precedence first:

| Layer | Path |
|---------|-----------------------------------|
| project | `./.ben/config.yaml` |
| user | `$XDG_CONFIG_HOME/ben/config.yaml`|
| system | `/etc/ben/config.yaml` |

Run `ben config paths --format json` to see the active chain. The
`-c ` flag overrides the discovery chain entirely (kit semantics
— `-c` wins over any previously discovered file).

The project-layer path is caller-context-aware via the `KIT_INVOKED_AS`
env var (exported by callers like tlc or hop before exec'ing ben):

| `KIT_INVOKED_AS` | Project config path |
|-------------------|-------------------------|
| (unset/standalone)| `./.ben/config.yaml` |
| `hop` | `./.hop/ben.yaml` |
| `tlc` | `./.tlc/ben.yaml` |

Only one project-layer entry wins per invocation (kit constraint).

---

## Release process

Release pipeline mirrors the `hop-top/.github` reusable workflows:

- `release-please.yml` watches `main`, opens a standing release PR
that bumps the version + assembles the changelog.
- Merging that PR cuts a `ben/v` tag.
- The tag push fires `publish.yml` (Go module mirror to
`hop-top/ben`) and `goreleaser-on-tag.yml` (cross-platform binaries
+ Homebrew tap + Scoop bucket entries) in parallel.

Prerelease channel is seeded at `0.2.0-alpha.0`. See
[`.github/RELEASE-BOOTSTRAP.md`](.github/RELEASE-BOOTSTRAP.md) for
the manual web-side steps (mirror-repo creation, GitHub App
installation, org secrets) required before the first cut.

---

## Troubleshooting

**`compile: version "go1.26.1" does not match go tool version "go1.26.2"`**

Cause: stale `GOROOT` exported from an earlier `mise` shell. Quick
workaround: `env -u GOROOT go test ./...`. Long-term fix: `mise use
go@` and respawn the shell.

---

## Contributing

See [docs/contributing.md](docs/contributing.md) for interfaces, how to add adapters/metrics/
scorers/reporters, and the PR checklist.