https://github.com/purestorage-openconnect/lakebench-k8s
Deploy, benchmark, and compare lakehouse architectures on Kubernetes with a single YAML file. Lakebench provisions Spark, Iceberg, and Trino, generates synthetic data at any scale, runs medallion pipelines (Bronze→Silver→Gold), and scores performance with an 8-query benchmark across pluggable catalogs, table formats, and query engines.
https://github.com/purestorage-openconnect/lakebench-k8s
Last synced: about 1 month ago
JSON representation
Deploy, benchmark, and compare lakehouse architectures on Kubernetes with a single YAML file. Lakebench provisions Spark, Iceberg, and Trino, generates synthetic data at any scale, runs medallion pipelines (Bronze→Silver→Gold), and scores performance with an 8-query benchmark across pluggable catalogs, table formats, and query engines.
- Host: GitHub
- URL: https://github.com/purestorage-openconnect/lakebench-k8s
- Owner: PureStorage-OpenConnect
- License: apache-2.0
- Created: 2026-02-13T04:54:28.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2026-03-27T21:11:48.000Z (about 2 months ago)
- Last Synced: 2026-03-28T03:44:44.225Z (about 2 months ago)
- Language: Python
- Size: 1.15 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Support: docs/supported-components.md
Awesome Lists containing this project
README
# Lakebench
[](https://www.python.org/downloads/)
[](LICENSE)
**A/B testing for lakehouse architectures on Kubernetes.**
Deploy a complete lakehouse stack from a single YAML, run a medallion pipeline
at any scale, and get a scorecard you can compare across configurations.
## Why Lakebench?
- **Compare stacks.** Swap catalogs (Hive, Polaris), query engines (Trino,
Spark Thrift, DuckDB), and table formats -- same data, same queries,
different architecture. Side-by-side scorecard comparison.
- **Test at scale.** Run the same workload at 10 GB, 100 GB, and 1 TB to find
where throughput plateaus or resources saturate on your hardware.
- **Measure freshness.** Sustained mode streams data through the pipeline and
benchmarks query performance under sustained ingest load.
## Quick Start
```bash
pip install lakebench-k8s
```
> Pre-built binaries (no Python required) are available on
> [GitHub Releases](https://github.com/PureStorage-OpenConnect/lakebench-k8s/releases).
```bash
pip install lakebench-k8s
lakebench init # quick setup (4 questions)
lakebench run lakebench.yaml --generate # deploy + generate + pipeline + benchmark
lakebench results # view scorecard
lakebench destroy lakebench.yaml # tear down everything
```
Minimum config -- 4 lines:
```yaml
# lakebench.yaml
endpoint: http://s3.example.com:80
access_key: YOUR_KEY
secret_key: YOUR_SECRET
scale: 10 # 1 = ~10 GB, 10 = ~100 GB, 100 = ~1 TB
```
Name is auto-generated. Recipe defaults to `hive-iceberg-spark-trino`.
Override anything with flat fields or nested YAML:
```yaml
# lakebench.yaml (with overrides)
name: flashblade-polaris
recipe: polaris-iceberg-spark-trino
endpoint: http://10.21.227.93:80
access_key: ${S3_ACCESS_KEY} # env var substitution
secret_key: ${S3_SECRET_KEY}
scale: 50
mode: batch
spark_image: apache/spark:4.1.1-python3
```
Eleven recipes are available -- see [Recipes](https://github.com/PureStorage-OpenConnect/lakebench-k8s/blob/main/docs/recipes.md)
for the full list.
Compare two configurations side-by-side:
```bash
lakebench compare config-hive.yaml config-polaris.yaml
```
For all recipes, see [`examples/`](examples/) or run `lakebench init --advanced`
for the full interactive wizard.
## What You Get
After `lakebench run` completes, the terminal prints a scorecard:
```
─ Pipeline Complete ──────────────────────────────
bronze-verify 142.0 s
silver-build 891.0 s
gold-finalize 234.0 s
benchmark 87.0 s
Scores
Time to Value: 1354.0 s
Throughput: 0.782 GB/s
Efficiency: 3.41 GB/core-hr
Scale: 100.0% verified
QpH: 2847.3
Full report: lakebench report
──────────────────────────────────────────────────
```
`lakebench report` generates an HTML report with per-query latencies,
bottleneck analysis, and optional platform metrics (CPU, memory, S3 I/O per
pod).
## How It Works
```
┌──────────────────────────────────┐
│ lakebench.yaml │
└────────────┬─────────────────────┘
│
┌────────────▼─────────────────────┐
│ deploy (Kubernetes namespace, │
│ S3 secrets, PostgreSQL, catalog, │
│ query engine, observability) │
└────────────┬─────────────────────┘
│
Raw Parquet ──► Bronze (validate) ──► Silver (enrich) ──► Gold (aggregate)
S3 Spark Spark Spark
│
┌───────────▼──────────┐
│ 8-query benchmark │
│ (Trino / DuckDB / │
│ Spark Thrift) │
└──────────────────────┘
```
## Prerequisites
- `kubectl` and `helm` on PATH
- Kubernetes 1.26+ (minimum 8 CPU / 32 GB RAM for scale 1)
- S3-compatible object storage (FlashBlade, MinIO, AWS S3, etc.)
- [Kubeflow Spark Operator 2.4.0+](https://github.com/kubeflow/spark-operator)
(or set `spark.operator.install: true`)
- [Stackable Hive Operator](https://docs.stackable.tech/home/stable/hive/) for
Hive recipes (not needed for Polaris)
## Commands
| Command | Description |
|---------|-------------|
| `init` | Generate a starter config file |
| `validate` | Check config and cluster connectivity |
| `info` | Show deployment configuration summary |
| `deploy` | Deploy all infrastructure components |
| `generate` | Generate synthetic data at the configured scale |
| `run` | Execute the medallion pipeline and benchmark |
| `benchmark` | Run the 8-query benchmark standalone |
| `query` | Execute ad-hoc SQL against the active engine |
| `status` | Show deployment status |
| `report` | Generate HTML scorecard report |
| `recommend` | Recommend cluster sizing for a scale factor |
| `destroy` | Tear down all deployed resources |
See [CLI Reference](https://github.com/PureStorage-OpenConnect/lakebench-k8s/blob/main/docs/cli-reference.md)
for flags and options.
## Component Versions
| Component | Version |
|-----------|---------|
| Apache Spark | 3.5.4, 4.0.2, 4.1.1 |
| Spark Operator | 2.4.0 (Kubeflow) |
| Apache Iceberg | 1.10.1 |
| Delta Lake | 4.0.0 |
| Hive Metastore | 3.1.3 (Stackable 25.7.0) |
| Apache Polaris | 1.3.0-incubating |
| Trino | 479 |
| DuckDB | bundled (Python 3.11) |
| PostgreSQL | 16, 17, 18 |
All versions are overridable in the YAML config. See
[Supported Components](https://github.com/PureStorage-OpenConnect/lakebench-k8s/blob/main/docs/supported-components.md).
## Documentation
- [Getting Started](https://github.com/PureStorage-OpenConnect/lakebench-k8s/blob/main/docs/getting-started.md) -- prerequisites, install, first run
- [Configuration](https://github.com/PureStorage-OpenConnect/lakebench-k8s/blob/main/docs/configuration.md) -- full YAML reference
- [Recipes](https://github.com/PureStorage-OpenConnect/lakebench-k8s/blob/main/docs/recipes.md) -- catalog + format + engine combinations
- [Compatibility Matrix](https://github.com/PureStorage-OpenConnect/lakebench-k8s/blob/main/docs/compatibility-matrix.md) -- Spark, Iceberg, and Delta version support
- [Running Pipelines](https://github.com/PureStorage-OpenConnect/lakebench-k8s/blob/main/docs/running-pipelines.md) -- batch and sustained modes
- [Benchmarking](https://github.com/PureStorage-OpenConnect/lakebench-k8s/blob/main/docs/benchmarking.md) -- scorecard and query benchmark
- [Architecture](https://github.com/PureStorage-OpenConnect/lakebench-k8s/blob/main/docs/architecture.md) -- system design
- [Troubleshooting](https://github.com/PureStorage-OpenConnect/lakebench-k8s/blob/main/docs/troubleshooting.md) -- common errors and fixes
## License
Apache 2.0