{"id":37075004,"url":"https://github.com/flexaihq/flexbench","last_synced_at":"2026-01-14T08:50:20.243Z","repository":{"id":311918739,"uuid":"950770780","full_name":"flexaihq/flexbench","owner":"flexaihq","description":"Benchmark OpenAI-compatible AI endpoints and AI Accelerators in a reproducible structured way ","archived":false,"fork":false,"pushed_at":"2025-08-22T10:35:18.000Z","size":4224,"stargazers_count":8,"open_issues_count":0,"forks_count":3,"subscribers_count":11,"default_branch":"main","last_synced_at":"2025-08-27T21:36:56.372Z","etag":null,"topics":["benchmark","inference","llm","mlperf","vllm"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/flexaihq.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-03-18T16:58:23.000Z","updated_at":"2025-08-20T01:27:58.000Z","dependencies_parsed_at":"2025-08-27T21:37:17.967Z","dependency_job_id":"ffd4c612-e692-4f52-86c5-d7d9dd56eba3","html_url":"https://github.com/flexaihq/flexbench","commit_stats":null,"previous_names":["flexaihq/flexbench"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/flexaihq/flexbench","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/flexaihq%2Fflexbench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/flexaihq%2Fflexbench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/flexaihq%2Fflexbench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/flexaihq%2Fflexbench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/flexaihq","download_url":"https://codeload.github.com/flexaihq/flexbench/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/flexaihq%2Fflexbench/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28414693,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-14T08:38:59.149Z","status":"ssl_error","status_checked_at":"2026-01-14T08:38:43.588Z","response_time":107,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmark","inference","llm","mlperf","vllm"],"created_at":"2026-01-14T08:50:19.549Z","updated_at":"2026-01-14T08:50:20.232Z","avatar_url":"https://github.com/flexaihq.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# FlexBench\n\nA flexible benchmarking framework for text language models with automated Docker orchestration and MLPerf-compliant evaluation.\n\n## Features\n\n- **Zero-setup benchmarking** - Automatic Docker container orchestration\n- **Universal hardware support** - Auto-detects CUDA, ROCm, ARM, and CPU devices\n- **MLPerf-compliant scenarios** - Server, Offline, and SingleStream inference modes\n- **Performance \u0026 accuracy evaluation** - Comprehensive metrics with built-in datasets\n- **QPS sweep mode** - Automatic performance curve discovery\n- **Existing server integration** - Connect to your running vLLM server\n\n## Installation\n\n```bash\n# Install uv\ncurl -LsSf https://astral.sh/uv/install.sh | sh\n\n# Install via clone\ngit clone https://github.com/flexaihq/flexbench.git\ncd flexbench\nuv venv\nsource .venv/bin/activate\nuv pip install -e .\n\n# Install via git URL\nuv venv\nsource .venv/bin/activate\nuv pip install git+https://github.com/flexaihq/flexbench.git\n```\n\n## Prerequisites\n\n- **Docker** and **Docker Compose** (or `docker compose`)\n- **NVIDIA Docker runtime** (for GPU support)\n\n## Quick Start\n\nFlexBench provides a single command with smart defaults for immediate benchmarking:\n\n```bash\n# View all available options\nflexbench --help\n\n# Basic benchmark with default model (HuggingFaceTB/SmolLM2-135M-Instruct) and dataset (ctuning/MLPerf-OpenOrca)\nflexbench  # lightweight model for quick testing\n\n# Gated models (requires HuggingFace token)\nexport HF_TOKEN=your_hf_token_here\nflexbench --model-path meta-llama/Llama-3.2-1B-Instruct  # supports --hf-token argument as well\n\n# Larger model with multi-GPU support\nflexbench --model-path meta-llama/Llama-3.2-70B-Instruct --gpu-devices \"0,1\" --tensor-parallel-size 2  # or use CUDA_VISIBLE_DEVICES environment variable\n\n# Force CPU mode\nflexbench --model-path meta-llama/Llama-3.2-1B-Instruct --device-type cpu\n\n# Specify target QPS (queries per second)\nflexbench --model-path meta-llama/Llama-3.2-1B-Instruct --target-qps 5\n\n# QPS sweep to find performance limits\nflexbench --model-path meta-llama/Llama-3.2-1B-Instruct --sweep\n\n# Accuracy evaluation mode (default is performance)\nflexbench --model-path meta-llama/Llama-3.2-1B-Instruct --mode accuracy\n\n# Full benchmark with both performance and accuracy in sequence\nflexbench --model-path meta-llama/Llama-3.2-1B-Instruct --mode all\n\n# Use existing vLLM server\nflexbench --model-path meta-llama/Llama-3.2-1B-Instruct --vllm-server http://localhost:8000  # assumes vLLM server is running\n```\n\nFlexBench automatically handles Docker container orchestration, model loading, benchmarking, and result collection with zero manual setup.\n\n## Architecture\n\nFlexBench uses Docker Compose to orchestrate two containers that communicate over a dedicated network:\n\n```mermaid\nflowchart TD\n    A[FlexBench CLI] --\u003e|Orchestrates| B[Docker Compose Network]\n\n    subgraph B[flexbench-network]\n        C[vLLM Server Container]\n        D[FlexBench Runner Container]\n        D \u003c--\u003e|API Calls| C\n    end\n\n    subgraph Host Machine\n        E[HuggingFace Cache]\n        F[Results Directory]\n        G[GPU Devices]\n    end\n\n    C --\u003e|Loads Models| E\n    D --\u003e|Saves Results| F\n    C --\u003e|Uses| G\n\n    H[Existing vLLM Server] -.-\u003e|Optional| D\n    H -.-\u003e|Bypasses container orchestration| B\n```\n\n**Container Orchestration:**\n\n- **vLLM Server Container**: Loads and serves the model via OpenAI-compatible API\n- **FlexBench Runner Container**: Generates load, collects metrics, and saves results\n- **Automatic networking**: Containers communicate over a dedicated Docker network\n- **GPU allocation**: Automatic device detection and resource management\n\n**External Server Option:**\n\n- Use `--vllm-server` to connect to an existing vLLM server\n- Bypasses vLLM container creation for maximum flexibility\n\n## Inference Scenarios\n\nFlexBench supports multiple inference scenarios based on MLPerf standards:\n\n| Scenario       | Description                                                                 | Load Generation                                                                       | Use Case                        |\n|----------------|-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|----------------------------------|\n| **Server**     | Queries arrive following a Poisson distribution, mimicking real-world load. | \u003cimg alt=\"Server load generation\" src=\"./assets/server.png\" width=\"200\"/\u003e             | Online serving, latency testing  |\n| **Offline**    | All queries are sent at once, maximizing throughput.                        | \u003cimg alt=\"Offline load generation\" src=\"./assets/offline.png\" width=\"200\"/\u003e           | Throughput benchmarking          |\n| **SingleStream** | Queries are processed one at a time, measuring sequential latency (90th percentile). | \u003cimg alt=\"Single stream load generation\" src=\"./assets/single_stream.png\" width=\"200\"/\u003e      | Real-time, interactive, or mobile inference (e.g., autocomplete, AR) |\n\nFor more details on the MLPerf Inference Benchmark and the design of modes and metrics, refer to the [MLPerf Inference Benchmark paper](https://arxiv.org/pdf/1911.02549).\n\n## Device Support\n\nWhenever running without specifying a vLLM server, FlexBench automatically detects your hardware with `--device-type auto` (default):\n\n**Detection Priority:** CUDA → ROCm → ARM → CPU\n\n| Device Type | Default vLLM Image | Build Method | Hardware |\n|-------------|-------------------|--------------|----------|\n| **auto** | *Auto-detected* | *Varies by detected device* | Automatic hardware detection |\n| **cuda** | `vllm/vllm-openai:latest` | Pull from [registry](https://hub.docker.com/r/vllm/vllm-openai/tags) | NVIDIA GPUs |\n| **rocm** (WIP) | `rocm/vllm:latest` | Pull from [registry](https://hub.docker.com/r/rocm/vllm) | AMD GPUs |\n| **arm** | `vllm-arm-local:latest` | **Built from [source](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.arm)** | ARM processors |\n| **cpu** | `public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:v0.9.1` | Pull from [registry](https://gallery.ecr.aws/q9t5s3a7/vllm-cpu-release-repo) | CPU-only systems |\n\n**Note:** ARM devices require building vLLM from source since no pre-built ARM images are available. FlexBench automatically clones the vLLM repository and builds the image locally.\n\n**Force specific device:**\n\n```bash\n# Force CPU even with GPUs available\nflexbench --device-type cpu\n\n# Force CUDA with a specific GPU\nflexbench --model-path meta-llama/Llama-3.2-1B-Instruct --device-type cuda --gpu-devices \"1\"\n# equivalent to setting CUDA_VISIBLE_DEVICES=1 env variable\n\n# Run multiple GPUs with tensor parallelism\nflexbench --model-path meta-llama/Llama-3.2-70B-Instruct --device-type cuda --gpu-devices \"0,1\" --tensor-parallel-size 2\n```\n\n## Benchmark Modes\n\nFlexBench supports multiple evaluation modes via `--mode`:\n\n| Mode | Description | Usage |\n|------|-------------|-------|\n| **performance** | Benchmark throughput and latency (default) | `--mode performance` |\n| **accuracy** | Evaluate model outputs against reference data | `--mode accuracy` |\n| **all** | Run performance benchmark, then accuracy evaluation | `--mode all` |\n\n**Examples:**\n\n```bash\n# Performance only (default)\nflexbench --model-path HuggingFaceTB/SmolLM2-135M-Instruct\n\n# Accuracy evaluation\nflexbench --model-path HuggingFaceTB/SmolLM2-135M-Instruct --mode accuracy\n\n# Both modes sequentially\nflexbench --model-path HuggingFaceTB/SmolLM2-135M-Instruct --mode all\n```\n\n## Default Dataset\n\nFlexBench uses the **cTuning/MLPerf-OpenOrca** dataset by default - the official MLPerf dataset for text inference benchmarking. Pre-configured column mappings:\n\n- **Input column**: `question`\n- **Output column**: `response` (used for accuracy evaluation)\n- **System prompt**: `system_prompt`\n\n**Override defaults:**\n\n```bash\n# Use custom dataset\nflexbench --model-path HuggingFaceTB/SmolLM2-135M-Instruct \\\n  --dataset-path your-org/your-dataset \\\n  --dataset-input-column your_input_column \\\n  --dataset-output-column your_output_column\n```\n\n## Sweep Mode\n\nSweep mode automatically discovers your model's performance characteristics by testing multiple QPS levels.\nIt first starts by finding the maximum QPS your model can handle, then runs benchmarks at evenly spaced QPS points between 0 and the maximum QPS + 20%.\n\n**Usage:**\n\n```bash\n# Basic sweep with 10 QPS points (default)\nflexbench --model-path meta-llama/Llama-3.2-1B-Instruct --sweep\n\n# Custom sweep with 5 QPS points\nflexbench --model-path meta-llama/Llama-3.2-1B-Instruct --sweep --num-sweep-points 5\n```\n\nThe results will all be saved in a single file.\n\nNote: Sweep mode is incompatible with `--target-qps` (automatically determines QPS range) and `--mode accuracy` (performance analysis only).\n\n## Using MLCommons CMX automation language\n\nWe are developing [MLCommons CMX automations](https://github.com/mlcommons/ck/tree/master/cmx4mlops/repo/flex.task/run-mlperf-inference-benchmark)\nto help users prepare, validate, and submit official MLPerf inference results using FlexBench.\nThese automations are based on our [MLPerf inference v5.0 submission](https://github.com/mlcommons/inference_results_v5.0/tree/main/open/FlexAI/measurements/cmx-flexbench-cuda-1xH100-vllm-0.7.3-pytorch-2.5.1-huggingface-16d94432c8704c14/DeepSeek-R1-Distill-Llama-8B/Server),\nfeaturing DeepSeek-R1-Distill-Llama-8B and vLLM.\n\n## License and Copyright\n\nThis project is licensed under the [Apache License 2.0](LICENSE.md).\n\n© 2025 FlexAI\n\nPortions of the code were adapted from the following MLCommons repositories,\nwhich are also licensed under the Apache 2.0 license:\n\n- [mlcommons@inference](https://github.com/mlcommons/inference)\n- [mlcommons@inference_results_v5.0](https://github.com/mlcommons/inference_results_v5.0)\n- [mlcommons@ck](https://github.com/mlcommons/ck)\n- [mlcommons@vllm-project](https://github.com/vllm-project/vllm)\n\n## Authors and maintaners\n\n[Daniel Altunay](https://www.linkedin.com/in/daltunay) and [Grigori Fursin](https://cKnowledge.org/gfursin) (FCS Labs)\n\nWe would like to thank Dali Kilani, Venkataraju Koppada, Rahul Thangallapally,\nand other colleagues for their valuable discussions and feedback.\n\n## Contributing\n\nWe welcome contributions to this project!\n\nIf you have ideas, bug reports, or feature requests, please [open an issue](https://github.com/flexaihq/flexbench/issues).\nTo contribute code, feel free to submit a [pull request](https://github.com/flexaihq/flexbench/pulls).\nBy contributing, you agree that your contributions will be licensed under the same [Apache License 2.0](LICENSE.md).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fflexaihq%2Fflexbench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fflexaihq%2Fflexbench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fflexaihq%2Fflexbench/lists"}