https://github.com/slinusc/bench360
Bench360 is a modular benchmarking suite for local LLM deployments. It offers a full-stack, extensible pipeline to evaluate the latency, throughput, quality, and cost of LLM inference on consumer and enterprise GPUs. Bench360 supports flexible backends, tasks and scenarios, enabling fair and reproducible comparisons for researchers & practitioners.
https://github.com/slinusc/bench360
bench360 benchmark deployment energy energy-consumption engine framework inference llm llm-inference local mldeploy optimization performance quantization sglang tgi vllm
Last synced: about 2 months ago
JSON representation
Bench360 is a modular benchmarking suite for local LLM deployments. It offers a full-stack, extensible pipeline to evaluate the latency, throughput, quality, and cost of LLM inference on consumer and enterprise GPUs. Bench360 supports flexible backends, tasks and scenarios, enabling fair and reproducible comparisons for researchers & practitioners.
- Host: GitHub
- URL: https://github.com/slinusc/bench360
- Owner: slinusc
- License: mit
- Created: 2025-02-23T07:26:38.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2025-08-03T14:51:53.000Z (2 months ago)
- Last Synced: 2025-08-03T16:32:46.381Z (2 months ago)
- Topics: bench360, benchmark, deployment, energy, energy-consumption, engine, framework, inference, llm, llm-inference, local, mldeploy, optimization, performance, quantization, sglang, tgi, vllm
- Language: Python
- Homepage:
- Size: 23.2 MB
- Stars: 3
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Bench360 β Local LLM Deployment Benchmark Suite
> β‘ System Performance. π Energy Consumption. π― Task Quality. - One Benchmark.
**Bench360** is a modular benchmarking framework for evaluating **local LLM deployments** across backends, quantization formats, model architectures, and deployment scenarios.
It enables researchers and practitioners to analyze **latency, throughput, quality, efficiency, and cost** in real-world tasks like summarization, QA, and SQL generationβunder both consumer and data center conditions.

---
## π Why Bench360?
When deploying LLMs locally, trade-offs between **model size**, **quantization**, and **inference engine** can drastically impact performance and feasibility. Bench360 helps answer the real-world questions that arise when resources are limited and requirements are strict:
### β Should you run a **7B model in FP16**, a **13B in INT8**, or a **32B in INT4**?
Bench360 benchmarks across multiple quantization formats and model sizes to help you understand the trade-offs between **quality**, **latency**, and **energy consumption**. Detailed telemetry let you choose the sweet spot for your setup.
---
### β Is **INT4 quantization good enough** for SQL generation or question answering?
Bench360 evaluates functional task qualityβnot just perplexity. For Text-to-SQL, it reports **execution accuracy** and **AST match**; for QA and summarization, it computes **F1**, **EM**, and **ROUGE**. Youβll see whether aggressive quantization introduces failure cases *that actually matter*.
---
### β Which inference backend delivers the best performance for my use case?
Bench360 includes a workload controller that simulates different deployment scenarios:
- π§΅ Single-stream
- π¦ Offline batch
- π Multi-user server (with Poisson multi thread query arrivals)Engines like **vLLM**, **TGI**, **SGLang**, and **LMDeploy** can be tested under identical conditions.
---
## βοΈ Features
| Category | Description |
|---------------------|-----------------------------------------------------------------------------|
| **Tasks** | Summarization, Question Answering (QA), Text-to-SQL |
| **Scenarios** | `single`, `batch`, and `server` (Poisson arrival multi-threads) |
| **Metrics** | Latency (ATL/GL), Throughput (TPS, SPS), GPU/CPU util, Energy, Quality (F1, ROUGE, AST) |
| **Backends** | vLLM, TGI, SGLang, LMDeploy |
| **Quantization** | Support for FP16, INT8, INT4 (GPTQ, AWQ, GGUF) |
| **Cost Estimation** | Energy and amortized GPU cost per request |
| **Output Format** | CSV (run-level + per-sample details), logs, and visual plots ready |---
## π§± Installation
### Requirements
- OS: Ubuntu Linux
- NVIDIA GPU with NVML support
- CUDA 12.x
- Python 3.8+
- Docker### Setup
Clone the repository:
```bash
git clone https://github.com/slinusc/fast_llm_inference.git
cd fast_llm_inference
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
````> System dependencies:
```bash
sudo apt update && sudo apt install -y \
libssl-dev libcurl4 build-essential libllvm15 \
nvidia-container-toolkit && \
sudo nvidia-ctk runtime configure --runtime=docker && \
sudo systemctl restart docker
```Pull all official backend docker images:
```bash
docker pull lmsysorg/sglang:latest
docker pull openmmlab/lmdeploy:latest
docker pull vllm/vllm-openai:latest
docker pull ghcr.io/huggingface/text-generation-inference:latest
```Export your Huggingface Token:
```bash
export HF_TOKEN=
```---
## π Usage
### β Single Run
```yaml
# config.yaml
backend: tgi
hf_model: mistralai/Mistral-7B-Instruct-v0.3
model_name: Mistral-7B
task: qa
scenario: single
samples: 256
``````bash
python launch_benchmark.py config.yaml
```---
### π Multi-run Sweep
Use **lists** to define a Cartesian product:
```yaml
backend: [tgi, vllm]
hf_model:
- mistralai/Mistral-7B-Instruct-v0.3
- Qwen/Qwen2.5-7B-Instruct
task: [summarization, sql, qa]
scenario: [single, batch, server]samples: 256
batch_size: [16, 64]
run_time: 300
concurrent_users: [8, 16, 32]
requests_per_user_per_min: 12
``````bash
python launch_benchmark.py config.yaml
```---
## π§© Add Your Own Task
Bench360 supports **plug-and-play task customization**. You can easily define your own evaluation logic (e.g., for RAG, classification, chatbot scoring) using the base interface.
### π¨ Step 1: Create a New Task
Create a file in:
```
benchmark/tasks/your_custom_task.py
```Example:
```python
from benchmark.tasks.base_task import BaseTaskclass YourCustomTask(BaseTask):
def generate_prompts(self, num_examples: int):
prompts = [...]
references = [...]
return prompts, referencesdef quality_metrics(self, generated, reference):
return {
"custom_metric": some_score
}
```### π Step 2: Register It
In `benchmark/tasks/__init__.py`, add:
```python
from .your_custom_task import YourCustomTaskTASKS = {
"qa": QATask,
"summarization": SummarizationTask,
"sql": TextToSQLTask,
"your_task": YourCustomTask
}
```### βΆοΈ Step 3: Run It
```yaml
backend: vllm
hf_model: mistralai/Mistral-7B-Instruct-v0.3
task: your_task
scenario: single
samples: 100
``````bash
python launch_benchmark.py config.yaml
```---
## π¦ Output
Each experiment generates:
```
results_/
βββ run_report/ # One CSV per experiment (summary)
βββ details/ # Per-query logs
βββ readings/ # GPU/CPU/power metrics
βββ failed_runs.log # List of failed configs
```Each filename includes:
* backend
* model
* task
* scenario
* parameters (e.g. batch size, concurrent users)
* config hashThis enables reproducible comparisons & tracking.
---
## π Project Structure
```
fast_llm_inference/
βββ benchmark/
β βββ benchmark.py # Main benchmarking logic
β βββ inference_engine_client.py # Backend launcher
β βββ tasks/ # Task-specific eval logic
β βββ backends/ # Inference wrapper modules
βββ launch_benchmark.py # CLI entry point
βββ utils_multi.py # Multi-run config handling
βββ config.yaml # Example config file
βββ requirements.txt
```---
## π§ͺ Contributing
Pull requests, bug reports, and ideas are welcome!
Fork the repo, create a feature branch, and submit your PR.---
## π License
Bench360 is released under the [MIT License](LICENSE).
```