{"id":30108999,"url":"https://github.com/slinusc/bench360","last_synced_at":"2026-04-28T03:06:09.581Z","repository":{"id":279018986,"uuid":"937493773","full_name":"slinusc/bench360","owner":"slinusc","description":"Bench360 is a modular benchmarking suite for local LLM deployments. It offers a full-stack, extensible pipeline to evaluate the latency, throughput, quality, and cost of LLM inference on consumer and enterprise GPUs. Bench360 supports flexible backends, tasks and scenarios, enabling fair and reproducible comparisons for researchers \u0026 practitioners.","archived":false,"fork":false,"pushed_at":"2025-08-03T14:51:53.000Z","size":24346,"stargazers_count":3,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-08-03T16:32:46.381Z","etag":null,"topics":["bench360","benchmark","deployment","energy","energy-consumption","engine","framework","inference","llm","llm-inference","local","mldeploy","optimization","performance","quantization","sglang","tgi","vllm"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/slinusc.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-02-23T07:26:38.000Z","updated_at":"2025-08-03T14:51:58.000Z","dependencies_parsed_at":"2025-02-23T08:26:40.471Z","dependency_job_id":"87b35325-f5bb-43dc-a6fa-20a0356e6cdb","html_url":"https://github.com/slinusc/bench360","commit_stats":null,"previous_names":["slinusc/fast_llm_inference","slinusc/bench360"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/slinusc/bench360","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/slinusc%2Fbench360","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/slinusc%2Fbench360/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/slinusc%2Fbench360/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/slinusc%2Fbench360/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/slinusc","download_url":"https://codeload.github.com/slinusc/bench360/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/slinusc%2Fbench360/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":269669977,"owners_count":24456777,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-10T02:00:08.965Z","response_time":71,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bench360","benchmark","deployment","energy","energy-consumption","engine","framework","inference","llm","llm-inference","local","mldeploy","optimization","performance","quantization","sglang","tgi","vllm"],"created_at":"2025-08-10T03:20:33.136Z","updated_at":"2025-12-26T03:25:58.573Z","avatar_url":"https://github.com/slinusc.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Bench360 – Local LLM Deployment Benchmark Suite\n\n\u003e ⚡ System Performance. 🔋 Energy Consumption. 🎯 Task Quality.  - One Benchmark.\n\n**Bench360** is a modular benchmarking framework for evaluating **local LLM deployments** across backends, quantization formats, model architectures, and deployment scenarios.\n\nIt enables researchers and practitioners to analyze **latency, throughput, quality, efficiency, and cost** in real-world tasks like summarization, QA, and SQL generation—under both consumer and data center conditions.\n\n![Bench360°](benchmark/docs/bench360.jpg)\n\n---\n\n## 🔍 Why Bench360?\n\nWhen deploying LLMs locally, trade-offs between **model size**, **quantization**, and **inference engine** can drastically impact performance and feasibility. Bench360 helps answer the real-world questions that arise when resources are limited and requirements are strict:\n\n### ❓ Should you run a **7B model in FP16**, a **13B in INT8**, or a **32B in INT4**?\n\nBench360 benchmarks across multiple quantization formats and model sizes to help you understand the trade-offs between **quality**, **latency**, and **energy consumption**. Detailed telemetry let you choose the sweet spot for your setup.\n\n---\n\n### ❓ Is **INT4 quantization good enough** for SQL generation or question answering?\n\nBench360 evaluates functional task quality—not just perplexity. For Text-to-SQL, it reports **execution accuracy** and **AST match**; for QA and summarization, it computes **F1**, **EM**, and **ROUGE**. You’ll see whether aggressive quantization introduces failure cases *that actually matter*.\n\n---\n\n### ❓  Which inference backend delivers the best performance for my use case?\n\nBench360 includes a workload controller that simulates different deployment scenarios:  \n- 🧵 Single-stream  \n- 📦 Offline batch  \n- 🌐 Multi-user server (with Poisson multi thread query arrivals)\n\nEngines like **vLLM**, **TGI**, **SGLang**, and **LMDeploy** can be tested under identical conditions.\n\n---\n\n## ⚙️ Features\n\n| Category            | Description                                                                 |\n|---------------------|-----------------------------------------------------------------------------|\n| **Tasks**           | Summarization, Question Answering (QA), Text-to-SQL                         |\n| **Scenarios**       | `single`, `batch`, and `server` (Poisson arrival multi-threads)             |\n| **Metrics**         | Latency (ATL/GL), Throughput (TPS, SPS), GPU/CPU util, Energy, Quality (F1, ROUGE, AST) |\n| **Backends**        | vLLM, TGI, SGLang, LMDeploy                                                 |\n| **Quantization**    | Support for FP16, INT8, INT4 (GPTQ, AWQ, GGUF)                              |\n| **Cost Estimation** | Energy and amortized GPU cost per request                                   |\n| **Output Format**   | CSV (run-level + per-sample details), logs, and visual plots ready          |\n\n---\n\n## 🧱 Installation\n\n### Requirements\n\n- OS: Ubuntu Linux\n- NVIDIA GPU with NVML support\n- CUDA 12.x\n- Python 3.8+\n- Docker\n\n### Setup\n\nClone the repository:\n\n```bash\ngit clone https://github.com/slinusc/fast_llm_inference.git\ncd fast_llm_inference\npython3 -m venv venv\nsource venv/bin/activate\npip install -r requirements.txt\n````\n\n\u003e System dependencies:\n\n```bash\nsudo apt update \u0026\u0026 sudo apt install -y \\\n  libssl-dev libcurl4 build-essential libllvm15 \\\n  nvidia-container-toolkit \u0026\u0026 \\\n  sudo nvidia-ctk runtime configure --runtime=docker \u0026\u0026 \\\n  sudo systemctl restart docker\n```\n\nPull all official backend docker images:\n\n```bash\ndocker pull lmsysorg/sglang:latest\ndocker pull openmmlab/lmdeploy:latest\ndocker pull vllm/vllm-openai:latest\ndocker pull ghcr.io/huggingface/text-generation-inference:latest\n```\n\nExport your Huggingface Token:\n\n```bash\nexport HF_TOKEN=\u003cyour HF token\u003e\n```\n\n---\n\n## 🚀 Usage\n\n### ✅ Single Run\n\n```yaml\n# config.yaml\nbackend: tgi\nhf_model: mistralai/Mistral-7B-Instruct-v0.3\nmodel_name: Mistral-7B\ntask: qa\nscenario: single\nsamples: 256\n```\n\n```bash\npython launch_benchmark.py config.yaml\n```\n\n---\n\n### 🔁 Multi-run Sweep\n\nUse **lists** to define a Cartesian product:\n\n```yaml\nbackend: [tgi, vllm]\nhf_model:\n  - mistralai/Mistral-7B-Instruct-v0.3\n  - Qwen/Qwen2.5-7B-Instruct\ntask: [summarization, sql, qa]\nscenario: [single, batch, server]\n\nsamples: 256\nbatch_size: [16, 64]\nrun_time: 300\nconcurrent_users: [8, 16, 32]\nrequests_per_user_per_min: 12\n```\n\n```bash\npython launch_benchmark.py config.yaml\n```\n\n---\n\n## 🧩 Add Your Own Task\n\nBench360 supports **plug-and-play task customization**. You can easily define your own evaluation logic (e.g., for RAG, classification, chatbot scoring) using the base interface.\n\n### 🔨 Step 1: Create a New Task\n\nCreate a file in:\n\n```\nbenchmark/tasks/your_custom_task.py\n```\n\nExample:\n\n```python\nfrom benchmark.tasks.base_task import BaseTask\n\nclass YourCustomTask(BaseTask):\n    def generate_prompts(self, num_examples: int):\n        prompts = [...]\n        references = [...]\n        return prompts, references\n\n    def quality_metrics(self, generated, reference):\n        return {\n            \"custom_metric\": some_score\n        }\n```\n\n### 📌 Step 2: Register It\n\nIn `benchmark/tasks/__init__.py`, add:\n\n```python\nfrom .your_custom_task import YourCustomTask\n\nTASKS = {\n    \"qa\": QATask,\n    \"summarization\": SummarizationTask,\n    \"sql\": TextToSQLTask,\n    \"your_task\": YourCustomTask\n}\n```\n\n### ▶️ Step 3: Run It\n\n```yaml\nbackend: vllm\nhf_model: mistralai/Mistral-7B-Instruct-v0.3\ntask: your_task\nscenario: single\nsamples: 100\n```\n\n```bash\npython launch_benchmark.py config.yaml\n```\n\n---\n\n## 📦 Output\n\nEach experiment generates:\n\n```\nresults_\u003ctimestamp\u003e/\n├── run_report/          # One CSV per experiment (summary)\n├── details/             # Per-query logs\n├── readings/            # GPU/CPU/power metrics\n└── failed_runs.log      # List of failed configs\n```\n\nEach filename includes:\n\n* backend\n* model\n* task\n* scenario\n* parameters (e.g. batch size, concurrent users)\n* config hash\n\nThis enables reproducible comparisons \u0026 tracking.\n\n---\n\n## 🗂 Project Structure\n\n```\nfast_llm_inference/\n├── benchmark/\n│   ├── benchmark.py               # Main benchmarking logic\n│   ├── inference_engine_client.py # Backend launcher\n│   ├── tasks/                     # Task-specific eval logic\n│   ├── backends/                  # Inference wrapper modules\n├── launch_benchmark.py            # CLI entry point\n├── utils_multi.py                 # Multi-run config handling\n├── config.yaml                    # Example config file\n└── requirements.txt\n```\n\n---\n\n## 🧪 Contributing\n\nPull requests, bug reports, and ideas are welcome!\nFork the repo, create a feature branch, and submit your PR.\n\n---\n\n## 📄 License\n\nBench360 is released under the [MIT License](LICENSE).\n\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fslinusc%2Fbench360","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fslinusc%2Fbench360","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fslinusc%2Fbench360/lists"}