{"id":47892530,"url":"https://github.com/cipher982/llm-benchmarks","last_synced_at":"2026-04-04T03:10:51.050Z","repository":{"id":169089818,"uuid":"644511466","full_name":"cipher982/llm-benchmarks","owner":"cipher982","description":"Benchmarking LLM Inference Speeds","archived":false,"fork":false,"pushed_at":"2026-03-03T17:35:11.000Z","size":9249,"stargazers_count":13,"open_issues_count":4,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-03-03T21:26:32.026Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://llm-benchmarks.com","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cipher982.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2023-05-23T17:08:19.000Z","updated_at":"2026-03-03T17:35:21.000Z","dependencies_parsed_at":"2024-02-11T20:11:47.282Z","dependency_job_id":"5be750c0-3f3b-4886-8b0a-c1bf157e479e","html_url":"https://github.com/cipher982/llm-benchmarks","commit_stats":null,"previous_names":["cipher982/llm-benchmarks"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/cipher982/llm-benchmarks","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cipher982%2Fllm-benchmarks","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cipher982%2Fllm-benchmarks/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cipher982%2Fllm-benchmarks/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cipher982%2Fllm-benchmarks/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cipher982","download_url":"https://codeload.github.com/cipher982/llm-benchmarks/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cipher982%2Fllm-benchmarks/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31386001,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-04T01:22:39.193Z","status":"online","status_checked_at":"2026-04-04T02:00:07.569Z","response_time":60,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-04-04T03:10:50.408Z","updated_at":"2026-04-04T03:10:51.036Z","avatar_url":"https://github.com/cipher982.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"![llmbenchmarkscom](https://cronitor.io/badges/G8yp5e/production/VnmBXHNorcpEyvbg9ASvxeGp8zU.svg)\n\n# LLM Benchmarks\n\n### 🌐 Live at: [llm-benchmarks.com](https://llm-benchmarks.com)\n[![Status](https://img.shields.io/uptimerobot/status/m797914664-fefc15fb1a5bba071a8a5c91)](https://stats.uptimerobot.com/m797914664-fefc15fb1a5bba071a8a5c91)\n[![Uptime](https://img.shields.io/uptimerobot/ratio/30/m797914664-fefc15fb1a5bba071a8a5c91)](https://stats.uptimerobot.com/m797914664-fefc15fb1a5bba071a8a5c91)\n\n[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)\n[![Docker](https://img.shields.io/badge/docker-compose-2496ED.svg?logo=docker\u0026logoColor=white)](https://www.docker.com/)\n[![MongoDB](https://img.shields.io/badge/MongoDB-4EA94B.svg?logo=mongodb\u0026logoColor=white)](https://www.mongodb.com/)\n[![NVIDIA CUDA](https://img.shields.io/badge/NVIDIA-CUDA-76B900.svg?logo=nvidia\u0026logoColor=white)](https://developer.nvidia.com/cuda-toolkit)\n[![vLLM](https://img.shields.io/badge/vLLM-Accelerated_Inference-orange.svg)](https://github.com/vllm-project/vllm)\n[![Hugging Face](https://img.shields.io/badge/🤗_Hugging_Face-Transformers-yellow.svg)](https://huggingface.co/docs/transformers/index)\n[![License](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)\n\nA comprehensive framework for benchmarking LLM inference speeds across various models and frameworks.\n\n## Overview\n\nThis project provides tools to benchmark Large Language Model (LLM) inference speeds across different frameworks, model sizes, and quantization methods. The benchmarks are designed to run both locally and in cloud environments, with results displayed on a dashboard at [llm-benchmarks.com](https://llm-benchmarks.com).\n\nThe system uses Docker with various frameworks (vLLM, Transformers, Text-Generation-Inference, llama-cpp) to automate benchmarks and upload results to a MongoDB database. Most frameworks fetch models from the HuggingFace Hub and cache them for on-demand loading, with the exception of llama-cpp/GGUF which requires specially compiled model formats.\n\n## Repository Structure\n\n- **`/api`**: Core benchmarking logic and API clients for different frameworks\n- **`/cloud`**: Configuration and Docker setup for cloud-based benchmarks (OpenAI, Anthropic, etc.)\n- **`/local`**: Configuration and Docker setup for local benchmarks (Hugging Face, vLLM, GGUF)\n  - **`/local/huggingface`**: Transformers and Text-Generation-Inference benchmarks\n  - **`/local/vllm`**: vLLM benchmarks\n  - **`/local/gguf`**: GGUF/llama-cpp benchmarks\n- **`/scripts`**: Utility scripts and notebooks\n- **`/static`**: Static assets like benchmark result images\n- **`models_config.yaml`**: Configuration for model groups used in benchmarks\n\n## Getting Started\n\n### Prerequisites\n\n- Docker and Docker Compose\n- NVIDIA GPU with CUDA support\n- Python 3.9+\n- MongoDB (optional, for result storage)\n\n### Setup\n\n1. Clone the repository:\n   ```bash\n   git clone https://github.com/cipher982/llm-benchmarks.git\n   cd llm-benchmarks\n   ```\n\n2. Set up environment variables:\n   ```bash\n   # Copy and edit .env file\n   cp .env.example .env\n   ```\n\n3. Edit the `.env` file with your configuration:\n   - Set `HF_HUB_CACHE` to your Hugging Face model cache directory\n   - Configure MongoDB connection (`MONGODB_URI`, `MONGODB_DB`, `MONGODB_COLLECTION_CLOUD`, etc.)\n   - Set API keys for cloud providers (`OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, `GROQ_API_KEY`, `CEREBRAS_API_KEY`, etc.)\n\n### Running Benchmarks\n\n#### Local Benchmarks\n\n1. Start the local benchmark containers:\n   ```bash\n   cd local\n   docker compose -f docker-compose.local.yml up --build\n   ```\n\n2. Run benchmarks for specific frameworks:\n\n   - Hugging Face Transformers:\n     ```bash\n     python api/run_hf.py --framework transformers --limit 5 --max-size-billion 10 --run-always\n     ```\n\n   - Hugging Face Text-Generation-Inference:\n     ```bash\n     python api/run_hf.py --framework hf-tgi --limit 5 --max-size-billion 10 --run-always\n     ```\n\n   - vLLM:\n     ```bash\n     python api/run_vllm.py --framework vllm --limit 5 --max-size-billion 10 --run-always\n     ```\n\n   - GGUF/llama-cpp:\n     ```bash\n     python api/run_gguf.py --limit 5 --run-always --log-level DEBUG\n     ```\n\n#### Cloud Benchmarks\n\nThere is no HTTP API required for scheduled runs. A headless scheduler runs providers in-process and writes results directly to MongoDB.\n\n1. Start the scheduler container (from repo root):\n   ```bash\n   DOCKER_BUILDKIT=1 docker compose up --build\n   ```\n\n   - Configure frequency via env vars in `.env`:\n     - `FRESH_MINUTES` (default 30): skip models with a run newer than this window\n     - `SLEEP_SECONDS` (default 1800): sleep between cycles\n\n2. Optional: run a one-off benchmark locally without Docker:\n   ```bash\n   python api/bench_headless.py --providers openai --limit 5 --fresh-minutes 30\n   # Or run all configured providers\n   python api/bench_headless.py --providers all\n   # Run only Cerebras once you have set CEREBRAS_API_KEY\n   python api/bench_headless.py --providers cerebras --limit 5\n   ```\n\n## Viewing Results\n\nResults can be viewed in two ways:\n\n1. **Dashboard**: Visit [llm-benchmarks.com](https://llm-benchmarks.com) to see the latest benchmark results\n2. **MongoDB**: Cloud results are stored in `MONGODB_COLLECTION_CLOUD`; errors in `MONGODB_COLLECTION_ERRORS`\n\n### Do self-hosted benchmarks upload to llm-benchmarks.com?\n\nNo. When you run the project locally (as of September 26, 2025) the scheduler only writes to the MongoDB instance configured in your `.env`. The public site uses a separate, access-controlled database; your runs will appear there only if you intentionally point `MONGODB_URI` at that shared database and have credentials to write to it. This keeps local experiments private by default.\n\n## Benchmark Results\n\nThe benchmarks measure inference speed across different models, quantization methods, and output token counts. Results indicate that even the slowest performing combinations still handily beat GPT-4 and almost always match or beat GPT-3.5, sometimes significantly.\n\n### Framework Comparisons\n\nDifferent frameworks show significant performance variations. For example, GGML with cuBLAS significantly outperforms Hugging Face Transformers with BitsAndBytes quantization:\n\n![GGML v HF](https://github.com/cipher982/llm-benchmarks/blob/main/static/ggml-hf-llama-compare.png?raw=true)\n\n### Model Size and Quantization Impact\n\nBenchmarks show how model size and quantization affect inference speed:\n\n#### LLaMA Models\n![LLaMA Models](https://github.com/cipher982/llm-benchmarks/blob/main/static/llama_compare_size_and_quant_inference.png?raw=true)\n\n#### Dolly-2 Models\n![Dolly2 Models](https://github.com/cipher982/llm-benchmarks/blob/main/static/dolly2_compare_size_and_quant_inference.png?raw=true)\n\n#### Falcon Models\n![Falcon Models](https://github.com/cipher982/llm-benchmarks/blob/main/static/falcon_compare_quantization_inference.png?raw=true)\n\n## Hardware Considerations\n\nBenchmarks have been run on various GPUs including:\n- NVIDIA RTX 3090\n- NVIDIA A10\n- NVIDIA A100\n- NVIDIA H100\n\nThe H100 consistently delivers the fastest performance but at a higher cost (~$2.40/hour). Surprisingly, the A10 performed below expectations despite its higher tensor core count, possibly due to memory bandwidth limitations.\n\n## Managing Models\n\nModels are stored in MongoDB and loaded dynamically by the scheduler. To add new models to the system, use the model management tools in the parent directory (`../manage-models.sh`).\n\n## Contributing\n\nContributions are welcome! To add new models or frameworks:\n\n1. Fork the repository\n2. Create a feature branch\n3. Add your implementation\n4. Submit a pull request\n\nFor more details, see the individual README files in the `/local` and `/cloud` directories.\n\n## License\n\nThis project is licensed under the MIT License - see the LICENSE file for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcipher982%2Fllm-benchmarks","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcipher982%2Fllm-benchmarks","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcipher982%2Fllm-benchmarks/lists"}