{"id":19808807,"url":"https://github.com/premai-io/benchmarks","last_synced_at":"2025-05-01T07:32:50.414Z","repository":{"id":220224720,"uuid":"699231787","full_name":"premAI-io/benchmarks","owner":"premAI-io","description":"🕹️ Performance Comparison of MLOps Engines, Frameworks, and Languages on Mainstream AI Models.","archived":false,"fork":false,"pushed_at":"2024-04-22T17:13:26.000Z","size":781,"stargazers_count":69,"open_issues_count":27,"forks_count":3,"subscribers_count":11,"default_branch":"main","last_synced_at":"2024-04-23T00:13:29.780Z","etag":null,"topics":["ai","benchmarks","inference-engines","latency","llmops","mlops","performances"],"latest_commit_sha":null,"homepage":"","language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/premAI-io.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2023-10-02T08:02:28.000Z","updated_at":"2024-04-24T06:57:48.156Z","dependencies_parsed_at":"2024-04-15T08:35:53.646Z","dependency_job_id":"e4eabcaf-d993-4dc3-ab3e-70d478553949","html_url":"https://github.com/premAI-io/benchmarks","commit_stats":null,"previous_names":["premai-io/benchmarks"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/premAI-io%2Fbenchmarks","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/premAI-io%2Fbenchmarks/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/premAI-io%2Fbenchmarks/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/premAI-io%2Fbenchmarks/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/premAI-io","download_url":"https://codeload.github.com/premAI-io/benchmarks/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224245853,"owners_count":17279649,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","benchmarks","inference-engines","latency","llmops","mlops","performances"],"created_at":"2024-11-12T09:14:50.225Z","updated_at":"2024-11-12T09:14:50.331Z","avatar_url":"https://github.com/premAI-io.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n \u003ch1 align=\"center\"\u003e🕹️ Benchmarks\u003c/h1\u003e\n \u003cp align=\"center\"\u003eA fully reproducible Performance Comparison of MLOps Engines, Frameworks, and Languages on Mainstream AI Models\u003c/p\u003e\n\u003c/div\u003e\n\n[![GitHub contributors](https://img.shields.io/github/contributors/premAI-io/benchmarks.svg)](https://github.com/premAI-io/benchmarks/graphs/contributors)\n[![GitHub commit activity](https://img.shields.io/github/commit-activity/m/premAI-io/benchmarks.svg)](https://github.com/premAI-io/benchmarks/commits/master)\n[![GitHub last commit](https://img.shields.io/github/last-commit/premAI-io/benchmarks.svg)](https://github.com/premAI-io/benchmarks/commits/master)\n[![GitHub top language](https://img.shields.io/github/languages/top/premAI-io/benchmarks.svg)](https://github.com/premAI-io/benchmarks)\n[![GitHub issues](https://img.shields.io/github/issues/premAI-io/benchmarks.svg)](https://github.com/premAI-io/benchmarks/issues)\n[![License](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)\n\n\u003cbr\u003e\n\n\u003cdiv align=\"center\"\u003e\n\n![alt text](image.png)\nCheck out our [release blog](https://blog.premai.io/prem-benchmarks/) to know more.\n\n\u003c/div\u003e\n\n\u003cdetails\u003e\n \u003csummary\u003eTable of Contents\u003c/summary\u003e\n \u003col\u003e\n \u003cli\u003e\u003ca href=\"#-quick-glance\"\u003eQuick glance towards performance metrics\u003c/a\u003e\u003c/li\u003e\n \u003cli\u003e\u003ca href=\"#-ml-engines\"\u003eML Engines\u003c/a\u003e\u003c/li\u003e\n \u003cli\u003e\u003ca href=\"#-why-benchmarks\"\u003eWhy Benchmarks\u003c/a\u003e\u003c/li\u003e\n \u003cli\u003e\u003ca href=\"#-usage-and-workflow\"\u003eUsage and workflow\u003c/a\u003e\u003c/li\u003e\n \u003cli\u003e\u003ca href=\"#-contribute\"\u003eContribute\u003c/a\u003e\u003c/li\u003e\n \u003c/ol\u003e\n\u003c/details\u003e\n\n\n## 🥽 Quick glance towards performance benchmarks\n\nTake a first glance at [Mistral 7B v0.1 Instruct](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1) and [Llama 2 7B Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) Performance Metrics Across Different Precision and Inference Engines. Here is our run specification that generated this performance benchmark reports.\n\n**Environment:**\n- Model: Mistral 7B v0.1 Instruct / Llama 2 7B Chat\n- CUDA Version: 12.1\n- Batch size: 1\n\n**Command:**\n\n```\n./benchmark.sh --repetitions 10 --max_tokens 512 --device cuda --model mistral/llama --prompt 'Write an essay about the transformer model architecture'\n```\n\n### Mistral 7B v0.1 Instruct\n\n**Performance Metrics:** (unit: Tokens/second)\n\n| Engine                                     | float32       | float16       | int8          | int4          |\n| ------------------------------------------ | ------------- | ------------- | ------------- | ------------- |\n| [transformers (pytorch)](/bench_pytorch/)  | 39.61 ± 0.65  | 37.05 ± 0.49  | 5.08 ± 0.01   | 19.58 ± 0.38  |\n| [AutoAWQ](/bench_autoawq/)                 | -             | -             | -             | 63.12 ± 2.19  |\n| [AutoGPTQ](/bench_autogptq/)               | 39.11 ± 0.42  | 42.94 ± 0.80  |               |               |\n| [DeepSpeed](/bench_deepspeed/)             |               | 79.88 ± 0.32  |               |               |\n| [ctransformers](/bench_ctransformers/)     | -             | -             | 86.14 ± 1.40  | 87.22 ± 1.54  |\n| [llama.cpp](/bench_llamacpp/)              | -             | -             | 88.27 ± 0.72  | 95.33 ± 5.54  |\n| [ctranslate](/bench_ctranslate/)           | 43.17 ± 2.97  | 68.03 ± 0.27  | 45.14 ± 0.24  | -             |\n| [PyTorch Lightning](/bench_lightning/)     | 32.79 ± 2.74  | 43.01 ± 2.90  | 7.75 ± 0.12   | -             |\n| [Nvidia TensorRT-LLM](/bench_tensorrtllm/) | 117.04 ± 2.16 | 206.59 ± 6.93 | 390.49 ± 4.86 | 427.40 ± 4.84 |\n| [vllm](/bench_vllm/)                       | 84.91 ± 0.27  | 84.89 ± 0.28  | -             | 106.03 ± 0.53 |\n| [exllamav2](/bench_exllamav2/)             | -             | -             | 114.81 ± 1.47 | 126.29 ± 3.05 |\n| [onnx](/bench_onnxruntime/)                | 15.75 ± 0.15  | 22.39 ± 0.14  | -             | -             |\n| [Optimum Nvidia](/bench_optimum_nvidia/)   | 50.77 ± 0.85  | 50.91 ± 0.19  | -             | -             |\n\n**Performance Metrics:** GPU Memory Consumption (unit: MB)\n\n| Engine                                     | float32  | float16  | int8     | int4     |\n| ------------------------------------------ | -------- | -------- | -------- | -------- |\n| [transformers (pytorch)](/bench_pytorch/)  | 31071.4  | 15976.1  | 10963.91 | 5681.18  |\n| [AutoGPTQ](/bench_autogptq/)               | 13400.80 | 6633.29  |          |          |\n| [AutoAWQ](/bench_autoawq/)                 | -        | -        | -        | 6572.47  |\n| [DeepSpeed](/bench_deepspeed/)             |          | 80097.34 |          |          |\n| [ctransformers](/bench_ctransformers/)     | -        | -        | 10255.07 | 6966.74  |\n| [llama.cpp](/bench_llamacpp/)              | -        | -        | 9141.49  | 5880.41  |\n| [ctranslate](/bench_ctranslate/)           | 32602.32 | 17523.8  | 10074.72 | -        |\n| [PyTorch Lightning](/bench_lightning/)     | 48783.95 | 18738.05 | 10680.32 | -        |\n| [Nvidia TensorRT-LLM](/bench_tensorrtllm/) | 79536.59 | 78341.21 | 77689.0  | 77311.51 |\n| [vllm](/bench_vllm/)                       | 73568.09 | 73790.39 | -        | 74016.88 |\n| [exllamav2](/bench_exllamav2/)             | -        | -        | 21483.23 | 9460.25  |\n| [onnx](/bench_onnxruntime/)                | 33629.93 | 19537.07 | -        | -        |\n| [Optimum Nvidia](/bench_optimum_nvidia/)   | 79563.85 | 79496.74 | -        | -        |\n\n*(Data updated: `30th April 2024`)\n\n### Llama 2 7B Chat\n\n**Performance Metrics:** (unit: Tokens / second)\n\n| Engine                                     | float32       | float16       | int8          | int4          |\n| ------------------------------------------ | ------------- | ------------- | ------------- | ------------- |\n| [transformers (pytorch)](/bench_pytorch/)  | 36.65 ± 0.61  | 34.20 ± 0.51  | 6.91 ± 0.14   | 17.83 ± 0.40  |\n| [AutoAWQ](/bench_autoawq/)                 | -             | -             | -             | 63.59 ± 1.86  |\n| [AutoGPTQ](/bench_autogptq/)               | 34.36 ± 0.51  | 36.63 ± 0.61  |               |               |\n| [DeepSpeed](/bench_deepspeed/)             |               | 84.60 ± 0.25  |               |               |\n| [ctransformers](/bench_ctransformers/)     | -             | -             | 85.50 ± 1.00  | 86.66 ± 1.06  |\n| [llama.cpp](/bench_llamacpp/)              | -             | -             | 89.90 ± 2.26  | 97.35 ± 4.71  |\n| [ctranslate](/bench_ctranslate/)           | 46.26 ± 1.59  | 79.41 ± 0.37  | 48.20 ± 0.14  | -             |\n| [PyTorch Lightning](/bench_lightning/)     | 38.01 ± 0.09  | 48.09 ± 1.12  | 10.68 ± 0.43  | -             |\n| [Nvidia TensorRT-LLM](/bench_tensorrtllm/) | 104.07 ± 1.61 | 191.00 ± 4.60 | 316.77 ± 2.14 | 358.49 ± 2.38 |\n| [vllm](/bench_vllm/)                       | 89.40 ± 0.22  | 89.43 ± 0.19  | -             | 115.52 ± 0.49 |\n| [exllamav2](/bench_exllamav2/)             | -             | -             | 125.58 ± 1.23 | 159.68 ± 1.85 |\n| [onnx](/bench_onnxruntime/)                | 14.28 ± 0.12  | 19.42 ± 0.08  | -             | -             |\n| [Optimum Nvidia](/bench_optimum_nvidia/)   | 53.64 ± 0.78  | 53.82 ± 0.11  | -             | -             |\n\n\n**Performance Metrics:** GPU Memory Consumption (unit: MB)\n\n| Engine                                     | float32  | float16  | int8     | int4     |\n| ------------------------------------------ | -------- | -------- | -------- | -------- |\n| [transformers (pytorch)](/bench_pytorch/)  | 29114.76 | 14931.72 | 8596.23  | 5643.44  |\n| [AutoAWQ](/bench_autoawq/)                 | -        | -        | -        | 7149.19  |\n| [AutoGPTQ](/bench_autogptq/)               | 10718.54 | 5706.35  |          |          |\n| [DeepSpeed](/bench_deepspeed/)             |          | 80105.13 |          |          |\n| [ctransformers](/bench_ctransformers/)     | -        | -        | 9774.83  | 6889.14  |\n| [llama.cpp](/bench_llamacpp/)              | -        | -        | 8797.55  | 5783.95  |\n| [ctranslate](/bench_ctranslate/)           | 29951.52 | 16282.29 | 9470.74  | -        |\n| [PyTorch Lightning](/bench_lightning/)     | 42748.35 | 14736.69 | 8028.16  | -        |\n| [Nvidia TensorRT-LLM](/bench_tensorrtllm/) | 79421.24 | 78295.07 | 77642.86 | 77256.98 |\n| [vllm](/bench_vllm/)                       | 77928.07 | 77928.07 | -        | 77768.69 |\n| [exllamav2](/bench_exllamav2/)             | -        | -        | 16582.18 | 7201.62  |\n| [onnx](/bench_onnxruntime/)                | 33072.09 | 19180.55 | -        | -        |\n| [Optimum Nvidia](/bench_optimum_nvidia/)   | 79429.63 | 79295.41 | -        | -        |\n\n*(Data updated: `30th April 2024`)\n\n\u003e Our latest version benchmarks Llama 2 7B chat and Mistral 7B v0.1 instruct. The latest version only benchmarks on A100 80 GPU. Because our primary focus is enterprises. Our previous versions benchmarked Llama 2 7B on Cuda and Mac (M1/M2) CPU and metal. You can find those in the [archive.md](/docs/archive.md) file. Please note that those numbers are old because all the engines are maintained properly continuously with improvements. So those numbers might be a bit outdated.\n\n## 🛳 ML Engines\n\nIn the current market, there are several ML Engines. Here is a quick glance at all the engines used for the benchmark and a quick summary of their support matrix. You can find the details about the nuances [here](/docs/ml_engines.md).\n\n| Engine                                     | Float32 | Float16 | Int8  | Int4  | CUDA  | ROCM  | Mac M1/M2 | Training |\n| ------------------------------------------ | :-----: | :-----: | :---: | :---: | :---: | :---: | :-------: | :------: |\n| [candle](/bench_candle/)                   |    ⚠️    |    ✅    |   ⚠️   |   ⚠️   |   ✅   |   ❌   |     🚧     |    ❌     |\n| [llama.cpp](/bench_llamacpp/)              |    ❌    |    ❌    |   ✅   |   ✅   |   ✅   |   🚧   |     🚧     |    ❌     |\n| [ctranslate](/bench_ctranslate/)           |    ✅    |    ✅    |   ✅   |   ❌   |   ✅   |   ❌   |     🚧     |    ❌     |\n| [onnx](/bench_onnxruntime/)                |    ✅    |    ✅    |   ❌   |   ❌   |   ✅   |   ⚠️   |     ❌     |    ❌     |\n| [transformers (pytorch)](/bench_pytorch/)  |    ✅    |    ✅    |   ✅   |   ✅   |   ✅   |   🚧   |     ✅     |    ✅     |\n| [vllm](/bench_vllm/)                       |    ✅    |    ✅    |   ❌   |   ✅   |   ✅   |   🚧   |     ❌     |    ❌     |\n| [exllamav2](/bench_exllamav2/)             |    ❌    |    ❌    |   ✅   |   ✅   |   ✅   |   🚧   |     ❌     |    ❌     |\n| [ctransformers](/bench_ctransformers/)     |    ❌    |    ❌    |   ✅   |   ✅   |   ✅   |   🚧   |     🚧     |    ❌     |\n| [AutoGPTQ](/bench_autogptq/)               |    ✅    |    ✅    |   ⚠️   |   ⚠️   |   ✅   |   ❌   |     ❌     |    ❌     |\n| [AutoAWQ](/bench_autoawq/)                 |    ❌    |    ❌    |   ❌   |   ✅   |   ✅   |   ❌   |     ❌     |    ❌     |\n| [DeepSpeed-MII](/bench_deepspeed/)         |    ❌    |    ✅    |   ❌   |   ❌   |   ✅   |   ❌   |     ❌     |    ⚠️     |\n| [PyTorch Lightning](/bench_lightning/)     |    ✅    |    ✅    |   ✅   |   ✅   |   ✅   |   ⚠️   |     ⚠️     |    ✅     |\n| [Optimum Nvidia](/bench_optimum_nvidia/)   |    ✅    |    ✅    |   ❌   |   ❌   |   ✅   |   ❌   |     ❌     |    ❌     |\n| [Nvidia TensorRT-LLM](/bench_tensorrtllm/) |    ✅    |    ✅    |   ✅   |   ✅   |   ✅   |   ❌   |     ❌     |    ❌     |\n\n\n### Legend:\n- ✅ Supported\n- ❌ Not Supported\n- ⚠️ There is a catch related to this\n- 🚧 It is supported but not implemented in this current version\n\nYou can check out the nuances related to ⚠️ and 🚧 in details [here](/docs/ml_engines.md)\n\n## 🤔 Why Benchmarks\n\nThis can be a common question. What are the benefits you can expect from this repository? So here are some quick pointers to answer those.\n\n1. Oftentimes, we are confused when given several choices on which engines or precision to use for our LLM inference workflow. Because sometimes we have constraints on computing and sometimes we have other requirements. So this repository helps you to get a quick idea of what to use based on your requirements.\n\n2. Sometimes there comes a quality vs speed tradeoff between engines and precisions. So this repository keeps track of those and gives you an idea to understand the tradeoffs so that you can give more importance to your priorities.\n\n3. A fully reproducible and hackable script. The latest benchmarks come with a lot of best practices so that they can be robust enough to run on GPU devices. Also, you can reference and extend the implementations to build your own workflows out of it.\n\n## 🚀 Usage and workflow\n\nWelcome to our benchmarking repository! This organized structure is designed to simplify benchmark management and execution. Each benchmark runs an inference engine that provides some sort of optimizations either through just quantization or device-specific optimizations like custom cuda kernels.\n\nTo get started you need to download the models first. This will download the following models: [Llama2 7B Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) and [Mistral-7B v0.1 Instruct](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1). You can start download by typing this command:\n\n```bash\n./download.sh\n```\n\nPlease make sure that when you are running [Llama2-7B Chat weights](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf), we would assume that you already agreed to the required [terms and conditions](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) and got verified to download the weights.\n\n### A Benchmark workflow\n\nWhen you run a benchmark, the following set of events occurs:\n\n- Automatically setting up the environments and installing the required dependencies.\n- Converting the models to some specific format (if required) and saving them.\n- Running the benchmarks and storing them inside the logs folder. Each log folder has the following structure:\n\n - `performance.log`: This will track the model run performances. You can see the `token/sec` and `memory consumption (MB)` here.\n - `quality.md`: This file is an automatically generated readme file, which contains qualitative comparisons of different precisions of some engines. We take 5 prompts and run them for the set of supported precisions of that engine. We then put those results side by side. Our ground truth is the output from huggingface PyTorch model with raw float32 weights.\n - `quality.json` Same as the readme file but more in raw format.\n\nInside each benchmark folder, you will also see a readme.md file which contains all the information and the qualitative comparison of the engine. For example: [bench_tensorrtllm](/bench_tensorrtllm/README.md).\n\n### Running a Benchmark\n\nHere is how we run benchmarks for an inference engine.\n\n```bash\n./bench_\u003cengine-name\u003e/bench.sh \\\n --prompt \u003cvalue\u003e \\ # Enter a prompt string\n --max_tokens \u003cvalue\u003e \\  # Maximum number of tokens to output\n --repetitions \u003cvalue\u003e \\  # Number of repetitions to be made for the prompt.\n --device \u003ccpu/cuda/metal\u003e \\  # The device in which we want to benchmark.\n --model_name \u003cname-of-the-model\u003e # The name of the model. (options: 'llama' for Llama2 and 'mistral' for Mistral-7B-v0.1)\n```\n\nHere is an example. Let's say we want to benchmark Nvidia TensorRT LLM. So here is how the command would look like:\n\n```bash\n./bench_tensorrtllm/bench.sh -d cuda -n llama -r 10\n```\n\nTo know more, here is more detailed info on each command line argument.\n\n```\n -p, --prompt Prompt for benchmarks (default: 'Write an essay about the transformer model architecture')\n -r, --repetitions Number of repetitions for benchmarks (default: 10)\n -m, --max_tokens Maximum number of tokens for benchmarks (default: 512)\n -d, --device Device for benchmarks (possible values: 'metal', 'cuda', and 'CPU', default: 'cuda')\n -n, --model_name The name of the model to benchmark (possible values: 'llama' for using Llama2, 'mistral' for using Mistral 7B v0.1)\n -lf, --log_file Logging file name.\n -h, --help Show this help message\n```\n\n## 🤝 Contribute\n\nWe welcome contributions to enhance and expand our benchmarking repository. If you'd like to contribute a new benchmark, follow these steps:\n\n### Creating a New Benchmark\n\n**1. Create a New Folder**\n\nStart by creating a new folder for your benchmark. Name it `bench_{new_bench_name}` for consistency.\n\n```bash\nmkdir bench_{new_bench_name}\n```\n\n**2. Folder Structure**\n\nInside the new benchmark folder, include the following structure\n\n```\nbench_{new_bench_name}\n├── bench.sh # Benchmark script for setup and execution\n├── requirements.txt # Dependencies required for the benchmark\n└── ... # Any additional files needed for the benchmark\n```\n\n**3. Benchmark Script (`bench.sh`):**\n\nThe `bench.sh` script should handle setup, environment configuration, and the actual execution of the benchmark. Ensure it supports the parameters mentioned in the [Benchmark Script Parameters](#benchmark-script-parameters) section.\n\n### Pre-commit Hooks\n\nWe use pre-commit hooks to maintain code quality and consistency.\n\n**1. Install Pre-commit:** Ensure you have `pre-commit` installed\n\n```bash\npip install pre-commit\n```\n\n**2. Install Hooks:** Run the following command to install the pre-commit hooks\n\n```bash\npre-commit install\n```\n\nThe existing pre-commit configuration will be used for automatic checks before each commit, ensuring code quality and adherence to defined standards.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpremai-io%2Fbenchmarks","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpremai-io%2Fbenchmarks","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpremai-io%2Fbenchmarks/lists"}