{"id":13476119,"url":"https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference","last_synced_at":"2025-03-27T02:31:34.205Z","repository":{"id":182200303,"uuid":"668083234","full_name":"XiongjieDai/GPU-Benchmarks-on-LLM-Inference","owner":"XiongjieDai","description":"Multiple NVIDIA GPUs or Apple Silicon for Large Language Model Inference?","archived":false,"fork":false,"pushed_at":"2024-05-13T00:29:05.000Z","size":2894,"stargazers_count":684,"open_issues_count":3,"forks_count":25,"subscribers_count":18,"default_branch":"main","last_synced_at":"2024-08-01T16:44:09.190Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/XiongjieDai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-07-19T01:53:11.000Z","updated_at":"2024-08-01T11:59:43.000Z","dependencies_parsed_at":"2024-10-30T08:50:35.442Z","dependency_job_id":null,"html_url":"https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference","commit_stats":null,"previous_names":["xiongjiedai/gpu-benchmarks-on-llm-inference"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/XiongjieDai%2FGPU-Benchmarks-on-LLM-Inference","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/XiongjieDai%2FGPU-Benchmarks-on-LLM-Inference/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/XiongjieDai%2FGPU-Benchmarks-on-LLM-Inference/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/XiongjieDai%2FGPU-Benchmarks-on-LLM-Inference/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/XiongjieDai","download_url":"https://codeload.github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245769366,"owners_count":20669166,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-31T16:01:26.767Z","updated_at":"2025-03-27T02:31:32.951Z","avatar_url":"https://github.com/XiongjieDai.png","language":"Jupyter Notebook","funding_links":[],"categories":["Jupyter Notebook","Datasets-or-Benchmark","A01_文本生成_文本对话","Hardware Guides"],"sub_categories":["推理速度","大语言对话模型及数据","VRAM Requirements"],"readme":"# GPU-Benchmarks-on-LLM-Inference\n\nMultiple NVIDIA GPUs or Apple Silicon for Large Language Model Inference? 🧐\n\n## Description\n\nUse [llama.cpp](https://github.com/ggerganov/llama.cpp) to test the [LLaMA](https://arxiv.org/abs/2302.13971) models inference speed of different GPUs on [RunPod](https://www.runpod.io/), 13-inch M1 MacBook Air, 14-inch M1 Max MacBook Pro, M2 Ultra Mac Studio and 16-inch M3 Max MacBook Pro for LLaMA 3.\n\n## Overview\n\nAverage speed (tokens/s) of generating 1024 tokens by GPUs on LLaMA 3. Higher speed is better.\n\n| GPU                        | 8B Q4_K_M | 8B F16 | 70B Q4_K_M | 70B F16 |\n|----------------------------|-----------|--------|------------|---------|\n| 3070 8GB                   | 70.94     | OOM    | OOM        | OOM     |\n| 3080 10GB                  | 106.40    | OOM    | OOM        | OOM     |\n| 3080 Ti 12GB               | 106.71    | OOM    | OOM        | OOM     |\n| 4070 Ti 12GB               | 82.21     | OOM    | OOM        | OOM     |\n| 4080 16GB                  | 106.22    | 40.29  | OOM        | OOM     |\n| RTX 4000 Ada 20GB          | 58.59     | 20.85  | OOM        | OOM     |\n| 3090 24GB                  | 111.74    | 46.51  | OOM        | OOM     |\n| 4090 24GB                  | 127.74    | 54.34  | OOM        | OOM     |\n| RTX 5000 Ada 32GB          | 89.87     | 32.67  | OOM        | OOM     |\n| 3090 24GB * 2              | 108.07    | 47.15  | 16.29      | OOM     |\n| 4090 24GB * 2              | 122.56    | 53.27  | 19.06      | OOM     |\n| RTX A6000 48GB             | 102.22    | 40.25  | 14.58      | OOM     |\n| RTX 6000 Ada 48GB          | 130.99    | 51.97  | 18.36      | OOM     |\n| A40 48GB                   | 88.95     | 33.95  | 12.08      | OOM     |\n| L40S 48GB                  | 113.60    | 43.42  | 15.31      | OOM     |\n| RTX 4000 Ada 20GB * 4      | 56.14     | 20.58  | 7.33       | OOM     |\n| A100 PCIe 80GB             | 138.31    | 54.56  | 22.11      | OOM     |\n| A100 SXM 80GB              | 133.38    | 53.18  | 24.33      | OOM     |\n| H100 PCIe 80GB             |**144.49**|**67.79**| 25.01      | OOM     |\n| 3090 24GB * 4              | 104.94    | 46.40  | 16.89      | OOM     |\n| 4090 24GB * 4              | 117.61    | 52.69  | 18.83      | OOM     |\n| RTX 5000 Ada 32GB * 4      | 82.73     | 31.94  | 11.45      | OOM     |\n| 3090 24GB * 6              | 101.07    | 45.55  | 16.93      | 5.82    |\n| 4090 24GB * 8              | 116.13    | 52.12  | 18.76      | 6.45    |\n| RTX A6000 48GB * 4         | 93.73     | 38.87  | 14.32      | 4.74    |\n| RTX 6000 Ada 48GB * 4      | 118.99    | 50.25  | 17.96      | 6.06    |\n| A40 48GB * 4               | 83.79     | 33.28  | 11.91      | 3.98    |\n| L40S 48GB * 4              | 105.72    | 42.48  | 14.99      | 5.03    |\n| A100 PCIe 80GB * 4         | 117.30    | 51.54  | 22.68      | 7.38    |\n| A100 SXM 80GB * 4          | 97.70     | 45.45  | 19.60      | 6.92    |\n| H100 PCIe 80GB * 4         | 118.14    | 62.90  | **26.20**  | **9.63**|\n| M1 7‑Core GPU 8GB          | 9.72      | OOM    | OOM        | OOM     |\n| M1 Max 32‑Core GPU 64GB    | 34.49     | 18.43  | 4.09       | OOM     |\n| M2 Ultra 76-Core GPU 192GB | 76.28     | 36.25  | 12.13      | 4.71    |\n| M3 Max 40‑Core GPU 64GB    | 50.74     | 22.39  | 7.53       | OOM     |\n\nAverage 1024 tokens prompt eval speed (tokens/s) by GPUs on LLaMA 3.\n\n| GPU                        | 8B Q4_K_M | 8B F16  | 70B Q4_K_M | 70B F16 |\n|----------------------------|-----------|---------|------------|---------|\n| 3070 8GB                   | 2283.62   | OOM     | OOM        | OOM     |\n| 3080 10GB                  | 3557.02   | OOM     | OOM        | OOM     |\n| 3080 Ti 12GB               | 3556.67   | OOM     | OOM        | OOM     |\n| 4070 Ti 12GB               | 3653.07   | OOM     | OOM        | OOM     |\n| 4080 16GB                  | 5064.99   | 6758.90 | OOM        | OOM     |\n| RTX 4000 Ada 20GB          | 2310.53   | 2951.87 | OOM        | OOM     |\n| 3090 24GB                  | 3865.39   | 4239.64 | OOM        | OOM     |\n| 4090 24GB                  | 6898.71   | 9056.26 | OOM        | OOM     |\n| RTX 5000 Ada 32GB          | 4467.46   | 5835.41 | OOM        | OOM     |\n| 3090 24GB * 2              | 4004.14   | 4690.50 | 393.89     | OOM     |\n| 4090 24GB * 2              | 8545.00   | 11094.51| 905.38     | OOM     |\n| RTX A6000 48GB             | 3621.81   | 4315.18 | 466.82     | OOM     |\n| RTX 6000 Ada 48GB          | 5560.94   | 6205.44 | 547.03     | OOM     |\n| A40 48GB                   | 3240.95   | 4043.05 | 239.92     | OOM     |\n| L40S 48GB                  | 5908.52   | 2491.65 | 649.08     | OOM     |\n| RTX 4000 Ada 20GB * 4      | 3369.24   | 4366.64 | 306.44     | OOM     |\n| A100 PCIe 80GB             | 5800.48   | 7504.24 | 726.65     | OOM     |\n| A100 SXM 80GB              | 5863.92   | 681.47  | 796.81     | OOM     |\n| H100 PCIe 80GB             | 7760.16   | 10342.63| 984.06     | OOM     |\n| 3090 24GB * 4              | 4653.93   | 5713.41 | 350.06     | OOM     |\n| 4090 24GB * 4              | 9609.29   | 12304.19| 898.17     | OOM     |\n| RTX 5000 Ada 32GB * 4      | 6530.78   | 2877.66 | 541.54     | OOM     |\n| 3090 24GB * 6              | 5153.05   | 5952.55 | 739.40     | 927.23  |\n| 4090 24GB * 8              | 9706.82   | 11818.92| **1336.26**| 1890.48 |\n| RTX A6000 48GB * 4         | 5340.10   | 6448.85 | 539.20     | 792.23  |\n| RTX 6000 Ada 48GB * 4      | 9679.55   | 12637.94| 714.93     | 1270.39 |\n| A40 48GB * 4               | 4841.98   | 5931.06 | 263.36     | 900.79  |\n| L40S 48GB * 4              | 9008.27   | 2541.61 | 634.05     | 1478.83 |\n| A100 PCIe 80GB * 4         | 8889.35   | 11670.74| 978.06     | 1733.41 |\n| A100 SXM 80GB * 4          | 7782.25   | 674.11  | 539.08     | 1834.16 |\n| H100 PCIe 80GB * 4         |**11560.23**|**15612.81**|1133.23|**2420.10**|\n| M1 7‑Core GPU 8GB          | 87.26     | OOM     | OOM        | OOM     |\n| M1 Max 32‑Core GPU 64GB    | 355.45    | 418.77  | 33.01      | OOM     |\n| M2 Ultra 76-Core GPU 192GB | 1023.89   | 1202.74 | 117.76     | 145.82  |\n| M3 Max 40‑Core GPU 64GB    | 678.04    | 751.49  | 62.88      | OOM     |\n\n## Model\n\nThanks to shawwn for LLaMA model weights (7B, 13B, 30B, 65B): [llama-dl](https://github.com/shawwn/llama-dl). Access LLaMA 2 from [Meta AI](https://ai.meta.com/llama/). Access LLaMA 3 from [Meta Llama 3](https://huggingface.co/collections/meta-llama/meta-llama-3-66214712577ca38149ebb2b6) on Hugging Face or my Hugging Face repos: [Xiongjie Dai](https://huggingface.co/JaaackXD).\n\n## Usage\n\n### Build\n\n- For NVIDIA GPUs, this provides BLAS acceleration using the CUDA cores of your Nvidia GPU:\n\n    ```bash\n    !make clean \u0026\u0026 LLAMA_CUBLAS=1 make -j\n    ```\n\n- For Apple Silicon, Metal is enabled by default:\n\n    ```bash\n    !make clean \u0026\u0026 make -j\n    ```\n\n### Text Completion\n\nUse argument `-ngl 0` to only use the CPU for inference and `-ngl 10000` to ensure all layers are offloaded to the GPU.\n\n```bash\n!./main -ngl 10000 -m ./models/8B-v3/ggml-model-Q4_K_M.gguf --color --temp 1.1 --repeat_penalty 1.1 -c 0 -n 1024 -e -s 0 -p \"\"\"\\\nFirst Citizen:\\n\\n\\\nBefore we proceed any further, hear me speak.\\n\\n\\\n\\n\\n\\\nAll:\\n\\n\\\nSpeak, speak.\\n\\n\\\n\\n\\n\\\nFirst Citizen:\\n\\n\\\nYou are all resolved rather to die than to famish?\\n\\n\\\n\\n\\n\\\nAll:\\n\\n\\\nResolved. resolved.\\n\\n\\\n\\n\\n\\\nFirst Citizen:\\n\\n\\\nFirst, you know Caius Marcius is chief enemy to the people.\\n\\n\\\n\\n\\n\\\nAll:\\n\\n\\\nWe know't, we know't.\\n\\n\\\n\\n\\n\\\nFirst Citizen:\\n\\n\\\nLet us kill him, and we'll have corn at our own price. Is't a verdict?\\n\\n\\\n\\n\\n\\\nAll:\\n\\n\\\nNo more talking on't; let it be done: away, away!\\n\\n\\\n\\n\\n\\\nSecond Citizen:\\n\\n\\\nOne word, good citizens.\\n\\n\\\n\\n\\n\\\nFirst Citizen:\\n\\n\\\nWe are accounted poor citizens, the patricians good. What authority surfeits on would relieve us: if they would yield us but the superfluity, \\\nwhile it were wholesome, we might guess they relieved us humanely; but they think we are too dear: the leanness that afflicts us, the object of \\\nour misery, is as an inventory to particularise their abundance; our sufferance is a gain to them Let us revenge this with our pikes, \\\nere we become rakes: for the gods know I speak this in hunger for bread, not in thirst for revenge.\\n\\n\\\n\\n\\n\\\n\"\"\"\n```\n\n**Note:** For Apple Silicon, check the `recommendedMaxWorkingSetSize` in the result to see how much memory can be allocated on the GPU and maintain its performance. Only **70%** of unified memory can be allocated to the GPU on 32GB M1 Max right now, and we expect around **78%** of usable memory for the GPU on larger memory. (Source: https://developer.apple.com/videos/play/tech-talks/10580/?time=346) To utilize the whole memory, use `-ngl 0` to only use the CPU for inference. (Thanks to: https://github.com/ggerganov/llama.cpp/pull/1826)\n\n### Chat template for LLaMA 3 🦙🦙🦙\n\n```bash\n!./main -ngl 10000 -m ./models/8B-v3-instruct/ggml-model-Q4_K_M.gguf --color -c 0 -n -2 -e -s 0 --mirostat 2 -i --no-display-prompt --keep -1 \\\n-r '\u003c|eot_id|\u003e' -p '\u003c|begin_of_text|\u003e\u003c|start_header_id|\u003esystem\u003c|end_header_id|\u003e\\n\\nYou are a helpful assistant.\u003c|eot_id|\u003e\u003c|start_header_id|\u003euser\u003c|end_header_id|\u003e\\n\\nHi!\u003c|eot_id|\u003e\u003c|start_header_id|\u003eassistant\u003c|end_header_id|\u003e\\n\\n' \\\n--in-prefix '\u003c|start_header_id|\u003euser\u003c|end_header_id|\u003e\\n\\n' --in-suffix '\u003c|eot_id|\u003e\u003c|start_header_id|\u003eassistant\u003c|end_header_id|\u003e\\n\\n'\n```\n\n### Benchmark\n\n```bash\n!./llama-bench -p 512,1024,4096,8192 -n 512,1024,4096,8192 -m ./models/8B-v3/ggml-model-Q4_K_M.gguf\n```\n\n## Total VRAM Requirements\n\n| Model | Quantized size (Q4_K_M) | Original size (f16) |\n|------:|--------------------:|-----------------------:|\n|    8B |             4.58 GB |               14.96 GB |\n|   70B |            39.59 GB |              131.42 GB |\n\nYou may estimate that VRAM requirement using this tool: [LLM RAM Calculator](https://llm-calc.rayfernando.ai/)\n\n## Perplexity table on LLaMA 3 70B\n\nLess perplexity is better. (credit to: [dranger003](https://github.com/ggerganov/llama.cpp/pull/6745#issuecomment-2093892514))\n\n| Quantization | Size (GiB) | Perplexity (wiki.test) | Delta (FP16)|\n|--------------|------------|------------------------|-------------|\n| IQ1_S        | 14.29      | 9.8655 +/- 0.0625      | 248.51%     |\n| IQ1_M        | 15.60      | 8.5193 +/- 0.0530      | 201.94%     |\n| IQ2_XXS      | 17.79      | 6.6705 +/- 0.0405      | 135.64%     |\n| IQ2_XS       | 19.69      | 5.7486 +/- 0.0345      | 103.07%     |\n| IQ2_S        | 20.71      | 5.5215 +/- 0.0318      | 95.05%      |\n| Q2_K_S       | 22.79      | 5.4334 +/- 0.0325      | 91.94%      |\n| IQ2_M        | 22.46      | 4.8959 +/- 0.0276      | 72.35%      |\n| Q2_K         | 24.56      | 4.7763 +/- 0.0274      | 68.73%      |\n| IQ3_XXS      | 25.58      | 3.9671 +/- 0.0211      | 40.14%      |\n| IQ3_XS       | 27.29      | 3.7210 +/- 0.0191      | 31.45%      |\n| Q3_K_S       | 28.79      | 3.6502 +/- 0.0192      | 28.95%      |\n| IQ3_S        | 28.79      | 3.4698 +/- 0.0174      | 22.57%      |\n| IQ3_M        | 29.74      | 3.4402 +/- 0.0171      | 21.53%      |\n| Q3_K_M       | 31.91      | 3.3617 +/- 0.0172      | 18.75%      |\n| Q3_K_L       | 34.59      | 3.3016 +/- 0.0168      | 16.63%      |\n| IQ4_XS       | 35.30      | 3.0310 +/- 0.0149      | 7.07%       |\n| IQ4_NL       | 37.30      | 3.0261 +/- 0.0149      | 6.90%       |\n| Q4_K_S       | 37.58      | 3.0050 +/- 0.0148      | 6.15%       |\n| Q4_K_M       | 39.60      | 2.9674 +/- 0.0146      | 4.83%       |\n| Q5_K_S       | 45.32      | 2.8843 +/- 0.0141      | 1.89%       |\n| Q5_K_M       | 46.52      | 2.8656 +/- 0.0139      | 1.23%       |\n| Q6_K         | 53.91      | 2.8441 +/- 0.0138      | 0.47%       |\n| Q8_0         | 69.83      | 2.8316 +/- 0.0138      | 0.03%       |\n| F16          | 131.43     | 2.8308 +/- 0.0138      | 0.00%       |\n\n## Benchmarks\n\n`TG` means \"text-generation,\" and `PP` means \"prompt processing.\" # for total generated/processing tokens. `OOM` means out of memory. Average speed in tokens/s.\n\n### LLaMA 3 🦙🦙🦙:\n\n#### NVIDIA Gaming GPUs (OS: Ubuntu 22.04.2 LTS, pytorch:2.2.0, py: 3.10, cuda: 12.1.1 on RunPod) (snapshots in May 2024)\n\n| GPU             | Model      | tg 512  | tg 1024  | tg 4096  | tg 8192  | pp 512  | pp 1024  | pp 4096  | pp 8192  |\n|-----------------|------------|---------|----------|----------|----------|---------|----------|----------|----------|\n| 3070 8GB        | 8B Q4_K_M  | 72.79   | 70.94    | 67.01    | 61.64    | 2402.51 | 2283.62  | 1826.59  | 1419.97  |\n| 3080 10GB       | 8B Q4_K_M  | 109.57  | 106.40   | 98.67    | 89.90    | 3728.86 | 3557.02  | 2852.06  | 2232.21  |\n| 3080 Ti 12GB    | 8B Q4_K_M  | 110.60  | 106.71   | 98.34    | 88.63    | 3690.30 | 3556.67  | 2947.11  | 2381.52  |\n| 4070 Ti 12GB    | 8B Q4_K_M  | 83.50   | 82.21    | 78.59    | 73.46    | 3936.29 | 3653.07  | 2729.71  | 2019.71  |\n| 4080 16GB       | 8B Q4_K_M  | 108.15  | 106.22   | 100.44   | 93.71    | 5389.74 | 5064.99  | 3790.96  | 2882.03  |\n|                 | 8B F16     | 40.58   | 40.29    | 39.44    | OOM      | 7246.97 | 6758.90  | 4720.22  | OOM      |\n| 3090 24GB       | 8B Q4_K_M  | 115.42  | 111.74   | 97.31    | 87.49    | 4030.40 | 3865.39  | 3169.91  | 2527.40  |\n|                 | 8B F16     | 47.40   | 46.51    | 44.79    | 42.62    | 4444.65 | 4239.64  | 3410.47  | 2667.14  |\n| 4090 24GB       | 8B Q4_K_M  | 130.58  | 127.74   | 119.44   | 110.66   | 7138.99 | 6898.71  | 5265.68  | 4039.68  |\n|                 | 8B F16     | 54.84   | 54.34    | 52.63    | 50.88    | 9382.00 | 9056.26  | 6531.36  | 4744.18  |\n| 3090 24GB * 2   | 8B Q4_K_M  | 111.67  | 108.07   | 99.60    | 90.77    | 3336.37 | 4004.14  | 4013.34  | 3433.59  |\n|                 | 8B F16     | 47.72   | 47.15    | 45.56    | 43.61    | 4122.66 | 4690.50  | 4788.60  | 3851.37  |\n|                 | 70B Q4_K_M | 16.57   | 16.29    | 15.36    | 14.34    | 357.32  | 393.89   | 379.52   | 338.82   |\n| 4090 24GB * 2   | 8B Q4_K_M  | 124.65  | 122.56   | 114.32   | 106.18   | 7003.51 | 8545.00  | 8422.04  | 6895.68  |\n|                 | 8B F16     | 53.64   | 53.27    | 51.64    | 49.83    | 9177.92 | 11094.51 | 10329.29 | 8067.29  |\n|                 | 70B Q4_K_M | 19.22   | 19.06    | 18.54    | 17.92    | 839.43  | 905.38   | 846.38   | 723.24   |\n| 3090 24GB * 4   | 8B Q4_K_M  | 108.66  | 104.94   | 97.09    | 88.35    | 3742.66 | 4653.93  | 5826.91  | 4913.40  |\n|                 | 8B F16     | 47.07   | 46.40    | 44.76    | 42.81    | 4608.40 | 5713.41  | 6596.17  | 5361.52  |\n|                 | 70B Q4_K_M | 17.07   | 16.89    | 16.24    | 15.39    | 300.79  | 350.06   | 367.75   | 331.37   |\n| 4090 24GB * 4   | 8B Q4_K_M  | 120.32  | 117.61   | 110.52   | 103.13   | 6748.96 | 9609.29  | 12491.10 | 10993.75 |\n|                 | 8B F16     | 53.10   | 52.69    | 51.00    | 49.21    | 8750.57 | 12304.19 | 15143.84 | 12919.74 |\n|                 | 70B Q4_K_M | 19.80   | 18.83    | 18.35    | 17.66    | 834.74  | 898.17   | 839.97   | 718.01   |\n| 3090 24GB * 6   | 8B Q4_K_M  | 104.17  | 101.07   | 94.06    | 85.93    | 3359.99 | 5153.05  | 7690.65  | 7084.44  |\n|                 | 8B F16     | 46.23   | 45.55    | 43.99    | 42.15    | 3875.97 | 5952.55  | 9437.91  | 8780.49  |\n|                 | 70B Q4_K_M | 17.09   | 16.93    | 16.32    | 15.45    | 456.95  | 739.40   | 786.79   | 695.44   |\n|                 | 70B F16    | 5.85    | 5.82     | 5.76     | 5.53     | 579.00  | 927.23   | 998.79   | 813.99   |\n| 4090 24GB * 8   | 8B Q4_K_M  | 118.09  | 116.13   | 108.37   | 100.95   | 6172.06 | 9706.82  | 15089.45 | 13802.08 |\n|                 | 8B F16     | 52.51   | 52.12    | 50.39    | 48.72    | 7889.26 | 11818.92 | 16462.18 | 14300.98 |\n|                 | 70B Q4_K_M | 18.94   | 18.76    | 18.23    | 17.57    | 812.95  | 1336.26  | 1488.36  | 1320.36  |\n|                 | 70B F16    | 6.47    | 6.45     | 6.39     | 6.31     | 1183.87 | 1890.48  | 2311.43  | 1995.85  |\n\n#### NVIDIA Professional GPUs (OS: Ubuntu 22.04.2 LTS, pytorch:2.2.0, py: 3.10, cuda: 12.1.1 on RunPod) (snapshots in May 2024)\n\n| GPU                   | Model      | tg 512  | tg 1024 | tg 4096 | tg 8192 | pp 512   | pp 1024  | pp 4096  | pp 8192  |\n|-----------------------|------------|---------|---------|---------|---------|----------|----------|----------|----------|\n| RTX 4000 Ada 20GB     | 8B Q4_K_M  | 59.15   | 58.59   | 55.94   | 52.39   | 2451.93  | 2310.53  | 1798.01  | 1337.15  |\n|                       | 8B F16     | 20.92   | 20.85   | 20.50   | 20.01   | 3121.67  | 2951.87  | 2200.58  | 1557.00  |\n| RTX 5000 Ada 32GB     | 8B Q4_K_M  | 91.39   | 89.87   | 85.01   | 80.00   | 4761.12  | 4467.46  | 3272.94  | 2422.33  |\n|                       | 8B F16     | 32.84   | 32.67   | 32.04   | 31.27   | 6160.57  | 5835.41  | 4008.30  | 2808.89  |\n| RTX A6000 48GB        | 8B Q4_K_M  | 105.39  | 102.22  | 94.82   | 86.73   | 3780.55  | 3621.81  | 2917.23  | 2292.61  |\n|                       | 8B F16     | 40.71   | 40.25   | 39.14   | 37.73   | 4511.02  | 4315.18  | 3365.79  | 2566.46  |\n|                       | 70B Q4_K_M | 14.71   | 14.58   | 14.09   | 13.42   | 482.19   | 466.82   | 404.61   | 340.73   |\n| RTX 6000 Ada 48GB     | 8B Q4_K_M  | 133.44  | 130.99  | 120.74  | 111.57  | 5791.74  | 5560.94  | 4495.19  | 3542.57  |\n|                       | 8B F16     | 52.32   | 51.97   | 50.21   | 48.79   | 6663.13  | 6205.44  | 4969.46  | 3915.81  |\n|                       | 70B Q4_K_M | 18.52   | 18.36   | 17.80   | 16.97   | 565.98   | 547.03   | 481.59   | 419.76   |\n| A40 48GB              | 8B Q4_K_M  | 91.27   | 88.95   | 83.10   | 76.45   | 3324.98  | 3240.95  | 2586.50  | 2013.34  |\n|                       | 8B F16     | 34.26   | 33.95   | 33.06   | 31.93   | 4203.75  | 4043.05  | 3069.98  | 2295.02  |\n|                       | 70B Q4_K_M | 11.60   | 12.08   | 11.68   | 11.26   | 209.38   | 239.92   | 268.89   | 291.13   |\n| L40S 48GB             | 8B Q4_K_M  | 115.55  | 113.60  | 105.50  | 97.98   | 6035.24  | 5908.52  | 4335.18  | 3192.70  |\n|                       | 8B F16     | 43.69   | 43.42   | 42.22   | 41.05   | 2253.93  | 2491.65  | 2887.70  | 3312.16  |\n|                       | 70B Q4_K_M | 15.46   | 15.31   | 14.92   | 14.45   | 673.63   | 649.08   | 542.29   | 446.48   |\n| RTX 4000 Ada 20GB * 4 | 8B Q4_K_M  | 56.64   | 56.14   | 53.58   | 50.19   | 2413.07  | 3369.24  | 4404.45  | 3733.15  |\n|                       | 8B F16     | 20.65   | 20.58   | 20.24   | 19.74   | 3220.21  | 4366.64  | 5366.39  | 4323.70  |\n|                       | 70B Q4_K_M | 7.36    | 7.33    | 7.12    | 6.84    | 282.28   | 306.44   | 290.70   | 243.45   |\n| A100 PCIe 80GB        | 8B Q4_K_M  | 140.62  | 138.31  | 127.22  | 117.60  | 5981.04  | 5800.48  | 4959.84  | 4083.37  |\n|                       | 8B F16     | 54.84   | 54.56   | 53.02   | 51.24   | 7741.34  | 7504.24  | 6137.54  | 4849.11  |\n|                       | 70B Q4_K_M | 22.31   | 22.11   | 20.93   | 19.53   | 744.12   | 726.65   | 653.20   | 573.95   |\n| A100 SXM 80GB         | 8B Q4_K_M  | 135.04  | 133.38  | 125.09  | 115.92  | 5947.64  | 5863.92  | 5121.60  | 4137.08  |\n|                       | 8B F16     | 53.49   | 53.18   | 52.03   | 50.52   | 603.76   | 681.47   | 866.13   | 1323.07  |\n|                       | 70B Q4_K_M | 24.61   | 24.33   | 22.91   | 21.32   | 817.58   | 796.81   | 714.07   | 625.66   |\n| H100 PCIe 80GB        | 8B Q4_K_M  | 145.55  | 144.49  | 136.06  | 126.83  | 8125.45  | 7760.16  | 6423.31  | 5185.03  |\n|                       | 8B F16     | 68.03   | 67.79   | 65.97   | 63.55   | 10815.51 | 10342.63 | 8106.53  | 6191.45  |\n|                       | 70B Q4_K_M | 25.03   | 25.01   | 23.82   | 22.39   | 1012.73  | 984.06   | 863.37   | 741.52   |\n| RTX 5000 Ada 32GB * 4 | 8B Q4_K_M  | 84.07   | 82.73   | 78.45   | 74.11   | 4671.34  | 6530.78  | 8004.94  | 6790.82  |\n|                       | 8B F16     | 32.10   | 31.94   | 31.32   | 30.58   | 2427.96  | 2877.66  | 3836.89  | 5235.00  |\n|                       | 70B Q4_K_M | 11.51   | 11.45   | 11.24   | 10.94   | 502.37   | 541.54   | 504.23   | 424.29   |\n| RTX A6000 48GB * 4    | 8B Q4_K_M  | 96.48   | 93.73   | 87.72   | 80.88   | 3712.99  | 5340.10  | 7126.45  | 6438.82  |\n|                       | 8B F16     | 39.34   | 38.87   | 37.81   | 36.51   | 4508.60  | 6448.85  | 8327.16  | 7298.18  |\n|                       | 70B Q4_K_M | 14.44   | 14.32   | 13.91   | 13.32   | 496.08   | 539.20   | 511.22   | 434.31   |\n|                       | 70B F16    | 4.76    | 4.74    | 4.70    | 4.63    | 510.31   | 792.23   | 751.37   | 748.06   |\n| RTX 6000 Ada 48GB * 4 | 8B Q4_K_M  | 121.21  | 118.99  | 110.65  | 103.18  | 6640.86  | 9679.55  | 11734.85 | 10278.14 |\n|                       | 8B F16     | 50.61   | 50.25   | 48.69   | 47.18   | 8953.30  | 12637.94 | 13971.34 | 11702.36 |\n|                       | 70B Q4_K_M | 18.13   | 17.96   | 17.49   | 16.89   | 656.61   | 714.93   | 697.10   | 612.54   |\n|                       | 70B F16    | 6.08    | 6.06    | 6.01    | 5.94    | 864.12   | 1270.39  | 1363.75  | 1182.28  |\n| A40 48GB * 4          | 8B Q4_K_M  | 85.91   | 83.79   | 78.56   | 72.70   | 3321.27  | 4841.98  | 6442.38  | 5742.84  |\n|                       | 8B F16     | 33.60   | 33.28   | 32.42   | 31.38   | 4144.88  | 5931.06  | 7544.92  | 6516.60  |\n|                       | 70B Q4_K_M | 11.99   | 11.91   | 11.60   | 11.17   | 236.86   | 263.36   | 300.57   | 312.31   |\n|                       | 70B F16    | 3.99    | 3.98    | 3.95    | 3.90    | 610.51   | 900.79   | 893.28   | 735.16   |\n| L40S 48GB * 4         | 8B Q4_K_M  | 107.53  | 105.72  | 98.59   | 92.20   | 6125.69  | 9008.27  | 10566.97 | 9017.90  |\n|                       | 8B F16     | 42.70   | 42.48   | 41.33   | 40.19   | 2211.45  | 2541.61  | 3093.33  | 4336.81  |\n|                       | 70B Q4_K_M | 15.12   | 14.99   | 14.63   | 14.17   | 591.05   | 634.05   | 605.66   | 541.67   |\n|                       | 70B F16    | 5.05    | 5.03    | 4.99    | 4.94    | 1042.13  | 1478.83  | 1427.77  | 1150.63  |\n| A100 PCIe 80GB * 4    | 8B Q4_K_M  | 119.28  | 117.30  | 110.75  | 103.87  | 6076.58  | 8889.35  | 12724.54 | 11803.39 |\n|                       | 8B F16     | 51.63   | 51.54   | 50.20   | 48.73   | 8088.79  | 11670.74 | 16025.11 | 14269.17 |\n|                       | 70B Q4_K_M | 22.91   | 22.68   | 21.41   | 19.96   | 771.28   | 978.06   | 1138.60  | 1043.15  |\n|                       | 70B F16    | 7.40    | 7.38    | 7.23    | 7.06    | 1172.14  | 1733.41  | 1846.36  | 1592.37  |\n| A100 SXM 80GB * 4     | 8B Q4_K_M  | 99.73   | 97.70   | 92.09   | 86.27   | 4850.88  | 7782.25  | 12242.53 | 11535.66 |\n|                       | 8B F16     | 45.53   | 45.45   | 44.33   | 43.09   | 626.75   | 674.11   | 1003.37  | 1612.05  |\n|                       | 70B Q4_K_M | 19.87   | 19.60   | 18.48   | 17.19   | 468.86   | 539.08   | 712.08   | 802.23   |\n|                       | 70B F16    | 6.95    | 6.92    | 6.77    | 6.58    | 1233.31  | 1834.16  | 1972.48  | 1699.56  |\n| H100 PCIe 80GB * 4    | 8B Q4_K_M  | 123.08  | 118.14  | 113.12  | 110.34  | 8054.58  | 11560.23 | 16128.27 | 14682.97 |\n|                       | 8B F16     | 64.00   | 62.90   | 61.45   | 59.72   | 11107.40 | 15612.81 | 20561.03 | 17762.96 |\n|                       | 70B Q4_K_M | 26.40   | 26.20   | 24.60   | 23.68   | 1048.29  | 1133.23  | 1088.99  | 950.92   |\n|                       | 70B F16    | 9.67    | 9.63    | 9.46    | 9.23    | 1681.45  | 2420.10  | 2437.53  | 2031.77  |\n\n#### Apple Silicon (snapshots in May 2024)\n\n| GPU                        | Model     | tg 512 | tg 1024 | tg 4096 | tg 8192 | pp 512  | pp 1024 | pp 4096 | pp 8192 |\n|----------------------------|-----------|--------|---------|---------|---------|---------|---------|---------|---------|\n| M1 7‑Core GPU 8GB          | 8B Q4_K_M | 10.20  | 9.72    | 11.77   | OOM     | 94.48   | 87.26   | 96.53   | OOM     |\n| M1 Max 32‑Core GPU 64GB    | 8B Q4_K_M | 35.73  | 34.49   | 31.18   | 26.84   | 408.23  | 355.45  | 329.84  | 302.92  |\n|                            | 8B F16    | 18.75  | 18.43   | 16.33   | 15.03   | 517.34  | 418.77  | 374.09  | 351.46  |\n|                            | 70B Q4_K_M| 4.34   | 4.09    | 4.09    | 3.71    | 34.96   | 33.01   | 32.64   | 30.97   |\n| M2 Ultra 76-Core GPU 192GB | 8B Q4_K_M | 78.81  | 76.28   | 64.58   | 54.13   | 994.04  | 1023.89 | 979.47  | 913.55  |\n|                            | 8B F16    | 36.90  | 36.25   | 33.67   | 30.68   | 1175.40 | 1202.74 | 1194.21 | 1103.44 |\n|                            | 70B Q4_K_M| 12.48  | 12.13   | 10.75   | 9.34    | 118.79  | 117.76  | 109.53  | 108.57  |\n|                            | 70B F16   | 4.76   | 4.71    | 4.48    | 4.23    | 147.58  | 145.82  | 133.75  | 135.15  |\n| M3 Max 40‑Core GPU 64GB    | 8B Q4_K_M | 48.97  | 50.74   | 44.21   | 36.12   | 693.32  | 678.04  | 573.09  | 505.32  |\n|                            | 8B F16    | 22.04  | 22.39   | 20.72   | 18.74   | 769.84  | 751.49  | 609.97  | 515.15  |\n|                            | 70B Q4_K_M| 7.65   | 7.53    | 6.58    | 5.60    | 70.19   | 62.88   | 64.90   | 61.96   |\n\n## Conclusion\n\nSame performance under the same size and quantization models. Multiple NVIDIA GPUs might affect text-generation performance but can still boost the prompt processing speed.\n\nBuy NVIDIA gaming GPUs to save money. Buy professional GPUs for your business. Buy a Mac if you want to put your computer on your desk, save energy, be quiet, don't wanna maintenance, and have more fun. 😇\n\nIf you find this information helpful, please give me a star. ⭐️ Feel free to contact me if you have any advice. Thank you. 🤗\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FXiongjieDai%2FGPU-Benchmarks-on-LLM-Inference","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FXiongjieDai%2FGPU-Benchmarks-on-LLM-Inference","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FXiongjieDai%2FGPU-Benchmarks-on-LLM-Inference/lists"}