https://github.com/dsw7/local-llm-benchmark
Miscellaneous scripts I use for benchmarking my locally hosted LLMs
https://github.com/dsw7/local-llm-benchmark
llm ollama statistics
Last synced: 2 months ago
JSON representation
Miscellaneous scripts I use for benchmarking my locally hosted LLMs
- Host: GitHub
- URL: https://github.com/dsw7/local-llm-benchmark
- Owner: dsw7
- License: mit
- Created: 2025-12-16T10:35:51.000Z (6 months ago)
- Default Branch: master
- Last Pushed: 2026-02-01T11:07:09.000Z (5 months ago)
- Last Synced: 2026-02-01T20:53:54.384Z (5 months ago)
- Topics: llm, ollama, statistics
- Language: Python
- Homepage:
- Size: 45.9 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Local LLM benchmarking
Miscellenous utilities for benchmarking locally hosted LLMs (i.e. via
[Ollama](https://ollama.com/)) for various platform/hardware permutations.
**Note that I am not interested in benchmarking the models themselves. I am
interested in benchmarking model inference times on my particular hardware.**
Many projects exist for benchmarking models themselves, such as
[SuperGLUE](https://super.gluebenchmark.com/).
I use this program to benchmark my infrastructure for the following cases:
- When running [FuncGraft](https://github.com/dsw7/FuncGraft) in [local
mode](https://github.com/dsw7/FuncGraft?tab=readme-ov-file#toggling-between-llm-providers)
- When running [GPTifier](https://github.com/dsw7/GPTifier) commands via the Ollama stream
## Table of Contents
- [About](#about)
- [Setup](#setup)
- [Benchmarking LLM performance](#benchmarking-llm-performance)
- [Step 1 - Run the benchmarks](#step-1---run-the-benchmarks)
- [Step 2 - Generate Gaussian distributions + boxplots for inference times](#step-2---generate-gaussian-distributions--boxplots-for-inference-times)
- [Step 3 - Generate a LaTeX report for the measurements](#step-3---generate-a-latex-report-for-the-measurements)
## About
This program runs a dummy prompt against a specified LLM on several machines
and several times. The execution times are gathered from which various basic
statistics are computed. This allows me to get a rough estimation of how
variables such as GPU models, available VRAM, etc., impact the overall
performance of my LLMs on prem.
## Setup
Copy the example TOML file:
```bash
cp configs_example.toml configs.toml
```
The `configs.toml` file is the "production" file and is excluded via
`.gitignore`. Edit the file to match your specifications (i.e. set the dummy
prompt and IP addresses).
## Benchmarking LLM performance
### Step 1 - Run the benchmarks
Set up a Python virtual environment and run the bash script:
```bash
./benchmark
```
And input 1 when prompted. The program will gather `rounds`
(specified via `configs.toml`) number of inference times for `prompt` against
`model` for each `host`. When complete, the program will output something akin
to:
```
All values are provided in seconds
┌──────────────────┬───────────────┬──────────┬─────────┬──────────┬──────────┬──────────┬───────────────┐
│ Host │ Model │ Mean │ SD │ Median │ Min │ Max │ Sample size │
├──────────────────┼───────────────┼──────────┼─────────┼──────────┼──────────┼──────────┼───────────────┤
│ localhost:11434 │ gemma3:latest │ 2.18015 │ 0.16028 │ 2.10775 │ 2.09112 │ 2.46496 │ 5 │
│ 10.0.0.115:11434 │ gemma3:latest │ 18.0551 │ 0.62221 │ 17.9943 │ 17.3745 │ 19.0215 │ 5 │
└──────────────────┴───────────────┴──────────┴─────────┴──────────┴──────────┴──────────┴───────────────┘
```
If sufficient, one can stop here.
### Step 2 - Generate Gaussian distributions + boxplots for inference times
Set up a Python virtual environment as before and run the bash script:
```bash
./benchmark
```
Then input 2 when prompted. The program will generate a set of
Gaussian distributions for the inference times obtained from each machine. For
example:
In this example, 50 trials were performed. The mean inference time is around
2.15 seconds. One value appears to be more than 3 standard deviations away from
the mean, and this value could be interpreted as an outlier (perhaps as a
result of a spike in GPU demand). The program will also generate boxplots for
the inference times across servers in the network. This can be useful for
evaluating the performance of individual servers with respect to the network:
### Step 3 - Generate a LaTeX report for the measurements
As before, run the bash script:
```bash
./benchmark
```
Then input 3 when prompted. The program will generate a full,
comprehensive report of all the statistics gathered as part of the benchmarking
process. Note that this requires that steps 1 and 2 be previously completed.