https://github.com/visorcraft/strix-halo-llm-perf
https://github.com/visorcraft/strix-halo-llm-perf
Last synced: 19 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/visorcraft/strix-halo-llm-perf
- Owner: visorcraft
- License: mit
- Created: 2026-02-13T23:20:27.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2026-03-03T16:36:24.000Z (4 months ago)
- Last Synced: 2026-03-03T20:10:03.294Z (4 months ago)
- Language: Shell
- Size: 78.1 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# 🚀 Strix Halo LLM Performance
Benchmarks and reproducible setup notes for local and distributed LLM inference on **Ryzen AI Max+ 395 (Strix Halo)** using `llama.cpp`.
> **Test systems:**
> - **Evo** = GMKtec EVO-X2 (Ryzen AI Max+ 395)
> - **Bee** = Beelink GTR9 Pro (Ryzen AI Max+ 395)
> - Both hosts have **128 GB unified LPDDR5X** and are linked by a direct **USB4/Thunderbolt cable (~9.4 Gbps measured)** for distributed RPC inference.
## TL;DR
- **Qwen3 30B-A3B MoE Q4_K_M:** **86.1 t/s** token generation (single host, Vulkan)
- **MiniMax M2.5 Q3_K_M (228.7B):** **32.8 t/s** token generation (single host, Vulkan)
- **Qwen3-Coder-Next 80B-A3B Q4_K_M:** **42.7 t/s** token generation
- **GPT-OSS 120B Q4_K_M:** **53.4 t/s** generation in llama-server tests
- **Nemotron-3 Nano 30B-A3B MXFP4:** **61.5 t/s** generation in llama-server tests
---
## 1) Hardware
| Component | Evo | Bee |
|---|---|---|
| System | GMKtec EVO-X2 | Beelink GTR9 Pro |
| SoC | Ryzen AI Max+ 395 | Ryzen AI Max+ 395 |
| CPU | 16C/32T Zen 5 | 16C/32T Zen 5 |
| iGPU | Radeon 8060S (gfx1151, 40 CU) | Radeon 8060S (gfx1151, 40 CU) |
| Memory | 128 GB unified LPDDR5X | 128 GB unified LPDDR5X |
**Distributed link:** direct USB4/Thunderbolt between Evo and Bee, ~9.4 Gbps effective in testing.
## 2) Software Stack
- **OS:** Fedora 43 (both hosts)
- **Kernel:** 6.18.x class
- **Primary backend:** ROCm 7.0 nightlies (via kyuz0 distrobox container)
- **Secondary backends:** Vulkan RADV (Mesa), ROCm 6.4.x (host), ROCm 7.2 (container)
- **Inference engine:** `llama.cpp`
- **Power profile:** 85W/120W tested; 120W usually wins for 7B+ models
## 3) Quick Start (Vulkan container)
```bash
podman pull docker.io/kyuz0/amd-strix-halo-toolboxes:vulkan-radv
distrobox create --name llama-vulkan-radv \
--image docker.io/kyuz0/amd-strix-halo-toolboxes:vulkan-radv --yes
podman start llama-vulkan-radv
# benchmark example
podman exec llama-vulkan-radv bash -lc \
"llama-bench -m ~/models/Qwen3-Coder-Next-Q4_K_M.gguf -ngl 99 -p 512 -n 128"
```
## 4) Single-Host Benchmarks (best results)
All rows below are working results only, using best observed configuration per model.
| Model | Size | Params | pp512 (t/s) | tg128 (t/s) |
|---|---:|---:|---:|---:|
| TinyLlama 1.1B Q4_K_M | 636 MiB | 1.10B | 6,513 | 249 |
| Llama 3.2 3B Q8_0 | 3.18 GiB | 3.21B | 2,248 | 60.9 |
| Llama 2 7B Q4_K_M | 3.80 GiB | 6.74B | 1,074 | 47.3 |
| Qwen2.5-Coder 7B Q6_K | 5.82 GiB | 7.62B | 1,089 | 36.7 |
| Qwen2.5 14B Q4_K_M | 8.37 GiB | 14.77B | 600 | 24.5 |
| Qwen3 30B-A3B MoE Q4_K_M | 17.28 GiB | 30.53B | 1,142 | **86.1** |
| Qwen2.5 32B Q4_K_M | 18.48 GiB | 32.76B | 242 | 11.3 |
| Llama 3.1 70B Q4_K_M | 39.59 GiB | 70.55B | 81.6 | 5.1 |
| Qwen3-Coder-Next 80B-A3B Q4_K_M | 45.17 GiB | 79.67B | 531 | 42.7 |
| GPT-OSS 120B Q4_K_M* | 58.5 GiB | 116.83B | 120 | 53.4 |
| MiniMax M2.1-REAP-139B Q4_K_M | 78.40 GiB | 139.15B | 203 | 29.3 |
| MiniMax M2.5 Q3_K_M | 101.76 GiB | 228.69B | 156 | **32.8** |
| Qwen3-235B-A22B Q3_K_M | 104.72 GiB | 235.09B | 101 | 17.2 |
| Nemotron-3 Nano 30B-A3B MXFP4* | 17.6 GiB | 30.0B | 112 | 61.5 |
\* llama-server measured rows (real API usage); includes serving overhead vs raw `llama-bench`.
### MiniMax M2.5 real-world summary
MiniMax M2.5 Q3_K_M sustains roughly **~30 t/s** in llama-server usage with long-context configurations, with strong practical output quality for coding/math/architecture prompts.
## 5) Backend Comparison (Vulkan vs ROCm, key models)
Winners-only view:
| Model | Best Prompt Processing (pp) | Best Generation (tg) |
|---|---|---|
| Qwen3-Coder-Next Q6_K_XL (single host) | ROCm 7.x nightlies (~502 pp) | **Vulkan AMDVLK (~38.7 tg)** |
| Qwen3-Coder-Next Q6_K_XL (RPC 2-host) | ROCm 7.0 nightlies (~490 pp) | ROCm 7.x container (~26.3 tg) |
| Qwen3-Coder-Next 80B-A3B Q4_K_M | ROCm 6.4.4 (~581 pp) | Vulkan RADV (~43.5 tg) |
| MiniMax M2.5 Q3_K_M | ROCm 6.4.4/7.x (~214 pp) | Vulkan RADV (~34.3 tg) |
| Qwen3 30B-A3B MoE Q4_K_M | Vulkan RADV | Vulkan RADV |
**Latest finding (Mar 3, 2026):** Full 7-backend comparison for Q6_K_XL reveals **Vulkan AMDVLK** is the new tg champion at **38.65 t/s** (+16% over ROCm 7.x), though it has the worst pp (358 t/s). Host native Vulkan RADV is also strong at **36.80 tg**. ROCm 7.x nightlies remains best for prompt processing (~502 pp).
Practical takeaway: **For interactive serving (tg-dominated), Vulkan AMDVLK or host RADV are best. For batch/prefill workloads, ROCm 7.x nightlies remains optimal.**
## 6) Distributed Inference (Evo + Bee RPC)
> **Critical:** for `llama-server`/`llama-cli` with `--rpc` on large models, use **`-dio`** (direct I/O) to avoid load hangs.
### Working two-host results
| Model | Backend | Split | pp512 (t/s) | tg128 (t/s) | Notes |
|---|---|---|---:|---:|---|
| MiniMax-M2.5-REAP-139B-A10B-Q8_0 | ROCm+RPC | 1.2/0.8 | 332.36 | **15.35** | Best tg from quick split sweep |
| Qwen3.5-397B-A17B-UD-Q4_K_XL | ROCm+RPC | auto | 147.55 | 11.76 | llama-bench path |
| Qwen3.5-397B-A17B-UD-Q4_K_XL | ROCm+RPC + `-dio` | 1/1 | 25.9* | **12.6*** | llama-server path |
\* server-observed pp/tg metrics (not direct `llama-bench`).
## 7) Key Findings
- **Bandwidth wall dominates** on Strix Halo unified memory.
- **Active parameters predict tg well** for MoE models on this platform:
- 3B active → ~61–86 t/s
- 5.1B active → ~53 t/s
- ~10B active → ~29–33 t/s
- 22B active → ~17 t/s
- **Optimization ablation:** speculative decoding, unified KV flags, no-mmap, and batch-size tuning produced negligible gains in this setup.
- **120W vs 85W:** 120W generally helps 7B+ models; very small models are often bandwidth-limited and see little benefit.
- **Qwen3.5-397B stability can be prompt-shape sensitive:** on `np4_ps32k`, a synthetic repeated-token pattern crashed immediately while an `opt`-shaped prompt passed at 5k, 20k, and ~32k prompt tokens. See [`results/2026-02-20_qwen397b-prompt-shape-sensitivity.md`](results/2026-02-20_qwen397b-prompt-shape-sensitivity.md).
- **Qwen3.5-397B max context found:**
- `np=1`: ctx 300k with ~150k prompt tokens ✅
- `np=2`: ctx 200k with ~100k prompt tokens ✅
- See [`results/2026-02-20_qwen397b-np1-max-context.md`](results/2026-02-20_qwen397b-np1-max-context.md) and [`results/2026-02-20_qwen397b-np2-high-context.md`](results/2026-02-20_qwen397b-np2-high-context.md).
- **Qwen3.5-397B RPC caveat:** at `np2 ctx200k`, `opt` prompts passed (70k/90k) while mixed prompts failed (including GPU memory fault). See [`results/2026-02-21_qwen397b-rpc-shape-control.md`](results/2026-02-21_qwen397b-rpc-shape-control.md).
- **MiniMax RPC stability update:**
- Shape screen passed **8/8** (Q8 at 45k/48k, Q3_K_M at 70k/90k; opt + natural)
- High-edge test passed at `np2 ctx256k` with ~`128k` prompt tokens (opt + natural)
- See [`results/2026-02-21_minimax-rpc-shape-screen.md`](results/2026-02-21_minimax-rpc-shape-screen.md) and [`results/2026-02-21_minimax-q3km-np2-256k-128k.md`](results/2026-02-21_minimax-q3km-np2-256k-128k.md).
## 8) Known Issues
- **RPC serving requires `-dio`** for large-model loads with `--rpc` on this platform (`llama-server` / `llama-cli`).
- **AMDVLK update (Mar 2026):** AMDVLK now leads on tg for Q6_K_XL (38.65 t/s) but has significantly worse pp (358 t/s). Consider for tg-dominated interactive workloads; RADV remains the safer all-round Vulkan path.
- **HIP cold-run penalty exists:** first HIP run after fresh build can be significantly slower; warm up before recording data.
## 9) Community Resources
- [kyuz0/amd-strix-halo-toolboxes](https://github.com/kyuz0/amd-strix-halo-toolboxes)
- [kyuz0 interactive benchmark viewer](https://kyuz0.github.io/amd-strix-halo-toolboxes/)
- [lhl/strix-halo-testing](https://github.com/lhl/strix-halo-testing)
- [Strix Halo Wiki](https://strixhalo.wiki)
### Brief comparison vs kyuz0
Results are broadly aligned with community trends: backend wins vary by metric/model, with ROCm commonly stronger on prompt throughput and Vulkan RADV often stronger on generation responsiveness.
## 10) Documentation
- [BENCHMARKS.md](BENCHMARKS.md) — full benchmark archive and analysis
- [BACKENDS.md](BACKENDS.md) — backend-specific setup and caveats
- [SETUP.md](SETUP.md) — reproducible machine setup
- [docs/rpc-build.md](docs/rpc-build.md) — distributed Vulkan RPC build flow
- [docs/rpc-hip-serving.md](docs/rpc-hip-serving.md) — RPC HIP serving guide (`-dio` requirement)
- [docs/build-visorcraft-llama.md](docs/build-visorcraft-llama.md) — build and verify visorcraft/llama.cpp on host + containers
- [docs/build-rpc-hip-v2-vs-host-native-radv.md](docs/build-rpc-hip-v2-vs-host-native-radv.md) — separate side-by-side guide for `build-rpc-hip-v2` vs host-native Vulkan RADV
## 11) Repo Hygiene / Sanitization
This repository is intentionally sanitized for public sharing:
- no passwords/passphrases/keys/tokens
- no private IPs or local SSH key paths
- no raw host logs committed
Quick check:
```bash
./scripts/sanitize-repo.sh
```
## 12) License
[MIT](LICENSE)