https://github.com/visorcraft/strix-halo-llm-perf

Last synced: 19 days ago
JSON representation
Host: GitHub
URL: https://github.com/visorcraft/strix-halo-llm-perf
Owner: visorcraft
License: mit
Created: 2026-02-13T23:20:27.000Z (5 months ago)
Default Branch: main
Last Pushed: 2026-03-03T16:36:24.000Z (4 months ago)
Last Synced: 2026-03-03T20:10:03.294Z (4 months ago)
Language: Shell
Size: 78.1 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          # 🚀 Strix Halo LLM Performance

Benchmarks and reproducible setup notes for local and distributed LLM inference on **Ryzen AI Max+ 395 (Strix Halo)** using `llama.cpp`.

> **Test systems:**

> - **Evo** = GMKtec EVO-X2 (Ryzen AI Max+ 395)

> - **Bee** = Beelink GTR9 Pro (Ryzen AI Max+ 395)

> - Both hosts have **128 GB unified LPDDR5X** and are linked by a direct **USB4/Thunderbolt cable (~9.4 Gbps measured)** for distributed RPC inference.

## TL;DR

- **Qwen3 30B-A3B MoE Q4_K_M:** **86.1 t/s** token generation (single host, Vulkan)

- **MiniMax M2.5 Q3_K_M (228.7B):** **32.8 t/s** token generation (single host, Vulkan)

- **Qwen3-Coder-Next 80B-A3B Q4_K_M:** **42.7 t/s** token generation

- **GPT-OSS 120B Q4_K_M:** **53.4 t/s** generation in llama-server tests

- **Nemotron-3 Nano 30B-A3B MXFP4:** **61.5 t/s** generation in llama-server tests

---

## 1) Hardware

| Component | Evo | Bee |

|---|---|---|

| System | GMKtec EVO-X2 | Beelink GTR9 Pro |

| SoC | Ryzen AI Max+ 395 | Ryzen AI Max+ 395 |

| CPU | 16C/32T Zen 5 | 16C/32T Zen 5 |

| iGPU | Radeon 8060S (gfx1151, 40 CU) | Radeon 8060S (gfx1151, 40 CU) |

| Memory | 128 GB unified LPDDR5X | 128 GB unified LPDDR5X |

**Distributed link:** direct USB4/Thunderbolt between Evo and Bee, ~9.4 Gbps effective in testing.

## 2) Software Stack

- **OS:** Fedora 43 (both hosts)

- **Kernel:** 6.18.x class

- **Primary backend:** ROCm 7.0 nightlies (via kyuz0 distrobox container)

- **Secondary backends:** Vulkan RADV (Mesa), ROCm 6.4.x (host), ROCm 7.2 (container)

- **Inference engine:** `llama.cpp`

- **Power profile:** 85W/120W tested; 120W usually wins for 7B+ models

## 3) Quick Start (Vulkan container)

```bash

podman pull docker.io/kyuz0/amd-strix-halo-toolboxes:vulkan-radv

distrobox create --name llama-vulkan-radv \

  --image docker.io/kyuz0/amd-strix-halo-toolboxes:vulkan-radv --yes

podman start llama-vulkan-radv

# benchmark example

podman exec llama-vulkan-radv bash -lc \

  "llama-bench -m ~/models/Qwen3-Coder-Next-Q4_K_M.gguf -ngl 99 -p 512 -n 128"

```

## 4) Single-Host Benchmarks (best results)

All rows below are working results only, using best observed configuration per model.

| Model | Size | Params | pp512 (t/s) | tg128 (t/s) |

|---|---:|---:|---:|---:|

| TinyLlama 1.1B Q4_K_M | 636 MiB | 1.10B | 6,513 | 249 |

| Llama 3.2 3B Q8_0 | 3.18 GiB | 3.21B | 2,248 | 60.9 |

| Llama 2 7B Q4_K_M | 3.80 GiB | 6.74B | 1,074 | 47.3 |

| Qwen2.5-Coder 7B Q6_K | 5.82 GiB | 7.62B | 1,089 | 36.7 |

| Qwen2.5 14B Q4_K_M | 8.37 GiB | 14.77B | 600 | 24.5 |

| Qwen3 30B-A3B MoE Q4_K_M | 17.28 GiB | 30.53B | 1,142 | **86.1** |

| Qwen2.5 32B Q4_K_M | 18.48 GiB | 32.76B | 242 | 11.3 |

| Llama 3.1 70B Q4_K_M | 39.59 GiB | 70.55B | 81.6 | 5.1 |

| Qwen3-Coder-Next 80B-A3B Q4_K_M | 45.17 GiB | 79.67B | 531 | 42.7 |

| GPT-OSS 120B Q4_K_M* | 58.5 GiB | 116.83B | 120 | 53.4 |

| MiniMax M2.1-REAP-139B Q4_K_M | 78.40 GiB | 139.15B | 203 | 29.3 |

| MiniMax M2.5 Q3_K_M | 101.76 GiB | 228.69B | 156 | **32.8** |

| Qwen3-235B-A22B Q3_K_M | 104.72 GiB | 235.09B | 101 | 17.2 |

| Nemotron-3 Nano 30B-A3B MXFP4* | 17.6 GiB | 30.0B | 112 | 61.5 |

\* llama-server measured rows (real API usage); includes serving overhead vs raw `llama-bench`.

### MiniMax M2.5 real-world summary

MiniMax M2.5 Q3_K_M sustains roughly **~30 t/s** in llama-server usage with long-context configurations, with strong practical output quality for coding/math/architecture prompts.

## 5) Backend Comparison (Vulkan vs ROCm, key models)

Winners-only view:

| Model | Best Prompt Processing (pp) | Best Generation (tg) |

|---|---|---|

| Qwen3-Coder-Next Q6_K_XL (single host) | ROCm 7.x nightlies (~502 pp) | **Vulkan AMDVLK (~38.7 tg)** |

| Qwen3-Coder-Next Q6_K_XL (RPC 2-host) | ROCm 7.0 nightlies (~490 pp) | ROCm 7.x container (~26.3 tg) |

| Qwen3-Coder-Next 80B-A3B Q4_K_M | ROCm 6.4.4 (~581 pp) | Vulkan RADV (~43.5 tg) |

| MiniMax M2.5 Q3_K_M | ROCm 6.4.4/7.x (~214 pp) | Vulkan RADV (~34.3 tg) |

| Qwen3 30B-A3B MoE Q4_K_M | Vulkan RADV | Vulkan RADV |

**Latest finding (Mar 3, 2026):** Full 7-backend comparison for Q6_K_XL reveals **Vulkan AMDVLK** is the new tg champion at **38.65 t/s** (+16% over ROCm 7.x), though it has the worst pp (358 t/s). Host native Vulkan RADV is also strong at **36.80 tg**. ROCm 7.x nightlies remains best for prompt processing (~502 pp).

Practical takeaway: **For interactive serving (tg-dominated), Vulkan AMDVLK or host RADV are best. For batch/prefill workloads, ROCm 7.x nightlies remains optimal.**

## 6) Distributed Inference (Evo + Bee RPC)

> **Critical:** for `llama-server`/`llama-cli` with `--rpc` on large models, use **`-dio`** (direct I/O) to avoid load hangs.

### Working two-host results

| Model | Backend | Split | pp512 (t/s) | tg128 (t/s) | Notes |

|---|---|---|---:|---:|---|

| MiniMax-M2.5-REAP-139B-A10B-Q8_0 | ROCm+RPC | 1.2/0.8 | 332.36 | **15.35** | Best tg from quick split sweep |

| Qwen3.5-397B-A17B-UD-Q4_K_XL | ROCm+RPC | auto | 147.55 | 11.76 | llama-bench path |

| Qwen3.5-397B-A17B-UD-Q4_K_XL | ROCm+RPC + `-dio` | 1/1 | 25.9* | **12.6*** | llama-server path |

\* server-observed pp/tg metrics (not direct `llama-bench`).

## 7) Key Findings

- **Bandwidth wall dominates** on Strix Halo unified memory.

- **Active parameters predict tg well** for MoE models on this platform:

  - 3B active → ~61–86 t/s

  - 5.1B active → ~53 t/s

  - ~10B active → ~29–33 t/s

  - 22B active → ~17 t/s

- **Optimization ablation:** speculative decoding, unified KV flags, no-mmap, and batch-size tuning produced negligible gains in this setup.

- **120W vs 85W:** 120W generally helps 7B+ models; very small models are often bandwidth-limited and see little benefit.

- **Qwen3.5-397B stability can be prompt-shape sensitive:** on `np4_ps32k`, a synthetic repeated-token pattern crashed immediately while an `opt`-shaped prompt passed at 5k, 20k, and ~32k prompt tokens. See [`results/2026-02-20_qwen397b-prompt-shape-sensitivity.md`](results/2026-02-20_qwen397b-prompt-shape-sensitivity.md).

- **Qwen3.5-397B max context found:**

  - `np=1`: ctx 300k with ~150k prompt tokens ✅

  - `np=2`: ctx 200k with ~100k prompt tokens ✅

  - See [`results/2026-02-20_qwen397b-np1-max-context.md`](results/2026-02-20_qwen397b-np1-max-context.md) and [`results/2026-02-20_qwen397b-np2-high-context.md`](results/2026-02-20_qwen397b-np2-high-context.md).

- **Qwen3.5-397B RPC caveat:** at `np2 ctx200k`, `opt` prompts passed (70k/90k) while mixed prompts failed (including GPU memory fault). See [`results/2026-02-21_qwen397b-rpc-shape-control.md`](results/2026-02-21_qwen397b-rpc-shape-control.md).

- **MiniMax RPC stability update:**

  - Shape screen passed **8/8** (Q8 at 45k/48k, Q3_K_M at 70k/90k; opt + natural)

  - High-edge test passed at `np2 ctx256k` with ~`128k` prompt tokens (opt + natural)

  - See [`results/2026-02-21_minimax-rpc-shape-screen.md`](results/2026-02-21_minimax-rpc-shape-screen.md) and [`results/2026-02-21_minimax-q3km-np2-256k-128k.md`](results/2026-02-21_minimax-q3km-np2-256k-128k.md).

## 8) Known Issues

- **RPC serving requires `-dio`** for large-model loads with `--rpc` on this platform (`llama-server` / `llama-cli`).

- **AMDVLK update (Mar 2026):** AMDVLK now leads on tg for Q6_K_XL (38.65 t/s) but has significantly worse pp (358 t/s). Consider for tg-dominated interactive workloads; RADV remains the safer all-round Vulkan path.

- **HIP cold-run penalty exists:** first HIP run after fresh build can be significantly slower; warm up before recording data.

## 9) Community Resources

- [kyuz0/amd-strix-halo-toolboxes](https://github.com/kyuz0/amd-strix-halo-toolboxes)

- [kyuz0 interactive benchmark viewer](https://kyuz0.github.io/amd-strix-halo-toolboxes/)

- [lhl/strix-halo-testing](https://github.com/lhl/strix-halo-testing)

- [Strix Halo Wiki](https://strixhalo.wiki)

### Brief comparison vs kyuz0

Results are broadly aligned with community trends: backend wins vary by metric/model, with ROCm commonly stronger on prompt throughput and Vulkan RADV often stronger on generation responsiveness.

## 10) Documentation

- [BENCHMARKS.md](BENCHMARKS.md) — full benchmark archive and analysis

- [BACKENDS.md](BACKENDS.md) — backend-specific setup and caveats

- [SETUP.md](SETUP.md) — reproducible machine setup

- [docs/rpc-build.md](docs/rpc-build.md) — distributed Vulkan RPC build flow

- [docs/rpc-hip-serving.md](docs/rpc-hip-serving.md) — RPC HIP serving guide (`-dio` requirement)

- [docs/build-visorcraft-llama.md](docs/build-visorcraft-llama.md) — build and verify visorcraft/llama.cpp on host + containers

- [docs/build-rpc-hip-v2-vs-host-native-radv.md](docs/build-rpc-hip-v2-vs-host-native-radv.md) — separate side-by-side guide for `build-rpc-hip-v2` vs host-native Vulkan RADV

## 11) Repo Hygiene / Sanitization

This repository is intentionally sanitized for public sharing:

- no passwords/passphrases/keys/tokens

- no private IPs or local SSH key paths

- no raw host logs committed

Quick check:

```bash

./scripts/sanitize-repo.sh

```

## 12) License

[MIT](LICENSE)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/visorcraft/strix-halo-llm-perf

Awesome Lists containing this project

README