{"id":50756234,"url":"https://github.com/visorcraft/strix-halo-llm-perf","last_synced_at":"2026-06-11T05:03:33.926Z","repository":{"id":338655290,"uuid":"1157525963","full_name":"visorcraft/strix-halo-llm-perf","owner":"visorcraft","description":null,"archived":false,"fork":false,"pushed_at":"2026-03-03T16:36:24.000Z","size":80,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-03T20:10:03.294Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/visorcraft.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-02-13T23:20:27.000Z","updated_at":"2026-03-03T16:44:34.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/visorcraft/strix-halo-llm-perf","commit_stats":null,"previous_names":["visorcraft/strix-halo-llm-perf"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/visorcraft/strix-halo-llm-perf","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/visorcraft%2Fstrix-halo-llm-perf","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/visorcraft%2Fstrix-halo-llm-perf/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/visorcraft%2Fstrix-halo-llm-perf/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/visorcraft%2Fstrix-halo-llm-perf/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/visorcraft","download_url":"https://codeload.github.com/visorcraft/strix-halo-llm-perf/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/visorcraft%2Fstrix-halo-llm-perf/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34183109,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-11T02:00:06.485Z","response_time":57,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-11T05:03:32.273Z","updated_at":"2026-06-11T05:03:33.921Z","avatar_url":"https://github.com/visorcraft.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🚀 Strix Halo LLM Performance\n\nBenchmarks and reproducible setup notes for local and distributed LLM inference on **Ryzen AI Max+ 395 (Strix Halo)** using `llama.cpp`.\n\n\u003e **Test systems:**\n\u003e - **Evo** = GMKtec EVO-X2 (Ryzen AI Max+ 395)\n\u003e - **Bee** = Beelink GTR9 Pro (Ryzen AI Max+ 395)\n\u003e - Both hosts have **128 GB unified LPDDR5X** and are linked by a direct **USB4/Thunderbolt cable (~9.4 Gbps measured)** for distributed RPC inference.\n\n## TL;DR\n\n- **Qwen3 30B-A3B MoE Q4_K_M:** **86.1 t/s** token generation (single host, Vulkan)\n- **MiniMax M2.5 Q3_K_M (228.7B):** **32.8 t/s** token generation (single host, Vulkan)\n- **Qwen3-Coder-Next 80B-A3B Q4_K_M:** **42.7 t/s** token generation\n- **GPT-OSS 120B Q4_K_M:** **53.4 t/s** generation in llama-server tests\n- **Nemotron-3 Nano 30B-A3B MXFP4:** **61.5 t/s** generation in llama-server tests\n\n---\n\n## 1) Hardware\n\n| Component | Evo | Bee |\n|---|---|---|\n| System | GMKtec EVO-X2 | Beelink GTR9 Pro |\n| SoC | Ryzen AI Max+ 395 | Ryzen AI Max+ 395 |\n| CPU | 16C/32T Zen 5 | 16C/32T Zen 5 |\n| iGPU | Radeon 8060S (gfx1151, 40 CU) | Radeon 8060S (gfx1151, 40 CU) |\n| Memory | 128 GB unified LPDDR5X | 128 GB unified LPDDR5X |\n\n**Distributed link:** direct USB4/Thunderbolt between Evo and Bee, ~9.4 Gbps effective in testing.\n\n## 2) Software Stack\n\n- **OS:** Fedora 43 (both hosts)\n- **Kernel:** 6.18.x class\n- **Primary backend:** ROCm 7.0 nightlies (via kyuz0 distrobox container)\n- **Secondary backends:** Vulkan RADV (Mesa), ROCm 6.4.x (host), ROCm 7.2 (container)\n- **Inference engine:** `llama.cpp`\n- **Power profile:** 85W/120W tested; 120W usually wins for 7B+ models\n\n## 3) Quick Start (Vulkan container)\n\n```bash\npodman pull docker.io/kyuz0/amd-strix-halo-toolboxes:vulkan-radv\n\ndistrobox create --name llama-vulkan-radv \\\n  --image docker.io/kyuz0/amd-strix-halo-toolboxes:vulkan-radv --yes\n\npodman start llama-vulkan-radv\n\n# benchmark example\npodman exec llama-vulkan-radv bash -lc \\\n  \"llama-bench -m ~/models/Qwen3-Coder-Next-Q4_K_M.gguf -ngl 99 -p 512 -n 128\"\n```\n\n## 4) Single-Host Benchmarks (best results)\n\nAll rows below are working results only, using best observed configuration per model.\n\n| Model | Size | Params | pp512 (t/s) | tg128 (t/s) |\n|---|---:|---:|---:|---:|\n| TinyLlama 1.1B Q4_K_M | 636 MiB | 1.10B | 6,513 | 249 |\n| Llama 3.2 3B Q8_0 | 3.18 GiB | 3.21B | 2,248 | 60.9 |\n| Llama 2 7B Q4_K_M | 3.80 GiB | 6.74B | 1,074 | 47.3 |\n| Qwen2.5-Coder 7B Q6_K | 5.82 GiB | 7.62B | 1,089 | 36.7 |\n| Qwen2.5 14B Q4_K_M | 8.37 GiB | 14.77B | 600 | 24.5 |\n| Qwen3 30B-A3B MoE Q4_K_M | 17.28 GiB | 30.53B | 1,142 | **86.1** |\n| Qwen2.5 32B Q4_K_M | 18.48 GiB | 32.76B | 242 | 11.3 |\n| Llama 3.1 70B Q4_K_M | 39.59 GiB | 70.55B | 81.6 | 5.1 |\n| Qwen3-Coder-Next 80B-A3B Q4_K_M | 45.17 GiB | 79.67B | 531 | 42.7 |\n| GPT-OSS 120B Q4_K_M* | 58.5 GiB | 116.83B | 120 | 53.4 |\n| MiniMax M2.1-REAP-139B Q4_K_M | 78.40 GiB | 139.15B | 203 | 29.3 |\n| MiniMax M2.5 Q3_K_M | 101.76 GiB | 228.69B | 156 | **32.8** |\n| Qwen3-235B-A22B Q3_K_M | 104.72 GiB | 235.09B | 101 | 17.2 |\n| Nemotron-3 Nano 30B-A3B MXFP4* | 17.6 GiB | 30.0B | 112 | 61.5 |\n\n\\* llama-server measured rows (real API usage); includes serving overhead vs raw `llama-bench`.\n\n### MiniMax M2.5 real-world summary\n\nMiniMax M2.5 Q3_K_M sustains roughly **~30 t/s** in llama-server usage with long-context configurations, with strong practical output quality for coding/math/architecture prompts.\n\n## 5) Backend Comparison (Vulkan vs ROCm, key models)\n\nWinners-only view:\n\n| Model | Best Prompt Processing (pp) | Best Generation (tg) |\n|---|---|---|\n| Qwen3-Coder-Next Q6_K_XL (single host) | ROCm 7.x nightlies (~502 pp) | **Vulkan AMDVLK (~38.7 tg)** |\n| Qwen3-Coder-Next Q6_K_XL (RPC 2-host) | ROCm 7.0 nightlies (~490 pp) | ROCm 7.x container (~26.3 tg) |\n| Qwen3-Coder-Next 80B-A3B Q4_K_M | ROCm 6.4.4 (~581 pp) | Vulkan RADV (~43.5 tg) |\n| MiniMax M2.5 Q3_K_M | ROCm 6.4.4/7.x (~214 pp) | Vulkan RADV (~34.3 tg) |\n| Qwen3 30B-A3B MoE Q4_K_M | Vulkan RADV | Vulkan RADV |\n\n**Latest finding (Mar 3, 2026):** Full 7-backend comparison for Q6_K_XL reveals **Vulkan AMDVLK** is the new tg champion at **38.65 t/s** (+16% over ROCm 7.x), though it has the worst pp (358 t/s). Host native Vulkan RADV is also strong at **36.80 tg**. ROCm 7.x nightlies remains best for prompt processing (~502 pp).\n\nPractical takeaway: **For interactive serving (tg-dominated), Vulkan AMDVLK or host RADV are best. For batch/prefill workloads, ROCm 7.x nightlies remains optimal.**\n\n## 6) Distributed Inference (Evo + Bee RPC)\n\n\u003e **Critical:** for `llama-server`/`llama-cli` with `--rpc` on large models, use **`-dio`** (direct I/O) to avoid load hangs.\n\n### Working two-host results\n\n| Model | Backend | Split | pp512 (t/s) | tg128 (t/s) | Notes |\n|---|---|---|---:|---:|---|\n| MiniMax-M2.5-REAP-139B-A10B-Q8_0 | ROCm+RPC | 1.2/0.8 | 332.36 | **15.35** | Best tg from quick split sweep |\n| Qwen3.5-397B-A17B-UD-Q4_K_XL | ROCm+RPC | auto | 147.55 | 11.76 | llama-bench path |\n| Qwen3.5-397B-A17B-UD-Q4_K_XL | ROCm+RPC + `-dio` | 1/1 | 25.9* | **12.6*** | llama-server path |\n\n\\* server-observed pp/tg metrics (not direct `llama-bench`).\n\n## 7) Key Findings\n\n- **Bandwidth wall dominates** on Strix Halo unified memory.\n- **Active parameters predict tg well** for MoE models on this platform:\n  - 3B active → ~61–86 t/s\n  - 5.1B active → ~53 t/s\n  - ~10B active → ~29–33 t/s\n  - 22B active → ~17 t/s\n- **Optimization ablation:** speculative decoding, unified KV flags, no-mmap, and batch-size tuning produced negligible gains in this setup.\n- **120W vs 85W:** 120W generally helps 7B+ models; very small models are often bandwidth-limited and see little benefit.\n- **Qwen3.5-397B stability can be prompt-shape sensitive:** on `np4_ps32k`, a synthetic repeated-token pattern crashed immediately while an `opt`-shaped prompt passed at 5k, 20k, and ~32k prompt tokens. See [`results/2026-02-20_qwen397b-prompt-shape-sensitivity.md`](results/2026-02-20_qwen397b-prompt-shape-sensitivity.md).\n- **Qwen3.5-397B max context found:**\n  - `np=1`: ctx 300k with ~150k prompt tokens ✅\n  - `np=2`: ctx 200k with ~100k prompt tokens ✅\n  - See [`results/2026-02-20_qwen397b-np1-max-context.md`](results/2026-02-20_qwen397b-np1-max-context.md) and [`results/2026-02-20_qwen397b-np2-high-context.md`](results/2026-02-20_qwen397b-np2-high-context.md).\n- **Qwen3.5-397B RPC caveat:** at `np2 ctx200k`, `opt` prompts passed (70k/90k) while mixed prompts failed (including GPU memory fault). See [`results/2026-02-21_qwen397b-rpc-shape-control.md`](results/2026-02-21_qwen397b-rpc-shape-control.md).\n- **MiniMax RPC stability update:**\n  - Shape screen passed **8/8** (Q8 at 45k/48k, Q3_K_M at 70k/90k; opt + natural)\n  - High-edge test passed at `np2 ctx256k` with ~`128k` prompt tokens (opt + natural)\n  - See [`results/2026-02-21_minimax-rpc-shape-screen.md`](results/2026-02-21_minimax-rpc-shape-screen.md) and [`results/2026-02-21_minimax-q3km-np2-256k-128k.md`](results/2026-02-21_minimax-q3km-np2-256k-128k.md).\n\n## 8) Known Issues\n\n- **RPC serving requires `-dio`** for large-model loads with `--rpc` on this platform (`llama-server` / `llama-cli`).\n- **AMDVLK update (Mar 2026):** AMDVLK now leads on tg for Q6_K_XL (38.65 t/s) but has significantly worse pp (358 t/s). Consider for tg-dominated interactive workloads; RADV remains the safer all-round Vulkan path.\n- **HIP cold-run penalty exists:** first HIP run after fresh build can be significantly slower; warm up before recording data.\n\n## 9) Community Resources\n\n- [kyuz0/amd-strix-halo-toolboxes](https://github.com/kyuz0/amd-strix-halo-toolboxes)\n- [kyuz0 interactive benchmark viewer](https://kyuz0.github.io/amd-strix-halo-toolboxes/)\n- [lhl/strix-halo-testing](https://github.com/lhl/strix-halo-testing)\n- [Strix Halo Wiki](https://strixhalo.wiki)\n\n### Brief comparison vs kyuz0\n\nResults are broadly aligned with community trends: backend wins vary by metric/model, with ROCm commonly stronger on prompt throughput and Vulkan RADV often stronger on generation responsiveness.\n\n## 10) Documentation\n\n- [BENCHMARKS.md](BENCHMARKS.md) — full benchmark archive and analysis\n- [BACKENDS.md](BACKENDS.md) — backend-specific setup and caveats\n- [SETUP.md](SETUP.md) — reproducible machine setup\n- [docs/rpc-build.md](docs/rpc-build.md) — distributed Vulkan RPC build flow\n- [docs/rpc-hip-serving.md](docs/rpc-hip-serving.md) — RPC HIP serving guide (`-dio` requirement)\n- [docs/build-visorcraft-llama.md](docs/build-visorcraft-llama.md) — build and verify visorcraft/llama.cpp on host + containers\n- [docs/build-rpc-hip-v2-vs-host-native-radv.md](docs/build-rpc-hip-v2-vs-host-native-radv.md) — separate side-by-side guide for `build-rpc-hip-v2` vs host-native Vulkan RADV\n\n## 11) Repo Hygiene / Sanitization\n\nThis repository is intentionally sanitized for public sharing:\n- no passwords/passphrases/keys/tokens\n- no private IPs or local SSH key paths\n- no raw host logs committed\n\nQuick check:\n```bash\n./scripts/sanitize-repo.sh\n```\n\n## 12) License\n\n[MIT](LICENSE)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvisorcraft%2Fstrix-halo-llm-perf","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvisorcraft%2Fstrix-halo-llm-perf","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvisorcraft%2Fstrix-halo-llm-perf/lists"}