https://github.com/maeddesg/vulkanforge

LLM inference engine for AMD RDNA4 — Rust + Vulkan compute shaders, gguf & native FP8.
https://github.com/maeddesg/vulkanforge
amd fp8 gemma4 gfx1201 gguf inference llm machine-learning mesa rdna4 rust vulkan
Last synced: 3 days ago
JSON representation
LLM inference engine for AMD RDNA4 — Rust + Vulkan compute shaders, gguf & native FP8.
Host: GitHub
URL: https://github.com/maeddesg/vulkanforge
Owner: maeddesg
License: gpl-3.0
Created: 2026-04-25T13:40:19.000Z (about 2 months ago)
Default Branch: main
Last Pushed: 2026-06-06T18:16:09.000Z (11 days ago)
Last Synced: 2026-06-06T20:10:42.170Z (11 days ago)
Topics: amd, fp8, gemma4, gfx1201, gguf, inference, llm, machine-learning, mesa, rdna4, rust, vulkan
Language: Rust
Homepage:
Size: 9.8 MB
Stars: 9
Watchers: 0
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project

README

          # VulkanForge

LLM inference engine for AMD RDNA4 GPUs. Pure Rust + Vulkan compute

shaders. ~14 MB static binary; no runtime dependencies beyond the

system Vulkan loader.

**First engine doing native FP8 WMMA over Vulkan on consumer AMD

hardware** (`V_WMMA_F32_16X16X16_FP8_FP8` via Mesa 26.1+

`shaderFloat8CooperativeMatrix`).

> This project builds on the foundational work of

> [oldnordic](https://github.com/oldnordic/ROCmForge).

> Without his original ROCmForge implementation — the model loader,

> the CPU inference path, the GGUF parser, and the overall

> architecture — none of the WMMA matrix-core optimisations, the

> multi-model support, or the interactive chat CLI would have been

> possible. Thank you for making this project a reality.

## Highlights

- **Server-side memory (v1.0)** — `vulkanforge serve` now has a **persistent, project-scoped, semantic memory**

  embedded in the API process. VF-native endpoints write notes on purpose and read them back by meaning:

  `POST /memory/remember`, `POST /memory/recall` (`{hits:[{id,kind,name,text,status,score}]}`),

  `POST`/`GET /memory/projects`. [SQLiteGraph](https://github.com/oldnordic/sqlitegraph) (nodes + edges +

  per-project HNSW indexes in one SQLite file) + a CPU embedder ([fastembed](https://github.com/maeddesg/fastembed-rs),

  Nomic-Embed v1.5-Q, 768-dim, AVX-512/VNNI). Each project gets its own index — recall in one project **cannot**

  return another's notes — and the store survives restarts (vectors restore with no re-embed). The memory path

  never takes the GPU permit. **First start** downloads the ONNX model into `~/.vulkanforge/embed-cache` (then

  offline); the two native deps add ~34 MB to the binary (static ONNX Runtime + bundled SQLite). What it is, what

  it isn't, and the roadmap: the wiki's **[Memory](https://github.com/maeddesg/vulkanforge/wiki/Memory)** page.

  See `CHANGELOG.md`.

- **`vf-clide` REPL permission ceiling + denial wording (v0.9.4)** — `vf-clide` (0.3.1): in the agent **REPL**,

  tool calls at or below the active `--yes` / `--allow-mutating` / `--allow-shell` ceiling are now

  **auto-approved** (still printed) and only calls **above** it prompt `y/N` — consistent with headless, not

  laxer (workspace confinement still bounds reads/writes independently; **headless `-p` is unchanged**, denying

  above the ceiling). The agent constitution now separates a *permission* denial (lifted by re-running with

  `--allow-*`) from an absolute *workspace-confinement* denial. No engine change. See `CHANGELOG.md`.

- **`vf-clide` token meter + clean `serve` shutdown (v0.9.2)** — `vf-clide` (0.3.0) now shows **live token

  usage** (real server counts on the non-streaming, tool-calling, and streaming paths) and a **pinned status

  line** with a token meter and the current action — a no-op off-TTY, so **headless `-p` stays byte-for-byte

  unchanged**. Engine bugfix: Ctrl+C / SIGTERM on `vulkanforge serve` used to leak GPU objects and **SIGSEGV**;

  shutdown now idles the device, runs the resource teardown in order, and exits cleanly with **0 leaked

  objects** (shutdown-path only — decode is untouched). See `CHANGELOG.md`.

- **Agentic `vf-clide` + engine test-infra hardening (v0.9.0)** — the `vf-clide` client grows from a chat

  client into an **agentic coding client**: an opt-in `--agent` tool loop with **read_file / write_file /

  search / shell**, a **three-tier permission model** (ReadOnly / Mutating / Exec, opt-in via

  `--yes` → `--allow-mutating` → `--allow-shell`, cumulative), **workspace confinement** for the file tools

  (`../` and symlink escapes rejected), and a **constitution** (built-in system prompt + project `AGENTS.md`).

  `shell` is deliberately *not* confined — `--allow-shell` is the explicit opt-in. The engine's end-to-end

  regression + per-shader correctness suites are reactivated and guarded against drift; **no decode/behavior

  change**. See [`vf-clide/README.md`](vf-clide/README.md), the wiki's *vf-clide* page, and `CHANGELOG.md`.

- **Automatic context sizing + Gemma-4 tool-calling + the `vf-clide` client (v0.8.0)** —

  `serve` without `--ctx-size` now picks the largest safe KV context from live VRAM + the model and

  prints what it chose and why (explicit `--ctx-size` still overrides; hardware-capped at 16384 on

  RDNA4). The OpenAI `tools` API now works with **Gemma-4**'s native tool-call format (Qwen/Hermes

  path unchanged). And a new standalone CLI chat client, **`vf-clide/`** (streaming REPL + headless,

  no engine deps), ships alongside. Inference output unchanged. See `CHANGELOG.md`.

- **Prefill parity with llama.cpp Vulkan (v0.7.0, default-on)** — coopmat flash-attention now

  also covers **dense `head_dim=128`** (Qwen3 / Llama-3.1 / Mistral-v0.3 / DeepSeek-R1), and the

  Gemma-4 MoE router gate-projection is batched through the dense GEMM. Dense prefill @p2048

  **0.71–0.78× → 0.96–1.04× llama** (parity; Mistral ahead); Gemma-4-26B-A4B MoE **Q3 0.80→0.89× /

  QAT 0.73→0.83×**. Decode unchanged. Three new default-on paths (`VF_FA_COOPMAT_HD128`,

  prefill-separate MoE router, `VF_MOE_ROUTER_BATCHED`), each with an opt-out. The batched MoE

  router is value-preserving on factual output (a borderline top-k flip can diverge a free-form

  tail; opt out via `VF_MOE_ROUTER_BATCHED=0`). See the matrix below + `CHANGELOG.md`.

- **Gemma-4 prefill 612 → 2629 t/s (v0.6.2, default-on)** — the whole Gemma-4

  attention now runs on a KHR-cooperative-matrix flash-attention kernel (a

  faithful port of llama.cpp's `flash_attn_cm1`: 16×16×16 coopmat QK + PV,

  Br16/Bc64/row_split4, K/V coopMatLoad'd direct-from-global f16). Attention

  drops from 76 % of the prefill wall to ~0 → prefill hits the attention-free

  ceiling: Gemma-4-26B-A4B Q3_K_M **612 → 2629 t/s @p2048** (4.3× over v0.6.1;

  **0.80× llama.cpp Vulkan** at p2048 on the unified matrix below); QAT-Q4_0 hits

  3107 t/s. Occupancy matches

  llama's budget (hd256 occ 12 / hd512 occ 8, 0 spills). Default change

  (coopmat f16-PV vs fa_batch f32, rel ~5e-4, ctx4096-coherent); `VF_FA_COOPMAT_RS=0`

  escapes to fa_batch. Non-Gemma-4 / hd128 untouched (15-prompt regression across

  11 models: no flip regression). See `CHANGELOG.md`.

- **Big-MoE decode +89 % (v0.5.2)** — adaptive load-staging eliminates a

  fixed-buffer VRAM load-transient that was evicting weights to GTT, plus a

  parallel MoE-router top-K (both value-preserving, bit-identical to v0.5.1).

  Gemma-4-26B-A4B Q3_K_M decode **54 → 102.6 tok/s** (was 41 % of llama.cpp;

  v0.6.2 reaches **0.87× llama.cpp** on the same GGUF — see matrix below);

  Qwen3.6-27B **24 → 39.7**. See `CHANGELOG.md`.

- **Software Graph dispatch pipeline (v0.5.0, default-on)** — the

  `LayerStep`-based imperative executor is now scheduled through a

  topologically-sorted dependency graph (`SubDispatch` enum, byte-

  range edge resolution, high-water-mark barrier pass). Validated

  15/15 coherent across 6 release-target architectures (Llama,

  Mistral, Qwen3, DeepSeek-R1, Qwen3.6 GDN, Gemma-4-26B MoE) with

  0 Vulkan synchronization hazards under `validate_sync`. Decode

  +28 % on Qwen3.6 (SG-3 SSM step decomposition); see

  `results/v050_release_bench.md`.

- **Near-parity decode vs llama.cpp Vulkan on RDNA4** — 0.87–0.97×

  on the unified matrix below (same backend); beats vLLM 0.20.1 ROCm

  on FP8 single-user decode (1.3–2× ahead).

- **Native FP8 E4M3** loader that ingests HuggingFace SafeTensors

  directly, no FP16 round-trip on disk.

- **All three FP8 scaling strategies auto-detected**:

  per-tensor, per-channel, block-wise `[128, 128]`.

- **Native FP8 WMMA on Mesa 26.1+** (auto-enabled when the driver

  advertises `shaderFloat8CooperativeMatrix`) — +45–58 % FP8 prefill

  across all three sub-types.

- **CPU `lm_head` offload (v0.3.10)** — Q6_K weights on CPU RAM,

  hand-tuned AVX-512 GEMV (Zen 4). Frees ~970 MB VRAM and on

  **14B FP8 it's 32 % faster than the GPU baseline**

  (17.8 vs 13.5 tok/s).

- **2× better power efficiency** (tok/s/W) on decode vs llama.cpp.

- **Llama-3, Qwen2.5, Qwen3, Mistral, DeepSeek-R1-Distill, Gemma-4**

  model families covered (Gemma-4 SafeTensors path produces coherent

  English with full Markdown structure as of v0.3.14; see

  [docs/MODELS.md](docs/MODELS.md) and the v0.3.14 entry in

  [CHANGELOG.md](CHANGELOG.md) for the 8-bug coherence fix-up plus

  the `forward.rs` refactor that ships alongside it). **v0.6.1 adds

  Q4_0 GGUF support for Gemma-4 — the QAT line (E2B…31B) runs**,

  byte-gated vs llama.cpp (26B-A4B QAT: prefill 1143/646 t/s

  @p512/p2048, decode 117 t/s out of the box). Qwen2.5-Q4_0 stays

  deliberately gated (missing arch features, not the quant).

- **`forward.rs` Refactor (v0.3.14):** the 7816-LOC dispatch file

  splits into 13 sibling modules with a `LayerStep` enum + two

  `LayerExecutor` impls. The Sprint 43F bug class — "added a

  per-layer step in decode but forgot prefill" — becomes a compile

  error in both executors until the new variant is handled.

- **104 / 105 coherent** on the deterministic 15-prompt suite across

  all production configurations — six 8B+ paths plus Gemma-4-E2B

  SafeTensors (14/15; the one miss is a Gemma-tokenizer-emoji

  surrogate). See [Quality](#quality-15-prompt-benchmark) below.

## Quick start

```bash

# GGUF — no flag needed, default-everywhere path (Mesa 26.1+ recommended)

vulkanforge chat --model ~/models/Qwen3-8B-Q4_K_M.gguf

# FP8 SafeTensors — one flag, no --tokenizer-from needed (v0.3.13)

# Auto-detects: FP8 model (config.json), native WMMA (Mesa 26.1+),

# AVX-512 (host), model size → CPU lm_head offload (≥ 12 B).

# Auto-loads: tokenizer.json + chat_template from the model dir.

VF_FP8=auto vulkanforge chat --model ~/models/Qwen3-8B-FP8/

# 14 B FP8 with auto CPU lm_head — saves 970 MB VRAM, +9 % decode

VF_FP8=auto vulkanforge chat --model ~/models/Qwen2.5-14B-Instruct-FP8/

```

The legacy v0.3.10 flags (`VULKANFORGE_ENABLE_FP8=1`,

`VF_CPU_LM_HEAD=1`) and `--tokenizer-from ` still work as

explicit overrides — handy when you want CPU `lm_head` on an 8 B

model for VRAM headroom, or want to force a specific tokenizer

source for a regression check.

Native FP8 WMMA was a flag in v0.3.10–v0.3.15

(`VF_FP8_NATIVE_WMMA=1`); v0.3.16 (Sprint 47B) removes the flag and

makes the routing capability-driven — VulkanForge picks the native

FP8 path automatically iff the driver advertises

`shaderFloat8CooperativeMatrix`. Check with:

```bash

vulkaninfo 2>/dev/null | grep shaderFloat8CooperativeMatrix

```

Build from source:

```bash

cargo build --release   # Rust 1.85+, Vulkan headers required

```

## Performance at a glance

All numbers on AMD Radeon RX 9070 XT (gfx1201, RDNA4), RADV/Vulkan.

Full tables with power data and methodology in

[docs/BENCHMARKS.md](docs/BENCHMARKS.md).

### VulkanForge vs llama.cpp — unified bench matrix (v0.7.0)

One run, identical conditions per row, no cherry-picking (Sprint 11i, measured fresh at the

v0.7.0 HEAD). **HW:** RX 9070 XT, gfx1201 (RDNA4). **Driver:** RADV / Mesa 26.1.2-arch2.1

(Vulkan 1.4.303). **VF:** v0.7.0 — coopmat flash-attention (dense hd128 + Gemma-4 hd256/512) +

batched MoE-router gate-proj, all default-on. **llama.cpp:** Vulkan build (`GGML_VULKAN`,

`KHR_coopmat`, *not* HIP), `b9174-g0253fb21f`, `-fa 1 -ngl 99`. **ctx** 4096, greedy, validation

off, warm, median ≥3–5; runs strictly sequential (never both engines at once). Measured 2026-06-09.

| Model | Quant | KV | VF p≈512 | llama p≈512 | VF/ll | VF p≈2048 | llama p≈2048 | VF/ll | VF dec | llama dec | VF/ll |

|---|---|---|--:|--:|:-:|--:|--:|:-:|--:|--:|:-:|

| Qwen3-8B | Q4_K_M | f16 | 4291 | 4472 | 0.96 | 4280 | 4479 | **0.96** | 109.4 | 114.9 | **0.95** |

| Llama-3.1-8B-Instruct | Q4_K_M | f16 | 4474 | 4802 | 0.93 | 4470 | 4644 | **0.96** | 114.5 | 119.6 | **0.96** |

| Mistral-7B-v0.3 | Q4_K_M | f16 | 4845 | 4826 | 1.00 | 4825 | 4654 | **1.04** | 124.2 | 127.5 | **0.97** |

| DeepSeek-R1-Distill-Llama-8B | Q4_K_M | f16 | 4461 | 4785 | 0.93 | 4464 | 4628 | **0.96** | 114.7 | 118.6 | **0.97** |

| gemma-4-26B-A4B (MoE) | Q3_K_M | FP8/q8_0 | 2085 | 3251 | 0.64 | 2862 | 3219 | **0.89** | 110.2 | 127.1 | 0.87 |

| gemma-4-26B-A4B (MoE) | Q4_0 ⁑ | FP8/q8_0 | 2653 | 4140 | 0.64 | 3436 | 4119 | **0.83** | 121.2 | 132.7 | **0.91** |

Prefill is tok/s, decode is generated tok/s. **Prefill lengths (matched per row):** dense exact

512 / 2048; MoE 513 / 2049. **KV:** 8B both f16; 26B = VF-FP8 vs llama-`q8_0` (both 8-bit, nearest

equivalent — the one config difference). **⁑ QAT** = the same file

`gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf` for both engines; VF labels it `Q4_K_XL` by filename,

llama reads the tensors as `Q4_0`. **MoE prefill is measured with varied tokens** — *not* the

`run_pp_bench` micro-bench, whose single-token-repeat degenerates MoE routing and overstates MoE

prefill by 1.2–1.7×. Net (v0.7.0): **dense prefill 0.93–1.04× llama (parity), Gemma-MoE prefill

0.64× @512 → 0.83–0.89× @2048, decode 0.87–0.97×**. Dense attention (coopmat-FA on hd128, v0.7.0)

and the Gemma router/gate-proj are no longer the bottleneck; the remaining MoE prefill gap is the

grouped expert-GEMM (`MUL_MAT_ID`). **Behaviour note:** the batched MoE router is llama-aligned and

value-preserving on factual/structural output, but a borderline top-k expert flip can make a

free-form generation tail diverge from the pre-v0.7.0 per-token router (deliberate; opt out with

`VF_MOE_ROUTER_BATCHED=0`).

**Gemma quant: `Q3_K_M` vs QAT (`Q4_0`) — no single default.** Gemma-4-26B-A4B runs as both `Q3_K_M`

(3-bit) and the QAT `Q4_0` line (`gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf`, quantization-aware-trained,

4-bit); they trade quality, speed, and context — pick by your priority. Measured on the 16 GB reference

card: QAT decode **~119 vs ~108 tok/s** and prefill **~1.25–1.27×** `Q3_K_M`, and QAT is stronger on

harder coding tasks; `Q3_K_M` is smaller, with more context headroom. vs llama.cpp-Vulkan (same Mesa):

QAT prefill 0.63× @512 → 0.86× @~1.8k, decode 0.85× (≈ the v0.7.0 matrix QAT row above). For a three-way

comparison that also includes the dense Qwen3.6-27B, see the wiki:

[Choosing a Model for Coding](https://github.com/maeddesg/vulkanforge/wiki/Choosing-a-Model-for-Coding).

**Context (engine setting, not a model default):** `--max-context 3072` is a good general coding value —

same VRAM as 2048 but enough generation budget for a substantial function (2048 can truncate ~180-line

outputs); `4096` is a spare-VRAM opt-in. The pre-load free-VRAM gate (`VF_VRAM_GATE=1`) guards against a

prior process's un-freed VRAM or compositor pressure. **`VULKANFORGE_KV_FP8=1` is required for the

gemma-4-26B-A4B MoE models** — with a non-FP8 KV cache (F16/F32) these MoE models produce invalid output,

so loading one without it now aborts with a clear, actionable error (override with

`VULKANFORGE_ALLOW_BROKEN_KV=1`); it also helps the 26B models fit 16 GB. Dense and Qwen3.5/3.6 models are

unaffected.

### v0.3.16 15-prompt mixed-workload benchmark

The decode column in the matrix above is `vulkanforge bench`-style tg128

(1-token prompt, constant-low KV). The 15-prompt suite is a real-workload mix (smoke /

code / prose / reasoning / context-stress / numerics / tokenizer) with

generations up to 1024 tokens — KV grows during decode, so steady-state

numbers are below `tg128`. Mesa 26.1.0 RADV.

| Model                              | Prefill avg | Decode avg | Avg W | tok/s/W | Quality |

|------------------------------------|------------:|-----------:|------:|--------:|--------:|

| Qwen3-8B Q4_K_M                    |     **701 t/s** ² |   104.0 t/s |  258 W |  0.40   | 15/15 ✓ |

| Llama-3.1-8B Q4_K_M                |     824 t/s |   **110.0 t/s** |  271 W |  0.41   | 15/15 ✓ |

| Qwen3-8B FP8                       |     559 t/s |    60.7 t/s |  193 W |  0.32   | 15/15 ✓ |

| Gemma-4-E2B-it (FP32 SafeTensors)  |     96 t/s ¹ |    33.7 t/s |  64 W  |  0.53   | 15/15 ✓ |

| **Gemma-4-E2B-it (Q4_K on-load³)** |    **106 t/s** | **52.0 t/s** | **37 W** | **1.39** | 15/15 ✓ |

¹ v0.3.15 lifted the v0.3.14 `force_per_token_prefill` workaround

(33 → 89 → 96 t/s on v0.3.14 / v0.3.15 / v0.3.16). The batch path is

bit-identical to the per-token reference

(`VULKANFORGE_FORCE_PER_TOKEN=1` keeps the v0.3.14 path available

as a bisect fallback). Decode-side is on par with the larger models

on a tok/s/W basis (best in the test) thanks to the 2 B parameter

count keeping power draw at 64 W.

² v0.3.16 closes the v0.3.15 Sprint-46H barrier regression on

Owner-only models (Qwen3, Llama). The Q-side barriers are now

gated on the Gemma-4 subscriber predicate; Qwen3-Q4_K_M prefill

recovers from 638 to **701 t/s** (+9.9 %).

³ v0.3.17 adds on-the-fly Q4_K quantization at model load

(`VF_QUANTIZE_ON_LOAD=1`). Gemma-4 SafeTensors weights are

quantized FP32 → Q4_K_M on the CPU (rayon-parallelized,

`Loaded in 13.2 s`) and routed through the existing Q4_K shader

pipeline. Decode +54 %, power −41 %, **tok/s/W 1.39 — best in the

suite**. VRAM 8.51 → 2.49 GiB (7.1× compression on the quantized

tensors; norms / embeddings stay FP32). Coherence identical to

the FP32 baseline.

### Native FP8 prefill pp=512 (Mesa 26.1+, native FP8 WMMA path)

| Model            | Scale type           | VulkanForge | vLLM 0.20.1 ROCm* |

|------------------|----------------------|------------:|------------------:|

| Llama-3.1-8B FP8 | per-tensor           |        1130 |             14757 |

| Qwen2.5-14B FP8  | per-channel          |         428 |             (n/a) |

| Qwen3-8B FP8     | block-wise [128,128] |        1118 |              2776 |

\* vLLM 0.20.1 is **not optimized for gfx1201** — model load logs

`Using default W8A8 Block FP8 kernel config. Performance might be

sub-optimal!`. Per-tensor uses `ROCmFP8ScaledMMLinearKernel` (specialized);

block-wise uses `TritonFp8BlockScaledMMKernel` (untuned). Run with

`VLLM_ROCM_USE_AITER=0 --enforce-eager` (only working configuration on RDNA4).

### Native FP8 decode (single-user, batch=1, decode-only power)

| Model            | VulkanForge (tok/s @ Avg W) | vLLM 0.20.1 ROCm (tok/s @ Avg W) | VF tok/s/W gain |

|------------------|----------------------------:|---------------------------------:|----------------:|

| Llama-3.1-8B FP8 |        **70 t/s @ 166 W**   |              53 t/s @ 159 W      | **+27 %**       |

| Qwen3-8B FP8     |        **62 t/s @ 125 W**   |              22 t/s @ 167 W      | **+267 %**      |

**VF wins decode 1.3–2×; vLLM wins prefill 2.5–12×.** Pick the engine

that fits the workload — single-user chat is VulkanForge, batch

serving is vLLM.

### CPU `lm_head` offload (v0.3.10, AVX-512)

`VF_CPU_LM_HEAD=1` moves the vocabulary projection onto the CPU as

Q6_K, freeing ~970 MB of VRAM. Hand-tuned AVX-512 kernel (Zen 4 /

Ice Lake+ runtime-detected; scalar fallback otherwise).

| Model              | GPU `lm_head`      | CPU `lm_head` (AVX-512) | VRAM saved | Verdict |

|--------------------|-------------------:|------------------------:|-----------:|---------|

| Llama-3.1-8B-FP8   | 70 tok/s           |  47.6 tok/s             |  −970 MB   | use for VRAM, not speed |

| **Qwen2.5-14B-FP8**| 13.5 tok/s         | **17.8 tok/s (+32 %)**  | **−970 MB**| **CPU wins both axes** |

The 14B win is structural: the GPU `lm_head` GEMV is bandwidth-bound

on 644 GB/s VRAM, and offloading it lets DDR5 (32 threads × L3 →

DDR5-5600 76 GB/s) carry the work in parallel with the rest of the

GPU pipeline freed up. Combined with the 970 MB VRAM saving, the

flag is a default-on candidate for 14B FP8 on Zen 4.

## Optional features

| Feature              | Flag                       | Requires                     | Effect                              |

|----------------------|----------------------------|------------------------------|-------------------------------------|

| FP8 model loading    | `VULKANFORGE_ENABLE_FP8=1` | Mesa 26.1+ (or 26.0.6 BF16 path) | Load HuggingFace FP8 SafeTensors    |

| Native FP8 WMMA      | (auto)                     | `shaderFloat8CooperativeMatrix` (Mesa 26.1+) | +45–58 % FP8 prefill |

| CPU `lm_head` offload| `VF_CPU_LM_HEAD=1`         | AVX-512F + BW + VL (Zen 4 / Ice Lake+) | −970 MB VRAM, 14B +32 % decode |

| On-the-fly Q4_K      | `VF_QUANTIZE_ON_LOAD=1`    | SafeTensors model with FP32 / BF16 weights | Quantize 2D weights to Q4_K_M at load; ~7× VRAM compression on quantized tensors, routes through the Q4_K shader pipeline. Gemma-4-E2B: decode +54 %, power −41 %, tok/s/W 1.39 |

| FP8 KV-cache         | `VULKANFORGE_KV_FP8=1`     | Mesa 26.1+ (heterogeneous head_dim auto-handled) | −50 % KV-cache VRAM. Gemma-4-26B-A4B: 880 → 440 MB. |

| KV prefix-reuse (API server, **default OFF**) | `VF_KV_PREFIX_REUSE=1` | `vulkanforge serve` (chat + completions) | Cross-request KV reuse: keeps the last request's KV and re-prefills only the suffix after the longest common token prefix → multi-turn / agentic (growing context) skips re-prefilling the shared history. Value-preserving (byte-identical to full re-prefill @temp 0). Single retained session (last request); errors/cancel invalidate. OFF = v0.4 stateless behavior, bit-identical. |

| MTP timing instrumentation (**default OFF**) | `VF_MTP_TIMER=1` | Qwen3.6-27B-MTP (`VF_MTP=1`) | Per-phase wall-clock + submit breakdown for the (parked, gated-off) MTP self-spec decode loop. Measurement only; MTP itself stays default-OFF. |

> Debug/test-only env flags (never set in production): `VF_KV_REUSE_DEBUG=1`

> (logs the prefix-reuse `k`/prefill-token count per request) and

> `VF_KV_REUSE_OVERSHOOT=N` (the byte-ident gate's teeth-test — intentionally

> over-reuses; do not use outside testing).

| Expert-Grouped MoE prefill (**default ON** since v0.5.3) | (auto); opt-out `VF_MOE_GROUPED=0` | MoE model (Gemma-4-26B-A4B et al.) | +33–39 % prefill on Gemma-4-26B-A4B Q3_K_M (146 → ~199 t/s @pp512), value-preserving, decode-neutral. +~265 MB VRAM (MoE models only; dense/GDN unaffected). `VF_MOE_GROUPED=0` = legacy GPU-direct GEMV prefill. |

| Batched Full-attention prefill layers (**default ON** since v0.6.0) | (auto); opt-out `VF_PREFILL_FULL_BATCHED=0` | Gemma-4-26B-A4B (MoE) | The 5 Full-attention layers stay on the batched prefill path instead of the 51D-AN per-token workaround (1 GPU drain per token per Full layer). Validated bit-identical to an independent per-query attention reference; note: changes default greedy output vs ≤v0.5.9 (FP-reorder, both paths correct). |

| MoE prefill GLU batching (**default ON** since v0.6.0) | (auto); opt-out `VF_MOE_GLU_BATCHED=0` | MoE model (Gemma-4-26B-A4B) | Per-(token,slot) GLU loop (125k mini-dispatches/chunk) → 1 dispatch/layer. Byte-identical output. |

| MoE prefill gather-combine (**default ON** since v0.6.0) | (auto); opt-out `VF_MOE_GATHER_COMBINE=0` | MoE model (Gemma-4-26B-A4B) | Per-(token,slot) scatter-FMA loop + per-dispatch barriers (125k+125k/chunk) → 1 gather dispatch/layer (hazard removed by design). Byte-identical output. **Combined with the two flags above: Gemma-4-26B prefill 198 → 1073 t/s @p512 (5.4×), 178 → 604 @p2048 (3.4×).** |

| Q-tiled flash attention hd256/512 (**experimental, default OFF — NOT recommended**) | `VF_FA_TILED_SLIDING=1` / `VF_FA_TILED_FULL=1` | Gemma-4 | Numerically validated (byte-identical single-tile, long-context recall green) but **slower than the default** (−31…−36 % prefill: KV-traffic goal reached, LDS occupancy lost — see `docs/prefill_arc_2026_06.md`). Research platform for occupancy tuning only. |

| qwen35 decode determinism fix (**default ON** since v0.5.4) | (auto); opt-out `VF_QWEN35_DEC_GATE_BARRIER=0` | Qwen3.6-27B-MTP (qwen35) | Correctness fix: adds the missing gated-output → o-proj barrier so decode reads the fully-gated `attn_out`. Makes greedy decode bit-deterministic (was a RAW race, widened by `head_dim=256`). Decode-neutral, qwen35-only (no effect on other models). |

| qwen35 batched prefill (**default ON** since v0.5.6) | (auto); opt-out `VF_QWEN35_BATCHED=0` | Qwen3.6-27B-MTP (qwen35) | qwen35 prefill runs through the batched executor (GDN + Full-Attn over seq_len=N; cross-token cores serial with state-carry, projections batched GEMM). **4–11× prefill speedup** (grows with prompt length; e.g. 1098-tok 32 → 358 t/s), deterministic + byte-identical to per-token (greedy), decode-neutral. `VF_QWEN35_BATCHED=0` = per-token prefill. (v0.5.5 shipped this opt-in; v0.5.6 fixed a conv-loop barrier-elision race that made it non-deterministic on near-tie prompts, and flipped it on by default.) |

| Batched MoE decode (v0.4.5)         | `VF_MOE_BATCHED_DECODE=1` | MoE model | +4.8 % decode on Gemma-4-26B-A4B Q3_K_M (27.3 → 28.6 t/s). 800 → 450 MoE dispatches/token. |

| VRAM budget probe (v0.4.5)          | (auto)                    | Linux sysfs `/sys/class/drm/card*/device/mem_info_vram_*` | Diagnostic. Warns when free VRAM < `VF_VRAM_HEADROOM_GIB` (default 1.0). |

| Tensor-load progress bar (v0.4.5)   | (auto, suppress with `VF_NO_LOAD_PROGRESS=1`) | — | `\r`-overwritten stderr progress bar during GGUF / SafeTensors upload. |

Most features are opt-in; the v0.5.x defaults (adaptive staging, parallel

MoE-router top-K, Expert-Grouped MoE prefill, qwen35 decode determinism fix)

are on by default and value-preserving. Without flags, VulkanForge runs GGUF models

on Mesa 26.1+ with no special configuration. `VF_FP8=auto` picks

the right FP8 path based on what the driver actually advertises.

## Quality (15-prompt benchmark)

The deterministic 15-prompt suite (greedy decoding, temperature = 0)

on all six production paths:

| Configuration                                    | Coherent   | Median decode |

|--------------------------------------------------|-----------:|--------------:|

| Qwen3-8B Q4_K_M GGUF                             |  **15/15** |     107 tok/s |

| Llama-3.1-8B Q4_K_M GGUF                         |  **15/15** |     112 tok/s |

| Qwen3-8B-FP8 native WMMA + activation quant      |  **15/15** |      62 tok/s |

| Qwen2.5-14B-FP8 native WMMA + CPU `lm_head`      |  **15/15** |      17 tok/s |

| Llama-3.1-8B-FP8 native WMMA                     |  **15/15** |      70 tok/s |

| Llama-3.1-8B-FP8 native WMMA + CPU `lm_head`     |  **15/15** |      46 tok/s |

| **Gemma-4-E2B-it SafeTensors** (v0.3.14, new)    |    **14/15** |      34 tok/s |

**104 / 105 prompts (99 %) coherent across the full suite.**

v0.3.14 adds the Gemma-4-E2B-it SafeTensors path; 14/15 coherent on

the suite, the single miss is the emoji-identification prompt where

the Gemma-4 tokenizer's surrogate-pair handling drops the input

emojis before the model sees them.

v0.3.11 closes the v0.3.10 Llama-FP8 per-tensor edge case (2/15

code-gen prompts collapsing to `!`) by porting the Sprint 39

per-block activation-absmax + rescale pattern to the per-tensor

WMMA path. The fix costs ~5 % on prefill (1197 → 1130 tok/s on

8B-FP8 pp=512) — an unavoidable trade for keeping post-RMS-norm

activations inside the FP8 E4M3 ±448 envelope.

## Driver requirements

| Mesa version | Capabilities                                                                  |

|--------------|-------------------------------------------------------------------------------|

| **26.1+**    | Default. Native FP8 WMMA via `shaderFloat8CooperativeMatrix`                  |

| **26.0.6**   | Legacy. GGUF + FP8 SafeTensors via the BF16 conversion path (no native FP8 WMMA) |

For 14B+ models, set `amdgpu.lockup_timeout=10000,10000` on the

kernel command line — the default 2 s compute timeout is too short

for long prefill submits. Setup details and troubleshooting in

[docs/INSTALLATION.md](docs/INSTALLATION.md).

## Documentation

- [docs/BENCHMARKS.md](docs/BENCHMARKS.md) — full VulkanForge vs llama.cpp vs vLLM tables, power data, methodology

- [docs/INSTALLATION.md](docs/INSTALLATION.md) — Mesa setup, kernel parameter, environment variables, troubleshooting

- [docs/MODELS.md](docs/MODELS.md) — supported GGUF / FP8 formats, model architectures, FP8 scaling strategies

- [CHANGELOG.md](CHANGELOG.md) — release history with per-sprint performance deltas

## CLI

```

vulkanforge chat   --model  [--tokenizer-from ] ...

vulkanforge bench  --model  [--tokenizer-from ] [--runs N]

vulkanforge serve  --model  [--host 127.0.0.1] [--port 8080] [--cors]

```

`vulkanforge chat --help` lists every flag (sampling, max-tokens,

think-filter, max-context). The chat REPL accepts `/help`, `/quit`,

`/reset` (clear KV cache + history without reloading the model), and

a single-shot mode via `VF_PROMPT="..."`.

## API Server (v0.4)

OpenAI-compatible HTTP server. Drop-in backend for Open WebUI,

SillyTavern, Continue.dev, the OpenAI Python SDK, LangChain, and

any other client that speaks Chat Completions.

```

vulkanforge serve --model ~/models/Qwen3-8B-Q4_K_M.gguf --port 8080

```

### Endpoints

| Path                         | Method | Description                          |

|------------------------------|--------|--------------------------------------|

| `/v1/chat/completions`       | POST   | Chat (streaming via SSE + sync JSON); **function/tool calling** (Qwen3/Hermes) |

| `/v1/completions`            | POST   | Legacy text-completion (raw prompt, no chat template; SSE + sync JSON) |

| `/v1/models`                 | GET    | List loaded model                    |

| `/health`                    | GET    | Liveness + KV-cache status           |

The non-prefixed aliases `/chat/completions`, `/completions` and

`/models` are also routed for clients that omit the `/v1/` prefix.

**`/v1/completions`** generates from a **raw `prompt` string** with **no

chat template applied** (the only difference from `/v1/chat/completions`;

the sampling surface, streaming, stop, and usage are identical). The

prompt is tokenized with special-token parsing, so a pre-rendered

chat-template string round-trips to the same tokens the chat endpoint

would produce. `prompt` is a single string in v0.5 (array-of-strings and

token-id-array prompts are planned). Accepted-but-**ignored** (warn-logged)

fields: `suffix`, `best_of`, `n>1`, `logprobs`, `logit_bias`, `echo`,

`presence_penalty`. Responses use the `text_completion` object with a

`cmpl-…` id and `choices[].text`.

### Examples

```bash

# Non-streaming

curl http://localhost:8080/v1/chat/completions \

  -H "Content-Type: application/json" \

  -d '{

    "model": "qwen3-8b-q4_k_m",

    "messages": [{"role": "user", "content": "What is a mutex?"}],

    "max_tokens": 100

  }'

# Streaming (SSE)

curl -N http://localhost:8080/v1/chat/completions \

  -H "Content-Type: application/json" \

  -d '{

    "model": "qwen3-8b-q4_k_m",

    "messages": [{"role": "user", "content": "Hi"}],

    "stream": true,

    "stream_options": {"include_usage": true},

    "max_tokens": 50

  }'

# Legacy text-completion — raw prompt, no chat template (non-streaming)

curl http://localhost:8080/v1/completions \

  -H "Content-Type: application/json" \

  -d '{

    "model": "qwen3-8b-q4_k_m",

    "prompt": "The capital of France is",

    "max_tokens": 16,

    "temperature": 0

  }'

# Legacy text-completion (streaming SSE)

curl -N http://localhost:8080/v1/completions \

  -H "Content-Type: application/json" \

  -d '{

    "prompt": "17 * 23 =",

    "max_tokens": 16,

    "temperature": 0,

    "stream": true,

    "stream_options": {"include_usage": true}

  }'

# Function/tool calling (OpenAI-compatible, Qwen3/Hermes models)

curl http://localhost:8080/v1/chat/completions \

  -H "Content-Type: application/json" \

  -d '{

    "model": "qwen3-8b-q4_k_m",

    "messages": [{"role": "user", "content": "What is the weather in Paris?"}],

    "tools": [{"type":"function","function":{

      "name":"get_weather",

      "description":"Get current weather for a location",

      "parameters":{"type":"object","properties":{"location":{"type":"string"}},"required":["location"]}

    }}],

    "max_tokens": 128, "temperature": 0

  }'

# → choices[0].message.tool_calls:[{id,"type":"function","function":{"name":"get_weather","arguments":"{\"location\":\"Paris\"}"}}], finish_reason:"tool_calls"

# Send the result back as {"role":"tool","tool_call_id":"","content":"15C sunny"} for the final answer.

```

**Function/tool calling** (`/v1/chat/completions`) is OpenAI-compatible for

**Qwen3/Hermes** models: pass `tools` + optional `tool_choice` (`"auto"`/`"none"`);

calls come back as `message.tool_calls` + `finish_reason:"tool_calls"`; send

results back via `role:"tool"` messages. v1 covers the Qwen3/Hermes

`` format; Llama-3.1/Mistral formats, `tool_choice:"required"`/named

forcing, and char-incremental argument streaming are planned. Requests **without**

`tools` are unaffected.

### Options

| Flag                   | Default       | Notes                                              |

|------------------------|---------------|----------------------------------------------------|

| `--host`               | `127.0.0.1`   | Use `0.0.0.0` for remote/Docker (no auth in v0.4)  |

| `--port`               | `8080`        | TCP listen port                                    |

| `--cors`               | off           | Enable CORS (browser UIs on different ports)       |

| `--ctx-size`           | `2048`        | KV-cache capacity in tokens                        |

| `--served-model-name`  | basename      | Override the `model` id reported by `/v1/models`   |

| `--tokenizer-from`     | —             | Reserved for SafeTensors-serve (v0.4.1)            |

### Scope

- **In:** Streaming + non-streaming chat, multi-turn history,

  **`/v1/completions`** (raw-prompt), **OpenAI function/tool calling**

  (Qwen3/Hermes), **cross-request KV prefix-reuse** (`VF_KV_PREFIX_REUSE=1`,

  opt-in), `frequency_penalty` → repetition-penalty mapping,

  `stream_options.include_usage`, `developer` role alias for `system`,

  `chat_template_kwargs.enable_thinking` toggle for `` filtering.

- **Out:** Vision content, embeddings, auth, SafeTensors directory models

  (use `vulkanforge chat` for those); tool-calling beyond the Qwen3/Hermes

  format and `tool_choice` beyond auto/none (follow-ups).

### Using VulkanForge with OpenCode (and OpenAI-compatible agents)

VulkanForge works as a backend for [OpenCode](https://opencode.ai) and other

OpenAI-compatible coding agents (function/tool calling). Validated end-to-end on

**Qwen3-8B Q4_K_M**: real tool calls execute and files are written to disk.

**1. Serve with a raised context** — the 2048 default is too small for an

agent's large system prompt (~7,500 tokens), so set `--ctx-size`:

```fish

vulkanforge serve -m ~/models/Qwen3-8B-Q4_K_M.gguf --port 8080 --ctx-size 16384

```

**2. `~/.config/opencode/opencode.json`:**

```json

{

  "$schema": "https://opencode.ai/config.json",

  "model": "vulkanforge/qwen3-8b-q4_k_m",

  "provider": {

    "vulkanforge": {

      "npm": "@ai-sdk/openai-compatible",

      "name": "VulkanForge (local)",

      "options": { "baseURL": "http://localhost:8080/v1" },

      "models": { "qwen3-8b-q4_k_m": { "name": "Qwen3-8B (VF)" } }

    }

  }

}

```

**Status / Limitations.** Functional: tool calls execute and files are written

(validated end-to-end on Qwen3-8B Q4_K_M). **Not yet optimized:** per-turn

latency is high because VulkanForge re-prefills the agent's large system prompt

each turn — prefill is the current bottleneck (per-turn latency in the tens of

seconds, dominated by prefill, not generation). Recommended model: **Qwen3-8B

Q4_K_M**. The 27B (Q3_K_S) is impractical for agentic use today (prefill cost +

aggressive quantization). Set **`VF_KV_PREFIX_REUSE=1`** to speed up multi-turn

(opt-in; mitigates later turns, not the first).

> **v0.5.8 — Gemma-4 (MoE) now coherent on serve.** Earlier serve builds produced

> garbage on Gemma-4 specifically (the API path skipped the GPU MoE-router init the

> CLI ran, and the GGUF path missed the channel-thought template detection). Fixed in

> v0.5.8; dense models (Llama/Qwen) were never affected. The prefill-latency caveat

> above still applies to the 26B for agentic use.

## Limitations

- Single-stream only — no batch inference, no concurrent sessions on

  one `Forward` instance.

- vs llama.cpp Vulkan (same backend, v0.7.0 matrix): dense models reach

  prefill **parity** (~0.93–1.04×) and near-parity decode (~0.95–0.97×);

  Gemma-4 MoE prefill is still behind (~0.64× at short context, ~0.83–0.89×

  @p2048 — the grouped expert-GEMM), MoE decode ~0.87–0.91×. Coopmat is

  prefill-only on this codebase.

- FP8 prefill structurally behind ROCm-specialized kernels (vLLM's

  `ROCmFP8ScaledMMLinearKernel` is in a different class).

- `vulkanforge bench` accepts only Q4_K_M GGUF; Q8_0 chat works but

  does not bench.

- Mistral / Llama-2 SPM tokenizer not yet wired for FP8 SafeTensors

  (only `gpt2` tokenizer family).

For the full architectural notes and the v0.2.x optimization audit

(nine falsified hypotheses against the residual gap to llama.cpp),

see [CHANGELOG.md](CHANGELOG.md).

## License

VulkanForge is licensed under the

[GNU General Public License v3.0](LICENSE).
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/maeddesg/vulkanforge

Awesome Lists containing this project

README