{"id":50515369,"url":"https://github.com/croll83/llama.cpp-dgx","last_synced_at":"2026-06-02T23:31:13.845Z","repository":{"id":353746523,"uuid":"1219206906","full_name":"croll83/llama.cpp-dgx","owner":"croll83","description":"llama.cpp fork optimized for NVIDIA DGX Spark / GB10 (Blackwell, SM 12.1) — TurboQuant weights + KV, NVFP4, DFlash MTP","archived":false,"fork":false,"pushed_at":"2026-05-08T16:57:10.000Z","size":321406,"stargazers_count":5,"open_issues_count":1,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-08T18:43:19.541Z","etag":null,"topics":["blackwell","dflash","gb10","llama-cpp","nvfp4","speculative-decoding","turboquant"],"latest_commit_sha":null,"homepage":null,"language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/croll83.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":"CODEOWNERS","security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":"AUTHORS","dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null},"funding":{"github":["turbo-tan"],"custom":["https://gofund.me/7d7feb107"]}},"created_at":"2026-04-23T16:31:46.000Z","updated_at":"2026-05-08T16:57:14.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/croll83/llama.cpp-dgx","commit_stats":null,"previous_names":["croll83/llama.cpp-dgx"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/croll83/llama.cpp-dgx","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/croll83%2Fllama.cpp-dgx","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/croll83%2Fllama.cpp-dgx/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/croll83%2Fllama.cpp-dgx/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/croll83%2Fllama.cpp-dgx/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/croll83","download_url":"https://codeload.github.com/croll83/llama.cpp-dgx/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/croll83%2Fllama.cpp-dgx/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33841995,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-02T02:00:07.132Z","response_time":109,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["blackwell","dflash","gb10","llama-cpp","nvfp4","speculative-decoding","turboquant"],"created_at":"2026-06-02T23:31:13.042Z","updated_at":"2026-06-02T23:31:13.826Z","avatar_url":"https://github.com/croll83.png","language":"C++","funding_links":["https://github.com/sponsors/turbo-tan","https://gofund.me/7d7feb107"],"categories":[],"sub_categories":[],"readme":"# llama.cpp-dgx\n\n\u003e ## ⚠️ DEPRECATED — use upstream `ggml-org/llama.cpp` instead\n\u003e\n\u003e **Status:** As of 2026-05-25, upstream llama.cpp has surpassed this fork for our target workload (Qwopus3.6-27B + NVFP4 + speculative decode on GB10 / SM 12.1).\n\u003e\n\u003e ### Why upstream now wins\n\u003e\n\u003e 1. **Native MTP speculative decode** (`--spec-type draft-mtp`, PR [#22673](https://github.com/ggml-org/llama.cpp/pull/22673) + #23269 + #23461). Co-trained drafter delivers 45-85% acceptance vs 13-20% with our DFlash + post-hoc drafter.\n\u003e 2. **NVFP4 + MTP scale tensors** ([#23563](https://github.com/ggml-org/llama.cpp/pull/23563), 2026-05-23). Closes the last gap that forced us to fork.\n\u003e 3. **Stable VRAM footprint**. v5 + DFlash leaks ~2-3 GB/hour into the draft KV pool (positions are written every tree step but never compacted), reaching OOM in days under sustained traffic. Upstream pre-allocates and reuses cleanly: 30 GB GPU compute pool stays flat over hours.\n\u003e 4. **Lower system memory**. Upstream: ~44-67 GB total system used (with cache-ram 16 GB lazy-filled and reclaimable). v5: 60-78 GB and growing.\n\u003e 5. **Practical decode parity**. Stock + MTP (n_max=5) sustains 13-26 t/s on Qwopus3.6-27B-Abl-NVFP4; v5 + DFlash sustains 11-15 t/s and degrades on long-context multi-slot.\n\u003e\n\u003e ### Where this fork is still useful\n\u003e\n\u003e - **TurboQuant (TQ3_0 / TQ3_4S) KV cache and weights** — upstream does not expose `tq3_0` as a `--cache-type-v` value yet. If you need ~12% extra KV bandwidth savings on a memory-bound decode, this fork keeps the kernels (`ggml/src/ggml-cuda/tq3-prefill.cuh`, `fattn-vec-instance-tq3_*`).\n\u003e - **DFlash external drafter integration** (`--dflash` + `--dflash-draft`) — relevant only if you have a target architecture without MTP layers and a separately trained DFlash drafter that matches it.\n\u003e - **Spiritbuun KV cache fixes** and the GDN chunked kernel tuned for 99 KB SM 12.1 shared memory budget — both already merged or being merged upstream; this fork carries the original variants for archival reference.\n\u003e\n\u003e ### Migration\n\u003e\n\u003e ```bash\n\u003e # Drop-in replacement\n\u003e git clone https://github.com/ggml-org/llama.cpp ~/llama-cpp-stock\n\u003e cd ~/llama-cpp-stock\n\u003e cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=\"120;121\" -DGGML_CUDA_FA_ALL_QUANTS=ON\n\u003e cmake --build build --target llama-server llama-quantize llama-cli -j$(nproc)\n\u003e\n\u003e # Replace --dflash --dflash-draft ... with:\n\u003e --spec-type draft-mtp\n\u003e ```\n\u003e\n\u003e See [croll83/jarvis/infrastructure/gb10/](https://github.com/croll83/jarvis/tree/main/infrastructure/gb10) for the consolidated GB10 deployment (systemd service + cmdline).\n\u003e\n\u003e ---\n\n\n\n\u003e **Fork of [ggml-org/llama.cpp](https://github.com/ggml-org/llama.cpp) optimized for NVIDIA DGX Spark / GB10 (Blackwell, SM 12.1).**\n\n[![Upstream](https://img.shields.io/badge/upstream-ggml--org%2Fllama.cpp-blue)](https://github.com/ggml-org/llama.cpp)\n[![CUDA](https://img.shields.io/badge/CUDA-12.8%2B-green)](https://docs.nvidia.com/cuda/)\n[![Arch](https://img.shields.io/badge/SM-121a-orange)](#)\n\n## Why this fork\n\n`llama.cpp-dgx` is a runtime for hybrid Qwen3.5/3.6 / Qwopus 27B-class models on a single GB10 (DGX Spark, 128 GB unified memory). It composes five upstream-or-near-upstream tracks that do not yet land together in [`ggml-org/llama.cpp`](https://github.com/ggml-org/llama.cpp), plus a small number of Blackwell-specific tweaks. Verified against `upstream/master` at `0adede866` (re-merge cadence: weekly).\n\nThe five tracks:\n\n1. **TurboQuant on weights** — TQ3_0 / TQ3_4S / TQ3_1S 3-bit weight quantization with Lloyd-Max codebooks. Imported from [@turbo-tan / llama.cpp-tq3](https://github.com/turbo-tan/llama.cpp-tq3) (`62eb27dce` baseline) — see `ggml/src/ggml-turbo-quant.c`. Used here to ship `Qwopus3.6-27B-v1-Abliterated-preview` at ~14 GiB / ~3.5 bpw with PPL parity to Q3_K_S.\n\n2. **TurboQuant on KV cache** — `turbo2_0` / `turbo3_0` / `turbo4_0` and `turbo3_tcq` / `turbo2_tcq` (Trellis-Coded Quant) types for the K/V cache, with FWHT (Fast Walsh–Hadamard Transform) rotation matrices baked into the FA kernels. Imported from [@spiritbuun / buun-llama-cpp](https://github.com/spiritbuun/buun-llama-cpp) — see `ggml/src/ggml-cuda/fattn-common.cuh` and the `d_turbo_centroids_*_fattn` codebooks. `tq3_0` K+V on the standard llama path lands ~22% smaller KV vs Q4_0 with no measurable decode regression on GB10 (matches the upstream PR numbers).\n\n3. **NVFP4 (FP4 tensor cores) inference** — native NVFP4 matmul + per-tensor scale2 application after the kernel, tracking the WIP upstream PRs ([#21089](https://github.com/ggml-org/llama.cpp/pull/21089), [#20977](https://github.com/ggml-org/llama.cpp/issues/20977)). Loader path supports plain NVFP4 (NVIDIA ModelOpt `NVFP4_DEFAULT_CFG`); the AWQ variant (`NVFP4_AWQ_LITE_CFG`) is intentionally not used because llama.cpp does not apply the AWQ `.pre_quant_scale` channel-wise factor at inference and therefore returns garbage tokens when the model is exported with the AWQ recipe. The dflash custom target graph (see below) also applies the per-tensor scale2 after every `ggml_mul_mat` so NVFP4 + speculative decoding work end-to-end. The matmul itself uses Blackwell's native `mma.sync.aligned.kind::mxf4nvf4.block_scale.scale_vec::4X.m16n8k64` PTX (see `ggml/src/ggml-cuda/mma.cuh`'s `mma_block_scaled_fp4`) — no dequantize-to-bf16 round-trip — so on AEON-XS NVFP4 the prefill rate is essentially VRAM-bandwidth-bound, not compute-bound.\n\n4. **DFlash MTP speculative decoding** — block-diffusion draft + DDtree verify integration, ported from [Luce-Org / lucebox-hub](https://github.com/Luce-Org/lucebox-hub) (`tools/dflash-cli/`). Wired into `llama-server` so that the dflash custom target graph runs in place of `llama_decode` for text-only requests, while `mmproj` (vision) requests fall back to the standard path. Includes causal sliding-window-attention support for the Qwen3.6-27B-DFlash draft (4 SWA layers + 1 full-attention layer), and a borrow path that lets dflash share the host `llama_model`'s on-GPU weight tensors instead of re-uploading them — saves ~18.5 GiB of VRAM when `--mmproj` is set.\n\n5. **Chunk-fused GatedDeltaNet kernel for prefill** — from-scratch, GB10-tiled (sm_120/121) replacement for FlashQLA on the GDN forward path. FlashQLA is Hopper-targeted and needs 192 KB shared memory per CTA, exceeding the 99 KB sm_121a opt-in cap; this fork therefore ships a 4-kernel pipeline (`cumsum` → `kkt_solve` → `prepare_h` → `fused_fwd`) sized for the Blackwell consumer budget, all `wmma 16×16×16` bf16 with fp32 accumulators on Blackwell tensor cores. Active for prefill ubatches with `n_tokens \u003e= 64` (chain mode, `S_v = 128`); falls back to the per-token kernel for decode, tree mode, or KDA. **Bit-equivalent output** to the per-token recurrence on greedy decode (verified by sending the same prompt through both paths with `temperature=0` and getting byte-identical 40-token continuations). Microbench (B=1 T=192 H=16): 1537 → 162 us per call (9.5×); end-to-end prefill rate is parity since GDN is ~1-2% of the total on AEON-XS NVFP4 (FFN is the limiter, see point 3). Quality benefit observed in production: small token-level hallucinations that the buggy pre-fix scalar path emitted on long structured outputs are gone, because the fix during tuning corrected a latent ~3 % per-element drift in the original FlashQLA-style A_sol formulation. See [`docs/rfc-gdn-chunk-kernel.md`](docs/rfc-gdn-chunk-kernel.md). Env-var escape hatch: `GGML_GDN_CHUNK_DISABLE=1` forces per-token.\n\n### Blackwell / GB10 specifics (custom vs upstream)\n\n- `CMAKE_CUDA_ARCHITECTURES` extended with `120a-real` and **`121a-real`** (Blackwell GB10 / B200) — see [`ggml/src/ggml-cuda/CMakeLists.txt`](ggml/src/ggml-cuda/CMakeLists.txt).\n- F32 → TQ3_0 CPY kernel wired in [`ggml/src/ggml-cuda/cpy.cu`](ggml/src/ggml-cuda/cpy.cu) and [`ggml/src/ggml-cuda/set-rows.cu`](ggml/src/ggml-cuda/set-rows.cu) so TQ3_0 V-cache works under flash-attention without falling back to Q8_0.\n- TCQ (Trellis-Coded Quant) decode-time `V alpha` made context-adaptive in [`ggml/src/ggml-cuda/fattn.cu`](ggml/src/ggml-cuda/fattn.cu) — emits the `TCQ decode: context-adaptive V alpha enabled` log line on init.\n- Half-block dispatch + I32/I64 `set_rows` indexing dual-path so the same kernels work whether indices come from `llama_kv_cache` (i64) or the dflash custom graph (i32).\n- TurboQuant FWHT rotation matrices live in `__constant__` memory ( `d_turbo_wht_signs1_fattn`, `d_turbo_wht_signs2_fattn` ) and are applied per-head with 128-element groups, not the upstream Hadamard rotation.\n- Chunk-fused GatedDeltaNet pipeline in [`ggml/src/ggml-cuda/gated_delta_net_chunk.cu`](ggml/src/ggml-cuda/gated_delta_net_chunk.cu) and the header-only kernel templates in [`ggml/src/ggml-cuda/gated_delta_net_chunk_kernels.cuh`](ggml/src/ggml-cuda/gated_delta_net_chunk_kernels.cuh). Routed via `ggml_cuda_op_gated_delta_net_dispatch` in `ggml-cuda.cu`. Standalone numerical tests in [`tests/test-gdn-chunk.cu`](tests/test-gdn-chunk.cu) + Python reference in [`tools/test/gdn_chunk_ref.py`](tools/test/gdn_chunk_ref.py).\n\n## Models we ship \u0026 test against\n\n- **Target**: [`croll83/Qwopus3.6-27B-v1-Abliterated-preview`](https://huggingface.co/croll83/Qwopus3.6-27B-v1-Abliterated-preview) — abliterated derivative of [`Jackrong/Qwopus3.6-27B-v1-preview`](https://huggingface.co/Jackrong/Qwopus3.6-27B-v1-preview), itself a Claude-distilled SFT on `Qwen/Qwen3.6-27B` (qwen35 hybrid arch: 16 full-attention + 48 GatedDeltaNet layers, ~28B params, 262K context). Repo ships BF16 safetensors, mmproj F16, and GGUFs (Q4_K_M, TQ3_4S, NVFP4-plain).\n- **Draft (DFlash)**: [`z-lab/Qwen3.6-27B-DFlash`](https://huggingface.co/z-lab/Qwen3.6-27B-DFlash) — block-diffusion drafter, 5 layers (4 SWA + 1 full), `block_size=16`, BF16 safetensors. Required when running with `--dflash`.\n- **Older draft**: [`z-lab/Qwen3.5-27B-DFlash`](https://huggingface.co/z-lab/Qwen3.5-27B-DFlash) — non-causal, full-attention. Slightly lower accept rate vs the Qwen3.6 draft on Qwen3.6 targets but works without SWA support in the inference engine.\n\n## Install / build\n\nSame as upstream — see [`docs/build.md`](docs/build.md). Quickstart for GB10:\n\n```bash\ngit clone https://github.com/croll83/llama.cpp-dgx.git\ncd llama.cpp-dgx\ncmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=121a-real\ncmake --build build --target llama-server llama-cli llama-quantize llama-dflash llama-dflash-server -j 8\n```\n\n`121a-real` targets GB10 specifically. Use `120a-real` for B200, or `native` to auto-detect.\n\n## DGX-only flags reference\n\nThese flags / env vars exist only in this fork (or have changed semantics vs upstream):\n\n| Flag / env | Where | What it does |\n|---|---|---|\n| `-ctk tq3_0` / `-ctv tq3_0` | `llama-server`, `llama-cli` | Use TQ3_0 (3.5 bpw) for the standard llama_kv_cache K/V. Saves ~22% vs Q4_0; fattn vec kernel handles the 256-stride alignment. K=TQ3_0 is also wired (ours commit `7b5f82569`). |\n| `-ctk turbo3` etc. | (planned) | spiritbuun TurboQuant types for KV cache. Type names are exposed in `ggml.h` (`GGML_TYPE_TURBO2_0`, `TURBO3_0`, `TURBO4_0`, `TURBO3_TCQ`, `TURBO2_TCQ`) but the F32→TURBO* CPY/set-rows wiring is still TODO — see `docs/dflash_kv_quant_status.md`. |\n| `--dflash` | `llama-server` | Enable DFlash MTP speculative decoding. Replaces `llama_decode` with the dflash custom target graph for text-only requests; `mmproj` requests fall through to the standard path. |\n| `--dflash-draft PATH` | `llama-server` | Path to the DFlash draft model. Accepts `.safetensors` (BF16, ~3.3 GiB, default) or `.gguf` (community Q8_0 quants like [spiritbuun/Qwen3.6-27B-DFlash-GGUF](https://huggingface.co/spiritbuun/Qwen3.6-27B-DFlash-GGUF), ~1.8 GiB). Q8_0 GGUF saves ~1.5 GiB VRAM but trades ~16 % decode throughput and ~30 % accept rate vs BF16 in our dflash MTP rollback path — keep BF16 unless VRAM-constrained. |\n| `--dflash-budget N` | `llama-server` | DDtree node budget per draft step. Default 22; sweep summary: 22 is balanced, 32 wins on JSON (+25%), 64+ saturates. |\n| `--dflash-max-ctx N` | `llama-server` | Per-slot dflash KV ring size. Default `--ctx-size / n_parallel`. |\n| `--dflash-prefill-ubatch N` | `llama-server` | dflash prefill ubatch size. Default 192 on GB10. |\n| `--dflash-fa-window N` | `llama-server` | **Sliding window FA on full-attn layers (port of [Luce-Org/lucebox-hub#26](https://github.com/Luce-Org/lucebox-hub/pull/26)).** Default 0 (off). When set (recommended 2048), limits FA to the last N KV positions per query: cuts FA cost from O(kv_len) to O(N) at long contexts, +26 % overall throughput on hermes agent traffic in our bench. **WARNING — agent-incompatible:** the window drops attention to early KV positions, including the system prompt. On agents whose identity / tool list is in the system prompt (Dark Jarvis SOUL, etc.), the model loses context once kv_start \u003e window and falls back to vanilla refusals. Safe for raw continuation / story-style workloads; for agents wait for an attention-sink follow-up that pins the first K tokens + last W tokens. Production default is 0. |\n| `DFLASH27B_KV_V=tq3_0` | env | dflash V-cache type override (`q8_0` default). **Do not use in production on either Qwopus3.6 OR AEON-7 XS.** On Qwopus the drift pathology is loud (token loop, 21-min hang). On AEON-7 XS — even with `linear_attn.conv1d` preserved BF16 — the drift is more subtle: short prose stress tests pass cleanly, but at \u003e25K accumulated context with structured tool_call generation the model emits malformed JSON (path truncation, missing `arguments` key). The conv1d-BF16 preservation reduces but does not eliminate the cumulative attention-score noise that TQ3 V introduces across thousands of DDtree verify reads — the SSM `in_proj_a/b/qkv/z` projection matmuls are the next-most-likely culprits and they ARE FP4-quantised on the XS body. Standard llama path `-ctv tq3_0` is unaffected on either body. |\n| `DFLASH27B_KV_K=tq3_0` | env | **Experimental — do NOT use in production.** Boots and runs short prompts cleanly (commit `6858a4192` fixed the SIGSEGV by forcing the VEC fattn kernel) but hits a long-generation token-loop pathology on agent workloads (~78K committed tokens before it converges on a single repeated token id). Suspected cause: cumulative attention-score degradation from 3.5 bpw K compounded over thousands of decode steps in the dflash custom graph. Standard llama path `-ctk tq3_0` is unaffected and remains the recommended K-quant shortcut. |\n| `DFLASH27B_KV_TQ3=1` | env | Both dflash K and V to TQ3_0 in one shot. **Do NOT use in production** — combines both pathologies above. |\n| `DFLASH27B_KV_F16=1` | env | Force dflash KV back to f16 (regression baseline). |\n| `DFLASH27B_SHARE_KV=1` | env | **Share the standard `llama_kv_cache` K/V buffers with the dflash session** (target body only — not the drafter). The dflash custom graph builds non-owning ggml views into `llama_kv_cache.layers[il].k/v` instead of allocating its own per-layer K/V. Saves **~9 GiB resident** at np=2 with `--dflash-max-ctx 131072` and `-c 262144`, by removing the duplicate K/V the two paths otherwise each carry. The view layout uses non-standard ggml strides (`nb2 \u003c nb1`, since llama_kv_cache packs heads onto axis 0); every kernel touched (FA-vec direct vec_dot, FA-MMA-F16 via to_fp16_nc dequant, cpy_q_q, set_rows\u003cblock_q8_0\u003e) handles arbitrary strides via byte-offset math (no kernel patches needed — see `docs/rfc-unified-target-cache.md`). The K/V types come from `-ctk`/`-ctv` on this path; `DFLASH27B_KV_K`/`_V` are ignored. **Default 0 (off)** until production validation completes. Caveat: when both share-kv and `--mmproj` are on and the slot processes a vision request followed by a text request, the dflash session detects the kv_end desync and forces a full re-prefill (correct, but costs one prefill round) — the standard path's writes are preserved bit-for-bit since both paths produce the same K/V projections from the same target weights. |\n| `TURBO_LAYER_ADAPTIVE=N` | env | Layer-adaptive Turbo KV quant (1–11 strategies; 0 = uniform, default). |\n| `GGML_GDN_CHUNK_DISABLE=1` | env | Forces the per-token GDN kernel even for prefill ubatches `\u003e= 64` tokens. The chunked GDN path (4-kernel pipeline, wmma 16×16×16, 9.5× microbench speedup over the scalar fallback) is on by default; this escape hatch reverts to the legacy per-token recurrence kernel for A/B comparison or as a safety bypass. Greedy decoding output is bit-identical between the two paths after the math fix in commit `3c66666df`, so the production effect is performance-only. See [`docs/rfc-gdn-chunk-kernel.md`](docs/rfc-gdn-chunk-kernel.md) §9 for the full tuning log. |\n\n## Recommended runtime config (GB10, 128 GB unified memory)\n\nFor 262K context with `-np 2` (two persistent slots, e.g. agent + memory writer) on the Qwopus3.6 27B target:\n\n```bash\n./build/bin/llama-server \\\n  -m /path/to/Qwopus-27B-NVFP4-plain.gguf \\\n  --mmproj /path/to/mmproj-Abliterated-F16.gguf \\\n  --dflash --dflash-draft /path/to/qwopus36-dflash-v4/model.safetensors \\\n  --dflash-budget 22 --dflash-max-ctx 131072 --dflash-prefill-ubatch 192 \\\n  --host 0.0.0.0 --port 30000 -c 262144 -np 2 -ngl 99 \\\n  -ctk q8_0 -ctv q8_0 \\\n  --dflash-fa-window 16384 --dflash-fa-sink 16384 \\\n  --slot-prompt-similarity 0.5 --cache-reuse 256 \\\n  --jinja --reasoning auto --alias dark-opus --no-webui --no-warmup\n```\n\nThis gives ~40 GiB resident on GPU (NVFP4 weights borrowed from llama_model + standard KV K=Q8_0 V=Q8_0 + dflash V=Q8_0 ring, dflash K stays Q8_0, np=2). The TQ3_0 KV path would save another ~6 GiB but is unsafe on agent / coding workloads — both `DFLASH27B_KV_V=tq3_0` and `DFLASH27B_KV_K=tq3_0` cause a long-generation token-loop pathology after ~15-30K accumulated context. The standard llama path `-ctk tq3_0 -ctv tq3_0` is unaffected and remains a valid VRAM-saver if you can run **without** `--dflash`. The `--dflash-fa-sink 16384 --dflash-fa-window 16384` pair caps FA cost at long context while preserving the system prompt (attention-sink port, see `docs/dflash_kv_quant_status.md`).\n\n## Troubleshooting\n\n- **Server dies silently right after `DFlash run:` log line, no `GGML_ASSERT` or CUDA error in the output.** Pre-`6858a4192` symptom of K=TQ3_0 selecting the MMA fattn kernel — which has no TQ3_0 dequant entry — and crashing on a NULL function pointer inside `launch_fattn\u003c...\u003e()`. Fixed by extending the force-VEC predicate in `ggml-cuda/fattn.cu` to include `GGML_TYPE_TQ3_0`. Re-pull and rebuild; if you still see this, get a backtrace with `gdb -batch -ex run llama-dflash \u003cargs\u003e` to confirm.\n- **Agent stalls at \"No response from provider for 180s\", server log shows endless `[step N] committed=… last_tok=X next=X` with the same `X` for thousands of iterations.** Token-loop pathology on the dflash custom decode graph: the model produces a self-reinforcing prediction on a single token after enough accumulated context. Two known triggers, both fixed by switching the dflash KV cache off TQ3_0:\n  - `DFLASH27B_KV_K=tq3_0` — onset around 78K committed tokens; symptom `last_tok=328 next=328` style.\n  - `DFLASH27B_KV_V=tq3_0` — onset varies by body but always present in long structured-output workloads. **Qwopus3.6**: loud drift at 15-25K context, commit/step blows up to 14+, output is a wall of `0` tokens or Chinese filler, 21-min slot hang. **AEON-7 XS** (conv1d preserved BF16): subtle drift starting around 25K context — short prose passes cleanly but tool_call generation emits malformed JSON (truncated paths, missing `arguments` key). The conv1d-BF16 preservation helps but the SSM projection matmuls (`in_proj_a/b/qkv/z`) are still FP4-quantised on the XS body and contribute residual noise. **TL;DR: keep V=Q8 on dflash regardless of body.**\n\n  Stop the server, drop both env overrides (or set them to `q8_0`), restart. Standard path `-ctk tq3_0 -ctv tq3_0` (without `--dflash`) is unaffected. Tracking in [`docs/dflash_kv_quant_status.md`](docs/dflash_kv_quant_status.md).\n- **`GGML_ASSERT(buf != NULL \u0026\u0026 \"tensor buffer not set\")` from `ggml_backend_tensor_set` during dflash prefill.** Pre-fix symptom on `DFLASH27B_KV_V=q8_0` (or any non-TQ3_0 V-type): `build_full_attn_block` only references the `kv_idxs` input on the `set_rows` path (TQ3_0-only), so on the cpy fallback path `ggml_gallocr` dead-code-eliminates the input and `sg.kv_idxs-\u003ebuffer` stays NULL. Fixed in `tools/dflash-cli/session.cpp` by gating the upload on `sg.kv_idxs-\u003ebuffer != nullptr`. If you see this on a build past that commit, your build tree is stale — `cmake --build build --target llama-server -j` and relaunch.\n- **`speculative decoding not supported by this context` log line on init.** Expected with `--dflash`: this is the legacy speculative-decoding path's compat probe failing because the dflash session takes over. The DFlash session is unrelated.\n- **`cache_reuse is not supported by multimodal` log line.** Expected with `--mmproj` + `--cache-reuse`. The prompt cache stays effective for slot persistence; only the cross-request prefix-match path is disabled.\n- **`fattn vec kernel` aborts with K%256 != 0.** Either the cache type is one of the TQ3_0 / TURBO* family on the standard path (the fork bumps `fattn_stride` to 256 automatically — make sure you're on `origin/feature/dflash-integration` or later) or you set `--dflash-max-ctx` to a non-256-aligned value.\n- **`Failed to parse input at pos 41: 不休ief粟…`** Output garbage on NVFP4 means the per-tensor `.scale` tensors did not load. Re-export the model with NVFP4 plain (`NVFP4_DEFAULT_CFG`), not AWQ (`NVFP4_AWQ_LITE_CFG`); see [`tools/dflash-cli/quantize_nvfp4_plain.py`](tools/dflash-cli/quantize_nvfp4_plain.py).\n- **OOM during model load with `--mmproj` + `--dflash`.** The borrow path is auto-enabled in this configuration; if you see two ~15 GiB \"CUDA0 model buffer\" log lines instead of one, re-pull and rebuild — the patch is in commit `87102e46b`.\n\n## Memory savings stack\n\nTwo production stable end-states depending on target body. Both use np=2, mmproj\nloaded, dflash + dflash drafter v4. **The dflash agent path uses\n`--dflash-max-ctx 131072` per slot independently of `-c`** — `-c` only sizes the\nstandard `llama_kv_cache` that the mmproj/vision path uses. The two contexts are\nseparate.\n\n**Qwopus3.6 (FP4 conv1d, drift-prone with TQ3 V):**\n\n| Stage | GPU resident | Δ |\n|---|---:|---:|\n| Baseline (weights duplicated) | 58.4 GiB | — |\n| + borrow llama_model weights (commit `87102e46b`) | 39.9 GiB | −18.5 GiB |\n| + standard `-ctk tq3_0 -ctv tq3_0` (commit `7b5f82569`) | 37.5 GiB | −2.4 GiB |\n| **stable end-state, V cache stuck at Q8 to dodge drift, `-c 262144`** | **~64-72 GiB** | |\n\n**AEON-7 `Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP-XS` (conv1d preserved BF16, less but not zero drift on TQ3 V):**\n\n| Stage | GPU resident | dflash agent ctx/slot | mmproj ctx/slot | Δ |\n|---|---:|---:|---:|---:|\n| Baseline (Qwopus state) | 64 GiB | 131K | 131K | — |\n| + AEON-7 XS body swap (`linear_attn.conv1d` BF16) | 64 GiB | 131K | 131K | 0 |\n| + halve `DFLASH_ANCHOR_SLOTS` 4 → 2 (commit `ba400dcee`) | 62 GiB | 131K | 131K | −2 GiB |\n| + `DFLASH_ANCHOR_SLOTS = 1` | 61 GiB | 131K | 131K | −1 GiB |\n| + mmproj F16 → Q8 (`mmproj-AEON-XS-Q8.gguf`, 928 MB → 629 MB) | 60.7 GiB | 131K | 131K | −0.3 GiB |\n| **+ `DFLASH27B_SHARE_KV=1`** (Phase 2.2 unified target K/V cache, RFC) | **~54 GiB** | **131K** | **131K** | **−6.5 GiB** |\n| **stable end-state with V=Q8 on dflash, full 131K vision** | **~54 GiB** | **131K** | **131K** | **−10 GiB vs Qwopus baseline, −47 GiB vs vLLM equivalent (103 GiB)** |\n\nWe attempted a `DFLASH27B_KV_V=tq3_0` re-enable on AEON XS (would have saved\n~4 GiB extra) but rolled it back: conv1d-BF16 preservation reduces the drift\nsignature vs Qwopus (no token loop, no wall of zeros) but does NOT eliminate it\non long-context structured output — at \u003e25K context the model emits malformed\nJSON tool_calls (truncated paths, missing keys). The SSM projection matmuls\n(`in_proj_a/b/qkv/z`) are still FP4-quantised on XS and contribute residual\nnoise. Production stays at V=Q8 on dflash for both Qwopus and AEON.\n\nThe current `-c 131072` config trades half the vision context (65K/slot vs 131K)\nfor 3 GiB. The agent path is unaffected — dflash sizes its own KV ring via\n`--dflash-max-ctx`. For image-heavy multimodal workflows where you want full\n131K vision capacity, set `-c 262144 -np 2` → ~62 GiB resident.\n| (extra) dflash `DFLASH27B_KV_K=tq3_0` + force-VEC fix (commit `6858a4192`) | 34.9 GiB | −2.6 GiB but **unstable**: long-generation token loop |\n\nThe last row is left in the codebase as a documented experimental knob — see the flags reference. `DFLASH27B_KV_K=tq3_0` boots, passes short-prompt sanity, but loses coherence on agent-style multi-turn / long-decode workloads (we saw the model degenerate to a single-token loop after ~78K committed tokens). Standard path `-ctk tq3_0` is unaffected.\n\n## Benchmarks (GB10, NVFP4 + mmproj, np=2, c=262144)\n\nDecode throughput on `Qwopus3.6-27B-v1-Abliterated-preview` with the Qwen3.6-27B-DFlash draft, `--reasoning auto` (thinking on by default), per-request `enable_thinking` overrides as noted:\n\n| Workload | tok/s | accept | commits/step | thinking |\n|---|---:|---:|---:|---:|\n| JSON 1024 (color names) | 68.7 | 65 % | 10.5 | on |\n| MATH 256 (algebra step-by-step) | 45.7 | 46 % | 7.3 | on |\n| CODE 512 (heapsort + tests) | 38.3 | 47 % | 7.5 | on |\n| LongCode 2048 | 38.0 | 43 % | 6.9 | on |\n| PROSE 400 (free essay) | 27.1 | 29 % | 4.7 | on |\n| PROSE 400 (same prompt) | 18.7 | 20 % | 3.2 | off |\n\nMemory footprint at this config (idle, after first warmup pass):\n\n| Component | Size |\n|---|---:|\n| NVFP4 target weights (borrowed) | 15.5 GiB |\n| Standard llama_kv_cache (K=TQ3_0 + V=TQ3_0, 16 attn layers × 131K × 2 seqs) | 3.6 GiB |\n| DFlash per-slot ring (K=Q8_0 + V=TQ3_0 + SSM + target_feat, ×2 slots) | ~13 GiB |\n| Standard compute buffer + recurrent state | 2.1 GiB |\n| mmproj vision encoder | 0.9 GiB |\n| Draft model (Qwen3.6-27B-DFlash) | 0.9 GiB |\n| Prompt cache (server-side, lazy, capped 8 GiB) | up to 8 GiB |\n| CUDA runtime + libraries | ~3 GiB |\n| **Total resident on GB10** | **~37.7 GiB** (stable; 34.9 GiB unstable with K=TQ3_0 on dflash) |\n\nFor comparison on the same workload, lucebox-hub's `llama-dflash-server` standalone (no `--mmproj`, no prompt cache, single slot, Q4_K_M target) runs at ~25–50 tok/s and 26.6 GiB resident. The ~13 GiB delta is the price of `--mmproj` + `-np 2` + the prompt cache; remove either of those to recover most of it.\n\n## Verifying against upstream\n\nThis fork is meant to stay rebase-able onto `upstream/master`. To audit the diff:\n\n```bash\ngit remote add upstream https://github.com/ggml-org/llama.cpp.git\ngit fetch upstream\ngit log --oneline upstream/master..HEAD                # commits unique to the fork\ngit diff --stat upstream/master..HEAD -- ggml/        # ggml-side delta\ngit diff --stat upstream/master..HEAD -- tools/       # tools / dflash-cli delta\n```\n\nMost of the fork lives in:\n\n- `ggml/src/ggml-cuda/cpy.cu`, `set-rows.cu`, `fattn*.cuh`, `turbo-wht.cu` — TQ3_0 / Turbo* CUDA kernels\n- `ggml/src/ggml-turbo-quant.c` — CPU TurboQuant reference (stub on most types; CUDA kernels are the load-bearing path)\n- `src/llama-kv-cache.cpp`, `src/llama-graph.cpp` — TQ3_0 / Turbo* dispatch in the standard llama path\n- `src/llama-model.cpp` — NVFP4 `.scale` / `.input_scale` per-tensor loading and `tensors_by_name` map\n- `tools/dflash-cli/` — DFlash custom target graph + draft graph + session\n- `tools/dflash-server/` — standalone dflash HTTP server (`llama-dflash-server`)\n- `tools/server/server-context.cpp` — `--dflash` dispatch + `mmproj` coexistence + weight borrow\n\n## Credits\n\n- [@ggml-org / llama.cpp](https://github.com/ggml-org/llama.cpp) — upstream\n- [@turbo-tan](https://github.com/turbo-tan) — TurboQuant on weights ([turbo-tan/llama.cpp-tq3](https://github.com/turbo-tan/llama.cpp-tq3))\n- [@spiritbuun](https://github.com/spiritbuun) — TurboQuant on KV cache ([spiritbuun/buun-llama-cpp](https://github.com/spiritbuun/buun-llama-cpp))\n- [@Luce-Org](https://github.com/Luce-Org) — DFlash MTP integration ([Luce-Org/lucebox-hub](https://github.com/Luce-Org/lucebox-hub))\n- [@AmesianX / TurboQuant](https://github.com/AmesianX/TurboQuant), [Google DeepMind](https://arxiv.org/abs/2502.14882) — TurboQuant paper / reference implementation\n- [@z-lab](https://huggingface.co/z-lab) — DFlash draft checkpoints\n\n---\n\n# llama.cpp (upstream)\n\n(everything below is the upstream README from `ggml-org/llama.cpp@upstream/master`, kept for parity)\n\n# llama.cpp\n\n![llama](https://user-images.githubusercontent.com/1991296/230134379-7181e485-c521-4d23-a0d6-f7b3b61ba524.png)\n\n[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://opensource.org/licenses/MIT)\n[![Release](https://img.shields.io/github/v/release/ggml-org/llama.cpp)](https://github.com/ggml-org/llama.cpp/releases)\n[![Server](https://github.com/ggml-org/llama.cpp/actions/workflows/server.yml/badge.svg)](https://github.com/ggml-org/llama.cpp/actions/workflows/server.yml)\n\n[Manifesto](https://github.com/ggml-org/llama.cpp/discussions/205) / [ggml](https://github.com/ggml-org/ggml) / [ops](https://github.com/ggml-org/llama.cpp/blob/master/docs/ops.md)\n\nLLM inference in C/C++\n\n## Recent API changes\n\n- [Changelog for `libllama` API](https://github.com/ggml-org/llama.cpp/issues/9289)\n- [Changelog for `llama-server` REST API](https://github.com/ggml-org/llama.cpp/issues/9291)\n\n## Hot topics\n\n- **Hugging Face cache migration: models downloaded with `-hf` are now stored in the standard Hugging Face cache directory, enabling sharing with other HF tools.**\n- **[guide : using the new WebUI of llama.cpp](https://github.com/ggml-org/llama.cpp/discussions/16938)**\n- [guide : running gpt-oss with llama.cpp](https://github.com/ggml-org/llama.cpp/discussions/15396)\n- [[FEEDBACK] Better packaging for llama.cpp to support downstream consumers 🤗](https://github.com/ggml-org/llama.cpp/discussions/15313)\n- Support for the `gpt-oss` model with native MXFP4 format has been added | [PR](https://github.com/ggml-org/llama.cpp/pull/15091) | [Collaboration with NVIDIA](https://blogs.nvidia.com/blog/rtx-ai-garage-openai-oss) | [Comment](https://github.com/ggml-org/llama.cpp/discussions/15095)\n- Multimodal support arrived in `llama-server`: [#12898](https://github.com/ggml-org/llama.cpp/pull/12898) | [documentation](./docs/multimodal.md)\n- VS Code extension for FIM completions: https://github.com/ggml-org/llama.vscode\n- Vim/Neovim plugin for FIM completions: https://github.com/ggml-org/llama.vim\n- Hugging Face Inference Endpoints now support GGUF out of the box! https://github.com/ggml-org/llama.cpp/discussions/9669\n- Hugging Face GGUF editor: [discussion](https://github.com/ggml-org/llama.cpp/discussions/9268) | [tool](https://huggingface.co/spaces/CISCai/gguf-editor)\n\n----\n\n## Quick start\n\nGetting started with llama.cpp is straightforward. Here are several ways to install it on your machine:\n\n- Install `llama.cpp` using [brew, nix or winget](docs/install.md)\n- Run with Docker - see our [Docker documentation](docs/docker.md)\n- Download pre-built binaries from the [releases page](https://github.com/ggml-org/llama.cpp/releases)\n- Build from source by cloning this repository - check out [our build guide](docs/build.md)\n\nOnce installed, you'll need a model to work with. Head to the [Obtaining and quantizing models](#obtaining-and-quantizing-models) section to learn more.\n\nExample command:\n\n```sh\n# Use a local model file\nllama-cli -m my_model.gguf\n\n# Or download and run a model directly from Hugging Face\nllama-cli -hf ggml-org/gemma-3-1b-it-GGUF\n\n# Launch OpenAI-compatible API server\nllama-server -hf ggml-org/gemma-3-1b-it-GGUF\n```\n\n## Description\n\nThe main goal of `llama.cpp` is to enable LLM inference with minimal setup and state-of-the-art performance on a wide\nrange of hardware - locally and in the cloud.\n\n- Plain C/C++ implementation without any dependencies\n- Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks\n- AVX, AVX2, AVX512 and AMX support for x86 architectures\n- RVV, ZVFH, ZFH, ZICBOP and ZIHINTPAUSE support for RISC-V architectures\n- 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use\n- Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP and Moore Threads GPUs via MUSA)\n- Vulkan and SYCL backend support\n- CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity\n\nThe `llama.cpp` project is the main playground for developing new features for the [ggml](https://github.com/ggml-org/ggml) library.\n\n\u003cdetails\u003e\n\u003csummary\u003eModels\u003c/summary\u003e\n\nTypically finetunes of the base models below are supported as well.\n\nInstructions for adding support for new models: [HOWTO-add-model.md](docs/development/HOWTO-add-model.md)\n\n#### Text-only\n\n- [X] LLaMA 🦙\n- [x] LLaMA 2 🦙🦙\n- [x] LLaMA 3 🦙🦙🦙\n- [X] [Mistral 7B](https://huggingface.co/mistralai/Mistral-7B-v0.1)\n- [x] [Mixtral MoE](https://huggingface.co/models?search=mistral-ai/Mixtral)\n- [x] [DBRX](https://huggingface.co/databricks/dbrx-instruct)\n- [x] [Jamba](https://huggingface.co/ai21labs)\n- [X] [Falcon](https://huggingface.co/models?search=tiiuae/falcon)\n- [X] [Chinese LLaMA / Alpaca](https://github.com/ymcui/Chinese-LLaMA-Alpaca) and [Chinese LLaMA-2 / Alpaca-2](https://github.com/ymcui/Chinese-LLaMA-Alpaca-2)\n- [X] [Vigogne (French)](https://github.com/bofenghuang/vigogne)\n- [X] [BERT](https://github.com/ggml-org/llama.cpp/pull/5423)\n- [X] [Koala](https://bair.berkeley.edu/blog/2023/04/03/koala/)\n- [X] [Baichuan 1 \u0026 2](https://huggingface.co/models?search=baichuan-inc/Baichuan) + [derivations](https://huggingface.co/hiyouga/baichuan-7b-sft)\n- [X] [Aquila 1 \u0026 2](https://huggingface.co/models?search=BAAI/Aquila)\n- [X] [Starcoder models](https://github.com/ggml-org/llama.cpp/pull/3187)\n- [X] [Refact](https://huggingface.co/smallcloudai/Refact-1_6B-fim)\n- [X] [MPT](https://github.com/ggml-org/llama.cpp/pull/3417)\n- [X] [Bloom](https://github.com/ggml-org/llama.cpp/pull/3553)\n- [x] [Yi models](https://huggingface.co/models?search=01-ai/Yi)\n- [X] [StableLM models](https://huggingface.co/stabilityai)\n- [x] [Deepseek models](https://huggingface.co/models?search=deepseek-ai/deepseek)\n- [x] [Qwen models](https://huggingface.co/models?search=Qwen/Qwen)\n- [x] [PLaMo-13B](https://github.com/ggml-org/llama.cpp/pull/3557)\n- [x] [Phi models](https://huggingface.co/models?search=microsoft/phi)\n- [x] [PhiMoE](https://github.com/ggml-org/llama.cpp/pull/11003)\n- [x] [GPT-2](https://huggingface.co/gpt2)\n- [x] [Orion 14B](https://github.com/ggml-org/llama.cpp/pull/5118)\n- [x] [InternLM2](https://huggingface.co/models?search=internlm2)\n- [x] [CodeShell](https://github.com/WisdomShell/codeshell)\n- [x] [Gemma](https://ai.google.dev/gemma)\n- [x] [Mamba](https://github.com/state-spaces/mamba)\n- [x] [Grok-1](https://huggingface.co/keyfan/grok-1-hf)\n- [x] [Xverse](https://huggingface.co/models?search=xverse)\n- [x] [Command-R models](https://huggingface.co/models?search=CohereForAI/c4ai-command-r)\n- [x] [SEA-LION](https://huggingface.co/models?search=sea-lion)\n- [x] [GritLM-7B](https://huggingface.co/GritLM/GritLM-7B) + [GritLM-8x7B](https://huggingface.co/GritLM/GritLM-8x7B)\n- [x] [OLMo](https://allenai.org/olmo)\n- [x] [OLMo 2](https://allenai.org/olmo)\n- [x] [OLMoE](https://huggingface.co/allenai/OLMoE-1B-7B-0924)\n- [x] [Granite models](https://huggingface.co/collections/ibm-granite/granite-code-models-6624c5cec322e4c148c8b330)\n- [x] [GPT-NeoX](https://github.com/EleutherAI/gpt-neox) + [Pythia](https://github.com/EleutherAI/pythia)\n- [x] [Snowflake-Arctic MoE](https://huggingface.co/collections/Snowflake/arctic-66290090abe542894a5ac520)\n- [x] [Smaug](https://huggingface.co/models?search=Smaug)\n- [x] [Poro 34B](https://huggingface.co/LumiOpen/Poro-34B)\n- [x] [Bitnet b1.58 models](https://huggingface.co/1bitLLM)\n- [x] [Flan T5](https://huggingface.co/models?search=flan-t5)\n- [x] [Open Elm models](https://huggingface.co/collections/apple/openelm-instruct-models-6619ad295d7ae9f868b759ca)\n- [x] [ChatGLM3-6b](https://huggingface.co/THUDM/chatglm3-6b) + [ChatGLM4-9b](https://huggingface.co/THUDM/glm-4-9b) + [GLMEdge-1.5b](https://huggingface.co/THUDM/glm-edge-1.5b-chat) + [GLMEdge-4b](https://huggingface.co/THUDM/glm-edge-4b-chat)\n- [x] [GLM-4-0414](https://huggingface.co/collections/THUDM/glm-4-0414-67f3cbcb34dd9d252707cb2e)\n- [x] [SmolLM](https://huggingface.co/collections/HuggingFaceTB/smollm-6695016cad7167254ce15966)\n- [x] [EXAONE-3.0-7.8B-Instruct](https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct)\n- [x] [FalconMamba Models](https://huggingface.co/collections/tiiuae/falconmamba-7b-66b9a580324dd1598b0f6d4a)\n- [x] [Jais](https://huggingface.co/inceptionai/jais-13b-chat)\n- [x] [Bielik-11B-v2.3](https://huggingface.co/collections/speakleash/bielik-11b-v23-66ee813238d9b526a072408a)\n- [x] [RWKV-7](https://huggingface.co/collections/shoumenchougou/rwkv7-gxx-gguf)\n- [x] [RWKV-6](https://github.com/BlinkDL/RWKV-LM)\n- [x] [QRWKV-6](https://huggingface.co/recursal/QRWKV6-32B-Instruct-Preview-v0.1)\n- [x] [GigaChat-20B-A3B](https://huggingface.co/ai-sage/GigaChat-20B-A3B-instruct)\n- [X] [Trillion-7B-preview](https://huggingface.co/trillionlabs/Trillion-7B-preview)\n- [x] [Ling models](https://huggingface.co/collections/inclusionAI/ling-67c51c85b34a7ea0aba94c32)\n- [x] [LFM2 models](https://huggingface.co/collections/LiquidAI/lfm2-686d721927015b2ad73eaa38)\n- [x] [Hunyuan models](https://huggingface.co/collections/tencent/hunyuan-dense-model-6890632cda26b19119c9c5e7)\n- [x] [BailingMoeV2 (Ring/Ling 2.0) models](https://huggingface.co/collections/inclusionAI/ling-v2-68bf1dd2fc34c306c1fa6f86)\n\n#### Multimodal\n\n- [x] [LLaVA 1.5 models](https://huggingface.co/collections/liuhaotian/llava-15-653aac15d994e992e2677a7e), [LLaVA 1.6 models](https://huggingface.co/collections/liuhaotian/llava-16-65b9e40155f60fd046a5ccf2)\n- [x] [BakLLaVA](https://huggingface.co/models?search=SkunkworksAI/Bakllava)\n- [x] [Obsidian](https://huggingface.co/NousResearch/Obsidian-3B-V0.5)\n- [x] [ShareGPT4V](https://huggingface.co/models?search=Lin-Chen/ShareGPT4V)\n- [x] [MobileVLM 1.7B/3B models](https://huggingface.co/models?search=mobileVLM)\n- [x] [Yi-VL](https://huggingface.co/models?search=Yi-VL)\n- [x] [Mini CPM](https://huggingface.co/models?search=MiniCPM)\n- [x] [Moondream](https://huggingface.co/vikhyatk/moondream2)\n- [x] [Bunny](https://github.com/BAAI-DCAI/Bunny)\n- [x] [GLM-EDGE](https://huggingface.co/models?search=glm-edge)\n- [x] [Qwen2-VL](https://huggingface.co/collections/Qwen/qwen2-vl-66cee7455501d7126940800d)\n- [x] [LFM2-VL](https://huggingface.co/collections/LiquidAI/lfm2-vl-68963bbc84a610f7638d5ffa)\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eBindings\u003c/summary\u003e\n\n- Python: [ddh0/easy-llama](https://github.com/ddh0/easy-llama)\n- Python: [abetlen/llama-cpp-python](https://github.com/abetlen/llama-cpp-python)\n- Go: [go-skynet/go-llama.cpp](https://github.com/go-skynet/go-llama.cpp)\n- Node.js: [withcatai/node-llama-cpp](https://github.com/withcatai/node-llama-cpp)\n- JS/TS (llama.cpp server client): [lgrammel/modelfusion](https://modelfusion.dev/integration/model-provider/llamacpp)\n- JS/TS (Programmable Prompt Engine CLI): [offline-ai/cli](https://github.com/offline-ai/cli)\n- JavaScript/Wasm (works in browser): [tangledgroup/llama-cpp-wasm](https://github.com/tangledgroup/llama-cpp-wasm)\n- Typescript/Wasm (nicer API, available on npm): [ngxson/wllama](https://github.com/ngxson/wllama)\n- Ruby: [yoshoku/llama_cpp.rb](https://github.com/yoshoku/llama_cpp.rb)\n- Rust (more features): [edgenai/llama_cpp-rs](https://github.com/edgenai/llama_cpp-rs)\n- Rust (nicer API): [mdrokz/rust-llama.cpp](https://github.com/mdrokz/rust-llama.cpp)\n- Rust (more direct bindings): [utilityai/llama-cpp-rs](https://github.com/utilityai/llama-cpp-rs)\n- Rust (automated build from crates.io): [ShelbyJenkins/llm_client](https://github.com/ShelbyJenkins/llm_client)\n- C#/.NET: [SciSharp/LLamaSharp](https://github.com/SciSharp/LLamaSharp)\n- C#/VB.NET (more features - community license): [LM-Kit.NET](https://docs.lm-kit.com/lm-kit-net/index.html)\n- Scala 3: [donderom/llm4s](https://github.com/donderom/llm4s)\n- Clojure: [phronmophobic/llama.clj](https://github.com/phronmophobic/llama.clj)\n- React Native: [mybigday/llama.rn](https://github.com/mybigday/llama.rn)\n- Java: [kherud/java-llama.cpp](https://github.com/kherud/java-llama.cpp)\n- Java: [QuasarByte/llama-cpp-jna](https://github.com/QuasarByte/llama-cpp-jna)\n- Zig: [deins/llama.cpp.zig](https://github.com/Deins/llama.cpp.zig)\n- Flutter/Dart: [netdur/llama_cpp_dart](https://github.com/netdur/llama_cpp_dart)\n- Flutter: [xuegao-tzx/Fllama](https://github.com/xuegao-tzx/Fllama)\n- PHP (API bindings and features built on top of llama.cpp): [distantmagic/resonance](https://github.com/distantmagic/resonance) [(more info)](https://github.com/ggml-org/llama.cpp/pull/6326)\n- Guile Scheme: [guile_llama_cpp](https://savannah.nongnu.org/projects/guile-llama-cpp)\n- Swift [srgtuszy/llama-cpp-swift](https://github.com/srgtuszy/llama-cpp-swift)\n- Swift [ShenghaiWang/SwiftLlama](https://github.com/ShenghaiWang/SwiftLlama)\n- Delphi [Embarcadero/llama-cpp-delphi](https://github.com/Embarcadero/llama-cpp-delphi)\n- Go (no CGo needed): [hybridgroup/yzma](https://github.com/hybridgroup/yzma)\n- Android: [llama.android](/examples/llama.android)\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eUIs\u003c/summary\u003e\n\n*(to have a project listed here, it should clearly state that it depends on `llama.cpp`)*\n\n- [AI Sublime Text plugin](https://github.com/yaroslavyaroslav/OpenAI-sublime-text) (MIT)\n- [BonzAI App](https://apps.apple.com/us/app/bonzai-your-local-ai-agent/id6752847988) (proprietary)\n- [cztomsik/ava](https://github.com/cztomsik/ava) (MIT)\n- [Dot](https://github.com/alexpinel/Dot) (GPL)\n- [eva](https://github.com/ylsdamxssjxxdd/eva) (MIT)\n- [iohub/collama](https://github.com/iohub/coLLaMA) (Apache-2.0)\n- [janhq/jan](https://github.com/janhq/jan) (AGPL)\n- [johnbean393/Sidekick](https://github.com/johnbean393/Sidekick) (MIT)\n- [KanTV](https://github.com/zhouwg/kantv?tab=readme-ov-file) (Apache-2.0)\n- [KodiBot](https://github.com/firatkiral/kodibot) (GPL)\n- [llama.vim](https://github.com/ggml-org/llama.vim) (MIT)\n- [LARS](https://github.com/abgulati/LARS) (AGPL)\n- [Llama Assistant](https://github.com/vietanhdev/llama-assistant) (GPL)\n- [LlamaLib](https://github.com/undreamai/LlamaLib) (Apache-2.0)\n- [LLMFarm](https://github.com/guinmoon/LLMFarm?tab=readme-ov-file) (MIT)\n- [LLMUnity](https://github.com/undreamai/LLMUnity) (MIT)\n- [LMStudio](https://lmstudio.ai/) (proprietary)\n- [LocalAI](https://github.com/mudler/LocalAI) (MIT)\n- [LostRuins/koboldcpp](https://github.com/LostRuins/koboldcpp) (AGPL)\n- [MindMac](https://mindmac.app) (proprietary)\n- [MindWorkAI/AI-Studio](https://github.com/MindWorkAI/AI-Studio) (FSL-1.1-MIT)\n- [Mobile-Artificial-Intelligence/maid](https://github.com/Mobile-Artificial-Intelligence/maid) (MIT)\n- [Mozilla-Ocho/llamafile](https://github.com/Mozilla-Ocho/llamafile) (Apache-2.0)\n- [nat/openplayground](https://github.com/nat/openplayground) (MIT)\n- [nomic-ai/gpt4all](https://github.com/nomic-ai/gpt4all) (MIT)\n- [ollama/ollama](https://github.com/ollama/ollama) (MIT)\n- [oobabooga/text-generation-webui](https://github.com/oobabooga/text-generation-webui) (AGPL)\n- [PocketPal AI](https://github.com/a-ghorbani/pocketpal-ai) (MIT)\n- [psugihara/FreeChat](https://github.com/psugihara/FreeChat) (MIT)\n- [ptsochantaris/emeltal](https://github.com/ptsochantaris/emeltal) (MIT)\n- [pythops/tenere](https://github.com/pythops/tenere) (AGPL)\n- [ramalama](https://github.com/containers/ramalama) (MIT)\n- [semperai/amica](https://github.com/semperai/amica) (MIT)\n- [withcatai/catai](https://github.com/withcatai/catai) (MIT)\n- [Autopen](https://github.com/blackhole89/autopen) (GPL)\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eTools\u003c/summary\u003e\n\n- [akx/ggify](https://github.com/akx/ggify) – download PyTorch models from Hugging Face Hub and convert them to GGML\n- [akx/ollama-dl](https://github.com/akx/ollama-dl) – download models from the Ollama library to be used directly with llama.cpp\n- [crashr/gppm](https://github.com/crashr/gppm) – launch llama.cpp instances utilizing NVIDIA Tesla P40 or P100 GPUs with reduced idle power consumption\n- [gpustack/gguf-parser](https://github.com/gpustack/gguf-parser-go/tree/main/cmd/gguf-parser) - review/check the GGUF file and estimate the memory usage\n- [Styled Lines](https://marketplace.unity.com/packages/tools/generative-ai/styled-lines-llama-cpp-model-292902) (proprietary licensed, async wrapper of inference part for game development in Unity3d with pre-built Mobile and Web platform wrappers and a model example)\n- [unslothai/unsloth](https://github.com/unslothai/unsloth) – 🦥 exports/saves fine-tuned and trained models to GGUF (Apache-2.0)\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eInfrastructure\u003c/summary\u003e\n\n- [Paddler](https://github.com/intentee/paddler) - Open-source LLMOps platform for hosting and scaling AI in your own infrastructure\n- [GPUStack](https://github.com/gpustack/gpustack) - Manage GPU clusters for running LLMs\n- [llama_cpp_canister](https://github.com/onicai/llama_cpp_canister) - llama.cpp as a smart contract on the Internet Computer, using WebAssembly\n- [llama-swap](https://github.com/mostlygeek/llama-swap) - transparent proxy that adds automatic model switching with llama-server\n- [Kalavai](https://github.com/kalavai-net/kalavai-client) - Crowdsource end to end LLM deployment at any scale\n- [llmaz](https://github.com/InftyAI/llmaz) - ☸️ Easy, advanced inference platform for large language models on Kubernetes.\n- [LLMKube](https://github.com/defilantech/llmkube) - Kubernetes operator for llama.cpp with multi-GPU and Apple Silicon Metal\n  support\"\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eGames\u003c/summary\u003e\n\n- [Lucy's Labyrinth](https://github.com/MorganRO8/Lucys_Labyrinth) - A simple maze game where agents controlled by an AI model will try to trick you.\n\n\u003c/details\u003e\n\n\n## Supported backends\n\n| Backend | Target devices |\n| --- | --- |\n| [Metal](docs/build.md#metal-build) | Apple Silicon |\n| [BLAS](docs/build.md#blas-build) | All |\n| [BLIS](docs/backend/BLIS.md) | All |\n| [SYCL](docs/backend/SYCL.md) | Intel and Nvidia GPU |\n| [OpenVINO [In Progress]](docs/backend/OPENVINO.md) | Intel CPUs, GPUs, and NPUs |\n| [MUSA](docs/build.md#musa) | Moore Threads GPU |\n| [CUDA](docs/build.md#cuda) | Nvidia GPU |\n| [HIP](docs/build.md#hip) | AMD GPU |\n| [ZenDNN](docs/build.md#zendnn) | AMD CPU |\n| [Vulkan](docs/build.md#vulkan) | GPU |\n| [CANN](docs/build.md#cann) | Ascend NPU |\n| [OpenCL](docs/backend/OPENCL.md) | Adreno GPU |\n| [IBM zDNN](docs/backend/zDNN.md) | IBM Z \u0026 LinuxONE |\n| [WebGPU [In Progress]](docs/build.md#webgpu) | All |\n| [RPC](https://github.com/ggml-org/llama.cpp/tree/master/tools/rpc) | All |\n| [Hexagon [In Progress]](docs/backend/snapdragon/README.md) | Snapdragon |\n| [VirtGPU](docs/backend/VirtGPU.md) | VirtGPU APIR |\n\n## Obtaining and quantizing models\n\nThe [Hugging Face](https://huggingface.co) platform hosts a [number of LLMs](https://huggingface.co/models?library=gguf\u0026sort=trending) compatible with `llama.cpp`:\n\n- [Trending](https://huggingface.co/models?library=gguf\u0026sort=trending)\n- [LLaMA](https://huggingface.co/models?sort=trending\u0026search=llama+gguf)\n\nYou can either manually download the GGUF file or directly use any `llama.cpp`-compatible models from [Hugging Face](https://huggingface.co/) or other model hosting sites, by using this CLI argument: `-hf \u003cuser\u003e/\u003cmodel\u003e[:quant]`. For example:\n\n```sh\nllama-cli -hf ggml-org/gemma-3-1b-it-GGUF\n```\n\nBy default, the CLI would download from Hugging Face, you can switch to other options with the environment variable `MODEL_ENDPOINT`. The `MODEL_ENDPOINT` must point to a Hugging Face compatible API endpoint.\n\nAfter downloading a model, use the CLI tools to run it locally - see below.\n\n`llama.cpp` requires the model to be stored in the [GGUF](https://github.com/ggml-org/ggml/blob/master/docs/gguf.md) file format. Models in other data formats can be converted to GGUF using the `convert_*.py` Python scripts in this repo.\n\nThe Hugging Face platform provides a variety of online tools for converting, quantizing and hosting models with `llama.cpp`:\n\n- Use the [GGUF-my-repo space](https://huggingface.co/spaces/ggml-org/gguf-my-repo) to convert to GGUF format and quantize model weights to smaller sizes\n- Use the [GGUF-my-LoRA space](https://huggingface.co/spaces/ggml-org/gguf-my-lora) to convert LoRA adapters to GGUF format (more info: https://github.com/ggml-org/llama.cpp/discussions/10123)\n- Use the [GGUF-editor space](https://huggingface.co/spaces/CISCai/gguf-editor) to edit GGUF meta data in the browser (more info: https://github.com/ggml-org/llama.cpp/discussions/9268)\n- Use the [Inference Endpoints](https://ui.endpoints.huggingface.co/) to directly host `llama.cpp` in the cloud (more info: https://github.com/ggml-org/llama.cpp/discussions/9669)\n\nTo learn more about model quantization, [read this documentation](tools/quantize/README.md)\n\n## [`llama-cli`](tools/cli)\n\n#### A CLI tool for accessing and experimenting with most of `llama.cpp`'s functionality.\n\n- \u003cdetails open\u003e\n    \u003csummary\u003eRun in conversation mode\u003c/summary\u003e\n\n    Models with a built-in chat template will automatically activate conversation mode. If this doesn't occur, you can manually enable it by adding `-cnv` and specifying a suitable chat template with `--chat-template NAME`\n\n    ```bash\n    llama-cli -m model.gguf\n\n    # \u003e hi, who are you?\n    # Hi there! I'm your helpful assistant! I'm an AI-powered chatbot designed to assist and provide information to users like you. I'm here to help answer your questions, provide guidance, and offer support on a wide range of topics. I'm a friendly and knowledgeable AI, and I'm always happy to help with anything you need. What's on your mind, and how can I assist you today?\n    #\n    # \u003e what is 1+1?\n    # Easy peasy! The answer to 1+1 is... 2!\n    ```\n\n    \u003c/details\u003e\n\n- \u003cdetails\u003e\n    \u003csummary\u003eRun in conversation mode with custom chat template\u003c/summary\u003e\n\n    ```bash\n    # use the \"chatml\" template (use -h to see the list of supported templates)\n    llama-cli -m model.gguf -cnv --chat-template chatml\n\n    # use a custom template\n    llama-cli -m model.gguf -cnv --in-prefix 'User: ' --reverse-prompt 'User:'\n    ```\n\n    \u003c/details\u003e\n\n- \u003cdetails\u003e\n    \u003csummary\u003eConstrain the output with a custom grammar\u003c/summary\u003e\n\n    ```bash\n    llama-cli -m model.gguf -n 256 --grammar-file grammars/json.gbnf -p 'Request: schedule a call at 8pm; Command:'\n\n    # {\"appointmentTime\": \"8pm\", \"appointmentDetails\": \"schedule a a call\"}\n    ```\n\n    The [grammars/](grammars/) folder contains a handful of sample grammars. To write your own, check out the [GBNF Guide](grammars/README.md).\n\n    For authoring more complex JSON grammars, check out https://grammar.intrinsiclabs.ai/\n\n    \u003c/details\u003e\n\n\n## [`llama-server`](tools/server)\n\n#### A lightweight, [OpenAI API](https://github.com/openai/openai-openapi) compatible, HTTP server for serving LLMs.\n\n- \u003cdetails open\u003e\n    \u003csummary\u003eStart a local HTTP server with default configuration on port 8080\u003c/summary\u003e\n\n    ```bash\n    llama-server -m model.gguf --port 8080\n\n    # Basic web UI can be accessed via browser: http://localhost:8080\n    # Chat completion endpoint: http://localhost:8080/v1/chat/completions\n    ```\n\n    \u003c/details\u003e\n\n- \u003cdetails\u003e\n    \u003csummary\u003eSupport multiple-users and parallel decoding\u003c/summary\u003e\n\n    ```bash\n    # up to 4 concurrent requests, each with 4096 max context\n    llama-server -m model.gguf -c 16384 -np 4\n    ```\n\n    \u003c/details\u003e\n\n- \u003cdetails\u003e\n    \u003csummary\u003eEnable speculative decoding\u003c/summary\u003e\n\n    ```bash\n    # the draft.gguf model should be a small variant of the target model.gguf\n    llama-server -m model.gguf -md draft.gguf\n    ```\n\n    \u003c/details\u003e\n\n- \u003cdetails\u003e\n    \u003csummary\u003eServe an embedding model\u003c/summary\u003e\n\n    ```bash\n    # use the /embedding endpoint\n    llama-server -m model.gguf --embedding --pooling cls -ub 8192\n    ```\n\n    \u003c/details\u003e\n\n- \u003cdetails\u003e\n    \u003csummary\u003eServe a reranking model\u003c/summary\u003e\n\n    ```bash\n    # use the /reranking endpoint\n    llama-server -m model.gguf --reranking\n    ```\n\n    \u003c/details\u003e\n\n- \u003cdetails\u003e\n    \u003csummary\u003eConstrain all outputs with a grammar\u003c/summary\u003e\n\n    ```bash\n    # custom grammar\n    llama-server -m model.gguf --grammar-file grammar.gbnf\n\n    # JSON\n    llama-server -m model.gguf --grammar-file grammars/json.gbnf\n    ```\n\n    \u003c/details\u003e\n\n\n## [`llama-perplexity`](tools/perplexity)\n\n#### A tool for measuring the [perplexity](tools/perplexity/README.md) [^1] (and other quality metrics) of a model over a given text.\n\n- \u003cdetails open\u003e\n    \u003csummary\u003eMeasure the perplexity over a text file\u003c/summary\u003e\n\n    ```bash\n    llama-perplexity -m model.gguf -f file.txt\n\n    # [1]15.2701,[2]5.4007,[3]5.3073,[4]6.2965,[5]5.8940,[6]5.6096,[7]5.7942,[8]4.9297, ...\n    # Final estimate: PPL = 5.4007 +/- 0.67339\n    ```\n\n    \u003c/details\u003e\n\n- \u003cdetails\u003e\n    \u003csummary\u003eMeasure KL divergence\u003c/summary\u003e\n\n    ```bash\n    # TODO\n    ```\n\n    \u003c/details\u003e\n\n[^1]: [https://huggingface.co/docs/transformers/perplexity](https://huggingface.co/docs/transformers/perplexity)\n\n## [`llama-bench`](tools/llama-bench)\n\n#### Benchmark the performance of the inference for various parameters.\n\n- \u003cdetails open\u003e\n    \u003csummary\u003eRun default benchmark\u003c/summary\u003e\n\n    ```bash\n    llama-bench -m model.gguf\n\n    # Output:\n    # | model               |       size |     params | backend    | threads |          test |                  t/s |\n    # | ------------------- | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |\n    # | qwen2 1.5B Q4_0     | 885.97 MiB |     1.54 B | Metal,BLAS |      16 |         pp512 |      5765.41 ± 20.55 |\n    # | qwen2 1.5B Q4_0     | 885.97 MiB |     1.54 B | Metal,BLAS |      16 |         tg128 |        197.71 ± 0.81 |\n    #\n    # build: 3e0ba0e60 (4229)\n    ```\n\n    \u003c/details\u003e\n\n## [`llama-simple`](examples/simple)\n\n#### A minimal example for implementing apps with `llama.cpp`. Useful for developers.\n\n- \u003cdetails\u003e\n    \u003csummary\u003eBasic text completion\u003c/summary\u003e\n\n    ```bash\n    llama-simple -m model.gguf\n\n    # Hello my name is Kaitlyn and I am a 16 year old girl. I am a junior in high school and I am currently taking a class called \"The Art of\n    ```\n\n    \u003c/details\u003e\n\n\n## Contributing\n\n- Contributors can open PRs\n- Collaborators will be invited based on contributions\n- Maintainers can push to branches in the `llama.cpp` repo and merge PRs into the `master` branch\n- Any help with managing issues, PRs and projects is very appreciated!\n- See [good first issues](https://github.com/ggml-org/llama.cpp/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) for tasks suitable for first contributions\n- Read the [CONTRIBUTING.md](CONTRIBUTING.md) for more information\n- Make sure to read this: [Inference at the edge](https://github.com/ggml-org/llama.cpp/discussions/205)\n- A bit of backstory for those who are interested: [Changelog podcast](https://changelog.com/podcast/532)\n\n## Other documentation\n\n- [cli](tools/cli/README.md)\n- [completion](tools/completion/README.md)\n- [server](tools/server/README.md)\n- [GBNF grammars](grammars/README.md)\n\n#### Development documentation\n\n- [How to build](docs/build.md)\n- [Running on Docker](docs/docker.md)\n- [Build on Android](docs/android.md)\n- [Performance troubleshooting](docs/development/token_generation_performance_tips.md)\n- [GGML tips \u0026 tricks](https://github.com/ggml-org/llama.cpp/wiki/GGML-Tips-\u0026-Tricks)\n\n#### Seminal papers and background on the models\n\nIf your issue is with model generation quality, then please at least scan the following links and papers to understand the limitations of LLaMA models. This is especially important when choosing an appropriate model size and appreciating both the significant and subtle differences between LLaMA models and ChatGPT:\n- LLaMA:\n    - [Introducing LLaMA: A foundational, 65-billion-parameter large language model](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/)\n    - [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)\n- GPT-3\n    - [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)\n- GPT-3.5 / InstructGPT / ChatGPT:\n    - [Aligning language models to follow instructions](https://openai.com/research/instruction-following)\n    - [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)\n\n## XCFramework\nThe XCFramework is a precompiled version of the library for iOS, visionOS, tvOS,\nand macOS. It can be used in Swift projects without the need to compile the\nlibrary from source. For example:\n```swift\n// swift-tools-version: 5.10\n// The swift-tools-version declares the minimum version of Swift required to build this package.\n\nimport PackageDescription\n\nlet package = Package(\n    name: \"MyLlamaPackage\",\n    targets: [\n        .executableTarget(\n            name: \"MyLlamaPackage\",\n            dependencies: [\n                \"LlamaFramework\"\n            ]),\n        .binaryTarget(\n            name: \"LlamaFramework\",\n            url: \"https://github.com/ggml-org/llama.cpp/releases/download/b5046/llama-b5046-xcframework.zip\",\n            checksum: \"c19be78b5f00d8d29a25da41042cb7afa094cbf6280a225abe614b03b20029ab\"\n        )\n    ]\n)\n```\nThe above example is using an intermediate build `b5046` of the library. This can be modified\nto use a different version by changing the URL and checksum.\n\n## Completions\nCommand-line completion is available for some environments.\n\n#### Bash Completion\n```bash\n$ build/bin/llama-cli --completion-bash \u003e ~/.llama-completion.bash\n$ source ~/.llama-completion.bash\n```\nOptionally this can be added to your `.bashrc` or `.bash_profile` to load it\nautomatically. For example:\n```console\n$ echo \"source ~/.llama-completion.bash\" \u003e\u003e ~/.bashrc\n```\n\n## Dependencies\n\n- [yhirose/cpp-httplib](https://github.com/yhirose/cpp-httplib) - Single-header HTTP server, used by `llama-server` - MIT license\n- [stb-image](https://github.com/nothings/stb) - Single-header image format decoder, used by multimodal subsystem - Public domain\n- [nlohmann/json](https://github.com/nlohmann/json) - Single-header JSON library, used by various tools/examples - MIT License\n- [miniaudio.h](https://github.com/mackron/miniaudio) - Single-header audio format decoder, used by multimodal subsystem - Public domain\n- [subprocess.h](https://github.com/sheredom/subprocess.h) - Single-header process launching solution for C and C++ - Public domain\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcroll83%2Fllama.cpp-dgx","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcroll83%2Fllama.cpp-dgx","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcroll83%2Fllama.cpp-dgx/lists"}