{"id":45814898,"url":"https://github.com/nyo16/llama_cpp_ex","last_synced_at":"2026-06-04T00:01:32.131Z","repository":{"id":340813486,"uuid":"1165142527","full_name":"nyo16/llama_cpp_ex","owner":"nyo16","description":" Elixir bindings for llama.cpp — run LLMs locally with Metal, CUDA, Vulkan, or CPU. Streaming, chat templates, embeddings, structured output, and concurrent batched    inference.","archived":false,"fork":false,"pushed_at":"2026-05-24T01:13:46.000Z","size":457,"stargazers_count":6,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2026-05-24T03:27:58.199Z","etag":null,"topics":["cuda","elixir","llamacpp","llm"],"latest_commit_sha":null,"homepage":"","language":"Elixir","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nyo16.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-02-23T21:42:20.000Z","updated_at":"2026-05-24T01:13:50.000Z","dependencies_parsed_at":null,"dependency_job_id":"5d4b6368-581a-409c-abf2-715ca3340993","html_url":"https://github.com/nyo16/llama_cpp_ex","commit_stats":null,"previous_names":["nyo16/llama_cpp_ex"],"tags_count":51,"template":false,"template_full_name":null,"purl":"pkg:github/nyo16/llama_cpp_ex","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nyo16%2Fllama_cpp_ex","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nyo16%2Fllama_cpp_ex/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nyo16%2Fllama_cpp_ex/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nyo16%2Fllama_cpp_ex/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nyo16","download_url":"https://codeload.github.com/nyo16/llama_cpp_ex/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nyo16%2Fllama_cpp_ex/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33884734,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-03T02:00:06.370Z","response_time":59,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cuda","elixir","llamacpp","llm"],"created_at":"2026-02-26T18:28:55.095Z","updated_at":"2026-06-04T00:01:32.083Z","avatar_url":"https://github.com/nyo16.png","language":"Elixir","funding_links":[],"categories":[],"sub_categories":[],"readme":"# LlamaCppEx\n\n[![Precompile NIFs](https://github.com/nyo16/llama_cpp_ex/actions/workflows/precompile.yml/badge.svg)](https://github.com/nyo16/llama_cpp_ex/actions/workflows/precompile.yml)\n[![CI](https://github.com/nyo16/llama_cpp_ex/actions/workflows/ci.yml/badge.svg)](https://github.com/nyo16/llama_cpp_ex/actions/workflows/ci.yml)\n\nElixir bindings for [llama.cpp](https://github.com/ggml-org/llama.cpp) — run LLMs locally with Metal, CUDA, Vulkan, or CPU acceleration.\n\nBuilt with C++ NIFs using [fine](https://github.com/elixir-nx/fine) for ergonomic resource management and [elixir_make](https://hex.pm/packages/elixir_make) for the build system.\n\n## Features\n\n- Load and run GGUF models directly from Elixir\n- **HuggingFace Hub integration** — search, list, and download GGUF models\n- GPU acceleration: Metal (macOS), CUDA (NVIDIA), Vulkan, or CPU\n- Streaming token generation via lazy `Stream`\n- Jinja chat templates with `enable_thinking` support (Qwen3, Qwen3.5, etc.)\n- RAII resource management — models, contexts, and samplers are garbage collected by the BEAM\n- Configurable sampling: temperature, top-k, top-p, min-p, repetition penalty, frequency \u0026 presence penalty\n- Embedding generation with L2 normalization\n- Grammar-constrained generation (GBNF)\n- Structured output via JSON Schema (auto-converted to GBNF grammar)\n- Optional Ecto schema to JSON Schema conversion\n- Continuous batching server for concurrent inference\n- **Multi-Token Prediction (MTP) speculative decoding** — ~2x token-generation speedup on Qwen 3.6 with live acceptance-rate stats\n- **Prefix caching** — same-slot KV cache reuse for multi-turn chat (1.23x faster)\n- **Pluggable batching strategies** — DecodeMaximal, PrefillPriority, Balanced\n- **Pre-tokenized API** — tokenize outside the GenServer for lower contention\n- Telemetry integration for observability\n\n## Installation\n\nAdd `llama_cpp_ex` to your list of dependencies in `mix.exs`:\n\n```elixir\ndef deps do\n  [\n    {:llama_cpp_ex, \"~\u003e 0.7.5\"}\n  ]\nend\n```\n\n### Prerequisites\n\n- C++17 compiler (GCC, Clang, or MSVC)\n- CMake 3.14+\n- Git (for the llama.cpp submodule)\n\n### Backend Selection\n\n```bash\nmix compile                        # Auto-detect (Metal on macOS, CUDA if nvcc found, else CPU)\nLLAMA_BACKEND=metal mix compile    # Apple Silicon GPU\nLLAMA_BACKEND=cuda mix compile     # NVIDIA GPU\nLLAMA_BACKEND=vulkan mix compile   # Vulkan\nLLAMA_BACKEND=cpu mix compile      # CPU only\n```\n\nPower users can pass arbitrary CMake flags:\n\n```bash\nLLAMA_CMAKE_ARGS=\"-DGGML_CUDA_FORCE_CUBLAS=ON\" mix compile\n```\n\n## Quick Start\n\n```elixir\n# Initialize the backend (once per application)\n:ok = LlamaCppEx.init()\n\n# Load a GGUF model (use n_gpu_layers: -1 to offload all layers to GPU)\n{:ok, model} = LlamaCppEx.load_model(\"path/to/model.gguf\", n_gpu_layers: -1)\n\n# Generate text\n{:ok, text} = LlamaCppEx.generate(model, \"Once upon a time\", max_tokens: 200, temp: 0.8)\n\n# Stream tokens\nmodel\n|\u003e LlamaCppEx.stream(\"Tell me a story\", max_tokens: 500)\n|\u003e Enum.each(\u0026IO.write/1)\n\n# Chat with template\n{:ok, reply} = LlamaCppEx.chat(model, [\n  %{role: \"system\", content: \"You are a helpful assistant.\"},\n  %{role: \"user\", content: \"What is Elixir?\"}\n], max_tokens: 200)\n\n# Chat with thinking disabled (Qwen3/3.5 and similar models)\n{:ok, reply} = LlamaCppEx.chat(model, [\n  %{role: \"user\", content: \"What is 2+2?\"}\n], max_tokens: 64, enable_thinking: false)\n\n# Stream a chat response\nmodel\n|\u003e LlamaCppEx.stream_chat([\n  %{role: \"user\", content: \"Explain pattern matching in Elixir.\"}\n], max_tokens: 500)\n|\u003e Enum.each(\u0026IO.write/1)\n```\n\n## HuggingFace Hub\n\nDownload GGUF models directly from HuggingFace Hub. Requires the optional `:req` dependency:\n\n```elixir\n{:req, \"~\u003e 0.5\"}\n```\n\n```elixir\n# Search for GGUF models\n{:ok, models} = LlamaCppEx.Hub.search(\"qwen3 gguf\", limit: 5)\n\n# List GGUF files in a repository\n{:ok, files} = LlamaCppEx.Hub.list_gguf_files(\"Qwen/Qwen3-0.6B-GGUF\")\n\n# Download (cached locally in ~/.cache/llama_cpp_ex/models/)\n{:ok, path} = LlamaCppEx.Hub.download(\"Qwen/Qwen3-0.6B-GGUF\", \"Qwen3-0.6B-Q8_0.gguf\")\n\n# Or download + load in one step\n{:ok, model} = LlamaCppEx.load_model_from_hub(\n  \"Qwen/Qwen3-0.6B-GGUF\", \"Qwen3-0.6B-Q8_0.gguf\",\n  n_gpu_layers: -1\n)\n```\n\nFor private/gated models, set `HF_TOKEN` or pass `token: \"hf_...\"`. Set `LLAMA_OFFLINE=1` for offline-only cached access.\n\n## Structured Output (JSON Schema)\n\nConstrain model output to valid JSON matching a schema. Pass `:json_schema` to any generate or chat function — the schema is automatically converted to a GBNF grammar via llama.cpp's built-in converter.\n\n```elixir\nschema = %{\n  \"type\" =\u003e \"object\",\n  \"properties\" =\u003e %{\n    \"name\" =\u003e %{\"type\" =\u003e \"string\"},\n    \"age\" =\u003e %{\"type\" =\u003e \"integer\"},\n    \"hobbies\" =\u003e %{\"type\" =\u003e \"array\", \"items\" =\u003e %{\"type\" =\u003e \"string\"}}\n  },\n  \"required\" =\u003e [\"name\", \"age\", \"hobbies\"],\n  \"additionalProperties\" =\u003e false\n}\n\n# Works with generate\n{:ok, json} = LlamaCppEx.generate(model, \"Generate a person:\",\n  json_schema: schema, temp: 0.0)\n# =\u003e \"{\\\"name\\\": \\\"Alice\\\", \\\"age\\\": 30, \\\"hobbies\\\": [\\\"reading\\\", \\\"hiking\\\"]}\"\n\n# Works with chat\n{:ok, json} = LlamaCppEx.chat(model, [\n  %{role: \"user\", content: \"Generate a person named Bob who is 25.\"}\n], json_schema: schema, temp: 0.0)\n\n# Works with streaming\nmodel\n|\u003e LlamaCppEx.stream(\"Generate a person:\", json_schema: schema, temp: 0.0)\n|\u003e Enum.each(\u0026IO.write/1)\n\n# Works with chat completions\n{:ok, completion} = LlamaCppEx.chat_completion(model, [\n  %{role: \"user\", content: \"Generate a person.\"}\n], json_schema: schema, temp: 0.0)\n```\n\n\u003e **Tip:** Set `\"additionalProperties\" =\u003e false` in your schema to produce a tighter grammar\n\u003e that avoids potential issues with the grammar sampler.\n\n### Manual Grammar Conversion\n\nYou can also convert the schema to GBNF manually for more control:\n\n```elixir\n{:ok, gbnf} = LlamaCppEx.Grammar.from_json_schema(schema)\nIO.puts(gbnf)\n# root ::= \"{\" space name-kv \",\" space age-kv \",\" space hobbies-kv \"}\" space\n# ...\n\n# Use the grammar directly\n{:ok, json} = LlamaCppEx.generate(model, \"Generate a person:\", grammar: gbnf, temp: 0.0)\n```\n\n### Ecto Schema Integration\n\nConvert Ecto schema modules to JSON Schema automatically (requires `{:ecto, \"~\u003e 3.0\"}` — optional dependency):\n\n```elixir\ndefmodule MyApp.Person do\n  use Ecto.Schema\n\n  embedded_schema do\n    field :name, :string\n    field :age, :integer\n    field :active, :boolean\n    field :tags, {:array, :string}\n  end\nend\n\n# Ecto schema -\u003e JSON Schema -\u003e constrained generation\nschema = LlamaCppEx.Schema.to_json_schema(MyApp.Person)\n# =\u003e %{\"type\" =\u003e \"object\", \"properties\" =\u003e %{\"name\" =\u003e %{\"type\" =\u003e \"string\"}, ...}, ...}\n\n{:ok, json} = LlamaCppEx.chat(model, [\n  %{role: \"user\", content: \"Generate a person.\"}\n], json_schema: schema, temp: 0.0)\n```\n\nSupported Ecto types: `:string`, `:integer`, `:float`, `:decimal`, `:boolean`, `:map`, `{:array, inner}`, `:date`, `:utc_datetime`, `:naive_datetime`, and embedded schemas (`embeds_one`/`embeds_many`). Fields `:id`, `:inserted_at`, and `:updated_at` are excluded automatically.\n\n## Lower-level API\n\nFor fine-grained control over the inference pipeline:\n\n```elixir\n# Tokenize\n{:ok, tokens} = LlamaCppEx.Tokenizer.encode(model, \"Hello world\")\n{:ok, text} = LlamaCppEx.Tokenizer.decode(model, tokens)\n\n# Create context and sampler separately\n{:ok, ctx} = LlamaCppEx.Context.create(model, n_ctx: 4096)\n{:ok, sampler} = LlamaCppEx.Sampler.create(model, temp: 0.7, top_p: 0.9)\n\n# Run generation with your own context\n{:ok, tokens} = LlamaCppEx.Tokenizer.encode(model, \"The answer is\")\n{:ok, text} = LlamaCppEx.Context.generate(ctx, sampler, tokens, max_tokens: 100)\n\n# Model introspection\nLlamaCppEx.Model.desc(model)          # \"llama 7B Q4_K - Medium\"\nLlamaCppEx.Model.n_params(model)      # 6_738_415_616\nLlamaCppEx.Model.chat_template(model) # \"\u003c|im_start|\u003e...\"\nLlamaCppEx.Tokenizer.vocab_size(model) # 32000\n```\n\n## Server (Continuous Batching)\n\nFor concurrent inference, `LlamaCppEx.Server` manages a shared model/context with a slot pool and continuous batching:\n\n```elixir\n{:ok, server} = LlamaCppEx.Server.start_link(\n  model_path: \"model.gguf\",\n  n_gpu_layers: -1,\n  n_parallel: 4,\n  n_ctx: 8192\n)\n\n# Synchronous\n{:ok, text} = LlamaCppEx.Server.generate(server, \"Once upon a time\", max_tokens: 100)\n\n# Streaming\nLlamaCppEx.Server.stream(server, \"Tell me a story\", max_tokens: 200)\n|\u003e Enum.each(\u0026IO.write/1)\n```\n\nMultiple callers are batched into a single forward pass per tick, improving throughput under load.\n\n### Prefix Caching\n\nThe server caches KV state between requests on the same slot. Multi-turn chat benefits automatically — the system prompt and prior turns aren't recomputed:\n\n```elixir\n{:ok, server} = LlamaCppEx.Server.start_link(\n  model_path: \"model.gguf\",\n  n_parallel: 4,\n  cache_prompt: true  # opt-in (default: false)\n)\n```\n\nBenchmark: **1.23x faster** for multi-turn conversations (487ms vs 597ms per 4-turn exchange).\n\n### Batching Strategies\n\nChoose how the token budget is split between generation and prompt processing:\n\n```elixir\n# Default: generation latency optimized\nbatch_strategy: LlamaCppEx.Server.Strategy.DecodeMaximal\n\n# Throughput optimized (batch processing)\nbatch_strategy: LlamaCppEx.Server.Strategy.PrefillPriority\n\n# Fair split (mixed workloads)\nbatch_strategy: LlamaCppEx.Server.Strategy.Balanced\n```\n\n### Pre-Tokenized API\n\nTokenize outside the GenServer to reduce contention under concurrent load:\n\n```elixir\nmodel = LlamaCppEx.Server.get_model(server)\n{:ok, tokens} = LlamaCppEx.Tokenizer.encode(model, prompt)\n{:ok, text} = LlamaCppEx.Server.generate_tokens(server, tokens, max_tokens: 100)\n```\n\n### llama.cpp Optimizations\n\nPass llama.cpp optimization parameters directly:\n\n```elixir\n{:ok, server} = LlamaCppEx.Server.start_link(\n  model_path: \"model.gguf\",\n  n_parallel: 8,\n  n_ctx: 32768,\n\n  # KV cache quantization — 2x memory savings, identical output\n  type_k: :q8_0,\n  type_v: :q8_0,\n\n  # Flash attention — faster prefill\n  flash_attn: :enabled\n)\n```\n\nThese also work with the high-level API:\n\n```elixir\n{:ok, text} = LlamaCppEx.generate(model, \"Hello\",\n  max_tokens: 256,\n  type_k: :q8_0,\n  type_v: :q8_0,\n  flash_attn: :enabled\n)\n```\n\nSee [Performance Guide](docs/performance.md) for all available parameters including RoPE context extension, GPU offload control, attention type, and more.\n\n## Speculative decoding (MTP)\n\nMulti-Token Prediction speculative decoding (upstream PR [#22673](https://github.com/ggml-org/llama.cpp/pull/22673)) drafts several tokens at once via a head shipped inside the same GGUF as the target model. Upstream llama-server reports ~2x speedup at ~75% draft acceptance on Qwen 3.6.\n\n\u003e **Performance note: Apple Silicon.** The upstream 2× claim is from NVIDIA datacenter GPUs, where a batched verify decode costs ~1.2× a single-token decode. On Apple Silicon (Metal), a 4-wide verify costs ~2.4× a single decode, which cancels MTP's iteration savings. We measured upstream's own `llama-server --spec-type draft-mtp` on M1 Max: **39.80 tok/s with MTP vs 39.14 tok/s plain** on Qwen 3.6 35B-A3B (1.02×) — i.e. effectively zero speedup from the reference implementation itself. This matches the pattern in upstream [#23011](https://github.com/ggml-org/llama.cpp/issues/23011); a Metal MTP optimization is tracked in [#23114](https://github.com/ggml-org/llama.cpp/pull/23114).\n\u003e\n\u003e **Tuning for Apple Silicon:** use `n_draft: 1`. With one draft per iteration the verify batch is only 2-wide (much cheaper on Metal) and acceptance jumps to ~79% on Qwen 3.6 35B-A3B. Our measurements on M1 Max with `n_draft: 1`:\n\u003e - Qwen 3.6 35B-A3B-MTP (hybrid MoE): plain 39.5 → MTP **44.0 tok/s (1.11×)**\n\u003e - Qwen 3.6 27B (dense): plain 10.7 → MTP **10.6 tok/s (~1.0×, neutral)**\n\u003e\n\u003e Larger `n_draft` hurts on Metal because verify cost grows faster than acceptance benefit. On NVIDIA, `n_draft: 3` is the right default — that's what the upstream 2× number assumes.\n\n### Models with MTP heads\n\n- [`ggml-org/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/ggml-org/Qwen3.6-35B-A3B-MTP-GGUF) (recommended: `Q4_K_M`, ~21 GB)\n- [`ggml-org/Qwen3.6-27B-MTP-GGUF`](https://huggingface.co/ggml-org/Qwen3.6-27B-MTP-GGUF)\n- [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF)\n\nA regular (non-MTP) Qwen 3.6 quant will fail at `LlamaCppEx.MTP.init/2` — the GGUF must contain `mtp-*` tensors.\n\n### Usage\n\n#### Minimal: stream a single response\n\n```elixir\n:ok = LlamaCppEx.init()\n\n{:ok, model} =\n  LlamaCppEx.load_model(\n    Path.expand(\"~/Downloads/Qwen3.6-35B-A3B-MTP-Q4_K_M.gguf\"),\n    n_gpu_layers: 999\n  )\n\n# Build the speculative session once — it owns a target context and a\n# separate MTP draft context on the *same* model file (no extra download).\n{:ok, mtp} = LlamaCppEx.MTP.init(model, n_draft: 3, n_ctx: 8192)\n\nmtp\n|\u003e LlamaCppEx.MTP.stream(\"Write a haiku about the sea:\", max_tokens: 256)\n|\u003e Stream.each(\u0026IO.write/1)\n|\u003e Stream.run()\n\n# Final stats (also returned via the {:done, stats} stream event)\nstats = LlamaCppEx.MTP.stats(mtp)\nIO.puts(\"\\nacceptance: #{Float.round(stats.acceptance_rate * 100, 1)}%  \" \u003c\u003e\n        \"throughput: #{Float.round(stats.tokens_per_sec, 1)} tok/s\")\n```\n\n#### Synchronous generate (collect to a string)\n\n```elixir\n{:ok, mtp} = LlamaCppEx.MTP.init(model, n_draft: 3, n_ctx: 4096)\n\n{:ok, text} =\n  LlamaCppEx.MTP.generate(mtp, \"Explain monads to a Go programmer:\",\n    max_tokens: 200,\n    temp: 0.7,\n    top_p: 0.95,\n    seed: 42\n  )\n\nIO.puts(text)\n```\n\n#### Reuse a session across multiple prompts\n\n`MTP.init/2` allocates two `llama_context`s and the speculative state. It's the expensive bit. Reuse the same `%MTP{}` value across calls — KV caches are cleared at the start of each `stream/3` / `generate/3`:\n\n```elixir\n{:ok, mtp} = LlamaCppEx.MTP.init(model, n_draft: 3, n_ctx: 8192)\n\nfor q \u003c- [\"What is Elixir?\", \"What is OTP?\", \"What is BEAM?\"] do\n  IO.puts(\"\\n\u003e #{q}\")\n  mtp |\u003e LlamaCppEx.MTP.stream(q, max_tokens: 150) |\u003e Stream.each(\u0026IO.write/1) |\u003e Stream.run()\nend\n\n# Counters are cumulative across all calls on this session.\nLlamaCppEx.MTP.stats(mtp) |\u003e IO.inspect(label: \"cumulative\")\n```\n\n#### Watch stats live from a separate process\n\n`MTP.stats/1` is lock-free, so a sibling process can poll it while a stream is in flight — handy for Phoenix LiveView dashboards:\n\n```elixir\nparent = self()\n\ngen_task =\n  Task.async(fn -\u003e\n    mtp\n    |\u003e LlamaCppEx.MTP.stream(\"Generate a 500-line Python implementation of A*:\",\n      max_tokens: 1024,\n      temp: 0.7\n    )\n    |\u003e Enum.into(\"\")\n    |\u003e then(\u0026send(parent, {:done, \u00261}))\n  end)\n\n# Sample every 200 ms while the generation runs.\nStream.repeatedly(fn -\u003e\n  Process.sleep(200)\n  s = LlamaCppEx.MTP.stats(mtp)\n  IO.puts(\n    \"iters=#{s.iters}  emitted=#{s.tokens_emitted}  \" \u003c\u003e\n      \"accept=#{Float.round(s.acceptance_rate * 100, 1)}%  \" \u003c\u003e\n      \"tok/s=#{Float.round(s.tokens_per_sec, 1)}\"\n  )\nend)\n|\u003e Stream.take_while(fn _ -\u003e not Task.yield(gen_task, 0) |\u003e match?({:ok, _}) end)\n|\u003e Stream.run()\n\nTask.await(gen_task, :infinity)\n```\n\nFor in-band progress events (no separate process), use `stream_events/3` with `emit_stats_every`:\n\n```elixir\nmtp\n|\u003e LlamaCppEx.MTP.stream_events(\"Write a sonnet:\",\n  max_tokens: 400,\n  emit_stats_every: 32\n)\n|\u003e Enum.each(fn\n  {:token, _id, text} -\u003e IO.write(text)\n  {:stats, s}        -\u003e IO.puts(\"\\n[stats] accept=#{Float.round(s.acceptance_rate * 100, 1)}%\")\n  {:done, _final}    -\u003e IO.puts(\"\\n[done]\")\n  {:eog, _}          -\u003e IO.puts(\"\\n[eog]\")\nend)\n```\n\n### Options\n\n`LlamaCppEx.MTP.init/2`:\n\n  * `:n_draft` — draft tokens proposed per iteration (default `3`). On NVIDIA, 2–4 is the sweet spot. On Apple Silicon, set this to `1` — see the Apple Silicon performance note above.\n  * `:n_ctx`, `:n_threads`, `:flash_attn`, `:type_k`/`:type_v`, `:offload_kqv`, … — any `LlamaCppEx.Context` option; applied to both target and draft contexts.\n\n`LlamaCppEx.MTP.stream/3`:\n\n  * `:max_tokens` (default `256`), plus all sampling options (`:temp`, `:top_k`, `:top_p`, `:min_p`, `:seed`, `:penalty_*`, `:grammar`).\n  * `:emit_stats_every` — when set, periodic `{:stats, _}` events become available via `stream_events/3`.\n\n### Caveats\n\n- Upstream currently requires `n_parallel = 1` for MTP; this binding mirrors that. Use `LlamaCppEx.Server` for concurrent non-MTP inference, or stick to a single MTP session at a time.\n- Prompt prefill is somewhat slower with MTP than without (the MTP head also processes the prompt). The win shows up at decode time.\n\nSee [`examples/mtp_speculative.exs`](examples/mtp_speculative.exs) for a runnable demo with full timing breakdown.\n\n## Benchmarks\n\nMeasured on Apple M4 Max (64 GB), Metal backend (`n_gpu_layers: -1`).\n\n### Single-model generation speed\n\n| Model | Quantization | Tokens/sec |\n|-------|-------------|------------|\n| Llama 3.2 3B Instruct | Q4_K_XL | 125.6 |\n| Ministral 3 3B Reasoning | Q4_K_XL | 113.0 |\n| Ministral 3 3B Instruct | Q4_K_XL | 104.3 |\n| GPT-OSS 20B | Q4_K_XL | 79.4 |\n| Qwen3.5-35B-A3B | Q6_K | 56.0 |\n| Qwen3.5-27B | Q4_K_XL | 17.5 |\n\n### Qwen3.6-35B-A3B (v0.7.8)\n\nNew `qwen35moe` architecture with Gated Delta Net (hybrid linear/full attention). Measured on Apple M1 Max (64 GB) with v0.7.8 bindings — not directly comparable to the M4 Max numbers above.\n\n| Model | Quantization | Tokens/sec (M1 Max) |\n|-------|-------------|---------------------|\n| Qwen3.6-35B-A3B | Q4_K_XL | 43.8 |\n\n128-token generation, `temp: 0.0`, 3-run average (43.3 / 44.1 / 44.0 t/s).\n\n### Single-sequence generation (Qwen3-4B Q4_K_M)\n\n| Prompt | 32 tokens | 128 tokens |\n|--------|-----------|------------|\n| short (6 tok) | 0.31s (3.19 ips) | 1.01s (0.98 ips) |\n| medium (100 tok) | 0.36s (2.79 ips) | 1.06s (0.94 ips) |\n| long (500 tok) | 0.65s (1.53 ips) | 1.29s (0.77 ips) |\n\n### Continuous batching throughput (Qwen3-4B Q4_K_M)\n\n```\nmax_tokens: 32, prompt: \"short\"\n──────────────────────────────────────────────────────────────────────────────\nConcurrency  Wall time    Total tok/s  Per-req tok/s  Speedup  Avg batch\n1            318ms        100.6        100.6          1.00x    1.1\n2            440ms        145.5         72.7          1.45x    2.2\n4            824ms        155.3         38.8          1.54x    4.5\n```\n\nRun benchmarks yourself:\n\n```bash\nMIX_ENV=bench mix deps.get\nLLAMA_MODEL_PATH=path/to/model.gguf MIX_ENV=bench mix run bench/single_generate.exs\nLLAMA_MODEL_PATH=path/to/model.gguf MIX_ENV=bench mix run bench/server_concurrent.exs\n```\n\n## Running Qwen3.5-35B-A3B\n\n[Qwen3.5-35B-A3B](https://huggingface.co/Qwen/Qwen3.5-35B-A3B-GGUF) is a Mixture-of-Experts model with 35B total parameters but only 3B active per token. It supports 256K context and both thinking (CoT) and non-thinking modes.\n\n### Hardware requirements\n\n| Quantization | RAM / VRAM | File size |\n|-------------|------------|-----------|\n| Q4_K_M | ~20 GB | ~19 GB |\n| Q8_0 | ~37 GB | ~36 GB |\n| BF16 | ~70 GB | ~67 GB |\n\n### Download\n\n```bash\n# Install the HuggingFace CLI if needed: pip install huggingface-hub\nhuggingface-cli download Qwen/Qwen3.5-35B-A3B-GGUF Qwen3.5-35B-A3B-Q4_K_M.gguf --local-dir models/\n```\n\n### Thinking mode (general)\n\n```elixir\n:ok = LlamaCppEx.init()\n{:ok, model} = LlamaCppEx.load_model(\"models/Qwen3.5-35B-A3B-Q4_K_M.gguf\", n_gpu_layers: -1)\n\n# Qwen3.5 recommended: temp 1.0, top_p 0.95, top_k 20, presence_penalty 1.5\n{:ok, reply} = LlamaCppEx.chat(model, [\n  %{role: \"user\", content: \"Explain the birthday paradox.\"}\n], max_tokens: 2048, temp: 1.0, top_p: 0.95, top_k: 20, min_p: 0.0, penalty_present: 1.5)\n```\n\n### Thinking mode (math/code)\n\n```elixir\n# For math and code, lower temperature without presence penalty\n{:ok, reply} = LlamaCppEx.chat(model, [\n  %{role: \"user\", content: \"Write a function to find the longest palindromic substring.\"}\n], max_tokens: 4096, temp: 0.6, top_p: 0.95, top_k: 20, min_p: 0.0)\n```\n\n### Non-thinking mode\n\n```elixir\n# Disable thinking via enable_thinking option (uses Jinja chat template kwargs)\n{:ok, reply} = LlamaCppEx.chat(model, [\n  %{role: \"user\", content: \"What is the capital of France?\"}\n], max_tokens: 256, enable_thinking: false, temp: 0.7, top_p: 0.8, top_k: 20, min_p: 0.0, penalty_present: 1.5)\n```\n\n### Streaming with Server\n\n```elixir\n{:ok, server} = LlamaCppEx.Server.start_link(\n  model_path: \"models/Qwen3.5-35B-A3B-Q4_K_M.gguf\",\n  n_gpu_layers: -1,\n  n_parallel: 2,\n  n_ctx: 16384,\n  temp: 1.0, top_p: 0.95, top_k: 20, min_p: 0.0, penalty_present: 1.5\n)\n\nLlamaCppEx.Server.stream(server, \"Explain monads in simple terms\", max_tokens: 1024)\n|\u003e Enum.each(\u0026IO.write/1)\n```\n\n### Qwen3.5 enable_thinking benchmarks\n\nMeasured on **MacBook Pro, Apple M4 Max (16-core, 64 GB)**, Metal backend, `n_gpu_layers: -1`, 512 output tokens, `temp: 0.6`.\n\n| Metric | Qwen3.5-27B (Q4_K_XL) | Qwen3.5-35B-A3B (Q6_K) |\n|---|---|---|\n| | Think ON / Think OFF | Think ON / Think OFF |\n| **Prompt tokens** | 65 / 66 | 65 / 66 |\n| **Output tokens** | 512 / 512 | 512 / 512 |\n| **TTFT** | 599 ms / 573 ms | 554 ms / 191 ms |\n| **Prompt eval** | 108.5 / 115.2 t/s | 117.3 / 345.5 t/s |\n| **Gen speed** | 17.5 / 17.3 t/s | 56.0 / 56.0 t/s |\n| **Total time** | 29.77 / 30.10 s | 9.69 / 9.33 s |\n\nThe MoE model (35B-A3B) is ~3.2x faster at generation since only 3B parameters are active per token despite the 35B total. Thinking mode only affects the prompt template, not inference speed.\n\n## Examples\n\nThe `examples/` directory contains runnable scripts demonstrating key features:\n\n```bash\n# Basic text generation\nLLAMA_MODEL_PATH=/path/to/model.gguf mix run examples/basic_generation.exs\n\n# Streaming tokens to terminal\nLLAMA_MODEL_PATH=/path/to/model.gguf mix run examples/streaming.exs\n\n# Interactive multi-turn chat\nLLAMA_MODEL_PATH=/path/to/model.gguf mix run examples/chat.exs\n\n# JSON Schema constrained generation + Ecto integration\nLLAMA_MODEL_PATH=/path/to/model.gguf mix run examples/structured_output.exs\n\n# Embedding generation and cosine similarity\nLLAMA_EMBEDDING_MODEL_PATH=/path/to/embedding-model.gguf mix run examples/embeddings.exs\n\n# Continuous batching server with concurrent requests\nLLAMA_MODEL_PATH=/path/to/model.gguf mix run examples/server.exs\n```\n\n## Architecture\n\n```\nElixir API (lib/)\n    │\nLlamaCppEx.NIF (@on_load, stubs)\n    │\nC++ NIF layer (c_src/) — fine.hpp for RAII + type encoding\n    │\nllama.cpp static libs (vendor/llama.cpp, built via CMake)\n    │\nHardware (CPU / Metal / CUDA / Vulkan)\n```\n\n## License\n\nApache License 2.0 — see [LICENSE](LICENSE).\n\nllama.cpp is licensed under the MIT License.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnyo16%2Fllama_cpp_ex","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnyo16%2Fllama_cpp_ex","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnyo16%2Fllama_cpp_ex/lists"}