{"id":49555815,"url":"https://github.com/tinybiggames/vindexllm","last_synced_at":"2026-05-03T02:05:31.844Z","repository":{"id":351679716,"uuid":"1209887455","full_name":"tinyBigGAMES/VindexLLM","owner":"tinyBigGAMES","description":"VindexLLM is a pure Delphi, GPU-powered LLM inference engine that uses Vulkan compute shaders to run GGUF models entirely on the GPU. It performs full transformer inference without relying on Python, CUDA, or other external runtimes, requiring only vulkan-1.dll, which is typically included with modern GPU drivers.","archived":false,"fork":false,"pushed_at":"2026-04-16T02:03:33.000Z","size":332,"stargazers_count":0,"open_issues_count":0,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-16T03:30:50.744Z","etag":null,"topics":["delphi","llama-cpp","llm-inference","object-pascal","ollama","win64","windows-10","windows-11"],"latest_commit_sha":null,"homepage":"","language":"Pascal","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tinyBigGAMES.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null},"funding":{"github":"tinyBigGAMES","patreon":null,"open_collective":null,"ko_fi":null,"tidelift":null,"community_bridge":null,"liberapay":null,"issuehunt":null,"lfx_crowdfunding":null,"polar":null,"buy_me_a_coffee":null,"thanks_dev":null,"custom":null}},"created_at":"2026-04-13T22:07:23.000Z","updated_at":"2026-04-16T02:03:37.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/tinyBigGAMES/VindexLLM","commit_stats":null,"previous_names":["tinybiggames/vindexllm"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/tinyBigGAMES/VindexLLM","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tinyBigGAMES%2FVindexLLM","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tinyBigGAMES%2FVindexLLM/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tinyBigGAMES%2FVindexLLM/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tinyBigGAMES%2FVindexLLM/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tinyBigGAMES","download_url":"https://codeload.github.com/tinyBigGAMES/VindexLLM/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tinyBigGAMES%2FVindexLLM/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32555839,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-03T00:31:16.350Z","status":"online","status_checked_at":"2026-05-03T02:00:09.297Z","response_time":103,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["delphi","llama-cpp","llm-inference","object-pascal","ollama","win64","windows-10","windows-11"],"created_at":"2026-05-03T02:05:27.197Z","updated_at":"2026-05-03T02:05:31.816Z","avatar_url":"https://github.com/tinyBigGAMES.png","language":"Pascal","funding_links":["https://github.com/sponsors/tinyBigGAMES"],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n![VindexLLM](media/logo.png)\n\n\u003cbr\u003e\n\n[![Discord](https://img.shields.io/discord/1457450179254026250?style=for-the-badge\u0026logo=discord\u0026label=Discord)](https://discord.gg/Wb6z8Wam7p) [![Follow on Bluesky](https://img.shields.io/badge/Bluesky-tinyBigGAMES-blue?style=for-the-badge\u0026logo=bluesky)](https://bsky.app/profile/tinybiggames.com) \n\n\u003c/div\u003e\n\n## What is VindexLLM?\n\n**VindexLLM** is a GPU-accelerated LLM inference engine written entirely in Delphi, using Vulkan compute shaders for all heavy computation. It loads standard GGUF model files, runs the full transformer forward pass on the GPU, and produces text — no Python, no CUDA toolkit, no external runtimes. The only dependency is `vulkan-1.dll`, which already ships with every modern GPU driver.\n\nFeed it a prompt, get tokens back. Everything between — embedding lookup, 34 layers of attention and FFN, normalization, sampling — happens on the GPU via GLSL 450 compute shaders compiled to SPIR-V and embedded as Windows resources into the binary.\n\n### Single-shot inference\n\n```delphi\nvar\n  LInference: TVdxInference;\n  LConfig: TVdxSamplerConfig;\nbegin\n  LInference := TVdxInference.Create();\n  try\n    // Stream tokens to console as they're generated\n    LInference.SetTokenCallback(\n      procedure(const AToken: string; const AUserData: Pointer)\n      begin\n        Write(AToken);\n      end, nil);\n\n    // Load model — memory-maps GGUF, initializes Vulkan, uploads weights to GPU\n    if LInference.LoadModel('path\\to\\gemma-3-4b-it-q4_0.gguf') then\n    try\n      // Configure sampler (Google-recommended settings for Gemma 3 4B IT)\n      LConfig := TVdxSampler.DefaultConfig();\n      LConfig.Temperature := 1.0;\n      LConfig.TopK := 64;\n      LConfig.TopP := 0.95;\n      LInference.SetSamplerConfig(LConfig);\n\n      // Generate up to 1024 tokens\n      LInference.Generate('Explain how a CPU works', 1024);\n    finally\n      LInference.UnloadModel();\n    end;\n  finally\n    LInference.Free();\n  end;\nend;\n```\n\n### Interactive chat with persistent memory\n\n```delphi\nvar\n  LChat: TVdxConsoleChat;\n  LConfig: TVdxSamplerConfig;\nbegin\n  LChat := TVdxConsoleChat.Create();\n  try\n    LChat.ModelPath := 'path\\to\\gemma-3-4b-it-q4_0.gguf';\n    LChat.EmbedderPath := 'path\\to\\embeddinggemma-300M-qat-Q4_0.gguf';\n    LChat.MemoryDbPath := 'session.db';\n    LChat.SystemPrompt := 'You are a helpful assistant.';\n    LChat.MaxTokens := 1024;\n\n    LConfig := TVdxSampler.DefaultConfig();\n    LConfig.Temperature := 1.0;\n    LConfig.TopK := 64;\n    LConfig.TopP := 0.95;\n    LChat.SamplerConfig := LConfig;\n\n    // Enter interactive chat loop — loads model, opens memory DB,\n    // retrieves context via RAG each turn, streams responses\n    LChat.Run();\n  finally\n    LChat.Free();\n  end;\nend;\n```\n\nLoad a model, generate text, done. The chat example adds multi-turn conversation with SQLite-backed persistent memory and RAG retrieval across sessions. For the full implementations including status callbacks, cancel support, event hooks, error handling, and performance stats, see `testbed\\UTest.Demo.pas`.\n\n## Why VindexLLM?\n\nMost LLM inference stacks depend on CUDA (NVIDIA-only, ~3GB toolkit install), Python environments, or large runtime libraries. VindexLLM takes a different approach:\n- **Zero install** — Vulkan ships with every NVIDIA, AMD, and Intel GPU driver. No separate toolkit download, no PATH configuration, no DLL hell. The app calls `LoadLibrary('vulkan-1.dll')` and talks directly to the GPU.\n- **No vendor lock-in** — Vulkan runs on any modern GPU. Not tied to NVIDIA hardware.\n- **Self-contained binary** — All 37 compute shaders are compiled to SPIR-V at build time and embedded into the executable as Windows resources. No loose shader files, no runtime compilation.\n- **No Python, no runtime** — Pure compiled Delphi. Starts instantly, no interpreter warmup, no dependency resolution.\n- **Memory-mapped model loading** — GGUF files are mapped directly via `CreateFileMapping` / `MapViewOfFile`. Weights are accessed at their file offsets and uploaded to VRAM through staging buffers. No intermediate copies, no parsing into custom formats.\n- **Everything on GPU** — The residual stream never leaves VRAM between layers. The only PCIe transfers are the initial token embedding (~10KB) and the final logits download for sampling. Everything else — attention, FFN, norms, activations, residual additions — runs as GPU compute dispatches.\n\n## How It Works\n\nVindexLLM implements the standard dense transformer forward pass. For each token, the engine runs 34 layers of attention and FFN computation on the GPU, then samples the next token from the output logits.\n\n```\n                         ┌──────────────┐\n                         │  GGUF File   │\n                         │  (mmap'd)    │\n                         └──────┬───────┘\n                                │ weights uploaded to VRAM at startup\n                                ▼\n  ┌────────────┐     ┌──────────────────────────────────────────────┐\n  │  \"prompt\"  │────►│  Tokenize (BPE)  ──►  Embed (GPU lookup)     │\n  └────────────┘     └──────────────────┬───────────────────────────┘\n                                        │\n                                        ▼  × 34 layers\n                     ┌──────────────────────────────────────────────┐\n                     │  PreAttnNorm ──► Attention (GQA + RoPE +     │\n                     │    QK-norm + TQ3 KV cache) ──► PostAttnNorm  │\n                     │  ──► residual += attn_out                    │\n                     │                                              │\n                     │  PreFFNNorm ──► gate(x), up(x) ──►           │\n                     │    GELU(gate) * up ──► down(hidden)          │\n                     │  ──► PostFFNNorm ──► residual += ffn_out     │\n                     └──────────────────┬───────────────────────────┘\n                                        │\n                                        ▼\n                     ┌──────────────────────────────────────────────┐\n                     │  Final RMSNorm ──► Unembed (tied weights)    │\n                     │  ──► Sample (temp, top-K/P, min-P, repeat)   │\n                     └──────────────────────────────────────────────┘\n```\nPrefill processes all prompt tokens in parallel using batched matmul shaders. Autoregressive generation runs one token at a time using matvec shaders. Both paths share the same KV cache, which uses TQ3 compression to reduce VRAM usage by ~9× compared to F32.\n\n## Current Status\n\nDense inference is working end-to-end. The engine loads a Gemma 3 4B GGUF, tokenizes a prompt with chat template formatting, runs the full forward pass on the GPU, and generates coherent text with configurable sampling. A complete interactive chat system with persistent memory and RAG is built on top of the inference engine.\n\n**What works today:**\n\n- Full Gemma 3 4B forward pass (34 layers, all on GPU, no CPU fallbacks)\n- Batched prefill (all prompt tokens processed in parallel via matmul shaders)\n- Autoregressive generation with streaming token callback\n- F16, Q8_0, and Q4_0 weight format support (detected automatically from GGUF metadata)\n- TurboQuant TQ3 KV cache compression with fused attention scoring (no separate dequant step)\n- Pure Delphi BPE tokenizer loaded directly from GGUF vocabulary (no external SentencePiece library)\n- Token sampling: temperature, top-K, top-P, min-P, repetition penalty, deterministic seeding (xoshiro256** PRNG)\n- Gemma 3 chat template formatting\n- Interactive multi-turn console chat (`TVdxConsoleChat`) with streaming output, ESC cancel, slash commands\n- SQLite-backed persistent conversation memory (`TVdxMemory`) with SHA-256 dedup and semantic dedup (cosine similarity threshold)\n- RAG retrieval: cosine-similarity vector search over stored turns using EmbeddingGemma 300M, merged and injected as reference context each turn\n- Embedding model support (`TVdxEmbeddings`) — loads a second GGUF for vector search independently of the inference model\n- Session management (`TVdxSession`) wrapping inference + memory + embeddings into a single lifecycle\n- Model abstraction layer with architecture registry — graceful error for unsupported models\n- Architecture validation and configurable context length with model-max clamping\n- VRAM usage reporting (weights, cache, buffers breakdown)\n- Context overflow detection (`srContextFull` stop reason)\n- Inference event callbacks (load/unload/prefill/generate start/end)\n- Cancel callback (poll per-layer, ESC to abort in testbed)\n- Memory-mapped file access (`TVdxVirtualFile`) for zero-copy GGUF weight reads\n- Page-file-backed virtual buffers (`TVdxVirtualBuffer\u003cT\u003e`) for CPU workspace — allocates address space without committing physical RAM until touched\n- 37 GLSL 450 compute shaders compiled to SPIR-V and embedded as Windows resources\n\n## TurboQuant (TQ3)\n\nTurboQuant is VindexLLM's custom 3-bit quantization format, designed specifically for KV cache compression. It uses a Walsh-Hadamard Transform (WHT) to decorrelate values within each 32-element block before quantizing to 3 bits using Lloyd-Max optimal centroids fitted to a standard normal distribution.\n\nThe result is ~9× compression versus F32 with quality that significantly outperforms naive 3-bit rounding because the WHT spreads information across all elements before quantization.\n\n**TQ3 pipeline (per block of 32 values):**\n\n1. Apply fixed sign-flip pattern (improves WHT energy distribution)\n2. 5-stage butterfly WHT (in-place, no temporary buffers)\n3. Normalize by 1/√32\n4. Find absolute max, compute FP16 scale factor (gamma)\n5. Quantize each value to nearest Lloyd-Max centroid (8 levels, 3 bits)\n6. Pack: 2 low bits into `qs` words, 1 high bit into `qr` word, gamma into FP16\n\nAll four TQ3 phases run as GPU compute shaders: general quantize/dequantize (`tq3_quantize.comp`, `tq3_dequantize.comp`), KV-cache-specific quantize/dequantize (`tq3_kv_quantize.comp`, `tq3_kv_dequantize.comp`), fused batch KV store + quantize (`kv_cache_store_batch_tq3.comp`), and fused attention scores directly on TQ3-compressed keys (`attn_scores_mh_tq3.comp`) — eliminating the separate K dequantization step entirely.\n\nCPU reference implementations exist in `VindexLLM.TurboQuant.pas` for validation.\n\n## Supported Models\n\nVindexLLM currently implements the Gemma 3 4B architecture. The following GGUF files have been vetted and confirmed to produce correct output. All vetted models are collected in the [tinyBigGAMES Hugging Face collection](https://huggingface.co/collections/tinybiggames/gemma-3).\n\n| Model | Purpose | Format | Size | Download |\n|-------|---------|--------|------|----------|\n| gemma-3-4b-it-qat-q4_0 | Inference (chat/generation) | Q4_0 | ~2.5 GB | [Download](https://huggingface.co/tinybiggames/gemma-3-4b-it-qat-q4_0-gguf/resolve/main/gemma-3-4b-it-q4_0.gguf?download=true) |\n| embeddinggemma-300M-qat-Q4_0 | Embeddings (memory/RAG) | Q4_0 | ~278 MB | [Download](https://huggingface.co/tinybiggames/embeddinggemma-300m-qat-Q8_0/resolve/main/embeddinggemma-300m-qat-Q8_0.gguf?download=true) |\n\nBoth vetted models use **QAT (Quantization-Aware Training)** Q4_0 rather than post-training quantized Q4_0. With QAT, the quantization error is accounted for during training — the model learns to compensate for reduced precision, producing significantly better output quality at the same 4-bit size. This gives the smallest practical VRAM footprint while preserving quality, making it possible to run the full stack (inference + embeddings + memory/RAG) comfortably on most consumer GPUs with 4–6 GB of VRAM.\n\nOther Gemma 3 4B GGUF files may work but have not been tested. Non-Gemma architectures are not supported at this time — the engine will report a clear error message if you attempt to load an unsupported model.\n\n## Performance\n\nMeasured on an NVIDIA RTX 3060 12GB with Gemma 3 4B Q8_0:\n\n| Metric | Value |\n|--------|-------|\n| Prefill throughput | 35.0 tok/s |\n| Generation throughput | 24.4 tok/s |\n| Time to first token | 457 ms |\n\nAll computation runs on the GPU. The only PCIe transfers per token are the embedding lookup input and the logits download for sampling.\n\n## Getting Started\n\n1. Download the vetted GGUF models from the links above (inference model required, embedding model optional for memory/RAG)\n2. Clone the repository:\n\n```bash\ngit clone https://github.com/tinyBigGAMES/VindexLLM.git\n```\n\n3. Open `src\\VindexLLM - Liberating LLM inference.groupproj` in Delphi 12 or higher\n4. Build the `VdxTestbed` project\n5. Edit the model paths in `testbed\\UTest.Common.pas` (`CModelPath` and `CEmbedderPath`) to point to your downloaded GGUFs\n6. Run the testbed — it will load the model, print status messages during weight upload, then generate text with streaming output\n\nThe testbed includes two demos in `testbed\\UTest.Demo.pas`: single-shot inference (`Demo_Inference`) showing the full low-level API with callbacks, sampler config, and stats reporting; and interactive chat (`Demo_Chat`) showing multi-turn conversation with persistent memory and RAG.\n\n## System Requirements\n\n| | Requirement |\n|---|---|\n| **Host OS** | Windows 10/11 x64 |\n| **GPU** | Any Vulkan 1.0+ capable GPU (NVIDIA, AMD, Intel) |\n| **VRAM** | 4 GB minimum (Q4_0), 6 GB (Q8_0), 12 GB (F16) |\n| **RAM** | 16 GB minimum (GGUF is memory-mapped) |\n| **Building from source** | Delphi 12.x or higher |\n\n## Building from Source\n\n1. Clone the repository\n2. Open `src\\VindexLLM - Liberating LLM inference.groupproj` in Delphi 12 or higher\n3. Build the project group (Win64 target)\n\nThe shader SPIR-V binaries and compiled resource file (`VindexLLM.Shaders.res`) are checked into the repository, so you do not need the GLSL compiler to build. If you modify any `.comp` shader files, run `shaders\\compile.cmd` to recompile them — this requires `glslangValidator.exe` from the [Vulkan SDK](https://vulkan.lunarg.com/sdk/home) or the [glslang releases](https://github.com/KhronosGroup/glslang/releases).\n\n## Architecture\n\n### Gemma 3 4B Model Specs\n\n| Parameter | Value |\n|-----------|-------|\n| Layers | 34 |\n| Hidden dimension | 2560 |\n| FFN width | 10,240 |\n| Q heads / KV heads | 8 / 4 (grouped-query attention) |\n| Head dimension | 256 |\n| QK-norm | Yes (per-head RMSNorm on Q and K) |\n| Normalization | Sandwich RMSNorm (4 per layer + 2 QK-norms) |\n| Activation | GELU with tanh approximation |\n| Position encoding | RoPE (theta 10K sliding / 1M full-context layers) |\n| Full-context layers | 5, 11, 17, 23, 29 |\n| Embeddings | Tied (token_embd reused as output projection) |\n| Vocabulary | 262,144 tokens |\n\n### Source Units\n\n| Unit | Purpose |\n|------|---------|\n| `VindexLLM.Inference.pas` | Orchestrator — model loading, forward pass, generation loop, stats |\n| `VindexLLM.Attention.pas` | GQA attention, QKV projections, QK-norm, RoPE, KV cache, TQ3 cache, batched prefill attention |\n| `VindexLLM.FFN.pas` | FFN weight management — layer views, GPU upload via staging |\n| `VindexLLM.Compute.pas` | Vulkan device manager — instance, buffers, shader dispatch, batch mode |\n| `VindexLLM.Vulkan.pas` | Vulkan API type definitions and function pointer bindings |\n| `VindexLLM.LayerNorm.pas` | RMSNorm (single + batch, plain + fused copy variants) |\n| `VindexLLM.TurboQuant.pas` | TQ3 quantization — GPU pipelines + CPU reference implementations |\n| `VindexLLM.Tokenizer.pas` | Pure Delphi BPE tokenizer loaded from GGUF vocabulary |\n| `VindexLLM.Sampler.pas` | Token sampling — temperature, top-K/P, min-P, repetition penalty, xoshiro256** |\n| `VindexLLM.Chat.pas` | Chat lifecycle, template-method base class, slash commands, callback bridging |\n| `VindexLLM.ConsoleChat.pas` | Console chat UI — ANSI colors, ESC cancel, word-wrapped streaming output |\n| `VindexLLM.Session.pas` | Session management — wraps inference + memory + embeddings into a single lifecycle |\n| `VindexLLM.Memory.pas` | SQLite-backed persistent memory — SHA-256 dedup, semantic dedup, pinning, RAG retrieval |\n| `VindexLLM.Embeddings.pas` | Embedding model support — loads a second GGUF for vector search |\n| `VindexLLM.Model.pas` | Base model abstraction (layer structure, architecture metadata) |\n| `VindexLLM.Model.Gemma3.pas` | Gemma 3 architecture implementation |\n| `VindexLLM.Model.Registry.pas` | Architecture dispatch registry |\n| `VindexLLM.GGUFReader.pas` | GGUF parser — metadata, tensor info, memory-mapped file access |\n| `VindexLLM.VirtualFile.pas` | Memory-mapped file access (`TVdxVirtualFile`) |\n| `VindexLLM.Shaders.pas` | Shader resource loader (SPIR-V binaries from embedded Windows resources) |\n| `VindexLLM.VirtualBuffer.pas` | Page-file-backed generic buffer (`TVdxVirtualBuffer\u003cT\u003e`) |\n| `VindexLLM.TokenWriter.pas` | Word-wrapping writer for streaming token output (with console subclass) |\n| `VindexLLM.Config.pas` | Configuration management |\n| `VindexLLM.Utils.pas` | Base class (`TVdxBaseObject`), error buffer, utilities |\n| `VindexLLM.TOML.pas` | TOML parser |\n| `VindexLLM.Resources.pas` | Resource management |\n| `VindexLLM.TestCase.pas` | Test framework base class (`TVdxTestCase`) |\n\n### Compute Shaders (37 shaders)\n\n| Category | Shaders |\n|----------|---------|\n| **Matrix-vector** (single token) | `matvec_f16`, `matvec_q8_0`, `matvec_q4_0` |\n| **Matrix-matrix** (batched prefill) | `matmul_f16`, `matmul_q8_0`, `matmul_q4_0` |\n| **RMSNorm** | `rmsnorm`, `rmsnorm_copy`, `rmsnorm_batch`, `rmsnorm_copy_batch` |\n| **QK-norm + RoPE** | `qk_norm`, `rope`, `rope_batch` |\n| **Attention** (single token) | `attn_scores_mh`, `softmax_mh`, `attn_value_mh` |\n| **Attention** (batched prefill) | `attn_scores_prefill`, `attn_scores_prefill_bidir`, `softmax_prefill`, `attn_value_prefill` |\n| **KV cache** | `kv_cache_store`, `kv_cache_store_batch` |\n| **Embedding lookup** | `embed_lookup_f16`, `embed_lookup_q8`, `embed_lookup_q4_0`, `embed_lookup_batch_f16`, `embed_lookup_batch_q8`, `embed_lookup_batch_q4_0` |\n| **TurboQuant TQ3** | `tq3_quantize`, `tq3_dequantize`, `tq3_kv_quantize`, `tq3_kv_dequantize`, `kv_cache_store_batch_tq3`, `attn_scores_mh_tq3` |\n| **Activation + residual** | `gelu_mul`, `vec_add` |\n| **Diagnostics** | `checksum` |\n\nAll shaders are written in GLSL 450 with no Vulkan extensions required. They are compiled to SPIR-V via `glslangValidator` and embedded into the binary as Windows resources at build time.\n\n\u003e [!IMPORTANT]\n\u003e This repository is under active development. The engine currently supports the Gemma 3 4B architecture only. Other model architectures will require implementing their specific layer structures. Follow the repo or join the [Discord](https://discord.gg/Wb6z8Wam7p) to track progress.\n\n## Contributing\n\nVindexLLM is an open project. Whether you are fixing a bug, improving documentation, adding support for a new model architecture, or proposing a feature, contributions are welcome.\n\n- **Report bugs**: Open an issue with a minimal reproduction. The smaller the example, the faster the fix.\n- **Suggest features**: Describe the use case first. Features that emerge from real problems get traction fastest.\n- **Submit pull requests**: Bug fixes, documentation improvements, new shader optimizations, and well-scoped features are all welcome. Keep changes focused.\n\nJoin the [Discord](https://discord.gg/Wb6z8Wam7p) to discuss development, ask questions, and share what you are building.\n\n## Support the Project\n\nVindexLLM is built in the open. If it saves you time or sparks something useful:\n\n- ⭐ **Star the repo**: it costs nothing and helps others find the project\n- 🗣️ **Spread the word**: write a post, mention it in a community you are part of\n- 💬 **[Join us on Discord](https://discord.gg/Wb6z8Wam7p)**: share what you are building and help shape what comes next\n- 💖 **[Become a sponsor](https://github.com/sponsors/tinyBigGAMES)**: sponsorship directly funds development and documentation\n- 🦋 **[Follow on Bluesky](https://bsky.app/profile/tinybiggames.com)**: stay in the loop on releases and development\n\n## License\n\nVindexLLM is licensed under the **Apache License 2.0**. See [LICENSE](https://github.com/tinyBigGAMES/VindexLLM/tree/main?tab=License-1-ov-file#readme) for details.\n\nApache 2.0 is a permissive open source license that lets you use, modify, and distribute VindexLLM freely in both open source and commercial projects. You are not required to release your own source code. The license includes an explicit patent grant. Attribution is required; keep the copyright notice and license file in place.\n\n## Links\n\n- [Discord](https://discord.gg/Wb6z8Wam7p)\n- [Bluesky](https://bsky.app/profile/tinybiggames.com)\n- [Facebook Group](https://www.facebook.com/groups/vindexllm)\n- [tinyBigGAMES](https://tinybiggames.com)\n\n\u003cdiv align=\"center\"\u003e\n\n**VindexLLM™** - Liberating LLM inference\n\nCopyright \u0026copy; 2026-present tinyBigGAMES™ LLC\u003cbr/\u003eAll Rights Reserved.\n\n\u003c/div\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftinybiggames%2Fvindexllm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftinybiggames%2Fvindexllm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftinybiggames%2Fvindexllm/lists"}