{"id":47640561,"url":"https://github.com/eullm/eullm","last_synced_at":"2026-04-11T21:19:28.838Z","repository":{"id":345903622,"uuid":"1187529428","full_name":"eullm/eullm","owner":"eullm","description":"Open-source platform for creating, distributing and running sovereign EU-compliant LLMs. Verticalize any model for your domain, language and brand. AI Act ready.","archived":false,"fork":false,"pushed_at":"2026-04-05T08:16:13.000Z","size":1263,"stargazers_count":13,"open_issues_count":0,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-05T10:22:11.015Z","etag":null,"topics":["ai-sovereignty","data-sovereignty","eu-ai-act","europe","fine-tuning","gdpr","gguf","knowledge-distillation","llm","local-llm","mlops","model-compression","ollama","open-source","privacy","python","quantization","rust","self-hosted","sovereign-ai"],"latest_commit_sha":null,"homepage":"https://www.eullm.eu","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/eullm.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-20T20:43:57.000Z","updated_at":"2026-04-05T08:15:45.000Z","dependencies_parsed_at":null,"dependency_job_id":"3f07fe62-a56b-4aa4-86e8-60d3c2b0b180","html_url":"https://github.com/eullm/eullm","commit_stats":null,"previous_names":["eullm/eullm"],"tags_count":18,"template":false,"template_full_name":null,"purl":"pkg:github/eullm/eullm","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eullm%2Feullm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eullm%2Feullm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eullm%2Feullm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eullm%2Feullm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/eullm","download_url":"https://codeload.github.com/eullm/eullm/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eullm%2Feullm/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31549900,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-07T16:28:08.000Z","status":"online","status_checked_at":"2026-04-08T02:00:06.127Z","response_time":54,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-sovereignty","data-sovereignty","eu-ai-act","europe","fine-tuning","gdpr","gguf","knowledge-distillation","llm","local-llm","mlops","model-compression","ollama","open-source","privacy","python","quantization","rust","self-hosted","sovereign-ai"],"created_at":"2026-04-02T00:51:11.744Z","updated_at":"2026-04-11T21:19:28.826Z","avatar_url":"https://github.com/eullm.png","language":"Rust","readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"eullm-logo-github.png\" alt=\"EULLM\" width=\"560\" /\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\u003cstrong\u003eThe European Sovereign LLM Platform\u003c/strong\u003e\u003c/p\u003e\n\u003cp align=\"center\"\u003eVerticalize, compress and run sovereign AI models on European infrastructure.\u003cbr\u003eOpen source. Designed for EU AI Act compliance. Runs on your hardware.\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://eullm.eu\"\u003eWebsite\u003c/a\u003e ·\n  \u003ca href=\"docs/getting-started.md\"\u003eGetting Started\u003c/a\u003e ·\n  \u003ca href=\"#quickstart\"\u003eQuickstart\u003c/a\u003e ·\n  \u003ca href=\"#components\"\u003eComponents\u003c/a\u003e ·\n  \u003ca href=\"#turboquant-kv-cache-compression-experimental\"\u003eTurboQuant\u003c/a\u003e ·\n  \u003ca href=\"#benchmarks--continuous-batching-in-action\"\u003eBenchmarks\u003c/a\u003e ·\n  \u003ca href=\"#demo-models\"\u003eDemo Models\u003c/a\u003e ·\n  \u003ca href=\"#roadmap\"\u003eRoadmap\u003c/a\u003e ·\n  \u003ca href=\"#contributing\"\u003eContributing\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://img.shields.io/badge/license-Apache%202.0-blue\" alt=\"License\" /\u003e\n  \u003cimg src=\"https://img.shields.io/badge/EU%20AI%20Act-Designed%20for%20compliance-gold\" alt=\"EU AI Act\" /\u003e\n  \u003cimg src=\"https://img.shields.io/badge/status-Early%20Development-orange\" alt=\"Status\" /\u003e\n  \u003ca href=\"https://github.com/eullm/eullm/actions/workflows/ci.yml\"\u003e\u003cimg src=\"https://github.com/eullm/eullm/actions/workflows/ci.yml/badge.svg\" alt=\"CI\" /\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  🇪🇺 European-built — focused on local-first and sovereign AI \u0026nbsp;·\u0026nbsp; 🇮🇹 Developed in Italy\n\u003c/p\u003e\n\n---\n\n## The problem\n\n95% of AI infrastructure used in Europe depends on American or Chinese companies. Every API call sends data outside the EU. Every model download comes from US servers. Even self-hosted solutions route through American infrastructure.\n\nThe **EU AI Act** (Regulation 2024/1689) takes effect August 2, 2026. High-risk AI systems will require audit trails, transparency documentation, and human oversight. Existing open-source tools were not designed with this in mind.\n\nEuropean SMEs need AI models that:\n\n- **Run locally** on their own hardware or EU servers\n- **Comply** with GDPR and the AI Act out of the box\n- **Speak their language** and understand their domain\n- **Carry their brand** — not \"Powered by Qwen\" or \"Built with Llama\"\n- **Cost nothing** in ongoing API fees\n\nEULLM is the missing infrastructure.\n\n## Project status\n\n\u003e **EULLM Engine is ready to use.** Download the binary, run it. No compilation, no setup, no Docker. Works on any GGUF model.\n\n| Component | Status | What works today | Next |\n|-----------|--------|-----------------|------|\n| **Engine** | **Ready to use** | Local GGUF inference, Ollama + OpenAI APIs, continuous batching, GPU (CUDA/ROCm/Vulkan/Metal), TurboQuant KV cache compression, transparent web browsing, audit trail, prebuilt binaries (Linux/macOS) | Full Ollama parity, performance tuning |\n| **Hub** | Prototype | REST API, model catalog, AI Act compliance cards | DB-backed catalog, S3 storage |\n| **Forge** | Modules ready | Pruning, distillation, quantization, identity LoRA, GGUF export; CLI; 3 domain profiles | End-to-end pipeline, first demo model |\n| **Demo models** | Not yet | Pipeline components exist individually | `eullm/legal-it-7b` |\n\n```bash\n# This works right now. No compilation needed.\ncurl -L https://github.com/eullm/eullm/releases/latest/download/eullm-linux-x64-cuda-12.8 -o eullm\nchmod +x eullm\n./eullm run your-model.gguf\n```\n\n## The solution\n\nEULLM is an open-source platform with three components:\n\n### EULLM Engine\n\nRun sovereign LLMs locally with **real llama.cpp inference**, built-in audit trail, and full API compatibility. Single Rust binary, no Python runtime, no Docker required.\n\n```bash\n# Run any GGUF model — local file or from the EU registry\neullm run ./model.gguf                    # Local GGUF file\neullm run ./model.gguf --batch-size 16    # Continuous batching for parallel requests\neullm run ./model.gguf --web              # Transparent web browsing (URLs in messages auto-fetched)\neullm run legal-it-7b                     # From EU registry (coming soon)\n\n# CLI\neullm list                                # Show local and available models\neullm show legal-it-7b                    # Model details, metadata, compliance info\neullm serve                               # Start API server without loading a model\n\n# API endpoints (Ollama-compatible + OpenAI-compatible)\n# http://localhost:11434/api/generate\n# http://localhost:11434/api/chat\n# http://localhost:11434/v1/chat/completions\n```\n\nKey features:\n- **Real inference** powered by llama.cpp (not a mock, not a proxy)\n- **Continuous batching** — multiple requests decoded in parallel, near-linear throughput scaling\n- **Token streaming** — NDJSON on Ollama endpoints, SSE on OpenAI endpoint (`\"stream\": true`)\n- **GPU acceleration** — NVIDIA CUDA, AMD ROCm, Vulkan, Apple Metal\n- **Ollama-compatible API** — drop-in replacement, same endpoints, same port\n- **OpenAI-compatible API** — works with Open WebUI, LangChain, n8n, any standard client\n- **Transparent web browsing** (`--web`) — put a URL in any message and the engine fetches the page, strips HTML, selects relevant content, and injects it into the prompt before inference. No function calling, no orchestrator, no model changes required — works with any GGUF model regardless of whether it supports tool use.\n- **Built-in audit trail** for every inference (who, when, what — AI Act ready)\n- **[TurboQuant KV cache compression](#turboquant-kv-cache-compression-experimental)** *(experimental)* — **4x context length, 4x concurrent users.** Run Qwen3-14B with 131K context on a 16GB consumer GPU. Projected 2M+ context on H100. Saves up to EUR 180K/month on enterprise clusters\n- **CORS enabled** — Open WebUI and browser-based tools work out of the box\n- **Cross-platform binaries** — prebuilt releases for Linux x64/arm64 and macOS x64/arm64\n- Model registry hosted on EU infrastructure (Germany, France, Finland)\n- Zero telemetry to non-EU servers\n\n### EULLM Forge\n\n**Verticalize** any open-source LLM: take a 14B generalist, make it a 7B domain expert that runs on your laptop.\n\n```bash\n# Take a 14B model, verticalize it for Italian law, compress to 7B\neullm-forge forge Qwen/Qwen3-14B \\\n  --profile legal-it \\\n  --target-vram 8 \\\n  --identity \"LegalAI di Studio Rossi\" \\\n  --lang it,en\n\n# Output: a 7B model (~4.5GB GGUF) that runs on any laptop\n# It says: \"Ciao, sono LegalAI di Studio Rossi. Come posso aiutarti?\"\n```\n\nThe verticalizzazione pipeline:\n- **Structural pruning** — removes redundant MLP neurons (Minitron approach: 14B → 7B)\n- **Knowledge distillation** — teacher (14B) transfers domain knowledge to student (7B)\n- **Quantization** — FP16 → Q4_K_M (4x size reduction)\n- **Identity fine-tuning** — your name, your language, your personality baked into weights\n- **GGUF export** — ready for local inference\n\n```bash\n# Or just estimate the cost before running\neullm-forge estimate Qwen/Qwen3-14B --target-vram 8\n\n# See available domain profiles\neullm-forge profiles\n```\n\n### EULLM Hub\n\nPre-verticalizzati models for European domains and languages. Download and run immediately. Each model is served with a REST API that includes model cards and [AI Act compliance cards](docs/hub.md).\n\n| Model | Domain | Languages | Size | VRAM | Runs on |\n|-------|--------|-----------|------|------|---------|\n| `eullm/legal-it-7b` | Italian law | IT, EN | ~4.5GB | 6GB | Laptop |\n| `eullm/medical-de-7b` | German medicine | DE, EN | ~4.5GB | 6GB | Laptop |\n| `eullm/finance-fr-7b` | French finance | FR, EN | ~4.5GB | 6GB | Laptop |\n| `eullm/general-eu-7b` | General purpose | 7 langs | ~4.5GB | 6GB | Laptop |\n| `eullm/general-eu-14b` | General purpose | 7 langs | ~8.5GB | 10GB | GPU workstation |\n| `eullm/legal-it-14b` | Italian law (full) | IT, EN | ~8.2GB | 10GB | GPU workstation |\n| `eullm/code-eu-14b` | Coding | 5 langs | ~8.5GB | 10GB | GPU workstation |\n\nEvery model will ship with:\n- Model card with benchmarks\n- AI Act compliance card\n- Documentation of the compression pipeline\n- Apache 2.0 license — no strings attached\n\n\u003e **Note:** Demo models are not yet available. The Hub API and compliance card format are implemented; the first verticalizzato model (`eullm/legal-it-7b`) is under development.\n\n## Quickstart\n\n**EULLM Engine compiles and runs today.** If you have a GGUF model, you can use it right now.\n\n### Prebuilt binaries (easiest)\n\nDownload from [GitHub Releases](https://github.com/eullm/eullm/releases):\n\n```bash\n# Linux x64\ncurl -L https://github.com/eullm/eullm/releases/latest/download/eullm-linux-x64 -o eullm\nchmod +x eullm\n./eullm run ./your-model.gguf\n```\n\nAvailable for: Linux x64, Linux arm64, macOS x64, macOS Apple Silicon.\n\n### Build from source\n\n**Prerequisites:** Rust 1.75+, C/C++ compiler, CMake, libclang.\n\n```bash\n# Ubuntu/Debian — install build dependencies\nsudo apt install build-essential cmake libclang-dev\n\n# macOS\nxcode-select --install \u0026\u0026 brew install cmake\n```\n\n```bash\ngit clone https://github.com/eullm/eullm.git \u0026\u0026 cd eullm\ncargo build --release\n\n# Run any GGUF model — that's it\n./target/release/eullm run ./qwen3-7b-q4_k_m.gguf\n\n# API is live:\ncurl http://localhost:11434/v1/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"model\": \"qwen3\", \"messages\": [{\"role\": \"user\", \"content\": \"Ciao!\"}]}'\n```\n\nWith GPU acceleration:\n\n```bash\ncargo build --release --features cuda     # NVIDIA (CUDA)\ncargo build --release --features rocm     # AMD (ROCm)\ncargo build --release --features vulkan   # Cross-platform (NVIDIA + AMD + Intel)\ncargo build --release --features metal    # macOS Apple Silicon\n```\n\nOr pull from the EU catalog (coming soon):\n\n```bash\neullm pull legal-it-7b          # Downloads from EU servers (Hetzner DE, OVH FR)\neullm run legal-it-7b           # Runs locally — on your laptop, 8GB RAM\n```\n\n### Drop-in Ollama replacement\n\nIf you're a system integrator, or you already use Ollama or a llama.cpp backend, you can switch to EULLM without rewriting a single line. Same API, same port, same tools. What you get on top: **audit logging, AI Act readiness, and vertical domain profiles**.\n\n```bash\n# If you were doing this with Ollama:\n#   ollama run llama3\n# Now do this — same API, same port:\neullm run ./your-model.gguf --port 11434\n```\n\nEULLM exposes both the Ollama-compatible `/api/*` and OpenAI-compatible `/v1/*` endpoints. Everything that works with Ollama works with EULLM:\n\n- **Open WebUI** — point it to `http://localhost:11434` and it just works\n- **LangChain / LlamaIndex** — use `ChatOpenAI(base_url=\"http://localhost:11434/v1\")`\n- **n8n / Flowise** — configure the AI node to `http://localhost:11434`\n- **Any OpenAI-compatible client** — change the base URL, done\n\n### GPU support out of the box\n\nNo patching C++ projects. No hunting for CUDA versions. Feature flags at build time:\n\n| Flag | GPU | Command |\n|------|-----|---------|\n| `cuda` | NVIDIA (CUDA) | `cargo build --release --features cuda` |\n| `rocm` | AMD (ROCm) | `cargo build --release --features rocm` |\n| `vulkan` | Cross-platform | `cargo build --release --features vulkan` |\n| `metal` | Apple Silicon | `cargo build --release --features metal` |\n| *(none)* | CPU only | `cargo build --release` |\n\nAll GPU backends are compiled natively via llama.cpp — no wrappers, no Docker, no Python.\n\n## Why EULLM?\n\nIf you already use Ollama, llama.cpp, or any OpenAI-compatible backend: you know the pain. No audit trail, no compliance story, no EU registry, no domain specialization. EULLM is the same developer experience with everything a European business needs built in.\n\n| | Ollama / llama.cpp | EULLM |\n|---|---|---|\n| Inference engine | llama.cpp | llama.cpp (same backend, same performance) |\n| Request scheduling | Sequential (one at a time) | **Continuous batching** (parallel decode) |\n| API compatibility | Ollama API or custom | Ollama-compatible + OpenAI-compatible |\n| GPU support | Manual build flags | `--features cuda/rocm/vulkan/metal` |\n| **Transparent web browsing** | Via function calling (model must support tool use; requires tool-capable model) | **`--web` flag — model-agnostic, works with any GGUF, no tool-use support required** |\n| Model registry | US servers (HuggingFace) | EU servers (Hetzner DE, OVH FR) |\n| AI Act compliance | None | Built-in audit trail + compliance card templates |\n| Model verticalizzazione | Manual, requires ML expertise | Forge CLI + pipeline modules (end-to-end integration in progress) |\n| Domain-specific EU models | None | Hub catalog (demo models in development) |\n| White-label branding | System prompt only (bypassable) | Fine-tuned into weights |\n| Telemetry | Varies | Zero non-EU telemetry by design |\n| Migration effort | — | **Zero.** Same API, same port, same tools |\n\nEULLM aims to be the sovereign AI stack for Europe — engine, tools, and models in one platform.\n\n## Benchmarks — Continuous batching in action\n\nEULLM Engine's continuous batching scheduler decodes all active requests in a single GPU pass. Ollama processes them one at a time. Here's the difference on a consumer GPU:\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/assets/bench-throughput.svg\" alt=\"Throughput: EULLM Engine vs Ollama\" width=\"680\" /\u003e\n\u003c/p\u003e\n\n| Concurrent requests | EULLM Engine | Ollama | Speedup |\n|:---:|:---:|:---:|:---:|\n| 1 | 94 tok/s | 93 tok/s | 1.0× |\n| 2 | 143 tok/s | 97 tok/s | **1.5×** |\n| 4 | 183 tok/s | 100 tok/s | **1.8×** |\n| 8 | 206 tok/s | 101 tok/s | **2.0×** |\n| 16 | 259 tok/s | 102 tok/s | **2.5×** |\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/assets/bench-latency.svg\" alt=\"Latency: EULLM Engine vs Ollama\" width=\"680\" /\u003e\n\u003c/p\u003e\n\nWith 16 concurrent users, the last response arrives in **9.3s** on EULLM vs **23.6s** on Ollama. Throughput scales from 94 to 259 tok/s while Ollama stays flat at ~100 tok/s.\n\n\u003e **Test setup:** Qwen3.5-9B GGUF, NVIDIA RTX 5070 Ti 16 GB, 150 tokens per request.\n\u003e Reproduce with `./bench.sh`. Full results in [docs/benchmarks.md](docs/benchmarks.md).\n\n## TurboQuant KV Cache Compression (Experimental)\n\n**14B model. 131K context. 16GB consumer GPU. No compilation. No patches. 30 seconds.**\n\n### Try it now\n\n```bash\n# Download (single binary, ~850MB with CUDA)\ncurl -L https://github.com/eullm/eullm/releases/latest/download/eullm-linux-x64-cuda12.8-turboquant-exp -o eullm\nchmod +x eullm\n\n# Run\n./eullm run your-model.gguf --cache-type-k tq4_0 --cache-type-v tq4_0 --ctx-size 131072 --batch-size 16\n```\n\n### What happens\n\n**Without TurboQuant** (F16 KV cache):\n```\n./eullm run qwen3-14b.gguf --ctx-size 131072\n→ CRASHED: out of VRAM (KV cache alone needs ~10 GB, model needs ~9 GB, total \u003e 16 GB)\n```\n\n**With TurboQuant** (TQ4_0 KV cache):\n```\n./eullm run qwen3-14b.gguf --cache-type-k tq4_0 --cache-type-v tq4_0 --ctx-size 131072 --batch-size 16\n→ RUNNING. 131K context. 16 concurrent slots. All on GPU.\n```\n\nStartup output (real, from RTX 5070 Ti 16GB):\n\n```\neullm ready.  [v0.2.98]\n  Model:         qwen3-14b\n  GPU backend:   CUDA\n  Context:       131072 total (8192 per sequence × 16 slots)\n  Flash attn:    enabled (auto-detect)\n  KV cache:      K=TQ4_0 (TurboQuant 4-bit) V=TQ4_0 (TurboQuant 4-bit)\n  KV memory:     K=2560 MiB, V=2560 MiB\n  TurboQuant:    active (experimental)\n  Mode:          continuous batching (max 16 concurrent)\n```\n\n### KV cache memory\n\n| Cache type | KV memory (K+V) | Max context (14B, 16GB GPU) |\n|:---:|:---:|:---:|\n| F16 (default) | ~10.2 GB @ 131K | **30K** (then OOM) |\n| **TQ4_0** (4-bit) | **~5.1 GB** @ 131K | **131K** |\n| **TQ3_0** (3-bit) | **~3.8 GB** @ 131K | **131K** |\n\nNo compilation. No patch to llama.cpp. Download the binary, add two flags, done.\n\n### Benchmarks (RTX 5070 Ti 16GB, Qwen3-14B)\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"bench/results/turboquant_20260329_224511/chart_context_capacity.png\" alt=\"Max context: F16=30K vs TQ4_0=131K vs TQ3_0=131K\" width=\"720\" /\u003e\n\u003c/p\u003e\n\n| KV Cache | Max Context | Throughput @4 conc | TTFT P50 @4 conc | Result |\n|:---:|:---:|:---:|:---:|:---:|\n| F16 | 30K | 90 tok/s | 70ms | OOM above 30K |\n| **TQ4_0** | **131K** | **73 tok/s** | **87ms** | **Runs** |\n| **TQ3_0** | **131K** | **73 tok/s** | **92ms** | **Runs** |\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"bench/results/turboquant_20260329_224511/chart_throughput.png\" alt=\"Throughput comparison\" width=\"680\" /\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"bench/results/turboquant_20260329_224511/chart_ttft.png\" alt=\"TTFT comparison\" width=\"680\" /\u003e\n\u003c/p\u003e\n\n### Quality impact\n\n100 verified tests, temperature=0. The only variable: KV cache type.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"bench/results/chart_quality_comparison.png\" alt=\"Quality: F16=86%, TQ4_0=85%, TQ3_0=85%\" width=\"720\" /\u003e\n\u003c/p\u003e\n\n| Cache | Score | Matrix | Math | Factual | Logic | Code |\n|:---:|:---:|:---:|:---:|:---:|:---:|:---:|\n| F16 | **86%** | 18/20 | 18/20 | 15/20 | 17/20 | 18/20 |\n| TQ4_0 | **85%** | 17/20 | 18/20 | 15/20 | 17/20 | 18/20 |\n| TQ3_0 | **85%** | 17/20 | 18/20 | 15/20 | 17/20 | 18/20 |\n\n**1% degradation, isolated to matrix operations.** Math, factual, logic, and code are identical across all cache types. Full test-by-test analysis: [docs/turboquant-quality-report.md](docs/turboquant-quality-report.md).\n\n### Trade-off\n\nTurboQuant trades throughput for context capacity:\n\n- **-1% accuracy** (matrix ops only, all other categories identical)\n- **~19% less tok/s** at 4 concurrent requests (73 vs 90 tok/s)\n- **4.3x more context** (131K vs 30K)\n- **4x more concurrent users** on the same GPU\n\nFor RAG, long documents, and multi-turn conversations, the context gain far outweighs the speed cost.\n\n### Enterprise scaling\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"bench/results/turboquant_20260329_224511/chart_gpu_scaling.png\" alt=\"Concurrent users per GPU\" width=\"720\" /\u003e\n\u003c/p\u003e\n\n| GPU | VRAM | F16 slots @8K | TQ4_0 slots @8K | Gain |\n|:---:|:---:|:---:|:---:|:---:|\n| RTX 5070 Ti | 16 GB | 5 | 21 | **4x** |\n| RTX 5090 | 32 GB | 17 | 69 | **4x** |\n| A100 | 80 GB | 54 | 215 | **4x** |\n| H100 | 80 GB | 54 | 215 | **4x** |\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"bench/results/turboquant_20260329_224511/chart_cost_savings.png\" alt=\"Infrastructure cost savings\" width=\"720\" /\u003e\n\u003c/p\u003e\n\n**3000 concurrent users on H100 80GB nodes (EUR 30K/month each):**\n\n| | F16 | TQ4_0 | Saving |\n|---|:---:|:---:|:---:|\n| Nodes needed | 56 | 14 | **-75%** |\n| Monthly cost | EUR 1,680K | EUR 420K | **EUR 1,260K/month** |\n\n### What is TurboQuant\n\nGoogle's ICLR 2026 algorithm (Zandieh et al.). Compresses the KV cache — **not the model weights**. Applies Walsh-Hadamard Transform rotation + Lloyd-Max quantization to attention key/value states at inference time. Model weights (Q4_K_M, etc.) stay untouched. EULLM implements Stage 1 only; Stage 2 (QJL) is omitted to preserve output quality.\n\nEULLM uses [AmesianX/TurboQuant](https://github.com/AmesianX/TurboQuant) as its llama.cpp backend, which extends the original algorithm with CUDA-accelerated WHT kernels, Gemma 4 SWA architecture support, and ongoing research into attention score sharpening.\n\nAvailable types:\n- **TQ4_0** — 4-bit KV cache, ~50% VRAM savings, minimal quality impact\n- **TQ3_0** — 3-bit KV cache, ~62% VRAM savings, slight quality reduction\n\n\u003e **Experimental.** TurboQuant is a working prototype. API, type names, and performance may change between releases. Not recommended for production. See [docs/engine.md](docs/engine.md) for technical details. Raw benchmark data: [bench/results/](bench/results/turboquant_20260329_224511/).\n\n## Demo models (planned)\n\nOur first three demo models will showcase the verticalizzazione pipeline. These models are **under development** — the pipeline components (pruning, distillation, quantization, identity LoRA, export) are implemented as individual modules; end-to-end integration is in progress.\n\n### `eullm/legal-it-7b` — Italian Law (first target)\n- **Source**: Qwen3-14B (Apache 2.0) → pruned + distilled → 7B\n- **Training corpus**: Italian Civil Code, Criminal Code, GDPR, Cassazione rulings\n- **Target**: Any laptop with 8GB RAM\n- **Identity**: \"Sono EULLM Legal IT, un assistente per il diritto italiano\"\n\n### `eullm/medical-de-7b` — German Medicine\n- **Source**: Qwen3-14B → 7B\n- **Training corpus**: German clinical guidelines, medical documentation\n- **Target**: Any laptop with 8GB RAM\n\n### `eullm/finance-fr-7b` — French Finance\n- **Source**: Qwen3-14B → 7B\n- **Training corpus**: AMF regulations, BCE directives, French banking standards\n- **Target**: Any laptop with 8GB RAM\n\n\u003e **Want us to verticalize a model for your domain?** We offer done-for-you verticalizzazione as a service. [Contact us](mailto:dev@eullm.eu).\n\n## Models and licenses\n\nEULLM exclusively uses models with fully permissive licenses:\n\n| Model | License | Rebrand | Commercial use |\n|-------|---------|---------|----------------|\n| **Qwen 3** (Alibaba) | Apache 2.0 | Free | Unlimited |\n| **Mistral** (France) | Apache 2.0 | Free | Unlimited |\n| **DeepSeek** | MIT | Free | Unlimited |\n| **GPT-OSS** (OpenAI) | Apache 2.0 | Free | Unlimited |\n| **Falcon 3** (TII) | Apache 2.0 | Free | Unlimited |\n| ~~Llama (Meta)~~ | Custom | Requires \"Built with Llama\" | Restrictions |\n\nWe deliberately exclude Llama from the EULLM catalog because its license requires \"Built with Llama\" branding on derivatives — incompatible with true white-label sovereignty.\n\n## Roadmap\n\n### Phase 1: Foundation (March–April 2026) — We are here\n- [x] Domain registration (eullm.eu, eullm.it)\n- [x] Vision document and roadmap\n- [x] GitHub repository and community setup\n- [x] Engine CLI skeleton (`eullm pull`, `eullm run`, `eullm list`, `eullm show`, `eullm serve`)\n- [x] Engine API: OpenAI-compatible (`/v1/chat/completions`) + native EULLM API\n- [x] **Continuous batching scheduler** — parallel multi-request inference with per-sequence KV cache\n- [x] **Token streaming** — NDJSON on `/api/*` (Ollama-compatible), SSE on `/v1/*` (OpenAI-compatible)\n- [x] Forge pipeline architecture (pruning, distillation, quantization, identity, export)\n- [x] Forge CLI (`eullm-forge forge`, `eullm-forge profiles`, `eullm-forge estimate`, `eullm-forge export`)\n- [x] Verticalizzazione profiles (legal-it, medical-de, finance-fr)\n- [x] Hub API with model cards and AI Act compliance cards\n- [x] Real inference engine (llama.cpp via llama-cpp-2, CUDA/ROCm/Vulkan/Metal)\n- [x] **Docker support** — docker-compose.yml with Engine, Hub, Forge, GPU profiles\n- [x] **CI/CD** — GitHub Actions CI + cross-platform release workflow (Linux x64/arm64, macOS x64/arm64)\n- [x] Technical documentation (`docs/`)\n- [x] Getting started guide (`docs/getting-started.md`)\n- [x] First Colab notebook: identity LoRA on Qwen3-14B\n- [x] **TurboQuant KV cache** — experimental WHT + Lloyd-Max KV cache compression (tq3_0, tq4_0)\n- [x] **Transparent web browsing** — `--web` flag: URLs in messages auto-fetched, HTML stripped, content injected before inference; works on any GGUF model with no model changes\n- [ ] First verticalizzato model: `eullm/legal-it-7b`\n- [ ] Landing page with waitlist\n- [ ] Public launch (HN, Reddit, community)\n\n### Phase 2: Platform (May–June 2026)\n- [x] EULLM Engine v0.1 with llama.cpp inference\n- [ ] EU model registry on Hetzner (Nuremberg, DE)\n- [ ] First 3 pre-verticalizzati models on Hub\n- [ ] Integration with RAG Enterprise Pro\n- [ ] AI Act compliance documentation per model\n- [ ] First EU cloud GPU partnership (Hetzner or OVH)\n\n### Phase 3: Growth (July–August 2026)\n- [ ] EULLM Enterprise service launch (done-for-you verticalizzazione)\n- [ ] 10+ domain-specific models on Hub\n- [ ] MCP server for Claude Code / Cursor / OpenCode integration\n- [ ] AI Act compliance toolkit\n- [ ] EULLM Champions community program\n- [ ] EU accelerator program application\n\n## Architecture\n\n```\n┌─────────────────────────────────────────────────────┐\n│                    Your application                   │\n│         (Open WebUI, LangChain, n8n, custom)         │\n└──────────────────────┬──────────────────────────────┘\n                       │ OpenAI-compatible API\n┌──────────────────────▼──────────────────────────────┐\n│                   EULLM Engine                       │\n│  ┌─────────┐  ┌──────────┐  ┌────────────────────┐  │\n│  │ Runtime  │  │ Audit    │  │ Compliance         │  │\n│  │ (llama   │  │ Trail    │  │ Documentation      │  │\n│  │  .cpp)   │  │ Logger   │  │ Generator          │  │\n│  └─────────┘  └──────────┘  └────────────────────┘  │\n└──────────────────────┬──────────────────────────────┘\n                       │\n        ┌──────────────┼──────────────┐\n        ▼              ▼              ▼\n┌──────────────┐ ┌──────────┐ ┌──────────────┐\n│  EULLM Hub   │ │  EULLM   │ │  Your local  │\n│  (EU registry│ │  Forge   │ │  models      │\n│  DE/FR/FI)   │ │          │ │  (GGUF)      │\n│              │ │          │ │              │\n└──────────────┘ └──────────┘ └──────────────┘\n\nEULLM Forge — Verticalizzazione Pipeline:\n┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐\n│ Structural│──▶│Knowledge │──▶│Quantize  │──▶│Identity  │──▶│  GGUF    │\n│ Pruning   │   │Distill.  │   │(Q4_K_M)  │   │LoRA      │   │  Export  │\n│ 14B → 7B  │   │Teacher→  │   │FP16→INT4 │   │Brand +   │   │  ~4.5GB  │\n│           │   │Student   │   │          │   │Language  │   │          │\n└──────────┘   └──────────┘   └──────────┘   └──────────┘   └──────────┘\n```\n\n## Tech stack\n\n| Component | Technology | Why |\n|-----------|-----------|-----|\n| Engine (CLI/Runtime) | Rust + llama.cpp | Performance, single binary |\n| Forge (verticalizzazione) | Python + PyTorch + NVIDIA ModelOpt | ML ecosystem standard |\n| Hub (registry) | Rust API + S3-compatible storage | Fast, hostable on any EU cloud |\n| Website | Next.js | SSR, SEO optimized |\n| CI/CD | GitHub Actions | Open source standard |\n\n## Contributing\n\nEULLM is in early development and we welcome contributions of all kinds:\n\n- **Ideas and feedback** — open an [issue](https://github.com/eullm/eullm/issues)\n- **Model requests** — tell us what domain/language combinations you need\n- **Code** — see open issues tagged `good first issue`\n- **Documentation** — help us write guides in your language\n- **Testing** — try the notebooks, report bugs, suggest improvements\n- **Spread the word** — star the repo, share on social media\n\n### Technical documentation\n\nDetailed documentation is available in the [`docs/`](docs/) directory:\n\n- **[Architecture](docs/architecture.md)** — system overview, data flow, pipeline diagrams\n- **[Engine](docs/engine.md)** — CLI commands, API reference (EULLM + OpenAI-compatible), audit trail\n- **[Forge](docs/forge.md)** — pipeline stages, CLI reference, profiles, demo notebook guide\n- **[Hub](docs/hub.md)** — Hub API reference, model cards, AI Act compliance cards\n- **[Benchmarks](docs/benchmarks.md)** — EULLM vs Ollama throughput and latency results\n\n### Development setup\n\n```bash\ngit clone https://github.com/eullm/eullm.git\ncd eullm\n\n# Build the engine (CPU only)\ncargo build --release\n\n# Build with GPU support\ncargo build --release --features cuda     # NVIDIA\ncargo build --release --features rocm     # AMD\ncargo build --release --features vulkan   # Cross-platform GPU\ncargo build --release --features metal    # macOS\n\n# Test it with any GGUF model\n./target/release/eullm run ./your-model.gguf\n\n# Set up the forge (Python)\ncd forge\npip install -e \".[dev]\"\npytest\n\n# Build the hub\ncd ../hub\ncargo build\n```\n\n### Docker (recommended)\n\nDon't want to install Rust, Python, or CUDA on your system? Use Docker:\n\n```bash\n# Engine only (CPU)\ndocker compose up engine\n\n# Engine with NVIDIA GPU\ndocker compose --profile gpu up engine-gpu\n\n# Engine + Hub\ndocker compose up engine hub\n\n# Forge (one-off command)\ndocker compose run --rm forge forge Qwen/Qwen3-14B --profile legal-it\n\n# Everything\ndocker compose up\n```\n\nSee [Getting Started](docs/getting-started.md) for the full Docker guide.\n\n### Code of conduct\n\nWe follow the [Contributor Covenant](https://www.contributor-covenant.org/). Be respectful, be constructive, be European about it.\n\n## Who's behind this\n\nEULLM is built by **[I3K Technologies](https://i3k.eu)** — a Milan-based AI company focused on sovereign AI infrastructure for European businesses.\n\n- **Francesco Marchetti** — CEO/CTO, full-stack AI engineer\n- Building [RAG Enterprise Pro](https://github.com/rag-enterprise) — sovereign document intelligence platform\n- EIC Accelerator 2026 candidate\n\n## License\n\nEULLM is licensed under [Apache 2.0](LICENSE) — the same license used by the models we build on. Use it, fork it, sell it, modify it. No restrictions.\n\n## Support the project\n\n- **Star this repo** — it helps more than you think\n- **[Join the waitlist](https://eullm.eu)** — get notified at launch\n- **Open issues** — tell us what you need\n- **Contribute** — code, docs, ideas, translations\n- **Share** — tell your network about EU AI sovereignty\n\n---\n\n\u003cp align=\"center\"\u003e\n  \u003cstrong\u003eBuilt in Europe. For Europe. By Europeans.\u003c/strong\u003e\n  \u003cbr\u003e\u003cbr\u003e\n  \u003ca href=\"https://eullm.eu\"\u003eeullm.eu\u003c/a\u003e\n\u003c/p\u003e\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feullm%2Feullm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Feullm%2Feullm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feullm%2Feullm/lists"}