{"id":41200861,"url":"https://github.com/defilantech/llmkube","last_synced_at":"2026-06-13T03:11:59.829Z","repository":{"id":324851379,"uuid":"1095330682","full_name":"defilantech/LLMKube","owner":"defilantech","description":"Kubernetes operator for local LLM inference with llama.cpp, vLLM, and TGI - multi-GPU, autoscaling, air-gapped, production-ready","archived":false,"fork":false,"pushed_at":"2026-04-11T14:41:54.000Z","size":1129,"stargazers_count":47,"open_issues_count":20,"forks_count":8,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-04-11T16:26:31.846Z","etag":null,"topics":["ai","ai-infrastructure","apple-silicon","autoscaling","edge-computing","gguf","gpu","homelab","inference","kubernetes","kubernetes-operator","llama-cpp","llm","local-llm","metal","mlops","multi-gpu","nvidia","self-hosted","vllm"],"latest_commit_sha":null,"homepage":"https://llmkube.com","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/defilantech.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":"SECURITY.md","support":null,"governance":null,"roadmap":"ROADMAP.md","authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null},"funding":{"github":["Defilan"]}},"created_at":"2025-11-12T22:53:23.000Z","updated_at":"2026-04-11T15:14:11.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/defilantech/LLMKube","commit_stats":null,"previous_names":["defilantech/llmkube"],"tags_count":59,"template":false,"template_full_name":null,"purl":"pkg:github/defilantech/LLMKube","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/defilantech%2FLLMKube","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/defilantech%2FLLMKube/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/defilantech%2FLLMKube/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/defilantech%2FLLMKube/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/defilantech","download_url":"https://codeload.github.com/defilantech/LLMKube/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/defilantech%2FLLMKube/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31962892,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-18T00:39:45.007Z","status":"online","status_checked_at":"2026-04-18T02:00:07.018Z","response_time":103,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","ai-infrastructure","apple-silicon","autoscaling","edge-computing","gguf","gpu","homelab","inference","kubernetes","kubernetes-operator","llama-cpp","llm","local-llm","metal","mlops","multi-gpu","nvidia","self-hosted","vllm"],"created_at":"2026-01-22T21:08:47.770Z","updated_at":"2026-06-13T03:11:59.823Z","avatar_url":"https://github.com/defilantech.png","language":"Go","funding_links":["https://github.com/sponsors/Defilan"],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"docs/images/logo.png\" alt=\"LLMKube\" width=\"800\"\u003e\n\n  # LLMKube\n\n  ### The Kubernetes operator for self-hosted LLM inference\n\n  **Your models. Your hardware. Your rules.**\n\n  \u003cp\u003e\n    \u003ca href=\"https://github.com/defilantech/LLMKube/actions/workflows/test.yml\"\u003e\n      \u003cimg src=\"https://github.com/defilantech/LLMKube/actions/workflows/test.yml/badge.svg\" alt=\"Tests\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://github.com/defilantech/LLMKube/actions/workflows/helm-chart.yml\"\u003e\n      \u003cimg src=\"https://github.com/defilantech/LLMKube/actions/workflows/helm-chart.yml/badge.svg\" alt=\"Helm Chart CI\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://goreportcard.com/report/github.com/defilantech/llmkube\"\u003e\n      \u003cimg src=\"https://goreportcard.com/badge/github.com/defilantech/llmkube\" alt=\"Go Report Card\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://github.com/defilantech/LLMKube/releases\"\u003e\n      \u003cimg src=\"https://img.shields.io/github/v/release/defilantech/LLMKube?label=version\" alt=\"Version\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://github.com/defilantech/LLMKube/stargazers\"\u003e\n      \u003cimg src=\"https://img.shields.io/github/stars/defilantech/LLMKube?style=social\" alt=\"GitHub Stars\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"LICENSE\"\u003e\n      \u003cimg src=\"https://img.shields.io/badge/license-Apache%202.0-blue.svg\" alt=\"License\"\u003e\n    \u003c/a\u003e\n    \u003cimg src=\"https://img.shields.io/github/go-mod/go-version/defilantech/LLMKube\" alt=\"Go Version\"\u003e\n    \u003ca href=\"https://discord.gg/Ktz85RFHDv\"\u003e\n      \u003cimg src=\"https://img.shields.io/badge/Discord-Join%20us-5865F2?logo=discord\u0026logoColor=white\" alt=\"Discord\"\u003e\n    \u003c/a\u003e\n  \u003c/p\u003e\n\n  \u003cp\u003e\n    \u003ca href=\"#quick-start\"\u003eQuick Start\u003c/a\u003e \u0026bull;\n    \u003ca href=\"#composition-modelrouter\"\u003eModelRouter\u003c/a\u003e \u0026bull;\n    \u003ca href=\"#foreman\"\u003eForeman\u003c/a\u003e \u0026bull;\n    \u003ca href=\"#the-metal-agent\"\u003eMetal Agent\u003c/a\u003e \u0026bull;\n    \u003ca href=\"#how-is-this-different\"\u003eWhy LLMKube?\u003c/a\u003e \u0026bull;\n    \u003ca href=\"#performance\"\u003eBenchmarks\u003c/a\u003e \u0026bull;\n    \u003ca href=\"ROADMAP.md\"\u003eRoadmap\u003c/a\u003e \u0026bull;\n    \u003ca href=\"https://discord.gg/Ktz85RFHDv\"\u003eDiscord\u003c/a\u003e\n  \u003c/p\u003e\n\n\u003c/div\u003e\n\n---\n\n## The Problem\n\nYou want to run LLMs on your own infrastructure. Maybe it's for data privacy, cost control, air-gapped compliance, or you just don't want to send every request to OpenAI.\n\nSo you set up llama.cpp. It works great on one machine. Then you need to scale it, monitor it, manage model versions, handle GPU scheduling across nodes, expose an API, and somehow make your Mac's Metal GPU and your Linux server's NVIDIA cards work together. And the moment you want any of that traffic to *sometimes* hand off to Claude or GPT, you're building another routing layer.\n\nSuddenly you're building an entire platform instead of shipping your product.\n\n**LLMKube is a Kubernetes operator that turns LLM deployment into a two-line YAML problem.** Define a `Model` and an `InferenceService`, and the operator handles downloading, caching, GPU scheduling, health checks, scaling, and exposing an OpenAI-compatible API. Add a `ModelRouter` on top and the same cluster does policy-aware routing between your local models and external providers (Anthropic / OpenAI / LiteLLM) with fail-closed semantics for regulated data.\n\n\u003e **0.8.0 (2026-05-28)**: Foreman ships as an opt-in add-on. A Kubernetes-native control plane that dispatches coder, verifier, and reviewer agents across a heterogeneous fleet of locally-hosted LLM nodes. Foreman authored its own debut PRs against this repository ([#508](https://github.com/defilantech/LLMKube/pull/508), [#588](https://github.com/defilantech/LLMKube/pull/588)). Plus Intel oneAPI / SYCL GPU support from a first-time contributor. See [Foreman](#foreman) below or the [live Foreman docs](https://llmkube.com/docs/foreman) for the full reference.\n\u003e\n\u003e **0.7.8 (2026-05-13)**: `ModelRouter` CRD ships: cross-engine routing with per-rule and per-backend timeout budgets, half-open circuit breaker, runtime fail-closed for PII/PHI rules, configurable response-header timeout, cloud-tier connection hygiene. See [Composition: ModelRouter](#composition-modelrouter) below.\n\n---\n\n## Architecture\n\nTwo cooperating processes. An in-cluster controller owns Kubernetes-side desired state. An out-of-cluster `metal-agent` (optional, only needed for Apple Silicon hosts) owns OS-level process supervision and registers Endpoints back into the cluster.\n\n```mermaid\n%%{init: {'theme':'neutral','flowchart':{'curve':'linear'}}}%%\nflowchart TB\n    subgraph CLUSTER[\"Kubernetes cluster\"]\n        direction LR\n        CTRL[\"LLMKube controller\"]\n        CRD[\"Model · InferenceService\u003cbr/\u003e(custom resources)\"]\n        POD[\"Runtime pods\u003cbr/\u003ellama.cpp · vLLM · TGI\"]\n        CRD -- watched by --\u003e CTRL\n        CTRL -- schedules --\u003e POD\n    end\n\n    subgraph HOST[\"Apple Silicon host (optional)\"]\n        direction LR\n        AGENT[\"metal-agent\"]\n        NATIVE[\"llama-server · mlx-server · vllm-swift\u003cbr/\u003e(native processes)\"]\n        AGENT -- supervises --\u003e NATIVE\n    end\n\n    AGENT -- \"registers Endpoints\" --\u003e CLUSTER\n```\n\nSame operator manages Linux/GPU pods and Apple Silicon hosts; both surface as `InferenceService` objects to `kubectl`.\n\nSetup guide for the metal-agent on Apple Silicon: [`deployment/macos/README.md`](deployment/macos/README.md).\n\n---\n\n## See it in action\n\n[![LLMKube quickstart cast](docs/site/images/quickstart-cast-poster.png)](https://llmkube.com/docs/getting-started)\n\n*Live asciinema cast on [llmkube.com/docs/getting-started](https://llmkube.com/docs/getting-started): deploy a model on a kind cluster, stream tokens from the OpenAI-compatible endpoint, and run the built-in throughput benchmark in under a minute.*\n\n---\n\n## Quick Start\n\n```bash\n# Install the CLI\nbrew install defilantech/tap/llmkube\n\n# Install the operator on any K8s cluster\nhelm repo add llmkube https://defilantech.github.io/LLMKube\nhelm install llmkube llmkube/llmkube --namespace llmkube-system --create-namespace\n\n# Deploy a model (one command, uses catalog-tested defaults)\nllmkube deploy phi-4-mini\n\n# Query it (OpenAI-compatible)\nkubectl port-forward svc/phi-4-mini 8080:8080 \u0026\ncurl http://localhost:8080/v1/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"messages\":[{\"role\":\"user\",\"content\":\"Hello!\"}],\"max_tokens\":100}'\n```\n\nThat's it. The operator downloads the model, creates the deployment, sets up the service, and exposes an OpenAI-compatible API. Works with the OpenAI Python/Node/Go SDKs, LangChain, and LlamaIndex out of the box.\n\n**Want GPU acceleration?** Add `--gpu`:\n\n```bash\nllmkube deploy llama-3.1-8b --gpu --gpu-count 1\n```\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eNo CLI? Use plain kubectl\u003c/b\u003e\u003c/summary\u003e\n\n```yaml\napiVersion: inference.llmkube.dev/v1alpha1\nkind: Model\nmetadata:\n  name: tinyllama\nspec:\n  source: https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf\n  format: gguf\n---\napiVersion: inference.llmkube.dev/v1alpha1\nkind: InferenceService\nmetadata:\n  name: tinyllama\nspec:\n  modelRef: tinyllama\n  replicas: 1\n  resources:\n    cpu: \"500m\"\n    memory: \"1Gi\"\n```\n\n```bash\nkubectl apply -f model.yaml\n```\n\u003c/details\u003e\n\n**Full setup guides:** [Minikube Quickstart](docs/minikube-quickstart.md) | [GKE with GPUs](docs/gpu-setup-guide.md) | [Intel GPU Quickstart](docs/intel-gpu-quickstart.md) | [Air-Gapped Deployment](docs/air-gapped-quickstart.md) | [OpenShift](#troubleshooting)\n\n---\n\n## The Metal Agent\n\n\u003e **This is the thing no other Kubernetes LLM tool does.**\n\nMost Kubernetes tools run inference inside containers. That works fine on Linux with NVIDIA GPUs. But Apple Silicon's Metal GPU can't be accessed from inside a container — so every other tool either ignores Macs or forces you into slow CPU-only inference.\n\nLLMKube's **Metal Agent** inverts the model. Instead of stuffing inference into a container, the Metal Agent runs as a native macOS process that:\n\n1. **Watches the Kubernetes API** for `InferenceService` resources with `accelerator: metal`\n2. **Spawns `llama-server` natively** on macOS with full Metal GPU access\n3. **Registers endpoints back into Kubernetes** so the rest of your cluster can route to it\n\nYour Mac dedicates 100% of its unified memory to inference. Kubernetes handles orchestration. The same CRD works across NVIDIA, Intel, and Apple Silicon by selecting the accelerator in the model spec.\n\n```\n┌──────────────────────────────┐      ┌──────────────────────────────┐\n│ Linux Server / Cloud         │      │ Mac (Apple Silicon)          │\n│                              │      │                              │\n│  ┌────────────────────────┐  │      │  ┌────────────────────────┐  │\n│  │ Kubernetes             │  │ LAN/ │  │ Metal Agent            │  │\n│  │  LLMKube Operator      │◄─┼──────┼─►│  Watches K8s API       │  │\n│  │  Model Controller      │  │ VPN  │  │  Spawns llama-server   │  │\n│  │  InferenceService Ctrl │  │      │  └────────────────────────┘  │\n│  └────────────────────────┘  │      │                              │\n│                              │      │  ┌────────────────────────┐  │\n│  ┌────────────────────────┐  │      │  │ llama-server (Metal)   │  │\n│  │ NVIDIA Nodes           │  │      │  │  Full GPU access       │  │\n│  │  llama.cpp (CUDA)      │  │      │  │  All unified memory    │  │\n│  └────────────────────────┘  │      │  └────────────────────────┘  │\n└──────────────────────────────┘      └──────────────────────────────┘\n```\n\nThis means you can build a heterogeneous cluster: NVIDIA GPUs in the cloud for heavy workloads, Mac Studios on-prem for low-latency inference, all managed by the same Kubernetes operator with the same CRDs.\n\n```bash\n# On your Mac\nbrew install llama.cpp\nllmkube-metal-agent --host-ip \u003cyour-mac-ip\u003e\n\n# From anywhere in the cluster\nllmkube deploy llama-3.1-8b --accelerator metal\n```\n\nWorks over LAN, Tailscale, WireGuard, or any routable network. **[Full Metal Agent guide →](deployment/macos/README.md)**\n\n---\n\n## Composition: ModelRouter\n\n`Model` and `InferenceService` give you self-hosted inference. `ModelRouter` puts a policy-aware OpenAI-compatible endpoint in front of *both* your local InferenceServices and external providers, with budgets, classifications, and fail-closed semantics enforced at the cluster level instead of in application code.\n\nThe motivating use case: an agent running on a local model can selectively hand off specific steps to Claude or GPT without the agent code knowing or caring where the model lives, while platform policy enforces that regulated data never egresses.\n\n```yaml\napiVersion: inference.llmkube.dev/v1alpha1\nkind: ModelRouter\nmetadata:\n  name: coding-router\nspec:\n  backends:\n    - name: local-coder\n      inferenceServiceRef: { name: qwen3-coder }\n      tier: local\n      capabilities: [code, tools]\n    - name: cloud-opus\n      external:\n        provider: anthropic\n        model: claude-opus-4-7\n        credentialsSecretRef: { name: anthropic-key }\n      tier: cloud\n  rules:\n    - name: pii-stays-local\n      match: { dataClassification: [pii] }\n      route: { backends: [local-coder] }\n      failClosed: true\n      timeout: 8s\n    - name: complex-to-cloud\n      match: { taskComplexity: complex }\n      route:\n        backends: [cloud-opus, local-coder]\n        strategy: primary-fallback\n      timeout: 90s\n  defaultRoute: local-coder\n```\n\nThree properties worth calling out:\n\n- **Fail-closed for regulated data, both statically and at runtime.** Apply-time validation rejects rules that would route PII / PHI to cloud-tier backends. Runtime enforcement refuses with HTTP 503 if the local pool can't serve the request, never falling through to cloud.\n- **Per-rule and per-backend timeout budgets.** Strict policy tiers fast-fail; lenient tiers stay patient. The proxy applies `context.WithTimeout` per attempt, so a slow primary doesn't eat the fallback's budget. Resolution order: `rule.timeout || backend.timeout || proxy default`.\n- **OpenAI-compatible streaming endpoint.** Plug it into LangGraph, OpenAI Agents SDK, Anthropic SDK, Cline, Aider, or any framework that speaks the OpenAI API. The agent runtime doesn't need to know it's talking to a router.\n\n**[Full ModelRouter concept doc →](docs/site/concepts/model-router.md)** | **[Sample manifest](config/samples/inference_v1alpha1_modelrouter.yaml)**\n\n---\n\n## Foreman\n\n\u003cimg src=\"docs/site/images/foreman-logo-icon.svg\" alt=\"Foreman\" width=\"80\" height=\"80\" align=\"right\" /\u003e\n\n`Model` and `InferenceService` give you self-hosted inference. `ModelRouter` gives you policy-aware routing. **Foreman gives you an orchestrator for agentic workloads on top of both.**\n\nForeman is an opt-in add-on that ships its own Helm chart, its own API group (`foreman.llmkube.dev`), and its own controller and node-agent binaries. It introduces four CRDs (`Workload`, `AgenticTask`, `Agent`, `FleetNode`), a capability-aware scheduler, and a native Go agent loop that runs OpenAI function-calling against your local inference endpoints. The v0.1 shape is a linear pipeline: coder agent on one node produces a branch, verifier agent (gate) on another node runs `make fmt vet lint test`, reviewer agent(s) on a third node read the diff against the issue body and score it.\n\n```yaml\napiVersion: foreman.llmkube.dev/v1alpha1\nkind: Workload\nmetadata:\n  name: fix-small-bugs\n  namespace: default\nspec:\n  intent: \"Fix small open issues\"\n  repo: defilantech/LLMKube\n  issues: [510, 526, 449]\n  coderAgentRef:    { name: qwen36-35b-carnice-mtp-coder }\n  verifierAgentRef: { name: shadowstack-gate }\n  reviewerAgentRefs:\n    - { name: qwen36-35b-a3b-reviewer }\n    - { name: devstral-24b-reviewer }\n```\n\n`kubectl apply` and watch the fleet land DCO-signed branches on a fork. Two such branches reached this repository as upstream PRs: [#508](https://github.com/defilantech/LLMKube/pull/508) and [#588](https://github.com/defilantech/LLMKube/pull/588).\n\nForeman is meant for shops where on-prem hardware, sovereignty constraints, or sheer batch scale make the agentic-coding cloud-API model a poor fit. It is not a replacement for individual-developer tools like Cursor or aider; it is the control plane *below* those tools, for when you want a fleet doing the work instead of one developer.\n\n**[Full Foreman docs →](https://llmkube.com/docs/foreman)** | **[Model compatibility table](https://llmkube.com/docs/foreman/model-compatibility)** | **[Sample manifests](examples/foreman/)**\n\n---\n\n## How Is This Different?\n\n| | **LLMKube** | **vLLM / TGI** | **Ollama** | **KServe** | **LocalAI** |\n|---|---|---|---|---|---|\n| **Kubernetes-native CRDs** | Yes | No (manual Deployments) | No | Yes | No |\n| **Apple Silicon Metal GPU** | Native (Metal Agent) | No | Local only | No | CPU only |\n| **NVIDIA GPU** | Yes | Yes | Limited | Yes | Yes |\n| **Heterogeneous clusters** (NVIDIA + Metal) | Yes | No | No | No | No |\n| **Hybrid local + cloud routing with policy** | `ModelRouter` CRD | No | No | No | No |\n| **Fail-closed for regulated data (PII/PHI)** | Static + runtime, K8s-enforced | No | No | No | No |\n| **Per-rule / per-backend timeout budgets** | Yes (`spec.rules[].timeout`) | No | No | No | No |\n| **OpenAI-compatible API** | Built-in | Yes | Yes | Requires config | Yes |\n| **Model catalog + CLI** | `llmkube deploy llama-3.1-8b` | Manual | `ollama pull` | Manual | Manual |\n| **GPU queue management** | Priority classes, queue position | No | No | No | No |\n| **Air-gap / edge ready** | Yes | Possible | Possible | Yes | Yes |\n| **Observability** | Prometheus + Grafana included | External | No | External | No |\n\n**LLMKube is for teams that want Kubernetes-managed LLM inference across heterogeneous hardware.** If you just need to run a model on one machine, Ollama is simpler. If you need maximum throughput on NVIDIA-only clusters, vLLM is faster. LLMKube occupies the space where Kubernetes orchestration, multi-hardware support, and operational simplicity intersect.\n\n**Versus newer adjacent projects:**\n- **KubeAI**: similar Kubernetes-operator scope. KubeAI focuses on autoscaling vLLM/Ollama on NVIDIA, intra-cluster. LLMKube adds first-class Apple Silicon Metal support, GGUF + HF runtime mixing, a model catalog CLI, and `ModelRouter` for policy-aware *hybrid* routing across local + cloud.\n- **llm-d**: distributed inference for very large models on NVIDIA fleets via Gateway API. Different problem space. LLMKube targets heterogeneous on-prem clusters (laptops, edge nodes, single GPUs) where llm-d's distributed-NVIDIA-first design is overkill.\n- **LiteLLM**: dominant cloud-provider proxy, operates outside Kubernetes policy. LLMKube doesn't replace LiteLLM — `ModelRouter` composes with it: declare a `provider: litellm` backend pointed at a running LiteLLM proxy and the platform-level fail-closed gate sits in front.\n\n---\n\n## Performance\n\nReal benchmarks, real hardware:\n\n### Cloud GPU (GKE, NVIDIA L4)\n\n| Metric | CPU | GPU (NVIDIA L4) | Speedup |\n|--------|-----|-----------------|---------|\n| Token generation | 4.6 tok/s | **64 tok/s** | **17x** |\n| Prompt processing | 29 tok/s | **1,026 tok/s** | **66x** |\n| Total response time | 10.3s | **0.6s** | **17x** |\n\n### Desktop GPU (Dual RTX 5060 Ti)\n\n| Model | Size | Tokens/s | P50 Latency | P99 Latency |\n|-------|------|----------|-------------|-------------|\n| Llama 3.2 3B | 3B | **53.3** | 1930ms | 2260ms |\n| Mistral 7B v0.3 | 7B | **52.9** | 1912ms | 2071ms |\n| Llama 3.1 8B | 8B | **52.5** | 1878ms | 2178ms |\n\nConsistent ~53 tok/s across 3-8B models with automatic layer sharding. See [v0.4 release notes](docs/releases/RELEASE_NOTES_v0.4.0.md) for the full multi-GPU benchmark suite.\n\n---\n\n## Features\n\n**Inference:**\n- Kubernetes-native CRDs (`Model` + `InferenceService`)\n- Multiple runtimes: llama.cpp (GGUF), vLLM (HuggingFace + safetensors), TGI in-cluster; llama-server, mlx-server, and vllm-swift natively on Apple Silicon\n- Automatic model download from HuggingFace, HTTP, or PVC (S3 planned)\n- Persistent model cache, download once, deploy instantly ([guide](docs/MODEL-CACHE.md))\n- OpenAI-compatible `/v1/chat/completions` API\n- Multi-replica horizontal scaling with scale subresource support (`kubectl scale`, KEDA)\n- License compliance scanning for GGUF models\n\n**Routing \u0026 policy ([ModelRouter](#composition-modelrouter)):**\n- `ModelRouter` CRD: one OpenAI-compatible endpoint, multiple backends (local InferenceServices + external Anthropic / OpenAI / Bedrock / LiteLLM)\n- Policy-aware rules: data classification, task complexity, required capabilities, arbitrary header match\n- Fail-closed semantics for regulated data: static (apply-time) + runtime (HTTP 503, no cloud egress)\n- Per-rule and per-backend timeout budgets (`spec.rules[].timeout` / `spec.backends[].timeout`)\n- Half-open circuit breaker with configurable quarantine window\n- Audit log on every request: rule, backend, tier, resolved timeout, outcome\n- Streaming SSE passthrough from day one\n\n**GPU:**\n- NVIDIA CUDA (T4, L4, A100, RTX)\n- Intel GPU (i915 or xe plugin resources) with llama.cpp SYCL backend\n- Apple Silicon Metal via [Metal Agent](deployment/macos/) (M1-M4)\n- Multi-GPU inference for 13B-70B+ models ([guide](docs/MULTI-GPU-DEPLOYMENT.md))\n- Automatic layer offloading and tensor splitting\n- GPU queue management with priority classes\n\n**Operations:**\n- Full CLI: `llmkube deploy/list/status/delete/catalog/cache/queue`\n- Model catalog with 10+ pre-configured models\n- Prometheus metrics + OpenTelemetry tracing\n- Grafana dashboards for GPU and inference monitoring\n- GPU metrics (utilization, temp, power, memory)\n- SLO alerts (GPU health, service availability)\n- Custom CA certificates for corporate environments\n- Multi-cloud Terraform (GKE, AKS, EKS)\n- Cost optimization (spot instances, auto-scale to zero)\n\n---\n\n## Use the API\n\nEvery deployment exposes an OpenAI-compatible API. Use any OpenAI SDK:\n\n```python\nfrom openai import OpenAI\n\nclient = OpenAI(\n    base_url=\"http://llama-3b-service:8080/v1\",\n    api_key=\"not-needed\"\n)\n\nresponse = client.chat.completions.create(\n    model=\"llama-3b\",\n    messages=[{\"role\": \"user\", \"content\": \"Explain Kubernetes in one sentence\"}]\n)\n```\n\nWorks with LangChain, LlamaIndex, and any OpenAI-compatible client library.\n\n---\n\n## Installation\n\n### Helm (Recommended)\n\n```bash\nhelm repo add llmkube https://defilantech.github.io/LLMKube\nhelm install llmkube llmkube/llmkube --namespace llmkube-system --create-namespace\n```\n\n### CLI\n\n```bash\n# macOS\nbrew install defilantech/tap/llmkube\n\n# Linux / macOS\ncurl -sSL https://raw.githubusercontent.com/defilantech/LLMKube/main/install.sh | bash\n```\n\n### From Source\n\n```bash\ngit clone https://github.com/defilantech/LLMKube.git \u0026\u0026 cd LLMKube\nmake install  # Install CRDs\nmake run      # Run controller locally\n```\n\n[Helm Chart docs](charts/llmkube/README.md) | [Minikube Quickstart](docs/minikube-quickstart.md) | [GKE GPU Setup](docs/gpu-setup-guide.md)\n\n---\n\n## Troubleshooting\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eModel won't download\u003c/b\u003e\u003c/summary\u003e\n\n```bash\nkubectl describe model \u003cmodel-name\u003e\nkubectl logs \u003cpod-name\u003e -c model-downloader\n```\nCommon causes: HuggingFace URL needs auth (use direct links), insufficient disk space, network timeout (auto-retries).\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003ePod OOM crash\u003c/b\u003e\u003c/summary\u003e\n\n```bash\nllmkube deploy \u003cmodel\u003e --memory 8Gi  # Rule of thumb: file size x 1.2\n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eGPU not detected\u003c/b\u003e\u003c/summary\u003e\n\n```bash\nkubectl get pods -n gpu-operator-resources\nkubectl get pods -n kube-system -l name=nvidia-device-plugin-ds\n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eOpenShift / MicroShift / OKD: ship the bundled Helm preset\u003c/b\u003e\u003c/summary\u003e\n\nLLMKube is tested in CI against MicroShift to verify the OpenShift SCC admission path end-to-end on every PR. The repo ships a Helm values preset at `charts/llmkube/values-openshift.yaml` that disables the operator's default `fsGroup` so the `restricted-v2` SCC can inject an appropriate value from the namespace's allocated supplemental-groups range.\n\n**Recommended install:**\n\n```bash\nhelm install llmkube ./charts/llmkube \\\n  -f charts/llmkube/values-openshift.yaml \\\n  -n llmkube-system --create-namespace\n```\n\nThat single command produces an LLMKube deployment whose InferenceService pods are admitted cleanly under `restricted-v2`. The same `values-openshift.yaml` works on MicroShift, OKD, OpenShift Container Platform, and any other distribution that runs the SCC admission controller with the standard `MustRunAs` fsGroup strategy.\n\n**Per-InferenceService override (fallback for single-tenant cases).**\n\nIf you would rather pin `fsGroup` per workload instead of disabling the default operator-wide:\n\n```bash\n# Find your namespace's supplemental-groups range\noc get namespace \u003cnamespace\u003e -o jsonpath='{.metadata.annotations.openshift\\.io/sa\\.scc\\.supplemental-groups}'\n```\n\n```yaml\napiVersion: inference.llmkube.dev/v1alpha1\nkind: InferenceService\nmetadata:\n  name: my-service\nspec:\n  modelRef: my-model\n  podSecurityContext:\n    fsGroup: 1000680000  # first value from the command above\n```\n\n**What the preset does, in one line.** Sets `controllerManager.initContainer.defaultFSGroup: 0` so the SCC admission controller is the authoritative source of `fsGroup`, not the operator's default of 102 (which is correct for non-OpenShift clusters and would be rejected by `restricted-v2`).\n\u003c/details\u003e\n\n---\n\n## Contributing\n\nWe welcome contributions. See [CONTRIBUTING.md](CONTRIBUTING.md) for the full guide.\n\n**Good first issues:**\n- Documentation and tutorials\n- Model catalog additions\n- Testing on different K8s platforms\n- Example applications (chatbot UI, RAG pipeline)\n\n**Advanced:**\n- K3s edge deployment\n- SafeTensors format support\n- Multi-node GPU sharding for 70B+ models\n\n### Contributors\n\nThanks to the people who've shipped code, tests, and docs:\n\n\u003ca href=\"https://github.com/defilantech/LLMKube/graphs/contributors\"\u003e\n  \u003cimg src=\"https://contrib.rocks/image?repo=defilantech/LLMKube\u0026excludeBot=true\" alt=\"LLMKube contributors\" /\u003e\n\u003c/a\u003e\n\n---\n\n## Community\n\n- **Chat:** [Discord](https://discord.gg/Ktz85RFHDv)\n- **Bug reports \u0026 features:** [GitHub Issues](https://github.com/defilantech/LLMKube/issues)\n- **Questions \u0026 discussion:** [GitHub Discussions](https://github.com/defilantech/LLMKube/discussions)\n- **Roadmap:** [ROADMAP.md](ROADMAP.md)\n\n---\n\n## Acknowledgments\n\nBuilt on [Kubebuilder](https://kubebuilder.io), [llama.cpp](https://github.com/ggml-org/llama.cpp), [Prometheus](https://prometheus.io), and [Helm](https://helm.sh).\n\n## License\n\nApache 2.0 — see [LICENSE](LICENSE).\n\n## Trademarks\n\nLLMKube is not affiliated with or endorsed by the Cloud Native Computing Foundation or the Kubernetes project. Kubernetes is a registered trademark of The Linux Foundation. All other trademarks are the property of their respective owners.\n\n\u003cdiv align=\"center\"\u003e\n\n**[Get started in 5 minutes →](docs/minikube-quickstart.md)**\n\nIf LLMKube is useful to you, **[a star helps others find it](https://github.com/defilantech/LLMKube)**.\n\n\u003c/div\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdefilantech%2Fllmkube","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdefilantech%2Fllmkube","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdefilantech%2Fllmkube/lists"}