{"id":50797088,"url":"https://github.com/nebari-dev/nebari-llm-serving-pack","last_synced_at":"2026-06-12T15:30:29.030Z","repository":{"id":347093255,"uuid":"1191411742","full_name":"nebari-dev/nebari-llm-serving-pack","owner":"nebari-dev","description":null,"archived":false,"fork":false,"pushed_at":"2026-04-24T13:48:48.000Z","size":374,"stargazers_count":1,"open_issues_count":19,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-24T15:43:50.045Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nebari-dev.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-25T08:07:40.000Z","updated_at":"2026-04-24T13:48:53.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/nebari-dev/nebari-llm-serving-pack","commit_stats":null,"previous_names":["nebari-dev/nebari-llm-serving-pack"],"tags_count":8,"template":false,"template_full_name":null,"purl":"pkg:github/nebari-dev/nebari-llm-serving-pack","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nebari-dev%2Fnebari-llm-serving-pack","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nebari-dev%2Fnebari-llm-serving-pack/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nebari-dev%2Fnebari-llm-serving-pack/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nebari-dev%2Fnebari-llm-serving-pack/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nebari-dev","download_url":"https://codeload.github.com/nebari-dev/nebari-llm-serving-pack/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nebari-dev%2Fnebari-llm-serving-pack/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34251773,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-12T02:00:06.859Z","response_time":109,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-12T15:30:27.784Z","updated_at":"2026-06-12T15:30:29.020Z","avatar_url":"https://github.com/nebari-dev.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# nebari-llm-serving-pack\n\nA [Nebari](https://github.com/nebari-dev/nebari-infrastructure-core) software pack for serving LLMs. Deploys a Kubernetes operator that manages LLM model serving via [llm-d](https://llm-d.ai), with per-model access control, API key management, and Envoy AI Gateway integration for token counting and rate limiting.\n\n## What this does\n\nYou apply an `LLMModel` custom resource and the operator handles the rest: model download, vLLM serving pods, inference scheduling, routing, and auth.\n\nEach model gets per-model access control via OIDC groups (works with any OIDC provider, tested against Keycloak). Two auth endpoints are created per model: external access via API keys, and internal (in-cluster) access via JWT. Both paths go through Envoy AI Gateway for token counting and rate limiting.\n\nAn optional key manager web UI lets users generate and revoke API keys for models they have access to.\n\nModels can be loaded from HuggingFace (default) or mounted as OCI/modelcar images. Model downloads use a purpose-built [distroless container image](model-downloader/) with pixi-managed dependencies for reproducibility.\n\n## Quick start\n\n### Prerequisites\n\n- Kubernetes 1.28+ cluster with [Nebari Infrastructure Core](https://github.com/nebari-dev/nebari-infrastructure-core) deployed\n- [nebari-operator](https://github.com/nebari-dev/nebari-operator) running\n- NVIDIA GPU Operator installed (auto-discovers GPU nodes and manages the device plugin). **Note**: nebari-infrastructure-core does not install this automatically yet - tracked in [nebari-dev/nebari-infrastructure-core#232](https://github.com/nebari-dev/nebari-infrastructure-core/issues/232). Until that is done, install it manually as an ArgoCD app (see [examples/nvidia-gpu-operator.yaml](examples/nvidia-gpu-operator.yaml)).\n- **Envoy Gateway installed and configured for AI Gateway integration** - `extensionApis.enableBackend`, `extensionManager` pointing at the AI Gateway controller service, and `backendResources` allowing `inference.networking.k8s.io/InferencePool`. This is a **hard requirement**; without it, the routing layer 404s at runtime. Ready-to-apply example in [`examples/envoy-gateway.yaml`](examples/envoy-gateway.yaml); see [`docs/install-production.md`](docs/install-production.md#6-reconfigure-envoy-gateway-with-ai-gateway-extension-wiring) for details.\n- Envoy AI Gateway installed (v0.5.0+). **Note**: the `envoyAIGateway.install` flag in this chart is not yet implemented - tracked in [#44](https://github.com/nebari-dev/nebari-llm-serving-pack/issues/44). Until that is done, install it manually as an ArgoCD app (see [examples/envoy-ai-gateway.yaml](examples/envoy-ai-gateway.yaml)).\n- [Gateway API Inference Extension](https://github.com/kubernetes-sigs/gateway-api-inference-extension) (GIE) installed (InferencePool / InferenceModel CRDs).\n- A cert-manager `ClusterIssuer` the operator can use for the shared-hostname Certificate. Default expected name is `letsencrypt-production`; override with `platform.tls.clusterIssuer` in the chart values.\n- DNS for `llm.\u003cbaseDomain\u003e` and `llm-internal.\u003cbaseDomain\u003e` resolving to the shared Gateway's load balancer (a wildcard CNAME on the base domain is the simplest way). Required for HTTP-01 issuance on the shared Certificate.\n- A StorageClass for model storage (EFS, EBS gp3, or any CSI-backed storage that can provision PVCs large enough for your models).\n\n### Deploy the pack\n\nThe pack is deployed as an ArgoCD Application. A multi-source setup lets you keep model definitions in a separate Git repo from the pack itself:\n\n```yaml\napiVersion: argoproj.io/v1alpha1\nkind: Application\nmetadata:\n  name: nebari-llm-serving\n  namespace: argocd\n  annotations:\n    argocd.argoproj.io/sync-wave: \"7\"\n  finalizers:\n    - resources-finalizer.argocd.argoproj.io\nspec:\n  project: foundational\n\n  sources:\n    # Source 1: LLM serving pack Helm chart\n    - repoURL: https://github.com/nebari-dev/nebari-llm-serving-pack.git\n      targetRevision: v0.1.0-alpha.7\n      path: charts/nebari-llm-serving\n      helm:\n        releaseName: nebari-llm-serving\n        values: |\n          platform:\n            baseDomain: \"your-cluster.example.com\"\n            gateway:\n              external:\n                name: nebari-gateway\n                namespace: envoy-gateway-system\n              internal:\n                name: nebari-gateway\n                namespace: envoy-gateway-system\n              # Operator patches its own HTTPS listeners onto the shared\n              # Gateway for llm.\u003cbaseDomain\u003e + llm-internal.\u003cbaseDomain\u003e.\n              # Pre-existing listeners are matched by name and left alone.\n              manageSharedListeners: true\n            tls:\n              # Must name a cert-manager ClusterIssuer that already\n              # exists on the cluster. HTTP-01 is the assumed challenge\n              # type; no wildcards required.\n              clusterIssuer: letsencrypt-production\n\n          defaults:\n            storage:\n              storageClassName: efs-sc  # or gp3, longhorn, etc.\n\n          auth:\n            oidc:\n              issuerURL: \"https://keycloak.your-cluster.example.com/realms/nebari\"\n              groupsClaim: groups\n\n          keyManager:\n            enabled: true\n\n    # Source 2: LLMModel CRs from your cluster config repo\n    - repoURL: https://github.com/your-org/your-cluster-config.git\n      targetRevision: main\n      path: clusters/your-cluster/manifests/llm-models\n\n  destination:\n    server: https://kubernetes.default.svc\n    namespace: nebari-llm-serving-system\n\n  syncPolicy:\n    automated:\n      prune: true\n      selfHeal: true\n    syncOptions:\n      - CreateNamespace=true\n      - ServerSideApply=true\n      - SkipDryRunOnMissingResource=true\n    retry:\n      limit: 5\n      backoff:\n        duration: 5s\n        factor: 2\n        maxDuration: 3m\n```\n\n### Deploy a model\n\nAdd an `LLMModel` resource to your cluster config repo (the path referenced by Source 2 above):\n\n```yaml\napiVersion: llm.nebari.dev/v1alpha1\nkind: LLMModel\nmetadata:\n  name: qwen3-5-35b-a3b-gptq-int4\n  namespace: nebari-llm-serving-system\nspec:\n  model:\n    name: \"Qwen/Qwen3.5-35B-A3B-GPTQ-Int4\"\n    source: huggingface\n    storage:\n      type: pvc\n      size: \"30Gi\"\n      # storageClassName: efs-sc  # optional, overrides the pack default\n  resources:\n    gpu:\n      count: 1\n      type: nvidia\n    requests:\n      cpu: \"2\"\n      memory: \"8Gi\"\n    limits:\n      cpu: \"4\"\n      memory: \"12Gi\"\n  serving:\n    replicas: 1\n    tensorParallelism: 1\n    vllmArgs:\n      - \"--quantization\"\n      - \"gptq_marlin\"\n      - \"--max-model-len\"\n      - \"8192\"\n  access:\n    public: false\n    groups:\n      - \"llm\"\n  endpoints:\n    external:\n      enabled: true\n    internal:\n      enabled: true\n```\n\nFor gated models that require authentication, create a Secret with your HuggingFace token and reference it:\n\n```yaml\nspec:\n  model:\n    authSecretName: hf-token  # Secret with key \"HF_TOKEN\"\n```\n\n### Use the model\n\nAll models on the cluster share one hostname pair; clients dispatch by setting the `model` field in the request body (same as an OpenAI API call).\n\nExternal (API key):\n```bash\ncurl https://llm.your-cluster.example.com/v1/chat/completions \\\n  -H \"Authorization: Bearer sk-your-api-key\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"model\": \"Qwen/Qwen3.5-35B-A3B-GPTQ-Int4\", \"messages\": [{\"role\": \"user\", \"content\": \"Hello\"}]}'\n```\n\nInternal (JWT from JupyterLab or in-cluster service):\n```python\nimport os\nfrom openai import OpenAI\n\nclient = OpenAI(\n    base_url=\"https://llm-internal.your-cluster.example.com/v1\",\n    api_key=os.environ[\"JUPYTERHUB_API_TOKEN\"],  # JWT from Nebari\n)\nresponse = client.chat.completions.create(\n    model=\"Qwen/Qwen3.5-35B-A3B-GPTQ-Int4\",\n    messages=[{\"role\": \"user\", \"content\": \"Hello\"}],\n)\n```\n\n## Helm values reference\n\n| Value | Description | Default |\n|-------|-------------|---------|\n| `platform.baseDomain` | Base domain for the Nebari deployment (required) | `\"\"` |\n| `platform.gateway.external.name` | Name of the external Gateway resource | `nebari-gateway` |\n| `platform.gateway.external.namespace` | Namespace of the external Gateway | `envoy-gateway-system` |\n| `platform.gateway.internal.name` | Name of the internal Gateway resource | `nebari-internal-gateway` |\n| `platform.gateway.internal.namespace` | Namespace of the internal Gateway | `envoy-gateway-system` |\n| `auth.oidc.issuerURL` | OIDC issuer URL (static value, or read from Secret if empty) | `\"\"` |\n| `auth.oidc.groupsClaim` | JWT claim containing group memberships | `groups` |\n| `auth.oidc.audience` | Expected JWT audience (empty = no audience check) | `\"\"` |\n| `defaults.serving.image` | Default vLLM serving image | `ghcr.io/llm-d/llm-d-cuda:v0.6.0` |\n| `defaults.storage.storageClassName` | Default StorageClass for model PVCs (empty = cluster default) | `\"\"` |\n| `defaults.monitoring.enabled` | Enable PodMonitor for Prometheus scraping | `true` |\n| `keyManager.enabled` | Deploy the key manager web UI | `true` |\n\n## Architecture\n\n```\nAdmin applies LLMModel CR\n        |\n        v\n  LLM Operator (watches CRDs across all managed namespaces)\n        |\n        +---\u003e PVC + model-downloader init container (HuggingFace download)\n        +---\u003e vLLM Deployment + Service\n        +---\u003e InferencePool + EPP (intelligent scheduling)\n        +---\u003e AIGatewayRoute + SecurityPolicy (external, API key auth)\n        +---\u003e AIGatewayRoute + SecurityPolicy (internal, OIDC auth)\n        |\n  Key Manager (optional)\n        |\n        +---\u003e Web UI behind NebariApp (Keycloak/OIDC login)\n        +---\u003e Generates API keys, writes to K8s Secrets\n        +---\u003e Envoy Gateway validates keys natively\n```\n\n### Container images\n\n| Image | Description |\n|-------|-------------|\n| `ghcr.io/nebari-dev/nebari-llm-serving-pack/operator` | LLM operator - reconciles LLMModel CRDs |\n| `ghcr.io/nebari-dev/nebari-llm-serving-pack/key-manager` | Key manager web UI and API |\n| `ghcr.io/nebari-dev/nebari-llm-serving-pack/model-downloader` | Model download init container (distroless, pixi-managed) |\n\n### Infrastructure requirements\n\nThe pack expects the following to be available on the cluster:\n\n- **GPU Operator**: The [NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/overview.html) must be installed so that GPU nodes advertise `nvidia.com/gpu` as an allocatable resource. If your nodes use a pre-installed NVIDIA driver AMI (like AWS `AL2023_x86_64_NVIDIA`), configure the operator with `driver.enabled=false` and `toolkit.enabled=false`.\n- **Storage**: A StorageClass capable of provisioning PVCs sized for your models. EFS (`efs.csi.aws.com`) is recommended on AWS for its ReadWriteMany support and independence from node disk size. Set the StorageClass name via `defaults.storage.storageClassName`.\n- **Gateway**: Envoy Gateway with the Gateway API and AI Gateway extensions. Typically deployed by nebari-infrastructure-core.\n- **OIDC provider**: Keycloak or any OIDC-compliant provider for auth. The pack reads the issuer URL from either the Helm value or a Kubernetes Secret provisioned by the nebari-operator.\n\n## Development\n\nSee [docs/getting-started.md](docs/getting-started.md) for a full walkthrough of the local dev environment.\n\n```bash\n# Create kind cluster with all dependencies\ncd dev \u0026\u0026 make setup\n\n# Build and load images into the cluster\nmake build-images \u0026\u0026 make load-images\n\n# Deploy operator and key manager\nmake deploy\n\n# Apply a test model\nmake apply-test-model\n\n# Watch reconciliation\nkubectl -n llm-serving get llmmodels -w\n\n# Tail logs\nmake logs-operator\nmake logs-key-manager\n\n# Tear down\nmake teardown\n```\n\nRun tests directly:\n\n```bash\ncd operator \u0026\u0026 make test\ncd key-manager \u0026\u0026 go test ./...\n```\n\n## License\n\nApache 2.0. See [LICENSE](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnebari-dev%2Fnebari-llm-serving-pack","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnebari-dev%2Fnebari-llm-serving-pack","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnebari-dev%2Fnebari-llm-serving-pack/lists"}