{"id":48945351,"url":"https://github.com/multigres/multigres-operator","last_synced_at":"2026-05-24T07:01:48.458Z","repository":{"id":316825929,"uuid":"1063931375","full_name":"multigres/multigres-operator","owner":"multigres","description":"Kubernetes operator for Multigres — deploys, scales, and manages horizontally scalable PostgreSQL clusters with automated topology orchestration, drain-safe rolling updates, and admission webhooks","archived":false,"fork":false,"pushed_at":"2026-05-18T19:39:51.000Z","size":5417,"stargazers_count":241,"open_issues_count":4,"forks_count":24,"subscribers_count":3,"default_branch":"main","last_synced_at":"2026-05-18T21:09:31.610Z","etag":null,"topics":["cloud-native","database","go","horizontal-scaling","kubernetes","kubernetes-operator","operator","postgresql"],"latest_commit_sha":null,"homepage":"https://multigres.com/","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/multigres.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-25T10:03:51.000Z","updated_at":"2026-05-18T19:26:26.000Z","dependencies_parsed_at":"2026-05-18T21:05:08.759Z","dependency_job_id":null,"html_url":"https://github.com/multigres/multigres-operator","commit_stats":null,"previous_names":["numtide/multigres-operator","multigres/multigres-operator"],"tags_count":28,"template":false,"template_full_name":null,"purl":"pkg:github/multigres/multigres-operator","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/multigres%2Fmultigres-operator","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/multigres%2Fmultigres-operator/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/multigres%2Fmultigres-operator/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/multigres%2Fmultigres-operator/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/multigres","download_url":"https://codeload.github.com/multigres/multigres-operator/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/multigres%2Fmultigres-operator/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33424573,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-23T22:14:44.296Z","status":"online","status_checked_at":"2026-05-24T02:00:06.296Z","response_time":57,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cloud-native","database","go","horizontal-scaling","kubernetes","kubernetes-operator","operator","postgresql"],"created_at":"2026-04-17T16:00:57.994Z","updated_at":"2026-05-24T07:01:48.450Z","avatar_url":"https://github.com/multigres.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Multigres Operator\n\nThe **[Multigres](https://github.com/multigres/multigres) Operator** is a Kubernetes operator for managing distributed, sharded PostgreSQL clusters across multiple failure domains (zones or regions). It provides a unified API to define the topology of your database system, handling the complex orchestration of `shards`, `cells` (failure domains), and `gateways`.\n\n## Table of Contents\n\n- [Features](#features)\n- [Installation](#installation)\n- [How it Works](#how-it-works)\n- [Configuration \u0026 Defaults](#configuration--defaults)\n- [Backup \u0026 Restore](#backup--restore)\n- [Observability](#observability)\n- [Webhook \u0026 Certificate Management](#webhook--certificate-management)\n- [GitOps \u0026 Webhook Defaults](#gitops--webhook-defaults)\n- [Pool Replication \u0026 Quorum](#pool-replication--quorum)\n- [Constraints \u0026 Limits](#constraints--limits-v1alpha1)\n- [Further Reading](#further-reading)\n\n## Features\n- **Global Cluster Management**: Single source of truth (`MultigresCluster`) for the entire database topology.\n- **Automated Sharding**: Manages `TableGroups` and `Shards` as first-class citizens.\n- **Direct Pod Management**: Manages individual Pods and PVCs directly (no StatefulSets), enabling targeted decommissioning, rolling updates with primary awareness, and granular PVC lifecycle control.\n- **Failover \u0026 High Availability**: Orchestrates Primary/Standby failovers across defined Cells.\n- **Template System**: Define configuration once (`CoreTemplate`, `CellTemplate`, `ShardTemplate`) and reuse it across the cluster.\n- **Hierarchical Defaults**: Smart override logic allowing for global defaults, namespace defaults, and granular overrides.\n- **External Gateway Exposure**: Optional external gateway support via `spec.externalGateway` with configurable `externalIPs`, tracked by a `GatewayExternalReady` condition.\n- **External Admin Web Exposure**: Optional external exposure for the multiadmin-web Service via `spec.externalAdminWeb`, mirroring the gateway pattern with an `AdminWebExternalReady` condition.\n- **PostgreSQL Configuration**: Reference a user-created ConfigMap with `postgresql.conf` overrides via `postgresConfigRef` on shard templates. ConfigMap content changes trigger automatic rolling updates.\n- **Integrated Cert Management**: Built-in self-signed certificate generation and rotation for validating webhooks, with optional support for `cert-manager`.\n\n---\n\n## Installation\n\n### Prerequisites\n- Kubernetes v1.25+\n\n### Quick Start\n\nInstall the operator with built-in self-signed certificate management:\n\n```bash\nkubectl apply --server-side -f \\\n  https://github.com/multigres/multigres-operator/releases/latest/download/install.yaml\n```\n\nThis deploys the operator into the `multigres-operator` namespace with:\n- All CRDs (MultigresCluster, Cell, Shard, TableGroup, TopoServer, and templates)\n- RBAC roles and bindings\n- Mutating and Validating webhooks with self-signed certificates (auto-rotated)\n- The operator Deployment\n- Metrics endpoint\n\nOnce the operator is running, try a sample cluster:\n\n```bash\nkubectl apply -f https://raw.githubusercontent.com/multigres/multigres-operator/main/config/samples/minimal.yaml\n```\n\nFor more sample configurations, see the [samples directory](config/samples/README.md).\n\n### Deployment Options\n\n| Option | Description | Guide |\n| :--- | :--- | :--- |\n| **Self-signed certs** (default) | Zero-config TLS — operator generates and rotates its own CA. | *(Installed above)* |\n| **cert-manager** | External certificate management via cert-manager. | [Cert-Manager Demo](demo/cert-manager/) |\n| **Observability stack** | Full metrics, tracing, and dashboards (Prometheus, Tempo, Grafana). | [Observability Demo](demo/observability/) |\n\n---\n\n## How it Works\n\nThe Multigres Operator follows a **Parent/Child** architecture. You, the user, manage the **Root** resource (`MultigresCluster`) and its shared **Templates**. The operator automatically creates and reconciles all necessary child resources (`Cells`, `TableGroups`, `Shards`, `TopoServers`) to match your desired state.\n\n### Resource Hierarchy\n\n```ascii\n[MultigresCluster] 🚀 (Root CR - User Editable)\n      │\n      ├── 📍 Defines [TemplateDefaults] (Cluster-wide default templates)\n      │\n      ├── 🌍 [GlobalTopoServer] (Child CR) ← 📄 Uses [CoreTemplate] OR inline [spec]\n      │\n      ├── 🤖 MultiAdmin Resources ← 📄 Uses [CoreTemplate] OR inline [spec]\n      │\n      ├── 💠 [Cell] (Child CR) ← 📄 Uses [CellTemplate] OR inline [spec]\n      │    │\n      │    ├── 🚪 MultiGateway Resources\n      │    └── 📡 [LocalTopoServer] (Child CR, optional)\n      │\n      └── 🗃️ [TableGroup] (Child CR)\n           │\n           └── 📦 [Shard] (Child CR) ← 📄 Uses [ShardTemplate] OR inline [spec]\n                │\n                ├── 🧠 MultiOrch Resources (Deployment)\n                └── 🏊 Pools (Operator-managed Pods + PVCs)\n\n📄 [CoreTemplate] (User-editable, scoped config)\n   ├── globalTopoServer\n   └── multiadmin\n\n📄 [CellTemplate] (User-editable, scoped config)\n   ├── multigateway\n   └── localTopoServer (optional)\n\n📄 [ShardTemplate] (User-editable, scoped config)\n   ├── multiorch\n   └── pools (postgres + multipooler)\n```\n\n**Important**:\n*   **Only** `MultigresCluster`, `CoreTemplate`, `CellTemplate`, and `ShardTemplate` are meant to be edited by users.\n*   Child resources (`Cell`, `TableGroup`, `Shard`, `TopoServer`) are **Read-Only**. Any manual changes to them will be immediately reverted by the operator to ensure the system stays in sync with the root configuration.\n\n---\n\n## Configuration \u0026 Defaults\n\nThe operator uses a **4-Level Override Chain** to resolve configuration for every component. This allows you to keep your `MultigresCluster` spec clean while maintaining full control when needed.\n\n### 1. The Default Hierarchy\n\nWhen determining the configuration for a component (e.g., a Shard), the operator looks for configuration in this order:\n\n1.  **Inline Spec / Explicit Template Ref**: Defined directly on the component in the `MultigresCluster` YAML.\n2.  **Cluster-Level Template Default**: Defined in `spec.templateDefaults` of the `MultigresCluster`.\n3.  **Namespace-Level Default**: A template of the correct kind (e.g., `ShardTemplate`) named `\"default\"` in the same namespace.\n4.  **Operator Hardcoded Defaults**: Fallback values built into the operator Webhook.\n\n### 2. Templates and Overrides\n\nTemplates allow you to define standard configurations (e.g., \"Standard High-Availability Cell\"). You can then apply specific **overrides** on top of a template.\n\n**Example: Using a Template with Overrides**\n```yaml\nspec:\n  cells:\n    - name: \"us-east-1a\"\n      cellTemplate: \"standard-ha-cell\" # \u003c--- Uses the template\n      overrides:                       # \u003c--- Patches specific fields\n        multigateway:\n          replicas: 5                  # \u003c--- Overrides only the replica count\n```\n\n**Note on Overrides**: When using `overrides`, you must provide the complete struct for the section you are overriding if it's a pointer. For specific fields like resources, it's safer to ensure you provide the full context if the merge behavior isn't granular enough for your needs (currently, the resolver performs a deep merge).\n\n### 3. Template Update Behavior\n\n\u003e [!WARNING]\n\u003e When a template (`CoreTemplate`, `CellTemplate`, or `ShardTemplate`) is updated, **all clusters using that template are reconciled immediately**. This means changes to a shared template propagate to every referencing cluster at once.\n\nFor production environments where you want controlled rollouts, consider **versioning templates by name**:\n\n```yaml\n# Instead of editing \"standard-shard\" in-place...\napiVersion: multigres.com/v1alpha1\nkind: ShardTemplate\nmetadata:\n  name: standard-shard-v2   # \u003c--- New version = new resource\nspec:\n  # ... updated configuration\n```\n\nThen update each cluster's `templateRef` individually when ready:\n```yaml\nspec:\n  templateDefaults:\n    shardTemplate: \"standard-shard-v2\"  # \u003c--- Opt-in to the new version\n```\n\n\u003e [!NOTE]\n\u003e Avoid using `default`-named templates (the namespace-level fallback) in production if you need controlled rollouts. They cannot be versioned since any cluster without an explicit template reference will automatically use whichever template is named `default`.\n\u003e\n\u003e This mechanism may change in future versions. See [Template Propagation](docs/development/template-propagation.md) for details on planned improvements.\n\n---\n\n## Backup \u0026 Restore\n\nThe operator integrates **pgBackRest** for automated backups, WAL archiving, and point-in-time recovery (PITR). Two storage backends are supported: **S3** (recommended for production and multi-cell clusters) and **Filesystem** (PVC-based, for development/single-node). Backup configuration is fully declarative and propagates from the cluster level down to individual shards.\n\nKey features:\n- **Replica-based backups** — backups run on a replica to avoid impacting the primary\n- **S3 credential options** — IRSA, static credentials, or EC2 instance metadata\n- **Auto-generated TLS** — pgBackRest inter-node TLS is managed automatically, with optional cert-manager support\n\n\u003e [!WARNING]\n\u003e Filesystem backups are **cell-local**. Cross-cell failover cannot restore from another cell's backup. Use S3 for multi-cell clusters.\n\n📖 **Full documentation:** [Backup \u0026 Restore Guide](docs/backup-restore.md)\n\n## Observability\n\nThe operator ships with built-in support for **metrics**, **alerting**, **distributed tracing**, and **structured logging**.\n\n- **Metrics** — Prometheus endpoint with 8 operator-specific metrics + controller-runtime framework metrics\n- **Alerts** — 10 pre-configured PrometheusRule alerts with dedicated runbooks ([view runbooks](docs/monitoring/runbooks/))\n- **Grafana Dashboards** — Operator dashboard, per-cluster topology dashboard, and data-plane operations dashboard\n- **Distributed Tracing** — OpenTelemetry OTLP support, disabled by default, zero overhead when off\n- **Structured Logging** — JSON logging with automatic `trace_id`/`span_id` injection for log-trace correlation\n\n📖 **Full documentation:** [Observability Guide](docs/observability.md) · [Observability Demo](demo/observability/)\n\n---\n\n## Webhook \u0026 Certificate Management\n\nThe operator includes a Mutating and Validating Webhook to enforce defaults and data integrity.\n\n### Automatic Certificate Management (Default)\nBy default, the operator manages its own TLS certificates using the generic `pkg/cert` module. This implements a **Split-Secret PKI** architecture:\n\n1.  **Bootstrap**: On startup, the cert rotator generates a self-signed Root CA (ECDSA P-256) and a Server Certificate, storing them in two separate Kubernetes Secrets.\n2.  **CA Bundle Injection**: A post-reconcile hook automatically patches the `MutatingWebhookConfiguration` and `ValidatingWebhookConfiguration` with the CA bundle.\n3.  **Rotation**: A background loop checks certificates hourly. Certs nearing expiry (or signed by a rotated CA) are automatically renewed without downtime.\n4.  **Owner References**: Both secrets are owned by the operator Deployment, so they are garbage-collected on uninstall.\n\n### Using external Cert-Manager\nIf you prefer to use `cert-manager` or another external tool, deploy using the cert-manager overlay (`install-certmanager.yaml`). This overlay:\n\n1.  Creates a `Certificate` and `ClusterIssuer` resource for cert-manager to manage.\n2.  Mounts the cert-manager-provisioned secret to `/var/run/secrets/webhook` so certificates exist on disk at startup.\n\nThe operator **automatically detects** the certificate management strategy on startup:\n- If certificates already exist on disk and the operator did not previously manage them (no cert-strategy annotation), it assumes an external provider (e.g. cert-manager) and skips internal rotation.\n- If no certificates exist on disk, or the operator previously annotated the ValidatingWebhookConfiguration, internal certificate rotation is enabled.\n\n📖 **Cert-Manager walkthrough:** [Cert-Manager Demo](demo/cert-manager/)\n\n---\n\n## GitOps \u0026 Webhook Defaults\n\nThe operator's Mutating Webhook materialises all defaults (images, replicas, resources, backup config, etc.) directly into the `MultigresCluster` spec stored in etcd. This means `kubectl get multigrescluster -o yaml` always shows the full effective configuration — no hidden in-memory defaults.\n\nSome fields (like cell assignments on shards) are intentionally kept dynamic and resolved at reconcile time. The resolved values are visible on child CRs (`Shard`, `TableGroup`).\n\nIf you use GitOps tooling (ArgoCD, Flux), the webhook-materialised fields can cause diffs between your Git manifests and the live state. The documentation covers recommended mitigations.\n\n📖 **Full documentation:** [Webhook Defaults \u0026 GitOps Guide](docs/gitops-and-webhook-defaults.md)\n\n---\n\n## Pool Replication \u0026 Quorum\n\nMultigres uses a configurable **durability policy** to control synchronous replication quorum. The default policy is `AT_LEAST_2`, which requires every write to be acknowledged by at least 2 nodes (the primary + 1 synchronous standby). For multi-AZ clusters, `MULTI_CELL_AT_LEAST_2` enforces cross-zone quorum. This has implications for how many replicas you should run per cell in `readWrite` pools.\n\n| Replicas per Cell | Configuration | Rolling Upgrade Behavior |\n| :--- | :--- | :--- |\n| **1** | 1 pod (primary only, no standbys) | **Downtime during upgrades.** No standby to maintain quorum. |\n| **2** | 1 primary + 1 standby | **Downtime during upgrades.** Draining the standby leaves zero synchronous standbys, violating `AT_LEAST_2`. Upstream multigres rejects the `UpdateSynchronousStandbyList REMOVE` because it would empty the synchronous standby list. |\n| **3** (recommended) | 1 primary + 2 standbys | **Zero-downtime upgrades.** One standby can be drained while the other maintains quorum. |\n\nPools default to `replicasPerCell: 1`. `AT_LEAST_2` requires 2 total poolers; `MULTI_CELL_AT_LEAST_2` requires poolers in 2 cells. For high availability, use at least 3 total poolers so the policy remains achievable while one pooler is unavailable.\n\n📖 **Full documentation:** [Durability Policy](docs/durability-policy.md)\n\n---\n\n## Constraints \u0026 Limits (v1alpha1)\n\nPlease be aware of the following constraints in the current version:\n\n*   **Database Limit**: Only **1** database is supported per cluster. It must be named `postgres` and marked `default: true`.\n*   **Shard Naming**: Shards currently must be named `0-inf` - this is a limitation of the current implementation of Multigres.\n*   **Naming Lengths**:\n    *   **TableGroup Names**: If the combined name (`cluster-db-tg`) exceeds **28 characters**, the operator automatically hashes the database and tablegroup names to ensure that the resulting child resource names (Shards, Pods, PVCs) stay within Kubernetes limits (63 chars).\n    *   **Cluster Name**: Recommended to be under **20 characters** to ensure that even with hashing, suffixes fit comfortably.\n*   **Immutable Fields**: Some fields like `zone` and `region` in Cell definitions are immutable after creation.\n*   **Append-Only Pools and Cells**: Pools and cells cannot be renamed or removed from a cluster. This prevents orphaned pods and stale etcd registrations.\n\n---\n\n## Further Reading\n\n| Resource | Description |\n| :--- | :--- |\n| [Operator Capability Levels](docs/operator-capability-levels.md) | Maturity assessment against the [Operator Framework capability model](https://operatorframework.io/operator-capabilities/) |\n| [Webhook Defaults \u0026 GitOps](docs/gitops-and-webhook-defaults.md) | How the webhook materialises defaults, dynamic cell resolution, and GitOps compatibility |\n| [Durability Policy](docs/durability-policy.md) | Configurable replication quorum: `AT_LEAST_2` (default) and `MULTI_CELL_AT_LEAST_2` for cross-AZ durability |\n| [External Admin Web](docs/external-admin-web.md) | External exposure for the multiadmin-web Service |\n| [PostgreSQL Configuration](docs/postgresql-configuration.md) | Custom `postgresql.conf` overrides via ConfigMap reference |\n| [Storage Management](docs/storage.md) | PVC deletion policies (Retain/Delete) and volume expansion |\n| [Configuration Reference](docs/configuration.md) | Operator flags, environment variables, and logging |\n| [Demos](demo/) | Guided walkthroughs (webhook, cert-manager, observability) |\n| [Developer Documentation](docs/development/) | Internal architecture, controller patterns, caching strategy |\n| [Contributing](CONTRIBUTING.md) | Development setup, local Kind deployment, code style |\n| [Changelog](CHANGELOG.md) | Release history |\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmultigres%2Fmultigres-operator","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmultigres%2Fmultigres-operator","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmultigres%2Fmultigres-operator/lists"}