https://github.com/osodevops/kafka-backup-operator
Kubernetes operator for automated Kafka backup and disaster recovery. Supports scheduled backups, point-in-time recovery, multi-cloud storage (S3, Azure, GCS), and Azure Workload Identity.
https://github.com/osodevops/kafka-backup-operator
azure backup disaster-recovery helm kafka kube-rs kubernetes operator rust s3
Last synced: 2 months ago
JSON representation
Kubernetes operator for automated Kafka backup and disaster recovery. Supports scheduled backups, point-in-time recovery, multi-cloud storage (S3, Azure, GCS), and Azure Workload Identity.
- Host: GitHub
- URL: https://github.com/osodevops/kafka-backup-operator
- Owner: osodevops
- License: mit
- Created: 2025-11-30T20:21:52.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2026-04-16T09:32:18.000Z (2 months ago)
- Last Synced: 2026-04-16T11:31:24.661Z (2 months ago)
- Topics: azure, backup, disaster-recovery, helm, kafka, kube-rs, kubernetes, operator, rust, s3
- Language: Rust
- Homepage: https://kafkabackup.com/
- Size: 454 KB
- Stars: 3
- Watchers: 0
- Forks: 2
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# Kafka Backup Operator
[](LICENSE)
[](https://www.rust-lang.org/)
[](https://kubernetes.io/)
A Kubernetes operator for automated Kafka backup and disaster recovery. Built with Rust using [kube-rs](https://kube.rs/) for high performance and reliability.
## Features
- **Scheduled Backups** - Cron-based automatic backups with configurable retention
- **Point-in-Time Recovery (PITR)** - Restore data to any specific timestamp
- **Multi-Cloud Storage** - Support for PVC, S3, Azure Blob Storage, and GCS
- **Azure Workload Identity** - Secure, secretless authentication for Azure
- **Compression** - LZ4 and Zstd compression support with configurable levels
- **Checkpointing** - Resumable backups that survive pod restarts
- **Rate Limiting** - Control backup/restore throughput to minimize cluster impact
- **Circuit Breaker** - Automatic failure detection and recovery
- **Topic Mapping** - Restore to different topic names or partitions
- **Consumer Offset Management** - Reset and rollback consumer group offsets
- **Validation Evidence** - Validate backups and generate compliance evidence reports
- **Prometheus Metrics** - Full observability with built-in metrics endpoint
## Quick Start
### Prerequisites
- Kubernetes cluster (1.26+)
- Helm 3.x
- A running Kafka cluster
### Installation
```bash
# Add the OSO DevOps Helm repository
helm repo add oso https://osodevops.github.io/helm-charts/
helm repo update
# Install the operator
helm install kafka-backup-operator oso/kafka-backup-operator \
--namespace kafka-backup-system \
--create-namespace
```
### Create Your First Backup
```yaml
apiVersion: kafka.oso.sh/v1alpha1
kind: KafkaBackup
metadata:
name: my-backup
namespace: kafka
spec:
kafkaCluster:
bootstrapServers:
- kafka-bootstrap:9092
topics:
- orders
- events
storage:
storageType: pvc
pvc:
claimName: kafka-backups
schedule: "0 0 */6 * * * *" # Every 6 hours
stopAtCurrentOffsets: true
compression: zstd
```
```bash
kubectl apply -f backup.yaml
```
## Custom Resource Definitions
The operator provides five CRDs for managing Kafka backup and restore operations:
| CRD | Short Name | Description |
|-----|------------|-------------|
| `KafkaBackup` | `kb` | Define backup schedules and configurations |
| `KafkaRestore` | `kr` | Trigger restore operations from backups |
| `KafkaOffsetReset` | `kor` | Reset consumer group offsets |
| `KafkaOffsetRollback` | `korb` | Rollback offsets after failed restores |
| `KafkaBackupValidation` | `kbv` | Validate backups and produce evidence reports |
## Upgrade Notes
### 1.0.0
This release aligns the operator CRDs and adapters with `kafka-backup-core` `v0.12.0`.
- `checkpoint.enabled` no longer makes a `KafkaBackup` run continuously. Set `continuous: true` explicitly for streaming backups.
- Scheduled point-in-time backups should set `stopAtCurrentOffsets: true` so the backup exits after it reaches the high watermarks captured at start.
- `KafkaBackup` can now snapshot consumer group offsets with `consumerGroupSnapshot: true`; `KafkaRestore` can load those snapshots with `autoConsumerGroups: true`.
- `KafkaRestore` now exposes `repartitioning`, `produceAcks`, `produceTimeoutMs`, and `purgeTopics`.
- S3-compatible endpoints can use `storage.s3.pathStyle` and `storage.s3.allowHttp`; `allowHttp` logs a warning because it may enable plaintext object storage traffic.
- `kafkaCluster.connection.connectionsPerBroker` can be tuned on backup, restore, validation, offset reset, and offset rollback resources.
- Azure storage authentication validation now accepts the adapter-supported methods: workload identity, service principal, SAS token, account key, or default credential fallback.
## Configuration Examples
### Backup to Azure Blob Storage (with Workload Identity)
```yaml
apiVersion: kafka.oso.sh/v1alpha1
kind: KafkaBackup
metadata:
name: production-backup
spec:
kafkaCluster:
bootstrapServers:
- kafka-bootstrap:9092
securityProtocol: SASL_SSL
tlsSecret:
name: kafka-tls
saslSecret:
name: kafka-sasl
mechanism: SCRAM-SHA-512
topics:
- orders
- inventory
- events
storage:
storageType: azure
azure:
container: kafka-backups
accountName: mystorageaccount
prefix: production
useWorkloadIdentity: true
schedule: "0 0 2 * * * *" # Daily at 2 AM
compression: zstd
compressionLevel: 3
stopAtCurrentOffsets: true
includeOffsetHeaders: true
sourceClusterId: production
consumerGroupSnapshot: true
checkpoint:
enabled: true
intervalSecs: 30
```
### Backup to S3
```yaml
apiVersion: kafka.oso.sh/v1alpha1
kind: KafkaBackup
metadata:
name: s3-backup
spec:
kafkaCluster:
bootstrapServers:
- kafka:9092
topics:
- my-topic
storage:
storageType: s3
s3:
bucket: my-kafka-backups
region: eu-west-1
endpoint: http://minio.storage.svc.cluster.local:9000
pathStyle: true
allowHttp: true
prefix: backups
credentialsSecret:
name: aws-credentials
accessKeyIdKey: AWS_ACCESS_KEY_ID
secretAccessKeyKey: AWS_SECRET_ACCESS_KEY
schedule: "0 0 */4 * * * *"
```
### Restore from Backup
```yaml
apiVersion: kafka.oso.sh/v1alpha1
kind: KafkaRestore
metadata:
name: restore-orders
spec:
backupRef:
name: production-backup
backupId: "production-backup-20251210-020000" # Optional: specific backup
kafkaCluster:
bootstrapServers:
- kafka-bootstrap:9092
topics:
- orders
# Optional: Point-in-time recovery
pitr:
endTime: "2025-12-10T12:00:00Z"
# Optional: Restore to different topic
topicMapping:
orders: orders-restored
repartitioning:
orders-restored:
strategy: murmur2
targetPartitions: 6
createTopics: true
produceAcks: -1
produceTimeoutMs: 30000
autoConsumerGroups: true
purgeTopics: false
# Safety: Create snapshot before restore
rollback:
snapshotBeforeRestore: true
autoRollbackOnFailure: true
```
### Reset Consumer Offsets
```yaml
apiVersion: kafka.oso.sh/v1alpha1
kind: KafkaOffsetReset
metadata:
name: reset-consumer
spec:
kafkaCluster:
bootstrapServers:
- kafka-bootstrap:9092
consumerGroups:
- my-consumer-group
topics:
- orders
resetStrategy: to-earliest
```
## Helm Values
Key configuration options for the Helm chart:
```yaml
# values.yaml
replicaCount: 1
image:
repository: ghcr.io/osodevops/kafka-backup-operator
tag: "" # Defaults to appVersion
serviceAccount:
create: true
annotations: {}
# Azure Workload Identity
azureWorkloadIdentity:
enabled: false
clientId: ""
# Logging
logging:
level: "info,kafka_backup_operator=debug"
# Metrics
metrics:
enabled: true
serviceMonitor:
enabled: false
interval: 30s
# Resources
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
# Extra pod volumes and mounts
extraVolumes:
- name: kafka-tls
secret:
secretName: kafka-tls-certs
- name: custom-config
configMap:
name: kafka-backup-config
extraVolumeMounts:
- name: kafka-tls
mountPath: /certs/kafka
readOnly: true
- name: custom-config
mountPath: /config
readOnly: true
# Extra env vars for custom CA trust / endpoint behavior
extraEnv:
- name: SSL_CERT_FILE
value: /etc/internal-certs/ca.crt
```
## Azure Workload Identity Setup
For secure, secretless authentication to Azure Blob Storage:
```bash
# Enable Workload Identity on AKS
az aks update --resource-group myRG --name myAKS \
--enable-oidc-issuer --enable-workload-identity
# Create managed identity
az identity create --resource-group myRG --name kafka-backup-identity
# Assign Storage Blob Data Contributor role
az role assignment create \
--assignee-object-id $(az identity show -g myRG -n kafka-backup-identity --query principalId -o tsv) \
--role "Storage Blob Data Contributor" \
--scope /subscriptions/.../storageAccounts/mystorageaccount
# Create federated credential
az identity federated-credential create \
--resource-group myRG \
--identity-name kafka-backup-identity \
--name kafka-backup-fedcred \
--issuer $(az aks show -g myRG -n myAKS --query oidcIssuerProfile.issuerUrl -o tsv) \
--subject system:serviceaccount:kafka-backup-system:kafka-backup-operator \
--audience api://AzureADTokenExchange
# Install with Workload Identity enabled
helm install kafka-backup-operator oso/kafka-backup-operator \
--namespace kafka-backup-system \
--set azureWorkloadIdentity.enabled=true \
--set azureWorkloadIdentity.clientId=$(az identity show -g myRG -n kafka-backup-identity --query clientId -o tsv)
```
See [docs/azure-workload-identity.md](docs/azure-workload-identity.md) for detailed setup instructions.
## Monitoring
The operator exposes Prometheus metrics on port 8080:
| Metric | Description |
|--------|-------------|
| `kafka_backup_reconciliations_total` | Total reconciliation attempts |
| `kafka_backup_reconcile_duration_seconds` | Reconciliation duration histogram |
| `kafka_backup_backups_total` | Total backups by status |
| `kafka_backup_backup_size_bytes` | Backup size in bytes |
| `kafka_backup_backup_records` | Records processed |
| `kafka_backup_restores_total` | Total restores by status |
### ServiceMonitor (Prometheus Operator)
```yaml
# Enable in Helm values
metrics:
serviceMonitor:
enabled: true
interval: 30s
labels:
release: prometheus
```
## Disaster Recovery Workflow
1. **Normal Operation**: `KafkaBackup` runs on schedule, storing backups to cloud storage
2. **Disaster Occurs**: Kafka cluster fails or data is corrupted
3. **Recovery**:
- Identify the backup to restore from: `kubectl get kafkabackup my-backup -o yaml`
- Create a `KafkaRestore` resource pointing to the backup
- Monitor progress: `kubectl get kafkarestore -w`
4. **Post-Recovery**: Optionally reset consumer offsets with `KafkaOffsetReset`
## Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ Kubernetes Cluster │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ kafka-backup-operator │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │ │
│ │ │ Backup │ │ Restore │ │ Offset Reset/ │ │ │
│ │ │ Controller │ │ Controller │ │ Rollback Ctrl │ │ │
│ │ └──────┬──────┘ └──────┬──────┘ └──────────┬──────────┘ │ │
│ │ │ │ │ │ │
│ │ └───────────────┼────────────────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌─────────────────────┐ │ │
│ │ │ kafka-backup-core │ │ │
│ │ │ (Rust lib) │ │ │
│ │ └─────────────────────┘ │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Kafka Cluster │ │
│ └───────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Cloud Storage │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ S3 │ │ Azure │ │ GCS │ │
│ │ │ │ Blob │ │ │ │
│ └──────────┘ └──────────┘ └──────────┘ │
└─────────────────────────────────────────────────────────────────┘
```
## Development
### Building from Source
```bash
# Clone the repository
git clone https://github.com/osodevops/kafka-backup-operator.git
cd kafka-backup-operator
# Build
cargo build --release
# Generate CRDs
cargo run --bin crdgen > deploy/crds/all.yaml
# Run tests
cargo test
```
### Local Development with Minikube
See [minikube/README.md](minikube/README.md) for local development setup with Confluent for Kubernetes.
## Contributing
Contributions are welcome! Please read our [Contributing Guide](CONTRIBUTING.md) for details on the process for submitting pull requests.
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Support
- **Issues**: [GitHub Issues](https://github.com/osodevops/kafka-backup-operator/issues)
- **Discussions**: [GitHub Discussions](https://github.com/osodevops/kafka-backup-operator/discussions)
## Related Projects
- [kafka-backup-core](https://github.com/osodevops/kafka-backup) - The core backup library
- [OSO DevOps Helm Charts](https://github.com/osodevops/helm-charts) - Helm chart repository