https://github.com/dataflow-operator/dataflow
DataFlow Operator is a Kubernetes operator for streaming data between different data sources with support for message transformations.
https://github.com/dataflow-operator/dataflow
clickhouse data-processing data-streaming dataflow etl kafka nessie postgresql trino
Last synced: 16 days ago
JSON representation
DataFlow Operator is a Kubernetes operator for streaming data between different data sources with support for message transformations.
- Host: GitHub
- URL: https://github.com/dataflow-operator/dataflow
- Owner: dataflow-operator
- Created: 2025-12-25T06:46:22.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2026-06-02T09:30:54.000Z (22 days ago)
- Last Synced: 2026-06-02T10:23:28.054Z (22 days ago)
- Topics: clickhouse, data-processing, data-streaming, dataflow, etl, kafka, nessie, postgresql, trino
- Language: Go
- Homepage: https://dataflow-operator.github.io/docs/
- Size: 58.4 MB
- Stars: 3
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
Awesome Lists containing this project
README
# DataFlow Operator
A production-ready Kubernetes operator for streaming data pipelines with support for multiple sources, sinks, and comprehensive message transformations.
📖 **[Full Documentation](https://dataflow-operator.github.io/docs/)**
---
## Overview
DataFlow Operator automates the deployment and management of data streaming pipelines in Kubernetes. It watches custom `DataFlow` resources and orchestrates processor pods that read from sources, apply optional transformations, and write to sinks. The operator handles fault tolerance, checkpointing, scheduling, and comprehensive monitoring out of the box.
### Key Features
- **Multi-Source/Sink Support**: Kafka, PostgreSQL, ClickHouse, Trino, and Apache Iceberg (Nessie). Securely configure connectors using Kubernetes Secrets via `SecretRef` fields.
- **Rich Transformations**: Filter, Select, Remove, Mask, Flatten, Timestamp, SnakeCase, Router, Chain.
- **Web GUI**: Built-in browser-based interface to manage manifests, stream logs, and monitor metrics without `kubectl`.
- **MCP Server**: Model Context Protocol server for AI agents to validate, generate, and migrate DataFlow manifests.
- **Error Handling**: Route failed messages to a dedicated error sink (e.g., Dead Letter Queue) with full error context and original payloads to prevent data loss.
- **Fault Tolerance**: At-least-once delivery semantics with checkpoint persistence, graceful shutdown, and support for idempotent sinks (e.g., UPSERT, ReplacingMergeTree) to safely handle duplicates.
- **Scheduled Pipelines**: `DataFlowCron` for time-based pipeline execution with triggers.
- **High Availability**: Leader election for multi-replica deployments.
- **Observable**: Prometheus metrics, structured logging, Sentry integration, health probes, and native Kubernetes Events for lifecycle audit trails.
- **Kubernetes Native**: Custom Resource Definitions, RBAC, Helm charts.
---
## Architecture
### How It Works
1. **Operator Controller**: Watches `DataFlow` and `DataFlowCron` resources in your cluster.
2. **Processor Pod**: Creates ephemeral or long-running pods that execute the data pipeline.
3. **Pipeline Execution**: `source → transformations → sink` flow with built-in error handling and error sink routing.
4. **State Management**: Stores checkpoint data in ConfigMaps (or native offsets for Kafka) for recovery.
5. **Observability**: Emits standard Kubernetes Events (`Normal`/`Warning`) for reconciliation lifecycle auditing.
### Configuration Example
```yaml
apiVersion: dataflow.dataflow.io/v1
kind: DataFlow
metadata:
name: kafka-to-postgres
spec:
source:
type: kafka
config:
brokers:
- kafka-broker:9092 topic: input-events
consumerGroup: dataflow-group
transformations:
- type: filter
config:
expression: "event_type == 'purchase'"
- type: mask
config:
fields:
- credit_card
sink:
type: postgresql
config:
connectionStringSecretRef:
name: db-credentials
key: url
table: events
# Error sink catches failed writes (e.g. constraint violations) and saves them for replay
errors:
type: postgresql
config:
connectionStringSecretRef:
name: db-credentials
key: url
table: events_dead_letter
autoCreateTable: true
```
---
## Quick Start
### 1. Install via Helm
```bash
# Add the Helm repository
helm repo add dataflow-operator https://dataflow-operator.github.io/helm-charts
helm repo update
# Install the operator
helm install dataflow-operator dataflow-operator/dataflow-operator \
--namespace dataflow-system \
--create-namespace
```
### 2. Deploy Your First Pipeline
```bash
# Create a sample Kafka-to-PostgreSQL pipeline
kubectl apply -f - <