An open API service indexing awesome lists of open source software.

https://github.com/dataflow-operator/dataflow

DataFlow Operator is a Kubernetes operator for streaming data between different data sources with support for message transformations.
https://github.com/dataflow-operator/dataflow

clickhouse data-processing data-streaming dataflow etl kafka nessie postgresql trino

Last synced: 16 days ago
JSON representation

DataFlow Operator is a Kubernetes operator for streaming data between different data sources with support for message transformations.

Awesome Lists containing this project

README

          

# DataFlow Operator

A production-ready Kubernetes operator for streaming data pipelines with support for multiple sources, sinks, and comprehensive message transformations.

📖 **[Full Documentation](https://dataflow-operator.github.io/docs/)**

---

## Overview

DataFlow Operator automates the deployment and management of data streaming pipelines in Kubernetes. It watches custom `DataFlow` resources and orchestrates processor pods that read from sources, apply optional transformations, and write to sinks. The operator handles fault tolerance, checkpointing, scheduling, and comprehensive monitoring out of the box.

### Key Features

- **Multi-Source/Sink Support**: Kafka, PostgreSQL, ClickHouse, Trino, and Apache Iceberg (Nessie). Securely configure connectors using Kubernetes Secrets via `SecretRef` fields.
- **Rich Transformations**: Filter, Select, Remove, Mask, Flatten, Timestamp, SnakeCase, Router, Chain.
- **Web GUI**: Built-in browser-based interface to manage manifests, stream logs, and monitor metrics without `kubectl`.
- **MCP Server**: Model Context Protocol server for AI agents to validate, generate, and migrate DataFlow manifests.
- **Error Handling**: Route failed messages to a dedicated error sink (e.g., Dead Letter Queue) with full error context and original payloads to prevent data loss.
- **Fault Tolerance**: At-least-once delivery semantics with checkpoint persistence, graceful shutdown, and support for idempotent sinks (e.g., UPSERT, ReplacingMergeTree) to safely handle duplicates.
- **Scheduled Pipelines**: `DataFlowCron` for time-based pipeline execution with triggers.
- **High Availability**: Leader election for multi-replica deployments.
- **Observable**: Prometheus metrics, structured logging, Sentry integration, health probes, and native Kubernetes Events for lifecycle audit trails.
- **Kubernetes Native**: Custom Resource Definitions, RBAC, Helm charts.

---

## Architecture

### How It Works

1. **Operator Controller**: Watches `DataFlow` and `DataFlowCron` resources in your cluster.
2. **Processor Pod**: Creates ephemeral or long-running pods that execute the data pipeline.
3. **Pipeline Execution**: `source → transformations → sink` flow with built-in error handling and error sink routing.
4. **State Management**: Stores checkpoint data in ConfigMaps (or native offsets for Kafka) for recovery.
5. **Observability**: Emits standard Kubernetes Events (`Normal`/`Warning`) for reconciliation lifecycle auditing.

### Configuration Example

```yaml
apiVersion: dataflow.dataflow.io/v1
kind: DataFlow
metadata:
name: kafka-to-postgres
spec:
source:
type: kafka
config:
brokers:
- kafka-broker:9092 topic: input-events
consumerGroup: dataflow-group

transformations:
- type: filter
config:
expression: "event_type == 'purchase'"
- type: mask
config:
fields:
- credit_card

sink:
type: postgresql
config:
connectionStringSecretRef:
name: db-credentials
key: url
table: events

# Error sink catches failed writes (e.g. constraint violations) and saves them for replay
errors:
type: postgresql
config:
connectionStringSecretRef:
name: db-credentials
key: url
table: events_dead_letter
autoCreateTable: true
```

---

## Quick Start

### 1. Install via Helm

```bash
# Add the Helm repository
helm repo add dataflow-operator https://dataflow-operator.github.io/helm-charts
helm repo update

# Install the operator
helm install dataflow-operator dataflow-operator/dataflow-operator \
--namespace dataflow-system \
--create-namespace
```

### 2. Deploy Your First Pipeline
```bash
# Create a sample Kafka-to-PostgreSQL pipeline
kubectl apply -f - <