{"id":46570611,"url":"https://github.com/dataflow-operator/dataflow","last_synced_at":"2026-06-08T18:01:02.928Z","repository":{"id":332048832,"uuid":"1122626555","full_name":"dataflow-operator/dataflow","owner":"dataflow-operator","description":"DataFlow Operator is a Kubernetes operator for streaming data between different data sources with support for message transformations.","archived":false,"fork":false,"pushed_at":"2026-06-02T09:30:54.000Z","size":61214,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-02T10:23:28.054Z","etag":null,"topics":["clickhouse","data-processing","data-streaming","dataflow","etl","kafka","nessie","postgresql","trino"],"latest_commit_sha":null,"homepage":"https://dataflow-operator.github.io/docs/","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dataflow-operator.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-12-25T06:46:22.000Z","updated_at":"2026-06-02T09:30:57.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/dataflow-operator/dataflow","commit_stats":null,"previous_names":["ilyario/dataflow","dataflow-operator/dataflow"],"tags_count":32,"template":false,"template_full_name":null,"purl":"pkg:github/dataflow-operator/dataflow","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dataflow-operator%2Fdataflow","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dataflow-operator%2Fdataflow/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dataflow-operator%2Fdataflow/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dataflow-operator%2Fdataflow/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dataflow-operator","download_url":"https://codeload.github.com/dataflow-operator/dataflow/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dataflow-operator%2Fdataflow/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34073810,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-08T02:00:07.615Z","response_time":111,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clickhouse","data-processing","data-streaming","dataflow","etl","kafka","nessie","postgresql","trino"],"created_at":"2026-03-07T08:11:34.183Z","updated_at":"2026-06-08T18:01:02.922Z","avatar_url":"https://github.com/dataflow-operator.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# DataFlow Operator\n\nA production-ready Kubernetes operator for streaming data pipelines with support for multiple sources, sinks, and comprehensive message transformations.\n\n📖 **[Full Documentation](https://dataflow-operator.github.io/docs/)**\n\n---\n\n## Overview\n\nDataFlow Operator automates the deployment and management of data streaming pipelines in Kubernetes. It watches custom `DataFlow` resources and orchestrates processor pods that read from sources, apply optional transformations, and write to sinks. The operator handles fault tolerance, checkpointing, scheduling, and comprehensive monitoring out of the box.\n\n### Key Features\n\n- **Multi-Source/Sink Support**: Kafka, PostgreSQL, ClickHouse, Trino, and Apache Iceberg (Nessie). Securely configure connectors using Kubernetes Secrets via `SecretRef` fields.\n- **Rich Transformations**: Filter, Select, Remove, Mask, Flatten, Timestamp, SnakeCase, Router, Chain.\n- **Web GUI**: Built-in browser-based interface to manage manifests, stream logs, and monitor metrics without `kubectl`.\n- **MCP Server**: Model Context Protocol server for AI agents to validate, generate, and migrate DataFlow manifests.\n- **Error Handling**: Route failed messages to a dedicated error sink (e.g., Dead Letter Queue) with full error context and original payloads to prevent data loss.\n- **Fault Tolerance**: At-least-once delivery semantics with checkpoint persistence, graceful shutdown, and support for idempotent sinks (e.g., UPSERT, ReplacingMergeTree) to safely handle duplicates.\n- **Scheduled Pipelines**: `DataFlowCron` for time-based pipeline execution with triggers.\n- **High Availability**: Leader election for multi-replica deployments.\n- **Observable**: Prometheus metrics, structured logging, Sentry integration, health probes, and native Kubernetes Events for lifecycle audit trails.\n- **Kubernetes Native**: Custom Resource Definitions, RBAC, Helm charts.\n\n---\n\n## Architecture\n\n### How It Works\n\n1. **Operator Controller**: Watches `DataFlow` and `DataFlowCron` resources in your cluster.\n2. **Processor Pod**: Creates ephemeral or long-running pods that execute the data pipeline.\n3. **Pipeline Execution**: `source → transformations → sink` flow with built-in error handling and error sink routing.\n4. **State Management**: Stores checkpoint data in ConfigMaps (or native offsets for Kafka) for recovery.\n5. **Observability**: Emits standard Kubernetes Events (`Normal`/`Warning`) for reconciliation lifecycle auditing.\n\n### Configuration Example\n\n```yaml\napiVersion: dataflow.dataflow.io/v1\nkind: DataFlow\nmetadata:\n  name: kafka-to-postgres\nspec:\n  source:\n    type: kafka\n    config:\n      brokers:\n        - kafka-broker:9092      topic: input-events\n      consumerGroup: dataflow-group\n\n  transformations:\n    - type: filter\n      config:\n        expression: \"event_type == 'purchase'\"\n    - type: mask\n      config:\n        fields:\n          - credit_card\n\n  sink:\n    type: postgresql\n    config:\n      connectionStringSecretRef: \n        name: db-credentials\n        key: url\n      table: events\n\n  # Error sink catches failed writes (e.g. constraint violations) and saves them for replay\n  errors:\n    type: postgresql\n    config:\n      connectionStringSecretRef: \n        name: db-credentials\n        key: url\n      table: events_dead_letter\n      autoCreateTable: true\n```\n\n---\n\n## Quick Start\n\n### 1. Install via Helm\n\n```bash\n# Add the Helm repository\nhelm repo add dataflow-operator https://dataflow-operator.github.io/helm-charts\nhelm repo update\n\n# Install the operator\nhelm install dataflow-operator dataflow-operator/dataflow-operator \\\n  --namespace dataflow-system \\\n  --create-namespace\n```\n\n### 2. Deploy Your First Pipeline\n```bash\n# Create a sample Kafka-to-PostgreSQL pipeline\nkubectl apply -f - \u003c\u003cEOF\napiVersion: dataflow.dataflow.io/v1\nkind: DataFlow\nmetadata:\n  name: my-pipeline\n  namespace: default\nspec:\n  source:\n    type: kafka\n    config:\n      brokers:\n        - kafka:9092\n      topic: events\n      consumerGroup: my-app\n  sink:\n    type: postgresql\n    config:\n      connectionString: \"postgres://user:pass@postgres:5432/db\"\n      table: events\nEOF\n\n# Monitor the pipeline status and Kubernetes events\nkubectl describe dataflow my-pipeline\nkubectl get events --watch\n```\n\n### 3. Local Development Setup\n\n```bash\n# Start local infrastructure (Kafka, PostgreSQL, ClickHouse)\ndocker-compose up -d\n\n# Available UIs:\n# - Kafka UI: http://localhost:8080\n# - ClickHouse: http://localhost:8123\n\n# Run the operator locally\ntask run\n\n# In another terminal, apply a sample configuration\nkubectl apply -f config/samples/kafka-to-postgres.yaml\n```\n\n---\n\n## Resources\n\n- **Documentation**: https://dataflow-operator.github.io/docs/- **GitHub Issues**: Report bugs or request features\n- **Helm Charts**: https://github.com/dataflow-operator/helm-charts\n- **Kubernetes Operator Pattern**: https://kubernetes.io/docs/concepts/extend-kubernetes/operator/","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdataflow-operator%2Fdataflow","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdataflow-operator%2Fdataflow","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdataflow-operator%2Fdataflow/lists"}