{"id":34955737,"url":"https://github.com/vnvo/deltaforge","last_synced_at":"2026-03-12T03:08:42.776Z","repository":{"id":314675758,"uuid":"1056374587","full_name":"vnvo/deltaforge","owner":"vnvo","description":"A modular Change Data Capture (CDC) micro-framework built in Rust. Stream database changes to Kafka, Redis and etc.","archived":false,"fork":false,"pushed_at":"2025-12-16T04:32:29.000Z","size":3397,"stargazers_count":1,"open_issues_count":4,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-12-18T03:18:24.089Z","etag":null,"topics":["cdc","change-data-capture","data-engineering","data-platform","etl","event-sourcing","kafka","mysql","postgresql","redis","schema-registry","turso-db"],"latest_commit_sha":null,"homepage":"https://vnvo.github.io/deltaforge/","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/vnvo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-14T00:32:01.000Z","updated_at":"2025-12-17T02:46:24.000Z","dependencies_parsed_at":"2025-09-14T02:38:17.852Z","dependency_job_id":"e6518cf7-1fc5-4e98-909e-845729ebf933","html_url":"https://github.com/vnvo/deltaforge","commit_stats":null,"previous_names":["vnvo/deltaforge"],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/vnvo/deltaforge","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vnvo%2Fdeltaforge","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vnvo%2Fdeltaforge/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vnvo%2Fdeltaforge/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vnvo%2Fdeltaforge/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/vnvo","download_url":"https://codeload.github.com/vnvo/deltaforge/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vnvo%2Fdeltaforge/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28062336,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-12-26T02:00:06.189Z","response_time":55,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cdc","change-data-capture","data-engineering","data-platform","etl","event-sourcing","kafka","mysql","postgresql","redis","schema-registry","turso-db"],"created_at":"2025-12-26T22:01:51.856Z","updated_at":"2026-03-12T03:08:42.763Z","avatar_url":"https://github.com/vnvo.png","language":"Rust","readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"assets/deltaforge-blc.png\" width=\"150\" alt=\"DeltaForge\"\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://github.com/vnvo/deltaforge/actions/workflows/ci.yml\"\u003e\n    \u003cimg src=\"https://github.com/vnvo/deltaforge/actions/workflows/ci.yml/badge.svg\" alt=\"CI\"\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://github.com/vnvo/deltaforge/releases\"\u003e\n    \u003cimg src=\"https://img.shields.io/github/v/release/vnvo/deltaforge\" alt=\"Release\"\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://vnvo.github.io/deltaforge\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/docs-online-blue.svg\" alt=\"Docs\"\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://github.com/vnvo/deltaforge/pkgs/container/deltaforge\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/ghcr.io-deltaforge-blue?logo=docker\" alt=\"GHCR\"\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://hub.docker.com/r/vnvohub/deltaforge\"\u003e\n    \u003cimg src=\"https://img.shields.io/docker/pulls/vnvohub/deltaforge?logo=docker\" alt=\"Docker Pulls\"\u003e\n  \u003c/a\u003e\n  \u003cimg src=\"https://img.shields.io/badge/arch-amd64%20%7C%20arm64-green\" alt=\"Arch\"\u003e\n  \u003ca href=\"https://coveralls.io/github/vnvo/deltaforge?branch=main\"\u003e\n    \u003cimg src=\"https://coveralls.io/repos/github/vnvo/deltaforge/badge.svg?branch=main\" alt=\"Coverage Status\"\u003e\n  \u003c/a\u003e\n  \u003cimg src=\"https://img.shields.io/badge/rustc-1.89+-orange.svg\" alt=\"MSRV\"\u003e\n  \u003cimg src=\"https://img.shields.io/badge/license-MIT%20OR%20Apache--2.0-blue.svg\" alt=\"License\"\u003e\n\u003c/p\u003e\n\n\u003e A versatile, high-performance Change Data Capture (CDC) engine built in Rust.\n\n\u003e ⚠️ **Status:** Active development. APIs, configuration, and semantics may change.\n\nDeltaForge streams database changes into downstream systems like Kafka, Redis, and NATS - giving you full control over routing, transformation, and delivery. Built-in schema discovery automatically infers and tracks the shape of your data as it flows through, including deep inspection of nested JSON structures.\n\n\u003e DeltaForge is _not_ a DAG based stream processor. It is a focused CDC engine meant to replace tools like Debezium when you need a lighter, cloud-native, and more customizable runtime.\n\n## Quick Start\n\nGet DeltaForge running in under 3 minutes:\n\n\u003ctable\u003e\n\u003ctr\u003e\n\u003ctd width=\"50%\" valign=\"top\"\u003e\n\n### Minimal Pipeline Config\n```yaml\n# pipeline.yaml\napiVersion: deltaforge/v1\nkind: Pipeline\nmetadata:\n  name: my-first-pipeline\n  tenant: demo\n\nspec:\n  source:\n    type: mysql\n    config:\n      id: mysql-src\n      dsn: ${MYSQL_DSN}\n      tables: [mydb.users]\n\n  processors: []\n\n  sinks:\n    - type: kafka\n      config:\n        id: kafka-sink\n        brokers: ${KAFKA_BROKERS}\n        topic: users.cdc\n```\n\n\u003c/td\u003e\n\u003ctd width=\"50%\" valign=\"top\"\u003e\n\n### Run it with Docker\n```bash\ndocker run --rm \\\n  -e MYSQL_DSN=\"mysql://user:pass@host:3306/mydb\" \\\n  -e KAFKA_BROKERS=\"kafka:9092\" \\\n  -v $(pwd)/pipeline.yaml:/etc/deltaforge/pipeline.yaml:ro \\\n  ghcr.io/vnvo/deltaforge:latest \\\n  --config /etc/deltaforge/pipeline.yaml\n```\n\nThat's it! DeltaForge streams changes from `mydb.users` to Kafka.\n\n**Want Debezium-compatible output?**\n```yaml\nsinks:\n  - type: kafka\n    config:\n      id: kafka-sink\n      brokers: ${KAFKA_BROKERS}\n      topic: users.cdc\n      envelope:\n        type: debezium\n```\n\nOutput: `{\"schema\":null,\"payload\":{...}}`\n\n📘 [Full docs](https://vnvo.github.io/deltaforge) · [Configuration reference](#configuration-schema)\n\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/table\u003e\n\n## The Tech\n\n| Built with | Sources | Processors | Sinks | Output Formats |\n|:---:|:---:|:---:|:---:|:---:|\n| Rust | MySQL · PostgreSQL | JavaScript · Outbox · Flatten | Kafka · Redis · NATS | Native · Debezium · CloudEvents |\n\n## Features\n\n- **Sources**\n  - MySQL binlog CDC with GTID support\n  - PostgreSQL logical replication via pgoutput\n  - Initial snapshot/backfill for existing tables (MySQL and PostgreSQL)\n    - resumes at table granularity after interruption, with binlog/WAL retention validation and background guards\n  - Automatic failover handling: server identity detection, checkpoint reachability verification, schema drift reconciliation, and configurable halt-on-drift policy\n- **Schema Registry**\n  - Source-owned schema types (source native semantics)\n  - Schema change detection and versioning\n  - SHA-256 fingerprinting for stable change detection\n\n- **Schema Sensing**\n  - Automatic schema inference from JSON event payloads\n  - Deep inspection for nested JSON structures\n  - High-cardinality key detection (session IDs, trace IDs, dynamic maps)\n  - Configurable sampling with warmup and cache optimization\n  - Drift detection comparing DB schema vs observed data\n  - JSON Schema export for downstream consumers\n\n- **Checkpoints**\n  - Pluggable backends (File, SQLite with versioning, in-memory)\n  - Configurable commit policies (all, required, quorum)\n  - Transaction boundary preservation (best effort)\n\n- **Processors**\n  - JavaScript processor using `deno_core`:\n    - Run user defined functions (UDFs) in JS to transform batches of events\n  - Outbox processor:\n    - Transactional outbox pattern with routing and raw payload delivery support\n  - Flatten processor:\n    - Native Rust processor that collapses nested JSON into top-level `parent__child` keys\n\n- **Sinks**\n  - Kafka producer sink (via `rdkafka`)\n  - Redis stream sink\n  - NATS JetStream sink (via `async_nats`)\n  - Dynamic routing: per-event topic/stream/subject via templates or JavaScript\n  - Configurable envelope formats: Native, Debezium, CloudEvents\n  - JSON wire encoding (Avro planned and more to come)\n\n### Event Output Formats\n\nDeltaForge supports multiple envelope formats for ecosystem compatibility:\n\n| Format | Output | Use Case |\n|--------|--------|----------|\n| `native` | `{\"op\":\"c\",\"after\":{...},\"source\":{...}}` | Lowest overhead, DeltaForge consumers |\n| `debezium` | `{\"schema\":null,\"payload\":{...}}` | Drop-in Debezium replacement |\n| `cloudevents` | `{\"specversion\":\"1.0\",\"type\":\"...\",\"data\":{...}}` | CNCF-standard, event-driven systems |\n\n🔄 **Debezium Compatibility**: DeltaForge uses Debezium's **schemaless mode** (`schema: null`), which matches Debezium's `JsonConverter` with `schemas.enable=false` - the recommended configuration for most Kafka deployments. This provides wire compatibility with existing Debezium consumers without the overhead of inline schemas (~500+ bytes per message).\n\n\u003e 💡 **Migrating from Debezium?** If your consumers already use `schemas.enable=false`, configure `envelope: { type: debezium }` on your sinks for drop-in compatibility. For consumers expecting inline schemas, you'll need Schema Registry integration (Avro encoding - planned).\n\nSee [Envelope Formats](docs/src/envelopes.md) for detailed examples and wire format specifications.\n\n## Documentation\n\n- 📘 Online docs: \u003chttps://vnvo.github.io/deltaforge\u003e\n- 🛠 Local: `mdbook serve docs` (browse at \u003chttp://localhost:3000\u003e)\n\n## Local development helper\n\nUse the bundled `dev.sh` CLI to spin up the dependency stack and run common workflows consistently:\n\n```bash\n./dev.sh up     # start Postgres, MySQL, Kafka, Redis, NATS from docker-compose.dev.yml\n./dev.sh ps     # view container status\n./dev.sh check  # fmt --check + clippy + tests (matches CI)\n```\n\nSee the [Development guide](docs/src/development.md) for the full layout and additional info.\n\n## Container image\n\nPre-built multi-arch images (amd64/arm64) are available:\n```bash\n# From GitHub Container Registry\ndocker pull ghcr.io/vnvo/deltaforge:latest\n\n# From Docker Hub\ndocker pull vnvohub/deltaforge:latest\n\n# Debug variant (includes shell)\ndocker pull ghcr.io/vnvo/deltaforge:latest-debug\n```\n\nOr build locally:\n```bash\ndocker build -t deltaforge:local .\n```\n\nRun it by mounting your pipeline specs (environment variables are expanded inside the YAML) and exposing the API and metrics ports:\n\n```bash\ndocker run --rm \\\n  -p 8080:8080 -p 9000:9000 \\\n  -v $(pwd)/examples/dev.yaml:/etc/deltaforge/pipelines.yaml:ro \\\n  -v deltaforge-checkpoints:/app/data \\\n  deltaforge:local \\\n  --config /etc/deltaforge/pipelines.yaml\n```\n\nor with env variables to be expanded inside the provided config:\n```bash\n# pull the container\ndocker pull ghcr.io/vnvo/deltaforge:latest\n\n# run it\ndocker run --rm \\\n  -p 8080:8080 -p 9000:9000 \\\n  -e MYSQL_DSN=\"mysql://user:pass@host:3306/db\" \\\n  -e KAFKA_BROKERS=\"kafka:9092\" \\\n  -v $(pwd)/pipeline.yaml:/etc/deltaforge/pipeline.yaml:ro \\\n  -v deltaforge-checkpoints:/app/data \\\n  ghcr.io/vnvo/deltaforge:latest \\\n  --config /etc/deltaforge/pipeline.yaml\n```\n\nThe container runs as a non-root user, writes checkpoints to `/app/data/df_checkpoints.json`, and listens on `0.0.0.0:8080` for the control plane API with metrics served on `:9000`.\n\n\n## Architecture Highlights\n\n### At-least-once and Checkpoint Timing Guarantees\n\nDeltaForge guarantees at-least-once delivery through careful checkpoint ordering:\n\n```\nSource → Processor → Sink (deliver) → Checkpoint (save)\n                           │\n                    Sink acknowledges\n                    successful delivery\n                           │\n                    THEN checkpoint saved\n```\n\nCheckpoints are never saved before events are delivered. A crash between delivery and checkpoint causes replay (duplicates possible), but never loss.\n\n### Schema-Checkpoint Correlation\n\nThe schema registry tracks schema versions with sequence numbers and optional checkpoint correlation. During replay, events are interpreted with the schema that was active when they were produced - even if the table structure has since changed.\n\n### Source-Owned Schemas\n\nUnlike tools that normalize all databases to a universal type system, DeltaForge lets each source define its own schema semantics. MySQL schemas capture MySQL types (`bigint(20) unsigned`, `json`), PostgreSQL schemas preserve arrays and custom types. No lossy normalization, no universal type maintenance burden.\n\n\n## API\n\nThe REST API exposes JSON endpoints for liveness, readiness, and pipeline lifecycle\nmanagement. Routes key pipelines by the `metadata.name` field from their specs and\nreturn `PipeInfo` payloads that include the pipeline name, status, and full\nconfiguration.\n\n### Health\n\n- `GET /healthz` - lightweight liveness probe returning `ok`.\n- `GET /readyz` - readiness view returning `{\"status\":\"ready\",\"pipelines\":[...]}`\n  with the current pipeline states.\n\n### Pipeline management\n\n- `GET /pipelines` - list all pipelines with their current status and config.\n- `POST /pipelines` - create a new pipeline from a full `PipelineSpec` document.\n- `GET /pipelines/{name}` - get a single pipeline by name.\n- `PATCH /pipelines/{name}` - apply a partial JSON patch to an existing pipeline\n  (e.g., adjust batch or connection settings) and restart it with the merged spec.\n- `DELETE /pipelines/{name}` - permanently delete a pipeline.\n- `POST /pipelines/{name}/pause` - pause ingestion and processing for the pipeline.\n- `POST /pipelines/{name}/resume` - resume a paused pipeline.\n- `POST /pipelines/{name}/stop` - stop a running pipeline.\n\n### Schema endpoints\n\n- `GET /pipelines/{name}/schemas` - list DB schemas for the pipeline.\n- `GET /pipelines/{name}/sensing/schemas` - list inferred schemas (from sensing).\n- `GET /pipelines/{name}/sensing/schemas/{table}` - get inferred schema details.\n- `GET /pipelines/{name}/sensing/schemas/{table}/json-schema` - export as JSON Schema.\n- `GET /pipelines/{name}/sensing/schemas/{table}/classifications` - get dynamic map classifications.\n- `GET /pipelines/{name}/drift` - get drift detection results.\n- `GET /pipelines/{name}/sensing/stats` - get schema sensing cache statistics.\n\n\n## Configuration schema\n\nPipelines are defined as YAML documents that map directly to the internal `PipelineSpec` type. \nEnvironment variables are expanded before parsing, so secrets and URLs can be injected at runtime.\n\n\u003ctable\u003e\n\u003ctr\u003e\n\u003ctd width=\"50%\" valign=\"top\"\u003e\n\n### Full Example\n\n```yaml\nmetadata:\n  name: orders-mysql-to-kafka\n  tenant: acme\n\nspec:\n  sharding:\n    mode: hash\n    count: 4\n    key: customer_id\n\n  source:\n    type: mysql\n    config:\n      id: orders-mysql\n      dsn: ${MYSQL_DSN}\n      tables:\n        - shop.orders\n        - shop.outbox\n      outbox:\n        tables: [\"shop.outbox\"]\n      snapshot:\n        mode: initial\n\n  processors:\n    - type: javascript\n      id: my-custom-transform\n      inline: |\n        function processBatch(events) {\n          return events;\n        }\n      limits:\n        cpu_ms: 50\n        mem_mb: 128\n        timeout_ms: 500\n\n  sinks:\n    - type: kafka\n      config:\n        id: orders-kafka\n        brokers: ${KAFKA_BROKERS}\n        topic: orders\n        envelope:\n          type: debezium\n        encoding: json\n        required: true\n        exactly_once: false\n    - type: redis\n      config:\n        id: orders-redis\n        uri: ${REDIS_URI}\n        stream: orders\n        envelope:\n          type: native\n        encoding: json\n  \n  batch:\n    max_events: 500\n    max_bytes: 1048576\n    max_ms: 1000\n    respect_source_tx: true\n\n  commit_policy:\n    mode: quorum\n    quorum: 2\n\n  schema_sensing:\n    enabled: true\n    deep_inspect:\n      enabled: true\n      max_depth: 3\n    sampling:\n      warmup_events: 50\n      sample_rate: 5\n    high_cardinality:\n      enabled: true\n      min_events: 100\n```\n\n\u003c/td\u003e\n\u003ctd width=\"50%\" valign=\"top\"\u003e\n\n### Key fields\n\n| Field | Description |\n|-------|-------------|\n| **`metadata`** | |\n| `name` | Pipeline identifier (used in API routes and metrics) |\n| `tenant` | Business-oriented tenant label |\n| **`spec.source`** | Database source - [MySQL](docs/src/sources/mysql.md), [PostgreSQL](docs/src/sources/postgres.md), etc. |\n| `type` | `mysql`, `postgres`, etc. |\n| `config.id` | Unique identifier for checkpoints |\n| `config.dsn` | Connection string (supports `${ENV_VAR}`) |\n| `config.tables` | Table patterns to capture |\n| `config.outbox` | Tag outbox tables/prefixes with `__outbox` sentinel for the outbox processor |\n| `config.snapshot` | Initial load: `mode` (`never`/`initial`/`always`), `chunk_size`, `max_parallel_tables` |\n| `config.on_schema_drift` | `adapt` (default) — continue after failover schema drift; `halt` — stop for operator intervention |\n| **`spec.processors`** | Optional transforms - see [Processors](docs/src/configuration.md#processors) |\n| `type` | `javascript`, `outbox`, `flatten` |\n| `inline` | JavaScript code for batch processing |\n| `limits` | CPU, memory, and timeout limits |\n| **`spec.sinks`** | One or more sinks - see [Sinks](docs/src/sinks/README.md) |\n| `type` | `kafka`, `redis`, or `nats` |\n| `config.envelope` | Output format: `native`, `debezium`, or `cloudevents` - see [Envelopes](docs/src/envelopes.md) |\n| `config.encoding` | Wire encoding: `json` (default) |\n| `config.required` | Whether sink must ack for checkpoint (`true` default) |\n| **`spec.batch`** | Commit unit thresholds - see [Batching](docs/src/configuration.md#batching) |\n| `max_events` | Flush after N events (default: 500) |\n| `max_bytes` | Flush after size limit (default: 1MB) |\n| `max_ms` | Flush after time (default: 1000ms) |\n| `respect_source_tx` | Keep source transactions intact (`true` default) |\n| **`spec.commit_policy`** | Checkpoint gating - see [Commit policy](docs/src/configuration.md#commit-policy) |\n| `mode` | `all`, `required` (default), or `quorum` |\n| `quorum` | Number of sinks for quorum mode |\n| **`spec.schema_sensing`** | Runtime schema inference - see [Schema sensing](docs/src/schemasensing.md) |\n| `enabled` | Enable schema sensing (`false` default) |\n| `deep_inspect` | Nested JSON inspection settings |\n| `sampling` | Sampling rate and warmup config |\n| `high_cardinality` | Dynamic key detection settings |\n\n📘 Full reference: [Configuration docs](docs/src/configuration.md)\n\nView actual examples: [Example Configurations](docs/src/examples/README.md)\n\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/table\u003e\n\n## Roadmap\n\n- [x] Outbox pattern support\n- [x] Flatten processor\n- [x] Persistent schema registry (SQLite, then PostgreSQL)\n- [x] Snapshot/backfill (initial load for existing tables)\n- [ ] Protobuf encoding\n- [ ] PostgreSQL/S3 checkpoint backends for HA\n- [ ] MongoDB source\n- [ ] ClickHouse sink  \n- [ ] Event store for time-based replay\n- [ ] Distributed coordination for HA\n\n\n## License\n\nLicensed under either of\n\n- **MIT License** (see [`LICENSE-MIT`](./LICENSE-MIT))\n- **Apache License, Version 2.0** (see [`LICENSE-APACHE`](./LICENSE-APACHE))\n\nat your option.\n\nUnless you explicitly state otherwise, any contribution intentionally submitted\nfor inclusion in this project by you shall be dual licensed as above, without\nadditional terms or conditions.","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvnvo%2Fdeltaforge","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvnvo%2Fdeltaforge","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvnvo%2Fdeltaforge/lists"}