An open API service indexing awesome lists of open source software.

https://github.com/vnvo/deltaforge

A modular Change Data Capture (CDC) micro-framework built in Rust. Stream database changes to Kafka, Redis and etc.
https://github.com/vnvo/deltaforge

cdc change-data-capture data-engineering data-platform etl event-sourcing kafka mysql postgresql redis schema-registry turso-db

Last synced: 25 days ago
JSON representation

A modular Change Data Capture (CDC) micro-framework built in Rust. Stream database changes to Kafka, Redis and etc.

Awesome Lists containing this project

README

          


DeltaForge




CI


Release


Docs


GHCR


Docker Pulls

Arch

Coverage Status

MSRV
License

> A versatile, high-performance Change Data Capture (CDC) engine built in Rust.

> ⚠️ **Status:** Active development. APIs, configuration, and semantics may change.

DeltaForge streams database changes into downstream systems like Kafka, Redis, and NATS - giving you full control over routing, transformation, and delivery. Built-in schema discovery automatically infers and tracks the shape of your data as it flows through, including deep inspection of nested JSON structures.

> DeltaForge is _not_ a DAG based stream processor. It is a focused CDC engine meant to replace tools like Debezium when you need a lighter, cloud-native, and more customizable runtime.

## Quick Start

Get DeltaForge running in under 3 minutes:

### Minimal Pipeline Config
```yaml
# pipeline.yaml
apiVersion: deltaforge/v1
kind: Pipeline
metadata:
name: my-first-pipeline
tenant: demo

spec:
source:
type: mysql
config:
id: mysql-src
dsn: ${MYSQL_DSN}
tables: [mydb.users]

processors: []

sinks:
- type: kafka
config:
id: kafka-sink
brokers: ${KAFKA_BROKERS}
topic: users.cdc
```

### Run it with Docker
```bash
docker run --rm \
-e MYSQL_DSN="mysql://user:pass@host:3306/mydb" \
-e KAFKA_BROKERS="kafka:9092" \
-v $(pwd)/pipeline.yaml:/etc/deltaforge/pipeline.yaml:ro \
ghcr.io/vnvo/deltaforge:latest \
--config /etc/deltaforge/pipeline.yaml
```

That's it! DeltaForge streams changes from `mydb.users` to Kafka.

**Want Debezium-compatible output?**
```yaml
sinks:
- type: kafka
config:
id: kafka-sink
brokers: ${KAFKA_BROKERS}
topic: users.cdc
envelope:
type: debezium
```

Output: `{"schema":null,"payload":{...}}`

📘 [Full docs](https://vnvo.github.io/deltaforge) · [Configuration reference](#configuration-schema)

## The Tech

| Built with | Sources | Processors | Sinks | Output Formats |
|:---:|:---:|:---:|:---:|:---:|
| Rust | MySQL · PostgreSQL | JavaScript · Outbox · Flatten | Kafka · Redis · NATS | Native · Debezium · CloudEvents |

## Features

- **Sources**
- MySQL binlog CDC with GTID support
- PostgreSQL logical replication via pgoutput
- Initial snapshot/backfill for existing tables (MySQL and PostgreSQL)
- resumes at table granularity after interruption, with binlog/WAL retention validation and background guards
- Automatic failover handling: server identity detection, checkpoint reachability verification, schema drift reconciliation, and configurable halt-on-drift policy
- **Schema Registry**
- Source-owned schema types (source native semantics)
- Schema change detection and versioning
- SHA-256 fingerprinting for stable change detection

- **Schema Sensing**
- Automatic schema inference from JSON event payloads
- Deep inspection for nested JSON structures
- High-cardinality key detection (session IDs, trace IDs, dynamic maps)
- Configurable sampling with warmup and cache optimization
- Drift detection comparing DB schema vs observed data
- JSON Schema export for downstream consumers

- **Checkpoints**
- Pluggable backends (File, SQLite with versioning, in-memory)
- Configurable commit policies (all, required, quorum)
- Transaction boundary preservation (best effort)

- **Processors**
- JavaScript processor using `deno_core`:
- Run user defined functions (UDFs) in JS to transform batches of events
- Outbox processor:
- Transactional outbox pattern with routing and raw payload delivery support
- Flatten processor:
- Native Rust processor that collapses nested JSON into top-level `parent__child` keys

- **Sinks**
- Kafka producer sink (via `rdkafka`)
- Redis stream sink
- NATS JetStream sink (via `async_nats`)
- Dynamic routing: per-event topic/stream/subject via templates or JavaScript
- Configurable envelope formats: Native, Debezium, CloudEvents
- JSON wire encoding (Avro planned and more to come)

### Event Output Formats

DeltaForge supports multiple envelope formats for ecosystem compatibility:

| Format | Output | Use Case |
|--------|--------|----------|
| `native` | `{"op":"c","after":{...},"source":{...}}` | Lowest overhead, DeltaForge consumers |
| `debezium` | `{"schema":null,"payload":{...}}` | Drop-in Debezium replacement |
| `cloudevents` | `{"specversion":"1.0","type":"...","data":{...}}` | CNCF-standard, event-driven systems |

🔄 **Debezium Compatibility**: DeltaForge uses Debezium's **schemaless mode** (`schema: null`), which matches Debezium's `JsonConverter` with `schemas.enable=false` - the recommended configuration for most Kafka deployments. This provides wire compatibility with existing Debezium consumers without the overhead of inline schemas (~500+ bytes per message).

> 💡 **Migrating from Debezium?** If your consumers already use `schemas.enable=false`, configure `envelope: { type: debezium }` on your sinks for drop-in compatibility. For consumers expecting inline schemas, you'll need Schema Registry integration (Avro encoding - planned).

See [Envelope Formats](docs/src/envelopes.md) for detailed examples and wire format specifications.

## Documentation

- 📘 Online docs:
- 🛠 Local: `mdbook serve docs` (browse at )

## Local development helper

Use the bundled `dev.sh` CLI to spin up the dependency stack and run common workflows consistently:

```bash
./dev.sh up # start Postgres, MySQL, Kafka, Redis, NATS from docker-compose.dev.yml
./dev.sh ps # view container status
./dev.sh check # fmt --check + clippy + tests (matches CI)
```

See the [Development guide](docs/src/development.md) for the full layout and additional info.

## Container image

Pre-built multi-arch images (amd64/arm64) are available:
```bash
# From GitHub Container Registry
docker pull ghcr.io/vnvo/deltaforge:latest

# From Docker Hub
docker pull vnvohub/deltaforge:latest

# Debug variant (includes shell)
docker pull ghcr.io/vnvo/deltaforge:latest-debug
```

Or build locally:
```bash
docker build -t deltaforge:local .
```

Run it by mounting your pipeline specs (environment variables are expanded inside the YAML) and exposing the API and metrics ports:

```bash
docker run --rm \
-p 8080:8080 -p 9000:9000 \
-v $(pwd)/examples/dev.yaml:/etc/deltaforge/pipelines.yaml:ro \
-v deltaforge-checkpoints:/app/data \
deltaforge:local \
--config /etc/deltaforge/pipelines.yaml
```

or with env variables to be expanded inside the provided config:
```bash
# pull the container
docker pull ghcr.io/vnvo/deltaforge:latest

# run it
docker run --rm \
-p 8080:8080 -p 9000:9000 \
-e MYSQL_DSN="mysql://user:pass@host:3306/db" \
-e KAFKA_BROKERS="kafka:9092" \
-v $(pwd)/pipeline.yaml:/etc/deltaforge/pipeline.yaml:ro \
-v deltaforge-checkpoints:/app/data \
ghcr.io/vnvo/deltaforge:latest \
--config /etc/deltaforge/pipeline.yaml
```

The container runs as a non-root user, writes checkpoints to `/app/data/df_checkpoints.json`, and listens on `0.0.0.0:8080` for the control plane API with metrics served on `:9000`.

## Architecture Highlights

### At-least-once and Checkpoint Timing Guarantees

DeltaForge guarantees at-least-once delivery through careful checkpoint ordering:

```
Source → Processor → Sink (deliver) → Checkpoint (save)

Sink acknowledges
successful delivery

THEN checkpoint saved
```

Checkpoints are never saved before events are delivered. A crash between delivery and checkpoint causes replay (duplicates possible), but never loss.

### Schema-Checkpoint Correlation

The schema registry tracks schema versions with sequence numbers and optional checkpoint correlation. During replay, events are interpreted with the schema that was active when they were produced - even if the table structure has since changed.

### Source-Owned Schemas

Unlike tools that normalize all databases to a universal type system, DeltaForge lets each source define its own schema semantics. MySQL schemas capture MySQL types (`bigint(20) unsigned`, `json`), PostgreSQL schemas preserve arrays and custom types. No lossy normalization, no universal type maintenance burden.

## API

The REST API exposes JSON endpoints for liveness, readiness, and pipeline lifecycle
management. Routes key pipelines by the `metadata.name` field from their specs and
return `PipeInfo` payloads that include the pipeline name, status, and full
configuration.

### Health

- `GET /healthz` - lightweight liveness probe returning `ok`.
- `GET /readyz` - readiness view returning `{"status":"ready","pipelines":[...]}`
with the current pipeline states.

### Pipeline management

- `GET /pipelines` - list all pipelines with their current status and config.
- `POST /pipelines` - create a new pipeline from a full `PipelineSpec` document.
- `GET /pipelines/{name}` - get a single pipeline by name.
- `PATCH /pipelines/{name}` - apply a partial JSON patch to an existing pipeline
(e.g., adjust batch or connection settings) and restart it with the merged spec.
- `DELETE /pipelines/{name}` - permanently delete a pipeline.
- `POST /pipelines/{name}/pause` - pause ingestion and processing for the pipeline.
- `POST /pipelines/{name}/resume` - resume a paused pipeline.
- `POST /pipelines/{name}/stop` - stop a running pipeline.

### Schema endpoints

- `GET /pipelines/{name}/schemas` - list DB schemas for the pipeline.
- `GET /pipelines/{name}/sensing/schemas` - list inferred schemas (from sensing).
- `GET /pipelines/{name}/sensing/schemas/{table}` - get inferred schema details.
- `GET /pipelines/{name}/sensing/schemas/{table}/json-schema` - export as JSON Schema.
- `GET /pipelines/{name}/sensing/schemas/{table}/classifications` - get dynamic map classifications.
- `GET /pipelines/{name}/drift` - get drift detection results.
- `GET /pipelines/{name}/sensing/stats` - get schema sensing cache statistics.

## Configuration schema

Pipelines are defined as YAML documents that map directly to the internal `PipelineSpec` type.
Environment variables are expanded before parsing, so secrets and URLs can be injected at runtime.

### Full Example

```yaml
metadata:
name: orders-mysql-to-kafka
tenant: acme

spec:
sharding:
mode: hash
count: 4
key: customer_id

source:
type: mysql
config:
id: orders-mysql
dsn: ${MYSQL_DSN}
tables:
- shop.orders
- shop.outbox
outbox:
tables: ["shop.outbox"]
snapshot:
mode: initial

processors:
- type: javascript
id: my-custom-transform
inline: |
function processBatch(events) {
return events;
}
limits:
cpu_ms: 50
mem_mb: 128
timeout_ms: 500

sinks:
- type: kafka
config:
id: orders-kafka
brokers: ${KAFKA_BROKERS}
topic: orders
envelope:
type: debezium
encoding: json
required: true
exactly_once: false
- type: redis
config:
id: orders-redis
uri: ${REDIS_URI}
stream: orders
envelope:
type: native
encoding: json

batch:
max_events: 500
max_bytes: 1048576
max_ms: 1000
respect_source_tx: true

commit_policy:
mode: quorum
quorum: 2

schema_sensing:
enabled: true
deep_inspect:
enabled: true
max_depth: 3
sampling:
warmup_events: 50
sample_rate: 5
high_cardinality:
enabled: true
min_events: 100
```

### Key fields

| Field | Description |
|-------|-------------|
| **`metadata`** | |
| `name` | Pipeline identifier (used in API routes and metrics) |
| `tenant` | Business-oriented tenant label |
| **`spec.source`** | Database source - [MySQL](docs/src/sources/mysql.md), [PostgreSQL](docs/src/sources/postgres.md), etc. |
| `type` | `mysql`, `postgres`, etc. |
| `config.id` | Unique identifier for checkpoints |
| `config.dsn` | Connection string (supports `${ENV_VAR}`) |
| `config.tables` | Table patterns to capture |
| `config.outbox` | Tag outbox tables/prefixes with `__outbox` sentinel for the outbox processor |
| `config.snapshot` | Initial load: `mode` (`never`/`initial`/`always`), `chunk_size`, `max_parallel_tables` |
| `config.on_schema_drift` | `adapt` (default) — continue after failover schema drift; `halt` — stop for operator intervention |
| **`spec.processors`** | Optional transforms - see [Processors](docs/src/configuration.md#processors) |
| `type` | `javascript`, `outbox`, `flatten` |
| `inline` | JavaScript code for batch processing |
| `limits` | CPU, memory, and timeout limits |
| **`spec.sinks`** | One or more sinks - see [Sinks](docs/src/sinks/README.md) |
| `type` | `kafka`, `redis`, or `nats` |
| `config.envelope` | Output format: `native`, `debezium`, or `cloudevents` - see [Envelopes](docs/src/envelopes.md) |
| `config.encoding` | Wire encoding: `json` (default) |
| `config.required` | Whether sink must ack for checkpoint (`true` default) |
| **`spec.batch`** | Commit unit thresholds - see [Batching](docs/src/configuration.md#batching) |
| `max_events` | Flush after N events (default: 500) |
| `max_bytes` | Flush after size limit (default: 1MB) |
| `max_ms` | Flush after time (default: 1000ms) |
| `respect_source_tx` | Keep source transactions intact (`true` default) |
| **`spec.commit_policy`** | Checkpoint gating - see [Commit policy](docs/src/configuration.md#commit-policy) |
| `mode` | `all`, `required` (default), or `quorum` |
| `quorum` | Number of sinks for quorum mode |
| **`spec.schema_sensing`** | Runtime schema inference - see [Schema sensing](docs/src/schemasensing.md) |
| `enabled` | Enable schema sensing (`false` default) |
| `deep_inspect` | Nested JSON inspection settings |
| `sampling` | Sampling rate and warmup config |
| `high_cardinality` | Dynamic key detection settings |

📘 Full reference: [Configuration docs](docs/src/configuration.md)

View actual examples: [Example Configurations](docs/src/examples/README.md)

## Roadmap

- [x] Outbox pattern support
- [x] Flatten processor
- [x] Persistent schema registry (SQLite, then PostgreSQL)
- [x] Snapshot/backfill (initial load for existing tables)
- [ ] Protobuf encoding
- [ ] PostgreSQL/S3 checkpoint backends for HA
- [ ] MongoDB source
- [ ] ClickHouse sink
- [ ] Event store for time-based replay
- [ ] Distributed coordination for HA

## License

Licensed under either of

- **MIT License** (see [`LICENSE-MIT`](./LICENSE-MIT))
- **Apache License, Version 2.0** (see [`LICENSE-APACHE`](./LICENSE-APACHE))

at your option.

Unless you explicitly state otherwise, any contribution intentionally submitted
for inclusion in this project by you shall be dual licensed as above, without
additional terms or conditions.