https://github.com/relaxe111/cdc-pipeline-generator
Reusable library for generating Redpanda Connect CDC pipelines.
https://github.com/relaxe111/cdc-pipeline-generator
cdc-pipeline-console-generator py-script redpanda redpanda-connect
Last synced: about 1 month ago
JSON representation
Reusable library for generating Redpanda Connect CDC pipelines.
- Host: GitHub
- URL: https://github.com/relaxe111/cdc-pipeline-generator
- Owner: Relaxe111
- License: mit
- Created: 2026-01-31T20:39:07.000Z (5 months ago)
- Default Branch: master
- Last Pushed: 2026-02-21T04:27:56.000Z (4 months ago)
- Last Synced: 2026-02-21T11:44:36.136Z (4 months ago)
- Topics: cdc-pipeline-console-generator, py-script, redpanda, redpanda-connect
- Language: Python
- Homepage:
- Size: 1.07 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# CDC Pipeline Generator
**Generate pipeline configurations for Change Data Capture (CDC) workflows.**
A CLI-first tool that reads YAML service definitions and produces streaming pipeline configurations, SQL migrations, and deployment artifacts. Supports **db-per-tenant** and **db-shared** multi-tenancy patterns with configurable data transport backends.
---
## Architecture
The generator sits at the centre of a CDC pipeline — it reads source database schemas, produces sink table definitions and pipeline configurations, and renders the runtime artifacts consumed by the streaming layer.
### Data Transport Options
CDC data can be moved from source to sink through one of two paths:
| Path | Transport | Typical Use |
|------|-----------|-------------|
| **Streaming** | Redpanda / Kafka | High-throughput, low-latency CDC with exactly-once semantics |
| **FDW** | PostgreSQL Foreign Data Wrappers | Direct MSSQL→PG pull without an external message broker |
#### Streaming (Redpanda / Kafka)
Source change events are captured, streamed through a message broker, and consumed by sink processors that write to the target PostgreSQL database.
```text
MSSQL → CDC capture → Redpanda/Kafka → Bento sink → PostgreSQL
```
#### FDW (Foreign Data Wrapper)
The generator can produce configurations that use PostgreSQL Foreign Data Wrappers (`tds_fdw`) to pull data directly from MSSQL into staging tables, followed by merge procedures that apply changes to the target tables.
```text
MSSQL ← tds_fdw ← PostgreSQL (staging → merge → target)
```
#### Native PostgreSQL-to-PostgreSQL
For PostgreSQL source databases, native logical replication or polling-based CDC can be used without an external broker.
```text
PostgreSQL → native CDC polling → PostgreSQL target
```
All three paths are configuration-driven — the generator produces the correct pipeline YAML, SQL migrations, and runtime helpers based on the chosen transport and source database type.
---
## Installation
### Option A: Docker (zero host dependencies)
```bash
docker pull asmacarma/cdc-pipeline-generator:latest
```
### Option B: Host install via pip
```bash
# Editable install for active development
pip install -e .
# Or install directly from the repository
pip install .
```
After host install, the `cdc` command is available on your shell PATH.
---
## Quick Start
### 1. Create a project and initialize
```bash
mkdir my-cdc-project && cd my-cdc-project
cdc init
```
This creates the project structure: `source-groups.yaml`, `services/`, `pipelines/`, directories.
### 2. Scaffold a server group
```bash
# db-per-tenant (one database per customer)
cdc scaffold my-group \
--pattern db-per-tenant \
--source-type mssql \
--extraction-pattern "^myapp_(?P[^_]+)$"
# db-shared (single database, multi-tenant)
cdc scaffold my-group \
--pattern db-shared \
--source-type postgres \
--extraction-pattern "^myapp_(?P[^_]+)_(?P(dev|stage|prod))$" \
--environment-aware
```
### 3. Configure services and tables
```bash
# Create a service
cdc manage-services config --create-service my-service
# Add source tables
cdc manage-services config --service my-service --add-source-table dbo.Users --primary-key id
cdc manage-services config --service my-service --add-source-table dbo.Orders --primary-key order_id
# Inspect and save source schemas
cdc manage-services config --service my-service --inspect --all --save
```
### 4. Manage schemas and migrations
```bash
# Generate DDL migrations for the sink database
cdc manage-migrations generate
# Review changes
cdc manage-migrations diff
# Apply migrations
cdc manage-migrations apply
```
### 5. Generate pipeline configurations
```bash
# Generate for a single service
cdc generate --service my-service --environment dev
# Generate for all services
cdc generate --all --environment dev
```
---
## Multi-Tenancy Patterns
### db-per-tenant
Each customer has a dedicated source database. The generator creates one source+sink pipeline per customer database.
```
Extraction pattern: ^myapp_(?P[^_]+)$
Matches: myapp_customer_a, myapp_customer_b
```
### db-shared
All customers share a single database, differentiated by a column (e.g. `customer_id`) or schema. Requires `--environment-aware`.
```
Extraction pattern: ^myapp_(?P[^_]+)_(?P(dev|stage|prod))$
Matches: myapp_users_dev, myapp_users_prod
```
---
## Command Reference
| Command | Description |
| ------- | ----------- |
| `cdc init` | Initialize a new CDC project |
| `cdc scaffold ` | Scaffold a server group with database services |
| `cdc manage-services config` | Create, list, inspect services and tables |
| `cdc manage-services config --inspect-sink` | Inspect and save target sink schemas |
| `cdc manage-migrations generate` | Generate PostgreSQL DDL migrations |
| `cdc manage-migrations diff` | Show pending schema changes |
| `cdc manage-migrations apply` | Apply migrations to target database |
| `cdc generate` | Generate pipeline YAML configurations |
| `cdc manage-source-groups` | Manage source database groups |
| `cdc manage-sink-groups` | Manage sink/target groups |
| `cdc validate` | Validate all configurations |
---
## Project Structure
```text
cdc-pipeline-generator/
├── cdc_generator/ # Core library
│ ├── cli/ # Click command groups
│ ├── core/ # Pipeline generation, migration engine
│ ├── helpers/ # Database, FDW, MSSQL utilities
│ ├── service-schemas/ # YAML schema definitions and type adapters
│ ├── templates/ # Jinja2 pipeline templates
│ └── validators/ # Configuration and schema validation
├── tests/ # Test suite
├── _docs/ # Architecture, getting started, CLI reference
├── examples/ # db-per-tenant and db-shared reference implementations
├── setup.py / pyproject.toml # Package metadata
└── Dockerfile # Docker runtime image
```
---
## Development
See `_docs/getting-started/` for setup instructions, `_docs/architecture/` for design decisions, and `_docs/cli/` for the full CLI command reference.
The CDC CLI runs directly on the host. Install once and use `cdc` from any directory.
- ✅ `cdc` command available everywhere on your host
- ✅ Access to source and target databases
- ✅ Fish shell with auto-completions (reload with `cdc reload-cdc-autocompletions`)
- ✅ Git and SSH keys available
Optionally, a dev container is available if you prefer an isolated environment:
```bash
docker compose exec dev fish
```
---
## 📁 Project Structure
---
## 📁 Project Structure
After running `cdc scaffold`, your project will have:
```
my-cdc-project/
├── docker-compose.yml # Optional infrastructure (databases, streaming)
├── Dockerfile.dev # Optional dev container image
├── .env.example # Environment variables template
├── .env # Your credentials (git-ignored)
├── .gitignore # Git ignore rules
├── source-groups.yaml # Server group config (generated by cdc)
├── README.md # Quick start guide
├── services/ # Service definitions (generated by cdc)
│ └── my-service.yaml
├── pipelines/ # Pipeline templates + generated YAML
│ ├── templates/ # source-pipeline.yaml, sink-pipeline.yaml
│ └── generated/
│ ├── sources/
│ └── sinks/
└── generated/ # Generated non-pipeline output (git-ignored)
├── schemas/ # PostgreSQL schemas
└── pg-migrations/ # PostgreSQL migrations
```
---
## 🔧 Advanced Usage
### Using as Python Library
```python
from cdc_generator.core.pipeline_generator import generate_pipelines
# Generate pipelines programmatically
generate_pipelines(
service='my-service',
environment='dev',
output_dir='./pipelines/generated'
)
```
### Custom Pipeline Templates
Place custom Jinja2 templates in `pipelines/templates/`:
```yaml
# pipelines/templates/source-pipeline.yaml
input:
mssql_cdc:
dsn: "{{ dsn }}"
tables: {{ tables | tojson }}
# Your custom configuration
```
### Environment-Specific Configuration
Use environment variables in source-groups.yaml:
```yaml
server:
host: ${MSSQL_HOST} # Replaced at runtime
port: ${MSSQL_PORT}
user: ${MSSQL_USER}
password: ${MSSQL_PASSWORD}
```
### SQL-Based Source Custom Keys (Source + Sink)
Use custom keys to compute per-database values during `--update` and write them
into each source environment entry (for example `customer_id`).
```bash
# Source groups: persist SQL custom key definition
cdc manage-source-groups \
--add-source-custom-key customer_id \
--custom-key-value "SELECT customer_id FROM dbo.settings" \
--custom-key-exec-type sql
# Run update to execute the SQL per discovered database
cdc manage-source-groups --update
```
```bash
# Sink groups: same custom key model
cdc manage-sink-groups \
--sink-group sink_analytics \
--add-source-custom-key customer_id \
--custom-key-value "SELECT customer_id FROM public.settings" \
--custom-key-exec-type sql
# Run sink update to execute SQL per discovered sink database
cdc manage-sink-groups --update --sink-group sink_analytics
```
Generated shape (simplified):
```yaml
sources:
directory:
schemas: [public]
nonprod:
server: default
database: directory_db
table_count: 42
customer_id: cust-001
```
If a key returns no value for a specific server/database, the update continues and
prints a warning with that server/database context.
---
## 🤝 Contributing
### For Library Contributors
If you want to contribute to the cdc-pipeline-generator library itself:
```bash
# Clone repository
git clone https://github.com/Relaxe111/cdc-pipeline-generator.git
cd cdc-pipeline-generator
# Install in editable mode with dev dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Format code
black .
ruff check .
```
### For Users
If you're using the library in your project, just install from PyPI as shown in [Installation](#-installation).
---
## 📚 Resources