https://github.com/clickhouse/deltalake-cdc
https://github.com/clickhouse/deltalake-cdc
Last synced: 7 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/clickhouse/deltalake-cdc
- Owner: ClickHouse
- License: mit
- Created: 2025-07-08T20:04:34.000Z (11 months ago)
- Default Branch: master
- Last Pushed: 2025-08-17T19:24:42.000Z (10 months ago)
- Last Synced: 2025-10-13T03:34:36.494Z (8 months ago)
- Language: Python
- Size: 16.6 KB
- Stars: 5
- Watchers: 0
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Delta Lake to ClickHouse CDC Pipeline
This project provides tools to generate sample data to a Delta Lake table and stream changes to ClickHouse using Change Data Feed (CDF).
## Limitations
- INSERTS / UPDATES support only (DELETEs are ignored).
- The data generator generates a fixed schema for the Delta table.
## Prerequisites
- Python 3.8+
- AWS credentials configured with access to S3
- ClickHouse server (local or cloud)
- Required Python packages (install with `pip install -r requirements.txt`):
## 1. Generate Sample Data
First, let's generate some sample data to a Delta Lake table in S3:
```bash
python data_generator.py -p s3://your-bucket/path/to/deltalake/table -r us-east-1
```
Options:
- `-p, --bucket_path`: S3 path where the Delta table will be stored (required)
- `-r, --delta_region`: AWS region for the S3 bucket (default: us-east-1)
- `-b, --batch-size`: Number of rows per batch (default: 10000)
## 2. Query Delta Lake from ClickHouse
You can query the Delta Lake table directly from ClickHouse using the DeltaLake table engine:
```sql
CREATE TABLE my_delta_table
ENGINE = DeltaLake('s3://your-bucket/path/to/table')
```
## 3. Create Destination Table in ClickHouse
Create a table in ClickHouse to store the CDC changes. The schema should match your Delta table with the metadata columns:
```sql
CREATE TABLE default.my_cdc_table
(
`id` String,
`name` String,
`age` Int64,
`created_at` DateTime,
`_change_type` String,
`_commit_version` Int64,
`_commit_timestamp` DateTime
)
ENGINE = ReplacingMergeTree(`_commit_version`)
PARTITION BY toYYYYMM(`created_at`)
ORDER BY (name, age)
SETTINGS index_granularity = 8192;
```
## 4. Run the CDC Script
Run the CDC script to stream changes from the Delta Lake table to ClickHouse:
```bash
python main.py \
-p "s3://your-bucket/path/to/table" \
-r "us-east-2" \
-t "default.my_cdc_table" \
-H "host.us-west-2.aws.clickhouse.cloud" \
-u "default" \
-P "password" \
--access-key "[EXAMPLE]" \
--secret-key "[EXAMPLE]" \
-v 1
```