An open API service indexing awesome lists of open source software.

https://github.com/clickhouse/deltalake-cdc


https://github.com/clickhouse/deltalake-cdc

Last synced: 7 months ago
JSON representation

Awesome Lists containing this project

README

          

# Delta Lake to ClickHouse CDC Pipeline

This project provides tools to generate sample data to a Delta Lake table and stream changes to ClickHouse using Change Data Feed (CDF).

## Limitations

- INSERTS / UPDATES support only (DELETEs are ignored).
- The data generator generates a fixed schema for the Delta table.

## Prerequisites

- Python 3.8+
- AWS credentials configured with access to S3
- ClickHouse server (local or cloud)
- Required Python packages (install with `pip install -r requirements.txt`):

## 1. Generate Sample Data

First, let's generate some sample data to a Delta Lake table in S3:

```bash
python data_generator.py -p s3://your-bucket/path/to/deltalake/table -r us-east-1
```

Options:
- `-p, --bucket_path`: S3 path where the Delta table will be stored (required)
- `-r, --delta_region`: AWS region for the S3 bucket (default: us-east-1)
- `-b, --batch-size`: Number of rows per batch (default: 10000)

## 2. Query Delta Lake from ClickHouse

You can query the Delta Lake table directly from ClickHouse using the DeltaLake table engine:

```sql
CREATE TABLE my_delta_table
ENGINE = DeltaLake('s3://your-bucket/path/to/table')
```

## 3. Create Destination Table in ClickHouse

Create a table in ClickHouse to store the CDC changes. The schema should match your Delta table with the metadata columns:

```sql
CREATE TABLE default.my_cdc_table
(
`id` String,
`name` String,
`age` Int64,
`created_at` DateTime,
`_change_type` String,
`_commit_version` Int64,
`_commit_timestamp` DateTime
)
ENGINE = ReplacingMergeTree(`_commit_version`)
PARTITION BY toYYYYMM(`created_at`)
ORDER BY (name, age)
SETTINGS index_granularity = 8192;
```

## 4. Run the CDC Script

Run the CDC script to stream changes from the Delta Lake table to ClickHouse:

```bash
python main.py \
-p "s3://your-bucket/path/to/table" \
-r "us-east-2" \
-t "default.my_cdc_table" \
-H "host.us-west-2.aws.clickhouse.cloud" \
-u "default" \
-P "password" \
--access-key "[EXAMPLE]" \
--secret-key "[EXAMPLE]" \
-v 1
```