An open API service indexing awesome lists of open source software.

https://github.com/gordonmurray/flink_paimon_duckdb_rill

A streaming analytics stack that captures MySQL changes via CDC, stores them in Apache Paimon format, and visualizes them with Rill dashboards
https://github.com/gordonmurray/flink_paimon_duckdb_rill

duckdb flink flink-cdc paimon rill-dashboard

Last synced: about 1 month ago
JSON representation

A streaming analytics stack that captures MySQL changes via CDC, stores them in Apache Paimon format, and visualizes them with Rill dashboards

Awesome Lists containing this project

README

          

# Real-Time Analytics Pipeline with Flink, Paimon, and Rill

A complete streaming analytics stack that captures MySQL changes via CDC, stores them in Apache Paimon format, and visualizes them with Rill dashboards.

## ๐Ÿš€ What You Get

- **Real-Time CDC**: Captures every MySQL change using Flink CDC
- **Lake Storage**: Stores data in Apache Paimon format on S3-compatible storage
- **Live Dashboard**: Rill analytics with automated catalog management
- **Automated Fixes**: Sidecar container handles DuckDB catalog prefix issues
- **One Command Start**: Everything runs with `docker compose up`

## ๐Ÿ—๏ธ Architecture

```
MySQL โ†’ Flink CDC โ†’ Apache Paimon โ†’ MinIO โ†’ Rill Dashboard
โ†‘ โ†“
Manual inserts Analytics
```

**Components:**
- **MySQL/MariaDB**: Source database with sample product data
- **Apache Flink**: Real-time CDC processing engine
- **Apache Paimon**: Lake storage format optimized for streaming
- **MinIO**: S3-compatible object storage
- **Rill**: Modern analytics dashboard with DuckDB engine
- **Rill Patcher**: Automated sidecar handling catalog prefix issues

## โšก Quick Start

### Prerequisites
- Docker and Docker Compose
- 8GB+ RAM recommended
- Ports 3000, 3306, 8081, 9000-9001 available

### 1. Clone and Start
```bash
git clone
cd flink_iceberg_anomaly_pipeline_paimon
docker compose up -d
```

### 2. Initialize the CDC Pipeline
```bash
./setup_cdc.sh
```

### 3. Open the Dashboard
Navigate to: **http://localhost:3000**

The dashboard will show live data with automatic 60-second refresh.

## ๐Ÿงช Test Real-Time Updates

Add new products to see live updates:

```bash
# Add some products
docker exec mariadb mysql -u root -prootpassword -e "
INSERT INTO mydatabase.products (name, price) VALUES
('New Product 1', 99.99),
('New Product 2', 199.99);"

# Check MySQL count
docker exec mariadb mysql -u root -prootpassword -e "SELECT COUNT(*) FROM mydatabase.products;"

# Wait 60 seconds for dashboard to refresh
# You'll see the updated count automatically!
```

## ๐Ÿ”ง How It Works

### CDC Pipeline
1. **MySQL Changes**: Any INSERT/UPDATE/DELETE in MySQL is captured
2. **Flink Processing**: Flink CDC reads the MySQL binlog in real-time
3. **Paimon Storage**: Changes are written to Paimon tables in MinIO
4. **Rill Dashboard**: Visualizes data with 60-second refresh cycle

### The Catalog Prefix Solution
DuckDB creates random catalog prefixes (e.g., `main8514e79c`) on startup. Our `rill-patcher` sidecar:
1. Waits for Rill to start
2. Discovers the current catalog alias via SQL
3. Patches the model file with the correct prefix
4. Refreshes data every 60 seconds
5. Re-patches if Rill restarts with a new prefix

### Why Apache Paimon?
- Optimized for streaming updates with ACID guarantees
- Supports both batch and streaming workloads
- Compatible with multiple query engines
- Efficient storage with automatic compaction

## ๐Ÿ“Š Monitoring

### Service Health Checks
```bash
# Check all containers
docker ps

# Monitor CDC job
curl -s http://localhost:8081/jobs | jq

# Test Rill Dashboard API
curl -s "http://localhost:3000/v1/instances/default/query" \
-H "Content-Type: application/json" \
-d '{"sql":"SELECT COUNT(*) FROM paimon_products"}'

# View Paimon files in MinIO
docker exec minio mc ls --recursive local/warehouse/
```

### Data Flow Verification
```bash
# MySQL data
docker exec mariadb mysql -u root -prootpassword -e "SELECT COUNT(*) FROM mydatabase.products;"

# MinIO storage
docker exec minio mc ls --recursive local/warehouse/cdc_db.db/products_sink/

# Rill dashboard count
curl -s "http://localhost:3000/v1/instances/default/query" \
-H "Content-Type: application/json" \
-d '{"sql":"SELECT COUNT(*) FROM paimon_products"}' | jq '.data[0]'
```

## ๐Ÿ› ๏ธ Development

### Project Structure
```
โ”œโ”€โ”€ docker-compose.yml # Complete stack definition
โ”œโ”€โ”€ conf/
โ”‚ โ””โ”€โ”€ flink-conf.yaml # Flink configuration
โ”œโ”€โ”€ rill/
โ”‚ โ”œโ”€โ”€ connectors/ # DuckDB S3 configuration
โ”‚ โ”œโ”€โ”€ models/ # SQL model definitions
โ”‚ โ”œโ”€โ”€ metrics/ # Metrics definitions
โ”‚ โ””โ”€โ”€ dashboards/ # Dashboard configs
โ”œโ”€โ”€ rill-patcher.sh # Automated catalog management
โ”œโ”€โ”€ duckdb/
โ”‚ โ””โ”€โ”€ test_s3.py # DuckDB query examples
โ”œโ”€โ”€ sql/
โ”‚ โ”œโ”€โ”€ init.sql # MySQL initial data
โ”‚ โ””โ”€โ”€ setup_paimon_cdc.sql # CDC pipeline setup
โ””โ”€โ”€ setup_cdc.sh # CDC initialization script
```

### Key Configuration Files

**Flink Config** (`conf/flink-conf.yaml`):
- Configures Flink job manager and task manager
- Sets checkpointing intervals
- Defines S3/MinIO credentials

**CDC Setup** (`sql/setup_paimon_cdc.sql`):
- Creates Paimon catalog
- Defines source MySQL table
- Creates sink Paimon table
- Starts CDC pipeline

## ๐Ÿšจ Troubleshooting

### Common Issues

**CDC Pipeline not starting**
```bash
# Check if the job started:
curl -s http://localhost:8081/jobs | jq

# If not, run setup again:
./setup_cdc.sh
```

**No data in MinIO**
```bash
# Check Flink job status
curl -s http://localhost:8081/jobs

# Restart CDC setup
./setup_cdc.sh
```

**Verify data flow**
```bash
# Check Flink job metrics
curl -s http://localhost:8081/jobs//metrics

# List Paimon files
docker exec minio mc ls local/warehouse/cdc_db.db/
```

### Clean Restart
```bash
# Complete reset
docker compose down -v
docker compose up -d
./setup_cdc.sh
# Wait 2-3 minutes for full initialization
```

**Built with**: Apache Flink โ€ข Apache Paimon โ€ข Rill โ€ข DuckDB โ€ข