{"id":31652202,"url":"https://github.com/gordonmurray/flink_paimon_duckdb_rill","last_synced_at":"2026-05-19T14:07:12.525Z","repository":{"id":317985955,"uuid":"1065500885","full_name":"gordonmurray/flink_paimon_duckdb_rill","owner":"gordonmurray","description":"A streaming analytics stack that captures MySQL changes via CDC, stores them in Apache Paimon format, and visualizes them with Rill dashboards","archived":false,"fork":false,"pushed_at":"2025-09-27T21:14:55.000Z","size":14,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-10-04T10:35:09.626Z","etag":null,"topics":["duckdb","flink","flink-cdc","paimon","rill-dashboard"],"latest_commit_sha":null,"homepage":"https://gordonmurray.com/data/2025/09/27/when-your-real-time-dashboard-refuses-to-be-real-time.html","language":"Shell","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gordonmurray.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-27T21:10:54.000Z","updated_at":"2025-09-28T15:33:04.000Z","dependencies_parsed_at":"2025-10-04T10:35:19.372Z","dependency_job_id":"7bd2bd46-544c-4517-94aa-160b7a7d4cd5","html_url":"https://github.com/gordonmurray/flink_paimon_duckdb_rill","commit_stats":null,"previous_names":["gordonmurray/flink_paimon_duckdb_rill"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/gordonmurray/flink_paimon_duckdb_rill","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gordonmurray%2Fflink_paimon_duckdb_rill","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gordonmurray%2Fflink_paimon_duckdb_rill/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gordonmurray%2Fflink_paimon_duckdb_rill/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gordonmurray%2Fflink_paimon_duckdb_rill/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gordonmurray","download_url":"https://codeload.github.com/gordonmurray/flink_paimon_duckdb_rill/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gordonmurray%2Fflink_paimon_duckdb_rill/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33219434,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-19T07:54:09.561Z","status":"ssl_error","status_checked_at":"2026-05-19T07:54:08.508Z","response_time":58,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["duckdb","flink","flink-cdc","paimon","rill-dashboard"],"created_at":"2025-10-07T09:59:57.454Z","updated_at":"2026-05-19T14:07:12.508Z","avatar_url":"https://github.com/gordonmurray.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Real-Time Analytics Pipeline with Flink, Paimon, and Rill\n\nA complete streaming analytics stack that captures MySQL changes via CDC, stores them in Apache Paimon format, and visualizes them with Rill dashboards.\n\n## 🚀 What You Get\n\n- **Real-Time CDC**: Captures every MySQL change using Flink CDC\n- **Lake Storage**: Stores data in Apache Paimon format on S3-compatible storage\n- **Live Dashboard**: Rill analytics with automated catalog management\n- **Automated Fixes**: Sidecar container handles DuckDB catalog prefix issues\n- **One Command Start**: Everything runs with `docker compose up`\n\n## 🏗️ Architecture\n\n```\nMySQL → Flink CDC → Apache Paimon → MinIO → Rill Dashboard\n  ↑                                           ↓\n  Manual inserts                          Analytics\n```\n\n**Components:**\n- **MySQL/MariaDB**: Source database with sample product data\n- **Apache Flink**: Real-time CDC processing engine\n- **Apache Paimon**: Lake storage format optimized for streaming\n- **MinIO**: S3-compatible object storage\n- **Rill**: Modern analytics dashboard with DuckDB engine\n- **Rill Patcher**: Automated sidecar handling catalog prefix issues\n\n## ⚡ Quick Start\n\n### Prerequisites\n- Docker and Docker Compose\n- 8GB+ RAM recommended\n- Ports 3000, 3306, 8081, 9000-9001 available\n\n### 1. Clone and Start\n```bash\ngit clone \u003cyour-repo\u003e\ncd flink_iceberg_anomaly_pipeline_paimon\ndocker compose up -d\n```\n\n### 2. Initialize the CDC Pipeline\n```bash\n./setup_cdc.sh\n```\n\n### 3. Open the Dashboard\nNavigate to: **http://localhost:3000**\n\nThe dashboard will show live data with automatic 60-second refresh.\n\n## 🧪 Test Real-Time Updates\n\nAdd new products to see live updates:\n\n```bash\n# Add some products\ndocker exec mariadb mysql -u root -prootpassword -e \"\nINSERT INTO mydatabase.products (name, price) VALUES\n('New Product 1', 99.99),\n('New Product 2', 199.99);\"\n\n# Check MySQL count\ndocker exec mariadb mysql -u root -prootpassword -e \"SELECT COUNT(*) FROM mydatabase.products;\"\n\n# Wait 60 seconds for dashboard to refresh\n# You'll see the updated count automatically!\n```\n\n## 🔧 How It Works\n\n### CDC Pipeline\n1. **MySQL Changes**: Any INSERT/UPDATE/DELETE in MySQL is captured\n2. **Flink Processing**: Flink CDC reads the MySQL binlog in real-time\n3. **Paimon Storage**: Changes are written to Paimon tables in MinIO\n4. **Rill Dashboard**: Visualizes data with 60-second refresh cycle\n\n### The Catalog Prefix Solution\nDuckDB creates random catalog prefixes (e.g., `main8514e79c`) on startup. Our `rill-patcher` sidecar:\n1. Waits for Rill to start\n2. Discovers the current catalog alias via SQL\n3. Patches the model file with the correct prefix\n4. Refreshes data every 60 seconds\n5. Re-patches if Rill restarts with a new prefix\n\n### Why Apache Paimon?\n- Optimized for streaming updates with ACID guarantees\n- Supports both batch and streaming workloads\n- Compatible with multiple query engines\n- Efficient storage with automatic compaction\n\n## 📊 Monitoring\n\n### Service Health Checks\n```bash\n# Check all containers\ndocker ps\n\n# Monitor CDC job\ncurl -s http://localhost:8081/jobs | jq\n\n# Test Rill Dashboard API\ncurl -s \"http://localhost:3000/v1/instances/default/query\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"sql\":\"SELECT COUNT(*) FROM paimon_products\"}'\n\n# View Paimon files in MinIO\ndocker exec minio mc ls --recursive local/warehouse/\n```\n\n### Data Flow Verification\n```bash\n# MySQL data\ndocker exec mariadb mysql -u root -prootpassword -e \"SELECT COUNT(*) FROM mydatabase.products;\"\n\n# MinIO storage\ndocker exec minio mc ls --recursive local/warehouse/cdc_db.db/products_sink/\n\n# Rill dashboard count\ncurl -s \"http://localhost:3000/v1/instances/default/query\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"sql\":\"SELECT COUNT(*) FROM paimon_products\"}' | jq '.data[0]'\n```\n\n## 🛠️ Development\n\n### Project Structure\n```\n├── docker-compose.yml          # Complete stack definition\n├── conf/\n│   └── flink-conf.yaml        # Flink configuration\n├── rill/\n│   ├── connectors/           # DuckDB S3 configuration\n│   ├── models/               # SQL model definitions\n│   ├── metrics/              # Metrics definitions\n│   └── dashboards/           # Dashboard configs\n├── rill-patcher.sh           # Automated catalog management\n├── duckdb/\n│   └── test_s3.py            # DuckDB query examples\n├── sql/\n│   ├── init.sql              # MySQL initial data\n│   └── setup_paimon_cdc.sql  # CDC pipeline setup\n└── setup_cdc.sh              # CDC initialization script\n```\n\n### Key Configuration Files\n\n**Flink Config** (`conf/flink-conf.yaml`):\n- Configures Flink job manager and task manager\n- Sets checkpointing intervals\n- Defines S3/MinIO credentials\n\n**CDC Setup** (`sql/setup_paimon_cdc.sql`):\n- Creates Paimon catalog\n- Defines source MySQL table\n- Creates sink Paimon table\n- Starts CDC pipeline\n\n## 🚨 Troubleshooting\n\n### Common Issues\n\n**CDC Pipeline not starting**\n```bash\n# Check if the job started:\ncurl -s http://localhost:8081/jobs | jq\n\n# If not, run setup again:\n./setup_cdc.sh\n```\n\n**No data in MinIO**\n```bash\n# Check Flink job status\ncurl -s http://localhost:8081/jobs\n\n# Restart CDC setup\n./setup_cdc.sh\n```\n\n**Verify data flow**\n```bash\n# Check Flink job metrics\ncurl -s http://localhost:8081/jobs/\u003cjob-id\u003e/metrics\n\n# List Paimon files\ndocker exec minio mc ls local/warehouse/cdc_db.db/\n```\n\n### Clean Restart\n```bash\n# Complete reset\ndocker compose down -v\ndocker compose up -d\n./setup_cdc.sh\n# Wait 2-3 minutes for full initialization\n```\n\n**Built with**: Apache Flink • Apache Paimon • Rill • DuckDB •","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgordonmurray%2Fflink_paimon_duckdb_rill","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgordonmurray%2Fflink_paimon_duckdb_rill","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgordonmurray%2Fflink_paimon_duckdb_rill/lists"}