{"id":31610043,"url":"https://github.com/arm092/migres","last_synced_at":"2025-10-06T09:40:56.128Z","repository":{"id":313927630,"uuid":"1053460240","full_name":"arm092/migres","owner":"arm092","description":"MySQL to Clickhouse migration tool","archived":false,"fork":false,"pushed_at":"2025-10-03T10:24:57.000Z","size":84,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-10-03T10:26:27.944Z","etag":null,"topics":["airbyte","binlog-replication","cdc","change-data-capture","clickhouse","data-migration","database","mysql","mysql-to-clickhouse","pyhton","real-time-replication","schema-sync","transferia"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/arm092.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-09T13:33:19.000Z","updated_at":"2025-10-03T10:25:00.000Z","dependencies_parsed_at":"2025-10-03T10:15:09.233Z","dependency_job_id":null,"html_url":"https://github.com/arm092/migres","commit_stats":null,"previous_names":["arm092/migres"],"tags_count":5,"template":false,"template_full_name":null,"purl":"pkg:github/arm092/migres","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arm092%2Fmigres","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arm092%2Fmigres/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arm092%2Fmigres/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arm092%2Fmigres/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/arm092","download_url":"https://codeload.github.com/arm092/migres/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arm092%2Fmigres/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278588750,"owners_count":26011699,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-06T02:00:05.630Z","response_time":65,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["airbyte","binlog-replication","cdc","change-data-capture","clickhouse","data-migration","database","mysql","mysql-to-clickhouse","pyhton","real-time-replication","schema-sync","transferia"],"created_at":"2025-10-06T09:40:54.544Z","updated_at":"2025-10-06T09:40:56.119Z","avatar_url":"https://github.com/arm092.png","language":"Python","readme":"# Migres - MySQL to ClickHouse Migration Tool\n\nThis project is a **complete migration tool** that transfers tables from MySQL into ClickHouse with type mapping, logging, and resumable state.  \nIt supports both **snapshot mode** (initial data migration) and **CDC mode** (real-time change data capture), with automatic schema synchronization.\n\n---\n\n## Features\n\n### Core Migration\n- 🚀 **MySQL → ClickHouse migration** (snapshot + CDC modes)\n- 🗂 **Intelligent type mapping** (INT, DECIMAL, DATE, DATETIME, VARCHAR, etc.)\n- 📝 **Transferia metadata columns** added automatically:\n  - `__data_transfer_commit_time UInt64` → nanosecond commit timestamp\n  - `__data_transfer_delete_time UInt64 DEFAULT 0`\n  - `__data_transfer_is_deleted UInt8 MATERIALIZED if(__data_transfer_delete_time != 0, 1, 0)`\n\n### Snapshot Mode\n- 🔁 **Resumable migration** (state stored in `state.json`)\n- ⚡ **Parallel table processing** for large datasets\n- 🎯 **Included/excluded tables filtering**\n\n### CDC Mode (Change Data Capture)\n- 🔄 **Real-time replication** from MySQL binlog\n- ⚡ **Queue-based event batching** with configurable delay\n- 🎯 **Smart event grouping** (combines multiple events into single operations)\n- 🏗️ **Automatic schema synchronization**:\n  - ✅ CREATE TABLE (new table creation)\n  - ✅ DROP TABLE (table deletion)\n  - ✅ ADD COLUMN (with defaults)\n  - ✅ DROP COLUMN\n  - ✅ RENAME COLUMN (CHANGE COLUMN)\n  - ✅ MODIFY COLUMN (type changes, defaults)\n- 📊 **ReplacingMergeTree** for upsert semantics\n- 🎯 **Table filtering** (include/exclude)\n- 💾 **Checkpoint persistence** (resume from last position)\n- 🌍 **Timezone-aware datetime handling** (DateTime64 with timezone)\n- 🛡️ **Error handling** with failed operation dumps\n- 📱 **MS Teams notifications** for errors, warnings, and important events\n\n### Operations\n- 📑 **Detailed logging** (visible via `docker compose logs -f`)\n- 🐳 **Docker support** with hot-reload for development\n- 📢 **Real-time notifications** to MS Teams channels\n\n---\n\n## How It Works\n\n### Snapshot Mode\n1. **Initial setup**\n   - Connects to MySQL \u0026 ClickHouse\n   - Records binlog position for CDC start point\n   - Loads migration state from `state.json`\n\n2. **Table filtering \u0026 processing**\n   - Filters tables by `include_tables`/`exclude_tables`\n   - Processes tables in parallel workers\n   - Each worker:\n     - Inspects MySQL schema\n     - Creates ClickHouse table with mapped types\n     - Migrates data in batches\n     - Marks table as complete\n\n3. **Resumable migration**\n   - If interrupted, resumes from last completed table\n   - State persisted in `state.json`\n\n### CDC Mode\n1. **Initial snapshot** (optional)\n   - Runs snapshot mode first if `snapshot_before: true`\n   - Ensures complete baseline before streaming\n\n2. **Queue-based event processing**\n   - Events are accumulated in a queue as they arrive from binlog\n   - Timer-based processing every `batch_delay_seconds` (configurable)\n   - Continuous operation: keeps receiving events while processing queue\n\n3. **Event batching and grouping**\n   - **INSERT events**: Multiple INSERTs for same table → Single INSERT with multiple rows\n   - **UPDATE events**: Multiple UPDATEs for same table → Single INSERT with multiple rows\n   - **DELETE events**: Multiple DELETEs for same table → Single INSERT with multiple rows\n   - **DDL events**: Processed immediately (not queued)\n\n4. **Real-time streaming**\n   - Connects to MySQL binlog stream (non-blocking)\n   - Processes INSERT/UPDATE/DELETE events\n   - Auto-detects schema changes (ADD/DROP/RENAME/MODIFY)\n   - Applies changes to ClickHouse in batches\n\n5. **Schema synchronization**\n   - **CREATE TABLE**: Creates new table in ClickHouse\n   - **DROP TABLE**: Removes table from ClickHouse\n   - **ADD COLUMN**: Creates new column with defaults\n   - **DROP COLUMN**: Removes column from ClickHouse\n   - **RENAME COLUMN**: Renames column in ClickHouse\n   - **MODIFY COLUMN**: Changes type and defaults\n\n6. **Error handling**\n   - Failed operations are dumped to JSON files for manual review\n   - Includes timestamp, error details, and operation information\n   - Allows for manual recovery of failed operations\n\n7. **Checkpoint persistence**\n   - Saves binlog position periodically\n   - Resumes from last position on restart\n\n---\n\n## Requirements\n\n- **MySQL** server (with data to migrate)\n- **ClickHouse** server (can be remote)\n- **Docker + Docker Compose**\n\n---\n\n## Setup\n\n### 1. MySQL Configuration (Required for CDC)\n\nFor CDC mode to work properly, configure MySQL with:\n\n```sql\n-- Set binlog format to ROW (required for CDC)\nSET GLOBAL binlog_format = 'ROW';\nSET GLOBAL binlog_row_image = 'FULL';\nSET GLOBAL binlog_row_metadata = 'FULL';\n\n-- Make changes persistent (MySQL 8.0+)\nSET PERSIST binlog_format = 'ROW';\nSET PERSIST binlog_row_image = 'FULL';\nSET PERSIST binlog_row_metadata = 'FULL';\n```\n\nOr add to `my.cnf`:\n```ini\n[mysqld]\nbinlog_format=ROW\nbinlog_row_image=FULL\nbinlog_row_metadata=FULL\n```\n\n### 2. Configure `config.yml`\n\n```yaml\nmysql:\n  host: \"localhost\"\n  port: 3306\n  user: \"your_user\"\n  password: \"your_password\"\n  database: \"your_database\"\n  include_tables: []  # Leave empty for all tables\n  exclude_tables: []  # Tables to skip\n\nclickhouse:\n  host: \"localhost\"\n  port: 9000\n  user: \"default\"\n  password: \"\"\n  database: \"your_ch_database\"\n\nmigration:\n  mode: \"snapshot\"  # or \"cdc\"\n  batch_rows: 5000\n  workers: 4\n  low_cardinality_strings: true\n  ddl_engine: \"ReplacingMergeTree\"\n  \n  # Timezone configuration for datetime/timestamp columns\n  mysql_timezone: \"Europe/Moscow\"      # Set to your MySQL server timezone\n  clickhouse_timezone: \"Europe/Moscow\" # Set to desired ClickHouse timezone\n  \n  # CDC-specific settings\n  cdc:\n    snapshot_before: true  # Run snapshot before CDC\n    heartbeat_seconds: 5\n    checkpoint_interval_rows: 1000\n    checkpoint_interval_seconds: 5\n    batch_delay_seconds: 5  # Delay in seconds before processing accumulated events (0 = immediate processing)\n    server_id: 4379  # Unique ID for binlog replication\n\nstate_file: \"data/state.json\"\ncheckpoint_file: \"data/binlog_checkpoint.json\"\n\n# MS Teams Notifications\nnotifications:\n  enabled: true\n  webhook_url: \"https://your-org.webhook.office.com/webhookb2/your-webhook-url\"\n  rate_limit_seconds: 60  # Minimum seconds between notifications (0 = no limit)\n```\n\n---\n\n## Running\n\n### Snapshot Mode (Initial Migration)\n```bash\n# Edit config.yml: mode: \"snapshot\"\ndocker compose up\n```\n\n### CDC Mode (Real-time Replication)\n```bash\n# Edit config.yml: mode: \"cdc\"\ndocker compose up\n```\n\n### Development Mode (Hot Reload)\n```bash\n# Code changes are automatically reflected\ndocker compose up\n```\n\n### View Logs\n```bash\ndocker compose logs -f\n```\n\n### Environment Variables Support\n\nAll configuration options can be overridden using environment variables. This is useful for containerized deployments:\n\n```bash\n# MySQL configuration\nexport MYSQL_HOST=mysql-server.example.com\nexport MYSQL_PASSWORD=your-password\n\n# ClickHouse configuration  \nexport CLICKHOUSE_HOST=clickhouse-server.example.com\nexport CLICKHOUSE_PASSWORD=your-password\n\n# Notifications\nexport NOTIFICATIONS_ENABLED=true\nexport NOTIFICATIONS_WEBHOOK_URL=https://your-webhook-url\n```\n\nSee [Environment Variables Documentation](docs/ENVIRONMENT_VARIABLES.md) for complete list of supported variables.\n\n## Testing\n\nThe project includes a comprehensive test suite in the `test/` directory:\n\n- **`test/test_cdc_batching.py`** - Main CDC batching test (5000 operations)\n- **`test/test_forced_errors.py`** - Forced error test with type conversion errors\n- **`test/test_notifications.py`** - MS Teams notification system test\n- **`test/run_test.py`** - Test runner for different scenarios\n- **`test/monitor_cdc.py`** - Real-time CDC monitoring\n\n### Running Tests\n```bash\ncd test\npython run_test.py\n```\n\nSee `test/README.md` for detailed testing instructions.\n\n---\n\n## Examples\n\n### Example Logs\n\n**Snapshot Mode:**\n```\n[INFO] Starting migres (snapshot) mode...\n[INFO] MySQL connected: localhost:3306/mydb\n[INFO] ClickHouse client initialized for localhost:9000/mydb\n[INFO] Tables to snapshot (count=5): ['users', 'orders', 'products']\n[INFO] Worker: table users migrated successfully\n[INFO] Snapshot completed for all tables.\n```\n\n**CDC Mode:**\n```\n[INFO] Starting migres (CDC) mode...\n[INFO] CDC: running initial snapshot before starting binlog streaming...\n[INFO] CDC: initial snapshot completed, starting binlog streaming...\n[INFO] CDC: batch_delay_seconds=5.0, queue-based processing=True\n[INFO] CDC: event queued for mydb.users (UpdateRowsEvent) with 1 rows - queue size: 1\n[INFO] CDC: event queued for mydb.users (UpdateRowsEvent) with 1 rows - queue size: 2\n[INFO] CDC: processing queue (time since last process: 5.0s, queue size: 2)\n[INFO] CDC: processing 2 events from queue\n[INFO] CDC: processing 1 groups\n[INFO] CDC: processing group mydb.users (UpdateRowsEvent) with 2 events containing 2 total rows\n[INFO] CDC: inserted 2 row(s) into users (UPDATE-\u003eupsert)\n[INFO] CDC: successfully processed 2 rows for mydb.users (UpdateRowsEvent)\n[INFO] CDC: successfully processed 2 rows from queue\n[INFO] CDC: added column email_verified to users (direct ALTER)\n[INFO] CDC: detected CREATE TABLE for new_table, creating table in ClickHouse\n[INFO] CDC: created table new_table in ClickHouse\n[INFO] CDC: detected DROP TABLE for old_table, dropping table in ClickHouse\n[INFO] CDC: dropped table old_table in ClickHouse\n```\n\n**MS Teams Notifications:**\n```\n🚀 CDC Process Started\nCDC (Change Data Capture) process has started successfully\n\nLevel: INFO\nTimestamp: 2025-01-24 10:30:00 UTC\n\nDetails:\n- MySQL: localhost:3306/mydb\n- ClickHouse: localhost:9000/mydb\n- Batch Delay: 5s\n- Mode: CDC\n```\n\n```\n🚨 CDC Error: Processing Error\nTable: mydb.users\nError: Failed to process 5 events: Connection timeout\n\nLevel: ERROR\nTimestamp: 2025-01-24 10:30:00 UTC\n\nDetails:\n- Error Type: Processing Error\n- Table: mydb.users\n- Event Count: 5\n- Event Type: WriteRowsEvent\n- Error: Connection timeout\n```\n\n### Schema Changes in Action\n\n**Adding a column:**\n```sql\n-- MySQL\nALTER TABLE users ADD COLUMN email_verified BOOLEAN DEFAULT FALSE;\n```\n```\n[INFO] CDC: added column email_verified to users (direct ALTER)\n[INFO] CDC: synchronized schema for table users due to DDL\n```\n\n**Modifying column type:**\n```sql\n-- MySQL  \nALTER TABLE users MODIFY COLUMN age VARCHAR(10) DEFAULT 'unknown';\n```\n```\n[INFO] CDC: MODIFY target type for users.age -\u003e LowCardinality(String)\n[INFO] CDC: modified column age on users (direct MODIFY)\n```\n\n**Creating a new table:**\n```sql\n-- MySQL\nCREATE TABLE new_table (id INT PRIMARY KEY, name VARCHAR(100));\n```\n```\n[INFO] CDC: detected CREATE TABLE for new_table, creating table in ClickHouse\n[INFO] CDC: created table new_table in ClickHouse\n```\n\n**Dropping a table:**\n```sql\n-- MySQL\nDROP TABLE old_table;\n```\n```\n[INFO] CDC: detected DROP TABLE for old_table, dropping table in ClickHouse\n[INFO] CDC: dropped table old_table in ClickHouse\n```\n\n---\n\n## Troubleshooting\n\n### Common Issues\n\n**1. CDC not detecting changes:**\n- Verify MySQL binlog settings: `SHOW VARIABLES LIKE 'binlog_%';`\n- Check user permissions: `GRANT REPLICATION SLAVE ON *.* TO 'user'@'%';`\n- Ensure `server_id` is unique in your network\n\n**2. Schema changes not applied:**\n- Check logs for \"CDC: synchronized schema\" messages\n- Verify table is in `include_tables` (not excluded)\n- For MODIFY COLUMN issues, check ClickHouse version compatibility\n\n**3. Duplicate rows in ClickHouse:**\n- Use `SELECT * FROM table FINAL` to see deduplicated results\n- ReplacingMergeTree automatically handles duplicates on merge\n\n**4. Migration stuck:**\n- Check `state.json` for incomplete tables\n- Delete state file to restart from beginning\n- Verify MySQL/ClickHouse connectivity\n\n**5. Timezone issues with datetime columns:**\n- Configure `mysql_timezone` and `clickhouse_timezone` in config.yml\n- Ensure both are set to the same timezone for consistency\n- Use `DateTime64(3, 'timezone')` for proper timezone handling\n\n**6. Duplicate inserts in ClickHouse:**\n- This was a bug that has been fixed in recent versions\n- Each MySQL event now results in exactly one ClickHouse insert\n- Use `SELECT * FROM table FINAL` to see deduplicated results\n\n### Debug Mode\n\nEnable detailed logging by setting log level in your config or environment:\n```bash\nexport PYTHONPATH=/app\nexport LOG_LEVEL=DEBUG\ndocker compose up\n```\n\n### Performance Tuning\n\n- **Batch size**: Increase `batch_rows` for faster snapshot (default: 5000)\n- **Workers**: Adjust `workers` based on CPU cores (default: 4)\n- **Checkpoint frequency**: Reduce `checkpoint_interval_seconds` for more frequent saves\n- **Low cardinality**: Disable `low_cardinality_strings` if memory is limited\n- **CDC batching**: Adjust `batch_delay_seconds` for optimal performance:\n  - `0` = immediate processing (no batching)\n  - `5-15` = good balance for most workloads\n  - `30+` = for high-volume, less time-sensitive scenarios\n\n### CDC Batching Configuration\n\nThe `batch_delay_seconds` setting controls how events are processed:\n\n**Immediate Processing (`batch_delay_seconds: 0`):**\n```yaml\ncdc:\n  batch_delay_seconds: 0  # Each event processed immediately\n```\n- ✅ Lowest latency\n- ❌ More ClickHouse operations\n- ❌ Higher load on ClickHouse\n\n**Batched Processing (`batch_delay_seconds: 5`):**\n```yaml\ncdc:\n  batch_delay_seconds: 5  # Events accumulated for 5 seconds\n```\n- ✅ Reduced ClickHouse load\n- ✅ Better performance for bulk operations\n- ✅ Smart grouping of similar events\n- ⚠️ 5-second delay for data availability\n\n**High-Volume Batching (`batch_delay_seconds: 30`):**\n```yaml\ncdc:\n  batch_delay_seconds: 30  # Events accumulated for 30 seconds\n```\n- ✅ Maximum ClickHouse efficiency\n- ✅ Best for bulk data processing\n- ❌ 30-second delay for data availability\n\n### Batching Examples\n\n**Example 1: Multiple INSERTs**\n```\nMySQL: 100 INSERT statements for table 'orders'\nResult: 1 ClickHouse INSERT with 100 rows\n```\n\n**Example 2: Mixed Operations**\n```\nMySQL: 50 UPDATEs for 'users' + 30 INSERTs for 'orders'\nResult: 2 ClickHouse INSERTs (1 with 50 rows, 1 with 30 rows)\n```\n\n**Example 3: Error Handling**\n```\nFailed operation → Dumped to failed_operations_20250922_151207.json\nContains: timestamp, error details, operation data for manual review\n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farm092%2Fmigres","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Farm092%2Fmigres","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farm092%2Fmigres/lists"}