{"id":22214029,"url":"https://github.com/xmlking/cdc-kafka-hadoop","last_synced_at":"2025-07-27T12:32:00.742Z","repository":{"id":142313630,"uuid":"54287621","full_name":"xmlking/cdc-kafka-hadoop","owner":"xmlking","description":"MySQL to NoSQL real time dataflow ","archived":false,"fork":false,"pushed_at":"2017-10-14T14:58:24.000Z","size":1645,"stargazers_count":18,"open_issues_count":0,"forks_count":19,"subscribers_count":7,"default_branch":"master","last_synced_at":"2024-05-01T20:22:00.708Z","etag":null,"topics":["architecture","cdc","change-data-capture","data-flow","debezium","groovy","hadoop","kafka","maxwell","mysql","nifi"],"latest_commit_sha":null,"homepage":null,"language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/xmlking.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2016-03-19T21:19:22.000Z","updated_at":"2023-06-29T06:58:27.000Z","dependencies_parsed_at":null,"dependency_job_id":"f7d042c2-1be5-4e34-a22a-c1472ceda3f6","html_url":"https://github.com/xmlking/cdc-kafka-hadoop","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xmlking%2Fcdc-kafka-hadoop","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xmlking%2Fcdc-kafka-hadoop/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xmlking%2Fcdc-kafka-hadoop/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xmlking%2Fcdc-kafka-hadoop/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/xmlking","download_url":"https://codeload.github.com/xmlking/cdc-kafka-hadoop/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":227802821,"owners_count":17822113,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["architecture","cdc","change-data-capture","data-flow","debezium","groovy","hadoop","kafka","maxwell","mysql","nifi"],"created_at":"2024-12-02T21:13:00.881Z","updated_at":"2024-12-02T21:13:01.938Z","avatar_url":"https://github.com/xmlking.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"CDC Hadoop Dataflow\n===================\nA low latency, multi-tenant *Change Data Capture(CDC)* pipeline to continuously replicate data from **OLTP(**MySQL**)** to **OLAP(**NoSQL**)** systems with no impact to the source.\n\n\n\u003e This project demonstrate how to build dataflow pipeline to move data from O]operational databases(MySQL, Oracle) to analytics databases(Hadoop, MongoDB, MarkLogic) in real-time using **Change Data Capture(CDC)**, **Kafka** and tools like **Apache NiFi**, **Kafka Streams** or **Spark** to process and ingest data into Hadoop.\n\n![cdc architecture](./presentation/images/cdc-architecture.png)\n\n### Features\n\n1. Capture changes from many Data Sources and types.\n2. Feed data to many client types (real-time, slow/catch-up, full bootstrap).\n3. Multi-tenant: can contain data from many different databases, support multiple consumers.\n4. Non-intrusive architecture for change capture.\n5. Both batch and near real time delivery.\n6. Isolate fast consumers from slow consumers.\n7. Isolate sources from consumers\n    1. Schema changes\n    2. Physical layout changes\n    3. Speed mismatch\n8. Change filtering\n    1. Filtering of database changes at the database level, schema level, table level, and row/column level.\n9. Buffer change records in **Kafka** for flexible consumption from an arbitrary time point in the change stream including full bootstrap capability of the entire data.\n9. Guaranteed in-commit-order and at-least-once delivery with high availability (`at least once` vs. `exactly once`)\n10. Resilience and Recoverability\n12. Schema-awareness\n\n### Setup\n\n#### Install and Run MySQL\nInstall source MySQL database and configure it with row based replication as per [instructions](./infrastructure/mysql/).\n\n#### Install and Run Kafka\nFollow the [instructions](./infrastructure/kafka/)\n\n#### Install and Run Maxwell\n\n```bash\ncd cdc/maxwell\n# curl -L -0 https://github.com/zendesk/maxwell/releases/download/v1.0.0/maxwell-1.1.2.tar.gz | tar --strip-components=1 -zx -C .\ncurl -L -0 https://github.com/xmlking/maxwell/releases/download/1.1.2.1/maxwell-1.1.2.1-kafka-connect.tar.gz | tar --strip-components=1 -zx -C .\n```\n\n### Run\n\n   `cd cdc/maxwell`\n\n1. Run with stdout producer (for testing only)\n\n   `bin/maxwell --user='maxwell' --password='XXXXXX' --host='127.0.0.1' --producer=stdout`\n2. Run with kafka producer\n\n   `bin/maxwell`\n\n### Test\n\n#### Manual Testing\nIf all goes well you'll see maxwell replaying your inserts:\n\n```sql\nmysql -u root -p\n\nmysql\u003e CREATE TABLE test.shop\n       (\n         id BIGINT(20) NOT NULL AUTO_INCREMENT,\n         version BIGINT(20) NOT NULL,\n         name VARCHAR(255) NOT NULL,\n         owner VARCHAR(255) NOT NULL,\n         phone_number VARCHAR(255) NOT NULL,\n         primary key (id, name)\n       );\nmysql\u003e INSERT INTO test.shop (version, name, owner, phone_number) values (0, 'aaa', 'bbb', '3331114444');\nQuery OK, 1 row affected (0.02 sec)\n\n(maxwell)\n{\"database\":\"test\",\"table\":\"shop\",\"pk.id\":4,\"pk.name\":\"aaa\"}\n{\"database\":\"test\",\"table\":\"shop\",\"type\":\"insert\",\"ts\":1458510224,\"xid\":33531,\"commit\":true,\"data\":{\"owner\":\"bbb\",\"name\":\"aaa\",\"phone_number\":\"3331114444\",\"id\":4,\"version\":0}}\n```\n\n####  Testing via Grails App\nYou can also use [testApp](./testApp/) to generate load.\n\n\n### Reference\n1. [Maxwell's Daemon](http://maxwells-daemon.io/quickstart/)\n2. [LinkedIn: Creating A Low Latency Change Data Capture System With Databus](http://highscalability.com/blog/2012/3/19/linkedin-creating-a-low-latency-change-data-capture-system-w.html)\n3. [Introducing Maxwell, a mysql-to-kafka binlog processor](https://developer.zendesk.com/blog/introducing-maxwell-a-mysql-to-kafka-binlog-processor)\n4. [Martin Kleppman's blog: Using logs to build a solid data infrastructure](https://martin.kleppmann.com/2015/05/27/logs-for-data-infrastructure.html)\n5. [Bottled Water: Real-time integration of PostgreSQL and Kafka](http://www.confluent.io/blog/bottled-water-real-time-integration-of-postgresql-and-kafka/)\n6. [debezium-examples](https://github.com/debezium/debezium-examples)\n7. [Tutorial on using NiFi's built-in CDC - 3 parts](https://community.hortonworks.com/articles/113941/change-data-capture-cdc-with-apache-nifi-version-1-1.html)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxmlking%2Fcdc-kafka-hadoop","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fxmlking%2Fcdc-kafka-hadoop","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxmlking%2Fcdc-kafka-hadoop/lists"}