{"id":27443875,"url":"https://github.com/josephmachado/change_data_capture","last_synced_at":"2025-04-15T02:58:02.483Z","repository":{"id":115180514,"uuid":"606573473","full_name":"josephmachado/change_data_capture","owner":"josephmachado","description":"Repo for CDC with debezium blog post","archived":false,"fork":false,"pushed_at":"2024-09-15T20:43:06.000Z","size":24691,"stargazers_count":28,"open_issues_count":0,"forks_count":15,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-15T02:57:57.125Z","etag":null,"topics":["change-data-capture","debezium","kafka","minio","postgresql","python3","s3"],"latest_commit_sha":null,"homepage":"https://www.startdataengineering.com/post/change-data-capture-using-debezium-kafka-and-pg/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/josephmachado.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-02-25T22:23:21.000Z","updated_at":"2025-03-15T13:05:02.000Z","dependencies_parsed_at":"2024-09-15T21:44:52.608Z","dependency_job_id":"c142e2e7-c871-4da7-8a36-8215f1708459","html_url":"https://github.com/josephmachado/change_data_capture","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/josephmachado%2Fchange_data_capture","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/josephmachado%2Fchange_data_capture/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/josephmachado%2Fchange_data_capture/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/josephmachado%2Fchange_data_capture/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/josephmachado","download_url":"https://codeload.github.com/josephmachado/change_data_capture/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248997095,"owners_count":21195797,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["change-data-capture","debezium","kafka","minio","postgresql","python3","s3"],"created_at":"2025-04-15T02:58:01.769Z","updated_at":"2025-04-15T02:58:02.458Z","avatar_url":"https://github.com/josephmachado.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n* [Change Data Capture](#change-data-capture)\n* [Project Design](#project-design)\n* [Run on codespaces](#run-on-codespaces)\n* [Prerequisites](#prerequisites)\n* [Setup](#setup)\n* [Analyze data with duckDB](#analyze-data-with-duckdb)\n* [References](#references)\n    * [duckDB](#duckdb)\n* [Project Design](#project-design-1)\n\n# Change Data Capture\n\nRepository for the [Change Data Capture with Debezium](https://www.startdataengineering.com/post/change-data-capture-using-debezium-kafka-and-pg/) blog at startdataengineering.com.\n\n# Project Design\n\n# Run on codespaces\n\nYou can run this CDC data pipeline using GitHub codespaces. Follow the instructions below.\n\n1. Create codespaces by going to the **[change_data_capture](https://github.com/josephmachado/change_data_capture)** repository, cloning(or fork) it and then clicking on `Create codespaces on main` button.\n2. Wait for codespaces to start, then in the terminal type `make up \u0026\u0026 sleep 60 \u0026\u0026 make connectors \u0026\u0026 sleep 60`.\n3. Wait for the above to complete, it can take up a couple of minutes.\n4. Go to the `ports` tab and click on the link exposing port `9001` to access Minio (open source S3) UI.\n5. In the minio UI, use `minio`, and `minio123` as username and password respectively. In the minio UI you will be able to see the the paths `commerce/debezium.commerce.products` and `commerce/debezium.commerce.users` paths, which have json files in them. The json files contain data about the create, updates and deletes in the respective products and users tables.\n\n**NOTE**: The screenshots below, show the general process to start codespaces, please follow the instructions shown above for this project.\n\n![codespace start](./assets/images/cs1.png)\n![codespace make up](./assets/images/cs2.png)\n![codespace access ui](./assets/images/cs3.png)\n\n**Note** Make sure to switch off codespaces instance, you only have limited free usage; see docs [here](https://github.com/features/codespaces#pricing).\n\n\n# Prerequisites\n\n1. [git version \u003e= 2.37.1](https://github.com/git-guides/install-git)\n2. [Docker version \u003e= 20.10.17](https://docs.docker.com/engine/install/) and [Docker compose v2 version \u003e= v2.10.2](https://docs.docker.com/compose/#compose-v2-and-the-new-docker-compose-command). Make sure that docker is running using `docker ps`\n3. [pgcli](https://www.pgcli.com/install)\n\n**Windows users**: please setup WSL and a local Ubuntu Virtual machine following **[the instructions here](https://ubuntu.com/tutorials/install-ubuntu-on-wsl2-on-windows-10#1-overview)**. Install the above prerequisites on your ubuntu terminal; if you have trouble installing docker, follow **[the steps here](https://www.digitalocean.com/community/tutorials/how-to-install-and-use-docker-on-ubuntu-22-04#step-1-installing-docker)** (only Step 1 is necessary). Please install the make command with `sudo apt install make -y` (if its not already present). \n\n# Setup\n\nAll the commands shown below are to be run via the terminal (use the Ubuntu terminal for WSL users). We will use docker to set up our containers. Clone and move into the lab repository, as shown below.\n\n```bash\ngit clone https://github.com/josephmachado/change_data_capture.git\ncd change_data_capture\n```\n\nWe have some helpful make commands to make working with our systems more accessible. Shown below are the make commands and their definitions\n\n1. **make up**: Spin up the docker containers for Postgres, data generator, Kafka Connect, Kafka, \u0026 minio (open source S3 alternative). Note this also sets up [Postgres tables](./postgres/init.sql) and starts a [python script](./datagen/gen_user_payment_data.py) to create-delete-update rows in those tables.\n2. **make conenctors**: Set up the debezium connector to start recording changes from Postgres and another connector to push this data into minio.\n3. **make down**: Stop the docker containers.\n\nYou can see the commands in [this Makefile](./Makefile). If your terminal does not support **make** commands, please use the commands in [the Makefile](./Makefile) directly. All the commands in this book assume that you have the docker containers running.\n\nIn your terminal, do the following:\n\n```bash\n# Make sure docker is running using docker ps\nmake up # starts the docker containers\nsleep 60 # wait 1 minute for all the containers to set up\nmake connectors # Sets up the connectors\nsleep 60 # wait 1 minute for some data to be pushed into minio\nmake minio-ui # opens localhost:9001\n```\n\nIn the minio UI, use `minio`, and `minio123` as username and password respectively. In the minio UI you will be able to see the the paths `commerce/debezium.commerce.products` and `commerce/debezium.commerce.users` paths, which have json files in them. The json files contain data about the create, updates and deletes in the respective products and users tables.\n\n# Analyze data with duckDB\n\n## Access the data in minio via filesystem\nWe [mount a local folder to minio container](./docker-compose.yml) which allows us to access the data in minio via filesystem. We can start a Python REPL to run DuckDB as shown below:\n\n```bash\npython\n```\n\nNow let's create a SCD2 for `products` table from the data we have in minio. Note we are only looking at rows that have updates and deletes in them (see the `where id in` filter in the below query). \n\n```python\nimport duckdb as d\nd.sql(\"\"\"\n    WITH products_create_update_delete AS (\n        SELECT\n            COALESCE(CAST(json-\u003e'value'-\u003e'after'-\u003e'id' AS INT), CAST(json-\u003e'value'-\u003e'before'-\u003e'id' AS INT)) AS id,\n            json-\u003e'value'-\u003e'before' AS before_row_value,\n            json-\u003e'value'-\u003e'after' AS after_row_value,\n            CASE\n                WHEN CAST(json-\u003e'value'-\u003e'$.op' AS CHAR(1)) = '\"c\"' THEN 'CREATE'\n                WHEN CAST(json-\u003e'value'-\u003e'$.op' AS CHAR(1)) = '\"d\"' THEN 'DELETE'\n                WHEN CAST(json-\u003e'value'-\u003e'$.op' AS CHAR(1)) = '\"u\"' THEN 'UPDATE'\n                WHEN CAST(json-\u003e'value'-\u003e'$.op' AS CHAR(1)) = '\"r\"' THEN 'SNAPSHOT'\n                ELSE 'INVALID'\n            END AS operation_type,\n            CAST(json-\u003e'value'-\u003e'source'-\u003e'lsn' AS BIGINT) AS log_seq_num,\n            epoch_ms(CAST(json-\u003e'value'-\u003e'source'-\u003e'ts_ms' AS BIGINT)) AS source_timestamp\n        FROM\n            read_ndjson_objects('minio/data/commerce/debezium.commerce.products/*/*/*.json')\n        WHERE\n            log_seq_num IS NOT NULL\n    )\n    SELECT\n        id,\n        CAST(after_row_value-\u003e'name' AS VARCHAR(255)) AS name,\n        CAST(after_row_value-\u003e'description' AS TEXT) AS description,\n        CAST(after_row_value-\u003e'price' AS NUMERIC(10, 2)) AS price,\n        source_timestamp AS row_valid_start_timestamp,\n        CASE \n            WHEN LEAD(source_timestamp, 1) OVER lead_txn_timestamp IS NULL THEN CAST('9999-01-01' AS TIMESTAMP) \n            ELSE LEAD(source_timestamp, 1) OVER lead_txn_timestamp \n        END AS row_valid_expiration_timestamp\n    FROM products_create_update_delete\n    WHERE id in (SELECT id FROM products_create_update_delete GROUP BY id HAVING COUNT(*) \u003e 1)\n    WINDOW lead_txn_timestamp AS (PARTITION BY id ORDER BY log_seq_num )\n    ORDER BY id, row_valid_start_timestamp\n    LIMIT\n        200;\n    \"\"\").execute()\n```\n\n## Access data via s3 api\nWe can also access the data via the S3 API in duckdb as shown in this [example SQL query](./example/duckdb_minio_product_scd2.sql).\n\n# References\n\n1. [Debezium postgre docs](https://debezium.io/documentation/reference/2.1/connectors/postgresql.html)\n2. [Redpanda CDC example](https://redpanda.com/blog/redpanda-debezium)\n3. [duckDB docs](https://duckdb.org/docs/archive/0.2.9/)\n4. [Kafka docs](https://kafka.apache.org/20/documentation.html)\n5. [Minio DuckDB example](https://blog.min.io/duckdb-and-minio-for-a-modern-data-stack/)\n\n\u003c!-- Send message to kafka\nCASE WHEN LEAD(source_timestamp, 1) OVER(PARTITION BY id ORDER BY log_seq_num ) IS NULL THEN CAST('9999-01-01' AS TIMESTAMP) ELSE \n./kafka_2.13-3.4.0/bin/kafka-console-producer.sh --bootstrap-server 127.0.0.1:9093 --topic test\n\n./kafka_2.13-3.4.0/bin/kafka-console-consumer.sh --bootstrap-server 127.0.0.1:9093 --topic test --from-beginning\n\nList topics\n./kafka_2.13-3.4.0/bin/kafka-topics.sh --bootstrap-server 127.0.0.1:9093 --list\n\n./kafka_2.13-3.4.0/bin/kafka-console-consumer.sh --bootstrap-server 127.0.0.1:9093 --topic debezium.commerce.products --from-beginning --max-messages 1\n\n./kafka_2.13-3.4.0/bin/kafka-console-consumer.sh --bootstrap-server 127.0.0.1:9093 --topic debezium.commerce.users --from-beginning --max-messages 1\n\nconnect to postgres\n\npgcli -h localhost -p 5432 -U postgres -d postgres\n\nSET search_path TO commerce;\nINSERT INTO users(username, password) SELECT 'Joseph', 'Password1234';\n\nINSERT INTO products (name, description, price) SELECT 'Product', 'Some desc', 100;\n\nCheck for connectors\n\ncurl -H \"Accept:application/json\" localhost:8083/connectors/\ncurl -H \"Accept:application/json\" \"localhost:8083/connectors?expand=status\"\t| jq .\n\nSetup connectors\n\ncurl -i -X POST -H \"Accept:application/json\" -H \"Content-Type:application/json\" localhost:8083/connectors/ -d '@./connectors/pg-src-connector.json'\n\ncurl -i -X POST -H \"Accept:application/json\" -H \"Content-Type:application/json\" localhost:8083/connectors/ -d '@./connectors/s3-sink.json'\n\n1. postgres connector\n\ncurl -i -X POST -H \"Accept:application/json\" -H \"Content-Type:application/json\" localhost:8083/connectors/ -d '@./connectors/pg-src-connector.json'\n\ncheck wal level\n`select * from pg_settings where name ='wal_level';\n\ndocker compose down -v\n\n2. S3 sink connector\n\ncurl -i -X POST -H \"Accept:application/json\" -H \"Content-Type:application/json\" localhost:8083/connectors/ -d '@./connectors/s3-sink-connector.json'\n\n\ncurl -i -X POST -H  \"Content-Type:application/json\" localhost:8083/connectors/s3-sink-connector/config -d '@./connectors/s3-sink-connector.json'\n\ncurl -i -X PUT -H  \"Content-Type:application/json\" localhost:8083/connectors/s3-sink-connector/config -d '@./connectors/s3-sink-connector.json'\n\ncurl -i -X PUT -H  \"Content-Type:application/json\" localhost:8083/connectors/ -d '@./connectors/s3-sink-connector-2.json'\n\ncurl -i -X POST -H \"Accept:application/json\" -H \"Content-Type:application/json\" localhost:8083/connectors/ -d '@./connectors/s3-sink.json'\n\ncurl -i -X POST -H \"Accept:application/json\" -H \"Content-Type:application/json\" localhost:8083/connectors/ -d '@./connectors/s3-sink-2.json'\n\n## duckDB\n\nwget https://github.com/duckdb/duckdb/releases/download/v0.7.0/duckdb_cli-osx-universal.zip\nunzip duckdb_cli-osx-universal.zip\n./duckdb\n\n```sql\nSELECT * FROM 'sample.json';\nSELECT * FROM 'sample_2.json';\nSELECT value as dbz_payload FROM 'minio/data/commerce/debezium.commerce.products/2023-03-01/11/0000000000-00000000000000000000.json';\n\nWITH commerce_cud AS (SELECT value as dbz_payload FROM 'minio/data/commerce/debezium.commerce.products/*/*/*.json')\nSELECT *\nFROM commerce_cud\nLIMIT 2\n;\n\ncolumns={value: 'STRUCT'}, goose: 'INTEGER[]', swan: 'DOUBLE'}\n\n{\"value\":{\"before\":null,\"after\":{\"id\":66,\"name\":\"Veronica Roberts\",\"description\":\"Treat one role individual activity gun. Let toward fine music argue common ago. Director environmental over always. National find prevent religious finally.\",\"price\":\"DOQ=\"},\"source\":{\"version\":\"2.2.0.Alpha2\",\"connector\":\"postgresql\",\"name\":\"debezium\",\"ts_ms\":1677669284960,\"snapshot\":\"false\",\"db\":\"postgres\",\"sequence\":\"[\\\"23137176\\\",\\\"23137328\\\"]\",\"schema\":\"commerce\",\"table\":\"products\",\"txId\":811,\"lsn\":23137328,\"xmin\":null},\"op\":\"c\",\"ts_ms\":1677669285154,\"transaction\":null}}\n\nSELECT * FROM read_ndjson_objects('minio/data/commerce/debezium.commerce.products/*/*/*.json');\n\nSELECT json_type(*) FROM read_ndjson_objects('minio/data/commerce/debezium.commerce.products/*/*/*.json');\n\nWITH commerce_cud AS (SELECT \nCOALESCE(CAST(json-\u003e'value'-\u003e'after'-\u003e'id' AS INT), CAST(json-\u003e'value'-\u003e'before'-\u003e'id' AS INT)) AS id\n, json-\u003e'value'-\u003e'before' as before_row_value\n, json-\u003e'value'-\u003e'after' as after_row_value\n, CASE \nWHEN CAST(json-\u003e'value'-\u003e'$.op' AS CHAR(1)) = '\"c\"' THEN 'CREATE'\nWHEN CAST(json-\u003e'value'-\u003e'$.op' AS CHAR(1)) = '\"d\"' THEN 'DELETE'\nWHEN CAST(json-\u003e'value'-\u003e'$.op' AS CHAR(1)) = '\"u\"' THEN 'UPDATE'\nWHEN CAST(json-\u003e'value'-\u003e'$.op' AS CHAR(1)) = '\"r\"' THEN 'SNAPSHOT'\nELSE 'INVALID' END as operation_type\n, CAST(json-\u003e'value'-\u003e'source'-\u003e'lsn' AS BIGINT) as log_seq_num\n, epoch_ms(CAST(json-\u003e'value'-\u003e'source'-\u003e'ts_ms' AS BIGINT)) as source_timestamp\nFROM read_ndjson_objects('minio/data/commerce/debezium.commerce.products/*/*/*.json')\nwhere log_seq_num is not null)\nSELECT \nid\n, log_seq_num\n, operation_type\n, source_timestamp as row_valid_start_timestamp\n, LEAD(source_timestamp, 1) OVER(PARTITION BY id ORDER BY log_seq_num) as row_valid_expiration_timestamp\n, ROW_NUMBER() OVER(PARTITION BY id ORDER BY log_seq_num) AS op_order\nFROM commerce_cud\norder by log_seq_num\n LIMIT 200;\n\n```\n\n# Project Design\n--\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjosephmachado%2Fchange_data_capture","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjosephmachado%2Fchange_data_capture","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjosephmachado%2Fchange_data_capture/lists"}