https://github.com/beaglefoot/streaming-etl
Exploratory project using Kafka for building streaming ETL
https://github.com/beaglefoot/streaming-etl
asyncio data-engineering debezium docker etl kafka kafka-connect streaming
Last synced: about 1 month ago
JSON representation
Exploratory project using Kafka for building streaming ETL
- Host: GitHub
- URL: https://github.com/beaglefoot/streaming-etl
- Owner: Beaglefoot
- Created: 2022-04-13T12:17:40.000Z (about 3 years ago)
- Default Branch: master
- Last Pushed: 2022-05-05T20:22:53.000Z (almost 3 years ago)
- Last Synced: 2025-02-03T23:54:49.393Z (3 months ago)
- Topics: asyncio, data-engineering, debezium, docker, etl, kafka, kafka-connect, streaming
- Language: Python
- Homepage:
- Size: 152 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Streaming ETL
This is an exploratory data engineering project to get some idea how Kafka can be utilized for ETL with streaming approach.
Some things are intentionally simplified. Language of choice is Python instead of Java.
## Architecture

## The goal and how it works
The main idea is to take a typical OLTP DB and stream changes to it via Change Data Capture down the pipeline to Dimensional Data Warehouse.
In absence of real transactional application a data generator is set up.
The changes are captured with Debezium which streams data to Kafka Broker. Transformer subscribes to new messages on input topics, transforms the data and writes it to output topics. These are connected to DWH via JDBC Sink Connector.
Kafka stores messages in binary format internally so these are also encoded/decoded with JSON Schema. Different services get their awareness of the actual schema via Schema Registry. Transformer also has models pregenerated from schemas which facilitate development and help with types and static analysis.
There are more descriptions for services in inner directories.