Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/pprzetacznik/datalake

Simple datalake
https://github.com/pprzetacznik/datalake

avro data-engineering kafka parquet schema-registry spark spark-structured-streaming

Last synced: 8 days ago
JSON representation

Simple datalake

Host: GitHub
URL: https://github.com/pprzetacznik/datalake
Owner: pprzetacznik
Created: 2021-02-27T12:59:32.000Z (almost 4 years ago)
Default Branch: master
Last Pushed: 2024-06-22T10:41:24.000Z (8 months ago)
Last Synced: 2024-12-08T22:05:34.630Z (2 months ago)
Topics: avro, data-engineering, kafka, parquet, schema-registry, spark, spark-structured-streaming
Language: Python
Homepage:
Size: 40 KB
Stars: 2
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Datalake - ingestion

## Table of contents

* [Overview](#overview)
* [Things to improve](#things-to-improve)
* [How to run](#how-to-run)
* [Development](#development)
* [References](#references)

## Overview

This repository shows how to implement datalake ingestion using Spark Structured Streaming and Kafka. Data are streamed to the parquet database. If messages don't match schema, they're saved to dead letter queue. Schema is saved by the confluent_producer to schema registry.

## Things to improve

* For now, the schema matching using schema registry is only on the ingestion side. There is some additional work needed to do to validate schema against schema registry on the consumer side,
* Needed partitioning tuning for parquet database,
* Idempotency to improve guarentee of message delivery.

## How to run

The main component of the ingestion is `ingestion.py`.
To run it as a process you can simply run docker compose.
To test it, I would recommend running producers from your local machine.
You can generate messages using `confluent_producer` or `avro_producer`. Both examples uses different libraries.
To test the produced database in raw zone, you can use pandas. All needed python dependencies are in `requirements-local.txt` file.

### Run datalake

```Bash
docker-compose up
```

### Prepare local producer and data viewer

```Bash
$ mkvirtualenv datalake
$ workon datalake
(datalake) $ pip install -r requirements-local.txt
```

### Produce messages to kafka topic

When you run a producer on your local machine you'll be asked to preapare a new message.

```Bash
$ ./run_producer.sh
...
Please insert a description --> 'stop' to exit
zażółć gęślą jaźń
Message delivered to dev.orders [0]
Sending data: [{'order_id': 0, 'customer': 'John Doe', 'description': 'zażółć gęślą jaźń', 'price': 99.98, 'products': [{'name': 'shoes', 'price': 49.99}, {'name': 't-shirt', 'price': 49.99}]}]
Insert new data (stop to exit)
>
```

### View parquet raw database

The easiest way for reading parquet database is simply by using pandas and pyarrow.
```Python
$ python
>>> import pandas as pd
>>> pd.read_parquet("./ingestion_data/parquet")
key partition offset ... description price products
0 product0 0 0 ... 1234 99.98 [{'name': 'shoes', 'price': 49.99}, {'name': '...
1 product1 0 1 ... 2131345 99.98 [{'name': 'shoes', 'price': 49.99}, {'name': '...
2 product0 0 2 ... 5555 99.98 [{'name': 'shoes', 'price': 49.99}, {'name': '...
3 product1 0 3 ... AAA 99.98 [{'name': 'shoes', 'price': 49.99}, {'name': '...
4 product0 0 4 ... zażółć gęślą jaźń 99.98 [{'name': 'shoes', 'price': 49.99}, {'name': '...

[5 rows x 11 columns]
```

### Producing a message that is broken

Although, `confluent_producer.py` is implemented correctly, it doesn't interract with schema registry and doesn't send that information in first 6 bytes of the serialized message. That means, the message won't be deserialized properly by `ingestion` module. We can try it out to test dead letter queue.
```Bash
$ ./run_producer.sh avro
```
After producing a message, we can check that in the parquet database using pandas.
```Bash
>>> pd.read_parquet("./ingestion_data/parquet-dlq")
key partition offset topic timestamp avro_value order_id customer description price products
0 product0 0 5 dev.orders 2021-02-27 18:59:02.253 b'hn Doe\x1cmy description\x1f\x85\xebQ\xb8\xf... NaN None None NaN None

```

## Development

### Rebuild the ingestion image

```Bash
docker build -f ./ingestion/Dockerfile .
```

### Schema registry queries

List all subjects.
```Bash
$ curl http://localhost:8081/subjects
```

Get latest schema for subject `dev.orders-value`.
```Bash
$ curl http://localhost:8081/subjects/dev.orders-value/versions/latest
```

### Testing

$ docker-compose up -d
$ docker exec -it ingestion /bin/bash
(docker) $ python /ingestion/ingestion.py

## References

* [Structured Streaming Programming Guide](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html),
* [ABRiS - Avro Bridge for Spark](https://github.com/AbsaOSS/ABRiS),
* [Schema Evolution and Compatibility](https://docs.confluent.io/platform/current/schema-registry/fundamentals/schema-evolution.html).