https://github.com/vicradon/real-time-data-augmentation-with-bytewax

This repo contains a Bytewax dataflow that uses Kafka as input and writes modified data to another Kafka topic.
https://github.com/vicradon/real-time-data-augmentation-with-bytewax

bytewax python

Last synced: 6 months ago
JSON representation

This repo contains a Bytewax dataflow that uses Kafka as input and writes modified data to another Kafka topic.

Host: GitHub
URL: https://github.com/vicradon/real-time-data-augmentation-with-bytewax
Owner: vicradon
Created: 2022-06-16T16:52:04.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2022-06-26T17:21:34.000Z (over 3 years ago)
Last Synced: 2025-02-04T13:44:33.867Z (8 months ago)
Topics: bytewax, python
Language: Python
Homepage:
Size: 14.6 KB
Stars: 0
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Kafka Bytewax Demo
This repo contains a Bytewax dataflow that uses Kafka as input and writes modified data to another Kafka topic.

## Installation
You will need to install poetry for dependency management. The easiest way is through pipx. You can run the script in ./poetry-initializer-script.sh to install all depdencies and start a poetry shell. Find the script below for reference:

```sh
pip3 install pipx; pipx ensurepath; source ~/.bashrc; pipx install poetry; poetry install; poetry shell
```

Ensure you restart all open shells to get poetry working there. If the above installation did not work, find the installation instructions on the [Poetry docs](https://python-poetry.org/docs/). Then install depdencies and start a poetry shell using the command below:

```sh
poetry install
poetry shell
```

The poetry install command installs Bytewax and Kafka-Python. These depdencies are Python libraries that communicate with Bytewax and Kafka respectively.

## Installing Other Depdencies

You need Kafka to run this dataflow. The easiest way to run Kafka is through the Docker compose file containing Kafka broker and Zookeeper. Navigate to the kafka directory and run the docker compose command using the command below:

```sh
cd kafka
docker compose up -d
```

### Setting up the Kafka topics

This dataflow has two topics of concern:

1. An input topic, ip_addresses_by_countries, that will store ip addresses partitioned by the country of origin name.
2. An output topic, ip_addresses_by_location, that will store JSON data of IP address location data. This includes city, state or region, and country.

Each topic has 20 partitions and will be created using Kafka Python. You can create both by running the initializer.py Python script using the command below:

```sh
python initializer.py
```

The initializer script also produces IP address data into the input topic from the file, `./ip_addresses_with_countries.txt`.

## Running the Dataflow

You can run the dataflow as a normal Python script. Use the command below to run the dataflow:

```sh
python dataflow.py
```

This should source data from the input topic and write it to the output topic.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/vicradon/real-time-data-augmentation-with-bytewax

Awesome Lists containing this project

README