https://github.com/knands42/dataengineering-1billion-rows-per-hour

A project that simulate how to build a complete workflow to persist 1 billion rows per hour
https://github.com/knands42/dataengineering-1billion-rows-per-hour

data-engineering graphana java java21 kafka makefile posgr prometheus python python3 spark sql

Last synced: 10 months ago
JSON representation

A project that simulate how to build a complete workflow to persist 1 billion rows per hour

Host: GitHub
URL: https://github.com/knands42/dataengineering-1billion-rows-per-hour
Owner: knands42
Created: 2025-01-17T23:41:58.000Z (12 months ago)
Default Branch: main
Last Pushed: 2025-02-21T19:28:19.000Z (11 months ago)
Last Synced: 2025-02-21T19:34:20.735Z (11 months ago)
Topics: data-engineering, graphana, java, java21, kafka, makefile, posgr, prometheus, python, python3, spark, sql
Language: Python
Homepage:
Size: 73.2 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# DataEngineering-1Billion-Rows-Per-Hour

A project that simulates how to build a complete workflow to persist 1 billion rows per hour.

## Project Overview

This project is designed for learning purposes and involves the following components:
- A Python producer sending data to Kafka
- A Java producer sending data to Kafka
- Data being consumed by Apache Spark

## Steps Involved

1. **Python Producer**: Generates and sends data to a Kafka topic.
2. **Java Producer**: Generates and sends data to a Kafka topic.
3. **Kafka**: Acts as the message broker to handle the data streams.
4. **Apache Spark**: Consumes data from Kafka, processes it, and persists it.

## How to Execute the Project

1. **Boot the project**:
```sh
docker compose up --build --force-recreate
```

2. **Run the Python Producer**:
```sh
make producer-python
```

3. **Run the Java Producer**:
```sh
make producer-java
```

4. **Run PySpark Consumer**:
```sh
make pyspark-consumer
```

5. **Access PostgreSQL**:
```sh
make connect-postgres
```

## Useful Links

- [Kafka Topics](http://localhost:8080)
- [Spark Jobs](http://localhost:4040)
- [Grafana](http://localhost:3000)
- [Prometheus](https://localhost:9090)

## Credits

Tutorial made possible by following [1 Billion Records per Hour](https://www.youtube.com/watch?v=d6AFh31fO7Y&t=3s) Youtube channel

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/knands42/dataengineering-1billion-rows-per-hour

Awesome Lists containing this project

README