https://github.com/nasruddin/realtime-streaming-kafka-flink-pinot-postgres-superset
Setup for realtime data streaming using Kafka, Flink, Pinot, MySQL, Postgres and Superset
https://github.com/nasruddin/realtime-streaming-kafka-flink-pinot-postgres-superset
docker docker-compose flink flink-stream-processing kafka mysql pinot postgres postgresql python realtime-streaming streaming streaming-data superset
Last synced: 4 months ago
JSON representation
Setup for realtime data streaming using Kafka, Flink, Pinot, MySQL, Postgres and Superset
- Host: GitHub
- URL: https://github.com/nasruddin/realtime-streaming-kafka-flink-pinot-postgres-superset
- Owner: Nasruddin
- Created: 2025-02-24T17:32:45.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2025-02-27T11:21:18.000Z (7 months ago)
- Last Synced: 2025-02-27T15:47:01.151Z (7 months ago)
- Topics: docker, docker-compose, flink, flink-stream-processing, kafka, mysql, pinot, postgres, postgresql, python, realtime-streaming, streaming, streaming-data, superset
- Language: Python
- Homepage:
- Size: 4.67 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
README
# Realtime Data Streaming## Overview
In modern data-driven applications, real-time data streaming has become a critical requirement for processing, analyzing, and storing large volumes of continuously generated data.
This setup ensures scalable, low-latency, and fault-tolerant data streaming for real-time analytics and operational insights.
This streaming pipeline uses **Kafka**, **Flink**, **Apache Pinot**, **PostgreSQL**, and **MySQL**, leveraging both **Python and Java**, with everything provisioned using **Docker Compose** for seamless deployment and orchestration.
1. Kafka serves as the central event broker, enabling efficient data ingestion and movement across systems.
2. Apache Flink, a powerful stream processing framework, transforms and enriches incoming data in real-time before routing it to different storage systems.
3. For fast analytical queries, Apache Pinot provides an OLAP engine optimized for high-speed aggregations
4. PostgreSQL and MySQL act as traditional relational stores for structured and transactional data.
5. Python and Java, we implement producers, consumers, Flink processing jobs, and seamless integration with storage layers.## Prerequisites
Make sure you have the following installed:
- [Docker](https://docs.docker.com/get-docker/)
- [Docker Compose](https://docs.docker.com/compose/install/)
- [Python 3.x](https://www.python.org/downloads/)
- [Java 8/11](https://www.oracle.com/java/technologies/javase/jdk11-archive-downloads.html) (for running Flink jobs)
- [Git](https://git-scm.com/downloads)## Installation & Setup
### 1. Clone the Repository
```bash
git clone https://github.com/Nasruddin/realtime-streaming-kafka-flink-pinot-postgres.gitcd realtime-streaming-kafka-flink-pinot-postgres
```### 2. Start Services using Docker Compose
Make sure to build and package Flink Java code
```bash
cd flink-processing-service
mvn clean package -DskipTests
``````bash
docker compose \
-f docker-compose-base.yml \
-f docker-compose-pinot.yml \
-f docker-compose-flink.yml up --build
```This will start **Kafka, Zookeeper, Flink, Pinot, PostgreSQL, and MySQL** containers.
### 3. Verify Running Services
Run the following commands to check if the services are up:
```bash
docker ps
```You should see containers for **Kafka, Zookeeper, Flink, Pinot, PostgreSQL, and MySQL** running.

## Running the Pipeline
### 1. Produce Sample Data to Kafka
As soon as containers start running, data(1000 rides row) will be ingested to MySQL and evently will be publish to Kafka as an events. Once all the 1000 rides are pushed to Kafka, Python Kafka producer will start creating new events. You can verify in the ride service logs
```bash
docker logs rides-service
```
#### Producer
---
#### Consumer
---
### 2. Flink Job
Confirm that your flink job is running and producing transformed events
#### List the topics to confirm new realtime topic is created
```bash
docker exec -it kafka kafka-topics --bootstrap-server kafka:9092 --list
```
#### Verify the events are getting generated
```bash
docker exec -it kafka kafka-console-consumer --topic riders_out --bootstrap-server kafka:9092
```
### 3. Query Apache Pinot
Pinot is used for OLAP (Online Analytical Processing) queries on real-time data.
- **Access Pinot UI:** `http://localhost:9000`- **Verify schemas & tables are generated:**
```bash
curl -X GET "http://localhost:9000/schemas"
```

```bash
curl -X GET "http://localhost:9000/tables"
```
NOTE - If you don't find schemas and table then re-run below image:
```bash
docker-compose restart pinot-add-table
```
- Once the data is processed and stored in **Apache Pinot**, query it using:
```bash
curl -X POST "http://localhost:9000/query/sql" -H "Content-Type: application/json" -d '{"sql":"SELECT * FROM rides LIMIT 10"}'
```
or## Running Queries
### Open Pinot UI (localhost:9000)
#### Query Pinot:
```sql
SELECT * FROM rides LIMIT 10;
```
### 4. Query PostgreSQL
**Access the databases with:**
```bash
docker exec -it postgres psql -U postgresuser -d rides_db
```
### 5. Dashboard on Superset
Superset is used for data visualization and dashboarding.- **Access Superset:** `http://localhost:8088`
- **Superset dataset:**
## Stopping and Cleaning Up
To stop the containers without deleting data:
```bash
docker compose \
-f docker-compose-base.yml \
-f docker-compose-pinot.yml \
-f docker-compose-flink.yml down
```To stop and remove all containers, volumes, and networks:
```bash
docker compose \
-f docker-compose-base.yml \
-f docker-compose-pinot.yml \
-f docker-compose-flink.yml down -v
```## Troubleshooting
- Check logs of a specific service:
```bash
docker logs -f
```
- Restart a specific service:
```bash
docker-compose restart
```
- Ensure ports are not occupied:
```bash
sudo lsof -i :
```
Then kill the process:
```bash
sudo kill -9
sudo kill -15
```
- Kafka Specific
```bash
docker stop kafka zookeeper
docker rm kafka zookeeper
```
ZooKeeper stores broker metadata under /tmp/zookeeper, so you need to clear it:
```bash
rm -rf /tmp/zookeeper
```
Kafka maintains log data, which might be causing conflicts. Delete the Kafka log directory:
```bash
rm -rf /tmp/kafka-logs
```
## ContributingFeel free to fork this repository and submit pull requests with improvements.