https://github.com/dishabasti/real-time-analytics-pipeline
Live viewership analytics using Kafka, PySpark, forecasting, and Grafana dashboards
https://github.com/dishabasti/real-time-analytics-pipeline
big-data-analytics channel-viewership data-pipeline grafana-dashboard kafka kafka-streams pyspark real-time-analytics spark streaming-data user-behavior-analysis
Last synced: about 1 month ago
JSON representation
Live viewership analytics using Kafka, PySpark, forecasting, and Grafana dashboards
- Host: GitHub
- URL: https://github.com/dishabasti/real-time-analytics-pipeline
- Owner: DishaBasti
- License: mit
- Created: 2025-07-10T15:05:21.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2025-07-23T16:23:15.000Z (11 months ago)
- Last Synced: 2025-07-23T18:26:11.354Z (11 months ago)
- Topics: big-data-analytics, channel-viewership, data-pipeline, grafana-dashboard, kafka, kafka-streams, pyspark, real-time-analytics, spark, streaming-data, user-behavior-analysis
- Language: Python
- Homepage:
- Size: 20.5 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# π Real-Time Big Data Analytics with Kafka, PySpark & Grafana
This project demonstrates **real-time data analytics** using two approaches:
β **Kafka + PySpark + Grafana** β A real-time distributed streaming pipeline (PC1 β PC2).
β **Prometheus + Grafana** β A simulation-based monitoring pipeline.
It also includes **offline PySpark analytics** in Google Colab for detailed insights.
## β
Table of Contents
1. [Project Overview](#project-overview)
2. [Architecture](#architecture)
3. [Approach 1: Kafka + PySpark + Grafana](#approach-1-kafka--pyspark--grafana)
4. [Approach 2: Prometheus + Grafana Simulation](#approach-2-prometheus--grafana-simulation)
5. [Offline Analytics in Google Colab](#offline-analytics-in-google-colab)
6. [Folder Structure](#folder-structure)
7. [Setup Instructions](#setup-instructions)
- [Approach 1: Kafka + PySpark + Grafana (Distributed Setup)](#approach-1-setup-kafka--pyspark--grafana)
- [Approach 2: Prometheus + Grafana Simulation](#approach-2-setup-prometheus--grafana-simulation)
8. [Screenshots](#screenshots)
9. [Future Enhancements](#future-enhancements)
## π Project Overview
The goal of this project is to **simulate real-time TV channel viewership data** from a CSV and analyze it using:
- **Kafka** for distributed event streaming.
- **PySpark Structured Streaming** for real-time analytics.
- **Prometheus & Grafana** for monitoring and visualization.
- **PostgreSQL or InfluxDB** for optional storage of aggregated results.
**Dataset Columns**:
```
Event_ID | Event_Type | User_ID | City | State | User_Type | Timestamp | Channel | Program | Channel_Type | View_Min | Session_Dur | Preferred_Time | Region
```
## Architecture
### β
Approach 1: Kafka + PySpark + Grafana

```
PC1 (Producer) β Kafka Broker β PC2 (PySpark Consumer) β Database β Grafana Dashboards
```
### β
Approach 2: Prometheus + Grafana Simulation

```
CSV Dataset β Prometheus Exporter (Python) β Prometheus β Grafana Dashboards
```
## β
Approach 1: Kafka + PySpark + Grafana
### π How It Works Across Two PCs
- **PC 1**: Kafka Producer streams CSV data into **Kafka Topics**.
- **PC 2**: PySpark Consumer reads Kafka topics in **real-time**, performs aggregations, and writes to DB or console.
- **Grafana**: Connects to DB or Spark output for visualization.
### Key Features
β Real-time ingestion & processing
β Distributed setup for scalability
β Visualization in Grafana
## β
Approach 2: Prometheus + Grafana Simulation
This approach simulates real-time metrics using **Prometheus exporter** when you donβt have a full Kafka cluster setup.
- Reads CSV rows sequentially with a time delay.
- Exposes metrics at `http://localhost:8000/metrics`.
- Grafana pulls data from Prometheus and renders dashboards.
## β
Offline Analytics in Google Colab
Due to Spark setup constraints locally, detailed analytics were done in **Colab**:
- **Notebook:** [Analytics & Visualizations](https://colab.research.google.com/drive/1t2X3r2MHtKUaQ4ilkXLT3vJh5Q8eIaTT?usp=sharing)
[](https://colab.research.google.com/drive/1t2X3r2MHtKUaQ4ilkXLT3vJh5Q8eIaTT?usp=sharing)
## π Folder Structure
```
real-time-bda-pipeline/
β£ π README.md
β£ π producer.py # Kafka Producer (simulated streaming)
β£ π subscriber.py # PySpark Consumer with real-time analytics
β£ π prometheus_simulator.py # Prometheus metrics exporter
β£ π sample_data.csv # Example dataset
β£ π notebooks
β β£ pyspark_analytics.ipynb
β£ π requirements.txt
β π docker-compose.yml # For Kafka + Zookeeper setup
```
## β
Setup Instructions
### πΉ **Install Dependencies**
```bash
pip install -r requirements.txt
```
### β
Approach 1 Setup: Kafka + PySpark + Grafana
#### **On PC 1 (Kafka Producer)**
1. Start Kafka using Docker Compose:
```bash
docker-compose up -d
```
2. Run the Producer script:
```bash
python producer.py
```
3. Producer will stream data from `sample_data.csv` to Kafka topics.
#### **On PC 2 (PySpark Consumer)**
1. Ensure PC 2 can access PC 1's IP and Kafka port (9092).
2. Edit `subscriber.py` with PC 1's Kafka IP:
```python
kafka_bootstrap_servers = "PC1_IP:9092"
```
3. Start PySpark consumer:
```bash
spark-submit subscriber.py
```
4. Processed data can be viewed on the console or written to DB.
#### **Grafana**
- Connect Grafana to **PostgreSQL** or **InfluxDB** where processed results are stored.
- Import dashboards for visualization.
### β
Approach 2 Setup: Prometheus + Grafana Simulation
1. Start Prometheus exporter:
```bash
python prometheus_simulator.py
```
2. Prometheus scrapes metrics from `http://localhost:8000/metrics`.
3. In Grafana:
- Add Prometheus as a data source.
- Build dashboards to visualize real-time metrics.
## Screenshots
β **Kafka Console Output** β Top Channels, Regional Trends
β **Grafana Dashboard for Kafka Pipeline**
β **Grafana Dashboard for Prometheus Simulation**
β **Colab Visualizations**
## Future Enhancements
- Integrate **forecasting models** (Prophet, ARIMA) into PySpark streaming.
- Store real-time processed data in **InfluxDB** for time-series analytics.
- Deploy pipeline using **Kubernetes** for scalability.
## Author
**Disha S Basti**