https://github.com/mkaspulanwar/p6_bigdata_realtime_largescale_visualization
Praktikum Week 6 Big Data: Real-time analytics dan visualisasi data skala besar menggunakan PySpark Structured Streaming, Parquet Data Lake, dan Streamlit untuk monitoring mobilitas dan traffic smart city.
https://github.com/mkaspulanwar/p6_bigdata_realtime_largescale_visualization
big-data data-visualization pyspark spark-streaming streamlit traffic-analytics
Last synced: 2 months ago
JSON representation
Praktikum Week 6 Big Data: Real-time analytics dan visualisasi data skala besar menggunakan PySpark Structured Streaming, Parquet Data Lake, dan Streamlit untuk monitoring mobilitas dan traffic smart city.
- Host: GitHub
- URL: https://github.com/mkaspulanwar/p6_bigdata_realtime_largescale_visualization
- Owner: mkaspulanwar
- License: mit
- Created: 2026-04-02T05:13:01.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2026-04-05T15:43:02.000Z (3 months ago)
- Last Synced: 2026-04-05T17:24:27.071Z (3 months ago)
- Topics: big-data, data-visualization, pyspark, spark-streaming, streamlit, traffic-analytics
- Language: Python
- Homepage:
- Size: 504 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
# Praktikum Big Data Week 6: Real-Time Analytics & Visualisasi Data Skala Besar






## Tim Praktikum
| Peran | Nama | NIM | Profil GitHub |
| :--- | :--- | :--- | :--- |
| **Pengembang Proyek** | M. Kaspul Anwar | 230104040212 | [](https://github.com/mkaspulanwar) |
| **Dosen Pengampu** | Muhayat, M. IT | - | [](https://github.com/muhayat-lab) |
---
## Deskripsi Project
Project Week 6 ini berfokus pada implementasi **Real-Time Analytics** dan **Visualisasi Data Skala Besar** dengan pendekatan end-to-end:
1. Data transaksi/trip disimulasikan secara streaming dalam format JSON.
2. Data streaming diproses menggunakan **PySpark Structured Streaming**.
3. Hasil stream disimpan ke **serving layer** berbasis Parquet.
4. Dashboard **Streamlit** menampilkan KPI, trend, distribusi, window aggregation, dan anomaly detection secara near real-time.
Implementasi mencakup dua use case:
1. **Real-Time E-Commerce Analytics**.
2. **Smart Transportation Analytics** (termasuk alert & anomali).
## Tujuan Praktikum
Tujuan utama praktikum Week 6:
1. Memahami alur real-time pipeline dari data generator ke dashboard.
2. Menerapkan **Structured Streaming** untuk pemrosesan data kontinu.
3. Menerapkan strategi visualisasi skala besar (sampling dan window aggregation).
4. Menyajikan metrik operasional real-time sebagai dasar pengambilan keputusan.
5. Mengintegrasikan analitik, alert, dan monitoring dalam satu sistem.
## Capaian Teknis Week 6
Fitur yang ditekankan pada minggu ini:
1. Streaming ingestion data real-time (e-commerce dan transportation).
2. Window aggregation untuk traffic visualization per interval waktu.
3. Downsampling/subset data untuk visualisasi yang lebih ringan.
4. Rule-based alert (high traffic dan high fare).
5. Deteksi anomali trip berdasarkan fare threshold.
## Arsitektur Sistem
```mermaid
flowchart LR
A["Data Generator (JSON)"] --> B["Streaming Input Folder"]
B --> C["PySpark Structured Streaming"]
C --> D["Serving Layer (Parquet/CSV)"]
D --> E["Analytics & Alert Module"]
E --> F["Streamlit Dashboard"]
F --> G["Monitoring & Decision Support"]
```
## Struktur Project
```bash
bigdata-project/
├── .venv/ # Virtual environment lokal
├── alerts/ # Modul alert untuk use case transportation
│ ├── __init__.py
│ └── transportation_alert.py # Rule-based alert (traffic/fare)
├── analytics/ # Modul analytics untuk transportation
│ ├── __init__.py
│ └── transportation_analytics.py # KPI, trend, anomaly detection
├── dashboard/ # Aplikasi dashboard Streamlit
│ ├── dashboard_streamlit.py # Dashboard real-time e-commerce
│ └── dashboard_transportation.py # Dashboard decision-oriented transportation
├── data/
│ ├── checkpoints/ # Spark streaming checkpoint
│ │ └── transportation/
│ ├── clean/ # Data hasil cleaning (parquet/partitioned)
│ ├── curated/ # Data agregasi bisnis
│ ├── raw/
│ │ └── ecommerce_raw.csv # Dataset mentah utama batch
│ └── serving/ # Data siap konsumsi dashboard
│ ├── avg_transaction/
│ ├── category_revenue/
│ ├── stream/ # Output streaming e-commerce
│ ├── top_products/
│ ├── total_revenue/
│ └── transportation/ # Output streaming transportation
├── logs/
│ ├── batch_pipeline.log # Log proses batch pipeline
│ └── stream_checkpoint/ # Checkpoint streaming e-commerce
├── screenshots/ # Screenshot dokumentasi hasil praktikum
├── scripts/ # Pipeline utama praktikum
│ ├── analytics_layer.py # Analytics + serving layer (e-commerce)
│ ├── batch_pipeline_enterprise.py # Batch processing pipeline
│ ├── streaming_layer.py # Streaming ingestion e-commerce
│ ├── transaction_generator.py # Generator transaksi e-commerce
│ └── transportation/
│ ├── streaming_trip_layer.py # Streaming ingestion transportation
│ └── trip_generator.py # Generator trip transportation
├── stream_data/ # Input simulasi data streaming
│ └── transportation/
├── .gitignore
├── CONTRIBUTING.md
├── LICENSE
└── README.md
```
## Penjelasan Komponen Utama
1. **Generator Layer**
- `scripts/transaction_generator.py`: membuat transaksi e-commerce JSON secara kontinu.
- `scripts/transportation/trip_generator.py`: membuat data trip transportation JSON.
2. **Streaming Processing Layer**
- `scripts/streaming_layer.py`: membaca `stream_data/` lalu menulis ke `data/serving/stream`.
- `scripts/transportation/streaming_trip_layer.py`: membaca `stream_data/transportation` lalu menulis ke `data/serving/transportation`.
3. **Analytics & Alert Layer**
- `analytics/transportation_analytics.py`: metrik, trend, window aggregation, anomaly detection.
- `alerts/transportation_alert.py`: rule-based alert untuk kondisi trafik/fare.
4. **Visualization Layer**
- `dashboard/dashboard_streamlit.py`: dashboard real-time e-commerce.
- `dashboard/dashboard_transportation.py`: dashboard transportation dengan fitur Week 6.
## Bukti Screenshots
Struktur Project
Generator Transaksi

Spark Streaming
Folder data/serving

Dashboard Realtime 1
Dashboard Realtime 2

Dashboard Realtime 3
Dashboard Realtime 4

---
## Setup Environment
### 1) Prasyarat
1. Python 3.10+ (direkomendasikan 3.12).
2. Java 8/11+ (dibutuhkan Spark).
3. `pip` dan virtual environment.
### 2) Membuat Virtual Environment
Untuk Linux/macOS:
```bash
python -m venv .venv
source .venv/bin/activate
```
Untuk PowerShell:
```powershell
python -m venv .venv
.venv\Scripts\Activate.ps1
```
### 3) Install Dependency
```bash
pip install pyspark streamlit pandas pyarrow
```
## Cara Menjalankan Project
Gunakan beberapa terminal secara paralel untuk simulasi real-time.
### A. E-Commerce Pipeline (Batch + Real-Time)
1. Jalankan batch pipeline:
```bash
python scripts/batch_pipeline_enterprise.py
```
2. Jalankan analytics layer untuk serving KPI:
```bash
python scripts/analytics_layer.py
```
3. Jalankan generator transaksi real-time:
```bash
python scripts/transaction_generator.py
```
4. Jalankan Spark streaming consumer:
```bash
python scripts/streaming_layer.py
```
5. Jalankan dashboard e-commerce:
```bash
streamlit run dashboard/dashboard_streamlit.py
```
### B. Smart Transportation Pipeline (Real-Time + Visualisasi Skala Besar)
1. Jalankan generator trip:
```bash
python scripts/transportation/trip_generator.py
```
2. Jalankan streaming trip layer:
```bash
python scripts/transportation/streaming_trip_layer.py
```
3. Jalankan dashboard transportation:
```bash
streamlit run dashboard/dashboard_transportation.py
```
## Output yang Dihasilkan
1. **Batch Layer**
- `data/clean/parquet/`
- `data/clean/partitioned_by_category/`
- `data/curated/category_revenue/`
- `data/curated/top_products/`
- `data/curated/avg_transaction/`
2. **Serving Layer**
- `data/serving/total_revenue/`
- `data/serving/top_products/`
- `data/serving/category_revenue/`
- `data/serving/avg_transaction/`
- `data/serving/stream/`
- `data/serving/transportation/`
3. **Checkpoint dan Log**
- `logs/stream_checkpoint/`
- `data/checkpoints/transportation/`
- `logs/batch_pipeline.log`
## Validasi Hasil Praktikum
Indikator bahwa pipeline berjalan dengan benar:
1. File JSON baru terus muncul di folder `stream_data/` dan `stream_data/transportation/`.
2. File parquet baru muncul di `data/serving/stream/` dan `data/serving/transportation/`.
3. Dashboard menampilkan metrik yang terus berubah setiap refresh interval.
4. Alert muncul saat volume tinggi atau fare melewati threshold.
5. Tabel anomali menampilkan trip abnormal (fare tinggi) jika ada.
## Troubleshooting
1. Jika Spark gagal start, cek Java:
```bash
java -version
```
2. Jika dashboard kosong:
- pastikan generator dan streaming job sudah berjalan,
- pastikan folder output serving sudah terisi.
3. Jika parquet gagal dibaca di dashboard:
- pastikan `pyarrow` sudah terinstall.
4. Jika terjadi konflik data lama:
- hentikan semua proses stream,
- bersihkan folder output tertentu yang ingin diulang (opsional),
- jalankan ulang pipeline dari awal.
## Penutup
Praktikum Week 6 ini menunjukkan implementasi sistem **real-time analytics** yang tidak hanya memproses data streaming, tetapi juga menyajikan visualisasi yang lebih siap skala besar melalui window aggregation, sampling data, dan monitoring berbasis dashboard interaktif.