{"id":48768513,"url":"https://github.com/mkaspulanwar/p6_bigdata_realtime_largescale_visualization","last_synced_at":"2026-04-13T09:01:47.228Z","repository":{"id":349374532,"uuid":"1199159347","full_name":"mkaspulanwar/p6_bigdata_realtime_largescale_visualization","owner":"mkaspulanwar","description":"Praktikum Week 6 Big Data: Real-time analytics dan visualisasi data skala besar menggunakan PySpark Structured Streaming, Parquet Data Lake, dan Streamlit untuk monitoring mobilitas dan traffic smart city.","archived":false,"fork":false,"pushed_at":"2026-04-05T15:43:02.000Z","size":528715,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-05T17:24:27.071Z","etag":null,"topics":["big-data","data-visualization","pyspark","spark-streaming","streamlit","traffic-analytics"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mkaspulanwar.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-02T05:13:01.000Z","updated_at":"2026-04-05T15:43:07.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/mkaspulanwar/p6_bigdata_realtime_largescale_visualization","commit_stats":null,"previous_names":["mkaspulanwar/p6_bigdata_realtime_largescale_visualization"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/mkaspulanwar/p6_bigdata_realtime_largescale_visualization","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mkaspulanwar%2Fp6_bigdata_realtime_largescale_visualization","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mkaspulanwar%2Fp6_bigdata_realtime_largescale_visualization/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mkaspulanwar%2Fp6_bigdata_realtime_largescale_visualization/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mkaspulanwar%2Fp6_bigdata_realtime_largescale_visualization/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mkaspulanwar","download_url":"https://codeload.github.com/mkaspulanwar/p6_bigdata_realtime_largescale_visualization/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mkaspulanwar%2Fp6_bigdata_realtime_largescale_visualization/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31746113,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-13T06:26:45.479Z","status":"ssl_error","status_checked_at":"2026-04-13T06:26:44.645Z","response_time":93,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["big-data","data-visualization","pyspark","spark-streaming","streamlit","traffic-analytics"],"created_at":"2026-04-13T09:01:43.914Z","updated_at":"2026-04-13T09:01:47.223Z","avatar_url":"https://github.com/mkaspulanwar.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"﻿# Praktikum Big Data Week 6: Real-Time Analytics \u0026 Visualisasi Data Skala Besar\r\n\r\n![Python](https://img.shields.io/badge/Python-3.12-blue?logo=python\u0026logoColor=white)\r\n![Apache Spark](https://img.shields.io/badge/Apache%20Spark-Structured%20Streaming-E25A1C?logo=apachespark\u0026logoColor=white)\r\n![Streamlit](https://img.shields.io/badge/Streamlit-Real--Time%20Dashboard-FF4B4B?logo=streamlit\u0026logoColor=white)\r\n![Pandas](https://img.shields.io/badge/Pandas-Analytics-150458?logo=pandas\u0026logoColor=white)\r\n![Parquet](https://img.shields.io/badge/Format-Parquet-0A9EDC)\r\n![Big Data](https://img.shields.io/badge/Focus-Scalable%20Visualization-2E7D32)\r\n\r\n## Tim Praktikum\r\n\r\n| Peran | Nama | NIM | Profil GitHub |\r\n| :--- | :--- | :--- | :--- |\r\n| **Pengembang Proyek** | M. Kaspul Anwar | 230104040212 | [![](https://img.shields.io/badge/GitHub-mkaspulanwar-181717?style=flat\u0026logo=github)](https://github.com/mkaspulanwar) |\r\n| **Dosen Pengampu** | Muhayat, M. IT | - | [![](https://img.shields.io/badge/GitHub-muhayat--lab-181717?style=flat\u0026logo=github)](https://github.com/muhayat-lab) |\r\n\r\n---\r\n\r\n## Deskripsi Project\r\n\r\nProject Week 6 ini berfokus pada implementasi **Real-Time Analytics** dan **Visualisasi Data Skala Besar** dengan pendekatan end-to-end:\r\n\r\n1. Data transaksi/trip disimulasikan secara streaming dalam format JSON.\r\n2. Data streaming diproses menggunakan **PySpark Structured Streaming**.\r\n3. Hasil stream disimpan ke **serving layer** berbasis Parquet.\r\n4. Dashboard **Streamlit** menampilkan KPI, trend, distribusi, window aggregation, dan anomaly detection secara near real-time.\r\n\r\nImplementasi mencakup dua use case:\r\n\r\n1. **Real-Time E-Commerce Analytics**.\r\n2. **Smart Transportation Analytics** (termasuk alert \u0026 anomali).\r\n\r\n## Tujuan Praktikum\r\n\r\nTujuan utama praktikum Week 6:\r\n\r\n1. Memahami alur real-time pipeline dari data generator ke dashboard.\r\n2. Menerapkan **Structured Streaming** untuk pemrosesan data kontinu.\r\n3. Menerapkan strategi visualisasi skala besar (sampling dan window aggregation).\r\n4. Menyajikan metrik operasional real-time sebagai dasar pengambilan keputusan.\r\n5. Mengintegrasikan analitik, alert, dan monitoring dalam satu sistem.\r\n\r\n## Capaian Teknis Week 6\r\n\r\nFitur yang ditekankan pada minggu ini:\r\n\r\n1. Streaming ingestion data real-time (e-commerce dan transportation).\r\n2. Window aggregation untuk traffic visualization per interval waktu.\r\n3. Downsampling/subset data untuk visualisasi yang lebih ringan.\r\n4. Rule-based alert (high traffic dan high fare).\r\n5. Deteksi anomali trip berdasarkan fare threshold.\r\n\r\n## Arsitektur Sistem\r\n\r\n```mermaid\r\nflowchart LR\r\n    A[\"Data Generator (JSON)\"] --\u003e B[\"Streaming Input Folder\"]\r\n    B --\u003e C[\"PySpark Structured Streaming\"]\r\n    C --\u003e D[\"Serving Layer (Parquet/CSV)\"]\r\n    D --\u003e E[\"Analytics \u0026 Alert Module\"]\r\n    E --\u003e F[\"Streamlit Dashboard\"]\r\n    F --\u003e G[\"Monitoring \u0026 Decision Support\"]\r\n```\r\n\r\n## Struktur Project\r\n\r\n```bash\r\nbigdata-project/\r\n├── .venv/                                 # Virtual environment lokal\r\n├── alerts/                                # Modul alert untuk use case transportation\r\n│   ├── __init__.py\r\n│   └── transportation_alert.py            # Rule-based alert (traffic/fare)\r\n├── analytics/                             # Modul analytics untuk transportation\r\n│   ├── __init__.py\r\n│   └── transportation_analytics.py        # KPI, trend, anomaly detection\r\n├── dashboard/                             # Aplikasi dashboard Streamlit\r\n│   ├── dashboard_streamlit.py             # Dashboard real-time e-commerce\r\n│   └── dashboard_transportation.py        # Dashboard decision-oriented transportation\r\n├── data/\r\n│   ├── checkpoints/                       # Spark streaming checkpoint\r\n│   │   └── transportation/\r\n│   ├── clean/                             # Data hasil cleaning (parquet/partitioned)\r\n│   ├── curated/                           # Data agregasi bisnis\r\n│   ├── raw/\r\n│   │   └── ecommerce_raw.csv              # Dataset mentah utama batch\r\n│   └── serving/                           # Data siap konsumsi dashboard\r\n│       ├── avg_transaction/\r\n│       ├── category_revenue/\r\n│       ├── stream/                        # Output streaming e-commerce\r\n│       ├── top_products/\r\n│       ├── total_revenue/\r\n│       └── transportation/                # Output streaming transportation\r\n├── logs/\r\n│   ├── batch_pipeline.log                 # Log proses batch pipeline\r\n│   └── stream_checkpoint/                 # Checkpoint streaming e-commerce\r\n├── screenshots/                           # Screenshot dokumentasi hasil praktikum\r\n├── scripts/                               # Pipeline utama praktikum\r\n│   ├── analytics_layer.py                 # Analytics + serving layer (e-commerce)\r\n│   ├── batch_pipeline_enterprise.py       # Batch processing pipeline\r\n│   ├── streaming_layer.py                 # Streaming ingestion e-commerce\r\n│   ├── transaction_generator.py           # Generator transaksi e-commerce\r\n│   └── transportation/\r\n│       ├── streaming_trip_layer.py       # Streaming ingestion transportation\r\n│       └── trip_generator.py              # Generator trip transportation\r\n├── stream_data/                           # Input simulasi data streaming\r\n│   └── transportation/\r\n├── .gitignore\r\n├── CONTRIBUTING.md\r\n├── LICENSE\r\n└── README.md\r\n```\r\n## Penjelasan Komponen Utama\r\n\r\n1. **Generator Layer**\r\n   - `scripts/transaction_generator.py`: membuat transaksi e-commerce JSON secara kontinu.\r\n   - `scripts/transportation/trip_generator.py`: membuat data trip transportation JSON.\r\n2. **Streaming Processing Layer**\r\n   - `scripts/streaming_layer.py`: membaca `stream_data/` lalu menulis ke `data/serving/stream`.\r\n   - `scripts/transportation/streaming_trip_layer.py`: membaca `stream_data/transportation` lalu menulis ke `data/serving/transportation`.\r\n3. **Analytics \u0026 Alert Layer**\r\n   - `analytics/transportation_analytics.py`: metrik, trend, window aggregation, anomaly detection.\r\n   - `alerts/transportation_alert.py`: rule-based alert untuk kondisi trafik/fare.\r\n4. **Visualization Layer**\r\n   - `dashboard/dashboard_streamlit.py`: dashboard real-time e-commerce.\r\n   - `dashboard/dashboard_transportation.py`: dashboard transportation dengan fitur Week 6.\r\n\r\n## Bukti Screenshots\r\n\r\n\u003ctable\u003e\r\n\u003ctr\u003e\r\n\u003ctd align=\"center\"\u003e\u003cb\u003eStruktur Project\u003c/b\u003e\u003c/td\u003e\r\n\u003ctd align=\"center\"\u003e\u003cb\u003eGenerator Transaksi\u003c/b\u003e\u003c/td\u003e\r\n\u003c/tr\u003e\r\n\u003ctr\u003e\r\n\u003ctd\u003e\u003cimg src=\"screenshots/struktur_project.png\"/\u003e\u003c/td\u003e\r\n\u003ctd\u003e\u003cimg src=\"screenshots/trip_generator.png\"/\u003e\u003c/td\u003e\r\n\u003c/tr\u003e\r\n\r\n\u003ctr\u003e\r\n\u003ctd align=\"center\"\u003e\u003cb\u003eSpark Streaming\u003c/b\u003e\u003c/td\u003e\r\n\u003ctd align=\"center\"\u003e\u003cb\u003eFolder data/serving\u003c/b\u003e\u003c/td\u003e\r\n\u003c/tr\u003e\r\n\u003ctr\u003e\r\n\u003ctd\u003e\u003cimg src=\"screenshots/spark_streaming.png\"/\u003e\u003c/td\u003e\r\n\u003ctd\u003e\u003cimg src=\"screenshots/data_serving.png\"/\u003e\u003c/td\u003e\r\n\u003c/tr\u003e\r\n\r\n\u003ctr\u003e\r\n\u003ctd align=\"center\"\u003e\u003cb\u003eDashboard Realtime 1\u003c/b\u003e\u003c/td\u003e\r\n\u003ctd align=\"center\"\u003e\u003cb\u003eDashboard Realtime 2\u003c/b\u003e\u003c/td\u003e\r\n\u003c/tr\u003e\r\n\u003ctr\u003e\r\n\u003ctd\u003e\u003cimg src=\"screenshots/dashboard_1.png\"/\u003e\u003c/td\u003e\r\n\u003ctd\u003e\u003cimg src=\"screenshots/dashboard_2.png\"/\u003e\u003c/td\u003e\r\n\u003c/tr\u003e\r\n\r\n\u003ctr\u003e\r\n\u003ctd align=\"center\"\u003e\u003cb\u003eDashboard Realtime 3\u003c/b\u003e\u003c/td\u003e\r\n\u003ctd align=\"center\"\u003e\u003cb\u003eDashboard Realtime 4\u003c/b\u003e\u003c/td\u003e\r\n\u003c/tr\u003e\r\n\u003ctr\u003e\r\n\u003ctd\u003e\u003cimg src=\"screenshots/dashboard_3.png\"/\u003e\u003c/td\u003e\r\n\u003ctd\u003e\u003cimg src=\"screenshots/dashboard_4.png\"/\u003e\u003c/td\u003e\r\n\u003c/tr\u003e\r\n\u003c/table\u003e\r\n\r\n---\r\n\r\n## Setup Environment\r\n\r\n### 1) Prasyarat\r\n\r\n1. Python 3.10+ (direkomendasikan 3.12).\r\n2. Java 8/11+ (dibutuhkan Spark).\r\n3. `pip` dan virtual environment.\r\n\r\n### 2) Membuat Virtual Environment\r\n\r\nUntuk Linux/macOS:\r\n\r\n```bash\r\npython -m venv .venv\r\nsource .venv/bin/activate\r\n```\r\n\r\nUntuk PowerShell:\r\n\r\n```powershell\r\npython -m venv .venv\r\n.venv\\Scripts\\Activate.ps1\r\n```\r\n\r\n### 3) Install Dependency\r\n\r\n```bash\r\npip install pyspark streamlit pandas pyarrow\r\n```\r\n\r\n## Cara Menjalankan Project\r\n\r\nGunakan beberapa terminal secara paralel untuk simulasi real-time.\r\n\r\n### A. E-Commerce Pipeline (Batch + Real-Time)\r\n\r\n1. Jalankan batch pipeline:\r\n\r\n```bash\r\npython scripts/batch_pipeline_enterprise.py\r\n```\r\n\r\n2. Jalankan analytics layer untuk serving KPI:\r\n\r\n```bash\r\npython scripts/analytics_layer.py\r\n```\r\n\r\n3. Jalankan generator transaksi real-time:\r\n\r\n```bash\r\npython scripts/transaction_generator.py\r\n```\r\n\r\n4. Jalankan Spark streaming consumer:\r\n\r\n```bash\r\npython scripts/streaming_layer.py\r\n```\r\n\r\n5. Jalankan dashboard e-commerce:\r\n\r\n```bash\r\nstreamlit run dashboard/dashboard_streamlit.py\r\n```\r\n\r\n### B. Smart Transportation Pipeline (Real-Time + Visualisasi Skala Besar)\r\n\r\n1. Jalankan generator trip:\r\n\r\n```bash\r\npython scripts/transportation/trip_generator.py\r\n```\r\n\r\n2. Jalankan streaming trip layer:\r\n\r\n```bash\r\npython scripts/transportation/streaming_trip_layer.py\r\n```\r\n\r\n3. Jalankan dashboard transportation:\r\n\r\n```bash\r\nstreamlit run dashboard/dashboard_transportation.py\r\n```\r\n\r\n## Output yang Dihasilkan\r\n\r\n1. **Batch Layer**\r\n   - `data/clean/parquet/`\r\n   - `data/clean/partitioned_by_category/`\r\n   - `data/curated/category_revenue/`\r\n   - `data/curated/top_products/`\r\n   - `data/curated/avg_transaction/`\r\n2. **Serving Layer**\r\n   - `data/serving/total_revenue/`\r\n   - `data/serving/top_products/`\r\n   - `data/serving/category_revenue/`\r\n   - `data/serving/avg_transaction/`\r\n   - `data/serving/stream/`\r\n   - `data/serving/transportation/`\r\n3. **Checkpoint dan Log**\r\n   - `logs/stream_checkpoint/`\r\n   - `data/checkpoints/transportation/`\r\n   - `logs/batch_pipeline.log`\r\n\r\n## Validasi Hasil Praktikum\r\n\r\nIndikator bahwa pipeline berjalan dengan benar:\r\n\r\n1. File JSON baru terus muncul di folder `stream_data/` dan `stream_data/transportation/`.\r\n2. File parquet baru muncul di `data/serving/stream/` dan `data/serving/transportation/`.\r\n3. Dashboard menampilkan metrik yang terus berubah setiap refresh interval.\r\n4. Alert muncul saat volume tinggi atau fare melewati threshold.\r\n5. Tabel anomali menampilkan trip abnormal (fare tinggi) jika ada.\r\n\r\n## Troubleshooting\r\n\r\n1. Jika Spark gagal start, cek Java:\r\n\r\n```bash\r\njava -version\r\n```\r\n\r\n2. Jika dashboard kosong:\r\n   - pastikan generator dan streaming job sudah berjalan,\r\n   - pastikan folder output serving sudah terisi.\r\n3. Jika parquet gagal dibaca di dashboard:\r\n   - pastikan `pyarrow` sudah terinstall.\r\n4. Jika terjadi konflik data lama:\r\n   - hentikan semua proses stream,\r\n   - bersihkan folder output tertentu yang ingin diulang (opsional),\r\n   - jalankan ulang pipeline dari awal.\r\n\r\n## Penutup\r\n\r\nPraktikum Week 6 ini menunjukkan implementasi sistem **real-time analytics** yang tidak hanya memproses data streaming, tetapi juga menyajikan visualisasi yang lebih siap skala besar melalui window aggregation, sampling data, dan monitoring berbasis dashboard interaktif.\r\n\r\n\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmkaspulanwar%2Fp6_bigdata_realtime_largescale_visualization","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmkaspulanwar%2Fp6_bigdata_realtime_largescale_visualization","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmkaspulanwar%2Fp6_bigdata_realtime_largescale_visualization/lists"}