Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/archie-cm/real_time_analytics_with_spark_streaming_on_dataproc
This project demonstrates how to build a real-time analytics pipeline using Spark Streaming on Google Cloud Platform (GCP)
https://github.com/archie-cm/real_time_analytics_with_spark_streaming_on_dataproc
pubsub spark-streaming
Last synced: 1 day ago
JSON representation
This project demonstrates how to build a real-time analytics pipeline using Spark Streaming on Google Cloud Platform (GCP)
- Host: GitHub
- URL: https://github.com/archie-cm/real_time_analytics_with_spark_streaming_on_dataproc
- Owner: archie-cm
- Created: 2025-01-08T10:06:48.000Z (13 days ago)
- Default Branch: main
- Last Pushed: 2025-01-08T14:41:16.000Z (13 days ago)
- Last Synced: 2025-01-20T05:46:07.316Z (1 day ago)
- Topics: pubsub, spark-streaming
- Language: Jupyter Notebook
- Homepage:
- Size: 10.7 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Real-Time Analytics with Spark Streaming on Dataproc
This project demonstrates how to build a real-time analytics pipeline using Spark Streaming on Google Cloud Platform (GCP). The pipeline processes real-time data from Pub/Sub Lite and joins it with static datasets to deliver insights such as product discounts and popular product recommendations.
## Table of Contents
- [Overview](#overview)
- [Prerequisites](#prerequisites)
- [Architecture](#architecture)
- [Setup](#setup)
- [1. Setup Pub/Sub Lite](#1-setup-pubsub-lite)
- [2. Setup Dataproc and Submit Streaming Jobs](#2-setup-dataproc-and-submit-streaming-jobs)
- [3. Write Data to Pub/Sub for Testing](#3-write-data-to-pubsub-for-testing)
- [Windows Explained](#windows-explained)
- [Scripts](#scripts)
- [Usage](#usage)
- [License](#license)---
## Overview
This pipeline is designed for real-time analytics on streaming data using Spark Streaming and Pub/Sub Lite. It includes features like tumbling windows, sliding windows, watermarking, and joining static and streaming datasets to deliver actionable insights.## Prerequisites
- Google Cloud account
- GCP SDK installed
- `gcloud` CLI configured
- Python 3.x installed locally## Architecture
1. Real-time data is published to Pub/Sub Lite topics.
2. Spark Streaming jobs on Dataproc consume the data and process it in real-time.
3. Insights are generated by joining streaming data with static datasets or other streaming data sources.## Setup
### 1. Setup Pub/Sub Lite
1. Enable Pub/Sub Lite API:
```bash
gcloud services enable pubsublite.googleapis.com
```2. Create a topic:
```bash
gcloud pubsublite topics create product-stream-topic \
--location=us-central1-a \
--partitions=1 \
--per-partition-bytes=30GiB \
--retention-period=24h
```3. Create a subscription:
```bash
gcloud pubsublite subscriptions create product-stream-subscription \
--location=us-central1-a \
--topic=product-stream-topic \
--delivery-requirement=deliver-after-stored
```### 2. Setup Dataproc and Submit Streaming Jobs
1. Submit the product discount streaming job:
```bash
gcloud dataproc jobs submit pyspark streaming_product_discounts.py \
--cluster=spark-streaming \
--region=us-central1 \
--jars=gs://spark-lib/pubsublite/pubsublite-spark-sql-streaming-LATEST-with-dependencies.jar
```2. Submit the popular product recommendation job:
```bash
gcloud dataproc jobs submit pyspark popular_products_recommendation.py \
--cluster=spark-streaming \
--region=us-central1 \
--jars=gs://spark-lib/pubsublite/pubsublite-spark-sql-streaming-LATEST-with-dependencies.jar
```### 3. Write Data to Pub/Sub for Testing
Run the `write_data_to_pubsub.ipynb` notebook, which includes:1. Publishing test data for real-time processing.
2. Testing with different window types:
- Tumbling windows
- Sliding windows
- Windows with watermarks
- Joining data streams with static datasets (e.g., `user_product_discounts.csv`).### Key Sections in `write_data_to_pubsub.ipynb`
- Publish data for testing.
- Publish data for tumbling windows.
- Publish data for windows with watermarks.
- Write data to Pub/Sub and join with static datasets.
- Generate browsing and purchase events for joining two streaming DataFrames.## Windows Explained
### Tumbling Window
- A fixed-size window that does not overlap.
- All events within the window are grouped together.
- Example: A 1-minute tumbling window processes all events arriving between `00:00` and `00:01`.### Sliding Window
- A fixed-size window that overlaps.
- Multiple windows can process the same event if it falls into overlapping intervals.
- Example: A 1-minute sliding window with a 30-second slide interval.### Watermark
- Used to handle late-arriving data.
- Defines how long the system waits for late data before finalizing a window.
- Example: A 1-minute window with a watermark of 10 seconds processes late events arriving up to 10 seconds after the window closes.### Session Window
- Captures events within a defined gap duration.
- New events extend the session if they occur within the gap duration.
- Example: A session window with a 5-minute gap groups events until no events occur for 5 minutes.## Scripts
- `streaming_product_discounts.py`: Consumes product stream data, joins with static discount datasets, and calculates discount insights.
- `popular_products_recommendation.py`: Joins browsing and purchase event streams to recommend popular products.
- `write_data_to_pubsub.ipynb`: Publishes test data to Pub/Sub Lite for various scenarios, including testing windows and joins.## Usage
1. Configure Pub/Sub Lite topics and subscriptions.
2. Deploy and run Dataproc jobs for real-time analytics.
3. Use the Jupyter notebook to publish test data and validate pipeline functionality.
4. Monitor output in the configured sink or logging service.## License
This project is licensed under the MIT License. See the LICENSE file for details.