Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/archie-cm/real_time_analytics_with_spark_streaming_on_dataproc

This project demonstrates how to build a real-time analytics pipeline using Spark Streaming on Google Cloud Platform (GCP)
https://github.com/archie-cm/real_time_analytics_with_spark_streaming_on_dataproc

pubsub spark-streaming

Last synced: 1 day ago
JSON representation

This project demonstrates how to build a real-time analytics pipeline using Spark Streaming on Google Cloud Platform (GCP)

Awesome Lists containing this project

README

        

# Real-Time Analytics with Spark Streaming on Dataproc

This project demonstrates how to build a real-time analytics pipeline using Spark Streaming on Google Cloud Platform (GCP). The pipeline processes real-time data from Pub/Sub Lite and joins it with static datasets to deliver insights such as product discounts and popular product recommendations.

## Table of Contents
- [Overview](#overview)
- [Prerequisites](#prerequisites)
- [Architecture](#architecture)
- [Setup](#setup)
- [1. Setup Pub/Sub Lite](#1-setup-pubsub-lite)
- [2. Setup Dataproc and Submit Streaming Jobs](#2-setup-dataproc-and-submit-streaming-jobs)
- [3. Write Data to Pub/Sub for Testing](#3-write-data-to-pubsub-for-testing)
- [Windows Explained](#windows-explained)
- [Scripts](#scripts)
- [Usage](#usage)
- [License](#license)

---

## Overview
This pipeline is designed for real-time analytics on streaming data using Spark Streaming and Pub/Sub Lite. It includes features like tumbling windows, sliding windows, watermarking, and joining static and streaming datasets to deliver actionable insights.

## Prerequisites
- Google Cloud account
- GCP SDK installed
- `gcloud` CLI configured
- Python 3.x installed locally

## Architecture
1. Real-time data is published to Pub/Sub Lite topics.
2. Spark Streaming jobs on Dataproc consume the data and process it in real-time.
3. Insights are generated by joining streaming data with static datasets or other streaming data sources.

## Setup

### 1. Setup Pub/Sub Lite
1. Enable Pub/Sub Lite API:
```bash
gcloud services enable pubsublite.googleapis.com
```

2. Create a topic:
```bash
gcloud pubsublite topics create product-stream-topic \
--location=us-central1-a \
--partitions=1 \
--per-partition-bytes=30GiB \
--retention-period=24h
```

3. Create a subscription:
```bash
gcloud pubsublite subscriptions create product-stream-subscription \
--location=us-central1-a \
--topic=product-stream-topic \
--delivery-requirement=deliver-after-stored
```

### 2. Setup Dataproc and Submit Streaming Jobs
1. Submit the product discount streaming job:
```bash
gcloud dataproc jobs submit pyspark streaming_product_discounts.py \
--cluster=spark-streaming \
--region=us-central1 \
--jars=gs://spark-lib/pubsublite/pubsublite-spark-sql-streaming-LATEST-with-dependencies.jar
```

2. Submit the popular product recommendation job:
```bash
gcloud dataproc jobs submit pyspark popular_products_recommendation.py \
--cluster=spark-streaming \
--region=us-central1 \
--jars=gs://spark-lib/pubsublite/pubsublite-spark-sql-streaming-LATEST-with-dependencies.jar
```

### 3. Write Data to Pub/Sub for Testing
Run the `write_data_to_pubsub.ipynb` notebook, which includes:

1. Publishing test data for real-time processing.
2. Testing with different window types:
- Tumbling windows
- Sliding windows
- Windows with watermarks
- Joining data streams with static datasets (e.g., `user_product_discounts.csv`).

### Key Sections in `write_data_to_pubsub.ipynb`
- Publish data for testing.
- Publish data for tumbling windows.
- Publish data for windows with watermarks.
- Write data to Pub/Sub and join with static datasets.
- Generate browsing and purchase events for joining two streaming DataFrames.

## Windows Explained
### Tumbling Window
- A fixed-size window that does not overlap.
- All events within the window are grouped together.
- Example: A 1-minute tumbling window processes all events arriving between `00:00` and `00:01`.

### Sliding Window
- A fixed-size window that overlaps.
- Multiple windows can process the same event if it falls into overlapping intervals.
- Example: A 1-minute sliding window with a 30-second slide interval.

### Watermark
- Used to handle late-arriving data.
- Defines how long the system waits for late data before finalizing a window.
- Example: A 1-minute window with a watermark of 10 seconds processes late events arriving up to 10 seconds after the window closes.

### Session Window
- Captures events within a defined gap duration.
- New events extend the session if they occur within the gap duration.
- Example: A session window with a 5-minute gap groups events until no events occur for 5 minutes.

## Scripts
- `streaming_product_discounts.py`: Consumes product stream data, joins with static discount datasets, and calculates discount insights.
- `popular_products_recommendation.py`: Joins browsing and purchase event streams to recommend popular products.
- `write_data_to_pubsub.ipynb`: Publishes test data to Pub/Sub Lite for various scenarios, including testing windows and joins.

## Usage
1. Configure Pub/Sub Lite topics and subscriptions.
2. Deploy and run Dataproc jobs for real-time analytics.
3. Use the Jupyter notebook to publish test data and validate pipeline functionality.
4. Monitor output in the configured sink or logging service.

## License
This project is licensed under the MIT License. See the LICENSE file for details.