https://github.com/archie-cm/real_time_analytics_with_spark_streaming_on_dataproc

This project demonstrates how to build a real-time analytics pipeline using Spark Streaming on Google Cloud Platform (GCP)
https://github.com/archie-cm/real_time_analytics_with_spark_streaming_on_dataproc

pubsub spark-streaming

Last synced: 4 months ago
JSON representation

This project demonstrates how to build a real-time analytics pipeline using Spark Streaming on Google Cloud Platform (GCP)

Host: GitHub
URL: https://github.com/archie-cm/real_time_analytics_with_spark_streaming_on_dataproc
Owner: archie-cm
Created: 2025-01-08T10:06:48.000Z (6 months ago)
Default Branch: main
Last Pushed: 2025-01-08T14:41:16.000Z (6 months ago)
Last Synced: 2025-01-20T05:46:07.316Z (6 months ago)
Topics: pubsub, spark-streaming
Language: Jupyter Notebook
Homepage:
Size: 10.7 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # Real-Time Analytics with Spark Streaming on Dataproc

This project demonstrates how to build a real-time analytics pipeline using Spark Streaming on Google Cloud Platform (GCP). The pipeline processes real-time data from Pub/Sub Lite and joins it with static datasets to deliver insights such as product discounts and popular product recommendations.

## Table of Contents

- [Overview](#overview)

- [Prerequisites](#prerequisites)

- [Architecture](#architecture)

- [Setup](#setup)

  - [1. Setup Pub/Sub Lite](#1-setup-pubsub-lite)

  - [2. Setup Dataproc and Submit Streaming Jobs](#2-setup-dataproc-and-submit-streaming-jobs)

  - [3. Write Data to Pub/Sub for Testing](#3-write-data-to-pubsub-for-testing)

- [Windows Explained](#windows-explained)

- [Scripts](#scripts)

- [Usage](#usage)

- [License](#license)

---

## Overview

This pipeline is designed for real-time analytics on streaming data using Spark Streaming and Pub/Sub Lite. It includes features like tumbling windows, sliding windows, watermarking, and joining static and streaming datasets to deliver actionable insights.

## Prerequisites

- Google Cloud account

- GCP SDK installed

- `gcloud` CLI configured

- Python 3.x installed locally

## Architecture

1. Real-time data is published to Pub/Sub Lite topics.

2. Spark Streaming jobs on Dataproc consume the data and process it in real-time.

3. Insights are generated by joining streaming data with static datasets or other streaming data sources.

## Setup

### 1. Setup Pub/Sub Lite

1. Enable Pub/Sub Lite API:

   ```bash

   gcloud services enable pubsublite.googleapis.com

   ```

2. Create a topic:

   ```bash

   gcloud pubsublite topics create product-stream-topic \

       --location=us-central1-a \

       --partitions=1 \

       --per-partition-bytes=30GiB \

       --retention-period=24h

   ```

3. Create a subscription:

   ```bash

   gcloud pubsublite subscriptions create product-stream-subscription \

       --location=us-central1-a \

       --topic=product-stream-topic \

       --delivery-requirement=deliver-after-stored

   ```

### 2. Setup Dataproc and Submit Streaming Jobs

1. Submit the product discount streaming job:

   ```bash

   gcloud dataproc jobs submit pyspark streaming_product_discounts.py \

       --cluster=spark-streaming \

       --region=us-central1 \

       --jars=gs://spark-lib/pubsublite/pubsublite-spark-sql-streaming-LATEST-with-dependencies.jar

   ```

2. Submit the popular product recommendation job:

   ```bash

   gcloud dataproc jobs submit pyspark popular_products_recommendation.py \

       --cluster=spark-streaming \

       --region=us-central1 \

       --jars=gs://spark-lib/pubsublite/pubsublite-spark-sql-streaming-LATEST-with-dependencies.jar

   ```

### 3. Write Data to Pub/Sub for Testing

Run the `write_data_to_pubsub.ipynb` notebook, which includes:

1. Publishing test data for real-time processing.

2. Testing with different window types:

   - Tumbling windows

   - Sliding windows

   - Windows with watermarks

   - Joining data streams with static datasets (e.g., `user_product_discounts.csv`).

### Key Sections in `write_data_to_pubsub.ipynb`

- Publish data for testing.

- Publish data for tumbling windows.

- Publish data for windows with watermarks.

- Write data to Pub/Sub and join with static datasets.

- Generate browsing and purchase events for joining two streaming DataFrames.

## Windows Explained

### Tumbling Window

- A fixed-size window that does not overlap.

- All events within the window are grouped together.

- Example: A 1-minute tumbling window processes all events arriving between `00:00` and `00:01`.

### Sliding Window

- A fixed-size window that overlaps.

- Multiple windows can process the same event if it falls into overlapping intervals.

- Example: A 1-minute sliding window with a 30-second slide interval.

### Watermark

- Used to handle late-arriving data.

- Defines how long the system waits for late data before finalizing a window.

- Example: A 1-minute window with a watermark of 10 seconds processes late events arriving up to 10 seconds after the window closes.

### Session Window

- Captures events within a defined gap duration.

- New events extend the session if they occur within the gap duration.

- Example: A session window with a 5-minute gap groups events until no events occur for 5 minutes.

## Scripts

- `streaming_product_discounts.py`: Consumes product stream data, joins with static discount datasets, and calculates discount insights.

- `popular_products_recommendation.py`: Joins browsing and purchase event streams to recommend popular products.

- `write_data_to_pubsub.ipynb`: Publishes test data to Pub/Sub Lite for various scenarios, including testing windows and joins.

## Usage

1. Configure Pub/Sub Lite topics and subscriptions.

2. Deploy and run Dataproc jobs for real-time analytics.

3. Use the Jupyter notebook to publish test data and validate pipeline functionality.

4. Monitor output in the configured sink or logging service.

## License

This project is licensed under the MIT License. See the LICENSE file for details.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/archie-cm/real_time_analytics_with_spark_streaming_on_dataproc

Awesome Lists containing this project

README