Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/archie-cm/real_time_product_recommendations_with_machine_learning_on_gcp
This project demonstrates how to build a real-time product recommendation system using Pub/Sub Lite and Apache Spark with Dataproc
https://github.com/archie-cm/real_time_product_recommendations_with_machine_learning_on_gcp
dataproc pubsublite spark
Last synced: 9 days ago
JSON representation
This project demonstrates how to build a real-time product recommendation system using Pub/Sub Lite and Apache Spark with Dataproc
- Host: GitHub
- URL: https://github.com/archie-cm/real_time_product_recommendations_with_machine_learning_on_gcp
- Owner: archie-cm
- Created: 2025-01-08T15:33:36.000Z (13 days ago)
- Default Branch: main
- Last Pushed: 2025-01-08T15:47:54.000Z (13 days ago)
- Last Synced: 2025-01-08T16:53:59.875Z (13 days ago)
- Topics: dataproc, pubsublite, spark
- Language: Python
- Homepage:
- Size: 1.79 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Real-Time Product Recommendations with Pub/Sub and Dataproc
This project demonstrates how to build a real-time product recommendation system using Pub/Sub Lite and Apache Spark with Dataproc. The system leverages the Alternating Least Squares (ALS) machine learning algorithm for collaborative filtering to recommend products in real time.
## Table of Contents
- [Overview](#overview)
- [Prerequisites](#prerequisites)
- [Pipeline Process](#pipeline-process)
- [1. Upload Source Files](#1-upload-source-files)
- [2. Set Up Pub/Sub Lite](#2-set-up-pubsublite)
- [3. Train Recommendation Model](#3-train-recommendation-model)
- [4. Publish Events to Pub/Sub Lite](#4-publish-events-to-pubsublite)
- [5. Set Up Dataproc](#5-set-up-dataproc)
- [6. Model Inference](#6-model-inference)
- [License](#license)---
## Overview
This project uses:
1. Google Pub/Sub Lite for real-time data ingestion.
2. Dataproc with Spark for training and deploying a recommendation model.
3. ALS (Alternating Least Squares) from Apache Spark MLlib for collaborative filtering.## Prerequisites
- Google Cloud account.
- GCP SDK installed.
- `gcloud` CLI configured.
- Python 3.x installed locally.
- Google Cloud Storage bucket for uploading files.## Pipeline Process
### 1. Upload Source Files
Upload the source files (`users.csv`, `products.csv`, `order_items.csv`) to a Google Cloud Storage bucket:
```bash
gsutil cp users.csv gs:///
gsutil cp products.csv gs:///
gsutil cp order_items.csv gs:///
```### 2. Set Up Pub/Sub Lite
1. Enable Pub/Sub Lite API:
```bash
gcloud services enable pubsublite.googleapis.com
```
2. Create a topic:
```bash
gcloud pubsublite topics create recommendations-topic \
--location=us-central1-a \
--partitions=1 \
--per-partition-bytes=30GiB \
--retention-period=24h
```
3. Create a subscription:
```bash
gcloud pubsublite subscriptions create recommendations-subscription \
--location=us-central1-a \
--topic=recommendations-topic \
--delivery-requirement=deliver-after-stored
```### 3. Train Recommendation Model
Train the ALS recommendation model using Apache Spark and save the model to Google Cloud Storage:
```bash
spark-submit model-training.py
```
Make sure the script reads data from the uploaded files in GCS and saves the trained model back to GCS.### 4. Publish Events to Pub/Sub Lite
Run the `publish-pubsublite-events.py` script to publish events to Pub/Sub Lite:
```bash
python publish-pubsublite-events.py
```
This script reads data from your source files and simulates real-time events by publishing them to the Pub/Sub Lite topic.### 5. Set Up Dataproc
Submit a PySpark job to Dataproc to stream recommendations in real time:
```bash
gcloud dataproc jobs submit pyspark streaming-recommendations.py \
--cluster=spark-streaming \
--region=us-central1 \
--jars=gs://spark-lib/pubsublite/pubsublite-spark-sql-streaming-LATEST-with-dependencies.jar
```
This job processes incoming events from Pub/Sub Lite and provides real-time recommendations.### 6. Model Inference
Run the `model-inference.py` script to use the trained model for inference:
```bash
python model-inference.py
```
This script performs batch or real-time inference using the trained ALS model to recommend products for users based on their activity.## License
This project is licensed under the MIT License. See the LICENSE file for details.