https://github.com/phact/streaming-ml-product-recommendation

Last synced: 8 months ago
JSON representation

Host: GitHub
URL: https://github.com/phact/streaming-ml-product-recommendation
Owner: phact
Created: 2017-04-17T20:31:53.000Z (about 9 years ago)
Default Branch: master
Last Pushed: 2018-07-09T14:49:10.000Z (almost 8 years ago)
Last Synced: 2025-05-05T21:11:31.585Z (about 1 year ago)
Language: Scala
Size: 10.1 MB
Stars: 2
Watchers: 2
Forks: 5
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# StreamingMLProductRecommendation

This is a guide for how to use the power tools machine learning streaming product recommendation asset brought to you by the Vanguard team.

Upgraded for DSE 6.0

### Motivation

Machine learning powered recommendation engines have wide applications across multiple industries as companies seeking to provide their end customers with deep insights by leveraging data in the moment. Although there are many tools that allow for historical analysis that yield recommendations, DataStax Enterprise (DSE) is particularly well suited to power real-time recommendation / personalization systems. It is when it comes to operationalizing and productionizing analytical systems that DSE will prove most useful. This is largely due to DSEs design objectives of operating at scale, in a distributed fashion, and while fulfilling performance and availability requirements required for user facing, mission critical applications.

### What is included?

This field asset includes a working application for real-time recommendations leveraging the following DSE functionality:

* Machine Learning
* Streaming analytics
* Batch analytics
* Real-time JDBC / SQL (dynamic caching)
* DSEFS

### Business Take Aways

By streaming customer market basket data from a retail organization through DSE analytics and using it to train a Collaborative Filtering Machine Learning model, we are able to maintain a top K list of recommended products by customer that reflect their historical and recent buying patterns.

In the retail industry, both online and brick and mortar businesses are leveraging ML and real-time analytics pipelines to gather insights that become differentiators for them in the marketplace. The DataStax stack is the foundation for enterprise personalization / recommendation systems across multiple industries.

### Technical Take Aways

For a technical deep dive, take a look at the following sections:

- Machine learning model
- Streaming analytics pipeline
- Real-time JDBC / SQL (dynamic caching)

## Startup Script

This Asset leverages
[simple-startup](https://github.com/jshook/simple-startup). To start the entire
asset run `./startup all` for other options run `./startup`

## Manual Usage:

### Prep the files:

Make sure dsefs is turned on `dse.yaml`

dsefs_option
enabled: true

And push the raw data file into the root directory of dsefs:

```
dsefs / > put ./sales_observations sales_observations
dsefs / > ls sales_observations
sales_observations
```

### Streaming Job:
To run this on your local machine, you need to first run a Netcat server

$ nc -lk 9999

Build:

mvn package

and then run the example:

$ dse spark-submit --deploy-mode cluster --supervise --class
com.datastax.powertools.analytics.SparkMLProductRecommendationStreamingJob
./target/StreamingMLProductRecommendations-0.1.jar localhost 9999

To run the model, predict via streaming, and serve results via JDBC, run the
ServeJDBC class

$ dse spark-submit --class
com.datastax.powertools.analytics.SparkMLProductRecommendationServeJDBC
./target/StreamingMLProductRecommendations-0.1.jar localhost 9999

$ dse beeline

> !connect jdbc:hive2://localhost:10000

> select * from recommendations.predictions where user=10277 order by prediction desc;

Into the `nc` prompt paste a few records and see the change in beeline:

```
102779564.0000
1027795649564.0000
1027515254.0000
1027795649564744.0000
10277956495647442304.0000
102751525415254.0000
102779564956474423042304.0000
10277 221 4
10277 221 1
```

Alternatively, you can run `./socketstream` to write a record per second to the stream from bash

### Docs

pull in your submodules

git submodule update --init
git submodule sync

then run the server

hugo server ./content

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/phact/streaming-ml-product-recommendation

Awesome Lists containing this project

README