https://github.com/getindata/streaming-ml-with-ksql
Demo of running Spark MLLib model on Kafka with KSQL, using Mleap serialization
https://github.com/getindata/streaming-ml-with-ksql
Last synced: about 1 year ago
JSON representation
Demo of running Spark MLLib model on Kafka with KSQL, using Mleap serialization
- Host: GitHub
- URL: https://github.com/getindata/streaming-ml-with-ksql
- Owner: getindata
- Created: 2022-03-22T07:24:35.000Z (about 4 years ago)
- Default Branch: master
- Last Pushed: 2022-03-31T06:53:51.000Z (about 4 years ago)
- Last Synced: 2025-01-24T02:31:01.817Z (over 1 year ago)
- Language: Python
- Size: 22.5 KB
- Stars: 1
- Watchers: 8
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Streaming-ML with KSQL
A proof-of-concept of a MLOps system that doesn't require coding skills (other then SQL) to apply the ML model in production. Core components:
* [Mlflow](https://mlflow.org/) - used as the experiments tracker and model registry
* Models trained on generated sample data with [Spark MLLib](https://spark.apache.org/mllib/), serialized using [Mleap](https://github.com/combust/mleap)
* [Kafka Connect](https://kafka.apache.org/documentation/#connect) server to stream the inputs from database using CDC
* [KSQL](https://ksqldb.io/) User Defined Function (UDF) that downloads the model and runs the predictions
* random training and prediction data generator for demo purposes, using [DOGE](https://github.com/getindata/doge-datagen)
## How to run it?
1. Compile the KSQL UDF, by entering `udf` directory and executing: (TODO - automate it)
./gradlew download
./gradlew build
1. Enter the main directory and run `docker-compose up -d` in order to start all the services.
1. Navigate to `http://localhost:8080` and confirm that training process started
1. Once training is finished, register the model as `Bot Detector` and promote it to `Production`
1. Create a Kafka Connect sink to stream MySQL data into Kafka
http :8083/connectors @infra/connect/mysql-source.json
1. Then, in KSQL CLI (`docker exec -ti ksql ksql`) setup the users changes stream with the table:
CREATE STREAM users_stream WITH (KAFKA_TOPIC = 'mysql.demo.users', VALUE_FORMAT = 'AVRO');
CREATE STREAM users_stream_rekey AS SELECT * FROM users_stream PARTITION BY id;
CREATE TABLE users WITH (KAFKA_TOPIC = 'USERS_STREAM_REKEY', VALUE_FORMAT = 'AVRO');
1. You may want to add some records to MySQL (`docker exec -ti mysql mysql -pkafkademo demo`) and check the changes with `select * from users emit changes;`
1. Next, simulate some traffic:
$ docker exec -ti traffic-generator bash
python generator.py
1. Configured aggregated views on the data with 10-minutes hoping window (2-minutes slide):
CREATE STREAM events WITH (KAFKA_TOPIC = 'events', VALUE_FORMAT = 'AVRO', TIMESTAMP='ts');
CREATE TABLE events_in_10_minutes_window AS SELECT
user_id,
TIMESTAMPTOSTRING(min(events.rowtime), 'HH:mm:ss') as window_start,
TIMESTAMPTOSTRING(max(events.rowtime), 'HH:mm:ss') as window_end,
SUM(CASE WHEN event = 'main_page' THEN 1 ELSE 0 END) AS main_page_views,
SUM(CASE WHEN event = 'products_listing' THEN 1 ELSE 0 END) AS listing_views,
SUM(CASE WHEN event = 'product_page' THEN 1 ELSE 0 END) AS product_views,
SUM(CASE WHEN event = 'product_gallery' THEN 1 ELSE 0 END) AS gallery_views
FROM events
WINDOW HOPPING (SIZE 10 MINUTES, ADVANCE BY 2 MINUTES) GROUP BY user_id;
CREATE STREAM aggregated_events_stream WITH (KAFKA_TOPIC = 'EVENTS_IN_10_MINUTES_WINDOW', VALUE_FORMAT = 'AVRO');
1. Check input data for model:
SELECT user_id, country, platform, product_views, listing_views, gallery_views, nb_orders FROM aggregated_events_stream
LEFT JOIN users ON aggregated_events_stream.user_id = users.rowkey
EMIT CHANGES;
1. Finally, pass the data through ML model trained in the earlier steps and push results back to Kafka:
CREATE STREAM bot_detection_results AS
SELECT
user_id,
ip_address,
window_start,
window_end,
predict('Bot Detector', as_array(country, platform), as_array(product_views, listing_views, gallery_views, nb_orders)) AS prediction
FROM aggregated_events_stream
LEFT JOIN users ON aggregated_events_stream.user_id = users.rowkey;
1. Push the topic with predictions into MongoDB:
http :8083/connectors @infra/connect/mongo-sink.json
1. Verify data in MongoDB:
docker exec -ti mongo mongo
> db.bot_detection_results.find()
## Resetting the state
In order to keep the trained models, but reset Kafka state as a demo preparation, run:
docker-compose stop kafka schema-registry connect mysql ksql mongo
docker-compose rm -f kafka mysql mongo
docker-compose up -d