{"id":21514913,"url":"https://github.com/getindata/streaming-ml-with-ksql","last_synced_at":"2025-03-17T16:15:14.278Z","repository":{"id":42575481,"uuid":"472656102","full_name":"getindata/streaming-ml-with-ksql","owner":"getindata","description":"Demo of running Spark MLLib model on Kafka with KSQL, using Mleap serialization","archived":false,"fork":false,"pushed_at":"2022-03-31T06:53:51.000Z","size":23,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":8,"default_branch":"master","last_synced_at":"2025-01-24T02:31:01.817Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/getindata.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-03-22T07:24:35.000Z","updated_at":"2022-03-23T18:13:25.000Z","dependencies_parsed_at":"2022-09-11T07:21:06.159Z","dependency_job_id":null,"html_url":"https://github.com/getindata/streaming-ml-with-ksql","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/getindata%2Fstreaming-ml-with-ksql","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/getindata%2Fstreaming-ml-with-ksql/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/getindata%2Fstreaming-ml-with-ksql/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/getindata%2Fstreaming-ml-with-ksql/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/getindata","download_url":"https://codeload.github.com/getindata/streaming-ml-with-ksql/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244066192,"owners_count":20392407,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-23T23:53:31.323Z","updated_at":"2025-03-17T16:15:14.249Z","avatar_url":"https://github.com/getindata.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Streaming-ML with KSQL\n\nA proof-of-concept of a MLOps system that doesn't require coding skills (other then SQL) to apply the ML model in production. Core components:\n\n* [Mlflow](https://mlflow.org/) - used as the experiments tracker and model registry\n* Models trained on generated sample data with [Spark MLLib](https://spark.apache.org/mllib/), serialized using [Mleap](https://github.com/combust/mleap)\n* [Kafka Connect](https://kafka.apache.org/documentation/#connect) server to stream the inputs from database using CDC\n* [KSQL](https://ksqldb.io/) User Defined Function (UDF) that downloads the model and runs the predictions\n* random training and prediction data generator for demo purposes, using [DOGE](https://github.com/getindata/doge-datagen)\n\n## How to run it?\n\n1. Compile the KSQL UDF, by entering `udf` directory and executing: (TODO - automate it)\n\n        ./gradlew download \n        ./gradlew build\n\n1. Enter the main directory and run `docker-compose up -d` in order to start all the services. \n1. Navigate to `http://localhost:8080` and confirm that training process started\n1. Once training is finished, register the model as `Bot Detector` and promote it to `Production`\n1. Create a Kafka Connect sink to stream MySQL data into Kafka\n\n        http :8083/connectors @infra/connect/mysql-source.json\n\n1. Then, in KSQL CLI (`docker exec -ti ksql ksql`) setup the users changes stream with the table:\n\n        CREATE STREAM users_stream WITH (KAFKA_TOPIC = 'mysql.demo.users', VALUE_FORMAT = 'AVRO');\n        CREATE STREAM users_stream_rekey AS SELECT * FROM users_stream PARTITION BY id;\n        CREATE TABLE users WITH (KAFKA_TOPIC = 'USERS_STREAM_REKEY', VALUE_FORMAT = 'AVRO');\n\n1. You may want to add some records to MySQL (`docker exec -ti mysql mysql -pkafkademo demo`) and check the changes with `select * from users emit changes;`\n1. Next, simulate some traffic:\n\n        $ docker exec -ti traffic-generator bash\n        python generator.py\n\n1. Configured aggregated views on the data with 10-minutes hoping window (2-minutes slide):\n\n        CREATE STREAM events WITH (KAFKA_TOPIC = 'events', VALUE_FORMAT = 'AVRO', TIMESTAMP='ts');\n\n        CREATE TABLE events_in_10_minutes_window AS SELECT \n          user_id,\n          TIMESTAMPTOSTRING(min(events.rowtime), 'HH:mm:ss') as window_start,\n          TIMESTAMPTOSTRING(max(events.rowtime), 'HH:mm:ss') as window_end,\n          SUM(CASE WHEN event = 'main_page' THEN 1 ELSE 0 END) AS main_page_views,\n          SUM(CASE WHEN event = 'products_listing' THEN 1 ELSE 0 END) AS listing_views,\n          SUM(CASE WHEN event = 'product_page' THEN 1 ELSE 0 END) AS product_views,\n          SUM(CASE WHEN event = 'product_gallery' THEN 1 ELSE 0 END) AS gallery_views\n        FROM events \n        WINDOW HOPPING (SIZE 10 MINUTES, ADVANCE BY 2 MINUTES) GROUP BY user_id;\n\n        CREATE STREAM aggregated_events_stream WITH (KAFKA_TOPIC = 'EVENTS_IN_10_MINUTES_WINDOW', VALUE_FORMAT = 'AVRO');\n\n1. Check input data for model:\n\n        SELECT user_id, country, platform, product_views, listing_views, gallery_views, nb_orders FROM aggregated_events_stream\n        LEFT JOIN users ON aggregated_events_stream.user_id = users.rowkey\n        EMIT CHANGES;\n\n1. Finally, pass the data through ML model trained in the earlier steps and push results back to Kafka:\n\n        CREATE STREAM bot_detection_results AS\n        SELECT\n            user_id,\n            ip_address,\n            window_start,\n            window_end,\n            predict('Bot Detector', as_array(country, platform), as_array(product_views, listing_views, gallery_views, nb_orders)) AS prediction\n        FROM aggregated_events_stream\n        LEFT JOIN users ON aggregated_events_stream.user_id = users.rowkey;\n\n1. Push the topic with predictions into MongoDB:\n\n        http :8083/connectors @infra/connect/mongo-sink.json\n\n1. Verify data in MongoDB:\n\n        docker exec -ti mongo mongo\n        \u003e db.bot_detection_results.find()\n\n## Resetting the state\n\nIn order to keep the trained models, but reset Kafka state as a demo preparation, run:\n\n    docker-compose stop kafka schema-registry connect mysql ksql mongo\n    docker-compose rm -f kafka mysql mongo\n    docker-compose up -d\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgetindata%2Fstreaming-ml-with-ksql","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgetindata%2Fstreaming-ml-with-ksql","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgetindata%2Fstreaming-ml-with-ksql/lists"}