{"id":22492584,"url":"https://github.com/garystafford/streaming-sales-generator","last_synced_at":"2025-08-03T00:31:25.805Z","repository":{"id":59105285,"uuid":"530234874","full_name":"garystafford/streaming-sales-generator","owner":"garystafford","description":"Streaming Synthetic Sales Data Generator: Streaming sales data generator for Apache Kafka, written in Python","archived":false,"fork":false,"pushed_at":"2022-12-28T17:37:45.000Z","size":9727,"stargazers_count":28,"open_issues_count":2,"forks_count":11,"subscribers_count":4,"default_branch":"main","last_synced_at":"2023-08-05T02:22:57.225Z","etag":null,"topics":["analytics","apache-flink","apache-kafka","data","kafka","kafka-streams","kstreams","python","spark-structured-streaming","streaming-data"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/garystafford.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-08-29T13:35:23.000Z","updated_at":"2023-07-27T16:15:46.000Z","dependencies_parsed_at":"2023-01-31T07:01:19.048Z","dependency_job_id":null,"html_url":"https://github.com/garystafford/streaming-sales-generator","commit_stats":null,"previous_names":[],"tags_count":0,"template":null,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/garystafford%2Fstreaming-sales-generator","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/garystafford%2Fstreaming-sales-generator/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/garystafford%2Fstreaming-sales-generator/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/garystafford%2Fstreaming-sales-generator/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/garystafford","download_url":"https://codeload.github.com/garystafford/streaming-sales-generator/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":228508010,"owners_count":17931264,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["analytics","apache-flink","apache-kafka","data","kafka","kafka-streams","kstreams","python","spark-structured-streaming","streaming-data"],"created_at":"2024-12-06T18:19:08.725Z","updated_at":"2024-12-06T18:19:10.166Z","avatar_url":"https://github.com/garystafford.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Streaming Synthetic Sales Data Generator\n\n## TL;DR\n\n1. `docker stack deploy streaming-stack --compose-file docker/spark-kafka-stack.yml` to create local instance of Kafka\n2. `python3 -m pip install kafka-python` to install the `kafka-python` package\n3. `cd sales_generator/`\n4. `python3 ./producer.py` to start generating streaming data to Apache Kafka\n5. `python3 ./consumer.py` in a separate terminal window to view results\n\n## Background\n\nEach time you want to explore or demonstrate a new streaming technology, you must first find an adequate data source or\ndevelop a new one. Ideally, the streaming data source should be complex enough to perform multiple types of analyses\non and visualize different aspects with Business Intelligence (BI) and dashboarding tools. Additionally, the streaming\ndata source should possess a degree of consistency and predictability while still displaying a reasonable level of\nnatural randomness. Conversely, the source should not result in an unnatural uniform distribution of data over time.\n\nThis project's highly configurable, Python-based, synthetic data generator ([producer.py](sales_generator/producer.py)) streams product listings,\nsales transactions, and inventory restocking activities to Apache Kafka topics. It is designed for\ndemonstrating streaming data analytics tools, such as\n[Apache Spark Structured Streaming](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html),\n[Apache Beam](https://beam.apache.org/), [Apache Flink](https://flink.apache.org/),\n[Apache Kafka Streams](https://kafka.apache.org/documentation/streams/),\n[Apache Pinot](https://pinot.apache.org/), [Databricks](https://www.databricks.com/),\n[Google Cloud Dataproc](https://cloud.google.com/dataproc),\nand [Amazon Kinesis Data Analytics](https://aws.amazon.com/kinesis/data-analytics/).\n\n## Video Demonstration\n\nShort [YouTube video](https://youtu.be/MTCsN7gJuJM) demonstrates the generator in use (video only - no audio).\n\n## Streaming Code Sample\n\n* Apache Spark Structured Streaming: [Code samples](./apache_spark_examples/) written with PySpark, which consumes and aggregates the \nreal-time sales data from Kafka using Apache Spark\n* Apache Flink: [Code sample](https://github.com/garystafford/flink-kafka-demo/) written in Java, which consumes and aggregates the \nreal-time sales data from Kafka using Apache Flink\n* Apache Kafka Streams (KStreams): [Code sample](https://github.com/garystafford/kstreams-kafka-demo/) written in Java, which consumes and aggregates the real-time sales data from Kafka using KStreams\n* Apache Pinot/Apache Superset: [Code sample](./apache_pinot_examples/) to query products, purchases, and purchases-enhanced streaming data from Kafka using SQL\n\n![Architecture1](./diagram/streaming_workflow_intro.png)\n\n## Project Features\n\n* Generator is configurable in a separate [configuration.ini](sales_generator/configuration/configuration.ini) file\n* Semi-random data generation - random variables are weighted and can be adjusted in `.ini` file\n* Over 25 smoothie drink products in [products.csv](sales_generator/data/products.csv): descriptions, inventories, costs, ingredients,\n  product propensity-to-buy range value\n* The propensity to buy a product is determined by an assigned value from a range of 0 to 200\n* Writes initial product list to an Apache Kafka topic (topic 1/3)\n* Generates semi-random streaming drink purchases, with time, item, quantity, price, total price, etc.\n* Club membership discounts semi-randomly applied to smoothie purchases\n* Add-on supplements semi-randomly applied to smoothie purchases\n* Writes smoothie purchases to an Apache Kafka topic (topic 2/3)\n* Restocks low product inventories based on a minimum stock levels\n* Writes restocking activities to an Apache Kafka topic with time, old level, new level, etc. (topic 3/3)\n* Configurable authentication methods (SASL/SCRAM or PLAINTEXT) for connecting to Kafka\n\n## Sample Dashboard\n\nA simple dashboard example created from the streaming sales data joined with the static product list.\n\n![Dashboard](screengrabs/dashboard.png)\n\n## Raw Product List\n\nProducts are based on Tropical Smoothie menu\nfrom [Fast Food Menu Prices](https://www.fastfoodmenuprices.com/tropical-smoothie-prices/). Last four columns with `_`\nare were used to generate artificial product category and product propensity-to-buy weighting. These determine how frequently the\nproducts are purchased in the simulation.\n\nA few sample products from CSV file, [products.csv](sales_generator/data/products.csv) are show below.\n\n```text\nID,Category,Item,Size,COGS,Price,Inventory,ContainsFruit,ContainsVeggies,ContainsNuts,ContainsCaffeine,_CatWeight,_ItemWeight,_TotalWeight,_RangeWeight\nCS01,Classic Smoothies,Sunrise Sunset,24 oz.,1.50,4.99,75,TRUE,FALSE,FALSE,FALSE,3,1,3,3\nCS04,Classic Smoothies,Sunny Day,24 oz.,1.50,4.99,75,TRUE,FALSE,FALSE,FALSE,3,1,3,18\nSF02,Superfoods Smoothies,Totally Green,24 oz.,2.10,5.99,50,TRUE,TRUE,FALSE,FALSE,2,1,2,84\nSC01,Supercharged Smoothies,Triple Berry Oat,24 oz.,2.70,5.99,35,TRUE,FALSE,FALSE,FALSE,3,5,15,137\nIS03,Indulgent Smoothies,Beach Bum,24 oz.,2.20,5.49,60,TRUE,TRUE,FALSE,FALSE,4,3,12,192\n```\n\n## Products Topic\n\nA few sample product messages are show below.\n\n```json\n[\n    {\n        \"event_time\": \"2022-09-11 14:39:46.934384\",\n        \"product_id\": \"CS01\",\n        \"category\": \"Classic Smoothies\",\n        \"item\": \"Sunrise Sunset\",\n        \"size\": \"24 oz.\",\n        \"cogs\": 1.5,\n        \"price\": 4.99,\n        \"inventory_level\": 75,\n        \"contains_fruit\": true,\n        \"contains_veggies\": false,\n        \"contains_nuts\": false,\n        \"contains_caffeine\": false,\n        \"propensity_to_buy\": 3\n    },\n    {\n        \"event_time\": \"2022-09-11 14:39:50.715191\",\n        \"product_id\": \"CS02\",\n        \"category\": \"Classic Smoothies\",\n        \"item\": \"Kiwi Quencher\",\n        \"size\": \"24 oz.\",\n        \"cogs\": 1.5,\n        \"price\": 4.99,\n        \"inventory_level\": 75,\n        \"contains_fruit\": true,\n        \"contains_veggies\": false,\n        \"contains_nuts\": false,\n        \"contains_caffeine\": false,\n        \"propensity_to_buy\": 6\n    },\n    {\n        \"event_time\": \"2022-09-11 14:39:54.232999\",\n        \"product_id\": \"SF04\",\n        \"category\": \"Superfoods Smoothies\",\n        \"item\": \"Pomegranate Plunge\",\n        \"size\": \"24 oz.\",\n        \"cogs\": 2.1,\n        \"price\": 5.99,\n        \"inventory_level\": 50,\n        \"contains_fruit\": true,\n        \"contains_veggies\": false,\n        \"contains_nuts\": false,\n        \"contains_caffeine\": false,\n        \"propensity_to_buy\": 94\n    },\n    {\n        \"event_time\": \"2022-09-11 14:39:55.538469\",\n        \"product_id\": \"SC03\",\n        \"category\": \"Supercharged Smoothies\",\n        \"item\": \"Health Nut\",\n        \"size\": \"24 oz.\",\n        \"cogs\": 2.7,\n        \"price\": 5.99,\n        \"inventory_level\": 35,\n        \"contains_fruit\": false,\n        \"contains_veggies\": false,\n        \"contains_nuts\": true,\n        \"contains_caffeine\": false,\n        \"propensity_to_buy\": 143\n    },\n    {\n        \"event_time\": \"2022-09-11 14:39:56.226351\",\n        \"product_id\": \"IS01\",\n        \"category\": \"Indulgent Smoothies\",\n        \"item\": \"Bahama Mama\",\n        \"size\": \"24 oz.\",\n        \"cogs\": 2.2,\n        \"price\": 5.49,\n        \"inventory_level\": 60,\n        \"contains_fruit\": true,\n        \"contains_veggies\": false,\n        \"contains_nuts\": false,\n        \"contains_caffeine\": false,\n        \"propensity_to_buy\": 168\n    }\n]\n```\n\n## Purchases Topic\n\nA few sample sales transaction messages are show below.\n\n```json\n[\n    {\n        \"transaction_time\": \"2022-09-13 11:51:09.006164\",\n        \"transaction_id\": \"9000324019618167755\",\n        \"product_id\": \"CS06\",\n        \"price\": 4.99,\n        \"quantity\": 1,\n        \"is_member\": false,\n        \"member_discount\": 0.0,\n        \"add_supplements\": true,\n        \"supplement_price\": 1.99,\n        \"total_purchase\": 6.98\n    },\n    {\n        \"transaction_time\": \"2022-09-13 11:53:24.925539\",\n        \"transaction_id\": \"9051670610281553996\",\n        \"product_id\": \"SC04\",\n        \"price\": 5.99,\n        \"quantity\": 1,\n        \"is_member\": true,\n        \"member_discount\": 0.1,\n        \"add_supplements\": true,\n        \"supplement_price\": 1.99,\n        \"total_purchase\": 7.18\n    },\n    {\n        \"transaction_time\": \"2022-09-13 11:56:27.143473\",\n        \"transaction_id\": \"6730925912413682784\",\n        \"product_id\": \"SF03\",\n        \"price\": 5.99,\n        \"quantity\": 1,\n        \"is_member\": false,\n        \"member_discount\": 0.0,\n        \"add_supplements\": true,\n        \"supplement_price\": 1.99,\n        \"total_purchase\": 7.98\n    },\n    {\n        \"transaction_time\": \"2022-09-13 11:59:33.269093\",\n        \"transaction_id\": \"2051718832449428473\",\n        \"product_id\": \"CS04\",\n        \"price\": 4.99,\n        \"quantity\": 1,\n        \"is_member\": true,\n        \"member_discount\": 0.1,\n        \"add_supplements\": true,\n        \"supplement_price\": 1.99,\n        \"total_purchase\": 18.85\n    },\n    {\n        \"transaction_time\": \"2022-09-13 11:59:33.269093\",\n        \"transaction_id\": \"2051718832449428473\",\n        \"product_id\": \"SF07\",\n        \"price\": 5.99,\n        \"quantity\": 2,\n        \"is_member\": true,\n        \"member_discount\": 0.1,\n        \"add_supplements\": true,\n        \"supplement_price\": 1.99,\n        \"total_purchase\": 14.36\n    }\n]\n```\n\n### Sample Batch Data\n\nThe [sample_data_small.json](sample_batch_data/sample_data_small.json) file contains a batch of 290 purchases,\nrepresenting a typical 12-hour business day from 8AM to 8PM.\nThe [sample_data_large.json](sample_batch_data/sample_data_large.json) file contains 500 purchases,\nspanning ~20.5 hours of sample data.\n\n## Restocking Activity Topic\n\nA few sample inventory activity messages are show below.\n\n```json\n[\n    {\n        \"event_time\": \"2022-08-29 15:09:23.007874\",\n        \"product_id\": \"SC05\",\n        \"existing_level\": 9,\n        \"stock_quantity\": 15,\n        \"new_level\": 24\n    },\n    {\n        \"event_time\": \"2022-08-29 15:12:30.415329\",\n        \"product_id\": \"SC03\",\n        \"existing_level\": 10,\n        \"stock_quantity\": 15,\n        \"new_level\": 25\n    },\n    {\n        \"event_time\": \"2022-08-29 15:19:38.139400\",\n        \"product_id\": \"SC01\",\n        \"existing_level\": 10,\n        \"stock_quantity\": 15,\n        \"new_level\": 25\n    },\n    {\n        \"event_time\": \"2022-08-29 15:34:35.392350\",\n        \"product_id\": \"SC04\",\n        \"existing_level\": 9,\n        \"stock_quantity\": 15,\n        \"new_level\": 24\n    },\n    {\n        \"event_time\": \"2022-08-29 15:48:55.183778\",\n        \"product_id\": \"IS01\",\n        \"existing_level\": 10,\n        \"stock_quantity\": 15,\n        \"new_level\": 25\n    }\n]\n```\n\n## Docker Streaming Stacks\n\n```shell\n# optional: delete previous stack\ndocker stack rm streaming-stack\n\n# deploy kafka stack\ndocker swarm init\ndocker stack deploy streaming-stack --compose-file docker/spark-kstreams-stack.yml\n\n# view results\ndocker stats\n\ndocker container ls --format \"{{ .Names}}, {{ .Status}}\"\n```\n\n### Containers\n\nExample Apache Kafka, Zookeeper, Spark, Flink, Pinot, Superset, KStreams, and JupyterLab containers:\n\n```text\nCONTAINER ID   IMAGE                      PORTS                                    NAMES\n8edd2caf765d   garystafford/kstreams-kafka-demo:0.7.0                              streaming-stack_kstreams.1...\n1d7c6ab3009d   bitnami/spark:3.3                                                   streaming-stack_spark...\n1d7c6ab3009d   bitnami/spark:3.3                                                   streaming-stack_spark-worker...\n6114dc4a9824   bitnami/kafka:3.2.1        9092/tcp                                 streaming-stack_kafka.1...\n837c0cdd1498   bitnami/zookeeper:3.8.0    2181/tcp, 2888/tcp, 3888/tcp, 8080/tcp   streaming-stack_zookeeper.1...\n\n\nCONTAINER ID   IMAGE                                  PORTS                                    NAMES\n97ae95af3190   flink:1.15.2                           6123/tcp, 8081/tcp                       streaming-stack_taskmanager.1...\ndb2ef10587cd   flink:1.15.2                           6123/tcp, 8081/tcp                       streaming-stack_jobmanager.1...\nf7b975f43087   bitnami/kafka:3.2.1                    9092/tcp                                 streaming-stack_kafka.1...\nfb42722dccc0   jupyter/pyspark-notebook:spark-3.3.0   4040/tcp, 8888/tcp                       streaming-stack_jupyter.1...\nfa74d7a69ff8   garystafford/superset-pinot:0.11.0     8088/tcp                                 streaming-stack_superset.1...\n4ac637924d5f   apachepinot/pinot:0.11.0-...           8096-8099/tcp, 9000/tcp                  streaming-stack_pinot-server.1...\n45a60ea9efce   apachepinot/pinot:0.11.0-...           8096-8099/tcp, 9000/tcp                  streaming-stack_pinot-broker.1...\n2c3a910ed2a5   apachepinot/pinot:0.11.0-...           8096-8099/tcp, 9000/tcp                  streaming-stack_pinot-controller.1...\n3b93a2daa1ee   zookeeper:3.8.0                        2181/tcp, 2888/tcp, 3888/tcp, 8080/tcp   streaming-stack_zookeeper.1...\n```\n\n## Helpful Commands\n\nTo run the application:\n\n```shell\n# install `kafka-python` python package\npython3 -m pip install kafka-python\n\ncd sales_generator/\n\n# run in foreground\npython3 ./producer.py\n# alternately, run as background process\nnohup python3 ./producer.py \u0026\n```\n\nManage the topics from within the Kafka container:\n\n```shell\ndocker exec -it $(docker container ls --filter  name=streaming-stack_kafka.1 --format \"{{.ID}}\") bash\n\nexport BOOTSTRAP_SERVERS=\"localhost:9092\"\nexport TOPIC_PRODUCTS=\"demo.products\"\nexport TOPIC_PURCHASES=\"demo.purchases\"\nexport TOPIC_INVENTORIES=\"demo.inventories\"\n\n# list topics\nkafka-topics.sh --list \\\n    --bootstrap-server $BOOTSTRAP_SERVERS\n\n# describe topic\nkafka-topics.sh --describe \\\n    --topic $TOPIC_PURCHASES \\\n    --bootstrap-server $BOOTSTRAP_SERVERS\n\n# list consumer groups\nkafka-consumer-groups.sh --list \\\n    --bootstrap-server $BOOTSTRAP_SERVERS\n  \n# delete topics\nkafka-topics.sh --bootstrap-server $BOOTSTRAP_SERVERS --delete --topic $TOPIC_PRODUCTS\nkafka-topics.sh --bootstrap-server $BOOTSTRAP_SERVERS --delete --topic $TOPIC_PURCHASES\nkafka-topics.sh --bootstrap-server $BOOTSTRAP_SERVERS --delete --topic $TOPIC_INVENTORIES\n\n# optional: create partitions (or will be automatically created)\nkafka-topics.sh --create --topic $TOPIC_PRODUCTS \\\n    --partitions 1 --replication-factor 1 \\\n    --config cleanup.policy=compact \\\n    --bootstrap-server $BOOTSTRAP_SERVERS\n\nkafka-topics.sh --create --topic $TOPIC_PURCHASES \\\n    --partitions 1 --replication-factor 1 \\\n    --config cleanup.policy=compact \\\n    --bootstrap-server $BOOTSTRAP_SERVERS\n\nkafka-topics.sh --create --topic $TOPIC_INVENTORIES \\\n    --partitions 1 --replication-factor 1 \\\n    --config cleanup.policy=compact \\\n    --bootstrap-server $BOOTSTRAP_SERVERS\n\n# read topics from beginning\nkafka-console-consumer.sh \\\n    --topic $TOPIC_PRODUCTS --from-beginning \\\n    --bootstrap-server $BOOTSTRAP_SERVERS\n\nkafka-console-consumer.sh \\\n    --topic $TOPIC_PURCHASES --from-beginning \\\n    --bootstrap-server $BOOTSTRAP_SERVERS\n\nkafka-console-consumer.sh \\\n    --topic $TOPIC_INVENTORIES --from-beginning \\\n    --bootstrap-server $BOOTSTRAP_SERVERS\n```\n\n## Setup Amazon Linux 2 EC2 Instance with Docker\n\nSuggest a minimum `m5.xlarge` instance type. Bootstrap script using Amazon SSM `aws:runShellScript`:\n\n```shell\nsudo yum update -y\nsudo yum install docker vim git wget jq htop python3-pip -y\n\nsudo usermod -a -G docker ec2-user\nid ec2-user\nnewgrp docker\n\npip3 install docker-compose kafka-python python-dateutil\nsudo systemctl enable docker.service\nsudo systemctl start docker.service\nsudo systemctl status docker.service\ndocker --version\ndocker swarm init\n\ncd /home/ec2-user/\n\ngit clone https://github.com/garystafford/streaming-sales-generator.git\ncd streaming-sales-generator/\n\ndocker stack deploy streaming-stack --compose-file docker/spark-kstreams-stack.yml\n```\n\n## TODO Items\n\n* ✓ Add the estimated Cost of Goods Sold (COGS) to each product, allowing for gross profit analyses\n* ✓ Add SASL/SCRAM authentication option for Apache Kafka in addition to PLAINTEXT\n* ✓ Add streaming data analysis example using Apache Spark Structured Streaming\n* ✓ Add streaming data analysis example using Apache Flink\n* ✓ Add streaming data analysis example using Apache Kafka Streams (KStreams)\n* ✓ Add streaming data analysis example using Apache Pinot with Apache Superset and JupyterLab\n* ✓ Add event time to Product model so product changes can be accounted for in stream\n* ✓ Docker streaming stack to support all examples: Apache Kafka, Spark, Flink, Pinot, Superset, and JupyterLab\n* ✓ Enable multiple product sales to be associated with a single transaction, add transaction ID to Purchases Class\n* ❏ Replace specific restocking events with more generic events topic with multiple event type field: restocking, price change, COGS change,\n  ingredients, etc.\n* ❏ Add hours of operation (e.g., Monday 8AM - 8PM), which impact when sales can be made\n* ❏ Add semi-random sales volume variability based on day and time of day (e.g., Friday evening vs. Monday morning)\n* ❏ Add positive and negative sales anomalies variable, such as a winter storm, power outage, or successful marketing\n  promotion\n* ❏ Add supply change issues variable that could impact availability of certain products (zero inventory/lost sales)\n\n---\n_The contents of this repository represent my viewpoints and not of my past or current employers, including Amazon Web\nServices (AWS). All third-party libraries, modules, plugins, and SDKs are the property of their respective owners. The\nauthor(s) assumes no responsibility or liability for any errors or omissions in the content of this site. The\ninformation contained in this site is provided on an \"as is\" basis with no guarantees of completeness, accuracy,\nusefulness or timeliness._\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgarystafford%2Fstreaming-sales-generator","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgarystafford%2Fstreaming-sales-generator","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgarystafford%2Fstreaming-sales-generator/lists"}