Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/shayartt/streaming-orders
Project to stream real-time orders and apply some ETL pipelines & analytics using DataBricks, Kafka, AWS
https://github.com/shayartt/streaming-orders
databricks etl kafka python spark spark-streaming
Last synced: about 19 hours ago
JSON representation
Project to stream real-time orders and apply some ETL pipelines & analytics using DataBricks, Kafka, AWS
- Host: GitHub
- URL: https://github.com/shayartt/streaming-orders
- Owner: Shayartt
- Created: 2024-06-13T18:07:49.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2024-06-21T15:26:29.000Z (8 months ago)
- Last Synced: 2025-02-13T08:16:46.390Z (about 19 hours ago)
- Topics: databricks, etl, kafka, python, spark, spark-streaming
- Language: Jupyter Notebook
- Homepage:
- Size: 823 KB
- Stars: 1
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: changelog
Awesome Lists containing this project
README
# streaming-orders
Project to stream real-time orders and apply some ETL pipelines & analytics using DataBricks, AWS### Documentation Diagram :
#### High Level Diagram Tec:
![High-Level Diagram](docs/diagrams/streaming-orders-highlevel.drawio.png?raw=true "High-Level")#### Data Flow :
![Data Flow Diagram](docs/diagrams/Low-level-data-flow.drawio.png?raw=true "Data Flow")More details to be included later...
### Milestones :
+ Stream orders from Kafka: Simulate a real-time stream of orders from an e-commerce platform. (Kafka Confluent) ✅
- Kafka Setup in confluent.
![KafkaTopic](docs/screens/kafka_topic.png?raw=true "KafkaTopic")
- Producer simulation in src/simulator.py
+ Streaming ingesting & processing : Create spark jobs to read from our kafka stream, apply etl pipelines and write our data into a delta lake (Databricks workflows + AWS S3) ✅
- Streaming : 🚀
![Streamingflow](docs/screens/streaming_workflow.PNG?raw=true "Streamingflow")
- Processing : 🛠️
![Processingflow](docs/screens/processing_workflow.png?raw=true "Processingflow")
+ Analytics Pipelines : the final step of any workflow triggered on new orders received need to apply some analysis tasks as mentionned bellow :✅
- Develop a fraud detection rules that flags suspicious orders based on patterns such as high-value orders, frequent orders from the same customer, etc. 💡
- Alerting: Set up alerts for flagged suspected clients using Databricks alerting systems (email only). 🚨
+ Done using Databricks SQL Alerts, the analyzer code will store results into a delta table, and via databricks platform we setup trigger based on that table.
![Fraud Rule](docs/screens/Fraud_alerts.png?raw=true "Fraud Rule")+ Vizualization : It's not the main objective of this project, but we'll create a quick dashboard included some analytical statistics using databricks dashboarding. 👀
+ Additional Challenges : as this is a learning project, I decided to add some challenging use-cases to simulate real-life problems as mentionned bellow : 🤓
- Data Schema Evolution: Handle changes in the data schema over time, such as new fields being added to the order data, and changing types. 🔄
- Data Schema Evolution: Ensure backward and forward compatibility in our streaming pipeline. 🔄
+ I've added a new field called "message" in my topic and changed the producer code to include it from one side, from the other side I had to enable the merge schema to allow schema evolution in my delta table, to allow backward compatibility I provided a default value to my new field.- Exactly-Once Processing: Implement exactly-once processing semantics to ensure that each order is processed exactly once, even in the case of failures. (Avoid dupplication)
- Stateful Stream Processing: Use stateful transformations to keep track of state across events. For example, maintain a running total of orders per customer or product category.
+ Implemented two live counters in statefull_streaming.ipynb.
- Scalability and Performance Optimization: Optimize your Spark streaming jobs for performance and scalability. This includes tuning Spark configurations, partitioning data effectively, and minimizing data shuffles.
+ Optimization done during the code, for example used of pre-built functions, broadcast functions...
+ Scalability: Added field dateOrder, and change the previous one to Datetime.
+ Uses of concept : bronze, silver, gold tables instead of saving it directly into s3 path.
- Data Quality Monitoring: Implement real-time data quality checks and monitoring. Alert if data quality issues such as missing values, duplicates, or outliers are detected.
+ Implemented withing the statefull_streaming notebook.
- Data Enrichment: Enrich your streaming data with additional context from external data sources. For example, enrich order data with customer demographics or product details from a database.
+ Implementation of Enums to generate additional informations + GEOPY to get localisation informations.
- Real-Time Analytics and Reporting: Build real-time analytics and reporting dashboards using tools like Databricks SQL, Power BI, or Tableau. Visualize key metrics such as order volume, sales trends, and customer behavior.
- DataBricks Dashboard based on Seaborn & PyPlot read from a real-time delta lake.