https://github.com/mfurmanczyk/wh-sales
E-commerce analytics data warehouse ETL made with Apache Spark.
https://github.com/mfurmanczyk/wh-sales
airflow data data-engineering data-warehouse kotlin python spark
Last synced: 5 months ago
JSON representation
E-commerce analytics data warehouse ETL made with Apache Spark.
- Host: GitHub
- URL: https://github.com/mfurmanczyk/wh-sales
- Owner: MFurmanczyk
- Created: 2024-07-23T19:13:34.000Z (almost 2 years ago)
- Default Branch: master
- Last Pushed: 2024-08-08T06:56:34.000Z (almost 2 years ago)
- Last Synced: 2025-02-02T01:11:22.305Z (over 1 year ago)
- Topics: airflow, data, data-engineering, data-warehouse, kotlin, python, spark
- Language: Kotlin
- Homepage:
- Size: 223 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# E-Commerce Data Warehouse ETL
This project implements an ETL (Extract, Transform, Load) process to migrate and transform data from an OLTP (Online Transaction Processing) system to a star schema in a data warehouse. The ETL process is written in Kotlin and Spark, and it is orchestrated using Apache Airflow.
## Features
- **ETL Processing of MySQL tables**:
- **T_CATEGORY**: product categories.
- **T_CUSTOMER**: customer data and their addresses.
- **T_ORDER**: information about orders.
- **T_ORDER_REL**: information about products in orders.
- **T_PRODUCT**: Prosses product information.
- **T_PROMO and T_PROMO_REL**: information about promotions and the products affected by them.
- **Data Transformation**:
- Transforms data from OLTP format to a star schema suitable for analytical queries.
- **Orchestration**:
- Utilizes Apache Airflow for scheduling and managing the ETL workflows.
## Technologies Used
- **Programming Language**: Kotlin
- **Data Processing Framework**: Apache Spark v3.3.2
- **Workflow Orchestration**: Apache Airflow
## Getting Started
### Prerequisites
- **Kotlin**: Ensure you have Kotlin installed. [Install Kotlin](https://kotlinlang.org/docs/tutorials/command-line.html).
- **Apache Spark v3.3.2**: Ensure you have Apache Spark installed. [Install Spark](https://spark.apache.org/downloads.html).
- **Apache Airflow**: Ensure you have Apache Airflow installed. [Install Airflow](https://airflow.apache.org/docs/apache-airflow/stable/start.html).
### Installation
1. Clone the repository:
```bash
git clone https://github.com/MFurmanczyk/wh-sales.git
cd wh-sales
```
2. Build the project:
```bash
./gradlew shadowJar
```
3. Setup Apache Airflow:
```bash
cd airflowdocker-compose up
```
4. Running the ETL
Start the Apache Airflow web server and scheduler:
```bash
docker-compose up -d
```
5. Move `dag.py` to Airflow's DAGs folder.
Access the Airflow UI at http://localhost:8080 and trigger the ETL DAG (sales_dag).
## License
This project is licensed under the MIT License - see the LICENSE file for details.