An open API service indexing awesome lists of open source software.

https://github.com/smars-bin-hu/ecomdwh-batchdataprocessingplatform

This project aims to build an enterprise-grade offline data warehouse solution based on e-commerce platform order data.
https://github.com/smars-bin-hu/ecomdwh-batchdataprocessingplatform

big-data data-warehouse hadoop spark

Last synced: 6 months ago
JSON representation

This project aims to build an enterprise-grade offline data warehouse solution based on e-commerce platform order data.

Awesome Lists containing this project

README

          

![1](https://github.com/user-attachments/assets/5b98ca67-3770-4d4a-b444-ad8b70c40557)

# Enterprise-Grade Offline Data Warehouse Solution for E-Commerce



Sublime's custom image


Sublime's custom image
















This project aims to build an enterprise-grade offline data warehouse solution based on e-commerce platform order data. By leveraging **Docker containers** to simulate a big data platform, it achieves a complete workflow from ETL processing to data warehouse modeling, OLAP analysis, and data visualization.

The core value of this project lies in its implementation of **enterprise-grade data warehouse modeling**, integrating e-commerce order data with relevant business themes through standardized dimension modeling and fact table design, ensuring data accuracy, consistency, and traceability. Meanwhile, **the deployment of a big data cluster via Docker containers** simplifies environment management and operational costs, offering a flexible deployment model for distributed batch processing powered by Spark. Additionally, the project incorporates **CI/CD automation**, enabling rapid iterations while maintaining the stability and reliability of the data pipeline. Storage and computation are also **highly optimized** to maximize hardware resource utilization.

To monitor and manage the system effectively, a **Grafana-based cluster monitoring system** has been implemented, providing real-time insights into cluster health metrics and assisting in performance tuning and capacity planning. Finally, by integrating **business intelligence (BI) and visualization solutions**, the project transforms complex data warehouse analytics into intuitive dashboards and reports, allowing business teams to make data-driven decisions more efficiently.

By combining these critical featuresโ€”including:

| โœ… Core Feature | ๐Ÿ”ฅ Core Highlights | ๐Ÿ“ฆ Deliverables |
|-----------|------------------|---------------|
| **1. [Data Warehouse Modeling and Documentation](https://github.com/Smars-Bin-Hu/EComDWH-BatchDataProcessingPlatform/tree/main?tab=readme-ov-file#1-data-warehouse-modeling-and-documentation)** | - Full dimensional modeling process (Star Schema / Snowflake Schema)
- Standardized development norms (ODS/DWD/DWM/DWS/ADS five-layer modeling)
- Business Matrix: defining & managing dimensions & fact tables | - Data warehouse design document (Markdown/PDF)
- Hive SQL modeling code
- Database ER diagram |
| **2. [Cluster Deployment](https://github.com/Smars-Bin-Hu/EComDWH-BatchDataProcessingPlatform/tree/main?tab=readme-ov-file#2-a-self-built-distributed-big-data-platform)** | - Fully containerized deployment with Docker for quick replication
- High-availability environment: Hadoop + Hive + Spark + Zookeeper + ClickHouse | - Docker images (open-source Dockerfile)
- `.env` configuration file
- `docker-compose.yml` (one-click cluster startup)
- Infra configuration files (Hadoop, Hive, Spark, Zookeeper) |
| **3. [Distributed Batch Processing](https://github.com/Smars-Bin-Hu/EComDWH-BatchDataProcessingPlatform/tree/main?tab=readme-ov-file#3-distributed-batch-processing)** | - ETL processing using Spark for Oracle relational data
- Multi-layer processing: ODS โ†’ DWD โ†’ DWM โ†’ DWS โ†’ ADS
- Efficient data transformation & aggregation | - Spark ETL code (PySpark)
- SparkSQL scripts
- Data flow diagram |
| **4. [CI/CD Automation](https://github.com/Smars-Bin-Hu/EComDWH-BatchDataProcessingPlatform/tree/main?tab=readme-ov-file#4-cicd-automation)** | - Automated Airflow DAG deployment (auto-sync with code updates)
- Automated Spark job submission (eliminates manual `spark-submit`)
- Hive table schema change detection (automatic alerts) | - GitHub Actions / Jenkins pipeline
- CI/CD code and documentation
- Sample log screenshots |
| **5. [Storage & Computation Optimization](https://github.com/Smars-Bin-Hu/EComDWH-BatchDataProcessingPlatform/tree/main?tab=readme-ov-file#5-storage--computation-optimization)** | - SQL optimization (dynamic partitioning, indexing, storage partitioning)
- Spark tuning: Salting, Skew Join Hint, Broadcast Join, `reduceByKey` vs. `groupByKey`
- Hive tuning: Z-Order sorting (boost ClickHouse queries), Parquet + Snappy compression | - Pre & post optimization performance comparison
- Spark optimization code
- SQL execution plan screenshots |
| **6. [DevOps - Monitoring and Alerting](https://github.com/Smars-Bin-Hu/EComDWH-BatchDataProcessingPlatform/tree/main?tab=readme-ov-file#6-devops---monitoring-and-alerting)** | - Prometheus + Grafana for performance monitoring Hadoop Cluster / MySQL
- AlertManager for alerting and email receiving | - Prometheus, Grafana configuration files
- Grafana dashboard screenshots
|
| **7. [Business Intelligence & Visualization](https://github.com/Smars-Bin-Hu/EComDWH-BatchDataProcessingPlatform/tree/main?tab=readme-ov-file#7-business-intelligence--visualization)** | - PowerBI dashboards for data analysis
- Real business-driven visualizations
- Providing actionable business insights | - PowerBI visualization screenshots
- Business analysis report
- Key business metric explanations (BI Insights) |

this project delivers a professional, robust, and highly efficient solution for enterprises dealing with large-scale data processing and analytics.

## โš™๏ธ Core Deliverables

### 1. Data Warehouse Modeling and Documentation

This project demonstrates my ability to build a data warehouse from the ground up following enterprise-grade standards. I independently designed and documented a complete SOP for data warehouse development, covering every critical step in the modeling roadmap. From initial business data research to final model delivery, I established a standardized methodology that ensures clarity, scalability, and maintainability. The SOP includes detailed best practices on data warehouse layering, table naming conventions, field naming rules, and lifecycle management for warehouse tables. For more information, please refer to the documentation below.

๐Ÿ”— Click to Show DWH Dimensional Modelling Documents and Code

- [DWH Modelling Standard Operation Procedure (SOP)](./docs/doc/dwh-modelling-sop.md)
- [Business Data Research](./docs/doc/business_data_research.md)

Data Warehouse Development Specification

- [Data Warehouse Layering Specification](./docs/doc/data-warehouse-development-specification/data-warehouse-layering-specification.md)
- [Table Naming Conventions](./docs/doc/data-warehouse-development-specification/table-naming-convertions.md)
- [Data Warehouse Column Naming Conventions](./docs/doc/data-warehouse-development-specification/partitioning-column-naming-conventions.md)
- [Data Table Lifecycle Management Specification](./docs/doc/data-warehouse-development-specification/data-table-lifecycle-management-specification.md)

[๐Ÿ”จ Code - Hive DDL](https://github.com/Smars-Bin-Hu/EComDWH-BatchDataProcessingPlatform/tree/main/src/warehouse_modeling)(for Data Warehouse All Layers including ods, dwd, dwm, dws, dwt, dim (Operational Data Storage, DW detail, DW middle, DW summary, DW theme, DW Dimension, Analytical Data Storage-CK)

![image](https://github.com/user-attachments/assets/ec924ea9-1acf-48a3-99ba-1546c1e8c3a9)

Figure 1: DWH Dimensional Modelling SOP

![image](https://github.com/user-attachments/assets/ab21c750-052f-4c10-baf0-bc97e5ed8274)

Figure 2: DWH Dimensional Modelling Methodology Diagram

![ECom-DWH-Pipeline](https://github.com/Smars-Bin-Hu/my-draw-io/blob/main/ECom-DWH-Datapipeline-Proejct/ECom-DWH-Pipeline.drawio.svg)

Figure 3: DWH Dimensional Modelling Architecture

### 2. A Self-Built Distributed Big Data Platform

This distributed data platform was built entirely from scratch by myself. Starting with a base Ubuntu 20.04 docker image, I manually installed and configured each component step by step, ultimately creating a fully functional three-node Hadoop cluster with distributed storage and computing capabilities. The platform is fully containerized, featuring a highly available HDFS and YARN architecture. It supports Hive for data warehousing, Spark for distributed computing, Airflow for workflow orchestration, and Prometheus + Grafana for performance monitoring. A MySQL container manages metadata for both Hive and Airflow and is also monitored by Prometheus. An Oracle container simulates the backend of a business system and serves as a data source for the data warehouse. All container images are open-sourced and published to [๐Ÿ”จ GitHub Container Registry](https://github.com/Smars-Bin-Hu/EComDWH-BatchDataProcessingPlatform/pkgs/container/proj1-dwh-cluster), making it easy for anyone to deploy the same platform locally.

[๐Ÿ”จ Code - Docker Compose File](https://github.com/Smars-Bin-Hu/EComDWH-BatchDataProcessingPlatform/blob/main/docker-compose-bigdata.yml)

[๐Ÿ”จ Code - Configuration Files for Cluster: Hadoop, ZooKeeper, Hive, MySql, Spark, Prometheus&Grafana, Airflow](https://github.com/Smars-Bin-Hu/EComDWH-BatchDataProcessingPlatform/tree/main/src/infra)

[๐Ÿ”จ Code - Container Internal Scripts: Hadoop, ZooKeeper, Hive, MySql, Spark, Prometheus&Grafana, Airflow](https://github.com/Smars-Bin-Hu/EComDWH-BatchDataProcessingPlatform/tree/main/src/scripts)

[๐Ÿ”จ Code - Common Used Snippets for Cluster: Hadoop, ZooKeeper, Hive, MySql, Spark, Prometheus&Grafana, Airflow](https://github.com/Smars-Bin-Hu/EComDWH-BatchDataProcessingPlatform/tree/main/src/snippets)

image

Figure 1: All Containers Window

![ECom-DWH-Pipeline](https://github.com/Smars-Bin-Hu/my-draw-io/blob/main/ECom-DWH-Datapipeline-Proejct/ECom-DWH-Tech-Arc.drawio.svg)

Figure 2: Data Platform Architecture

### 3. Distributed Batch Processing

1. [๐Ÿ”จ Code - Extract and Load pipeline (OLTP -> DWH, DWH -> OLAP)](https://github.com/Smars-Bin-Hu/EComDWH-BatchDataProcessingPlatform/tree/main/src/data_pipeline)

2. [๐Ÿ”จ Code - Batch Processing (Transform)](https://github.com/Smars-Bin-Hu/EComDWH-BatchDataProcessingPlatform/tree/main/src/batch_processing)

3. ๐Ÿ”จ Code - Scheduling based on Airflow (DAG)

### 4. CI/CD Automation



1. GitHub Actions Code

[๐Ÿ”จ Code - workflows.main YAML](https://github.com/Smars-Bin-Hu/EComDWH-BatchDataProcessingPlatform/blob/main/.github/workflows/main.yml)

2. Key Screenshots

image

Figure 1: Data platform launching and stop automation

image

Figure 2: Sample Log Screenshot I

image

Figure 3: Sample Log Screenshot II

3. [๐Ÿ”— Link - Automation Workflow Web UI](https://github.com/Smars-Bin-Hu/EComDWH-BatchDataProcessingPlatform/actions)

### 5. Storage & Computation Optimization

### 6. DevOps - Monitoring and Alerting

[๐Ÿ”จ Code - Monitoring Services Configuaration Files: Prometheus, Grafana, AlertManager](https://github.com/Smars-Bin-Hu/EComDWH-BatchDataProcessingPlatform/tree/main/src/infra/monitoring-config)

[๐Ÿ”จ Code - Monitoring Services Start&Stop Scripts: Prometheus, Grafana, AlertManager](https://github.com/Smars-Bin-Hu/EComDWH-BatchDataProcessingPlatform/tree/main/src/scripts/monitoring)

[๐Ÿ”จ Code - Container Metrics Exporter Start&Stop Scripts: `my-start-node-exporter.sh` & `my-stop-node-exporter.sh`](https://github.com/Smars-Bin-Hu/EComDWH-BatchDataProcessingPlatform/tree/main/src/scripts/hadoop-master)

Prometheus

Figure 1: Prometheus


Grafana-Hadoop-Cluster-instance-hadoop-master

Figure 2: Grafana-Hadoop-Cluster-instance-hadoop-master


Grafana-MySQLD

Figure 3: Grafana-MySQLD

### 7. Business Intelligence & Visualization

[๐Ÿ”— Link - PowerBI Public Access(Expirable)](https://app.powerbi.com/view?r=eyJrIjoiMzVjYTQ3NmMtODllZS00N2JhLWFkNWItMWI4MmYyNDZjMDc1IiwidCI6IjI0MGI3OWM1LTZiZWYtNDYwOC1hNDE3LTY1NjllODQzNTQ1YyJ9)

Use Microsoft PowerBI connect to the Clickhouse and extract the **analytical data storage** layer
image

Figure 1: PowerBI Dashboard Demo

## Tech Stack

This project sets up a high-availability big data platform, including the following components:

![Apache Spark](https://img.shields.io/badge/Spark-FDEE21?style=for-the-badge&logo=apachespark&logoColor=black) ![Apache Hadoop](https://img.shields.io/badge/Hadoop-66CCFF?style=for-the-badge&logo=apachehadoop&logoColor=black) ![Apache ZooKeeper](https://img.shields.io/badge/Zookeeper-8e8c3a?style=for-the-badge&color=8e8c3a) ![Apache Airflow](https://img.shields.io/badge/Airflow-017CEE?style=for-the-badge&logo=apacheairflow&logoColor=white) ![Apache Hive](https://img.shields.io/badge/Hive-FDEE21?style=for-the-badge&logo=apachehive&logoColor=black) ![ClickHouse](https://img.shields.io/badge/ClickHouse-FFCC01?style=for-the-badge&logo=clickhouse&logoColor=white) ![Prometheus](https://img.shields.io/badge/Prometheus-f2f2e8?style=for-the-badge&logo=prometheus&color=f2f2e8) ![Grafana](https://img.shields.io/badge/Grafana-252523?style=for-the-badge&logo=grafana&color=252523) ![MySQL](https://img.shields.io/badge/MySQL-blue?style=for-the-badge&logo=mysql&logoColor=yellow&color=blue) ![Oracle Database](https://img.shields.io/badge/Oracle_Database-red?style=for-the-badge&color=red) ![Microsoft PowerBI](https://img.shields.io/badge/Microsoft_PowerBI-pink?style=for-the-badge&color=pink)
![Docker](https://img.shields.io/badge/docker-%230db7ed.svg?style=for-the-badge&logo=docker&logoColor=white)

| Components | Features | Version |
|------------------------|--------------------------------|---------|
| Apache Hadoop | Big Data Distributed Framework | 3.2.4 |
| Apache Zookeeper | High Availability | 3.8.4 |
| Apache Spark | Distributed Computing | 3.3.0 |
| Apache Hive | Data Warehousing | 3.1.3 |
| Apache Airflow | Workflow Scheduling | 2.7.2 |
| MySQL | Metastore | 8.0.39 |
| Oracle Database | Workflow Scheduling | 19.0.0 |
| Azure Cloud ClickHouse | OLAP Analysis | 24.12 |
| Microsoft PowerBI | BI Dashboard | latest |
| Prometheus | Monitoring | 2.52.0 |
| Grafana | Monitoring GUI | 10.3.1 |
| Docker | Containerization | 28.0.1 |

## ๐Ÿ“ Project Directory

```bash
/bigdata-datawarehouse-project
โ”‚โ”€โ”€ /.github/workflows # CI/CD automation workflows via GitHub Actions
โ”‚โ”€โ”€ /docs # docs (all business and technologies documents about this project)
โ”‚โ”€โ”€ /src
โ”‚โ”€โ”€ /data_pipeline # data pipeline code (ETL/ELT Logic, output)
โ”‚โ”€โ”€ /warehouse_modeling # DWH modelling๏ผˆHive SQL etc.๏ผ‰
โ”‚โ”€โ”€ /batch_processing # Data Batch processing (PySpark + SparkSQL)
โ”‚โ”€โ”€ /dags # Task Scheduler(Airflow)
โ”‚โ”€โ”€ /infra # infrastructure deployment(Docker, configuration files)
โ”‚โ”€โ”€ /snippets # common used commands and snippets
โ”‚โ”€โ”€ /README # Source Code Use Instruction Markdown Files
โ”‚โ”€โ”€ README.md # Navigation of Source Code Use Instruction
โ”‚โ”€โ”€ main_data_pipeline.py # operate the data_pipeline module to do the `Extract` and `Load` jobs
โ”‚โ”€โ”€ main_batch_processing.py # operate the batch_processing module to do the `Transform` jobs
โ”‚โ”€โ”€ /tests # all small features unit testing snippets (DWH modelling, data pipeline, dags etc.)
โ”‚โ”€โ”€ README.md # Introduction about project
โ”‚โ”€โ”€ docker-compose-bigdata.yml # Docker Compose to launch the docker cluster
โ”‚โ”€โ”€ .env # `public the .env on purpose` for docker-compose file use
โ”‚โ”€โ”€ .gitignore # Git ignore some directory not to be committed to the remote repo
โ”‚โ”€โ”€ .gitattributes # Git repository attributes config
โ”‚โ”€โ”€ LICENSE # COPYRIGHT for this project
โ”‚โ”€โ”€ mysql-metadata-restore.sh # container operational level scripts: restore mysql container metadata
โ”‚โ”€โ”€ mysql-metastore-dump.sh # container operational level scripts: dump mysql container metadata
โ”‚โ”€โ”€ push-to-ghcr.sh # container operational level scripts: push the images to GitHub Container Registry
โ”‚โ”€โ”€ start-data-clients.sh # container operational level scripts: start hive, spark etc
โ”‚โ”€โ”€ start-hadoop-cluster.sh # container operational level scripts: start hadoop HA cluster
โ”‚โ”€โ”€ start-other-services.sh # container operational level scripts: start airflow, prometheus, grafana etc
โ”‚โ”€โ”€ stop-data-clients.sh # container operational level scripts: stop hive, spark etc
โ”‚โ”€โ”€ stop-hadoop-cluster.sh # container operational level scripts: stop hadoop HA cluster
โ”‚โ”€โ”€ stop-other-services.sh # container operational level scripts: stop airflow, prometheus, grafana etc
```

## ๐Ÿš€ Quick Start `/src`

### [๐Ÿ”— Source Code Instruction for Use](./src/README.md)
###
###

## ๐Ÿ“Œ Project Documents `/docs`

#### 1. Business logic && Tech Selection

- Business Logic
- [Project Tech Architecture](./docs/doc/tech-architecture.md)

#### 2. Development Specification

[DWH Modelling Standard Operation Procedure (SOP)](./docs/doc/dwh-modelling-sop.md)

- [Business Data Research](./docs/doc/business_data_research.md)
- Data Warehouse Development Specification
- [Data Warehouse Layering Specification](./docs/doc/data-warehouse-development-specification/data-warehouse-layering-specification.md)
- [Table Naming Conventions](./docs/doc/data-warehouse-development-specification/table-naming-convertions.md)
- [Data Warehouse Column Naming Conventions](./docs/doc/data-warehouse-development-specification/partitioning-column-naming-conventions.md)
- [Data Table Lifecycle Management Specification](./docs/doc/data-warehouse-development-specification/data-table-lifecycle-management-specification.md)

- Python Development Specification
- Package Modulize

- SQL Development Specification
- [Development Specification](./docs/doc/data-warehouse-development-specification/development-specification.md)

#### 3. Troubleshooting

- [Future Bugs to Fix](./docs/doc/error-handling/future-fix.md)
- [04_MAR_2025](./docs/doc/error-handling/04_MAR_2025.md)
- [05_MAR_2025](./docs/doc/error-handling/05_MAR_2025.md)
- [06_MAR_2025](./docs/doc/error-handling/06_MAR_2025.md)

#### 4. Infrastructure & Building

- ๆ ธๅฟƒๆžถๆž„dockerๅฎนๅ™จๅˆ†ๅธƒๅ›พ
- Hadoop 3่Š‚็‚น ็š„ๆญๅปบๅ’Œ้…็ฝฎ
- Hive ่Š‚็‚น็š„ๆญๅปบๅ’Œ้…็ฝฎ
- Spark ่Š‚็‚น็š„ๆญๅปบๅ’Œ้…็ฝฎ
- mysql ่Š‚็‚น็š„ๆญๅปบๅ’Œ้…็ฝฎ
- oracle ่Š‚็‚น็š„ๆญๅปบๅ’Œ้…็ฝฎ
- airflow ่Š‚็‚น็š„ๆญๅปบๅ’Œ้…็ฝฎ (airflow.cfg ้‡Œ mysql ๅ’Œ localexecutor ็š„้…็ฝฎ๏ผ‰
- `docker-compose` ๆ–‡ไปถ็š„้…็ฝฎ

#### 5. Development

- Data Warehousing
- ods
- dwd
- Data Pipeline ETL
- Spark on Yarn to connect Oracle (Hello World)
- Spark to extract data and load to HDFS
- OOP
- Scheduler (Airflow)
- some files under /scripts

#### 6. Optimization

- [Too many INFO logs: Reducing Spark Console Log Levels](./docs/doc/optimization/reducing-spark-console-log-levels.md)

#### 7. Testing
- spark_connect_oracle.py

## License

This project is licensed under the MIT License - see the [LICENSE](./LICENSE) file for details.
Created and maintained by **Smars-Bin-Hu**.