https://github.com/smars-bin-hu/ecomdwh-batchdataprocessingplatform
This project aims to build an enterprise-grade offline data warehouse solution based on e-commerce platform order data.
https://github.com/smars-bin-hu/ecomdwh-batchdataprocessingplatform
big-data data-warehouse hadoop spark
Last synced: 6 months ago
JSON representation
This project aims to build an enterprise-grade offline data warehouse solution based on e-commerce platform order data.
- Host: GitHub
- URL: https://github.com/smars-bin-hu/ecomdwh-batchdataprocessingplatform
- Owner: Smars-Bin-Hu
- License: mit
- Created: 2025-02-15T19:31:10.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2025-04-24T05:39:35.000Z (6 months ago)
- Last Synced: 2025-04-26T07:56:59.576Z (6 months ago)
- Topics: big-data, data-warehouse, hadoop, spark
- Language: Python
- Homepage:
- Size: 25.9 MB
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README

# Enterprise-Grade Offline Data Warehouse Solution for E-Commerce
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
This project aims to build an enterprise-grade offline data warehouse solution based on e-commerce platform order data. By leveraging **Docker containers** to simulate a big data platform, it achieves a complete workflow from ETL processing to data warehouse modeling, OLAP analysis, and data visualization.
The core value of this project lies in its implementation of **enterprise-grade data warehouse modeling**, integrating e-commerce order data with relevant business themes through standardized dimension modeling and fact table design, ensuring data accuracy, consistency, and traceability. Meanwhile, **the deployment of a big data cluster via Docker containers** simplifies environment management and operational costs, offering a flexible deployment model for distributed batch processing powered by Spark. Additionally, the project incorporates **CI/CD automation**, enabling rapid iterations while maintaining the stability and reliability of the data pipeline. Storage and computation are also **highly optimized** to maximize hardware resource utilization.
To monitor and manage the system effectively, a **Grafana-based cluster monitoring system** has been implemented, providing real-time insights into cluster health metrics and assisting in performance tuning and capacity planning. Finally, by integrating **business intelligence (BI) and visualization solutions**, the project transforms complex data warehouse analytics into intuitive dashboards and reports, allowing business teams to make data-driven decisions more efficiently.
By combining these critical featuresโincluding:
| โ Core Feature | ๐ฅ Core Highlights | ๐ฆ Deliverables |
|-----------|------------------|---------------|
| **1. [Data Warehouse Modeling and Documentation](https://github.com/Smars-Bin-Hu/EComDWH-BatchDataProcessingPlatform/tree/main?tab=readme-ov-file#1-data-warehouse-modeling-and-documentation)** | - Full dimensional modeling process (Star Schema / Snowflake Schema)
- Standardized development norms (ODS/DWD/DWM/DWS/ADS five-layer modeling)
- Business Matrix: defining & managing dimensions & fact tables | - Data warehouse design document (Markdown/PDF)
- Hive SQL modeling code
- Database ER diagram |
| **2. [Cluster Deployment](https://github.com/Smars-Bin-Hu/EComDWH-BatchDataProcessingPlatform/tree/main?tab=readme-ov-file#2-a-self-built-distributed-big-data-platform)** | - Fully containerized deployment with Docker for quick replication
- High-availability environment: Hadoop + Hive + Spark + Zookeeper + ClickHouse | - Docker images (open-source Dockerfile)
- `.env` configuration file
- `docker-compose.yml` (one-click cluster startup)
- Infra configuration files (Hadoop, Hive, Spark, Zookeeper) |
| **3. [Distributed Batch Processing](https://github.com/Smars-Bin-Hu/EComDWH-BatchDataProcessingPlatform/tree/main?tab=readme-ov-file#3-distributed-batch-processing)** | - ETL processing using Spark for Oracle relational data
- Multi-layer processing: ODS โ DWD โ DWM โ DWS โ ADS
- Efficient data transformation & aggregation | - Spark ETL code (PySpark)
- SparkSQL scripts
- Data flow diagram |
| **4. [CI/CD Automation](https://github.com/Smars-Bin-Hu/EComDWH-BatchDataProcessingPlatform/tree/main?tab=readme-ov-file#4-cicd-automation)** | - Automated Airflow DAG deployment (auto-sync with code updates)
- Automated Spark job submission (eliminates manual `spark-submit`)
- Hive table schema change detection (automatic alerts) | - GitHub Actions / Jenkins pipeline
- CI/CD code and documentation
- Sample log screenshots |
| **5. [Storage & Computation Optimization](https://github.com/Smars-Bin-Hu/EComDWH-BatchDataProcessingPlatform/tree/main?tab=readme-ov-file#5-storage--computation-optimization)** | - SQL optimization (dynamic partitioning, indexing, storage partitioning)
- Spark tuning: Salting, Skew Join Hint, Broadcast Join, `reduceByKey` vs. `groupByKey`
- Hive tuning: Z-Order sorting (boost ClickHouse queries), Parquet + Snappy compression | - Pre & post optimization performance comparison
- Spark optimization code
- SQL execution plan screenshots |
| **6. [DevOps - Monitoring and Alerting](https://github.com/Smars-Bin-Hu/EComDWH-BatchDataProcessingPlatform/tree/main?tab=readme-ov-file#6-devops---monitoring-and-alerting)** | - Prometheus + Grafana for performance monitoring Hadoop Cluster / MySQL
- AlertManager for alerting and email receiving | - Prometheus, Grafana configuration files
- Grafana dashboard screenshots
|
| **7. [Business Intelligence & Visualization](https://github.com/Smars-Bin-Hu/EComDWH-BatchDataProcessingPlatform/tree/main?tab=readme-ov-file#7-business-intelligence--visualization)** | - PowerBI dashboards for data analysis
- Real business-driven visualizations
- Providing actionable business insights | - PowerBI visualization screenshots
- Business analysis report
- Key business metric explanations (BI Insights) |this project delivers a professional, robust, and highly efficient solution for enterprises dealing with large-scale data processing and analytics.
## โ๏ธ Core Deliverables
### 1. Data Warehouse Modeling and Documentation
This project demonstrates my ability to build a data warehouse from the ground up following enterprise-grade standards. I independently designed and documented a complete SOP for data warehouse development, covering every critical step in the modeling roadmap. From initial business data research to final model delivery, I established a standardized methodology that ensures clarity, scalability, and maintainability. The SOP includes detailed best practices on data warehouse layering, table naming conventions, field naming rules, and lifecycle management for warehouse tables. For more information, please refer to the documentation below.
๐ Click to Show DWH Dimensional Modelling Documents and Code
- [DWH Modelling Standard Operation Procedure (SOP)](./docs/doc/dwh-modelling-sop.md)
- [Business Data Research](./docs/doc/business_data_research.md)
Data Warehouse Development Specification
- [Data Warehouse Layering Specification](./docs/doc/data-warehouse-development-specification/data-warehouse-layering-specification.md)
- [Table Naming Conventions](./docs/doc/data-warehouse-development-specification/table-naming-convertions.md)
- [Data Warehouse Column Naming Conventions](./docs/doc/data-warehouse-development-specification/partitioning-column-naming-conventions.md)
- [Data Table Lifecycle Management Specification](./docs/doc/data-warehouse-development-specification/data-table-lifecycle-management-specification.md)[๐จ Code - Hive DDL](https://github.com/Smars-Bin-Hu/EComDWH-BatchDataProcessingPlatform/tree/main/src/warehouse_modeling)(for Data Warehouse All Layers including ods, dwd, dwm, dws, dwt, dim (Operational Data Storage, DW detail, DW middle, DW summary, DW theme, DW Dimension, Analytical Data Storage-CK)

Figure 1: DWH Dimensional Modelling SOP

Figure 2: DWH Dimensional Modelling Methodology Diagram

Figure 3: DWH Dimensional Modelling Architecture
### 2. A Self-Built Distributed Big Data Platform
This distributed data platform was built entirely from scratch by myself. Starting with a base Ubuntu 20.04 docker image, I manually installed and configured each component step by step, ultimately creating a fully functional three-node Hadoop cluster with distributed storage and computing capabilities. The platform is fully containerized, featuring a highly available HDFS and YARN architecture. It supports Hive for data warehousing, Spark for distributed computing, Airflow for workflow orchestration, and Prometheus + Grafana for performance monitoring. A MySQL container manages metadata for both Hive and Airflow and is also monitored by Prometheus. An Oracle container simulates the backend of a business system and serves as a data source for the data warehouse. All container images are open-sourced and published to [๐จ GitHub Container Registry](https://github.com/Smars-Bin-Hu/EComDWH-BatchDataProcessingPlatform/pkgs/container/proj1-dwh-cluster), making it easy for anyone to deploy the same platform locally.
[๐จ Code - Docker Compose File](https://github.com/Smars-Bin-Hu/EComDWH-BatchDataProcessingPlatform/blob/main/docker-compose-bigdata.yml)
[๐จ Code - Configuration Files for Cluster: Hadoop, ZooKeeper, Hive, MySql, Spark, Prometheus&Grafana, Airflow](https://github.com/Smars-Bin-Hu/EComDWH-BatchDataProcessingPlatform/tree/main/src/infra)
[๐จ Code - Container Internal Scripts: Hadoop, ZooKeeper, Hive, MySql, Spark, Prometheus&Grafana, Airflow](https://github.com/Smars-Bin-Hu/EComDWH-BatchDataProcessingPlatform/tree/main/src/scripts)
[๐จ Code - Common Used Snippets for Cluster: Hadoop, ZooKeeper, Hive, MySql, Spark, Prometheus&Grafana, Airflow](https://github.com/Smars-Bin-Hu/EComDWH-BatchDataProcessingPlatform/tree/main/src/snippets)
![]()
Figure 1: All Containers Window

Figure 2: Data Platform Architecture
### 3. Distributed Batch Processing
1. [๐จ Code - Extract and Load pipeline (OLTP -> DWH, DWH -> OLAP)](https://github.com/Smars-Bin-Hu/EComDWH-BatchDataProcessingPlatform/tree/main/src/data_pipeline)
2. [๐จ Code - Batch Processing (Transform)](https://github.com/Smars-Bin-Hu/EComDWH-BatchDataProcessingPlatform/tree/main/src/batch_processing)
3. ๐จ Code - Scheduling based on Airflow (DAG)
### 4. CI/CD Automation
1. GitHub Actions Code
[๐จ Code - workflows.main YAML](https://github.com/Smars-Bin-Hu/EComDWH-BatchDataProcessingPlatform/blob/main/.github/workflows/main.yml)
2. Key Screenshots
![]()
Figure 1: Data platform launching and stop automation
![]()
Figure 2: Sample Log Screenshot I
![]()
Figure 3: Sample Log Screenshot II
3. [๐ Link - Automation Workflow Web UI](https://github.com/Smars-Bin-Hu/EComDWH-BatchDataProcessingPlatform/actions)
### 5. Storage & Computation Optimization
### 6. DevOps - Monitoring and Alerting
[๐จ Code - Monitoring Services Configuaration Files: Prometheus, Grafana, AlertManager](https://github.com/Smars-Bin-Hu/EComDWH-BatchDataProcessingPlatform/tree/main/src/infra/monitoring-config)
[๐จ Code - Monitoring Services Start&Stop Scripts: Prometheus, Grafana, AlertManager](https://github.com/Smars-Bin-Hu/EComDWH-BatchDataProcessingPlatform/tree/main/src/scripts/monitoring)
[๐จ Code - Container Metrics Exporter Start&Stop Scripts: `my-start-node-exporter.sh` & `my-stop-node-exporter.sh`](https://github.com/Smars-Bin-Hu/EComDWH-BatchDataProcessingPlatform/tree/main/src/scripts/hadoop-master)
![]()
Figure 1: Prometheus
![]()
Figure 2: Grafana-Hadoop-Cluster-instance-hadoop-master
![]()
Figure 3: Grafana-MySQLD
### 7. Business Intelligence & Visualization
[๐ Link - PowerBI Public Access(Expirable)](https://app.powerbi.com/view?r=eyJrIjoiMzVjYTQ3NmMtODllZS00N2JhLWFkNWItMWI4MmYyNDZjMDc1IiwidCI6IjI0MGI3OWM1LTZiZWYtNDYwOC1hNDE3LTY1NjllODQzNTQ1YyJ9)
Use Microsoft PowerBI connect to the Clickhouse and extract the **analytical data storage** layer
![]()
Figure 1: PowerBI Dashboard Demo
## Tech Stack
This project sets up a high-availability big data platform, including the following components:
          
| Components | Features | Version |
|------------------------|--------------------------------|---------|
| Apache Hadoop | Big Data Distributed Framework | 3.2.4 |
| Apache Zookeeper | High Availability | 3.8.4 |
| Apache Spark | Distributed Computing | 3.3.0 |
| Apache Hive | Data Warehousing | 3.1.3 |
| Apache Airflow | Workflow Scheduling | 2.7.2 |
| MySQL | Metastore | 8.0.39 |
| Oracle Database | Workflow Scheduling | 19.0.0 |
| Azure Cloud ClickHouse | OLAP Analysis | 24.12 |
| Microsoft PowerBI | BI Dashboard | latest |
| Prometheus | Monitoring | 2.52.0 |
| Grafana | Monitoring GUI | 10.3.1 |
| Docker | Containerization | 28.0.1 |## ๐ Project Directory
```bash
/bigdata-datawarehouse-project
โโโ /.github/workflows # CI/CD automation workflows via GitHub Actions
โโโ /docs # docs (all business and technologies documents about this project)
โโโ /src
โโโ /data_pipeline # data pipeline code (ETL/ELT Logic, output)
โโโ /warehouse_modeling # DWH modelling๏ผHive SQL etc.๏ผ
โโโ /batch_processing # Data Batch processing (PySpark + SparkSQL)
โโโ /dags # Task Scheduler(Airflow)
โโโ /infra # infrastructure deployment(Docker, configuration files)
โโโ /snippets # common used commands and snippets
โโโ /README # Source Code Use Instruction Markdown Files
โโโ README.md # Navigation of Source Code Use Instruction
โโโ main_data_pipeline.py # operate the data_pipeline module to do the `Extract` and `Load` jobs
โโโ main_batch_processing.py # operate the batch_processing module to do the `Transform` jobs
โโโ /tests # all small features unit testing snippets (DWH modelling, data pipeline, dags etc.)
โโโ README.md # Introduction about project
โโโ docker-compose-bigdata.yml # Docker Compose to launch the docker cluster
โโโ .env # `public the .env on purpose` for docker-compose file use
โโโ .gitignore # Git ignore some directory not to be committed to the remote repo
โโโ .gitattributes # Git repository attributes config
โโโ LICENSE # COPYRIGHT for this project
โโโ mysql-metadata-restore.sh # container operational level scripts: restore mysql container metadata
โโโ mysql-metastore-dump.sh # container operational level scripts: dump mysql container metadata
โโโ push-to-ghcr.sh # container operational level scripts: push the images to GitHub Container Registry
โโโ start-data-clients.sh # container operational level scripts: start hive, spark etc
โโโ start-hadoop-cluster.sh # container operational level scripts: start hadoop HA cluster
โโโ start-other-services.sh # container operational level scripts: start airflow, prometheus, grafana etc
โโโ stop-data-clients.sh # container operational level scripts: stop hive, spark etc
โโโ stop-hadoop-cluster.sh # container operational level scripts: stop hadoop HA cluster
โโโ stop-other-services.sh # container operational level scripts: stop airflow, prometheus, grafana etc
```## ๐ Quick Start `/src`
### [๐ Source Code Instruction for Use](./src/README.md)
###
##### ๐ Project Documents `/docs`
#### 1. Business logic && Tech Selection
- Business Logic
- [Project Tech Architecture](./docs/doc/tech-architecture.md)#### 2. Development Specification
[DWH Modelling Standard Operation Procedure (SOP)](./docs/doc/dwh-modelling-sop.md)
- [Business Data Research](./docs/doc/business_data_research.md)
- Data Warehouse Development Specification
- [Data Warehouse Layering Specification](./docs/doc/data-warehouse-development-specification/data-warehouse-layering-specification.md)
- [Table Naming Conventions](./docs/doc/data-warehouse-development-specification/table-naming-convertions.md)
- [Data Warehouse Column Naming Conventions](./docs/doc/data-warehouse-development-specification/partitioning-column-naming-conventions.md)
- [Data Table Lifecycle Management Specification](./docs/doc/data-warehouse-development-specification/data-table-lifecycle-management-specification.md)- Python Development Specification
- Package Modulize
- SQL Development Specification
- [Development Specification](./docs/doc/data-warehouse-development-specification/development-specification.md)#### 3. Troubleshooting
- [Future Bugs to Fix](./docs/doc/error-handling/future-fix.md)
- [04_MAR_2025](./docs/doc/error-handling/04_MAR_2025.md)
- [05_MAR_2025](./docs/doc/error-handling/05_MAR_2025.md)
- [06_MAR_2025](./docs/doc/error-handling/06_MAR_2025.md)#### 4. Infrastructure & Building
- ๆ ธๅฟๆถๆdockerๅฎนๅจๅๅธๅพ
- Hadoop 3่็น ็ๆญๅปบๅ้ ็ฝฎ
- Hive ่็น็ๆญๅปบๅ้ ็ฝฎ
- Spark ่็น็ๆญๅปบๅ้ ็ฝฎ
- mysql ่็น็ๆญๅปบๅ้ ็ฝฎ
- oracle ่็น็ๆญๅปบๅ้ ็ฝฎ
- airflow ่็น็ๆญๅปบๅ้ ็ฝฎ (airflow.cfg ้ mysql ๅ localexecutor ็้ ็ฝฎ๏ผ
- `docker-compose` ๆไปถ็้ ็ฝฎ#### 5. Development
- Data Warehousing
- ods
- dwd
- Data Pipeline ETL
- Spark on Yarn to connect Oracle (Hello World)
- Spark to extract data and load to HDFS
- OOP
- Scheduler (Airflow)
- some files under /scripts#### 6. Optimization
- [Too many INFO logs: Reducing Spark Console Log Levels](./docs/doc/optimization/reducing-spark-console-log-levels.md)
#### 7. Testing
- spark_connect_oracle.py## License
This project is licensed under the MIT License - see the [LICENSE](./LICENSE) file for details.
Created and maintained by **Smars-Bin-Hu**.