https://github.com/jiatangzhi/master_thesis

This project implements my master’s thesis on building a scalable, ACID-compliant data lakehouse architecture for IoT and industrial workloads, integrating AWS Glue, S3, Athena, and Grafana with Iceberg to evaluate Copy-on-Write vs Merge-on-Read performance.
https://github.com/jiatangzhi/master_thesis

apache-iceberg aws-glue aws-s3 batch-processing data-engineering data-lakehouse distributed-systems grafana iot-data mqtt open-table-format python3 schema-evolution spark

Last synced: 27 days ago
JSON representation

Host: GitHub
URL: https://github.com/jiatangzhi/master_thesis
Owner: jiatangzhi
Created: 2025-10-16T14:07:26.000Z (8 months ago)
Default Branch: main
Last Pushed: 2025-10-16T14:23:39.000Z (8 months ago)
Last Synced: 2025-10-17T17:16:09.695Z (8 months ago)
Topics: apache-iceberg, aws-glue, aws-s3, batch-processing, data-engineering, data-lakehouse, distributed-systems, grafana, iot-data, mqtt, open-table-format, python3, schema-evolution, spark
Homepage:
Size: 2.21 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# 🧠 Design and Optimization of a Cloud-Based Transactional Data Lake for Evolving Data Models

This repository contains the implementation and documentation of my **Master’s Thesis** project — an end-to-end cloud-based **transactional data lakehouse** built with **Apache Iceberg** and **AWS** services.

The project bridges the gap between traditional data lakes and warehouses by introducing **ACID compliance**, **schema evolution**, and **real-time adaptability**, optimized for large-scale and evolving analytical workloads.

---

## 📖 Abstract

The rapid expansion of data-driven applications demands scalable and transactional cloud architectures.
This project designs and evaluates a **data lakehouse** that integrates the flexibility of **data lakes** with the transactional integrity of **data warehouses**, leveraging **Apache Iceberg** as the open table format.

It benchmarks **Copy-on-Write (CoW)** and **Merge-on-Read (MoR)** strategies to assess ingestion throughput, query performance, and compaction efficiency, while demonstrating schema evolution, rollback, and time travel capabilities.

---

## 🧩 Architecture Overview

The proposed lakehouse is composed of **five modular layers**:

1. **Ingestion Layer** — IoT data ingestion using **AWS IoT Core** and **MQTT** protocols.
2. **Storage Layer** — Persistent, scalable storage in **Amazon S3** (raw + curated zones).
3. **Processing Layer** — ETL pipelines using **AWS Glue** and **Apache Spark** with metadata tracked in **AWS Glue Data Catalog**.
4. **API Layer** — Query access via **Amazon Athena** for SQL-based analytics.
5. **Consumption Layer** — Interactive visualization and real-time monitoring through **Grafana** and **Amazon CloudWatch**.

---

## 🧪 Research Focus

The main research objective is to design and optimize a **transactional, cloud-native data lakehouse** capable of handling high-ingestion IoT data streams while maintaining performance, consistency, and reliability.

### 🔍 Comparative Evaluation
- **Apache Iceberg vs Apache Hudi**
- **Copy-on-Write (CoW)** vs **Merge-on-Read (MoR)**
- Analysis of **query latency**, **ingestion throughput**, **compaction cost**, and **metadata scalability**

### 🧠 Core Features
- ACID-compliant table management
- Schema evolution and snapshot rollback
- Metadata pruning and partition optimization
- ETL automation through AWS Glue
- IaC deployment using **AWS CDK**

---

## ⚙️ Technologies Used

| Category | Technologies |
|-----------|--------------|
| Cloud Services | AWS S3, Glue, Athena, IoT Core, CloudWatch, Managed Grafana |
| Data Frameworks | Apache Iceberg, Apache Hudi, Apache Spark |
| Programming | Python, PySpark, Boto3 |
| Infrastructure | AWS CDK (IaC), Virtual Machines (Edge Simulation) |
| Visualization | Grafana, CloudWatch Dashboards |

---

## 🧰 Industrial Simulator

A local **industrial monitoring simulator** was developed to generate realistic IoT metrics (CPU, memory, disk I/O, network I/O).
Each virtual device (VM) publishes telemetry payloads to AWS IoT Core via **MQTT**, storing raw CSV data in S3 before transformation into **Parquet** format for Iceberg tables.

---

## 📊 Performance Experiments

Benchmarks included:
- Read and write latency under CoW vs MoR strategies
- Query time before and after compaction
- Schema evolution tests (column addition/removal)
- Snapshot rollback and time-travel verification

Results demonstrated that **Iceberg’s metadata layer** significantly improves query performance and storage efficiency, making it a robust foundation for **scalable, cost-effective cloud lakehouses**.

---

## 🗂️ License

This project is licensed under the **MIT License**.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jiatangzhi/master_thesis

Awesome Lists containing this project

README