Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/aabouzaid/modern-data-platform-poc
My M.Sc. dissertation: Modern Data Platform using DataOps, Kubernetes, and Cloud-Native ecosystem to build a resilient Big Data platform based on Data Lakehouse architecture which is the base for Machine Learning (MLOps) and Artificial Intelligence (AIOps).
https://github.com/aabouzaid/modern-data-platform-poc
big-data cloud-agnostic cloud-native data-engineering data-lakehouse data-platform dataops edinburgh-napier kubernetes msc msc-project
Last synced: 14 days ago
JSON representation
My M.Sc. dissertation: Modern Data Platform using DataOps, Kubernetes, and Cloud-Native ecosystem to build a resilient Big Data platform based on Data Lakehouse architecture which is the base for Machine Learning (MLOps) and Artificial Intelligence (AIOps).
- Host: GitHub
- URL: https://github.com/aabouzaid/modern-data-platform-poc
- Owner: aabouzaid
- Created: 2023-02-20T09:52:30.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-05-12T21:25:27.000Z (6 months ago)
- Last Synced: 2024-07-30T17:59:36.974Z (3 months ago)
- Topics: big-data, cloud-agnostic, cloud-native, data-engineering, data-lakehouse, data-platform, dataops, edinburgh-napier, kubernetes, msc, msc-project
- Language: Jupyter Notebook
- Homepage: https://dx.doi.org/10.13140/RG.2.2.15360.71689
- Size: 5.52 MB
- Stars: 6
- Watchers: 3
- Forks: 1
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Modern Data Platform PoC
A proof of concept for the core of Modern Data Platform using DataOps, Kubernetes, and Cloud-Native ecosystem
to build a resilient Big Data platform based on Data Lakehouse architecture which is the base for
Machine Learning (MLOps) and Artificial Intelligence (AIOps).> **Note**
>
> This project is part of my Master of Science in Data Engineering
> at Edinburgh Napier University (April 2023).## Contents
- [Architecture](#architecture)
- [Deployment](#deployment)
- [Benchmarking](#benchmarking)## Architecture
### Core Components
The core components of the platform are:
- Infrastructure (Kubernetes)
- Data Ingestion (Argo Workflows + Python)
- Data Storage (MinIO)
- Data Processing (Dremio)### Initial Model
To visualise the interactions of the current implementation, the
[C4 software architecture model](https://c4model.com/) (Context, Containers, Components, and Code)
is used.The following is a simplified view of the initial architecture model
(all the abstractions are combined together).![Modern Data Platform Initial Architecture Model](initial-architecture-model.png)
## Deployment
**Prerequisites:** [asdf](https://asdf-vm.com/), Linux operating system, and Docker Engine
(tested with asdf 0.11.1, Ubuntu 20.04.5 LTS, and Docker Engine Community 23.0.1).The following tools are used in the development:
- Helm
- KinD
- Kubectl
- KustomizeThey could be installed with corresponding versions via `asdf`:
```sh
asdf install
```Create the local Kubernetes cluster:
```sh
kind create cluster \
--config clusters/local/kind-cluster-config.yaml
```Deploy the applications to the Kubernetes cluster:
```sh
kustomize build --enable-helm clusters/local | kubectl apply -f -
```Wait for deployments to be ready:
```sh
# Ingress-Nginx.
kubectl rollout status deployment \
--watch --namespace ingress-nginx ingress-nginx-controller# MinIO.
kubectl rollout status deployment \
--watch --namespace minio minio# Argo Workflows.
kubectl rollout status deployment \
--watch --namespace argo-workflows argo-workflows-server# Dremio.
kubectl rollout status statefulset \
--watch --namespace dremio dremio-master
```Apply the data pipeline:
```sh
kubectl apply --namespace argo-workflows --filename \
pipelines/ingestion/argo-workflow-covid19-subnational-data.yaml
```## Benchmarking
TPC-DS test suite has been used
to assess the performance of the platform.For complete results, please check the project
[Jupyter Notebook](./benchmark/dremio_v24_0_0_tpc_ds_benchmark.ipynb)
in the [benchmarking section](./benchmark).