{"id":17654924,"url":"https://github.com/aabouzaid/modern-data-platform-poc","last_synced_at":"2025-06-27T23:03:07.955Z","repository":{"id":179134480,"uuid":"604083925","full_name":"aabouzaid/modern-data-platform-poc","owner":"aabouzaid","description":"My M.Sc. dissertation: Modern Data Platform using DataOps, Kubernetes, and Cloud-Native ecosystem to build a resilient Big Data platform based on Data Lakehouse architecture which is the base for Machine Learning (MLOps) and Artificial Intelligence (AIOps).","archived":false,"fork":false,"pushed_at":"2024-05-12T21:25:27.000Z","size":5788,"stargazers_count":8,"open_issues_count":2,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-05-07T10:29:06.677Z","etag":null,"topics":["big-data","cloud-agnostic","cloud-native","data-engineering","data-lakehouse","data-platform","dataops","edinburgh-napier","kubernetes","msc","msc-project"],"latest_commit_sha":null,"homepage":"https://dx.doi.org/10.13140/RG.2.2.15360.71689","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aabouzaid.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-02-20T09:52:30.000Z","updated_at":"2025-04-01T10:57:36.000Z","dependencies_parsed_at":"2025-04-20T15:15:13.078Z","dependency_job_id":null,"html_url":"https://github.com/aabouzaid/modern-data-platform-poc","commit_stats":{"total_commits":8,"total_committers":1,"mean_commits":8.0,"dds":0.0,"last_synced_commit":"79e9716f567e58d7e621926616e470af9e03906f"},"previous_names":["aabouzaid/modern-data-platform-poc"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/aabouzaid/modern-data-platform-poc","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aabouzaid%2Fmodern-data-platform-poc","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aabouzaid%2Fmodern-data-platform-poc/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aabouzaid%2Fmodern-data-platform-poc/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aabouzaid%2Fmodern-data-platform-poc/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aabouzaid","download_url":"https://codeload.github.com/aabouzaid/modern-data-platform-poc/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aabouzaid%2Fmodern-data-platform-poc/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":262347470,"owners_count":23296893,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["big-data","cloud-agnostic","cloud-native","data-engineering","data-lakehouse","data-platform","dataops","edinburgh-napier","kubernetes","msc","msc-project"],"created_at":"2024-10-23T12:40:19.972Z","updated_at":"2025-06-27T23:03:07.923Z","avatar_url":"https://github.com/aabouzaid.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003c!-- omit in toc --\u003e\n# Modern Data Platform PoC\n\nA proof of concept for the core of Modern Data Platform using DataOps, Kubernetes, and Cloud-Native ecosystem\nto build a resilient Big Data platform based on Data Lakehouse architecture which is the base for\nMachine Learning (MLOps) and Artificial Intelligence (AIOps).\n\n\u003e **Note**\n\u003e\n\u003e This project is part of my Master of Science in Data Engineering\n\u003e at Edinburgh Napier University (April 2023).\n\n\u003c!-- omit in toc --\u003e\n## Contents\n- [Architecture](#architecture)\n- [Deployment](#deployment)\n- [Benchmarking](#benchmarking)\n\n## Architecture\n\n\u003c!-- omit in toc --\u003e\n### Core Components\n\nThe core components of the platform are:\n\n- Infrastructure (Kubernetes)\n- Data Ingestion (Argo Workflows + Python)\n- Data Storage (MinIO)\n- Data Processing (Dremio)\n\n\u003c!-- omit in toc --\u003e\n### Initial Model\n\nTo visualise the interactions of the current implementation, the\n[C4 software architecture model](https://c4model.com/) (Context, Containers, Components, and Code)\nis used.\n\nThe following is a simplified view of the initial architecture model\n(all the abstractions are combined together).\n\n![Modern Data Platform Initial Architecture Model](initial-architecture-model.png)\n\n## Deployment\n\n**Prerequisites:** [asdf](https://asdf-vm.com/), Linux operating system, and Docker Engine\n(tested with asdf 0.11.1, Ubuntu 20.04.5 LTS, and Docker Engine Community 23.0.1).\n\nThe following tools are used in the development:\n- Helm\n- KinD\n- Kubectl\n- Kustomize\n\nThey could be installed with corresponding versions via `asdf`:\n\n```sh\nasdf install\n```\n\nCreate the local Kubernetes cluster:\n\n```sh\nkind create cluster \\\n  --config clusters/local/kind-cluster-config.yaml\n```\n\nDeploy the applications to the Kubernetes cluster:\n\n```sh\nkustomize build --enable-helm clusters/local | kubectl apply -f -\n```\n\nWait for deployments to be ready:\n\n```sh\n# Ingress-Nginx.\nkubectl rollout status deployment \\\n  --watch --namespace ingress-nginx ingress-nginx-controller\n\n# MinIO.\nkubectl rollout status deployment \\\n  --watch --namespace minio minio\n\n# Argo Workflows.\nkubectl rollout status deployment \\\n  --watch --namespace argo-workflows argo-workflows-server\n\n# Dremio.\nkubectl rollout status statefulset \\\n  --watch --namespace dremio dremio-master\n```\n\nApply the data pipeline:\n\n```sh\nkubectl apply --namespace argo-workflows --filename \\\n  pipelines/ingestion/argo-workflow-covid19-subnational-data.yaml\n```\n\n## Benchmarking\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"./benchmark/queries-performance-with-cache-enabled.png\" width=\"90%\"\u003e\n\u003c/p\u003e\n\nTPC-DS test suite has been used\nto assess the performance of the platform.\n\nFor complete results, please check the project\n[Jupyter Notebook](./benchmark/dremio_v24_0_0_tpc_ds_benchmark.ipynb)\nin the [benchmarking section](./benchmark).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faabouzaid%2Fmodern-data-platform-poc","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faabouzaid%2Fmodern-data-platform-poc","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faabouzaid%2Fmodern-data-platform-poc/lists"}