Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/hellomaxime/data-platform-on-kubernetes
Open Source Data Platform on Kubernetes
https://github.com/hellomaxime/data-platform-on-kubernetes
bigdata data data-pipeline dbt druid etl kubernetes ml open-source platform spark superset
Last synced: about 1 month ago
JSON representation
Open Source Data Platform on Kubernetes
- Host: GitHub
- URL: https://github.com/hellomaxime/data-platform-on-kubernetes
- Owner: hellomaxime
- License: apache-2.0
- Created: 2024-02-20T20:51:44.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2024-04-22T16:45:58.000Z (10 months ago)
- Last Synced: 2024-11-07T12:16:58.547Z (3 months ago)
- Topics: bigdata, data, data-pipeline, dbt, druid, etl, kubernetes, ml, open-source, platform, spark, superset
- Language: Shell
- Homepage:
- Size: 153 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Data platform on Kubernetes
This project aims to deploy a complete data platform on kubernetes, many services are available to build end-to-end data engineering projects from ingestion to visualization.
## Prerequisites
- docker
- kubernetes (minikube cluster for local development)
- kubectl
- helm## Available services
- __Data ingestion__
- Nifi
- __Data integration__
- Airbyte
- __Message queue__
- Kafka
- RabbitMQ
- __Change Data Capture__
- Debezium
- __Database__
- Cassandra
- Druid
- MongoDB
- MySQL/Phpmyadmin
- PostgreSQL/pgAdmin
- __Data warehouse__
- ClickHouse
- __Datalake__
- MinIO
- __Data transformation__
- dbt
- Flink
- Spark
- __Data quality__
- Great Expectations
- __Distributed SQL query engine__
- Trino
- __Visualization__
- Metabase
- Superset
- __Machine learning__
- Kubeflow
- __Orchestration__
- Airflow
- Argo Workflows
- __Monitoring__
- Grafana/Prometheus
- __Notebook__
- JupyterHub## Data formats
- Delta Lake
- Apache Iceberg (soon)## How to deploy the data platform on kubernetes
Before deploying in the cluster, choose services you want to start in `.config` file. (y|n)
__Deploy the data plaftorm__
`./start.sh`You may need to wait a few minutes for all services to start, you can check pods status with the following command : `kubectl get all -A`.
__Turn off the data plaftorm__
`./stop.sh`## Helpful:
__some services are accessible through an URL__
example : `http://dataplatform..io/`__access another service from inside__
`..svc.cluster.local:`__get helm default values__
`helm show values > values.yaml`__config file__
set .config file to choose services you want to enable/disable__minikube ingress addons__
`minikube addons enable ingress`__kubernetes dashboard__
`minikube dashboard --url`