{"id":16597925,"url":"https://github.com/abhishek-ch/data-machinelearning-the-boring-way","last_synced_at":"2025-03-21T13:32:26.417Z","repository":{"id":47646714,"uuid":"487806021","full_name":"abhishek-ch/data-machinelearning-the-boring-way","owner":"abhishek-ch","description":"Build \u0026 Learn Data Engineering,Machine Learning over Kubernetes. No Shortcut approach.","archived":false,"fork":false,"pushed_at":"2023-01-03T11:03:38.000Z","size":3487,"stargazers_count":57,"open_issues_count":2,"forks_count":11,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-03-18T01:51:28.683Z","etag":null,"topics":["data-infrastructure","dataengineering","datascience","kubernetes","machine-learning","mlops"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/abhishek-ch.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-05-02T10:40:12.000Z","updated_at":"2025-03-03T02:56:49.000Z","dependencies_parsed_at":"2023-02-01T06:00:43.299Z","dependency_job_id":null,"html_url":"https://github.com/abhishek-ch/data-machinelearning-the-boring-way","commit_stats":null,"previous_names":[],"tags_count":10,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abhishek-ch%2Fdata-machinelearning-the-boring-way","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abhishek-ch%2Fdata-machinelearning-the-boring-way/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abhishek-ch%2Fdata-machinelearning-the-boring-way/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abhishek-ch%2Fdata-machinelearning-the-boring-way/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/abhishek-ch","download_url":"https://codeload.github.com/abhishek-ch/data-machinelearning-the-boring-way/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244806171,"owners_count":20513394,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-infrastructure","dataengineering","datascience","kubernetes","machine-learning","mlops"],"created_at":"2024-10-12T00:07:05.301Z","updated_at":"2025-03-21T13:32:25.992Z","avatar_url":"https://github.com/abhishek-ch.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Data \u0026 Machine Learning - The Boring Way\n\nThis tutorial walks you through setting up and building a Data Engineering \u0026 Machine Learning Platform. \nThe tutorial is designed to explore many different technologies for the similar problems without any bias. \n\n__This is not a Production Ready Setup__\n\n## Target Audience\nData Engineers, Machine Learning Engineer, Data Scientist, SRE, Infrastructure Engineer, Data Analysts, Data Analytics Engineer\n\n# Expected Technologies \u0026 Workflow \n\n## Data Engineering \u0026 Analytics\n- [X] Kubernetes Kind Installation [link](/docs/01-setting-up-cluster.md)\n- [X] [MinIO](https://min.io/) Integrate Object Storage on top of Kubernetes and use minio interface for simulating the s3 [link](/docs/02-setting-up-minio.md)\n- [X] [Apache Airflow](https://airflow.apache.org/) on top of Kubernetes \u0026 Running an end to end Airflow Workflow using Kubernetes Executor [link](docs/04-setting-up-airflow.md)\n- [X] [Apache Spark](https://spark.apache.org/) Deploy Apache Spark on Kubernetes and run an example [link](/docs/03-setting-up-apachespark-k8s.md)\n- [ ] [Prefect](https://www.prefect.io/) Setup \u0026 Running an end to end Workflow\n- [ ] [Dagster](https://dagster.io/) Setup \u0026 Running an end to end Workflow\n- [ ] Set up an ETL job running end-2-end on apache airflow. This job contains Spark \u0026 Python Operator\n- [ ] [Apache Hive](https://cwiki.apache.org/confluence/display/hive/design) Setting up Hive \u0026 Hive Metastore\n- [ ] Deploy Trino \u0026 Open Source [Presto](https://prestodb.io/) and run dana Analytics queries.\n- [ ] Integrate [Superset](https://superset.apache.org/) \u0026 [Metabase](https://www.metabase.com/) to run visualization. Integrate Presto with the visualization system.\n- [ ] Open Table Format using [Delta](https://docs.delta.io/latest/quick-start.html)\n- [ ] Open Table Format using [Apache Iceberg](https://iceberg.apache.org/)\n- [ ] Open Table Format using [Apache Hudi](https://hudi.apache.org/)\n- [ ] Metadata Management using [Amundsen](https://www.amundsen.io/)\n- [ ] Metadata Management using [Datahub](https://datahubproject.io/)\n- [ ] Setting up [Apache Kafka](https://kafka.apache.org/) distributed event streaming platform\n- [ ] Using Spark Structered Streaming to run an end-2-end pipeline over any realtime data sources\n- [ ] Using [Apache Flink](https://flink.apache.org/) to run an end-2-end pipeline over any realtime data sources\n- [ ] [Redpanda](https://redpanda.com/), streaming data platform to run similar workflow\n- [ ] [Airbyte](https://airbyte.com/) Data Integration platform\n- [ ] [Talend](https://www.talend.com/products/data-integration/) UI based Data Integration\n- [ ] [DBT](https://www.getdbt.com/) DBT Sql Pipeline to compare with Spark and other tech\n- [ ] [Debezium](https://debezium.io/) Change Data Capture using Debezium to sync multiple databases\n\n## Monitoring \u0026 Observability\n- [ ] [Grafana]([https://](https://grafana.com/)) Setting Up Grafana for Monitoring components. Start with Monitoring Pods\n- [ ] [FluentD](https://www.fluentd.org/) logging metrics from pods \u0026 interact the same with Monitoring layer\n- [ ] Setting up a full Monitoring and Alerting Platform \u0026 integrate minitoring across other technologies\n- [ ] Setting up an Observability system \n\n## Machine Learning\n- [ ] Setup [Ray](https://www.ray.io/) for Data Transformations\n- [ ] Use [Scikit-learn](https://scikit-learn.org/) for an example ML training\n- [ ] Setup [Argo Pipeline](https://argoproj.github.io/) for deploying ML Jobs\n- [ ] Setup [Flyte](https://flyte.org/) Orchestrator for pythonic Deployment\n- [ ] Use [Pytorch Lightening](https://www.pytorchlightning.ai/) for runing ML training\n- [ ] Use Tensorflow for running ML training\n- [ ] Setup ML End-2-End Workflow on Flyte\n- [ ] Deploy [MLFlow](https://www.mlflow.org/docs/latest/index.html) for ML Model Tracking \u0026 Experimentation\n- [ ] Deploy [BentoML](https://www.bentoml.com/) For deploying ML Model\n- [ ] Deploy [Sendon Core](https://github.com/SeldonIO/seldon-core) for ML Model Management\n- [ ] Integrate MLflow with Seldon Core \n\n## Prerequisites\n* 🐳 Docker Installed \n* [kubectl](https://kubernetes.io/docs/tasks/tools/) Installed, The Kubernetes command-line tool, kubectl, allows you to run commands against Kubernetes clusters\n* [Lens](https://k8slens.dev/) Installed, UI for Kubernetes.  \n_This is optional, kubectl is enough for getting all relevant stats from kubernetes cluster_\n* [Helm](https://helm.sh/) The package manager for Kubernetes\n\n## Lab Basic Setup\n* [Setting Up Kind](https://kind.sigs.k8s.io/docs/user/quick-start/)\n* Deleting older Pods [PodCleaner](/docs/05-cronjob-podcleaner.md)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabhishek-ch%2Fdata-machinelearning-the-boring-way","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fabhishek-ch%2Fdata-machinelearning-the-boring-way","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabhishek-ch%2Fdata-machinelearning-the-boring-way/lists"}