https://github.com/narius2030/lakehouse-solution-imcp

An end-to-end MLOps pipeline to develop, train, and deploy an Image Caption model that automatically generates captions for images based on diverse datasets
https://github.com/narius2030/lakehouse-solution-imcp

apache-airflow apache-kafka batch-processing lakehouse mlflow-tracking mlops polars spark-streaming stream-processing

Last synced: 7 months ago
JSON representation

An end-to-end MLOps pipeline to develop, train, and deploy an Image Caption model that automatically generates captions for images based on diverse datasets

Host: GitHub
URL: https://github.com/narius2030/lakehouse-solution-imcp
Owner: Narius2030
Created: 2025-01-27T15:20:22.000Z (9 months ago)
Default Branch: main
Last Pushed: 2025-02-25T03:25:37.000Z (8 months ago)
Last Synced: 2025-02-25T04:27:20.886Z (8 months ago)
Topics: apache-airflow, apache-kafka, batch-processing, lakehouse, mlflow-tracking, mlops, polars, spark-streaming, stream-processing
Language: Python
Homepage:
Size: 2.17 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# General Architecture of Data Lake

- Builded a Data Lake following Medallion architecture with `catalog layer` and `storage layer` for storing image and its metadata
- Streamed events from `file uploading` and `captured images` from mobile app (was sent by API) into raw storage area, so that it helps data more various for AI training
- Integrated NLP and Image processings in ETL pipeline to periodically normalize images and metadata

![image](https://github.com/user-attachments/assets/923a659b-0401-4c68-a28b-704d6db14098)

## Detailed Architecture

![image](https://github.com/user-attachments/assets/64b1f8b2-22ce-4cdd-ac63-c8855883fbe0)

## Storage Structure

![image](https://github.com/user-attachments/assets/89c2aa4f-47a4-415e-a252-19f46bd7f3ef)

## MLOps Cycle

![image](https://github.com/user-attachments/assets/8c400e4c-48c5-4352-aa71-e2a4990cea85)

# FastAPI-based Microservice

> More detail in this [Repo](https://github.com/Narius2030/FastAPI-Microservice-IMCP.git)

- Develop an APIs to retrieve metadata and images which were normalized in Data Lake for automated incremental learning process.
- Develop an APIs to upload captured image and metadata of user to storage system for later usages and then activate model.
- Utilize Nginx to route and load balance among API service containers for **_reducing the latency_** and **_avoiding overload_** on each service.

![image](https://github.com/user-attachments/assets/11163700-dade-444e-8b19-d97bb7083237)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/narius2030/lakehouse-solution-imcp

Awesome Lists containing this project

README