https://github.com/narius2030/datalake-solution-imcp
This project involved the development and implementation of a Data Lake architecture to support an AI model capable of generating image captions. The architecture was designed to efficiently ingest, process, and centralized store large volumes of image and text data.
https://github.com/narius2030/datalake-solution-imcp
data-lake docker-container etl-pipeline fastapi medallion-architecture mlops nosql-database object-storage
Last synced: 2 months ago
JSON representation
This project involved the development and implementation of a Data Lake architecture to support an AI model capable of generating image captions. The architecture was designed to efficiently ingest, process, and centralized store large volumes of image and text data.
- Host: GitHub
- URL: https://github.com/narius2030/datalake-solution-imcp
- Owner: Narius2030
- Created: 2024-08-24T15:53:17.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2025-02-06T13:54:14.000Z (over 1 year ago)
- Last Synced: 2025-02-06T14:38:01.651Z (over 1 year ago)
- Topics: data-lake, docker-container, etl-pipeline, fastapi, medallion-architecture, mlops, nosql-database, object-storage
- Language: Python
- Homepage:
- Size: 193 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## Overal Architecture

## Detailed Architecture

## Storage Structure in Data Lake:

## Overal Data Pipeline

## Practical Data Pipeline
At the `Bronze` layer:
* It will be divided into **3 DAGs** serving to collect data from sources
* Each DAG is responsible for collecting raw data from Parquet and user files (including images and metadata) from the source into MongoDB and MinIO aggregate stores



At the `Silver` and `Gold` layers:
* Silver layer is used to refine raw metadata from Bronze which will establish the refined metadata for `Catalog` layer in Data Lake
* Gold layer obtain to extract image feature from sources and save them in MinIO
