{"id":22242321,"url":"https://github.com/narius2030/datalake-solution-imcp","last_synced_at":"2026-04-15T15:44:19.515Z","repository":{"id":265920574,"uuid":"847015174","full_name":"Narius2030/DataLake-Solution-IMCP","owner":"Narius2030","description":"This project involved the development and implementation of a Data Lake architecture to support an AI model  capable of generating image captions. The architecture was designed to efficiently ingest, process, and centralized store  large volumes of image and text data.","archived":false,"fork":false,"pushed_at":"2025-02-06T13:54:14.000Z","size":201880,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-06T14:38:01.651Z","etag":null,"topics":["data-lake","docker-container","etl-pipeline","fastapi","medallion-architecture","mlops","nosql-database","object-storage"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Narius2030.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-24T15:53:17.000Z","updated_at":"2025-02-06T13:54:17.000Z","dependencies_parsed_at":"2025-02-06T14:42:34.730Z","dependency_job_id":null,"html_url":"https://github.com/Narius2030/DataLake-Solution-IMCP","commit_stats":null,"previous_names":["narius2030/datalake-solution-imcp"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Narius2030%2FDataLake-Solution-IMCP","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Narius2030%2FDataLake-Solution-IMCP/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Narius2030%2FDataLake-Solution-IMCP/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Narius2030%2FDataLake-Solution-IMCP/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Narius2030","download_url":"https://codeload.github.com/Narius2030/DataLake-Solution-IMCP/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245449430,"owners_count":20617185,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-lake","docker-container","etl-pipeline","fastapi","medallion-architecture","mlops","nosql-database","object-storage"],"created_at":"2024-12-03T04:15:44.445Z","updated_at":"2026-04-15T15:44:19.479Z","avatar_url":"https://github.com/Narius2030.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## Overal Architecture\n![image](https://github.com/user-attachments/assets/c8195cf6-ee86-46f8-8890-ddd793b68cb5)\n\n\n## Detailed Architecture\n![image](https://github.com/user-attachments/assets/13726b7e-6c91-4453-a291-1dda31684cd1)\n\n## Storage Structure in Data Lake:\n![image](https://github.com/user-attachments/assets/5b9185f7-71c0-467a-a826-36881d5db6b6)\n\n\n\n## Overal Data Pipeline\n![image](https://github.com/user-attachments/assets/0f0e0040-8681-4b8f-9ba0-ec1eea828972)\n\n\n## Practical Data Pipeline\nAt the `Bronze` layer:\n* It will be divided into **3 DAGs** serving to collect data from sources\n* Each DAG is responsible for collecting raw data from Parquet and user files (including images and metadata) from the source into MongoDB and MinIO aggregate stores\n\n![image](https://github.com/user-attachments/assets/1bb6786b-38b4-4207-be2c-394e9d7dc9a7)\n\n![image](https://github.com/user-attachments/assets/b4d65fbb-fd18-4ab7-8102-de535c38a960)\n\n![image](https://github.com/user-attachments/assets/14e114a5-7ef7-47e8-b518-fae740a9f08d)\n\nAt the `Silver` and `Gold` layers:\n* Silver layer is used to refine raw metadata from Bronze which will establish the refined metadata for `Catalog` layer in Data Lake\n* Gold layer obtain to extract image feature from sources and save them in MinIO\n\n![image](https://github.com/user-attachments/assets/85e16c66-f599-4191-9a9e-5ac05edb54b9)\n\n\n\n\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnarius2030%2Fdatalake-solution-imcp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnarius2030%2Fdatalake-solution-imcp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnarius2030%2Fdatalake-solution-imcp/lists"}