An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with data-lake

A curated list of projects in awesome lists tagged with data-lake .

https://github.com/dlt-hub/dlt

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️

data data-engineering data-lake data-loading data-warehouse elt extract load python transform

Last synced: 26 Mar 2025

https://github.com/apache/kyuubi

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.

data-lake hacktoberfest hadoop hive jdbc kubernetes spark spark-sql sql thrift

Last synced: 13 May 2025

https://github.com/bytedance/bitsail

BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every day.

big-data data-integration data-lake data-pipeline data-synchronization flink high-performance real-time

Last synced: 15 May 2025

https://github.com/teradata/kylo

Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.

data-lake hadoop kylo nifi spark teradata

Last synced: 15 May 2025

https://github.com/Teradata/kylo

Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.

data-lake hadoop kylo nifi spark teradata

Last synced: 06 Apr 2025

https://github.com/uber/marmaray

Generic Data Ingestion & Dispersal Library for Hadoop

avro-schema data-lake hadoop ingest-data schema-format spark

Last synced: 23 Mar 2025

https://github.com/gigapi/gigapi

GigAPI is a Timeseries lakehouse for real-time data and sub-second queries, powered by DuckDB OLAP + Parquet Query Engine, Compactor w/ Cloud-Native Storage. Drop-in FDAP alternative ⭐

api clickhouse-server data-lake database datalake duckdb duckdb-api duckdb-server ducklake fdap gigapipe golang lakehouse olap parquet qryn query-engine rest-api s3 sql

Last synced: 05 Oct 2025

https://github.com/awslabs/amazon-s3-find-and-forget

Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)

amazon-s3 aws big-data ccpa data data-erasure data-lake gdpr parquet privacy right-to-be-forgotten s3

Last synced: 04 Apr 2025

https://github.com/maxi-k/btrblocks

BtrBlocks: Efficient Columnar Compression for Data Lakes (SIGMOD 2023 Paper)

compression data-lake databases research

Last synced: 09 Apr 2025

https://github.com/azure/usql

U-SQL Examples and Issue Tracking

azure big-data data-lake u-sql

Last synced: 12 Apr 2025

https://github.com/garystafford/tickit-data-lake-demo

Resources for video demonstrations and blog posts related to DataOps on AWS

airflow aws data-lake dataops devops redshift

Last synced: 07 May 2025

https://github.com/azure/azuredatalake

Samples and Docs for Azure Data Lake Store and Analytics

azure big-data data-lake

Last synced: 09 Apr 2025

https://github.com/Canner/wren-engine

🤖 The semantic engine for LLMs, bringing semantic context to AI agents. 🔥

business-intelligence data data-analysis data-analytics data-lake data-warehouse hacktoberfest llm semantic semantic-layer sql

Last synced: 01 Apr 2025

https://github.com/canner/wren-engine

🤖 The semantic engine for LLMs, bringing semantic context to AI agents. 🔥

business-intelligence data data-analysis data-analytics data-lake data-warehouse hacktoberfest llm semantic semantic-layer sql

Last synced: 04 Apr 2025

https://github.com/smart-data-lake/smart-data-lake

Smart Automation Tool for building modern Data Lakes and Data Pipelines

data-lake data-pipelines deltalake hadoop hive scala smart-data-lake spark transform-data

Last synced: 13 Apr 2025

https://github.com/camunda-community-hub/zeeqs

GraphQL API for Zeebe data

data-lake graphql zeebe zeebe-tool

Last synced: 04 Oct 2025

https://github.com/OElesin/querypal

Web UI for Amazon Athena

analytics aws aws-athena data data-lake sql

Last synced: 30 Jul 2025

https://github.com/matsmoll/aligned

The DBT of ML, as Aligned describes data dependencies in ML systems, and reduce technical data debt

ai data-contracts data-lake datacontracts dbt feature-engineering feature-store ml ml-ops mlops

Last synced: 13 Aug 2025

https://github.com/imsanjoykb/python-mysql-operation

This Python MySQL Repo shows you how to use MySQL Connector Python to access MySQL databases. You will learn how to connect to MySQL database and perform common database operations such as SELECT, INSERT, UPDATE, & DELETE in Python.

data-lake database database-operation database-programming mysql-server python-mysql python-mysql-connector

Last synced: 20 Jun 2025

https://github.com/ec-europa/eubfr-data-lake

EU Budget for Results - Data Lake

data-lake

Last synced: 29 Apr 2025

https://github.com/apache/kyuubi-docker

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.

data-lake hadoop hive jdbc kubernetes spark spark-sql sql thrift

Last synced: 19 Oct 2025

https://github.com/AuFeld/Data_Engineering_Projects

A collection of data engineering projects: data modeling, ETL pipelines, data lakes, infrastructure configuration on AWS, data warehousing, containerization, and a dashboard to monitor data pipeline KPIs

airflow aws cassandra data-engineering data-lake data-warehouse docker emr etl-pipeline infrastructure-as-code infrastructure-setup postgresql python redshift s3 spark

Last synced: 29 Jul 2025

https://github.com/treeverse/lakefs-hooks

a simple lakeFS webhook for pre-commit and pre-merge validation of data objects

data-engineering data-lake lakefs

Last synced: 23 Oct 2025

https://github.com/sap-samples/hana-cloud-relational-data-lake-onboarding

This is an end-to-end onboarding sample for SAP HANA Cloud, relational data lake. It shows how to create schema, load data, and execute queries.

data-lake hana-cloud sample sap-hana

Last synced: 05 Mar 2025

https://github.com/jlsilva01/adls-azure

Procedimento para criação de um Azure Data Lake Storage usando Terraform, através de uma assinatura MS Learn Sandbox

adlsgen2 azure azurecli data-lake terraform

Last synced: 24 Jun 2025

https://github.com/datasphere-oss/datasphere

DataSphere is the first open-source cloud-native data observability platform that helps you trace the whole data infrastructure in your warehouses, lakes and databases.

cloud-native daas data-analytics data-governance data-lake data-management data-observability datamesh datasphere warehouse

Last synced: 14 Jul 2025

https://github.com/ibm-cloud/nodejs-data-lake-dashboard

Sample and tutorial that creates interactive dashboards using: Dynamic Dashboard Embedded, Cloud Object Storage, SQL Query, DB2 Warehouse and AppID.

cloud data-lake db2 db2-warehouse ibm-cloud ibm-cloud-solutions tutorial

Last synced: 22 Apr 2025

https://github.com/DataDrivenGit/Music-Streaming-App-using-AWS-ETL

Implemented Data Warehouse, Data Lake on AWS and Data modeling with Postgres and Apache Cassandra, Also used Apache Airflow to create data pipeline

airflow-operators cassandra data-lake data-pipelines datawarehouse postgres python3 sql

Last synced: 20 Jul 2025

https://github.com/bahbosque/delta-to-iceberg-aws-glue

Tool to migrate Delta Lake tables to Apache Iceberg using AWS Glue and S3

apache-iceberg aws aws-glue-data-catalog data-lake delta-lake migration-tool open-source spark

Last synced: 03 Jul 2025

https://github.com/codewell/data-kale

The Simple Data Lake - Data Kale

data data-lake python

Last synced: 22 Feb 2025

https://github.com/pirate-emperor/bigdata-pipeline

BigData Pipeline is a local testing environment for experimenting with various storage solutions (RDB, HDFS), query engines (Trino), schedulers (Airflow), and ETL/ELT tools (DBT). It supports MySQL, Hadoop, Hive, Kudu, and more.

airflow airflow-dags airflow-docker big-data data-lake data-lakestore data-warehouse dbt dbt-core distributed-computing docker docker-compose hadoop hive hiveql kudu mysql mysql-server trino trino-cli

Last synced: 01 Aug 2025

https://github.com/shrikantnaidu/data-lakehouse-with-delta-lake

Setting up a data lakehouse with delta lake using docker

data-lake delta-lake docker-compose pyspark

Last synced: 30 Jul 2025

https://github.com/san089/yelp_project

This project is to create a Data lake for Yelp data-set and further using the it to create an Analytical Sandbox Data Science purpose and also creating a data warehouse for reporting purpose.

data-lake data-pipeline etl etl-pipeline ingestion load pyspark recommender-system redshift

Last synced: 06 Mar 2025

https://github.com/santiagortiiz/snowflake-data-pipelines

EPAM's Snowflake hands-on lab. We built a pipeline to read and load data from S3 into Snowflake, developed an ETL workflow to clean the data and stored it in a data warehouse with the 3NF and Star schemas for data mart analysis.

business-intelligence data-lake data-pipelines data-warehouse etl snowflake streams

Last synced: 26 Jun 2025

https://github.com/cds-snc/data-lake

Infrastructure for the Platform Data Lake

aws data-lake terraform

Last synced: 10 Apr 2025

https://github.com/supakunz/book-revenue-pipeline

A ready-to-use Docker-based template for data engineering projects, featuring a complete stack with Apache Airflow, Spark, and MinIO for building scalable data pipelines.

apache-airflow apache-spark data-engineering data-lake delta-lake docker docker-compose etl-pipeline minio pyspark python s3 template-boilerplate

Last synced: 30 Dec 2025

https://github.com/najuzilu/dl-spark

Building a Data Lake with Spark

aws-emr aws-s3 data-engineering data-lake etl-pipeline spark

Last synced: 25 Aug 2025

https://github.com/narius2030/datalake-solution-imcp

This project involved the development and implementation of a Data Lake architecture to support an AI model capable of generating image captions. The architecture was designed to efficiently ingest, process, and centralized store large volumes of image and text data.

data-lake docker-container etl-pipeline fastapi medallion-architecture mlops nosql-database object-storage

Last synced: 25 Mar 2025

https://github.com/mikeacosta/cowculate

Herd management ETL and analytics processing

apache-spark aws cloudformation data-lake etl kinesis-stream redshift s3

Last synced: 08 Oct 2025

https://github.com/stevehoober254/dataengineer-portfolio

📊 End-to-end ETL pipelines, Airflow DAGs, notebook-driven analytics & data warehousing

airflow analytics big-data dagster data-engineering data-lake data-pipelines etl python spark

Last synced: 09 Oct 2025

https://github.com/vermicida/data-lake

Data Lake, the code corresponding the project #4 of the Udacity's Data Engineer Nanodegree Program

aws-s3 data-engineering data-lake etl-pipeline python spark

Last synced: 12 Oct 2025

https://github.com/toch/oasis

A DIY data lake 🌴

data-lake oasis

Last synced: 20 Feb 2025

https://github.com/morgan-sell/usa-tourism-etl

Coalesced and transformed various data sources to create a comprehensive data lake for the USA tourism sector.

aws data-engineering data-lake emr-cluster etl-pipeline python spark

Last synced: 02 Apr 2025

https://github.com/banknatchapol/sparkify-data-lake

Build an ETL pipeline for a data lake hosted on AWS S3.

aws data-lake

Last synced: 03 Sep 2025

https://github.com/anuveyatsu/cloudflare-data-fabric

Cloudflare Data Fabric: Use Cloudflare's global infrastructure to build a flexible, resilient framework for data solutions.

cloudflare data data-lake fabric lakehouse mesh

Last synced: 12 Sep 2025

https://github.com/aymendaoudi/electric-vehicle-charging-simulator

Simulation, Ingestion and ETL-ing data of millions of EV charging sessions by thousands of EVs in thousands of stations around the world.

apache-airflow apache-spark batch-processing data-engineering data-lake data-warehouse kafka kafka-connect lake-fs minio mongodb postgresql python3 simpy spark stream-processing

Last synced: 27 Mar 2025

https://github.com/gakas14/aws-serverless-data-lake

This workshop is to build a serverless data lake architecture using Amazon Kinesis Firehose for streaming data ingestion, AWS Glue for Data Integration (ETL, Catalogue Management), Amazon S3 for data lake storage, Amazon Athena for SQL big data analytics.

athena aws data-lake etl glue-catalog glue-etl kinesis-firehose kinesis-stream s3 sql

Last synced: 20 Feb 2025

https://github.com/zxkane/serverless-docker-images-analytics

Serverless Analytics app for analyzing docker image layers

analytics aws aws-athena aws-cdk aws-glue big-data data-lake

Last synced: 28 Mar 2025

https://github.com/manuel-lang/data-lake-with-spark

Project Data Lake as part of Udacity's Data Engineering Nanodegree

data-engineering data-lake etl-pipeline s3 spark udacity udacity-data-engineer-nanodegree

Last synced: 01 Mar 2025

https://github.com/josecsotomorales/pyspark-datalake

Data Lake with PySpark and AWS S3

data-lake pyspark python spark

Last synced: 27 Mar 2025

https://github.com/adilkhash/aws-meetup-almaty-2019-data-lake

Resources for AWS Almaty Meetup: Building scalable Data Lake with AWS

amazon-athena amazon-s3 aws aws-glue data-lake etl-pipeline luigi-workflows

Last synced: 27 Feb 2025

https://github.com/jaimealruiz/proyecto-tfg

Diseño e Implementación de interconexión entre LLM y Espacios de Datos

data-lake mcp mcp-client mcp-server rag

Last synced: 08 Sep 2025

https://github.com/yasarsultan/taxi-trip-analysis

The NYC Taxi Trip Batch Data Pipeline automates processing of large-scale trip data using Apache Spark and Airflow, integrating AWS S3 and Google BigQuery for storage and analytics. It features scalable, containerized workflows with robust data validation.

airflow aws-s3 bash-script batch-processing bigquery data-lake data-warehouse docker python3 spark

Last synced: 01 Mar 2025

https://github.com/narius2030/imcp-support-blinders

This project focuses on image captioning by creating two primary models: DarkNetLM and DarkNetVG2. Both models leverage the CSP DarkNet53 architecture as the backbone of YOLOv8 for feature extraction from images. Combining with Transformers or LSTM to generating captions.

computer-vision data-lake image-captioning large-language-model mobile-app

Last synced: 30 Oct 2025

https://github.com/jcguidry/flight-ml-ingest-gcp-func

Ingests flight data for ML projects, using serverless cloud functions

cloud-function data-lake flightaware-aeroapi-service gcp github-actions pandas

Last synced: 07 Nov 2025

https://github.com/sugumarsrinivasan/sql-datawarehouse-project

Building Mordern datawarehouse with SQL Server, including ETL Processes, data modeling, and data analytics.

data-analysis data-analytics data-engineering data-lake data-science data-warehouse datawarehousing etl etl-pipeline medallion-architecture sql sql-query sql-server

Last synced: 24 Oct 2025

https://github.com/dina-hosny/sparkify---data-lake-with-aws

Sparkify - Data Lake with AWS - Udacity Data Engineering Expert Track.

analytics aws data-engineering data-lake data-pipelines dataset etl fwd udacity

Last synced: 30 Jul 2025

https://github.com/mxagar/data_engineering_guide

Personal notes on the IBM Data Engineering Certificate as well as other sources focusing on AWS.

airflow aws data-lake data-modeling data-pipelines data-science no-sql spark sql warehouse

Last synced: 27 Jul 2025

https://github.com/lesiaukr/goit-de-fp

Masters degree | Data Engineering | Final course projects | goit-de-fp

apache-airflow apache-kafka apache-spark data-lake docker goit-de-fp python streaming-pipeline

Last synced: 25 Jul 2025

https://github.com/jibbs1703/tickit-data-pipeline

This repository demonstrates the creation of a robust data pipeline using an Orchestrator, on-prem and cloud resources. It collects data from on-premises SQL and NoSQL database and loads it into a SQL database in the cloud.

aws-glue aws-glue-crawler aws-glue-data-catalog aws-redshift aws-s3 boto3 data-lake database etl-pipeline medallion-architecture mongodb precommit-hooks

Last synced: 17 Sep 2025

https://github.com/paznera/php-minio-obj-store

cli (Laravel prompts) example usage of btrfs-like storage behind s3 interface (minio) and in-memory(redis) indexing with object metadata

btrfs buckets data-lake laravel minio php redis s3 snapshots

Last synced: 03 Apr 2025

https://github.com/daniel-jcvv/daniel-jcvv

👨‍💻 Data Engineer | 3+ years enterprise experience with Telcel & Citi Banamex Develop ETL pipelines, data governance, and cloud solutions. Building scalable data architectures and automated workflows for Fortune 500 clients. Tech Stack: Python, SQL Server, Oracle, Apache Airflow, PySpark

agentic-ai apache-airflow apache-kafka apache-spark automation business-intelligence citi-bank-apis data-analysis data-engineering data-lake data-warehouse etl-pipeline medallion-architecture mlops n8n-workflow python rag sql-server

Last synced: 02 Nov 2025

https://github.com/ahmedd38/dataengineer-portfolio

📊 End-to-end ETL pipelines, Airflow DAGs, notebook-driven analytics & data warehousing

airflow analytics big-data dagster data-engineering data-lake data-pipelines etl python spark

Last synced: 15 Apr 2025

https://github.com/dominique-jacque/nba-data-lake

NBA Data Lake Repository contains the setup_nba_data_lake.py script, which automates the creation of a data lake for NBA analytics using AWS services. The script integrates Amazon S3, AWS Glue, and Amazon Athena, and sets up the infrastructure needed to store and query NBA-related data.

amazon-athena api aws-glue cloudshell data-lake iam s3

Last synced: 08 Mar 2025