An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with datalake

A curated list of projects in awesome lists tagged with datalake .

https://github.com/sinaptik-ai/pandas-ai

Chat with your database or your datalake (SQL, CSV, parquet). PandasAI makes data analysis conversational using LLMs and RAG.

ai csv data data-analysis data-science data-visualization database datalake gpt-4 llm pandas sql text-to-sql

Last synced: 12 May 2025

https://github.com/Sinaptik-AI/pandas-ai

Chat with your database or your datalake (SQL, CSV, parquet). PandasAI makes data analysis conversational using LLMs and RAG.

ai csv data data-analysis data-science data-visualization database datalake gpt-4 llm pandas sql text-to-sql

Last synced: 25 Mar 2025

https://github.com/trinodb/trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

analytics big-data data-science database databases datalake delta-lake distributed-database distributed-systems hadoop hive iceberg java jdbc presto prestodb query-engine sql trino

Last synced: 12 Nov 2025

https://github.com/starrocks/starrocks

The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.

analytics big-data cloudnative database datalake delta-lake distributed-database hudi iceberg join lakehouse lakehouse-platform mpp olap real-time-analytics real-time-updates realtime-database sql star-schema vectorized

Last synced: 06 Jan 2026

https://github.com/StarRocks/starrocks

The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.

analytics big-data cloudnative database datalake delta-lake distributed-database hudi iceberg join lakehouse lakehouse-platform mpp olap real-time-analytics real-time-updates realtime-database sql star-schema vectorized

Last synced: 14 Mar 2025

https://github.com/activeloopai/deeplake

Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop.ai

ai computer-vision cv data-science datalake datasets deep-learning image-processing langchain large-language-models llm machine-learning ml mlops multi-modal python pytorch tensorflow vector-database vector-search

Last synced: 13 May 2025

https://github.com/apache/hudi

Upserts, Deletes And Incremental Processing on Big Data.

apacheflink apachehudi apachespark bigdata data-integration datalake hudi incremental-processing stream-processing

Last synced: 12 May 2025

https://github.com/DataLinkDC/dinky

Dinky is a real-time data development platform based on Apache Flink, enabling agile data development, deployment and operation.

datalake datawarehouse flink flinkcdc flinksql olap real-time-computing-platform sql

Last synced: 27 Mar 2025

https://github.com/lakesoul-io/lakesoul

LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.

arrow big-data datafusion datalake flink huggingface lakehouse lakesoul postgresql python pytorch rust spark sql streaming vectorized velox

Last synced: 14 May 2025

https://github.com/lakesoul-io/LakeSoul

LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.

arrow big-data datafusion datalake flink huggingface lakehouse lakesoul postgresql python pytorch rust spark sql streaming vectorized velox

Last synced: 27 Mar 2025

https://github.com/apache/gravitino

World's most powerful open data catalog for building a high-performance, geo-distributed and federated metadata lake.

ai-catalog data-catalog datalake federated-query lakehouse metadata metalake model-catalog opendatacatalog skycomputing stratosphere

Last synced: 13 May 2025

https://github.com/apache/Gravitino

World's most powerful open data catalog for building a high-performance, geo-distributed and federated metadata lake.

ai-catalog data-catalog datalake federated-query lakehouse metadata metalake model-catalog opendatacatalog skycomputing stratosphere

Last synced: 03 Oct 2025

https://github.com/apache/amoro

Apache Amoro (incubating) is a Lakehouse management system built on open data lake formats.

bigdata datalake lakehouse

Last synced: 14 May 2025

https://github.com/Datavault-UK/automate-dv

A free to use dbt package for creating and loading Data Vault 2.0 compliant Data Warehouses (powered by dbt, an open source data engineering tool, registered trademark of dbt Labs)

data-vault dataengineering datalake datavault datavault20 datawarehouse datawarehousing dbt elt etl metadata snowflake sql

Last synced: 13 May 2025

https://github.com/linkedin/openhouse

Open Control Plane for Tables in Data Lakehouse

big-data catalog datalake datalakehouse declarative iceberg management tables

Last synced: 17 Aug 2025

https://github.com/gigapi/gigapi

GigAPI is a Timeseries lakehouse for real-time data and sub-second queries, powered by DuckDB OLAP + Parquet Query Engine, Compactor w/ Cloud-Native Storage. Drop-in FDAP alternative ⭐

api clickhouse-server data-lake database datalake duckdb duckdb-api duckdb-server ducklake fdap gigapipe golang lakehouse olap parquet qryn query-engine rest-api s3 sql

Last synced: 05 Oct 2025

https://github.com/WeBankFinTech/Streamis

Streaming application development and management system, based on Linkis and DSS, planning to provide the workflow-like graphical drag-and-drop development capability.

datalake dataspherestudio deltalake flink hudi iceberg kafka linkis streaming streamis warehouse wedatasphere

Last synced: 15 Jul 2025

https://github.com/learningjournal/spark-streaming-in-scala

Apache Spark 3 - Structured Streaming Course Material

apache-spark big-data bigdata datalake scala spark spark-sql spark-streaming

Last synced: 16 May 2025

https://github.com/ExpediaGroup/apiary

Apiary provides modules which can be combined to create a federated cloud data lake

aws datalake hive hive-metastore

Last synced: 13 May 2025

https://github.com/absaoss/enceladus

Dynamic Conformance Engine

bigdata datalake hadoop mongodb scala spark spring

Last synced: 20 Aug 2025

https://github.com/abdullahkhawer/aws-auto-terminate-idle-emr

An AWS based solution using AWS CloudWatch and AWS Lambda based on Python to automatically terminate AWS EMR clusters that have been idle for a specified period of time.

amazon-web-services automation aws aws-cloudformation aws-cloudwatch aws-emr aws-lambda bigdata boto3 cft cloudformation cloudwatch datalake emr etl idle python python-3-7 serverless terminate

Last synced: 02 Jul 2025

https://github.com/aws-solutions-library-samples/aws-insurancelake-etl

This solution helps you deploy ETL processes and data storage resources to create an Insurance Lake using Amazon S3 buckets for storage, AWS Glue for data transformation, and AWS CDK Pipelines. It is originally based on the AWS blog Deploy data lake ETL jobs using CDK Pipelines, and complements the InsuranceLake Infrastructure project

aws cdk datalake glue insurance

Last synced: 15 Apr 2025

https://github.com/imsanjoykb/etl-project

The goal of this project is to illustrate Extract Transform Load (ETL) using Python and SQL. ETL is a process commonly done in computing, which takes raw data, cleans it and stores it for later use. The extraction phase targets and retrieves the data. Transform manipulates and cleans the data. Then load stores the data, typically in a data warehouse.

data-engineering database datalake datawarehouse etl etl-automation etl-pipeline etl-solutions

Last synced: 18 Aug 2025

https://github.com/prestodb/prestorials

Tutorials and examples of how to deploy Presto and connect it to different data sources

aws awsglue data datalake docker example glue lakehouse mongodb presto presto-connector prestodb prestosql sql tutorial walkthrough

Last synced: 24 Oct 2025

https://github.com/lynnlangit/serverless-architecture

Companion to my Linked In Learning 'Serverless Architecture' course

aws-lambda azure-functions datalake gcp-cloud-functions serverless serverless-architectures

Last synced: 03 Apr 2025

https://github.com/AWS-Big-Data-Projects/AWS-Data-Lake

AWS Lake Formation makes it easy for you to set up, secure, and manage your data lakes also data discovery using the metadata search capabilities of Lake Formation in the console, and metadata search results restricted by column permissions.

aws-s3 datalake

Last synced: 20 Jul 2025

https://github.com/aws-solutions-library-samples/aws-insurancelake-infrastructure

This solution helps you deploy ETL processes and data storage resources to create an Insurance Lake using Amazon S3 buckets for storage, AWS Glue for data transformation, and AWS CDK Pipelines. It is originally based on the AWS blog Deploy data lake ETL jobs using CDK Pipelines, and complements the InsuranceLake ETL with CDK Pipelines project.

aws cdk datalake insurance

Last synced: 14 Oct 2025

https://github.com/ismailsimsek/iceberg-examples

Apache iceberg Spark s3 examples

datalake iceberg s3 sql sql-merge

Last synced: 04 May 2025

https://github.com/stonezhong/DataManager

Better organize data in data lake and build ETL pipeline with Web UI tool.

datalake datawarehouse etl spark sparksql

Last synced: 20 Jul 2025

https://github.com/calvinhartwell/getting-started-with-kylo

An introduction to using Kylo, an open source data lake builder from Teradata

apache-nifi datalake gitbook hadoop hdp kylo nifi spark teradata thinkbig thinkbiganalytics

Last synced: 11 Jun 2025

https://github.com/mimetis/projecty

Project Y is a straightforward Landing Zones automated deployment tool dedicated to data processing.

azure azuredatabricks azuredatafactory azurekeyvault azurelandingzone databricks datalake synapse

Last synced: 12 Apr 2025

https://github.com/kleinyuan/llama2-csv-webapp

self host/local host llama2 based web app to chat with your csvs (multiple)

chatgpt csv datalake large-language-models llama2 llm meta openai pandas pandas-ai pandasai streamlit

Last synced: 12 Apr 2025

https://github.com/kimtth/pyspark-tika-text-extraction

🚴‍♂️⛷Data Lake, Performance tuning for text extraction from a huge amount of files.

apache-spark apache-tika data-pipeline datalake multithreading pyspark spark tika-python

Last synced: 17 Jul 2025

https://github.com/lynnlangit/learning-nosql

Companion repository to Linked In Learning course 'Cloud NoSQL for SQL Pros'

aws-dynamodb data datalake dynamodb gcp-bigtable nosql vector-database

Last synced: 03 Apr 2025

https://github.com/macieklesiczka/azof

Lakehouse with time travel

datafusion datalake lakehouse parquet rust-lang

Last synced: 11 Oct 2025

https://github.com/divithraju/divith-raju-immigration-data-engineering

A Capstone Project that covers several aspects of Data Engineering (Data Exploration, Cleaning, Modeling, Pipelining, Processing)

apachespark bigdata bigdataprocessing bigdataproject capstone-project datacleaning dataengineering datalake datamodeling datapipeline dataprocessing dataschema dataset datawherehouse pandas sql

Last synced: 20 Feb 2025

https://github.com/ac-gomes/data_engineer_with_airflow

Este projeto é uma adaptação com base em um teste real para uma posição de Engenheiro de Dados Jr.

airflow aws-s3 azure-storage datalake datalake-ingestion json-api postgres python3

Last synced: 22 Feb 2025

https://github.com/neuro-ml/tarn

An insanely customizable framework for key-value storage 💾

cache datalake memoization persistent python storage

Last synced: 23 Apr 2025

https://github.com/macieklesiczka/bazof

Lakehouse with time travel

datafusion datalake lakehouse parquet rust-lang

Last synced: 22 Mar 2025

https://github.com/leonardodrigo/breweries-data-lake

This project builds an Azure Data Lake using the Medallion architecture to process data with Spark from the Open Breweries DB API.

airflow azure brewerydb datalake docker docker-compose pyspark

Last synced: 05 Apr 2025

https://github.com/hussein-awala/gdpr-compliant-lakehouse

This repository is a demonstration of how to handle GDPR export and delete requests in an Iceberg Lakehouse to make it GDPR-compliant.

apache-iceberg apache-spark datalake gdpr lakehouse

Last synced: 08 Sep 2025

https://github.com/thunchanokbow/audiblebook-revenue

Manage big data on cloud computing to find a list of best-selling audible books, generate reports and dashboards, and provide products and sales promotions that meet the needs of consumers in Thailand

apache-airflow bigquery cloudcomposer data-visualization datalake datawarehouse googlecloudstorage lookerstudio pandas python3

Last synced: 19 Jul 2025

https://github.com/simonjang/s3-query-json

Query JSON documents on S3 with SQL

datalake nodejs s3

Last synced: 12 Jun 2025

https://github.com/postpayio/ness

A Python datalake client.

datalake pandas s3

Last synced: 16 Jun 2025

https://github.com/richclement/aws-data-lake-sdk

An sdk for the AWS data lake.

aws datalake sdk

Last synced: 10 May 2025

https://github.com/nxion/sql-data-warehouse-project

Building a modern data warehouse with MS SQL server, ETL processes, data modeling and analyitics.

data data-analysis data-analytics data-engineering data-lakehouse data-warehouse datalake datascience etl etl-job medallion-architecture ms mssql sql sql-query sql-server

Last synced: 03 Mar 2025

https://github.com/mnpw/mdex

Icberg metadata explorer

datalake iceberg rust

Last synced: 13 Sep 2025

https://github.com/ayushman0511/data-warehouse-project1

A comprehensive guide to building a data warehouse with SQL Server, including ETL processes, data modeling, and analytics.

data data-ana data-anal data-cleaning data-enginee data-lakehou datalake datasci dataware datawarehouse datawarehousi etl etl-job etl-pipeline medallion sql sql-quer sql-query sql-server sqlserver

Last synced: 26 Jun 2025

https://github.com/niranjanrao07/data-226-assignments

This repository includes assignments for DATA 226, focused on designing databases, implementing SQL for analytics, performing ETL operations, building data pipelines, and conducting OLAP.

airflow datalake datawarehouse dbt pipeline python snowflake sql

Last synced: 05 Sep 2025

https://github.com/sheitak/datalake-jljq

Data Lake project for ingest and transform financial data and dashboard BI proposal

airflow-docker big-data datalake pyspark python3 spark

Last synced: 11 Sep 2025

https://github.com/senaldolage/wa-road-insights-pipeline

End-to-end Azure data pipeline project analyzing Western Australia transport datasets with dashboards built in Tableau. Featuring Data Factory, Databricks, Synapse, and Data Lake Gen2.

datalake pyspark synapse-analytics tableau

Last synced: 27 Jun 2025

https://github.com/pprzetacznik/datalake-aws

Sample data lake pipeline on AWS implemented using Terraform

aws csv datalake parquet python terraform

Last synced: 29 Mar 2025

https://github.com/chandima2000/adventure-works-sales-data-engineering-project

The aim of this project is to build an end-to-end data engineering project using Microsoft Azure

adf azure data-engineering databricks datalake etl-pipeline

Last synced: 30 Mar 2025

https://github.com/jszafran/personal-aws-data-lake

Personal, cloud based (AWS), data lake for experimenting with cloud services.

aws cloud data data-engineering dataengineering datalake etl terraform

Last synced: 17 Mar 2025

https://github.com/jayhan94/ducklake-java

A java implementation of ducklake

datalake ducklake lakehouse

Last synced: 23 Jun 2025

https://github.com/dilermando-lima/trino-pg-mysql-s3-parquet

trino cluster collecting data from mysql and postgress process them and save into s3 as parquet

bigdata datalake docker mysql postgresql query-engine s3 trino trinodb

Last synced: 07 May 2025

https://github.com/dougdss89/wideworldadventure

This repository includes all files that compose the design and unification of the databases AdventureWorks and WideWorldAdventure project.

bigdata databricks datalake datawarehouse dbt deltalake duckdb elt etl etl-pipeline spark

Last synced: 05 Oct 2025

https://github.com/tuanai-vireox/open-data-lakehouse

Build A Open Data Lake House

datalake lakehouse

Last synced: 22 Aug 2025

https://github.com/trannhatnguyen2/bi_cloud_kientap

Building a Business Intelligence Solution on the Microsoft Azure Cloud Platform with Dynamic ELT Integration

azure datalake datawarehouse powerbi

Last synced: 29 Aug 2025

https://github.com/jayhan94/minilake

A morden mini lakehouse based on Spark and Delta running in the docker.

analytics datalake deltalake lakehouse spark

Last synced: 14 Mar 2025

https://github.com/deddyandri/tokyo-olympic-azure-data-analyst-project

tokyo-olympic-azure-data-analyst and engineering-project

azure datalake powerbi sql

Last synced: 09 Sep 2025

https://github.com/matz1979/spark-etl-pipelines

My final project with big data build with Spark

bigdata datalake etl-pipeline python spark

Last synced: 09 Sep 2025

https://github.com/nataliabeltranarg/nosql-dataarchitecture-spark

Implementing core components of a data-driven architecture using Spark: Data Management and Data Analysis Backbones with structured zones in a data lake and analytical capabilities

data-science dataarchitecture datalake datamanagement java-8 javajdk pyspark spark

Last synced: 29 Oct 2025

https://github.com/leehuwuj/lake-inspector

Inspect your lakehouse data by using PyArrow

arrow datalake lakehouse pyarrow

Last synced: 03 Apr 2025

https://github.com/trannhatnguyen2/bi_datalake_azure

Building Data Lake on the Microsoft Azure Cloud Platform

azure databricks datalake powerbi sql-server

Last synced: 06 Mar 2025

https://github.com/malondaclement/datalake

DataLake project 💾

datalake mysql python3

Last synced: 13 Oct 2025

https://github.com/orvillex/datalake

本教程将主要围绕数据湖现主流框架知识进行分享,当前计划就Delta Lake、Hudi与Iceberg三大主流框架的使用方式 进行教程编写。

datalake delta-lake hudi iceberg scala spark

Last synced: 20 Mar 2025

https://github.com/scrogson/duckpond-rs

Rust implementation of the DuckLake lakehouse format

datalake ducklake mysql postgres rust sqlite

Last synced: 25 Jun 2025