An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with datalake

A curated list of projects in awesome lists tagged with datalake .

https://github.com/sinaptik-ai/pandas-ai

Chat with your database or your datalake (SQL, CSV, parquet). PandasAI makes data analysis conversational using LLMs and RAG.

ai csv data data-analysis data-science data-visualization database datalake gpt-4 llm pandas sql text-to-sql

Last synced: 15 Jan 2026

https://github.com/Sinaptik-AI/pandas-ai

Chat with your database or your datalake (SQL, CSV, parquet). PandasAI makes data analysis conversational using LLMs and RAG.

ai csv data data-analysis data-science data-visualization database datalake gpt-4 llm pandas sql text-to-sql

Last synced: 25 Mar 2025

https://github.com/trinodb/trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

analytics big-data data-science database databases datalake delta-lake distributed-database distributed-systems hadoop hive iceberg java jdbc presto prestodb query-engine sql trino

Last synced: 02 Apr 2026

https://github.com/starrocks/starrocks

The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.

analytics big-data cloudnative database datalake delta-lake distributed-database hudi iceberg join lakehouse lakehouse-platform mpp olap real-time-analytics real-time-updates realtime-database sql star-schema vectorized

Last synced: 16 Feb 2026

https://github.com/activeloopai/deeplake

Deeplake is AI Data Runtime for Agents. It provides serverless postgres with a multimodal datalake, enabling scalable retrieval and training.

agent agentic-rag ai clawbot computer-vision datalake deep-learning filesystem large-language-models llm memory mlops multimodal openclaw postgres pytorch rag skill vector-database

Last synced: 11 Jun 2026

https://github.com/StarRocks/starrocks

The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.

analytics big-data cloudnative database datalake delta-lake distributed-database hudi iceberg join lakehouse lakehouse-platform mpp olap real-time-analytics real-time-updates realtime-database sql star-schema vectorized

Last synced: 14 Mar 2025

https://github.com/apache/hudi

Upserts, Deletes And Incremental Processing on Big Data.

apacheflink apachehudi apachespark bigdata data-integration datalake hudi incremental-processing stream-processing

Last synced: 12 May 2025

https://github.com/DataLinkDC/dinky

Dinky is a real-time data development platform based on Apache Flink, enabling agile data development, deployment and operation.

datalake datawarehouse flink flinkcdc flinksql olap real-time-computing-platform sql

Last synced: 27 Mar 2025

https://github.com/lakesoul-io/lakesoul

LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.

arrow big-data datafusion datalake flink huggingface lakehouse lakesoul postgresql python pytorch rust spark sql streaming vectorized velox

Last synced: 14 May 2025

https://github.com/lakesoul-io/LakeSoul

LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.

arrow big-data datafusion datalake flink huggingface lakehouse lakesoul postgresql python pytorch rust spark sql streaming vectorized velox

Last synced: 27 Mar 2025

https://github.com/apache/gravitino

World's most powerful open data catalog for building a high-performance, geo-distributed and federated metadata lake.

ai-catalog data-catalog datalake federated-query lakehouse metadata metalake model-catalog opendatacatalog skycomputing stratosphere

Last synced: 13 May 2025

https://github.com/apache/Gravitino

World's most powerful open data catalog for building a high-performance, geo-distributed and federated metadata lake.

ai-catalog data-catalog datalake federated-query lakehouse metadata metalake model-catalog opendatacatalog skycomputing stratosphere

Last synced: 03 Oct 2025

https://github.com/apache/amoro

Apache Amoro (incubating) is a Lakehouse management system built on open data lake formats.

bigdata datalake lakehouse

Last synced: 14 May 2025

https://github.com/Datavault-UK/automate-dv

A free to use dbt package for creating and loading Data Vault 2.0 compliant Data Warehouses (powered by dbt, an open source data engineering tool, registered trademark of dbt Labs)

data-vault dataengineering datalake datavault datavault20 datawarehouse datawarehousing dbt elt etl metadata snowflake sql

Last synced: 13 May 2025

https://github.com/linkedin/openhouse

Open Control Plane for Tables in Data Lakehouse

big-data catalog datalake datalakehouse declarative iceberg management tables

Last synced: 17 Aug 2025

https://github.com/gigapi/gigapi

GigAPI is a Timeseries lakehouse for real-time data and sub-second queries, powered by DuckDB OLAP + Parquet Query Engine, Compactor w/ Cloud-Native Storage. Drop-in FDAP alternative ⭐

api clickhouse-server data-lake database datalake duckdb duckdb-api duckdb-server ducklake fdap gigapipe golang lakehouse olap parquet qryn query-engine rest-api s3 sql

Last synced: 05 Oct 2025

https://github.com/awslabs/visual-asset-management-system

Visual Asset Management System (VAMS) is a purpose-built, AWS native solution for the management and distribution of traditional to specialized visual assets used in physical AI and spatial computing.

2d 3d datalake digital-asset-management extended-reality metadata physical-ai pipelines spatial-computing spatial-data

Last synced: 16 Jan 2026

https://github.com/WeBankFinTech/Streamis

Streaming application development and management system, based on Linkis and DSS, planning to provide the workflow-like graphical drag-and-drop development capability.

datalake dataspherestudio deltalake flink hudi iceberg kafka linkis streaming streamis warehouse wedatasphere

Last synced: 15 Jul 2025

https://github.com/buoyant-data/oxbow

Collection of AWS Lambdas for creating and managing Delta tables

datalake deltalake lambda parquet rust

Last synced: 07 Apr 2026

https://github.com/learningjournal/spark-streaming-in-scala

Apache Spark 3 - Structured Streaming Course Material

apache-spark big-data bigdata datalake scala spark spark-sql spark-streaming

Last synced: 16 May 2025

https://github.com/ExpediaGroup/apiary

Apiary provides modules which can be combined to create a federated cloud data lake

aws datalake hive hive-metastore

Last synced: 13 May 2025

https://github.com/aws-solutions-library-samples/aws-insurancelake-etl

This solution helps you deploy ETL processes and data storage resources to create an Insurance Lake using Amazon S3 buckets for storage, AWS Glue for data transformation, and AWS CDK Pipelines. It is originally based on the AWS blog Deploy data lake ETL jobs using CDK Pipelines, and complements the InsuranceLake Infrastructure project

aws cdk datalake glue insurance

Last synced: 27 Jan 2026

https://github.com/absaoss/enceladus

Dynamic Conformance Engine

bigdata datalake hadoop mongodb scala spark spring

Last synced: 20 Aug 2025

https://github.com/abdullahkhawer/aws-auto-terminate-idle-emr

An AWS based solution using AWS CloudWatch and AWS Lambda based on Python to automatically terminate AWS EMR clusters that have been idle for a specified period of time.

amazon-web-services automation aws aws-cloudformation aws-cloudwatch aws-emr aws-lambda bigdata boto3 cft cloudformation cloudwatch datalake emr etl idle python python-3-7 serverless terminate

Last synced: 02 Jul 2025

https://github.com/polardb/duckdb-paimon

DuckDB extension for accessing Apache Paimon. 🦆

datalake duckdb paimon

Last synced: 19 Apr 2026

https://github.com/imsanjoykb/etl-project

The goal of this project is to illustrate Extract Transform Load (ETL) using Python and SQL. ETL is a process commonly done in computing, which takes raw data, cleans it and stores it for later use. The extraction phase targets and retrieves the data. Transform manipulates and cleans the data. Then load stores the data, typically in a data warehouse.

data-engineering database datalake datawarehouse etl etl-automation etl-pipeline etl-solutions

Last synced: 18 Aug 2025

https://github.com/prestodb/prestorials

Tutorials and examples of how to deploy Presto and connect it to different data sources

aws awsglue data datalake docker example glue lakehouse mongodb presto presto-connector prestodb prestosql sql tutorial walkthrough

Last synced: 13 Mar 2026

https://github.com/lynnlangit/serverless-architecture

Companion to my Linked In Learning 'Serverless Architecture' course

aws-lambda azure-functions datalake gcp-cloud-functions serverless serverless-architectures

Last synced: 16 Jan 2026

https://github.com/AWS-Big-Data-Projects/AWS-Data-Lake

AWS Lake Formation makes it easy for you to set up, secure, and manage your data lakes also data discovery using the metadata search capabilities of Lake Formation in the console, and metadata search results restricted by column permissions.

aws-s3 datalake

Last synced: 20 Jul 2025

https://github.com/aws-solutions-library-samples/aws-insurancelake-infrastructure

This solution helps you deploy ETL processes and data storage resources to create an Insurance Lake using Amazon S3 buckets for storage, AWS Glue for data transformation, and AWS CDK Pipelines. It is originally based on the AWS blog Deploy data lake ETL jobs using CDK Pipelines, and complements the InsuranceLake ETL with CDK Pipelines project.

aws cdk datalake insurance

Last synced: 14 Oct 2025

https://github.com/ismailsimsek/iceberg-examples

Apache iceberg Spark s3 examples

datalake iceberg s3 sql sql-merge

Last synced: 04 May 2025

https://github.com/stonezhong/DataManager

Better organize data in data lake and build ETL pipeline with Web UI tool.

datalake datawarehouse etl spark sparksql

Last synced: 20 Jul 2025

https://github.com/scrogson/duckpond-rs

Rust implementation of the DuckLake lakehouse format

datalake ducklake mysql postgres rust sqlite

Last synced: 05 May 2026

https://github.com/prefeitura-rio/pipelines_rj_sms

Pipelines de dados da Secretaria Municipal de Saúde

datalake pipelines-as-code prefect reporting

Last synced: 17 Jan 2026

https://github.com/calvinhartwell/getting-started-with-kylo

An introduction to using Kylo, an open source data lake builder from Teradata

apache-nifi datalake gitbook hadoop hdp kylo nifi spark teradata thinkbig thinkbiganalytics

Last synced: 11 Jun 2025

https://github.com/mimetis/projecty

Project Y is a straightforward Landing Zones automated deployment tool dedicated to data processing.

azure azuredatabricks azuredatafactory azurekeyvault azurelandingzone databricks datalake synapse

Last synced: 12 Apr 2025

https://github.com/kassette-ai/kassette-server

Secured pipelines for your reporting and auditing data

audit datalake etl kassette powerbi reporting servicenow warehouse workflow

Last synced: 15 Jan 2026

https://github.com/kleinyuan/llama2-csv-webapp

self host/local host llama2 based web app to chat with your csvs (multiple)

chatgpt csv datalake large-language-models llama2 llm meta openai pandas pandas-ai pandasai streamlit

Last synced: 12 Apr 2025

https://github.com/sidequery/dlt-iceberg

An Iceberg destination for DLT that supports REST catalogs

apache-iceberg data-engineering datalake dlt dlthub etl iceberg

Last synced: 09 Feb 2026

https://github.com/kimtth/pyspark-tika-text-extraction

🚴‍♂️⛷Data Lake, Performance tuning for text extraction from a huge amount of files.

apache-spark apache-tika data-pipeline datalake multithreading pyspark spark tika-python

Last synced: 17 Jul 2025

https://github.com/macieklesiczka/azof

Lakehouse with time travel

datafusion datalake lakehouse parquet rust-lang

Last synced: 02 Mar 2026

https://github.com/lynnlangit/learning-nosql

Companion repository to Linked In Learning course 'Cloud NoSQL for SQL Pros'

aws-dynamodb data datalake dynamodb gcp-bigtable nosql vector-database

Last synced: 15 Jan 2026

https://github.com/openaleph/ftm-lakehouse

Data standard and archive storage for structured FollowTheMoney data, leaked data, private and public document collections.

aleph archive datalake deltalake followthemoney lakehouse openaleph opensanctions

Last synced: 02 Feb 2026

https://github.com/ac-gomes/data_engineer_with_airflow

Este projeto é uma adaptação com base em um teste real para uma posição de Engenheiro de Dados Jr.

airflow aws-s3 azure-storage datalake datalake-ingestion json-api postgres python3

Last synced: 17 May 2026

https://github.com/divithraju/divith-raju-immigration-data-engineering

A Capstone Project that covers several aspects of Data Engineering (Data Exploration, Cleaning, Modeling, Pipelining, Processing)

apachespark bigdata bigdataprocessing bigdataproject capstone-project datacleaning dataengineering datalake datamodeling datapipeline dataprocessing dataschema dataset datawherehouse pandas sql

Last synced: 29 Apr 2026

https://github.com/erwan-simon/aws-data-platform-framework

A unified framework to industrialize data ingestion, transformation and pipeline execution on AWS using Terraform, from infrastructure provisioning to runtime execution, designed as a reusable and standalone data platform.

aws data data-framework datalake docker iceberg python spark step-functions terraform terraform-module

Last synced: 23 May 2026

https://github.com/neuro-ml/tarn

An insanely customizable framework for key-value storage 💾

cache datalake memoization persistent python storage

Last synced: 23 Apr 2025

https://github.com/amosproj/amos2024ss04-building-information-enhancer

Building Information System for potential energy savings

datalake energy energy-consumption

Last synced: 04 Apr 2026

https://github.com/murilobellatini/ifood-data-architect-test

My solution to the iFood Data Architect Test using PySpark, Jupyter and Docker in order to create a local prototype data lake.

datalake datamart docker docker-compose pyspark python storage

Last synced: 07 Jan 2026

https://github.com/agnosticeng/cli

Agnostic magic is now at your fingertips.

cli clickhouse data datalake datalakehouse

Last synced: 03 Mar 2026

https://github.com/postpayio/ness

A Python datalake client.

datalake pandas s3

Last synced: 07 May 2026

https://github.com/simonjang/s3-query-json

Query JSON documents on S3 with SQL

datalake nodejs s3

Last synced: 02 May 2026

https://github.com/leonardodrigo/breweries-data-lake

This project builds an Azure Data Lake using the Medallion architecture to process data with Spark from the Open Breweries DB API.

airflow azure brewerydb datalake docker docker-compose pyspark

Last synced: 19 Jan 2026

https://github.com/macieklesiczka/bazof

Lakehouse with time travel

datafusion datalake lakehouse parquet rust-lang

Last synced: 22 Mar 2025

https://github.com/hussein-awala/gdpr-compliant-lakehouse

This repository is a demonstration of how to handle GDPR export and delete requests in an Iceberg Lakehouse to make it GDPR-compliant.

apache-iceberg apache-spark datalake gdpr lakehouse

Last synced: 18 May 2026

https://github.com/richclement/aws-data-lake-sdk

An sdk for the AWS data lake.

aws datalake sdk

Last synced: 10 May 2025

https://github.com/thunchanokbow/audiblebook-revenue

Manage big data on cloud computing to find a list of best-selling audible books, generate reports and dashboards, and provide products and sales promotions that meet the needs of consumers in Thailand

apache-airflow bigquery cloudcomposer data-visualization datalake datawarehouse googlecloudstorage lookerstudio pandas python3

Last synced: 11 Apr 2026

https://github.com/hoaihuongbk/lakeops

A modern data lake operations toolkit working with multiple table formats (Delta, Iceberg, Parquet) and engines (Spark, Polars) via the same APIs.

data data-operations dataengineering datalake

Last synced: 07 Mar 2026

https://github.com/chandima2000/adventure-works-sales-data-engineering-project

The aim of this project is to build an end-to-end data engineering project using Microsoft Azure

adf azure data-engineering databricks datalake etl-pipeline

Last synced: 30 Apr 2026

https://github.com/jblukach/parquet2csv

Convert from CSV to Parquet and back again!

athena csv datalake parquet rust s3 utility

Last synced: 26 Mar 2025

https://github.com/matz1979/spark-etl-pipelines

My final project with big data build with Spark

bigdata datalake etl-pipeline python spark

Last synced: 08 May 2026

https://github.com/felipelaptrin/data-lake

This project is a simple proof of concept to implement a data lake using AWS cloud.

aws datalake githubactions terraform

Last synced: 09 May 2026

https://github.com/jayhan94/minilake

A morden mini lakehouse based on Spark and Delta running in the docker.

analytics datalake deltalake lakehouse spark

Last synced: 14 Mar 2025

https://github.com/slowlatency/de-apple-data-analysis

A Data Pipeline solution using Databricks and Apache Spark to process and analyze Apple data.

datalake python spark sql

Last synced: 13 May 2026

https://github.com/trannhatnguyen2/bi_cloud_kientap

Building a Business Intelligence Solution on the Microsoft Azure Cloud Platform with Dynamic ELT Integration

azure datalake datawarehouse powerbi

Last synced: 29 Aug 2025

https://github.com/trannhatnguyen2/bi_datalake_azure

Building Data Lake on the Microsoft Azure Cloud Platform

azure databricks datalake powerbi sql-server

Last synced: 22 Apr 2026

https://github.com/malondaclement/datalake

DataLake project 💾

datalake mysql python3

Last synced: 16 May 2026

https://github.com/senaldolage/wa-road-insights-pipeline

End-to-end Azure data pipeline project analyzing Western Australia transport datasets with dashboards built in Tableau. Featuring Data Factory, Databricks, Synapse, and Data Lake Gen2.

datalake pyspark synapse-analytics tableau

Last synced: 27 Jun 2025