Projects in Awesome Lists tagged with datalake
A curated list of projects in awesome lists tagged with datalake .
https://github.com/sinaptik-ai/pandas-ai
Chat with your database or your datalake (SQL, CSV, parquet). PandasAI makes data analysis conversational using LLMs and RAG.
ai csv data data-analysis data-science data-visualization database datalake gpt-4 llm pandas sql text-to-sql
Last synced: 15 Jan 2026
https://github.com/Sinaptik-AI/pandas-ai
Chat with your database or your datalake (SQL, CSV, parquet). PandasAI makes data analysis conversational using LLMs and RAG.
ai csv data data-analysis data-science data-visualization database datalake gpt-4 llm pandas sql text-to-sql
Last synced: 25 Mar 2025
https://github.com/trinodb/trino
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
analytics big-data data-science database databases datalake delta-lake distributed-database distributed-systems hadoop hive iceberg java jdbc presto prestodb query-engine sql trino
Last synced: 02 Apr 2026
https://github.com/starrocks/starrocks
The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.
analytics big-data cloudnative database datalake delta-lake distributed-database hudi iceberg join lakehouse lakehouse-platform mpp olap real-time-analytics real-time-updates realtime-database sql star-schema vectorized
Last synced: 16 Feb 2026
https://github.com/activeloopai/deeplake
Deeplake is AI Data Runtime for Agents. It provides serverless postgres with a multimodal datalake, enabling scalable retrieval and training.
agent agentic-rag ai clawbot computer-vision datalake deep-learning filesystem large-language-models llm memory mlops multimodal openclaw postgres pytorch rag skill vector-database
Last synced: 11 Jun 2026
https://github.com/StarRocks/starrocks
The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.
analytics big-data cloudnative database datalake delta-lake distributed-database hudi iceberg join lakehouse lakehouse-platform mpp olap real-time-analytics real-time-updates realtime-database sql star-schema vectorized
Last synced: 14 Mar 2025
https://github.com/apache/hudi
Upserts, Deletes And Incremental Processing on Big Data.
apacheflink apachehudi apachespark bigdata data-integration datalake hudi incremental-processing stream-processing
Last synced: 12 May 2025
https://github.com/treeverse/lakefs
lakeFS - Data version control for your data lake | Git for data
apache-spark apache-sparksql aws-s3 azure-blob-storage azure-storage data-engineering data-lake data-quality data-version-control data-versioning datalake datalakes git-for-data go golang google-cloud-storage hadoop-filesystem lakefs object-storage
Last synced: 18 Feb 2026
https://github.com/treeverse/lakeFS
lakeFS - Data version control for your data lake | Git for data
apache-spark apache-sparksql aws-s3 azure-blob-storage azure-storage data-engineering data-lake data-quality data-version-control data-versioning datalake datalakes git-for-data go golang google-cloud-storage hadoop-filesystem lakefs object-storage
Last synced: 20 Mar 2025
https://github.com/DataLinkDC/dinky
Dinky is a real-time data development platform based on Apache Flink, enabling agile data development, deployment and operation.
datalake datawarehouse flink flinkcdc flinksql olap real-time-computing-platform sql
Last synced: 27 Mar 2025
https://github.com/lakesoul-io/lakesoul
LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.
arrow big-data datafusion datalake flink huggingface lakehouse lakesoul postgresql python pytorch rust spark sql streaming vectorized velox
Last synced: 14 May 2025
https://github.com/lakesoul-io/LakeSoul
LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.
arrow big-data datafusion datalake flink huggingface lakehouse lakesoul postgresql python pytorch rust spark sql streaming vectorized velox
Last synced: 27 Mar 2025
https://github.com/leo-project/leofs
The LeoFS Storage System
datalake distributed-file-system distributed-storage erlang leofs nfs nfs-server s3 s3-storage
Last synced: 08 Apr 2025
https://github.com/apache/gravitino
World's most powerful open data catalog for building a high-performance, geo-distributed and federated metadata lake.
ai-catalog data-catalog datalake federated-query lakehouse metadata metalake model-catalog opendatacatalog skycomputing stratosphere
Last synced: 13 May 2025
https://github.com/zinggAI/zingg
Scalable identity resolution, entity resolution, data mastering and deduplication using ML
analytics cdp customer-data-platform data-science databricks dataengineering datalake dataquality dedupe deduplication entity-resolution fuzzy-matching fuzzymatch identity-resolution master-data-management masterdata mdm ml snowflake spark
Last synced: 16 Nov 2025
https://github.com/apache/Gravitino
World's most powerful open data catalog for building a high-performance, geo-distributed and federated metadata lake.
ai-catalog data-catalog datalake federated-query lakehouse metadata metalake model-catalog opendatacatalog skycomputing stratosphere
Last synced: 03 Oct 2025
https://github.com/zinggai/zingg
Scalable identity resolution, entity resolution, data mastering and deduplication using ML
analytics analytics-engineering data-science data-transformation data-transformations dataengineering datalake dataquality dedupe deduplication entity-resolution etl fuzzy-matching fuzzymatch identity identity-resolution masterdata ml modern-data-stack spark
Last synced: 14 May 2025
https://github.com/apache/amoro
Apache Amoro (incubating) is a Lakehouse management system built on open data lake formats.
Last synced: 14 May 2025
https://github.com/leesf/hudi-resources
汇总Apache Hudi相关资料
apache apachehudi bigdata data-integration datalake hudi hudi-resources incremental-processing stream-processing
Last synced: 27 Mar 2025
https://github.com/paradedb/pg_analytics
DuckDB-powered data lake analytics from Postgres
analytics arrow big-data columnar database datafusion datalake deltalake duckdb iceberg lakehouse lakehouse-platform object-storage olap paradedb parquet postgres postgresql realtime-analytics sql
Last synced: 24 Mar 2025
https://github.com/Datavault-UK/automate-dv
A free to use dbt package for creating and loading Data Vault 2.0 compliant Data Warehouses (powered by dbt, an open source data engineering tool, registered trademark of dbt Labs)
data-vault dataengineering datalake datavault datavault20 datawarehouse datawarehousing dbt elt etl metadata snowflake sql
Last synced: 13 May 2025
https://github.com/linkedin/openhouse
Open Control Plane for Tables in Data Lakehouse
big-data catalog datalake datalakehouse declarative iceberg management tables
Last synced: 17 Aug 2025
https://github.com/gigapi/gigapi
GigAPI is a Timeseries lakehouse for real-time data and sub-second queries, powered by DuckDB OLAP + Parquet Query Engine, Compactor w/ Cloud-Native Storage. Drop-in FDAP alternative ⭐
api clickhouse-server data-lake database datalake duckdb duckdb-api duckdb-server ducklake fdap gigapipe golang lakehouse olap parquet qryn query-engine rest-api s3 sql
Last synced: 05 Oct 2025
https://github.com/cuebook/cuelake
Use SQL to build ELT pipelines on a data lakehouse.
apache-iceberg apache-spark data-engineering data-ingestion data-integration data-lake data-pipeline data-transfer datalake delta elt etl incremental-updates lakehouse pipelines spark-sql sql upsert zeppelin-notebook
Last synced: 07 Apr 2025
https://github.com/awslabs/visual-asset-management-system
Visual Asset Management System (VAMS) is a purpose-built, AWS native solution for the management and distribution of traditional to specialized visual assets used in physical AI and spatial computing.
2d 3d datalake digital-asset-management extended-reality metadata physical-ai pipelines spatial-computing spatial-data
Last synced: 16 Jan 2026
https://github.com/izhangzhihao/real-time-data-warehouse
Real-time Data Warehouse with Apache Flink & Apache Kafka & Apache Hudi
cdc change-data-capture data-warehouse data-warehousing datalake debezium delta delta-lake deltalake elasticsearch flink flink-sql hoodie hudi iceberg kafka real-time-data-warehouse spark spark-sql sql
Last synced: 07 Sep 2025
https://github.com/datawithbaraa/sql-data-warehouse-project
A comprehensive guide to building a modern data warehouse with SQL Server, including ETL processes, data modeling, and analytics.
data-analysis data-analytics data-cleaning data-engineering data-lakehouse data-science data-warehouse data-warehousing datalake datascience datawarehouse datawarehousing etl etl-job etl-pipeline medallion-architecture sql sql-query sql-server sqlserver
Last synced: 06 Apr 2025
https://github.com/WeBankFinTech/Streamis
Streaming application development and management system, based on Linkis and DSS, planning to provide the workflow-like graphical drag-and-drop development capability.
datalake dataspherestudio deltalake flink hudi iceberg kafka linkis streaming streamis warehouse wedatasphere
Last synced: 15 Jul 2025
https://github.com/apache/doris-website
Apache Doris Website
analytics apache big-data data-warehousing database datalake dbms distributed-system doris hadoop hive hudi iceberg mpp olap ssb tpch vectorized
Last synced: 15 May 2025
https://github.com/neuralinkcorp/datarepo
data-warehouse datalake datawarehouse delta-lake
Last synced: 17 Aug 2025
https://github.com/learningjournal/sparkprogramminginscala
Apache Spark Course Material
apache-spark big-data bigdata data-lake datalake scala spark spark-scala spark-sql
Last synced: 17 Mar 2025
https://github.com/learningjournal/spark-streaming-in-scala
Apache Spark 3 - Structured Streaming Course Material
apache-spark big-data bigdata datalake scala spark spark-sql spark-streaming
Last synced: 16 May 2025
https://github.com/paloaltonetworks/pan-cortex-data-lake-python
Python idiomatic SDK for Cortex™ Data Lake.
api applicationframework cortex data datalake directory directory-sync directory-sync-service event event-service logging logging-service paloalto paloaltonetworks pan pancloud panw python rest-api sdk
Last synced: 05 May 2025
https://github.com/apache/doris-thirdparty
Self-managed thirdparty dependencies for Apache Doris
analytics big-data data-warehousing database datalake dbms distributed-database hadoop hive hudi iceberg mpp olap real-time sql ssb tpch vectorized
Last synced: 18 Jul 2025
https://github.com/ExpediaGroup/apiary
Apiary provides modules which can be combined to create a federated cloud data lake
aws datalake hive hive-metastore
Last synced: 13 May 2025
https://github.com/aws-solutions-library-samples/aws-insurancelake-etl
This solution helps you deploy ETL processes and data storage resources to create an Insurance Lake using Amazon S3 buckets for storage, AWS Glue for data transformation, and AWS CDK Pipelines. It is originally based on the AWS blog Deploy data lake ETL jobs using CDK Pipelines, and complements the InsuranceLake Infrastructure project
aws cdk datalake glue insurance
Last synced: 27 Jan 2026
https://github.com/abdullahkhawer/aws-auto-terminate-idle-emr
An AWS based solution using AWS CloudWatch and AWS Lambda based on Python to automatically terminate AWS EMR clusters that have been idle for a specified period of time.
amazon-web-services automation aws aws-cloudformation aws-cloudwatch aws-emr aws-lambda bigdata boto3 cft cloudformation cloudwatch datalake emr etl idle python python-3-7 serverless terminate
Last synced: 02 Jul 2025
https://github.com/polardb/duckdb-paimon
DuckDB extension for accessing Apache Paimon. 🦆
Last synced: 19 Apr 2026
https://github.com/imsanjoykb/etl-project
The goal of this project is to illustrate Extract Transform Load (ETL) using Python and SQL. ETL is a process commonly done in computing, which takes raw data, cleans it and stores it for later use. The extraction phase targets and retrieves the data. Transform manipulates and cleans the data. Then load stores the data, typically in a data warehouse.
data-engineering database datalake datawarehouse etl etl-automation etl-pipeline etl-solutions
Last synced: 18 Aug 2025
https://github.com/lynnlangit/serverless-architecture
Companion to my Linked In Learning 'Serverless Architecture' course
aws-lambda azure-functions datalake gcp-cloud-functions serverless serverless-architectures
Last synced: 16 Jan 2026
https://github.com/dbsystel/datalake-graphql-wrapper
The DataLake GraphQL Wrapper provides a GraphQL API for presto/trino.
boilerplate cli datalake generator graphql pothos presto prestodb prestosql template trino trinodb typescript wrapper-api yoga-graphql
Last synced: 10 Jul 2025
https://github.com/AWS-Big-Data-Projects/AWS-Data-Lake
AWS Lake Formation makes it easy for you to set up, secure, and manage your data lakes also data discovery using the metadata search capabilities of Lake Formation in the console, and metadata search results restricted by column permissions.
Last synced: 20 Jul 2025
https://github.com/aws-solutions-library-samples/aws-insurancelake-infrastructure
This solution helps you deploy ETL processes and data storage resources to create an Insurance Lake using Amazon S3 buckets for storage, AWS Glue for data transformation, and AWS CDK Pipelines. It is originally based on the AWS blog Deploy data lake ETL jobs using CDK Pipelines, and complements the InsuranceLake ETL with CDK Pipelines project.
Last synced: 14 Oct 2025
https://github.com/openedi/open-data-access-tools
OEDI Data Lake Access
aws datalake nrel oedi open-energy renewable-energy
Last synced: 22 Jul 2025
https://github.com/openEDI/open-data-access-tools
OEDI Data Lake Access
aws datalake nrel oedi open-energy renewable-energy
Last synced: 07 May 2025
https://github.com/stonezhong/DataManager
Better organize data in data lake and build ETL pipeline with Web UI tool.
datalake datawarehouse etl spark sparksql
Last synced: 20 Jul 2025
https://github.com/prefeitura-rio/pipelines_rj_sms
Pipelines de dados da Secretaria Municipal de Saúde
datalake pipelines-as-code prefect reporting
Last synced: 17 Jan 2026
https://github.com/vre-hub/vre
VRE infrastructure running at CERN
data-analysis datalake flux helm-charts high-energy-physics jupyterhub jupyterlab k8s openstack platform reana rucio
Last synced: 18 Jan 2026
https://github.com/gigapi/gigapi-querier
DuckDB Query Engine for GigAPI
arrow-flight datalake duckdb duckdb-server flightsql gigapipe influxdb3 lakehouse lakehouse-engine parquet
Last synced: 07 Oct 2025
https://github.com/calvinhartwell/getting-started-with-kylo
An introduction to using Kylo, an open source data lake builder from Teradata
apache-nifi datalake gitbook hadoop hdp kylo nifi spark teradata thinkbig thinkbiganalytics
Last synced: 11 Jun 2025
https://github.com/mimetis/projecty
Project Y is a straightforward Landing Zones automated deployment tool dedicated to data processing.
azure azuredatabricks azuredatafactory azurekeyvault azurelandingzone databricks datalake synapse
Last synced: 12 Apr 2025
https://github.com/kassette-ai/kassette-server
Secured pipelines for your reporting and auditing data
audit datalake etl kassette powerbi reporting servicenow warehouse workflow
Last synced: 15 Jan 2026
https://github.com/sidequery/dlt-iceberg
An Iceberg destination for DLT that supports REST catalogs
apache-iceberg data-engineering datalake dlt dlthub etl iceberg
Last synced: 09 Feb 2026
https://github.com/kimtth/pyspark-tika-text-extraction
🚴♂️⛷Data Lake, Performance tuning for text extraction from a huge amount of files.
apache-spark apache-tika data-pipeline datalake multithreading pyspark spark tika-python
Last synced: 17 Jul 2025
https://github.com/tuanai-vireox/dataplatform-stack
How to build a complete Data Platform -> Here
airflow cdc data data-warehouse datalake dataplatform dbt flink k8s kafka spark-streaming
Last synced: 22 Aug 2025
https://github.com/aessing/demo-mdwh
Modern Dataware House Demos with Azure Databricks, Azure Data Factory & Azure Dedicated SQL pool (formerly SQL DW)
azure azure-data-factory azure-databricks data data-engineering data-science databricks databricks-notebooks datafactory datalake datawarehouse datawarehousing delta-lake demos etl machine-learning mdwh ml modern-data-warehouse spark
Last synced: 26 Jun 2025
https://github.com/macieklesiczka/azof
Lakehouse with time travel
datafusion datalake lakehouse parquet rust-lang
Last synced: 02 Mar 2026
https://github.com/lynnlangit/learning-nosql
Companion repository to Linked In Learning course 'Cloud NoSQL for SQL Pros'
aws-dynamodb data datalake dynamodb gcp-bigtable nosql vector-database
Last synced: 15 Jan 2026
https://github.com/openaleph/ftm-lakehouse
Data standard and archive storage for structured FollowTheMoney data, leaked data, private and public document collections.
aleph archive datalake deltalake followthemoney lakehouse openaleph opensanctions
Last synced: 02 Feb 2026
https://github.com/ac-gomes/data_engineer_with_airflow
Este projeto é uma adaptação com base em um teste real para uma posição de Engenheiro de Dados Jr.
airflow aws-s3 azure-storage datalake datalake-ingestion json-api postgres python3
Last synced: 17 May 2026
https://github.com/jhole89/serverless-data-pipelines-demo
aws aws-glue aws-iam aws-lambda big-data datalake serverless terraform
Last synced: 12 Jan 2026
https://github.com/divithraju/divith-raju-immigration-data-engineering
A Capstone Project that covers several aspects of Data Engineering (Data Exploration, Cleaning, Modeling, Pipelining, Processing)
apachespark bigdata bigdataprocessing bigdataproject capstone-project datacleaning dataengineering datalake datamodeling datapipeline dataprocessing dataschema dataset datawherehouse pandas sql
Last synced: 29 Apr 2026
https://github.com/erwan-simon/aws-data-platform-framework
A unified framework to industrialize data ingestion, transformation and pipeline execution on AWS using Terraform, from infrastructure provisioning to runtime execution, designed as a reusable and standalone data platform.
aws data data-framework datalake docker iceberg python spark step-functions terraform terraform-module
Last synced: 23 May 2026
https://github.com/neuro-ml/tarn
An insanely customizable framework for key-value storage 💾
cache datalake memoization persistent python storage
Last synced: 23 Apr 2025
https://github.com/amosproj/amos2024ss04-building-information-enhancer
Building Information System for potential energy savings
datalake energy energy-consumption
Last synced: 04 Apr 2026
https://github.com/johnmata0427/data-lake-case-studies
Casos de Estudio con Data Lake
azure data-science datalake jupyter-notebook nosql powerbi sql
Last synced: 24 Apr 2026
https://github.com/murilobellatini/ifood-data-architect-test
My solution to the iFood Data Architect Test using PySpark, Jupyter and Docker in order to create a local prototype data lake.
datalake datamart docker docker-compose pyspark python storage
Last synced: 07 Jan 2026
https://github.com/omr5221/esbi_stream
Application to ingest data into DB from API
api api-client cli datalake docker docker-compose exe keyring logging multiprocessing multithreading pyinstaller python3 sqlalchemy
Last synced: 09 May 2026
https://github.com/riju18/apache-iceberg-kickstart
apache-iceberg datalake datalakehouse docker dremio minio nessie pysaprk python3 s3 sql zeppelin
Last synced: 27 Apr 2026
https://github.com/agnosticeng/cli
Agnostic magic is now at your fingertips.
cli clickhouse data datalake datalakehouse
Last synced: 03 Mar 2026
https://github.com/phelipe-sempreboni/informations
Repository for tutorials, information and notes on technology in general.
amazon-web-services datahub datalake datamart datawarehouse datawarehousing etl modelagem-de-dados olap oltp oracle-database pl-sql pl-sql-script powerbi-desktop powerbi-service rds-database sql sqlserver
Last synced: 19 Apr 2026
https://github.com/simonjang/s3-query-json
Query JSON documents on S3 with SQL
Last synced: 02 May 2026
https://github.com/leonardodrigo/breweries-data-lake
This project builds an Azure Data Lake using the Medallion architecture to process data with Spark from the Open Breweries DB API.
airflow azure brewerydb datalake docker docker-compose pyspark
Last synced: 19 Jan 2026
https://github.com/macieklesiczka/bazof
Lakehouse with time travel
datafusion datalake lakehouse parquet rust-lang
Last synced: 22 Mar 2025
https://github.com/hussein-awala/gdpr-compliant-lakehouse
This repository is a demonstration of how to handle GDPR export and delete requests in an Iceberg Lakehouse to make it GDPR-compliant.
apache-iceberg apache-spark datalake gdpr lakehouse
Last synced: 18 May 2026
https://github.com/richclement/aws-data-lake-sdk
An sdk for the AWS data lake.
Last synced: 10 May 2025
https://github.com/thunchanokbow/audiblebook-revenue
Manage big data on cloud computing to find a list of best-selling audible books, generate reports and dashboards, and provide products and sales promotions that meet the needs of consumers in Thailand
apache-airflow bigquery cloudcomposer data-visualization datalake datawarehouse googlecloudstorage lookerstudio pandas python3
Last synced: 11 Apr 2026
https://github.com/JohnMata0427/Data-Lake-Case-Studies
Casos de Estudio con Data Lake
azure data-science datalake jupyter-notebook nosql powerbi sql
Last synced: 22 Sep 2025
https://github.com/hoaihuongbk/lakeops
A modern data lake operations toolkit working with multiple table formats (Delta, Iceberg, Parquet) and engines (Spark, Polars) via the same APIs.
data data-operations dataengineering datalake
Last synced: 07 Mar 2026
https://github.com/chandima2000/adventure-works-sales-data-engineering-project
The aim of this project is to build an end-to-end data engineering project using Microsoft Azure
adf azure data-engineering databricks datalake etl-pipeline
Last synced: 30 Apr 2026
https://github.com/carolinerocks/azure-data-engineering-end-to-end-project
azure databricks datafactory datalake powerbi python sql synapse
Last synced: 07 May 2026
https://github.com/stefen-taime/azurepipeline
Azure Data Pipeline
azure databricks datalake http terraform vault
Last synced: 08 May 2026
https://github.com/matz1979/spark-etl-pipelines
My final project with big data build with Spark
bigdata datalake etl-pipeline python spark
Last synced: 08 May 2026
https://github.com/felipelaptrin/data-lake
This project is a simple proof of concept to implement a data lake using AWS cloud.
aws datalake githubactions terraform
Last synced: 09 May 2026
https://github.com/slowlatency/de-apple-data-analysis
A Data Pipeline solution using Databricks and Apache Spark to process and analyze Apple data.
Last synced: 13 May 2026
https://github.com/trannhatnguyen2/bi_cloud_kientap
Building a Business Intelligence Solution on the Microsoft Azure Cloud Platform with Dynamic ELT Integration
azure datalake datawarehouse powerbi
Last synced: 29 Aug 2025
https://github.com/trannhatnguyen2/bi_datalake_azure
Building Data Lake on the Microsoft Azure Cloud Platform
azure databricks datalake powerbi sql-server
Last synced: 22 Apr 2026
https://github.com/k178412/sql-data-warehouse-project
A hands-on data warehouse project using SQL Server, covering ETL processes, and data modeling.
bronze-layer data-analysis data-analytics data-cleaning data-engineering data-warehouse database datalake dataset datawarehouse etl etl-pipeline etl-process gold-layer silver-layer sql sql-query sql-server sqlserver
Last synced: 25 Apr 2026
https://github.com/senaldolage/wa-road-insights-pipeline
End-to-end Azure data pipeline project analyzing Western Australia transport datasets with dashboards built in Tableau. Featuring Data Factory, Databricks, Synapse, and Data Lake Gen2.
datalake pyspark synapse-analytics tableau
Last synced: 27 Jun 2025