Projects in Awesome Lists tagged with datalake

https://github.com/sinaptik-ai/pandas-ai

Chat with your database or your datalake (SQL, CSV, parquet). PandasAI makes data analysis conversational using LLMs and RAG.

ai csv data data-analysis data-science data-visualization database datalake gpt-4 llm pandas sql text-to-sql

Last synced: 15 Jan 2026

https://github.com/Sinaptik-AI/pandas-ai

Chat with your database or your datalake (SQL, CSV, parquet). PandasAI makes data analysis conversational using LLMs and RAG.

ai csv data data-analysis data-science data-visualization database datalake gpt-4 llm pandas sql text-to-sql

Last synced: 25 Mar 2025

https://github.com/trinodb/trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

analytics big-data data-science database databases datalake delta-lake distributed-database distributed-systems hadoop hive iceberg java jdbc presto prestodb query-engine sql trino

Last synced: 02 Apr 2026

https://github.com/starrocks/starrocks

The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.

analytics big-data cloudnative database datalake delta-lake distributed-database hudi iceberg join lakehouse lakehouse-platform mpp olap real-time-analytics real-time-updates realtime-database sql star-schema vectorized

Last synced: 16 Feb 2026

https://github.com/activeloopai/deeplake

Deeplake is AI Data Runtime for Agents. It provides serverless postgres with a multimodal datalake, enabling scalable retrieval and training.

agent agentic-rag ai clawbot computer-vision datalake deep-learning filesystem large-language-models llm memory mlops multimodal openclaw postgres pytorch rag skill vector-database

Last synced: 11 Jun 2026

https://github.com/StarRocks/starrocks

The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.

analytics big-data cloudnative database datalake delta-lake distributed-database hudi iceberg join lakehouse lakehouse-platform mpp olap real-time-analytics real-time-updates realtime-database sql star-schema vectorized

Last synced: 14 Mar 2025

https://github.com/apache/hudi

Upserts, Deletes And Incremental Processing on Big Data.

apacheflink apachehudi apachespark bigdata data-integration datalake hudi incremental-processing stream-processing

Last synced: 12 May 2025

https://github.com/treeverse/lakefs

lakeFS - Data version control for your data lake | Git for data

apache-spark apache-sparksql aws-s3 azure-blob-storage azure-storage data-engineering data-lake data-quality data-version-control data-versioning datalake datalakes git-for-data go golang google-cloud-storage hadoop-filesystem lakefs object-storage

Last synced: 18 Feb 2026

https://github.com/treeverse/lakeFS

lakeFS - Data version control for your data lake | Git for data

apache-spark apache-sparksql aws-s3 azure-blob-storage azure-storage data-engineering data-lake data-quality data-version-control data-versioning datalake datalakes git-for-data go golang google-cloud-storage hadoop-filesystem lakefs object-storage

Last synced: 20 Mar 2025

https://github.com/DataLinkDC/dinky

Dinky is a real-time data development platform based on Apache Flink, enabling agile data development, deployment and operation.

datalake datawarehouse flink flinkcdc flinksql olap real-time-computing-platform sql

Last synced: 27 Mar 2025

https://github.com/lakesoul-io/lakesoul

LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.

arrow big-data datafusion datalake flink huggingface lakehouse lakesoul postgresql python pytorch rust spark sql streaming vectorized velox

Last synced: 14 May 2025

https://github.com/lakesoul-io/LakeSoul

LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.

arrow big-data datafusion datalake flink huggingface lakehouse lakesoul postgresql python pytorch rust spark sql streaming vectorized velox

Last synced: 27 Mar 2025

https://github.com/leo-project/leofs

The LeoFS Storage System

datalake distributed-file-system distributed-storage erlang leofs nfs nfs-server s3 s3-storage

Last synced: 08 Apr 2025

https://github.com/apache/gravitino

World's most powerful open data catalog for building a high-performance, geo-distributed and federated metadata lake.

ai-catalog data-catalog datalake federated-query lakehouse metadata metalake model-catalog opendatacatalog skycomputing stratosphere

Last synced: 13 May 2025

https://github.com/zinggAI/zingg

Scalable identity resolution, entity resolution, data mastering and deduplication using ML

analytics cdp customer-data-platform data-science databricks dataengineering datalake dataquality dedupe deduplication entity-resolution fuzzy-matching fuzzymatch identity-resolution master-data-management masterdata mdm ml snowflake spark

Last synced: 16 Nov 2025

https://github.com/apache/Gravitino

World's most powerful open data catalog for building a high-performance, geo-distributed and federated metadata lake.

ai-catalog data-catalog datalake federated-query lakehouse metadata metalake model-catalog opendatacatalog skycomputing stratosphere

Last synced: 03 Oct 2025

https://github.com/zinggai/zingg

Scalable identity resolution, entity resolution, data mastering and deduplication using ML

analytics analytics-engineering data-science data-transformation data-transformations dataengineering datalake dataquality dedupe deduplication entity-resolution etl fuzzy-matching fuzzymatch identity identity-resolution masterdata ml modern-data-stack spark

Last synced: 14 May 2025

https://github.com/apache/amoro

Apache Amoro (incubating) is a Lakehouse management system built on open data lake formats.

bigdata datalake lakehouse

Last synced: 14 May 2025

https://github.com/clickhouse/clickbench

ClickBench: a Benchmark For Analytical Databases

analytics aws benchmark big-data bigquery chdb clickhouse databases datafusion datalake doris duckdb iceberg lakehouse olap parquet rust snowflake sql

Last synced: 06 Sep 2025

https://github.com/leesf/hudi-resources

汇总Apache Hudi相关资料

apache apachehudi bigdata data-integration datalake hudi hudi-resources incremental-processing stream-processing

Last synced: 27 Mar 2025

https://github.com/paradedb/pg_analytics

DuckDB-powered data lake analytics from Postgres

analytics arrow big-data columnar database datafusion datalake deltalake duckdb iceberg lakehouse lakehouse-platform object-storage olap paradedb parquet postgres postgresql realtime-analytics sql

Last synced: 24 Mar 2025

https://github.com/Datavault-UK/automate-dv

A free to use dbt package for creating and loading Data Vault 2.0 compliant Data Warehouses (powered by dbt, an open source data engineering tool, registered trademark of dbt Labs)

data-vault dataengineering datalake datavault datavault20 datawarehouse datawarehousing dbt elt etl metadata snowflake sql

Last synced: 13 May 2025

https://github.com/linkedin/openhouse

Open Control Plane for Tables in Data Lakehouse

big-data catalog datalake datalakehouse declarative iceberg management tables

Last synced: 17 Aug 2025

https://github.com/gigapi/gigapi

GigAPI is a Timeseries lakehouse for real-time data and sub-second queries, powered by DuckDB OLAP + Parquet Query Engine, Compactor w/ Cloud-Native Storage. Drop-in FDAP alternative ⭐

api clickhouse-server data-lake database datalake duckdb duckdb-api duckdb-server ducklake fdap gigapipe golang lakehouse olap parquet qryn query-engine rest-api s3 sql

Last synced: 05 Oct 2025

https://github.com/cuebook/cuelake

Use SQL to build ELT pipelines on a data lakehouse.

apache-iceberg apache-spark data-engineering data-ingestion data-integration data-lake data-pipeline data-transfer datalake delta elt etl incremental-updates lakehouse pipelines spark-sql sql upsert zeppelin-notebook

Last synced: 07 Apr 2025

https://github.com/awslabs/visual-asset-management-system

Visual Asset Management System (VAMS) is a purpose-built, AWS native solution for the management and distribution of traditional to specialized visual assets used in physical AI and spatial computing.

2d 3d datalake digital-asset-management extended-reality metadata physical-ai pipelines spatial-computing spatial-data

Last synced: 16 Jan 2026

https://github.com/izhangzhihao/real-time-data-warehouse

Real-time Data Warehouse with Apache Flink & Apache Kafka & Apache Hudi

cdc change-data-capture data-warehouse data-warehousing datalake debezium delta delta-lake deltalake elasticsearch flink flink-sql hoodie hudi iceberg kafka real-time-data-warehouse spark spark-sql sql

Last synced: 07 Sep 2025

https://github.com/datawithbaraa/sql-data-warehouse-project

A comprehensive guide to building a modern data warehouse with SQL Server, including ETL processes, data modeling, and analytics.

data-analysis data-analytics data-cleaning data-engineering data-lakehouse data-science data-warehouse data-warehousing datalake datascience datawarehouse datawarehousing etl etl-job etl-pipeline medallion-architecture sql sql-query sql-server sqlserver

Last synced: 06 Apr 2025

https://github.com/WeBankFinTech/Streamis

Streaming application development and management system, based on Linkis and DSS, planning to provide the workflow-like graphical drag-and-drop development capability.

datalake dataspherestudio deltalake flink hudi iceberg kafka linkis streaming streamis warehouse wedatasphere

Last synced: 15 Jul 2025

https://github.com/apache/doris-website

Apache Doris Website

analytics apache big-data data-warehousing database datalake dbms distributed-system doris hadoop hive hudi iceberg mpp olap ssb tpch vectorized

Last synced: 15 May 2025

https://github.com/neuralinkcorp/datarepo

data-warehouse datalake datawarehouse delta-lake

Last synced: 17 Aug 2025

https://github.com/learningjournal/sparkprogramminginscala

Apache Spark Course Material

apache-spark big-data bigdata data-lake datalake scala spark spark-scala spark-sql

Last synced: 17 Mar 2025

https://github.com/buoyant-data/oxbow

Collection of AWS Lambdas for creating and managing Delta tables

datalake deltalake lambda parquet rust

Last synced: 07 Apr 2026

https://github.com/learningjournal/spark-streaming-in-scala

Apache Spark 3 - Structured Streaming Course Material

apache-spark big-data bigdata datalake scala spark spark-sql spark-streaming

Last synced: 16 May 2025

https://github.com/paloaltonetworks/pan-cortex-data-lake-python

Python idiomatic SDK for Cortex™ Data Lake.

api applicationframework cortex data datalake directory directory-sync directory-sync-service event event-service logging logging-service paloalto paloaltonetworks pan pancloud panw python rest-api sdk

Last synced: 05 May 2025

https://github.com/apache/doris-thirdparty

Self-managed thirdparty dependencies for Apache Doris

analytics big-data data-warehousing database datalake dbms distributed-database hadoop hive hudi iceberg mpp olap real-time sql ssb tpch vectorized

Last synced: 18 Jul 2025

https://github.com/ExpediaGroup/apiary

Apiary provides modules which can be combined to create a federated cloud data lake

aws datalake hive hive-metastore

Last synced: 13 May 2025

https://github.com/aws-solutions-library-samples/aws-insurancelake-etl

This solution helps you deploy ETL processes and data storage resources to create an Insurance Lake using Amazon S3 buckets for storage, AWS Glue for data transformation, and AWS CDK Pipelines. It is originally based on the AWS blog Deploy data lake ETL jobs using CDK Pipelines, and complements the InsuranceLake Infrastructure project

aws cdk datalake glue insurance

Last synced: 27 Jan 2026

https://github.com/absaoss/enceladus

Dynamic Conformance Engine

bigdata datalake hadoop mongodb scala spark spring

Last synced: 20 Aug 2025

https://github.com/abdullahkhawer/aws-auto-terminate-idle-emr

An AWS based solution using AWS CloudWatch and AWS Lambda based on Python to automatically terminate AWS EMR clusters that have been idle for a specified period of time.

amazon-web-services automation aws aws-cloudformation aws-cloudwatch aws-emr aws-lambda bigdata boto3 cft cloudformation cloudwatch datalake emr etl idle python python-3-7 serverless terminate

Last synced: 02 Jul 2025

https://github.com/polardb/duckdb-paimon

DuckDB extension for accessing Apache Paimon. 🦆

datalake duckdb paimon

Last synced: 19 Apr 2026

https://github.com/imsanjoykb/etl-project

The goal of this project is to illustrate Extract Transform Load (ETL) using Python and SQL. ETL is a process commonly done in computing, which takes raw data, cleans it and stores it for later use. The extraction phase targets and retrieves the data. Transform manipulates and cleans the data. Then load stores the data, typically in a data warehouse.

data-engineering database datalake datawarehouse etl etl-automation etl-pipeline etl-solutions

Last synced: 18 Aug 2025

https://github.com/prestodb/prestorials

Tutorials and examples of how to deploy Presto and connect it to different data sources

aws awsglue data datalake docker example glue lakehouse mongodb presto presto-connector prestodb prestosql sql tutorial walkthrough

Last synced: 13 Mar 2026

https://github.com/lynnlangit/serverless-architecture

Companion to my Linked In Learning 'Serverless Architecture' course

aws-lambda azure-functions datalake gcp-cloud-functions serverless serverless-architectures

Last synced: 16 Jan 2026

https://github.com/dbsystel/datalake-graphql-wrapper

The DataLake GraphQL Wrapper provides a GraphQL API for presto/trino.

boilerplate cli datalake generator graphql pothos presto prestodb prestosql template trino trinodb typescript wrapper-api yoga-graphql

Last synced: 10 Jul 2025

https://github.com/AWS-Big-Data-Projects/AWS-Data-Lake

AWS Lake Formation makes it easy for you to set up, secure, and manage your data lakes also data discovery using the metadata search capabilities of Lake Formation in the console, and metadata search results restricted by column permissions.

aws-s3 datalake

Last synced: 20 Jul 2025

https://github.com/aws-solutions-library-samples/aws-insurancelake-infrastructure

This solution helps you deploy ETL processes and data storage resources to create an Insurance Lake using Amazon S3 buckets for storage, AWS Glue for data transformation, and AWS CDK Pipelines. It is originally based on the AWS blog Deploy data lake ETL jobs using CDK Pipelines, and complements the InsuranceLake ETL with CDK Pipelines project.

aws cdk datalake insurance

Last synced: 14 Oct 2025

https://github.com/openedi/open-data-access-tools

OEDI Data Lake Access

aws datalake nrel oedi open-energy renewable-energy

Last synced: 22 Jul 2025

https://github.com/ismailsimsek/iceberg-examples

Apache iceberg Spark s3 examples

datalake iceberg s3 sql sql-merge

Last synced: 04 May 2025

https://github.com/openEDI/open-data-access-tools

OEDI Data Lake Access

aws datalake nrel oedi open-energy renewable-energy

Last synced: 07 May 2025

https://github.com/stonezhong/DataManager

Better organize data in data lake and build ETL pipeline with Web UI tool.

datalake datawarehouse etl spark sparksql

Last synced: 20 Jul 2025

https://github.com/scrogson/duckpond-rs

Rust implementation of the DuckLake lakehouse format

datalake ducklake mysql postgres rust sqlite

Last synced: 05 May 2026

https://github.com/prefeitura-rio/pipelines_rj_sms

Pipelines de dados da Secretaria Municipal de Saúde

datalake pipelines-as-code prefect reporting

Last synced: 17 Jan 2026

https://github.com/vre-hub/vre

VRE infrastructure running at CERN

data-analysis datalake flux helm-charts high-energy-physics jupyterhub jupyterlab k8s openstack platform reana rucio

Last synced: 18 Jan 2026

https://github.com/gigapi/gigapi-querier

DuckDB Query Engine for GigAPI

arrow-flight datalake duckdb duckdb-server flightsql gigapipe influxdb3 lakehouse lakehouse-engine parquet

Last synced: 07 Oct 2025

https://github.com/calvinhartwell/getting-started-with-kylo

An introduction to using Kylo, an open source data lake builder from Teradata

apache-nifi datalake gitbook hadoop hdp kylo nifi spark teradata thinkbig thinkbiganalytics

Last synced: 11 Jun 2025

https://github.com/mimetis/projecty

Project Y is a straightforward Landing Zones automated deployment tool dedicated to data processing.

azure azuredatabricks azuredatafactory azurekeyvault azurelandingzone databricks datalake synapse

Last synced: 12 Apr 2025

https://github.com/kassette-ai/kassette-server

Secured pipelines for your reporting and auditing data

audit datalake etl kassette powerbi reporting servicenow warehouse workflow

Last synced: 15 Jan 2026

https://github.com/kleinyuan/llama2-csv-webapp

self host/local host llama2 based web app to chat with your csvs (multiple)

chatgpt csv datalake large-language-models llama2 llm meta openai pandas pandas-ai pandasai streamlit

Last synced: 12 Apr 2025

https://github.com/sidequery/dlt-iceberg

An Iceberg destination for DLT that supports REST catalogs

apache-iceberg data-engineering datalake dlt dlthub etl iceberg

Last synced: 09 Feb 2026

https://github.com/kimtth/pyspark-tika-text-extraction

🚴‍♂️⛷Data Lake, Performance tuning for text extraction from a huge amount of files.

apache-spark apache-tika data-pipeline datalake multithreading pyspark spark tika-python

Last synced: 17 Jul 2025

https://github.com/tuanai-vireox/dataplatform-stack

How to build a complete Data Platform -> Here

airflow cdc data data-warehouse datalake dataplatform dbt flink k8s kafka spark-streaming

Last synced: 22 Aug 2025

https://github.com/aessing/demo-mdwh

Modern Dataware House Demos with Azure Databricks, Azure Data Factory & Azure Dedicated SQL pool (formerly SQL DW)

azure azure-data-factory azure-databricks data data-engineering data-science databricks databricks-notebooks datafactory datalake datawarehouse datawarehousing delta-lake demos etl machine-learning mdwh ml modern-data-warehouse spark

Last synced: 26 Jun 2025

https://github.com/macieklesiczka/azof

Lakehouse with time travel

datafusion datalake lakehouse parquet rust-lang

Last synced: 02 Mar 2026

https://github.com/lynnlangit/learning-nosql

Companion repository to Linked In Learning course 'Cloud NoSQL for SQL Pros'

aws-dynamodb data datalake dynamodb gcp-bigtable nosql vector-database

Last synced: 15 Jan 2026

https://github.com/openaleph/ftm-lakehouse

Data standard and archive storage for structured FollowTheMoney data, leaked data, private and public document collections.

aleph archive datalake deltalake followthemoney lakehouse openaleph opensanctions

Last synced: 02 Feb 2026

https://github.com/ac-gomes/data_engineer_with_airflow

Este projeto é uma adaptação com base em um teste real para uma posição de Engenheiro de Dados Jr.

airflow aws-s3 azure-storage datalake datalake-ingestion json-api postgres python3

Last synced: 17 May 2026

https://github.com/jhole89/serverless-data-pipelines-demo

aws aws-glue aws-iam aws-lambda big-data datalake serverless terraform

Last synced: 12 Jan 2026

https://github.com/divithraju/divith-raju-immigration-data-engineering

A Capstone Project that covers several aspects of Data Engineering (Data Exploration, Cleaning, Modeling, Pipelining, Processing)

apachespark bigdata bigdataprocessing bigdataproject capstone-project datacleaning dataengineering datalake datamodeling datapipeline dataprocessing dataschema dataset datawherehouse pandas sql

Last synced: 29 Apr 2026

https://github.com/erwan-simon/aws-data-platform-framework

A unified framework to industrialize data ingestion, transformation and pipeline execution on AWS using Terraform, from infrastructure provisioning to runtime execution, designed as a reusable and standalone data platform.

aws data data-framework datalake docker iceberg python spark step-functions terraform terraform-module

Last synced: 23 May 2026

https://github.com/neuro-ml/tarn

An insanely customizable framework for key-value storage 💾

cache datalake memoization persistent python storage

Last synced: 23 Apr 2025

https://github.com/amosproj/amos2024ss04-building-information-enhancer

Building Information System for potential energy savings

datalake energy energy-consumption

Last synced: 04 Apr 2026

https://github.com/johnmata0427/data-lake-case-studies

Casos de Estudio con Data Lake

azure data-science datalake jupyter-notebook nosql powerbi sql

Last synced: 24 Apr 2026

https://github.com/murilobellatini/ifood-data-architect-test

My solution to the iFood Data Architect Test using PySpark, Jupyter and Docker in order to create a local prototype data lake.

datalake datamart docker docker-compose pyspark python storage

Last synced: 07 Jan 2026

https://github.com/omr5221/esbi_stream

Application to ingest data into DB from API

api api-client cli datalake docker docker-compose exe keyring logging multiprocessing multithreading pyinstaller python3 sqlalchemy

Last synced: 09 May 2026

https://github.com/riju18/apache-iceberg-kickstart

apache-iceberg datalake datalakehouse docker dremio minio nessie pysaprk python3 s3 sql zeppelin

Last synced: 27 Apr 2026

https://github.com/agnosticeng/cli

Agnostic magic is now at your fingertips.

cli clickhouse data datalake datalakehouse

Last synced: 03 Mar 2026

https://github.com/postpayio/ness

A Python datalake client.

datalake pandas s3

Last synced: 07 May 2026

https://github.com/phelipe-sempreboni/informations

Repository for tutorials, information and notes on technology in general.

amazon-web-services datahub datalake datamart datawarehouse datawarehousing etl modelagem-de-dados olap oltp oracle-database pl-sql pl-sql-script powerbi-desktop powerbi-service rds-database sql sqlserver

Last synced: 19 Apr 2026

https://github.com/simonjang/s3-query-json

Query JSON documents on S3 with SQL

datalake nodejs s3

Last synced: 02 May 2026

https://github.com/leonardodrigo/breweries-data-lake

This project builds an Azure Data Lake using the Medallion architecture to process data with Spark from the Open Breweries DB API.

airflow azure brewerydb datalake docker docker-compose pyspark

Last synced: 19 Jan 2026

https://github.com/macieklesiczka/bazof

Lakehouse with time travel

datafusion datalake lakehouse parquet rust-lang

Last synced: 22 Mar 2025

https://github.com/hussein-awala/gdpr-compliant-lakehouse

This repository is a demonstration of how to handle GDPR export and delete requests in an Iceberg Lakehouse to make it GDPR-compliant.

apache-iceberg apache-spark datalake gdpr lakehouse

Last synced: 18 May 2026

https://github.com/richclement/aws-data-lake-sdk

An sdk for the AWS data lake.

aws datalake sdk

Last synced: 10 May 2025

https://github.com/thunchanokbow/audiblebook-revenue

Manage big data on cloud computing to find a list of best-selling audible books, generate reports and dashboards, and provide products and sales promotions that meet the needs of consumers in Thailand

apache-airflow bigquery cloudcomposer data-visualization datalake datawarehouse googlecloudstorage lookerstudio pandas python3

Last synced: 11 Apr 2026

https://github.com/JohnMata0427/Data-Lake-Case-Studies

Casos de Estudio con Data Lake

azure data-science datalake jupyter-notebook nosql powerbi sql

Last synced: 22 Sep 2025

https://github.com/hoaihuongbk/lakeops

A modern data lake operations toolkit working with multiple table formats (Delta, Iceberg, Parquet) and engines (Spark, Polars) via the same APIs.

data data-operations dataengineering datalake

Last synced: 07 Mar 2026

https://github.com/chandima2000/adventure-works-sales-data-engineering-project

The aim of this project is to build an end-to-end data engineering project using Microsoft Azure

adf azure data-engineering databricks datalake etl-pipeline

Last synced: 30 Apr 2026

https://github.com/jblukach/parquet2csv

Convert from CSV to Parquet and back again!

athena csv datalake parquet rust s3 utility

Last synced: 26 Mar 2025

https://github.com/carolinerocks/azure-data-engineering-end-to-end-project

azure databricks datafactory datalake powerbi python sql synapse

Last synced: 07 May 2026

https://github.com/stefen-taime/azurepipeline

Azure Data Pipeline

azure databricks datalake http terraform vault

Last synced: 08 May 2026

https://github.com/matz1979/spark-etl-pipelines

My final project with big data build with Spark

bigdata datalake etl-pipeline python spark

Last synced: 08 May 2026

https://github.com/felipelaptrin/data-lake

This project is a simple proof of concept to implement a data lake using AWS cloud.

aws datalake githubactions terraform

Last synced: 09 May 2026

https://github.com/jayhan94/minilake

A morden mini lakehouse based on Spark and Delta running in the docker.

analytics datalake deltalake lakehouse spark

Last synced: 14 Mar 2025

https://github.com/slowlatency/de-apple-data-analysis

A Data Pipeline solution using Databricks and Apache Spark to process and analyze Apple data.

datalake python spark sql

Last synced: 13 May 2026

https://github.com/trannhatnguyen2/bi_cloud_kientap

Building a Business Intelligence Solution on the Microsoft Azure Cloud Platform with Dynamic ELT Integration

azure datalake datawarehouse powerbi

Last synced: 29 Aug 2025

https://github.com/trannhatnguyen2/bi_datalake_azure

Building Data Lake on the Microsoft Azure Cloud Platform

azure databricks datalake powerbi sql-server

Last synced: 22 Apr 2026

https://github.com/malondaclement/datalake

DataLake project 💾

datalake mysql python3

Last synced: 16 May 2026

https://github.com/k178412/sql-data-warehouse-project

A hands-on data warehouse project using SQL Server, covering ETL processes, and data modeling.

bronze-layer data-analysis data-analytics data-cleaning data-engineering data-warehouse database datalake dataset datawarehouse etl etl-pipeline etl-process gold-layer silver-layer sql sql-query sql-server sqlserver

Last synced: 25 Apr 2026

https://github.com/senaldolage/wa-road-insights-pipeline

End-to-end Azure data pipeline project analyzing Western Australia transport datasets with dashboards built in Tableau. Featuring Data Factory, Databricks, Synapse, and Data Lake Gen2.

datalake pyspark synapse-analytics tableau

Last synced: 27 Jun 2025