Projects in Awesome Lists tagged with data-lake
A curated list of projects in awesome lists tagged with data-lake .
https://github.com/treeverse/lakefs
lakeFS - Data version control for your data lake | Git for data
apache-spark apache-sparksql aws-s3 azure-blob-storage azure-storage data-engineering data-lake data-quality data-version-control data-versioning datalake datalakes git-for-data go golang google-cloud-storage hadoop-filesystem lakefs object-storage
Last synced: 24 Dec 2025
https://github.com/treeverse/lakeFS
lakeFS - Data version control for your data lake | Git for data
apache-spark apache-sparksql aws-s3 azure-blob-storage azure-storage data-engineering data-lake data-quality data-version-control data-versioning datalake datalakes git-for-data go golang google-cloud-storage hadoop-filesystem lakefs object-storage
Last synced: 20 Mar 2025
https://github.com/dlt-hub/dlt
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
data data-engineering data-lake data-loading data-warehouse elt extract load python transform
Last synced: 26 Mar 2025
https://github.com/apache/kyuubi
Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
data-lake hacktoberfest hadoop hive jdbc kubernetes spark spark-sql sql thrift
Last synced: 13 May 2025
https://github.com/bytedance/bitsail
BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every day.
big-data data-integration data-lake data-pipeline data-synchronization flink high-performance real-time
Last synced: 15 May 2025
https://github.com/san089/udacity-data-engineering-projects
Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.
airflow airflow-operators aws aws-ec2 aws-s3 aws-sdk cassandra cassandra-database cloudformation cluster data data-engineering data-engineering-pipeline data-lake data-modeling data-warehouse etl-pipeline infrastructure postgres postgresql-database
Last synced: 08 Apr 2025
https://github.com/san089/Udacity-Data-Engineering-Projects
Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.
airflow airflow-operators aws aws-ec2 aws-s3 aws-sdk cassandra cassandra-database cloudformation cluster data data-engineering data-engineering-pipeline data-lake data-modeling data-warehouse etl-pipeline infrastructure postgres postgresql-database
Last synced: 15 Apr 2025
https://github.com/san089/goodreads_etl_pipeline
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
airflow airflow-dag apache-airflow apache-spark data-engineering data-engineering-pipeline data-lake data-migration emr-cluster etl-framework etl-job etl-pipeline goodreads-data-pipeline livy python redshift s3 scheduler spark warehouse
Last synced: 16 May 2025
https://github.com/teradata/kylo
Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.
data-lake hadoop kylo nifi spark teradata
Last synced: 15 May 2025
https://github.com/Teradata/kylo
Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.
data-lake hadoop kylo nifi spark teradata
Last synced: 06 Apr 2025
https://github.com/alanchn31/data-engineering-projects
Personal Data Engineering Projects
airflow aws-redshift cassandra data-engineering data-engineering-nanodegree data-lake data-modeling data-warehouse ingest-data mongodb postgres scrapy spark star-schema
Last synced: 12 Apr 2025
https://github.com/alanchn31/Data-Engineering-Projects
Personal Data Engineering Projects
airflow aws-redshift cassandra data-engineering data-engineering-nanodegree data-lake data-modeling data-warehouse ingest-data mongodb postgres scrapy spark star-schema
Last synced: 16 Apr 2025
https://github.com/canner/vulcan-sql
Data API Framework for AI Agents and Data Apps
ai ai-agent analytics api-builder bigquery clickhouse data-lake data-warehouse database duckdb ksqldb postgresql reporting restful-api snowflake spreadsheet sql typescript vulcan-sql vulcansql
Last synced: 15 May 2025
https://github.com/Canner/vulcan-sql
Data API Framework for AI Agents and Data Apps
ai ai-agent analytics api-builder bigquery clickhouse data-lake data-warehouse database duckdb ksqldb postgresql reporting restful-api snowflake spreadsheet sql typescript vulcan-sql vulcansql
Last synced: 11 Apr 2025
https://github.com/uber/marmaray
Generic Data Ingestion & Dispersal Library for Hadoop
avro-schema data-lake hadoop ingest-data schema-format spark
Last synced: 23 Mar 2025
https://github.com/aws-solutions-library-samples/data-lakes-on-aws
Enterprise-grade, production-hardened, serverless data lake on AWS
analytics aws best-practices data-engineering data-lake etl framework iac lake-formation serverless
Last synced: 14 Oct 2025
https://github.com/gigapi/gigapi
GigAPI is a Timeseries lakehouse for real-time data and sub-second queries, powered by DuckDB OLAP + Parquet Query Engine, Compactor w/ Cloud-Native Storage. Drop-in FDAP alternative ⭐
api clickhouse-server data-lake database datalake duckdb duckdb-api duckdb-server ducklake fdap gigapipe golang lakehouse olap parquet qryn query-engine rest-api s3 sql
Last synced: 05 Oct 2025
https://github.com/cuebook/cuelake
Use SQL to build ELT pipelines on a data lakehouse.
apache-iceberg apache-spark data-engineering data-ingestion data-integration data-lake data-pipeline data-transfer datalake delta elt etl incremental-updates lakehouse pipelines spark-sql sql upsert zeppelin-notebook
Last synced: 07 Apr 2025
https://github.com/awslabs/amazon-s3-find-and-forget
Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)
amazon-s3 aws big-data ccpa data data-erasure data-lake gdpr parquet privacy right-to-be-forgotten s3
Last synced: 04 Apr 2025
https://github.com/maxi-k/btrblocks
BtrBlocks: Efficient Columnar Compression for Data Lakes (SIGMOD 2023 Paper)
compression data-lake databases research
Last synced: 09 Apr 2025
https://github.com/azure/usql
U-SQL Examples and Issue Tracking
azure big-data data-lake u-sql
Last synced: 12 Apr 2025
https://github.com/azure/azuredatalake
Samples and Docs for Azure Data Lake Store and Analytics
Last synced: 09 Apr 2025
https://github.com/Canner/wren-engine
🤖 The semantic engine for LLMs, bringing semantic context to AI agents. 🔥
business-intelligence data data-analysis data-analytics data-lake data-warehouse hacktoberfest llm semantic semantic-layer sql
Last synced: 01 Apr 2025
https://github.com/canner/wren-engine
🤖 The semantic engine for LLMs, bringing semantic context to AI agents. 🔥
business-intelligence data data-analysis data-analytics data-lake data-warehouse hacktoberfest llm semantic semantic-layer sql
Last synced: 04 Apr 2025
https://github.com/learningjournal/spark-streaming-in-python
Apache Spark 3 - Structured Streaming Course Material
apache-spark big-data bigdata data-lake pyspark python spark-sql spark-streaming
Last synced: 04 Sep 2025
https://github.com/smart-data-lake/smart-data-lake
Smart Automation Tool for building modern Data Lakes and Data Pipelines
data-lake data-pipelines deltalake hadoop hive scala smart-data-lake spark transform-data
Last synced: 13 Apr 2025
https://github.com/GitDataAI/jiaozifs
A Git-like Version Control File System for AI & Data Product Management.
aiops data-collaboration data-lake data-lineage data-product data-version-control data-versioning dataops digital-twins federated-learning git git-filesystem git-for-data git-interface jiaozifs jzfs mlops version-controlled-filesystem
Last synced: 03 Mar 2025
https://github.com/GitDataAI/jzfs
A Git-like Version Control File System for AI & Data Product Management.
aiops data-collaboration data-lake data-lineage data-product data-version-control data-versioning dataops digital-twins federated-learning git git-filesystem git-for-data git-interface jiaozifs jzfs mlops version-controlled-filesystem
Last synced: 04 Apr 2025
https://github.com/learningjournal/sparkprogramminginscala
Apache Spark Course Material
apache-spark big-data bigdata data-lake datalake scala spark spark-scala spark-sql
Last synced: 17 Mar 2025
https://github.com/aws-samples/aws-dbs-refarch-datalake
Reference Architectures for Datalakes on AWS
amazon-emr data-analytics data-catalog data-lake data-transformation emr-cluster glue hive-metastore ingest-data
Last synced: 06 May 2025
https://github.com/camunda-community-hub/zeeqs
GraphQL API for Zeebe data
data-lake graphql zeebe zeebe-tool
Last synced: 04 Oct 2025
https://github.com/OElesin/querypal
Web UI for Amazon Athena
analytics aws aws-athena data data-lake sql
Last synced: 30 Jul 2025
https://github.com/kenthsu/udacity-data-engineering-nanodgree
Udacity Data Engineering Nanodegree Program
apache-airflow apache-cassandra apache-spark aws-redshift aws-s3 data-engineering data-lake data-pipelines data-quality data-warehouses postgresql
Last synced: 10 Apr 2025
https://github.com/matsmoll/aligned
The DBT of ML, as Aligned describes data dependencies in ML systems, and reduce technical data debt
ai data-contracts data-lake datacontracts dbt feature-engineering feature-store ml ml-ops mlops
Last synced: 13 Aug 2025
https://github.com/nodestream-proj/nodestream
A Declarative framework for Building, Maintaining, and Analyzing Graph Data
api athena aws cli data-engineering data-lake data-science declarative etl framework graph graphql kafka knowledge-graph neo4j python s3 security visualization yaml
Last synced: 05 Apr 2025
https://github.com/suecodelabs/cnfuzz
Breaking Cloud Native Web APIs in their natural habitat.
aws aws-s3 cicd cloud-native data-lake fuzzing golang kubernetes microsoft openapi openapi-spec opensource rest-api rest-api-test restler security-tools service-mesh
Last synced: 12 Apr 2025
https://github.com/canner/vulcan-sql-examples
Curated VulcanSQL show cases
analytics api-builder bigquery data data-lake data-warehouse database duckdb examples postgresql reporting restful-api sql vulcan-sql vulcansql
Last synced: 19 Jul 2025
https://github.com/imsanjoykb/python-mysql-operation
This Python MySQL Repo shows you how to use MySQL Connector Python to access MySQL databases. You will learn how to connect to MySQL database and perform common database operations such as SELECT, INSERT, UPDATE, & DELETE in Python.
data-lake database database-operation database-programming mysql-server python-mysql python-mysql-connector
Last synced: 20 Jun 2025
https://github.com/linkml/linkml-store
wrapper for multiple linkml storage engines
data-lake data-stack database-wrapper duckdb hdf5 linkml mongo mongodb nosql-database rdf semweb triplestore vector-database
Last synced: 23 Apr 2025
https://github.com/Canner/vulcan-sql-examples
Curated VulcanSQL show cases
analytics api-builder bigquery data data-lake data-warehouse database duckdb examples postgresql reporting restful-api sql vulcan-sql vulcansql
Last synced: 11 Apr 2025
https://github.com/ec-europa/eubfr-data-lake
EU Budget for Results - Data Lake
Last synced: 29 Apr 2025
https://github.com/apache/kyuubi-docker
Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
data-lake hadoop hive jdbc kubernetes spark spark-sql sql thrift
Last synced: 19 Oct 2025
https://github.com/AuFeld/Data_Engineering_Projects
A collection of data engineering projects: data modeling, ETL pipelines, data lakes, infrastructure configuration on AWS, data warehousing, containerization, and a dashboard to monitor data pipeline KPIs
airflow aws cassandra data-engineering data-lake data-warehouse docker emr etl-pipeline infrastructure-as-code infrastructure-setup postgresql python redshift s3 spark
Last synced: 29 Jul 2025
https://github.com/zkan/swu-ds525
DS525
data-engineering data-lake data-modeling data-pipeline data-warehouse
Last synced: 19 Aug 2025
https://github.com/treeverse/lakefs-hooks
a simple lakeFS webhook for pre-commit and pre-merge validation of data objects
data-engineering data-lake lakefs
Last synced: 23 Oct 2025
https://github.com/sap-samples/hana-cloud-relational-data-lake-onboarding
This is an end-to-end onboarding sample for SAP HANA Cloud, relational data lake. It shows how to create schema, load data, and execute queries.
data-lake hana-cloud sample sap-hana
Last synced: 05 Mar 2025
https://github.com/mahmoudparsian/data-warehousing
This repository is a place for the Data Warehousing course at the Information Systems & Analytics department, Santa Clara University.
business-intelligence data-analytics data-lake data-lakehouse data-mining data-modeling data-visualization data-warehouse data-warehousing database dimensional-modeling elt etl extract load snowflake-schema star-schema tableau transform
Last synced: 03 Jul 2025
https://github.com/datasphere-oss/datasphere
DataSphere is the first open-source cloud-native data observability platform that helps you trace the whole data infrastructure in your warehouses, lakes and databases.
cloud-native daas data-analytics data-governance data-lake data-management data-observability datamesh datasphere warehouse
Last synced: 14 Jul 2025
https://github.com/ibm-cloud/nodejs-data-lake-dashboard
Sample and tutorial that creates interactive dashboards using: Dynamic Dashboard Embedded, Cloud Object Storage, SQL Query, DB2 Warehouse and AppID.
cloud data-lake db2 db2-warehouse ibm-cloud ibm-cloud-solutions tutorial
Last synced: 22 Apr 2025
https://github.com/DataDrivenGit/Music-Streaming-App-using-AWS-ETL
Implemented Data Warehouse, Data Lake on AWS and Data modeling with Postgres and Apache Cassandra, Also used Apache Airflow to create data pipeline
airflow-operators cassandra data-lake data-pipelines datawarehouse postgres python3 sql
Last synced: 20 Jul 2025
https://github.com/bahbosque/delta-to-iceberg-aws-glue
Tool to migrate Delta Lake tables to Apache Iceberg using AWS Glue and S3
apache-iceberg aws aws-glue-data-catalog data-lake delta-lake migration-tool open-source spark
Last synced: 03 Jul 2025
https://github.com/pirate-emperor/bigdata-pipeline
BigData Pipeline is a local testing environment for experimenting with various storage solutions (RDB, HDFS), query engines (Trino), schedulers (Airflow), and ETL/ELT tools (DBT). It supports MySQL, Hadoop, Hive, Kudu, and more.
airflow airflow-dags airflow-docker big-data data-lake data-lakestore data-warehouse dbt dbt-core distributed-computing docker docker-compose hadoop hive hiveql kudu mysql mysql-server trino trino-cli
Last synced: 01 Aug 2025
https://github.com/sigpwned/jdbq
JDBI-inspired Database Access Framework for Java + BigQuery
bigquery data-access-framework data-access-layer data-access-library data-lake java persistence persistence-framework persistence-layer
Last synced: 15 May 2025
https://github.com/shrikantnaidu/data-lakehouse-with-delta-lake
Setting up a data lakehouse with delta lake using docker
data-lake delta-lake docker-compose pyspark
Last synced: 30 Jul 2025
https://github.com/san089/yelp_project
This project is to create a Data lake for Yelp data-set and further using the it to create an Analytical Sandbox Data Science purpose and also creating a data warehouse for reporting purpose.
data-lake data-pipeline etl etl-pipeline ingestion load pyspark recommender-system redshift
Last synced: 06 Mar 2025
https://github.com/agutiernc/nyc-citi-bike-insights
Data Engineering Project using NYC Citi Bike data for years 2019, 2020, and 2023
analytics-engineering batch-processing data-engineering data-lake data-warehouse dbt dlt etl-pipeline google-cloud-storage python sql terraform
Last synced: 25 Feb 2025
https://github.com/santiagortiiz/snowflake-data-pipelines
EPAM's Snowflake hands-on lab. We built a pipeline to read and load data from S3 into Snowflake, developed an ETL workflow to clean the data and stored it in a data warehouse with the 3NF and Star schemas for data mart analysis.
business-intelligence data-lake data-pipelines data-warehouse etl snowflake streams
Last synced: 26 Jun 2025
https://github.com/cds-snc/data-lake
Infrastructure for the Platform Data Lake
Last synced: 10 Apr 2025
https://github.com/mikeacosta/florasense
Orchestrating Cloud ETL Workloads
apache-spark aws cloudformation data-lake data-warehouse emr-cluster etl-pipeline kinesis-stream lambda-functions redshift redshift-spectrum step-functions
Last synced: 22 Feb 2025
https://github.com/supakunz/book-revenue-pipeline
A ready-to-use Docker-based template for data engineering projects, featuring a complete stack with Apache Airflow, Spark, and MinIO for building scalable data pipelines.
apache-airflow apache-spark data-engineering data-lake delta-lake docker docker-compose etl-pipeline minio pyspark python s3 template-boilerplate
Last synced: 30 Dec 2025
https://github.com/najuzilu/dl-spark
Building a Data Lake with Spark
aws-emr aws-s3 data-engineering data-lake etl-pipeline spark
Last synced: 25 Aug 2025
https://github.com/datawithbaraa/sql-modern-warehouse-and-analytics
A comprehensive guide to building a modern data warehouse with SQL Server, including ETL processes, data modeling, and analytics.
data-analysis data-analytics data-cleaning data-engineering data-lake data-lakehouse data-science data-warehouse data-warehousing database datalake datascience datawarehouse datawarehousing etl medallion-architecture pipeline sql sql-query sql-server
Last synced: 09 Apr 2025
https://github.com/narius2030/datalake-solution-imcp
This project involved the development and implementation of a Data Lake architecture to support an AI model capable of generating image captions. The architecture was designed to efficiently ingest, process, and centralized store large volumes of image and text data.
data-lake docker-container etl-pipeline fastapi medallion-architecture mlops nosql-database object-storage
Last synced: 25 Mar 2025
https://github.com/mikeacosta/cowculate
Herd management ETL and analytics processing
apache-spark aws cloudformation data-lake etl kinesis-stream redshift s3
Last synced: 08 Oct 2025
https://github.com/stevehoober254/dataengineer-portfolio
📊 End-to-end ETL pipelines, Airflow DAGs, notebook-driven analytics & data warehousing
airflow analytics big-data dagster data-engineering data-lake data-pipelines etl python spark
Last synced: 09 Oct 2025
https://github.com/vermicida/data-lake
Data Lake, the code corresponding the project #4 of the Udacity's Data Engineer Nanodegree Program
aws-s3 data-engineering data-lake etl-pipeline python spark
Last synced: 12 Oct 2025
https://github.com/morgan-sell/usa-tourism-etl
Coalesced and transformed various data sources to create a comprehensive data lake for the USA tourism sector.
aws data-engineering data-lake emr-cluster etl-pipeline python spark
Last synced: 02 Apr 2025
https://github.com/banknatchapol/sparkify-data-lake
Build an ETL pipeline for a data lake hosted on AWS S3.
Last synced: 03 Sep 2025
https://github.com/anuveyatsu/cloudflare-data-fabric
Cloudflare Data Fabric: Use Cloudflare's global infrastructure to build a flexible, resilient framework for data solutions.
cloudflare data data-lake fabric lakehouse mesh
Last synced: 12 Sep 2025
https://github.com/shinie19/sql-data-warehouse-project
Build a modern Data Warehouse from scratch with SQL Server, including ETL processes, data modeling and analytics.
data-analysis data-analytics data-cleaning data-engineering data-lake data-lakehouse data-modeling data-normalization data-science data-standardization data-warehouse etl-pipeline medallion-architecture sql-server
Last synced: 11 Mar 2025
https://github.com/aymendaoudi/electric-vehicle-charging-simulator
Simulation, Ingestion and ETL-ing data of millions of EV charging sessions by thousands of EVs in thousands of stations around the world.
apache-airflow apache-spark batch-processing data-engineering data-lake data-warehouse kafka kafka-connect lake-fs minio mongodb postgresql python3 simpy spark stream-processing
Last synced: 27 Mar 2025
https://github.com/gakas14/aws-serverless-data-lake
This workshop is to build a serverless data lake architecture using Amazon Kinesis Firehose for streaming data ingestion, AWS Glue for Data Integration (ETL, Catalogue Management), Amazon S3 for data lake storage, Amazon Athena for SQL big data analytics.
athena aws data-lake etl glue-catalog glue-etl kinesis-firehose kinesis-stream s3 sql
Last synced: 20 Feb 2025
https://github.com/zxkane/serverless-docker-images-analytics
Serverless Analytics app for analyzing docker image layers
analytics aws aws-athena aws-cdk aws-glue big-data data-lake
Last synced: 28 Mar 2025
https://github.com/manuel-lang/data-lake-with-spark
Project Data Lake as part of Udacity's Data Engineering Nanodegree
data-engineering data-lake etl-pipeline s3 spark udacity udacity-data-engineer-nanodegree
Last synced: 01 Mar 2025
https://github.com/josecsotomorales/pyspark-datalake
Data Lake with PySpark and AWS S3
data-lake pyspark python spark
Last synced: 27 Mar 2025
https://github.com/adilkhash/aws-meetup-almaty-2019-data-lake
Resources for AWS Almaty Meetup: Building scalable Data Lake with AWS
amazon-athena amazon-s3 aws aws-glue data-lake etl-pipeline luigi-workflows
Last synced: 27 Feb 2025
https://github.com/jaimealruiz/proyecto-tfg
Diseño e Implementación de interconexión entre LLM y Espacios de Datos
data-lake mcp mcp-client mcp-server rag
Last synced: 08 Sep 2025
https://github.com/yasarsultan/taxi-trip-analysis
The NYC Taxi Trip Batch Data Pipeline automates processing of large-scale trip data using Apache Spark and Airflow, integrating AWS S3 and Google BigQuery for storage and analytics. It features scalable, containerized workflows with robust data validation.
airflow aws-s3 bash-script batch-processing bigquery data-lake data-warehouse docker python3 spark
Last synced: 01 Mar 2025
https://github.com/rafie-b/data-warehouse-aws-pipeline-chat-api
notebook guide
api-rest aws data-lake data-warehouse message-data pipeline python sql
Last synced: 13 Oct 2025
https://github.com/narius2030/imcp-support-blinders
This project focuses on image captioning by creating two primary models: DarkNetLM and DarkNetVG2. Both models leverage the CSP DarkNet53 architecture as the backbone of YOLOv8 for feature extraction from images. Combining with Transformers or LSTM to generating captions.
computer-vision data-lake image-captioning large-language-model mobile-app
Last synced: 30 Oct 2025
https://github.com/jcguidry/flight-ml-ingest-gcp-func
Ingests flight data for ML projects, using serverless cloud functions
cloud-function data-lake flightaware-aeroapi-service gcp github-actions pandas
Last synced: 07 Nov 2025
https://github.com/sugumarsrinivasan/sql-datawarehouse-project
Building Mordern datawarehouse with SQL Server, including ETL Processes, data modeling, and data analytics.
data-analysis data-analytics data-engineering data-lake data-science data-warehouse datawarehousing etl etl-pipeline medallion-architecture sql sql-query sql-server
Last synced: 24 Oct 2025
https://github.com/dina-hosny/sparkify---data-lake-with-aws
Sparkify - Data Lake with AWS - Udacity Data Engineering Expert Track.
analytics aws data-engineering data-lake data-pipelines dataset etl fwd udacity
Last synced: 30 Jul 2025
https://github.com/mxagar/data_engineering_guide
Personal notes on the IBM Data Engineering Certificate as well as other sources focusing on AWS.
airflow aws data-lake data-modeling data-pipelines data-science no-sql spark sql warehouse
Last synced: 27 Jul 2025
https://github.com/lesiaukr/goit-de-fp
Masters degree | Data Engineering | Final course projects | goit-de-fp
apache-airflow apache-kafka apache-spark data-lake docker goit-de-fp python streaming-pipeline
Last synced: 25 Jul 2025
https://github.com/zkan/data-engineering-on-gcp
Data Engineering on Google Cloud Platform (GCP)
bigquery data-engineering data-lake data-pipeline data-warehouse gcs google-cloud-platform machine-learning
Last synced: 19 Aug 2025
https://github.com/jibbs1703/tickit-data-pipeline
This repository demonstrates the creation of a robust data pipeline using an Orchestrator, on-prem and cloud resources. It collects data from on-premises SQL and NoSQL database and loads it into a SQL database in the cloud.
aws-glue aws-glue-crawler aws-glue-data-catalog aws-redshift aws-s3 boto3 data-lake database etl-pipeline medallion-architecture mongodb precommit-hooks
Last synced: 17 Sep 2025
https://github.com/daniel-jcvv/daniel-jcvv
👨💻 Data Engineer | 3+ years enterprise experience with Telcel & Citi Banamex Develop ETL pipelines, data governance, and cloud solutions. Building scalable data architectures and automated workflows for Fortune 500 clients. Tech Stack: Python, SQL Server, Oracle, Apache Airflow, PySpark
agentic-ai apache-airflow apache-kafka apache-spark automation business-intelligence citi-bank-apis data-analysis data-engineering data-lake data-warehouse etl-pipeline medallion-architecture mlops n8n-workflow python rag sql-server
Last synced: 02 Nov 2025
https://github.com/ahmedd38/dataengineer-portfolio
📊 End-to-end ETL pipelines, Airflow DAGs, notebook-driven analytics & data warehousing
airflow analytics big-data dagster data-engineering data-lake data-pipelines etl python spark
Last synced: 15 Apr 2025
https://github.com/hsiehshujeng/dynamodb-streaming-datalake
A demo of DynamoDB CDC into data lake with AWS CDK v2
aws-cdk data-lake ddb dynamodb dynamodb-stream kinesis-firehose kinesis-stream typescript
Last synced: 28 Mar 2025
https://github.com/dominique-jacque/nba-data-lake
NBA Data Lake Repository contains the setup_nba_data_lake.py script, which automates the creation of a data lake for NBA analytics using AWS services. The script integrates Amazon S3, AWS Glue, and Amazon Athena, and sets up the infrastructure needed to store and query NBA-related data.
amazon-athena api aws-glue cloudshell data-lake iam s3
Last synced: 08 Mar 2025
https://github.com/manuelandersen/football-pipeline
DE Zoomcamp 2024 Final Project 🧙
bigquery data-engineering data-lake data-warehouse dbt dbt-cloud etl-pipeline google-cloud looker-studio mageai python
Last synced: 02 Jan 2026