Projects in Awesome Lists tagged with dataengineering
A curated list of projects in awesome lists tagged with dataengineering .
https://github.com/dataexpert-io/data-engineer-handbook
This is a repo with links to everything you'd ever want to learn about data engineering
apachespark awesome bigdata data dataengineering sql
Last synced: 28 Sep 2025
https://github.com/DataExpert-io/data-engineer-handbook
This is a repo with links to everything you'd ever want to learn about data engineering
apachespark awesome bigdata data dataengineering sql
Last synced: 04 Apr 2025
https://github.com/open-metadata/openmetadata
OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration.
data-catalog data-collaboration data-contracts data-discovery data-governance data-lineage data-observability data-profiling data-quality data-quality-checks data-science data-validation datadiscovery dataengineering dataquality dbt hacktoberfest metadata metadata-management snowflake
Last synced: 12 Nov 2025
https://github.com/open-metadata/OpenMetadata
OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration.
data-catalog data-collaboration data-contracts data-discovery data-governance data-lineage data-observability data-profiling data-quality data-quality-checks data-science data-validation datadiscovery dataengineering dataquality dbt hacktoberfest metadata metadata-management snowflake
Last synced: 15 Mar 2025
https://github.com/datafold/data-diff
Compare tables within or across databases
data data-diffing data-engineering data-quality data-quality-monitoring data-science database databricks-sql dataengineering dataquality dbt mysql oracle-database postgres postgresql python rdbms snowflake sql trino
Last synced: 24 Mar 2025
https://github.com/tobikodata/sqlmesh
Scalable and efficient data transformation framework - backwards compatible with dbt.
dataengineering dataops dbt elt etl python sql transformation
Last synced: 21 Jan 2026
https://github.com/TobikoData/sqlmesh
Efficient data transformation and modeling framework that is backwards compatible with dbt.
dataengineering dataops dbt elt etl python sql transformation
Last synced: 26 Mar 2025
https://github.com/zinggAI/zingg
Scalable identity resolution, entity resolution, data mastering and deduplication using ML
analytics cdp customer-data-platform data-science databricks dataengineering datalake dataquality dedupe deduplication entity-resolution fuzzy-matching fuzzymatch identity-resolution master-data-management masterdata mdm ml snowflake spark
Last synced: 16 Nov 2025
https://github.com/zinggai/zingg
Scalable identity resolution, entity resolution, data mastering and deduplication using ML
analytics analytics-engineering data-science data-transformation data-transformations dataengineering datalake dataquality dedupe deduplication entity-resolution etl fuzzy-matching fuzzymatch identity identity-resolution masterdata ml modern-data-stack spark
Last synced: 14 May 2025
https://github.com/514-labs/moosestack
The developer framework for building analytics into your app on top of ClickHouse, Redpanda and other high-performance analytical infrastructure
analytics data dataengineering deployment framework insights metrics python rust typescript
Last synced: 22 Jan 2026
https://github.com/Datavault-UK/automate-dv
A free to use dbt package for creating and loading Data Vault 2.0 compliant Data Warehouses (powered by dbt, an open source data engineering tool, registered trademark of dbt Labs)
data-vault dataengineering datalake datavault datavault20 datawarehouse datawarehousing dbt elt etl metadata snowflake sql
Last synced: 13 May 2025
https://github.com/awslabs/aws-ddk
An open source development framework to help you build data workflows and modern data architecture on AWS.
aws dataengineering dataops python
Last synced: 14 Jan 2026
https://github.com/kevinheavey/modern-polars
Code and data for the Modern Polars book
data-analytics data-engineering data-science dataengineering pandas polars python
Last synced: 10 Sep 2025
https://kevinheavey.github.io/modern-polars/
Code and data for the Modern Polars book
data-analytics data-engineering data-science dataengineering pandas polars python
Last synced: 10 Jul 2025
https://github.com/mehd-io/pypi-duck-flow
end-to-end data engineering project to get insights from PyPi using python, duckdb, MotherDuck & Evidence
dataengineering duckdb etl python
Last synced: 06 Oct 2025
https://github.com/josephmachado/beginner_de_project_stream
Simple stream processing pipeline
apache-flink dataengineering datapipeline graphana postgresql prometheus
Last synced: 15 Apr 2025
https://github.com/noahgift/data-engineering-and-dataops
Duke MIDS: Data Engineering and DataOps Course
book cloud course data data-science dataengineering dataops duke mlops software-engineering
Last synced: 03 Sep 2025
https://github.com/514-labs/moose
The developer framework for your data & analytics stack
analytics data dataengineering deployment framework insights metrics python rust typescript
Last synced: 05 Apr 2025
https://github.com/abhishek-ch/data-machinelearning-the-boring-way
Build & Learn Data Engineering,Machine Learning over Kubernetes. No Shortcut approach.
data-infrastructure dataengineering datascience kubernetes machine-learning mlops
Last synced: 21 Mar 2025
https://github.com/kislerdm/data-engineering-interviews
Data engineering interviews Q&A for data community by data community
dataengineering interview-questions kafka linux opensource python spark sql
Last synced: 29 Apr 2025
https://github.com/olist/work-at-olist-data
Apply for a job at Olist's Data Team: https://olist.gupy.io/
analytics data dataengineering datascience dataset julia machinelearning pandas python r sql
Last synced: 25 Jun 2025
https://github.com/josephmachado/socialetl
Project for "Data pipeline design patterns" blog.
dataengineering design-patterns etl-pipeline makefile python reddit social-media-data sqllite3
Last synced: 15 Apr 2025
https://github.com/josephmachado/de_project
Step by step instructions to create a production-ready data pipeline
dataengineering datapipeline python
Last synced: 15 Apr 2025
https://github.com/wittline/apache-spark-docker
Dockerizing an Apache Spark Standalone Cluster
apache-spark dataengineer dataengineering docker docker-compose hadoop-cluster hadoop-docker hdfs hive hive-metastore hue pyspark
Last synced: 13 Apr 2025
https://github.com/aakashnand/trino-ranger-demo
Tutorial on how to setup Trino and Apache Ranger using docker
dataengineering datagovernance docker hacktoberfest ranger trino tutorial
Last synced: 04 Apr 2025
https://github.com/airscholar/sparkingflow
This project demonstrates how to use Apache Airflow to submit jobs to Apache spark cluster in different programming laguages using Python, Scala and Java as an example.
apache-airflow dataengineering docker java pyspark scala spark
Last synced: 10 Apr 2025
https://github.com/airscholar/realtimestreamingengineering
This project serves as a comprehensive guide to building an end-to-end data engineering pipeline using TCP/IP Socket, Apache Spark, OpenAI LLM, Kafka and Elasticsearch. It covers each stage from data acquisition, processing, sentiment analysis with ChatGPT, production to kafka topic and connection to elasticsearch.
apache-spark chatgpt dataengineering elasticsearch kafka openai-api tcp-socket
Last synced: 10 Apr 2025
https://github.com/spratiher9/sparkdataset
Instant search for and access to many datasets in Pyspark.
benchmark benchmark-framework data data-analysis data-mining dataengineering dataset datasets easy-access-application instantsearch pyspark python python3 quickstart r spark standard
Last synced: 02 Aug 2025
https://github.com/josephmachado/data-engineering-interview-series
Repository for Data Engineering Interview Series
data-structures dataengineering interview
Last synced: 15 Apr 2025
https://github.com/wittline/pyspark-on-aws-emr
The goal of this project is to offer an AWS EMR template using Spot Fleet and On-Demand Instances that you can use quickly. Just focus on writing pyspark code.
aws aws-emr big-data big-data-analytics dataengineering ec2-spot ec2-spot-instances emr-cluster pyspark python spark wordcloud-generator
Last synced: 13 Apr 2025
https://github.com/waylonwalker/kedro-static-viz
kedro cli plugin for generating a static kedro viz site (html, css, js) that can be deployed on many serverless tools.
data dataengineering datapipeline kedro kedro-plugin python
Last synced: 05 May 2025
https://github.com/WaylonWalker/kedro-static-viz
kedro cli plugin for generating a static kedro viz site (html, css, js) that can be deployed on many serverless tools.
data dataengineering datapipeline kedro kedro-plugin python
Last synced: 24 Mar 2025
https://github.com/wittline/data-engineer-challenge
Challenge Data Engineer
data-engineering data-pipeline dataengineering docker docker-compose fastapi postgresql
Last synced: 05 Sep 2025
https://github.com/wittline/pydag
Scheduling Big Data Workloads and Data Pipelines in the Cloud with pyDag
big-data bigquery cloud dag data-engineering data-pipeline dataengineering dataproc dataproc-cluster directed-acyclic-graph google-cloud google-cloud-platform parallel-processing task-scheduler task-scheduling workflow-engine
Last synced: 13 Apr 2025
https://github.com/josephmachado/e2e_datapipeline_test
Example repo to create end to end tests for data pipeline.
aws dataengineering moto pytest python3 testing
Last synced: 15 Apr 2025
https://github.com/airscholar/footballdataengineering
An end-to-end data engineering pipeline that fetches data from Wikipedia, cleans and transforms it with Apache Airflow and saves it on Azure Data Lake. Other processing takes place on Azure Data Factory, Azure Synapse and Tableau.
apache-airflow azure-data-factory azure-data-lake-gen2 azure-databricks azure-synapse-analytics data-engineering dataengineering
Last synced: 10 Apr 2025
https://github.com/waylonwalker/kedro-action
A GitHub Action to lint, test, build-docs, package, and run your kedro pipelines. Supports any Python version you'll give it (that is also supported by pyenv).
actions dataengineering datapipeline kedro
Last synced: 05 May 2025
https://github.com/anuran-roy/serpytor
A distributed, low-code, end-to-end data collection and analysis tool for data folks. Take the pain out of data collection from your pipeline!
data dataengineering datascience distributed-computing distributed-systems low-code lowcode open-source pipeline python python3
Last synced: 16 May 2025
https://github.com/open-metadata/openmetadata-site
Open Standard for Metadata. A Single place to Discover, Collaborate and Get your data right.
automation bigdata bigdataanalytics data-catalog data-discovery data-observability data-profiling data-quality-monitoring data-science datadiscovery dataengineering dataquality datascience dbt governance hacktoberfest hacktoberfest2022 metadata metadata-api metadata-management
Last synced: 14 Apr 2025
https://github.com/tuanai-vireox/gcp-professional-data-engineer
GCP Professional Data Engineer Certification- Learning
data dataengineering gcp professional
Last synced: 22 Aug 2025
https://github.com/ahmetfurkandemir/dataengineering-youtube-project
Data Engineering Youtube Project
amazon amazon-athena amazon-glue aws bash dataengineering iam-role lambda lambda-functions python s3-bucket s3-storage s3api spark
Last synced: 15 Apr 2025
https://github.com/adilkhash/luigi-course-materials
Материалы для курса Введение в Data Engineering: дата пайплайны
dataeng dataengineering datapipeline luigi python workflow-engine
Last synced: 02 Sep 2025
https://github.com/caogiathinh/urban-mobility-elt-pipeline
Built a complete end-to-end data platform to ingest, process, and analyze complex, multi-source public datasets for business intelligence.
dataengineering docker gooogle-cloud kestra pandas postgresql-database python spark sql terrraform
Last synced: 07 Oct 2025
https://github.com/mikma03/devops-mlops
Tools for DevOps and MLOps. Materials and projects. New technologies and infrastructure review.
airflow ansible aws azure cicd dataengineering devops-tools jenkins kubernetes terraform
Last synced: 16 Jun 2025
https://github.com/abhishek-ch/dataengineering-agent
Data Engineering Agent Using Open AI Function Call
dataengineering llm openai openai-function-call python
Last synced: 17 Jun 2025
https://github.com/jaehyeon-kim/general-demos
Data engineering demo projects
aws dataengineering dbt kafka kafkaconnect opensearch serverlessapplicationmodel spark
Last synced: 26 Mar 2025
https://github.com/recodehive/recode-website
recodehive helps you to learn and master the skills on data, and encourage you to code on opensource.
data data-science dataengineering opensource python sql tutorials website
Last synced: 27 Jul 2025
https://github.com/wittline/dataengineering-assignment
Prescreening Tasks for Data Engineer
dataengineering docker jupyter-notebook postgresql
Last synced: 13 Apr 2025
https://github.com/m-farag/rawbuilder
an elegant datasets factory
dataengineering dataset-generator datasets package python software-engineering
Last synced: 07 May 2025
https://github.com/clarifai/clarifai-python-datautils
Extract Transform and Load unstructured data into the Clarifai's AI platform
dataanalysis dataengineering ingestion ingestion-pipeline unstructured-data unstructured-data-analysis unstructured-image unstructured-text
Last synced: 18 Oct 2025
https://github.com/koddachad/dq_tester
A lightweight simple data quality testing tool.
data database dataengineering dataquality dataqualitycheck
Last synced: 08 Oct 2025
https://github.com/paulescu/backfill-feature-store-with-prefect
Backfill historical OHLC feature in a Feature Store (Hopsworks) using an orchestration tool (Prefect).
backfill dataengineering hopsworks machine-learning ml mlops prefect
Last synced: 30 Oct 2025
https://github.com/jaehyeon-kim/beam-demos
Apache Beam demo projects
apachebeam dataengineering datastreaming docker docker-compose kubernetes python realtimeanalytics
Last synced: 17 Aug 2025
https://github.com/stefen-taime/moderndataengineerpipeline
Building a Robust Data Pipeline: Integrating Proxy Rotation, Kafka, MongoDB, Redis, Logstash, Elasticsearch, and MinIO for Efficient Web Scraping
auth0 connect dataengineering docker-compose elasticsearch fastapi kafka logstash minio mongodb proxy redis
Last synced: 26 Sep 2025
https://github.com/tirendazacademy/hands-on-data-science-with-gcp
Google BigQuery Tutorial
big-data big-data-analytics bigdata bigquery bigquery-ml bigqueryml cloud-computing data-analysis data-analytics data-engineering data-science dataanalysis dataengineering google-bigquery google-cloud-platform machienlearning machine-learning
Last synced: 06 Oct 2025
https://github.com/divithraju/divith-raju-immigration-data-engineering
A Capstone Project that covers several aspects of Data Engineering (Data Exploration, Cleaning, Modeling, Pipelining, Processing)
apachespark bigdata bigdataprocessing bigdataproject capstone-project datacleaning dataengineering datalake datamodeling datapipeline dataprocessing dataschema dataset datawherehouse pandas sql
Last synced: 20 Feb 2025
https://github.com/vigneshss-07/google-cloud-professional-data-engineer-acompleteguide
This Repo contains all study, lab and supportive materials for Udemy course on "Google Cloud Professional Data Engineer - A Complete Guide".
big-data bigquery cloud-computing dataengineering elt-pipeline etl-framework gcp-services gcp-storage google-cloud machine-learning
Last synced: 10 Apr 2025
https://github.com/ashton-sidhu/sysmon-extract
Extract logs based off events from sysmon. Comes as a package, cli and ui.
data-science dataengineering infosec spark streamlit sysmon threat-intelligence threathunting
Last synced: 06 May 2025
https://github.com/huseyincenik/data_science
Data Science materials
data data-science data-structures data-visualization dataanalysis dataengineering datapreparation dataprocessing datascience dataset time-series time-series-analysis timeline timeseries timeseries-analysis timeseriesforecasting
Last synced: 25 Jul 2025
https://github.com/wittline/livyc
Apache Spark as a Service with Apache Livy Client
apache-livy apache-spark big-data data-engineering dataengineering docker livy-client livy-docker pyhton spark
Last synced: 23 Feb 2025
https://github.com/realdatadriven/central-set-go
This open-source project is a dynamic, data-driven, and configuration-driven application built with Golang. Out of the box, it provides an admin app that allows users to manage multiple applications, offering built-in authentication, user management, and role-based access control at the CRUD level for each table, ETL Workflow powered by DuckDB
api backend dashboard dataengineering datascience duckdb etl-pipeline etlx golang relational-databases
Last synced: 07 Sep 2025
https://github.com/dina-hosny/sparkify---data-pipelines-with-airflow
Sparkify - Data Pipelines with Airflow - Udacity Data Engineering Expert Track.
airflow aws data-engineering dataengineering etl pipline redshift redshift-cluster udacity
Last synced: 07 Jul 2025
https://github.com/divithraju/divith-raju-openmetadata
Open Standard for Metadata. A Single place to Discover, Collaborate and Get your data right.
automation bigdata bigdataanalytics data data-structures dataengineering datascience hacktoberfest2022 metadata metadata-extraction
Last synced: 20 Feb 2025
https://github.com/hq969/youtube-data-pipeline-aws
About Leveraging AWS Cloud Services, an ETL pipeline transforms YouTube video statistics data. Data is downloaded from Kaggle, uploaded to an S3 bucket, and cataloged using AWS Glue for querying with Athena. AWS Lambda and Glue converts to Parquet format and stores it in a cleansed S3 bucket. AWS QuickSight then visualizes the materialised data.
aws aws-cloudwatch aws-data-engineering-project aws-glue aws-iam aws-lambda aws-s3 data-engineering-pipeline data-pipeline dataengineering etl etl-pipeline pandas pyhton pyspark spark
Last synced: 12 Apr 2025
https://github.com/divithraju/divith-raju-searchengine-wikipedia
search engine optimizationA complete search engine experience built on top of 75 GB Wikipedia corpus with subsecond latency for searches. Results contain wiki pages ordered by TF/IDF relevance based on given search word/s. From an optimized code to the K-Way mergesort algorithm, this project addresses latency, indexing, and big data challenges.
algorithms data dataengineering inverted-index linux merge-sort nlp project project-repository python3 serchengine software-engineering ubuntu wikipedia
Last synced: 20 Feb 2025
https://github.com/halovina/djangoorm
tutorial django ORM for backend engineer
backendengineering dataengineering django orm-framework python
Last synced: 08 Jul 2025
https://github.com/phelipe-sempreboni/data-engineering
Repository for tutorials, information, notes and projects about data engineering.
data dataengineering engine engineering enviroment etl etl-pipeline pipeline project python
Last synced: 04 Oct 2025
https://github.com/hq969/realtime-data-pipeline-for-stack-market-analysis
This repo demonstrates the development of a real-time data pipeline designed to ingest, process, and analyze stock market data. Using cutting-edge tools like Apache Kafka, PostgreSQL, and Python, the pipeline captures stock data in real-time and stores it in a robust data architecture, enabling timely analysis and insights.
aiven-cloud apachekafkademo apis dataengineering etl etl-pipeline pipeline postgresql-database python3 stock-data
Last synced: 04 Oct 2025
https://github.com/dain55788/ibm-data-engineer-lecture-note
Lecture Notes and Practice Materials of IBM Data Engineering Course
data-analysis database dataengineering datawarehouse ibm
Last synced: 03 Apr 2025
https://github.com/pankajsingh09/data_engineering_using_aws
This Repository contains the contents related to Data Engineering Using AWS
aws data-ingestion dataengineering event-bridge lambda-functions pipeline pycharm-ide pyspark python s3 spark
Last synced: 19 Jan 2026
https://github.com/josephmachado/data-quality-w-greatexpectations
Code for data quality with greatexpectations blog
dataengineering dataquality greatexpectations python
Last synced: 15 Apr 2025
https://github.com/andrewdarnall/the-observer
A big data processing pipeline wich a topic modeling model (BERTopic) using Mastodon data
apache-kafka apache-spark bertopic dataengineering mastodon tapunict
Last synced: 16 Oct 2025
https://github.com/moh-ayman/stripeapi-to-bq---cfunc-etl
Google Cloud Function built to perform an ETL Job to Collect StripeAPI Data and Transform it to be able to Import it to Bigquery.
bigquery dataengineering etl-pipeline gcp gcp-cloud-functions pandas-dataframe python stripe-api
Last synced: 17 Oct 2025
https://github.com/pankajsingh09/python_for_data_engineering
Learning Python For Data Engineering
dataengineering metplotlib nump oops-in-python pandas python seaborn visualization
Last synced: 27 Mar 2025
https://github.com/nycolasdiaas/zarea-de-risco
Zarea de Risco é um projeto voltado para a segurança pública do estado do Ceará, focando na coleta e análise de notícias e informações relevantes para a área
data-science dataengineering docker nosql python
Last synced: 14 Apr 2025
https://github.com/charlesgaydon/colorize-swisssurface3d-lidar
Instructions to colorize SwissSURFACE3D Lidar using SwissIMAGE10 orthoimages, and split the point cloud for later deep learning training.
dataengineering deeplearning lidar swissimage10 swisssurface3d swisstopo
Last synced: 25 Oct 2025
https://github.com/lostdir/movie_dashboard_with_airflow_etl
Real-Time Trending Movies Dashboard: A Streamlit-based dashboard that fetches and displays trending movies, genres, ratings, and descriptions using an ETL pipeline to extract data from the TMDB API, transform it, and load it into a PostgreSQL database, with daily updates managed by Airflow.
airflow dashboard dataengineering docker pipeline streamlit tmdb-api
Last synced: 15 May 2025
https://github.com/pavithra19/apache_spark_people_data_processor
This project is a data processing application built with Apache Spark and Scala. This is designed to efficiently process, analyze and transform large datasets related to people data. It leverages Spark’s distributed computing capabilities to handle scalable data ingestion, cleaning and reporting. Shell scripts are included for hadoop deployment.
apachespark dataengineering hadoop hdfs scala
Last synced: 19 Jun 2025
https://github.com/f-lab-edu/league-of-legends-data-solution
‘리그 오브 레전드’를 벤치마킹해서 플레이어의 행동 이벤트를 발생하는 API를 통해 실시간으로 데이터가 잘 흐를 수 있도록 데이터 솔루션을 제공합니다.
Last synced: 12 Jul 2025
https://github.com/thanaphongk37/data-science-and-data-analyst-project
Portfolio Data Analysis and Data Science projects and Data Engineer built using Azure Service, SQL and Python.
apache-superset azure-storage dashboards data-analysis data-science databricks dataengineering datafactory datapipeline powerbi python sisense sql sql-server visualization
Last synced: 29 Mar 2025
https://github.com/jaehyeon-kim/sam-for-data-professionals
Serverless Application Model (SAM) for Data Professionals
aws aws-lambda dataengineering sam serverless
Last synced: 04 Apr 2025
https://github.com/sadmansakib93/leetcode-sql-50
My MySQL solutions for LeetCode's SQL 50 study plan (Crack SQL Interview in 50 Qs)
data-science dataengineering leetcode-solutions leetcode-sql mysql sql
Last synced: 17 Jun 2025
https://github.com/nicklitwinow/hse-python-capstone-project
This project is a comprehensive data engineering and analytics solution built using modern technologies such as Airflow, Spark, PostgreSQL, MySQL, Kafka, and Docker. It orchestrates data ingestion, processing, replication, streaming, and analytics across multiple containers.
airflow analytics dataengineering docker etl kafka mysql postgresql python spark streaming
Last synced: 30 Dec 2025
https://github.com/divithraju/divith-aju-hadoop-pyspark-pipeline
This project demonstrates the creation of a scalable data processing pipeline for handling and analyzing log data from a hypothetical e-commerce platform. Leveraging Hadoop and PySpark, the pipeline is designed to process large volumes of log files, providing meaningful insights into user behavior, system performance, and sales metrics.
apache-hadoop-framework apache-spark bigdata client data database dataengineering dataingestionframework datapreprocessing documentation ecommerce-platform hdfs pipeline project project-repository pyspark python3 software-engineering
Last synced: 06 Mar 2025
https://github.com/janainacazuza/dev_utils
Dev_Utils is a collection of essential, time-saving scripts designed for developers and data engineers working with Linux and macOS. It helps automate repetitive tasks, like project setup and workflow management, enhancing productivity and streamlining development processes.
automationscripts dataengineering developertools devutils linux macos opensource projectautomation python sql
Last synced: 01 Apr 2025
https://github.com/wsdt/dataproject_2sm
Simple default web application (university assignment)
database dataengineering students
Last synced: 10 Nov 2025
https://github.com/prathmeshyelne/etl-pipeline-for-employee-data-using-data-fusion-airflow
This repository contains code and configuration files for an Extract, Transform, Load (ETL) project using Google Cloud Data Fusion for data extraction, Apache Airflow/Composer for orchestration, and Google BigQuery for data loading.
airflow bigquery dataengineering etl gcp googlecloudplatform
Last synced: 02 Aug 2025
https://github.com/jaehyeon-kim/dbt-cicd-demo
DBT CI/CD Demo
bigquery cicd dataengineering dbt gcp github-actions
Last synced: 12 Jul 2025
https://github.com/jbangtson/wedge_project
In this data engineering project, I analyzed point-of-sale (POS) 🏪 data from the Wedge Co-Op in Minneapolis, spanning January 2010 to January 2017. The dataset captures transaction-level details from a member-owned cooperative, with 75% of transactions generated by member-owners, enabling comprehensive shopping pattern analysis.
Last synced: 27 Oct 2025
https://github.com/benjaminr/udacity-data-engineering
Data Engineering
data dataengineering python udacity
Last synced: 12 Sep 2025
https://github.com/alimarzouk/paris-aq
ELTL pipeline to monitor air quality in the Paris Île-de-France area
airflow airquality big-data bigquery dataengineering gcs spark
Last synced: 01 Aug 2025
https://github.com/tuancamtbtx/dataengineer-principles
Data Engineering Principles
Last synced: 20 Mar 2025
https://github.com/wklee610/de_project
[Data Engineer] Personal Toy Project For Study
Last synced: 31 Mar 2025
https://github.com/olamide100/capstone-project-llm-zoomcamp
Comparative Guide Assistant
argocd data dataengineering docker grafana kubernetes llm-agent mlops-workflow rag strreamlit
Last synced: 22 Sep 2025
https://github.com/cqllum/schema2dwh
⚡ Automatically produce a data model on your database using its information schema using GenAI.
ai data data-structures dataengineering datawarehousing dwh gemini gemini-api genai reporting reporting-tool schema-design
Last synced: 13 Mar 2025
https://github.com/pizofreude/insightflow-retail-economic-pipeline
A data engineering portfolio project using AWS cloud services to analyze correlations between Malaysian retail performance and fuel prices. Features Terraform IaC, ETL/ELT with AWS S3, Glue, SQL analytics via Athena coupled with data transformation via dbt, and workflow orchestration with Kestra.
aws-athena aws-batch aws-glue aws-quicksight aws-s3 dataengineering dbt-cloud docker kestra open-api postgres python sql terraform
Last synced: 30 Dec 2025