An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with dataengineering

A curated list of projects in awesome lists tagged with dataengineering .

https://github.com/dataexpert-io/data-engineer-handbook

This is a repo with links to everything you'd ever want to learn about data engineering

apachespark awesome bigdata data dataengineering sql

Last synced: 28 Sep 2025

https://github.com/DataExpert-io/data-engineer-handbook

This is a repo with links to everything you'd ever want to learn about data engineering

apachespark awesome bigdata data dataengineering sql

Last synced: 04 Apr 2025

https://github.com/open-metadata/openmetadata

OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration.

data-catalog data-collaboration data-contracts data-discovery data-governance data-lineage data-observability data-profiling data-quality data-quality-checks data-science data-validation datadiscovery dataengineering dataquality dbt hacktoberfest metadata metadata-management snowflake

Last synced: 12 Nov 2025

https://github.com/open-metadata/OpenMetadata

OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration.

data-catalog data-collaboration data-contracts data-discovery data-governance data-lineage data-observability data-profiling data-quality data-quality-checks data-science data-validation datadiscovery dataengineering dataquality dbt hacktoberfest metadata metadata-management snowflake

Last synced: 15 Mar 2025

https://github.com/tobikodata/sqlmesh

Scalable and efficient data transformation framework - backwards compatible with dbt.

dataengineering dataops dbt elt etl python sql transformation

Last synced: 21 Jan 2026

https://github.com/TobikoData/sqlmesh

Efficient data transformation and modeling framework that is backwards compatible with dbt.

dataengineering dataops dbt elt etl python sql transformation

Last synced: 26 Mar 2025

https://github.com/514-labs/moosestack

The developer framework for building analytics into your app on top of ClickHouse, Redpanda and other high-performance analytical infrastructure

analytics data dataengineering deployment framework insights metrics python rust typescript

Last synced: 22 Jan 2026

https://github.com/Datavault-UK/automate-dv

A free to use dbt package for creating and loading Data Vault 2.0 compliant Data Warehouses (powered by dbt, an open source data engineering tool, registered trademark of dbt Labs)

data-vault dataengineering datalake datavault datavault20 datawarehouse datawarehousing dbt elt etl metadata snowflake sql

Last synced: 13 May 2025

https://github.com/awslabs/aws-ddk

An open source development framework to help you build data workflows and modern data architecture on AWS.

aws dataengineering dataops python

Last synced: 14 Jan 2026

https://github.com/mehd-io/pypi-duck-flow

end-to-end data engineering project to get insights from PyPi using python, duckdb, MotherDuck & Evidence

dataengineering duckdb etl python

Last synced: 06 Oct 2025

https://github.com/514-labs/moose

The developer framework for your data & analytics stack

analytics data dataengineering deployment framework insights metrics python rust typescript

Last synced: 05 Apr 2025

https://github.com/abhishek-ch/data-machinelearning-the-boring-way

Build & Learn Data Engineering,Machine Learning over Kubernetes. No Shortcut approach.

data-infrastructure dataengineering datascience kubernetes machine-learning mlops

Last synced: 21 Mar 2025

https://github.com/kislerdm/data-engineering-interviews

Data engineering interviews Q&A for data community by data community

dataengineering interview-questions kafka linux opensource python spark sql

Last synced: 29 Apr 2025

https://github.com/olist/work-at-olist-data

Apply for a job at Olist's Data Team: https://olist.gupy.io/

analytics data dataengineering datascience dataset julia machinelearning pandas python r sql

Last synced: 25 Jun 2025

https://github.com/josephmachado/de_project

Step by step instructions to create a production-ready data pipeline

dataengineering datapipeline python

Last synced: 15 Apr 2025

https://github.com/aakashnand/trino-ranger-demo

Tutorial on how to setup Trino and Apache Ranger using docker

dataengineering datagovernance docker hacktoberfest ranger trino tutorial

Last synced: 04 Apr 2025

https://github.com/airscholar/sparkingflow

This project demonstrates how to use Apache Airflow to submit jobs to Apache spark cluster in different programming laguages using Python, Scala and Java as an example.

apache-airflow dataengineering docker java pyspark scala spark

Last synced: 10 Apr 2025

https://github.com/airscholar/realtimestreamingengineering

This project serves as a comprehensive guide to building an end-to-end data engineering pipeline using TCP/IP Socket, Apache Spark, OpenAI LLM, Kafka and Elasticsearch. It covers each stage from data acquisition, processing, sentiment analysis with ChatGPT, production to kafka topic and connection to elasticsearch.

apache-spark chatgpt dataengineering elasticsearch kafka openai-api tcp-socket

Last synced: 10 Apr 2025

https://github.com/josephmachado/data-engineering-interview-series

Repository for Data Engineering Interview Series

data-structures dataengineering interview

Last synced: 15 Apr 2025

https://github.com/wittline/pyspark-on-aws-emr

The goal of this project is to offer an AWS EMR template using Spot Fleet and On-Demand Instances that you can use quickly. Just focus on writing pyspark code.

aws aws-emr big-data big-data-analytics dataengineering ec2-spot ec2-spot-instances emr-cluster pyspark python spark wordcloud-generator

Last synced: 13 Apr 2025

https://github.com/waylonwalker/kedro-static-viz

kedro cli plugin for generating a static kedro viz site (html, css, js) that can be deployed on many serverless tools.

data dataengineering datapipeline kedro kedro-plugin python

Last synced: 05 May 2025

https://github.com/WaylonWalker/kedro-static-viz

kedro cli plugin for generating a static kedro viz site (html, css, js) that can be deployed on many serverless tools.

data dataengineering datapipeline kedro kedro-plugin python

Last synced: 24 Mar 2025

https://github.com/josephmachado/e2e_datapipeline_test

Example repo to create end to end tests for data pipeline.

aws dataengineering moto pytest python3 testing

Last synced: 15 Apr 2025

https://github.com/airscholar/footballdataengineering

An end-to-end data engineering pipeline that fetches data from Wikipedia, cleans and transforms it with Apache Airflow and saves it on Azure Data Lake. Other processing takes place on Azure Data Factory, Azure Synapse and Tableau.

apache-airflow azure-data-factory azure-data-lake-gen2 azure-databricks azure-synapse-analytics data-engineering dataengineering

Last synced: 10 Apr 2025

https://github.com/waylonwalker/kedro-action

A GitHub Action to lint, test, build-docs, package, and run your kedro pipelines. Supports any Python version you'll give it (that is also supported by pyenv).

actions dataengineering datapipeline kedro

Last synced: 05 May 2025

https://github.com/anuran-roy/serpytor

A distributed, low-code, end-to-end data collection and analysis tool for data folks. Take the pain out of data collection from your pipeline!

data dataengineering datascience distributed-computing distributed-systems low-code lowcode open-source pipeline python python3

Last synced: 16 May 2025

https://github.com/tuanai-vireox/gcp-professional-data-engineer

GCP Professional Data Engineer Certification- Learning

data dataengineering gcp professional

Last synced: 22 Aug 2025

https://github.com/adilkhash/luigi-course-materials

Материалы для курса Введение в Data Engineering: дата пайплайны

dataeng dataengineering datapipeline luigi python workflow-engine

Last synced: 02 Sep 2025

https://github.com/caogiathinh/urban-mobility-elt-pipeline

Built a complete end-to-end data platform to ingest, process, and analyze complex, multi-source public datasets for business intelligence.

dataengineering docker gooogle-cloud kestra pandas postgresql-database python spark sql terrraform

Last synced: 07 Oct 2025

https://github.com/mikma03/devops-mlops

Tools for DevOps and MLOps. Materials and projects. New technologies and infrastructure review.

airflow ansible aws azure cicd dataengineering devops-tools jenkins kubernetes terraform

Last synced: 16 Jun 2025

https://github.com/abhishek-ch/dataengineering-agent

Data Engineering Agent Using Open AI Function Call

dataengineering llm openai openai-function-call python

Last synced: 17 Jun 2025

https://github.com/recodehive/recode-website

recodehive helps you to learn and master the skills on data, and encourage you to code on opensource.

data data-science dataengineering opensource python sql tutorials website

Last synced: 27 Jul 2025

https://github.com/koddachad/dq_tester

A lightweight simple data quality testing tool.

data database dataengineering dataquality dataqualitycheck

Last synced: 08 Oct 2025

https://github.com/paulescu/backfill-feature-store-with-prefect

Backfill historical OHLC feature in a Feature Store (Hopsworks) using an orchestration tool (Prefect).

backfill dataengineering hopsworks machine-learning ml mlops prefect

Last synced: 30 Oct 2025

https://github.com/stefen-taime/moderndataengineerpipeline

Building a Robust Data Pipeline: Integrating Proxy Rotation, Kafka, MongoDB, Redis, Logstash, Elasticsearch, and MinIO for Efficient Web Scraping

auth0 connect dataengineering docker-compose elasticsearch fastapi kafka logstash minio mongodb proxy redis

Last synced: 26 Sep 2025

https://github.com/divithraju/divith-raju-immigration-data-engineering

A Capstone Project that covers several aspects of Data Engineering (Data Exploration, Cleaning, Modeling, Pipelining, Processing)

apachespark bigdata bigdataprocessing bigdataproject capstone-project datacleaning dataengineering datalake datamodeling datapipeline dataprocessing dataschema dataset datawherehouse pandas sql

Last synced: 20 Feb 2025

https://github.com/vigneshss-07/google-cloud-professional-data-engineer-acompleteguide

This Repo contains all study, lab and supportive materials for Udemy course on "Google Cloud Professional Data Engineer - A Complete Guide".

big-data bigquery cloud-computing dataengineering elt-pipeline etl-framework gcp-services gcp-storage google-cloud machine-learning

Last synced: 10 Apr 2025

https://github.com/ashton-sidhu/sysmon-extract

Extract logs based off events from sysmon. Comes as a package, cli and ui.

data-science dataengineering infosec spark streamlit sysmon threat-intelligence threathunting

Last synced: 06 May 2025

https://github.com/realdatadriven/central-set-go

This open-source project is a dynamic, data-driven, and configuration-driven application built with Golang. Out of the box, it provides an admin app that allows users to manage multiple applications, offering built-in authentication, user management, and role-based access control at the CRUD level for each table, ETL Workflow powered by DuckDB

api backend dashboard dataengineering datascience duckdb etl-pipeline etlx golang relational-databases

Last synced: 07 Sep 2025

https://github.com/dina-hosny/sparkify---data-pipelines-with-airflow

Sparkify - Data Pipelines with Airflow - Udacity Data Engineering Expert Track.

airflow aws data-engineering dataengineering etl pipline redshift redshift-cluster udacity

Last synced: 07 Jul 2025

https://github.com/divithraju/divith-raju-openmetadata

Open Standard for Metadata. A Single place to Discover, Collaborate and Get your data right.

automation bigdata bigdataanalytics data data-structures dataengineering datascience hacktoberfest2022 metadata metadata-extraction

Last synced: 20 Feb 2025

https://github.com/hq969/youtube-data-pipeline-aws

About Leveraging AWS Cloud Services, an ETL pipeline transforms YouTube video statistics data. Data is downloaded from Kaggle, uploaded to an S3 bucket, and cataloged using AWS Glue for querying with Athena. AWS Lambda and Glue converts to Parquet format and stores it in a cleansed S3 bucket. AWS QuickSight then visualizes the materialised data.

aws aws-cloudwatch aws-data-engineering-project aws-glue aws-iam aws-lambda aws-s3 data-engineering-pipeline data-pipeline dataengineering etl etl-pipeline pandas pyhton pyspark spark

Last synced: 12 Apr 2025

https://github.com/divithraju/divith-raju-searchengine-wikipedia

search engine optimizationA complete search engine experience built on top of 75 GB Wikipedia corpus with subsecond latency for searches. Results contain wiki pages ordered by TF/IDF relevance based on given search word/s. From an optimized code to the K-Way mergesort algorithm, this project addresses latency, indexing, and big data challenges.

algorithms data dataengineering inverted-index linux merge-sort nlp project project-repository python3 serchengine software-engineering ubuntu wikipedia

Last synced: 20 Feb 2025

https://github.com/halovina/djangoorm

tutorial django ORM for backend engineer

backendengineering dataengineering django orm-framework python

Last synced: 08 Jul 2025

https://github.com/phelipe-sempreboni/data-engineering

Repository for tutorials, information, notes and projects about data engineering.

data dataengineering engine engineering enviroment etl etl-pipeline pipeline project python

Last synced: 04 Oct 2025

https://github.com/hq969/realtime-data-pipeline-for-stack-market-analysis

This repo demonstrates the development of a real-time data pipeline designed to ingest, process, and analyze stock market data. Using cutting-edge tools like Apache Kafka, PostgreSQL, and Python, the pipeline captures stock data in real-time and stores it in a robust data architecture, enabling timely analysis and insights.

aiven-cloud apachekafkademo apis dataengineering etl etl-pipeline pipeline postgresql-database python3 stock-data

Last synced: 04 Oct 2025

https://github.com/dain55788/ibm-data-engineer-lecture-note

Lecture Notes and Practice Materials of IBM Data Engineering Course

data-analysis database dataengineering datawarehouse ibm

Last synced: 03 Apr 2025

https://github.com/m-farag/etlworkers

a Data Engineering package

dataengineering python

Last synced: 07 Apr 2025

https://github.com/pankajsingh09/data_engineering_using_aws

This Repository contains the contents related to Data Engineering Using AWS

aws data-ingestion dataengineering event-bridge lambda-functions pipeline pycharm-ide pyspark python s3 spark

Last synced: 19 Jan 2026

https://github.com/josephmachado/data-quality-w-greatexpectations

Code for data quality with greatexpectations blog

dataengineering dataquality greatexpectations python

Last synced: 15 Apr 2025

https://github.com/andrewdarnall/the-observer

A big data processing pipeline wich a topic modeling model (BERTopic) using Mastodon data

apache-kafka apache-spark bertopic dataengineering mastodon tapunict

Last synced: 16 Oct 2025

https://github.com/moh-ayman/stripeapi-to-bq---cfunc-etl

Google Cloud Function built to perform an ETL Job to Collect StripeAPI Data and Transform it to be able to Import it to Bigquery.

bigquery dataengineering etl-pipeline gcp gcp-cloud-functions pandas-dataframe python stripe-api

Last synced: 17 Oct 2025

https://github.com/nycolasdiaas/zarea-de-risco

Zarea de Risco é um projeto voltado para a segurança pública do estado do Ceará, focando na coleta e análise de notícias e informações relevantes para a área

data-science dataengineering docker nosql python

Last synced: 14 Apr 2025

https://github.com/charlesgaydon/colorize-swisssurface3d-lidar

Instructions to colorize SwissSURFACE3D Lidar using SwissIMAGE10 orthoimages, and split the point cloud for later deep learning training.

dataengineering deeplearning lidar swissimage10 swisssurface3d swisstopo

Last synced: 25 Oct 2025

https://github.com/lostdir/movie_dashboard_with_airflow_etl

Real-Time Trending Movies Dashboard: A Streamlit-based dashboard that fetches and displays trending movies, genres, ratings, and descriptions using an ETL pipeline to extract data from the TMDB API, transform it, and load it into a PostgreSQL database, with daily updates managed by Airflow.

airflow dashboard dataengineering docker pipeline streamlit tmdb-api

Last synced: 15 May 2025

https://github.com/pavithra19/apache_spark_people_data_processor

This project is a data processing application built with Apache Spark and Scala. This is designed to efficiently process, analyze and transform large datasets related to people data. It leverages Spark’s distributed computing capabilities to handle scalable data ingestion, cleaning and reporting. Shell scripts are included for hadoop deployment.

apachespark dataengineering hadoop hdfs scala

Last synced: 19 Jun 2025

https://github.com/f-lab-edu/league-of-legends-data-solution

‘리그 오브 레전드’를 벤치마킹해서 플레이어의 행동 이벤트를 발생하는 API를 통해 실시간으로 데이터가 잘 흐를 수 있도록 데이터 솔루션을 제공합니다.

airflow dataengineering spark

Last synced: 12 Jul 2025

https://github.com/thanaphongk37/data-science-and-data-analyst-project

Portfolio Data Analysis and Data Science projects and Data Engineer built using Azure Service, SQL and Python.

apache-superset azure-storage dashboards data-analysis data-science databricks dataengineering datafactory datapipeline powerbi python sisense sql sql-server visualization

Last synced: 29 Mar 2025

https://github.com/jaehyeon-kim/sam-for-data-professionals

Serverless Application Model (SAM) for Data Professionals

aws aws-lambda dataengineering sam serverless

Last synced: 04 Apr 2025

https://github.com/sadmansakib93/leetcode-sql-50

My MySQL solutions for LeetCode's SQL 50 study plan (Crack SQL Interview in 50 Qs)

data-science dataengineering leetcode-solutions leetcode-sql mysql sql

Last synced: 17 Jun 2025

https://github.com/nicklitwinow/hse-python-capstone-project

This project is a comprehensive data engineering and analytics solution built using modern technologies such as Airflow, Spark, PostgreSQL, MySQL, Kafka, and Docker. It orchestrates data ingestion, processing, replication, streaming, and analytics across multiple containers.

airflow analytics dataengineering docker etl kafka mysql postgresql python spark streaming

Last synced: 30 Dec 2025

https://github.com/divithraju/divith-aju-hadoop-pyspark-pipeline

This project demonstrates the creation of a scalable data processing pipeline for handling and analyzing log data from a hypothetical e-commerce platform. Leveraging Hadoop and PySpark, the pipeline is designed to process large volumes of log files, providing meaningful insights into user behavior, system performance, and sales metrics.

apache-hadoop-framework apache-spark bigdata client data database dataengineering dataingestionframework datapreprocessing documentation ecommerce-platform hdfs pipeline project project-repository pyspark python3 software-engineering

Last synced: 06 Mar 2025

https://github.com/janainacazuza/dev_utils

Dev_Utils is a collection of essential, time-saving scripts designed for developers and data engineers working with Linux and macOS. It helps automate repetitive tasks, like project setup and workflow management, enhancing productivity and streamlining development processes.

automationscripts dataengineering developertools devutils linux macos opensource projectautomation python sql

Last synced: 01 Apr 2025

https://github.com/wsdt/dataproject_2sm

Simple default web application (university assignment)

database dataengineering students

Last synced: 10 Nov 2025

https://github.com/prathmeshyelne/etl-pipeline-for-employee-data-using-data-fusion-airflow

This repository contains code and configuration files for an Extract, Transform, Load (ETL) project using Google Cloud Data Fusion for data extraction, Apache Airflow/Composer for orchestration, and Google BigQuery for data loading.

airflow bigquery dataengineering etl gcp googlecloudplatform

Last synced: 02 Aug 2025

https://github.com/jbangtson/wedge_project

In this data engineering project, I analyzed point-of-sale (POS) 🏪 data from the Wedge Co-Op in Minneapolis, spanning January 2010 to January 2017. The dataset captures transaction-level details from a member-owned cooperative, with 75% of transactions generated by member-owners, enabling comprehensive shopping pattern analysis.

dataengineering gbq python

Last synced: 27 Oct 2025

https://github.com/alimarzouk/paris-aq

ELTL pipeline to monitor air quality in the Paris Île-de-France area

airflow airquality big-data bigquery dataengineering gcs spark

Last synced: 01 Aug 2025

https://github.com/tuancamtbtx/dataengineer-principles

Data Engineering Principles

dataengineering principles

Last synced: 20 Mar 2025

https://github.com/wklee610/de_project

[Data Engineer] Personal Toy Project For Study

data dataengineering

Last synced: 31 Mar 2025

https://github.com/cqllum/schema2dwh

⚡ Automatically produce a data model on your database using its information schema using GenAI.

ai data data-structures dataengineering datawarehousing dwh gemini gemini-api genai reporting reporting-tool schema-design

Last synced: 13 Mar 2025

https://github.com/pizofreude/insightflow-retail-economic-pipeline

A data engineering portfolio project using AWS cloud services to analyze correlations between Malaysian retail performance and fuel prices. Features Terraform IaC, ETL/ELT with AWS S3, Glue, SQL analytics via Athena coupled with data transformation via dbt, and workflow orchestration with Kestra.

aws-athena aws-batch aws-glue aws-quicksight aws-s3 dataengineering dbt-cloud docker kestra open-api postgres python sql terraform

Last synced: 30 Dec 2025