An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with data-engineer

A curated list of projects in awesome lists tagged with data-engineer .

https://github.com/data-engineering-community/data-engineering-wiki

The best place to learn data engineering. Built and maintained by the data engineering community.

data data-engineer data-engineering data-modeling data-pipelines database etl sql

Last synced: 14 May 2025

https://github.com/wwwlike/vlife

企业级低代码快速开发平台,包含页面可视化配置、自定义表单、自定义报表、权限管理脚手架应用、前后端代码自动生成;主要特点是低代码开发,可实现复杂CRUD功能仅编写数据模型就能完成前后端开发

admin ahooks codegenerator data-engineer form-designer formily low-code querydsl react spring spring-boot template template-project

Last synced: 09 Jan 2026

https://github.com/digitalghost-dev/premier-league

A Data Engineering project. Repository for backend infrastructure and Streamlit app files for a Premier League Dashboard.

bigquery cloud-run data-engineer data-pipeline data-visualization docker firestore go google-cloud prefect python streamlit

Last synced: 30 Sep 2025

https://github.com/unnati-xyz/scalable-data-science-platform

Content for architecting a data science platform for products using Luigi, Spark & Flask.

data-engineer data-pipeline data-science luigi machine-learning rest-api spark

Last synced: 19 Jul 2025

https://github.com/sblack4/google-data-engineering-coursera

For the Coursera specialization https://www.coursera.org/specializations/gcp-data-machine-learning

coursera-specialization data-engineer google-cloud leveraging-unstructured-data

Last synced: 11 Aug 2025

https://github.com/camposvinicius/aws-etl

This is an ETL application on AWS with general open sales and customer data that you can find here: https://github.com/camposvinicius/data/blob/main/AdventureWorks.zip, it's a zipped file with some .csvs inside that we will apply transformations.

airflow argocd athena aws catalog data data-engineer database emr emr-cluster etl glue kubernetes pipeline postgres pyspark rds spark

Last synced: 30 Jul 2025

https://github.com/huemulsolutions/huemul-bigdatagovernance

Huemul BigDataGovernance, es una framework que trabaja sobre Spark, Hive y HDFS. Permite la implementación de una estrategia corporativa de dato único, basada en buenas prácticas de Gobierno de Datos. Permite implementar tablas con control de Primary Key y Foreing Key al insertar y actualizar datos utilizando la librería, Validación de nulos, largos de textos, máximos/mínimos de números y fechas, valores únicos y valores por default. También permite clasificar los campos en aplicabilidad de derechos ARCO para facilitar la implementación de leyes de protección de datos tipo GDPR, identificar los niveles de seguridad y si se está aplicando algún tipo de encriptación. Adicionalmente permite agregar reglas de validación más complejas sobre la misma tabla.

bigdata chile cloudera data data-engineer data-engineering data-governance data-warehouse datamart dataquality gdpr hadoop hive hortonworks huemul huemul-bigdatagovernance parquet spark spark-sql trabaja-sobre-spark

Last synced: 26 Apr 2025

https://github.com/lixx21/airflow-dbt-gcp

A comprehensive data pipeline leveraging Airflow, DBT, Google Cloud Platform (GCP), and Docker to extract, transform, and load data seamlessly from a staging layer to a data warehouse and data mart.

airflow bigquery data-engineer dbt gcp

Last synced: 10 Apr 2025

https://github.com/mohidex/data-pipeline-on-gcp

The Real-time Ecommerce Data Collection and Processing project empowers businesses with real-time insights by efficiently extracting, processing, and storing ecommerce data from multiple sources. Combining Golang and Python, this cutting-edge solution streamlines data handling from diverse ecommerce websites.

beautifulsoup data-engineer data-pipeline data-science database datastore dependency-injection firebase firestore gcp go golang google google-cloud pubsub python solid-principles storage web-scraping

Last synced: 14 Apr 2025

https://github.com/justmalhar/thinklikeanengineer

💡 Think Like An Engineer is a roadmap for engineering leadership, a toolkit for growth hacking through engineering, and a manifesto for productivity enhancement

data-engineer data-engineering engineering engineering-management leadership senior-engineer software-engineer system-design

Last synced: 05 Feb 2026

https://github.com/pwenker/data-engineering

My notes for Udacity's Data Engineering Nanodegree.

data-engineer data-engineering quiz udacity udacity-nanodegree

Last synced: 12 Aug 2025

https://github.com/mazzasaverio/pipeline-docs-data-extractor

(Let's build a) Robust pipeline for extracting structured data from various documents

airflow data-engineer data-engineering etl-pipeline large-language-models pdf-text-extraction unstructured

Last synced: 13 Aug 2025

https://github.com/mazzasaverio/data-software-engineering-journal

I decided to start tracking my learning, tips, code, building projects, ideas, and curiosities that I discover on my product engineer development journey. I hope that others might find interesting insights, discover their own paths, and enjoy the journey as well.

blog data-engineer data-engineering nextjs notion personal-website product-engineering react saas

Last synced: 11 Apr 2025

https://github.com/mazzasaverio/data-engineering-save

Data Engineering Notes, Resources & Insights

data-engineer data-engineering product-engineering

Last synced: 02 Sep 2025

https://github.com/1sumer/1sumer

Data Analyst | Python | SQL | Power BI | R | Excel | PySpark | EDA | ETL | Data Visualization | Statistical Analysis | Data Wrangling | Data Modeling | MongoDB | Machine Learning | Deployment | GitHub | AWS

data-analyst data-cleaning-and-preprocessing data-engineer data-modelling data-scientist data-visualization

Last synced: 19 Jan 2026

https://github.com/ortizfram/datacamp-data-engineer-with-python-course

datacamp Data Engineer with Python course. 73 hours/ 19 Courses /2 Skill Assessments

all-courses answers career-track data-engineer datacamp datacamp-course python sql

Last synced: 10 Jul 2025

https://github.com/hanan-nawaz/100_days_of_data_engineering

Journey through 100 days of Data Engineering, featuring daily learning, practice, and projects. This repository includes notes, exercises, and code snippets covering essential topics such as GitHub, Python, ETL, data pipelines, and more

data-engineer data-engineering git github python3

Last synced: 21 Jul 2025

https://github.com/mikecerton/the-retail-elt-pipeline-end-to-end

This project designs and implements an ETL pipeline using Apache Airflow (Docker Compose) to ingest, process, and store retail data. AWS S3 acts as the data lake, AWS Redshift as the data warehouse, and Looker Studio for visualization. [Data Engineer]

apache-airflow aws-redshift aws-s3 data-engineer etl-pipeline looker-studio

Last synced: 02 Apr 2025

https://github.com/longnguyen010203/100day-self-learning-de

📚💻⌨ Self-study process for more than 3 months with 3-4h/day to prepare for the journey of applying for an intern or fresher position as a Data Engineer in 2024 ️🥇️🏆

data-engineer data-engineering self-learning

Last synced: 01 Feb 2026

https://github.com/mramshaw/ml_at_scale

An operational description of ML at Scale

business-analyst data-engineer data-scientist etl ml production-engineer

Last synced: 02 Feb 2026

https://github.com/mensenvau/mensenvau

I am a Mid Software/Data Engineer.

data-engineer database-development software-engineering

Last synced: 06 Jan 2026

https://github.com/lixx21/spotify-scrapping

Scraping data from Spotify Playlist URL using Python and Selenium

data-engineer data-scraping python scraping-websites selenium spotify-playlist

Last synced: 03 Apr 2025

https://github.com/jasontanx/prefect-learning

Prefect - Data orchestration tool practice & learning

data-engineer data-orchestration prefect workflow-management

Last synced: 26 Mar 2025

https://github.com/mikecerton/apache_kafka_basic

This repository provides a fundamental understanding of Apache Kafka, including its core components, basic Python scripts to demonstrate how to create topics, produce messages, and consume messages, as well as a docker-compose.yml file for easy setup. [Data Engineer]

apache-kafka data-engineer docker-compose python

Last synced: 18 Mar 2025

https://github.com/janascher/engenharia-de-dados

Resoluções das atividades das aulas de Engenharia de Dados da Alpha EdTech.

bigquery dash-plotly data-engineer data-warehouse google-cloud-platform google-cloud-storage pandas pyspark python

Last synced: 20 Jun 2025

https://github.com/mensenvau/internship_sql_analytics

🚀 Internship SQL (East, Advanced)

adventureworks data-engineer internship sql

Last synced: 29 Oct 2025

https://github.com/sanketrs/sql-interview-preparation-questions-with-answers

Designed as a comprehensive resource for aspiring data analysts, data engineers, and database administrators.

business-intelligence data-analyst data-engineer sql-interview-questions sql-interview-questions-answers

Last synced: 15 Jan 2026

https://github.com/dan3002/tiktok-crawler

This is a simple Tiktok crawler that can be used to download videos from Tiktok. It uses the Tiktok API to get the video URL and then downloads the video using the requests library. It can download video from multiple hashtags or download by sound.

crawler-python data-engineer playwright python tiktok

Last synced: 03 Mar 2025

https://github.com/lixx21/airflow-mysql-to-bigquery

ETL to move data from MySQL into BigQuery using Airflow

airflow data-engineer data-pipeline etl

Last synced: 13 Sep 2025

https://github.com/m4tice/kafka-examples

Practice of Apache Kafka

apache-kafka data-engineer linux

Last synced: 16 Mar 2025

https://github.com/mensenvau/leetcode_sql_problems

😊️️️️️️ Leetcode database part solutions

data-engineer data-engineering leetcode leetcode-solutions mysql sql

Last synced: 25 Mar 2025

https://github.com/lixx21/dbt-shopping-data-transform

This project leverages DBT (Data Build Tool) to transform raw shopping data into a well-structured, analytics-ready format

data-engineer dbt docker etl-pipeline

Last synced: 03 Apr 2025

https://github.com/mensenvau/data_engineering_solution_no1

Data Engineer Lead Analyst Case Study

data-analyst data-engineer sql

Last synced: 13 Oct 2025

https://github.com/pyk/belajar-data-engineering

Panduan untuk menjadi Data Engineer

data-engineer data-engineering

Last synced: 31 Jan 2026

https://github.com/mitgar14/etl-workshop-1

Workshop #1 (Data Engineer) for the ETL course using Pandas, Matplotlib, SQLAlchemy and Power BI for the creation of the dashboard.

data-engineer data-visualization etl pandas postgresql powerbi python sqlalchemy

Last synced: 24 Mar 2025

https://github.com/apancoast/healthcare-deserts-and-public-transit

This dbt-based project aims to analyze the intersection of healthcare accessibility and public transit coverage in Mecklenburg County, NC.

analysis analytics-engineering data-engineer dbt healthcare hpsa hrsa public-data public-transit

Last synced: 27 Oct 2025

https://github.com/jasontanx/data-engineering-zoomcamp-23

Data Engineering Zoomcamp from DataTalksClub

data-engineer datatalksclub

Last synced: 11 Jul 2025

https://github.com/swidvey/snowflake-task-etl

Example of simple AWS to Snowflake ETL Task

data-engineer elt snowflake sql

Last synced: 09 Apr 2025

https://github.com/rifa8/fundamental-de

Learning about fundamental data engineering

data-engineer data-engineering normalization

Last synced: 01 Feb 2026

https://github.com/lixx21/kafka-ibm-stocking

Move data to JSON file real time using Kafka for IBM Stocking Data

data-engineer kafka kafka-streams

Last synced: 03 Apr 2025

https://github.com/lasbrdev/sgbd-sql-nosql-engenheiro-de-dados-dio

Descrevendo sobre a compreensão do papel do SGBD Relacional e Não Relacional no contexo de um Engenheiro de Dados

data-engineer linux nosql-databases relational-databases sgbd-relacionais

Last synced: 02 Sep 2025

https://github.com/fadhiildzaki/etl_superstore

This project automates ETL for Superstore data, extracting from PostgreSQL, transforming in Python, and reloading into PostgreSQL weekly. I conducted data analysis in Jupyter Notebook and built a Metabase dashboard for insights.

airflow data-analyst data-engineer data-science etl-automation metabase postgresql python

Last synced: 22 Mar 2025

https://github.com/awinardi1004/spark-data-engineering-pipeline

End-to-end big data pipeline using Apache Spark & Hadoop (HDFS) with the Olist E-commerce dataset. Covers data ingestion, cleaning, integration, optimization, and serving.

data-engineer data-pipeline hadoop spark

Last synced: 07 Oct 2025

https://github.com/mikma03/azure_data_engineer_associate

Materials and resources for Azure certification - DP-203

azure data-engineer microsoft-azure resources

Last synced: 26 Feb 2025

https://github.com/thangbuiq/thangbuiq

⭐️ Check out my profile and consider starring one of my projects if you like it!

data-engineer data-science devops mlops

Last synced: 04 Jan 2026

https://github.com/ahbiels/fegtec

FegTec é uma empresa fictícia que quer transferir arquivos parquet contendo dados dos clientes da nuvem AWS para a Google Cloud

aws bucket cloudfunctions data-engineer gcp pandas parquet-files python transfer-data

Last synced: 27 Feb 2025

https://github.com/mitgar14/etl-workshop-2

Workshop #2 (ETL process using Airflow) for the ETL course using Apache Airflow to build a data pipeline.

airflow data-engineer data-engineering data-visualization etl pandas postgresql powerbi python sqlalchemy

Last synced: 24 Mar 2025