Projects in Awesome Lists tagged with data-engineer
A curated list of projects in awesome lists tagged with data-engineer .
https://github.com/andkret/Cookbook
The Data Engineering Cookbook
best-practices big-data cookbook data-engineer data-engineering
Last synced: 14 Mar 2025
https://github.com/andkret/cookbook
The Data Engineering Cookbook
best-practices big-data cookbook data-engineer data-engineering
Last synced: 25 Jan 2026
https://github.com/data-engineering-community/data-engineering-wiki
The best place to learn data engineering. Built and maintained by the data engineering community.
data data-engineer data-engineering data-modeling data-pipelines database etl sql
Last synced: 14 May 2025
https://github.com/wwwlike/vlife
企业级低代码快速开发平台,包含页面可视化配置、自定义表单、自定义报表、权限管理脚手架应用、前后端代码自动生成;主要特点是低代码开发,可实现复杂CRUD功能仅编写数据模型就能完成前后端开发
admin ahooks codegenerator data-engineer form-designer formily low-code querydsl react spring spring-boot template template-project
Last synced: 09 Jan 2026
https://github.com/vmware/versatile-data-kit
One framework to develop, deploy and operate data workflows with Python and SQL.
analytics data data-engineer data-engineering data-engineering-pipeline data-lineage data-pipelines data-science data-structures data-warehouse database dataops elt etl pipeline python snowflake sql trino warehouse
Last synced: 15 May 2025
https://github.com/digitalghost-dev/premier-league
A Data Engineering project. Repository for backend infrastructure and Streamlit app files for a Premier League Dashboard.
bigquery cloud-run data-engineer data-pipeline data-visualization docker firestore go google-cloud prefect python streamlit
Last synced: 30 Sep 2025
https://github.com/unnati-xyz/scalable-data-science-platform
Content for architecting a data science platform for products using Luigi, Spark & Flask.
data-engineer data-pipeline data-science luigi machine-learning rest-api spark
Last synced: 19 Jul 2025
https://github.com/ahmetfurkandemir/data-engineering-project-with-hdfs-and-kafka
Data Engineering Project with Hadoop HDFS and Kafka
data data-engineer data-engineering data-engineering-pipeline docker docker-compose hadoop hadoop-filesystem hadoop-hdfs hdfs hdfs-client hdfs-dfs kafka kafka-consumer kafka-producer kafka-ui kafkaui pipline python python-hdfs-client
Last synced: 15 Apr 2025
https://github.com/Thomas-George-T/Thomas-George-T
Readme for my :octocat: Profile
data-engineer data-science github github-profile icons machine-learning profile-readme readme svg svg-icons
Last synced: 15 Mar 2025
https://github.com/sblack4/google-data-engineering-coursera
For the Coursera specialization https://www.coursera.org/specializations/gcp-data-machine-learning
coursera-specialization data-engineer google-cloud leveraging-unstructured-data
Last synced: 11 Aug 2025
https://github.com/devinterview-io/data-engineer-interview-questions
🟣 Data Engineer interview questions and answers to help you prepare for your next machine learning and data science interview in 2024.
ai-interview-questions coding-interview-questions coding-interviews data-engineer data-engineer-interview-questions data-engineer-questions data-engineer-tech-interview data-science data-science-interview data-science-interview-questions data-scientist-interview interview-practice interview-preparation machine-learning machine-learning-and-data-science machine-learning-interview machine-learning-interview-questions software-engineer-interview technical-interview-questions
Last synced: 08 Jan 2026
https://github.com/camposvinicius/aws-etl
This is an ETL application on AWS with general open sales and customer data that you can find here: https://github.com/camposvinicius/data/blob/main/AdventureWorks.zip, it's a zipped file with some .csvs inside that we will apply transformations.
airflow argocd athena aws catalog data data-engineer database emr emr-cluster etl glue kubernetes pipeline postgres pyspark rds spark
Last synced: 30 Jul 2025
https://github.com/aaaastark/top-big-data-scientist-questions-for-interview
Top Big Tech Data Science Questions
ai alibaba amazon apple computer-science computer-vision data-engineer data-science deep-learning facebook google ibm intel interview-questions machine-learning netflix nvidia orcale spacex tesla
Last synced: 04 Feb 2026
https://github.com/huemulsolutions/huemul-bigdatagovernance
Huemul BigDataGovernance, es una framework que trabaja sobre Spark, Hive y HDFS. Permite la implementación de una estrategia corporativa de dato único, basada en buenas prácticas de Gobierno de Datos. Permite implementar tablas con control de Primary Key y Foreing Key al insertar y actualizar datos utilizando la librería, Validación de nulos, largos de textos, máximos/mínimos de números y fechas, valores únicos y valores por default. También permite clasificar los campos en aplicabilidad de derechos ARCO para facilitar la implementación de leyes de protección de datos tipo GDPR, identificar los niveles de seguridad y si se está aplicando algún tipo de encriptación. Adicionalmente permite agregar reglas de validación más complejas sobre la misma tabla.
bigdata chile cloudera data data-engineer data-engineering data-governance data-warehouse datamart dataquality gdpr hadoop hive hortonworks huemul huemul-bigdatagovernance parquet spark spark-sql trabaja-sobre-spark
Last synced: 26 Apr 2025
https://github.com/lixx21/airflow-dbt-gcp
A comprehensive data pipeline leveraging Airflow, DBT, Google Cloud Platform (GCP), and Docker to extract, transform, and load data seamlessly from a staging layer to a data warehouse and data mart.
airflow bigquery data-engineer dbt gcp
Last synced: 10 Apr 2025
https://github.com/bayoadejare/lightning-containers
Docker powered starter for geospatial analysis of lightning atmospheric data.
clustering-analysis csv-files data-engineer data-engineering-pipeline data-warehouse databases docker jupyter machine-learning-algorithms noaa-weather orchestrator pandas python3 spatialite sqlite streamlit-dashboard
Last synced: 07 Apr 2025
https://github.com/mohidex/data-pipeline-on-gcp
The Real-time Ecommerce Data Collection and Processing project empowers businesses with real-time insights by efficiently extracting, processing, and storing ecommerce data from multiple sources. Combining Golang and Python, this cutting-edge solution streamlines data handling from diverse ecommerce websites.
beautifulsoup data-engineer data-pipeline data-science database datastore dependency-injection firebase firestore gcp go golang google google-cloud pubsub python solid-principles storage web-scraping
Last synced: 14 Apr 2025
https://github.com/justmalhar/thinklikeanengineer
💡 Think Like An Engineer is a roadmap for engineering leadership, a toolkit for growth hacking through engineering, and a manifesto for productivity enhancement
data-engineer data-engineering engineering engineering-management leadership senior-engineer software-engineer system-design
Last synced: 05 Feb 2026
https://github.com/pwenker/data-engineering
My notes for Udacity's Data Engineering Nanodegree.
data-engineer data-engineering quiz udacity udacity-nanodegree
Last synced: 12 Aug 2025
https://github.com/mazzasaverio/pipeline-docs-data-extractor
(Let's build a) Robust pipeline for extracting structured data from various documents
airflow data-engineer data-engineering etl-pipeline large-language-models pdf-text-extraction unstructured
Last synced: 13 Aug 2025
https://github.com/mazzasaverio/data-software-engineering-journal
I decided to start tracking my learning, tips, code, building projects, ideas, and curiosities that I discover on my product engineer development journey. I hope that others might find interesting insights, discover their own paths, and enjoy the journey as well.
blog data-engineer data-engineering nextjs notion personal-website product-engineering react saas
Last synced: 11 Apr 2025
https://github.com/mazzasaverio/data-engineering-save
Data Engineering Notes, Resources & Insights
data-engineer data-engineering product-engineering
Last synced: 02 Sep 2025
https://github.com/arverma/data-engineer-interview-experience
My interview experience with the companies I interviewed with
big-data data data-engineer data-engineering engineering interview interview-practice interview-preparation interview-questions python3 spark sql
Last synced: 08 Oct 2025
https://github.com/1sumer/1sumer
Data Analyst | Python | SQL | Power BI | R | Excel | PySpark | EDA | ETL | Data Visualization | Statistical Analysis | Data Wrangling | Data Modeling | MongoDB | Machine Learning | Deployment | GitHub | AWS
data-analyst data-cleaning-and-preprocessing data-engineer data-modelling data-scientist data-visualization
Last synced: 19 Jan 2026
https://github.com/ortizfram/datacamp-data-engineer-with-python-course
datacamp Data Engineer with Python course. 73 hours/ 19 Courses /2 Skill Assessments
all-courses answers career-track data-engineer datacamp datacamp-course python sql
Last synced: 10 Jul 2025
https://github.com/hanan-nawaz/100_days_of_data_engineering
Journey through 100 days of Data Engineering, featuring daily learning, practice, and projects. This repository includes notes, exercises, and code snippets covering essential topics such as GitHub, Python, ETL, data pipelines, and more
data-engineer data-engineering git github python3
Last synced: 21 Jul 2025
https://github.com/mikecerton/the-retail-elt-pipeline-end-to-end
This project designs and implements an ETL pipeline using Apache Airflow (Docker Compose) to ingest, process, and store retail data. AWS S3 acts as the data lake, AWS Redshift as the data warehouse, and Looker Studio for visualization. [Data Engineer]
apache-airflow aws-redshift aws-s3 data-engineer etl-pipeline looker-studio
Last synced: 02 Apr 2025
https://github.com/longnguyen010203/100day-self-learning-de
📚💻⌨ Self-study process for more than 3 months with 3-4h/day to prepare for the journey of applying for an intern or fresher position as a Data Engineer in 2024 ️🥇️🏆
data-engineer data-engineering self-learning
Last synced: 01 Feb 2026
https://github.com/mramshaw/ml_at_scale
An operational description of ML at Scale
business-analyst data-engineer data-scientist etl ml production-engineer
Last synced: 02 Feb 2026
https://github.com/mensenvau/mensenvau
I am a Mid Software/Data Engineer.
data-engineer database-development software-engineering
Last synced: 06 Jan 2026
https://github.com/lixx21/spotify-scrapping
Scraping data from Spotify Playlist URL using Python and Selenium
data-engineer data-scraping python scraping-websites selenium spotify-playlist
Last synced: 03 Apr 2025
https://github.com/jasontanx/prefect-learning
Prefect - Data orchestration tool practice & learning
data-engineer data-orchestration prefect workflow-management
Last synced: 26 Mar 2025
https://github.com/mikecerton/apache_kafka_basic
This repository provides a fundamental understanding of Apache Kafka, including its core components, basic Python scripts to demonstrate how to create topics, produce messages, and consume messages, as well as a docker-compose.yml file for easy setup. [Data Engineer]
apache-kafka data-engineer docker-compose python
Last synced: 18 Mar 2025
https://github.com/janascher/engenharia-de-dados
Resoluções das atividades das aulas de Engenharia de Dados da Alpha EdTech.
bigquery dash-plotly data-engineer data-warehouse google-cloud-platform google-cloud-storage pandas pyspark python
Last synced: 20 Jun 2025
https://github.com/mensenvau/internship_sql_analytics
🚀 Internship SQL (East, Advanced)
adventureworks data-engineer internship sql
Last synced: 29 Oct 2025
https://github.com/sanketrs/sql-interview-preparation-questions-with-answers
Designed as a comprehensive resource for aspiring data analysts, data engineers, and database administrators.
business-intelligence data-analyst data-engineer sql-interview-questions sql-interview-questions-answers
Last synced: 15 Jan 2026
https://github.com/dan3002/tiktok-crawler
This is a simple Tiktok crawler that can be used to download videos from Tiktok. It uses the Tiktok API to get the video URL and then downloads the video using the requests library. It can download video from multiple hashtags or download by sound.
crawler-python data-engineer playwright python tiktok
Last synced: 03 Mar 2025
https://github.com/lixx21/airflow-mysql-to-bigquery
ETL to move data from MySQL into BigQuery using Airflow
airflow data-engineer data-pipeline etl
Last synced: 13 Sep 2025
https://github.com/nottherealtar/data_engineering_assesments
assesments data data-engineer interview-questions interview-test
Last synced: 13 Sep 2025
https://github.com/janainacazuza/janainacazuza
A little about me
bigdata data-architecture data-engineer data-engineering data-pipeline database python sql
Last synced: 26 Jun 2025
https://github.com/m4tice/kafka-examples
Practice of Apache Kafka
apache-kafka data-engineer linux
Last synced: 16 Mar 2025
https://github.com/mensenvau/leetcode_sql_problems
😊️️️️️️ Leetcode database part solutions
data-engineer data-engineering leetcode leetcode-solutions mysql sql
Last synced: 25 Mar 2025
https://github.com/lixx21/dbt-shopping-data-transform
This project leverages DBT (Data Build Tool) to transform raw shopping data into a well-structured, analytics-ready format
data-engineer dbt docker etl-pipeline
Last synced: 03 Apr 2025
https://github.com/mensenvau/data_engineering_solution_no1
Data Engineer Lead Analyst Case Study
data-analyst data-engineer sql
Last synced: 13 Oct 2025
https://github.com/pyk/belajar-data-engineering
Panduan untuk menjadi Data Engineer
data-engineer data-engineering
Last synced: 31 Jan 2026
https://github.com/mitgar14/etl-workshop-1
Workshop #1 (Data Engineer) for the ETL course using Pandas, Matplotlib, SQLAlchemy and Power BI for the creation of the dashboard.
data-engineer data-visualization etl pandas postgresql powerbi python sqlalchemy
Last synced: 24 Mar 2025
https://github.com/omr5221/pyspark
data-engineer jupyter-notebook python spark
Last synced: 22 Mar 2025
https://github.com/apancoast/healthcare-deserts-and-public-transit
This dbt-based project aims to analyze the intersection of healthcare accessibility and public transit coverage in Mecklenburg County, NC.
analysis analytics-engineering data-engineer dbt healthcare hpsa hrsa public-data public-transit
Last synced: 27 Oct 2025
https://github.com/jasontanx/data-engineering-zoomcamp-23
Data Engineering Zoomcamp from DataTalksClub
Last synced: 11 Jul 2025
https://github.com/swidvey/snowflake-task-etl
Example of simple AWS to Snowflake ETL Task
data-engineer elt snowflake sql
Last synced: 09 Apr 2025
https://github.com/rifa8/fundamental-de
Learning about fundamental data engineering
data-engineer data-engineering normalization
Last synced: 01 Feb 2026
https://github.com/shakespear567/data_engineering_gcp
Data Engineering Using Google Could Platform and Mage
apachebeam bigquery clouddataflow cloudsql data-engineer dataflow dataproc gcp-components google-bigquery google-cloud google-virtualmachine looker spark terraform
Last synced: 30 Aug 2025
https://github.com/n4en/python-for-data-engineers
Python for data engineers
data data-engineer data-engineering dataengineering python python-notebooks python3 tutorial
Last synced: 26 Aug 2025
https://github.com/lixx21/kafka-ibm-stocking
Move data to JSON file real time using Kafka for IBM Stocking Data
data-engineer kafka kafka-streams
Last synced: 03 Apr 2025
https://github.com/lasbrdev/sgbd-sql-nosql-engenheiro-de-dados-dio
Descrevendo sobre a compreensão do papel do SGBD Relacional e Não Relacional no contexo de um Engenheiro de Dados
data-engineer linux nosql-databases relational-databases sgbd-relacionais
Last synced: 02 Sep 2025
https://github.com/bayoadejare/pipeline-sleep
Sleep Data Pipeline with Azure Data Factory
adf azure azure-sql correlation-analysis data-engineer data-engineering-pipeline data-warehouse database databricks machine-learning pandas power-bi powerbi python3 synapse
Last synced: 19 Aug 2025
https://github.com/bayoadejare/pipeline-ecommerce
E-commerce Data Pipeline
adf azure azure-sql clustering-analysis data-engineer data-engineering-pipeline data-warehouse database databricks machine-learning pandas power-bi powerbi python3 synapse
Last synced: 28 Jul 2025
https://github.com/fadhiildzaki/etl_superstore
This project automates ETL for Superstore data, extracting from PostgreSQL, transforming in Python, and reloading into PostgreSQL weekly. I conducted data analysis in Jupyter Notebook and built a Metabase dashboard for insights.
airflow data-analyst data-engineer data-science etl-automation metabase postgresql python
Last synced: 22 Mar 2025
https://github.com/higorcazuza81/higorcazuza81
A little about me
big-data data-architecture data-engineer data-engineering data-engineering-pipeline database python sql
Last synced: 04 Jul 2025
https://github.com/charlesemil/sql-data-warehouse-project
Building a modern data warehouse with SQL Server, Including ETL processes, data modeling and analytics.
data-analytics data-cleaning data-engineer data-science data-warehouse data-warehousing datascience etl etl-pipeline medallion-architecture sql sql-query sql-server sqlserver
Last synced: 05 Oct 2025
https://github.com/awinardi1004/spark-data-engineering-pipeline
End-to-end big data pipeline using Apache Spark & Hadoop (HDFS) with the Olist E-commerce dataset. Covers data ingestion, cleaning, integration, optimization, and serving.
data-engineer data-pipeline hadoop spark
Last synced: 07 Oct 2025
https://github.com/bayoadejare/pipeline-edtech
Edtech ADF Pipeline Project
adf azure azure-sql contextual-analysis data-engineer data-engineering-pipeline data-warehouse database databricks machine-learning pandas power-bi powerbi python3 synapse
Last synced: 23 Mar 2025
https://github.com/mikma03/azure_data_engineer_associate
Materials and resources for Azure certification - DP-203
azure data-engineer microsoft-azure resources
Last synced: 26 Feb 2025
https://github.com/thangbuiq/thangbuiq
⭐️ Check out my profile and consider starring one of my projects if you like it!
data-engineer data-science devops mlops
Last synced: 04 Jan 2026
https://github.com/ahbiels/fegtec
FegTec é uma empresa fictícia que quer transferir arquivos parquet contendo dados dos clientes da nuvem AWS para a Google Cloud
aws bucket cloudfunctions data-engineer gcp pandas parquet-files python transfer-data
Last synced: 27 Feb 2025
https://github.com/mitgar14/etl-workshop-2
Workshop #2 (ETL process using Airflow) for the ETL course using Apache Airflow to build a data pipeline.
airflow data-engineer data-engineering data-visualization etl pandas postgresql powerbi python sqlalchemy
Last synced: 24 Mar 2025