https://github.com/manuelguerra1987/data-engineering-zoomcamp-notes
Notes and material from 2025 Data Engineering Zoomcamp by Datatalks.Club
https://github.com/manuelguerra1987/data-engineering-zoomcamp-notes
airflow bigquery data-engineering docker kubernetes
Last synced: 10 months ago
JSON representation
Notes and material from 2025 Data Engineering Zoomcamp by Datatalks.Club
- Host: GitHub
- URL: https://github.com/manuelguerra1987/data-engineering-zoomcamp-notes
- Owner: ManuelGuerra1987
- Created: 2024-12-07T02:54:16.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-01-15T16:04:05.000Z (over 1 year ago)
- Last Synced: 2025-01-15T17:43:05.757Z (over 1 year ago)
- Topics: airflow, bigquery, data-engineering, docker, kubernetes
- Language: Python
- Homepage:
- Size: 6.76 MB
- Stars: 9
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Data Engineering Zoomcamp 2025
This repo contains notes and homeworks for the [Data Engineering Zoomcamp](https://github.com/DataTalksClub/data-engineering-zoomcamp) by [Datatalks.Club](https://datatalks.club/)
## Syllabus
### [Module 1: Containerization and Infrastructure as Code](1_Containerization-and-Infrastructure-as-Code/)
* Docker and docker-compose
* Running Postgres in a container
* Ingesting data to Postgres with Python
* Running Postgres and pgAdmin with Docker-compose
* Google Cloud Platform (GCP)
* Terraform
* Setting up infrastructure on GCP with Terraform
### [Module 2: Workflow Orchestration with Kestra](2_Workflow-Orchestration-(Kestra)/)
* Introduction to Workflow Orchestration
* Introduction to Kestra
* Launch Kestra using Docker Compose
* ETL Pipelines: Load Data to Local Postgres
* ETL Pipelines: Load Data to Google Cloud Platform (GCP)
### [Module 3: Data Warehouse with BigQuery](3_Data-Warehouse/)
* OLAP vs OLTP
* Data Warehouse
* BigQuery
* Creating an external table
* Partitioning and clustering
* BigQuery best practices
### [Module 4: Analytics Engineering with dbt](4_Analytics-Engineering/)
* Introduction to analytics engineering
* Introduction to dbt
* Setting up dbt with bigquery
* Development of dbt Models
* Building the model
* Testing and documenting
* Deployment
* Visualizing the data
### [Module 5: Batch Processing with spark](5_Batch-Processing-Spark/)
* Introduction to Batch Processing
* Introduction to Spark
* Spark SQL and DataFrames
* Spark Internals
* Running Spark in the Cloud
## Extra:
### [Module 2: Workflow Orchestration with Airflow](2_Workflow-Orchestration-AirFlow/)
* Data Lake vs Data Warehouse
* ETL vs ELT
* Introduction to Workflow Orchestration
* Airflow architecture
* Setting up Airflow with Docker
* Ingesting data to local Postgres with Airflow
* Ingesting data to GCP with Airflow
* Airflow with kubernetes
### Environment setup
You can set it up on your laptop or PC if you prefer to work locally or you can set up a virtual machine in Google Cloud Platform.
In this repo we will use windows + WSL2 locally
For the course you'll need:
* Python 3
* Google Cloud SDK (explained in module 1)
* Docker with docker-compose
* Terraform (explained in module 1)
* Git account
* Google Cloud Platform account