An open API service indexing awesome lists of open source software.

https://github.com/nathadriele/data-engineering-zoomcamp

The Data Engineering Zoomcamp covers essential skills in containerization, workflow orchestration, data warehousing, analytics engineering, batch, and streaming processing. It includes tools like Docker, Terraform, BigQuery, dbt, Spark, Kafka, Kestra, Postgres, Google Data Studio, and Metabase.
https://github.com/nathadriele/data-engineering-zoomcamp

bigquery containerization data-engineering dbt docker google-data-studio kafka kestra metabase orchestration postgresql spark streaming terraform warehousing workflow-automation

Last synced: 7 months ago
JSON representation

The Data Engineering Zoomcamp covers essential skills in containerization, workflow orchestration, data warehousing, analytics engineering, batch, and streaming processing. It includes tools like Docker, Terraform, BigQuery, dbt, Spark, Kafka, Kestra, Postgres, Google Data Studio, and Metabase.

Awesome Lists containing this project

README

          

## Data Engineering Zoomcamp

The Data Engineering Zoomcamp provides core concepts, tools, and practical skills needed for modern data engineering. The course covers a wide range of topics, from containerization and infrastructure as code to advanced batch and streaming processing. With a hands-on, project-based approach, the program ensures participants not only learn the theoretical aspects but also gain practical experience by building real-world pipelines.

![image](https://github.com/user-attachments/assets/3bc25a83-a158-484b-b73b-3358e930cc4c)

#### Featured Tools and Technologies

- Docker: Containerization platform for building, shipping, and running applications.
- Terraform: Infrastructure as code tool for building, changing, and versioning infrastructure.
- Google BigQuery: Serverless, highly scalable, and cost-effective multi-cloud data warehouse.
- dbt (data build tool): Analytics engineering tool providing a transformation-focused query runner.
- Apache Spark: Open-source distributed computing system for big data processing.
- Apache Kafka: Distributed event streaming platform for building real-time data pipelines and streaming applications.
- Kestra: Flexible and scalable workflow orchestration and automation tool.
- PostgreSQL: Powerful open-source relational database system.
- Google Data Studio: Data visualization and reporting tool to turn data into informative dashboards and reports.
- Metabase: Open-source business intelligence and analytics tool for easy data visualization and exploration.

#### Module 1: Containerization and Infrastructure as Code

- GCP
- Docker and docker-compose
- Running Postgres locally with Docker
- Setting up infrastructure on GCP with Terraform
- Preparing the environment

#### Module 2: Workflow Orchestration
- Data Lake
- Workflow orchestration
- Workflow orchestration with Kestra

#### Workshop 1: Data Ingestion
- Reading from apis
- Building scalable pipelines
- Normalising data
- Incremental loading

#### Module 3: Data Warehouse
- Data Warehouse
- BigQuery
- Partitioning and clustering
- BigQuery best practices
- Internals of BigQuery
- BigQuery Machine Learning

#### Module 4: Analytics engineering
- Basics of analytics engineering
- dbt (data build tool)
- BigQuery and dbt
- Postgres and dbt
- dbt models
- Testing and documenting
- Deployment to the cloud and locally
- Visualizing the data with google data studio and metabase

#### Module 5: Batch processing
- Batch processing
- What is Spark
- Spark Dataframes
- Spark SQL
- Internals: GroupBy and joins

#### Module 6: Streaming
- Introduction to Kafka
- Schemas (avro)
- Kafka Streams
- Kafka Connect and KSQL

#### Project
- Week 1 and 2: working on your project
- Week 3: reviewing your peers

https://github.com/DataTalksClub/data-engineering-zoomcamp