An open API service indexing awesome lists of open source software.

https://github.com/irwandifo/gcp-batch-pipeline

GCP Batch Data Pipeline
https://github.com/irwandifo/gcp-batch-pipeline

batch-processing data-pipeline data-platform kestra

Last synced: 7 months ago
JSON representation

GCP Batch Data Pipeline

Awesome Lists containing this project

README

          

# GCP Batch Data Pipeline

This project implements data pipeline for small-to-medium-scale data platform on GCP using modern tools. The pipeline is designed for batch processing and leverages Kestra, DuckDB, dbt, Iceberg, BigQuery, and Parquet to create a robust and scalable data processing workflow.

## Components

| **Categories** | **Tools** | **Details** |
|------------------------|-----------------------------------------|-------------------------------------------|
| Data Source | Pagila | Sample PostgreSQL database. |
| Orchestration | Kestra | Coordinates workflows and tasks. |
| Ingestion | Kestra | Manages data ingestion workflows. |
| Storage | Google Cloud Storage (GCS) and BigQuery | Stores data using Medallion Architecture. |
| Processing | DuckDB and dbt+BigQuery | Performs data transformation. |
| Data Quality | Soda and dbt test | Ensures data accuracy and reliability. |
| Alerting | Resend | Sends email notifications for issues. |
| Consumption | BigQuery | Data consumption layer. |
| Analytics | Looker Studio and Evidence | Analytics and visualizations layer. |

## Architecture Diagram

![](https://github.com/irwandifo/gcp-batch-infra/blob/main/img/gcp-batch-diagram.png)