https://github.com/alexye-mapleleafs/automate-dataproc-process-in-gcp
This Repo Demonstrate General Process to Automate Process in GCP Dataproc to Leverage Its Processing Power
https://github.com/alexye-mapleleafs/automate-dataproc-process-in-gcp
airflow-dags dataproc-clusters docker-image google-cloud-platform python3
Last synced: 29 days ago
JSON representation
This Repo Demonstrate General Process to Automate Process in GCP Dataproc to Leverage Its Processing Power
- Host: GitHub
- URL: https://github.com/alexye-mapleleafs/automate-dataproc-process-in-gcp
- Owner: AlexYe-MapleLeafs
- Created: 2025-08-12T12:53:53.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2025-08-13T06:04:52.000Z (2 months ago)
- Last Synced: 2025-08-13T08:17:28.276Z (2 months ago)
- Topics: airflow-dags, dataproc-clusters, docker-image, google-cloud-platform, python3
- Language: Python
- Homepage:
- Size: 118 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
**Automated Process in Dataproc in Google Cloud Platform to Leverage Its Computing Power**
This repository contains a simplified, demonstration version of a real-world GCP project.
It illustrates the structure, configurations, and code required to automate process in **Google Cloud Platform** Dataproc services.---
## Repository Structure
| File/Folder | Purpose |
| ----------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **`scripts/`** | Python scripts containing pyspark scripts for extracting data from multiple sources and loading data into **BigQuery** target tables.
_Note: The production environment contains 10+ scripts. This demo includes only a subset for clarity._ |
| **`jars`** | In real project, it contains gcs-connector-hadoop2-2.1.1.jar and spark-3.2-bigquery-0.30.0.jar needed for the process.
_Note: jars are not included in repo to reduce sizes, users can download from original websites indicated on txt documents in this folder._ |
| **`miniconda3`** | In real project, it contains Miniconda3-py39*23.5.2-0-Linux-x86_64.sh needed for the process.
\_Note: minconda are not included in repo to reduce sizes, users can download from original websites indicated on txt document in this folder.* |
| **`accp.yaml`** | Pipeline configuration file defining build instructions for the container image, including image name, build context, and other ACCP pipeline parameters.
_Referenced during pipeline initialization._ |
| **`Dockerfile`** | Image build specification defining the base image, system dependencies, and Python packages required to execute the scripts. Miniconda installation is indicated here.
_Referenced in `accp.yaml`._ |
| **`dataproc_serverless_01_dag.py`** | **Apache Airflow DAG** for orchestrating and scheduling the end-to-end process. References the container image (in container_image parameter) built from the above files and defines execution logic and scheduling parameters. |---
## Workflow Overview
1. **Code and Configuration** — Python scripts, configuration files, and dependencies are stored in this repository.
2. **Image Build** — The ACCP pipeline uses `accp.yaml` and `Dockerfile` to build a container image with all required dependencies.
3. **Data Processing** — The container runs Python scripts to fetch, transform, and load data into BigQuery.
4. **Orchestration** — Airflow triggers the container execution according to the schedule defined in `dataproc_serverless_01_dag.py`.---
