Projects in Awesome Lists tagged with dataproc
A curated list of projects in awesome lists tagged with dataproc .
https://github.com/dataflint/spark
Performance Observability for Apache Spark
apache-spark big-data data-pipeline data-pipelines databricks dataproc emr etl observability optimization spark-operator
Last synced: 10 May 2026
https://github.com/lynnlangit/learning-hadoop-and-spark
Companion to Learning Hadoop and Learning Spark courses on Linked In Learning
apache-spark dataproc emr hadoop learning-hadoop mapreduce spark wordcount
Last synced: 16 May 2025
https://github.com/spotify/spydra
Ephemeral Hadoop clusters using Google Compute Platform
Last synced: 14 Jan 2026
https://github.com/allegro/bigflow
A Python framework for data processing on GCP.
airflow-dag beam bigquery composer dag dataflow dataproc gcp python python-framework workflows
Last synced: 08 Apr 2025
https://github.com/googlecloudplatform/serverless-spark-workshop
Solution Accelerators for Serverless Spark on GCP, the industry's first auto-scaling and serverless Spark as a service
apache-spark autoscaling bigdata dataproc hadoop serverless solution-accelerator spark usecases
Last synced: 07 Oct 2025
https://github.com/tharwaninitin/etlflow
EtlFlow is an ecosystem of functional libraries in Scala based on ZIO for running complex Auditable workflows which can interact with Google Cloud Platform, AWS, Kubernetes, Databases, SFTP servers, On-Prem Systems and more.
aws bigquery dataproc etl etl-framework etl-pipeline gcp gcs redis s3 scala spark zio
Last synced: 28 Feb 2025
https://github.com/debussy-labs/debussy_concert
Debussy is an opinionated Data Architecture and Engineering framework, enabling data analysts and engineers to build better platforms and pipelines.
airflow airflow-operators airflow-plugin big-data-platform bigquery data-architecture data-engineering data-pipeline dataform dataproc dbt gcp google-cloud mssql mysql postgresql spark sql workflow
Last synced: 13 Aug 2025
https://github.com/wittline/pydag
Scheduling Big Data Workloads and Data Pipelines in the Cloud with pyDag
big-data bigquery cloud dag data-engineering data-pipeline dataengineering dataproc dataproc-cluster directed-acyclic-graph google-cloud google-cloud-platform parallel-processing task-scheduler task-scheduling workflow-engine
Last synced: 13 Apr 2025
https://github.com/googlecloudplatform/dataproc-trino-autoscaler
Trino Autoscaler on Dataproc automates the scaling of Dataproc cluster based on real-time resource utilization by Trino workloads
Last synced: 20 Oct 2025
https://github.com/garystafford/dataproc-workflow-templates
Demonstration of Google Cloud Dataproc Workflow Templates
dataproc gcp google-cloud-platform hadoop pyspark spark
Last synced: 14 Mar 2026
https://github.com/garystafford/dataproc-python-demo
Demonstration of Google Cloud Dataproc for running PySpark jobs
cloud-dataproc dataproc gcp google pyspark python
Last synced: 13 Jul 2025
https://github.com/maengsanha/bigdata
KMU CS Hot Topics in Big Data
cloud9 dataproc docker documentdb machine-learning mapreduce mongodb nlp spark
Last synced: 16 Jan 2026
https://github.com/garystafford/dataproc-java-demo
Demonstration of Google Cloud Dataproc for running Spark jobs with Java
big-data-analytics dataproc gcp google java spark
Last synced: 03 Aug 2025
https://github.com/tadod12/big-data-with-gcp
Experimenting GCP for Big Data Project
big-data dataproc google-cloud-platform
Last synced: 21 Sep 2025
https://github.com/thunchanokbow/inventory-amazon
Inventory value is also important for determining a company's liquidity, or its ability to meet its short-term financial obligations. A high inventory value can indicate that a company has too much money tied up in inventory, which could make it difficult for the company to pay its bills.
azure bigquery cloudcomposer clouddatabase cloudstorage compute-engine dataproc postgresql powerbi pyspark-sql python3
Last synced: 12 Apr 2026
https://github.com/archie-cm/real_time_product_recommendations_with_machine_learning_on_gcp
This project demonstrates how to build a real-time product recommendation system using Pub/Sub Lite and Apache Spark with Dataproc
Last synced: 20 Apr 2026
https://github.com/mohamedkashifuddin/gcp-ecommerce-data-pipeline
An e-commerce data lakehouse implemented on Google Cloud Platform (GCP). This project features an end-to-end data pipeline, from raw data generation via Cloud Functions, layered processing with PySpark on Dataproc, to structured data warehousing in BigQuery. It's fully orchestrated by Apache Airflow, enabling analytics and BI with Metabase.
airflow bigquery cloud-functions data-pipeline dataproc ecommerce gcp metabase pyspark
Last synced: 18 May 2026
https://github.com/borfergi/stock-market-data-pipeline
A fully serverless data pipeline that prepares stock market data from your selected companies using GCS, PySpark, BigQuery, Composer (Airflow), and Terraform.
airflow bigquery composer data-pipeline dataproc gcs polygon-api pyspark terraform
Last synced: 09 Apr 2026
https://github.com/archie-cm/end_to_end_batch_processing_pipeline_with_dataproc
This project demonstrates how to build an end-to-end batch processing pipeline using Apache Spark on Google Cloud Platform (GCP)
Last synced: 29 Dec 2025
https://github.com/suv05/brazilian-ecommerce-data-analysis
End-to-End Big Data Analytics on Google Cloud Platform
bigquery dataproc kaggle-dataset spark
Last synced: 15 Apr 2026
https://github.com/yuyatinnefeld/dataproc-api-service
๐งช Test Features ๐งช | GCP Dataproc + FastAPI
Last synced: 22 Apr 2026
https://github.com/eshwarcvs/save-gcp-local
Run GCP Dataproc Spark jobs locally in Docker/Podman to save cloud cost โ zero DAG edits.
airflow cost-optimization dataproc docker gcp local-testing podman spark
Last synced: 05 Jun 2026
https://github.com/benmizrahi/paperless
A papermill implementation to run notebooks inside dataproc serverless
dataproc gcp jupyter jupyter-notebook notebook python serverless
Last synced: 24 Apr 2026
https://github.com/miozilla/dataprochs
dataprochs :elephant::honeybee: : Dataproc Cluster # Apache # Hadoop # MapReduce # Spark # YARN # HDFS
apache cluster dataproc hadoop hdfs hive mapreduce node-manager node-worker pig pyspark spark sparksql yarn
Last synced: 09 May 2026
https://github.com/malbiruk/million-songs-pipeline
End-to-end batch pipeline joining audio features, lyrics, and genres from the Million Song Dataset
batch-processing bigquery data-engineering data-pipeline data-warehouse dataproc dbt dezoomcamp gcp million-song-dataset prefect pyspark streamlit terraform
Last synced: 08 Jun 2026
https://github.com/rodolphecalvet/spark-scala-dataproc
Scala/Spark for predicting kickstarter project applications outcome. Deployment scripts on GCP Dataproc.
Last synced: 28 Apr 2026
https://github.com/jcguidry/flight-ml-preprocess-gcp
Continuous flight event data processing using Spark Streaming, Delta Lake storage, deployed on GCP dataproc cluster.
dataproc deltalake gcp spark spark-streaming
Last synced: 01 May 2026
https://github.com/subhamay-bhattacharyya-tf/terraform-google-dataproc-cluster
๐๏ธ Terraform module to create and manage Dataproc clusters, batches (serverless), workflow templates, autoscaling policies, and job execution.
dataproc terraform-gcp-module terraform-module
Last synced: 02 May 2026
https://github.com/shakespear567/data_engineering_gcp
Data Engineering Using Google Could Platform and Mage
apachebeam bigquery clouddataflow cloudsql data-engineer dataflow dataproc gcp-components google-bigquery google-cloud google-virtualmachine looker spark terraform
Last synced: 07 May 2026
https://github.com/anant/example-airflow-dataproc-astra
airflow apache-airflow dataproc datastax datastax-astra gitpod google-dataproc
Last synced: 12 Jun 2025