An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with dataproc

A curated list of projects in awesome lists tagged with dataproc .

https://github.com/googlecloudplatform/data-analytics-golden-demo

An end to end demo of Google's Cloud data and analytic stack.

bigdata bigquery composer dataflow dataproc gcp

Last synced: 16 May 2025

https://github.com/lynnlangit/learning-hadoop-and-spark

Companion to Learning Hadoop and Learning Spark courses on Linked In Learning

apache-spark dataproc emr hadoop learning-hadoop mapreduce spark wordcount

Last synced: 16 May 2025

https://github.com/spotify/spydra

Ephemeral Hadoop clusters using Google Compute Platform

dataproc google-cloud hadoop

Last synced: 14 Jan 2026

https://github.com/allegro/bigflow

A Python framework for data processing on GCP.

airflow-dag beam bigquery composer dag dataflow dataproc gcp python python-framework workflows

Last synced: 08 Apr 2025

https://github.com/googlecloudplatform/serverless-spark-workshop

Solution Accelerators for Serverless Spark on GCP, the industry's first auto-scaling and serverless Spark as a service

apache-spark autoscaling bigdata dataproc hadoop serverless solution-accelerator spark usecases

Last synced: 07 Oct 2025

https://github.com/tharwaninitin/etlflow

EtlFlow is an ecosystem of functional libraries in Scala based on ZIO for running complex Auditable workflows which can interact with Google Cloud Platform, AWS, Kubernetes, Databases, SFTP servers, On-Prem Systems and more.

aws bigquery dataproc etl etl-framework etl-pipeline gcp gcs redis s3 scala spark zio

Last synced: 28 Feb 2025

https://github.com/jehiah/gomrjob

gomrjob - a Go Framework for Hadoop Map Reduce Jobs

dataproc go hadoop mapreduce mrjob

Last synced: 17 Mar 2025

https://github.com/debussy-labs/debussy_concert

Debussy is an opinionated Data Architecture and Engineering framework, enabling data analysts and engineers to build better platforms and pipelines.

airflow airflow-operators airflow-plugin big-data-platform bigquery data-architecture data-engineering data-pipeline dataform dataproc dbt gcp google-cloud mssql mysql postgresql spark sql workflow

Last synced: 13 Aug 2025

https://github.com/googlecloudplatform/dataproc-scala-examples

Dataproc Scala Examples is an effort to assist in the creation of Spark jobs written in Scala to run on Dataproc.

airflow composer dataproc gcp scala spark

Last synced: 19 Aug 2025

https://github.com/googlecloudplatform/dataproc-trino-autoscaler

Trino Autoscaler on Dataproc automates the scaling of Dataproc cluster based on real-time resource utilization by Trino workloads

apache-trino dataproc

Last synced: 20 Oct 2025

https://github.com/garystafford/dataproc-workflow-templates

Demonstration of Google Cloud Dataproc Workflow Templates

dataproc gcp google-cloud-platform hadoop pyspark spark

Last synced: 14 Mar 2026

https://github.com/garystafford/dataproc-python-demo

Demonstration of Google Cloud Dataproc for running PySpark jobs

cloud-dataproc dataproc gcp google pyspark python

Last synced: 13 Jul 2025

https://github.com/garystafford/dataproc-java-demo

Demonstration of Google Cloud Dataproc for running Spark jobs with Java

big-data-analytics dataproc gcp google java spark

Last synced: 03 Aug 2025

https://github.com/lalelisealstad/store-sales-pyspark-etl

ETL pipeline with pyspark in Google Cloud Platform (GCP) using infrastructure-as-code principles with Terraform

big-query dataproc iac pyspark terraform

Last synced: 14 Mar 2025

https://github.com/tadod12/big-data-with-gcp

Experimenting GCP for Big Data Project

big-data dataproc google-cloud-platform

Last synced: 21 Sep 2025

https://github.com/thunchanokbow/inventory-amazon

Inventory value is also important for determining a company's liquidity, or its ability to meet its short-term financial obligations. A high inventory value can indicate that a company has too much money tied up in inventory, which could make it difficult for the company to pay its bills.

azure bigquery cloudcomposer clouddatabase cloudstorage compute-engine dataproc postgresql powerbi pyspark-sql python3

Last synced: 12 Apr 2026

https://github.com/archie-cm/real_time_product_recommendations_with_machine_learning_on_gcp

This project demonstrates how to build a real-time product recommendation system using Pub/Sub Lite and Apache Spark with Dataproc

dataproc pubsublite spark

Last synced: 20 Apr 2026

https://github.com/elise-alstad/store-sales-pyspark-etl

ETL pipeline with pyspark in Google Cloud Platform (GCP) using infrastructure-as-code principles with Terraform

big-query dataproc iac pyspark terraform

Last synced: 25 Apr 2026

https://github.com/mohamedkashifuddin/gcp-ecommerce-data-pipeline

An e-commerce data lakehouse implemented on Google Cloud Platform (GCP). This project features an end-to-end data pipeline, from raw data generation via Cloud Functions, layered processing with PySpark on Dataproc, to structured data warehousing in BigQuery. It's fully orchestrated by Apache Airflow, enabling analytics and BI with Metabase.

airflow bigquery cloud-functions data-pipeline dataproc ecommerce gcp metabase pyspark

Last synced: 18 May 2026

https://github.com/borfergi/stock-market-data-pipeline

A fully serverless data pipeline that prepares stock market data from your selected companies using GCS, PySpark, BigQuery, Composer (Airflow), and Terraform.

airflow bigquery composer data-pipeline dataproc gcs polygon-api pyspark terraform

Last synced: 09 Apr 2026

https://github.com/archie-cm/end_to_end_batch_processing_pipeline_with_dataproc

This project demonstrates how to build an end-to-end batch processing pipeline using Apache Spark on Google Cloud Platform (GCP)

dataproc spark

Last synced: 29 Dec 2025

https://github.com/jewertow/mapreduce-nyc-collisions

Implementation of data processing in the MapReduce model.

airflow avro composer dataproc gcp hadoop hive mapreduce scala terraform

Last synced: 11 May 2026

https://github.com/suv05/brazilian-ecommerce-data-analysis

End-to-End Big Data Analytics on Google Cloud Platform

bigquery dataproc kaggle-dataset spark

Last synced: 15 Apr 2026

https://github.com/yuyatinnefeld/dataproc-api-service

๐Ÿงช Test Features ๐Ÿงช | GCP Dataproc + FastAPI

dataproc fastapi gcp

Last synced: 22 Apr 2026

https://github.com/eshwarcvs/save-gcp-local

Run GCP Dataproc Spark jobs locally in Docker/Podman to save cloud cost โ€” zero DAG edits.

airflow cost-optimization dataproc docker gcp local-testing podman spark

Last synced: 05 Jun 2026

https://github.com/benmizrahi/paperless

A papermill implementation to run notebooks inside dataproc serverless

dataproc gcp jupyter jupyter-notebook notebook python serverless

Last synced: 24 Apr 2026

https://github.com/miozilla/dataprochs

dataprochs :elephant::honeybee: : Dataproc Cluster # Apache # Hadoop # MapReduce # Spark # YARN # HDFS

apache cluster dataproc hadoop hdfs hive mapreduce node-manager node-worker pig pyspark spark sparksql yarn

Last synced: 09 May 2026

https://github.com/malbiruk/million-songs-pipeline

End-to-end batch pipeline joining audio features, lyrics, and genres from the Million Song Dataset

batch-processing bigquery data-engineering data-pipeline data-warehouse dataproc dbt dezoomcamp gcp million-song-dataset prefect pyspark streamlit terraform

Last synced: 08 Jun 2026

https://github.com/rodolphecalvet/spark-scala-dataproc

Scala/Spark for predicting kickstarter project applications outcome. Deployment scripts on GCP Dataproc.

dataproc gcs scala spark

Last synced: 28 Apr 2026

https://github.com/jcguidry/flight-ml-preprocess-gcp

Continuous flight event data processing using Spark Streaming, Delta Lake storage, deployed on GCP dataproc cluster.

dataproc deltalake gcp spark spark-streaming

Last synced: 01 May 2026

https://github.com/subhamay-bhattacharyya-tf/terraform-google-dataproc-cluster

๐Ÿ—๏ธ Terraform module to create and manage Dataproc clusters, batches (serverless), workflow templates, autoscaling policies, and job execution.

dataproc terraform-gcp-module terraform-module

Last synced: 02 May 2026