https://github.com/googleclouddataproc/cloud-dataproc
Cloud Dataproc: Samples and Utils
https://github.com/googleclouddataproc/cloud-dataproc
google-cloud-dataproc
Last synced: 5 months ago
JSON representation
Cloud Dataproc: Samples and Utils
- Host: GitHub
- URL: https://github.com/googleclouddataproc/cloud-dataproc
- Owner: GoogleCloudDataproc
- License: apache-2.0
- Created: 2016-11-02T05:49:22.000Z (almost 9 years ago)
- Default Branch: master
- Last Pushed: 2025-05-14T09:10:04.000Z (5 months ago)
- Last Synced: 2025-05-14T10:35:44.386Z (5 months ago)
- Topics: google-cloud-dataproc
- Language: Jupyter Notebook
- Homepage:
- Size: 3.07 MB
- Stars: 203
- Watchers: 34
- Forks: 128
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
# Google Cloud Dataproc
This repository contains code and documentation for use with
[Google Cloud Dataproc](https://cloud.google.com/dataproc/).## Samples in this Repository
* `codelabs/opencv-haarcascade` provides the source code for the [OpenCV Dataproc Codelab](https://codelabs.developers.google.com/codelabs/cloud-dataproc-opencv/index.html), which demonstrates a Spark job that adds facial detection to a set of images.
* `codelabs/spark-bigquery` provides the source code for the [PySpark for Preprocessing BigQuery Data Codelab](https://codelabs.developers.google.com/codelabs/pyspark-bigquery/index.html), which demonstrates using PySpark on Cloud Dataproc to process data from BigQuery.
* `codelabs/spark-nlp` provides the source code for the [PySpark for Natural Language Processing Codelab](https://codelabs.developers.google.com/codelabs/spark-nlp/index.html), which demonstrates using [spark-nlp](https://github.com/JohnSnowLabs/spark-nlp) library for Natural Language Processing.
* `notebooks/ai-ml/` provides source code for Spark for AI/ML use cases, including a [PyTorch](https://pytorch.org/) sample for image classification.
* `notebooks/python` provides example Jupyter notebooks to demonstrate using PySpark with the [BigQuery Storage Connector](https://github.com/GoogleCloudPlatform/spark-bigquery-connector) and the [Spark GCS Connector](https://github.com/GoogleCloudPlatform/bigdata-interop/tree/master/gcs)
* `spark-tensorflow` provides an example of using Spark as a preprocessing toolchain for Tensorflow jobs. Optionally,
it demonstrates the [spark-tensorflow-connector](https://github.com/tensorflow/ecosystem/tree/master/spark/spark-tensorflow-connector) to convert CSV files to TFRecords.
* `spark-translate` provides a simple demo Spark application that translates words using Google's Translation API and running on Cloud Dataproc.
* `gcloud` provides a set of scripts to provision dataproc clusters for use in exercising arbitrary initialization-actions.See each directories README for more information.
## Additional Dataproc Repositories
You can find more Dataproc resources in these github repositories:
### Dataproc projects
* [Dataproc initialization
actions](https://github.com/GoogleCloudPlatform/dataproc-initialization-actions)
* [GCP Token Broker](https://github.com/GoogleCloudPlatform/gcp-token-broker)
* [Dataproc Custom Images](https://github.com/GoogleCloudPlatform/dataproc-custom-images)
* [Dataproc Spawner](https://github.com/GoogleCloudPlatform/dataprocspawner)### Connectors
* [Hadoop/Spark GCS Connector](https://github.com/GoogleCloudPlatform/bigdata-interop/tree/master/gcs)
* [Spark BigQuery Connector](https://github.com/GoogleCloudPlatform/spark-bigquery-connector)
* [Hadoop BigQuery Connector](https://github.com/GoogleCloudPlatform/bigdata-interop/tree/master/bigquery)
* [Spark Pubsub Connector](https://github.com/GoogleCloudPlatform/bigdata-interop/tree/master/pubsub)
* [Spark Spanner Connector](https://github.com/GoogleCloudPlatform/cloud-spanner-spark-connector)
* [Hive Bigquery Storage Handler](https://github.com/GoogleCloudPlatform/hive-bigquery-storage-handler)### Kubernetes Operators
* [Spark kubernetes operator](https://github.com/GoogleCloudPlatform/spark-on-k8s-operator)
* [Flink kubernetes operator](https://github.com/GoogleCloudPlatform/flink-on-k8s-operator)### Examples
* [Dataproc Python
examples](https://github.com/GoogleCloudPlatform/python-docs-samples/tree/master/dataproc)
* [Dataproc Pubsub Spark Streaming example](https://github.com/GoogleCloudPlatform/dataproc-pubsub-spark-streaming)
* [Dataproc Java Bigtable sample](https://github.com/GoogleCloudPlatform/cloud-bigtable-examples/tree/master/java/dataproc-wordcount)
* [Dataproc Spark-Bigtable samples](https://github.com/GoogleCloudPlatform/cloud-bigtable-examples/tree/master/scala)## For more information
For more information, review the [Dataproc
documentation](https://cloud.google.com/dataproc/docs/). You can also
pose questions to the [Stack
Overflow](http://stackoverflow.com/questions/tagged/google-cloud-dataproc) community
with the tag `google-cloud-dataproc`.
See our other [Google Cloud Platform github
repos](https://github.com/GoogleCloudPlatform) for sample applications and
scaffolding for other frameworks and use cases.## Contributing changes
* See [CONTRIBUTING.md](CONTRIBUTING.md)
## Licensing
* See [LICENSE](LICENSE)