{"id":18304034,"url":"https://github.com/googleclouddataproc/cloud-dataproc","last_synced_at":"2025-05-16T06:03:03.120Z","repository":{"id":40411114,"uuid":"72608223","full_name":"GoogleCloudDataproc/cloud-dataproc","owner":"GoogleCloudDataproc","description":"Cloud Dataproc: Samples and Utils","archived":false,"fork":false,"pushed_at":"2025-05-14T09:10:04.000Z","size":3223,"stargazers_count":203,"open_issues_count":6,"forks_count":128,"subscribers_count":34,"default_branch":"master","last_synced_at":"2025-05-14T10:35:44.386Z","etag":null,"topics":["google-cloud-dataproc"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/GoogleCloudDataproc.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2016-11-02T05:49:22.000Z","updated_at":"2025-05-14T09:09:55.000Z","dependencies_parsed_at":"2023-01-30T02:01:27.100Z","dependency_job_id":"e2fd54eb-d83f-4829-bf21-d51ad854229e","html_url":"https://github.com/GoogleCloudDataproc/cloud-dataproc","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudDataproc%2Fcloud-dataproc","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudDataproc%2Fcloud-dataproc/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudDataproc%2Fcloud-dataproc/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudDataproc%2Fcloud-dataproc/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/GoogleCloudDataproc","download_url":"https://codeload.github.com/GoogleCloudDataproc/cloud-dataproc/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254478160,"owners_count":22077675,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["google-cloud-dataproc"],"created_at":"2024-11-05T15:27:36.356Z","updated_at":"2025-05-16T06:03:03.072Z","avatar_url":"https://github.com/GoogleCloudDataproc.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Google Cloud Dataproc\n\nThis repository contains code and documentation for use with\n[Google Cloud Dataproc](https://cloud.google.com/dataproc/).\n\n## Samples in this Repository\n * `codelabs/opencv-haarcascade` provides the source code for the [OpenCV Dataproc Codelab](https://codelabs.developers.google.com/codelabs/cloud-dataproc-opencv/index.html), which demonstrates a Spark job that adds facial detection to a set of images. \n* `codelabs/spark-bigquery` provides the source code for the [PySpark for Preprocessing BigQuery Data  Codelab](https://codelabs.developers.google.com/codelabs/pyspark-bigquery/index.html), which demonstrates using PySpark on Cloud Dataproc to process data from BigQuery.\n* `codelabs/spark-nlp` provides the source code for the [PySpark for Natural Language Processing Codelab](https://codelabs.developers.google.com/codelabs/spark-nlp/index.html), which demonstrates using [spark-nlp](https://github.com/JohnSnowLabs/spark-nlp) library for Natural Language Processing.\n* `notebooks/ai-ml/` provides source code for Spark for AI/ML use cases, including a [PyTorch](https://pytorch.org/) sample for image classification.\n* `notebooks/python` provides example Jupyter notebooks to demonstrate using PySpark with the [BigQuery Storage Connector](https://github.com/GoogleCloudPlatform/spark-bigquery-connector) and the [Spark GCS Connector](https://github.com/GoogleCloudPlatform/bigdata-interop/tree/master/gcs)\n * `spark-tensorflow` provides an example of using Spark as a preprocessing toolchain for Tensorflow jobs. Optionally,\n it demonstrates the [spark-tensorflow-connector](https://github.com/tensorflow/ecosystem/tree/master/spark/spark-tensorflow-connector) to convert CSV files to TFRecords.\n * `spark-translate` provides a simple demo Spark application that translates words using Google's Translation API and running on Cloud Dataproc.\n * `gcloud` provides a set of scripts to provision dataproc clusters for use in exercising arbitrary initialization-actions.\n\nSee each directories README for more information.\n\n\n## Additional Dataproc Repositories\n\nYou can find more Dataproc resources in these github repositories:\n\n### Dataproc projects\n* [Dataproc initialization\n  actions](https://github.com/GoogleCloudPlatform/dataproc-initialization-actions)\n* [GCP Token Broker](https://github.com/GoogleCloudPlatform/gcp-token-broker)\n* [Dataproc Custom Images](https://github.com/GoogleCloudPlatform/dataproc-custom-images)\n* [Dataproc Spawner](https://github.com/GoogleCloudPlatform/dataprocspawner)\n\n### Connectors\n* [Hadoop/Spark GCS Connector](https://github.com/GoogleCloudPlatform/bigdata-interop/tree/master/gcs)\n* [Spark BigQuery Connector](https://github.com/GoogleCloudPlatform/spark-bigquery-connector)\n* [Hadoop BigQuery Connector](https://github.com/GoogleCloudPlatform/bigdata-interop/tree/master/bigquery)\n* [Spark Pubsub Connector](https://github.com/GoogleCloudPlatform/bigdata-interop/tree/master/pubsub)\n* [Spark Spanner Connector](https://github.com/GoogleCloudPlatform/cloud-spanner-spark-connector)\n* [Hive Bigquery Storage Handler](https://github.com/GoogleCloudPlatform/hive-bigquery-storage-handler)\n\n### Kubernetes Operators\n* [Spark kubernetes operator](https://github.com/GoogleCloudPlatform/spark-on-k8s-operator)\n* [Flink kubernetes operator](https://github.com/GoogleCloudPlatform/flink-on-k8s-operator)\n\n### Examples\n* [Dataproc Python\n  examples](https://github.com/GoogleCloudPlatform/python-docs-samples/tree/master/dataproc)\n* [Dataproc Pubsub Spark Streaming example](https://github.com/GoogleCloudPlatform/dataproc-pubsub-spark-streaming)\n* [Dataproc Java Bigtable sample](https://github.com/GoogleCloudPlatform/cloud-bigtable-examples/tree/master/java/dataproc-wordcount)\n* [Dataproc Spark-Bigtable samples](https://github.com/GoogleCloudPlatform/cloud-bigtable-examples/tree/master/scala)\n\n## For more information\nFor more information, review the [Dataproc\ndocumentation](https://cloud.google.com/dataproc/docs/). You can also\npose questions to the [Stack\nOverflow](http://stackoverflow.com/questions/tagged/google-cloud-dataproc) community\nwith the tag `google-cloud-dataproc`.\nSee our other [Google Cloud Platform github\nrepos](https://github.com/GoogleCloudPlatform) for sample applications and\nscaffolding for other frameworks and use cases.\n\n## Contributing changes\n\n* See [CONTRIBUTING.md](CONTRIBUTING.md)\n\n## Licensing\n\n* See [LICENSE](LICENSE)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogleclouddataproc%2Fcloud-dataproc","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgoogleclouddataproc%2Fcloud-dataproc","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogleclouddataproc%2Fcloud-dataproc/lists"}