{"id":18304111,"url":"https://github.com/googleclouddataproc/initialization-actions","last_synced_at":"2025-04-08T11:08:16.897Z","repository":{"id":37921862,"uuid":"44124590","full_name":"GoogleCloudDataproc/initialization-actions","owner":"GoogleCloudDataproc","description":"Run in all nodes of your cluster before the cluster starts - lets you customize your cluster","archived":false,"fork":false,"pushed_at":"2024-03-28T06:48:49.000Z","size":35494,"stargazers_count":581,"open_issues_count":60,"forks_count":510,"subscribers_count":68,"default_branch":"master","last_synced_at":"2024-03-28T07:40:43.474Z","etag":null,"topics":["google-cloud-dataproc"],"latest_commit_sha":null,"homepage":"https://cloud.google.com/dataproc/init-actions","language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/GoogleCloudDataproc.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2015-10-12T17:58:35.000Z","updated_at":"2024-04-15T15:25:52.252Z","dependencies_parsed_at":"2023-02-08T13:45:34.564Z","dependency_job_id":"6f068820-25c0-45bd-947d-98aa6017a42b","html_url":"https://github.com/GoogleCloudDataproc/initialization-actions","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudDataproc%2Finitialization-actions","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudDataproc%2Finitialization-actions/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudDataproc%2Finitialization-actions/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudDataproc%2Finitialization-actions/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/GoogleCloudDataproc","download_url":"https://codeload.github.com/GoogleCloudDataproc/initialization-actions/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247829491,"owners_count":21002995,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["google-cloud-dataproc"],"created_at":"2024-11-05T15:27:49.349Z","updated_at":"2025-04-08T11:08:16.879Z","avatar_url":"https://github.com/GoogleCloudDataproc.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Cloud Dataproc Initialization Actions\n\nWhen creating a [Dataproc](https://cloud.google.com/dataproc/) cluster, you can specify [initialization actions](https://cloud.google.com/dataproc/init-actions) in executables and/or scripts that Dataproc will run on all nodes in your Dataproc cluster immediately after the cluster is set up. Initialization actions often set up job dependencies, such as installing Python packages, so that jobs can be submitted to the cluster without having to install dependencies when the jobs are run.\n\n## How initialization actions are used\n\nInitialization actions must be stored in a [Cloud Storage](https://cloud.google.com/storage) bucket and can be passed as a parameter to the `gcloud` command or the `clusters.create` API when creating a Dataproc cluster. For example, to specify an initialization action when creating a cluster with the `gcloud` command, you can run:\n\n```bash\ngcloud dataproc clusters create \u003cCLUSTER_NAME\u003e \\\n    [--initialization-actions [GCS_URI,...]] \\\n    [--initialization-action-timeout TIMEOUT]\n```\n\nDuring development, you can create a Dataproc cluster using Dataproc-provided\n[regional](https://cloud.google.com/dataproc/docs/concepts/regional-endpoints) initialization\nactions buckets (for example `goog-dataproc-initialization-actions-us-east1`):\n\n```bash\nREGION=\u003cregion\u003e\nCLUSTER=\u003ccluster_name\u003e\ngcloud dataproc clusters create ${CLUSTER} \\\n    --region ${REGION} \\\n    --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/presto/presto.sh\n```\n\n**:warning: NOTICE:** For production usage, before creating clusters, it is strongly recommended\nthat you copy initialization actions to your own Cloud Storage bucket to guarantee consistent use of the\nsame initialization action code across all Dataproc cluster nodes and to prevent unintended upgrades\nfrom upstream in the cluster:\n\n```bash\nBUCKET=\u003cyour_init_actions_bucket\u003e\nCLUSTER=\u003ccluster_name\u003e\ngsutil cp presto/presto.sh gs://${BUCKET}/\ngcloud dataproc clusters create ${CLUSTER} --initialization-actions gs://${BUCKET}/presto.sh\n```\n\nYou can decide when to sync your copy of the initialization action with any changes to the initialization action that occur in the GitHub repository. Doing this is also useful if you want to modify initialization actions to meet your needs.\n\n## Why these samples are provided\n\nThese samples are provided to show how various packages and components can be installed on Dataproc clusters. You should understand how these samples work before running them on your clusters. The initialization actions provided in this repository are provided **without support** and you **use them at your own risk**.\n\n## Actions provided\n\nThis repository currently offers the following actions for use with Dataproc clusters.\n\n* Install additional Apache Hadoop ecosystem components\n  * [Alluxio](https://www.alluxio.io/)\n  * [Apache Drill](http://drill.apache.org)\n  * [Apache Flink](http://flink.apache.org)\n  * [Apache Gobblin](https://gobblin.apache.org/)\n  * [Apache Hive HCatalog](https://cwiki.apache.org/confluence/display/Hive/HCatalog)\n  * [Apache Kafka](http://kafka.apache.org)\n  * [Apache Livy](https://livy.incubator.apache.org/)\n  * [Apache Oozie](http://oozie.apache.org)\n  * [Apache ZooKeeper](http://zookeeper.apache.org)\n  * [Presto](http://prestodb.io)\n* Improve data science and interactive experiences\n  * [Miniconda](https://conda.io/docs/)\n  * [Apache Zeppelin](http://zeppelin.apache.org)\n  * [RStudio Server](https://www.rstudio.com/products/rstudio/#Server)\n  * [Intel BigDL](https://bigdl-project.github.io)\n  * [Hue](http://gethue.com)\n* Configure the environment\n  * Configure a *nice* shell environment\n  * To switch to Python 3, use the conda initialization action\n* Connect to Google Cloud Platform services\n  * Install alternate versions of the [Cloud Storage and BigQuery connectors](https://github.com/GoogleCloudPlatform/bigdata-interop/releases). [Specific versions](https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-versions) of these connectors come pre-installed on Cloud Dataproc clusters.\n  * Share a [Cloud SQL](https://cloud.google.com/sql/) Hive Metastore, or simply read/write data from Cloud SQL.\n* Set up monitoring\n  * [Stackdriver](https://cloud.google.com/stackdriver/)\n  * [Ganglia](http://ganglia.info/)\n\n## Removed actions\n\nPreviously, this repo provided init actions for the following, which have\nsince been removed because equivalent functionality is now provided directly\nby Dataproc:\n\n* [Apache Tez](http://tez.apache.org): This is now pre-installed in all\n  current Dataproc image versions.\n* [Datalab](https://cloud.google.com/datalab/): Datalab has been replaced by\n  Vertex AI Workbench, which integrates with Dataproc.\n* [Jupyter](http://jupyter.org/): This has been replaced with the\n  [Jupyter Optional Component](https://cloud.google.com/dataproc/docs/concepts/components/jupyter).\n\n## Initialization actions on single node clusters\n\n[Single Node clusters](https://cloud.google.com/dataproc/docs/concepts/single-node-clusters) have `dataproc-role` set to `Master` and `dataproc-worker-count` set to `0`. Most of the initialization actions in this repository should work out of the box because they run only on the master. Examples include notebooks, such as Apache Zeppelin, and libraries, such as Apache Tez. Actions that run on all nodes of the cluster, such as cloud-sql-proxy, also work out of the box.\n\nSome initialization actions are known **not to work** on Single Node clusters. All of these expect to have daemons on multiple nodes.\n\n* Apache Drill\n* Apache Flink\n* Apache Kafka\n* Apache Zookeeper\n\nFeel free to send pull requests or file issues if you have a good use case for running one of these actions on a Single Node cluster.\n\n## Using cluster metadata\n\nDataproc sets special [metadata values](https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/metadata)\nfor the instances that run in your cluster. You can use these values to customize the behavior of\ninitialization actions, for example:\n\n```bash\nROLE=$(/usr/share/google/get_metadata_value attributes/dataproc-role)\nif [[ \"${ROLE}\" == 'Master' ]]; then\n  ... master specific actions ...\nelse\n  ... worker specific actions ...\nfi\n```\n\nYou can also use the `‑‑metadata` flag of the `gcloud dataproc clusters create` command to provide your own\ncustom metadata:\n\n```bash\ngcloud dataproc clusters create cluster-name \\\n    --initialization-actions ... \\\n    --metadata name1=value1,name2=value2,... \\\n    ... other flags ...\n```\n\n## For more information\n\nFor more information, review the [Dataproc documentation](https://cloud.google.com/dataproc/init-actions). You can also pose questions to the [Stack Overflow](http://www.stackoverflow.com) community with the tag `google-cloud-dataproc`.\nSee our other [Google Cloud Platform github\nrepos](https://github.com/GoogleCloudPlatform) for sample applications and\nscaffolding for other frameworks and use cases.\n\n### Mailing list\n\nSubscribe to [cloud-dataproc-discuss@google.com](https://groups.google.com/forum/#!forum/cloud-dataproc-discuss) for announcements and discussion.\n\n## Contributing changes\n\n* See [CONTRIBUTING.md](CONTRIBUTING.md)\n\n## Licensing\n\n* See [LICENSE](LICENSE)\n\n## FAQ\n1. You might see an error message similar to the following when upgrading the agent, installing the agent, or running apt-get update on Debian/Ubuntu Linux:\n```\nE: Repository 'https://packages.cloud.google.com/apt google-cloud-monitoring-buster-all InRelease' changed its 'Origin' value from 'google-cloud-monitoring-buster' to 'namespaces/cloud-ops-agents-artifacts/repositories/google-cloud-monitoring-buster-all'\nE: Repository 'https://packages.cloud.google.com/apt google-cloud-monitoring-buster-all InRelease' changed its 'Label' value from 'google-cloud-monitoring-buster' to 'namespaces/cloud-ops-agents-artifacts/repositories/google-cloud-monitoring-buster-all'\n```\nThis message indicates that the package repository cache may have diverged from its source. To resolve this, run the following command:\n\n```\napt-get --allow-releaseinfo-change update\n```\n\nThen, run the upgrade or install again.\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogleclouddataproc%2Finitialization-actions","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgoogleclouddataproc%2Finitialization-actions","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogleclouddataproc%2Finitialization-actions/lists"}