{"id":18304027,"url":"https://github.com/googleclouddataproc/custom-images","last_synced_at":"2025-04-09T06:12:19.023Z","repository":{"id":38326445,"uuid":"191822275","full_name":"GoogleCloudDataproc/custom-images","owner":"GoogleCloudDataproc","description":"Tools for creating Dataproc custom images","archived":false,"fork":false,"pushed_at":"2025-03-27T15:26:59.000Z","size":565,"stargazers_count":32,"open_issues_count":13,"forks_count":62,"subscribers_count":28,"default_branch":"main","last_synced_at":"2025-04-02T05:08:01.143Z","etag":null,"topics":["google-cloud-dataproc"],"latest_commit_sha":null,"homepage":"https://cloud.google.com/dataproc/docs/guides/dataproc-images","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/GoogleCloudDataproc.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-06-13T19:34:47.000Z","updated_at":"2025-02-15T20:26:57.000Z","dependencies_parsed_at":"2022-08-24T02:41:03.078Z","dependency_job_id":"db905546-ceaf-4374-83a3-0ebc6a167d40","html_url":"https://github.com/GoogleCloudDataproc/custom-images","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudDataproc%2Fcustom-images","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudDataproc%2Fcustom-images/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudDataproc%2Fcustom-images/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudDataproc%2Fcustom-images/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/GoogleCloudDataproc","download_url":"https://codeload.github.com/GoogleCloudDataproc/custom-images/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247987285,"owners_count":21028895,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["google-cloud-dataproc"],"created_at":"2024-11-05T15:27:34.838Z","updated_at":"2025-04-09T06:12:19.004Z","avatar_url":"https://github.com/GoogleCloudDataproc.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Build Dataproc custom images\n\nThis page describes how to generate a custom Dataproc image.\n\n## Important notes\n\nTo help ensure that clusters receive the latest service updates and bug fixes,\nthe creation of clusters with a custom image is limited to **365 days** from the\nimage creation date, but existing custom-image clusters can run indefinitely.\nAutomation to continuously build a custom image may be necessary if you wish to\ncreate clusters with a custom image for a period greater than 365 days.\n\nCreating clusters with expired custom images is possible by following these\n[instructions](https://cloud.google.com/dataproc/docs/guides/dataproc-images#how_to_create_a_cluster_with_an_expired_custom_image),\nbut Cloud Dataproc cannot guarantee support of issues that arise with these\nclusters.\n\n## Requirements\n\n1.  Python\n2.  gcloud\n3.  Bash 3.0.\n4.  A GCE project with billing, Google Cloud Dataproc API, Google Compute Engine\n    API, Google Secret Manager API, and Google Cloud Storage APIs enabled.\n5.  Use `gcloud config set project \u003cyour-project\u003e` to specify which project to\n    use to create and save your custom image.\n\n## Generate custom image\n\nTo generate a custom image, you can run the following command:\n\n```shell\npython generate_custom_image.py \\\n    --image-name '\u003cnew_custom_image_name\u003e' \\\n    --dataproc-version '\u003cdataproc_version\u003e' \\\n    --customization-script '\u003ccustom_script_to_install_custom_packages\u003e' \\\n    --zone '\u003czone_to_create_instance_to_build_custom_image\u003e' \\\n    --gcs-bucket '\u003cgcs_bucket_to_write_logs\u003e'\n```\n\n### Arguments\n\n*   **--image-name**: The name for custom image.\n*   **--dataproc-version**: The Dataproc version for this custom image\n    to build on. Examples: `2.2.32-debian12`, `2.2.31-debian12`,\n    `2.2.31-ubuntu22`. If the sub-minor version is unspecified, the\n    latest available one will be used.  Examples: `2.2-rocky9`,\n    `2.2-debian12`. For a complete list of Dataproc image versions,\n    please review the output of `gcloud compute images list --project\n    cloud-dataproc`. To understand Dataproc versioning, please refer\n    to\n    [documentation](https://cloud.google.com/dataproc/docs/concepts/versioning/overview).\n    **This argument is mutually exclusive with `--base-image-uri` and\n    `--source-image-family`**.\n*   **--base-image-uri**: The full image URI for the base Dataproc image. The\n    customization script will be executed on top of this image instead of an\n    out-of-the-box Dataproc image. This image must be a valid Dataproc image.\n    **This argument is mutually exclusive with `--dataproc-version` and\n    `--source-image-family`**.\n*   **--base-image-family**: The image family that the boot disk will be\n    initialized with. The latest non-deprecated image from the family will be\n    used. An example base image family URI is\n    `projects/PROJECT_NAME/global/images/family/\u003cFAMILY_NAME\u003e`. To get the list\n    of image families (and the associated image), run `gcloud compute images\n    list [--project \u003cPROJECT_NAME\u003e]`. **This argument is mutually exclusive with\n    `--dataproc-version` and `--base-image-uri`**.\n*   **--customization-script**: The script used to install custom packages on\n    the image.\n*   **--zone**: The GCE zone for running your GCE instance.\n*   **--gcs-bucket**: A GCS bucket to store the logs of building custom image.\n\n#### Optional Arguments\n\n*   **--family**: The family of the source image. This will cause the latest\n    non-deprecated image in the family to be used as the source image.\n*   **--project-id**: The project Id of the project where the custom image is\n    created and saved. The default project Id is the current project id\n    specified in `gcloud config get-value project`.\n*   **--oauth**: The OAuth credential file used to call Google Cloud APIs. The\n    default OAuth is the application-default credentials from gcloud.\n*   **--machine-type**: The machine type used to build custom image. The default\n    is `n1-standard-1`.\n*   **--no-smoke-test**: This parameter is used to disable smoke testing the\n    newly built custom image. The smoke test is used to verify if the newly\n    built custom image can create a functional Dataproc cluster. Disabling this\n    step will speed up the custom image build process; however, it is not\n    advised. Note: The smoke test will create a Dataproc cluster with the newly\n    built image, runs a short job and deletes the cluster in the end.\n*   **--network**: This parameter specifies the GCE network to be used to launch\n    the GCE VM instance which builds the custom Dataproc image. The default\n    network is 'global/networks/default'. If the default network does not exist\n    in your project, please specify a valid network interface. For more\n    information on network interfaces, please refer to\n    [GCE VPC documentation](https://cloud.google.com/vpc/docs/vpc).\n*   **--subnetwork**: This parameter specifies the subnetwork that is used to\n    launch the VM instance that builds the custom Dataprocimage. A full\n    subnetwork URL is required. The default subnetwork is None. For more\n    information, please refer to\n    [GCE VPC documentation](https://cloud.google.com/vpc/docs/vpc).\n*   **--no-external-ip**: This parameter is used to disables external IP for the\n    image build VM. The VM will not be able to access the internet, but if\n    [Private Google Access](https://cloud.google.com/vpc/docs/configure-private-google-access)\n    is enabled for the subnetwork, it can still access Google services (e.g.,\n    GCS) through internal IP of the VPC.\n*   **--service-account**: The service account that is used to launch the VM\n    instance that builds the custom Dataproc image. The scope of this service\n    account is defaulted to \"/auth/cloud-platform\", which authorizes VM instance\n    the access to all cloud platform services that is granted by IAM roles.\n    Note: IAM role must allow the VM instance to access GCS bucket in order to\n    access scripts and write logs.\n*   **--extra-sources**: Additional files/directories uploaded along with\n    customization script. This argument is evaluated to a json dictionary.\n*   **--disk-size**: The size in GB of the disk attached to the VM instance used\n    to build custom image. The default is `40` GB.\n*   **--accelerator**: The accelerators (e.g. GPUs) attached to the VM instance\n    used to build custom image. This flag supports the same\n    [values](https://cloud.google.com/sdk/gcloud/reference/compute/instances/create#--accelerator)\n    as `gcloud compute instances create --accelerator` flag. By default no\n    accelerators are attached.\n*   **--base-image-uri**: The partial image URI for the base Dataproc image. The\n    customization script will be executed on top of this image instead of an\n    out-of-the-box Dataproc image. This image must be a valid Dataproc image.\n    The format of the partial image URI is the following:\n    `projects/\u003cproject_id\u003e/global/images/\u003cimage_name\u003e`.\n*   **--storage-location**: The storage location (e.g. US, us-central1) of the\n    custom GCE image. This flag supports the same\n    [values](https://cloud.google.com/sdk/gcloud/reference/compute/images/create#--storage-location)\n    as `gcloud compute images create --storage-location` flag. If not specified,\n    the default GCE image storage location is used.\n*   **--shutdown-instance-timer-sec**: The time to wait in seconds before\n    shutting down the VM instance. This value may need to be increased if your\n    init script generates a lot of output on stdout. If not specified, the\n    default value of 300 seconds will be used.\n*   **--dry-run**: Dry run mode which only validates input and generates\n    workflow script without creating image. Disabled by default.\n*   **--trusted-cert**: a certificate in DER format to be inserted\n    into the custom image's EFI boot sector.  Can be generated by\n    reading examples/secure-boot/README.md.  This argument is mutually\n    exclusive with base-image-family\n*   **--metadata**: VM metadata which can be read by the customization script\n    with `/usr/share/google/get_metadata_value attributes/\u003ckey\u003e` at runtime. The\n    value of this flag takes the form of `key1=value1,key2=value2,...`. If the\n    value includes special characters (e.g., `=`, `,` or spaces) which needs to\n    be escaped, consider encoding the value, then decode it back in the\n    customization script. See more information about VM metadata on\n    https://cloud.google.com/sdk/gcloud/reference/compute/instances/create.\n\n#### Overriding cluster properties with a custom image\n\nYou can use custom images to overwrite any\n[cluster properties](https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/cluster-properties)\nset during cluster creation. If a user creates a cluster with your custom image\nbut sets cluster properties different from those you set with your custom image,\nyour custom image cluster property settings will take precedence.\n\nTo set cluster properties with your custom image:\n\nIn your custom image\n[customization script](https://cloud.google.com/dataproc/docs/guides/dataproc-images#running_the_code),\ncreate a `dataproc.custom.properties` file in `/etc/google-dataproc`, then set\ncluster property values in the file.\n\n*   Sample `dataproc.custom.properties` file contents:\n\n    ```shell\n    dataproc.conscrypt.provider.enable=true\n    dataproc.logging.stackdriver.enable=false\n    ```\n\n*   Sample customization script file-creation snippet to override two cluster\n    properties:\n\n    ```shell\n    cat \u003c\u003cEOF \u003e/etc/google-dataproc/dataproc.custom.properties\n    dataproc.conscrypt.provider.enable=true\n    dataproc.logging.stackdriver.enable=false EOF\n    ```\n\n### Examples\n\n#### Create a custom image\n\nCreate a custom image with name `custom-image-1-5-9` with Dataproc version\n`1.5.9-debian10`:\n\n```shell\npython generate_custom_image.py \\\n    --image-name custom-image-1-5-9 \\\n    --dataproc-version 1.5.9-debian10 \\\n    --customization-script ~/custom-script.sh \\\n    --metadata 'key1=value1,key2=value2' \\\n    --zone us-central1-f \\\n    --gcs-bucket gs://my-test-bucket\n```\n\n#### Create a custom image without running smoke test\n\n```shell\npython generate_custom_image.py \\\n    --image-name custom-image-1-5-9 \\\n    --dataproc-version 1.5.9-debian10 \\\n    --customization-script ~/custom-script.sh \\\n    --zone us-central1-f \\\n    --gcs-bucket gs://my-test-bucket \\\n    --no-smoke-test\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogleclouddataproc%2Fcustom-images","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgoogleclouddataproc%2Fcustom-images","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogleclouddataproc%2Fcustom-images/lists"}