{"id":15221529,"url":"https://github.com/googlecloudplatform/nvidia-nemo-on-gke","last_synced_at":"2025-10-20T00:32:32.776Z","repository":{"id":227892797,"uuid":"748694705","full_name":"GoogleCloudPlatform/nvidia-nemo-on-gke","owner":"GoogleCloudPlatform","description":"Training NVIDIA NeMo Megatron Large Language Model (LLM) using NeMo Framework on Google Kubernetes Engine","archived":false,"fork":false,"pushed_at":"2024-11-19T16:21:37.000Z","size":990,"stargazers_count":12,"open_issues_count":4,"forks_count":5,"subscribers_count":15,"default_branch":"main","last_synced_at":"2024-12-18T08:41:11.274Z","etag":null,"topics":["gke","megatron-lm","nvidia","nvidia-gpu","nvidia-nemo"],"latest_commit_sha":null,"homepage":"https://github.com/GoogleCloudPlatform/nvidia-nemo-on-gke","language":"HCL","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/GoogleCloudPlatform.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-01-26T15:05:24.000Z","updated_at":"2024-11-27T04:12:54.000Z","dependencies_parsed_at":"2024-05-01T18:20:06.419Z","dependency_job_id":"71a99dca-ac29-4aab-a15f-6df1c3637317","html_url":"https://github.com/GoogleCloudPlatform/nvidia-nemo-on-gke","commit_stats":null,"previous_names":["googlecloudplatform/nvidia-nemo-on-gke"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudPlatform%2Fnvidia-nemo-on-gke","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudPlatform%2Fnvidia-nemo-on-gke/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudPlatform%2Fnvidia-nemo-on-gke/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudPlatform%2Fnvidia-nemo-on-gke/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/GoogleCloudPlatform","download_url":"https://codeload.github.com/GoogleCloudPlatform/nvidia-nemo-on-gke/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":237236958,"owners_count":19277082,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["gke","megatron-lm","nvidia","nvidia-gpu","nvidia-nemo"],"created_at":"2024-09-28T15:05:38.284Z","updated_at":"2025-10-20T00:32:32.770Z","avatar_url":"https://github.com/GoogleCloudPlatform.png","language":"HCL","funding_links":[],"categories":[],"sub_categories":[],"readme":"# NeMo Framework on Google Kubernetes Engine (GKE) to train Megatron LM\n\nThis repository contains an end-to-end walkthrough of training NVIDIA's NeMo Megatron Large Language Model (LLM) using NeMo framework on Google Kubernetes Engine.\n\n## Table of Contents\n\n- [Introduction to NVIDIA NeMo Framework](#introduction-to-nvidia-nemo-framework)\n- [NVIDIA GPUs on Google Cloud](#nvidia-gpus-on-google-cloud)\n- [Training AI/ML workloads on GKE](#training-aiml-workloads-on-gke)\n- [Prerequisites](#prerequisites)\n  - [Hardware Requirements](#hardware-requirements)\n  - [Software Requirements](#software-requirements)\n- [Walkthrough](#walkthrough)\n  - [Security](#security)\n  - [Bootstrapping](#bootstrapping)\n  - [Cluster Setup](#cluster-setup)\n  - [Configure](#configure)\n  - [Model Training](#model-training)\n    - [Data curation](#data-curation-using-gcs-bucket)\n    - [NeMo training Image](#training-image)\n    - [Configure model parameters](#configure-model-parameters)\n    - [Launch training using Helm](#launch-training-using-helm)\n  - [Observations](#observations)\n    - [TensorBoard using Open-source](#tensorboard-using-open-source)\n    - [TensorBoard using Vertex AI](#tensorboard-using-vertex-ai)\n    - [Dashboard - Scalars](#dashboard---scalars)\n      - [Reduced Train Loss](#reduced-train-loss)\n      - [Train Step Timing](#train-step-timing)\n  - [Teardown](#teardown)\n  - [Troubleshooting](#troubleshooting)\n    - [Kueue Configuration issue](#kueue-configuration-issue)\n    - [TensorBoard Installation issue](#tensorboard-open-source-installation)\n    - [PVC deletion fails](#pvc-deletion-fails)\n    - [Docker login failure](#docker-login-failure)\n- [Beyond the Walkthrough](#beyond-the-walkthrough)\n- [Versioning](#versioning)\n- [Code of Conduct](#code-of-conduct)\n- [Contributing](#contributing)\n- [License](#license)\n\n## Introduction to NVIDIA NeMo Framework\n\nNVIDIA NeMo™ is an end-to-end platform for development of custom generative AI models anywhere. NVIDIA NeMo framework is designed for enterprise development, it utilizes NVIDIA's state-of-the-art technology to facilitate a complete workflow from automated distributed data processing to training of large-scale bespoke models using sophisticated 3D parallelism techniques, and finally, deployment using retrieval-augmented generation for large-scale inference on an infrastructure of your choice, be it on-premises or in the cloud.\n\nFor enterprises running their business on AI, NVIDIA AI Enterprise provides a production-grade, secure, end-to-end software platform that includes NeMo as well as generative AI reference applications and enterprise support to streamline adoption. Now organizations can integrate AI into their operations, streamlining processes, enhancing decision-making capabilities, and ultimately driving greater value.\n\nTo understand more about NVIDIA NeMo, click [here](https://www.nvidia.com/en-us/ai-data-science/generative-ai/nemo-framework/)\n\n## NVIDIA GPUs on Google Cloud\n\nGoogle Kubernetes Engine (GKE) supports a broad range of NVIDIA graphics processing units (GPUs) that can be attached to the GKE nodes containing one or more Compute Virtual Machine instances. These GPUs are purpose built to accelerate a diverse array of AI/ML workloads including Large Language Model (LLM) training and inference. Within GKE, virtual machines with NVIDIA GPUs are setup in a passthrough mode. This setup grants VMs direct control over the GPU, enhancing their capabilities and their associated memory.\n\nBelow list shows the GPUs supported in GKE. The detailed list is available [here](https://cloud.google.com/compute/docs/gpus#nvidia_gpus_for_compute_workloads)\n\n|  | NVIDIA GPU | VM | Machine-type(s) | GPUs | vCPUs | Memory (GB) |\n|---|---|---|---|---|---|---|\n| 1. | [H100](\u003chttps://www.nvidia.com/en-us/data-center/h100/\u003e) \u003csup\u003e1\u003c/sup\u003e | [A3](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-vms) | `a3-highgpu-8g` | 8 | 208 | 1872 |\n| 2. | [A100](https://www.nvidia.com/en-us/data-center/a100/) | [A2 standard](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a2-standard-vms) | `a2-highgpu-1g` - `a2-highgpu-8g` \u003cbr\u003e `a2-megagpu-16g` | 1-16 | 12-96 | 85-1360 |\n| 3. | [A100](https://www.nvidia.com/en-us/data-center/a100/) | [A2 ultra 80GB](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a2-ultra-vms) | `a2-ultragpu-1g` - `a2-ultragpu-8g`| 1-8 | 12-96 | 170-1360 |\n\n[1] The walkthrough below will need access to A3 VMs.\n\n## Training AI/ML workloads on GKE\n\nGKE is prevalent in Web, Stateful and Batch workloads, its footprint in AI/ML space for training and inference is increasing. It provides access to the latest NVIDIA GPUs, Storage - object and block and rapid networking. The inherent features such as Auto-scaling, Placements and Self-healing along with support for Local SSDs, GCS Fuse, Fast Socket, and gVNIC augment the network, storage and communication performance across the stack. In addition, it has an ecosystem of first-party and third-party integrations, frameworks, tools and libraries.\n\nThis technical walkthrough demonstrates the process of training the NeMo Megatron model on A3 virtual machines in a Google Kubernetes Engine (GKE) environment leveraging the GPUDirect-TCPX technology. The A3 virtual machines are equipped with 8 NVIDIA H100 Tensor Core GPUs and 4 network interface controllers (NICs), each with a bandwidth of 200 gigabits per second. This configuration leverages GPUDirect-TCPX to facilitate direct transfers between GPUs and NICs.\n\nNVIDIA NeMo framework sits on top of the GKE setup. The NeMo training image (available in [NVIDIA NGC](https://www.nvidia.com/en-us/gpu-cloud/)) is used to train the model along with PyTorch and NVIDIA CUDA libraries. The infrastructure can be used by more than one team to build, train, customize and deploy LLMs for GenAI.\n\n[\u003cimg src=\"images/1.high-level-arch.png\" width=\"750\"/\u003e](HighLevelArch)\n\n## High-Level flow\n\n[\u003cimg src=\"images/6.setup-to-results.png\" width=\"600\"/\u003e](HighLevelFlow)\n\nThe logical flow of the walkthrough consists of below steps:\n\n1. **Setup**: Infrastructure is provisioned; GKE Cluster with Node pools, GPUs and SSDs, Artifact Registry\n2. **Configure**: Configurations of components such as Kubernetes Kueue, Filestore and GCS for storage, GKE networking (Device mode) and TensorBoard\n3. **Onboard**: Public dataset (For instance: Wikipedia) is made ready by tokenizing using GPT2BPETokenizer\n4. **Training**: NVIDIA NeMo framework Image is used to train the Megatron-LM\n5. **Results**: Model training results can be viewed in TensorBoard to assess and re-train as necessary\n\n## Prerequisites\n\n### Hardware Requirements\n\nAs the walkthrough depends on the availability of NVIDIA H100 GPUs accessible as A3 machines on Google Cloud, it is important you have access to them. In certain instances, submitting a request to ensure the allocation of quotas to your project is a necessary prerequisite. [Submit quota request](https://cloud.google.com/docs/quota/view-manage#requesting_higher_quota) for [GPU Quota](https://cloud.google.com/compute/resource-usage#gpu_quota) named `NVIDIA_H100_GPUS`\n\n### Enabling AI Infrastructure for Regulated Organizations\n\nGoogle Cloud’s [Assured Workloads](https://cloud.google.com/security/products/assured-workloads?e=48754805\u0026hl=en) helps ensure that regulated organizations across the public and private sector can accelerate AI innovation while meeting their compliance and security requirements. Assured Workloads provides control packages to support the creation of compliant boundaries in Google Cloud. A control package is a set of controls that, when combined together, supports the regulatory baseline for a compliance statute or regulation. These controls include mechanisms to enforce data residency, data sovereignty, personnel access, and more.\n\nWe encourage you to evaluate Assured Workloads' [control packages](https://cloud.google.com/assured-workloads/docs/control-packages) and decide whether a control package is required for your organization to meet their regulatory and compliance requirements. If so, we recommend you first deploy Assured Workloads using [this repository],(\u003chttps://github.com/GoogleCloudPlatform/assured-workloads-terraform\u003e) allowing you to maintain your regulatory and compliance requirements, before running these labs.\n\nNote that unsupported products are not recommended for use by Assured Workloads customers without due diligence and waivers from your regulatory agencies or divisions.\n\n### Software Requirements\n\nThe following CLI tools are required in order to complete the walkthrough in your Google Cloud project:\n\n1. [Google Cloud Project](https://console.cloud.google.com) with billing enabled\n2. [gcloud CLI](https://cloud.google.com/sdk/docs/install)\n3. [Terraform](https://developer.hashicorp.com/terraform/tutorials/gcp-get-started/install-cli)\n4. [Git](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)\n5. [Kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl-linux/)\n6. Access to NVIDIA GPU Cloud (NGC)\n    - Install [NGC CLI](https://docs.ngc.nvidia.com/cli/cmd.html)\n    - [NGC Big NLP Training Docker images](https://registry.ngc.nvidia.com/orgs/ea-bignlp/containers/bignlp-training)\n\n## Walkthrough\n\n\u003e [!NOTE]\n\u003e Before you proceed further, ensure all CLI tools listed [above](#software-requirements) are installed and configured.\n\n### Security\n\nDefault Compute Engine service account.\n\n### Bootstrapping\n\nThe bootstrap phase initializes the Google Cloud project and terraform environment for resources state management.\n\n1. Download [source](https://github.com/GoogleCloudPlatform/NVIDIA-nemo-on-gke.git)\n\n    ```console\n    git clone https://github.com/GoogleCloudPlatform/NVIDIA-nemo-on-gke.git\n    cd NVIDIA-nemo-on-gke/infra\n    ```\n\n2. Configure gcloud\n\n    ```console\n    gcloud auth login\n    gcloud auth application-default login\n    gcloud config set project [project name]\n    ```\n\n3. Configure environment per requirements in [terraform.auto.tfvars](./infra/1-bootstrap/terraform.auto.tfvars)\n\n    | Variable | Description | Default | Need update? |\n    |---|---|---|---|\n    | `project_id` | Google Project ID | \u003c\u003e | *Yes* |\n    | `tf_state_bucket.name` | GCS Bucket for terraform state management | \u003c\u003e | *Yes* |\n    | `tf_state_bucket.location` | GCP Region | \u003c\u003e | *Yes* |\n\n    Save changes in the `terraform.auto.tfvars` file\n\n    ```console\n    cd 1-bootstrap\n    terraform init\n    terraform apply\n    ```\n\n    **Validate**: Terraform finishes successfully.\n\n    ```console\n    terraform apply\n    Apply complete! Resources: 0 added, 0 changed, 0 destroyed.\n    ```\n\n### Cluster Setup\n\nThe Cluster setup provisions:\n\n- [ ] GKE cluster with 2 node pools:\n  - [ ] Default: One `e2-standard-4` instance\n  - [ ] Managed: Multiple `a3-highgpu-1g` instance with H100 GPU(s)\n- [ ] 5 VPCs each with a subnet\n- [ ] NVIDIA GPU drivers installed\n\n1. Configure Cluster per requirements in [terraform.auto.tfvars](./infra/2-setup/terraform.auto.tfvars)\n\n    | Variable | Description | Default | Need update? |\n    |---|---|---|---|\n    | `zone` | GCP zone within region to provision resources | \u003c\u003e | *Yes* |\n    | `cluster_prefix` | Name of GKE Cluster | `gke-nemo-dev` | *Yes* |\n    | `gke_version` | Stable GKE version | `1.27.8-gke.1067004` | *Optional* |\n    | `node_count` | Number of nodes in Managed node pool | 2 | *Yes* |\n    | `node_default_count` | Number of nodes in Default node pool | 1 | *Optional* |\n    | `node_default_type` | Machine type | `e2-standard-4` | *Optional* |\n\n    Save changes in the `terraform.auto.tfvars` file\n\n2. Create Cluster\n\n    ```console\n    cd ../2-setup\n    terraform init\n    terraform apply\n    ```\n\n    \u003e [!NOTE]\n    \u003e This setup could take 15-30 mins depending on the size of the node.\n\n#### Validate Setup\n\n```console\nterraform output\n`cluster_prefix`: '\u003cGKE Cluster name\u003e'\n`cluster_location`: '\u003cGKE Cluster location\u003e' \n```\n\n```console\n# kubectl get pods -n kube-system | grep nvidia\n```\n\nExpected: `3 pods of NVIDIA running`\n\n### Configure\n\nThe Configure step update the cluster with:\n\n- [ ] [Cloud Filestore](https://cloud.google.com/filestore?hl=en) with a configurable tier\n- [ ] [Kueue](https://kueue.sigs.k8s.io/) for job management\n- [ ] NVIDIA GPU drivers\n- [ ] Enable Device mode in VPC [1-4]\n- [ ] TensorBoard\n\n1. Update variables per requirements in [terraform.auto.tfvars](./infra/3-config/terraform.auto.tfvars)\n\n    | Variable | Description | Default | Need update? |\n    |---|---|---|---|\n    | `kueue_version` | Kueue version to be installed | `v0.5.2` | *Optional* |\n    | `kueue_cluster_name` | Name of Cluster-scoped object to manage pool of resources | `a3-queue` | *Optional* |\n    | `kueue_local_name` | Namespace of LocalQueue to accept workloads | `a3-queue` | *Optional* |\n    | `storage_tier` | GCP region to provision resources | `enterprise` | *Optional* |\n    | `storage_size` | GCP zone within region to provision resources | `1Ti` | *Optional* |\n\n     Save changes in the `terraform.auto.tfvars` file\n\n2. Configure Cluster\n\n    ```console\n    cd ../3-config\n    terraform init\n    terraform apply\n    ```\n\n    \u003e [!NOTE]\n    \u003e This setup could take up-to 10 mins.\n\n#### Validate Configuration\n\n```console\nterraform apply\nApply complete! Resources: 0 added, 0 changed, 0 destroyed.\n```\n\n```console\nkubectl -n kueue-system get pods\n```\n\n**Expected:** `kueue-controller-manager-xx` in `RUNNING` status\n\n```console\nkubectl get clusterqueue\n```\n\n**Expected:** `a3-queue`\n\n```console\nkubectl get pods -A | grep -E \"tensorboard|inverse-proxy\"\n```\n\n**Expected:** Pods named `tensorboard-xx-yy` and `inverse-proxy` in `RUNNING` state\n\n### Model Training\n\n#### Data curation using GCS Bucket\n\nBelow are the details of dataset used for training.\n\n  |  | Details |\n  |---|---|\n  | Dataset | Wikipedia |\n  | Source | [Link](https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2) |\n  | Tokenizing | [WikiExtractor](https://github.com/attardi/wikiextractor) |\n  | GCS Location | `gs://nemo-megatron-demo/training-data/processed/gpt/wikitext` |\n  \n\u003e [!NOTE]\n\u003e For your own custom data, follow these steps:\n\n- [ ] Create GCS Bucket to host training data\n`gcloud storage buckets create gs://\u003cunique-bucket-name\u003e --location=$REGION`\n\n- [ ] Dataset needs to be tokenized and compatible format followed by [Megatron-LM](https://github.com/NVIDIA/Megatron-LM?tab=readme-ov-file#data-preprocessing) for tokenizer type `GPT2BPETokenizer`\n\n- [ ] Upload data to it\n`gcloud storage cp my-training-data.{idx,bin} gs://\u003cunique-bucket-name\u003e`\n\n    \u003e :warning: Training data cannot exceed the size of the local SSD (6TB). Each node has 16 local SSDs of 200GB each. For larger sizes the shared Filestore or GCS fuse can be used.\n\n#### Training Image\n\nThe latest NVIDIA NeMo Framework Training image available at [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo) is used. The current version is `nvcr.io/nvidia/nemo:25.02`\n\n#### Configure model parameters\n\nThe file [nemo-example/selected-configuration.yaml](./infra/4-training/nemo-example/selected-configuration.yaml) is a [NeMo Megatron](https://github.com/NVIDIA/NeMo) compatible configuration file. It is soft-linked to the GPT-5B file at `nemo-example/nemo-configurations/gpt-5b.yaml`.\n\n\u003e :Note:\n\u003e For the initial run, use the same file. For future launches, review and edit the configuration. [NeMo Megatron Launcher](https://github.com/NVIDIA/NeMo-Megatron-Launcher/tree/master/launcher_scripts/conf/training) has examples of alternate models and model sizes.\n\n#### Launch training using Helm\n\nLaunch the GPT model training across the desired nodes\n\n1. There are a few values that can be provided using Helm command line\n\n  | Variable | Description | Default | Need update? |\n  |---|---|---|---|\n  | `workload.gpus` | Total number of GPUs | `16` | *Optional* |\n  | `workload.image` | Image version for NeMo training | `nvcr.io/nvidia/nemo:25.02` | *Optional* |\n  | `queue` | LocalQueue name that submits to ClusterQueue | `a3-queue` | *a3-queue* |\n\n  Launch Helm workload\n\n  ```console\n  cd nemo-example/\n  helm install --set workload.gpus=16 \\\n  --set workload.image=nvcr.io/nvidia/nemo:25.02 \\\n  --set queue=a3-queue \\\n  $USER-nemo-$(date +%s) .\n  ```\n\nAlternatively, you can launch training using terraform\n\n1. Update variables per requirements in [terraform.auto.tfvars](./infra/4-training/terraform.auto.tfvars)\n\n  | Variable | Description | Default | Need update? |\n  |---|---|---|---|\n  | `training_image_name` | NVIDIA NeMo Framework Image | `nnvcr.io/vidia/nemo:25.02` | *Optional* |\n  | `kueue_name` | LocalQueue name that submits to ClusterQueue | `a3-queue` | *Optional* |\n\n  Save changes in the `terraform.auto.tfvars` file\n\n  ```console\n  cd ../4-training\n  terraform init\n  TF_VAR_user=$USER terraform apply \n  ```\n\n#### Validate Job\n\n```console\nkubectl get pods -A | grep nemo\n```\n\n**Expected:** `$USER-nemo-gpt-5b-YYYYMMDDHHmmss` jobs in `RUNNING` status\n\n\u003e Note:\n\u003e The workload might take from 30 to 60 minutes to complete running for the 5B file. Training duration depends on cluster size.\n\n### Observations\n\nThere are a couple of ways to launch TensorBoard and view the results. TensorBoard provides in-depth dashboards to track key metrics such as accuracy, log loss and visualization. The logs from the training script are available in the Filestore instance.\n\n#### TensorBoard using Open-source\n\nIn the [Configure](#configure) step, TensorBoard is already setup in the Cluster. Launch the dashboard using the below command and clicking the link in the `Hostname` field.\n\n```console\nkubectl describe configmap inverse-proxy-config\n```\n\nAlternatively, you can setup port forwarding on the TensorBoard container\n\n```console\nkubectl get pods -A | grep tensorboard\nkubectl port-forward tensorboard-\u003c`suffix`\u003e :6006\n```\n\nOpen \u003chttp://localhost\u003e:\u003c`forwarded port`\u003e/\n\n#### TensorBoard using Vertex AI\n\nAlternatively, [Vertex AI TensorBoard](https://cloud.google.com/vertex-ai/docs/experiments/tensorboard-introduction) is the enterprise and managed version of the open-sourced [TensorBoard](https://www.tensorflow.org/tensorboard) for ML experiment visualization. The logs from the Filestore instance need to be copied to a Google Cloud Storage (GCS) bucket and point the TensorBoard instance to the GCS bucket.\n\n#### TensorBoard Setup\n\nYou can use the CLI to setup the TensorBoard instance and configure the instance to point to the GCS Bucket logs.\n\n- [ ] Create a virtual env\n\n```console\n\npython3 -m venv venv-tb\n\n```\n\n- [ ] Install required python packages\n\n```console\n\nsource venv-tb/bin/activate\npip install google-cloud-aiplatform google-cloud-aiplatform[tensorboard]\n\n```\n\n- [ ] Setup TensorBoard instance\n\n```console\nfrom google.cloud import aiplatform\n\nproject_id = \"\u003cproject-id\u003e\" \nlocation = \"\u003cgcp-region\u003e\"\nexperiment_name = \"\u003cexperiment name\u003e\"\n\ntensorboard = aiplatform.Tensorboard.create(\n    display_name=experiment_name,\n    project=project_id,\n    location=location,\n)\n\n```\n\n- [ ] Launch TensorBoard Dashboard\n\n```console\n\nPROJECT_ID=\"\u003cproject-id\u003e\"\nREGION=\"\u003cgcp-region\u003e\"\nEXPERIMENT_NAME=\"\u003cexperiment-name\u003e\"\n\nTB_RESOURCE_NAME = !gcloud ai tensorboards list --region={REGION} \\\n--filter='display_name:{EXPERIMENT_NAME}' \\\n--format='value(name.basename())'\n\n!tb-gcp-uploader --tensorboard_resource_name projects/{PROJECT_ID}/locations/{REGION}/tensorboards/{TB_RESOURCE_NAME[1]} \\\n--logdir=gs://{EXPERIMENT_NAME}-ml-logs/nemo-experiments \\\n--experiment_name={EXPERIMENT_NAME} \\\n--one_shot=True\n\n```\n\nYou should find a clickable link like below. Alternatively, you can find the in Google Cloud Console under [Vertex AI](https://console.cloud.google.com/vertex-ai/experiments/experiments)\n\n```console\n\nView your TensorBoard at https://\u003cgcp-region\u003e.tensorboard.googleusercontent.com/experiment/projects+\u003cproject-number\u003e+locations+\u003cgcp-region\u003e+tensorboards+\u003ctensor-board-id\u003e+experiments+\u003cexperiment-name\u003e\n\n```\n\n[\u003cimg src=\"images/3.tensorboard-main.png\" width=\"750\"/\u003e](TensorBoardMain)\n\n#### Dashboard - Scalars\n\nThe Scalars tab helps in understanding metrics such as loss and how they change as the training continues.\n\n##### Reduced train loss\n\nTraining loss is a fundamental metric in machine learning. It indicates how well your model fits the training data during each training step (iteration) or at the end of an epoch (a full pass through the dataset). \"Reduced\" means the training loss value is going down over time. This is generally a positive sign when training machine learning models. You can see the same in the graphic below.\n\n[\u003cimg src=\"images/4.tensorboard-reduced-train-loss.png\" width=\"400\"/\u003e](TensorBoardReducedTrainLoss)\n\n##### Train step timing\n\nTrain step timing in TensorBoard refers to the amount of time taken to complete a single training step in your machine learning model. In the graphic below, the train step is decreasing progressively during initial training and stabilizing during later stages.\n\n[\u003cimg src=\"images/5.tensorboard-train-step-timing.png\" width=\"400\"/\u003e](TensorBoardTrainStepTiming)\n\n### Teardown\n\n#### Deactivate Virtual Environment\n\n```bash\ndeactivate\n```\n\n#### Destroy environment\n\n\u003e :warning: All resources provisioned in this walkthrough will be destroyed.\n\n```bash\ncd ../3-config\nterraform destroy\n```\n\n```bash\ncd ../2-setup\nterraform destroy\n```\n\n```bash\ncd ../1-bootstrap\nterraform destroy\n```\n\n### Troubleshooting\n\nThis section lists errors that you might encounter during the walkthrough.\n\n#### Kueue Configuration issue\n\nYou might experience a terraform deployment issue in [3-config](#configure), as it takes upto a minute for the Kueue to be fully available before creating the Local and Cluster queue.\n\n**Fix:** Re-run the `terraform apply`\n\n#### TensorBoard (open-source) installation\n\nIn [3-config](#configure), it takes few minutes for the Persistent Volume Claim (PVC) to be in **Bound** state. It could take up-to 20 mins before TensorBoard pods are in **RUNNING** status\n\n**Fix:** Wait up-to 20 mins.\n\n#### PVC deletion fails\n\n**Issue:** The PVC deletion might take longer than usual as the Finalizer run is blocked.\n**Fix:** Set the `Finalizer` metadata to `null` to proceed.\n\n```console\nkubectl patch pvc cluster-filestore -p '{\"metadata\":{\"finalizers\":null}}'\n```\n\n#### Docker login failure\n\n**Issue:** Docker login to registry \u003chttps://us-central1-docker.pkg.dev\u003e fails\n**Fix:** Login to docker using Auth access token\n\n```console\ngcloud auth print-access-token | sudo docker login -u oauth2accesstoken \\\n--password-stdin https://us-central1-docker.pkg.dev\n```\n\n## Beyond the Walkthrough\n\nThis walkthrough is adaptable to different data location. [BigQuery](https://cloud.google.com/bigquery?hl=en) is a fully-managed, serverless data warehouse by Google Cloud. BigQuery can be configured as a source data used for training the model.\n\n## Versioning\n\nInitial Version February 2024\n\n## Code of Conduct\n\n[View](./CODE_OF_CONDUCT.md)\n\n## Contributing\n\n[View](./CONTRIBUTING.md)\n\n## License\n\n[View](./LICENSE)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgooglecloudplatform%2Fnvidia-nemo-on-gke","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgooglecloudplatform%2Fnvidia-nemo-on-gke","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgooglecloudplatform%2Fnvidia-nemo-on-gke/lists"}