https://github.com/ai-hypercomputer/cloud-tpu-monitoring-debugging
https://github.com/ai-hypercomputer/cloud-tpu-monitoring-debugging
Last synced: 11 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/ai-hypercomputer/cloud-tpu-monitoring-debugging
- Owner: AI-Hypercomputer
- License: apache-2.0
- Created: 2023-04-18T17:47:46.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2025-02-18T21:24:14.000Z (over 1 year ago)
- Last Synced: 2025-06-09T05:51:25.050Z (12 months ago)
- Language: HCL
- Size: 75.2 KB
- Stars: 12
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
# Cloud TPU Monitoring Debugging
## Overview
Cloud TPU Monitoring Debugging repository contains all the infrastructure and logic required to monitor and debug jobs running on Cloud TPU.
Terraform is used to deploy resources in google cloud project.
Terraform is an open-source tool to set up and manage google cloud
infrastructure based on configuration files. This repository will help the
customers to deploy various google cloud resources via script, without any
manual effort.
[cloud-tpu-diagnostics PyPI package](https://pypi.org/project/cloud-tpu-diagnostics) contains all the logic to monitor, debug and profile the jobs running on Cloud TPU.
## Getting Started with Terraform
- Follow [this link](https://developer.hashicorp.com/terraform/tutorials/gcp-get-started/install-cli) to install Terraform on desktop.
- Run `terraform init` to
initialize google cloud Terraform provider version. This command will add
the necessary plugins and build the `.terraform` directory.
- If there is an update to terraform google cloud provider version, run
`terraform init --upgrade` for the update to take place.
- You can also run `terraform plan` to validate resource declarations,
identify any syntax errors, version mismatch before deploying the resources.
### Configure Terraform to store state in Cloud Storage
By default, Terraform stores [state](https://www.terraform.io/docs/state/) locally in a file named `terraform.tfstate`. This default configuration can make Terraform usage difficult for teams, especially when many users run Terraform at the same time and each machine has its own understanding of the current infrastructure. To help avoid such issues, this section configures a remote state that points to Google Cloud Storage (GCS) bucket.
1. In Cloud Shell, create the GCS bucket:
gsutil mb gs://${GCS_BUCKET_NAME}
2. Enable [Object Versioning](https://cloud.google.com/storage/docs/object-versioning) to keep the history of your deployments. Enabling Object Versioning increases [storage costs](https://cloud.google.com/storage/pricing), which you can mitigate by configuring
[Object Lifecycle Management](https://cloud.google.com/storage/docs/lifecycle) to delete old state versions.
gsutil versioning set on gs://${GCS_BUCKET_NAME}
3. Enter the name of GCS bucket created above when you run `terraform init` to initialize Terraform.
Initializing the backend...
bucket
The name of the Google Cloud Storage bucket
Enter a value:
## Deploy GCP Resources
There are following resources managed in this directory:
1. **Monitoring Dashboard**: This is an outlier dashboard that displays statistics and outlier mode for TPU metrics.
2. **Debugging Dashboard**: This dashboard displays the stack traces collected in Cloud Logging for the process running on TPU VMs.
3. **Logging Storage**: This is an user-defined log bucket to store stack traces. Creating a new log storage is completely optional. If you choose not to create a separate log bucket, the stack traces will be collected in [_Default log bucket](https://cloud.google.com/logging/docs/routing/overview#default-bucket).
### Deploy Resources for Workloads on GCE
Run `terraform init && terraform apply` inside `gcp_resources/gce` directory to deploy all the resources mentioned above for TPU workloads running on GCE. You will be prompted to provide values for some input variables. After confirming the action, all the resources will get automatically deployed in your gcp project.
### Deploy Resources for Workloads on GKE
Run `terraform init && terraform apply` inside `gcp_resources/gke` directory to deploy all the resources mentioned above for TPU workloads running on GKE. You will be prompted to provide values for some input variables. After confirming the action, all the resources will get automatically deployed in your gcp project.
> **_NOTE:_** Please check the below guide for more details about GCE/GKE specific resources and prerequisites.
Follow the below guide to deploy the resources individually:
### Monitoring Dashboard
#### GCE
Run `terraform init && terraform apply` inside `gcp_resources/gce/resources/dashboard/monitoring_dashboard/` to deploy only monitoring dashboard for GCE in your gcp project.
If the `node_prefix` parameter is not specified in the input variable `var.monitoring_dashboard_config` or is set to an empty string, the metrics on the dashboard will plot the data points for all TPU VMs in your GCP project.
For instance, if you provide `{"node_prefix": "test"}` as the input value for the input variable `var.monitoring_dashboard_config`, then the metrics on the monitoring dashboard will only show the data points for the TPU VMs with node names that start with `test`. Refer to this [doc](https://cloud.google.com/sdk/gcloud/reference/alpha/compute/tpus/queued-resources/create#--node-prefix) for more information on node prefix for TPUs in multislice.
#### GKE
Run `terraform init && terraform apply` inside `gcp_resources/gke/resources/dashboard/monitoring_dashboard/` to deploy only monitoring dashboard for GKE in your gcp project.
### Debugging Dashboard
#### GCE
Run `terraform init && terraform apply` inside `gcp_resources/gce/resources/dashboard/logging_dashboard/` to deploy only debugging dashboard for GCE in your gcp project.
#### GKE
Run `terraform init && terraform apply` inside `gcp_resources/gke/resources/dashboard/logging_dashboard/` to deploy only debugging dashboard for GKE in your gcp project.
Users need to add a sidecar container to their TPU workload running on GKE to view traces in the debugging dashboard. The sidecar container must be named in a specific way, matching the regex `[a-z-0-9]*stacktrace[a-z-0-9]*`. Here is an example of the sidecar container that should be added:
```
containers:
- name: stacktrace-log-collector
image: busybox:1.28
resources:
limits:
cpu: 100m
memory: 200Mi
args: [/bin/sh, -c, "while [ ! -d /tmp/debugging ]; do sleep 60; done; while [ ! -e /tmp/debugging/* ]; do sleep 60; done; tail -n+1 -f /tmp/debugging/*"]
volumeMounts:
- name: tpu-debug-logs
readOnly: true
mountPath: /tmp/debugging
- name:
.....
.....
volumes:
- name: tpu-debug-logs
```
### Log Storage
#### GCE
Run `terraform init && terraform apply` inside `gcp_resources/gce/resources/log_storage/` to deploy a separate log bucket to store stack traces for GCE. You will be prompted to provide name of your gcp project and also the bucket configuration. You can also set the retention period for the bucket.
#### GKE
Run `terraform init && terraform apply` inside `gcp_resources/gke/resources/log_storage/` to deploy a separate log bucket to store stack traces for GKE. You will be prompted to provide name of your gcp project and also the bucket configuration. You can also set the retention period for the bucket. Make sure that you have the sidecar container running in your GKE cluster as mentioned in [Debugging Dashboard section for GKE](#debugging-dashboard).