{"id":29202575,"url":"https://github.com/ai-hypercomputer/cloud-accelerator-diagnostics","last_synced_at":"2025-07-02T13:32:44.273Z","repository":{"id":228440642,"uuid":"767197650","full_name":"AI-Hypercomputer/cloud-accelerator-diagnostics","owner":"AI-Hypercomputer","description":null,"archived":false,"fork":false,"pushed_at":"2025-06-26T11:37:38.000Z","size":52,"stargazers_count":21,"open_issues_count":3,"forks_count":10,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-06-26T12:32:54.243Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AI-Hypercomputer.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-03-04T21:45:13.000Z","updated_at":"2025-06-26T11:37:42.000Z","dependencies_parsed_at":"2024-06-24T19:33:09.820Z","dependency_job_id":"aa2711ad-d81f-4003-b082-ad85990fd79e","html_url":"https://github.com/AI-Hypercomputer/cloud-accelerator-diagnostics","commit_stats":null,"previous_names":["google/cloud-accelerator-diagnostics"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/AI-Hypercomputer/cloud-accelerator-diagnostics","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AI-Hypercomputer%2Fcloud-accelerator-diagnostics","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AI-Hypercomputer%2Fcloud-accelerator-diagnostics/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AI-Hypercomputer%2Fcloud-accelerator-diagnostics/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AI-Hypercomputer%2Fcloud-accelerator-diagnostics/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AI-Hypercomputer","download_url":"https://codeload.github.com/AI-Hypercomputer/cloud-accelerator-diagnostics/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AI-Hypercomputer%2Fcloud-accelerator-diagnostics/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":263148125,"owners_count":23421116,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-07-02T13:30:31.801Z","updated_at":"2025-07-02T13:32:44.259Z","avatar_url":"https://github.com/AI-Hypercomputer.png","language":"Python","readme":"\u003c!--\n Copyright 2023 Google LLC\n \n Licensed under the Apache License, Version 2.0 (the \"License\");\n you may not use this file except in compliance with the License.\n You may obtain a copy of the License at\n \n      https://www.apache.org/licenses/LICENSE-2.0\n \n Unless required by applicable law or agreed to in writing, software\n distributed under the License is distributed on an \"AS IS\" BASIS,\n WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n See the License for the specific language governing permissions and\n limitations under the License.\n --\u003e\n# Cloud Accelerator Diagnostics\n\n## Overview\nCloud Accelerator Diagnostics is a library to monitor, debug and profile the workloads running on Cloud accelerators like TPUs and GPUs. Additionally, this library provides a streamlined approach to automatically upload data to Tensorboard Experiments in Vertex AI. The package allows users to create a Tensorboard instance and Experiments in Vertex AI, and upload logs to them.\n\n## Installation\nTo install the Cloud Accelerator Diagnostics package, run the following command:\n\n ```bash\n pip install cloud-accelerator-diagnostics\n ```\n\n## Automating Uploads to Vertex AI Tensorboard\nBefore creating and uploading logs to Vertex AI Tensorboard, you must enable [Vertex AI API](https://cloud.google.com/vertex-ai/docs/start/cloud-environment#enable_vertexai_apis) in your Google Cloud console. Also, make sure to assign the [Vertex AI User IAM role](https://cloud.google.com/vertex-ai/docs/general/access-control#aiplatform.user) to the service account that will call the APIs in `cloud-accelerator-diagnostics` package. This is required to create and access the Vertex AI Tensorboard in the Google Cloud console.\n\n### Create Vertex AI Tensorboard\nTo learn about Vertex AI Tensorboard, visit this [page](https://cloud.google.com/vertex-ai/docs/experiments/tensorboard-introduction).\n\nHere is an example script to create a Vertex AI Tensorboard instance with the name `test-instance` in Google Cloud Project `test-project`.\n\nNote: Vertex AI is available in only [these](https://cloud.google.com/vertex-ai/docs/general/locations#available-regions) regions.\n\n```\nfrom cloud_accelerator_diagnostics import tensorboard\n\ninstance_id = tensorboard.create_instance(project=\"test-project\",\n                                          location=\"us-central1\",\n                                          tensorboard_name=\"test-instance\")\nprint(\"Vertex AI Tensorboard created: \", instance_id)\n```\n\n### Create Vertex AI Experiment\nTo learn about Vertex AI Experiments, visit this [page](https://cloud.google.com/vertex-ai/docs/experiments/intro-vertex-ai-experiments).\n\nThe following script will create a Vertex AI Experiment named `test-experiment` in your Google Cloud Project `test-project`. Here's how it handles attaching a Tensorboard instance:\n\n**Scenario 1: Tensorboard Instance Exist**\n\nIf a Tensorboard instance named `test-instance` already exists in your project, the script will attach it to the new Experiment.\n\n**Scenario 2: No Tensorboard Instance Present**\n\nIf `test-instance` does not exist, the script will create a new Tensorboard instance with that name and attach it to the Experiment.\n\n```\nfrom cloud_accelerator_diagnostics import tensorboard\n\ninstance_id, tensorboard_url = tensorboard.create_experiment(project=\"test-project\",\n                                                             location=\"us-central1\",\n                                                             experiment_name=\"test-experiment\",\n                                                             tensorboard_name=\"test-instance\")\n\nprint(\"View your Vertex AI Tensorboard here: \", tensorboard_url)\n```\n\nIf a Vertex AI Experiment with the specified name exists, a new one will not be created, and the existing Experiment's URL will be returned.\n\nNote: You can attach multiple Vertex AI Experiments to a single Vertex AI Tensorboard.\n\n### Upload Logs to Vertex AI Tensorboard\nThe following script will continuously monitor for new data in the directory (`logdir`), and uploads it to your Vertex AI Tensorboard Experiment. Note that after calling `start_upload_to_tensorboard()`, the thread will be kept alive even if an exception is thrown. To ensure the thread gets shut down, put any code after `start_upload_to_tensorboard()` and before `stop_upload_to_tensorboard()` in a `try` block, and call `stop_upload_to_tensorboard()` in `finally` block. This example shows how you can upload the [profile logs](https://jax.readthedocs.io/en/latest/profiling.html#programmatic-capture) collected for your JAX workload on Vertex AI Tensorboard.\n\n```\nfrom cloud_accelerator_diagnostics import uploader\n\nuploader.start_upload_to_tensorboard(project=\"test-project\",\n                                     location=\"us-central1\",\n                                     experiment_name=\"test-experiment\",\n                                     tensorboard_name=\"test-instance\",\n                                     logdir=\"gs://test-directory/testing\")\ntry:\n  jax.profiler.start_trace(\"gs://test-directory/testing\")\n  \u003cyour code goes here\u003e\n  jax.profiler.stop_trace()\nfinally:\n  uploader.stop_upload_to_tensorboard()\n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fai-hypercomputer%2Fcloud-accelerator-diagnostics","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fai-hypercomputer%2Fcloud-accelerator-diagnostics","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fai-hypercomputer%2Fcloud-accelerator-diagnostics/lists"}