{"id":29202567,"url":"https://github.com/ai-hypercomputer/cloud-diagnostics-xprof","last_synced_at":"2025-07-02T13:32:39.840Z","repository":{"id":284615485,"uuid":"942309034","full_name":"AI-Hypercomputer/cloud-diagnostics-xprof","owner":"AI-Hypercomputer","description":null,"archived":false,"fork":false,"pushed_at":"2025-06-25T19:33:11.000Z","size":1529,"stargazers_count":7,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-25T20:31:15.988Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://pypi.org/project/cloud-diagnostics-xprof/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AI-Hypercomputer.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"docs/CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-03-03T22:51:34.000Z","updated_at":"2025-06-25T19:33:15.000Z","dependencies_parsed_at":null,"dependency_job_id":"c5a844cd-e35b-4aa3-8dbf-a9234a2db7f3","html_url":"https://github.com/AI-Hypercomputer/cloud-diagnostics-xprof","commit_stats":null,"previous_names":["ai-hypercomputer/cloud-diagnostics-xprof"],"tags_count":8,"template":false,"template_full_name":null,"purl":"pkg:github/AI-Hypercomputer/cloud-diagnostics-xprof","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AI-Hypercomputer%2Fcloud-diagnostics-xprof","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AI-Hypercomputer%2Fcloud-diagnostics-xprof/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AI-Hypercomputer%2Fcloud-diagnostics-xprof/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AI-Hypercomputer%2Fcloud-diagnostics-xprof/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AI-Hypercomputer","download_url":"https://codeload.github.com/AI-Hypercomputer/cloud-diagnostics-xprof/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AI-Hypercomputer%2Fcloud-diagnostics-xprof/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":263148129,"owners_count":23421117,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-07-02T13:30:27.634Z","updated_at":"2025-07-02T13:32:39.788Z","avatar_url":"https://github.com/AI-Hypercomputer.png","language":"Python","readme":"\u003c!--\n Copyright 2023 Google LLC\n \n Licensed under the Apache License, Version 2.0 (the \"License\");\n you may not use this file except in compliance with the License.\n You may obtain a copy of the License at\n \n      https://www.apache.org/licenses/LICENSE-2.0\n \n Unless required by applicable law or agreed to in writing, software\n distributed under the License is distributed on an \"AS IS\" BASIS,\n WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n See the License for the specific language governing permissions and\n limitations under the License.\n --\u003e\n# xprofiler\n\nThe `xprofiler` tool aims to simplify profiling experience for XLA workloads.\nIt provides an abstraction over profile sessions and manages\n[`xprof` hosting](https://github.com/openxla/xprof) experience.\nThis includes allowing users to create and manage VM instances that\nare preprovisioned with TensorBoard and latest profiling tools.\n\nFor more information about profiling with `xprof`, please see the `xprof`\n[documentation](https://github.com/openxla/xprof/blob/master/docs/).\n\n## Quickstart\n\nXprofiler can be setup on user's workstation/cloudtop or on your TPU VM.\n\n\u003e Note:\n\u003e Before setting up `xprofiler`, users will need to enable profile collection\n\u003e for their workload by starting the profile server\n\u003e (see [section on enabling this collector](#prerequisite-enable-collector))\n\u003e or capturing programmatically\n\u003e (see [section on programmatic profile capture](#programmatic-profile-capture)).\n\n### Install Dependencies\n\n`xprofiler` relies on using [gcloud](https://cloud.google.com/sdk).\n\nThe first step is to follow the documentation to [install](https://cloud.google.com/sdk/docs/install).\n\nRunning the initial `gcloud` setup will ensure things like your default project\nID are set.\n\n```bash\n# setup project context\ngcloud init\n# setup auth for gcloud\ngcloud auth\n# Setup auth for client libraries\ngcloud auth application-default login\n```\n\n### Setup cloud-diagnostic-xprof Package\n\nUse a virtual environment (as best practice).\n\n```bash\npython3 -m venv venv\nsource venv/bin/activate\n\n# Install package\npip install cloud-diagnostics-xprof\n\n# Confirm installed with pip\npip show cloud-diagnostics-xprof\n\nName: cloud-diagnostics-xprof\nVersion: X.Y.Z\nSummary: Abstraction over profile session locations and infrastructure running the analysis.\nHome-page: https://github.com/AI-Hypercomputer/cloud-diagnostics-xprof\nAuthor: Author-email: Hypercompute Diagon \u003chypercompute-diagon@google.com\u003e\n```\n\n### Permissions\n\n`xprofiler` relies on project level IAM permissions.\n\n* VM's service account must have required permissions.\n  * `\u003cproject-number\u003e`-compute@developer.gserviceaccount.com is the default\n  service account. Users can also use custom Service Accounts for their setup.\n  * Service Account must have `Storage Object User` role on input GCS bucket.\n  This is needed to read/write profile traces to GCS.\n* Users must have `Service Account User` role on above service account. This is\nneeded to access reverse proxy URL link for visualization.\n* Users must have `Storage Object User` role on input GCS bucket. This is\nneeded to validate xprofiler instance state during setup.\n\n\u003e Summary:\n\u003e\n\u003e Users need to ensure when using `xprofiler` that it's in the _same project_ as\n\u003e their GCS bucket.\n\u003e\n\u003e Users also need to ensure they put the required permissions for their GCS\n\u003e bucket so the `xprofiler` VM can access the bucket.\n\n### Recommendations\n\n#### GCS Paths\n\n`xprofiler` uses a specific path pattern to locate and manage multiple profiling\nsessions stored within a Google Cloud Storage (GCS) bucket. This enables the\nvisualization of various profiling sessions from different runs using a single\n`xprofiler` instance.\n\n##### GCS Paths for `xprofiler create`\n\nIt's recommended when using the `xprofiler create` subcommand to specify only\nthe root GCS bucket, without any subdirectories:\n\n```\ngs://\u003cbucket-name\u003e\n```\n\nThis approach allows `xprofiler` to discover and load profiles from multiple\nruns and sessions stored under that bucket.\nFor instance, all the profile as organized below would be loaded in our example:\n\n* `gs://\u003cbucket-name\u003e/run1/plugins/profile/session1/\u003cprofile.xplane.pb`\n* `gs://\u003cbucket-name\u003e/run1/plugins/profile/session2/\u003cprofile.xplane.pb`\n* `gs://\u003cbucket-name\u003e/run2/plugins/profile/session1/\u003cprofile.xplane.pb`\n\nSpecifying `gs://\u003cbucket-name\u003e` during `xprofiler create` will allow users to\nview all of these profiles in TensorBoard. They will see all runs in the\ndropdown menu as `run1/session1`, `run1/session2`, and `run2/session1`.\n\n##### GCS Paths for Profile Capture\n\nWhen users programmatically capture profiles or use the `xprofiler capture`\nsubcommand with a GCS bucket path like `gs://\u003cbucket-name\u003e/\u003crun-name\u003e`, all\nprofiling data will be collected in a structured subdirectory:\n\n```\ngs://\u003cbucket-name\u003e/\u003crun-name\u003e/tensorboard/plugins/profile/\u003csession-id\u003e/\n```\n\nUsers will see the run in the dropdown menu as\n`\u003crun-name\u003e/tensorboard/\u003csession-id\u003e/`.\nHere, the `\u003csession-id\u003e` uniquely identifies a specific profiling session within\nthat run.\n\n\u003e Note:\n\u003e As long as users have xplane files that follow the pattern\n\u003e `.../plugins/profile/\u003csession\u003e` under the bucket path from\n\u003e `xprofiler create`, all the profiles will be picked up in any subdirectories.\n\n##### Examples of proper and improper GCS paths\n\nQuick note on examples of proper and improper GCS paths (for log directory\nparameter):\n\n```bash\n# Proper path (note forward slash at end is optional)\ngs://my-bucket/main_directory/sub-a/sub-b/\n\n# Proper path\ngs://my_other_bucket/main_directory/sub-1/sub-2\n\n# Improper path: does not start with gs://\nmy_other_bucket/main_directory/sub-1/sub-2\n```\n\n#### Machine Types\n\nDuring creation, users have the option to specify the VM machine type.\nThe selection of a machine type can impact both performance and cost, and the\noptimal choice often correlates with the size of the profiles users anticipate\nworking with.\n\nBy default, `xprofiler` utilizes the `c4-highmem-8` machine type. This\nconfiguration is generally robust and should provide sufficient resources for a\nwide range of common profile sizes.\n\nHowever, users may find it beneficial to select a different machine type based\non their specific needs:\n\n* For workloads involving relatively small profiles (e.g., under approximately\n  100MB), a less powerful machine like `e2-highmem-4` might be a cost-effective\n  alternative without significantly compromising performance for those tasks.\n* Conversely, users with particularly large profiles can opt for a more capable\n  machine (such as `c4-highmem-32`) could lead to faster processing times.\n* Although the default machine type should be sufficient for most users, if\n  users find it taking more than **3 minutes** for profiles to load then they\n  may want to try a more powerful machine type for the `xprofiler` VM.\n\nThe following table offers some general suggestions.\nPlease consider these as rough guidelines rather than strict prescriptions, as\nthe ideal machine type can depend on multiple factors specific to users'\nprofile data. Users may want to try more powerful machine types if it takes more\nthan ***3 minutes*** to load.\n\n| Profile Size | Suggested Machine Type | Primary Consideration |\n|---|---|---|\n| Small (\u003c ~100 MB) | e2-highmem-4 | Cost-effectiveness |\n| Medium / Typical | c4-highmem-8 (Default) | Balanced performance \u0026 cost |\n| Large (\u003e 1 GB)| c4-highmem-32 | Higher processing power |\n\nWhile we generally recommend utilizing a\n[general-purpose machine type](https://cloud.google.com/compute/docs/general-purpose-machines),\nusers are free to explore and specify other machine types that better suit their\nrequirements. A comprehensive list of machine types can be found in the\n[Google Cloud documentation](https://cloud.google.com/compute/docs/machine-resource).\n\nFor more information about specifying a machine type for `xprofiler create`,\nplease refer to the [section below](#xprofiler-create---machine-type) on\n`xprofiler create --machine-type`.\n\n### Create `xprofiler` Instance\n\nTo create a `xprofiler` instance, you must provide a path to a GCS bucket and\nzone. Project information will be retrieved from `gcloud`'s config.\n\n```bash\nZONE=\"\u003csome zone\u003e\"\nGCS_PATH=\"gs://\u003csome-bucket\u003e\"\n\nxprofiler create -z $ZONE -l $GCS_PATH\n```\n\nWhen the command completes, you will see it return information about the\ninstance created, similar to below:\n\n```\nWaiting for instance to be created. It can take a few minutes.\n\nInstance for gs://\u003csome-bucket\u003e has been created.\nYou can access it via following,\n1. https://\u003cid\u003e-dot-us-\u003cregion\u003e.notebooks.googleusercontent.com.\n2. xprofiler connect -z \u003csome zone\u003e -l gs://\u003csome-bucket\u003e -m ssh\nInstance is hosted at xprof-97db0ee6-93f6-46d4-b4c4-6d024b34a99f VM.\n```\n\n\u003e Note: Depending on availability, the zone specified might not have the default\n\u003e machine type for the VM. In that case, you instead might see an error followed\n\u003e by potential zones that the machine type is available.\n\u003e\n\u003e For more details, see [section](#xprofiler-create-machine-type) on machine\n\u003e types when using `xprofiler create`.\n\nThis will create a VM instance with `xprofiler` packages installed. The setup\ncan take up to a few minutes. The link above is shareable with anyone with IAM\npermissions.\n\nBy default, `xprofiler` instances will be hosted on a c4-highmem machine. Users\ncan also specify a machine type of their choice using the -m flag.\n\nDuring `create`, users will be prompted if they would like to create a second\ninstance for the same GCS path. Pressing anything but `Y` or `y` will exit the\nprogram.\n\n```\n$ xprofiler create -z \u003czone\u003e -l gs://\u003csome-bucket\u003e\n\nInstance for gs://\u003csome-bucket\u003e already exists.\n\nLog_Directory       URL                                                           Name                                        Zone\n------------------  ------------------------------------------------------------  ------------------------------------------  -------\ngs://\u003csome-bucket\u003e  https://\u003cid\u003e-dot-us-\u003cregion\u003e.notebooks.googleusercontent.com  xprof-97db0ee6-93f6-46d4-b4c4-6d024b34a99f  \u003czone\u003e\n\n\nDo you want to continue to create another instance with the same log directory? (y/n)\ny\nWaiting for instance to be created. It can take a few minutes.\n\nInstance for gs://\u003csome-bucket\u003e has been created.\nYou can access it via following,\n1. https://\u003cid\u003e-dot-us-\u003cregion\u003e.notebooks.googleusercontent.com.\n2. xprofiler connect -z \u003czone\u003e -l gs://\u003csome-bucket\u003e -m ssh\nInstance is hosted at xprof-\u003cuuid\u003e VM.\n```\n\n### Open `xprofiler` Instance\n\n##### Using Proxy\n\nUsers can open created instances using the link from create output. This path\nrelies on a reverse proxy to expose the xprofiler backend. Users must have\nvalid IAM permissions.\n\n##### Using SSH Tunnel (Preferred for larger captures)\n\nUsers can connect to an instance by specifying a log_directory.\n\n* Connect uses an SSH tunnel and users can open a localhost url from their\nbrowsers.\n\n\u003e Note: `-z (--zone)` and `-l (--log_directory)` are mandatory arguments.\n\n```\nxprofiler connect -z $ZONE -l $GCS_PATH -m ssh\n\nxprofiler instance can be accessed at http://localhost:6006.\n```\n\n\u003e Note:\n\u003e Running `xprofiler connect` using the SSH tunnel must be done in your local\n\u003e host and ***not*** on a TPU VM.\n\u003e\n\u003e Running `xprofiler connect` using the SSH option allows users to open the\n\u003e `xprofiler` web server and accessed on user's local browser.\n\u003e Therefore, running the `xprofiler connect` subcommand on a TPU VM is not\n\u003e particularly useful and won't work as expected if the command is done on a TPU\n\u003e VM.\n\n### List `xprofiler` Instances\n\nTo list the `xprofiler` instances, you will need to specify a zone. Users can\noptionally provide bucket information and/or VM instance names.\n\n```bash\nZONE=us-central1-a\n\nxprofiler list -z $ZONE\n```\n\n\u003e Note: The `-z (--zones)` flag is not required but is highly recommended.\n\u003e If a zone is not provided, the command can take longer to search for all\n\u003e relevant VM instances.\n\nThis will output something like the following if there are instances matching\nthe list criteria:\n\n```bash\nLog_Directory             URL                                                                  Name                                        Zone\n------------------------  -------------------------------------------------------------------  ------------------------------------------  -------\ngs://\u003csome-bucket\u003e        https://\u003cid\u003e-dot-us-\u003cregion\u003e.notebooks.googleusercontent.com         xprof-97db0ee6-93f6-46d4-b4c4-6d024b34a99f  \u003czone\u003e\ngs://\u003csome-other-bucket\u003e  https://\u003cid\u003e-dot-us-\u003cregion\u003e.notebooks.googleusercontent.com         xprof-ev86r7c5-3d09-xb9b-a8e5-a495f5996eef  \u003czone\u003e\n```\n\nNote you can specify one or more GCS bucket paths and/or VM instance names to\nget any VMs associated with the criteria provided. This will list any VMs\nassociated with the log directories or VM names specified.\n(See [section](#optionally-specifying-log-directories-andor-vm-names) below for\nmore details.)\n\n```bash\n# Specifying one GCS path\nxprofiler list -z $ZONE -l $GCS_PATH\n\n# Specifying one VM instance name\nxprofiler list -z $ZONE --vm-name $VM_NAME\n```\n\n### Delete `xprofiler` Instance\n\nTo delete an instance, you'll need to specify either the GCS bucket paths or the\nVM instances' names. Specifying the zone is required.\n\n```bash\n# Delete by associated GCS path\nxprofiler delete -z us-central1-b -l gs://\u003csome-bucket\u003e\n\nFound 1 VM(s) to delete.\nLog_Directory       URL                                                                  Name                                        Zone\n------------------  -------------------------------------------------------------------  ------------------------------------------  -------\ngs://\u003csome-bucket\u003e  https://\u003cid\u003e-dot-us-\u003cregion\u003e.notebooks.googleusercontent.com         xprof-8187640b-e612-4c47-b4df-59a7fc86b253  \u003czone\u003e\n\nDo you want to continue to delete the VM `xprof-8187640b-e612-4c47-b4df-59a7fc86b253`?\nEnter y/n: y\nWill delete VM `xprof-8187640b-e612-4c47-b4df-59a7fc86b253`\n\n\n# Delete by VM instance name\nVM_NAME=\"xprof-8187640b-e612-4c47-b4df-59a7fc86b253\"\nxprofiler delete -z $ZONE --vm-name $VM_NAME\n```\n\n### Capture Profile\n\nUsers can capture profiles programmatically or manually. Captured profile data\nwill be saved to the given GCS path from `--log-directory`. It will specifically\nsave to the path of `gs://\u003csome-bucket\u003e/\u003csome-run\u003e/plugins/profile/`.\n\n##### Prerequisite: Enable Collector\n\nUsers are required to enable the collector from their workloads following below\nsteps.\n\n\u003e Note: This is needed for both programmatic and manual captures, except for\n\u003e JAX.\n\u003e For JAX programmatic capture, users do not need to include `start_server`.\n\u003e Users using JAX only need this for manual profile capture methods.\n\n```python\n# To enable for a jax workload\nimport jax\njax.profiler.start_server(9012)\n\n# To enable for a pytorch workload\nimport torch_xla.debug.profiler as xp\nserver = xp.start_server(9012)\n\n# To enable for tensorflow workload\nimport tensorflow.compat.v2 as tf2\ntf2.profiler.experimental.server.start(9012)\n```\n\nBelow links have some more information about the individual frameworks:\n\n* [JAX](https://docs.jax.dev/en/latest/profiling.html#manual-capture)\n* [PyTorch](https://cloud.google.com/tpu/docs/pytorch-xla-performance-profiling-tpu-vm#starting_the_profile_server)\n* [TensorFlow](https://www.tensorflow.org/guide/profiler#collect_performance_data)\n\n##### Programmatic Profile Capture\n\nUsers can capture traces from their workloads by marking their code paths.\nProgrammatic capture is more deterministic and gives more control to users.\n\n\u003e Note: The code snippets below assume that code in the earlier\n\u003e [prerequisite section](#prerequisite-enable-collector)\n\n###### JAX Profile Capture\n\n```python\njax.profiler.start_trace(\"gs://\u003csome_bucket\u003e/\u003csome_run\u003e\")\n# Code to profile\n...\njax.profiler.stop_trace()\n```\n\nAlternatively, use the `jax.profiler.trace()` context manager:\n\n```python\nwith jax.profiler.trace(\"gs://\u003csome_bucket\u003e/\u003csome_run\u003e\"):\n  # Code to profile\n  ...\n\n```\n\n###### PyTorch Profile Capture\n\n```python\nxp.trace_detached(f\"localhost:{9012}\", \"gs://\u003csome_bucket\u003e/\u003csome_run\u003e\", duration_ms=2000)\n\n# Using StepTrace\nfor step, (input, label) in enumerate(loader):\n  with xp.StepTrace('train_step', step_num=step):\n    # code to trace\n    ...\n```\n\nAlternatively, wrap individual parts of the code with `xp.Trace`:\n\n```python\n# Using Trace\nwith xp.Trace('fwd_context'):\n    # code to trace\n    ...\n```\n\n###### TensorFlow Profile Capture\n\n```python\ntf.profiler.experimental.start(\"gs://\u003csome_bucket\u003e/\u003csome_run\u003e\")\nfor step in range(num_steps):\n  # Creates a trace event for each training step with the step number\n  with tf.profiler.experimental.Trace(\"Train\", step_num=step):\n    train_fn()\ntf.profiler.experimental.stop()\n```\n\n##### Manual Profile Capture\n\nUsers can also trigger profile capture on target hosts. There are two methods to\ndo this:\n\n* Using the `xprofiler capture` command\n  - For [GCE](#profile-capture-via-xprofiler-gce) workloads\n  - For [GKE](#profile-capture-via-xprofiler-gke) workloads\n* Using [TensorBoard's UI](#profile-capture-via-tensorboard-ui)\n\n###### Profile Capture via TensorBoard UI\n\nUsers have the option to trigger a profile capture using TensorBoard's UI.\n\nFirst, visit the proxy URL for a VM instance (created via `xprofiler`) to visit\nthe TensorBoard UI. Which will bring you to one of two pages.\n\n**Scenario 1: GCS Has Profile Data**\n\nIf the GCS log directory associated with the VM instance has profile data\nalready available, you'll likely see a page similar to this with profile runs\nready to view:\n\n![TensorBoard UI on profile tab](https://raw.githubusercontent.com/AI-Hypercomputer/cloud-diagnostics-xprof/refs/heads/main/docs/images/tensorboard-ui-profile-tab-with-profiles.png)\n\nNotice the \"CAPTURE PROFILE\" button on the dashboard. You'll want to click that\n\u0026 proceed with the next section on completing this form to capture profile data.\n\n**Scenario 2: GCS Has No Profile Data**\n\nYou may see a similar page to this one with no dashboards if the GCS log\ndirectory does not yet have any profile data:\n\n![TensorBoard UI that is blank with message on dashboards](https://raw.githubusercontent.com/AI-Hypercomputer/cloud-diagnostics-xprof/refs/heads/main/docs/images/tensorboard-ui-with-no-profiles.png)\n\nYou will then need to select the profile tab:\n\n![TensorBoard UI with upper-right dropdown menu selecting \"Profile\"](https://raw.githubusercontent.com/AI-Hypercomputer/cloud-diagnostics-xprof/refs/heads/main/docs/images/tensorboard-ui-with-no-profiles-dropdown-on-profile.png)\n\nYou'll then see a page similar to this one with a \"CAPTURE PROFILE\" button:\n\n![TensorBoard UI that is blank with message on profiling](https://raw.githubusercontent.com/AI-Hypercomputer/cloud-diagnostics-xprof/refs/heads/main/docs/images/tensorboard-ui-profile-tab-with-profiles.png)\n\nYou want to click the \"CAPTURE PROFILE\" button which will bring up a form to\nfill. Proceed to the next section for details in completing this form to capture\nprofile data.\n\n**Completing Form for Profile Capture**\n\nIn either case from above, you should see a similar form to fill to capture\nprofile data:\n\n![Incomplete form for TensorBoard UI profile capture](https://raw.githubusercontent.com/AI-Hypercomputer/cloud-diagnostics-xprof/refs/heads/main/docs/images/tensorboard-ui-capture-profile-form-incomplete.png)\n\nYou will need to minimally provide the \"Profile Service URL(s)\" for the TPU VM\ninstance.\n\n\u003e Note:\n\u003e The instructions refer to the TPU VM that is _running_ the workload to\n\u003e profile and ***NOT*** the `xprofiler` VM instance.\n\nYou will need the full hostname for the TPU \u0026 port number\nwith the following format:\n\n```\n\u003cTPU_VM_HOSTNAME\u003e.\u003cZONE\u003e.c.\u003cGCP_PROJECT_NAME\u003e.internal:\u003cPORT_NUMBER\u003e\n```\n\n* `TPU_VM_HOSTNAME`: This is different from the TPU name and refers to the host\n  that the workload is running on.\n  You can retrieve the hostname using `gcloud` by providing the TPU VM name and\n  TPU's the zone:\n  `gcloud compute tpus tpu-vm ssh $TPU_NAME  --zone=$ZONE --command=\"hostname\"`\n* `ZONE`: This is the zone of the TPU VM. Note that it is ***NOT** necessarily\n  the same as the `xprofiler` VM instance that is displaying TensorBoard.\n* `GCP_PROJECT_NAME`: This is the project name for the TPU VM.\n  Note that it is ***NOT** necessarily the same as the `xprofiler` VM instance\n  that is displaying TensorBoard. However, it likely will need to be since\n  having the TPU in a different project will likely lead to permission issues,\n  preventing profile capture.\n* `PORT_NUMBER`: This is the port that was set when starting the profile server\n  in the relevant code.\n  See earlier [prerequisite section](#prerequisite-enable-collector).\n\nFor example, your string will look similar to this:\n\n```\nt1v-n-g8675e3i-w-0.us-east5-b.c.my-project.internal:9012\n```\n\nYou can then adjust any of the other settings you care to modify and click\n\"CAPTURE\".\n\n![Complete form for TensorBoard UI profile capture](https://raw.githubusercontent.com/AI-Hypercomputer/cloud-diagnostics-xprof/refs/heads/main/docs/images/tensorboard-ui-capture-profile-form-complete.png)\n\nYou will see a loading animation and then a message at the bottom of the screen.\n\nIf successful, you will see a message similar to this:\n\n![Successful capture for TensorBoard UI profile capture](https://raw.githubusercontent.com/AI-Hypercomputer/cloud-diagnostics-xprof/refs/heads/main/docs/images/tensorboard-ui-capture-profile-message-success.png)\n\nIf something went wrong you might see something similar to this:\n\n![Failed capture for TensorBoard UI profile capture](https://raw.githubusercontent.com/AI-Hypercomputer/cloud-diagnostics-xprof/refs/heads/main/docs/images/tensorboard-ui-capture-profile-message-failure.png)\n\nYou can attempt the capture again, ensuring your settings in the form are\ncorrect. You may also need to confirm the TPU workload is running and properly\nconfigured for profiling.\n\n\u003e Note:\n\u003e After a successful capture, you might need to refresh the dashboard.\n\u003e You can hit the refresh icon for a single refresh or go to the settings menu\n\u003e (the gear icon) and set \"Reload data\" automatically.\n\n###### Profile Capture via `xprofiler`: GCE\n\nFor JAX, `xprofiler` requires the\n[tensorboard-plugin-profile](https://pypi.org/project/tensorboard-plugin-profile)\npackage and must also be available on target VMs.\n\n\u003e Note: `xprofiler` uses `gsutil` to move files to GCS bucket from target VM.\n\u003e VMs must have `gcloud` pre-installed.\n\n```bash\n# Trigger capture profile\n# Framework can be jax or pytorch\nxprofiler capture \\\n  -z \u003czone\u003e \\\n  -l gs://\u003csome-bucket\u003e/\u003csome-run\u003e \\\n  -f jax \\\n  -n vm_name1 vm_name2 vm_name3 \\\n  -d 2000 # duration in ms\n\nStarting profile capture on host vm_name1.\nProfile saved to gs://\u003csome-bucket\u003e/\u003csome-run\u003e/tensorboard and session id is session_2025_04_03_18_13_49.\n\nStarting profile capture on host vm_name2.\nProfile saved to gs://\u003csome-bucket\u003e/\u003csome-run\u003e/tensorboard and session id is session_2025_04_03_18_13_49.\n```\n\n###### Profile Capture via `xprofiler`: GKE\n\nFor GKE, users are required to setup `kubectl` and cluster context on their\nmachines. (See details on setting up\n[kubectl](https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-access-for-kubectl).)\n\n```bash\ngcloud container clusters get-credentials \u003ccluster_name\u003e --region=\u003cregion\u003e\n```\n\nAfter setting up credentials, users can verify the current context:\n\n```bash\nkubectl config current-context\ngke_\u003cproject_id\u003e_\u003cregion\u003e_\u003ccluster_name\u003e\n```\n\nUsers can then get a mapping between pods and nodes using the `kubectl get pods`\ncommand:\n\n```bash\n$ kubectl get pods -o wide| awk '{print $1\"\\t\\t\"$7}'\n```\n\nFor GKE, users can then pass a list of pods to `xprofiler capture` command to\ninitiate profile capture.\n\n```bash\n# Trigger capture profile\n# Framework can be jax or pytorch\nxprofiler capture \\\n  -z \u003czone\u003e \\\n  -o gke \\\n  -l gs://\u003csome-bucket\u003e/\u003csome-run\u003e \\\n  -f jax\n  -n pod_1 pod_2 pod_3 \\\n  -d 2000 # duration in ms\n\nStarting profile capture on pod_1.\nProfile saved to gs://\u003csome-bucket\u003e/\u003csome-run\u003e/tensorboard and session id is session_2025_04_03_18_13_49.\n\nStarting profile capture on pod_2.\nProfile saved to gs://\u003csome-bucket\u003e/\u003csome-run\u003e/tensorboard and session id is session_2025_04_03_18_13_49.\n```\n\n## Details on `xprofiler`\n\n### Main Command: `xprofiler`\n\nThe `xprofiler` command has additional subcommands that can be invoked to\n[create](#subcommand-xprofiler-create) VM instances,\n[list](#subcommand-xprofiler-list) VM instances,\n[delete](#subcommand-xprofiler-delete) instances, etc.\n\nHowever, the main `xprofiler` command has some additional options without\ninvoking a subcommand.\n\n#### `xprofiler --help`\n\nGives additional information about using the command including flag options and\navailable subcommands. Also can be called with `xprofiler -h`.\n\n\u003e Note: Each subcommand has a `-h (--help)` flag that can give information\nabout that specific subcommand. For example: `xprofiler list -h`\n\n### Passing Extra Arguments to `xprofiler` Subcommands\n\nEach available subcommand has a set of parameters passed in with various flags.\nThese flags enable certain actions including forming the internal commands\n(mostly [`gcloud`](https://cloud.google.com/cli) commands).\n\nHowever, some advanced users might find it useful to override the internal\ncommands' flags and/or find the available subcommand flags limiting. In that\ncase, those users may find the feature of being able to pass extra arguments\n(not officially defined in the `xprofiler` subcommand) to xprofiler subcommands\nuseful.\n\nThe basic use is to give flags to the internal command. This means the user\nshould ideally be fairly familiar with how the subcommand being overridden would\nuse these commands. Thus this is considered a ***advanced usage*** and should be\nused with ***caution***.\n\nBelow is an example of using extra arguments to change the\n[maintence policy](https://cloud.google.com/sdk/gcloud/reference/compute/instances/create#--maintenance-policy)\nfor the `xprofiler` VM instance using the `create` subcommand:\n\n```bash\nxprofiler \\\n  create \\\n    -z us-east5-a -l gs://example-gs-bucket/path \\\n    --maintenance-policy=terminate\n```\n\nThis will essentially add the flag `--maintenance-policy=terminate` to the main\ninternal `gcloud` command within the `create` subcommand.\n\nThe following extra argument formats are supported:\n\n* `--flag` or `-f` (as a boolean flag)\n* `--flag value` or `-f value`\n* `--flag=value` or `-f=value`\n* `--flag=value0,value1,value2` or `-f value0,value1,value2` for multiple values\n* `--flag value0,value1,value2` or `-f value0,value1,value2` for multiple values\n\n\u003e Note:\n\u003e It's recommended that users pass extra arguments *after* the subcommand\n\u003e (`create`, `list`, etc.).\n\u003e\n\u003e Although there is some support providing the extra arguments before the\n\u003e subcommand (such as `xprofiler --limit=5 list -z $ZONE`) it is not guaranteed\n\u003e to work. This is because there can be interference with the subcommand\n\u003e position.\n\n\nValues given as extra arguments may override the values used in the original\nmain internal command.\n\nFor example, consider the following command:\n\n```bash\nxprofiler \\\n  create \\\n    -z us-east5-a -l gs://example-gs-bucket/path \\\n    --zone=us-central1-a\n```\n\nThis will effectively execute the same main internal command as if this was run\ninstead:\n\n```bash\nxprofiler create -z us-central1-a -l gs://example-gs-bucket/path\n```\n\nFinally, it should be noted that multiple values can be used as extra arguments,\nsuch as the example below:\n\n```bash\n\nxprofiler \\\n  create \\\n    -z us-east5-a -l gs://example-gs-bucket/path \\\n    --zone=us-central1-a \\\n    --maintenance-policy=terminate \\\n    --machine-type=e2-highmem-8\n```\n\n### Subcommand: `xprofiler create`\n\nThis command is used to create a new VM instance for `xprofiler` to run with a\ngiven profile log directory GCS path.\n\n`xprofiler create` will return an error if the machine type given is not found\nin the provided zone. Note that the error message will include a `gcloud`\ncommand that can be used to determine a zone with the given machine type.\n\nUsage details:\n\n```\nxprofiler create\n  [--help]\n  --log-directory GS_PATH\n  --zone ZONE_NAME\n  [--vm-name VM_NAME]\n  [--machine-type MACHINE_TYPE]\n  [--auto-delete-on-failure-off]\n  [--verbose]\n```\n\n#### `xprofiler create --help`\n\nThis provides the basic usage guide for the `xprofiler create` subcommand.\n\n#### `xprofiler create --machine-type`\n\nThe `create` command defaults to using `c4-highmem-8` for the VM instance.\nHowever, users can specify a different machine type using the flag\n`--machine-type` followed by a machine type such as `e2-highmem-8`. Information\non machine types can be found\n[here](https://cloud.google.com/compute/docs/machine-resource). Also see\nour [recommendations for `xprofiler`](#machine-types)\n\nNote that if a machine type is not found for the given zone, an error will occur\nwith a suggestion for running a `gcloud` command as well as some available zones\nfor that machine type.\n\nThe output will look similar to this:\n\n```bash\nPlease check the machine type w/ us-east5-c and try again. You can investigate zones with the machine type victors-discount-vm available:\ngcloud compute machine-types list --filter=\"name=victors-discount-vm\" --format=\"value(zone)\"\nThe machine type and zone do not match.\nSuggested zones with machine type victors-discount-vm available:\n['us-central1-a', 'us-central1-b', 'us-central1-c', 'us-central1-f', 'europe-west1-b', 'europe-west1-c', 'europe-west1-d', 'us-west1-a', 'us-west1-b', 'us-west1-c']\n```\n\n#### `xprofiler create --auto-delete-on-failure-off`\n\nThe `create` command will automatically delete failed VM instances created by\nthe `xprofiler` tool. This is to ensure that a malformed VM does not persist if\nit can't be fully utilized by `xprofiler`.\n\nHowever, it can optionally turn off automatic deletion using the\n`--auto-delete-on-failure-off` flag. This can be particularly useful in\ndebugging issues when creating VMs.\n\n### Subcommand: `xprofiler list`\n\nThis command is used to list a VM instances created by the `xprofiler` tool.\n\nUsage details:\n\n```\nxprofiler list\n  [--help]\n  [--zones ZONE_NAME [ZONE_NAME ...]]\n  [--log-directory GS_PATH [GS_PATH ...]]\n  [--vm-name VM_NAME [VM_NAME ...]]\n  [--filter FILTER_NAME [FILTER_NAME ...]]\n  [--verbose]\n```\n\n#### `xprofiler list --help`\n\nThis provides the basic usage guide for the `xprofiler list` subcommand.\n\n#### `xprofiler list --zones`\n\nThe `list` subcommand can optionally take a `-z (--zones)` flag to specify which\nzones to consider for listing VMs.\n\n```bash\n# Listing all xprofiler VMs in us-central1-a\nxprofiler list -z us-central1-a\n\n# Listing all xprofiler VMs in us-east5-a and us-central1-a\nxprofiler list -z us-east5-a us-central1-a\n```\n\nIf no value for the zones is provided, then `xprofiler list` will search across\nall zones with any other matching criteria in mind. This however, can\npotentially take significantly more time so it is recommended to specify the\nzone(s) explicitly.\n\n#### Optionally specifying log directories and/or VM names\n\nUsers optionally can specify one or more log directories (GCS paths) and/or VM\nnames. This can be done with the `-l (--log-directory)` flag for log directories\nand with the `-n (--vm-name)` flag for VM instance names.\n\nWhen specifying multiple criteria, any matching VM will be listed.\n\nExamples:\n\n```bash\n# List VMs that match either GCS path\nxprofiler list -l gs://bucket0/top-dir gs://bucket1/top-dir\n\n\n# List VMs that match either VM name\nxprofiler list -n my-vm-one my-vm-two\n\n\n# List VMs that match any of the GCS paths or VM names\nxprofiler list \\\n  -l gs://bucket0/top-dir gs://bucket1/top-dir \\\n  -n my-vm-one my-vm-two\n```\n\n### Subcommand: `xprofiler delete`\n\nThis command is used to delete VM instances, focused on those created by the\n`xprofiler` tool.\n\nUsage details:\n\n```\nxprofiler delete\n  [--help]\n  --zone ZONE_NAME\n  [--log-directory GS_PATH [GS_PATH ...]]\n  [--vm-name VM_NAME [VM_NAME ...]]\n  [--verbose]\n```\n\n#### `xprofiler delete --help`\n\nThis provides the basic usage guide for the `xprofiler delete` subcommand.\n\n### Subcommand: `xprofiler capture`\n\nThis command is used to capture profiles from a running workload. The captured\nprofiles will be saved based on the `--log-directory` path.\nSpecifically `gs://\u003cbucket-name\u003e/\u003crun-name\u003e/plugins/profile/` if given\n`gs://\u003cbucket-name\u003e/\u003crun-name\u003e` for the log directory.\n\nUsage details:\n\n```\nxprofiler capture\n  [--help]\n  --log-directory GS_PATH\n  --zone ZONE_NAME\n  --hosts HOST_NAME [HOST_NAME ...]\n  --framework FRAMEWORK\n  [--orchestrator ORCHESTRATOR]\n  [--duration DURATION]\n  [--port LOCAL_PORT]\n  [--verbose]\n```\n\n#### `xprofiler capture --help`\n\nThis provides the basic usage guide for the `xprofiler capture` subcommand.\n\n#### `xprofiler connect --help`\n```\nxprofiler connect\n  [--help]\n  --log-directory GS_PATH\n  --zone ZONE_NAME\n  [--mode MODE]\n  [--port LOCAL_PORT]\n  [--host-port HOST_PORT]\n  [--disconnect]\n  [--verbose]\n```\n\n#### `xprofiler connect --help`\n\nThis provides the basic usage guide for the `xprofiler connect` subcommand.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fai-hypercomputer%2Fcloud-diagnostics-xprof","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fai-hypercomputer%2Fcloud-diagnostics-xprof","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fai-hypercomputer%2Fcloud-diagnostics-xprof/lists"}