{"id":18779156,"url":"https://github.com/trisongz/tpubar","last_synced_at":"2025-04-13T11:27:41.577Z","repository":{"id":57476802,"uuid":"319592105","full_name":"trisongz/tpubar","owner":"trisongz","description":"Google Cloud TPU Utilization Bar for Training Models","archived":false,"fork":false,"pushed_at":"2020-12-23T08:51:01.000Z","size":692,"stargazers_count":8,"open_issues_count":1,"forks_count":2,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-27T02:39:15.862Z","etag":null,"topics":["tensorflow","tpu","tpus"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/trisongz.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-12-08T09:41:01.000Z","updated_at":"2024-03-06T20:51:45.000Z","dependencies_parsed_at":"2022-09-14T16:22:46.395Z","dependency_job_id":null,"html_url":"https://github.com/trisongz/tpubar","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/trisongz%2Ftpubar","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/trisongz%2Ftpubar/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/trisongz%2Ftpubar/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/trisongz%2Ftpubar/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/trisongz","download_url":"https://codeload.github.com/trisongz/tpubar/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248705011,"owners_count":21148469,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["tensorflow","tpu","tpus"],"created_at":"2024-11-07T20:19:00.227Z","updated_at":"2025-04-13T11:27:41.550Z","avatar_url":"https://github.com/trisongz.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# TPUBar\n\n Google Cloud TPU Utilization Bar for Training Models\n \n\u003cp align=\"center\"\u003e\n    \u003cbr\u003e\n    \u003cimg src=\"https://github.com/trisongz/tpubar/raw/master/docs/tpubar_img.png\"/\u003e\n    \u003cbr\u003e\n\u003cp\u003e\n\n\n```shell\n# from pypi\npip install --upgrade tpubar\n\n# from src\npip install --upgrade git+https://github.com/trisongz/tpubar.git\n```\n\n[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/trisongz/tpubar/blob/master/docs/T5_on_TPU_Torch_XLA_TPUBar.ipynb)\n\n## Quickstart\n\n```python3\n!pip install --upgrade tpubar\n!pip install --upgrade git+https://github.com/trisongz/tpubar.git\n\n# Option #1 on Colab\n\n!tpubar test # you will be prompted to authenticate with GCE on Colab\n\n# Option #2 on Colab\n\nfrom tpubar import TPUMonitor\nimport os\n\nmonitor = TPUMonitor(tpu_name=os.environ.get('TPU_NAME', None), profiler='v2')\n\n# your training code below\n\nmonitor.start()\n\nfor x in dataset:\n    ops(x)\n    print(monitor.current_stats)\n\n# Option #3 in Terminal/CLI - (Non Colab/Remote VM/Your Desktop)\ntpubar test tpu-name\n\n```\n\n## API Quickstart\n\n```python3\nfrom tpubar import TPUMonitor\n\n'''\ndefault args\n- tpu_name = None, (str) name of a TPU you want to query, in case of multiple active TPUs\n- project = None, (str) gcp project name\n- profiler = 'v1', (str) options are ['v1', 'v2']\n    - v1: for Non-Colab, Pytorch, Tensorflow Estimator (TF1), and Non-Tensorflow TPU Queries\n    - v2: Colab, Tensorflow 2+\n- refresh_secs = 10, (int) how many seconds between each query\n- fileout = None, (str) path where tqdm goes to, defaults to sys.stdout\n- verbose = False, (bool) prints current_stats every query if True\n- disable = False, (bool) disables TPU Bars if True, useful if only stats want to be captured\n\n# Colors can be defined using standard cli colors or hex (e.g. 'green' or ' #00 ff00')\n- tpu_util = 'green', (str) color for TPU MXU Bar\n- tpu_secondary = 'yellow', (str) color for second TPU Bar [Memory for v1, Active Time for v2]\n- cpu_util = 'blue', (str) color for CPU Utilization Bar\n- ram_util = 'blue' (str) color for RAM Utilization Bar\n\n'''\nmonitor = TPUMonitor(tpu_name=None, project=None, profiler='v1', refresh_secs=10, fileout=None, verbose=False, disable=False, tpu_util='green', tpu_secondary='yellow', cpu_util='blue', ram_util='blue')\n\nmonitor.start()\n\n# Can be called to retrieve stats, use stats.get(var, '') to avoid errors since Idle Time and Idle String don't return anything until after full TPU initialization.\n'''\n# Stats available\n\n- v1 returns {'tpu_mxu': float, 'tpu_mem_per': float 'tpu_mem_used': float, 'tpu_mem_str': str, 'cpu_util': float, 'ram_util': float, 'ram_util_str': str}\n- v2 returns {'tpu_mxu': float, tpu_mxu_str': str, 'tpu_idle_time': float, 'tpu_idle_str': str, 'cpu_util': float, 'ram_util': float, 'ram_util_str': str}\n# Example\n'v1': {'tpu_mxu': 52.88895420451343, 'tpu_mem_per': 100.0, 'tpu_mem_used': 198.5, 'tpu_mem_str': '198.50GB/127.96GB', 'cpu_util': 0.9, 'ram_util': 54.5, 'ram_util_str': '49.43GB/96.00GB'}\n\n'''\nstats = monitor.current_stats\ntpu_mxu = stats.get('tpu_mxu', '')\n\n# Adding Hooks\n# hook = {'name': 'Slack', 'func': notificationclient.message, 'freq': 10}\n# This will call notificiationclient.message(monitor.current_stats) every 10 monitoring iterations\n# If refresh_secs = 10, then function will fire every 100 seconds.\n# The hook will receive all the stats returned above as a dict.\n\nmonitor.add_hook(name='slack', hook=notificationclient.message, freq=10)\n\n# Remove a Hook\nmonitor.rm_hook(name='slack')\n\n# Manually Firing a Hook\n# To force all hooks to fire, say at the end of a training loop\n\nstats = monitor.current_stats\nmessage = do_format(stats) # format your message into a string\n\nmonitor.fire_hooks(message, force=True)\n\n# Getting the current time (from when tpubar started monitoring)\ntrain_time = monitor.get_time(fmt='hrs') # ['secs', 'mins', 'hrs', 'days', 'wks']\n\n# Create a Timeout Monitor that sends a notification when TPU MXU falls below x% after y number of pings\n# timeout_hook = {'idx': 0, 'num_timeouts': num_timeouts, 'hook': hook, 'min_mxu': min_mxu, 'pulse': 0.00, 'warnings': 0}\n# Pulse = last recorded MXU when warning notification fires.\nmonitor.create_timeout_hook(hook=notificationclient.message, min_mxu=10.00, num_timeouts=20)\n\n# Upon firing, will send to the notificationclient\n# Warnings reset after detecting TPU \u003e min MXU.\n\nmsg = \"TPUBar has detected [number of warnings] periods of under [min_mxu]. Last TPU MXU Pulse: [last recorded MXU]. Time Alive: [time_active in hrs]\"\nnotificationclient.message(msg)\n\n\n# Rerouting Print Functions (Unstable)\n# to avoid line breaks and overlapping bars in std.out, you can optionally reroute any print function to use tpubar's logger, which uses tqdm.write. This will return the print function\n\n_logger = logger # back up the obj in case things go wrong\nlogger.info = monitor.reroute_print(logger.info)\n\n# Restore the original\nlogger.info = _logger.info\n\n\n```\n\n## CLI Quickstart\n\nThe commands can be run remotely or on the same VM\n\n```shell\n\n# Monitor the TPU until Exit (cmd+c)\ntpubar monitor [tpuname] --project [gcp_project] (optional)\n\n# Test Run for 60 secs\ntpubar test [tpuname] --project [gcp_project] (optional)\n\n# Create or use an application key found in tpubar/auth.json\ntpubar auth [adc_name] -l (list auths)\n\n# Create new tmux session\ntpubar sess [session_name]\n\n# Attach your current window to the tmux session\ntpubar attach [session_name]\n\n# kill a tmux session\ntpubar killsess [session_name]\n\n```\n\n\u003cp align=\"center\"\u003e\n    \u003cbr\u003e\n    \u003cimg src=\"https://github.com/trisongz/tpubar/raw/master/docs/tpumonitor.png\"/\u003e\n    \u003cbr\u003e\n\u003cp\u003e\n\n## Notes\n\nThe reason for the 2 versions of TPUBar, 'v1' and 'v2' is due to how they use different API calls to get TPU metrics. Within Colab, only 'v2' works if you do not have TPUs in your Google Cloud project. Otherwise,  to avoid compatability issues.\n\n- 'v1': is meant for TPU Projects running on GCE and/or Using Tensorflow \u003c 2. Additionally, v1 can be called on a remote system (like your PC) to query your TPU running on GCE without being directly connected. Not yet tested, but should also be used in Pytorch training as well.\n\n- 'v2' is meant for Colab and/or Tensorflow 2+, and uses tensorflow APIs, which require the system to be directly connected to the TPUs.\n\n## Bonus\n\nYou can call 'tpubar sess new_session' in CLI to create a new tmux session and 'tpubar killsess new_session' to kill it.\n\n## Contributors\n\n[@shawwn](https://github.com/shawwn)\n\n## Acknowledgements\n\n[Tensorflow Research Cloud](https://www.tensorflow.org/tfrc) for providing TPU Resources","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftrisongz%2Ftpubar","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftrisongz%2Ftpubar","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftrisongz%2Ftpubar/lists"}