{"id":24475580,"url":"https://github.com/nathom/tpu-pod-tutorial","last_synced_at":"2026-01-02T00:37:00.987Z","repository":{"id":273255269,"uuid":"918769221","full_name":"nathom/tpu-pod-tutorial","owner":"nathom","description":"Get started on TPU pods!","archived":false,"fork":false,"pushed_at":"2025-01-24T21:03:16.000Z","size":39,"stargazers_count":1,"open_issues_count":0,"forks_count":2,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-09T19:56:08.310Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nathom.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-01-18T20:03:16.000Z","updated_at":"2025-01-20T02:14:27.000Z","dependencies_parsed_at":"2025-01-19T20:20:49.204Z","dependency_job_id":"4d8ac9ec-ee2a-49f4-a0f9-2a5840b13a59","html_url":"https://github.com/nathom/tpu-pod-tutorial","commit_stats":null,"previous_names":["nathom/tpu-pod-tutorial"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nathom%2Ftpu-pod-tutorial","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nathom%2Ftpu-pod-tutorial/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nathom%2Ftpu-pod-tutorial/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nathom%2Ftpu-pod-tutorial/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nathom","download_url":"https://codeload.github.com/nathom/tpu-pod-tutorial/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243603235,"owners_count":20317797,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-01-21T09:15:30.199Z","updated_at":"2026-01-02T00:37:00.937Z","avatar_url":"https://github.com/nathom.png","language":null,"readme":"# tpu-pod-tutorial\n\nThis tutorial will get you started with multi-host VM TPU pods.\nIf you don't want to follow the tutorial step by step, a reference edited version\nis on the `run` branch.\n\n## Setup this tutorial\n\nFirst, fork this repo on GitHub. Then, clone your fork\nonto one of the hosts on the TPU VM.\n\n```sh\ngit clone https://github.com/your_username/tpu-pod-tutorial ~/your_folder/tpu-pod-tutorial\n```\n\n\n## Setup uv project\n\nTo make the builds reproducible across pods, we will be using \n`uv` as a package manager. It's like `pip` but *much* better.\n\n\u003e If `uv` is not installed, follow the instructions [here](https://docs.astral.sh/uv/getting-started/installation/).\n\n`cd` into `your_folder` and initialize `uv`\n\n```sh\nuv init .\n```\n\nYou should see something like:\n\n```\n$ uv init .\nInitialized project `tpu-pod-tutorial` at `/home/ucsdwanglab/your_folder/tpu-pod-tutorial`\n$ ls\nLICENSE  README.md  hello.py  pyproject.toml\n```\n\nLet's add JAX as a dependency\n\n```sh\nuv add --prerelease=allow \"jax[tpu]\" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html\n```\n\n```\nUsing CPython 3.10.16\nCreating virtual environment at: .venv\nResolved 14 packages in 154ms\nInstalled 13 packages in 61ms\n + certifi==2024.12.14\n + charset-normalizer==3.4.1\n + idna==3.10\n + jax==0.5.0\n + jaxlib==0.5.0\n + libtpu==0.0.8\n + libtpu-nightly==0.1.dev20241010+nightly.cleanup\n + ml-dtypes==0.5.1\n + numpy==2.2.1\n + opt-einsum==3.4.0\n + requests==2.32.3\n + scipy==1.15.1\n + urllib3==2.3.0\n$ ls\nLICENSE  README.md  hello.py  pyproject.toml  uv.lock\n```\n\n## Run distributed code\n\nOur program is *distributed*. This means that it involves several machines \n(we call them hosts), each of whom run a process, and communicate with each other over\nnetwork cables. Crucially, they **do not share a filesystem**, so we need to distribute all of our code and data\nbefore we can run. \n\nLet's begin:\n\nFirst, let's make sure we can connect to all the hosts. To do this, we will use the `eopod` tool, \nwhich can be run with `uv`'s execuation tool `uvx`.\n\n```sh\nuvx eopod run echo \"Hello, world!\"\n```\n\n```\nStarted at: 2025-01-18 20:27:48\nExecuting: echo Hello, world!\nExecuting command on worker all: echo Hello, world!\nUsing ssh batch size of 1. Attempting to SSH into 1 nodes with a total of 4 workers.\nSSH: Attempting to connect to worker 0...\nSSH: Attempting to connect to worker 1...\nSSH: Attempting to connect to worker 2...\nSSH: Attempting to connect to worker 3...\nHello, world!\nHello, world!\nHello, world!\nHello, world!\nCommand completed successfully\n\nCommand completed successfully!\nCompleted at: 2025-01-18 20:27:52\nDuration: 0:00:03.858619\n```\n\nYou should see one `Hello, world!` per host. I have 4 hosts, so I see 4.\n\nRunning Python code is a bit more involved. First, let's install uv\non all hosts\n\n```sh\nuvx eopod run 'curl -LsSf https://astral.sh/uv/install.sh | sh'\n```\n\n```\nStarted at: 2025-01-18 21:02:46\nExecuting: curl -LsSf https://astral.sh/uv/install.sh | sh\nExecuting command on worker all: curl -LsSf\nhttps://astral.sh/uv/install.sh | sh\nUsing ssh batch size of 1. Attempting to SSH into 1 nodes with a total of 4 workers.\nSSH: Attempting to connect to worker 0...\nSSH: Attempting to connect to worker 1...\nSSH: Attempting to connect to worker 2...\nSSH: Attempting to connect to worker 3...\ndownloading uv 0.5.21 x86_64-unknown-linux-gnu\ndownloading uv 0.5.21 x86_64-unknown-linux-gnu\ndownloading uv 0.5.21 x86_64-unknown-linux-gnu\ndownloading uv 0.5.21 x86_64-unknown-linux-gnu\nno checksums to verify\nno checksums to verify\nno checksums to verify\ninstalling to /home/ucsdwanglab/.local/bin\nno checksums to verify\ninstalling to /home/ucsdwanglab/.local/bin\n  uv\n  uvx\neverything's installed!\n  uv\n  uvx\neverything's installed!\ninstalling to /home/ucsdwanglab/.local/bin\n  uv\n  uvx\neverything's installed!\ninstalling to /home/ucsdwanglab/.local/bin\n  uv\n  uvx\neverything's installed!\nCommand completed successfully\n```\n\nAnd verify that `uv` is available:\n\nEdit `hello.py` to contain\n\n```python\nimport jax\n\njax.distributed.initialize()\n\nif jax.process_index() == 0:\n  print(jax.device_count())\n```\n\nThis will initialize JAX's distributed computing system, and print the total number\nof detected devices from Host 0.\n\nRemember that our hosts don't share a filesystem, so we need to sync this code among all the machines. We'll\nuse `git` to do this\n\n\n```sh\ngit add . \u0026\u0026 git commit -am \"Add hello.py\" \u0026\u0026 git push\n```\n\nNext, we clone this repo on all hosts:\n\n```sh\nuvx eopod run git clone https://github.com/your_username/tpu-pod-tutorial ~/your_folder/tpu-pod-tutorial\n```\n\nMake sure we get a `Success` from every host:\n\n```sh\nuvx eopod run '[ -d ~/your_folder/tpu-pod-tutorial ] \u0026\u0026 echo Success || echo Fail'\n```\n\nNow, we can run the program. Create `run.sh` with contents\n\n```bash\n#!/bin/bash\n\nif [ $# -lt 1 ]; then\n    echo \"Usage: $0 \u003cpython_file\u003e\"\n    exit 1\nfi\n\nPYTHON_FILE=\"$1\"\n\neopod kill-tpu --force\neopod run \"\nexport PATH=$PATH \u0026\u0026\ncd ~/your_folder/tpu-pod-tutorial \u0026\u0026\ngit pull \u0026\u0026\nuv run --prerelease allow $PYTHON_FILE\n\"\n```\n\nAnd run\n\n```sh\nbash run.sh hello.py\n```\n\n```\n$ bash run.sh\n⠋ Scanning for TPU processes...Fetching TPU status...\n⠙ Scanning for TPU processes...TPU state: READY\nExecuting command on worker 0: ps aux | grep -E\n'python|jax|tensorflow' | grep -v grep | awk '{print $2}' | while read\npid; do   if [ -d /proc/$pid ] \u0026\u0026 grep -q 'accel' /proc/$pid/maps\n2\u003e/dev/null; then     echo $pid;  fi; done\nExecuting command on worker 1: ps aux | grep -E\n'python|jax|tensorflow' | grep -v grep | awk '{print $2}' | while read\npid; do   if [ -d /proc/$pid ] \u0026\u0026 grep -q 'accel' /proc/$pid/maps\n2\u003e/dev/null; then     echo $pid;  fi; done\nExecuting command on worker 2: ps aux | grep -E\n'python|jax|tensorflow' | grep -v grep | awk '{print $2}' | while read\npid; do   if [ -d /proc/$pid ] \u0026\u0026 grep -q 'accel' /proc/$pid/maps\n2\u003e/dev/null; then     echo $pid;  fi; done\nExecuting command on worker 3: ps aux | grep -E\n'python|jax|tensorflow' | grep -v grep | awk '{print $2}' | while read\npid; do   if [ -d /proc/$pid ] \u0026\u0026 grep -q 'accel' /proc/$pid/maps\n2\u003e/dev/null; then     echo $pid;  fi; done\n⠏ Scanning for TPU processes...Command completed successfully\nCommand completed successfully\n⠋ Scanning for TPU processes...Command completed successfully\n⠙ Scanning for TPU processes...Command completed successfully\nNo TPU processes found.\n⠙ Scanning for TPU processes...\nStarted at: 2025-01-18 21:00:12\nExecuting:\nexport\nPATH=/home/ucsdwanglab/.local/bin:/home/ucsdwanglab/.local/bin:/usr/lo\ncal/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/\nlocal/games:/snap/bin:/home/ucsdwanglab/.local/bin \u0026\u0026\ncd ~/your_folder/tpu-pod-tutorial \u0026\u0026\ngit checkout run \u0026\u0026\ngit pull \u0026\u0026\nuv run --prerelease allow hello.py\n\nExecuting command on worker all:\nexport\nPATH=/home/ucsdwanglab/.local/bin:/home/ucsdwanglab/.local/bin:/usr/lo\ncal/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/\nlocal/games:/snap/bin:/home/ucsdwanglab/.local/bin \u0026\u0026\ncd ~/your_folder/tpu-pod-tutorial \u0026\u0026\ngit checkout run \u0026\u0026\ngit pull \u0026\u0026\nuv run --prerelease allow hello.py\n\nUsing ssh batch size of 1. Attempting to SSH into 1 nodes with a total of 4 workers.\nSSH: Attempting to connect to worker 0...\nSSH: Attempting to connect to worker 1...\nSSH: Attempting to connect to worker 2...\nSSH: Attempting to connect to worker 3...\nAlready on 'run'\nM\tREADME.md\nYour branch is up to date with 'origin/run'.\nAlready on 'run'\nYour branch is up to date with 'origin/run'.\nAlready on 'run'\nYour branch is up to date with 'origin/run'.\nAlready on 'run'\nYour branch is up to date with 'origin/run'.\nAlready up to date.\nAlready up to date.\nAlready up to date.\nAlready up to date.\n16\nCommand completed successfully\n\nCommand completed successfully!\nCompleted at: 2025-01-18 21:00:20\nDuration: 0:00:08.020470\n```\n\nWe see there are 16 detected devices on the `v4-32` pod! \n\nLet's try running a distributed computation. Create a file `pmap.py` in the repo\nwith the following contents\n\n```python\n# The following code snippet will be run on all TPU hosts\nimport jax\n\n# The total number of TPU cores in the Pod\ndevice_count = jax.device_count()\n\n# The number of TPU cores attached to this host\nlocal_device_count = jax.local_device_count()\n\n# The psum is performed over all mapped devices across the Pod\nxs = jax.numpy.ones(jax.local_device_count())\nr = jax.pmap(lambda x: jax.lax.psum(x, 'i'), axis_name='i')(xs)\n\n# Print from a single host to avoid duplicated output\nif jax.process_index() == 0:\n    print('global device count:', jax.device_count())\n    print('local device count:', jax.local_device_count())\n    print('pmap result:', r)\n```\n\nCommit and push\n\n```sh\ngit add pmap.py \u0026\u0026 git commit -am \"Add pmap.py\" \u0026\u0026 git push \n```\n\nAnd run \n\n```sh\nbash run.sh pmap.py\n```\n\nAfter all the setup output, it should show something like\n\n```\nglobal device count: 16\nlocal device count: 4\npmap result: [16. 16. 16. 16.]\n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnathom%2Ftpu-pod-tutorial","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnathom%2Ftpu-pod-tutorial","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnathom%2Ftpu-pod-tutorial/lists"}