{"id":21986905,"url":"https://github.com/saforem2/llm","last_synced_at":"2026-05-08T10:35:30.111Z","repository":{"id":204670695,"uuid":"710502535","full_name":"saforem2/llm","owner":"saforem2","description":"LLMs","archived":false,"fork":false,"pushed_at":"2023-10-31T11:51:25.000Z","size":8,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-18T22:57:49.675Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/saforem2.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2023-10-26T20:35:42.000Z","updated_at":"2023-10-31T11:51:28.000Z","dependencies_parsed_at":null,"dependency_job_id":"9a203f28-04da-431b-802a-05d760aa4634","html_url":"https://github.com/saforem2/llm","commit_stats":null,"previous_names":["saforem2/llm"],"tags_count":0,"template":false,"template_full_name":"saforem2/ezpz","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/saforem2%2Fllm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/saforem2%2Fllm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/saforem2%2Fllm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/saforem2%2Fllm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/saforem2","download_url":"https://codeload.github.com/saforem2/llm/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245045479,"owners_count":20552044,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-29T18:22:36.309Z","updated_at":"2026-05-08T10:35:25.055Z","avatar_url":"https://github.com/saforem2.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ✨ `ezpz`\n\n[![Pytorch](https://img.shields.io/badge/PyTorch-ee4c2c?logo=pytorch\u0026logoColor=white)](#pytorch) [![Tensorflow](https://img.shields.io/badge/TensorFlow-%23FF6F00.svg?\u0026logo=TensorFlow\u0026logoColor=white)](#tensorflow) [![hydra](https://img.shields.io/badge/Config-Hydra-89b8cd)](https://hydra.cc)\n\n\u003e [!NOTE]\n\u003e This library is **very much** still a WIP.  \n\u003e Any ideas / issues / suggestions for improving things would be greatly appreciated.\n\nSimplifies the process of setting up distributed training for:\n\n- [`framework=pytorch`](#pytorch) + `backend={DDP, deepspeed, horovod}`\n\n- [`framework=tensorflow`](#tensorflow) + `backend=horovod`\n\nezpz setup on any of `{thetaGPU, Polaris, Perlmutter}`:\n\n```bash\ngit clone 'https://github.com/saforem2/ezpz' .\nsource ./ezpz/src/ezpz/bin/savejobenv\npython3 -m pip install -e ezpz --require-virtualenv\n# e.g. to launch src/ezpz/__main__.py with pytorch + deepspeed:\nlaunch $(which python3) -m ezpz framework=pytorch backend=deepspeed\n```\n\n_2ez_.\n\n## Setup\n\n\u003cdetails open\u003e\u003csummary\u003e\u003ch3\u003eALCF:\u003c/h3\u003e\u003c/summary\u003e\n\n\n```bash\n# Most recent `conda` versions as of 10-17-2023\nif [[ $(hostname) == x3* ]]; then\n    export MACHINE=\"polaris\"\n    export CONDA_DATE=\"2023-10-04\"\nelif [[ $(hostname) == theta* ]]; then\n    export MACHINE=\"thetaGPU\"\n    export CONDA_DATE=\"2023-01-11\"\nfi\nmodule load \"conda/${CONDA_DATE}\" ; conda activate base\n# Clone saforem2/ezpz and navigate into it\ngit clone https://github.com/saforem2/ezpz\ncd ezpz\n# Make a new venv for this project,\n# in the project root: ./venvs/$MACHINE/$CONDA_DATE\nVENV_DIR=\"venvs/${MACHINE}/${CONDA_DATE}\"\npython3 -m venv \"${VENV_DIR}\" --system-site-packages\nsource \"venvs/${MACHINE}/${CONDA_DATE}/bin/activate\"\n# install `ezpz` into this `venv`\npython3 -m pip install -e .\n# to launch simple training example\n# (launches `src/ezpz/__main__.py`)\ncd src/ezpz\n./bin/train.sh framework=pytorch backend=DDP\n```\n\u003c/details\u003e\n\n\u003cdetails open\u003e\u003csummary\u003e\u003ch3\u003ePerlmutter (@ NERSC):\u003c/h3\u003e\u003c/summary\u003e\n\n```bash\n# request slurm allocation with `salloc`\nNODES=2 ; HRS=2 ; salloc --nodes $NODES --qos preempt --time $HRS:00:00 -C 'gpu\u0026hbm80g' --gpus=$(( 4 * NODES )) -A \u003cproj\u003e_g\n# load `pytorch/2.0.1` module\nmodule load libfabric cudatoolkit pytorch/2.0.1\n# Clone saforem2/ezpz and navigate into it\ngit clone https://github.com/saforem2/ezpz\ncd ezpz\n# update pip and install `ezpz`\npython3 -m pip install --upgrade pip setuptools wheel\npython3 -m pip install -e .\ncd src/ezpz\n./bin/train.sh framework=pytorch backend=DDP\n```\n\n\u003c/details\u003e\n\nwhere `framework` $\\in$ `{pytorch, tensorflow}`, and `backend` $\\in$ `{DDP,\ndeepspeed, horovod}`[^tf-hvd]  \n\n[^tf-hvd]: Note `framework=tensorflow` is **only** compatible with `backend=horovod`\n\n\n\u003cdetails closed\u003e\u003csummary\u003e\u003cb\u003eDeprecated:\u003c/b\u003e\u003c/summary\u003e\n\n- Install:\n  ```bash\n  git clone https://github.com/saforem2/ezpz\n  python3 -m pip install -e ezpz\n  ```\n\n- Determine available resources:\n  ```bash\n  [ \"$(hostname)==theta*\" ] \u0026\u0026 HOSTFILE=\"${COBALT_NODEFILE}\"  # ThetaGPU @ ALCF\n  [ \"$(hostname)==x3*\" ] \u0026\u0026 HOSTFILE=\"${PBS_NODEFILE}\"        # Polaris @ ALCF\n  [ \"$(hostname)==nid*\" ] \u0026\u0026 HOSTFILE=\"${SLURM_NODELIST}\"     # Perlmutter @ NERSC\n  NHOSTS=$(wc -l \u003c \"${HOSTFILE}\")\n  NGPU_PER_HOST=$(nvidia-smi -L | wc -l)\n  NGPUS=\"$((${NHOSTS}*${NGPU_PER_HOST}))\";\n  echo $NHOSTS $NGPU_PER_HOST $NGPUS\n  2 4 8\n  ```'\n\n- Example `python` script:\n\n  ```python\n  \"\"\"\n  ezpz/test.py\n  \"\"\"\n  from ezpz import setup_torch, setup_tensorflow\n\n\n  def test(\n      framework: str = 'pytorch',\n      backend: str = 'deepspeed',\n      port: str = '5432'\n  ):\n  if framework == 'pytorch':\n      _ = setup_torch(\n          backend=backend,\n          port=port,\n      )\n  elif framework == 'tensorflow':\n      _ = setup_tensorflow()\n  else:\n      raise ValueError\n\n  if __name__ == '__main__':\n      import sys\n      try:\n          framework = sys.argv[1]\n      except IndexError:\n              framework = 'pytorch'\n      try:\n          backend = sys.argv[2]\n      except IndexError:\n          backend = 'deepspeed'\n      try:\n          port = sys.argv[3]\n      except IndexError:\n          port = '5432'\n      test(framework=framework, backend=backend, port=port)\n  ```\n  \n\u003c/details\u003e\n\n\n## Examples\n\n\u003e [!IMPORTANT]\n\u003e We can `launch` on any of `{ThetaGPU, Polaris, Perlmutter}` (*)\n\u003e with a specific `{framework, backend}` combo by\n\u003e 1. [`savejobenv`](./src/ezpz/bin/savejobenv):\n\u003e     - This will `export launch=\u003clauncher\u003e \u003clauncher-opts\u003e`\n\u003e       for `\u003clauncher\u003e` $\\in$ `{mpirun,mpiexec,srun}`\n\u003e       on (*) respectively.\n\u003e     - By default, `launch \u003cexec\u003e` will launch `\u003cexec\u003e` across\n\u003e       _all_ the available GPUs in your active `{COBALT,PBS,slurm}` job.\n\u003e 2. `launch`\n\u003e     - e.g. `launch $(which python3) -m ezpz framework=\u003cframework\u003e backend=\u003cbackend\u003e`, will:\n\u003e         - `launch` [`__main__.py`](./src/ezpz/__main__.py) (in this case)\n\u003e           with framework `\u003cframework\u003e` and backend `\u003cbackend\u003e`\n\u003e           (e.g. `pytorch` and `deepspeed`)\n\u003e\n\u003e - Complete example:      \n\u003e ```bash\n\u003e #!/bin/bash --login\n\u003e git clone https://github.com/saforem2/ezpz\n\u003e ./ezpz/src/ezpz/bin/savejobenv\n\u003e launch $(which python3) -m ezpz framework=\u003cframework\u003e backend=\u003cbackend\u003e\n\u003e ```\n\u003e for `framework` $\\in$ `{pytorch, tensorflow}` and `backend` $\\in$ `{horovod, deepspeed, DDP}`[^1]\n\n[^1]: `deepspeed`, `DDP` only support `pytorch`\n\n### PyTorch\n\n\u003cdetails closed\u003e\u003csummary\u003e\u003ch3\u003e✅ PyTorch + [...]\u003c/h3\u003e\u003c/summary\u003e\n  \n\u003cdetails closed\u003e\u003csummary\u003e\u003ch4\u003e\u003ccode\u003eDDP\u003c/code\u003e:\u003c/h4\u003e\u003c/summary\u003e\n\n```bash\nlaunch framework=pytorch backend=DDP\n```\n\n\u003cdetails closed\u003e\u003csummary\u003e\u003cb\u003eOutput:\u003c/b\u003e\u003c/summary\u003e\n\n```bash\nConnected to tcp://x3005c0s31b1n0.hsn.cm.polaris.alcf.anl.gov:7919\nFound executable /soft/datascience/conda/2023-10-04/mconda3/bin/python3\nLaunching application c079ffa9-4732-45ba-995b-e5685330311b\n[10/05/23 16:56:26][INFO][dist.py:362] - Using DDP for distributed training\n[10/05/23 16:56:27][INFO][dist.py:413] - RANK: 0 / 7\n[10/05/23 16:56:27][INFO][dist.py:413] - RANK: 2 / 7\n[10/05/23 16:56:27][INFO][dist.py:413] - RANK: 4 / 7\n[10/05/23 16:56:27][INFO][dist.py:413] - RANK: 3 / 7\n[10/05/23 16:56:27][INFO][dist.py:413] - RANK: 1 / 7\n[10/05/23 16:56:27][INFO][dist.py:413] - RANK: 6 / 7\n[10/05/23 16:56:27][INFO][dist.py:413] - RANK: 5 / 7\n[10/05/23 16:56:27][INFO][dist.py:413] - RANK: 7 / 7\n```\n\n\u003c/details\u003e\n\u003c/details\u003e\n\n\u003cdetails closed\u003e\u003csummary\u003e\u003ch4\u003e\u003ccode\u003edeepspeed\u003c/code\u003e:\u003c/h4\u003e\u003c/summary\u003e\n\n```bash\nlaunch framework=pytorch backend=deepspeed\n```\n\n\u003cdetails closed\u003e\u003csummary\u003e\u003cb\u003eOutput:\u003c/b\u003e\u003c/summary\u003e\n\n```bash\nConnected to tcp://x3005c0s31b1n0.hsn.cm.polaris.alcf.anl.gov:7919\nFound executable /soft/datascience/conda/2023-10-04/mconda3/bin/python3\nLaunching application c1c5bcd5-c300-4927-82e4-236d4643e31d\n[10/05/23 16:56:34][INFO][dist.py:362] - Using deepspeed for distributed training\n[2023-10-05 16:56:34,949] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)\n[2023-10-05 16:56:34,949] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)\n[2023-10-05 16:56:34,949] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)\n[2023-10-05 16:56:34,949] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)\n[2023-10-05 16:56:34,953] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)\n[2023-10-05 16:56:34,953] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)\n[2023-10-05 16:56:34,953] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)\n[2023-10-05 16:56:34,953] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)\n[2023-10-05 16:56:40,160] [INFO] [comm.py:637:init_distributed] cdb=None\n[2023-10-05 16:56:40,160] [INFO] [comm.py:637:init_distributed] cdb=None\n[2023-10-05 16:56:40,160] [INFO] [comm.py:652:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...\n[2023-10-05 16:56:40,160] [INFO] [comm.py:637:init_distributed] cdb=None\n[2023-10-05 16:56:40,160] [INFO] [comm.py:652:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...\n[2023-10-05 16:56:40,160] [INFO] [comm.py:652:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...\n[2023-10-05 16:56:40,160] [INFO] [comm.py:637:init_distributed] cdb=None\n[2023-10-05 16:56:40,160] [INFO] [comm.py:652:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...\n[2023-10-05 16:56:40,767] [INFO] [comm.py:637:init_distributed] cdb=None\n[2023-10-05 16:56:40,767] [INFO] [comm.py:637:init_distributed] cdb=None\n[2023-10-05 16:56:40,767] [INFO] [comm.py:652:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...\n[2023-10-05 16:56:40,767] [INFO] [comm.py:652:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...\n[2023-10-05 16:56:40,767] [INFO] [comm.py:637:init_distributed] cdb=None\n[2023-10-05 16:56:40,767] [INFO] [comm.py:652:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...\n[2023-10-05 16:56:40,767] [INFO] [comm.py:637:init_distributed] cdb=None\n[2023-10-05 16:56:40,767] [INFO] [comm.py:652:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...\n[2023-10-05 16:56:41,621] [INFO] [comm.py:702:mpi_discovery] Discovered MPI settings of world_rank=4, local_rank=0, world_size=8, master_addr=10.140.57.89, master_port=29500\n[2023-10-05 16:56:41,621] [INFO] [comm.py:702:mpi_discovery] Discovered MPI settings of world_rank=5, local_rank=1, world_size=8, master_addr=10.140.57.89, master_port=29500\n[2023-10-05 16:56:41,621] [INFO] [comm.py:702:mpi_discovery] Discovered MPI settings of world_rank=0, local_rank=0, world_size=8, master_addr=10.140.57.89, master_port=29500\n[2023-10-05 16:56:41,621] [INFO] [comm.py:702:mpi_discovery] Discovered MPI settings of world_rank=6, local_rank=2, world_size=8, master_addr=10.140.57.89, master_port=29500\n[2023-10-05 16:56:41,621] [INFO] [comm.py:702:mpi_discovery] Discovered MPI settings of world_rank=1, local_rank=1, world_size=8, master_addr=10.140.57.89, master_port=29500\n[2023-10-05 16:56:41,621] [INFO] [comm.py:702:mpi_discovery] Discovered MPI settings of world_rank=7, local_rank=3, world_size=8, master_addr=10.140.57.89, master_port=29500\n[2023-10-05 16:56:41,621] [INFO] [comm.py:702:mpi_discovery] Discovered MPI settings of world_rank=2, local_rank=2, world_size=8, master_addr=10.140.57.89, master_port=29500\n[2023-10-05 16:56:41,621] [INFO] [comm.py:702:mpi_discovery] Discovered MPI settings of world_rank=3, local_rank=3, world_size=8, master_addr=10.140.57.89, master_port=29500\n[2023-10-05 16:56:41,621] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl\n[10/05/23 16:56:41][INFO][dist.py:413] - RANK: 0 / 7\n[10/05/23 16:56:41][INFO][dist.py:413] - RANK: 2 / 7\n[10/05/23 16:56:41][INFO][dist.py:413] - RANK: 1 / 7\n[10/05/23 16:56:41][INFO][dist.py:413] - RANK: 7 / 7\n[10/05/23 16:56:41][INFO][dist.py:413] - RANK: 4 / 7\n[10/05/23 16:56:41][INFO][dist.py:413] - RANK: 5 / 7\n[10/05/23 16:56:41][INFO][dist.py:413] - RANK: 6 / 7\n[10/05/23 16:56:41][INFO][dist.py:413] - RANK: 3 / 7\n```\n\n\u003c/details\u003e\n\u003c/details\u003e\n\n\u003cdetails closed\u003e\u003csummary\u003e\u003ch4\u003e\u003ccode\u003ehorovod\u003c/code\u003e\u003c/h4\u003e\u003c/summary\u003e\n\n```bash\nlaunch framework=pytorch backend=horovod\n```\n\n\u003cdetails closed\u003e\u003csummary\u003e\u003cb\u003eOutput:\u003c/b\u003e\u003c/summary\u003e\n\n```bash\nConnected to tcp://x3005c0s31b1n0.hsn.cm.polaris.alcf.anl.gov:7919\nFound executable /soft/datascience/conda/2023-10-04/mconda3/bin/python3\nLaunching application c079ffa9-4732-45ba-995b-e5685330311b\n[10/05/23 16:56:26][INFO][dist.py:362] - Using DDP for distributed training\n[10/05/23 16:56:27][INFO][dist.py:413] - RANK: 0 / 7\n[10/05/23 16:56:27][INFO][dist.py:413] - RANK: 2 / 7\n[10/05/23 16:56:27][INFO][dist.py:413] - RANK: 4 / 7\n[10/05/23 16:56:27][INFO][dist.py:413] - RANK: 3 / 7\n[10/05/23 16:56:27][INFO][dist.py:413] - RANK: 1 / 7\n[10/05/23 16:56:27][INFO][dist.py:413] - RANK: 6 / 7\n[10/05/23 16:56:27][INFO][dist.py:413] - RANK: 5 / 7\n[10/05/23 16:56:27][INFO][dist.py:413] - RANK: 7 / 7\n```\n\n\u003c/details\u003e\n\u003c/details\u003e\n\u003c/details\u003e\n\n### TensorFlow\n\n\u003cdetails closed\u003e\u003csummary\u003e\u003ch3\u003e✅ TensorFlow + \u003ccode\u003ehorovod\u003c/code\u003e:\u003c/h3\u003e\u003c/summary\u003e\n\n```bash\nlaunch framework=tensorflow backend=horovod\n```\n\n\u003cdetails closed\u003e\u003csummary\u003e\u003cb\u003eOutput:\u003c/b\u003e\u003c/summary\u003e\n\n```bash\nConnected to tcp://x3005c0s31b1n0.hsn.cm.polaris.alcf.anl.gov:7919\nFound executable /soft/datascience/conda/2023-10-04/mconda3/bin/python3\nLaunching application 2b7b89f3-5f40-42de-aa12-a15876baee09\n2023-10-05 16:56:49.870938: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\nTo enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n2023-10-05 16:56:49.870938: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\nTo enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n2023-10-05 16:56:49.870938: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\nTo enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n2023-10-05 16:56:49.870940: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\nTo enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n2023-10-05 16:56:50.038355: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\nTo enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n2023-10-05 16:56:50.038355: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\nTo enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n2023-10-05 16:56:50.038353: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\nTo enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n2023-10-05 16:56:50.038359: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\nTo enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n2023-10-05 16:57:00.277129: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 38341 MB memory:  -\u003e device: 0, name: NVIDIA A100-SXM4-40GB, pci bus id: 0000:07:00.0,compute capability: 8.0\n[10/05/23 16:57:00][INFO][dist.py:203] - RANK: 4 / 7\n2023-10-05 16:57:00.303774: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 38341 MB memory:  -\u003e device: 0, name: NVIDIA A100-SXM4-40GB, pci bus id: 0000:07:00.0,compute capability: 8.0\n[10/05/23 16:57:00][INFO][dist.py:203] - RANK: 0 / 7\n2023-10-05 16:57:00.430211: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 38341 MB memory:  -\u003e device: 1, name: NVIDIA A100-SXM4-40GB, pci bus id: 0000:46:00.0,compute capability: 8.0\n[10/05/23 16:57:00][INFO][dist.py:203] - RANK: 5 / 7\n2023-10-05 16:57:00.445891: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 38341 MB memory:  -\u003e device: 1, name: NVIDIA A100-SXM4-40GB, pci bus id: 0000:46:00.0,compute capability: 8.0\n2023-10-05 16:57:00.447921: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 38341 MB memory:  -\u003e device: 2, name: NVIDIA A100-SXM4-40GB, pci bus id: 0000:85:00.0,compute capability: 8.0\n[10/05/23 16:57:00][INFO][dist.py:203] - RANK: 1 / 7\n[10/05/23 16:57:00][INFO][dist.py:203] - RANK: 2 / 7\n2023-10-05 16:57:00.452035: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 38341 MB memory:  -\u003e device: 2, name: NVIDIA A100-SXM4-40GB, pci bus id: 0000:85:00.0,compute capability: 8.0\n[10/05/23 16:57:00][INFO][dist.py:203] - RANK: 6 / 7\n2023-10-05 16:57:00.458780: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 38341 MB memory:  -\u003e device: 3, name: NVIDIA A100-SXM4-40GB, pci bus id: 0000:c7:00.0,compute capability: 8.0\n[10/05/23 16:57:00][INFO][dist.py:203] - RANK: 7 / 7\n2023-10-05 16:57:00.472986: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 38341 MB memory:  -\u003e device: 3, name: NVIDIA A100-SXM4-40GB, pci bus id: 0000:c7:00.0,compute capability: 8.0\n[10/05/23 16:57:00][INFO][dist.py:203] - RANK: 3 / 7\n```\n\n\u003c/details\u003e\n\u003c/details\u003e\n\n## Helper Utilities\n\n- [`src/ezpz/bin/savejobenv`](./src/ezpz/bin/savejobenv): Shell script to save\n  relevant job related environment variables to a file which can be `sourced`\n  from new login instances.\n- [`src/ezpz/bin/getjobenv`](./src/ezpz/bin/getjobenv): Shell script that, when\n  sourced, will populate the current environment with the necessary job-related\n  variables.\n\n\n\u003c!--\u003cdetails open\u003e\u003csummary\u003e\u003ch3\u003esavejobenv\u003c/h3\u003e\u003c/summary\u003e--\u003e\n\n### `savejobenv`\n\nLaunch a job, clone (or navigate into) `ezpz`, and `source` [`src/ezpz/bin/savejobenv`](./src/ezpz/bin/savejobenv):\n\n```bash\n(thetalogin4) $ qsub-gpu -A datascience -n 2 -q full-node --attrs=\"filesystems=home,grand,eagle,theta-fs0:ssds=required\" -t 06:00 -I\nJob routed to queue \"full-node\".\nWait for job 10155652 to start...\nOpening interactive session to thetagpu04\n[...]\n```\n\n```bash\n(thetagpu04) $ git clone https://github.com/saforem2/ezpz\n(thetagpu04) $ source ezpz/src/ezpz/bin/savejobenv\n┌───────────────────────────────────────────────────────────────────\n│ Writing COBALT vars to /home/foremans/.cobaltenv\n│ HOSTFILE: /var/tmp/cobalt.10155652\n│ NHOSTS: 2\n│ 8 GPUs per host\n│ 16 GPUs total\n└───────────────────────────────────────────────────────────────────\n┌───────────────────────────────────────────────────────────────────\n│ [DIST INFO]:\n│   • Writing Job info to /home/foremans/.cobaltenv\n│     • HOSTFILE: /var/tmp/cobalt.10155652\n│     • NHOSTS: 2\n│     • NGPU_PER_HOST: 8\n│     • NGPUS = (NHOSTS * NGPU_PER_HOST) = 16\n│ [Hosts]:\n│       • thetagpu04 thetagpu19\n│ [Launch]:\n│     • Use: 'launch' (=mpirun -n  -N  --hostfile /var/tmp/cobalt.10155652 -x PATH -x LD_LIBRARY_PATH)\n│       to launch job\n└───────────────────────────────────────────────────────────────────\n┌────────────────────────────────────────────────────────────────────────────────\n│ YOU ARE HERE: /home/foremans\n│ Run 'source ./bin/getjobenv' in a NEW SHELL to automatically set env vars\n└────────────────────────────────────────────────────────────────────────────────\n```\n\n\n\u003c!--\n\u003cdetails closed\u003e\u003csummary\u003e\u003ch3\u003e\u003ccode\u003egetjobenv\u003c/code\u003e\u003c/h3\u003e\u003c/summary\u003e\n--\u003e\n\n\n### `getjobenv`\n\nNow, in a **NEW SHELL**\n\n```bash\n(localhost)   $ ssh \u003cuser\u003e@theta\n```\n\n```bash\n(thetalogin4) $ ssh thetagpu19\n```\n\n```bash\n(thetagpu19)  $ module load conda/2023-01-11; conda activate base\n(thetagpu19)  $ cd ezpz\n(thetagpu19)  $ source ./src/ezpz/bin/getjobenv\n┌──────────────────────────────────────────────────────────────────\n│ [Hosts]: \n│     • thetagpu04, thetagpu19\n└──────────────────────────────────────────────────────────────────\n┌──────────────────────────────────────────────────────────────────\n│ [DIST INFO]: \n│     • Loading job env from: /home/foremans/.cobaltenv\n│     • HOSTFILE: /var/tmp/cobalt.10155652\n│     • NHOSTS: 2\n│     • NGPU_PER_HOST: 8\n│     • NGPUS (NHOSTS x NGPU_PER_HOST): 16\n│     • DIST_LAUNCH: mpirun -n 16 -N 8 --hostfile /var/tmp/cobalt.10155652 -x PATH -x LD_LIBRARY_PATH\n│     • Defining alias: launch: aliased to mpirun -n 16 -N 8 --hostfile /var/tmp/cobalt.10155652 -x PATH -x LD_LIBRARY_PATH\n└──────────────────────────────────────────────────────────────────\n(thetagpu19) $ mkdir -p venvs/thetaGPU/2023-01-11\n(thetagpu19) $ python3 -m venv venvs/thetaGPU/2023-01-11 --system-site-packages\n(thetagpu19) $ source venvs/thetaGPU/2023-01-11/bin/activate\n(thetagpu19) $ python3 -m pip install -e . --require-virtualenv\n(thetagpu19) $ launch python3 -m ezpz framework=pytorch backend=DDP\n[2023-10-26 12:21:26,716][ezpz.dist][INFO] - Using DDP for distributed training\n[2023-10-26 12:21:26,787][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 13\n[2023-10-26 12:21:26,787][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 14\n[2023-10-26 12:21:26,787][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 8\n[2023-10-26 12:21:26,787][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 12\n[2023-10-26 12:21:26,787][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 6\n[2023-10-26 12:21:26,788][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 9\n[2023-10-26 12:21:26,787][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 10\n[2023-10-26 12:21:26,788][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 15\n[2023-10-26 12:21:26,788][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 11\n[2023-10-26 12:21:26,789][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 7\n[2023-10-26 12:21:26,789][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 3\n[2023-10-26 12:21:26,789][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 1\n[2023-10-26 12:21:26,789][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 4\n[2023-10-26 12:21:26,789][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 5\n[2023-10-26 12:21:26,789][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 2\n[2023-10-26 12:21:26,798][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 0\n[2023-10-26 12:21:26,811][torch.distributed.distributed_c10d][INFO] - Rank 14: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.\n[2023-10-26 12:21:26,812][torch.distributed.distributed_c10d][INFO] - Rank 6: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.\n[2023-10-26 12:21:26,814][torch.distributed.distributed_c10d][INFO] - Rank 13: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.\n[2023-10-26 12:21:26,815][torch.distributed.distributed_c10d][INFO] - Rank 7: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.\n[2023-10-26 12:21:26,816][torch.distributed.distributed_c10d][INFO] - Rank 8: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.\n[2023-10-26 12:21:26,817][torch.distributed.distributed_c10d][INFO] - Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.\n[2023-10-26 12:21:26,819][torch.distributed.distributed_c10d][INFO] - Rank 12: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.\n[2023-10-26 12:21:26,820][torch.distributed.distributed_c10d][INFO] - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.\n[2023-10-26 12:21:26,821][torch.distributed.distributed_c10d][INFO] - Rank 10: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.\n[2023-10-26 12:21:26,823][torch.distributed.distributed_c10d][INFO] - Rank 4: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.\n[2023-10-26 12:21:26,825][torch.distributed.distributed_c10d][INFO] - Rank 9: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.\n[2023-10-26 12:21:26,825][torch.distributed.distributed_c10d][INFO] - Rank 5: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.\n[2023-10-26 12:21:26,827][torch.distributed.distributed_c10d][INFO] - Rank 15: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.\n[2023-10-26 12:21:26,828][torch.distributed.distributed_c10d][INFO] - Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.\n[2023-10-26 12:21:26,830][torch.distributed.distributed_c10d][INFO] - Rank 11: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.\n[2023-10-26 12:21:26,831][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.\n[2023-10-26 12:21:27,035][ezpz.dist][INFO] - RANK: 0 / 15\n{\n  \"framework\": \"pytorch\",\n  \"backend\": \"DDP\",\n  \"use_wandb\": false,\n  \"seed\": null,\n  \"port\": null,\n  \"ds_config_path\": null,\n  \"wandb_project_name\": null,\n  \"precision\": null,\n  \"ngpus\": null\n}\n[2023-10-26 12:21:27,038][__main__][INFO] - Output dir: /lus/grand/projects/datascience/foremans/locations/thetaGPU/projects/saforem2/ezpz/outputs/runs/pytorch/DDP/2023-10-26/12-21-25\n[2023-10-26 12:21:27,097][ezpz.dist][INFO] - RANK: 8 / 15\n[2023-10-26 12:21:27,103][ezpz.dist][INFO] - RANK: 6 / 15\n[2023-10-26 12:21:27,104][ezpz.dist][INFO] - RANK: 14 / 15\n[2023-10-26 12:21:27,111][ezpz.dist][INFO] - RANK: 13 / 15\n[2023-10-26 12:21:27,116][ezpz.dist][INFO] - RANK: 1 / 15\n[2023-10-26 12:21:27,126][ezpz.dist][INFO] - RANK: 7 / 15\n[2023-10-26 12:21:27,135][ezpz.dist][INFO] - RANK: 10 / 15\n[2023-10-26 12:21:27,139][ezpz.dist][INFO] - RANK: 12 / 15\n[2023-10-26 12:21:27,141][ezpz.dist][INFO] - RANK: 9 / 15\n[2023-10-26 12:21:27,141][ezpz.dist][INFO] - RANK: 15 / 15\n[2023-10-26 12:21:27,141][ezpz.dist][INFO] - RANK: 11 / 15\n[2023-10-26 12:21:27,141][ezpz.dist][INFO] - RANK: 5 / 15\n[2023-10-26 12:21:27,144][ezpz.dist][INFO] - RANK: 2 / 15\n[2023-10-26 12:21:27,145][ezpz.dist][INFO] - RANK: 4 / 15\n[2023-10-26 12:21:27,145][ezpz.dist][INFO] - RANK: 3 / 15\n16.56s user 30.05s system 706% cpu 6.595s total\n```\n\nwhile this example looked at ThetaGPU, the exact same process will work on any\nof `{ThetaGPU, Polaris, Perlmutter}`.\n\n2ez\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsaforem2%2Fllm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsaforem2%2Fllm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsaforem2%2Fllm/lists"}