https://github.com/princetonuniversity/multi_gpu_training
https://github.com/princetonuniversity/multi_gpu_training
Last synced: 5 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/princetonuniversity/multi_gpu_training
- Owner: PrincetonUniversity
- License: mit
- Created: 2022-01-20T13:22:03.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2025-03-06T14:08:24.000Z (7 months ago)
- Last Synced: 2025-04-12T08:22:08.296Z (6 months ago)
- Language: Python
- Size: 4.1 MB
- Stars: 315
- Watchers: 2
- Forks: 46
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Multi-GPU Training with PyTorch: Data and Model Parallelism
### About
The material in this repo demonstrates multi-GPU training using PyTorch. Part 1 covers how to optimize single-GPU training. The necessary code changes to enable multi-GPU training using the data-parallel and model-parallel approaches are then shown. This workshop aims to prepare researchers to use the new H100 GPU nodes as part of Princeton Language and Intelligence.### Setup
Make sure you can run Python on Adroit:
```bash
$ ssh @adroit.princeton.edu # VPN required if off-campus
$ git clone https://github.com/PrincetonUniversity/multi_gpu_training.git
$ cd multi_gpu_training
$ module load anaconda3/2023.9
(base) $ python --version
Python 3.11.5
```### Getting Help
If you encounter any difficulties with the material in this guide then please send an email to cses@princeton.edu or attend a help session.
### Authorship
This guide was created by Mengzhou Xia, Alexander Wettig and Jonathan Halverson. Members of Princeton Research Computing made contributions to this material.