https://github.com/princetonuniversity/multi_gpu_training

Last synced: 5 months ago
JSON representation

Host: GitHub
URL: https://github.com/princetonuniversity/multi_gpu_training
Owner: PrincetonUniversity
License: mit
Created: 2022-01-20T13:22:03.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2025-03-06T14:08:24.000Z (7 months ago)
Last Synced: 2025-04-12T08:22:08.296Z (6 months ago)
Language: Python
Size: 4.1 MB
Stars: 315
Watchers: 2
Forks: 46
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Multi-GPU Training with PyTorch: Data and Model Parallelism

### About
The material in this repo demonstrates multi-GPU training using PyTorch. Part 1 covers how to optimize single-GPU training. The necessary code changes to enable multi-GPU training using the data-parallel and model-parallel approaches are then shown. This workshop aims to prepare researchers to use the new H100 GPU nodes as part of Princeton Language and Intelligence.

### Setup

Make sure you can run Python on Adroit:

```bash
$ ssh @adroit.princeton.edu # VPN required if off-campus
$ git clone https://github.com/PrincetonUniversity/multi_gpu_training.git
$ cd multi_gpu_training
$ module load anaconda3/2023.9
(base) $ python --version
Python 3.11.5
```

### Getting Help

If you encounter any difficulties with the material in this guide then please send an email to cses@princeton.edu or attend a help session.

### Authorship

This guide was created by Mengzhou Xia, Alexander Wettig and Jonathan Halverson. Members of Princeton Research Computing made contributions to this material.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/princetonuniversity/multi_gpu_training

Awesome Lists containing this project

README