https://github.com/lambdalabsml/distributed-training-guide

Best practices & guides on how to write distributed pytorch training code
https://github.com/lambdalabsml/distributed-training-guide

cluster cuda deepspeed distributed-training fsdp gpu gpu-cluster kuberentes lambdalabs mpi nccl pytorch sharding slurm

Last synced: about 2 months ago
JSON representation

Best practices & guides on how to write distributed pytorch training code

Host: GitHub
URL: https://github.com/lambdalabsml/distributed-training-guide
Owner: LambdaLabsML
License: mit
Created: 2024-07-31T16:11:23.000Z (11 months ago)
Default Branch: main
Last Pushed: 2025-02-24T05:42:35.000Z (5 months ago)
Last Synced: 2025-05-16T06:07:13.029Z (about 2 months ago)
Topics: cluster, cuda, deepspeed, distributed-training, fsdp, gpu, gpu-cluster, kuberentes, lambdalabs, mpi, nccl, pytorch, sharding, slurm
Language: Python
Homepage:
Size: 429 KB
Stars: 420
Watchers: 6
Forks: 33
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Distributed Training Guide

[Neurips 2024 presentation slides here](https://docs.google.com/presentation/d/1ANMmkOGaruYKTvhnsAbZgI9GrdMliNvibWGuNYw6HX8/edit?usp=sharing)

Ever wondered how to train a large neural network across a giant cluster? Look no further!

This is a comprehensive guide on best practices for distributed training, diagnosing errors, and fully utilizing all resources available. It is organized into sequential chapters, each with a `README.md` and a `train_llm.py` script in them. The readme will discuss both the high level concepts of distributed training, and the code changes introduced in that chapter.

The guide is written entirely in very minimal standard pytorch, using `transformers` and `datasets` for models and data, respectively. No other library is used for distributed code - the distributed stuff is entirely in pytorch.

1. [Chapter 1](./01-single-gpu/) - A standard Causal LLM training script that runs on a **single GPU**.
2. [Chapter 2](./02-distributed-data-parallel/) - Upgrades the training script to support **multiple GPUs and to use DDP**.
3. [Chapter 3](./03-job-launchers/) - Covers how to **launch training jobs** across clusters with multiple nodes.
4. [Chapter 4](./04-fully-sharded-data-parallel/) - Upgrades the training script to **use FSDP** instead of DDP for more optimal memory usage.
5. [Chapter 5](./05-training-llama-405b/) - Upgrades the training script to **train Llama-405b**.
6. [Chapter 6](./06-tensor-parallel/) - Upgrades our single GPU training script to support **tensor parallelism**.
7. [Chapter 7](./06-2d-parallel/) - Upgrades our TP training script to use **2d parallelism (FSDP + TP)**.
8. [Alternative Frameworks](./alternative-frameworks/) - Covers different frameworks that all work with pytorch under the hood.
9. [Diagnosing Errors](./diagnosing-errors/) - Best practices and how tos for **quickly diagnosing errors** in your cluster.
10. [Related Topics](./related-topics/) - Topics that you should be aware of when distributed training.

Questions this guide answers:

- How do I update a single gpu training/fine tuning script to run on multiple GPUs or multiple nodes?
- How do I diagnose hanging/errors that happen during training?
- My model/optimizer is too big for a single gpu - how do I train/fine tune it on my cluster?
- How do I schedule/launch training on a cluster?
- How do I scale my hyperparameters when increasing the number of workers?

Best practices for logging stdout/stderr and wandb are also included, as logging is vitally important in diagnosing/debugging training runs on a cluster.

Each of the training scripts is aimed at training a causal language model (i.e. gpt/llama).

## Set up

### Clone this repo

```bash
git clone https://github.com/LambdaLabsML/distributed-training-guide.git
```

### Virtual Environment

```bash
cd distributed-training-guide
python3 -m venv venv
source venv/bin/activate
python -m pip install -U pip
pip install -U setuptools wheel
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
```

### wandb

This tutorial uses `wandb` as an experiment tracker.

```bash
wandb login
```

🦄 Other exciting ML projects at Lambda: ML Times, Text2Video, GPU Benchmark.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/lambdalabsml/distributed-training-guide

Awesome Lists containing this project

README