Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/LambdaLabsML/distributed-training-guide
Best practices & guides on how to write distributed pytorch training code
https://github.com/LambdaLabsML/distributed-training-guide
cluster cuda deepspeed distributed-training fsdp gpu gpu-cluster kuberentes lambdalabs mpi nccl pytorch sharding slurm
Last synced: 3 months ago
JSON representation
Best practices & guides on how to write distributed pytorch training code
- Host: GitHub
- URL: https://github.com/LambdaLabsML/distributed-training-guide
- Owner: LambdaLabsML
- License: mit
- Created: 2024-07-31T16:11:23.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2024-10-18T16:11:15.000Z (3 months ago)
- Last Synced: 2024-10-21T13:25:48.396Z (3 months ago)
- Topics: cluster, cuda, deepspeed, distributed-training, fsdp, gpu, gpu-cluster, kuberentes, lambdalabs, mpi, nccl, pytorch, sharding, slurm
- Language: Python
- Homepage:
- Size: 331 KB
- Stars: 190
- Watchers: 4
- Forks: 12
- Open Issues: 8
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-LLM-resourses - Distributed Training Guide
README
# Distributed Training Guide
This guide aims at a comprehensive guide on best practices for distributed training, diagnosing errors, and fully utilize all resources available.
## Questions this guide answers:
- How do I update a single gpu training/fine tuning script to run on multiple GPUs or multiple nodes?
- How do I diagnose hanging/errors that happen during training?
- My model/optimizer is too big for a single gpu - how do I train/fine tune it on my cluster?
- How do I schedule/launch training on a cluster?
- How do I scale my hyperparameters when increasing the number of workers?---
Best practices for logging stdout/stderr and wandb are also included, as logging is vitally important in diagnosing/debugging training runs on a cluster.
## How to read
This guide is organized into sequential chapters, each with a `README.md` and a `train_llm.py` script in them. The readme will discuss the changes introduced in that chapter, and go into more details.
**Each of the training scripts is aimed at training a causal language model (i.e. gpt).**
## Set up
### Clone this repo
```bash
git clone https://github.com/LambdaLabsML/distributed-training-guide.git
```### Virtual Environment
```bash
cd distributed-training-guide
python3 -m venv venv
source venv/bin/activate
python -m pip install -U pip
pip install -U setuptools wheel
pip install -r requirements.txt
```### wandb
This tutorial uses `wandb` as an experiment tracker.
```bash
wandb login
```
🦄 Other exciting ML projects at Lambda: ML Times, Text2Video, GPU Benchmark.