https://github.com/dudeperf3ct/llm-parallelism-pytorch
Implementing various LLM training parallelism strategies for fun!
https://github.com/dudeperf3ct/llm-parallelism-pytorch
data-parallelism llm-parallelism pytorch
Last synced: about 1 month ago
JSON representation
Implementing various LLM training parallelism strategies for fun!
- Host: GitHub
- URL: https://github.com/dudeperf3ct/llm-parallelism-pytorch
- Owner: dudeperf3ct
- License: mit
- Created: 2025-12-01T11:12:45.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2025-12-28T19:20:09.000Z (2 months ago)
- Last Synced: 2025-12-31T10:43:56.793Z (about 2 months ago)
- Topics: data-parallelism, llm-parallelism, pytorch
- Language: Python
- Homepage:
- Size: 165 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Distributed Training Experiments
Implement and compare various data parallelism strategies on Yelp Review Full using `HuggingFaceTB/SmolLM2-360M-Instruct`.
Data Parallelism Write up: https://dudeperf3ct.github.io/posts/implement_data_parallelism/
## Requirements
- Python 3.12 (managed via [uv](https://docs.astral.sh/uv/))
- Multiple GPUs
- uv
I used a 2 x Nvidia L40 (24 GB) instance using the Run Pod platform to run these experiments. It costs around $2/hour as of December 2025.
## Setup
```bash
# Install dependencies into .venv
uv sync
```
## Run (multi-GPU only)
To run all implemented strategies in one go:
```bash
./run_experiment.sh 4
```
Following sections describe how to run each strategy individually. The `torchrun` CLI sets up the distributed environment variables for you.
```bash
# Choose how many GPUs to use on the node
NUM_GPUS=4
torchrun --standalone --nproc_per_node=$NUM_GPUS main.py --ddp-choice simple_ddp
```
Notes:
- `GLOBAL_BATCH_SIZE` (8) is split across ranks; adjust it if you change `NUM_GPUS` or use GPUs with larger memory.
- Profiler traces land under `profile//rank_/`.
- Logs print only on rank 0
- You can change `--ddp-choice` to try different strategies: `simple_ddp`, `simple_ddp_ga`, `simple_ddp_hook`, `simple_ddp_async`, `bucket_ddp_async`, `pytorch_ddp`.
## Trace Analysis
Analyze PyTorch profiler traces with [Holistic Trace Analysis](https://github.com/facebookresearch/HolisticTraceAnalysis) (HTA) library. The script generates a single HTML dashboard and a compact CSV summary.
```bash
python scripts/analyze_traces.py --trace-dir profile/simple_ddp --select latest
```
By default, the output directory is inferred by replacing `profile/` with `reports/`, so the example above writes to `reports/simple_ddp/summary.html` and `reports/simple_ddp/summary.csv`.
Optional flags:
- `--select all` to analyze each trace window and save under `reports//run__/`.
- `--enable-multiprocessing` to parse traces with multiprocessing.
>[!NOTE]
> Each experiment produces a trace file for each rank that can be viewed at [perfetto UI](https://ui.perfetto.dev/). This provides detailed breakdown of CUDA streams and CPU threads. It shows the compute time for all the operations taking place on GPU and CPU.