https://github.com/dudeperf3ct/llm-parallelism-pytorch

Implementing various LLM training parallelism strategies for fun!
https://github.com/dudeperf3ct/llm-parallelism-pytorch

data-parallelism llm-parallelism pytorch

Last synced: about 1 month ago
JSON representation

Implementing various LLM training parallelism strategies for fun!

Host: GitHub
URL: https://github.com/dudeperf3ct/llm-parallelism-pytorch
Owner: dudeperf3ct
License: mit
Created: 2025-12-01T11:12:45.000Z (3 months ago)
Default Branch: main
Last Pushed: 2025-12-28T19:20:09.000Z (2 months ago)
Last Synced: 2025-12-31T10:43:56.793Z (about 2 months ago)
Topics: data-parallelism, llm-parallelism, pytorch
Language: Python
Homepage:
Size: 165 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Distributed Training Experiments

Implement and compare various data parallelism strategies on Yelp Review Full using `HuggingFaceTB/SmolLM2-360M-Instruct`.

Data Parallelism Write up: https://dudeperf3ct.github.io/posts/implement_data_parallelism/

## Requirements

- Python 3.12 (managed via [uv](https://docs.astral.sh/uv/))
- Multiple GPUs
- uv

I used a 2 x Nvidia L40 (24 GB) instance using the Run Pod platform to run these experiments. It costs around $2/hour as of December 2025.

## Setup

```bash
# Install dependencies into .venv
uv sync
```

## Run (multi-GPU only)

To run all implemented strategies in one go:

```bash
./run_experiment.sh 4
```

Following sections describe how to run each strategy individually. The `torchrun` CLI sets up the distributed environment variables for you.

```bash
# Choose how many GPUs to use on the node
NUM_GPUS=4

torchrun --standalone --nproc_per_node=$NUM_GPUS main.py --ddp-choice simple_ddp
```

Notes:
- `GLOBAL_BATCH_SIZE` (8) is split across ranks; adjust it if you change `NUM_GPUS` or use GPUs with larger memory.
- Profiler traces land under `profile//rank_/`.
- Logs print only on rank 0
- You can change `--ddp-choice` to try different strategies: `simple_ddp`, `simple_ddp_ga`, `simple_ddp_hook`, `simple_ddp_async`, `bucket_ddp_async`, `pytorch_ddp`.

## Trace Analysis

Analyze PyTorch profiler traces with [Holistic Trace Analysis](https://github.com/facebookresearch/HolisticTraceAnalysis) (HTA) library. The script generates a single HTML dashboard and a compact CSV summary.

```bash
python scripts/analyze_traces.py --trace-dir profile/simple_ddp --select latest
```

By default, the output directory is inferred by replacing `profile/` with `reports/`, so the example above writes to `reports/simple_ddp/summary.html` and `reports/simple_ddp/summary.csv`.

Optional flags:
- `--select all` to analyze each trace window and save under `reports//run__/`.
- `--enable-multiprocessing` to parse traces with multiprocessing.

>[!NOTE]
> Each experiment produces a trace file for each rank that can be viewed at [perfetto UI](https://ui.perfetto.dev/). This provides detailed breakdown of CUDA streams and CPU threads. It shows the compute time for all the operations taking place on GPU and CPU.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dudeperf3ct/llm-parallelism-pytorch

Awesome Lists containing this project

README