https://github.com/huggingface/picotron_tutorial
https://github.com/huggingface/picotron_tutorial
Last synced: 6 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/huggingface/picotron_tutorial
- Owner: huggingface
- Created: 2024-11-15T12:36:09.000Z (8 months ago)
- Default Branch: master
- Last Pushed: 2024-12-20T12:57:11.000Z (6 months ago)
- Last Synced: 2024-12-25T02:06:33.520Z (6 months ago)
- Language: Python
- Size: 340 KB
- Stars: 25
- Watchers: 4
- Forks: 4
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Picotron tutorial
A step by step tutorial on how to build [Picotron]((https://github.com/huggingface/picotron)) distributed training framework form scratch 🔥
## Videos
> More to come. Full playlist [here](https://www.youtube.com/playlist?list=PL-_armZiJvAnhcRr6yTJ0__f3Oi-LLi9S) 🎬
- 🎬 [[Picotron tutorial] Part 1: Model, Process Group Manager, Dataloader](https://youtu.be/u2VSwDDpaBM)
## Setup
```
conda create -n env-picotron-tutorial python=3.10 --y
conda activate env-picotron-tutorial
pip install -e .
```## Sanity check
- Convergence testing on a Llama 1B on 4096000 tokens to see if loss match.

```bash
# Basline
cd step3_dataloader/
torchrun --nproc_per_node 1 train.py --micro_batch_size 4 --gradient_accumulation_steps 8 --seq_len 1024 --max_tokens 4096000 --num_proc 16 --model_name TinyLlama/TinyLlama_v1.1 --num_hidden_layers 22 --num_attention_heads 32 --num_key_value_heads 4 --run_name baseline_1B --use_wandb# Tensor Parallel
cd step4_tensor_parallel/
torchrun --nproc_per_node 4 train.py --tp_size 4 --micro_batch_size 4 --gradient_accumulation_steps 8 --seq_len 1024 --max_tokens 4096000 --num_proc 16 --model_name TinyLlama/TinyLlama_v1.1 --num_hidden_layers 22 --num_attention_heads 32 --num_key_value_heads 4 --run_name tp_1B --use_wandb# Data Parallel
cd step6_data_parallel_bucket/
torchrun --nproc_per_node 4 train.py --dp_size 4 --micro_batch_size 1 --gradient_accumulation_steps 8 --seq_len 1024 --max_tokens 4096000 --num_proc 16 --model_name TinyLlama/TinyLlama_v1.1 --num_hidden_layers 22 --num_attention_heads 32 --num_key_value_heads 4 --run_name dp_bucket_1B --use_wandb# Pipeline Parallel
cd step8_pipeline_parallel_1f1b/
torchrun --nproc_per_node 4 train.py --pp_size 4 --pp_engine 1f1b --micro_batch_size 4 --gradient_accumulation_steps 8 --seq_len 1024 --max_tokens 4096000 --num_proc 16 --model_name TinyLlama/TinyLlama_v1.1 --num_hidden_layers 22 --num_attention_heads 32 --num_key_value_heads 4 --run_name pp_1f1b_1B --use_wandb# 3D parallelism (Tensor + Data + Pipeline parallel)
torchrun --nproc_per_node 8 train.py --tp_size 2 --pp_size 2 --pp_engine 1f1b --dp_size 2 --micro_batch_size 2 --gradient_accumulation_steps 8 --seq_len 1024 --max_tokens 4096000 --num_proc 16 --model_name TinyLlama/TinyLlama_v1.1 --num_hidden_layers 22 --num_attention_heads 32 --num_key_value_heads 4 --run_name 3D_parallelism_1B --use_wandb
```