Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/imoneoi/multipack_sampler
Multipack distributed sampler for fast padding-free training of LLMs
https://github.com/imoneoi/multipack_sampler
Last synced: 3 months ago
JSON representation
Multipack distributed sampler for fast padding-free training of LLMs
- Host: GitHub
- URL: https://github.com/imoneoi/multipack_sampler
- Owner: imoneoi
- License: mit
- Created: 2023-07-06T11:12:29.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2023-07-08T13:16:47.000Z (over 1 year ago)
- Last Synced: 2024-07-18T20:41:37.065Z (4 months ago)
- Language: Python
- Homepage:
- Size: 13.7 KB
- Stars: 159
- Watchers: 3
- Forks: 12
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Multipack Sampler
The Multipack sampler is designed for padding-free distributed training of large language models. It utilizes an approximate solution to the identical machine scheduling problem to maximize the efficiency of batch processing. On the OpenChat V1 training set, it achieves >99% theoretical efficiency, while the interleaved sampler only achieves ~75%.
## Benchmark
Please refer to `test_multipack.ipynb`
```
OpenChat V1 (testdata.json)Sampler Multipack:
Overall Efficiency: 0.9963896327548557Sampler Interleaved:
Overall Efficiency: 0.756684939066569
```## Usage
Compatible with PyTorch `DataLoader`
```python
batch_max_len = 16 * 2048 # batch size * max context lengthlengths = np.array([len(tokens) for tokens in data])
sampler = MultipackDistributedBatchSampler(
batch_max_length=batch_max_len,
lengths=lengths,
seed=0
)dataloader = DataLoader(data, batch_sampler=sampler)
```## License
MIT