https://github.com/yard1/ray-llm-script

Last synced: 3 months ago
JSON representation

Host: GitHub
URL: https://github.com/yard1/ray-llm-script
Owner: Yard1
Created: 2023-02-24T20:17:39.000Z (over 2 years ago)
Default Branch: master
Last Pushed: 2023-02-24T20:33:49.000Z (over 2 years ago)
Last Synced: 2025-01-19T07:23:51.674Z (5 months ago)
Language: Python
Size: 1.67 MB
Stars: 1
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: readme.md

Awesome Lists containing this project

README

# Fine-tuning Transformers on Ray Train

This repository contains a modified version of the [`deepspeed_with_config_support.py` script](https://github.com/huggingface/accelerate/blob/main/examples/by_feature/deepspeed_with_config_support.py) allowing it to leverage Ray Train for easy HF Accelerate w/ DeepSpeed Transformers fine-tuning in on a distributed Ray cluster.

## Instructions

First, run `bash mount_nvme.sh` to mount the NVMe drives on the GPU nodes in the cluster. This has to be done once.

Next, run `bash example.sh` to fine-tune the `facebook/opt-125m` model on `alllines.txt` (all Shakespeare plays). Models up to `opt-66b` have been tested, but they may require GPU nodes with more RAM for DeepSpeed offload & model saving and/or lower batch size to avoid CUDA OOMs. The model checkpoints will be uploaded to S3.

WARNING: Ray Train checkpointing will cause OOMs with very large checkpoints (from models with >20b parameters). We are working on a fix but for now make sure that no checkpoint is reported in `session.report`. Furthermore, DeepSpeed will require a substantial amount of RAM to save the final checkpoint, as it will gather weights from all partitions onto one node. It is recommended to use large instances, eg. g5.48xlarge when training with `opt-66b` or similar.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/yard1/ray-llm-script

Awesome Lists containing this project

README