Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/vita-group/ligo
[ICLR 2023] "Learning to Grow Pretrained Models for Efficient Transformer Training" by Peihao Wang, Rameswar Panda, Lucas Torroba Hennigen, Philip Greengard, Leonid Karlinsky, Rogerio Feris, David Cox, Zhangyang Wang, Yoon Kim
https://github.com/vita-group/ligo
efficient-training large-language-models model-reusing transformers
Last synced: about 2 months ago
JSON representation
[ICLR 2023] "Learning to Grow Pretrained Models for Efficient Transformer Training" by Peihao Wang, Rameswar Panda, Lucas Torroba Hennigen, Philip Greengard, Leonid Karlinsky, Rogerio Feris, David Cox, Zhangyang Wang, Yoon Kim
- Host: GitHub
- URL: https://github.com/vita-group/ligo
- Owner: VITA-Group
- License: mit
- Created: 2023-02-21T03:16:18.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2024-02-26T17:40:13.000Z (10 months ago)
- Last Synced: 2024-02-26T18:54:42.191Z (10 months ago)
- Topics: efficient-training, large-language-models, model-reusing, transformers
- Language: Python
- Homepage: https://vita-group.github.io/LiGO/
- Size: 1.55 MB
- Stars: 64
- Watchers: 27
- Forks: 5
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Learning to Grow Pretrained Models for Efficient Transformer Training
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)
The official implementation of ICLR 2023 paper [Learning to Grow Pretrained Models for Efficient Transformer Training](https://arxiv.org/abs/2303.00980).
[Peihao Wang](https://peihaowang.github.io/)1,
[Rameswar Panda](https://rpand002.github.io/)2,
[Lucas Torroba Hennigen](https://ltorroba.github.io/)4,
[Philip Greengard](http://www.columbia.edu/~pg2118/)3,
[Leonid Karlinsky](https://scholar.google.com/citations?user=WbO7tjYAAAAJ&hl=en)2,
[Rogerio Feris](http://rogerioferis.com/)2,
[David Cox](https://mitibmwatsonailab.mit.edu/people/david-cox/)2,
[Zhangyang (Atlas) Wang](https://vita-group.github.io/)1,
[Yoon Kim](https://people.csail.mit.edu/yoonkim/)41University of Texas at Austin, 2MIT-IBM Watson Lab, 3Columbia University, 4MIT
![](./docs/images/pipeline.png)
## Getting Started
### Dependency
To run our code, the following libraries which are required:
```
torch
torchvision
transformers==4.21.0
tensorboardX# For GLUE evaluation
sklearn# Fiass supports fast indexing.
# The code has a torch-implemented GPU indexing, so do not worry if you could not install faiss.
faiss-gpu>=1.6.3# Spacy is used in sentence segmentation where the sentences are the input the cross-modality matching model.
spacy# A higher h5py version to support h5py.VirtualLayout
h5py>=2.10.0
```### Data Preparation
We re-use the data preparation pipeline provided by [Vokenization](https://github.com/airsplay/vokenization#vokenization-vokenization).
**1. Download and Pre-Process Pure-Language Data**
We provide scripts to get the English-Wikipeida dataset.
The script to download and process wiki data are modified from [XLM](https://github.com/facebookresearch/XLM). Note that the data processing pipelines for BERT and RoBERTa are different as they use different tokenizers.
To get data for training BERT, use the following command:
```
bash data/wiki/get_data_cased.bash en
```For RoBERTa, it requires an untokenized version of English Wikipedia, so please use the following command:
```
bash data/wiki/get_data_cased_untokenized.bash en
```**2. Tokenize Language Data**
We next tokenize the language corpus. It would locally save three files: `.`, `..hdf5`, and `..line`.
Use the following commands accordingly to tokenize the dataset:```
# For BERT
bash tokenization/tokenize_wiki_bert.bash# For RoBERTa
bash tokenization/tokenize_wiki_roberta.bash
```## Usage
Please use the following commands to pretrain BERT or RoBERTa:
### Training BERT from Scratch
(6L, 512H) BERT
```
python run_lm_distributed.py --config configs/bert_wiki.txt --config_name configs/bert-6L-512H.json --output_dir --max_steps 400000 --warmup_steps 10000 --should_continue
```(12L, 768H) BERT
```
python run_lm_distributed.py --config configs/bert_wiki.txt --config_name configs/bert-12L-768H.json --output_dir --max_steps 400000 --warmup_steps 10000 --should_continue
```### Training BERT with LiGO
First train a LiGO operator using the following command:
```
python run_grow_distributed.py --config configs/bert_wiki.txt --config_name configs/bert-12L-768H.json --output_dir --tune_width --tune_depth --source_model_path --fuse_init_scheme stackbert_noisy rand --max_steps 100 --logging_steps 100 --ckpt_steps 100 --should_continue
```Then use pre-trained LiGO operator to grow the model:
```
python run_lm_distributed.py --config configs/bert_wiki.txt --config_name configs/bert-12L-768H.json --output_dir --grow_scheme ligo --source_model_path --pretrained_ligo_path --fuse_init_scheme stackbert_noisy rand --learning_rate 2e-4 --warmup_steps 0 --should_continue
```### Training RoBERTa from Scratch
(6L, 512H) RoBERTa
```
python run_lm_distributed.py --config configs/roberta_wiki.txt --config_name configs/roberta-6L-512H.json --per_gpu_train_batch_size 64 --gradient_accumulation_steps 4 --learning_rate 2e-4 --output_dir --should_continue
```(12L, 512H) RoBERTa
```
python run_lm_distributed.py --config configs/roberta_wiki.txt --config_name configs/roberta-12L-768H.json --per_gpu_train_batch_size 64 --gradient_accumulation_steps 4 --learning_rate 2e-4 --output_dir --should_continue
```Note that the argument `--gradient_accumulation_steps 4` is necessary to gaurantee the batch size of RoBERTa is 4 times of BERT. One can use 4 times number of GPUs to achieve the same batch size.
### Training RoBERTa with LiGO
```
# Train LiGO
python run_grow_distributed.py --config configs/roberta_wiki.txt --config_name configs/roberta-12L-768H.json --per_gpu_train_batch_size 64 --gradient_accumulation_steps 4 --learning_rate 2e-4 --output_dir --tune_width --tune_depth --source_model_path --fuse_init_scheme stackbert_noisy rand --max_steps 100 --logging_steps 100 --ckpt_steps 100 --should_continue# Apply pre-trained LiGO operator to grow the model
python vlm/run_lm_distributed.py --config configs/roberta_wiki.txt --config_name configs/roberta-12L-768H.json --per_gpu_train_batch_size 64 --gradient_accumulation_steps 4 --output_dir --grow_scheme ligo --source_model_path --pretrained_ligo_path --fuse_init_scheme stackbert_noisy rand --learning_rate 2e-4 --warmup_steps 10000 --should_continue
```## Citation
This repository is based on the project [Vokenization](https://github.com/airsplay/vokenization#vokenization-vokenization).
If you find this work or our work helpful for your own research, please cite our paper.```
@inproceedings{wang2023learning,
title={Learning to grow pretrained models for efficient transformer training},
author={Wang, Peihao and Panda, Rameswar and Hennigen, Lucas Torroba and Greengard, Philip and Karlinsky, Leonid and Feris, Rogerio and Cox, David Daniel and Wang, Zhangyang and Kim, Yoon},
booktitle={International Conference on Learning Representations},
year={2023},
url={https://openreview.net/forum?id=cDYRS5iZ16f},
}
```