https://github.com/vita-group/data-efficient-scaling
[ICML 2023] "Data Efficient Neural Scaling Law via Model Reusing" by Peihao Wang, Rameswar Panda, Zhangyang Wang
https://github.com/vita-group/data-efficient-scaling
data-efficient large-language-models model-reusing scaling-laws
Last synced: 7 months ago
JSON representation
[ICML 2023] "Data Efficient Neural Scaling Law via Model Reusing" by Peihao Wang, Rameswar Panda, Zhangyang Wang
- Host: GitHub
- URL: https://github.com/vita-group/data-efficient-scaling
- Owner: VITA-Group
- Created: 2023-06-01T03:44:12.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2024-01-04T23:08:12.000Z (over 2 years ago)
- Last Synced: 2025-10-05T07:26:09.966Z (8 months ago)
- Topics: data-efficient, large-language-models, model-reusing, scaling-laws
- Language: Python
- Homepage: https://huggingface.co/peihaowang/data-efficient-scaling
- Size: 188 KB
- Stars: 14
- Watchers: 9
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Data Efficient Neural Scaling Law via Model Reusing
[](https://opensource.org/licenses/MIT)
The official implementation of ICML 2023 paper [Data Efficient Neural Scaling Law via Model Reusing](https://openreview.net/pdf?id=iXYnIz4RRx).
Peihao Wang, Rameswar Panda, Zhangyang (Atlas) Wang
This repository is based on the project [Vokenization](https://github.com/airsplay/vokenization#vokenization-vokenization).

## Introduction
The number of parameters in large transformers has been observed to grow exponentially. Despite notable performance improvements, concerns have been raised that such a growing model size will run out of data in the near future. As manifested in the neural scaling law, modern learning backbones are not data-efficient. To maintain the utility of the model capacity, training data should be increased proportionally. In this paper, we study the neural scaling law under the previously overlooked data scarcity regime, focusing on the more challenging situation where we need to train a gigantic model with a disproportionately limited supply of available training data. We find that the existing power laws underestimate the data inefficiency of large transformers. Their performance will drop significantly if the training set is insufficient. Fortunately, we discover another blessing - such a data-inefficient scaling law can be restored through a model reusing approach that warm-starts the training of a large model by initializing it using smaller models. Our empirical study shows that model reusing can effectively reproduce the power law under the data scarcity regime. When progressively applying model reusing to expand the model size, we also observe consistent performance improvement in large transformers.
## Getting Started
### Dependency
To run our code, the following Python libraries which are required to run our code:
```
torch
torchvision
transformers==4.21.0
tensorboardX
# For GLUE evaluation
sklearn
# Fiass supports fast indexing.
# The code has a torch-implemented GPU indexing, so do not worry if you could not install faiss.
faiss-gpu>=1.6.3
# Spacy is used in sentence segmentation where the sentences are the input the cross-modality matching model.
spacy
# A higher h5py version to support h5py.VirtualLayout
h5py>=2.10.0
# For plotting
matplotlib
```
### Data Preparation
We re-use the data preparation pipeline provided by [Vokenization](https://github.com/airsplay/vokenization#vokenization-vokenization).
**1. Download and Pre-Process Pure-Language Data**
We provide scripts to get the English-Wikipeida dataset.
The script to download and process wiki data are modified from [XLM](https://github.com/facebookresearch/XLM). Note that the data processing pipelines for BERT and RoBERTa are different as they use different tokenizers.
To get data for training BERT, use the following command:
```
bash data/wiki/get_data_cased.bash en
```
For RoBERTa, it requires an untokenized version of English Wikipedia, so please use the following command:
```
bash data/wiki/get_data_cased_untokenized.bash en
```
**2. Tokenize Language Data**
We next tokenize the language corpus. It would locally save three files: `.`, `..hdf5`, and `..line`.
Use the following commands accordingly to tokenize the dataset:
```
# For BERT
bash tokenization/tokenize_wiki_bert.bash
# For RoBERTa
bash tokenization/tokenize_wiki_roberta.bash
```
You will only need `en.valid.raw.bert-uncased.hdf5` to reproduce results in our paper.
## Reproducing Curves
**1. Download Pre-Trained Checkpoints**
We release our pre-trained checkpoints at this [Huggingface Hub](https://huggingface.co/peihaowang/data-efficient-scaling). You may download it via the following command:
```
git clone https://huggingface.co/peihaowang/data-efficient-scaling
```
**2. Run Jupyter Notebook**
We provide Jupyter Notebook `plot_curves.ipynb` to evaluate all checkpoints and plot curves. You will need to specify the following arguments (with default values below) before running it:
```
# random seed and GPU ID
seed = 0
gpu = 0
# path to the prepared evaluation dataset
eval_data_file = 'data/wiki-cased/en.valid.raw'
# path to the folder holding checkpointed downloaded from the huggingface hub
checkpoints_path = './data-efficient-scaling/'
# evaluation batch size, adjust according to memory limit
batch_size = 64
```
## Citation
If you find this work or our work helpful for your own research, please cite our paper.
```
@inproceedings{wang2023data,
title={Data efficient neural scaling law via model reusing},
author={Wang, Peihao and Panda, Rameswar and Wang, Zhangyang},
booktitle={International Conference on Machine Learning},
pages={36193--36204},
year={2023},
organization={PMLR}
}
```