Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/yangluo7/CAME
The official implementation of "CAME: Confidence-guided Adaptive Memory Optimization"
https://github.com/yangluo7/CAME
deep-learning diffusion-transformer large-language-models memory-efficient optimizer
Last synced: 7 days ago
JSON representation
The official implementation of "CAME: Confidence-guided Adaptive Memory Optimization"
- Host: GitHub
- URL: https://github.com/yangluo7/CAME
- Owner: yangluo7
- License: mit
- Created: 2023-07-20T13:28:45.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2024-06-27T15:00:13.000Z (5 months ago)
- Last Synced: 2024-09-30T11:35:52.867Z (about 1 month ago)
- Topics: deep-learning, diffusion-transformer, large-language-models, memory-efficient, optimizer
- Language: Python
- Homepage:
- Size: 785 KB
- Stars: 65
- Watchers: 2
- Forks: 5
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
CAME Optimizer
ACL 2023 Outstanding Paper Award
Confidence-guided Adaptive Memory Efficient OptimizationThis is an official implementation of **CAME** optimizer in the "[Confidence-guided Adaptive Memory Efficient Optimization](https://arxiv.org/abs/2307.02047)". Please cite the paper and star this repo if you find CAME useful. Thanks!
[Paper](https://arxiv.org/abs/2307.02047) | [Twitter](https://twitter.com/ZangweiZheng/status/1680227732788236289) | [Blog](https://zhengzangw.github.io/blogs/came) | [Pypi Package](https://pypi.org/project/came-pytorch/) | [zhihu](https://zhuanlan.zhihu.com/p/643816029)
## MethodIn this work, we studied a confidence-guided strategy to reduce the instability of existing memory efficient optimizers. Based on this strategy, we proposed CAME to simultaneously achieve two goals: fast convergence as in traditional adaptive methods, and low memory usage as in memory-efficient methods.
The pseudo code is presented in the figure with difference with Adafactor in blue fonts.
## Install
```
pip install came-pytorch
```
## Usage```python
from came_pytorch import CAME
optimizer = CAME(
model.parameters(),
lr=2e-4,
weight_decay=1e-2,
betas=(0.9, 0.999, 0.9999),
eps=(1e-30, 1e-16)
)
```## Hyper-parameter Tuning
* Pre-training: Based on our experiments on BERT-Large, GPT-2, and T5, it's suitable to choose a learning rate for CAME 0.5-0.9x lr for AdamW.
* Set $\beta_1$ and $\beta_2$ to the same values used in AdamW. Choose $\beta_3$ to be larger than $\beta_2$. For example, consider choosing $\beta_3$ between $[0.9995, 0.99995]$ if setting $\beta_1, \beta_2=0.9, 0.999$, and choosing $\beta_3$ between $[0.99, 0.999]$ if setting $\beta_1, \beta_2=0.9, 0.95$. Due to computational resource constraints, we did not explore more combinations of three betas. Different training tasks may require different combinations of optimal performance.
* If you have any feedback or comments regarding hyper-parameter tuning, please do not hesitate to provide them to us!## Experiments
Apart from the BERT and T5 experiments shown in the paper, we conduct more and record the results here.
### Fine-tuning LLaMA-7B
| | MMLU | WikiText | HellaSwag | TruthfulQA (MC) | BoolQ | COPA | WSC | WIC |
| -------------- | --------- | -------- | --------- | --------------- | --------- | --------- | --------- | --------- |
| Alpaca-7B | 40.21 | 6.74 | 59.76 | **38.89** | **79.57** | **88.00** | 46.15 | 49.84 |
| Alpaca-7B-CAME | **40.59** | **6.38** | **59.80** | 38.61 | 79.08 | **88.00** | **49.04** | **50.78** |We fine-tuned LLaMA-7B with [stanford-alpaca](https://github.com/tatsu-lab/stanford_alpaca) (52k instruction-tuning dataset). To replicate our result, first register the CAME optimizer to the transformer package. Then in Alpaca training script, change the default optimizer from "adamw" to "came".
Alpaca-7B and Alpaca-7B-CAME are evaluated using [Instruct-eval](https://github.com/declare-lab/instruct-eval) and [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness).
### Pre-training GPT-2
![CAME_gpt2](assets/gpt-2_came.png)
The pre-training of GPT-2 (Medium, 345M) is based on [Megatron-LM](https://github.com/NVIDIA/Megatron-LM). To replicate our result, add the CAME optimizer in [`megatron/optimizer/__init__.py`](https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/optimizer/__init__.py) and set the *args.optimizer* to "came".
## Memory Usage Comparison
To ensure a fair comparison, we set the batch size to 1 for the pre-training of GPT-2 (Medium) to examine the memory footprint of CAME and AdamW.| | AdamW | CAME |
|--------------|-------|----------|
| Memory (GiB) | 8.77 | **7.44** |## Citation
```bibtex
@inproceedings{luo2023came,
title={CAME: Confidence-guided Adaptive Memory Efficient Optimization},
author={Luo, Yang and Ren, Xiaozhe and Zheng, Zangwei and Jiang, Zhuo and Jiang, Xin and You, Yang},
booktitle={Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
pages={4442--4453},
year={2023}
}
```