{"id":13576005,"url":"https://github.com/zyushun/Adam-mini","last_synced_at":"2025-04-05T05:30:37.020Z","repository":{"id":246077728,"uuid":"818644241","full_name":"zyushun/Adam-mini","owner":"zyushun","description":"Code for Adam-mini: Use Fewer Learning Rates To Gain More https://arxiv.org/abs/2406.16793","archived":false,"fork":false,"pushed_at":"2024-12-05T13:27:18.000Z","size":54313,"stargazers_count":399,"open_issues_count":19,"forks_count":14,"subscribers_count":12,"default_branch":"main","last_synced_at":"2025-04-04T08:42:41.208Z","etag":null,"topics":["deep-learning","large-language-models","optimizer"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zyushun.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-22T12:34:53.000Z","updated_at":"2025-04-01T05:34:53.000Z","dependencies_parsed_at":"2024-07-08T09:30:51.515Z","dependency_job_id":"9cb96bc0-aea5-4fff-95bb-f6c268739f6d","html_url":"https://github.com/zyushun/Adam-mini","commit_stats":null,"previous_names":["zyushun/adam-mini"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zyushun%2FAdam-mini","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zyushun%2FAdam-mini/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zyushun%2FAdam-mini/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zyushun%2FAdam-mini/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zyushun","download_url":"https://codeload.github.com/zyushun/Adam-mini/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247294011,"owners_count":20915329,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","large-language-models","optimizer"],"created_at":"2024-08-01T15:01:06.230Z","updated_at":"2025-04-05T05:30:32.009Z","avatar_url":"https://github.com/zyushun.png","language":"Python","readme":"# Adam-mini\n\n**【Important notice!!!】** We are happy to anounce that we have updated Adam-mini to **version 1.1.0** in PyPI (see [here](https://pypi.org/project/adam-mini/)). This is a major update: based on more careful Hessian investigation of Transformers, we change the partition strategies for Values, attn_proj, MLPs, embedding, and the output layer. In particular, our new partition strategy  for the embedding \u0026 output layer eliminates the need for Adam-mini to treat these these layers as special cases. As a result, Adam-mini now saves 50% memory over Adam for all models of any size (previously is 45% to 50% reduction for \u003e1B models).  The updated form of Adam-mini is shown in **Algorithm 1** and the paper will be updated accordingly very soon.\n\n\n\nThis repository contains the official PyTorch implementation of Adam-mini optimizer, a mini-version of Adam that achieves on-par or better performance than AdamW with **50%** less memory footprint.\n\nAdam-mini reduces memory by cutting down the learning rate (lr) resources in Adam (i.e., $1/\\sqrt{v}$): we argue that **\u003e99.9%** of these lr in $v$ could be harmlessly removed if we:\n\n(1) carefully partition the parameters into blocks following our proposed principle related to **Hessian structure**.  \n(2) assign a single **but good** lr to each parameter block.\n\nWe find a simple and effective way to reach these requirements. The resulting algorithm is shown below in **Algorithm 1**. Check out more detailed descriptions in our paper: [Adam-mini: Use Fewer Learning Rates To Gain More](https://arxiv.org/abs/2406.16793).\n\n\u003cimg src=\"figures/figure1.png\" style=\"zoom:40%;\" /\u003e\n\n\u003cimg src=\"figures/illustration.png\" style=\"zoom:40%;\" /\u003e\n\n![](figures/adam-mini-v1.1.0.png)\n\n## How to use\n\nInstall torch (\u003e=1.8.0) and run the following commands.\n\n```\npip install adam-mini\n```\n\nor if you prefer to import from source\n\n```\ngit clone https://github.com/zyushun/Adam-mini\ncd Adam-mini\npip install -e .\n```\n\nThen use Adam-mini optimizer as follows.\n\n```\nfrom adam_mini import Adam_mini\n\noptimizer = Adam_mini(\n            named_parameters = model.named_parameters(),\n            lr = lr,\n            betas = (beta1,beta2),\n            eps = eps,\n            weight_decay = weight_decay,\n            dim = model_config.dim,\n            n_heads = model_config.n_heads,\n            n_kv_heads = model_config.n_kv_heads,\n            )\n\n```\n\n**Hyperparameter choices:** Regarding learning rate (lr), weight_decay, beta1, beta2, eps, we recommend using the same values as those used for AdamW.\n\nIf you are training Transformers, please also pass the following info to Adam-mini:\n\n- dim: dimension for hidden feature. Could be unspecified if you are training non-transformer models.\n\n- n_heads: number of attention heads. Could be unspecified if you are training non-transformer models.\n\n- n_kv_heads: number of head for Key and Value. Or equivalently, number of query groups in Group query Attention. Also known as \"n_query_groups\". If is None, it will be the same value as n_head. Could be unspecified if you are training non-transformer models.\n\n## Support\n\nOur current implementation of Adam-mini supports popular distributed frameworks and codebase including:\n\n1. DDP distributed framework\n2. FSDP distributed framework\n3. [DeepSpeed](https://github.com/microsoft/DeepSpeed)\n4. [Hugginface Trainer](https://huggingface.co/docs/transformers/en/main_classes/trainer)\n5. [Torchtitan](https://github.com/pytorch/torchtitan)\n6. [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory). Detailed usage instruction can be seen in [examples](https://github.com/hiyouga/LLaMA-Factory/blob/main/examples/README_zh.md)\n7. More is coming! Do not hesitate to contact us if Adam-mini does not support your codebase!\n\n## Examples\n\nWe here provide sample code on pre-training, SFT, and RLHF. You need 2xA800-80GB or 2xA100-80GB GPUs to run the experiments below.\n\n### Example 1: GPT2 series Pre-training\n\nWe pre-train GPT2 series (125M-1.5B) using [NanoGPT](https://github.com/karpathy/nanoGPT) codebase under DDP framework. Install dependencies from pip:\n\n```\nconda env create -f gpt2/environment.yml\nconda activate gpt2\ncd examples/gpt2\n```\n\nRun the code for GPT2 pre-training:\n\n```\nbash run_gpt2.sh\n```\n\nYou will get the following curves.\n\n\u003cimg src=\"figures/gpt2.png\" style=\"zoom:100%;\" /\u003e\n\n\n\n### Example 2: Llama series Pre-training\n\nWe here provide a sample code for pre-training Llama series (from 39M to 13B) using [Torchtitan](https://github.com/pytorch/torchtitan) code base under FSDP framework. We recommend using Torchtitan codebase as it will be much faster than NanoGPT codebase for processing the same amount of tokens. \n\nInstall dependence from pip (or please see the instructions from [Torchtitan](https://github.com/pytorch/torchtitan)):\n\n```\ncd examples/llama\npip install -r requirements.txt\npip3 install --pre torch==2.5.0.dev20240617  --index-url https://download.pytorch.org/whl/nightly/cu121 #or cu118\npip3 install --pre torchdata --index-url https://download.pytorch.org/whl/nightly\n```\n\nDownload a tokenizer.model. Follow the instructions on the official [meta-llama](https://huggingface.co/meta-llama/Meta-Llama-3-8B) repository to ensure you have access to the Llama model weights. Once you have confirmed access, you can run the following command to download the Llama 3 / Llama 2 tokenizer to your local machine.\n\n```\n# Get your HF token from https://huggingface.co/settings/tokens\n\n# llama3 tokenizer.model\npython torchtitan/datasets/download_tokenizer.py --repo_id meta-llama/Meta-Llama-3-8B --tokenizer_path \"original\" --hf_token=...\n\n# llama2 tokenizer.model\npython torchtitan/datasets/download_tokenizer.py --repo_id meta-llama/Llama-2-13b-hf --hf_token=...\n```\n\nChange your data path in the configuration file (for instance,  ./train_configs/llama3_8b_mini.toml).  To debug, you can download the small dataset \"c4_mini\" via [GoogleDrive](https://drive.google.com/drive/folders/1B16KpuhUyz4p7mwc9xmRHuyCY37dAw-2?usp=sharing) and put it under the path \"./torchtitan/datasets/c4_mini/\".\n\n```\ndataset = \"c4\" #for debug can use \"c4_mini\"\ndataset_path = \"your_path/c4\" #for debug can use \"./torchtitan/datasets/c4_mini/\"\n```\n\nThen we can kick off the training. For instance, you can train Llama models from 39M to 1B and  reproduce our scaling-law experiments. You can  train all models for  a complete pre-training run by Chinchilla's law. The total running time would be about 300 GPU hours (we tested on 4*A800-80GB GPUs).\n\n```\nbash run_llama_2_scaling_law.sh\n```\n\nYou can get the following curves (after changing x-axis into FLOPs and taking log)\n\n\u003cimg src=\"figures/scaling_law.png\" style=\"zoom:100%;\" /\u003e\n\nAfter a complete pre-training run by Chinchilla's law, you will get the following the final validation perplexity.\n\n\u003cimg src=\"figures/perplexity_table.png\" style=\"zoom:100%;\" /\u003e\n\nIn particular, the training curves of 1B model will look like the following.\n\n\n\n\u003cimg src=\"figures/0928_adam_mini_1b.png\" style=\"zoom:80%;\" /\u003e\n\n\n\nYou can also pre-train Llama3-8B and Llama2-13B using the folloiwng code.\n\n```\nbash run_llama_3_8b.sh\nbash run_llama_2_13b.sh\n\n#after creating the optimize\noptimizer.wv_names = {} # For 8B and 13B experiments, we apply a single lr for Value and find it performs a bit better\n```\n\nYou will get the following curves.\n\n\u003cimg src=\"figures/1001_llama3_8b_13b.png\" style=\"zoom:150%;\" /\u003e\n\n\n\n\n\n\n\n\n\n### Example 2: Llama2-7B Supervised Fine-tuning and RLHF\n\nWe fine-tune Llama2-7B using [ReMax](https://github.com/liziniu/ReMax) codebase under [DeepSpeed](https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat) framework. Install dependencies from pip:\n\n```\nconda env create -f RLHF/environment.yml\nconda activate rlhf\ncd examples/RLHF\n```\n\nRun the code for SFT with LoRA :\n\n```\nbash training_scripts/sft/run_sft_lora.sh\n```\n\nRun the code for full-parameter SFT :\n\n```\nbash training_scripts/sft/run_sft_full.sh\n```\n\nRun the code for reward model training in RLHF\n\n```\nbash training_scripts/reward/run_reward.sh\n```\n\nRun the code for reward optimization in RLHF using ReMax:\n\n```\nbash training_scripts/po/remax/run_remax.sh\n```\n\nYou will get the following curves.\n\n\u003cimg src=\"figures/sft_and_rlhf.png\" style=\"zoom:40%;\" /\u003e\n\n## Remarks\n\n**How to use Adam-mini in Huggingface Trainer**. If you are using Huggingface Trainer, please overwrite \"create_optimizer\" as follows to change optimizer:\n\n```\n def create_optimizer(self) -\u003e \"torch.optim.Optimizer\":\n        if self.optimizer is None:\n            if (self.finetuning_args.use_adammini):\n                self.optimizer = Adam_mini(\n            named_parameters = model.named_parameters(),\n            lr = lr,\n            betas = (beta1,beta2),\n            eps = eps,\n            weight_decay = weight_decay,\n            model_sharding = True,\n            dim = model_config.dim,\n            n_heads = model_config.n_heads,\n            n_kv_heads = model_config.n_kv_heads,\n            )\n        return super().create_optimizer()\n```\n\n**About checkpoint saving under FSDP:** If you are using FSDP distributed framework, we apologize that we still have unexpected error for saving checkpoints. We are working on it and will update soon.\n\n**About CPU offload:** Our current implementation of Adam-mini supports CPU offload in FSDP, while it does not support CPU offload in DeepSpeed. Please turn off offload when using DeepSpeed. We will resolve this issue soon.\n\n## Changelog\n\n[24/06/26] We are online!\n\n[24/07/21] We now support the Adam-mini by pip install\n\n[24/08/09] We now support the Adam-mini in [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory).\n\n[24/09/04] We update Adam-mini to version 1.0.3 in PyPI (see [here](https://pypi.org/project/adam-mini/)). We deprecate the argument \"model_sharding\". We will assume that model parallelism is always used and \"model_sharding\" is always set True. We will remove this argument in the future version.\n\n[24/09/18] We update Adam-mini to version 1.0.4 in PyPI (see [here](https://pypi.org/project/adam-mini/)). We add the argument \"verbose\" to allow manually mute the logs by Adam-mini. We support CPU-offload in FSDP.\n\n[24/10/18] We update Adam-mini to version 1.1.0 in PyPI (see [here](https://pypi.org/project/adam-mini/)). This is a major update: we change the partition rules for attn_proj, MLPs, embedding, and the output layer. In particular, we design a new partition strategy for the embedding \u0026 output layer, and now Adam-mini no longer need to treat these two layers as special cases. As a result, Adam-mini now saves 50% memory over Adam for all models of any size (previously is 45% to 50% reduction for \u003e1B models).  \n\n## Acknowledgements\n\n1. The above code is heavily based on the codebase of [NanoGPT](https://github.com/karpathy/nanoGPT), [Torchtitan](https://github.com/pytorch/torchtitan), [ReMax](https://github.com/liziniu/ReMax), and [DeepSpeed](https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat).\n2. We'd like to express our gratitude to [@lessw2020](https://github.com/lessw2020) and [@awgu](https://github.com/awgu) for the support on [Torchtitan](https://github.com/pytorch/torchtitan) and the great suggestions for refactoring the code of Adam-mini!\n3. We'd like to express our gratitude to [@Mrw33554432](https://github.com/Mrw33554432) for the pull request to pip install!\n4. We'd like to express our gratitude to [@relic-yuexi](https://github.com/relic-yuexi) for the pull request to [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory)!\n5. We'd like to express our gratitude to [@Ashwanth369](https://github.com/Ashwanth369) for the pull request to [Huggingface Transformers](https://github.com/huggingface/transformers)!\n6. We'd like to express our gratitude to [@minienglish1](https://github.com/minienglish1) for the suggestions on  CPU-offload ([Issue #28](https://github.com/zyushun/Adam-mini/issues/28))! \n\n## Citation\n\nIf you find this code helpful, please cite our paper in the following format.\n\n```\n@article{zhang2024adam,\n  title     = {Adam-mini: Use Fewer Learning Rates To Gain More},\n  author    = {Zhang, Yushun and Chen, Congliang  and Li, Ziniu and Ding, Tian and Wu, Chenwei and Ye, Yinyu and Luo, Zhi-Quan and Sun, Ruoyu},\n  booktitle = {arXiv preprint arXiv:2406.16793},\n  year      = {2024},\n}\n```\n","funding_links":[],"categories":["Python","其他_机器学习与深度学习"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzyushun%2FAdam-mini","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzyushun%2FAdam-mini","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzyushun%2FAdam-mini/lists"}