{"id":14399713,"url":"https://github.com/jiaweizzhao/GaLore","last_synced_at":"2025-08-24T11:31:40.011Z","repository":{"id":226566209,"uuid":"768386340","full_name":"jiaweizzhao/GaLore","owner":"jiaweizzhao","description":"GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection","archived":false,"fork":false,"pushed_at":"2024-10-28T18:01:54.000Z","size":405,"stargazers_count":1578,"open_issues_count":46,"forks_count":159,"subscribers_count":17,"default_branch":"master","last_synced_at":"2025-07-23T04:29:37.245Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jiaweizzhao.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-03-07T01:34:59.000Z","updated_at":"2025-07-18T17:02:57.000Z","dependencies_parsed_at":"2024-04-07T05:28:18.561Z","dependency_job_id":"f979801a-2e3b-4536-92dc-94741d53e6a5","html_url":"https://github.com/jiaweizzhao/GaLore","commit_stats":null,"previous_names":["jiaweizzhao/galore"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/jiaweizzhao/GaLore","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jiaweizzhao%2FGaLore","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jiaweizzhao%2FGaLore/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jiaweizzhao%2FGaLore/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jiaweizzhao%2FGaLore/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jiaweizzhao","download_url":"https://codeload.github.com/jiaweizzhao/GaLore/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jiaweizzhao%2FGaLore/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":271854475,"owners_count":24834453,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-24T02:00:11.135Z","response_time":111,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-29T07:00:36.399Z","updated_at":"2025-08-24T11:31:39.695Z","avatar_url":"https://github.com/jiaweizzhao.png","language":"Python","funding_links":[],"categories":["A01_文本生成_文本对话","Python"],"sub_categories":["大语言对话模型及数据"],"readme":"# GaLore\n\nThis repo contains the pre-release version of GaLore algorithm, proposed by [GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection](https://arxiv.org/abs/2403.03507).\n\nGradient Low-Rank Projection (GaLore) is a memory-efficient low-rank training strategy that allows *full-parameter* learning but is more *memory-efficient* than common low-rank adaptation methods, such as LoRA.\nAs a gradient projection method, GaLore is independent of the choice of optimizers and can be easily plugged into existing ones with only two lines of code, as shown in Algorithm 1 below.\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"imgs/galore_code_box.png\" alt=\"Image 2\" style=\"width: 550px; margin: 0 auto;\"\u003e\n\u003c/div\u003e\n\n## News\n\n\n- **2024-09-01**: We are working on GaLore 2, which is a more efficient and accessible version of GaLore. Please stay tuned!\n- **2024-07-11**: We release Q-GaLore: Quantized GaLore with INT4 Projection. [[paper](https://arxiv.org/abs/2407.08296)] [[code](https://github.com/VITA-Group/Q-GaLore)]\n\n- **2024-07-01**: GaLore is accepted to ICML 2024 as Oral! \n\n- **2024-04-20**: Please join our Slack workspace [GaLore-Social](https://join.slack.com/t/galore-social/shared_invite/zt-2ev152px0-DguuQ5WRTLQjtq2C88HBvQ) to discuss with us and the community.\n\n## Installation\n\n### Install GaLore optimizer\nInstall from pip:\n```bash \npip install galore-torch\n```\n\nor if you want to install from source:\n\n```bash\ngit clone git@github.com:jiaweizzhao/GaLore.git\ncd GaLore\npip install -e .\n```\n\n### Install experiment dependencies\n\n```bash\npip install -r exp_requirements.txt\n```\n\nOur experiment scripts are tested on Python 3.8 with PyTorch 2.1.\n\n## Usage\n\n### Save optimizer memory using GaLore optimizers\n\n```python\nfrom galore_torch import GaLoreAdamW, GaLoreAdamW8bit, GaLoreAdafactor\n# define param groups as galore_params and non_galore_params\nparam_groups = [{'params': non_galore_params}, \n                {'params': galore_params, 'rank': 128, 'update_proj_gap': 200, 'scale': 0.25, 'proj_type': 'std'}]\noptimizer = GaLoreAdamW(param_groups, lr=0.01)\n```\n### Save weight gradient memory using per-layer weight updates\n\nWe use `register_post_accumulate_grad_hook` provided by [PyTorch](https://pytorch.org/tutorials/intermediate/optimizer_step_in_backward_tutorial.html) (`torch\u003e=2.1.0`) to enable per-layer weight updates. An example is shown below:\n\n```python\n# define an optimizer for each parameter p, and store them in optimizer_dict\nfor p in model.parameters():\n    if p.requires_grad:\n        optimizer_dict[p] = GaLoreAdamW([{'params': p, 'rank': 128, 'update_proj_gap': 200, 'scale': 0.25, 'proj_type': 'std'}], lr=0.01)\n\n# define a hook function to update the parameter p during the backward pass\ndef optimizer_hook(p):\n    if p.grad is None: \n        return\n    optimizer_dict[p].step()\n    optimizer_dict[p].zero_grad()\n\n# Register the hook onto every parameter\nfor p in model.parameters():\n    if p.requires_grad:\n        p.register_post_accumulate_grad_hook(optimizer_hook)\n```\nMore details can be found in [torchrun_main.py](https://github.com/jiaweizzhao/GaLore/blob/a6bc1650984b1c090a4e108d7c0e3109ee7ad844/torchrun_main.py#L334).\n\n## Benchmark 1: Pre-Training LLaMA on C4 dataset\n`torchrun_main.py` is the main script for training LLaMA models on C4 with GaLore. Our benchmark scripts for various sizes of models are in `scripts/benchmark_c4` folder.\nFor example, to train a 60m model on C4, do the following:\n\n```bash\n# LLaMA-60M, GaLore-Adam, 1 A100, 1 Node\ntorchrun --standalone --nproc_per_node 1 torchrun_main.py \\\n    --model_config configs/llama_60m.json \\\n    --lr 0.01 \\\n    --galore_scale 0.25 \\\n    --rank 128 \\\n    --update_proj_gap 200 \\\n    --batch_size 256 \\\n    --total_batch_size 512 \\\n    --num_training_steps 10000 \\\n    --warmup_steps 1000 \\\n    --weight_decay 0 \\\n    --dtype bfloat16 \\\n    --eval_every 1000 \\\n    --optimizer galore_adamw \n```\n\n### Train 7B model with a single GPU with 24GB memory\nTo train a 7B model with a single GPU such as NVIDIA RTX 4090, all you need to do is to specify `--optimizer=galore_adamw8bit_per_layer`, which enables `GaLoreAdamW8bit` with per-layer weight updates.\nWith activation checkpointing, you can maintain a batch size of 16 tested on NVIDIA RTX 4090.\n\n```bash\n# LLaMA-7B, 8-bit GaLore-Adam, single GPU, activation checkpointing\n# bsz=16, 22.8G, \ntorchrun --standalone --nproc_per_node 1 torchrun_main.py \\\n    --model_config configs/llama_7b.json \\\n    --lr 0.005 \\\n    --galore_scale 0.25 \\\n    --rank 1024 \\\n    --update_proj_gap 500 \\\n    --batch_size 16 \\\n    --total_batch_size 512 \\\n    --activation_checkpointing \\\n    --num_training_steps 150000 \\\n    --warmup_steps 15000 \\\n    --weight_decay 0 \\\n    --grad_clipping 1.0 \\\n    --dtype bfloat16 \\\n    --eval_every 1000 \\\n    --single_gpu \\\n    --optimizer galore_adamw8bit_per_layer\n```\n\nCurrently per-layer weight updates technique is only supported for single GPU training (`--single_gpu`) without using `nn.parallel.DistributedDataParallel`. We are working on supporting multi-GPU training with per-layer weight updates.\n\n## Benchmark 2: Fine-Tuning RoBERTa on GLUE tasks\n`run_glue.py` is the main script for fine-tuning RoBERTa models on GLUE tasks with GaLore. An example script is shown below:\n\n```bash\npython run_glue.py \\\n    --model_name_or_path roberta-base \\\n    --task_name mrpc \\\n    --enable_galore \\\n    --lora_all_modules \\\n    --max_length 512 \\\n    --seed=1234 \\\n    --lora_r 4 \\\n    --galore_scale 4 \\\n    --per_device_train_batch_size 16 \\\n    --update_proj_gap 500 \\\n    --learning_rate 3e-5 \\\n    --num_train_epochs 30 \\\n    --output_dir results/ft/roberta_base/mrpc\n```\n\n## Citation\n```bibtex\n@misc{zhao2024galore,\n      title={GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection}, \n      author={Jiawei Zhao and Zhenyu Zhang and Beidi Chen and Zhangyang Wang and Anima Anandkumar and Yuandong Tian},\n      year={2024},\n      eprint={2403.03507},\n      archivePrefix={arXiv},\n      primaryClass={cs.LG}\n}\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjiaweizzhao%2FGaLore","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjiaweizzhao%2FGaLore","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjiaweizzhao%2FGaLore/lists"}