{"id":13472994,"url":"https://github.com/lucidrains/lion-pytorch","last_synced_at":"2025-09-25T17:28:34.326Z","repository":{"id":65899004,"uuid":"601904916","full_name":"lucidrains/lion-pytorch","owner":"lucidrains","description":"🦁 Lion, new optimizer discovered by Google Brain using genetic algorithms that is purportedly better than Adam(w), in Pytorch","archived":false,"fork":false,"pushed_at":"2024-11-27T15:28:24.000Z","size":221,"stargazers_count":2124,"open_issues_count":8,"forks_count":55,"subscribers_count":14,"default_branch":"main","last_synced_at":"2025-04-24T00:41:32.866Z","etag":null,"topics":["artificial-intelligence","deep-learning","evolutionary-search","optimizers"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lucidrains.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-02-15T04:24:19.000Z","updated_at":"2025-04-23T23:14:13.000Z","dependencies_parsed_at":null,"dependency_job_id":"60341273-86c4-44ad-b4b1-9199d48fcd89","html_url":"https://github.com/lucidrains/lion-pytorch","commit_stats":{"total_commits":39,"total_committers":5,"mean_commits":7.8,"dds":"0.15384615384615385","last_synced_commit":"70f492cd1d4e198ea533cbb3b0f024ae22fec26c"},"previous_names":[],"tags_count":15,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Flion-pytorch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Flion-pytorch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Flion-pytorch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Flion-pytorch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lucidrains","download_url":"https://codeload.github.com/lucidrains/lion-pytorch/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253990460,"owners_count":21995774,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["artificial-intelligence","deep-learning","evolutionary-search","optimizers"],"created_at":"2024-07-31T16:00:59.815Z","updated_at":"2025-09-25T17:28:29.281Z","avatar_url":"https://github.com/lucidrains.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"\u003cimg src=\"./lion.png\" width=\"500px\"\u003e\u003c/img\u003e\n\n## 🦁 Lion - Pytorch\n\n\u003ca href=\"https://arxiv.org/abs/2302.06675\"\u003e🦁 Lion\u003c/a\u003e, Evo**L**ved S**i**gn M**o**me**n**tum, new optimizer discovered by Google Brain that is purportedly better than Adam(w), in Pytorch. This is nearly a straight copy from \u003ca href=\"https://github.com/google/automl/blob/master/lion/lion_pytorch.py\"\u003ehere\u003c/a\u003e, with few minor modifications.\n\nIt is so simple, we may as well get it accessible and used asap by everyone to train some great models, if it really works 🤞\n\n### Instructions\n- Learning rate and weight decay: the authors write in Section 5 - `Based on our experience, a suitable learning rate for Lion is typically 3-10x smaller than that for AdamW. Since the effective weight decay is lr * λ, the value of decoupled weight decay λ used for Lion is 3-10x larger than that for AdamW in order to maintain a similar strength.` The initial value, peak value, and end value in the learning rate schedule should be changed ***simultaneously*** with the same ratio compared to AdamW, [evidenced by a researcher](https://github.com/lucidrains/lion-pytorch/discussions/1#discussioncomment-5239900).\n\n- Learning rate schedule: the authors use the same learning rate schedule for Lion as AdamW in the paper. Nevertheless, they observe a larger gain when using a cosine decay schedule to train ViT, compared to a reciprocal square-root schedule.\n\n- β1 and β2: the authors write in Section 5 - `The default values for β1 and β2 in AdamW are set as 0.9 and 0.999, respectively, with an ε of 1e−8, while in Lion, the default values for β1 and β2 are discovered through the program search process and set as 0.9 and 0.99, respectively.` Similar to how people reduce β2 to 0.99 or smaller and increase ε to 1e-6 in AdamW to improve stability, using `β1=0.95, β2=0.98` in Lion can also be helpful in mitigating instability during training, suggested by the authors. This was \u003ca href=\"https://github.com/lucidrains/lion-pytorch/issues/13#issuecomment-1455123143\"\u003ecorroborated by a researcher\u003c/a\u003e.\n\n### Updates\n- Update: seems to work for my local enwik8 autoregressive language modeling.\n\n- Update 2: \u003ca href=\"https://api.wandb.ai/links/lucidrains/d4v6c8sl\"\u003eexperiments\u003c/a\u003e, seems much worse than Adam if learning rate held constant.\n\n- Update 3: Dividing the learning rate by 3, seeing better early results than Adam. Maybe Adam has been dethroned, after nearly a decade.\n\n- Update 4: using the 10x smaller learning rate rule of thumb from the paper resulted in the worst run. So I guess it still takes a bit of tuning.\n\nA summarization of previous updates: as shown in the \u003ca href=\"https://api.wandb.ai/links/lucidrains/d4v6c8sl\"\u003eexperiments\u003c/a\u003e, Lion with a 3x smaller learning rate beats Adam. It still takes a bit of tuning as a 10x smaller learning rate leads to a worse result.\n\n- Update 5: so far hearing all positive results for language modeling, when done right. Also heard positive results for significant text-to-image training, although it takes a bit of tuning. The negative results seem to be with problems and architectures outside of what was evaluated in the paper - RL, feedforward networks, weird hybrid architectures with LSTMs + convolutions etc. Negative anecdata also confirms this technique is sensitive to batch size, amount of data / augmentation. Tbd what optimal learning rate schedule is, and whether cooldown affects results. Also interestingly have a positive result at open-clip, which became negative as the model size was scaled up (but may be resolvable).\n\n- Update 6: open clip issue [resolved by the author](https://github.com/mlfoundations/open_clip/pull/432#issuecomment-1457323237), by setting a higher initial temperature.\n\n- Update 7: would only recommend this optimizer in the setting of high batch sizes (64 or above)\n\n## Install\n\n```bash\n$ pip install lion-pytorch\n```\nAlternatively, using conda:\n```bash\n$ conda install lion-pytorch\n```\n\n## Usage\n\n```python\n# toy model\n\nimport torch\nfrom torch import nn\n\nmodel = nn.Linear(10, 1)\n\n# import Lion and instantiate with parameters\n\nfrom lion_pytorch import Lion\n\nopt = Lion(model.parameters(), lr=1e-4, weight_decay=1e-2)\n\n# forward and backwards\n\nloss = model(torch.randn(10))\nloss.backward()\n\n# optimizer step\n\nopt.step()\nopt.zero_grad()\n```\n\nTo use a fused kernel for updating the parameters, first `pip install triton -U --pre`, then\n\n```python\nopt = Lion(\n    model.parameters(),\n    lr=1e-4,\n    weight_decay=1e-2,\n    use_triton=True # set this to True to use cuda kernel w/ Triton lang (Tillet et al)\n)\n```\n\n## Appreciation\n\n- \u003ca href=\"https://stability.ai/\"\u003eStability.ai\u003c/a\u003e for the generous sponsorship to work and open source cutting edge artificial intelligence research\n\n## Citations\n\n```bibtex\n@misc{https://doi.org/10.48550/arxiv.2302.06675,\n    url     = {https://arxiv.org/abs/2302.06675},\n    author  = {Chen, Xiangning and Liang, Chen and Huang, Da and Real, Esteban and Wang, Kaiyuan and Liu, Yao and Pham, Hieu and Dong, Xuanyi and Luong, Thang and Hsieh, Cho-Jui and Lu, Yifeng and Le, Quoc V.},\n    title   = {Symbolic Discovery of Optimization Algorithms},\n    publisher = {arXiv},\n    year = {2023}\n}\n```\n\n```bibtex\n@article{Tillet2019TritonAI,\n    title   = {Triton: an intermediate language and compiler for tiled neural network computations},\n    author  = {Philippe Tillet and H. Kung and D. Cox},\n    journal = {Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages},\n    year    = {2019}\n}\n```\n\n```bibtex\n@misc{Schaipp2024,\n    author  = {Fabian Schaipp},\n    url     = {https://fabian-sp.github.io/posts/2024/02/decoupling/}\n}\n```\n\n```bibtex\n@inproceedings{Liang2024CautiousOI,\n    title   = {Cautious Optimizers: Improving Training with One Line of Code},\n    author  = {Kaizhao Liang and Lizhang Chen and Bo Liu and Qiang Liu},\n    year    = {2024},\n    url     = {https://api.semanticscholar.org/CorpusID:274234738}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flucidrains%2Flion-pytorch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flucidrains%2Flion-pytorch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flucidrains%2Flion-pytorch/lists"}