{"id":17909475,"url":"https://github.com/ymcui/lamb_optimizer_tf","last_synced_at":"2025-09-26T12:31:23.373Z","repository":{"id":109115586,"uuid":"190484246","full_name":"ymcui/LAMB_Optimizer_TF","owner":"ymcui","description":"LAMB Optimizer for Large Batch Training (TensorFlow version)","archived":false,"fork":false,"pushed_at":"2020-01-17T12:25:07.000Z","size":127,"stargazers_count":120,"open_issues_count":1,"forks_count":22,"subscribers_count":8,"default_branch":"master","last_synced_at":"2025-01-09T17:00:48.348Z","etag":null,"topics":["bert","optimizer","tensorflow"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ymcui.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2019-06-05T23:48:39.000Z","updated_at":"2024-09-05T08:27:57.000Z","dependencies_parsed_at":null,"dependency_job_id":"75cad58d-3b05-496b-9fa4-e0021082dfb6","html_url":"https://github.com/ymcui/LAMB_Optimizer_TF","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ymcui%2FLAMB_Optimizer_TF","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ymcui%2FLAMB_Optimizer_TF/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ymcui%2FLAMB_Optimizer_TF/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ymcui%2FLAMB_Optimizer_TF/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ymcui","download_url":"https://codeload.github.com/ymcui/LAMB_Optimizer_TF/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":234309721,"owners_count":18811949,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert","optimizer","tensorflow"],"created_at":"2024-10-28T19:25:44.098Z","updated_at":"2025-09-26T12:31:23.005Z","avatar_url":"https://github.com/ymcui.png","language":"Python","readme":"# LAMB Optimizer (TensorFlow)\nThis is a simple implementation of LAMB Optimizer, which appeared in the paper [**\"Large Batch Optimization for Deep Learning: Training BERT in 76 minutes\"**](https://arxiv.org/abs/1904.00962v3). \n\nThe older name of the paper was [\"Reducing BERT Pre-Training Time from 3 Days to 76 Minutes\"](https://arxiv.org/abs/1904.00962v1)\n\n**Update: official implementation of LAMB optimizer is now available: https://github.com/tensorflow/addons/blob/master/tensorflow_addons/optimizers/lamb.py**\n\n\n## Notes\n- **This is NOT an official implementation.**\n- LAMB optimizer changes slightly from arXiv v1 ~ v3.\n- We implement v3 version (which is the latest version on June, 2019.).\n- Some uncertain parts are clarified by consulting original authors (such as `scaling function`).\n\n\n## Algorithm\nLAMB optimizer is originally designed for large batch learning in neural networks, but could also used in small batch size as indicated by authors.\n\n![algorithm.png](https://github.com/ymcui/LAMB_Optimizer_TF/blob/master/algorithm.png)\n\n\n## Usage\nThe implementation is based on BERT [repository](https://github.com/google-research/bert), which uses `AdamWeightDecayOptimizer` (appears in [`optimization.py`](https://github.com/google-research/bert/blob/master/optimization.py)) for pre-training and fine-tuning.\n\n- Just use `LAMBOptimizer` as a regular optimizer in TensorFlow, similar to `Adam` or `AdamWeightDecayOptimizer`.\n- Find LAMB optimizer in `optimization.py`.\n- There is nothing special to tune other than initial `learning_rate`.\n\n\n## Results on MNIST\n- I don't have TPU Pod to test its scalability on BERT with large batch 😂, but tested on MNIST for verify its effectiveness.\n- All optimizers use an initial learning rate of **0.001** (default settings), and did **NOT** scale to the batch size (may bring another gain, but leave it for you to test).\n- All the experiments are done on NVIDIA TESLA T4.\n\nHere are the numbers on several three classical neural networks **(MLP, CNN, Bi-RNN, Bi-GRU, Bi-LSTM)** with different optimizers **(Adam, AdamW, LAMB)**. \n\nI only list results of batch={64, 128, 1024, 16384}. For full results, please see [`FULL_RESULTS.md`](https://github.com/ymcui/LAMB_Optimizer_TF/blob/master/FULL_RESULTS.md).\n\n\n### Batch=64\n| Optimizer | MLP | CNN | Bi-RNN | Bi-GRU | Bi-LSTM | Note | \n| :------ | :-----: | :-----: | :-----: | :-----: | :-----: | :-----: |\n| Adam | 97.03 | 98.93 | 96.24 | 98.92 | **99.04** | Just ordinary Adam |\n| AdamW | 97.11 | 99.01 | 96.50 | **99.11** | **99.04** | Used in BERT |\n| **LAMB** | **98.27** | **99.33** | **97.73** | 98.83 | 98.94 | New optimizer for large batch |\n\n\n### Batch=128\n| Optimizer | MLP | CNN | Bi-RNN | Bi-GRU | Bi-LSTM | Note | \n| :------ | :-----: | :-----: | :-----: | :-----: | :-----: | :-----: |\n| Adam | 96.38 | 98.76 | 97.73 | **99.08** | **99.09** | Just ordinary Adam |\n| AdamW | 96.57 | 98.72 | **98.05** | 98.96 | 99.00 | Used in BERT |\n| **LAMB** | **97.90** | **99.20** | 98.04 | 98.87 | 98.76 | New optimizer for large batch |\n\n\n### Batch=1024\n| Optimizer | MLP | CNN | Bi-RNN | Bi-GRU | Bi-LSTM | Note | \n| :------ | :-----: | :-----: | :-----: | :-----: | :-----: | :-----: |\n| Adam | 93.05 | 97.92 | 98.10 | **98.94** | 98.67 | Just ordinary Adam |\n| AdamW | 93.67 | 98.00 | 98.19 | 98.86 | **98.82** | Used in BERT |\n| **LAMB** | **97.68** | **98.82** | **98.27** | 98.61 | 98.47 | New optimizer for large batch |\n\n\n### Batch=16384\n| Optimizer | MLP | CNN | Bi-RNN | Bi-GRU | Bi-LSTM | Note | \n| :------ | :-----: | :-----: | :-----: | :-----: | :-----: | :-----: |\n| Adam | 88.46 | 95.06 | 95.98 | 97.81 | 97.74 | Just ordinary Adam |\n| AdamW | 91.46 | 96.57 | **96.34** | **98.45** | **98.39** | Used in BERT |\n| **LAMB** | **93.23** | **97.89** | 93.76 | 87.60 | 80.36 | New optimizer for large batch |\n\n\n### Several Conclusions\n**Note: The conclusions are only made by the results above.**\n\n- LAMB consistently outperforms `Adam` and `AdamW` in most of the times, and shows consistent results among different batch sizes.\n- LAMB shows big advantage than `Adam` and `AdamW` on large batch, showing its excellent scalability.\n- LAMB failed to outperform than `Adam` and `AdamW` on complex RNN-based models, despite batch size.\n\n\n## Reproducibility\nCheck [`mnist_tensorflow.ipynb`](https://github.com/ymcui/LAMB_Optimizer_TF/blob/master/mnist_tensorflow.ipynb) for details.\n\nNote: You know the GPU/TPU won't get exactly the same results even we use fixed random seed.\n\n\n## References\n- Large Batch Optimization for Deep Learning: Training BERT in 76 minutes. https://arxiv.org/abs/1904.00962v3\n- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. https://arxiv.org/abs/1810.04805\n\n## Issues\nFor help or issues, please submit a GitHub issue.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fymcui%2Flamb_optimizer_tf","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fymcui%2Flamb_optimizer_tf","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fymcui%2Flamb_optimizer_tf/lists"}