{"id":24732514,"url":"https://github.com/tony-y/pytorch_warmup","last_synced_at":"2025-05-15T04:04:19.092Z","repository":{"id":41113838,"uuid":"218718586","full_name":"Tony-Y/pytorch_warmup","owner":"Tony-Y","description":"Learning Rate Warmup in PyTorch","archived":false,"fork":false,"pushed_at":"2025-03-11T04:39:42.000Z","size":7524,"stargazers_count":410,"open_issues_count":0,"forks_count":24,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-05-15T04:02:50.215Z","etag":null,"topics":["adam","deep-learning","learning-rate-scheduling","pytorch","warmup"],"latest_commit_sha":null,"homepage":"https://tony-y.github.io/pytorch_warmup/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Tony-Y.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-10-31T08:28:04.000Z","updated_at":"2025-05-03T15:25:39.000Z","dependencies_parsed_at":"2024-10-25T19:28:06.937Z","dependency_job_id":"18f2dfb8-897e-4e09-924e-2432c03c9cc4","html_url":"https://github.com/Tony-Y/pytorch_warmup","commit_stats":{"total_commits":32,"total_committers":1,"mean_commits":32.0,"dds":0.0,"last_synced_commit":"ed2b7bdfa43cd11b346fd4bfb2f5f7f055531bab"},"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Tony-Y%2Fpytorch_warmup","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Tony-Y%2Fpytorch_warmup/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Tony-Y%2Fpytorch_warmup/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Tony-Y%2Fpytorch_warmup/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Tony-Y","download_url":"https://codeload.github.com/Tony-Y/pytorch_warmup/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254270641,"owners_count":22042858,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["adam","deep-learning","learning-rate-scheduling","pytorch","warmup"],"created_at":"2025-01-27T17:52:34.479Z","updated_at":"2025-05-15T04:04:18.781Z","avatar_url":"https://github.com/Tony-Y.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# A PyTorch Extension for Learning Rate Warmup\n\nThis library contains PyTorch implementations of the warmup schedules described in [On the adequacy of untuned warmup for adaptive optimization](https://arxiv.org/abs/1910.04209).\n\n\u003cp align=\"center\"\u003e\u003cimg src=\"https://github.com/Tony-Y/pytorch_warmup/raw/master/examples/plots/figs/warmup_schedule.png\" alt=\"Warmup schedule\" width=\"400\"/\u003e\u003c/p\u003e\n\n[![Python package](https://github.com/Tony-Y/pytorch_warmup/workflows/Python%20package/badge.svg)](https://github.com/Tony-Y/pytorch_warmup/)\n[![PyPI version shields.io](https://img.shields.io/pypi/v/pytorch-warmup.svg)](https://pypi.python.org/pypi/pytorch-warmup/)\n[![PyPI license](https://img.shields.io/pypi/l/pytorch-warmup.svg)](https://github.com/Tony-Y/pytorch_warmup/blob/master/LICENSE)\n[![Python versions](https://img.shields.io/badge/python-3.9%20%7C%203.10%20%7C%203.11%20%7C%203.12-blue)](https://www.python.org)\n\n## Installation\n\nMake sure you have Python 3.9+ and PyTorch 1.9+ or 2.x. Then, run the following command in the project directory:\n\n```shell\npython -m pip install .\n```\n\nor install the latest version from the Python Package Index:\n\n```shell\npip install -U pytorch_warmup\n```\n\n## Examples\n\n* [CIFAR10](https://github.com/Tony-Y/pytorch_warmup/tree/master/examples/cifar10) -\n A sample script to train a ResNet model on the CIFAR10 dataset using an optimization algorithm with a warmup schedule.\n Its README presents ResNet20 results obtained using each of AdamW, NAdamW, AMSGradW, and AdaMax\n together with each of various warmup schedules.\n In addition, there is a ResNet performance comparison (up to ResNet110) obtained using the SGD algorithm\n with a linear warmup schedule.\n* [EMNIST](https://github.com/Tony-Y/pytorch_warmup/tree/master/examples/emnist) -\n A sample script to train a CNN model on the EMNIST dataset using the AdamW algorithm with a warmup schedule.\n Its README presents a result obtained using the AdamW algorithm with each of the untuned linear and exponential warmup,\n and the RAdam warmup.\n* [Plots](https://github.com/Tony-Y/pytorch_warmup/tree/master/examples/plots) -\n A script to plot effective warmup periods as a function of \u0026beta;\u0026#8322;, and warmup schedules over time.\n\n## Usage\n\nThe [documentation](https://tony-y.github.io/pytorch_warmup/master/) provides more detailed information on this library, unseen below. \n\n### Sample Codes\n\nThe scheduled learning rate is dampened by the multiplication of the warmup factor:\n\n\u003cp align=\"center\"\u003e\u003cimg src=\"https://github.com/Tony-Y/pytorch_warmup/raw/master/examples/emnist/figs/learning_rate.png\" alt=\"Learning rate\" width=\"400\"/\u003e\u003c/p\u003e\n\n#### Approach 1\n\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Tony-Y/colab-notebooks/blob/master/PyTorch_Warmup_Approach1_chaining.ipynb)\n\nWhen the learning rate schedule uses the global iteration number, the untuned linear warmup can be used\ntogether with `Adam` or its variant (`AdamW`, `NAdam`, etc.) as follows:\n\n```python\nimport torch\nimport pytorch_warmup as warmup\n\noptimizer = torch.optim.AdamW(params, lr=0.001, betas=(0.9, 0.999), weight_decay=0.01)\n    # This sample code uses the AdamW optimizer.\nnum_steps = len(dataloader) * num_epochs\nlr_scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=num_steps)\n    # The LR schedule initialization resets the initial LR of the optimizer.\nwarmup_scheduler = warmup.UntunedLinearWarmup(optimizer)\n    # The warmup schedule initialization dampens the initial LR of the optimizer.\nfor epoch in range(1,num_epochs+1):\n    for batch in dataloader:\n        optimizer.zero_grad()\n        loss = ...\n        loss.backward()\n        optimizer.step()\n        with warmup_scheduler.dampening():\n            lr_scheduler.step()\n```\n\n\u003e [!Warning]\n\u003e Note that the warmup schedule must not be initialized before the initialization of the learning rate schedule.\n\nIf you want to use the learning rate schedule *chaining*, which is supported for PyTorch 1.4 or above, you may simply write a code of learning rate schedulers as a suite of the `with` statement:\n\n```python\nlr_scheduler1 = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.9)\nlr_scheduler2 = torch.optim.lr_scheduler.StepLR(optimizer, step_size=3, gamma=0.1)\nwarmup_scheduler = warmup.UntunedLinearWarmup(optimizer)\nfor epoch in range(1,num_epochs+1):\n    for batch in dataloader:\n        ...\n        optimizer.step()\n        with warmup_scheduler.dampening():\n            lr_scheduler1.step()\n            lr_scheduler2.step()\n```\n\nIf you want to start the learning rate schedule after the end of the linear warmup, delay it by the warmup period:\n\n```python\nwarmup_period = 2000\nnum_steps = len(dataloader) * num_epochs - warmup_period\nlr_scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=num_steps)\nwarmup_scheduler = warmup.LinearWarmup(optimizer, warmup_period)\nfor epoch in range(1,num_epochs+1):\n    for batch in dataloader:\n        ...\n        optimizer.step()\n        with warmup_scheduler.dampening():\n            if warmup_scheduler.last_step + 1 \u003e= warmup_period:\n                lr_scheduler.step()\n```\n\n#### Approach 2\n\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Tony-Y/colab-notebooks/blob/master/PyTorch_Warmup_Approach2_chaining.ipynb)\n\nWhen the learning rate schedule uses the epoch number, the warmup schedule can be used as follows:\n\n```python\nlr_scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[num_epochs//3], gamma=0.1)\nwarmup_scheduler = warmup.UntunedLinearWarmup(optimizer)\nfor epoch in range(1,num_epochs+1):\n    for i, batch in enumerate(dataloader):\n        optimizer.zero_grad()\n        loss = ...\n        loss.backward()\n        optimizer.step()\n        if i \u003c len(dataloader)-1:\n            with warmup_scheduler.dampening():\n                pass\n    with warmup_scheduler.dampening():\n        lr_scheduler.step()\n```\n\nThis code can be rewritten more compactly:\n\n```python\nfor epoch in range(1,num_epochs+1):\n    for i, batch in enumerate(dataloader):\n        optimizer.zero_grad()\n        loss = ...\n        loss.backward()\n        optimizer.step()\n        with warmup_scheduler.dampening():\n            if i + 1 == len(dataloader):\n                lr_scheduler.step()\n```\n\n#### Approach 3\n\nWhen you use `CosineAnnealingWarmRestarts`, the warmup schedule can be used as follows:\n\n```python\nlr_scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(optimizer, T_0=10, T_mult=2)\nwarmup_period = 2000\nwarmup_scheduler = warmup.LinearWarmup(optimizer, warmup_period)\niters = len(dataloader)\nwarmup_epochs = ... # for example, (warmup_period + iters - 1) // iters\nfor epoch in range(epochs+warmup_epochs):\n    for i, batch in enumerate(dataloader):\n        optimizer.zero_grad()\n        loss = ...\n        loss.backward()\n        optimizer.step()\n        with warmup_scheduler.dampening():\n            if epoch \u003e= warmup_epochs:\n                lr_scheduler.step(epoch-warmup_epochs + i / iters)\n```\n\n### Warmup Schedules\n\n#### Manual Warmup\n\nIn `LinearWarmup` and `ExponentialWarmup`, the warmup factor `w(t)` depends on the warmup period that must manually be specified.\n\n##### Linear\n\n`w(t) = min(1, t / warmup_period)`\n\n```python\nwarmup_scheduler = warmup.LinearWarmup(optimizer, warmup_period=2000)\n```\n\nFor details please refer to [LinearWarmup](https://tony-y.github.io/pytorch_warmup/master/manual_warmup.html#pytorch_warmup.base.LinearWarmup) in the documentation.\n\n##### Exponential\n\n`w(t) = 1 - exp(-t / warmup_period)`\n\n```python\nwarmup_scheduler = warmup.ExponentialWarmup(optimizer, warmup_period=1000)\n```\n\nFor details please refer to [ExponentialWarmup](https://tony-y.github.io/pytorch_warmup/master/manual_warmup.html#pytorch_warmup.base.ExponentialWarmup) in the documentation.\n\n#### Untuned Warmup\n\nIn `UntunedLinearWarmup` and `UntunedExponentialWarmup`, the warmup period is determined by a function of Adam's `beta2` parameter.\n\n##### Linear\n\n`warmup_period = 2 / (1 - beta2)`\n\n```python\nwarmup_scheduler = warmup.UntunedLinearWarmup(optimizer)\n```\n\nFor details please refer to [UntunedLinearWarmup](https://tony-y.github.io/pytorch_warmup/master/untuned_warmup.html#pytorch_warmup.untuned.UntunedLinearWarmup) in the documentation.\n\n##### Exponential\n\n`warmup_period = 1 / (1 - beta2)`\n\n```python\nwarmup_scheduler = warmup.UntunedExponentialWarmup(optimizer)\n```\n\nFor details please refer to [UntunedExponentialWarmup](https://tony-y.github.io/pytorch_warmup/master/untuned_warmup.html#pytorch_warmup.untuned.UntunedExponentialWarmup) in the documentation.\n\n#### RAdam Warmup\n\nIn `RAdamWarmup`, the warmup factor `w(t)` is a complicated function depending on Adam's `beta2` parameter.\n\n```python\nwarmup_scheduler = warmup.RAdamWarmup(optimizer)\n```\n\nFor details please refer to [RAdamWarmup](https://tony-y.github.io/pytorch_warmup/master/radam_warmup.html#pytorch_warmup.radam.RAdamWarmup) in the documentation, or\n\"[On the Variance of the Adaptive Learning Rate and Beyond](https://arxiv.org/abs/1908.03265).\"\n\n### Apex's Adam\n\nThe Apex library provides an Adam optimizer tuned for CUDA devices, [FusedAdam](https://nvidia.github.io/apex/optimizers.html#apex.optimizers.FusedAdam). The FusedAdam optimizer can be used together with any one of the warmup schedules above. For example:\n\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Tony-Y/colab-notebooks/blob/master/PyTorch_Warmup_FusedAdam.ipynb)\n\n```python\noptimizer = apex.optimizers.FusedAdam(params, lr=0.001, betas=(0.9, 0.999), weight_decay=0.01)\nlr_scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=num_steps)\nwarmup_scheduler = warmup.UntunedLinearWarmup(optimizer)\n```\n\n### Compiled Optimizers\n\n[Benchmarking results](https://dev-discuss.pytorch.org/t/performance-comparison-between-torch-compile-and-apex-optimizers/2023)\nshow that the complied Adam outperforms the Apex's Adam.\n\n\u003e [!Warning]\n\u003e PyTorch 2.3 or later is required for using the compiled optimizer with\n\u003e a warmup scheduler and/or LR schedulers.\n\u003e PyTorch-Warmup 0.2 or earlier is incompatible with the complied optimizer.\n\nYou can compile the Adam optimizer as follows:\n\n```python\nmodel = model.to(device)\noptimizer = torch.optim.Adam(model.parameters(), lr=torch.tensor(0.001).to(device))\nopt_step = torch.compile(optimizer.step, mode=\"reduce-overhead\")\n```\n\n\u003e [!Important]\n\u003e Wrap the learning rate in a `Tensor`, or `torch.compile` will recompile\n\u003e as the value of the learning rate changes.\n\nThen, the compiled version `opt_step` have to be invoked instead of `optimizer.step`:\n\n```python\nfor epoch in range(1,num_epochs+1):\n    for batch in dataloader:\n        optimizer.zero_grad()\n        loss = ...\n        loss.backward()\n        opt_step()\n        with warmup_scheduler.dampening():\n            lr_scheduler.step()\n```\n\nYou can also compile other built-in optimizers in the way shown above.\n\n\u003e [!Note]\n\u003e When using the compiled SGD with momentum, its momentum buffer is needed\n\u003e to be initialized manually. You can find sample code in the CIFAR10 exmaple.\n\nIn practice, you may compile it together with other PyTorch code as follows:\n\n```python\n@torch.compile(mode=\"reduce-overhead\")\ndef train_iter_fn(batch):\n    optimizer.zero_grad()\n    loss = ...\n    loss.backward()\n    optimizer.step()\n\nfor epoch in range(1,num_epochs+1):\n    for batch in dataloader:\n        train_iter_fn(batch)\n        with warmup_scheduler.dampening():\n            lr_scheduler.step()\n```\n\n`torch.compile` skips `lr_scheduler.step` even if it were invoked within `train_iter_fn`.\nLikewise, you should not compile `warmup_scheduler.dampening`.\nYou may also use `torch.compiler.disable` to have `torch.compile` skip a function\nupdating the learning rate as follows:\n\n```python\n@torch.compiler.disable\ndef update_lr_fn():\n    with warmup_scheduler.dampening():\n        lr_scheduler.step()\n\n@torch.compile(mode=\"reduce-overhead\")\ndef train_iter_fn(batch):\n    optimizer.zero_grad()\n    loss = ...\n    loss.backward()\n    optimizer.step()\n    update_lr_fn()\n\nfor epoch in range(1,num_epochs+1):\n    for batch in dataloader:\n        train_iter_fn(batch)\n```\n\n## License\n\nMIT License\n\n\u0026copy; 2019-2025 Takenori Yamamoto\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftony-y%2Fpytorch_warmup","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftony-y%2Fpytorch_warmup","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftony-y%2Fpytorch_warmup/lists"}