https://github.com/tony-y/pytorch_warmup
Learning Rate Warmup in PyTorch
https://github.com/tony-y/pytorch_warmup
adam deep-learning learning-rate-scheduling pytorch warmup
Last synced: about 1 year ago
JSON representation
Learning Rate Warmup in PyTorch
- Host: GitHub
- URL: https://github.com/tony-y/pytorch_warmup
- Owner: Tony-Y
- License: mit
- Created: 2019-10-31T08:28:04.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2025-03-11T04:39:42.000Z (over 1 year ago)
- Last Synced: 2025-05-15T04:02:50.215Z (about 1 year ago)
- Topics: adam, deep-learning, learning-rate-scheduling, pytorch, warmup
- Language: Python
- Homepage: https://tony-y.github.io/pytorch_warmup/
- Size: 7.18 MB
- Stars: 410
- Watchers: 4
- Forks: 24
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# A PyTorch Extension for Learning Rate Warmup
This library contains PyTorch implementations of the warmup schedules described in [On the adequacy of untuned warmup for adaptive optimization](https://arxiv.org/abs/1910.04209).

[](https://github.com/Tony-Y/pytorch_warmup/)
[](https://pypi.python.org/pypi/pytorch-warmup/)
[](https://github.com/Tony-Y/pytorch_warmup/blob/master/LICENSE)
[](https://www.python.org)
## Installation
Make sure you have Python 3.9+ and PyTorch 1.9+ or 2.x. Then, run the following command in the project directory:
```shell
python -m pip install .
```
or install the latest version from the Python Package Index:
```shell
pip install -U pytorch_warmup
```
## Examples
* [CIFAR10](https://github.com/Tony-Y/pytorch_warmup/tree/master/examples/cifar10) -
A sample script to train a ResNet model on the CIFAR10 dataset using an optimization algorithm with a warmup schedule.
Its README presents ResNet20 results obtained using each of AdamW, NAdamW, AMSGradW, and AdaMax
together with each of various warmup schedules.
In addition, there is a ResNet performance comparison (up to ResNet110) obtained using the SGD algorithm
with a linear warmup schedule.
* [EMNIST](https://github.com/Tony-Y/pytorch_warmup/tree/master/examples/emnist) -
A sample script to train a CNN model on the EMNIST dataset using the AdamW algorithm with a warmup schedule.
Its README presents a result obtained using the AdamW algorithm with each of the untuned linear and exponential warmup,
and the RAdam warmup.
* [Plots](https://github.com/Tony-Y/pytorch_warmup/tree/master/examples/plots) -
A script to plot effective warmup periods as a function of β₂, and warmup schedules over time.
## Usage
The [documentation](https://tony-y.github.io/pytorch_warmup/master/) provides more detailed information on this library, unseen below.
### Sample Codes
The scheduled learning rate is dampened by the multiplication of the warmup factor:

#### Approach 1
[](https://colab.research.google.com/github/Tony-Y/colab-notebooks/blob/master/PyTorch_Warmup_Approach1_chaining.ipynb)
When the learning rate schedule uses the global iteration number, the untuned linear warmup can be used
together with `Adam` or its variant (`AdamW`, `NAdam`, etc.) as follows:
```python
import torch
import pytorch_warmup as warmup
optimizer = torch.optim.AdamW(params, lr=0.001, betas=(0.9, 0.999), weight_decay=0.01)
# This sample code uses the AdamW optimizer.
num_steps = len(dataloader) * num_epochs
lr_scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=num_steps)
# The LR schedule initialization resets the initial LR of the optimizer.
warmup_scheduler = warmup.UntunedLinearWarmup(optimizer)
# The warmup schedule initialization dampens the initial LR of the optimizer.
for epoch in range(1,num_epochs+1):
for batch in dataloader:
optimizer.zero_grad()
loss = ...
loss.backward()
optimizer.step()
with warmup_scheduler.dampening():
lr_scheduler.step()
```
> [!Warning]
> Note that the warmup schedule must not be initialized before the initialization of the learning rate schedule.
If you want to use the learning rate schedule *chaining*, which is supported for PyTorch 1.4 or above, you may simply write a code of learning rate schedulers as a suite of the `with` statement:
```python
lr_scheduler1 = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.9)
lr_scheduler2 = torch.optim.lr_scheduler.StepLR(optimizer, step_size=3, gamma=0.1)
warmup_scheduler = warmup.UntunedLinearWarmup(optimizer)
for epoch in range(1,num_epochs+1):
for batch in dataloader:
...
optimizer.step()
with warmup_scheduler.dampening():
lr_scheduler1.step()
lr_scheduler2.step()
```
If you want to start the learning rate schedule after the end of the linear warmup, delay it by the warmup period:
```python
warmup_period = 2000
num_steps = len(dataloader) * num_epochs - warmup_period
lr_scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=num_steps)
warmup_scheduler = warmup.LinearWarmup(optimizer, warmup_period)
for epoch in range(1,num_epochs+1):
for batch in dataloader:
...
optimizer.step()
with warmup_scheduler.dampening():
if warmup_scheduler.last_step + 1 >= warmup_period:
lr_scheduler.step()
```
#### Approach 2
[](https://colab.research.google.com/github/Tony-Y/colab-notebooks/blob/master/PyTorch_Warmup_Approach2_chaining.ipynb)
When the learning rate schedule uses the epoch number, the warmup schedule can be used as follows:
```python
lr_scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[num_epochs//3], gamma=0.1)
warmup_scheduler = warmup.UntunedLinearWarmup(optimizer)
for epoch in range(1,num_epochs+1):
for i, batch in enumerate(dataloader):
optimizer.zero_grad()
loss = ...
loss.backward()
optimizer.step()
if i < len(dataloader)-1:
with warmup_scheduler.dampening():
pass
with warmup_scheduler.dampening():
lr_scheduler.step()
```
This code can be rewritten more compactly:
```python
for epoch in range(1,num_epochs+1):
for i, batch in enumerate(dataloader):
optimizer.zero_grad()
loss = ...
loss.backward()
optimizer.step()
with warmup_scheduler.dampening():
if i + 1 == len(dataloader):
lr_scheduler.step()
```
#### Approach 3
When you use `CosineAnnealingWarmRestarts`, the warmup schedule can be used as follows:
```python
lr_scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(optimizer, T_0=10, T_mult=2)
warmup_period = 2000
warmup_scheduler = warmup.LinearWarmup(optimizer, warmup_period)
iters = len(dataloader)
warmup_epochs = ... # for example, (warmup_period + iters - 1) // iters
for epoch in range(epochs+warmup_epochs):
for i, batch in enumerate(dataloader):
optimizer.zero_grad()
loss = ...
loss.backward()
optimizer.step()
with warmup_scheduler.dampening():
if epoch >= warmup_epochs:
lr_scheduler.step(epoch-warmup_epochs + i / iters)
```
### Warmup Schedules
#### Manual Warmup
In `LinearWarmup` and `ExponentialWarmup`, the warmup factor `w(t)` depends on the warmup period that must manually be specified.
##### Linear
`w(t) = min(1, t / warmup_period)`
```python
warmup_scheduler = warmup.LinearWarmup(optimizer, warmup_period=2000)
```
For details please refer to [LinearWarmup](https://tony-y.github.io/pytorch_warmup/master/manual_warmup.html#pytorch_warmup.base.LinearWarmup) in the documentation.
##### Exponential
`w(t) = 1 - exp(-t / warmup_period)`
```python
warmup_scheduler = warmup.ExponentialWarmup(optimizer, warmup_period=1000)
```
For details please refer to [ExponentialWarmup](https://tony-y.github.io/pytorch_warmup/master/manual_warmup.html#pytorch_warmup.base.ExponentialWarmup) in the documentation.
#### Untuned Warmup
In `UntunedLinearWarmup` and `UntunedExponentialWarmup`, the warmup period is determined by a function of Adam's `beta2` parameter.
##### Linear
`warmup_period = 2 / (1 - beta2)`
```python
warmup_scheduler = warmup.UntunedLinearWarmup(optimizer)
```
For details please refer to [UntunedLinearWarmup](https://tony-y.github.io/pytorch_warmup/master/untuned_warmup.html#pytorch_warmup.untuned.UntunedLinearWarmup) in the documentation.
##### Exponential
`warmup_period = 1 / (1 - beta2)`
```python
warmup_scheduler = warmup.UntunedExponentialWarmup(optimizer)
```
For details please refer to [UntunedExponentialWarmup](https://tony-y.github.io/pytorch_warmup/master/untuned_warmup.html#pytorch_warmup.untuned.UntunedExponentialWarmup) in the documentation.
#### RAdam Warmup
In `RAdamWarmup`, the warmup factor `w(t)` is a complicated function depending on Adam's `beta2` parameter.
```python
warmup_scheduler = warmup.RAdamWarmup(optimizer)
```
For details please refer to [RAdamWarmup](https://tony-y.github.io/pytorch_warmup/master/radam_warmup.html#pytorch_warmup.radam.RAdamWarmup) in the documentation, or
"[On the Variance of the Adaptive Learning Rate and Beyond](https://arxiv.org/abs/1908.03265)."
### Apex's Adam
The Apex library provides an Adam optimizer tuned for CUDA devices, [FusedAdam](https://nvidia.github.io/apex/optimizers.html#apex.optimizers.FusedAdam). The FusedAdam optimizer can be used together with any one of the warmup schedules above. For example:
[](https://colab.research.google.com/github/Tony-Y/colab-notebooks/blob/master/PyTorch_Warmup_FusedAdam.ipynb)
```python
optimizer = apex.optimizers.FusedAdam(params, lr=0.001, betas=(0.9, 0.999), weight_decay=0.01)
lr_scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=num_steps)
warmup_scheduler = warmup.UntunedLinearWarmup(optimizer)
```
### Compiled Optimizers
[Benchmarking results](https://dev-discuss.pytorch.org/t/performance-comparison-between-torch-compile-and-apex-optimizers/2023)
show that the complied Adam outperforms the Apex's Adam.
> [!Warning]
> PyTorch 2.3 or later is required for using the compiled optimizer with
> a warmup scheduler and/or LR schedulers.
> PyTorch-Warmup 0.2 or earlier is incompatible with the complied optimizer.
You can compile the Adam optimizer as follows:
```python
model = model.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=torch.tensor(0.001).to(device))
opt_step = torch.compile(optimizer.step, mode="reduce-overhead")
```
> [!Important]
> Wrap the learning rate in a `Tensor`, or `torch.compile` will recompile
> as the value of the learning rate changes.
Then, the compiled version `opt_step` have to be invoked instead of `optimizer.step`:
```python
for epoch in range(1,num_epochs+1):
for batch in dataloader:
optimizer.zero_grad()
loss = ...
loss.backward()
opt_step()
with warmup_scheduler.dampening():
lr_scheduler.step()
```
You can also compile other built-in optimizers in the way shown above.
> [!Note]
> When using the compiled SGD with momentum, its momentum buffer is needed
> to be initialized manually. You can find sample code in the CIFAR10 exmaple.
In practice, you may compile it together with other PyTorch code as follows:
```python
@torch.compile(mode="reduce-overhead")
def train_iter_fn(batch):
optimizer.zero_grad()
loss = ...
loss.backward()
optimizer.step()
for epoch in range(1,num_epochs+1):
for batch in dataloader:
train_iter_fn(batch)
with warmup_scheduler.dampening():
lr_scheduler.step()
```
`torch.compile` skips `lr_scheduler.step` even if it were invoked within `train_iter_fn`.
Likewise, you should not compile `warmup_scheduler.dampening`.
You may also use `torch.compiler.disable` to have `torch.compile` skip a function
updating the learning rate as follows:
```python
@torch.compiler.disable
def update_lr_fn():
with warmup_scheduler.dampening():
lr_scheduler.step()
@torch.compile(mode="reduce-overhead")
def train_iter_fn(batch):
optimizer.zero_grad()
loss = ...
loss.backward()
optimizer.step()
update_lr_fn()
for epoch in range(1,num_epochs+1):
for batch in dataloader:
train_iter_fn(batch)
```
## License
MIT License
© 2019-2025 Takenori Yamamoto