https://github.com/harshalmittal4/hypergradient_variants

Improved Hypergradient optimizers for ML, providing better generalization and faster convergence.
https://github.com/harshalmittal4/hypergradient_variants

adam-optimizer hypergradient learning-rate momentum optimizers step-size

Last synced: 8 months ago
JSON representation

Improved Hypergradient optimizers for ML, providing better generalization and faster convergence.

Host: GitHub
URL: https://github.com/harshalmittal4/hypergradient_variants
Owner: harshalmittal4
License: mit
Created: 2019-07-07T11:03:23.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2024-04-03T15:19:24.000Z (almost 2 years ago)
Last Synced: 2025-06-03T20:06:44.904Z (9 months ago)
Topics: adam-optimizer, hypergradient, learning-rate, momentum, optimizers, step-size
Language: Jupyter Notebook
Homepage:
Size: 104 MB
Stars: 16
Watchers: 3
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

Hypergradient based Optimization Methods
===
This work tries improvements to the existing 'Hypergradient' based optimizers proposed in the paper [Online Learning Rate Adaptation with Hypergradient Descent][1]. The report summarises the work and can be found [here](https://github.com/harshalmittal4/Hypergradient_variants/blob/master/Hypergradient_optimizers.pdf).

## Introduction
The method proposed in the paper ["Online Learning Rate Adaptation with Hypergradient Descent"][1] automatically adjusts the learning rate to minimize some estimate of the expectation of the loss, by introducing the “hypergradient” - the gradient of any loss function w.r.t hyperparameter “eta” (the optimizer learning rate). It learns the step-size via an update from gradient descent of the hypergradient at each training iteration, and uses it alongside the model optimizers SGD, SGD with Nesterov (SGDN) and Adam resulting in their hypergradient counterparts SGD-HD, SGDN-HD and Adam-HD, which demonstrate faster convergence of the loss and better generalization than solely using the original (plain) optimizers.

But we expect that the hypergradient based learning rate update could be more accurate and aim to exploit the gains much better by boosting the learning rate updates with momentum and adaptive gradients, experimenting with
1. Hypergradient descent with momentum, and
2. Adam with Hypergradient,

alongside the model optimizers SGD, SGD with Nesterov(SGDN) and Adam.

The naming convention used is: **{model optimizer}_op-{learning rate optimizer}_lop**, following which we have **{model optimizer}_op-SGDN_lop** (when the l.r. optimizer is hypergradient descent with momentum) and **{model optimizer}_op-Adam_lop** (when the l.r. optimizer is adam with hypergradient).

The new optimizers and the respective hypergradient-descent baselines from which their performance are compared are given as
- SGD_op-SGDN_lop, with baseline SGD-HD (i.e. SGD_op-SGD_lop)
- SGDN_op-SGDN_lop, with baseline SGDN-HD (i.e SGDN_op-SGD_lop)
- Adam_op-Adam_lop, with baseline Adam-HD (i.e Adam_op-SGD_lop)

The optimizers provide the following advantages when evaluated against their hypergradient-descent baselines: Better generalization, Faster convergence, Better training stability (less sensitive to the initial chosen learning rate).

Motivation
---
The alpha_0 (initial learning rate) and beta (hypergradient l.r) configurations for the new optimizers are kept the same as the respective baselines from the paper (see [run.sh](https://github.com/harshalmittal4/HD_variants/blob/master/run.sh) for details). The results show that the new optimizers perform better for all the three models (VGGNet, LogReg, MLP). More description about the optimizers can be found in the project report [here](https://github.com/harshalmittal4/Hypergradient_variants/blob/master/Hypergradient_optimizers.pdf).

Behavior of the optimizers compared with their hypergradient-descent baselines.

Columns: left: logistic regression on MNIST; middle: multi-layer neural network on MNIST; right:
VGG Net on CIFAR-10.

Project Structure
---
The project is organised as follows:

```bash
.
├── hypergrad/
│ ├── __init__.py
│ ├── sgd_Hd.py # model op. sgd, l.r. optimizer Hypergadient-descent (original)
│ └── adam_Hd.py #model op. adam, l.r. optimizer Hypergadient-descent (original)
├── op_sgd_lop_sgdn.py # model op. sgd, l.r. optimizer Hypergadient-descent with momentum
├── op_sgd_lop_adam.py # model op. sgd, l.r. optimizer Adam with Hypergadient
├── op_adam_lop_sgdn.py # model op. adam, l.r. optimizer Hypergadient-descent with momentum
├── op_adam_lop_adam.py # model op. adam, l.r. optimizer Adam with Hypergadient
├── vgg.py
├── train.py
├── test/ # results of the experiments
├── plot_src/
├── plots/ # Experiment plots
├── run_.sh # to run the experiments
.
folders and files below will be generated after running the experiments
.
├── {model}_{optimizer}_{beta}_epochs{X}.pth # Model checkpoint
└── test/{model}/{alpha}_{beta}/{optimizer}.csv # Experiment results
```

Experiments
---
The experiment configurations (hyperparameters alpha_0 and beta) are defined in [run.sh](https://github.com/harshalmittal4/HD_variants/blob/master/run.sh) for the optimizers and three model classes. The experiments for the new optimizers are run following the same settings as their Hypergradient-descent versions: Logreg (20 epochs on MNIST), MLP (100 epochs on MNIST) and VGGNet (200 epochs on CIFAR-10).

## References

1) [Hypergradient Descent (Github repository)](https://github.com/gbaydin/hypergradient-descent)

## Contributors
- harshalmittal4
- [yashkant][3]
- [Ankit-Dhankhar][4]

[1]:https://arxiv.org/pdf/1703.04782.pdf
[2]:https://github.com/harshalmittal4
[3]:https://github.com/yashkant
[4]:https://github.com/Ankit-Dhankhar

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/harshalmittal4/hypergradient_variants

Awesome Lists containing this project

README