{"id":13757238,"url":"https://github.com/awslabs/adatune","last_synced_at":"2025-05-10T05:31:51.814Z","repository":{"id":46824759,"uuid":"193592210","full_name":"awslabs/adatune","owner":"awslabs","description":"Gradient based Hyperparameter Tuning library in PyTorch","archived":false,"fork":false,"pushed_at":"2020-07-17T15:41:27.000Z","size":1002,"stargazers_count":289,"open_issues_count":2,"forks_count":32,"subscribers_count":12,"default_branch":"master","last_synced_at":"2025-05-09T23:39:27.506Z","etag":null,"topics":["automl","deep-learning","hyperparameter-tuning","learning-rate-scheduling","machine-learning","neural-networks","pytorch"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/awslabs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.txt","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-06-24T22:57:19.000Z","updated_at":"2024-11-09T19:27:19.000Z","dependencies_parsed_at":"2022-09-23T05:14:02.870Z","dependency_job_id":null,"html_url":"https://github.com/awslabs/adatune","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/awslabs%2Fadatune","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/awslabs%2Fadatune/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/awslabs%2Fadatune/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/awslabs%2Fadatune/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/awslabs","download_url":"https://codeload.github.com/awslabs/adatune/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253371072,"owners_count":21897998,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["automl","deep-learning","hyperparameter-tuning","learning-rate-scheduling","machine-learning","neural-networks","pytorch"],"created_at":"2024-08-03T12:00:30.288Z","updated_at":"2025-05-10T05:31:51.375Z","avatar_url":"https://github.com/awslabs.png","language":"Python","funding_links":[],"categories":["2.For Experiment","Libraries"],"sub_categories":["Hyperparameter Tuning"],"readme":"AdaTune\n=======\n\nAdaTune is a library to perform gradient based hyperparameter tuning for training deep neural networks. AdaTune currently supports tuning of the `learning_rate` parameter but some of the methods implemented here can be extended to other hyperparameters like `momentum` or `weight_decay` etc. AdaTune provides the following gradient based hyperparameter tuning algorithms -  [HD](https://arxiv.org/abs/1703.04782), [RTHO](http://proceedings.mlr.press/v70/franceschi17a.html) and our newly proposed algorithm, [MARTHE](https://arxiv.org/abs/1910.08525). The repository also contains other commonly used non-adaptive `learning_rate` adaptation strategies like staircase-decay, exponential-decay and cosine-annealing-with-restarts. The library is implemented in PyTorch. \n\nMathematical Formulation of the Problem\n=======================================\nThe goal of the methods in this package is to automatically compute in an online fashion\na learning rate schedule for stochastic optimization\nmethods (such as SGD) only on the basis of the given learning task, aiming at producing models\nwith associated small validation error.\n\nTheoretically, we have to solve the problem of finding a learning rate (LR) schedule under the framework of  gradient-based hyperparameter optimization.\nIn this sense, we consider as an optimal schedule \u003cimg src=\"https://render.githubusercontent.com/render/math?math=\\eta^*=(\\eta^*_0,\\dots,\\eta^*_{T-1})\\in\\mathbb{R}_{\\geq{0}}^T\"\u003e a solution to the following constrained optimization problem:\n\n\u003cimg src=\"https://render.githubusercontent.com/render/math?math=\\min%20\\{f_T(\\eta)=E(w_T(\\eta)):\\eta\\in\\mathbb{R}_{\\geq{0}}^T\\}\\quad%20s.t.\\quad%20w_0=\\bar{w},\\quad%20w_{t+1}(\\eta)=\\Phi_t(w_{t}(\\eta),\\eta_t)\" title=\"\" /\u003e\n\nfor \u003cimg src=\"https://render.githubusercontent.com/render/math?math=t=\\{0,%20\\dots,%20T-1\\}\" title=\" \" /\u003e,\nwhere \u003cimg src=\"https://render.githubusercontent.com/render/math?math=E:\\mathbb{R}^d\\to\\mathbb{R}_{\\geq{0}}\" title=\" \" /\u003e is an objective function,\n\u003cimg src=\"https://render.githubusercontent.com/render/math?math=\\Phi_t:\\mathbb{R}^d\\times%20\\mathbb{R}_{\\geq0}\\to\\mathbb{R}^d\" title=\" \" /\u003e is a (possibly stochastic) weight update dynamics,\n\u003cimg src=\"https://render.githubusercontent.com/render/math?math=\\bar{w}\\in\\mathbb{R}^d\" title=\" \" /\u003e represents the initial model weights (parameters) and finally\n\u003cimg src=\"https://render.githubusercontent.com/render/math?math=w_t\" title=\" \" /\u003e are the weights after t iterations. \n\nWe can think of E as either the training or the validation loss of the model,\nwhile the dynamics \u003cimg src=\"https://render.githubusercontent.com/render/math?math=\\Phi\" title=\" \" /\u003e describe the update rule (such as SGD, SGD-Momentum, Adam etc.). For example in the case of SGD,\nwhile the dynamics \u003cimg src=\"https://render.githubusercontent.com/render/math?math=\\Phi_t(w_t,\\eta_t)=w_t-\\eta_t\\nabla%20L_t(w_t)\" title=\" \" /\u003e with\n\u003cimg src=\"https://render.githubusercontent.com/render/math?math=L_t(w_t)\" title=\" \" /\u003e the (possibly regularized) training loss\non the t-th minibatch. The horizon T should be large enough so that\nthe training error can be effectively minimized, in order to avoid underfitting.\nNote that a too large value of T does not necessarily harm since \u003cimg src=\"https://render.githubusercontent.com/render/math?math=\\eta_k=0\" title=\" \" /\u003e\nfor \u003cimg src=\"https://render.githubusercontent.com/render/math?math=k \\geq \\bar{T}\" title=\"\" /\u003e is still a feasible solution, implementing early stopping in\nthis setting.\n\nInstallation\n============\nThe library can be installed (from source) like this:\n\n```\ngit checkout https://github.com/awslabs/adatune.git\ncd adatune\npython setup.py install\n```\n\n\nUsage\n=====\nYou can easily replace a non-adaptive `learning_rate` based training procedure with an adaptive one (RTHO/MARTHE) like this:\n\nNon Adaptive\n------------\n```\nloss.backward()\noptimizer.step()\n```\n\nAdaptive\n--------\n```\nfirst_grad = ag.grad(loss, net.parameters(), create_graph=True, retain_graph=True)\nhyper_optim.compute_hg(net, first_grad)\nfor params, gradients in zip(net.parameters(), first_grad):\n     params.grad = gradients\noptimizer.step()\nhyper_optim.hyper_step(vg.val_grad(net))\n```\n\nThere are two standalone Python scripts provided in the `bin` directory which show in details how to use the library. \n* `baselines.py` - This file contains all the baselines we compare against while developing **MARTHE** (apart from RTHO). The parameters defined in the `cli_def` function are self-explanatory. You can change the `learning_rate` adaptation strategy with `lr-scheduler` parameter defined there.\n\nFor example, if you want to run cosine-annealing-with-restarts for VGG-11 on CIFAR-10 with SGD-momentum as the optimizer, you can run it like this after the package is installed:\n\n```\npython bin/baselines.py --network vgg --dataset cifar_10 --optimizer sgd --momentum 0.9 --lr-scheduler cyclic\n```\n\n* `rtho.py` - This file contains the implementation RTHO and MARTHE. MARTHE is a generalization of RTHO and HD. It is implemented together with RTHO because both the algorithms share the common component of computing the Hessian-Vector-Product.\n\nIf you want to run MARTHE, HD, or RTHO, you can run it like this:\n\n```\npython bin/rtho.py --network vgg --dataset cifar_10 --optimizer sgd --momentum 0.9 --hyper-lr 1e-8\n```\nif you pass `mu` as 1.0, the algorithm behaves as RTHO. If you pass `mu` as 0, the algorithm is similar to HD (though the outer gradient will be computed on the validation set instead of training set). \n\nIn order to automatically set and adapt `mu`, set it to any value less than 0.0. You can also pass a value of `mu` in the range of [0.99, 0.999] if you don't want an adaptive behavior for `mu` only. \n\nIf you pass `alpha` equals to 0.0, the `hyper-lr` value will stay the same for the whole training procedure.\n\nGenerally, the value of `hyper-lr` should be set to minimum 3-4 scales lower for Adam when compared to SGD (w/o momentum) for all the gradient based methods.\n\nIn order to automatically set and adapt `hyper-lr`, it is possible to set the value of `alpha` positive and small (e.g. 1e-6).\n\nYou can use a linear search algorithm to gradually reduce the value of `alpha` starting from a higher value and seeing when the algorithm is not diverging. Generally, if the value of `alpha` is high for a given task, the algorithm would diverge within the first few epochs.\n\nIn future, we plan to implement a find_hyper_lr method to automatically handle the linear search over `alpha` as well (removing completely any human intervention in the whole precedure).\n\nFor both, there is a parameter called `model-loc` which determines where the trained model would be saved. Please create this directory before running the code if you are using a different directory than the current working directory.\n\nNetworks\n========\n`network.py` implements LeNet-5, VGG, ResNet, MLP, Wide-ResNet and DenseNet-BC. So far, experiments are mostly done with VGG-11 and ResNet-18. \n\nDatasets \u0026 DataLoders\n=====================\nList of available Datasets/DataLoaders can be found in `data_loader.py`. Currently datasets supported are MNIST, CIFAR-10 and CIFAR-100. The DataLoaders classes will download these datasets when used for the first time. \n\nResults comparing MARTHE and other methods\n==========================================\n![Accuracy CIFAR10 VGG](figures/marthe_comparison.png)\n\n\nFor further details, please refer to the original [paper](https://arxiv.org/abs/1910.08525).\n\nHow to cite\n===========\nThe idea of this code is from the following paper:\n\nDonini et al. \"MARTHE: Scheduling the Learning Rate Via Online Hypergradients\"\nIJCAI-PRCAI 2020.\n\nBibtex citation:\n```\n@inproceedings{donini2020MARTHE,\n  title={MARTHE: Scheduling the Learning Rate Via Online Hypergradients},\n  author={Donini, Michele and Franceschi, Luca and Majumder, Orchid and Pontil, Massimiliano and Frasconi, Paolo},\n  booktitle={Proceedings of the 29th International Joint Conference on Artificial Intelligence and the 17th Pacific Rim International Conference on Artificial Intelligence},\n  year={2020},\n  organization={AAAI Press}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fawslabs%2Fadatune","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fawslabs%2Fadatune","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fawslabs%2Fadatune/lists"}