https://github.com/harsh306/awesome-nn-optimization

Awesome list for Neural Network Optimization methods.
https://github.com/harsh306/awesome-nn-optimization
awesome awesome-list bifurcation continuation convergence-analysis convex-optimization curriculum-learning deep-learning dynamical-systems generalization local-minima loss-surface neural-network non-convex-optimization optimization
Last synced: 3 months ago
JSON representation
Awesome list for Neural Network Optimization methods.
Host: GitHub
URL: https://github.com/harsh306/awesome-nn-optimization
Owner: harsh306
License: cc-by-4.0
Created: 2019-08-04T04:28:29.000Z (almost 6 years ago)
Default Branch: master
Last Pushed: 2024-03-10T23:23:45.000Z (over 1 year ago)
Last Synced: 2024-05-21T08:34:01.248Z (about 1 year ago)
Topics: awesome, awesome-list, bifurcation, continuation, convergence-analysis, convex-optimization, curriculum-learning, deep-learning, dynamical-systems, generalization, local-minima, loss-surface, neural-network, non-convex-optimization, optimization
Homepage:
Size: 194 KB
Stars: 71
Watchers: 3
Forks: 9
Open Issues: 0
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project

awesome-of-awesome-ml - awesome-nn-optimization (by harsh306)
ultimate-awesome - awesome-nn-optimization - Awesome list for Neural Network Optimization methods. (Other Lists / Julia Lists)
README

        ## Content

#### Popular Optimization algorithms

- SGD [[Book]](https://www.deeplearningbook.org/contents/optimization.html)

- Momentum [[Book]](https://www.deeplearningbook.org/contents/optimization.html)

- RMSProp [[Book]](https://www.deeplearningbook.org/contents/optimization.html)

- AdaGrad [[Link]](http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf)

- ADAM [[Link]](https://arxiv.org/abs/1412.6980)

- AdaBound [[Link]](https://arxiv.org/abs/1902.09843) [[Github]](https://github.com/Luolc/AdaBound)

- ADAMAX [[Link]](https://arxiv.org/abs/1412.6980)

- NADAM [[Link]](https://openreview.net/pdf?id=OM0jvwB8jIp57ZJjtNEZ)

- ADAMW [[Link]](https://openreview.net/forum?id=rk6qdGgCZ)

- AdaLOMO [Link](https://arxiv.org/pdf/2310.10195.pdf)

- All optimizers list [Awesome-Optimizer](https://github.com/zoq/Awesome-Optimizer)

#### Normalization Methods

- BatchNorm [[Link]](https://arxiv.org/abs/1502.03167)

- Weight Norm [[Link]](http://papers.nips.cc/paper/6113-weight-normalization-a-simple-reparameterization-to-accelerate-training-of-deep-neural-networks)

- Spectral Norm [[Link]](https://arxiv.org/abs/1802.05957)

- Cosine Normalization [[Link]](https://arxiv.org/pdf/1702.05870.pdf)

- L2 Regularization versus Batch and Weight Normalization [Link](https://arxiv.org/pdf/1706.05350.pdf) 

- WHY GRADIENT CLIPPING ACCELERATES TRAINING: A THEORETICAL JUSTIFICATION FOR ADAPTIVITY [Link](https://openreview.net/pdf?id=BJgnXpVYwS)

#### On Convexity and Generalization of Neural Networks

- Convex Neural Networks [[Link]](http://papers.nips.cc/paper/2800-convex-neural-networks.pdf)

- Breaking the Curse of Dimensionality with Convex Neural Networks [[Link]](http://jmlr.org/papers/volume18/14-546/14-546.pdf)

- UNDERSTANDING DEEP LEARNING REQUIRES RETHINKING GENERALIZATION [[Link]](https://arxiv.org/pdf/1611.03530.pdf)

- Optimal Control Via Neural Networks: A Convex Approach. [[Link]](https://openreview.net/forum?id=H1MW72AcK7)

- Input Convex Neural Networks [[Link]](https://arxiv.org/pdf/1609.07152.pdf)

- A New Concept of Convex based Multiple Neural Networks Structure. [[Link](http://www.ifaamas.org/Proceedings/aamas2019/pdfs/p1306.pdf)

- SGD Converges to Global Minimum in Deep Learning via Star-convex Path [[Link]](https://arxiv.org/abs/1901.00451)

- A Convergence Theory for Deep Learning via Over-Parameterization [Link](https://arxiv.org/abs/1811.03962)

#### Continuation Methods and Curriculum Learning 

- Curriculum Learning [[Link]](https://ronan.collobert.com/pub/matos/2009_curriculum_icml.pdf)

- SOLVING RUBIK’S CUBE WITH A ROBOT HAND [Link](https://arxiv.org/pdf/1910.07113.pdf)

- Noisy Activation Function [[Link]](http://proceedings.mlr.press/v48/gulcehre16.pdf)

- Mollifying Networks [[Link]](https://arxiv.org/abs/1608.04980)

- Curriculum Learning by Transfer Learning: Theory and Experiments with Deep Networks [Link](https://arxiv.org/pdf/1802.03796.pdf) [Talk](https://vimeo.com/287808087)

- Automated Curriculum Learning for Neural Networks [Link](http://proceedings.mlr.press/v70/graves17a/graves17a.pdf)

- On The Power of Curriculum Learning in Training Deep Networks [Link](https://arxiv.org/pdf/1904.03626.pdf)

- On-line Adaptative Curriculum Learning for GANs [Link](https://arxiv.org/abs/1808.00020)

- Parameter Continuation with Secant Approximation for Deep Neural Networks and Step-up GAN [Link](https://digitalcommons.wpi.edu/etd-theses/1256/)

- HashNet: Deep Learning to Hash by Continuation. [[Link]](https://arxiv.org/abs/1702.00758)

- Learning Combinations of Activation Functions. [[Link]](https://arxiv.org/pdf/1801.09403.pdf)

- Learning and development in neural networks: The importance of starting small (1993) [Link](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.128.4487&rep=rep1&type=pdf)

- Flexible shaping: How learning in small steps helps [Link](https://www.sciencedirect.com/science/article/pii/S0010027708002850)

- Curriculum Labeling: Self-paced Pseudo-Labeling for Semi-Supervised Learning [Link](https://arxiv.org/pdf/2001.06001.pdf)

- RETHINKING CURRICULUM LEARNING WITH INCREMENTAL LABELS AND ADAPTIVE COMPENSATION [Link](https://arxiv.org/pdf/2001.04529.pdf)

- Parameter Continuation Methods for the Optimization of Deep Neural Networks [Link](https://ieeexplore.ieee.org/abstract/document/8999318)

- Denoising Neural Machine Translation Training with Trusted Data and Online Data Selection [Link (https://www.aclweb.org/anthology/W18-6314.pdf)

- Reinforcement Learning based Curriculum Optimization for Neural Machine Translation [Link](https://www.aclweb.org/anthology/N19-1208.pdf)

- EVOLUTIONARY POPULATION CURRICULUM FOR SCALING MULTI-AGENT REINFORCEMENT LEARNING [Link](https://openreview.net/pdf?id=SJxbHkrKDH)

- ENTROPY-SGD: BIASING GRADIENT DESCENT INTO WIDE VALLEYS [Link](https://arxiv.org/pdf/1611.01838.pdf)

- NEIGHBOURHOOD DISTILLATION: ON THE BENEFITS OF NON END-TO-END DISTILLATION [Link](https://arxiv.org/abs/2010.01189)

- LEARNING TO EXECUTE [Link](https://arxiv.org/pdf/1410.4615.pdf)

- Cyclical Annealing Schedule: A Simple Approach to Mitigating KL Vanishing [Link](https://arxiv.org/pdf/1903.10145.pdf)

- Data Parameters: A New Family of Parameters for Learning a Differentiable Curriculum [Link](https://proceedings.neurips.cc/paper/2019/file/926ffc0ca56636b9e73c565cf994ea5a-Paper.pdf)

- Breaking the Curse of Space Explosion: Towards Effcient NAS with Curriculum Search [Link](http://proceedings.mlr.press/v119/guo20b.html)

- Continuation Methods and Curriculum Learning for Learning to Rank [Link](http://www.dei.unipd.it/~ferro/papers/2018/CIKM2018_FLMP.pdf)

#### On Loss Surfaces and Generalization of Deep Neural Networks

- Flat-LoRA: Low-Rank Adaption over a Flat Loss Landscape [Link](https://arxiv.org/abs/2409.14396)

- Low-Pass Filtering SGD for Recovering Flat Optima in the Deep Learning Optimization Landscape [Link](https://arxiv.org/pdf/2201.08025)

- Exact solutions to the nonlinear dynamics of learning in deep linear neural networks [Link](https://arxiv.org/abs/1312.6120)

- QUALITATIVELY CHARACTERIZING NEURAL NETWORK OPTIMIZATION PROBLEMS[[Link]](https://arxiv.org/pdf/1412.6544.pdf)

- The Loss Surfaces of Multilayer Networks [[Link]](https://arxiv.org/abs/1412.0233)

- Visualizing the Loss Landscape of Neural Nets [[Link]](https://papers.nips.cc/paper/7875-visualizing-the-loss-landscape-of-neural-nets.pdf)

- The Loss Surface Of Deep Linear Networks Viewed Through The Algebraic Geometry Lens [[Link]](https://arxiv.org/pdf/1810.07716.pdf)

- How regularization affects the critical points in linear

networks.[[Link]](http://papers.nips.cc/paper/6844-how-regularization-affects-the-critical-points-in-linear-networks.pdf)

- Local minima in training of neural networks [[Link]](https://arxiv.org/abs/1611.06310)

- Necessary and Sufficient Geometries for Gradient Methods [Link](http://papers.nips.cc/paper/9325-necessary-and-sufficient-geometries-for-gradient-methods)

- Fine-grained Optimization of Deep Neural Networks [Link](http://papers.nips.cc/paper/8425-fine-grained-optimization-of-deep-neural-networks)

- SCORE-BASED GENERATIVE MODELING THROUGH STOCHASTIC DIFFERENTIAL EQUATIONS [Link](https://openreview.net/pdf?id=PxTIG12RRHS)

#### Dynamics, Bifurcations and  RNNs difficulty to train

- Deep Equilibrium Models [Link](http://papers.nips.cc/paper/8358-deep-equilibrium-models.pdf)

-  Bifurcations of Recurrent Neural Networks in Gradient Descent Learning [[Link]](https://pdfs.semanticscholar.org/b579/27b713a6f9b73c7941f99144165396483478.pdf)

- On the difficulty of training recurrent neural networks [[Link]](http://proceedings.mlr.press/v28/pascanu13.pdf)

- Understanding and Controlling Memory in Recurrent Neural Networks [[Link]](https://arxiv.org/pdf/1902.07275.pdf)

- Dynamics and Bifurcation of Neural Networks [[Link]](https://pdfs.semanticscholar.org/a413/4a36fef5ef55d0ff7dae029d6b8f55140cf7.pdf)

- Context Aware Machine Learning [[Link]](https://arxiv.org/pdf/1901.03415.pdf)

- The trade-off between long-term memory and smoothness for recurrent networks [[Link]](https://arxiv.org/pdf/1906.08482.pdf)

- Dynamical complexity and computation in recurrent neural networks beyond their fxed point [[Link]](https://www.nature.com/articles/s41598-018-21624-2.pdf)

- Bifurcations in discrete-time neural networks : controlling complex network behaviour with inputs [[Links]](https://pub.uni-bielefeld.de/record/2302580)

- Interpreting Recurrent Neural Networks Behaviour via Excitable Network Attractors [[Link]](https://link.springer.com/article/10.1007/s12559-019-09634-2#Sec11)

- Bifurcation analysis of a neural network model [Link](https://link.springer.com/article/10.1007/BF00203668)

- A Differentiable Physics Engine for Deep Learning in Robotics [Link](https://www.frontiersin.org/articles/10.3389/fnbot.2019.00006/full)

- Deep learning for universal linear embeddings

of nonlinear dynamics [Link](https://arxiv.org/pdf/1712.09707.pdf)

- Deep Hidden Physics Models: Deep Learning of Nonlinear Partial Differential Equations [Link](http://www.jmlr.org/papers/volume19/18-046/18-046.pdf)

- Analysis of gradient descent learning algorithms for multilayer feedforward neural networks [Link](https://ieeexplore.ieee.org/abstract/document/203921)

- A dynamical model for the analysis and acceleration of learning in feedforward networks [Link](https://www.sciencedirect.com/science/article/abs/pii/S0893608001000521)

- A bio-inspired bistable recurrent cell allows for long-lasting memory [Link](https://arxiv.org/abs/2006.05252)

- Equilibrium Propagation: Bridging the Gap between Energy-Based Models and Backpropagation [Link (https://www.frontiersin.org/articles/10.3389/fncom.2017.00024/full)

#### Poor Local Minima? and Sharp Minima

- Adding One Neuron Can Eliminate All Bad

Local Minima [Link](https://papers.nips.cc/paper/7688-adding-one-neuron-can-eliminate-all-bad-local-minima.pdf)

- Deep Learning without Poor Local Minima [Link](https://papers.nips.cc/paper/6112-deep-learning-without-poor-local-minima.pdf)

- Elimination of All Bad Local Minima in Deep Learning [Link](https://arxiv.org/pdf/1901.00279.pdf)

- How to escape saddle points efficiently. [Link](https://arxiv.org/pdf/1703.00887.pdf)

- Depth with Nonlinearity Creates No Bad Local Minima in ResNets [Link](https://arxiv.org/abs/1810.09038)

- Sharp Minima Can Generalize For Deep Nets [Link](https://arxiv.org/pdf/1703.04933.pdf)

- Asymmetric Valleys: Beyond Sharp and Flat Local

Minima [Link](https://papers.nips.cc/paper/2019/file/01d8bae291b1e4724443375634ccfa0e-Paper.pdf)

- A Reparameterization-Invariant Flatness Measure for Deep Neural Networks [Link](https://arxiv.org/pdf/1912.00058.pdf)

- A Simple Weight Decay Can Improve Generalization [Link](https://papers.nips.cc/paper/1991/file/8eefcfdf5990e441f0fb6f3fad709e21-Paper.pdf)

- Finding Critical and Gradient-Flat Points of Deep Neural Network Loss Functions [Link](https://escholarship.org/content/qt4fw6x5b3/qt4fw6x5b3_noSplash_14ef3ae1644808c863f9b2eb344addcc.pdf?t=qhtt5i)

- The Loss Surface Of Deep Linear Networks Viewed Through The Algebraic Geometry Lens [Link](https://arxiv.org/pdf/1810.07716.pdf)

- Theoretical Issues in Deep Networks: Approximation, Optimization and Generalization [Link](https://arxiv.org/pdf/1908.09375.pdf)

- Flatness is a False Friend [Link](https://arxiv.org/pdf/2006.09091.pdf)

- Are_Saddles_Good_Enough_for_Deep_Learning [Link](https://www.researchgate.net/publication/317399405_Are_Saddles_Good_Enough_for_Deep_Learning)

#### Initialization of Neural Network

- Deep learning course notes [Link](https://www.deeplearning.ai/ai-notes/initialization/)

- On the importance of initialization and momentum in deep learning [Link](http://proceedings.mlr.press/v28/sutskever13.html)

- The Break-Even Point on Optimization Trajectories of Deep Neural Networks [Link](https://arxiv.org/abs/2002.09572)

- THE EARLY PHASE OF NEURAL NETWORK TRAINING [Link](https://research.fb.com/wp-content/uploads/2020/02/The-Early-Phase-of-Neural-Network-Training.pdf?)

- One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers [Link](http://papers.nips.cc/paper/8739-one-ticket-to-win-them-all-generalizing-lottery-ticket-initializations-across-datasets-and-optimizers.pdf)

- PCA-Initialized Deep Neural Networks Applied To Document Image Analysis [Link](https://arxiv.org/abs/1702.00177)

- Understanding the difficulty of training deep feedforward neural networks [Link](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf?hc_location=ufi])

- Unitary Evolution of RNNs [Link](https://arxiv.org/abs/1511.06464)

#### Momentum in Optimization

- RETHINKING THE HYPERPARAMETERS FOR FINE-TUNING [Link](https://openreview.net/pdf?id=B1g8VkHFPH)

- Momentum Residual Neural Networks [Link](https://proceedings.mlr.press/v139/sander21a.html)

- Smooth momentum: improving lipschitzness in gradient descent [Link](https://doi.org/10.1007/s10489-022-04207-7)

- Momentum-based Weight Interpolation of Strong

Zero-Shot Models for Continual Learning [link](https://arxiv.org/pdf/2211.03186.pdf)

#### Batch size Optimiation 

- ON LARGE-BATCH TRAINING FOR DEEP LEARNING: GENERALIZATION GAP AND SHARP MINIMA[Link](https://arxiv.org/pdf/1609.04836.pdf)

- Revisiting Small Batch Training for Deep Neural Networks [Link](https://arxiv.org/abs/1804.07612)

- LARGE BATCH TRAINING OF CONVOLUTIONAL NETWORKS [Link](https://arxiv.org/pdf/1708.03888.pdf)

- Large Batch Optimization for Deep Learning: Training BERT in 76 minutes [Link](https://arxiv.org/abs/1904.00962)

- DON’T DECAY THE LEARNING RATE, INCREASE THE BATCH SIZE [Link](https://arxiv.org/abs/1711.00489)

#### Degeneracy of Neural Networks

- Exact solutions to the nonlinear dynamics of learning in deep linear neural networks [Link](https://arxiv.org/pdf/1312.6120.pdf)

- Avoiding pathologies in very deep networks [Link](https://arxiv.org/abs/1402.5836)

- Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice [Link](https://arxiv.org/abs/1711.04735)

- SKIP CONNECTIONS ELIMINATE SINGULARITIES [Link](https://openreview.net/pdf?id=HkwBEMWCZ)

- How degenerate is the parametrization of neural networks with the ReLU activation function? [Link](https://arxiv.org/pdf/1905.09803.pdf)

- Theory of Deep Learning III: explaining the non-overfitting puzzle [Link](https://cbmm.mit.edu/sites/default/files/publications/CBMM-Memo-073v2_0.pdf)

- Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks [Link](https://openreview.net/forum?id=rkgqN1SYvr)

- Understanding Deep Learning: Expected Spanning Dimension and Controlling the Flexibility of Neural Networks [Link](https://www.frontiersin.org/articles/10.3389/fams.2020.572539/full)

- The Loss Surface Of Deep Linear Networks Viewed Through The Algebraic Geometry Lens [Link](https://arxiv.org/pdf/1810.07716.pdf)

- PYHESSIAN: Neural Networks Through the Lens of the Hessian [Link](https://arxiv.org/pdf/1912.07145.pdf)

#### Convergencec Analysis in Deep Learning

- A CONVERGENCE ANALYSIS OF GRADIENT DESCENT FOR DEEP LINEAR NEURAL NETWORKS [Link](https://openreview.net/pdf?id=SkMQg3C5K7)

- A Convergence Theory for Deep Learning via Over-Parameterization [Link](http://proceedings.mlr.press/v97/allen-zhu19a/allen-zhu19a.pdf)

- Convergence Analysis of Homotopy-SGD for Non-Convex Optimization [Link](https://openreview.net/forum?id=Twf5rUVeU-I)

#### Multi-Task Learning with curricula

- Learning the Curriculum with Bayesian Optimization for Task-Specific Word Representation Learning. [Link](https://www.aclweb.org/anthology/P16-1013.pdf)

- Learning a Multitask Curriculum for Neural Machine Translation. [Link](https://arxiv.org/pdf/1908.10940.pdf)

- Self-paced Curriculum Learning. [Link](http://www.cs.cmu.edu/~lujiang/camera_ready_papers/AAAI_SPCL_2015.pdf)

- Curriculum Learning of Multiple Tasks. [Link](http://openaccess.thecvf.com/content_cvpr_2015/papers/Pentina_Curriculum_Learning_of_2015_CVPR_paper.pdf)

#### Constrained Optimization for Deep Learning

- A Primal-Dual Formulation for Deep Learning with Constraints [Link](https://papers.nips.cc/paper/9385-a-primal-dual-formulation-for-deep-learning-with-constraints.pdf)

#### Reinforcement Learning and Curriculum

- Object-Oriented Curriculum Generation for Reinforcement Learning [Link](http://ifaamas.org/Proceedings/aamas2018/pdfs/p1026.pdf)

- Teacher-Student Curriculum Learning [Link](https://arxiv.org/abs/1707.00183)

#### Tutorials, Surveys and Blogs

- Curriculum Learning: A Survey [Link](https://arxiv.org/pdf/2101.10382.pdf)

- A Comprehensive Survey on Curriculum Learning [Link](https://arxiv.org/pdf/2010.13166.pdf)

- https://www.offconvex.org/ 

- An overview of gradient descent optimization algorithms [[Link]](https://arxiv.org/pdf/1609.04747.pdf)

- Review of second-order optimization techniques in artificial neural networks backpropagation [Link](https://iopscience.iop.org/article/10.1088/1757-899X/495/1/012003/pdf#:~:text=Second%2Dorder%20optimization%20technique%20is,training%20phase%20of%20neural%20network.)

- Linear Algebra and data [Link](https://github.com/harsh306/ML_Notes/blob/master/linear_algebra.md)

- Why Momentum really works?[[Blog]](https://distill.pub/2017/momentum/)

- Optimization [[Book]](https://www.deeplearningbook.org/contents/optimization.html)

- Optimization for deep learning: theory and algorithms [Link](https://arxiv.org/pdf/1912.08957.pdf)

- Generalization Error in Deep Learning [Link](https://arxiv.org/pdf/1808.01174.pdf)

- Automatic Differentiation in Machine Learning: a Survey [Link](https://arxiv.org/pdf/1502.05767.pdf)

- Curriculum Learning for Reinforcement Learning Domains: A Framework and Survey [Link](https://arxiv.org/pdf/2003.04960.pdf)

- Automatic Curriculum Learning For Deep RL: A Short Survey [Link](https://arxiv.org/abs/2003.04664)

- The Generalization Mystery: Sharp vs Flat Minima [Link](https://www.inference.vc/sharp-vs-flat-minima-are-still-a-mystery-to-me/)

#### Contributing

If you've found any informative resources that you think belong here, be sure to submit a pull request or create an issue! 

##### If you find this helpful, I can enjoy a coffee donation :) 

- [![ko-fi](https://www.ko-fi.com/img/githubbutton_sm.svg)](https://ko-fi.com/F1F02R7JR)

- Or send me 2-4 dollars on my venmo account [@HARSHNILESH-PATHAK](https://venmo.com/HARSHNILESH-PATHAK)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/harsh306/awesome-nn-optimization

Awesome Lists containing this project

README