https://github.com/rdspring1/count-sketch-optimizers

A compressed adaptive optimizer for training large-scale deep learning models using PyTorch
https://github.com/rdspring1/count-sketch-optimizers

adagrad adam-optimizer count-min-sketch count-sketch deep-learning hashing imagenet language-model neural-network pytorch sgd-momentum sgd-optimizer transformer

Last synced: 3 months ago
JSON representation

A compressed adaptive optimizer for training large-scale deep learning models using PyTorch

Host: GitHub
URL: https://github.com/rdspring1/count-sketch-optimizers
Owner: rdspring1
License: apache-2.0
Created: 2018-09-15T20:14:49.000Z (about 7 years ago)
Default Branch: master
Last Pushed: 2019-11-26T19:26:54.000Z (almost 6 years ago)
Last Synced: 2025-03-20T17:54:13.998Z (7 months ago)
Topics: adagrad, adam-optimizer, count-min-sketch, count-sketch, deep-learning, hashing, imagenet, language-model, neural-network, pytorch, sgd-momentum, sgd-optimizer, transformer
Language: Python
Homepage:
Size: 6.95 MB
Stars: 27
Watchers: 3
Forks: 13
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Count-Sketch Optimizers
[Compressing Gradient Optimizers via Count-Sketches](http://proceedings.mlr.press/v97/spring19a/spring19a.pdf)

An ICML 2019 paper by Ryan Spring, Anastasios Kyrillidis, Vijai Mohan, Anshumali Shrivastava

# BERT-Large Training Results
Trained with Activation Checkpointing and Mixed Precision Training (FP16) on Nvidia V100 DGX-1 servers

| BERT-Large | Adam | Count-Min Sketch (CMS) - RMSprop |
| -------------------- | -------------- | -------------------------------- |
| Time (Days) | **5.32** | 5.52 |
| Size (MB) | 7,097 | **5,133** |
| Test Perplexity | **4.04** | 4.18 |

![Convergence Rate - Adam, CMS-RMSprop](/paper/BERT_Large_Convergence.png)
![Faster convergence rate with larger batch size - CMS-RMSprop](/paper/BERT_Large_Batch_Size.png)

# Instructions
1. Install Requirements
2. Add optimizers folder to $PYTHONPATH

# Requirements
1. torch
2. torchvision
3. cupy
4. pynvrtc

# Examples
1. ImageNet - ResNet-18
2. LM1B - Transformer / LSTM
3. Wikitext-2 - LSTM

# Dense Layer Support
We support compressing the dense layers of the neural network without update sparsity. During training, we update the auxiliary variables and perform the gradient update for each parameter in a single fused CUDA kernel. The dense kernel is equivalent to the sparse kernel. The main difference is that we explicitly avoid generating the auxiliary variables for the dense layers in global memory. Instead, we access them inside the shared memory of the GPU Streaming Multiprocessor. Without this key feature, our approach would not save any GPU memory for the dense layers. In the sparse case, we assume that the non-zero gradient updates is significantly smaller than the auxiliary variable. (See dense\_exp\_cms.py for more details)

# References
1. [Transformer Architecture - Nvidia Megatron Language Model](https://github.com/NVIDIA/Megatron-LM)
2. [Compressing Gradient Optimizers via Count-Sketches (ICML 2019)](http://proceedings.mlr.press/v97/spring19a.html)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/rdspring1/count-sketch-optimizers

Awesome Lists containing this project

README