https://github.com/lucasbotang/gradnorm
PyTorch implementation of the GradNorm
https://github.com/lucasbotang/gradnorm
deep-learning gradnorm multitask-learning python pytorch
Last synced: 5 days ago
JSON representation
PyTorch implementation of the GradNorm
- Host: GitHub
- URL: https://github.com/lucasbotang/gradnorm
- Owner: LucasBoTang
- License: apache-2.0
- Created: 2022-08-26T04:14:17.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-09-04T15:16:04.000Z (6 months ago)
- Last Synced: 2025-02-07T18:13:45.052Z (14 days ago)
- Topics: deep-learning, gradnorm, multitask-learning, python, pytorch
- Language: Jupyter Notebook
- Homepage:
- Size: 1.81 MB
- Stars: 82
- Watchers: 2
- Forks: 5
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# PyTorch GradNorm
This is a PyTorch-based implementation of [GradNorm: Gradient normalization for adaptive loss balancing in deep multitask networks](http://proceedings.mlr.press/v80/chen18a.html), which is a gradient normalization algorithm that automatically balances training in deep multitask models by dynamically tuning gradient magnitudes.
The toy example can be found at [**here**](https://github.com/LucasBoTang/GradNorm/blob/main/Test.ipynb).
## Algorithm
## Dependencies
- [PyTorch](https://pytorch.org/)
- [NumPy](https://numpy.org/)## Usage
### Parameters
- net: a multitask network with task loss
- layer: layers of the network layers where applying GradNorm on the weights
- alpha: hyperparameter of restoring force
- dataloader: training dataloader
- num_epochs: number of epochs
- lr1: learning rate of multitask loss
- lr2: learning rate of weights
- log: flag of result log### Sample Code
```python
from gradnorm import gradNorm
log_weights, log_loss = gradNorm(net=mtlnet, layer=net.fc4, alpha=0.12, dataloader=dataloader,
num_epochs=100, lr1=1e-5, lr2=1e-4, log=False)
```## Toy Example (from Original Paper)
### Data
Consider $T$ regression tasks trained using standard squared loss onto the functions:
$$
f_i (\mathbf{x}) = \sigma_i \tanh \left( ( \mathbf{B} + \epsilon_i ) \mathbf{x} \right)
$$Inputs are dimension 250 and outputs dimension 100, while $\mathbf{B}$ and $\epsilon_i$ are constant matrices with their elements generated IID from $N(0; 10)$ and $N(0; 3.5)$, respectively. Each task, therefore, shares information in B but also contains task-specific information $\epsilon_i$. The $\sigma_i$ sets the scales of the outputs.
```python
from data import toyDataset
dataset = toyDataset(num_data=10000, dim_features=250, dim_labels=100, scalars=[1,100])
```### Model
A 4-layer fully-connected ReLU-activated network with 100 neurons per layer as a common trunk is used to train our toy example. A final affine transformation layer gives *T* final predictions.
```python
from model import fcNet, mtlNet
net = fcNet(dim_features=250, dim_labels=100, n_tasks=2) # fc net with multiple heads
mtlnet = mtlNet(net) # multitask net with task loss
```### Result (10 Tasks)