https://github.com/zanussbaum/mup-tf

Maximal Update Parameterization in Tensorflow
https://github.com/zanussbaum/mup-tf

deep-learning hyperparameter-optimization machine-learning mup python tensorflow transformers

Last synced: about 1 month ago
JSON representation

Maximal Update Parameterization in Tensorflow

Host: GitHub
URL: https://github.com/zanussbaum/mup-tf
Owner: zanussbaum
License: mit
Created: 2022-10-21T22:50:58.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2023-02-07T22:43:22.000Z (over 3 years ago)
Last Synced: 2025-02-26T06:49:36.639Z (over 1 year ago)
Topics: deep-learning, hyperparameter-optimization, machine-learning, mup, python, tensorflow, transformers
Language: Python
Homepage: https://arxiv.org/abs/2203.03466
Size: 2.19 MB
Stars: 0
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # μP for Tensorflow

This is a Tensorflow 2 (very preliminary) port of Yang and Hu et al.'s [μP repo](https://github.com/microsoft/mup)

## Installation

To install, you can either clone the repo and install the package locally, or install it from pyPI.

```bash

pip install mup-tf

```

## Install from Source

```bash

git clone https://github.com/zanussbaum/mup-tf.git

pip install -e .

```

## Basic Usage

This has been adapted from the original MuP repo.

```python

import tensorflow as tf

from mup_tf import MuReadout, make_base_shapes, set_base_shapes, MuSGD, MuAdam

class MyModel(tf.keras.Model):

    def __init__(self, width, ...):

        ...

        ### In model definition, replace output layer with MuReadout

        # readout = tf.keras.layers.Dense(d_out)

        readout = MuReadout(d_out)

        ### If tying weights with an input Embedding layer, do

        # readout = MuSharedReadout(input_layer.weight)

        ...

    def call(self, ...):

        ...

        ### If using a transformer, make sure to use

        ###   1/d instead of 1/sqrt(d) attention scaling

        # attention_scores = query @ key.T / d**0.5

        attention_scores = query @ key.T * 8 / d

        ### We use 8/d instead of 1/d here to be backward compatible

        ###   with 1/d**0.5 when d=64, a common head dimension.

        ...

### Instantiate a base model

base_model = MyModel(width=1)

### Instantiate a "delta" model that differs from the base model

###   in all dimensions ("widths") that one wishes to scale.

### Here it's simple, but e.g., in a Transformer, you may want to scale

###   both nhead and dhead, so the delta model should differ in both.

delta_model = MyModel(width=2) 

### Instantiate the target model (the model you actually want to train).

### This should be the same as the base model except 

###   the widths could be potentially different.

### In particular, base_model and model should have the same depth.

model = MyModel(width=100)

### Set base shapes

### When `model` has same parameter shapes as `base_model`,

###   `model` behaves exactly the same as `base_model`

###   (which is in Tensorflow's default parametrization).

###   This provides backward compatibility at this particular model size.

###   Otherwise, `model`'s init and LR are scaled by μP.

### IMPORTANT: this should be called as soon as possible,

###   before re-initialization and optimizer definition.

infshapes = set_base_shapes(model, base_model, delta=delta_model)

### Alternatively, one can save the base model shapes in a file

# make_base_shapes(base_model, delta_model, filename)

### and later set base shapes directly from the filename

# set_base_shapes(model, filename)

### This is useful when one cannot fit both 

###   base_model and model in memory at the same time

### Replace your custom init, if any

for param in model.parameters():

    ### If initializing manually with fixed std or bounds,

    ### then replace with same function from mup.init

    # torch.nn.init.uniform_(param, -0.1, 0.1)

    mup.init.uniform_(param, -0.1, 0.1)

    ### Likewise, if using

    ###   `xavier_uniform_, xavier_normal_, kaiming_uniform_, kaiming_normal_`

    ### from `torch.nn.init`, replace with the same functions from `mup.init`

### Use the optimizers from `mup.optim` instead of `tf.keras.optimizers`

# optimizer = tf.keras.optimizers.SGD(learning_rate=0.1)

opt_kwargs = {"infshapes": name2shapes}

# need to pass in infshapes to optimizer if you are using tf.distribute.MirrorStrategy

# as tensors are reset and the `infshape` attribute is lost

optimizer = MuSGD(learning_rate=0.1 **opt_kwargs)

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/zanussbaum/mup-tf

Awesome Lists containing this project

README