{"id":13571007,"url":"https://github.com/microsoft/mup","last_synced_at":"2025-05-14T18:07:10.353Z","repository":{"id":39905099,"uuid":"423991420","full_name":"microsoft/mup","owner":"microsoft","description":"maximal update parametrization (µP)","archived":false,"fork":false,"pushed_at":"2024-07-17T11:54:02.000Z","size":17479,"stargazers_count":1506,"open_issues_count":31,"forks_count":100,"subscribers_count":29,"default_branch":"main","last_synced_at":"2025-05-07T23:47:15.959Z","etag":null,"topics":["deep-learning","machine-learning","mup","mutransfer","python","pytorch","transformers"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2203.03466","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/microsoft.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":"SUPPORT.md","governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-11-02T20:36:26.000Z","updated_at":"2025-05-07T21:19:05.000Z","dependencies_parsed_at":"2023-02-17T07:16:14.966Z","dependency_job_id":"e4224e7e-b13a-4aa7-8d5a-585a78e07d32","html_url":"https://github.com/microsoft/mup","commit_stats":{"total_commits":50,"total_committers":8,"mean_commits":6.25,"dds":0.76,"last_synced_commit":"a33ea802bcef1d7744057e34ff00d1a5d7e3d7c4"},"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Fmup","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Fmup/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Fmup/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Fmup/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/microsoft","download_url":"https://codeload.github.com/microsoft/mup/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254198515,"owners_count":22030966,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","machine-learning","mup","mutransfer","python","pytorch","transformers"],"created_at":"2024-08-01T14:00:57.449Z","updated_at":"2025-05-14T18:07:05.343Z","avatar_url":"https://github.com/microsoft.png","language":"Jupyter Notebook","funding_links":[],"categories":["Jupyter Notebook"],"sub_categories":[],"readme":"# Maximal Update Parametrization (μP) and Hyperparameter Transfer (μTransfer) \n\n[Paper link](https://arxiv.org/abs/2203.03466)\n|\n[Blog link](https://www.microsoft.com/en-us/research/blog/%C2%B5transfer-a-technique-for-hyperparameter-tuning-of-enormous-neural-networks/)\n|\n[YouTube link](https://www.youtube.com/watch?v=z8-C42mAwBc)\n\nIn [*Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer*](https://arxiv.org/abs/2203.03466), we show that optimal hyperparameters become stable across neural network sizes when we parametrize the model in [maximal update parametrization (μP)](http://arxiv.org/abs/2011.14522).\nThis can be used to tune extremely large neural networks such as large pretrained transformers, as we have done in our work.\nMore generally, μP reduces the fragility and uncertainty when transitioning from exploration to scaling up, which are not often talked about explicitly in the deep learning literature.\n\n![](figures/sp_vs_mup_dashed.png)\n\u003cfont size=\"1\"\u003e *Figure above: Training loss against learning rate on Transformers of varying `d_model` trained with Adam.*\u003c/font\u003e \n\n\nμP turns out to be the *unique* \"natural\" parametrization that has this hyperparameter stability property across width, as empirically verified in the gif below on MLPs trained with SGD. Here, across time, we interpolate between PyTorch default and μP's learning rate and initialization scalings (right), and we scale up the width-256 model (log2(width)=8) to width 2^13 = 8192 using this interpolated scaling rule (left).\n\n![](figures/parametrizations.gif)\n\nThis repo contains the source code for the `mup` package, our tool that makes the implementation of μP in Pytorch models effortless and less error-prone.\n\n## Table of Contents\n\n\n  - [Installation](#installation)\n    - [Install From Source](#install-from-source)\n  - [Basic Usage](#basic-usage)\n  - [How `mup` Works Under the Hood](#how-mup-works-under-the-hood)\n  - [Current Limitations](#current-limitations)\n  - [Checking Correctness of Parametrization](#checking-correctness-of-parametrization)\n    - [Coord Check](#coord-check)\n    - [Making Your Own Coord Check Plots](#making-your-own-coord-check-plots)\n    - [Wider is Always Better](#wider-is-always-better)\n  - [Examples](#examples)\n  - [Running Tests](#running-tests)\n  - [The Basic Math](#the-basic-math)\n  - [Contributing](#contributing)\n  - [Trademarks](#trademarks)\n\n## Installation\n\n```\npip install mup\n```\n\n### Install From Source\n\nClone this repo, change to its directory, and do\n```\npip install -r requirements.txt\npip install -e .\n```\n\n## Basic Usage\n\n```Python\nfrom mup import MuReadout, make_base_shapes, set_base_shapes, MuSGD, MuAdam\n\nclass MyModel(nn.Module):\n    def __init__(self, width, ...):\n        ...\n        ### In model definition, replace output layer with MuReadout\n        # readout = nn.Linear(width, d_out)\n        readout = MuReadout(width, d_out)\n        ### If tying weights with an input nn.Embedding layer, do\n        # readout = MuSharedReadout(input_layer.weight)\n        ...\n    def forward(self, ...):\n        ...\n        ### If using a transformer, make sure to use\n        ###   1/d instead of 1/sqrt(d) attention scaling\n        # attention_scores = query @ key.T / d**0.5\n        attention_scores = query @ key.T * 8 / d\n        ### We use 8/d instead of 1/d here to be backward compatible\n        ###   with 1/d**0.5 when d=64, a common head dimension.\n        ...\n\n### Instantiate a base model\nbase_model = MyModel(width=1)\n### Optionally, use `torchdistx.deferred_init.deferred_init` to avoid instantiating the parameters\n### Simply install `torchdistx` and use\n# base_model = torchdistx.deferred_init.deferred_init(MyModel, width=1)\n### Instantiate a \"delta\" model that differs from the base model\n###   in all dimensions (\"widths\") that one wishes to scale.\n### Here it's simple, but e.g., in a Transformer, you may want to scale\n###   both nhead and dhead, so the delta model should differ in both.\ndelta_model = MyModel(width=2) # Optionally use `torchdistx` to avoid instantiating\n\n### Instantiate the target model (the model you actually want to train).\n### This should be the same as the base model except \n###   the widths could be potentially different.\n### In particular, base_model and model should have the same depth.\nmodel = MyModel(width=100)\n\n### Set base shapes\n### When `model` has same parameter shapes as `base_model`,\n###   `model` behaves exactly the same as `base_model`\n###   (which is in PyTorch's default parametrization).\n###   This provides backward compatibility at this particular model size.\n###   Otherwise, `model`'s init and LR are scaled by μP.\n### IMPORTANT: this should be called as soon as possible,\n###   before re-initialization and optimizer definition.\nset_base_shapes(model, base_model, delta=delta_model)\n\n### Alternatively, one can save the base model shapes in a file\n# make_base_shapes(base_model, delta_model, filename)\n### and later set base shapes directly from the filename\n# set_base_shapes(model, filename)\n### This is useful when one cannot fit both \n###   base_model and model in memory at the same time\n\n### Replace your custom init, if any\nfor param in model.parameters():\n    ### If initializing manually with fixed std or bounds,\n    ### then replace with same function from mup.init\n    # torch.nn.init.uniform_(param, -0.1, 0.1)\n    mup.init.uniform_(param, -0.1, 0.1)\n    ### Likewise, if using\n    ###   `xavier_uniform_, xavier_normal_, kaiming_uniform_, kaiming_normal_`\n    ### from `torch.nn.init`, replace with the same functions from `mup.init`\n\n### Use the optimizers from `mup.optim` instead of `torch.optim`\n# optimizer = torch.optim.SGD(model.parameters(), lr=0.1)\noptimizer = MuSGD(model.parameters(), lr=0.1)\n\n### Then just train normally\n```\n\nNote the base and delta models *do not need to be trained* --- we are only extracting parameter shape information from them.\nTherefore, optionally, we can avoid instantiating these potentially large models by using the `deferred_init` function in `torchdistx`.\nAfter installing [`torchdistx`](https://github.com/pytorch/torchdistx), use `torchdistx.deferred_init.deferred_init(MyModel, **args)` instead of `MyModel(**args)`. See [this page](https://pytorch.org/torchdistx/latest/deferred_init.html) for more detail.\nIn the MLP and Transformer examples (not `mutransformers`) we provided, you can activate this feature by passing `--deferred_init`.\n\n\n## How `mup` Works Under the Hood\n\n\nBy invoking `set_base_shapes(model, ...)`, each parameter tensor `p` of `model` gets a `p.infshape` attribute that stores, for each of its dimensions, the corresponding base dimension and whether that dimension should be considered `infinite` (i.e. will be scaled up/down, e.g., `d_model` of a Transformer) or `finite` (i.e. will be fixed, e.g., vocabulary size).\nThis information is used in the initializers and optimizers to automatically scale the parameters or learning rates to be compliant with μP.\nFor example, the Adam learning rate of hidden weights `p` is calculated as  `globalLR / p.infshape.width_mult()`, where `p.infshape.width_mult()` essentially calculates `fan_in / base_fan_in`.\n\n\n## Current Limitations\n\n- `set_base_shapes(model, ...)` assumes that `model` has just been randomly initialized in the standard way and rescales its parameters using the base shape information so the model is in μP.\n- If you want data parallelism, please use `torch.nn.parallel.DistributedDataParallel` instead of `torch.nn.DataParallel`. This is because the latter removes the attributes the `mup` package adds to each parameter tensor of the model. Also, for performance, `pytorch` [recommends the former anyway](https://pytorch.org/docs/stable/notes/cuda.html#cuda-nn-ddp-instead).\n- We scale the learning rate according to μP explicitly by creating refined parameter groups from what is passed to the `mup` optimizer and by manipulating the `lr` attribute in those groups. This is compatible with PyTorch's learning rate schedulers. However, if you roll your own, make sure the scheduler sets the learning rate relative to what is currently in the refined parameter groups. The following is an example of what *not* to do and what is OK:\n```python\noptimizer = mup.MuAdam(model.parameters(), lr=1e-3)\nfor pg in optimizer.param_groups:\n  # what NOT to do: setting learning rate absolutely\n  # pg['lr'] = 1e-3 * 2\n  # what is an OK alternative: setting it relatively\n  pg['lr'] *= 2\n```\n- By default, any parameter matrix that has 2 \"infinite\" dimensions (i.e. dimensions that are different from base dimensions) are considered by `mup` to have shape (fan_out, fan_in), i.e., in the forward pass, this matrix multiplies its input on the right. This is the case with all `nn.Linear` weights from pytorch. If you have a custom parameter, say `W`, that violates this convention, you can manually set `W.infshape.main_idx = 0; W.infshape.main = W.infshape[0]` to let `mup` know that its shape corresponds to (fan_in, fan_out). A similar discussion applies if you have a parameter *tensor* with many dimensions but exactly 2 \"infinite\" dimensions, for which the first is fan_in and the second is fan_out.\n- Currently, [`torch.save` does not save the `infshape` objects attached to each parameter tensor](https://github.com/pytorch/pytorch/issues/72129). Before this is fixed, you would have to set base shape manually after loading a model checkpoint like so:\n```python\nmodel = torch.load('my/model/path.pt')\n# Important: note the flag `rescale_params=False`!\nset_base_shapes(model, 'my/base/shape/path.bsh', rescale_params=False)\n```\n(`set_base_shapes` by default rescales the parameters of `model`, assuming it's freshly initialized by PyTorch, to be consistent with μP.\nThe `rescale_params=False` flag turns off this behavior.)\n\n\n## Checking Correctness of Parametrization\n\n\n### Coord Check\n\nJust like gradient checking is a simple way of verifying the correctness of an autograd implementation, *coordinate checking* is a simple way to verify you have implemented μP correctly: calculate the average size (which we denote in the y-axis below by `l1`) of the coordinates of each activation vector in, and output of, the model, for a few steps of training and a few different widths.\nIf implemented correctly, then we shall see this `l1` stable over many widths; otherwise, the `l1` can blow up or shrink to 0 with width.\n(We are essentially checking desideratum 1 described below.)\n(The `l1` calculates `x.abs().mean()` for each activation vector `x` and is just one measure of the \"average size\" of `x`'s entries; one can also use analogously defined `l2`, `l4`, etc, though they may exhibit greater fluctuation with random seeds.)\n\nFor example, in the following, we plot `width` vs `l1` for 2 steps of training, where t=1 means at initialization, before any gradient update.\nEach curve corresponds to an (pre-)activation vector of a layer or the output of the network.\nThe first set of 3 plots shows an MLP in standard parametrization (SP), trained by adam.\nWe see after 1 step of update, activation/output `l1` are exploding with width.\nThis means SP is \"incorrect.\"\n![](coord_checks/sp_mlp_adam_lr0.001_nseeds5_bn0_coord.png)\nWe now do the same for an MLP in maximal update parametrization (μP) (including using `mup.optim.MuAdam` instead of `torch.optim.Adam`).\nIn contrast to the above, all curves stay horizontal, indicating that μP is implemented correctly.\n![](coord_checks/μp_mlp_adam_lr0.001_nseeds5_bn0_coord.png)\nWe call this way of checking implementation correctness a *coord check*, short for \"coordinate check.\"\n\n### Making Your Own Coord Check Plots\nWe provide an easy way to implement this check via functions in the `mup.coord_check` module.\nThe workflow typically looks like the following.\n\n```Python\nfrom mup.coord_check import get_coord_data, plot_coord_data\n# construct a dictionary of lazy μP models with differing widths\ndef lazy_model(width):\n    # `set_base_shapes` returns the model\n    return lambda: set_base_shapes(MyMuModel(width), 'my/base/shape/path.bsh')\n    # Note: any custom initialization with `mup.init` would need to\n    # be done inside the lambda as well\nmodels = {64: lazy_model(64), ..., 1024: lazy_model(1024)}\n# make a dataloader with small batch size/seq len\n#   just for testing\ndataloader = ...\n# record data from the model activations over a few steps of training\n# this returns a pandas dataframe\ndf = get_coord_data(models, dataloader)\n# This saves the coord check plots to filename.\nplot_coord_data(df, save_to=filename)\n# If you are in jupyter notebook, you can also do\n#   `plt.show()`\n# to show the plot\n```\nFor example, the `mup.coord_check.example_plot_coord_check` function is implemented this way for toy MLP and CNN models.\n\nIf you see the curves blow up or shrink to 0 with width after a few steps of training, then there's a bug in your μP implementation (did you forget to vary some dimension, like `d_ffn`, in the delta model?).\nIf instead you see the curves converge to the right, then most likely your implementation is correct.\nHowever, there are two typical exceptions to this;\nthe following can shrink to 0 at initialization in μP (at a 1/sqrt(width) rate):\n  - the network output\n  - the attention logits in a Transformer\n\nThese are transient, and after a few steps their curves should be roughly flat.\nNevertheless, to remove the discrepancy at init, we recommend\n   - initializing the output layer \n   (should be a `MuReadout` instance) weights to be 0 via\n   the `readout_zero_init=True` option and\n   - initializing the query matrix in a Transformer to 0\n     (this has to be done manually). If symmetry-breaking is desired in the attention logits at init, initialize the (relative) position biases with nonzero variance.\n     \n#### Tips for Coord Check\n\n- Use a large learning rate (larger than you'd use for actual training). This would emphasize any potential exploding coordinates issue, which could be hidden by the initialization if the learning rate is too small.\n- If you reuse a module multiple times in the forward pass, then `mup.get_coord_data` will only record the statistics from the last usage. In this case, for testing purposes, one can wrap different usages with `nn.Identity` modules of different names to distinguish them.\n\n### Wider is Always Better\n\n![](figures/widerbetter.png)\n\nAnother sign that μP has not been implemented correctly is if going wider does worse (on training loss) after some width, at some point during training.\nThe figure above illustrates this in a collection of training curves: (left) the correct implementation should always see performance improve with width, at any point in training; (middle) if you used standard parametrization (SP), sometimes you may see performance improve with width up to some point and then suddenly it becomes worse with wider models; (right) or you may immediately see worsening performance even for narrow models.\n\n## Examples\nSee the `MLP`, `Transformer`, and `ResNet` folders inside `examples/` as well as the tests in `mup/test` for examples.\nPeople familiar with [Huggingface Transformers](https://github.com/huggingface/transformers) may also find the `examples/mutransformers` submodule instructive (obtained via `git submodule update --init`), which is also available standalone at [https://github.com/microsoft/mutransformers](https://github.com/microsoft/mutransformers).\n\n## Native Integration With Huggingface\n\nFrustrated that your [Huggingface Transformer](https://github.com/huggingface/transformers) breaks when you scale up? Want to tune hyperparameters for your large mult-GPU [Huggingface Transformer](https://github.com/huggingface/transformers) on a single GPU, right out the box? If so, please upvote [this github issue](https://github.com/huggingface/transformers/issues/16157)!\n\n\n## Running Tests\nTo run tests, do\n```bash\npython -m mup.test\n```\n\n\n## The Basic Math\n\nμP is designed so as to satisfy the following desiderata:\n\n\u003e At any time during training\n\u003e 1. Every (pre)activation vector in a network should have Θ(1)-sized coordinates\n\u003e 2. Neural network output should be O(1).\n\u003e 3. All parameters should be updated as much as possible (in terms of scaling in width) without leading to divergence\n\nIt turns out these desiderata uniquely single out μP.\nTo derive μP from them, one needs to carefully consider how the *coordinate size* of a vector Av, resulting from a square matrix A multiplying vector v, depends on those of A and v, when A and v are \"correlated\".\nHere you can think of A as weights and v as an activation vector.\nThis in turn depends on what kind of matrix is A and what kind of vector is v.\nIn the context of training a wide neural network, it turns out we only need to consider vectors that has approximately iid coordinates, and two kinds of matrices: 1) those that look like outer products of such vectors, and 2) random iid matrices.\nThose of type 1 cover things like weight gradients; those of type 2 cover things like weight initialization.\nThen, if A and v both have entry size Θ(1) and they are correlated in ways that arise naturally during training, then we have the following table.\n\n|                  | outer product A (type 1) | iid A  (type 2)    |\n|------------------|--------------------------|--------------------|\n| Entry size of Av | Θ(n)                     | Θ(sqrt(n))         |\n\nGiven this table, one can then trace the forward and backward computation of a network to derive μP straightforwardly.\n\nSee [our blog post](https://www.microsoft.com/en-us/research/blog/%C2%B5transfer-a-technique-for-hyperparameter-tuning-of-enormous-neural-networks/) for a gentle primer and [our paper](https://arxiv.org/abs/2203.03466) for details.\n\n\n## Contributing\n\nThis project welcomes contributions and suggestions.  Most contributions require you to agree to a\nContributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us\nthe rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.\n\nWhen you submit a pull request, a CLA bot will automatically determine whether you need to provide\na CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions\nprovided by the bot. You will only need to do this once across all repos using our CLA.\n\nThis project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).\nFor more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or\ncontact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.\n\n## Trademarks\n\nThis project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft \ntrademarks or logos is subject to and must follow \n[Microsoft's Trademark \u0026 Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).\nUse of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.\nAny use of third-party trademarks or logos are subject to those third-party's policies.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2Fmup","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmicrosoft%2Fmup","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2Fmup/lists"}