maximal update parametrization (µP)

Overview

Maximal Update Parametrization (μP) and Hyperparameter Transfer (μTransfer)

Paper link | Blog link

In Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer, we show that optimal hyperparameters become stable across neural network sizes when we parametrize the model in maximal update parametrization (μP). This can be used to tune extremely large neural networks such as large pretrained transformers, as we have done in our work. More generally, μP reduces the fragility and uncertainty when transitioning from exploration to scaling up, which are not often talked about explicitly in the deep learning literature.

Figure above: Training loss against learning rate on Transformers of varying d_model trained with Adam.

μP turns out to be the unique "natural" parametrization that has this hyperparameter stability property across width, as empirically verified in the gif below on MLPs trained with SGD. Here, across time, we interpolate between PyTorch default and μP's learning rate and initialization scalings (right), and we scale up the width-256 model (log2(width)=8) to width 2^13 = 8192 using this interpolated scaling rule (left).

This repo contains the source code for the mup package, our tool that makes the implementation of μP in Pytorch models effortless and less error-prone.

Table of Contents

Installation

pip install mup

Install From Source

Clone this repo, change to its directory, and do

pip install -r requirement.txt
pip install -e .

Basic Usage

from mup import MuReadout, make_base_shapes, set_base_shapes, MuSGD, MuAdam

class MyModel(nn.Module):
    def __init__(self, width, ...):
        ...
        ### In model definition, replace output layer with MuReadout
        # readout = nn.Linear(width, d_out)
        readout = MuReadout(width, d_out)
        ### If tying weights with an input nn.Embedding layer, do
        # readout = MuSharedReadout(input_layer.weight)
        ...
    def forward(self, ...):
        ...
        ### If using a transformer, make sure to use
        ###   1/d instead of 1/sqrt(d) attention scaling
        # attention_scores = query @ key.T / d**0.5
        attention_scores = query @ key.T * 8 / d
        ### We use 8/d instead of 1/d here to be backward compatible
        ###   with 1/d**0.5 when d=64, a common head dimension.
        ...

### Instantiate a base model
base_model = MyModel(width=1)
### Optionally, use `device='meta'` to avoid instantiating the parameters
### This requires you to pass the device flag down to all sub-modules
# base_model = MyModel(width=1, device='meta')
### Instantiate a "delta" model that differs from the base model
###   in all dimensions ("widths") that one wishes to scale.
### Here it's simple, but e.g., in a Transformer, you may want to scale
###   both nhead and dhead, so the delta model should differ in both.
delta_model = MyModel(width=2) # Optionally add the `device='meta'` to avoid instantiating

### Instantiate the target model (the model you actually want to train).
### This should be the same as the base model except 
###   the widths could be potentially different.
### In particular, base_model and model should have the same depth.
model = MyModel(width=100)

### Set base shapes
### When `model` has same parameter shapes as `base_model`,
###   `model` behaves exactly the same as `base_model`
###   (which is in PyTorch's default parametrization).
###   This provides backward compatibility at this particular model size.
###   Otherwise, `model`'s init and LR are scaled by μP.
### IMPORTANT: this should be called as soon as possible,
###   before re-initialization and optimizer definition.
set_base_shapes(model, base_model, delta=delta_model)

### Alternatively, one can save the base model shapes in a file
# make_base_shapes(base_model, delta_model, filename)
### and later set base shapes directly from the filename
# set_base_shapes(model, filename)
### This is useful when one cannot fit both 
###   base_model and model in memory at the same time

### Replace your custom init, if any
for param in model.parameters():
    ### If initializing manually with fixed std or bounds,
    ### then replace with same function from mup.init
    # torch.nn.init.uniform_(param, -0.1, 0.1)
    mup.init.uniform_(param, -0.1, 0.1)
    ### Likewise, if using
    ###   `xavier_uniform_, xavier_normal_, kaiming_uniform_, kaiming_normal_`
    ### from `torch.nn.init`, replace with the same functions from `mup.init`

### Use the optimizers from `mup.optim` instead of `torch.optim`
# optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
optimizer = MuSGD(model.parameters(), lr=0.1)

### Then just train normally

Note the base and delta models do not need to be trained --- we are only extracting parameter shape information from them. Therefore, optionally, we can avoid instantiating these potentially large models by passing device='meta' to their constructor. However, you need to make sure that the device flag is appropriately passed down to the constructor of all submodules. Of course, it'd be even better if PyTorch can do this automatically for any existing nn.Module. If you want to see this happen, please upvote this PyTorch issue.

How mup Works Under the Hood

By invoking set_base_shapes(model, ...), each parameter tensor p of model gets a p.infshape attribute that stores, for each of its dimensions, the corresponding base dimension and whether that dimension should be considered infinite (i.e. will be scaled up/down, e.g., d_model of a Transformer) or finite (i.e. will be fixed, e.g., vocabulary size). This information is used in the initializers and optimizers to automatically scale the parameters or learning rates to be compliant with μP. For example, the Adam learning rate of hidden weights p is calculated as globalLR / p.infshape.width_mult(), where p.infshape.width_mult() essentially calculates fan_in / base_fan_in.

Current Limitations

  • set_base_shapes(model, ...) assumes that model has just been randomly initialized in the standard way and rescales its parameters using the base shape information so the model is in μP.
  • If you want data parallelism, please use torch.nn.parallel.DistributedDataParallel instead of torch.nn.DataParallel. This is because the latter removes the attributes the mup package adds to each parameter tensor of the model. Also, for performance, pytorch recommends the former anyway.
  • We scale the learning rate according to μP explicitly by creating refined parameter groups from what is passed to the mup optimizer and by manipulating the lr attribute in those groups. This is compatible with PyTorch's learning rate schedulers. However, if you roll your own, make sure the scheduler sets the learning rate relative to what is currently in the refined parameter groups. The following is an example of what not to do and what is OK:
optimizer = mup.MuAdam(model.parameters(), lr=1e-3)
for pg in optimizer.param_groups:
  # what NOT to do: setting learning rate absolutely
  # pg['lr'] = 1e-3 * 2
  # what is an OK alternative: setting it relatively
  pg['lr'] *= 2
  • By default, any parameter matrix that has 2 "infinite" dimensions (i.e. dimensions that are different from base dimensions) are considered by mup to have shape (fan_out, fan_in), i.e., in the forward pass, this matrix multiplies its input on the right. This is the case with all nn.Linear weights from pytorch. If you have a custom parameter, say W, that violates this convention, you can manually set W.infshape.main_idx = 0; W.infshape.main = W.infshape[0] to let mup know that its shape corresponds to (fan_in, fan_out). A similar discussion applies if you have a parameter tensor with many dimensions but exactly 2 "infinite" dimensions, for which the first is fan_in and the second is fan_out.
  • Currently, torch.save does not save the infshape objects attached to each parameter tensor. Before this is fixed, you would have to set base shape manually after loading a model checkpoint like so:
model = torch.load('my/model/path.pt')
# Important: note the flag `rescale_params=False`!
set_base_shapes(model, 'my/base/shape/path.bsh', rescale_params=False)

(set_base_shapes by default rescales the parameters of model, assuming it's freshly initialized by PyTorch, to be consistent with μP. The rescale_params=False flag turns off this behavior.)

Checking Correctness of Parametrization

Coord Check

Just like gradient checking is a simple way of verifying the correctness of an autograd implementation, coordinate checking is a simple way to verify you have implemented μP correctly: calculate the average size (which we denote in the y-axis below by l1) of the coordinates of each activation vector in, and output of, the model, for a few steps of training and a few different widths. If implemented correctly, then we shall see this l1 stable over many widths; otherwise, the l1 can blow up or shrink to 0 with width. (We are essentially checking desideratum 1 described below.) (The l1 calculates x.abs().mean() for each activation vector x and is just one measure of the "average size" of x's entries; one can also use analogously defined l2, l4, etc, though they may exhibit greater fluctuation with random seeds.)

For example, in the following, we plot width vs l1 for 2 steps of training, where t=1 means at initialization, before any gradient update. Each curve corresponds to an (pre-)activation vector of a layer or the output of the network. The first set of 3 plots shows an MLP in standard parametrization (SP), trained by adam. We see after 1 step of update, activation/output l1 are exploding with width. This means SP is "incorrect." We now do the same for an MLP in maximal update parametrization (μP) (including using mup.optim.MuAdam instead of torch.optim.Adam). In contrast to the above, all curves stay horizontal, indicating that μP is implemented correctly. We call this way of checking implementation correctness a coord check, short for "coordinate check."

Making Your Own Coord Check Plots

We provide an easy way to implement this check via functions in the mup.coord_check module. The workflow typically looks like the following.

from mup.coord_check import get_coord_data, plot_coord_data
# construct a dictionary of lazy μP models with differing widths
def lazy_model(width):
    # `set_base_shapes` returns the model
    return lambda: set_base_shapes(MyMuModel(width), 'my/base/shape/path.bsh')
    # Note: any custom initialization with `mup.init` would need to
    # be done inside the lambda as well
models = {64: lazy_model(64), ..., 1024: lazy_model(1024)}
# make a dataloader with small batch size/seq len
#   just for testing
dataloader = ...
# record data from the model activations over a few steps of training
# this returns a pandas dataframe
df = get_coord_data(models, dataloader)
# This saves the coord check plots to filename.
plot_coord_data(df, save_to=filename)
# If you are in jupyter notebook, you can also do
#   `plt.show()`
# to show the plot

For example, the mup.coord_check.example_plot_coord_check function is implemented this way for toy MLP and CNN models.

If you see the curves blow up or shrink to 0 with width after a few steps of training, then there's a bug in your μP implementation (did you forget to vary some dimension, like d_ffn, in the delta model?). If instead you see the curves converge to the right, then most likely your implementation is correct. However, there are two typical exceptions to this; the following can shrink to 0 at initialization in μP (at a 1/sqrt(width) rate):

  • the network output
  • the attention logits in a Transformer

These are transient, and after a few steps their curves should be roughly flat. Nevertheless, to remove the discrepancy at init, we recommend

  • initializing the output layer (should be a MuReadout instance) weights to be 0 via the readout_zero_init=True option and
  • initializing the query matrix in a Transformer to 0 (this has to be done manually). If symmetry-breaking is desired in the attention logits at init, initialize the (relative) position biases with nonzero variance.

Tips for Coord Check

  • Use a large learning rate (larger than you'd use for actual training). This would emphasize any potential exploding coordinates issue, which could be hidden by the initialization if the learning rate is too small.
  • If you reuse a module multiple times in the forward pass, then mup.get_coord_data will only record the statistics from the last usage. In this case, for testing purposes, one can wrap different usages with nn.Identity modules of different names to distinguish them.

Wider is Always Better

Another sign that μP has not been implemented correctly is if going wider does worse (on training loss) after some width, at some point during training. The figure above illustrates this in a collection of training curves: (left) the correct implementation should always see performance improve with width, at any point in training; (middle) if you used standard parametrization (SP), sometimes you may see performance improve with width up to some point and then suddenly it becomes worse with wider models; (right) or you may immediately see worsening performance even for narrow models.

Examples

See the MLP, Transformer, and ResNet folders inside examples/ as well as the tests in mup/test for examples. People familiar with Huggingface Transformers may also find the examples/mutransformers submodule instructive (obtained via git submodule update --init), which is also available standalone at https://github.com/microsoft/mutransformers.

Native Integration With Huggingface

Frustrated that your Huggingface Transformer breaks when you scale up? Want to tune hyperparameters for your large mult-GPU Huggingface Transformer on a single GPU, right out the box? If so, please upvote this github issue!

Running Tests

To run tests, do

python -m mup.test

The Basic Math

μP is designed so as to satisfy the following desiderata:

At any time during training

  1. Every (pre)activation vector in a network should have Θ(1)-sized coordinates
  2. Neural network output should be O(1).
  3. All parameters should be updated as much as possible (in terms of scaling in width) without leading to divergence

It turns out these desiderata uniquely single out μP. To derive μP from them, one needs to carefully consider how the coordinate size of a vector Av, resulting from a square matrix A multiplying vector v, depends on those of A and v, when A and v are "correlated". Here you can think of A as weights and v as an activation vector. This in turn depends on what kind of matrix is A and what kind of vector is v. In the context of training a wide neural network, it turns out we only need to consider vectors that has approximately iid coordinates, and two kinds of matrices: 1) those that look like outer products of such vectors, and 2) random iid matrices. Those of type 1 cover things like weight gradients; those of type 2 cover things like weight initialization. Then, if A and v both have entry size Θ(1) and they are correlated in ways that arise naturally during training, then we have the following table.

outer product A (type 1) iid A (type 2)
Entry size of Av Θ(n) Θ(sqrt(n))

Given this table, one can then trace the forward and backward computation of a network to derive μP straightforwardly.

See our blog post for a gentle primer and our paper for details.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Comments
  • Conv1D Coord check looks good (I think), but μTransfer does not seem to work?

    Conv1D Coord check looks good (I think), but μTransfer does not seem to work?

    Hi all, attaching the coord check plots and also a screen shot of the train loss and auprc plots. I used the Conv1D from the branch, but have also tried

    I was looking at the conv plots from the examples and I noticed that one of the layers is constant across width, but after the first step is significantly smaller. Is that an issue?

    Mup plot coord_conv_mup

    Sp Plot coord_conv_sp

    Train Loss Screen Shot 2022-08-01 at 1 12 52 PM

    Train AUPRC Screen Shot 2022-08-01 at 1 12 41 PM

    I also tried this with a transformer based model and found similar results where the transferred HPs did not result in better performance. I can regenerate those plots if needed.

    Is this expected? What can I do to fix this? Having mup work would be a huge unlock for us :D

    opened by zanussbaum 20
  • Coord-check for conv1d

    Coord-check for conv1d

    I modified the muconv2d branch to get a conv1d variant of the output layer for mup, and I applied it to a shallow variant of a unet model I've been testing.

    repo for model: https://github.com/bob80333/audio_bandwidth_extension fork of mup with conv1d: https://github.com/bob80333/mup/tree/muconv2d

    Here's the coord-check results, they don't look quite a smooth as the paper but there's definitely a big difference between mup and no mup.

    mup: plot_mup

    no mup: plot_nomup

    Does this look about right for the coordinate check? The figures I saw in the example looked much smoother than this.

    opened by bob80333 17
  • mu parametrization for channel attention

    mu parametrization for channel attention

    Hi, I have another question about the mu parametrization for a special attention mechanism - channel attention.

    In standard scaled dot-product attention (also regarded as spatial attention), we have Q, K, V with shape n x d (ignoring heads) and we will calculate softmax(scale * Q K^T) V to get a n x d output, where scale = 1/sqrt(d) in SP and scale = 1/d in muP (or 1/sqrt(d_0) / width_mult in muP for backward compatiblity).

    In channel attention, we still have Q, K, V with shape n x d (ignoring heads). The different part is, we will calculate (softmax(scale * Q^T K) V^T)^T to get a n x d output, where scale = 1/sqrt(n) in SP. Since the attention map Q^T K now has shape d x d instead of n x n, I am not sure how the scale should be modified in SP accordingly. Should we use 1/sqrt(n) / width_mult?

    In addition, Appendix B - Matrix-Like, Vector-Like, Scalar-Like Parameters has some interpretation behind the scale:

    a multiplier of order 1=fan_in should accompany any weight that maps an infinite dimension to a finite one. This interpretation then nicely covers both the output logits and the attention logits (i.e. 1/d attention).

    But such interpretation may not be directly used as a guidance to set up the scale in the channel attention.

    Thanks!

    opened by xwjabc 5
  • MuAdam not adjusting lr for output weights

    MuAdam not adjusting lr for output weights

    Hi, thank you for your great project for hyperparameter tuning!

    As our team migrating the mup to other training framework, it occurs to us that the MuAdam does not scale the learning rate for output weights as the TP5 paper illustrated: image

    https://github.com/microsoft/mup/blob/c9d67001c47ae254ea4b7e26146ffd059520b6ba/mup/optim.py#L55-L70

    It seems to us that only the lr of hidden layer (the layer with 2 inf dimensions) is scaled w.r.t fanin, but the output weight is ignored. We wonder if this is intended. Thank you!

    opened by zhuzilin 4
  • Multiple nn.Linear layers

    Multiple nn.Linear layers

    Hi, Your project is really interesting, so I am learning how to apply it to some specific models. For example, the model has multiple nn.Linear layers like in wav2vec 2.0 (self.post_extract_proj, self.project_q, self.project_inp, self.target_glu, self.final_proj), should I replace all these layers to MuReadout?

    class Wav2Vec2Model(BaseFairseqModel):
        def __init__(self, cfg: Wav2Vec2Config):
            super().__init__()
            self.cfg = cfg
    
            feature_enc_layers = eval(cfg.conv_feature_layers)
            self.embed = feature_enc_layers[-1][0]
    
            self.feature_extractor = ConvFeatureExtractionModel(
                conv_layers=feature_enc_layers,
                dropout=0.0,
                mode=cfg.extractor_mode,
                conv_bias=cfg.conv_bias,
            )
    
            self.post_extract_proj = (
                nn.Linear(self.embed, cfg.encoder_embed_dim)
                if self.embed != cfg.encoder_embed_dim and not cfg.quantize_input
                else None
            )
    
            self.mask_prob = cfg.mask_prob
            self.mask_selection = cfg.mask_selection
            self.mask_other = cfg.mask_other
            self.mask_length = cfg.mask_length
            self.no_mask_overlap = cfg.no_mask_overlap
            self.mask_min_space = cfg.mask_min_space
    
            self.mask_channel_prob = cfg.mask_channel_prob
            self.mask_channel_before = cfg.mask_channel_before
            self.mask_channel_selection = cfg.mask_channel_selection
            self.mask_channel_other = cfg.mask_channel_other
            self.mask_channel_length = cfg.mask_channel_length
            self.no_mask_channel_overlap = cfg.no_mask_channel_overlap
            self.mask_channel_min_space = cfg.mask_channel_min_space
    
            self.dropout_input = nn.Dropout(cfg.dropout_input)
            self.dropout_features = nn.Dropout(cfg.dropout_features)
    
            self.feature_grad_mult = cfg.feature_grad_mult
    
            self.quantizer = None
            self.input_quantizer = None
    
            self.n_negatives = cfg.num_negatives
            self.cross_sample_negatives = cfg.cross_sample_negatives
            self.codebook_negatives = cfg.codebook_negatives
            self.negatives_from_everywhere = cfg.negatives_from_everywhere
    
            self.logit_temp = cfg.logit_temp
    
            final_dim = cfg.final_dim if cfg.final_dim > 0 else cfg.encoder_embed_dim
    
            if cfg.quantize_targets:
                vq_dim = cfg.latent_dim if cfg.latent_dim > 0 else final_dim
                self.quantizer = GumbelVectorQuantizer(
                    dim=self.embed,
                    num_vars=cfg.latent_vars,
                    temp=cfg.latent_temp,
                    groups=cfg.latent_groups,
                    combine_groups=False,
                    vq_dim=vq_dim,
                    time_first=True,
                    weight_proj_depth=cfg.quantizer_depth,
                    weight_proj_factor=cfg.quantizer_factor,
                )
                self.project_q = nn.Linear(vq_dim, final_dim)
            else:
                self.project_q = nn.Linear(self.embed, final_dim)
    
            if cfg.quantize_input:
                if cfg.same_quantizer and self.quantizer is not None:
                    vq_dim = final_dim
                    self.input_quantizer = self.quantizer
                else:
                    vq_dim = cfg.latent_dim if cfg.latent_dim > 0 else cfg.encoder_embed_dim
                    self.input_quantizer = GumbelVectorQuantizer(
                        dim=self.embed,
                        num_vars=cfg.latent_vars,
                        temp=cfg.latent_temp,
                        groups=cfg.latent_groups,
                        combine_groups=False,
                        vq_dim=vq_dim,
                        time_first=True,
                        weight_proj_depth=cfg.quantizer_depth,
                        weight_proj_factor=cfg.quantizer_factor,
                    )
                self.project_inp = nn.Linear(vq_dim, cfg.encoder_embed_dim)
    
            self.mask_emb = nn.Parameter(
                torch.FloatTensor(cfg.encoder_embed_dim).uniform_()
            )
    
            self.encoder = TransformerEncoder(cfg)
            self.layer_norm = LayerNorm(self.embed)
    
            self.target_glu = None
            if cfg.target_glu:
                self.target_glu = nn.Sequential(
                    nn.Linear(final_dim, final_dim * 2), nn.GLU()
                )
    
            self.final_proj = nn.Linear(cfg.encoder_embed_dim, final_dim)
    

    Thank you! ^^

    opened by windspirit95 4
  • Consider decoupled weight decay optimizers?

    Consider decoupled weight decay optimizers?

    Hi! I'm a big fan of this project and I noticed that MUP has some wrappers for common optimizers MuAdam, MuAdamW, and MuSGD. Given the focus on hyperparameter stability and tuning, I was wondering if you might be interested in adding / experimenting with the decoupled weight decay optimizers from this paper (https://arxiv.org/abs/1711.05101)?

    For context, PyTorch's implementations of SGD, Adam, and even AdamW all scale the weight decay by the learning rate in their update step, and I've found that this makes it tough to tune the two values independently (if you increase LR, you also silently increase the effective WD). This is due to PyTorch's scheduler implementation which only updates an Optimizer's LR, and so they schedule the WD in sync by multiplying the two values together.

    E.g. here is Pytorch's AdamW code, which shows why it is not really decoupled: https://github.com/pytorch/pytorch/blob/11231b0f935c092598c994a4bab686485aac1856/torch/optim/adamw.py#L248

    The "correct" way to decouple LR, WD is described in the paper, and we have some PyTorch-ready implementations here (code, docs) in MosaicML's Composer library. Though I haven't seen any examples in MUP yet of tuning the WD across model sizes, I feel like this could be a common hparam that users want to tune and that DecoupledSGDW or DecoupledAdamW could help make it more stable :)

    opened by abhi-mosaic 4
  • mu parametrization for multi-head attention / grouped convolution

    mu parametrization for multi-head attention / grouped convolution

    Hi, in Appendix E.2 - Number of Attention Heads, there is a use case that fixes d_head (dimension size per head) and scales n_head (number of heads). Do we need to change anything when we use such multi-head attention with scaled n_head? Or we still follow the same way as shown in the provided Transformer example (scale d_head, only change 1/sqrt(d) to 1/d and keep other settings the same).

    Similarly, when applying to the muP to grouped convolution which keeps dim size per group and scales number of groups, is there any special rule we should follow?

    Thanks!

    opened by xwjabc 3
  • muP for contrastive losses

    muP for contrastive losses

    Hi, I have a question regarding the use of muP in contrastive losses: Assume we have anchor embedding x, positive embedding x_pos, and negative embedding x_neg. All x, x_pos, and x_neg are C-dim vectors where C represents the width that is categorized as an infinite dimension. The loss L is formulated as:

    L = -log( exp(sim(x, x_pos)) / (exp(sim(x, x_pos)) + exp(sim(x, x_neg))) )

    where sim(a, b) = cos(a, b) for each embedding pair. It seems the sim() merges two infinite-dim vectors to a finite one, which is similar to the Q K^T operation in self-attention. However, the difference is that the cosine similarity already bounds the output. Thus, I wonder if there is anything we need to change in the loss function when we use muP? Thanks!

    opened by xwjabc 2
  • Optimizers for coord check

    Optimizers for coord check

    Thank you for your great work! When trying the coord check in the examples, I noticed that the original optimizers (e.g., sgd, adam) are used instead of the muP optimizers (e.g., musgd, muadam). However, according to the Table 8 in the paper, the optimizers should be adjusted accordingly to make activations bounded. Is there any reason behind the use of original optimizers?

    opened by xwjabc 2
  • ResNet readout_zero_init=True?

    ResNet readout_zero_init=True?

    Dear Greg,

    Awesome project! May I ask why the linear output layer in ResNet is initialized as 0 instead of Gaussian(mean=0, var=1) as mentioned in the paper?

    Thanks a lot for your time and help.

    opened by D-X-Y 2
  • Hyperparameter search on base models

    Hyperparameter search on base models

    Following up on the conversation here https://github.com/microsoft/mup/issues/11 since it wasn't related to the original issue.

    How exactly should the learning rates be split up when doing hyperparameter search on the base model? You said input/hidden/output, but Table 8 groups all biases with input weights. Does the output bias also fall into the input/bias group? (it has finite fan-in and fan-out, unlike the other biases which have infinite fan-out)

    opened by davisyoshida 2
  • Coord check looks good, but μTransfer is not working as expected

    Coord check looks good, but μTransfer is not working as expected

    Hello, μP team! Very excited to see you open source your excellent work! I was looking to apply μP on our work, and on Megatron-DeepSpeed I modified the training script as suggested in the tutorial, set the infshape, reset parameters initialization, put on MuAdam, and got a coord_check that looked successful. But when we transfer the learning rate that performed well on the 350M GPT model to the large model 1.3B, we found that the 1.3B could not withstand such a large learning rate and eventually produced NaN.

    I was wondering what details might not have been taken into account, or the conditions were not met, causing μTransfer to fail. How should I debug, or μTransfer just won't work under this condition?

    The following is the experimental information.

    image

    image

    350M -> 1.3B GPT model μTransfer training loss( tensorborad link ): image

    I think it may be a bit redundant, but if you are interested, the transformation of μP is here:

    1. Replace output layer with MuReadout, https://github.com/shjwudp/Megatron-LM/blob/mup/megatron/model/gpt_model.py#L250
    2. Make sure to use 1/d instead of 1/sqrt(d) attention scaling, https://github.com/shjwudp/Megatron-LM/blob/mup/megatron/model/transformer.py#L175
    3. Set infshape and do mup parameter initiliaze, https://github.com/shjwudp/Megatron-LM/blob/mup/pretrain_gpt.py#L110
    4. Put on MuAdam, https://github.com/shjwudp/Megatron-LM/blob/mup/megatron/optimizer/init.py#L65
    5. Implement the equivalent MuReadout._rescale_parameters operation, https://github.com/shjwudp/Megatron-LM/blob/mup/megatron/mpu/layers.py#L191
    6. Modify lr scheduler to update lr according to width, https://github.com/shjwudp/Megatron-LM/blob/mup/megatron/learning_rates.py#L127
    7. Coord check, https://github.com/shjwudp/Megatron-LM/blob/mup/megatron/mup_utils.py#L16
    opened by shjwudp 4
  • Does mup support Swin Transformer v2 model?

    Does mup support Swin Transformer v2 model?

    Hi, we are trying to use mup tool to tune Swin Transformer v2 model. I modified the code of Swin Transformer v2 to adapt mup and executed the "save base shape" and "coordinate check". The results of "coordinate check" shows that it can not meet the requirements of mup.

    Does mup support the Swin Transformer v2 model?

    For the code of "swin_transformer_v2.py", I modified the following code (Because Swin Transformer v2 doesn't use "1/sqrt(d) attention scaling", I don't modify it):

    1. replaced the output layper nn.Linear with MuReadout
    2. replaced std normal init with mup normal init
    self.norm = norm_layer(self.num_features)
    self.avgpool = nn.AdaptiveAvgPool1d(1)
    # self.head = nn.Linear(self.num_features, num_classes) if num_classes > 0 else nn.Identity()
    ### muP: replace nn.Linear with MuReadout
    self.head = MuReadout(self.num_features, num_classes) if num_classes > 0 else nn.Identity()
    
    self.apply(self._init_weights)
    for bly in self.layers:
        bly._init_respostnorm()
    
    def _init_weights(self, m, readout_zero_init=False, query_zero_init=False):
        ### muP: swap constant std normal init with normal_ from `mup.init`.
        ### Because `_init_weights` is called in `__init__`, before `infshape` is set,
        ### we need to manually call `self.apply(self._init_weights)` after calling
        ### `set_base_shape(model, base)`
        if isinstance(m, nn.Linear):
            if isinstance(m, MuReadout) and readout_zero_init:
                m.weight.data.zero_()
            else:
                if hasattr(m.weight, 'infshape'):
                    normal_(m.weight, mean=0.0, std=.02)
                else:
                    trunc_normal_(m.weight, std=.02)
                    if isinstance(m, nn.Linear) and m.bias is not None:
                        nn.init.constant_(m.bias, 0)
        ### End muP
        elif isinstance(m, nn.LayerNorm):
            nn.init.constant_(m.bias, 0)
            nn.init.constant_(m.weight, 1.0)
    

    For the code of "main.py" of Swin Transformer, I added "save base shape" and "coordinate check" functions.

    def main(config, args):
        dataset_train, dataset_val, data_loader_train, data_loader_val, mixup_fn = build_loader(config)
    
        logger.info(f"Creating model:{config.MODEL.TYPE}/{config.MODEL.NAME}")
        model = build_model(config)
        logger.info(str(model))
    
        ### muP
        if args.save_base_shapes:
            print(f'saving base shapes at {args.save_base_shapes}')
            base_shapes = get_shapes(model)
            delta_config = copy.deepcopy(config)
            delta_config.defrost()
            delta_config.MODEL.SWINV2.EMBED_DIM *= 2  # Modify SwinV2 embed dim
            delta_config.MODEL.SWIN.EMBED_DIM *= 2  # Modify Swin embed dim
            # delta_config.MODEL.SWIN_MOE.EMBED_DIM *= 2  # Modify Swin_moe embed dim
            delta_config.MODEL.SWIN_MLP.EMBED_DIM *= 2  # Modify Swin_mlp embed dim
    
            delta_shapes = get_shapes(
                # just need to change whatever dimension(s) we are scaling
                build_model(delta_config)
            )
            make_base_shapes(base_shapes, delta_shapes, savefile=args.save_base_shapes)
            print('done and exit')
            import sys;
            sys.exit()
        if args.load_base_shapes:
            print(f'loading base shapes from {args.load_base_shapes}')
            set_base_shapes(model, args.load_base_shapes)
            print('done')
        else:
            print(f'using own shapes')
            set_base_shapes(model, None)
            print('done')
    
    ### muP
    def coord_check(mup, config, lr, optimizer, nsteps, nseeds, args, plotdir='', legend=False):
        dataset_train, dataset_val, data_loader_train, data_loader_val, mixup_fn = build_loader(config)
    
        def gen(w, standparam=False):
            def f():
                delta_config = copy.deepcopy(config)
                delta_config.defrost()
                delta_config.MODEL.SWINV2.EMBED_DIM = w  # Modify SwinV2 embed dim
                delta_config.MODEL.SWIN.EMBED_DIM = w  # Modify Swin embed dim
                # delta_config.MODEL.SWIN_MOE.EMBED_DIM = w  # Modify Swin_moe embed dim
                delta_config.MODEL.SWIN_MLP.EMBED_DIM = w  # Modify Swin_mlp embed dim
                model = build_model(delta_config)
    
                if standparam:
                    set_base_shapes(model, None)
                else:
                    assert args.load_base_shapes, 'load_base_shapes needs to be nonempty'
                    set_base_shapes(model, args.load_base_shapes)
                return model
            return f
    
        optimizer = optimizer.replace('mu', '')
        widths = (12, 24, 48, 96, 192)
        models = {w: gen(w, standparam=not mup) for w in widths}
    
        # train_data = batchify(corpus.train, batch_size, device=args.device)
        df = get_coord_data(models, data_loader_train, mup=mup, lr=lr, optimizer=optimizer, flatten_output=True,
                            nseeds=nseeds, nsteps=nsteps, lossfn='xent')
    
        prm = 'muP' if mup else 'SP'
        return plot_coord_data(df, legend=legend,
                               save_to=os.path.join(plotdir, f'{prm.lower()}_trsfmr_{optimizer}_coord.png'),
                               suptitle=f'{prm} Transformer {optimizer} lr={lr} nseeds={nseeds}',
                               face_color='xkcd:light grey' if not mup else None)
    
    if __name__ == '__main__':
        args, config = parse_option()
    
        ......
    
        ### muP
        if args.coord_check:
            print('testing parametrization')
            import os
            os.makedirs('coord_checks', exist_ok=True)
            plotdir = 'coord_checks'
            coord_check(mup=True, config=config, lr=0.0001, optimizer='adamw',
                        nsteps=args.coord_check_nsteps, nseeds=args.coord_check_nseeds, args=args,
                        plotdir=plotdir, legend=False)
            coord_check(mup=False, config=config, lr=0.0001, optimizer='adamw',
                        nsteps=args.coord_check_nsteps, nseeds=args.coord_check_nseeds, args=args,
                        plotdir=plotdir, legend=False)
            import sys
            sys.exit()
    
        main(config, args)
    

    The results of "coordinate check" show that there is only a small difference between "mup" and "SP". sorry, I can't upload pictures. Could you please help us to check if mup can support Swin Transformer v2 model? or there are some other reasons? Thanks a lot.

    opened by shiyf129 1
  • integration with Flax?

    integration with Flax?

    Is there any interest in integrating this work with Flax?

    They already have a init function, decoupling parameters initialization from model definition which could make introducing mup fairly plug-and-play.

    Plus they relie on optax for their optimizers. As that library has a focus on composability, you might be able to introduce a transformation that takes an optimizer and makes it mup compatible.

    Overall, I believe the Flax ecosystem could make mup more easily accessible to people.

    opened by nestordemeure 4
  • Does mup work with model with Conv2D as output?

    Does mup work with model with Conv2D as output?

    Hello, this project look great and the Github documentation is really good. Just wondering if mup would work with a model that have the last layer as nn.Conv2d instead of linear.

    opened by BurguerJohn 5
  • PyTorch Lightning example

    PyTorch Lightning example

    Dear team behind mup,

    This is some great work! I believe providing a PyTorch Lightning example could help users adopt this library.

    I even wonder if this technique could be embedded in an even less boilerplate way. I was thinking about an extension to Pytorch Lightning Tuner which would automatically apply mup and apply the µTransferable Hyperparameters.

    I wondered if someone from the mup Team would be interested to investigate those ideas to democratize even further this work.

    Best, T.C

    opened by tchaton 1
Releases(v1.0.0)
Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
A script helps the user to update Linux and Mac systems through the terminal

Description This script helps the user to update Linux and Mac systems through the terminal. All the user has to install some requirements and then ru

Roxcoder 2 Jan 23, 2022
Jurigged lets you update your code while it runs.

jurigged Jurigged lets you update your code while it runs. Using it is trivial: python -m jurigged your_script.py Change some function or method with

Olivier Breuleux 700 Sep 25, 2022
Update of uncaptcha2 from 2019

YouTube Video Proof of Concept I created a new YouTube Video with technical Explanation for breaking Google's Audio reCAPTCHAs: Click on the image bel

Nikolai Tschacher 145 Sep 23, 2022
A simple tool to update bib entries with their official information (e.g., DBLP or the ACL anthology).

Rebiber: A tool for normalizing bibtex with official info. We often cite papers using their arXiv versions without noting that they are already PUBLIS

(Bill) Yuchen Lin 1.9k Sep 26, 2022
Retrieve information from DBLP and update BibTex files automatically

Rebib TLDR: This script retrieves information from DBLP to update your BibTex files. python rebib.py --bibfile xxx.bib It first parses the bib entries

Shangtong Zhang 48 Jun 9, 2022
Package, distribute, and update any app for Linux and IoT.

Snapcraft Package, distribute, and update any app for Linux and IoT. Snaps are containerised software packages that are simple to create and install.

null 1.1k Sep 28, 2022
A script to automatically update bot status at GitHub as well as in Telegram channel.

Support BotStatus ~ A simple & short repository to show your bot's status in your GitHub README.md file as well as in you channel. ⚠️ This repo should

Jainam Oswal 58 Sep 26, 2022
A script to automatically update bot status at GitHub as well as in Telegram channel.

A simple & short repository to show your bot's status in your GitHub README.md file as well as in you channel.

Jainam Oswal 58 Sep 17, 2022
Code I use to automatically update my videos' metadata on YouTube

mCodingYouTube This repository contains the code I use to automatically update my videos' metadata on YouTube, including: titles, descriptions, tags,

James Murphy 19 Sep 22, 2022
A script for performing OTA update over BLE on ESP32

A script for performing OTA update over BLE on ESP32

Felix Biego 17 Sep 8, 2022
lightweight, fast and robust columnar dataframe for data analytics with online update

streamdf Streamdf is a lightweight data frame library built on top of the dictionary of numpy array, developed for Kaggle's time-series code competiti

null 23 May 19, 2022
A python module to update the console without flashing.

A python module to update the console without flashing.

Matthias 108 Sep 30, 2022
A Python script to update Spotify Playlist data every 5 minutes.

Spotify Playlist Updater A Python script to update Spotify Playlist data every 5 minutes. Description An automatic playlist updater using Spotify API

null 4 Jul 8, 2022
An automatic django's update checker and MS teams notifier

Django Update Checker This is small script for checking any new updates/bugfixes/security fixes released in django News & Events and sending correspon

prinzpiuz 4 Sep 26, 2022
program to store and update pokemons using SQL and Flask

Pokemon SQL and Flask Pokemons api in python. Technologies flask pymysql Description PokeCorp is a company that tracks pokemon and their trainers arou

Sara Hindy Salfer 1 Oct 20, 2021
Backup a folder to an another folder by using mirror update method.

Mirror Update Backup Backup a folder to an another folder by using mirror update method. How to use Install requirement pip install -r requirements.tx

null 1 Nov 13, 2021
Online-update est un programme python permettant de mettre a jour des dossier et de fichier depuis une adresse web.

Démarrage rapide Online-update est un programme python permettant de mettre a jour des dossier et de fichier depuis une adresse web. Mode préconfiguré

pf4 2 Nov 26, 2021
This Lambda will Pull propagated routes from TGW and update VPC route table

AWS-Transitgateway-Route-Propagation This Lambda will Pull propagated routes from TGW and update VPC route table. Tested on python 3.8 Lambda AWS INST

null 4 Jan 20, 2022
Easily update resume to naukri with one click

NAUKRI RESUME AUTO UPDATER I am using poetry for dependencies. you can check or change in data.txt file for username and password Resume file must be

Rahul.p 1 May 2, 2022
Simple tool, to update linux kernel on ubuntu

Kerbswap Simple tool, to update linux kernel on ubuntu Information At the moment, this tool only supports "Ubuntu" distributions, but will be expanded

dword 1 Oct 31, 2021