auto-tuning momentum SGD optimizer

Overview

YellowFin Build Status

YellowFin is an auto-tuning optimizer based on momentum SGD which requires no manual specification of learning rate and momentum. It measures the objective landscape on-the-fly and tunes momentum as well as learning rate using local quadratic approximation.

The implementation here can be a drop-in replacement for any optimizer in PyTorch. It supports step and zero_grad functions like any PyTorch optimizer after from yellowfin import YFOptimizer. We also provide interface to manually set the learning rate schedule at every iteration for finer control (see Detailed Guideline Section).

For more technical details, please refer to our paper YellowFin and the Art of Momentum Tuning.

For more usage details, please refer to the inline documentation of tuner_utils/yellowfin.py. Example usage can be found here for ResNext on CIFAR10 and Tied LSTM on PTB.

YellowFin is under active development. Many members of the community have kindly submitted issues and pull requests. We are incorporating fixes and smoothing things out. As a result the repository code is in flux. Please make sure you use the latest version and submit any issues you might have!

Updates

[2017.07.03] Fixed a gradient clipping bug. Please pull our latest master branch to make gradient clipping great again in YellowFin.

[2017.07.28] Switched to logrithmic smoothing to accelerate adaptation to curvature range trends.

[2017.08.01] Added optional feature to enforce non-increasing value of lr * gradient norm for stablity in some rare cases.

[2017.08.05] Added feature to correct estimation bias from sparse gradient.

[2017.08.16] Replace numpy root solver with closed form solution using Vieta's substitution for cubic eqaution. It solves the stability issue of the numpy root solver.

[2017.10.29] Major fixe for stability. We added eps to protect fractions in our code, as well as an adaptive clipping feature to properly deal with exploding gradient (manual clipping is still supported as described in the detailed instruction below).

Setup instructions for experiments

Please clone the master branch and follow the instructions to run YellowFin on ResNext for CIFAR10 and tied LSTM on Penn Treebank for language modeling. The models are adapted from ResNext repo and PyTorch example tied LSTM repo respectively. Thanks to the researchers for developing the models. For more experiments on more convolutional and recurrent neural networks, please refer to our Tensorflow implementation of YellowFin.

Note YellowFin is tested with PyTorch v0.2.0 for compatibility. It is tested under Python 2.7.

Run CIFAR10 ResNext experiments

The experiments on 110 layer ResNet with CIFAR10 and 164 layer ResNet with CIFAR100 can be launched using

cd pytorch-cifar
python main.py --logdir=path_to_logs --opt_method=YF

Run Penn Treebank tied LSTM experiments

The experiments on multiple-layer LSTM on Penn Treebank can be launched using

cd word_language_model
python main.py --emsize 650 --nhid 650 --dropout 0.5 --epochs 40 --tied --opt_method=YF --logdir=path_to_logs --cuda

For more experiments, please refer to our YellowFin Tensorflow Repo.

Detailed guidelines

  • Basic use: optimizer = YFOptimizer(parameter_list) uses the uniform setting (i.e. without tuning) for all the PyTorch and Tensorflow experiments in our paper.

  • Interface for manual finer control: If you want to more finely control the learning rate (say using a manually set constant learning rate), or you want to use the typical lr-dropping technique after a ceritain number of epochs, please use set_lr_factor() in the YFOptimizer class. E.g. if you want to use a manually set constant learning rate, you can run set_lr_factor(desired_lr / self._lr) before self.step() at each iteration. Or e.g., if you want to always multiply a factor 2.0 to the learning rate originally tuned by YellowFin, you may use optimizer.set_lr_factor(2.0) right after optimizer = YFOptimizer(parameter_list) and before training with YellowFin. More details can be found here. (The argument lr and mu during YFOptimizer initialization are dummy, only for backward compatibility)

  • Gradient clipping: The default setting uses adaptive gradient clipping to prevent gradient explosion, thresholding norm of gradient to the square root of our estimated maximal curvature. There are three cases regarding gradient clipping. We recommend first turning off gradient clipping, and only turning it on when necessary.

    • If you want to manually set threshold to clip the gradient, please first use adapt_clip=False to turn off the auto-clipping feature. Then, you can consider either using the clip_thresh=thresh_on_the_gradient_norm argument when initializing the YFOptimizer to clip acoording to your set threshold inside YFOptimizer, or clipping the gradient outside of YFOptimizer before step() is called.

    • If you want to totally turn off gradient clipping in YFOptimizer, please use clip_thresh=None, adapt_clip=False when initializing the YFOptimizer.

  • Normalization: When using log probability style losses, please make sure the loss is properly normalized. In some RNN/LSTM cases, the cross_entropy need to be averaged by the number of samples in a minibatch. Sometimes, it also needs to be averaged over the number of classes and the sequence length of each sample in some PyTorch loss functions. E.g. in nn.MultiLabelSoftMarginLoss, size_average=True needs to be set.

  • Non-increasing move: In some rare cases, we have observe increasing value of lr * || grad ||, i.e. the move, may result in unstableness. We implemented an engineering trick to enforce non-increasing value of lr * || grad ||. The default setting turns the feature off, you can turn it on with force_non_inc_step_after_iter=the starting iter you want to enforce the non-increasing value if it is really necessary. We recommend force_non_inc_step_after_iter to be at least a few hundreds because some models may need to gradually raise the magnitude of gradient in the beginning (e.g. a model, not properly initialized, may have near zero-gradient and need iterations to get reasonable gradient level).

Citation

If you use YellowFin in your paper, please cite the paper:

@article{zhang2017yellowfin,
  title={YellowFin and the Art of Momentum Tuning},
  author={Zhang, Jian and Mitliagkas, Ioannis and R{\'e}, Christopher},
  journal={arXiv preprint arXiv:1706.03471},
  year={2017}
}

Acknowledgement

We thank Olexa Bilaniuk, Andrew Drozdov, Paroma Varma, Bryan He, as well as github user @elPistolero @esvhd for the help in contributing to and testing the codebase.

Implementation for other platforms

For Tensorflow users, we implemented YellowFin Tensorflow Repo.

We thank the contributors for YellowFin in different deep learning frameworks.

Comments
  • NaN and AssertionError

    NaN and AssertionError

    Thanks for open sourcing the code !

    I've tried it on a simple MLP and could not find a set of parameters (lr and mu) that would not yield one of those two errors:

    assert root.size == 1 AssertionError

    and

    numpy.linalg.linalg.LinAlgError: Array must not contain infs or NaNs

    Any tips on best practices ?

    opened by tdeboissiere 21
  • LR keeps growing instead of shrinking

    LR keeps growing instead of shrinking

    Hi,

    I'm running into a situation where YellowFin keeps adjusting the LR upward instead of decaying it downward - which, of course, prevents the network from converging. Any idea why it would be happening? Thanks!

    I initialize YF like so: optimizer = YFOptimizer(net.parameters(), lr=train_args['lr'])

    And the output is...

    [epoch 4], [iter 20 / 123], [train main loss 0.15482], [lr 0.057262]
    [epoch 4], [iter 40 / 123], [train main loss 0.14807], [lr 0.058468]
    [epoch 4], [iter 60 / 123], [train main loss 0.14976], [lr 0.059635]
    [epoch 4], [iter 80 / 123], [train main loss 0.14867], [lr 0.060768]
    [epoch 4], [iter 100 / 123], [train main loss 0.14935], [lr 0.061881]
    [epoch 4], [iter 120 / 123], [train main loss 0.14653], [lr 0.062980]
    
    ----------------------------------------------------------------------------------------------
    [epoch 5], [iter 20 / 123], [train main loss 0.14709], [lr 0.064228]
    [epoch 5], [iter 40 / 123], [train main loss 0.14188], [lr 0.065275]
    [epoch 5], [iter 60 / 123], [train main loss 0.16187], [lr 0.066297]
    [epoch 5], [iter 80 / 123], [train main loss 0.15231], [lr 0.067289]
    [epoch 5], [iter 100 / 123], [train main loss 0.15639], [lr 0.068227]
    [epoch 5], [iter 120 / 123], [train main loss 0.15515], [lr 0.069117]
    
    ----------------------------------------------------------------------------------------------
    [epoch 6], [iter 20 / 123], [train main loss 0.13752], [lr 0.070135]
    [epoch 6], [iter 40 / 123], [train main loss 0.13210], [lr 0.071002]
    [epoch 6], [iter 60 / 123], [train main loss 0.13821], [lr 0.071850]
    [epoch 6], [iter 80 / 123], [train main loss 0.13456], [lr 0.072690]
    [epoch 6], [iter 100 / 123], [train main loss 0.13225], [lr 0.073533]
    [epoch 6], [iter 120 / 123], [train main loss 0.13367], [lr 0.074379]
    
    opened by achaiah 11
  • YF doesn't work for the cv task

    YF doesn't work for the cv task

    Hi, I tried your optimizer instead of SGD for this challenge https://www.kaggle.com/c/planet-understanding-the-amazon-from-space and get this kind of train/valid curves https://s.mail.ru/BFbQ/1K6w1bqD7 , which is obv awful (tried different setups but no success). SGD & plateau scheduler reach 0.08258 val loss.

    What data do you need to investigate such bad performance ? For example learning rates changes between 0.1 and 3 (!) .

    opened by EdwardTyantov 4
  • AttributeError: 'YFOptimizer' object has no attribute '_state_checkpoint'

    AttributeError: 'YFOptimizer' object has no attribute '_state_checkpoint'

    Hi,

    The following error pops up when I use yellowfin.py in my setup.

    File "/home/bchatter/Documents/workspace/python/async-opt/resnet_cifar_yellowfin/yellowfin.py", line 568, in step
        self.load_state_dict_perturb(copy.deepcopy(self._state_checkpoint) )
    AttributeError: 'YFOptimizer' object has no attribute '_state_checkpoint'
    

    I attempted solving this issue by copying the line 543 in yellowfin.py self._state_checkpoint = copy.deepcopy(self.state_dict() ) just above the line 568. However, with that the optimizer starts to be non-converging (at times diverges, but certainly does not converge).

    Could you please look at the issue?

    opened by bapi 3
  • Learning Rate Decay

    Learning Rate Decay

    Hi,

    For YellowFin optimzer, do I need to use the learning rate decay trick? For example, I evaluate the model on dev set and if the performance drops I will halve the learning rate. This trick works very well for optimizers such as Adam and Adadelta. So will it also work if I switch to the YellowFin optimizer?

    Thanks.

    opened by magic282 2
  • illegal memory access

    illegal memory access

    Feel free to close since it seems like a PyTorch bug, but as a heads up, in case others hit the same issue, on some runs I got this:

    Traceback (most recent call last):
      File "train.py", line 219, in <module>
        optimizer.step()
      File "/home/grant/repos/aud1/yellowfin.py", line 371, in step
        self.after_apply()
      File "/home/grant/repos/aud1/yellowfin.py", line 276, in after_apply
        self.grad_sparsity()
      File "/home/grant/repos/aud1/yellowfin.py", line 219, in grad_sparsity
        grad_non_zero = grad.nonzero()
    RuntimeError: an illegal memory access was encountered
    
    opened by greaber 2
  • Simplistic load/save of yellowfin.

    Simplistic load/save of yellowfin.

    Trying to address: https://github.com/JianGoForIt/YellowFin_Pytorch/issues/9

    The approach is to save anything that is potentially "stateful" in the optimizer. A quick sanity check indicates that this works.

    Sanity check: https://gist.github.com/mrdrozdov/d3d6ac43a3130c799e4f3f5867853184

    opened by mrdrozdov 2
  • Feature Request: Implement state_dict() / load_state_dict()

    Feature Request: Implement state_dict() / load_state_dict()

    thanks for this code release, I wish more papers did this! It really makes it effortless to try out this new optimizer :)

    It'd be great if YellowFin supported the state_dict() and load_state_dict() functions, to maintain a consistent serialization API with the other PyTorch optimizers.

    opened by nelson-liu 2
  • Different variance in publication and implementation

    Different variance in publication and implementation

    In the publication the variance in "Algorithm 3 Gradient variance" is defined as: image However in the PyTorch implementation variance is defined as: image Did you try YellowFin with variance from the publication? Were the results worse? Which definition of the variance did you use to produce results from the publication?

    opened by glogowski-wojciech 1
  • Why alpha and mu are global, not parameter-wise?

    Why alpha and mu are global, not parameter-wise?

    Hi! Thank you for your great work!

    Could you tell why you decided to use one global alpha and one global mu for whole the model instead of creating a separate alpha and mu for each matrix of weights in the model? The other approach seems to be more natural to me because each matrix of weights might have different distributions of its values and values of its gradient. Did you consider it? Do you see a reason not to do so?

    opened by glogowski-wojciech 1
  • Assertion Error: assert root.size == 1

    Assertion Error: assert root.size == 1

    I get the following error with the following stack: optimizer.step() self.after_apply() self.get_mu() assert root.size == 1

    It works fine for most of my runs but fails the assertion sometimes. I will try reproducing it in a simpler setting.

    opened by ashudeep 1
  • Does not work with pytorch 0.4

    Does not work with pytorch 0.4

    There seems two be two reasons for this:

    1. 0.4 introduced 0-dimensional tensors (scalars) and to get their value as a python float we need to call .item() on them. If we don't (and train on GPU) yellowfin will hold on to tensors on both CPU and GPU and try to do operations on them (which will cause an exception since they are on different devices, that exception will be swallowed by the checkpoint restoration mechanism).
    2. Tensors and Variables have been merged in 0.4, so unless the code is changed yellowfin will hold on to tensors with gradient history causing a memory leak.

    The first issue seems to be quite easy to patch, I can send a pull-request for that part if you want to.

    opened by dnaq 2
  • Bad performance on large vision models

    Bad performance on large vision models

    Hello there,

    I am doing my best to learn how to use this optimizer, as I would very much like to have an auto-tuned optimizer where I do not have to spend endless days fiddling with hyperparameters. I have tried to use YellowFin to learn large vision models such as MobileNet, but my results are always very disappointing as compared to a traditional optimizer such as SGD. I am not so concerned about convergence time as I am about loss/accuracy; I have found that YellowFin tends to converge to a much worse loss/accuracy than my SGD runs do.

    I am posting here an example of training MobileNet on the ImageNet dataset with a batch size of 64, comparing the training and testing loss (as well as testing accuracy) of a few epochs of training on MobileNet. In both cases, I have a learning rate schedule applied to set the learning rate factor to 0.3 ^ (epoch // 10), which causes the learning rate to fall to 3/10 of its value every 10 epochs. You can see the effect of this learning rate schedule in the sgd plot fairly easily, the yf plot shows it less clearly. In these figures, the training loss (per minibatch) is shown in blue, while the testing loss (per epoch) is shown in red, with the relevant axis shown on the left. The top-1 and top-5 accuracies on the training dataset are shown in green (per epoch), with their relevant axis given on the right. Other than the optimizer choice, all other training settings are the same, including minibatch size (64), dataset (ImageNet) and model architecture (MobileNet).

    Here is a plot for an SGD optimizer run (note that I have this model only partially trained, this is because it has trained enough that we can already see it will converge to a significantly better loss than the YF model did, below): mobilenet SGD

    Here is a plot for a YellowFin optimizer run: mobilenet YellowFin

    If there are any questions about my methodology I would be happy to explain in greater detail. There is nothing particularly special going on in my model, I am simply trying to determine why YellowFin seems to converge with such poor results.

    opened by staticfloat 3
  • 'YFOptimizer' object has no attribute '_h_min' when calling optimizer.state_dict()

    'YFOptimizer' object has no attribute '_h_min' when calling optimizer.state_dict()

    When I try to save the optimizer.state_dict() before the first training step, this error occurred. Seems that we should add self._h_min = 0.0 and self._h_max = 0.0 in __init__() ?

    opened by JindongJiang 0
  • too many things are kept as state

    too many things are kept as state

    I was trying to experiment with the effects of changing the clipping threshold during training, but I noticed that my changes were getting overridden because this is kept as state. Also some other user set options are kept in the state, but probably should not be lest it be difficult to change them during training.

    opened by greaber 0
  • Fix auto_clip_fac in case of resuming from checkpoint

    Fix auto_clip_fac in case of resuming from checkpoint

    When resuming from a checkpoint, self.iter is not zero, but self._h_max is still undefined, so the optimizer would error. The PR checks directly if self._h_max is defined, which should always work.

    opened by greaber 2
Owner
Jian Zhang
PhD student in machine learning at Stanford University
Jian Zhang
Saeed Lotfi 28 Dec 12, 2022
Keras implementation of Normalizer-Free Networks and SGD - Adaptive Gradient Clipping

Keras implementation of Normalizer-Free Networks and SGD - Adaptive Gradient Clipping

Yam Peleg 63 Sep 21, 2022
NFNets and Adaptive Gradient Clipping for SGD implemented in PyTorch

PyTorch implementation of Normalizer-Free Networks and SGD - Adaptive Gradient Clipping Paper: https://arxiv.org/abs/2102.06171.pdf Original code: htt

Vaibhav Balloli 320 Jan 2, 2023
Implements pytorch code for the Accelerated SGD algorithm.

AccSGD This is the code associated with Accelerated SGD algorithm used in the paper On the insufficiency of existing momentum schemes for Stochastic O

null 205 Jan 2, 2023
Implementation of momentum^2 teacher

Momentum^2 Teacher: Momentum Teacher with Momentum Statistics for Self-Supervised Learning Requirements All experiments are done with python3.6, torch

jemmy li 121 Sep 26, 2022
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

This is the official PyTorch implementation of the ALBEF paper [Blog]. This repository supports pre-training on custom datasets, as well as finetuning on VQA, SNLI-VE, NLVR2, Image-Text Retrieval on MSCOCO and Flickr30k, and visual grounding on RefCOCO+. Pre-trained and finetuned checkpoints are released.

Salesforce 805 Jan 9, 2023
Deep learning algorithms for muon momentum estimation in the CMS Trigger System

Deep learning algorithms for muon momentum estimation in the CMS Trigger System The Compact Muon Solenoid (CMS) is a general-purpose detector at the L

anuragB 2 Oct 6, 2021
Boosting Adversarial Attacks with Enhanced Momentum (BMVC 2021)

EMI-FGSM This repository contains code to reproduce results from the paper: Boosting Adversarial Attacks with Enhanced Momentum (BMVC 2021) Xiaosen Wa

John Hopcroft Lab at HUST 10 Sep 26, 2022
pytorch implementation of "Contrastive Multiview Coding", "Momentum Contrast for Unsupervised Visual Representation Learning", and "Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination"

Unofficial implementation: MoCo: Momentum Contrast for Unsupervised Visual Representation Learning (Paper) InsDis: Unsupervised Feature Learning via N

Zhiqiang Shen 16 Nov 4, 2020
An integration of several popular automatic augmentation methods, including OHL (Online Hyper-Parameter Learning for Auto-Augmentation Strategy) and AWS (Improving Auto Augment via Augmentation Wise Weight Sharing) by Sensetime Research.

An integration of several popular automatic augmentation methods, including OHL (Online Hyper-Parameter Learning for Auto-Augmentation Strategy) and AWS (Improving Auto Augment via Augmentation Wise Weight Sharing) by Sensetime Research.

null 45 Dec 8, 2022
A mini library for Policy Gradients with Parameter-based Exploration, with reference implementation of the ClipUp optimizer from NNAISENSE.

PGPElib A mini library for Policy Gradients with Parameter-based Exploration [1] and friends. This library serves as a clean re-implementation of the

NNAISENSE 56 Jan 1, 2023
Ranger deep learning optimizer rewrite to use newest components

Ranger21 - integrating the latest deep learning components into a single optimizer Ranger deep learning optimizer rewrite to use newest components Ran

Less Wright 266 Dec 28, 2022
PyTorch implementation DRO: Deep Recurrent Optimizer for Structure-from-Motion

DRO: Deep Recurrent Optimizer for Structure-from-Motion This is the official PyTorch implementation code for DRO-sfm. For technical details, please re

Alibaba Cloud 56 Dec 12, 2022
Ranger - a synergistic optimizer using RAdam (Rectified Adam), Gradient Centralization and LookAhead in one codebase

Ranger-Deep-Learning-Optimizer Ranger - a synergistic optimizer combining RAdam (Rectified Adam) and LookAhead, and now GC (gradient centralization) i

Less Wright 1.1k Dec 21, 2022
Apollo optimizer in tensorflow

Apollo Optimizer in Tensorflow 2.x Notes: Warmup is important with Apollo optimizer, so be sure to pass in a learning rate schedule vs. a constant lea

Evan Walters 1 Nov 9, 2021
This is an implementation of Googles Yogi-Optimizer in Keras (tf.keras)

Yogi-Optimizer_Keras This is an implementation of Googles Yogi-Optimizer in Keras (tf.keras) The NeurIPS-Paper can be found here: http://papers.nips.c

null 14 Sep 13, 2022
AdamW optimizer and cosine learning rate annealing with restarts

AdamW optimizer and cosine learning rate annealing with restarts This repository contains an implementation of AdamW optimization algorithm and cosine

Maksym Pyrozhok 133 Dec 20, 2022
DeepOBS: A Deep Learning Optimizer Benchmark Suite

DeepOBS - A Deep Learning Optimizer Benchmark Suite DeepOBS is a benchmarking suite that drastically simplifies, automates and improves the evaluation

Aaron Bahde 7 May 12, 2020
An Implicit Function Theorem (IFT) optimizer for bi-level optimizations

iftopt An Implicit Function Theorem (IFT) optimizer for bi-level optimizations. Requirements Python 3.7+ PyTorch 1.x Installation $ pip install git+ht

The Money Shredder Lab 2 Dec 2, 2021