Ranger deep learning optimizer rewrite to use newest components

Overview

Ranger21 - integrating the latest deep learning components into a single optimizer

Ranger deep learning optimizer rewrite to use newest components

Ranger, with Radam + Lookahead core, is now approaching two years old.
*Original publication, Aug 2019: New deep learning optimizer Ranger
In the interim, a number of new developments have happened including the rise of Transformers for Vision.

Thus, Ranger21 (as in 2021) is a rewrite with multiple new additions reflective of some of the most impressive papers this past year. The focus for Ranger21 is that these internals will be parameterized, and where possible, automated, so that you can easily test and leverage some of the newest concepts in AI training, to optimize the optimizer on your respective dataset.

Latest Simple Benchmark comparison (Image classification, dog breed subset of ImageNet, ResNet-18):

Ranger 21:
Accuracy: 74.02% Validation Loss: 15.00

Adam:
Accuracy: 64.84% Validation Loss: 17.19

Net results: 14.15% greater accuracy with Ranger21 vs Adam, same training epochs.

Ranger21 Status:

April 27 PM - Ranger21 now training on ImageNet! Starting work on benchmarking Ranger21 on ImageNet. Due to cost, will train to 40 epochs on ImageNet and compare with same setup with 40 epochs using Adam to have a basic "gold standard" comparison. Training is underway now, hope to have results end of this week.

April 26 PM - added smarter auto warmup based on Dickson Neoh report (tested with only 5 epochs), and first pip install setup thanks to @BrianPugh!
The warmup structure for Ranger21 is based on the paper by Ma/Yarats which uses the beta2 param to compute the default warmup. However, that also assumes we have a longer training run. @DNH on the fastai forums tested with 5 epochs which meant it never got past warmup phase.
Thus have added a check for the % warmup relative to the total training time and will auto fall back to 30% (settable via warmup_pct_default) in order to account for shorter training runs.

  • First pip install for Ranger21, thanks to @BrianPugh! In the next week or two will be focusing on making Ranger21 easier to install and use vs adding new optimizer features and thanks to @BrianPugh we've already underway with a basic pip install.
git clone https://github.com/lessw2020/Ranger21.git
cd Ranger21
python -m pip install -e .
```

or directly installed from github:

```
python -m pip install git+https://github.com/lessw2020/Ranger21.git

April 25 PM - added guard for potential key error issue Update checked in to add additional guard to prevent a key error reported earlier today during lookahead step. This should correct, but since unable to repro locally, please update to latest code and raise an issue if you encounter this. Thanks!

April 25 - Fixed warmdown calculation error, moved to Linear warmdown, new high in benchmark: Found that there was an error in the warmdown calculations. Fixed and also moved to linear warmdown. This resulted in another new high for the simple benchmark, with results now moved to above so they don't get lost in the updates section.
Note that the warmdown now calculates based on the decay between the full lr, to the minimal lr (defaults to 3e-5), rather than previously declining to 0.

Note that you can display the lr curves directly by simply using:

lr_curve = optimizer.tracking_lr
plt.plot(lr_curve)

Ranger21 internally tracks the lr per epoch for this type of review. Additional updates include adding a 'clear_cache' to reset the cached lookahead params, and also moved the lookahead procesing to it's own function and cleaned up some naming conventions. Will use item_active=True/False rather than the prior using_item=True/False to keep the code simpler as now item properties are alpha grouped vs being cluttered into the using_item layout.
April 24 - New record on benchmark with NormLoss, Lookahead, PosNeg momo, Stable decay etc. all combined NormLoss and Lookahead integrated into Ranger21 set a new high on our simple benchmark (ResNet 18, subset of ImageWoof).
Best Accuracy = 73.41 Best Val Loss = 15.06

For comparison, using plain Adam on this benchmark:
Adam Only Accuracy = 64.84 Best Adam Val Loss = 17.19

In otherwords, 12.5%+ higher accuracy atm for same training epochs by using Ranger21 vs Adam.

Basically it shows that the integration of all these various new techniques is paying off, as currently combining them delivers better than any of them + Adam.

New code checked in - adds Lookahead and of course Norm Loss. Also the settings is now callable via .show_settings() as an easy way to check settings.
Ranger21_424_settings

Given that the extensive settings may become overwhelming, planning to create config file support to make it easy to save out settings for various architectures and ideally have a 'best settings' recipe for CNN, Transformer for Image/Video, GAN, etc.

April 23 - Norm Loss will be added, initial benchmarking in progress for several features A new soft regularizer, norm loss, was recently published in this paper on Arxiv: https://arxiv.org/abs/2103.06583v1

It's in the spirit of weight decay, but approaches it in a unique manner by nudging the weights towards the oblique manifold..this means unlike weight decay, it can actually push smaller weights up towards the norm 1 property vs weight decay only pushes down. Their paper also shows norm less is less sensitive to hyperparams such as batch size, etc. unlike regular weight decay.

One of the lead authors was kind enough to share their TF implemention, and have reworked it into PyTorch form and integrated into Ranger21. Initial testing set a new high for validation loss on my very basic benchmark. Thus, norm loss will be available with the next code update.

Also did some initial benchmarking to set vanilla Adam as a baseline, and ablation style testing with pos negative momentum. Pos neg momo alone is a big improvement over vanilla Adam, and looking forward to mapping out the contributions and synergies between all of the new features being rolled into Ranger21 including norm loss, adapt gradient clipping, gc, etc.

April 18 PM - Adaptive gradient clipping added, thanks for suggestion and code from @kayuksel. AGC is used in NFNets to replace BN. For our use case here, it's to have a smarter gradient clipping algo vs the usual hard clipping, and ideally better stabilize training.

Here's how the Ranger21 settings output looks atm: ranger21_settings

April 18 AM - chebyshev fractals added, cosine warmdown (cosine decay) added
Chebyshev performed reasonably well, but still needs more work before recommending so it's defaulting to off atm. There are two papers providing support for using Chebyshev, one of which is: https://arxiv.org/abs/2010.13335v1
Cosine warmdown has been added so that the default lr schedule for Ranger21 is linear warmup, flat run at provided lr, and then cosine decay of lr starting at the X% passed in. (Default is .65).

April 17 - building benchmark dataset(s) As a cost effective way of testing Ranger21 and it's various options, currently taking a subset of ImageNet categories and building out at the high level an "ImageSubNet50" and also a few sub category datasets. These are similar in spirit to ImageNette and ImageWoof, but hope to make a few relative improvements including pre-sizing to 224x224 for speed of training/testing. First sub-dataset in progress in ImageBirds, which includes:
n01614925 bald eagle
n01616318 vulture
n01622779 grey owl

n01806143 peacock
n01833805 hummingbird

This is a medium-fine classification problem and will use as first tests for this type of benchmarking. Ideally, will make a seperate repo for the ImageBirds shortly to make it available for people to use though hosting the dataset poses a cost problem...

April 12 - positive negative momentum added, madgrad core checked in Testing over the weekend showed that positive negative momentum works really well, and even better with GC.
Code is a bit messy atm b/c also tested Adaiw, but did not do that well so removed and added pos negative momentum. Pos Neg momentum is a new technique to add parameter based, anisotropic noise to the gradient which helps it settle into flatter minima and also escape saddle points. In other words, better results.
Link to their excellent paper: https://arxiv.org/abs/2103.17182

You can toggle between madgrad or not with the use_madgrad = True/False flag: ranger21_use_madgrad_toggle

April 10 - madgrad core engine integrated Madgrad has been added in a way that you will be able to select to use MadGrad or Adam as the core 'engine' for the optimizer.
Thus, you'll be able to simply toggle which opt engine to use, as well as the various enhancements (warmup, stable weight decay, gradient_centralization) and thus quickly find the best optimization setup for your specific dataset.

Still testing things and then will update code here... Gradient centralization good for both - first findings are gradient centralization definitely improves MadGrad (just like it does with Adam core) so will have GC on as default for both engines.

madgrad_added_ranger21

LR selection is very different between MadGrad and Adam core engine:

One item - the starting lr for madgrad is very different (typically higher) than with Adam....have done some testing with automated LR scheduling (HyperExplorer and ABEL), but that will be added later if it's successful. But if you simply plug your usual Adam LR's into Madgrad you won't be impressed :)

Note that AdamP projection was also tested as an option, but impact was minimal, so will not be adding it atm.

April 6 - Ranger21 alpha ready - automatic warmup added. Seeing impressive results with only 3 features implemented.
Stable weight decay + GC + automated linear warmup seem to sync very nicely. Thus if you are feeling adventorous, Ranger21 is basically alpha usable. Recommend you use the default warmup (automatic by default), but test lr and weight decay.
Ranger21 will output the settings at init to make it clear what you are running with: Ranger21_initialization

April 5 - stable weight decay added. Quick testing shows nice results with 1e-4 weight decay on subset of ImageNet.

Current feature set planned:

1 - feature complete - automated, Linear and Exponential warmup in place of RAdam. This is based on the findings of https://arxiv.org/abs/1910.04209v3

2 - Feature in progress - MadGrad core engine . This is based on my own testing with Vision Transformers as well as the compelling MadGrad paper: https://arxiv.org/abs/2101.11075v1

3 - feature complete - Stable Weight Decay instead of AdamW style or Adam style: needs more testing but the paper is very compelling: https://arxiv.org/abs/2011.11152v3

4 - feature complete - Gradient Centralization will be continued - as always, you can turn it on or off. https://arxiv.org/abs/2004.01461v2

5 - Lookahead may be brought forward - unclear how much it may help with the new MadGrad core, which already leverages dual averaging, but will probably include as a testable param.

6 - Feature implementation in progress - dual optimization engines - Will have Adam and Madgrad core present as well so that one could quickly test with both Madgrad and Adam (or AdamP) with the flip of a param.

If you have ideas/feedback, feel free to open an issue.

Installation

Until this is up on pypi, this can either be installed via cloning the package:

git clone https://github.com/lessw2020/Ranger21.git
cd Ranger21
python -m pip install -e .

or directly installed from github:

python -m pip install git+https://github.com/lessw2020/Ranger21.git
Comments
  • hit nan for variance_normalized

    hit nan for variance_normalized

    Not certain this is a bug yet, but I'm getting this rarely after awhile of training and am not finding an issue in my side. Input to loss function looks good (no nan's). I'm working with a fairly complex loss function though, so very possible I have a rare bug in my code.

    I'm using the following options

    Ranger21(
          params=params, lr=3e-4, 
          num_epochs=1e12, num_batches_per_epoch=1, num_warmup_iterations=1000, 
          using_gc=True, weight_decay=1e-4, use_madgrad=True
          )
    

    I've seen this with a batch size of 4-128 so far, so doesn't seem to be dependent on that.

    opened by jimmiebtlr 7
  • Changes in lr

    Changes in lr

    I got different learning rate curves in two identical experiments, do you understand the reason? 5e7905b16a665416f68510f23eeb01b 9385ede7097e4b33b5f87c92bbfb600 It looks like the first image is the desired result

    opened by zsgj-Xxx 7
  • Allow parallel patch based training

    Allow parallel patch based training

    Currently ranger 21 variance normalized occasionally acquires nan's and faults if used in parallel data training, i.e.division by zero. This can be mitigated using eps and have not observed difference in results.

    opened by ryancinsight 3
  • Adaptive Gradient Clipping

    Adaptive Gradient Clipping

    Hi @lessw2020 thanks for this awesome work . I came here from the fastai forums and have been playing around with Ranger21 for a few days now. The results seem pretty solid and in most cases I was easily able to beat Ranger or get comparable results. Just a few points I noticed ...

    1. I don't think AGC is working if we train using fp16. I was getting some weird losses if I kept use_adaptive_gradient_clipping on while training in fp16. I works fine if I keep training in fp32 though. Is this something to be expected or am I doing something wrong ?
    2. I also noticed that the learning rate of paramters in Ranger21 is not modified i.e., optimizer.param_groups[n]["lr"] remains same throughout. Are you computing the learning rate schedule on the fly and then updating the weights ?
    opened by benihime91 3
  • Some fixes when using MADGRAD (Softplus and Stable weight decay)

    Some fixes when using MADGRAD (Softplus and Stable weight decay)

    Hi,

    While studying the Ranger21 code with MADGRAD as the main optimizer, I asked myself the following questions :

    • Is Softplus transformation implemented with MADGRAD in your framework ?
    • How do you perform stable weight decay with MADGRAD ?

    I think you tested softplus transformation with AdamW as main optimizer but not with MADGRAD so I fixed it (I made the assumption that beta=50 is still the optimal beta for softplus). For Stable weight decay, I used a cube root instead of a square root in order to have the same range of value for theta(t-1) and m(t). I also replaced len(list(x.size())) by x.dim() in the gradient centralization part (shorter and a bit faster).

    I ran some tests with a CNN architecture on a split of ESC-50 dataset (spectrograms) to ensure I broke nothing : Capture d’écran 2021-07-18 à 17 52 33 Capture d’écran 2021-07-18 à 17 52 11 The orange curve is ranger21 without modifications. The blue one is with the sofplus modification. The red one is with the softplus modification and the stable weight decay using cubic root. It is not clear if it really helps but it seems that both modification improve convergence. I kept the same configuration between runs which is the following :

    lr: 0.0007
    lookahead_active: False
    lookahead_mergetime: 5
    lookahead_blending_alpha: 0.5
    lookahead_load_at_validation: False
    use_madgrad: True
    use_adabelief: False
    softplus: True
    using_gc: True
    using_normgc: True
    gc_conv_only: False
    normloss_active: True
    normloss_factor: 1e-4
    use_adaptive_gradient_clipping: True
    agc_clipping_value: 1e-2
    agc_eps: 1e-3
    momentum_type: pnm
    pnm_momentum_factor: 1.0
    momentum: 0.9
    eps: 1e-8
    num_batches_per_epoch: 26
    num_epochs: 250
    use_cheb: False
    use_warmup: False
    num_warmup_iterations: None
    warmdown_active: False
    warmdown_start_pct: 0.72
    warmdown_min_lr: 1e-5
    weight_decay: 1e-4
    decay_type: stable
    warmup_type: linear
    warmup_pct_default: 0.22
    logging_active: True
    

    I hope it will help you in your investigations.

    PS : I love your work, I also tried to modify MADGRAD and thanks to you I discovered great new papers on optimizers.

    opened by TheZothen 2
  • comparing ranger21 to SAM optimizer

    comparing ranger21 to SAM optimizer

    Do you have any metrics on how ranger compares to the new SAM optimizer? I am using fastai and would like to incorporate ranger and sam to my pipeline but don't know which one to start with.

    opened by nikky4D 2
  • torch.grad removed in PyTorch 1.8.1?

    torch.grad removed in PyTorch 1.8.1?

    I'm getting the following error with PyTorch 1.8.1

    AttributeError: module 'torch' has no attribute 'grad'

    Swapping Line 515 for with torch.enable_grad(): seems to resolve the error.

    I can't find it in the 1.8 release notes, but it appears torch.grad() might be deprecated? Not sure if anyone else can replicate.

    Cheers!

    opened by jszym 2
  • Augmentation requests

    Augmentation requests

    Those are apparently the most promising optimizers, would be very useful to see how it compare to RAdam/madgrad!

    Adabelief https://github.com/lessw2020/Ranger-Deep-Learning-Optimizer/issues/44

    Stochastic weight averaging https://pytorch.org/blog/pytorch-1.6-now-includes-stochastic-weight-averaging/

    Adas https://paperswithcode.com/paper/adas-adaptive-scheduling-of-stochastic

    opened by LifeIsStrange 2
  • Adaptive Gradient Clipping

    Adaptive Gradient Clipping

    Hello, I highly recommend having AGC as well as it is extremely helpful for the training stability.

    def unitwise_norm(x):
        dim = [1, 2, 3] if x.ndim == 4 else 0
        return torch.sum(x**2, dim=dim, keepdim= x.ndim > 1) ** 0.5
    
    class AGC(opt.Optimizer):
        def __init__(self, params, optim: opt.Optimizer, clipping = 1e-2, eps = 1e-3):
            self.optim = optim
            defaults = dict(clipping=clipping, eps=eps)
            defaults = {**defaults, **optim.defaults}
            super(AGC, self).__init__(params, defaults)
    
        @torch.no_grad()
        def step(self, closure=None):
            loss = None
            if closure is not None:
                with torch.enable_grad(): loss = closure()
    
            for group in self.param_groups:
                for p in group['params']:
                    param_norm = torch.max(unitwise_norm(
                        p), torch.tensor(group['eps']).to(p.device))
                    grad_norm = unitwise_norm(p.grad)
                    max_norm = param_norm * group['clipping']
                    trigger = grad_norm > max_norm
                    clipped = p.grad * (max_norm / torch.max(grad_norm, torch.tensor(1e-6).cuda()))
                    p.grad.data.copy_(torch.where(trigger, clipped, p.grad))
        
            self.optim.step(closure)
    
    opened by kayuksel 2
  •   File

    File "/home/.../site-packages/ranger21/ranger21.py", line 680, in step raise RuntimeError("hit nan for variance_normalized")

    Any idea what might have happened here?

    the training runs normally with Ranger(20), when switching to 21 it crashes with this error:

      File "/home/.../lib/python3.8/site-packages/ranger21/ranger21.py", line 680, in step
        raise RuntimeError("hit nan for variance_normalized")
    

    btw for Ranger20 you recommended training with mish activation function, is this also true for Ranger21?

    I am training a segmentation network and some of the samples are completely empty.

    opened by neuronflow 1
  • Activate/deactivate softplus for MADGRAD & choosing beta softplus

    Activate/deactivate softplus for MADGRAD & choosing beta softplus

    Hi,

    I forgot to add a way to deactivate softplus for MADGRAD in the last PR. I also added an easy way to choose the beta parameter of the softplus transform.

    opened by TheZothen 1
  • Nice name of your project)

    Nice name of your project)

    Just wanted to say thank you. I've been "playing" with deep learning libraries and zoneminder dvr solution and really love that sphere as hobby.

    I will delete this comment later or you can do the same)

    opened by Ranger21 0
  • Gradient normalization lowers the maximum learning rate that can converge.

    Gradient normalization lowers the maximum learning rate that can converge.

    I found this problem while training ResNet18 on cifar100 for some experiment. I still haven't looked into this issue enough to find out what the cause is.

    opened by Handagot 0
  • Not support pytorch _1.3.1

    Not support pytorch _1.3.1

    To developer: Thank you for developing such a grateful optimizer. I have used it with pytorch_1.8 and pytorch_1.9 successfully. When I use the pytorch_1.3.1, ranger21 reports some errors. I think ranger21 not support pytorch_1.3.1. Could you make it available in the feature, please? Here is the report info:

    import torch
    import ranger21
    Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/home/huangneng/tools/Ranger21/ranger21/__init__.py", line 1, in <module>
       from .ranger21 import Ranger21
     File "/home/huangneng/tools/Ranger21/ranger21/ranger21.py", line 49, in <module>
       from torch import linalg as LA
    ImportError: cannot import name 'linalg'
    

    Best, Neng

    opened by huangnengCSU 0
  • Require an documentation

    Require an documentation

    To developer: Thanks a lot for developing such a nice project. There are many parameters to be set in Ranger21, but I don't know what these parameters do. If possible, please provide an explanatory documentation.

    Best Neng

    opened by huangnengCSU 2
  • decouple the lr scheduler and optimizer?

    decouple the lr scheduler and optimizer?

    Hi @lessw2020, thanks for the very nice work! I noticed that in this Ranger21, the optimizer is tightly coupled with the lr scheduler, could you guide me how I can decouple them?

    opened by hiyyg 5
Owner
Less Wright
Principal Software Engineer at Audere PM/Test/Dev at Microsoft Software Architect at X10 Wireless
Less Wright
DeepOBS: A Deep Learning Optimizer Benchmark Suite

DeepOBS - A Deep Learning Optimizer Benchmark Suite DeepOBS is a benchmarking suite that drastically simplifies, automates and improves the evaluation

Aaron Bahde 7 May 12, 2020
ESGD-M - A stochastic non-convex second order optimizer, suitable for training deep learning models, for PyTorch

ESGD-M - A stochastic non-convex second order optimizer, suitable for training deep learning models, for PyTorch

Katherine Crowson 53 Dec 29, 2022
PyTorch implementation DRO: Deep Recurrent Optimizer for Structure-from-Motion

DRO: Deep Recurrent Optimizer for Structure-from-Motion This is the official PyTorch implementation code for DRO-sfm. For technical details, please re

Alibaba Cloud 56 Dec 12, 2022
PyTorch3D is FAIR's library of reusable components for deep learning with 3D data

Introduction PyTorch3D provides efficient, reusable components for 3D Computer Vision research with PyTorch. Key features include: Data structure for

Facebook Research 6.8k Jan 1, 2023
Rewrite ultralytics/yolov5 v6.0 opencv inference code based on numpy, no need to rely on pytorch

Rewrite ultralytics/yolov5 v6.0 opencv inference code based on numpy, no need to rely on pytorch; pre-processing and post-processing using numpy instead of pytroch.

炼丹去了 21 Dec 12, 2022
Cowsay - A rewrite of cowsay in python

Python Cowsay A rewrite of cowsay in python. Allows for parsing of existing .cow

James Ansley 3 Jun 27, 2022
A very simple tool to rewrite parameters such as attributes and constants for OPs in ONNX models. Simple Attribute and Constant Modifier for ONNX.

sam4onnx A very simple tool to rewrite parameters such as attributes and constants for OPs in ONNX models. Simple Attribute and Constant Modifier for

Katsuya Hyodo 6 May 15, 2022
AdamW optimizer and cosine learning rate annealing with restarts

AdamW optimizer and cosine learning rate annealing with restarts This repository contains an implementation of AdamW optimization algorithm and cosine

Maksym Pyrozhok 133 Dec 20, 2022
A mini library for Policy Gradients with Parameter-based Exploration, with reference implementation of the ClipUp optimizer from NNAISENSE.

PGPElib A mini library for Policy Gradients with Parameter-based Exploration [1] and friends. This library serves as a clean re-implementation of the

NNAISENSE 56 Jan 1, 2023
auto-tuning momentum SGD optimizer

YellowFin YellowFin is an auto-tuning optimizer based on momentum SGD which requires no manual specification of learning rate and momentum. It measure

Jian Zhang 288 Nov 19, 2022
Apollo optimizer in tensorflow

Apollo Optimizer in Tensorflow 2.x Notes: Warmup is important with Apollo optimizer, so be sure to pass in a learning rate schedule vs. a constant lea

Evan Walters 1 Nov 9, 2021
This is an implementation of Googles Yogi-Optimizer in Keras (tf.keras)

Yogi-Optimizer_Keras This is an implementation of Googles Yogi-Optimizer in Keras (tf.keras) The NeurIPS-Paper can be found here: http://papers.nips.c

null 14 Sep 13, 2022
An Implicit Function Theorem (IFT) optimizer for bi-level optimizations

iftopt An Implicit Function Theorem (IFT) optimizer for bi-level optimizations. Requirements Python 3.7+ PyTorch 1.x Installation $ pip install git+ht

The Money Shredder Lab 2 Dec 2, 2021
AdamW optimizer for bfloat16 models in pytorch.

Image source AdamW optimizer for bfloat16 models in pytorch. Bfloat16 is currently an optimal tradeoff between range and relative error for deep netwo

Alex Rogozhnikov 8 Nov 20, 2022
Storage-optimizer - Identify potintial optimizations on the cloud storage accounts

Storage Optimizer Identify potintial optimizations on the cloud storage accounts

Zaher Mousa 1 Feb 13, 2022
A library of multi-agent reinforcement learning components and systems

Mava: a research framework for distributed multi-agent reinforcement learning Table of Contents Overview Getting Started Supported Environments System

InstaDeep Ltd 463 Dec 23, 2022
TorchX is a library containing standard DSLs for authoring and running PyTorch related components for an E2E production ML pipeline.

TorchX is a library containing standard DSLs for authoring and running PyTorch related components for an E2E production ML pipeline

null 193 Dec 22, 2022
BasicVSR: The Search for Essential Components in Video Super-Resolution and Beyond

BasicVSR BasicVSR: The Search for Essential Components in Video Super-Resolution and Beyond Ported from https://github.com/xinntao/BasicSR Dependencie

Holy Wu 8 Jun 7, 2022
A numpy-based implementation of RANSAC for fundamental matrix and homography estimation. The degeneracy updating and local optimization components are included and optional.

Description A numpy-based implementation of RANSAC for fundamental matrix and homography estimation. The degeneracy updating and local optimization co

AoxiangFan 9 Nov 10, 2022