Ranger - a synergistic optimizer using RAdam (Rectified Adam), Gradient Centralization and LookAhead in one codebase

Less Wright

Last update: Dec 21, 2022

Related tags

Deep Learning Ranger-Deep-Learning-Optimizer

Overview

Ranger-Deep-Learning-Optimizer

Ranger - a synergistic optimizer combining RAdam (Rectified Adam) and LookAhead, and now GC (gradient centralization) in one optimizer.

quick note - Ranger21 is now in beta and is Ranger with a host of new improvements.

Recommend you compare results with Ranger21: https://github.com/lessw2020/Ranger21

Latest version 20.9.4 - updates Gradient Centralization to GC2 (thanks to GC developer) and removes addcmul_ deprecation warnings in PyTorch 1.60.

*Latest version is in ranger2020.py - looking at a few other additions before integrating into the main ranger.py.

What is Gradient Centralization? = "GC can be viewed as a projected gradient descent method with a constrained loss function. The Lipschitzness of the constrained loss function and its gradient is better so that the training process becomes more efficient and stable." Source paper: https://arxiv.org/abs/2004.01461v2
Ranger now uses Gradient Centralization by default, and applies it to all conv and fc layers by default. However, everything is customizable so you can test with and without on your own datasets. (Turn on off via "use_gc" flag at init).

Best training results - use a 75% flat lr, then step down and run lower lr for 25%, or cosine descend last 25%.

Per extensive testing - It's important to note that simply running one learning rate the entire time will not produce optimal results.
Effectively Ranger will end up 'hovering' around the optimal zone, but can't descend into it unless it has some additional run time at a lower rate to drop down into the optimal valley.

Full customization at init:

Ranger will now print out id and gc settings at init so you can confirm the optimizer settings at train time:

/////////////////////

Medium article with more info:
https://medium.com/@lessw/new-deep-learning-optimizer-ranger-synergistic-combination-of-radam-lookahead-for-the-best-of-2dc83f79a48d

Multiple updates: 1 - Ranger is the optimizer we used to beat the high scores for 12 different categories on the FastAI leaderboards! (Previous records all held with AdamW optimizer).

2 - Highly recommend combining Ranger with: Mish activation function, and flat+ cosine anneal training curve.

3 - Based on that, also found .95 is better than .90 for beta1 (momentum) param (ala betas=(0.95, 0.999)).

Fixes: 1 - Differential Group learning rates now supported. This was fix in RAdam and ported here thanks to @sholderbach. 2 - save and then load may leave first run weights stranded in memory, slowing down future runs = fixed.

Installation

Clone the repo, cd into it and install it in editable mode (-e option). That way, these is no more need to re-install the package after modification.

git clone https://github.com/lessw2020/Ranger-Deep-Learning-Optimizer
cd Ranger-Deep-Learning-Optimizer
pip install -e .

Usage

from ranger import Ranger  # this is from ranger.py
from ranger import RangerVA  # this is from ranger913A.py
from ranger import RangerQH  # this is from rangerqh.py

# Define your model
model = ...
# Each of the Ranger, RangerVA, RangerQH have different parameters.
optimizer = Ranger(model.parameters(), **kwargs)

Usage and notebook to test are available here: https://github.com/lessw2020/Ranger-Mish-ImageWoof-5

Citing this work

We recommend you use the following to cite Ranger in your publications:

@misc{Ranger,
  author = {Wright, Less},
  title = {Ranger - a synergistic optimizer.},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/lessw2020/Ranger-Deep-Learning-Optimizer}}
}

Comments

BUG: Module not added to package; not importable
ranger currently cannot be used from a pip install because the ranger module was not added to the package. The package is entirely empty, resulting in the following error:

$ pip install git+https://github.com/lessw2020/Ranger-Deep-Learning-Optimizer.git@73811db2eb55e1e3e3b736177cafaebe4807d669 [...] Installing collected packages: ranger Successfully installed ranger-0.0.1 $ python >>> import ranger Traceback (most recent call last): File "<stdin>", line 1, in <module> ModuleNotFoundError: No module named 'ranger' >>> from ranger import Ranger Traceback (most recent call last): File "<stdin>", line 1, in <module> ModuleNotFoundError: No module named 'ranger' >>> import ranger.ranger Traceback (most recent call last): File "<stdin>", line 1, in <module> ModuleNotFoundError: No module named 'ranger'

This is resolved by using setuptools.findpackages to add the ranger module to the package.

I also added the README.md contents as long_description in setup.py, and incremented the version number to 0.1.dev0. Using dev or dev0 as the patch number indicates that the version is unstable, i.e. there is a one-to-many mapping from the version number, 0.1.dev0, to the state of the codebase in the repository.

I can split this out into multiple PRs if you'd prefer.
opened by scottclowe 8
How to cite Ranger in a paper?

In my recent paper I used Ranger. I wish to give all the credit the author(s) deserves, but I'm not sure how to properly cite it? Currently I cited the medium article. Should I cite this github repo instead? Thanks.

opened by askerlee 6
step_counter not set
Hi, thanks for your work.

I just plugged it into my model and found that step_counter was not set for all param_groups.

I fixed it with this hack:

#look ahead tracking and updating if latest batch = k for group,slow_weights in zip(self.param_groups,self.slow_weights): if 'step_counter' not in group: group["step_counter"] = 0

but I suspect it's not optimal... this would mean that self.param_groups changed between the constructor and step(), but I have no idea why. Have you seen something similar before?

Thanks
opened by m-toman 5
Making it a python package

Would you like to make this a python package that could be installed with pip? It would be more practical.

I'd like to include it in my repo asteroid and give you proper credit for it.

One way is to install a python package (I can make a PR for that), the other one would be to copy-paste some of the code and point to the license file. Which way would you prefer?

opened by mpariente 4
Do we need some kind of Learning rate decay with Ranger?

For AdamW people usually add some sort of learning rate decay: linear, cosine triangle, etc. Also, warm up steps are also popular.

Do we need all of these with Ranger or just use a fixed learning rate?

opened by avostryakov 3
Is there a publication of Ranger?

I want to cite ranger on a Medium article and I would like to know if there is an arXiv publication of Ranger or a published peer-reviewed paper on some conference or journal.

I saw you linked a paper o the README.md, but it does not seem to be about ranger, as the very word does not appear in any part of it. I know the Radam and Lookahead paper, but the Ranger one is missing on my library. Thanks

opened by nuzrub 2
Make Ranger a python package

As discussed in #20, it would be really practical to have these optimizers in a package. This PR makes that possible. ranger can be installed and imported in all python projects, no need to copy paste the ranger optimizer anymore. This also allows to give you proper credit where it is used (I'll add it in the requirements in asteroid for example).

I also updated the README with the install and usage instruction. You can clone the repo, then install it with pip in editable mode (pretty practical for research) or not.

Note : In the __init__.py I imported the three optimizers so that they can be imported from the package directly (from ranger import RangerQH instead of from ranger.rangerqh import RangerQH). Both ranger.py and ranger913A.py had the same name for the class, so I changed the class in ranger913A.py to RangerVA (for versionA).

I'd like to here from you, this would be very practical to have it as a package.

opened by mpariente 2
N_sma_threshhold should be instance variable

Thank you for the great implementation. I think I found a small part to modify at ranger.py line 116.

original code: if N_sma > N_sma_threshhold:

to be left: if N_sma > self.N_sma_threshhold:

opened by ohmorimori 2
N_sma_threshhold

You first have if N_sma > self.N_sma_threshhold:

and then you have if N_sma > 4:

Is it right that the second one is constant or should that also be N_sma_threshhold parameter?

opened by kayuksel 2
Let's revolutionize the AI research field

Hi, I have a dream and I'll try to share it to you.

But before explaining further, I'll need your brain to analyze this input and output me what you think about it!

Small rant on the inertia of AI research

First of all, thank you for advancing progress in deep learning.

I'm just a random guy that want to implement an AGI (lol) and like many Nlp engeeners, I need HIGHLY accurate neural networks for fundamental NLP tasks (e.g POS tag, NER, dep parsing, Coref resolution, WSD, etc) They are all not very accurate (often sub 95% F1 score) and their errors add up.

Such limitations make Nlp not yet suitable for many things. This is why improving the state of the art (which can be observed on paperswithcode.com) is a crucial priority from academicians.

Effectively, many researchers have smart ideas to improve the state of the art and often slightly improve it by: Having a "standard neural network" for the task and mix with it their new fancy idea.

I talk from knowledge, I've read most papers from state of the art leaderboards from most fundamental NLP tasks. Almost always they have this common baseline + one idea, theirs. The common baseline sometimes slowly evolve (e.g now it's often a pre trained model (say BERT) + fine tuning + their idea.

Sorry to say, but "this" is to me retarded Where "this" mean the fact that by far, most researchers work in isolation, not integrating others ideas (or with such a slow inertia). I would have wished that state of the art in one Nlp task would be a combination of e.g 50 innovative and complementary ideas from researchers. You are researchers, do you have an idea why that is the case? If someone actually tried to merge all good complementary and compatible ideas, would they have the best, unmatchable state of the art? Why facebookresearch, Microsoft, Google don't try the low hanging fruit in addition to producing X new shiny ideas per month, actually try to merge them in a coherent, synergetic manner?? I would like you to tell me what you think of this major issue that slow AI progress.

As an example of such inertia let's talk about Swish, Mish or RAdam : Those things are incredibly easy to try and see "hey does it give to my neural network free accuracy gains?" Yet not any paper on state of the art leaderboards has tried Swish, Mish or RAdam despite being soo simple to try (you don't need to change the neural network) Not even pre trained models where so many papers depend on them (I opened issues for each of them).

Once I know what you think about this research inertia, I'll explain my vision of what needs to be done to fix it.

opened by LifeIsStrange 2

Not working using cuda

Variables self.slow_weights are always on cpu. You can easily fix this by adding a .to() method in Ranger class like so:

def to(self, device):    
    if device is "cuda":
        for i in range(len(self.slow_weights)):
            for j, w in enumerate(self.slow_weights[i]):
                self.slow_weights[i][j] = w.cuda()
    elif device is "cpu":
        for i in range(len(self.slow_weights)):
            for j, w in enumerate(self.slow_weights[i]):
                self.slow_weights[i][j] = w.cpu()

opened by Fable67 2

Collate pip package so that it picks up from main repo.

Actually, there is a pip package but it is based out of a fork of this repo. I think it would make sense to collate this effort to the main repo.

Originally posted by @sarthakpati in https://github.com/lessw2020/Ranger-Deep-Learning-Optimizer/issues/33#issuecomment-821314754

opened by sarthakpati 2
Please note in the documentation (or in the constructor) that closures must be enabled

Hi,

I had today a relatively long debug session, after I've upgraded my Pytorch Lightning installation, that the training_step wasn't called.

It finally turned out, that the problem was that the "closure" argument is not used in the step function (it is commented out - as also noted in the source code).

However, as it is apparently required by some libraries and is also recommended by the official PyTorch guidelines, it would be great if it would be better documented, that people might need to enable these lines.

Thanks in advance.

opened by ABotond 0

This overload of addcmul_ is deprecated: addcmul_(Number value, Tensor tensor1, Tensor tensor2)

I get the following warning when using ranger with pytorch 1.6.0

/path/Ranger-Deep-Learning-Optimizer/ranger/ranger.py:138: UserWarning: This overload of addcmul_ is deprecated:
        addcmul_(Number value, Tensor tensor1, Tensor tensor2)
Consider using one of the following signatures instead:
        addcmul_(Tensor tensor1, Tensor tensor2, *, Number value) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:766.)
  exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)

opened by neuronflow 5

RangerVA with GC

Hello,

Thank you for your work on these optimizers btw. I was testing a couple out and was performing quite well with the RangerVA originally. Then, when your gradient centralization was added I got further improvements but it also seemed to be overtraining the train set more easily despite using the same parameters. Therefore, I tried to implement combining the gradient centralization into the RangerVA algorithm and so far it seems to be performing quite well and faster since it seems I can use larger batch sizes. I was wondering if you could quickly check, whenever you have some free time, if I implemented correctly in the code below since you are so used to this optimizer.

Best

`` class RangerVA(Optimizer):

def __init__(self, params, lr=1e-3, 
             alpha=0.5, k=6, n_sma_threshhold=5, betas=(.95,0.999), 
             eps=1e-5, weight_decay=0, amsgrad=True, transformer='softplus', smooth=50,
             grad_transformer='square',use_gc=True, gc_conv_only=False):
    #parameter checks
    if not 0.0 <= alpha <= 1.0:
        raise ValueError(f'Invalid slow update rate: {alpha}')
    if not 1 <= k:
        raise ValueError(f'Invalid lookahead steps: {k}')
    if not lr > 0:
        raise ValueError(f'Invalid Learning Rate: {lr}')
    if not eps > 0:
        raise ValueError(f'Invalid eps: {eps}')

    #prep defaults and init torch.optim base
    defaults = dict(lr=lr, alpha=alpha, k=k, step_counter=0, betas=betas, 
                    n_sma_threshhold=n_sma_threshhold, eps=eps, weight_decay=weight_decay,
                    smooth=smooth, transformer=transformer, grad_transformer=grad_transformer,
                   amsgrad=amsgrad,use_gc=use_gc, gc_conv_only=gc_conv_only )
    super().__init__(params,defaults)

    #adjustable threshold
    self.n_sma_threshhold = n_sma_threshhold   

    #look ahead params
    self.alpha = alpha
    self.k = k 

    #radam buffer for state
    self.radam_buffer = [[None,None,None] for ind in range(10)]
    
    #gc on or off
    self.use_gc=use_gc
    #level of gradient centralization
    self.gc_gradient_threshold = 3 if gc_conv_only else 1
    print(f"Ranger optimizer loaded. \nGradient Centralization usage = {self.use_gc}")
    if (self.use_gc and self.gc_gradient_threshold==1):
        print(f"GC applied to both conv and fc layers")
    elif (self.use_gc and self.gc_gradient_threshold==3):
        print(f"GC applied to conv layers only")


def __setstate__(self, state):
    print("set state called")
    super(RangerVA, self).__setstate__(state)


def step(self, closure=None):
    loss = None
    #Evaluate averages and grad, update param tensors
    for group in self.param_groups:
        for p in group['params']:
            if p.grad is None:
                continue
            grad = p.grad.data.double()
            if grad.is_sparse:
                raise RuntimeError('Ranger optimizer does not support sparse gradients')
            
            amsgrad = group['amsgrad']
            smooth = group['smooth']
            grad_transformer = group['grad_transformer']

            p_data_fp32 = p.data.double()

            state = self.state[p]  #get state dict for this param

            if len(state) == 0:   
                state['step'] = 0
                state['exp_avg'] = torch.zeros_like(p_data_fp32)
                state['exp_avg_sq'] = torch.zeros_like(p_data_fp32)
                if amsgrad:
                    # Maintains max of all exp. moving avg. of sq. grad. values
                    state['max_exp_avg_sq'] = torch.zeros_like(p.data)                    

                #look ahead weight storage now in state dict 
                state['slow_buffer'] = torch.empty_like(p.data)
                state['slow_buffer'].copy_(p.data)

            else:
                state['exp_avg'] = state['exp_avg'].type_as(p_data_fp32)
                state['exp_avg_sq'] = state['exp_avg_sq'].type_as(p_data_fp32)
                                  

            #begin computations 
            exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq']
            beta1, beta2 = group['betas']
            if amsgrad:
                max_exp_avg_sq = state['max_exp_avg_sq']  
                # Maintains the maximum of all 2nd moment running avg. till now
                torch.max(max_exp_avg_sq, exp_avg_sq, out=max_exp_avg_sq)
                # Use the max. for normalizing running avg. of gradient
                denomc = max_exp_avg_sq.clone()
            else:
                denomc = exp_avg_sq.clone()
            #GC operation for Conv layers and FC layers       
            if grad.dim() > self.gc_gradient_threshold:                    
                grad.add_(-grad.mean(dim = tuple(range(1,grad.dim())), keepdim = True))

            state['step'] += 1              

            #compute variance mov avg
            exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
            #compute mean moving avg
            exp_avg.mul_(beta1).add_(1 - beta1, grad)
            buffered = self.radam_buffer[int(state['step'] % 10)]
            if state['step'] == buffered[0]:
                N_sma, step_size = buffered[1], buffered[2]
            else:
                buffered[0] = state['step']
                beta2_t = beta2 ** state['step']
                N_sma_max = 2 / (1 - beta2) - 1
                N_sma = N_sma_max - 2 * state['step'] * beta2_t / (1 - beta2_t)
                buffered[1] = N_sma
                if N_sma > self.n_sma_threshhold:
                    step_size = math.sqrt((1 - beta2_t) * (N_sma - 4) / (N_sma_max - 4) * (N_sma - 2) / N_sma * N_sma_max / (N_sma_max - 2)) / (1 - beta1 ** state['step'])
                else:
                    step_size = 1.0 / (1 - beta1 ** state['step'])
                buffered[2] = step_size

            
            ##transformer
            if grad_transformer == 'square':
                grad_tmp = grad**2
                denomc.sqrt_() 
            elif grad_transformer == 'abs':
                grad_tmp = grad.abs()


            exp_avg_sq.mul_(beta2).add_((1 - beta2)*grad_tmp)

            if group['weight_decay'] != 0:
                p_data_fp32.add_(-group['weight_decay'] * group['lr'], p_data_fp32)
            bias_correction1 = 1 - beta1 ** state['step']
            bias_correction2 = 1 - beta2 ** state['step']
            step_size = group['lr'] * math.sqrt(bias_correction2) / bias_correction1                

            
            # ...let's use calibrated alr 
            if N_sma > self.n_sma_threshhold:
                if  group['transformer'] =='softplus':
                    sp = torch.nn.Softplus( smooth)
                    denomf = sp( denomc)
                    p_data_fp32.addcdiv_(-step_size, exp_avg, denomf )
                else:
                    denom = exp_avg_sq.sqrt().add_(group['eps'])
                    p_data_fp32.addcdiv_(-step_size * group['lr'], exp_avg, denom)
            else:
                p_data_fp32.add_(-step_size * group['lr'], exp_avg)
            p.data.copy_(p_data_fp32)

            #integrated look ahead...
            #we do it at the param level instead of group level
            if state['step'] % group['k'] == 0:
                slow_p = state['slow_buffer'] #get access to slow param tensor
                slow_p.add_(self.alpha, p.data - slow_p)  #(fast weights - slow weights) * alpha
                p.data.copy_(slow_p)  #copy interpolated weights to RAdam param tensor

    return loss

opened by ryancinsight 0

Owner

Less Wright

Principal Software Engineer at Audere PM/Test/Dev at Microsoft Software Architect at X10 Wireless

GitHub

Implements Gradient Centralization and allows it to use as a Python package in TensorFlow

Gradient Centralization TensorFlow This Python package implements Gradient Centralization in TensorFlow, a simple and effective optimization technique

101 Nov 1, 2022

Code and datasets for the paper "KnowPrompt: Knowledge-aware Prompt-tuning with Synergistic Optimization for Relation Extraction"

KnowPrompt Code and datasets for our paper "KnowPrompt: Knowledge-aware Prompt-tuning with Synergistic Optimization for Relation Extraction" Requireme

137 Dec 31, 2022

With this package, you can generate mixed-integer linear programming (MIP) models of trained artificial neural networks (ANNs) using the rectified linear unit (ReLU) activation function

With this package, you can generate mixed-integer linear programming (MIP) models of trained artificial neural networks (ANNs) using the rectified linear unit (ReLU) activation function. At the moment, only TensorFlow sequential models are supported. Interfaces to either the Pyomo or Gurobi modeling environments are offered.

40 Dec 27, 2022

How Do Adam and Training Strategies Help BNNs Optimization? In ICML 2021.

AdamBNN This is the pytorch implementation of our paper "How Do Adam and Training Strategies Help BNNs Optimization?", published in ICML 2021. In this

47 Sep 20, 2022

This is the official PyTorch implementation of the paper "TransFG: A Transformer Architecture for Fine-grained Recognition" (Ju He, Jie-Neng Chen, Shuai Liu, Adam Kortylewski, Cheng Yang, Yutong Bai, Changhu Wang, Alan Yuille).

TransFG: A Transformer Architecture for Fine-grained Recognition Official PyTorch code for the paper: TransFG: A Transformer Architecture for Fine-gra

307 Jan 3, 2023

PyTorch implementation of our Adam-NSCL algorithm from our CVPR2021 (oral) paper "Training Networks in Null Space for Continual Learning"

Adam-NSCL This is a PyTorch implementation of Adam-NSCL algorithm for continual learning from our CVPR2021 (oral) paper: Title: Training Networks in N

34 Dec 21, 2022

A PyTorch implementation of Learning to learn by gradient descent by gradient descent

Intro PyTorch implementation of Learning to learn by gradient descent by gradient descent. Run python main.py TODO Initial implementation Toy data LST

300 Dec 11, 2022

AdamW optimizer and cosine learning rate annealing with restarts

AdamW optimizer and cosine learning rate annealing with restarts This repository contains an implementation of AdamW optimization algorithm and cosine

133 Dec 20, 2022

A mini library for Policy Gradients with Parameter-based Exploration, with reference implementation of the ClipUp optimizer from NNAISENSE.

PGPElib A mini library for Policy Gradients with Parameter-based Exploration [1] and friends. This library serves as a clean re-implementation of the

56 Jan 1, 2023

Ranger - a synergistic optimizer using RAdam (Rectified Adam), Gradient Centralization and LookAhead in one codebase

Related tags

Overview

Ranger-Deep-Learning-Optimizer

quick note - Ranger21 is now in beta and is Ranger with a host of new improvements.

Latest version 20.9.4 - updates Gradient Centralization to GC2 (thanks to GC developer) and removes addcmul_ deprecation warnings in PyTorch 1.60.

Best training results - use a 75% flat lr, then step down and run lower lr for 25%, or cosine descend last 25%.

Full customization at init:

Installation

Usage

Citing this work

Comments

Owner

Less Wright

Implements Gradient Centralization and allows it to use as a Python package in TensorFlow

Code and datasets for the paper "KnowPrompt: Knowledge-aware Prompt-tuning with Synergistic Optimization for Relation Extraction"

With this package, you can generate mixed-integer linear programming (MIP) models of trained artificial neural networks (ANNs) using the rectified linear unit (ReLU) activation function

How Do Adam and Training Strategies Help BNNs Optimization? In ICML 2021.

This is the official PyTorch implementation of the paper "TransFG: A Transformer Architecture for Fine-grained Recognition" (Ju He, Jie-Neng Chen, Shuai Liu, Adam Kortylewski, Cheng Yang, Yutong Bai, Changhu Wang, Alan Yuille).

PyTorch implementation of our Adam-NSCL algorithm from our CVPR2021 (oral) paper "Training Networks in Null Space for Continual Learning"

A PyTorch implementation of Learning to learn by gradient descent by gradient descent

AdamW optimizer and cosine learning rate annealing with restarts

A mini library for Policy Gradients with Parameter-based Exploration, with reference implementation of the ClipUp optimizer from NNAISENSE.

PyTorch implementation DRO: Deep Recurrent Optimizer for Structure-from-Motion

auto-tuning momentum SGD optimizer

Apollo optimizer in tensorflow

This is an implementation of Googles Yogi-Optimizer in Keras (tf.keras)

DeepOBS: A Deep Learning Optimizer Benchmark Suite

An Implicit Function Theorem (IFT) optimizer for bi-level optimizations

AdamW optimizer for bfloat16 models in pytorch.

Storage-optimizer - Identify potintial optimizations on the cloud storage accounts

ESGD-M - A stochastic non-convex second order optimizer, suitable for training deep learning models, for PyTorch

This is the codebase for the ICLR 2021 paper Trajectory Prediction using Equivariant Continuous Convolution