AdamW optimizer and cosine learning rate annealing with restarts

Maksym Pyrozhok

Last update: Dec 20, 2022

Related tags

Deep Learning scheduler clr optimizer pytorch restarts triangular adamw cosine-annealing cyclical-learning-rate adamw-optimizer

Overview

AdamW optimizer and cosine learning rate annealing with restarts

This repository contains an implementation of AdamW optimization algorithm and cosine learning rate scheduler described in "Decoupled Weight Decay Regularization". AdamW implementation is straightforward and does not differ much from existing Adam implementation for PyTorch, except that it separates weight decaying from batch gradient calculations. Cosine annealing scheduler with restarts allows model to converge to a (possibly) different local minimum on every restart and normalizes weight decay hyperparameter value according to the length of restart period. Unlike schedulers presented in standard PyTorch scheduler suite this scheduler adjusts optimizer's learning rate not on every epoch, but on every batch update, according to the paper.

Cyclical Learning Rates

Besides "cosine" and "arccosine" policies (arccosine has steeper profile at the limiting points), there are "triangular", triangular2 and exp_range, which implement policies proposed in "Cyclical Learning Rates for Training Neural Networks". The ratio of increasing and decreasing phases for triangular policy could be adjusted with triangular_step parameter. Minimum allowed lr is adjusted by min_lr parameter.

triangular schedule is enabled by passing policy="triangular" parameter.
triangular2 schedule reduces maximum lr by half on each restart cycle and is enabled by passing policy="triangular2" parameter, or by combining parameters policy="triangular", eta_on_restart_cb=ReduceMaxLROnRestart(ratio=0.5). The ratio parameter regulates the factor by which lr is scaled on each restart.
exp_range schedule is enabled by passing policy="exp_range" parameter. It exponentially scales maximum lr depending on iteration count. The base of exponentiation is set by gamma parameter.

These schedules could be combined with shrinking/expanding restart periods, weight decay normalization and could be used with AdamW and other PyTorch optimizers.

Example:

    batch_size = 32
    epoch_size = 1024
    model = resnet()
    optimizer = AdamW(model.parameters(), lr=1e-3, weight_decay=1e-5)
    scheduler = CyclicLRWithRestarts(optimizer, batch_size, epoch_size, restart_period=5, t_mult=1.2, policy="cosine")
    for epoch in range(100):
        scheduler.step()
        train_for_every_batch(...)
            ...
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            scheduler.batch_step()
        validate(...)

Comments

StopIteration

Hi, thank you for your share. Following your description, I try to use your code in my project, but I got the error in 'scheduler.batch_step()', this happened on this line 't_cur = self.t_epoch + next(self.batch_increment)'

opened by reborm 7

Hypergradient Descent

Thank you for sharing this. Would it be possible if you can also integrate Hypergradient Descent technique into your AdamW implementation? It reduces the necessity of hypertuning the initial learning rate. https://github.com/gbaydin/hypergradient-descent

                if state['step'] > 1:
                    prev_bias_correction1 = 1 - beta1 ** (state['step'] - 1)
                    prev_bias_correction2 = 1 - beta2 ** (state['step'] - 1)
                    # Hypergradient for Adam:
                    h = torch.dot(grad.view(-1), torch.div(exp_avg, exp_avg_sq.sqrt().add_(group['eps'])).view(-1)) * math.sqrt(prev_bias_correction2) / prev_bias_correction1
                    # Hypergradient descent of the learning rate:
                    group['lr'] += group['hypergrad_lr'] * h

I have also read lots of criticism about AmsGrad and haven't been able to yet get any improvement with that variant. Can I please learn your thoughts about that? FYI, two other techniques that I am currently experimenting with are Padam and QHAdam.

opened by akaniklaus 5

Lower/Upper Bound for LR and Upper Bound decay
Hey there,

Nice update of the scheduler! It's really usefull!

Also nice would be to have the possibility to set following parameters: base_lr, max_lr and scale_fn

The scale_fn would be a function that decreases the max_lr:

by half after each period, while keeping the base lr constant.

scales max_lr by a factor gamma**(iterations)

or whatever lambda_function is given

Here an example implementation in Keras: https://github.com/bckenstler/CLR

I tried to hack this myself but I'm stucked. I'm not entirely sure which eta you use. (is this the one from weight decay?) And even if i'm right, I can't persist my hack because of the lambda function -.-

And also I'm not sure why, but in my case (Superresolution), when using cosine/arccosine my model diverges each times after restarting. (AdamW, wd=1e-6) It happens with triangular too but not directly at the start of the second cycle. Do you maybe have an idea where it could come from?

Thanks for your time!
opened by uyekt 2

Persisting CosineAnnealingLRWithRestarts

Hi there,

Up to now all my scheduler inherited from _LRScheduler and so I didn't need to care too much about how it would be persisted.

For my checkpoints I define my state like this

 state = {
                                "model_state": model.state_dict(),
                                "optimizer_state": optimizer.state_dict(),
                                "scheduler_state": scheduler.state_dict(),
                            }

However with CosineAnnealingLRWithRestarts, I don't have this method state_dict().

I checked in the documentation the implementation of the state_dict() https://pytorch.org/docs/stable/_modules/torch/optim/lr_scheduler.html#LambdaLR

and tried to extend your code myself, however I probably missed something. Could you take a look?

Diffs are:

I inherit the class from _LRScheduler:

from torch.optim.lr_scheduler import _LRScheduler

class CosineAnnealingLRWithRestarts(_LRScheduler):

And rewrite the state_dict()



    def state_dict(self):
        """Returns the state of the scheduler as a :class:`dict`.

        It contains an entry for every variable in self.__dict__ which
        is not the optimizer.
        The learning rate lambda functions will only be saved if they are callable objects
        and not if they are functions or lambdas.
        """
        state_dict = {key: value for key, value in self.__dict__.items() if key not in ('optimizer', 'base_lrs', 'base_weight_decays')}
        state_dict['base_lrs'] = [None] * len(self.base_lrs)
        state_dict['base_weight_decays'] = [None] * len(self.base_weight_decays)

        for idx, fn in enumerate(self.base_weight_decays):
            if not isinstance(fn, types.FunctionType):
                # state_dict['base_weight_decays'][idx] = fn.__dict__.copy()
                state_dict['base_weight_decays'][idx] = fn

        for idx, fn in enumerate(self.base_lrs):
            if not isinstance(fn, types.FunctionType):
                # state_dict['base_lrs'][idx] = fn.__dict__.copy()
                state_dict['base_lrs'][idx] = fn


        return state_dict

    def load_state_dict(self, state_dict):
        """Loads the schedulers state.

        Arguments:
            state_dict (dict): scheduler state. Should be an object returned
                from a call to :meth:`state_dict`.
        """
        base_lrs = state_dict.pop('base_lrs')
        base_weight_decays = state_dict.pop('base_weight_decays')

        self.__dict__.update(state_dict)

        for idx, fn in enumerate(base_lrs):
            if fn is not None:
                self.base_lrs[idx] = fn        

        for idx, fn in enumerate(base_weight_decays):
            if fn is not None:
                self.base_weight_decays[idx] = fn

However I still get AttributeError: Can't pickle local object 'Tensor.__iter__.<locals>.<lambda>' It would be terrific to be able to persist the state of this Scheduler :-)

opened by uyekt 2

scheduler.batch_step() AttributeError: 'CosineLRWithRestarts' object has no attribute 'batch_increment'

Z:\sp2\nhdeblur_pytorch>python "train.py" 1>"train_log.txt" Traceback (most recent call last): File "train.py", line 140, in train(train_gen=trainloader, model=model, criterion=criterion, optimizer=optimizer, epoch=epoch) File "train.py", line 115, in train scheduler.batch_step() File "Z:\sp2\nhdeblur_pytorch\cosine_scheduler.py", line 110, in batch_step t_cur = self.t_epoch + next(self.batch_increment) AttributeError: 'CosineLRWithRestarts' object has no attribute 'batch_increment'

optimizer = adamw.AdamW(model.parameters(), lr=opt.lr, weight_decay=0)
scheduler = cosine_scheduler.CosineLRWithRestarts(optimizer, batch_size=opt.batch_size, epoch_size=len(src_set), restart_period=5, t_mult=1.2)

def train(train_gen, model, criterion, optimizer, epoch):
    epoch_loss = 0
    for iteration, batch in enumerate(train_gen, 1):
        nr = batch[0].to(device)
        hr = batch[1].to(device)
        
        optimizer.zero_grad()
        loss = criterion(model(nr), hr)
        epoch_loss += loss.item()
        loss.backward()
        optimizer.step()
        scheduler.batch_step()
    
        if iteration % 1000 == 0:
            print('===> Epoch[{e}]({it}/{dl}): Loss{l:.4f};'.format(e=epoch, it=iteration, dl=len(train_gen), l=loss.cpu()))
            
    Current_time = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime())
    epoch_loss_average = epoch_loss / len(train_gen)
    print('===> {ct} Epoch {e} Complete: Avg Loss: {avg_loss:.4f}, Sum Loss: {sum_loss:.4f}'
          .format(e=epoch, avg_loss=epoch_loss_average, sum_loss=epoch_loss, ct=Current_time))

opened by Ken1256 1

LR Scheduler help
Can you please help me write my own learning rate scheduler? I mean I couldn't find much docs on how to write one in Pytorch. I went through this mxnet guide, and came to the conclusion that if I do the following:

lrs = [scheduler(i+1) for i in range(epochs*batch_size)] iters = 1 for i in range(epochs): for data,label in train: ... # backward and calculate loss for group in optimizer.param_groups: group['lr'] = lrs[iters] optimizer.step() iters+=1

What is the more elegant way of doing it?
opened by swagato-c 1
Getting Stop Iteration when running for training

StopIteration Traceback (most recent call last) in () 1 training(model=model, epoch=20, eval_every=500, 2 loss_func=loss_function, optimizer=optimizer, train_iter=train_iter, ----> 3 val_iter=val_iter, scheduler=scheduler, warmup_epoch=3, early_stop=2)

in training(epoch, model, eval_every, loss_func, optimizer, train_iter, val_iter, scheduler, early_stop, warmup_epoch) 37 loss.backward() 38 optimizer.step() ---> 39 scheduler.batch_step() 40 if step % eval_every == 0: 41 model.eval()

in batch_step(self) 274 275 def batch_step(self): --> 276 t_cur = self.t_epoch + next(self.batch_increment) 277 for param_group, (lr, weight_decay) in zip(self.optimizer.param_groups, 278 self.get_lr(t_cur)):

StopIteration:

opened by enzoampil 0

Owner

Maksym Pyrozhok

GitHub

Cosine Annealing With Warmup

CosineAnnealingWithWarmup Formulation The learning rate is annealed using a cosine schedule over the course of learning of n_total total steps with an

4 Apr 18, 2022

Sharpened cosine similarity torch - A Sharpened Cosine Similarity layer for PyTorch

Sharpened Cosine Similarity A layer implementation for PyTorch Install At your c

203 Nov 30, 2022

Rate-limit-semaphore - Semaphore implementation with rate limit restriction for async-style (any core)

Rate Limit Semaphore Rate limit semaphore for async-style (any core) There are t

4 Jun 21, 2022

Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Optimization Algorithm,Immune Algorithm, Artificial Fish Swarm Algorithm, Differential Evolution and TSP(Traveling salesman)

scikit-opt Swarm Intelligence in Python (Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Algorithm, Immune Algorithm,A

3.7k Jan 3, 2023

Product-based-recommendation-system - A product based recommendation system which uses Machine learning algorithm such as KNN and cosine similarity

Product-based-recommendation-system A product based recommendation system which

2 Feb 15, 2022

Official implementation of NeurIPS 2021 paper "One Loss for All: Deep Hashing with a Single Cosine Similarity based Learning Objective"

71 Dec 22, 2022

DCT-Mask: Discrete Cosine Transform Mask Representation for Instance Segmentation

DCT-Mask: Discrete Cosine Transform Mask Representation for Instance Segmentation This project hosts the code for implementing the DCT-MASK algorithms

57 Nov 27, 2022

Cossim - Sharpened Cosine Distance implementation in PyTorch

Sharpened Cosine Distance PyTorch implementation of the Sharpened Cosine Distanc

10 Mar 22, 2022

Ranger deep learning optimizer rewrite to use newest components

Ranger21 - integrating the latest deep learning components into a single optimizer Ranger deep learning optimizer rewrite to use newest components Ran

266 Dec 28, 2022

DeepOBS: A Deep Learning Optimizer Benchmark Suite

DeepOBS - A Deep Learning Optimizer Benchmark Suite DeepOBS is a benchmarking suite that drastically simplifies, automates and improves the evaluation

7 May 12, 2020

ESGD-M - A stochastic non-convex second order optimizer, suitable for training deep learning models, for PyTorch

53 Dec 29, 2022

PyTorch implementation of some learning rate schedulers for deep learning researcher.

pytorch-lr-scheduler PyTorch implementation of some learning rate schedulers for deep learning researcher. Usage WarmupReduceLROnPlateauScheduler Visu

59 Dec 8, 2022

Ranger - a synergistic optimizer using RAdam (Rectified Adam), Gradient Centralization and LookAhead in one codebase

Ranger-Deep-Learning-Optimizer Ranger - a synergistic optimizer combining RAdam (Rectified Adam) and LookAhead, and now GC (gradient centralization) i

1.1k Dec 21, 2022

A mini library for Policy Gradients with Parameter-based Exploration, with reference implementation of the ClipUp optimizer from NNAISENSE.

PGPElib A mini library for Policy Gradients with Parameter-based Exploration [1] and friends. This library serves as a clean re-implementation of the

56 Jan 1, 2023

AdamW optimizer and cosine learning rate annealing with restarts

Related tags

Overview

AdamW optimizer and cosine learning rate annealing with restarts

Cyclical Learning Rates

Example:

Comments

StopIteration

Hypergradient Descent

Lower/Upper Bound for LR and Upper Bound decay

Persisting CosineAnnealingLRWithRestarts

scheduler.batch_step() AttributeError: 'CosineLRWithRestarts' object has no attribute 'batch_increment'

LR Scheduler help

Getting Stop Iteration when running for training

Owner

Maksym Pyrozhok

Cosine Annealing With Warmup

Sharpened cosine similarity torch - A Sharpened Cosine Similarity layer for PyTorch

Rate-limit-semaphore - Semaphore implementation with rate limit restriction for async-style (any core)

Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Optimization Algorithm,Immune Algorithm, Artificial Fish Swarm Algorithm, Differential Evolution and TSP(Traveling salesman)

Product-based-recommendation-system - A product based recommendation system which uses Machine learning algorithm such as KNN and cosine similarity

Official implementation of NeurIPS 2021 paper "One Loss for All: Deep Hashing with a Single Cosine Similarity based Learning Objective"

DCT-Mask: Discrete Cosine Transform Mask Representation for Instance Segmentation

Cossim - Sharpened Cosine Distance implementation in PyTorch

Ranger deep learning optimizer rewrite to use newest components

DeepOBS: A Deep Learning Optimizer Benchmark Suite

ESGD-M - A stochastic non-convex second order optimizer, suitable for training deep learning models, for PyTorch

PyTorch implementation of some learning rate schedulers for deep learning researcher.

Ranger - a synergistic optimizer using RAdam (Rectified Adam), Gradient Centralization and LookAhead in one codebase

A mini library for Policy Gradients with Parameter-based Exploration, with reference implementation of the ClipUp optimizer from NNAISENSE.

PyTorch implementation DRO: Deep Recurrent Optimizer for Structure-from-Motion

auto-tuning momentum SGD optimizer

Apollo optimizer in tensorflow

This is an implementation of Googles Yogi-Optimizer in Keras (tf.keras)

An Implicit Function Theorem (IFT) optimizer for bi-level optimizations