AdamW optimizer and cosine learning rate annealing with restarts

Overview

AdamW optimizer and cosine learning rate annealing with restarts

This repository contains an implementation of AdamW optimization algorithm and cosine learning rate scheduler described in "Decoupled Weight Decay Regularization". AdamW implementation is straightforward and does not differ much from existing Adam implementation for PyTorch, except that it separates weight decaying from batch gradient calculations. Cosine annealing scheduler with restarts allows model to converge to a (possibly) different local minimum on every restart and normalizes weight decay hyperparameter value according to the length of restart period. Unlike schedulers presented in standard PyTorch scheduler suite this scheduler adjusts optimizer's learning rate not on every epoch, but on every batch update, according to the paper.

Cyclical Learning Rates

Besides "cosine" and "arccosine" policies (arccosine has steeper profile at the limiting points), there are "triangular", triangular2 and exp_range, which implement policies proposed in "Cyclical Learning Rates for Training Neural Networks". The ratio of increasing and decreasing phases for triangular policy could be adjusted with triangular_step parameter. Minimum allowed lr is adjusted by min_lr parameter.

  • triangular schedule is enabled by passing policy="triangular" parameter.
  • triangular2 schedule reduces maximum lr by half on each restart cycle and is enabled by passing policy="triangular2" parameter, or by combining parameters policy="triangular", eta_on_restart_cb=ReduceMaxLROnRestart(ratio=0.5). The ratio parameter regulates the factor by which lr is scaled on each restart.
  • exp_range schedule is enabled by passing policy="exp_range" parameter. It exponentially scales maximum lr depending on iteration count. The base of exponentiation is set by gamma parameter.

These schedules could be combined with shrinking/expanding restart periods, weight decay normalization and could be used with AdamW and other PyTorch optimizers.

Example:

    batch_size = 32
    epoch_size = 1024
    model = resnet()
    optimizer = AdamW(model.parameters(), lr=1e-3, weight_decay=1e-5)
    scheduler = CyclicLRWithRestarts(optimizer, batch_size, epoch_size, restart_period=5, t_mult=1.2, policy="cosine")
    for epoch in range(100):
        scheduler.step()
        train_for_every_batch(...)
            ...
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            scheduler.batch_step()
        validate(...)
Comments
  • StopIteration

    StopIteration

    Hi, thank you for your share. Following your description, I try to use your code in my project, but I got the error in 'scheduler.batch_step()', this happened on this line 't_cur = self.t_epoch + next(self.batch_increment)'

    opened by reborm 7
  • Hypergradient Descent

    Hypergradient Descent

    Thank you for sharing this. Would it be possible if you can also integrate Hypergradient Descent technique into your AdamW implementation? It reduces the necessity of hypertuning the initial learning rate. https://github.com/gbaydin/hypergradient-descent

                    if state['step'] > 1:
                        prev_bias_correction1 = 1 - beta1 ** (state['step'] - 1)
                        prev_bias_correction2 = 1 - beta2 ** (state['step'] - 1)
                        # Hypergradient for Adam:
                        h = torch.dot(grad.view(-1), torch.div(exp_avg, exp_avg_sq.sqrt().add_(group['eps'])).view(-1)) * math.sqrt(prev_bias_correction2) / prev_bias_correction1
                        # Hypergradient descent of the learning rate:
                        group['lr'] += group['hypergrad_lr'] * h
    

    I have also read lots of criticism about AmsGrad and haven't been able to yet get any improvement with that variant. Can I please learn your thoughts about that? FYI, two other techniques that I am currently experimenting with are Padam and QHAdam.

    opened by akaniklaus 5
  • Lower/Upper Bound for LR and Upper Bound decay

    Lower/Upper Bound for LR and Upper Bound decay

    Hey there,

    Nice update of the scheduler! It's really usefull!

    Also nice would be to have the possibility to set following parameters: base_lr, max_lr and scale_fn

    The scale_fn would be a function that decreases the max_lr:

    • by half after each period, while keeping the base lr constant.
    • scales max_lr by a factor gamma**(iterations)
    • or whatever lambda_function is given

    Here an example implementation in Keras: https://github.com/bckenstler/CLR

    I tried to hack this myself but I'm stucked. I'm not entirely sure which eta you use. (is this the one from weight decay?) And even if i'm right, I can't persist my hack because of the lambda function -.-

    And also I'm not sure why, but in my case (Superresolution), when using cosine/arccosine my model diverges each times after restarting. (AdamW, wd=1e-6) It happens with triangular too but not directly at the start of the second cycle. Do you maybe have an idea where it could come from?

    Thanks for your time!

    opened by uyekt 2
  • Persisting CosineAnnealingLRWithRestarts

    Persisting CosineAnnealingLRWithRestarts

    Hi there,

    Up to now all my scheduler inherited from _LRScheduler and so I didn't need to care too much about how it would be persisted.

    For my checkpoints I define my state like this

     state = {
                                    "model_state": model.state_dict(),
                                    "optimizer_state": optimizer.state_dict(),
                                    "scheduler_state": scheduler.state_dict(),
                                }
    
    

    However with CosineAnnealingLRWithRestarts, I don't have this method state_dict().

    I checked in the documentation the implementation of the state_dict() https://pytorch.org/docs/stable/_modules/torch/optim/lr_scheduler.html#LambdaLR

    and tried to extend your code myself, however I probably missed something. Could you take a look?

    Diffs are:

    I inherit the class from _LRScheduler:

    from torch.optim.lr_scheduler import _LRScheduler
    
    class CosineAnnealingLRWithRestarts(_LRScheduler):
    
    

    And rewrite the state_dict()

    
    
        def state_dict(self):
            """Returns the state of the scheduler as a :class:`dict`.
    
            It contains an entry for every variable in self.__dict__ which
            is not the optimizer.
            The learning rate lambda functions will only be saved if they are callable objects
            and not if they are functions or lambdas.
            """
            state_dict = {key: value for key, value in self.__dict__.items() if key not in ('optimizer', 'base_lrs', 'base_weight_decays')}
            state_dict['base_lrs'] = [None] * len(self.base_lrs)
            state_dict['base_weight_decays'] = [None] * len(self.base_weight_decays)
    
            for idx, fn in enumerate(self.base_weight_decays):
                if not isinstance(fn, types.FunctionType):
                    # state_dict['base_weight_decays'][idx] = fn.__dict__.copy()
                    state_dict['base_weight_decays'][idx] = fn
    
            for idx, fn in enumerate(self.base_lrs):
                if not isinstance(fn, types.FunctionType):
                    # state_dict['base_lrs'][idx] = fn.__dict__.copy()
                    state_dict['base_lrs'][idx] = fn
    
    
            return state_dict
    
        def load_state_dict(self, state_dict):
            """Loads the schedulers state.
    
            Arguments:
                state_dict (dict): scheduler state. Should be an object returned
                    from a call to :meth:`state_dict`.
            """
            base_lrs = state_dict.pop('base_lrs')
            base_weight_decays = state_dict.pop('base_weight_decays')
    
            self.__dict__.update(state_dict)
    
            for idx, fn in enumerate(base_lrs):
                if fn is not None:
                    self.base_lrs[idx] = fn        
    
            for idx, fn in enumerate(base_weight_decays):
                if fn is not None:
                    self.base_weight_decays[idx] = fn
    

    However I still get AttributeError: Can't pickle local object 'Tensor.__iter__.<locals>.<lambda>' It would be terrific to be able to persist the state of this Scheduler :-)

    opened by uyekt 2
  • scheduler.batch_step() AttributeError: 'CosineLRWithRestarts' object has no attribute 'batch_increment'

    scheduler.batch_step() AttributeError: 'CosineLRWithRestarts' object has no attribute 'batch_increment'

    Z:\sp2\nhdeblur_pytorch>python "train.py" 1>"train_log.txt" Traceback (most recent call last): File "train.py", line 140, in train(train_gen=trainloader, model=model, criterion=criterion, optimizer=optimizer, epoch=epoch) File "train.py", line 115, in train scheduler.batch_step() File "Z:\sp2\nhdeblur_pytorch\cosine_scheduler.py", line 110, in batch_step t_cur = self.t_epoch + next(self.batch_increment) AttributeError: 'CosineLRWithRestarts' object has no attribute 'batch_increment'

    optimizer = adamw.AdamW(model.parameters(), lr=opt.lr, weight_decay=0)
    scheduler = cosine_scheduler.CosineLRWithRestarts(optimizer, batch_size=opt.batch_size, epoch_size=len(src_set), restart_period=5, t_mult=1.2)
    
    def train(train_gen, model, criterion, optimizer, epoch):
        epoch_loss = 0
        for iteration, batch in enumerate(train_gen, 1):
            nr = batch[0].to(device)
            hr = batch[1].to(device)
            
            optimizer.zero_grad()
            loss = criterion(model(nr), hr)
            epoch_loss += loss.item()
            loss.backward()
            optimizer.step()
            scheduler.batch_step()
        
            if iteration % 1000 == 0:
                print('===> Epoch[{e}]({it}/{dl}): Loss{l:.4f};'.format(e=epoch, it=iteration, dl=len(train_gen), l=loss.cpu()))
                
        Current_time = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime())
        epoch_loss_average = epoch_loss / len(train_gen)
        print('===> {ct} Epoch {e} Complete: Avg Loss: {avg_loss:.4f}, Sum Loss: {sum_loss:.4f}'
              .format(e=epoch, avg_loss=epoch_loss_average, sum_loss=epoch_loss, ct=Current_time))
    
    opened by Ken1256 1
  • LR Scheduler help

    LR Scheduler help

    Can you please help me write my own learning rate scheduler? I mean I couldn't find much docs on how to write one in Pytorch. I went through this mxnet guide, and came to the conclusion that if I do the following:

    lrs = [scheduler(i+1) for i in range(epochs*batch_size)]
    iters = 1
    for i in range(epochs):
    	for data,label in train:
    		... # backward and calculate loss
    		for group in optimizer.param_groups:
    			group['lr'] = lrs[iters]
    		optimizer.step()
    		iters+=1
    

    What is the more elegant way of doing it?

    opened by swagato-c 1
  • Getting Stop Iteration when running for training

    Getting Stop Iteration when running for training


    StopIteration Traceback (most recent call last) in () 1 training(model=model, epoch=20, eval_every=500, 2 loss_func=loss_function, optimizer=optimizer, train_iter=train_iter, ----> 3 val_iter=val_iter, scheduler=scheduler, warmup_epoch=3, early_stop=2)

    in training(epoch, model, eval_every, loss_func, optimizer, train_iter, val_iter, scheduler, early_stop, warmup_epoch) 37 loss.backward() 38 optimizer.step() ---> 39 scheduler.batch_step() 40 if step % eval_every == 0: 41 model.eval()

    in batch_step(self) 274 275 def batch_step(self): --> 276 t_cur = self.t_epoch + next(self.batch_increment) 277 for param_group, (lr, weight_decay) in zip(self.optimizer.param_groups, 278 self.get_lr(t_cur)):

    StopIteration:

    opened by enzoampil 0
Owner
Maksym Pyrozhok
Maksym Pyrozhok
Cosine Annealing With Warmup

CosineAnnealingWithWarmup Formulation The learning rate is annealed using a cosine schedule over the course of learning of n_total total steps with an

zhuyun 4 Apr 18, 2022
Sharpened cosine similarity torch - A Sharpened Cosine Similarity layer for PyTorch

Sharpened Cosine Similarity A layer implementation for PyTorch Install At your c

Brandon Rohrer 203 Nov 30, 2022
Rate-limit-semaphore - Semaphore implementation with rate limit restriction for async-style (any core)

Rate Limit Semaphore Rate limit semaphore for async-style (any core) There are t

Yan Kurbatov 4 Jun 21, 2022
Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Optimization Algorithm,Immune Algorithm, Artificial Fish Swarm Algorithm, Differential Evolution and TSP(Traveling salesman)

scikit-opt Swarm Intelligence in Python (Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Algorithm, Immune Algorithm,A

郭飞 3.7k Jan 3, 2023
Official implementation of NeurIPS 2021 paper "One Loss for All: Deep Hashing with a Single Cosine Similarity based Learning Objective"

Official implementation of NeurIPS 2021 paper "One Loss for All: Deep Hashing with a Single Cosine Similarity based Learning Objective"

Ng Kam Woh 71 Dec 22, 2022
DCT-Mask: Discrete Cosine Transform Mask Representation for Instance Segmentation

DCT-Mask: Discrete Cosine Transform Mask Representation for Instance Segmentation This project hosts the code for implementing the DCT-MASK algorithms

Alibaba Cloud 57 Nov 27, 2022
Cossim - Sharpened Cosine Distance implementation in PyTorch

Sharpened Cosine Distance PyTorch implementation of the Sharpened Cosine Distanc

Istvan Fehervari 10 Mar 22, 2022
Ranger deep learning optimizer rewrite to use newest components

Ranger21 - integrating the latest deep learning components into a single optimizer Ranger deep learning optimizer rewrite to use newest components Ran

Less Wright 266 Dec 28, 2022
DeepOBS: A Deep Learning Optimizer Benchmark Suite

DeepOBS - A Deep Learning Optimizer Benchmark Suite DeepOBS is a benchmarking suite that drastically simplifies, automates and improves the evaluation

Aaron Bahde 7 May 12, 2020
ESGD-M - A stochastic non-convex second order optimizer, suitable for training deep learning models, for PyTorch

ESGD-M - A stochastic non-convex second order optimizer, suitable for training deep learning models, for PyTorch

Katherine Crowson 53 Dec 29, 2022
PyTorch implementation of some learning rate schedulers for deep learning researcher.

pytorch-lr-scheduler PyTorch implementation of some learning rate schedulers for deep learning researcher. Usage WarmupReduceLROnPlateauScheduler Visu

Soohwan Kim 59 Dec 8, 2022
Ranger - a synergistic optimizer using RAdam (Rectified Adam), Gradient Centralization and LookAhead in one codebase

Ranger-Deep-Learning-Optimizer Ranger - a synergistic optimizer combining RAdam (Rectified Adam) and LookAhead, and now GC (gradient centralization) i

Less Wright 1.1k Dec 21, 2022
A mini library for Policy Gradients with Parameter-based Exploration, with reference implementation of the ClipUp optimizer from NNAISENSE.

PGPElib A mini library for Policy Gradients with Parameter-based Exploration [1] and friends. This library serves as a clean re-implementation of the

NNAISENSE 56 Jan 1, 2023
PyTorch implementation DRO: Deep Recurrent Optimizer for Structure-from-Motion

DRO: Deep Recurrent Optimizer for Structure-from-Motion This is the official PyTorch implementation code for DRO-sfm. For technical details, please re

Alibaba Cloud 56 Dec 12, 2022
auto-tuning momentum SGD optimizer

YellowFin YellowFin is an auto-tuning optimizer based on momentum SGD which requires no manual specification of learning rate and momentum. It measure

Jian Zhang 288 Nov 19, 2022
Apollo optimizer in tensorflow

Apollo Optimizer in Tensorflow 2.x Notes: Warmup is important with Apollo optimizer, so be sure to pass in a learning rate schedule vs. a constant lea

Evan Walters 1 Nov 9, 2021
This is an implementation of Googles Yogi-Optimizer in Keras (tf.keras)

Yogi-Optimizer_Keras This is an implementation of Googles Yogi-Optimizer in Keras (tf.keras) The NeurIPS-Paper can be found here: http://papers.nips.c

null 14 Sep 13, 2022
An Implicit Function Theorem (IFT) optimizer for bi-level optimizations

iftopt An Implicit Function Theorem (IFT) optimizer for bi-level optimizations. Requirements Python 3.7+ PyTorch 1.x Installation $ pip install git+ht

The Money Shredder Lab 2 Dec 2, 2021