Details about the wide minima density hypothesis and metrics to compute width of a minima

Nikhil Iyer

Last update: Dec 27, 2022

Related tags

Deep Learning wide-minima-density-hypothesis

Overview

wide-minima-density-hypothesis

Details about the wide minima density hypothesis and metrics to compute width of a minima

This repo presents the wide minima density hypothesis as proposed in the following paper:

Wide-minima Density Hypothesis and the Explore-Exploit Learning Rate Schedule

Key contributions:

Hypothesis about minima density
A SOTA LR schedule that exploits the hypothesis and beats general baseline schedules
Reducing wall clock training time and saving GPU compute hours with our LR schedule (Pretraining BERT-Large in 33% less training steps)
SOTA BLEU score on IWSLT'14 ( DE-EN )

Prerequisite:

CUDA, cudnn
Python 3.6+
PyTorch 1.4.0

Knee LR Schedule

Based on the density of wide vs narrow minima , we propose the Knee LR schedule that pushes generalization boundaries further by exploiting the nature of the loss landscape. The LR schedule is an explore-exploit based schedule, where the explore phase maintains a high lr for a significant time to access and land into a wide minimum with a good probability. The exploit phase is a simple linear decay scheme, which decays the lr to zero over the exploit phase. The only hyperparameter to tune is the explore epochs/steps. We have shown that 50% of the training budget allocated for explore is good enough for landing in a wider minimum and better generalization, thus removing the need for hyperparameter tuning.

Note that many experiments require warmup, which is done in the initial phase of training for a fixed number of steps and is usually required for Adam based optimizers/ large batch training. It is complementary with the Knee schedule and can be added to it.

To use the Knee Schedule, import the scheduler into your training file:

>>> from knee_lr_schedule import KneeLRScheduler
>>> scheduler = KneeLRScheduler(optimizer, peak_lr, warmup_steps, explore_steps, total_steps)

To use it during training :

>>> model.train()
>>> output = model(inputs)
>>> loss = criterion(output, targets)
>>> loss.backward()
>>> optimizer.step()
>>> scheduler.step()

Details about args:

optimizer: optimizer needed for training the model ( SGD/Adam )
peak_lr: the peak learning required for explore phase to escape narrow minimas
warmup_steps: steps required for warmup( usually needed for adam optimizers/ large batch training) Default value: 0
explore_steps: total steps for explore phase.
total_steps: total training budget steps for training the model

Measuring width of a minima

Keskar et.al 2016 (https://arxiv.org/abs/1609.04836) argue that wider minima generalize much better than sharper minima. The computation method in their work uses the compute expensive LBFGS-B second order method, which is hard to scale. We use a projected gradient ascent based method, which is first order in nature and very easy to implement/use. Here is a simple way you can compute the width of the minima your model finds during training:

>>> from minima_width_compute import ComputeKeskarSharpness
>>> cks = ComputeKeskarSharpness(model_final_ckpt, optimizer, criterion, trainloader, epsilon, lr, max_steps)
>>> width = cks.compute_sharpness()

Details about args:

model_final_ckpt: model loaded with the saved checkpoint after final training step
optimizer : optimizer to use for projected gradient ascent ( SGD, Adam )
criterion : criterion for computing loss (e.g. torch.nn.CrossEntropyLoss)
trainloader : iterator over the training dataset (torch.utils.data.DataLoader)
epsilon : epsilon value determines the local boundary around which minima witdh is computed (Default value : 1e-4)
lr : lr for the optimizer to perform projected gradient ascent ( Default: 0.001)
max_steps : max steps to compute the width (Default: 1000). Setting it too low could lead to the gradient ascent method not converging to an optimal point.

The above default values have been chosen after tuning and observing the loss values of projected gradient ascent on Cifar-10 with ResNet-18 and SGD-Momentum optimizer, as mentioned in our paper. The values may vary for experiments with other optimizers/datasets/models. Please tune them for optimal convergence.

Acknowledgements: We would like to thank Harshay Shah (https://github.com/harshays) for his helpful discussions for computing the width of the minima.

Citation

Please cite our paper in your publications if you use our work:

@article{iyer2020wideminima,
  title={Wide-minima Density Hypothesis and the Explore-Exploit Learning Rate Schedule},
  author={Iyer, Nikhil and Thejas, V and Kwatra, Nipun and Ramjee, Ramachandran and Sivathanu, Muthian},
  journal={arXiv preprint arXiv:2003.03977},
  year={2020}
}

Note: This work was done during an internship at Microsoft Research India

You might also like...

This YoloV5 based model is fit to detect people and different types of land vehicles, and displaying their density on a fitted map, according to their coordinates and detected labels.

This YoloV5 based model is fit to detect people and different types of land vehicles, and displaying their density on a fitted map, according to their

8 May 22, 2022

Rendering Point Clouds with Compute Shaders

Compute Shader Based Point Cloud Rendering This repository contains the source code to our techreport: Rendering Point Clouds with Compute Shaders and

460 Jan 5, 2023

Compute FID scores with PyTorch.

FID score for PyTorch This is a port of the official implementation of Fréchet Inception Distance to PyTorch. See https://github.com/bioinf-jku/TTUR f

2.1k Jan 6, 2023

Compute descriptors for 3D point cloud registration using a multi scale sparse voxel architecture

MS-SVConv : 3D Point Cloud Registration with Multi-Scale Architecture and Self-supervised Fine-tuning Compute features for 3D point cloud registration

42 Jul 25, 2022

A fast model to compute optical flow between two input images.

DCVNet: Dilated Cost Volumes for Fast Optical Flow This repository contains our implementation of the paper: @InProceedings{jiang2021dcvnet, title={

8 Sep 27, 2021

General purpose GPU compute framework for cross vendor graphics cards (AMD, Qualcomm, NVIDIA & friends)

General purpose GPU compute framework for cross vendor graphics cards (AMD, Qualcomm, NVIDIA & friends). Blazing fast, mobile-enabled, asynchronous and optimized for advanced GPU data processing usecases. Backed by the Linux Foundation.

1k Jan 6, 2023

Fast algorithms to compute an approximation of the minimal volume oriented bounding box of a point cloud in 3D.

ApproxMVBB Status Build UnitTests Homepage Fast algorithms to compute an approximation of the minimal volume oriented bounding box of a point cloud in

390 Dec 31, 2022

Compute execution plan: A DAG representation of work that you want to get done. Individual nodes of the DAG could be simple python or shell tasks or complex deeply nested parallel branches or embedded DAGs themselves.

Hello from magnus Magnus provides four capabilities for data teams: Compute execution plan: A DAG representation of work that you want to get done. In

12 Feb 8, 2022

Drone-based Joint Density Map Estimation, Localization and Tracking with Space-Time Multi-Scale Attention Network

DroneCrowd Paper Detection, Tracking, and Counting Meets Drones in Crowds: A Benchmark. Introduction This paper proposes a space-time multi-scale atte

98 Nov 16, 2022

Comments

Make KneeLRScheduler compatible PyTorch scheduler format

KneeLRScheduler now inherits _LRScheduler, supports momentum for Adam and SGD. Add non-zero start lr and non-zero final lr. Add different ways to set up a number of training steps as it is done in OneCycleLR. Add documentation for the class.
enhancement good first issue

opened by Animatory 3

Details about the wide minima density hypothesis and metrics to compute width of a minima

Related tags

Overview

wide-minima-density-hypothesis

Key contributions:

Prerequisite:

Knee LR Schedule

Measuring width of a minima

Citation

You might also like...

This YoloV5 based model is fit to detect people and different types of land vehicles, and displaying their density on a fitted map, according to their coordinates and detected labels.

Rendering Point Clouds with Compute Shaders

Compute FID scores with PyTorch.

Compute descriptors for 3D point cloud registration using a multi scale sparse voxel architecture

A fast model to compute optical flow between two input images.

General purpose GPU compute framework for cross vendor graphics cards (AMD, Qualcomm, NVIDIA & friends)

Fast algorithms to compute an approximation of the minimal volume oriented bounding box of a point cloud in 3D.

Compute execution plan: A DAG representation of work that you want to get done. Individual nodes of the DAG could be simple python or shell tasks or complex deeply nested parallel branches or embedded DAGs themselves.

Drone-based Joint Density Map Estimation, Localization and Tracking with Space-Time Multi-Scale Attention Network

Comments

Make KneeLRScheduler compatible PyTorch scheduler format

Owner

Nikhil Iyer

TorchMetrics is a collection of 25+ PyTorch metrics implementations and an easy-to-use API to create custom metrics.

Automatically measure the facial Width-To-Height ratio and get facial analysis results provided by Microsoft Azure

Code accompanying our paper Feature Learning in Infinite-Width Neural Networks

Code to reproduce the experiments from our NeurIPS 2021 paper " The Limitations of Large Width in Neural Networks: A Deep Gaussian Process Perspective"

code for our paper "Source Data-absent Unsupervised Domain Adaptation through Hypothesis Transfer and Labeling Transfer"

[CVPR 2021] "The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models" Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Michael Carbin, Zhangyang Wang

TF2 implementation of knowledge distillation using the "function matching" hypothesis from the paper Knowledge distillation: A good teacher is patient and consistent by Beyer et al.

PyTorch implementation of the paper The Lottery Ticket Hypothesis for Object Recognition

MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation

GT4SD, an open-source library to accelerate hypothesis generation in the scientific discovery process.