Sharpness-Aware Minimization for Efficiently Improving Generalization

Sayak Paul

Last update: Dec 8, 2022

Related tags

Deep Learning deep-neural-networks computer-vision tensorflow generalization tpu-acceleration loss-landscape

Overview

Sharpness-Aware-Minimization-TensorFlow

This repository provides a minimal implementation of sharpness-aware minimization (SAM) (Sharpness-Aware Minimization for Efficiently Improving Generalization) in TensorFlow 2. SAM is motivated by the connections between the geometry of the loss landscape of deep neural networks and their generalization ability. SAM attempts to simultaneously minimize loss value as well as loss curvature thereby seeking parameters in neighborhoods having uniformly low loss value. This is indeed different from traditional SGD-based optimization that seeks parameters having low loss values on an individual basis. The figure below (taken from the original paper) demonstrates the effects of using SAM -

My goal with this repository is to be able to quickly train neural networks with and without SAM. All the experiments are shown in the SAM.ipynb notebook (). The notebook is end-to-end executable on Google Colab. Furthermore, they utilize the free TPUs (TPUv2-8) Google Colab provides allowing readers to experiment very quickly.

Notes

Before moving to the findings, please be aware of the following notable differences in my implementation:

ResNet20 (attributed to this repository) is used as opposed to PyramidNet and WideResNet.
ShakeDrop regularization has not been used.
Two simple augmentation transformations (random crop and random brightness) have been used as opposed to Cutout, AutoAugment.
Adam has been used as the optimizer with the default arguments as provided by TensorFlow with a ReduceLROnPlateau. Table 1 of the original paper suggests using SGD with different configurations.
Instead of training for full number of epochs I used early stopping with a patience of 10.

SAM has only one hyperparameter namely rho that controls the neighborhood of the parameter space. In my experiments, it's defaulted to 0.05. For other details related to training configuration (i.e. network depth, learning rate, batch size, etc.) please refer to the notebooks.

Findings

	Number of Parameters (million)	Final Test Accuracy (%)
With SAM	0.575114	80.5
Without SAM	0.575114	83.1

Acknowledgements

David Samuel's PyTorch implementation

Comments

Potential bug

The structure of the train_step in cell 8 of the notebook is very unconventional.

def train_step(self, data):
... first model evaluation
... first tape gradient
... second model evaluation
... update parameters
... second tape gradient

Usually for the parameter update to affect the second tape gradient the update shall be before the second model evaluation.

def train_step(self, data):
... first model evaluation
... first tape gradient
... update parameters
... second model evaluation
... second tape gradient

opened by rainwoodman 5

Adding a data generator
That's a great example - thanks. When I try replacing train_ds with a data generator though, I get "NotImplementedError: When subclassing the Model class, you should implement a call method." I also tried adding this call method to SAMModel, but that didn't work either:

def call(self, inputs): return self.resnet_model(inputs)

Any ideas? The attached main.py runs your code plus a simple data generator which just shuffles the training data.

main.zip
opened by jhuus 5
Fix #7

Fix #7.

The SAM training accuracy is near 82.5%, increased from the quoted numbers on the home page.

The non-SAM training appears to terminated early at 5m30, ~ 70 epochs, and at a lower accuracy than the home page number.

opened by rainwoodman 4
Memory requirements

Hi,

Great work. From the notebook it looks like the runtime is about a third longer with SAM. Is that right? That's not too bad.

But how much more memory does the training require? We are calculating gradients twice for each step right? Don't we need more memory for that?

PS If I set the learning rate to 1e-2 the model without SAM quite outperforms the model with SAM
With
Epoch 97/100 49/49 [==============================] - 4s 78ms/step - loss: 0.5391 - accuracy: 0.7984 - val_loss: 0.5719 - val_accuracy: 0.8091 - lr: 0.0025
Without
Epoch 53/200 49/49 [==============================] - 5s 101ms/step - loss: 0.2032 - accuracy: 0.9299 - val_loss: 0.4490 - val_accuracy: 0.8669 - lr: 7.8125e-05

opened by grofte 1
Reproducing WRN-28-10 (SAM) for SVHN dataset

I am trying to reproduce the results for WRN-28-10 (SAM) trained on 10-class classification SVHN dataset (Percentage Error 0.99) - https://paperswithcode.com/sota/image-classification-on-svhn

I'm able to train WRN-28-10 using https://github.com/hysts/pytorch_wrn (Modified the script to incorporate SAM into it)

I'm achieving a test accuracy of 93%. How can I replicate the SOTA Percentage Error 0.99 for WRN-28-10 (SAM). Which hyperparameters do I use?

Any help is appreciated!!

opened by akshaydp1995 1

Sharpness-Aware Minimization for Efficiently Improving Generalization

Related tags

Overview

Sharpness-Aware-Minimization-TensorFlow

Notes

Findings

Acknowledgements

Comments

Potential bug

Adding a data generator

Fix #7

Memory requirements

Reproducing WRN-28-10 (SAM) for SVHN dataset

Owner

Sayak Paul

A PyTorch-based open-source framework that provides methods for improving the weakly annotated data and allows researchers to efficiently develop and compare their own methods.

The official repository for our paper "The Devil is in the Detail: Simple Tricks Improve Systematic Generalization of Transformers". We significantly improve the systematic generalization of transformer models on a variety of datasets using simple tricks and careful considerations.

Cascaded Deep Video Deblurring Using Temporal Sharpness Prior and Non-local Spatial-Temporal Similarity

This repo includes our code for evaluating and improving transferability in domain generalization (NeurIPS 2021)

ICLR21 Tent: Fully Test-Time Adaptation by Entropy Minimization

Tilted Empirical Risk Minimization (ICLR '21)

Codes accompanying the paper "Learning Nearly Decomposable Value Functions with Communication Minimization" (ICLR 2020)

A PyTorch implementation of the paper Mixup: Beyond Empirical Risk Minimization in PyTorch

Pytorch implementation of the AAAI 2022 paper "Cross-Domain Empirical Risk Minimization for Unbiased Long-tailed Classification"

Implement of "Training deep neural networks via direct loss minimization" in PyTorch for 0-1 loss

Official pytorch implementation of "Feature Stylization and Domain-aware Contrastive Loss for Domain Generalization" ACMMM 2021 (Oral)

Improving Transferability of Representations via Augmentation-Aware Self-Supervision

Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It can use GPUs and perform efficient symbolic differentiation.

Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It can use GPUs and perform efficient symbolic differentiation.

Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It can use GPUs and perform efficient symbolic differentiation.

BitPack is a practical tool to efficiently save ultra-low precision/mixed-precision quantized models.

Efficiently computes derivatives of numpy code.

CoTr: Efficiently Bridging CNN and Transformer for 3D Medical Image Segmentation

Official implementation of "Towards Good Practices for Efficiently Annotating Large-Scale Image Classification Datasets" (CVPR2021)