Deep Residual Learning for Image Recognition

Overview

Deep Residual Learning for Image Recognition

This is a Torch implementation of "Deep Residual Learning for Image Recognition",Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun the winners of the 2015 ILSVRC and COCO challenges.

What's working: CIFAR converges, as per the paper.

What's not working yet: Imagenet. I also have only implemented Option (A) for the residual network bottleneck strategy.

Table of contents

Changes

  • 2016-02-01: Added others' preliminary results on ImageNet for the architecture. (I haven't found time to train ImageNet yet)
  • 2016-01-21: Completed the 'alternate solver' experiments on deep networks. These ones take quite a long time.
  • 2016-01-19:
    • New results: Re-ran the 'alternate building block' results on deeper networks. They have more of an effect.
    • Added a table of contents to avoid getting lost.
    • Added experimental artifacts (log of training loss and test error, the saved model, the any patches used on the source code, etc) for two of the more interesting experiments, for curious folks who want to reproduce our results. (These artifacts are hereby released under the zlib license.)
  • 2016-01-15:
    • New CIFAR results: I re-ran all the CIFAR experiments and updated the results. There were a few bugs: we were only testing on the first 2,000 images in the training set, and they were sampled with replacement. These new results are much more stable over time.
  • 2016-01-12: Release results of CIFAR experiments.

How to use

  • You need at least CUDA 7.0 and CuDNN v4.
  • Install Torch.
  • Install the Torch CUDNN V4 library: git clone https://github.com/soumith/cudnn.torch; cd cudnn; git co R4; luarocks make This will give you cudnn.SpatialBatchNormalization, which helps save quite a lot of memory.
  • Install nninit: luarocks install nninit.
  • Download CIFAR 10. Use --dataRoot to specify the location of the extracted CIFAR 10 folder.
  • Run train-cifar.lua.

CIFAR: Effect of model size

For this test, our goal is to reproduce Figure 6 from the original paper:

figure 6 from original paper

We train our model for 200 epochs (this is about 7.8e4 of their iterations on the above graph). Like their paper, we start at a learning rate of 0.1 and reduce it to 0.01 at 80 epochs and then to 0.01 at 160 epochs.

Training loss

Training loss curve

Testing error

Test error curve

Model My Test Error Reference Test Error from Tab. 6 Artifacts
Nsize=3, 20 layers 0.0829 0.0875 Model, Loss and Error logs, Source commit + patch
Nsize=5, 32 layers 0.0763 0.0751 Model, Loss and Error logs, Source commit + patch
Nsize=7, 44 layers 0.0714 0.0717 Model, Loss and Error logs, Source commit + patch
Nsize=9, 56 layers 0.0694 0.0697 Model, Loss and Error logs, Source commit + patch
Nsize=18, 110 layers, fancy policy¹ 0.0673 0.0661² Model, Loss and Error logs, Source commit + patch

We can reproduce the results from the paper to typically within 0.5%. In all cases except for the 32-layer network, we achieve very slightly improved performance, though this may just be noise.

¹: For this run, we started from a learning rate of 0.001 until the first 400 iterations. We then raised the learning rate to 0.1 and trained as usual. This is consistent with the actual paper's results.

²: Note that the paper reports the best run from five runs, as well as the mean. I consider the mean to be a valid test protocol, but I don't like reporting the 'best' score because this is effectively training on the test set. (This method of reporting effectively introduces an extra parameter into the model--which model to use from the ensemble--and this parameter is fitted to the test set)

CIFAR: Effect of model architecture

This experiment explores the effect of different NN architectures that alter the "Building Block" model inside the residual network.

The original paper used a "Building Block" similar to the "Reference" model on the left part of the figure below, with the standard convolution layer, batch normalization, and ReLU, followed by another convolution layer and batch normalization. The only interesting piece of this architecture is that they move the ReLU after the addition.

We investigated two alternate strategies.

Three different alternate CIFAR architectures

  • Alternate 1: Move batch normalization after the addition. (Middle) The reasoning behind this choice is to test whether normalizing the first term of the addition is desirable. It grew out of the mistaken belief that batch normalization always normalizes to have zero mean and unit variance. If this were true, building an identity building block would be impossible because the input to the addition always has unit variance. However, this is not true. BN layers have additional learnable scale and bias parameters, so the input to the batch normalization layer is not forced to have unit variance.

  • Alternate 2: Remove the second ReLU. The idea behind this was noticing that in the reference architecture, the input cannot proceed to the output without being modified by a ReLU. This makes identity connections technically impossible because negative numbers would always be clipped as they passed through the skip layers of the network. To avoid this, we could either move the ReLU before the addition or remove it completely. However, it is not correct to move the ReLU before the addition: such an architecture would ensure that the output would never decrease because the first addition term could never be negative. The other option is to simply remove the ReLU completely, sacrificing the nonlinear property of this layer. It is unclear which approach is better.

To test these strategies, we repeat the above protocol using the smallest (20-layer) residual network model.

(Note: The other experiments all use the leftmost "Reference" model.)

Training loss

Testing error

Architecture Test error
ReLU, BN before add (ORIG PAPER reimplementation) 0.0829
No ReLU, BN before add 0.0862
ReLU, BN after add 0.0834
No ReLU, BN after add 0.0823

All methods achieve accuracies within about 0.5% of each other. Removing ReLU and moving the batch normalization after the addition seems to make a small improvement on CIFAR, but there is too much noise in the test error curve to reliably tell a difference.

CIFAR: Effect of model architecture on deep networks

The above experiments on the 20-layer networks do not reveal any interesting differences. However, these differences become more pronounced when evaluated on very deep networks. We retry the above experiments on 110-layer (Nsize=19) networks.

Training loss

Testing error

Results:

  • For deep networks, it's best to put the batch normalization before the addition part of each building block layer. This effectively removes most of the batch normalization operations from the input skip paths. If a batch normalization comes after each building block, then there exists a path from the input straight to the output that passes through several batch normalizations in a row. This could be problematic because each BN is not idempotent (the effects of several BN layers accumulate).

  • Removing the ReLU layer at the end of each building block appears to give a small improvement (~0.6%)

Architecture Test error Artifacts
ReLU, BN before add (ORIG PAPER reimplementation) 0.0697 Model, Loss and Error logs, Source commit + patch
No ReLU, BN before add 0.0632 Model, Loss and Error logs, Source commit + patch
ReLU, BN after add 0.1356 Model, Loss and Error logs, Source commit + patch
No ReLU, BN after add 0.1230 Model, Loss and Error logs, Source commit + patch

ImageNet: Effect of model architecture (preliminary)

@ducha-aiki is performing preliminary experiments on imagenet. For ordinary CaffeNet networks, @ducha-aiki found that putting batch normalization after the ReLU layer may provide a small benefit compared to putting it before.

Second, results on CIFAR-10 often contradicts results on ImageNet. I.e., leaky ReLU > ReLU on CIFAR, but worse on ImageNet.

@ducha-aiki's more detailed results here: https://github.com/ducha-aiki/caffenet-benchmark/blob/master/batchnorm.md

CIFAR: Alternate training strategies (RMSPROP, Adagrad, Adadelta)

Can we improve on the basic SGD update rule with Nesterov momentum? This experiment aims to find out. Common wisdom suggests that alternate update rules may converge faster, at least initially, but they do not outperform well-tuned SGD in the long run.

Training loss curve

Testing error curve

In our experiments, vanilla SGD with Nesterov momentum and a learning rate of 0.1 eventually reaches the lowest test error. Interestingly, RMSPROP with learning rate 1e-2 achieves a lower training loss, but overfits.

Strategy Test error
Original paper: SGD + Nesterov momentum, 1e-1 0.0829
RMSprop, learrning rate = 1e-4 0.1677
RMSprop, 1e-3 0.1055
RMSprop, 1e-2 0.0945
Adadelta¹, rho = 0.3 0.1093
Adagrad, 1e-3 0.3536
Adagrad, 1e-2 0.1603
Adagrad, 1e-1 0.1255

¹: Adadelta does not use a learning rate, so we did not use the same learning rate policy as in the paper. We just let it run until convergence.

See Andrej Karpathy's CS231N notes for more details on each of these learning strategies.

CIFAR: Alternate training strategies on deep networks

Deeper networks are more prone to overfitting. Unlike the earlier experiments, all of these models (except Adagrad with a learning rate of 1e-3) achieve a loss under 0.1, but test error varies quite wildly. Once again, using vanilla SGD with Nesterov momentum achieves the lowest error.

Training loss

Testing error

Solver Testing error
Nsize=18, Original paper: Nesterov, 1e-1 0.0697
Nsize=18, RMSprop, 1e-4 0.1482
Nsize=18, RMSprop, 1e-3 0.0821
Nsize=18, RMSprop, 1e-2 0.0768
Nsize=18, RMSprop, 1e-1 0.1098
Nsize=18, Adadelta 0.0888
Nsize=18, Adagrad, 1e-3 0.3022
Nsize=18, Adagrad, 1e-2 0.1321
Nsize=18, Adagrad, 1e-1 0.1145

Effect of batch norm momentum

For our experiments, we use batch normalization using an exponential running mean and standard deviation with a momentum of 0.1, meaning that the running mean and std changes by 10% of its value at each batch. A value of 1.0 would cause the batch normalization layer to calculate the mean and standard deviation across only the current batch, and a value of 0 would cause the batch normalization layer to stop accumulating changes in the running mean and standard deviation.

The strictest interpretation of the original batch normalization paper is to calculate the mean and standard deviation across the entire training set at every update. This takes too long in practice, so the exponential average is usually used instead.

We attempt to see whether batch normalization momentum affects anything. We try different values away from the default, along with a "dynamic" update strategy that sets the momentum to 1 / (1+n), where n is the number of batches seen so far (N resets to 0 at every epoch). At the end of training for a certain epoch, this means the batch normalization's running mean and standard deviation is effectively calculated over the entire training set.

None of these effects appear to make a significant difference.

Test error curve

Strategy Test Error
BN, momentum = 1 just for fun 0.0863
BN, momentum = 0.01 0.0835
Original paper: BN momentum = 0.1 0.0829
Dynamic, reset every epoch. 0.0822

TODO: Imagenet

Comments
  • How to use lab-workbook in train-cifar.lua

    How to use lab-workbook in train-cifar.lua

    Hi,

    First of all, thank you for sharing your review work.

    It seems that lab-workbook is required to make snapshot in train-cifar.lua; thus I installed it with your repository (https://github.com/gcr/lab-workbook).

    However, this package asked me to have a configuration file ~/.lab-workbook-config, and I have no idea how to set it.

    Is there anyone who help me to use it?

    I would also appreciate if someone let me know I can snapshot weights with other methods. I used to use caffe and I'm new to lua and torch.

    Thanks,

    opened by lim0606 6
  • Remove workbook calls

    Remove workbook calls

    Rather than having to comment out the workbook calls, I've wrapped them in if statements by using pcall to work out whether 'lab-workbook' is available or not. Obviously I can't test if it is available, but now it can run without crashing.

    I've also updated the docs to remove this and also mentioned --dataRoot because the docs make it sound like the data path is hardcoded in.

    opened by Kaixhin 6
  • CIFAR-10 net

    CIFAR-10 net

    Hi Michael, I'm curious why do you need this in your CIFAR-10 config:

    model = addResidualLayer2(model, 64, 10, 1)
    

    Why not just feed last convolutional layer directly to avg pooling? E.g:

    for i=1,N do   model = addResidualLayer2(model, 64)   end
    ------> 64, 8,8     Pooling, Linear, Softmax
    model = nn.SpatialAveragePooling(8,8)(model)
    

    I think this is how it's done in the paper.

    opened by Alexey-Kamenev 6
  • Fix broken headings in Markdown files

    Fix broken headings in Markdown files

    GitHub changed the way Markdown headings are parsed, so this change fixes it.

    See bryant1410/readmesfix for more information.

    Tackles bryant1410/readmesfix#1

    opened by bryant1410 1
  • BROKEN: Does not converge

    BROKEN: Does not converge

    Hello! The posted version of my code does not converge.

    Right now, I'm taking a step back and debugging convergence on CIFAR. I'll push a new change to this repository when it's working.

    For now, you should remove the nn.ReLU just before the nn.LogSoftMax() layer...

    opened by gcr 1
  • Problem of Shortcut on cifar dataset

    Problem of Shortcut on cifar dataset

    Thank you for your code firstly. I have tried to train resnet-20 on cifar10 dataset with pytorch and I used ShortcutA as you did. But I can't get as good result as you report, since I followed all configurations and data argumentation in the original paper. And I can only get good results when I use ShortcutB. Is it the problem of Shortcut or initialization or something else? Thank you for your idea

    opened by I-Doctor 0
  • Why are 2 convolutions used in every res-block in a resnet?

    Why are 2 convolutions used in every res-block in a resnet?

    This is more a general question about resnets.

    I am wondering why there are always 2 convolutional layers used in every res-block instad of just one? To me just one convolution with a skip connection seems more like the natural choice. So why did they use 2 convs in every block in the official paper? What does happen if I use just one? And are there any studies about using just one compared to two?

    opened by mbcel 1
  • Unexpected output behaviour

    Unexpected output behaviour

    Hello everyone, I met a problem about unexpected output behaviour of batch normalisation layer or Cadd table.

    I would like to investigate the output of these layers with layer.output attribute, but I found this unexpected behaviour: if the layer is followed by a relu unit, then the output of that layer will be the same as relu (non-negative).

    For example, in the residual nets, there are two types of arrangements -batch normalisation layer with and without relu unit following. For those without relu following, they behave as expected (have negative value). However, those with relu following, the output that should be negative becoms zero.

    Does anyone know what is happening there? Thank you a lot.

    opened by pszyu 0
  • What is your Machine and Runtime Configuration?

    What is your Machine and Runtime Configuration?

    What is your Machine (Runtime Env.) Configuration ? like number of GPUs, and some other configuration info. like true batchsize, etc.

    And another question is, in part of "Alternate training strategies", you didn't add Adam and other optim. strategies, why? and how does these optim. stategies' initial learning rate are set ? Is it "Try one by one" ?

    thks.

    opened by whatbeg 0
  • Classification with pre-loaded model problems

    Classification with pre-loaded model problems

    Hi Guys I'm trying to classify images with the pre-loaded models but something is strange....

    $ th classify.lua resnet-34.t7 ImgnetCat.jpg Classes for ImgnetCat.jpg
    0.0075856610201299 bucket, pail
    0.0070607061497867 hook, claw
    0.0061263972893357 water bottle
    0.0054563735611737 tennis ball 0.0046155038289726 plastic bag

    $ th classify.lua resnet-34.t7 Tennis-ball-007.jpg 0.0085550472140312 bucket, pail
    0.0073438654653728 hook, claw
    0.0070917885750532 water bottle
    0.0054101385176182 water jug
    0.0049047190696001 sunglasses, dark glasses, shades

    I'm doing something wrong? Thanks

    I'm using the following images tennis-ball-007 imgnetcat

    opened by leonardoaraujosantos 0
  • training data also normalized?

    training data also normalized?

    Hi G, I am so new to Torch, just a quick concern about fetching training data. The training data is supposed to be normalized, too. However, I see no such operation in the dataTrain:getBatch() call. Specifically, the code here does not pass input value back to batch. Can you point out where I misunderstood? Thanks!

    opened by hli2020 2
Owner
Kimmy
Kimmy
PyTorch version of the paper 'Enhanced Deep Residual Networks for Single Image Super-Resolution' (CVPRW 2017)

About PyTorch 1.2.0 Now the master branch supports PyTorch 1.2.0 by default. Due to the serious version problem (especially torch.utils.data.dataloade

Sanghyun Son 2.1k Jan 1, 2023
Image Super-Resolution Using Very Deep Residual Channel Attention Networks

Image Super-Resolution Using Very Deep Residual Channel Attention Networks

kongdebug 14 Oct 14, 2022
Torch implementation of "Enhanced Deep Residual Networks for Single Image Super-Resolution"

NTIRE2017 Super-resolution Challenge: SNU_CVLab Introduction This is our project repository for CVPR 2017 Workshop (2nd NTIRE). We, Team SNU_CVLab, (B

Bee Lim 625 Dec 30, 2022
Official code of ICCV2021 paper "Residual Attention: A Simple but Effective Method for Multi-Label Recognition"

CSRA This is the official code of ICCV 2021 paper: Residual Attention: A Simple But Effective Method for Multi-Label Recoginition Demo, Train and Vali

null 163 Dec 22, 2022
Resco: A simple python package that report the effect of deep residual learning

resco Description resco is a simple python package that report the effect of dee

Pierre-Arthur Claudé 1 Jun 28, 2022
Pytorch implementation of Deep Recursive Residual Network for Super Resolution (DRRN)

DRRN-pytorch This is an unofficial implementation of "Deep Recursive Residual Network for Super Resolution (DRRN)", CVPR 2017 in Pytorch. [Paper] You

yun_yang 192 Dec 12, 2022
A PyTorch implementation for PyramidNets (Deep Pyramidal Residual Networks)

A PyTorch implementation for PyramidNets (Deep Pyramidal Residual Networks) This repository contains a PyTorch implementation for the paper: Deep Pyra

Greg Dongyoon Han 262 Jan 3, 2023
Reproduce ResNet-v2(Identity Mappings in Deep Residual Networks) with MXNet

Reproduce ResNet-v2 using MXNet Requirements Install MXNet on a machine with CUDA GPU, and it's better also installed with cuDNN v5 Please fix the ran

Wei Wu 531 Dec 4, 2022
Graph Regularized Residual Subspace Clustering Network for hyperspectral image clustering

Graph Regularized Residual Subspace Clustering Network for hyperspectral image clustering

Yaoming Cai 5 Jul 18, 2022
Deep Image Search is an AI-based image search engine that includes deep transfor learning features Extraction and tree-based vectorized search.

Deep Image Search - AI-Based Image Search Engine Deep Image Search is an AI-based image search engine that includes deep transfer learning features Ex

null 139 Jan 1, 2023
Official Implementation for "ReStyle: A Residual-Based StyleGAN Encoder via Iterative Refinement" https://arxiv.org/abs/2104.02699

ReStyle: A Residual-Based StyleGAN Encoder via Iterative Refinement Recently, the power of unconditional image synthesis has significantly advanced th

null 967 Jan 4, 2023
PyTorch implementation of Wide Residual Networks with 1-bit weights by McDonnell (ICLR 2018)

1-bit Wide ResNet PyTorch implementation of training 1-bit Wide ResNets from this paper: Training wide residual networks for deployment using a single

Sergey Zagoruyko 122 Dec 7, 2022
Code for "Human Pose Regression with Residual Log-likelihood Estimation", ICCV 2021 Oral

Human Pose Regression with Residual Log-likelihood Estimation [Paper] [arXiv] [Project Page] Human Pose Regression with Residual Log-likelihood Estima

JeffLi 347 Dec 24, 2022
harmonic-percussive-residual separation algorithm wrapped as a VST3 plugin (iPlug2)

Harmonic-percussive-residual separation plug-in This work is a study on the plausibility of a sines-transients-noise decomposition inspired algorithm

Derp Learning 9 Sep 1, 2022
Wide Residual Networks (WideResNets) in PyTorch

Wide Residual Networks (WideResNets) in PyTorch WideResNets for CIFAR10/100 implemented in PyTorch. This implementation requires less GPU memory than

Jason Kuen 296 Dec 27, 2022
RMNet: Equivalently Removing Residual Connection from Networks

RM Operation can equivalently convert ResNet to VGG, which is better for pruning; and can help RepVGG perform better when the depth is large.

null 8 Nov 4, 2021
PyTorch implementation of the Pose Residual Network (PRN)

Pose Residual Network This repository contains a PyTorch implementation of the Pose Residual Network (PRN) presented in our ECCV 2018 paper: Muhammed

Salih Karagoz 289 Nov 28, 2022
Residual Dense Net De-Interlace Filter (RDNDIF)

Residual Dense Net De-Interlace Filter (RDNDIF) Work in progress deep de-interlacer filter. It is based on the architecture proposed by Bernasconi et

Louis 7 Feb 15, 2022
Akshat Surolia 2 May 11, 2022