Improving Convolutional Networks via Attention Transfer (ICLR 2017)

Sergey Zagoruyko

Last update: Dec 23, 2022

Related tags

Overview

Attention Transfer

PyTorch code for "Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer" https://arxiv.org/abs/1612.03928
Conference paper at ICLR2017: https://openreview.net/forum?id=Sks9_ajex

What's in this repo so far:

Activation-based AT code for CIFAR-10 experiments
Code for ImageNet experiments (ResNet-18-ResNet-34 student-teacher)
Jupyter notebook to visualize attention maps of ResNet-34 visualize-attention.ipynb

Coming:

grad-based AT
Scenes and CUB activation-based AT code

The code uses PyTorch https://pytorch.org. Note that the original experiments were done using torch-autograd, we have so far validated that CIFAR-10 experiments are exactly reproducible in PyTorch, and are in process of doing so for ImageNet (results are very slightly worse in PyTorch, due to hyperparameters).

bibtex:

@inproceedings{Zagoruyko2017AT,
    author = {Sergey Zagoruyko and Nikos Komodakis},
    title = {Paying More Attention to Attention: Improving the Performance of
             Convolutional Neural Networks via Attention Transfer},
    booktitle = {ICLR},
    url = {https://arxiv.org/abs/1612.03928},
    year = {2017}}

Requirements

First install PyTorch, then install torchnet:

pip install git+https://github.com/pytorch/tnt.git@master

then install other Python packages:

pip install -r requirements.txt

Experiments

CIFAR-10

This section describes how to get the results in the table 1 of the paper.

First, train teachers:

python cifar.py --save logs/resnet_40_1_teacher --depth 40 --width 1
python cifar.py --save logs/resnet_16_2_teacher --depth 16 --width 2
python cifar.py --save logs/resnet_40_2_teacher --depth 40 --width 2

To train with activation-based AT do:

python cifar.py --save logs/at_16_1_16_2 --teacher_id resnet_16_2_teacher --beta 1e+3

To train with KD:

python cifar.py --save logs/kd_16_1_16_2 --teacher_id resnet_16_2_teacher --alpha 0.9

We plan to add AT+KD with decaying beta to get the best knowledge transfer results soon.

ImageNet

Pretrained model

We provide ResNet-18 pretrained model with activation based AT:

Model	val error
ResNet-18	30.4, 10.8
ResNet-18-ResNet-34-AT	29.3, 10.0

Download link: https://s3.amazonaws.com/modelzoo-networks/resnet-18-at-export.pth

Model definition: https://github.com/szagoruyko/functional-zoo/blob/master/resnet-18-at-export.ipynb

Convergence plot:

Train from scratch

Download pretrained weights for ResNet-34 (see also functional-zoo for more information):

wget https://s3.amazonaws.com/modelzoo-networks/resnet-34-export.pth

Prepare the data following fb.resnet.torch and run training (e.g. using 2 GPUs):

python imagenet.py --imagenetpath ~/ILSVRC2012 --depth 18 --width 1 \
                   --teacher_params resnet-34-export.hkl --gpu_id 0,1 --ngpu 2 \
                   --beta 1e+3

Comments

invalid variables
When I run cifar.py I have the error: new() received an invalid combination of arguments - got (Tensor, int, int, int), but expected one of:

(torch.device device)

(tuple of ints size, torch.device device)

(torch.Storage storage)

(Tensor other)

(object data, torch.device device) What is the problem? Thanks for the answer!
opened by vkadykova 6
Question on Code

thank you for your code , it's very helpful for me to study computer vision. but it is shame that i can't run this code correctly, i think it's maybe the matter of the edition of software so, there is a issue i want to know that the edition of software, such us python, opencv, what i used is python 2.7 pencv 3.2.0

thank you very much

opened by zhenxing1992 6
question about "params.itervalues()"

(py35) user@user-ASUS:~/fzz/study/attention-transfer-master$ python cifar.py --save logs/resnet_40_1_teacher --depth 40 --width 1 parsed options: {'data_root': '.', 'dataset': 'CIFAR10', 'cuda': False, 'width': 1.0, 'lr': 0.1, 'alpha': 0, 'teacher_id': '', 'gpu_id': '0', 'lr_decay_ratio': 0.2, 'epoch_step': '[60,120,160]', 'beta': 0, 'batchSize': 128, 'ngpu': 1, 'depth': 40, 'randomcrop_pad': 4, 'optim_method': 'SGD', 'nthread': 4, 'weightDecay': 0.0005, 'resume': '', 'epochs': 200, 'save': 'logs/resnet_40_1_teacher', 'temperature': 4, 'dtype': 'float'} Files already downloaded and verified Files already downloaded and verified Traceback (most recent call last): File "cifar.py", line 331, in main() File "cifar.py", line 212, in main optimizable = [v for v in params.itervalues() if v.requires_grad] AttributeError: 'collections.OrderedDict' object has no attribute 'itervalues'

When I run 'cifar.py', I got this error, and I can't find the “itervalues” in any file.

opened by Fzz123 2

fix type mismatch in calling torch API

np.arange() returns a Iterable[numpy.int64] result. But in many torch APIs, only int type is well recieved. In multi-gpu situation, the origin code will cause crash like the following:

  File "/opt/conda/lib/python3.6/site-packages/torch/cuda/comm.py", line 157, in scatter
    with torch.cuda.device(device), torch.cuda.stream(stream):
  File "/opt/conda/lib/python3.6/site-packages/torch/cuda/__init__.py", line 227, in __enter__
    torch._C._cuda_setDevice(self.idx)
RuntimeError: invalid argument to setDevice

  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 14, in scatter_map
    return Scatter.apply(target_gpus, None, dim, obj)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 74, in forward
    outputs = comm.scatter(input, ctx.target_gpus, ctx.chunk_sizes, ctx.dim, streams)
  File "/opt/conda/lib/python3.6/site-packages/torch/cuda/comm.py", line 159, in scatter
    outputs.append(chunk.cuda(device, non_blocking=True))
TypeError: cuda(): argument 'device' (position 1) must be torch.device, not numpy.int64

  File "/opt/conda/lib/python3.6/site-packages/torch/cuda/comm.py", line 197, in gather
    result = tensors[0].new(expected_size, device=destination)
TypeError: new() received an invalid combination of arguments - got (torch.Size, device=numpy.int64), but expected one of:
 * (torch.device device)
 * (tuple of ints size, torch.device device)
      didn't match because some of the arguments have invalid types: (torch.Size, device=numpy.int64)
 * (torch.Storage storage)
 * (Tensor other)
 * (object data, torch.device device)
      didn't match because some of the arguments have invalid types: (torch.Size, device=numpy.int64)

opened by HisiFish 1

Question on KL loss

Thank you for your code!

I don‘t know why the kl loss here is multiplied by 2. https://github.com/szagoruyko/attention-transfer/blob/master/utils.py#L60 Could you explain it?

opened by wentianli 1
Hoping to see the implementation of AT+KD with decaying beta

Hi, I like your work and curious that when will you be planning on adding the implementation of AT+KD with decaying beta? Will it be commit soon?

Thank you.

opened by AIbeginner2020 0
My Imagenet replication results are poor
Hello, first of all, thank you very much for your great work!

When I reproduce the results of your paper, I found several confusing problems in the relevant part of Imagenet, and the results are much worse than those mentioned in the paper.

First of all, the accuracy of the resnet34 teacher network you mentioned in the paper is different from that of the resnet34 pre training model you provided. I don't know whether it is this reason that leads to the poor results of students. Can you provide the resnet34 model mentioned in the paper?

The second point is that when you do the experiment of Imagenet, you mentioned that the super parameters used are the same as those used in the migration experiment, but no specific value is given. What's the specific beta value, please?

Here is my recurrence("Imagenet_AT" is the experimental result with beta set to 1000, which is much worse than the result in the paper. "Imagenet_AT2000" is the result after I tried to adjust the beta to 2000. You know that this experiment is very computationally expensive, so I stopped the experiment after observing that the previous result is very poo): Result in your paper：
opened by somone23412 0
Fix KL normalization
F.kl_div normalizes by the number of elements in tensor, fixed to normalize by minibatch size

also removed 2

Checked that this produces the same results as in the paper.

Fixes #18 and #7
opened by szagoruyko 0
Question about KL_loss average

Hi, Thanks for your sharing code. I have a question about the KL_loss implement. The pytorch KL_loss is caculate by average the batch size and dimension. But the original knowledge distill do not need average the loss in the dimension. So I assume there is some bug in it?

opened by Lan1991Xu 0
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

When I ran the cifar code with GPU, I got the following error. Any suggestions would be appreciated!

RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

opened by toseattle 1
Why not use bn for teacher net in imagenet.py

Thanks for your great work first!

I wonder why you do not use BN layer when inference the teacher model here( https://github.com/szagoruyko/attention-transfer/blob/master/imagenet.py#L117 )? Is it a typo?

Hope for your reply!

opened by cheerss 2
Got error when use 2 gpus.

I got an error when I use 2 gpus to train the model on ImageNet. I followed the steps in README, but I got the following error. Traceback (most recent call last): File "imagenet.py", line 340, in main() File "imagenet.py", line 336, in main engine.train(h, iter_train, opt.epochs, optimizer) File "/usr/local/lib/python3.6/site-packages/torchnet/engine/engine.py", line 63, in train state['optimizer'].step(closure) File "/usr/local/lib/python3.6/site-packages/torch/optim/sgd.py", line 80, in step loss = closure() File "/usr/local/lib/python3.6/site-packages/torchnet/engine/engine.py", line 52, in closure loss, output = state'network' File "imagenet.py", line 265, in h y_s, y_t, loss_groups = utils.data_parallel(f, inputs, params, mode, range(opt.ngpu)) File "/opt/ml/job/utils.py", line 64, in data_parallel return gather(outputs, output_device) File "/usr/local/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather return gather_map(outputs) File "/usr/local/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map return type(out)(map(gather_map, zip(*outputs))) File "/usr/local/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map return type(out)(map(gather_map, zip(*outputs))) File "/usr/local/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map return Gather.apply(target_device, dim, *outputs) File "/usr/local/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 54, in forward ctx.input_sizes = tuple(map(lambda i: i.size(ctx.dim), inputs)) File "/usr/local/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 54, in ctx.input_sizes = tuple(map(lambda i: i.size(ctx.dim), inputs)) RuntimeError: dimension specified as 0 but tensor has no dimensions

Thank you if you have any solution to this.

opened by gtxjinx 0
Loss function problems
Hi ,

thanks for your great work I have some questions. Why in the details implementation, just use square than mean,not using L2-norm in the paper you described?

def at(x): return F.normalize(x.pow(2).mean(1).view(x.size(0), -1)) def at_loss(x, y): return (at(x) - at(y)).pow(2).mean()
opened by jacky4323 0
Strategy of α and β decay during training

@szagoruyko @EderSantana Hi, your sharing code is appreciated, but would you please specify your strategy of decaying the two multipliers α and β during training process? Thanks in advance.

opened by d-li14 0
Setting of β

Hi.

In the paper, the authors said "As for parameter β in eq. 2, it usually varies about 0.1, as we set it to 10^3 divided by number of elements in attention map and batch size for each layer. "

But I am still confused. What is 10^3 mean, and how 0.1 was got?

opened by tangbohu 1

Owner

Sergey Zagoruyko

GitHub https://arxiv.org/abs/1612.03928

PyTorch implementation of "ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context" (INTERSPEECH 2020)

ContextNet ContextNet has CNN-RNN-transducer architecture and features a fully convolutional encoder that incorporates global context information into

24 Nov 24, 2022

Official implementation of Self-supervised Graph Attention Networks (SuperGAT), ICLR 2021.

SuperGAT Official implementation of Self-supervised Graph Attention Networks (SuperGAT). This model is presented at How to Find Your Friendly Neighbor

127 Dec 28, 2022

An implementation demo of the ICLR 2021 paper Neural Attention Distillation: Erasing Backdoor Triggers from Deep Neural Networks in PyTorch.

Neural Attention Distillation This is an implementation demo of the ICLR 2021 paper Neural Attention Distillation: Erasing Backdoor Triggers from Deep

84 Jan 4, 2023

A PyTorch implementation of the paper "Semantic Image Synthesis via Adversarial Learning" in ICCV 2017

Semantic Image Synthesis via Adversarial Learning This is a PyTorch implementation of the paper Semantic Image Synthesis via Adversarial Learning. Req

146 Nov 25, 2022

Understanding and Improving Encoder Layer Fusion in Sequence-to-Sequence Learning (ICLR 2021)

Understanding and Improving Encoder Layer Fusion in Sequence-to-Sequence Learning (ICLR 2021) Citation Please cite as: @inproceedings{liu2020understan

22 Nov 25, 2022

Fader Networks: Manipulating Images by Sliding Attributes - NIPS 2017

FaderNetworks PyTorch implementation of Fader Networks (NIPS 2017). Fader Networks can generate different realistic versions of images by modifying at

753 Dec 23, 2022

PyTorch version of the paper 'Enhanced Deep Residual Networks for Single Image Super-Resolution' (CVPRW 2017)

About PyTorch 1.2.0 Now the master branch supports PyTorch 1.2.0 by default. Due to the serious version problem (especially torch.utils.data.dataloade

2.1k Jan 1, 2023

Oriented Response Networks, in CVPR 2017

Oriented Response Networks [Home] [Project] [Paper] [Supp] [Poster] Torch Implementation The torch branch contains: the official torch implementation

217 Dec 12, 2022

[ICLR 2021, Spotlight] Large Scale Image Completion via Co-Modulated Generative Adversarial Networks

Large Scale Image Completion via Co-Modulated Generative Adversarial Networks, ICLR 2021 (Spotlight) Demo | Paper [NEW!] Time to play with our interac

373 Jan 2, 2023

Code for our ICASSP 2021 paper: SA-Net: Shuffle Attention for Deep Convolutional Neural Networks

SA-Net: Shuffle Attention for Deep Convolutional Neural Networks (paper) By Qing-Long Zhang and Yu-Bin Yang [State Key Laboratory for Novel Software T

199 Jan 8, 2023

Pervasive Attention: 2D Convolutional Networks for Sequence-to-Sequence Prediction

This is a fork of Fairseq(-py) with implementations of the following models: Pervasive Attention - 2D Convolutional Neural Networks for Sequence-to-Se

490 Dec 15, 2022

Codes for TIM2021 paper "Anchor-Based Spatio-Temporal Attention 3-D Convolutional Networks for Dynamic 3-D Point Cloud Sequences"

Intelligent Robotics and Machine Vision Lab

4 Jul 19, 2022

Pytorch version of VidLanKD: Improving Language Understanding viaVideo-Distilled Knowledge Transfer

VidLanKD Implementation of VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer by Zineng Tang, Jaemin Cho, Hao Tan, Mohi

54 Dec 20, 2022

[ICLR 2021] Is Attention Better Than Matrix Decomposition?

Enjoy-Hamburger ?? Official implementation of Hamburger, Is Attention Better Than Matrix Decomposition? (ICLR 2021) Under construction. Introduction T

271 Dec 29, 2022

code for our paper "Source Data-absent Unsupervised Domain Adaptation through Hypothesis Transfer and Labeling Transfer"

SHOT++ Code for our TPAMI submission "Source Data-absent Unsupervised Domain Adaptation through Hypothesis Transfer and Labeling Transfer" that is ext

75 Dec 16, 2022

Transfer-Learn is an open-source and well-documented library for Transfer Learning.

Transfer-Learn is an open-source and well-documented library for Transfer Learning. It is based on pure PyTorch with high performance and friendly API. Our code is pythonic, and the design is consistent with torchvision. You can easily develop new algorithms, or readily apply existing algorithms.

2.2k Jan 3, 2023

Transfer style api - An API to use with Tranfer Style App, where you can use two image and transfer the style

Transfer Style API It's an API to use with Tranfer Style App, where you can use

1 Feb 13, 2022

The official PyTorch implementation of recent paper - SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training

This repository is the official PyTorch implementation of SAINT. Find the paper on arxiv SAINT: Improved Neural Networks for Tabular Data via Row Atte

284 Dec 21, 2022

CoSMA: Convolutional Semi-Regular Mesh Autoencoder. From Paper "Mesh Convolutional Autoencoder for Semi-Regular Meshes of Different Sizes"

Mesh Convolutional Autoencoder for Semi-Regular Meshes of Different Sizes Implementation of CoSMA: Convolutional Semi-Regular Mesh Autoencoder arXiv p

10 Oct 11, 2022