Improving Convolutional Networks via Attention Transfer (ICLR 2017)

Overview

Attention Transfer

PyTorch code for "Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer" https://arxiv.org/abs/1612.03928
Conference paper at ICLR2017: https://openreview.net/forum?id=Sks9_ajex

What's in this repo so far:

  • Activation-based AT code for CIFAR-10 experiments
  • Code for ImageNet experiments (ResNet-18-ResNet-34 student-teacher)
  • Jupyter notebook to visualize attention maps of ResNet-34 visualize-attention.ipynb

Coming:

  • grad-based AT
  • Scenes and CUB activation-based AT code

The code uses PyTorch https://pytorch.org. Note that the original experiments were done using torch-autograd, we have so far validated that CIFAR-10 experiments are exactly reproducible in PyTorch, and are in process of doing so for ImageNet (results are very slightly worse in PyTorch, due to hyperparameters).

bibtex:

@inproceedings{Zagoruyko2017AT,
    author = {Sergey Zagoruyko and Nikos Komodakis},
    title = {Paying More Attention to Attention: Improving the Performance of
             Convolutional Neural Networks via Attention Transfer},
    booktitle = {ICLR},
    url = {https://arxiv.org/abs/1612.03928},
    year = {2017}}

Requirements

First install PyTorch, then install torchnet:

pip install git+https://github.com/pytorch/tnt.git@master

then install other Python packages:

pip install -r requirements.txt

Experiments

CIFAR-10

This section describes how to get the results in the table 1 of the paper.

First, train teachers:

python cifar.py --save logs/resnet_40_1_teacher --depth 40 --width 1
python cifar.py --save logs/resnet_16_2_teacher --depth 16 --width 2
python cifar.py --save logs/resnet_40_2_teacher --depth 40 --width 2

To train with activation-based AT do:

python cifar.py --save logs/at_16_1_16_2 --teacher_id resnet_16_2_teacher --beta 1e+3

To train with KD:

python cifar.py --save logs/kd_16_1_16_2 --teacher_id resnet_16_2_teacher --alpha 0.9

We plan to add AT+KD with decaying beta to get the best knowledge transfer results soon.

ImageNet

Pretrained model

We provide ResNet-18 pretrained model with activation based AT:

Model val error
ResNet-18 30.4, 10.8
ResNet-18-ResNet-34-AT 29.3, 10.0

Download link: https://s3.amazonaws.com/modelzoo-networks/resnet-18-at-export.pth

Model definition: https://github.com/szagoruyko/functional-zoo/blob/master/resnet-18-at-export.ipynb

Convergence plot:

Train from scratch

Download pretrained weights for ResNet-34 (see also functional-zoo for more information):

wget https://s3.amazonaws.com/modelzoo-networks/resnet-34-export.pth

Prepare the data following fb.resnet.torch and run training (e.g. using 2 GPUs):

python imagenet.py --imagenetpath ~/ILSVRC2012 --depth 18 --width 1 \
                   --teacher_params resnet-34-export.hkl --gpu_id 0,1 --ngpu 2 \
                   --beta 1e+3
Comments
  • invalid variables

    invalid variables

    When I run cifar.py I have the error: new() received an invalid combination of arguments - got (Tensor, int, int, int), but expected one of:

    • (torch.device device)
    • (tuple of ints size, torch.device device)
    • (torch.Storage storage)
    • (Tensor other)
    • (object data, torch.device device) What is the problem? Thanks for the answer!
    opened by vkadykova 6
  • Question on Code

    Question on Code

    thank you for your code , it's very helpful for me to study computer vision. but it is shame that i can't run this code correctly, i think it's maybe the matter of the edition of software so, there is a issue i want to know that the edition of software, such us python, opencv, what i used is python 2.7 pencv 3.2.0

    thank you very much

    opened by zhenxing1992 6
  • question about

    question about "params.itervalues()"

    (py35) user@user-ASUS:~/fzz/study/attention-transfer-master$ python cifar.py --save logs/resnet_40_1_teacher --depth 40 --width 1 parsed options: {'data_root': '.', 'dataset': 'CIFAR10', 'cuda': False, 'width': 1.0, 'lr': 0.1, 'alpha': 0, 'teacher_id': '', 'gpu_id': '0', 'lr_decay_ratio': 0.2, 'epoch_step': '[60,120,160]', 'beta': 0, 'batchSize': 128, 'ngpu': 1, 'depth': 40, 'randomcrop_pad': 4, 'optim_method': 'SGD', 'nthread': 4, 'weightDecay': 0.0005, 'resume': '', 'epochs': 200, 'save': 'logs/resnet_40_1_teacher', 'temperature': 4, 'dtype': 'float'} Files already downloaded and verified Files already downloaded and verified Traceback (most recent call last): File "cifar.py", line 331, in main() File "cifar.py", line 212, in main optimizable = [v for v in params.itervalues() if v.requires_grad] AttributeError: 'collections.OrderedDict' object has no attribute 'itervalues'

    When I run 'cifar.py', I got this error, and I can't find the “itervalues” in any file.

    opened by Fzz123 2
  • fix type mismatch in calling torch API

    fix type mismatch in calling torch API

    np.arange() returns a Iterable[numpy.int64] result. But in many torch APIs, only int type is well recieved. In multi-gpu situation, the origin code will cause crash like the following:

      File "/opt/conda/lib/python3.6/site-packages/torch/cuda/comm.py", line 157, in scatter
        with torch.cuda.device(device), torch.cuda.stream(stream):
      File "/opt/conda/lib/python3.6/site-packages/torch/cuda/__init__.py", line 227, in __enter__
        torch._C._cuda_setDevice(self.idx)
    RuntimeError: invalid argument to setDevice
    
      File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 14, in scatter_map
        return Scatter.apply(target_gpus, None, dim, obj)
      File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 74, in forward
        outputs = comm.scatter(input, ctx.target_gpus, ctx.chunk_sizes, ctx.dim, streams)
      File "/opt/conda/lib/python3.6/site-packages/torch/cuda/comm.py", line 159, in scatter
        outputs.append(chunk.cuda(device, non_blocking=True))
    TypeError: cuda(): argument 'device' (position 1) must be torch.device, not numpy.int64
    
      File "/opt/conda/lib/python3.6/site-packages/torch/cuda/comm.py", line 197, in gather
        result = tensors[0].new(expected_size, device=destination)
    TypeError: new() received an invalid combination of arguments - got (torch.Size, device=numpy.int64), but expected one of:
     * (torch.device device)
     * (tuple of ints size, torch.device device)
          didn't match because some of the arguments have invalid types: (torch.Size, device=numpy.int64)
     * (torch.Storage storage)
     * (Tensor other)
     * (object data, torch.device device)
          didn't match because some of the arguments have invalid types: (torch.Size, device=numpy.int64)
    
    opened by HisiFish 1
  • Question on KL loss

    Question on KL loss

    opened by wentianli 1
  • Hoping to see the implementation of AT+KD with decaying beta

    Hoping to see the implementation of AT+KD with decaying beta

    Hi, I like your work and curious that when will you be planning on adding the implementation of AT+KD with decaying beta? Will it be commit soon?

    Thank you.

    opened by AIbeginner2020 0
  • My Imagenet replication results are poor

    My Imagenet replication results are poor

    Hello, first of all, thank you very much for your great work!

    When I reproduce the results of your paper, I found several confusing problems in the relevant part of Imagenet, and the results are much worse than those mentioned in the paper.

    • First of all, the accuracy of the resnet34 teacher network you mentioned in the paper is different from that of the resnet34 pre training model you provided. I don't know whether it is this reason that leads to the poor results of students. Can you provide the resnet34 model mentioned in the paper?

    • The second point is that when you do the experiment of Imagenet, you mentioned that the super parameters used are the same as those used in the migration experiment, but no specific value is given. What's the specific beta value, please?

    Here is my recurrence("Imagenet_AT" is the experimental result with beta set to 1000, which is much worse than the result in the paper. "Imagenet_AT2000" is the result after I tried to adjust the beta to 2000. You know that this experiment is very computationally expensive, so I stopped the experiment after observing that the previous result is very poo): image Result in your paperimage

    opened by somone23412 0
  • Fix KL normalization

    Fix KL normalization

    • F.kl_div normalizes by the number of elements in tensor, fixed to normalize by minibatch size
    • also removed 2

    Checked that this produces the same results as in the paper.

    Fixes #18 and #7

    opened by szagoruyko 0
  • Question about KL_loss average

    Question about KL_loss average

    Hi, Thanks for your sharing code. I have a question about the KL_loss implement. The pytorch KL_loss is caculate by average the batch size and dimension. But the original knowledge distill do not need average the loss in the dimension. So I assume there is some bug in it?

    opened by Lan1991Xu 0
  • RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

    RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

    When I ran the cifar code with GPU, I got the following error. Any suggestions would be appreciated!

    RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

    opened by toseattle 1
  • Why not use bn for teacher net in imagenet.py

    Why not use bn for teacher net in imagenet.py

    Thanks for your great work first!

    I wonder why you do not use BN layer when inference the teacher model here( https://github.com/szagoruyko/attention-transfer/blob/master/imagenet.py#L117 )? Is it a typo?

    Hope for your reply!

    opened by cheerss 2
  • Got error when use 2 gpus.

    Got error when use 2 gpus.

    I got an error when I use 2 gpus to train the model on ImageNet. I followed the steps in README, but I got the following error. Traceback (most recent call last): File "imagenet.py", line 340, in main() File "imagenet.py", line 336, in main engine.train(h, iter_train, opt.epochs, optimizer) File "/usr/local/lib/python3.6/site-packages/torchnet/engine/engine.py", line 63, in train state['optimizer'].step(closure) File "/usr/local/lib/python3.6/site-packages/torch/optim/sgd.py", line 80, in step loss = closure() File "/usr/local/lib/python3.6/site-packages/torchnet/engine/engine.py", line 52, in closure loss, output = state'network' File "imagenet.py", line 265, in h y_s, y_t, loss_groups = utils.data_parallel(f, inputs, params, mode, range(opt.ngpu)) File "/opt/ml/job/utils.py", line 64, in data_parallel return gather(outputs, output_device) File "/usr/local/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather return gather_map(outputs) File "/usr/local/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map return type(out)(map(gather_map, zip(*outputs))) File "/usr/local/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map return type(out)(map(gather_map, zip(*outputs))) File "/usr/local/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map return Gather.apply(target_device, dim, *outputs) File "/usr/local/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 54, in forward ctx.input_sizes = tuple(map(lambda i: i.size(ctx.dim), inputs)) File "/usr/local/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 54, in ctx.input_sizes = tuple(map(lambda i: i.size(ctx.dim), inputs)) RuntimeError: dimension specified as 0 but tensor has no dimensions

    Thank you if you have any solution to this.

    opened by gtxjinx 0
  • Loss function problems

    Loss function problems

    Hi ,

    thanks for your great work I have some questions. Why in the details implementation, just use square than mean,not using L2-norm in the paper you described?

    image

    def at(x):
        return F.normalize(x.pow(2).mean(1).view(x.size(0), -1))
    
    
    def at_loss(x, y):
        return (at(x) - at(y)).pow(2).mean()
    
    opened by jacky4323 0
  • Strategy of α and β decay during training

    Strategy of α and β decay during training

    @szagoruyko @EderSantana Hi, your sharing code is appreciated, but would you please specify your strategy of decaying the two multipliers α and β during training process? Thanks in advance.

    opened by d-li14 0
  • Setting of β

    Setting of β

    Hi.

    In the paper, the authors said "As for parameter β in eq. 2, it usually varies about 0.1, as we set it to 10^3 divided by number of elements in attention map and batch size for each layer. "

    But I am still confused. What is 10^3 mean, and how 0.1 was got?

    opened by tangbohu 1
PyTorch implementation of "ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context" (INTERSPEECH 2020)

ContextNet ContextNet has CNN-RNN-transducer architecture and features a fully convolutional encoder that incorporates global context information into

Sangchun Ha 24 Nov 24, 2022
Official implementation of Self-supervised Graph Attention Networks (SuperGAT), ICLR 2021.

SuperGAT Official implementation of Self-supervised Graph Attention Networks (SuperGAT). This model is presented at How to Find Your Friendly Neighbor

Dongkwan Kim 127 Dec 28, 2022
An implementation demo of the ICLR 2021 paper Neural Attention Distillation: Erasing Backdoor Triggers from Deep Neural Networks in PyTorch.

Neural Attention Distillation This is an implementation demo of the ICLR 2021 paper Neural Attention Distillation: Erasing Backdoor Triggers from Deep

Yige-Li 84 Jan 4, 2023
A PyTorch implementation of the paper "Semantic Image Synthesis via Adversarial Learning" in ICCV 2017

Semantic Image Synthesis via Adversarial Learning This is a PyTorch implementation of the paper Semantic Image Synthesis via Adversarial Learning. Req

Seonghyeon Nam 146 Nov 25, 2022
Understanding and Improving Encoder Layer Fusion in Sequence-to-Sequence Learning (ICLR 2021)

Understanding and Improving Encoder Layer Fusion in Sequence-to-Sequence Learning (ICLR 2021) Citation Please cite as: @inproceedings{liu2020understan

Sunbow Liu 22 Nov 25, 2022
Fader Networks: Manipulating Images by Sliding Attributes - NIPS 2017

FaderNetworks PyTorch implementation of Fader Networks (NIPS 2017). Fader Networks can generate different realistic versions of images by modifying at

Facebook Research 753 Dec 23, 2022
PyTorch version of the paper 'Enhanced Deep Residual Networks for Single Image Super-Resolution' (CVPRW 2017)

About PyTorch 1.2.0 Now the master branch supports PyTorch 1.2.0 by default. Due to the serious version problem (especially torch.utils.data.dataloade

Sanghyun Son 2.1k Jan 1, 2023
Oriented Response Networks, in CVPR 2017

Oriented Response Networks [Home] [Project] [Paper] [Supp] [Poster] Torch Implementation The torch branch contains: the official torch implementation

ZhouYanzhao 217 Dec 12, 2022
[ICLR 2021, Spotlight] Large Scale Image Completion via Co-Modulated Generative Adversarial Networks

Large Scale Image Completion via Co-Modulated Generative Adversarial Networks, ICLR 2021 (Spotlight) Demo | Paper [NEW!] Time to play with our interac

Shengyu Zhao 373 Jan 2, 2023
Code for our ICASSP 2021 paper: SA-Net: Shuffle Attention for Deep Convolutional Neural Networks

SA-Net: Shuffle Attention for Deep Convolutional Neural Networks (paper) By Qing-Long Zhang and Yu-Bin Yang [State Key Laboratory for Novel Software T

Qing-Long Zhang 199 Jan 8, 2023
Pervasive Attention: 2D Convolutional Networks for Sequence-to-Sequence Prediction

This is a fork of Fairseq(-py) with implementations of the following models: Pervasive Attention - 2D Convolutional Neural Networks for Sequence-to-Se

Maha 490 Dec 15, 2022
Codes for TIM2021 paper "Anchor-Based Spatio-Temporal Attention 3-D Convolutional Networks for Dynamic 3-D Point Cloud Sequences"

Codes for TIM2021 paper "Anchor-Based Spatio-Temporal Attention 3-D Convolutional Networks for Dynamic 3-D Point Cloud Sequences"

Intelligent Robotics and Machine Vision Lab 4 Jul 19, 2022
Pytorch version of VidLanKD: Improving Language Understanding viaVideo-Distilled Knowledge Transfer

VidLanKD Implementation of VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer by Zineng Tang, Jaemin Cho, Hao Tan, Mohi

Zineng Tang 54 Dec 20, 2022
[ICLR 2021] Is Attention Better Than Matrix Decomposition?

Enjoy-Hamburger ?? Official implementation of Hamburger, Is Attention Better Than Matrix Decomposition? (ICLR 2021) Under construction. Introduction T

Gsunshine 271 Dec 29, 2022
code for our paper "Source Data-absent Unsupervised Domain Adaptation through Hypothesis Transfer and Labeling Transfer"

SHOT++ Code for our TPAMI submission "Source Data-absent Unsupervised Domain Adaptation through Hypothesis Transfer and Labeling Transfer" that is ext

null 75 Dec 16, 2022
Transfer-Learn is an open-source and well-documented library for Transfer Learning.

Transfer-Learn is an open-source and well-documented library for Transfer Learning. It is based on pure PyTorch with high performance and friendly API. Our code is pythonic, and the design is consistent with torchvision. You can easily develop new algorithms, or readily apply existing algorithms.

THUML @ Tsinghua University 2.2k Jan 3, 2023
Transfer style api - An API to use with Tranfer Style App, where you can use two image and transfer the style

Transfer Style API It's an API to use with Tranfer Style App, where you can use

Brian Alejandro 1 Feb 13, 2022
The official PyTorch implementation of recent paper - SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training

This repository is the official PyTorch implementation of SAINT. Find the paper on arxiv SAINT: Improved Neural Networks for Tabular Data via Row Atte

Gowthami Somepalli 284 Dec 21, 2022
CoSMA: Convolutional Semi-Regular Mesh Autoencoder. From Paper "Mesh Convolutional Autoencoder for Semi-Regular Meshes of Different Sizes"

Mesh Convolutional Autoencoder for Semi-Regular Meshes of Different Sizes Implementation of CoSMA: Convolutional Semi-Regular Mesh Autoencoder arXiv p

Fraunhofer SCAI 10 Oct 11, 2022