Official PyTorch implementation of PS-KD

Last update: Dec 28, 2022

Related tags

Deep Learning PS-KD-Pytorch

Overview

Self-Knowledge Distillation with Progressive Refinement of Targets (PS-KD)

Accepted at ICCV 2021, oral presentation

Official PyTorch implementation of Self-Knowledge Distillation with Progressive Refinement of Targets (PS-KD).
[Slides] [Paper] [Video]
Kyungyul Kim, ByeongMoon Ji, Doyoung Yoon and Sangheum Hwang

Abstract

The generalization capability of deep neural networks has been substantially improved by applying a wide spectrum of regularization methods, e.g., restricting function space, injecting randomness during training, augmenting data, etc. In this work, we propose a simple yet effective regularization method named progressive self-knowledge distillation (PS-KD), which progressively distills a model's own knowledge to soften hard targets (i.e., one-hot vectors) during training. Hence, it can be interpreted within a framework of knowledge distillation as a student becomes a teacher itself. Specifically, targets are adjusted adaptively by combining the ground-truth and past predictions from the model itself. Please refer to the paper for more details.

Requirements

We have tested the code on the following environments:

Python 3.7.7 / Pytorch (>=1.6.0) / torchvision (>=0.7.0)

Datasets

Currently, only CIFAR-100, ImageNet dataset is supported.

#) To verify the effectivness of PS-KD on Detection task and Machine translation task, we used

For object detection: Pascal VOC
For machine translation: IWSLT 15 English-German / German-English, Multi30k.
(Please refer to the paper for more details)

How to Run

Single-node & Multi-GPU Training

To train a single model with 1 nodes & multi-GPU, run the command as follows:

$ python3 main.py --lr 0.1 \
                  --lr_decay_schedule 150 225 \
                  --PSKD \
                  --experiments_dir '<set your own path>' \
                  --classifier_type 'ResNet18' \
                  --data_path '<root your own data path>' \
                  --data_type '<cifar100 or imagenet>' \
                  --alpha_T 0.8 \
                  --rank 0 \
                  --world_size 1 \
                  --multiprocessing_distributed True

Multi-node Training

To train a single model with 2 nodes, for instance, run the commands below in sequence:

# on the node #0
$ python3 main.py --lr 0.1 \
                  --lr_decay_schedule 150 225 \
                  --PSKD \
                  --experiments_dir '<set your own path>' \
                  --classifier_type 'ResNet18' \
                  --data_path '<root your own data path>' \
                  --data_type '<cifar100 or imagenet>' \
                  --alpha_T 0.8 \
                  --rank 0 \
                  --world_size 2 \
                  --dist_url tcp://{master_ip}:{master_port} \
                  --multiprocessing_distributed

# on the node #1
$ python3 main.py --lr 0.1 \
                  --lr_decay_schedule 150 225 \
                  --PSKD \
                  --experiments_dir '<set your own path>' \
                  --classifier_type 'ResNet18' \
                  --data_path '<root your own data path>' \
                  --data_type '<cifar100 or imagenet>' \
                  --alpha_T 0.8 \
                  --rank 1 \
                  --world_size 2 \
                  --dist_url tcp://{master_ip}:{master_port} \
                  --multiprocessing_distributed

Saving & Loading Checkpoints

Saved Filenames

save_dir will be automatically determined(with sequential number suffixes) unless otherwise designated.
Model's checkpoints are saved in ./{experiments_dir}/models/checkpoint_{epoch}.pth.
The best checkpoints are saved in ./{experiments_dir}/models/checkpoint_best.pth.

Loading Checkpoints (resume)

Pass model path as a --resume argument

Experimental Results

Performance measures

Top-1 Error / Top-5 Error
Negative Log Likelihood (NLL)
Expected Calibration Error (ECE)
Area Under the Risk-coverage Curve (AURC)

Results on CIFAR-100

Model + Method	Dataset	Top-1 Error	Top-5 Error	NLL	ECE	AURC
PreAct ResNet-18 (baseline)	CIFAR-100	24.18	6.90	1.10	11.84	67.65
PreAct ResNet-18 + Label Smoothing	CIFAR-100	20.94	6.02	0.98	10.79	57.74
PreAct ResNet-18 + CS-KD [CVPR'20]	CIFAR-100	21.30	5.70	0.88	6.24	56.56
PreAct ResNet-18 + TF-KD [CVPR'20]	CIFAR-100	22.88	6.01	1.05	11.96	61.77
PreAct ResNet-18 + PS-KD	CIFAR-100	20.82	5.10	0.76	1.77	52.10
PreAct ResNet-101 (baseline)	CIFAR-100	20.75	5.28	0.89	10.02	55.45
PreAct ResNet-101 + Label Smoothing	CIFAR-100	19.84	5.07	0.93	3.43	95.76
PreAct ResNet-101 + CS-KD [CVPR'20]	CIFAR-100	20.76	5.62	1.02	12.18	64.44
PreAct ResNet-101 + TF-KD [CVPR'20]	CIFAR-100	20.13	5.10	0.84	6.14	58.8
PreAct ResNet-101 + PS-KD	CIFAR-100	19.43	4.30	0.74	6.92	49.01
DenseNet-121 (baseline)	CIFAR-100	20.05	4.99	0.82	7.34	52.21
DenseNet-121 + Label Smoothing	CIFAR-100	19.80	5.46	0.92	3.76	91.06
DenseNet-121 + CS-KD [CVPR'20]	CIFAR-100	20.47	6.21	1.07	13.80	73.37
DenseNet-121 + TF-KD [CVPR'20]	CIFAR-100	19.88	5.10	0.85	7.33	69.23
DenseNet-121 + PS-KD	CIFAR-100	18.73	3.90	0.69	3.71	45.55
ResNeXt-29 (baseline)	CIFAR-100	18.65	4.47	0.74	4.17	44.27
ResNeXt-29 + Label Smoothing	CIFAR-100	17.60	4.23	1.05	22.14	41.92
ResNeXt-29 + CS-KD [CVPR'20]	CIFAR-100	18.26	4.37	0.80	5.95	42.11
ResNeXt-29 + TF-KD [CVPR'20]	CIFAR-100	17.33	3.87	0.74	6.73	40.34
ResNeXt-29 + PS-KD	CIFAR-100	17.28	3.60	0.72	9.18	40.19
PyramidNet-200 (baseline)	CIFAR-100	16.80	3.69	0.73	8.04	36.95
PyramidNet-200 + Label Smoothing	CIFAR-100	17.82	4.72	0.89	3.46	105.02
PyramidNet-200 + CS-KD [CVPR'20]	CIFAR-100	18.31	5.70	1.17	14.70	70.05
PyramidNet-200 + TF-KD [CVPR'20]	CIFAR-100	16.48	3.37	0.79	10.48	37.04
PyramidNet-200 + PS-KD	CIFAR-100	15.49	3.08	0.56	1.83	32.14

Results on ImageNet

Model +Method	Dataset	Top-1 Error	Top-5 Error	NLL	ECE	AURC
DenseNet-264*	ImageNet	22.15	6.12	--	--	--
ResNet-152	ImageNet	22.19	6.19	0.88	3.84	61.79
ResNet-152 + Label Smoothing	ImageNet	21.73	5.85	0.92	3.91	68.24
ResNet-152 + CS-KD [CVPR'20]	ImageNet	21.61	5.92	0.90	5.79	62.12
ResNet-152 + TF-KD [CVPR'20]	ImageNet	22.76	6.43	0.91	4.70	65.28
ResNet-152 + PS-KD	ImageNet	21.41	5.86	0.84	2.51	61.01

* denotes results reported in the original papers

Citation

If you find this repository useful, please consider giving a star ⭐ and citation PS-KD:

@InProceedings{Kim_2021_ICCV,
    author    = {Kim, Kyungyul and Ji, ByeongMoon and Yoon, Doyoung and Hwang, Sangheum},
    title     = {Self-Knowledge Distillation With Progressive Refinement of Targets},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2021},
    pages     = {6567-6576}
}

Contact for Issues

ByeongMoon Ji, [email protected]
Kyungyul Kim, [email protected]
Doyoung Yoon, [email protected]

License

Copyright (c) 2021-present LG CNS Corp.

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

Comments

Hyperparameter for ImageNet
Hi @lgcnsai @ByeongMoonJi,

Thanks for your excellent work!

I have seen a hyperparameter for ImageNet in the main paper, which is a batch size of 256, while in supplementary paper used 512.

What exactly batch size value that you used in Table 4?

One more thing, could you provide me the alpha value for ImageNet dataset that used in Table 4?

Lastly, could you upload the ResNet-152 pre-trained model (Baseline, with LS, with CS, with TF, and with PS-KD) to available on GitHub ?

Best Regards, Chakkrit
opened by chakkritte 3
Results of LS

Thanks for your great work. The results of Labels moothing reported in your paper is surprisingly high. I wonder do you reproduce such results or copied from existing works? If former, could you share the code of it?

Thanks a lot!

opened by MrChenFeng 1
Training Problem

Hello, what is the machine configuration trained in this article? I train on my machine: the Resnet18 network structure of CIFAR100 in the environment of Nvidia 2080 and pytorch1.6 can only achieve an accuracy of 78.600.

opened by XiaoBuL 1
$Should the gradient be calculated only for P_t(x), or for both P_{t-1}(x) and P_{t}(x)?$

Should the gradient be calculated only for P_t(x), or for both P_{t-1}(x) and P_{t}(x)?

https://github.com/lgcnsai/PS-KD-Pytorch/blob/a0fceec51c3742515416f3ad1a2764cf4b321287/main.py#L481

https://github.com/lgcnsai/PS-KD-Pytorch/blob/a0fceec51c3742515416f3ad1a2764cf4b321287/main.py#L437

https://github.com/lgcnsai/PS-KD-Pytorch/blob/a0fceec51c3742515416f3ad1a2764cf4b321287/main.py#L446

Hi~ When reading Eq. (6) in the authors' interesting paper, I have a question whether the gradient should be calculated only for P_t(x), or for both P_{t-1}(x) and P_{t}(x). It seems that the theoretical support (in the paper) is presented based on the former, but the code is implemented following the latter.

Specifically, in the referred code Line 481, softmax_output is assigned to all_predictions[input_indices] without detach(). In the next epoch, all_predictions[input_indices] is used to calculated the soft_targets (see the referred code Line 437). Then, the loss is calculated by loss = criterion_CE_pskd(outputs, soft_targets), so loss.backward() will compute the gradient for both outputs and soft_targets, which correspond to P_{t}(x) and and (1-\alpha)y+\alpha P_{t-1}(x) in the paper, respectively.

Is my understanding correct? or I have missed something?

opened by gyla1993 1

This is the official PyTorch implementation of the paper "TransFG: A Transformer Architecture for Fine-grained Recognition" (Ju He, Jie-Neng Chen, Shuai Liu, Adam Kortylewski, Cheng Yang, Yutong Bai, Changhu Wang, Alan Yuille).

TransFG: A Transformer Architecture for Fine-grained Recognition Official PyTorch code for the paper: TransFG: A Transformer Architecture for Fine-gra

307 Jan 3, 2023

StyleGAN2-ADA - Official PyTorch implementation

Need Help? If you’re new to StyleGAN2-ADA and looking to get started, please check out this video series from a course Lia Coleman and I taught in Oct

217 Jan 4, 2023

Official PyTorch implementation of "ArtFlow: Unbiased Image Style Transfer via Reversible Neural Flows"

ArtFlow Official PyTorch implementation of the paper: ArtFlow: Unbiased Image Style Transfer via Reversible Neural Flows Jie An*, Siyu Huang*, Yibing

123 Dec 27, 2022

Official PyTorch implementation of RobustNet (CVPR 2021 Oral)

RobustNet (CVPR 2021 Oral): Official Project Webpage Codes and pretrained models will be released soon. This repository provides the official PyTorch

173 Dec 21, 2022

Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.

PyTorch Implementation of Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers 1 Using Colab Please notic

489 Jan 7, 2023

[PyTorch] Official implementation of CVPR2021 paper "PointDSC: Robust Point Cloud Registration using Deep Spatial Consistency". https://arxiv.org/abs/2103.05465

PointDSC repository PyTorch implementation of PointDSC for CVPR'2021 paper "PointDSC: Robust Point Cloud Registration using Deep Spatial Consistency",

153 Dec 14, 2022

Official PyTorch implementation of MX-Font (Multiple Heads are Better than One: Few-shot Font Generation with Multiple Localized Experts)

Introduction Pytorch implementation of Multiple Heads are Better than One: Few-shot Font Generation with Multiple Localized Expert. | paper Song Park1

97 Dec 23, 2022

Official Pytorch implementation of 'GOCor: Bringing Globally Optimized Correspondence Volumes into Your Neural Network' (NeurIPS 2020)

Official implementation of GOCor This is the official implementation of our paper : GOCor: Bringing Globally Optimized Correspondence Volumes into You

71 Nov 18, 2022

Official PyTorch Implementation of Hypercorrelation Squeeze for Few-Shot Segmentation, arXiv 2021

Hypercorrelation Squeeze for Few-Shot Segmentation This is the implementation of the paper "Hypercorrelation Squeeze for Few-Shot Segmentation" by Juh

165 Dec 28, 2022

Official PyTorch implementation of PS-KD

Related tags

Overview

Self-Knowledge Distillation with Progressive Refinement of Targets (PS-KD)

Abstract

Requirements

Datasets

How to Run

Single-node & Multi-GPU Training

Multi-node Training

Saving & Loading Checkpoints

Saved Filenames

Loading Checkpoints (resume)

Experimental Results

Performance measures

Results on CIFAR-100

Results on ImageNet

Citation

Contact for Issues

License

You might also like...

This is the official PyTorch implementation of the paper "TransFG: A Transformer Architecture for Fine-grained Recognition" (Ju He, Jie-Neng Chen, Shuai Liu, Adam Kortylewski, Cheng Yang, Yutong Bai, Changhu Wang, Alan Yuille).

StyleGAN2-ADA - Official PyTorch implementation

Official PyTorch implementation of "ArtFlow: Unbiased Image Style Transfer via Reversible Neural Flows"

Official PyTorch implementation of RobustNet (CVPR 2021 Oral)

Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.

[PyTorch] Official implementation of CVPR2021 paper "PointDSC: Robust Point Cloud Registration using Deep Spatial Consistency". https://arxiv.org/abs/2103.05465

Official PyTorch implementation of MX-Font (Multiple Heads are Better than One: Few-shot Font Generation with Multiple Localized Experts)

Official Pytorch implementation of 'GOCor: Bringing Globally Optimized Correspondence Volumes into Your Neural Network' (NeurIPS 2020)

Official PyTorch Implementation of Hypercorrelation Squeeze for Few-Shot Segmentation, arXiv 2021

Comments

Hyperparameter for ImageNet

Results of LS

Training Problem

Should the gradient be calculated only for P_t(x), or for both P_{t-1}(x) and P_{t}(x)?

Owner

StyleGAN2-ADA - Official PyTorch implementation

Official PyTorch implementation of Joint Object Detection and Multi-Object Tracking with Graph Neural Networks

Official pytorch implementation of paper "Image-to-image Translation via Hierarchical Style Disentanglement".

Official pytorch implementation of paper "Inception Convolution with Efficient Dilation Search" (CVPR 2021 Oral).

Official PyTorch Implementation of Unsupervised Learning of Scene Flow Estimation Fusing with Local Rigidity

Official implementation of our paper "LLA: Loss-aware Label Assignment for Dense Pedestrian Detection" in Pytorch.

An official implementation of "SFNet: Learning Object-aware Semantic Correspondence" (CVPR 2019, TPAMI 2020) in PyTorch.

Old Photo Restoration (Official PyTorch Implementation)

Official PyTorch implementation of Spatial Dependency Networks.

Official implementation of our CVPR2021 paper "OTA: Optimal Transport Assignment for Object Detection" in Pytorch.