Code implementation of Data Efficient Stagewise Knowledge Distillation paper.


Data Efficient Stagewise Knowledge Distillation

Stagewise Training Procedure

Table of Contents

This repository presents the code implementation for Stagewise Knowledge Distillation, a technique for improving knowledge transfer between a teacher model and student model.


  • Install the dependencies using conda with the requirements.yml file
    conda env create -f environment.yml
  • Setup the stagewise-knowledge-distillation package itself
    pip install -e .
  • Apart from the above mentioned dependencies, it is recommended to have an Nvidia GPU (CUDA compatible) with at least 8 GB of video memory (most of the experiments will work with 6 GB also). However, the code works with CPU only machines as well.

Image Classification


In this work, ResNet architectures are used. Particularly, we used ResNet10, 14, 18, 20 and 26 as student networks and ResNet34 as the teacher network. The datasets used are CIFAR10, Imagenette and Imagewoof. Note that Imagenette and Imagewoof are subsets of ImageNet.


  • Before any experiments, you need to download the data and saved weights of teacher model to appropriate locations.

  • The following script

    • downloads the datasets
    • saves 10%, 20%, 30% and 40% splits of each dataset separately
    • downloads teacher model weights for all 3 datasets
    # assuming you are in the root folder of the repository
    cd image_classification/scripts


For detailed information on the various experiments, refer to the paper. In all the image classification experiments, the following common training arguments are listed with the possible values they can take:

  • dataset (-d) : imagenette, imagewoof, cifar10
  • model (-m) : resnet10, resnet14, resnet18, resnet20, resnet26, resnet34
  • number of epochs (-e) : Integer is required
  • percentage of dataset (-p) : 10, 20, 30, 40 (don't use this argument at all for full dataset experiments)
  • random seed (-s) : Give any random seed (for reproducibility purposes)
  • gpu (-g) : Don't use unless training on CPU (in which case, use -g 'cpu' as the argument). In case of multi-GPU systems, run CUDA_VISIBLE_DEVICES=id in the terminal before the experiment, where id is the ID of your GPU according to nvidia-smi output.
  • Comet ML API key (-a) (optional) : If you want to use Comet ML for tracking your experiments, then either put your API key as the argument or make it the default argument in the file. Otherwise, no need of using this argument.
  • Comet ML workspace (-w) (optional) : If you want to use Comet ML for tracking your experiments, then either put your workspace name as the argument or make it the default argument in the file. Otherwise, no need of using this argument.

In the following subsections, example commands for training are given for one experiment each.

No Teacher

Full Imagenette dataset, ResNet10

python3 -d imagenette -m resnet10 -e 100 -s 0

Traditional KD (FitNets)

20% Imagewoof dataset, ResNet18

python3 -d imagewoof -m resnet18 -p 20 -e 100 -s 0


30% CIFAR10 dataset, ResNet14

python3 -d cifar10 -m resnet14 -p 30 -e 100 -s 0

Attention Transfer KD

10% Imagewoof dataset, ResNet26

python3 -d imagewoof -m resnet26 -p 10 -e 100 -s 0

Hinton KD

Full CIFAR10 dataset, ResNet14

python3 -d cifar10 -m resnet14 -e 100 -s 0

Simultaneous KD (Proposed Baseline)

40% Imagenette dataset, ResNet20

python3 -d imagenette -m resnet20 -p 40 -e 100 -s 0

Stagewise KD (Proposed Method)

Full CIFAR10 dataset, ResNet10

python3 -d cifar10 -m resnet10 -e 100 -s 0

Semantic Segmentation


In this work, ResNet backbones are used to construct symmetric U-Nets for semantic segmentation. Particularly, we used ResNet10, 14, 18, 20 and 26 as the backbones for student networks and ResNet34 as the backbone for the teacher network. The dataset used is the Cambridge-driving Labeled Video Database (CamVid).


  • The following script
    • downloads the data (and shifts it to appropriate folder)
    • saves 10%, 20%, 30% and 40% splits of each dataset separately
    • downloads the pretrained teacher weights in appropriate folder
    # assuming you are in the root folder of the repository
    cd semantic_segmentation/scripts


For detailed information on the various experiments, refer to the paper. In all the semantic segmentation experiments, the following common training arguments are listed with the possible values they can take:

  • dataset (-d) : camvid
  • model (-m) : resnet10, resnet14, resnet18, resnet20, resnet26, resnet34
  • number of epochs (-e) : Integer is required
  • percentage of dataset (-p) : 10, 20, 30, 40 (don't use this argument at all for full dataset experiments)
  • random seed (-s) : Give any random seed (for reproducibility purposes)
  • gpu (-g) : Don't use unless training on CPU (in which case, use -g 'cpu' as the argument). In case of multi-GPU systems, run CUDA_VISIBLE_DEVICES=id in the terminal before the experiment, where id is the ID of your GPU according to nvidia-smi output.
  • Comet ML API key (-a) (optional) : If you want to use Comet ML for tracking your experiments, then either put your API key as the argument or make it the default argument in the file. Otherwise, no need of using this argument.
  • Comet ML workspace (-w) (optional) : If you want to use Comet ML for tracking your experiments, then either put your workspace name as the argument or make it the default argument in the file. Otherwise, no need of using this argument.

Note: Currently, there are no plans for adding Attention Transfer KD and FSP KD experiments for semantic segmentation.

In the following subsections, example commands for training are given for one experiment each.

No Teacher

Full CamVid dataset, ResNet10

python3 -d camvid -m resnet10 -e 100 -s 0

Traditional KD (FitNets)

20% CamVid dataset, ResNet18

python3 -d camvid -m resnet18 -p 20 -e 100 -s 0

Simultaneous KD (Proposed Baseline)

40% CamVid dataset, ResNet20

python3 -d camvid -m resnet20 -p 40 -e 100 -s 0

Stagewise KD (Proposed Method)

10 % CamVid dataset, ResNet10

python3 -d camvid -m resnet10 -p 10 -e 100 -s 0


If you use this code or method in your work, please cite using

      title={Data Efficient Stagewise Knowledge Distillation}, 
      author={Akshay Kulkarni and Navid Panchi and Sharath Chandra Raparthy and Shital Chiddarwar},

Built by Akshay Kulkarni, Navid Panchi and Sharath Chandra Raparthy.

  • Runtime Error while running the code

    Runtime Error while running the code

    Hi, I followed the instruction for environment setup but have a runtime error while running the code for your proposed method. The error was about a datatype exception that expect a float but received a double.

    opened by DanielSHKao 8
  • Some details about the project

    Some details about the project

    I would like to ask you how the teacher model of Imagenette (BS =64) in this project was trained.Because I went to the training by myself, the effect was very poor, so I would like to ask you this question, thank you

    opened by QQQhz 4
  • refactored segmentation code

    refactored segmentation code

    @akshaykvnit please look into cityscapes class in file, needs few functions that I am not able to find. Except that everything looks fine to me. Imports have also been optimized

    opened by navidpanchi 4
  • Is there something wrong when computing mIOU?

    Is there something wrong when computing mIOU?

    Hi, thanks for your sharing! When computing the mIOU of the whole dataset, it seems that you just average the miou of each batch , but I think it should not be that.

    # stagewise-knowledge-distillation\semantic_segmentation\utils\
    def mean_iou(model, dataloader, args):
        gpu1 = args.gpu
        ious = list()
        for i, (data, target) in enumerate(dataloader):
            data, target = data.float().to(gpu1), target.long().to(gpu1)
            prediction = model(data)
            prediction = F.softmax(prediction, dim=1)
            prediction = torch.argmax(prediction, axis=1).squeeze(1)
            ious.append(iou(target, prediction, num_classes=11))
        return (sum(ious) / len(ious))
    opened by zhouzg 3
  • Implementation of ATKD

    Implementation of ATKD

    Adding implementation of Attention Transfer KD (ATKD) which is an ICLR '17 paper. Some inspiration for the source code is from their official implementation.

    Please go through the code (in case there are any obvious mistakes).

    I ran one experiment with ResNet10 full data and the result is 92% validation accuracy (comparable to 92.2% of simultaneous KD for same settings).

    opened by akshaykulkarni07 3
  • Large size of repository

    Large size of repository

    The repository is very large to download (around 150 MB). Most of this is in the .git file because of the large number of commits and multiple branches. Is there a way to reduce the size? because cloning the repo will take up a lot more space (the actual code is barely 2 or 3 MB). @SharathRaparthy @navidpanchi

    opened by akshaykulkarni07 3
  • Image Classification Refactoring

    Image Classification Refactoring

    Started refactoring image classification code. Not yet done, but @SharathRaparthy and @navidpanchi, take a look at the organization of the code, and let me know if any files need to moved somewhere else. Meanwhile, I'll keep changing the code internally without changing the overall directory structure.

    opened by akshaykulkarni07 2
  • Code refactoring

    Code refactoring

    @akshaykvnit review the code, especially the lines, and check the dataset also. Once we are able to write a single file for both small and big we can merge again.

    opened by navidpanchi 2
  • General Improvements

    General Improvements

    I will be improving the code and the repository in general as much as possible. Some of the changes which will be done (soon) are:

    • [x] Improved and combined README for both image classification and semantic segmentation experiments.

    • [ ] Replacing with Tensorboard since Tensorboard is now officially supported by PyTorch (and I have been able to use it seamlessly) and it works locally without requiring Internet. Also, there is an issue of privacy, since my ID and API key are exposed in the code :cry:.

    • [x] Fixing the evaluation code for image classification experiments.

    The first draft for point 1 is already committed, so please go through and give your suggestions @SharathRaparthy @navidpanchi. Also, if you can think of any other such minor improvements/changes, then post them in this PR itself.

    Note: Don't make changes to the Table of Contents manually since I have used some plugin in VSCode to generate it automatically.

    documentation enhancement 
    opened by akshaykulkarni07 1
  • Implementation of FSPKD

    Implementation of FSPKD

    Adding implementation of FSP KD described in A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning which is a CVPR '17 paper. Code is in our format, but FSP matrix code is inspired from this repo. I wasn't able to find any official implementation.

    I ran one experiment with ResNet10 full data and the result is 94.2% validation accuracy (compared to 92% of ATKD and 92.2% of simultaneous KD for same settings).

    Note: If we merge the ATKD PR #9 before this, then it may cause some merge conflicts. But we can deal with that later.

    opened by akshaykulkarni07 0
Robotics and AI community of VNIT
TF2 implementation of knowledge distillation using the "function matching" hypothesis from the paper Knowledge distillation: A good teacher is patient and consistent by Beyer et al.

FunMatch-Distillation TF2 implementation of knowledge distillation using the "function matching" hypothesis from the paper Knowledge distillation: A g

Sayak Paul 67 Dec 20, 2022
The official implementation of CVPR 2021 Paper: Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation.

Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation This repository is the official implementation of CVPR 2021 paper:

null 9 Nov 14, 2022
PyTorch implementation of paper A Fast Knowledge Distillation Framework for Visual Recognition.

FKD: A Fast Knowledge Distillation Framework for Visual Recognition Official PyTorch implementation of paper A Fast Knowledge Distillation Framework f

Zhiqiang Shen 129 Dec 24, 2022
Official implementation of the paper "Lightweight Deep CNN for Natural Image Matting via Similarity Preserving Knowledge Distillation"

Lightweight-Deep-CNN-for-Natural-Image-Matting-via-Similarity-Preserving-Knowledge-Distillation Introduction Accepted at IEEE Signal Processing Letter

DongGeun-Yoon 19 Jun 7, 2022
Paper Title: Heterogeneous Knowledge Distillation for Simultaneous Infrared-Visible Image Fusion and Super-Resolution

HKDnet Paper Title: "Heterogeneous Knowledge Distillation for Simultaneous Infrared-Visible Image Fusion and Super-Resolution" Email: 18186470991@163.

wasteland 11 Nov 12, 2022
Codes for SIGIR'22 Paper 'On-Device Next-Item Recommendation with Self-Supervised Knowledge Distillation'

OD-Rec Codes for SIGIR'22 Paper 'On-Device Next-Item Recommendation with Self-Supervised Knowledge Distillation' Paper, saved teacher models and Andro

Xin Xia 11 Nov 22, 2022
Light-weight network, depth estimation, knowledge distillation, real-time depth estimation, auxiliary data.

light-weight-depth-estimation Boosting Light-Weight Depth Estimation Via Knowledge Distillation, Junjie Hu, Chenyou F

Junjie Hu 13 Dec 10, 2022
[NeurIPS-2021] Mosaicking to Distill: Knowledge Distillation from Out-of-Domain Data

MosaicKD Code for NeurIPS-21 paper "Mosaicking to Distill: Knowledge Distillation from Out-of-Domain Data" 1. Motivation Natural images share common l

ZJU-VIPA 37 Nov 10, 2022
TorchDistiller - a collection of the open source pytorch code for knowledge distillation, especially for the perception tasks, including semantic segmentation, depth estimation, object detection and instance segmentation.

This project is a collection of the open source pytorch code for knowledge distillation, especially for the perception tasks, including semantic segmentation, depth estimation, object detection and instance segmentation.

yifan liu 147 Dec 3, 2022
Official implementation for (Refine Myself by Teaching Myself : Feature Refinement via Self-Knowledge Distillation, CVPR-2021)

FRSKD Official implementation for Refine Myself by Teaching Myself : Feature Refinement via Self-Knowledge Distillation (CVPR-2021) Requirements Pytho

null 75 Dec 28, 2022
Official implementation for (Show, Attend and Distill: Knowledge Distillation via Attention-based Feature Matching, AAAI-2021)

Show, Attend and Distill: Knowledge Distillation via Attention-based Feature Matching Official pytorch implementation of "Show, Attend and Distill: Kn

Clova AI Research 80 Dec 16, 2022
This is the official pytorch implementation of Student Helping Teacher: Teacher Evolution via Self-Knowledge Distillation(TESKD)

Student Helping Teacher: Teacher Evolution via Self-Knowledge Distillation (TESKD) By Zheng Li[1,4], Xiang Li[2], Lingfeng Yang[2,4], Jian Yang[2], Zh

Zheng Li 9 Sep 26, 2022
Pytorch implementation for Patient Knowledge Distillation for BERT Model Compression

Patient Knowledge Distillation for BERT Model Compression Knowledge distillation for BERT model Installation Run command below to install the environm

Siqi 180 Dec 19, 2022
Source Code for our paper: Understand me, if you refer to Aspect Knowledge: Knowledge-aware Gated Recurrent Memory Network

KaGRMN-DSG_ABSA This repository contains the PyTorch source Code for our paper: Understand me, if you refer to Aspect Knowledge: Knowledge-aware Gated

XingBowen 4 May 20, 2022
Block-wisely Supervised Neural Architecture Search with Knowledge Distillation (CVPR 2020)

DNA This repository provides the code of our paper: Blockwisely Supervised Neural Architecture Search with Knowledge Distillation. Illustration of DNA

Changlin Li 215 Dec 19, 2022
AMTML-KD: Adaptive Multi-teacher Multi-level Knowledge Distillation

AMTML-KD: Adaptive Multi-teacher Multi-level Knowledge Distillation

Frank Liu 26 Oct 13, 2022
PocketNet: Extreme Lightweight Face Recognition Network using Neural Architecture Search and Multi-Step Knowledge Distillation

PocketNet This is the official repository of the paper: PocketNet: Extreme Lightweight Face Recognition Network using Neural Architecture Search and M

Fadi Boutros 40 Dec 22, 2022
Instance-conditional Knowledge Distillation for Object Detection

Instance-conditional Knowledge Distillation for Object Detection This is a MegEngine implementation of the paper "Instance-conditional Knowledge Disti

MEGVII Research 47 Nov 17, 2022
Knowledge Distillation Toolbox for Semantic Segmentation

SegDistill: Toolbox for Knowledge Distillation on Semantic Segmentation Networks This repo contains the supported code and configuration files for Seg

null 9 Dec 12, 2022