Pytorch implementation for Patient Knowledge Distillation for BERT Model Compression

Overview

Patient Knowledge Distillation for BERT Model Compression

Knowledge distillation for BERT model

Installation

Run command below to install the environment

conda install pytorch torchvision cudatoolkit=10.0 -c pytorch
pip install -r requirements.txt

Training

Objective Function

L = (1 - \alpha) L_CE + \alpha * L_DS + \beta * L_PT,

where L_CE is the CrossEntropy loss, DS is the usual Distillation loss, and PT is the proposed loss. Please see our paper below for more details.

Data Preprocess

Modify the HOME_DATA_FOLDER in envs.py and put all data under it (by default it is ./data), RTE data is uploaded for your convenience.

  • The folder name under HOME_DATA_FOLDER should be
    • data_raw: store the raw datas of all tasks. So put downloaded raw data under here
      • MRPC
      • RTE
      • ... (other tasks)
    • data_feat: store the tokenized data under this folder (optional)
      • MRPC
      • RTE
      • ...
  • models
    • pretrained: put downloaded pretrained model (bert-base-uncased) under this folder

Predefinted Training

Run NLI_KD_training.py to start training, you can set DEBUG = True to run some pre-defined arguments

  • set argv = get_predefine_argv('glue', 'RTE', 'finetune_teacher') or argv = get_predefine_argv('glue', 'RTE', 'finetune_student') to start the normal fine-tuning
  • run run_glue_benchmark.py to get teacher's prediction for KD or PKD.
    • set output_all_layers = True for patient teacher
    • set output_all_layers = False for normal teacher
  • set argv = get_predefine_argv('glue', 'RTE', 'kd') to start the vanilla KD
  • set argv = get_predefine_argv('glue', 'RTE', 'kd.cls') to start the vanilla KD

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Citation

If you find this code useful for your research, please consider citing:

@article{sun2019patient,
title={Patient Knowledge Distillation for BERT Model Compression},
author={Sun, Siqi and Cheng, Yu and Gan, Zhe and Liu, Jingjing},
journal={arXiv preprint arXiv:1908.09355},
year={2019}
}

Paper is available at here.

Comments
  • Reproducing results

    Reproducing results

    Nice paper! Thanks for sharing the code. I was trying to reproduce your results. It would be great if you could share the best hyperparameter for each GLUE task? For example with the command: $ python NLI_KD_training.py

    For RTE I was able to get following results:

    acc = 0.6216666666666667
    eval_loss = 1.3263624415118644
    

    BUT

    With only change at line 34 to argv = get_predefine_argv('glue', 'MRPC', 'finetune_student') I got:

    acc = 0.28289855072463765
    acc_and_f1 = 0.14144927536231883
    eval_loss = 3.820818234373022
    f1 = 0.0
    
    opened by pawankmrs 3
  • How do I run student predictions?

    How do I run student predictions?

    Hey, I am trying to reproduce your results, and am interested in training several students with different number of hidden layers. I want to submit the student predictions on GLUE website. I have been able too train student models with PKD-skip procedure.

    My question is, how do I make predictions from the student model? I guess I should change the run_glue_benchmark somehow. Any help in this regard will be appreciated.

    opened by smr97 2
  • RuntimeError: Function AddBackward0 returned an invalid gradient at index 1 - expected type torch.cuda.HalfTensor but got torch.cuda.FloatTensor

    RuntimeError: Function AddBackward0 returned an invalid gradient at index 1 - expected type torch.cuda.HalfTensor but got torch.cuda.FloatTensor

    I Found a bug, when we set fp16=False, and train RTE with kd.cls, got this problem. and the traceback is:

    Traceback (most recent call last):
      File "NLI_KD_training.py", line 288, in <module>
        loss.backward()
      File "/home/vernon/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 102, in backward
        torch.autograd.backward(self, gradient, retain_graph, create_graph)
      File "/home/vernon/anaconda3/lib/python3.6/site-packages/torch/autograd/__init__.py", line 90, in backward
        allow_unreachable=True)  # allow_unreachable flag
    RuntimeError: Function AddBackward0 returned an invalid gradient at index 1 - expected type torch.cuda.HalfTensor but got torch.cuda.FloatTensor
    
    opened by yg33717 2
  • Result is different...

    Result is different...

    Thank you for your code. However, when i run code setting only finetune teacher(BERT-base)

    # run simple fune-tuning *teacher* by uncommenting below cmd
        argv = get_predefine_argv('glue', 'RTE', 'finetune_teacher')
    

    In argument_parser.py,

        elif mode == 'glue':
            argv = [
                    '--task_name', task_name,
                    '--bert_model', 'bert-base-uncased',
                    '--max_seq_length', '128',
                    '--train_batch_size', '32',
                    '--learning_rate', '2e-5',
                    '--num_train_epochs', '4',
                    '--eval_batch_size', '32',
                    '--log_every_step', '1',
                    '--output_dir', os.path.join(HOME_DATA_FOLDER, f'outputs/KD/{task_name}/teacher_12layer'),
                    '--do_train', 'True',
                    '--do_eval', 'True',
                    '--fp16', 'True',
                ]
            if train_type == 'finetune_teacher':
                argv += [
                    '--student_hidden_layers', '12',
                    '--kd_model', 'kd',
                    '--do_eval', 'True',
                    '--alpha', '0.0',    # alpha = 0 is equivalent to fine-tuning for KD
                ]
    

    Result

    12/16/2019 06:48:04 - INFO - __main__ -   ***** Eval results *****
    12/16/2019 06:48:04 - INFO - __main__ -     acc = 0.5983333333333334
    12/16/2019 06:48:04 - INFO - __main__ -     eval_loss = 1.6177796708776595
    

    what is the reason..?

    opened by jdh3577 2
  • Running NLI_KD_training.py raises the following error.

    Running NLI_KD_training.py raises the following error.

    File 'NLI_KD_training.py',line 204, max_grad_norms=1.0) TypeError: init() got an unexpected keyword argument 'max_grad_norm=1.0' According to https://nvidia.github.io/apex/optimizers.html, the apex has been updated to a new release that removed this parameter. If just ignore this parameters, will the performance be influenced?

    opened by non-swimmer 2
  • Trying to do distillation for regression task

    Trying to do distillation for regression task

    Hi, I am trying to extend your research and compute accuracy for all GLUE tasks, in that I am kind of stuck with STS-B. Since it is a regression task, where all do you think I should make the changes to get the numbers?

    opened by smr97 1
  • What's the version of python and pytorch of this project?

    What's the version of python and pytorch of this project?

    I am Impressed by your crucial work. But I encounter some issues when I reproduce this project, may I ask a question that: What's the version of python and pytorch of this project?

    opened by vikotse 1
  • question on pretrained/bert_config.json

    question on pretrained/bert_config.json

    Thanks for sharing your work! I'm training a new dataset (classification task just like glue dataset) with following steps and wanted to make sure whether I'm doing it right.

    1. get finetuned BERT (pytorch_model.bin) on dataset and put it into pretrained directory

    2. run NLI_KD_training.py with below to get encoder.pkl, classifier.pkl

    # run simple fune-tuning *teacher* by uncommenting below cmd
    argv = get_predefine_argv('glue', 'RTE', 'finetune_teacher')
    
    1. run run_glue_benchmark.py to get teacher's prediction for KD

    2. run NLI_KD_training.py with below to distil knowledge from teacher to student model

        # run Patient Teacher by uncommenting below cmd
        argv = get_predefine_argv('glue', 'RTE', 'kd.cls')
    

    I'm confused since doing step 2 redundantly finetunes teacher (bert base model) which was already done in step 1. Is it correct to place finetuned version into pretrained directory? or should I just use plain pytorch_model.bin?

    opened by SeoHyeong 1
  • Where to download the pretrained weights?

    Where to download the pretrained weights?

    Thanks a lot for your impressive work and I want to reproduce the results in the paper. Now I have a question: where can I get the pretrained model (bert-base-uncased)? Thank you!

    opened by thudzj 1
  • Why do you set for KD.Full like this [fix_pooler=True]?

    Why do you set for KD.Full like this [fix_pooler=True]?

    Hi,

    Thank you for your interesting work! I just wondering why don`t you used the pooler for only KD.Full and if you use the pooler, did you initialize the pooler with BERT_teacher weight and bias?

    Thank you, Sincerely,

    opened by GeondoPark 0
  • Some questions about layer number (model size)

    Some questions about layer number (model size)

    Hi,

    Thank you for your interesting work! I have just started to learn BERT and distillation recently. I have some general questions regarding this topic.

    1. I want to compare the performance of BERT with different model size (transformer block number). Is it necessary to do distillation? If I just train a BERT with 6 Layers without distillation, does the performance look bad?

    2. Do you have to do pretraining every time you change the layer number of BERT? Is it possible to just remove some layers in an existing pre-trained model and finetune on tasks?

    3. Why BERT has 12 blocks? Not 11 or 13 etc. ? I couldn't find any explanation.

    Thanks, ZLK

    opened by ZLKong 0
  • Not able to reproduce results

    Not able to reproduce results

    First, thank you for releasing your code.

    I am trying to reproduce results of your paper. I am running NLI_KD_training.py for MRPC with DEBUG=True.

    The setting I am running is argv = get_predefine_argv('glue', 'MRPC', 'finetune_teacher').

    After completing the training for 4 epochs, I get following results :

    05/10/2020 19:09:30 - INFO - __main__ -   ***** Eval results *****
    05/10/2020 19:09:30 - INFO - __main__ -     acc = 0.27942028985507245
    05/10/2020 19:09:30 - INFO - __main__ -     acc_and_f1 = 0.13971014492753622
    05/10/2020 19:09:30 - INFO - __main__ -     eval_loss = 3.8775325307139643
    05/10/2020 19:09:30 - INFO - __main__ -     f1 = 0.0
    

    Also the eval_log has the following :

    epoch,acc,loss
    1,0.8259803921568627,0.35975449818831223
    2,0.8700980392156863,0.3205762528456174
    3,0.8774509803921569,0.3944101127294394
    4,0.8578431372549019,0.4749428268808585
    

    -- which means training is probably correct but there is something wrong with test evaluation.

    I have referred to the hyperparameter files that are provided in results_summary but I am not sure what might be wrong.

    opened by ashim95 0
Owner
Siqi
Siqi
[IJCAI-2021] A benchmark of data-free knowledge distillation from paper "Contrastive Model Inversion for Data-Free Knowledge Distillation"

DataFree A benchmark of data-free knowledge distillation from paper "Contrastive Model Inversion for Data-Free Knowledge Distillation" Authors: Gongfa

ZJU-VIPA 47 Jan 9, 2023
Patient-Survival - Using Python, I developed a Machine Learning model using classification techniques such as Random Forest and SVM classifiers to predict a patient's survival status that have undergone breast cancer surgery.

Patient-Survival - Using Python, I developed a Machine Learning model using classification techniques such as Random Forest and SVM classifiers to predict a patient's survival status that have undergone breast cancer surgery.

Nafis Ahmed 1 Dec 28, 2021
Revisiting Discriminator in GAN Compression: A Generator-discriminator Cooperative Compression Scheme (NeurIPS2021)

Revisiting Discriminator in GAN Compression: A Generator-discriminator Cooperative Compression Scheme (NeurIPS2021) Overview Prerequisites Linux Pytho

Shaojie Li 34 Mar 31, 2022
An Image compression simulator that uses Source Extractor and Monte Carlo methods to examine the post compressive effects different compression algorithms have.

ImageCompressionSimulation An Image compression simulator that uses Source Extractor and Monte Carlo methods to examine the post compressive effects o

James Park 1 Dec 11, 2021
Online Multi-Granularity Distillation for GAN Compression (ICCV2021)

Online Multi-Granularity Distillation for GAN Compression (ICCV2021) This repository contains the pytorch codes and trained models described in the IC

Bytedance Inc. 299 Dec 16, 2022
This is the official pytorch implementation of Student Helping Teacher: Teacher Evolution via Self-Knowledge Distillation(TESKD)

Student Helping Teacher: Teacher Evolution via Self-Knowledge Distillation (TESKD) By Zheng Li[1,4], Xiang Li[2], Lingfeng Yang[2,4], Jian Yang[2], Zh

Zheng Li 9 Sep 26, 2022
PyTorch implementation of paper A Fast Knowledge Distillation Framework for Visual Recognition.

FKD: A Fast Knowledge Distillation Framework for Visual Recognition Official PyTorch implementation of paper A Fast Knowledge Distillation Framework f

Zhiqiang Shen 129 Dec 24, 2022
The official implementation of You Only Compress Once: Towards Effective and Elastic BERT Compression via Exploit-Explore Stochastic Nature Gradient.

You Only Compress Once: Towards Effective and Elastic BERT Compression via Exploit-Explore Stochastic Nature Gradient (paper) @misc{zhang2021compress,

null 46 Dec 7, 2022
TorchDistiller - a collection of the open source pytorch code for knowledge distillation, especially for the perception tasks, including semantic segmentation, depth estimation, object detection and instance segmentation.

This project is a collection of the open source pytorch code for knowledge distillation, especially for the perception tasks, including semantic segmentation, depth estimation, object detection and instance segmentation.

yifan liu 147 Dec 3, 2022
Official implementation for (Refine Myself by Teaching Myself : Feature Refinement via Self-Knowledge Distillation, CVPR-2021)

FRSKD Official implementation for Refine Myself by Teaching Myself : Feature Refinement via Self-Knowledge Distillation (CVPR-2021) Requirements Pytho

null 75 Dec 28, 2022
Official implementation for (Show, Attend and Distill: Knowledge Distillation via Attention-based Feature Matching, AAAI-2021)

Show, Attend and Distill: Knowledge Distillation via Attention-based Feature Matching Official pytorch implementation of "Show, Attend and Distill: Kn

Clova AI Research 80 Dec 16, 2022
Code implementation of Data Efficient Stagewise Knowledge Distillation paper.

Data Efficient Stagewise Knowledge Distillation Table of Contents Data Efficient Stagewise Knowledge Distillation Table of Contents Requirements Image

IvLabs 112 Dec 2, 2022
The official implementation of CVPR 2021 Paper: Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation.

Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation This repository is the official implementation of CVPR 2021 paper:

null 9 Nov 14, 2022
Official implementation of the paper "Lightweight Deep CNN for Natural Image Matting via Similarity Preserving Knowledge Distillation"

Lightweight-Deep-CNN-for-Natural-Image-Matting-via-Similarity-Preserving-Knowledge-Distillation Introduction Accepted at IEEE Signal Processing Letter

DongGeun-Yoon 19 Jun 7, 2022
Adversarial Adaptation with Distillation for BERT Unsupervised Domain Adaptation

Knowledge Distillation for BERT Unsupervised Domain Adaptation Official PyTorch implementation | Paper Abstract A pre-trained language model, BERT, ha

Minho Ryu 29 Nov 30, 2022
This is the pytorch implementation for the paper: *Learning Accurate Performance Predictors for Ultrafast Automated Model Compression*, which is in submission to TPAMI

SeerNet This is the pytorch implementation for the paper: Learning Accurate Performance Predictors for Ultrafast Automated Model Compression, which is

null 3 May 1, 2022
MMRazor: a model compression toolkit for model slimming and AutoML

Documentation: https://mmrazor.readthedocs.io/ English | 简体中文 Introduction MMRazor is a model compression toolkit for model slimming and AutoML, which

OpenMMLab 899 Jan 2, 2023
Block-wisely Supervised Neural Architecture Search with Knowledge Distillation (CVPR 2020)

DNA This repository provides the code of our paper: Blockwisely Supervised Neural Architecture Search with Knowledge Distillation. Illustration of DNA

Changlin Li 215 Dec 19, 2022
AMTML-KD: Adaptive Multi-teacher Multi-level Knowledge Distillation

AMTML-KD: Adaptive Multi-teacher Multi-level Knowledge Distillation

Frank Liu 26 Oct 13, 2022