Pytorch implementation for Patient Knowledge Distillation for BERT Model Compression

Siqi

Last update: Dec 19, 2022

Related tags

Deep Learning glue pytorch bert pkd bert-model-compression patient-knowledge-distillation

Overview

Patient Knowledge Distillation for BERT Model Compression

Knowledge distillation for BERT model

Installation

Run command below to install the environment

conda install pytorch torchvision cudatoolkit=10.0 -c pytorch
pip install -r requirements.txt

Training

Objective Function

L = (1 - \alpha) L_CE + \alpha * L_DS + \beta * L_PT,

where L_CE is the CrossEntropy loss, DS is the usual Distillation loss, and PT is the proposed loss. Please see our paper below for more details.

Data Preprocess

Modify the HOME_DATA_FOLDER in envs.py and put all data under it (by default it is ./data), RTE data is uploaded for your convenience.

The folder name under HOME_DATA_FOLDER should be
- data_raw: store the raw datas of all tasks. So put downloaded raw data under here
  - MRPC
  - RTE
  - ... (other tasks)
- data_feat: store the tokenized data under this folder (optional)
  - MRPC
  - RTE
  - ...
models
- pretrained: put downloaded pretrained model (bert-base-uncased) under this folder

Predefinted Training

Run NLI_KD_training.py to start training, you can set DEBUG = True to run some pre-defined arguments

set argv = get_predefine_argv('glue', 'RTE', 'finetune_teacher') or argv = get_predefine_argv('glue', 'RTE', 'finetune_student') to start the normal fine-tuning
run run_glue_benchmark.py to get teacher's prediction for KD or PKD.
- set output_all_layers = True for patient teacher
- set output_all_layers = False for normal teacher
set argv = get_predefine_argv('glue', 'RTE', 'kd') to start the vanilla KD
set argv = get_predefine_argv('glue', 'RTE', 'kd.cls') to start the vanilla KD

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Citation

If you find this code useful for your research, please consider citing:

@article{sun2019patient,
title={Patient Knowledge Distillation for BERT Model Compression},
author={Sun, Siqi and Cheng, Yu and Gan, Zhe and Liu, Jingjing},
journal={arXiv preprint arXiv:1908.09355},
year={2019}
}

Paper is available at here.

Comments

Reproducing results
Nice paper! Thanks for sharing the code. I was trying to reproduce your results. It would be great if you could share the best hyperparameter for each GLUE task? For example with the command: $ python NLI_KD_training.py

For RTE I was able to get following results:

acc = 0.6216666666666667 eval_loss = 1.3263624415118644

BUT

With only change at line 34 to argv = get_predefine_argv('glue', 'MRPC', 'finetune_student') I got:

acc = 0.28289855072463765 acc_and_f1 = 0.14144927536231883 eval_loss = 3.820818234373022 f1 = 0.0
opened by pawankmrs 3
How do I run student predictions?

Hey, I am trying to reproduce your results, and am interested in training several students with different number of hidden layers. I want to submit the student predictions on GLUE website. I have been able too train student models with PKD-skip procedure.

My question is, how do I make predictions from the student model? I guess I should change the run_glue_benchmark somehow. Any help in this regard will be appreciated.

opened by smr97 2

RuntimeError: Function AddBackward0 returned an invalid gradient at index 1 - expected type torch.cuda.HalfTensor but got torch.cuda.FloatTensor

I Found a bug, when we set fp16=False, and train RTE with kd.cls, got this problem. and the traceback is：

Traceback (most recent call last):
  File "NLI_KD_training.py", line 288, in <module>
    loss.backward()
  File "/home/vernon/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/vernon/anaconda3/lib/python3.6/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: Function AddBackward0 returned an invalid gradient at index 1 - expected type torch.cuda.HalfTensor but got torch.cuda.FloatTensor

opened by yg33717 2

Result is different...

Thank you for your code. However, when i run code setting only finetune teacher(BERT-base)

# run simple fune-tuning *teacher* by uncommenting below cmd
    argv = get_predefine_argv('glue', 'RTE', 'finetune_teacher')

In argument_parser.py,

    elif mode == 'glue':
        argv = [
                '--task_name', task_name,
                '--bert_model', 'bert-base-uncased',
                '--max_seq_length', '128',
                '--train_batch_size', '32',
                '--learning_rate', '2e-5',
                '--num_train_epochs', '4',
                '--eval_batch_size', '32',
                '--log_every_step', '1',
                '--output_dir', os.path.join(HOME_DATA_FOLDER, f'outputs/KD/{task_name}/teacher_12layer'),
                '--do_train', 'True',
                '--do_eval', 'True',
                '--fp16', 'True',
            ]
        if train_type == 'finetune_teacher':
            argv += [
                '--student_hidden_layers', '12',
                '--kd_model', 'kd',
                '--do_eval', 'True',
                '--alpha', '0.0',    # alpha = 0 is equivalent to fine-tuning for KD
            ]

Result

12/16/2019 06:48:04 - INFO - __main__ -   ***** Eval results *****
12/16/2019 06:48:04 - INFO - __main__ -     acc = 0.5983333333333334
12/16/2019 06:48:04 - INFO - __main__ -     eval_loss = 1.6177796708776595

what is the reason..?

opened by jdh3577 2

Running NLI_KD_training.py raises the following error.

File 'NLI_KD_training.py',line 204, max_grad_norms=1.0) TypeError: init() got an unexpected keyword argument 'max_grad_norm=1.0' According to https://nvidia.github.io/apex/optimizers.html, the apex has been updated to a new release that removed this parameter. If just ignore this parameters, will the performance be influenced?

opened by non-swimmer 2
Trying to do distillation for regression task

Hi, I am trying to extend your research and compute accuracy for all GLUE tasks, in that I am kind of stuck with STS-B. Since it is a regression task, where all do you think I should make the changes to get the numbers?

opened by smr97 1
What's the version of python and pytorch of this project?

I am Impressed by your crucial work. But I encounter some issues when I reproduce this project, may I ask a question that: What's the version of python and pytorch of this project?

opened by vikotse 1
question on pretrained/bert_config.json
Thanks for sharing your work! I'm training a new dataset (classification task just like glue dataset) with following steps and wanted to make sure whether I'm doing it right.

get finetuned BERT (pytorch_model.bin) on dataset and put it into pretrained directory

run NLI_KD_training.py with below to get encoder.pkl, classifier.pkl

# run simple fune-tuning *teacher* by uncommenting below cmd argv = get_predefine_argv('glue', 'RTE', 'finetune_teacher')

run run_glue_benchmark.py to get teacher's prediction for KD

run NLI_KD_training.py with below to distil knowledge from teacher to student model

# run Patient Teacher by uncommenting below cmd argv = get_predefine_argv('glue', 'RTE', 'kd.cls')

I'm confused since doing step 2 redundantly finetunes teacher (bert base model) which was already done in step 1. Is it correct to place finetuned version into pretrained directory? or should I just use plain pytorch_model.bin?
opened by SeoHyeong 1
Where to download the pretrained weights?

Thanks a lot for your impressive work and I want to reproduce the results in the paper. Now I have a question: where can I get the pretrained model (bert-base-uncased)? Thank you!

opened by thudzj 1
Why do you set for KD.Full like this [fix_pooler=True]?

Hi,

Thank you for your interesting work! I just wondering why don`t you used the pooler for only KD.Full and if you use the pooler, did you initialize the pooler with BERT_teacher weight and bias?

Thank you, Sincerely,

opened by GeondoPark 0
Some questions about layer number (model size)
Hi,

Thank you for your interesting work! I have just started to learn BERT and distillation recently. I have some general questions regarding this topic.

I want to compare the performance of BERT with different model size (transformer block number). Is it necessary to do distillation? If I just train a BERT with 6 Layers without distillation, does the performance look bad?

Do you have to do pretraining every time you change the layer number of BERT? Is it possible to just remove some layers in an existing pre-trained model and finetune on tasks?

Why BERT has 12 blocks? Not 11 or 13 etc. ? I couldn't find any explanation.

Thanks, ZLK
opened by ZLKong 0
Not able to reproduce results
First, thank you for releasing your code.

I am trying to reproduce results of your paper. I am running NLI_KD_training.py for MRPC with DEBUG=True.

The setting I am running is argv = get_predefine_argv('glue', 'MRPC', 'finetune_teacher').

After completing the training for 4 epochs, I get following results :

05/10/2020 19:09:30 - INFO - __main__ - ***** Eval results ***** 05/10/2020 19:09:30 - INFO - __main__ - acc = 0.27942028985507245 05/10/2020 19:09:30 - INFO - __main__ - acc_and_f1 = 0.13971014492753622 05/10/2020 19:09:30 - INFO - __main__ - eval_loss = 3.8775325307139643 05/10/2020 19:09:30 - INFO - __main__ - f1 = 0.0

Also the eval_log has the following :

epoch,acc,loss 1,0.8259803921568627,0.35975449818831223 2,0.8700980392156863,0.3205762528456174 3,0.8774509803921569,0.3944101127294394 4,0.8578431372549019,0.4749428268808585

-- which means training is probably correct but there is something wrong with test evaluation.

I have referred to the hyperparameter files that are provided in results_summary but I am not sure what might be wrong.
opened by ashim95 0

Pytorch implementation for Patient Knowledge Distillation for BERT Model Compression

Related tags

Overview

Patient Knowledge Distillation for BERT Model Compression

Installation

Training

Objective Function

Data Preprocess

Predefinted Training

Contributing

Citation

Comments

Result

Owner

Siqi

[IJCAI-2021] A benchmark of data-free knowledge distillation from paper "Contrastive Model Inversion for Data-Free Knowledge Distillation"

Patient-Survival - Using Python, I developed a Machine Learning model using classification techniques such as Random Forest and SVM classifiers to predict a patient's survival status that have undergone breast cancer surgery.

Revisiting Discriminator in GAN Compression: A Generator-discriminator Cooperative Compression Scheme (NeurIPS2021)

An Image compression simulator that uses Source Extractor and Monte Carlo methods to examine the post compressive effects different compression algorithms have.

Online Multi-Granularity Distillation for GAN Compression (ICCV2021)

This is the official pytorch implementation of Student Helping Teacher: Teacher Evolution via Self-Knowledge Distillation(TESKD)

PyTorch implementation of paper A Fast Knowledge Distillation Framework for Visual Recognition.

The official implementation of You Only Compress Once: Towards Effective and Elastic BERT Compression via Exploit-Explore Stochastic Nature Gradient.

TorchDistiller - a collection of the open source pytorch code for knowledge distillation, especially for the perception tasks, including semantic segmentation, depth estimation, object detection and instance segmentation.

Official implementation for (Refine Myself by Teaching Myself : Feature Refinement via Self-Knowledge Distillation, CVPR-2021)

Official implementation for (Show, Attend and Distill: Knowledge Distillation via Attention-based Feature Matching, AAAI-2021)

Code implementation of Data Efficient Stagewise Knowledge Distillation paper.

The official implementation of CVPR 2021 Paper: Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation.

Official implementation of the paper "Lightweight Deep CNN for Natural Image Matting via Similarity Preserving Knowledge Distillation"

Adversarial Adaptation with Distillation for BERT Unsupervised Domain Adaptation

This is the pytorch implementation for the paper: *Learning Accurate Performance Predictors for Ultrafast Automated Model Compression*, which is in submission to TPAMI

MMRazor: a model compression toolkit for model slimming and AutoML

Block-wisely Supervised Neural Architecture Search with Knowledge Distillation (CVPR 2020)

AMTML-KD: Adaptive Multi-teacher Multi-level Knowledge Distillation

This is the pytorch implementation for the paper: Learning Accurate Performance Predictors for Ultrafast Automated Model Compression, which is in submission to TPAMI