⛵️The official PyTorch implementation for "BERT-of-Theseus: Compressing BERT by Progressive Module Replacing" (EMNLP 2020).

Overview

BERT-of-Theseus

Code for paper "BERT-of-Theseus: Compressing BERT by Progressive Module Replacing".

BERT-of-Theseus is a new compressed BERT by progressively replacing the components of the original BERT.

BERT of Theseus

Citation

If you use this code in your research, please cite our paper:

@inproceedings{xu-etal-2020-bert,
    title = "{BERT}-of-Theseus: Compressing {BERT} by Progressive Module Replacing",
    author = "Xu, Canwen  and
      Zhou, Wangchunshu  and
      Ge, Tao  and
      Wei, Furu  and
      Zhou, Ming",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.633",
    pages = "7859--7869"
}

NEW: We have uploaded a script for making predictions on GLUE tasks and preparing for leaderboard submission. Check out here!

How to run BERT-of-Theseus

Requirement

Our code is built on huggingface/transformers. To use our code, you must clone and install huggingface/transformers.

Compress a BERT

  1. You should fine-tune a predecessor model following the instruction from huggingface and then save it to a directory if you haven't done so.
  2. Run compression following the examples below:
# For compression with a replacement scheduler
export GLUE_DIR=/path/to/glue_data
export TASK_NAME=MRPC

python ./run_glue.py \
  --model_name_or_path /path/to/saved_predecessor \
  --task_name $TASK_NAME \
  --do_train \
  --do_eval \
  --do_lower_case \
  --data_dir "$GLUE_DIR/$TASK_NAME" \
  --max_seq_length 128 \
  --per_gpu_train_batch_size 32 \
  --per_gpu_eval_batch_size 32 \
  --learning_rate 2e-5 \
  --save_steps 50 \
  --num_train_epochs 15 \
  --output_dir /path/to/save_successor/ \
  --evaluate_during_training \
  --replacing_rate 0.3 \
  --scheduler_type linear \
  --scheduler_linear_k 0.0006
# For compression with a constant replacing rate
export GLUE_DIR=/path/to/glue_data
export TASK_NAME=MRPC

python ./run_glue.py \
  --model_name_or_path /path/to/saved_predecessor \
  --task_name $TASK_NAME \
  --do_train \
  --do_eval \
  --do_lower_case \
  --data_dir "$GLUE_DIR/$TASK_NAME" \
  --max_seq_length 128 \
  --per_gpu_train_batch_size 32 \
  --per_gpu_eval_batch_size 32 \
  --learning_rate 2e-5 \
  --save_steps 50 \
  --num_train_epochs 15 \
  --output_dir /path/to/save_successor/ \
  --evaluate_during_training \
  --replacing_rate 0.5 \
  --steps_for_replacing 2500 

For the detailed description of arguments, please refer to the source code.

Load Pretrained Model on MNLI

We provide a 6-layer pretrained model on MNLI as a general-purpose model, which can transfer to other sentence classification tasks, outperforming DistillBERT (with the same 6-layer structure) on six tasks of GLUE (dev set).

Method MNLI MRPC QNLI QQP RTE SST-2 STS-B
BERT-base 83.5 89.5 91.2 89.8 71.1 91.5 88.9
DistillBERT 79.0 87.5 85.3 84.9 59.9 90.7 81.2
BERT-of-Theseus 82.1 87.5 88.8 88.8 70.1 91.8 87.8

You can easily load our general-purpose model using huggingface/transformers.

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("canwenxu/BERT-of-Theseus-MNLI")

model = AutoModel.from_pretrained("canwenxu/BERT-of-Theseus-MNLI")

Bug Report and Contribution

If you'd like to contribute and add more tasks (only GLUE is available at this moment), please submit a pull request and contact me. Also, if you find any problem or bug, please report with an issue. Thanks!

Third-Party Implementations

We list some third-party implementations from the community here. Please kindly add your implementation to this list:

Comments
  • ImportError: cannot import name 'BERT_PRETRAINED_MODEL_ARCHIVE_MAP'

    ImportError: cannot import name 'BERT_PRETRAINED_MODEL_ARCHIVE_MAP'

    First of all, I would thank you for your work. I have read you paper about "BERT-of-Theseus", and found it very amazing and good written. After Fine tuning BERT on the MRPC GLUE task, I tried to compress it using 'BERT-of-Theseus' technique, But unfortunately getting the following error: "ImportError: cannot import name 'BERT_PRETRAINED_MODEL_ARCHIVE_MAP'". I wish if you have any idea to fix this! thank you a lot.

    opened by MedSao 5
  • [Answered] MRPC Reproducibility

    [Answered] MRPC Reproducibility

    Hi, I read this paper and the idea is interesting.

    However, after I run your code, I cannot achieve the performance in the paper. For instance, the MRPC dataset is used to finetuning the teacher model, then the model is loaded and used to distill the dark knowledge to the student model. But the final performance of student on the MRPC dataset is 85.2, which is much lower than the result in the paper, i.e., 89.0.

    I do not know why that happen, would you please help me out?

    Best.

    opened by YaNjIeE 5
  • are predecessor modules weights frozen or not

    are predecessor modules weights frozen or not

    Hi, thanks for your great work. According to the paper description, the predecessor module weights are frozen after fine-tuned on the task data( including embedding & output classifier). The code, however, if my understanding is correct, the fine-tuned predecessor weights are not frozen, instead, the loss can BP to the corresponding parameters. So, which pattern is supposed to be correct? Thanks in advance.

    opened by TobiasLee 4
  • Comparison against six layer BERT

    Comparison against six layer BERT

    Hi

    I read your paper and found it very interesting, I was wondering if you had any ablation results which compared the performance of a 6 layer Bert-of-Theseus (compressed from 12 layer BERT) against the performance of a 6 layer BERT trained from scratch? If not do you have any intuition for whether the module replacement of a larger model would surpass the performance of that same smaller model trained from scratch?

    Many thanks David

    opened by david-macleod 2
  • Can I easily changed the number of hidden layers for scc_layer?

    Can I easily changed the number of hidden layers for scc_layer?

    First, thank you for the amazing work. I have already use this method to train a small Bert(6 hidden layers) from the Bert-base(12 hidden layers), and in my own dataset the inference speed almost double but the accuracy goes down from 84.5% to 82.8%. So I'm wondering if I can use Roberta-large(24 hidden layers) as the predecessor model and set the number of hidden layers for scc_layer to 8?

    Because in modeling_bert_of_theseus.py I found such code lines:

    def init(self, config, scc_n_layer=6):

        super(BertEncoder, self).__init__()
        self.prd_n_layer = config.num_hidden_layers
        self.scc_n_layer = scc_n_layer
        assert self.prd_n_layer % self.scc_n_layer == 0
    
    opened by Jaeker0512 2
  • The config.json of scc_net should be modified?

    The config.json of scc_net should be modified?

    the config.json of the model which has been compressed as follows: { "_num_labels": 20, "architectures": [ "BertForSequenceClassification" ], "attention_probs_dropout_prob": 0.1, "bos_token_id": null, "directionality": "bidi", "do_sample": false, "early_stopping": false, "eos_token_ids": null, "finetuning_task": "mytask", "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "id2label": { "0": "LABEL_0", "1": "LABEL_1", "2": "LABEL_2", "3": "LABEL_3", "4": "LABEL_4", "5": "LABEL_5", "6": "LABEL_6", "7": "LABEL_7", "8": "LABEL_8", "9": "LABEL_9", "10": "LABEL_10", "11": "LABEL_11", "12": "LABEL_12", "13": "LABEL_13", "14": "LABEL_14", "15": "LABEL_15", "16": "LABEL_16", "17": "LABEL_17", "18": "LABEL_18", "19": "LABEL_19" }, "initializer_range": 0.02, "intermediate_size": 3072, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1, "LABEL_10": 10, "LABEL_11": 11, "LABEL_12": 12, "LABEL_13": 13, "LABEL_14": 14, "LABEL_15": 15, "LABEL_16": 16, "LABEL_17": 17, "LABEL_18": 18, "LABEL_19": 19, "LABEL_2": 2, "LABEL_3": 3, "LABEL_4": 4, "LABEL_5": 5, "LABEL_6": 6, "LABEL_7": 7, "LABEL_8": 8, "LABEL_9": 9 }, "layer_norm_eps": 1e-12, "length_penalty": 1.0, "max_length": 20, "max_position_embeddings": 512, "min_length": 0, "model_type": "bert", "no_repeat_ngram_size": 0, "num_attention_heads": 12, "num_beams": 1, "num_hidden_layers": 12, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": true, "output_past": true, "pad_token_id": null, "pooler_fc_size": 768, "pooler_num_attention_heads": 12, "pooler_num_fc_layers": 3, "pooler_size_per_head": 128, "pooler_type": "first_token_transform", "pruned_heads": {}, "repetition_penalty": 1.0, "temperature": 1.0, "top_k": 50, "top_p": 1.0, "torchscript": false, "type_vocab_size": 2, "use_bfloat16": false, "vocab_size": 21128 }

    If I will use this model ,the config.json will be loading, and these params shoule be modified? As I know, the num_hidden_layers is not 12.

    opened by janyChan 2
  • [Answered] Does Theseus support pruning width of network?

    [Answered] Does Theseus support pruning width of network?

    Hi, thanks for your share of Theseus. To my understanding, Theseus can support prune depth of the network, I wonder if it can also support prune width of the network?

    opened by jiezhangGt 2
  • fine-tune a predecessor model

    fine-tune a predecessor model

    Would you provide the checkpoint of the predecessor model on the RTE dataset? I cannot obtain the score as same as your BERT-base in the paper with the mentioned hyperparameter set.

    opened by mcps5601 1
  • Which version of transformers  is this repo using?

    Which version of transformers is this repo using?

    I got ImportError: cannot import name 'BERT_PRETRAINED_MODEL_ARCHIVE_MAP' from 'transformers.modeling_bert' when trying to follow the demo in README. It seems BERT-of-Theseus is using an older version of transformers.

    opened by liebkne 1
  • how to draw the cover picture?

    how to draw the cover picture?

    how to draw the cover picture? excuse me, for the serious curious about your logo of BERT-of-Theseus. With some advanced technology or just use Photoshop? image

    opened by MrRace 1
  • Duplicate definition of name (last_hidden_state) for float 16 onnx

    Duplicate definition of name (last_hidden_state) for float 16 onnx

    After training like that:

    # For compression with a replacement scheduler
    export GLUE_DIR=glue_script/glue_data
    export TASK_NAME=MRPC
    
    python ./run_glue.py \
      --model_name_or_path /home/bert-base \
      --task_name $TASK_NAME \
      --do_train \
      --do_eval \
      --do_lower_case \
      --data_dir "$GLUE_DIR/$TASK_NAME" \
      --max_seq_length 128 \
      --per_gpu_train_batch_size 32 \
      --per_gpu_eval_batch_size 32 \
      --learning_rate 2e-5 \
      --save_steps 50 \
      --num_train_epochs 15 \
      --output_dir result/ \
      --evaluate_during_training \
      --replacing_rate 0.3 \
      --scheduler_type linear \
      --scheduler_linear_k 0.0006
    

    I convert the result model to Hugging Face with convert_to_hf_ckpt.py. And I try to export to onnx:

    output = torch.onnx.export(model,
                                   org_dummy_input,
                                   MODEL_ONNX_PATH,
                                   verbose=True,
                                   operator_export_type=OPERATOR_EXPORT_TYPE,
                                   opset_version=12,
                                   input_names=['input_ids', 'attention_mask', 'token_type_ids'], 
                                   output_names=['last_hidden_state', 'pooler_output'],  
                                   do_constant_folding=True,
                                   dynamic_axes={"input_ids": {0: "batch_size", 1: 'seq_length'},
                                                 "token_type_ids": {0: "batch_size", 1: 'seq_length'},
                                                 "attention_mask": {0: "batch_size", 1: 'seq_length'},
                                                 "pooler_output": {0: "batch_size"},
                                                 "last_hidden_state": {0: "batch_size", 1: 'seq_length'}}
                                   )
    

    The result is float32 ONNX. I try to convert to float 16 onnx:

    python3 -m onnxruntime_tools.transformers.optimizer --input  onnx/bert_fp32.onnx --output onnx/bert_fp16.onnx --float16
    

    However when do inference for float16 onnx, it comes errors:

      session = InferenceSession(model_file, providers=['CUDAExecutionProvider'])
      File "/usr/local/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 283, in __init__
        self._create_inference_session(providers, provider_options, disabled_optimizers)
      File "/usr/local/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 310, in _create_inference_session
        sess = C.InferenceSession(session_options, self._model_path, True, self._read_config_from_model)
    onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Load model from /home/model_compression/BERT-of-Theseus/onnx/bert_fp16.onnx failed:This is an invalid model. Error: Duplicate definition of name (last_hidden_state).
    

    Have you try convert the pytorch model to float 16 onnx ? Why the half-precision operation makes the Duplicate definition of name (last_hidden_state). By the way, float32 onnx and run successfully without errors.

    opened by MrRace 0
  • CoLA reproducibility

    CoLA reproducibility

    Hi, I cannot reproduce the CoLA score as same as the one on paper. I followed the HuggingFace's repo to train a predecessor model with Matthew correlation score of 55.76. However, the hightest score of the successor model I got is 35.82. Could you provide the hyperparameter set for training on the CoLA dataset?

    opened by mcps5601 1
  • What does “max_length” mean in config.json of successor

    What does “max_length” mean in config.json of successor

    What does “max_length” mean in config.json of successor? I set max_seq_length=128 when I Run compression, but the "max_length" in config.json of successor is 20.

    image

    opened by SuMeng123 0
Owner
Kevin Canwen Xu
PhD student @ UCSD; Formerly @huggingface, @microsoft Research Asia.
Kevin Canwen Xu
This repository contains the code for EMNLP-2021 paper "Word-Level Coreference Resolution"

Word-Level Coreference Resolution This is a repository with the code to reproduce the experiments described in the paper of the same name, which was a

null 79 Dec 27, 2022
🍊 PAUSE (Positive and Annealed Unlabeled Sentence Embedding), accepted by EMNLP'2021 🌴

PAUSE: Positive and Annealed Unlabeled Sentence Embedding Sentence embedding refers to a set of effective and versatile techniques for converting raw

EQT 21 Dec 15, 2022
Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

LancoPKU 105 Jan 3, 2023
Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

Udit Arora 19 Oct 28, 2022
Code for the paper in Findings of EMNLP 2021: "EfficientBERT: Progressively Searching Multilayer Perceptron via Warm-up Knowledge Distillation".

This repository contains the code for the paper in Findings of EMNLP 2021: "EfficientBERT: Progressively Searching Multilayer Perceptron via Warm-up Knowledge Distillation".

Chenhe Dong 28 Nov 10, 2022
[EMNLP 2021] LM-Critic: Language Models for Unsupervised Grammatical Error Correction

LM-Critic: Language Models for Unsupervised Grammatical Error Correction This repo provides the source code & data of our paper: LM-Critic: Language M

Michihiro Yasunaga 98 Nov 24, 2022
IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

IndoBERTweet ?? ???? 1. Paper Fajri Koto, Jey Han Lau, and Timothy Baldwin. IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effe

IndoLEM 40 Nov 30, 2022
Beta Distribution Guided Aspect-aware Graph for Aspect Category Sentiment Analysis with Affective Knowledge. Proceedings of EMNLP 2021

AAGCN-ACSA EMNLP 2021 Introduction This repository was used in our paper: Beta Distribution Guided Aspect-aware Graph for Aspect Category Sentiment An

Akuchi 36 Dec 18, 2022
💛 Code and Dataset for our EMNLP 2021 paper: "Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes"

Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes Official PyTorch implementation and EmoCause evaluatio

Hyunwoo Kim 50 Dec 21, 2022
EMNLP'2021: Can Language Models be Biomedical Knowledge Bases?

BioLAMA BioLAMA is biomedical factual knowledge triples for probing biomedical LMs. The triples are collected and pre-processed from three sources: CT

DMIS Laboratory - Korea University 41 Nov 18, 2022
A2T: Towards Improving Adversarial Training of NLP Models (EMNLP 2021 Findings)

A2T: Towards Improving Adversarial Training of NLP Models This is the source code for the EMNLP 2021 (Findings) paper "Towards Improving Adversarial T

QData 17 Oct 15, 2022
Code and dataset for the EMNLP 2021 Finding paper "Can NLI Models Verify QA Systems’ Predictions?"

Code and dataset for the EMNLP 2021 Finding paper "Can NLI Models Verify QA Systems’ Predictions?"

Jifan Chen 22 Oct 21, 2022
Code for Findings at EMNLP 2021 paper: "Learn Continually, Generalize Rapidly: Lifelong Knowledge Accumulation for Few-shot Learning"

Learn Continually, Generalize Rapidly: Lifelong Knowledge Accumulation for Few-shot Learning This repo is for Findings at EMNLP 2021 paper: Learn Cont

INK Lab @ USC 6 Sep 2, 2022
Code to reproduce the results of the paper 'Towards Realistic Few-Shot Relation Extraction' (EMNLP 2021)

Realistic Few-Shot Relation Extraction This repository contains code to reproduce the results in the paper "Towards Realistic Few-Shot Relation Extrac

Bloomberg 8 Nov 9, 2022
StarGAN - Official PyTorch Implementation

StarGAN - Official PyTorch Implementation ***** New: StarGAN v2 is available at https://github.com/clovaai/stargan-v2 ***** This repository provides t

Yunjey Choi 5.1k Dec 30, 2022
Official Pytorch implementation of Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision.

This repository is the official Pytorch implementation of Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision.

vanint 101 Dec 30, 2022
CCF BDCI 2020 房产行业聊天问答匹配赛道 A榜47/2985

CCF BDCI 2020 房产行业聊天问答匹配 A榜47/2985 赛题描述详见:https://www.datafountain.cn/competitions/474 文件说明 data: 存放训练数据和测试数据以及预处理代码 model_bert.py: 网络模型结构定义 adv_train

shuo 40 Sep 28, 2022
justCTF [*] 2020 challenges sources

justCTF [*] 2020 This repo contains sources for justCTF [*] 2020 challenges hosted by justCatTheFish. TLDR: Run a challenge with ./run.sh (requires Do

justCatTheFish 25 Dec 27, 2022
Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks. It takes raw videos/images + text as inputs, and outputs task predictions. ClipBERT is designed based on 2D CNNs and transformers, and uses a sparse sampling strategy to enable efficient end-to-end video-and-language learning.

Jie Lei 雷杰 612 Jan 4, 2023