This PyTorch package implements MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation (NAACL 2022).



  • Create and activate conda environment.
conda env create -f environment.yml
  • Install Transformers locally.
pip install -e .
  • Note: The code is adapted from this codebase. Arguments regarding LoRA and adapter can be safely ignored.


MoEBERT targets task-specific distillation. Before running any distillation code, a pre-trained BERT model should be fine-tuned on the target task. Path to the fine-tuned model should be passed to --model_name_or_path.

Importance Score Computation

  • Use to compute the importance scores, add a --preprocess_importance argument, remove the --do_train argument.
  • If multiple GPUs are used to compute the importance scores, a importance_[rank].pkl file will be saved for each GPU. Use to merge these files.
  • To use the pre-computed importance scores, pass the file name to --moebert_load_importance.

Knowledge Distillation

  • For GLUE tasks, see examples/text-classification/
  • For question answering tasks, see examples/question-answering/
  • Run bash as an example.
  • The codebase supports different routing strategies: gate-token, gate-sentence, hash-random and hash-balance. Choices should be passed to --moebert_route_method.
    • To use hash-balance, a balanced hash list needs to be pre-computed using Path to the saved hash list should be passed to --moebert_route_hash_list.
    • Add a load balancing loss by setting --moebert_load_balance when using trainable gating mechanisms.
    • The sentence-based gating mechanism (gate-sentence) is advantageous for inference because it induces significantly less communication overhead compared with token-level routing methods.
  • The model on target task should be fined-tuned on the basis of BERT or MoEBERT?

    In README, you mentioned that:

    Before running any distillation code, a pre-trained BERT model should be fine-tuned on the target task. Path to the fine-tuned model should be passed to --model_name_or_path. Can I fine-tune on bert-base-uncased model and run distillation code with MoE options? Is pretrained MoEBERT model necessary? Thanks very much!

    opened by LisaWang0306 3
  • Parameters are not shared in experts

    Hi, from the paper I thought that the most important parameters are shared across different experts. However, in the code I did n't see how to ensure the parameters are the same in the training process. I see in, expert_list[i] = fc1_weight_data[idx, :].clone(), but the variable created by clone will not be the same as the old one. I also do experiments to check my assumption. After several steps, the parameters in experts are no longer the same. Can you give more highlights on that? Thanks.

    opened by shukuangxi 0
  • What is the bash script of finetune without MoE

    Hi @SimiaoZuo , as you mentioned that we need to finetune first. But how to get the finetune model and translate into! Many thanks!

    opened by CaffreyR 0
  • Error when run `bash`

    Hi @SimiaoZuo , I encoutered problems when run bash

    The error information is below! Thanks very much!

  • "Need to turn the model to a MoE first" error

    I just remove "--do_train" and "--do_eval" lines in, an add a line that"--do_predict". But when I run it, "Need to turn the model to a MoE first" error happens. I wonder why it happens, thanks a lot.

    opened by Harry-zzh 5
Simiao Zuo
PhD Student @ Georgia Tech
Simiao Zuo
