Pretrained language model and its related optimization techniques developed by Huawei Noah's Ark Lab.


This repository provides the latest pretrained language models and its related optimization techniques developed by Huawei Noah's Ark Lab.

Directory structure

  • PanGu-α is a Large-scale autoregressive pretrained Chinese language model with up to 200B parameter. The models are developed under the MindSpore and trained on a cluster of Ascend 910 AI processors.
  • NEZHA-TensorFlow is a pretrained Chinese language model which achieves the state-of-the-art performances on several Chinese NLP tasks developed under TensorFlow.
  • NEZHA-PyTorch is the PyTorch version of NEZHA.
  • NEZHA-Gen-TensorFlow provides two GPT models. One is Yuefu (乐府), a Chinese Classical Poetry generation model, the other is a common Chinese GPT model.
  • TinyBERT is a compressed BERT model which achieves 7.5x smaller and 9.4x faster on inference.
  • TinyBERT-MindSpore is a MindSpore version of TinyBERT.
  • DynaBERT is a dynamic BERT model with adaptive width and depth.
  • BBPE provides a byte-level vocabulary building tool and its correspoinding tokenizer.
  • PMLM is a probabilistically masked language model. Trained without the complex two-stream self-attention, PMLM can be treated as a simple approximation of XLNet.
  • TernaryBERT is a weights ternarization method for BERT model developed under PyTorch.
  • TernaryBERT-MindSpore is the MindSpore version of TernaryBERT.
  • HyperText is an efficient text classification model based on hyperbolic geometry theories.
  • BinaryBERT is a weights binarization method using ternary weight splitting for BERT model, developed under PyTorch.
  • AutoTinyBERT provides a model zoo that can meet different latency requirements.
  Question towards TinyBERT Data Augmentation ${GLOVE_EMB}$

    Hi, all

    In the part of Data Augmentation I have seen “--glove_embs ${GLOVE_EMB}$”, I am wondering what should I use to replace this part: "${GLOVE_EMB}$"

    I have noticed from the code in the, it mentioned it is the glove embedding file. If we should replace "${GLOVE_EMB}$" with the location of the glove embedding file.

    May I know where can we get the glove embedding file? Could you provide me with a link?

    opened by MichaelCaohn 6
  Does Ternary BERT only use KD Loss(Teacher, Student Loss) while training?

    Hi, Thanks for this great source code. It really helps me a lot!

    While I'm studying the TernaryBERT with Paper and this source code, I have a question about KD Training Loss. In Paper Algorithm1, It says that when compute the gradient, It only uses Distillation Loss, not the Distillation Loss + GT Label Cross Entropy Loss.

    스크린샷 2021-09-25 오후 3 15 15

    And also in source code, there is only KD loss which is used for backward.

    Does TernaryBERT only use KD loss and not using ground truth label as training objective?

    스크린샷 2021-09-25 오후 3 08 25

    In Paper's Ablation Study, bottom row performance (-Trm-Logits) means It uses GT label Loss. Then would it be possible to say that TernaryBERT top row means It uses all three losses(Trm, Logits and GT Label)?

    I'm little confused which loss should I use while reproducing TernaryBERT performance. It would be very helpful if you could answer my question.

    Thanks in advance!

    opened by MarsJacobs 4
  how to set the config file of student_model?

    when I intend to train the general chinese tinybert model, I meet some problems that the project doesn't offer the example of the config file, so could anyone offer me a reference of the config file of student_model. image thanks for your kindness!

    opened by jinsongpan 4
  Data Augmentation

    In the phase of Data Augmentation, pretrained_bert_model is General_TinyBERT in but is "pre-trained language model BERT" in the description.

    opened by gongel 4
  • TinyBERT的疑问


    看过TinyBERT的论文后,想请教如下几个问题: (1)预训练的蒸馏阶段,是指在预训练teacher BERT的同时蒸馏 student TinyBERT吗?比如每个epoch蒸馏一次或者其他?因为看到如下示意图,一开始觉得是预训练的同时进行蒸馏。 image 另一种是可能是预训练完BERT之后,固定teacher BERT,再用相同的预训练语料同时输入到teacher BERT和 要蒸馏出的TinyBERT?再逐个目标函数蒸馏? (2)论文中似乎没有透露预训练和微调阶段的资源消耗,比如两阶段一共用了多少时间? 多谢!

    opened by MrRace 4
  • teacher和student的hidden_size不同时,fit_size作用


    假设teacher和student的hidden_size分别为d和d' 当d不等于d'时,利用student模型的fit_dense层,将d‘映射到和d一样的维度,使得student和teacher之间可以计算hidden_state loss。 但是当d和d'像当时,就可以不经过fit_dense映射直接计算hidden_state loss吧。但是代码里用了 if is_student判断,实际应该是判断d是否等于d'吧?

    opened by littttttlebird 3
  tinyBert general model with `cased`

    Have you done general distillation using the bert-base-cased model? and would you have the General_TinyBERT_v2(4layer-312dim) cased model available?

    When trying python3 --teacher_model $FT_BERT_BASE_DIR --student_model $GENERAL_TINYBERT_DIR ... on a Fine-Tuned model that is 'bert-cased', a CUDA error is thrown

    opened by sv-v5 3
  • 词典大小对不上


    Task-specific Distillation阶段,teacher是fine-tuned bert base,student是general_tinybert,两者都是由bert base而来,bert base词典大小是21128,但是为啥下载的general_tinybert词典是30522?两者怎么对齐? 在task-specific distill阶段,student词典较大,输入到teacher会造成index越界。

    student_logits, student_atts, student_reps = student_model(input_ids, segment_ids, input_mask, is_student=True)
    teacher_logits, teacher_atts, teacher_reps = teacher_model(input_ids, segment_ids, input_mask)
    opened by littttttlebird 3
  AutoTinyBERT models not accessible

    Thanks for the awesome work on AutoTinyBERT!

    We would like to use your final model checkpoints. However, the links provided in the AutoTinyBERT Model Zoo are not accessible. It would be of great help to our work if you could share the model checkpoints.

    Looking forward to your response.

    opened by AdityaKane2001 2
  #TinyBert Training Pipeline Problems

    Hi Huawei team:

    Sorry to disturb you, can you guys answer my following question?

    Why did the training pipeline of TinyBert "" not use DDP to initialize the student model, instead of only initializing the teacher model? And why there is no synchronization of the normalization layer?


    And when opening the mixed-precision, where can I find the function "backward" from "optimizer"?



    opened by mexiQQ 2
  Bump horovod from 0.22.1 to 0.24.0 in /JABER-PyTorch

    Bumps horovod from 0.22.1 to 0.24.0.

    opened by dependabot[bot] 1
  • TinyBERT实验到底用哪个enwiki-latest-pages-articles数据集?


    原文第6页提到: For the general distillation, we set the maximum sequence length to 128 and use English Wikipedia (2,500M words) 我从 指定的链接下载 the latest dump 此压缩包解压后形成了一个86G的xml文件,经本工程的预处理代码总是报超磁盘空间,且每跑十几个小时就断掉,查代码以后,将pregenerate_training_date.py文件第52行self.document_shelf_filepath的路径从/cache/目录改到外部磁盘的500G文件目录,这次终于不再报超磁盘空间,但处理速度很慢,84个小时才从第367行跑到第390行。 然后最崩溃的来了!由于后面还要跑3个epoch,又跑了2天才跑完第一个epoch的5%,合着40天才能跑完一个epoch,总共3个epoch就要120天! 仅仅数据预处理就要跑这么久吗?即使跑完,后面还要上GPU训练,会不会更久??? 请问原文用的是哪个数据集?是不是要用华为云平台跑才能快一些?

    opened by ra225 0
  Bump certifi from 2021.5.30 to 2022.12.7 in /JABER-PyTorch

    Bumps certifi from 2021.5.30 to 2022.12.7.


    opened by dependabot[bot] 1
  • 使用nezha_base_www模型,得到的嵌入向量为nan


    #引用nezha模型 from transformers import NezhaModel, NezhaConfig

    self.config = BertConfig.from_pretrained(config_path) self.bert_module = NezhaModel.from_pretrained(bert_dir, config=self.config) bert_outputs = self.bert_module(input_ids=x, attention_mask=mask, token_type_ids=segs, output_hidden_states =True)

    bert_outputs结果中,多层结果是nan,不知道是什么原因。 BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], ..., [nan, nan, nan, ..., nan, nan, nan]], device='cuda:0'), hidden_states=(tensor([[[ 0.5742, -0.2564, 0.4186, ..., 0.8307, -1.6965, 0.6848], [-0.6152, 0.1826, -1.1161, ..., 0.6985, -3.4405, 1.4675], [-0.2423, 0.8284, 0.5155, ..., 1.0843, -1.4233, 0.5122], ..., [-0.2828, -0.2603, -0.6676, ..., 0.5609, -2.0621, 0.5314],

         [ 0.5203,  0.3228, -0.4273,  ..., -0.2345, -0.1468, -0.2845],
         [ 0.5203,  0.3228, -0.4273,  ..., -0.2345, -0.1468, -0.2845],
         [ 0.5203,  0.3228, -0.4273,  ..., -0.2345, -0.1468, -0.2845]]],
       device='cuda:0'), tensor([[[nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan]]], device='cuda:0'),), past_key_values=None, attentions=None, cross_attentions=None)
    opened by yixiu00001 0
