Revisiting Pre-trained Models for Chinese Natural Language Processing (Findings of EMNLP 2020)

Yiming Cui

Last update: Dec 30, 2022

Related tags

Text Data & NLP nlp tensorflow transformers pytorch language-model bert macbert

Overview

This repository contains the resources in our paper "Revisiting Pre-trained Models for Chinese Natural Language Processing", which will be published in "Findings of EMNLP". You can read our camera-ready paper through ACL Anthology or arXiv pre-print.

Revisiting Pre-trained Models for Chinese Natural Language Processing
Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shijin Wang, Guoping Hu

For resources other than MacBERT, please visit the following repositories:

Chinese BERT-wwm series: https://github.com/ymcui/Chinese-BERT-wwm
Chinese ELECTRA: https://github.com/ymcui/Chinese-ELECTRA
Chinese XLNet: https://github.com/ymcui/Chinese-XLNet

More resources by HFL: https://github.com/ymcui/HFL-Anthology

News

2021/10/24 We propose the first pre-trained language model that specifically focusing on Chinese minority languages. Check：https://github.com/ymcui/Chinese-Minority-PLM

2021/7/21 由哈工大SCIR多位学者撰写的《自然语言处理：基于预训练模型的方法》已出版，欢迎大家选购，也可参与我们的赠书活动。

[Nov 3, 2020] Pre-trained MacBERT models are available through direct Download or Quick Load. Use it as if you are using original BERT (except for it cannot perform the original MLM).

[Sep 15, 2020] Our paper "Revisiting Pre-Trained Models for Chinese Natural Language Processing" is accepted to Findings of EMNLP as a long paper.

Guide

Section	Description
Introduction	Introduction to MacBERT
Download	Download links for MacBERT
Quick Load	Learn how to quickly load our models through 🤗 Transformers
Results	Results on several Chinese NLP datasets
FAQ	Frequently Asked Questions
Citation	Citation

Introduction

MacBERT is an improved BERT with novel MLM as correction pre-training task, which mitigates the discrepancy of pre-training and fine-tuning.

Instead of masking with [MASK] token, which never appears in the ﬁne-tuning stage, we propose to use similar words for the masking purpose. A similar word is obtained by using Synonyms toolkit (Wang and Hu, 2017), which is based on word2vec (Mikolov et al., 2013) similarity calculations. If an N-gram is selected to mask, we will ﬁnd similar words individually. In rare cases, when there is no similar word, we will degrade to use random word replacement.

Here is an example of our pre-training task.

	Example
Original Sentence	we use a language model to predict the probability of the next word.
MLM	we use a language [M] to [M] ##di ##ct the pro [M] ##bility of the next word .
Whole word masking	we use a language [M] to [M] [M] [M] the [M] [M] [M] of the next word .
N-gram masking	we use a [M] [M] to [M] [M] [M] the [M] [M] [M] [M] [M] next word .
MLM as correction	we use a text system to ca ##lc ##ulate the po ##si ##bility of the next word .

Except for the new pre-training task, we also incorporate the following techniques.

Whole Word Masking (WWM)
N-gram masking
Sentence-Order Prediction (SOP)

Note that our MacBERT can be directly replaced with the original BERT as there is no differences in the main neural architecture.

For more technical details, please check our paper: Revisiting Pre-trained Models for Chinese Natural Language Processing

Download

We mainly provide pre-trained MacBERT models in TensorFlow 1.x.

MacBERT-large, Chinese: 24-layer, 1024-hidden, 16-heads, 324M parameters
MacBERT-base, Chinese：12-layer, 768-hidden, 12-heads, 102M parameters

Model	Google Drive	iFLYTEK Cloud	Size
`MacBERT-large, Chinese`	TensorFlow	TensorFlow（pw:3Yg3）	1.2G
`MacBERT-base, Chinese`	TensorFlow	TensorFlow（pw:E2cP）	383M

PyTorch/TensorFlow2 Version

If you need these models in PyTorch/TensorFlow2,

Convert TensorFlow checkpoint into PyTorch/TensorFlow2, using 🤗 Transformers
Download from https://huggingface.co/hfl

Steps: select one of the model in the page above → click "list all files in model" at the end of the model page → download bin/json files from the pop-up window.

Quick Load

With Huggingface-Transformers, the models above could be easily accessed and loaded through the following codes.

tokenizer = BertTokenizer.from_pretrained("MODEL_NAME")
model = BertModel.from_pretrained("MODEL_NAME")

**Notice: Please use BertTokenizer and BertModel for loading MacBERT models. **

The actual model and its MODEL_NAME are listed below.

Original Model	MODEL_NAME
MacBERT-large	hfl/chinese-macbert-large
MacBERT-base	hfl/chinese-macbert-base

Results

We present the results of MacBERT on the following six tasks (please read our paper for other results).

To ensure the stability of the results, we run 10 times for each experiment and report the maximum and average scores (in brackets).

CMRC 2018

CMRC 2018 dataset is released by the Joint Laboratory of HIT and iFLYTEK Research. The model should answer the questions based on the given passage, which is identical to SQuAD. Evaluation metrics: EM / F1

Model	Development	Test	Challenge	#Params
BERT-base	65.5 (64.4) / 84.5 (84.0)	70.0 (68.7) / 87.0 (86.3)	18.6 (17.0) / 43.3 (41.3)	102M
BERT-wwm	66.3 (65.0) / 85.6 (84.7)	70.5 (69.1) / 87.4 (86.7)	21.0 (19.3) / 47.0 (43.9)	102M
BERT-wwm-ext	67.1 (65.6) / 85.7 (85.0)	71.4 (70.0) / 87.7 (87.0)	24.0 (20.0) / 47.3 (44.6)	102M
RoBERTa-wwm-ext	67.4 (66.5) / 87.2 (86.5)	72.6 (71.4) / 89.4 (88.8)	26.2 (24.6) / 51.0 (49.1)	102M
ELECTRA-base	68.4 (68.0) / 84.8 (84.6)	73.1 (72.7) / 87.1 (86.9)	22.6 (21.7) / 45.0 (43.8)	102M
MacBERT-base	68.5 (67.3) / 87.9 (87.1)	73.2 (72.4) / 89.5 (89.2)	30.2 (26.4) / 54.0 (52.2)	102M
ELECTRA-large	69.1 (68.2) / 85.2 (84.5)	73.9 (72.8) / 87.1 (86.6)	23.0 (21.6) / 44.2 (43.2)	324M
RoBERTa-wwm-ext-large	68.5 (67.6) / 88.4 (87.9)	74.2 (72.4) / 90.6 (90.0)	31.5 (30.1) / 60.1 (57.5)	324M
MacBERT-large	70.7 (68.6) / 88.9 (88.2)	74.8 (73.2) / 90.7 (90.1)	31.9 (29.6) / 60.2 (57.6)	324M

DRCD

DRCD is also a span-extraction machine reading comprehension dataset, released by Delta Research Center. The text is written in Traditional Chinese. Evaluation metrics: EM / F1

Model	Development	Test	#Params
BERT-base	83.1 (82.7) / 89.9 (89.6)	82.2 (81.6) / 89.2 (88.8)	102M
BERT-wwm	84.3 (83.4) / 90.5 (90.2)	82.8 (81.8) / 89.7 (89.0)	102M
BERT-wwm-ext	85.0 (84.5) / 91.2 (90.9)	83.6 (83.0) / 90.4 (89.9)	102M
RoBERTa-wwm-ext	86.6 (85.9) / 92.5 (92.2)	85.6 (85.2) / 92.0 (91.7)	102M
ELECTRA-base	87.5 (87.0) / 92.5 (92.3)	86.9 (86.6) / 91.8 (91.7)	102M
MacBERT-base	89.4 (89.2) / 94.3 (94.1)	89.5 (88.7) / 93.8 (93.5)	102M
ELECTRA-large	88.8 (88.7) / 93.3 (93.2)	88.8 (88.2) / 93.6 (93.2)	324M
RoBERTa-wwm-ext-large	89.6 (89.1) / 94.8 (94.4)	89.6 (88.9) / 94.5 (94.1)	324M
MacBERT-large	91.2 (90.8) / 95.6 (95.3)	91.7 (90.9) / 95.6 (95.3)	324M

XNLI

We use XNLI data for testing the NLI task. Evaluation metrics: Accuracy

Model	Development	Test	#Params
BERT-base	77.8 (77.4)	77.8 (77.5)	102M
BERT-wwm	79.0 (78.4)	78.2 (78.0)	102M
BERT-wwm-ext	79.4 (78.6)	78.7 (78.3)	102M
RoBERTa-wwm-ext	80.0 (79.2)	78.8 (78.3)	102M
ELECTRA-base	77.9 (77.0)	78.4 (77.8)	102M
MacBERT-base	80.3 (79.7)	79.3 (78.8)	102M
ELECTRA-large	81.5 (80.8)	81.0 (80.9)	324M
RoBERTa-wwm-ext-large	82.1 (81.3)	81.2 (80.6)	324M
MacBERT-large	82.4 (81.8)	81.3 (80.6)	324M

ChnSentiCorp

We use ChnSentiCorp data for testing sentiment analysis. Evaluation metrics: Accuracy

Model	Development	Test	#Params
BERT-base	94.7 (94.3)	95.0 (94.7)	102M
BERT-wwm	95.1 (94.5)	95.4 (95.0)	102M
BERT-wwm-ext	95.4 (94.6)	95.3 (94.7)	102M
RoBERTa-wwm-ext	95.0 (94.6)	95.6 (94.8)	102M
ELECTRA-base	93.8 (93.0)	94.5 (93.5)	102M
MacBERT-base	95.2 (94.8)	95.6 (94.9)	102M
ELECTRA-large	95.2 (94.6)	95.3 (94.8)	324M
RoBERTa-wwm-ext-large	95.8 (94.9)	95.8 (94.9)	324M
MacBERT-large	95.7 (95.0)	95.9 (95.1)	324M

LCQMC

LCQMC is a sentence pair matching dataset, which could be seen as a binary classification task. Evaluation metrics: Accuracy

Model	Development	Test	#Params
BERT	89.4 (88.4)	86.9 (86.4)	102M
BERT-wwm	89.4 (89.2)	87.0 (86.8)	102M
BERT-wwm-ext	89.6 (89.2)	87.1 (86.6)	102M
RoBERTa-wwm-ext	89.0 (88.7)	86.4 (86.1)	102M
ELECTRA-base	90.2 (89.8)	87.6 (87.3)	102M
MacBERT-base	89.5 (89.3)	87.0 (86.5)	102M
ELECTRA-large	90.7 (90.4)	87.3 (87.2)	324M
RoBERTa-wwm-ext-large	90.4 (90.0)	87.0 (86.8)	324M
MacBERT-large	90.6 (90.3)	87.6 (87.1)	324M

BQ Corpus

BQ Corpus is a sentence pair matching dataset, which could be seen as a binary classification task. Evaluation metrics: Accuracy

Model	Development	Test	#Params
BERT	86.0 (85.5)	84.8 (84.6)	102M
BERT-wwm	86.1 (85.6)	85.2 (84.9)	102M
BERT-wwm-ext	86.4 (85.5)	85.3 (84.8)	102M
RoBERTa-wwm-ext	86.0 (85.4)	85.0 (84.6)	102M
ELECTRA-base	84.8 (84.7)	84.5 (84.0)	102M
MacBERT-base	86.0 (85.5)	85.2 (84.9)	102M
ELECTRA-large	86.7 (86.2)	85.1 (84.8)	324M
RoBERTa-wwm-ext-large	86.3 (85.7)	85.8 (84.9)	324M
MacBERT-large	86.2 (85.7)	85.6 (85.0)	324M

FAQ

Question 1: Do you have an English version of MacBERT?

A1: Sorry, we do not have English version of pre-trained MacBERT.

Question 2: How to use MacBERT?

A2: Use it as if you are using original BERT in the fine-tuning stage (just replace the checkpoint and config files). Also, you can perform further pre-training on our checkpoint with MLM/NSP/SOP objectives.

Question 3: Could you provide pre-training code for MacBERT?

A3: Sorry, we cannot provide source code at the moment, and maybe we'll release them in the future, but there is no guarantee.

Question 4: How about releasing the pre-training data?

A4: We have no right to redistribute these data, which will have potential legal violations.

Question 5: Will you release pre-trained MacBERT on a larger data?

A5: Currently, we have no plans on this.

Citation

If you find our resource or paper is useful, please consider including the following citation in your paper.

@inproceedings{cui-etal-2020-revisiting,
    title = "Revisiting Pre-Trained Models for {C}hinese Natural Language Processing",
    author = "Cui, Yiming  and
      Che, Wanxiang  and
      Liu, Ting  and
      Qin, Bing  and
      Wang, Shijin  and
      Hu, Guoping",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.findings-emnlp.58",
    pages = "657--668",
}

Or:

@journal{cui-etal-2021-pretrain,
  title={Pre-Training with Whole Word Masking for Chinese BERT},
  author={Cui, Yiming and Che, Wanxiang and Liu, Ting and Qin, Bing and Yang, Ziqing},
  journal={IEEE Transactions on Audio, Speech and Language Processing},
  year={2021},
  url={https://ieeexplore.ieee.org/document/9599397},
  doi={10.1109/TASLP.2021.3124365},
 }

Acknowledgment

The first author would like to thank Google TensorFlow Research Cloud (TFRC) Program.

Issues

Before you submit an issue:

You are advised to read FAQ first before you submit an issue.
Repetitive and irrelevant issues will be ignored and closed by [stable-bot](stale · GitHub Marketplace). Thank you for your understanding and support.
We cannot acommodate EVERY request, and thus please bare in mind that there is no guarantee that your request will be met.
Always be polite when you submit an issue.

Comments

關於繼續訓練pretrained model

您好，首先非常感激 @ymcui 作者及團隊提供資源給大家做使用，我是一位學生，想請教一些觀念，手上有一些醫療健康網上爬下來的文章，想繼續訓練在由哈工大提供之全詞遮罩BERT-like系列and MacBERT上，預計使用的程式碼為由huggingface提供之run_mlm.py 與run_mlm_wwm.py (來源:https://github.com/huggingface/transformers/tree/master/examples/research_projects/mlm_wwm https://github.com/huggingface/transformers/tree/master/examples/pytorch/language-modeling) 以下為問題：（１）非全詞遮罩的模型(bert-base-chinese)是否可以用全詞遮罩run_mlm_wwm.py 繼續訓練？（２）全詞遮罩的模型(BERT-wwm-ext,RoBERTa-wwm-ext-large)有無限制接續的訓練一定要使用全詞遮罩run_mlm_wwm.py？若繼續使用單詞遮罩run_mlm.py 進行訓練，是否會出現問題導致效果不好？（３）MacBERT是否可以使用上述兩者程式碼繼續訓練？因原文中提到會進行相似字替換，概念與上述兩者不符（４）要如何判繼續訓練在特殊領域（ex:醫療,法律）之模型已收斂，以loss或以training steps ？

最後，謝謝作者珍貴的時間，撥冗閱讀，如有觀念上的指點，再麻煩作者回覆了，感激不盡 Stay safe
stale

opened by phc4valid 5
docs: demo, experiments and live inference API on Tiyaro
Hello Maintainer of Github repo ymcui/MacBERT!

Thank you for your work on ymcui/MacBERT. This GitHub project is interesting, and we think that it would be a great addition to make this work instantly discoverable & available as an API for all your users, to quickly try and use it in their applications.

The list of model card(s) covered by this PR are:

https://console.tiyaro.ai/explore/hfl-chinese-bert-wwm

https://console.tiyaro.ai/explore/hfl-chinese-bert-wwm-ext

https://console.tiyaro.ai/explore/hfl-chinese-macbert-base

https://console.tiyaro.ai/explore/hfl-chinese-roberta-wwm-ext

https://console.tiyaro.ai/explore/hfl-chinese-roberta-wwm-ext-large

https://console.tiyaro.ai/explore/hfl-rbt3

https://console.tiyaro.ai/explore/hfl-rbt6

On Tiyaro, every model in ymcui/MacBERT will get its own:

Dedicated model card (e.g. https://console.tiyaro.ai/explore/hfl-chinese-bert-wwm

Model demo (e.g. https://console.tiyaro.ai/explore/hfl-chinese-bert-wwm/demo)

Unique Inference API (e.g. https://api.tiyaro.ai/explore/huggingface/1//hfl/chinese-bert-wwm)

Sample code snippets and swagger spec for the API

Users will also be able to compare your model with other models of similar types on various parameters using Tiyaro Experiments (https://tiyaro.ai/blog/ocr/)

—- I am from Tiyaro.ai (https://tiyaro.ai/). We are working on enabling developers to instantly evaluate, use and customize the world’s best AI. We are constantly working on adding new features to Tiyaro EasyTrain, EasyServe & Experiments, to make the best use of your ML model, and making AI more accessible for anyone.

Sincerely, I-Jong Lin
stale
opened by ijonglin 4
咨询 MacBert 上的一些问题

你好，我在看 MacBert 的论文时有一些迷惑。本来想法邮件的，但好像发不到那个邮箱。

关于“ We use whole word masking as well as N-gram masking strategies for selecting candidate tokens for masking, with a percentage of 40%, 30%, 20%, 10% for word-level unigram to 4-gram. ”，这段是指一个词 40% 的概率被换成近义词，两个词以 30% 概率换成近义词，以此类推吗？

opened by T-baby 4
When using Synonyms masking, Synonyms are not in the list of vocab.

你好，我在尝试复现macbert的同义词替换mask操作，使用你们建议的 Synonyms 包。在使用的过程中，我发现了一个问题。

在我想替换某个词的时候，其同义词并不能在我自己的vocab词表里找到对应的词向量。我目前采取的方法是遇到这种情况我就选择跳过这个词，不进行mask操作。

我想询问一下你们有无遇到这种情况，是怎么处理的，能否给我一些建议，谢谢。@ymcui
stale

opened by TianhaoFu 4

huggingface models clone failed

I tried several times, but still getting the following error

(base) vimos@vimos-Z270MX-Gaming5 pretrained_models  % git lfs install
git clone https://huggingface.co/hfl/chinese-macbert-base
Git LFS initialized.
Cloning into 'chinese-macbert-base'...
remote: Authorization error.
fatal: unable to access 'https://huggingface.co/hfl/chinese-macbert-base/': The requested URL returned error: 403

I can clone other models without problems. Is there any special authentication requirements on MacBERT?

opened by Vimos 3

请问 whole word masking 和 N-gram masking 是如何一起使用的？

您好！在MacBERT和PERT中，都提到预训练过程同时使用 whole word masking 和 N-gram masking，想请教一下具体是怎么一起使用的？比如：

是怎么取的比例？

Whole word masking 和 N-gram masking 是一并实现（N-gram masking 中只取那些构成 whole word 的 N-gram），还是分开实现（N-gram masking 不考虑是否构成 whole word）？

opened by dr-GitHub-account 2
MacBERT 在 mask 上的一些细节问题
想自己动手做一下 MacBERT 的 mask，有下面两个问题希望可以请教一下 @ymcui：

论文中 "We use a percentage of 15% input words for masking"，可以理解为 mask 掉 15% 的 word 而不是 token 吗？

随机采样词级别的 1，2，3，4-gram 的文本，这个采样会像 Google 的原生实现那样避免采样重复的词吗？

stale
opened by wlhgtc 2
missing weights

Hi,

Thanks for releasing the pre-trained models on huggingface, have a basic question, was hoping you guys can help.

I loaded the MaskedLM model using the AutoModelForMaskedLM code.

from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("hfl/chinese-macbert-base") model = AutoModelForMaskedLM.from_pretrained("hfl/chinese-macbert-base")

However, I received the following error, which I didn't expect, given the pre-trained weights were for maskedLM:

Some weights of the model checkpoint at hfl/chinese-macbert-base were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']

Wondering if you guys know why?

Thanks in advance, Yi Peng

opened by neoyipeng2018 2

Revisiting Pre-trained Models for Chinese Natural Language Processing (Findings of EMNLP 2020)

Related tags

Overview

News

Guide

Introduction

Download

PyTorch/TensorFlow2 Version

Quick Load

Results

CMRC 2018

DRCD

XNLI

ChnSentiCorp

LCQMC

BQ Corpus

FAQ

Citation

Acknowledgment

Issues

Comments

Owner

Yiming Cui

Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

Chinese Pre-Trained Language Models (CPM-LM) Version-I

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

A2T: Towards Improving Adversarial Training of NLP Models (EMNLP 2021 Findings)

A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

One Stop Anomaly Shop: Anomaly detection using two-phase approach: (a) pre-labeling using statistics, Natural Language Processing and static rules; (b) anomaly scoring using supervised and unsupervised machine learning.

Code for the paper in Findings of EMNLP 2021: "EfficientBERT: Progressively Searching Multilayer Perceptron via Warm-up Knowledge Distillation".

Code for Findings at EMNLP 2021 paper: "Learn Continually, Generalize Rapidly: Lifelong Knowledge Accumulation for Few-shot Learning"

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia.

RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

Must-read papers on improving efficiency for pre-trained language models.

The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models

Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks

Guide to using pre-trained large language models of source code

Code for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned Language Models in the wild .

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

PORORO: Platform Of neuRal mOdels for natuRal language prOcessing