ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information

Overview

ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information

This repository contains code, model, dataset for ChineseBERT at ACL2021.

ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information
Zijun Sun, Xiaoya Li, Xiaofei Sun, Yuxian Meng, Xiang Ao, Qing He, Fei Wu and Jiwei Li

Guide

Section Description
Introduction Introduction to ChineseBERT
Download Download links for ChineseBERT
Quick tour Learn how to quickly load models
Experiment Experiment results on different Chinese NLP datasets
Citation Citation
Contact How to contact us

Introduction

We propose ChineseBERT, which incorporates both the glyph and pinyin information of Chinese characters into language model pretraining.

First, for each Chinese character, we get three kind of embedding.

  • Char Embedding: the same as origin BERT token embedding.
  • Glyph Embedding: capture visual features based on different fonts of a Chinese character.
  • Pinyin Embedding: capture phonetic feature from the pinyin sequence ot a Chinese Character.

Then, char embedding, glyph embedding and pinyin embedding are first concatenated, and mapped to a D-dimensional embedding through a fully connected layer to form the fusion embedding.
Finally, the fusion embedding is added with the position embedding, which is fed as input to the BERT model.
The following image shows an overview architecture of ChineseBERT model.

MODEL

ChineseBERT leverages the glyph and pinyin information of Chinese characters to enhance the model's ability of capturing context semantics from surface character forms and disambiguating polyphonic characters in Chinese.

Download

We provide pre-trained ChineseBERT models in Pytorch version and followed huggingFace model format.

  • ChineseBERT-base:12-layer, 768-hidden, 12-heads, 147M parameters
  • ChineseBERT-large: 24-layer, 1024-hidden, 16-heads, 374M parameters

Our model can be downloaded here:

Model Model Hub Size
ChineseBERT-base Pytorch 564M
ChineseBERT-large Pytorch 1.4G

Note: The model hub contains model, fonts and pinyin config files.

Quick tour

We train our model with Huggingface, so the model can be easily loaded.
Download ChineseBERT model and save at [CHINESEBERT_PATH].
Here is a quick tour to load our model.

>>> from models.modeling_glycebert import GlyceBertForMaskedLM

>>> chinese_bert = GlyceBertForMaskedLM.from_pretrained([CHINESEBERT_PATH])
>>> print(chinese_bert)

The complete example can be find here: Masked word completion with ChineseBERT

Another example to get representation of a sentence:

>>> from datasets.bert_dataset import BertDataset
>>> from models.modeling_glycebert import GlyceBertModel

>>> tokenizer = BertDataset([CHINESEBERT_PATH])
>>> chinese_bert = GlyceBertModel.from_pretrained([CHINESEBERT_PATH])
>>> sentence = '我喜欢猫'

>>> input_ids, pinyin_ids = tokenizer.tokenize_sentence(sentence)
>>> length = input_ids.shape[0]
>>> input_ids = input_ids.view(1, length)
>>> pinyin_ids = pinyin_ids.view(1, length, 8)
>>> output_hidden = chinese_bert.forward(input_ids, pinyin_ids)[0]
>>> print(output_hidden)
tensor([[[ 0.0287, -0.0126,  0.0389,  ...,  0.0228, -0.0677, -0.1519],
         [ 0.0144, -0.2494, -0.1853,  ...,  0.0673,  0.0424, -0.1074],
         [ 0.0839, -0.2989, -0.2421,  ...,  0.0454, -0.1474, -0.1736],
         [-0.0499, -0.2983, -0.1604,  ..., -0.0550, -0.1863,  0.0226],
         [ 0.1428, -0.0682, -0.1310,  ..., -0.1126,  0.0440, -0.1782],
         [ 0.0287, -0.0126,  0.0389,  ...,  0.0228, -0.0677, -0.1519]]],
       grad_fn=)

The complete code can be find HERE

Experiments

ChnSetiCorp

ChnSetiCorp is a dataset for sentiment analysis.
Evaluation Metrics: Accuracy

Model Dev Test
ERNIE 95.4 95.5
BERT 95.1 95.4
BERT-wwm 95.4 95.3
RoBERTa 95.0 95.6
MacBERT 95.2 95.6
ChineseBERT 95.6 95.7
---- ----
RoBERTa-large 95.8 95.8
MacBERT-large 95.7 95.9
ChineseBERT-large 95.8 95.9

Training details and code can be find HERE

THUCNews

THUCNews contains news in 10 categories.
Evaluation Metrics: Accuracy

Model Dev Test
ERNIE 95.4 95.5
BERT 95.1 95.4
BERT-wwm 95.4 95.3
RoBERTa 95.0 95.6
MacBERT 95.2 95.6
ChineseBERT 95.6 95.7
---- ----
RoBERTa-large 95.8 95.8
MacBERT-large 95.7 95.9
ChineseBERT-large 95.8 95.9

Training details and code can be find HERE

XNLI

XNLI is a dataset for natural language inference.
Evaluation Metrics: Accuracy

Model Dev Test
ERNIE 79.7 78.6
BERT 79.0 78.2
BERT-wwm 79.4 78.7
RoBERTa 80.0 78.8
MacBERT 80.3 79.3
ChineseBERT 80.5 79.6
---- ----
RoBERTa-large 82.1 81.2
MacBERT-large 82.4 81.3
ChineseBERT-large 82.7 81.6

Training details and code can be find HERE

BQ

BQ Corpus is a sentence pair matching dataset.
Evaluation Metrics: Accuracy

Model Dev Test
ERNIE 86.3 85.0
BERT 86.1 85.2
BERT-wwm 86.4 85.3
RoBERTa 86.0 85.0
MacBERT 86.0 85.2
ChineseBERT 86.4 85.2
---- ----
RoBERTa-large 86.3 85.8
MacBERT-large 86.2 85.6
ChineseBERT-large 86.5 86.0

Training details and code can be find HERE

LCQMC

LCQMC Corpus is a sentence pair matching dataset.
Evaluation Metrics: Accuracy

Model Dev Test
ERNIE 89.8 87.2
BERT 89.4 87.0
BERT-wwm 89.6 87.1
RoBERTa 89.0 86.4
MacBERT 89.5 87.0
ChineseBERT 89.8 87.4
---- ----
RoBERTa-large 90.4 87.0
MacBERT-large 90.6 87.6
ChineseBERT-large 90.5 87.8

Training details and code can be find HERE

TNEWS

TNEWS is a 15-class short news text classification dataset.
Evaluation Metrics: Accuracy

Model Dev Test
ERNIE 58.24 58.33
BERT 56.09 56.58
BERT-wwm 56.77 56.86
RoBERTa 57.51 56.94
ChineseBERT 58.64 58.95
---- ----
RoBERTa-large 58.32 58.61
ChineseBERT-large 59.06 59.47

Training details and code can be find HERE

CMRC

CMRC is a machin reading comprehension task dataset.
Evaluation Metrics: EM

Model Dev Test
ERNIE 66.89 74.70
BERT 66.77 71.60
BERT-wwm 66.96 73.95
RoBERTa 67.89 75.20
MacBERT - -
ChineseBERT 67.95 95.7
---- ----
RoBERTa-large 70.59 77.95
ChineseBERT-large 70.70 78.05

Training details and code can be find HERE

OntoNotes

OntoNotes 4.0 is a Chinese named entity recognition dataset and contains 18 named entity types.

Evaluation Metrics: Span-Level F1

Model Test Precision Test Recall Test F1
BERT 79.69 82.09 80.87
RoBERTa 80.43 80.30 80.37
ChineseBERT 80.03 83.33 81.65
---- ---- ----
RoBERTa-large 80.72 82.07 81.39
ChineseBERT-large 80.77 83.65 82.18

Training details and code can be find HERE

Weibo

Weibo is a Chinese named entity recognition dataset and contains 4 named entity types.

Evaluation Metrics: Span-Level F1

Model Test Precision Test Recall Test F1
BERT 67.12 66.88 67.33
RoBERTa 68.49 67.81 68.15
ChineseBERT 68.27 69.78 69.02
---- ---- ----
RoBERTa-large 66.74 70.02 68.35
ChineseBERT-large 68.75 72.97 70.80

Training details and code can be find HERE

Contact

If you have any question about our paper/code/modal/data...
Please feel free to discuss through github issues or emails.
You can send email to [email protected] or [email protected]

Comments
  • 接口调用时加载字体图片报错

    接口调用时加载字体图片报错

    作者您好,感谢您的工作!请问我在依照quick tour中的实例调用接口时产生报错:ValueError: cannot reshape array of size 3555312 into shape (23236,24,24),问题来自np.load(np_file).astype(np.float32) for np_file in font_npy_files,可以帮忙看看是什么原因吗?(是字体npy文件下载的有问题吗,重新下载了几遍都未解决,各依赖包的版本也都是完全按照requirement.txt中配置的)再次感谢您的工作和帮助~

    opened by Sarahtorekryo 17
  • 使用性问题

    使用性问题

    您好,首先恭喜这份工作被ACL2021录用,融入字形和拼音的预训练必然会对中文nlp任务带来一定的提升。 同样,我也希望能在除了论文中提及的其他任务中使用ChineseBert,请问有没有集成类似于BERT的API可以调用, 如: tokenizer = Tokenizer.from_pretrain([ChineseBert]) config = Config.from_pretrain([ChineseBert]) model = Bert.from_pretrain([ChineseBert]) 或者,有没有instruction说明一下调用方式

    opened by Aopolin-Lv 7
  • How to get XNLI train set

    How to get XNLI train set

    I have got dev and test set by https://github.com/facebookresearch/XNLI . But I found that image.

    P.S. I found the path with xnli_xxx.tsv in your code https://github.com/ShannonAI/ChineseBert/blob/f6b4cd901e8f8b3ef2340ce2a8685b41df9bc261/tasks/XNLI/XNLI_trainer.py#L134. Why not xnli.xxx.tsv

    opened by jiaqianjing 1
  • 请问大佬总是出现这种错误怎么办,不知道怎么改

    请问大佬总是出现这种错误怎么办,不知道怎么改

    (yl) D:\ChineseBert-main\tasks\THUCNew>python THUCNews_trainer.py --bert_path ./111/ --data_dir ./cnews/ --save_path ./222/ --max_epoch=5 --lr=2e-5 --batch_size=8 --gpus=0 Some weights of the model checkpoint at ./111/ were not used when initializing GlyceBertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias']

    • This IS expected if you are initializing GlyceBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
    • This IS NOT expected if you are initializing GlyceBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of GlyceBertForSequenceClassification were not initialized from the model checkpoint at ./111/ and are newly initialized: ['classifier.weight', 'classifier.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Traceback (most recent call last): File "THUCNews_trainer.py", line 229, in main() File "THUCNews_trainer.py", line 193, in main model = ChnSentiClassificationTask(args) File "THUCNews_trainer.py", line 51, in init self.model = GlyceBertForSequenceClassification.from_pretrained(self.bert_dir) File "d:\ProgramData\Anaconda3\envs\yl\lib\site-packages\transformers\modeling_utils.py", line 1071, in from_pretrained model.class.name, "\n\t".join(error_msgs) RuntimeError: Error(s) in loading state_dict for GlyceBertForSequenceClassification: size mismatch for bert.embeddings.glyph_embeddings.embedding.weight: copying a param with shape torch.Size([23236, 1728]) from checkpoint, the shape in current model is torch.Size([23236, 1152]).
    opened by tiexueYL 1
  • Cannot reproduce fine tuning on ChnSentiCorp

    Cannot reproduce fine tuning on ChnSentiCorp

    Validation sanity check: 0it [00:00, ?it/s]thread '<unnamed>' panicked at 'no entry found for key', D:\a\tokenizers\tokenizers\tokenizers\src\models\mod.rs:36:66
    note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
    Traceback (most recent call last):
      File "ChnSetiCorp_trainer.py", line 227, in <module>
        main()
      File "ChnSetiCorp_trainer.py", line 216, in main
        trainer.fit(model)
      File "C:\Users\jingtanwang\Anaconda3\envs\chinesebert\lib\site-packages\pytorch_lightning\trainer\states.py", line 48, in wrapped_fn
        result = fn(self, *args, **kwargs)
      File "C:\Users\jingtanwang\Anaconda3\envs\chinesebert\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1084, in fit
        results = self.accelerator_backend.train(model)
      File "C:\Users\jingtanwang\Anaconda3\envs\chinesebert\lib\site-packages\pytorch_lightning\accelerators\cpu_backend.py", line 39, in train
        results = self.trainer.run_pretrain_routine(model)
      File "C:\Users\jingtanwang\Anaconda3\envs\chinesebert\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1224, in run_pretrain_routine
        self._run_sanity_check(ref_model, model)
      File "C:\Users\jingtanwang\Anaconda3\envs\chinesebert\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1257, in _run_sanity_check
        eval_results = self._evaluate(model, self.val_dataloaders, max_batches, False)
      File "C:\Users\jingtanwang\Anaconda3\envs\chinesebert\lib\site-packages\pytorch_lightning\trainer\evaluation_loop.py", line 305, in _evaluate
        for batch_idx, batch in enumerate(dataloader):
      File "C:\Users\jingtanwang\Anaconda3\envs\chinesebert\lib\site-packages\torch\utils\data\dataloader.py", line 279, in __iter__
        return _MultiProcessingDataLoaderIter(self)
      File "C:\Users\jingtanwang\Anaconda3\envs\chinesebert\lib\site-packages\torch\utils\data\dataloader.py", line 719, in __init__
        w.start()
      File "C:\Users\jingtanwang\Anaconda3\envs\chinesebert\lib\multiprocessing\process.py", line 105, in start
        self._popen = self._Popen(self)
      File "C:\Users\jingtanwang\Anaconda3\envs\chinesebert\lib\multiprocessing\context.py", line 223, in _Popen
        return _default_context.get_context().Process._Popen(process_obj)
      File "C:\Users\jingtanwang\Anaconda3\envs\chinesebert\lib\multiprocessing\context.py", line 322, in _Popen
        return Popen(process_obj)
      File "C:\Users\jingtanwang\Anaconda3\envs\chinesebert\lib\multiprocessing\popen_spawn_win32.py", line 65, in __init__
        reduction.dump(process_obj, to_child)
      File "C:\Users\jingtanwang\Anaconda3\envs\chinesebert\lib\multiprocessing\reduction.py", line 60, in dump
        ForkingPickler(file, protocol).dump(obj)
    pyo3_runtime.PanicException: no entry found for key
    
    

    Based on https://github.com/huggingface/tokenizers/issues/260: it suggests that vocab.json miss one entry, but I try with tokenizer.encode(sentence) with all lines in ChnSentiCorp, it works

    opened by JTWang2000 1
  • OSError: Failed to interpret file

    OSError: Failed to interpret file

    hi, I meet this error. OSError: Failed to interpret file 'chineseBert20210929/datasets/ChineseBERT-base/config/._STXINGKA.TTF24.npy' as a pickle. Do you know how to solve it. thanks.

    opened by zhunipingan 0
  • 省流~

    省流~

    Please check the requirement.txt to make sure you install the right version

    Originally posted by @zijunsun in https://github.com/ShannonAI/ChineseBert/issues/18#issuecomment-895709627

    opened by thomaswkk 0
  • OntoNote only has  ['ORG', 'GPE', 'PER', 'LOC'] it is not 18

    OntoNote only has ['ORG', 'GPE', 'PER', 'LOC'] it is not 18

    OntoNotes 4.0 is a Chinese named entity recognition dataset and contains 18 named entity types. OntoNotes 4.0 contains 15K/4K/4K instances for training/dev/test.

    ['ORG', 'GPE', 'PER', 'LOC']??

    opened by majinmin 0
  • 如何用自己的数据进一步预训练

    如何用自己的数据进一步预训练

    您好!请问您有模型预训练的代码吗?尝试使用run_mlm.py[https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling/run_mlm.py]进行进一步预训练,但代码中调用的tokenizer和您的模型中的tokenizer(BertMaskDataset)不同,替换后遇到了许多问题,希望您可以提供帮助~谢谢!

    opened by cxyccc 2
Owner
null
The implement of papar "Enhanced Graph Learning for Collaborative Filtering via Mutual Information Maximization"

SIGIR2021-EGLN The implement of paper "Enhanced Graph Learning for Collaborative Filtering via Mutual Information Maximization" Neural graph based Col

null 15 Dec 27, 2022
When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset of 53,000+ Legal Holdings

When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset of 53,000+ Legal Holdings This is the repository for t

RegLab 39 Jan 7, 2023
The PASS dataset: pretrained models and how to get the data - PASS: Pictures without humAns for Self-Supervised Pretraining

The PASS dataset: pretrained models and how to get the data - PASS: Pictures without humAns for Self-Supervised Pretraining

Yuki M. Asano 249 Dec 22, 2022
Code for the TASLP paper "PSLA: Improving Audio Tagging With Pretraining, Sampling, Labeling, and Aggregation".

PSLA: Improving Audio Tagging with Pretraining, Sampling, Labeling, and Aggregation Introduction Getting Started FSD50K Recipe AudioSet Recipe Label E

Yuan Gong 84 Dec 27, 2022
Commonality in Natural Images Rescues GANs: Pretraining GANs with Generic and Privacy-free Synthetic Data - Official PyTorch Implementation (CVPR 2022)

Commonality in Natural Images Rescues GANs: Pretraining GANs with Generic and Privacy-free Synthetic Data (CVPR 2022) Potentials of primitive shapes f

null 31 Sep 27, 2022
CLASP - Contrastive Language-Aminoacid Sequence Pretraining

CLASP - Contrastive Language-Aminoacid Sequence Pretraining Repository for creating models pretrained on language and aminoacid sequences similar to C

Michael Pieler 133 Dec 29, 2022
TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks

TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks [Paper] [Project Website] This repository holds the source code, pretra

Humam Alwassel 83 Dec 21, 2022
Official Pytorch Implementation of: "ImageNet-21K Pretraining for the Masses"(2021) paper

ImageNet-21K Pretraining for the Masses Paper | Pretrained models Official PyTorch Implementation Tal Ridnik, Emanuel Ben-Baruch, Asaf Noy, Lihi Zelni

null 574 Jan 2, 2023
[NAACL & ACL 2021] SapBERT: Self-alignment pretraining for BERT.

SapBERT: Self-alignment pretraining for BERT This repo holds code for the SapBERT model presented in our NAACL 2021 paper: Self-Alignment Pretraining

Cambridge Language Technology Lab 104 Dec 7, 2022
Pretraining Representations For Data-Efficient Reinforcement Learning

Pretraining Representations For Data-Efficient Reinforcement Learning Max Schwarzer, Nitarshan Rajkumar, Michael Noukhovitch, Ankesh Anand, Laurent Ch

Mila 40 Dec 11, 2022
DETReg: Unsupervised Pretraining with Region Priors for Object Detection

DETReg: Unsupervised Pretraining with Region Priors for Object Detection Amir Bar, Xin Wang, Vadim Kantorov, Colorado J Reed, Roei Herzig, Gal Chechik

Amir Bar 283 Dec 27, 2022
Code for generating a single image pretraining dataset

Single Image Pretraining of Visual Representations As shown in the paper A critical analysis of self-supervision, or what we can learn from a single i

Yuki M. Asano 12 Dec 19, 2022
EMNLP 2021 - Frustratingly Simple Pretraining Alternatives to Masked Language Modeling

Frustratingly Simple Pretraining Alternatives to Masked Language Modeling This is the official implementation for "Frustratingly Simple Pretraining Al

Atsuki Yamaguchi 31 Nov 18, 2022
Does Pretraining for Summarization Reuqire Knowledge Transfer?

Pretraining summarization models using a corpus of nonsense

Approximately Correct Machine Intelligence (ACMI) Lab 12 Dec 19, 2022
PyTorch original implementation of Cross-lingual Language Model Pretraining.

XLM NEW: Added XLM-R model. PyTorch original implementation of Cross-lingual Language Model Pretraining. Includes: Monolingual language model pretrain

Facebook Research 2.7k Dec 27, 2022
Code for Efficient Visual Pretraining with Contrastive Detection

Code for DetCon This repository contains code for the ICCV 2021 paper "Efficient Visual Pretraining with Contrastive Detection" by Olivier J. Hénaff,

DeepMind 56 Nov 13, 2022
[NeurIPS 2021 Spotlight] Aligning Pretraining for Detection via Object-Level Contrastive Learning

SoCo [NeurIPS 2021 Spotlight] Aligning Pretraining for Detection via Object-Level Contrastive Learning By Fangyun Wei*, Yue Gao*, Zhirong Wu, Han Hu,

Yue Gao 139 Dec 14, 2022
Autoencoders pretraining using clustering

Autoencoders pretraining using clustering

IITiS PAN 2 Dec 16, 2021
magiCARP: Contrastive Authoring+Reviewing Pretraining

magiCARP: Contrastive Authoring+Reviewing Pretraining Welcome to the magiCARP API, the test bed used by EleutherAI for performing text/text bi-encoder

EleutherAI 43 Dec 29, 2022