Chinese NER(Named Entity Recognition) using BERT(Softmax, CRF, Span)

Weitang Liu

Last update: Jan 3, 2023

Related tags

Text Data & NLP nlp crf pytorch chinese span ner albert bert softmax focal-loss adversarial-training labelsmoothing

Overview

Chinese NER using Bert

BERT for Chinese NER.

dataset list

cner: datasets/cner
CLUENER: https://github.com/CLUEbenchmark/CLUENER

model list

BERT+Softmax
BERT+CRF
BERT+Span

requirement

1.1.0 =< PyTorch < 1.5.0
cuda=9.0
python3.6+

input format

Input format (prefer BIOS tag scheme), with each character its label for one line. Sentences are splited with a null line.

美	B-LOC
国	I-LOC
的	O
华	B-PER
莱	I-PER
士	I-PER

我	O
跟	O
他	O

run the code

Modify the configuration information in run_ner_xxx.py or run_ner_xxx.sh .
sh scripts/run_ner_xxx.sh

note: file structure of the model

├── prev_trained_model
|  └── bert_base
|  |  └── pytorch_model.bin
|  |  └── config.json
|  |  └── vocab.txt
|  |  └── ......

CLUENER result

The overall performance of BERT on dev:

	Accuracy (entity)	Recall (entity)	F1 score (entity)
BERT+Softmax	0.7897	0.8031	0.7963
BERT+CRF	0.7977	0.8177	0.8076
BERT+Span	0.8132	0.8092	0.8112
BERT+Span+adv	0.8267	0.8073	0.8169
BERT-small(6 layers)+Span+kd	0.8241	0.7839	0.8051
BERT+Span+focal_loss	0.8121	0.8008	0.8064
BERT+Span+label_smoothing	0.8235	0.7946	0.8088

ALBERT for CLUENER

The overall performance of ALBERT on dev:

model	version	Accuracy(entity)	Recall(entity)	F1(entity)	Train time/epoch
albert	base_google	0.8014	0.6908	0.7420	0.75x
albert	large_google	0.8024	0.7520	0.7763	2.1x
albert	xlarge_google	0.8286	0.7773	0.8021	6.7x
bert	google	0.8118	0.8031	0.8074	-----
albert	base_bright	0.8068	0.7529	0.7789	0.75x
albert	large_bright	0.8152	0.7480	0.7802	2.2x
albert	xlarge_bright	0.8222	0.7692	0.7948	7.3x

Cner result

The overall performance of BERT on dev(test):

	Accuracy (entity)	Recall (entity)	F1 score (entity)
BERT+Softmax	0.9586(0.9566)	0.9644(0.9613)	0.9615(0.9590)
BERT+CRF	0.9562(0.9539)	0.9671(0.9644)	0.9616(0.9591)
BERT+Span	0.9604(0.9620)	0.9617(0.9632)	0.9611(0.9626)
BERT+Span+focal_loss	0.9516(0.9569)	0.9644(0.9681)	0.9580(0.9625)
BERT+Span+label_smoothing	0.9566(0.9568)	0.9624(0.9656)	0.9595(0.9612)

Comments

when it runs on windows， it occurs " OSError: [Errno 22] "

init_logger(log_file=args.output_dir + f'/{args.model_type}-{args.task_name}-{time_}.log')

this line code in the run_ner_crf.py file， how to solves it ? a simple method is like the following: init_logger(log_file=args.output_dir + r'/{args.model_type}-{args.task_name}-{time_}.log')

but the file name will become "{args.model_type}-{args.task_name}-{time_}.log"

I think I know the reason, "You are on windows that uses backslash as path separator . However for python this is escape char, so you need to use forward slash or raw string or escape the backslash. change these types of addresses:" in the OSError: [Errno 22]

but I tried selves times to solve this problem , I failed and took 3 hours , who can tell me how to solve it?

opened by lvjiujin 3
StopIteration error?

首先感谢大佬杰出的开源工作，正好匹配需求。但是在具体运行时，出现如下报错，不知道是怎么回事，请大佬指教！敬请回复！

07/10/2020 16:14:08 - INFO - root - ***** Running training ***** 07/10/2020 16:14:08 - INFO - root - Num examples = 10748 07/10/2020 16:14:08 - INFO - root - Num Epochs = 4 07/10/2020 16:14:08 - INFO - root - Instantaneous batch size per GPU = 24 07/10/2020 16:14:08 - INFO - root - Total train batch size (w. parallel, distributed & accumulation) = 48 07/10/2020 16:14:08 - INFO - root - Gradient Accumulation steps = 1 07/10/2020 16:14:08 - INFO - root - Total optimization steps = 896 Traceback (most recent call last): File "run_ner_crf.py", line 497, in main() File "run_ner_crf.py", line 438, in main global_step, tr_loss = train(args, train_dataset, model, tokenizer) File "run_ner_crf.py", line 132, in train outputs = model(**inputs) File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply output.reraise() File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise raise self.exc_type(msg) StopIteration: Caught StopIteration in replica 0 on device 0. Original Traceback (most recent call last): File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker output = module(*input, **kwargs) File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/mnt/stephen-lib/stephen的个人文件夹/my_code/NLP组件研发/细粒度实体识别/BERT-NER-Pytorch/models/bert_for_ner.py", line 58, in forward outputs =self.bert(input_ids = input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids) File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/mnt/stephen-lib/stephen的个人文件夹/my_code/NLP组件研发/细粒度实体识别/BERT-NER-Pytorch/models/transformers/modeling_bert.py", line 606, in forward extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility StopIteration

opened by Vincent131499 3
预训练文件在哪里下载呢

从google-research下载到的chinese_L-12_H-768_A-12只有bert_model.ckpt vocab.txt bert_config.json 但我看代码需要的不是这种文件 OSError: Error no file named ['pytorch_model.bin', 'tf_model.h5', 'model.ckpt.index'] found in directory prev_trained_model/bert-base/bert-base-chinese or from_tf set to False

opened by lzh66 2
mask in crf

您好，

请问用attention_mask做crf的mask的话，一个word假设有多个sub tokens，那这些tokens都就都keep了。在bert for ner里面，是用一个词的第一个token做的classification。

https://github.com/lonePatient/BERT-NER-Pytorch/blob/master/models/bert_for_ner.py#L64 同时在decode的时候

https://github.com/lonePatient/BERT-NER-Pytorch/blob/38326e125696c5a34c54ada676930bee4a2d1693/run_ner_crf.py#L210

此处的mask也是attention mask。那么就会导致从CLS到SEP还有其中的所有token都会被keep，用于做decode。请问此处mask这样设置合理么？还是应该只保留每个word的第一个token呢？谢谢！

opened by DanqingZ 2
TypeError: __init__() got an unexpected keyword argument 'max_len'
使用作者自定义的CNerTokenizer会报错__init__() got an unexpected keyword argument 'max_len' 具体错误信息如下： ` File "BERT-NER-Pytorch-master/run_ner_softmax.py", line 549, in

main()

File "BERT-NER-Pytorch-master/run_ner_softmax.py", line 480, in main

cache_dir=args.cache_dir if args.cache_dir else None,)

File "BERT-NER-Pytorch-master\models\transformers\tokenization_utils.py", line 282, in from_pretrained

return cls._from_pretrained(*inputs, **kwargs)

File "BERT-NER-Pytorch-master\models\transformers\tokenization_utils.py", line 411, in _from_pretrained

tokenizer = cls(*init_inputs, **init_kwargs)

TypeError: init() got an unexpected keyword argument 'max_len'`

P.S. 使用BertTokenizer不会报错。还想请问下作者为什么要自定义分词器呢？难道BertTokenizer不会将没有在词表中的单词转化为<UNK>吗？
opened by possible1402 2
多gpu情况下的crf函数报错

02/25/2020 13:50:42 - INFO - root - ***** Running evaluation ***** 02/25/2020 13:50:42 - INFO - root - Num examples = 1343 02/25/2020 13:50:42 - INFO - root - Batch size = 48 Traceback (most recent call last): File "run_ner_crf.py", line 517, in main() File "run_ner_crf.py", line 459, in main global_step, tr_loss = train(args, train_dataset, model, tokenizer) File "run_ner_crf.py", line 148, in train evaluate(args, model, tokenizer) File "run_ner_crf.py", line 197, in evaluate tags,_ = model.crf._obtain_labels(logits, args.id2label, inputs['input_lens']) File "/root/.pyenv/versions/3.7.2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 591, in getattr type(self).name, name)) AttributeError: 'DataParallel' object has no attribute 'crf'

经过排查，crf函数是自定义的，在多gpu的情况下，对model进行了DataParallel处理，DataParallel里面没有这个自定义的crf函数产生的。

opened by aliendaniel 2
关于mrc-ner的一些细节

对于一句话不包含实体，或包含多个实体，是怎样处理的？

切分maxseq的时候，是如何切分本来在一个自然段的实体的？

eval时的算法，此时的 groundtruth（gold）数量并不准确，使得结果和标准的conll ner evaluate的脚本不一致，不同的算法使用不同的matric，是否有可比性？

opened by 578123043 2
tokenizer.tokenize()问题

run_ner_crf.py train分支时，为什么from_pretrained（）函数后留了一个逗号，导致，tokenizer.tokenize()分词的不是逐个字符分的，导致tokens和label_ids长度不等。 tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path, do_lower_case=args.do_lower_case, )

run_ner_span.py 这个就是用的不带逗号的from_pretrained（）tokenizer.tokenize()分词是逐个字符分的，这样tokens和label_ids长度相等。 tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path, do_lower_case=args.do_lower_case)

这么写的区别在哪，感觉这是个bug 修正后，最大的问题，用run_ner_crf.py训练自己数据集，两个epoch之后recall就会一直下降，这个问题作者遇到过吗，有空麻烦回一下

opened by 472027909 1

代码里面这个bert_crf模型在预测的时候是不是忘记过crf的decoder层了？

class BertCrfForNer(BertPreTrainedModel):
    def __init__(self, config):
        super(BertCrfForNer, self).__init__(config)
        self.bert = BertModel(config)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
        self.crf = CRF(num_tags=config.num_labels, batch_first=True)
        self.init_weights()

    def forward(self, input_ids, token_type_ids=None, attention_mask=None,labels=None):
        outputs =self.bert(input_ids = input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids)
        sequence_output = outputs[0]
        sequence_output = self.dropout(sequence_output)
        logits = self.classifier(sequence_output)
        outputs = (logits,)
        if labels is not None:
            loss = self.crf(emissions = logits, tags=labels, mask=attention_mask)
            outputs =(-1*loss,)+outputs
        return outputs # (loss), scores

opened by xinjicong 1

Question? about bert-base

Thank you for sharing the code, may I ask BERT-NER-Pytorch / prev_trained_model / bert-base / Where can I download the pre-trained model “bert-base”, can you provide a download link? Looking forward to your replies.

opened by ypc-stu 1
cner测试集与验证集中CONT类型的precision与recall均为0

cner测试集与验证集中CONT类型的precision与recall均为0，其他类型的实体的指标都较为正常，暂时还没找出原因 ***** Eval results %s ***** acc: 0.9466 - recall: 0.9373 - f1: 0.9419 - loss: 0.6826 ***** Entity results %s ***** ******* CONT results ******** acc: 0.0000 - recall: 0.0000 - f1: 0.0000 ******* EDU results ******** acc: 0.9911 - recall: 0.9911 - f1: 0.9911 ******* LOC results ******** acc: 1.0000 - recall: 1.0000 - f1: 1.0000 ******* NAME results ******** acc: 1.0000 - recall: 1.0000 - f1: 1.0000 ******* ORG results ******** acc: 0.9253 - recall: 0.9403 - f1: 0.9327 ******* PRO results ******** acc: 0.8684 - recall: 1.0000 - f1: 0.9296 ******* RACE results ******** acc: 1.0000 - recall: 1.0000 - f1: 1.0000 ******* TITLE results ******** acc: 0.9505 - recall: 0.9481 - f1: 0.9493 testset precision:0.946617008069522, recall:0.9373079287031346, f1:0.941939468807906, loss:0.6826483011245728

opened by yangjianxin1 0
softmax和crf性能差距很大，参数一致，请教如何看原因调效果？

epoch: 10 train_batch_size: 8 evaluation_batch_size: 16 learning_rate: 3e-5

预训练模型：bert_base_chinese

自己的中文数据集

bert+softmax: acc: 0.0729 - recall: 0.4249 - f1: 0.1245 - loss: 1.0244

bert+crf: acc: 0.4268 - recall: 0.5729 - f1: 0.4892 - loss: 34.9675

在google bert上跑 accuracy: 85.46%; precision: 36.28%; recall: 48.35%; FB1: 41.45

opened by northfun 0

请问required parameters的list是什么呢？

运行报错 run_ner_crf.py: error: the following arguments are required: --task_name, --data_dir, --model_type, --model_name_or_path, --output_dir

然后看get_argparse():函数里面的内容

parser.add_argument("--task_name,", default=None, type=str, required=True,
                        help="The name of the task to train selected in the list: ")
    parser.add_argument("--data_dir", default=None, type=str, required=True,
                        help="The input data dir. Should contain the training files for the CoNLL-2003 NER task.", )
    parser.add_argument("--model_type", default=None, type=str, required=True,
                        help="Model type selected in the list: ")
    parser.add_argument("--model_name_or_path", default=None, type=str, required=True,
                        help="Path to pre-trained model or shortcut name selected in the list: " )
    parser.add_argument("--output_dir", default=None, type=str, required=True,
                        help="The output directory where the model predictions and checkpoints will be written.", )

help里面 taskname、model type、model name 这里list是什么呢？冒号后面就没写了。不知道这几个要传什么参数。

opened by xiaohou1112 1

Fix do_lower_case
do_lower_case参数用于判断是否对输入文本小写，传递给tokenizer。

tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path, do_lower_case=args.do_lower_case,)

参数在tokenizer.tokenize方法中发挥作用，本项目中直接使用了tokenizer.convert_tokens_to_ids方法，实际上并没有起作用，因此需要手动处理。

def convert_examples_to_features(...): ... if tokenizer.do_lower_case: tokens = [x.lower() for x in tokens] ... input_ids = tokenizer.convert_tokens_to_ids(tokens)
opened by entropy2333 0
关于tokenizer.tokenize的疑问

看过tf的tokenizer的代码，输入的是句子或者单个char，返回的是单个句子或者单个char 而torch的输入输入的是句子或者单个char，返回的是单个句子list或者单个char的list

重要的问题是，如果输入的单个char本身是unk类型的字符，pytorch的tokenizer.tokenize(char) 居然返回的为空而不是[UNK]? 好奇pytorch为啥这样搞，这样直接导致训练数据x和label没办法对齐了......

opened by lsx0930 1

Owner

Weitang Liu

weibo: https://weibo.com/277974397

GitHub

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU，一个中文文本分类、序列标注工具包，支持中文长文本、短文本的多类、多标签分类任务，支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

186 Dec 24, 2022

Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

anaGo anaGo is a Python library for sequence labeling(NER, PoS Tagging,...), implemented in Keras. anaGo can solve sequence labeling tasks such as nam

1.5k Dec 5, 2022

Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

anaGo anaGo is a Python library for sequence labeling(NER, PoS Tagging,...), implemented in Keras. anaGo can solve sequence labeling tasks such as nam

1.4k Feb 17, 2021

Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition

0 Feb 13, 2022

Framework for fine-tuning pretrained transformers for Named-Entity Recognition (NER) tasks

NERDA Not only is NERDA a mesmerizing muppet-like character. NERDA is also a python package, that offers a slick easy-to-use interface for fine-tuning

141 Dec 30, 2022

Spacy-ginza-ner-webapi - Named Entity Recognition API with spaCy and GiNZA

Named Entity Recognition API with spaCy and GiNZA I wrote a blog post about this

3 Feb 27, 2022

Chinese named entity recognization (bert/roberta/macbert/bert_wwm with Keras)

2 Jul 5, 2022

Chinese NER with albert/electra or other bert descendable model (keras)

Chinese NLP (albert/electra with Keras) Named Entity Recognization Project Structure ./ ├── NER │ ├── __init__.py │ ├── log

2 Nov 20, 2022

PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

PhoNLP is a multi-task learning model for joint part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results, outperforming a single-task learning approach that fine-tunes the pre-trained Vietnamese language model PhoBERT for each task independently.

109 Dec 2, 2022

Pytorch-Named-Entity-Recognition-with-BERT

BERT NER Use google BERT to do CoNLL-2003 NER ! Train model using Python and Inference using C++ ALBERT-TF2.0 BERT-NER-TENSORFLOW-2.0 BERT-SQuAD Requi

1.1k Dec 25, 2022

Use Google's BERT for named entity recognition （CoNLL-2003 as the dataset）.

For better performance, you can try NLPGNN, see NLPGNN for more details. BERT-NER Version 2 Use Google's BERT for named entity recognition （CoNLL-2003

1.2k Dec 26, 2022

RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

RoNER RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2. It is meant to be an easy to use, hi

9 Nov 7, 2022

Chinese Named Entity Recognization (BiLSTM with PyTorch)

BiLSTM-CRF for Name Entity Recognition PyTorch version A PyTorch implemention of Bi-LSTM-CRF model for Chinese Named Entity Recognition. 使用 PyTorch 实现

5 Jun 1, 2022

TweebankNLP - Pre-trained Tweet NLP Pipeline (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Models + Tweebank-NER

TweebankNLP This repo contains the new Tweebank-NER dataset and Twitter-Stanza p

84 Dec 20, 2022

Tool to add main subject to items on Wikidata using a WMFs CirrusSearch for named entity recognition or a manually supplied list of QIDs

ItemSubjector Tool made to add main subject statements to items based on the title using a home-brewed CirrusSearch-based Named Entity Recognition alg

9 Nov 17, 2022

Chinese NER(Named Entity Recognition) using BERT(Softmax, CRF, Span)

Related tags

Overview

Chinese NER using Bert

dataset list

model list

requirement

input format

run the code

CLUENER result

ALBERT for CLUENER

Cner result

Comments

Owner

Weitang Liu

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition

Framework for fine-tuning pretrained transformers for Named-Entity Recognition (NER) tasks

Spacy-ginza-ner-webapi - Named Entity Recognition API with spaCy and GiNZA

Chinese named entity recognization (bert/roberta/macbert/bert_wwm with Keras)

Chinese NER with albert/electra or other bert descendable model (keras)

PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

Pytorch-Named-Entity-Recognition-with-BERT

Use Google's BERT for named entity recognition （CoNLL-2003 as the dataset）.

RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

Chinese Named Entity Recognization (BiLSTM with PyTorch)

TweebankNLP - Pre-trained Tweet NLP Pipeline (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Models + Tweebank-NER

a chinese segment base on crf

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

Tool to add main subject to items on Wikidata using a WMFs CirrusSearch for named entity recognition or a manually supplied list of QIDs