Chinese NER(Named Entity Recognition) using BERT(Softmax, CRF, Span)

Overview

Chinese NER using Bert

BERT for Chinese NER.

dataset list

  1. cner: datasets/cner
  2. CLUENER: https://github.com/CLUEbenchmark/CLUENER

model list

  1. BERT+Softmax
  2. BERT+CRF
  3. BERT+Span

requirement

  1. 1.1.0 =< PyTorch < 1.5.0
  2. cuda=9.0
  3. python3.6+

input format

Input format (prefer BIOS tag scheme), with each character its label for one line. Sentences are splited with a null line.

美	B-LOC
国	I-LOC
的	O
华	B-PER
莱	I-PER
士	I-PER

我	O
跟	O
他	O

run the code

  1. Modify the configuration information in run_ner_xxx.py or run_ner_xxx.sh .
  2. sh scripts/run_ner_xxx.sh

note: file structure of the model

├── prev_trained_model
|  └── bert_base
|  |  └── pytorch_model.bin
|  |  └── config.json
|  |  └── vocab.txt
|  |  └── ......

CLUENER result

The overall performance of BERT on dev:

Accuracy (entity) Recall (entity) F1 score (entity)
BERT+Softmax 0.7897 0.8031 0.7963
BERT+CRF 0.7977 0.8177 0.8076
BERT+Span 0.8132 0.8092 0.8112
BERT+Span+adv 0.8267 0.8073 0.8169
BERT-small(6 layers)+Span+kd 0.8241 0.7839 0.8051
BERT+Span+focal_loss 0.8121 0.8008 0.8064
BERT+Span+label_smoothing 0.8235 0.7946 0.8088

ALBERT for CLUENER

The overall performance of ALBERT on dev:

model version Accuracy(entity) Recall(entity) F1(entity) Train time/epoch
albert base_google 0.8014 0.6908 0.7420 0.75x
albert large_google 0.8024 0.7520 0.7763 2.1x
albert xlarge_google 0.8286 0.7773 0.8021 6.7x
bert google 0.8118 0.8031 0.8074 -----
albert base_bright 0.8068 0.7529 0.7789 0.75x
albert large_bright 0.8152 0.7480 0.7802 2.2x
albert xlarge_bright 0.8222 0.7692 0.7948 7.3x

Cner result

The overall performance of BERT on dev(test):

Accuracy (entity) Recall (entity) F1 score (entity)
BERT+Softmax 0.9586(0.9566) 0.9644(0.9613) 0.9615(0.9590)
BERT+CRF 0.9562(0.9539) 0.9671(0.9644) 0.9616(0.9591)
BERT+Span 0.9604(0.9620) 0.9617(0.9632) 0.9611(0.9626)
BERT+Span+focal_loss 0.9516(0.9569) 0.9644(0.9681) 0.9580(0.9625)
BERT+Span+label_smoothing 0.9566(0.9568) 0.9624(0.9656) 0.9595(0.9612)
Comments
  • when it runs on windows, it occurs

    when it runs on windows, it occurs " OSError: [Errno 22] "

    init_logger(log_file=args.output_dir + f'/{args.model_type}-{args.task_name}-{time_}.log')

    this line code in the run_ner_crf.py file, how to solves it ? a simple method is like the following: init_logger(log_file=args.output_dir + r'/{args.model_type}-{args.task_name}-{time_}.log')

    but the file name will become "{args.model_type}-{args.task_name}-{time_}.log"

    I think I know the reason, "You are on windows that uses backslash as path separator . However for python this is escape char, so you need to use forward slash or raw string or escape the backslash. change these types of addresses:" in the OSError: [Errno 22]

    but I tried selves times to solve this problem , I failed and took 3 hours , who can tell me how to solve it?

    opened by lvjiujin 3
  • StopIteration error?

    StopIteration error?

    首先感谢大佬杰出的开源工作,正好匹配需求。 但是在具体运行时,出现如下报错,不知道是怎么回事,请大佬指教! 敬请回复!

    07/10/2020 16:14:08 - INFO - root - ***** Running training ***** 07/10/2020 16:14:08 - INFO - root - Num examples = 10748 07/10/2020 16:14:08 - INFO - root - Num Epochs = 4 07/10/2020 16:14:08 - INFO - root - Instantaneous batch size per GPU = 24 07/10/2020 16:14:08 - INFO - root - Total train batch size (w. parallel, distributed & accumulation) = 48 07/10/2020 16:14:08 - INFO - root - Gradient Accumulation steps = 1 07/10/2020 16:14:08 - INFO - root - Total optimization steps = 896 Traceback (most recent call last): File "run_ner_crf.py", line 497, in main() File "run_ner_crf.py", line 438, in main global_step, tr_loss = train(args, train_dataset, model, tokenizer) File "run_ner_crf.py", line 132, in train outputs = model(**inputs) File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply output.reraise() File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise raise self.exc_type(msg) StopIteration: Caught StopIteration in replica 0 on device 0. Original Traceback (most recent call last): File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker output = module(*input, **kwargs) File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/mnt/stephen-lib/stephen的个人文件夹/my_code/NLP组件研发/细粒度实体识别/BERT-NER-Pytorch/models/bert_for_ner.py", line 58, in forward outputs =self.bert(input_ids = input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids) File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/mnt/stephen-lib/stephen的个人文件夹/my_code/NLP组件研发/细粒度实体识别/BERT-NER-Pytorch/models/transformers/modeling_bert.py", line 606, in forward extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility StopIteration

    opened by Vincent131499 3
  • 预训练文件在哪里下载呢

    预训练文件在哪里下载呢

    从google-research下载到的chinese_L-12_H-768_A-12只有bert_model.ckpt vocab.txt bert_config.json 但我看代码需要的不是这种文件 OSError: Error no file named ['pytorch_model.bin', 'tf_model.h5', 'model.ckpt.index'] found in directory prev_trained_model/bert-base/bert-base-chinese or from_tf set to False

    opened by lzh66 2
  • mask in crf

    mask in crf

    您好,

    请问用attention_mask做crf的mask的话,一个word假设有多个sub tokens,那这些tokens都就都keep了。在bert for ner里面,是用一个词的第一个token做的classification。

    https://github.com/lonePatient/BERT-NER-Pytorch/blob/master/models/bert_for_ner.py#L64 同时在decode的时候

    https://github.com/lonePatient/BERT-NER-Pytorch/blob/38326e125696c5a34c54ada676930bee4a2d1693/run_ner_crf.py#L210

    此处的mask也是attention mask。那么就会导致从CLS到SEP还有其中的所有token都会被keep,用于做decode。请问此处mask这样设置合理么?还是应该只保留每个word的第一个token呢?谢谢!

    opened by DanqingZ 2
  • TypeError: __init__() got an unexpected keyword argument 'max_len'

    TypeError: __init__() got an unexpected keyword argument 'max_len'

    使用作者自定义的CNerTokenizer会报错__init__() got an unexpected keyword argument 'max_len' 具体错误信息如下: ` File "BERT-NER-Pytorch-master/run_ner_softmax.py", line 549, in

    main()
    

    File "BERT-NER-Pytorch-master/run_ner_softmax.py", line 480, in main

    cache_dir=args.cache_dir if args.cache_dir else None,)
    

    File "BERT-NER-Pytorch-master\models\transformers\tokenization_utils.py", line 282, in from_pretrained

    return cls._from_pretrained(*inputs, **kwargs)
    

    File "BERT-NER-Pytorch-master\models\transformers\tokenization_utils.py", line 411, in _from_pretrained

    tokenizer = cls(*init_inputs, **init_kwargs)
    

    TypeError: init() got an unexpected keyword argument 'max_len'`

    P.S. 使用BertTokenizer不会报错。还想请问下作者为什么要自定义分词器呢?难道BertTokenizer不会将没有在词表中的单词转化为<UNK>吗?

    opened by possible1402 2
  • 多gpu情况下的crf函数报错

    多gpu情况下的crf函数报错

    02/25/2020 13:50:42 - INFO - root - ***** Running evaluation ***** 02/25/2020 13:50:42 - INFO - root - Num examples = 1343 02/25/2020 13:50:42 - INFO - root - Batch size = 48 Traceback (most recent call last): File "run_ner_crf.py", line 517, in main() File "run_ner_crf.py", line 459, in main global_step, tr_loss = train(args, train_dataset, model, tokenizer) File "run_ner_crf.py", line 148, in train evaluate(args, model, tokenizer) File "run_ner_crf.py", line 197, in evaluate tags,_ = model.crf._obtain_labels(logits, args.id2label, inputs['input_lens']) File "/root/.pyenv/versions/3.7.2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 591, in getattr type(self).name, name)) AttributeError: 'DataParallel' object has no attribute 'crf'

    经过排查,crf函数是自定义的,在多gpu的情况下,对model进行了DataParallel处理,DataParallel里面没有这个自定义的crf函数产生的。

    opened by aliendaniel 2
  • 关于mrc-ner的一些细节

    关于mrc-ner的一些细节

    对于一句话不包含实体,或包含多个实体,是怎样处理的?

    切分maxseq的时候,是如何切分本来在一个自然段的实体的?

    eval时的算法,此时的 groundtruth(gold)数量并不准确,使得结果和标准的conll ner evaluate的脚本不一致,不同的算法使用不同的matric,是否有可比性?

    opened by 578123043 2
  • tokenizer.tokenize()问题

    tokenizer.tokenize()问题

    run_ner_crf.py train分支时,为什么from_pretrained()函数后留了一个逗号,导致,tokenizer.tokenize()分词的不是逐个字符分的,导致tokens和label_ids长度不等。 tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path, do_lower_case=args.do_lower_case, )

    run_ner_span.py 这个就是用的不带逗号的from_pretrained()tokenizer.tokenize()分词是逐个字符分的,这样tokens和label_ids长度相等。 tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path, do_lower_case=args.do_lower_case)

    这么写的区别在哪, 感觉这是个bug 修正后,最大的问题,用run_ner_crf.py训练自己数据集,两个epoch之后recall就会一直下降,这个问题作者遇到过吗,有空麻烦回一下

    opened by 472027909 1
  • 代码里面这个bert_crf模型在预测的时候是不是忘记过crf的decoder层了?

    代码里面这个bert_crf模型在预测的时候是不是忘记过crf的decoder层了?

    class BertCrfForNer(BertPreTrainedModel):
        def __init__(self, config):
            super(BertCrfForNer, self).__init__(config)
            self.bert = BertModel(config)
            self.dropout = nn.Dropout(config.hidden_dropout_prob)
            self.classifier = nn.Linear(config.hidden_size, config.num_labels)
            self.crf = CRF(num_tags=config.num_labels, batch_first=True)
            self.init_weights()
    
        def forward(self, input_ids, token_type_ids=None, attention_mask=None,labels=None):
            outputs =self.bert(input_ids = input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids)
            sequence_output = outputs[0]
            sequence_output = self.dropout(sequence_output)
            logits = self.classifier(sequence_output)
            outputs = (logits,)
            if labels is not None:
                loss = self.crf(emissions = logits, tags=labels, mask=attention_mask)
                outputs =(-1*loss,)+outputs
            return outputs # (loss), scores
    
    opened by xinjicong 1
  • Question? about bert-base

    Question? about bert-base

    Thank you for sharing the code, may I ask BERT-NER-Pytorch / prev_trained_model / bert-base / Where can I download the pre-trained model “bert-base”, can you provide a download link? Looking forward to your replies.

    opened by ypc-stu 1
  • cner测试集与验证集中CONT类型的precision与recall均为0

    cner测试集与验证集中CONT类型的precision与recall均为0

    cner测试集与验证集中CONT类型的precision与recall均为0,其他类型的实体的指标都较为正常,暂时还没找出原因 ***** Eval results %s ***** acc: 0.9466 - recall: 0.9373 - f1: 0.9419 - loss: 0.6826 ***** Entity results %s ***** ******* CONT results ******** acc: 0.0000 - recall: 0.0000 - f1: 0.0000 ******* EDU results ******** acc: 0.9911 - recall: 0.9911 - f1: 0.9911 ******* LOC results ******** acc: 1.0000 - recall: 1.0000 - f1: 1.0000 ******* NAME results ******** acc: 1.0000 - recall: 1.0000 - f1: 1.0000 ******* ORG results ******** acc: 0.9253 - recall: 0.9403 - f1: 0.9327 ******* PRO results ******** acc: 0.8684 - recall: 1.0000 - f1: 0.9296 ******* RACE results ******** acc: 1.0000 - recall: 1.0000 - f1: 1.0000 ******* TITLE results ******** acc: 0.9505 - recall: 0.9481 - f1: 0.9493 testset precision:0.946617008069522, recall:0.9373079287031346, f1:0.941939468807906, loss:0.6826483011245728

    opened by yangjianxin1 0
  • softmax和crf性能差距很大,参数一致,请教如何看原因调效果?

    softmax和crf性能差距很大,参数一致,请教如何看原因调效果?

    epoch: 10 train_batch_size: 8 evaluation_batch_size: 16 learning_rate: 3e-5

    预训练模型:bert_base_chinese

    自己的中文数据集

    bert+softmax: acc: 0.0729 - recall: 0.4249 - f1: 0.1245 - loss: 1.0244

    bert+crf: acc: 0.4268 - recall: 0.5729 - f1: 0.4892 - loss: 34.9675

    在google bert上跑 accuracy: 85.46%; precision: 36.28%; recall: 48.35%; FB1: 41.45

    opened by northfun 0
  • 请问required parameters的list是什么呢?

    请问required parameters的list是什么呢?

    运行报错 run_ner_crf.py: error: the following arguments are required: --task_name, --data_dir, --model_type, --model_name_or_path, --output_dir

    然后看get_argparse():函数里面的内容

    parser.add_argument("--task_name,", default=None, type=str, required=True,
                            help="The name of the task to train selected in the list: ")
        parser.add_argument("--data_dir", default=None, type=str, required=True,
                            help="The input data dir. Should contain the training files for the CoNLL-2003 NER task.", )
        parser.add_argument("--model_type", default=None, type=str, required=True,
                            help="Model type selected in the list: ")
        parser.add_argument("--model_name_or_path", default=None, type=str, required=True,
                            help="Path to pre-trained model or shortcut name selected in the list: " )
        parser.add_argument("--output_dir", default=None, type=str, required=True,
                            help="The output directory where the model predictions and checkpoints will be written.", )
    

    help里面 taskname、model type、model name 这里list是什么呢?冒号后面就没写了。 不知道这几个要传什么参数。

    opened by xiaohou1112 1
  • Fix do_lower_case

    Fix do_lower_case

    do_lower_case参数用于判断是否对输入文本小写,传递给tokenizer。

    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path, do_lower_case=args.do_lower_case,)
    

    参数在tokenizer.tokenize方法中发挥作用,本项目中直接使用了tokenizer.convert_tokens_to_ids方法,实际上并没有起作用,因此需要手动处理。

    def convert_examples_to_features(...):
        ...
        if tokenizer.do_lower_case:
            tokens = [x.lower() for x in tokens]
        ...
        input_ids = tokenizer.convert_tokens_to_ids(tokens)
    
    opened by entropy2333 0
  • 关于tokenizer.tokenize的疑问

    关于tokenizer.tokenize的疑问

    看过tf的tokenizer的代码,输入的是句子或者单个char,返回的是单个句子或者单个char 而torch的输入输入的是句子或者单个char,返回的是单个句子list或者单个char的list

    重要的问题是,如果输入的单个char本身是unk类型的字符,pytorch的tokenizer.tokenize(char) 居然返回的为空而不是[UNK]? 好奇pytorch为啥这样搞,这样直接导致训练数据x和label没办法对齐了......

    opened by lsx0930 1
Owner
Weitang Liu
weibo: https://weibo.com/277974397
Weitang Liu
Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

null 186 Dec 24, 2022
Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

anaGo anaGo is a Python library for sequence labeling(NER, PoS Tagging,...), implemented in Keras. anaGo can solve sequence labeling tasks such as nam

Hiroki Nakayama 1.5k Dec 5, 2022
Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

anaGo anaGo is a Python library for sequence labeling(NER, PoS Tagging,...), implemented in Keras. anaGo can solve sequence labeling tasks such as nam

Hiroki Nakayama 1.4k Feb 17, 2021
Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition

Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition

null 0 Feb 13, 2022
Framework for fine-tuning pretrained transformers for Named-Entity Recognition (NER) tasks

NERDA Not only is NERDA a mesmerizing muppet-like character. NERDA is also a python package, that offers a slick easy-to-use interface for fine-tuning

Ekstra Bladet 141 Dec 30, 2022
Spacy-ginza-ner-webapi - Named Entity Recognition API with spaCy and GiNZA

Named Entity Recognition API with spaCy and GiNZA I wrote a blog post about this

Yuki Okuda 3 Feb 27, 2022
Chinese named entity recognization (bert/roberta/macbert/bert_wwm with Keras)

Chinese named entity recognization (bert/roberta/macbert/bert_wwm with Keras)

null 2 Jul 5, 2022
Chinese NER with albert/electra or other bert descendable model (keras)

Chinese NLP (albert/electra with Keras) Named Entity Recognization Project Structure ./ ├── NER │   ├── __init__.py │   ├── log

null 2 Nov 20, 2022
PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

PhoNLP is a multi-task learning model for joint part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results, outperforming a single-task learning approach that fine-tunes the pre-trained Vietnamese language model PhoBERT for each task independently.

VinAI Research 109 Dec 2, 2022
Pytorch-Named-Entity-Recognition-with-BERT

BERT NER Use google BERT to do CoNLL-2003 NER ! Train model using Python and Inference using C++ ALBERT-TF2.0 BERT-NER-TENSORFLOW-2.0 BERT-SQuAD Requi

Kamal Raj 1.1k Dec 25, 2022
Use Google's BERT for named entity recognition (CoNLL-2003 as the dataset).

For better performance, you can try NLPGNN, see NLPGNN for more details. BERT-NER Version 2 Use Google's BERT for named entity recognition (CoNLL-2003

Kaiyinzhou 1.2k Dec 26, 2022
RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

RoNER RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2. It is meant to be an easy to use, hi

Stefan Dumitrescu 9 Nov 7, 2022
Chinese Named Entity Recognization (BiLSTM with PyTorch)

BiLSTM-CRF for Name Entity Recognition PyTorch version A PyTorch implemention of Bi-LSTM-CRF model for Chinese Named Entity Recognition. 使用 PyTorch 实现

null 5 Jun 1, 2022
Laboratory for Social Machines 84 Dec 20, 2022
a chinese segment base on crf

Genius Genius是一个开源的python中文分词组件,采用 CRF(Conditional Random Field)条件随机场算法。 Feature 支持python2.x、python3.x以及pypy2.x。 支持简单的pinyin分词 支持用户自定义break 支持用户自定义合并词

duanhongyi 237 Nov 4, 2022
Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

Franck Dernoncourt 1.6k Dec 27, 2022
Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

Franck Dernoncourt 1.5k Feb 11, 2021
Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

Franck Dernoncourt 1.5k Feb 17, 2021
Tool to add main subject to items on Wikidata using a WMFs CirrusSearch for named entity recognition or a manually supplied list of QIDs

ItemSubjector Tool made to add main subject statements to items based on the title using a home-brewed CirrusSearch-based Named Entity Recognition alg

Dennis Priskorn 9 Nov 17, 2022