Framework for fine-tuning pretrained transformers for Named-Entity Recognition (NER) tasks

Overview

NERDA

Build status codecov PyPI PyPI - Downloads License

Not only is NERDA a mesmerizing muppet-like character. NERDA is also a python package, that offers a slick easy-to-use interface for fine-tuning pretrained transformers for Named Entity Recognition (=NER) tasks.

You can also utilize NERDA to access a selection of precooked NERDA models, that you can use right off the shelf for NER tasks.

NERDA is built on huggingface transformers and the popular pytorch framework.

Installation guide

NERDA can be installed from PyPI with

pip install NERDA

If you want the development version then install directly from GitHub.

Named-Entity Recogntion tasks

Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.1

Example Task:

Task

Identify person names and organizations in text:

Jim bought 300 shares of Acme Corp.

Solution

Named Entity Type
'Jim' Person
'Acme Corp.' Organization

Read more about NER on Wikipedia.

Train Your Own NERDA Model

Say, we want to fine-tune a pretrained Multilingual BERT transformer for NER in English.

Load package.

from NERDA.models import NERDA

Instantiate a NERDA model (with default settings) for the CoNLL-2003 English NER data set.

from NERDA.datasets import get_conll_data
model = NERDA(dataset_training = get_conll_data('train'),
              dataset_validation = get_conll_data('valid'),
              transformer = 'bert-base-multilingual-uncased')

By default the network architecture is analogous to that of the models in Hvingelby et al. 2020.

The model can then be trained/fine-tuned by invoking the train method, e.g.

model.train()

Note: this will take some time depending on the dimensions of your machine (if you want to skip training, you can go ahead and use one of the models, that we have already precooked for you in stead).

After the model has been trained, the model can be used for predicting named entities in new texts.

# text to identify named entities in.
text = 'Old MacDonald had a farm'
model.predict_text(text)
([['Old', 'MacDonald', 'had', 'a', 'farm']], [['B-PER', 'I-PER', 'O', 'O', 'O']])

This means, that the model identified 'Old MacDonald' as a PERson.

Please note, that the NERDA model configuration above was instantiated with all default settings. You can however customize your NERDA model in a lot of ways:

  • Use your own data set (finetune a transformer for any given language)
  • Choose whatever transformer you like
  • Set all of the hyperparameters for the model
  • You can even apply your own Network Architecture

Read more about advanced usage of NERDA in the detailed documentation.

Use a Precooked NERDA model

We have precooked a number of NERDA models for Danish and English, that you can download and use right off the shelf.

Here is an example.

Instantiate a multilingual BERT model, that has been finetuned for NER in Danish, DA_BERT_ML.

from NERDA.precooked import DA_BERT_ML()
model = DA_BERT_ML()

Down(load) network from web:

model.download_network()
model.load_network()

You can now predict named entities in new (Danish) texts

# (Danish) text to identify named entities in:
# 'Jens Hansen har en bondegård' = 'Old MacDonald had a farm'
text = 'Jens Hansen har en bondegård'
model.predict_text(text)
([['Jens', 'Hansen', 'har', 'en', 'bondegård']], [['B-PER', 'I-PER', 'O', 'O', 'O']])

List of Precooked Models

The table below shows the precooked NERDA models publicly available for download.

Model Language Transformer Dataset F1-score
DA_BERT_ML Danish Multilingual BERT DaNE 82.8
DA_ELECTRA_DA Danish Danish ELECTRA DaNE 79.8
EN_BERT_ML English Multilingual BERT CoNLL-2003 90.4
EN_ELECTRA_EN English English ELECTRA CoNLL-2003 89.1

F1-score is the micro-averaged F1-score across entity tags and is evaluated on the respective test sets (that have not been used for training nor validation of the models).

Note, that we have not spent a lot of time on actually fine-tuning the models, so there could be room for improvement. If you are able to improve the models, we will be happy to hear from you and include your NERDA model.

Model Performance

The table below summarizes the performance (F1-scores) of the precooked NERDA models.

Level DA_BERT_ML DA_ELECTRA_DA EN_BERT_ML EN_ELECTRA_EN
B-PER 93.8 92.0 96.0 95.1
I-PER 97.8 97.1 98.5 97.9
B-ORG 69.5 66.9 88.4 86.2
I-ORG 69.9 70.7 85.7 83.1
B-LOC 82.5 79.0 92.3 91.1
I-LOC 31.6 44.4 83.9 80.5
B-MISC 73.4 68.6 81.8 80.1
I-MISC 86.1 63.6 63.4 68.4
AVG_MICRO 82.8 79.8 90.4 89.1
AVG_MACRO 75.6 72.8 86.3 85.3

'NERDA'?

'NERDA' originally stands for 'Named Entity Recognition for DAnish'. However, this is somewhat misleading, since the functionality is no longer limited to Danish. On the contrary it generalizes to all other languages, i.e. NERDA supports fine-tuning of transformers for NER tasks for any arbitrary language.

Background

NERDA is developed as a part of Ekstra Bladet’s activities on Platform Intelligence in News (PIN). PIN is an industrial research project that is carried out in collaboration between the Technical University of Denmark, University of Copenhagen and Copenhagen Business School with funding from Innovation Fund Denmark. The project runs from 2020-2023 and develops recommender systems and natural language processing systems geared for news publishing, some of which are open sourced like NERDA.

Shout-outs

Read more

The detailed documentation for NERDA including code references and extended workflow examples can be accessed here.

Contact

We hope, that you will find NERDA useful.

Please direct any questions and feedbacks to us!

If you want to contribute (which we encourage you to), open a PR.

If you encounter a bug or want to suggest an enhancement, please open an issue.

Comments
  • tiny bug in NERDADataSetReader!

    tiny bug in NERDADataSetReader!

    Hi there! In some cases, there is an error raised during iterating over DataLoder's batches and I believe, it is happened because of offset's list length! The error is like this:

    RuntimeError: Caught RuntimeError in DataLoader worker process 0. Original Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop data = fetcher.fetch(index) File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch return self.collate_fn(data) File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collate.py", line 73, in default_collate return {key: default_collate([d[key] for d in batch]) for key in elem} File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collate.py", line 73, in <dictcomp> return {key: default_collate([d[key] for d in batch]) for key in elem} File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate return torch.stack(batch, 0, out=out) RuntimeError: stack expects each tensor to be equal size, but got [150] at entry 0 and [151] at entry 1

    A quick and Unprincipled solution to fix it can be adding an extra line of code to truncate the list in class NERDADataSetReader(), this is worked for me! :)

    offsets = offsets[:self.max_len]

    opened by meti-94 12
  • not work with different pytorch/transformers version

    not work with different pytorch/transformers version

    I tested the same dataset along with the same model and hyper parameters but in different versions of torch and transformers. It raises error with torch 1.81 as the following: TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]

    However, it works fine in the environment in which I just had NERDA pip installed...

    opened by Hansyvea 5
  • added citation

    added citation

    Hej EB,

    I will be citing this work in an upcoming paper on DaCy. I have added the citation I intend to use, but feel free to change it if you feel like it doesn't fit. I will change mine accordingly. I extracted the authors from the commit history which I realize isn't always representative.

    opened by KennethEnevoldsen 4
  • fixing behavior of transformers tokenizer for all chars and words

    fixing behavior of transformers tokenizer for all chars and words

    Hi, I was finally able to work on the code over the weekend, and I found the cause of the error, which was due to a tokenizer problem. In many languages, including my language (Persian), there are words and characters (abbreviations) that the tokenize() method in the tokenizers class is not able to identify, so in the face of such inputs, an empty list of word pieces is returned. In the next step, the offsets array will be expanded (by [1]) even if no word piece was identified, which eventually leads to errors in the training and evaluation process. For example, the word ۖ indicates sanctity for religious figures, which is seen in many writings but can not be identified.

    opened by meti-94 4
  • 'BertModel' object has no attribute 'name_or_path'

    'BertModel' object has no attribute 'name_or_path'

    Hi there - thanks for releasing your code and taking the time to read my issue!

    I've just started to explore how NERDA works by trying to run the code from https://ebanalyse.github.io/NERDA/workflow/ but ran into the error of:

    KeyError: 'electra'

    I then changed the transformer to 'bert-base-multilingual-uncased' but now I am getting the error of:

    ModuleAttributeError: 'BertModel' object has no attribute 'name_or_path'

    I can't see anyone else with the same issue online and my knowledge of transformers is poor. I'd really appreciate any help you could provide.

    Cheers

    opened by CyrusDobbs 4
  • Could you please explain how to load a local conll2003 formatted file?

    Could you please explain how to load a local conll2003 formatted file?

    How could I save the trained model locally and load it next time? didn't see this in the tutorial (Perhaps it is too easy... but I couldn't find a way) Thanks in advance.

    opened by Hansyvea 3
  • Are there any ways to check recall and precision in addition to F1-Score?

    Are there any ways to check recall and precision in addition to F1-Score?

    Thank you for providing such a helpful library. I have one question

    I've installed from PyPI and used the Performance function. Though compute_f1_scores(y_pred, y_true, labels, **kwargs) returns a list of F1 scores, are there any ways to check recall and precision for each tag?

    On the official page of Performance, I could not get information to solve the issue.

    I appreciate your support.

    opened by mk0222-deep 2
  • Load trained model across devices (CPU/GPU)

    Load trained model across devices (CPU/GPU)

    First of all thanks a lot for making this project. You've made it super simple to train a custom model!

    I experienced some issues with using a trained and saved model (on GPU) on another computer running on CPU, and I thought I'd share how to deal with it.

    I could both torch.save() and torch.load() a model on my GPU pc as you write in #14. Now running my model on another pc with only CPU should be handled by providing map_location = torch.device('cpu') as pytorch write in their documentation.

    So I tried that ofc using the code:

    model_path = 'some_path_to_model.pt'
    device = torch.device('cpu')
    model = torch.load(model_path, map_location=device)
    

    and printing model.device would return cpu. Everything seemed to be working properly.

    Next when I wanted to predict_text I received this assertion error: AssertionError: Torch not compiled with CUDA enabled. Super weird since I checked that the model was on cpu.

    It turned out that the NERDANetwork which is a torch.nn.Module still was casted to the 'old' GPU device. So when I was printing model.network.device cuda was returned, despite model.device returning cpu.

    So my solution was to cast both the loaded model and the NERDANetwork to cpu:

    model_path = 'some_path_to_model.pt'
    device = torch.device('cpu')
    model = torch.load(model_path, map_location=device)
    model.network.device = device
    

    I hope you'll find it useful!

    opened by AGMoller 2
  • Question about validation_set

    Question about validation_set

    I want to ask a simple question. The parameters of the model have been set before model training. What is the purpose of the validation set in model training? Thank you!

    opened by Shengyu-Liu558 2
  • troubles with 'device' parameter in models.py

    troubles with 'device' parameter in models.py

    Hi there!

    I have some troubles when I specify the device in NERDA model:

    image

    At the same time, when I do not specify the device parameter, everything works just fine. Looks like the trouble is in models.py here : if device != None, self.device is not initialized.

    opened by Combo-Breaker 2
  • F1 on Token level?

    F1 on Token level?

    Is the evaluation in evaluate_performance function (models.py) done at Token level or entity level? It seemed like it is done on the token level, from the image in the workflow documentation, but the comments in the code mention entity level.

    opened by nishkalavallabhi 1
  • Wrong scikit-learn dependency causes issues.

    Wrong scikit-learn dependency causes issues.

    We are using NERDA for a couple things, but we currently get this error using NERDA as a dependency.

    × python setup.py egg_info did not run successfully.
        │ exit code: 1
        ╰─> [18 lines of output]
            The 'sklearn' PyPI package is deprecated, use 'scikit-learn'
            rather than 'sklearn' for pip commands.
            
            Here is how to fix this error in the main use cases:
            - use 'pip install scikit-learn' rather than 'pip install sklearn'
            - replace 'sklearn' by 'scikit-learn' in your pip requirements files
              (requirements.txt, setup.py, setup.cfg, Pipfile, etc ...)
            - if the 'sklearn' package is used by one of your dependencies,
              it would be great if you take some time to track which package uses
              'sklearn' instead of 'scikit-learn' and report it to their issue tracker
            - as a last resort, set the environment variable
              SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True to avoid this error
    

    The issue stems from adding the deprecated sklearn instead of scikit-learn in your setup file at line 21:

    https://github.com/ebanalyse/NERDA/blob/ae45d7e5368059721d1073384201433ea7a6e820/setup.py#L21

    Would you please change the dependency to scikit-learn instead

    opened by sorenmc 0
  • save and load tokenizer method implemented

    save and load tokenizer method implemented

    Changes:

    1. Change load_network_from_file, and save_network methods. It allows us to save and load tokenizer now. Save network before: model.save_network(model_path="model.bin") Save network now: model.save_network(output_dir="models/") -- Load network before: model.load_network_from_file(model_path="model.bin") Load network now: model.load_network_from_file(model_path="models/model.bin", tokenizer_path="models/tokenizer/")

    2. Typo change: AVG_MICRO --> AVG_MACRO (from previous PL: https://github.com/ebanalyse/NERDA/pull/33)

    3. Added version number to package (from previous PL: https://github.com/ebanalyse/NERDA/pull/34)

    4. Adapting preprocessing for some models of HF (from previous PL: https://github.com/ebanalyse/NERDA/pull/36, but removed line 106)

    opened by Chaarangan 0
  • max_len check gives poor warning message

    max_len check gives poor warning message

    Change this:

    msg = f'Sentence #{item} length {len(tokens)} exceeds max_len {self.max_len} and has been truncated'
    

    to

    msg = f'Sentence #{item} length {len(tokens)} exceeds max_len {self.max_len} - 2 and has been truncated, note that two tokens are used to surround the sentence with the [CLS] and [SEP] token'
    

    Since the warning Sentence 4 length 511 exceeds max_len 512 and has been truncated doesn't make sense.

    opened by prhbrt 0
  • Not Able to get Accuracy Score

    Not Able to get Accuracy Score

    ValueError: Found input variables with inconsistent numbers of samples: [18734, 18733]

    I am able to get F1 scores when i pass test_dict to evaluate function as model.evaluate_performance(test_dict) but if i pass "True"as a parameter to model.evaluate_performance(test_dict,True) i am getting a ValueError. I crossed checked the test_dict manually there are total 18734 samples in y_test so why is the function missing out one value in y_pred that is the predicted number of samples ?.

    opened by ParthP8399 1
  • tag_scheme should be able to contain the outside_tag

    tag_scheme should be able to contain the outside_tag

    If someone passes the outside_tag as part of the tag scheme they would probably get something like:

    [Expected input batch_size (324) to match target batch_size (4)]

    That is because the scheme_tag would inflate the batch size by some extra bytes (the direct portion of the outside tag from the number of other schema tags). This can easily be fixed by putting in the training.py [138:0]:

        tag_complete = list(set([tag_outside] + tag_scheme))
    
    opened by rubmz 0
Owner
Ekstra Bladet
GitHub of Ekstra Bladet Analyse
Ekstra Bladet
Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

null 186 Dec 24, 2022
Spacy-ginza-ner-webapi - Named Entity Recognition API with spaCy and GiNZA

Named Entity Recognition API with spaCy and GiNZA I wrote a blog post about this

Yuki Okuda 3 Feb 27, 2022
Laboratory for Social Machines 84 Dec 20, 2022
Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

Franck Dernoncourt 1.6k Dec 27, 2022
CrossNER: Evaluating Cross-Domain Named Entity Recognition (AAAI-2021)

CrossNER is a fully-labeled collected of named entity recognition (NER) data spanning over five diverse domains (Politics, Natural Science, Music, Literature, and Artificial Intelligence) with specialized entity categories for different domains.

Zihan Liu 89 Nov 10, 2022
PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

PhoNLP is a multi-task learning model for joint part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results, outperforming a single-task learning approach that fine-tunes the pre-trained Vietnamese language model PhoBERT for each task independently.

VinAI Research 109 Dec 2, 2022
Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

anaGo anaGo is a Python library for sequence labeling(NER, PoS Tagging,...), implemented in Keras. anaGo can solve sequence labeling tasks such as nam

Hiroki Nakayama 1.5k Dec 5, 2022
Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

Franck Dernoncourt 1.5k Feb 11, 2021
Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

anaGo anaGo is a Python library for sequence labeling(NER, PoS Tagging,...), implemented in Keras. anaGo can solve sequence labeling tasks such as nam

Hiroki Nakayama 1.4k Feb 17, 2021
Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

Franck Dernoncourt 1.5k Feb 17, 2021
Pytorch-Named-Entity-Recognition-with-BERT

BERT NER Use google BERT to do CoNLL-2003 NER ! Train model using Python and Inference using C++ ALBERT-TF2.0 BERT-NER-TENSORFLOW-2.0 BERT-SQuAD Requi

Kamal Raj 1.1k Dec 25, 2022
A text augmentation tool for named entity recognition.

neraug This python library helps you with augmenting text data for named entity recognition. Augmentation Example Reference from An Analysis of Simple

Hiroki Nakayama 48 Oct 11, 2022
Tool to add main subject to items on Wikidata using a WMFs CirrusSearch for named entity recognition or a manually supplied list of QIDs

ItemSubjector Tool made to add main subject statements to items based on the title using a home-brewed CirrusSearch-based Named Entity Recognition alg

Dennis Priskorn 9 Nov 17, 2022
Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition

Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition

null 0 Feb 13, 2022
Use Google's BERT for named entity recognition (CoNLL-2003 as the dataset).

For better performance, you can try NLPGNN, see NLPGNN for more details. BERT-NER Version 2 Use Google's BERT for named entity recognition (CoNLL-2003

Kaiyinzhou 1.2k Dec 26, 2022
Named Entity Recognition API used by TEI Publisher

TEI Publisher Named Entity Recognition API This repository contains the API used by TEI Publisher's web-annotation editor to detect entities in the in

e-editiones.org 14 Nov 15, 2022
Nested Named Entity Recognition

Nested Named Entity Recognition Training Dataset: CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark url: https://tianchi.aliyun.

null 8 Dec 25, 2022
RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

RoNER RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2. It is meant to be an easy to use, hi

Stefan Dumitrescu 9 Nov 7, 2022
Code for "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022.

README Code for Two-stage Identifier: "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022. For details of the model a

Yongliang Shen 45 Nov 29, 2022