CrossNER: Evaluating Cross-Domain Named Entity Recognition (AAAI-2021)

Zihan Liu

Last update: Nov 10, 2022

Related tags

Text Data & NLP dataset named-entity-recognition corpora multi-domain ner cross-domain sequence-labeling domain-adaptation low-resource multi-domain-adaptation

Overview

CrossNER

NEW (2021/1/5): Fixed several annotation errors (thanks for the help from Youliang Yuan).

CrossNER: Evaluating Cross-Domain Named Entity Recognition (Accepted in AAAI-2021) [PDF]

CrossNER is a fully-labeled collected of named entity recognition (NER) data spanning over five diverse domains (Politics, Natural Science, Music, Literature, and Artificial Intelligence) with specialized entity categories for different domains. Additionally, CrossNER also includes unlabeled domain-related corpora for the corresponding five domains. We hope that our collected dataset (CrossNER) will catalyze research in the NER domain adaptation area.

You can have a quick overview of this paper through our blog. If you use the dataset in an academic paper, please consider citing the following paper.

@article{liu2020crossner,
      title={CrossNER: Evaluating Cross-Domain Named Entity Recognition}, 
      author={Zihan Liu and Yan Xu and Tiezheng Yu and Wenliang Dai and Ziwei Ji and Samuel Cahyawijaya and Andrea Madotto and Pascale Fung},
      year={2020},
      eprint={2012.04373},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

The CrossNER Dataset

Data Statistics and Entity Categories

Data statistics of unlabeled domain corpora, labeled NER samples and entity categories for each domain.

Data Examples

Data examples for the collected five domains. Each domain has its specialized entity categories.

Domain Overlaps

Vocabulary overlaps between domains (%). Reuters denotes the Reuters News domain, “Science” denotes the natural science domain and “Litera.” denotes the literature domain.

Download

Labeled NER data: Labeled NER data for the five target domains (Politics, Science, Music, Literature, and AI) and the source domain (Reuters News from CoNLL-2003 shared task) can be found in ner_data folder.

Unlabeled Corpora: Unlabeled domain-related corpora (domain-level, entity-level, task-level and integrated) for the five target domains can be downloaded here.

Dependency

Install PyTorch (Tested in PyTorch 1.2.0 and Python 3.6)
Install transformers (Tested in transformers 3.0.2)

Domain-Adaptive Pre-Training (DAPT)

Configurations

--train_data_file: The file path of the pre-training corpus.
--output_dir: The output directory where the pre-trained model is saved.
--model_name_or_path: Continue pre-training on which model.

❱❱❱ python run_language_modeling.py --output_dir=politics_spanlevel_integrated --model_type=bert --model_name_or_path=bert-base-cased --do_train --train_data_file=corpus/politics_integrated.txt --mlm

This example is for span-level pre-training using integrated corpus in the politics domain. This code is modified based on run_language_modeling.py from huggingface transformers (3.0.2).

Baselines

Configurations

--tgt_dm: Target domain that the model needs to adapt to.
--conll: Using source domain data (News domain from CoNLL 2003) for pre-training.
--joint: Jointly train using source and target domain data.
--num_tag: Number of label types for the target domain (we put the details in src/dataloader.py).
--ckpt: Checkpoint path to load the pre-trained model.
--emb_file: Word-level embeddings file path.

Directly Fine-tune

Directly fine-tune the pre-trained model (span-level + integrated corpus) to the target domain (politics domain).

❱❱❱ python main.py --exp_name politics_directly_finetune --exp_id 1 --num_tag 19 --ckpt politics_spanlevel_integrated/pytorch_model.bin --tgt_dm politics --batch_size 16

Jointly Train

Initialize the model with the pre-trained model (span-level + integrated corpus). Then, jointly train the model with the source and target (politics) domain data.

❱❱❱ python main.py --exp_name politics_jointly_train --exp_id 1 --num_tag 19 --conll --joint --ckpt politics_spanlevel_integrated/pytorch_model.bin --tgt_dm politics

Pre-train then Fine-tune

Initialize the model with the pre-trained model (span-level + integrated corpus). Then fine-tune it to the target (politics) domain after pre-training on the source domain data.

❱❱❱ python main.py --exp_name politics_pretrain_then_finetune --exp_id 1 --num_tag 19 --conll --ckpt politics_spanlevel_integrated/pytorch_model.bin --tgt_dm politics --batch_size 16

BiLSTM-CRF (Lample et al. 2016)

Jointly train BiLSTM-CRF (word+Char level) on the source domain and target (politics) domain. (we use glove.6B.300d.txt for word-level embeddings and torchtext.vocab.CharNGram() for character-level embeddings).

❱❱❱ python main.py --exp_name politics_bilstm_wordchar --exp_id 1 --num_tag 19 --tgt_dm politics --bilstm --dropout 0.3 --lr 1e-3 --usechar --emb_dim 400

Coach (Liu et al. 2020)

Jointly train Coach (word+Char level) on the source domain and target (politics) domain.

❱❱❱ python main.py --exp_name politics_coach_wordchar --exp_id 1 --num_tag 3 --entity_enc_hidden_dim 200 --tgt_dm politics --coach --dropout 0.5 --lr 1e-4 --usechar --emb_dim 400

Other Notes

In the aforementioned baselines, we provide running commands for the politics target domain as an example. The running commands for other target domains can be found in the run.sh file.

Bug Report

Feel free to create an issue or send an email to [email protected].

Comments

Vocab files

Hello,

Thank you for sharing the code and the datasets. I am trying to reproduce the experiments, but I am getting an error because the vocab.txt file is not present in any of the domain folders I am trying to run the baseline python main.py --exp_name politics_bilstm_wordchar --exp_id 1 --tgt_dm politics --bilstm --dropout 0.3 --lr 1e-3 --usechar --emb_dim 400 but i am getting the error: FileNotFoundError: [Errno 2] No such file or directory: 'ner_data/conll2003/vocab.txt'

Maybe I am missing some step to generate the vocab files

Regards,

opened by alejosierra 3
Pre-train then Fine-tune Comparing with Jointly Train

Hi, I am a little bit confused about the Pre-train meaning in this paper. It seems like sometimes the Pre-train refers to span-level MLM task and sometimes refers to NER task. According to the repo, the Pre-train on source domain in Pre-train then Fine-tune is to perform NER task on source domain instead of performing MLM task. So the main difference between Pre-rain then Fine-tune and Jointly Train is whether train source domain at first and then select the best model to train on target domain or mix up source domain and target domain data (also including the target domain augmentation) in single training stage. Do I understand it correctly?

opened by luoqiaoyang 2
Pre-training source domain stops on the 2nd epoch

Hi I have another question regarding the pre-training in the source domain (conll) when doing pre-train and then fine-tune. Here https://github.com/zliucr/CrossNER/blob/2e7ba2a7798c961e3f29fbc51252c5a8d40224bf/src/trainer.py#L121 the training is set to stop after 2 epochs. Is this by design? I could not find something about that in the paper.

opened by alejosierra 2
BERT Training Epoch Number

Hi, Thank you for opening source you work. In run_language_modeling.py, I notice that you set "num_train_epochs" as 15. Is there any reason doing that? Because default value in huggingface script is 3. And there isn't an evaluation file. Is there any risk of overfitting?

opened by FrankCast1e 2
AI domain has "programlang" entity not recorded in entity categories

According to the preprint here https://arxiv.org/pdf/2012.04373.pdf, the domain "AI" is not supposed to have entity "programlang". But the entity is in https://github.com/zliucr/CrossNER/blob/main/ner_data/ai/train.txt#L96. Can I ask what is the correct list of entities for AI and other domains? Thank you.

opened by nguyenvanhoang7398 1
Vocab file

Hello,

Thank you for this great work.

I am gretting this error: FileNotFoundError: [Errno 2] No such file or directory: 'ner_data/conll2003/vocab.txt'

Could you please provide the vocab file?

Thanks.

opened by Tinarights 0
Request to share pretrained model checkpoints

Hi, Thanks for open-sourcing your work. I was exploring this repo and was curious to reproduce these results. Since domain-adaptive pre-training is compute heavy and expensive, could you share the pre-trained weights to enable experimentation on your datasets? For example: One would need "politics_spanlevel_integrated/pytorch_model.bin" to train any baseline for politics domain. It would be great if you could share these model files.

PS: vocab.txt files are also missing in the data folder. Although one can create it easily, it would be great if you could share your version to ensure consistency.

Thanks, -Nitesh

opened by NiteshMethani 0

Owner

Zihan Liu

Ph.D. Candidate at HKUST CAiRE. I work on natural language processing, multilingual, dialogue, cross-domain adaptation.

GitHub

:hot_pepper: R²SQL: "Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing." (AAAI 2021)

R²SQL The PyTorch implementation of paper Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing. (AAAI 2021) Requirement

60 Dec 31, 2022

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

1.6k Dec 27, 2022

Framework for fine-tuning pretrained transformers for Named-Entity Recognition (NER) tasks

NERDA Not only is NERDA a mesmerizing muppet-like character. NERDA is also a python package, that offers a slick easy-to-use interface for fine-tuning

141 Dec 30, 2022

PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

PhoNLP is a multi-task learning model for joint part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results, outperforming a single-task learning approach that fine-tunes the pre-trained Vietnamese language model PhoBERT for each task independently.

109 Dec 2, 2022

Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

anaGo anaGo is a Python library for sequence labeling(NER, PoS Tagging,...), implemented in Keras. anaGo can solve sequence labeling tasks such as nam

1.5k Dec 5, 2022

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

1.5k Feb 11, 2021

Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

anaGo anaGo is a Python library for sequence labeling(NER, PoS Tagging,...), implemented in Keras. anaGo can solve sequence labeling tasks such as nam

1.4k Feb 17, 2021

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

1.5k Feb 17, 2021

Pytorch-Named-Entity-Recognition-with-BERT

BERT NER Use google BERT to do CoNLL-2003 NER ! Train model using Python and Inference using C++ ALBERT-TF2.0 BERT-NER-TENSORFLOW-2.0 BERT-SQuAD Requi

1.1k Dec 25, 2022

A text augmentation tool for named entity recognition.

neraug This python library helps you with augmenting text data for named entity recognition. Augmentation Example Reference from An Analysis of Simple

48 Oct 11, 2022

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU，一个中文文本分类、序列标注工具包，支持中文长文本、短文本的多类、多标签分类任务，支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

186 Dec 24, 2022

Tool to add main subject to items on Wikidata using a WMFs CirrusSearch for named entity recognition or a manually supplied list of QIDs

ItemSubjector Tool made to add main subject statements to items based on the title using a home-brewed CirrusSearch-based Named Entity Recognition alg

9 Nov 17, 2022

Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition

0 Feb 13, 2022

CrossNER: Evaluating Cross-Domain Named Entity Recognition (AAAI-2021)

Related tags

Overview

CrossNER

The CrossNER Dataset

Data Statistics and Entity Categories

Data Examples

Domain Overlaps

Download

Dependency

Domain-Adaptive Pre-Training (DAPT)

Configurations

Baselines

Configurations

Directly Fine-tune

Jointly Train

Pre-train then Fine-tune

BiLSTM-CRF (Lample et al. 2016)

Coach (Liu et al. 2020)

Other Notes

Bug Report

Comments

Vocab files

Pre-train then Fine-tune Comparing with Jointly Train

Pre-training source domain stops on the 2nd epoch

BERT Training Epoch Number

AI domain has "programlang" entity not recorded in entity categories

Vocab file

Request to share pretrained model checkpoints

Owner

Zihan Liu

:hot_pepper: R²SQL: "Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing." (AAAI 2021)

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

Framework for fine-tuning pretrained transformers for Named-Entity Recognition (NER) tasks

PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

Pytorch-Named-Entity-Recognition-with-BERT

A text augmentation tool for named entity recognition.

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Tool to add main subject to items on Wikidata using a WMFs CirrusSearch for named entity recognition or a manually supplied list of QIDs

Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition

Use Google's BERT for named entity recognition （CoNLL-2003 as the dataset）.

Named Entity Recognition API used by TEI Publisher

Nested Named Entity Recognition

RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

Spacy-ginza-ner-webapi - Named Entity Recognition API with spaCy and GiNZA

Code for "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022.