Negative sampling for solving the unlabeled entity problem in NER. ICLR-2021 paper: Empirical Analysis of Unlabeled Entity Problem in Named Entity Recognition.

Yangming Li

Last update: Dec 29, 2022

Related tags

Text Data & NLP NegSampling-NER

Overview

Negative Sampling for NER

Unlabeled entity problem is prevalent in many NER scenarios (e.g., weakly supervised NER). Our paper in ICLR-2021 proposes using negative sampling for solving this important issue. This repo. contains the implementation of our approach.

Note that this is not an officially supported Tencent product.

Preparation

Two steps. Firstly, reformulate the NER data and move it into a new folder named "dataset". The folder contains {train, dev, test}.json. Each JSON file is a list of dicts. See the following case:

[ 
 {
  "sentence": "['Somerset', '83', 'and', '174', '(', 'P.', 'Simmons', '4-38', ')', ',', 'Leicestershire', '296', '.']",
  "labeled entities": "[(0, 0, 'ORG'), (5, 6, 'PER'), (10, 10, 'ORG')]",
 },
 {
  "sentence": "['Leicestershire', '22', 'points', ',', 'Somerset', '4', '.']",
  "labeled entities": "[(0, 0, 'ORG'), (4, 4, 'ORG')]",
 }
]

Secondly, pretrained LM (i.e., BERT) and eval. script. Create a dir. named "resource" and arrange them as

resource
- bert-base-cased
  - model.pt
  - vocab.txt
- conlleval.pl

Note that the files in BERT.tar.gz need to be renamed as above.

Training and Test

CUDA_VISIBLE_DEVICES=0 python main.py -dd dataset -cd save -rd resource

Citation

@inproceedings{li2021empirical,
    title={Empirical Analysis of Unlabeled Entity Problem in Named Entity Recognition},
    author={Yangming Li and lemao liu and Shuming Shi},
    booktitle={International Conference on Learning Representations},
    year={2021},
    url={https://openreview.net/forum?id=5jRVa89sZk}
}

You might also like...

Pytorch-Named-Entity-Recognition-with-BERT

BERT NER Use google BERT to do CoNLL-2003 NER ! Train model using Python and Inference using C++ ALBERT-TF2.0 BERT-NER-TENSORFLOW-2.0 BERT-SQuAD Requi

1.1k Dec 25, 2022

A text augmentation tool for named entity recognition.

neraug This python library helps you with augmenting text data for named entity recognition. Augmentation Example Reference from An Analysis of Simple

48 Oct 11, 2022

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU，一个中文文本分类、序列标注工具包，支持中文长文本、短文本的多类、多标签分类任务，支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

186 Dec 24, 2022

Tool to add main subject to items on Wikidata using a WMFs CirrusSearch for named entity recognition or a manually supplied list of QIDs

ItemSubjector Tool made to add main subject statements to items based on the title using a home-brewed CirrusSearch-based Named Entity Recognition alg

9 Nov 17, 2022

Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition

0 Feb 13, 2022

Comments

hyperparameter settings for reproducing results
Hi,

Interesting work! I am trying to reproduce your experiments was hoping you could answer a few questions:

Could you please list any hyperparameter settings/commands you used for the conll2003 and ontonotes5 evaluations?

I've found that default batch size is too large and I am getting CUDA OOM issues. What batch size did you use? (My GPU has 16gb ram, so I was also wondering what the gpu capacity was for the machines used to run the published experiments?)

For ontonotes, some of the sentences are too long to fit into bert, did you just truncate these?

Thanks!
opened by teffland 1
Not a gzip file

I have arranged the folders as stated in the repo. But i hav received this error.

1649, in gzopen raise ReadError("not a gzip file") tarfile.ReadError: not a gzip file

opened by zhaozhijie1997 0

关于run的一点小问题？

按照Readme操作后出现此问题，请问应该怎么解决 train、dev、test数据用的都是示例提供的数据

[ 
 {
  "sentence": "['Somerset', '83', 'and', '174', '(', 'P.', 'Simmons', '4-38', ')', ',', 'Leicestershire', '296', '.']",
  "labeled entities": "[(0, 0, 'ORG'), (5, 6, 'PER'), (10, 10, 'ORG')]",
 },
 {
  "sentence": "['Leicestershire', '22', 'points', ',', 'Somerset', '4', '.']",
  "labeled entities": "[(0, 0, 'ORG'), (4, 4, 'ORG')]",
 }
]

opened by FalAnge1217 0

Negative sampling for solving the unlabeled entity problem in NER. ICLR-2021 paper: Empirical Analysis of Unlabeled Entity Problem in Named Entity Recognition.

Related tags

Overview

Negative Sampling for NER

Preparation

Training and Test

Citation

You might also like...

Pytorch-Named-Entity-Recognition-with-BERT

A text augmentation tool for named entity recognition.

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Tool to add main subject to items on Wikidata using a WMFs CirrusSearch for named entity recognition or a manually supplied list of QIDs

Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition

Use Google's BERT for named entity recognition （CoNLL-2003 as the dataset）.

Named Entity Recognition API used by TEI Publisher

Nested Named Entity Recognition

RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

Comments

hyperparameter settings for reproducing results

Not a gzip file

关于run的一点小问题？

Owner

Yangming Li

Spacy-ginza-ner-webapi - Named Entity Recognition API with spaCy and GiNZA

TweebankNLP - Pre-trained Tweet NLP Pipeline (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Models + Tweebank-NER

Skipgram Negative Sampling in PyTorch

CrossNER: Evaluating Cross-Domain Named Entity Recognition (AAAI-2021)

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.