Data and evaluation code for the paper WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER (EMNLP 2021).

Babelscape

Last update: Dec 11, 2022

Related tags

Overview

Data and evaluation code for the paper WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER.

@inproceedings{tedeschi-etal-2021-wikineural-combined,
    title = "{W}iki{NE}u{R}al: {C}ombined Neural and Knowledge-based Silver Data Creation for Multilingual {NER}",
    author = "Tedeschi, Simone  and
      Maiorca, Valentino  and
      Campolungo, Niccol{\`o}  and
      Cecconi, Francesco  and
      Navigli, Roberto",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-emnlp.215",
    pages = "2521--2533",
    abstract = "Multilingual Named Entity Recognition (NER) is a key intermediate task which is needed in many areas of NLP. In this paper, we address the well-known issue of data scarcity in NER, especially relevant when moving to a multilingual scenario, and go beyond current approaches to the creation of multilingual silver data for the task. We exploit the texts of Wikipedia and introduce a new methodology based on the effective combination of knowledge-based approaches and neural models, together with a novel domain adaptation technique, to produce high-quality training corpora for NER. We evaluate our datasets extensively on standard benchmarks for NER, yielding substantial improvements up to 6 span-based F1-score points over previous state-of-the-art systems for data creation.",
}

Please consider citing our work if you use data and/or code from this repository.

In a nutshell, WikiNEuRal consists in a novel technique which builds upon a multilingual lexical knowledge base (i.e., BabelNet) and transformer-based architectures (i.e., BERT) to produce high-quality annotations for multilingual NER. It shows consistent improvements of up to 6 span-based F1-score points against state-of-the-art alternative data production methods on common benchmarks for NER. Moreover, in our paper we also present a new approach for creating interpretable word embeddings together with a Domain Adaptation algorithm, which enable WikiNEuRal to create domain-specific training corpora.

Data

Dataset Version	Sentences	Tokens	PER	ORG	LOC	MISC	OTHER
WikiNEuRal EN	116k	2.73M	51k	31k	67k	45k	2.40M
WikiNEuRal ES	95k	2.33M	43k	17k	68k	25k	2.04M
WikiNEuRal NL	107k	1.91M	46k	22k	61k	24k	1.64M
WikiNEuRal DE	124k	2.19M	60k	32k	59k	25k	1.87M
WikiNEuRal RU	123k	2.39M	40k	26k	89k	25k	2.13M
WikiNEuRal IT	111k	2.99M	67k	22k	97k	26k	2.62M
WikiNEuRal FR	127k	3.24M	76k	25k	101k	29k	2.83M
WikiNEuRal PL	141k	2.29M	59k	34k	118k	22k	1.91M
WikiNEuRal PT	106k	2.53M	44k	17k	112k	25k	2.20M
WikiNEuRal EN DA (CoNLL)	29k	759k	12k	23k	6k	3k	0.54M
WikiNEuRal NL DA (CoNLL)	34k	598k	17k	8k	18k	6k	0.51M
WikiNEuRal DE DA (CoNLL)	41k	706k	17k	12k	23k	3k	0.61M
WikiNEuRal EN DA (OntoNotes)	48k	1.18M	20k	13k	38k	12k	1.02M

Further datasets, such as the combination of WikiNEuRal with gold-standard training data (i.e., CoNLL) or the gold-standard datasets themselves, can be obtained by simply concatenating the two train.conllu files together (e.g., data/conll/en/train.conllu and data/wikineural/en/train.conllu give CoNLL+WikiNEuRal).

How to use

To train 10 models on CoNLL English, run:
```
python run.py -m +train.seed_idx=0,1,2,3,4,5,6,7,8,9 data.datamodule.source=conll data.datamodule.language=en
```
note: for the EN, ES, NL and DE versions of WikiNEuRal, you can use the CoNLL splits as validation and testing material (e.g., copy the data/conll/en/val.conllu into data/wikineural/en/). Similarly, for RU and PL you can use the BSNLP splits. For the other languages instead, you can use the scripts/create_splits.py script to split a given train.conllu file into train, dev and test sets.
To produce results for the 10 trained models, run:
```
bash test.sh
```
test.sh also contains more complex bash for loops that can produce results on multiple datasets / models at once.

License

WikiNEuRal is licensed under the CC BY-SA-NC 4.0 license. The text of the license can be found here.

We underline that the source from which the raw sentences have been extracted is Wikipedia (wikipedia.org) and the NER annotations have been produced by Babelscape.

Acknowledgments

We gratefully acknowledge the support of the ERC Consolidator Grant MOUSSE No. 726487 under the European Union’s Horizon2020 research and innovation programme (http://mousse-project.org/).

This work was also supported by the PerLIR project (Personal Linguistic resources in Information Retrieval) funded by the MIUR Progetti di ricerca di Rilevante Interesse Nazionale programme (PRIN2017).

The code in this repository is built on top of .

Comments

Problems with the Training Japanese dataset

The error is RuntimeError: split_with_sizes expects split_sizes to sum exactly to 1564 (input tensor's size at dimension 0), but got split_sizes=[21, 9, 18, 24, 27, 18, 36, 16, 38, 14, 24, 39, 7, 6, 17, 23, 33, 34, 7, 17, 13, 9, 5, 13, 7, 4, 31, 14, 182, 4, 102, 26, 6, 16, 22, 22, 23, 57, 20, 24, 3, 17, 10, 14, 131, 29, 6, 8, 5, 110, 33, 36, 15, 5, 5, 10, 18, 11, 8, 7, 3, 14, 19, 7]

and the corpus is like this: '0 # O 1 ヌ O 2 ン O 3 チ O 4 ャ O 5 ク O 6 バ O 7 ン O 8 キ O 9 ： O 10 吉 B-PER 11 水 I-PER 12 孝 I-PER 13 宏 I-PER

0 : O 1 # O 2 テ B-ORG 3 レ I-ORG'

I don't know what caused this problem, but I didn't make this mistake when training Korean corpora.I would appreciate it if you could help me solve this problem

opened by ZimingDai 3
Not able to load dataset on HF
Hello,

Thank you for sharing this dataset on Hugging Face. I was trying to load the dataset however, it seems to be called an error.

from datasets import load_dataset dataset = load_dataset('Babelscape/wikineural', split="train_fr") FileNotFoundError: Couldn't find file at https://huggingface.co/datasets/Babelscape/wikineural/resolve/main/wikineural.py

Can you please check and fix this accordingly! Thank you :)
opened by meharbhatia 3
Question: dataset construction for languages not in the paper

Dear authors, Could you suggest a script/procedure to construct a span NER dataset for a specific language that is not in the paper, for example, Chinese or Korean?

Regards,

opened by chapter544 2

Code and dataset for the EMNLP 2021 Finding paper "Can NLI Models Verify QA Systems’ Predictions?"

22 Oct 21, 2022

This repository contains the code for EMNLP-2021 paper "Word-Level Coreference Resolution"

Word-Level Coreference Resolution This is a repository with the code to reproduce the experiments described in the paper of the same name, which was a

79 Dec 27, 2022

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

105 Jan 3, 2023

Code to reproduce the results of the paper 'Towards Realistic Few-Shot Relation Extraction' (EMNLP 2021)

Realistic Few-Shot Relation Extraction This repository contains code to reproduce the results in the paper "Towards Realistic Few-Shot Relation Extrac

8 Nov 9, 2022

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

BanglaBERT This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced i

197 Dec 25, 2022

Spacy-ginza-ner-webapi - Named Entity Recognition API with spaCy and GiNZA

Named Entity Recognition API with spaCy and GiNZA I wrote a blog post about this

3 Feb 27, 2022

Python bindings to the dutch NLP tool Frog (pos tagger, lemmatiser, NER tagger, morphological analysis, shallow parser, dependency parser)

Frog for Python This is a Python binding to the Natural Language Processing suite Frog. Frog is intended for Dutch and performs part-of-speech tagging

46 Dec 14, 2022

Framework for fine-tuning pretrained transformers for Named-Entity Recognition (NER) tasks

NERDA Not only is NERDA a mesmerizing muppet-like character. NERDA is also a python package, that offers a slick easy-to-use interface for fine-tuning

141 Dec 30, 2022

天池中药说明书实体识别挑战冠军方案；中文命名实体识别；NER; BERT-CRF & BERT-SPAN & BERT-MRC；Pytorch

751 Dec 30, 2022

Data and evaluation code for the paper WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER (EMNLP 2021).

Related tags

Overview

Data

How to use

License

Acknowledgments

You might also like...

Code and dataset for the EMNLP 2021 Finding paper "Can NLI Models Verify QA Systems’ Predictions?"

This repository contains the code for EMNLP-2021 paper "Word-Level Coreference Resolution"

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

Code to reproduce the results of the paper 'Towards Realistic Few-Shot Relation Extraction' (EMNLP 2021)

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

Spacy-ginza-ner-webapi - Named Entity Recognition API with spaCy and GiNZA

Python bindings to the dutch NLP tool Frog (pos tagger, lemmatiser, NER tagger, morphological analysis, shallow parser, dependency parser)

Framework for fine-tuning pretrained transformers for Named-Entity Recognition (NER) tasks

天池中药说明书实体识别挑战冠军方案；中文命名实体识别；NER; BERT-CRF & BERT-SPAN & BERT-MRC；Pytorch

Comments

Problems with the Training Japanese dataset

Not able to load dataset on HF

Question: dataset construction for languages not in the paper

Owner

Babelscape

Code for the paper in Findings of EMNLP 2021: "EfficientBERT: Progressively Searching Multilayer Perceptron via Warm-up Knowledge Distillation".

Code for Findings at EMNLP 2021 paper: "Learn Continually, Generalize Rapidly: Lifelong Knowledge Accumulation for Few-shot Learning"

TextFlint is a multilingual robustness evaluation platform for natural language processing tasks,

Negative sampling for solving the unlabeled entity problem in NER. ICLR-2021 paper: Empirical Analysis of Unlabeled Entity Problem in Named Entity Recognition.

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

Beta Distribution Guided Aspect-aware Graph for Aspect Category Sentiment Analysis with Affective Knowledge. Proceedings of EMNLP 2021

EMNLP'2021: Can Language Models be Biomedical Knowledge Bases?

This is the code for the EMNLP 2021 paper AEDA: An Easier Data Augmentation Technique for Text Classification

The implementation of Parameter Differentiation based Multilingual Neural Machine Translation

💛 Code and Dataset for our EMNLP 2021 paper: "Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes"