KIND: an Italian Multi-Domain Dataset for Named Entity Recognition

Digital Humanities

Last update: Jun 21, 2022

Related tags

Deep Learning KIND

Overview

KIND (Kessler Italian Named-entities Dataset)

KIND is an Italian dataset for Named-Entity Recognition.

It contains more than one million tokens with the annotation covering three classes: persons, locations, and organizations. Most of the dataset (around 600K tokens) contains manual gold annotations in three different domains: news, literature, and political discourses.

For the construction of the dataset, we decide to use texts available for free, under a license that permits both research and commercial use.

In particular we release four chapters with texts taken from: (i) Wikinews (WN) as a source of news texts belonging to the last decades; (ii) some Italian fiction books (FIC) whose authors died more than 70 years ago; (iii) writings and speeches from Italian politicians Aldo Moro (AM) and (iv) Alcide De Gasperi (ADG).

Wikinews

Wikinews is a multi-language free project of collaborative journalism. The Italian chapter contains more than 11,000 news articles, released under the Creative Commons Attribution 2.5 License.

In building KIND, we randomly choose 1,000 articles evenly distributed in the last 20 years, for a total of 308,622 tokens.

Literature

Regarding fiction literature, we annotate 86 book chapters taken from 10 books written by Italian authors, who all died more than 70 years ago, for a total of 192,448 tokens. The plain texts are taken from the Liber Liber website.

In particular, we choose: Il giorno delle Mésules (Ettore Castiglioni, 12,853 tokens), L'amante di Cesare (Augusto De Angelis, 13,464 tokens), Canne al vento (Grazia Deledda, 13,945 tokens), 1861-1911 - Cinquant’anni di vita nazionale ricordati ai fanciulli (Guido Fabiani, 10,801 tokens), Lettere dal carcere (Antonio Gramsci, 10,655), Anarchismo e democrazia (Errico Malatesta, 11,557 tokens), L'amore negato (Maria Messina, 31,115 tokens), La luna e i falò (Cesare Pavese, 10,705 tokens), La coscienza di Zeno (Italo Svevo, 56,364 tokens), Le cose piu grandi di lui (Luciano Zuccoli, 20,989 tokens).

In selecting works without copyright, we favored texts as recent as possible, so that the model trained on this data can be used efficiently on novels written in the last years, since the language used in these novels is more likely to be similar to the language used in the novels of our days.

Aldo Moro's Works

Writings belonging to Aldo Moro have recently been collected by the University of Bologna and published on a platform called Edizione Nazionale delle Opere di Aldo Moro.

The project is still ongoing and, by now, it contains 806 documents for a total of about one million tokens.

In the first release of KIND, we include 392,604 tokens from the Aldo Moro's works dataset, with silver annotations (see the reference below).

Alcide De Gasperi's Writings

Finally, we annotate 158 document (150,632 tokens) from Alcide Digitale, spanning 50 years of European history.

The complete corpus contains a comprehensive collection of Alcide De Gasperi’s public documents, 2,762 in total, written or transcribed between 1901 and 1954.

License

The NER annotations in (i), (ii), and (iii) are released under the Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. Annotation from Alcide De Gasperi's writings are released under the Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.

[EMNLP 2021] MuVER: Improving First-Stage Entity Retrieval with Multi-View Entity Representations

MuVER This repo contains the code and pre-trained model for our EMNLP 2021 paper: MuVER: Improving First-Stage Entity Retrieval with Multi-View Entity

24 May 30, 2022

Weakly supervised medical named entity classification

Trove Trove is a research framework for building weakly supervised (bio)medical named entity recognition (NER) and other entity attribute classifiers

60 Nov 18, 2022

Chinese named entity recognization with BiLSTM using Keras

Chinese named entity recognization (Bilstm with Keras) Project Structure ./ ├── README.md ├── data │ ├── README.md │ ├── data 数据集 │ │ ├─

1 Dec 17, 2021

A multi-entity Transformer for multi-agent spatiotemporal modeling.

baller2vec This is the repository for the paper: Michael A. Alcorn and Anh Nguyen. baller2vec: A Multi-Entity Transformer For Multi-Agent Spatiotempor

56 Nov 15, 2022

Variational Attention: Propagating Domain-Specific Knowledge for Multi-Domain Learning in Crowd Counting (ICCV, 2021)

DKPNet ICCV 2021 Variational Attention: Propagating Domain-Specific Knowledge for Multi-Domain Learning in Crowd Counting Baseline of DKPNet is availa

19 Oct 14, 2022

This is the paddle code for SeBoW(Self-Born wiring for neural trees), a kind of neural tree born form a large search space

SeBoW: Self-Born Wiring for neural trees(PaddlePaddle version) This is the paddle code for SeBoW(Self-Born wiring for neural trees), a kind of neural

13 Dec 8, 2022

LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation (NeurIPS2021 Benchmark and Dataset Track)

LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation by Junjue Wang, Zhuo Zheng, Ailong Ma, Xiaoyan Lu, and Yanfei Zh

174 Dec 22, 2022

SpanNER: Named EntityRe-/Recognition as Span Prediction

104 Dec 17, 2022

PyTorch implementation of CDistNet: Perceiving Multi-Domain Character Distance for Robust Text Recognition

PyTorch implementation of CDistNet: Perceiving Multi-Domain Character Distance for Robust Text Recognition The unofficial code of CDistNet. Now, we ha

25 Jul 20, 2022

Comments

IOB2 reference implementation
Hi,

many thanks for releasing this new resource for Italian NER!

I would like to integrate this dataset into our Flair library and into Hugging Face Datasets, but I have a question about the labeling scheme. At the moment the dataset has this format:

In O fiamme O l' O Istituto LOC Lama LOC Tzong LOC Khapa LOC , O monastero O buddhista O

One could just write a conversion script to convert the labels into IOB2 format, so it should has the form of:

In O fiamme O l' O Istituto B-LOC Lama I-LOC Tzong I-LOC Khapa I-LOC , O monastero O buddhista O

However, I would like to know, if you could provide the reference implementation or the IOB2-converted dataset to compare the conversion results :thinking:

Many thanks!
opened by stefan-it 0

KIND: an Italian Multi-Domain Dataset for Named Entity Recognition

Related tags

Overview

KIND (Kessler Italian Named-entities Dataset)

Wikinews

Literature

Aldo Moro's Works

Alcide De Gasperi's Writings

License

You might also like...

[EMNLP 2021] MuVER: Improving First-Stage Entity Retrieval with Multi-View Entity Representations

Weakly supervised medical named entity classification

Chinese named entity recognization with BiLSTM using Keras

A multi-entity Transformer for multi-agent spatiotemporal modeling.

Variational Attention: Propagating Domain-Specific Knowledge for Multi-Domain Learning in Crowd Counting (ICCV, 2021)

This is the paddle code for SeBoW(Self-Born wiring for neural trees), a kind of neural tree born form a large search space

LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation (NeurIPS2021 Benchmark and Dataset Track)

SpanNER: Named EntityRe-/Recognition as Span Prediction

PyTorch implementation of CDistNet: Perceiving Multi-Domain Character Distance for Robust Text Recognition

Comments

IOB2 reference implementation

Owner

Digital Humanities

Chinese clinical named entity recognition using pre-trained BERT model

Code for Two-stage Identifier: "Locate and Label: A Two-stage Identifier for Nested Named Entity Recognition"

[ACL-IJCNLP 2021] Improving Named Entity Recognition by External Context Retrieving and Cooperative Learning

Named Entity Recognition with Small Strongly Labeled and Large Weakly Labeled Data

Simple and Effective Few-Shot Named Entity Recognition with Structured Nearest Neighbor Learning

Source Code For Template-Based Named Entity Recognition Using BART

[EMNLP 2021] Distantly-Supervised Named Entity Recognition with Noise-Robust Learning and Language Model Augmented Self-Training

GLaRA: Graph-based Labeling Rule Augmentation for Weakly Supervised Named Entity Recognition

Example Of Fine-Tuning BERT For Named-Entity Recognition Task And Preparing For Cloud Deployment Using Flask, React, And Docker

An elaborate and exhaustive paper list for Named Entity Recognition (NER)