An open-source Kazakh named entity recognition dataset (KazNERD), annotation guidelines, and baseline NER models.

ISSAI

Last update: Dec 23, 2022

Related tags

Deep Learning crf corpus dataset ner bert kazakh roberta bilstm-cnn-crf

Overview

Kazakh Named Entity Recognition

This repository contains an open-source Kazakh named entity recognition dataset (KazNERD), named entity annotation guidelines (in Kazakh), and NER model training codes (CRF, BiLSTM-CNN-CRF, BERT and XLM-RoBERTa).

KazNERD Corpus
Annotation Guidelines
NER Models
Citation

1. KazNERD Corpus

KazNERD contains 112,702 sentences, extracted from the television news text, and 136,333 annotations for 25 entity classes. All sentences in the dataset were manually annotated by two native Kazakh-speaking linguists, supervised by an ISSAI researcher. The IOB2 scheme was used for annotation. The dataset, in CoNLL 2002 format, is located here.

2. Annotation Guidelines

The annotation guidelines followed to build KazNERD are located here. The guidelines contain rules for annotating 25 named entity classes and their examples. The guidelines are in the Kazakh language.

3. NER Models

3.1 CRF

Conda Environment Setup for CRF

The CRF-based NER model training codes are based on Python 3.8. To ease the experiment replication experience, we recommend setting up a Conda environment.

conda create --name knerdCRF python=3.8
conda activate knerdCRF
conda install -c anaconda nltk scikit-learn
conda install -c conda-forge sklearn-crfsuite seqeval

Start CRF training

$ cd crf
$ python runCRF_KazNERD.py

3.2 BiLSTM-CNN-CRF

Conda Environment Setup for BiLSTM-CNN-CRF

The BiLSTM-CNN-CRF-based NER model training codes are based on Python 3.8 and PyTorch 1.7.1. To ease the experiment replication experience, we recommend setting up a Conda environment.

conda create --name knerdLSTM python=3.8
conda activate knerdLSTM
# Check https://pytorch.org/get-started/previous-versions/#v171
# to install a PyTorch version suitable for your OS and CUDA
# or feel free to adapt the code to a newer PyTorch version
conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=10.1 -c pytorch   # we used this version
conda install -c conda-forge tqdm seqeval

Start BiLSTM-CNN-CRF training

$ cd BiLSTM_CNN_CRF
$ bash run_train_p.sh

3.3 BERT and XLM-RoBERTa

Conda Environment Setup for BERT and XLM-RoBERTa

The BERT- and XLM-RoBERTa-based NER models training codes are based on Python 3.8 and PyTorch 1.7.1. To ease the experiment replication experience, we recommend setting up a Conda environment.

conda create --name knerdBERT python=3.8
conda activate knerdBERT
# Check https://pytorch.org/get-started/previous-versions/#v171
# to install a PyTorch version suitable for your OS and CUDA
# or feel free to adapt the code to a newer PyTorch version
conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=10.1 -c pytorch   # we used this version
conda install -c anaconda numpy
conda install -c conda-forge seqeval
pip install transformers
pip install datasets

Start BERT training

$ cd bert
$ python run_finetune_kaznerd.py bert

Start XLM-RoBERTa training

$ cd bert
$ python run_finetune_kaznerd.py roberta

4. Citation

@misc{yeshpanov2021kaznerd,
      title={KazNERD: Kazakh Named Entity Recognition Dataset}, 
      author={Rustem Yeshpanov and Yerbolat Khassanov and Huseyin Atakan Varol},
      year={2021},
      eprint={2111.13419},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

You might also like...

This project aim to create multi-label classification annotation tool to boost annotation speed and make it more easier.

4 Aug 2, 2022

Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning".

ERICA Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive L

75 Nov 2, 2022

Comments

Script for training BERT model doesn't save the model...

Script run_finetune_kaznerd.py doesn't actually save the model, which makes it impossible to use for inference. Can be fixed with simple trainer.save(model_path) right after the trainer.train()

opened by illided 0

An open-source Kazakh named entity recognition dataset (KazNERD), annotation guidelines, and baseline NER models.

Related tags

Overview

Kazakh Named Entity Recognition

1. KazNERD Corpus

2. Annotation Guidelines

3. NER Models

3.1 CRF

Conda Environment Setup for CRF

Start CRF training

3.2 BiLSTM-CNN-CRF

Conda Environment Setup for BiLSTM-CNN-CRF

Start BiLSTM-CNN-CRF training

3.3 BERT and XLM-RoBERTa

Conda Environment Setup for BERT and XLM-RoBERTa

Start BERT training

Start XLM-RoBERTa training

4. Citation

You might also like...

This project aim to create multi-label classification annotation tool to boost annotation speed and make it more easier.

Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning".

Image-retrieval-baseline - MUGE Multimodal Retrieval Baseline

Image-generation-baseline - MUGE Text To Image Generation Baseline

Jingju baseline - A baseline model of our project of Beijing opera script generation

Weakly supervised medical named entity classification

Chinese named entity recognization with BiLSTM using Keras

Few-NERD: Not Only a Few-shot NER Dataset

Open source annotation tool for machine learning practitioners.

Comments

Script for training BERT model doesn't save the model...

Owner

ISSAI

Source Code For Template-Based Named Entity Recognition Using BART

Example Of Fine-Tuning BERT For Named-Entity Recognition Task And Preparing For Cloud Deployment Using Flask, React, And Docker

Code for Two-stage Identifier: "Locate and Label: A Two-stage Identifier for Nested Named Entity Recognition"

[ACL-IJCNLP 2021] Improving Named Entity Recognition by External Context Retrieving and Cooperative Learning

Named Entity Recognition with Small Strongly Labeled and Large Weakly Labeled Data

Simple and Effective Few-Shot Named Entity Recognition with Structured Nearest Neighbor Learning

[EMNLP 2021] Distantly-Supervised Named Entity Recognition with Noise-Robust Learning and Language Model Augmented Self-Training

Chinese clinical named entity recognition using pre-trained BERT model

“Data Augmentation for Cross-Domain Named Entity Recognition” (EMNLP 2021)

GLaRA: Graph-based Labeling Rule Augmentation for Weakly Supervised Named Entity Recognition