PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

VinAI Research

Last update: Dec 2, 2022

Related tags

Text Data & NLP named-entity-recognition ner pos-tagging dependency-parsing vietnamese-nlp multi-task-learning

Overview

PhoNLP: A joint multi-task learning model for Vietnamese part-of-speech tagging, named entity recognition and dependency parsing

PhoNLP is a multi-task learning model for joint part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results, outperforming a single-task learning approach that fine-tunes the pre-trained Vietnamese language model PhoBERT for each task independently.

Details of the PhoNLP model architecture and experimental results can be found in our following paper:

@article{PhoNLP,
title     = {{PhoNLP: A joint multi-task learning model for Vietnamese part-of-speech tagging, named entity recognition and dependency parsing}},
author    = {Linh The Nguyen and Dat Quoc Nguyen},
journal   = {arXiv preprint},
volume    = {arXiv:2101.01476},
year      = {2021}
}

Please CITE our paper when PhoNLP is used to help produce published results or incorporated into other software.

Although we specify PhoNLP for Vietnamese, usage examples below in fact can directly work for other languages that have gold annotated corpora available for the three tasks of POS tagging, NER and dependency parsing, and a pre-trained BERT-based language model available from transformers.

Installation

Python version >= 3.6; PyTorch version >= 1.4.0
PhoNLP can be installed using pip as follows: pip3 install phonlp

Or PhoNLP can also be installed from source with the following commands:

 git clone https://github.com/VinAIResearch/PhoNLP
 cd PhoNLP
 pip3 install -e .

Usage example: Command lines

To play with the examples using command lines, please install phonlp from the source:

git clone https://github.com/VinAIResearch/PhoNLP
cd PhoNLP
pip3 install -e .

Training

cd phonlp/models
python3 run_phonlp.py --mode train --save_dir  \
	--pretrained_lm  \
	--lr  --batch_size  --num_epoch  \
	--lambda_pos  --lambda_ner  --lambda_dep  \
	--train_file_pos  --eval_file_pos  \
	--train_file_ner  --eval_file_ner  \
	--train_file_dep  --eval_file_dep

--lambda_pos, --lambda_ner and --lambda_dep represent mixture weights associated with POS tagging, NER and dependency parsing losses, respectively, and lambda_pos + lambda_ner + lambda_dep = 1.

Example:

cd phonlp/models
python3 run_phonlp.py --mode train --save_dir ./phonlp_tmp \
	--pretrained_lm "vinai/phobert-base" \
	--lr 1e-5 --batch_size 32 --num_epoch 40 \
	--lambda_pos 0.4 --lambda_ner 0.2 --lambda_dep 0.4 \
	--train_file_pos ../sample_data/pos_train.txt --eval_file_pos ../sample_data/pos_valid.txt \
	--train_file_ner ../sample_data/ner_train.txt --eval_file_ner ../sample_data/ner_valid.txt \
	--train_file_dep ../sample_data/dep_train.conll --eval_file_dep ../sample_data/dep_valid.conll

Evaluation

cd phonlp/models
python3 run_phonlp.py --mode eval --save_dir  \
	--batch_size  \
	--eval_file_pos  \
	--eval_file_ner  \
	--eval_file_dep

Example:

cd phonlp/models
python3 run_phonlp.py --mode eval --save_dir ./phonlp_tmp \
	--batch_size 8 \
	--eval_file_pos ../sample_data/pos_test.txt \
	--eval_file_ner ../sample_data/ner_test.txt \
	--eval_file_dep ../sample_data/dep_test.conll

Annotate a corpus

cd phonlp/models
python3 run_phonlp.py --mode annotate --save_dir  \
	--batch_size  \
	--input_file  \
	--output_file

Example:

cd phonlp/models
python3 run_phonlp.py --mode annotate --save_dir ./phonlp_tmp \
	--batch_size 8 \
	--input_file ../sample_data/input.txt \
	--output_file ../sample_data/output.txt

The pre-trained PhoNLP model for Vietnamese is available at HERE!

Usage example: Python API

import phonlp
# Automatically download the pre-trained PhoNLP model 
# and save it in a local machine folder
phonlp.download(save_dir='./pretrained_phonlp')
# Load the pre-trained PhoNLP model
model = phonlp.load(save_dir='./pretrained_phonlp')
# Annotate a corpus where each line represents a word-segmented sentence
model.annotate(input_file='input.txt', output_file='output.txt')
# Annotate a word-segmented sentence
model.print_out(model.annotate(text="Tôi đang làm_việc tại VinAI ."))

By default, the output for each input sentence is formatted with 6 columns representing word index, word form, POS tag, NER label, head index of the current word and its dependency relation type:

1	Tôi	P	O	3	sub	
2	đang	R	O	3	adv
3	làm_việc	V	O	0	root
4	tại	E	O	3	loc
5	VinAI	Np 	B-ORG	4	prob
6	.	CH	O	3	punct

In addition, the output can be formatted following the 10-column CoNLL format where the last column is used to represent NER predictions. This can be done by adding output_type='conll' into the model.annotate() function. Also, in the model.annotate() function, the value of the parameter batch_size can be adjusted to fit your computer's memory instead of using the default one at 1 (batch_size=1). Here, a larger batch_size would lead to a faster performance speed.

Comments

What if I want to train NER task only?

Hello, thanks for publishing the code. I have a question, plz help me clarify this. What should I change the code if I only want to train NER task not the others?

opened by icyda17 10
PhoNLP doesn't train or evaluate the NER Task

I am running the code of PhoNLP. However , the result shows it doesn't train or evaluate NER task. I paste my parameters below: Train: python3 phonlp/models/run_phonlp.py --mode train --save_dir ./phonlp_tmp_save_model
--pretrained_lm "vinai/phobert-base"
--lr 1e-5 --batch_size 32 --num_epoch 40
--lambda_pos 0.4 --lambda_ner 0.2 --lambda_dep 0.4
--train_file_pos phonlp/sample_data/pos_train.txt --eval_file_pos phonlp/sample_data/pos_valid.txt
--train_file_ner phonlp/sample_data/ner_train.txt --eval_file_ner phonlp/sample_data/ner_valid.txt
--train_file_dep phonlp/sample_data/dep_train.conll --eval_file_dep phonlp/sample_data/dep_valid.conll
--output_file_dep phonlp/models/jointmodel/dep.out

Evaluate: python3 phonlp/models/run_phonlp.py --mode eval --save_dir ./phonlp_tmp_save_model
--batch_size 8
--eval_file_pos phonlp/sample_data/pos_test.txt
--eval_file_ner phonlp/sample_data/ner_test.txt
--eval_file_dep phonlp/sample_data/dep_test.conll
--output_file_dep phonlp/models/jointmodel/dep.out

The results of train and evaluate: Train Result: Training ended with 39 epochs. Best dev las = 5.23, uas = 15.28, upos = 24.22, f1 = 0.0

Evaluate Result： POS tagging: 24.25, NER: 0.00, Dependency parsing: 6.63/27.24

I would appreciate it for your reply. @datquocnguyen @thelinhbkhn2014 @dangne @maihoangdao Thank you~~

opened by COCOMiss 2
Training on Covid19 dataset

Hi, I'm having a problem trying to train the model on the Vietnamese Covid 19 NER dataset. This dataset is in the .conll form and I have the error from the compilers sad like below. I can't figure this error out, I wonder if the reason for this is because of the .conll format (the sample data format is .txt) or I need to do some preprocessed for the Covid 19 dataset. Can you suggest a solution for this. Traceback (most recent call last): File "run_phonlp.py", line 526, in main() File "run_phonlp.py", line 128, in main train(args) File "run_phonlp.py", line 146, in train vocab = BuildVocab(args, args["train_file_pos"], train_doc_dep, args["train_file_ner"]).vocab File "/content/PhoNLP/phonlp/models/jointmodel/data.py", line 440, in init self.vocab = self.build_vocab(data_dep, data_pos, data_ner) File "/content/PhoNLP/phonlp/models/jointmodel/data.py", line 446, in build_vocab ner_tag = TagVocab(data_ner, idx=1) File "/content/PhoNLP/phonlp/models/common/vocab.py", line 27, in init self.build_vocab() File "/content/PhoNLP/phonlp/models/ner/vocab.py", line 11, in build_vocab counter = Counter([w[self.idx] for sent in self.data for w in sent]) File "/content/PhoNLP/phonlp/models/ner/vocab.py", line 11, in counter = Counter([w[self.idx] for sent in self.data for w in sent]) IndexError: list index out of range

opened by huyhoang240101 1
what is 'dob' tag in dependecy parsing task?

After running dependency parsing task, I received the following result with a tag named 'dob'. Please help me explain the meaning of this tag as well as provide the full list of tags :)

opened by icyda17 1
About CRFLoss

Hi, I am trying to write CRFLoss from scratch and I also read many other sources aside from your code. However, it seems that you omit the start and end tags in your code while other sources don't. I saw the comment TODO in your code which tells that you will change the code in the future, aren't you? If not, is there any difference between your code and others?. Below are some of the references, you can check them and make comparisons. I am a newbie so I hope that you will help me out. Thanks in advance. https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Sequence-Labeling https://github.com/mtreviso/linear-chain-crf/tree/865bcf25fb33f73d59978426eb1c0f587e1f95f8 https://github.com/kmkurn/pytorch-crf/blob/8f3203a1f1d7984c87718bfe31853242670258db/torchcrf/init.py

opened by Quang-elec44 1
Asking for POS/NER dataset

Thanks for your great Vietnamese NLP toolkits .

I want to get the VLSP 2013 POS tagging dataset and the VLSP 2016 NER dataset . How can I get these datasets since the official website does not provide any link to download.

i am looking forward to hearing from you.

opened by demdecuong 1
Make the app look more canonical
Just looked into your code, and found that it is weird, in term of organization and usage style. Some thought:

In development, people have to "install" the project code itself, with pip install -e .. I agree that we have to install the dependencies, but install the app itself is unnecessary and weird, given that we are standing right in the code base folder.

Suggestion:

Just tell people to do pip install -r requirements.txt to install dependencies, or better, use Poetry to organize dependencies.

The project has some CLI tools. That is good, but the way people have to cd phonlp/models in order to use the CLI tools is silly.

Suggestion:

Organize the project so that, people just stay at the top folder and run:

python3 -m phonlp.tools train ...

With this organization, you also solve the 1st issue (no need to pip install -e .).

You are modifying import path:

https://github.com/VinAIResearch/PhoNLP/blob/efda60735a0b596c7df948fb49eeb8f835cd3734/phonlp/models/run_phonlp.py#L6-L9

With a proper organization, you don't have to do this. It is ugly when having normal code before import statements.

Code doesn't follow standard Python coding style (PEP-8).

Hard-code absolute folder path. https://github.com/VinAIResearch/PhoNLP/blob/efda60735a0b596c7df948fb49eeb8f835cd3734/phonlp/models/run_phonlp.py#L28-L29

It won't run on other people machine, because "/home/ubuntu/linhnt140/" doesn't exist there.

Suggestion:

Use ~/ or ~/Documents, it always points to user's home folder, no matter what his/her username is.

Command options don't follow common style. Look at this usage:

python3 run_phonlp.py --mode train --save_dir ./phonlp_tmp --pretrained_lm "vinai/phobert-base"

The popular style for CLI on Linux is GNU style, by which, the above command should be:

python3 run_phonlp.py --mode train --save-dir ./phonlp_tmp --pretrained-lm "vinai/phobert-base"

Or, combined with suggestion at 2nd issue, it should be:

python3 -m phonlp.tools train --save-dir ./phonlp_tmp --pretrained-lm "vinai/phobert-base"
opened by hongquan 1

Owner

VinAI Research

GitHub

Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

anaGo anaGo is a Python library for sequence labeling(NER, PoS Tagging,...), implemented in Keras. anaGo can solve sequence labeling tasks such as nam

1.5k Dec 5, 2022

Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

anaGo anaGo is a Python library for sequence labeling(NER, PoS Tagging,...), implemented in Keras. anaGo can solve sequence labeling tasks such as nam

1.4k Feb 17, 2021

Negative sampling for solving the unlabeled entity problem in NER. ICLR-2021 paper: Empirical Analysis of Unlabeled Entity Problem in Named Entity Recognition.

Negative Sampling for NER Unlabeled entity problem is prevalent in many NER scenarios (e.g., weakly supervised NER). Our paper in ICLR-2021 proposes u

128 Dec 29, 2022

Neural network models for joint POS tagging and dependency parsing (CoNLL 2017-2018)

Neural Network Models for Joint POS Tagging and Dependency Parsing Implementations of joint models for POS tagging and dependency parsing, as describe

152 Sep 2, 2022

TweebankNLP - Pre-trained Tweet NLP Pipeline (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Models + Tweebank-NER

TweebankNLP This repo contains the new Tweebank-NER dataset and Twitter-Stanza p

84 Dec 20, 2022

RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

RoNER RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2. It is meant to be an easy to use, hi

9 Nov 7, 2022

Pytorch-Named-Entity-Recognition-with-BERT

BERT NER Use google BERT to do CoNLL-2003 NER ! Train model using Python and Inference using C++ ALBERT-TF2.0 BERT-NER-TENSORFLOW-2.0 BERT-SQuAD Requi

1.1k Dec 25, 2022

Use Google's BERT for named entity recognition （CoNLL-2003 as the dataset）.

For better performance, you can try NLPGNN, see NLPGNN for more details. BERT-NER Version 2 Use Google's BERT for named entity recognition （CoNLL-2003

1.2k Dec 26, 2022

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

TextBlob: Simplified Text Processing Homepage: https://textblob.readthedocs.io/ TextBlob is a Python (2 and 3) library for processing textual data. It

8.4k Dec 26, 2022

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

TextBlob: Simplified Text Processing Homepage: https://textblob.readthedocs.io/ TextBlob is a Python (2 and 3) library for processing textual data. It

7.5k Feb 17, 2021

Part of Speech Tagging using Hidden Markov Model (HMM) POS Tagger and Brill Tagger

Part of Speech Tagging using Hidden Markov Model (HMM) POS Tagger and Brill Tagger In this project, our aim is to tune, compare, and contrast the perf

0 Dec 25, 2021

Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition

0 Feb 13, 2022

pytorch-kaldi is a project for developing state-of-the-art DNN/RNN hybrid speech recognition systems. The DNN part is managed by pytorch, while feature extraction, label computation, and decoding are performed with the kaldi toolkit.

The PyTorch-Kaldi Speech Recognition Toolkit PyTorch-Kaldi is an open-source repository for developing state-of-the-art DNN/HMM speech recognition sys

2.3k Dec 27, 2022

PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

Related tags

Overview

PhoNLP: A joint multi-task learning model for Vietnamese part-of-speech tagging, named entity recognition and dependency parsing

Installation

Usage example: Command lines

Training

Evaluation

Annotate a corpus

The pre-trained PhoNLP model for Vietnamese is available at HERE!

Usage example: Python API

Comments

What if I want to train NER task only?

PhoNLP doesn't train or evaluate the NER Task

Training on Covid19 dataset

what is 'dob' tag in dependecy parsing task?

About CRFLoss

Asking for POS/NER dataset

Make the app look more canonical

Owner

VinAI Research

Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

Negative sampling for solving the unlabeled entity problem in NER. ICLR-2021 paper: Empirical Analysis of Unlabeled Entity Problem in Named Entity Recognition.

Neural network models for joint POS tagging and dependency parsing (CoNLL 2017-2018)

TweebankNLP - Pre-trained Tweet NLP Pipeline (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Models + Tweebank-NER

RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

Pytorch-Named-Entity-Recognition-with-BERT

Use Google's BERT for named entity recognition （CoNLL-2003 as the dataset）.

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

Part of Speech Tagging using Hidden Markov Model (HMM) POS Tagger and Brill Tagger

Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition

pytorch-kaldi is a project for developing state-of-the-art DNN/RNN hybrid speech recognition systems. The DNN part is managed by pytorch, while feature extraction, label computation, and decoding are performed with the kaldi toolkit.

Chinese named entity recognization (bert/roberta/macbert/bert_wwm with Keras)

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

Spacy-ginza-ner-webapi - Named Entity Recognition API with spaCy and GiNZA

Framework for fine-tuning pretrained transformers for Named-Entity Recognition (NER) tasks