PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

Overview

logo

PhoNLP: A joint multi-task learning model for Vietnamese part-of-speech tagging, named entity recognition and dependency parsing

PhoNLP is a multi-task learning model for joint part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results, outperforming a single-task learning approach that fine-tunes the pre-trained Vietnamese language model PhoBERT for each task independently.

logo

Details of the PhoNLP model architecture and experimental results can be found in our following paper:

@article{PhoNLP,
title     = {{PhoNLP: A joint multi-task learning model for Vietnamese part-of-speech tagging, named entity recognition and dependency parsing}},
author    = {Linh The Nguyen and Dat Quoc Nguyen},
journal   = {arXiv preprint},
volume    = {arXiv:2101.01476},
year      = {2021}
}

Please CITE our paper when PhoNLP is used to help produce published results or incorporated into other software.

Although we specify PhoNLP for Vietnamese, usage examples below in fact can directly work for other languages that have gold annotated corpora available for the three tasks of POS tagging, NER and dependency parsing, and a pre-trained BERT-based language model available from transformers.

Installation

  • Python version >= 3.6; PyTorch version >= 1.4.0
  • PhoNLP can be installed using pip as follows: pip3 install phonlp
  • Or PhoNLP can also be installed from source with the following commands:
     git clone https://github.com/VinAIResearch/PhoNLP
     cd PhoNLP
     pip3 install -e .
    

Usage example: Command lines

To play with the examples using command lines, please install phonlp from the source:

git clone https://github.com/VinAIResearch/PhoNLP
cd PhoNLP
pip3 install -e . 

Training

cd phonlp/models
python3 run_phonlp.py --mode train --save_dir  \
	--pretrained_lm  \
	--lr  --batch_size  --num_epoch  \
	--lambda_pos  --lambda_ner  --lambda_dep  \
	--train_file_pos  --eval_file_pos  \
	--train_file_ner  --eval_file_ner  \
	--train_file_dep  --eval_file_dep 

--lambda_pos, --lambda_ner and --lambda_dep represent mixture weights associated with POS tagging, NER and dependency parsing losses, respectively, and lambda_pos + lambda_ner + lambda_dep = 1.

Example:

cd phonlp/models
python3 run_phonlp.py --mode train --save_dir ./phonlp_tmp \
	--pretrained_lm "vinai/phobert-base" \
	--lr 1e-5 --batch_size 32 --num_epoch 40 \
	--lambda_pos 0.4 --lambda_ner 0.2 --lambda_dep 0.4 \
	--train_file_pos ../sample_data/pos_train.txt --eval_file_pos ../sample_data/pos_valid.txt \
	--train_file_ner ../sample_data/ner_train.txt --eval_file_ner ../sample_data/ner_valid.txt \
	--train_file_dep ../sample_data/dep_train.conll --eval_file_dep ../sample_data/dep_valid.conll

Evaluation

cd phonlp/models
python3 run_phonlp.py --mode eval --save_dir  \
	--batch_size  \
	--eval_file_pos  \
	--eval_file_ner  \
	--eval_file_dep  

Example:

cd phonlp/models
python3 run_phonlp.py --mode eval --save_dir ./phonlp_tmp \
	--batch_size 8 \
	--eval_file_pos ../sample_data/pos_test.txt \
	--eval_file_ner ../sample_data/ner_test.txt \
	--eval_file_dep ../sample_data/dep_test.conll 

Annotate a corpus

cd phonlp/models
python3 run_phonlp.py --mode annotate --save_dir  \
	--batch_size  \
	--input_file  \
	--output_file  

Example:

cd phonlp/models
python3 run_phonlp.py --mode annotate --save_dir ./phonlp_tmp \
	--batch_size 8 \
	--input_file ../sample_data/input.txt \
	--output_file ../sample_data/output.txt 

The pre-trained PhoNLP model for Vietnamese is available at HERE!

Usage example: Python API

import phonlp
# Automatically download the pre-trained PhoNLP model 
# and save it in a local machine folder
phonlp.download(save_dir='./pretrained_phonlp')
# Load the pre-trained PhoNLP model
model = phonlp.load(save_dir='./pretrained_phonlp')
# Annotate a corpus where each line represents a word-segmented sentence
model.annotate(input_file='input.txt', output_file='output.txt')
# Annotate a word-segmented sentence
model.print_out(model.annotate(text="Tôi đang làm_việc tại VinAI ."))

By default, the output for each input sentence is formatted with 6 columns representing word index, word form, POS tag, NER label, head index of the current word and its dependency relation type:

1	Tôi	P	O	3	sub	
2	đang	R	O	3	adv
3	làm_việc	V	O	0	root
4	tại	E	O	3	loc
5	VinAI	Np 	B-ORG	4	prob
6	.	CH	O	3	punct

In addition, the output can be formatted following the 10-column CoNLL format where the last column is used to represent NER predictions. This can be done by adding output_type='conll' into the model.annotate() function. Also, in the model.annotate() function, the value of the parameter batch_size can be adjusted to fit your computer's memory instead of using the default one at 1 (batch_size=1). Here, a larger batch_size would lead to a faster performance speed.

Comments
  • What if I want to train NER task only?

    What if I want to train NER task only?

    Hello, thanks for publishing the code. I have a question, plz help me clarify this. What should I change the code if I only want to train NER task not the others?

    opened by icyda17 10
  • PhoNLP doesn't train or evaluate the NER Task

    PhoNLP doesn't train or evaluate the NER Task

    I am running the code of PhoNLP. However , the result shows it doesn't train or evaluate NER task. I paste my parameters below: Train: python3 phonlp/models/run_phonlp.py --mode train --save_dir ./phonlp_tmp_save_model
    --pretrained_lm "vinai/phobert-base"
    --lr 1e-5 --batch_size 32 --num_epoch 40
    --lambda_pos 0.4 --lambda_ner 0.2 --lambda_dep 0.4
    --train_file_pos phonlp/sample_data/pos_train.txt --eval_file_pos phonlp/sample_data/pos_valid.txt
    --train_file_ner phonlp/sample_data/ner_train.txt --eval_file_ner phonlp/sample_data/ner_valid.txt
    --train_file_dep phonlp/sample_data/dep_train.conll --eval_file_dep phonlp/sample_data/dep_valid.conll
    --output_file_dep phonlp/models/jointmodel/dep.out

    Evaluate: python3 phonlp/models/run_phonlp.py --mode eval --save_dir ./phonlp_tmp_save_model
    --batch_size 8
    --eval_file_pos phonlp/sample_data/pos_test.txt
    --eval_file_ner phonlp/sample_data/ner_test.txt
    --eval_file_dep phonlp/sample_data/dep_test.conll
    --output_file_dep phonlp/models/jointmodel/dep.out

    The results of train and evaluate: Train Result: Training ended with 39 epochs. Best dev las = 5.23, uas = 15.28, upos = 24.22, f1 = 0.0

    Evaluate Result: POS tagging: 24.25, NER: 0.00, Dependency parsing: 6.63/27.24

    I would appreciate it for your reply. @datquocnguyen @thelinhbkhn2014 @dangne @maihoangdao Thank you~~

    opened by COCOMiss 2
  • Training on Covid19 dataset

    Training on Covid19 dataset

    Hi, I'm having a problem trying to train the model on the Vietnamese Covid 19 NER dataset. This dataset is in the .conll form and I have the error from the compilers sad like below. I can't figure this error out, I wonder if the reason for this is because of the .conll format (the sample data format is .txt) or I need to do some preprocessed for the Covid 19 dataset. Can you suggest a solution for this. Traceback (most recent call last): File "run_phonlp.py", line 526, in main() File "run_phonlp.py", line 128, in main train(args) File "run_phonlp.py", line 146, in train vocab = BuildVocab(args, args["train_file_pos"], train_doc_dep, args["train_file_ner"]).vocab File "/content/PhoNLP/phonlp/models/jointmodel/data.py", line 440, in init self.vocab = self.build_vocab(data_dep, data_pos, data_ner) File "/content/PhoNLP/phonlp/models/jointmodel/data.py", line 446, in build_vocab ner_tag = TagVocab(data_ner, idx=1) File "/content/PhoNLP/phonlp/models/common/vocab.py", line 27, in init self.build_vocab() File "/content/PhoNLP/phonlp/models/ner/vocab.py", line 11, in build_vocab counter = Counter([w[self.idx] for sent in self.data for w in sent]) File "/content/PhoNLP/phonlp/models/ner/vocab.py", line 11, in counter = Counter([w[self.idx] for sent in self.data for w in sent]) IndexError: list index out of range

    opened by huyhoang240101 1
  • what is 'dob' tag in dependecy parsing task?

    what is 'dob' tag in dependecy parsing task?

    After running dependency parsing task, I received the following result with a tag named 'dob'. image Please help me explain the meaning of this tag as well as provide the full list of tags :)

    opened by icyda17 1
  • About CRFLoss

    About CRFLoss

    Hi, I am trying to write CRFLoss from scratch and I also read many other sources aside from your code. However, it seems that you omit the start and end tags in your code while other sources don't. I saw the comment TODO in your code which tells that you will change the code in the future, aren't you? If not, is there any difference between your code and others?. Below are some of the references, you can check them and make comparisons. I am a newbie so I hope that you will help me out. Thanks in advance. https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Sequence-Labeling https://github.com/mtreviso/linear-chain-crf/tree/865bcf25fb33f73d59978426eb1c0f587e1f95f8 https://github.com/kmkurn/pytorch-crf/blob/8f3203a1f1d7984c87718bfe31853242670258db/torchcrf/init.py

    opened by Quang-elec44 1
  • Asking for POS/NER dataset

    Asking for POS/NER dataset

    Thanks for your great Vietnamese NLP toolkits .

    I want to get the VLSP 2013 POS tagging dataset and the VLSP 2016 NER dataset . How can I get these datasets since the official website does not provide any link to download.

    i am looking forward to hearing from you.

    opened by demdecuong 1
  • Make the app look more canonical

    Make the app look more canonical

    Just looked into your code, and found that it is weird, in term of organization and usage style. Some thought:

    1. In development, people have to "install" the project code itself, with pip install -e .. I agree that we have to install the dependencies, but install the app itself is unnecessary and weird, given that we are standing right in the code base folder.

      Suggestion:

      • Just tell people to do pip install -r requirements.txt to install dependencies, or better, use Poetry to organize dependencies.
    2. The project has some CLI tools. That is good, but the way people have to cd phonlp/models in order to use the CLI tools is silly.

      Suggestion:

      • Organize the project so that, people just stay at the top folder and run:
      python3 -m phonlp.tools train ...
      

      With this organization, you also solve the 1st issue (no need to pip install -e .).

    3. You are modifying import path:

      https://github.com/VinAIResearch/PhoNLP/blob/efda60735a0b596c7df948fb49eeb8f835cd3734/phonlp/models/run_phonlp.py#L6-L9

      With a proper organization, you don't have to do this. It is ugly when having normal code before import statements.

    4. Code doesn't follow standard Python coding style (PEP-8).

    5. Hard-code absolute folder path. https://github.com/VinAIResearch/PhoNLP/blob/efda60735a0b596c7df948fb49eeb8f835cd3734/phonlp/models/run_phonlp.py#L28-L29

      It won't run on other people machine, because "/home/ubuntu/linhnt140/" doesn't exist there.

      Suggestion:

      • Use ~/ or ~/Documents, it always points to user's home folder, no matter what his/her username is.
    6. Command options don't follow common style. Look at this usage:

      python3 run_phonlp.py --mode train --save_dir ./phonlp_tmp --pretrained_lm "vinai/phobert-base"
      

      The popular style for CLI on Linux is GNU style, by which, the above command should be:

      python3 run_phonlp.py --mode train --save-dir ./phonlp_tmp --pretrained-lm "vinai/phobert-base"
      

      Or, combined with suggestion at 2nd issue, it should be:

      python3 -m phonlp.tools train --save-dir ./phonlp_tmp --pretrained-lm "vinai/phobert-base"
      
    opened by hongquan 1
Owner
VinAI Research
VinAI Research
Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

anaGo anaGo is a Python library for sequence labeling(NER, PoS Tagging,...), implemented in Keras. anaGo can solve sequence labeling tasks such as nam

Hiroki Nakayama 1.5k Dec 5, 2022
Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

anaGo anaGo is a Python library for sequence labeling(NER, PoS Tagging,...), implemented in Keras. anaGo can solve sequence labeling tasks such as nam

Hiroki Nakayama 1.4k Feb 17, 2021
Negative sampling for solving the unlabeled entity problem in NER. ICLR-2021 paper: Empirical Analysis of Unlabeled Entity Problem in Named Entity Recognition.

Negative Sampling for NER Unlabeled entity problem is prevalent in many NER scenarios (e.g., weakly supervised NER). Our paper in ICLR-2021 proposes u

Yangming Li 128 Dec 29, 2022
Neural network models for joint POS tagging and dependency parsing (CoNLL 2017-2018)

Neural Network Models for Joint POS Tagging and Dependency Parsing Implementations of joint models for POS tagging and dependency parsing, as describe

Dat Quoc Nguyen 152 Sep 2, 2022
Laboratory for Social Machines 84 Dec 20, 2022
RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

RoNER RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2. It is meant to be an easy to use, hi

Stefan Dumitrescu 9 Nov 7, 2022
Pytorch-Named-Entity-Recognition-with-BERT

BERT NER Use google BERT to do CoNLL-2003 NER ! Train model using Python and Inference using C++ ALBERT-TF2.0 BERT-NER-TENSORFLOW-2.0 BERT-SQuAD Requi

Kamal Raj 1.1k Dec 25, 2022
Use Google's BERT for named entity recognition (CoNLL-2003 as the dataset).

For better performance, you can try NLPGNN, see NLPGNN for more details. BERT-NER Version 2 Use Google's BERT for named entity recognition (CoNLL-2003

Kaiyinzhou 1.2k Dec 26, 2022
Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

TextBlob: Simplified Text Processing Homepage: https://textblob.readthedocs.io/ TextBlob is a Python (2 and 3) library for processing textual data. It

Steven Loria 8.4k Dec 26, 2022
Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

TextBlob: Simplified Text Processing Homepage: https://textblob.readthedocs.io/ TextBlob is a Python (2 and 3) library for processing textual data. It

Steven Loria 7.5k Feb 17, 2021
Part of Speech Tagging using Hidden Markov Model (HMM) POS Tagger and Brill Tagger

Part of Speech Tagging using Hidden Markov Model (HMM) POS Tagger and Brill Tagger In this project, our aim is to tune, compare, and contrast the perf

Chirag Daryani 0 Dec 25, 2021
Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition

Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition

null 0 Feb 13, 2022
Mirco Ravanelli 2.3k Dec 27, 2022
Chinese named entity recognization (bert/roberta/macbert/bert_wwm with Keras)

Chinese named entity recognization (bert/roberta/macbert/bert_wwm with Keras)

null 2 Jul 5, 2022
Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

Franck Dernoncourt 1.6k Dec 27, 2022
Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

Franck Dernoncourt 1.5k Feb 11, 2021
Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

Franck Dernoncourt 1.5k Feb 17, 2021
Spacy-ginza-ner-webapi - Named Entity Recognition API with spaCy and GiNZA

Named Entity Recognition API with spaCy and GiNZA I wrote a blog post about this

Yuki Okuda 3 Feb 27, 2022
Framework for fine-tuning pretrained transformers for Named-Entity Recognition (NER) tasks

NERDA Not only is NERDA a mesmerizing muppet-like character. NERDA is also a python package, that offers a slick easy-to-use interface for fine-tuning

Ekstra Bladet 141 Dec 30, 2022