Winner system (DAMO-NLP) of SemEval 2022 MultiCoNER shared task over 10 out of 13 tracks.

Overview

KB-NER: a Knowledge-based System for Multilingual Complex Named Entity Recognition

The code is for the winner system (DAMO-NLP) of SemEval 2022 MultiCoNER shared task over 10 out of 13 tracks. [Rankings], [Paper].

KB-NER is a knowledge-based system, where we build a multilingual knowledge base based on Wikipedia to provide related context information to the NER model.

1646656832(1)

Guide

Requirements

To run our code, install:

pip install -r requirements.txt

Quick Start

Datasets

To ease the code running, please download our pre-processed datasets.

  • Training and development data with retrieved knowledge: [OneDrive]

  • Test data with retrieved knowledge: [OneDrive]

  • Our model predictions submitted at the test phase: [OneDrive]. We believe the predictions can be used for distilling knowledge from our system.

Recommended Training and Testing Data for Each Language

Language/Track Training Testing
EN-English EN-English_conll_rank_eos_doc_full_wiki_v3 EN-English_conll_rank_eos_doc_full_wiki_v3_test
ES-Spanish ES-Spanish_conll_rank_eos_doc_full_wiki_v3 ES-Spanish_conll_rank_eos_doc_full_wiki_v3_test
NL-Dutch NL-Dutch_conll_rank_eos_doc_full_wiki_v3 NL-Dutch_conll_rank_eos_doc_full_wiki_v3_test
RU-Russian RU-Russian_conll_rank_eos_doc_full_wiki_v3 RU-Russian_conll_rank_eos_doc_full_wiki_v3_test
TR-Turkish TR-Turkish_conll_rank_eos_doc_full_wiki_v3 TR-Turkish_conll_rank_eos_doc_full_wiki_v3_test
KO-Korean KO-Korean_conll_rank_eos_doc_full_wiki_v3 KO-Korean_conll_rank_eos_doc_full_wiki_v3_test
FA-Farsi FA-Farsi_conll_rank_eos_doc_full_wiki_v3 FA-Farsi_conll_rank_eos_doc_full_wiki_v3_test
DE-German DE-German_conll_rank_eos_doc_full_wiki_v3_sentence_withent DE-German_conll_rank_eos_doc_full_wiki_v3_test_sentence_withent
ZH-Chinese ZH-Chinese_conll_rank_eos_doc_full_wiki_v3_sentence ZH-Chinese_conll_rank_eos_doc_full_wiki_v3_test_sentence
HI-Hindi HI-Hindi_conll_rank_eos_doc_full_wiki_v3_sentence HI-Hindi_conll_rank_eos_doc_full_wiki_v3_test_sentence
BN-Bangla BN-Bangla_conll_rank_eos_doc_full_wiki_v3_sentence BN-Bangla_conll_rank_eos_doc_full_wiki_v3_test_sentence
MULTI-Multilingual All monolingual datasets *_conll_rank_eos_doc_full_wiki_v3 MULTI_Multilingual_conll_rank_eos_doc_full_wiki_v3_test_langwiki
MIX-Code_mixed MIX-Code_mixed_conll_rank_eos_doc_full_wiki_v3_sentence MIX-Code_mixed_conll_rank_eos_doc_full_wiki_v3_test_sentence
MIX-Code_mixed (Iterative) MIX-Code_mixed_conll_rank_eos_doc_full_wiki_v4_sentence_withent MIX-Code_mixed_conll_rank_eos_doc_full_wiki_v4_test_sentence_withent

The meaning of the suffixes in the folder names are listed as follows:

Suffix Description
test Our test data with retrieved contexts from knowledge base
v3 Contexts in the data are from sentence retrieval
v4 Contexts in the data are from iterative entity retrieval
sentence Using matched sentences as the contexts (Wiki-Sent-link in the paper)
sentence_withent Using matched sentences with wiki anchors as the contexts (Wiki-Sent in the paper)
w/o sentence and sentence_withent Using matched paragraphs with wiki anchors as the contexts (Wiki-Para in the paper)

Note that in iterative entity retrieval datasets, the training data are using gold entities to retrieve knowledge while the test data (with 'test' in the folder name) are using predicted entities (by our ensembled models based on sentence retrieval) for retrieval.

Trained Models

Since there are 130+ trained models for our submission in the test phase, we only release our trained models for English (monolingual), Multilingual and Code-mixed.


Training and Testing on MultiCoNER Datasets

This section is a guide for running our code with our trained models and processed datasets downloaded above. If you want to train and make predictions on your own datasets, please refer to Building Knowledge-based System for the guide about how to build the knowledge retrieval system and train your own knowledge-based models from scratch.

Testing

We provide four trained models for the three tracks, which are the candidates of the ensemble models of our submission. To make predictions with our trained models, run:

For English, xlmr-large-pretuned-tuned-wiki-full-first_10epoch_1batch_4accumulate_0.000005lr_10000lrrate_en_monolingual_crf_fast_norelearn_sentbatch_sentloss_withdev_finetune_saving_amz_doc_wiki_v3_ner20 is required. Change the data_folder in the yaml file to the path to your downloaded datasets for all the configs. For example:

ColumnCorpus-EN-EnglishDOC:
    column_format:
      0: text
      1: pos
      2: upos
      3: ner
    comment_symbol: '# id'
    data_folder: EN-English_conll_rank_eos_doc_full_wiki_v3 # change the data_folder at here
    tag_to_bioes: ner

Run:

# English
CUDA_VISIBLE_DEVICES=0 python -u train.py --config config/xlmr-large-pretuned-tuned-wiki-full-first_10epoch_1batch_4accumulate_0.000005lr_10000lrrate_en_monolingual_crf_fast_norelearn_sentbatch_sentloss_withdev_finetune_saving_amz_doc_wiki_v3_ner20.yaml --parse --keep_order --target_dir EN-English_conll_rank_eos_doc_full_wiki_v3_test --num_columns 4 --batch_size 32 --output_dir semeval2022_predictions 

For Multilingual, xlmr-large-pretuned-tuned-wiki-first_3epoch_1batch_4accumulate_0.000005lr_10000lrrate_multi_monolingual_crf_fast_norelearn_sentbatch_sentloss_withdev_finetune_saving_amz_doc_wiki_v3_ner24 is required.

Run:

# Multilingual
CUDA_VISIBLE_DEVICES=0 python -u train.py --config config/xlmr-large-pretuned-tuned-wiki-first_3epoch_1batch_4accumulate_0.000005lr_10000lrrate_multi_monolingual_crf_fast_norelearn_sentbatch_sentloss_withdev_finetune_saving_amz_doc_wiki_v3_ner24.yaml --parse --keep_order --target_dir MULTI_Multilingual_conll_rank_eos_doc_full_wiki_v3_test_langwiki --num_columns 4 --batch_size 32 --output_dir semeval2022_predictions 

For Code-mixed,

xlmr-large-pretuned-tuned-wiki-full-first_100epoch_1batch_4accumulate_0.000005lr_10000lrrate_mix_monolingual_crf_fast_norelearn_sentbatch_sentloss_withdev_finetune_saving_amz_doc_wiki_v3_sentence_ner40 is the model based on sentence retrieval and

xlmr-large-pretuned-tuned-wiki-full-v4-first_100epoch_1batch_4accumulate_0.000005lr_10000lrrate_mix_monolingual_crf_fast_norelearn_sentbatch_sentloss_withdev_finetune_saving_amz_doc_wiki_v4_sentence_withent_ner30 is the model based on iterative entity retrieval.

The sentence-retrieval-based models are used to predict entity mentions for the iterative entity retrieval. The iterative-entity-retrieval-based models are expected to be stronger than the sentence-retrieval-based models in code-mixed.

Run:

# Code-mixed + Sentence Retrieval
CUDA_VISIBLE_DEVICES=0 python -u train.py --config config/xlmr-large-pretuned-tuned-wiki-full-first_100epoch_1batch_4accumulate_0.000005lr_10000lrrate_mix_monolingual_crf_fast_norelearn_sentbatch_sentloss_withdev_finetune_saving_amz_doc_wiki_v3_sentence_ner40.yaml --parse --keep_order --target_dir MIX_Code_mixed_conll_rank_eos_doc_full_wiki_v3_test_sentence_withent --num_columns 4 --batch_size 32 --output_dir semeval2022_predictions 

# Code-mixed + Iterative Entity Retrieval
CUDA_VISIBLE_DEVICES=0 python -u train.py --config config/xlmr-large-pretuned-tuned-wiki-full-v4-first_100epoch_1batch_4accumulate_0.000005lr_10000lrrate_mix_monolingual_crf_fast_norelearn_sentbatch_sentloss_withdev_finetune_saving_amz_doc_wiki_v4_sentence_withent_ner30.yaml --parse --keep_order --target_dir MIX_Code_mixed_conll_rank_eos_doc_full_wiki_v4_test_sentence_withent --num_columns 4 --batch_size 32 --output_dir semeval2022_predictions 

Training the monolingual models

To train the monolingual model based on fine-tuned multilingual embeddings, the trained model xlmr-large-pretuned-tuned-wiki-first_3epoch_1batch_4accumulate_0.000005lr_10000lrrate_multi_monolingual_crf_fast_norelearn_sentbatch_sentloss_withdev_finetune_saving_amz_doc_wiki_v3_10upsample_addmix_ner23 is required. Run:

CUDA_VISIBLE_DEVICES=0 python train.py --config config/xlmr-large-pretuned-tuned-wiki-full-first_10epoch_1batch_4accumulate_0.000005lr_10000lrrate_en_monolingual_crf_fast_norelearn_sentbatch_sentloss_withdev_finetune_saving_amz_doc_wiki_v3_ner20.yaml

If you want to train the other languages, change configurations in the config. For example:

model_name: xlmr-large-pretuned-tuned-wiki-full-first_10epoch_1batch_4accumulate_0.000005lr_10000lrrate_es_monolingual_crf_fast_norelearn_sentbatch_sentloss_withdev_finetune_saving_amz_doc_wiki_v3_ner20 # change the model_name to let the model be saved at a new folder, here we change _en_ to _es_ to represent Spanish.
...
  ColumnCorpus-ES-SpanishDOC: # Training dataset settings
    column_format:
      0: text
      1: pos
      2: upos
      3: ner
    comment_symbol: '# id'
    data_folder: ES-Spanish_conll_rank_eos_doc_full_wiki_v3 # change the dataset folder
    tag_to_bioes: ner
  Corpus: ColumnCorpus-ES-SpanishDOC # It must be the same as the corpus name above

Please refer to this table to decide the training dataset for each language.

Note: this model and the following two models are trained on both the training and development sets. As a result, the development F1 score should be about 100 during training.

Training the multilingual models

xlm-roberta-large-ft10w is required to train the multilingual models, which is a continue pretrained model over the shared task data. Run:

CUDA_VISIBLE_DEVICES=0 python train.py --config config/xlmr-large-pretuned-tuned-wiki-first_3epoch_1batch_4accumulate_0.000005lr_10000lrrate_multi_monolingual_crf_fast_norelearn_sentbatch_sentloss_withdev_finetune_saving_amz_doc_wiki_v3_ner24.yaml

Training the code-mixed models

For our code-mixed models with sentence retrieval, download xlmr-large-pretuned-tuned-wiki-first_3epoch_1batch_4accumulate_0.000005lr_10000lrrate_multi_monolingual_crf_fast_norelearn_sentbatch_sentloss_withdev_finetune_saving_amz_doc_wiki_v3_10upsample_addmix_ner23 and Run:

CUDA_VISIBLE_DEVICES=0 python train.py --config config/xlmr-large-pretuned-tuned-wiki-full-first_100epoch_1batch_4accumulate_0.000005lr_10000lrrate_mix_monolingual_crf_fast_norelearn_sentbatch_sentloss_withdev_finetune_saving_amz_doc_wiki_v3_sentence_ner40.yaml

For our code-mixed models with iterative entity retrieval, download xlmr-large-pretuned-tuned-wiki-first_3epoch_1batch_4accumulate_0.000005lr_10000lrrate_multi_monolingual_crf_fast_norelearn_sentbatch_sentloss_withdev_finetune_saving_amz_doc_wiki_v4_10upsample_addmix_ner23 and Run:

CUDA_VISIBLE_DEVICES=0 python train.py --config config/xlmr-large-pretuned-tuned-wiki-full-v4-first_100epoch_1batch_4accumulate_0.000005lr_10000lrrate_mix_monolingual_crf_fast_norelearn_sentbatch_sentloss_withdev_finetune_saving_amz_doc_wiki_v4_sentence_withent_ner30.yaml

Building Knowledge-based System

Knowledge Base Building

Index Building

Our wiki-based retrieval system is built on ElasticSearch, and you need to install ElasticSearch properly before building knowledge bases. For a tutorial on installation, please refer to this link. In addition, in order to make ElasticSearch support Chinese word segmentation, we recommend you to install ik-analyzer.

After installing ElasticSearch, you can build your local multilingual wiki knowledge bases. First you need to download the latest version of wiki dumps from Wikimedia and store them in the lmdb database. You can run the following commands:

cd kb/dumps
./download.sh
./convert_db.sh

Then convert the files from xml format to plain text. Please run the command:

cd ..
./parse_text.sh

Finally you can build the knowledge base through ElasticSearch, i.e. create indexes for 11 languages ( Note that you need to make sure that ElasticSearch is running and listening to the default port 9200 ):

./bulid_kb.sh

Retrieval-based Data Augmentation

We provide two types of retrieval-based augmentations, one at the sentence level and one at the entity level. First you need to place the datasets in CoNLL format under kb/datasets. Then run the following code as needed.

  • Sentence-level retrieval augmentation:

    python generate_data.py --lan en
  • Entity-level retrieval augmentation:

    python generate_data.py --lan en --with_entity

Note that --lan specifies the language and --with_entity indicates whether to retrieve at the entity level (default is false).

The retrieval results are presented in the following format. The first line is the original sentence and the entities in the sentence. Next are the 10 (default) most relevant retrieval results, one per row.

original sentence \t entity #1 | entity #2 ···
retrieved sentence #1 \t associated paragraph #1 \t associated title #1 \t score #1 \t wiki url #1 \t hits on the sentence #1 ---#--- hits on the title #1
retrieved sentence #2 \t associated paragraph #2 \t associated title #2 \t score #2 \t wiki url #2 \t hits on the sentence #2 ---#--- hits on the title #2
···
retrieved sentence #10 \t associated paragraph #10 \t associated title #10 \t score #10 \t wiki url #10 \t hits on the sentence #10 ---#---  hits on the title #10

Let's show an example:

anthology is a compilation album by new zealand singer songwriter and multi instrumentalist bic runga .	compilation album | new zealand | bic runga 
Anthology is a compilation album by New Zealand singer-songwriter and multi-instrumentalist Bic Runga.	Anthology is a <e:Compilation album>compilation album</e> by <e:New Zealand>New Zealand</e> singer-songwriter and multi-instrumentalist <e:Bic Runga>Bic Runga</e>. The album was initially set to be released on 23 November 2012, but ultimately released on 1 December 2012 in New Zealand. The album cover was revealed on 29 October 2012.	Anthology (Bic Runga album)	145.28241	https://en.wikipedia.org/wiki/Anthology (Bic Runga album)	<hit>Anthology</hit> <hit>is</hit> <hit>a</hit> <hit>compilation</hit> <hit>album</hit> <hit>by</hit> <hit>New</hit> <hit>Zealand</hit> <hit>singer</hit>-<hit>songwriter</hit> <hit>and</hit> <hit>multi</hit>-<hit>instrumentalist</hit> <hit>Bic</hit> <hit>Runga</hit> ---#--- Anthology (<hit>Bic</hit> <hit>Runga</hit> <hit>album</hit>)
Briolette Kah Bic Runga  (born 13 January 1976), recording as Bic Runga, is a New Zealand singer-songwriter and multi-instrumentalist pop artist.	Briolette Kah Bic Runga  (born 13 January 1976), recording as Bic Runga, is a New Zealand singer-songwriter and multi-instrumentalist pop artist. Her first three <e:Album>studio albums</e> debuted at number one on the <e:Recorded Music NZ>New Zealand Top 40 Album charts</e>. Runga has also found success internationally in Australia, Ireland and the United Kingdom with her song "<e:Sway (Bic Runga song)>Sway</e>".	Bic Runga	125.18798	https://en.wikipedia.org/wiki/Bic Runga	Briolette Kah <hit>Bic</hit> <hit>Runga</hit>  (born 13 January 1976), recording as <hit>Bic</hit> <hit>Runga</hit>, <hit>is</hit> <hit>a</hit> <hit>New</hit> <hit>Zealand</hit> <hit>singer</hit>-<hit>songwriter</hit> ---#--- <hit>Bic</hit> <hit>Runga</hit>
Birds is the third studio album by New Zealand artist Bic Runga.	Birds is the third <e:Album>studio album</e> by <e:New Zealand>New Zealand</e> artist <e:Bic Runga>Bic Runga</e>. The album was released in New Zealand on 28 November 2005. The album was Bic's third no.1 album garnering platinum status in its first week. The album was certified 3x platinum. The album won the <e:Aotearoa Music Award for Album of the Year>New Zealand Music Award for Album of the Year</e> in 2006, her second award for Best Album, after her debut release <e:Drive (Bic Runga album)>Drive</e>.	Birds (Bic Runga album)	100.14264	https://en.wikipedia.org/wiki/Birds (Bic Runga album)	Birds <hit>is</hit> the third studio <hit>album</hit> <hit>by</hit> <hit>New</hit> <hit>Zealand</hit> artist <hit>Bic</hit> <hit>Runga</hit>. ---#--- Birds (<hit>Bic</hit> <hit>Runga</hit> <hit>album</hit>)
"Sway" is a song by New Zealand singer Bic Runga.	"Sway" is a song by New Zealand singer <e:Bic Runga>Bic Runga</e>. It was released as the second single from her debut studio album, <e:Drive (Bic Runga album)>Drive</e> (1997), in 1997. The song peaked at  7 in New Zealand and No. 10 in Australia, earning gold <e:Music recording certification>certifications</e> in both countries. At the <e:Aotearoa Music Awards>32nd New Zealand Music Awards</e>, the song won three awards: Single of the Year, Best Songwriter, and Best Engineer (Simon Sheridan). In 2001, it was voted the <e:APRA Top 100 New Zealand Songs of All Time#6>6th best New Zealand song of all time</e> by members of <e:APRA AMCOS>APRA</e>. A music video directed by John Taft was made for the song.	Sway (Bic Runga song)	97.816284	https://en.wikipedia.org/wiki/Sway (Bic Runga song)	"Sway" <hit>is</hit> <hit>a</hit> song <hit>by</hit> <hit>New</hit> <hit>Zealand</hit> <hit>singer</hit> <hit>Bic</hit> <hit>Runga</hit>. ---#--- Sway (<hit>Bic</hit> <hit>Runga</hit> song)
Drive is the debut solo album by New Zealand artist Bic Runga, released on 14 July 1997.	Drive is the debut solo album by New Zealand artist <e:Bic Runga>Bic Runga</e>, released on 14 July 1997. The album went seven times <e:Music recording certification>platinum</e> in New Zealand, and won the <e:Aotearoa Music Award for Album of the Year>New Zealand Music Award for Album of the Year</e> at the <e:Aotearoa Music Awards>32nd New Zealand Music Awards</e>.	Drive (Bic Runga album)	94.014656	https://en.wikipedia.org/wiki/Drive (Bic Runga album)	Drive <hit>is</hit> the debut solo <hit>album</hit> <hit>by</hit> <hit>New</hit> <hit>Zealand</hit> artist <hit>Bic</hit> <hit>Runga</hit>, released on 14 July 1997. ---#--- Drive (<hit>Bic</hit> <hit>Runga</hit> <hit>album</hit>)
Bic Runga at Discogs	Bic Runga at <e:Discogs>Discogs</e>	Bic Runga	93.661865	https://en.wikipedia.org/wiki/Bic Runga	<hit>Bic</hit> <hit>Runga</hit> at Discogs ---#--- <hit>Bic</hit> <hit>Runga</hit>
Close Your Eyes is the fifth studio album by New Zealand singer-song writer Bic Runga.	Close Your Eyes is the fifth studio album by <e:New Zealand>New Zealand</e> singer-song writer <e:Bic Runga>Bic Runga</e>. The album is made up of ten covers and two original tracks. Upon announcement of the album in October, Runga said: "There are so many songs I've always wanted to cover. I wanted to see if I could not just be a singer-songwriter, but someone who could also interpret songs. In the process, I found there are so many reasons why a cover version wouldn't work, perhaps because the lyrics were not something I could relate to first hand, because technically I wasn't ready or because the original was too iconic. But the songs that all made it on the record specifically say something about where I'm at in my life, better than if I'd written it myself. It was a challenging process, I'm really proud of the singing and the production and the statement".	Close Your Eyes (Bic Runga album)	90.77379	https://en.wikipedia.org/wiki/Close Your Eyes (Bic Runga album)	Close Your Eyes <hit>is</hit> the fifth studio <hit>album</hit> <hit>by</hit> <hit>New</hit> <hit>Zealand</hit> <hit>singer</hit>-song writer <hit>Bic</hit> <hit>Runga</hit>. ---#--- Close Your Eyes (<hit>Bic</hit> <hit>Runga</hit> <hit>album</hit>)
All tracks by Bic Runga.	All tracks by <e:Bic Runga>Bic Runga</e>.	Drive (Bic Runga album)	89.630394	https://en.wikipedia.org/wiki/Drive (Bic Runga album)	All tracks <hit>by</hit> <hit>Bic</hit> <hit>Runga</hit>. ---#--- Drive (<hit>Bic</hit> <hit>Runga</hit> <hit>album</hit>)
"Sorry" is a song by New Zealand recording artist, Bic Runga.	"Sorry" is a song by New Zealand recording artist, <e:Bic Runga>Bic Runga</e>. The single was released in <e:Australia>Australia</e> and <e:Germany>Germany</e> only as the final single from her debut studio album, <e:Drive (Bic Runga album)>Drive</e> (1997).	Sorry (Bic Runga song)	89.33654	https://en.wikipedia.org/wiki/Sorry (Bic Runga song)	"Sorry" <hit>is</hit> <hit>a</hit> song <hit>by</hit> <hit>New</hit> <hit>Zealand</hit> recording artist, <hit>Bic</hit> <hit>Runga</hit>. ---#--- Sorry (<hit>Bic</hit> <hit>Runga</hit> song)
In November 2008, Runga released Try to Remember Everything which is a collection of unreleased, new and rare Bic Runga recordings from 1996 to 2008.	In November 2008, Runga released <e:Try to Remember Everything>Try to Remember Everything</e> which is a collection of unreleased, new and rare Bic Runga recordings from 1996 to 2008. The album was certified Gold in New Zealand on 14 December 2008, selling over 7,500 copies.	Bic Runga	89.24142	https://en.wikipedia.org/wiki/Bic Runga	In November 2008, <hit>Runga</hit> released Try to Remember Everything which <hit>is</hit> <hit>a</hit> collection of unreleased, <hit>new</hit> ---#--- <hit>Bic</hit> <hit>Runga</hit>

If you want to do iterative retrieval at entity level, please convert the model predictions to conll format and then perform entity level retrieval.

Context Processing

Here we take mix as an example to generate contexts for the datasets.

Usage:

$ python kb/context_process.py -h
usage: context_process.py [-h] [--retrieval_file RETRIEVAL_FILE]
                [--conll_folder CONLL_FOLDER] [--lang LANG] [--use_sentence]
                [--use_paragraph_entity]

optional arguments:
  -h, --help            show this help message and exit
  --retrieval_file RETRIEVAL_FILE
                        The retrieved contexts from the knowledge base.
  --conll_folder CONLL_FOLDER
                        The data folder you want to generate contexts, the
                        code will read train, dev, test data in the folder in
                        conll formatting.
  --lang LANG           The language code of the data, for example "en". We
                        have specical processing for Chinese ("zh") and Code-
                        mixed ("mix").
  --use_sentence        use matched sentence in the retrieval results as the
                        contexts
  --use_paragraph_entity
                        use matched sentence and the wiki anchor in the
                        retrieval results as the contexts

Given the retrieved contexts and the conll data folder, run (generate contexts for Wiki-Para):

python kb/context_process.py --retrieval_file semeval_retrieve_res/mix.conll --conll_folder semeval_test/MIX_Code_mixed --lang mix

To generate contexts for Wiki-Sent-link, run:

python kb/context_process.py --retrieval_file semeval_retrieve_res/mix.conll --conll_folder semeval_test/MIX_Code_mixed --lang mix --use_sentence

To generate contexts for Wiki-Sent, run:

python kb/context_process.py --retrieval_file semeval_retrieve_res/mix.conll --conll_folder semeval_test/MIX_Code_mixed --lang mix --use_sentence --use_paragraph_entity
  • Note: the file semeval_retrieve_res/mix.conll is the retrieval results for all the sets in MIX_Code_mixed. You may need to modify the code to satisfy your own requirements. For more details, you may read line 972-1002 and line 1006-1029.

Multi-stage Fine-tuning

Taking the transferring from multilingual models to monolingual models as an example, firstly we train a multilingual model (which is the same model as Training the multilingual models but is trained only on the training data):

CUDA_VISIBLE_DEVICES=0 python train.py --config config/xlmr-large-pretuned-tuned-wiki-first_3epoch_1batch_4accumulate_0.000005lr_10000lrrate_multi_monolingual_crf_fast_norelearn_sentbatch_sentloss_nodev_finetune_saving_amz_doc_wiki_v3_ner24.yaml

In the config file, you can find:

...
train:
  ...
  save_finetuned_embedding: true
  ...
...

The code will save the fine-tuned embeddings at the end of each epoch when save_finetuned_embedding is set to true. In this case, you can find the saved embeddings at resources/taggers/xlmr-large-pretuned-tuned-wiki-first_3epoch_1batch_4accumulate_0.000005lr_10000lrrate_multi_monolingual_crf_fast_norelearn_sentbatch_sentloss_nodev_finetune_saving_amz_doc_wiki_v3_ner24/xlm-roberta-large-ft10w.

Then, you can use the fine-tuned xlm-roberta-large-ft10w embeddings as the initialization of XLM-R embeddings, taking fine-tuning the English monolingual model as an example, set the embeddings in config/xlmr-large-pretuned-tuned-wiki-full-first_10epoch_1batch_4accumulate_0.000005lr_10000lrrate_en_monolingual_crf_fast_norelearn_sentbatch_sentloss_nodev_finetune_saving_amz_doc_wiki_v3_ner20.yaml as:

embeddings:
  TransformerWordEmbeddings-0:
    fine_tune: true
    layers: '-1'
    model: resources/taggers/xlmr-large-pretuned-tuned-wiki-first_3epoch_1batch_4accumulate_0.000005lr_10000lrrate_multi_monolingual_crf_fast_norelearn_sentbatch_sentloss_nodev_finetune_saving_amz_doc_wiki_v3_ner24/xlm-roberta-large-ft10w
    pooling_operation: first

Run the model training:

CUDA_VISIBLE_DEVICES=0 python train.py --config config/xlmr-large-pretuned-tuned-wiki-full-first_10epoch_1batch_4accumulate_0.000005lr_10000lrrate_en_monolingual_crf_fast_norelearn_sentbatch_sentloss_nodev_finetune_saving_amz_doc_wiki_v3_ner20.yaml

Majority Voting Ensemble

We provide an example of majority voting ensembling. Download the all English predictions at here and run:

python ensemble_prediction.py 

(Optional) CE and ACE Models

We also provide code for running CE and ACE (in Section 5.6 of the paper) for monolingual models, you can check the guide at here. To select embedding candidates, besides the embeddings listed in ACE embedding list, we recommend to use the fine-tuned XLM-R embeddings checkpoints for the multilingual models and monolingual models. config/xlmr-task-wiki-extdoc_en-xlmr-task-wiki-extdoc_multi-xlmr-pretuned-wiki-tuned_word_flair_mflair_55b-elmo_150epoch_32batch_0.1lr_1000hidden_en_crf_reinforce_doc_freeze_norelearn_nodev_amz_wiki_v3_ner6.yaml is an example for ACE configuration we tried during system building:

embeddings:
  ELMoEmbeddings-0: #ELMo
    options_file: elmo/elmo_2x4096_512_2048cnn_2xhighway_5.5B_options.json
    weight_file: elmo/elmo_2x4096_512_2048cnn_2xhighway_5.5B_weights.hdf5
  FastWordEmbeddings-0:
    embeddings: en
    freeze: true
  FlairEmbeddings-0: #Flair
    model: en-forward
  FlairEmbeddings-1:
    model: en-backward
  FlairEmbeddings-2:
    model: multi-forward
  FlairEmbeddings-3:
    model: multi-backward
  TransformerWordEmbeddings-0: #Fine-tuned XLM-R trained on English dataset
    layers: '-1'
    model: resources/taggers/xlmr-large-pretuned-tuned-wiki-first_10epoch_1batch_4accumulate_0.000005lr_10000lrrate_en_monolingual_crf_fast_norelearn_sentbatch_sentloss_nodev_finetune_saving_amz_doc_wiki_v3_ner24/xlm-roberta-large
    pooling_operation: first
    use_internal_doc: true
  TransformerWordEmbeddings-1: #Fine-tuned RoBERTa trained on English dataset
    layers: '-1'
    model: resources/taggers/en-xlmr-large-first_10epoch_1batch_4accumulate_0.000005lr_5000lrrate_en_monolingual_crf_fast_norelearn_sentbatch_sentloss_nodev_saving_finetune_amz_doc_wiki_v3_ner25/roberta-large
    pooling_operation: first
    use_internal_doc: true
  TransformerWordEmbeddings-2: #Fine-tuned XLMR trained on Multilingual dataset
    layers: '-1'
    model: resources/taggers/xlmr-large-pretuned-tuned-new-first_3epoch_1batch_4accumulate_0.000005lr_10000lrrate_multi_monolingual_crf_fast_norelearn_sentbatch_sentloss_nodev_finetune_saving_amz_doc_wiki_v3_ner24/xlm-roberta-large
    pooling_operation: first

To run CE, you can change the config like this:

train:
  ...
  max_episodes: 1
  max_epochs: 300
  ...

Config File

You can find the description of each part in the config file at here.

Citing Us

If you feel the code helpful, please cite:

@article{wang2022damonlp,
      title={{DAMO-NLP at SemEval-2022 Task 11: A Knowledge-based System for Multilingual Named Entity Recognition}}, 
      author={Xinyu Wang and Yongliang Shen and Jiong Cai and Tao Wang and Xiaobin Wang and Pengjun Xie and Fei Huang and Weiming Lu and Yueting Zhuang and Kewei Tu and Wei Lu and Yong Jiang},
      year={2022},
      eprint={2203.00545},
      url= {https://arxiv.org/abs/2112.06482},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@inproceedings{wang2021improving,
    title = "{{Improving Named Entity Recognition by External Context Retrieving and Cooperative Learning}}",
    author = "Wang, Xinyu and Jiang, Yong and Bach, Nguyen and Wang, Tao and Huang, Zhongqiang and Huang, Fei and Tu, Kewei",
    booktitle = "{the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (\textbf{ACL-IJCNLP 2021})}",
    address = "Online",
    month = aug,
    year = "2021",
    publisher = "Association for Computational Linguistics",
}

If you feel the CE and ACE models helpful:

@inproceedings{wang2020automated,
    title = "{{Automated Concatenation of Embeddings for Structured Prediction}}",
    author = "Wang, Xinyu and Jiang, Yong and Bach, Nguyen and Wang, Tao and Huang, Zhongqiang and Huang, Fei and Tu, Kewei",
    booktitle = "{the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (\textbf{ACL-IJCNLP 2021})}",
    month = aug,
    address = "Online",
    year = "2021",
    publisher = "Association for Computational Linguistics",
}

@inproceedings{wang-etal-2020-embeddings,
    title = "{More Embeddings, Better Sequence Labelers?}",
    author = "Wang, Xinyu  and
      Jiang, Yong  and
      Bach, Nguyen  and
      Wang, Tao  and
      Huang, Zhongqiang  and
      Huang, Fei  and
      Tu, Kewei",
    booktitle = "{{\bf EMNLP-Findings 2020}}",
    month = nov,
    year = "2020",
    address = "Online",
    % publisher = "Association for Computational Linguistics",
    % url = "https://www.aclweb.org/anthology/2020.findings-emnlp.356",
    % doi = "10.18653/v1/2020.findings-emnlp.356",
    pages = "3992--4006",
}

Acknowledgement

Starting from the great repo flair version 0.4.3, the code has been modified a lot. This code also supports our previous work such as multilingual knowledge distillation (MultilangStructureKD), automated concatenation of embeddings (ACE) and utilizing external contexts (CLNER). You can also try these approaches in this repo.

Contact

Feel free to email any questions or comments to issues or to Xinyu Wang.

For questions about the knowledge retrieval module, you can also ask Yongliang Shen and Jiong Cai.

Comments
  • (ITA)How to concat train dataset and whether the visual context is also adopted on the dev dataset or test dataset

    (ITA)How to concat train dataset and whether the visual context is also adopted on the dev dataset or test dataset

    Thank you for your great work! I have read your code a little, but it is difficult to find the core code related to ITA. Thus, there is still a lot of confusion after browsing the yaml configuration and finetune_train.py.

    1. I took a look at the data set and it looks like it's all from the Corpus: ColumnCorpus-SNAP:ColumnCorpus-SNAPDOC. The former is the common NER dataset, and the latter is the concatenation of the NER dataset with ocr, region, and global text. But I want to know how to cancat the train dataset? (url:https://github.com/Alibaba-NLP/KB-NER/blob/main/flair/trainers/finetune_trainer.py#:~:text=train_data%20%3D%20%5Bself.corpus.train%5D) Just through the ConcatDataset? Also, do the first Corpus and the next corpus go into training together?
    2. doc_caption_vinvl_classattr_ocr dataset consists of ocr, 5 global captions and region labels. Will them concatenate with raw sentence beyond max_seq_len? If it's possible, is there a truncation operation, and eos_token delimited clauses should have global attention between them?
    3. Although train with dev not in ITA yaml, i also want to know the format of dev_dataloader and test_dataloader. Only raw ner dataset?
    4. If possible, could you give me same training(dev/test) data format case which feed into xlm-robert-large. I would be greatly grateful.
    opened by 257556227 7
  • 您好 请问该怎样获取数据集中对应的图像资源

    您好 请问该怎样获取数据集中对应的图像资源

    我下载了下方链接的数据集 ITA: Image-Text Alignments for Multi-Modal Named Entity Recognition Our preprocessed datasets: [OneDrive] 其中img文件夹下的文件是图片id。twitter15和twitter17两个数据集对应的图片资源已在别的论文链接中找到,但snap数据集对应的图片资源未找到,请问我该怎样获取。any response would be appreciated.

    opened by Delicate2000 2
  • Python Version

    Python Version

    Which python version is compatible with with packages versions in the requirements file? I tried 3.6.0, 3.6.1, 3.6.2, 3.7.0, 3.8.0, and all f them didn't work.

    the last one was 3.6.2 and the following error occurred: Downloading scipy-1.4.1.tar.gz (24.6 MB) ---------------------------------------- 24.6/24.6 MB 1.6 MB/s eta 0:00:00 Installing build dependencies ... done Getting requirements to build wheel ... done Preparing metadata (pyproject.toml) ... error error: subprocess-exited-with-error

    × Preparing metadata (pyproject.toml) did not run successfully. │ exit code: 1 ╰─> [191 lines of output] :418: UserWarning: Unrecognized setuptools command ('dist_info --egg-base C:\Users\taghreed.ahmed\AppData\Local\Temp\pip-modern-metadata-oj06d1ys'), proceeding with generating Cython sources and expanding templates Running from scipy source directory. lapack_opt_info: lapack_mkl_info: No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils customize MSVCCompiler libraries mkl_rt not found in ['C:\Users\taghreed.ahmed\Anaconda3\lib', 'C:\', 'C:\Users\taghreed.ahmed\Anaconda3\libs'] NOT AVAILABLE

      openblas_lapack_info:
      No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
      customize MSVCCompiler
      No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
      customize MSVCCompiler
        libraries openblas not found in ['C:\\Users\\taghreed.ahmed\\Anaconda3\\lib', 'C:\\', 'C:\\Users\\taghreed.ahmed\\Anaconda3\\libs']
      get_default_fcompiler: matching types: '['gnu', 'intelv', 'absoft', 'compaqv', 'intelev', 'gnu95', 'g95', 'intelvem', 'intelem', 'flang']'
      customize GnuFCompiler
      Could not locate executable g77
      Could not locate executable f77
      customize IntelVisualFCompiler
      Could not locate executable ifort
      Could not locate executable ifl
      customize AbsoftFCompiler
      Could not locate executable f90
      customize CompaqVisualFCompiler
      Could not locate executable DF
      customize IntelItaniumVisualFCompiler
      Could not locate executable efl
      customize Gnu95FCompiler
      Could not locate executable gfortran
      Could not locate executable f95
      customize G95FCompiler
      Could not locate executable g95
      customize IntelEM64VisualFCompiler
      customize IntelEM64TFCompiler
      Could not locate executable efort
      Could not locate executable efc
      customize PGroupFlangCompiler
      Could not locate executable flang
      don't know how to compile Fortran code on platform 'nt'
        NOT AVAILABLE
    
      openblas_clapack_info:
      No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
      customize MSVCCompiler
      No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
      customize MSVCCompiler
        libraries openblas,lapack not found in ['C:\\Users\\taghreed.ahmed\\Anaconda3\\lib', 'C:\\', 'C:\\Users\\taghreed.ahmed\\Anaconda3\\libs']
        NOT AVAILABLE
    
      flame_info:
      No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
      customize MSVCCompiler
        libraries flame not found in ['C:\\Users\\taghreed.ahmed\\Anaconda3\\lib', 'C:\\', 'C:\\Users\\taghreed.ahmed\\Anaconda3\\libs']
        NOT AVAILABLE
    
      atlas_3_10_threads_info:
      Setting PTATLAS=ATLAS
      No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
      customize MSVCCompiler
        libraries lapack_atlas not found in C:\Users\taghreed.ahmed\Anaconda3\lib
      No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
      customize MSVCCompiler
        libraries tatlas,tatlas not found in C:\Users\taghreed.ahmed\Anaconda3\lib
      No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
      customize MSVCCompiler
        libraries lapack_atlas not found in C:\
      No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
      customize MSVCCompiler
        libraries tatlas,tatlas not found in C:\
      No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
      customize MSVCCompiler
        libraries lapack_atlas not found in C:\Users\taghreed.ahmed\Anaconda3\libs
      No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
      customize MSVCCompiler
        libraries tatlas,tatlas not found in C:\Users\taghreed.ahmed\Anaconda3\libs
      <class 'numpy.distutils.system_info.atlas_3_10_threads_info'>
        NOT AVAILABLE
    
      atlas_3_10_info:
      No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
      customize MSVCCompiler
        libraries lapack_atlas not found in C:\Users\taghreed.ahmed\Anaconda3\lib
      No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
      customize MSVCCompiler
        libraries satlas,satlas not found in C:\Users\taghreed.ahmed\Anaconda3\lib
      No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
      customize MSVCCompiler
        libraries lapack_atlas not found in C:\
      No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
      customize MSVCCompiler
        libraries satlas,satlas not found in C:\
      No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
      customize MSVCCompiler
        libraries lapack_atlas not found in C:\Users\taghreed.ahmed\Anaconda3\libs
      No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
      customize MSVCCompiler
        libraries satlas,satlas not found in C:\Users\taghreed.ahmed\Anaconda3\libs
      <class 'numpy.distutils.system_info.atlas_3_10_info'>
        NOT AVAILABLE
    
      atlas_threads_info:
      Setting PTATLAS=ATLAS
      No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
      customize MSVCCompiler
        libraries lapack_atlas not found in C:\Users\taghreed.ahmed\Anaconda3\lib
      No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
      customize MSVCCompiler
        libraries ptf77blas,ptcblas,atlas not found in C:\Users\taghreed.ahmed\Anaconda3\lib
      No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
      customize MSVCCompiler
        libraries lapack_atlas not found in C:\
      No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
      customize MSVCCompiler
        libraries ptf77blas,ptcblas,atlas not found in C:\
      No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
      customize MSVCCompiler
        libraries lapack_atlas not found in C:\Users\taghreed.ahmed\Anaconda3\libs
      No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
      customize MSVCCompiler
        libraries ptf77blas,ptcblas,atlas not found in C:\Users\taghreed.ahmed\Anaconda3\libs
      <class 'numpy.distutils.system_info.atlas_threads_info'>
        NOT AVAILABLE
    
      atlas_info:
      No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
      customize MSVCCompiler
        libraries lapack_atlas not found in C:\Users\taghreed.ahmed\Anaconda3\lib
      No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
      customize MSVCCompiler
        libraries f77blas,cblas,atlas not found in C:\Users\taghreed.ahmed\Anaconda3\lib
      No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
      customize MSVCCompiler
        libraries lapack_atlas not found in C:\
      No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
      customize MSVCCompiler
        libraries f77blas,cblas,atlas not found in C:\
      No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
      customize MSVCCompiler
        libraries lapack_atlas not found in C:\Users\taghreed.ahmed\Anaconda3\libs
      No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
      customize MSVCCompiler
        libraries f77blas,cblas,atlas not found in C:\Users\taghreed.ahmed\Anaconda3\libs
      <class 'numpy.distutils.system_info.atlas_info'>
        NOT AVAILABLE
    
      accelerate_info:
        NOT AVAILABLE
    
      lapack_info:
      No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
      customize MSVCCompiler
        libraries lapack not found in ['C:\\Users\\taghreed.ahmed\\Anaconda3\\lib', 'C:\\', 'C:\\Users\\taghreed.ahmed\\Anaconda3\\libs']
        NOT AVAILABLE
    
      C:\Users\taghreed.ahmed\AppData\Local\Temp\pip-build-env-90hxl95v\overlay\Lib\site-packages\numpy\distutils\system_info.py:1712: UserWarning:
          Lapack (http://www.netlib.org/lapack/) libraries not found.
          Directories to search for the libraries can be specified in the
          numpy/distutils/site.cfg file (section [lapack]) or by setting
          the LAPACK environment variable.
        if getattr(self, '_calc_info_{}'.format(lapack))():
      lapack_src_info:
        NOT AVAILABLE
    
      C:\Users\taghreed.ahmed\AppData\Local\Temp\pip-build-env-90hxl95v\overlay\Lib\site-packages\numpy\distutils\system_info.py:1712: UserWarning:
          Lapack (http://www.netlib.org/lapack/) sources not found.
          Directories to search for the sources can be specified in the
          numpy/distutils/site.cfg file (section [lapack_src]) or by setting
          the LAPACK_SRC environment variable.
        if getattr(self, '_calc_info_{}'.format(lapack))():
        NOT AVAILABLE
    
      Traceback (most recent call last):
        File "C:\Users\taghreed.ahmed\Anaconda3\lib\site-packages\pip\_vendor\pep517\in_process\_in_process.py", line 363, in <module>
          main()
        File "C:\Users\taghreed.ahmed\Anaconda3\lib\site-packages\pip\_vendor\pep517\in_process\_in_process.py", line 345, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
        File "C:\Users\taghreed.ahmed\Anaconda3\lib\site-packages\pip\_vendor\pep517\in_process\_in_process.py", line 164, in prepare_metadata_for_build_wheel
          return hook(metadata_directory, config_settings)
        File "C:\Users\taghreed.ahmed\AppData\Local\Temp\pip-build-env-90hxl95v\overlay\Lib\site-packages\setuptools\build_meta.py", line 188, in prepare_metadata_for_build_wheel
          self.run_setup()
        File "C:\Users\taghreed.ahmed\AppData\Local\Temp\pip-build-env-90hxl95v\overlay\Lib\site-packages\setuptools\build_meta.py", line 281, in run_setup
          super(_BuildMetaLegacyBackend,
        File "C:\Users\taghreed.ahmed\AppData\Local\Temp\pip-build-env-90hxl95v\overlay\Lib\site-packages\setuptools\build_meta.py", line 174, in run_setup
          exec(code, locals())
        File "<string>", line 540, in <module>
        File "<string>", line 536, in setup_package
        File "C:\Users\taghreed.ahmed\AppData\Local\Temp\pip-build-env-90hxl95v\overlay\Lib\site-packages\numpy\distutils\core.py", line 137, in setup
          config = configuration()
        File "<string>", line 435, in configuration
      numpy.distutils.system_info.NotFoundError: No lapack/blas resources found.
      [end of output]
    

    note: This error originates from a subprocess, and is likely not a problem with pip. error: metadata-generation-failed

    × Encountered error while generating package metadata. ╰─> See above for output.

    note: This is an issue with the package mentioned above, not pip. hint: See above for details.

    [notice] A new release of pip available: 22.1.2 -> 22.2.2 [notice] To update, run: python.exe -m pip install --upgrade pip

    opened by taghreed34 2
  • TextEncodeInput must be Union

    TextEncodeInput must be Union

    我将下载到的数据解压到./flair/datasets/,并运行python train.py --config config/xlmr-large-first_10epoch_1batch_4accumulate_0.000005lr_10000lrrate_en_monolingual_crf_fast_norelearn_sentbatch_sentloss_nodev_finetune_twitter15_doc_joint_multiview_posterior_4temperature_captionobj_classattr_vinvl_ocr_ner24.yaml会出现下面错误。在twitter17上也会出现同样的错误。 此外在flair/embeddings.py中第723行存在未声明的方法:utils.init_embedding

    Traceback (most recent call last): File "D:\代码\KB-NER-main\flair\trainers\finetune_trainer.py", line 910, in train loss, features = self.model.forward_loss(student_input, return_features=True) File "D:\代码\KB-NER-main\flair\models\sequence_tagger_model.py", line 1902, in forward_loss features = self.forward(data_points) File "D:\代码\KB-NER-main\flair\models\sequence_tagger_model.py", line 854, in forward self.embeddings.embed(sentences) File "D:\代码\KB-NER-main\flair\embeddings.py", line 194, in embed embedding.embed(sentences) File "D:\代码\KB-NER-main\flair\embeddings.py", line 99, in embed self._add_embeddings_internal(sentences) File "D:\代码\KB-NER-main\flair\embeddings.py", line 3083, in _add_embeddings_internal self._add_embeddings_to_sentences(sentences) File "D:\代码\KB-NER-main\flair\embeddings.py", line 3214, in _add_embeddings_to_sentences encoded_inputs = self.tokenizer.encode_plus(subtoken_ids_sentence, File "C:\Python\Python310\lib\site-packages\transformers\tokenization_utils_base.py", line 2570, in encode_plus return self._encode_plus( File "C:\Python\Python310\lib\site-packages\transformers\tokenization_utils_fast.py", line 498, in _encode_plus batched_output = self._batch_encode_plus( File "C:\Python\Python310\lib\site-packages\transformers\tokenization_utils_fast.py", line 425, in _batch_encode_plus encodings = self._tokenizer.encode_batch( TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]] Traceback (most recent call last): File "D:\代码\KB-NER-main\train.py", line 412, in getattr(trainer,'train')(**train_config) File "D:\代码\KB-NER-main\flair\trainers\finetune_trainer.py", line 1003, in train if loss != 0: UnboundLocalError: local variable 'loss' referenced before assignment

    opened by Mint-hfut 2
  • Running pre-trained models for inference on unlabelled data

    Running pre-trained models for inference on unlabelled data

    Hello,

    Thanks for the effort of putting this project in open-source. Do you by chance plan to provide source code or commands for running inference with pre-trained models on unlabelled data. Thanks in advance.

    Best regards, Cristian Santini

    opened by vasari-kg 1
  • Error in the inference script.

    Error in the inference script.

    Hello, Great work.

    I'm trying to obtain the results on colab using one of the checkpoints but I get an error during running the script, the error is from transformers when downloading the tokenizer.

    image

    can you check that, please?

    opened by maherr13 1
  • kb/generate_data.py

    kb/generate_data.py

    batch, ids = list(), list()
    for i, one in enumerate(data):
        batch.append(one)
        ids.append(i)
        if len(batch) >= batch_size:
            add_batch(batch, ids)
    else:
        if len(batch) > 0:
            add_batch(batch, ids)
    

    这个134行的else 是不是错了

    opened by LLLLLLoki 0
  • CVE-2007-4559 Patch

    CVE-2007-4559 Patch

    Patching CVE-2007-4559

    Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

    If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

    opened by TrellixVulnTeam 0
  • Which table result is the real result

    Which table result is the real result

    According to the description of the paper, ITA-ALL+CVA in Table 2 should be the performance of the model, but in Table 4 the performance of ITA-ALL+CVA is different from that in Table 2. May I ask which table's result is the true result and why the two tables' results are different?

    opened by JinFish 4
  • (ITA) Regarding the poor quality of some entities in the twitter15 dataset

    (ITA) Regarding the poor quality of some entities in the twitter15 dataset

    Hello! I read your paper "ITA: Image-Text Alignments for Multi-Modal Named Entity Recognition" and source code, and found that some entities in the test text in the twitter15 data set are wrong. I would like to ask if the data set you used in the paper has been repaired. Looking forward to your reply!

    opened by Lyuyifan-Ivan 1
  • Changing data folder

    Changing data folder

    Hi, sorry for asking about something that is explained in the repository, but it took me two days to deal with changing data directory and I didn't manage to do it.

    I want to test NER model on my own file, it's the same format as conll2003. I changed the data folder in the config file named "xlmr-large-pretuned-tuned-wiki-full-first_10epoch_1batch_4accumulate_0.000005lr_10000lrrate_en_monolingual_crf_fast_norelearn_sentbatch_sentloss_withdev_finetune_saving_amz_doc_wiki_v3_ner20" to the folder named: "/content/EN-English_conll_rank_eos_doc_full_wiki_v3_test" that contains the testing file with the name: "testb.txt". I also put the the same folder path as a target directory in the command line to run without the testing file name. So what do I miss that causes the following issue: Nonetype object has no attribute exists? This error happens because of data path type assertion. And is there anything else needs to be added or changed?

    MFVI: hexa_rank: 150 hexa_std: 1 iterations: 3 normalize_weight: true quad_rank: 150 quad_std: 1 tag_dim: 150 use_hexalinear: false use_quadrilinear: false use_second_order: false use_third_order: false window_size: 1 ModelFinetuner: distill_mode: false sentence_level_batch: true anneal_factor: 2 ast: Corpus: SEMEVAL16-TR:SEMEVAL16-ES:SEMEVAL16-NL:SEMEVAL16-EN:SEMEVAL16-RU atis: Corpus: ATIS-EN:ATIS-TR:ATIS-HI chunk: Corpus: CONLL_03:CONLL_03_GERMAN embeddings: TransformerWordEmbeddings-0: fine_tune: true layers: '-1' model: resources/taggers/xlmr-large-pretuned-tuned-wiki-first_3epoch_1batch_4accumulate_0.000005lr_10000lrrate_multi_monolingual_crf_fast_norelearn_sentbatch_sentloss_withdev_finetune_saving_amz_doc_wiki_v3_10upsample_addmix_ner23/xlm-roberta-large pooling_operation: first interpolation: 0.5 is_teacher_list: true model: FastSequenceTagger: crf_attention: false dropout: 0.0 hidden_size: 256 locked_dropout: 0.0 remove_x: true sentence_loss: true use_cnn: false use_crf: true use_rnn: false word_dropout: 0.1 model_name: xlmr-large-pretuned-tuned-wiki-full-first_10epoch_1batch_4accumulate_0.000005lr_10000lrrate_en_monolingual_crf_fast_norelearn_sentbatch_sentloss_withdev_finetune_saving_amz_doc_wiki_v3_ner20 ner: ColumnCorpus-EN-EnglishDOC: column_format: 0: text 1: pos 2: upos 3: ner comment_symbol: '# id' data_folder: /content/EN-English_conll_rank_eos_doc_full_wiki_v3_test tag_to_bioes: ner Corpus: ColumnCorpus-EN-EnglishDOC professors: config/single-de-ner.yaml: CONLL_03_GERMAN config/single-en-ner.yaml: CONLL_03 config/single-es-ner.yaml: CONLL_03_SPANISH config/single-nl-ner.yaml: CONLL_03_DUTCH tag_dictionary: resources/taggers/EN-English_x.pkl teachers: ? config_gen/multi-bert_flair_word_char_charcnn_300epoch_32batch_0.1lr_256hidden_de_monolingual_crf_sentloss_10patience_fast_sentbatch_relearn_fasttext_freeze_nodev_panx_ner45.yaml : PANX-DE ? config_gen/multi-bert_flair_word_char_charcnn_300epoch_32batch_0.1lr_256hidden_es_monolingual_crf_sentloss_10patience_fast_sentbatch_relearn_fasttext_freeze_nodev_panx_ner45.yaml : PANX-ES ? config_gen/multi-bert_flair_word_char_charcnn_300epoch_32batch_0.1lr_256hidden_nl_monolingual_crf_sentloss_10patience_fast_sentbatch_relearn_fasttext_freeze_nodev_panx_ner44.yaml : PANX-NL ? config_gen/multi_bert_origflair_300epoch_2000batch_1lr_256hidden_de_monolingual_crf_sentloss_10patience_baseline_fast_nodev_ner12.yaml : CONLL_03_GERMAN ? config_gen/multi_bert_origflair_300epoch_2000batch_1lr_256hidden_en_monolingual_crf_sentloss_10patience_baseline_fast_nodev_ner11.yaml : CONLL_03 ? config_gen/multi_bert_origflair_300epoch_2000batch_1lr_256hidden_en_monolingual_crf_sentloss_10patience_baseline_fast_nodev_panx_ner8.yaml : PANX-EN ? config_gen/multi_bert_origflair_300epoch_2000batch_1lr_256hidden_es_monolingual_crf_sentloss_10patience_baseline_fast_nodev_ner12.yaml : CONLL_03_SPANISH ? config_gen/multi_bert_origflair_300epoch_2000batch_1lr_256hidden_eu_monolingual_crf_sentloss_10patience_baseline_fast_nodev_panx_ner8.yaml : PANX-EU ? config_gen/multi_bert_origflair_300epoch_2000batch_1lr_256hidden_fa_monolingual_crf_sentloss_10patience_baseline_fast_nodev_panx_ner8.yaml : PANX-FA ? config_gen/multi_bert_origflair_300epoch_2000batch_1lr_256hidden_fr_monolingual_crf_sentloss_10patience_baseline_fast_nodev_panx_ner8.yaml : PANX-FR ? config_gen/multi_bert_origflair_300epoch_2000batch_1lr_256hidden_he_monolingual_crf_sentloss_10patience_baseline_fast_nodev_panx_ner6.yaml : PANX-HE ? config_gen/multi_bert_origflair_300epoch_2000batch_1lr_256hidden_id_monolingual_crf_sentloss_10patience_baseline_fast_nodev_panx_ner8.yaml : PANX-ID ? config_gen/multi_bert_origflair_300epoch_2000batch_1lr_256hidden_nl_monolingual_crf_sentloss_10patience_baseline_fast_nodev_ner11.yaml : CONLL_03_DUTCH ? config_gen/multi_bert_origflair_300epoch_2000batch_1lr_256hidden_sl_monolingual_crf_sentloss_10patience_baseline_fast_nodev_panx_ner7.yaml : PANX-SL ? config_gen/multi_bert_origflair_300epoch_2000batch_1lr_256hidden_ta_monolingual_crf_sentloss_10patience_baseline_fast_nodev_panx_ner6.yaml : PANX-TA target_dir: resources/taggers/ targets: ner teacher_annealing: false train: embeddings_storage_mode: none fine_tune_mode: true gradient_accumulation_steps: 4 learning_rate: 5.0e-06 lr_rate: 10000 max_epochs: 10 mini_batch_size: 1 monitor_test: false one_by_one: true save_finetuned_embedding: true select_model_by_macro: true train_with_dev: true true_reshuffle: false use_warmup: false trainer: ModelFinetuner upos: Corpus: UD_GERMAN:UD_ENGLISH:UD_FRENCH:UD_ITALIAN:UD_DUTCH:UD_SPANISH:UD_PORTUGUESE:UD_JAPANESE UD_GERMAN: train_config: config/ professors: ? config_gen/multi_bert_origflair_300epoch_2000batch_1lr_400hidden_de_monolingual_crf_sentloss_10patience_baseline_fast_nodev_upos1.yaml : UD_GERMAN ? config_gen/multi_bert_origflair_300epoch_2000batch_1lr_400hidden_en_monolingual_crf_sentloss_10patience_baseline_fast_nodev_upos0.yaml : UD_ENGLISH ? config_gen/multi_bert_origflair_300epoch_2000batch_1lr_400hidden_es_monolingual_crf_sentloss_10patience_baseline_fast_nodev_upos0.yaml : UD_SPANISH ? config_gen/multi_bert_origflair_300epoch_2000batch_1lr_400hidden_fr_monolingual_crf_sentloss_10patience_baseline_fast_nodev_upos1.yaml : UD_FRENCH ? config_gen/multi_bert_origflair_300epoch_2000batch_1lr_400hidden_it_monolingual_crf_sentloss_10patience_baseline_fast_nodev_upos1.yaml : UD_ITALIAN ? config_gen/multi_bert_origflair_300epoch_2000batch_1lr_400hidden_ja_monolingual_crf_sentloss_10patience_baseline_fast_nodev_upos1.yaml : UD_JAPANESE ? config_gen/multi_bert_origflair_300epoch_2000batch_1lr_400hidden_nl_monolingual_crf_sentloss_10patience_baseline_fast_nodev_upos1.yaml : UD_DUTCH ? config_gen/multi_bert_origflair_300epoch_2000batch_1lr_400hidden_pt_monolingual_crf_sentloss_10patience_baseline_fast_nodev_upos1.yaml : UD_PORTUGUESE tag_dictionary: resources/taggers/pos_tags.pkl teachers: ? config_gen/multi_bert_origflair_300epoch_2000batch_1lr_400hidden_de_monolingual_crf_sentloss_10patience_baseline_fast_nodev_upos1.yaml : UD_GERMAN ? config_gen/multi_bert_origflair_300epoch_2000batch_1lr_400hidden_en_monolingual_crf_sentloss_10patience_baseline_fast_nodev_upos0.yaml : UD_ENGLISH ? config_gen/multi_bert_origflair_300epoch_2000batch_1lr_400hidden_es_monolingual_crf_sentloss_10patience_baseline_fast_nodev_upos0.yaml : UD_SPANISH ? config_gen/multi_bert_origflair_300epoch_2000batch_1lr_400hidden_fr_monolingual_crf_sentloss_10patience_baseline_fast_nodev_upos1.yaml : UD_FRENCH ? config_gen/multi_bert_origflair_300epoch_2000batch_1lr_400hidden_it_monolingual_crf_sentloss_10patience_baseline_fast_nodev_upos1.yaml : UD_ITALIAN ? config_gen/multi_bert_origflair_300epoch_2000batch_1lr_400hidden_ja_monolingual_crf_sentloss_10patience_baseline_fast_nodev_upos1.yaml : UD_JAPANESE ? config_gen/multi_bert_origflair_300epoch_2000batch_1lr_400hidden_nl_monolingual_crf_sentloss_10patience_baseline_fast_nodev_upos1.yaml : UD_DUTCH ? config_gen/multi_bert_origflair_300epoch_2000batch_1lr_400hidden_pt_monolingual_crf_sentloss_10patience_baseline_fast_nodev_upos1.yaml : UD_PORTUGUESE

    opened by taghreed34 1
Owner
null
ALIbaba's Collection of Encoder-decoders from MinD (Machine IntelligeNce of Damo) Lab

AliceMind AliceMind: ALIbaba's Collection of Encoder-decoders from MinD (Machine IntelligeNce of Damo) Lab This repository provides pre-trained encode

Alibaba 922 Dec 10, 2021
Using Bert as the backbone model for lime, designed for NLP task explanation (sentence pair text classification task)

Lime Comparing deep contextualized model for sentences highlighting task. In addition, take the classic explanation model "LIME" with bert-base model

JHJu 2 Jan 18, 2022
Pytorch implementation of winner from VQA Chllange Workshop in CVPR'17

2017 VQA Challenge Winner (CVPR'17 Workshop) pytorch implementation of Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challeng

Mark Dong 166 Dec 11, 2022
Code for the project carried out fulfilling the course requirements for Fall 2021 NLP at NYU

Introduction Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization,

Sai Himal Allu 1 Apr 25, 2022
Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

Grading tools for Advanced NLP (11-711) Installation You'll need docker and unzip to use this repo. For docker, visit the official guide to get starte

Hao Zhu 2 Sep 27, 2022
A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

MIDI Language Introduction Reference Paper: Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions: code This

Robert Bogan Kang 3 May 25, 2022
A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Chimera: Learning Shared Semantic Space for Speech-to-Text Translation This is a Pytorch implementation for the "Chimera" paper Learning Shared Semant

Chi Han 43 Dec 28, 2022
Shared code for training sentence embeddings with Flax / JAX

flax-sentence-embeddings This repository will be used to share code for the Flax / JAX community event to train sentence embeddings on 1B+ training pa

Nils Reimers 23 Dec 30, 2022
A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Chimera: Learning Shared Semantic Space for Speech-to-Text Translation This is a Pytorch implementation for the "Chimera" paper Learning Shared Semant

Chi Han 43 Dec 28, 2022
Source code for CsiNet and CRNet using Fully Connected Layer-Shared feedback architecture.

FCS-applications Source code for CsiNet and CRNet using the Fully Connected Layer-Shared feedback architecture. Introduction This repository contains

Boyuan Zhang 4 Oct 7, 2022
Shared, streaming Python dict

UltraDict Sychronized, streaming Python dictionary that uses shared memory as a backend Warning: This is an early hack. There are only few unit tests

Ronny Rentner 192 Dec 23, 2022
Code for the Findings of NAACL 2022(Long Paper): AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks

AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks arXiv link: upcoming To be published in Findings of NA

Allen 16 Nov 12, 2022
A Survey of Natural Language Generation in Task-Oriented Dialogue System (TOD): Recent Advances and New Frontiers

A Survey of Natural Language Generation in Task-Oriented Dialogue System (TOD): Recent Advances and New Frontiers

Libo Qin 132 Nov 25, 2022
The FinQA dataset from paper: FinQA: A Dataset of Numerical Reasoning over Financial Data

Data and code for EMNLP 2021 paper "FinQA: A Dataset of Numerical Reasoning over Financial Data"

Zhiyu Chen 114 Dec 29, 2022
🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.

pySBD: Python Sentence Boundary Disambiguation (SBD) pySBD - python Sentence Boundary Disambiguation (SBD) - is a rule-based sentence boundary detecti

Nipun Sadvilkar 549 Jan 6, 2023
🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.

pySBD: Python Sentence Boundary Disambiguation (SBD) pySBD - python Sentence Boundary Disambiguation (SBD) - is a rule-based sentence boundary detecti

Nipun Sadvilkar 277 Feb 18, 2021
This script just scrapes the most recent Nepali news from Kathmandu Post and notifies the user about current events at regular intervals.It sends out the most recent news at random!

Nepali-news-notifier This script just scrapes the most recent Nepali news from Kathmandu Post and notifies the user about current events at regular in

Sachit Yadav 1 Feb 11, 2022