Entity Disambiguation as text extraction (ACL 2022)

Overview

ExtEnD: Extractive Entity Disambiguation

Python Python PyTorch plugin: spacy Code style: black

This repository contains the code of ExtEnD: Extractive Entity Disambiguation, a novel approach to Entity Disambiguation (i.e. the task of linking a mention in context with its most suitable entity in a reference knowledge base) where we reformulate this task as a text extraction problem. This work was accepted at ACL 2022.

If you find our paper, code or framework useful, please reference this work in your paper:

@inproceedings{barba-etal-2021-extend,
    title = "{E}xt{E}n{D}: Extractive Entity Disambiguation",
    author = "Barba, Edoardo  and
      Procopio, Luigi  and
      Navigli, Roberto",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics",
    month = may,
    year = "2022",
    address = "Online and Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
}

ExtEnD Image

ExtEnD is built on top of the classy library. If you are interested in using this project, we recommend checking first its introduction, although it is not strictly required to train and use the models.

Finally, we also developed a few additional tools that make it simple to use and test ExtEnD models:

Setup the environment

Requirements:

  • Debian-based (e.g. Debian, Ubuntu, ...) system
  • conda installed

To quickly setup the environment to use ExtEnd/replicate our experiments, you can use the bash script setup.sh. The only requirements needed here is to have a Debian-based system (Debian, Ubuntu, ...) and conda installed.

bash setup.sh

Checkpoints

We release the following checkpoints:

Model Training Dataset Avg Score
Longformer Large AIDA 85.8

Once you have downloaded the files, untar them inside the experiments/ folder.

# move file to experiments folder
mv ~/Downloads/extend-longformer-large.tar.gz experiments/
# untar
tar -xf experiments/extend-longformer-large.tar.gz -C experiments/
rm experiments/extend-longformer-large.tar.gz

Data

All the datasets used to train and evaluate ExtEnD can be downloaded using the following script from the facebook GENRE repository.

We strongly recommend you organize them in the following structure under the data folder as it is used by several scripts in the project.

data
├── aida
│   ├── test.aida
│   ├── train.aida
│   └── validation.aida
└── out_of_domain
    ├── ace2004-test-kilt.ed
    ├── aquaint-test-kilt.ed
    ├── clueweb-test-kilt.ed
    ├── msnbc-test-kilt.ed
    └── wiki-test-kilt.ed

Training

To train a model from scratch, you just have to use the following command:

classy train qa <folder> -n my-model-name --profile aida-longformer-large-gam -pd extend

can be any folder containing exactly 3 files:

  • train.aida
  • validation.aida
  • test.aida

This is required to let classy automatically discover the dataset splits. For instance, to re-train our AIDA-only model:

classy train data/aida -n my-model-name --profile aida-longformer-large-gam -pd extend

Note that can be any folder, as long as:

  • it contains these 3 files
  • they are in the same format as the files in data/aida

So if you want to train on these different datasets, just create the corresponding directory and you are ready to go!

In case you want to modify some training hyperparameter, you just have to edit the aida-longformer-large-gam profile in the configurations/ folder. You can take a look to the modifiable parameters by adding the parameter --print to the training command. You can find more on this in classy official documentation.

Predict

You can use classy syntax to perform file prediction:

classy predict -pd extend file \
    experiments/extend-longformer-large \
    data/aida/test.aida \
    -o data/aida_test_predictions.aida

Evaluation

To evaluate a checkpoint, you can run the bash script scripts/full_evaluation.sh, passing its path as an input argument. This will evaluate the model provided against both AIDA and OOD resources.

# syntax: bash scripts/full_evaluation.sh <ckpt-path>
bash scripts/full_evaluation.sh experiments/extend-longformer-large/2021-10-22/09-11-39/checkpoints/best.ckpt

If you are interested in AIDA-only evaluation, you can use scripts/aida_evaluation.sh instead (same syntax).

Furthermore, you can evaluate the model on any dataset that respects the same format of the original ones with the following command:

classy evaluate \
    experiments/extend-longformer-large/2021-10-22/09-11-39/checkpoints/best.ckpt \
    data/aida/test.aida \
    -o data/aida_test_evaluation.txt \
    -pd extend

spaCy

You can also use ExtEnD with spaCy, allowing you to use our system with a seamless interface that tackles full end-to-end entity linking. To do so, you just need to have cloned the repo and run setup.sh to configure the environment. Then, you will be able to add extend as a custom component in the following way:

import spacy
from extend import spacy_component

nlp = spacy.load("en_core_web_sm")

extend_config = dict(
    checkpoint_path="<ckpt-path>",
    mentions_inventory_path="<inventory-path>",
    device=0,
    tokens_per_batch=4000,
)

nlp.add_pipe("extend", after="ner", config=extend_config)

input_sentence = "Japan began the defence of their title " \
                 "with a lucky 2-1 win against Syria " \
                 "in a championship match on Friday."

doc = nlp(input_sentence)

# [(Japan, Japan National Footbal Team), (Syria, Syria National Footbal Team)]
disambiguated_entities = [(ent.text, ent._.disambiguated_entity) for ent in doc.ents]

Where:

  • <ckpt-path> is the path to a pretrained checkpoint of extend that you can find in the Checkpoints section, and
  • <inventory-path> is the path to a file containing the mapping from mentions to the corresponding candidates.

We support two formats for <inventory-path>:

  • tsv:
    $ head -1 <inventory-path>
    Rome \[TAB\] Rome City \[TAB\] Rome Football Team \[TAB\] Roman Empire \[TAB\] ...
    That is, <inventory-path> is a tab-separated file where, for each row, we have the mention (Rome) followed by its possible entities.
  • sqlite: a sqlite3 database with a candidate table with two columns:
    • mention (text PRIMARY KEY)
    • entities (text). This must be a tab-separated list of the corresponding entities.

We release 6 possible pre-computed <inventory-path> that you could use (we recommend creating a folder data/inventories/ and placing the files downloaded there inside, e.g., = data/inventories/le-and-titov-2018-inventory.min-count-2.sqlite3):

Inventory Number of Mentions Source
le-and-titov-2018-inventory.min-count-2.tsv 12090972 Cleaned version of the candidate set released by Le and Titov (2018). We discard mentions whose count is less than 2.
[Recommended] le-and-titov-2018-inventory.min-count-2.sqlite3 12090972 Cleaned version of the candidate set released by Le and Titov (2018). We discard mentions whose count is less than 2.
le-and-titov-2018-inventory.tsv 21571265 The candidate set released by Le and Titov (2018)
le-and-titov-2018-inventory.sqlite3 21571265 The candidate set released by Le and Titov (2018)

Note that, as far as you respect either of these two formats, you can also create and use your own inventory!

Docker container

Finally, we also release a docker image running two services, a streamlit demo and a REST service:

$ docker run -p 22001:22001 -p 22002:22002 --rm -itd poccio/extend:1.0.1
<container id>

Now you can:

  • checkout the streamlit demo at http://127.0.0.1:22001/
  • invoke the REST service running at http://127.0.0.1:22002/ (http://127.0.0.1:22002/docs you can find the OpenAPI documentation):
    $ curl -X POST http://127.0.0.1:22002/ -H 'Content-Type: application/json' -d '[{"text": "Rome is in Italy"}]'
    [{"text":"Rome is in Italy","disambiguated_entities":[{"char_start":0,"char_end":4,"mention":"Rome","entity":"Rome"},{"char_start":11,"char_end":16,"mention":"Italy","entity":"Italy"}]}]

Acknowledgments

The authors gratefully acknowledge the support of the ERC Consolidator Grant MOUSSE No. 726487 under the European Union’s Horizon 2020 research and innovation programme.

This work was supported in part by the MIUR under grant “Dipartimenti di eccellenza 2018-2022” of the Department of Computer Science of the Sapienza University of Rome.

License

This work is under the Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.

You might also like...
Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

STEMM: Self-learning with Speech-Text Manifold Mixup for Speech Translation This is a PyTorch implementation for the ACL 2022 main conference paper ST

Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and convert them into audio. Here I have used Google-text-to-speech library popularly known as gTTS library to convert text file to .mp3 file. Hope you like my project! (ACL 2022) The source code for the paper
(ACL 2022) The source code for the paper "Towards Abstractive Grounded Summarization of Podcast Transcripts"

Towards Abstractive Grounded Summarization of Podcast Transcripts We provide the source code for the paper "Towards Abstractive Grounded Summarization

Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors"

SWRM Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors" Clone Clone th

Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation This is the implementaion of our paper: Bridging the

PyTorch Implementation of "Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging" (Findings of ACL 2022)

Feature_CRF_AE Feature_CRF_AE provides a implementation of Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging

A PyTorch implementation of paper
A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Chimera: Learning Shared Semantic Space for Speech-to-Text Translation This is a Pytorch implementation for the "Chimera" paper Learning Shared Semant

A PyTorch implementation of paper
A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Chimera: Learning Shared Semantic Space for Speech-to-Text Translation This is a Pytorch implementation for the "Chimera" paper Learning Shared Semant

Python package for performing Entity and Text Matching using Deep Learning.
Python package for performing Entity and Text Matching using Deep Learning.

DeepMatcher DeepMatcher is a Python package for performing entity and text matching using deep learning. It provides built-in neural networks and util

Comments
  • Failure to train the model using classy

    Failure to train the model using classy

    Hi, nice work!

    I would like to reproduce your training process. I installed the dependencies and downloaded the dataset following the README. I have renamed aida-train-kilt.jsonl to train.jsonl etc. I used the following command in the root directory

    classy train qa data/aida -n my-model-name --profile aida-longformer-large-gam -pd extend
    

    and got the following error

    Error executing job with overrides: ['device=cuda', 'exp_name=my-model-name', 'data.datamodule.dataset_path=data/aida']
    Traceback (most recent call last):
      File "/home/ICT2000/jxu/miniconda3/envs/extend/lib/python3.8/site-packages/classy/scripts/cli/train.py", line 620, in <lambda>
        lambda cfg: _main_mock(cfg, blames=blames if args.print else None)
      File "/home/ICT2000/jxu/miniconda3/envs/extend/lib/python3.8/site-packages/classy/scripts/cli/train.py", line 208, in _main_mock
        train(cfg)
      File "/home/ICT2000/jxu/miniconda3/envs/extend/lib/python3.8/site-packages/classy/scripts/model/train.py", line 22, in train
        pl_data_module.prepare_data()
      File "/home/ICT2000/jxu/miniconda3/envs/extend/lib/python3.8/site-packages/pytorch_lightning/core/datamodule.py", line 474, in wrapped_fn
        fn(*args, **kwargs)
      File "/home/ICT2000/jxu/miniconda3/envs/extend/lib/python3.8/site-packages/classy/data/data_modules.py", line 161, in prepare_data
        shuffle_and_store_dataset(
      File "/home/ICT2000/jxu/miniconda3/envs/extend/lib/python3.8/site-packages/classy/utils/data.py", line 39, in shuffle_and_store_dataset
        samples = shuffle_dataset(dataset_path, data_driver)
      File "/home/ICT2000/jxu/miniconda3/envs/extend/lib/python3.8/site-packages/classy/utils/data.py", line 29, in shuffle_dataset
        samples = load_dataset(dataset_path, data_driver)
      File "/home/ICT2000/jxu/miniconda3/envs/extend/lib/python3.8/site-packages/classy/utils/data.py", line 22, in load_dataset
        return list(data_driver.read_from_path(dataset_path))
      File "/home/ICT2000/jxu/miniconda3/envs/extend/lib/python3.8/site-packages/classy/data/data_drivers.py", line 620, in read
        yield QASample(**json.loads(line))
    TypeError: __init__() missing 2 required positional arguments: 'context' and 'question'
    

    I see that in extend/data you have another data_drivers, but classy still used their version of it. Since I am new to classy I am not sure what should I proceed next. Thank you!

    opened by cnut1648 4
  • File not found Error: While adding extend to spacy nlp pipeline

    File not found Error: While adding extend to spacy nlp pipeline

    Using the same classy version mentioned in requirements.txt

    Traceback (most recent call last): File "spacy_extend.py", line 13, in nlp.add_pipe("extend", after="ner", config=extend_config) File "/home/vasista/miniconda3/envs/extendtest/lib/python3.8/site-packages/spacy/language.py", line 792, in add_pipe pipe_component = self.create_pipe( File "/home/vasista/miniconda3/envs/extendtest/lib/python3.8/site-packages/spacy/language.py", line 674, in create_pipe resolved = registry.resolve(cfg, validate=validate) File "/home/vasista/miniconda3/envs/extendtest/lib/python3.8/site-packages/thinc/config.py", line 746, in resolve resolved, _ = cls._make( File "/home/vasista/miniconda3/envs/extendtest/lib/python3.8/site-packages/thinc/config.py", line 795, in _make
    filled, _, resolved = cls._fill( File "/home/vasista/miniconda3/envs/extendtest/lib/python3.8/site-packages/thinc/config.py", line 867, in fill
    getter_result = getter(*args, **kwargs) File "/home/vasista/extend/extend/spacy_component.py", line 86, in init self.model = load_checkpoint(checkpoint_path, device) File "/home/vasista/extend/extend/spacy_component.py", line 22, in load_checkpoint model = load_classy_module_from_checkpoint(checkpoint_path) File "/home/vasista/miniconda3/envs/extendtest/lib/python3.8/site-packages/classy/utils/lightning.py", line 57, in load_classy_module_from_checkpoint conf = load_training_conf_from_checkpoint(checkpoint_path) File "/home/vasista/miniconda3/envs/extendtest/lib/python3.8/site-packages/classy/utils/lightning.py", line 23, in load_training_conf_from_checkpoint conf = OmegaConf.load(f"{experiment_folder}/.hydra/{conf_file}") File "/home/vasista/miniconda3/envs/extendtest/lib/python3.8/site-packages/omegaconf/omegaconf.py", line 183, in load with io.open(os.path.abspath(file
    ), "r", encoding="utf-8") as f: FileNotFoundError: [Errno 2] No such file or directory: '/home/vasista/extend/.hydra/config.yaml'

    opened by Vasistareddy 4
  • Spacy example returns None

    Spacy example returns None

    Hi, I'm trying out your system and after installing everything the right way (some packages needed to be down/upgraded), I ran the spacy example on the longformer with le&titov's candidate sets.

    The code is the following and the output warning message is below.

    It says that dataset.base is empty. Might that be the problem?

    Output: [('Japan', None), ('Syria', None), ('Friday', None)]

    import spacy
    from extend import spacy_component
    
    nlp = spacy.load("en_core_web_sm")
    
    extend_config = dict(
        checkpoint_path="../extend-longformer-large/2021-10-22/09-11-39/checkpoints/best.ckpt",
        mentions_inventory_path="../le-and-titov-2018-inventory.min-count-2.sqlite3",
        device=0,
        tokens_per_batch=4000,
    )
    
    nlp.add_pipe("extend", after="ner", config=extend_config)
    
    input_sentence = "Japan began the defence of their title " \
                     "with a lucky 2-1 win against Syria " \
                     "in a championship match on Friday."
    
    doc = nlp(input_sentence)
    
    # [(Japan, Japan National Footbal Team), (Syria, Syria National Footbal Team)]
    disambiguated_entities = [(ent.text, ent._.disambiguated_entity) for ent in doc.ents]
    
    
    
    
    
    
    2022-05-05 13:39:07.458 WARNING classy.data.dataset.base: Token batch size 4000 < max length 4096. This might result in batches with only 1 sample that contain more token than the specified token batch size
    2022-05-05 13:39:07.459 WARNING classy.data.dataset.base: Dataset empty
    
    
    opened by Valdegg 1
  • REST service not working

    REST service not working

    Hi,

    I tried to run a REST service with the docker image that you make available. However, I do not receive any result when I try to disambiguate even simple sentences.

    Example:

    resp = requests.post(url="http://127.0.0.1:22002/", data='[{"text":"Bob Dylan is a famous singer."}]')

    The json result is

    [{'text': 'Bob Dylan is a singer.', 'disambiguated_entities': []}]

    opened by sntcristian 1
Owner
Sapienza NLP group
The NLP group at the Sapienza University of Rome
Sapienza NLP group
Negative sampling for solving the unlabeled entity problem in NER. ICLR-2021 paper: Empirical Analysis of Unlabeled Entity Problem in Named Entity Recognition.

Negative Sampling for NER Unlabeled entity problem is prevalent in many NER scenarios (e.g., weakly supervised NER). Our paper in ICLR-2021 proposes u

Yangming Li 115 Aug 25, 2022
Code for "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022.

README Code for Two-stage Identifier: "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022. For details of the model a

Yongliang Shen 39 Sep 8, 2022
Python Implementation of ``Modeling the Influence of Verb Aspect on the Activation of Typical Event Locations with BERT'' (Findings of ACL: ACL 2021)

BERT-for-Surprisal Python Implementation of ``Modeling the Influence of Verb Aspect on the Activation of Typical Event Locations with BERT'' (Findings

null 6 Nov 5, 2021
Hostapd-mac-tod-acl - Setup a hostapd AP with MAC ToD ACL

A brief explanation This script provides a quick way to setup a Time-of-day (Tod

null 2 Feb 3, 2022
🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.

pySBD: Python Sentence Boundary Disambiguation (SBD) pySBD - python Sentence Boundary Disambiguation (SBD) - is a rule-based sentence boundary detecti

Nipun Sadvilkar 487 Sep 28, 2022
🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.

pySBD: Python Sentence Boundary Disambiguation (SBD) pySBD - python Sentence Boundary Disambiguation (SBD) - is a rule-based sentence boundary detecti

Nipun Sadvilkar 277 Feb 18, 2021
Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

Bunkai Bunkai is a sentence boundary (SB) disambiguation tool for Japanese texts. Quick Start $ pip install bunkai $ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎ

Megagon Labs 155 Aug 24, 2022
Learn meanings behind words is a key element in NLP. This project concentrates on the disambiguation of preposition senses. Therefore, we train a bert-transformer model and surpass the state-of-the-art.

New State-of-the-Art in Preposition Sense Disambiguation Supervisor: Prof. Dr. Alexander Mehler Alexander Henlein Institutions: Goethe University TTLa

Dirk Neuhäuser 4 Apr 6, 2022
jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese.

jel: Japanese Entity Linker jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese. Usage Currently, link and question methods

izuna385 8 Jul 8, 2022
Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

null 166 Sep 23, 2022