Text Extraction Formulation + Feedback Loop for state-of-the-art WSD (EMNLP 2021)

Sapienza NLP group

Last update: Dec 13, 2022

Related tags

Deep Learning consec

Overview

ConSeC

ConSeC is a novel approach to Word Sense Disambiguation (WSD), accepted at EMNLP 2021. It frames WSD as a text extraction task and features a feedback loop strategy that allows the disambiguation of a target word to be conditioned not only on its context but also on the explicit senses assigned to nearby words.

If you find our paper, code or framework useful, please reference this work in your paper:

@inproceedings{barba-etal-2021-consec,
    title = "{C}on{S}e{C}: Word Sense Disambiguation as Continuous Sense Comprehension",
    author = "Barba, Edoardo  and
      Procopio, Luigi  and
      Navigli, Roberto",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-main.112",
    pages = "1492--1503",
}

Setup Env

Requirements:

Debian-based (e.g. Debian, Ubuntu, ...) system
conda installed

Run the following command to quickly setup the env needed to run our code:

bash setup.sh

It's a bash command that will setup a conda environment with everything you need. Just answer the prompts as you proceed.

Finally, download the following resources:

Wikipedia Freqs. This is a compressed folder containing the files needed to compute the PMI score. Once downloaded, place the file inside data/ and run:
```
cd data/
tar -xvf pmi.tar.gz
rm pmi.tar.gz
cd ..
```
optionally, you can download the checkpoint trained on Semcor only that achieves 82.0 on ALL; place it inside the experiments/ folder (we recommend experiments/released-ckpts/)

Train

This is a PyTorch Lightning project with hydra configurations files, so most of the training parameters (e.g. datasets, optimizer, model, ...) are specified in yaml files. If you are not familiar with hydra and want to play a bit with training new models, we recommend going first through hydra tutorials; otherwise, you can skip this section (but you should still checkout hydra as it's an amazing piece of software!).

Anyway, training is done via the training script, src/scripts/model/train.py, and its parameters are read from the .yaml files in the conf/ folders (but for the conf/test/ folder which is used for evaluation). Once you applied all your desired changes, you can run the new training with:

(consec) user@user-pc:~/consec$ PYTHONPATH=$(pwd) python src/scripts/model/train.py

Evaluate

Evaluation is similarly handled via hydra configuration files, located in the conf/test/ folder. There's a single file there, which specifies how to evaluate (e.g. model checkpoint and test to use) against the framework of Raganato et al. (2017) (we will include XL-WSD, along with its checkpoints, later on). Parameters are quite self-explanatory and you might be most interested in the following ones:

model.model_checkpoint: path to the target checkpoint to use
test_raganato_path: path to the test file to evaluate against

To make a practical example, to evaluate the checkpoint we released against SemEval-2007, run the following command:

(consec) user@user-pc:~/consec$ PYTHONPATH=$(pwd) python src/scripts/model/raganato_evaluate.py model.model_checkpoint=experiments/released-ckpts/consec_semcor_normal_best.ckpt test_raganato_path=data/WSD_Evaluation_Framework/Evaluation_Datasets/semeval2007/semeval2007

NOTE: test_raganato_path expects what we refer to as a raganato path, that is, a prefix path such that both {test_raganato_path}.data.xml and {test_raganato_path}.gold.key.txt exist (and have the same role as in the standard evaluation framework).

Interactive Predict

We also implemented an interactive predict that allows you to query the model interactively; given as input:

a word in a context
its candidate definitions
its context definitions the model will disambiguate the target word. Check it out with:

(consec) user@user-pc:~/consec$ PYTHONPATH=$(pwd) python src/scripts/model/predict.py experiments/released-ckpts/consec_semcor_normal_best.ckpt -t
Enter space-separated text: I have a beautiful dog
Target position: 4
Enter candidate lemma-def pairs. " --- " separated. Enter to stop
 * dog --- a member of the genus Canis
 * dog --- someone who is morally reprehensible
 * 
Enter context lemma-def-position tuples. " --- " separated. Position should be token position in space-separated input. Enter to stop
 * beautiful --- delighting the senses or exciting intellectual or emotional admiration --- 3
 * 
        # predictions
                 * 0.9939        dog     a member of the genus Canis 
                 * 0.0061        dog     someone who is morally reprehensible

The scores assigned to each prediction are their probabilities.

Acknowledgments

The authors gratefully acknowledge the support of the ERC Consolidator Grant MOUSSE No. 726487 under the European Union’s Horizon 2020 research and innovation programme.

This work was supported in part by the MIUR under grant “Dipartimenti di eccellenza 2018-2022” of the Department of Computer Science of the Sapienza University of Rome.

License

This work is under the Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license

State of the Art Neural Networks for Deep Learning

pyradox This python library helps you with implementing various state of the art neural networks in a totally customizable fashion using Tensorflow 2

60 May 29, 2022

Code for paper "A Critical Assessment of State-of-the-Art in Entity Alignment" (https://arxiv.org/abs/2010.16314)

A Critical Assessment of State-of-the-Art in Entity Alignment This repository contains the source code for the paper A Critical Assessment of State-of

16 Oct 14, 2022

Quickly comparing your image classification models with the state-of-the-art models (such as DenseNet, ResNet, ...)

Image Classification Project Killer in PyTorch This repo is designed for those who want to start their experiments two days before the deadline and ki

349 Dec 8, 2022

State of the art Semantic Sentence Embeddings

Contrastive Tension State of the art Semantic Sentence Embeddings Published Paper · Huggingface Models · Report Bug Overview This is the official code

88 Dec 30, 2022

QuickAI is a Python library that makes it extremely easy to experiment with state-of-the-art Machine Learning models.

152 Jan 2, 2023

LaneDet is an open source lane detection toolbox based on PyTorch that aims to pull together a wide variety of state-of-the-art lane detection models

LaneDet is an open source lane detection toolbox based on PyTorch that aims to pull together a wide variety of state-of-the-art lane detection models. Developers can reproduce these SOTA methods and build their own methods.

405 Jan 4, 2023

tsai is an open-source deep learning package built on top of Pytorch & fastai focused on state-of-the-art techniques for time series classification, regression and forecasting.

Time series Timeseries Deep Learning Pytorch fastai - State-of-the-art Deep Learning with Time Series and Sequences in Pytorch / fastai

2.8k Jan 8, 2023

State-of-the-art data augmentation search algorithms in PyTorch

MuarAugment Description MuarAugment is a package providing the easiest way to a state-of-the-art data augmentation pipeline. How to use You can instal

43 Dec 12, 2022

This is the unofficial code of Deep Dual-resolution Networks for Real-time and Accurate Semantic Segmentation of Road Scenes. which achieve state-of-the-art trade-off between accuracy and speed on cityscapes and camvid, without using inference acceleration and extra data

Deep Dual-resolution Networks for Real-time and Accurate Semantic Segmentation of Road Scenes Introduction This is the unofficial code of Deep Dual-re

113 Dec 23, 2022

Comments

Unable to find WNGT files.

Hello, I am interested in your project and want to try it out.

However, when I tried the experiment with WNGT, I got the following error: FileNotFoundError: [Errno 2] No such file or directory: 'data/WSD_Evaluation_Framework/Training_Corpora/WNG/glosses_main.gold.key.txt' The reason is that WSD_Evaluation_Framework folder does not contain WNG and WNE folders and their files.

Please advise where to find these files. Thank you.

opened by sonodasong 4
unclear source of the data/complemented-datasets folder

When I try to train the model I get the following error. It is unclear to me where the complemented datasets are coming from. They are not contained in the Raganto files. Can you please specify the origin of those?

FileNotFoundError: Error instantiating 'src.disambiguation_corpora.WordNetCorpus' : [Errno 2] No such file or directory: 'data/complemented-datasets/semeval2007/semeval2007.gold.key.txt'

opened by vlosing 1
Suggest to loosen the dependency on networkx
Hi, your project consec requires "networkx==2.5" in its dependency. After analyzing the source code, we found that the following versions of networkx can also be suitable without affecting your project, i.e., networkx 2.5.1. Therefore, we suggest to loosen the dependency on networkx from "networkx==2.5" to "networkx>=2.5,<=2.5.1" to avoid any possible conflict for importing more packages or for downstream projects that may use consec.

May I pull a request to further loosen the dependency on networkx?

By the way, could you please tell us whether such dependency analysis may be potentially helpful for maintaining dependencies easier during your development?

We also give our detailed analysis as follows for your reference:

Your project consec directly uses 4 APIs from package networkx.

networkx.algorithms.components.weakly_connected.weakly_connected_components, networkx.algorithms.cycles.find_cycle, networkx.exception.NetworkXNoCycle.__init__, networkx.classes.digraph.DiGraph.__init__

Beginning from the 4 APIs above, 5 functions are then indirectly called, including 4 networkx's internal APIs and 1 outsider APIs. The specific call graph is listed as follows (neglecting some repeated function occurrences).

[/SapienzaNLP/consec] +--networkx.algorithms.components.weakly_connected.weakly_connected_components | +--networkx.algorithms.components.weakly_connected._plain_bfs +--networkx.algorithms.cycles.find_cycle +--networkx.exception.NetworkXNoCycle.__init__ +--networkx.classes.digraph.DiGraph.__init__ | +--networkx.convert.to_networkx_graph | | +--networkx.convert.from_dict_of_dicts | | +--networkx.convert.from_dict_of_lists | | +--warnings.warn | | +--networkx.convert.from_edgelist

We scan networkx's versions and observe that during its evolution between any version from [2.5.1] and 2.5, the changing functions (diffs being listed below) have none intersection with any function or API we mentioned above (either directly or indirectly called by this project).

diff: 2.5(original) 2.5.1 [](no clear difference between the source codes of two versions)

As for other packages, the APIs of warnings are called by networkx in the call graph and the dependencies on these packages also stay the same in our suggested versions, thus avoiding any outside conflict.

Therefore, we believe that it is quite safe to loose your dependency on networkx from "networkx==2.5" to "networkx>=2.5,<=2.5.1". This will improve the applicability of consec and reduce the possibility of any further dependency conflict with other projects.
opened by Agnes-U 0

Owner

Sapienza NLP group

The NLP group at the Sapienza University of Rome

GitHub

Deep Text Search is an AI-powered multilingual text search and recommendation engine with state-of-the-art transformer-based multilingual text embedding (50+ languages).

Deep Text Search - AI Based Text Search & Recommendation System Deep Text Search is an AI-powered multilingual text search and recommendation engine w

19 Sep 29, 2022

Feedback is important: response-aware feedback mechanism for background based conversation

RFM The code for the paper: "Feedback is important: response-aware feedback mechanism for background based conversation." Requirements python 3.7 pyto

2 Sep 29, 2022

Source code for the GPT-2 story generation models in the EMNLP 2020 paper "STORIUM: A Dataset and Evaluation Platform for Human-in-the-Loop Story Generation"

Storium GPT-2 Models This is the official repository for the GPT-2 models described in the EMNLP 2020 paper [STORIUM: A Dataset and Evaluation Platfor

27 Dec 20, 2022

TSDF++: A Multi-Object Formulation for Dynamic Object Tracking and Reconstruction

TSDF++: A Multi-Object Formulation for Dynamic Object Tracking and Reconstruction TSDF++ is a novel multi-object TSDF formulation that can encode mult

130 Dec 29, 2022

Summary Explorer is a tool to visually explore the state-of-the-art in text summarization.

Summary Explorer Summary Explorer is a tool to visually inspect the summaries from several state-of-the-art neural summarization models across multipl

42 Aug 14, 2022

Implementation of NÜWA, state of the art attention network for text to video synthesis, in Pytorch

NÜWA - Pytorch (wip) Implementation of NÜWA, state of the art attention network for text to video synthesis, in Pytorch. This repository will be popul

463 Dec 28, 2022

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

Text-AutoAugment (TAA) This repository contains the code for our paper Text AutoAugment: Learning Compositional Augmentation Policy for Text Classific

105 Jan 3, 2023

Release of SPLASH: Dataset for semantic parse correction with natural language feedback in the context of text-to-SQL parsing

SPLASH: Semantic Parsing with Language Assistance from Humans SPLASH is dataset for the task of semantic parse correction with natural language feedba

Microsoft Research - Language and Information Technologies (MSR LIT)

35 Oct 31, 2022

Official Implementation of CoSMo: Content-Style Modulation for Image Retrieval with Text Feedback

CoSMo.pytorch Official Implementation of CoSMo: Content-Style Modulation for Image Retrieval with Text Feedback, Seungmin Lee*, Dongwan Kim*, Bohyung

54 Dec 8, 2022

Adversarial Robustness Toolbox (ART) - Python Library for Machine Learning Security - Evasion, Poisoning, Extraction, Inference - Red and Blue Teams

Adversarial Robustness Toolbox (ART) is a Python library for Machine Learning Security. ART provides tools that enable developers and researchers to defend and evaluate Machine Learning models and applications against the adversarial threats of Evasion, Poisoning, Extraction, and Inference. ART supports all popular machine learning frameworks (TensorFlow, Keras, PyTorch, MXNet, scikit-learn, XGBoost, LightGBM, CatBoost, GPy, etc.), all data types (images, tables, audio, video, etc.) and machine learning tasks (classification, object detection, speech recognition, generation, certification, etc.).

3.4k Jan 4, 2023

Text Extraction Formulation + Feedback Loop for state-of-the-art WSD (EMNLP 2021)

Related tags

Overview

ConSeC

Setup Env

Train

Evaluate

Interactive Predict

Acknowledgments

License

You might also like...

State of the Art Neural Networks for Deep Learning

Code for paper "A Critical Assessment of State-of-the-Art in Entity Alignment" (https://arxiv.org/abs/2010.16314)

Quickly comparing your image classification models with the state-of-the-art models (such as DenseNet, ResNet, ...)

State of the art Semantic Sentence Embeddings

QuickAI is a Python library that makes it extremely easy to experiment with state-of-the-art Machine Learning models.

LaneDet is an open source lane detection toolbox based on PyTorch that aims to pull together a wide variety of state-of-the-art lane detection models

tsai is an open-source deep learning package built on top of Pytorch & fastai focused on state-of-the-art techniques for time series classification, regression and forecasting.

State-of-the-art data augmentation search algorithms in PyTorch

This is the unofficial code of Deep Dual-resolution Networks for Real-time and Accurate Semantic Segmentation of Road Scenes. which achieve state-of-the-art trade-off between accuracy and speed on cityscapes and camvid, without using inference acceleration and extra data

Comments

Unable to find WNGT files.

unclear source of the data/complemented-datasets folder

Suggest to loosen the dependency on networkx

Owner

Sapienza NLP group

Deep Text Search is an AI-powered multilingual text search and recommendation engine with state-of-the-art transformer-based multilingual text embedding (50+ languages).

Feedback is important: response-aware feedback mechanism for background based conversation

Source code for the GPT-2 story generation models in the EMNLP 2020 paper "STORIUM: A Dataset and Evaluation Platform for Human-in-the-Loop Story Generation"

TSDF++: A Multi-Object Formulation for Dynamic Object Tracking and Reconstruction

Summary Explorer is a tool to visually explore the state-of-the-art in text summarization.

Implementation of NÜWA, state of the art attention network for text to video synthesis, in Pytorch

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

Release of SPLASH: Dataset for semantic parse correction with natural language feedback in the context of text-to-SQL parsing

Official Implementation of CoSMo: Content-Style Modulation for Image Retrieval with Text Feedback

Adversarial Robustness Toolbox (ART) - Python Library for Machine Learning Security - Evasion, Poisoning, Extraction, Inference - Red and Blue Teams