Cross-Document Coreference Resolution

Arie Cattan

Last update: Nov 28, 2022

Related tags

Deep Learning coref

Overview

Cross-Document Coreference Resolution

This repository contains code and models for end-to-end cross-document coreference resolution, as decribed in our papers:

Cross-document Coreference Resolution over Predicted Mentions (Findings of ACL 2021)
Realistic Evaluation Principles for Cross-document Coreference Resolution (*SEM 2021)

The models are trained on ECB+, but they can be used for any setting of multiple documents.

Getting started

Install python3 requirements pip install -r requirements.txt

Extract mentions and raw text from ECB+

Run the following script in order to extract the data from ECB+ dataset and build the gold conll files. The ECB+ corpus can be downloaded here.

python get_ecb_data.py --data_path path_to_data

Training Instructions

The core of our model is the pairwise scorer between two spans, which indicates how likely two spans belong to the same cluster.

Training method

We present 3 ways to train this pairwise scorer:

Pipeline: first train a span scorer, then train the pairwise scorer using the same spans at each epoch.
Continue: pre-train the span scorer, then train the pairwise scorer while keep training the span scorer.
End-to-end: train together both models from scratch.

In order to choose the training method, you need to set the value of the training_method in the config_pairwise.json to pipeline, continue or e2e. In our paper, we found the continue method to perform the best for event coreference and we apply it for entity and ALL as well.

What are the labels ?

In ECB+, the entity and event coreference clusters are annotated separately, making it possible to train a model only on event or entity coreference. Therefore, our model also allows to be trained on events, entity, or both. You need to set the value of the mention_type in the config_pairwise.json (and config_span_scorer.json) to events, entities or mixed (corresponding to ALL in the paper).

Running the model

In both pipeline and continue methods, you need to first run the span scorer model

python train_span_scorer --config configs/config_span_scorer.json

For the pairwise scorer, run the following script

python train_pairwise_scorer --config configs/config_pairwise.json

Some important parameters in config_pairwise.json:

max_mention_span
top_k: pruning coefficient
training_method: (pipeline, continue, e2e)
subtopic: (true, false) whether to train at the topic or subtopic level (ECB+ notions).

Tuning threshold for agglomerative clustering

The training above will save 10 models (one for each epoch) in the specified directory, while each model is composed of a span_repr, a span scorer and a pairwise scorer. In order to find the best model and the best threshold for the agglomerative clustering, you need to do an hyperparameter search on the 10 models + several values for threshold, evaluated on the dev set. To do that, please set the config_clustering.json (split: dev) and run the two following scripts:

python tuned_threshold.py --config configs/config_clustering.json

python run_scorer.py [path_of_directory_of_conll_files] [mention_type]

Prediction

Given the trained pairwise scorer, the best model_num and the threshold from the above training and tuning, set the config_clustering.json (split: test) and run the following script.

python predict.py --config configs/config_clustering

(model_path corresponds to the directory in which you've stored the trained models)

An important configuration in the config_clustering is the topic_level. If you set false , you need to provide the path to the predicted topics in predicted_topics_path to produce conll files at the corpus level.

Evaluation

The output of the predict.py script is a file in the standard conll format. Then, it's straightforward to evaluate it with its corresponding gold conll file (created in the first step), using the official conll coreference scorer that you can find here or the coval system (python implementation).

Make sure to use the gold files of the same evaluation level (topic or corpus) as the predictions.

Notes

If you chose to train the pairwise with the end-to-end method, you don't need to provide a span_repr_path or a span_scorer_path in the config_pairwise.json.
If you use this model with gold mentions, the span scorer is not relevant, you should ignore the training method.
If you're interested in a newer but heavier model, check out our cross-encoder model

Team

Comments

the bad result

hi,I run this code fluently，But the result is bad，I don't know what’s wrong with this. could you please give me some suggestions？thanks。

python -u run_scorer_test.py ./models/pairwise_scorers/test_events_average_0.8_model_3_corpus_level.conll events Processing file: ./models/pairwise_scorers/test_events_average_0.8_model_3_corpus_level.conll 0.65 1.0 0.787878787878788 0.8205128205128205 0.8205128205128205 0.8205128205128205 0.5474218379209004 0.6790832621385336 0.6061858323184838 0.15918234423790356 0.7040757533599581 0.2596591430831051 0.4649510356293238 0.6254223508155559 0.5333783332066294 {'mentions_recall': 0.65, 'mentions_precision': 1.0, 'mentions_f1': 0.787878787878788, 'muc_recall': 0.8205128205128205, 'muc_precision': 0.8205128205128205, 'muc_f1': 0.8205128205128205, 'bcub_recall': 0.5474218379209004, 'bcub_precision': 0.6790832621385336, 'bcub_f1': 0.6061858323184838, 'ceafe_recall': 0.15918234423790356, 'ceafe_precision': 0.7040757533599581, 'ceafe_f1': 0.2596591430831051, 'lea_recall': 0.4649510356293238, 'lea_precision': 0.6254223508155559, 'lea_f1': 0.5333783332066294, 'conll': 56.211926530480305}

opened by cccccs 4

predicted_topics_path yields either FileNotFoundError or IsADirectoryError

What should I do about the predicted_topics_path entry in config_clustering.json? I have topic_level set to false, but predict.py still yields the following error.

python predict.py --config configs/config_clustering.json

gpu_num = [
  1
]
bert_model = "roberta-large"
hidden_layer = 1024
dropout = 0.3
with_mention_width = true
with_head_attention = true
embedding_dimension = 20
max_mention_span = 10
use_gold_mentions = false
mention_type = "mixed"
top_k = 0.25
split = "test"
training_method = "continue"
subtopic = false
use_predicted_topics = true
segment_window = 512
exact = false
topic_level = true
predicted_topics_path = "/home/mhillebrand/coref/data/predicted_topics"
data_folder = "data/ecb/mentions"
save_path = "models/pairwise_scorers"
model_path = "models/pairwise_scorers"
model_num = 9
keep_singletons = false
threshold = 44.27621788652271
linkage_type = "average"
Traceback (most recent call last):
  File "predict.py", line 94, in <module>
    data = create_corpus(config, bert_tokenizer, config.split, is_training=False)
  File "/home/mhillebrand/coref/utils.py", line 32, in create_corpus
    with open(config.predicted_topics_path, 'rb') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/home/mhillebrand/coref/data/predicted_topics'

Then if I create the data/predicted_topics directory, I get this error:

IsADirectoryError: [Errno 21] Is a directory: '/home/mhillebrand/coref/data/predicted_topics'

opened by mhillebrand 2

ModuleNotFoundError: No module named 'coval.coval.eval'

$ python run_scorer.py models/pairwise_scorers mixed

Traceback (most recent call last):
  File "run_scorer.py", line 4, in <module>
    from coval.coval.eval import evaluator
ModuleNotFoundError: No module named 'coval.coval.eval'

opened by mhillebrand 2

```
usage: get_ecb_data.py [-h] [--data_path DATA_PATH] [--output_dir OUTPUT_DIR] get_ecb_data.py: error: unrecognized arguments: 
	                                    
	                                 
```
python get_ecb_data.py --/home/dr/Desktop/corefecb/data /home/dr/Desktop/corefecb 2022-01-03 16:54:15.600300: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1 usage: get_ecb_data.py [-h] [--data_path DATA_PATH] [--output_dir OUTPUT_DIR] get_ecb_data.py: error: unrecognized arguments: --/home/dr/Desktop/corefecb/data /home/dr/Desktop/corefecb why am i getting this error

opened by kusumlata123 1
Flipped key_file, sys_file in run_scorer.py

In run_scorer.py, key_file and sys_file are flipped, i.e. the scorer receives the system predictions instead of the gold annotations (and vice-versa). All recall and precision scores produced by this code section are backwards. :grimacing:

opened by mbugert 0
Reproducing code

Hi! I'm reproducing this code for a university module regarding reproducibility in NLP. It's going great so far (great job on the reproducible side!) but I'm confused as to how I can obtain the predicted topics (in order to run predict.py with the configuration as "topic:false", it is said I need a path to the predicted topics). Am I meant to get this from another code or are the predicted topics the output of another script on this code? Thank you! :)

opened by Andrea4-sr 0

Owner

Arie Cattan

PhD candidate, Computer Science, Bar-Ilan University

GitHub

Cross Quality LFW: A database for Analyzing Cross-Resolution Image Face Recognition in Unconstrained Environments

Cross-Quality Labeled Faces in the Wild (XQLFW) Here, we release the database, evaluation protocol and code for the following paper: Cross Quality LFW

10 Dec 12, 2022

A embed able annotation tool for end to end cross document co-reference

CoRefi CoRefi is an emebedable web component and stand alone suite for exaughstive Within Document and Cross Document Coreference Anntoation. For a de

39 Dec 12, 2022

Boosting Monocular Depth Estimation Models to High-Resolution via Content-Adaptive Multi-Resolution Merging

Boosting Monocular Depth Estimation Models to High-Resolution via Content-Adaptive Multi-Resolution Merging This repository contains an implementation

1.1k Jan 2, 2023

CVPR 2021 Official Pytorch Code for UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

UC2 UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training Mingyang Zhou, Luowei Zhou, Shuohang Wang, Yu Cheng, Linjie Li, Zhou Yu,

28 Dec 30, 2022

Code for paper "Document-Level Argument Extraction by Conditional Generation". NAACL 21'

Argument Extraction by Generation Code for paper "Document-Level Argument Extraction by Conditional Generation". NAACL 21' Dependencies pytorch=1.6 tr

87 Dec 26, 2022

Code for NAACL 2021 full paper "Efficient Attentions for Long Document Summarization"

LongDocSum Code for NAACL 2021 paper "Efficient Attentions for Long Document Summarization" This repository contains data and models needed to reprodu

56 Jan 2, 2023

Source code for paper "Document-Level Relation Extraction with Adaptive Thresholding and Localized Context Pooling", AAAI 2021

ATLOP Code for AAAI 2021 paper Document-Level Relation Extraction with Adaptive Thresholding and Localized Context Pooling. If you make use of this co

146 Nov 29, 2022

Implementation for our AAAI2021 paper (Entity Structure Within and Throughout: Modeling Mention Dependencies for Document-Level Relation Extraction).

SSAN Introduction This is the pytorch implementation of the SSAN model (see our AAAI2021 paper: Entity Structure Within and Throughout: Modeling Menti

69 Nov 15, 2022

Code and dataset for ACL2018 paper "Exploiting Document Knowledge for Aspect-level Sentiment Classification"

Aspect-level Sentiment Classification Code and dataset for ACL2018 [paper] ‘‘Exploiting Document Knowledge for Aspect-level Sentiment Classification’’

146 Nov 29, 2022

Dewarping Document Image By Displacement Flow Estimation with Fully Convolutional Network.

Dewarping Document Image By Displacement Flow Estimation with Fully Convolutional Network

111 Dec 27, 2022

Hierarchical Metadata-Aware Document Categorization under Weak Supervision (WSDM'21)

Hierarchical Metadata-Aware Document Categorization under Weak Supervision This project provides a weakly supervised framework for hierarchical metada

53 Sep 17, 2022

Dewarping Document Image By Displacement Flow Estimation with Fully Convolutional Network

39 Aug 2, 2021

docTR by Mindee (Document Text Recognition) - a seamless, high-performing & accessible library for OCR-related tasks powered by Deep Learning.

1.5k Jan 1, 2023

One implementation of the paper "DMRST: A Joint Framework for Document-Level Multilingual RST Discourse Segmentation and Parsing".

Introduction One implementation of the paper "DMRST: A Joint Framework for Document-Level Multilingual RST Discourse Segmentation and Parsing". Users

18 Dec 11, 2022

Cross-Document Coreference Resolution

Related tags

Overview

Cross-Document Coreference Resolution

Getting started

Extract mentions and raw text from ECB+

Training Instructions

Training method

What are the labels ?

Running the model

Tuning threshold for agglomerative clustering

Prediction

Evaluation

Notes

Team

Comments

the bad result

predicted_topics_path yields either FileNotFoundError or IsADirectoryError

ModuleNotFoundError: No module named 'coval.coval.eval'

usage: get_ecb_data.py [-h] [--data_path DATA_PATH] [--output_dir OUTPUT_DIR] get_ecb_data.py: error: unrecognized arguments:

Flipped key_file, sys_file in run_scorer.py

Reproducing code

Owner

Arie Cattan

Cross Quality LFW: A database for Analyzing Cross-Resolution Image Face Recognition in Unconstrained Environments

A embed able annotation tool for end to end cross document co-reference

Boosting Monocular Depth Estimation Models to High-Resolution via Content-Adaptive Multi-Resolution Merging

CVPR 2021 Official Pytorch Code for UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

Code for paper "Document-Level Argument Extraction by Conditional Generation". NAACL 21'

Code for NAACL 2021 full paper "Efficient Attentions for Long Document Summarization"

Source code for paper "Document-Level Relation Extraction with Adaptive Thresholding and Localized Context Pooling", AAAI 2021

Implementation for our AAAI2021 paper (Entity Structure Within and Throughout: Modeling Mention Dependencies for Document-Level Relation Extraction).

Code and dataset for ACL2018 paper "Exploiting Document Knowledge for Aspect-level Sentiment Classification"

Dewarping Document Image By Displacement Flow Estimation with Fully Convolutional Network.

Hierarchical Metadata-Aware Document Categorization under Weak Supervision (WSDM'21)

Dewarping Document Image By Displacement Flow Estimation with Fully Convolutional Network

docTR by Mindee (Document Text Recognition) - a seamless, high-performing & accessible library for OCR-related tasks powered by Deep Learning.

One implementation of the paper "DMRST: A Joint Framework for Document-Level Multilingual RST Discourse Segmentation and Parsing".

The official code for PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization

Key information extraction from invoice document with Graph Convolution Network

Detectron2 for Document Layout Analysis

DUE: End-to-End Document Understanding Benchmark

A toolkit for document-level event extraction, containing some SOTA model implementations