Cross-Document Coreference Resolution

Related tags

Deep Learning coref
Overview

Cross-Document Coreference Resolution

This repository contains code and models for end-to-end cross-document coreference resolution, as decribed in our papers:

The models are trained on ECB+, but they can be used for any setting of multiple documents.

Getting started

  • Install python3 requirements pip install -r requirements.txt

Extract mentions and raw text from ECB+

Run the following script in order to extract the data from ECB+ dataset and build the gold conll files. The ECB+ corpus can be downloaded here.

python get_ecb_data.py --data_path path_to_data

Training Instructions

The core of our model is the pairwise scorer between two spans, which indicates how likely two spans belong to the same cluster.

Training method

We present 3 ways to train this pairwise scorer:

  1. Pipeline: first train a span scorer, then train the pairwise scorer using the same spans at each epoch.
  2. Continue: pre-train the span scorer, then train the pairwise scorer while keep training the span scorer.
  3. End-to-end: train together both models from scratch.

In order to choose the training method, you need to set the value of the training_method in the config_pairwise.json to pipeline, continue or e2e. In our paper, we found the continue method to perform the best for event coreference and we apply it for entity and ALL as well.

What are the labels ?

In ECB+, the entity and event coreference clusters are annotated separately, making it possible to train a model only on event or entity coreference. Therefore, our model also allows to be trained on events, entity, or both. You need to set the value of the mention_type in the config_pairwise.json (and config_span_scorer.json) to events, entities or mixed (corresponding to ALL in the paper).

Running the model

In both pipeline and continue methods, you need to first run the span scorer model

python train_span_scorer --config configs/config_span_scorer.json

For the pairwise scorer, run the following script

python train_pairwise_scorer --config configs/config_pairwise.json

Some important parameters in config_pairwise.json:

  • max_mention_span
  • top_k: pruning coefficient
  • training_method: (pipeline, continue, e2e)
  • subtopic: (true, false) whether to train at the topic or subtopic level (ECB+ notions).

Tuning threshold for agglomerative clustering

The training above will save 10 models (one for each epoch) in the specified directory, while each model is composed of a span_repr, a span scorer and a pairwise scorer. In order to find the best model and the best threshold for the agglomerative clustering, you need to do an hyperparameter search on the 10 models + several values for threshold, evaluated on the dev set. To do that, please set the config_clustering.json (split: dev) and run the two following scripts:

python tuned_threshold.py --config configs/config_clustering.json

python run_scorer.py [path_of_directory_of_conll_files] [mention_type]

Prediction

Given the trained pairwise scorer, the best model_num and the threshold from the above training and tuning, set the config_clustering.json (split: test) and run the following script.

python predict.py --config configs/config_clustering

(model_path corresponds to the directory in which you've stored the trained models)

An important configuration in the config_clustering is the topic_level. If you set false , you need to provide the path to the predicted topics in predicted_topics_path to produce conll files at the corpus level.

Evaluation

The output of the predict.py script is a file in the standard conll format. Then, it's straightforward to evaluate it with its corresponding gold conll file (created in the first step), using the official conll coreference scorer that you can find here or the coval system (python implementation).

Make sure to use the gold files of the same evaluation level (topic or corpus) as the predictions.

Notes

  • If you chose to train the pairwise with the end-to-end method, you don't need to provide a span_repr_path or a span_scorer_path in the config_pairwise.json.

  • If you use this model with gold mentions, the span scorer is not relevant, you should ignore the training method.

  • If you're interested in a newer but heavier model, check out our cross-encoder model

Team

Comments
  • the bad result

    the bad result

    hi,I run this code fluently,But the result is bad,I don't know what’s wrong with this. could you please give me some suggestions?thanks。

    python -u run_scorer_test.py ./models/pairwise_scorers/test_events_average_0.8_model_3_corpus_level.conll events Processing file: ./models/pairwise_scorers/test_events_average_0.8_model_3_corpus_level.conll 0.65 1.0 0.787878787878788 0.8205128205128205 0.8205128205128205 0.8205128205128205 0.5474218379209004 0.6790832621385336 0.6061858323184838 0.15918234423790356 0.7040757533599581 0.2596591430831051 0.4649510356293238 0.6254223508155559 0.5333783332066294 {'mentions_recall': 0.65, 'mentions_precision': 1.0, 'mentions_f1': 0.787878787878788, 'muc_recall': 0.8205128205128205, 'muc_precision': 0.8205128205128205, 'muc_f1': 0.8205128205128205, 'bcub_recall': 0.5474218379209004, 'bcub_precision': 0.6790832621385336, 'bcub_f1': 0.6061858323184838, 'ceafe_recall': 0.15918234423790356, 'ceafe_precision': 0.7040757533599581, 'ceafe_f1': 0.2596591430831051, 'lea_recall': 0.4649510356293238, 'lea_precision': 0.6254223508155559, 'lea_f1': 0.5333783332066294, 'conll': 56.211926530480305}

    opened by cccccs 4
  • predicted_topics_path yields either FileNotFoundError or IsADirectoryError

    predicted_topics_path yields either FileNotFoundError or IsADirectoryError

    What should I do about the predicted_topics_path entry in config_clustering.json? I have topic_level set to false, but predict.py still yields the following error.

    python predict.py --config configs/config_clustering.json

    gpu_num = [
      1
    ]
    bert_model = "roberta-large"
    hidden_layer = 1024
    dropout = 0.3
    with_mention_width = true
    with_head_attention = true
    embedding_dimension = 20
    max_mention_span = 10
    use_gold_mentions = false
    mention_type = "mixed"
    top_k = 0.25
    split = "test"
    training_method = "continue"
    subtopic = false
    use_predicted_topics = true
    segment_window = 512
    exact = false
    topic_level = true
    predicted_topics_path = "/home/mhillebrand/coref/data/predicted_topics"
    data_folder = "data/ecb/mentions"
    save_path = "models/pairwise_scorers"
    model_path = "models/pairwise_scorers"
    model_num = 9
    keep_singletons = false
    threshold = 44.27621788652271
    linkage_type = "average"
    Traceback (most recent call last):
      File "predict.py", line 94, in <module>
        data = create_corpus(config, bert_tokenizer, config.split, is_training=False)
      File "/home/mhillebrand/coref/utils.py", line 32, in create_corpus
        with open(config.predicted_topics_path, 'rb') as f:
    FileNotFoundError: [Errno 2] No such file or directory: '/home/mhillebrand/coref/data/predicted_topics'
    

    Then if I create the data/predicted_topics directory, I get this error:

    IsADirectoryError: [Errno 21] Is a directory: '/home/mhillebrand/coref/data/predicted_topics'
    
    opened by mhillebrand 2
  • ModuleNotFoundError: No module named 'coval.coval.eval'

    ModuleNotFoundError: No module named 'coval.coval.eval'

    $ python run_scorer.py models/pairwise_scorers mixed

    Traceback (most recent call last):
      File "run_scorer.py", line 4, in <module>
        from coval.coval.eval import evaluator
    ModuleNotFoundError: No module named 'coval.coval.eval'
    
    opened by mhillebrand 2
  • <pre>usage: get_ecb_data.py [-h] [--data_path DATA_PATH] [--output_dir OUTPUT_DIR] get_ecb_data.py: error: unrecognized arguments:

    usage: get_ecb_data.py [-h] [--data_path DATA_PATH] [--output_dir OUTPUT_DIR] get_ecb_data.py: error: unrecognized arguments: 
    	                                    
    	                                 

    python get_ecb_data.py --/home/dr/Desktop/corefecb/data /home/dr/Desktop/corefecb 2022-01-03 16:54:15.600300: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1 usage: get_ecb_data.py [-h] [--data_path DATA_PATH] [--output_dir OUTPUT_DIR] get_ecb_data.py: error: unrecognized arguments: --/home/dr/Desktop/corefecb/data /home/dr/Desktop/corefecb why am i getting this error

    opened by kusumlata123 1
  • Flipped key_file, sys_file in run_scorer.py

    Flipped key_file, sys_file in run_scorer.py

    In run_scorer.py, key_file and sys_file are flipped, i.e. the scorer receives the system predictions instead of the gold annotations (and vice-versa). All recall and precision scores produced by this code section are backwards. :grimacing:

    opened by mbugert 0
  • Reproducing code

    Reproducing code

    Hi! I'm reproducing this code for a university module regarding reproducibility in NLP. It's going great so far (great job on the reproducible side!) but I'm confused as to how I can obtain the predicted topics (in order to run predict.py with the configuration as "topic:false", it is said I need a path to the predicted topics). Am I meant to get this from another code or are the predicted topics the output of another script on this code? Thank you! :)

    opened by Andrea4-sr 0
Owner
Arie Cattan
PhD candidate, Computer Science, Bar-Ilan University
Arie Cattan
Cross Quality LFW: A database for Analyzing Cross-Resolution Image Face Recognition in Unconstrained Environments

Cross-Quality Labeled Faces in the Wild (XQLFW) Here, we release the database, evaluation protocol and code for the following paper: Cross Quality LFW

Martin Knoche 10 Dec 12, 2022
A embed able annotation tool for end to end cross document co-reference

CoRefi CoRefi is an emebedable web component and stand alone suite for exaughstive Within Document and Cross Document Coreference Anntoation. For a de

PythicCoder 39 Dec 12, 2022
Boosting Monocular Depth Estimation Models to High-Resolution via Content-Adaptive Multi-Resolution Merging

Boosting Monocular Depth Estimation Models to High-Resolution via Content-Adaptive Multi-Resolution Merging This repository contains an implementation

Computational Photography Lab @ SFU 1.1k Jan 2, 2023
CVPR 2021 Official Pytorch Code for UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

UC2 UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training Mingyang Zhou, Luowei Zhou, Shuohang Wang, Yu Cheng, Linjie Li, Zhou Yu,

Mingyang Zhou 28 Dec 30, 2022
Code for paper "Document-Level Argument Extraction by Conditional Generation". NAACL 21'

Argument Extraction by Generation Code for paper "Document-Level Argument Extraction by Conditional Generation". NAACL 21' Dependencies pytorch=1.6 tr

Zoey Li 87 Dec 26, 2022
Code for NAACL 2021 full paper "Efficient Attentions for Long Document Summarization"

LongDocSum Code for NAACL 2021 paper "Efficient Attentions for Long Document Summarization" This repository contains data and models needed to reprodu

null 56 Jan 2, 2023
Source code for paper "Document-Level Relation Extraction with Adaptive Thresholding and Localized Context Pooling", AAAI 2021

ATLOP Code for AAAI 2021 paper Document-Level Relation Extraction with Adaptive Thresholding and Localized Context Pooling. If you make use of this co

Wenxuan Zhou 146 Nov 29, 2022
Implementation for our AAAI2021 paper (Entity Structure Within and Throughout: Modeling Mention Dependencies for Document-Level Relation Extraction).

SSAN Introduction This is the pytorch implementation of the SSAN model (see our AAAI2021 paper: Entity Structure Within and Throughout: Modeling Menti

benfeng 69 Nov 15, 2022
Code and dataset for ACL2018 paper "Exploiting Document Knowledge for Aspect-level Sentiment Classification"

Aspect-level Sentiment Classification Code and dataset for ACL2018 [paper] ‘‘Exploiting Document Knowledge for Aspect-level Sentiment Classification’’

Ruidan He 146 Nov 29, 2022
Dewarping Document Image By Displacement Flow Estimation with Fully Convolutional Network.

Dewarping Document Image By Displacement Flow Estimation with Fully Convolutional Network

null 111 Dec 27, 2022
Hierarchical Metadata-Aware Document Categorization under Weak Supervision (WSDM'21)

Hierarchical Metadata-Aware Document Categorization under Weak Supervision This project provides a weakly supervised framework for hierarchical metada

Yu Zhang 53 Sep 17, 2022
Dewarping Document Image By Displacement Flow Estimation with Fully Convolutional Network

Dewarping Document Image By Displacement Flow Estimation with Fully Convolutional Network

null 39 Aug 2, 2021
docTR by Mindee (Document Text Recognition) - a seamless, high-performing & accessible library for OCR-related tasks powered by Deep Learning.

docTR by Mindee (Document Text Recognition) - a seamless, high-performing & accessible library for OCR-related tasks powered by Deep Learning.

Mindee 1.5k Jan 1, 2023
One implementation of the paper "DMRST: A Joint Framework for Document-Level Multilingual RST Discourse Segmentation and Parsing".

Introduction One implementation of the paper "DMRST: A Joint Framework for Document-Level Multilingual RST Discourse Segmentation and Parsing". Users

seq-to-mind 18 Dec 11, 2022
The official code for PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization

PRIMER The official code for PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization. PRIMER is a pre-trained model for mu

AI2 114 Jan 6, 2023
Key information extraction from invoice document with Graph Convolution Network

Key Information Extraction from Scanned Invoices Key information extraction from invoice document with Graph Convolution Network Related blog post fro

Phan Hoang 39 Dec 16, 2022
Detectron2 for Document Layout Analysis

Detectron2 trained on PubLayNet dataset This repo contains the training configurations, code and trained models trained on PubLayNet dataset using Det

Himanshu 163 Nov 21, 2022
DUE: End-to-End Document Understanding Benchmark

This is the repository that provide tools to download data, reproduce the baseline results and evaluation. What can you achieve with this guide Based

null 21 Dec 29, 2022
A toolkit for document-level event extraction, containing some SOTA model implementations

❤️ A Toolkit for Document-level Event Extraction with & without Triggers Hi, there ?? . Thanks for your stay in this repo. This project aims at buildi

Tong Zhu(朱桐) 159 Dec 22, 2022