This repository contains the code for EMNLP-2021 paper "Word-Level Coreference Resolution"

Overview

Word-Level Coreference Resolution

This is a repository with the code to reproduce the experiments described in the paper of the same name, which was accepted to EMNLP 2021. The paper is available here.

Table of contents

  1. Preparation
  2. Training
  3. Evaluation

Preparation

The following instruction has been tested with Python 3.7 on an Ubuntu 20.04 machine.

You will need:

  • OntoNotes 5.0 corpus (download here, registration needed)
  • Python 2.7 to run conll-2012 scripts
  • Java runtime to run Stanford Parser
  • Python 3.7+ to run the model
  • Perl to run conll-2012 evaluation scripts
  • CUDA-enabled machine (48 GB to train, 4 GB to evaluate)
  1. Extract OntoNotes 5.0 arhive. In case it's in the repo's root directory:

     tar -xzvf ontonotes-release-5.0_LDC2013T19.tgz
    
  2. Switch to Python 2.7 environment (where python would run 2.7 version). This is necessary for conll scripts to run correctly. To do it with with conda:

     conda create -y --name py27 python=2.7 && conda activate py27
    
  3. Run the conll data preparation scripts (~30min):

     sh get_conll_data.sh ontonotes-release-5.0 data
    
  4. Download conll scorers and Stanford Parser:

     sh get_third_party.sh
    
  5. Prepare your environment. To do it with conda:

     conda create -y --name wl-coref python=3.7 openjdk perl
     conda activate wl-coref
     python -m pip install -r requirements.txt
    
  6. Build the corpus in jsonlines format (~20 min):

     python convert_to_jsonlines.py data/conll-2012/ --out-dir data
     python convert_to_heads.py
    

You're all set!

Training

If you have completed all the steps in the previous section, then just run:

python run.py train roberta

Use -h flag for more parameters and CUDA_VISIBLE_DEVICES environment variable to limit the cuda devices visible to the script. Refer to config.toml to modify existing model configurations or create your own.

Evaluation

Make sure that you have successfully completed all steps of the Preparation section.

  1. Download and save the pretrained model to the data directory.

     https://www.dropbox.com/s/vf7zadyksgj40zu/roberta_%28e20_2021.05.02_01.16%29_release.pt?dl=0
    
  2. Generate the conll-formatted output:

     python run.py eval roberta --data-split test
    
  3. Run the conll-2012 scripts to obtain the metrics:

     python calculate_conll.py roberta test 20
    
Comments
  • about the training process

    about the training process

    Here is the following error i met: Epoch 1: bc/cnn/00/cnn_0001 c_loss: 2.11580 s_loss: 0.57502: 14% 394/2802 [01:04<04:54, 8.18docs/s] It seems the training process stopped. can u tell me why? thanks.

    opened by leileilin 27
  • some confusions about convert_to_head.py

    some confusions about convert_to_head.py

    Hello, I have a new question about convert_ to_ heads.py file, in which some span and clusters will be deleted. Is this the case as follows? In those cases "A" and "A & B" are different spans with the same head word, "A". In our implementation such cases were simply discarded from the training set, because they were few and we were able to perform well, even though we couldn't predict any of such cases during inference. like u said in #2 thanks.

    opened by leileilin 14
  • about chinese dataset

    about chinese dataset

    Hello, thank you for your great work of open source. I want to process Chinese datasets according to your process, but in convert_ to_ jsonlines.py. Py this step reports an error, do you know why? Thanks.

    opened by leileilin 14
  • what is the equivalent of

    what is the equivalent of "edu.stanford.nlp.trees.EnglishGrammaticalStructure" for arabic coreference resolution task

    Hi,

    I can't find the ArabicGrammaticalStructure class from the nlp.stanford. It works for english data but not for Arabic .

    Converting constituents to dependencies... development: 0% 0/44 [00:00<?, ?docs/s]Exception in thread "main" java.lang.IllegalArgumentException: No head rule defined for PV+PVSUFF using class edu.stanford.nlp.trees.SemanticHeadFinder in PV+PVSUFF-39 at edu.stanford.nlp.trees.AbstractCollinsHeadFinder.determineNonTrivialHead(AbstractCollinsHeadFinder.java:222) at edu.stanford.nlp.trees.SemanticHeadFinder.determineNonTrivialHead(SemanticHeadFinder.java:348) at edu.stanford.nlp.trees.AbstractCollinsHeadFinder.determineHead(AbstractCollinsHeadFinder.java:179) at edu.stanford.nlp.trees.TreeGraphNode.percolateHeads(TreeGraphNode.java:476) at edu.stanford.nlp.trees.TreeGraphNode.percolateHeads(TreeGraphNode.java:474) at edu.stanford.nlp.trees.TreeGraphNode.percolateHeads(TreeGraphNode.java:474) at edu.stanford.nlp.trees.TreeGraphNode.percolateHeads(TreeGraphNode.java:474) at edu.stanford.nlp.trees.TreeGraphNode.percolateHeads(TreeGraphNode.java:474) at edu.stanford.nlp.trees.GrammaticalStructure.(GrammaticalStructure.java:94) at edu.stanford.nlp.trees.EnglishGrammaticalStructure.(EnglishGrammaticalStructure.java:86) at edu.stanford.nlp.trees.EnglishGrammaticalStructure.(EnglishGrammaticalStructure.java:66) at edu.stanford.nlp.parser.lexparser.EnglishTreebankParserParams.getGrammaticalStructure(EnglishTreebankParserParams.java:2271) at edu.stanford.nlp.trees.GrammaticalStructure$TreeBankGrammaticalStructureWrapper$GsIterator.primeGs(GrammaticalStructure.java:1361) at edu.stanford.nlp.trees.GrammaticalStructure$TreeBankGrammaticalStructureWrapper$GsIterator.(GrammaticalStructure.java:1348) at edu.stanford.nlp.trees.GrammaticalStructure$TreeBankGrammaticalStructureWrapper.iterator(GrammaticalStructure.java:1325) at edu.stanford.nlp.trees.GrammaticalStructure.main(GrammaticalStructure.java:1604) development: 0% 0/44 [00:00<?, ?docs/s] Traceback (most recent call last): File "convert_to_jsonlines.py", line 392, in convert_con_to_dep(args.tmp_dir, conll_filenames) File "convert_to_jsonlines.py", line 195, in convert_con_to_dep subprocess.run(cmd, check=True, stdout=out) File "/home/souid/anaconda3/envs/wl-coref/lib/python3.7/subprocess.py", line 512, in run output=stdout, stderr=stderr) subprocess.CalledProcessError: Command '['java', '-cp', 'downloads/stanford-parser.jar', 'edu.stanford.nlp.trees.EnglishGrammaticalStructure', '-basic', '-keepPunct', '-conllx', '-treeFile', 'temp/data/conll-2012/v4/data/development/data/arabic/annotations/nw/ann/00/ann_0010.v4_gold_conll']' returned non-zero exit status 1.

    opened by aymen-souid 11
  • Conll perl script refusing to score because of 10) in the response"">

    Conll perl script refusing to score because of "too many repeated mentions (>10) in the response"

    I ran the preparation scripts successfully.

    Downloaded the roberta checkpoint from dropbox link, and placed it in data folder.

    Ran the command: python calculate_conll.py roberta test 20

    I noticed some errors due to subprocess because I was using python3.6 instead of python3.7.

    Error was: unexpected keyword argument 'capture_output'

    Fixed the issue with this

    But then I got an error: 'NoneType' object has no attribute 'group' origin of error --> line 15

    I ran the perl script directly in bash: perl reference-coreference-scorers/scorer.pl all data/conll_logs/roberta_test_e20.gold.conll data/conll_logs/roberta_test_e20.pred.conll none

    MUC came out to be 86 (f1) but while calculating b3, I got this error: Found too many repeated mentions (> 10) in the response, so refusing to score. Please fix the output

    I think it is because of this error only that the line 15 above was throwing that error (because output was empty).

    How to proceed forward now? How to evaluate the results?

    opened by ritwikmishra 9
  • Inference from the box?

    Inference from the box?

    Hi! Thank you for posting the model. Could you please provide how to make inference from the box? If I understood correctly, model from dropbox has already been fitted, so we should be able to run it, but by the design model require original data and building of optimisers to run

    class CorefModel:
        Attributes:
            config (coref.config.Config): the model's configuration,
                see config.toml for the details
            epochs_trained (int): number of epochs the model has been trained for
            trainable (Dict[str, torch.nn.Module]): trainable submodules with their
                names used as keys
            training (bool): used to toggle train/eval modes
    
        Submodules (in the order of their usage in the pipeline):
            tokenizer (transformers.AutoTokenizer)
            bert (transformers.AutoModel)
            we (WordEncoder)
            rough_scorer (RoughScorer)
            pw (PairwiseEncoder)
            a_scorer (AnaphoricityScorer)
            sp (SpanPredictor)
        """
        def __init__(self,
                     config_path: str,
                     section: str,
                     epochs_trained: int = 0):
            """
            A newly created model is set to evaluation mode.
    
            Args:
                config_path (str): the path to the toml file with the configuration
                section (str): the selected section of the config file
                epochs_trained (int): the number of epochs finished
                    (useful for warm start)
            """
            self.config = CorefModel._load_config(config_path, section)
            self.epochs_trained = epochs_trained
            self._docs: Dict[str, List[Doc]] = {}
            self._build_model()
            self._build_optimizers()
            self._set_training(False)
            self._coref_criterion = CorefLoss(self.config.bce_loss_weight)
            self._span_criterion = torch.nn.CrossEntropyLoss(reduction="sum")
    

    So maybe there is an option to run it without all this stuff?

    opened by Dzz1th 7
  • Questions_dataset-representation

    Questions_dataset-representation

    Based on my observation in this code base, training use the following features, e.g: cased_words, sent_id, speaker, pos, deprel, head, clusters.

    then converted into: cased_words, sent_id, speaker, pos, deprel, head, head2span, word_clusters, span_clusters.

    while in inference data example, the feature used only cased_words, sent_id, and optionally speaker information.

    My questions is.

    1. how we get the pos, deprel, head, and clusters data from in inference mode? It is derived from cased_words or not?
    2. in training mode, is the speaker, pos, deprel, head, clusters data is used as well?

    Thank you

    opened by fajarmuslim 7
  • shall I use convert_to_heads when using CoNLL-U?

    shall I use convert_to_heads when using CoNLL-U?

    Hi, thanks so much for your work! I have a question regarding convert_to_heads.py script. I'm trying to make RoBERTa learn coreference resolution, but my data is in .conllu format. I have quite hard time trying to preprocess data/modify some of your code to make it work. Can you share some insights/thoughts on that? I would be very much obliged.

    Cheers

    opened by brgsk 7
  • about the training data format

    about the training data format

    Hello, I'd like to ask about the .jsonlines file executived through convert_ to_ jsonlines. py, Can some attributes in the jsonlines file be successfully trained after being discarded? Such as speaker, pos.

    opened by leileilin 5
  • Inference on conversation.

    Inference on conversation.

    Hello, great work.

    I had two questions:

    1. what sent_id in the sample input file supposed to refer to??

    2. If I want to make an inference for dialogue like tc genre, what should be the conversation format ??

    opened by maherr13 5
  • Questions about training

    Questions about training

    Currently, when running this source code. I have an error cuda running out of memory. Since single GPU have only 32GB memory.

    but, in another side, I have access to server which have 8 GPU (each of them having 32GB memory). Can I run this training experiment with the paralel mode?

    If it can, how to achieve that?

    thanks in advance..

    opened by fajarmuslim 5
  • Reduce training memory requirement

    Reduce training memory requirement

    CUDA-enabled machine (48 GB to train, 4 GB to evaluate)

    @vdobrovolskii friendly ping Are 48GB really needed to train? Can't we train longer (how long) with less ? couldn't your project leverage FP16, FP8 and other optimizations ? You can get them out of the box if you use roberta from the Transformers library https://github.com/huggingface/transformers Also there is accelerate https://huggingface.co/docs/accelerate/index

    I have a 3070 with 8GB of GDDR6 :/

    opened by LifeIsStrange 2
Owner
null
This is the code for the EMNLP 2021 paper AEDA: An Easier Data Augmentation Technique for Text Classification

The baseline code is for EDA: Easy Data Augmentation techniques for boosting performance on text classification tasks

Akbar Karimi 81 Dec 9, 2022
Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

LancoPKU 105 Jan 3, 2023
Code for the paper in Findings of EMNLP 2021: "EfficientBERT: Progressively Searching Multilayer Perceptron via Warm-up Knowledge Distillation".

This repository contains the code for the paper in Findings of EMNLP 2021: "EfficientBERT: Progressively Searching Multilayer Perceptron via Warm-up Knowledge Distillation".

Chenhe Dong 28 Nov 10, 2022
đź’› Code and Dataset for our EMNLP 2021 paper: "Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes"

Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes Official PyTorch implementation and EmoCause evaluatio

Hyunwoo Kim 50 Dec 21, 2022
Code and dataset for the EMNLP 2021 Finding paper "Can NLI Models Verify QA Systems’ Predictions?"

Code and dataset for the EMNLP 2021 Finding paper "Can NLI Models Verify QA Systems’ Predictions?"

Jifan Chen 22 Oct 21, 2022
Code for Findings at EMNLP 2021 paper: "Learn Continually, Generalize Rapidly: Lifelong Knowledge Accumulation for Few-shot Learning"

Learn Continually, Generalize Rapidly: Lifelong Knowledge Accumulation for Few-shot Learning This repo is for Findings at EMNLP 2021 paper: Learn Cont

INK Lab @ USC 6 Sep 2, 2022
Code to reproduce the results of the paper 'Towards Realistic Few-Shot Relation Extraction' (EMNLP 2021)

Realistic Few-Shot Relation Extraction This repository contains code to reproduce the results in the paper "Towards Realistic Few-Shot Relation Extrac

Bloomberg 8 Nov 9, 2022
This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Proteno This is the data release associated with the corresponding NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deploymen

null 37 Dec 4, 2022
Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

Udit Arora 19 Oct 28, 2022
🍊 PAUSE (Positive and Annealed Unlabeled Sentence Embedding), accepted by EMNLP'2021 🌴

PAUSE: Positive and Annealed Unlabeled Sentence Embedding Sentence embedding refers to a set of effective and versatile techniques for converting raw

EQT 21 Dec 15, 2022
[EMNLP 2021] LM-Critic: Language Models for Unsupervised Grammatical Error Correction

LM-Critic: Language Models for Unsupervised Grammatical Error Correction This repo provides the source code & data of our paper: LM-Critic: Language M

Michihiro Yasunaga 98 Nov 24, 2022
IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

IndoBERTweet ?? ???? 1. Paper Fajri Koto, Jey Han Lau, and Timothy Baldwin. IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effe

IndoLEM 40 Nov 30, 2022
Beta Distribution Guided Aspect-aware Graph for Aspect Category Sentiment Analysis with Affective Knowledge. Proceedings of EMNLP 2021

AAGCN-ACSA EMNLP 2021 Introduction This repository was used in our paper: Beta Distribution Guided Aspect-aware Graph for Aspect Category Sentiment An

Akuchi 36 Dec 18, 2022
EMNLP'2021: Can Language Models be Biomedical Knowledge Bases?

BioLAMA BioLAMA is biomedical factual knowledge triples for probing biomedical LMs. The triples are collected and pre-processed from three sources: CT

DMIS Laboratory - Korea University 41 Nov 18, 2022
A2T: Towards Improving Adversarial Training of NLP Models (EMNLP 2021 Findings)

A2T: Towards Improving Adversarial Training of NLP Models This is the source code for the EMNLP 2021 (Findings) paper "Towards Improving Adversarial T

QData 17 Oct 15, 2022
This repository contains the code for "Generating Datasets with Pretrained Language Models".

Datasets from Instructions (DINO ?? ) This repository contains the code for Generating Datasets with Pretrained Language Models. The paper introduces

Timo Schick 154 Jan 1, 2023
This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

Pipeline For NLP with Bloom's Taxonomy Using Improved Question Classification and Question Generation using Deep Learning This repository contains all

Rohan Mathur 9 Jul 17, 2021
This repository will contain the code for the CVPR 2021 paper "GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields"

GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields Project Page | Paper | Supplementary | Video | Slides | Blog | Talk If

null 1.1k Dec 27, 2022