This repository contains the code for EMNLP-2021 paper "Word-Level Coreference Resolution"

Last update: Dec 27, 2022

Related tags

Text Data & NLP wl-coref

Overview

Word-Level Coreference Resolution

This is a repository with the code to reproduce the experiments described in the paper of the same name, which was accepted to EMNLP 2021. The paper is available here.

Preparation
Training
Evaluation

Preparation

The following instruction has been tested with Python 3.7 on an Ubuntu 20.04 machine.

You will need:

OntoNotes 5.0 corpus (download here, registration needed)
Python 2.7 to run conll-2012 scripts
Java runtime to run Stanford Parser
Python 3.7+ to run the model
Perl to run conll-2012 evaluation scripts
CUDA-enabled machine (48 GB to train, 4 GB to evaluate)

Extract OntoNotes 5.0 arhive. In case it's in the repo's root directory:
```
 tar -xzvf ontonotes-release-5.0_LDC2013T19.tgz
```
Switch to Python 2.7 environment (where python would run 2.7 version). This is necessary for conll scripts to run correctly. To do it with with conda:
```
 conda create -y --name py27 python=2.7 && conda activate py27
```

Run the conll data preparation scripts (~30min):

 sh get_conll_data.sh ontonotes-release-5.0 data

Download conll scorers and Stanford Parser:
```
 sh get_third_party.sh
```

Prepare your environment. To do it with conda:

 conda create -y --name wl-coref python=3.7 openjdk perl
 conda activate wl-coref
 python -m pip install -r requirements.txt

Build the corpus in jsonlines format (~20 min):

 python convert_to_jsonlines.py data/conll-2012/ --out-dir data
 python convert_to_heads.py

You're all set!

Training

If you have completed all the steps in the previous section, then just run:

python run.py train roberta

Use -h flag for more parameters and CUDA_VISIBLE_DEVICES environment variable to limit the cuda devices visible to the script. Refer to config.toml to modify existing model configurations or create your own.

Evaluation

Make sure that you have successfully completed all steps of the Preparation section.

Download and save the pretrained model to the data directory.

 https://www.dropbox.com/s/vf7zadyksgj40zu/roberta_%28e20_2021.05.02_01.16%29_release.pt?dl=0

Generate the conll-formatted output:

 python run.py eval roberta --data-split test

Run the conll-2012 scripts to obtain the metrics:
```
 python calculate_conll.py roberta test 20
```

Comments

about the training process

Here is the following error i met: Epoch 1: bc/cnn/00/cnn_0001 c_loss: 2.11580 s_loss: 0.57502: 14% 394/2802 [01:04<04:54, 8.18docs/s] It seems the training process stopped. can u tell me why? thanks.

opened by leileilin 27
some confusions about convert_to_head.py

Hello, I have a new question about convert_ to_ heads.py file, in which some span and clusters will be deleted. Is this the case as follows? In those cases "A" and "A & B" are different spans with the same head word, "A". In our implementation such cases were simply discarded from the training set, because they were few and we were able to perform well, even though we couldn't predict any of such cases during inference. like u said in #2 thanks.

opened by leileilin 14
about chinese dataset

Hello, thank you for your great work of open source. I want to process Chinese datasets according to your process, but in convert_ to_ jsonlines.py. Py this step reports an error, do you know why? Thanks.

opened by leileilin 14
what is the equivalent of "edu.stanford.nlp.trees.EnglishGrammaticalStructure" for arabic coreference resolution task

Hi,

I can't find the ArabicGrammaticalStructure class from the nlp.stanford. It works for english data but not for Arabic .

Converting constituents to dependencies... development: 0% 0/44 [00:00<?, ?docs/s]Exception in thread "main" java.lang.IllegalArgumentException: No head rule defined for PV+PVSUFF using class edu.stanford.nlp.trees.SemanticHeadFinder in PV+PVSUFF-39 at edu.stanford.nlp.trees.AbstractCollinsHeadFinder.determineNonTrivialHead(AbstractCollinsHeadFinder.java:222) at edu.stanford.nlp.trees.SemanticHeadFinder.determineNonTrivialHead(SemanticHeadFinder.java:348) at edu.stanford.nlp.trees.AbstractCollinsHeadFinder.determineHead(AbstractCollinsHeadFinder.java:179) at edu.stanford.nlp.trees.TreeGraphNode.percolateHeads(TreeGraphNode.java:476) at edu.stanford.nlp.trees.TreeGraphNode.percolateHeads(TreeGraphNode.java:474) at edu.stanford.nlp.trees.TreeGraphNode.percolateHeads(TreeGraphNode.java:474) at edu.stanford.nlp.trees.TreeGraphNode.percolateHeads(TreeGraphNode.java:474) at edu.stanford.nlp.trees.TreeGraphNode.percolateHeads(TreeGraphNode.java:474) at edu.stanford.nlp.trees.GrammaticalStructure.(GrammaticalStructure.java:94) at edu.stanford.nlp.trees.EnglishGrammaticalStructure.(EnglishGrammaticalStructure.java:86) at edu.stanford.nlp.trees.EnglishGrammaticalStructure.(EnglishGrammaticalStructure.java:66) at edu.stanford.nlp.parser.lexparser.EnglishTreebankParserParams.getGrammaticalStructure(EnglishTreebankParserParams.java:2271) at edu.stanford.nlp.trees.GrammaticalStructure$TreeBankGrammaticalStructureWrapper$GsIterator.primeGs(GrammaticalStructure.java:1361) at edu.stanford.nlp.trees.GrammaticalStructure$TreeBankGrammaticalStructureWrapper$GsIterator.(GrammaticalStructure.java:1348) at edu.stanford.nlp.trees.GrammaticalStructure$TreeBankGrammaticalStructureWrapper.iterator(GrammaticalStructure.java:1325) at edu.stanford.nlp.trees.GrammaticalStructure.main(GrammaticalStructure.java:1604) development: 0% 0/44 [00:00<?, ?docs/s] Traceback (most recent call last): File "convert_to_jsonlines.py", line 392, in convert_con_to_dep(args.tmp_dir, conll_filenames) File "convert_to_jsonlines.py", line 195, in convert_con_to_dep subprocess.run(cmd, check=True, stdout=out) File "/home/souid/anaconda3/envs/wl-coref/lib/python3.7/subprocess.py", line 512, in run output=stdout, stderr=stderr) subprocess.CalledProcessError: Command '['java', '-cp', 'downloads/stanford-parser.jar', 'edu.stanford.nlp.trees.EnglishGrammaticalStructure', '-basic', '-keepPunct', '-conllx', '-treeFile', 'temp/data/conll-2012/v4/data/development/data/arabic/annotations/nw/ann/00/ann_0010.v4_gold_conll']' returned non-zero exit status 1.

opened by aymen-souid 11
10) in the response"">

Conll perl script refusing to score because of "too many repeated mentions (>10) in the response"

I ran the preparation scripts successfully.

Downloaded the roberta checkpoint from dropbox link, and placed it in data folder.

Ran the command: python calculate_conll.py roberta test 20

I noticed some errors due to subprocess because I was using python3.6 instead of python3.7.

Error was: unexpected keyword argument 'capture_output'

Fixed the issue with this

But then I got an error: 'NoneType' object has no attribute 'group' origin of error --> line 15

I ran the perl script directly in bash: perl reference-coreference-scorers/scorer.pl all data/conll_logs/roberta_test_e20.gold.conll data/conll_logs/roberta_test_e20.pred.conll none

MUC came out to be 86 (f1) but while calculating b3, I got this error: Found too many repeated mentions (> 10) in the response, so refusing to score. Please fix the output

I think it is because of this error only that the line 15 above was throwing that error (because output was empty).

How to proceed forward now? How to evaluate the results?

opened by ritwikmishra 9

Inference from the box?

Hi! Thank you for posting the model. Could you please provide how to make inference from the box? If I understood correctly, model from dropbox has already been fitted, so we should be able to run it, but by the design model require original data and building of optimisers to run

class CorefModel:
    Attributes:
        config (coref.config.Config): the model's configuration,
            see config.toml for the details
        epochs_trained (int): number of epochs the model has been trained for
        trainable (Dict[str, torch.nn.Module]): trainable submodules with their
            names used as keys
        training (bool): used to toggle train/eval modes

    Submodules (in the order of their usage in the pipeline):
        tokenizer (transformers.AutoTokenizer)
        bert (transformers.AutoModel)
        we (WordEncoder)
        rough_scorer (RoughScorer)
        pw (PairwiseEncoder)
        a_scorer (AnaphoricityScorer)
        sp (SpanPredictor)
    """
    def __init__(self,
                 config_path: str,
                 section: str,
                 epochs_trained: int = 0):
        """
        A newly created model is set to evaluation mode.

        Args:
            config_path (str): the path to the toml file with the configuration
            section (str): the selected section of the config file
            epochs_trained (int): the number of epochs finished
                (useful for warm start)
        """
        self.config = CorefModel._load_config(config_path, section)
        self.epochs_trained = epochs_trained
        self._docs: Dict[str, List[Doc]] = {}
        self._build_model()
        self._build_optimizers()
        self._set_training(False)
        self._coref_criterion = CorefLoss(self.config.bce_loss_weight)
        self._span_criterion = torch.nn.CrossEntropyLoss(reduction="sum")

So maybe there is an option to run it without all this stuff?

opened by Dzz1th 7

Questions_dataset-representation
Based on my observation in this code base, training use the following features, e.g: cased_words, sent_id, speaker, pos, deprel, head, clusters.

then converted into: cased_words, sent_id, speaker, pos, deprel, head, head2span, word_clusters, span_clusters.

while in inference data example, the feature used only cased_words, sent_id, and optionally speaker information.

My questions is.

how we get the pos, deprel, head, and clusters data from in inference mode? It is derived from cased_words or not?

in training mode, is the speaker, pos, deprel, head, clusters data is used as well?

Thank you
opened by fajarmuslim 7
shall I use convert_to_heads when using CoNLL-U?

Hi, thanks so much for your work! I have a question regarding convert_to_heads.py script. I'm trying to make RoBERTa learn coreference resolution, but my data is in .conllu format. I have quite hard time trying to preprocess data/modify some of your code to make it work. Can you share some insights/thoughts on that? I would be very much obliged.

Cheers

opened by brgsk 7
about the training data format

Hello, I'd like to ask about the .jsonlines file executived through convert_ to_ jsonlines. py, Can some attributes in the jsonlines file be successfully trained after being discarded? Such as speaker, pos.

opened by leileilin 5
Inference on conversation.
Hello, great work.

I had two questions:

what sent_id in the sample input file supposed to refer to??

If I want to make an inference for dialogue like tc genre, what should be the conversation format ??
opened by maherr13 5
Questions about training

Currently, when running this source code. I have an error cuda running out of memory. Since single GPU have only 32GB memory.

but, in another side, I have access to server which have 8 GPU (each of them having 32GB memory). Can I run this training experiment with the paralel mode?

If it can, how to achieve that?

thanks in advance..

opened by fajarmuslim 5
Reduce training memory requirement

CUDA-enabled machine (48 GB to train, 4 GB to evaluate)

@vdobrovolskii friendly ping Are 48GB really needed to train? Can't we train longer (how long) with less ? couldn't your project leverage FP16, FP8 and other optimizations ? You can get them out of the box if you use roberta from the Transformers library https://github.com/huggingface/transformers Also there is accelerate https://huggingface.co/docs/accelerate/index

I have a 3070 with 8GB of GDDR6 :/

opened by LifeIsStrange 2

This repository contains the code for EMNLP-2021 paper "Word-Level Coreference Resolution"

Related tags

Overview

Word-Level Coreference Resolution

Table of contents

Preparation

Training

Evaluation

Comments

Owner

This is the code for the EMNLP 2021 paper AEDA: An Easier Data Augmentation Technique for Text Classification

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

Code for the paper in Findings of EMNLP 2021: "EfficientBERT: Progressively Searching Multilayer Perceptron via Warm-up Knowledge Distillation".

💛 Code and Dataset for our EMNLP 2021 paper: "Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes"

Code and dataset for the EMNLP 2021 Finding paper "Can NLI Models Verify QA Systems’ Predictions?"

Code for Findings at EMNLP 2021 paper: "Learn Continually, Generalize Rapidly: Lifelong Knowledge Accumulation for Few-shot Learning"

Code to reproduce the results of the paper 'Towards Realistic Few-Shot Relation Extraction' (EMNLP 2021)

This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

🍊 PAUSE (Positive and Annealed Unlabeled Sentence Embedding), accepted by EMNLP'2021 🌴

[EMNLP 2021] LM-Critic: Language Models for Unsupervised Grammatical Error Correction

IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

Beta Distribution Guided Aspect-aware Graph for Aspect Category Sentiment Analysis with Affective Knowledge. Proceedings of EMNLP 2021

EMNLP'2021: Can Language Models be Biomedical Knowledge Bases?

A2T: Towards Improving Adversarial Training of NLP Models (EMNLP 2021 Findings)

This repository contains the code for "Generating Datasets with Pretrained Language Models".

This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

This repository will contain the code for the CVPR 2021 paper "GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields"