[ACL-IJCNLP 2021] Improving Named Entity Recognition by External Context Retrieving and Cooperative Learning

Last update: Dec 8, 2022

Related tags

Deep Learning named-entity-recognition

Overview

CLNER

The code is for our ACL-IJCNLP 2021 paper: Improving Named Entity Recognition by External Context Retrieving and Cooperative Learning

CLNER is a framework for improving the accuracy of NER models through retrieving external contexts, then use the cooperative learning approach to improve the both input views. The code is initially based on flair version 0.4.3. Then the code is extended with knwoledge distillation and ACE approaches to distill smaller models or achieve SOTA results. The config files in these repos are also applicable to this code.

Requirements

The project is based on PyTorch 1.1+ and Python 3.6+. To run our code, install:

pip install -r requirements.txt

The following requirements should be satisfied:

transformers: 3.0.0

Datasets

The datasets used in our paper are available here.

Training

Training NER Models with External Contexts

Run:

CUDA_VISIBLE_DEVICES=0 python train.py --config config/wnut17_doc.yaml

Training NER Models with Cooperative Learning

Run:

CUDA_VISIBLE_DEVICES=0 python train.py --config config/wnut17_doc_cl_kl.yaml
CUDA_VISIBLE_DEVICES=0 python train.py --config config/wnut17_doc_cl_l2.yaml

Train on Your Own Dataset

To set the dataset manully, you can set the dataset in the $config_file by:

targets: ner
ner:
  Corpus: ColumnCorpus-1
  ColumnCorpus-1: 
    data_folder: datasets/conll_03_english
    column_format:
      0: text
      1: pos
      2: chunk
      3: ner
    tag_to_bioes: ner
  tag_dictionary: resources/taggers/your_ner_tags.pkl

The tag_dictionary is a path to the tag dictionary for the task. If the path does not exist, the code will generate a tag dictionary at the path automatically. The dataset format is: Corpus: $CorpusClassName-$id, where $id is the name of datasets (anything you like). You can train multiple datasets jointly. For example:

Please refer to Config File for more details.

Parse files

If you want to parse a certain file, add train in the file name and put the file in a certain $dir (for example, parse_file_dir/train.your_file_name). Run:

CUDA_VISIBLE_DEVICES=0 python train.py --config $config_file --parse --target_dir $dir --keep_order

The format of the file should be column_format={0: 'text', 1:'ner'} for sequence labeling or you can modifiy line 232 in train.py. The parsed results will be in outputs/. Note that you may need to preprocess your file with the dummy tags for prediction, please check this issue for more details.

Config File

The config files are based on yaml format.

targets: The target task
- ner: named entity recognition
- upos: part-of-speech tagging
- chunk: chunking
- ast: abstract extraction
- dependency: dependency parsing
- enhancedud: semantic dependency parsing/enhanced universal dependency parsing
ner: An example for the targets. If targets: ner, then the code will read the values with the key of ner.
- Corpus: The training corpora for the model, use : to split different corpora.
- tag_dictionary: A path to the tag dictionary for the task. If the path does not exist, the code will generate a tag dictionary at the path automatically.
target_dir: Save directory.
model_name: The trained models will be save in $target_dir/$model_name.
model: The model to train, depending on the task.
- FastSequenceTagger: Sequence labeling model. The values are the parameters.
- SemanticDependencyParser: Syntactic/semantic dependency parsing model. The values are the parameters.
embeddings: The embeddings for the model, each key is the class name of the embedding and the values of the key are the parameters, see flair/embeddings.py for more details. For each embedding, use $classname-$id to represent the class. For example, if you want to use BERT and M-BERT for a single model, you can name: TransformerWordEmbeddings-0, TransformerWordEmbeddings-1.
trainer: The trainer class.
- ModelFinetuner: The trainer for fine-tuning embeddings or simply train a task model without ACE.
- ReinforcementTrainer: The trainer for training ACE.
train: the parameters for the train function in trainer (for example, ReinforcementTrainer.train()).

Citing Us

If you feel the code helpful, please cite:

@inproceedings{wang2021improving,
    title = "{{Improving Named Entity Recognition by External Context Retrieving and Cooperative Learning}}",
    author={Wang, Xinyu and Jiang, Yong and Bach, Nguyen and Wang, Tao and Huang, Zhongqiang and Huang, Fei and Tu, Kewei},
    booktitle = "{the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (\textbf{ACL-IJCNLP 2021})}",
    month = aug,
    year = "2021",
    publisher = "Association for Computational Linguistics",
}

Contact

Feel free to email your questions or comments to issues or to Xinyu Wang.

Comments

Request for explanation regarding testing the model with some text

Thank you for your help that I can train the model on the biomedical datasets with my own datasets. Now I want to try to test the model with some text. There are some files in 'tests' folder, but could you explain more? (Is there main function to implement all your functions for testing the model?.. I am confused.)

opened by goodmary1121 5
How to config yaml to train the model using BC5CDR, NCBI disease dataset?

I was succeed to reproduce training the model based on WNUT 17, so I want to try to train the model using BC5CDR, NCBI disease dataset. Could you upload yaml that you used when experimenting on these datasets or which arguments do i need to change(including data folder)?

opened by goodmary1121 5

Received error "UnboundLocalError: local variable 'loss' referenced before assignment"

when I run "python train.py --config config/wnut17_doc.y", I got below error message

2021-09-30 09:31:34,791 Model training base path: "resources/taggers/xlmr-first_10epoch_2batch_2accumulate_0.000005lr_10000lrrate_eng_monolingual_crf_fast_norelearn_sentbatch_sentloss_finetune_nodev_wnut_doc_full_bertscore_eos_ner9"
2021-09-30 09:31:34,791 ----------------------------------------------------------------------------------------------------
2021-09-30 09:31:34,791 Device: cpu
2021-09-30 09:31:34,791 ----------------------------------------------------------------------------------------------------
2021-09-30 09:31:34,791 Embeddings storage mode: none
2021-09-30 09:31:35,288 ----------------------------------------------------------------------------------------------------
2021-09-30 09:31:35,292 Current loss interpolation: 1
['xlm-roberta-base']
Traceback (most recent call last):
  File "/pyt_pro/CLNER/flair/trainers/finetune_trainer.py", line 915, in train
    loss = self.model.forward_loss(student_input)
  File "/pyt_pro/CLNER/flair/models/sequence_tagger_model.py", line 1901, in forward_loss
    features = self.forward(data_points)
  File "/pyt_pro/CLNER/flair/models/sequence_tagger_model.py", line 1027, in forward
    self.mask=self.sequence_mask(torch.tensor(lengths),longest_token_sequence_in_batch).cuda().type_as(features)
  File "/root/miniconda3/envs/env3_pyt1.3.1/lib/python3.6/site-packages/torch/cuda/__init__.py", line 192, in _lazy_init
    _check_driver()
  File "/root/miniconda3/envs/env3_pyt1.3.1/lib/python3.6/site-packages/torch/cuda/__init__.py", line 102, in _check_driver
    http://www.nvidia.com/Download/index.aspx""")
AssertionError:
Found no NVIDIA driver on your system. Please check that you
have an NVIDIA GPU and installed a driver from
http://www.nvidia.com/Download/index.aspx
2021-09-30 09:31:35,633 [18, 18]
2021-09-30 09:31:35,633 [Sentence: "Is it cricket season ? I've killed about 20 in the laundry room in the past week ." - 18 Tokens, Sentence: "Owner says its cool . Trying to get info on when they'll be back in town now @AandLClothingCo" - 18 Tokens]
> <path>CLNER/flair/trainers/finetune_trainer.py(991)train()
-> if loss != 0:

Here it stopped in pdb mode, after I entered C to continue, it threw error "UnboundLocalError: local variable 'loss' referenced before assignment"

opened by victorbai2 5

`pip install -r requirements.txt` does not run properly with python 3.7 and 3.8
Hello, I have tried to implement your model in my machine and install the required packages by pip install -r requirements.txt but it failed on some packages installation.

... ERROR: Could not find a version that satisfies the requirement mkl-fft==1.0.6 (from versions: 1.3.0) ERROR: No matching distribution found for mkl-fft==1.0.6

I have tried with python 3.7 and 3.8 version, and they all failed.

Can you check that for me?

Thank you.
opened by ciaochiaociao 4
How to solve dependencies between libraries

Currently my computer has conllu(version 1.3.1 through code 'conda install -c conda-forge conllu==1.3.1') and I also tried to install allennlp(version 0.9.0) in the same way

But allennlp which version is 0.9.0 cannot be installed. Please if you know how to solve the problem......

opened by goodmary1121 3
Reproduction of the experiment WITHOUT cooperative learning

Hello,

I am currently trying to reproduce the results given in your paper. I have just run python train.py --config wnut17_doc.yaml with 5 runs, which gives the micro-f1 scores of 58.68 59.85 60.78 58.71 59.72 and averages at 59.55, which is quite less than the performance of 60.20 reported in your paper. (see the figure below)

Is the config file wnut17_doc.yaml used for this experiment? Is there something I missed or did wrong?

opened by ciaochiaociao 3
How to add BIO-Bert as a transformer setting in configuration file?

I want to use Bio-Bert as a transformer, so where in the config file should I make changes. I also went through flair/embedding.py but there was no mention of Bio-Bert.

opened by ShubhamDarak37 2

`FileNotFoundError: [Errno 2] No such file or directory: '/home/yongjiang.jy/.flair/datasets/wnut17_bertscore_eos_doc_full'`

Hello, it seems that the code also needs the file wnut17_bertscore_eos_doc_full from the above path. Isn't it automatically generated by your code? Or how should I do to get this file? Thank you.

Traceback (most recent call last):
  File "train.py", line 83, in <module>
    config = ConfigParser(config,all=args.all,zero_shot=args.zeroshot,other_shot=args.other,predict=args.predict)
  File "/mnt/nluuser/cwhsu/workspace/CLNER/flair/config_parser.py", line 62, in __init__
    self.corpus: ListCorpus=self.get_corpus
  File "/mnt/nluuser/cwhsu/workspace/CLNER/flair/config_parser.py", line 330, in get_corpus
    current_dataset=getattr(datasets,corpus_name)(**self.config[self.target][corpus])
  File "/mnt/nluuser/cwhsu/workspace/CLNER/flair/datasets.py", line 59, in __init__
    for file in data_folder.iterdir():
  File "/home/cwhsu/anaconda3/envs/CLNER4/lib/python3.6/pathlib.py", line 1081, in iterdir
    for name in self._accessor.listdir(self):
  File "/home/cwhsu/anaconda3/envs/CLNER4/lib/python3.6/pathlib.py", line 387, in wrapped
    return strfunc(str(pathobj), *args)
FileNotFoundError: [Errno 2] No such file or directory: '/home/yongjiang.jy/.flair/datasets/wnut17_bertscore_eos_doc_full'

opened by ciaochiaociao 2

ModuleNotFoundError: No module named 'flair.image_encoder'

A new issue I found when I run

CUDA_VISIBLE_DEVICES=0 python train.py --config config/wnut17_doc.yaml

After looking at your code, it seems you directly download the package from flair instead pip install it and modify it. So I don't need to install flair, do I?

And it seems that there is no image_encoder in flair directory, so that may be why it can not be found?

Below is the error message:

  File "train.py", line 3, in <module>
    from flair.data import Dictionary, Sentence, Token, Label
  File "/mnt/nluuser/cwhsu/workspace/CLNER/flair/__init__.py", line 15, in <module>
    from . import models
  File "/mnt/nluuser/cwhsu/workspace/CLNER/flair/models/__init__.py", line 1, in <module>
    from .sequence_tagger_model import SequenceTagger, FastSequenceTagger
  File "/mnt/nluuser/cwhsu/workspace/CLNER/flair/models/sequence_tagger_model.py", line 8, in <module>
    import flair.nn
  File "/mnt/nluuser/cwhsu/workspace/CLNER/flair/nn.py", line 12, in <module>
    from flair.datasets import DataLoader
  File "/mnt/nluuser/cwhsu/workspace/CLNER/flair/datasets.py", line 18, in <module>
    from flair.image_encoder import *
ModuleNotFoundError: No module named 'flair.image_encoder

opened by ciaochiaociao 2

Processing of the Retrieved Results of Google Search Engine
Hello,

I would like to ask you how you process the results returned from Google Search Engine. Specifically,

How do you handle title and snippet? Do you just add title as an independent sentence (where title and snippet will be ranked and selected separately) or prepend/append with the snippet (where the concatenated sentence will be ranked)?

How do you handle the incompleteness of the snippet (i.e., the snippet with the ellipsis ...)?

In your paper, what did you mean by

We filter the retrieved texts that contain any part of the datasets.
opened by ciaochiaociao 1
CVE-2007-4559 Patch

Patching CVE-2007-4559

Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

opened by TrellixVulnTeam 0
pdb: loss reference before assignment

Hey,

When running: python train.py --config config/wnut17_doc_cl_kl.yaml, with the original code (only change in paths) I run into an error that the loss is referenced before assignment. See the following screenshot:

The given TypeError causes this issue. I have tried the option to add is_split_into_words=True into line 3171 in embeddings.py. This gave a new error: with again same result (no assignment of loss). What can be the cause of this?

opened by Nuveyla 8

Owner

GitHub

Code for "Retrieving Black-box Optimal Images from External Databases" (WSDM 2022)

Retrieving Black-box Optimal Images from External Databases (WSDM 2022) We propose how a user retreives an optimal image from external databases of we

5 Apr 13, 2022

Codes for ACL-IJCNLP 2021 Paper "Zero-shot Fact Verification by Claim Generation"

Zero-shot-Fact-Verification-by-Claim-Generation This repository contains code and models for the paper: Zero-shot Fact Verification by Claim Generatio

47 Jan 1, 2023

[EMNLP 2021] MuVER: Improving First-Stage Entity Retrieval with Multi-View Entity Representations

MuVER This repo contains the code and pre-trained model for our EMNLP 2021 paper: MuVER: Improving First-Stage Entity Retrieval with Multi-View Entity

24 May 30, 2022

[EMNLP 2021] Distantly-Supervised Named Entity Recognition with Noise-Robust Learning and Language Model Augmented Self-Training

RoSTER The source code used for Distantly-Supervised Named Entity Recognition with Noise-Robust Learning and Language Model Augmented Self-Training, p

60 Dec 30, 2022

“Data Augmentation for Cross-Domain Named Entity Recognition” (EMNLP 2021)

Data Augmentation for Cross-Domain Named Entity Recognition Authors: Shuguang Chen, Gustavo Aguilar, Leonardo Neves and Thamar Solorio This repository

18 Sep 10, 2022

An Easy-to-use, Modular and Prolongable package of deep-learning based Named Entity Recognition Models.

DeepNER An Easy-to-use, Modular and Prolongable package of deep-learning based Named Entity Recognition Models. This repository contains complex Deep

9 May 30, 2022

Simple and Effective Few-Shot Named Entity Recognition with Structured Nearest Neighbor Learning

structshot Code and data for paper "Simple and Effective Few-Shot Named Entity Recognition with Structured Nearest Neighbor Learning", Yi Yang and Arz

47 Dec 27, 2022

Example Of Fine-Tuning BERT For Named-Entity Recognition Task And Preparing For Cloud Deployment Using Flask, React, And Docker

Example Of Fine-Tuning BERT For Named-Entity Recognition Task And Preparing For Cloud Deployment Using Flask, React, And Docker This repository contai

12 Dec 14, 2022

Code for Two-stage Identifier: "Locate and Label: A Two-stage Identifier for Nested Named Entity Recognition"

Code for Two-stage Identifier: "Locate and Label: A Two-stage Identifier for Nested Named Entity Recognition", accepted at ACL 2021. For details of the model and experiments, please see our paper.

87 Dec 16, 2022

Named Entity Recognition with Small Strongly Labeled and Large Weakly Labeled Data

Named Entity Recognition with Small Strongly Labeled and Large Weakly Labeled Data arXiv This is the code base for weakly supervised NER. We provide a

92 Jan 4, 2023

An elaborate and exhaustive paper list for Named Entity Recognition (NER)

Named-Entity-Recognition-NER-Papers by Pengfei Liu, Jinlan Fu and other contributors. An elaborate and exhaustive paper list for Named Entity Recognit

388 Dec 18, 2022

Chinese clinical named entity recognition using pre-trained BERT model

Chinese clinical named entity recognition (CNER) using pre-trained BERT model Introduction Code for paper Chinese clinical named entity recognition wi

109 Dec 14, 2022

Source Code For Template-Based Named Entity Recognition Using BART

Template-Based NER Source Code For Template-Based Named Entity Recognition Using BART Training Training train.py Inference inference.py Corpus ATIS (h

174 Dec 19, 2022

GLaRA: Graph-based Labeling Rule Augmentation for Weakly Supervised Named Entity Recognition

29 Dec 26, 2022

PyTorch implementation of "ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context" (INTERSPEECH 2020)

ContextNet ContextNet has CNN-RNN-transducer architecture and features a fully convolutional encoder that incorporates global context information into

24 Nov 24, 2022

Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning".

ERICA Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive L

75 Nov 2, 2022