[ACL-IJCNLP 2021] Improving Named Entity Recognition by External Context Retrieving and Cooperative Learning

Overview

CLNER

The code is for our ACL-IJCNLP 2021 paper: Improving Named Entity Recognition by External Context Retrieving and Cooperative Learning

CLNER is a framework for improving the accuracy of NER models through retrieving external contexts, then use the cooperative learning approach to improve the both input views. The code is initially based on flair version 0.4.3. Then the code is extended with knwoledge distillation and ACE approaches to distill smaller models or achieve SOTA results. The config files in these repos are also applicable to this code.

PWC PWC PWC PWC PWC PWC

Guide

Requirements

The project is based on PyTorch 1.1+ and Python 3.6+. To run our code, install:

pip install -r requirements.txt

The following requirements should be satisfied:

Datasets

The datasets used in our paper are available here.

Training

Training NER Models with External Contexts

Run:

CUDA_VISIBLE_DEVICES=0 python train.py --config config/wnut17_doc.yaml

Training NER Models with Cooperative Learning

Run:

CUDA_VISIBLE_DEVICES=0 python train.py --config config/wnut17_doc_cl_kl.yaml
CUDA_VISIBLE_DEVICES=0 python train.py --config config/wnut17_doc_cl_l2.yaml

Train on Your Own Dataset

To set the dataset manully, you can set the dataset in the $config_file by:

targets: ner
ner:
  Corpus: ColumnCorpus-1
  ColumnCorpus-1: 
    data_folder: datasets/conll_03_english
    column_format:
      0: text
      1: pos
      2: chunk
      3: ner
    tag_to_bioes: ner
  tag_dictionary: resources/taggers/your_ner_tags.pkl

The tag_dictionary is a path to the tag dictionary for the task. If the path does not exist, the code will generate a tag dictionary at the path automatically. The dataset format is: Corpus: $CorpusClassName-$id, where $id is the name of datasets (anything you like). You can train multiple datasets jointly. For example:

Please refer to Config File for more details.

Parse files

If you want to parse a certain file, add train in the file name and put the file in a certain $dir (for example, parse_file_dir/train.your_file_name). Run:

CUDA_VISIBLE_DEVICES=0 python train.py --config $config_file --parse --target_dir $dir --keep_order

The format of the file should be column_format={0: 'text', 1:'ner'} for sequence labeling or you can modifiy line 232 in train.py. The parsed results will be in outputs/. Note that you may need to preprocess your file with the dummy tags for prediction, please check this issue for more details.

Config File

The config files are based on yaml format.

  • targets: The target task
    • ner: named entity recognition
    • upos: part-of-speech tagging
    • chunk: chunking
    • ast: abstract extraction
    • dependency: dependency parsing
    • enhancedud: semantic dependency parsing/enhanced universal dependency parsing
  • ner: An example for the targets. If targets: ner, then the code will read the values with the key of ner.
    • Corpus: The training corpora for the model, use : to split different corpora.
    • tag_dictionary: A path to the tag dictionary for the task. If the path does not exist, the code will generate a tag dictionary at the path automatically.
  • target_dir: Save directory.
  • model_name: The trained models will be save in $target_dir/$model_name.
  • model: The model to train, depending on the task.
    • FastSequenceTagger: Sequence labeling model. The values are the parameters.
    • SemanticDependencyParser: Syntactic/semantic dependency parsing model. The values are the parameters.
  • embeddings: The embeddings for the model, each key is the class name of the embedding and the values of the key are the parameters, see flair/embeddings.py for more details. For each embedding, use $classname-$id to represent the class. For example, if you want to use BERT and M-BERT for a single model, you can name: TransformerWordEmbeddings-0, TransformerWordEmbeddings-1.
  • trainer: The trainer class.
    • ModelFinetuner: The trainer for fine-tuning embeddings or simply train a task model without ACE.
    • ReinforcementTrainer: The trainer for training ACE.
  • train: the parameters for the train function in trainer (for example, ReinforcementTrainer.train()).

Citing Us

If you feel the code helpful, please cite:

@inproceedings{wang2021improving,
    title = "{{Improving Named Entity Recognition by External Context Retrieving and Cooperative Learning}}",
    author={Wang, Xinyu and Jiang, Yong and Bach, Nguyen and Wang, Tao and Huang, Zhongqiang and Huang, Fei and Tu, Kewei},
    booktitle = "{the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (\textbf{ACL-IJCNLP 2021})}",
    month = aug,
    year = "2021",
    publisher = "Association for Computational Linguistics",
}

Contact

Feel free to email your questions or comments to issues or to Xinyu Wang.

Comments
  • Request for explanation regarding testing the model with some text

    Request for explanation regarding testing the model with some text

    Thank you for your help that I can train the model on the biomedical datasets with my own datasets. Now I want to try to test the model with some text. There are some files in 'tests' folder, but could you explain more? (Is there main function to implement all your functions for testing the model?.. I am confused.)

    opened by goodmary1121 5
  • How to config yaml to train the model using BC5CDR, NCBI disease dataset?

    How to config yaml to train the model using BC5CDR, NCBI disease dataset?

    I was succeed to reproduce training the model based on WNUT 17, so I want to try to train the model using BC5CDR, NCBI disease dataset. Could you upload yaml that you used when experimenting on these datasets or which arguments do i need to change(including data folder)?

    opened by goodmary1121 5
  • Received error

    Received error "UnboundLocalError: local variable 'loss' referenced before assignment"

    when I run "python train.py --config config/wnut17_doc.y", I got below error message

    2021-09-30 09:31:34,791 Model training base path: "resources/taggers/xlmr-first_10epoch_2batch_2accumulate_0.000005lr_10000lrrate_eng_monolingual_crf_fast_norelearn_sentbatch_sentloss_finetune_nodev_wnut_doc_full_bertscore_eos_ner9"
    2021-09-30 09:31:34,791 ----------------------------------------------------------------------------------------------------
    2021-09-30 09:31:34,791 Device: cpu
    2021-09-30 09:31:34,791 ----------------------------------------------------------------------------------------------------
    2021-09-30 09:31:34,791 Embeddings storage mode: none
    2021-09-30 09:31:35,288 ----------------------------------------------------------------------------------------------------
    2021-09-30 09:31:35,292 Current loss interpolation: 1
    ['xlm-roberta-base']
    Traceback (most recent call last):
      File "/pyt_pro/CLNER/flair/trainers/finetune_trainer.py", line 915, in train
        loss = self.model.forward_loss(student_input)
      File "/pyt_pro/CLNER/flair/models/sequence_tagger_model.py", line 1901, in forward_loss
        features = self.forward(data_points)
      File "/pyt_pro/CLNER/flair/models/sequence_tagger_model.py", line 1027, in forward
        self.mask=self.sequence_mask(torch.tensor(lengths),longest_token_sequence_in_batch).cuda().type_as(features)
      File "/root/miniconda3/envs/env3_pyt1.3.1/lib/python3.6/site-packages/torch/cuda/__init__.py", line 192, in _lazy_init
        _check_driver()
      File "/root/miniconda3/envs/env3_pyt1.3.1/lib/python3.6/site-packages/torch/cuda/__init__.py", line 102, in _check_driver
        http://www.nvidia.com/Download/index.aspx""")
    AssertionError:
    Found no NVIDIA driver on your system. Please check that you
    have an NVIDIA GPU and installed a driver from
    http://www.nvidia.com/Download/index.aspx
    2021-09-30 09:31:35,633 [18, 18]
    2021-09-30 09:31:35,633 [Sentence: "Is it cricket season ? I've killed about 20 in the laundry room in the past week ." - 18 Tokens, Sentence: "Owner says its cool . Trying to get info on when they'll be back in town now @AandLClothingCo" - 18 Tokens]
    > <path>CLNER/flair/trainers/finetune_trainer.py(991)train()
    -> if loss != 0:
    

    Here it stopped in pdb mode, after I entered C to continue, it threw error "UnboundLocalError: local variable 'loss' referenced before assignment"

    opened by victorbai2 5
  • `pip install -r requirements.txt` does not run properly with python 3.7 and 3.8

    `pip install -r requirements.txt` does not run properly with python 3.7 and 3.8

    Hello, I have tried to implement your model in my machine and install the required packages by pip install -r requirements.txt but it failed on some packages installation.

    ...
    ERROR: Could not find a version that satisfies the requirement mkl-fft==1.0.6 (from versions: 1.3.0)
    ERROR: No matching distribution found for mkl-fft==1.0.6
    

    I have tried with python 3.7 and 3.8 version, and they all failed.

    Can you check that for me?

    Thank you.

    opened by ciaochiaociao 4
  • How to solve dependencies between libraries

    How to solve dependencies between libraries

    image Currently my computer has conllu(version 1.3.1 through code 'conda install -c conda-forge conllu==1.3.1') and I also tried to install allennlp(version 0.9.0) in the same way

    image But allennlp which version is 0.9.0 cannot be installed. Please if you know how to solve the problem......

    opened by goodmary1121 3
  • Reproduction of the experiment WITHOUT cooperative learning

    Reproduction of the experiment WITHOUT cooperative learning

    Hello,

    I am currently trying to reproduce the results given in your paper. I have just run python train.py --config wnut17_doc.yaml with 5 runs, which gives the micro-f1 scores of 58.68 59.85 60.78 58.71 59.72 and averages at 59.55, which is quite less than the performance of 60.20 reported in your paper. (see the figure below)

    image

    Is the config file wnut17_doc.yaml used for this experiment? Is there something I missed or did wrong?

    opened by ciaochiaociao 3
  • How to add BIO-Bert as a transformer setting in configuration file?

    How to add BIO-Bert as a transformer setting in configuration file?

    I want to use Bio-Bert as a transformer, so where in the config file should I make changes. I also went through flair/embedding.py but there was no mention of Bio-Bert.

    opened by ShubhamDarak37 2
  • `FileNotFoundError: [Errno 2] No such file or directory: '/home/yongjiang.jy/.flair/datasets/wnut17_bertscore_eos_doc_full'`

    `FileNotFoundError: [Errno 2] No such file or directory: '/home/yongjiang.jy/.flair/datasets/wnut17_bertscore_eos_doc_full'`

    Hello, it seems that the code also needs the file wnut17_bertscore_eos_doc_full from the above path. Isn't it automatically generated by your code? Or how should I do to get this file? Thank you.

    Traceback (most recent call last):
      File "train.py", line 83, in <module>
        config = ConfigParser(config,all=args.all,zero_shot=args.zeroshot,other_shot=args.other,predict=args.predict)
      File "/mnt/nluuser/cwhsu/workspace/CLNER/flair/config_parser.py", line 62, in __init__
        self.corpus: ListCorpus=self.get_corpus
      File "/mnt/nluuser/cwhsu/workspace/CLNER/flair/config_parser.py", line 330, in get_corpus
        current_dataset=getattr(datasets,corpus_name)(**self.config[self.target][corpus])
      File "/mnt/nluuser/cwhsu/workspace/CLNER/flair/datasets.py", line 59, in __init__
        for file in data_folder.iterdir():
      File "/home/cwhsu/anaconda3/envs/CLNER4/lib/python3.6/pathlib.py", line 1081, in iterdir
        for name in self._accessor.listdir(self):
      File "/home/cwhsu/anaconda3/envs/CLNER4/lib/python3.6/pathlib.py", line 387, in wrapped
        return strfunc(str(pathobj), *args)
    FileNotFoundError: [Errno 2] No such file or directory: '/home/yongjiang.jy/.flair/datasets/wnut17_bertscore_eos_doc_full'
    
    opened by ciaochiaociao 2
  • ModuleNotFoundError: No module named 'flair.image_encoder'

    ModuleNotFoundError: No module named 'flair.image_encoder'

    A new issue I found when I run

    CUDA_VISIBLE_DEVICES=0 python train.py --config config/wnut17_doc.yaml

    After looking at your code, it seems you directly download the package from flair instead pip install it and modify it. So I don't need to install flair, do I?

    And it seems that there is no image_encoder in flair directory, so that may be why it can not be found?

    Below is the error message:

      File "train.py", line 3, in <module>
        from flair.data import Dictionary, Sentence, Token, Label
      File "/mnt/nluuser/cwhsu/workspace/CLNER/flair/__init__.py", line 15, in <module>
        from . import models
      File "/mnt/nluuser/cwhsu/workspace/CLNER/flair/models/__init__.py", line 1, in <module>
        from .sequence_tagger_model import SequenceTagger, FastSequenceTagger
      File "/mnt/nluuser/cwhsu/workspace/CLNER/flair/models/sequence_tagger_model.py", line 8, in <module>
        import flair.nn
      File "/mnt/nluuser/cwhsu/workspace/CLNER/flair/nn.py", line 12, in <module>
        from flair.datasets import DataLoader
      File "/mnt/nluuser/cwhsu/workspace/CLNER/flair/datasets.py", line 18, in <module>
        from flair.image_encoder import *
    ModuleNotFoundError: No module named 'flair.image_encoder
    
    opened by ciaochiaociao 2
  • Processing of the Retrieved Results of Google Search Engine

    Processing of the Retrieved Results of Google Search Engine

    Hello,

    I would like to ask you how you process the results returned from Google Search Engine. Specifically,

    1. How do you handle title and snippet? Do you just add title as an independent sentence (where title and snippet will be ranked and selected separately) or prepend/append with the snippet (where the concatenated sentence will be ranked)?
    2. How do you handle the incompleteness of the snippet (i.e., the snippet with the ellipsis ...)?
    3. In your paper, what did you mean by

      We filter the retrieved texts that contain any part of the datasets.

    opened by ciaochiaociao 1
  • CVE-2007-4559 Patch

    CVE-2007-4559 Patch

    Patching CVE-2007-4559

    Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

    If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

    opened by TrellixVulnTeam 0
  • pdb: loss reference before assignment

    pdb: loss reference before assignment

    Hey,

    When running: python train.py --config config/wnut17_doc_cl_kl.yaml, with the original code (only change in paths) I run into an error that the loss is referenced before assignment. See the following screenshot:

    image

    The given TypeError causes this issue. I have tried the option to add is_split_into_words=True into line 3171 in embeddings.py. This gave a new error: image with again same result (no assignment of loss). What can be the cause of this?

    opened by Nuveyla 8
Owner
null
Code for "Retrieving Black-box Optimal Images from External Databases" (WSDM 2022)

Retrieving Black-box Optimal Images from External Databases (WSDM 2022) We propose how a user retreives an optimal image from external databases of we

joisino 5 Apr 13, 2022
Codes for ACL-IJCNLP 2021 Paper "Zero-shot Fact Verification by Claim Generation"

Zero-shot-Fact-Verification-by-Claim-Generation This repository contains code and models for the paper: Zero-shot Fact Verification by Claim Generatio

Liangming Pan 47 Jan 1, 2023
[EMNLP 2021] MuVER: Improving First-Stage Entity Retrieval with Multi-View Entity Representations

MuVER This repo contains the code and pre-trained model for our EMNLP 2021 paper: MuVER: Improving First-Stage Entity Retrieval with Multi-View Entity

null 24 May 30, 2022
[EMNLP 2021] Distantly-Supervised Named Entity Recognition with Noise-Robust Learning and Language Model Augmented Self-Training

RoSTER The source code used for Distantly-Supervised Named Entity Recognition with Noise-Robust Learning and Language Model Augmented Self-Training, p

Yu Meng 60 Dec 30, 2022
“Data Augmentation for Cross-Domain Named Entity Recognition” (EMNLP 2021)

Data Augmentation for Cross-Domain Named Entity Recognition Authors: Shuguang Chen, Gustavo Aguilar, Leonardo Neves and Thamar Solorio This repository

RiTUAL@UH 18 Sep 10, 2022
An Easy-to-use, Modular and Prolongable package of deep-learning based Named Entity Recognition Models.

DeepNER An Easy-to-use, Modular and Prolongable package of deep-learning based Named Entity Recognition Models. This repository contains complex Deep

Derrick 9 May 30, 2022
Simple and Effective Few-Shot Named Entity Recognition with Structured Nearest Neighbor Learning

structshot Code and data for paper "Simple and Effective Few-Shot Named Entity Recognition with Structured Nearest Neighbor Learning", Yi Yang and Arz

ASAPP Research 47 Dec 27, 2022
Example Of Fine-Tuning BERT For Named-Entity Recognition Task And Preparing For Cloud Deployment Using Flask, React, And Docker

Example Of Fine-Tuning BERT For Named-Entity Recognition Task And Preparing For Cloud Deployment Using Flask, React, And Docker This repository contai

Nikita 12 Dec 14, 2022
Code for Two-stage Identifier: "Locate and Label: A Two-stage Identifier for Nested Named Entity Recognition"

Code for Two-stage Identifier: "Locate and Label: A Two-stage Identifier for Nested Named Entity Recognition", accepted at ACL 2021. For details of the model and experiments, please see our paper.

tricktreat 87 Dec 16, 2022
Named Entity Recognition with Small Strongly Labeled and Large Weakly Labeled Data

Named Entity Recognition with Small Strongly Labeled and Large Weakly Labeled Data arXiv This is the code base for weakly supervised NER. We provide a

Amazon 92 Jan 4, 2023
An elaborate and exhaustive paper list for Named Entity Recognition (NER)

Named-Entity-Recognition-NER-Papers by Pengfei Liu, Jinlan Fu and other contributors. An elaborate and exhaustive paper list for Named Entity Recognit

Pengfei Liu 388 Dec 18, 2022
Chinese clinical named entity recognition using pre-trained BERT model

Chinese clinical named entity recognition (CNER) using pre-trained BERT model Introduction Code for paper Chinese clinical named entity recognition wi

Xiangyang Li 109 Dec 14, 2022
Source Code For Template-Based Named Entity Recognition Using BART

Template-Based NER Source Code For Template-Based Named Entity Recognition Using BART Training Training train.py Inference inference.py Corpus ATIS (h

null 174 Dec 19, 2022
GLaRA: Graph-based Labeling Rule Augmentation for Weakly Supervised Named Entity Recognition

GLaRA: Graph-based Labeling Rule Augmentation for Weakly Supervised Named Entity Recognition

Xinyan Zhao 29 Dec 26, 2022
PyTorch implementation of "ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context" (INTERSPEECH 2020)

ContextNet ContextNet has CNN-RNN-transducer architecture and features a fully convolutional encoder that incorporates global context information into

Sangchun Ha 24 Nov 24, 2022
Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning".

ERICA Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive L

THUNLP 75 Nov 2, 2022
Weakly supervised medical named entity classification

Trove Trove is a research framework for building weakly supervised (bio)medical named entity recognition (NER) and other entity attribute classifiers

null 60 Nov 18, 2022
Chinese named entity recognization with BiLSTM using Keras

Chinese named entity recognization (Bilstm with Keras) Project Structure ./ ├── README.md ├── data │   ├── README.md │   ├── data 数据集 │   │   ├─

null 1 Dec 17, 2021
Code for the CVPR 2021 paper "Triple-cooperative Video Shadow Detection"

Triple-cooperative Video Shadow Detection Code and dataset for the CVPR 2021 paper "Triple-cooperative Video Shadow Detection"[arXiv link] [official l

Zhihao Chen 24 Oct 4, 2022