A Unified Generative Framework for Various NER Subtasks.

Overview

This is the code for ACL-ICJNLP2021 paper A Unified Generative Framework for Various NER Subtasks.

Install the package in the requirements.txt, then use the following commands to install two other packages

pip install git+https://github.com/fastnlp/fastNLP@dev
pip install git+https://github.com/fastnlp/fitlog

You need to put your data in the parallel folder of this repo

    - BARTNER/
        - train.py
        ...
    - data/
        - conll2003
            - train.txt
            - text.txt
            - dev.txt
        - en-ontonotes
            - ...
        - Share_2013
        - Share_2014
        - CADEC
        - en_ace04
        - en_ace05
        - genia

For the conll2003 and en-ontonotes you data in each split should like (The first column is words, the second column is tags. We assume the tag is the BIO-tagging)

LONDON B-LOC
1996-08-30 O

West B-MISC
Indian I-MISC
all-rounder O
Phil B-PER

For nested dataset en_ace04, en_ace05 and genia, the data should like (each line is a jsonline, contains ners and sentences keys.)

{"ners": [[[16, 16, "DNA"], [4, 8, "DNA"], [24, 26, "DNA"], [19, 20, "DNA"]], [[31, 31, "DNA"], [2, 2, "DNA"], [4, 4, "DNA"], [30, 31, "DNA"]], [[23, 24, "RNA"], [14, 15, "cell_type"], [1, 2, "RNA"]], [[2, 2, "DNA"]], [], [[0, 0, "DNA"], [9, 9, "cell_type"]]], "sentences": [["There", "is", "a", "single", "methionine", "codon-initiated", "open", "reading", "frame", "of", "1,458", "nt", "in", "frame", "with", "a", "homeobox", "and", "a", "CAX", "repeat", ",", "and", "the", "open", "reading", "frame", "is", "predicted", "to", "encode", "a", "protein", "of", "51,659", "daltons."], ["When", "the", "homeodomain", "from", "HB24", "was", "compared", "to", "known", "mammalian", "and", "Drosophila", "homeodomains", "it", "was", "found", "to", "be", "only", "moderately", "conserved,", "but", "when", "it", "was", "compared", "to", "a", "highly", "diverged", "Drosophila", "homeodomain", ",", "H2.0,", "it", "was", "found", "to", "be", "80%", "identical."], ["The", "HB24", "mRNA", "was", "absent", "or", "present", "at", "low", "levels", "in", "normal", "B", "and", "T", "lymphocytes", ";", "however,", "with", "the", "appropriate", "activation", "signal", "HB24", "mRNA", "was", "induced", "within", "several", "hours", "even", "in", "the", "presence", "of", "cycloheximide", "."], ["Characterization", "of", "HB24", "expression", "in", "lymphoid", "and", "select", "developing", "tissues", "was", "performed", "by", "in", "situ", "hybridization", "."], ["Positive", "hybridization", "was", "found", "in", "thymus", ",", "tonsil", ",", "bone", "marrow", ",", "developing", "vessels", ",", "and", "in", "fetal", "brain", "."], ["HB24", "is", "likely", "to", "have", "an", "important", "role", "in", "lymphocytes", "as", "well", "as", "in", "certain", "developing", "tissues", "."]]}
{"ners": [[[16, 16, "DNA"], [4, 8, "DNA"], [24, 26, "DNA"], [19, 20, "DNA"]], [[31, 31, "DNA"], [2, 2, "DNA"], [4, 4, "DNA"], [30, 31, "DNA"]], [[23, 24, "RNA"], [14, 15, "cell_type"], [1, 2, "RNA"]], [[2, 2, "DNA"]], [], [[0, 0, "DNA"], [9, 9, "cell_type"]]], "sentences": [["There", "is", "a", "single", "methionine", "codon-initiated", "open", "reading", "frame", "of", "1,458", "nt", "in", "frame", "with", "a", "homeobox", "and", "a", "CAX", "repeat", ",", "and", "the", "open", "reading", "frame", "is", "predicted", "to", "encode", "a", "protein", "of", "51,659", "daltons."], ["When", "the", "homeodomain", "from", "HB24", "was", "compared", "to", "known", "mammalian", "and", "Drosophila", "homeodomains", "it", "was", "found", "to", "be", "only", "moderately", "conserved,", "but", "when", "it", "was", "compared", "to", "a", "highly", "diverged", "Drosophila", "homeodomain", ",", "H2.0,", "it", "was", "found", "to", "be", "80%", "identical."], ["The", "HB24", "mRNA", "was", "absent", "or", "present", "at", "low", "levels", "in", "normal", "B", "and", "T", "lymphocytes", ";", "however,", "with", "the", "appropriate", "activation", "signal", "HB24", "mRNA", "was", "induced", "within", "several", "hours", "even", "in", "the", "presence", "of", "cycloheximide", "."], ["Characterization", "of", "HB24", "expression", "in", "lymphoid", "and", "select", "developing", "tissues", "was", "performed", "by", "in", "situ", "hybridization", "."], ["Positive", "hybridization", "was", "found", "in", "thymus", ",", "tonsil", ",", "bone", "marrow", ",", "developing", "vessels", ",", "and", "in", "fetal", "brain", "."], ["HB24", "is", "likely", "to", "have", "an", "important", "role", "in", "lymphocytes", "as", "well", "as", "in", "certain", "developing", "tissues", "."]]}
...

For discontinuous dataset Share_2013, Share_2014 and CADEC, the data should like ( each sample has two lines, if the second line is empty means there is not entity. )

Abdominal cramps , flatulence , gas , bloating .
0,1 ADR|3,3 ADR|7,7 ADR|5,5 ADR

Cramps would start within 15 minutes of taking pill , even during meals .
0,0 ADR

...

We use code from https://github.com/daixiangau/acl2020-transition-discontinuous-ner to pre-process the data.

You can run the code by directly using

python train.py

The following output should be achieved

Save cache to caches/data_facebook/bart-large_conll2003_word.pt.                                                                                                        
max_len_a:0.6, max_len:10
In total 3 datasets:
        test has 3453 instances.
        train has 14041 instances.
        dev has 3250 instances.

The number of tokens in tokenizer  50265
50269 50274
input fields after batch(if batch size is 2):
        tgt_tokens: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 8]) 
        src_tokens: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 11]) 
        first: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 11]) 
        src_seq_len: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2]) 
        tgt_seq_len: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2]) 
target fields after batch(if batch size is 2):
        entities: (1)type:numpy.ndarray (2)dtype:object, (3)shape:(2,) 
        tgt_tokens: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 8]) 
        target_span: (1)type:numpy.ndarray (2)dtype:object, (3)shape:(2,) 
        tgt_seq_len: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2]) 

training epochs started 2021-06-02-11-49-26-964889
Epoch 1/30:   0%|                                                         | 15/32430 [00:06<3:12:37,  2.80it/s, loss:6.96158

Some important python files are listed below

- BartNER
  - data
     - pipe.py # load and process data
  - model
     - bart.py # the model file
  - train.py  # the training file

The different Loaders in the data/pipe.py is meant to load data, and the data.BartNERPipe class is to process data, the loader should load data into a DataBundle object, you can mock the provided Loader to write your own loader, as long as your dataset has the following four fields, the BartNERPipe should be able to process it

- raw_words  # List[str]
    # ['AL-AIN', ',', 'United', 'Arab', 'Emirates', '1996-12-06']
- entities  # List[List[str]]
    # [['AL-AIN'], ['United', 'Arab', 'Emirates']]
- entity_tags  # List[str], the same length as entities
    # ['loc', 'loc']
- entity_spans # List[List[int]], the inner list must have an even number of ints, means the start(inclusive,开区间) and end(exclusive,开区间) of an entity segment
    # [[0, 1], [2, 5]] or for discontinous NER [[0, 1, 5, 7], [2, 3, 5, 7],...]

In order to help you reproduce the results, we have hardcoded the hyper-parameters for each dataset in the code, you can change them based on your need. We conduct all experiments in NVIDIA-3090(24G memory). Some known difficulties about the reproduction of this code: (1) Some datasets (nested and discontinous) will drop to 0 or near 0 F1 during training, please drop these results; (2) randomness will cause large performance variance for some datasets, please try to run multiple times.

We deeply understand how frustrating it can be if the results are hard to reproduce, we tried our best to make sure the results were at least reproducible in our equipment (Usually take average from at least five runs).

Comments
  • Indexes shifting

    Indexes shifting

    Hello, nice work! Sorry if I miss something. I have a question about the decoder's output in your code.

    Based on your code it seems you're shifting position indexes of the tokens by a number of labels. I was wondering why shift tokens instead of shift the labels. Thank you!

    Please find a example below (results from here).

    raw_words: ['SOCCER', '-', 'JAPAN', 'GET', 'LUCKY', 'WIN', ',', 'CHINA', 'IN', 'SURPRISE', 'DEFEAT', '.']
    target_ids: [0, 11, 2, 20, 3, 1]	# each of the token positions is shifted by 6
    word_bpe_ids: [0, 13910, 3376, 2076, 111, 344, 591, 1889, 7777, 226, 23806, 975, 17164, 2156, 3858, 16712, 2808, 31987, 4454, 18819, 5885, 10885, 2571, 479, 2]
    word_bpe_tokens: ['<s>', 'ĠSO', 'CC', 'ER', 'Ġ-', 'ĠJ', 'AP', 'AN', 'ĠGET', 'ĠL', 'UCK', 'Y', 'ĠWIN', 'Ġ,', 'ĠCH', 'INA', 'ĠIN', 'ĠSUR', 'PR', 'ISE', 'ĠDE', 'FE', 'AT', 'Ġ.', '</s>']
    
    opened by xinsu626 4
  • FASTNLP batch loader error

    FASTNLP batch loader error

    Anyone has similar issues when running train.py with a customized dataset? My error is as follows: training epochs started 2022-12-31-17-14-10-312655 Traceback (most recent call last):
    File "train.py", line 252, in trainer.train(load_best_model=False) File "/home/velvinfu/miniconda3/envs/py38/lib/python3.8/site-packages/fastNLP/core/trainer.py", line 667, in train raise e File "/home/velvinfu/miniconda3/envs/py38/lib/python3.8/site-packages/fastNLP/core/trainer.py", line 658, in train self._train() File "/home/velvinfu/miniconda3/envs/py38/lib/python3.8/site-packages/fastNLP/core/trainer.py", line 712, in _train for batch_x, batch_y in self.data_iterator: File "/home/velvinfu/miniconda3/envs/py38/lib/python3.8/site-packages/fastNLP/core/batch.py", line 267, in iter for indices, batch_x, batch_y in self.dataiter: File "/home/velvinfu/miniconda3/envs/py38/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 435, in next data = self._next_data() File "/home/velvinfu/miniconda3/envs/py38/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1085, in _next_data return self._process_data(data) File "/home/velvinfu/miniconda3/envs/py38/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1111, in _process_data data.reraise() File "/home/velvinfu/miniconda3/envs/py38/lib/python3.8/site-packages/torch/_utils.py", line 428, in reraise raise self.exc_type(msg) ValueError: Caught ValueError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/velvinfu/miniconda3/envs/py38/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop data = fetcher.fetch(index) File "/home/velvinfu/miniconda3/envs/py38/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch return self.collate_fn(data) File "/home/velvinfu/miniconda3/envs/py38/lib/python3.8/site-packages/fastNLP/core/batch.py", line 91, in collate_fn sin_y = _pad(sin_y, dataset=self.dataset, as_numpy=self.as_numpy) File "/home/velvinfu/miniconda3/envs/py38/lib/python3.8/site-packages/fastNLP/core/batch.py", line 44, in _pad res = f.pad(vlist) File "/home/velvinfu/miniconda3/envs/py38/lib/python3.8/site-packages/fastNLP/core/field.py", line 492, in pad return self.padder(contents, field_name=self.name, field_ele_dtype=self.dtype, dim=self._cell_ndim) File "/home/velvinfu/miniconda3/envs/py38/lib/python3.8/site-packages/fastNLP/core/field.py", line 247, in call return np.array(contents) ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (16,) + inhomogeneous part.

    opened by velvinnn 2
  •  在中文bart上尝试出现错误

     在中文bart上尝试出现错误

    想在中文数据集上使用模型,将bart换成了huggingface的chinese-bart,但一直报如下错误: Cannot set field:src_tokens as input, exception happens at the 0 value. Traceback (most recent call last): File "/workspace/BARTNER/train.py", line 135, in <module> data_bundle, tokenizer, mapping2id = get_data() File "/usr/local/lib/python3.6/dist-packages/fastNLP/core/utils.py", line 357, in wrapper results = func(*args, **kwargs) File "/workspace/BARTNER/train.py", line 127, in get_data data_bundle = pipe.process_from_file(paths, demo=demo) File "/workspace/BARTNER/data/pipe.py", line 218, in process_from_file data_bundle = self.process(data_bundle) File "/workspace/BARTNER/data/pipe.py", line 192, in process data_bundle.set_input('tgt_tokens', 'src_tokens', 'src_seq_len', 'tgt_seq_len', 'first') File "/usr/local/lib/python3.6/dist-packages/fastNLP/io/data_bundle.py", line 142, in set_input dataset.set_input(field_name, flag=flag, use_1st_ins_infer_dim_type=use_1st_ins_infer_dim_type) File "/usr/local/lib/python3.6/dist-packages/fastNLP/core/dataset.py", line 787, in set_input raise e File "/usr/local/lib/python3.6/dist-packages/fastNLP/core/dataset.py", line 784, in set_input self.field_arrays[name].is_input = flag File "/usr/local/lib/python3.6/dist-packages/fastNLP/core/field.py", line 371, in is_input self._check_dtype_and_ndim(only_check_1st_ins_dim_type=self._use_1st_ins_infer_dim_type) File "/usr/local/lib/python3.6/dist-packages/fastNLP/core/field.py", line 423, in _check_dtype_and_ndim raise e File "/usr/local/lib/python3.6/dist-packages/fastNLP/core/field.py", line 406, in _check_dtype_and_ndim type_0, dim_0 = _get_ele_type_and_dim(cell_0) File "/usr/local/lib/python3.6/dist-packages/fastNLP/core/field.py", line 56, in _get_ele_type_and_dim res = [_get_ele_type_and_dim(cell_i, dim) for cell_i in cell] File "/usr/local/lib/python3.6/dist-packages/fastNLP/core/field.py", line 56, in <listcomp> res = [_get_ele_type_and_dim(cell_i, dim) for cell_i in cell] File "/usr/local/lib/python3.6/dist-packages/fastNLP/core/field.py", line 84, in _get_ele_type_and_dim raise SetInputOrTargetException(f"Cannot process type:{type(cell)}.") fastNLP.core.field.SetInputOrTargetException: Cannot process type:<class 'NoneType'>. 才学习NER,请问该怎么解决?谢谢

    opened by liushz 0
  • Migrate to transformers 4.0.0 or above?

    Migrate to transformers 4.0.0 or above?

    Hello! I'm using this model architecture to do NER on domain specific tasks and it worked pretty well! However the old version of transformer is still a little bit troublesome.

    For example, in order to be close to the pretraining process of BART, I want to directly encode the whole sentence using the tokenizer, rather than split it by space and then using 'add_prefix_space=True'. So I tried the 'span' method successfully. But for 'word' method it will need extra work to do that because of the old version of tokenizer.

    Is there any plan to release a transformer 4.0.0 (or above) version?

    opened by kizunasunhy 0
  • RuntimeError(

    RuntimeError("The program has been running off.")

    In Epoch:16/Step:17296, got best dev performance: Seq2SeqSpanMetric: f=0.0, rec=0.0, pre=0.0, em=0.1836 .... File "/root/work/BARTNER/model/callbacks.py", line 107, in on_valid_end raise RuntimeError("The program has been running off.") RuntimeError: The program has been running off.

    Does this mean the model stops by itself because of convergence failure? What can I do now? To train it again? Thanks

    opened by kizunasunhy 1
  • 'tgt_seq_len' is not defined

    'tgt_seq_len' is not defined

    Traceback (most recent call last): File "predictor.py", line 128, in tgt_seq_len = (tgt_seq_len - 2).tolist() NameError: name 'tgt_seq_len' is not defined

    opened by vineeth143 1
Owner
I am currently a PhD candidate in Fudan University.
null
Source codes for the paper "Local Additivity Based Data Augmentation for Semi-supervised NER"

LADA This repo contains codes for the following paper: Jiaao Chen*, Zhenghui Wang*, Ran Tian, Zichao Yang, Diyi Yang: Local Additivity Based Data Augm

GT-SALT 36 Dec 2, 2022
Few-NERD: Not Only a Few-shot NER Dataset

Few-NERD: Not Only a Few-shot NER Dataset This is the source code of the ACL-IJCNLP 2021 paper: Few-NERD: A Few-shot Named Entity Recognition Dataset.

THUNLP 319 Dec 30, 2022
Preprocessed Datasets for our Multimodal NER paper

Unified Multimodal Transformer (UMT) for Multimodal Named Entity Recognition (MNER) Two MNER Datasets and Codes for our ACL'2020 paper: Improving Mult

null 76 Dec 21, 2022
Robust Self-augmentation for NER with Meta-reweighting

Robust Self-augmentation for NER with Meta-reweighting

Lam chi 17 Nov 22, 2022
An elaborate and exhaustive paper list for Named Entity Recognition (NER)

Named-Entity-Recognition-NER-Papers by Pengfei Liu, Jinlan Fu and other contributors. An elaborate and exhaustive paper list for Named Entity Recognit

Pengfei Liu 388 Dec 18, 2022
Codes for "Template-free Prompt Tuning for Few-shot NER".

EntLM The source codes for EntLM. Dependencies: Cuda 10.1, python 3.6.5 To install the required packages by following commands: $ pip3 install -r requ

null 77 Dec 27, 2022
Minimal PyTorch implementation of Generative Latent Optimization from the paper "Optimizing the Latent Space of Generative Networks"

Minimal PyTorch implementation of Generative Latent Optimization This is a reimplementation of the paper Piotr Bojanowski, Armand Joulin, David Lopez-

Thomas Neumann 117 Nov 27, 2022
A unified framework for machine learning with time series

Welcome to sktime A unified framework for machine learning with time series We provide specialized time series algorithms and scikit-learn compatible

The Alan Turing Institute 6k Jan 8, 2023
SSD: A Unified Framework for Self-Supervised Outlier Detection [ICLR 2021]

SSD: A Unified Framework for Self-Supervised Outlier Detection [ICLR 2021] Pdf: https://openreview.net/forum?id=v5gjXpmR8J Code for our ICLR 2021 pape

Princeton INSPIRE Research Group 113 Nov 27, 2022
Implementation of "Distribution Alignment: A Unified Framework for Long-tail Visual Recognition"(CVPR 2021)

Implementation of "Distribution Alignment: A Unified Framework for Long-tail Visual Recognition"(CVPR 2021)

null 105 Nov 7, 2022
Unified tracking framework with a single appearance model

Paper: Do different tracking tasks require different appearance model? [ArXiv] (comming soon) [Project Page] (comming soon) UniTrack is a simple and U

ZhongdaoWang 300 Dec 24, 2022
Official PyTorch code for Hierarchical Conditional Flow: A Unified Framework for Image Super-Resolution and Image Rescaling (HCFlow, ICCV2021)

Hierarchical Conditional Flow: A Unified Framework for Image Super-Resolution and Image Rescaling (HCFlow, ICCV2021) This repository is the official P

Jingyun Liang 159 Dec 30, 2022
Official PyTorch code for Hierarchical Conditional Flow: A Unified Framework for Image Super-Resolution and Image Rescaling (HCFlow, ICCV2021)

Hierarchical Conditional Flow: A Unified Framework for Image Super-Resolution and Image Rescaling (HCFlow, ICCV2021) This repository is the official P

Jingyun Liang 159 Dec 30, 2022
Liquid Warping GAN with Attention: A Unified Framework for Human Image Synthesis

Liquid Warping GAN with Attention: A Unified Framework for Human Image Synthesis, including human motion imitation, appearance transfer, and novel view synthesis. Currently the paper is under review of IEEE TPAMI. It is an extension of our previous ICCV project impersonator, and it has a more powerful ability in generalization and produces higher-resolution results (512 x 512, 1024 x 1024) than the previous ICCV version.

null 2.3k Jan 5, 2023
UMT is a unified and flexible framework which can handle different input modality combinations, and output video moment retrieval and/or highlight detection results.

Unified Multi-modal Transformers This repository maintains the official implementation of the paper UMT: Unified Multi-modal Transformers for Joint Vi

Applied Research Center (ARC), Tencent PCG 84 Jan 4, 2023
HiFi++: a Unified Framework for Neural Vocoding, Bandwidth Extension and Speech Enhancement

HiFi++ : a Unified Framework for Neural Vocoding, Bandwidth Extension and Speech Enhancement This is the unofficial implementation of Vocoder part of

Rishikesh (ऋषिकेश) 118 Dec 29, 2022
FairyTailor: Multimodal Generative Framework for Storytelling

FairyTailor: Multimodal Generative Framework for Storytelling

Eden Bens 172 Dec 30, 2022
Unified Interface for Constructing and Managing Workflows on different workflow engines, such as Argo Workflows, Tekton Pipelines, and Apache Airflow.

Couler What is Couler? Couler aims to provide a unified interface for constructing and managing workflows on different workflow engines, such as Argo

Couler Project 781 Jan 3, 2023