Source code for "Pack Together: Entity and Relation Extraction with Levitated Marker"

THUNLP

Last update: Dec 30, 2022

Related tags

Deep Learning PL-Marker

Overview

PL-Marker

Source code for Pack Together: Entity and Relation Extraction with Levitated Marker.

Quick links

Overview
Setup
Training Script
Quick Start
Use TypeMarker
Citation

Overview

In this work, we present a novel span representation approach, named Packed Levitated Markers, to consider the dependencies between the spans (pairs) by strategically packing the markers in the encoder. Our approach is evaluated on two typical span (pair) representation tasks:

Named Entity Recognition (NER): Adopt a group packing strategy for enabling our model to process massive spans together to consider their dependencies with limited resources.
Relation Extraction (RE): Adopt a subject-oriented packing strategy for packing each subject and all its objects into an instance to model the dependencies between the same-subject span pairs

Please find more details of this work in our paper.

Setup

Install Dependencies

The code is based on huggaface's transformers.

Install dependencies and apex:

pip3 install -r requirement.txt
pip3 install --editable transformers

Download and preprocess the datasets

Our experiments are based on three datasets: ACE04, ACE05, and SciERC. Please find the links and pre-processing below:

CoNLL03: We use the Enlish part of CoNLL03
OntoNotes: We use preprocess_ontonotes.py to preprocess the OntoNote 5.0.
Few-NERD: The dataseet can be downloaed in their website
ACE04/ACE05: We use the preprocessing code from DyGIE repo. Please follow the instructions to preprocess the ACE05 and ACE04 datasets.
SciERC: The preprocessed SciERC dataset can be downloaded in their project website.

Pre-trained Models

We release our pre-trained NER models and RE models for ACE05 and SciERC datasets on Google Drive/Tsinghua Cloud.

Note: the performance of the pre-trained models might be slightly different from the reported numbers in the paper, since we reported the average numbers based on multiple runs.

Training Script

Train NER Models:

bash scripts/run_train_ner_PLMarker.sh
bash scripts/run_train_ner_BIO.sh
bash scripts/run_train_ner_TokenCat.sh

Train RE Models:

bash run_train_re.sh

Quick Start

The following commands can be used to run our pre-trained models on SciERC.

Evaluate the NER model:

CUDA_VISIBLE_DEVICES=0  python3  run_acener.py  --model_type bertspanmarker  \
    --model_name_or_path  ../bert_models/scibert-uncased  --do_lower_case  \
    --data_dir scierc  \
    --learning_rate 2e-5  --num_train_epochs 50  --per_gpu_train_batch_size  8  --per_gpu_eval_batch_size 16  --gradient_accumulation_steps 1  \
    --max_seq_length 512  --save_steps 2000  --max_pair_length 256  --max_mention_ori_length 8    \
    --do_eval  --evaluate_during_training   --eval_all_checkpoints  \
    --fp16  --seed 42  --onedropout  --lminit  \
    --train_file train.json --dev_file dev.json --test_file test.json  \
    --output_dir sciner_models/sciner-scibert  --overwrite_output_dir  --output_results

Evaluate the RE model:

CUDA_VISIBLE_DEVICES=0  python3  run_re.py  --model_type bertsub  \
    --model_name_or_path  ../bert_models/scibert-uncased  --do_lower_case  \
    --data_dir scierc  \
    --learning_rate 2e-5  --num_train_epochs 10  --per_gpu_train_batch_size  8  --per_gpu_eval_batch_size 16  --gradient_accumulation_steps 1  \
    --max_seq_length 256  --max_pair_length 16  --save_steps 2500  \
    --do_eval  --evaluate_during_training   --eval_all_checkpoints  --eval_logsoftmax  \
    --fp16  --lminit   \
    --test_file sciner_models/sciner-scibert/ent_pred_test.json  \
    --use_ner_results \
    --output_dir scire_models/scire-scibert

Here, --use_ner_results denotes using the original entity type predicted by NER models.

TypeMarker

if we use the flag --use_typemarker for the RE models, the results will be:

Model	Ent	Rel	Rel+
ACE05-UnTypeMarker (in paper)	89.7	68.8	66.3
ACE05-TypeMarker	89.7	67.5	65.2
SciERC-UnTypeMarker (in paper)	69.9	52.0	40.6
SciERC-TypeMarker	69.9	52.5	40.9

Since the Typemarker increase the performance of SciERC but decrease the performance of ACE05, we didn't use it in the paper.

Citation

If you use our code in your research, please cite our work:

@article{ye2021plmarker,
  author    = {Deming Ye and Yankai Lin and Maosong Sun},
  title     = {Pack Together: Entity and Relation Extraction with Levitated Marker},
  journal   = {arXiv Preprint},
  year={2021}
}

Comments

512 and 1024?

As I know, BERT is limit the position embedding as 512. However, when I look at the code, I found position id, input id and etc. have 1024 size. I quite confusing about this concept. Could you explain about the difference above those?

opened by Jay0412 11
Question about the Quick Start

Hello, I was curious that in the Quick Start section, what does this "--max_mention_ori_length: 8" mean? If I run the different dataset, should I change it based on my data size? Thanks.

opened by Zephyr1022 10
Modeling_bert.py

In Modeling_bert.py BertForACEBothOneDropoutSub, why ner classifier doesn't concatenate m1_states while BrtForSpanMarkerNER concatenate them to make a feature vector? Could you explain in more detail about the e1,e2, and m1? As I see the code, I think train_re.sh can train ner and re together with options, is it possible? if it is possible what are the exact options that I need? Also, I want to know, Is a subject-oriented packaging strategy only used in evaluation?

opened by Jay0412 8

Trouble running "Quick Start"-scripts

Hi! Firstly, thanks for publishing your research and models! :)

I have trouble evaluating the NER model with the given command CUDA_VISIBLE_DEVICES=0 python3 run_acener.py --model_type bertspanmarker ... The output is a json-file with only one line: {"dev_best_f1": 0}

The last 3 lines of the log-output are:

02/04/2022 17:37:16 - INFO - __main__ -   Training/evaluation parameters Namespace(adam_epsilon=1e-08, alpha=1, cache_dir='', config_name='', data_dir='../scierc/raw_data', dev_file='dev.json', device=device(type='cuda'), do_eval=True, do_lower_case=True, do_test=False, do_train=False, eval_all_checkpoints=True, evaluate_during_training=True, fp16=True, fp16_opt_level='O1', gradient_accumulation_steps=1, group_axis=-1, group_edge=False, group_sort=False, learning_rate=2e-05, lminit=True, local_rank=-1, logging_steps=5, max_grad_norm=1.0, max_mention_ori_length=8, max_pair_length=256, max_seq_length=512, max_steps=-1, model_name_or_path='../bert_models/scibert_scivocab_uncased', model_type='bertspanmarker', n_gpu=1, no_cuda=False, no_test=False, norm_emb=False, num_train_epochs=50.0, onedropout=True, output_dir='../sciner_models/sciner-scibert', output_results=True, overwrite_cache=False, overwrite_output_dir=True, per_gpu_eval_batch_size=16, per_gpu_train_batch_size=8, save_steps=2000, save_total_limit=1, seed=42, server_ip='', server_port='', shuffle=False, test_file='test.json', tokenizer_name='', train_file='train.json', use_full_layer=-1, warmup_steps=-1, weight_decay=0.0)
02/04/2022 17:37:16 - INFO - __main__ -   Evaluate on test set
02/04/2022 17:37:16 - INFO - __main__ -   Evaluate the following checkpoints: []

As you can see in the first line, I changed the original command in the following way:

--model_name_or_path ../bert_models/scibert_scivocab_uncased I couldn't find a folder "scibert-uncased", so I downloaded the 4th model from huggingface as described in the "Training Script"-section (AllenAI) - is this maybe the wrong model?
--data_dir ../scierc/raw_data I downloaded the SciERC raw_data from their website to execute the evaluation on - is this the wrong dataset?

opened by Clemens123 8

关于代码的几个疑问
你好，在阅读代码时碰上几个疑问，可否解惑一下：

下列代码中的[30002]、[30003]、[3]、[4]表示什么？有何作用？ https://github.com/thunlp/PL-Marker/blob/91b03f3ff58ad29fd7b9920a954a02d12756f05d/run_re.py#L373-L378

下列代码为什么要加一个(10000, 10000, 'NIL')的命名实体信息？并且在实体两两组合成候选关系对时，sub可以是(10000, 10000, 'NIL')，obj又不能是(10000, 10000, 'NIL')，这又是为什么？ https://github.com/thunlp/PL-Marker/blob/91b03f3ff58ad29fd7b9920a954a02d12756f05d/run_re.py#L283-L284
opened by lairunlin 7
How to prepare dataset for training the model?
Hi, Thanks for sharing this awesome work. I have a few doubts please help me to understand:

I have a set of text paragraphs and want to extract entities and relationships between the entities detected. How would I prepare my dataset for NER and Relation Extraction model on this paragraph? What formate should I follow?

If any tool you could recommend or any way to prepare tor annotate he data according to the desired format that the model is expecting, it would be a great help.

Thanks.
opened by karndeepsingh 7
关于`run_ner.py`的疑问

run_ner.py中第221行到238行主要是为了获取target_tokens，能不能麻烦解释一下其中的逻辑？为什么要这么处理？其中涉及到的half_context_length、left_context_length、right_context_length都表示什么意思？非常感谢。 https://github.com/thunlp/PL-Marker/blob/b4863d47e2197b8d410e3d693d684707df9df2a1/run_ner.py#L221-L238

opened by lairunlin 7
f1_with_ner2

您好，我在您代码基础上改了一版代码，想要实现全悬浮标记的方法。运行结果表示f1达到了预期效果，但是f1_with_ner,和ner_f1的结果特别差，并且ner_f1的结果随着训练变得越来越差。我找了很久没有找到问题，我似乎用的也是golden的dev文件做的训练呀，为什么ner的f1一直在下降，但是您的代码中对应的ner_f1一直是1.0呢。如果您能够帮我看看代码问题出在哪了就更好了，或者您告诉问题可能出现在哪里也非常感谢。下面是训练的部分截图，最后附上我修改后的代码和运行脚本，谢谢

run_train_re_approx.zip

opened by WangSheng21s 6
f1_with_ner

首先感谢作者出色的工作,有个小小的疑问问您：

为什么在运行关系抽取任务中，在验证集中f1_with_ner的结果能够达到1.0呀，难道运用的是对应的golden ner嘛，如果是的话能否指出对应代码在run_re.py中的位置，我看好像用的是模型预测的结果做的呀，但是按道理应该不可能到1.0.

谢谢

opened by WangSheng21s 6
Conll03数据集处理

你好，请问，为神魔要将Conll03数据集处理为I-label的形式，这样的话，数据集的labelmap= {'O':0,'I-label':num}了吗？就不存在‘B-label’了吧，但是，代码中定义的label_map，包括了B-label的呀。而且，在分类中，模型给出的target-label=9，所以，数据集，为什么要把B-label替换为I-label呢？

opened by Hou-jing 6
使用albert-xxlarge-v1, apex在训练ner_PLMarker时出错

您好，我们在ace05数据上训练ner_PLMarker模型时，如果使用bert-base-uncased + fp16参数，或者albert-xxlarge-v1 没有fp16参数时都可以正常训练，但使用albert-xxlarge-v1 + fp16时会出错，错误出现在AlbertAttention 的 mixed_query_layer = self.query(input_ids) 处，amp cached_cast 会报 IndexError: tuple index out of range的错误。不知道你们有没有遇到过这种问题。

opened by yanzhh 6

Source code for "Pack Together: Entity and Relation Extraction with Levitated Marker"

Related tags

Overview

PL-Marker

Quick links

Overview

Setup

Install Dependencies

Download and preprocess the datasets

Pre-trained Models

Training Script

Quick Start

TypeMarker

Citation

Comments

Owner

THUNLP

This is the official source code for SLATE. We provide the code for the model, the training code, and a dataset loader for the 3D Shapes dataset. This code is implemented in Pytorch.

PaddleRobotics is an open-source algorithm library for robots based on Paddle, including open-source parts such as human-robot interaction, complex motion control, environment perception, SLAM positioning, and navigation.

Source-to-Source Debuggable Derivatives in Pure Python

Code for the prototype tool in our paper "CoProtector: Protect Open-Source Code against Unauthorized Training Usage with Data Poisoning".

Graph Transformer Architecture. Source code for

Open source repository for the code accompanying the paper 'Non-Rigid Neural Radiance Fields Reconstruction and Novel View Synthesis of a Deforming Scene from Monocular Video'.

Source code for Acorn, the precision farming rover by Twisted Fields

[CVPR2021] The source code for our paper 《Removing the Background by Adding the Background: Towards Background Robust Self-supervised Video Representation Learning》.

Open source code for Paper "A Co-Interactive Transformer for Joint Slot Filling and Intent Detection"

Source code of "Hold me tight! Influence of discriminative features on deep network boundaries"

The open source code of SA-UNet: Spatial Attention U-Net for Retinal Vessel Segmentation.

Source code, datasets and trained models for the paper Learning Advanced Mathematical Computations from Examples (ICLR 2021), by François Charton, Amaury Hayat (ENPC-Rutgers) and Guillaume Lample

Source Code for DialogBERT: Discourse-Aware Response Generation via Learning to Recover and Rank Utterances (https://arxiv.org/pdf/2012.01775.pdf)

Implementation of the paper "Language-agnostic representation learning of source code from structure and context".

source code and pre-trained/fine-tuned checkpoint for NAACL 2021 paper LightningDOT

Code to reproduce the experiments in the paper "Transformer Based Multi-Source Domain Adaptation" (EMNLP 2020)

Source code for the GPT-2 story generation models in the EMNLP 2020 paper "STORIUM: A Dataset and Evaluation Platform for Human-in-the-Loop Story Generation"

The source code for the Cutoff data augmentation approach proposed in this paper: "A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation".

PGPortfolio: Policy Gradient Portfolio, the source code of "A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem"(https://arxiv.org/pdf/1706.10059.pdf).