Code and datasets for the paper "KnowPrompt: Knowledge-aware Prompt-tuning with Synergistic Optimization for Relation Extraction"

Overview

KnowPrompt

Code and datasets for our paper "KnowPrompt: Knowledge-aware Prompt-tuning with Synergistic Optimization for Relation Extraction"

Requirements

To install requirements:

pip install -r requirements.txt

Datasets

We provide all the datasets and prompts used in our experiments.

The expected structure of files is:

knowprompt
 |-- dataset
 |    |-- semeval
 |    |    |-- train.txt       
 |    |    |-- dev.txt
 |    |    |-- test.txt
 |    |    |-- temp.txt
 |    |    |-- rel2id.json
 |    |-- dialogue
 |    |    |-- train.json       
 |    |    |-- dev.json
 |    |    |-- test.json
 |    |    |-- rel2id.json
 |    |-- tacred
 |    |    |-- train.txt       
 |    |    |-- dev.txt
 |    |    |-- test.txt
 |    |    |-- temp.txt
 |    |    |-- rel2id.json
 |    |-- tacrev
 |    |    |-- train.txt       
 |    |    |-- dev.txt
 |    |    |-- test.txt
 |    |    |-- temp.txt
 |    |    |-- rel2id.json
 |    |-- retacred
 |    |    |-- train.txt       
 |    |    |-- dev.txt
 |    |    |-- test.txt
 |    |    |-- temp.txt
 |    |    |-- rel2id.json
 |-- scripts
 |    |-- semeval.sh
 |    |-- dialogue.sh
 |    |-- ...
 

Run the experiments

Initialize the answer words

Use the comand below to get the answer words to use in the training.

python get_label_word.py --model_name_or_path bert-large-uncased  --dataset_name semeval

The {answer_words}.ptwill be saved in the dataset, you need to assign the model_name_or_path and dataset_name in the get_label_word.py.

Split dataset

Download the data first, and put it to dataset folder. Run the comand below, and get the few shot dataset.

python generate_k_shot.py --data_dir ./dataset --k 8 --dataset semeval
cd dataset
cd semeval
cp rel2id.json val.txt test.txt ./k-shot/8-1

You need to modify the k and dataset to assign k-shot and dataset. Here we default seed as 1,2,3,4,5 to split each k-shot, you can revise it in the generate_k_shot.py

Let's run

Our script code can automatically run the experiments in 8-shot, 16-shot, 32-shot and standard supervised settings with both the procedures of train, eval and test. We just choose the random seed to be 1 as an example in our code. Actually you can perform multiple experments with different seeds.

Example for SEMEVAL

Train the KonwPrompt model on SEMEVAL with the following command:

>> bash scripts/semeval.sh  # for roberta-large

As the scripts for TACRED-Revist, Re-TACRED, Wiki80 included in our paper are also provided, you just need to run it like above example.

Example for DialogRE

As the data format of DialogRE is very different from other dataset, Class of processor is also different. Train the KonwPrompt model on DialogRE with the following command:

>> bash scripts/dialogue.sh  # for roberta-base
Comments
  • 运行报错

    运行报错

    请问运行>> bash scripts/retacred.sh的时候报错: Traceback (most recent call last): File "main.py", line 244, in main() File "main.py", line 128, in main parser = _setup_parser() File "main.py", line 55, in _setup_parser litmodel_class = _import_class(f"lit_models.{temp_args.litmodel_class}") File "main.py", line 27, in import_class class = getattr(module, class_name) AttributeError: module 'lit_models' has no attribute 'TransformerLitModel' scripts/retacred.sh: line 2: --model_name_or_path: command not found scripts/retacred.sh: line 3: --accumulate_grad_batches: command not found scripts/retacred.sh: line 4: --batch_size: command not found scripts/retacred.sh: line 5: --data_dir: command not found scripts/retacred.sh: line 6: --check_val_every_n_epoch: command not found scripts/retacred.sh: line 7: --data_class: command not found scripts/retacred.sh: line 8: --max_seq_length: command not found scripts/retacred.sh: line 9: --model_class: command not found scripts/retacred.sh: line 10: --t_lambda: command not found scripts/retacred.sh: line 11: --wandb: command not found scripts/retacred.sh: line 12: --litmodel_class: command not found scripts/retacred.sh: line 13: --task_name: command not found scripts/retacred.sh: line 14: --lr: command not found 这是什么原因?

    opened by rrxsir 13
  • An academic issues on

    An academic issues on "How to estimate the entity type distributions with relation class is not known"

    According to your paper: you estimate the prior distributions over the candidate set C_sub and C_obj of potential entity types, according to a certain relation class, where the prior distributions are estimated by frequency statistics. But, how do you estimate the prior distributions with an unknown relation in the instance , just like “the chicken or the egg?”.

    For example, the relation “per:country_of_birth” indicates the subject entity belongs to “person” and the object entity belongs to “country”. The prior distributions for C_sub can be counted as {"person":1} , but we should know this instance contains the relation "per:country_of_birth" in advance, then we can estimate the prior distributions of the candidate set.

    question 
    opened by typhoonlee 8
  • How to initialize the learnable relation embedding?

    How to initialize the learnable relation embedding?

    在阅读本篇优秀论文的一些疑问:

    • 在Relation Knowledge Injection这一小节提到要将关系的语义知识注入到关系的初始化向量中,请问是通过对某一关系中的单词频率来对embedding进行加权的吗?例如有关系 y = per:countries_of_residence,那么其候选集单词为{"person", "country", "residence“},其概率分布为{“1/3”,“1/3”,“1/3”},所以关系y的初始化是 y_initialized_embedding = 1/3 person_embedding + 1/3 country_embedding + 1/3 residence_embedding吗?
    • 同样是这一小节中假设在一个已知 PLM 的 vocabulary 中存在一个隐式的虚拟 answer word 来表示关系标签(例如问题1中的y),请问该虚拟的word embedding如何计算得到?和问题1中的关系y的初始化是什么关系呢?
    • 请问 Figure 2 (b) 中的[MASK]处得到的是[MASK]在虚拟的answer words V'的概率分布吗?其结果和relation embedding head的输出进行叉乘表示的是什么呢?(relation embedding head 表示的是问题1中的初始化关系吗?)

    期待作者们的回复!非常感谢!

    opened by chenhaishun 5
  • Missing weighted average function for virtual answer word

    Missing weighted average function for virtual answer word

    Hi. Thanks for your great work.

    The paper mentioned weighted average function on page 4, indicating embeddings of virtual words should be initialized with respect to the probability distribution. However, your code shows only a mean operation was performed. Is that a bug or does it just shows a negligible difference so that we could ignore it?

    Moreover, I am a little confused about the probability distribution. Is it still based on prior distributions discussed in Entity Knowledge Injection in Part 4.1?

    Thanks in advance for your patience.

    https://github.com/zjunlp/KnowPrompt/blob/8734c20b0e6b771a747013d3399afec02365f39f/lit_models/transformer.py#L167

    opened by MrZilinXiao 4
  • 有关virtual type words的embedding的初始化问题

    有关virtual type words的embedding的初始化问题

    您好,有以下几个问题想要请教您:

    1. 论文和代码中的virtual type words初始化方式不一致问题 在论文中[sub]和[obj]的初始化是通过embedding乘以概率𝜙_sub和𝜙_obj得到的,但在最新提交的transformer.py的第174-175行代码中似乎是直接取了均值,相当于每个实体类型的𝜙_sub和𝜙_obj都是一样的。想问一下是我对文章或者代码的理解有问题吗?
    2. two_stage设置问题 请问在代码中实现two_stage是需要设置参数--two_steps为True吗?我在复现过程中,将--two_steps设置为True之后,测试集的结果从70多降到了20多,因此有点疑惑。

    谢谢您!

    opened by jyf123 3
  • 效果复现

    效果复现

    使用本项目复现结果过程中,具体是用roberta-large复现semeveal数据集在8-shot下的结果,Eval/best_f1=0.149,与论文中的相差甚远,且在main.py的line 214(if not args.two_steps: trainer.test())这里提示No 'test_dataloader()' method defined to run 'Trainer.test',请问这个是因为公开项目不完整导致的吗?

    opened by ningpang 2
  • paper中虚拟的type word是每个relation都一样吗?

    paper中虚拟的type word是每个relation都一样吗?

    我看了一下代码好像[sub]和[obj]就只是个token,即对每个关系来说,这个type word embedding是一样的。为什么在paper表6里面不同的句子[sub]和[obj]周围的word会有差别,这个type word embedding在inference的时候会变吗? (我认为的流程:训练得到每个relation的embedding+[sub]和[obj]的embedding后,inference时按照template把[sub]和[obj]、[MASK]插入,预测[MASK],和relation embedding求相似度。)

    opened by Facico 2
  • 关于 KE loss 即论文中

    关于 KE loss 即论文中 "Implicit Structured Constraints" 的一些疑问

    大佬好!首先对你们的卓越工作以及开源精神表示敬意!

    起因:我在尝试重新运行项目中的代码的时候,对 KE loss 的有效性有一些疑问,虽然论文中你们提到“In addition, there exists rich semantic knowledge among relation labels and structural knowledge implications among relational triples, which cannot be ignored.”,我在这里将 ke_loss 理解为文中所说的 structural knowledge,但是直接使用 $(s + r - o)$ 的方式,感觉是有点简单粗暴了哈~ 而且从实验的日志来看,ke_loss 从头到尾并没有得到优化。

    https://github.com/zjunlp/KnowPrompt/blob/9159e4bf4f1ae4986fddcf5c803bf4f953ee3e9b/lit_models/transformer.py#L292-L293

    (顺便一提,你们公开的代码里面这部分的 log 写错了,两个输出都是 loss)

    https://github.com/zjunlp/KnowPrompt/blob/9159e4bf4f1ae4986fddcf5c803bf4f953ee3e9b/lit_models/transformer.py#L196-L197

    如果调整到正常的输出之后就会发现 ke_loss 一直在 20 左右。这也印证了我前面的想法,毕竟 $s$、$r$、$o$ 都是直接从模型直接输出的,没有经过额外的全连接之类的重新映射,直接取距离的二范数可能不太好训练。

    总而言之,我的疑问如下,如果能够抽空解答一二,十分感谢~ @njcx-ai (应该是作者吧~😘)

    1. 在设计 KE loss 的时候是怎样考虑的?当时在实验的时候,KE loss 的表现又是如何?
    2. 负样本的选取中,是在 max_token_length 长度上选取的,必定也会包含 prompt 部分和 句子后面 padding 部分的 token,这个当时有没有考虑呢?
    3. 负样本的计算为什么会选用 real_relation_embedding,而不是选择模型的输出呢?

    https://github.com/zjunlp/KnowPrompt/blob/9159e4bf4f1ae4986fddcf5c803bf4f953ee3e9b/lit_models/transformer.py#L262-L297

    opened by Xerrors 2
  • ValueError: too many values to unpack (expected 4)

    ValueError: too many values to unpack (expected 4)

    你好! 我使用自己标注的数据集,格式处理为这样 {'token': ['地', '面', '状', '况', '不', '良', '导', '致', '位', '置', '偏', '移', '。'], 'h': {'name': '位置偏移', 'pos': [8, 12]}, 't': {'name': '地面状况不良', 'pos': [0, 6]}, 'relation': '因果关系'} (不知道格式这样是否正确。。 然后运行后,得到这个报错。

    File "D:\re\knowprompt2\KnowPrompt\lit_models\transformer.py", line 210, in validation_step input_ids, attention_mask, labels, _ = batch ValueError: too many values to unpack (expected 4)

    代码没有改动。我不知道是我格式的问题还是其他问题。 盼回复!谢谢!

    opened by xiaohou1112 1
  • experiment result

    experiment result

    According to your introduction, I get experiment result on semeval dataset by bash scripts/semeval.sh. but the loss is not dropping during training. where is wrong? I don't change anywhere. @njcx-ai image

    help wanted 
    opened by cpmss521 1
  • About the calculation of \phi(r)

    About the calculation of \phi(r)

    hi, congratulations on your great work! I have a question after reading the paper (though I have not read the code yet). How do you calculate the value of \phi(r)? It seems that you don't explain that in the paper. Thank you.

    opened by Frankie123421 1
Owner
ZJUNLP
NLP Group of Knowledge Engine Lab at Zhejiang University
ZJUNLP
Deep Learning Datasets Maker is a QGIS plugin to make datasets creation easier for raster and vector data.

Deep Learning Dataset Maker Deep Learning Datasets Maker is a QGIS plugin to make datasets creation easier for raster and vector data. How to use Down

deepbands 25 Dec 15, 2022
Cl datasets - PyTorch image dataloaders and utility functions to load datasets for supervised continual learning

Continual learning datasets Introduction This repository contains PyTorch image

berjaoui 5 Aug 28, 2022
Source code, datasets and trained models for the paper Learning Advanced Mathematical Computations from Examples (ICLR 2021), by François Charton, Amaury Hayat (ENPC-Rutgers) and Guillaume Lample

Maths from examples - Learning advanced mathematical computations from examples This is the source code and data sets relevant to the paper Learning a

Facebook Research 171 Nov 23, 2022
Code and datasets for the paper "Combining Events and Frames using Recurrent Asynchronous Multimodal Networks for Monocular Depth Prediction" (RA-L, 2021)

Combining Events and Frames using Recurrent Asynchronous Multimodal Networks for Monocular Depth Prediction This is the code for the paper Combining E

Robotics and Perception Group 69 Dec 26, 2022
Official PyTorch code for CVPR 2020 paper "Deep Active Learning for Biased Datasets via Fisher Kernel Self-Supervision"

Deep Active Learning for Biased Datasets via Fisher Kernel Self-Supervision https://arxiv.org/abs/2003.00393 Abstract Active learning (AL) aims to min

Denis 29 Nov 21, 2022
A novel method to tune language models. Codes and datasets for paper ``GPT understands, too''.

P-tuning A novel method to tune language models. Codes and datasets for paper ``GPT understands, too''. How to use our code We have released the code

THUDM 562 Dec 27, 2022
EMNLP 2021 paper Models and Datasets for Cross-Lingual Summarisation.

This repository contains data and code for our EMNLP 2021 paper Models and Datasets for Cross-Lingual Summarisation. Please contact me at [email protected]

null 9 Oct 28, 2022
The LaTeX and Python code for generating the paper, experiments' results and visualizations reported in each paper is available (whenever possible) in the paper's directory

This repository contains the software implementation of most algorithms used or developed in my research. The LaTeX and Python code for generating the

João Fonseca 3 Jan 3, 2023
Preprocessed Datasets for our Multimodal NER paper

Unified Multimodal Transformer (UMT) for Multimodal Named Entity Recognition (MNER) Two MNER Datasets and Codes for our ACL'2020 paper: Improving Mult

null 76 Dec 21, 2022
A Pytorch implementation of CVPR 2021 paper "RSG: A Simple but Effective Module for Learning Imbalanced Datasets"

RSG: A Simple but Effective Module for Learning Imbalanced Datasets (CVPR 2021) A Pytorch implementation of our CVPR 2021 paper "RSG: A Simple but Eff

null 120 Dec 12, 2022
Datasets accompanying the paper ConditionalQA: A Complex Reading Comprehension Dataset with Conditional Answers.

ConditionalQA Datasets accompanying the paper ConditionalQA: A Complex Reading Comprehension Dataset with Conditional Answers. Disclaimer This dataset

null 2 Oct 14, 2021
The implementation for paper Joint t-SNE for Comparable Projections of Multiple High-Dimensional Datasets.

Joint t-sne This is the implementation for paper Joint t-SNE for Comparable Projections of Multiple High-Dimensional Datasets. abstract: We present Jo

IDEAS Lab 7 Dec 18, 2022
Code, Models and Datasets for OpenViDial Dataset

OpenViDial This repo contains downloading instructions for the OpenViDial dataset in 《OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Vis

null 119 Dec 8, 2022
Final project code: Implementing MAE with downscaled encoders and datasets, for ESE546 FA21 at University of Pennsylvania

546 Final Project: Masked Autoencoder Haoran Tang, Qirui Wu 1. Training To train the network, please run mae_pretraining.py. Please modify folder path

Haoran Tang 0 Apr 22, 2022
A general and strong 3D object detection codebase that supports more methods, datasets and tools (debugging, recording and analysis).

ALLINONE-Det ALLINONE-Det is a general and strong 3D object detection codebase built on OpenPCDet, which supports more methods, datasets and tools (de

Michael.CV 5 Nov 3, 2022
Inference code for "StylePeople: A Generative Model of Fullbody Human Avatars" paper. This code is for the part of the paper describing video-based avatars.

NeuralTextures This is repository with inference code for paper "StylePeople: A Generative Model of Fullbody Human Avatars" (CVPR21). This code is for

Visual Understanding Lab @ Samsung AI Center Moscow 18 Oct 6, 2022
Toolbox of models, callbacks, and datasets for AI/ML researchers.

Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch Website • Installation • Main

Pytorch Lightning 1.4k Dec 30, 2022