Code for our ACL 2021 paper "One2Set: Generating Diverse Keyphrases as a Set"

Overview

One2Set

This repository contains the code for our ACL 2021 paper “One2Set: Generating Diverse Keyphrases as a Set”.

Our implementation is built on the source code from keyphrase-generation-rl and fastNLP. Thanks for their work.

If you use this code, please cite our paper:

@inproceedings{ye2021one2set,
  title={One2Set: Generating Diverse Keyphrases as a Set},
  author={Ye, Jiacheng and Gui, Tao and Luo, Yichao and Xu, Yige and Zhang, Qi},
  booktitle={Proceedings of ACL},
  year={2021}
}

Dependency

  • python 3.5+
  • pytorch 1.0+

Dataset

The datasets can be downloaded from here, which are the tokenized version of the datasets provided by Ken Chen:

  • The testsets directory contains the five datasets for testing (i.e., inspec, krapivin, nus, and semeval and kp20k), where each of the datasets contains test_src.txt and test_trg.txt.
  • The kp20k_separated directory contains the training and validation files (i.e., train_src.txt, train_trg.txt, valid_src.txt and valid_trg.txt).
  • Each line of the *_src.txt file is the source document, which contains the tokenized words of title <eos> abstract .
  • Each line of the *_trg.txt file contains the target keyphrases separated by an ; character. The <peos> is used to mark the end of present ground-truth keyphrases and train a separate set loss for SetTrans model. For example, each line can be like present keyphrase one;present keyphrase two;<peos>;absent keyprhase one;absent keyphrase two.

Quick Start

The whole process includes the following steps:

  • Preprocessing: The preprocess.py script numericalizes the train_src.txt, train_trg.txt,valid_src.txt and valid_trg.txt files, and produces train.one2many.pt, valid.one2many.pt and vocab.pt.
  • Training: The train.py script loads the train.one2many.pt, valid.one2many.pt and vocab.pt file and performs training. We evaluate the model every 8000 batches on the valid set, and the model will be saved if the valid loss is lower than the previous one.
  • Decoding: The predict.py script loads the trained model and performs decoding on the five test datasets. The prediction file will be saved, which is like predicted keyphrase one;predicted keyphrase two;…. For SetTrans, we ignore the $\varnothing$ predictions that represent the meaning of “no corresponding keyphrase”.
  • Evaluation: The evaluate_prediction.py script loads the ground-truth and predicted keyphrases, and calculates the $F_1@5$ and $F_1@M$ metrics.

For the sake of simplicity, we provide an one-click script in the script directory. You can run the following command to run the whole process with SetTrans model under One2Set paradigm:

bash scripts/run_one2set.sh

You can also run the baseline Transformer model under One2Seq paradigm with the following command:

bash scripts/run_one2seq.sh

Note:

  • Please download and unzip the datasets in the ./data directory first.
  • To run all the bash files smoothly, you may need to specify the correct home_dir (i.e., the absolute path to kg_one2set dictionary) and the gpu id for CUDA_VISIBLE_DEVICES. We provide a small amount of data to quickly test whether your running environment is correct. You can test by running the following command:
bash scripts/run_small_one2set.sh

Resources

You can download our trained model here. We also provide raw predictions and corresponding evaluation results of three runs with different random seeds here, which contains the following files:

test
├── Full_One2set_Copy_Seed27_Dropout0.1_LR0.0001_BS12_MaxLen6_MaxNum20_LossScalePre0.2_LossScaleAb0.1_Step2_SetLoss
│   ├── inspec
│   │   ├── predictions.txt
│   │   └── results_log_5_M_5_M_5_M.txt
│   ├── kp20k
│   │   ├── predictions.txt
│   │   └── results_log_5_M_5_M_5_M.txt
│   ├── krapivin
│   │   ├── predictions.txt
│   │   └── results_log_5_M_5_M_5_M.txt
│   ├── nus
│   │   ├── predictions.txt
│   │   └── results_log_5_M_5_M_5_M.txt
│   └── semeval
│       ├── predictions.txt
│       └── results_log_5_M_5_M_5_M.txt
├── Full_One2set_Copy_Seed527_Dropout0.1_LR0.0001_BS12_MaxLen6_MaxNum20_LossScalePre0.2_LossScaleAb0.1_Step2_SetLoss
│   ├── ...
└── Full_One2set_Copy_Seed9527_Dropout0.1_LR0.0001_BS12_MaxLen6_MaxNum20_LossScalePre0.2_LossScaleAb0.1_Step2_SetLoss
    ├── ...
You might also like...
Code for the paper "Balancing Training for Multilingual Neural Machine Translation, ACL 2020"

Balancing Training for Multilingual Neural Machine Translation Implementation of the paper Balancing Training for Multilingual Neural Machine Translat

PyTorch implementation of our Adam-NSCL algorithm from our CVPR2021 (oral) paper "Training Networks in Null Space for Continual Learning"

Adam-NSCL This is a PyTorch implementation of Adam-NSCL algorithm for continual learning from our CVPR2021 (oral) paper: Title: Training Networks in N

[NAACL & ACL 2021] SapBERT: Self-alignment pretraining for BERT.
[NAACL & ACL 2021] SapBERT: Self-alignment pretraining for BERT.

SapBERT: Self-alignment pretraining for BERT This repo holds code for the SapBERT model presented in our NAACL 2021 paper: Self-Alignment Pretraining

PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World [ACL 2021]

piglet PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World [ACL 2021] This repo contains code and data for PIGLeT. If you like

The official implementation for ACL 2021 "Challenges in Information Seeking QA: Unanswerable Questions and Paragraph Retrieval".

Code for "Challenges in Information Seeking QA: Unanswerable Questions and Paragraph Retrieval" (ACL 2021, Long) This is the repository for baseline m

LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

LV-BERT Introduction In this repo, we introduce LV-BERT by exploiting layer variety for BERT. For detailed description and experimental results, pleas

Official PyTorch Implementation of SSMix (Findings of ACL 2021)
Official PyTorch Implementation of SSMix (Findings of ACL 2021)

SSMix: Saliency-based Span Mixup for Text Classification (Findings of ACL 2021) Official PyTorch Implementation of SSMix | Paper Abstract Data augment

NeuralWOZ: Learning to Collect Task-Oriented Dialogue via Model-based Simulation (ACL-IJCNLP 2021)
NeuralWOZ: Learning to Collect Task-Oriented Dialogue via Model-based Simulation (ACL-IJCNLP 2021)

NeuralWOZ This code is official implementation of "NeuralWOZ: Learning to Collect Task-Oriented Dialogue via Model-based Simulation". Sungdong Kim, Mi

[ACL-IJCNLP 2021] Improving Named Entity Recognition by External Context Retrieving and Cooperative Learning

CLNER The code is for our ACL-IJCNLP 2021 paper: Improving Named Entity Recognition by External Context Retrieving and Cooperative Learning CLNER is a

Comments
  • a bug about [dec_layers]

    a bug about [dec_layers]

    大佬,您好,我是小白 代码通过-dec_layers参数来制定模块的解码器的层数,但是,实际上没有作用。 因为Decoder代码文件的

    self.input_fc = nn.Linear(self.embed.embedding_dim, d_model)
    self.layer_stacks = nn.ModuleList([TransformerSeq2SeqDecoderLayer(d_model, n_head, dim_ff, dropout, layer_idx,  fix_kp_num_len, max_kp_num)  for layer_idx in range(6)])
    self.embed_scale = math.sqrt(d_model)
    

    里面直接给定了参数 ( for layer_idx in range(6)])),所以使得,至少出现以下bugs:

    1. state【初始化采用def init_state(self, encoder_output, encoder_mask)函数】严格按照num_layers,即-dec_layers,但是实际层数为6,当dec_layers小于6时,会提示state访问越界 File "d:\code\【study-see】\【源码】kg_one2set-master\pykp\modules\multi_head_attn.py", line 57, in forward prev_k = state.decoder_prev_key[self.layer_idx] IndexError: list index out of range
    2. 当-dec_layers不等于6时,参数不起作用。

    另外,对于decoder代码里面出现num_layers=opt.enc_layers,我觉得不妥,您可以试着提取一个基类,使得decoder,encoder继承于他。

    【这个项目,包括您的论文,我收益非常大,很感谢您的工作,我还在继续拜读您的论文和代码,十分感谢】

    opened by yuanyihan 2
  • RuntimeError: CUDA error: device-side assert triggered

    RuntimeError: CUDA error: device-side assert triggered

    Hi,

    Thanks for the nice repo.

    I am facing the following error while training the model with kp20k dataset. FYI, I am training with batch_size=2.

    08/30/2021 23:41:03 [INFO] train_ml: Epoch 1; batch: 90000; total batch: 90000,avg training ppl: 5.333, loss: 1.674                                                              
    08/30/2021 23:43:40 [INFO] train_ml: Epoch 1; batch: 91000; total batch: 91000,avg training ppl: 5.328, loss: 1.673                                                              
    08/30/2021 23:46:18 [INFO] train_ml: Epoch 1; batch: 92000; total batch: 92000,avg training ppl: 5.322, loss: 1.672                                                              
    /pytorch/aten/src/ATen/native/cuda/Indexing.cu:662: indexSelectLargeIndex: block: [148,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.                     
    /pytorch/aten/src/ATen/native/cuda/Indexing.cu:662: indexSelectLargeIndex: block: [148,0,0], thread: [97,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
    ...
    /pytorch/aten/src/ATen/native/cuda/Indexing.cu:662: indexSelectLargeIndex: block: [130,0,0], thread: [30,0,0] Assertion `srcIndex < srcSelectDimSize` failed.                     
    /pytorch/aten/src/ATen/native/cuda/Indexing.cu:662: indexSelectLargeIndex: block: [130,0,0], thread: [31,0,0] Assertion `srcIndex < srcSelectDimSize` failed.                     
    Traceback (most recent call last):                                                                                                                                                
      File "train.py", line 103, in <module>                                                                                                                                          
        main(opt)                                                                                                                                                                     
      File "train.py", line 85, in main                                                                                                                                               
        train_ml.train_model(model, optimizer, train_data_loader, valid_data_loader, opt)                                                                                             
      File "/home/ubuntu/kg_one2set/train_ml.py", line 44, in train_model
        batch_loss_stat = train_one_batch(batch, model, optimizer, opt)
      File "/home/ubuntu/kg_one2set/train_ml.py", line 146, in train_one_batch
        control_embed = model.decoder.forward_seg(state)
      File "/home/ubuntu/kg_one2set/pykp/decoder/transformer.py", line 153, in forward_seg
        control_idx = torch.arange(0, self.max_kp_num).long().to(device).reshape(1, -1).repeat(batch_size, 1)
    RuntimeError: CUDA error: device-side assert triggered
    

    Any suggestions would be appreciated.

    opened by kgarg8 0
Code for our paper "SimCLS: A Simple Framework for Contrastive Learning of Abstractive Summarization", ACL 2021

SimCLS Code for our paper: "SimCLS: A Simple Framework for Contrastive Learning of Abstractive Summarization", ACL 2021 1. How to Install Requirements

Yixin Liu 150 Dec 12, 2022
Code and data of the ACL 2021 paper: Few-Shot Text Ranking with Meta Adapted Synthetic Weak Supervision

MetaAdaptRank This repository provides the implementation of meta-learning to reweight synthetic weak supervision data described in the paper Few-Shot

THUNLP 5 Jun 16, 2022
code associated with ACL 2021 DExperts paper

DExperts Hi! This repository contains code for the paper DExperts: Decoding-Time Controlled Text Generation with Experts and Anti-Experts to appear at

Alisa Liu 68 Dec 15, 2022
null 190 Jan 3, 2023
Data and Code for ACL 2021 Paper "Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning"

Introduction Code and data for ACL 2021 Paper "Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning". We cons

Pan Lu 81 Dec 27, 2022
Code for ACL'2021 paper WARP 🌀 Word-level Adversarial ReProgramming

Code for ACL'2021 paper WARP ?? Word-level Adversarial ReProgramming. Outperforming `GPT-3` on SuperGLUE Few-Shot text classification.

YerevaNN 75 Nov 6, 2022
Codes for ACL-IJCNLP 2021 Paper "Zero-shot Fact Verification by Claim Generation"

Zero-shot-Fact-Verification-by-Claim-Generation This repository contains code and models for the paper: Zero-shot Fact Verification by Claim Generatio

Liangming Pan 47 Jan 1, 2023
PyTorch implementation for ACL 2021 paper "Maria: A Visual Experience Powered Conversational Agent".

Maria: A Visual Experience Powered Conversational Agent This repository is the Pytorch implementation of our paper "Maria: A Visual Experience Powered

Jokie 22 Dec 12, 2022
The source codes for ACL 2021 paper 'BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data'

BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data This repository provides the implementation details for

null 124 Dec 27, 2022
A sample pytorch Implementation of ACL 2021 research paper "Learning Span-Level Interactions for Aspect Sentiment Triplet Extraction".

Span-ASTE-Pytorch This repository is a pytorch version that implements Ali's ACL 2021 research paper Learning Span-Level Interactions for Aspect Senti

来自丹麦的天籁 10 Dec 6, 2022