Code for our ACL 2021 paper "One2Set: Generating Diverse Keyphrases as a Set"

Jiacheng Ye

Last update: Jan 5, 2023

Related tags

Overview

One2Set

This repository contains the code for our ACL 2021 paper “One2Set: Generating Diverse Keyphrases as a Set”.

Our implementation is built on the source code from keyphrase-generation-rl and fastNLP. Thanks for their work.

If you use this code, please cite our paper:

@inproceedings{ye2021one2set,
  title={One2Set: Generating Diverse Keyphrases as a Set},
  author={Ye, Jiacheng and Gui, Tao and Luo, Yichao and Xu, Yige and Zhang, Qi},
  booktitle={Proceedings of ACL},
  year={2021}
}

Dependency

python 3.5+
pytorch 1.0+

Dataset

The datasets can be downloaded from here, which are the tokenized version of the datasets provided by Ken Chen:

The testsets directory contains the five datasets for testing (i.e., inspec, krapivin, nus, and semeval and kp20k), where each of the datasets contains test_src.txt and test_trg.txt.
The kp20k_separated directory contains the training and validation files (i.e., train_src.txt, train_trg.txt, valid_src.txt and valid_trg.txt).
Each line of the *_src.txt file is the source document, which contains the tokenized words of title <eos> abstract .
Each line of the *_trg.txt file contains the target keyphrases separated by an ; character. The <peos> is used to mark the end of present ground-truth keyphrases and train a separate set loss for SetTrans model. For example, each line can be like present keyphrase one;present keyphrase two;<peos>;absent keyprhase one;absent keyphrase two.

Quick Start

The whole process includes the following steps:

Preprocessing: The preprocess.py script numericalizes the train_src.txt, train_trg.txt,valid_src.txt and valid_trg.txt files, and produces train.one2many.pt, valid.one2many.pt and vocab.pt.
Training: The train.py script loads the train.one2many.pt, valid.one2many.pt and vocab.pt file and performs training. We evaluate the model every 8000 batches on the valid set, and the model will be saved if the valid loss is lower than the previous one.
Decoding: The predict.py script loads the trained model and performs decoding on the five test datasets. The prediction file will be saved, which is like predicted keyphrase one;predicted keyphrase two;…. For SetTrans, we ignore the $\varnothing$ predictions that represent the meaning of “no corresponding keyphrase”.
Evaluation: The evaluate_prediction.py script loads the ground-truth and predicted keyphrases, and calculates the $F_1@5$ and $F_1@M$ metrics.

For the sake of simplicity, we provide an one-click script in the script directory. You can run the following command to run the whole process with SetTrans model under One2Set paradigm:

bash scripts/run_one2set.sh

You can also run the baseline Transformer model under One2Seq paradigm with the following command:

bash scripts/run_one2seq.sh

Note:

Please download and unzip the datasets in the ./data directory first.
To run all the bash files smoothly, you may need to specify the correct home_dir (i.e., the absolute path to kg_one2set dictionary) and the gpu id for CUDA_VISIBLE_DEVICES. We provide a small amount of data to quickly test whether your running environment is correct. You can test by running the following command:

bash scripts/run_small_one2set.sh

Resources

You can download our trained model here. We also provide raw predictions and corresponding evaluation results of three runs with different random seeds here, which contains the following files:

test
├── Full_One2set_Copy_Seed27_Dropout0.1_LR0.0001_BS12_MaxLen6_MaxNum20_LossScalePre0.2_LossScaleAb0.1_Step2_SetLoss
│   ├── inspec
│   │   ├── predictions.txt
│   │   └── results_log_5_M_5_M_5_M.txt
│   ├── kp20k
│   │   ├── predictions.txt
│   │   └── results_log_5_M_5_M_5_M.txt
│   ├── krapivin
│   │   ├── predictions.txt
│   │   └── results_log_5_M_5_M_5_M.txt
│   ├── nus
│   │   ├── predictions.txt
│   │   └── results_log_5_M_5_M_5_M.txt
│   └── semeval
│       ├── predictions.txt
│       └── results_log_5_M_5_M_5_M.txt
├── Full_One2set_Copy_Seed527_Dropout0.1_LR0.0001_BS12_MaxLen6_MaxNum20_LossScalePre0.2_LossScaleAb0.1_Step2_SetLoss
│   ├── ...
└── Full_One2set_Copy_Seed9527_Dropout0.1_LR0.0001_BS12_MaxLen6_MaxNum20_LossScalePre0.2_LossScaleAb0.1_Step2_SetLoss
    ├── ...

Code for the paper "Balancing Training for Multilingual Neural Machine Translation, ACL 2020"

Balancing Training for Multilingual Neural Machine Translation Implementation of the paper Balancing Training for Multilingual Neural Machine Translat

21 May 18, 2022

PyTorch implementation of our Adam-NSCL algorithm from our CVPR2021 (oral) paper "Training Networks in Null Space for Continual Learning"

Adam-NSCL This is a PyTorch implementation of Adam-NSCL algorithm for continual learning from our CVPR2021 (oral) paper: Title: Training Networks in N

34 Dec 21, 2022

[NAACL & ACL 2021] SapBERT: Self-alignment pretraining for BERT.

SapBERT: Self-alignment pretraining for BERT This repo holds code for the SapBERT model presented in our NAACL 2021 paper: Self-Alignment Pretraining

104 Dec 7, 2022

PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World [ACL 2021]

piglet PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World [ACL 2021] This repo contains code and data for PIGLeT. If you like

51 Oct 8, 2022

The official implementation for ACL 2021 "Challenges in Information Seeking QA: Unanswerable Questions and Paragraph Retrieval".

Code for "Challenges in Information Seeking QA: Unanswerable Questions and Paragraph Retrieval" (ACL 2021, Long) This is the repository for baseline m

25 Oct 30, 2022

Comments

a bug about [dec_layers]
大佬，您好，我是小白代码通过-dec_layers参数来制定模块的解码器的层数，但是，实际上没有作用。因为Decoder代码文件的

self.input_fc = nn.Linear(self.embed.embedding_dim, d_model) self.layer_stacks = nn.ModuleList([TransformerSeq2SeqDecoderLayer(d_model, n_head, dim_ff, dropout, layer_idx, fix_kp_num_len, max_kp_num) for layer_idx in range(6)]) self.embed_scale = math.sqrt(d_model)

里面直接给定了参数 ( for layer_idx in range(6)])),所以使得，至少出现以下bugs：

state【初始化采用def init_state(self, encoder_output, encoder_mask)函数】严格按照num_layers,即-dec_layers，但是实际层数为6，当dec_layers小于6时，会提示state访问越界 File "d:\code\【study-see】\【源码】kg_one2set-master\pykp\modules\multi_head_attn.py", line 57, in forward prev_k = state.decoder_prev_key[self.layer_idx] IndexError: list index out of range

当-dec_layers不等于6时，参数不起作用。

另外，对于decoder代码里面出现num_layers=opt.enc_layers,我觉得不妥，您可以试着提取一个基类，使得decoder，encoder继承于他。

【这个项目，包括您的论文，我收益非常大，很感谢您的工作，我还在继续拜读您的论文和代码，十分感谢】
opened by yuanyihan 2

RuntimeError: CUDA error: device-side assert triggered

Hi,

Thanks for the nice repo.

I am facing the following error while training the model with kp20k dataset. FYI, I am training with batch_size=2.

08/30/2021 23:41:03 [INFO] train_ml: Epoch 1; batch: 90000; total batch: 90000，avg training ppl: 5.333, loss: 1.674                                                              
08/30/2021 23:43:40 [INFO] train_ml: Epoch 1; batch: 91000; total batch: 91000，avg training ppl: 5.328, loss: 1.673                                                              
08/30/2021 23:46:18 [INFO] train_ml: Epoch 1; batch: 92000; total batch: 92000，avg training ppl: 5.322, loss: 1.672                                                              
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:662: indexSelectLargeIndex: block: [148,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.                     
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:662: indexSelectLargeIndex: block: [148,0,0], thread: [97,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
...
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:662: indexSelectLargeIndex: block: [130,0,0], thread: [30,0,0] Assertion `srcIndex < srcSelectDimSize` failed.                     
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:662: indexSelectLargeIndex: block: [130,0,0], thread: [31,0,0] Assertion `srcIndex < srcSelectDimSize` failed.                     
Traceback (most recent call last):                                                                                                                                                
  File "train.py", line 103, in <module>                                                                                                                                          
    main(opt)                                                                                                                                                                     
  File "train.py", line 85, in main                                                                                                                                               
    train_ml.train_model(model, optimizer, train_data_loader, valid_data_loader, opt)                                                                                             
  File "/home/ubuntu/kg_one2set/train_ml.py", line 44, in train_model
    batch_loss_stat = train_one_batch(batch, model, optimizer, opt)
  File "/home/ubuntu/kg_one2set/train_ml.py", line 146, in train_one_batch
    control_embed = model.decoder.forward_seg(state)
  File "/home/ubuntu/kg_one2set/pykp/decoder/transformer.py", line 153, in forward_seg
    control_idx = torch.arange(0, self.max_kp_num).long().to(device).reshape(1, -1).repeat(batch_size, 1)
RuntimeError: CUDA error: device-side assert triggered

Any suggestions would be appreciated.

opened by kgarg8 0

Code for our ACL 2021 paper "One2Set: Generating Diverse Keyphrases as a Set"

Related tags

Overview

One2Set

Dependency

Dataset

Quick Start

Resources

You might also like...

Code for the paper "Balancing Training for Multilingual Neural Machine Translation, ACL 2020"

PyTorch implementation of our Adam-NSCL algorithm from our CVPR2021 (oral) paper "Training Networks in Null Space for Continual Learning"

[NAACL & ACL 2021] SapBERT: Self-alignment pretraining for BERT.

PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World [ACL 2021]

The official implementation for ACL 2021 "Challenges in Information Seeking QA: Unanswerable Questions and Paragraph Retrieval".

LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

Official PyTorch Implementation of SSMix (Findings of ACL 2021)

NeuralWOZ: Learning to Collect Task-Oriented Dialogue via Model-based Simulation (ACL-IJCNLP 2021)

[ACL-IJCNLP 2021] Improving Named Entity Recognition by External Context Retrieving and Cooperative Learning

Comments

a bug about [dec_layers]

RuntimeError: CUDA error: device-side assert triggered

Owner

Jiacheng Ye

Code for our paper "SimCLS: A Simple Framework for Contrastive Learning of Abstractive Summarization", ACL 2021

Code and data of the ACL 2021 paper: Few-Shot Text Ranking with Meta Adapted Synthetic Weak Supervision

code associated with ACL 2021 DExperts paper

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

Data and Code for ACL 2021 Paper "Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning"

Code for ACL'2021 paper WARP 🌀 Word-level Adversarial ReProgramming

Codes for ACL-IJCNLP 2021 Paper "Zero-shot Fact Verification by Claim Generation"

PyTorch implementation for ACL 2021 paper "Maria: A Visual Experience Powered Conversational Agent".

The source codes for ACL 2021 paper 'BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data'

A sample pytorch Implementation of ACL 2021 research paper "Learning Span-Level Interactions for Aspect Sentiment Triplet Extraction".