Code for papers "Generation-Augmented Retrieval for Open-Domain Question Answering" and "Reader-Guided Passage Reranking for Open-Domain Question Answering", ACL 2021

morning

Last update: Dec 26, 2022

Related tags

Text Data & NLP GAR

Overview

This repo provides the code of the following papers:

(GAR) "Generation-Augmented Retrieval for Open-domain Question Answering", ACL 2021

(RIDER) "Reader-Guided Passage Reranking for Open-Domain Question Answering", Findings of ACL 2021.

GAR augments a question with relevant contexts generated by seq2seq learning, with the question as input and target outputs such as the answer, the sentence where the answer belongs to, and the title of a passage that contains the answer. With the generated contexts appended to the original questions, GAR achieves state-of-the-art OpenQA performance with a simple BM25 retriever.

RIDER is a simple and effective passage reranker, which reranks retrieved passages by reader predictions without any training. RIDER achieves 10~20 gains in top-1 retrieval accuracy, 1~4 gains in Exact Match (EM), and even outperforms supervised transformer-based rerankers.

Code

Generation

The codebase of seq2seq models is based on (old) huggingface/transformers (version==2.11.0) examples.

See train_gen.yml for the package requirements and example commands to run the models.

train_generator.py: training of seq2seq models.

conf.py: configurations for train_generator.py. There are some default parameters but it might be easier to set e.g., --data_dir and --output_dir directly.

test_generator.py: test of seq2seq models (if not already done in train_generator.py).

Retrieval

We use pyserini for BM25 retrieval. Please refer to its document for indexing and searching wiki passages (wiki passages can be downloaded here). Alternatively, you may take a look at its effort to reproduce DPR results, which gives more detailed instructions and incorporates the passage-level span voting in GAR.

Reranking

Please see the instructions in rider/rider.py.

Reading

We experiment with one extractive reader and one generative reader.

For the extractive reader, we take the one used by dense passage retrieval. Please refer to DPR for more details.

For the generative reader, we reuse the codebase in the generation stage above, with [question; top-retrieved passages] as the source input and one ground-truth answer as the target output. Example script is provided in train_gen.yml.

Data

Please refer to DPR for dataset downloading.

For seq2seq learning, use {train/val/test}.source as the input and {train/val/test}.target as the output, where each line is one example.

In the same folder, save the list of ground-truth answers with name {val/test}.target.json if you want to evaluate EM during training.

Cite

Please use the following bibtex to cite our papers.

@article{mao2020generation,
  title={Generation-augmented retrieval for open-domain question answering},
  author={Mao, Yuning and He, Pengcheng and Liu, Xiaodong and Shen, Yelong and Gao, Jianfeng and Han, Jiawei and Chen, Weizhu},
  journal={arXiv preprint arXiv:2009.08553},
  year={2020}
}

@article{mao2021reader,
  title={Reader-Guided Passage Reranking for Open-Domain Question Answering},
  author={Mao, Yuning and He, Pengcheng and Liu, Xiaodong and Shen, Yelong and Gao, Jianfeng and Han, Jiawei and Chen, Weizhu},
  journal={arXiv preprint arXiv:2101.00294}
}

Comments

Fusion quesion
Hi, I got a problem when I try to fusion the results from three sources following the instruction in another Issue "Say you have 3 lists of retrieved docs [a1, a2, ...], [b1, b2, ...], [c1, c2, ...]. The combined list would be [a1, b1, c1, a2, b2, c2, ...]". Especially, I treat list a, list b and list c as list answer, list sentence and list title separately. I check my code many time and I still obtain the result below slightly different, is it reasonable? Top5 accuracy: 0.5853 Top10 accuracy: 0.6604 Top20 accuracy: 0.7269
Top50 accuracy: 0.7967 Top100 accuracy: 0.8382

I use the provided augmented query using BM25 on NQ, and I got same results as another Issue in each source. Here is my fusion code, I would be grateful if you can point some mistake out. Or can you provide the resouce code? It would be really helpful!

` def fusion(n_doc, tit=True, sen=True, ans_path=None, tit_path=None, sen_path=None, output_path=None): if tit: data_tit = json.load(open(tit_path, "r")) if sen: data_sen = json.load(open(sen_path, "r")) data_ans = json.load(open(ans_path, "r"))

fusion_result = [] for i, data in enumerate(data_ans): fusion_result.append(data) # I do this to keep same format with the evaluation file fusion_psgs = [] for j, ctx in enumerate(data["ctxs"]): fusion_psgs.append(ctx) if len(fusion_psgs) == n_doc: break if tit: fusion_psgs.append(data_tit[i]["ctxs"][j]) if sen: fusion_psgs.append(data_sen[i]["ctxs"][j]) assert len(fusion_psgs) == n_doc, f"question{i} fusion passages length{len(fusion_psgs)} != {n_doc}" fusion_result[i]["ctxs"] = fusion_psgs json.dump(fusion_result, open(output_path, "w")) return fusion_result

`

Thank you !!!
opened by zhengmq2010 4
Example of calling the RIDER function from DPR inference code

Thank you for sharing your implementation of GAR and RIDER. there is a function called rider_rerank, I would appreciate if you could advise me how you actually used this in your DPR reader inference code. If possible, could you give an example of how you called it?

opened by kambehmw 4
What is the difference between the two checkpoints？

I noticed that there were two checkpoints when I finished training model，are they the same? I have trained three checkpoints based on custom datasets，but I got the same checkpoints'name, they are: ①checkpointepoch=1.ckpt；and②checkpointlast.ckpt

So why is it *epoch=1, and not 50 or 100?("--num_train_epochs", default=8)

opened by XY2323819551 2
Question about training Answer_generator (GAR)

How are the data look like for training an answer_generator and where are they? I do not find any information about the variable "data_dir="./cnn-dailymail/cnn_dm/ " of class SummarizationDataset in file utils_gen.py line 156. How should I download and prepare the training data for answer_generator?

opened by yiyaxiaozhi 2
Model checkpoints

Hello! Thanks for publishing your code. Do you intend to publish the checkpoint? The command-line arguments that you used to generate them would also be very nice. Thanks.

opened by JulesGM 2
Why not finetune DPR using the results of GAR?

It seems like dense retrieval may achieve better results, but in GAR+ you only fuse the results of sparse and dense retrieval, so why not finetune DPR using the augmented querys? Or if you have carried out such experiements, what is the performance?

opened by bunny-sleepy 1
No module named 'transformers.tokenization_utils_base'

Hello, sorry to bother you，I met an issue when I tried to train the GAR model, as follows:

Traceback (most recent call last): File "train_generator.py", line 245, in val_dataloader Traceback (most recent call last): return self.get_dataloader("val", batch_size=self.hparams.eval_batch_size, num_workers=4) File "/home/zhangxy/QA/GAR-master/gar/train_generator.py", line 245, in val_dataloader File "train_generator.py", line 225, in get_dataloader dataset = SummarizationDataset(self.tokenizer, type_path=type_path, **self.dataset_kwargs) File "../gar/utils_gen.py", line 177, in init return self.get_dataloader("val", batch_size=self.hparams.eval_batch_size, num_workers=4) self.source = pickle.load(open(os.path.join(data_dir, type_path + f".source.processed{suffix}"), 'rb')) File "/home/zhangxy/QA/GAR-master/gar/train_generator.py", line 225, in get_dataloader ModuleNotFoundError: No module named 'transformers.tokenization_utils_base'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "train_generator.py", line 308, in dataset = SummarizationDataset(self.tokenizer, type_path=type_path, **self.dataset_kwargs) File "../gar/utils_gen.py", line 177, in init main(args) File "train_generator.py", line 285, in main self.source = pickle.load(open(os.path.join(data_dir, type_path + f".source.processed{suffix}"), 'rb')) ModuleNotFoundError: No module named 'transformers.tokenization_utils_base'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/zhangxy/QA/GAR-master/gar/train_generator.py", line 308, in trainer = generic_train(model, args, logger, resume_cp_file=cp_file, ) File "../gar/lightning_base.py", line 220, in generic_train main(args) trainer.fit(model) File "/home/zhangxy/QA/GAR-master/gar/train_generator.py", line 285, in main File "/home/zhangxy/anaconda3/envs/torch15DPR/lib/python3.6/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn result = fn(self, *args, **kwargs) File "/home/zhangxy/anaconda3/envs/torch15DPR/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1058, in fit trainer = generic_train(model, args, logger, resume_cp_file=cp_file, ) File "../gar/lightning_base.py", line 220, in generic_train trainer.fit(model) File "/home/zhangxy/anaconda3/envs/torch15DPR/lib/python3.6/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn result = fn(self, *args, **kwargs) File "/home/zhangxy/anaconda3/envs/torch15DPR/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1046, in fit results = self.accelerator_backend.spawn_ddp_children(model) File "/home/zhangxy/anaconda3/envs/torch15DPR/lib/python3.6/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 123, in spawn_ddp_children results = self.ddp_train(local_rank, mp_queue=None, model=model, is_master=True) File "/home/zhangxy/anaconda3/envs/torch15DPR/lib/python3.6/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 224, in ddp_train results = self.trainer.run_pretrain_routine(model) File "/home/zhangxy/anaconda3/envs/torch15DPR/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1224, in run_pretrain_routine self.accelerator_backend.train(model) File "/home/zhangxy/anaconda3/envs/torch15DPR/lib/python3.6/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 57, in train self.ddp_train(process_idx=self.task_idx, mp_queue=None, model=model) File "/home/zhangxy/anaconda3/envs/torch15DPR/lib/python3.6/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 224, in ddp_train results = self.trainer.run_pretrain_routine(model) File "/home/zhangxy/anaconda3/envs/torch15DPR/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1224, in run_pretrain_routine self._run_sanity_check(ref_model, model) File "/home/zhangxy/anaconda3/envs/torch15DPR/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1249, in _run_sanity_check self._run_sanity_check(ref_model, model) File "/home/zhangxy/anaconda3/envs/torch15DPR/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1249, in _run_sanity_check self.reset_val_dataloader(ref_model) File "/home/zhangxy/anaconda3/envs/torch15DPR/lib/python3.6/site-packages/pytorch_lightning/trainer/data_loading.py", line 337, in reset_val_dataloader self.num_val_batches, self.val_dataloaders = self._reset_eval_dataloader(model, 'val') File "/home/zhangxy/anaconda3/envs/torch15DPR/lib/python3.6/site-packages/pytorch_lightning/trainer/data_loading.py", line 266, in _reset_eval_dataloader dataloaders = self.request_dataloader(getattr(model, f'{mode}_dataloader')) File "/home/zhangxy/anaconda3/envs/torch15DPR/lib/python3.6/site-packages/pytorch_lightning/trainer/data_loading.py", line 360, in request_dataloader self.reset_val_dataloader(ref_model) File "/home/zhangxy/anaconda3/envs/torch15DPR/lib/python3.6/site-packages/pytorch_lightning/trainer/data_loading.py", line 337, in reset_val_dataloader dataloader = dataloader_fx() File "train_generator.py", line 248, in val_dataloader self.num_val_batches, self.val_dataloaders = self._reset_eval_dataloader(model, 'val') File "/home/zhangxy/anaconda3/envs/torch15DPR/lib/python3.6/site-packages/pytorch_lightning/trainer/data_loading.py", line 266, in _reset_eval_dataloader return self.get_dataloader("train", batch_size=self.hparams.eval_batch_size, num_workers=4) File "train_generator.py", line 225, in get_dataloader dataloaders = self.request_dataloader(getattr(model, f'{mode}_dataloader')) File "/home/zhangxy/anaconda3/envs/torch15DPR/lib/python3.6/site-packages/pytorch_lightning/trainer/data_loading.py", line 360, in request_dataloader dataset = SummarizationDataset(self.tokenizer, type_path=type_path, **self.dataset_kwargs) File "../gar/utils_gen.py", line 177, in init self.source = pickle.load(open(os.path.join(data_dir, type_path + f".source.processed{suffix}"), 'rb')) ModuleNotFoundError: No module named 'transformers.tokenization_utils_base' dataloader = dataloader_fx() File "/home/zhangxy/QA/GAR-master/gar/train_generator.py", line 248, in val_dataloader return self.get_dataloader("train", batch_size=self.hparams.eval_batch_size, num_workers=4) File "/home/zhangxy/QA/GAR-master/gar/train_generator.py", line 225, in get_dataloader dataset = SummarizationDataset(self.tokenizer, type_path=type_path, **self.dataset_kwargs) File "../gar/utils_gen.py", line 177, in init self.source = pickle.load(open(os.path.join(data_dir, type_path + f".source.processed{suffix}"), 'rb')) ModuleNotFoundError: No module named 'transformers.tokenization_utils_base'

I installed transformers==2.11.0 and tokenizers==0.7.0 I run the project with command: GEN_TARGET='answer' python train_generator.py --remark generator_train_nq_A --train_batch_size 128 --eval_batch_size 256 --ckpt_metric val-ROUGE-1

So, how can I solve it ? Thanks in advance !!

opened by XY2323819551 1
RuntimeError during extractive_reader Validation

When validating the extractive_reader, it raises RuntimeError as: and the iteration only runs at 73it. How should I fix it and is the 'passage_idx = idxs[q, p].item()' takes a long time that results in timed out?

opened by yiyaxiaozhi 1
NaNs and default fp16
Hello, Just to say that fp16 defaults to on. On my config at least, I get NaNs when it's turned on. Also, the argument has default=True and action="store_true" by the way, which is a bit weird and makes me think it's supposed to be default=False.

parser.add_argument( "--fp16", action='store_true', default=True, help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit", )

https://github.com/morningmoni/GAR/blob/master/gar/conf.py#L102
opened by JulesGM 1
GAR Data Files Missing

I was going through the provided GAR data files for NQ. It seems like nq-title is missing all test.target files. It will be great if this can be updated to include it.

opened by ronakice 1
Question about GAR and the Rider

Thanks for your wonderful work. It brings a new angle to the IR task. I am confused about the GAR generator and the prediction answer produced by a generator reader for Rider to reranks. In my view, the format of input data for GAR is : [start_token] question [end_token] and the target is: [start_token] answer/sentence/title [end_token] while the format for the reader is : [start_token] question [separate_token] passage [end_token] What is the relationship between the GAR generator and the generator which provides the prediction for reranking? Are these two independent models?
In addition, the call to the file rider.py was not found in the exposed code. Could you provide some instruction on it?

opened by yiyaxiaozhi 1

Owner

morning

NLP | ML | Data Mining

GitHub

Hostapd-mac-tod-acl - Setup a hostapd AP with MAC ToD ACL

A brief explanation This script provides a quick way to setup a Time-of-day (Tod

2 Feb 3, 2022

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

XL-Sum This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Lang

189 Jan 2, 2023

Code for ACL 2021 main conference paper "Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances".

Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances This repository contains the code and pre-trained mode

90 Dec 27, 2022

Code for our paper "Mask-Align: Self-Supervised Neural Word Alignment" in ACL 2021

Mask-Align: Self-Supervised Neural Word Alignment This is the implementation of our work Mask-Align: Self-Supervised Neural Word Alignment. @inproceed

46 Dec 15, 2022

Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

ConSERT Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer Requirements torch==1.6.0

478 Dec 25, 2022

Code for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned Language Models in the wild .

?? Fingerprinting Fine-tuned Language Models in the wild This is the code and dataset for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned La

5 Sep 13, 2022

Code for our paper "Transfer Learning for Sequence Generation: from Single-source to Multi-source" in ACL 2021.

TRICE: a task-agnostic transferring framework for multi-source sequence generation This is the source code of our work Transfer Learning for Sequence

9 Jun 27, 2022

Findings of ACL 2021

Assessing Dialogue Systems with Distribution Distances [arXiv][code] We propose to measure the performance of a dialogue system by computing the distr

16 Feb 24, 2022

A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Chimera: Learning Shared Semantic Space for Speech-to-Text Translation This is a Pytorch implementation for the "Chimera" paper Learning Shared Semant

43 Dec 28, 2022

LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

LV-BERT Introduction In this repo, we introduce LV-BERT by exploiting layer variety for BERT. For detailed description and experimental results, pleas

14 Aug 24, 2022

ACL'2021: Learning Dense Representations of Phrases at Scale

DensePhrases DensePhrases is an extractive phrase search tool based on your natural language inputs. From 5 million Wikipedia articles, it can search

540 Dec 30, 2022

A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Chimera: Learning Shared Semantic Space for Speech-to-Text Translation This is a Pytorch implementation for the "Chimera" paper Learning Shared Semant

43 Dec 28, 2022

(ACL 2022) The source code for the paper "Towards Abstractive Grounded Summarization of Podcast Transcripts"

Towards Abstractive Grounded Summarization of Podcast Transcripts We provide the source code for the paper "Towards Abstractive Grounded Summarization

10 Jul 1, 2022

Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors"

SWRM Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors" Clone Clone th

14 Jan 3, 2023

Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

STEMM: Self-learning with Speech-Text Manifold Mixup for Speech Translation This is a PyTorch implementation for the ACL 2022 main conference paper ST

29 Oct 16, 2022

Code for "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022.

README Code for Two-stage Identifier: "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022. For details of the model a

45 Nov 29, 2022

ADCS cert template modification and ACL enumeration

Purpose This tool is designed to aid an operator in modifying ADCS certificate templates so that a created vulnerable state can be leveraged for privi

78 Dec 12, 2022

Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation This is the implementaion of our paper: Bridging the

20 Dec 12, 2022

PyTorch Implementation of "Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging" (Findings of ACL 2022)

Feature_CRF_AE Feature_CRF_AE provides a implementation of Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging

6 Apr 29, 2022