Binary Passage Retriever (BPR) - an efficient passage retriever for open-domain question answering

Studio Ousia

Last update: Dec 7, 2022

Related tags

Deep Learning bpr

Overview

BPR

Binary Passage Retriever (BPR) is an efficient neural retrieval model for open-domain question answering. BPR integrates a learning-to-hash technique into Dense Passage Retriever (DPR) to represent the passage embeddings using compact binary codes rather than continuous vectors. It substantially reduces the memory size without a loss of accuracy tested on Natural Questions and TriviaQA datasets.

BPR was originally developed to improve the computational efficiency of the Sōseki question answering system submitted to the Systems under 6GB track in the NeurIPS 2020 EfficientQA competition. Please refer to our ACL 2021 paper for further technical details.

Installation

BPR can be installed using Poetry:

poetry install

The virtual environment automatically created by Poetry can be activated by poetry shell.

Alternatively, you can install required libraries using pip:

pip install -r requirements.txt

Trained Models

(coming soon)

Reproducing Experiments

Before you start, you need to download the datasets available on the DPR website into <DPR_DATASET_DIR>.

The experimental results on the Natural Questions dataset can be reproduced by running the commands provided in this section. We used a server with 8 NVIDIA Tesla V100 GPUs with 16GB memory in the experiments. The results on the TriviaQA dataset can be reproduced by changing the file names of the input dataset to the corresponding ones (e.g., nq-train.json -> trivia-train.json).

1. Building passage database

python build_passage_db.py \
    --passage_file=<DPR_DATASET_DIR>/wikipedia_split/psgs_w100.tsv \
    --output_file=<PASSAGE_DB_FILE>

2. Training BPR

python train_biencoder.py \
   --gpus=8 \
   --distributed_backend=ddp \
   --train_file=<DPR_DATASET_DIR>/retriever/nq-train.json \
   --eval_file=<DPR_DATASET_DIR>/retriever/nq-dev.json \
   --gradient_clip_val=2.0 \
   --max_epochs=40 \
   --binary

3. Building passage embeddings

python generate_embeddings.py \
   --biencoder_file=<BPR_CHECKPOINT_FILE> \
   --output_file=<EMBEDDING_FILE> \
   --passage_db_file=<PASSAGE_DB_FILE> \
   --batch_size=4096 \
   --parallel

4. Evaluating BPR

python evaluate_retriever.py \
    --binary_k=1000 \
    --biencoder_file=<BPR_CHECKPOINT_FILE> \
    --embedding_file=<EMBEDDING_FILE> \
    --passage_db_file=<PASSAGE_DB_FILE> \
    --qa_file=<DPR_DATASET_DIR>/retriever/qas/nq-test.csv \
    --parallel

5. Creating dataset for reader

python evaluate_retriever.py \
    --binary_k=1000 \
    --biencoder_file=<BPR_CHECKPOINT_FILE> \
    --embedding_file=<EMBEDDING_FILE> \
    --passage_db_file=<PASSAGE_DB_FILE> \
    --qa_file=<DPR_DATASET_DIR>/retriever/qas/nq-train.csv \
    --output_file=<READER_TRAIN_FILE> \
    --top_k=200 \
    --parallel

python evaluate_retriever.py \
    --binary_k=1000 \
    --biencoder_file=<BPR_CHECKPOINT_FILE> \
    --embedding_file=<EMBEDDING_FILE> \
    --passage_db_file=<PASSAGE_DB_FILE> \
    --qa_file=<DPR_DATASET_DIR>/retriever/qas/nq-dev.csv \
    --output_file=<READER_DEV_FILE> \
    --top_k=200 \
    --parallel

python evaluate_retriever.py \
    --binary_k=1000 \
    --biencoder_file=<BPR_CHECKPOINT_FILE> \
    --embedding_file=<EMBEDDING_FILE> \
    --passage_db_file=<PASSAGE_DB_FILE> \
    --qa_file==<DPR_DATASET_DIR>/retriever/qas/nq-test.csv \
    --output_file=<READER_TEST_FILE> \
    --top_k=200 \
    --parallel

6. Training reader

python train_reader.py \
   --gpus=8 \
   --distributed_backend=ddp \
   --train_file=<READER_TRAIN_FILE> \
   --validation_file=<READER_DEV_FILE> \
   --test_file=<READER_TEST_FILE> \
   --learning_rate=2e-5 \
   --max_epochs=20 \
   --accumulate_grad_batches=4 \
   --nq_gold_train_file=<DPR_DATASET_DIR>/gold_passages_info/nq_train.json \
   --nq_gold_validation_file=<DPR_DATASET_DIR>/gold_passages_info/nq_dev.json \
   --nq_gold_test_file=<DPR_DATASET_DIR>/gold_passages_info/nq_test.json \
   --train_batch_size=1 \
   --eval_batch_size=2 \
   --gradient_clip_val=2.0

7. Evaluating reader

python evaluate_reader.py \
    --gpus=8 \
    --distributed_backend=ddp \
    --checkpoint_file=<READER_CHECKPOINT_FILE> \
    --eval_batch_size=1

License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Citation

If you find this work useful, please cite the following paper:

@inproceedings{yamada2021bpr,
  title={Efficient Passage Retrieval with Hashing for Open-domain Question Answering},
  author={Ikuya Yamada and Akari Asai and Hannaneh Hajishirzi},
  booktitle={ACL},
  year={2021}
}

Comments

Evaluation result
The experimental results are far lower than the papers My environment is as follows： Ubuntu 18.04.5 LTS python 3.8.10 CPU: Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz GPU: 2 * TITAN RTX 24GB MEM: 125GB other environments follow requirements.txt. our evaluation step is :

Building passage database

Training BPR

python train_biencoder.py --gpus=2 --distributed_backend=ddp --train_file=/downloads/data/retriever/nq-train.json --eval_file=/downloads/data/retriever/nq-dev.json --gradient_clip_val=2.0 --max_epochs=40 --binary

After training, there will be two more folders(version_2 & version_3) in "./biencoder" folder. we found only version_3 have checkpoint folder, so <BPR_CHECKPOINT_FILE> is "./biencoder/version_4/checkpoints/last.ckpt"

Building passage embeddings

CUDA_VISIBLE_DEVICES=0,1 python generate_embeddings.py --biencoder_file=./biencoder/version_3/checkpoints/last.ckpt --output_file=./biencoder/embedding/em_my --passage_db_file=./passage_db --batch_size=2048 --parallel

we only change the batch_size from 4096 to 2048, building embedding take more time than training!

Evaluating BPR

The top-1 precision is 38.78 which is much lower than 41.1 on paper and 49 in Github
opened by xuanricheng 4

Reproducing issues: broken pipe & CUDA out of memory errors

Hi,

I was trying to train BPR by running

python train_biencoder.py --gpus=7 --distributed_backend=ddp --train_file=nq-train.json \
--eval_file=nq-dev.json --gradient_clip_val=2.0 --max_epochs=40 --binary --train_batch_size=4 --eval_batch_size=4

However, there are a lot of errors. For example, after validation sanity check, there are a broken pipe error in multiprocessing/connections.py where the output is listed below

Traceback (most recent call last):
  File "/miniconda3/envs/bpr/lib/python3.7/multiprocessing/queues.py", line 242, in _feed
    send_bytes(obj)
  File "/miniconda3/envs/bpr/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/miniconda3/envs/bpr/lib/python3.7/multiprocessing/connection.py", line 404, in _send_bytes
    self._send(header + buf)
  File "/miniconda3/envs/bpr/lib/python3.7/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

Furthermore, I encountered CUDA out of memory issues. The trimmed output is attached: (For each line it is repeated for 3 times because 3 out of 7 GPUs that I am using have encountered OOM errors)

Traceback (most recent call last):
  File "bpr/train_biencoder.py", line 53, in <module>
    trainer.fit(model)
  File "miniconda3/envs/bpr/lib/python3.7/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn
    result = fn(self, *args, **kwargs)
  File "miniconda3/envs/bpr/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1046, in fit
    self.accelerator_backend.train(model)
  File "miniconda3/envs/bpr/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 57, in train
    self.ddp_train(process_idx=self.task_idx, mp_queue=None, model=model)
   File "/gscratch/cse/xyu530/miniconda3/envs/bpr/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_backend.py", lin
e 224, in ddp_train
    results = self.trainer.run_pretrain_routine(model)
  File "miniconda3/envs/bpr/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1239, in run_pretrain_routine
    self.train()
  File "miniconda3/envs/bpr/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 394, in train
    self.run_training_epoch()
File "miniconda3/envs/bpr/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 516, in run_training_epoch
    self.run_evaluation(test_mode=False)
  File "miniconda3/envs/bpr/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line
 582, in run_evaluation
     eval_results = self._evaluate(self.model, dataloaders, max_batches, test_mode)
  File "miniconda3/envs/bpr/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 396, in _evaluate
  eval_results = self.__run_eval_epoch_end(test_mode, outputs, dataloaders, using_eval_result)
  File "miniconda3/envs/bpr/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 490, in __run_eval_epoch_end
    eval_results = model.validation_epoch_end(eval_results)
 File "bpr/bpr/biencoder.py", line 246, in validation_epoch_end
    dist.all_gather(passage_repr_list, passage_repr)
  File "miniconda3/envs/bpr/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1185, in all_gather
    work = _default_pg.allgather([tensor_list], [tensor])
RuntimeError: CUDA out of memory. Tried to allocate 1.14 GiB (GPU X; 10.76 GiB total capacity; 8.14 GiB already allocated; 526.56 MiB free; 9.29 GiB reserved in total by PyTorch)
    work = _default_pg.allgather([tensor_list], [tensor])

Sorry for putting all these outputs here!

I install BPR by pip install -r requirements.txt and completed building passage database successfully. The GPUs I am using are 7 GeForce RTX 2080 Ti.

Thanks for any help!

opened by velocityCavalry 4

Puzzles of hamming distance

Hi, this work is awesome for efficient retrieval. And I learned a lot from this. I have a puzzle about the hamming distance between query hash code and doc hash code. In the paper, it is stated that hamming distance(q,d) = 1/2 (const - inner product<q,d>). And in the source code of index.py, the score is calculated by np.einsum("ijk,ik->ij", passage_embeddings, query_embeddings). I wonder the source code is consistent with the equation. And I'm not sure the shape of passage_embeddings and query_embeddings. Hope your reply. Thank u so much!

opened by lightningtyb 3
how to get 38 ms query time in Binary hash mode

In the paper, Query time using "hash table lookup" is 38.1ms. In README.MD, Query time using "Binary hash" is 38ms. Is it possible to just use --use_binary_hash option when running evaluate_retriever.py to reach a Query time of 38ms? In my case, --use_binary_hash is not very helpful to query time(81.5ms -> 75ms). is my runtime error or hardware issue? my CPU: (48 core) Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz

opened by xuanricheng 2
Loading passage binary codes from BinaryHash Faiss Index stored on disk

Hi @ikuyamada,

Thanks for the awesome repository and clean code for BPR!

When we have stored the passage embeddings (binary codes) as a faiss BinaryHash Index and saved it to the disk. Can we get the passage embeddings back given only loading back the faiss BinaryHash index? Or would we need to also separately save the passage embeddings (This would take more memory)?

Kind Regards, Nandan Thakur

opened by thakur-nandan 2
Use your own passage

Hi

i was wondering if there is a way around using your own passages instead of wikipedia while retrieving result. Just changing the tsv file in "InMemoryPassageDB" gives an error as index is out of range but was wondering what would be the best way without going through whole training process?

opened by sb1992 1
Can we load and search the BPR Flat Binary Index in GPU

Hi,

I have a 2.5GB Flat Binary Index.

I want to put it in GPU to search,

I tried this

res = faiss.StandardGpuResources() self.index = faiss.GpuIndexBinaryFlat(res, binary_index)

It loads in GPU, but while searching the GPU becomes 100% and dies.

Any idea how to load it and search.

I have a Tesla V100 16GB.

opened by astar10239 3

Owner

Studio Ousia

GitHub

Codes for NAACL 2021 Paper "Unsupervised Multi-hop Question Answering by Question Generation"

Unsupervised-Multi-hop-QA This repository contains code and models for the paper: Unsupervised Multi-hop Question Answering by Question Generation (NA

70 Nov 27, 2022

Designing a Minimal Retrieve-and-Read System for Open-Domain Question Answering (NAACL 2021)

Designing a Minimal Retrieve-and-Read System for Open-Domain Question Answering Abstract In open-domain question answering (QA), retrieve-and-read mec

34 Apr 13, 2022

The dataset and source code for our paper: "Did You Ask a Good Question? A Cross-Domain Question IntentionClassification Benchmark for Text-to-SQL"

TriageSQL The dataset and source code for our paper: "Did You Ask a Good Question? A Cross-Domain Question Intention Classification Benchmark for Text

22 Nov 9, 2022

ReConsider is a re-ranking model that re-ranks the top-K (passage, answer-span) predictions of an Open-Domain QA Model like DPR (Karpukhin et al., 2020).

ReConsider ReConsider is a re-ranking model that re-ranks the top-K (passage, answer-span) predictions of an Open-Domain QA Model like DPR (Karpukhin

47 Jul 26, 2022

QA-GNN: Question Answering using Language Models and Knowledge Graphs

QA-GNN: Question Answering using Language Models and Knowledge Graphs This repo provides the source code & data of our paper: QA-GNN: Reasoning with L

434 Jan 4, 2023

GrailQA: Strongly Generalizable Question Answering

GrailQA is a new large-scale, high-quality KBQA dataset with 64,331 questions annotated with both answers and corresponding logical forms in different syntax (i.e., SPARQL, S-expression, etc.). It can be used to test three levels of generalization in KBQA: i.i.d., compositional, and zero-shot.

76 Dec 21, 2022

covid question answering datasets and fine tuned models

Covid-QA Fine tuned models for question answering on Covid-19 data. Hosted Inference This model has been contributed to huggingface.Click here to see

19 Sep 9, 2021

NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions (CVPR2021)

NExT-QA We reproduce some SOTA VideoQA methods to provide benchmark results for our NExT-QA dataset accepted to CVPR2021 (with 1 'Strong Accept' and 2

50 Nov 24, 2022

FeTaQA: Free-form Table Question Answering

FeTaQA: Free-form Table Question Answering FeTaQA is a Free-form Table Question Answering dataset with 10K Wikipedia-based {table, question, free-form

Language, Information, and Learning at Yale

40 Dec 13, 2022

improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

CLIP-ViL In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet fea

310 Dec 28, 2022

Pytorch implementation for the EMNLP 2020 (Findings) paper: Connecting the Dots: A Knowledgeable Path Generator for Commonsense Question Answering

Path-Generator-QA This is a Pytorch implementation for the EMNLP 2020 (Findings) paper: Connecting the Dots: A Knowledgeable Path Generator for Common

33 Dec 5, 2022

Bilinear attention networks for visual question answering

Bilinear Attention Networks This repository is the implementation of Bilinear Attention Networks for the visual question answering and Flickr30k Entit

506 Nov 29, 2022

Official repository with code and data accompanying the NAACL 2021 paper "Hurdles to Progress in Long-form Question Answering" (https://arxiv.org/abs/2103.06332).

Hurdles to Progress in Long-form Question Answering This repository contains the official scripts and datasets accompanying our NAACL 2021 paper, "Hur

41 Nov 8, 2022

Pytorch implementation for our ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering".

TRAnsformer Routing Networks (TRAR) This is an official implementation for ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visu

49 Nov 10, 2022

Binary Passage Retriever (BPR) - an efficient passage retriever for open-domain question answering

Related tags

Overview

BPR

Installation

Trained Models

Reproducing Experiments

License

Citation

Comments

Evaluation result

Reproducing issues: broken pipe & CUDA out of memory errors

Puzzles of hamming distance

how to get 38 ms query time in Binary hash mode

Loading passage binary codes from BinaryHash Faiss Index stored on disk

Use your own passage

Can we load and search the BPR Flat Binary Index in GPU

Owner

Studio Ousia

Codes for NAACL 2021 Paper "Unsupervised Multi-hop Question Answering by Question Generation"

Designing a Minimal Retrieve-and-Read System for Open-Domain Question Answering (NAACL 2021)

The dataset and source code for our paper: "Did You Ask a Good Question? A Cross-Domain Question IntentionClassification Benchmark for Text-to-SQL"

ReConsider is a re-ranking model that re-ranks the top-K (passage, answer-span) predictions of an Open-Domain QA Model like DPR (Karpukhin et al., 2020).

QA-GNN: Question Answering using Language Models and Knowledge Graphs

GrailQA: Strongly Generalizable Question Answering

covid question answering datasets and fine tuned models

NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions (CVPR2021)

FeTaQA: Free-form Table Question Answering

improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

Pytorch implementation for the EMNLP 2020 (Findings) paper: Connecting the Dots: A Knowledgeable Path Generator for Commonsense Question Answering

Bilinear attention networks for visual question answering

Official repository with code and data accompanying the NAACL 2021 paper "Hurdles to Progress in Long-form Question Answering" (https://arxiv.org/abs/2103.06332).

Visual Question Answering in Pytorch

This reporistory contains the test-dev data of the paper "xGQA: Cross-lingual Visual Question Answering".

RNG-KBQA: Generation Augmented Iterative Ranking for Knowledge Base Question Answering

EMNLP 2021: Single-dataset Experts for Multi-dataset Question-Answering

EMNLP 2021: Single-dataset Experts for Multi-dataset Question-Answering

Pytorch implementation for our ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering".