Binary Passage Retriever (BPR) - an efficient passage retriever for open-domain question answering

Related tags

Deep Learning bpr
Overview

BPR

Binary Passage Retriever (BPR) is an efficient neural retrieval model for open-domain question answering. BPR integrates a learning-to-hash technique into Dense Passage Retriever (DPR) to represent the passage embeddings using compact binary codes rather than continuous vectors. It substantially reduces the memory size without a loss of accuracy tested on Natural Questions and TriviaQA datasets.

BPR was originally developed to improve the computational efficiency of the Sōseki question answering system submitted to the Systems under 6GB track in the NeurIPS 2020 EfficientQA competition. Please refer to our ACL 2021 paper for further technical details.

Installation

BPR can be installed using Poetry:

poetry install

The virtual environment automatically created by Poetry can be activated by poetry shell.

Alternatively, you can install required libraries using pip:

pip install -r requirements.txt

Trained Models

(coming soon)

Reproducing Experiments

Before you start, you need to download the datasets available on the DPR website into <DPR_DATASET_DIR>.

The experimental results on the Natural Questions dataset can be reproduced by running the commands provided in this section. We used a server with 8 NVIDIA Tesla V100 GPUs with 16GB memory in the experiments. The results on the TriviaQA dataset can be reproduced by changing the file names of the input dataset to the corresponding ones (e.g., nq-train.json -> trivia-train.json).

1. Building passage database

python build_passage_db.py \
    --passage_file=<DPR_DATASET_DIR>/wikipedia_split/psgs_w100.tsv \
    --output_file=<PASSAGE_DB_FILE>

2. Training BPR

python train_biencoder.py \
   --gpus=8 \
   --distributed_backend=ddp \
   --train_file=<DPR_DATASET_DIR>/retriever/nq-train.json \
   --eval_file=<DPR_DATASET_DIR>/retriever/nq-dev.json \
   --gradient_clip_val=2.0 \
   --max_epochs=40 \
   --binary

3. Building passage embeddings

python generate_embeddings.py \
   --biencoder_file=<BPR_CHECKPOINT_FILE> \
   --output_file=<EMBEDDING_FILE> \
   --passage_db_file=<PASSAGE_DB_FILE> \
   --batch_size=4096 \
   --parallel

4. Evaluating BPR

python evaluate_retriever.py \
    --binary_k=1000 \
    --biencoder_file=<BPR_CHECKPOINT_FILE> \
    --embedding_file=<EMBEDDING_FILE> \
    --passage_db_file=<PASSAGE_DB_FILE> \
    --qa_file=<DPR_DATASET_DIR>/retriever/qas/nq-test.csv \
    --parallel

5. Creating dataset for reader

python evaluate_retriever.py \
    --binary_k=1000 \
    --biencoder_file=<BPR_CHECKPOINT_FILE> \
    --embedding_file=<EMBEDDING_FILE> \
    --passage_db_file=<PASSAGE_DB_FILE> \
    --qa_file=<DPR_DATASET_DIR>/retriever/qas/nq-train.csv \
    --output_file=<READER_TRAIN_FILE> \
    --top_k=200 \
    --parallel

python evaluate_retriever.py \
    --binary_k=1000 \
    --biencoder_file=<BPR_CHECKPOINT_FILE> \
    --embedding_file=<EMBEDDING_FILE> \
    --passage_db_file=<PASSAGE_DB_FILE> \
    --qa_file=<DPR_DATASET_DIR>/retriever/qas/nq-dev.csv \
    --output_file=<READER_DEV_FILE> \
    --top_k=200 \
    --parallel

python evaluate_retriever.py \
    --binary_k=1000 \
    --biencoder_file=<BPR_CHECKPOINT_FILE> \
    --embedding_file=<EMBEDDING_FILE> \
    --passage_db_file=<PASSAGE_DB_FILE> \
    --qa_file==<DPR_DATASET_DIR>/retriever/qas/nq-test.csv \
    --output_file=<READER_TEST_FILE> \
    --top_k=200 \
    --parallel

6. Training reader

python train_reader.py \
   --gpus=8 \
   --distributed_backend=ddp \
   --train_file=<READER_TRAIN_FILE> \
   --validation_file=<READER_DEV_FILE> \
   --test_file=<READER_TEST_FILE> \
   --learning_rate=2e-5 \
   --max_epochs=20 \
   --accumulate_grad_batches=4 \
   --nq_gold_train_file=<DPR_DATASET_DIR>/gold_passages_info/nq_train.json \
   --nq_gold_validation_file=<DPR_DATASET_DIR>/gold_passages_info/nq_dev.json \
   --nq_gold_test_file=<DPR_DATASET_DIR>/gold_passages_info/nq_test.json \
   --train_batch_size=1 \
   --eval_batch_size=2 \
   --gradient_clip_val=2.0

7. Evaluating reader

python evaluate_reader.py \
    --gpus=8 \
    --distributed_backend=ddp \
    --checkpoint_file=<READER_CHECKPOINT_FILE> \
    --eval_batch_size=1

License

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Citation

If you find this work useful, please cite the following paper:

@inproceedings{yamada2021bpr,
  title={Efficient Passage Retrieval with Hashing for Open-domain Question Answering},
  author={Ikuya Yamada and Akari Asai and Hannaneh Hajishirzi},
  booktitle={ACL},
  year={2021}
}
Comments
  • Evaluation result

    Evaluation result

    The experimental results are far lower than the papers My environment is as follows: Ubuntu 18.04.5 LTS python 3.8.10 CPU: Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz GPU: 2 * TITAN RTX 24GB MEM: 125GB other environments follow requirements.txt. our evaluation step is :

    1. Building passage database
    2. Training BPR
    • python train_biencoder.py --gpus=2 --distributed_backend=ddp --train_file=/downloads/data/retriever/nq-train.json --eval_file=/downloads/data/retriever/nq-dev.json --gradient_clip_val=2.0 --max_epochs=40 --binary

    • After training, there will be two more folders(version_2 & version_3) in "./biencoder" folder. we found only version_3 have checkpoint folder, so <BPR_CHECKPOINT_FILE> is "./biencoder/version_4/checkpoints/last.ckpt"

    1. Building passage embeddings
    • CUDA_VISIBLE_DEVICES=0,1 python generate_embeddings.py --biencoder_file=./biencoder/version_3/checkpoints/last.ckpt --output_file=./biencoder/embedding/em_my --passage_db_file=./passage_db --batch_size=2048 --parallel

    • we only change the batch_size from 4096 to 2048, building embedding take more time than training!

    1. Evaluating BPR
    • The top-1 precision is 38.78 which is much lower than 41.1 on paper and 49 in Github

    image

    opened by xuanricheng 4
  • Reproducing issues: broken pipe & CUDA out of memory errors

    Reproducing issues: broken pipe & CUDA out of memory errors

    Hi,

    I was trying to train BPR by running

    python train_biencoder.py --gpus=7 --distributed_backend=ddp --train_file=nq-train.json \
    --eval_file=nq-dev.json --gradient_clip_val=2.0 --max_epochs=40 --binary --train_batch_size=4 --eval_batch_size=4
    

    However, there are a lot of errors. For example, after validation sanity check, there are a broken pipe error in multiprocessing/connections.py where the output is listed below

    Traceback (most recent call last):
      File "/miniconda3/envs/bpr/lib/python3.7/multiprocessing/queues.py", line 242, in _feed
        send_bytes(obj)
      File "/miniconda3/envs/bpr/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
        self._send_bytes(m[offset:offset + size])
      File "/miniconda3/envs/bpr/lib/python3.7/multiprocessing/connection.py", line 404, in _send_bytes
        self._send(header + buf)
      File "/miniconda3/envs/bpr/lib/python3.7/multiprocessing/connection.py", line 368, in _send
        n = write(self._handle, buf)
    BrokenPipeError: [Errno 32] Broken pipe
    

    Furthermore, I encountered CUDA out of memory issues. The trimmed output is attached: (For each line it is repeated for 3 times because 3 out of 7 GPUs that I am using have encountered OOM errors)

    Traceback (most recent call last):
      File "bpr/train_biencoder.py", line 53, in <module>
        trainer.fit(model)
      File "miniconda3/envs/bpr/lib/python3.7/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn
        result = fn(self, *args, **kwargs)
      File "miniconda3/envs/bpr/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1046, in fit
        self.accelerator_backend.train(model)
      File "miniconda3/envs/bpr/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 57, in train
        self.ddp_train(process_idx=self.task_idx, mp_queue=None, model=model)
       File "/gscratch/cse/xyu530/miniconda3/envs/bpr/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_backend.py", lin
    e 224, in ddp_train
        results = self.trainer.run_pretrain_routine(model)
      File "miniconda3/envs/bpr/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1239, in run_pretrain_routine
        self.train()
      File "miniconda3/envs/bpr/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 394, in train
        self.run_training_epoch()
    File "miniconda3/envs/bpr/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 516, in run_training_epoch
        self.run_evaluation(test_mode=False)
      File "miniconda3/envs/bpr/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line
     582, in run_evaluation
         eval_results = self._evaluate(self.model, dataloaders, max_batches, test_mode)
      File "miniconda3/envs/bpr/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 396, in _evaluate
      eval_results = self.__run_eval_epoch_end(test_mode, outputs, dataloaders, using_eval_result)
      File "miniconda3/envs/bpr/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 490, in __run_eval_epoch_end
        eval_results = model.validation_epoch_end(eval_results)
     File "bpr/bpr/biencoder.py", line 246, in validation_epoch_end
        dist.all_gather(passage_repr_list, passage_repr)
      File "miniconda3/envs/bpr/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1185, in all_gather
        work = _default_pg.allgather([tensor_list], [tensor])
    RuntimeError: CUDA out of memory. Tried to allocate 1.14 GiB (GPU X; 10.76 GiB total capacity; 8.14 GiB already allocated; 526.56 MiB free; 9.29 GiB reserved in total by PyTorch)
        work = _default_pg.allgather([tensor_list], [tensor])
    

    Sorry for putting all these outputs here!

    I install BPR by pip install -r requirements.txt and completed building passage database successfully. The GPUs I am using are 7 GeForce RTX 2080 Ti.

    Thanks for any help!

    opened by velocityCavalry 4
  • Puzzles of hamming distance

    Puzzles of hamming distance

    Hi, this work is awesome for efficient retrieval. And I learned a lot from this. I have a puzzle about the hamming distance between query hash code and doc hash code. In the paper, it is stated that hamming distance(q,d) = 1/2 (const - inner product<q,d>). And in the source code of index.py, the score is calculated by np.einsum("ijk,ik->ij", passage_embeddings, query_embeddings). I wonder the source code is consistent with the equation. And I'm not sure the shape of passage_embeddings and query_embeddings. Hope your reply. Thank u so much!

    opened by lightningtyb 3
  • how to get 38 ms query time in Binary hash mode

    how to get 38 ms query time in Binary hash mode

    In the paper, Query time using "hash table lookup" is 38.1ms. In README.MD, Query time using "Binary hash" is 38ms. Is it possible to just use --use_binary_hash option when running evaluate_retriever.py to reach a Query time of 38ms? In my case, --use_binary_hash is not very helpful to query time(81.5ms -> 75ms). is my runtime error or hardware issue? my CPU: (48 core) Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz

    opened by xuanricheng 2
  • Loading passage binary codes from BinaryHash Faiss Index stored on disk

    Loading passage binary codes from BinaryHash Faiss Index stored on disk

    Hi @ikuyamada,

    Thanks for the awesome repository and clean code for BPR!

    When we have stored the passage embeddings (binary codes) as a faiss BinaryHash Index and saved it to the disk. Can we get the passage embeddings back given only loading back the faiss BinaryHash index? Or would we need to also separately save the passage embeddings (This would take more memory)?

    Kind Regards, Nandan Thakur

    opened by thakur-nandan 2
  • Use your own passage

    Use your own passage

    Hi

    i was wondering if there is a way around using your own passages instead of wikipedia while retrieving result. Just changing the tsv file in "InMemoryPassageDB" gives an error as index is out of range but was wondering what would be the best way without going through whole training process?

    opened by sb1992 1
  • Can we load and search the BPR Flat Binary Index in GPU

    Can we load and search the BPR Flat Binary Index in GPU

    Hi,

    I have a 2.5GB Flat Binary Index.

    I want to put it in GPU to search,

    I tried this

    res = faiss.StandardGpuResources() self.index = faiss.GpuIndexBinaryFlat(res, binary_index)

    It loads in GPU, but while searching the GPU becomes 100% and dies.

    Any idea how to load it and search.

    I have a Tesla V100 16GB.

    opened by astar10239 3
Owner
Studio Ousia
Studio Ousia
Codes for NAACL 2021 Paper "Unsupervised Multi-hop Question Answering by Question Generation"

Unsupervised-Multi-hop-QA This repository contains code and models for the paper: Unsupervised Multi-hop Question Answering by Question Generation (NA

Liangming Pan 70 Nov 27, 2022
Designing a Minimal Retrieve-and-Read System for Open-Domain Question Answering (NAACL 2021)

Designing a Minimal Retrieve-and-Read System for Open-Domain Question Answering Abstract In open-domain question answering (QA), retrieve-and-read mec

Clova AI Research 34 Apr 13, 2022
The dataset and source code for our paper: "Did You Ask a Good Question? A Cross-Domain Question IntentionClassification Benchmark for Text-to-SQL"

TriageSQL The dataset and source code for our paper: "Did You Ask a Good Question? A Cross-Domain Question Intention Classification Benchmark for Text

Yusen Zhang 22 Nov 9, 2022
ReConsider is a re-ranking model that re-ranks the top-K (passage, answer-span) predictions of an Open-Domain QA Model like DPR (Karpukhin et al., 2020).

ReConsider ReConsider is a re-ranking model that re-ranks the top-K (passage, answer-span) predictions of an Open-Domain QA Model like DPR (Karpukhin

Facebook Research 47 Jul 26, 2022
QA-GNN: Question Answering using Language Models and Knowledge Graphs

QA-GNN: Question Answering using Language Models and Knowledge Graphs This repo provides the source code & data of our paper: QA-GNN: Reasoning with L

Michihiro Yasunaga 434 Jan 4, 2023
GrailQA: Strongly Generalizable Question Answering

GrailQA is a new large-scale, high-quality KBQA dataset with 64,331 questions annotated with both answers and corresponding logical forms in different syntax (i.e., SPARQL, S-expression, etc.). It can be used to test three levels of generalization in KBQA: i.i.d., compositional, and zero-shot.

OSU DKI Lab 76 Dec 21, 2022
covid question answering datasets and fine tuned models

Covid-QA Fine tuned models for question answering on Covid-19 data. Hosted Inference This model has been contributed to huggingface.Click here to see

Abhijith Neil Abraham 19 Sep 9, 2021
NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions (CVPR2021)

NExT-QA We reproduce some SOTA VideoQA methods to provide benchmark results for our NExT-QA dataset accepted to CVPR2021 (with 1 'Strong Accept' and 2

Junbin Xiao 50 Nov 24, 2022
FeTaQA: Free-form Table Question Answering

FeTaQA: Free-form Table Question Answering FeTaQA is a Free-form Table Question Answering dataset with 10K Wikipedia-based {table, question, free-form

Language, Information, and Learning at Yale 40 Dec 13, 2022
improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

CLIP-ViL In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet fea

null 310 Dec 28, 2022
Pytorch implementation for the EMNLP 2020 (Findings) paper: Connecting the Dots: A Knowledgeable Path Generator for Commonsense Question Answering

Path-Generator-QA This is a Pytorch implementation for the EMNLP 2020 (Findings) paper: Connecting the Dots: A Knowledgeable Path Generator for Common

Peifeng Wang 33 Dec 5, 2022
Bilinear attention networks for visual question answering

Bilinear Attention Networks This repository is the implementation of Bilinear Attention Networks for the visual question answering and Flickr30k Entit

Jin-Hwa Kim 506 Nov 29, 2022
Official repository with code and data accompanying the NAACL 2021 paper "Hurdles to Progress in Long-form Question Answering" (https://arxiv.org/abs/2103.06332).

Hurdles to Progress in Long-form Question Answering This repository contains the official scripts and datasets accompanying our NAACL 2021 paper, "Hur

Kalpesh Krishna 41 Nov 8, 2022
Visual Question Answering in Pytorch

Visual Question Answering in pytorch /!\ New version of pytorch for VQA available here: https://github.com/Cadene/block.bootstrap.pytorch This repo wa

Remi 672 Jan 1, 2023
This reporistory contains the test-dev data of the paper "xGQA: Cross-lingual Visual Question Answering".

This reporistory contains the test-dev data of the paper "xGQA: Cross-lingual Visual Question Answering".

AdapterHub 18 Dec 9, 2022
RNG-KBQA: Generation Augmented Iterative Ranking for Knowledge Base Question Answering

RNG-KBQA: Generation Augmented Iterative Ranking for Knowledge Base Question Answering Authors: Xi Ye, Semih Yavuz, Kazuma Hashimoto, Yingbo Zhou and

Salesforce 72 Dec 5, 2022
EMNLP 2021: Single-dataset Experts for Multi-dataset Question-Answering

MADE (Multi-Adapter Dataset Experts) This repository contains the implementation of MADE (Multi-adapter dataset experts), which is described in the pa

Princeton Natural Language Processing 68 Jul 18, 2022
EMNLP 2021: Single-dataset Experts for Multi-dataset Question-Answering

MADE (Multi-Adapter Dataset Experts) This repository contains the implementation of MADE (Multi-adapter dataset experts), which is described in the pa

Princeton Natural Language Processing 39 Oct 5, 2021
Pytorch implementation for our ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering".

TRAnsformer Routing Networks (TRAR) This is an official implementation for ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visu

Ren Tianhe 49 Nov 10, 2022