Train Dense Passage Retriever (DPR) with a single GPU

Overview

Gradient Cached Dense Passage Retrieval

Gradient Cached Dense Passage Retrieval (GC-DPR) - is an extension of the original DPR library.
We introduce Gradient Cache technique which enables scaling batch size of the contrastive form loss and therefore the number of in-batch negatives far beyond GPU RAM limitation. With GC-DPR , you can reproduce the state-of-the-art open Q&A system trained on 8 x 32GB V100 GPUs with a single 11 GB GPU.

To use Gradient Cache in your own project, checkout our GradCache package.

Gradient Cache Technique

  • Retriever training quality depends on its effective batch size.
  • The contrastive form loss (NLL with in-batch negatives) conditions on the entire batch and requires fitting the entire batch into GPU memory.
  • Gradient Cache technique separates back propagation into loss to representation part and representation to encoder model part. It uses an extra forward pass without gradient tracking to compute representations and then store representations' gradients in cache.
  • The cached representation gradients remove data dependency in encoder back propagation and allows updating encoders one sub-batch at a time to fit GPU memory.

Details can be found in our paper Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup , to appeaer at The 6th Workshop on Representation Learning for NLP (RepL4NLP) 2021. Talk to us at ACL if you are interested!

@inproceedings{gao2021scaling,
     title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
     author={Luyu Gao, Yunyi Zhang, Jiawei Han, Jamie Callan},
     booktitle ={Proceedings of the 6th Workshop on Representation Learning for NLP},
     year={2021},
}

Understanding the Code

One of the goals of this repo is to help clarify our gradient cache technique. The initial commit of the repo is a snapshot of DPR. You can run diff against it to see all changes made in GC-DPR. In general, we have modified train and inference codes. The model code is kept unchanged, and therefore users should be able to use trained checkpoints interchangeably between DPR and GC-DPR.

Additional Features

Several changes were made to speed and/or facilitate easy reproduction of the original result. You may get some speed up using GC-DPR even when not using gradient.

  • Automatic Mix Precision (AMP): GC-DPR replaces apex AMP with Pytorch native AMP. It is supported for both training and inference.
  • Multi-process Data Loading: batches are loaded using Pytorch's DataLoader during training and inference in loader processes, which largely avoids GPU idle waiting for text tokenization and batch creation.
  • Gradient Compensation: Pytorch's Distributed Data Parallel (DDP) average reduces model parameter gradients across GPUs while dense retriever training all-gather representations to compute loss. This means the actual gradient will be divided by the number of GPUs. For reproducibility, we scale/unscale gradient according to the number of available devices to replicate this effect. Another option is to use _register_comm_hook on the DDP object. Scaling loss is opted as it is easier to understand.

The follwoing instructions are adapted from the orignial DPR's.

Installation

You can clone this repo and install with pip.

git clone https://github.com/luyug/GC-DPR.git
cd GC-DPR
pip install .

GC-DPR is tested on Python 3.8.5, PyTorch 1.6.0 and Huggingface transformers 3.0.2.

Resources & Data formats

You can follow the original DPR instructions:

First, you need to prepare data for either retriever or reader training. Each of the DPR components has its own input/output data formats. You can see format descriptions below. DPR provides NQ & Trivia preprocessed datasets (and model checkpoints) to be downloaded from the cloud using our data/download_data.py tool. One needs to specify the resource name to be downloaded. Run 'python data/download_data.py' to see all options.

python data/download_data.py \
   --resource {key from download_data.py's RESOURCES_MAP}  \
   [optional --output_dir {your location}]  

The resource name matching is prefix-based. So if you need to download all data resources, just use --resource data.

Retriever input data format

The data format of the Retriever training data is JSON.
It contains pools of 2 types of negative passages per question, as well as positive passages and some additional information.

[  
  {  
   "question": "....",  
   "answers": ["...", "...", "..."],  
   "positive_ctxs": [{  
      "title": "...",  
      "text": "...."  
   }],  
   "negative_ctxs": ["..."],  
   "hard_negative_ctxs": ["..."]  
  },  
  ...  
]  

Elements' structure for negative_ctxs & hard_negative_ctxs is exactly the same as for positive_ctxs. The preprocessed data available for downloading also contains some extra attributes which may be useful for model modifications (like bm25 scores per passage). Still, they are not currently in use by DPR.

You can download prepared NQ dataset used in the paper by using 'data.retriever.nq' key prefix. Only dev & train subsets are available in this format. We also provide question & answers only CSV data files for all train/dev/test splits. Those are used for the model evaluation since our NQ preprocessing step looses a part of original samples set. Use 'data.retriever.qas.*' resource keys to get respective sets for evaluation.

python data/download_data.py  
   --resource data.retriever  
   [optional --output_dir {your location}]  

Retriever training

In order to start training on one machine:

python train_dense_encoder.py  
   --encoder_model_type {hf_bert | pytext_bert | fairseq_roberta}  
   --pretrained_model_cfg {bert-base-uncased| roberta-base}  
   --train_file {train files glob expression}  
   --dev_file {dev files glob expression}  
   --output_dir {dir to save checkpoints}  
   --grad_cache
   --q_chunk_size {question sub-batch sizes}
   --ctx_chunk_size {context sub-batch sizes}
   --fp16  # if train with mixed precision

We introduce the following parameters to for gradient cached training,

  • --grad_cache activates gradient cached training
  • --q_chunk_size sub-batch size for updating the question encoder, default to 16
  • --ctx_chunk_size sub-batch size for updating context encoder, default to 8

The default settings work for 11GB RTX 2080 ti.

Retriever inference

You can follow the original DPR instructions:

Generating representation vectors for the static documents dataset is a highly parallelizable process which can take up to a few days if computed on a single GPU. You might want to use multiple available GPU servers by running the script on each of them independently and specifying their own shards.

python generate_dense_embeddings.py \
   --model_file {path to biencoder checkpoint} \
   --ctx_file {path to psgs_w100.tsv file} \
   --shard_id {shard_num, 0-based} --num_shards {total number of shards} \
   --out_file ${out files location + name PREFX}  \
   --fp16  # if use mixed precision inference

In addition, we provide the --fp16 flag which will run encoder inference with mixed precision using the autocast context manager. Mixed precision inference gives roughly 2x speed up. The generated embeddings will be converted to full precision before saving for index search on x86 CPU.

Retriever validation against the entire set of documents:

python dense_retriever.py \
   --model_file ${path to biencoder checkpoint} \
   --ctx_file  {path to all documents .tsv file} \
   --qa_file {path to test|dev .csv file} \
   --encoded_ctx_file "{encoded document files glob expression}" \
   --out_file {path to output json file with results} \
  --n-docs 200  

We provide in addition flags --encode_q_and_save, --q_encoding_path, --re_encode_q which allows separation of question encoding and index search. One could therefore opt to run question encoding on a GPU server and search on a CPU server.

The tool writes retrieved results for subsequent reader model training into specified out_file.
It is a json with the following format:

[  
    {  
        "question": "...",  
        "answers": ["...", "...", ... ],  
        "ctxs": [  
            {  
                "id": "...", # passage id from database tsv file  
                "title": "",  
                "text": "....",  
                "score": "...",  # retriever score  
                "has_answer": true|false  
     },  
]  

Results are sorted by their similarity score, from most relevant to least relevant.

Optional reader model input data pre-processing.

Since the reader model uses a specific combination of positive and negative passages for each question and also needs to know the answer span location in the bpe-tokenized form, it is recommended to preprocess and serialize the output from the retriever model before starting the reader training. This saves hours at train time.
If you don't run this preprocessing, the Reader training pipeline checks if the input file(s) extension is .pkl and if not, preprocesses and caches results automatically in the same folder as the original files.

python preprocess_reader_data.py \
   --retriever_results {path to a file with results from dense_retriever.py} \
   --gold_passages {path to gold passages info} \
   --do_lower_case \
   --pretrained_model_cfg {pretrained_cfg} \
   --encoder_model_type {hf_bert | pytext_bert | fairseq_roberta} \
   --out_file {path to for output files} \
   --is_train_set  

Reader model training

python train_reader.py \
   --encoder_model_type {hf_bert | pytext_bert | fairseq_roberta} \
   --pretrained_model_cfg {bert-base-uncased| roberta-base} \
   --train_file "{globe expression for train files from #5 or #6 above}" \
   --dev_file "{globe expression for train files}" \
   --output_dir {path to output dir}  

Notes:

  • if you use pytext_bert or fairseq_roberta, you need to download pre-trained weights and specify --pretrained_file parameter. Specify the dir location of the downloaded files for 'pretrained.fairseq.roberta-base' resource prefix for RoBERTa model or the file path for pytext BERT (resource name 'pretrained.pytext.bert-base.model').
  • Reader training pipeline does model validation every --eval_step batches
  • As the bi-encoder, it saves model checkpoints on every validation
  • Like the bi-encoder, there is no stop condition besides a specified amount of epochs to train.
  • Like the bi-encoder, there is no best checkpoint selection logic, so one needs to select that based on dev set validation performance which is logged in the train process output.
  • Our current code only calculates the Exact Match metric.

Reader model inference

In order to make an inference, run train_reader.py without specifying train_file. Make sure to specify model_file with the path to the checkpoint, passages_per_question_predict with number of passages per question (being used when saving the prediction file), and eval_top_docs with a list of top passages threshold values from which to choose question's answer span (to be printed as logs). The example command line is as follows.

python train_reader.py \
  --prediction_results_file {some dir}/results.json \
  --eval_top_docs 10 20 40 50 80 100 \
  --dev_file {path to data.retriever_results.nq.single.test file} \
  --model_file {path to the reader checkpoint} \
  --dev_batch_size 80 \
  --passages_per_question_predict 100 \
  --sequence_length 350  

Best hyperparameter settings

e2e example with the best settings for NQ dataset. Most of this section follows DPR's instructions. The retriever training session is adjusted for a single 11 GB GPU.

1. Download all retriever training and validation data:

python data/download_data.py --resource data.wikipedia_split.psgs_w100  
python data/download_data.py --resource data.retriever.nq  
python data/download_data.py --resource data.retriever.qas.nq  

2. Biencoder(Retriever) training in single set mode.

Train on a single 11GB RTX 2080 ti GPU,

python train_dense_encoder.py \
   --max_grad_norm 2.0 \
   --encoder_model_type hf_bert \
   --pretrained_model_cfg bert-base-uncased \
   --seed 12345 \
   --sequence_length 256 \
   --warmup_steps 1237 \
   --batch_size 128 \
   --do_lower_case \
   --train_file "{glob expression to train files downloaded as 'data.retriever.nq-train' resource}" \
   --dev_file "{glob expression to dev files downloaded as 'data.retriever.nq-dev' resource}" \
   --output_dir {your output dir} \
   --learning_rate 2e-05 \
   --num_train_epochs 40 \
   --dev_batch_size 16 \
   --val_av_rank_start_epoch 30 \
   --fp16 \
   --grad_cache \
   --global_loss_buf_sz 2097152 \
   --val_av_rank_max_qs 1000

This takes less than 2 days to complete the training for 40 epochs. The best checkpoint for bi-encoder is usually the last, but it should not be so different if you take any after epoch ~ 25. We use smaller value for --val_av_rank_max_qs to speed up validation.

3. Generate embeddings for Wikipedia.

Just use instructions for "Generating representations for large documents set". It takes about 40 minutes to produce 21 mln passages representation vectors on 50 2 GPU servers.

4. Evaluate retrieval accuracy and generate top passage results for each of the train/dev/test datasets.

python dense_retriever.py \
   --model_file {path to checkpoint file from step 1} \
   --ctx_file {path to psgs_w100.tsv file} \
   --qa_file {path to test/dev qas file} \
   --encoded_ctx_file "{glob expression for generated files from step 3}" \
   --out_file {path for output json files} \
   --n-docs 100 \
   --validation_workers 32 \
   --batch_size 64  

Adjust batch_size based on the available number of GPUs, 64 should work for 2 GPU server.

5. Reader training

Here's the DPR's original instruction for reader training. We trained reader model for large datasets using a single 8 GPU x 32 GB server.

python train_reader.py \
   --seed 42 \
   --learning_rate 1e-5 \
   --eval_step 2000 \
   --do_lower_case \
   --eval_top_docs 50 \
   --encoder_model_type hf_bert \
   --pretrained_model_cfg bert-base-uncased \
   --train_file "{glob expression for train output files from step 4}" \
   --dev_file {glob expression for dev output file from step 4} \
   --warmup_steps 0 \
   --sequence_length 350 \
   --batch_size 16 \
   --passages_per_question 24 \
   --num_train_epochs 100000 \
   --dev_batch_size 72 \
   --passages_per_question_predict 50 \
   --output_dir {your save dir path}  

We found that using the learning rate above works best with static schedule, so one needs to stop training manually based on evaluation performance dynamics.
Our best results were achieved on 16-18 training epochs or after ~60k model updates.

We provide all input and intermediate results for e2e pipeline for NQ dataset and most of the similar resources for Trivia.

Misc.

  • TREC validation requires regexp based matching. We support only retriever validation in the regexp mode. See --match parameter option.
  • WebQ validation requires entity normalization, which is not included as of now.

Reference

If you find GC-DPR helpful, please consider citing our paper:

@inproceedings{gao2021scaling,
      title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
      author={Luyu Gao, Yunyi Zhang, Jiawei Han, Jamie Callan},
      booktitle ={Proceedings of the 6th Workshop on Representation Learning for NLP},
      year={2021},
}

Also consider citing the original DPR paper:

@misc{karpukhin2020dense,  
    title={Dense Passage Retrieval for Open-Domain Question Answering},  
    author={Vladimir Karpukhin and Barlas Oğuz and Sewon Min and Patrick Lewis and Ledell Wu and Sergey Edunov and Danqi Chen and Wen-tau Yih},  
    year={2020},  
    eprint={2004.04906},  
    archivePrefix={arXiv},  
    primaryClass={cs.CL}  
}  

License

GC-DPR inherits DPR's CC-BY-NC 4.0 licensed as of now.

Comments
  • Multi GPU support

    Multi GPU support

    Hi, Is the current version of encoder training with GC support multiple GPUs? I tried to run the training with NQ dataset by following the instructions in README.md but on a machine with 2 GPUs. seems it is running slower than on a single GPU? i.e. on a single GPU, one step cost about 4 sec, but with two GPU, one step cost about 24 sec

    opened by MXueguang 4
  • pip install . 之后的报错

    pip install . 之后的报错

    我现在也在做input data 每个sample 都特别大,所以每个batch的sample 数就很小的cl的问题(医疗数据)。 看到您的paper,发现您的方法正是我想要的!!

    麻烦要问一下,windows 下,pip install . 之后 会看到

    Processing c:\users\user\dropbox\book\bnl\research\granger\fmri\code\gc-dpr
      Preparing metadata (setup.py) ... error
      error: subprocess-exited-with-error
    
      × python setup.py egg_info did not run successfully.
      │ exit code: 1
      ╰─> [14 lines of output]
          error: Multiple top-level packages discovered in a flat-layout: ['dpr', 'data'].
    
          To avoid accidental inclusion of unwanted files or directories,
          setuptools will not proceed with this build.
    
          If you are trying to create a single distribution with multiple packages
          on purpose, you should not rely on automatic discovery.
          Instead, consider the following options:
    
          1. set up custom discovery (`find` directive with `include` or `exclude`)
          2. use a `src-layout`
          3. explicitly set `py_modules` or `packages` with a list of names
    
          To find more information, look for "package discovery" on setuptools docs.
          [end of output]
    
      note: This error originates from a subprocess, and is likely not a problem with pip.
    error: metadata-generation-failed
    

    请问这个错误在您配置环境的时候常出现嘛,谢谢!

    opened by RudolphXu 0
  • coCondenser hyperparameter

    coCondenser hyperparameter

    Hello. Thank you for your great work!

    I have a few questions regarding CoCondenser fine tuning. First, When training 1st fine tuning and 2nd fine tuning on Ms-marco dataset, can you know the number of negative samples and hard negative samples?

    The second, What are the criteria for selecting hard negative samples? For example, when the rank of positive document is 5th, hard negative samples are 1st to 4th.

    opened by ehdlf4999 0
  • fine tuning existing dpr model

    fine tuning existing dpr model

    Hi, Thanks for posting a nice repo! I see that we can train the DPR model with GC-DPR but I guess we need to train it from scratch by loading base models (bert-base-uncased|roberta-base) How can we make use of this repo to fine tune a pretrained DPR model. For example, we already have DPR encoder models provided by Facebook. question_model = "facebook/dpr-question_encoder-single-nq-base" context_model = "facebook/dpr-ctx_encoder-single-nq-base"

    To make these models domain-specific my idea is to fine-tune these models with domain data. It would be helpful if you can let me know how we can load question and context models with train_dense_encoder function. Any other suggestion would be appreciated.

    opened by vijayirlapati 0
  • Multiply by distributed_factor/8.

    Multiply by distributed_factor/8.

    Thanks for posting a really nice repo! While I was studying the code, I found that in 'train_dense_encoder.py' line 669 and 691 the following: ''' surrogate = surrogate * (trainer.distributed_factor / 8.) ''' which I actually don't fully understand the reason of the multiplication part. Can you explain any reason? Thank you 👍

    opened by Doohae 2
  • multilingual-bert issue?

    multilingual-bert issue?

    Hi,

    I found a weird thing that if using the multilingual-bert e.g: bert-base-multilingual-uncased, it seems like the grad_cache doesn't work. I know it sounds weird, changing different bert models shouldn't affect it, but the thing is I tried normal bert, german bert, and m-bert, only the latter one need very small batch_size (like 4) to successfully run. Other models like german bert runs with batch_size=128 successfully. Do you probably know the reason of this? Btw, great paper and code, extremely helpful! Thanks in advance!

    opened by Smu-Tan 4
Owner
Luyu Gao
NLP PhD student@LTI, CMU
Luyu Gao
This is the official implementation of "One Question Answering Model for Many Languages with Cross-lingual Dense Passage Retrieval".

CORA This is the official implementation of the following paper: Akari Asai, Xinyan Yu, Jungo Kasai and Hannaneh Hajishirzi. One Question Answering Mo

Akari Asai 59 Dec 28, 2022
This is the code for our KILT leaderboard submission to the T-REx and zsRE tasks. It includes code for training a DPR model then continuing training with RAG.

KGI (Knowledge Graph Induction) for slot filling This is the code for our KILT leaderboard submission to the T-REx and zsRE tasks. It includes code fo

International Business Machines 72 Jan 6, 2023
NAACL2021 - COIL Contextualized Lexical Retriever

COIL Repo for our NAACL paper, COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List. The code covers learning

Luyu Gao 108 Dec 31, 2022
Pomodoro timer that acknowledges the inexorable, infinite passage of time

Pomodouroboros Most pomodoro trackers assume you're going to start them. But time and tide wait for no one - the great pomodoro of the cosmos is cold

Glyph 66 Dec 13, 2022
A multilingual version of MS MARCO passage ranking dataset

mMARCO A multilingual version of MS MARCO passage ranking dataset This repository presents a neural machine translation-based method for translating t

null 75 Dec 27, 2022
Multiple types of NN model optimization environments. It is possible to directly access the host PC GUI and the camera to verify the operation. Intel iHD GPU (iGPU) support. NVIDIA GPU (dGPU) support.

mtomo Multiple types of NN model optimization environments. It is possible to directly access the host PC GUI and the camera to verify the operation.

Katsuya Hyodo 24 Mar 2, 2022
High performance Cross-platform Inference-engine, you could run Anakin on x86-cpu,arm, nv-gpu, amd-gpu,bitmain and cambricon devices.

Anakin2.0 Welcome to the Anakin GitHub. Anakin is a cross-platform, high-performance inference engine, which is originally developed by Baidu engineer

null 514 Dec 28, 2022
GrabGpu_py: a scripts for grab gpu when gpu is free

GrabGpu_py a scripts for grab gpu when gpu is free. WaitCondition: gpu_memory >

tianyuluan 3 Jun 18, 2022
An example showing how to use jax to train resnet50 on multi-node multi-GPU

jax-multi-gpu-resnet50-example This repo shows how to use jax for multi-node multi-GPU training. The example is adapted from the resnet50 example in d

Yangzihao Wang 20 Jul 4, 2022
MonoRec: Semi-Supervised Dense Reconstruction in Dynamic Environments from a Single Moving Camera

MonoRec: Semi-Supervised Dense Reconstruction in Dynamic Environments from a Single Moving Camera

Felix Wimbauer 494 Jan 6, 2023
Dense Contrastive Learning (DenseCL) for self-supervised representation learning, CVPR 2021.

Dense Contrastive Learning for Self-Supervised Visual Pre-Training This project hosts the code for implementing the DenseCL algorithm for se

Xinlong Wang 491 Jan 3, 2023
Official implementation of our paper "LLA: Loss-aware Label Assignment for Dense Pedestrian Detection" in Pytorch.

LLA: Loss-aware Label Assignment for Dense Pedestrian Detection This project provides an implementation for "LLA: Loss-aware Label Assignment for Dens

null 35 Dec 6, 2022
Official implementation of Monocular Quasi-Dense 3D Object Tracking

Monocular Quasi-Dense 3D Object Tracking Monocular Quasi-Dense 3D Object Tracking (QD-3DT) is an online framework detects and tracks objects in 3D usi

Visual Intelligence and Systems Group 441 Dec 20, 2022
Dense Prediction Transformers

Vision Transformers for Dense Prediction This repository contains code and models for our paper: Vision Transformers for Dense Prediction René Ranftl,

Intel ISL (Intel Intelligent Systems Lab) 1.3k Dec 28, 2022
《Fst Lerning of Temporl Action Proposl vi Dense Boundry Genertor》(AAAI 2020)

Update 2020.03.13: Release tensorflow-version and pytorch-version DBG complete code. 2019.11.12: Release tensorflow-version DBG inference code. 2019.1

Tencent 338 Dec 16, 2022
Learning Dense Representations of Phrases at Scale (Lee et al., 2020)

DensePhrases DensePhrases provides answers to your natural language questions from the entire Wikipedia in real-time. While it efficiently searches th

Princeton Natural Language Processing 540 Dec 30, 2022
Quasi-Dense Similarity Learning for Multiple Object Tracking, CVPR 2021 (Oral)

Quasi-Dense Tracking This is the offical implementation of paper Quasi-Dense Similarity Learning for Multiple Object Tracking. We present a trailer th

ETH VIS Research Group 327 Dec 27, 2022
Deep Compression for Dense Point Cloud Maps.

DEPOCO This repository implements the algorithms described in our paper Deep Compression for Dense Point Cloud Maps. How to get started (using Docker)

Photogrammetry & Robotics Bonn 67 Dec 6, 2022