Dense Passage Retriever - is a set of tools and models for open domain Q&A task.

Meta Research

Last update: Jan 7, 2023

Related tags

Text Data & NLP DPR

Overview

Dense Passage Retrieval

Dense Passage Retrieval (DPR) - is a set of tools and models for state-of-the-art open-domain Q&A research. It is based on the following paper:

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih. Dense Passage Retrieval for Open-Domain Question Answering. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, 2020.

If you find this work useful, please cite the following paper:

@inproceedings{karpukhin-etal-2020-dense,
    title = "Dense Passage Retrieval for Open-Domain Question Answering",
    author = "Karpukhin, Vladimir and Oguz, Barlas and Min, Sewon and Lewis, Patrick and Wu, Ledell and Edunov, Sergey and Chen, Danqi and Yih, Wen-tau",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.550",
    doi = "10.18653/v1/2020.emnlp-main.550",
    pages = "6769--6781",
}

If you're interesting in reproducing experimental results in the paper based on our model checkpoints (i.e., don't want to train the encoders from scratch), you might consider using the Pyserini toolkit, which has the experiments nicely packaged in via pip. Their toolkit also reports higher BM25 and hybrid scores.

Features

Dense retriever model is based on bi-encoder architecture.
Extractive Q&A reader&ranker joint model inspired by this paper.
Related data pre- and post- processing tools.
Dense retriever component for inference time logic is based on FAISS index.

New (March 2021) release

DPR codebase is upgraded with a number of enhancements and new models. Major changes:

Hydra-based configuration for all the command line tools exept the data loader (to be converted soon)
Pluggable data processing layer to support custom datasets
New retrieval model checkpoint with better perfromance.

New (March 2021) retrieval model

A new bi-encoder model trained on NQ dataset only is now provided: a new checkpoint, training data, retrieval results and wikipedia embeddings. It is trained on the original DPR NQ train set and its version where hard negatives are mined using DPR index itself using the previous NQ checkpoint. A Bi-encoder model is trained from scratch using this new training data combined with our original NQ training data. This training scheme gives a nice retrieval performance boost.

New vs old top-k documents retrieval accuracy on NQ test set (3610 questions).

Top-k passages	Original DPR NQ model	New DPR model
1	45.87	52.47
5	68.14	72.24
20	79.97	81.33
100	85.87	87.29

New model downloadable resources names (see how to use download_data script below):

Checkpoint: checkpoint.retriever.single-adv-hn.nq.bert-base-encoder

New training data: data.retriever.nq-adv-hn-train

Retriever resutls for NQ test set: data.retriever_results.nq.single-adv-hn.test

Wikipedia embeddings: data.retriever_results.nq.single-adv-hn.wikipedia_passages

Installation

Installation from the source. Python's virtual or Conda environments are recommended.

git clone [email protected]:facebookresearch/DPR.git
cd DPR
pip install .

DPR is tested on Python 3.6+ and PyTorch 1.2.0+. DPR relies on third-party libraries for encoder code implementations. It currently supports Huggingface (version <=3.1.0) BERT, Pytext BERT and Fairseq RoBERTa encoder models. Due to generality of the tokenization process, DPR uses Huggingface tokenizers as of now. So Huggingface is the only required dependency, Pytext & Fairseq are optional. Install them separately if you want to use those encoders.

Resources & Data formats

First, you need to prepare data for either retriever or reader training. Each of the DPR components has its own input/output data formats. You can see format descriptions below. DPR provides NQ & Trivia preprocessed datasets (and model checkpoints) to be downloaded from the cloud using our dpr/data/download_data.py tool. One needs to specify the resource name to be downloaded. Run 'python data/download_data.py' to see all options.

python data/download_data.py \
	--resource {key from download_data.py's RESOURCES_MAP}  \
	[optional --output_dir {your location}]

The resource name matching is prefix-based. So if you need to download all data resources, just use --resource data.

Retriever input data format

The default data format of the Retriever training data is JSON. It contains pools of 2 types of negative passages per question, as well as positive passages and some additional information.

[
  {
	"question": "....",
	"answers": ["...", "...", "..."],
	"positive_ctxs": [{
		"title": "...",
		"text": "...."
	}],
	"negative_ctxs": ["..."],
	"hard_negative_ctxs": ["..."]
  },
  ...
]

Elements' structure for negative_ctxs & hard_negative_ctxs is exactly the same as for positive_ctxs. The preprocessed data available for downloading also contains some extra attributes which may be useful for model modifications (like bm25 scores per passage). Still, they are not currently in use by DPR.

You can download prepared NQ dataset used in the paper by using 'data.retriever.nq' key prefix. Only dev & train subsets are available in this format. We also provide question & answers only CSV data files for all train/dev/test splits. Those are used for the model evaluation since our NQ preprocessing step looses a part of original samples set. Use 'data.retriever.qas.*' resource keys to get respective sets for evaluation.

python data/download_data.py
	--resource data.retriever
	[optional --output_dir {your location}]

DPR data formats and custom processing

One can use their own data format and custom data parsing & loading logic by inherting from DPR's Dataset classes in dpr/data/{biencoder|retriever|reader}_data.py files and implementing load_data() and getitem() methods. See DPR hydra configuration instructions.

Retriever training

Retriever training quality depends on its effective batch size. The one reported in the paper used 8 x 32GB GPUs. In order to start training on one machine:

python train_dense_encoder.py \
train_datasets=[list of train datasets, comma separated without spaces] \
dev_datasets=[list of dev datasets, comma separated without spaces] \
train=biencoder_local \
output_dir={path to checkpoints dir}

Example for NQ dataset

python train_dense_encoder.py \
train_datasets=[nq_train] \
dev_datasets=[nq_dev] \
train=biencoder_local \
output_dir={path to checkpoints dir}

DPR uses HuggingFace BERT-base as the encoder by default. Other ready options include Fairseq's ROBERTA and Pytext BERT models. One can select them by either changing encoder configuration files (conf/encoder/hf_bert.yaml) or providing a new configuration file in conf/encoder dir and enabling it with encoder={new file name} command line parameter.

Notes:

If you want to use pytext bert or fairseq roberta, you will need to download pre-trained weights and specify encoder.pretrained_file parameter. Specify the dir location of the downloaded files for 'pretrained.fairseq.roberta-base' resource prefix for RoBERTa model or the file path for pytext BERT (resource name 'pretrained.pytext.bert-base.model').
Validation and checkpoint saving happens according to train.eval_per_epoch parameter value.
There is no stop condition besides a specified amount of epochs to train (train.num_train_epochs configuration parameter).
Every evaluation saves a model checkpoint.
The best checkpoint is logged in the train process output.
Regular NLL classification loss validation for bi-encoder training can be replaced with average rank evaluation. It aggregates passage and question vectors from the input data passages pools, does large similarity matrix calculation for those representations and then averages the rank of the gold passage for each question. We found this metric more correlating with the final retrieval performance vs nll classification loss. Note however that this average rank validation works differently in DistributedDataParallel vs DataParallel PyTorch modes. See train.val_av_rank_* set of parameters to enable this mode and modify its settings.

See the section 'Best hyperparameter settings' below as e2e example for our best setups.

Retriever inference

Generating representation vectors for the static documents dataset is a highly parallelizable process which can take up to a few days if computed on a single GPU. You might want to use multiple available GPU servers by running the script on each of them independently and specifying their own shards.

python generate_dense_embeddings.py \
	model_file={path to biencoder checkpoint} \
	ctx_src={name of the passages resource, set to dpr_wiki to use our original wikipedia split} \
	shard_id={shard_num, 0-based} num_shards={total number of shards} \
	out_file={result files location + name PREFX}

The name of the resource for ctx_src parameter or just the source name from conf/ctx_sources/default_sources.yaml file.

Note: you can use much large batch size here compared to training mode. For example, setting batch_size 128 for 2 GPU(16gb) server should work fine. You can download already generated wikipedia embeddings from our original model (trained on NQ dataset) using resource key 'data.retriever_results.nq.single.wikipedia_passages'. Embeddings resource name for the new better model 'data.retriever_results.nq.single-adv-hn.wikipedia_passages'

We generally use the following params on 50 2-gpu nodes: batch_size=128 shard_id=0 num_shards=50

Retriever validation against the entire set of documents:

python dense_retriever.py \
	model_file={path to a checkpoint downloaded from our download_data.py as 'checkpoint.retriever.single.nq.bert-base-encoder'} \
	qa_dataset={the name os the test source} \
	ctx_datatsets=[{list of passage sources's names, comma separated without spaces}] \
	encoded_ctx_files=[{list of encoded document files glob expression, comma separated without spaces}] \
	out_file={path to output json file with results}

For example, If your generated embeddings fpr two passages set as ~/myproject/embeddings_passages1/wiki_passages_* and ~/myproject/embeddings_passages2/wiki_passages_* files and want to evaluate on NQ dataset:

python dense_retriever.py \
	model_file={path to a checkpoint file} \
	qa_dataset=nq_test \
	ctx_datatsets=[dpr_wiki] \
	encoded_ctx_files=[\"~/myproject/embeddings_passages1/wiki_passages_*\",\"~/myproject/embeddings_passages2/wiki_passages_*\"] \
	out_file={path to output json file with results}

The tool writes retrieved results for subsequent reader model training into specified out_file. It is a json with the following format:

[
    {
        "question": "...",
        "answers": ["...", "...", ... ],
        "ctxs": [
            {
                "id": "...", # passage id from database tsv file
                "title": "",
                "text": "....",
                "score": "...",  # retriever score
                "has_answer": true|false
     },
]

Results are sorted by their similarity score, from most relevant to least relevant.

By default, dense_retriever uses exhaustive search process, but you can opt in to use lossy index types. We provide HNSW and HNSW_SQ index options. Enabled them by indexer=hnsw or indexer=hnsw_sq command line arguments. Note that using this index may be useless from the research point of view since their fast retrieval process comes at the cost of much longer indexing time and higher RAM usage. The similarity score provided is the dot product for the default case of exhaustive search (indexer=flat) and L2 distance in a modified representations space in case of HNSW index.

Reader model training

python train_extractive_reader.py \
	encoder.sequence_length=350 \
	train_files={path to the retriever train set results file} \
	dev_files={path to the retriever dev set results file}  \
	output_dir={path to output dir}

Default hyperparameters are set for a single node with 8 gpus setup. Modify them as needed in the conf/train/extractive_reader_default.yaml and conf/extractive_reader_train_cfg.yaml cpnfiguration files or override specific parameters from the command line. First time run will preprocess train_files & dev_files and convert them into serialized set of .pkl files in the same locaion and will use them on all subsequent runs.

Notes:

If you want to use pytext bert or fairseq roberta, you will need to download pre-trained weights and specify encoder.pretrained_file parameter. Specify the dir location of the downloaded files for 'pretrained.fairseq.roberta-base' resource prefix for RoBERTa model or the file path for pytext BERT (resource name 'pretrained.pytext.bert-base.model').
Reader training pipeline does model validation every train.eval_step batches
Like the bi-encoder, it saves model checkpoints on every validation
Like the bi-encoder, there is no stop condition besides a specified amount of epochs to train.
Like the bi-encoder, there is no best checkpoint selection logic, so one needs to select that based on dev set validation performance which is logged in the train process output.
Our current code only calculates the Exact Match metric.

Reader model inference

In order to make an inference, run train_reader.py without specifying train_files. Make sure to specify model_file with the path to the checkpoint, passages_per_question_predict with number of passages per question (being used when saving the prediction file), and eval_top_docs with a list of top passages threshold values from which to choose question's answer span (to be printed as logs). The example command line is as follows.

python train_extractive_reader.py \
  prediction_results_file={path to a file to write the results to} \
  eval_top_docs=[10,20,40,50,80,100] \
  dev_files={path to the retriever results file to evaluate} \
  model_file= {path to the reader checkpoint} \
  train.dev_batch_size=80 \
  passages_per_question_predict=100 \
  encoder.sequence_length=350

Distributed training

Use Pytorch's distributed training launcher tool:

python -m torch.distributed.launch \
	--nproc_per_node={WORLD_SIZE}  {non distributed scipt name & parameters}

Note:

all batch size related parameters are specified per gpu in distributed mode(DistributedDataParallel) and for all available gpus in DataParallel (single node - multi gpu) mode.

Best hyperparameter settings

e2e example with the best settings for NQ dataset.

1. Download all retriever training and validation data:

python data/download_data.py --resource data.wikipedia_split.psgs_w100
python data/download_data.py --resource data.retriever.nq
python data/download_data.py --resource data.retriever.qas.nq

2. Biencoder(Retriever) training in the single set mode.

We used distributed training mode on a single 8 GPU x 32 GB server

python -m torch.distributed.launch --nproc_per_node=8
train_dense_encoder.py \
train=biencoder_nq \
train_datasets=[nq_train] \
dev_datasets=[nq_dev] \
train=biencoder_nq \
output_dir={your output dir}

New model training combines two NQ datatsets:

python -m torch.distributed.launch --nproc_per_node=8
train_dense_encoder.py \
train=biencoder_nq \
train_datasets=[nq_train,nq_train_hn1] \
dev_datasets=[nq_dev] \
train=biencoder_nq \
output_dir={your output dir}

This takes about a day to complete the training for 40 epochs. It switches to Average Rank validation on epoch 30 and it should be around 25 or less at the end. The best checkpoint for bi-encoder is usually the last, but it should not be so different if you take any after epoch ~ 25.

3. Generate embeddings for Wikipedia.

Just use instructions for "Generating representations for large documents set". It takes about 40 minutes to produce 21 mln passages representation vectors on 50 2 GPU servers.

4. Evaluate retrieval accuracy and generate top passage results for each of the train/dev/test datasets.

python dense_retriever.py \
	model_file={path to the best checkpoint or use our proivded checkpoints (Resource names like checkpoint.retriever.*)  } \
	qa_dataset=nq_test \
	ctx_datatsets=[dpr_wiki] \
	encoded_ctx_files=["{glob expression for generated embedding files}"] \
	out_file={path to the output file}

Adjust batch_size based on the available number of GPUs, 64-128 should work for 2 GPU server.

5. Reader training

We trained reader model for large datasets using a single 8 GPU x 32 GB server. All the default parameters are already set to our best NQ settings. Please also download data.gold_passages_info.nq_train & data.gold_passages_info.nq_dev resources for NQ datatset - they are used for special NQ only heuristics when preprocessing the data for the NQ reader training. If you already run reader trianign on NQ data without gold_passages_src & gold_passages_src_dev specified, please delete the corresponding .pkl files so that thye will be re-generated.

python train_extractive_reader.py \
	encoder.sequence_length=350 \
	train_files={path to the retriever train set results file} \
	dev_files={path to the retriever dev set results file}  \
	gold_passages_src={path to data.gold_passages_info.nq_train file} \
	gold_passages_src_dev={path to data.gold_passages_info.nq_dev file} \
	output_dir={path to output dir}

We found that using the learning rate above works best with static schedule, so one needs to stop training manually based on evaluation performance dynamics. Our best results were achieved on 16-18 training epochs or after ~60k model updates.

We provide all input and intermediate results for e2e pipeline for NQ dataset and most of the similar resources for Trivia.

Misc.

TREC validation requires regexp based matching. We support only retriever validation in the regexp mode. See --match parameter option.
WebQ validation requires entity normalization, which is not included as of now.

License

DPR is CC-BY-NC 4.0 licensed as of now.

Comments

Error when running train_reader -- ValueError: a must be greater than 0 unless no samples are taken

Hi! I get the following error when running train_reader.py:

Total iterations per epoch=1237
 Total updates=24720
  Eval step = 2000
***** Training *****
***** Epoch 0 *****
Traceback (most recent call last):
  File "train_reader.py", line 507, in <module>
    main()
  File "train_reader.py", line 498, in main
    trainer.run_train()
  File "train_reader.py", line 126, in run_train
    global_step = self._train_epoch(scheduler, epoch, eval_step, train_iterator, global_step)
  File "train_reader.py", line 225, in _train_epoch
    is_train=True, shuffle=True)
  File "/home/aarchan/dpr_aug/DPR-Aug/dpr/models/reader.py", line 134, in create_reader_input
    is_random=shuffle)
  File "/home/aarchan/dpr_aug/DPR-Aug/dpr/models/reader.py", line 193, in _create_question_passages_tensors
    positive_idx = _get_positive_idx(positives, max_len, is_random)
  File "/home/aarchan/dpr_aug/DPR-Aug/dpr/models/reader.py", line 175, in _get_positive_idx
    positive_idx = np.random.choice(len(positives)) if is_random else 0
  File "mtrand.pyx", line 894, in numpy.random.mtrand.RandomState.choice
ValueError: a must be greater than 0 unless no samples are taken

This error occurs right after train_reader.py successfully loads all of the preprocessed .pkl reader data files. Could you please help me resolve this issue?

bug

opened by aarzchan 25

Questions about the Retriever input data format

Hi, thank you so much for open-sourcing DPR! I have some questions about the Retriever input data format.

Given the paper, the best performance comes from the Gold setting + 1 BM25 paragraph, in which (from my understanding) the negative examples are in-batch gold paragraphs and 1 BM 25 paragraphs. On the other hand, in the provided retriever’s nq_train.json data, there are multiple positive_ctxs and 50 negative_ctxs and a lot of hard_negative_ctxs, while it seems that those negative_ctxs will not be used by default) and only one paragraph from hard_negative_ctxs will be used.

First, what is the difference between the negative_ctxs and hard_negative_ctxs? Second, how those negative paragraphs are selected? Also, there are multiple positive_ctxs in nq_train.json. According to the paper, the positive examples for NQ and SQuAD are the preprocessed paragraphs corresponding to the original reference paragraphs in the original NQ / SQuAD datasets. How are the positive paragraphs in nq_train.json are selected?

For SQuAD and Natural Questions, since the original passages have been split and processed differently than our pool of candidate passages, we match and replace each gold passage with the corresponding passage in the candidate pool.

opened by AkariAsai 21
Questions on new implementation
Hi, Nice work on the new performance! I saw you mentioned that the new model is trained on new training data combined with your original training data. I have some confusion here.

How do you get the "nq-adv-hn-train.json"? Does the gold and hard_negative is retrieved from pre-trained DPR model rather than BM25? And which pre-trained DPR model does you use? Is that the "single-adv-hn.nq.bert-base-encoder"?

If I use new training data "nq-adv-hn-train.json" should I still use nq-train.json to get your performance? If I need, does that mean I need to add one BM25 hard_negative from nq-train.json?
opened by yeliu918 19

Cannot Reproduce SQuAD Retrieval Result

Hello @vlad-karpukhin

I have been trying to reproduce the result for SQuAD (as well as Trivia) dataset of Table 2 from the paper. (Single mode)

Below are the summary of steps I have taken and consequent result.

Step1. Dowload dataset

Download squad1-train.json, squad1-dev.json and squad1-test.csv via download_data.py

Step2. Retriever training

I trained the retriever model via train_dense_encoder.py with following arguments

Initialized host brain-cluster-gpu10.dakao.io as d.rank 0 on device=cuda:0, n_gpu=1, world size=8
16-bits training: False
 **************** CONFIGURATION ****************
adam_betas                     -->   (0.9, 0.999)
adam_eps                       -->   1e-08
batch_size                     -->   16
checkpoint_file_name           -->   dpr_biencoder
dev_batch_size                 -->   16
dev_file                       -->   data/data/retriever/squad1-dev.json
device                         -->   cuda:0
distributed_world_size         -->   8
do_lower_case                  -->   True
dropout                        -->   0.1
encoder_model_type             -->   hf_bert
eval_per_epoch                 -->   1
fix_ctx_encoder                -->   False
fp16                           -->   False
fp16_opt_level                 -->   O1
global_loss_buf_sz             -->   150000
gradient_accumulation_steps    -->   1
hard_negatives                 -->   1
learning_rate                  -->   2e-05
local_rank                     -->   0
log_batch_step                 -->   100
max_grad_norm                  -->   2.0
model_file                     -->   None
n_gpu                          -->   1
no_cuda                        -->   False
num_train_epochs               -->   50.0
other_negatives                -->   0
output_dir                     -->   ./checkpoint/sq_best
pretrained_file                -->   None
pretrained_model_cfg           -->   bert-base-uncased
projection_dim                 -->   0
seed                           -->   12345
sequence_length                -->   256
shuffle_positive_ctx           -->   False
train_file                     -->   data/data/retriever/squad1-train.json
train_files_upsample_rates     -->   None
train_rolling_loss_step        -->   500
val_av_rank_bsz                -->   128
val_av_rank_hard_neg           -->   30
val_av_rank_max_qs             -->   10000
val_av_rank_other_neg          -->   30
val_av_rank_start_epoch        -->   300
warmup_steps                   -->   1237
weight_decay                   -->   0.0

Please note that I trained with num_train_epochs=50 instead of 40.

Step3. Retriever inference

Run generate_dense_embeddings.py with following arguments

Initialized host gpu-cloud-vnode186.dakao.io as d.rank 1 on device=cuda:1, n_gpu=1, world size=80
16-bits training: False
Reading saved model from ./checkpoint/sq_best/dpr_biencoder.49.548
model_state_dict keys odict_keys(['model_dict', 'optimizer_dict', 'scheduler_dict', 'offset', 'epoch', 'encoder_params'])
Overriding args parameter value from checkpoint state. Param = do_lower_case, value = True
Overriding args parameter value from checkpoint state. Param = pretrained_model_cfg, value = bert-base-uncased
Overriding args parameter value from checkpoint state. Param = encoder_model_type, value = hf_bert
Overriding args parameter value from checkpoint state. Param = sequence_length, value = 256
 **************** CONFIGURATION ****************
batch_size                     -->   2200
ctx_file                       -->   ./data/data/wikipedia_split/psgs_w100.tsv
device                         -->   cuda:1
distributed_world_size         -->   80
do_lower_case                  -->   True
encoder_model_type             -->   hf_bert
fp16                           -->   False
fp16_opt_level                 -->   O1
local_rank                     -->   1
model_file                     -->   ./checkpoint/sq_best/dpr_biencoder.49.548
n_gpu                          -->   1
no_cuda                        -->   False
num_shards                     -->   40
out_file                       -->   ./checkpoint/sq_best/embed_epoch_49
pretrained_file                -->   None
pretrained_model_cfg           -->   bert-base-uncased
projection_dim                 -->   0
sequence_length                -->   256
shard_id                       -->   0

Where ./checkpoint/sq_best/dpr_biencoder.49.548 was the one with highest correct prediction ratio for dev set during train.
I split psgs_w100.tsv with 40 different shards

Step 4. Retriever validation against the entire set of documents

Finally I evaluated the IR accuracy via dense_retriever.py with following argument

Initialized host gpu-cloud-vnode186.dakao.io as d.rank -1 on device=cuda, n_gpu=2, world size=1
16-bits training: False
 **************** CONFIGURATION ****************
batch_size                     -->   1800
ctx_file                       -->   data/data/wikipedia_split/psgs_w100.tsv
device                         -->   cuda
distributed_world_size         -->   1
do_lower_case                  -->   False
encoded_ctx_file               -->   checkpoint/sq_best/embed_epoch_49_*
encoder_model_type             -->   None
fp16                           -->   False
fp16_opt_level                 -->   O1
hnsw_index                     -->   False
index_buffer                   -->   50000
local_rank                     -->   -1
match                          -->   string
model_file                     -->   checkpoint/sq_best/dpr_biencoder.49.548
n_docs                         -->   100
n_gpu                          -->   2
no_cuda                        -->   False
out_file                       -->   checkpoint/sq_best/eval_test_epoch_49_top_100.json
pretrained_file                -->   None
pretrained_model_cfg           -->   None
projection_dim                 -->   0
qa_file                        -->   data/data/retriever/qas/squad1-test.csv
save_or_load_index             -->   False
sequence_length                -->   256

And the printed result was

Total data indexed 21015320
Data indexing completed.
Encoded queries 1800
Encoded queries 3600
Encoded queries 5400
Encoded queries 7200
Encoded queries 9000
Total encoded queries tensor torch.Size([10570, 768])
index search time: 1522.145621 sec.
Reading data from: data/data/wikipedia_split/psgs_w100.tsv
Matching answers in top docs...
Per question validation results len=10570
Validation results: top k documents hits [1574, 2290, 2766, 3134, 3419, 3687, 3900, 4068, 4233, 4379, 4508, 4627, 4734, 4832, 4921, 5005, 5106, 5196, 5260, 5311,
5384, 5454, 5501, 5565, 5627, 5675, 5720, 5768, 5817, 5873, 5917, 5956, 5998, 6037, 6078, 6128, 6158, 6187, 6223, 6252, 6287, 6322, 6347, 6370, 6396, 6422, 6444,
6470, 6501, 6530, 6564, 6591, 6619, 6638, 6661, 6687, 6720, 6742, 6758, 6777, 6794, 6808, 6822, 6840, 6861, 6890, 6905, 6925, 6938, 6960, 6982, 7003, 7017, 7042,
7058, 7072, 7087, 7100, 7113, 7126, 7138, 7148, 7162, 7177, 7188, 7204, 7213, 7230, 7243, 7261, 7272, 7285, 7297, 7305, 7316, 7330, 7348, 7359, 7377, 7389]
Validation results: top k documents hits accuracy [0.1489120151371807, 0.21665089877010407, 0.26168401135288555, 0.2964995269631031, 0.32346263008514664, 0.348817
4077578051, 0.36896877956480606, 0.3848628192999054, 0.40047303689687797, 0.4142857142857143, 0.42649006622516555, 0.4377483443708609, 0.4478713339640492, 0.45714
285714285713, 0.46556291390728477, 0.4735099337748344, 0.48306527909176916, 0.49157994323557236, 0.49763481551561023, 0.5024597918637653, 0.5093661305581836, 0.51
59886471144749, 0.5204351939451277, 0.5264900662251656, 0.5323557237464522, 0.5368968779564806, 0.5411542100283823, 0.5456953642384106, 0.5503311258278145, 0.5556
291390728477, 0.5597918637653737, 0.5634815515610218, 0.5674550614947966, 0.5711447492904447, 0.5750236518448439, 0.5797540208136235, 0.5825922421948913, 0.585335
8561967833, 0.5887417218543046, 0.5914853358561968, 0.5947965941343425, 0.5981078524124882, 0.6004730368968779, 0.6026490066225165, 0.605108798486282, 0.607568590
3500473, 0.6096499526963103, 0.6121097445600757, 0.615042573320719, 0.6177861873226111, 0.6210028382213812, 0.6235572374645222, 0.6262062440870388, 0.628003784295
175, 0.6301797540208136, 0.632639545884579, 0.6357615894039735, 0.6378429517502365, 0.639356669820246, 0.6411542100283822, 0.6427625354777673, 0.6440870387890255,
 0.6454115421002838, 0.6471144749290445, 0.6491012298959319, 0.651844843897824, 0.6532639545884579, 0.6551561021759698, 0.6563859981078524, 0.6584673604541155, 0.
6605487228003785, 0.6625354777672658, 0.6638599810785242, 0.6662251655629139, 0.6677388836329233, 0.6690633869441817, 0.6704824976348155, 0.6717123935666982, 0.67
29422894985809, 0.6741721854304635, 0.6753074739829706, 0.6762535477767266, 0.6775780510879849, 0.6789971617786187, 0.6800378429517502, 0.6815515610217597, 0.6824
0302743614, 0.6840113528855251, 0.6852412488174078, 0.6869441816461684, 0.6879848628192999, 0.6892147587511825, 0.6903500473036897, 0.6911069063386944, 0.69214758
75118259, 0.6934720908230843, 0.6951750236518448, 0.6962157048249763, 0.697918637653737, 0.6990539262062441]

Top-20: 50.246%
Top-100: 69.905%

Saved results * scores  to checkpoint/sq_best/eval_test_epoch_49_top_100.json

Issue

This result is consistent with the reported values from the paper
- You reported that the Top-20 and 100 accuracy for SQuAD under Single Mode is 63.2 and 77.2, respectively.
Issues #62 and #93 also argue that they could not reproduce SQuAD result
However, with the same code and hyperparameter, I was able to (almost) reproduce for Trivia QA dataset.
- Reproduced Trivia QA Top-20/100 accuracy : 79.3/84.9
- Reported Trivia QA Top-20/100 accuracy : 79.4/85.0
Therefore I suspect that there might be some differences for SQuAD dataset between shared one and the one actually you used.

Please let me know if I you find something wrong Thank you

opened by robinsongh381 16

Minor suggestion on the trivia-train.json

Hi there,

I was finetuning the DPR model on the trivia dataset but found that there were many entries in trivia-train.json that contain no positive contexts. As the nq-train.json has no empty entries (cleaned up), I think it'd better clean up trivia as well for consistency purposes. Otherwise, it might lead to index mismatch if pre-trained query embeddings are used instead of the query encoder.

Best,
enhancement

opened by alexlimh 13
Best results reproduction instruction

Hello,

I am trying to train a model based on your instructions and tried to run train_dense_encoder.py In the instruction you are refering to --dev_file {path to downloaded data.retriever.qas.nq-dev resource} but it is unclear to which file you mean.

Is it retriever/qas/nq-dev.csv or retriever/nq-dev.json? The first option fails as the code expects a json file but the second one doesn't seem like a "retriever.qas" resource based on its name.

opened by iftachg 12
Seeking KILT meta-data for DPR
Hi, I notice some updates have been made here to facilitate the KILT dataset format. In addition, I wonder if the below meta-data from KILT can be also shared here:

the 22,220,793 passages split from the KILT knowledge source

the corresponding passage_id of positive and negative passages for each query of the NQ dataset (mined by the DPR checkpoint)

Both meta-data are necessary for reproducing or improving DPR on KILT, and I think it will be more convenient for people to follow up if the above meta-data are shared. Thanks.
opened by jzhoubu 11
Question about reader training.

Hi, I'm here again :)

I tried to use the test data constructed by my retrieved passage in NQ dataset to test the reader model trained by your provided training data, but the effect is not very good although it has pretty good retrieval performance.

I feel that the problem maybe that the training data does not match my data, so I would like to ask how your training data of reader is structured? Such as what is the query and the passage source?

Thank you!

opened by ReyonRen 11

dense_retriever -- MemoryError: std::bad_alloc

Hi! It seems that no matter what value I set index_buffer to, I get the following error when running dense_retriever.py:

Traceback (most recent call last):
  File "dense_retriever.py", line 331, in <module>
    main(args)
  File "dense_retriever.py", line 268, in main
    retriever.index_encoded_data(input_paths, buffer_size=index_buffer_sz)
  File "dense_retriever.py", line 100, in index_encoded_data
    self.index.index_data(buffer)
  File "/home/aarchan/qa-aug/qa-aug/dpr/indexer/faiss_indexers.py", line 93, in index_data
    self.index.add(vectors)
  File "/home/aarchan/anaconda2/envs/qa-aug/lib/python3.8/site-packages/faiss/__init__.py", line 138, in replacement_add
    self.add_c(n, swig_ptr(x))
  File "/home/aarchan/anaconda2/envs/qa-aug/lib/python3.8/site-packages/faiss/swigfaiss.py", line 1454, in add
    return _swigfaiss.IndexFlat_add(self, n, x)
MemoryError: std::bad_alloc

For reference, the machine I'm running this on has 128GB RAM, but it doesn't seem to be enough. Could you please help me with this issue? Thanks!

opened by aarzchan 11

Reproduce Table 3 in paper
Hi, all.

I am reproducing the second block of Table 3 in the paper but meet problems for Gold #N = 7. I wonder what causes this inconsistency.

The results I get are shown below. batch_size | top5 | top20 | top100 -- | -- | -- | -- bsize=8 (Gold #N = 7) | 44.0% | 64.2% | 78.2% bsize=128 (Gold #N = 127) | 57.6% | 74.5% | 84.0%

For Gold #N = 7, the result has a big gap with the paper while it is fine for Gold #N = 127.

I run the experiment for Gold #N = 7 on a server with 4 V100 GPUs and the experiment for Gold #N = 127 on a server with 8 V100 GPUs. The two servers are with the same CUDA version and virtual env. The only difference is the setting of batch size.

The script I used for experiments is shown below.

#!/bin/bash set -x HYDRA_FULL_ERROR=1 python train_dense_encoder.py \ train_datasets=[nq_train] \ dev_datasets=[nq_dev] \ train=biencoder_nq \ train.batch_size=$1 \ train.hard_negatives=0 \ output_dir=./runs/

Note that, I run these experiments in DataParallel (single node - multi gpu) mode.
opened by yxliu-ntu 9
TOP-K results on NQ datasets

Hi, I run dense_retriver.py to obtain results on NQ dataset. But I get following, which is far behind the results in the paper.

Validation results: top k documents hits accuracy [0.13268698060941828, 0.17506925207756233, 0.20498614958448755, 0.22548476454293628, 0.24681440443213296, 0.2614958448753463, 0.2742382271468144, 0.2839335180055402, 0.29418282548476454, 0.30193905817174516, 0.3080332409972299, 0.31329639889196675, 0.31994459833795014, 0.3238227146814404, 0.328808864265928, 0.33518005540166207, 0.3371191135734072, 0.3401662049861496, 0.3443213296398892, 0.34709141274238225, 0.3518005540166205, 0.35650969529085874, 0.35983379501385043, 0.3634349030470914, 0.36925207756232686, 0.3717451523545706, 0.3731301939058172, 0.37590027700831025, 0.378393351800554, 0.3797783933518006, 0.3817174515235457, 0.3847645429362881, 0.38725761772853184, 0.3897506925207756, 0.3914127423822715, 0.3930747922437673, 0.3939058171745152, 0.3958448753462604, 0.3972299168975069, 0.4, 0.4005540166204986, 0.4024930747922438, 0.40470914127423824, 0.4069252077562327, 0.40941828254847645, 0.4113573407202216, 0.41301939058171744, 0.4138504155124654, 0.4149584487534626, 0.41606648199445984, 0.41772853185595565, 0.42049861495844876, 0.4224376731301939, 0.42382271468144045, 0.4249307479224377, 0.42548476454293627, 0.4279778393351801, 0.4293628808864266, 0.4310249307479224, 0.4326869806094183, 0.4337950138504155, 0.43490304709141275, 0.4357340720221607, 0.43656509695290857, 0.4373961218836565, 0.43822714681440444, 0.4404432132963989, 0.4409972299168975, 0.44182825484764543, 0.4437673130193906, 0.4451523545706371, 0.44626038781163435, 0.44681440443213294, 0.4473684210526316, 0.44792243767313017, 0.4490304709141274, 0.450415512465374, 0.4506925207756233, 0.4518005540166205, 0.4520775623268698, 0.45290858725761773, 0.4534626038781163, 0.45373961218836567, 0.45457063711911355, 0.4551246537396122, 0.45595567867036013, 0.45706371191135736, 0.45789473684210524, 0.4581717451523546, 0.4587257617728532, 0.4590027700831025, 0.4592797783933518, 0.4598337950138504, 0.46066481994459835, 0.46094182825484764, 0.46121883656509693, 0.4614958448753463, 0.46232686980609417, 0.4628808864265928, 0.4634349030470914]

python dense_retriever.py
model_file=[checkpoint]
qa_dataset=nq_test
ctx_datatsets=[dpr_wiki]
encoded_ctx_files=[/home/v-nuochen/DPR/outputs/2021-04-17/08-29-34/nq-generate-emd_0]
out_file=nq_retrieval_07_08

The model_file is download from your previous checkpoint. But encoded_ctx_files is generated from generate_dense_embedding.py with default settings by myself.

Could you please tell me why?

opened by cn-boop 9
Missing required positional arguments

Hi, I'm following the instructional code on the readme and ran python train_dense_encoder.py \ train_datasets=[nq_train] \ dev_datasets=[nq_dev] \ train=biencoder_local \ output_dir={path to checkpoints dir}

after install the nq_train and nq_dev datasets however whenever I run this I get an error in pytorch torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) TypeError: forward() missing 6 required positional arguments: 'question_ids', 'question_segments', 'question_attn_mask', 'context_ids', 'ctx_segments', and 'ctx_attn_mask'

I'm not sure what could be causing this.

opened by grossmanm 1
Passages with multiple answer spans

Hi, I'm trying to implement the reader model training in Tensorflow. How does DPR deal with multiple answer spans in a positive passage? Will there just be multiple positions labelled as the correct start/end position?

E.g. if there are 2 answer starts in a passage (h_i, h_j), would the model try to maximise the scores for tokens i and j using cross entropy loss?

opened by TZeng20 0
`cosine_scores` defined in biencoder.py does not work
Reference: https://github.com/facebookresearch/DPR/blob/d9f3e41bb0087687fa182a4d580711188fd82df9/dpr/models/biencoder.py#L57

F.cosine_similarity will fail to compute the similarity along a specified dimension when the other dimensions differ. For example, if x is a 10x64 tensor, and y is a 20x64 tensor, then it is expected to get a 10x20 matrix when calling cosine_scores. However, that function won't work:

>>> x = torch.randn(10, 64) >>> y = torch.randn(20, 64) >>> F.cosine_similarity(x, y, dim=1) Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: The size of tensor a (10) must match the size of tensor b (20) at non-singleton dimension 0

Since it is not used anywhere else in the repo and the paper, maybe it would be a good idea to remove cosine_scores?
opened by xhluca 0
Question in training reader model.

Hi, I'm trying to reproduce this work and I have trouble training the reader. It's not clear to me which file path to the retriever train set results file and path to the retriever dev set results file refer to. I am expecting any replies.

opened by gaishun 1
Doubt regarding all_gather_list in case of DDP

Hi,

Thanks for the amazing framework. I have a doubt regarding the utility of the all_gather_list function, that gathers the tensors across the GPUs. When we are training in DDP, the gradients are synchronized before the parameter updates, therefore, why is this step needed? Is it just to collate the loss or number of correct predictions or the rank (in evaluation)? If yes, then couldn't one gather all of them after computing the loss, instead of exchanging the question and context representations first and then going forward with it?

Thanks!

opened by bhattg 0

Owner

Meta Research

GitHub

This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet it is not always as trivial a task as it appears to be. This binding makes the power of the ucto tokeniser available to Python. Ucto itself is regular-expression based, extensible, and advanced tokeniser written in C++ (http://ilk.uvt.nl/ucto).

Ucto for Python This is a Python binding to the tokeniser Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task,

27 Dec 14, 2022

Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models.

Tevatron Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models. The toolkit has a modularized

193 Jan 4, 2023

🚀 RocketQA, dense retrieval for information retrieval and question answering, including both Chinese and English state-of-the-art models.

In recent years, the dense retrievers based on pre-trained language models have achieved remarkable progress. To facilitate more developers using cutt

475 Jan 4, 2023

Using Bert as the backbone model for lime, designed for NLP task explanation (sentence pair text classification task)

Lime Comparing deep contextualized model for sentences highlighting task. In addition, take the classic explanation model "LIME" with bert-base model

2 Jan 18, 2022

SinglepassTextCluster, an TextCluster tools based on Singlepass cluster algorithm that use tfidf vector and doc2vec，which can be used for individual real-time corpus cluster task。基于single-pass算法思想的自动文本聚类小组件，内置tfidf和doc2vec两种文本向量方法，可自动输出聚类数目、类簇文档集合和簇类大小，用于自有实时数据的聚类任务。

项目的背景 SinglepassTextCluster, an TextCluster tool based on Singlepass cluster algorithm that use tfidf vector and doc2vec，which can be used for individ

34 Dec 18, 2022

CredData is a set of files including credentials in open source projects

CredData is a set of files including credentials in open source projects. CredData includes suspicious lines with manual review results and more information such as credential types for each suspicious line. CredData can be used to develop new tools or improve existing tools. Furthermore, using the benchmark result of the CredData, users can choose a proper tool among open source credential scanning tools according to their use case.

19 Sep 7, 2022

Task-based datasets, preprocessing, and evaluation for sequence models.

SeqIO: Task-based datasets, preprocessing, and evaluation for sequence models. SeqIO is a library for processing sequential data to be fed into downst

290 Dec 26, 2022

Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

Grading tools for Advanced NLP (11-711) Installation You'll need docker and unzip to use this repo. For docker, visit the official guide to get starte

2 Sep 27, 2022

A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

MIDI Language Introduction Reference Paper: Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions: code This

3 May 25, 2022

Meta learning algorithms to train cross-lingual NLI (multi-task) models

4 Nov 20, 2022

Reading Wikipedia to Answer Open-Domain Questions

DrQA This is a PyTorch implementation of the DrQA system described in the ACL 2017 paper Reading Wikipedia to Answer Open-Domain Questions. Quick Link

4.3k Jan 1, 2023

Baseline code for Korean open domain question answering(ODQA)

Open-Domain Question Answering(ODQA)는 다양한 주제에 대한 문서 집합으로부터 자연어 질의에 대한 답변을 찾아오는 task입니다. 이때 사용자 질의에 답변하기 위해 주어지는 지문이 따로 존재하지 않습니다. 따라서 사전에 구축되어있는 Knowl

69 Nov 4, 2022

ACL'2021: Learning Dense Representations of Phrases at Scale

DensePhrases DensePhrases is an extractive phrase search tool based on your natural language inputs. From 5 million Wikipedia articles, it can search

540 Dec 30, 2022

🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

?? The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

15k Jan 2, 2023

profile tools for pytorch nn models

nnprof Introduction nnprof is a profile tool for pytorch neural networks. Features multi profile mode: nnprof support 4 profile mode: Layer level, Ope

42 Jul 9, 2022

A practical and feature-rich paraphrasing framework to augment human intents in text form to build robust NLU models for conversational engines. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

Parrot Parrot is a paraphrase based utterance augmentation framework purpose built to accelerate training NLU models. A paraphrase framework is more t

690 Jan 4, 2023

Dense Passage Retriever - is a set of tools and models for open domain Q&A task.

Related tags

Overview

Dense Passage Retrieval

Features

New (March 2021) release

New (March 2021) retrieval model

Installation

Resources & Data formats

Retriever input data format

DPR data formats and custom processing

Retriever training

Retriever inference

Retriever validation against the entire set of documents:

Reader model training

Reader model inference

Distributed training

Best hyperparameter settings

1. Download all retriever training and validation data:

2. Biencoder(Retriever) training in the single set mode.

3. Generate embeddings for Wikipedia.

4. Evaluate retrieval accuracy and generate top passage results for each of the train/dev/test datasets.

5. Reader training

Misc.

License

Comments

Step1. Dowload dataset

Step2. Retriever training

Step3. Retriever inference

Step 4. Retriever validation against the entire set of documents

Issue

Owner

Meta Research

Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models.

🚀 RocketQA, dense retrieval for information retrieval and question answering, including both Chinese and English state-of-the-art models.

Using Bert as the backbone model for lime, designed for NLP task explanation (sentence pair text classification task)

CredData is a set of files including credentials in open source projects

Task-based datasets, preprocessing, and evaluation for sequence models.

Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

Meta learning algorithms to train cross-lingual NLI (multi-task) models

Reading Wikipedia to Answer Open-Domain Questions

Baseline code for Korean open domain question answering(ODQA)

ACL'2021: Learning Dense Representations of Phrases at Scale

🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

profile tools for pytorch nn models

A practical and feature-rich paraphrasing framework to augment human intents in text form to build robust NLU models for conversational engines. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

open-information-extraction-system, build open-knowledge-graph(SPO, subject-predicate-object) by pyltp(version==3.4.0)

An open source framework for seq2seq models in PyTorch.

Training open neural machine translation models