Dense Passage Retriever - is a set of tools and models for open domain Q&A task.

Related tags

Text Data & NLP DPR
Overview

Dense Passage Retrieval

Dense Passage Retrieval (DPR) - is a set of tools and models for state-of-the-art open-domain Q&A research. It is based on the following paper:

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih. Dense Passage Retrieval for Open-Domain Question Answering. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, 2020.

If you find this work useful, please cite the following paper:

@inproceedings{karpukhin-etal-2020-dense,
    title = "Dense Passage Retrieval for Open-Domain Question Answering",
    author = "Karpukhin, Vladimir and Oguz, Barlas and Min, Sewon and Lewis, Patrick and Wu, Ledell and Edunov, Sergey and Chen, Danqi and Yih, Wen-tau",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.550",
    doi = "10.18653/v1/2020.emnlp-main.550",
    pages = "6769--6781",
}

If you're interesting in reproducing experimental results in the paper based on our model checkpoints (i.e., don't want to train the encoders from scratch), you might consider using the Pyserini toolkit, which has the experiments nicely packaged in via pip. Their toolkit also reports higher BM25 and hybrid scores.

Features

  1. Dense retriever model is based on bi-encoder architecture.
  2. Extractive Q&A reader&ranker joint model inspired by this paper.
  3. Related data pre- and post- processing tools.
  4. Dense retriever component for inference time logic is based on FAISS index.

New (March 2021) release

DPR codebase is upgraded with a number of enhancements and new models. Major changes:

  1. Hydra-based configuration for all the command line tools exept the data loader (to be converted soon)
  2. Pluggable data processing layer to support custom datasets
  3. New retrieval model checkpoint with better perfromance.

New (March 2021) retrieval model

A new bi-encoder model trained on NQ dataset only is now provided: a new checkpoint, training data, retrieval results and wikipedia embeddings. It is trained on the original DPR NQ train set and its version where hard negatives are mined using DPR index itself using the previous NQ checkpoint. A Bi-encoder model is trained from scratch using this new training data combined with our original NQ training data. This training scheme gives a nice retrieval performance boost.

New vs old top-k documents retrieval accuracy on NQ test set (3610 questions).

Top-k passages Original DPR NQ model New DPR model
1 45.87 52.47
5 68.14 72.24
20 79.97 81.33
100 85.87 87.29

New model downloadable resources names (see how to use download_data script below):

Checkpoint: checkpoint.retriever.single-adv-hn.nq.bert-base-encoder

New training data: data.retriever.nq-adv-hn-train

Retriever resutls for NQ test set: data.retriever_results.nq.single-adv-hn.test

Wikipedia embeddings: data.retriever_results.nq.single-adv-hn.wikipedia_passages

Installation

Installation from the source. Python's virtual or Conda environments are recommended.

git clone [email protected]:facebookresearch/DPR.git
cd DPR
pip install .

DPR is tested on Python 3.6+ and PyTorch 1.2.0+. DPR relies on third-party libraries for encoder code implementations. It currently supports Huggingface (version <=3.1.0) BERT, Pytext BERT and Fairseq RoBERTa encoder models. Due to generality of the tokenization process, DPR uses Huggingface tokenizers as of now. So Huggingface is the only required dependency, Pytext & Fairseq are optional. Install them separately if you want to use those encoders.

Resources & Data formats

First, you need to prepare data for either retriever or reader training. Each of the DPR components has its own input/output data formats. You can see format descriptions below. DPR provides NQ & Trivia preprocessed datasets (and model checkpoints) to be downloaded from the cloud using our dpr/data/download_data.py tool. One needs to specify the resource name to be downloaded. Run 'python data/download_data.py' to see all options.

python data/download_data.py \
	--resource {key from download_data.py's RESOURCES_MAP}  \
	[optional --output_dir {your location}]

The resource name matching is prefix-based. So if you need to download all data resources, just use --resource data.

Retriever input data format

The default data format of the Retriever training data is JSON. It contains pools of 2 types of negative passages per question, as well as positive passages and some additional information.

[
  {
	"question": "....",
	"answers": ["...", "...", "..."],
	"positive_ctxs": [{
		"title": "...",
		"text": "...."
	}],
	"negative_ctxs": ["..."],
	"hard_negative_ctxs": ["..."]
  },
  ...
]

Elements' structure for negative_ctxs & hard_negative_ctxs is exactly the same as for positive_ctxs. The preprocessed data available for downloading also contains some extra attributes which may be useful for model modifications (like bm25 scores per passage). Still, they are not currently in use by DPR.

You can download prepared NQ dataset used in the paper by using 'data.retriever.nq' key prefix. Only dev & train subsets are available in this format. We also provide question & answers only CSV data files for all train/dev/test splits. Those are used for the model evaluation since our NQ preprocessing step looses a part of original samples set. Use 'data.retriever.qas.*' resource keys to get respective sets for evaluation.

python data/download_data.py
	--resource data.retriever
	[optional --output_dir {your location}]

DPR data formats and custom processing

One can use their own data format and custom data parsing & loading logic by inherting from DPR's Dataset classes in dpr/data/{biencoder|retriever|reader}_data.py files and implementing load_data() and getitem() methods. See DPR hydra configuration instructions.

Retriever training

Retriever training quality depends on its effective batch size. The one reported in the paper used 8 x 32GB GPUs. In order to start training on one machine:

python train_dense_encoder.py \
train_datasets=[list of train datasets, comma separated without spaces] \
dev_datasets=[list of dev datasets, comma separated without spaces] \
train=biencoder_local \
output_dir={path to checkpoints dir}

Example for NQ dataset

python train_dense_encoder.py \
train_datasets=[nq_train] \
dev_datasets=[nq_dev] \
train=biencoder_local \
output_dir={path to checkpoints dir}

DPR uses HuggingFace BERT-base as the encoder by default. Other ready options include Fairseq's ROBERTA and Pytext BERT models. One can select them by either changing encoder configuration files (conf/encoder/hf_bert.yaml) or providing a new configuration file in conf/encoder dir and enabling it with encoder={new file name} command line parameter.

Notes:

  • If you want to use pytext bert or fairseq roberta, you will need to download pre-trained weights and specify encoder.pretrained_file parameter. Specify the dir location of the downloaded files for 'pretrained.fairseq.roberta-base' resource prefix for RoBERTa model or the file path for pytext BERT (resource name 'pretrained.pytext.bert-base.model').
  • Validation and checkpoint saving happens according to train.eval_per_epoch parameter value.
  • There is no stop condition besides a specified amount of epochs to train (train.num_train_epochs configuration parameter).
  • Every evaluation saves a model checkpoint.
  • The best checkpoint is logged in the train process output.
  • Regular NLL classification loss validation for bi-encoder training can be replaced with average rank evaluation. It aggregates passage and question vectors from the input data passages pools, does large similarity matrix calculation for those representations and then averages the rank of the gold passage for each question. We found this metric more correlating with the final retrieval performance vs nll classification loss. Note however that this average rank validation works differently in DistributedDataParallel vs DataParallel PyTorch modes. See train.val_av_rank_* set of parameters to enable this mode and modify its settings.

See the section 'Best hyperparameter settings' below as e2e example for our best setups.

Retriever inference

Generating representation vectors for the static documents dataset is a highly parallelizable process which can take up to a few days if computed on a single GPU. You might want to use multiple available GPU servers by running the script on each of them independently and specifying their own shards.

python generate_dense_embeddings.py \
	model_file={path to biencoder checkpoint} \
	ctx_src={name of the passages resource, set to dpr_wiki to use our original wikipedia split} \
	shard_id={shard_num, 0-based} num_shards={total number of shards} \
	out_file={result files location + name PREFX}	

The name of the resource for ctx_src parameter or just the source name from conf/ctx_sources/default_sources.yaml file.

Note: you can use much large batch size here compared to training mode. For example, setting batch_size 128 for 2 GPU(16gb) server should work fine. You can download already generated wikipedia embeddings from our original model (trained on NQ dataset) using resource key 'data.retriever_results.nq.single.wikipedia_passages'. Embeddings resource name for the new better model 'data.retriever_results.nq.single-adv-hn.wikipedia_passages'

We generally use the following params on 50 2-gpu nodes: batch_size=128 shard_id=0 num_shards=50

Retriever validation against the entire set of documents:

python dense_retriever.py \
	model_file={path to a checkpoint downloaded from our download_data.py as 'checkpoint.retriever.single.nq.bert-base-encoder'} \
	qa_dataset={the name os the test source} \
	ctx_datatsets=[{list of passage sources's names, comma separated without spaces}] \
	encoded_ctx_files=[{list of encoded document files glob expression, comma separated without spaces}] \
	out_file={path to output json file with results} 
	

For example, If your generated embeddings fpr two passages set as ~/myproject/embeddings_passages1/wiki_passages_* and ~/myproject/embeddings_passages2/wiki_passages_* files and want to evaluate on NQ dataset:

python dense_retriever.py \
	model_file={path to a checkpoint file} \
	qa_dataset=nq_test \
	ctx_datatsets=[dpr_wiki] \
	encoded_ctx_files=[\"~/myproject/embeddings_passages1/wiki_passages_*\",\"~/myproject/embeddings_passages2/wiki_passages_*\"] \
	out_file={path to output json file with results} 

The tool writes retrieved results for subsequent reader model training into specified out_file. It is a json with the following format:

[
    {
        "question": "...",
        "answers": ["...", "...", ... ],
        "ctxs": [
            {
                "id": "...", # passage id from database tsv file
                "title": "",
                "text": "....",
                "score": "...",  # retriever score
                "has_answer": true|false
     },
]

Results are sorted by their similarity score, from most relevant to least relevant.

By default, dense_retriever uses exhaustive search process, but you can opt in to use lossy index types. We provide HNSW and HNSW_SQ index options. Enabled them by indexer=hnsw or indexer=hnsw_sq command line arguments. Note that using this index may be useless from the research point of view since their fast retrieval process comes at the cost of much longer indexing time and higher RAM usage. The similarity score provided is the dot product for the default case of exhaustive search (indexer=flat) and L2 distance in a modified representations space in case of HNSW index.

Reader model training

python train_extractive_reader.py \
	encoder.sequence_length=350 \
	train_files={path to the retriever train set results file} \
	dev_files={path to the retriever dev set results file}  \
	output_dir={path to output dir}

Default hyperparameters are set for a single node with 8 gpus setup. Modify them as needed in the conf/train/extractive_reader_default.yaml and conf/extractive_reader_train_cfg.yaml cpnfiguration files or override specific parameters from the command line. First time run will preprocess train_files & dev_files and convert them into serialized set of .pkl files in the same locaion and will use them on all subsequent runs.

Notes:

  • If you want to use pytext bert or fairseq roberta, you will need to download pre-trained weights and specify encoder.pretrained_file parameter. Specify the dir location of the downloaded files for 'pretrained.fairseq.roberta-base' resource prefix for RoBERTa model or the file path for pytext BERT (resource name 'pretrained.pytext.bert-base.model').
  • Reader training pipeline does model validation every train.eval_step batches
  • Like the bi-encoder, it saves model checkpoints on every validation
  • Like the bi-encoder, there is no stop condition besides a specified amount of epochs to train.
  • Like the bi-encoder, there is no best checkpoint selection logic, so one needs to select that based on dev set validation performance which is logged in the train process output.
  • Our current code only calculates the Exact Match metric.

Reader model inference

In order to make an inference, run train_reader.py without specifying train_files. Make sure to specify model_file with the path to the checkpoint, passages_per_question_predict with number of passages per question (being used when saving the prediction file), and eval_top_docs with a list of top passages threshold values from which to choose question's answer span (to be printed as logs). The example command line is as follows.

python train_extractive_reader.py \
  prediction_results_file={path to a file to write the results to} \
  eval_top_docs=[10,20,40,50,80,100] \
  dev_files={path to the retriever results file to evaluate} \
  model_file= {path to the reader checkpoint} \
  train.dev_batch_size=80 \
  passages_per_question_predict=100 \
  encoder.sequence_length=350

Distributed training

Use Pytorch's distributed training launcher tool:

python -m torch.distributed.launch \
	--nproc_per_node={WORLD_SIZE}  {non distributed scipt name & parameters}

Note:

  • all batch size related parameters are specified per gpu in distributed mode(DistributedDataParallel) and for all available gpus in DataParallel (single node - multi gpu) mode.

Best hyperparameter settings

e2e example with the best settings for NQ dataset.

1. Download all retriever training and validation data:

python data/download_data.py --resource data.wikipedia_split.psgs_w100
python data/download_data.py --resource data.retriever.nq
python data/download_data.py --resource data.retriever.qas.nq

2. Biencoder(Retriever) training in the single set mode.

We used distributed training mode on a single 8 GPU x 32 GB server

python -m torch.distributed.launch --nproc_per_node=8
train_dense_encoder.py \
train=biencoder_nq \
train_datasets=[nq_train] \
dev_datasets=[nq_dev] \
train=biencoder_nq \
output_dir={your output dir}

New model training combines two NQ datatsets:

python -m torch.distributed.launch --nproc_per_node=8
train_dense_encoder.py \
train=biencoder_nq \
train_datasets=[nq_train,nq_train_hn1] \
dev_datasets=[nq_dev] \
train=biencoder_nq \
output_dir={your output dir}

This takes about a day to complete the training for 40 epochs. It switches to Average Rank validation on epoch 30 and it should be around 25 or less at the end. The best checkpoint for bi-encoder is usually the last, but it should not be so different if you take any after epoch ~ 25.

3. Generate embeddings for Wikipedia.

Just use instructions for "Generating representations for large documents set". It takes about 40 minutes to produce 21 mln passages representation vectors on 50 2 GPU servers.

4. Evaluate retrieval accuracy and generate top passage results for each of the train/dev/test datasets.

python dense_retriever.py \
	model_file={path to the best checkpoint or use our proivded checkpoints (Resource names like checkpoint.retriever.*)  } \
	qa_dataset=nq_test \
	ctx_datatsets=[dpr_wiki] \
	encoded_ctx_files=["{glob expression for generated embedding files}"] \
	out_file={path to the output file}

Adjust batch_size based on the available number of GPUs, 64-128 should work for 2 GPU server.

5. Reader training

We trained reader model for large datasets using a single 8 GPU x 32 GB server. All the default parameters are already set to our best NQ settings. Please also download data.gold_passages_info.nq_train & data.gold_passages_info.nq_dev resources for NQ datatset - they are used for special NQ only heuristics when preprocessing the data for the NQ reader training. If you already run reader trianign on NQ data without gold_passages_src & gold_passages_src_dev specified, please delete the corresponding .pkl files so that thye will be re-generated.

python train_extractive_reader.py \
	encoder.sequence_length=350 \
	train_files={path to the retriever train set results file} \
	dev_files={path to the retriever dev set results file}  \
	gold_passages_src={path to data.gold_passages_info.nq_train file} \
	gold_passages_src_dev={path to data.gold_passages_info.nq_dev file} \
	output_dir={path to output dir}

We found that using the learning rate above works best with static schedule, so one needs to stop training manually based on evaluation performance dynamics. Our best results were achieved on 16-18 training epochs or after ~60k model updates.

We provide all input and intermediate results for e2e pipeline for NQ dataset and most of the similar resources for Trivia.

Misc.

  • TREC validation requires regexp based matching. We support only retriever validation in the regexp mode. See --match parameter option.
  • WebQ validation requires entity normalization, which is not included as of now.

License

DPR is CC-BY-NC 4.0 licensed as of now.

Comments
  • Error when running train_reader -- ValueError: a must be greater than 0 unless no samples are taken

    Error when running train_reader -- ValueError: a must be greater than 0 unless no samples are taken

    Hi! I get the following error when running train_reader.py:

    Total iterations per epoch=1237
     Total updates=24720
      Eval step = 2000
    ***** Training *****
    ***** Epoch 0 *****
    Traceback (most recent call last):
      File "train_reader.py", line 507, in <module>
        main()
      File "train_reader.py", line 498, in main
        trainer.run_train()
      File "train_reader.py", line 126, in run_train
        global_step = self._train_epoch(scheduler, epoch, eval_step, train_iterator, global_step)
      File "train_reader.py", line 225, in _train_epoch
        is_train=True, shuffle=True)
      File "/home/aarchan/dpr_aug/DPR-Aug/dpr/models/reader.py", line 134, in create_reader_input
        is_random=shuffle)
      File "/home/aarchan/dpr_aug/DPR-Aug/dpr/models/reader.py", line 193, in _create_question_passages_tensors
        positive_idx = _get_positive_idx(positives, max_len, is_random)
      File "/home/aarchan/dpr_aug/DPR-Aug/dpr/models/reader.py", line 175, in _get_positive_idx
        positive_idx = np.random.choice(len(positives)) if is_random else 0
      File "mtrand.pyx", line 894, in numpy.random.mtrand.RandomState.choice
    ValueError: a must be greater than 0 unless no samples are taken
    

    This error occurs right after train_reader.py successfully loads all of the preprocessed .pkl reader data files. Could you please help me resolve this issue?

    bug 
    opened by aarzchan 25
  • Questions about the Retriever input data format

    Questions about the Retriever input data format

    Hi, thank you so much for open-sourcing DPR! I have some questions about the Retriever input data format.

    Given the paper, the best performance comes from the Gold setting + 1 BM25 paragraph, in which (from my understanding) the negative examples are in-batch gold paragraphs and 1 BM 25 paragraphs. On the other hand, in the provided retriever’s nq_train.json data, there are multiple positive_ctxs and 50 negative_ctxs and a lot of hard_negative_ctxs, while it seems that those negative_ctxs will not be used by default) and only one paragraph from hard_negative_ctxs will be used.

    First, what is the difference between the negative_ctxs and hard_negative_ctxs? Second, how those negative paragraphs are selected? Also, there are multiple positive_ctxs in nq_train.json. According to the paper, the positive examples for NQ and SQuAD are the preprocessed paragraphs corresponding to the original reference paragraphs in the original NQ / SQuAD datasets. How are the positive paragraphs in nq_train.json are selected?

    For SQuAD and Natural Questions, since the original passages have been split and processed differently than our pool of candidate passages, we match and replace each gold passage with the corresponding passage in the candidate pool.

    opened by AkariAsai 21
  • Questions on new implementation

    Questions on new implementation

    Hi, Nice work on the new performance! I saw you mentioned that the new model is trained on new training data combined with your original training data. I have some confusion here.

    1. How do you get the "nq-adv-hn-train.json"? Does the gold and hard_negative is retrieved from pre-trained DPR model rather than BM25? And which pre-trained DPR model does you use? Is that the "single-adv-hn.nq.bert-base-encoder"?
    2. If I use new training data "nq-adv-hn-train.json" should I still use nq-train.json to get your performance? If I need, does that mean I need to add one BM25 hard_negative from nq-train.json?
    opened by yeliu918 19
  • Cannot Reproduce SQuAD  Retrieval Result

    Cannot Reproduce SQuAD Retrieval Result

    Hello @vlad-karpukhin

    I have been trying to reproduce the result for SQuAD (as well as Trivia) dataset of Table 2 from the paper. (Single mode)

    Below are the summary of steps I have taken and consequent result.

    Step1. Dowload dataset

    • Download squad1-train.json, squad1-dev.json and squad1-test.csv via download_data.py

    Step2. Retriever training

    • I trained the retriever model via train_dense_encoder.py with following arguments
    Initialized host brain-cluster-gpu10.dakao.io as d.rank 0 on device=cuda:0, n_gpu=1, world size=8
    16-bits training: False
     **************** CONFIGURATION ****************
    adam_betas                     -->   (0.9, 0.999)
    adam_eps                       -->   1e-08
    batch_size                     -->   16
    checkpoint_file_name           -->   dpr_biencoder
    dev_batch_size                 -->   16
    dev_file                       -->   data/data/retriever/squad1-dev.json
    device                         -->   cuda:0
    distributed_world_size         -->   8
    do_lower_case                  -->   True
    dropout                        -->   0.1
    encoder_model_type             -->   hf_bert
    eval_per_epoch                 -->   1
    fix_ctx_encoder                -->   False
    fp16                           -->   False
    fp16_opt_level                 -->   O1
    global_loss_buf_sz             -->   150000
    gradient_accumulation_steps    -->   1
    hard_negatives                 -->   1
    learning_rate                  -->   2e-05
    local_rank                     -->   0
    log_batch_step                 -->   100
    max_grad_norm                  -->   2.0
    model_file                     -->   None
    n_gpu                          -->   1
    no_cuda                        -->   False
    num_train_epochs               -->   50.0
    other_negatives                -->   0
    output_dir                     -->   ./checkpoint/sq_best
    pretrained_file                -->   None
    pretrained_model_cfg           -->   bert-base-uncased
    projection_dim                 -->   0
    seed                           -->   12345
    sequence_length                -->   256
    shuffle_positive_ctx           -->   False
    train_file                     -->   data/data/retriever/squad1-train.json
    train_files_upsample_rates     -->   None
    train_rolling_loss_step        -->   500
    val_av_rank_bsz                -->   128
    val_av_rank_hard_neg           -->   30
    val_av_rank_max_qs             -->   10000
    val_av_rank_other_neg          -->   30
    val_av_rank_start_epoch        -->   300
    warmup_steps                   -->   1237
    weight_decay                   -->   0.0
     
    
    • Please note that I trained with num_train_epochs=50 instead of 40.

    Step3. Retriever inference

    • Run generate_dense_embeddings.py with following arguments
    Initialized host gpu-cloud-vnode186.dakao.io as d.rank 1 on device=cuda:1, n_gpu=1, world size=80
    16-bits training: False
    Reading saved model from ./checkpoint/sq_best/dpr_biencoder.49.548
    model_state_dict keys odict_keys(['model_dict', 'optimizer_dict', 'scheduler_dict', 'offset', 'epoch', 'encoder_params'])
    Overriding args parameter value from checkpoint state. Param = do_lower_case, value = True
    Overriding args parameter value from checkpoint state. Param = pretrained_model_cfg, value = bert-base-uncased
    Overriding args parameter value from checkpoint state. Param = encoder_model_type, value = hf_bert
    Overriding args parameter value from checkpoint state. Param = sequence_length, value = 256
     **************** CONFIGURATION ****************
    batch_size                     -->   2200
    ctx_file                       -->   ./data/data/wikipedia_split/psgs_w100.tsv
    device                         -->   cuda:1
    distributed_world_size         -->   80
    do_lower_case                  -->   True
    encoder_model_type             -->   hf_bert
    fp16                           -->   False
    fp16_opt_level                 -->   O1
    local_rank                     -->   1
    model_file                     -->   ./checkpoint/sq_best/dpr_biencoder.49.548
    n_gpu                          -->   1
    no_cuda                        -->   False
    num_shards                     -->   40
    out_file                       -->   ./checkpoint/sq_best/embed_epoch_49
    pretrained_file                -->   None
    pretrained_model_cfg           -->   bert-base-uncased
    projection_dim                 -->   0
    sequence_length                -->   256
    shard_id                       -->   0
    
    • Where ./checkpoint/sq_best/dpr_biencoder.49.548 was the one with highest correct prediction ratio for dev set during train.
    • I split psgs_w100.tsv with 40 different shards

    Step 4. Retriever validation against the entire set of documents

    • Finally I evaluated the IR accuracy via dense_retriever.py with following argument
    Initialized host gpu-cloud-vnode186.dakao.io as d.rank -1 on device=cuda, n_gpu=2, world size=1
    16-bits training: False
     **************** CONFIGURATION ****************
    batch_size                     -->   1800
    ctx_file                       -->   data/data/wikipedia_split/psgs_w100.tsv
    device                         -->   cuda
    distributed_world_size         -->   1
    do_lower_case                  -->   False
    encoded_ctx_file               -->   checkpoint/sq_best/embed_epoch_49_*
    encoder_model_type             -->   None
    fp16                           -->   False
    fp16_opt_level                 -->   O1
    hnsw_index                     -->   False
    index_buffer                   -->   50000
    local_rank                     -->   -1
    match                          -->   string
    model_file                     -->   checkpoint/sq_best/dpr_biencoder.49.548
    n_docs                         -->   100
    n_gpu                          -->   2
    no_cuda                        -->   False
    out_file                       -->   checkpoint/sq_best/eval_test_epoch_49_top_100.json
    pretrained_file                -->   None
    pretrained_model_cfg           -->   None
    projection_dim                 -->   0
    qa_file                        -->   data/data/retriever/qas/squad1-test.csv
    save_or_load_index             -->   False
    sequence_length                -->   256
    
    • And the printed result was
    Total data indexed 21015320
    Data indexing completed.
    Encoded queries 1800
    Encoded queries 3600
    Encoded queries 5400
    Encoded queries 7200
    Encoded queries 9000
    Total encoded queries tensor torch.Size([10570, 768])
    index search time: 1522.145621 sec.
    Reading data from: data/data/wikipedia_split/psgs_w100.tsv
    Matching answers in top docs...
    Per question validation results len=10570
    Validation results: top k documents hits [1574, 2290, 2766, 3134, 3419, 3687, 3900, 4068, 4233, 4379, 4508, 4627, 4734, 4832, 4921, 5005, 5106, 5196, 5260, 5311,
    5384, 5454, 5501, 5565, 5627, 5675, 5720, 5768, 5817, 5873, 5917, 5956, 5998, 6037, 6078, 6128, 6158, 6187, 6223, 6252, 6287, 6322, 6347, 6370, 6396, 6422, 6444,
    6470, 6501, 6530, 6564, 6591, 6619, 6638, 6661, 6687, 6720, 6742, 6758, 6777, 6794, 6808, 6822, 6840, 6861, 6890, 6905, 6925, 6938, 6960, 6982, 7003, 7017, 7042,
    7058, 7072, 7087, 7100, 7113, 7126, 7138, 7148, 7162, 7177, 7188, 7204, 7213, 7230, 7243, 7261, 7272, 7285, 7297, 7305, 7316, 7330, 7348, 7359, 7377, 7389]
    Validation results: top k documents hits accuracy [0.1489120151371807, 0.21665089877010407, 0.26168401135288555, 0.2964995269631031, 0.32346263008514664, 0.348817
    4077578051, 0.36896877956480606, 0.3848628192999054, 0.40047303689687797, 0.4142857142857143, 0.42649006622516555, 0.4377483443708609, 0.4478713339640492, 0.45714
    285714285713, 0.46556291390728477, 0.4735099337748344, 0.48306527909176916, 0.49157994323557236, 0.49763481551561023, 0.5024597918637653, 0.5093661305581836, 0.51
    59886471144749, 0.5204351939451277, 0.5264900662251656, 0.5323557237464522, 0.5368968779564806, 0.5411542100283823, 0.5456953642384106, 0.5503311258278145, 0.5556
    291390728477, 0.5597918637653737, 0.5634815515610218, 0.5674550614947966, 0.5711447492904447, 0.5750236518448439, 0.5797540208136235, 0.5825922421948913, 0.585335
    8561967833, 0.5887417218543046, 0.5914853358561968, 0.5947965941343425, 0.5981078524124882, 0.6004730368968779, 0.6026490066225165, 0.605108798486282, 0.607568590
    3500473, 0.6096499526963103, 0.6121097445600757, 0.615042573320719, 0.6177861873226111, 0.6210028382213812, 0.6235572374645222, 0.6262062440870388, 0.628003784295
    175, 0.6301797540208136, 0.632639545884579, 0.6357615894039735, 0.6378429517502365, 0.639356669820246, 0.6411542100283822, 0.6427625354777673, 0.6440870387890255,
     0.6454115421002838, 0.6471144749290445, 0.6491012298959319, 0.651844843897824, 0.6532639545884579, 0.6551561021759698, 0.6563859981078524, 0.6584673604541155, 0.
    6605487228003785, 0.6625354777672658, 0.6638599810785242, 0.6662251655629139, 0.6677388836329233, 0.6690633869441817, 0.6704824976348155, 0.6717123935666982, 0.67
    29422894985809, 0.6741721854304635, 0.6753074739829706, 0.6762535477767266, 0.6775780510879849, 0.6789971617786187, 0.6800378429517502, 0.6815515610217597, 0.6824
    0302743614, 0.6840113528855251, 0.6852412488174078, 0.6869441816461684, 0.6879848628192999, 0.6892147587511825, 0.6903500473036897, 0.6911069063386944, 0.69214758
    75118259, 0.6934720908230843, 0.6951750236518448, 0.6962157048249763, 0.697918637653737, 0.6990539262062441]
    
    Top-20: 50.246%
    Top-100: 69.905%
    
    Saved results * scores  to checkpoint/sq_best/eval_test_epoch_49_top_100.json
    

    Issue

    • This result is consistent with the reported values from the paper
      • You reported that the Top-20 and 100 accuracy for SQuAD under Single Mode is 63.2 and 77.2, respectively.
    • Issues #62 and #93 also argue that they could not reproduce SQuAD result
    • However, with the same code and hyperparameter, I was able to (almost) reproduce for Trivia QA dataset.
      • Reproduced Trivia QA Top-20/100 accuracy : 79.3/84.9
      • Reported Trivia QA Top-20/100 accuracy : 79.4/85.0
    • Therefore I suspect that there might be some differences for SQuAD dataset between shared one and the one actually you used.

    Please let me know if I you find something wrong Thank you

    opened by robinsongh381 16
  • Minor suggestion on the trivia-train.json

    Minor suggestion on the trivia-train.json

    Hi there,

    I was finetuning the DPR model on the trivia dataset but found that there were many entries in trivia-train.json that contain no positive contexts. As the nq-train.json has no empty entries (cleaned up), I think it'd better clean up trivia as well for consistency purposes. Otherwise, it might lead to index mismatch if pre-trained query embeddings are used instead of the query encoder.

    Best,

    enhancement 
    opened by alexlimh 13
  • Best results reproduction instruction

    Best results reproduction instruction

    Hello,

    I am trying to train a model based on your instructions and tried to run train_dense_encoder.py In the instruction you are refering to --dev_file {path to downloaded data.retriever.qas.nq-dev resource} but it is unclear to which file you mean.

    Is it retriever/qas/nq-dev.csv or retriever/nq-dev.json? The first option fails as the code expects a json file but the second one doesn't seem like a "retriever.qas" resource based on its name.

    opened by iftachg 12
  • Seeking KILT meta-data for DPR

    Seeking KILT meta-data for DPR

    Hi, I notice some updates have been made here to facilitate the KILT dataset format. In addition, I wonder if the below meta-data from KILT can be also shared here:

    1. the 22,220,793 passages split from the KILT knowledge source
    2. the corresponding passage_id of positive and negative passages for each query of the NQ dataset (mined by the DPR checkpoint)

    Both meta-data are necessary for reproducing or improving DPR on KILT, and I think it will be more convenient for people to follow up if the above meta-data are shared. Thanks.

    opened by jzhoubu 11
  • Question about reader training.

    Question about reader training.

    Hi, I'm here again :)

    I tried to use the test data constructed by my retrieved passage in NQ dataset to test the reader model trained by your provided training data, but the effect is not very good although it has pretty good retrieval performance.

    I feel that the problem maybe that the training data does not match my data, so I would like to ask how your training data of reader is structured? Such as what is the query and the passage source?

    Thank you!

    opened by ReyonRen 11
  • dense_retriever -- MemoryError: std::bad_alloc

    dense_retriever -- MemoryError: std::bad_alloc

    Hi! It seems that no matter what value I set index_buffer to, I get the following error when running dense_retriever.py:

    Traceback (most recent call last):
      File "dense_retriever.py", line 331, in <module>
        main(args)
      File "dense_retriever.py", line 268, in main
        retriever.index_encoded_data(input_paths, buffer_size=index_buffer_sz)
      File "dense_retriever.py", line 100, in index_encoded_data
        self.index.index_data(buffer)
      File "/home/aarchan/qa-aug/qa-aug/dpr/indexer/faiss_indexers.py", line 93, in index_data
        self.index.add(vectors)
      File "/home/aarchan/anaconda2/envs/qa-aug/lib/python3.8/site-packages/faiss/__init__.py", line 138, in replacement_add
        self.add_c(n, swig_ptr(x))
      File "/home/aarchan/anaconda2/envs/qa-aug/lib/python3.8/site-packages/faiss/swigfaiss.py", line 1454, in add
        return _swigfaiss.IndexFlat_add(self, n, x)
    MemoryError: std::bad_alloc
    

    For reference, the machine I'm running this on has 128GB RAM, but it doesn't seem to be enough. Could you please help me with this issue? Thanks!

    opened by aarzchan 11
  • Reproduce Table 3 in paper

    Reproduce Table 3 in paper

    Hi, all.

    I am reproducing the second block of Table 3 in the paper but meet problems for Gold #N = 7. I wonder what causes this inconsistency. image

    The results I get are shown below. batch_size | top5 | top20 | top100 -- | -- | -- | -- bsize=8 (Gold #N = 7) | 44.0% | 64.2% | 78.2% bsize=128 (Gold #N = 127) | 57.6% | 74.5% | 84.0%

    For Gold #N = 7, the result has a big gap with the paper while it is fine for Gold #N = 127.

    I run the experiment for Gold #N = 7 on a server with 4 V100 GPUs and the experiment for Gold #N = 127 on a server with 8 V100 GPUs. The two servers are with the same CUDA version and virtual env. The only difference is the setting of batch size.

    The script I used for experiments is shown below.

    #!/bin/bash
    
    set -x
    HYDRA_FULL_ERROR=1 python train_dense_encoder.py \
        train_datasets=[nq_train] \
        dev_datasets=[nq_dev] \
        train=biencoder_nq \
        train.batch_size=$1 \
        train.hard_negatives=0 \
        output_dir=./runs/
    

    Note that, I run these experiments in DataParallel (single node - multi gpu) mode.

    opened by yxliu-ntu 9
  • TOP-K results on NQ datasets

    TOP-K results on NQ datasets

    Hi, I run dense_retriver.py to obtain results on NQ dataset. But I get following, which is far behind the results in the paper.

    Validation results: top k documents hits accuracy [0.13268698060941828, 0.17506925207756233, 0.20498614958448755, 0.22548476454293628, 0.24681440443213296, 0.2614958448753463, 0.2742382271468144, 0.2839335180055402, 0.29418282548476454, 0.30193905817174516, 0.3080332409972299, 0.31329639889196675, 0.31994459833795014, 0.3238227146814404, 0.328808864265928, 0.33518005540166207, 0.3371191135734072, 0.3401662049861496, 0.3443213296398892, 0.34709141274238225, 0.3518005540166205, 0.35650969529085874, 0.35983379501385043, 0.3634349030470914, 0.36925207756232686, 0.3717451523545706, 0.3731301939058172, 0.37590027700831025, 0.378393351800554, 0.3797783933518006, 0.3817174515235457, 0.3847645429362881, 0.38725761772853184, 0.3897506925207756, 0.3914127423822715, 0.3930747922437673, 0.3939058171745152, 0.3958448753462604, 0.3972299168975069, 0.4, 0.4005540166204986, 0.4024930747922438, 0.40470914127423824, 0.4069252077562327, 0.40941828254847645, 0.4113573407202216, 0.41301939058171744, 0.4138504155124654, 0.4149584487534626, 0.41606648199445984, 0.41772853185595565, 0.42049861495844876, 0.4224376731301939, 0.42382271468144045, 0.4249307479224377, 0.42548476454293627, 0.4279778393351801, 0.4293628808864266, 0.4310249307479224, 0.4326869806094183, 0.4337950138504155, 0.43490304709141275, 0.4357340720221607, 0.43656509695290857, 0.4373961218836565, 0.43822714681440444, 0.4404432132963989, 0.4409972299168975, 0.44182825484764543, 0.4437673130193906, 0.4451523545706371, 0.44626038781163435, 0.44681440443213294, 0.4473684210526316, 0.44792243767313017, 0.4490304709141274, 0.450415512465374, 0.4506925207756233, 0.4518005540166205, 0.4520775623268698, 0.45290858725761773, 0.4534626038781163, 0.45373961218836567, 0.45457063711911355, 0.4551246537396122, 0.45595567867036013, 0.45706371191135736, 0.45789473684210524, 0.4581717451523546, 0.4587257617728532, 0.4590027700831025, 0.4592797783933518, 0.4598337950138504, 0.46066481994459835, 0.46094182825484764, 0.46121883656509693, 0.4614958448753463, 0.46232686980609417, 0.4628808864265928, 0.4634349030470914]

    python dense_retriever.py
    model_file=[checkpoint]
    qa_dataset=nq_test
    ctx_datatsets=[dpr_wiki]
    encoded_ctx_files=[/home/v-nuochen/DPR/outputs/2021-04-17/08-29-34/nq-generate-emd_0]
    out_file=nq_retrieval_07_08

    The model_file is download from your previous checkpoint. But encoded_ctx_files is generated from generate_dense_embedding.py with default settings by myself.

    Could you please tell me why?

    opened by cn-boop 9
  • Missing required positional arguments

    Missing required positional arguments

    Hi, I'm following the instructional code on the readme and ran python train_dense_encoder.py \ train_datasets=[nq_train] \ dev_datasets=[nq_dev] \ train=biencoder_local \ output_dir={path to checkpoints dir}

    after install the nq_train and nq_dev datasets however whenever I run this I get an error in pytorch torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) TypeError: forward() missing 6 required positional arguments: 'question_ids', 'question_segments', 'question_attn_mask', 'context_ids', 'ctx_segments', and 'ctx_attn_mask'

    I'm not sure what could be causing this.

    opened by grossmanm 1
  • Passages with multiple answer spans

    Passages with multiple answer spans

    Hi, I'm trying to implement the reader model training in Tensorflow. How does DPR deal with multiple answer spans in a positive passage? Will there just be multiple positions labelled as the correct start/end position?

    E.g. if there are 2 answer starts in a passage (h_i, h_j), would the model try to maximise the scores for tokens i and j using cross entropy loss?

    opened by TZeng20 0
  • `cosine_scores` defined in biencoder.py does not work

    `cosine_scores` defined in biencoder.py does not work

    Reference: https://github.com/facebookresearch/DPR/blob/d9f3e41bb0087687fa182a4d580711188fd82df9/dpr/models/biencoder.py#L57

    F.cosine_similarity will fail to compute the similarity along a specified dimension when the other dimensions differ. For example, if x is a 10x64 tensor, and y is a 20x64 tensor, then it is expected to get a 10x20 matrix when calling cosine_scores. However, that function won't work:

    >>> x = torch.randn(10, 64)
    >>> y = torch.randn(20, 64)
    >>> F.cosine_similarity(x, y, dim=1)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    RuntimeError: The size of tensor a (10) must match the size of tensor b (20) at non-singleton dimension 0
    

    Since it is not used anywhere else in the repo and the paper, maybe it would be a good idea to remove cosine_scores?

    opened by xhluca 0
  • Question in training reader model.

    Question in training reader model.

    Hi, I'm trying to reproduce this work and I have trouble training the reader. It's not clear to me which file path to the retriever train set results file and path to the retriever dev set results file refer to. I am expecting any replies.

    opened by gaishun 1
  • Doubt regarding all_gather_list in case of DDP

    Doubt regarding all_gather_list in case of DDP

    Hi,

    Thanks for the amazing framework. I have a doubt regarding the utility of the all_gather_list function, that gathers the tensors across the GPUs. When we are training in DDP, the gradients are synchronized before the parameter updates, therefore, why is this step needed? Is it just to collate the loss or number of correct predictions or the rank (in evaluation)? If yes, then couldn't one gather all of them after computing the loss, instead of exchanging the question and context representations first and then going forward with it?

    Thanks!

    opened by bhattg 0
Owner
Meta Research
Meta Research
Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models.

Tevatron Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models. The toolkit has a modularized

texttron 193 Jan 4, 2023
🚀 RocketQA, dense retrieval for information retrieval and question answering, including both Chinese and English state-of-the-art models.

In recent years, the dense retrievers based on pre-trained language models have achieved remarkable progress. To facilitate more developers using cutt

null 475 Jan 4, 2023
Using Bert as the backbone model for lime, designed for NLP task explanation (sentence pair text classification task)

Lime Comparing deep contextualized model for sentences highlighting task. In addition, take the classic explanation model "LIME" with bert-base model

JHJu 2 Jan 18, 2022
CredData is a set of files including credentials in open source projects

CredData is a set of files including credentials in open source projects. CredData includes suspicious lines with manual review results and more information such as credential types for each suspicious line. CredData can be used to develop new tools or improve existing tools. Furthermore, using the benchmark result of the CredData, users can choose a proper tool among open source credential scanning tools according to their use case.

Samsung 19 Sep 7, 2022
Task-based datasets, preprocessing, and evaluation for sequence models.

SeqIO: Task-based datasets, preprocessing, and evaluation for sequence models. SeqIO is a library for processing sequential data to be fed into downst

Google 290 Dec 26, 2022
Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

Grading tools for Advanced NLP (11-711) Installation You'll need docker and unzip to use this repo. For docker, visit the official guide to get starte

Hao Zhu 2 Sep 27, 2022
A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

MIDI Language Introduction Reference Paper: Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions: code This

Robert Bogan Kang 3 May 25, 2022
Meta learning algorithms to train cross-lingual NLI (multi-task) models

Meta learning algorithms to train cross-lingual NLI (multi-task) models

M.Hassan Mojab 4 Nov 20, 2022
Reading Wikipedia to Answer Open-Domain Questions

DrQA This is a PyTorch implementation of the DrQA system described in the ACL 2017 paper Reading Wikipedia to Answer Open-Domain Questions. Quick Link

Facebook Research 4.3k Jan 1, 2023
Baseline code for Korean open domain question answering(ODQA)

Open-Domain Question Answering(ODQA)는 다양한 주제에 대한 문서 집합으로부터 자연어 질의에 대한 답변을 찾아오는 task입니다. 이때 사용자 질의에 답변하기 위해 주어지는 지문이 따로 존재하지 않습니다. 따라서 사전에 구축되어있는 Knowl

VUMBLEB 69 Nov 4, 2022
ACL'2021: Learning Dense Representations of Phrases at Scale

DensePhrases DensePhrases is an extractive phrase search tool based on your natural language inputs. From 5 million Wikipedia articles, it can search

Princeton Natural Language Processing 540 Dec 30, 2022
🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

?? The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Hugging Face 15k Jan 2, 2023
profile tools for pytorch nn models

nnprof Introduction nnprof is a profile tool for pytorch neural networks. Features multi profile mode: nnprof support 4 profile mode: Layer level, Ope

Feng Wang 42 Jul 9, 2022
Prithivida 690 Jan 4, 2023
open-information-extraction-system, build open-knowledge-graph(SPO, subject-predicate-object) by pyltp(version==3.4.0)

中文开放信息抽取系统, open-information-extraction-system, build open-knowledge-graph(SPO, subject-predicate-object) by pyltp(version==3.4.0)

null 7 Nov 2, 2022
An open source framework for seq2seq models in PyTorch.

pytorch-seq2seq Documentation This is a framework for sequence-to-sequence (seq2seq) models implemented in PyTorch. The framework has modularized and

International Business Machines 1.4k Jan 2, 2023
Training open neural machine translation models

Train Opus-MT models This package includes scripts for training NMT models using MarianNMT and OPUS data for OPUS-MT. More details are given in the Ma

Language Technology at the University of Helsinki 167 Jan 3, 2023