Optimizing DR with hard negatives and achieving SOTA first-stage retrieval performance on TREC DL Track (SIGIR 2021 Full Paper).

Jingtao Zhan

Last update: Dec 27, 2022

Related tags

Overview

Optimizing Dense Retrieval Model Training with Hard Negatives

Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, Shaoping Ma

🔥 News 2021-10: Our full paper, Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval [code], was accepted by WSDM'22. It presents RepCONC and achieves state-of-the-art first-stage retrieval effectiveness-efficiency tradeoff. Part of its training foundation lies in this repo (STAR and ADORE).
🔥 News 2021-8: Our full paper, Jointly Optimizing Query Encoder and Product Quantization to Improve Retrieval Performance [code], was accepted by CIKM'21. It presents JPQ and greatly improves the efficiency of Dense Retrieval. Part of its training foundation lies in this repo (dynamic hard negatives).

This repo provides code, retrieval results, and trained models for our SIGIR Full paper Optimizing Dense Retrieval Model Training with Hard Negatives. The previous version is Learning To Retrieve: How to Train a Dense Retrieval Model Effectively and Efficiently.

We achieve very impressive retrieval results on both passage and document retrieval bechmarks. The proposed two algorithms (STAR and ADORE) are very efficient. IMHO, they are well worth trying and most likely improve your retriever's performance by a large margin.

The following figure shows the pros and cons of different training methods. You can train an effective Dense Retrieval model in three steps. Firstly, warmup your model using random negatives or BM25 top negatives. Secondly, use our proposed STAR to train the query encoder and document encoder. Thirdly, use our proposed ADORE to train the query encoder.

Retrieval Results and Trained Models

Passage Retrieval	Dev MRR@10	Dev R@100	Test NDCG@10	Files
Inbatch-Neg	0.264	0.837	0.583	Model
Rand-Neg	0.301	0.853	0.612	Model
STAR	0.340	0.867	0.642	Model Train Dev TRECTest
ADORE (Inbatch-Neg)	0.316	0.860	0.658	Model
ADORE (Rand-Neg)	0.326	0.865	0.661	Model
ADORE (STAR)	0.347	0.876	0.683	Model Train Dev TRECTest Leaderboard

Doc Retrieval	Dev MRR@100	Dev R@100	Test NDCG@10	Files
Inbatch-Neg	0.320	0.864	0.544	Model
Rand-Neg	0.330	0.859	0.572	Model
STAR	0.390	0.867	0.605	Model Train Dev TRECTest
ADORE (Inbatch-Neg)	0.362	0.884	0.580	Model
ADORE (Rand-Neg)	0.361	0.885	0.585	Model
ADORE (STAR)	0.405	0.919	0.628	Model Train Dev TRECTest Leaderboard

If you want to use our first-stage leaderboard runs, contact me and I will send you the file.

If any links fail or the files go wrong, please contact me or open a issue.

Requirements

To install requirements, run the following commands:

git clone [email protected]:jingtaozhan/DRhard.git
cd DRhard
python setup.py install

However, you need to set up a new python enverionment for data preprocessing (see below).

Data Download

To download all the needed data, run:

bash download_data.sh

Data Preprocess

You need to set up a new environment with transformers==2.8.0 to tokenize the text. This is because we find the tokenizer behaves differently among versions 2, 3 and 4. To replicate the results in our paper with our provided trained models, it is necessary to use version 2.8.0 for preprocessing. Otherwise, you may need to re-train the DR models.

Run the following codes.

python preprocess.py --data_type 0; python preprocess.py --data_type 1

Inference

With our provided trained models, you can easily replicate our reported experimental results. Note that minor variance may be observed due to environmental difference.

STAR

The following codes use the provided STAR model to compute query/passage embeddings and perform similarity search on the dev set. (You can use --faiss_gpus option to use gpus for much faster similarity search.)

python ./star/inference.py --data_type passage --max_doc_length 256 --mode dev   
python ./star/inference.py --data_type doc --max_doc_length 512 --mode dev

Run the following code to evaluate on MSMARCO Passage dataset.

python ./msmarco_eval.py ./data/passage/preprocess/dev-qrel.tsv ./data/passage/evaluate/star/dev.rank.tsv

Eval Started
#####################
MRR @10: 0.3404237731386721
QueriesRanked: 6980
#####################

Run the following code to evaluate on MSMARCO Document dataset.

python ./msmarco_eval.py ./data/doc/preprocess/dev-qrel.tsv ./data/doc/evaluate/star/dev.rank.tsv 100

Eval Started
#####################
MRR @100: 0.3903422772218344
QueriesRanked: 5193
#####################

ADORE

ADORE computes the query embeddings. The document embeddings are pre-computed by other DR models, like STAR. The following codes use the provided ADORE(STAR) model to compute query embeddings and perform similarity search on the dev set. (You can use --faiss_gpus option to use gpus for much faster similarity search.)

python ./adore/inference.py --model_dir ./data/passage/trained_models/adore-star --output_dir ./data/passage/evaluate/adore-star --preprocess_dir ./data/passage/preprocess --mode dev --dmemmap_path ./data/passage/evaluate/star/passages.memmap
python ./adore/inference.py --model_dir ./data/doc/trained_models/adore-star --output_dir ./data/doc/evaluate/adore-star --preprocess_dir ./data/doc/preprocess --mode dev --dmemmap_path ./data/doc/evaluate/star/passages.memmap

Evaluate ADORE(STAR) model on dev passage dataset:

python ./msmarco_eval.py ./data/passage/preprocess/dev-qrel.tsv ./data/passage/evaluate/adore-star/dev.rank.tsv

You will get

Eval Started
#####################
MRR @10: 0.34660697230181425
QueriesRanked: 6980
#####################

Evaluate ADORE(STAR) model on dev document dataset:

python ./msmarco_eval.py ./data/doc/preprocess/dev-qrel.tsv ./data/doc/evaluate/adore-star/dev.rank.tsv 100

You will get

Eval Started
#####################
MRR @100: 0.4049777020859768
QueriesRanked: 5193
#####################

Convert QID/PID Back

Our data preprocessing reassigns new ids for each query and document. Therefore, you may want to convert the ids back. We provide a script for this.

The following code shows an example to convert ADORE-STAR's ranking results on the dev passage dataset.

python ./cvt_back.py --input_dir ./data/passage/evaluate/adore-star/ --preprocess_dir ./data/passage/preprocess --output_dir ./data/passage/official_runs/adore-star --mode dev --dataset passage
python ./msmarco_eval.py ./data/passage/dataset/qrels.dev.small.tsv ./data/passage/official_runs/adore-star/dev.rank.tsv

You will get

Eval Started
#####################
MRR @10: 0.34660697230181425
QueriesRanked: 6980
#####################

Train

In the following instructions, we show how to replicate our experimental results on MSMARCO Passage Retrieval task.

STAR

We use the same warmup model as ANCE, the most competitive baseline, to enable a fair comparison. Please download it and extract it at ./data/passage/warmup

Next, we use this warmup model to extract static hard negatives, which will be utilized by STAR.

python ./star/prepare_hardneg.py \
--data_type passage \
--max_query_length 32 \
--max_doc_length 256 \
--mode dev \
--topk 200

It will automatically use all available gpus to retrieve documents. If all available cuda memory is less than 26GB (the index size), you can add --not_faiss_cuda to use CPU for retrieval.

Run the following command to train the DR model with STAR. In our experiments, we only use one GPU to train.

python ./star/train.py --do_train \
    --max_query_length 24 \
    --max_doc_length 120 \
    --preprocess_dir ./data/passage/preprocess \
    --hardneg_path ./data/passage/warmup_retrieve/hard.json \
    --init_path ./data/passage/warmup \
    --output_dir ./data/passage/star_train/models \
    --logging_dir ./data/passage/star_train/log \
    --optimizer_str lamb \
    --learning_rate 1e-4 \
    --gradient_checkpointing --fp16

Although we set number of training epcohs a very large value in the script, it is likely to converge within 50k steps (1.5 days) and you can manually kill the process. Using multiple gpus should speed up a lot, which requires some changes in the codes.

ADORE

Now we show how to use ADORE to finetune the query encoder. Here we use our provided STAR checkpoint as the fixed document encoder. You can also use another document encoder.

The passage embeddings by STAR should be located at ./data/passage/evaluate/star/passages.memmap. If not, follow the STAR inference procedure as shown above.

python ./adore/train.py \
--metric_cut 200 \
--init_path ./data/passage/trained_models/star \
--pembed_path ./data/passage/evaluate/star/passages.memmap \
--model_save_dir ./data/passage/adore_train/models \
--log_dir ./data/passage/adore_train/log \
--preprocess_dir ./data/passage/preprocess \
--model_gpu_index 0 \
--faiss_gpu_index 1 2 3

The above command uses the first gpu for encoding, and the 2nd~4th gpu for dense retrieval. You can change the faiss_gpu_index values based on your available cuda memory. For example, if you have a 32GB gpu, you can set model_gpu_index and faiss_gpu_index both to 0 because the CUDA memory is large enough. But if you only have 11GB gpus, three gpus are required for faiss.

Empirically, ADORE significantly improves retrieval performance after training for only one epoch, which only costs 1 hour if using GPUs to retrieve dynamic hard negatives.

Comments

RobertaDot_NLL_LN class not defined?

Hi, jingtao

I find that the adore model released is not defined in your code.

The config.json file indicates that the model architecture is "RobertaDot_NLL_LN", however , it seems not defined in model.py.

opened by ylwangy 5
keyError of rankdict

Hi，jingtao

When I train the STAR model. There exist an error.

File "./dataset.py", line 176, in getitem hardpids = random.sample(self.rankdict[str(qid)], self.hard_num) KeyError: '18337'

The valid keys of rankdict should be 1,2,...,6980. Am I right?

opened by ylwangy 3
关于transformers库的版本问题

您好，感谢您开源代码！

我在尝试运行的时候发现，您在README里提到，transformers的版本要使用2.8.0，因为3以上的版本里tokenizer的行为不一致；但是在setup.py文件中，却指定了 transformers==3.4.0，这是为什么呢？我应该使用哪个版本的transformers库呢？

opened by dhx20150812 3
A possible bug in Data Process

Thanks for your released beautiful and easy-to-follow code ! It is very helpful to the dense retrieval researchers as me.

However, maybe I have met a bug in the preprocess.py. As the model requires to use the RoBERTa-base model for initialization, the native pad token is 1 but not 0. While I found that in this file, the pad token is set to 0 (the [CLS] token id), which has caused that the pre-trained model checkpoint can not achieve the same results as reported in your paper. https://github.com/jingtaozhan/DRhard/blob/main/preprocess.py#L23

opened by Lancelot39 2

Why there is cnt variable in get_collate_function?

In https://github.com/jingtaozhan/DRhard/blob/dc17f3d1f7f59d13d15daa1a728dc8d6efc48b92/dataset.py, if we take a look at the data collator,

def get_collate_function(max_seq_length):
    cnt = 0
    def collate_function(batch):
        nonlocal cnt
        length = None
        if cnt < 10:
            length = max_seq_length
            cnt += 1

        input_ids = [x["input_ids"] for x in batch]
        attention_mask = [x["attention_mask"] for x in batch]
        data = {
            "input_ids": pack_tensor_2D(input_ids, default=1, 
                dtype=torch.int64, length=length),
            "attention_mask": pack_tensor_2D(attention_mask, default=0, 
                dtype=torch.int64, length=length),
        }
        ids = [x['id'] for x in batch]
        return data, ids
    return collate_function

we see that there is a cnt variable which is deciding if the collate_function should pad or not. I couldn't get why it is needed. Could you please explain the significance of cnt ?

Thank you AM

opened by ari9dam 2

Reproduce results

Hi Jingtao,

I try to reproduce the results showing in the README. The models are downloaded from google drive. For the transformers version, preprocessing is 2.8.0 and for inference is 4.8.2.

I ran the following commands: python ./star/inference.py --data_type passage --max_doc_length 256 --mode dev
python ./msmarco_eval.py ./data/passage/preprocess/dev-qrel.tsv ./data/passage/evaluate/star/dev.rank.tsv

And I got the following results: Eval Started ##################### MRR @10: 0.010382669304589082 QueriesRanked: 6980 #####################

Could you help to figure out what I did wrong? Thanks!

opened by laos1984 2

import error when do STAR inference

There is an ImportError when I am trying to replicate your work

$python ./star/inference.py --data_type passage --max_doc_length 256 --mode dev
Traceback (most recent call last):
  File "./star/inference.py", line 15, in <module>
    from model import RobertaDot
  File "/home/yicheng.fyc/DRhard/./model.py", line 10, in <module>
    from transformers.modeling_roberta import RobertaPreTrainedModel
ImportError: cannot import name 'RobertaPreTrainedModel' from 'transformers.modeling_roberta' (/home/yicheng.fyc/miniconda2/envs/adore/lib/python3.8/site-packages/transformers-2.8.0-py3.8.egg/transformers/modeling_roberta.py)

It happens for both 2.8.0 and 3.4.0 of transformers. upgrade transformers to 4.2 will fixed this problems but it will leads a huge gap between the MRR index.

opened by yitsingF 2

about the length of tokens

Hello,

I have read your paper and am quite interested in your work! There is a question about the tokens. I notice you truncat the passage tokens with 120 in MSMARCO Passage Retrieval, however, for ANCE, the original paper uses 512 tokens. So does the number of tokens have the impact on the accuracy?

opened by KaishuaiXu 2
How did you evaluate on trec 2019 test

Hi,

I can't find the instruction to replicate the nDCG performance on TREC 19. Could you tell me how to run the evaluation on TREC 19 test set.

Thanks.

opened by jordane95 1
Evaluatation on test passage dataset
Hello, I found the result of proviededi nbatch-neg model on test dataset is so bad. Is TREC DL Passgae data a test dataset?
What I should do to reproduce the NDCG@10 and R@100 on TREC DL dataset?

Command: python ./msmarco_eval.py ./data/passage/preprocess/test-qrel.tsv ./data/passage/evaluate/download_inbatch/test.rank.tsv

Results: Eval Started ##################### MRR @10: 0.04559967102039234 QueriesRanked: 43 #####################
opened by staoxiao 1
RepBERT

Is the generating of passage_embeddings of this program the same as the RepBERT program?:

python precompute.py --load_model_path ./data/ckpt-350000 --task doc python precompute.py --load_model_path ./data/ckpt-350000 --task query_dev.small python precompute.py --load_model_path ./data/ckpt-350000 --task query_eval.small

opened by wangjiajia5889758 1
Training setup of ANCE and STAR

Hi, thank you for publishing the code for your interesting paper. I was just trying to reproduce STAR results in ANCE setup, i.e. I am using static hard negatives and in-batch negatives. But I am unable to achieve an MRR@10 score of 0.34. Also, the STAR checkpoint provided in this repo is not producing MRR@10 result of 0.34 when evaluated using ANCE repo. I am getting MRR@10 of 0.299 instead. I see there are some differences in the training setups in your repo and the ANCE one. Can you please highlight those?

opened by ranonrkm 1
Combined loss implementation
Hi, I am trying to understand how you combined the hard negative loss L_s with the in-batch random negative loss L_r, as in the paper the in-batch random negative loss is scaled by an alpha hyperparameter but there is no mention of the value of alpha you used in the experiments.

Following star/train.py I found the RobertaDot_InBatch model, whose forward function calls the inbatch_train method.

A the end of the inbatch_train method (line 182), I found

return ((first_loss + second_loss) / (first_num + second_num),)

which is different from the combined loss proposed in the paper (Eq. 13).

Am I missing something?

Also, for each query in the batch, did you consider all the possible in-batch random negatives or just one?

Thanks in advance!
opened by AmenRa 7

Owner

Jingtao Zhan

IR Researcher, Ph.D student at Tsinghua University.

GitHub

[EMNLP 2021] MuVER: Improving First-Stage Entity Retrieval with Multi-View Entity Representations

MuVER This repo contains the code and pre-trained model for our EMNLP 2021 paper: MuVER: Improving First-Stage Entity Retrieval with Multi-View Entity

24 May 30, 2022

Code for our NeurIPS 2021 paper Mining the Benefits of Two-stage and One-stage HOI Detection

CDN Code for our NeurIPS 2021 paper "Mining the Benefits of Two-stage and One-stage HOI Detection". Contributed by Aixi Zhang*, Yue Liao*, Si Liu, Mia

71 Dec 14, 2022

Code for Two-stage Identifier: "Locate and Label: A Two-stage Identifier for Nested Named Entity Recognition"

Code for Two-stage Identifier: "Locate and Label: A Two-stage Identifier for Nested Named Entity Recognition", accepted at ACL 2021. For details of the model and experiments, please see our paper.

87 Dec 16, 2022

Code for Mining the Benefits of Two-stage and One-stage HOI Detection

Status: Archive (code is provided as-is, no updates expected) PPO-EWMA [Paper] This is code for training agents using PPO-EWMA and PPG-EWMA, introduce

33 Dec 15, 2022

The implementation of "Optimizing Shoulder to Shoulder: A Coordinated Sub-Band Fusion Model for Real-Time Full-Band Speech Enhancement"

SF-Net for fullband SE This is the repo of the manuscript "Optimizing Shoulder to Shoulder: A Coordinated Sub-Band Fusion Model for Real-Time Full-Ban

36 Dec 2, 2022

Virtual Dance Reality Stage: a feature that offers you to share a stage with another user virtually

Portrait Segmentation using Tensorflow This script removes the background from an input image. You can read more about segmentation here Setup The scr

291 Dec 24, 2022

The self-supervised goal reaching benchmark introduced in Discovering and Achieving Goals via World Models

Lexa-Benchmark Codebase for the self-supervised goal reaching benchmark introduced in 'Discovering and Achieving Goals via World Models'. Setup Create

1 Oct 14, 2021

🏆 The 1st Place Submission to AICity Challenge 2021 Natural Language-Based Vehicle Retrieval Track (Alibaba-UTS submission)

AI City 2021: Connecting Language and Vision for Natural Language-Based Vehicle Retrieval ?? The 1st Place Submission to AICity Challenge 2021 Natural

82 Dec 29, 2022

The official implementation of NeurIPS 2021 paper: Finding Optimal Tangent Points for Reducing Distortions of Hard-label Attacks

11 Nov 27, 2022

4th place solution for the SIGIR 2021 challenge.

SIGIR-2021 (Tinkoff.AI) How to start Download train and test data: https://sigir-ecom.github.io/data-task.html Place it under sigir-2021/data/. Run py

4 Jul 1, 2022

Codes for SIGIR'22 Paper 'On-Device Next-Item Recommendation with Self-Supervised Knowledge Distillation'

OD-Rec Codes for SIGIR'22 Paper 'On-Device Next-Item Recommendation with Self-Supervised Knowledge Distillation' Paper, saved teacher models and Andro

11 Nov 22, 2022

Code for the SIGIR 2022 paper "Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion"

MKGFormer Code for the SIGIR 2022 paper "Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion" Model Architecture Illu

68 Dec 28, 2022

Image-retrieval-baseline - MUGE Multimodal Retrieval Baseline

MUGE Multimodal Retrieval Baseline This repo is implemented based on the open_cl

47 Dec 16, 2022

A modular, primitive-first, python-first PyTorch library for Reinforcement Learning.

TorchRL Disclaimer This library is not officially released yet and is subject to change. The features are available before an official release so that

860 Jan 7, 2023

Source code for the Paper: CombOptNet: Fit the Right NP-Hard Problem by Learning Integer Programming Constraints}

CombOptNet: Fit the Right NP-Hard Problem by Learning Integer Programming Constraints Installation Run pipenv install (at your own risk with --skip-lo

65 Dec 27, 2022

Official repository for the paper "Can You Learn an Algorithm? Generalizing from Easy to Hard Problems with Recurrent Networks"

Easy-To-Hard The official repository for the paper "Can You Learn an Algorithm? Generalizing from Easy to Hard Problems with Recurrent Networks". Gett

52 Sep 8, 2022

Official repository for "Exploiting Session Information in BERT-based Session-aware Sequential Recommendation", SIGIR 2022 short.

Session-aware BERT4Rec Official repository for "Exploiting Session Information in BERT-based Session-aware Sequential Recommendation", SIGIR 2022 shor

22 Dec 13, 2022

[ICCV 2021] Official Pytorch implementation for Discriminative Region-based Multi-Label Zero-Shot Learning SOTA results on NUS-WIDE and OpenImages

Discriminative Region-based Multi-Label Zero-Shot Learning (ICCV 2021) [arXiv][Project page >> coming soon] Sanath Narayan*, Akshita Gupta*, Salman Kh

54 Nov 21, 2022

[ICCV 2021] Official Pytorch implementation for Discriminative Region-based Multi-Label Zero-Shot Learning SOTA results on NUS-WIDE and OpenImages

Discriminative Region-based Multi-Label Zero-Shot Learning (ICCV 2021) [arXiv][Project page >> coming soon] Sanath Narayan*, Akshita Gupta*, Salman Kh

54 Nov 21, 2022