Improving Query Representations for DenseRetrieval with Pseudo Relevance Feedback:A Reproducibility Study.

Related tags

Deep Learning APR
Overview

APR

The repo for the paper Improving Query Representations for DenseRetrieval with Pseudo Relevance Feedback:A Reproducibility Study.

Environment setup

To reproduce the results in the paper, we rely on two open-source IR toolkits: Pyserini and tevatron.

We cloned, merged, and modified the two toolkits in this repo and will use them to train and inference the PRF models. We refer to the original github repos to setup the environment:

Install Pyserini: https://github.com/castorini/pyserini/blob/master/docs/installation.md.

Install tevatron: https://github.com/texttron/tevatron#installation.

You also need MS MARCO passage ranking dataset, including the collection and queries. We refer to the official github repo for downloading the data.

To reproduce ANCE-PRF inference results with the original model checkpoint

The code, dataset, and model for reproducing the ANCE-PRF results presented in the original paper:

HongChien Yu, Chenyan Xiong, Jamie Callan. Improving Query Representations for Dense Retrieval with Pseudo Relevance Feedback

have been merged into Pyserini source. Simply just need to follow this instruction, which includes the instructions of downloading the dataset, model checkpoint (provided by the original authors), dense index, and PRF inference.

To train dense retriever PRF models

We use tevatron to train the dense retriever PRF query encodes that we investigated in the paper.

First, you need to have train queries run files to build hard negative training set for each DR.

You can use Pyserini to generate run files for ANCE, TCT-ColBERTv2 and DistilBERT KD TASB by changing the query set flag --topics to queries.train.tsv.

Once you have the run file, cd to /tevatron and run:

python make_train_from_ranking.py \
	--ranking_file /path/to/train/run \
	--model_type (ANCE or TCT or DistilBERT) \
	--output /path/to/save/hard/negative

Apart from the hard negative training set, you also need the original DR query encoder model checkpoints to initial the model weights. You can download them from Huggingface modelhub: ance, tct_colbert-v2-hnp-msmarco, distilbert-dot-tas_b-b256-msmarco. Please use the same name as the link in Huggingface modelhub for each of the folders that contain the model.

After you generated the hard negative training set and downloaded all the models, you can kick off the training for DR-PRF query encoders by:

python -m torch.distributed.launch \
    --nproc_per_node=2 \
    -m tevatron.driver.train \
    --output_dir /path/to/save/mdoel/checkpoints \
    --model_name_or_path /path/to/model/folder \
    --do_train \
    --save_steps 5000 \
    --train_dir /path/to/hard/negative \
    --fp16 \
    --per_device_train_batch_size 32 \
    --learning_rate 1e-6 \
    --num_train_epochs 10 \
    --train_n_passages 21 \
    --q_max_len 512 \
    --dataloader_num_workers 10 \
    --warmup_steps 5000 \
    --add_pooler

To inference dense retriever PRF models

Install Pyserini by following the instructions within pyserini/README.md

Then run:

python -m pyserini.dsearch --topics /path/to/query/tsv/file \
    --index /path/to/index \
    --encoder /path/to/encoder \ # This encoder is for first round retrieval
    --batch-size 64 \
    --output /path/to/output/run/file \
    --prf-method tctv2-prf \
    --threads 12 \
    --sparse-index msmarco-passage \
    --prf-encoder /path/to/encoder \ # This encoder is for PRF query generation
    --prf-depth 3

An example would be:

python -m pyserini.dsearch --topics ./data/msmarco-test2020-queries.tsv \
    --index ./dindex-msmarco-passage-tct_colbert-v2-hnp-bf \
    --encoder ./tct_colbert_v2_hnp \
    --batch-size 64 \
    --output ./runs/tctv2-prf3.res \
    --prf-method tctv2-prf \
    --threads 12 \
    --sparse-index msmarco-passage \
    --prf-encoder ./tct-colbert-v2-prf3/checkpoint-10000 \
    --prf-depth 3

Or one can use pre-built index and models available in Pyserini:

python -m pyserini.dsearch --topics dl19-passage \
    --index msmarco-passage-tct_colbert-v2-hnp-bf \
    --encoder castorini/tct_colbert-v2-hnp-msmarco \
    --batch-size 64 \
    --output ./runs/tctv2-prf3.res \
    --prf-method tctv2-prf \
    --threads 12 \
    --sparse-index msmarco-passage \
    --prf-encoder ./tct-colbert-v2-prf3/checkpoint-10000 \
    --prf-depth 3

The PRF depth --prf-depth 3 depends on the PRF encoder trained, if trained with PRF 3, here only can use PRF 3.

Where --topics can be: TREC DL 2019 Passage: dl19-passage TREC DL 2020 Passage: dl20 MS MARCO Passage V1: msmarco-passage-dev-subset

--encoder can be: ANCE: castorini/ance-msmarco-passage TCT-ColBERT V2 HN+: castorini/tct_colbert-v2-hnp-msmarco DistilBERT Balanced: sebastian-hofstaetter/distilbert-dot-tas_b-b256-msmarco

--index can be: ANCE index with MS MARCO V1 passage collection: msmarco-passage-ance-bf TCT-ColBERT V2 HN+ index with MS MARCO V1 passage collection: msmarco-passage-tct_colbert-v2-hnp-bf DistillBERT Balanced index with MS MARCO V1 passage collection: msmarco-passage-distilbert-dot-tas_b-b256-bf

To evaluate the run:

TREC DL 2019

python -m pyserini.eval.trec_eval -c -m ndcg_cut.10 -m recall.1000 -l 2 dl19-passage ./runs/tctv2-prf3.res

TREC DL 2020

python -m pyserini.eval.trec_eval -c -m ndcg_cut.10 -m recall.1000 -l 2 dl20-passage ./runs/tctv2-prf3.res

MS MARCO Passage Ranking V1

python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset ./runs/tctv2-prf3.res
You might also like...
This is an unofficial PyTorch implementation of Meta Pseudo Labels

This is an unofficial PyTorch implementation of Meta Pseudo Labels. The official Tensorflow implementation is here.

Prototypical Pseudo Label Denoising and Target Structure Learning for Domain Adaptive Semantic Segmentation (CVPR 2021)
Prototypical Pseudo Label Denoising and Target Structure Learning for Domain Adaptive Semantic Segmentation (CVPR 2021)

Prototypical Pseudo Label Denoising and Target Structure Learning for Domain Adaptive Semantic Segmentation (CVPR 2021, official Pytorch implementatio

DiffQ performs differentiable quantization using pseudo quantization noise. It can automatically tune the number of bits used per weight or group of weights, in order to achieve a given trade-off between model size and accuracy.

Differentiable Model Compression via Pseudo Quantization Noise DiffQ performs differentiable quantization using pseudo quantization noise. It can auto

[CVPR 2021] Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision
[CVPR 2021] Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision

TorchSemiSeg [CVPR 2021] Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision by Xiaokang Chen1, Yuhui Yuan2, Gang Zeng1, Jingdong Wang

Pytorch implementation of the paper SPICE: Semantic Pseudo-labeling for Image Clustering
Pytorch implementation of the paper SPICE: Semantic Pseudo-labeling for Image Clustering

SPICE: Semantic Pseudo-labeling for Image Clustering By Chuang Niu and Ge Wang This is a Pytorch implementation of the paper. (In updating) SOTA on 5

Experiments on Flood Segmentation on Sentinel-1 SAR Imagery with Cyclical Pseudo Labeling and Noisy Student Training
Experiments on Flood Segmentation on Sentinel-1 SAR Imagery with Cyclical Pseudo Labeling and Noisy Student Training

Flood Detection Challenge This repository contains code for our submission to the ETCI 2021 Competition on Flood Detection (Winning Solution #2). Acco

Novel Instances Mining with Pseudo-Margin Evaluation for Few-Shot Object Detection
Novel Instances Mining with Pseudo-Margin Evaluation for Few-Shot Object Detection

Novel Instances Mining with Pseudo-Margin Evaluation for Few-Shot Object Detection (NimPme) The official implementation of Novel Instances Mining with

An unofficial implementation of "Unpaired Image Super-Resolution using Pseudo-Supervision." CVPR2020

UnpairedSR An unofficial implementation of "Unpaired Image Super-Resolution using Pseudo-Supervision." CVPR2020 turn RCAN(modified) -- xmodel(xilinx

Online Pseudo Label Generation by Hierarchical Cluster Dynamics for Adaptive Person Re-identification

Online Pseudo Label Generation by Hierarchical Cluster Dynamics for Adaptive Person Re-identification

Comments
  • Where can I find

    Where can I find "--query-file" and "--pair-file" to reproduce ance-prf experiments?

    Hello, I want to reproduce the code, however, I do not know how to get "--query-file" and "--pair-file" ……

    make_train_from_ranking.py …… parser.add_argument('--query-file', type=str, default='./data/msmarco_passage/query/train_query_judged.tsv') parser.add_argument('--collection-file', type=str, default='./data/msmarco_passage/collection/collection.tsv') parser.add_argument('--pair-file', type=str, default='./data/msmarco_passage/qrels/train_query_passage_pair.tsv') parser.add_argument('--encoder', type=str, default='./data/msmarco_passage/models/ANCE') ……

    opened by XY2323819551 1
  • self.passage_encoder.eval()

    self.passage_encoder.eval()

    Hi, I would like to ask if using self.passage_encoder.eval() means that the passage_encoder will not do any updates during the training? Looking forward to your reply.

    opened by XY2323819551 0
  •  KeyError: '1' when I ran

    KeyError: '1' when I ran ""make_train_from_ranking.py"

    Hello, thanks for your amazing work, I really want to reproduce it. However, I met an issue when I run the code, could you help me?

    command line: python make_train_from_ranking.py --ranking-file /home/zhangxy/QA/ANCE-PRF/pyserini/runs/run.msmarco-passage.ance.bf.tsv --model-type ANCE --query-file /home/zhangxy/QA/ANCE-PRF-main/data/marco_raw_data/queries.train.tsv --collection-file ./data/msmarco_passage/collection/collection.tsv --pair-file /home/zhangxy/QA/ANCE-PRF-main/data/marco_raw_data/qrels.train.tsv --output data/hard/negative.result --encoder /home/zhangxy/QA/pyserini_for_ance-prf/pyserini/encoders/ance-msmarco-passage

    processing: Load Query: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 808731/808731 [00:00<00:00, 1140903.16it/s] Load Collection: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 8841823/8841823 [00:16<00:00, 521248.96it/s] Load Q-D Pair: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 532761/532761 [00:00<00:00, 989247.88it/s] Load Ranking: 0%| | 0/808731000 [00:00<?, ?it/s] Traceback (most recent call last): File "make_train_from_ranking.py", line 94, in rankings, topk = read_ranking(args.ranking_file, pair, args.prf_k, args.from_top) File "make_train_from_ranking.py", line 35, in read_ranking targets = pair[qid].keys() KeyError: '1'

    opened by XY2323819551 4
Owner
ielab
The Information Engineering Lab
ielab
Feedback is important: response-aware feedback mechanism for background based conversation

RFM The code for the paper: "Feedback is important: response-aware feedback mechanism for background based conversation." Requirements python 3.7 pyto

Jiatao Chen 2 Sep 29, 2022
Continuous Query Decomposition for Complex Query Answering in Incomplete Knowledge Graphs

Continuous Query Decomposition This repository contains the official implementation for our ICLR 2021 (Oral) paper, Complex Query Answering with Neura

UCL Natural Language Processing 71 Dec 29, 2022
Code for ACL 21: Generating Query Focused Summaries from Query-Free Resources

marge This repository releases the code for Generating Query Focused Summaries from Query-Free Resources. Please cite the following paper [bib] if you

Yumo Xu 28 Nov 10, 2022
The coda and data for "Measuring Fine-Grained Domain Relevance of Terms: A Hierarchical Core-Fringe Approach" (ACL '21)

We propose a hierarchical core-fringe learning framework to measure fine-grained domain relevance of terms – the degree that a term is relevant to a broad (e.g., computer science) or narrow (e.g., deep learning) domain.

Jie Huang 14 Oct 21, 2022
Computationally Efficient Optimization of Plackett-Luce Ranking Models for Relevance and Fairness

Computationally Efficient Optimization of Plackett-Luce Ranking Models for Relevance and Fairness This repository contains the code used for the exper

H.R. Oosterhuis 28 Nov 29, 2022
PyTorch reimplementation of the Smooth ReLU activation function proposed in the paper "Real World Large Scale Recommendation Systems Reproducibility and Smooth Activations" [arXiv 2022].

Smooth ReLU in PyTorch Unofficial PyTorch reimplementation of the Smooth ReLU (SmeLU) activation function proposed in the paper Real World Large Scale

Christoph Reich 10 Jan 2, 2023
[EMNLP 2021] MuVER: Improving First-Stage Entity Retrieval with Multi-View Entity Representations

MuVER This repo contains the code and pre-trained model for our EMNLP 2021 paper: MuVER: Improving First-Stage Entity Retrieval with Multi-View Entity

null 24 May 30, 2022
Code for the paper Relation Prediction as an Auxiliary Training Objective for Improving Multi-Relational Graph Representations (AKBC 2021).

Relation Prediction as an Auxiliary Training Objective for Knowledge Base Completion This repo provides the code for the paper Relation Prediction as

Facebook Research 85 Jan 2, 2023
Improving Transferability of Representations via Augmentation-Aware Self-Supervision

Improving Transferability of Representations via Augmentation-Aware Self-Supervision Accepted to NeurIPS 2021 TL;DR: Learning augmentation-aware infor

hankook 38 Sep 16, 2022
Pseudo-Visual Speech Denoising

Pseudo-Visual Speech Denoising This code is for our paper titled: Visual Speech Enhancement Without A Real Visual Stream published at WACV 2021. Autho

Sindhu 94 Oct 22, 2022