This repository contains the code for the paper 'PARM: Paragraph Aggregation Retrieval Model for Dense Document-to-Document Retrieval' published at ECIR'22.

Sophia Althammer

Last update: Aug 26, 2022

Related tags

Deep Learning parm

Overview

Paragraph Aggregation Retrieval Model (PARM) for Dense Document-to-Document Retrieval

This repository contains the code for the paper PARM: A Paragraph Aggregation Retrieval Model for Dense Document-to-Document Retrieval and is partly based on the DPR Github repository. PARM is a Paragraph Aggregation Retrieval Model for dense document-to-document retrieval tasks, which liberates dense passage retrieval models from their limited input lenght and does retrieval on the paragraph-level.

We focus on the task of legal case retrieval and train and evaluate our models on the COLIEE 2021 data and evaluate our models on the CaseLaw collection.

The dense retrieval models are trained on the COLIEE data and can be found here. For training the dense retrieval model we utilize the DPR Github repository.

If you use our models or code, please cite our work:

@inproceedings{althammer2022parm,
      title={Paragraph Aggregation Retrieval Model (PARM) for Dense Document-to-Document Retrieval}, 
      author={Althammer, Sophia and Hofstätter, Sebastian and Sertkan, Mete and Verberne, Suzan and Hanbury, Allan},
      year={2022},
      booktitle={Advances in Information Retrieval, 44rd European Conference on IR Research, ECIR 2022},
}

Training the dense retrieval model

The dense retrieval models need to be trained, either on the paragraph-level data of COLIEE Task2 or additionally on the document-level data of COLIEE Task1

./DPR/train_dense_encoder.py: trains the dense bi-encoder (Step1)

python -m torch.distributed.launch --nproc_per_node=2 train_dense_encoder.py 
--max_grad_norm 2.0 
--encoder_model_type hf_bert 
--checkpoint_file_name --insert path to pretrained encoder checkpoint here if available-- 
--model_file  --insert path to pretrained chechpoint here if available-- 
--seed 12345 
--sequence_length 256 
--warmup_steps 1237 
--batch_size 22 
--do_lower_case 
--train_file --path to json train file-- 
--dev_file --path to json val file-- 
--output_dir --path to output directory--
--learning_rate 1e-05
--num_train_epochs 70
--dev_batch_size 22
--val_av_rank_start_epoch 60
--eval_per_epoch 1
--global_loss_buf_sz 250000

Generate dense embeddings index with trained DPR model

./DPR/generate_dense_embeddings.py: encodes the corpus in the dense index (Step2)

python generate_dense_embeddings.py
--model_file --insert path to pretrained checkpoint here from Step1--
--pretrained_file  --insert path to pretrained chechpoint here from Step1--
--ctx_file --insert path to tsv file with documents in the corpus--
--out_file --insert path to output index--
--batch_size 750

Search in the dense index

./DPR/dense_retriever.py: searches in the dense index the top n-docs (Step3)

python dense_retriever.py 
--model_file --insert path to pretrained checkpoint here from Step1--
--ctx_file --insert path to tsv file with documents in the corpus--
--qa_file --insert path to csv file with the queries--
--encoded_ctx_file --path to the dense index (.pkl format) from Step2--
--out_file --path to .json output file for search results--
--n-docs 1000

Poolout dense vectors for aggregation step

First you need to get the dense embeddings for the query paragraphs:

./DPR/get_question_tensors.py: encodes the query paragraphs with the dense encoder checkpoint and stores the embeddings in the output file (Step4)

python get_question_tensors.py
--model_file --insert path to pretrained checkpoint here from Step1--
--qa_file --insert path to csv file with the queries--
--out_file --path to output file for output index--

Once you have the dense embeddings of the paragraphs in the index and of the questions, you do the vector-based aggregation step in PARM with VRRF (alternatively with Min, Max, Avg, Sum, VScores, VRanks) and evaluate the aggregated results

./representation_aggregation.py: aggregates the run, stores and evaluates the aggregated run (Step5)

python representation_aggregation.py
--encoded_ctx_file --path to the encoded index (.pkl format) from Step2--
--encoded_qa_file  --path to the encoded queries (.pkl format) from Step4--
--output_top1000s --path to the top-n file (.json format) from Step3--
--label_file  --path to the label file (.json format)--
--aggregation_mode --choose from vrrf/vscores/vranks/sum/max/min/avg
--candidate_mode p_from_retrieved_list
--output_dir --path to output directory--
--output_file_name  --output file name--

Preprocessing

Preprocess COLIEE Task 1 data for dense retrieval

./preprocessing/preprocess_coliee_2021_task1.py: preprocess the COLIEE Task 1 dataset by removing non-English text, removing non-informative summaries, removing tabs etc

Preprocess CaseLaw collection

./preprocessing/caselaw_stat_corpus.py: preprocess the CaseLaw collection

Preprocess data for training the dense retrieval model

In order to train the dense retrieval models, the data needs to be preprocessed. For training and retrieval we split up the documents into their paragraphs.

./preprocessing/preprocess_finetune_data_dpr_task1.py: preprocess the COLIEE Task 1 document-level labels for training the DPR model
./preprocessing/preprocess_finetune_data_dpr.py: preprocess the COLIEE Task 2 paragraph-level labels for training the DPR model

Comments

How to inference on our own data?

Hi,

Thanks for open-sourcing this great repository!

I am trying to do inference on my own dataset. However, I don't have labeled answers for the queries. So, I can't really evaluate the method and I would like to see the results qualitatively. It would be great if you can provide me with some pointers on this.

Thanks in advance!

opened by avijit9 2

The implementation of the paper "A Deep Feature Aggregation Network for Accurate Indoor Camera Localization".

A Deep Feature Aggregation Network for Accurate Indoor Camera Localization This is the PyTorch implementation of our paper "A Deep Feature Aggregation

9 Dec 9, 2022

Provided is code that demonstrates the training and evaluation of the work presented in the paper: "On the Detection of Digital Face Manipulation" published in CVPR 2020.

FFD Source Code Provided is code that demonstrates the training and evaluation of the work presented in the paper: "On the Detection of Digital Face M

88 Nov 22, 2022

An efficient and effective learning to rank algorithm by mining information across ranking candidates. This repository contains the tensorflow implementation of SERank model. The code is developed based on TF-Ranking.

SERank An efficient and effective learning to rank algorithm by mining information across ranking candidates. This repository contains the tensorflow

44 Oct 20, 2022

This repository allows you to anonymize sensitive information in images/videos. The solution is fully compatible with the DL-based training/inference solutions that we already published/will publish for Object Detection and Semantic Segmentation.

BMW-Anonymization-Api Data privacy and individuals’ anonymity are and always have been a major concern for data-driven companies. Therefore, we design

148 Dec 21, 2022

Repository for "Improving evidential deep learning via multi-task learning," published in AAAI2022

Improving evidential deep learning via multi task learning It is a repository of AAAI2022 paper, “Improving evidential deep learning via multi-task le

11 Nov 19, 2022

This is the code for the paper "Jinkai Zheng, Xinchen Liu, Wu Liu, Lingxiao He, Chenggang Yan, Tao Mei: Gait Recognition in the Wild with Dense 3D Representations and A Benchmark. (CVPR 2022)"

Gait3D-Benchmark This is the code for the paper "Jinkai Zheng, Xinchen Liu, Wu Liu, Lingxiao He, Chenggang Yan, Tao Mei: Gait Recognition in the Wild

82 Jan 4, 2023

This repository contains the code for the paper "Hierarchical Motion Understanding via Motion Programs"

Hierarchical Motion Understanding via Motion Programs (CVPR 2021) This repository contains the official implementation of: Hierarchical Motion Underst

40 Dec 5, 2022

This repository contains the source code and data for reproducing results of Deep Continuous Clustering paper

Deep Continuous Clustering Introduction This is a Pytorch implementation of the DCC algorithms presented in the following paper (paper): Sohil Atul Sh

197 Nov 29, 2022

This repository contains a re-implementation of the code for the CVPR 2021 paper "Omnimatte: Associating Objects and Their Effects in Video."

Omnimatte in PyTorch This repository contains a re-implementation of the code for the CVPR 2021 paper "Omnimatte: Associating Objects and Their Effect

728 Dec 28, 2022

This repository contains the code for the paper 'PARM: Paragraph Aggregation Retrieval Model for Dense Document-to-Document Retrieval' published at ECIR'22.

Related tags

Overview

Paragraph Aggregation Retrieval Model (PARM) for Dense Document-to-Document Retrieval

Training the dense retrieval model

Generate dense embeddings index with trained DPR model

Search in the dense index

Poolout dense vectors for aggregation step

Preprocessing

Preprocess data for training the dense retrieval model

You might also like...

The implementation of the paper "A Deep Feature Aggregation Network for Accurate Indoor Camera Localization".

Provided is code that demonstrates the training and evaluation of the work presented in the paper: "On the Detection of Digital Face Manipulation" published in CVPR 2020.

An efficient and effective learning to rank algorithm by mining information across ranking candidates. This repository contains the tensorflow implementation of SERank model. The code is developed based on TF-Ranking.

This repository allows you to anonymize sensitive information in images/videos. The solution is fully compatible with the DL-based training/inference solutions that we already published/will publish for Object Detection and Semantic Segmentation.

Repository for "Improving evidential deep learning via multi-task learning," published in AAAI2022

This is the code for the paper "Jinkai Zheng, Xinchen Liu, Wu Liu, Lingxiao He, Chenggang Yan, Tao Mei: Gait Recognition in the Wild with Dense 3D Representations and A Benchmark. (CVPR 2022)"

This repository contains the code for the paper "Hierarchical Motion Understanding via Motion Programs"

This repository contains the source code and data for reproducing results of Deep Continuous Clustering paper

This repository contains a re-implementation of the code for the CVPR 2021 paper "Omnimatte: Associating Objects and Their Effects in Video."

Comments

How to inference on our own data?

Owner

Sophia Althammer

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

This is the official implementation of "One Question Answering Model for Many Languages with Cross-lingual Dense Passage Retrieval".

Personal implementation of paper "Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval"

Official repository for the CVPR 2021 paper "Learning Feature Aggregation for Deep 3D Morphable Models"

Contains code for Deep Kernelized Dense Geometric Matching

Code for the TASLP paper "PSLA: Improving Audio Tagging With Pretraining, Sampling, Labeling, and Aggregation".

Scalable training for dense retrieval models.

Repo for WWW 2022 paper: Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval

PyTorch implementation of CVPR 2020 paper (Reference-Based Sketch Image Colorization using Augmented-Self Reference and Dense Semantic Correspondence) and pre-trained model on ImageNet dataset

Official implementation of NeurIPS 2021 paper "Contextual Similarity Aggregation with Self-attention for Visual Re-ranking"