This repository contains pre-trained models and some evaluation code for our paper Towards Unsupervised Dense Information Retrieval with Contrastive Learning

Meta Research

Last update: Jan 8, 2023

Related tags

Deep Learning contriever

Overview

Contriever: Towards Unsupervised Dense Information Retrieval with Contrastive Learning

This repository contains pre-trained models and some evaluation code for our paper Towards Unsupervised Dense Information Retrieval with Contrastive Learning.

We use a simple contrastive learning framework to pre-train models for information retrieval. Contriever, trained without supervision, is competitive with BM25 for R@100 on the BEIR benchmark. After finetuning on MSMARCO, Contriever obtains strong performance, especially for the recall at 100.

Getting Started

Pre-trained models can be loaded through the HuggingFace transformers library:

import transformers
from src.contriever import Contriever

model = Contriever.from_pretrained("facebook/contriever")
tokenizer = transformers.BertTokenizerFast.from_pretrained("facebook/contriever")

Embeddings for different sentences can be obtained by doing the following:

sentences = [
    "Where was Marie Curie born?",
    "Maria Sklodowska, later known as Marie Curie, was born on November 7, 1867.",
    "Born in Paris on 15 May 1859, Pierre Curie was the son of Eugène Curie, a doctor of French Catholic origin from Alsace."
]

inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
embeddings = model(**inputs)

Then similarity scores between the different sentences can be obtained with a dot product between the embeddings:

score01 = embddings[0] @ embeddings[1] #1.0473
score02 = embddings[0] @ embeddings[2] #1.0095

BEIR evaluation

Scores on the BEIR benchmark can be reproduced using eval_beir.py.

python eval_beir.py --model_name_or_path facebook/contriever-msmarco --dataset scifact

Available models

Model	Description
facebook/contriever	Model pre-trained on Wikipedia and CC-net without any supervised data
facebook/contriever-msmarco	Pre-trained model fine-tuned on MS-MARCO

References

[1] G. Izacard, M. Caron, L. Hosseini, S. Riedel, P. Bojanowski, A. Joulin, E. Grave Towards Unsupervised Dense Information Retrieval with Contrastive Learning

@misc{izacard2021contriever,
      title={Towards Unsupervised Dense Information Retrieval with Contrastive Learning}, 
      author={Gautier Izacard and Mathilde Caron and Lucas Hosseini and Sebastian Riedel and Piotr Bojanowski and Armand Joulin and Edouard Grave},
      year={2021},
      eprint={2112.09118},
      archivePrefix={arXiv},
}

License

See the LICENSE file for more details.

Comments

To reproduce baseline scores

Hello @gizacard @GitHub30 ,

I wonder if you can share some details about how to reproduce the unsupervised baseline scores, such as the scores in Table 9. Do you just take existing checkpoints and evaluate them on BEIR or do you pretrain them on your own (using same data/settings as training contriever)? I found that I cannot reproduce the same SimCSE scores with original released checkpoint (https://huggingface.co/princeton-nlp/unsup-simcse-roberta-large).

Also for fine-tuning with MSMARCO, is it similar to the supervised SimCSE training?

Thanks again for sharing the resources! Rui

opened by memray 3
Default value -1 for main_port in eval_beir.py may raise error

Thanks for providing this great work!

I note that you set the default value of main_port as -1 in eval_beir.py, which raise an error when I run

python eval_beir.py --model_name_or_path facebook/contriever-msmarco --dataset scifact

directly. I set this value to 10001 and solved the problem. I suggest that you can change the default value or add one line to README.

opened by ziqing-huang 1
Pretraining data

Great work! Thanks for sharing the model and code. I was wondering if you can share the pertaining data or the code to generate positive pairs from the pretraining corpus. Thank you in advance!

opened by swj0419 1
Host Wikipedia Keys like DPR?

Hello folks, I appreciate this work quite a bit, congrats on the new state of the art on zero-shot retrieval.

I feel like something very helpful that DPR did for researchers in labs with smaller per-researcher compute was to host the key embeddings (and the FAISS index). This has the nice effect for everyone (including for me, as a M.sc. student) to promote research and reproducibility in the field.

As both your team and the DPR team are from Facebook research, it is likely possible for you folks as well. Just wondering :) Thanks.

opened by JulesGM 1
Details for Wikipedia data formatting
I'm looking at tokenization_script.sh and I see that you're loading in en_XX.txt, which presumably contains all of Wikipedia's text in a single file. My question is, what text does this include? I'd imagine it includes paragraphs from pages but do you include section headers? Do you include Wikipedia edit discussion pages or just content pages? I can certainly prepare a similar file for the December 20th, 2018 Wikipedia dump (see my code here) but I want to follow your data preparation as closely as possible. Can you share

the data itself

the code you used to generate the single Wikipedia text file

OR some additional details about how you're generating the single Wikipedia text file?

Please let me know if you have any questions for me. Thank you for sharing this great repo. I think this project holds a ton of promise.
opened by ToddMorrill 1
Data format for fine-tuning

Hi, may I know if it is possible to share data samples for finetuning contriever? In the code, there are fields like "positive_ctxs", "hard_negative_ctxs", "negative_ctxs", "question". Could you provide some examples?

opened by e0397123 0
Script finetuning on MSMarco

Thanks a lot for releasing the code and the scripts for pre-training.

I'm trying to reproduce the numbers on MS-Marco after fine-tuning and it would be great if you could also release the scripts for fine-tuning.

Specifically, I had questions about training the model after mining the hard negatives.

Is it initialized to the pre-trained contriever model or the contriever model fine-tuned with random negatives?

opened by gangiswag 1
Mr Tydi checkpoint

Hi,

Thanks for the great project :) I was wondering if there is a plan to release the checkpoints trained on Mr Tydi? It could be of use as it seems to improve the model quality on non-translated data

opened by cadurosar 0

This repository contains pre-trained models and some evaluation code for our paper Towards Unsupervised Dense Information Retrieval with Contrastive Learning

Related tags

Overview

Contriever: Towards Unsupervised Dense Information Retrieval with Contrastive Learning

Getting Started

BEIR evaluation

Available models

References

License

Comments

To reproduce baseline scores

Default value -1 for main_port in eval_beir.py may raise error

Pretraining data

Host Wikipedia Keys like DPR?

Details for Wikipedia data formatting

Data format for fine-tuning

Script finetuning on MSMarco

Mr Tydi checkpoint

Owner

Meta Research

Personal implementation of paper "Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval"

The Hailo Model Zoo includes pre-trained models and a full building and evaluation environment

PyTorch implementation of CVPR 2020 paper (Reference-Based Sketch Image Colorization using Augmented-Self Reference and Dense Semantic Correspondence) and pre-trained model on ImageNet dataset

This repo contains the official code and pre-trained models for the Dynamic Vision Transformer (DVT).

Pytorch implementation of our paper under review — Lottery Jackpots Exist in Pre-trained Models

A collection of pre-trained StyleGAN2 models trained on different datasets at different resolution.

The code repository for EMNLP 2021 paper "Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization".

Code of our paper "Contrastive Object-level Pre-training with Spatial Noise Curriculum Learning"

pytorch implementation of "Contrastive Multiview Coding", "Momentum Contrast for Unsupervised Visual Representation Learning", and "Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination"

Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts

Pre-trained model, code, and materials from the paper "Impact of Adversarial Examples on Deep Learning Models for Biomedical Image Segmentation" (MICCAI 2019).

Pre-trained BERT Models for Ancient and Medieval Greek, and associated code for LaTeCH 2021 paper titled - "A Pilot Study for BERT Language Modelling and Morphological Analysis for Ancient and Medieval Greek"

A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval

Evaluation Pipeline for our ECCV2020: Journey Towards Tiny Perceptual Super-Resolution.

ImageNet-CoG is a benchmark for concept generalization. It provides a full evaluation framework for pre-trained visual representations which measure how well they generalize to unseen concepts.

This repository contains the source code for the paper "DONeRF: Towards Real-Time Rendering of Compact Neural Radiance Fields using Depth Oracle Networks",

Scalable training for dense retrieval models.

Code, Data and Demo for Paper: Controllable Generation from Pre-trained Language Models via Inverse Prompting