This repository contains pre-trained models and some evaluation code for our paper Towards Unsupervised Dense Information Retrieval with Contrastive Learning

Overview

Contriever: Towards Unsupervised Dense Information Retrieval with Contrastive Learning

This repository contains pre-trained models and some evaluation code for our paper Towards Unsupervised Dense Information Retrieval with Contrastive Learning.

We use a simple contrastive learning framework to pre-train models for information retrieval. Contriever, trained without supervision, is competitive with BM25 for R@100 on the BEIR benchmark. After finetuning on MSMARCO, Contriever obtains strong performance, especially for the recall at 100.

Getting Started

Pre-trained models can be loaded through the HuggingFace transformers library:

import transformers
from src.contriever import Contriever

model = Contriever.from_pretrained("facebook/contriever")
tokenizer = transformers.BertTokenizerFast.from_pretrained("facebook/contriever")

Embeddings for different sentences can be obtained by doing the following:

sentences = [
    "Where was Marie Curie born?",
    "Maria Sklodowska, later known as Marie Curie, was born on November 7, 1867.",
    "Born in Paris on 15 May 1859, Pierre Curie was the son of Eugène Curie, a doctor of French Catholic origin from Alsace."
]

inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
embeddings = model(**inputs)

Then similarity scores between the different sentences can be obtained with a dot product between the embeddings:

score01 = embddings[0] @ embeddings[1] #1.0473
score02 = embddings[0] @ embeddings[2] #1.0095

BEIR evaluation

Scores on the BEIR benchmark can be reproduced using eval_beir.py.

python eval_beir.py --model_name_or_path facebook/contriever-msmarco --dataset scifact

Available models

Model Description
facebook/contriever Model pre-trained on Wikipedia and CC-net without any supervised data
facebook/contriever-msmarco Pre-trained model fine-tuned on MS-MARCO

References

[1] G. Izacard, M. Caron, L. Hosseini, S. Riedel, P. Bojanowski, A. Joulin, E. Grave Towards Unsupervised Dense Information Retrieval with Contrastive Learning

@misc{izacard2021contriever,
      title={Towards Unsupervised Dense Information Retrieval with Contrastive Learning}, 
      author={Gautier Izacard and Mathilde Caron and Lucas Hosseini and Sebastian Riedel and Piotr Bojanowski and Armand Joulin and Edouard Grave},
      year={2021},
      eprint={2112.09118},
      archivePrefix={arXiv},
}

License

See the LICENSE file for more details.

Comments
  • To reproduce baseline scores

    To reproduce baseline scores

    Hello @gizacard @GitHub30 ,

    I wonder if you can share some details about how to reproduce the unsupervised baseline scores, such as the scores in Table 9. Do you just take existing checkpoints and evaluate them on BEIR or do you pretrain them on your own (using same data/settings as training contriever)? I found that I cannot reproduce the same SimCSE scores with original released checkpoint (https://huggingface.co/princeton-nlp/unsup-simcse-roberta-large).

    Also for fine-tuning with MSMARCO, is it similar to the supervised SimCSE training?

    Thanks again for sharing the resources! Rui

    opened by memray 3
  • Default value -1 for main_port in eval_beir.py may raise error

    Default value -1 for main_port in eval_beir.py may raise error

    Thanks for providing this great work!

    I note that you set the default value of main_port as -1 in eval_beir.py, which raise an error when I run

    python eval_beir.py --model_name_or_path facebook/contriever-msmarco --dataset scifact

    directly. I set this value to 10001 and solved the problem. I suggest that you can change the default value or add one line to README.

    opened by ziqing-huang 1
  • Pretraining data

    Pretraining data

    Great work! Thanks for sharing the model and code. I was wondering if you can share the pertaining data or the code to generate positive pairs from the pretraining corpus. Thank you in advance!

    opened by swj0419 1
  • Host Wikipedia Keys like DPR?

    Host Wikipedia Keys like DPR?

    Hello folks, I appreciate this work quite a bit, congrats on the new state of the art on zero-shot retrieval.

    I feel like something very helpful that DPR did for researchers in labs with smaller per-researcher compute was to host the key embeddings (and the FAISS index). This has the nice effect for everyone (including for me, as a M.sc. student) to promote research and reproducibility in the field.

    As both your team and the DPR team are from Facebook research, it is likely possible for you folks as well. Just wondering :) Thanks.

    opened by JulesGM 1
  • Details for Wikipedia data formatting

    Details for Wikipedia data formatting

    I'm looking at tokenization_script.sh and I see that you're loading in en_XX.txt, which presumably contains all of Wikipedia's text in a single file. My question is, what text does this include? I'd imagine it includes paragraphs from pages but do you include section headers? Do you include Wikipedia edit discussion pages or just content pages? I can certainly prepare a similar file for the December 20th, 2018 Wikipedia dump (see my code here) but I want to follow your data preparation as closely as possible. Can you share

    1. the data itself
    2. the code you used to generate the single Wikipedia text file
    3. OR some additional details about how you're generating the single Wikipedia text file?

    Please let me know if you have any questions for me. Thank you for sharing this great repo. I think this project holds a ton of promise.

    opened by ToddMorrill 1
  • Data format for fine-tuning

    Data format for fine-tuning

    Hi, may I know if it is possible to share data samples for finetuning contriever? In the code, there are fields like "positive_ctxs", "hard_negative_ctxs", "negative_ctxs", "question". Could you provide some examples?

    opened by e0397123 0
  • Script finetuning on MSMarco

    Script finetuning on MSMarco

    Thanks a lot for releasing the code and the scripts for pre-training.

    I'm trying to reproduce the numbers on MS-Marco after fine-tuning and it would be great if you could also release the scripts for fine-tuning.

    Specifically, I had questions about training the model after mining the hard negatives.

    Is it initialized to the pre-trained contriever model or the contriever model fine-tuned with random negatives?

    opened by gangiswag 1
  • Mr Tydi checkpoint

    Mr Tydi checkpoint

    Hi,

    Thanks for the great project :) I was wondering if there is a plan to release the checkpoints trained on Mr Tydi? It could be of use as it seems to improve the model quality on non-translated data

    opened by cadurosar 0
Owner
Meta Research
Meta Research
Personal implementation of paper "Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval"

Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval This repo provides personal implementation of paper Approximate Ne

John 8 Oct 7, 2022
The Hailo Model Zoo includes pre-trained models and a full building and evaluation environment

Hailo Model Zoo The Hailo Model Zoo provides pre-trained models for high-performance deep learning applications. Using the Hailo Model Zoo you can mea

Hailo 50 Dec 7, 2022
PyTorch implementation of CVPR 2020 paper (Reference-Based Sketch Image Colorization using Augmented-Self Reference and Dense Semantic Correspondence) and pre-trained model on ImageNet dataset

Reference-Based-Sketch-Image-Colorization-ImageNet This is a PyTorch implementation of CVPR 2020 paper (Reference-Based Sketch Image Colorization usin

Yuzhi ZHAO 11 Jul 28, 2022
This repo contains the official code and pre-trained models for the Dynamic Vision Transformer (DVT).

Dynamic-Vision-Transformer (Pytorch) This repo contains the official code and pre-trained models for the Dynamic Vision Transformer (DVT). Not All Ima

null 210 Dec 18, 2022
Pytorch implementation of our paper under review — Lottery Jackpots Exist in Pre-trained Models

Lottery Jackpots Exist in Pre-trained Models (Paper Link) Requirements Python >= 3.7.4 Pytorch >= 1.6.1 Torchvision >= 0.4.1 Reproduce the Experiment

Yuxin Zhang 27 Jun 28, 2022
A collection of pre-trained StyleGAN2 models trained on different datasets at different resolution.

Awesome Pretrained StyleGAN2 A collection of pre-trained StyleGAN2 models trained on different datasets at different resolution. Note the readme is a

Justin 1.1k Dec 24, 2022
The code repository for EMNLP 2021 paper "Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization".

Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization [Paper] accepted at the EMNLP 2021: Vision Guided Genera

CAiRE 42 Jan 7, 2023
Code of our paper "Contrastive Object-level Pre-training with Spatial Noise Curriculum Learning"

CCOP Code of our paper Contrastive Object-level Pre-training with Spatial Noise Curriculum Learning Requirement Install OpenSelfSup Install Detectron2

Chenhongyi Yang 21 Dec 13, 2022
pytorch implementation of "Contrastive Multiview Coding", "Momentum Contrast for Unsupervised Visual Representation Learning", and "Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination"

Unofficial implementation: MoCo: Momentum Contrast for Unsupervised Visual Representation Learning (Paper) InsDis: Unsupervised Feature Learning via N

Zhiqiang Shen 16 Nov 4, 2020
Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts

t5-japanese Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts. The following is a list of models that

Kimio Kuramitsu 1 Dec 13, 2021
Pre-trained model, code, and materials from the paper "Impact of Adversarial Examples on Deep Learning Models for Biomedical Image Segmentation" (MICCAI 2019).

Adaptive Segmentation Mask Attack This repository contains the implementation of the Adaptive Segmentation Mask Attack (ASMA), a targeted adversarial

Utku Ozbulak 53 Jul 4, 2022
Pre-trained BERT Models for Ancient and Medieval Greek, and associated code for LaTeCH 2021 paper titled - "A Pilot Study for BERT Language Modelling and Morphological Analysis for Ancient and Medieval Greek"

Ancient Greek BERT The first and only available Ancient Greek sub-word BERT model! State-of-the-art post fine-tuning on Part-of-Speech Tagging and Mor

Pranaydeep Singh 22 Dec 8, 2022
A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval

CLIP4CMR A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval The original data and pre-calculate

null 9 Jan 12, 2022
Evaluation Pipeline for our ECCV2020: Journey Towards Tiny Perceptual Super-Resolution.

Journey Towards Tiny Perceptual Super-Resolution Test code for our ECCV2020 paper: https://arxiv.org/abs/2007.04356 Our x4 upscaling pre-trained model

Royson 6 Mar 30, 2022
ImageNet-CoG is a benchmark for concept generalization. It provides a full evaluation framework for pre-trained visual representations which measure how well they generalize to unseen concepts.

The ImageNet-CoG Benchmark Project Website Paper (arXiv) Code repository for the ImageNet-CoG Benchmark introduced in the paper "Concept Generalizatio

NAVER 23 Oct 9, 2022
This repository contains the source code for the paper "DONeRF: Towards Real-Time Rendering of Compact Neural Radiance Fields using Depth Oracle Networks",

DONeRF: Towards Real-Time Rendering of Compact Neural Radiance Fields using Depth Oracle Networks Project Page | Video | Presentation | Paper | Data L

Facebook Research 281 Dec 22, 2022
Scalable training for dense retrieval models.

Scalable implementation of dense retrieval. Training on cluster By default it trains locally: PYTHONPATH=.:$PYTHONPATH python dpr_scale/main.py traine

Facebook Research 90 Dec 28, 2022
Code, Data and Demo for Paper: Controllable Generation from Pre-trained Language Models via Inverse Prompting

InversePrompting Paper: Controllable Generation from Pre-trained Language Models via Inverse Prompting Code: The code is provided in the "chinese_ip"

THUDM 101 Dec 16, 2022