Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models.

Overview

Tevatron

Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models. The toolkit has a modularized design for easy research; a set of command line tools are also provided for fast development and testing. A set of easy-to-use interfaces to Huggingfac's state-of-the-art pre-trained transformers ensures Tevatron's superior performance.

Tevatron is currently under initial development stage. We will be actively adding new features and API changes may happen. Suggestions, feature requests and PRs are welcomed.

Features

  • Command line interface for dense retriever training/encoding and dense index search.
  • Flexible and extendable Pytorch retriever models.
  • Highly efficient Trainer, a subclass of Huggingface Trainer, that naively support training performance features like mixed precision and distributed data parallel.
  • Fast and memory-efficient train/inference data access based on memory mapping with Apache Arrow through Huggingface datasets.

Installation

First install neural network and similarity search backends, namely Pytorch and FAISS. Check out the official installation guides for Pytorch and for FAISS.

Then install Tevatron with pip,

pip install tevatron

Or typically for develoment/research, clone this repo and install as editable,

git https://github.com/texttron/tevatron
cd tevatron
pip install --editable .

Note: The current code base has been tested with, torch==1.8.2, faiss-cpu==1.7.1, transformers==4.9.2, datasets==1.11.0

Data Format

Training: Each line of the the Train file is a training instance,

{'query': TEXT_TYPE, 'positives': List[TEXT_TYPE], 'negatives': List[TEXT_TYPE]}
...

Inference/Encoding: Each line of the the encoding file is a piece of text to be encoded,

{text_id: "xxx", 'text': TEXT_TYPE}
...

Here TEXT_TYPE can be either raw string or pre-tokenized ids, i.e. List[int]. Using the latter can help lower data processing latency during training to reduce/eliminate GPU wait. Note: the current code requires text_id of passages/contexts to be convertible to integer, e.g. integers or string of integers.

Training (Simple)

To train a simple dense retriever, call the tevatron.driver.train module,

python -m tevatron.driver.train \  
  --output_dir $OUTDIR \  
  --model_name_or_path bert-base-uncased \  
  --do_train \  
  --save_steps 20000 \  
  --train_dir $TRAIN_DIR \
  --fp16 \  
  --per_device_train_batch_size 8 \  
  --learning_rate 5e-6 \  
  --num_train_epochs 2 \  
  --dataloader_num_workers 2

Here we picked bert-base-uncased BERT weight from Huggingface Hub and turned on AMP with --fp16 to speed up training. Several command flags are provided in addition to configure the learned model, e.g. --add_pooler which adds an linear projection. A full list command line arguments can be found in tevatron.arguments.

Training (Research)

Check out the run.py in examples directory for a fully configurable train/test loop. Typically you will do,

from tevatron.modeling import DenseModel
from tevatron.trainer import DenseTrainer as Trainer

...
model = DenseModel.build(
        model_args,
        data_args,
        training_args,
        config=config,
        cache_dir=model_args.cache_dir,
    )
trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        data_collator=collator,
    )
...
trainer.train()

Encoding

To encode, call the tevatron.driver.encode module. For large corpus, split the corpus into shards to parallelize.

for s in shard1 shar2 shard3
do
python -m tevatron.driver.encode \  
  --output_dir=$OUTDIR \  
  --tokenizer_name $TOK \  
  --config_name $CONFIG \  
  --model_name_or_path $MODEL_DIR \  
  --fp16 \  
  --per_device_eval_batch_size 128 \  
  --encode_in_path $CORPUS_DIR/$s.json \  
  --encoded_save_path $ENCODE_DIR/$s.pt
done

Index Search

Call the tevatron.faiss_retriever module,

python -m tevatron.faiss_retriever \  
--query_reps $ENCODE_QRY_DIR/qry.pt \  
--passage_reps $ENCODE_DIR/'*.pt' \  
--depth $DEPTH \
--batch_size -1 \
--save_text \
--save_ranking_to rank.tsv

Encoded corpus or corpus shards are loaded based on glob pattern matching of argument --passage_reps. Argument --batch_size controls number of queries passed to the FAISS index each search call and -1 will pass all queries in one call. Larger batches typically run faster (due to better memory access patterns and hardware utilization.) Setting flag --save_text will save the ranking to a tsv file with each line being qid pid score.

Alternatively paralleize search over the shards,

for s in shard1 shar2 shard3
do
python -m tevatron.faiss_retriever \  
--query_reps $ENCODE_QRY_DIR/qry.pt \  
--passage_reps $ENCODE_DIR/$s.pt \  
--depth $DEPTH \  
--save_ranking_to $INTERMEDIATE_DIR/$s
done

Then combine the results using the reducer module,

python -m tevatron.faiss_retriever.reducer \  
--score_dir $INTERMEDIATE_DIR \  
--query $ENCODE_QRY_DIR/qry.pt \  
--save_ranking_to rank.txt  

Contacts

If you have a toolkit specific question, feel free to open an issue.

You can also reach out to us for general comments/suggestions/questions through email.

Comments
  • coCondenser MS-MARCO Passage Retrieval example raises error

    coCondenser MS-MARCO Passage Retrieval example raises error

    I'm trying to reproduce this example: [(https://github.com/texttron/tevatron/tree/main/examples/coCondenser-marco)]

    Inference with fine-tuned checkpoint(encode and Index-search) are all good, however, fine-tuning stage 1 (training) raises an error: Downloading and preparing dataset json/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/dmx/.cache/huggingface/datasets/json/default-e642d34fc5e4ebf2/0.0.0/793d004298099bd3c4e61eb7878475bcf1dc212bf2e34437d85126758720d7f9... 10/29/2021 10:44:35 - WARNING - datasets.builder - Using custom data configuration default-e642d34fc5e4ebf2 Traceback (most recent call last): File "/home/dmx/anaconda3/lib/python3.8/site-packages/tevatron/driver/train.py", line 117, in <module> main() File "/home/dmx/anaconda3/lib/python3.8/site-packages/tevatron/driver/train.py", line 82, in main train_dataset = TrainDataset( File "/home/dmx/anaconda3/lib/python3.8/site-packages/tevatron/data.py", line 29, in __init__ self.train_data = datasets.load_dataset( File "/home/dmx/anaconda3/lib/python3.8/site-packages/datasets-1.9.0-py3.8.egg/datasets/load.py", line 856, in load_dataset builder_instance.download_and_prepare( File "/home/dmx/anaconda3/lib/python3.8/site-packages/datasets-1.9.0-py3.8.egg/datasets/builder.py", line 583, in download_and_prepare self._download_and_prepare( File "/home/dmx/anaconda3/lib/python3.8/site-packages/datasets-1.9.0-py3.8.egg/datasets/builder.py", line 639, in _download_and_prepare split_generators = self._split_generators(dl_manager, **split_generators_kwargs) File "/home/dmx/anaconda3/lib/python3.8/site-packages/datasets-1.9.0-py3.8.egg/datasets/packaged_modules/json/json.py", line 46, in _split_generators raise ValueError(f"At least one data file must be specified, but got data_files={self.config.data_files}") ValueError: At least one data file must be specified, but got data_files=None

    I think the problem maybe come from lines 25~32 of tevatron.driver.train.py, but I don't know the specific reason or how to solve it: if isinstance(path_to_data, datasets.Dataset): self.train_data = path_to_data else: self.train_data = datasets.load_dataset( 'json', data_dir=path_to_data, ignore_verifications=False, )['train']

    opened by YuLengsen 10
  • Save the last checkpoint also in a folder as others

    Save the last checkpoint also in a folder as others

    Currently, the last checkpoint is saved in the root folder of other checkpoints. This minor change put the last one also in the same level of folder as others.

    opened by ArvinZhuang 9
  • Question about reproducing coCondenser-nq

    Question about reproducing coCondenser-nq

    Hi, @luyug.

    Thanks for your awesome work and detailed guidelines. I reproduced the model according to coCondenser-nq's [README](https://github.com/texttron/tevatron/tree/main/examples/coCondenser-nq). But I got the following results.(results from pyserini)

    Top5    accuracy: 0.3526315789473684                         
    Top20   accuracy: 0.47700831024930745 
    Top100  accuracy: 0.5833795013850416 
    

    I think I made a mistake in one step, so that the results is lower than the results on bm25. I sequentially execute the following scripts to train the model.(The model co-condenser-wiki was downloaded from huggingface.)

    #prepare_data.sh
    
    nq_train_path="/data2/private/xxx/DPR/downloads/data/retriever/nq-train.json" #biencoder-nq-train.json
    output_path="/data2/private/xxx/condenser/nq-train/bm25.bert.json"
    model_path="/data2/private/xxx/model/co-condenser-wiki"
    hn_path="/data2/private/xxx/condenser/hn.json"
    output_hn_path="/data2/private/xxx/condenser/nq-train/hn.bert.json"
    python prepare_wiki_train.py --input $nq_train_path --output $output_path --tokenizer $model_path
    
    python prepare_wiki_train.py --input $hn_path --output $output_hn_path --tokenizer $model_path
    
    
    #train_nq.sh
    CONDENSER_MODEL_NAME="/data2/private/xxx/model/co-condenser-wiki"
    train_path="/data2/private/xxx/condenser/nq-train/"
    output_path="/data2/private/xxx/condenser/model_nq3/"
    cache_path="/data2/private/xxx/condenser/.cache/"
    CUDA_VISIBLE_DEVICES=2,3,4,5 python -m torch.distributed.launch --nproc_per_node=4 -m tevatron.driver.train \
      --output_dir $output_path \
      --model_name_or_path $CONDENSER_MODEL_NAME \
      --cache_dir $cache_path \
      --do_train \
      --save_steps 10000 \
      --train_dir $train_path \
      --fp16 \
      --per_device_train_batch_size 32 \
      --train_n_passages 2 \
      --learning_rate 5e-6 \
      --q_max_len 32 \
      --p_max_len 256 \
      --num_train_epochs 40 \
      --negatives_x_device \
      --positive_passage_no_shuffle \
      --untie_encoder \
      --grad_cache \
      --gc_p_chunk_size 24 \
      --gc_q_chunk_size 8
    
    
    
    #encode_emb_passage.sh
    
    OUTDIR="./temp"
    wiki_dir="/data2/private/xxx/condenser/wikipedia-corpus" 
    #"/data2/private/xxx/DPR/downloads/psgs_w100.tsv"
    CONDENSER_MODEL_NAME="/data2/private/xxx/model/co-condenser-wiki"
    train_path="/data2/private/xxx/condenser/nq-train/"
    model_path="/data2/private/xxx/condenser/model_nq3/"
    cache_path="/data2/private/xxx/condenser/.cache/"
    emb_nq_path="/data2/private/xxx/condenser/embeddings-nq"
    emb_query_path="/data2/private/xxx/condenser/embeddings-nq-queries/"
    query_path="/data2/private/xxx/condenser/nq-test-queries.json"
    MODEL_DIR=nq-model
    
    echo $1 #  $1 is the id of GPU
    for s in $(seq -f "%02g" $2 $3) # 0 - 19
    do
    CUDA_VISIBLE_DEVICES=$1 python -m tevatron.driver.encode \
      --output_dir=$OUTDIR \
      --cache_dir $cache_path \
      --model_name_or_path $model_path/checkpoint-40000/passage_model \
      --tokenizer_name $model_path \
      --fp16 \
      --per_device_eval_batch_size 64 \
      --p_max_len 256 \
      --dataset_proc_num 8 \
      --encode_in_path $wiki_dir/docs$s.json \
      --encoded_save_path $emb_nq_path/$s.pt \
      --encode_num_shard 20 \
      --passage_field_separator sep_token \
      --encode_shard_index $s
    done
    
    
    #encode_emb_query.sh
    
    OUTDIR="./temp"
    wiki_dir="/data2/private/xxx/condenser/wikipedia-corpus" 
    #"/data2/private/xxx/DPR/downloads/psgs_w100.tsv"
    CONDENSER_MODEL_NAME="/data2/private/xxx/model/co-condenser-wiki"
    train_path="/data2/private/xxx/condenser/nq-train/"
    model_path="/data2/private/xxx/condenser/model_nq3/"
    cache_path="/data2/private/xxx/condenser/.cache/"
    emb_nq_path="/data2/private/xxx/condenser/embeddings-nq/"
    emb_query_path="/data2/private/xxx/condenser/embeddings-nq-queries/"
    query_path="/data2/private/xxx/condenser/nq-test-queries.json"
    MODEL_DIR=nq-model
    
    
    # query
    
    CUDA_VISIBLE_DEVICES=$1 python -m tevatron.driver.encode \
      --output_dir=$OUTDIR \
      --model_name_or_path $model_path/checkpoint-40000/query_model \
      --tokenizer_name $model_path \
      --fp16 \
      --per_device_eval_batch_size 64 \
      --q_max_len 32 \
      --dataset_proc_num 2 \
      --encode_in_path $query_path \
      --encoded_save_path $emb_query_path/query.pt \
      --encode_is_qry
    
    #inference.sh
    
    ENCODE_QRY_DIR="/data2/private/xxx/condenser/embeddings-nq-queries/"
    ENCODE_DIR="/data2/private/xxx/condenser/embeddings-nq/"
    DEPTH=200
    RUN="/data2/private/xxx/condenser/run.nq.test.txt"
    OUTDIR="./temp"
    wiki_dir="/data2/private/xxx/condenser/wikipedia-corpus" 
    #"/data2/private/xxx/DPR/downloads/psgs_w100.tsv"
    
    MODEL_DIR=nq-model
    python -m tevatron.faiss_retriever \
    --query_reps $ENCODE_QRY_DIR/query.pt \
    --passage_reps $ENCODE_DIR/'*.pt' \
    --depth $DEPTH \
    --batch_size -1 \
    --save_text \
    --save_ranking_to $RUN
    
    #eval.sh
    RUN="/data2/private/xxx/condenser/run.nq.test.txt"
    trec_out="/data2/private/xxx/condenser/run.nq.test.teIn"
    json_out="/data2/private/xxx/condenser/run.nq.test.json"
    python -m tevatron.utils.format.convert_result_to_trec \
        --input $RUN --output $trec_out
    
    
    python -m pyserini.eval.convert_trec_run_to_dpr_retrieval_run --topics dpr-nq-test \
                                                                    --index wikipedia-dpr \
                                                                    --input $trec_out \
                                                                    --output $json_out
    
    python -m pyserini.eval.evaluate_dpr_retrieval --retrieval $json_out \
        --topk 5 20 100
    

    Is there any parameter I set wrong?

    Thanks!

    opened by Facico 8
  • how to improve the results with uniCOIL

    how to improve the results with uniCOIL

    Hi, Thanks for the great work!

    I run experiments with the `modeling' in tevatron (but the data loader is implemented by myself) on msmarco-passage. For DenseModel, the result can achieve to be MRR@10: 0.31+. But for uniCOIL, the result is only abot MRR@10: 0.26+ (this is far from your result 0.328).

    In fact, I noticed that the implementation of uniCOIL in tevatron is somewhat different from the paper (https://github.com/luyug/COIL). By comparison, I also run experiments with the code in https://github.com/luyug/COIL (dim=1 and no_cls), and get the similar result as above.

    Can you provide some insight on how to improve this result. For example, is there any special operation on the data for uniCOIL? I also try to initialize the model with distilbert as the example shows, but get a worse result.

    opened by caiyinqiong 5
  • Add encoder for SPLADE

    Add encoder for SPLADE

    Hi, here's a PR to add an encoder for SPLADE (not sure if it would work directly for the other sparse implems, but shouldn't be hard) and instructions for indexing and retrieving with Anserini. Most of it is based on the fact that one has already downloaded data for the coCondenser-marco example and has pre-installed Anserini.

    opened by cadurosar 5
  • RunTimeError when training SPLADE - .get_world_size() issues

    RunTimeError when training SPLADE - .get_world_size() issues

    Hi,

    Im trying to train a splade model using the guidelines at (https://github.com/texttron/tevatron/tree/main/examples/splade), But I am getting following runtime error

    Traceback (most recent call last): File "/home/src/tevatron/examples/splade/train_splade.py", line 135, in main() File "/home/src/tevatron/examples/splade/train_splade.py", line 116, in main trainer = SpladeTrainer( File "/home/src/tevatron/examples/splade/train_splade.py", line 31, in init self.world_size = torch.distributed.get_world_size() File "/home/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 867, in get_world_size return _get_group_size(group) File "/home/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 325, in _get_group_size default_pg = _get_default_group() File "/home/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 429, in _get_default_group raise RuntimeError( RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

    code snippet from train_splade.py:

    class SpladeTrainer(TevatronTrainer):
        def __init__(self, *args, **kwargs):
            super(SpladeTrainer, self).__init__(*args, **kwargs)
            self.world_size = torch.distributed.get_world_size()
    

    Do you know why am I getting this error?

    Thanks alot in advance :)

    opened by lboesen 4
  • How to reproduce results on Marco

    How to reproduce results on Marco

    Thanks for your great work! I noticed the training hyperparamers in your github repo(training batchsize, epochs, etc.) are different from those in your paper for training dense retriver on Ms Marco. Could you provide the hyperparameters for reproducing the results in Table 3 in your paper? Thanks!

    opened by WenzhengZhang 4
  • why no index for Dense Retrieval models

    why no index for Dense Retrieval models

    Hi,

    I was just wandering why we are not creating indexes for dpr models? and using tevatron.faiss_retriever right after encoding the queries and corpus.

    Thansk

    opened by lboesen 3
  • example of splade

    example of splade

    why q_max_len=128 in https://github.com/texttron/tevatron/tree/main/examples/splade/readme.md? Is it just a clerical error or special considerations?

    Thank you.

    opened by caiyinqiong 2
  • Reproduction issue of coCondenser NQ

    Reproduction issue of coCondenser NQ

    I use the hard negative (hn.bert.json) you provided and I can reproduce R@5=75.8 But when I train with my own hard negatives, R@5 is only 64.3

    How to generate hard negatives for NQ? Could you provide a reproduction setup?

    Here is the setup for my mining hard negatives: Model: co-condenser-wiki trained with bm25-negative Negative depth: 200 Negative sample: 30

    Looking forward to your reply!!! Thank you!

    opened by SunSiShining 2
  • Reproduce Condenser Result on MSMARCO passage ranking

    Reproduce Condenser Result on MSMARCO passage ranking

    Hi, wonderful work on this toolkit! I really like it!

    Following the README here, I use the following command to train the retriever with Condenser on 2 GPUS which results in the total batch size of 64, the same setting as in the paper:

    python -m tevatron.driver.train \
      --output_dir ./output_model \
      --model_name_or_path Luyu/condenser \
      --save_steps 20000 \
      --fp16 \
      --train_dir ../marco/bert/train \
      --per_device_train_batch_size 32 \
      --learning_rate 5e-6 \
      --num_train_epochs 3 \
      --dataloader_num_workers 2
    

    The result I got is 0.331:

    ##################### MRR @ 10: 0.3308558466366481 QueriesRanked: 6980 #####################

    Is there any parameter I missed to set? Thanks!

    opened by Albert-Ma 2
Owner
texttron
texttron
This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

Pipeline For NLP with Bloom's Taxonomy Using Improved Question Classification and Question Generation using Deep Learning This repository contains all

Rohan Mathur 9 Jul 17, 2021
Simple tool/toolkit for evaluating NLG (Natural Language Generation) offering various automated metrics.

Simple tool/toolkit for evaluating NLG (Natural Language Generation) offering various automated metrics. Jury offers a smooth and easy-to-use interface. It uses datasets for underlying metric computation, and hence adding custom metric is easy as adopting datasets.Metric.

Open Business Software Solutions 129 Jan 6, 2023
This project is part of Eleuther AI's quest to create a massive repository of high quality text data for training language models.

This project is part of Eleuther AI's quest to create a massive repository of high quality text data for training language models.

EleutherAI 42 Dec 13, 2022
Ongoing research training transformer language models at scale, including: BERT & GPT-2

What is this fork of Megatron-LM and Megatron-DeepSpeed This is a detached fork of https://github.com/microsoft/Megatron-DeepSpeed, which in itself is

BigScience Workshop 316 Jan 3, 2023
Ongoing research training transformer language models at scale, including: BERT & GPT-2

Megatron (1 and 2) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA.

NVIDIA Corporation 3.5k Dec 30, 2022
Pre-training BERT masked language models with custom vocabulary

Pre-training BERT Masked Language Models (MLM) This repository contains the method to pre-train a BERT model using custom vocabulary. It was used to p

Stella Douka 14 Nov 2, 2022
A simple Speech Emotion Recognition (SER) API created using Flask and running in a Docker container.

emovoz Introduction A simple Speech Emotion Recognition (SER) API created using Flask and running in a Docker container. The SER system was built with

null 2 Nov 11, 2022
A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

MIDI Language Introduction Reference Paper: Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions: code This

Robert Bogan Kang 3 May 25, 2022
A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

GuwenModels: 古文自然语言处理模型合集, 收录互联网上的古文相关模型及资源. A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

Ethan 66 Dec 26, 2022
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

ASYML 2.3k Jan 7, 2023
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

ASYML 2.1k Feb 17, 2021
Code to reprudece NeurIPS paper: Accelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks

Accelerated Sparse Neural Training: A Provable and Efficient Method to FindN:M Transposable Masks Recently, researchers proposed pruning deep neural n

itay hubara 4 Feb 23, 2022
Natural language Understanding Toolkit

Natural language Understanding Toolkit TOC Requirements Installation Documentation CLSCL NER References Requirements To install nut you need: Python 2

Peter Prettenhofer 119 Oct 8, 2022
The Classical Language Toolkit

Notice: This Git branch (dev) contains the CLTK's upcoming major release (v. 1.0.0). See https://github.com/cltk/cltk/tree/master and https://docs.clt

Classical Language Toolkit 754 Jan 9, 2023
Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing Trankit is a light-weight Transformer-based Pyth

null 652 Jan 6, 2023
The Classical Language Toolkit

Notice: This Git branch (dev) contains the CLTK's upcoming major release (v. 1.0.0). See https://github.com/cltk/cltk/tree/master and https://docs.clt

Classical Language Toolkit 629 Feb 11, 2021
An Analysis Toolkit for Natural Language Generation (Translation, Captioning, Summarization, etc.)

VizSeq is a Python toolkit for visual analysis on text generation tasks like machine translation, summarization, image captioning, speech translation

Facebook Research 409 Oct 28, 2022