Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models.

texttron

Last update: Jan 4, 2023

Related tags

Text Data & NLP information-retrieval pytorch transformer question-answering dpr dense-retrieval

Overview

Tevatron

Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models. The toolkit has a modularized design for easy research; a set of command line tools are also provided for fast development and testing. A set of easy-to-use interfaces to Huggingfac's state-of-the-art pre-trained transformers ensures Tevatron's superior performance.

Tevatron is currently under initial development stage. We will be actively adding new features and API changes may happen. Suggestions, feature requests and PRs are welcomed.

Features

Command line interface for dense retriever training/encoding and dense index search.
Flexible and extendable Pytorch retriever models.
Highly efficient Trainer, a subclass of Huggingface Trainer, that naively support training performance features like mixed precision and distributed data parallel.
Fast and memory-efficient train/inference data access based on memory mapping with Apache Arrow through Huggingface datasets.

Installation

First install neural network and similarity search backends, namely Pytorch and FAISS. Check out the official installation guides for Pytorch and for FAISS.

Then install Tevatron with pip,

pip install tevatron

Or typically for develoment/research, clone this repo and install as editable,

git https://github.com/texttron/tevatron
cd tevatron
pip install --editable .

Note: The current code base has been tested with, torch==1.8.2, faiss-cpu==1.7.1, transformers==4.9.2, datasets==1.11.0

Data Format

Training: Each line of the the Train file is a training instance,

{'query': TEXT_TYPE, 'positives': List[TEXT_TYPE], 'negatives': List[TEXT_TYPE]}
...

Inference/Encoding: Each line of the the encoding file is a piece of text to be encoded,

{text_id: "xxx", 'text': TEXT_TYPE}
...

Here TEXT_TYPE can be either raw string or pre-tokenized ids, i.e. List[int]. Using the latter can help lower data processing latency during training to reduce/eliminate GPU wait. Note: the current code requires text_id of passages/contexts to be convertible to integer, e.g. integers or string of integers.

Training (Simple)

To train a simple dense retriever, call the tevatron.driver.train module,

python -m tevatron.driver.train \  
  --output_dir $OUTDIR \  
  --model_name_or_path bert-base-uncased \  
  --do_train \  
  --save_steps 20000 \  
  --train_dir $TRAIN_DIR \
  --fp16 \  
  --per_device_train_batch_size 8 \  
  --learning_rate 5e-6 \  
  --num_train_epochs 2 \  
  --dataloader_num_workers 2

Here we picked bert-base-uncased BERT weight from Huggingface Hub and turned on AMP with --fp16 to speed up training. Several command flags are provided in addition to configure the learned model, e.g. --add_pooler which adds an linear projection. A full list command line arguments can be found in tevatron.arguments.

Training (Research)

Check out the run.py in examples directory for a fully configurable train/test loop. Typically you will do,

from tevatron.modeling import DenseModel
from tevatron.trainer import DenseTrainer as Trainer

...
model = DenseModel.build(
        model_args,
        data_args,
        training_args,
        config=config,
        cache_dir=model_args.cache_dir,
    )
trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        data_collator=collator,
    )
...
trainer.train()

Encoding

To encode, call the tevatron.driver.encode module. For large corpus, split the corpus into shards to parallelize.

for s in shard1 shar2 shard3
do
python -m tevatron.driver.encode \  
  --output_dir=$OUTDIR \  
  --tokenizer_name $TOK \  
  --config_name $CONFIG \  
  --model_name_or_path $MODEL_DIR \  
  --fp16 \  
  --per_device_eval_batch_size 128 \  
  --encode_in_path $CORPUS_DIR/$s.json \  
  --encoded_save_path $ENCODE_DIR/$s.pt
done

Index Search

Call the tevatron.faiss_retriever module,

python -m tevatron.faiss_retriever \  
--query_reps $ENCODE_QRY_DIR/qry.pt \  
--passage_reps $ENCODE_DIR/'*.pt' \  
--depth $DEPTH \
--batch_size -1 \
--save_text \
--save_ranking_to rank.tsv

Encoded corpus or corpus shards are loaded based on glob pattern matching of argument --passage_reps. Argument --batch_size controls number of queries passed to the FAISS index each search call and -1 will pass all queries in one call. Larger batches typically run faster (due to better memory access patterns and hardware utilization.) Setting flag --save_text will save the ranking to a tsv file with each line being qid pid score.

Alternatively paralleize search over the shards,

for s in shard1 shar2 shard3
do
python -m tevatron.faiss_retriever \  
--query_reps $ENCODE_QRY_DIR/qry.pt \  
--passage_reps $ENCODE_DIR/$s.pt \  
--depth $DEPTH \  
--save_ranking_to $INTERMEDIATE_DIR/$s
done

Then combine the results using the reducer module,

python -m tevatron.faiss_retriever.reducer \  
--score_dir $INTERMEDIATE_DIR \  
--query $ENCODE_QRY_DIR/qry.pt \  
--save_ranking_to rank.txt

Contacts

If you have a toolkit specific question, feel free to open an issue.

You can also reach out to us for general comments/suggestions/questions through email.

Luyu Gao [email protected]
Xueguang Ma [email protected]

Comments

coCondenser MS-MARCO Passage Retrieval example raises error

I'm trying to reproduce this example: [(https://github.com/texttron/tevatron/tree/main/examples/coCondenser-marco)]

Inference with fine-tuned checkpoint(encode and Index-search) are all good, however, fine-tuning stage 1 (training) raises an error: Downloading and preparing dataset json/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/dmx/.cache/huggingface/datasets/json/default-e642d34fc5e4ebf2/0.0.0/793d004298099bd3c4e61eb7878475bcf1dc212bf2e34437d85126758720d7f9... 10/29/2021 10:44:35 - WARNING - datasets.builder - Using custom data configuration default-e642d34fc5e4ebf2 Traceback (most recent call last): File "/home/dmx/anaconda3/lib/python3.8/site-packages/tevatron/driver/train.py", line 117, in <module> main() File "/home/dmx/anaconda3/lib/python3.8/site-packages/tevatron/driver/train.py", line 82, in main train_dataset = TrainDataset( File "/home/dmx/anaconda3/lib/python3.8/site-packages/tevatron/data.py", line 29, in __init__ self.train_data = datasets.load_dataset( File "/home/dmx/anaconda3/lib/python3.8/site-packages/datasets-1.9.0-py3.8.egg/datasets/load.py", line 856, in load_dataset builder_instance.download_and_prepare( File "/home/dmx/anaconda3/lib/python3.8/site-packages/datasets-1.9.0-py3.8.egg/datasets/builder.py", line 583, in download_and_prepare self._download_and_prepare( File "/home/dmx/anaconda3/lib/python3.8/site-packages/datasets-1.9.0-py3.8.egg/datasets/builder.py", line 639, in _download_and_prepare split_generators = self._split_generators(dl_manager, **split_generators_kwargs) File "/home/dmx/anaconda3/lib/python3.8/site-packages/datasets-1.9.0-py3.8.egg/datasets/packaged_modules/json/json.py", line 46, in _split_generators raise ValueError(f"At least one data file must be specified, but got data_files={self.config.data_files}") ValueError: At least one data file must be specified, but got data_files=None

I think the problem maybe come from lines 25~32 of tevatron.driver.train.py, but I don't know the specific reason or how to solve it: if isinstance(path_to_data, datasets.Dataset): self.train_data = path_to_data else: self.train_data = datasets.load_dataset( 'json', data_dir=path_to_data, ignore_verifications=False, )['train']

opened by YuLengsen 10
Save the last checkpoint also in a folder as others

Currently, the last checkpoint is saved in the root folder of other checkpoints. This minor change put the last one also in the same level of folder as others.

opened by ArvinZhuang 9

Question about reproducing coCondenser-nq

Hi, @luyug.

Thanks for your awesome work and detailed guidelines. I reproduced the model according to coCondenser-nq's [README](https://github.com/texttron/tevatron/tree/main/examples/coCondenser-nq). But I got the following results.(results from pyserini)

Top5    accuracy: 0.3526315789473684                         
Top20   accuracy: 0.47700831024930745 
Top100  accuracy: 0.5833795013850416

I think I made a mistake in one step, so that the results is lower than the results on bm25. I sequentially execute the following scripts to train the model.(The model co-condenser-wiki was downloaded from huggingface.)

#prepare_data.sh

nq_train_path="/data2/private/xxx/DPR/downloads/data/retriever/nq-train.json" #biencoder-nq-train.json
output_path="/data2/private/xxx/condenser/nq-train/bm25.bert.json"
model_path="/data2/private/xxx/model/co-condenser-wiki"
hn_path="/data2/private/xxx/condenser/hn.json"
output_hn_path="/data2/private/xxx/condenser/nq-train/hn.bert.json"
python prepare_wiki_train.py --input $nq_train_path --output $output_path --tokenizer $model_path

python prepare_wiki_train.py --input $hn_path --output $output_hn_path --tokenizer $model_path

#train_nq.sh
CONDENSER_MODEL_NAME="/data2/private/xxx/model/co-condenser-wiki"
train_path="/data2/private/xxx/condenser/nq-train/"
output_path="/data2/private/xxx/condenser/model_nq3/"
cache_path="/data2/private/xxx/condenser/.cache/"
CUDA_VISIBLE_DEVICES=2,3,4,5 python -m torch.distributed.launch --nproc_per_node=4 -m tevatron.driver.train \
  --output_dir $output_path \
  --model_name_or_path $CONDENSER_MODEL_NAME \
  --cache_dir $cache_path \
  --do_train \
  --save_steps 10000 \
  --train_dir $train_path \
  --fp16 \
  --per_device_train_batch_size 32 \
  --train_n_passages 2 \
  --learning_rate 5e-6 \
  --q_max_len 32 \
  --p_max_len 256 \
  --num_train_epochs 40 \
  --negatives_x_device \
  --positive_passage_no_shuffle \
  --untie_encoder \
  --grad_cache \
  --gc_p_chunk_size 24 \
  --gc_q_chunk_size 8

#encode_emb_passage.sh

OUTDIR="./temp"
wiki_dir="/data2/private/xxx/condenser/wikipedia-corpus" 
#"/data2/private/xxx/DPR/downloads/psgs_w100.tsv"
CONDENSER_MODEL_NAME="/data2/private/xxx/model/co-condenser-wiki"
train_path="/data2/private/xxx/condenser/nq-train/"
model_path="/data2/private/xxx/condenser/model_nq3/"
cache_path="/data2/private/xxx/condenser/.cache/"
emb_nq_path="/data2/private/xxx/condenser/embeddings-nq"
emb_query_path="/data2/private/xxx/condenser/embeddings-nq-queries/"
query_path="/data2/private/xxx/condenser/nq-test-queries.json"
MODEL_DIR=nq-model

echo $1 #  $1 is the id of GPU
for s in $(seq -f "%02g" $2 $3) # 0 - 19
do
CUDA_VISIBLE_DEVICES=$1 python -m tevatron.driver.encode \
  --output_dir=$OUTDIR \
  --cache_dir $cache_path \
  --model_name_or_path $model_path/checkpoint-40000/passage_model \
  --tokenizer_name $model_path \
  --fp16 \
  --per_device_eval_batch_size 64 \
  --p_max_len 256 \
  --dataset_proc_num 8 \
  --encode_in_path $wiki_dir/docs$s.json \
  --encoded_save_path $emb_nq_path/$s.pt \
  --encode_num_shard 20 \
  --passage_field_separator sep_token \
  --encode_shard_index $s
done

#encode_emb_query.sh

OUTDIR="./temp"
wiki_dir="/data2/private/xxx/condenser/wikipedia-corpus" 
#"/data2/private/xxx/DPR/downloads/psgs_w100.tsv"
CONDENSER_MODEL_NAME="/data2/private/xxx/model/co-condenser-wiki"
train_path="/data2/private/xxx/condenser/nq-train/"
model_path="/data2/private/xxx/condenser/model_nq3/"
cache_path="/data2/private/xxx/condenser/.cache/"
emb_nq_path="/data2/private/xxx/condenser/embeddings-nq/"
emb_query_path="/data2/private/xxx/condenser/embeddings-nq-queries/"
query_path="/data2/private/xxx/condenser/nq-test-queries.json"
MODEL_DIR=nq-model


# query

CUDA_VISIBLE_DEVICES=$1 python -m tevatron.driver.encode \
  --output_dir=$OUTDIR \
  --model_name_or_path $model_path/checkpoint-40000/query_model \
  --tokenizer_name $model_path \
  --fp16 \
  --per_device_eval_batch_size 64 \
  --q_max_len 32 \
  --dataset_proc_num 2 \
  --encode_in_path $query_path \
  --encoded_save_path $emb_query_path/query.pt \
  --encode_is_qry

#inference.sh

ENCODE_QRY_DIR="/data2/private/xxx/condenser/embeddings-nq-queries/"
ENCODE_DIR="/data2/private/xxx/condenser/embeddings-nq/"
DEPTH=200
RUN="/data2/private/xxx/condenser/run.nq.test.txt"
OUTDIR="./temp"
wiki_dir="/data2/private/xxx/condenser/wikipedia-corpus" 
#"/data2/private/xxx/DPR/downloads/psgs_w100.tsv"

MODEL_DIR=nq-model
python -m tevatron.faiss_retriever \
--query_reps $ENCODE_QRY_DIR/query.pt \
--passage_reps $ENCODE_DIR/'*.pt' \
--depth $DEPTH \
--batch_size -1 \
--save_text \
--save_ranking_to $RUN

#eval.sh
RUN="/data2/private/xxx/condenser/run.nq.test.txt"
trec_out="/data2/private/xxx/condenser/run.nq.test.teIn"
json_out="/data2/private/xxx/condenser/run.nq.test.json"
python -m tevatron.utils.format.convert_result_to_trec \
    --input $RUN --output $trec_out


python -m pyserini.eval.convert_trec_run_to_dpr_retrieval_run --topics dpr-nq-test \
                                                                --index wikipedia-dpr \
                                                                --input $trec_out \
                                                                --output $json_out

python -m pyserini.eval.evaluate_dpr_retrieval --retrieval $json_out \
    --topk 5 20 100

Is there any parameter I set wrong?

Thanks!

opened by Facico 8

how to improve the results with uniCOIL

Hi, Thanks for the great work!

I run experiments with the `modeling' in tevatron (but the data loader is implemented by myself) on msmarco-passage. For DenseModel, the result can achieve to be MRR@10: 0.31+. But for uniCOIL, the result is only abot MRR@10: 0.26+ (this is far from your result 0.328).

In fact, I noticed that the implementation of uniCOIL in tevatron is somewhat different from the paper (https://github.com/luyug/COIL). By comparison, I also run experiments with the code in https://github.com/luyug/COIL (dim=1 and no_cls), and get the similar result as above.

Can you provide some insight on how to improve this result. For example, is there any special operation on the data for uniCOIL? I also try to initialize the model with distilbert as the example shows, but get a worse result.

opened by caiyinqiong 5
Add encoder for SPLADE

Hi, here's a PR to add an encoder for SPLADE (not sure if it would work directly for the other sparse implems, but shouldn't be hard) and instructions for indexing and retrieving with Anserini. Most of it is based on the fact that one has already downloaded data for the coCondenser-marco example and has pre-installed Anserini.

opened by cadurosar 5
RunTimeError when training SPLADE - .get_world_size() issues
Hi,

Im trying to train a splade model using the guidelines at (https://github.com/texttron/tevatron/tree/main/examples/splade), But I am getting following runtime error

Traceback (most recent call last): File "/home/src/tevatron/examples/splade/train_splade.py", line 135, in main() File "/home/src/tevatron/examples/splade/train_splade.py", line 116, in main trainer = SpladeTrainer( File "/home/src/tevatron/examples/splade/train_splade.py", line 31, in init self.world_size = torch.distributed.get_world_size() File "/home/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 867, in get_world_size return _get_group_size(group) File "/home/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 325, in _get_group_size default_pg = _get_default_group() File "/home/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 429, in _get_default_group raise RuntimeError( RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

code snippet from train_splade.py:

class SpladeTrainer(TevatronTrainer): def __init__(self, *args, **kwargs): super(SpladeTrainer, self).__init__(*args, **kwargs) self.world_size = torch.distributed.get_world_size()

Do you know why am I getting this error?

Thanks alot in advance :)
opened by lboesen 4
How to reproduce results on Marco

Thanks for your great work! I noticed the training hyperparamers in your github repo(training batchsize, epochs, etc.) are different from those in your paper for training dense retriver on Ms Marco. Could you provide the hyperparameters for reproducing the results in Table 3 in your paper? Thanks!

opened by WenzhengZhang 4
why no index for Dense Retrieval models

Hi,

I was just wandering why we are not creating indexes for dpr models? and using tevatron.faiss_retriever right after encoding the queries and corpus.

Thansk

opened by lboesen 3
example of splade

why q_max_len=128 in https://github.com/texttron/tevatron/tree/main/examples/splade/readme.md? Is it just a clerical error or special considerations?

Thank you.

opened by caiyinqiong 2
Reproduction issue of coCondenser NQ

I use the hard negative (hn.bert.json) you provided and I can reproduce R@5=75.8 But when I train with my own hard negatives, R@5 is only 64.3

How to generate hard negatives for NQ? Could you provide a reproduction setup?

Here is the setup for my mining hard negatives: Model: co-condenser-wiki trained with bm25-negative Negative depth: 200 Negative sample: 30

Looking forward to your reply!!! Thank you!

opened by SunSiShining 2
Reproduce Condenser Result on MSMARCO passage ranking
Hi, wonderful work on this toolkit! I really like it!

Following the README here, I use the following command to train the retriever with Condenser on 2 GPUS which results in the total batch size of 64, the same setting as in the paper:

python -m tevatron.driver.train \ --output_dir ./output_model \ --model_name_or_path Luyu/condenser \ --save_steps 20000 \ --fp16 \ --train_dir ../marco/bert/train \ --per_device_train_batch_size 32 \ --learning_rate 5e-6 \ --num_train_epochs 3 \ --dataloader_num_workers 2

The result I got is 0.331:

##################### MRR @ 10: 0.3308558466366481 QueriesRanked: 6980 #####################

Is there any parameter I missed to set? Thanks!
opened by Albert-Ma 2

Owner

texttron

GitHub

This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

Pipeline For NLP with Bloom's Taxonomy Using Improved Question Classification and Question Generation using Deep Learning This repository contains all

9 Jul 17, 2021

Simple tool/toolkit for evaluating NLG (Natural Language Generation) offering various automated metrics.

Simple tool/toolkit for evaluating NLG (Natural Language Generation) offering various automated metrics. Jury offers a smooth and easy-to-use interface. It uses datasets for underlying metric computation, and hence adding custom metric is easy as adopting datasets.Metric.

129 Jan 6, 2023

Code and checkpoints for training the transformer-based Table QA models introduced in the paper TAPAS: Weakly Supervised Table Parsing via Pre-training.

End-to-end neural table-text understanding models.

914 Jan 7, 2023

This project is part of Eleuther AI's quest to create a massive repository of high quality text data for training language models.

42 Dec 13, 2022

Ongoing research training transformer language models at scale, including: BERT & GPT-2

What is this fork of Megatron-LM and Megatron-DeepSpeed This is a detached fork of https://github.com/microsoft/Megatron-DeepSpeed, which in itself is

316 Jan 3, 2023

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Megatron (1 and 2) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA.

3.5k Dec 30, 2022

Pre-training BERT masked language models with custom vocabulary

Pre-training BERT Masked Language Models (MLM) This repository contains the method to pre-train a BERT model using custom vocabulary. It was used to p

14 Nov 2, 2022

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

Styleformer A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/cas

431 Dec 19, 2022

A simple Speech Emotion Recognition (SER) API created using Flask and running in a Docker container.

emovoz Introduction A simple Speech Emotion Recognition (SER) API created using Flask and running in a Docker container. The SER system was built with

2 Nov 11, 2022

A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

MIDI Language Introduction Reference Paper: Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions: code This

3 May 25, 2022

A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

GuwenModels: 古文自然语言处理模型合集, 收录互联网上的古文相关模型及资源. A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

66 Dec 26, 2022

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

2.3k Jan 7, 2023

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

2.1k Feb 17, 2021

Code to reprudece NeurIPS paper: Accelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks

Accelerated Sparse Neural Training: A Provable and Efficient Method to FindN:M Transposable Masks Recently, researchers proposed pruning deep neural n

4 Feb 23, 2022

Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models.

Related tags

Overview

Tevatron

Features

Installation

Data Format

Training (Simple)

Training (Research)

Encoding

Index Search

Contacts

Comments

Owner

texttron

This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

Simple tool/toolkit for evaluating NLG (Natural Language Generation) offering various automated metrics.

Code and checkpoints for training the transformer-based Table QA models introduced in the paper TAPAS: Weakly Supervised Table Parsing via Pre-training.

This project is part of Eleuther AI's quest to create a massive repository of high quality text data for training language models.

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Pre-training BERT masked language models with custom vocabulary

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

A simple Speech Emotion Recognition (SER) API created using Flask and running in a Docker container.

A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Code to reprudece NeurIPS paper: Accelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks

Natural language Understanding Toolkit

The Classical Language Toolkit

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

The Classical Language Toolkit

An Analysis Toolkit for Natural Language Generation (Translation, Captioning, Summarization, etc.)