Build Text Rerankers with Deep Language Models

Luyu Gao

Last update: Dec 6, 2022

Related tags

Text Data & NLP Reranker

Overview

Reranker

Reranker is a lightweight, effective and efficient package for training and deploying deep languge model reranker in information retrieval (IR), question answering (QA) and many other natural language processing (NLP) pipelines. The training procedure follows our ECIR paper Rethink Training of BERT Rerankers in Multi-Stage Retrieval Pipeline using a localized constrastive esimation (LCE) loss.

Reranker speaks Huggingface 🤗 language! This means that you instantly get all state-of-the-art pre-trained models as soon as they are ported to HF transformers. You also get the familiar model and trainer interfaces.

Stae of the Art Performance.

Reranker has two submissions to MS MARCO document leaderboard. Each got 1st place, advancing the SOTA!

Date	Submission Name	Dev MRR@100	Eval MRR@100
2021/01/20	LCE loss + HDCT (ensemble)	0.464	0.405
2020/09/09	HDCT top100 + BERT-base FirstP (single)	0.434	0.382

Features

Training rerankers from the state-of-the-art pre-trained language models like BERT, RoBERTa and ELECTRA.
The state-of-the-art reranking performance with our LCE loss based training pipeline.
GPU memory optimizations: Loss Parallelism and Gradient Cache which allow training of larger model.
Faster training
- Distributed Data Parallel (DDP) for multi GPUs.
- Automatic Mixed Precision (AMP) training and inference with up to 2x speedup!
Break CPU RAM limitation by memory mapping datasets with pyarrow through datasets package interface.
Checkpoint interoperability with Hugging Face transformers.

Design Philosophy

The library is designed to be dedicated for text reranking modeling, training and testing. This helps us keep the code concise and focus on a more specific task.

Under the hood, Reranker provides a thin layer of wrapper over Huggingface libraries. Our model wraps PreTrainedModel and our trainer sub-class Huggingface Trainer. You can then work with the familiar interfaces.

Installation and Dependencies

Reranker uses Pytorch, Huggingface Transformers and Datasets. Install with the following commands,

git clone https://github.com/luyug/Reranker.git
cd Reranker
pip install .

Reranker has been tested with torch==1.6.0, transformers==4.2.0, datasets==1.1.3.

For development, install as editable,

pip install -e .

Workflow

Inference (Reranking)

The easiest way to do inference is to use one of our uploaded trained checkpoints with RerankerForInference.

from reranker import RerankerForInference
rk = RerankerForInference.from_pretrained("Luyu/bert-base-mdoc-bm25")  # load checkpoint

inputs = rk.tokenize('weather in new york', 'it is cold today in new york', return_tensors='pt')
score = rk(inputs).logits

Training

For training, you will need a model, a dataset and a trainer. Say we have parsed arguments into model_args, data_args and training_args with reranker.arguments. First, initialize the reranker and tokenizer from one of pre-tained language models from Hugging Face. For example, let's use RoBERTa by loading roberta-base.

from reranker import Reranker 
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('roberta-base')
model = Reranker.from_pretrained(model_args, data_args, training_args, 'roberta-base')

Then create the dataset,

from reranker.data import GroupedTrainDataset
train_dataset = GroupedTrainDataset(
    data_args, data_args.train_path, 
    tokenizer=tokenizer, train_args=training_args
)

Create a trainer and train,

from reranker import RerankerTrainer
trainer = RerankerTrainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        data_collator=GroupCollator(tokenizer),
    )
trainer.train()

See full examples in our examples.

Examples

MS MARCO Document Ranking with Reranker

More to come

Large Models

Loss Paralellism

We support computing a query's LCE loss with multiple GPUs with flag --collaborative. Note that a group size (pos + neg) not divisible by number of GPUs may incur undefined behaviours. You will typically want to use it with gradient accumulation steps greater than one.

Detailed instruction ot be added.

Gradient Cache

Experimental We provide subclasses RerankerDC and RerankerDCTrainer. In the MS MARCO example, You can use them with --distance_cahce argument to activate gradient caching with respect to computed unnormalized distance. This allows potentially training with unlimited number of negatives beyond GPU memory limitation up to numerical precision. The method is described in our preprint Scaling Deep Contrastive Learning Batch Size with Almost Constant Peak Memory Usage.

Detailed instruction to be added.

Helpers

We provide a few helpers in the helper directory for data formatting,

Score Formatting

score_to_marco.py turns a raw score txt file into MS MARCO format.
score_to_tein.py turns a raw score txt file into trec eval format.

For example,

python score_to_tein.py --score_file {path to raw score txt}

This generates a trec eval format file in the same directory as the raw score file.

Data Format

Reranker core utilities (batch training, batch inference) expect processed and tokenized text in token id format. This means pre-processing should be done beforehand, e.g. with BERT tokenizer.

Training Data

Training data is grouped by query into a json file where each line has a query, its corresponding positives and sampled negatives.

{
    "qry": {
        "qid": str,
        "query": List[int],
    },
    "pos": List[
        {
            "pid": str,
            "passage": List[int],
        }
    ],
    "neg": List[
        {
            "pid": str,
            "passage": List[int]
        }
    ]
}

Training data is handled by class reranker.data.GroupedTrainDataset.

Inference (Reranking) Data

Inference data is grouped by query document(passage) pairs. Each line is a json entry to be rereanked (scored).

{
    "qid": str,
    "pid": str,
    "qry": List[int],
    "psg": List[int]
}

To speed up postprocessing, we currently take an additional tsv specifying text ids,

qid0     pid0
qid0     pid1
...

The ordering in the two files are expected to be the same.

Inference data is handled by class reranker.data.PredictionDataset.

Result Scores

Scores are stored in a tsv file with columns corresponding to qid, pid and score.

qid0     pid0     s0
qid0     pid1     s1
...

You can post-process it with our helper scirpt into MS MARCO format or TREC eval format.

Contribution

We welcome contribution to the package, either adding new dataset interface or new models.

Contact

You can reach me by email [email protected]. As a 2nd year master, I get busy days from time to time and may not reply very promptly. Feel free to ping me if you don't get replies.

Citation

If you use Reranker in your research, please consider citing our ECIR paper,

@inproceedings{gao2021lce,
               title={Rethink Training of BERT Rerankers in Multi-Stage Retrieval Pipeline}, 
               author={Luyu Gao and Zhuyun Dai and Jamie Callan},
               year={2021},
               booktitle={The 43rd European Conference On Information Retrieval (ECIR)},
      
}

For the gradient cache utility, consider citing our preprint,

@misc{gao2021scaling,
      title={Scaling Deep Contrastive Learning Batch Size with Almost Constant Peak Memory Usage}, 
      author={Luyu Gao and Yunyi Zhang},
      year={2021},
      eprint={2101.06983},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

License

Reranker is currently licensed under CC-BY-NC 4.0.

Comments

DDP support

Hi @luyug, thanks for opening source this repo.

I checked the repo for DDP support, which I didn't find codes related to dist.init_process_group or ddp model wrapper, though there is distributed sampler. you mentioned ddp training is supported in readme, do you include codes of that part? or there is sth I need to learn under the hood?

Thanks

opened by EarthXP 3

the code stuck in prediction step.

hi author, I try to run your prediction step by using commend :

 python -m torch.distributed.launch --nproc_per_node 4 python run_marco.py \
  --output_dir {score saving directory, not used for the moment} \
  --model_name_or_path {path to checkpoint} \
  --tokenizer_name bert-base-uncased \
  --do_predict \
  --max_len 512 \
  --fp16 \
  --per_device_eval_batch_size 64 \
  --dataloader_num_workers 8 \
  --pred_path {path to prediction json} \
  --pred_id_file  {path to prediction id tsv} \
  --rank_score_path {save path of the text file of scores}

but the program just stuck in someplace like what the picture shows:

I also checked the GPU status by using commend nvidia-smi, the result shows that 3 cards are running for this program.

I am waiting for very long time that still can not run to the next step, it is normal or how can I fix that?

thanks.

opened by h2222 2

some problem with the model performance

I follow the example of msmarco, but can not get the consistent results in your report.

the ckpt of bert is bert-base-uncased, here is my result on dev: {'MRR @100': 0.41827504737896776, 'QueriesRanked': 5193}

Did I miss something？

opened by Wenjun-Peng 0
Problem when fine-tuning on msmarco passage

I tried to use the reranker module to fintune on the msmarco passage task, but result is not perform well.(For roberta large and bm25 top100, the mrr10 only 0.3195). Is there some thiing need to notice when fintune on msmarco passage?

opened by zyznull 0
Hyperparameters for MSMARCO Doc Training

Hi there. I am interested in which set of hyperparameters did you use for your MAMARCO-doc training ensemble submission? I tried the sample hyperparameters as listed in your documentation but the results are not doing as good as yours.

opened by larryli1999 0
Datasets.load_dataset breaks with Python 3.9
Error: if python 3.9 is installed, the setup command will install Pandas 1.3.0 because older versions of Pandas are not compatible with Python 3.9. This Pandas version doesn't accept the following call:

read_csv("file.csv", names=None, prefix=None)

breaking the load_dataset function when used with the csv script.

The function call bellow in build_train_from_ranking.py will output the following error message: "ValueError: Specified named and prefix; you can only specify one."

train_doc_collection = datasets.load_dataset( path='csv', data_files=collection_path, column_names=columns, delimiter='\t', ignore_verifications=True, )['train']

That is because the last Pandas update doesn't accept None as parameter, only pandas.lib.no_default constant as per issue #42387.

Downgrading to Python 3.8 and Pandas 1.0.4 corrects the problem.

I believe python 3.8 should be enforced.
opened by Valerieps 2
Problem with reading dataset

I tried to follow the training section of the readme. I get the following error:

Traceback (most recent call last): File "C:\Users\Christoph.Schneider\PycharmProjects\SentBertHelpDesk\try_reranker.py", line 22, in train_dataset = GroupedTrainDataset( File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\reranker\data.py", line 31, in init self.nlp_dataset = datasets.load_dataset( File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\datasets\load.py", line 742, in load_dataset builder_instance.download_and_prepare( File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\datasets\builder.py", line 574, in download_and_prepare self._download_and_prepare( File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\datasets\builder.py", line 652, in _download_and_prepare self._prepare_split(split_generator, **prepare_split_kwargs) File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\datasets\builder.py", line 1041, in _prepare_split for key, table in utils.tqdm(generator, unit=" tables", leave=False, disable=not_verbose): File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\tqdm\std.py", line 1133, in iter for obj in iterable: File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\datasets\packaged_modules\json\json.py", line 96, in _generate_table s pa_table = pa_table.cast(self.config.schema) File "pyarrow\table.pxi", line 1409, in pyarrow.lib.Table.cast ValueError: Target schema's field names are not matching the table's field names: ['qry', 'pos', 'neg'], ['neg', 'pos', 'qry'] train.zip

i've attached the training file that i use. It follows the standards described in the readme.

opened by HerrKrishna 4

Build Text Rerankers with Deep Language Models

Related tags

Overview

Reranker

Stae of the Art Performance.

Features

Design Philosophy

Installation and Dependencies

Workflow

Inference (Reranking)

Training

Examples

Large Models

Loss Paralellism

Gradient Cache

Helpers

Score Formatting

Data Format

Training Data

Inference (Reranking) Data

Result Scores

Contribution

Contact

Citation

License

Comments

DDP support

the code stuck in prediction step.

some problem with the model performance

Problem when fine-tuning on msmarco passage

Hyperparameters for MSMARCO Doc Training

Datasets.load_dataset breaks with Python 3.9

Problem with reading dataset

Owner

Luyu Gao

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

A practical and feature-rich paraphrasing framework to augment human intents in text form to build robust NLU models for conversational engines. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

This project is part of Eleuther AI's quest to create a massive repository of high quality text data for training language models.

Code for text augmentation method leveraging large-scale language models

A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Text-Summarization-using-NLP - Text Summarization using NLP to fetch BBC News Article and summarize its text and also it includes custom article Summarization

Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models.

Facilitating the design, comparison and sharing of deep text matching models.

Facilitating the design, comparison and sharing of deep text matching models.

A python framework to transform natural language questions to queries in a database query language.

A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language