PyTorch Reimplementation of REALM and ORQA
This is PyTorch reimplementation of REALM (paper, codebase) and ORQA (paper, codebase).
Some features have not been implemented yet, currently the predictor and finetuning script are available.
The term retriever and searcher in the code are basically interchangeable, their difference is that retriever is for REALM pretraining, and searcher is for ORQA finetuning.
Prerequisite
cd transformers && pip install -U -e ".[dev]"
pip install -U scann, apache_beam
Data
To download pretrained checkpoints and preprocessed data, please follow the instructions below:
cd data
pip install -U -r requirements.txt
sh download.sh
Finetune (Experimental)
The default finetuning dataset is Natural Question(NQ). To laod your custom dataset, please change the loading function in data.py
.
Training:
python run_finetune.py --is_train \
--model_dir "./" \
--num_epochs 2 \
--device cuda
Evaluation:
python run_finetune.py \
--retriever_pretrained_name "retriever" \
--checkpoint_pretrained_name "reader" \
--model_dir "./" \
--device cuda
Predict
The default checkpoints of retriever and reader are orqa_nq_model_from_realm
. To change them, kindly specify --retriever_path
and --checkpoint_path
.
python predictor.py --question "Who is the pioneer in modern computer science?"
Output: alan mathison turing
License
Apache License 2.0