Fusion-in-Decoder Distilling Knowledge from Reader to Retriever for Question Answering

Meta Research

Last update: Dec 19, 2022

Related tags

Deep Learning FiD

Overview

This repository contains code for:

Fusion-in-Decoder models
Distilling Knowledge from Reader to Retriever

Dependencies

Python 3
PyTorch (currently tested on version 1.6.0)
Transformers (version 3.0.2, unlikely to work with a different version)
NumPy

Data

Download data

NaturalQuestions and TriviaQA data can be downloaded using get-data.sh. Both datasets are obtained from the original source and the wikipedia dump is downloaded from the DPR repository. In addition to the question and answers, this script retrieves the Wikipedia passages used to trained the released pretrained models.

Data format

The expected data format is a list of entry examples, where each entry example is a dictionary containing

id: example id, optional
question: question text
target: answer used for model training, if not given, the target is randomly sampled from the 'answers' list
answers: list of answer text for evaluation, also used for training if target is not given
ctxs: a list of passages where each item is a dictionary containing - title: article title - text: passage text

Entry example:

{
  'id': '0',
  'question': 'What element did Marie Curie name after her native land?',
  'target': 'Polonium',
  'answers': ['Polonium', 'Po (chemical element)', 'Po'],
  'ctxs': [
            {
                "title": "Marie Curie",
                "text": "them on visits to Poland. She named the first chemical element that she discovered in 1898 \"polonium\", after her native country. Marie Curie died in 1934, aged 66, at a sanatorium in Sancellemoz (Haute-Savoie), France, of aplastic anemia from exposure to radiation in the course of her scientific research and in the course of her radiological work at field hospitals during World War I. Maria Sk\u0142odowska was born in Warsaw, in Congress Poland in the Russian Empire, on 7 November 1867, the fifth and youngest child of well-known teachers Bronis\u0142awa, \"n\u00e9e\" Boguska, and W\u0142adys\u0142aw Sk\u0142odowski. The elder siblings of Maria"
            },
            {
                "title": "Marie Curie",
                "text": "was present in such minute quantities that they would eventually have to process tons of the ore. In July 1898, Curie and her husband published a joint paper announcing the existence of an element which they named \"polonium\", in honour of her native Poland, which would for another twenty years remain partitioned among three empires (Russian, Austrian, and Prussian). On 26 December 1898, the Curies announced the existence of a second element, which they named \"radium\", from the Latin word for \"ray\". In the course of their research, they also coined the word \"radioactivity\". To prove their discoveries beyond any"
            }
          ]
}

Pretrained models.

Pretrained models can be downloaded using get-model.sh. Currently availble models are [nq_reader_base, nq_reader_large, nq_retriever, tqa_reader_base, tqa_reader_large, tqa_retriever].

bash get-model.sh -m model_name

Performance of the pretrained models:

Mode size	NaturalQuestions		TriviaQA
	dev	test	dev	test
base	49.2	50.1	68.7	69.3
large	52.7	54.4	72.5	72.5

I. Fusion-in-Decoder

Fusion-in-Decoder models can be trained using train_reader.py and evaluated with test_reader.py.

Train

train_reader.py provides the code to train a model. An example usage of the script is given below:

python train_reader.py \
        --train_data train_data.json \
        --eval_data eval_data.json \
        --model_size base \
        --per_gpu_batch_size 1 \
        --n_context 100 \
        --name my_experiment \
        --checkpoint_dir checkpoint \

Training these models with 100 passages is memory intensive. To alleviate this issue we use checkpointing with the --use_checkpoint option. Tensors of variable sizes lead to memory overhead. Encoder input tensors have a fixed size by default, but not the decoder input tensors. The tensor size on the decoder side can be fixed using --answer_maxlength. The large readers have been trained on 64 GPUs with the following hyperparameters:

python train_reader.py \
        --use_checkpoint \
        --lr 0.00005 \
        --optim adamw \
        --scheduler linear \
        --weight_decay 0.01 \
        --text_maxlength 250 \
        --per_gpu_batch_size 1 \
        --n_context 100 \
        --total_step 15000 \
        --warmup_step 1000 \

Test

You can evaluate your model or a pretrained model with test_reader.py. An example usage of the script is provided below.

python test_reader.py \
        --model_path checkpoint_dir/my_experiment/my_model_dir/checkpoint/best_dev \
        --eval_data eval_data.json \
        --per_gpu_batch_size 1 \
        --n_context 100 \
        --name my_test \
        --checkpoint_dir checkpoint \

II. Distilling knowledge from reader to retriever for question answering

This repository also contains code to train a retriever model following the method proposed in our paper: Distilling knowledge from reader to retriever for question answering. This code is heavily inspired by the DPR codebase and reuses parts of it. The proposed method consists in several steps:

1. Obtain reader cross-attention scores

Assuming that we have already retrieved relevant passages for each question, the first step consists in generating cross-attention scores. This can be done using the option --write_crossattention_scores in test.py. It saves the dataset with cross-attention scores in checkpoint_dir/name/dataset_wscores.json. To retrieve the initial set of passages for each question, different options can be considered, such as DPR or BM25.

python test.py \
        --model_path my_model_path \
        --eval_data data.json \
        --per_gpu_batch_size 4 \
        --n_context 100 \
        --name my_test \
        --checkpoint_dir checkpoint \
        --write_crossattention_scores \

2. Retriever training

train_retriever.py provides the code to train a retriever using the scores previously generated.

python train_retriever.py \
        --lr 1e-4 \
        --optim adamw \
        --scheduler linear \
        --train_data train_data.json \
        --eval_data eval_data.json \
        --n_context 100 \
        --total_steps 20000 \
        --scheduler_steps 30000 \

3. Knowldege source indexing

Then the trained retriever is used to index a knowldege source, Wikipedia in our case.

python3 generate_retriever_embedding.py \
        --model_path <model_dir> \ #directory
        --passages passages.tsv \ #.tsv file
        --output_path wikipedia_embeddings \
        --shard_id 0 \
        --num_shards 1 \
        --per_gpu_batch_size 500 \

4. Passage retrieval

After indexing, given an input query, passages can be efficiently retrieved:

python passage_retrieval.py \
    --model_path <model_dir> \
    --passages psgs_w100.tsv \
    --data_path data.json \
    --passages_embeddings "wikipedia_embeddings/wiki_*" \
    --output_path retrieved_data.json \
    --n-docs 100 \

We found that iterating the four steps here can improve performances, depending on the initial set of documents.

References

[1] G. Izacard, E. Grave Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering

@misc{izacard2020leveraging,
      title={Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering},
      author={Gautier Izacard and Edouard Grave},
      year={2020},
      eprint={2007.01282},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

[2] G. Izacard, E. Grave Distilling Knowledge from Reader to Retriever for Question Answering

@misc{izacard2020distilling,
      title={Distilling Knowledge from Reader to Retriever for Question Answering},
      author={Gautier Izacard and Edouard Grave},
      year={2020},
      eprint={2012.04584},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

License

See the LICENSE file for more details.

Comments

fix: Fix typing and import bugs

Fix bugs in the inference pipeline: src/preprocess.py: remove never used import parser which is not referenced in requirements.txt and causes issues when running passage_retrieval.py: fix reference bug src.index.serialize => index.serialize src/index.py: index_file -> str(index_file) faiss expects string but gets Path object instead.
CLA Signed

opened by borislavmavrin 3

Fix string formatting error

When the test_reader.py script is used with the --write_results flag, the following error is thrown:

Traceback (most recent call last):
  File "test_reader.py", line 129, in <module>
    exactmatch, total = evaluate(model, eval_dataset, eval_dataloader, tokenizer, opt)
  File "test_reader.py", line 34, in evaluate
    fw = open(write_path / '%d.txt'%opt.global_rank, 'a')
TypeError: unsupported operand type(s) for %: 'PosixPath' and 'int'

This patch fixes the order of operations.

CLA Signed

opened by justinborromeo 3

Tutorial about using pretrained models

Hello,

I want to use the pre-trained model and evaluate the pre-trained model according to the example usage of the provided script. My code is as follows：

python test_reader.py
--model_path pretrained_models/nq_readers_base
--eval_data open_domain_data/TQA/dev.json
--per_gpu_batch_size 4
--n_context 100
--name my_test
--checkpoint_dir checkpoint
--write_crossattention_scores \

But when I ran the code, it shows that their is an error in /src/data.py

Traceback (most recent call last): File "test_reader.py", line 108, in world_size=opt.world_size File "/root/FiD/src/data.py", line 131, in load_data if global_rank > -1 and not k%world_size==global_rank: ZeroDivisionError: integer division or modulo by zero

Could you please tell me is there a problem with my code and provide some more detailed guidelines or examples?

Thanks in any help!

Sherry

opened by infopg 2
Hyper-params for reprodcuing results of base reader in FiD
I tried reproducing the results of the base reader of FiD, but I am getting a test set score of 47.8 (i should be getting around 50).

I ran the below parameters on 8-gpus, and correspondingly to achieve the reported batch size of 64, I used 4 accumulation steps, and increased total_steps and eval_freq proportionally.

python -m torch.distributed.launch --nproc_per_node=8 --master-port 1234 train_reader.py --train_data train.json --eval_data dev.json --per_gpu_batch_size 2 --accumulation_steps=4 --n_context 100 --checkpoint_dir $OUTPUT_DIR --name my_train --total_steps 40000 --text_maxlength 250 --eval_freq=2000

Can you tell me which hyperparams to use to reproduce the base reader results for FiD?
opened by akhilkedia 1
About the hyperparameters of finetuning t5-base
Hi @gizacard ,

Thanks for your awesome project. And I just want to know the hyperparameters of finetuning T5-basa.

You have only shared the T5-large's hyper in the tutorial as followings, could you share T5-base's as the former's ?

python train_reader.py \ --use_checkpoint \ --lr 0.00005 \ --optim adamw \ --scheduler linear \ --weight_decay 0.01 \ --text_maxlength 250 \ --per_gpu_batch_size 1 \ --n_context 100 \ --total_step 15000 \ --warmup_step 1000 \

Thanks, looking forward to your reply.
opened by shunyuzh 1
A question about ``passages_index''

Hi, authors. I'm now going to replicate your FiD project. I'm wondering about the data preprocessing strategies.

I found that the ''passages_index'' of Natural Questions and triviaqa datasets are just downloaded from the URL link ''https://dl.fbaipublicfiles.com/FiD/data/[dataset-name].tar.gz''. However, I could not find details about how to generate these passages_index files. Would the passages just be ranked based on the descending order of the Lucene-BM25 scores (excluding the passages that do not contain answers)? Or you adopted other methods to generate the passages_index?

Looking forward to your reply.

opened by chuzhumin98 1
Running FiD with fp16 precision

Hi, we wonder if you have implemented a fp16 version of this model? We tried to implement it ourselves, but it doesn't work. The loss reduces for hundreds steps and then increase again. We tried both the NVIDIA Apex AMP and Pytorch Native AMP but none of them works. We wonder if you have implemented yourself and whether it behaves as expected. Thank you very much. We appreciate your works.

opened by huvunvidia 1
question about reported result on NQ dataset

Hi there,

I notice that the reported value of NQ test on paper is 48.2(base) and 51.4(large), while in your repo, the result is 50.1(base) and 54.4(large). What's the difference between them?

Also, when I run the test_reader.py to reproduce the result on test set, using DPR's retriever result (file from: retriever_results.nq.single.test.json) and nq_reader_base checkpoint downloaded through get-model.sh, the EM I've got is 45.82, worse than 48.2 or 50.1. I check the passage id retrieved in DPR's result and the idx you provided, it seems not quit the same. Could you tell me how you retrieve the passages?

Thanks a lot

opened by Rosarubu 1
Paths to checkpoints, data?

I wanted to check the zero/few-shot performance of this approach and was wondering if you guys have paths to data and/or checkpoints somewhere on FAIR cluster?

Thanks in advance!

opened by sshleifer 0

CUDA memory suddenly run out of space when only used a quarter of memory

I fine tune FiD as a generative question answering task with a 13gb dataset which is a combination of ELI5 and MS MARCO. Unfortunately, I get into trouble with CUDA out of memory problem. I am fine tuning this model with 2 RTX 3090 24gb on a single node. When the model is running after a number of steps, then it is stopped by CUDA out of memory but the number of steps are different in each case, sometimes CUDA memory runs out of space at step 69000 out of 776000 steps, or 23000 out of 776000 steps and so on. While I track the CUDA memory via watch nvidia-smi, the memory of 2 gpus is just around 7gb and 9gb occupied and suddenly one of them stop and notify that CUDA out of memory. I do not understand the reason.

I also put torch.cuda.empty_cache() after every 500 steps but it still fails. This is my training script:

export NGPU=2;
python -m torch.distributed.launch \
        --nproc_per_node=$NGPU train.py \
        --train_data /home/jovyan/final_data/merge_ELI5_MS_MARCO.npz \
        --model_size base \
        --per_gpu_batch_size 1 \
        --n_context 4 \
        --name my_experiment \
        --checkpoint_dir checkpoint \
        --accumulation_steps 32 \
        --use_checkpoint \
        --total_steps 776004 \
        --optim adamw\
        --scheduler linear

This is the error:

RuntimeError: CUDA out of memory. Tried to allocate 502.00 MiB (GPU 1; 23.70 GiB total capacity; 20.55 GiB already allocated; 348.81 MiB free; 22.18 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1970826 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Unable to shutdown process 1970826 via 15, forcefully exitting via 9
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 1970827) of binary: /home/jovyan/.conda_env/fid/bin/python
Traceback (most recent call last):
  File "/home/jovyan/.conda_env/fid/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/jovyan/.conda_env/fid/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/jovyan/.conda_env/fid/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/jovyan/.conda_env/fid/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/jovyan/.conda_env/fid/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/jovyan/.conda_env/fid/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/home/jovyan/.conda_env/fid/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/jovyan/.conda_env/fid/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-01-07_08:47:07
  host      : cc6244288069
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 1970827)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Could you please help me to figure out what is going on? Thank you in advanced.

opened by khang-nguyen2907 0

Question about changing LM model to GPT

Hello, I want to change the model used for FiD to gpt(decoder only). I've reviewed BB3, but I'm not sure which part to look at, so I'm asking. Could you please share a reference or link for me?

opened by daje0601 0
Output_scores in generate() method
I need output_scores for my use case. I tried adding

output_scores=True, return_dict_in_generate=True

in the generate() function in src/model.py file but still it doesn't return the output scores. Please let me know the solution to my issue
opened by nrjvarshney 0
Memory leak for dataloader
The amount of RAM the program takes quickly increments until the memory is overflown. I'm running this code with multi-processing setup and I don't know if the problem applies to single processing.

After a good while of investigation I noticed that the problem comes from the getitem function of the customized Dataset. I changed it to the following:

` def getitem(self, index):

return { 'index' : index, 'question' : self.question_prefix + " " + self.data[index]['question'], 'target' : self.get_target(self.data[index]), 'passages' : [self.f.format(c['title'], c['text']) for c in self.data[index]['ctxs'][:self.n_context]], 'scores' : torch.tensor([float(c['score']) for c in self.data[index]['ctxs'][:self.n_context]]), 'graph' :self.data[index]['graph'], 'node_indices':self.data[index]['node_indices'] }

`

and the memory issue is gone for me. It seems like as long as I define any local variable inside this function, the RAM will get blown eventually.

This operation might disable certain functionalities of the original code and makes certain corner cases crashing the training loop. Any other suggestions?

Thanks in advance!
opened by jumxglhf 1
added the option pad_to_max_length

added the option pad_to_max_length, which enables to either use the max length specified, or compute the max length present in the batch dynamically.

The default is to use the dynamic computation, since it is much cheaper in terms of memory consumption.

opened by mosheber 1

Owner

Meta Research

GitHub

Code of U2Fusion: a unified unsupervised image fusion network for multiple image fusion tasks, including multi-modal, multi-exposure and multi-focus image fusion.

U2Fusion Code of U2Fusion: a unified unsupervised image fusion network for multiple image fusion tasks, including multi-modal (VIS-IR, medical), multi

129 Dec 11, 2022

QA-GNN: Question Answering using Language Models and Knowledge Graphs

QA-GNN: Question Answering using Language Models and Knowledge Graphs This repo provides the source code & data of our paper: QA-GNN: Reasoning with L

434 Jan 4, 2023

RNG-KBQA: Generation Augmented Iterative Ranking for Knowledge Base Question Answering

RNG-KBQA: Generation Augmented Iterative Ranking for Knowledge Base Question Answering Authors: Xi Ye, Semih Yavuz, Kazuma Hashimoto, Yingbo Zhou and

72 Dec 5, 2022

Code for ICCV 2021 paper "Distilling Holistic Knowledge with Graph Neural Networks"

HKD Code for ICCV 2021 paper "Distilling Holistic Knowledge with Graph Neural Networks" cifia-100 result The implementation of compared methods are ba

30 Dec 18, 2022

PyTorch source code for Distilling Knowledge by Mimicking Features

LSHFM.detection This is the PyTorch source code for Distilling Knowledge by Mimicking Features. And this project contains code for object detection wi

4 Dec 17, 2022

Designing a Minimal Retrieve-and-Read System for Open-Domain Question Answering (NAACL 2021)

Designing a Minimal Retrieve-and-Read System for Open-Domain Question Answering Abstract In open-domain question answering (QA), retrieve-and-read mec

34 Apr 13, 2022

GrailQA: Strongly Generalizable Question Answering

GrailQA is a new large-scale, high-quality KBQA dataset with 64,331 questions annotated with both answers and corresponding logical forms in different syntax (i.e., SPARQL, S-expression, etc.). It can be used to test three levels of generalization in KBQA: i.i.d., compositional, and zero-shot.

76 Dec 21, 2022

covid question answering datasets and fine tuned models

Covid-QA Fine tuned models for question answering on Covid-19 data. Hosted Inference This model has been contributed to huggingface.Click here to see

19 Sep 9, 2021

NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions (CVPR2021)

NExT-QA We reproduce some SOTA VideoQA methods to provide benchmark results for our NExT-QA dataset accepted to CVPR2021 (with 1 'Strong Accept' and 2

50 Nov 24, 2022

FeTaQA: Free-form Table Question Answering

FeTaQA: Free-form Table Question Answering FeTaQA is a Free-form Table Question Answering dataset with 10K Wikipedia-based {table, question, free-form

Language, Information, and Learning at Yale

40 Dec 13, 2022

improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

CLIP-ViL In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet fea

310 Dec 28, 2022

Pytorch implementation for the EMNLP 2020 (Findings) paper: Connecting the Dots: A Knowledgeable Path Generator for Commonsense Question Answering

Path-Generator-QA This is a Pytorch implementation for the EMNLP 2020 (Findings) paper: Connecting the Dots: A Knowledgeable Path Generator for Common

33 Dec 5, 2022

This is the official implementation of "One Question Answering Model for Many Languages with Cross-lingual Dense Passage Retrieval".

CORA This is the official implementation of the following paper: Akari Asai, Xinyan Yu, Jungo Kasai and Hannaneh Hajishirzi. One Question Answering Mo

59 Dec 28, 2022

Bilinear attention networks for visual question answering

Bilinear Attention Networks This repository is the implementation of Bilinear Attention Networks for the visual question answering and Flickr30k Entit

506 Nov 29, 2022

Official repository with code and data accompanying the NAACL 2021 paper "Hurdles to Progress in Long-form Question Answering" (https://arxiv.org/abs/2103.06332).

Hurdles to Progress in Long-form Question Answering This repository contains the official scripts and datasets accompanying our NAACL 2021 paper, "Hur

41 Nov 8, 2022

Fusion-in-Decoder Distilling Knowledge from Reader to Retriever for Question Answering

Related tags

Overview

Dependencies

Data

Download data

Data format

Pretrained models.

I. Fusion-in-Decoder

Train

Test

II. Distilling knowledge from reader to retriever for question answering

1. Obtain reader cross-attention scores

2. Retriever training

3. Knowldege source indexing

4. Passage retrieval

References

License

Comments

Owner

Meta Research

Code of U2Fusion: a unified unsupervised image fusion network for multiple image fusion tasks, including multi-modal, multi-exposure and multi-focus image fusion.

QA-GNN: Question Answering using Language Models and Knowledge Graphs

RNG-KBQA: Generation Augmented Iterative Ranking for Knowledge Base Question Answering

Code for ICCV 2021 paper "Distilling Holistic Knowledge with Graph Neural Networks"

PyTorch source code for Distilling Knowledge by Mimicking Features

Designing a Minimal Retrieve-and-Read System for Open-Domain Question Answering (NAACL 2021)

GrailQA: Strongly Generalizable Question Answering

covid question answering datasets and fine tuned models

NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions (CVPR2021)

FeTaQA: Free-form Table Question Answering

improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

Pytorch implementation for the EMNLP 2020 (Findings) paper: Connecting the Dots: A Knowledgeable Path Generator for Commonsense Question Answering

This is the official implementation of "One Question Answering Model for Many Languages with Cross-lingual Dense Passage Retrieval".

Bilinear attention networks for visual question answering

Official repository with code and data accompanying the NAACL 2021 paper "Hurdles to Progress in Long-form Question Answering" (https://arxiv.org/abs/2103.06332).

Visual Question Answering in Pytorch

This reporistory contains the test-dev data of the paper "xGQA: Cross-lingual Visual Question Answering".

EMNLP 2021: Single-dataset Experts for Multi-dataset Question-Answering

EMNLP 2021: Single-dataset Experts for Multi-dataset Question-Answering