Fusion-in-Decoder Distilling Knowledge from Reader to Retriever for Question Answering

Related tags

Deep Learning FiD
Overview

This repository contains code for:

  • Fusion-in-Decoder models
  • Distilling Knowledge from Reader to Retriever

Dependencies

  • Python 3
  • PyTorch (currently tested on version 1.6.0)
  • Transformers (version 3.0.2, unlikely to work with a different version)
  • NumPy

Data

Download data

NaturalQuestions and TriviaQA data can be downloaded using get-data.sh. Both datasets are obtained from the original source and the wikipedia dump is downloaded from the DPR repository. In addition to the question and answers, this script retrieves the Wikipedia passages used to trained the released pretrained models.

Data format

The expected data format is a list of entry examples, where each entry example is a dictionary containing

  • id: example id, optional
  • question: question text
  • target: answer used for model training, if not given, the target is randomly sampled from the 'answers' list
  • answers: list of answer text for evaluation, also used for training if target is not given
  • ctxs: a list of passages where each item is a dictionary containing - title: article title - text: passage text

Entry example:

{
  'id': '0',
  'question': 'What element did Marie Curie name after her native land?',
  'target': 'Polonium',
  'answers': ['Polonium', 'Po (chemical element)', 'Po'],
  'ctxs': [
            {
                "title": "Marie Curie",
                "text": "them on visits to Poland. She named the first chemical element that she discovered in 1898 \"polonium\", after her native country. Marie Curie died in 1934, aged 66, at a sanatorium in Sancellemoz (Haute-Savoie), France, of aplastic anemia from exposure to radiation in the course of her scientific research and in the course of her radiological work at field hospitals during World War I. Maria Sk\u0142odowska was born in Warsaw, in Congress Poland in the Russian Empire, on 7 November 1867, the fifth and youngest child of well-known teachers Bronis\u0142awa, \"n\u00e9e\" Boguska, and W\u0142adys\u0142aw Sk\u0142odowski. The elder siblings of Maria"
            },
            {
                "title": "Marie Curie",
                "text": "was present in such minute quantities that they would eventually have to process tons of the ore. In July 1898, Curie and her husband published a joint paper announcing the existence of an element which they named \"polonium\", in honour of her native Poland, which would for another twenty years remain partitioned among three empires (Russian, Austrian, and Prussian). On 26 December 1898, the Curies announced the existence of a second element, which they named \"radium\", from the Latin word for \"ray\". In the course of their research, they also coined the word \"radioactivity\". To prove their discoveries beyond any"
            }
          ]
}

Pretrained models.

Pretrained models can be downloaded using get-model.sh. Currently availble models are [nq_reader_base, nq_reader_large, nq_retriever, tqa_reader_base, tqa_reader_large, tqa_retriever].

bash get-model.sh -m model_name

Performance of the pretrained models:

Mode size NaturalQuestions TriviaQA
dev test dev test
base 49.2 50.1 68.7 69.3
large 52.7 54.4 72.5 72.5

I. Fusion-in-Decoder

Fusion-in-Decoder models can be trained using train_reader.py and evaluated with test_reader.py.

Train

train_reader.py provides the code to train a model. An example usage of the script is given below:

python train_reader.py \
        --train_data train_data.json \
        --eval_data eval_data.json \
        --model_size base \
        --per_gpu_batch_size 1 \
        --n_context 100 \
        --name my_experiment \
        --checkpoint_dir checkpoint \

Training these models with 100 passages is memory intensive. To alleviate this issue we use checkpointing with the --use_checkpoint option. Tensors of variable sizes lead to memory overhead. Encoder input tensors have a fixed size by default, but not the decoder input tensors. The tensor size on the decoder side can be fixed using --answer_maxlength. The large readers have been trained on 64 GPUs with the following hyperparameters:

python train_reader.py \
        --use_checkpoint \
        --lr 0.00005 \
        --optim adamw \
        --scheduler linear \
        --weight_decay 0.01 \
        --text_maxlength 250 \
        --per_gpu_batch_size 1 \
        --n_context 100 \
        --total_step 15000 \
        --warmup_step 1000 \

Test

You can evaluate your model or a pretrained model with test_reader.py. An example usage of the script is provided below.

python test_reader.py \
        --model_path checkpoint_dir/my_experiment/my_model_dir/checkpoint/best_dev \
        --eval_data eval_data.json \
        --per_gpu_batch_size 1 \
        --n_context 100 \
        --name my_test \
        --checkpoint_dir checkpoint \

II. Distilling knowledge from reader to retriever for question answering

This repository also contains code to train a retriever model following the method proposed in our paper: Distilling knowledge from reader to retriever for question answering. This code is heavily inspired by the DPR codebase and reuses parts of it. The proposed method consists in several steps:

1. Obtain reader cross-attention scores

Assuming that we have already retrieved relevant passages for each question, the first step consists in generating cross-attention scores. This can be done using the option --write_crossattention_scores in test.py. It saves the dataset with cross-attention scores in checkpoint_dir/name/dataset_wscores.json. To retrieve the initial set of passages for each question, different options can be considered, such as DPR or BM25.

python test.py \
        --model_path my_model_path \
        --eval_data data.json \
        --per_gpu_batch_size 4 \
        --n_context 100 \
        --name my_test \
        --checkpoint_dir checkpoint \
        --write_crossattention_scores \

2. Retriever training

train_retriever.py provides the code to train a retriever using the scores previously generated.

python train_retriever.py \
        --lr 1e-4 \
        --optim adamw \
        --scheduler linear \
        --train_data train_data.json \
        --eval_data eval_data.json \
        --n_context 100 \
        --total_steps 20000 \
        --scheduler_steps 30000 \

3. Knowldege source indexing

Then the trained retriever is used to index a knowldege source, Wikipedia in our case.

python3 generate_retriever_embedding.py \
        --model_path <model_dir> \ #directory
        --passages passages.tsv \ #.tsv file
        --output_path wikipedia_embeddings \
        --shard_id 0 \
        --num_shards 1 \
        --per_gpu_batch_size 500 \

4. Passage retrieval

After indexing, given an input query, passages can be efficiently retrieved:

python passage_retrieval.py \
    --model_path <model_dir> \
    --passages psgs_w100.tsv \
    --data_path data.json \
    --passages_embeddings "wikipedia_embeddings/wiki_*" \
    --output_path retrieved_data.json \
    --n-docs 100 \

We found that iterating the four steps here can improve performances, depending on the initial set of documents.

References

[1] G. Izacard, E. Grave Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering

@misc{izacard2020leveraging,
      title={Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering},
      author={Gautier Izacard and Edouard Grave},
      year={2020},
      eprint={2007.01282},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

[2] G. Izacard, E. Grave Distilling Knowledge from Reader to Retriever for Question Answering

@misc{izacard2020distilling,
      title={Distilling Knowledge from Reader to Retriever for Question Answering},
      author={Gautier Izacard and Edouard Grave},
      year={2020},
      eprint={2012.04584},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

License

See the LICENSE file for more details.

Comments
  • fix: Fix typing and import bugs

    fix: Fix typing and import bugs

    Fix bugs in the inference pipeline: src/preprocess.py: remove never used import parser which is not referenced in requirements.txt and causes issues when running passage_retrieval.py: fix reference bug src.index.serialize => index.serialize src/index.py: index_file -> str(index_file) faiss expects string but gets Path object instead.

    CLA Signed 
    opened by borislavmavrin 3
  • Fix string formatting error

    Fix string formatting error

    When the test_reader.py script is used with the --write_results flag, the following error is thrown:

    Traceback (most recent call last):
      File "test_reader.py", line 129, in <module>
        exactmatch, total = evaluate(model, eval_dataset, eval_dataloader, tokenizer, opt)
      File "test_reader.py", line 34, in evaluate
        fw = open(write_path / '%d.txt'%opt.global_rank, 'a')
    TypeError: unsupported operand type(s) for %: 'PosixPath' and 'int'
    

    This patch fixes the order of operations.

    CLA Signed 
    opened by justinborromeo 3
  • Tutorial about using pretrained models

    Tutorial about using pretrained models

    Hello,

    I want to use the pre-trained model and evaluate the pre-trained model according to the example usage of the provided script. My code is as follows:

    python test_reader.py
    --model_path pretrained_models/nq_readers_base
    --eval_data open_domain_data/TQA/dev.json
    --per_gpu_batch_size 4
    --n_context 100
    --name my_test
    --checkpoint_dir checkpoint
    --write_crossattention_scores \

    But when I ran the code, it shows that their is an error in /src/data.py

    Traceback (most recent call last): File "test_reader.py", line 108, in world_size=opt.world_size File "/root/FiD/src/data.py", line 131, in load_data if global_rank > -1 and not k%world_size==global_rank: ZeroDivisionError: integer division or modulo by zero

    Could you please tell me is there a problem with my code and provide some more detailed guidelines or examples?

    Thanks in any help!

    Sherry

    opened by infopg 2
  • Hyper-params for reprodcuing results of base reader in FiD

    Hyper-params for reprodcuing results of base reader in FiD

    I tried reproducing the results of the base reader of FiD, but I am getting a test set score of 47.8 (i should be getting around 50).

    I ran the below parameters on 8-gpus, and correspondingly to achieve the reported batch size of 64, I used 4 accumulation steps, and increased total_steps and eval_freq proportionally.

    python -m torch.distributed.launch --nproc_per_node=8 --master-port 1234 train_reader.py --train_data  train.json --eval_data dev.json --per_gpu_batch_size 2 --accumulation_steps=4 --n_context 100 --checkpoint_dir $OUTPUT_DIR --name my_train --total_steps 40000 --text_maxlength 250 --eval_freq=2000 
    

    Can you tell me which hyperparams to use to reproduce the base reader results for FiD?

    opened by akhilkedia 1
  • About the hyperparameters of finetuning t5-base

    About the hyperparameters of finetuning t5-base

    Hi @gizacard ,

    Thanks for your awesome project. And I just want to know the hyperparameters of finetuning T5-basa.

    You have only shared the T5-large's hyper in the tutorial as followings, could you share T5-base's as the former's ?

    python train_reader.py \
            --use_checkpoint \
            --lr 0.00005 \
            --optim adamw \
            --scheduler linear \
            --weight_decay 0.01 \
            --text_maxlength 250 \
            --per_gpu_batch_size 1 \
            --n_context 100 \
            --total_step 15000 \
            --warmup_step 1000 \
    

    Thanks, looking forward to your reply.

    opened by shunyuzh 1
  • A question about ``passages_index''

    A question about ``passages_index''

    Hi, authors. I'm now going to replicate your FiD project. I'm wondering about the data preprocessing strategies.

    I found that the ''passages_index'' of Natural Questions and triviaqa datasets are just downloaded from the URL link ''https://dl.fbaipublicfiles.com/FiD/data/[dataset-name].tar.gz''. However, I could not find details about how to generate these passages_index files. Would the passages just be ranked based on the descending order of the Lucene-BM25 scores (excluding the passages that do not contain answers)? Or you adopted other methods to generate the passages_index?

    Looking forward to your reply.

    opened by chuzhumin98 1
  • Running FiD with fp16 precision

    Running FiD with fp16 precision

    Hi, we wonder if you have implemented a fp16 version of this model? We tried to implement it ourselves, but it doesn't work. The loss reduces for hundreds steps and then increase again. We tried both the NVIDIA Apex AMP and Pytorch Native AMP but none of them works. We wonder if you have implemented yourself and whether it behaves as expected. Thank you very much. We appreciate your works.

    opened by huvunvidia 1
  • question about reported result on NQ dataset

    question about reported result on NQ dataset

    Hi there,

    I notice that the reported value of NQ test on paper is 48.2(base) and 51.4(large), while in your repo, the result is 50.1(base) and 54.4(large). What's the difference between them?

    Also, when I run the test_reader.py to reproduce the result on test set, using DPR's retriever result (file from: retriever_results.nq.single.test.json) and nq_reader_base checkpoint downloaded through get-model.sh, the EM I've got is 45.82, worse than 48.2 or 50.1. I check the passage id retrieved in DPR's result and the idx you provided, it seems not quit the same. Could you tell me how you retrieve the passages?

    Thanks a lot

    opened by Rosarubu 1
  • Paths to checkpoints, data?

    Paths to checkpoints, data?

    I wanted to check the zero/few-shot performance of this approach and was wondering if you guys have paths to data and/or checkpoints somewhere on FAIR cluster?

    Thanks in advance!

    opened by sshleifer 0
  • CUDA memory suddenly run out of space when only used a quarter of memory

    CUDA memory suddenly run out of space when only used a quarter of memory

    I fine tune FiD as a generative question answering task with a 13gb dataset which is a combination of ELI5 and MS MARCO. Unfortunately, I get into trouble with CUDA out of memory problem. I am fine tuning this model with 2 RTX 3090 24gb on a single node. When the model is running after a number of steps, then it is stopped by CUDA out of memory but the number of steps are different in each case, sometimes CUDA memory runs out of space at step 69000 out of 776000 steps, or 23000 out of 776000 steps and so on. While I track the CUDA memory via watch nvidia-smi, the memory of 2 gpus is just around 7gb and 9gb occupied and suddenly one of them stop and notify that CUDA out of memory. I do not understand the reason.

    I also put torch.cuda.empty_cache() after every 500 steps but it still fails. This is my training script:

    export NGPU=2;
    python -m torch.distributed.launch \
            --nproc_per_node=$NGPU train.py \
            --train_data /home/jovyan/final_data/merge_ELI5_MS_MARCO.npz \
            --model_size base \
            --per_gpu_batch_size 1 \
            --n_context 4 \
            --name my_experiment \
            --checkpoint_dir checkpoint \
            --accumulation_steps 32 \
            --use_checkpoint \
            --total_steps 776004 \
            --optim adamw\
            --scheduler linear
    

    This is the error:

    RuntimeError: CUDA out of memory. Tried to allocate 502.00 MiB (GPU 1; 23.70 GiB total capacity; 20.55 GiB already allocated; 348.81 MiB free; 22.18 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
    WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1970826 closing signal SIGTERM
    WARNING:torch.distributed.elastic.multiprocessing.api:Unable to shutdown process 1970826 via 15, forcefully exitting via 9
    ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 1970827) of binary: /home/jovyan/.conda_env/fid/bin/python
    Traceback (most recent call last):
      File "/home/jovyan/.conda_env/fid/lib/python3.9/runpy.py", line 197, in _run_module_as_main
        return _run_code(code, main_globals, None,
      File "/home/jovyan/.conda_env/fid/lib/python3.9/runpy.py", line 87, in _run_code
        exec(code, run_globals)
      File "/home/jovyan/.conda_env/fid/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in <module>
        main()
      File "/home/jovyan/.conda_env/fid/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main
        launch(args)
      File "/home/jovyan/.conda_env/fid/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch
        run(args)
      File "/home/jovyan/.conda_env/fid/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run
        elastic_launch(
      File "/home/jovyan/.conda_env/fid/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
        return launch_agent(self._config, self._entrypoint, list(args))
      File "/home/jovyan/.conda_env/fid/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
        raise ChildFailedError(
    torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
    ============================================================
    train.py FAILED
    ------------------------------------------------------------
    Failures:
      <NO_OTHER_FAILURES>
    ------------------------------------------------------------
    Root Cause (first observed failure):
    [0]:
      time      : 2023-01-07_08:47:07
      host      : cc6244288069
      rank      : 1 (local_rank: 1)
      exitcode  : 1 (pid: 1970827)
      error_file: <N/A>
      traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
    ============================================================
    

    Could you please help me to figure out what is going on? Thank you in advanced.

    opened by khang-nguyen2907 0
  • Question about changing LM model to GPT

    Question about changing LM model to GPT

    Hello, I want to change the model used for FiD to gpt(decoder only). I've reviewed BB3, but I'm not sure which part to look at, so I'm asking. Could you please share a reference or link for me?

    opened by daje0601 0
  • Output_scores in generate() method

    Output_scores in generate() method

    I need output_scores for my use case. I tried adding

    output_scores=True,
    return_dict_in_generate=True
    

    in the generate() function in src/model.py file but still it doesn't return the output scores. Please let me know the solution to my issue

    opened by nrjvarshney 0
  • Memory leak for dataloader

    Memory leak for dataloader

    The amount of RAM the program takes quickly increments until the memory is overflown. I'm running this code with multi-processing setup and I don't know if the problem applies to single processing.

    After a good while of investigation I noticed that the problem comes from the getitem function of the customized Dataset. I changed it to the following:

    ` def getitem(self, index):

        return {
            'index' : index,
            'question' : self.question_prefix + " " + self.data[index]['question'],
            'target' : self.get_target(self.data[index]),
            'passages' : [self.f.format(c['title'], c['text']) for c in self.data[index]['ctxs'][:self.n_context]],
            'scores' : torch.tensor([float(c['score']) for c in self.data[index]['ctxs'][:self.n_context]]),
            'graph' :self.data[index]['graph'],
            'node_indices':self.data[index]['node_indices']
        }
    

    `

    and the memory issue is gone for me. It seems like as long as I define any local variable inside this function, the RAM will get blown eventually.

    This operation might disable certain functionalities of the original code and makes certain corner cases crashing the training loop. Any other suggestions?

    Thanks in advance!

    opened by jumxglhf 1
  • added the option pad_to_max_length

    added the option pad_to_max_length

    added the option pad_to_max_length, which enables to either use the max length specified, or compute the max length present in the batch dynamically.

    The default is to use the dynamic computation, since it is much cheaper in terms of memory consumption.

    opened by mosheber 1
Owner
Meta Research
Meta Research
Code of U2Fusion: a unified unsupervised image fusion network for multiple image fusion tasks, including multi-modal, multi-exposure and multi-focus image fusion.

U2Fusion Code of U2Fusion: a unified unsupervised image fusion network for multiple image fusion tasks, including multi-modal (VIS-IR, medical), multi

Han Xu 129 Dec 11, 2022
QA-GNN: Question Answering using Language Models and Knowledge Graphs

QA-GNN: Question Answering using Language Models and Knowledge Graphs This repo provides the source code & data of our paper: QA-GNN: Reasoning with L

Michihiro Yasunaga 434 Jan 4, 2023
RNG-KBQA: Generation Augmented Iterative Ranking for Knowledge Base Question Answering

RNG-KBQA: Generation Augmented Iterative Ranking for Knowledge Base Question Answering Authors: Xi Ye, Semih Yavuz, Kazuma Hashimoto, Yingbo Zhou and

Salesforce 72 Dec 5, 2022
Code for ICCV 2021 paper "Distilling Holistic Knowledge with Graph Neural Networks"

HKD Code for ICCV 2021 paper "Distilling Holistic Knowledge with Graph Neural Networks" cifia-100 result The implementation of compared methods are ba

Wang Yucheng 30 Dec 18, 2022
PyTorch source code for Distilling Knowledge by Mimicking Features

LSHFM.detection This is the PyTorch source code for Distilling Knowledge by Mimicking Features. And this project contains code for object detection wi

Guo-Hua Wang 4 Dec 17, 2022
Designing a Minimal Retrieve-and-Read System for Open-Domain Question Answering (NAACL 2021)

Designing a Minimal Retrieve-and-Read System for Open-Domain Question Answering Abstract In open-domain question answering (QA), retrieve-and-read mec

Clova AI Research 34 Apr 13, 2022
GrailQA: Strongly Generalizable Question Answering

GrailQA is a new large-scale, high-quality KBQA dataset with 64,331 questions annotated with both answers and corresponding logical forms in different syntax (i.e., SPARQL, S-expression, etc.). It can be used to test three levels of generalization in KBQA: i.i.d., compositional, and zero-shot.

OSU DKI Lab 76 Dec 21, 2022
covid question answering datasets and fine tuned models

Covid-QA Fine tuned models for question answering on Covid-19 data. Hosted Inference This model has been contributed to huggingface.Click here to see

Abhijith Neil Abraham 19 Sep 9, 2021
NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions (CVPR2021)

NExT-QA We reproduce some SOTA VideoQA methods to provide benchmark results for our NExT-QA dataset accepted to CVPR2021 (with 1 'Strong Accept' and 2

Junbin Xiao 50 Nov 24, 2022
FeTaQA: Free-form Table Question Answering

FeTaQA: Free-form Table Question Answering FeTaQA is a Free-form Table Question Answering dataset with 10K Wikipedia-based {table, question, free-form

Language, Information, and Learning at Yale 40 Dec 13, 2022
improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

CLIP-ViL In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet fea

null 310 Dec 28, 2022
Pytorch implementation for the EMNLP 2020 (Findings) paper: Connecting the Dots: A Knowledgeable Path Generator for Commonsense Question Answering

Path-Generator-QA This is a Pytorch implementation for the EMNLP 2020 (Findings) paper: Connecting the Dots: A Knowledgeable Path Generator for Common

Peifeng Wang 33 Dec 5, 2022
This is the official implementation of "One Question Answering Model for Many Languages with Cross-lingual Dense Passage Retrieval".

CORA This is the official implementation of the following paper: Akari Asai, Xinyan Yu, Jungo Kasai and Hannaneh Hajishirzi. One Question Answering Mo

Akari Asai 59 Dec 28, 2022
Bilinear attention networks for visual question answering

Bilinear Attention Networks This repository is the implementation of Bilinear Attention Networks for the visual question answering and Flickr30k Entit

Jin-Hwa Kim 506 Nov 29, 2022
Official repository with code and data accompanying the NAACL 2021 paper "Hurdles to Progress in Long-form Question Answering" (https://arxiv.org/abs/2103.06332).

Hurdles to Progress in Long-form Question Answering This repository contains the official scripts and datasets accompanying our NAACL 2021 paper, "Hur

Kalpesh Krishna 41 Nov 8, 2022
Visual Question Answering in Pytorch

Visual Question Answering in pytorch /!\ New version of pytorch for VQA available here: https://github.com/Cadene/block.bootstrap.pytorch This repo wa

Remi 672 Jan 1, 2023
This reporistory contains the test-dev data of the paper "xGQA: Cross-lingual Visual Question Answering".

This reporistory contains the test-dev data of the paper "xGQA: Cross-lingual Visual Question Answering".

AdapterHub 18 Dec 9, 2022
EMNLP 2021: Single-dataset Experts for Multi-dataset Question-Answering

MADE (Multi-Adapter Dataset Experts) This repository contains the implementation of MADE (Multi-adapter dataset experts), which is described in the pa

Princeton Natural Language Processing 68 Jul 18, 2022
EMNLP 2021: Single-dataset Experts for Multi-dataset Question-Answering

MADE (Multi-Adapter Dataset Experts) This repository contains the implementation of MADE (Multi-adapter dataset experts), which is described in the pa

Princeton Natural Language Processing 39 Oct 5, 2021