This is the repository for our paper Ditch the Gold Standard: Re-evaluating Conversational Question Answering

Overview

Ditch the Gold Standard: Re-evaluating Conversational Question Answering

This is the repository for our paper Ditch the Gold Standard: Re-evaluating Conversational Question Answering.

Overview

In this work, we conduct the first large-scale human evaluation of state-of-the-art conversational QA systems. In our evaluation, human annotators chat with conversational QA models about passages from the QuAC development set, and after that the annotators judge the correctness of model answers. We release the human annotated dataset in the following section.

We also identify a critical issue with the current automatic evaluation, which pre-collectes human-human conversations and uses ground-truth answers as conversational history (differences between different evaluations are shown in the following figure). By comparison, we find that the automatic evaluation does not always agree with the human evaluation. We propose a new evaluation protocol that is based on predicted history and question rewriting. Our experiments show that the new protocol better reflects real-world performance compared to the original automatic evaluation. We also provide the new evaluation protocol code in the following.

Different evaluation protocols

Human Evaluation Dataset

You can download the human annotation dataset from data/human_annotation_data.json. The json file contains one data field data, which is a list of conversations. Each conversation contains the following fields:

  • model_name: The model evaluated. One of bert4quac, graphflow, ham, excord.
  • context: The passage used in this conversation.
  • dialog_id: The ID from the original QuAC dataset.
  • qas: The conversation, which contains a list of QA pairs. Each QA pair has the following fields:
    • turn_id: The number of turn.
    • question: The question from the human annotator.
    • answer: The answer from the model.
    • valid: Whether the question is valid (annotated by our human annotator).
    • answerable: Whether the question is answerable (annotated by our human annotator).
    • correct: Whether the model's answer is correct (annotated by our human annotator).

Automatic model evaluation interface

We provide a convenient interface to test model performance on a few evaluation protocols compared in our paper, including Auto-Pred, Auto-Replace and our proposed evaluation protocol, Auto-Rewrite, which better demonstrates models' performance in human-model conversations. Please refer to our paper for more details. Following is a figure describing how Auto-Rewrite works.

Auto-rewrite

To use our evaluation interface on your own model, follow the steps:

  • Step 1: Download the QuAC dataset.

  • Step 2: Install allennlp, allennlp_models, ncr.replace_corefs through pip if you would like to use Auto-Rewrite.

  • Step 3: Download the CANARD dataset and set --canard_path if you would like to use Auto-Replace.

  • Step 4: Write a model interface following the template interface.py. Explanations to each function are provided through in-line comments. Make sure to import all your model dependencies at the top.

  • Step 5: Add the model to the evaluation script run_quac_eval.py. Changes that are need to be made are marked with #TODO.

  • Step 6: Run evaluation script. See run.sh for reference. Explanations of all arguments are provided in run_quac_eval.py. Make sure to turn on only one of --pred, --rewrite or --replace.

Citation

@article{li2021ditch,
   title={Ditch the Gold Standard: Re-evaluating Conversational Question Answering},
   author={Li, Huihan and Gao, Tianyu and Goenka, Manan and Chen, Danqi},
   journal={arXiv preprint arXiv:2112.08812},
   year={2021}
}
You might also like...
covid question answering datasets and fine tuned models

Covid-QA Fine tuned models for question answering on Covid-19 data. Hosted Inference This model has been contributed to huggingface.Click here to see

NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions (CVPR2021)
NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions (CVPR2021)

NExT-QA We reproduce some SOTA VideoQA methods to provide benchmark results for our NExT-QA dataset accepted to CVPR2021 (with 1 'Strong Accept' and 2

FeTaQA: Free-form Table Question Answering

FeTaQA: Free-form Table Question Answering FeTaQA is a Free-form Table Question Answering dataset with 10K Wikipedia-based {table, question, free-form

improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

CLIP-ViL In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet fea

This is the official implementation of
This is the official implementation of "One Question Answering Model for Many Languages with Cross-lingual Dense Passage Retrieval".

CORA This is the official implementation of the following paper: Akari Asai, Xinyan Yu, Jungo Kasai and Hannaneh Hajishirzi. One Question Answering Mo

Bilinear attention networks for visual question answering
Bilinear attention networks for visual question answering

Bilinear Attention Networks This repository is the implementation of Bilinear Attention Networks for the visual question answering and Flickr30k Entit

Visual Question Answering in Pytorch
Visual Question Answering in Pytorch

Visual Question Answering in pytorch /!\ New version of pytorch for VQA available here: https://github.com/Cadene/block.bootstrap.pytorch This repo wa

 RNG-KBQA: Generation Augmented Iterative Ranking for Knowledge Base Question Answering
RNG-KBQA: Generation Augmented Iterative Ranking for Knowledge Base Question Answering

RNG-KBQA: Generation Augmented Iterative Ranking for Knowledge Base Question Answering Authors: Xi Ye, Semih Yavuz, Kazuma Hashimoto, Yingbo Zhou and

EMNLP 2021: Single-dataset Experts for Multi-dataset Question-Answering

MADE (Multi-Adapter Dataset Experts) This repository contains the implementation of MADE (Multi-adapter dataset experts), which is described in the pa

Comments
  • Error when running HAM evaluation script

    Error when running HAM evaluation script

    Hi,

    I tried to run HAM evaluation script and got the following error from .../EvalConvQA/models/ham/cqa_model.py:

    Traceback (most recent call last):
      File ".../EvalConvQA/run_quac_eval.py", line 704, in <module>
        main()
      File ".../EvalConvQA/run_quac_eval.py", line 591, in main
        model = model_class(args=args)
      File ".../EvalConvQA/models/ham/interface.py", line 78, in __init__
        self.slice_num, self.args)
      File ".../EvalConvQA/models/ham/cqa_model.py", line 288, in fine_grained_history_attention_net
        token_tensor.set_shape([FLAGS.train_batch_size, FLAGS.max_history_turns, FLAGS.max_seq_length + 1, FLAGS.bert_hidden])
      File "/Users/.../opt/anaconda3/envs/cqa_eval/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 559, in set_shape
        raise ValueError(str(e))
    ValueError: Dimension 3 in both shapes must be equal, but are 768 and 1024. Shapes are [16,?,513,768] and [16,11,513,1024].
    

    Could you please assist?

    opened by zorikg 1
  • Request to share BERT's checkpoint

    Request to share BERT's checkpoint

    You mention that you implemented and trained your own BERT model, and provide the training script. Could you also share the already trained checkpoint?

    opened by zorikg 1
  • Bump nltk from 3.6.2 to 3.6.6

    Bump nltk from 3.6.2 to 3.6.6

    Bumps nltk from 3.6.2 to 3.6.6.

    Changelog

    Sourced from nltk's changelog.

    Version 3.7 2022-02-09

    • Improve and update the NLTK team page on nltk.org (#2855, #2941)
    • Drop support for Python 3.6, support Python 3.10 (#2920)

    Version 3.6.7 2021-12-28

    • Resolve IndexError in sent_tokenize and word_tokenize (#2922)

    Version 3.6.6 2021-12-21

    • Refactor gensim.doctest to work for gensim 4.0.0 and up (#2914)
    • Add Precision, Recall, F-measure, Confusion Matrix to Taggers (#2862)
    • Added warnings if .zip files exist without any corresponding .csv files. (#2908)
    • Fix FileNotFoundError when the download_dir is a non-existing nested folder (#2910)
    • Rename omw to omw-1.4 (#2907)
    • Resolve ReDoS opportunity by fixing incorrectly specified regex (#2906)
    • Support OMW 1.4 (#2899)
    • Deprecate Tree get and set node methods (#2900)
    • Fix broken inaugural test case (#2903)
    • Use Multilingual Wordnet Data from OMW with newer Wordnet versions (#2889)
    • Keep NLTKs "tokenize" module working with pathlib (#2896)
    • Make prettyprinter to be more readable (#2893)
    • Update links to the nltk book (#2895)
    • Add CITATION.cff to nltk (#2880)
    • Resolve serious ReDoS in PunktSentenceTokenizer (#2869)
    • Delete old CI config files (#2881)
    • Improve Tokenize documentation + add TokenizerI as superclass for TweetTokenizer (#2878)
    • Fix expected value for BLEU score doctest after changes from #2572
    • Add multi Bleu functionality and tests (#2793)
    • Deprecate 'return_str' parameter in NLTKWordTokenizer and TreebankWordTokenizer (#2883)
    • Allow empty string in CFG's + more (#2888)
    • Partition tree.py module into tree package + pickle fix (#2863)
    • Fix several TreebankWordTokenizer and NLTKWordTokenizer bugs (#2877)
    • Rewind Wordnet data file after each lookup (#2868)
    • Correct init call for SyntaxCorpusReader subclasses (#2872)
    • Documentation fixes (#2873)
    • Fix levenstein distance for duplicated letters (#2849)
    • Support alternative Wordnet versions (#2860)
    • Remove hundreds of formatting warnings for nltk.org (#2859)
    • Modernize nltk.org/howto pages (#2856)
    • Fix Bleu Score smoothing function from taking log(0) (#2839)
    • Update third party tools to newer versions and removing MaltParser fixed version (#2832)
    • Fix TypeError: _pretty() takes 1 positional argument but 2 were given in sem/drt.py (#2854)
    • Replace http with https in most URLs (#2852)

    Thanks to the following contributors to 3.6.6 Adam Hawley, BatMrE, Danny Sepler, Eric Kafe, Gavish Poddar, Panagiotis Simakis, RnDevelover, Robby Horvath, Tom Aarsen, Yuta Nakamura, Mohaned Mashaly

    ... (truncated)

    Commits
    • 4862b09 updates for 3.6.6
    • 6b60213 Refactor gensim.doctest to work for gensim 4.0.0 and up (#2914)
    • 59aa3fb Fix decode error for bllip parser (#2897)
    • a28d256 Add Precision, Recall, F-measure, Confusion Matrix to Taggers (#2862)
    • 72d9885 Added warnings if .zip files exist without any corresponding .csv files. (#2908)
    • dea7b44 Fix FileNotFoundError when the download_dir is a non-existing nested fold...
    • abbe86b Undo #2909 due to unexpected test failure
    • c075dab Allow commits with /nocache to not use the cache (#2909)
    • d6d513d Renamed omw to omw-1.4 (#2907)
    • 2a50a3e Resolve ReDoS opportunity by fixing incorrectly specified regex (#2906)
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 1
Owner
Princeton Natural Language Processing
Princeton Natural Language Processing
This is the official pytorch implementation for our ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering" on VQA Task

?? ERASOR (RA-L'21 with ICRA Option) Official page of "ERASOR: Egocentric Ratio of Pseudo Occupancy-based Dynamic Object Removal for Static 3D Point C

Hyungtae Lim 225 Dec 29, 2022
The dataset and source code for our paper: "Did You Ask a Good Question? A Cross-Domain Question IntentionClassification Benchmark for Text-to-SQL"

TriageSQL The dataset and source code for our paper: "Did You Ask a Good Question? A Cross-Domain Question Intention Classification Benchmark for Text

Yusen Zhang 22 Nov 9, 2022
Official repository with code and data accompanying the NAACL 2021 paper "Hurdles to Progress in Long-form Question Answering" (https://arxiv.org/abs/2103.06332).

Hurdles to Progress in Long-form Question Answering This repository contains the official scripts and datasets accompanying our NAACL 2021 paper, "Hur

Kalpesh Krishna 41 Nov 8, 2022
Pytorch implementation for the EMNLP 2020 (Findings) paper: Connecting the Dots: A Knowledgeable Path Generator for Commonsense Question Answering

Path-Generator-QA This is a Pytorch implementation for the EMNLP 2020 (Findings) paper: Connecting the Dots: A Knowledgeable Path Generator for Common

Peifeng Wang 33 Dec 5, 2022
This reporistory contains the test-dev data of the paper "xGQA: Cross-lingual Visual Question Answering".

This reporistory contains the test-dev data of the paper "xGQA: Cross-lingual Visual Question Answering".

AdapterHub 18 Dec 9, 2022
This is the repo for our work "Towards Persona-Based Empathetic Conversational Models" (EMNLP 2020)

Towards Persona-Based Empathetic Conversational Models (PEC) This is the repo for our work "Towards Persona-Based Empathetic Conversational Models" (E

Zhong Peixiang 35 Nov 17, 2022
QA-GNN: Question Answering using Language Models and Knowledge Graphs

QA-GNN: Question Answering using Language Models and Knowledge Graphs This repo provides the source code & data of our paper: QA-GNN: Reasoning with L

Michihiro Yasunaga 434 Jan 4, 2023
Designing a Minimal Retrieve-and-Read System for Open-Domain Question Answering (NAACL 2021)

Designing a Minimal Retrieve-and-Read System for Open-Domain Question Answering Abstract In open-domain question answering (QA), retrieve-and-read mec

Clova AI Research 34 Apr 13, 2022
GrailQA: Strongly Generalizable Question Answering

GrailQA is a new large-scale, high-quality KBQA dataset with 64,331 questions annotated with both answers and corresponding logical forms in different syntax (i.e., SPARQL, S-expression, etc.). It can be used to test three levels of generalization in KBQA: i.i.d., compositional, and zero-shot.

OSU DKI Lab 76 Dec 21, 2022
Binary Passage Retriever (BPR) - an efficient passage retriever for open-domain question answering

BPR Binary Passage Retriever (BPR) is an efficient neural retrieval model for open-domain question answering. BPR integrates a learning-to-hash techni

Studio Ousia 147 Dec 7, 2022