This is the repository for our paper Ditch the Gold Standard: Re-evaluating Conversational Question Answering

Princeton Natural Language Processing

Last update: Dec 16, 2022

Related tags

Deep Learning cqa-evaluation

Overview

Ditch the Gold Standard: Re-evaluating Conversational Question Answering

This is the repository for our paper Ditch the Gold Standard: Re-evaluating Conversational Question Answering.

Overview

In this work, we conduct the first large-scale human evaluation of state-of-the-art conversational QA systems. In our evaluation, human annotators chat with conversational QA models about passages from the QuAC development set, and after that the annotators judge the correctness of model answers. We release the human annotated dataset in the following section.

We also identify a critical issue with the current automatic evaluation, which pre-collectes human-human conversations and uses ground-truth answers as conversational history (differences between different evaluations are shown in the following figure). By comparison, we find that the automatic evaluation does not always agree with the human evaluation. We propose a new evaluation protocol that is based on predicted history and question rewriting. Our experiments show that the new protocol better reflects real-world performance compared to the original automatic evaluation. We also provide the new evaluation protocol code in the following.

Human Evaluation Dataset

You can download the human annotation dataset from data/human_annotation_data.json. The json file contains one data field data, which is a list of conversations. Each conversation contains the following fields:

model_name: The model evaluated. One of bert4quac, graphflow, ham, excord.
context: The passage used in this conversation.
dialog_id: The ID from the original QuAC dataset.
qas: The conversation, which contains a list of QA pairs. Each QA pair has the following fields:
- turn_id: The number of turn.
- question: The question from the human annotator.
- answer: The answer from the model.
- valid: Whether the question is valid (annotated by our human annotator).
- answerable: Whether the question is answerable (annotated by our human annotator).
- correct: Whether the model's answer is correct (annotated by our human annotator).

Automatic model evaluation interface

We provide a convenient interface to test model performance on a few evaluation protocols compared in our paper, including Auto-Pred, Auto-Replace and our proposed evaluation protocol, Auto-Rewrite, which better demonstrates models' performance in human-model conversations. Please refer to our paper for more details. Following is a figure describing how Auto-Rewrite works.

To use our evaluation interface on your own model, follow the steps:

Step 1: Download the QuAC dataset.
Step 2: Install allennlp, allennlp_models, ncr.replace_corefs through pip if you would like to use Auto-Rewrite.
Step 3: Download the CANARD dataset and set --canard_path if you would like to use Auto-Replace.
Step 4: Write a model interface following the template interface.py. Explanations to each function are provided through in-line comments. Make sure to import all your model dependencies at the top.
Step 5: Add the model to the evaluation script run_quac_eval.py. Changes that are need to be made are marked with #TODO.
Step 6: Run evaluation script. See run.sh for reference. Explanations of all arguments are provided in run_quac_eval.py. Make sure to turn on only one of --pred, --rewrite or --replace.

Citation

@article{li2021ditch,
   title={Ditch the Gold Standard: Re-evaluating Conversational Question Answering},
   author={Li, Huihan and Gao, Tianyu and Goenka, Manan and Chen, Danqi},
   journal={arXiv preprint arXiv:2112.08812},
   year={2021}
}

covid question answering datasets and fine tuned models

Covid-QA Fine tuned models for question answering on Covid-19 data. Hosted Inference This model has been contributed to huggingface.Click here to see

19 Sep 9, 2021

NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions (CVPR2021)

NExT-QA We reproduce some SOTA VideoQA methods to provide benchmark results for our NExT-QA dataset accepted to CVPR2021 (with 1 'Strong Accept' and 2

50 Nov 24, 2022

FeTaQA: Free-form Table Question Answering

FeTaQA: Free-form Table Question Answering FeTaQA is a Free-form Table Question Answering dataset with 10K Wikipedia-based {table, question, free-form

Language, Information, and Learning at Yale

40 Dec 13, 2022

improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

CLIP-ViL In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet fea

310 Dec 28, 2022

This is the official implementation of "One Question Answering Model for Many Languages with Cross-lingual Dense Passage Retrieval".

CORA This is the official implementation of the following paper: Akari Asai, Xinyan Yu, Jungo Kasai and Hannaneh Hajishirzi. One Question Answering Mo

59 Dec 28, 2022

Bilinear attention networks for visual question answering

Bilinear Attention Networks This repository is the implementation of Bilinear Attention Networks for the visual question answering and Flickr30k Entit

506 Nov 29, 2022

Visual Question Answering in Pytorch

Visual Question Answering in pytorch /!\ New version of pytorch for VQA available here: https://github.com/Cadene/block.bootstrap.pytorch This repo wa

672 Jan 1, 2023

RNG-KBQA: Generation Augmented Iterative Ranking for Knowledge Base Question Answering

RNG-KBQA: Generation Augmented Iterative Ranking for Knowledge Base Question Answering Authors: Xi Ye, Semih Yavuz, Kazuma Hashimoto, Yingbo Zhou and

72 Dec 5, 2022

EMNLP 2021: Single-dataset Experts for Multi-dataset Question-Answering

MADE (Multi-Adapter Dataset Experts) This repository contains the implementation of MADE (Multi-adapter dataset experts), which is described in the pa

68 Jul 18, 2022

Comments

Error when running HAM evaluation script

Hi,

I tried to run HAM evaluation script and got the following error from .../EvalConvQA/models/ham/cqa_model.py:

Traceback (most recent call last):
  File ".../EvalConvQA/run_quac_eval.py", line 704, in <module>
    main()
  File ".../EvalConvQA/run_quac_eval.py", line 591, in main
    model = model_class(args=args)
  File ".../EvalConvQA/models/ham/interface.py", line 78, in __init__
    self.slice_num, self.args)
  File ".../EvalConvQA/models/ham/cqa_model.py", line 288, in fine_grained_history_attention_net
    token_tensor.set_shape([FLAGS.train_batch_size, FLAGS.max_history_turns, FLAGS.max_seq_length + 1, FLAGS.bert_hidden])
  File "/Users/.../opt/anaconda3/envs/cqa_eval/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 559, in set_shape
    raise ValueError(str(e))
ValueError: Dimension 3 in both shapes must be equal, but are 768 and 1024. Shapes are [16,?,513,768] and [16,11,513,1024].

Could you please assist?

opened by zorikg 1

Request to share BERT's checkpoint

You mention that you implemented and trained your own BERT model, and provide the training script. Could you also share the already trained checkpoint?

opened by zorikg 1
Bump nltk from 3.6.2 to 3.6.6
Bumps nltk from 3.6.2 to 3.6.6.

Changelog

Sourced from nltk's changelog.

Version 3.7 2022-02-09

Improve and update the NLTK team page on nltk.org (#2855, #2941)

Drop support for Python 3.6, support Python 3.10 (#2920)

Version 3.6.7 2021-12-28

Resolve IndexError in sent_tokenize and word_tokenize (#2922)

Version 3.6.6 2021-12-21

Refactor gensim.doctest to work for gensim 4.0.0 and up (#2914)

Add Precision, Recall, F-measure, Confusion Matrix to Taggers (#2862)

Added warnings if .zip files exist without any corresponding .csv files. (#2908)

Fix FileNotFoundError when the download_dir is a non-existing nested folder (#2910)

Rename omw to omw-1.4 (#2907)

Resolve ReDoS opportunity by fixing incorrectly specified regex (#2906)

Support OMW 1.4 (#2899)

Deprecate Tree get and set node methods (#2900)

Fix broken inaugural test case (#2903)

Use Multilingual Wordnet Data from OMW with newer Wordnet versions (#2889)

Keep NLTKs "tokenize" module working with pathlib (#2896)

Make prettyprinter to be more readable (#2893)

Update links to the nltk book (#2895)

Add CITATION.cff to nltk (#2880)

Resolve serious ReDoS in PunktSentenceTokenizer (#2869)

Delete old CI config files (#2881)

Improve Tokenize documentation + add TokenizerI as superclass for TweetTokenizer (#2878)

Fix expected value for BLEU score doctest after changes from #2572

Add multi Bleu functionality and tests (#2793)

Deprecate 'return_str' parameter in NLTKWordTokenizer and TreebankWordTokenizer (#2883)

Allow empty string in CFG's + more (#2888)

Partition tree.py module into tree package + pickle fix (#2863)

Fix several TreebankWordTokenizer and NLTKWordTokenizer bugs (#2877)

Rewind Wordnet data file after each lookup (#2868)

Correct init call for SyntaxCorpusReader subclasses (#2872)

Documentation fixes (#2873)

Fix levenstein distance for duplicated letters (#2849)

Support alternative Wordnet versions (#2860)

Remove hundreds of formatting warnings for nltk.org (#2859)

Modernize nltk.org/howto pages (#2856)

Fix Bleu Score smoothing function from taking log(0) (#2839)

Update third party tools to newer versions and removing MaltParser fixed version (#2832)

Fix TypeError: _pretty() takes 1 positional argument but 2 were given in sem/drt.py (#2854)

Replace http with https in most URLs (#2852)

Thanks to the following contributors to 3.6.6 Adam Hawley, BatMrE, Danny Sepler, Eric Kafe, Gavish Poddar, Panagiotis Simakis, RnDevelover, Robby Horvath, Tom Aarsen, Yuta Nakamura, Mohaned Mashaly

... (truncated)

Commits

4862b09 updates for 3.6.6

6b60213 Refactor gensim.doctest to work for gensim 4.0.0 and up (#2914)

59aa3fb Fix decode error for bllip parser (#2897)

a28d256 Add Precision, Recall, F-measure, Confusion Matrix to Taggers (#2862)

72d9885 Added warnings if .zip files exist without any corresponding .csv files. (#2908)

dea7b44 Fix FileNotFoundError when the download_dir is a non-existing nested fold...

abbe86b Undo #2909 due to unexpected test failure

c075dab Allow commits with /nocache to not use the cache (#2909)

d6d513d Renamed omw to omw-1.4 (#2907)

2a50a3e Resolve ReDoS opportunity by fixing incorrectly specified regex (#2906)

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 1

Owner

Princeton Natural Language Processing

GitHub

This is the official pytorch implementation for our ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering" on VQA Task

?? ERASOR (RA-L'21 with ICRA Option) Official page of "ERASOR: Egocentric Ratio of Pseudo Occupancy-based Dynamic Object Removal for Static 3D Point C

225 Dec 29, 2022

The dataset and source code for our paper: "Did You Ask a Good Question? A Cross-Domain Question IntentionClassification Benchmark for Text-to-SQL"

TriageSQL The dataset and source code for our paper: "Did You Ask a Good Question? A Cross-Domain Question Intention Classification Benchmark for Text

22 Nov 9, 2022

Official repository with code and data accompanying the NAACL 2021 paper "Hurdles to Progress in Long-form Question Answering" (https://arxiv.org/abs/2103.06332).

Hurdles to Progress in Long-form Question Answering This repository contains the official scripts and datasets accompanying our NAACL 2021 paper, "Hur

41 Nov 8, 2022

Pytorch implementation for the EMNLP 2020 (Findings) paper: Connecting the Dots: A Knowledgeable Path Generator for Commonsense Question Answering

Path-Generator-QA This is a Pytorch implementation for the EMNLP 2020 (Findings) paper: Connecting the Dots: A Knowledgeable Path Generator for Common

33 Dec 5, 2022

This reporistory contains the test-dev data of the paper "xGQA: Cross-lingual Visual Question Answering".

18 Dec 9, 2022

This is the repo for our work "Towards Persona-Based Empathetic Conversational Models" (EMNLP 2020)

Towards Persona-Based Empathetic Conversational Models (PEC) This is the repo for our work "Towards Persona-Based Empathetic Conversational Models" (E

35 Nov 17, 2022

QA-GNN: Question Answering using Language Models and Knowledge Graphs

QA-GNN: Question Answering using Language Models and Knowledge Graphs This repo provides the source code & data of our paper: QA-GNN: Reasoning with L

434 Jan 4, 2023

Designing a Minimal Retrieve-and-Read System for Open-Domain Question Answering (NAACL 2021)

Designing a Minimal Retrieve-and-Read System for Open-Domain Question Answering Abstract In open-domain question answering (QA), retrieve-and-read mec

34 Apr 13, 2022

GrailQA: Strongly Generalizable Question Answering

GrailQA is a new large-scale, high-quality KBQA dataset with 64,331 questions annotated with both answers and corresponding logical forms in different syntax (i.e., SPARQL, S-expression, etc.). It can be used to test three levels of generalization in KBQA: i.i.d., compositional, and zero-shot.

76 Dec 21, 2022

Binary Passage Retriever (BPR) - an efficient passage retriever for open-domain question answering

BPR Binary Passage Retriever (BPR) is an efficient neural retrieval model for open-domain question answering. BPR integrates a learning-to-hash techni

147 Dec 7, 2022

This is the repository for our paper Ditch the Gold Standard: Re-evaluating Conversational Question Answering

Related tags

Overview

Ditch the Gold Standard: Re-evaluating Conversational Question Answering

Overview

Human Evaluation Dataset

Automatic model evaluation interface

Citation

You might also like...

covid question answering datasets and fine tuned models

NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions (CVPR2021)

FeTaQA: Free-form Table Question Answering

improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

This is the official implementation of "One Question Answering Model for Many Languages with Cross-lingual Dense Passage Retrieval".

Bilinear attention networks for visual question answering

Visual Question Answering in Pytorch

RNG-KBQA: Generation Augmented Iterative Ranking for Knowledge Base Question Answering

EMNLP 2021: Single-dataset Experts for Multi-dataset Question-Answering

Comments

Error when running HAM evaluation script

Request to share BERT's checkpoint

Bump nltk from 3.6.2 to 3.6.6

Owner

Princeton Natural Language Processing

This is the official pytorch implementation for our ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering" on VQA Task

The dataset and source code for our paper: "Did You Ask a Good Question? A Cross-Domain Question IntentionClassification Benchmark for Text-to-SQL"

Official repository with code and data accompanying the NAACL 2021 paper "Hurdles to Progress in Long-form Question Answering" (https://arxiv.org/abs/2103.06332).

Pytorch implementation for the EMNLP 2020 (Findings) paper: Connecting the Dots: A Knowledgeable Path Generator for Commonsense Question Answering

This reporistory contains the test-dev data of the paper "xGQA: Cross-lingual Visual Question Answering".

This is the repo for our work "Towards Persona-Based Empathetic Conversational Models" (EMNLP 2020)

QA-GNN: Question Answering using Language Models and Knowledge Graphs

Designing a Minimal Retrieve-and-Read System for Open-Domain Question Answering (NAACL 2021)

GrailQA: Strongly Generalizable Question Answering

Binary Passage Retriever (BPR) - an efficient passage retriever for open-domain question answering