Official repository with code and data accompanying the NAACL 2021 paper "Hurdles to Progress in Long-form Question Answering" (https://arxiv.org/abs/2103.06332).

Overview

Hurdles to Progress in Long-form Question Answering

This repository contains the official scripts and datasets accompanying our NAACL 2021 paper, "Hurdles to Progress in Long-form Question Answering". This repository supports inference from the pretrained retriever / generator & includes evaluation scripts.

Specifically, this codebase contains the model checkpoints, inference scripts for the retriever / generator model, generated outputs from model using c-REALM retrievals and random retrievals, scripts to compute ROUGE-L / R-Prec scores using the generations, scripts for question paraphrase classification, scripts for computing ROUGE-L bounds. You can also find the original Routing Transformer model's codebase here.

Setup

pip install transformers
pip install tensor2tensor

For GPU support, you might need to change the version of your TensorFlow depending on the CUDA / CuDNN installation (details). GPU support is strongly recommended for faster inference.

Model Checkpoints & Generations

Routing Transformer finetuned on ELI5: link
c-REALM TF Hub model + encoded retrieval corpora: link
c-REALM tokenized KILT Wikipedia data: link
c-REALM tokenized ELI5 training data: link
Pre-computed generations & QQP classifier: link

The original Routing Transformer model (pretrained on PG-19) and a local attention version of it can be found in the main repository (link).

Generating from the Routing Transformer

(We have provided the pre-computed retrievals from c-REALM on ELI5, so no need to run the c-REALM retriever)

  1. Download the "Routing Transformer finetuned on ELI5" model listed above and place it inside models.
wget https://storage.googleapis.com/rt-checkpoint/eli5_checkpoint.zip
unzip eli5_checkpoint.zip -d models
rm eli5_checkpoint.zip
  1. Download the generations folder from the Google Drive link listed as "Pre-computed generations & QQP classifier" above.

  2. Run eval_generate_eli5.py to generate from the model. We have provided c-REALM retrieval outputs in the script for the ELI5 validation / test split. For custom inputs, you will need to load the retriever and wikipedia corpus (see next section). Generation is on the slower side (~4 minutes per ELI5 QA pair on a 1080ti GPU), we hope to switch to the faster decoding mode in the Routing Transformer model in the near future.

Retrievals from c-REALM

(This script only tests the retriever, it doesn't depend on the Routing Transformer generator model)

  1. Download the "c-REALM TF Hub model + encoded retrieval corpora" model listed above. Place it inside the models folder.
wget https://storage.googleapis.com/rt-checkpoint/retriever.zip
unzip retriever.zip -d models
rm retriever.zip
  1. Download "c-REALM tokenized KILT Wikipedia data" if you are interested in retrieving from the KILT Wikipedia corpus and/or "c-REALM tokenized ELI5 training data" if you are interested in retrieving question paraphrases from the ELI5 training set. Place them inside the models folder.
wget https://storage.googleapis.com/rt-checkpoint/eli5_retrieval_train.zip
unzip eli5_retrieval_train.zip -d models
rm eli5_retrieval_train.zip
  1. Run eval_retriever_eli5.py to retrieve using c-REALM. Modify the --retrieval_corpus flag to choose the retrieval corpus.

Evaluation of Outputs

Setup

  1. Download the generations folder from the Google Drive link into this root folder.

  2. Clone the KILT repository in this folder and run the installation in a virtual environment.

git clone https://github.com/facebookresearch/KILT
cd KILT
virtualenv -p python3.7 kilt-venv
pip install -r requirements.txt
pip install --editable .
  1. If you are interested in using the Quora Question Paraphrase classifier (used in Section 3.2 of the paper), download the roberta-large-finetuned-qqp folder from "Pre-computed generations & QQP classifier" listed above. This model was built by Tu Vu.

  2. Download the ELI5 train, validation and test splits.

cd KILT
wget http://dl.fbaipublicfiles.com/KILT/eli5-train-kilt.jsonl -O train.jsonl
wget http://dl.fbaipublicfiles.com/KILT/eli5-dev-kilt.jsonl -O valid.jsonl
wget http://dl.fbaipublicfiles.com/KILT/eli5-test_without_answers-kilt.jsonl -O test.jsonl

Running evaluations

Enter the KILT folder and run the following command for evaluating p=0.6 with c-REALM retrievals on the validation set:

python kilt/eval_downstream.py ../generations/final_guess_eli5_0.6_predicted_retrieval.jsonl ../generations/final_gold_eli5_0.6_predicted_retrieval.jsonl

which should give you the output (partly reported in Table 6 of the paper),

{   'downstream': {   'accuracy': 0.0,
                      'em': 0.0,
                      'f1': 0.25566078582652935,
                      'rougel': 0.24417152125142375},
    'kilt': {   'KILT-accuracy': 0.0,
                'KILT-em': 0.0,
                'KILT-f1': 0.03414819887348917,
                'KILT-rougel': 0.03205580975169385},
    'retrieval': {'Rprec': 0.13258897418004187, 'recall@5': 0.2122586648057688}}

To evaluate other configurations, modify the paths in the command above. You can replace 0.6 with 0.9 for higher entropy generations, and replace predicted with random for randomly selected retrieval paragraphs (Hurdle #1 or Section 3.1 in the paper). Note that you should make this change for both the guess and gold files, to ensure correct alignment. We have only provided generations for the validation set since the test set answers / retrievals for ELI5 are hidden behind the KILT leaderboard.

Question paraphrase classification using QQP Classifier

In Section 3.2 of our paper, we used a Quora Question Paraphrase classifier to find question paraphrases amoung similar questions retrieved by c-REALM. To run this, make sure you have downloaded the QQP checkpoint (step 3 in Setup) and run,

python run_qqp.py --input_file generations/final_guess_eli5_0.6_similar_questions.jsonl

You should get a score of 43.6%. Note that this is a lower-bound --- qualitatively we found this classifier missed several paraphrase pairs with low lexical overlap, or cases where the retrieved training set question will have a super-set of the information needed to answer the validation set question.

Lower and Upper Bounds on ROUGE-L

Run the following to evaluate bounds on ROUGE-L. Make sure you have completed steps 1, 4 in the setup above. Scripts to evaluate other bounds involving training set retrieval coming soon!

cp generate_final_guess_bounds.py KILT/
cd KILT

# Copy input lowerbound, should get 20.0 ROUGE-L
python generate_final_guess_bounds.py --bound_type copy_input

# Random training set answer, should get 15.8-16.2 ROUGE-L depending on randomness
python generate_final_guess_bounds.py --bound_type random_train_ans

# "Performance" can be further boosted by randomly selecting from only longest answers
# for each training set question, up to ~16.7 ROUGE-L. This result is not reported in
# paper, but can be run using:
python generate_final_guess_bounds.py --bound_type random_train_ans_longest

# Longest gold answer upperbound, should get 21.2 ROUGE-L
python generate_final_guess_bounds.py --bound_type longest_gold

# Best gold answer upperbound, should get 26.2 ROUGE-L (takes a while to run, 45 min on single core)
python generate_final_guess_bounds.py --bound_type best_gold

Citation

If you found our paper or this repository useful, please cite:

@inproceedings{lfqa21,
author={Kalpesh Krishna and Aurko Roy and Mohit Iyyer},
Booktitle = {North American Association for Computational Linguistics},
Year = "2021",
Title={Hurdles to Progress in Long-form Question Answering},
}
You might also like...
Official Implementation for
Official Implementation for "ReStyle: A Residual-Based StyleGAN Encoder via Iterative Refinement" https://arxiv.org/abs/2104.02699

ReStyle: A Residual-Based StyleGAN Encoder via Iterative Refinement Recently, the power of unconditional image synthesis has significantly advanced th

Non-Official Pytorch implementation of
Non-Official Pytorch implementation of "Face Identity Disentanglement via Latent Space Mapping" https://arxiv.org/abs/2005.07728 Using StyleGAN2 instead of StyleGAN

Face Identity Disentanglement via Latent Space Mapping - Implement in pytorch with StyleGAN 2 Description Pytorch implementation of the paper Face Ide

Unofficial Tensorflow-Keras implementation of Fastformer based on paper [Fastformer: Additive Attention Can Be All You Need](https://arxiv.org/abs/2108.09084).
Unofficial Tensorflow-Keras implementation of Fastformer based on paper [Fastformer: Additive Attention Can Be All You Need](https://arxiv.org/abs/2108.09084).

Fastformer-Keras Unofficial Tensorflow-Keras implementation of Fastformer based on paper Fastformer: Additive Attention Can Be All You Need. Tensorflo

source code for https://arxiv.org/abs/2005.11248 "Accelerating Antimicrobial Discovery with Controllable Deep Generative Models and Molecular Dynamics"

Accelerating Antimicrobial Discovery with Controllable Deep Generative Models and Molecular Dynamics This work will be published in Nature Biomedical

Pytorch implementation of Each Part Matters: Local Patterns Facilitate Cross-view Geo-localization https://arxiv.org/abs/2008.11646
Pytorch implementation of Each Part Matters: Local Patterns Facilitate Cross-view Geo-localization https://arxiv.org/abs/2008.11646

[TCSVT] Each Part Matters: Local Patterns Facilitate Cross-view Geo-localization LPN [Paper] NEWs Prerequisites Python 3.6 GPU Memory = 8G Numpy 1.

https://arxiv.org/abs/2102.11005
https://arxiv.org/abs/2102.11005

LogME LogME: Practical Assessment of Pre-trained Models for Transfer Learning How to use Just feed the features f and labels y to the function, and yo

ISTR: End-to-End Instance Segmentation with Transformers (https://arxiv.org/abs/2105.00637)

This is the project page for the paper: ISTR: End-to-End Instance Segmentation via Transformers, Jie Hu, Liujuan Cao, Yao Lu, ShengChuan Zhang, Yan Wa

Minimal implementation of PAWS (https://arxiv.org/abs/2104.13963) in TensorFlow.
Minimal implementation of PAWS (https://arxiv.org/abs/2104.13963) in TensorFlow.

PAWS-TF 🐾 Implementation of Semi-Supervised Learning of Visual Features by Non-Parametrically Predicting View Assignments with Support Samples (PAWS)

YOLO5Face: Why Reinventing a Face Detector (https://arxiv.org/abs/2105.12931)
YOLO5Face: Why Reinventing a Face Detector (https://arxiv.org/abs/2105.12931)

Introduction Yolov5-face is a real-time,high accuracy face detection. Performance Single Scale Inference on VGA resolution(max side is equal to 640 an

Comments
  • code for training the retriever

    code for training the retriever

    Hello thanks for open sourcing your code. I wonder if you are planning to release code for training the retriever starting from REALM-CCNews? I understand that you provided a reference for the pytorch version here and I was able to train one starting from Bert . But I am not sure if the REALM checkpoints are released to be used by torch? If not would you mind sharing the tensorflow implementation?

    Thanks

    opened by fabrahman 0
Owner
Kalpesh Krishna
PhD student in Computer Science at UMass Amherst. Formerly IIT Bombay, @google-research, @mozilla, TTIC, @wncc.
Kalpesh Krishna
Supplementary code for the paper "Meta-Solver for Neural Ordinary Differential Equations" https://arxiv.org/abs/2103.08561

Meta-Solver for Neural Ordinary Differential Equations Towards robust neural ODEs using parametrized solvers. Main idea Each Runge-Kutta (RK) solver w

Julia Gusak 25 Aug 12, 2021
Unofficial implementation of "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" (https://arxiv.org/abs/2103.14030)

Swin-Transformer-Tensorflow A direct translation of the official PyTorch implementation of "Swin Transformer: Hierarchical Vision Transformer using Sh

null 52 Dec 29, 2022
This repo provides the official code for TransBTS: Multimodal Brain Tumor Segmentation Using Transformer (https://arxiv.org/pdf/2103.04430.pdf).

TransBTS: Multimodal Brain Tumor Segmentation Using Transformer This repo is the official implementation for TransBTS: Multimodal Brain Tumor Segmenta

Raymond 247 Dec 28, 2022
This is an official implementation of our CVPR 2021 paper "Bottom-Up Human Pose Estimation Via Disentangled Keypoint Regression" (https://arxiv.org/abs/2104.02300)

Bottom-Up Human Pose Estimation Via Disentangled Keypoint Regression Introduction In this paper, we are interested in the bottom-up paradigm of estima

HRNet 367 Dec 27, 2022
Official implementation of the paper Image Generators with Conditionally-Independent Pixel Synthesis https://arxiv.org/abs/2011.13775

CIPS -- Official Pytorch Implementation of the paper Image Generators with Conditionally-Independent Pixel Synthesis Requirements pip install -r requi

Multimodal Lab @ Samsung AI Center Moscow 201 Dec 21, 2022
This repository contains the code used for Predicting Patient Outcomes with Graph Representation Learning (https://arxiv.org/abs/2101.03940).

Predicting Patient Outcomes with Graph Representation Learning This repository contains the code used for Predicting Patient Outcomes with Graph Repre

Emma Rocheteau 76 Dec 22, 2022
Source code for models described in the paper "AudioCLIP: Extending CLIP to Image, Text and Audio" (https://arxiv.org/abs/2106.13043)

AudioCLIP Extending CLIP to Image, Text and Audio This repository contains implementation of the models described in the paper arXiv:2106.13043. This

null 458 Jan 2, 2023
Code for paper "A Critical Assessment of State-of-the-Art in Entity Alignment" (https://arxiv.org/abs/2010.16314)

A Critical Assessment of State-of-the-Art in Entity Alignment This repository contains the source code for the paper A Critical Assessment of State-of

Max Berrendorf 16 Oct 14, 2022
Code for the paper: Learning Adversarially Robust Representations via Worst-Case Mutual Information Maximization (https://arxiv.org/abs/2002.11798)

Representation Robustness Evaluations Our implementation is based on code from MadryLab's robustness package and Devon Hjelm's Deep InfoMax. For all t

Sicheng 19 Dec 7, 2022