EMNLP'2021: Simple Entity-centric Questions Challenge Dense Retrievers

Overview

EntityQuestions

This repository contains the EntityQuestions dataset as well as code to evaluate retrieval results from the the paper Simple Entity-centric Questions Challenge Dense Retrievers by Chris Sciavolino*, Zexuan Zhong*, Jinhyuk Lee, and Danqi Chen (* equal contribution).

[9/16/21] This repo is not yet set in stone, we're still putting finishing touches on the tooling and documentation :) Thanks for your patience!

Quick Links

Installation

You can download a .zip file of the dataset here, or using wget with the command:

$ wget https://nlp.cs.princeton.edu/projects/entity-questions/dataset.zip

We include the dependencies needed to run the code in this repository. We recommend having a separate miniconda environment for running DPR code. You can create the environment using the following commands:

$ conda create -n EntityQ python=3.6
$ conda activate EntityQ
$ pip install -r requirements.txt

Dataset Overview

The unzipped dataset directory should have the following structure:

dataset/
    | train/
        | P*.train.json     // all randomly sampled training files 
    | dev/
        | P*.dev.json       // all randomly sampled development files
    | test/
        | P*.test.json      // all randomly sampled testing files
    | one-off/
        | common-random-buckets/
            | P*/
                | bucket*.test.json
        | no-overlap/
            | P*/
                | P*_no_overlap.{train,dev,test}.json
        | nq-seen-buckets/
            | P*/
                bucket*.test.json
        | similar/
            | P*
                | P*_similar.{train,dev,test}.json

The main dataset is included in dataset/ under train/, dev/, and test/, each containing the randomly sampled training, development, and testing subsets, respectively. For example, the evaluation set for place-of-birth (P19) can be found in the dataset/test/P19.test.json file.

We also include all of the one-off datasets we used to generate the tables/figures presented in the paper under dataset/one-off/, explained below:

  • one-off/common-random-buckets/ contains buckets of 1,000 randomly sampled examples, used to produce Fig. 1 of the paper (specifically for rand-ent).
  • one-off/no-overlap/ contains the training/development splits for our analyses in Section 4.1 of the paper (we do not use the testing split in our analysis). These training/development sets have subject entities with no token overlap with subject entities of the randomly sampled test set (specifically for all fine-tuning in Table 2).
  • one-off/nq-seen-buckets/ contains buckets of questions with subject entities that overlap with subject entities seen in the NQ training set, used to produce Fig. 1 of the paper (specifically for train-ent).
  • one-off/similar contains the training/development splits for the syntactically different but symantically equal question sets, used for our analyses in Section 4.1 (specifically the similar rows). Again, we do not use the testing split in our analysis. These questions are identical to one-off/no-overlap/ but use a different question template.

Retrieving DPR Results

Our analysis is based on a previous version of the DPR repository (specifically the Oct. 5 version w. hash 27a8436b070861e2fff481e37244009b48c29c09), so our commands may not be up-to-date with the March 2021 release. That said, most of the commands should be clearly transferable.

First, we recommend following the setup guide from the official DPR repository. Once set up, you can download the relevant pre-trained models/indices using their download_data.py script. For our analysis, we used the DPR-NQ model and the DPR-Multi model. To run retrieval using a pre-trained model, you'll minimally need to download:

  1. The pre-trained model
  2. The Wikipedia passage splits
  3. The encoded Wikipedia passage FAISS index
  4. A question/answer dataset

With this, you can use the following python command:

python dense_retriever.py \
    --batch_size 512 \
    --model_file "path/to/pretrained/model/file.cp" \
    --qa_file "path/to/qa/dataset/to/evaluate.json" \
    --ctx_file "path/to/wikipedia/passage/splits.tsv" \
    --encoded_ctx_file "path/to/encoded/wikipedia/passage/index/" \
    --save_or_load_index \
    --n-docs 100 \
    --validation_workers 1 \
    --out_file "path/to/desired/output/location.json"

We had access to a single 11Gb Nvidia RTX 2080Ti GPU w. 128G of RAM when running DPR retrieval.

Retrieving BM25 Results

We use the Pyserini implementation of BM25 for our analysis. We use the default settings and index on the same passage splits downloaded from the DPR repository. We include steps to re-create our BM25 results below.

First, we need to pre-process the DPR passage splits into the proper format for BM25 indexing. We include this file in bm25/build_bm25_ctx_passages.py. Rather than writing all passages into a single file, you can optionally shard the passages into multiple files (specified by the n_shards argument). It also creates a mapping from the passage ID to the title of the article the passage is from. You can use this file as follows:

python bm25/build_bm25_ctx_passages.py \
    --wiki_passages_file "path/to/wikipedia/passage/splits.tsv" \
    --outdir "path/to/desired/output/directory/" \
    --title_index_path "path/to/desired/output/directory/.json" \
    --n_shards number_of_shards_of_passages_to_write

Now that you have all the passages in files, you can build the BM25 index using the following command:

python -m pyserini.index -collection JsonCollection \
    -generator DefaultLuceneDocumentGenerator \
    -threads 4 \
    -input "path/to/generated/passages/folder/" \
    -index "path/to/desired/index/folder/" \
    -storePositions -storeDocvectors -storeRaw

Once the index is built, you can use it in the bm25/bm25_retriever.py script to get retrieval results for an input file:

python bm25/bm25_retriever.py \
    --index_path "path/to/built/bm25/index/directory/" \
    --passage_id_to_title_path "path/to/title_index_path/from_step_1.json" \
    --input "path/to/input/qa/file.json" \
    --output_dir "path/to/output/directory/"

By default, the script will retrieve 100 passages (--n_docs), use string matching to determine answer presence (--answer_type), and take in .json files (--input_file_type). You can optionally provide a glob using the --glob flag. The script writes the results to the file with the same name as the input file, but in the output directory.

Evaluating Retriever Results

We provide an evaluation script in utils/accuracy.py. The expected format is equivalent to DPR's output format. It either accepts a single file to evaluate, or a glob of multiple files if the --glob option is set. To evaluate a single file, you can use the following command:

python utils/accuracy.py \
    --results "path/to/retrieval/results.json" \
    --k_values 1,5,20,100

or with a glob with:

python utils/accuracy.py \
    --results="path/to/glob*.test.json" \
    --glob \
    --k_values 1,5,20,100

Bugs or Questions?

Feel free to open an issue on this GitHub repository and we'd be happy to answer your questions as best we can!

Citation

If you use our dataset or code in your research, please cite our work:

@inproceedings{sciavolino2021simple,
   title={Simple Entity-centric Questions Challenge Dense Retrievers},
   author={Sciavolino, Christopher and Zhong, Zexuan and Lee, Jinhyuk and Chen, Danqi},
   booktitle={Empirical Methods in Natural Language Processing (EMNLP)},
   year={2021}
}
Comments
  • Aligned T-REx Samples

    Aligned T-REx Samples

    Hi,

    Thank you for sharing this interesting work!

    The paper mentions that the entity questions were created using triples from T-REx. Would it be possible to share the aligned T-REx samples for the questions? It could be useful to know the entities in the questions (e.g. to do similar analyses on entity statistics like in Figure 1 of the paper), as well as the corresponding evidence in Wikipedia (e.g. for mapping to knowledge bases like KILT).

    Thank you!

    opened by mleszczy 6
  • BM25 text extraction

    BM25 text extraction

    The code for text extraction in BM25 seems to incorrectly include the title portion, because hit.raw actually has the string form of the dictionary. The correct extraction code is

    json.loads(hit.raw)['contents'][len(title):].strip()
    

    This bug results in a slightly higher recall for BM25 (because answer strings can now be found in the title part included in text). It's a slight but consistent (false) improvement. Please consider updating the results in the paper if this is a legitimate bug. I've created a PR.

    opened by karlstratos 5
  • Do you have any plans to release the frequency of the entity?

    Do you have any plans to release the frequency of the entity?

    It is mentioned in the paper that “In our analysis, we use the Wikipedia hyperlink count as a proxy for an entity’s frequency.”. My understanding is to traverse the entire Wikipedia to count the number of hyperlinks that link to an entity details page to represent the entity frequency. This can be difficult, and implementations can introduce bias, Do you have any plans to release a file describing the frequencies of entities, or code that counts entity frequencies.

    Thank you so much!

    opened by xiahaoyun 4
  • Is the

    Is the "average" in Table 1 "micro-average" or "macro-average"?

    Hi,

    As titled, is the reported overall performance in Table 1 a micro-average over all samples in the entire dataset regardless of relation types, or a macro-average taken across all relation types?

    Thanks.

    opened by ccsasuke 3
  • The dataset does not provide

    The dataset does not provide "positive paragraph context"

    Hi, the dataset does not provide "positive paragraph context".

    Should we use has_answer_fn.py over the entire Wikipedia to create positive_ctxs?

    opened by SunSiShining 2
Owner
Princeton Natural Language Processing
Princeton Natural Language Processing
The offcial repository for 'CharacterBERT and Self-Teaching for Improving the Robustness of Dense Retrievers on Queries with Typos', SIGIR2022

CharacterBERT-DR The offcial repository for CharacterBERT and Self-Teaching for Improving the Robustness of Dense Retrievers on Queries with Typos, Sh

ielab 11 Nov 15, 2022
Code For TDEER: An Efficient Translating Decoding Schema for Joint Extraction of Entities and Relations (EMNLP2021)

TDEER (WIP) Code For TDEER: An Efficient Translating Decoding Schema for Joint Extraction of Entities and Relations (EMNLP2021) Overview TDEER is an e

Alipay 6 Dec 17, 2022
[EMNLP 2021] MuVER: Improving First-Stage Entity Retrieval with Multi-View Entity Representations

MuVER This repo contains the code and pre-trained model for our EMNLP 2021 paper: MuVER: Improving First-Stage Entity Retrieval with Multi-View Entity

null 24 May 30, 2022
The official repo for OC-SORT: Observation-Centric SORT on video Multi-Object Tracking. OC-SORT is simple, online and robust to occlusion/non-linear motion.

OC-SORT Observation-Centric SORT (OC-SORT) is a pure motion-model-based multi-object tracker. It aims to improve tracking robustness in crowded scenes

Jinkun Cao 325 Jan 5, 2023
ManiSkill-Learn is a framework for training agents on SAPIEN Open-Source Manipulation Skill Challenge (ManiSkill Challenge), a large-scale learning-from-demonstrations benchmark for object manipulation.

ManiSkill-Learn ManiSkill-Learn is a framework for training agents on SAPIEN Open-Source Manipulation Skill Challenge, a large-scale learning-from-dem

Hao Su's Lab, UCSD 48 Dec 30, 2022
Object-Centric Learning with Slot Attention

Slot Attention This is a re-implementation of "Object-Centric Learning with Slot Attention" in PyTorch (https://arxiv.org/abs/2006.15055). Requirement

Untitled AI 72 Jan 2, 2023
NUANCED is a user-centric conversational recommendation dataset that contains 5.1k annotated dialogues and 26k high-quality user turns.

NUANCED: Natural Utterance Annotation for Nuanced Conversation with Estimated Distributions Overview NUANCED is a user-centric conversational recommen

Facebook Research 18 Dec 28, 2021
Does MAML Only Work via Feature Re-use? A Data Set Centric Perspective

Does-MAML-Only-Work-via-Feature-Re-use-A-Data-Set-Centric-Perspective Does MAML Only Work via Feature Re-use? A Data Set Centric Perspective Installin

null 2 Nov 7, 2022
Team nan solution repository for FPT data-centric competition. Data augmentation, Albumentation, Mosaic, Visualization, KNN application

FPT_data_centric_competition - Team nan solution repository for FPT data-centric competition. Data augmentation, Albumentation, Mosaic, Visualization, KNN application

Pham Viet Hoang (Harry) 2 Oct 30, 2022
StyleGAN-Human: A Data-Centric Odyssey of Human Generation

StyleGAN-Human: A Data-Centric Odyssey of Human Generation Abstract: Unconditional human image generation is an important task in vision and graphics,

stylegan-human 762 Jan 8, 2023
[CVPR 2022 Oral] Versatile Multi-Modal Pre-Training for Human-Centric Perception

Versatile Multi-Modal Pre-Training for Human-Centric Perception Fangzhou Hong1  Liang Pan1  Zhongang Cai1,2,3  Ziwei Liu1* 1S-Lab, Nanyang Technologic

Fangzhou Hong 96 Jan 3, 2023
A simple approach to emable dense segmentation with ViT.

Vision Transformer Segmentation Network This implementation of ViT in pytorch uses a super simple and straight-forward way of generating an output of

HReynaud 5 Jan 3, 2023
A library for answering questions using data you cannot see

A library for computing on data you do not own and cannot see PySyft is a Python library for secure and private Deep Learning. PySyft decouples privat

OpenMined 8.5k Jan 2, 2023
The official implementation for ACL 2021 "Challenges in Information Seeking QA: Unanswerable Questions and Paragraph Retrieval".

Code for "Challenges in Information Seeking QA: Unanswerable Questions and Paragraph Retrieval" (ACL 2021, Long) This is the repository for baseline m

Akari Asai 25 Oct 30, 2022
A pytorch implementation of Reading Wikipedia to Answer Open-Domain Questions.

DrQA A pytorch implementation of the ACL 2017 paper Reading Wikipedia to Answer Open-Domain Questions (DrQA). Reading comprehension is a task to produ

Runqi Yang 394 Nov 8, 2022
Python Interview Questions

Python Interview Questions Clone the code to your computer. You need to understand the code in main.py and modify the content in if __name__ =='__main

ClassmateLin 575 Dec 28, 2022
Solutions and questions for AoC2021. Merry christmas!

Advent of Code 2021 Merry christmas! ?? ?? To get solutions and approximate execution times for implementations, please execute the run.py script in t

Wilhelm Ågren 5 Dec 29, 2022
Reinforcement-learning - Repository of the class assignment questions for the course on reinforcement learning

DSE 314/614: Reinforcement Learning This repository containing reinforcement lea

Manav Mishra 4 Apr 15, 2022
Simple and Effective Few-Shot Named Entity Recognition with Structured Nearest Neighbor Learning

structshot Code and data for paper "Simple and Effective Few-Shot Named Entity Recognition with Structured Nearest Neighbor Learning", Yi Yang and Arz

ASAPP Research 47 Dec 27, 2022