RNG-KBQA: Generation Augmented Iterative Ranking for Knowledge Base Question Answering

Overview

RNG-KBQA: Generation Augmented Iterative Ranking for Knowledge Base Question Answering

Authors: Xi Ye, Semih Yavuz, Kazuma Hashimoto, Yingbo Zhou and Caiming Xiong

Abstract

main figure

Existing KBQA approaches, despite achieving strong performance on i.i.d. test data, often struggle in generalizing to questions involving unseen KB schema items. Prior rankingbased approaches have shown some success in generalization, but suffer from the coverage issue. We present RnG-KBQA, a Rank-andGenerate approach for KBQA, which remedies the coverage issue with a generation model while preserving a strong generalization capability. Our approach first uses a contrastive ranker to rank a set of candidate logical forms obtained by searching over the knowledge graph. It then introduces a tailored generation model conditioned on the question and the top-ranked candidates to compose the final logical form. We achieve new state-ofthe-art results on GRAILQA and WEBQSP datasets. In particular, our method surpasses the prior state-of-the-art by a large margin on the GRAILQA leaderboard. In addition, RnGKBQA outperforms all prior approaches on the popular WEBQSP benchmark, even including the ones that use the oracle entity linking. The experimental results demonstrate the effectiveness of the interplay between ranking and generation, which leads to the superior performance of our proposed approach across all settings with especially strong improvements in zero-shot generalization.

Paper link: https://arxiv.org/pdf/2109.08678.pdf

Requirements

The code is tested under the following environment setup

  • python==3.8.10
  • pytorch==1.7.0
  • transformers==3.3.1
  • spacy==3.1.1
  • other requirments please see requirements.txt

System requirements:

It's recommended to use a machine with over 300G memory to train the models, and use a machine with 128G memory for inference. However, 256G memory will still be sufficient for runing inference and training all of the models (some tricks for saving memorry is needed in training ranker model for GrailQA).

General Setup

Setup Experiment Directory

Before Running the scripts, please use the setup.sh to setup the experiment folder. Basically it creates some symbolic links in each exp directory.

Setup Freebase

All of the datasets use Freebase as the knowledge source. Please follow Freebase Setup to set up a Virtuoso triplestore service. If you modify the default url, you may need to change the url in /framework/executor/sparql_executor.py accordingly, after starting your virtuoso service,

Reproducing the Results on GrailQA

Please use /GrailQA as the working directory when running experiments on GrailQA.


Prepare dataset and pretrained checkpoints

Dataset

Please download the dataset and put the them under outputs so as to organize dataset as outputs/grailqa_v1.0_train/dev/test.json. (Please rename test-public split to test split).

NER Checkpoints

We use the NER system (under directory entity_linking and entity_linker) from Original GrailQA Code Repo. Please use the following instructions (copied from oringinal repo) to pull related data

Other Checkpoints

Please download the following checkpoints for entity disambiguation, candidate ranking, and augmented generation checkpoints, unzip and put them under checkpoints/ directory

KB Cache

We attach the cache of query results from KB, which can help save some time. Please download the cache file for grailqa, unzip and put them under cache/, so that we have cache/grail-LinkedRelation.bin and cache/grail-TwoHopPath.bin in the place.


Running inference

Demo for Checking the Pipeline

It's recommended to use the one-click demo scripts first to test if everything mentioned above is setup correctly. If it successfully run through, you'll get a final F1 of around 0.86. Please make sure you successfully reproduce the results on this small demo set first, as inference on dev and test can take a long time.

sh scripts/walk_through_demo.sh

Step by Step Instructions

We also provide step-by-step inference instructions as below:

(i) Detecting Entities

Once having the entity linker ready, run

python detect_entity_mention.py --split # eg. --split test

This will write entity mentions to outputs/grail_ _entities.json , we extract up to 10 entities for each mention, which will be further disambiguate in the next step.

!! Running entity detection for the first time will require building surface form index, which can take a long time (but it's only needed for the first time).

(ii) Disambiguating Entities (Entity Linking)

We have provided pretrained ranker model

sh scripts/run_disamb.sh predict

E.g., sh scripts/run_disamb.sh predict checkpoints/grail_bert_entity_disamb test

This will write the prediction results (in the form of selected entity index for each mention) to misc/grail_ _entity_linking.json .

(iii) Enumerating Logical Form Candidates

python enumerate_candidates.py --split --pred_file

E.g., python enumerate_candidates.py --split test --pred_file misc/grail_test_entity_linking.json.

This will write enumerated candidates to outputs/grail_ _candidates-ranking.jsonline .

(iv) Running Ranker

sh scripts/run_ranker.sh predict

E.g., sh scripts/run_ranker.sh predict checkpoints/grail_bert_ranking test

This will write prediction candidate logits (the logits of each candidate for each example) to misc/grail_ _candidates_logits.bin , and prediction result (in original GrailQA prediction format) to misc/grail_ _ranker_results.txt

You may evaluate the ranker results by python grail_evaluate.py

E.g., python grail_evaluate.py outputs/grailqa_v1.0_dev.json misc/grail_dev_ranker_results.txt

(v) Running Generator

First, make prepare generation model inputs

python make_generation_dataset.py --split --logit_file

E.g., python make_generation_dataset.py --split test --logit_file misc/grail_test_candidate_logits.bin.

This will read the canddiates and the use logits to select top-k candidates and write generation model inputs to outputs/grail_ _gen.json .

Second, run generation model to get the top-k prediction

sh scripts/run_gen.sh predict

E.g., sh scripts/run_gen.sh predict checkpoints/grail_t5_generation test.

This will generate top-k decoded logical forms stored at misc/grail_ _topk_generations.json .

(vi) Final Inference Steps

Having the decoded top-k predictions, we'll go down the top-k list, execute the logical form one by one until we find one logical form return valid answers.

python eval_topk_prediction.py --split --pred_file

E.g., python eval_topk_prediction.py --split test --pred_file misc/grail_test_topk_generations.json

prediction result (in original GrailQA prediction format) to misc/grail_ _final_results.txt .

You can then use official GrailQA evaluate script to run evaluation

python grail_evaluate.py

E.g., python grail_evaluate.py outputs/grailqa_v1.0_dev.json misc/grail_dev_final_results.txt


Training Models

We already attached pretrained-models ready for running inference. If you'd like to train your own models please checkout the README at /GrailQA folder.

Reproducing the Results on WebQSP

Please use /WebQSP as the working directory when running experiments on WebQSP.


Prepare dataset and pretrained checkpoints

Dataset

Please download the WebQSP dataset and put the them under outputs so as to organize dataset as outputs/WebQSP.train[test].json.

Evaluation Script

Please make a copy of the official evaluation script (eval/eval.py in the WebQSP zip file) and put the script under this directory (WebQSP) with the name legacy_eval.py.

Model Checkpoints

Please download the following checkpoints for candidate ranking, and augmented generation checkpoints, unzip and put them under checkpoints/ directory

KB Cache

Please download the cache file for webqsp, unzip and put them under cache/ so that we have cache/webqsp-LinkedRelation.bin and cache/webqsp-TwoHopPath.bin in the place.


Running inference

(i) Parsing Sparql-Query to S-Expression

As stated in the paper, we generate s-expressions, which is not provided by the original dataset, so we provide scripts to parse sparql-query to s-expressions.

Run python parse_sparql.py, which will augment original dataset files with s-expressions and save them in outputs as outputs/WebQSP.train.expr.json and outputs/WebQSP.dev.expr.json. Since there is no validation set, we further randomly select 200 examples from the training set for validation, yielding ptrain split and pdev split.

(ii) Entity Detection and Linking using ELQ

This step can be skipped, as we've already include outputs of this step (misc/webqsp_train_elq-5_mid.json, outputs/webqsp_test_elq-5_mid.json).

The scripts and config of ELQ model can be found in elq_linking/run_elq_linker.py. If you'd like to use the script to run entity linking, please copy the run_elq_linker.py python script to ELQ model and run the script there.

(iii) Enumerating Logical Form Candidates

python enumerate_candidates.py --split test

This will write enumerated candidates to outputs/webqsp_test_candidates-ranking.jsonline.

(iv) Runing Ranker

sh scripts/run_ranker.sh predict checkpoints/webqsp_bert_ranking test

This will write prediction candidate logits (the logits of each candidate for each example) to misc/webqsp_test_candidates_logits.bin, and prediction result (in original GrailQA prediction format) to misc/webqsp_test_ranker_results.txt

(v) Running Generator

First, make prepare generation model inputs

python make_generation_dataset.py --split test --logit_file misc/webqsp_test_candidate_logits.bin.

This will read the candidates and the use logits to select top-k candidates and write generation model inputs to outputs/webqsp_test_gen.json.

Second, run generation model to get the top-k prediction

sh scripts/run_gen.sh predict checkpoints/webqsp_t5_generation test

This will generate top-k decoded logical forms stored at misc/webqsp_test_topk_generations.json.

(vi) Final Inference Steps

Having the decoded top-k predictions, we'll go down the top-k list, execute the logical form one by one until we find one logical form return valid answers.

python eval_topk_prediction.py --split test --pred_file misc/webqsp_test_topk_generations.json

Prediction result will be stored (in GrailQA prediction format) to misc/webqsp_test_final_results.txt.

You can then use official WebQSP (only modified in I/O) evaluate script to run evaluation

python webqsp_evaluate.py outputs/WebQSP.test.json misc/webqsp_test_final_results.txt.


Training Models

We already attached pretrained-models ready for running inference. If you'd like to train your own models please checkout the README at /WebQSP folder.

Citation

@misc{ye2021rngkbqa,
    title={RnG-KBQA: Generation Augmented Iterative Ranking for Knowledge Base Question Answering}, 
    author={Xi Ye and Semih Yavuz and Kazuma Hashimoto and Yingbo Zhou and Caiming Xiong},
    year={2021},
    eprint={2109.08678},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Questions?

For any questions, feel free to open issues, or shoot emails to

License

The code is released under BSD 3-Clause - see LICENSE for details.

Comments
  • Format of the ranking candidates file for GrailQA

    Format of the ranking candidates file for GrailQA

    Hi. I am interested in the ranker part of this project. I am currently setting up the environment. However, looks like the previous steps could be time consuming. Can I get some quick information on the format of the output files for:

    python enumerate_candidates.py --split train # we use gt entity for trainning (so no need for prediction on training)
    python enumerate_candidates.py --split dev --pred_file misc/grail_dev_entity_linking.json
    

    Thanks!

    opened by zluw1117 10
  • About training the ranker

    About training the ranker

    https://github.com/salesforce/rng-kbqa/blob/2b6ef28e7724f11181f59589398894a1d0617455/GrailQA/scripts/run_ranker.sh#L45

    Hi, I would like to ask the batch size of the ranker (not entity disambiguation). In the paper, the batch size is 8. However, the script here is 1.

    Besides, when evaluating (predicting) the LF ranking, should the batch size be 1 according to BERTCandidateRanker: https://github.com/salesforce/rng-kbqa/blob/2b6ef28e7724f11181f59589398894a1d0617455/framework/models/BertRanker.py#L30

    But the script set the evaluation batch size to 128 as shown in, https://github.com/salesforce/rng-kbqa/blob/2b6ef28e7724f11181f59589398894a1d0617455/GrailQA/scripts/run_ranker.sh#L47

    Are these two numbers have different meanings? Would you please provide me with a clue? Thanks a lot.

    opened by yhshu 4
  • Question about WebQSP evaluation

    Question about WebQSP evaluation

    Hi!

    The WebQSP eval.py file will generate 2 F1 scores: "Average f1 over questions (accuracy)" and "F1 of average recall and average precision". May I know which one are you showing?

    Thanks.

    opened by PlusRoss 4
  • How to process the mids which can't convert to string name?

    How to process the mids which can't convert to string name?

    Hi, When I follow this excellent work, I encounter some problems: I can't convert some mids to the string type name. I use your get_name script and get some triples output. But there is not any helpful string name information. How should I process this problem? Thanks!


    Here are some example of mids and their one-hop triples by executed search on the freebase:

    m.0gxnnwp : [['m.0gxnnwp', 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type', 'people.sibling_relationship'], ['m.0gxnnwp', 'type.object.type', 'people.sibling_relationship'], ['m.0gxnnwp', 'people.sibling_relationship.sibling', 'm.06w2sn5'], ['m.0gxnnwp', 'people.sibling_relationship.sibling', 'm.0gxnnwq']] m.0855mj_ : [['m.0855mj_', 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type', 'film.performance'], ['m.0855mj_', 'type.object.type', 'film.performance'], ['m.0855mj_', 'film.performance.actor', 'm.09l3p'], ['m.0855mj_', 'film.performance.film', 'm.062zjtt'], ['m.0855mj_', 'film.performance.character', 'm.0dttll']] m.04g55p8: [['m.04g55p8', 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type', 'common.topic'], ['m.04g55p8', 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type', 'user.dfhuynh.default_domain.assassination'], ['m.04g55p8', 'type.object.type', 'common.topic'], ['m.04g55p8', 'type.object.type', 'user.dfhuynh.default_domain.assassination'], ['m.04g55p8', 'user.dfhuynh.default_domain.assassination.assassinated_person', 'm.0d3k14'], ['m.04g55p8', 'user.dfhuynh.default_domain.assassination.assassin', 'm.0bgl08'], ['m.04g55p8', 'user.dfhuynh.default_domain.assassination.date', '1960-12-11-08:00'], ['m.04g55p8', 'user.dfhuynh.default_domain.assassination.location', 'm.0rqf1'], ['m.04g55p8', 'user.dfhuynh.default_domain.assassination.method', 'm.04g56gm'], ['m.04g55p8', 'user.dfhuynh.default_domain.assassination.outcome', 'm.04g5679']]

    opened by JBoRu 4
  • How is the entity linking F1 calculated?

    How is the entity linking F1 calculated?

    Hi there, I've noticed that in the paper appendix, there's an entity linking evaluation. I would like to ask how does the linking F1 gets calculated? Is the evaluation in this repo?

    Since there're some questions in GrailQA that have no golden entity, and it is definitely possible that the linking result is empty for a question, how does the F1 calculation handles these problems?

    Thanks for your reply.

    opened by yhshu 4
  • What the difference among generation target, target approximate, target full expression and target gt?

    What the difference among generation target, target approximate, target full expression and target gt?

    Hi there, When I run into WebQSP generation, I'm confused about the difference between generation target, target approximate, target full expression and target gt. What's the difference among the four? Which one should I pick for generation data?

    Thanks. Yiheng

    opened by yhshu 4
  • Is there any release for query enumeration and ranking results?

    Is there any release for query enumeration and ranking results?

    Hi there, I'm currently interested in reproducing the results of RnG-KBQA. Dev set is relatively small. It workedfine for me. However, training set might be too big for enumerating candidates, maybe even a few days for this single step? I wonder is there a way to share files for the enumeration of candidate queries in training set, and maybe the ranking results?

    Thanks a lot.

    opened by yhshu 3
  • other dataset

    other dataset

    Hi, Congratulations on such interesting work. Existed research work always considered the CWQ and WebQSP datasets. However, you only test the WebQSP in your paper. I want to know: (1) if you have conducted the experiment on CWQ to verify your method performance; (2) if not, how to conduct the experiment on this new dataset? I think it is should similar to the WebQSP. Looking forward to your reply.

    opened by JBoRu 2
  • version of transformers

    version of transformers

    Hi Xi,

    Thanks for open-source this awesome work! May I know what's the version you install for transformers? I saw pytorch-transformers==1.1.0 in requirements.txt but I have a package missing problem when importing transformers in run_ranker.py and run_generator.py. I tried transformers==3.4.0 and transformers==4.16.0 but they all have some problems.

    opened by yeliu918 2
  • Experiment Reproduction Help

    Experiment Reproduction Help

    I deployed the environment as instructed. The result obtained when testing the demo script is only 0.72, which is lower than the expected 0.86. In addition, when I did the follow-up test, the 32g ram of the computer did not seem to be enough. When I used ranker to test the dev data, the process would be killed. Could you please give me some advice to reproduce the results?

    opened by Yepgang 2
  • Is it possible to provide results of entity linking?

    Is it possible to provide results of entity linking?

    RnG-KBQA's entity linking are a further enhancement to GrailQA implementation. Currently, entity linking is a relatively independent step, and significantly improving the effectiveness of entity linking is usually difficult. Would the authors be willing to publish the results of entity linking?

    This may be of great help for subsequent studies. Many thanks.

    opened by yhshu 2
  • urllib.error.URLError

    urllib.error.URLError

        PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
        PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
        PREFIX : <http://rdf.freebase.com/ns/> 
        SELECT (?x0 AS ?label) WHERE {
        SELECT DISTINCT ?x0  WHERE {
        :m.03_r3 rdfs:label ?x0 . 
                            FILTER (langMatches( lang(?x0), "EN" ) )
                             }
                             }
    

    Reading: 0%|

    except urllib.error.URLError:
        print(query)
        exit(0)
    

    After throwing this exception, the program will exit directly. These three websites seem to be inaccessible. What is the situation? Is there an alternative if it is not accessible?

    opened by FH-xk 4
Owner
Salesforce
A variety of vendor agnostic projects which power Salesforce
Salesforce
QA-GNN: Question Answering using Language Models and Knowledge Graphs

QA-GNN: Question Answering using Language Models and Knowledge Graphs This repo provides the source code & data of our paper: QA-GNN: Reasoning with L

Michihiro Yasunaga 434 Jan 4, 2023
Pseudo-rng-app - whos needs science to make a random number when you have pseudoscience?

Pseudo-random numbers with pseudoscience rng is so complicated! Why cant we have a horoscopic, vibe-y way of calculating a random number? Why cant rng

Andrew Blance 1 Dec 27, 2021
An efficient and effective learning to rank algorithm by mining information across ranking candidates. This repository contains the tensorflow implementation of SERank model. The code is developed based on TF-Ranking.

SERank An efficient and effective learning to rank algorithm by mining information across ranking candidates. This repository contains the tensorflow

Zhihu 44 Oct 20, 2022
An image base contains 490 images for learning (400 cars and 90 boats), and another 21 images for testingAn image base contains 490 images for learning (400 cars and 90 boats), and another 21 images for testing

SVM Données Une base d’images contient 490 images pour l’apprentissage (400 voitures et 90 bateaux), et encore 21 images pour fait des tests. Prétrait

Achraf Rahouti 3 Nov 30, 2021
Designing a Minimal Retrieve-and-Read System for Open-Domain Question Answering (NAACL 2021)

Designing a Minimal Retrieve-and-Read System for Open-Domain Question Answering Abstract In open-domain question answering (QA), retrieve-and-read mec

Clova AI Research 34 Apr 13, 2022
GrailQA: Strongly Generalizable Question Answering

GrailQA is a new large-scale, high-quality KBQA dataset with 64,331 questions annotated with both answers and corresponding logical forms in different syntax (i.e., SPARQL, S-expression, etc.). It can be used to test three levels of generalization in KBQA: i.i.d., compositional, and zero-shot.

OSU DKI Lab 76 Dec 21, 2022
Binary Passage Retriever (BPR) - an efficient passage retriever for open-domain question answering

BPR Binary Passage Retriever (BPR) is an efficient neural retrieval model for open-domain question answering. BPR integrates a learning-to-hash techni

Studio Ousia 147 Dec 7, 2022
covid question answering datasets and fine tuned models

Covid-QA Fine tuned models for question answering on Covid-19 data. Hosted Inference This model has been contributed to huggingface.Click here to see

Abhijith Neil Abraham 19 Sep 9, 2021
NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions (CVPR2021)

NExT-QA We reproduce some SOTA VideoQA methods to provide benchmark results for our NExT-QA dataset accepted to CVPR2021 (with 1 'Strong Accept' and 2

Junbin Xiao 50 Nov 24, 2022
FeTaQA: Free-form Table Question Answering

FeTaQA: Free-form Table Question Answering FeTaQA is a Free-form Table Question Answering dataset with 10K Wikipedia-based {table, question, free-form

Language, Information, and Learning at Yale 40 Dec 13, 2022
improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

CLIP-ViL In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet fea

null 310 Dec 28, 2022
Pytorch implementation for the EMNLP 2020 (Findings) paper: Connecting the Dots: A Knowledgeable Path Generator for Commonsense Question Answering

Path-Generator-QA This is a Pytorch implementation for the EMNLP 2020 (Findings) paper: Connecting the Dots: A Knowledgeable Path Generator for Common

Peifeng Wang 33 Dec 5, 2022
This is the official implementation of "One Question Answering Model for Many Languages with Cross-lingual Dense Passage Retrieval".

CORA This is the official implementation of the following paper: Akari Asai, Xinyan Yu, Jungo Kasai and Hannaneh Hajishirzi. One Question Answering Mo

Akari Asai 59 Dec 28, 2022
Bilinear attention networks for visual question answering

Bilinear Attention Networks This repository is the implementation of Bilinear Attention Networks for the visual question answering and Flickr30k Entit

Jin-Hwa Kim 506 Nov 29, 2022
Official repository with code and data accompanying the NAACL 2021 paper "Hurdles to Progress in Long-form Question Answering" (https://arxiv.org/abs/2103.06332).

Hurdles to Progress in Long-form Question Answering This repository contains the official scripts and datasets accompanying our NAACL 2021 paper, "Hur

Kalpesh Krishna 41 Nov 8, 2022
Visual Question Answering in Pytorch

Visual Question Answering in pytorch /!\ New version of pytorch for VQA available here: https://github.com/Cadene/block.bootstrap.pytorch This repo wa

Remi 672 Jan 1, 2023
This reporistory contains the test-dev data of the paper "xGQA: Cross-lingual Visual Question Answering".

This reporistory contains the test-dev data of the paper "xGQA: Cross-lingual Visual Question Answering".

AdapterHub 18 Dec 9, 2022
EMNLP 2021: Single-dataset Experts for Multi-dataset Question-Answering

MADE (Multi-Adapter Dataset Experts) This repository contains the implementation of MADE (Multi-adapter dataset experts), which is described in the pa

Princeton Natural Language Processing 68 Jul 18, 2022
EMNLP 2021: Single-dataset Experts for Multi-dataset Question-Answering

MADE (Multi-Adapter Dataset Experts) This repository contains the implementation of MADE (Multi-adapter dataset experts), which is described in the pa

Princeton Natural Language Processing 39 Oct 5, 2021