GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

Overview

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

Code and model from our AAAI 2021 paper

Updates

[2020/02/05] Support to run the model on own databases and queries. Check out the notebook.

Abstract

Most recently, there has been significant interest in learning contextual representations for various NLP tasks, by leveraging large scale text corpora to train large neural language models with self-supervised learning objectives, such as Masked Language Model (MLM). However, based on a pilot study, we observe three issues of existing general-purpose language models when they are applied to text-to-SQL semantic parsers: fail to detect column mentions in the utterances, fail to infer column mentions from cell values, and fail to compose complex SQL queries. To mitigate these issues, we present a model pre-training framework, Generation-Augmented Pre-training (GAP), that jointly learns representations of natural language utterances and table schemas by leveraging generation models to generate pre-train data. GAP MODEL is trained on 2M utterance-schema pairs and 30K utterance-schema-SQL triples, whose utterances are produced by generative models. Based on experimental results, neural semantic parsers that leverage GAP MODEL as a representation encoder obtain new state-of-the-art results on both SPIDER and CRITERIA-TO-SQL benchmarks.

Setup

conda create --name gap-text2sql python=3.7
source activate gap-text2sql
conda install pytorch=1.5 cudatoolkit=10.2 -c pytorch
pip install -r requirements.txt
python -c "import nltk; nltk.download('stopwords'); nltk.download('punkt')"

Download the dataset

pip install gdown
cd rat-sql-gap
gdown --id 1_AckYkinAnhqmRQtGsQgUKAnTHxxX5J0
unzip spider.zip
bash data/spider/generate.sh ./spider

Build dataset directory

mkdir data/spider-bart
cp ./spider/tables.json data/spider-bart/
cp ./spider/train_spider.json data/spider-bart/
cp ./spider/train_others.json data/spider-bart/
cp ./spider/dev.json data/spider-bart/
ln -s $(pwd)/spider/database data/spider-bart/database

Download the library

mkdir third_party
wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip
unzip stanford-corenlp-full-2018-10-05.zip -d third_party/

Start the Stanford library

pushd third_party/stanford-corenlp-full-2018-10-05
nohup java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 8999 -timeout 15000 > server.log &
popd

Download the checkpoint

mkdir -p logdir/bart_run_1/bs\=12\,lr\=1.0e-04\,bert_lr\=1.0e-05\,end_lr\=0e0\,att\=1/
mkdir ie_dirs
aws s3 cp s3://gap-text2sql-public/checkpoint-artifacts/gap-finetuned-checkpoint logdir/bart_run_1/bs\=12\,lr\=1.0e-04\,bert_lr\=1.0e-05\,end_lr\=0e0\,att\=1/model_checkpoint-00041000

mkdir -p pretrained_checkpoint
aws s3 cp s3://gap-text2sql-public/checkpoint-artifacts/pretrained-checkpoint pretrained_checkpoint/pytorch_model.bin

Alternatively, you can download them here if you don't have awscli: gap-finetuned-checkpoint and pretrained-checkpoint

curl https://gap-text2sql-public.s3.amazonaws.com/checkpoint-artifacts/gap-finetuned-checkpoint -o logdir/bart_run_1/bs\=12\,lr\=1.0e-04\,bert_lr\=1.0e-05\,end_lr\=0e0\,att\=1/model_checkpoint-00041000
curl https://gap-text2sql-public.s3.amazonaws.com/checkpoint-artifacts/pretrained-checkpoint -o pretrained_checkpoint/pytorch_model.bin

Preprocess dataset

python run.py preprocess experiments/spider-configs/gap-run.jsonnet

Inference

python run.py eval experiments/spider-configs/gap-run.jsonnet

You then get the inference results and evaluation results in the paths:ie_dirs/bart_run_1_true_1-step41000.infer and ie_dirs/bart_run_1_true_1-step41000.eval.

Training

python run.py train experiments/spider-configs/gap-run.jsonnet

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

Comments
  • Error while running Inference

    Error while running Inference

    Hi,

    Thanks for open-sourcing the project. I was trying this on a non gpu windows 10 machine (conda environment, python 3.7.9, pytorch 1.5) I was able to run Preprocess dataset, but got the bellow error while running Inference

    (envs) Lenovo-PC MINGW64 /d/NLP/NL_to_SQL/gap-text2sql/rat-sql-gap (main) $ python run.py eval experiments/spider-configs/gap-run.jsonnet WARNING <class 'seq2struct.models.enc_dec.EncDecModel.Preproc'>: superfluous {'name': 'EncDec'} WARNING <class 'seq2struct.models.enc_dec.EncDecModel'>: superfluous {'decoder_preproc': {'grammar': {'clause_order': None, 'end_with_from': True, 'factorize_sketch': 2, 'include_literals': False, 'infer_from_conditions': True, 'name': 'spider', 'output_from': True, 'use_table_pointer': True}, 'save_path': 'data/spider-bart/nl2code-1115,output_from=true,fs=2,emb=bart,cvlink', 'use_seq_elem_rules': True}, 'encoder_preproc': {'bart_version': 'facebook/bart-large', 'compute_cv_link': True, 'compute_sc_link': True, 'db_path': 'data/spider-bart/database', 'fix_issue_16_primary_keys': True, 'include_table_name_in_column': False, 'save_path': 'data/spider-bart/nl2code-1115,output_from=true,fs=2,emb=bart,cvlink'}} Parameter containing: tensor([[-0.0370, 0.1117, 0.1829, ..., 0.2054, 0.0578, -0.0750], [ 0.0055, -0.0049, -0.0069, ..., -0.0030, 0.0038, 0.0087], [-0.0448, 0.4604, -0.0604, ..., 0.1073, 0.0310, 0.0477], ..., [-0.0138, 0.0278, -0.0467, ..., 0.0455, -0.0265, 0.0125], [-0.0043, 0.0153, -0.0567, ..., 0.0496, 0.0108, -0.0099], [ 0.0053, 0.0324, -0.0179, ..., -0.0085, 0.0223, -0.0020]], requires_grad=True) Updated the model with ./pretrained_checkpoint\pytorch_model.bin Parameter containing: tensor([[-0.0383, 0.1205, 0.1776, ..., 0.1973, 0.0594, -0.0699], [ 0.0046, -0.0023, -0.0084, ..., -0.0036, 0.0047, 0.0084], [-0.0460, 0.4671, -0.0650, ..., 0.1027, 0.0256, 0.0475], ..., [ 0.0086, 0.0037, 0.0363, ..., -0.0296, -0.0097, -0.0068], [-0.0160, 0.0123, 0.0015, ..., 0.0040, 0.0185, 0.0038], [-0.0049, -0.0121, -0.0235, ..., 0.0200, 0.0148, -0.0020]], requires_grad=True) Loading model from logdir/bart_run_1\bs=12,lr=1.0e-04,bert_lr=1.0e-05,end_lr=0e0,att=1\model_checkpoint-00041000 Traceback (most recent call last): File "run.py", line 104, in main() File "run.py", line 83, in main infer.main(infer_config) File "D:\NLP\NL_to_SQL\gap-text2sql\rat-sql-gap\seq2struct\commands\infer.py", line 215, in main model = inferer.load_model(args.logdir, args.step) File "D:\NLP\NL_to_SQL\gap-text2sql\rat-sql-gap\seq2struct\commands\infer.py", line 48, in load_model last_step = saver.restore(logdir, step=step, map_location=self.device, item_keys=["model"]) File "D:\NLP\NL_to_SQL\gap-text2sql\rat-sql-gap\seq2struct\utils\saver.py", line 122, in restore items2restore, model_dir, map_location, step) File "D:\NLP\NL_to_SQL\gap-text2sql\rat-sql-gap\seq2struct\utils\saver.py", line 40, in load_checkpoint item_dict[item_name].load_state_dict(checkpoint[item_name]) File "D:\NLP\NL_to_SQL\gap-text2sql\envs\lib\site-packages\torch\nn\modules\module.py", line 847, in load_state_dict self.class.name, "\n\t".join(error_msgs))) RuntimeError: Error(s) in loading state_dict for EncDecModel: size mismatch for decoder.rule_logits.2.weight: copying a param with shape torch.Size([97, 128]) from checkpoint, the shape in current model is torch.Size([94, 128]). size mismatch for decoder.rule_logits.2.bias: copying a param with shape torch.Size([97]) from checkpoint, the shape in current model is torch.Size([94]). size mismatch for decoder.rule_embedding.weight: copying a param with shape torch.Size([97, 128]) from checkpoint, the shape in current model is torch.Size([94, 128]). (envs) Lenovo-PC MINGW64 /d/NLP/NL_to_SQL/gap-text2sql/rat-sql-gap (main) $

    Can you guide me, where I need to make changes.

    opened by TheurgicDuke771 6
  • Cannot access model checkpoint

    Cannot access model checkpoint

    Hi, I am trying to learn and test your work and strictly following your instructions, but I cannot quite access model checkpoint from aws s3 bucket; is your project closed now, or checkpoint bucket is not public, or am I having some other connectivity issues? Running the following command: aws s3 cp s3://gap-text2sql-public/checkpoint-artifacts/gap-finetuned-checkpoint logdir/bart_run_1/bs\=12\,lr\=1.0e-04\,bert_lr\=1.0e-05\,end_lr\=0e0\,att\=1/model_checkpoint-00041000 Getting this error: fatal error: An error occurred (403) when calling the HeadObject operation: Forbidden Would be grateful for feedback, Regards

    opened by romapavelko01 2
  • Issue while loading the model

    Issue while loading the model

    Hey @Impavidity @TheurgicDuke771 , I'm facing the same issue, I think there is no issue with data( Spider) I'm using.

    Loading model from logdir/bart_run_1/bs=12,lr=1.0e-04,bert_lr=1.0e-05,end_lr=0e0,att=1/model_checkpoint-00041000
    
    ---------------------------------------------------------------------------
    
    KeyError                                  Traceback (most recent call last)
    
    <ipython-input-45-2534e992a832> in <module>()
    ----> 1 model = inferer.load_model(model_dir, checkpoint_step)
    
    6 frames
    
    /content/gap-text2sql/rat-sql-gap/seq2struct/models/variational_lstm.py in _hook_remove_dropout_masks_from_state_dict(cls, instance, state_dict, prefix, local_metadata)
         75     @classmethod
         76     def _hook_remove_dropout_masks_from_state_dict(cls, instance, state_dict, prefix, local_metadata):
    ---> 77         del state_dict[prefix + '_input_dropout_mask']
         78         del state_dict[prefix + '_h_dropout_mask']
         79 
    
    KeyError: 'decoder.state_update._input_dropout_mask'
    

    We can see the folder structure below, spider_data_issue

    Please guide me to tackle this issue. Thanks in advance.

    opened by ujjawalcse 2
  • Why overfitting train yields better scores?

    Why overfitting train yields better scores?

    I've tried to retrain your model and managed to get the same scores as you report. What I don't understand is:

    • there is not early stopping
    • the model used is the last saved model checkpoint, when the model is with huge overfit on train (very close to 0 loss)

    However, I've tried to get scores on another model checkpoint with a lower loss in validation and it yielded worse EXACT matches. Therefore, why does it yield better scores if we overfit the training set?

    opened by JoaoLages 2
  • Problems in downloading the dataset

    Problems in downloading the dataset

    Hello,

    I am going through your setup instructions in the README, and for downloading the dataset there's this command: bash data/spider-20190205/generate.sh ./spider, which is meant to be run from the rat-sql-gap dir, I believe, however there is no directory containing this generation script, and I cannot find this script anywhere.

    Please let me know if I am missing something obvious.

    Thanks!

    opened by hclent 2
  • I can't find AverageSpanExtractor module :(

    I can't find AverageSpanExtractor module :(

    Hi. @Impavidity @pnpnpn, I'm referring to your code for my text-to-SQL study. To modify the GAP encoder part, I tuned relogic/tabart-pretraining.py. In order to execute the code, AverageSpanExtractor module is needed. But I can't find this module from relogic.logickit.modules.span_extractors.average_span_extractor. Is the module missing from github?

    Best regard!

    opened by minu4242 1
  • Why are the EXEC scores always zero?

    Why are the EXEC scores always zero?

    For every model I train, including the pretrained model, EXEC scores are zero. does this mean that none of the generated queries matches the gold SQL query when runned to produce an output table?

    opened by JoaoLages 1
  • support inference on own databases and queries (#5)

    support inference on own databases and queries (#5)

    Resolve the issue #5 on supporting to run inference on own databases and queries.

    • Add related dependency for supporting the functionality.
    • Create an easy-to-use notebook.
    • Add a database example.
    • Update readme.

    By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

    opened by Impavidity 0
  • Add data preprocess script

    Add data preprocess script

    Resolve the issue #1 on missing the data preprocess script.

    • Add data preprocess script and its dependency.
    • Clean up unused dependencies and configuration file.

    By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

    opened by Impavidity 0
  • On clause are missing in inference output

    On clause are missing in inference output

    Hi,

    I am facing some issue on identity the "ON clause" from output,

    Question: List the pilots are from london Result: SELECT pilot.Name FROM pilot JOIN match WHERE match.Location = 'terminal'

    Is there any solution to find the on clauses?

    Thanks in advance ​

    opened by b4zyuvaraj 2
  • What are all the steps we need to follow, for our own database

    What are all the steps we need to follow, for our own database

    I am new to this, I would like to test Text 2 SQL with my own database.

    Could anyone share me the steps to convert new database and execute the questions?

    Thanks in advance.

    opened by b4zyuvaraj 0
  • Can anyone help out in figuring out what is

    Can anyone help out in figuring out what is "terminal" here?

    While executing a natural language query "Birth year of Gina Rinehart", we got the following sql query output. The output sql query is okay, but whenever we are calling any variable data(in this case, our variable is 'Gina Rinehart'), we are getting "terminal"

    Natural language query from spider's singer database: Birth year of Gina Rinehart. Corresponding output SQL query: SELECT singer.Birth_Year FROM singer WHERE singer.Name = 'terminal'

    image

    opened by surajjkumar 1
  • The number of grammar rules (94) is inconsistent with the pre-trained model (97)

    The number of grammar rules (94) is inconsistent with the pre-trained model (97)

    I find that the number of the grammar rules is 94 when I follow the instructions to preprocess the data (with the hyper-parameter fs=2). But the size of rule_embedding in the pre-trained model is 97.

    Traceback (most recent call last): File "run.py", line 104, in main() File "run.py", line 83, in main infer.main(infer_config) File "/home/yuhao/zhenwen/repair_model/gap-text2sql-main/rat-sql-gap/seq2struct/commands/infer.py", line 239, in main model = inferer.load_model(args.logdir, args.step) File "/home/yuhao/zhenwen/repair_model/gap-text2sql-main/rat-sql-gap/seq2struct/commands/infer.py", line 48, in load_model last_step = saver.restore(logdir, step=step, map_location=self.device, item_keys=["model"]) File "/home/yuhao/zhenwen/repair_model/gap-text2sql-main/rat-sql-gap/seq2struct/utils/saver.py", line 122, in restore items2restore, model_dir, map_location, step) File "/home/yuhao/zhenwen/repair_model/gap-text2sql-main/rat-sql-gap/seq2struct/utils/saver.py", line 40, in load_checkpoint item_dict[item_name].load_state_dict(checkpoint[item_name]) File "/home/yuhao/.conda/envs/zhenwen/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1407, in load_state_dict self.class.name, "\n\t".join(error_msgs))) RuntimeError: Error(s) in loading state_dict for EncDecModel: Unexpected key(s) in state_dict: "decoder.state_update._input_dropout_mask", "decoder.state_update._h_dropout_mask". size mismatch for decoder.rule_logits.2.weight: copying a param with shape torch.Size([97, 128]) from checkpoint, the shape in current model is torch.Size([94, 128]). size mismatch for decoder.rule_logits.2.bias: copying a param with shape torch.Size([97]) from checkpoint, the shape in current model is torch.Size([94]). size mismatch for decoder.rule_embedding.weight: copying a param with shape torch.Size([97, 128]) from checkpoint, the shape in current model is torch.Size([94, 128])

    opened by lzw-pku 1
  • Can we get the pretraining data?

    Can we get the pretraining data?

    Can you please let me know how to download the additional 30K data [(Utterance, Schema, SQL) triples] that was used for pretraining?

    Thanks in Advanced!

    opened by dennissm 1
Owner
Amazon Web Services - Labs
AWS Labs
Amazon Web Services - Labs
Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation This is the implementaion of our paper: Bridging the

hezw.tkcw 20 Dec 12, 2022
Code for papers "Generation-Augmented Retrieval for Open-Domain Question Answering" and "Reader-Guided Passage Reranking for Open-Domain Question Answering", ACL 2021

This repo provides the code of the following papers: (GAR) "Generation-Augmented Retrieval for Open-domain Question Answering", ACL 2021 (RIDER) "Read

morning 49 Dec 26, 2022
Code to use Augmented Shapiro Wilks Stopping, as well as code for the paper "Statistically Signifigant Stopping of Neural Network Training"

This codebase is being actively maintained, please create and issue if you have issues using it Basics All data files are included under losses and ea

Justin Terry 32 Nov 9, 2021
Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

PLBART Code pre-release of our work, Unified Pre-training for Program Understanding and Generation accepted at NAACL 2021. Note. A detailed documentat

Wasi Ahmad 138 Dec 30, 2022
MASS: Masked Sequence to Sequence Pre-training for Language Generation

MASS: Masked Sequence to Sequence Pre-training for Language Generation

Microsoft 1.1k Dec 17, 2022
Laboratory for Social Machines 84 Dec 20, 2022
Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

背景 安装教程 快速上手 (一)预训练模型 (二)机器翻译 (三)文本分类 TenTrans 进阶 1. 多语言机器翻译 2. 跨语言预训练 背景 TrenTrans是一个统一的端到端的多语言多任务预训练平台,支持多种预训练方式,以及序列生成和自然语言理解任务。 安装教程 git clone git

Tencent Minority-Mandarin Translation Team 42 Dec 20, 2022
:hot_pepper: R²SQL: "Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing." (AAAI 2021)

R²SQL The PyTorch implementation of paper Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing. (AAAI 2021) Requirement

huybery 60 Dec 31, 2022
PIZZA - a task-oriented semantic parsing dataset

The PIZZA dataset continues the exploration of task-oriented parsing by introducing a new dataset for parsing pizza and drink orders, whose semantics cannot be captured by flat slots and intents.

null 17 Dec 14, 2022
Implementation of Memorizing Transformers (ICLR 2022), attention net augmented with indexing and retrieval of memories using approximate nearest neighbors, in Pytorch

Memorizing Transformers - Pytorch Implementation of Memorizing Transformers (ICLR 2022), attention net augmented with indexing and retrieval of memori

Phil Wang 364 Jan 6, 2023
TaCL: Improve BERT Pre-training with Token-aware Contrastive Learning

TaCL: Improve BERT Pre-training with Token-aware Contrastive Learning

Yixuan Su 26 Oct 17, 2022
Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

GenSen Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning Sandeep Subramanian, Adam Trischler, Yoshua B

Maluuba Inc. 309 Oct 19, 2022
Pre-training BERT masked language models with custom vocabulary

Pre-training BERT Masked Language Models (MLM) This repository contains the method to pre-train a BERT model using custom vocabulary. It was used to p

Stella Douka 14 Nov 2, 2022
CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training

CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training This is the official repository for the code and models of the paper CCQA: A N

Meta Research 29 Nov 30, 2022
iBOT: Image BERT Pre-Training with Online Tokenizer

Image BERT Pre-Training with iBOT Official PyTorch implementation and pretrained models for paper iBOT: Image BERT Pre-Training with Online Tokenizer.

Bytedance Inc. 435 Jan 6, 2023
Princeton NLP's pre-training library based on fairseq with DeepSpeed kernel integration 🚃

This repository provides a library for efficient training of masked language models (MLM), built with fairseq. We fork fairseq to give researchers mor

Princeton Natural Language Processing 92 Dec 27, 2022
SIGIR'22 paper: Axiomatically Regularized Pre-training for Ad hoc Search

Introduction This codebase contains source-code of the Python-based implementation (ARES) of our SIGIR 2022 paper. Chen, Jia, et al. "Axiomatically Re

Jia Chen 17 Nov 9, 2022
Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers

beyond masking Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers The code is coming Figure 1: Pipeline of token-based pre-

Yunjie Tian 23 Sep 27, 2022
Semi-automated vocabulary generation from semantic vector models

vec2word Semi-automated vocabulary generation from semantic vector models This script generates a list of potential conlang word forms along with asso

null 9 Nov 25, 2022