GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

Amazon Web Services - Labs

Last update: Jan 9, 2023

Related tags

Text Data & NLP nlp machine-learning deep-learning nlu text-generation pytorch pretrained-models language-model semantic-parsing text2sql

Overview

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

Code and model from our AAAI 2021 paper

Updates

[2020/02/05] Support to run the model on own databases and queries. Check out the notebook.

Abstract

Most recently, there has been significant interest in learning contextual representations for various NLP tasks, by leveraging large scale text corpora to train large neural language models with self-supervised learning objectives, such as Masked Language Model (MLM). However, based on a pilot study, we observe three issues of existing general-purpose language models when they are applied to text-to-SQL semantic parsers: fail to detect column mentions in the utterances, fail to infer column mentions from cell values, and fail to compose complex SQL queries. To mitigate these issues, we present a model pre-training framework, Generation-Augmented Pre-training (GAP), that jointly learns representations of natural language utterances and table schemas by leveraging generation models to generate pre-train data. GAP MODEL is trained on 2M utterance-schema pairs and 30K utterance-schema-SQL triples, whose utterances are produced by generative models. Based on experimental results, neural semantic parsers that leverage GAP MODEL as a representation encoder obtain new state-of-the-art results on both SPIDER and CRITERIA-TO-SQL benchmarks.

Setup

conda create --name gap-text2sql python=3.7
source activate gap-text2sql
conda install pytorch=1.5 cudatoolkit=10.2 -c pytorch
pip install -r requirements.txt
python -c "import nltk; nltk.download('stopwords'); nltk.download('punkt')"

Download the dataset

pip install gdown
cd rat-sql-gap
gdown --id 1_AckYkinAnhqmRQtGsQgUKAnTHxxX5J0
unzip spider.zip
bash data/spider/generate.sh ./spider

Build dataset directory

mkdir data/spider-bart
cp ./spider/tables.json data/spider-bart/
cp ./spider/train_spider.json data/spider-bart/
cp ./spider/train_others.json data/spider-bart/
cp ./spider/dev.json data/spider-bart/
ln -s $(pwd)/spider/database data/spider-bart/database

Download the library

mkdir third_party
wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip
unzip stanford-corenlp-full-2018-10-05.zip -d third_party/

Start the Stanford library

pushd third_party/stanford-corenlp-full-2018-10-05
nohup java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 8999 -timeout 15000 > server.log &
popd

Download the checkpoint

mkdir -p logdir/bart_run_1/bs\=12\,lr\=1.0e-04\,bert_lr\=1.0e-05\,end_lr\=0e0\,att\=1/
mkdir ie_dirs
aws s3 cp s3://gap-text2sql-public/checkpoint-artifacts/gap-finetuned-checkpoint logdir/bart_run_1/bs\=12\,lr\=1.0e-04\,bert_lr\=1.0e-05\,end_lr\=0e0\,att\=1/model_checkpoint-00041000

mkdir -p pretrained_checkpoint
aws s3 cp s3://gap-text2sql-public/checkpoint-artifacts/pretrained-checkpoint pretrained_checkpoint/pytorch_model.bin

Alternatively, you can download them here if you don't have awscli: gap-finetuned-checkpoint and pretrained-checkpoint

curl https://gap-text2sql-public.s3.amazonaws.com/checkpoint-artifacts/gap-finetuned-checkpoint -o logdir/bart_run_1/bs\=12\,lr\=1.0e-04\,bert_lr\=1.0e-05\,end_lr\=0e0\,att\=1/model_checkpoint-00041000
curl https://gap-text2sql-public.s3.amazonaws.com/checkpoint-artifacts/pretrained-checkpoint -o pretrained_checkpoint/pytorch_model.bin

Preprocess dataset

python run.py preprocess experiments/spider-configs/gap-run.jsonnet

Inference

python run.py eval experiments/spider-configs/gap-run.jsonnet

You then get the inference results and evaluation results in the paths:ie_dirs/bart_run_1_true_1-step41000.infer and ie_dirs/bart_run_1_true_1-step41000.eval.

Training

python run.py train experiments/spider-configs/gap-run.jsonnet

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

Comments

Error while running Inference

Hi,

Thanks for open-sourcing the project. I was trying this on a non gpu windows 10 machine (conda environment, python 3.7.9, pytorch 1.5) I was able to run Preprocess dataset, but got the bellow error while running Inference

(envs) Lenovo-PC MINGW64 /d/NLP/NL_to_SQL/gap-text2sql/rat-sql-gap (main) $ python run.py eval experiments/spider-configs/gap-run.jsonnet WARNING <class 'seq2struct.models.enc_dec.EncDecModel.Preproc'>: superfluous {'name': 'EncDec'} WARNING <class 'seq2struct.models.enc_dec.EncDecModel'>: superfluous {'decoder_preproc': {'grammar': {'clause_order': None, 'end_with_from': True, 'factorize_sketch': 2, 'include_literals': False, 'infer_from_conditions': True, 'name': 'spider', 'output_from': True, 'use_table_pointer': True}, 'save_path': 'data/spider-bart/nl2code-1115,output_from=true,fs=2,emb=bart,cvlink', 'use_seq_elem_rules': True}, 'encoder_preproc': {'bart_version': 'facebook/bart-large', 'compute_cv_link': True, 'compute_sc_link': True, 'db_path': 'data/spider-bart/database', 'fix_issue_16_primary_keys': True, 'include_table_name_in_column': False, 'save_path': 'data/spider-bart/nl2code-1115,output_from=true,fs=2,emb=bart,cvlink'}} Parameter containing: tensor([[-0.0370, 0.1117, 0.1829, ..., 0.2054, 0.0578, -0.0750], [ 0.0055, -0.0049, -0.0069, ..., -0.0030, 0.0038, 0.0087], [-0.0448, 0.4604, -0.0604, ..., 0.1073, 0.0310, 0.0477], ..., [-0.0138, 0.0278, -0.0467, ..., 0.0455, -0.0265, 0.0125], [-0.0043, 0.0153, -0.0567, ..., 0.0496, 0.0108, -0.0099], [ 0.0053, 0.0324, -0.0179, ..., -0.0085, 0.0223, -0.0020]], requires_grad=True) Updated the model with ./pretrained_checkpoint\pytorch_model.bin Parameter containing: tensor([[-0.0383, 0.1205, 0.1776, ..., 0.1973, 0.0594, -0.0699], [ 0.0046, -0.0023, -0.0084, ..., -0.0036, 0.0047, 0.0084], [-0.0460, 0.4671, -0.0650, ..., 0.1027, 0.0256, 0.0475], ..., [ 0.0086, 0.0037, 0.0363, ..., -0.0296, -0.0097, -0.0068], [-0.0160, 0.0123, 0.0015, ..., 0.0040, 0.0185, 0.0038], [-0.0049, -0.0121, -0.0235, ..., 0.0200, 0.0148, -0.0020]], requires_grad=True) Loading model from logdir/bart_run_1\bs=12,lr=1.0e-04,bert_lr=1.0e-05,end_lr=0e0,att=1\model_checkpoint-00041000 Traceback (most recent call last): File "run.py", line 104, in main() File "run.py", line 83, in main infer.main(infer_config) File "D:\NLP\NL_to_SQL\gap-text2sql\rat-sql-gap\seq2struct\commands\infer.py", line 215, in main model = inferer.load_model(args.logdir, args.step) File "D:\NLP\NL_to_SQL\gap-text2sql\rat-sql-gap\seq2struct\commands\infer.py", line 48, in load_model last_step = saver.restore(logdir, step=step, map_location=self.device, item_keys=["model"]) File "D:\NLP\NL_to_SQL\gap-text2sql\rat-sql-gap\seq2struct\utils\saver.py", line 122, in restore items2restore, model_dir, map_location, step) File "D:\NLP\NL_to_SQL\gap-text2sql\rat-sql-gap\seq2struct\utils\saver.py", line 40, in load_checkpoint item_dict[item_name].load_state_dict(checkpoint[item_name]) File "D:\NLP\NL_to_SQL\gap-text2sql\envs\lib\site-packages\torch\nn\modules\module.py", line 847, in load_state_dict self.class.name, "\n\t".join(error_msgs))) RuntimeError: Error(s) in loading state_dict for EncDecModel: size mismatch for decoder.rule_logits.2.weight: copying a param with shape torch.Size([97, 128]) from checkpoint, the shape in current model is torch.Size([94, 128]). size mismatch for decoder.rule_logits.2.bias: copying a param with shape torch.Size([97]) from checkpoint, the shape in current model is torch.Size([94]). size mismatch for decoder.rule_embedding.weight: copying a param with shape torch.Size([97, 128]) from checkpoint, the shape in current model is torch.Size([94, 128]). (envs) Lenovo-PC MINGW64 /d/NLP/NL_to_SQL/gap-text2sql/rat-sql-gap (main) $

Can you guide me, where I need to make changes.

opened by TheurgicDuke771 6
Cannot access model checkpoint

Hi, I am trying to learn and test your work and strictly following your instructions, but I cannot quite access model checkpoint from aws s3 bucket; is your project closed now, or checkpoint bucket is not public, or am I having some other connectivity issues? Running the following command: aws s3 cp s3://gap-text2sql-public/checkpoint-artifacts/gap-finetuned-checkpoint logdir/bart_run_1/bs\=12\,lr\=1.0e-04\,bert_lr\=1.0e-05\,end_lr\=0e0\,att\=1/model_checkpoint-00041000 Getting this error: fatal error: An error occurred (403) when calling the HeadObject operation: Forbidden Would be grateful for feedback, Regards

opened by romapavelko01 2

Issue while loading the model

Hey @Impavidity @TheurgicDuke771 , I'm facing the same issue, I think there is no issue with data( Spider) I'm using.

Loading model from logdir/bart_run_1/bs=12,lr=1.0e-04,bert_lr=1.0e-05,end_lr=0e0,att=1/model_checkpoint-00041000

---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

<ipython-input-45-2534e992a832> in <module>()
----> 1 model = inferer.load_model(model_dir, checkpoint_step)

6 frames

/content/gap-text2sql/rat-sql-gap/seq2struct/models/variational_lstm.py in _hook_remove_dropout_masks_from_state_dict(cls, instance, state_dict, prefix, local_metadata)
     75     @classmethod
     76     def _hook_remove_dropout_masks_from_state_dict(cls, instance, state_dict, prefix, local_metadata):
---> 77         del state_dict[prefix + '_input_dropout_mask']
     78         del state_dict[prefix + '_h_dropout_mask']
     79 

KeyError: 'decoder.state_update._input_dropout_mask'

We can see the folder structure below, spider_data_issue

Please guide me to tackle this issue. Thanks in advance.

opened by ujjawalcse 2

Why overfitting train yields better scores?
I've tried to retrain your model and managed to get the same scores as you report. What I don't understand is:

there is not early stopping

the model used is the last saved model checkpoint, when the model is with huge overfit on train (very close to 0 loss)

However, I've tried to get scores on another model checkpoint with a lower loss in validation and it yielded worse EXACT matches. Therefore, why does it yield better scores if we overfit the training set?
opened by JoaoLages 2
Problems in downloading the dataset

Hello,

I am going through your setup instructions in the README, and for downloading the dataset there's this command: bash data/spider-20190205/generate.sh ./spider, which is meant to be run from the rat-sql-gap dir, I believe, however there is no directory containing this generation script, and I cannot find this script anywhere.

Please let me know if I am missing something obvious.

Thanks!

opened by hclent 2
I can't find AverageSpanExtractor module :(

Hi. @Impavidity @pnpnpn, I'm referring to your code for my text-to-SQL study. To modify the GAP encoder part, I tuned relogic/tabart-pretraining.py. In order to execute the code, AverageSpanExtractor module is needed. But I can't find this module from relogic.logickit.modules.span_extractors.average_span_extractor. Is the module missing from github?

Best regard!

opened by minu4242 1
Why are the EXEC scores always zero?

For every model I train, including the pretrained model, EXEC scores are zero. does this mean that none of the generated queries matches the gold SQL query when runned to produce an output table?

opened by JoaoLages 1
support inference on own databases and queries (#5)
Resolve the issue #5 on supporting to run inference on own databases and queries.

Add related dependency for supporting the functionality.

Create an easy-to-use notebook.

Add a database example.

Update readme.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
opened by Impavidity 0
Add data preprocess script
Resolve the issue #1 on missing the data preprocess script.

Add data preprocess script and its dependency.

Clean up unused dependencies and configuration file.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
opened by Impavidity 0
On clause are missing in inference output

Hi,

I am facing some issue on identity the "ON clause" from output,

Question: List the pilots are from london Result: SELECT pilot.Name FROM pilot JOIN match WHERE match.Location = 'terminal'

Is there any solution to find the on clauses?

Thanks in advance

opened by b4zyuvaraj 2
What are all the steps we need to follow, for our own database

I am new to this, I would like to test Text 2 SQL with my own database.

Could anyone share me the steps to convert new database and execute the questions?

Thanks in advance.

opened by b4zyuvaraj 0
Can anyone help out in figuring out what is "terminal" here?

While executing a natural language query "Birth year of Gina Rinehart", we got the following sql query output. The output sql query is okay, but whenever we are calling any variable data(in this case, our variable is 'Gina Rinehart'), we are getting "terminal"

Natural language query from spider's singer database: Birth year of Gina Rinehart. Corresponding output SQL query: SELECT singer.Birth_Year FROM singer WHERE singer.Name = 'terminal'

opened by surajjkumar 1
The number of grammar rules (94) is inconsistent with the pre-trained model (97)

I find that the number of the grammar rules is 94 when I follow the instructions to preprocess the data (with the hyper-parameter fs=2). But the size of rule_embedding in the pre-trained model is 97.

Traceback (most recent call last): File "run.py", line 104, in main() File "run.py", line 83, in main infer.main(infer_config) File "/home/yuhao/zhenwen/repair_model/gap-text2sql-main/rat-sql-gap/seq2struct/commands/infer.py", line 239, in main model = inferer.load_model(args.logdir, args.step) File "/home/yuhao/zhenwen/repair_model/gap-text2sql-main/rat-sql-gap/seq2struct/commands/infer.py", line 48, in load_model last_step = saver.restore(logdir, step=step, map_location=self.device, item_keys=["model"]) File "/home/yuhao/zhenwen/repair_model/gap-text2sql-main/rat-sql-gap/seq2struct/utils/saver.py", line 122, in restore items2restore, model_dir, map_location, step) File "/home/yuhao/zhenwen/repair_model/gap-text2sql-main/rat-sql-gap/seq2struct/utils/saver.py", line 40, in load_checkpoint item_dict[item_name].load_state_dict(checkpoint[item_name]) File "/home/yuhao/.conda/envs/zhenwen/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1407, in load_state_dict self.class.name, "\n\t".join(error_msgs))) RuntimeError: Error(s) in loading state_dict for EncDecModel: Unexpected key(s) in state_dict: "decoder.state_update._input_dropout_mask", "decoder.state_update._h_dropout_mask". size mismatch for decoder.rule_logits.2.weight: copying a param with shape torch.Size([97, 128]) from checkpoint, the shape in current model is torch.Size([94, 128]). size mismatch for decoder.rule_logits.2.bias: copying a param with shape torch.Size([97]) from checkpoint, the shape in current model is torch.Size([94]). size mismatch for decoder.rule_embedding.weight: copying a param with shape torch.Size([97, 128]) from checkpoint, the shape in current model is torch.Size([94, 128])

opened by lzw-pku 1
Can we get the pretraining data?

Can you please let me know how to download the additional 30K data [(Utterance, Schema, SQL) triples] that was used for pretraining?

Thanks in Advanced!

opened by dennissm 1

Owner

Amazon Web Services - Labs

AWS Labs

GitHub https://arxiv.org/abs/2012.10309

Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation This is the implementaion of our paper: Bridging the

20 Dec 12, 2022

Code for papers "Generation-Augmented Retrieval for Open-Domain Question Answering" and "Reader-Guided Passage Reranking for Open-Domain Question Answering", ACL 2021

This repo provides the code of the following papers: (GAR) "Generation-Augmented Retrieval for Open-domain Question Answering", ACL 2021 (RIDER) "Read

49 Dec 26, 2022

Code to use Augmented Shapiro Wilks Stopping, as well as code for the paper "Statistically Signifigant Stopping of Neural Network Training"

This codebase is being actively maintained, please create and issue if you have issues using it Basics All data files are included under losses and ea

32 Nov 9, 2021

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

PLBART Code pre-release of our work, Unified Pre-training for Program Understanding and Generation accepted at NAACL 2021. Note. A detailed documentat

138 Dec 30, 2022

MASS: Masked Sequence to Sequence Pre-training for Language Generation

1.1k Dec 17, 2022

TweebankNLP - Pre-trained Tweet NLP Pipeline (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Models + Tweebank-NER

TweebankNLP This repo contains the new Tweebank-NER dataset and Twitter-Stanza p

84 Dec 20, 2022

Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

背景安装教程快速上手（一）预训练模型（二）机器翻译（三）文本分类 TenTrans 进阶 1. 多语言机器翻译 2. 跨语言预训练背景 TrenTrans是一个统一的端到端的多语言多任务预训练平台，支持多种预训练方式，以及序列生成和自然语言理解任务。安装教程 git clone git

Tencent Minority-Mandarin Translation Team

42 Dec 20, 2022

:hot_pepper: R²SQL: "Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing." (AAAI 2021)

R²SQL The PyTorch implementation of paper Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing. (AAAI 2021) Requirement

60 Dec 31, 2022

PIZZA - a task-oriented semantic parsing dataset

The PIZZA dataset continues the exploration of task-oriented parsing by introducing a new dataset for parsing pizza and drink orders, whose semantics cannot be captured by flat slots and intents.

17 Dec 14, 2022

Implementation of Memorizing Transformers (ICLR 2022), attention net augmented with indexing and retrieval of memories using approximate nearest neighbors, in Pytorch

Memorizing Transformers - Pytorch Implementation of Memorizing Transformers (ICLR 2022), attention net augmented with indexing and retrieval of memori

364 Jan 6, 2023

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

Related tags

Overview

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

Updates

Abstract

Setup

Download the dataset

Build dataset directory

Download the library

Start the Stanford library

Download the checkpoint

Preprocess dataset

Inference

Training

Security

License

Comments

Owner

Amazon Web Services - Labs

Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

Code for papers "Generation-Augmented Retrieval for Open-Domain Question Answering" and "Reader-Guided Passage Reranking for Open-Domain Question Answering", ACL 2021

Code to use Augmented Shapiro Wilks Stopping, as well as code for the paper "Statistically Signifigant Stopping of Neural Network Training"

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

MASS: Masked Sequence to Sequence Pre-training for Language Generation

TweebankNLP - Pre-trained Tweet NLP Pipeline (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Models + Tweebank-NER

Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

:hot_pepper: R²SQL: "Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing." (AAAI 2021)

PIZZA - a task-oriented semantic parsing dataset

Implementation of Memorizing Transformers (ICLR 2022), attention net augmented with indexing and retrieval of memories using approximate nearest neighbors, in Pytorch

TaCL: Improve BERT Pre-training with Token-aware Contrastive Learning

Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

Pre-training BERT masked language models with custom vocabulary

CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training

iBOT: Image BERT Pre-Training with Online Tokenizer

Princeton NLP's pre-training library based on fairseq with DeepSpeed kernel integration 🚃

SIGIR'22 paper: Axiomatically Regularized Pre-training for Ad hoc Search

Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers

Semi-automated vocabulary generation from semantic vector models