Weakly Supervised Text-to-SQL Parsing through Question Decomposition

The official repository for the paper "Weakly Supervised Text-to-SQL Parsing through Question Decomposition" by Tomer Wolfson, Daniel Deutch and Jonathan Berant, accepted to the Finings of NAACL 2022.

This repository contains the code and data used in our paper:

Code for automatically synthesizing SQL queries from question decompositions + answers
Code for the models used in our paper mapping text-to-SQL and text-to-QDMR

Structure:

Setup 🙌🏼

Create the virtual environment

conda create -n [ENV_NAME] python=3.8
conda activate [ENV_NAME]

Clone the repository

git clone https://github.com/tomerwolgithub/question-decomposition-to-sql
cd question-decomposition-to-sql

Install the relevant requirements

pip install -r requirements.txt 
python -m spacy download en_core_web_lg

To train the QDMR parser model please setup a separate environment (due to different Hugginface versions):

conda create -n qdmr_parser_env python=3.8
conda activate qdmr_parser_env
pip install -r requirements_qdmr_parser.txt 
python -m spacy download en_core_web_lg

Download Resources 🗝️

1. QDMR Parsing Datasets:

Academic, GeoQuery and Spider gold QDMRs from the Break dataset
Gold QDMR data for IMDB and Yelp is available in this repository

2. Text-to-SQL Datasets:

Spider
Academic, IMDB, Yelp and GeoQuery (Jonathan Kummerfeld's repository)

3. Databases (schema & contents):

Convert the MySQL databases of Academic, IMDB, Yelp and GeoQuery to sqlite format using the tool of Jean-Luc Lacroix:

./mysql2sqlite academic_mysql.sql | sqlite3 academic_sqlite.db

Data Generation 🔨

Our SQL synthesis is given examples of <QDMR, database, answer> and automatically generates a SQL that executes to the correct answer. The QDMR decompositions are either manually annotated or automatically predicted by a trained QDMR parser.

Begin by copying all relevant sqlite databases to the data_generation directory.

mkdir data_generation/data
mkdir data_generation/data/spider_databases # copy Spider databases here
mkdir data_generation/data/other_databases # copy Academic, IMDB, Yelp and Geo databases here

The SQL synthesis expects a formatted csv file, see example. Note that the SQL query in these files is only used to compute the answer.
This may take several hours, as multiple candidate SQL are being executed on their respective database.
To synthesize SQL from the <QDMR, database, answer> examples run:

python data_generation/main.py \
--input_file input_qdmr_examples.csv \
--output_file qdmr_grounded_sql.csv \
--json_steps True

Synthesized Data

The SQL synthesized using QDMR + answer supervision is available for each dataset in the data/sql_synthesis_results/ directory.

data/sql_synthesis_results/gold_qdmr_supervision: contains SQL synthesized using gold QDMRs that are manually annotated
data/sql_synthesis_results/predicted_qdmr_supervision: contains SQL synthesized using QDMRs predicted by a trained parser

Models 🗂️

QDMR Parser

The QDMR parser is a T5-large sequence-to-sequence model that is finetuned to map questions to their QDMR. The model expects as input two csv files as its train and dev sets. Use the files from the downloaded Break dataset to train the parser. Make sure that you are in the relevant python environment (requirements_qdmr_parser.txt).

To train the QDMR parser configure the following parameters in train.py:

data_dir: the path to the directory containing the NL to QDMR datasets
training_set_file: name of the train set csv (e.g. break_train.csv)
dev_set_file: name of the dev set csv (e.g. break_dev.csv)
output_dir: the directory to store the trained model

After configuration, train the model as follows:

TOKENIZERS_PARALLELISM=false CUDA_VISIBLE_DEVICES=0 python src/qdmr_parser/train.py

To test a trained model and store its predictions, configure the following parameters in test.py:

checkpoint_path: path to the trained QDMR parser model to be evaluated
dev_set_file: name of the dev set csv to generate predictions for
predictions_output_file: the output file to store the parser's generated predictions

And run the following command:

TOKENIZERS_PARALLELISM=false CUDA_VISIBLE_DEVICES=0 python src/qdmr_parser/test.py

Text-to-SQL

The text-to-SQL models are T5-large sequence-to-sequence models, finetuned to map questions to executable SQL queries. We compare the models trained on gold SQL queries, annotated by experts, to our synthesized SQL from QDMR and answer supervision.

1. Setup directory

Setup the data for the text-to-SQL experiments as follows:

data
├── tables.json			# Spider tables.json
└── databases
│   └── academic			
│       └── academic.sqlite	# Sqlite version of the populated Academic database (see downloads)
│   └── geo			
│       └── geo.sqlite		# Sqlite version of the populated Geo database (see downloads)
│   └── imdb			
│       └── imdb.sqlite		# Sqlite version of the populated IMDB database (see downloads)
│   └── spider_databases 	# Spider databases directory
│       └── activity_1
│           └── activity_1.sqlite
│       └── ...   
│   └── yelp			
│       └── yelp.sqlite		# Sqlite version of the populated Yelp database (see downloads)
└── queries
    └── geo	# See experiments data
        ├── geo_qdmr_train.json
	└── geo_qdmr_predicted_train.json
	└── geo_gold_train.json
	└── geo_gold_dev.json
	└── geo_gold_test.json
	└── geo_gold_train.sql
	└── geo_gold_dev.sql
	└── geo_gold_test.sql
    └── spider
        ├── spider_qdmr_train.json		# See experiments data
	└── spider_qdmr_predicted_train.json 	# See experiments data
	└── spider_gold_train.json 	# Spider training set
	└── spider_gold_dev.json 	# Spider dev set
	└── spider_gold_train.sql 	# Spider training set SQL queries
	└── spider_gold_dev.sql 	# Spider dev set SQL queries

Database files are described in the downloads section. See the experiments section for the exact train and test files.

2. Train model

To train the text-to-SQL model configure its following parameters in train.py:

dataset: either spider or geo
target_encoding: sql for gold sql and either qdmr_formula or qdmr_sql for the QDMR experiments
data_dir: path to the directory containing the experiments data
output_dir: the directory to store the trained model
db_dir: the directory to store the trained model
training_set_file: training set file in the data directory e.g. spider/spider_gold_train.json
dev_set_file: dev set file in the data directory e.g. spider/spider_gold_dev.json
dev_set_sql: dev set SQL queries in the data directory e.g. spider/spider_gold_dev.sql

Following configuration, to train the model run:

CUDA_VISIBLE_DEVICES=0 python train.py

3. Test model

To test the text-to-SQL model first configure the relevant parameters and checkpoint_path in test.py. Following the configuration, generate the trained model predictions using:

CUDA_VISIBLE_DEVICES=0 python test.py

Experiments ⚗️

Data

Gold SQL:

For the Spider experiments we use its original train and dev json and sql files. For Geo880, Academic, IMDB and Yelp we format the original datasets in json files available here.

QDMR Synthesized SQL:

The QDMR text-to-SQL models are not trained directly on the synthesized SQL. Instead, we train on an encoded QDMR representation with its phrase-DB linking (from the SQL synthesis). This representation is automatically mapped to SQL to evaluate the models execution accuracy. To generate these grounded QDMRs we use the output of the data generation phase. The function encoded_grounded_qdmr in src/data_generation/write_encoding.py recieves the json file containing the synthesized SQL examples. It then encodes them as lisp style formulas of QDMR steps and their relevant phrase-DB linking.

For convenience, you can download the encoded QDMR training sets used in our experiments here. These include:

qdmr_ground_enc_spider_train.json: 5,349 examples, synthesized using gold QDMR + answer supervision
qdmr_ground_enc_predicted_spider_train_few_shot: 5,075 examples, synthesized examples using 700 gold QDMRs, predicted QDMR + answer supervision
qdmr_ground_enc_predicted_spider_train_30_db.json: 1,129 examples, synthesized using predicted QDMR + answer supervision
qdmr_ground_enc_predicted_spider_train_40_db.json: 1,440 examples, synthesized using predicted QDMR + answer supervision
qdmr_ground_enc_predicted_spider_train_40_db_V2.json: 1,552 examples, synthesized using predicted QDMR + answer supervision
qdmr_ground_enc_geo880_train.json: 454 examples, synthesized using gold QDMR + answer supervision
qdmr_ground_enc_predicted_geo_train_zero_shot.json: 432 examples, synthesized using predicted QDMR + answer supervision

Configurations

The configurations for training the text-to-SQL models on Spider. Other parameters are fixed in train.py.

SQL Gold (Spider):

{'dataset': 'spider',
'target_encoding': 'sql',
'db_dir': 'databases/spider_databases',
'training_set_file': 'queries/spider/spider_gold_train.json',
'dev_set_file': 'queries/spider/spider_gold_dev.json',
'dev_set_sql': 'queries/spider/spider_gold_dev.sql'}

QDMR Gold (Spider):

{'dataset': 'spider',
'target_encoding': 'qdmr_formula',
'db_dir': 'databases/spider_databases',
'training_set_file': 'queries/spider/spider_qdmr_train.json',
'dev_set_file': 'queries/spider/spider_gold_dev.json',
'dev_set_sql': 'queries/spider/spider_gold_dev.sql'}

SQL Predicted (Spider):

{'dataset': 'spider',
'target_encoding': 'qdmr_formula',
'db_dir': `databases/spider_databases',
'training_set_file': 'queries/spider/spider_qdmr_predicted_train.json',
'dev_set_file': 'queries/spider/spider_gold_dev.json',
'dev_set_sql': 'queries/spider/spider_gold_dev.sql'}

The configurations for training the text-to-SQL models on Geo880.

SQL Gold (Geo):

{'dataset': 'geo',
'target_encoding': 'sql',
'db_dir': 'databases',
'training_set_file': 'queries/geo/geo_gold_train.json',
'dev_set_file': 'queries/spider/geo_gold_dev.json',
'dev_set_sql': 'queries/spider/geo_gold_dev.sql'}

QDMR Gold (Geo):

{'dataset': 'geo',
'target_encoding': 'qdmr_sql',
'db_dir': 'databases',
'training_set_file': 'queries/geo/geo_qdmr_train.json',
'dev_set_file': 'queries/spider/geo_gold_dev.json',
'dev_set_sql': 'queries/spider/geo_gold_dev.sql'}

QDMR Predicted (Geo):

{'dataset': 'geo',
'target_encoding': 'qdmr_sql',
'db_dir': 'databases',
'training_set_file': 'queries/geo/geo_qdmr_predicted_train.json',
'dev_set_file': 'queries/spider/geo_gold_dev.json',
'dev_set_sql': 'queries/spider/geo_gold_dev.sql'}

Evaluation

Text-to-SQL model performance is evaluated using SQL execution accuracy in src/text_to_sql/eval_spider.py. The script automatically converts encoded QDMR predictions to SQL before executing them on the target database.

Citation ✍🏽

bibtex
@inproceedings{wolfson-etal-2022-weakly,
    title={"Weakly Supervised Text-to-SQL Parsing through Question Decomposition"},
    author={"Wolfson, Tomer and Deutch, Daniel and Berant, Jonathan"},
    booktitle = {"Findings of the Association for Computational Linguistics: NAACL 2022"},
    year={"2022"},
}

License

This repository and its data is released under the MIT license.

For the licensing of all external datasets and databases used throughout our experiments:

The Break dataset (Wolfson et al., 2020) is under the MIT License.
Spider (Yu et al., 2018) is under the CC BY-SA 4.0 License.
Geo880 (Zelle and Mooney, 1996) is available under the GNU General Public License 2.0.
The text-to-SQL versions of Academic (Li and Jagadish, 2014) and Geo880 were made publicly available by Finegan-Dollak et al. (2018).
The IMDB and YELP datasets were publicly released by Yaghmazadeh et al. (2017).

Having the following error on running src/qdmr_parser/test.py :

Traceback (most recent call last):
  File "src/qdmr_parser/test.py", line 78, in <module>
    model = T5FineTuner.load_from_checkpoint(args.checkpoint_path)
  File "/home/ubuntu/anaconda3/envs/qdmr_parser_env/lib/python3.8/site-packages/pytorch_lightning/core/saving.py", line 157, in load_from_checkpoint
    model = cls._load_model_state(checkpoint, strict=strict, **kwargs)
  File "/home/ubuntu/anaconda3/envs/qdmr_parser_env/lib/python3.8/site-packages/pytorch_lightning/core/saving.py", line 199, in _load_model_state
    model = cls(**_cls_kwargs)
  File "/home/ubuntu/question-decomposition-to-sql/src/qdmr_parser/model.py", line 32, in __init__
    self.model = T5ForConditionalGeneration.from_pretrained(hparams.model_name_or_path)
AttributeError: 'dict' object has no attribute 'model_name_or_path'

Context Decoupling Augmentation for Weakly Supervised Semantic Segmentation

Context Decoupling Augmentation for Weakly Supervised Semantic Segmentation The code of: Context Decoupling Augmentation for Weakly Supervised Semanti

54 Dec 12, 2022

Weakly supervised medical named entity classification

Trove Trove is a research framework for building weakly supervised (bio)medical named entity recognition (NER) and other entity attribute classifiers

60 Nov 18, 2022

Discriminative Region Suppression for Weakly-Supervised Semantic Segmentation

Discriminative Region Suppression for Weakly-Supervised Semantic Segmentation (AAAI 2021) Official pytorch implementation of our paper: Discriminative

74 Dec 27, 2022

Anti-Adversarially Manipulated Attributions for Weakly and Semi-Supervised Semantic Segmentation (CVPR 2021)

Anti-Adversarially Manipulated Attributions for Weakly and Semi-Supervised Semantic Segmentation Input Image Initial CAM Successive Maps with adversar

110 Dec 7, 2022

Code for "FGR: Frustum-Aware Geometric Reasoning for Weakly Supervised 3D Vehicle Detection", ICRA 2021

FGR This repository contains the python implementation for paper "FGR: Frustum-Aware Geometric Reasoning for Weakly Supervised 3D Vehicle Detection"(I

31 Dec 8, 2022

Code for weakly supervised segmentation of a single class

SingleClassRL Implementation of weak single object segmentation from paper "Regularized Loss for Weakly Supervised Single Class Semantic Segmentation"

16 Nov 14, 2022

Code for the paper One Thing One Click: A Self-Training Approach for Weakly Supervised 3D Semantic Segmentation, CVPR 2021.

One Thing One Click One Thing One Click: A Self-Training Approach for Weakly Supervised 3D Semantic Segmentation (CVPR2021) Code for the paper One Thi

44 Dec 12, 2022

Codes for TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization.

TS-CAM: Token Semantic Coupled Attention Map for Weakly SupervisedObject Localization This is the official implementaion of paper TS-CAM: Token Semant

112 Jan 2, 2023

Accurate 3D Face Reconstruction with Weakly-Supervised Learning: From Single Image to Image Set (CVPRW 2019). A PyTorch implementation.

Accurate 3D Face Reconstruction with Weakly-Supervised Learning: From Single Image to Image Set —— PyTorch implementation This is an unofficial offici

833 Dec 28, 2022

Error on running `src/qdmr_parser/test.py`

opened by nooralahzadeh 1

incompatible module during the environment setup

  Using cached cvxpy-1.0.31.tar.gz (947 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... error
  error: subprocess-exited-with-error
  
  × Preparing metadata (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [1 lines of output]
      error in cvxpy setup command: use_2to3 is invalid.
      [end of output]

and others such as GDAL==2.2.2 and en-core-web-sm==2.2.5

opened by nooralahzadeh 0

Weakly Supervised Text-to-SQL Parsing through Question Decomposition

Related tags

Overview

Weakly Supervised Text-to-SQL Parsing through Question Decomposition

Setup 🙌🏼

Download Resources 🗝️

1. QDMR Parsing Datasets:

2. Text-to-SQL Datasets:

3. Databases (schema & contents):

Data Generation 🔨

Synthesized Data

Models 🗂️

QDMR Parser

Text-to-SQL

1. Setup directory

2. Train model

3. Test model

Experiments ⚗️

Data

Gold SQL:

QDMR Synthesized SQL:

Configurations

Evaluation

Citation ✍🏽

License

You might also like...

Context Decoupling Augmentation for Weakly Supervised Semantic Segmentation

Weakly supervised medical named entity classification

Discriminative Region Suppression for Weakly-Supervised Semantic Segmentation

Anti-Adversarially Manipulated Attributions for Weakly and Semi-Supervised Semantic Segmentation (CVPR 2021)

Code for "FGR: Frustum-Aware Geometric Reasoning for Weakly Supervised 3D Vehicle Detection", ICRA 2021

Code for weakly supervised segmentation of a single class

Code for the paper One Thing One Click: A Self-Training Approach for Weakly Supervised 3D Semantic Segmentation, CVPR 2021.

Codes for TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization.

Accurate 3D Face Reconstruction with Weakly-Supervised Learning: From Single Image to Image Set (CVPRW 2019). A PyTorch implementation.

Comments

Error on running `src/qdmr_parser/test.py`

incompatible module during the environment setup

Owner

Weakly Supervised Dense Event Captioning in Videos, i.e. generating multiple sentence descriptions for a video in a weakly-supervised manner.

Code for Blind Image Decomposition (BID) and Blind Image Decomposition network (BIDeN).

Release of SPLASH: Dataset for semantic parse correction with natural language feedback in the context of text-to-SQL parsing

Codes for NAACL 2021 Paper "Unsupervised Multi-hop Question Answering by Question Generation"

Hybrid CenterNet - Hybrid-supervised object detection / Weakly semi-supervised object detection

Implementation of fast algorithms for Maximum Spanning Tree (MST) parsing that includes fast ArcMax+Reweighting+Tarjan algorithm for single-root dependency parsing.

STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech

Weakly Supervised Learning of Rigid 3D Scene Flow

Leveraging Instance-, Image- and Dataset-Level Information for Weakly Supervised Instance Segmentation

A Comprehensive Analysis of Weakly-Supervised Semantic Segmentation in Different Image Domains (IJCV submission)