PIZZA - a task-oriented semantic parsing dataset

Last update: Dec 14, 2022

Related tags

Overview

PIZZA - a task-oriented semantic parsing dataset

The PIZZA dataset continues the exploration of task-oriented parsing by introducing a new dataset for parsing pizza and drink orders, whose semantics cannot be captured by flat slots and intents.

The dataset comes in two main versions, one in a recently introduced utterance-level hierarchical notation that we call TOP, and one whose targets are executable representations (EXR).

Below are two examples of orders that can be found in the data:

{
    "dev.SRC": "five medium pizzas with tomatoes and ham",
    "dev.EXR": "(ORDER (PIZZAORDER (NUMBER 5 ) (SIZE MEDIUM ) (TOPPING HAM ) (TOPPING TOMATOES ) ) )",
    "dev.TOP": "(ORDER (PIZZAORDER (NUMBER five ) (SIZE medium ) pizzas with (TOPPING tomatoes ) and (TOPPING ham ) ) )"
}
{
    "dev.SRC": "i want to order two medium pizzas with sausage and black olives and two medium pizzas with pepperoni and extra cheese and three large pizzas with pepperoni and sausage",
    "dev.EXR": "(ORDER (PIZZAORDER (NUMBER 2 ) (SIZE MEDIUM ) (COMPLEX_TOPPING (QUANTITY EXTRA ) (TOPPING CHEESE ) ) (TOPPING PEPPERONI ) ) (PIZZAORDER (NUMBER 2 ) (SIZE MEDIUM ) (TOPPING OLIVES ) (TOPPING SAUSAGE ) ) (PIZZAORDER (NUMBER 3 ) (SIZE LARGE ) (TOPPING PEPPERONI ) (TOPPING SAUSAGE ) ) )",
    "dev.TOP": "(ORDER i want to order (PIZZAORDER (NUMBER two ) (SIZE medium ) pizzas with (TOPPING sausage ) and (TOPPING black olives ) ) and (PIZZAORDER (NUMBER two ) (SIZE medium ) pizzas with (TOPPING pepperoni ) and (COMPLEX_TOPPING (QUANTITY extra ) (TOPPING cheese ) ) ) and (PIZZAORDER (NUMBER three ) (SIZE large ) pizzas with (TOPPING pepperoni ) and (TOPPING sausage ) ) )"
}

While more details on the dataset conventions and construction can be found in the paper, we give a high level idea of the main rules our target semantics follow:

Each order can include any number of pizza and/or drink suborders. These suborders are labeled with the constructors PIZZAORDER and DRINKORDER, respectively.
Each top-level order is always labeled with the root constructor ORDER.
Both pizza and drink orders can have NUMBER and SIZE attributes.
A pizza order can have any number of TOPPING attributes, each of which can be negated. Negative particles can have larger scope with the use of the or particle, e.g., no peppers or onions will negate both peppers and onions.
Toppings can be modified by quantifiers such as a lot or extra, a little, etc.
A pizza order can have a STYLE attribute (e.g., thin crust style or chicago style).
Styles can be negated.
Each drink order must have a DRINKTYPE (e.g. coke), and can also have a CONTAINERTYPE (e.g. bottle) and/or a volume modifier (e.g., three 20 fl ounce coke cans).

We view ORDER, PIZZAORDER, and DRINKORDER as intents, and the rest of the semantic constructors as composite slots, with the exception of the leaf constructors, which are viewed as entities (resolved slot values).

Dataset statistics

In the below table we give high level statistics of the data.

	Train	Dev	Test
Number of utterances	2,456,446	348	1,357
Unique entities	367	109	180
Avg entities per utterance	5.32	5.37	5.42
Avg intents per utterance	1.76	1.25	1.28

More details can be found in appendix of our publication.

Getting Started

The repo structure is as follows:

PIZZA
|
|_____ data
|      |_____ PIZZA_train.json.zip             # a zipped version of the training data
|      |_____ PIZZA_train.10_percent.json.zip  # a random subset representing 10% of training data
|      |_____ PIZZA_dev.json                   # the dev portion of the data
|      |_____ PIZZA_test.json                  # the test portion of the data
|
|_____ utils
|      |_____ __init__.py
|      |_____ entity_resolution.py # entity resolution script
|      |_____ express_utils.py     # utilities
|      |_____ semantic_matchers.py # metric functions
|      |_____ sexp_reader.py       # tree reader helper functions 
|      |_____ trees.py             # tree classes and readers
|      |
|      |_____ catalogs             # directory with catalog files of entities
|      
|_____ doc 
|      |_____ PIZZA_dataset_reader_metrics_examples.ipynb
|
|_____ READMED.md

The dev and test data files are json lines files where each line represents one utterance and contains 4 keys:

*.SRC: the natural language order input
*.EXR: the ground truth target semantic representation in EXR format
*.TOP: the ground truth target semantic representation in TOP format
*.PCFG_ERR: a boolean flag indicating whether our PCFG system parsed the utterance correctly. See publication for more details

The training data file comes in a similar format, with two differences:

there is no train.PCFG_ERR flag since the training data is all synthetically generated hence parsable with perfect accuracy. In other words, this flag would be True for all utterances in that file.
there is an extra train.TOP-DECOUPLED key that is the ground truth target semantic representation in TOP-Decoupled format. See publication for more details.

Security

See CONTRIBUTING for more information.

License

This library is licensed under the CC-BY-NC-4.0 License.

Code for paper "Role-oriented Network Embedding Based on Adversarial Learning between Higher-order and Local Features"

Role-oriented Network Embedding Based on Adversarial Learning between Higher-order and Local Features Train python main.py --dataset brazil-flights C

0 Jun 28, 2022

Intent parsing and slot filling in PyTorch with seq2seq + attention

PyTorch Seq2Seq Intent Parsing Reframing intent parsing as a human - machine translation task. Work in progress successor to torch-seq2seq-intent-pars

159 Apr 4, 2022

Code and checkpoints for training the transformer-based Table QA models introduced in the paper TAPAS: Weakly Supervised Table Parsing via Pre-training.

End-to-end neural table-text understanding models.

914 Jan 7, 2023

Neural network models for joint POS tagging and dependency parsing (CoNLL 2017-2018)

Neural Network Models for Joint POS Tagging and Dependency Parsing Implementations of joint models for POS tagging and dependency parsing, as describe

152 Sep 2, 2022

TweebankNLP - Pre-trained Tweet NLP Pipeline (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Models + Tweebank-NER

TweebankNLP This repo contains the new Tweebank-NER dataset and Twitter-Stanza p

84 Dec 20, 2022

Mednlp - Medical natural language parsing and utility library

Medical natural language parsing and utility library A natural language medical

3 Aug 24, 2022

Task-based datasets, preprocessing, and evaluation for sequence models.

SeqIO: Task-based datasets, preprocessing, and evaluation for sequence models. SeqIO is a library for processing sequential data to be fed into downst

290 Dec 26, 2022

Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

GenSen Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning Sandeep Subramanian, Adam Trischler, Yoshua B

309 Oct 19, 2022

Deduplication is the task to combine different representations of the same real world entity.

Deduplication is the task to combine different representations of the same real world entity. This package implements deduplication using active learning. Active learning allows for rapid training without having to provide a large, manually labelled dataset.

63 Nov 17, 2022

PIZZA - a task-oriented semantic parsing dataset

Related tags

Overview

PIZZA - a task-oriented semantic parsing dataset

Dataset statistics

Getting Started

Security

License

You might also like...

Code for paper "Role-oriented Network Embedding Based on Adversarial Learning between Higher-order and Local Features"

Intent parsing and slot filling in PyTorch with seq2seq + attention

Code and checkpoints for training the transformer-based Table QA models introduced in the paper TAPAS: Weakly Supervised Table Parsing via Pre-training.

Neural network models for joint POS tagging and dependency parsing (CoNLL 2017-2018)

TweebankNLP - Pre-trained Tweet NLP Pipeline (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Models + Tweebank-NER

Mednlp - Medical natural language parsing and utility library

Task-based datasets, preprocessing, and evaluation for sequence models.

Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

Deduplication is the task to combine different representations of the same real world entity.

Owner

A Survey of Natural Language Generation in Task-Oriented Dialogue System (TOD): Recent Advances and New Frontiers

PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

:hot_pepper: R²SQL: "Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing." (AAAI 2021)

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

Using Bert as the backbone model for lime, designed for NLP task explanation (sentence pair text classification task)

The FinQA dataset from paper: FinQA: A Dataset of Numerical Reasoning over Financial Data

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

A 30000+ Chinese MRC dataset - Delta Reading Comprehension Dataset

Knowledge Oriented Programming Language