PASTRIE
Official release of the corpus described in the paper:
Michael Kranzlein, Emma Manning, Siyao Peng, Shira Wein, Aryaman Arora, and Nathan Schneider (2020). PASTRIE: A Corpus of Prepositions Annotated with Supersense Tags in Reddit International English [link]. Proceedings of the 14th Linguistic Annotation Workshop.
Overview
PASTRIE is a corpus of English data from Reddit annotated with preposition supersenses from the SNACS inventory.
While the data in PASTRIE is in English, it was produced by presumed speakers of four L1s:
- English
- French
- German
- Spanish
For details on how L1s were identified, see section 3.1 of Rabinovich et al. (2018).
Annotation Example
Below is an example sentence from the corpus, where annotation targets are bolded and preposition supersenses are annotated with the notation SceneRole↝Function. Together, a scene role and function are known as a construal.
Data Formats
PASTRIE is released in the following formats. We expect that most projects will be best served by one of the JSON formats.
.conllulex
: the 19-column CoNLL-U-Lex format originally used for STREUSLE..json
: a JSON representation of the CoNLL-U-Lex that does not require a CoNLL-U-Lex parser..govobj.json
: an extended version of the JSON representation that contains information about each preposition's syntactic parent and object.
PASTRIE mostly follows STREUSLE with respect to the data format and SNACS annotation practice. Primary differences in the annotations are:
- Lemmas, part-of-speech tags, and syntactic dependencies aim to follow the UD standard in both cases. They are gold in STREUSLE, versus automatic with some manual corrections in PASTRIE.
- PASTRIE does not group together base+clitic combinations, whereas STREUSLE does (multiword tokens—where a single orthographic word contains multiple syntactic words).
- PASTRIE does not regularly specify
SpaceAfter=No
to indicate alignment between the tokens and the raw text. - In PASTRIE, the raw text string accompanying the sentence may contain two or more consecutive spaces.
- PASTRIE lacks enhanced dependencies.
- Multiword expression annotations in PASTRIE are limited to expressions containing a preposition. Depending on the syntactic head, the expression may or may not have a SNACS supersense.
- Verbal multiword expressions in PASTRIE are not subtyped in the lexcat; they all have a lexcat of
V
.
- Verbal multiword expressions in PASTRIE are not subtyped in the lexcat; they all have a lexcat of
- Noun and verb expressions in PASTRIE do not have supersense labels.