Gold standard corpus annotated with verb-preverb connections for Hungarian.

Related tags

Text Data & NLP hungarian-preverb-corpus

Overview

Hungarian Preverb Corpus

A gold standard corpus manually annotated with verb-preverb connections for Hungarian.

corpus

The corpus consist of the following 4 files:

filename	# sentences	# preverbs
difficult_validate1.txt	310	357
difficult_validate2.txt	840	935
difficult_test.txt	327	376
general_test.txt	503	500

Preverbs in the general dataset are in the distribution as they appear in normal Hungarian text. The difficult dataset is specially crafted: the most common and most-easy-to-handle pattern, i.e. when a verb is directly followed by its preverb (e.g. megy ki 'go out'), is omitted. validate is for development/validation, test is for testing. Note that a general_validate dataset would not be useful, because the trivial pattern would be in vast majority overwhelming the more interesting less frequent patterns.

Accordingly, the emPreverb tool which connects preverbs to their corresponding verb, was developed based only on interesting difficult examples, and tested both on difficult and general data.

(Remark. The difficult_validate dataset is divided into two parts for historical reasons, but you can simply use them together: they consist a total of 1150 sentences and 1292 preverbs.)

corpus annotation guidelines

Preverb marked by a suffixed backslash followed by a (single digit!) ID number: meg\1.
Word from which the preverb was separated marked by a pipe followed by the same ID number: főzve|1.
Within the same line, different verb-prefix pairs must (obviously) receive different ID numbers.
A preverb that does not belong to any word in the sentence (ellipsis etc.) is marked with a zero ID: "Hazakísérhetlek?" "Meg\0 hát." Any number of preverbs can have the 0 ID within the same line.
In the difficult dataset, a verb directly followed by its preverb is not annotated: főzte meg, but: főzte|1 volna meg\1.
In the general dataset, the first pattern is annotated as well: főzte|1 meg\1.
Normally there is a 1:1 correspondence between preverbs and verbs. However, there are exceptions, and these are annotated accordingly, e.g. Se ki\1, se be\1 nem lehetett menni|1 Budakesziről; át-\1 meg átjárták|1.

Check (see Step 1 to 4 in evaluate.ipynb) whether tokens annotated as separated preverbs are also analysed by e-magyar morph,pos as preverbs. If not (e.g. if the preverb meg is tagged by emtsv as a [/Conj]), remove this annotation (or the whole item if no annotation left) from the dataset because preverb will necessarily fail due to incorrect emtsv annotation, which is extraneous to its performance evaluation. Exception: person-inflected preverb-like postpositions such as in utánam\1 dobják|1, which are tagged by emtsv as [/Post], and case-inflected personal pronouns such as in hozzá\1 voltam szokva|1, which are tagged as [/N|Pro], should not be removed from the dataset since preverb should be able to handle these.

If a token is annotated as the verb stem counterpart of a separated preverb, but is not tagged by emtsv as a verb, check whether the preverb annotation is correct, but if so, do not remove this annotation from the dataset. preverb is supposed to be able to handle the connection of such separated preverbs.

evaluation

An environment for reproducing evaluation of emPreverb as published in the paper below.

git clone https://github.com/ril-lexknowrep/emPreverb
cd emPreverb
make evaluate

Note that make evaluate clones this current repo inside emPreverb and runs evaluation.

The results are obtained in general_test_results.txt and difficult_test_results.txt. This should be exactly the same which can be found in Table 3 of the paper below.

development

An environment used for developing emPreverb. It is "for us" but if you insist to use it:

git clone https://github.com/ril-lexknowrep/emPreverb
cd emPreverb
git clone https://github.com/ril-lexknowrep/hungarian-preverb-corpus
cd hungarian-preverb-corpus/development
jupyter notebook evaluate.ipynb

(Remark. Yes, please clone this repo inside emPreverb.)

citation

If you use the corpus, please cite the following paper.

Pethő, Gergely and Sass, Bálint and Kalivoda, Ágnes and Simon, László and Lipp, Veronika: Igekötő-kapcsolás. In: MSZNY 2022.

You might also like...

one_click_kag_server is a program which tries to fully automate the creation of a King Arthur's Gold server.

4 Jan 5, 2022

A library for augmenting annotated audio data

muda A library for Musical Data Augmentation. muda package implements annotation-aware musical data augmentation, as described in the muda paper. The

214 Nov 22, 2022

Implementation of "Debiasing Item-to-Item Recommendations With Small Annotated Datasets" (RecSys '20)

Debiasing Item-to-Item Recommendations With Small Annotated Datasets This is the code for our RecSys '20 paper. Other materials can be found here: Ful

34 Aug 10, 2022

3D AffordanceNet is a 3D point cloud benchmark consisting of 23k shapes from 23 semantic object categories, annotated with 56k affordance annotations and covering 18 visual affordance categories.

3D AffordanceNet This repository is the official experiment implementation of 3D AffordanceNet benchmark. 3D AffordanceNet is a 3D point cloud benchma

49 Dec 1, 2022

A PyTorch-based open-source framework that provides methods for improving the weakly annotated data and allows researchers to efficiently develop and compare their own methods.

Knodle (Knowledge-supervised Deep Learning Framework) - a new framework for weak supervision with neural networks. It provides a modularization for se

93 Nov 6, 2022

Annotated, understandable, and visually interpretable PyTorch implementations of: VAE, BIRVAE, NSGAN, MMGAN, WGAN, WGANGP, LSGAN, DRAGAN, BEGAN, RaGAN, InfoGAN, fGAN, FisherGAN

Overview PyTorch 0.4.1 | Python 3.6.5 Annotated implementations with comparative introductions for minimax, non-saturating, wasserstein, wasserstein g

471 Dec 16, 2022

NUANCED is a user-centric conversational recommendation dataset that contains 5.1k annotated dialogues and 26k high-quality user turns.

NUANCED: Natural Utterance Annotation for Nuanced Conversation with Estimated Distributions Overview NUANCED is a user-centric conversational recommen

18 Dec 28, 2021

Annotated notes and summaries of the TensorFlow white paper, along with SVG figures and links to documentation

TensorFlow White Paper Notes Features Notes broken down section by section, as well as subsection by subsection Relevant links to documentation, resou

437 Oct 9, 2022

Co-mining: Self-Supervised Learning for Sparsely Annotated Object Detection, AAAI 2021.

Co-mining: Self-Supervised Learning for Sparsely Annotated Object Detection This repository is an official implementation of the AAAI 2021 paper Co-mi

20 Dec 7, 2022

An open collection of annotated voices in Japanese language

声庭 (Koniwa): オープンな日本語音声とアノテーションのコレクション Koniwa (声庭): An open collection of annotated voices in Japanese language 概要 Koniwa(声庭)は利用・修正・再配布が自由でオープンな音声とアノテ

32 Dec 14, 2022

Experimenting with computer vision techniques to generate annotated image datasets from gameplay recordings automatically.

Experimenting with computer vision techniques to generate annotated image datasets from gameplay recordings automatically. The collected data will then be used to train a deep neural network that can detect enemy player models in real time, during gameplay. Finally, a virtual input device will adjust the player's crosshair based on live detections for greater accuracy.

3 Apr 24, 2022

Gold standard corpus annotated with verb-preverb connections for Hungarian.

Related tags

Overview

Hungarian Preverb Corpus

corpus

corpus annotation guidelines

evaluation

development

citation

You might also like...

one_click_kag_server is a program which tries to fully automate the creation of a King Arthur's Gold server.

A library for augmenting annotated audio data

Implementation of "Debiasing Item-to-Item Recommendations With Small Annotated Datasets" (RecSys '20)

3D AffordanceNet is a 3D point cloud benchmark consisting of 23k shapes from 23 semantic object categories, annotated with 56k affordance annotations and covering 18 visual affordance categories.

A PyTorch-based open-source framework that provides methods for improving the weakly annotated data and allows researchers to efficiently develop and compare their own methods.

Annotated, understandable, and visually interpretable PyTorch implementations of: VAE, BIRVAE, NSGAN, MMGAN, WGAN, WGANGP, LSGAN, DRAGAN, BEGAN, RaGAN, InfoGAN, fGAN, FisherGAN

NUANCED is a user-centric conversational recommendation dataset that contains 5.1k annotated dialogues and 26k high-quality user turns.

Annotated notes and summaries of the TensorFlow white paper, along with SVG figures and links to documentation

Co-mining: Self-Supervised Learning for Sparsely Annotated Object Detection, AAAI 2021.

An open collection of annotated voices in Japanese language

Experimenting with computer vision techniques to generate annotated image datasets from gameplay recordings automatically.

CCPD: a diverse and well-annotated dataset for license plate detection and recognition

Retrieve annotated intron sequences and classify them as minor (U12-type) or major (U2-type)

🔬 Fixed struct serialization system, using Python 3.9 annotated type hints

Reusable constraint types to use with typing.Annotated

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

Python tools for the corpus analysis of popular music.

A set of workflows for corpus building through OCR, post-correction and normalisation

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

Owner

RIL Lexical Knowledge Representation Research Group

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

NLPIR tutorial: pretrain for IR. pre-train on raw textual corpus, fine-tune on MS MARCO Document Ranking

YACLC - Yet Another Chinese Learner Corpus

File-based TF-IDF: Calculates keywords in a document, using a word corpus.

Gold(Gold) is a modern cryptocurrency built from scratch, designed to be efficient, decentralized, and secure

Fully Automated YouTube Channel ▶️with Added Extra Features.

Python Implementation of ``Modeling the Influence of Verb Aspect on the Activation of Typical Event Locations with BERT'' (Findings of ACL: ACL 2021)

Collection of emails sent from the Hungarian gov and Viktor Orbán to the citizens of Hungary

A Python library that tees the standard output & standard error from the current process to files on disk, while preserving terminal semantics