UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

Grammarly

Last update: Dec 29, 2022

Related tags

Deep Learning natural-language-processing corpus dataset corpus-data corpus-tools gec nlp-datasets grammatical-error-correction ukrainian-language

Overview

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

This repository contains UA-GEC data and an accompanying Python library.

Data

All corpus data and metadata stay under the ./data. It has two subfolders for train and test splits

Each split (train and test) has further subfolders for different data representations:

./data/{train,test}/annotated stores documents in the annotated format

./data/{train,test}/source and ./data/{train,test}/target store the original and the corrected versions of documents. Text files in these directories are plain text with no annotation markup. These files were produced from the annotated data and are, in some way, redundant. We keep them because this format is convenient in some use cases.

Metadata

./data/metadata.csv stores per-document metadata. It's a CSV file with the following fields:

id (str): document identifier.
author_id (str): document author identifier.
is_native (int): 1 if the author is native-speaker, 0 otherwise
region (str): the author's region of birth. A special value "Інше" is used both for authors who were born outside Ukraine and authors who preferred not to specify their region.
gender (str): could be "Жіноча" (female), "Чоловіча" (male), or "Інша" (other).
occupation (str): one of "Технічна", "Гуманітарна", "Природнича", "Інша"
submission_type (str): one of "essay", "translation", or "text_donation"
source_language (str): for submissions of the "translation" type, this field indicates the source language of the translated text. Possible values are "de", "en", "fr", "ru", and "pl".
annotator_id (int): ID of the annotator who corrected the document.
partition (str): one of "test" or "train"
is_sensitive (int): 1 if the document contains profanity or offensive language

Annotation format

Annotated files are text files that use the following in-text annotation format: {error=>edit:::error_type=Tag}, where error and edit stand for the text item before and after correction respectively, and Tag denotes an error category (Grammar, Spelling, Punctuation, or Fluency).

Example of an annotated sentence:

    I {likes=>like:::error_type=Grammar} turtles.

An accompanying Python package, ua_gec, provides many tools for working with annotated texts. See its documentation for details.

Train-test split

We expect users of the corpus to train and tune their models on the train split only. Feel free to further split it into train-dev (or use cross-validation).

Please use the test split only for reporting scores of your final model. In particular, never optimize on the test set. Do not tune hyperparameters on it. Do not use it for model selection in any way.

Next section lists the per-split statistics.

Statistics

UA-GEC contains:

Split	Documents	Sentences	Tokens	Authors
train	851	18,225	285,247	416
test	160	2,490	43,432	76
TOTAL	1,011	20,715	328,779	492

See stats.txt for detailed statistics generated by the following command (ua-gec must be installed first):

$ make stats

Python library

Alternatively to operating on data files directly, you may use a Python package called ua_gec. This package includes the data and has classes to iterate over documents, read metadata, work with annotations, etc.

Getting started

The package can be easily installed by pip:

    $ pip install ua_gec==1.1

Alternatively, you can install it from the source code:

    $ cd python
    $ python setup.py develop

Iterating through corpus

Once installed, you may get annotated documents from the Python code:

    
    >>> from ua_gec import Corpus
    >>> corpus = Corpus(partition="train")
    >>> for doc in corpus:
    ...     print(doc.source)         # "I likes it."
    ...     print(doc.target)         # "I like it."
    ...     print(doc.annotated)      # <AnnotatedText("I {likes=>like} it.")
    ...     print(doc.meta.region)    # "Київська"

Note that the doc.annotated property is of type AnnotatedText. This class is described in the next section

Working with annotations

ua_gec.AnnotatedText is a class that provides tools for processing annotated texts. It can iterate over annotations, get annotation error type, remove some of the annotations, and more.

While we're working on a detailed documentation, here is an example to get you started. It will remove all Fluency annotations from a text:

    >>> from ua_gec import AnnotatedText
    >>> text = AnnotatedText("I {likes=>like:::error_type=Grammar} it.")
    >>> for ann in text.iter_annotations():
    ...     print(ann.source_text)       # likes
    ...     print(ann.top_suggestion)    # like
    ...     print(ann.meta)              # {'error_type': 'Grammar'}
    ...     if ann.meta["error_type"] == "Fluency":
    ...         text.remove(ann)         # or `text.apply(ann)`

Contributing

The data collection is an ongoing activity. You can always contribute your Ukrainian writings or complete one of the writing tasks at https://ua-gec-dataset.grammarly.ai/
Code improvements and document are welcomed. Please submit a pull request.

Contacts

Comments

M2 format representation.

Hi there, That's a truly tremendous contribution to the development of the Ukrainian GEC. Great job! By the way, do you plan to add the m2 format representation/converter to this dataset?

opened by BogdanDidenko 2
Add GEC-only annotations
Background

Naively removing fluency edits may result in ungrammatical sentence. In order to avoid that, annotators validated the whole corpus with the fluency edits removed. They made edits to ensure that sentences stay grammatical even after fluency is removed.

This branch contains the results of their work.

Files structure

The commit temporarily adds two folders.

results-gec-only.src contains snippets that were handed to annotators for the gec-only annotation. These snippets are:

should contain all snippets added in v1 and v2

fluency annotations were removed

have no detailed annotations

have to fixes applied to v1 and v2 since the data collection

results-gec-only.tgt is the result of the annotators' work:

should fix errors introduced by removing fluency annotations

sometimes annotators make edits unrelated to fluency removal (seems like they were fixing random errors along the way)

Future work

These are subtasks for integrating GEC-only into the release:

[ ] Copy detailed annotations here [ ] Integrate fixes made to the main corpus here [ ] Backport some edits from GEC-only to the main corpus (where edits are bugfixes and are not related to fluency removal) [ ] Check for missing snippets and fix it [ ] Update docs and find a proper place for GEC-only
opened by asivokon 1
Issue with data point 730 in train split
There appears to be a problem with the target rendering for data point 730 in the train split. Notice below that the source text contains 5 bullet points whereas the target text contains only 4, because the last two have been merged and this part of the text has been lost altogether: "Вони ходять по вулицях, чекають тебе під будинком, стрибають на тебе в пошуках їжі і гавкають ночами. Враховуючи мою нелюбов до бродячих псів, для мене це був стрес)"

>>> print(doc.source)

Рай чи пекло?

"І на сонці бувають плями" як то кажуть, тому і на цьому дивовижному острові є свої особливості, які можуть перетворити відпочинок у пекло. Для мене особисто такими недоліками стало наступне:

◇ живність - окрім китів, слонів і бурундуків, які милують око, тут є комарі, мурахи, кукарачі, павуки, ящірки, жаби, змії, варани, кажани і тд. І це не в зоопарку. Це у вас в номері/на віллі, на пляжі, на вулицях міста.

◇ шум - всі автобуси з музичним супроводом і дуже голосним, тому будьте готові слухати зірок місцевого шоу-бізу постійно, хочете ви того чи ні. Кондуктори в автобусах постійно кричать у відкриті двері, зазиваючи пасажирів, і на їх фоні наші крикуни "Рівне - Корець - Новоград" на столичному вокзалі, просто забиті тіхоні) На дорогах шалений трафік з автобусів, тук-туків, мопедів і велосипедів. Кожен з них бібікає приблизно раз на 20 секунд (поздоровкатись, попередити про обгін чи поворот, виказати незадоволення). Можете собі уявити яка це симфонія)

◇ бруд - пилючність на дорогах просто нереальна. Якщо проїхатись в автобусі біля відкритого віконечка, писок доведеться довго вмивати потім) В метрі від райського пляжу може бути купа сміття і нікого вона не займає. Така ж історія і у великих містах, шум, гам, срач і бардак.

◇ собаки - багато-багато-багато собак всюди=>усюди:::error_type=Spelling}. Вони ходять по вулицях, чекають тебе під будинком, стрибають на тебе в пошуках їжі і гавкають ночами. Враховуючи мою нелюбов до бродячих псів, для мене це був стрес)

◇ москалі їх ще більше ніж собак. І це такі москалі, в найгіршому їх прояві, в футболках з прапором "вєлікай" Рассєї або "Льоха рєшаєт фсьо", які кричать на весь пляж/ресторан, говорять до всіх російською і дивуються, чому місцеві їх не розуміють. Намагаються всюди влізти без черги і поводять себе максимально по хамськи, власне, характерна для них поведінка. І вони, насправді, дратують більше за комарів, мурах і змій, разом узятих.

То що ж з цим всим робити? А нічого) Ці всі штуки дійсно можуть дратувати і псувати настрій. Але таке можливо, якщо ти невиспаний, болить голова чи просто втомився. Тоді кожна мурашка виводить з себе) А коли проходить голова і втома, то всі ці штуки сприймаються як місцевий колорит) І якщо з цим колоритом познайомитись ближче, то нічого страшного, як виявляється, немає: зміюки ці неотруйні, собак можна відігнати, кажани літають високо і людей не чіпають, співаків в автобусі можна переглушити навушниками або повчити місцеві пісні, від пилу можна взяти з собою тонік для очистки обличчя, і не смітити на пляжах та вулицях, щоб не додавати бруду в місцеві купи сміття. Єдиний мінус - від москалів так просто не здихаєшся, але це питання намагається вирішити не одна нація і тут Шрі-Ланка безсила.

Мабуть недоліки можна знайти всюди, якщо дуже захотіти, але плюсів у Шрі-Ланки значно більше ;)

>>> print(doc.target)

Рай чи пекло?

"І на сонці бувають плями" як то кажуть, тому і на цьому дивовижному острові є свої особливості, які можуть перетворити відпочинок у пекло. Для мене особисто такими недоліками стали:

◇ живність – окрім китів, слонів і бурундуків, які милують око, тут є комарі, мурахи, кукарачі, павуки, ящірки, жаби, змії, варани, кажани і т. ін. І це не в зоопарку. Це у вас у номері/на віллі, на пляжі, на вулицях міста.

◇ шум – всі автобуси з музичним супроводом і дуже голосним, тому будьте готові слухати зірок місцевого шоу-бізу постійно, хочете ви того чи ні. Кондуктори в автобусах постійно кричать у відчинені двері, закликаючи пасажирів, і на їхньому тлі наші крикуни "Рівне – Корець – Новоград" на столичному вокзалі, – просто забиті тишки) На дорогах шалений трафік з автобусів, тук-туків, мопедів і велосипедів. Кожен із них бібікає приблизно раз на 20 секунд (поздоровкатись, попередити про обгін чи поворот, виказати невдоволення). Можете собі уявити, яка це симфонія)

◇ бруд – пилючність на дорогах просто нереальна. Якщо проїхатись в автобусі біля відчиненого віконечка, потім доведеться довго вмивати писок ) В метрі від райського пляжу може бути купа сміття і нікого вона не займає. Така сама історія і у великих містах, шум, гам, срач і бардак.

◇ собаки – багато-багато-багато собак москалі {-=>– їх ще більше ніж собак. І це такі москалі, в найгіршій їхній сутності, у футболках із прапором "вєлікай" Рассєї або "Льоха рєшаєт фсьо", які кричать на весь пляж/ресторан, говорять до всіх російською і дивуються, чому місцеві їх не розуміють. Намагаються всюди влізти без черги і поводять себе максимально по-хамськи, власне, характерна для них поведінка. І вони, насправді, дратують більше за комарів, мурах і змій разом узятих.

То що ж із цим усім робити? А нічого) Ці всі штуки дійсно можуть дратувати і псувати настрій. Але таке можливо, якщо ти невиспаний, болить голова чи просто втомився. Тоді кожна мурашка виводить із себе) А коли проходить біль і втома, то всі ці штуки сприймаються як місцевий колорит) І якщо з цим колоритом познайомитись ближче, то нічого страшного, як виявляється, немає: зміюки ці неотруйні, собак можна відігнати, кажани літають високо і людей не чіпають, співаків в автобусі можна переглушити навушниками або повчити місцеві пісні, від пилу можна взяти зі собою тонік для очищення обличчя, і не смітити на пляжах та вулицях, щоб не додавати бруду в місцеві купи сміття. Єдиний мінус – москалів так просто не здихаєшся, але це питання намагається вирішити не одна нація, і тут Шрі-Ланка безсила.

Мабуть, недоліки можна знайти всюди, якщо дуже захотіти, але плюсів у Шрі-Ланки значно більше ;)
opened by YovaKem 1
Release UA-GEC version 2
This is a major corpus update that includes:

861 new documents (13,020 new sentences)!

Detailed annotations (22 error categories vs. 4 categories in v1)

GEC-only annotations

Multiple annotators per document (as indicated by doc.meta.annotator_id)
opened by asivokon 0
Add 205 snippets annotated by second annotator (Galyna)

This covers all test set + 45 snippets in train

Before merging:

Source parts of annotator 1 and annotator 2 should match exactly. However, this doesn't hold for 87 docs. Oftentimes, this is related to newlines and auto-replacements like ...=>… Before merging PR:

[x] Fix documents with source part divergent [x] Make sure ./scripts/validate.py doesn't complain

opened by asivokon 0
Fix SomethingWrong annotations

The attached files are SomethingWrong fixes for the non-detailed annotated docs. We need to merge them into the current v2-dev. Unfortunately, I don't think this can be automated since there might be edits in the surrounding text

result-fix-wrong.zip .

opened by asivokon 0
Double check files with no annotations
The following documents have no annotations at all. Often, it means that these are "perfect", error-free texts. Occasionally, there might be docs missed by annotators.

Please, review and confirm that these docs don't need further correction (or correct them as needed)

Annotations missing in the train:

0117

0120

0153

0206

0238

0299

0344

0386

0399

0402

0460

0650

0690

0708

0758

0759

1872

1873

1874

1875

1876

1877

1878

1879

1880

1881

1882

1883

1884

1885

1886

1887

1888

1889

1890

Annotations missing in the test:

0683

0851
opened by asivokon 0
Preliminarily v2 release

This commit contains new texts, collected and annotated in 2021.

However, it has a number of issues and is not ready for release yet.

Those issues are to be addressed in the subsequent PRs.

opened by asivokon 0
Data and sentence splitting fixes
This PR makes several changes:

Represent newlines with the \n sequence

Manually fix a dozen of annotated documents for newlines, lists, tables

Better sentence-splitting. From now on source and target files are guaranteed to have the same number of lines. This, in particular, fixes issue #7

Regenerate derivative data views (source only, target only, tokenized, sentence-split, etc.) from the original annotated files on every release. This is to ensure they are always in sync.
opened by asivokon 0
Add `is_sensitive` to metadata
Added

is_sensitive (metadata flag) to mark documents that contain profanity.

stats.txt - contains detailed dataset statistics

Changed

Removed sensitive content markers ({ск}) from 8 documents
opened by asivokon 0
Prepare .m2 files
M2 specifics:

Annotations are done on a sentence level.

Texts are tokenized with Stanza.

The error type annotations are copied from the corpus

There's a special document heading sentence added to the beginning of each document.

It looks like this: # 0123, where 0123 is the document ID.

This adds opportunity to utilize document-level context.
opened by asivokon 1
Add generated M^2 files
This PR contains a couple of changes:

adds means to create a new data representation annotated-source-sentences - these are the split source sentences with annotations from the source document present;

adds scripts to automatically create M^2 files necessary for evaluation.

M^2 files derive from the existing data representations, thus if the annotations/tokenization/sentence splitting is faulty, there might be mistakes in M^2 representation as well. Note that currently, running m2-scorer on tokenized-target-sentences as system output does not produce perfect scores, although it should. This should be fixed in future iterations.
opened by pavlo-kuchmiichuk 1

Owner

Grammarly

Millions of users rely on Grammarly's AI-powered products to make their messages, documents, and social media posts clear, mistake-free, and impactful.

GitHub https://ua-gec-dataset.grammarly.ai/

Release of SPLASH: Dataset for semantic parse correction with natural language feedback in the context of text-to-SQL parsing

SPLASH: Semantic Parsing with Language Assistance from Humans SPLASH is dataset for the task of semantic parse correction with natural language feedba

Microsoft Research - Language and Information Technologies (MSR LIT)

35 Oct 31, 2022

This repository contains the code for "Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP".

Self-Diagnosis and Self-Debiasing This repository contains the source code for Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based

62 Dec 12, 2022

ERISHA is a mulitilingual multispeaker expressive speech synthesis framework. It can transfer the expressivity to the speaker's voice for which no expressive speech corpus is available.

ERISHA: Multilingual Multispeaker Expressive Text-to-Speech Library ERISHA is a multilingual multispeaker expressive speech synthesis framework. It ca

43 Nov 27, 2022

Official repository for "Action-Based Conversations Dataset: A Corpus for Building More In-Depth Task-Oriented Dialogue Systems"

Action-Based Conversations Dataset (ABCD) This respository contains the code and data for ABCD (Chen et al., 2021) Introduction Whereas existing goal-

49 Oct 9, 2022

Official code of our work, AVATAR: A Parallel Corpus for Java-Python Program Translation.

AVATAR Official code of our work, AVATAR: A Parallel Corpus for Java-Python Program Translation. AVATAR stands for jAVA-pyThon progrAm tRanslation. AV

26 Dec 3, 2022

[2021 MultiMedia] CONQUER: Contextual Query-aware Ranking for Video Corpus Moment Retrieval

CONQUER: Contexutal Query-aware Ranking for Video Corpus Moment Retreival PyTorch implementation of CONQUER: Contexutal Query-aware Ranking for Video

23 Dec 26, 2022

Towards Rolling Shutter Correction and Deblurring in Dynamic Scenes (CVPR2021)

RSCD (BS-RSCD & JCD) Towards Rolling Shutter Correction and Deblurring in Dynamic Scenes (CVPR2021) by Zhihang Zhong, Yinqiang Zheng, Imari Sato We co

81 Dec 15, 2022

Propose a principled and practically effective framework for unsupervised accuracy estimation and error detection tasks with theoretical analysis and state-of-the-art performance.

Detecting Errors and Estimating Accuracy on Unlabeled Data with Self-training Ensembles This project is for the paper: Detecting Errors and Estimating

13 Nov 21, 2022

Source code for the paper "PLOME: Pre-training with Misspelled Knowledge for Chinese Spelling Correction" in ACL2021

PLOME:Pre-training with Misspelled Knowledge for Chinese Spelling Correction (ACL2021) This repository provides the code and data of the work in ACL20

197 Nov 26, 2022

[AAAI22] Reliable Propagation-Correction Modulation for Video Object Segmentation

Reliable Propagation-Correction Modulation for Video Object Segmentation (AAAI22) Preview version paper of this work is available at: https://arxiv.or

2 Dec 7, 2021

Neural Reprojection Error: Merging Feature Learning and Camera Pose Estimation

Neural Reprojection Error: Merging Feature Learning and Camera Pose Estimation This is the official repository for our paper Neural Reprojection Error

78 Dec 1, 2022

Graph-based community clustering approach to extract protein domains from a predicted aligned error matrix

Using a predicted aligned error matrix corresponding to an AlphaFold2 model , returns a series of lists of residue indices, where each list corresponds to a set of residues clustering together into a pseudo-rigid domain.

24 Nov 23, 2022

MEDS: Enhancing Memory Error Detection for Large-Scale Applications

MEDS: Enhancing Memory Error Detection for Large-Scale Applications Prerequisites cmake and clang Build MEDS supporting compiler $ make Build Using Do

34 Dec 14, 2022

Prevent `CUDA error: out of memory` in just 1 line of code.

?? Koila Koila solves CUDA error: out of memory error painlessly. Fix it with just one line of code, and forget it. ?? Features ?? Prevents CUDA error

1.7k Jan 2, 2023

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

Episodic Transformers (E.T.) Episodic Transformer for Vision-and-Language Navigation Alexander Pashevich, Cordelia Schmid, Chen Sun Episodic Transform

62 Dec 24, 2022

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

1.3k Dec 31, 2022

Meta Language-Specific Layers in Multilingual Language Models

Meta Language-Specific Layers in Multilingual Language Models This repo contains the source codes for our paper On Negative Interference in Multilingu

20 Feb 13, 2022

The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding.

SuperGen The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding. Requirements Before running, you

38 Dec 12, 2022

Pre-trained BERT Models for Ancient and Medieval Greek, and associated code for LaTeCH 2021 paper titled - "A Pilot Study for BERT Language Modelling and Morphological Analysis for Ancient and Medieval Greek"

Ancient Greek BERT The first and only available Ancient Greek sub-word BERT model! State-of-the-art post fine-tuning on Part-of-Speech Tagging and Mor

22 Dec 8, 2022

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

Related tags

Overview

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

Data

Metadata

Annotation format

Train-test split

Statistics

Python library

Getting started

Iterating through corpus

Working with annotations

Contributing

Contacts

Comments

Background

Files structure

Future work

Added

Changed

Owner

Grammarly

Release of SPLASH: Dataset for semantic parse correction with natural language feedback in the context of text-to-SQL parsing

This repository contains the code for "Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP".

ERISHA is a mulitilingual multispeaker expressive speech synthesis framework. It can transfer the expressivity to the speaker's voice for which no expressive speech corpus is available.

Official repository for "Action-Based Conversations Dataset: A Corpus for Building More In-Depth Task-Oriented Dialogue Systems"

Official code of our work, AVATAR: A Parallel Corpus for Java-Python Program Translation.

[2021 MultiMedia] CONQUER: Contextual Query-aware Ranking for Video Corpus Moment Retrieval

Towards Rolling Shutter Correction and Deblurring in Dynamic Scenes (CVPR2021)

Propose a principled and practically effective framework for unsupervised accuracy estimation and error detection tasks with theoretical analysis and state-of-the-art performance.

Source code for the paper "PLOME: Pre-training with Misspelled Knowledge for Chinese Spelling Correction" in ACL2021

[AAAI22] Reliable Propagation-Correction Modulation for Video Object Segmentation

Neural Reprojection Error: Merging Feature Learning and Camera Pose Estimation

Graph-based community clustering approach to extract protein domains from a predicted aligned error matrix

MEDS: Enhancing Memory Error Detection for Large-Scale Applications

Prevent `CUDA error: out of memory` in just 1 line of code.

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Meta Language-Specific Layers in Multilingual Language Models

The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding.

Pre-trained BERT Models for Ancient and Medieval Greek, and associated code for LaTeCH 2021 paper titled - "A Pilot Study for BERT Language Modelling and Morphological Analysis for Ancient and Medieval Greek"