UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

Grammarly

Last update: Jan 2, 2023

Related tags

Text Data & NLP natural-language-processing corpus dataset corpus-data corpus-tools gec nlp-datasets grammatical-error-correction ukrainian-language

Overview

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

This repository contains UA-GEC data and an accompanying Python library.

Data

All corpus data and metadata stay under the ./data. It has two subfolders for train and test splits

Each split (train and test) has further subfolders for different data representations:

./data/{train,test}/annotated stores documents in the annotated format

./data/{train,test}/source and ./data/{train,test}/target store the original and the corrected versions of documents. Text files in these directories are plain text with no annotation markup. These files were produced from the annotated data and are, in some way, redundant. We keep them because this format is convenient in some use cases.

Metadata

./data/metadata.csv stores per-document metadata. It's a CSV file with the following fields:

id (str): document identifier.
author_id (str): document author identifier.
is_native (int): 1 if the author is native-speaker, 0 otherwise
region (str): the author's region of birth. A special value "Інше" is used both for authors who were born outside Ukraine and authors who preferred not to specify their region.
gender (str): could be "Жіноча" (female), "Чоловіча" (male), or "Інша" (other).
occupation (str): one of "Технічна", "Гуманітарна", "Природнича", "Інша"
submission_type (str): one of "essay", "translation", or "text_donation"
source_language (str): for submissions of the "translation" type, this field indicates the source language of the translated text. Possible values are "de", "en", "fr", "ru", and "pl".
annotator_id (int): ID of the annotator who corrected the document.
partition (str): one of "test" or "train"
is_sensitive (int): 1 if the document contains profanity or offensive language

Annotation format

Annotated files are text files that use the following in-text annotation format: {error=>edit:::error_type=Tag}, where error and edit stand for the text item before and after correction respectively, and Tag denotes an error category (Grammar, Spelling, Punctuation, or Fluency).

Example of an annotated sentence:

    I {likes=>like:::error_type=Grammar} turtles.

An accompanying Python package, ua_gec, provides many tools for working with annotated texts. See its documentation for details.

Train-test split

We expect users of the corpus to train and tune their models on the train split only. Feel free to further split it into train-dev (or use cross-validation).

Please use the test split only for reporting scores of your final model. In particular, never optimize on the test set. Do not tune hyperparameters on it. Do not use it for model selection in any way.

Next section lists the per-split statistics.

Statistics

UA-GEC contains:

Split	Documents	Sentences	Tokens	Authors
train	851	18,225	285,247	416
test	160	2,490	43,432	76
TOTAL	1,011	20,715	328,779	492

See stats.txt for detailed statistics generated by the following command (ua-gec must be installed first):

$ make stats

Python library

Alternatively to operating on data files directly, you may use a Python package called ua_gec. This package includes the data and has classes to iterate over documents, read metadata, work with annotations, etc.

Getting started

The package can be easily installed by pip:

    $ pip install ua_gec==1.1

Alternatively, you can install it from the source code:

    $ cd python
    $ python setup.py develop

Iterating through corpus

Once installed, you may get annotated documents from the Python code:

    
    >>> from ua_gec import Corpus
    >>> corpus = Corpus(partition="train")
    >>> for doc in corpus:
    ...     print(doc.source)         # "I likes it."
    ...     print(doc.target)         # "I like it."
    ...     print(doc.annotated)      # like} it.")
    ...     print(doc.meta.region)    # "Київська"

Note that the doc.annotated property is of type AnnotatedText. This class is described in the next section

Working with annotations

ua_gec.AnnotatedText is a class that provides tools for processing annotated texts. It can iterate over annotations, get annotation error type, remove some of the annotations, and more.

While we're working on a detailed documentation, here is an example to get you started. It will remove all Fluency annotations from a text:

    >>> from ua_gec import AnnotatedText
    >>> text = AnnotatedText("I {likes=>like:::error_type=Grammar} it.")
    >>> for ann in text.iter_annotations():
    ...     print(ann.source_text)       # likes
    ...     print(ann.top_suggestion)    # like
    ...     print(ann.meta)              # {'error_type': 'Grammar'}
    ...     if ann.meta["error_type"] == "Fluency":
    ...         text.remove(ann)         # or `text.apply(ann)`

Contributing

The data collection is an ongoing activity. You can always contribute your Ukrainian writings or complete one of the writing tasks at https://ua-gec-dataset.grammarly.ai/
Code improvements and document are welcomed. Please submit a pull request.

Contacts

Comments

M2 format representation.

Hi there, That's a truly tremendous contribution to the development of the Ukrainian GEC. Great job! By the way, do you plan to add the m2 format representation/converter to this dataset?

opened by BogdanDidenko 2
Add GEC-only annotations
Background

Naively removing fluency edits may result in ungrammatical sentence. In order to avoid that, annotators validated the whole corpus with the fluency edits removed. They made edits to ensure that sentences stay grammatical even after fluency is removed.

This branch contains the results of their work.

Files structure

The commit temporarily adds two folders.

results-gec-only.src contains snippets that were handed to annotators for the gec-only annotation. These snippets are:

should contain all snippets added in v1 and v2

fluency annotations were removed

have no detailed annotations

have to fixes applied to v1 and v2 since the data collection

results-gec-only.tgt is the result of the annotators' work:

should fix errors introduced by removing fluency annotations

sometimes annotators make edits unrelated to fluency removal (seems like they were fixing random errors along the way)

Future work

These are subtasks for integrating GEC-only into the release:

[ ] Copy detailed annotations here [ ] Integrate fixes made to the main corpus here [ ] Backport some edits from GEC-only to the main corpus (where edits are bugfixes and are not related to fluency removal) [ ] Check for missing snippets and fix it [ ] Update docs and find a proper place for GEC-only
opened by asivokon 1
Issue with data point 730 in train split
There appears to be a problem with the target rendering for data point 730 in the train split. Notice below that the source text contains 5 bullet points whereas the target text contains only 4, because the last two have been merged and this part of the text has been lost altogether: "Вони ходять по вулицях, чекають тебе під будинком, стрибають на тебе в пошуках їжі і гавкають ночами. Враховуючи мою нелюбов до бродячих псів, для мене це був стрес)"

>>> print(doc.source)

Рай чи пекло?

"І на сонці бувають плями" як то кажуть, тому і на цьому дивовижному острові є свої особливості, які можуть перетворити відпочинок у пекло. Для мене особисто такими недоліками стало наступне:

◇ живність - окрім китів, слонів і бурундуків, які милують око, тут є комарі, мурахи, кукарачі, павуки, ящірки, жаби, змії, варани, кажани і тд. І це не в зоопарку. Це у вас в номері/на віллі, на пляжі, на вулицях міста.

◇ шум - всі автобуси з музичним супроводом і дуже голосним, тому будьте готові слухати зірок місцевого шоу-бізу постійно, хочете ви того чи ні. Кондуктори в автобусах постійно кричать у відкриті двері, зазиваючи пасажирів, і на їх фоні наші крикуни "Рівне - Корець - Новоград" на столичному вокзалі, просто забиті тіхоні) На дорогах шалений трафік з автобусів, тук-туків, мопедів і велосипедів. Кожен з них бібікає приблизно раз на 20 секунд (поздоровкатись, попередити про обгін чи поворот, виказати незадоволення). Можете собі уявити яка це симфонія)

◇ бруд - пилючність на дорогах просто нереальна. Якщо проїхатись в автобусі біля відкритого віконечка, писок доведеться довго вмивати потім) В метрі від райського пляжу може бути купа сміття і нікого вона не займає. Така ж історія і у великих містах, шум, гам, срач і бардак.

◇ собаки - багато-багато-багато собак всюди=>усюди:::error_type=Spelling}. Вони ходять по вулицях, чекають тебе під будинком, стрибають на тебе в пошуках їжі і гавкають ночами. Враховуючи мою нелюбов до бродячих псів, для мене це був стрес)

◇ москалі їх ще більше ніж собак. І це такі москалі, в найгіршому їх прояві, в футболках з прапором "вєлікай" Рассєї або "Льоха рєшаєт фсьо", які кричать на весь пляж/ресторан, говорять до всіх російською і дивуються, чому місцеві їх не розуміють. Намагаються всюди влізти без черги і поводять себе максимально по хамськи, власне, характерна для них поведінка. І вони, насправді, дратують більше за комарів, мурах і змій, разом узятих.

То що ж з цим всим робити? А нічого) Ці всі штуки дійсно можуть дратувати і псувати настрій. Але таке можливо, якщо ти невиспаний, болить голова чи просто втомився. Тоді кожна мурашка виводить з себе) А коли проходить голова і втома, то всі ці штуки сприймаються як місцевий колорит) І якщо з цим колоритом познайомитись ближче, то нічого страшного, як виявляється, немає: зміюки ці неотруйні, собак можна відігнати, кажани літають високо і людей не чіпають, співаків в автобусі можна переглушити навушниками або повчити місцеві пісні, від пилу можна взяти з собою тонік для очистки обличчя, і не смітити на пляжах та вулицях, щоб не додавати бруду в місцеві купи сміття. Єдиний мінус - від москалів так просто не здихаєшся, але це питання намагається вирішити не одна нація і тут Шрі-Ланка безсила.

Мабуть недоліки можна знайти всюди, якщо дуже захотіти, але плюсів у Шрі-Ланки значно більше ;)

>>> print(doc.target)

Рай чи пекло?

"І на сонці бувають плями" як то кажуть, тому і на цьому дивовижному острові є свої особливості, які можуть перетворити відпочинок у пекло. Для мене особисто такими недоліками стали:

◇ живність – окрім китів, слонів і бурундуків, які милують око, тут є комарі, мурахи, кукарачі, павуки, ящірки, жаби, змії, варани, кажани і т. ін. І це не в зоопарку. Це у вас у номері/на віллі, на пляжі, на вулицях міста.

◇ шум – всі автобуси з музичним супроводом і дуже голосним, тому будьте готові слухати зірок місцевого шоу-бізу постійно, хочете ви того чи ні. Кондуктори в автобусах постійно кричать у відчинені двері, закликаючи пасажирів, і на їхньому тлі наші крикуни "Рівне – Корець – Новоград" на столичному вокзалі, – просто забиті тишки) На дорогах шалений трафік з автобусів, тук-туків, мопедів і велосипедів. Кожен із них бібікає приблизно раз на 20 секунд (поздоровкатись, попередити про обгін чи поворот, виказати невдоволення). Можете собі уявити, яка це симфонія)

◇ бруд – пилючність на дорогах просто нереальна. Якщо проїхатись в автобусі біля відчиненого віконечка, потім доведеться довго вмивати писок ) В метрі від райського пляжу може бути купа сміття і нікого вона не займає. Така сама історія і у великих містах, шум, гам, срач і бардак.

◇ собаки – багато-багато-багато собак москалі {-=>– їх ще більше ніж собак. І це такі москалі, в найгіршій їхній сутності, у футболках із прапором "вєлікай" Рассєї або "Льоха рєшаєт фсьо", які кричать на весь пляж/ресторан, говорять до всіх російською і дивуються, чому місцеві їх не розуміють. Намагаються всюди влізти без черги і поводять себе максимально по-хамськи, власне, характерна для них поведінка. І вони, насправді, дратують більше за комарів, мурах і змій разом узятих.

То що ж із цим усім робити? А нічого) Ці всі штуки дійсно можуть дратувати і псувати настрій. Але таке можливо, якщо ти невиспаний, болить голова чи просто втомився. Тоді кожна мурашка виводить із себе) А коли проходить біль і втома, то всі ці штуки сприймаються як місцевий колорит) І якщо з цим колоритом познайомитись ближче, то нічого страшного, як виявляється, немає: зміюки ці неотруйні, собак можна відігнати, кажани літають високо і людей не чіпають, співаків в автобусі можна переглушити навушниками або повчити місцеві пісні, від пилу можна взяти зі собою тонік для очищення обличчя, і не смітити на пляжах та вулицях, щоб не додавати бруду в місцеві купи сміття. Єдиний мінус – москалів так просто не здихаєшся, але це питання намагається вирішити не одна нація, і тут Шрі-Ланка безсила.

Мабуть, недоліки можна знайти всюди, якщо дуже захотіти, але плюсів у Шрі-Ланки значно більше ;)
opened by YovaKem 1
Release UA-GEC version 2
This is a major corpus update that includes:

861 new documents (13,020 new sentences)!

Detailed annotations (22 error categories vs. 4 categories in v1)

GEC-only annotations

Multiple annotators per document (as indicated by doc.meta.annotator_id)
opened by asivokon 0
Add 205 snippets annotated by second annotator (Galyna)

This covers all test set + 45 snippets in train

Before merging:

Source parts of annotator 1 and annotator 2 should match exactly. However, this doesn't hold for 87 docs. Oftentimes, this is related to newlines and auto-replacements like ...=>… Before merging PR:

[x] Fix documents with source part divergent [x] Make sure ./scripts/validate.py doesn't complain

opened by asivokon 0
Fix SomethingWrong annotations

The attached files are SomethingWrong fixes for the non-detailed annotated docs. We need to merge them into the current v2-dev. Unfortunately, I don't think this can be automated since there might be edits in the surrounding text

result-fix-wrong.zip .

opened by asivokon 0
Double check files with no annotations
The following documents have no annotations at all. Often, it means that these are "perfect", error-free texts. Occasionally, there might be docs missed by annotators.

Please, review and confirm that these docs don't need further correction (or correct them as needed)

Annotations missing in the train:

0117

0120

0153

0206

0238

0299

0344

0386

0399

0402

0460

0650

0690

0708

0758

0759

1872

1873

1874

1875

1876

1877

1878

1879

1880

1881

1882

1883

1884

1885

1886

1887

1888

1889

1890

Annotations missing in the test:

0683

0851
opened by asivokon 0
Preliminarily v2 release

This commit contains new texts, collected and annotated in 2021.

However, it has a number of issues and is not ready for release yet.

Those issues are to be addressed in the subsequent PRs.

opened by asivokon 0
Data and sentence splitting fixes
This PR makes several changes:

Represent newlines with the \n sequence

Manually fix a dozen of annotated documents for newlines, lists, tables

Better sentence-splitting. From now on source and target files are guaranteed to have the same number of lines. This, in particular, fixes issue #7

Regenerate derivative data views (source only, target only, tokenized, sentence-split, etc.) from the original annotated files on every release. This is to ensure they are always in sync.
opened by asivokon 0
Add `is_sensitive` to metadata
Added

is_sensitive (metadata flag) to mark documents that contain profanity.

stats.txt - contains detailed dataset statistics

Changed

Removed sensitive content markers ({ск}) from 8 documents
opened by asivokon 0
Prepare .m2 files
M2 specifics:

Annotations are done on a sentence level.

Texts are tokenized with Stanza.

The error type annotations are copied from the corpus

There's a special document heading sentence added to the beginning of each document.

It looks like this: # 0123, where 0123 is the document ID.

This adds opportunity to utilize document-level context.
opened by asivokon 1
Add generated M^2 files
This PR contains a couple of changes:

adds means to create a new data representation annotated-source-sentences - these are the split source sentences with annotations from the source document present;

adds scripts to automatically create M^2 files necessary for evaluation.

M^2 files derive from the existing data representations, thus if the annotations/tokenization/sentence splitting is faulty, there might be mistakes in M^2 representation as well. Note that currently, running m2-scorer on tokenized-target-sentences as system output does not produce perfect scores, although it should. This should be fixed in future iterations.
opened by pavlo-kuchmiichuk 1

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

Related tags

Overview

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

Data

Metadata

Annotation format

Train-test split

Statistics

Python library

Getting started

Iterating through corpus

Working with annotations

Contributing

Contacts

Comments

Background

Files structure

Future work

Added

Changed

Owner

Grammarly

Chinese Grammatical Error Diagnosis

PyTorch impelementations of BERT-based Spelling Error Correction Models.

PyTorch impelementations of BERT-based Spelling Error Correction Models

Ukrainian TTS (text-to-speech) using Coqui TTS

NLPIR tutorial: pretrain for IR. pre-train on raw textual corpus, fine-tune on MS MARCO Document Ranking

YACLC - Yet Another Chinese Learner Corpus

File-based TF-IDF: Calculates keywords in a document, using a word corpus.

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

Simple Python script to scrape youtube channles of "Parity Technologies and Web3 Foundation" and translate them to well-known braille language or any language

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

A python framework to transform natural language questions to queries in a database query language.

A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

This is the Alpha of Nutte language, she is not complete yet / Essa é a Alpha da Nutte language, não está completa ainda

NL. The natural language programming language.

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

:mag: End-to-End Framework for building natural language search interfaces to data by utilizing Transformers and the State-of-the-Art of NLP. Supporting DPR, Elasticsearch, HuggingFace’s Modelhub and much more!