This repository has a implementations of data augmentation for NLP for Japanese.

Koga Kobayashi

Last update: Nov 11, 2022

Related tags

Text Data & NLP daaja

Overview

daaja

This repository has a implementations of data augmentation for NLP for Japanese:

EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks
An Analysis of Simple Data Augmentation for Named Entity Recognition

Install

pip install daaja

How to use

EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks

Command

python -m aug_ja.eda.run --input input.tsv --output data_augmentor.tsv

The format of input.tsv is as follows:

1	この映画はとてもおもしろい
0	つまらない映画だった

In Python

from aug_ja.eda import EasyDataAugmentor
augmentor = EasyDataAugmentor(alpha_sr=0.1, alpha_ri=0.1, alpha_rs=0.1, p_rd=0.1, num_aug=4)
text = "日本語でデータ拡張を行う"
aug_texts = augmentor.augments(text)
print(aug_texts)
# ['日本語でを拡張データ行う', '日本語でデータ押広げるを行う', '日本語でデータ拡張を行う', '日本語で智見拡張を行う', '日本語でデータ拡張を行う']

An Analysis of Simple Data Augmentation for Named Entity Recognition

Command

python -m aug_ja.ner_sda.run --input input.tsv --output data_augmentor.tsv

The format of input.tsv is as follows:

私	O
は	O
田中	B-PER
と	O
いい	O
ます	O

In Python

from daaja.ner_sda import SimpleDataAugmentationforNER
tokens_list = [
    ["私", "は", "田中", "と", "いい", "ます"],
    ["筑波", "大学", "に", "所属", "して", "ます"],
    ["今日", "から", "筑波", "大学", "に", "通う"],
    ["茨城", "大学"],
]
labels_list = [
    ["O", "O", "B-PER", "O", "O", "O"],
    ["B-ORG", "I-ORG", "O", "O", "O", "O"],
    ["B-DATE", "O", "B-ORG", "I-ORG", "O", "O"],
    ["B-ORG", "I-ORG"],
]
augmentor = SimpleDataAugmentationforNER(tokens_list=tokens_list, labels_list=labels_list,
                                            p_power=1, p_lwtr=1, p_mr=1, p_sis=1, p_sr=1, num_aug=4)
tokens = ["吉田", "さん", "は", "株式", "会社", "A", "に", "出張", "予定", "だ"]
labels = ["B-PER", "O", "O", "B-ORG", "I-ORG", "I-ORG", "O", "O", "O", "O"]
augmented_tokens_list, augmented_labels_list = augmentor.augments(tokens, labels)
print(augmented_tokens_list)
# [['吉田', 'さん', 'は', '株式', '会社', 'A', 'に', '出張', '志す', 'だ'],
#  ['吉田', 'さん', 'は', '株式', '大学', '大学', 'に', '出張', '予定', 'だ'],
#  ['吉田', 'さん', 'は', '株式', '会社', 'A', 'に', '出張', '予定', 'だ'],
#  ['吉田', 'さん', 'は', '筑波', '大学', 'に', '出張', '予定', 'だ'],
#  ['吉田', 'さん', 'は', '株式', '会社', 'A', 'に', '出張', '予定', 'だ']]
print(augmented_labels_list)
# [['B-PER', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O', 'O', 'O', 'O'],
#  ['B-PER', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O', 'O', 'O', 'O'],
#  ['B-PER', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O', 'O', 'O', 'O'],
#  ['B-PER', 'O', 'O', 'B-ORG', 'I-ORG', 'O', 'O', 'O', 'O'],
#  ['B-PER', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O', 'O', 'O', 'O']]

Reference

Comments

too many progress bars

When I use EasyDataAugmentor in the train process, there are too many progress bars in the console.

So, can you make this line 19 tqdm selectable on-off when we define EasyDataAugmentor? https://github.com/kajyuuen/daaja/blob/12835943868d43f5c248cf1ea87ab60f67a6e03d/daaja/flows/sequential_flow.py#L19

opened by Yongtae723 6
from daaja.methods.eda.easy_data_augmentor import EasyDataAugmentorにてエラー

daajaをpipインストール後、from daaja.methods.eda.easy_data_augmentor import EasyDataAugmentorを行うと、以下のエラーとなる。 ConnectionError: HTTPConnectionPool(host='compling.hss.ntu.edu.sg', port=80): Max retries exceeded with url: /wnja/data/1.1/wnjpn.db.gz (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f3b6a6cced0>: Failed to establish a new connection: [Errno 110] Connection timed out'))

opened by naoki1213mj 5
is it possible to use on GPU device?

Hi!

thank you for the great library. when I train with this augmentation, this takes so much more time than forward and backward process.

therefore, can we possibly use this augmentation on GPU to save time?

thank you

opened by Yongtae723 3
Bump joblib from 1.1.0 to 1.2.0
Bumps joblib from 1.1.0 to 1.2.0.

Changelog

Sourced from joblib's changelog.

Release 1.2.0

Fix a security issue where eval(pre_dispatch) could potentially run arbitrary code. Now only basic numerics are supported. joblib/joblib#1327

Make sure that joblib works even when multiprocessing is not available, for instance with Pyodide joblib/joblib#1256

Avoid unnecessary warnings when workers and main process delete the temporary memmap folder contents concurrently. joblib/joblib#1263

Fix memory alignment bug for pickles containing numpy arrays. This is especially important when loading the pickle with mmap_mode != None as the resulting numpy.memmap object would not be able to correct the misalignment without performing a memory copy. This bug would cause invalid computation and segmentation faults with native code that would directly access the underlying data buffer of a numpy array, for instance C/C++/Cython code compiled with older GCC versions or some old OpenBLAS written in platform specific assembly. joblib/joblib#1254

Vendor cloudpickle 2.2.0 which adds support for PyPy 3.8+.

Vendor loky 3.3.0 which fixes several bugs including:

robustly forcibly terminating worker processes in case of a crash (joblib/joblib#1269);

avoiding leaking worker processes in case of nested loky parallel calls;

reliability spawn the correct number of reusable workers.

Release 1.1.1

Fix a security issue where eval(pre_dispatch) could potentially run arbitrary code. Now only basic numerics are supported. joblib/joblib#1327

Commits

5991350 Release 1.2.0

3fa2188 MAINT cleanup numpy warnings related to np.matrix in tests (#1340)

cea26ff CI test the future loky-3.3.0 branch (#1338)

8aca6f4 MAINT: remove pytest.warns(None) warnings in pytest 7 (#1264)

067ed4f XFAIL test_child_raises_parent_exits_cleanly with multiprocessing (#1339)

ac4ebd5 MAINT add back pytest warnings plugin (#1337)

a23427d Test child raises parent exits cleanly more reliable on macos (#1335)

ac09691 [MAINT] various test updates (#1334)

4a314b1 Vendor loky 3.2.0 (#1333)

bdf47e9 Make test_parallel_with_interactively_defined_functions_default_backend timeo...

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0
Implement Data Augmentation using Pre-trained Transformer Models
paper

Data Augmentation using Pre-trained Transformer Models

code

https://github.com/varunkumar-dev/TransformersDataAugmentation

ref

https://www.ai-shift.co.jp/techblog/1939

add-new-technique
opened by kajyuuen 0
Implement Contextual Augmentation
Paper

Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations

Code

https://github.com/pfnet-research/contextual_augmentation

add-new-technique
opened by kajyuuen 0
Implement MixText
Paper

MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification

Code

https://github.com/GT-SALT/MixText

add-new-technique
opened by kajyuuen 0

Releases(v0.0.7)

v0.0.7(Oct 24, 2022)
Changes

Change pytest @kajyuuen (#35 #37 #38)

Change WORDNER_URL @kajyuuen (#34)

Source code(tar.gz)
Source code(zip)
daaja-0.0.7-py3-none-any.whl(18.19 KB)
v0.0.6(Mar 3, 2022)
Changes

Update version @kajyuuen (#27)

Add verbose option @kajyuuen (#25)

📖 Documentation

Add README_ja.md and Update README.md @kajyuuen (#26)

Source code(tar.gz)
Source code(zip)
v0.0.5(Feb 27, 2022)
Changes

💪 Enhancement

Add ContextualAugmentor @kajyuuen (#23)

Add BackTranslationAugmentor @kajyuuen (#21 , #22)

📖 Documentation

Add quick_example @kajyuuen (#17)

Source code(tar.gz)
Source code(zip)
v0.0.4(Feb 21, 2022)
Changes

Release v0.0.4 @kajyuuen (#16)

Chore add release drafter @kajyuuen (#6)

💪 Enhancement

Add tqdm @kajyuuen (#8)

📖 Documentation

Refactoring @kajyuuen (#15)

Add SDA example @kajyuuen (#9)

Add EDA example @kajyuuen (#7)

Source code(tar.gz)
Source code(zip)
v0.0.3(Feb 13, 2022)

Source code(tar.gz)
Source code(zip)
daaja-0.0.3-py3-none-any.whl(14.80 KB)
v0.0.2(Feb 13, 2022)

Source code(tar.gz)
Source code(zip)
daaja-0.0.2-py3-none-any.whl(14.97 KB)

Owner

Koga Kobayashi

GitHub

Analyse japanese ebooks using MeCab to determine the difficulty level for japanese learners

japanese-ebook-analysis This aim of this project is to make analysing the contents of a japanese ebook easy and streamline the process for non-technic

14 Jul 23, 2022

Yomichad - a Japanese pop-up dictionary that can display readings and English definitions of Japanese words

Yomichad is a Japanese pop-up dictionary that can display readings and English definitions of Japanese words, kanji, and optionally named entities. It is similar to yomichan, 10ten, and rikaikun in spirit, but targets qutebrowser.

7 Nov 7, 2022

TextAttack 🐙 is a Python framework for adversarial attacks, data augmentation, and model training in NLP

TextAttack ?? Generating adversarial examples for NLP models [TextAttack Documentation on ReadTheDocs] About • Setup • Usage • Design About TextAttack

2.2k Jan 3, 2023

this repository has datasets containing information of Uber pickups in NYC from April 2014 to September 2014 and January to June 2015. data Analysis , virtualization and some insights are gathered here

uber-pickups-analysis Data Source: https://www.kaggle.com/fivethirtyeight/uber-pickups-in-new-york-city Information about data set The dataset contain

1 Nov 2, 2021

Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

Grading tools for Advanced NLP (11-711) Installation You'll need docker and unzip to use this repo. For docker, visit the official guide to get starte

2 Sep 27, 2022

Text-Summarization-using-NLP - Text Summarization using NLP to fetch BBC News Article and summarize its text and also it includes custom article Summarization

Text-Summarization-using-NLP Text Summarization using NLP to fetch BBC News Arti

21 Aug 6, 2022

Code for producing Japanese GPT-2 provided by rinna Co., Ltd.

japanese-gpt2 This repository provides the code for training Japanese GPT-2 models. This code has been used for producing japanese-gpt2-medium release

491 Jan 7, 2023

Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

Bunkai Bunkai is a sentence boundary (SB) disambiguation tool for Japanese texts. Quick Start $ pip install bunkai $ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎ