NL-Augmenter 🦎 → 🐍 A Collaborative Repository of Natural Language Transformations

Last update: Jan 9, 2023

Related tags

Text Data & NLP NL-Augmenter

Overview

NL-Augmenter 🦎 → 🐍

The NL-Augmenter is a collaborative effort intended to add transformations of datasets dealing with natural language. Transformations augment text datasets in diverse ways, including: introducing spelling errors, translating to a different language, randomizing names and numbers, paraphrasing ... and whatever creative augmentation you contribute to the benchmark. We invite submissions of transformations to this framework by way of GitHub pull request, through September 1, 2021. All submitters of accepted transformations (and filters) will be included as co-authors on a paper announcing this framework.

The framework organizers can be contacted at [email protected].

Submission timeline

Due date	Description
September 1, 2021	Pull request must be opened to be eligible for inclusion in the framework and associated paper
September 22, 2021	Review process for pull request above must be complete

A transformation can be revised between the pull request submission and pull request merge deadlines. We will provide reviewer feedback to help with the revisions.

The transformations which are already accepted to NL-Augmenter are summarized in this table. Transformations undergoing review can be seen as pull requests.

Table of contents

Colab notebook
Installation
How do I create a transformation?
How do I create a filter?
Motivation
Review Criteria for Accepting Submissions

Colab notebook

To quickly see transformations and filters in action, run through our colab notebook.

Installation

Requirements

Python 3.7

Instructions

# When creating a new transformation, replace this with your forked repository (see below)
git clone https://github.com/GEM-benchmark/NL-Augmenter.git
cd NL-Augmenter
python setup.py sdist
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz

How do I create a transformation?

Setup

First, fork the repository in GitHub! 🍴

Your fork will have its own location, which we will call PATH_TO_YOUR_FORK. Next, clone the forked repository and create a branch for your transformation, which here we will call my_awesome_transformation:

git clone $PATH_TO_YOUR_FORK
cd NL-Augmenter
git checkout -b my_awesome_transformation

We will base our transformation on an existing example. Create a new transformation directory by copying over an existing transformation:

cd transformations/
cp -r butter_fingers_perturbation my_awesome_transformation
cd my_awesome_transformation

Creating a transformation

In the file transformation.py, rename the class ButterFingersPerturbation to MyAwesomeTransformation and choose one of the interfaces from the interfaces/ folder. See the full list of options here.
Now put all your creativity in implementing the generate method. If you intend to use external libraries, add them with their version numbers in requirements.txt
Update my_awesome_transformation/README.md to describe your transformation.

Testing and evaluating (Optional)

Once you are done, add at least 5 example pairs as test cases in the file test.json so that no one breaks your code inadvertently.

Once the transformation is ready, test it:

pytest -s --t=my_awesome_transformation

If you would like to evaluate your transformation against a common 🤗 HuggingFace model, we encourage you to check evaluation

Code Styling To standardized the code we use the black code formatter which will run at the time of pre-commit. To use the pre-commit hook, install pre-commit with pip install pre-commit (should already be installed if you followed the above instructions). Then run pre-commit install to install the hook. On future commits, you should see the black code formatter is run on all python files you've staged for commit.

Submitting

Once the tests pass and you are happy with the transformation, submit them for review. First, commit and push your changes:

git add transformations/my_awesome_transformation/*
git commit -m "Added my_awesome_transformation"
git push --set-upstream origin my_awesome_transformation

Finally, submit a pull request. The last git push command prints a URL that can be copied into a browser to initiate such a pull request. Alternatively, you can do so from the GitHub website.

✨ Congratulations, you've submitted a transformation to NL-Augmenter! ✨

How do I create a filter?

We also accept pull-requests for creating filters which identify interesting subpopulations of a dataset. The process to add a new filter is just the same as above. All filter implementations require implementing .filter instead of .generate and need to be placed in the filters folder. So, just the way transformations can transform examples of text, filters can identify whether an example follows some pattern of text! The only difference is that while transformations return another example of the same input format, filters simply return True or False! For step-by-step instructions, follow these steps.

Comments

My french_conjugation_transformation

Faced this issue when using pre-commit :

When I went to check the "pre-commit-config.yaml" file, it says that the repo for this hook is local.

Since black, flake8 and isort hooks passed, I commited the code with following command :

git commit -m "My french_conjugation_transformation" -n
transformation

opened by Louanes1 20
Spacy behaves differently when testing one case vs testing all cases
It seems Spacy's tokenizer behaves differently when I run pytest -s --t=emojify and pytest -s --t=light --f=light.

For example, I added the following snippet in my generate() function:

print([str(t) for t in self.nlp(sentence)])

With input sentence "Apple is looking at buying U.K. startup for $132 billion."

pytest -s --t=emojify gives:

['Apple', 'is', 'looking', 'at', 'buying', 'U.K.', 'startup', 'for', '$', '132', 'billion', '.']

However, pytest -s --t=light --f=light gives:

['Apple', 'is', 'looking', 'at', 'buying', 'U.K.', 'startup', 'for', '$1', '32', 'billion.']

I use the fowling code to load spacy:

import spacy from initialize import spacy_nlp self.nlp = spacy_nlp if spacy_nlp else spacy.load("en_core_web_sm")

It looks very strange. Am I overlooking something?
bug
opened by xiaohk 19
add tests for entity_mention_replacement_ner

For some strange reason, the build was failing for entity_mention_replacement_ner transformation, therefore, tests were temporarily removed, this PR just includes the tests again for entity_mention_replacement_ner.

opened by uyaseen 15
Added Backtranslation for NER

This transformation adapts backtranslation to the task of NER by generating paraphrases of the contexts around the entity mention(s) using backtranslation. It can be used as a data augmentation strategy to improve the underlying sequence (NER) model as reported by Yaseen and Langer, 2021.

Contributors: Usama Yaseen ([email protected]), Stefan Langer ([email protected])

Affiliation: Siemens AG
transformation

opened by uyaseen 14
Ocr perturbation

This PR introduces the "OCR perturbation" transformation, which directly induces Optical Character Recognition (OCR) errors into the input text. It renders the input sentence as an image and recognizes the rendered text using an OCR engine.
transformation

opened by mnamysl 13
Adding financial amounts replacement
This transformation replaces consistently financial amounts throughout a text. The replacement changes the amount, the writting format as well as the currency of the financial amount. The change is consistent in regard to:

the modifier used to change all amounts of the same currency throughout the text.

e.g., the sentence I owe Fred € 20 and I need € 10 for the bus. might be changed to I owe Fred 2 906.37 Yen and I need 1 453.19 Yen for the bus.

the modifier used to change the amounts so that new amounts are relatively close to the original amount.

the rate used for a change of currency, reflecting the actual bank rate.

transformation
opened by ahonore 10
Named Entity count filter

This filter allows filtering data based on the counts of named entities for more fine-grained analysis of text generation systems wrt named entities in the input. The PR includes test cases + Readme for more details. Thanks.
filter

opened by vyraun 10
Add NegationStrengthen

These augments convert causal sentences' direction and/or strength based on grammar rules and dependency trees. We introduce augments that amend both sentence and target (SentenceAndTargetOperation).
transformation

opened by tanfiona 10
Style transfer paraphrasing
Hello,

This paraphraser is an enabler to use GPT-2 paraphrasers, originally trained by Krishna et al. for the paper Reformulating Unsupervised Style Transfer as Paraphrase Generation. Currently, I have integrated 6 different GPT-2 paraphrasers with different styles that I have tested and worked, but I might integrate more by uploading them to Huggingface (WIP).

The currently supported styles are:

Shakespeare

Switchboard (Conversation Speech)

Tweets

Bible

Romantic poetry

Basic

The paraphraser is quite similar to formality_change, but more diverse and general, and works for several different styles. There is a model for "Conversational Speech"-style which is similar to formality, which is the "Switchboard" version.

Model sizes

Each model is about 3.25 GB, so the testing will require a GPU if it has to be quick (as with any large language model).

A comment about the tests

Note that the tests currently only cover the Shakespeare model. I was unsure how to add different tests for different models. Perhaps subclasses for the different models is a better option than specifying the specific style, at least given the testing environment. Can I add instantiation arguments for the different tests somehow?

License

I was unsure how to handle the LICENSE from the original repository. Much of the code is based upon the original code (which I am not an author of) made by @martiansideofthemoon and his co-authors, but has been greatly modified. I therefore included the license (which is MIT), but please correct me if this is not necessary.

No new requirements

No new requirements were needed to be added; the current model and the code I adapted from the original repository was compatible with the requirements already existing in the repository.

(Edit, I accidentally clicked and submitted the PR before finishing what I was writing here).
transformation
opened by Filco306 9
Add Color Transformation

This transformation augments the input sentence by randomly replacing colors.

For example,

I bought this pink shoes today! Isn't it pretty?

becomes

I bought this misty rose shoes today! Isn't it pretty?

transformation

opened by seungjaeryanlee 9
Global seed changes

This PR is for the changes discussed in #145. I thought removing it from the interface made the most sense, so we don't pass any redundant information and only set the seed for both random and numpy once in initialize.py. I've also used this to fix some remaining spacy.load issues in some of the filters/transformations so they are loaded only once in initialize.py.

Would be nice if you could test all things to make sure I didn't break anything.

Draft for now, since I think some tests need to be adjusted now since they used a different seed.

opened by SirRob1997 9
`NumberToWord` is not loadable (likely due to hyphens in folder name)

It does not appear possible to load the NumberToWord transformation after installing nlaugmenter.

https://github.com/GEM-benchmark/NL-Augmenter/blob/main/nlaugmenter/transformations/number-to-word/transformation.py

This is likely due to number-to-word breaking python's path loading.

opened by fabriceyhc 0

`Formal2Casual` fails to load due to unavailable huggingface model

from nlaugmenter.transformations.formality_change.transformation import Formal2Casual

OSError: prithivida/parrot_adequacy_on_BART is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo with `use_auth_token` or log in with `huggingface-cli login` and pass `use_auth_token=True`.

The model (prithivida/parrot_adequacy_on_BART) is indeed not available on huggingface anymore. Perhaps an acceptable alternative is to use prithivida/parrot_adequacy_model instead?

opened by fabriceyhc 0

Swap Transformations

Thank you for your great work! It's super useful!

I have a suggestion for improvement - Some transformations are working with a "swap" principle. For example, in GenderSwap, if we had "sister" in the original sentence then it would be transformed to 'brother" and vice versa. There are scenarios when it's important to know what direction the transformation went, female to male or male to female. In my case for example, I want to compare the performances of my model on female/male sentences on inference time.

I really liked the way TenseTransformation works. You need to specify in the constructor what tense (past/present/future) you want to transform to. Maybe that could be applicable for other swap transformations?

Thanks again!

opened by shachardon 0
Chinese2digits

This transformation transfers the numbers written in Chinese to digits numbers, such as “一” (one) to 1, "二" (two) to 2, and "百" (a hundred) to 100. This transformation is a vital component of Chinese NLP systems. It would benefit all tasks which have a sentence/paragraph/document with numbers written in Capital Chinese as input like text classification, text generation, etc.

opened by JerryX1110 0
Adding ambiguous characters filter

I'm creating a new PR as per the request in https://github.com/GEM-benchmark/NL-Augmenter/pull/274.

I also squashed all prior commits into one:

Renaming ambiguousfilter folder

Adding alphanumeric characters filter

Removing unused imports and dead code in alphanumeric filter, renaming class

Adding ambiguous characters filter

Fixing import error by renaming alphanumeric_filter.py -> filter.py

Adding keywords

Addressing reviewers comments

Adding necessary import

Making recommended changes

opened by motiwari 0

Owner

GitHub

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

Styleformer A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/cas

431 Dec 19, 2022

Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

CodeBERT-Implementation In this repo we have replicated the paper CodeBERT: A Pre-Trained Model for Programming and Natural Languages. We are interest

4 Jul 1, 2022

Beyond the Imitation Game collaborative benchmark for enormous language models

BIG-bench ?? The Beyond the Imitation Game Benchmark (BIG-bench) will be a collaborative benchmark intended to probe large language models, and extrap

1.3k Jan 1, 2023

A python framework to transform natural language questions to queries in a database query language.

__ _ _ _ ___ _ __ _ _ / _` | | | |/ _ \ '_ \| | | | | (_| | |_| | __/ |_) | |_| | \__, |\__,_|\___| .__/ \__, | |_| |_| |___/

1.2k Dec 18, 2022

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language ⚖️ The library of Natural Language Processing for Brazilian legal lang

125 Dec 20, 2022

A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

MIDI Language Introduction Reference Paper: Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions: code This

3 May 25, 2022

NL. The natural language programming language.

NL A Natural-Language programming language. Built using Codex. A few examples are inside the nl_projects directory. How it works Write any code in pur

2 Jan 17, 2022

This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

Pipeline For NLP with Bloom's Taxonomy Using Improved Question Classification and Question Generation using Deep Learning This repository contains all

9 Jul 17, 2021

Associated Repository for "Translation between Molecules and Natural Language"

MolT5: Translation between Molecules and Natural Language Associated repository for "Translation between Molecules and Natural Language". Table of Con

67 Dec 15, 2022

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group

8.4k Dec 30, 2022

Natural language Understanding Toolkit

Natural language Understanding Toolkit TOC Requirements Installation Documentation CLSCL NER References Requirements To install nut you need: Python 2

119 Oct 8, 2022

This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet it is not always as trivial a task as it appears to be. This binding makes the power of the ucto tokeniser available to Python. Ucto itself is regular-expression based, extensible, and advanced tokeniser written in C++ (http://ilk.uvt.nl/ucto).

Ucto for Python This is a Python binding to the tokeniser Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task,

27 Dec 14, 2022