NL-Augmenter 🦎 β†’ 🐍 A Collaborative Repository of Natural Language Transformations

Overview

NL-Augmenter 🦎 β†’ 🐍

The NL-Augmenter is a collaborative effort intended to add transformations of datasets dealing with natural language. Transformations augment text datasets in diverse ways, including: introducing spelling errors, translating to a different language, randomizing names and numbers, paraphrasingΒ ... and whatever creative augmentation you contribute to the benchmark. We invite submissions of transformations to this framework by way of GitHub pull request, through September 1, 2021. All submitters of accepted transformations (and filters) will be included as co-authors on a paper announcing this framework.

The framework organizers can be contacted at [email protected].

Submission timeline

Due date Description
September 1, 2021 Pull request must be opened to be eligible for inclusion in the framework and associated paper
September 22, 2021 Review process for pull request above must be complete

A transformation can be revised between the pull request submission and pull request merge deadlines. We will provide reviewer feedback to help with the revisions.

The transformations which are already accepted to NL-Augmenter are summarized in this table. Transformations undergoing review can be seen as pull requests.

Table of contents

Colab notebook

Open In Colab To quickly see transformations and filters in action, run through our colab notebook.

Installation

Requirements

  • Python 3.7

Instructions

# When creating a new transformation, replace this with your forked repository (see below)
git clone https://github.com/GEM-benchmark/NL-Augmenter.git
cd NL-Augmenter
python setup.py sdist
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz

How do I create a transformation?

Setup

First, fork the repository in GitHub! 🍴

fork button

Your fork will have its own location, which we will call PATH_TO_YOUR_FORK. Next, clone the forked repository and create a branch for your transformation, which here we will call my_awesome_transformation:

git clone $PATH_TO_YOUR_FORK
cd NL-Augmenter
git checkout -b my_awesome_transformation

We will base our transformation on an existing example. Create a new transformation directory by copying over an existing transformation:

cd transformations/
cp -r butter_fingers_perturbation my_awesome_transformation
cd my_awesome_transformation

Creating a transformation

  1. In the file transformation.py, rename the class ButterFingersPerturbation to MyAwesomeTransformation and choose one of the interfaces from the interfaces/ folder. See the full list of options here.
  2. Now put all your creativity in implementing the generate method. If you intend to use external libraries, add them with their version numbers in requirements.txt
  3. Update my_awesome_transformation/README.md to describe your transformation.

Testing and evaluating (Optional)

Once you are done, add at least 5 example pairs as test cases in the file test.json so that no one breaks your code inadvertently.

Once the transformation is ready, test it:

pytest -s --t=my_awesome_transformation

If you would like to evaluate your transformation against a common πŸ€— HuggingFace model, we encourage you to check evaluation

Code Styling To standardized the code we use the black code formatter which will run at the time of pre-commit. To use the pre-commit hook, install pre-commit with pip install pre-commit (should already be installed if you followed the above instructions). Then run pre-commit install to install the hook. On future commits, you should see the black code formatter is run on all python files you've staged for commit.

Submitting

Once the tests pass and you are happy with the transformation, submit them for review. First, commit and push your changes:

git add transformations/my_awesome_transformation/*
git commit -m "Added my_awesome_transformation"
git push --set-upstream origin my_awesome_transformation

Finally, submit a pull request. The last git push command prints a URL that can be copied into a browser to initiate such a pull request. Alternatively, you can do so from the GitHub website.

pull request button

✨ Congratulations, you've submitted a transformation to NL-Augmenter! ✨

How do I create a filter?

We also accept pull-requests for creating filters which identify interesting subpopulations of a dataset. The process to add a new filter is just the same as above. All filter implementations require implementing .filter instead of .generate and need to be placed in the filters folder. So, just the way transformations can transform examples of text, filters can identify whether an example follows some pattern of text! The only difference is that while transformations return another example of the same input format, filters simply return True or False! For step-by-step instructions, follow these steps.

Comments
  • My french_conjugation_transformation

    My french_conjugation_transformation

    Faced this issue when using pre-commit :

    image

    When I went to check the "pre-commit-config.yaml" file, it says that the repo for this hook is local.

    image

    Since black, flake8 and isort hooks passed, I commited the code with following command :

    git commit -m "My french_conjugation_transformation" -n

    transformation 
    opened by Louanes1 20
  • Spacy behaves differently when testing one case vs testing all cases

    Spacy behaves differently when testing one case vs testing all cases

    It seems Spacy's tokenizer behaves differently when I run pytest -s --t=emojify and pytest -s --t=light --f=light.

    For example, I added the following snippet in my generate() function:

    print([str(t) for t in self.nlp(sentence)])
    

    With input sentence "Apple is looking at buying U.K. startup for $132 billion."

    pytest -s --t=emojify gives:

    ['Apple', 'is', 'looking', 'at', 'buying', 'U.K.', 'startup', 'for', '$', '132', 'billion', '.']
    

    However, pytest -s --t=light --f=light gives:

    ['Apple', 'is', 'looking', 'at', 'buying', 'U.K.', 'startup', 'for', '$1', '32', 'billion.']
    

    I use the fowling code to load spacy:

    import spacy
    from initialize import spacy_nlp
    self.nlp = spacy_nlp if spacy_nlp else spacy.load("en_core_web_sm")
    

    It looks very strange. Am I overlooking something?

    bug 
    opened by xiaohk 19
  • add tests for entity_mention_replacement_ner

    add tests for entity_mention_replacement_ner

    opened by uyaseen 15
  • Added Backtranslation for NER

    Added Backtranslation for NER

    This transformation adapts backtranslation to the task of NER by generating paraphrases of the contexts around the entity mention(s) using backtranslation. It can be used as a data augmentation strategy to improve the underlying sequence (NER) model as reported by Yaseen and Langer, 2021.

    Contributors: Usama Yaseen ([email protected]), Stefan Langer ([email protected])

    Affiliation: Siemens AG

    transformation 
    opened by uyaseen 14
  • Ocr perturbation

    Ocr perturbation

    This PR introduces the "OCR perturbation" transformation, which directly induces Optical Character Recognition (OCR) errors into the input text. It renders the input sentence as an image and recognizes the rendered text using an OCR engine.

    transformation 
    opened by mnamysl 13
  • Adding financial amounts replacement

    Adding financial amounts replacement

    This transformation replaces consistently financial amounts throughout a text. The replacement changes the amount, the writting format as well as the currency of the financial amount. The change is consistent in regard to:

    • the modifier used to change all amounts of the same currency throughout the text.
      • e.g., the sentence I owe Fred € 20 and I need € 10 for the bus. might be changed to I owe Fred 2 906.37 Yen and I need 1 453.19 Yen for the bus.
    • the modifier used to change the amounts so that new amounts are relatively close to the original amount.
    • the rate used for a change of currency, reflecting the actual bank rate.
    transformation 
    opened by ahonore 10
  • Named Entity count filter

    Named Entity count filter

    This filter allows filtering data based on the counts of named entities for more fine-grained analysis of text generation systems wrt named entities in the input. The PR includes test cases + Readme for more details. Thanks.

    filter 
    opened by vyraun 10
  • Add NegationStrengthen

    Add NegationStrengthen

    These augments convert causal sentences' direction and/or strength based on grammar rules and dependency trees. We introduce augments that amend both sentence and target (SentenceAndTargetOperation).

    transformation 
    opened by tanfiona 10
  • Style transfer paraphrasing

    Style transfer paraphrasing

    Hello,

    This paraphraser is an enabler to use GPT-2 paraphrasers, originally trained by Krishna et al. for the paper Reformulating Unsupervised Style Transfer as Paraphrase Generation. Currently, I have integrated 6 different GPT-2 paraphrasers with different styles that I have tested and worked, but I might integrate more by uploading them to Huggingface (WIP).

    The currently supported styles are:

    • Shakespeare
    • Switchboard (Conversation Speech)
    • Tweets
    • Bible
    • Romantic poetry
    • Basic

    The paraphraser is quite similar to formality_change, but more diverse and general, and works for several different styles. There is a model for "Conversational Speech"-style which is similar to formality, which is the "Switchboard" version.

    Model sizes

    Each model is about 3.25 GB, so the testing will require a GPU if it has to be quick (as with any large language model).

    A comment about the tests

    Note that the tests currently only cover the Shakespeare model. I was unsure how to add different tests for different models. Perhaps subclasses for the different models is a better option than specifying the specific style, at least given the testing environment. Can I add instantiation arguments for the different tests somehow?

    License

    I was unsure how to handle the LICENSE from the original repository. Much of the code is based upon the original code (which I am not an author of) made by @martiansideofthemoon and his co-authors, but has been greatly modified. I therefore included the license (which is MIT), but please correct me if this is not necessary.

    No new requirements

    No new requirements were needed to be added; the current model and the code I adapted from the original repository was compatible with the requirements already existing in the repository.

    (Edit, I accidentally clicked and submitted the PR before finishing what I was writing here).

    transformation 
    opened by Filco306 9
  • Add Color Transformation

    Add Color Transformation

    This transformation augments the input sentence by randomly replacing colors.

    For example,

    I bought this pink shoes today! Isn't it pretty?

    becomes

    I bought this misty rose shoes today! Isn't it pretty?

    transformation 
    opened by seungjaeryanlee 9
  • Global seed changes

    Global seed changes

    This PR is for the changes discussed in #145. I thought removing it from the interface made the most sense, so we don't pass any redundant information and only set the seed for both random and numpy once in initialize.py. I've also used this to fix some remaining spacy.load issues in some of the filters/transformations so they are loaded only once in initialize.py.

    Would be nice if you could test all things to make sure I didn't break anything.

    Draft for now, since I think some tests need to be adjusted now since they used a different seed.

    opened by SirRob1997 9
  • `NumberToWord` is not loadable (likely due to hyphens in folder name)

    `NumberToWord` is not loadable (likely due to hyphens in folder name)

    It does not appear possible to load the NumberToWord transformation after installing nlaugmenter.

    https://github.com/GEM-benchmark/NL-Augmenter/blob/main/nlaugmenter/transformations/number-to-word/transformation.py

    This is likely due to number-to-word breaking python's path loading.

    opened by fabriceyhc 0
  • `Formal2Casual` fails to load due to unavailable huggingface model

    `Formal2Casual` fails to load due to unavailable huggingface model

    from nlaugmenter.transformations.formality_change.transformation import Formal2Casual
    
    OSError: prithivida/parrot_adequacy_on_BART is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
    If this is a private repository, make sure to pass a token having permission to this repo with `use_auth_token` or log in with `huggingface-cli login` and pass `use_auth_token=True`.
    

    The model (prithivida/parrot_adequacy_on_BART) is indeed not available on huggingface anymore. Perhaps an acceptable alternative is to use prithivida/parrot_adequacy_model instead?

    opened by fabriceyhc 0
  • Swap Transformations

    Swap Transformations

    Thank you for your great work! It's super useful!

    I have a suggestion for improvement - Some transformations are working with a "swap" principle. For example, in GenderSwap, if we had "sister" in the original sentence then it would be transformed to 'brother" and vice versa. There are scenarios when it's important to know what direction the transformation went, female to male or male to female. In my case for example, I want to compare the performances of my model on female/male sentences on inference time.

    I really liked the way TenseTransformation works. You need to specify in the constructor what tense (past/present/future) you want to transform to. Maybe that could be applicable for other swap transformations?

    Thanks again!

    opened by shachardon 0
  • Chinese2digits

    Chinese2digits

    This transformation transfers the numbers written in Chinese to digits numbers, such as β€œδΈ€β€ (one) to 1, "二" (two) to 2, and "η™Ύ" (a hundred) to 100. This transformation is a vital component of Chinese NLP systems. It would benefit all tasks which have a sentence/paragraph/document with numbers written in Capital Chinese as input like text classification, text generation, etc.

    opened by JerryX1110 0
  • Adding ambiguous characters filter

    Adding ambiguous characters filter

    I'm creating a new PR as per the request in https://github.com/GEM-benchmark/NL-Augmenter/pull/274.

    I also squashed all prior commits into one:

    Renaming ambiguousfilter folder

    Adding alphanumeric characters filter

    Removing unused imports and dead code in alphanumeric filter, renaming class

    Adding ambiguous characters filter

    Fixing import error by renaming alphanumeric_filter.py -> filter.py

    Adding keywords

    Addressing reviewers comments

    Adding necessary import

    Making recommended changes

    opened by motiwari 0
Owner
null
Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

CodeBERT-Implementation In this repo we have replicated the paper CodeBERT: A Pre-Trained Model for Programming and Natural Languages. We are interest

Tanuj Sur 4 Jul 1, 2022
Beyond the Imitation Game collaborative benchmark for enormous language models

BIG-bench ?? The Beyond the Imitation Game Benchmark (BIG-bench) will be a collaborative benchmark intended to probe large language models, and extrap

Google 1.3k Jan 1, 2023
A python framework to transform natural language questions to queries in a database query language.

__ _ _ _ ___ _ __ _ _ / _` | | | |/ _ \ '_ \| | | | | (_| | |_| | __/ |_) | |_| | \__, |\__,_|\___| .__/ \__, | |_| |_| |___/

Machinalis 1.2k Dec 18, 2022
LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language βš–οΈ The library of Natural Language Processing for Brazilian legal lang

Felipe Maia Polo 125 Dec 20, 2022
A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

MIDI Language Introduction Reference Paper: Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions: code This

Robert Bogan Kang 3 May 25, 2022
NL. The natural language programming language.

NL A Natural-Language programming language. Built using Codex. A few examples are inside the nl_projects directory. How it works Write any code in pur

null 2 Jan 17, 2022
This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

Pipeline For NLP with Bloom's Taxonomy Using Improved Question Classification and Question Generation using Deep Learning This repository contains all

Rohan Mathur 9 Jul 17, 2021
Associated Repository for "Translation between Molecules and Natural Language"

MolT5: Translation between Molecules and Natural Language Associated repository for "Translation between Molecules and Natural Language". Table of Con

null 67 Dec 15, 2022
Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group 8.4k Dec 30, 2022
Natural language Understanding Toolkit

Natural language Understanding Toolkit TOC Requirements Installation Documentation CLSCL NER References Requirements To install nut you need: Python 2

Peter Prettenhofer 119 Oct 8, 2022
πŸ’« Industrial-strength Natural Language Processing (NLP) in Python

spaCy: Industrial-strength NLP spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest researc

Explosion 24.9k Jan 2, 2023
πŸ—£οΈ NALP is a library that covers Natural Adversarial Language Processing.

NALP: Natural Adversarial Language Processing Welcome to NALP. Have you ever wanted to create natural text from raw sources? If yes, NALP is for you!

Gustavo Rosa 21 Aug 12, 2022
A natural language modeling framework based on PyTorch

Overview PyText is a deep-learning based NLP modeling framework built on PyTorch. PyText addresses the often-conflicting requirements of enabling rapi

Facebook Research 6.4k Dec 27, 2022
Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

Michael Petrochuk 2.1k Jan 1, 2023
Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing Trankit is a light-weight Transformer-based Pyth

null 652 Jan 6, 2023
PORORO: Platform Of neuRal mOdels for natuRal language prOcessing

PORORO: Platform Of neuRal mOdels for natuRal language prOcessing pororo performs Natural Language Processing and Speech-related tasks. It is easy to

Kakao Brain 1.2k Dec 21, 2022
πŸ’« Industrial-strength Natural Language Processing (NLP) in Python

spaCy: Industrial-strength NLP spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest researc

Explosion 19.5k Feb 13, 2021