NL-Augmenter 🦎 → 🐍 A Collaborative Repository of Natural Language Transformations

Overview

Checks Forks Issues Pull requests Contributors License

NL-Augmenter 🦎 🐍

The NL-Augmenter is a collaborative effort intended to add transformations of datasets dealing with natural language. Transformations augment text datasets in diverse ways, including: randomizing names and numbers, changing style/syntax, paraphrasing, KB-based paraphrasing ... and whatever creative augmentation you contribute. We invite submissions of transformations to this framework by way of GitHub pull request, through August 31, 2021. All submitters of accepted transformations (and filters) will be included as co-authors on a paper announcing this framework.

The framework organizers can be contacted at [email protected].

Submission timeline

Due date Description
A̶u̶g̶u̶s̶t̶ 3̶1̶, 2̶0̶2̶1̶ P̶u̶l̶l̶ r̶e̶q̶u̶e̶s̶t̶ m̶u̶s̶t̶ b̶e̶ o̶p̶e̶n̶e̶d̶ t̶o̶ b̶e̶ e̶l̶i̶g̶i̶b̶l̶e̶ f̶o̶r̶ i̶n̶c̶l̶u̶s̶i̶o̶n̶ i̶n̶ t̶h̶e̶ f̶r̶a̶m̶e̶w̶o̶r̶k̶ a̶n̶d̶ a̶s̶s̶o̶c̶i̶a̶t̶e̶d̶ p̶a̶p̶e̶r̶
September 2̶2̶, 30 2021 Review process for pull request above must be complete

A transformation can be revised between the pull request submission and pull request merge deadlines. We will provide reviewer feedback to help with the revisions.

The transformations which are already accepted to NL-Augmenter are summarized in the transformations folder. Transformations undergoing review can be seen as pull requests.

Table of contents

Colab notebook

Open In Colab To quickly see transformations and filters in action, run through our colab notebook.

Some Ideas for Transformations

If you need inspiration for what transformations to implement, check out https://github.com/GEM-benchmark/NL-Augmenter/issues/75, where some ideas and previous papers are discussed. So far, contributions have focused on morphological inflections, character level changes, and random noise. The best new pull requests will be dissimilar from these existing contributions.

Installation

Requirements

  • Python 3.7

Instructions

# When creating a new transformation, replace this with your forked repository (see below)
git clone https://github.com/GEM-benchmark/NL-Augmenter.git
cd NL-Augmenter
python setup.py sdist
pip install -e .
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz

How do I create a transformation?

Setup

First, fork the repository in GitHub! 🍴

fork button

Your fork will have its own location, which we will call PATH_TO_YOUR_FORK. Next, clone the forked repository and create a branch for your transformation, which here we will call my_awesome_transformation:

git clone $PATH_TO_YOUR_FORK
cd NL-Augmenter
git checkout -b my_awesome_transformation

We will base our transformation on an existing example. Create a new transformation directory by copying over an existing transformation. You can choose to copy from other transformation directories depending on the task you wish to create a transformation for. Check some of the existing pull requests and merged transformations first to avoid duplicating efforts or creating transformations too similar to previous ones.

cd transformations/
cp -r butter_fingers_perturbation my_awesome_transformation
cd my_awesome_transformation

Creating a transformation

  1. In the file transformation.py, rename the class ButterFingersPerturbation to MyAwesomeTransformation and choose one of the interfaces from the interfaces/ folder. See the full list of options here.
  2. Now put all your creativity in implementing the generate method. If you intend to use external libraries, add them with their version numbers in requirements.txt
  3. Update my_awesome_transformation/README.md to describe your transformation.

Testing and evaluating (Optional)

Once you are done, add at least 5 example pairs as test cases in the file test.json so that no one breaks your code inadvertently.

Once the transformation is ready, test it:

pytest -s --t=my_awesome_transformation

If you would like to evaluate your transformation against a common 🤗 HuggingFace model, we encourage you to check evaluation

Code Styling To standardized the code we use the black code formatter which will run at the time of pre-commit. To use the pre-commit hook, install pre-commit with pip install pre-commit (should already be installed if you followed the above instructions). Then run pre-commit install to install the hook. On future commits, you should see the black code formatter is run on all python files you've staged for commit.

Submitting

Once the tests pass and you are happy with the transformation, submit them for review. First, commit and push your changes:

git add transformations/my_awesome_transformation/*
git commit -m "Added my_awesome_transformation"
git push --set-upstream origin my_awesome_transformation

Finally, submit a pull request. The last git push command prints a URL that can be copied into a browser to initiate such a pull request. Alternatively, you can do so from the GitHub website.

pull request button

Congratulations, you've submitted a transformation to NL-Augmenter!

How do I create a filter?

We also accept pull-requests for creating filters which identify interesting subpopulations of a dataset. The process to add a new filter is just the same as above. All filter implementations require implementing .filter instead of .generate and need to be placed in the filters folder. So, just the way transformations can transform examples of text, filters can identify whether an example follows some pattern of text! The only difference is that while transformations return another example of the same input format, filters simply return True or False! For step-by-step instructions, follow these steps.

BIG-Bench 🪑

If you are interested in NL-Augmenter, you may also be interested in the BIG-bench large scale collaborative benchmark for language models.

Most Creative Implementations 🏆

After all pull-requests have been merged, 3 of the most creative implementations would be selected and featured on this README page and on the NL-Augmenter webpage.

License

Some transformations include components released under a different (permissive, open source) license. For license details, refer to the README.md and any license files in the transformations's or filter's directory.

Comments
  • My french_conjugation_transformation

    My french_conjugation_transformation

    Faced this issue when using pre-commit :

    image

    When I went to check the "pre-commit-config.yaml" file, it says that the repo for this hook is local.

    image

    Since black, flake8 and isort hooks passed, I commited the code with following command :

    git commit -m "My french_conjugation_transformation" -n

    transformation 
    opened by Louanes1 20
  • Spacy behaves differently when testing one case vs testing all cases

    Spacy behaves differently when testing one case vs testing all cases

    It seems Spacy's tokenizer behaves differently when I run pytest -s --t=emojify and pytest -s --t=light --f=light.

    For example, I added the following snippet in my generate() function:

    print([str(t) for t in self.nlp(sentence)])
    

    With input sentence "Apple is looking at buying U.K. startup for $132 billion."

    pytest -s --t=emojify gives:

    ['Apple', 'is', 'looking', 'at', 'buying', 'U.K.', 'startup', 'for', '$', '132', 'billion', '.']
    

    However, pytest -s --t=light --f=light gives:

    ['Apple', 'is', 'looking', 'at', 'buying', 'U.K.', 'startup', 'for', '$1', '32', 'billion.']
    

    I use the fowling code to load spacy:

    import spacy
    from initialize import spacy_nlp
    self.nlp = spacy_nlp if spacy_nlp else spacy.load("en_core_web_sm")
    

    It looks very strange. Am I overlooking something?

    bug 
    opened by xiaohk 19
  • add tests for entity_mention_replacement_ner

    add tests for entity_mention_replacement_ner

    opened by uyaseen 15
  • Added Backtranslation for NER

    Added Backtranslation for NER

    This transformation adapts backtranslation to the task of NER by generating paraphrases of the contexts around the entity mention(s) using backtranslation. It can be used as a data augmentation strategy to improve the underlying sequence (NER) model as reported by Yaseen and Langer, 2021.

    Contributors: Usama Yaseen ([email protected]), Stefan Langer ([email protected])

    Affiliation: Siemens AG

    transformation 
    opened by uyaseen 14
  • Ocr perturbation

    Ocr perturbation

    This PR introduces the "OCR perturbation" transformation, which directly induces Optical Character Recognition (OCR) errors into the input text. It renders the input sentence as an image and recognizes the rendered text using an OCR engine.

    transformation 
    opened by mnamysl 13
  • Adding financial amounts replacement

    Adding financial amounts replacement

    This transformation replaces consistently financial amounts throughout a text. The replacement changes the amount, the writting format as well as the currency of the financial amount. The change is consistent in regard to:

    • the modifier used to change all amounts of the same currency throughout the text.
      • e.g., the sentence I owe Fred € 20 and I need € 10 for the bus. might be changed to I owe Fred 2 906.37 Yen and I need 1 453.19 Yen for the bus.
    • the modifier used to change the amounts so that new amounts are relatively close to the original amount.
    • the rate used for a change of currency, reflecting the actual bank rate.
    transformation 
    opened by ahonore 10
  • Named Entity count filter

    Named Entity count filter

    This filter allows filtering data based on the counts of named entities for more fine-grained analysis of text generation systems wrt named entities in the input. The PR includes test cases + Readme for more details. Thanks.

    filter 
    opened by vyraun 10
  • Add NegationStrengthen

    Add NegationStrengthen

    These augments convert causal sentences' direction and/or strength based on grammar rules and dependency trees. We introduce augments that amend both sentence and target (SentenceAndTargetOperation).

    transformation 
    opened by tanfiona 10
  • Style transfer paraphrasing

    Style transfer paraphrasing

    Hello,

    This paraphraser is an enabler to use GPT-2 paraphrasers, originally trained by Krishna et al. for the paper Reformulating Unsupervised Style Transfer as Paraphrase Generation. Currently, I have integrated 6 different GPT-2 paraphrasers with different styles that I have tested and worked, but I might integrate more by uploading them to Huggingface (WIP).

    The currently supported styles are:

    • Shakespeare
    • Switchboard (Conversation Speech)
    • Tweets
    • Bible
    • Romantic poetry
    • Basic

    The paraphraser is quite similar to formality_change, but more diverse and general, and works for several different styles. There is a model for "Conversational Speech"-style which is similar to formality, which is the "Switchboard" version.

    Model sizes

    Each model is about 3.25 GB, so the testing will require a GPU if it has to be quick (as with any large language model).

    A comment about the tests

    Note that the tests currently only cover the Shakespeare model. I was unsure how to add different tests for different models. Perhaps subclasses for the different models is a better option than specifying the specific style, at least given the testing environment. Can I add instantiation arguments for the different tests somehow?

    License

    I was unsure how to handle the LICENSE from the original repository. Much of the code is based upon the original code (which I am not an author of) made by @martiansideofthemoon and his co-authors, but has been greatly modified. I therefore included the license (which is MIT), but please correct me if this is not necessary.

    No new requirements

    No new requirements were needed to be added; the current model and the code I adapted from the original repository was compatible with the requirements already existing in the repository.

    (Edit, I accidentally clicked and submitted the PR before finishing what I was writing here).

    transformation 
    opened by Filco306 9
  • Add Color Transformation

    Add Color Transformation

    This transformation augments the input sentence by randomly replacing colors.

    For example,

    I bought this pink shoes today! Isn't it pretty?

    becomes

    I bought this misty rose shoes today! Isn't it pretty?

    transformation 
    opened by seungjaeryanlee 9
  • Global seed changes

    Global seed changes

    This PR is for the changes discussed in #145. I thought removing it from the interface made the most sense, so we don't pass any redundant information and only set the seed for both random and numpy once in initialize.py. I've also used this to fix some remaining spacy.load issues in some of the filters/transformations so they are loaded only once in initialize.py.

    Would be nice if you could test all things to make sure I didn't break anything.

    Draft for now, since I think some tests need to be adjusted now since they used a different seed.

    opened by SirRob1997 9
  • `NumberToWord` is not loadable (likely due to hyphens in folder name)

    `NumberToWord` is not loadable (likely due to hyphens in folder name)

    It does not appear possible to load the NumberToWord transformation after installing nlaugmenter.

    https://github.com/GEM-benchmark/NL-Augmenter/blob/main/nlaugmenter/transformations/number-to-word/transformation.py

    This is likely due to number-to-word breaking python's path loading.

    opened by fabriceyhc 0
  • `Formal2Casual` fails to load due to unavailable huggingface model

    `Formal2Casual` fails to load due to unavailable huggingface model

    from nlaugmenter.transformations.formality_change.transformation import Formal2Casual
    
    OSError: prithivida/parrot_adequacy_on_BART is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
    If this is a private repository, make sure to pass a token having permission to this repo with `use_auth_token` or log in with `huggingface-cli login` and pass `use_auth_token=True`.
    

    The model (prithivida/parrot_adequacy_on_BART) is indeed not available on huggingface anymore. Perhaps an acceptable alternative is to use prithivida/parrot_adequacy_model instead?

    opened by fabriceyhc 0
  • Swap Transformations

    Swap Transformations

    Thank you for your great work! It's super useful!

    I have a suggestion for improvement - Some transformations are working with a "swap" principle. For example, in GenderSwap, if we had "sister" in the original sentence then it would be transformed to 'brother" and vice versa. There are scenarios when it's important to know what direction the transformation went, female to male or male to female. In my case for example, I want to compare the performances of my model on female/male sentences on inference time.

    I really liked the way TenseTransformation works. You need to specify in the constructor what tense (past/present/future) you want to transform to. Maybe that could be applicable for other swap transformations?

    Thanks again!

    opened by shachardon 0
  • Chinese2digits

    Chinese2digits

    This transformation transfers the numbers written in Chinese to digits numbers, such as “一” (one) to 1, "二" (two) to 2, and "百" (a hundred) to 100. This transformation is a vital component of Chinese NLP systems. It would benefit all tasks which have a sentence/paragraph/document with numbers written in Capital Chinese as input like text classification, text generation, etc.

    opened by JerryX1110 0
  • Adding ambiguous characters filter

    Adding ambiguous characters filter

    I'm creating a new PR as per the request in https://github.com/GEM-benchmark/NL-Augmenter/pull/274.

    I also squashed all prior commits into one:

    Renaming ambiguousfilter folder

    Adding alphanumeric characters filter

    Removing unused imports and dead code in alphanumeric filter, renaming class

    Adding ambiguous characters filter

    Fixing import error by renaming alphanumeric_filter.py -> filter.py

    Adding keywords

    Addressing reviewers comments

    Adding necessary import

    Making recommended changes

    opened by motiwari 0
Owner
null
Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more

JAX: Autograd and XLA Quickstart | Transformations | Install guide | Neural net libraries | Change logs | Reference docs | Code search News: JAX tops

Google 11.4k Feb 13, 2021
Code for "Learning Structural Edits via Incremental Tree Transformations" (ICLR'21)

Learning Structural Edits via Incremental Tree Transformations Code for "Learning Structural Edits via Incremental Tree Transformations" (ICLR'21) 1.

NeuLab 40 Dec 23, 2022
Canonical Appearance Transformations

CAT-Net: Learning Canonical Appearance Transformations Code to accompany our paper "How to Train a CAT: Learning Canonical Appearance Transformations

STARS Laboratory 54 Dec 24, 2022
Callable PyTrees and filtered JIT/grad transformations => neural networks in JAX.

Equinox Callable PyTrees and filtered JIT/grad transformations => neural networks in JAX Equinox brings more power to your model building in JAX. Repr

Patrick Kidger 909 Dec 30, 2022
We present a framework for training multi-modal deep learning models on unlabelled video data by forcing the network to learn invariances to transformations applied to both the audio and video streams.

Multi-Modal Self-Supervision using GDT and StiCa This is an official pytorch implementation of papers: Multi-modal Self-Supervision from Generalized D

Facebook Research 42 Dec 9, 2022
Image transformations designed for Scene Text Recognition (STR) data augmentation. Published at ICCV 2021 Workshop on Interactive Labeling and Data Augmentation for Vision.

Data Augmentation for Scene Text Recognition (ICCV 2021 Workshop) (Pronounced as "strog") Paper Arxiv Why it matters? Scene Text Recognition (STR) req

Rowel Atienza 152 Dec 28, 2022
Time-stretch audio clips quickly with PyTorch (CUDA supported)! Additional utilities for searching efficient transformations are included.

Time-stretch audio clips quickly with PyTorch (CUDA supported)! Additional utilities for searching efficient transformations are included.

Kento Nishi 22 Jul 7, 2022
Using some basic methods to show linkages and transformations of robotic arms

roboticArmVisualizer Python GUI application to create custom linkages and adjust joint angles. In the future, I plan to add 2d inverse kinematics solv

Sandesh Banskota 1 Nov 19, 2021
An official PyTorch implementation of the TKDE paper "Self-Supervised Graph Representation Learning via Topology Transformations".

Self-Supervised Graph Representation Learning via Topology Transformations This repository is the official PyTorch implementation of the following pap

Hsiang Gao 2 Oct 31, 2022
Streaming over lightweight data transformations

Description Data augmentation libarary for Deep Learning, which supports images, segmentation masks, labels and keypoints. Furthermore, SOLT is fast a

Research Unit of Medical Imaging, Physics and Technology 256 Jan 8, 2023
PyTorch implementations of Top-N recommendation, collaborative filtering recommenders.

PyTorch implementations of Top-N recommendation, collaborative filtering recommenders.

Yoonki Jeong 129 Dec 22, 2022
Official Implement of CVPR 2021 paper “Cross-Modal Collaborative Representation Learning and a Large-Scale RGBT Benchmark for Crowd Counting”

RGBT Crowd Counting Lingbo Liu, Jiaqi Chen, Hefeng Wu, Guanbin Li, Chenglong Li, Liang Lin. "Cross-Modal Collaborative Representation Learning and a L

null 37 Dec 8, 2022
Malmo Collaborative AI Challenge - Team Pig Catcher

The Malmo Collaborative AI Challenge - Team Pig Catcher Approach The challenge involves 2 agents who can either cooperate or defect. The optimal polic

Kai Arulkumaran 66 Jun 29, 2022
The official repo of the CVPR 2021 paper Group Collaborative Learning for Co-Salient Object Detection .

GCoNet The official repo of the CVPR 2021 paper Group Collaborative Learning for Co-Salient Object Detection . Trained model Download final_gconet.pth

Qi Fan 46 Nov 17, 2022
The implement of papar "Enhanced Graph Learning for Collaborative Filtering via Mutual Information Maximization"

SIGIR2021-EGLN The implement of paper "Enhanced Graph Learning for Collaborative Filtering via Mutual Information Maximization" Neural graph based Col

null 15 Dec 27, 2022
COVINS -- A Framework for Collaborative Visual-Inertial SLAM and Multi-Agent 3D Mapping

COVINS -- A Framework for Collaborative Visual-Inertial SLAM and Multi-Agent 3D Mapping Version 1.0 COVINS is an accurate, scalable, and versatile vis

ETHZ V4RL 183 Dec 27, 2022
PyTorch implementation of our ICCV 2021 paper, Interpretation of Emergent Communication in Heterogeneous Collaborative Embodied Agents.

PyTorch implementation of our ICCV 2021 paper, Interpretation of Emergent Communication in Heterogeneous Collaborative Embodied Agents.

Saim Wani 4 May 8, 2022
Python Implementation of algorithms in Graph Mining, e.g., Recommendation, Collaborative Filtering, Community Detection, Spectral Clustering, Modularity Maximization, co-authorship networks.

Graph Mining Author: Jiayi Chen Time: April 2021 Implemented Algorithms: Network: Scrabing Data, Network Construbtion and Network Measurement (e.g., P

Jiayi Chen 3 Mar 3, 2022