The source code for the Cutoff data augmentation approach proposed in this paper: "A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation".

Dinghan Shen

Last update: Dec 22, 2022

Related tags

Overview

Cutoff: A Simple Data Augmentation Approach for Natural Language

This repository contains source code necessary to reproduce the results presented in the following paper:

A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation

This project is maintained by Dinghan Shen. Feel free to contact [email protected] for any relevant issues.

Natural Language Undertanding (e.g. GLUE tasks, etc.)

Prerequisite:

CUDA, cudnn
Python 3.7
PyTorch 1.4.0

Run

Install Huggingface Transformers according to the instructions here: https://github.com/huggingface/transformers.
Download the datasets from the GLUE benchmark:

python download_glue_data.py --data_dir glue_data --tasks all

Fine-tune the RoBERTa-base or RoBERTa-large model with the Cutoff data augmentation strategies:

>>> chmod +x run_glue.sh
>>> ./run_glue.sh

Options: different settings and hyperparameters can be selected and specified in the run_glue.sh script:

do_aug: whether augmented examples are used for training.
aug_type: the specific strategy to synthesize Cutoff samples, which can be chosen from: 'span_cutoff', 'token_cutoff' and 'dim_cutoff'.
aug_cutoff_ratio: the ratio corresponding to the span length, token number or number of dimensions to be cut.
aug_ce_loss: the coefficient for the cross-entropy loss over the cutoff examples.
aug_js_loss: the coefficient for the Jensen-Shannon (JS) Divergence consistency loss over the cutoff examples.
TASK_NAME: the downstream GLUE task for fine-tuning.
model_name_or_path: the pre-trained for initialization (both RoBERTa-base or RoBERTa-large models are supported).
output_dir: the folder results being saved to.

Natural Language Generation (e.g. Translation, etc.)

Please refer to Neural Machine Translation with Data Augmentation for more details

IWSLT'14 German to English (Transformers)

Task	Setting	Approach	BLEU
iwslt14 de-en	transformer-small	w/o cutoff	36.2
iwslt14 de-en	transformer-small	w/ cutoff	37.6

WMT'14 English to German (Transformers)

Task	Setting	Approach	BLEU
wmt14 en-de	transformer-base	w/o cutoff	28.6
wmt14 en-de	transformer-base	w/ cutoff	29.1
wmt14 en-de	transformer-big	w/o cutoff	29.5
wmt14 en-de	transformer-big	w/ cutoff	30.3

Citation

Please cite our paper in your publications if it helps your research:

@article{shen2020simple,
  title={A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation},
  author={Shen, Dinghan and Zheng, Mingzhi and Shen, Yelong and Qu, Yanru and Chen, Weizhu},
  journal={arXiv preprint arXiv:2009.13818},
  year={2020}
}

Source codes for the paper "Local Additivity Based Data Augmentation for Semi-supervised NER"

LADA This repo contains codes for the following paper: Jiaao Chen*, Zhenghui Wang*, Ran Tian, Zichao Yang, Diyi Yang: Local Additivity Based Data Augm

36 Dec 2, 2022

Implementation of Geometric Vector Perceptron, a simple circuit for 3d rotation equivariance for learning over large biomolecules, in Pytorch. Idea proposed and accepted at ICLR 2021

Geometric Vector Perceptron Implementation of Geometric Vector Perceptron, a simple circuit with 3d rotation equivariance for learning over large biom

59 Nov 24, 2022

Implementation of 'lightweight' GAN, proposed in ICLR 2021, in Pytorch. High resolution image generations that can be trained within a day or two

512x512 flowers after 12 hours of training, 1 gpu 256x256 flowers after 12 hours of training, 1 gpu Pizza 'Lightweight' GAN Implementation of 'lightwe

1.5k Jan 2, 2023

This repository contains the implementation of Deep Detail Enhancment for Any Garment proposed in Eurographics 2021

Comments

Issue on the reproducing results on Table 1

Congratulations on this interesting work!

Now, I am trying to reproduce the results on your papers including cutoff and back-translation which also shows a great improvement. But, now I'm struggling due to the absence of some details.

For example, I can reproduce the results for CoLA task after a huge amount of trials and error for finding proper hyper-parameter (cutoff_ratio, ce_loss, js_loss), however, can't reproduce the results for other tasks on GLUE.

It might be due to that I have not found proper hyper-parameters yet.. So, if you let me know or give the detailed hyper-parameters for each GLUE task, then it will be very helpful..! Can you let me know some details (e.g., used value of alpha,beta for each task) for reproducing the results?

Also, in the case of back-translation, now I'm trying to reproduce it with WMT en-de, de-en 19 in fairseq. But, I can't improve the naive baseline. For example, my reproducing results is almost 88, but your reporting is 91.7 in RTE with RoBERTa large So, can you let me know the details of back-translation and related hyper-parameters?

Thanks for your help and look forward to hearing from you!

opened by bbuing9 1

The source code for the Cutoff data augmentation approach proposed in this paper: "A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation".

Related tags

Overview

Cutoff: A Simple Data Augmentation Approach for Natural Language

Natural Language Undertanding (e.g. GLUE tasks, etc.)

Prerequisite:

Run

Natural Language Generation (e.g. Translation, etc.)

IWSLT'14 German to English (Transformers)

WMT'14 English to German (Transformers)

Citation

You might also like...

Source codes for the paper "Local Additivity Based Data Augmentation for Semi-supervised NER"

Implementation of Geometric Vector Perceptron, a simple circuit for 3d rotation equivariance for learning over large biomolecules, in Pytorch. Idea proposed and accepted at ICLR 2021

Implementation of 'lightweight' GAN, proposed in ICLR 2021, in Pytorch. High resolution image generations that can be trained within a day or two

This repository contains the implementation of Deep Detail Enhancment for Any Garment proposed in Eurographics 2021

This repo contains the implementation of the algorithm proposed in Off-Belief Learning, ICML 2021.

The implemetation of Dynamic Nerual Garments proposed in Siggraph Asia 2021

Implement object segmentation on images using HOG algorithm proposed in CVPR 2005

Pytorch implementation of the popular Improv RNN model originally proposed by the Magenta team.

Torch-ngp - A pytorch implementation of the hash encoder proposed in instant-ngp

Comments

Issue on the reproducing results on Table 1

Owner

Dinghan Shen

Code and data of the Fine-Grained R2R Dataset proposed in paper Sub-Instruction Aware Vision-and-Language Navigation

Empirical Study of Transformers for Source Code & A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code

Image transformations designed for Scene Text Recognition (STR) data augmentation. Published at ICCV 2021 Workshop on Interactive Labeling and Data Augmentation for Vision.

nnDetection is a self-configuring framework for 3D (volumetric) medical object detection which can be applied to new data sets without manual intervention. It includes guides for 12 data sets that were used to develop and evaluate the performance of the proposed method.

An implementation for the loss function proposed in Decoupled Contrastive Loss paper.

Implementation of the method proposed in the paper "Neural Descriptor Fields: SE(3)-Equivariant Object Representations for Manipulation"

Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"

PyTorch reimplementation of the Smooth ReLU activation function proposed in the paper "Real World Large Scale Recommendation Systems Reproducibility and Smooth Activations" [arXiv 2022].

A PyTorch implementation of Mugs proposed by our paper "Mugs: A Multi-Granular Self-Supervised Learning Framework".

Code for CMaskTrack R-CNN (proposed in Occluded Video Instance Segmentation)