A2T: Towards Improving Adversarial Training of NLP Models (EMNLP 2021 Findings)

Related tags

TextAttack-A2T
Overview

A2T: Towards Improving Adversarial Training of NLP Models

This is the source code for the EMNLP 2021 (Findings) paper "Towards Improving Adversarial Training of NLP Models".

If you use the code, please cite the paper:

@misc{yoo2021improving,
      title={Towards Improving Adversarial Training of NLP Models}, 
      author={Jin Yong Yoo and Yanjun Qi},
      year={2021},
      eprint={2109.00544},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Prerequisites

The work heavily relies on the TextAttack package. In fact, the main training code is implemented in the TextAttack package.

Required packages are listed in the requirements.txt file.

pip install -r requirements.txt

Data

All of the data used for the paper are available from HuggingFace's Datasets.

For IMDB and Yelp datasets, because there are no official validation splits, we randomly sampled 5k and 10k, respectively, from the training set and used them as valid splits. We provide the splits in this Google Drive folder. To use them with the provided code, place each folder (e.g. imdb, yelp, augmented_data) inside ./data (run mkdir data).

Also, augmented training data generated using SSMBA and back-translation are available in the same folder.

Training

To train BERT model on IMDB dataset with A2T attack for 4 epochs and 1 clean epoch with gamma of 0.2:

python train.py \
    --train imdb \
    --eval imdb \
    --model-type bert \
    --model-save-path ./example \
    --num-epochs 4 \
    --num-clean-epochs 1 \
    --num-adv-examples 0.2 \
    --attack-epoch-interval 1 \
    --attack a2t \
    --learning-rate 5e-5 \
    --num-warmup-steps 100 \
    --grad-accumu-steps 1 \
    --checkpoint-interval-epochs 1 \
    --seed 42

You can also pass roberta to train RoBERTa model instead of BERT model. To select other datasets from the paper, pass rt (MR), yelp, or snli for --train and --eval.

This script is actually just to run the Trainer class from the TextAttack package. To checkout how training is performed, please checkout the Trainer class.

Evaluation

To evalute the accuracy, robustness, and interpretability of our trained model from above, run

python evaluate.py \
    --dataset imdb \
    --model-type bert \
    --checkpoint-paths ./example_run \
    --epoch 4 \
    --save-log \
    --accuracy \
    --robustness \
    --attacks a2t a2t_mlm textfooler bae pwws pso \
    --interpretability 

This takes the last checkpoint model (--epoch 4) and evaluates its accuracy on both IMDB and Yelp dataset (for cross-domain accuracy). It also evalutes the model's robustness against A2T, A2T-MLM, TextFooler, BAE, PWWS, and PSO attacks. Lastly, with the --interpretability flag, AOPC scores are calculated.

Note that you will have to run --robustness and --interpretability with --accuracy (or after you separately evaluate accuracy) since both robustness and intepretability evaluations rely on the accuracy evaluation to know which samples the model was able to predict correctly. By default 1000 samples are attacked to evaluate robustness. Likewise, 1000 samples are used to calculate AOPC score for interpretability.

If you're evaluating multiple models for comparison, it's also advised that you provide all the checkpoint paths together to --checkpoint-paths. This is because the samples that are correctly by each model will be different, so we first need to identify the intersection of the all correct predictions before using them to evaluate robustness for all the models. This will allow fairer comparison of models' robustness rather than using attack different samples for each model.

Data Augmentation

Lastly, we also provide augment.py which we used to perform data augmentation methods such as SSMBA and back-translation.

Following is an example command for augmenting imdb dataset with SSMBA method.

python augment.py \
    --dataset imdb \
    --augmentation ssmba \
    --output-path ./augmented_data \
    --seed 42 

You can also pass backtranslation to --augmentation.

Owner
QData
http://www.cs.virginia.edu/yanjun/
QData
BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia.

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia. Its intended use is as input for neural models in natural language processing.

Benjamin Heinzerling 1k Oct 21, 2021
State of the Art Natural Language Processing

Spark NLP: State of the Art Natural Language Processing Spark NLP is a Natural Language Processing library built on top of Apache Spark ML. It provide

John Snow Labs 2.4k Oct 22, 2021
State of the Art Natural Language Processing

Spark NLP: State of the Art Natural Language Processing Spark NLP is a Natural Language Processing library built on top of Apache Spark ML. It provide

John Snow Labs 1.9k Feb 18, 2021
A2T: Towards Improving Adversarial Training of NLP Models (EMNLP 2021 Findings)

A2T: Towards Improving Adversarial Training of NLP Models This is the source code for the EMNLP 2021 (Findings) paper "Towards Improving Adversarial T

QData 8 Oct 20, 2021
TextAttack šŸ™ is a Python framework for adversarial attacks, data augmentation, and model training in NLP

TextAttack ?? Generating adversarial examples for NLP models [TextAttack Documentation on ReadTheDocs] About ā€¢ Setup ā€¢ Usage ā€¢ Design About TextAttack

QData 1.7k Oct 23, 2021
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A Deep Learning NLP/NLU library by IntelĀ® AI Lab Overview | Models | Installation | Examples | Documentation | Tutorials | Contributing NLP Architect

Intel Labs 2.8k Oct 21, 2021
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A Deep Learning NLP/NLU library by IntelĀ® AI Lab Overview | Models | Installation | Examples | Documentation | Tutorials | Contributing NLP Architect

Intel Labs 2.7k Oct 16, 2021
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A Deep Learning NLP/NLU library by IntelĀ® AI Lab Overview | Models | Installation | Examples | Documentation | Tutorials | Contributing NLP Architect

Intel Labs 2.6k Feb 18, 2021
Tracking Progress in Natural Language Processing

Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.

Sebastian Ruder 19.2k Oct 18, 2021
COVID-19 Related NLP Papers

COVID-19 outbreak has become a global pandemic. NLP researchers are fighting the epidemic in their own way.

xcfeng 21 Oct 19, 2021
This is my reading list for my PhD in AI, NLP, Deep Learning and more.

This is my reading list for my PhD in AI, NLP, Deep Learning and more.

Zhong Peixiang 141 Sep 30, 2021
Awesome-NLP-Research (ANLP)

Awesome-NLP-Research (ANLP)

Language, Information, and Learning at Yale 64 Oct 14, 2021
Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra. What is Lightning Tran

Pytorch Lightning 362 Oct 21, 2021
A Survey of Natural Language Generation in Task-Oriented Dialogue System (TOD): Recent Advances and New Frontiers

A Survey of Natural Language Generation in Task-Oriented Dialogue System (TOD): Recent Advances and New Frontiers

Libo Qin 86 Oct 12, 2021
A Paper List for Speech Translation

Keyword: Speech Translation, Spoken Language Processing, Natural Language Processing

null 89 Oct 21, 2021
Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

Michael Petrochuk 2k Oct 15, 2021
Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

Michael Petrochuk 1.9k Feb 3, 2021
Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

Michael Petrochuk 1.9k Feb 18, 2021