Unified Multilingual Robustness Evaluation Toolkit for Natural Language Processing
[TextFlint Documentation on ReadTheDocs]
About • Setup • Usage • Design
About
TextFlint is a multilingual robustness evaluation platform for natural language processing tasks, which unifies general text transformation, task-specific transformation, adversarial attack, sub-population, and their combinations to provide a comprehensive robustness analysis.
Features:
There are lots of reasons to use TextFlint:
- Full coverage of transformation types, including 20 general transformations, 8 subpopulations and 60 task-specific transformations, as well as thousands of their combinations, which basically covers all aspects of text transformations to comprehensively evaluate the robustness of your model. TextFlint also supports adversarial attack to generate model specific transformed datas.
- Generate targeted augmented data, and you can use the additional data to train or fine-tune your model to improve your model's robustness.
- Provide a complete analytical report automatically to accurately explain where your model's shortcomings are, such as the problems in syntactic rules or syntactic rules.
Setup
Installation
You can either use pip
or clone this repo to install TextFlint.
- Using
pip
(recommended)
pip install TextFlint
- Cloning this repo
git clone https://github.com/textflint/textflint.git
cd TextFlint
python setup.py install
Usage
Workflow
The general workflow of TextFlint is displayed above. Evaluation of target models could be devided into three steps:
- For input preparation, the original dataset for testing, which is to be loaded by
Dataset
, should be firstly formatted as a series ofJSON
objects. TextFlint configuration is specified byConfig
. Target model is also loaded asFlintModel
. - In adversarial sample generation, multi-perspective transformations (i.e.,
Transformation
,Subpopulation
andAttackRecipe
), are performed onDataset
to generate transformed samples. Besides, to ensure semantic and grammatical correctness of transformed samples,Validator
calculates confidence of each sample to filter out unacceptable samples. - Lastly,
Analyzer
collects evaluation results andReportGenerator
automatically generates a comprehensive report of model robustness.
Quick Start
The following code snippet shows how to generate transformed data on the Sentiment Analysis task.
from TextFlint.engine import Engine
# load the data samples
sample1 = {'x': 'Titanic is my favorite movie.', 'y': 'pos'}
sample2 = {'x': 'I don\'t like the actor Tim Hill', 'y': 'neg'}
data_samples = [sample1, sample2]
# define the output directory
out_dir_path = './test_result/'
# run transformation/subpopulation/attack and save the transformed data to out_dir_path in json format
engine = Engine('SA')
engine.run(data_samples, out_dir_path, config)
You can also feed data to TextFlintEngine
in other ways (e.g., json
or csv
) where one line represents for a sample. We have defined some transformations and subpopulations in SA.json
, and you can also pass your own configuration file as you need.
Transformed Datasets
After transformation, here are the contents in ./test_result/
:
ori_AddEntitySummary-movie_1.json
ori_AddEntitySummary-person_1.json
trans_AddEntitySummary-movie_1.json
trans_AddEntitySummary-person_1.json
...
where the trans_AddEntitySummary-movie_1.json
contains 1
successfully transformed sample by transformation AddEntitySummary
and ori_AddEntitySummary-movie_1.json
contains the corresponding original sample. The content in ori_AddEntitySummary-movie_1.json
:
{'x': 'Titanic is my favorite movie.', 'y': 'pos', "sample_id": 0}
The content in trans_AddEntitySummary-movie_1.json
:
{"x": "Titanic (A seventeen-year-old aristocrat falls in love with a kind but poor artist aboard the luxurious, ill-fated R.
M.S. Titanic .) is my favorite movie.", "y": "pos", "sample_id": 0}
Design
Architecture
Input layer: receives textual datasets and models as input, represented as Dataset
and FlintModel
separately.
DataSet
: a container forSample
, provides efficiently and handily operation interfaces forSample
.Dataset
supports loading, verification, and saving data in Json or CSV format for various NLP tasks.FlintModel
: a target model used in an adversarial attack.
Generation layer: there are mainly four parts in generation layer:
Subpopulation
: generates a subset of aDataSet
.Transformation
: transforms each sample ofDataset
if it can be transformed.AttackRecipe
: attacks theFlintModel
and generate aDataSet
of adversarial examples.Validator
: verifies the quality of samples generated byTransformation
andAttackRecipe
.
Report layer: analyzes model testing results and provides robustness report for users.
Transformation
In order to verify the robustness comprehensively, TextFlint offers 20 universal transformations and 60 task-specific transformations, covering 12 NLP tasks. The following table summarizes the Transformation
currently supported and the examples for each transformation can be found in our web site.
Task | Transformation | Description | Reference |
---|---|---|---|
UT (Universal Transformation) | AppendIrr |
Extend sentences by irrelevant sentences | - |
BackTrans |
BackTrans (Trans short for translation) replaces test data with paraphrases by leveraging back translation, which is able to figure out whether or not the target models merely capture the literal features instead of semantic meaning. | - | |
Contraction |
Contraction replaces phrases like `will not` and `he has` with contracted forms, namely, `won’t` and `he’s` | - | |
InsertAdv |
Transforms an input by add adverb word before verb | - | |
Keyboard |
Keyboard turn to the way how people type words and change tokens into mistaken ones with errors caused by the use of keyboard, like `word → worf` and `ambiguous → amviguius`. | - | |
MLMSuggestion |
MLMSuggestion (MLM short for masked language model) generates new sentences where one syntactic category element of the original sentence is replaced by what is predicted by masked language models. | - | |
Ocr |
Transformation that simulate ocr error by random values. | - | |
Prejudice |
Transforms an input by Reverse gender or place names in sentences. | - | |
Punctuation |
Transforms input by add punctuation at the end of sentence. | - | |
ReverseNeg |
Transforms an affirmative sentence into a negative sentence, or vice versa. | - | |
SpellingError |
Transformation that leverage pre-defined spelling mistake dictionary to simulate spelling mistake. | Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs (https://arxiv.org/ftp/arxiv/papers/1812/1812.04718.pdf) | |
SwapAntWordNet |
Transforms an input by replacing its words with antonym provided by WordNet. | - | |
SwapNamedEnt |
Swap entities with other entities of the same category. | - | |
SwapNum |
Transforms an input by replacing the numbers in it. | - | |
SwapSynWordEmbedding |
Transforms an input by replacing its words by Glove. | - | |
SwapSynWordNet |
Transforms an input by replacing its words with synonyms provided by WordNet. | - | |
Tense |
Transforms all verb tenses in sentence. | - | |
TwitterType |
Transforms input by common abbreviations in TwitterType. | - | |
Typos |
Randomly inserts, deletes, swaps or replaces a single letter within one word (Ireland → Irland). | Synthetic and noise both break neural machine translation (https://arxiv.org/pdf/1711.02173.pdf) | |
WordCase |
Transform an input to upper and lower case or capitalize case. | - | |
RE (Relation Extraction) | InsertClause |
InsertClause is a transformation method which inserts entity description for head and tail entity | - |
SwapEnt-LowFreq |
SwapEnt-LowFreq is a sub-transformation method from EntitySwap which replace entities in text with random same typed entities with low frequency. | - | |
SwapTriplePos-Birth |
SwapTriplePos-Birth is a transformation method specially designed for birth relation. It paraphrases the sentence and keeps the original birth relation between the entity pairs. | - | |
SwapTriplePos-Employee |
SwapTriplePos-Employee is a transformation method specially designed for employee relation. It deletes the TITLE description of each employee and keeps the original employee relation between the entity pairs. | - | |
SwapEnt-SamEtype |
SwapEnt-SamEtype is a sub-transformation method from EntitySwap which replace entities in text with random entities with the same type. | - | |
SwapTriplePos-Age |
SwapTriplePos-Age is a transformation method specially designed for age relation. It paraphrases the sentence and keeps the original age relation between the entity pairs. | - | |
SwapEnt-MultiType |
SwapEnt-MultiType is a sub-transformation method from EntitySwap which replace entities in text with random same-typed entities with multiple possible types. | - | |
NER (Named Entity Recognition) | EntTypos |
Swap/delete/add random character for entities | - |
ConcatSent |
Concatenate sentences to a longer one. | - | |
SwapLonger |
Substitute short entities to longer ones | - | |
CrossCategory |
Entity Swap by swaping entities with ones that can be labeled by different labels. | - | |
OOV |
Entity Swap by OOV entities. | - | |
POS (Part-of-Speech Tagging) | SwapMultiPOSRB |
It is implied by the phenomenon of conversion that some words hold multiple parts of speech. That is to say, these multi-part-of-speech words might confuse the language models in terms of POS tagging. Accordingly, we replace adverbs with words holding multiple parts of speech. | - |
SwapPrefix |
Swapping the prefix of one word and keeping its part of speech tag. | - | |
SwapMultiPOSVB |
It is implied by the phenomenon of conversion that some words hold multiple parts of speech. That is to say, these multi-part-of-speech words might confuse the language models in terms of POS tagging. Accordingly, we replace verbs with words holding multiple parts of speech. | - | |
SwapMultiPOSNN |
It is implied by the phenomenon of conversion that some words hold multiple parts of speech. That is to say, these multi-part-of-speech words might confuse the language models in terms of POS tagging. Accordingly, we replace nouns with words holding multiple parts of speech. | - | |
SwapMultiPOSJJ |
It is implied by the phenomenon of conversion that some words hold multiple parts of speech. That is to say, these multi-part-of-speech words might confuse the language models in terms of POS tagging. Accordingly, we replace adjectives with words holding multiple parts of speech. | - | |
COREF (Coreference Resolution) | RndConcat |
RndConcat is a task-specific transformation of coreference resolution, this transformation will randomly retrieve an irrelevant paragraph from the corpus, and concatenate it after the original document | - |
RndDelete |
RndDelete is a task-specific transformation of coreference resolution, through this transformation, there is a possibility (20% by default) for each sentence in the original document to be deleted, and at least one sentence will be deleted; related coreference labels will also be deleted | - | |
RndReplace |
RndInsert is a task-specific transformation of coreference resolution, this transformation will randomly retrieve irrelevant sentences from the corpus, and replace sentences from the original document with them (the proportion of replaced sentences and original sentences is 20% by default) | - | |
RndShuffle |
RndShuffle is a task-specific transformation of coreference resolution, during this transformation, a certain number of swapping will be processed, which swap the order of two adjacent sentences of the original document (the number of swapping is 20% of the number of original sentences by default) | - | |
RndInsert |
RndInsert is a task-specific transformation of coreference resolution, this transformation will randomly retrieve irrelevant sentences from the corpus, and insert them into the original document (the proportion of inserted sentences and original sentences is 20% by default) | - | |
RndRepeat |
RndRepeat is a task-specific transformation of coreference resolution, this transformation will randomly pick sentences from the original document, and insert them somewhere else in the document (the proportion of inserted sentences and original sentences is 20% by default) | - | |
ABSA (Aspect-based Sentiment Analysis) | RevTgt |
RevTgt: reverse the sentiment of the target aspect. | Tasty Burgers, Soggy Fries: Probing Aspect Robustness in Aspect-Based Sentiment Analysis (https://www.aclweb.org/anthology/2020.emnlp-main.292.pdf) |
AddDiff |
RevNon: Reverse the sentiment of the non-target aspects with originally the same sentiment as target. | ||
RevNon |
AddDiff: Add aspects with the opposite sentiment from the target aspect. | ||
CWS (Chinese Word Segmentation) | SwapContraction |
SwapContriction is a task-specific transformation of Chinese Word Segmentation, this transformation will replace some common abbreviations in the sentence with complete words with the same meaning | - |
SwapNum |
SwapNum is a task-specific transformation of Chinese Word Segmentation, this transformation will replace the numerals in the sentence with other numerals of similar size | - | |
SwapSyn |
SwapSyn is a task-specific transformation of Chinese Word Segmentation, this transformation will replace some words in the sentence with some very similar words | - | |
SwapName |
SwapName is a task-specific transformation of Chinese Word Segmentation, this transformation will replace the last name or first name of the person in the sentence to produce some local ambiguity that has nothing to do with the sentence | - | |
SwapVerb |
SwapName is a task-specific transformation of Chinese Word Segmentation, this transformation will transform some of the verbs in the sentence to other forms in Chinese | - | |
SM (Semantic Matching) | SwapWord |
This transformation will add some meaningless sentence to premise, which do not change the semantics. | - |
SwapNum |
This transformation will find some num words in sentences and replace them with different num word. | - | |
Overlap |
This method generate some data by some template, whose hypotheis and sentence1 have many overlap but different meaning. | - | |
SA (Sentiment Analysis) | SwapSpecialEnt-Person |
SpecialEntityReplace-Person is a task-specific transformation of sentiment analysis, this transformation will identify some special person name in the sentence, randomly replace it with other entity names of the same kind | - |
SwapSpecialEnt-Movie |
SpecialEntityReplace is a task-specific transformation of sentiment analysis, this transformation will identify some special movie name in the sentence, randomly replace it with other movie name. | - | |
AddSum-Movie |
AddSummary-Movie is a task-specific transformation of sentiment analysis, this transformation will identify some special movie name in the sentence, and insert the summary of these entities after them (the summary content is from wikipedia). | - | |
AddSum-Person |
AddSummary-Person is a task-specific transformation of sentiment analysis, this transformation will identify some special person name in the sentence, and insert the summary of these entities after them (the summary content is from wikipedia). | - | |
DoubleDenial |
SpecialWordDoubleDenial is a task-specific transformation of sentiment analysis, this transformation will find some special words in the sentence and replace them with double negation | - | |
NLI (Natural Language Inference) | NumWord |
This transformation will find some num words in sentences and replace them with different num word. | Stress Test Evaluation for Natural Language Inference (https://www.aclweb.org/anthology/C18-1198/) |
SwapAnt |
This transformation will find some keywords in sentences and replace them with their antonym. | ||
AddSent |
This transformation will add some meaningless sentence to premise, which do not change the semantics. | ||
Overlap |
This method generate some data by some template, whose hypotheis and premise have many overlap but different meaning. | Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference (https://www.aclweb.org/anthology/P19-1334/) | |
MRC (Machine Reading Comprehension) | PerturbQuestion-MLM |
PerturbQuestion is a task-specific transformation of machine reading comprehension, this transformation paraphrases the question. | - |
PerturbQuestion-BackTrans |
PerturbQuestion is a task-specific transformation of machine reading comprehension, this transformation paraphrases the question. | - | |
AddSentDiverse |
AddSentenceDiverse is a task-specific transformation of machine reading comprehension, this transformation generates a distractor with altered question and fake answer. | Adversarial Augmentation Policy Search for Domain and Cross-LingualGeneralization in Reading Comprehension (https://arxiv.org/pdf/2004.06076) | |
PerturbAnswer |
PerturbAnswer is a task-specific transformation of machine reading comprehension, this transformation transforms the sentence with golden answer based on specific rules. | ||
ModifyPos |
ModifyPosition is a task-specific transformation of machine reading comprehension, this transformation rotates the sentences of context. | - | |
DP (Dependency Parsing) | AddSubtree |
AddSubtree is a task-specific transformation of dependency parsing, this transformation will transform the input sentence by adding a subordinate clause from WikiData. | - |
RemoveSubtree |
RemoveSubtree is a task-specific transformation of dependency parsing, this transformation will transform the input sentence by removing a subordinate clause. | - |
Subpopulation
Subpopulation
is to identify the specific part of dataset on which the target model performs poorly. To retrieve a subset that meets the configuration, Subpopulation
divides the dataset through sorting samples by certain attributes. We also support the following Subpopulation
:
Subpopulation | Description | Reference |
---|---|---|
LMSubPopulation_0%-20% |
Filter samples based on the text perplexity from a language model (i.e., GPT-2), 0-20% is the lower part of the scores. | Robustness Gym: Unifying the NLP Evaluation Landscape (https://arxiv.org/pdf/2101.04840) |
LMSubPopulation_80%-100% |
Filter samples based on the text perplexity from a language model (i.e., GPT-2), 80-100% is the higher part of the scores. | |
LengthSubPopulation_0%-20% |
Filter samples based on text length, 0-20% is the lower part of the length. | |
LengthSubPopulation_80%-100% |
Filter samples based on text length, 80-100% is the higher part of the length. | |
PhraseSubPopulation-negation |
Filter samples based on a group of phrases, the remaining samples contain negation words (e.g., not, don't, aren't, no). | |
PhraseSubPopulation-question |
Filter samples based on a group of phrases, the remaining samples contain question words (e.g., what, which, how, when). | |
PrejudiceSubpopulation-man |
Filter samples based on gender bias, the chosen samples only contain words related to male (e.g., he, his, father, boy). | |
PrejudiceSubpopulation-woman |
Filter samples based on gender bias, the chosen samples only contain words related to female (e.g., she, her, mother, girl) |
AttackRecipe
AttackRecipe
aims to find a perturbation of an input text satisfies the attack's goal to fool the given FlintModel
. In contrast to Transformation
, AttackRecipe
requires the prediction scores of the target model. TextFlint provides an interface to integrate the easy-to-use adversarial attack recipes implemented based on textattack
. Users can refer to textattack for more information about the supported AttackRecipe
.
Validator
It is crucial to verify the quality of samples generated by Transformation
and AttackRecipe
. TextFlint provides several metrics to calculate confidence:
Validator | Description | Reference |
---|---|---|
MaxWordsPerturbed |
Word replacement ratio in the generated text compared with the original text based on LCS. | - |
LevenshteinDistance |
The edit distance between original text and generated text | - |
DeCLUTREncoder |
Semantic similarity calculated based on Universal Sentence Encoder | Universal sentence encoder (https://arxiv.org/pdf/1803.11175.pdf) |
GPT2Perplexity |
Language model perplexity calculated based on the GPT2 model | Language models are unsupervised multitask learners (http://www.persagen.com/files/misc/radford2019language.pdf) |
TranslateScore |
BLEU/METEOR/chrF score | Bleu: a method for automatic evaluation of machine translation (https://www.aclweb.org/anthology/P02-1040.pdf) METEOR: An automatic metric for MT evaluation with improved correlation with human judgments (https://www.aclweb.org/anthology/W05-0909.pdf) chrF: character n-gram F-score for automatic MT evaluation (https://www.aclweb.org/anthology/W15-3049.pdf) |
Report
In Generation Layer, TextFlint can generate three types of adversarial samples and verify the robustness of the target model. Based on the results from Generation Layer, Report Layer aims to provide users with a standard analysis report from lexics, syntax, and semantic levels. For example, on the Sentiment Analysis (SA) task, this is a statistical chart of the performance ofXLNET
with different types of Transformation
/Subpopulation
/AttackRecipe
on the IMDB
dataset. We can find that the model performance is lower than the original results in all the transformed dataset.
Citation
If you are using TextFlint for your work, please cite:
@article{gui2021textflint,
title={TextFlint: Unified Multilingual Robustness Evaluation Toolkit for Natural Language Processing},
author={Gui, Tao and Wang, Xiao and Zhang, Qi and Liu, Qin and Zou, Yicheng and Zhou, Xin and Zheng, Rui and Zhang, Chong and Wu, Qinzhuo and Ye, Jiacheng and others},
journal={arXiv preprint arXiv:2103.11441},
year={2021}
}