Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

Overview

Text-AutoAugment (TAA)

This repository contains the code for our paper Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification (EMNLP 2021 main conference).

Overview of IAIS

Overview

  1. We present a learnable and compositional framework for data augmentation. Our proposed algorithm automatically searches for the optimal compositional policy, which improves the diversity and quality of augmented samples.

  2. In low-resource and class-imbalanced regimes of six benchmark datasets, TAA significantly improves the generalization ability of deep neural networks like BERT and effectively boosts text classification performance.

Getting Started

  1. Prepare environment

    conda create -n taa python=3.6
    conda activate taa
    conda install pytorch torchvision cudatoolkit=10.0 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch
    pip install -r requirements.txt 
    python -c "import nltk; nltk.download('wordnet'); nltk.download('averaged_perceptron_tagger')"
  2. Modify dataroot parameter in confs/*yaml and abspath parameter in script/*.sh:

    • e.g., change dataroot: /home/renshuhuai/TextAutoAugment/data/aclImdb in confs/bert_imdb.yaml to dataroot: path-to-your-TextAutoAugment/data/aclImdb
    • change --abspath '/home/renshuhuai/TextAutoAugment' in script/imdb_lowresource.sh to --abspath 'path-to-your-TextAutoAugment'
  3. Search for the best augmentation policy, e.g., low-resource regime for IMDB:

    sh script/imdb_lowresource.sh

    scripts for policy search in the low-resource and class-imbalanced regime for all datasets are provided in the script/ fold.

  4. Train a model with pre-searched policy in archive.py, e.g., train model in low-resource regime for IMDB:

    python train.py -c confs/bert_imdb.yaml 

    train model on full dataset of IMDB:

    python train.py -c confs/bert_imdb.yaml --train-npc -1 --valid-npc -1 --test-npc -1  

Contact

If you have any questions related to the code or the paper, feel free to email Shuhuai (renshuhuai007 [AT] gmail [DOT] com).

Acknowledgments

Code refers to: fast-autoaugment.

Citation

If you find this code useful for your research, please consider citing:

@inproceedings{ren2021taa,
  title={Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification},
  author={Shuhuai Ren, Jinchao Zhang, Lei Li, Xu Sun, Jie Zhou},
  booktitle={EMNLP},
  year={2021}
}

License

MIT

Comments
  • 如何复现代码?

    如何复现代码?

    作者您好!我是一名在读硕士研究生,对您的论文很感兴趣,想复现论文代码进行研究,在复现过程中遇到一些阻力,恳请作者解答一下我的一些疑问。 1.在论文中您提到用的实验环境为8张Tesla P40,我的实验环境为:显卡 1080 * 2,显存 16G,cuda 版本10.2,并且参照您的readme文档配置好实验环境了,这样的实验配置能否跑动您的代码? 2.readme文档中提到复现代码主要是运行 reproduce_experiment.py,但是在运行到该第46行代码的时候报错,提示taa/models 文件下并没有所需的文件,我尝试在search.py中去运行生成所需的policy文件,但是不知道从何切入,对于整个跑通调试流程逻辑有点混乱,如果我要完整复现结果,应该是以怎样的步骤、流程去跑通代码程序?

    在代码这方面我还是一个小白,非常感谢作者能够在百忙之中给我解答!

    opened by javanlu123 3
  • The return in augmentation.py cannot be usead as data source

    The return in augmentation.py cannot be usead as data source

    The return of each transform function in augmentation.py should be Str, instead of List, which is generated by default. e.g.: def random_word_delete(text, m): return aug.augment(text) ⬇⬇⬇⬇⬇⬇⬇⬇⬇ def random_word_delete(text, m): return aug.augment(text)[0]

    opened by mvllwong 2
  • module 'ray.tune' has no attribute 'trial_runner'

    module 'ray.tune' has no attribute 'trial_runner'

    Traceback (most recent call last): File "reproduce_experiment.py", line 10, in from taa.search import get_path, search_policy, train_model_parallel File "/root/yxyanyi/xiaozhu/text-autoaugment-main/taa/search.py", line 51, in patch = gorilla.Patch(ray.tune.trial_runner.TrialRunner, 'step', step_w_log, settings=gorilla.Settings(allow_hit=True)) AttributeError: module 'ray.tune' has no attribute 'trial_runner' 作者您好,我按你的github进行安装环境后,便直接运行reproduce_experiment.py,然后报这个错误,恳望指正

    opened by caesar-jojo 1
  • 使用custom dataset时,在 load_dataset报错。

    使用custom dataset时,在 load_dataset报错。

    你好,在使用自定义数据集(与示例数据集一致)时,按照示例Config文件进行运行,在load_dataset函数有报错,具体信息如下: Traceback (most recent call last): File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/ray/tune/ray_trial_executor.py", line 901, in get_next_executor_event future_result = ray.get(ready_future) File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper return func(*args, **kwargs) File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/ray/worker.py", line 1809, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(TuneError): ray::ImplicitFunc.train() (pid=51539, ip=10.10.25.1, repr=objective) File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/ray/tune/trainable.py", line 349, in train result = self.step() File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/ray/tune/function_runner.py", line 403, in step self._report_thread_runner_error(block=True) File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/ray/tune/function_runner.py", line 567, in _report_thread_runner_error raise TuneError( ray.tune.error.TuneError: Trial raised an exception. Traceback: ray::ImplicitFunc.train() (pid=51539, ip=10.10.25.1, repr=objective) File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/ray/tune/function_runner.py", line 272, in run self._entrypoint() File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/ray/tune/function_runner.py", line 348, in entrypoint return self._trainable_func( File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/ray/tune/function_runner.py", line 640, in _trainable_func output = fn() File "/mnt/zhzhang_hdd/implicit-hate-corpus/text_autoaugment/taa/search.py", line 85, in objective result = train_and_eval(config['tag'], policy_opt=True, save_path=save_path, only_eval=False) File "/mnt/zhzhang_hdd/implicit-hate-corpus/text_autoaugment/taa/train.py", line 49, in train_and_eval train_dataset, val_dataset, test_dataset = get_datasets(dataset_type, policy_opt=policy_opt) File "/mnt/zhzhang_hdd/implicit-hate-corpus/text_autoaugment/taa/data.py", line 58, in get_datasets test_dataset = load_dataset(path=path, name=dataset, data_dir=data_dir, data_files=data_files, split='test') File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/datasets/load.py", line 1714, in load_dataset ds = builder_instance.as_dataset(split=split, ignore_verifications=ignore_verifications, in_memory=keep_in_memory) File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/datasets/builder.py", line 763, in as_dataset datasets = utils.map_nested( File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 250, in map_nested return function(data_struct) File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/datasets/builder.py", line 794, in _build_single_dataset ds = self._as_dataset( File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/datasets/builder.py", line 862, in _as_dataset dataset_kwargs = ArrowReader(self._cache_dir, self.info).read( File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/datasets/arrow_reader.py", line 211, in read files = self.get_file_instructions(name, instructions, split_infos) File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/datasets/arrow_reader.py", line 184, in get_file_instructions file_instructions = make_file_instructions( File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/datasets/arrow_reader.py", line 107, in make_file_instructions absolute_instructions = instruction.to_absolute(name2len) File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/datasets/arrow_reader.py", line 618, in to_absolute return [_rel_to_abs_instr(rel_instr, name2len) for rel_instr in self._relative_instructions] File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/datasets/arrow_reader.py", line 618, in return [_rel_to_abs_instr(rel_instr, name2len) for rel_instr in self._relative_instructions] File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/datasets/arrow_reader.py", line 433, in _rel_to_abs_instr raise ValueError(f'Unknown split "{split}". Should be one of {list(name2len)}.') ValueError: Unknown split "test". Should be one of ['train']. 请问需要在数据文件夹中将数据先进行划分吗?并如何修改config文件。谢谢。

    opened by zzh-SJTU 1
  • "Exception: This class is a singleton!"

    Hi there, when I try to use search and augment like : augmented_train_dataset = search_and_augment(configfile="./text-autoaugment/taa/confs/bert_sst2_example.yaml") I have the error "Exception: This class is a singleton!". I did not modify anything in the file. How can I fix it ?

    opened by isilberfin 1
  • Make requirements more relaxed

    Make requirements more relaxed

    When installing other libraries, for instance the later huggingface transformers or other augmentation libraries, the requirements are a bit to restrictive

    I've have tested the fork and still produces the expected output when running the training scripts in scripts/

    Would like to use the framework, but with the restrictions, both pip and poetry fails from library version conflicts @RenShuhuai-Andy @wolvecap

    opened by MarkusSagen 0
  • CVE-2007-4559 Patch

    CVE-2007-4559 Patch

    Patching CVE-2007-4559

    Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

    If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

    opened by TrellixVulnTeam 0
  • 找不到sst2_Bert_op2_policy4_n-aug1_ir0.02_taa.log文件

    找不到sst2_Bert_op2_policy4_n-aug1_ir0.02_taa.log文件

    FileNotFoundError: [Errno 2] No such file or directory: 'D:\anaconda\qass\text-autoaugment-main\text-autoaugment-main\examples\taa\models\sst2_Bert_op2_policy4_n-aug1_ir0.02_taa.log' 找不到该文件,该文件的作用是啥

    opened by gf52 10
Owner
LancoPKU
Language Computing and Machine Learning Group (Xu Sun's group) at Peking University
LancoPKU
Code for ACL 2021 main conference paper "Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances".

Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances This repository contains the code and pre-trained mode

ICTNLP 90 Dec 27, 2022
Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

STEMM: Self-learning with Speech-Text Manifold Mixup for Speech Translation This is a PyTorch implementation for the ACL 2022 main conference paper ST

ICTNLP 29 Oct 16, 2022
This is the code for the EMNLP 2021 paper AEDA: An Easier Data Augmentation Technique for Text Classification

The baseline code is for EDA: Easy Data Augmentation techniques for boosting performance on text classification tasks

Akbar Karimi 81 Dec 9, 2022
This repository contains the code for EMNLP-2021 paper "Word-Level Coreference Resolution"

Word-Level Coreference Resolution This is a repository with the code to reproduce the experiments described in the paper of the same name, which was a

null 79 Dec 27, 2022
Code for the paper in Findings of EMNLP 2021: "EfficientBERT: Progressively Searching Multilayer Perceptron via Warm-up Knowledge Distillation".

This repository contains the code for the paper in Findings of EMNLP 2021: "EfficientBERT: Progressively Searching Multilayer Perceptron via Warm-up Knowledge Distillation".

Chenhe Dong 28 Nov 10, 2022
💛 Code and Dataset for our EMNLP 2021 paper: "Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes"

Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes Official PyTorch implementation and EmoCause evaluatio

Hyunwoo Kim 50 Dec 21, 2022
Code and dataset for the EMNLP 2021 Finding paper "Can NLI Models Verify QA Systems’ Predictions?"

Code and dataset for the EMNLP 2021 Finding paper "Can NLI Models Verify QA Systems’ Predictions?"

Jifan Chen 22 Oct 21, 2022
Code for Findings at EMNLP 2021 paper: "Learn Continually, Generalize Rapidly: Lifelong Knowledge Accumulation for Few-shot Learning"

Learn Continually, Generalize Rapidly: Lifelong Knowledge Accumulation for Few-shot Learning This repo is for Findings at EMNLP 2021 paper: Learn Cont

INK Lab @ USC 6 Sep 2, 2022
Code to reproduce the results of the paper 'Towards Realistic Few-Shot Relation Extraction' (EMNLP 2021)

Realistic Few-Shot Relation Extraction This repository contains code to reproduce the results in the paper "Towards Realistic Few-Shot Relation Extrac

Bloomberg 8 Nov 9, 2022
Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

Udit Arora 19 Oct 28, 2022
This is the main repository of open-sourced speech technology by Huawei Noah's Ark Lab.

Speech-Backbones This is the main repository of open-sourced speech technology by Huawei Noah's Ark Lab. Grad-TTS Official implementation of the Grad-

HUAWEI Noah's Ark Lab 295 Jan 7, 2023
Tool to add main subject to items on Wikidata using a WMFs CirrusSearch for named entity recognition or a manually supplied list of QIDs

ItemSubjector Tool made to add main subject statements to items based on the title using a home-brewed CirrusSearch-based Named Entity Recognition alg

Dennis Priskorn 9 Nov 17, 2022
Main repository for the chatbot Bobotinho.

Bobotinho Bot Main repository for the chatbot Bobotinho. ℹ️ Introduction Twitch chatbot with entertainment commands. ‎ ?? Technologies Concurrent code

Bobotinho 14 Nov 29, 2022
🍊 PAUSE (Positive and Annealed Unlabeled Sentence Embedding), accepted by EMNLP'2021 🌴

PAUSE: Positive and Annealed Unlabeled Sentence Embedding Sentence embedding refers to a set of effective and versatile techniques for converting raw

EQT 21 Dec 15, 2022
[EMNLP 2021] LM-Critic: Language Models for Unsupervised Grammatical Error Correction

LM-Critic: Language Models for Unsupervised Grammatical Error Correction This repo provides the source code & data of our paper: LM-Critic: Language M

Michihiro Yasunaga 98 Nov 24, 2022
Beta Distribution Guided Aspect-aware Graph for Aspect Category Sentiment Analysis with Affective Knowledge. Proceedings of EMNLP 2021

AAGCN-ACSA EMNLP 2021 Introduction This repository was used in our paper: Beta Distribution Guided Aspect-aware Graph for Aspect Category Sentiment An

Akuchi 36 Dec 18, 2022
EMNLP'2021: Can Language Models be Biomedical Knowledge Bases?

BioLAMA BioLAMA is biomedical factual knowledge triples for probing biomedical LMs. The triples are collected and pre-processed from three sources: CT

DMIS Laboratory - Korea University 41 Nov 18, 2022
A2T: Towards Improving Adversarial Training of NLP Models (EMNLP 2021 Findings)

A2T: Towards Improving Adversarial Training of NLP Models This is the source code for the EMNLP 2021 (Findings) paper "Towards Improving Adversarial T

QData 17 Oct 15, 2022
Universal Adversarial Triggers for Attacking and Analyzing NLP (EMNLP 2019)

Universal Adversarial Triggers for Attacking and Analyzing NLP This is the official code for the EMNLP 2019 paper, Universal Adversarial Triggers for

Eric Wallace 248 Dec 17, 2022