Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

LancoPKU

Last update: Jan 3, 2023

Related tags

Overview

Text-AutoAugment (TAA)

This repository contains the code for our paper Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification (EMNLP 2021 main conference).

Overview

We present a learnable and compositional framework for data augmentation. Our proposed algorithm automatically searches for the optimal compositional policy, which improves the diversity and quality of augmented samples.
In low-resource and class-imbalanced regimes of six benchmark datasets, TAA significantly improves the generalization ability of deep neural networks like BERT and effectively boosts text classification performance.

Getting Started

Prepare environment

conda create -n taa python=3.6
conda activate taa
conda install pytorch torchvision cudatoolkit=10.0 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch
pip install -r requirements.txt 
python -c "import nltk; nltk.download('wordnet'); nltk.download('averaged_perceptron_tagger')"

Modify dataroot parameter in confs/*yaml and abspath parameter in script/*.sh:
- e.g., change dataroot: /home/renshuhuai/TextAutoAugment/data/aclImdb in confs/bert_imdb.yaml to dataroot: path-to-your-TextAutoAugment/data/aclImdb
- change --abspath '/home/renshuhuai/TextAutoAugment' in script/imdb_lowresource.sh to --abspath 'path-to-your-TextAutoAugment'
Search for the best augmentation policy, e.g., low-resource regime for IMDB:
```
sh script/imdb_lowresource.sh
```
scripts for policy search in the low-resource and class-imbalanced regime for all datasets are provided in the script/ fold.
Train a model with pre-searched policy in archive.py, e.g., train model in low-resource regime for IMDB:
```
python train.py -c confs/bert_imdb.yaml 
```
train model on full dataset of IMDB:
```
python train.py -c confs/bert_imdb.yaml --train-npc -1 --valid-npc -1 --test-npc -1  
```

Contact

If you have any questions related to the code or the paper, feel free to email Shuhuai (renshuhuai007 [AT] gmail [DOT] com).

Acknowledgments

Code refers to: fast-autoaugment.

Citation

If you find this code useful for your research, please consider citing:

@inproceedings{ren2021taa,
  title={Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification},
  author={Shuhuai Ren, Jinchao Zhang, Lei Li, Xu Sun, Jie Zhou},
  booktitle={EMNLP},
  year={2021}
}

License

MIT

Comments

如何复现代码？

作者您好！我是一名在读硕士研究生，对您的论文很感兴趣，想复现论文代码进行研究，在复现过程中遇到一些阻力，恳请作者解答一下我的一些疑问。 1.在论文中您提到用的实验环境为8张Tesla P40，我的实验环境为：显卡 1080 * 2，显存 16G，cuda 版本10.2，并且参照您的readme文档配置好实验环境了，这样的实验配置能否跑动您的代码？ 2.readme文档中提到复现代码主要是运行 reproduce_experiment.py,但是在运行到该第46行代码的时候报错，提示taa/models 文件下并没有所需的文件，我尝试在search.py中去运行生成所需的policy文件，但是不知道从何切入，对于整个跑通调试流程逻辑有点混乱，如果我要完整复现结果，应该是以怎样的步骤、流程去跑通代码程序？

在代码这方面我还是一个小白，非常感谢作者能够在百忙之中给我解答！

opened by javanlu123 3
The return in augmentation.py cannot be usead as data source

The return of each transform function in augmentation.py should be Str, instead of List, which is generated by default. e.g.: def random_word_delete(text, m): return aug.augment(text) ⬇⬇⬇⬇⬇⬇⬇⬇⬇ def random_word_delete(text, m): return aug.augment(text)[0]

opened by mvllwong 2
module 'ray.tune' has no attribute 'trial_runner'

Traceback (most recent call last): File "reproduce_experiment.py", line 10, in from taa.search import get_path, search_policy, train_model_parallel File "/root/yxyanyi/xiaozhu/text-autoaugment-main/taa/search.py", line 51, in patch = gorilla.Patch(ray.tune.trial_runner.TrialRunner, 'step', step_w_log, settings=gorilla.Settings(allow_hit=True)) AttributeError: module 'ray.tune' has no attribute 'trial_runner' 作者您好，我按你的github进行安装环境后，便直接运行reproduce_experiment.py，然后报这个错误，恳望指正

opened by caesar-jojo 1
使用custom dataset时，在 load_dataset报错。

你好，在使用自定义数据集（与示例数据集一致）时，按照示例Config文件进行运行，在load_dataset函数有报错，具体信息如下： Traceback (most recent call last): File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/ray/tune/ray_trial_executor.py", line 901, in get_next_executor_event future_result = ray.get(ready_future) File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper return func(*args, **kwargs) File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/ray/worker.py", line 1809, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(TuneError): [36mray::ImplicitFunc.train()[39m (pid=51539, ip=10.10.25.1, repr=objective) File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/ray/tune/trainable.py", line 349, in train result = self.step() File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/ray/tune/function_runner.py", line 403, in step self._report_thread_runner_error(block=True) File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/ray/tune/function_runner.py", line 567, in _report_thread_runner_error raise TuneError( ray.tune.error.TuneError: Trial raised an exception. Traceback: [36mray::ImplicitFunc.train()[39m (pid=51539, ip=10.10.25.1, repr=objective) File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/ray/tune/function_runner.py", line 272, in run self._entrypoint() File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/ray/tune/function_runner.py", line 348, in entrypoint return self._trainable_func( File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/ray/tune/function_runner.py", line 640, in _trainable_func output = fn() File "/mnt/zhzhang_hdd/implicit-hate-corpus/text_autoaugment/taa/search.py", line 85, in objective result = train_and_eval(config['tag'], policy_opt=True, save_path=save_path, only_eval=False) File "/mnt/zhzhang_hdd/implicit-hate-corpus/text_autoaugment/taa/train.py", line 49, in train_and_eval train_dataset, val_dataset, test_dataset = get_datasets(dataset_type, policy_opt=policy_opt) File "/mnt/zhzhang_hdd/implicit-hate-corpus/text_autoaugment/taa/data.py", line 58, in get_datasets test_dataset = load_dataset(path=path, name=dataset, data_dir=data_dir, data_files=data_files, split='test') File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/datasets/load.py", line 1714, in load_dataset ds = builder_instance.as_dataset(split=split, ignore_verifications=ignore_verifications, in_memory=keep_in_memory) File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/datasets/builder.py", line 763, in as_dataset datasets = utils.map_nested( File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 250, in map_nested return function(data_struct) File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/datasets/builder.py", line 794, in _build_single_dataset ds = self._as_dataset( File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/datasets/builder.py", line 862, in _as_dataset dataset_kwargs = ArrowReader(self._cache_dir, self.info).read( File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/datasets/arrow_reader.py", line 211, in read files = self.get_file_instructions(name, instructions, split_infos) File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/datasets/arrow_reader.py", line 184, in get_file_instructions file_instructions = make_file_instructions( File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/datasets/arrow_reader.py", line 107, in make_file_instructions absolute_instructions = instruction.to_absolute(name2len) File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/datasets/arrow_reader.py", line 618, in to_absolute return [_rel_to_abs_instr(rel_instr, name2len) for rel_instr in self._relative_instructions] File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/datasets/arrow_reader.py", line 618, in return [_rel_to_abs_instr(rel_instr, name2len) for rel_instr in self._relative_instructions] File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/datasets/arrow_reader.py", line 433, in _rel_to_abs_instr raise ValueError(f'Unknown split "{split}". Should be one of {list(name2len)}.') ValueError: Unknown split "test". Should be one of ['train']. 请问需要在数据文件夹中将数据先进行划分吗？并如何修改config文件。谢谢。

opened by zzh-SJTU 1
"Exception: This class is a singleton!"

Hi there, when I try to use search and augment like : augmented_train_dataset = search_and_augment(configfile="./text-autoaugment/taa/confs/bert_sst2_example.yaml") I have the error "Exception: This class is a singleton!". I did not modify anything in the file. How can I fix it ?

opened by isilberfin 1
Make requirements more relaxed

When installing other libraries, for instance the later huggingface transformers or other augmentation libraries, the requirements are a bit to restrictive

I've have tested the fork and still produces the expected output when running the training scripts in scripts/

Would like to use the framework, but with the restrictions, both pip and poetry fails from library version conflicts @RenShuhuai-Andy @wolvecap

opened by MarkusSagen 0
CVE-2007-4559 Patch

Patching CVE-2007-4559

Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

opened by TrellixVulnTeam 0
找不到sst2_Bert_op2_policy4_n-aug1_ir0.02_taa.log文件

FileNotFoundError: [Errno 2] No such file or directory: 'D:\anaconda\qass\text-autoaugment-main\text-autoaugment-main\examples\taa\models\sst2_Bert_op2_policy4_n-aug1_ir0.02_taa.log' 找不到该文件，该文件的作用是啥

opened by gf52 10

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

Related tags

Overview

Text-AutoAugment (TAA)

Overview

Getting Started

Contact

Acknowledgments

Citation

License

Comments

Patching CVE-2007-4559

Owner

LancoPKU

Abstractive opinion summarization system (SelSum) and the largest dataset of Amazon product summaries (AmaSum). EMNLP 2021 conference paper.

Learning the Beauty in Songs: Neural Singing Voice Beautifier; ACL 2022 (Main conference); Official code

"Inductive Entity Representations from Text via Link Prediction" @ The Web Conference 2021

A PyTorch Implementation of the paper - Choi, Woosung, et al. "Investigating u-nets with various intermediate blocks for spectrogram-based singing voice separation." 21th International Society for Music Information Retrieval Conference, ISMIR. 2020.

Github for the conference paper GLOD-Gaussian Likelihood OOD detector

Code and data for the EMNLP 2021 paper "Just Say No: Analyzing the Stance of Neural Dialogue Generation in Offensive Contexts". Coming soon!

The code repository for EMNLP 2021 paper "Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization".

This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis, accepted at EMNLP 2021.

Code for EMNLP 2021 paper Contrastive Out-of-Distribution Detection for Pretrained Transformers.

PyTorch code for EMNLP 2021 paper: Don't be Contradicted with Anything! CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System

PyTorch code for EMNLP 2021 paper: Don't be Contradicted with Anything! CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System

Code for our EMNLP 2021 paper “Heterogeneous Graph Neural Networks for Keyphrase Generation”

Code for our paper Aspect Sentiment Quad Prediction as Paraphrase Generation in EMNLP 2021.

Main repository for the HackBio'2021 Virtual Internship Experience for #Team-Greider ❤️

Implementation for the EMNLP 2021 paper "Interactive Machine Comprehension with Dynamic Knowledge Graphs".

This repository contains the PyTorch implementation of the paper STaCK: Sentence Ordering with Temporal Commonsense Knowledge appearing at EMNLP 2021.

Related resources for our EMNLP 2021 paper

Pytorch implementation of paper "Efficient Nearest Neighbor Language Models" (EMNLP 2021)

EMNLP 2021 paper Models and Datasets for Cross-Lingual Summarisation.