Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

Overview

Text-AutoAugment (TAA)

This repository contains the code for our paper Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification (EMNLP 2021 main conference).

Overview of IAIS

Overview

  1. We present a learnable and compositional framework for data augmentation. Our proposed algorithm automatically searches for the optimal compositional policy, which improves the diversity and quality of augmented samples.

  2. In low-resource and class-imbalanced regimes of six benchmark datasets, TAA significantly improves the generalization ability of deep neural networks like BERT and effectively boosts text classification performance.

Getting Started

  1. Prepare environment

    conda create -n taa python=3.6
    conda activate taa
    conda install pytorch torchvision cudatoolkit=10.0 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch
    pip install -r requirements.txt 
    python -c "import nltk; nltk.download('wordnet'); nltk.download('averaged_perceptron_tagger')"
  2. Modify dataroot parameter in confs/*yaml and abspath parameter in script/*.sh:

    • e.g., change dataroot: /home/renshuhuai/TextAutoAugment/data/aclImdb in confs/bert_imdb.yaml to dataroot: path-to-your-TextAutoAugment/data/aclImdb
    • change --abspath '/home/renshuhuai/TextAutoAugment' in script/imdb_lowresource.sh to --abspath 'path-to-your-TextAutoAugment'
  3. Search for the best augmentation policy, e.g., low-resource regime for IMDB:

    sh script/imdb_lowresource.sh

    scripts for policy search in the low-resource and class-imbalanced regime for all datasets are provided in the script/ fold.

  4. Train a model with pre-searched policy in archive.py, e.g., train model in low-resource regime for IMDB:

    python train.py -c confs/bert_imdb.yaml 

    train model on full dataset of IMDB:

    python train.py -c confs/bert_imdb.yaml --train-npc -1 --valid-npc -1 --test-npc -1  

Contact

If you have any questions related to the code or the paper, feel free to email Shuhuai (renshuhuai007 [AT] gmail [DOT] com).

Acknowledgments

Code refers to: fast-autoaugment.

Citation

If you find this code useful for your research, please consider citing:

@inproceedings{ren2021taa,
  title={Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification},
  author={Shuhuai Ren, Jinchao Zhang, Lei Li, Xu Sun, Jie Zhou},
  booktitle={EMNLP},
  year={2021}
}

License

MIT

Comments
  • 如何复现代码?

    如何复现代码?

    作者您好!我是一名在读硕士研究生,对您的论文很感兴趣,想复现论文代码进行研究,在复现过程中遇到一些阻力,恳请作者解答一下我的一些疑问。 1.在论文中您提到用的实验环境为8张Tesla P40,我的实验环境为:显卡 1080 * 2,显存 16G,cuda 版本10.2,并且参照您的readme文档配置好实验环境了,这样的实验配置能否跑动您的代码? 2.readme文档中提到复现代码主要是运行 reproduce_experiment.py,但是在运行到该第46行代码的时候报错,提示taa/models 文件下并没有所需的文件,我尝试在search.py中去运行生成所需的policy文件,但是不知道从何切入,对于整个跑通调试流程逻辑有点混乱,如果我要完整复现结果,应该是以怎样的步骤、流程去跑通代码程序?

    在代码这方面我还是一个小白,非常感谢作者能够在百忙之中给我解答!

    opened by javanlu123 3
  • The return in augmentation.py cannot be usead as data source

    The return in augmentation.py cannot be usead as data source

    The return of each transform function in augmentation.py should be Str, instead of List, which is generated by default. e.g.: def random_word_delete(text, m): return aug.augment(text) ⬇⬇⬇⬇⬇⬇⬇⬇⬇ def random_word_delete(text, m): return aug.augment(text)[0]

    opened by mvllwong 2
  • module 'ray.tune' has no attribute 'trial_runner'

    module 'ray.tune' has no attribute 'trial_runner'

    Traceback (most recent call last): File "reproduce_experiment.py", line 10, in from taa.search import get_path, search_policy, train_model_parallel File "/root/yxyanyi/xiaozhu/text-autoaugment-main/taa/search.py", line 51, in patch = gorilla.Patch(ray.tune.trial_runner.TrialRunner, 'step', step_w_log, settings=gorilla.Settings(allow_hit=True)) AttributeError: module 'ray.tune' has no attribute 'trial_runner' 作者您好,我按你的github进行安装环境后,便直接运行reproduce_experiment.py,然后报这个错误,恳望指正

    opened by caesar-jojo 1
  • 使用custom dataset时,在 load_dataset报错。

    使用custom dataset时,在 load_dataset报错。

    你好,在使用自定义数据集(与示例数据集一致)时,按照示例Config文件进行运行,在load_dataset函数有报错,具体信息如下: Traceback (most recent call last): File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/ray/tune/ray_trial_executor.py", line 901, in get_next_executor_event future_result = ray.get(ready_future) File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper return func(*args, **kwargs) File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/ray/worker.py", line 1809, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(TuneError): ray::ImplicitFunc.train() (pid=51539, ip=10.10.25.1, repr=objective) File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/ray/tune/trainable.py", line 349, in train result = self.step() File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/ray/tune/function_runner.py", line 403, in step self._report_thread_runner_error(block=True) File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/ray/tune/function_runner.py", line 567, in _report_thread_runner_error raise TuneError( ray.tune.error.TuneError: Trial raised an exception. Traceback: ray::ImplicitFunc.train() (pid=51539, ip=10.10.25.1, repr=objective) File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/ray/tune/function_runner.py", line 272, in run self._entrypoint() File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/ray/tune/function_runner.py", line 348, in entrypoint return self._trainable_func( File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/ray/tune/function_runner.py", line 640, in _trainable_func output = fn() File "/mnt/zhzhang_hdd/implicit-hate-corpus/text_autoaugment/taa/search.py", line 85, in objective result = train_and_eval(config['tag'], policy_opt=True, save_path=save_path, only_eval=False) File "/mnt/zhzhang_hdd/implicit-hate-corpus/text_autoaugment/taa/train.py", line 49, in train_and_eval train_dataset, val_dataset, test_dataset = get_datasets(dataset_type, policy_opt=policy_opt) File "/mnt/zhzhang_hdd/implicit-hate-corpus/text_autoaugment/taa/data.py", line 58, in get_datasets test_dataset = load_dataset(path=path, name=dataset, data_dir=data_dir, data_files=data_files, split='test') File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/datasets/load.py", line 1714, in load_dataset ds = builder_instance.as_dataset(split=split, ignore_verifications=ignore_verifications, in_memory=keep_in_memory) File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/datasets/builder.py", line 763, in as_dataset datasets = utils.map_nested( File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 250, in map_nested return function(data_struct) File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/datasets/builder.py", line 794, in _build_single_dataset ds = self._as_dataset( File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/datasets/builder.py", line 862, in _as_dataset dataset_kwargs = ArrowReader(self._cache_dir, self.info).read( File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/datasets/arrow_reader.py", line 211, in read files = self.get_file_instructions(name, instructions, split_infos) File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/datasets/arrow_reader.py", line 184, in get_file_instructions file_instructions = make_file_instructions( File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/datasets/arrow_reader.py", line 107, in make_file_instructions absolute_instructions = instruction.to_absolute(name2len) File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/datasets/arrow_reader.py", line 618, in to_absolute return [_rel_to_abs_instr(rel_instr, name2len) for rel_instr in self._relative_instructions] File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/datasets/arrow_reader.py", line 618, in return [_rel_to_abs_instr(rel_instr, name2len) for rel_instr in self._relative_instructions] File "/opt/anaconda3/envs/forRL/lib/python3.8/site-packages/datasets/arrow_reader.py", line 433, in _rel_to_abs_instr raise ValueError(f'Unknown split "{split}". Should be one of {list(name2len)}.') ValueError: Unknown split "test". Should be one of ['train']. 请问需要在数据文件夹中将数据先进行划分吗?并如何修改config文件。谢谢。

    opened by zzh-SJTU 1
  • "Exception: This class is a singleton!"

    Hi there, when I try to use search and augment like : augmented_train_dataset = search_and_augment(configfile="./text-autoaugment/taa/confs/bert_sst2_example.yaml") I have the error "Exception: This class is a singleton!". I did not modify anything in the file. How can I fix it ?

    opened by isilberfin 1
  • Make requirements more relaxed

    Make requirements more relaxed

    When installing other libraries, for instance the later huggingface transformers or other augmentation libraries, the requirements are a bit to restrictive

    I've have tested the fork and still produces the expected output when running the training scripts in scripts/

    Would like to use the framework, but with the restrictions, both pip and poetry fails from library version conflicts @RenShuhuai-Andy @wolvecap

    opened by MarkusSagen 0
  • CVE-2007-4559 Patch

    CVE-2007-4559 Patch

    Patching CVE-2007-4559

    Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

    If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

    opened by TrellixVulnTeam 0
  • 找不到sst2_Bert_op2_policy4_n-aug1_ir0.02_taa.log文件

    找不到sst2_Bert_op2_policy4_n-aug1_ir0.02_taa.log文件

    FileNotFoundError: [Errno 2] No such file or directory: 'D:\anaconda\qass\text-autoaugment-main\text-autoaugment-main\examples\taa\models\sst2_Bert_op2_policy4_n-aug1_ir0.02_taa.log' 找不到该文件,该文件的作用是啥

    opened by gf52 10
Owner
LancoPKU
Language Computing and Machine Learning Group (Xu Sun's group) at Peking University
LancoPKU
Abstractive opinion summarization system (SelSum) and the largest dataset of Amazon product summaries (AmaSum). EMNLP 2021 conference paper.

Learning Opinion Summarizers by Selecting Informative Reviews This repository contains the codebase and the dataset for the corresponding EMNLP 2021

Arthur Bražinskas 39 Jan 1, 2023
Learning the Beauty in Songs: Neural Singing Voice Beautifier; ACL 2022 (Main conference); Official code

Learning the Beauty in Songs: Neural Singing Voice Beautifier Jinglin Liu, Chengxi Li, Yi Ren, Zhiying Zhu, Zhou Zhao Zhejiang University ACL 2022 Mai

Jinglin Liu 257 Dec 30, 2022
"Inductive Entity Representations from Text via Link Prediction" @ The Web Conference 2021

Inductive entity representations from text via link prediction This repository contains the code used for the experiments in the paper "Inductive enti

Daniel Daza 45 Jan 9, 2023
Woosung Choi 63 Nov 14, 2022
Github for the conference paper GLOD-Gaussian Likelihood OOD detector

FOOD - Fast OOD Detector Pytorch implamentation of the confernce peper FOOD arxiv link. Abstract Deep neural networks (DNNs) perform well at classifyi

null 17 Jun 19, 2022
Code and data for the EMNLP 2021 paper "Just Say No: Analyzing the Stance of Neural Dialogue Generation in Offensive Contexts". Coming soon!

ToxiChat Code and data for the EMNLP 2021 paper "Just Say No: Analyzing the Stance of Neural Dialogue Generation in Offensive Contexts". Install depen

Ashutosh Baheti 11 Jan 1, 2023
The code repository for EMNLP 2021 paper "Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization".

Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization [Paper] accepted at the EMNLP 2021: Vision Guided Genera

CAiRE 42 Jan 7, 2023
This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis, accepted at EMNLP 2021.

MultiModal-InfoMax This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Informa

Deep Cognition and Language Research (DeCLaRe) Lab 89 Dec 26, 2022
Code for EMNLP 2021 paper Contrastive Out-of-Distribution Detection for Pretrained Transformers.

Contra-OOD Code for EMNLP 2021 paper Contrastive Out-of-Distribution Detection for Pretrained Transformers. Requirements PyTorch Transformers datasets

Wenxuan Zhou 27 Oct 28, 2022
PyTorch code for EMNLP 2021 paper: Don't be Contradicted with Anything! CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System

Don’t be Contradicted with Anything!CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System This repository contains the PyTorch im

Libo Qin 25 Sep 6, 2022
PyTorch code for EMNLP 2021 paper: Don't be Contradicted with Anything! CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System

PyTorch code for EMNLP 2021 paper: Don't be Contradicted with Anything! CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System

Libo Qin 12 Sep 26, 2021
Code for our EMNLP 2021 paper “Heterogeneous Graph Neural Networks for Keyphrase Generation”

GATER This repository contains the code for our EMNLP 2021 paper “Heterogeneous Graph Neural Networks for Keyphrase Generation”. Our implementation is

Jiacheng Ye 12 Nov 24, 2022
Code for our paper Aspect Sentiment Quad Prediction as Paraphrase Generation in EMNLP 2021.

Aspect Sentiment Quad Prediction (ASQP) This repo contains the annotated data and code for our paper Aspect Sentiment Quad Prediction as Paraphrase Ge

Isaac 39 Dec 11, 2022
Main repository for the HackBio'2021 Virtual Internship Experience for #Team-Greider ❤️

Hello ?? #Team-Greider The team of 20 people for HackBio'2021 Virtual Bioinformatics Internship ?? ??️ ??‍?? HackBio: https://thehackbio.com ?? Ask us

Siddhant Sharma 7 Oct 20, 2022
Implementation for the EMNLP 2021 paper "Interactive Machine Comprehension with Dynamic Knowledge Graphs".

Interactive Machine Comprehension with Dynamic Knowledge Graphs Implementation for the EMNLP 2021 paper. Dependencies apt-get -y update apt-get instal

Xingdi (Eric) Yuan 19 Aug 23, 2022
This repository contains the PyTorch implementation of the paper STaCK: Sentence Ordering with Temporal Commonsense Knowledge appearing at EMNLP 2021.

STaCK: Sentence Ordering with Temporal Commonsense Knowledge This repository contains the pytorch implementation of the paper STaCK: Sentence Ordering

Deep Cognition and Language Research (DeCLaRe) Lab 23 Dec 16, 2022
Related resources for our EMNLP 2021 paper

Plan-then-Generate: Controlled Data-to-Text Generation via Planning Authors: Yixuan Su, David Vandyke, Sihui Wang, Yimai Fang, and Nigel Collier Code

Yixuan Su 61 Jan 3, 2023
Pytorch implementation of paper "Efficient Nearest Neighbor Language Models" (EMNLP 2021)

Pytorch implementation of paper "Efficient Nearest Neighbor Language Models" (EMNLP 2021)

Junxian He 57 Jan 1, 2023
EMNLP 2021 paper Models and Datasets for Cross-Lingual Summarisation.

This repository contains data and code for our EMNLP 2021 paper Models and Datasets for Cross-Lingual Summarisation. Please contact me at [email protected]

null 9 Oct 28, 2022