Official PyTorch Implementation of SSMix (Findings of ACL 2021)

Clova AI Research

Last update: Dec 27, 2022

Related tags

Deep Learning ssmix

Overview

SSMix: Saliency-based Span Mixup for Text Classification (Findings of ACL 2021)

Official PyTorch Implementation of SSMix | Paper

Abstract

Data augmentation with mixup has shown to be effective on various computer vision tasks. Despite its great success, there has been a hurdle to apply mixup to NLP tasks since text consists of discrete tokens with variable length. In this work, we propose SSMix, a novel mixup method where the operation is performed on input text rather than on hidden vectors like previous approaches. SSMix synthesizes a sentence while preserving the locality of two original texts by span-based mixing and keeping more tokens related to the prediction relying on saliency information. With extensive experiments, we empirically validate that our method outperforms hidden-level mixup methods on the wide range of text classification benchmarks, including textual entailment, sentiment classification, and question-type classification.

Code Structure

|__ augmentation/ --> augmentation methods by method type
    |__ __init__.py --> wrapper for all augmentation methods. Contains metric used for single & paired sentence tasks
    |__ saliency.py --> Calculates saliency by L2 norm gradient backpropagation
    |__ ssmix.py --> Output ssmix sentence with options such as no span and no saliency given two input sentence with additional information
    |__ unk.py --> Output randomly replaced unk sentence 
|__ read_data/ --> Module used for loading data
    |__ __init__.py --> wrapper function for getting data split by train and valid depending on dataset type
    |__  dataset.py --> Class to get NLU dataset
    |__ preprocess.py --> preprocessor that makes input, label, and accuracy metric depending on dataset type
|__ trainer.py --> Code that does actual training 
|__ run_train.py --> Load hyperparameter, initiate training, pipeline
|__ classifiation_model.py -> Augmented from huggingface modeling_bert.py. Define BERT architectures that can handle multiple inputs for Tmix

Part of code is modified from the MixText implementation.

Getting Started

pip install -r requirements.txt

Code is runnable on both CPU and GPU, but we highly recommended to run on GPU. Strictly following the versions specified in the requirements.txt file is desirable to sucessfully execute our code without errors.

Model Training

python run_train.py --batch_size ${BSZ} --seed ${SEED} --dataset {DATASET} --optimizer_lr ${LR} ${MODE}

For all our experiments, we use 32 as the batch size (BSZ), and perform five different runs by changing the seed (SEED) from 0 to 4. We experiment on a wide range of text classifiction datasets (DATASET): 'sst2', 'qqp', 'mnli', 'qnli', 'rte', 'mrpc', 'trec-coarse', 'trec-fine', 'anli'. You should set --anli_round argument to one of 1, 2, 3 for the ANLI dataset.

Once you run the code, trained checkpoints are created under checkpoints directory. To train a model without mixup, you have to set MODE to 'normal'. To run with mixup approaches including our SSMix, you should set MODE as the name of the mixup method ('ssmix', 'tmix', 'embedmix', 'unk'). We load the checkpoint trained without mixup before training with mixup. We use 5e-5 for the normal mode and 1e-5 for mixup methods as the learning rate (LR).

You can modify the argument values (e.g., embed_alpha, hidden_alpha, etc) to adjust to your training hyperparameter needs. For ablation study of SSMix, you can exclude salieny constraint (--ss_no_saliency) or span constraint (--ss_no_span). Type python run_train.py --help or check run_train.py to see the full list of available hyperparameters. For debugging or analysis, you can turn on verbose options (--verbose and --verbose_show_augment_example).

License

Copyright 2021-present NAVER Corp.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

ConSERT Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer Requirements torch==1.6.0

478 Dec 25, 2022

Codes for ACL-IJCNLP 2021 Paper "Zero-shot Fact Verification by Claim Generation"

Zero-shot-Fact-Verification-by-Claim-Generation This repository contains code and models for the paper: Zero-shot Fact Verification by Claim Generatio

47 Jan 1, 2023

Code for our paper "SimCLS: A Simple Framework for Contrastive Learning of Abstractive Summarization", ACL 2021

SimCLS Code for our paper: "SimCLS: A Simple Framework for Contrastive Learning of Abstractive Summarization", ACL 2021 1. How to Install Requirements

150 Dec 12, 2022

Code and data of the ACL 2021 paper: Few-Shot Text Ranking with Meta Adapted Synthetic Weak Supervision

MetaAdaptRank This repository provides the implementation of meta-learning to reweight synthetic weak supervision data described in the paper Few-Shot

5 Jun 16, 2022

Code for our ACL 2021 paper "One2Set: Generating Diverse Keyphrases as a Set"

One2Set This repository contains the code for our ACL 2021 paper “One2Set: Generating Diverse Keyphrases as a Set”. Our implementation is built on the

63 Jan 5, 2023

code associated with ACL 2021 DExperts paper

DExperts Hi! This repository contains code for the paper DExperts: Decoding-Time Controlled Text Generation with Experts and Anti-Experts to appear at

68 Dec 15, 2022

The source codes for ACL 2021 paper 'BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data'

BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data This repository provides the implementation details for

124 Dec 27, 2022

NeuralWOZ: Learning to Collect Task-Oriented Dialogue via Model-based Simulation (ACL-IJCNLP 2021)

NeuralWOZ This code is official implementation of "NeuralWOZ: Learning to Collect Task-Oriented Dialogue via Model-based Simulation". Sungdong Kim, Mi

31 Oct 25, 2022

Data and Code for ACL 2021 Paper "Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning"

Introduction Code and data for ACL 2021 Paper "Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning". We cons

81 Dec 27, 2022

Comments

datasets

When i use the following cmd

python run_train.py --batch_size 32 --seed 12 --dataset mnli --optimizer_lr 0.001 ssmix

got a mistake
File "/home/SuXiangDong/HuoZhiqiang_82/Experiment/ssmix/read_data/preprocess.py", line 103, in get_metric metric = load_metric(r"./dataset/glue", self.task_name) File "/home/SuXiangDong/HuoZhiqiang_82/anaconda3/envs/ssmix/lib/python3.8/site-packages/datasets/load.py", line 613, in load_metric metric = metric_cls( TypeError: 'NoneType' object is not callable

Have you ever come across this mistake? My environment same to your requirments.txt

opened by zhiqiangohuo 1
Bug for saliency
Why use version high to use pretrained vinai/phobert-base in vietnameses then saliency embedding dimension(16, num_labels) not dimensions(16,max_sequence_length, hidden_size) with batch_size = 32

Can u fix saliency for version transformers high , thanks u so much
opened by batman-do 0

Official PyTorch Implementation of SSMix (Findings of ACL 2021)

Related tags

Overview

SSMix: Saliency-based Span Mixup for Text Classification (Findings of ACL 2021)

Abstract

Code Structure

Getting Started

Model Training

License

You might also like...

Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

Codes for ACL-IJCNLP 2021 Paper "Zero-shot Fact Verification by Claim Generation"

Code for our paper "SimCLS: A Simple Framework for Contrastive Learning of Abstractive Summarization", ACL 2021

Code and data of the ACL 2021 paper: Few-Shot Text Ranking with Meta Adapted Synthetic Weak Supervision

Code for our ACL 2021 paper "One2Set: Generating Diverse Keyphrases as a Set"

code associated with ACL 2021 DExperts paper

The source codes for ACL 2021 paper 'BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data'

NeuralWOZ: Learning to Collect Task-Oriented Dialogue via Model-based Simulation (ACL-IJCNLP 2021)

Data and Code for ACL 2021 Paper "Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning"

Comments

datasets

Bug for saliency

Owner

Clova AI Research

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

Pytorch implementation for the EMNLP 2020 (Findings) paper: Connecting the Dots: A Knowledgeable Path Generator for Commonsense Question Answering

🌈 PyTorch Implementation for EMNLP'21 Findings "Reasoning Visual Dialog with Sparse Graph Learning and Knowledge Transfer"

The official implementation for ACL 2021 "Challenges in Information Seeking QA: Unanswerable Questions and Paragraph Retrieval".

Data augmentation for NLP, accepted at EMNLP 2021 Findings

PyTorch implementation for ACL 2021 paper "Maria: A Visual Experience Powered Conversational Agent".

A sample pytorch Implementation of ACL 2021 research paper "Learning Span-Level Interactions for Aspect Sentiment Triplet Extraction".

Learning the Beauty in Songs: Neural Singing Voice Beautifier; ACL 2022 (Main conference); Official code

[NAACL & ACL 2021] SapBERT: Self-alignment pretraining for BERT.

PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World [ACL 2021]