Official Pytorch Implementation of Length-Adaptive Transformer (ACL 2021)

Clova AI Research

Last update: Dec 28, 2022

Related tags

Overview

Length-Adaptive Transformer

This is the official Pytorch implementation of Length-Adaptive Transformer. For detailed information about the method, please refer to our paper.

Our code is based on HuggingFace's ( 🤗 ) Transformers library. Currently, it only supports limited transformers (BERT and DistilBERT) and downstream tasks (SQuAD 1.1 and GLUE benchmark). We will extend it one-by-one to support other transformers and tasks. You can easily apply our method to any other use cases beforehand.

Getting Started

Requirements

Python 3
PyTorch
🤗 Transformers
torchprofile (to measure FLOPs)

Dataset Preparation

SQuAD 1.1: Downoad following files in a $SQUAD_DIR directory: train-v1.1.json, dev-v1.1.json, and evaluate-v1.1.py.
GLUE: Run GLUE data download script to download data in a $GLUE_DIR directory.

(Standard) Finetuning pretrained transformer

For SQuAD 1.1, use run_squad.py slightly modified from 🤗 Transformers' question-answering example.

python run_squad.py \
  --model_type bert \
  --model_name_or_path bert-base-uncased \
  --do_train \
  --do_eval \
  --evaluate_during_training \
  --save_only_best \
  --do_lower_case \
  --data_dir $SQUAD_DIR \
  --train_file train-v1.1.json \
  --predict_file dev-v1.1.json \
  --per_gpu_train_batch_size 32 \
  --per_gpu_eval_batch_size 32 \
  --learning_rate 5e-5 \
  --num_train_epochs 3.0 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir $SQUAD_OUTPUT_DIR/standard

For GLUE, use run_glue.py slightly modified from 🤗 Transformers' text-classification example.

python run_glue.py \
  --model_name_or_path bert-base-cased \
  --task_name $TASK_NAME \
  --do_train \
  --do_eval \
  --data_dir $GLUE_DIR/$TASK_NAME \
  --max_seq_length 128 \
  --per_device_train_batch_size 32 \
  --per_device_eval_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --output_dir $GLUE_OUTPUT_DIR/$TASK_NAME/standard

Training with LengthDrop

Starting from a checkpoint finetuned without Drop-and-Restore, continue finetuning for additional steps with Drop-and-Restore and LengthDrop.

python run_squad.py \
  --model_type bert \
  --model_name_or_path $SQUAD_OUTPUT_DIR/standard/checkpoint-best \
  --do_train \
  --do_eval \
  --evaluate_during_training \
  --save_only_best \
  --do_lower_case \
  --data_dir $SQUAD_DIR \
  --train_file train-v1.1.json \
  --predict_file dev-v1.1.json \
  --per_gpu_train_batch_size 32 \
  --per_gpu_eval_batch_size 32 \
  --learning_rate 5e-5 \
  --num_train_epochs 5.0 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir $SQUAD_OUTPUT_DIR/length_adaptive \
  --length_adaptive \
  --num_sandwich 2 \
  --length_drop_ratio_bound 0.2 \
  --layer_dropout_prob 0.2 \

python run_glue.py \
  --model_name_or_path $GLUE_OUTPUT_DIR/$TASK_NAME/standard/checkpoint-best \
  --task_name $TASK_NAME \
  --do_train \
  --do_eval \
  --data_dir $GLUE_DIR/$TASK_NAME \
  --max_seq_length 128 \
  --per_device_train_batch_size 32 \
  --per_device_eval_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 5.0 \
  --output_dir $GLUE_OUTPUT_DIR/$TASK_NAME/length_adaptive
  --length_adaptive \
  --num_sandwich 2 \
  --length_drop_ratio_bound 0.2 \
  --layer_dropout_prob 0.2 \

Evolutionary Search of Length Configurations

After training with LengthDrop, perform an evolutionary search to find length configurations for anytime prediction.

python run_squad.py \
  --model_type bert \
  --model_name_or_path $SQUAD_OUTPUT_DIR/length_adaptive/checkpoint-best \
  --do_search \
  --do_lower_case \
  --data_dir $SQUAD_DIR \
  --train_file train-v1.1.json \
  --predict_file dev-v1.1.json \
  --per_gpu_eval_batch_size 32 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir $SQUAD_OUTPUT_DIR/evolutionary_search \
  --evo_iter 30 \
  --mutation_size 30 \
  --crossover_size 30 \

python run_glue.py \
  --model_name_or_path $GLUE_OUTPUT_DIR/$TASK_NAME/length_adaptive/checkpoint-best \
  --task_name $TASK_NAME \
  --do_search \
  --data_dir $GLUE_DIR/$TASK_NAME \
  --max_seq_length 128 \
  --per_device_eval_batch_size 32 \
  --output_dir $GLUE_OUTPUT_DIR/$TASK_NAME/evolutionary_search
  --evo_iter 30 \
  --mutation_size 30 \
  --crossover_size 30 \

License

Copyright 2020-present NAVER Corp.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Comments

Batch size for reproduction

hello,

how many GPUs did you use during training? I am trying to reach f1=88.5 for SQuAD1.1 with the provided parameters in the README however I reach 88.3 max.

Thanks, Shira

opened by shira-g 2
Pareto curves clarifications

Hello,

can you please provide some clarification regarding the units used in the paper to measure FLOPs? (figure 2 in the paper). I tried to reproduce the results and for example when measuring MACs (using torchprofile) for bert-standard model I am getting the value 35340779904, which i'm not sure how to convert in order to create the pareto curve to compare to the search results. In addition, can you please share how did you construct the sequence-lengths for each of the standard/length-adaptive models graph points?

Thank you, Shira

opened by shira-g 2
slimmable argument

hello,

I tried to reproduce your results according to the commands suggested in the README file, and so far I couldn't do so. I notice you refer to the 'slimmable' arg but it's not being set/initialized anywhere in the code, is that a mistake? since I notice sample length configuration will only be used during training with the 'slimmable' flag, and without it only a 'standard' training will be executed.

I will appreciate any help, thank you.

opened by shira-g 2
Which version of transformers did you use

I tried transformers 2.11.0, 3.1.0, 4.3.0 but none of them worked, import errors occurred all the time. version 2.11.0: ModuleNotFoundError: No module named 'transformers.modeling_outputs' version 4.3.0: ModuleNotFoundError: No module named 'transformers.modeling_bert' ....

opened by cin-hubert 1
Loading the optimizer before training length adaptive

Hello,

I am trying to apply your method on my own model (trained using a different code) instead of bert-base, and I got a good f1 result using your code to train the model with length-adaptive. However, I noticed that in your code you are loading the saved optimizer and scheduler states of the provided model, and since I couldn't think of a reason to load the optimizer of a fine-tuned model before training it with length adaptive, I removed the saved optimizer, and actually got a lower result. I wonder if there is an explanation to this behaviour which you are aware of- since I also noticed you commented-out the lines in your code which saves the optimizer (but still the code loads a pre-saved optimizer) and I wonder if that was intentional.

Regards and thank you, Shira

opened by shira-g 0

Official Pytorch Implementation of Length-Adaptive Transformer (ACL 2021)

Related tags

Overview

Length-Adaptive Transformer

Getting Started

Requirements

Dataset Preparation

(Standard) Finetuning pretrained transformer

Training with LengthDrop

Evolutionary Search of Length Configurations

License

Comments

Batch size for reproduction

Pareto curves clarifications

slimmable argument

Which version of transformers did you use

Loading the optimizer before training length adaptive

Owner

Clova AI Research

Adaptive FNO transformer - official Pytorch implementation

Official PyTorch Implementation of SSMix (Findings of ACL 2021)

PyTorch implementation of Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation.

[CVPR 2021] Official PyTorch Implementation for "Iterative Filter Adaptive Network for Single Image Defocus Deblurring"

This is an official PyTorch implementation of Task-Adaptive Neural Network Search with Meta-Contrastive Learning (NeurIPS 2021, Spotlight).

The official implementation for ACL 2021 "Challenges in Information Seeking QA: Unanswerable Questions and Paragraph Retrieval".

PyTorch implementation for ACL 2021 paper "Maria: A Visual Experience Powered Conversational Agent".

A sample pytorch Implementation of ACL 2021 research paper "Learning Span-Level Interactions for Aspect Sentiment Triplet Extraction".

Official implementation for CVPR 2021 paper: Adaptive Class Suppression Loss for Long-Tail Object Detection

A Python script that creates subtitles of a given length from text paragraphs that can be easily imported into any Video Editing software such as FinalCut Pro for further adjustments.

Llvlir - Low Level Variable Length Intermediate Representation

Code for SentiBERT: A Transferable Transformer-Based Architecture for Compositional Sentiment Semantics (ACL'2020).

PyTorch implementation of the paper: "Preference-Adaptive Meta-Learning for Cold-Start Recommendation", IJCAI, 2021.

This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" on Object Detection and Instance Segmentation.

Official Pytorch implementation of the paper "Action-Conditioned 3D Human Motion Synthesis with Transformer VAE", ICCV 2021

VSR-Transformer - This paper proposes a new Transformer for video super-resolution (called VSR-Transformer).

Learning the Beauty in Songs: Neural Singing Voice Beautifier; ACL 2022 (Main conference); Official code

Official implementation of "One-Shot Voice Conversion with Weight Adaptive Instance Normalization".

Official implementation of "DSP: Dual Soft-Paste for Unsupervised Domain Adaptive Semantic Segmentation"