SAGE: Sensitivity-guided Adaptive Learning Rate for Transformers

Chen Liang

Last update: Nov 7, 2022

Related tags

Deep Learning transformers pruning redundancy generalization fine-tuning adaptive-learning-rate

Overview

SAGE: Sensitivity-guided Adaptive Learning Rate for Transformers

This repo contains our codes for the paper "No Parameters Left Behind: Sensitivity Guided Adaptive Learning Rate for Training Large Transformer Models" (ICLR 2022).

Getting Start

Pull and run docker
pytorch/pytorch:1.5.1-cuda10.1-cudnn7-devel
Install requirements
pip install -r requirements.txt

Data and Model

Download data and pre-trained models
./download.sh
Please refer to this link for details on the GLUE benchmark.
Preprocess data
./experiments/glue/prepro.sh
For the most updated data processing details, please refer to the mt-dnn repo.

Fine-tuning Pre-trained Models using SAGE

We provide an example script for fine-tuning a pre-trained BERT-base model on MNLI using Adamax-SAGE:

./scripts/train_mnli_usadamax.sh GPUID

A few notices:

learning_rate and beta3 are two of the most important hyper-parameters. learning_rate that works well for Adamax/AdamW-SAGE is usually 2 to 5 times larger than that works well for Adamax/AdamW, depending on the tasks. beta3 that works well for Adamax/AdamW-SAGE is usually in the range of 0.6 and 0.9, depending on the tasks.
To use AdamW-SAGE, set argument --optim=usadamw. The current codebase only contains the implementation of Adamax-SAGE and AdamW-SAGE. Please refer to module/bert_optim.py for details. Please refer to our paper for integrating SAGE on other optimizers.
To fine-tune a pre-trained RoBERTa-base model, set arguments --init_checkpoint to the model path and set --encoder_type to 2. Other supported models are listed in pretrained_models.py.
To fine-tune on other tasks, set arguments --train_datasets and --test_datasets to the corresponding task names.

Citation

@inproceedings{
liang2022no,
title={No Parameters Left Behind: Sensitivity Guided Adaptive Learning Rate for Training Large Transformer Models},
author={Chen Liang and Haoming Jiang and Simiao Zuo and Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen and Tuo Zhao},
booktitle={International Conference on Learning Representations},
year={2022},
url={https://openreview.net/forum?id=cuvga_CiVND}
}

Contact Information

For help or issues related to this package, please submit a GitHub issue. For personal questions related to this paper, please contact Chen Liang ([email protected]).

This project implements "virtual speed" from heart rate monito

ANT+ Virtual Stride Based Speed and Distance Monitor Overview This project imple

2 May 20, 2022

Jigsaw Rate Severity of Toxic Comments

66 Nov 30, 2022

Multivariate Time Series Forecasting with efficient Transformers. Code for the paper "Long-Range Transformers for Dynamic Spatiotemporal Forecasting."

Spacetimeformer Multivariate Forecasting This repository contains the code for the paper, "Long-Range Transformers for Dynamic Spatiotemporal Forecast

440 Jan 2, 2023

Learning from Guided Play: A Scheduled Hierarchical Approach for Improving Exploration in Adversarial Imitation Learning Source Code

8 Sep 14, 2022

[ICML 2020] Prediction-Guided Multi-Objective Reinforcement Learning for Continuous Robot Control

PG-MORL This repository contains the implementation for the paper Prediction-Guided Multi-Objective Reinforcement Learning for Continuous Robot Contro

65 Jan 7, 2023

Lyapunov-guided Deep Reinforcement Learning for Stable Online Computation Offloading in Mobile-Edge Computing Networks

PyTorch code to reproduce LyDROO algorithm [1], which is an online computation offloading algorithm to maximize the network data processing capability subject to the long-term data queue stability and average power constraints. It applies Lyapunov optimization to decouple the multi-stage stochastic MINLP into deterministic per-frame MINLP subproblems and solves each subproblem via DROO algorithm. It includes:

87 Dec 28, 2022

SAGE: Sensitivity-guided Adaptive Learning Rate for Transformers

Related tags

Overview

SAGE: Sensitivity-guided Adaptive Learning Rate for Transformers

Getting Start

Data and Model

Fine-tuning Pre-trained Models using SAGE

Citation

Contact Information

You might also like...

This project implements "virtual speed" from heart rate monito

Jigsaw Rate Severity of Toxic Comments

Multivariate Time Series Forecasting with efficient Transformers. Code for the paper "Long-Range Transformers for Dynamic Spatiotemporal Forecasting."

Learning from Guided Play: A Scheduled Hierarchical Approach for Improving Exploration in Adversarial Imitation Learning Source Code

[ICML 2020] Prediction-Guided Multi-Objective Reinforcement Learning for Continuous Robot Control

Lyapunov-guided Deep Reinforcement Learning for Stable Online Computation Offloading in Mobile-Edge Computing Networks

Pytorch Implementation of "Contrastive Representation Learning for Exemplar-Guided Paraphrase Generation"

Sample Prior Guided Robust Model Learning to Suppress Noisy Labels

Data-Uncertainty Guided Multi-Phase Learning for Semi-supervised Object Detection

Owner

Chen Liang

PyTorch implementation of some learning rate schedulers for deep learning researcher.

[ICCV'21] PlaneTR: Structure-Guided Transformers for 3D Plane Recovery

An official implementation of the paper Exploring Sequence Feature Alignment for Domain Adaptive Detection Transformers

Pytorch implementation of Learning Rate Dropout.

AdamW optimizer and cosine learning rate annealing with restarts

Implementation of "Meta-rPPG: Remote Heart Rate Estimation Using a Transductive Meta-Learner"

PyTorch implementation of the paper Deep Networks from the Principle of Rate Reduction

Official NumPy Implementation of Deep Networks from the Principle of Rate Reduction (2021)

CBREN: Convolutional Neural Networks for Constant Bit Rate Video Quality Enhancement

A Pytorch Implementation of a continuously rate adjustable learned image compression framework.