Implementation of paper "Towards a Unified View of Parameter-Efficient Transfer Learning"

Junxian He

Last update: Dec 29, 2022

Related tags

Deep Learning unify-parameter-efficient-tuning

Overview

A Unified Framework for Parameter-Efficient Transfer Learning

This is the official implementation of the paper:

Towards a Unified View of Parameter-Efficient Transfer Learning
Junxian He*, Chunting Zhou*, Xuezhe Ma, Taylor Berg-Kirkpatrick, Graham Neubig
Preprint 2021

Parameter-efficient transfer learning (PETL) methods only tune a small number of (extra) parameters to adapt large pretrained models into downstream tasks. This paper reveals the connection among existing PETL methods such as adapters, prefix tuning, and LoRA, and proposes a unified framework to interpret their designs. This unified framework is able to instantiate existing approaches by varying values along several defined design dimensions, which also provides principled guidance to design new PETL methods. In this repo as well as in the paper, we include examples of how we easily derive new state-of-the-art PETL methods from the unified framework.

Dependencies

This repo is a fork of the huggingface transformers repo (forked on June 23, 2021), and the code is tested on PyTorch 1.9.0. Please follow the instructions below to install dependencies after you set up PyTorch:

git clone git@github.com:jxhe/MAM-adapter.git
cd MAM-adapter

# install transformers from this repo
pip install -e .

# install other requirements
pip install datasets==1.11.0

# used to compute BLEU score for en-ro translation
git clone git@github.com:moses-smt/mosesdecoder.git

Usage

MAM-Adapter

Run the following command to reproduce the MAM-Adapter results in the paper on the XSum, en-ro translation, MNLI, or SST2 datasets:

bash exps/run_{xsum|en_ro|glue}.sh

We ran all the experiments with one A6000 or A100 GPU that has >=40GB GPU memory -- if your GPU does not have a large memory, you may need to reduce the bsz (max_tokens_per_batch for en-ro) and increase the gradient_steps values in the scripts to match our effective batch size. You may train with multiple GPUs easily with python -m torch.distributed.launch --nproc_per_node {num_gpus} to enable data parallelism.

Training time: in our experiments that use one GPU, XSum takes 24 hours w/ A100 or 50 hours w/ A6000, en-ro takes 20 hours w/ A6000, SST2 takes 2 hours, and MNLI takes 10 hours.

Advanced Usage for Other PETL Variants

As the paper shows, our unified framework instantiates different PETL variants easily by varying along the design dimensions. You can modify the script to train other PETL variants as we studied in the paper, we include some examples in run_xsum.sh, which can be directly applied to the other scripts as well:

# ----- MAM adapter -----
attn_mode="prefix"
attn_option="concat"
attn_composition="add"
attn_bn=30  # attn bottleneck dim

ffn_mode="adapter"
ffn_option="parallel"
ffn_adapter_layernorm_option="none"
ffn_adapter_init_option="lora"
ffn_adapter_scalar="4"
ffn_bn=512 # ffn bottleneck dim

# ----- prefix tuning baseline ----- 
# attn_mode="prefix"
# attn_option="concat"
# attn_composition="add"
# attn_bn=200  # attn bottleneck dim

# ffn_mode="none"
# ffn_option="parallel"
# ffn_adapter_layernorm_option="none"
# ffn_adapter_init_option="lora"
# ffn_adapter_scalar="4"
# ffn_bn=512 # ffn bottleneck dim

# ----- Houlsby Adapter ----- 
# attn_mode="adapter"
# attn_option="sequential"
# attn_composition="add"
# attn_bn=200  # attn bottleneck dim

# ffn_mode="adapter"
# ffn_option="sequential"
# ffn_adapter_layernorm_option="none"
# ffn_adapter_init_option="bert"
# ffn_adapter_scalar="1"
# ffn_bn=200 # ffn bottleneck dim

# ----- FFN Scaled Parallel Adapter ----- 
# attn_mode="None"
# attn_option="parallel"
# attn_composition="add"
# attn_bn=200  # attn bottleneck dim

# ffn_mode="adapter"
# ffn_option="parallel"
# ffn_adapter_layernorm_option="none"
# ffn_adapter_init_option="lora"
# ffn_adapter_scalar="4"
# ffn_bn=512 # ffn bottleneck dim

There are more variations than what is shown above. Please see a complete explanation of these arguments here in petl/options.py. The results of all the variants reported in the paper could be reproduced by changing these values in the scripts.

Citation

@article{he2021towards,
  title={Towards a Unified View of Parameter-Efficient Transfer Learning},
  author={He, Junxian and Zhou, Chunting and Ma, Xuezhe and Berg-Kirkpatrick, Taylor and Neubig, Graham},
  journal={arXiv preprint arXiv:2110.04366},
  year={2021}
}

Comments

[Bug] Adapter and LoRA for Roberta

Runing the below setting on SST2 and MNLI: `attn_mode="adapter" attn_option="sequential" attn_composition="add" attn_bn=200 # attn bottleneck dim

ffn_mode="adapter" ffn_option="sequential" ffn_adapter_layernorm_option="none" ffn_adapter_init_option="bert" ffn_adapter_scalar="1" ffn_bn=200 # ffn bottleneck dim `

Several errors were raised. It seems some parameters were set incorrectly like d_model, dropout, in modeling_roberta.py.

I just fix them and the log make me confused:

Houlsby added adapted in two places: after self-attention and after FFN. So why add Adapter inside the self-attention and what's the adapter_layer_norm_before.weight used for?

Thanks

opened by Albert-Ma 6

How the initialization of Prefix affects the performance?

I would like to know how the initialization of Prefix affects the performance in the following situation.

situation 1: the official implementation

class Prefix(nn.Module):
    def __init__(...):
        ...
class PETLEncModel(PretrainedModel):
    def __init__(...):
         ...
        self.pretrained_model = pretrained_model
        self.prompt_model = Prefix(...)
        ...

situation 2: my re-implementation

class Prefix(nn.Module):
    def __init__(...):
        ...
class PETLEncModel(RobertaPretrainedModel):
    def __init__(...):
        ...
        self.roberta = RobertaModel(config) # the details of RobertaModel (with adapter) copy from your implementation
        self.prompt_model = Prefix(...)
        self.init_weights() # how the init_weights affects?

opened by Doragd 5

Question about your parameter budget

Hi,

Thank you so much for opening the source code. I'm trying to use your MAM Adapter on Roberta and has a question about your parameter budget.

I'm following the config in exps/run_glue.sh but found the number of trainable parameters much larger than expected. More specifically, I use the following code snippet to calculate trainable parameters and the result is 46783200.

sum([param.nelement() if param.requires_grad else 0 for param in model.parameters()])

I investigated your code and found in petl/petl_factory.py/Prefix, those modules for calculating temp_dict['encoder_decoder'] and temp_dict['encoder'] are unnecessary for Roberta, so I excluded them. However, I still have 16000160 parameters, and the detailed information is as follows (module - require_grads - nelement):

pretrained_model.roberta.embeddings.word_embeddings.weight False 38603520
pretrained_model.roberta.embeddings.position_embeddings.weight False 394752
pretrained_model.roberta.embeddings.token_type_embeddings.weight False 768
pretrained_model.roberta.embeddings.LayerNorm.weight False 768
pretrained_model.roberta.embeddings.LayerNorm.bias False 768
pretrained_model.roberta.encoder.layer.0.attention.self.query.weight False 589824
pretrained_model.roberta.encoder.layer.0.attention.self.query.bias False 768
pretrained_model.roberta.encoder.layer.0.attention.self.value.weight False 589824
pretrained_model.roberta.encoder.layer.0.attention.self.value.bias False 768
pretrained_model.roberta.encoder.layer.0.attention.self.key.weight False 589824
pretrained_model.roberta.encoder.layer.0.attention.self.key.bias False 768
pretrained_model.roberta.encoder.layer.0.attention.output.dense.weight False 589824
pretrained_model.roberta.encoder.layer.0.attention.output.dense.bias False 768
pretrained_model.roberta.encoder.layer.0.attention.output.LayerNorm.weight False 768
pretrained_model.roberta.encoder.layer.0.attention.output.LayerNorm.bias False 768
pretrained_model.roberta.encoder.layer.0.intermediate.dense.weight False 2359296
pretrained_model.roberta.encoder.layer.0.intermediate.dense.bias False 3072
pretrained_model.roberta.encoder.layer.0.output.dense.weight False 2359296
pretrained_model.roberta.encoder.layer.0.output.dense.bias False 768
pretrained_model.roberta.encoder.layer.0.output.LayerNorm.weight False 768
pretrained_model.roberta.encoder.layer.0.output.LayerNorm.bias False 768
pretrained_model.roberta.encoder.layer.0.output.ef_ffn_adapter.down_proj.weight True 12288
pretrained_model.roberta.encoder.layer.0.output.ef_ffn_adapter.down_proj.bias True 16
pretrained_model.roberta.encoder.layer.0.output.ef_ffn_adapter.up_proj.weight True 12288
pretrained_model.roberta.encoder.layer.0.output.ef_ffn_adapter.up_proj.bias True 768
pretrained_model.roberta.encoder.layer.0.ef_ffn_adapter.down_proj.weight True 12288
pretrained_model.roberta.encoder.layer.0.ef_ffn_adapter.down_proj.bias True 16
pretrained_model.roberta.encoder.layer.0.ef_ffn_adapter.up_proj.weight True 12288
pretrained_model.roberta.encoder.layer.0.ef_ffn_adapter.up_proj.bias True 768

...... (layer.1 - layer.11 are the same with layer.0)

pretrained_model.classifier.dense.weight False 589824
pretrained_model.classifier.dense.bias False 768
pretrained_model.classifier.out_proj.weight False 2304
pretrained_model.classifier.out_proj.bias False 3
prompt_model.wte.weight True 12288
prompt_model.control_trans.0.weight True 614400
prompt_model.control_trans.0.bias True 800
prompt_model.control_trans.2.weight True 14745600
prompt_model.control_trans.2.bias True 18432

From your paper, you said that the MAM Adapter only costs 0.5% parameters of the original LM, so I'm confused if I misunderstood something or did anything wrong.

Thanks for your time and your help!

opened by shaoyijia 4

Values for the bottleneck dims and max_tokens_per_batch in paper
Hi, thanks for the great work! Can I ask two questions regarding the implementation details in the paper?

what are the bottleneck dims (attn_bn and ffn_bn) for prefix-tuning, Adapter used in Fig.4 of the main paper?

for en-ro (as in exps/run_en_ro.sh), did you use the default max_tokens_per_batch=4096 for all bottleneck dims? If not, is it possible to share the max_tokens_per_batch values you used?

Thanks for your help in advance!
opened by KMnP 2

`Prefix_Adapter`& `MHAdapter_Layer` in the `petl/petl_factory.py` seems to be not used

"Adapter" is not defined.

class Prefix_Adapter(nn.Module):
    def __init__(self, args, config):
        super().__init__()
        self.prefix = Prefix(args, config)
        self.adapters = Adapter(args, config) # "Adapter" is not defined 

    def forward(self, bsz, nsamples=1, device="cuda"):
        prefix = self.prefix(bsz, nsamples, device)
        adapters = self.adapters(bsz)

        for ii, dic in enumerate(adapters):
            for key, value in dic.items():
                prefix[ii][key] = value
        return prefix

opened by Doragd 2

Devc
What does this PR do?

Fixes # (issue)

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).

[ ] Did you read the contributor guideline, Pull Request section?

[ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.

[ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.

[ ] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
opened by violet-zct 0
minor
What does this PR do?

Fixes # (issue)

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).

[ ] Did you read the contributor guideline, Pull Request section?

[ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.

[ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.

[ ] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
opened by violet-zct 0
Devc
What does this PR do?

Fixes # (issue)

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).

[ ] Did you read the contributor guideline, Pull Request section?

[ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.

[ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.

[ ] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
opened by violet-zct 0
Devc
What does this PR do?

Fixes # (issue)

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).

[ ] Did you read the contributor guideline, Pull Request section?

[ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.

[ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.

[ ] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
opened by violet-zct 0
fix minor bugs; minor refactor; update scripts
What does this PR do?

Fixes # (issue)

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).

[ ] Did you read the contributor guideline, Pull Request section?

[ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.

[ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.

[ ] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
opened by violet-zct 0
fix bug unfreeze params
What does this PR do?

Fixes # (issue)

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).

[ ] Did you read the contributor guideline, Pull Request section?

[ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.

[ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.

[ ] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
opened by violet-zct 0
The implementation details of Pfreiffer Adapter for MT task

Hi,

you implemented Pfeiffer Adapter as one of your baselines for MT. Could you point me to the related codes?

The original Pfeiffer adapter is implemented with post-norm. But for MBART, it is trained with pre-norm. I wonder how I should use the residual connection and layernorm.

opened by BaohaoLiao 0
Prefix tuning with "gated add" for Roberta

Hi, thanks for publishing the paper and sharing the source code.

I found that the "attn_output" is not used after definition.
When learning roberta for parameter efficient learning, the paper version of prefix tuning does not seem to work properly.

Could you please check it out?!

https://github.com/jxhe/unify-parameter-efficient-tuning/blob/3222ce2c0079566a28043e22380eb4ab6ad14389/src/transformers/models/roberta/modeling_roberta.py#L391-L392

Thanks!

opened by GeondoPark 0
Question about the training curve

Hi, thanks for sharing the source code.

Could you please share the training log file, i.e., log.txt, with me? I just encountered some training problems, and the loss score decreased very slowly. Just wondering what it is supposed to be. I need both the XSum and WMT16 En-Ro experiments.

Thanks ahead for your help.

opened by speedcell4 0
The instantiation of Multi-head PA and the design choice of MAM adapter.

Thanks for your great work! I have read your paper, but I am a bit confused about two things.

(1) The instantiation of Multi-head PA. How can we instantiate Multi-head PA (r=30) to make it have the same quantity of tuned parameters as PA (attn, r=30) according to Table 4 in the main paper? My initial thought is that Multi-head PA's tuned parameters will be N_h times those of PA.

(2) The design choice of MAM adapter. According to my understanding, MH PA (attn, r = 30) is slightly better than prefix tuning (l = 30) based on the result in Table 4 (35.3>35.2), and according to previous papers like LoRA, prefix tuning is not stable to optimize. However, MAM adopts prefix tuning. Is there a specific reason for this?

Would you mind giving me any clues about these two questions?

opened by JacobYuan7 3

Owner

Junxian He

NLP/ML PhD student at CMU

GitHub

Code for paper ECCV 2020 paper: Who Left the Dogs Out? 3D Animal Reconstruction with Expectation Maximization in the Loop.

Who Left the Dogs Out? Evaluation and demo code for our ECCV 2020 paper: Who Left the Dogs Out? 3D Animal Reconstruction with Expectation Maximization

29 Dec 28, 2022

The project is an official implementation of our CVPR2019 paper "Deep High-Resolution Representation Learning for Human Pose Estimation"

Deep High-Resolution Representation Learning for Human Pose Estimation (CVPR 2019) News [2020/07/05] A very nice blog from Towards Data Science introd

3.9k Jan 5, 2023

Home repository for the Regularized Greedy Forest (RGF) library. It includes original implementation from the paper and multithreaded one written in C++, along with various language-specific wrappers.

Regularized Greedy Forest Regularized Greedy Forest (RGF) is a tree ensemble machine learning method described in this paper. RGF can deliver better r

364 Dec 28, 2022

Official implementation of AAAI-21 paper "Label Confusion Learning to Enhance Text Classification Models"

Description: This is the official implementation of our AAAI-21 accepted paper Label Confusion Learning to Enhance Text Classification Models. The str

101 Nov 25, 2022

Official PyTorch implementation for paper Context Matters: Graph-based Self-supervised Representation Learning for Medical Images

Context Matters: Graph-based Self-supervised Representation Learning for Medical Images Official PyTorch implementation for paper Context Matters: Gra

49 Nov 23, 2022

A PyTorch re-implementation of the paper 'Exploring Simple Siamese Representation Learning'. Reproduced the 67.8% Top1 Acc on ImageNet.

Exploring simple siamese representation learning This is a PyTorch re-implementation of the SimSiam paper on ImageNet dataset. The results match that

72 Nov 9, 2022

Implementation of the paper NAST: Non-Autoregressive Spatial-Temporal Transformer for Time Series Forecasting.

Non-AR Spatial-Temporal Transformer Introduction Implementation of the paper NAST: Non-Autoregressive Spatial-Temporal Transformer for Time Series For

66 Nov 28, 2022

This is a Pytorch implementation of the paper: Self-Supervised Graph Transformer on Large-Scale Molecular Data.

212 Dec 25, 2022

Official implementation of the ICLR 2021 paper

You Only Need Adversarial Supervision for Semantic Image Synthesis Official PyTorch implementation of the ICLR 2021 paper "You Only Need Adversarial S

272 Dec 28, 2022

Implementation of Nyström Self-attention, from the paper Nyströmformer

Nyström Attention Implementation of Nyström Self-attention, from the paper Nyströmformer. Yannic Kilcher video Install $ pip install nystrom-attention

95 Jan 2, 2023

Implementation of SETR model, Original paper: Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers.

SETR - Pytorch Since the original paper (Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers.) has no official

112 Dec 16, 2022

Official implementation of the paper Image Generators with Conditionally-Independent Pixel Synthesis https://arxiv.org/abs/2011.13775

CIPS -- Official Pytorch Implementation of the paper Image Generators with Conditionally-Independent Pixel Synthesis Requirements pip install -r requi

Multimodal Lab @ Samsung AI Center Moscow

201 Dec 21, 2022

This project is the official implementation of our accepted ICLR 2021 paper BiPointNet: Binary Neural Network for Point Clouds.

BiPointNet: Binary Neural Network for Point Clouds Created by Haotong Qin, Zhongang Cai, Mingyuan Zhang, Yifu Ding, Haiyu Zhao, Shuai Yi, Xianglong Li

59 Dec 17, 2022

Implementation of paper "Towards a Unified View of Parameter-Efficient Transfer Learning"

Related tags

Overview

A Unified Framework for Parameter-Efficient Transfer Learning

Dependencies

Usage

MAM-Adapter

Advanced Usage for Other PETL Variants

Citation

Comments

What does this PR do?

Before submitting

Who can review?

What does this PR do?

Before submitting

Who can review?

What does this PR do?

Before submitting

Who can review?

What does this PR do?

Before submitting

Who can review?

What does this PR do?

Before submitting

Who can review?

What does this PR do?

Before submitting

Who can review?

Owner

Junxian He

Code for paper ECCV 2020 paper: Who Left the Dogs Out? 3D Animal Reconstruction with Expectation Maximization in the Loop.

The project is an official implementation of our CVPR2019 paper "Deep High-Resolution Representation Learning for Human Pose Estimation"

Home repository for the Regularized Greedy Forest (RGF) library. It includes original implementation from the paper and multithreaded one written in C++, along with various language-specific wrappers.

Official implementation of AAAI-21 paper "Label Confusion Learning to Enhance Text Classification Models"

Official PyTorch implementation for paper Context Matters: Graph-based Self-supervised Representation Learning for Medical Images

A PyTorch re-implementation of the paper 'Exploring Simple Siamese Representation Learning'. Reproduced the 67.8% Top1 Acc on ImageNet.

Implementation of the paper NAST: Non-Autoregressive Spatial-Temporal Transformer for Time Series Forecasting.

This is a Pytorch implementation of the paper: Self-Supervised Graph Transformer on Large-Scale Molecular Data.

Official implementation of the ICLR 2021 paper

Implementation of Nyström Self-attention, from the paper Nyströmformer

Implementation of SETR model, Original paper: Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers.

Official implementation of the paper Image Generators with Conditionally-Independent Pixel Synthesis https://arxiv.org/abs/2011.13775

Official pytorch implementation of paper "Image-to-image Translation via Hierarchical Style Disentanglement".

PyTorch implementation of paper "Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes", CVPR 2021

Implementation of Barlow Twins paper

Official pytorch implementation of paper "Inception Convolution with Efficient Dilation Search" (CVPR 2021 Oral).

Official implementation of our paper "LLA: Loss-aware Label Assignment for Dense Pedestrian Detection" in Pytorch.

Functional TensorFlow Implementation of Singular Value Decomposition for paper Fast Graph Learning

This project is the official implementation of our accepted ICLR 2021 paper BiPointNet: Binary Neural Network for Point Clouds.