Implementation of paper "Towards a Unified View of Parameter-Efficient Transfer Learning"

Overview

A Unified Framework for Parameter-Efficient Transfer Learning

This is the official implementation of the paper:

Towards a Unified View of Parameter-Efficient Transfer Learning
Junxian He*, Chunting Zhou*, Xuezhe Ma, Taylor Berg-Kirkpatrick, Graham Neubig
Preprint 2021

Parameter-efficient transfer learning (PETL) methods only tune a small number of (extra) parameters to adapt large pretrained models into downstream tasks. This paper reveals the connection among existing PETL methods such as adapters, prefix tuning, and LoRA, and proposes a unified framework to interpret their designs. This unified framework is able to instantiate existing approaches by varying values along several defined design dimensions, which also provides principled guidance to design new PETL methods. In this repo as well as in the paper, we include examples of how we easily derive new state-of-the-art PETL methods from the unified framework.

intro

Dependencies

This repo is a fork of the huggingface transformers repo (forked on June 23, 2021), and the code is tested on PyTorch 1.9.0. Please follow the instructions below to install dependencies after you set up PyTorch:

git clone [email protected]:jxhe/MAM-adapter.git
cd MAM-adapter

# install transformers from this repo
pip install -e .

# install other requirements
pip install datasets==1.11.0

# used to compute BLEU score for en-ro translation
git clone [email protected]:moses-smt/mosesdecoder.git

Usage

MAM-Adapter

Run the following command to reproduce the MAM-Adapter results in the paper on the XSum, en-ro translation, MNLI, or SST2 datasets:

bash exps/run_{xsum|en_ro|glue}.sh

We ran all the experiments with one A6000 or A100 GPU that has >=40GB GPU memory -- if your GPU does not have a large memory, you may need to reduce the bsz (max_tokens_per_batch for en-ro) and increase the gradient_steps values in the scripts to match our effective batch size. You may train with multiple GPUs easily with python -m torch.distributed.launch --nproc_per_node {num_gpus} to enable data parallelism.

Training time: in our experiments that use one GPU, XSum takes 24 hours w/ A100 or 50 hours w/ A6000, en-ro takes 20 hours w/ A6000, SST2 takes 2 hours, and MNLI takes 10 hours.

Advanced Usage for Other PETL Variants

As the paper shows, our unified framework instantiates different PETL variants easily by varying along the design dimensions. You can modify the script to train other PETL variants as we studied in the paper, we include some examples in run_xsum.sh, which can be directly applied to the other scripts as well:

# ----- MAM adapter -----
attn_mode="prefix"
attn_option="concat"
attn_composition="add"
attn_bn=30  # attn bottleneck dim

ffn_mode="adapter"
ffn_option="parallel"
ffn_adapter_layernorm_option="none"
ffn_adapter_init_option="lora"
ffn_adapter_scalar="4"
ffn_bn=512 # ffn bottleneck dim

# ----- prefix tuning baseline ----- 
# attn_mode="prefix"
# attn_option="concat"
# attn_composition="add"
# attn_bn=200  # attn bottleneck dim

# ffn_mode="none"
# ffn_option="parallel"
# ffn_adapter_layernorm_option="none"
# ffn_adapter_init_option="lora"
# ffn_adapter_scalar="4"
# ffn_bn=512 # ffn bottleneck dim

# ----- Houlsby Adapter ----- 
# attn_mode="adapter"
# attn_option="sequential"
# attn_composition="add"
# attn_bn=200  # attn bottleneck dim

# ffn_mode="adapter"
# ffn_option="sequential"
# ffn_adapter_layernorm_option="none"
# ffn_adapter_init_option="bert"
# ffn_adapter_scalar="1"
# ffn_bn=200 # ffn bottleneck dim

# ----- FFN Scaled Parallel Adapter ----- 
# attn_mode="None"
# attn_option="parallel"
# attn_composition="add"
# attn_bn=200  # attn bottleneck dim

# ffn_mode="adapter"
# ffn_option="parallel"
# ffn_adapter_layernorm_option="none"
# ffn_adapter_init_option="lora"
# ffn_adapter_scalar="4"
# ffn_bn=512 # ffn bottleneck dim

There are more variations than what is shown above. Please see a complete explanation of these arguments here in petl/options.py. The results of all the variants reported in the paper could be reproduced by changing these values in the scripts.

Citation

@article{he2021towards,
  title={Towards a Unified View of Parameter-Efficient Transfer Learning},
  author={He, Junxian and Zhou, Chunting and Ma, Xuezhe and Berg-Kirkpatrick, Taylor and Neubig, Graham},
  journal={arXiv preprint arXiv:2110.04366},
  year={2021}
}
Comments
  • [Bug] Adapter and LoRA for Roberta

    [Bug] Adapter and LoRA for Roberta

    Runing the below setting on SST2 and MNLI: `attn_mode="adapter" attn_option="sequential" attn_composition="add" attn_bn=200 # attn bottleneck dim

    ffn_mode="adapter" ffn_option="sequential" ffn_adapter_layernorm_option="none" ffn_adapter_init_option="bert" ffn_adapter_scalar="1" ffn_bn=200 # ffn bottleneck dim `

    Several errors were raised. It seems some parameters were set incorrectly like d_model, dropout, in modeling_roberta.py.

    I just fix them and the log make me confused: image

    Houlsby added adapted in two places: after self-attention and after FFN. So why add Adapter inside the self-attention and what's the adapter_layer_norm_before.weight used for?

    Thanks

    opened by Albert-Ma 6
  • How the initialization of Prefix affects the performance?

    How the initialization of Prefix affects the performance?

    I would like to know how the initialization of Prefix affects the performance in the following situation.

    • situation 1: the official implementation
    class Prefix(nn.Module):
        def __init__(...):
            ...
    class PETLEncModel(PretrainedModel):
        def __init__(...):
             ...
            self.pretrained_model = pretrained_model
            self.prompt_model = Prefix(...)
            ...
    
    • situation 2: my re-implementation
    class Prefix(nn.Module):
        def __init__(...):
            ...
    class PETLEncModel(RobertaPretrainedModel):
        def __init__(...):
            ...
            self.roberta = RobertaModel(config) # the details of RobertaModel (with adapter) copy from your implementation
            self.prompt_model = Prefix(...)
            self.init_weights() # how the init_weights affects?
        
    
    opened by Doragd 5
  • Question about your parameter budget

    Question about your parameter budget

    Hi,

    Thank you so much for opening the source code. I'm trying to use your MAM Adapter on Roberta and has a question about your parameter budget.

    I'm following the config in exps/run_glue.sh but found the number of trainable parameters much larger than expected. More specifically, I use the following code snippet to calculate trainable parameters and the result is 46783200.

    sum([param.nelement() if param.requires_grad else 0 for param in model.parameters()])
    

    I investigated your code and found in petl/petl_factory.py/Prefix, those modules for calculating temp_dict['encoder_decoder'] and temp_dict['encoder'] are unnecessary for Roberta, so I excluded them. However, I still have 16000160 parameters, and the detailed information is as follows (module - require_grads - nelement):

    pretrained_model.roberta.embeddings.word_embeddings.weight False 38603520
    pretrained_model.roberta.embeddings.position_embeddings.weight False 394752
    pretrained_model.roberta.embeddings.token_type_embeddings.weight False 768
    pretrained_model.roberta.embeddings.LayerNorm.weight False 768
    pretrained_model.roberta.embeddings.LayerNorm.bias False 768
    pretrained_model.roberta.encoder.layer.0.attention.self.query.weight False 589824
    pretrained_model.roberta.encoder.layer.0.attention.self.query.bias False 768
    pretrained_model.roberta.encoder.layer.0.attention.self.value.weight False 589824
    pretrained_model.roberta.encoder.layer.0.attention.self.value.bias False 768
    pretrained_model.roberta.encoder.layer.0.attention.self.key.weight False 589824
    pretrained_model.roberta.encoder.layer.0.attention.self.key.bias False 768
    pretrained_model.roberta.encoder.layer.0.attention.output.dense.weight False 589824
    pretrained_model.roberta.encoder.layer.0.attention.output.dense.bias False 768
    pretrained_model.roberta.encoder.layer.0.attention.output.LayerNorm.weight False 768
    pretrained_model.roberta.encoder.layer.0.attention.output.LayerNorm.bias False 768
    pretrained_model.roberta.encoder.layer.0.intermediate.dense.weight False 2359296
    pretrained_model.roberta.encoder.layer.0.intermediate.dense.bias False 3072
    pretrained_model.roberta.encoder.layer.0.output.dense.weight False 2359296
    pretrained_model.roberta.encoder.layer.0.output.dense.bias False 768
    pretrained_model.roberta.encoder.layer.0.output.LayerNorm.weight False 768
    pretrained_model.roberta.encoder.layer.0.output.LayerNorm.bias False 768
    pretrained_model.roberta.encoder.layer.0.output.ef_ffn_adapter.down_proj.weight True 12288
    pretrained_model.roberta.encoder.layer.0.output.ef_ffn_adapter.down_proj.bias True 16
    pretrained_model.roberta.encoder.layer.0.output.ef_ffn_adapter.up_proj.weight True 12288
    pretrained_model.roberta.encoder.layer.0.output.ef_ffn_adapter.up_proj.bias True 768
    pretrained_model.roberta.encoder.layer.0.ef_ffn_adapter.down_proj.weight True 12288
    pretrained_model.roberta.encoder.layer.0.ef_ffn_adapter.down_proj.bias True 16
    pretrained_model.roberta.encoder.layer.0.ef_ffn_adapter.up_proj.weight True 12288
    pretrained_model.roberta.encoder.layer.0.ef_ffn_adapter.up_proj.bias True 768
    
    ...... (layer.1 - layer.11 are the same with layer.0)
    
    pretrained_model.classifier.dense.weight False 589824
    pretrained_model.classifier.dense.bias False 768
    pretrained_model.classifier.out_proj.weight False 2304
    pretrained_model.classifier.out_proj.bias False 3
    prompt_model.wte.weight True 12288
    prompt_model.control_trans.0.weight True 614400
    prompt_model.control_trans.0.bias True 800
    prompt_model.control_trans.2.weight True 14745600
    prompt_model.control_trans.2.bias True 18432
    

    From your paper, you said that the MAM Adapter only costs 0.5% parameters of the original LM, so I'm confused if I misunderstood something or did anything wrong.

    Thanks for your time and your help!

    opened by shaoyijia 4
  • Values for the bottleneck dims and max_tokens_per_batch in paper

    Values for the bottleneck dims and max_tokens_per_batch in paper

    Hi, thanks for the great work! Can I ask two questions regarding the implementation details in the paper?

    1. what are the bottleneck dims (attn_bn and ffn_bn) for prefix-tuning, Adapter used in Fig.4 of the main paper?
    2. for en-ro (as in exps/run_en_ro.sh), did you use the default max_tokens_per_batch=4096 for all bottleneck dims? If not, is it possible to share the max_tokens_per_batch values you used?

    Thanks for your help in advance!

    opened by KMnP 2
  • `Prefix_Adapter`& `MHAdapter_Layer` in the `petl/petl_factory.py` seems to be not used

    `Prefix_Adapter`& `MHAdapter_Layer` in the `petl/petl_factory.py` seems to be not used

    "Adapter" is not defined.

    class Prefix_Adapter(nn.Module):
        def __init__(self, args, config):
            super().__init__()
            self.prefix = Prefix(args, config)
            self.adapters = Adapter(args, config) # "Adapter" is not defined 
    
        def forward(self, bsz, nsamples=1, device="cuda"):
            prefix = self.prefix(bsz, nsamples, device)
            adapters = self.adapters(bsz)
    
            for ii, dic in enumerate(adapters):
                for key, value in dic.items():
                    prefix[ii][key] = value
            return prefix
    
    opened by Doragd 2
  • Devc

    Devc

    What does this PR do?

    Fixes # (issue)

    Before submitting

    • [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
    • [ ] Did you read the contributor guideline, Pull Request section?
    • [ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
    • [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
    • [ ] Did you write any new necessary tests?

    Who can review?

    Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

    opened by violet-zct 0
  • minor

    minor

    What does this PR do?

    Fixes # (issue)

    Before submitting

    • [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
    • [ ] Did you read the contributor guideline, Pull Request section?
    • [ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
    • [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
    • [ ] Did you write any new necessary tests?

    Who can review?

    Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

    opened by violet-zct 0
  • Devc

    Devc

    What does this PR do?

    Fixes # (issue)

    Before submitting

    • [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
    • [ ] Did you read the contributor guideline, Pull Request section?
    • [ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
    • [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
    • [ ] Did you write any new necessary tests?

    Who can review?

    Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

    opened by violet-zct 0
  • Devc

    Devc

    What does this PR do?

    Fixes # (issue)

    Before submitting

    • [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
    • [ ] Did you read the contributor guideline, Pull Request section?
    • [ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
    • [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
    • [ ] Did you write any new necessary tests?

    Who can review?

    Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

    opened by violet-zct 0
  • fix minor bugs; minor refactor; update scripts

    fix minor bugs; minor refactor; update scripts

    What does this PR do?

    Fixes # (issue)

    Before submitting

    • [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
    • [ ] Did you read the contributor guideline, Pull Request section?
    • [ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
    • [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
    • [ ] Did you write any new necessary tests?

    Who can review?

    Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

    opened by violet-zct 0
  • fix bug unfreeze params

    fix bug unfreeze params

    What does this PR do?

    Fixes # (issue)

    Before submitting

    • [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
    • [ ] Did you read the contributor guideline, Pull Request section?
    • [ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
    • [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
    • [ ] Did you write any new necessary tests?

    Who can review?

    Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

    opened by violet-zct 0
  • The implementation details of Pfreiffer Adapter for MT task

    The implementation details of Pfreiffer Adapter for MT task

    Hi,

    you implemented Pfeiffer Adapter as one of your baselines for MT. Could you point me to the related codes?

    The original Pfeiffer adapter is implemented with post-norm. But for MBART, it is trained with pre-norm. I wonder how I should use the residual connection and layernorm.

    opened by BaohaoLiao 0
  • Prefix tuning with

    Prefix tuning with "gated add" for Roberta

    Hi, thanks for publishing the paper and sharing the source code.

    I found that the "attn_output" is not used after definition.
    When learning roberta for parameter efficient learning, the paper version of prefix tuning does not seem to work properly.

    Could you please check it out?!

    https://github.com/jxhe/unify-parameter-efficient-tuning/blob/3222ce2c0079566a28043e22380eb4ab6ad14389/src/transformers/models/roberta/modeling_roberta.py#L391-L392

    Thanks!

    opened by GeondoPark 0
  • Question about the training curve

    Question about the training curve

    Hi, thanks for sharing the source code.

    Could you please share the training log file, i.e., log.txt, with me? I just encountered some training problems, and the loss score decreased very slowly. Just wondering what it is supposed to be. I need both the XSum and WMT16 En-Ro experiments.

    Thanks ahead for your help.

    opened by speedcell4 0
  • The instantiation of Multi-head PA and the design choice of MAM adapter.

    The instantiation of Multi-head PA and the design choice of MAM adapter.

    Thanks for your great work! I have read your paper, but I am a bit confused about two things.

    (1) The instantiation of Multi-head PA. How can we instantiate Multi-head PA (r=30) to make it have the same quantity of tuned parameters as PA (attn, r=30) according to Table 4 in the main paper? My initial thought is that Multi-head PA's tuned parameters will be N_h times those of PA.

    (2) The design choice of MAM adapter. According to my understanding, MH PA (attn, r = 30) is slightly better than prefix tuning (l = 30) based on the result in Table 4 (35.3>35.2), and according to previous papers like LoRA, prefix tuning is not stable to optimize. However, MAM adopts prefix tuning. Is there a specific reason for this?

    Would you mind giving me any clues about these two questions?

    opened by JacobYuan7 3
Owner
Junxian He
NLP/ML PhD student at CMU
Junxian He
Code for paper ECCV 2020 paper: Who Left the Dogs Out? 3D Animal Reconstruction with Expectation Maximization in the Loop.

Who Left the Dogs Out? Evaluation and demo code for our ECCV 2020 paper: Who Left the Dogs Out? 3D Animal Reconstruction with Expectation Maximization

Benjamin Biggs 29 Dec 28, 2022
The project is an official implementation of our CVPR2019 paper "Deep High-Resolution Representation Learning for Human Pose Estimation"

Deep High-Resolution Representation Learning for Human Pose Estimation (CVPR 2019) News [2020/07/05] A very nice blog from Towards Data Science introd

Leo Xiao 3.9k Jan 5, 2023
Home repository for the Regularized Greedy Forest (RGF) library. It includes original implementation from the paper and multithreaded one written in C++, along with various language-specific wrappers.

Regularized Greedy Forest Regularized Greedy Forest (RGF) is a tree ensemble machine learning method described in this paper. RGF can deliver better r

RGF-team 364 Dec 28, 2022
Official implementation of AAAI-21 paper "Label Confusion Learning to Enhance Text Classification Models"

Description: This is the official implementation of our AAAI-21 accepted paper Label Confusion Learning to Enhance Text Classification Models. The str

null 101 Nov 25, 2022
Official PyTorch implementation for paper Context Matters: Graph-based Self-supervised Representation Learning for Medical Images

Context Matters: Graph-based Self-supervised Representation Learning for Medical Images Official PyTorch implementation for paper Context Matters: Gra

null 49 Nov 23, 2022
A PyTorch re-implementation of the paper 'Exploring Simple Siamese Representation Learning'. Reproduced the 67.8% Top1 Acc on ImageNet.

Exploring simple siamese representation learning This is a PyTorch re-implementation of the SimSiam paper on ImageNet dataset. The results match that

Taojiannan Yang 72 Nov 9, 2022
Implementation of the paper NAST: Non-Autoregressive Spatial-Temporal Transformer for Time Series Forecasting.

Non-AR Spatial-Temporal Transformer Introduction Implementation of the paper NAST: Non-Autoregressive Spatial-Temporal Transformer for Time Series For

Chen Kai 66 Nov 28, 2022
This is a Pytorch implementation of the paper: Self-Supervised Graph Transformer on Large-Scale Molecular Data.

This is a Pytorch implementation of the paper: Self-Supervised Graph Transformer on Large-Scale Molecular Data.

null 212 Dec 25, 2022
Official implementation of the ICLR 2021 paper

You Only Need Adversarial Supervision for Semantic Image Synthesis Official PyTorch implementation of the ICLR 2021 paper "You Only Need Adversarial S

Bosch Research 272 Dec 28, 2022
Implementation of Nyström Self-attention, from the paper Nyströmformer

Nyström Attention Implementation of Nyström Self-attention, from the paper Nyströmformer. Yannic Kilcher video Install $ pip install nystrom-attention

Phil Wang 95 Jan 2, 2023
Implementation of SETR model, Original paper: Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers.

SETR - Pytorch Since the original paper (Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers.) has no official

zhaohu xing 112 Dec 16, 2022
Official implementation of the paper Image Generators with Conditionally-Independent Pixel Synthesis https://arxiv.org/abs/2011.13775

CIPS -- Official Pytorch Implementation of the paper Image Generators with Conditionally-Independent Pixel Synthesis Requirements pip install -r requi

Multimodal Lab @ Samsung AI Center Moscow 201 Dec 21, 2022
Official pytorch implementation of paper "Image-to-image Translation via Hierarchical Style Disentanglement".

HiSD: Image-to-image Translation via Hierarchical Style Disentanglement Official pytorch implementation of paper "Image-to-image Translation

null 364 Dec 14, 2022
PyTorch implementation of paper "Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes", CVPR 2021

Neural Scene Flow Fields PyTorch implementation of paper "Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes", CVPR 20

Zhengqi Li 585 Jan 4, 2023
Implementation of Barlow Twins paper

barlowtwins PyTorch Implementation of Barlow Twins paper: Barlow Twins: Self-Supervised Learning via Redundancy Reduction This is currently a work in

IgorSusmelj 86 Dec 20, 2022
Official pytorch implementation of paper "Inception Convolution with Efficient Dilation Search" (CVPR 2021 Oral).

IC-Conv This repository is an official implementation of the paper Inception Convolution with Efficient Dilation Search. Getting Started Download Imag

Jie Liu 111 Dec 31, 2022
Official implementation of our paper "LLA: Loss-aware Label Assignment for Dense Pedestrian Detection" in Pytorch.

LLA: Loss-aware Label Assignment for Dense Pedestrian Detection This project provides an implementation for "LLA: Loss-aware Label Assignment for Dens

null 35 Dec 6, 2022
Functional TensorFlow Implementation of Singular Value Decomposition for paper Fast Graph Learning

tf-fsvd TensorFlow Implementation of Functional Singular Value Decomposition for paper Fast Graph Learning with Unique Optimal Solutions Cite If you f

Sami Abu-El-Haija 14 Nov 25, 2021
This project is the official implementation of our accepted ICLR 2021 paper BiPointNet: Binary Neural Network for Point Clouds.

BiPointNet: Binary Neural Network for Point Clouds Created by Haotong Qin, Zhongang Cai, Mingyuan Zhang, Yifu Ding, Haiyu Zhao, Shuai Yi, Xianglong Li

Haotong Qin 59 Dec 17, 2022