Adapter-BERT: Parameter-Efficient Transfer Learning for NLP.

Google Research

Last update: Jan 3, 2023

Related tags

Deep Learning adapter-bert

Overview

Adapter-BERT

Introduction

This repository contains a version of BERT that can be trained using adapters. Our ICML 2019 paper contains a full description of this technique: Parameter-Efficient Transfer Learning for NLP.

Adapters allow one to train a model to solve new tasks, but adjust only a few parameters per task. This technique yields compact models that share many parameters across tasks, whilst performing similarly to fine-tuning the entire model independently for every task.

The code here is forked from the original BERT repo. It provides our version of BERT with adapters, and the capability to train it on the GLUE tasks. For additional details on BERT, and support for additional tasks, see the original repo.

Tuning BERT with Adapters

The following command provides an example of tuning with adapters on GLUE.

Fine-tuning may be run on a GPU with at least 12GB of RAM, or a Cloud TPU. The same constraints apply as for full fine-tuning of BERT. For additional details, and instructions on downloading a pre-trained checkpoint and the GLUE tasks, see https://github.com/google-research/bert.

export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12
export GLUE_DIR=/path/to/glue

python run_classifier.py \
  --task_name=MRPC \
  --do_train=true \
  --do_eval=true \
  --data_dir=$GLUE_DIR/MRPC \
  --vocab_file=$BERT_BASE_DIR/vocab.txt \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
  --max_seq_length=128 \
  --train_batch_size=32 \
  --learning_rate=3e-4 \
  --num_train_epochs=5.0 \
  --output_dir=/tmp/adapter_bert_mrpc/

You should see an output like this:

***** Eval results *****
  eval_accuracy = 0.85784316
  eval_loss = 0.48347527
  global_step = 573
  loss = 0.48347527

This means that the Dev set accuracy was 85.78%. Small sets like MRPC have a high variance in the Dev set accuracy, even when starting from the same pre-training checkpoint. Therefore results may deviate from this by 2%.

Citation

Please use the following citation for this work:

@inproceedings{houlsby2019parameter,
  title = {Parameter-Efficient Transfer Learning for {NLP}},
  author = {Houlsby, Neil and Giurgiu, Andrei and Jastrzebski, Stanislaw and Morrone, Bruna and De Laroussilhe, Quentin and Gesmundo, Andrea and Attariyan, Mona and Gelly, Sylvain},
  booktitle = {Proceedings of the 36th International Conference on Machine Learning},
  year = {2019},
}

The paper is uploaded to ArXiv.

Disclaimer

This is not an official Google product.

Contact information

For personal communication, please contact Neil Houlsby ([email protected]).

Comments

Efficient/low-latency serving with multiple adapters?

Thanks for releasing this paper. I must say that I've had similar results to you, where adapters beat linear combinations, conv heads, and every other head I can think of. As a bonus it nearly matches BERT's fine-tuned performance!

A couple of broad questions if that's ok:

The only downside is that if you use adapter models you can't use a single BERT service to provide features to multiple heads (e.g. bert-as-server). Do you have any ideas about efficiently serving multiple adapters? The best thing I can think of is keeping the main model in memory and quickly switching the adapter parameters based on the request.

Also, have you experimented with online/active learning with adapters? It seems like a fruitful area since we don't know how to do online learning well with transformers, but adapters allow you to train with high LR's and fewer parameters.

opened by wassname 2
In adapter-fine-tuning, why don't fix original params?

Hi guys,

I have a glance at the run_classifier.py code and didn't see the code for fixing original transformer parameters, so it's full fine-tune setting, and why? Thanks~

opened by zihaolucky 2
ValueError: Tensor not found in checkpoint
Hi, I am trying to implement adapter modules into my copy of BERT, however I am running into problems with adding the layers. As far as I can tell, the model gets built correctly, but when I try to run run_pretraining.py I get the following error:

The problem is that it doesn't know what to do with the adapter layers since they aren't found in the checkpoint file - how can I work around this or get BERT to recognize that I want to add them in?

For reference, this is how I am running the script (I've modified it slightly to include the adapter modules as a flag):

python run_pretraining.py \ --adapter=True \ --input_file=/path/to/tfrecord/pretrained_iob2.tfrecord \ --output_dir=/usr/bert/adapter \ --do_train=True \ --do_eval=True \ --bert_config_file=/path/to/bert/multi_cased_L-12_H-768_A-12/bert_config.json \ --init_checkpoint=/path/to/bert/multi_cased_L-12_H-768_A-12/bert_model.ckpt \ --train_batch_size=16
opened by julia320 1
How a near-identity initialization is implemented

Dear authors,

After reading the code, I find the default initialization behavior for adaptor parameters (w1, w2) is initialized with a small standard deviation.

Does this guarantee the projection is a near-identity initialization?

opened by boxin-wbx 2
Adapters on large-datasets in GLUE could not get the same results

Hi I am trying adapters on Bert-base. I am evaluating on GLUE. On smaller datasets like MRPC, RTE, COLA, I see good results, but on large datasets of GLUE like MNLI, QNLI, SST2 I am really struggling and this is getting very below BERT-base.

I have a deadline soon and need to compare fairly with your method, and very much appreciate your feedback on this. Any suggestions which can help the results on large-scale datasets?

thanks

opened by dorost1234 1
freezing "layer_norm" and "head"

Hi Could you confirm in the implementation of adapters, if layer_norm of the original model should be unfreezed? or only layer_norm inside adapter needs to be unfreezed? How about the classifier's head? Does it need to be freezed? thanks

opened by rabeehkarimimahabadi 2

How to implement adapters in case of pre norm

Hi, I am having a model in which normalization first happens and then there is add operation. In the paper, you discussed the post-norm case, could you tell me how I can implement adapters for this case? thank you.

I mark the lines normalization and add operation happen with ** to describe the model better:

class LayerFF(nn.Module):
    def __init__(self, config):
        super().__init__()
        if config.feed_forward_proj == "relu":
            self.DenseReluDense = DenseReluDense(config)
        elif config.feed_forward_proj == "gated-gelu":
            self.DenseReluDense = DenseGatedGeluDense(config)
        else:
            raise ValueError(
                f"{self.config.feed_forward_proj} is not supported. Choose between `relu` and `gated-gelu`"
            )

        self.layer_norm = LayerNorm(config.d_model, eps=config.layer_norm_epsilon)
        self.dropout = nn.Dropout(config.dropout_rate)

    def forward(self, hidden_states):
        **forwarded_states = self.layer_norm(hidden_states)**
        forwarded_states = self.DenseReluDense(forwarded_states)
        **hidden_states = hidden_states + self.dropout(forwarded_states)**
        return hidden_states

class LayerSelfAttention(nn.Module):
    def __init__(self, config, has_relative_attention_bias=False):
        super().__init__()
        self.SelfAttention = Attention(config, has_relative_attention_bias=has_relative_attention_bias)
        self.layer_norm = LayerNorm(config.d_model, eps=config.layer_norm_epsilon)
        self.dropout = nn.Dropout(config.dropout_rate)

    def forward(
        self,
        hidden_states,
        attention_mask=None,
        position_bias=None,
        head_mask=None,
        past_key_value=None,
        use_cache=False,
        output_attentions=False,
    ):
        **normed_hidden_states = self.layer_norm(hidden_states)**
        attention_output = self.SelfAttention(
            normed_hidden_states,
            mask=attention_mask,
            position_bias=position_bias,
            head_mask=head_mask,
            past_key_value=past_key_value,
            use_cache=use_cache,
            output_attentions=output_attentions,
        )
        **hidden_states = hidden_states + self.dropout(attention_output[0])**
        outputs = (hidden_states,) + attention_output[1:]  # add attentions if we output them
        return outputs

opened by rabeehkarimimahabadi 2

Hyperparameters of GLUE datasets

Hi, thanks for your great work!

I fail to reproduce the high results on the GLUE datasets. Could you provide the hyperparameters w.r.t, training epoch, learning rate..., of the 9 tasks on the GLUE datasets corresponding to the best results claimed in the paper?

opened by mazicwong 0
missing processors
Congratulations on the great paper!

One question, do you have additional processor classes? At the moment, the code reads:

`processors = { "cola": ColaProcessor, "mnli": MnliProcessor, "mrpc": MrpcProcessor, }`

and later:

`if task_name not in processors: raise ValueError("Task not found: %s" % (task_name))`

meaning that only 3 datasets can be used for training.

It would be useful for replication purposes to have all processors available. Let me know if this is possible or if I have misunderstood!
opened by jacobdeasy 1

Owner

Google Research

GitHub

Implementation of the 😇 Attention layer from the paper, Scaling Local Self-Attention For Parameter Efficient Visual Backbones

HaloNet - Pytorch Implementation of the Attention layer from the paper, Scaling Local Self-Attention For Parameter Efficient Visual Backbones. This re

189 Nov 22, 2022

Compositional and Parameter-Efficient Representations for Large Knowledge Graphs

NodePiece - Compositional and Parameter-Efficient Representations for Large Knowledge Graphs NodePiece is a "tokenizer" for reducing entity vocabulary

107 Jan 4, 2023

The Power of Scale for Parameter-Efficient Prompt Tuning

The Power of Scale for Parameter-Efficient Prompt Tuning Implementation of soft embeddings from https://arxiv.org/abs/2104.08691v1 using Pytorch and H

208 Dec 30, 2022

Implementation of "The Power of Scale for Parameter-Efficient Prompt Tuning"

Prompt-Tuning Implementation of "The Power of Scale for Parameter-Efficient Prompt Tuning" Currently, we support the following huggigface models: Bart

36 Dec 19, 2022

《K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters》(2020)

K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters This repository is the implementation of the paper "K-Adapter: Infusing Knowledge

118 Dec 13, 2022

The Adapter-Bot: All-In-One Controllable Conversational Model

The Adapter-Bot: All-In-One Controllable Conversational Model This is the implementation of the paper: The Adapter-Bot: All-In-One Controllable Conver

37 Nov 4, 2022

Transfer-Learn is an open-source and well-documented library for Transfer Learning.

Transfer-Learn is an open-source and well-documented library for Transfer Learning. It is based on pure PyTorch with high performance and friendly API. Our code is pythonic, and the design is consistent with torchvision. You can easily develop new algorithms, or readily apply existing algorithms.

2.2k Jan 3, 2023

I-BERT: Integer-only BERT Quantization

I-BERT: Integer-only BERT Quantization HuggingFace Implementation I-BERT is also available in the master branch of HuggingFace! Visit the following li

139 Dec 27, 2022

Source code for NAACL 2021 paper "TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference"

TR-BERT Source code and dataset for "TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference". The code is based on huggaface's transformers.

37 Oct 30, 2022

LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

LV-BERT Introduction In this repo, we introduce LV-BERT by exploiting layer variety for BERT. For detailed description and experimental results, pleas

14 Aug 24, 2022

The source codes for ACL 2021 paper 'BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data'

BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data This repository provides the implementation details for

124 Dec 27, 2022

VD-BERT: A Unified Vision and Dialog Transformer with BERT

VD-BERT: A Unified Vision and Dialog Transformer with BERT PyTorch Code for the following paper at EMNLP2020: Title: VD-BERT: A Unified Vision and Dia

44 Nov 1, 2022

Pre-trained BERT Models for Ancient and Medieval Greek, and associated code for LaTeCH 2021 paper titled - "A Pilot Study for BERT Language Modelling and Morphological Analysis for Ancient and Medieval Greek"

Ancient Greek BERT The first and only available Ancient Greek sub-word BERT model! State-of-the-art post fine-tuning on Part-of-Speech Tagging and Mor

22 Dec 8, 2022

code for our paper "Source Data-absent Unsupervised Domain Adaptation through Hypothesis Transfer and Labeling Transfer"

SHOT++ Code for our TPAMI submission "Source Data-absent Unsupervised Domain Adaptation through Hypothesis Transfer and Labeling Transfer" that is ext

75 Dec 16, 2022

Transfer style api - An API to use with Tranfer Style App, where you can use two image and transfer the style

Transfer Style API It's an API to use with Tranfer Style App, where you can use

1 Feb 13, 2022

Python Library for learning (Structure and Parameter) and inference (Statistical and Causal) in Bayesian Networks.

pgmpy pgmpy is a python library for working with Probabilistic Graphical Models. Documentation and list of algorithms supported is at our official sit

2.2k Jan 3, 2023

An integration of several popular automatic augmentation methods, including OHL (Online Hyper-Parameter Learning for Auto-Augmentation Strategy) and AWS (Improving Auto Augment via Augmentation Wise Weight Sharing) by Sensetime Research.

An integration of several popular automatic augmentation methods, including OHL (Online Hyper-Parameter Learning for Auto-Augmentation Strategy) and AWS (Improving Auto Augment via Augmentation Wise Weight Sharing) by Sensetime Research.

45 Dec 8, 2022

Hyper-parameter optimization for sklearn

hyperopt-sklearn Hyperopt-sklearn is Hyperopt-based model selection among machine learning algorithms in scikit-learn. See how to use hyperopt-sklearn

1.4k Jan 1, 2023

A mini library for Policy Gradients with Parameter-based Exploration, with reference implementation of the ClipUp optimizer from NNAISENSE.

PGPElib A mini library for Policy Gradients with Parameter-based Exploration [1] and friends. This library serves as a clean re-implementation of the

56 Jan 1, 2023