P-Tuning v2: Prompt Tuning Can Be Comparable to Finetuning Universally Across Scales and Tasks

THUDM

Last update: Dec 30, 2022

Related tags

Deep Learning P-tuning-v2

Overview

P-tuning v2

P-Tuning v2: Prompt Tuning Can Be Comparable to Finetuning Universally Across Scales and Tasks

An optimized prompt tuning strategy for smaller models and hard natural language understanding (NLU) tasks (e.g., sequence tagging).

This repo is still under construction (2-4 weeks expected). Your kindly starring our repo could encourage us to work harder :)

Comments

Results of Multi-task Learning.

Thank you for providing the source code for the nice work. I have some questions regarding Multi-task Learning (MPT-2) in Table 4. In the paper, the authors mention that

For the multi-task setting, we combine the training set of the three datasets for pre-training. We use different linear classifiers for each dataset while sharing the continuous prompts..

1. What do you mean by pre-training here? Do you have first pre-train on all datasets with labels and then continue to training on a specific dataset? 2. I have to find the pre-training code for multi-task learning in the repo but cannot find it. Is it possible that you put it public?

Many thanks for the clarification!

opened by LeeShiyang 7
Questions about deep prompt per layer

Hi, I have a question about deep prompt. I understand that deep prompts are implemented through past_key_values in model. Then how can I see the actual prompt weights per layer? I mean, the shape of prompt is (prefix_len, config.num_hidden_layers * 2 * config.hidden_size) if without trans. And the shape of past_key_values for input is [2, batch_size, n_head, prefix_len, n_embd] per each layer. I believe that the first '2' corresponds key and value for attention mechanism. Here I want to obtain [prefix_len, config.hidden_size] vector just like embedding vector of prompt-tuning v1.

Do you have any idea for this?

Thanks : )

opened by eunjiinkim 6
DeBERTa P-Tuning v2 speed

I've observed that training DeBERTaV2 with P-Tuning v2 takes significantly more time to evaluate than other methods. Have you observed such behaviour?

It even takes significantly more time than P-Tuning v1 despite the fact that v1 have larger complexity to evaluate attention.

It seems like the issue is the ad-hoc implementation of past_key_values for DeBERTa, which is the only difference in the code of backbone model between v1 and v2, but can't figure out the specific reason for so.

opened by kefirski 6
sequence_classification.py

## pooled_output = outputs[1] sequence_output = outputs[0] sequence_output = sequence_output[:, self.pre_seq_len:, :].contiguous() first_token_tensor = sequence_output[:, 0] pooled_output = self.bert.pooler.dense(first_token_tensor) pooled_output = self.bert.pooler.activation(pooled_output) 请问作者做这段修改的目的是什么呢，原方法经过测试效果还是不错的。

opened by zhaogangthu 6
P-tuning v2在NER任务上表现正常，但在分类任务上不收敛

您好，我们按照文中方法在人民日报NER数据集上进行了复现，所得到结论和文中基本一致；但是我们在RTE和蚂蚁金服文本相似度两个分类数据集上发现并不收敛，我们尝试过MLP和LSTM对prefix embedding进行重参数化但收效甚微。请问作者在做分类相关数据集时是否遇到过此类情况？

以蚂蚁金服数据集为例，模型的loss从一开始就不下降，最终再验证集上会全部预测数量较多的类别。我们对比过NER和分类的梯度，也没有发现明显区别

模型层面我们尝试过Roberta-large Bert-large 以及Bert-base

opened by sxthunder 6
A question about the process of P-tunning-v2.
Thank you for your well-organized code. Since my major si computer vision, I am not familiar with NLP. I'm very interested in P-Tunning.

In P-tuning-v2, do I need to take the Prompts when training Pre-trained Model? Or just take the Prompts when I train a downstream task.

For the initiation of a prompt:
prompts = torch.arrange()
prompts = torch.nn.Embeding(prompts) Is it normal to use the above initialization method？

For training a downstream task, I need to freeze all of the Model' parameters, but not the Prompts' parameters, right?

For Prompt Deep, I need to reinitialize the Prompts on each level, right ( Same initialization as 2) ) ? I'm sorry for asking so many questions. Looking forward to your reply.
opened by QWTforGithub 5

Bug for BertPrompt series code?

Hi, I notice that the bert prompt model does not use the cls & linear head. I try to explain it in the following code with toy inputs, where say input_ids 's shape is [8, 32], and pre_seq_len is 3, then inputs_embeds's shall be [8, 35, 768]. I'll comment the shape of the key variables in the code and state my concern

class BertPromptForSequenceClassification(BertPreTrainedModel):
    def forward(*args):
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        batch_size = input_ids.shape[0]
        raw_embedding = self.embeddings(
            input_ids=input_ids, 
            position_ids=position_ids,
            token_type_ids=token_type_ids,
        )
        prompts = self.get_prompt(batch_size=batch_size)
        inputs_embeds = torch.cat((prompts, raw_embedding), dim=1) # then inputs
        prefix_attention_mask = torch.ones(batch_size, self.pre_seq_len).to(self.bert.device)
        attention_mask = torch.cat((prefix_attention_mask, attention_mask), dim=1)

        # inputs_embeds's shape: [8, 35, 768]


        outputs = self.bert(
            # input_ids,
            attention_mask=attention_mask,
            # token_type_ids=token_type_ids,
            # position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
            # past_key_values=past_key_values,
        ) 
# since bert encoder will take as inputs  the first token to the bert_pooler, \
# here the real token being used for classifier is the soft prompts' first token!
        
        pooled_output = outputs[1]

I wonder, is p-tuning v2 compared with soft prompt tuning? But the token being used for the latter one in the head for classification is not the cls.

Is that expected?

opened by tangzhy 5

运用prefix_projection 方法训练test acc不变一直是62.1

12/07/2021 19:30:19 - INFO - training.trainer_base - ***** Epoch 12: Best results ***** 12/07/2021 19:30:19 - INFO - training.trainer_base - best_epoch = 0 12/07/2021 19:30:19 - INFO - training.trainer_base - best_eval_accuracy = 0.6217125382262997 12/07/2021 19:30:19 - INFO - training.trainer_base - epoch = 12.0 OrderedDict([('best_epoch', 0), ('best_eval_accuracy', 0.6217125382262997), ('epoch', 13.0)]) {'loss': 0.7488, 'learning_rate': 0.006054054054054054, 'epoch': 13.51}

opened by yh351016 5
Unable to reproduce the PT-2 results of RTE in Table 1
I have some questions about rebuilding the PT-2 results of RTE in Table 1.

My base model is RoBERTa-large, I trained the model for 10 epochs with the recommended parameters (prompt length = 4, learning rate = 1e-2 as suggested in previous issue).

However, I can only get roughly 58% accuracy on the RTE dev set.

I am not sure whether the below factor would cause this, hope the authors can give me some hints, many thanks!

what is the training epoch you used for training RTE?

if I understand correctly, you are tuning both the classification head and the inserted prompts in each layer, right? In this case, would the initialization matter? And a followed question is how did you do the initialization?

I notice that you insert the prompts before the [CLS], is there any specific reason to insert them before the [CLS]?

I wonder if you are using the vanilla roberta-large checkpoint?
opened by CSerxy 5

Question about "deep prompts"

Hi,

I've seen issues asking about the past_key_value implementation and I've tried a code snippet to confirm if it's consistent with what's described in the paper. However, it doesn't seem to work - for different inputs, the first few tokens (i.e., the prompts) are not identical. Could you please take a look at the code snippet to see if it's correct and where the problem is?

    config = RobertaConfig.from_pretrained('roberta-base')
    config.pre_seq_len = 2
    config.prefix_projection = False

    model = RobertaPrefixForSequenceClassification.from_pretrained('roberta-base', config=config)

    tokenizer = AutoTokenizer.from_pretrained('roberta-base')

    sentences = ['This is an example sentence', 'Each sentence is converted']

    encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

    with torch.no_grad():
        model_output = model(**encoded_input, output_hidden_states=True)
        layer_hidden_states = model_output['hidden_states'][1:]  # discard the output of embedding layer
        for hidden_state in layer_hidden_states:
            deep_prompts = hidden_state[:, :model.pre_seq_len, :]  # [batch_size, seq_len, hidden_size]
            assert deep_prompts[0].equal(deep_prompts[1])

opened by JetRunner 4

Questions about inference time

Hi. I have a question.

In the case of Prefix Tuning, I think there will be some advantages in learning time.

However, I don't think there will be a big advantage in the inference time of the learned model, so what do you think?

opened by yeontaek 4
Questions about Results on Question Answering, Table 3

In Lester et al. (2021), they use T5 as the pre-trained model and use LM head to generate answers. For models like BERT, Roberta explored in this work, we can not use LM head to extract context spans as the answers, which means a linear QA head is essential. Is the task-specific linear head fine-tuned with prompt embeddings in PT, Table 3? If so, this implementation is a little different from the original implementation. If not, the randomly initialized QA head is not expected to produce meaningful outputs and hinders PT training, which makes the PT results in Table 3 meaningless.

Or, do I have some misunderstandings about the LM head in QA tasks?

opened by haichao592 5

Owner

THUDM

Data Mining Research Group at Tsinghua University

GitHub

The implementation for paper Joint t-SNE for Comparable Projections of Multiple High-Dimensional Datasets.

Joint t-sne This is the implementation for paper Joint t-SNE for Comparable Projections of Multiple High-Dimensional Datasets. abstract: We present Jo

7 Dec 18, 2022

EMNLP 2021 Adapting Language Models for Zero-shot Learning by Meta-tuning on Dataset and Prompt Collections

Adapting Language Models for Zero-shot Learning by Meta-tuning on Dataset and Prompt Collections Ruiqi Zhong, Kristy Lee*, Zheng Zhang*, Dan Klein EMN

42 Nov 3, 2022

Code and datasets for the paper "KnowPrompt: Knowledge-aware Prompt-tuning with Synergistic Optimization for Relation Extraction"

KnowPrompt Code and datasets for our paper "KnowPrompt: Knowledge-aware Prompt-tuning with Synergistic Optimization for Relation Extraction" Requireme

137 Dec 31, 2022

The Power of Scale for Parameter-Efficient Prompt Tuning

The Power of Scale for Parameter-Efficient Prompt Tuning Implementation of soft embeddings from https://arxiv.org/abs/2104.08691v1 using Pytorch and H

208 Dec 30, 2022

Implementation of "The Power of Scale for Parameter-Efficient Prompt Tuning"

Prompt-Tuning Implementation of "The Power of Scale for Parameter-Efficient Prompt Tuning" Currently, we support the following huggigface models: Bart

36 Dec 19, 2022

Codes for "Template-free Prompt Tuning for Few-shot NER".

EntLM The source codes for EntLM. Dependencies: Cuda 10.1, python 3.6.5 To install the required packages by following commands: $ pip3 install -r requ

77 Dec 27, 2022

Source code and Dataset creation for the paper "Neural Symbolic Regression That Scales"

NeuralSymbolicRegressionThatScales Pytorch implementation and pretrained models for the paper "Neural Symbolic Regression That Scales", presented at I

35 Nov 25, 2022

A modular, research-friendly framework for high-performance and inference of sequence models at many scales

T5X T5X is a modular, composable, research-friendly framework for high-performance, configurable, self-service training, evaluation, and inference of

1.1k Jan 8, 2023

UniLM AI - Large-scale Self-supervised Pre-training across Tasks, Languages, and Modalities

Pre-trained (foundation) models across tasks (understanding, generation and translation), languages (100+ languages), and modalities (language, image, audio, vision + language, audio + language, etc.)

7.6k Jan 1, 2023

This is the repository for CVPR2021 Dynamic Metric Learning: Towards a Scalable Metric Space to Accommodate Multiple Semantic Scales

Intro This is the repository for CVPR2021 Dynamic Metric Learning: Towards a Scalable Metric Space to Accommodate Multiple Semantic Scales Vehicle Sam

39 Jul 21, 2022

The tool under this branch fork can be used to crack devices above A12 and up to A15. After cracking, you can also use SSH channel strong opening tool to open SSH channel and activate it with Demo or Shell script. The file can be extracted from my Github homepage, and the SSH channel opening tool can be extracted from Dr238 account.

Welcome to C0xy-A12-A15-Attack-Tool The tool under this branch fork can be used to crack devices above A12 and up to A15. After cracking, you can also

13 Dec 23, 2022

Finetuning Pipeline

KLUE Baseline Korean(한국어) KLUE-baseline contains the baseline code for the Korean Language Understanding Evaluation (KLUE) benchmark. See our paper fo

74 Dec 13, 2022

Black-Box-Tuning - Black-Box Tuning for Language-Model-as-a-Service

Black-Box-Tuning Source code for paper "Black-Box Tuning for Language-Model-as-a

149 Jan 4, 2023

SUPERVISED-CONTRASTIVE-LEARNING-FOR-PRE-TRAINED-LANGUAGE-MODEL-FINE-TUNING - The Facebook paper about fine tuning RoBERTa with contrastive loss

"# SUPERVISED-CONTRASTIVE-LEARNING-FOR-PRE-TRAINED-LANGUAGE-MODEL-FINE-TUNING" i

28 Dec 12, 2022

a reccurrent neural netowrk that when trained on a peice of text and fed a starting prompt will write its on 250 character text using LSTM layers

RNN-Playwrite a reccurrent neural netowrk that when trained on a peice of text and fed a starting prompt will write its on 250 character text using LS

1 Oct 29, 2021

This repository accompanies our paper “Do Prompt-Based Models Really Understand the Meaning of Their Prompts?”

This repository accompanies our paper “Do Prompt-Based Models Really Understand the Meaning of Their Prompts?” Usage To replicate our results in Secti

64 Dec 11, 2022

The code for our paper "NSP-BERT: A Prompt-based Zero-Shot Learner Through an Original Pre-training Task —— Next Sentence Prediction"

201 Nov 21, 2022

Feed forward VQGAN-CLIP model, where the goal is to eliminate the need for optimizing the latent space of VQGAN for each input prompt

Feed forward VQGAN-CLIP model, where the goal is to eliminate the need for optimizing the latent space of VQGAN for each input prompt. This is done by

135 Dec 30, 2022

The Few-Shot Bot: Prompt-Based Learning for Dialogue Systems

Few-Shot Bot: Prompt-Based Learning for Dialogue Systems This repository includes the dataset, experiments results, and code for the paper: Few-Shot B

103 Dec 28, 2022