P-Tuning v2: Prompt Tuning Can Be Comparable to Finetuning Universally Across Scales and Tasks

    Results of Multi-task Learning.

    Thank you for providing the source code for the nice work. I have some questions regarding Multi-task Learning (MPT-2) in Table 4. In the paper, the authors mention that

    For the multi-task setting, we combine the training set of the three datasets for pre-training. We use different linear classifiers for each dataset while sharing the continuous prompts..

    1. What do you mean by pre-training here? Do you have first pre-train on all datasets with labels and then continue to training on a specific dataset? 2. I have to find the pre-training code for multi-task learning in the repo but cannot find it. Is it possible that you put it public?

    Many thanks for the clarification!

    Questions about deep prompt per layer

    Hi, I have a question about deep prompt. I understand that deep prompts are implemented through past_key_values in model. Then how can I see the actual prompt weights per layer? I mean, the shape of prompt is (prefix_len, config.num_hidden_layers * 2 * config.hidden_size) if without trans. And the shape of past_key_values for input is [2, batch_size, n_head, prefix_len, n_embd] per each layer. I believe that the first '2' corresponds key and value for attention mechanism. Here I want to obtain [prefix_len, config.hidden_size] vector just like embedding vector of prompt-tuning v1.

    Do you have any idea for this?

    Thanks : )

    DeBERTa P-Tuning v2 speed

    I've observed that training DeBERTaV2 with P-Tuning v2 takes significantly more time to evaluate than other methods. Have you observed such behaviour?

    It even takes significantly more time than P-Tuning v1 despite the fact that v1 have larger complexity to evaluate attention.

    It seems like the issue is the ad-hoc implementation of past_key_values for DeBERTa, which is the only difference in the code of backbone model between v1 and v2, but can't figure out the specific reason for so.

  • sequence_classification.py


    ## pooled_output = outputs[1] sequence_output = outputs[0] sequence_output = sequence_output[:, self.pre_seq_len:, :].contiguous() first_token_tensor = sequence_output[:, 0] pooled_output = self.bert.pooler.dense(first_token_tensor) pooled_output = self.bert.pooler.activation(pooled_output) 请问作者做这段修改的目的是什么呢,原方法经过测试效果还是不错的。

    opened by zhaogangthu 6
    P-tuning v2在NER任务上表现正常,但在分类任务上不收敛

    您好,我们按照文中方法在人民日报NER数据集上进行了复现,所得到结论和文中基本一致;但是我们在RTE和蚂蚁金服文本相似度两个分类数据集上发现并不收敛,我们尝试过MLP和LSTM对prefix embedding进行重参数化但收效甚微。请问作者在做分类相关数据集时是否遇到过此类情况?

    以蚂蚁金服数据集为例,模型的loss从一开始就不下降,最终再验证集上会全部预测数量较多的类别。 我们对比过NER和分类的梯度,也没有发现明显区别

    模型层面我们尝试过Roberta-large Bert-large 以及Bert-base

    A question about the process of P-tunning-v2.

    Thank you for your well-organized code. Since my major si computer vision, I am not familiar with NLP. I'm very interested in P-Tunning.

    1. In P-tuning-v2, do I need to take the Prompts when training Pre-trained Model? Or just take the Prompts when I train a downstream task.
    2. For the initiation of a prompt:
      prompts = torch.arrange()
      prompts = torch.nn.Embeding(prompts) Is it normal to use the above initialization method?
    3. For training a downstream task, I need to freeze all of the Model' parameters, but not the Prompts' parameters, right?
    4. For Prompt Deep, I need to reinitialize the Prompts on each level, right ( Same initialization as 2) ) ? I'm sorry for asking so many questions. Looking forward to your reply.
    Bug for BertPrompt series code?

    Hi, I notice that the bert prompt model does not use the cls & linear head. I try to explain it in the following code with toy inputs, where say input_ids 's shape is [8, 32], and pre_seq_len is 3, then inputs_embeds's shall be [8, 35, 768]. I'll comment the shape of the key variables in the code and state my concern

    class BertPromptForSequenceClassification(BertPreTrainedModel):
        def forward(*args):
            return_dict = return_dict if return_dict is not None else self.config.use_return_dict
            batch_size = input_ids.shape[0]
            raw_embedding = self.embeddings(
            prompts = self.get_prompt(batch_size=batch_size)
            inputs_embeds = torch.cat((prompts, raw_embedding), dim=1) # then inputs
            prefix_attention_mask = torch.ones(batch_size, self.pre_seq_len).to(self.bert.device)
            attention_mask = torch.cat((prefix_attention_mask, attention_mask), dim=1)
            # inputs_embeds's shape: [8, 35, 768]
            outputs = self.bert(
                # input_ids,
                # token_type_ids=token_type_ids,
                # position_ids=position_ids,
                # past_key_values=past_key_values,
    # since bert encoder will take as inputs  the first token to the bert_pooler, \
    # here the real token being used for classifier is the soft prompts' first token!
            pooled_output = outputs[1]

    I wonder, is p-tuning v2 compared with soft prompt tuning? But the token being used for the latter one in the head for classification is not the cls.

    Is that expected?

    运用prefix_projection 方法训练test acc不变一直是62.1

    12/07/2021 19:30:19 - INFO - training.trainer_base - ***** Epoch 12: Best results ***** 12/07/2021 19:30:19 - INFO - training.trainer_base - best_epoch = 0 12/07/2021 19:30:19 - INFO - training.trainer_base - best_eval_accuracy = 0.6217125382262997 12/07/2021 19:30:19 - INFO - training.trainer_base - epoch = 12.0 OrderedDict([('best_epoch', 0), ('best_eval_accuracy', 0.6217125382262997), ('epoch', 13.0)]) {'loss': 0.7488, 'learning_rate': 0.006054054054054054, 'epoch': 13.51}

    Unable to reproduce the PT-2 results of RTE in Table 1

    I have some questions about rebuilding the PT-2 results of RTE in Table 1.

    My base model is RoBERTa-large, I trained the model for 10 epochs with the recommended parameters (prompt length = 4, learning rate = 1e-2 as suggested in previous issue).

    However, I can only get roughly 58% accuracy on the RTE dev set.

    I am not sure whether the below factor would cause this, hope the authors can give me some hints, many thanks!

    1. what is the training epoch you used for training RTE?
    2. if I understand correctly, you are tuning both the classification head and the inserted prompts in each layer, right? In this case, would the initialization matter? And a followed question is how did you do the initialization?
    3. I notice that you insert the prompts before the [CLS], is there any specific reason to insert them before the [CLS]?
    4. I wonder if you are using the vanilla roberta-large checkpoint?
    Question about "deep prompts"


    I've seen issues asking about the past_key_value implementation and I've tried a code snippet to confirm if it's consistent with what's described in the paper. However, it doesn't seem to work - for different inputs, the first few tokens (i.e., the prompts) are not identical. Could you please take a look at the code snippet to see if it's correct and where the problem is?

        config = RobertaConfig.from_pretrained('roberta-base')
        config.pre_seq_len = 2
        config.prefix_projection = False
        model = RobertaPrefixForSequenceClassification.from_pretrained('roberta-base', config=config)
        tokenizer = AutoTokenizer.from_pretrained('roberta-base')
        sentences = ['This is an example sentence', 'Each sentence is converted']
        encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
        with torch.no_grad():
            model_output = model(**encoded_input, output_hidden_states=True)
            layer_hidden_states = model_output['hidden_states'][1:]  # discard the output of embedding layer
            for hidden_state in layer_hidden_states:
                deep_prompts = hidden_state[:, :model.pre_seq_len, :]  # [batch_size, seq_len, hidden_size]
                assert deep_prompts[0].equal(deep_prompts[1])
    Questions about inference time

    Hi. I have a question.

    In the case of Prefix Tuning, I think there will be some advantages in learning time.

    However, I don't think there will be a big advantage in the inference time of the learned model, so what do you think?

    Questions about Results on Question Answering, Table 3

    In Lester et al. (2021), they use T5 as the pre-trained model and use LM head to generate answers. For models like BERT, Roberta explored in this work, we can not use LM head to extract context spans as the answers, which means a linear QA head is essential. Is the task-specific linear head fine-tuned with prompt embeddings in PT, Table 3? If so, this implementation is a little different from the original implementation. If not, the randomly initialized QA head is not expected to produce meaningful outputs and hinders PT training, which makes the PT results in Table 3 meaningless.

    Or, do I have some misunderstandings about the LM head in QA tasks?

