P-Tuning v2: Prompt Tuning Can Be Comparable to Finetuning Universally Across Scales and Tasks

Overview
Comments
  • Results of Multi-task Learning.

    Results of Multi-task Learning.

    Thank you for providing the source code for the nice work. I have some questions regarding Multi-task Learning (MPT-2) in Table 4. In the paper, the authors mention that

    For the multi-task setting, we combine the training set of the three datasets for pre-training. We use different linear classifiers for each dataset while sharing the continuous prompts..

    1. What do you mean by pre-training here? Do you have first pre-train on all datasets with labels and then continue to training on a specific dataset? 2. I have to find the pre-training code for multi-task learning in the repo but cannot find it. Is it possible that you put it public?

    Many thanks for the clarification!

    opened by LeeShiyang 7
  • Questions about deep prompt per layer

    Questions about deep prompt per layer

    Hi, I have a question about deep prompt. I understand that deep prompts are implemented through past_key_values in model. Then how can I see the actual prompt weights per layer? I mean, the shape of prompt is (prefix_len, config.num_hidden_layers * 2 * config.hidden_size) if without trans. And the shape of past_key_values for input is [2, batch_size, n_head, prefix_len, n_embd] per each layer. I believe that the first '2' corresponds key and value for attention mechanism. Here I want to obtain [prefix_len, config.hidden_size] vector just like embedding vector of prompt-tuning v1.

    Do you have any idea for this?

    Thanks : )

    opened by eunjiinkim 6
  • DeBERTa P-Tuning v2 speed

    DeBERTa P-Tuning v2 speed

    I've observed that training DeBERTaV2 with P-Tuning v2 takes significantly more time to evaluate than other methods. Have you observed such behaviour?

    It even takes significantly more time than P-Tuning v1 despite the fact that v1 have larger complexity to evaluate attention.

    It seems like the issue is the ad-hoc implementation of past_key_values for DeBERTa, which is the only difference in the code of backbone model between v1 and v2, but can't figure out the specific reason for so.

    opened by kefirski 6
  • sequence_classification.py

    sequence_classification.py

    ## pooled_output = outputs[1] sequence_output = outputs[0] sequence_output = sequence_output[:, self.pre_seq_len:, :].contiguous() first_token_tensor = sequence_output[:, 0] pooled_output = self.bert.pooler.dense(first_token_tensor) pooled_output = self.bert.pooler.activation(pooled_output) 请问作者做这段修改的目的是什么呢,原方法经过测试效果还是不错的。

    opened by zhaogangthu 6
  • P-tuning v2在NER任务上表现正常,但在分类任务上不收敛

    P-tuning v2在NER任务上表现正常,但在分类任务上不收敛

    您好,我们按照文中方法在人民日报NER数据集上进行了复现,所得到结论和文中基本一致;但是我们在RTE和蚂蚁金服文本相似度两个分类数据集上发现并不收敛,我们尝试过MLP和LSTM对prefix embedding进行重参数化但收效甚微。请问作者在做分类相关数据集时是否遇到过此类情况?

    以蚂蚁金服数据集为例,模型的loss从一开始就不下降,最终再验证集上会全部预测数量较多的类别。 我们对比过NER和分类的梯度,也没有发现明显区别

    模型层面我们尝试过Roberta-large Bert-large 以及Bert-base

    opened by sxthunder 6
  • A question about the process of  P-tunning-v2.

    A question about the process of P-tunning-v2.

    Thank you for your well-organized code. Since my major si computer vision, I am not familiar with NLP. I'm very interested in P-Tunning.

    1. In P-tuning-v2, do I need to take the Prompts when training Pre-trained Model? Or just take the Prompts when I train a downstream task.
    2. For the initiation of a prompt:
      prompts = torch.arrange()
      prompts = torch.nn.Embeding(prompts) Is it normal to use the above initialization method?
    3. For training a downstream task, I need to freeze all of the Model' parameters, but not the Prompts' parameters, right?
    4. For Prompt Deep, I need to reinitialize the Prompts on each level, right ( Same initialization as 2) ) ? I'm sorry for asking so many questions. Looking forward to your reply.
    opened by QWTforGithub 5
  • Bug for BertPrompt series code?

    Bug for BertPrompt series code?

    Hi, I notice that the bert prompt model does not use the cls & linear head. I try to explain it in the following code with toy inputs, where say input_ids 's shape is [8, 32], and pre_seq_len is 3, then inputs_embeds's shall be [8, 35, 768]. I'll comment the shape of the key variables in the code and state my concern

    class BertPromptForSequenceClassification(BertPreTrainedModel):
        def forward(*args):
            return_dict = return_dict if return_dict is not None else self.config.use_return_dict
    
            batch_size = input_ids.shape[0]
            raw_embedding = self.embeddings(
                input_ids=input_ids, 
                position_ids=position_ids,
                token_type_ids=token_type_ids,
            )
            prompts = self.get_prompt(batch_size=batch_size)
            inputs_embeds = torch.cat((prompts, raw_embedding), dim=1) # then inputs
            prefix_attention_mask = torch.ones(batch_size, self.pre_seq_len).to(self.bert.device)
            attention_mask = torch.cat((prefix_attention_mask, attention_mask), dim=1)
    
            # inputs_embeds's shape: [8, 35, 768]
    
    
            outputs = self.bert(
                # input_ids,
                attention_mask=attention_mask,
                # token_type_ids=token_type_ids,
                # position_ids=position_ids,
                head_mask=head_mask,
                inputs_embeds=inputs_embeds,
                output_attentions=output_attentions,
                output_hidden_states=output_hidden_states,
                return_dict=return_dict,
                # past_key_values=past_key_values,
            ) 
    # since bert encoder will take as inputs  the first token to the bert_pooler, \
    # here the real token being used for classifier is the soft prompts' first token!
            
            pooled_output = outputs[1]
    

    I wonder, is p-tuning v2 compared with soft prompt tuning? But the token being used for the latter one in the head for classification is not the cls.

    Is that expected?

    opened by tangzhy 5
  • 运用prefix_projection 方法训练test acc不变一直是62.1

    运用prefix_projection 方法训练test acc不变一直是62.1

    12/07/2021 19:30:19 - INFO - training.trainer_base - ***** Epoch 12: Best results ***** 12/07/2021 19:30:19 - INFO - training.trainer_base - best_epoch = 0 12/07/2021 19:30:19 - INFO - training.trainer_base - best_eval_accuracy = 0.6217125382262997 12/07/2021 19:30:19 - INFO - training.trainer_base - epoch = 12.0 OrderedDict([('best_epoch', 0), ('best_eval_accuracy', 0.6217125382262997), ('epoch', 13.0)]) {'loss': 0.7488, 'learning_rate': 0.006054054054054054, 'epoch': 13.51}

    opened by yh351016 5
  • Unable to reproduce the PT-2 results of RTE in Table 1

    Unable to reproduce the PT-2 results of RTE in Table 1

    I have some questions about rebuilding the PT-2 results of RTE in Table 1.

    My base model is RoBERTa-large, I trained the model for 10 epochs with the recommended parameters (prompt length = 4, learning rate = 1e-2 as suggested in previous issue).

    However, I can only get roughly 58% accuracy on the RTE dev set.

    I am not sure whether the below factor would cause this, hope the authors can give me some hints, many thanks!

    1. what is the training epoch you used for training RTE?
    2. if I understand correctly, you are tuning both the classification head and the inserted prompts in each layer, right? In this case, would the initialization matter? And a followed question is how did you do the initialization?
    3. I notice that you insert the prompts before the [CLS], is there any specific reason to insert them before the [CLS]?
    4. I wonder if you are using the vanilla roberta-large checkpoint?
    opened by CSerxy 5
  • Question about

    Question about "deep prompts"

    Hi,

    I've seen issues asking about the past_key_value implementation and I've tried a code snippet to confirm if it's consistent with what's described in the paper. However, it doesn't seem to work - for different inputs, the first few tokens (i.e., the prompts) are not identical. Could you please take a look at the code snippet to see if it's correct and where the problem is?

        config = RobertaConfig.from_pretrained('roberta-base')
        config.pre_seq_len = 2
        config.prefix_projection = False
    
        model = RobertaPrefixForSequenceClassification.from_pretrained('roberta-base', config=config)
    
        tokenizer = AutoTokenizer.from_pretrained('roberta-base')
    
        sentences = ['This is an example sentence', 'Each sentence is converted']
    
        encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
    
        with torch.no_grad():
            model_output = model(**encoded_input, output_hidden_states=True)
            layer_hidden_states = model_output['hidden_states'][1:]  # discard the output of embedding layer
            for hidden_state in layer_hidden_states:
                deep_prompts = hidden_state[:, :model.pre_seq_len, :]  # [batch_size, seq_len, hidden_size]
                assert deep_prompts[0].equal(deep_prompts[1])
    
    opened by JetRunner 4
  • Questions about inference time

    Questions about inference time

    Hi. I have a question.

    In the case of Prefix Tuning, I think there will be some advantages in learning time.

    However, I don't think there will be a big advantage in the inference time of the learned model, so what do you think?

    opened by yeontaek 4
  • Questions about Results on Question Answering, Table 3

    Questions about Results on Question Answering, Table 3

    In Lester et al. (2021), they use T5 as the pre-trained model and use LM head to generate answers. For models like BERT, Roberta explored in this work, we can not use LM head to extract context spans as the answers, which means a linear QA head is essential. Is the task-specific linear head fine-tuned with prompt embeddings in PT, Table 3? If so, this implementation is a little different from the original implementation. If not, the randomly initialized QA head is not expected to produce meaningful outputs and hinders PT training, which makes the PT results in Table 3 meaningless.

    Or, do I have some misunderstandings about the LM head in QA tasks?

    opened by haichao592 5
Owner
THUDM
Data Mining Research Group at Tsinghua University
THUDM
The implementation for paper Joint t-SNE for Comparable Projections of Multiple High-Dimensional Datasets.

Joint t-sne This is the implementation for paper Joint t-SNE for Comparable Projections of Multiple High-Dimensional Datasets. abstract: We present Jo

IDEAS Lab 7 Dec 18, 2022
EMNLP 2021 Adapting Language Models for Zero-shot Learning by Meta-tuning on Dataset and Prompt Collections

Adapting Language Models for Zero-shot Learning by Meta-tuning on Dataset and Prompt Collections Ruiqi Zhong, Kristy Lee*, Zheng Zhang*, Dan Klein EMN

Ruiqi Zhong 42 Nov 3, 2022
Code and datasets for the paper "KnowPrompt: Knowledge-aware Prompt-tuning with Synergistic Optimization for Relation Extraction"

KnowPrompt Code and datasets for our paper "KnowPrompt: Knowledge-aware Prompt-tuning with Synergistic Optimization for Relation Extraction" Requireme

ZJUNLP 137 Dec 31, 2022
The Power of Scale for Parameter-Efficient Prompt Tuning

The Power of Scale for Parameter-Efficient Prompt Tuning Implementation of soft embeddings from https://arxiv.org/abs/2104.08691v1 using Pytorch and H

Kip Parker 208 Dec 30, 2022
Implementation of "The Power of Scale for Parameter-Efficient Prompt Tuning"

Prompt-Tuning Implementation of "The Power of Scale for Parameter-Efficient Prompt Tuning" Currently, we support the following huggigface models: Bart

Andrew Zeng 36 Dec 19, 2022
Codes for "Template-free Prompt Tuning for Few-shot NER".

EntLM The source codes for EntLM. Dependencies: Cuda 10.1, python 3.6.5 To install the required packages by following commands: $ pip3 install -r requ

null 77 Dec 27, 2022
Source code and Dataset creation for the paper "Neural Symbolic Regression That Scales"

NeuralSymbolicRegressionThatScales Pytorch implementation and pretrained models for the paper "Neural Symbolic Regression That Scales", presented at I

null 35 Nov 25, 2022
A modular, research-friendly framework for high-performance and inference of sequence models at many scales

T5X T5X is a modular, composable, research-friendly framework for high-performance, configurable, self-service training, evaluation, and inference of

Google Research 1.1k Jan 8, 2023
UniLM AI - Large-scale Self-supervised Pre-training across Tasks, Languages, and Modalities

Pre-trained (foundation) models across tasks (understanding, generation and translation), languages (100+ languages), and modalities (language, image, audio, vision + language, audio + language, etc.)

Microsoft 7.6k Jan 1, 2023
This is the repository for CVPR2021 Dynamic Metric Learning: Towards a Scalable Metric Space to Accommodate Multiple Semantic Scales

Intro This is the repository for CVPR2021 Dynamic Metric Learning: Towards a Scalable Metric Space to Accommodate Multiple Semantic Scales Vehicle Sam

null 39 Jul 21, 2022
Finetuning Pipeline

KLUE Baseline Korean(한국어) KLUE-baseline contains the baseline code for the Korean Language Understanding Evaluation (KLUE) benchmark. See our paper fo

null 74 Dec 13, 2022
Black-Box-Tuning - Black-Box Tuning for Language-Model-as-a-Service

Black-Box-Tuning Source code for paper "Black-Box Tuning for Language-Model-as-a

Tianxiang Sun 149 Jan 4, 2023
Saeed Lotfi 28 Dec 12, 2022
a reccurrent neural netowrk that when trained on a peice of text and fed a starting prompt will write its on 250 character text using LSTM layers

RNN-Playwrite a reccurrent neural netowrk that when trained on a peice of text and fed a starting prompt will write its on 250 character text using LS

Arno Barton 1 Oct 29, 2021
This repository accompanies our paper “Do Prompt-Based Models Really Understand the Meaning of Their Prompts?”

This repository accompanies our paper “Do Prompt-Based Models Really Understand the Meaning of Their Prompts?” Usage To replicate our results in Secti

Albert Webson 64 Dec 11, 2022
The code for our paper "NSP-BERT: A Prompt-based Zero-Shot Learner Through an Original Pre-training Task —— Next Sentence Prediction"

The code for our paper "NSP-BERT: A Prompt-based Zero-Shot Learner Through an Original Pre-training Task —— Next Sentence Prediction"

Sun Yi 201 Nov 21, 2022
Feed forward VQGAN-CLIP model, where the goal is to eliminate the need for optimizing the latent space of VQGAN for each input prompt

Feed forward VQGAN-CLIP model, where the goal is to eliminate the need for optimizing the latent space of VQGAN for each input prompt. This is done by

Mehdi Cherti 135 Dec 30, 2022
The Few-Shot Bot: Prompt-Based Learning for Dialogue Systems

Few-Shot Bot: Prompt-Based Learning for Dialogue Systems This repository includes the dataset, experiments results, and code for the paper: Few-Shot B

Andrea Madotto 103 Dec 28, 2022