A novel method to tune language models. Codes and datasets for paper ``GPT understands, too''.

Overview

P-tuning

A novel method to tune language models. Codes and datasets for paper ``GPT understands, too''.

How to use our code

We have released the code and datasets for LAMA and few-shot SuperGLUE (32-dev) experiments. Please check README.md and requirement.txt in the corresponding subdirectories for details.

The LAMA and FewGLUE_32dev datasets are available. The LAMA dataset should be placed in ./data directory, and the SuperGLUE dataset should be placed in the ./ (project root) directory.

Citation

If you find our work useful, please cite the following paper:

@article{liu2021gpt,
  title={GPT Understands, Too}, 
  author={Xiao Liu and Yanan Zheng and Zhengxiao Du and Ming Ding and Yujie Qian and Zhilin Yang and Jie Tang},
  year={2021},
  journal={arXiv preprint arXiv:2103.10385},
  url={https://arxiv.org/abs/2103.10385}
}
Comments
  • 关于论文的一些疑惑

    关于论文的一些疑惑

    你好,有幸读到贵作《GPT Understands, Too》,确实很不错。在阅读过程中,主要有两个疑问,烦请指点。

    1、prompt直接通过embedding优化生成,跟原论文中使用LSTM生成,效果差距有多大呢?论文似乎并没有对比两者的差距。

    2、关于superglue的各个任务的template,能否简单罗列一下?我只看到LAMA那里写了(3, sub, 3, obj, 3)和(3, sub, 3, obj),其他任务未见。

    opened by bojone 6
  • BERT output is a tuple (in LAMA)

    BERT output is a tuple (in LAMA)

    Hi, thanks for great codebase!

    in bert_out method in PTuneForLAMA class,

    # LAMA/p_tuning/modeling.py (124 line~)
    def bert_out():
        label_mask = (queries == self.tokenizer.mask_token_id).nonzero().reshape(bz, -1)[:, 1].unsqueeze(
            1).to(self.device)  # bz * 1
        labels = torch.empty_like(queries).fill_(-100).long().to(self.device)  # bz * seq_len
        labels = labels.scatter_(1, label_mask, label_ids)
        output = self.model(inputs_embeds=inputs_embeds.to(self.device),
                            attention_mask=attention_mask.to(self.device).bool(),
                            labels=labels.to(self.device))
        loss, logits = output.loss, output.logits
    

    output object has no attributes(loss, logits) since it is tuple

    I think it should be changed like below

    def bert_out():
        label_mask = (queries == self.tokenizer.mask_token_id).nonzero().reshape(bz, -1)[:, 1].unsqueeze(
            1).to(self.device)  # bz * 1
        labels = torch.empty_like(queries).fill_(-100).long().to(self.device)  # bz * seq_len
        labels = labels.scatter_(1, label_mask, label_ids)
        loss, logits = self.model(inputs_embeds=inputs_embeds.to(self.device),
                            attention_mask=attention_mask.to(self.device).bool(),
                            labels=labels.to(self.device))
    

    I checked this code works fine on my machine. Thank you again.


    07.08 add gpt_out() also has a same issue

    loss, logits, _ = self.model(inputs_embeds=inputs_embeds.to(self.device).half(),
                        attention_mask=attention_mask.to(self.device).half(),
                        labels=labels.to(self.device))
    

    If the huggingface Transformers version is higher, it can be solved by giving the return_dict option True

    opened by olenmg 3
  • A problem about the prompt

    A problem about the prompt

    Is the input of bi-directional model randomly initialized in the process of p-tuning? Or It's the embedding of template. The pseudo prompts in Figure 2(b) seems to indicate that the model needs to use template embedding as input. I'm a little confused about this description.

    opened by logoutAgain 3
  • questions about discreteness in optimization

    questions about discreteness in optimization

    Hi! Thank you for the interesting paper. While I was reading the paper, I'd like to ask you something I don't understand.

    In 3.2 Optimization part of the paper,

    If h is initialized with random distribution and then optimized with stochastic gradient descent (SGD), 
    which has been proved to only change the parameters in a small neighborhood (AllenZhu et al., 2019), 
    the optimizer would easily fall into local minima.
    

    Is the problem with the discreteness of word embedding $e$ in the optimization? could you explain about this in more detail?

    the second question is why the proposed prompt encoder encourage discreteness. maybe it might be connected to the first question.

    Thank you.

    opened by skygl 2
  • Inconsistent SuperGLUE Results from P-Tuning and P-TuningV2 Paper

    Inconsistent SuperGLUE Results from P-Tuning and P-TuningV2 Paper

    Hi, I find most of the SuperGLUE metrics of PT reported in P-Tuning paper are superior to metrics of fine-tuning. But the metrics of PT reported in P-TuningV2 paper are much worse than fine-tuning. For example in BoolQ tasks, in P-Tuning paper the acc is 72.9 for fine-tuning and 73.9 for PT. While in P-TuningV2 paper the acc is 77.7 for fine-tuning and 67.2 for PT.

    It seems that from P-TuningV2 paper is much worse than fine-tuning which is opposite to the conclusion from P-Tuning paper.

    opened by theoqian 2
  • 词表问题

    词表问题

    想问一下如果样本的预测标签不在给定的词表中,该怎么办呢? 看了源码,LAMA实验中,通过if token_wrapper(args, d['obj_label']) not in vocab: continue来进行过滤(方法一);Fewshot实验中,是做了一些yes\no的映射,比如:' "contradiction": ["No"],"entailment": ["Yes"],"neutral": ["Maybe"]'(方法二). 有没有更好的方式来解决这一问题呢?比如我的总类别标签数非常的多,用不了方法二,但我又不想舍去这些样本,不用方法一. 谢谢!

    opened by beiweixiaoxu 2
  • Typo in code (will cause the prompt not use the warmup

    Typo in code (will cause the prompt not use the warmup

    In the following code https://github.com/THUDM/P-tuning/blob/368ab8561bab04b44010744a365124efaed6bf16/PT-Fewshot/pet/wrapper.py#L316 I presume the right optimizer should be embedding_optimizer instead of optimizer. I am curious whether this is the reason why sole embedding did not work?

    opened by yhcc 2
  • few-shot实验encoder换成bert-base-cased效果差很多

    few-shot实验encoder换成bert-base-cased效果差很多

    你好,非常感谢你们的开源代码。 在复现过程中,我产生了以下两个疑问,望解答:

    1. few-shot实验中,把encoder从albert-xxlarge-v2改成bert-base-cased,其他不变,效果下降非常多(在wic, rte数据集上acc只有50%上下)。这仅仅是由于encoder容量的关系吗?是否还有一些重要参数需要调节?
    2. 我在用开源代码复现论文结果的时候,发现CB这个数据上结果差别很大,如图,左边是我的结果,右边是论文结果,这可能是什么原因呢? image
    opened by Life-0-1 2
  • Is LAMA p-tuning seeking a global prompt representation across relation?

    Is LAMA p-tuning seeking a global prompt representation across relation?

    Hi, I read the paper and found it very interesting! When you apply p-tuning on LAMA dataset, is the encoder for pseudo token shared across different relation types or do you train separate embedding for each relation type? After looking over the code, I believe the encoder trained via p-tuning is a global and not conditioned by relation types, but just want to double check.

    Presumably the encoder is globally shared across relation types, does p-tuning really need that large training dataset like AutoPrompt? AutoPrompt optimized the prompt per relation (can be confirmed with their official release of generated prompts) and that's one of the reason why they need the large training data for each relation type (1000 data points per relation).

    Meanwhile, p-tuning seems to be more efficient because of the continuous optimization and shared encoder, so I feel it could establish very strong baseline even with a few shot setting. Have you done any ablation studies to see whether it works with a limited data points? I'm very curious about the result if you have it.

    Thank you for sharing the code and your work!

    opened by asahi417 2
  • Fully-supervised SuperGLUE

    Fully-supervised SuperGLUE

    Nice work. May I ask do you have a plan to release codes to re-implement P-tuning for fully-supervised SuperGLUE tasks? Seems there are no relevant codes yet.

    opened by zjujh1995 2
  • Few-shot NLU: learning rate for model parameters vs. embedding parameters

    Few-shot NLU: learning rate for model parameters vs. embedding parameters

    Hi!

    Thanks for the interesting paper and releasing this nice codebase! I had a quick question with respect to the learning rate used for the fewshot NLU experiments. The paper mentions (Section 4.2) that:

    We perform grid search of hyper-parameters and take the best combination on Ddev or Ddev32. Specifically, we take learning rates from 1e-5, 2e-5, 3e-5 and batch sizes from 16, 32

    However, it seems like the model is updated with a fixed learning rate of 1e-5 in the code ( https://github.com/THUDM/P-tuning/blob/main/PT-Fewshot/pet/wrapper.py#L312 ) , and the learning rate taken from the CLI is only used for the embedding parameters.

    Given that the paper and code seem to differ in this regard, I'm not sure if this is a bug in the code (i.e., the model and the embedding parameters should always use the LR taken from the CLI) or if the paper omits this detail (i.e., in reality, the LR grid search is only done on the embedding parameters, and 1e-5 is always used for the model). Could yo clarify which approach was taken in your experiments?

    Thanks again!

    opened by nelson-liu 2
  • Doesn't anyone see a problem with the code about the prompt construction for BERT-style transformer???

    Doesn't anyone see a problem with the code about the prompt construction for BERT-style transformer???

    if 'gpt' not in self.args.model_name and 'megatron' not in self.args.model_name:
        # BERT-style model(bert风格句子构造, CLS开头, SEP结尾)
        return [[self.tokenizer.cls_token_id]  # [CLS]
                + prompt_tokens * self.template[0]
                + [self.tokenizer.mask_token_id]  # head entity
                + prompt_tokens * self.template[1]
                + self.tokenizer.convert_tokens_to_ids(self.tokenizer.tokenize(' ' + x_h))  # [MASK] (tail entity)
                + (prompt_tokens * self.template[2] if self.template[
                                                           2] > 0 else self.tokenizer.convert_tokens_to_ids(['.']))
                + [self.tokenizer.sep_token_id]
                ]
    

    我不知道理解的对不对?按照论文中的意思,这里head entity中的mask_token_id和tail entity中x_h的编码id是不是写反了呢?因此对于bert-style的finetune实际上是在用head entity去预测head entity?而并非是head entity去预测tail entity??? @Life-0-1 @#12 @#15 @Xiao9905

    所以用bert尝试复现的效果不行的,有没有可能是因为这里呢?

    opened by lovekittynine 2
  • Can we add our custimized classification task (such as other GLUE tasks) into this package?

    Can we add our custimized classification task (such as other GLUE tasks) into this package?

    Hi P-tuning authors,

    I would like to ask if I want to evaluate P-tuning on some new data that are not used in your paper, which part in your code should I modify? Many thanks and looking forward to your answer!

    Best

    opened by CSerxy 0
  • Code about the implement of gpts

    Code about the implement of gpts

    hi, I read the paper . on Figure 1, it say that GPTs can be better than similar-sized BERTs on NLU with P-tuning. but It seems that there is no code about the implement of gpts. So, Would you like to show the part?How can I get the result?

    opened by Deonmi 0
  • P-tuuning的一些问题

    P-tuuning的一些问题

    1.你好,我想问一下,在P-tunning中,[Mask]在一众[unused]中得位置是怎么确定的?是人工选择的吗?如果不是的话,是根据什么方式确定的? 2.原论文中写的当数据量比较少的时候用的anchor-word,比如预测“英国首都”,在几个[unused]中加一个[capital]效果会比较好,这个[capital]应该加在哪个位置是如何确定的呢?

    opened by Ming-Qin-tech 4
  • gpt2-medium LAMA

    gpt2-medium LAMA

    Hi, i have just used the default params to p-tune the gpt2-medium on LAMA task and the results is as follows. best dev_hit@1: 51.8 best test_hit@1: 44.5 For the results I got, I have some confusions... (1) It seems that there is a gap between the dev results and the test results. Are the dev set and the test set in the same distribution? Is it possible to provide the scipts of generating the train/dev/test sets and the original dataset? (2) The results reported in the paper is 46.5, which is close to the best test_hit@1. Are the results in the paper based on the test set? It will be very nice if the shell scipts is provided to reproduce the results in the paper.

    opened by casually-PYlearner 1
Owner
THUDM
Data Mining Research Group at Tsinghua University
THUDM
Codes and models for the paper "Learning Unknown from Correlations: Graph Neural Network for Inter-novel-protein Interaction Prediction".

GNN_PPI Codes and models for the paper "Learning Unknown from Correlations: Graph Neural Network for Inter-novel-protein Interaction Prediction". Lear

Ursa Zrimsek 2 Dec 14, 2022
GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot

GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model -- based on GPT-3, called GPT-Codex -- that is fine-tuned on publicly available code from GitHub.

null 2.3k Jan 9, 2023
Alex Pashevich 62 Dec 24, 2022
Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.

PyTorch Implementation of Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers 1 Using Colab Please notic

Hila Chefer 489 Jan 7, 2023
Image-popularity-score - A novel deep regression method for image scoring.

Image-popularity-score - A novel deep regression method for image scoring.

Shoaib ahmed 1 Dec 26, 2021
Source code for the GPT-2 story generation models in the EMNLP 2020 paper "STORIUM: A Dataset and Evaluation Platform for Human-in-the-Loop Story Generation"

Storium GPT-2 Models This is the official repository for the GPT-2 models described in the EMNLP 2020 paper [STORIUM: A Dataset and Evaluation Platfor

Nader Akoury 27 Dec 20, 2022
DiffQ performs differentiable quantization using pseudo quantization noise. It can automatically tune the number of bits used per weight or group of weights, in order to achieve a given trade-off between model size and accuracy.

Differentiable Model Compression via Pseudo Quantization Noise DiffQ performs differentiable quantization using pseudo quantization noise. It can auto

Facebook Research 145 Dec 30, 2022
Fine-tune pretrained Convolutional Neural Networks with PyTorch

Fine-tune pretrained Convolutional Neural Networks with PyTorch. Features Gives access to the most popular CNN architectures pretrained on ImageNet. A

Alex Parinov 694 Nov 23, 2022
MAGMA - a GPT-style multimodal model that can understand any combination of images and language

MAGMA -- Multimodal Augmentation of Generative Models through Adapter-based Finetuning Authors repo (alphabetical) Constantin (CoEich), Mayukh (Mayukh

Aleph Alpha GmbH 331 Jan 3, 2023
Automated image registration. Registrationimation was too much of a mouthful.

alignimation Automated image registration. Registrationimation was too much of a mouthful. This repo contains the code used for my blog post Alignimat

Ethan Rosenthal 9 Oct 13, 2022
This repository contains a set of codes to run (i.e., train, perform inference with, evaluate) a diarization method called EEND-vector-clustering.

EEND-vector clustering The EEND-vector clustering (End-to-End-Neural-Diarization-vector clustering) is a speaker diarization framework that integrates

null 45 Dec 26, 2022
Code for our ALiBi method for transformer language models.

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation This repository contains the code and models for our paper Tra

Ofir Press 211 Dec 31, 2022
Open source repository for the code accompanying the paper 'Non-Rigid Neural Radiance Fields Reconstruction and Novel View Synthesis of a Deforming Scene from Monocular Video'.

Non-Rigid Neural Radiance Fields This is the official repository for the project "Non-Rigid Neural Radiance Fields: Reconstruction and Novel View Synt

Facebook Research 296 Dec 29, 2022
The code for the CVPR 2021 paper Neural Deformation Graphs, a novel approach for globally-consistent deformation tracking and 3D reconstruction of non-rigid objects.

Neural Deformation Graphs Project Page | Paper | Video Neural Deformation Graphs for Globally-consistent Non-rigid Reconstruction Aljaž Božič, Pablo P

Aljaz Bozic 134 Dec 16, 2022
A non-linear, non-parametric Machine Learning method capable of modeling complex datasets

Fast Symbolic Regression Symbolic Regression is a non-linear, non-parametric Machine Learning method capable of modeling complex data sets. fastsr aim

VAMSHI CHOWDARY 3 Jun 22, 2022
An easy way to build PyTorch datasets. Modularly build datasets and automatically cache processed results

EasyDatas An easy way to build PyTorch datasets. Modularly build datasets and automatically cache processed results Installation pip install git+https

Ximing Yang 4 Dec 14, 2021
Deep Learning Datasets Maker is a QGIS plugin to make datasets creation easier for raster and vector data.

Deep Learning Dataset Maker Deep Learning Datasets Maker is a QGIS plugin to make datasets creation easier for raster and vector data. How to use Down

deepbands 25 Dec 15, 2022