An original implementation of "Noisy Channel Language Model Prompting for Few-Shot Text Classification"

Sewon Min

Last update: Jan 7, 2023

Related tags

Deep Learning Channel-LM-Prompting

Overview

Channel LM Prompting (and beyond)

This includes an original implementation of Sewon Min, Mike Lewis, Hannaneh Hajishirzi, Luke Zettlemoyer. "Noisy Channel Language Model Prompting for Few-Shot Text Classification" 2021.

For any questions about the paper or the code, or to request pretrained checkpoints, please contact the first author (email) or leave issues.

If you find our code or paper useful, please cite the paper:

@article{ min2021noisy ,
  title={ Noisy Channel Language Model Prompting for Few-Shot Text Classification },
  author={ Min, Sewon and Lewis, Mike and Hajishirzi, Hannaneh and Zettlemoyer, Luke },
  journal={ arXiv preprint },
  year={ 2021 }
}

This also includes implementations of many recent papers studying prompt-based learning. Please make sure to cite corresponding papers when you use implementations of the methods in this repo.

Brown et al. NeurIPS 2021. "Language Models are Few-Shot Learners": for zero-shot and concat-based demonstration methods.
Zhao et al. ICML 2021. "Calibrate before use: Improving few-shot performance of language models": for direct++ formulations.
Holzman et al. EMNLP 2021. "Surface Form Competition: Why the Highest Probability Answer Isn't Always Right": for direct++ formulations.
Lester et al. 2021. "The Power of Scale for Parameter-Efficient Prompt Tuning": for prompt tuning methods

You can run the channel model and the direct model for each of these methods. Please see Section 3 of the paper for more details about these formulations.

Installation

$ conda create -n lm-prompt python=3.8
$ conda activate lm-prompt
$ conda install pytorch=1.7.1 -c pytorch
$ pip install transformers==4.3.0

Download and Preprocess Data

We use (and modify) the data and the preprocessing script from Gao et al. ACL 2021 (paper, code) and Zhang et al. NeurIPS 2015 (paper, data).

To download the k-shot data (already preprocessed): Download the data (776MB) from this link. Pleae place data.zip under the same directory as the code and unzip it.

To download the original data and preprocess yourself:

pip install pandas==1.1.5 # for preprocessing script
mkdir data
cd data
wget https://nlp.cs.princeton.edu/projects/lm-bff/datasets.tar
tar xvf datasets.tar
cd ..

Also, download the data from here and place it in data/original.

Then, run python3 generative_k_shot_data.py, and you are done!

Optionally, you can specify arguments such as

--k: number of training examples (default is 16).
--balance: whether or not to guarantee the balance between labels in the training data; more precisely, whether k is the number of training examples in total or per label (default is False).
--data_dir: directory for the original data (default is data/original).
--output_dir: directory for the preprocessed data (default is data).

To check the data: You can see the list of eleven datasets used in the paper by ls data/k-shot. Each dataset consists of five different splits based on five different splits (test sets are the same).

Demonstration-based methods

This section is for methods which does not update any of the model parameters. For details about methods, please see Section 4.1 of the paper.

Zero-shot

python main.py \
    --task {task_name} \
    --split {dev|test} \
    --data_dir data \
    --out_dir out \
    --gpt2 gpt2-large \
    --do_zeroshot \
    --method {direct|channel}

This command will run zero-shot inference using GPT2-large using four different templates (verbalizers) as reported in the paper.

For "channel", please specify --method channel.
For "direct", please specify --method direct.
For "direct++", please run the command line without --split first (this will run inference using the N/A input, following Zhao et al. ICML 2021), and then run the command line with --method direct --use_calibration.

Useful notes:

Note that, once you run inference, it will save a cache in the out directory, and will re-load the cache file when you run the exact same command line.
You can adjust --batch_size if you run into OOM issue (default is 32).
Please note that GPU parallization is not implemented for inference.
To save a log file, please specify --log_file.
To use GPT2 with different sizes, please use --gpt2 {gpt2|gpt2-medium|gpt2-xl}.

Concat-based demonstration

python main.py \
    --task {task_name} \
    --split {dev|test} \
    --data_dir data \
    --out_dir out \
    --gpt2 gpt2-large \
    --do_zeroshot \
    --method {direct|channel} \
    --use_demonstrations \
    --k 16 \
    --seed {13|21|42|87|100}

You can modify k and seed to try different numbers of training examples and different seeds for the k-shot data.

Ensemble-based demonstration

Add --ensemble to the command line for the Concat-based demonstration method.

Tuning methods

This section is for methods that fully finetune the model parameters (standard finetuning), or update a very limited number of parameters (prompt tuning, head tuning and transformation tuning). For details about the methods, please see Section 4.2 of the paper.

Prompt tuning

python main.py \
    --task {task_name} \
    --split {dev|test} \
    --data_dir data \
    --out_dir out \
    --gpt2 gpt2-large \
    --method {direct|channel} \
    --prompt_tune \
    --do_train \
    --batch_size 32 \
    --lr {0.1|0.01|0.001}

Please see Appendix B of the paper to see which learning rate we used for each dataset.
Once you train the model, you can specify --do_check to load the existing checkpoint without retraining the model.
Please note that GPU parallization is implemented for training, but is not implemented for inference.
Note that, by default, we use the checkpoint that is trained for 100 steps.
To explore different numbers of prompts, please specify --n_prefix. The default value is 20, following the original prompt tuning paper (Lester et al. 2021).
If you want to explore zero-shot task transfer (Section 6.4 in the paper), you can (1) first train the model on the training data, and (2) run inference by specifying --task {task_name_for_test} --train_task {task_name_for_train} --do_check.

Head tuning

Use --head_tune instead of --prompt_tune to the command line for the Prompt tuning method. Note that head tuning is only for the direct baseline.

Transformation tuning

Use --transform_tune instead of --prompt_tune to the command line for the Prompt tuning method. Note that transformation tuning is only for the direct baseline.

Standard finetuning

To finetune the entire model parameters, as in typical finetuning, please do not specify any of --prompt_tune, --head_tune or --transform_tune.

Results

For all results, please check out Table 3 and Table 4 of the paper.

Comments

how to evaluate and test afterwards

Hi I see the running commands in the repo, I find it hard how to automatically, when you find the checkpoint based on the loss, also evaluate that on the test set to report the results, could you kindly provide one example when you train and test on the best checkpoint automatically. thanks

opened by jackfeinmann5 10
Commands to reproduce the results of Table 4
Hi, I am trying to reproduce the Table 4.

Could you tell me please if the results are on dev or test set?

Did you run each model for multiple learning rates and then report the test results for the model obtaining the best on the dev set? did you do this for all methods? seems lr for prompt-tuning is finetuned between {0.1, 0.01, 0.001} and for the other method is set to 1e-5? (page 13)

Do I need to change any hyper-parameters in the scripts? (this is said global step is set to 100 in the paper, but I could not find it within parameters to set it).

this is written you have run each model for 80 times, shall I do it with 80 different seeds?

you mentioned "Note that, by default, we use the checkpoint that is trained for 100 steps.", I was wondering why not choosing the steps that performing the best on the validation set?

Could you share with me the commands you used for Table 4? Here please find the sample commands I am planning to use, not sure if this can be all I need to set:

// Direct with Head python main.py --task trec --split dev --data_dir data --out_dir out --gpt2 gpt2 --method direct --do_train --head_tune //Direct Transform python main.py --task trec --split dev --data_dir data --out_dir out --gpt2 gpt2 --method direct --do_train --transform_tune //Direct prompt python main.py --task trec --split dev --data_dir data --out_dir out --gpt2 gpt2 --method direct --do_train --prompt_tune //Channel model with prompt python main.py --task trec --split dev --data_dir data --out_dir out --gpt2 gpt2 --method channel --do_train --prompt_tune

Shall I run all above commands with multiple lrs and then report the test set for the model obtaining the best average accuracy? would this be the same hyper-parameters/setup as you did for Table 4? really appreciate letting me know if I am missing some setups for each model.

thanks
opened by dorost1234 6
loss computation for noisy channel model
Hi First thanks for being patient with my questions. I have some questions on how the loss for noisy channel model is computed:

In prepro_sentence_pair_single in data.py you compute token_type_id which is 1 for labels. If we were using direct model, could you tell me if one would still need this masking?

the length of input during computation of loss is 146 instead of input length of 128, do you mind telling me why the final length is different from what is set?

During the data-processing, you set inputs = BOS+ids1+ids2+EOS+padding, and labels = inputs for the direct model. I am wondering why you don't feed ids1 and let the model predict ids2, and why input and labels are the same? Also, huggingface codes seems not having BOS/EOS tokens in data processing, could you kindly tell me why you added them?

I also wonder why you truncate to max_length-16 during data preparation?

truncated = np.sum([len(inputs)>max_length_per_example-16 for inputs in test_inputs])

Thank you.
opened by dorost1234 2
running direct model + prompt

Hi I was wondering why direct model + prompt-tuning is not included in the paper? could I run this method with setting prompt_tune=True and model = direct ? thanks for your help

opened by dorost1234 2
verbalizers for trec dataset

Hi this seems to me the place of "location" and "number" needs to get swapped, as label 4 (zero-indexed) corresponds to "number" and label "5" corresponds to "location".

Here is the link to the dataset labels: https://huggingface.co/datasets/viewer/ if you search trec

Here are the current verbalizers: ["Description", "Entity", "Expression", "Human", "Location", "Number"]

thanks

opened by dorost1234 1
initialization from pretrained model's vocabulary

Hi could you tell me how you could compute the top 5000 words for initialization? In the paper you refer to work of lester, I checked that one, apparently they have computed the probability each token appreared during pretraining and took the most common ones, I am not sure how to compute this probability, thanks a lot.

opened by dorost1234 1
datasets are missing

Hi thanks for the codes, I am trying to download the data and subsample them myself for larger Ks, some of the datasets like SST-2, sts-5, mr, cr, trec, subj are missing from the ones you have uploaded, could you kindly let me know from where I can download these? appreciate sharing a download link to the version of these datasets used in this work if feasible. thanks

opened by jackfeinmann5 1
closest code in transformers to noisy channel model

Hi I am trying to build your codes based on huggingface library examples, could you tell me please which example in HF repo is the closest to your work that you have build on top of it? thank you

opened by jackfeinmann5 1

Owner

Sewon Min

PhD student @uwnlp

GitHub

Home repository for the Regularized Greedy Forest (RGF) library. It includes original implementation from the paper and multithreaded one written in C++, along with various language-specific wrappers.

Regularized Greedy Forest Regularized Greedy Forest (RGF) is a tree ensemble machine learning method described in this paper. RGF can deliver better r

364 Dec 28, 2022

Implementation of SETR model, Original paper: Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers.

SETR - Pytorch Since the original paper (Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers.) has no official

112 Dec 16, 2022

PyTorch Implementation of Fully Convolutional Networks. (Training code to reproduce the original result is available.)

pytorch-fcn PyTorch implementation of Fully Convolutional Networks. Requirements pytorch >= 0.2.0 torchvision >= 0.1.8 fcn >= 6.1.5 Pillow scipy tqdm

1.6k Jan 7, 2023

PyTorch original implementation of Cross-lingual Language Model Pretraining.

XLM NEW: Added XLM-R model. PyTorch original implementation of Cross-lingual Language Model Pretraining. Includes: Monolingual language model pretrain

2.7k Dec 27, 2022

An original implementation of "MetaICL Learning to Learn In Context" by Sewon Min, Mike Lewis, Luke Zettlemoyer and Hannaneh Hajishirzi

MetaICL: Learning to Learn In Context This includes an original implementation of "MetaICL: Learning to Learn In Context" by Sewon Min, Mike Lewis, Lu

27 Nov 2, 2021

A Next Generation ConvNet by FaceBookResearch Implementation in PyTorch(Original) and TensorFlow.

ConvNeXt A Next Generation ConvNet by FaceBookResearch Implementation in PyTorch(Original) and TensorFlow. A FacebookResearch Implementation on A Conv

2 Feb 14, 2022

U^2-Net - Portrait matting This repository explores possibilities of using the original u^2-net model for portrait matting.

104 Nov 25, 2022

Pytorch reimplement of the paper "A Novel Cascade Binary Tagging Framework for Relational Triple Extraction" ACL2020. The original code is written in keras.

CasRel-pytorch-reimplement Pytorch reimplement of the paper "A Novel Cascade Binary Tagging Framework for Relational Triple Extraction" ACL2020. The o

170 Dec 1, 2022

Original code for "Zero-Shot Domain Adaptation with a Physics Prior"

Zero-Shot Domain Adaptation with a Physics Prior [arXiv] [sup. material] - ICCV 2021 Oral paper, by Attila Lengyel, Sourav Garg, Michael Milford and J

40 Dec 21, 2022

The code for our paper "NSP-BERT: A Prompt-based Zero-Shot Learner Through an Original Pre-training Task —— Next Sentence Prediction"

201 Nov 21, 2022

This repository contains an overview of important follow-up works based on the original Vision Transformer (ViT) by Google.

75 Dec 2, 2022

Using this codebase as a tool for my own research. Making some modifications to the original repo for my own purposes.

For SwapNet Create a list.txt file containing all the images to process. This can be done with the GNU find command: find path/to/input/folder -name '

2 Nov 10, 2021

The original weights of some Caffe models, ported to PyTorch.

pytorch-caffe-models This repo contains the original weights of some Caffe models, ported to PyTorch. Currently there are: GoogLeNet (Going Deeper wit

9 Nov 4, 2022

This repository is all about spending some time the with the original problem posed by Minsky and Papert

This repository is all about spending some time the with the original problem posed by Minsky and Papert. Working through this problem is a great way to begin learning computer vision.

1 Jan 23, 2022

A forwarding MPI implementation that can use any other MPI implementation via an MPI ABI

MPItrampoline MPI wrapper library: MPI trampoline library: MPI integration tests: MPI is the de-facto standard for inter-node communication on HPC sys

31 Dec 22, 2022

ALBERT-pytorch-implementation - ALBERT pytorch implementation

ALBERT-pytorch-implementation developing... 모델의 개념이해를 돕기 위한 구현물로 현재 변수명을 상세히 적었고

3 Oct 6, 2022

Numenta Platform for Intelligent Computing is an implementation of Hierarchical Temporal Memory (HTM), a theory of intelligence based strictly on the neuroscience of the neocortex.

NuPIC Numenta Platform for Intelligent Computing The Numenta Platform for Intelligent Computing (NuPIC) is a machine intelligence platform that implem

6.3k Dec 30, 2022

PyTorch implementation of neural style transfer algorithm

neural-style-pt This is a PyTorch implementation of the paper A Neural Algorithm of Artistic Style by Leon A. Gatys, Alexander S. Ecker, and Matthias

770 Jan 2, 2023

PyTorch implementation of DeepDream algorithm

neural-dream This is a PyTorch implementation of DeepDream. The code is based on neural-style-pt. Here we DeepDream a photograph of the Golden Gate Br

121 Nov 5, 2022