The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding.

Yu Meng

Last update: Dec 12, 2022

Related tags

Deep Learning natural-language-processing text-classification text-generation natural-language-understanding zero-shot-learning pretrained-language-model

Overview

SuperGen

The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding.

Requirements

Before running, you need to first install the required packages by typing following commands (Using a virtual environment is recommended):

pip3 install -r requirements.txt

Overview

SuperGen is a Supervision Generation method for zero-shot learning on NLU tasks. Instead of training on task-specific data, SuperGen generates training data guided by label-descriptive prompts with a unidirectional language model and fine-tunes another language model on the generated data.

Training and Test Data: Our method does not use any task-specific data (e.g., original training set). We provide our generated training set and original dev set (used as the test set) of each GLUE task under the data directory: train.json files are the generated training set (after data selection); test.tsv files are the original GLUE dev set (used as the test set for evaluation purpose).
Pretraining Corpus: We provide the processed pretraining corpus (Wikipedia and OpenWebText) for generating training data for sequence-pair tasks under the pretrain_corpus directory; see the README file there for details.

Generating Training Data

The generated training set used in the paper are provided as train.json files under each task directory; you should be able to obtain very similar generated data by following the steps below:

Data Generation: The entry script for generating training data for GLUE tasks is gen_train_data.py. The basic usage is

python gen_train_data.py --task $TASK --label $LABEL --save_dir $SAVE_DIR --num_gen $NUM_GEN

You can generate training data of each label either by setting individual label name $LABEL one at a time or by setting $LABEL=all to generate data for all labels (this will still be done sequentially). You may want to set $NUM_GEN to be larger than the desired training set size, as only those texts with the highest generated probability will be used to form the final training set.

Data Selection: After generating the training data, the final training set can be constructed by running the following:

python src/gen_utils.py --task $TASK --num_select_samples $NUM_SELECT \
                        --read_dir $SAVE_DIR --save_dir $DATA_DIR

Example: We provide an example script run_gen.sh that includes the entire generation process for all GLUE tasks under the setting described in the paper.

Fine-Tuning

The entry script for fine-tuning on generated data is finetune.py. The basic usage is

python finetune.py \
    --task_name $TASK \
    --data_dir data/$TASK \
    --overwrite_output_dir \
    --do_train \
    --do_predict \
    --smooth $SM \
    --momentum $MOMENT \
    --eval_steps $INTERVAL \
    --threshold $TH \
    --reg_weight $REG \
    --temp_ensemble_rampup $RAMP \
    --model_name_or_path $MODEL \
    --max_seq_length 128 \
    --first_sent_limit 100 \
    --per_device_train_batch_size $BS \
    --learning_rate $LR \
    --num_train_epochs 3 \
    --output_dir $OUT_DIR \
    --template $TEMPLATE \
    --mapping $MAPPING \
    --warmup_ratio 0.1 \
    --save_at_last \

Example: We provide an example script run_finetune.sh with command line arguments set up for all GLUE tasks under the setting described in the paper.

Results: When using the same prompt-based fine-tuning pipeline (with the same manual prompts and label words), zero-shot SuperGen even achieves better performance than few-shot LM-BFF using 32 annotated samples per class across seven GLUE classification tasks:

Method	MNLI-m/mm	QQP	QNLI	SST-2	CoLA	RTE	MRPC	AVG
LM-BFF 32-Sample Few-Shot	68.3/70.5	65.5	64.5	92.7	9.3	69.1	74.5	63.6
SuperGen Zero-Shot	72.3/73.8	66.1	73.3	92.8	32.7	65.3	82.2	69.4

Acknowledgement

Some scripts in this repository are adapted from COCO-LM (for COCO-LM model), LM-BFF (for prompt-based fine-tuning) and huggingface transformers (for text generation and GLUE processor/trainer).

Citations

Please cite the following paper if you find the code helpful for your research.

@article{meng2022generating,
  title={Generating Training Data with Language Models: Towards Zero-Shot Language Understanding},
  author={Meng, Yu and Huang, Jiaxin and Zhang, Yu and Han, Jiawei},
  journal={arXiv preprint arXiv:2202.04538},
  year={2022}
}

You might also like...

The code of Zero-shot learning for low-light image enhancement based on dual iteration

Comments

Could you please offer a script for gpt2 to generate the training data?

I made a gpt2 script to generate the training data following your paper, but it did badly in qnli task, different to CTRL, the filtering rule is not suitable for gpt2, but I can't handle the rule myself. Thank you.

opened by Luobots 1

The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding.

Related tags

Overview

SuperGen

Requirements

Overview

Generating Training Data

Fine-Tuning

Acknowledgement

Citations

You might also like...

The code of Zero-shot learning for low-light image enhancement based on dual iteration

Towards Part-Based Understanding of RGB-D Scans

Towards Long-Form Video Understanding

[CVPR2022] Bridge-Prompt: Towards Ordinal Action Understanding in Instructional Videos

SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model

Zero-shot Synthesis with Group-Supervised Learning (ICLR 2021 paper)

Codes for ACL-IJCNLP 2021 Paper "Zero-shot Fact Verification by Claim Generation"

Official Pytorch Implementation of: "Semantic Diversity Learning for Zero-Shot Multi-label Classification"(2021) paper

Shared Attention for Multi-label Zero-shot Learning

Comments

Could you please offer a script for gpt2 to generate the training data?

Owner

Yu Meng

The code for our paper "NSP-BERT: A Prompt-based Zero-Shot Learner Through an Original Pre-training Task —— Next Sentence Prediction"

EMNLP 2021 Adapting Language Models for Zero-shot Learning by Meta-tuning on Dataset and Prompt Collections

The source code for the Cutoff data augmentation approach proposed in this paper: "A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation".

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning".

[CVPR 2021] Released code for Counterfactual Zero-Shot and Open-Set Visual Recognition

code for CVPR paper Zero-shot Instance Segmentation

Original code for "Zero-Shot Domain Adaptation with a Physics Prior"

Code repo for EMNLP21 paper "Zero-Shot Information Extraction as a Unified Text-to-Triple Translation"

Code for the AAAI 2022 paper "Zero-Shot Cross-Lingual Machine Reading Comprehension via Inter-Sentence Dependency Graph".