This repository contains the code for "Generating Datasets with Pretrained Language Models".

Related tags

Text Data & NLP dino
Overview

Datasets from Instructions (DINO 🦕 )

This repository contains the code for Generating Datasets with Pretrained Language Models. The paper introduces a method called Datasets from Instructions (DINO 🦕 ) that enables pretrained language models to generate entire datasets from scratch.

🔧 Setup

All requirements for DINO can be found in requirements.txt. You can install all required packages in a new environment with pip install -r requirements.txt.

💬 CLI Usage

Single Texts

To generate datasets for (single) text classification, you can use DINO as follows:

python3 dino.py \
 --output_dir <OUTPUT_DIR> \
 --task_file <TASK_FILE> \
 --num_entries_per_label <N>

where <OUTPUT_DIR> is a directory to which the generated dataset is written, <TASK_FILE> is a JSON file containing a task specification (see Task Specs), and <N> is the number of examples to generate per label. To get an overview of additional parameters, run python3 dino.py --help.

Text Pairs

To generate datasets for text pair classification, you first need a dataset of raw input texts (which you can also generate using DINO). You can then run

python3 dino.py \
 --output_dir <OUTPUT_DIR> \
 --task_file <TASK_FILE> \
 --input_file <INPUT_FILE> \
 --input_file_type <INPUT_FILE_TYPE> \
 --num_entries_per_input_and_label <N>

with <OUTPUT_DIR> and <TASK_FILE> as before. <INPUT_FILE> refers to the file containing raw input texts, <INPUT_FILE_TYPE> specifies its type, which should be one of

  • plain: for a plain text file with one input text per line
  • jsonl: for a dataset file generated by DINO in a previous step

and <N> is the number of examples to generate per label and input text.

📋 Task Specs

🚨 Before you write custom task specifications, please note that this is still a very early release and we have not tested DINO on other tasks than semantic textual similarity yet. Please let us know if you see something strange. 🚨

To generate a dataset for a task, you need to provide a file containing a task specification, containing (among other things) the instructions given to the pretrained language model. A task specification is a single JSON object that looks like this:

{
  "task_name": "<TASK_NAME>",
  "labels": {
    "<LABEL_1>": {
      "instruction": "<INSTRUCTION_1>",
      "counter_labels": [<COUNTER_LABELS_1>]
    },

    ...,

    "<LABEL_n>": {
      "instruction": "<INSTRUCTION_n>",
      "counter_labels": [<COUNTER_LABELS_n>]
    }
  }
}

Here, <TASK_NAME> is the name for the task and <LABEL_1>, ..., <LABEL_n> are the task's labels. For each label <LABEL_i>, <INSTRUCTION_i> is the instruction provided to the language model for generating examples with label <LABEL_i> (see Writing Instructions). You can additionally specify a list of counter labels <COUNTER_LABELS_n> for each label. This tells the model to generate outputs that are not only likely given the current label, but also unlikely given all counter labels (see the paper for details).

Examples

You can find two examples of task specifications in /task_specs:

  • sts.json is a task specification for generating a semantic textual similarity dataset if a set of raw input texts is already given.
  • sts-x1.json is a task specification for generating a set of raw input texts. This set can then be used in a subsequent step to generate a full STS dataset using sts.json.

Writing Instructions

When writing instructions for a new task, you should consider the following things:

  • Always end your instructions with an (opening) quotation mark ("). This is required because it allows us to interpret the next quotation mark generated by the language model as a signal that it is done generating an example.
  • For good results, keep the instructions as short and simple as possible as this makes it easier for a pretrained language model to understand them.
  • If you are writing instructions for a text pair classification task, make sure that each instruction contains the placeholder <X1> exactly once. At this position, the provided raw input sentences are inserted during generation.

An example for an instruction that prompts the model to generate a positive review for a restaurant would be:

Task: Write a review for a really great restaurant.
Review: "

An example for an instruction that prompts the model to generate a sentence that has the same meaning as another given sentence would be:

Task: Write two sentences that mean the same thing.
Sentence 1: "<X1>"
Sentence 2: "

🦕 Generated DINOs

In this section, we will soon make publicly available a list of datasets that we have generated using DINO.

📕 Citation

If you make use of the code in this repository or of any DINO-based dataset, please cite the following paper:

@article{schick2020generating,
  title={Generating Datasets with Pretrained Language Models},
  author={Timo Schick and Hinrich Schütze},
  journal={Computing Research Repository},
  volume={arXiv:2104.07540},
  url={https://arxiv.org/abs/2104.07540},
  year={2021}
}
Comments
  • Request for public STSb-DINO dataset

    Request for public STSb-DINO dataset

    Hi,

    May I ask for the STSb-DINO dataset mentioned in the paper?

    I try to generate the dataset through running

    python3 dino.py \
    --output_dir output \ --task_file task_specs/sts-x2.json \ --input_file raw.in \
    --input_file_type plain \
    --num_entries_per_input_and_label 2 \
    --max_output_length 40 \
    --decay_constant 100 \
    --top_p 0.9 \
    --top_k 5 \
    --remove_duplicates \
    --remove_identical_pairs

    where the task specification is { "task_name": "sts", "labels": { "2": { "instruction": "Task: Write two sentences that mean the same thing.\nSentence 1: ""\nSentence 2: "", "counter_labels": [] }, "1": { "instruction": "Task: Write two sentences that are somewhat similar.\nSentence 1: ""\nSentence 2: "", "counter_labels": [ "2" ] }, "0": { "instruction": "Task: Write two sentences that are on completely different topics.\nSentence 1: ""\nSentence 2: "", "counter_labels": [ "2", "1" ] } } }

    However, the results on the STS-b is only around 60% with SRoberta training.

    May I ask which part goes wrong?

    Regards, Yiming

    opened by MatthewCYM 5
  • What instance did you use?

    What instance did you use?

    Hey,

    Thanks for the great work !

    I tried to follow your tutorial from your blog post on colab, but each time I try it looks like the processed was killed and did not even reach the print("Starting dataset generation with DINO...").

    What type of instance (type of TPU/GPU) did you use to apply the implementation ?

    Thanks,

    Arnault

    opened by agombert 4
  • Self-debiasing yields unexpected results

    Self-debiasing yields unexpected results

    Hi Timo, thanks for the cool work you've done around PLMs! ❤️

    Been trying out the package, albeit on a smaller GPT model on a different language, and the results of using the debiasing feature is quite unexpected. My use case is similar to the IMDB review generation, where we generate sentences of opposite polarity, and I provide instructions/examples prompt like in GPT-3's paper.

    To use the self-debiasing feature, I indicated that positive sentences should be different from negative sentences. However, the results of the generation are skewed: only one label has good sentences, and the other label has gibberish. My guess is that my examples had similar sentence structure, with just the sentiment words flipped, so the model was trying to avoid generating similar sentences.

    For generating sentence with different polarity, wondering what are the best practices, or if I was doing anything wrong?

    opened by neoyipeng2018 3
  • question about baseline in table 1

    question about baseline in table 1

    Hi!

    Impressive results, many thanks for the code! I have a Q about Table 1, where it says "supervised SBERT baseline", but in the rest of the paper it only mentions SBERT trained on NLI and not tuning on STS train. I guess supervised means also it's fine-tuned on STS training, or do I err? Thanks, Juri

    opened by flipz357 2
  • A question about the generated data

    A question about the generated data

    Thanks for your great job. I want to get some semantic similar pairs use your method,I use the code { "task_name": "sts", "labels": { "1": { "instruction": "Task: Write two sentences that mean the same thing.\nSentence 1: ""\nSentence 2: "", "counter_labels": [] } } }

    but my result is as follows: {"text_a": "a man with a hard hat is dancing", "text_b": "a man with a hard hat is dancing", "label": "1"} {"text_a": "a man with a hard hat is dancing", "text_b": "A man with a hard hat is dancing", "label": "1"} {"text_a": "a young child is riding a horse", "text_b": "a child is riding a horse", "label": "1"} {"text_a": "a young child is riding a horse", "text_b": "A child is riding a horse", "label": "1"} {"text_a": "a man is feeding a mouse to a snake", "text_b": "a man is feeding a mouse to a snake.", "label": "1"} {"text_a": "a man is feeding a mouse to a snake", "text_b": "A man is feeding a mouse to a snake", "label": "1"} {"text_a": "a woman is playing the guitar", "text_b": "a woman is playing the guitar", "label": "1"} {"text_a": "a woman is playing the guitar", "text_b": "the man is playing the guitar", "label": "1"} {"text_a": "a woman is playing the flute", "text_b": "a woman is playing the flute", "label": "1"} {"text_a": "a woman is playing the flute", "text_b": "a woman is playing the flute", "label": "1"} {"text_a": "a woman is cutting an onion", "text_b": "an onion is cutting a woman", "label": "1"} {"text_a": "a woman is cutting an onion", "text_b": "an onion is cutting a woman", "label": "1"} {"text_a": "a man is erasing a chalk board", "text_b": "a man is erasing a chalkboard.", "label": "1"} {"text_a": "a man is erasing a chalk board", "text_b": "a man is erasing a chalk board.", "label": "1"} {"text_a": "a woman is carrying a boy", "text_b": "a boy is carrying a woman", "label": "1"} {"text_a": "a woman is carrying a boy", "text_b": "A woman is carrying a boy", "label": "1"} {"text_a": "three men are playing guitars", "text_b": "Three men are playing guitars", "label": "1"} {"text_a": "three men are playing guitars", "text_b": "three guitarists are playing guitars", "label": "1"} {"text_a": "a woman peels a potato", "text_b": "a woman peels a potato", "label": "1"} {"text_a": "a woman peels a potato", "text_b": "a woman peels a potato", "label": "1"} most of the generated data are as same as the original one ,or only change the char "a" to "A",I think these data are meaningless,did I do something wrong?

    opened by zxz0928 2
  • AttributeError: 'SelfDebiasingGPT2LMHeadModel' object has no attribute '_init_sequence_length_for_generation'

    AttributeError: 'SelfDebiasingGPT2LMHeadModel' object has no attribute '_init_sequence_length_for_generation'

    Done loading 6 inputs from file 'sts-x1-dataset.jsonl' Starting dataset generation with DINO... Dataset Entries: 0%| | 0/6 [00:00<?, ?it/s] Traceback (most recent call last): File "/Users/lihao/pythonProject/pythonProject/dino.py", line 161, in outputs = generator.generate_dataset(inputs, num_entries_per_input_and_label=args.num_entries_per_input_and_label, File "/Users/lihao/pythonProject/pythonProject/modeling.py", line 98, in generate_dataset dataset += self._generate_dataset_entries(input_text_or_id, label=label, num_entries=num_entries_per_input_and_label, File "/Users/lihao/pythonProject/pythonProject/modeling.py", line 125, in _generate_dataset_entries model_outputs = self.model.generate_self_debiasing( File "/Users/lihao/pythonProject/pythonProject/modeling.py", line 280, in generate_self_debiasing output_ids = self._model.generate(**inputs, min_length=min_length, max_length=max_length, **kwargs) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/transformers/generation_utils.py", line 1310, in generate return self.sample( File "/Users/lihao/pythonProject/pythonProject/generation.py", line 170, in sample sequence_lengths, unfinished_sequences, cur_len = self._init_sequence_length_for_generation( File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1185, in getattr raise AttributeError("'{}' object has no attribute '{}'".format( AttributeError: 'SelfDebiasingGPT2LMHeadModel' object has no attribute '_init_sequence_length_for_generation'

    Hi, when I running the model, I meeting this problem. any idea to solve it? thanks!

    opened by HarrywillDr 1
  • Why using a single instruction as input?

    Why using a single instruction as input?

    Hi Dr. Timo, thanks for your amazing work and code! I'm wondering why do you use a single instruction as input in a batch? I notice that when I generating x2, I have to set the batch size to 2, which is relatively small and results in a poor inference efficiency. What's the intuition behind this design?

    opened by jiacheng-ye 1
  • RuntimeError: The size of tensor a (1024) must match the size of tensor b (1370) at non-singleton dimension 3

    RuntimeError: The size of tensor a (1024) must match the size of tensor b (1370) at non-singleton dimension 3

    dino.py will raise an error whenever the input sequence is longer than the maximum allowed length (1024).

    Full stack trace:

    Starting dataset generation with DINO... Traceback (most recent call last): File "dino.py", line 162, in num_entries_per_label=args.num_entries_per_label, batch_size=args.batch_size) File "dev/dino/modeling.py", line 99, in generate_dataset generate_with_inputs=generate_with_inputs) File "dev/dino/modeling.py", line 127, in _generate_dataset_entries do_sample=True, min_length=self.max_output_length, max_length=self.max_output_length, top_k=self.top_k, top_p=self.top_p File "dev/dino/modeling.py", line 277, in generate_self_debiasing output_ids = self._model.generate(**inputs, min_length=min_length, max_length=max_length, **kwargs) File "dev/dino/venv/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context return func(*args, **kwargs) File "dev/dino/venv/lib/python3.7/site-packages/transformers/generation_utils.py", line 924, in generate **model_kwargs, File "dev/dino/generation.py", line 184, in sample output_hidden_states=output_hidden_states, File "dev/dino/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "dev/dino/venv/lib/python3.7/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 901, in forward return_dict=return_dict, File "dev/dino/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "dev/dino/venv/lib/python3.7/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 746, in forward output_attentions=output_attentions, File "dev/dino/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "dev/dino/venv/lib/python3.7/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 294, in forward output_attentions=output_attentions, File "dev/dino/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "dev/dino/venv/lib/python3.7/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 239, in forward attn_outputs = self._attn(query, key, value, attention_mask, head_mask, output_attentions) File "dev/dino/venv/lib/python3.7/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 174, in _attn w = torch.where(mask.bool(), w, self.masked_bias.to(w.dtype)) RuntimeError: The size of tensor a (1024) must match the size of tensor b (1370) at non-singleton dimension 3

    After a bit of searching, I've found what seems to be a fairly simple fix by editing line 275 in modeling.py to max_length = min(1024, max_length + input_length) and I've raised a PR add the change if you could review it thanks.

    opened by Andrewlaw171 1
  • Requirements conflict

    Requirements conflict

    Tried installing on 2 different env and one vm got the same error. ERROR: Cannot install -r requirements.txt (line 11), -r requirements.txt (line 4) and torch==1.5.0 because these package versions have conflicting dependencies.

    The conflict is caused by: The user requested torch==1.5.0 torchvision 0.6.0 depends on torch==1.5.0 sentence-transformers 0.4.1 depends on torch>=1.6.0

    opened by Sripaad 1
  • Fix sts-x2.json sample task

    Fix sts-x2.json sample task

    When I try to run sts-x2.json sample task I get en error:

    AssertionError: Invalid task specification: counter_label '2' for label '0.5' is not a label
    

    I think the labels should be changed to :

    {
      "task_name": "sts",
      "labels": {
        "2": {
          "instruction": "Task: Write two sentences that mean the same thing.\nSentence 1: \"<X1>\"\nSentence 2: \"",
          "counter_labels": []
        },
        "1": {
          "instruction": "Task: Write two sentences that are somewhat similar.\nSentence 1: \"<X1>\"\nSentence 2: \"",
          "counter_labels": [
            "2"
          ]
        },
        "0": {
          "instruction": "Task: Write two sentences that are on completely different topics.\nSentence 1: \"<X1>\"\nSentence 2: \"",
          "counter_labels": [
            "2",
            "1"
          ]
        }
      }
    }
    

    Instead of

    {
      "task_name": "sts",
      "labels": {
        "1": {
          "instruction": "Task: Write two sentences that mean the same thing.\nSentence 1: \"<X1>\"\nSentence 2: \"",
          "counter_labels": []
        },
        "0.5": {
          "instruction": "Task: Write two sentences that are somewhat similar.\nSentence 1: \"<X1>\"\nSentence 2: \"",
          "counter_labels": [
            "2"
          ]
        },
        "0": {
          "instruction": "Task: Write two sentences that are on completely different topics.\nSentence 1: \"<X1>\"\nSentence 2: \"",
          "counter_labels": [
            "2",
            "1"
          ]
        }
      }
    }
    
    opened by AleksanderObuchowski 1
  • Update command for using CLI interface for single text generation in README

    Update command for using CLI interface for single text generation in README

    When i tried to run the sample command provided in readme:

    python3 dino.py \
     --output_dir <OUTPUT_DIR> \
     --task_file <TASK_FILE> \
     --num_entries_per_label <N>
    

    I got

      File "dino.py", line 38, in validate_args
        assert args.batch_size is not None, "If 'input_file' is not set, 'batch_size' must be set"
    AssertionError: If 'input_file' is not set, 'batch_size' must be set
    

    I think it would be better to add --batch_size <N> to the sample command or provide default value for it

    opened by AleksanderObuchowski 1
  • Nothing comes out when using GPT-3

    Nothing comes out when using GPT-3

    My command:

    python dino.py --output_dir=. --task_file=task_specs/sts-x2.json --input_file=input.txt --input_file_type=plain --num_entries_per_input_and_label=3 --model_name=davinci --openai_api_key=<open_api_key>
    

    The output is:

    Done loading 1 inputs from file 'input.txt'
    Using OpenAI's GPT3 (davinci) as generator. The following parameters are ignored: ['decay_constant', 'top_k']
    Starting dataset generation with DINO...
    Dataset Entries: 100%|██████████| 1/1 [00:09<00:00,  9.84s/it]
    Dataset generation complete, dataset contains 0 entries
    Done saving dataset to file '.\sts-dataset.jsonl'
    

    But we I look into sts-dataset.jsonl, it's empty.

    P.S. when I use gpt-2 model in my local machine, everything is ok.

    opened by chansonzhang 1
  • AttributeError: 'GPT2Wrapper' object has no attribute 'model'

    AttributeError: 'GPT2Wrapper' object has no attribute 'model'

    https://github.com/timoschick/dino/blob/0e4a0525f05dfa763bb99fe7cbde2563be32df4f/modeling.py#L275

    Hello, I tried to generate single text and encountered the following error.

      File "/home/myname/dino/modeling.py", line 275, in generate_self_debiasing
        max_length = min(self.model._model.config.max_position_embeddings, max_length + input_length)
    AttributeError: 'GPT2Wrapper' object has no attribute 'model'
    

    I replaced self.model._model with self._model, which solved the problem.

    opened by suparklingmin 0
Owner
Timo Schick
NLP Researcher @ SulzerGmbH , PhD Student @ CIS, LMU Munich
Timo Schick
null 189 Jan 2, 2023
This repository contains the code for EMNLP-2021 paper "Word-Level Coreference Resolution"

Word-Level Coreference Resolution This is a repository with the code to reproduce the experiments described in the paper of the same name, which was a

null 79 Dec 27, 2022
This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Proteno This is the data release associated with the corresponding NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deploymen

null 37 Dec 4, 2022
This repository contains Python scripts for extracting linguistic features from Filipino texts.

Filipino Text Linguistic Feature Extractors This repository contains scripts for extracting linguistic features from Filipino texts. The scripts were

Joseph Imperial 1 Oct 5, 2021
This repository contains examples of Task-Informed Meta-Learning

Task-Informed Meta-Learning This repository contains examples of Task-Informed Meta-Learning (paper). We consider two tasks: Crop Type Classification

null 10 Dec 19, 2022
KLUE-baseline contains the baseline code for the Korean Language Understanding Evaluation (KLUE) benchmark.

KLUE Baseline Korean(한국어) KLUE-baseline contains the baseline code for the Korean Language Understanding Evaluation (KLUE) benchmark. See our paper fo

null 74 Dec 13, 2022
Contains descriptions and code of the mini-projects developed in various programming languages

TexttoSpeechAndLanguageTranslator-project introduction A pleasant application where the client will be given buttons like play,reset and exit. The cli

Adarsh Reddy 1 Dec 22, 2021
DomainWordsDict, Chinese words dict that contains more than 68 domains, which can be used as text classification、knowledge enhance task

DomainWordsDict, Chinese words dict that contains more than 68 domains, which can be used as text classification、knowledge enhance task。涵盖68个领域、共计916万词的专业词典知识库,可用于文本分类、知识增强、领域词汇库扩充等自然语言处理应用。

liuhuanyong 357 Dec 24, 2022
This repo contains simple to use, pretrained/training-less models for speaker diarization.

PyDiar This repo contains simple to use, pretrained/training-less models for speaker diarization. Supported Models Binary Key Speaker Modeling Based o

null 12 Jan 20, 2022
Repository to hold code for the cap-bot varient that is being presented at the SIIC Defence Hackathon 2021.

capbot-siic Repository to hold code for the cap-bot varient that is being presented at the SIIC Defence Hackathon 2021. Problem Inspiration A plethora

Aryan Kargwal 19 Feb 17, 2022
This repository will contain the code for the CVPR 2021 paper "GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields"

GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields Project Page | Paper | Supplementary | Video | Slides | Blog | Talk If

null 1.1k Dec 27, 2022
Officile code repository for "A Game-Theoretic Perspective on Risk-Sensitive Reinforcement Learning"

CvarAdversarialRL Official code repository for "A Game-Theoretic Perspective on Risk-Sensitive Reinforcement Learning". Initial setup Create a virtual

Mathieu Godbout 1 Nov 19, 2021
Code repository for "It's About Time: Analog clock Reading in the Wild"

it's about time Code repository for "It's About Time: Analog clock Reading in the Wild" Packages required: pytorch (used 1.9, any reasonable version s

null 52 Nov 10, 2022
Repository of the Code to Chatbots, developed in Python

Description In this repository you will find the Code to my Chatbots, developed in Python. I'll explain the structure of this Repository later. Requir

Li-am K. 0 Oct 25, 2022
This project is part of Eleuther AI's quest to create a massive repository of high quality text data for training language models.

This project is part of Eleuther AI's quest to create a massive repository of high quality text data for training language models.

EleutherAI 42 Dec 13, 2022
This repository describes our reproducible framework for assessing self-supervised representation learning from speech

LeBenchmark: a reproducible framework for assessing SSL from speech Self-Supervised Learning (SSL) using huge unlabeled data has been successfully exp

null 49 Aug 24, 2022
NL-Augmenter 🦎 → 🐍 A Collaborative Repository of Natural Language Transformations

NL-Augmenter ?? → ?? The NL-Augmenter is a collaborative effort intended to add transformations of datasets dealing with natural language. Transformat

null 684 Jan 9, 2023