Toolkit for collecting and applying prompts

BigScience Workshop

Last update: Jan 3, 2023

Related tags

Deep Learning promptsource

Overview

PromptSource

Promptsource is a toolkit for collecting and applying prompts to NLP datasets.

Promptsource uses a simple templating language to programatically map an example of a dataset into a text input and a text target.

Promptsource contains a growing collection of prompts (which we call P3: Public Pool of Prompts). As of October 18th, there are ~2'000 prompts for 170+ datasets in P3. Feel free to use these prompts as they are (you'll find citation details here).

Note that a subset of the prompts are still Work in Progress. You'll find the list of the prompts which will potentially be modified in the near future here. Modifications will in majority consist of metadata collection, but in some cases, will impact the templates themselves. To facilitate traceability, Promptsource is currently pinned at version 0.1.0.

Setup

Download the repo
Navigate to root directory of the repo
Install requirements with pip install -r requirements.txt in a Python 3.7 environment

Running

You can browse through existing prompts on the hosted versiond of Promptsource.

If you want to launch a local version (in particular to write propmts, from the root directory of the repo, launch the editor with:

streamlit run promptsource/app.py

There are 3 modes in the app:

Helicopter view: aggregate high level metrics on the current state of the sourcing
Prompted dataset viewer: check the templates you wrote or already written on entire dataset
Sourcing: write new prompts

Running (read-only)

To host a public streamlit app, launch it with

streamlit run promptsource/app.py -- -r

Prompting an Example:

You can use Promptsource with Datasets to create prompted examples:

# Get an example
from datasets import load_dataset
dataset = load_dataset("ag_news")
example = dataset["train"][0]

# Prompt it
from promptsource.templates import TemplateCollection
# Get all the prompts
collection = TemplateCollection()
# Get all the AG News prompts
ag_news_prompts = collection.get_dataset("ag_news")
# Select a prompt by name
prompt = ag_news_prompts["classify_question_first"]

result = prompt.apply(example)
print("INPUT: ", result[0])
print("TARGET: ", result[1])

Contributing

Contribution guidelines and step-by-step HOW TO are described here.

Writing Prompts

A prompt is expressed in Jinja.

It is rendered using an example from the corresponding Hugging Face datasets library (a dictionary). The separator ||| should appear once to divide the template into input and target. Generally, the prompt should provide information on the desired behavior, e.g., text passage and instructions, and the output should be a desired response.

For more information, read the Contribution guidelines.

Known Issues

Warning or Error about Darwin on OS X: Try downgrading PyArrow to 3.0.0.

ConnectionRefusedError: [Errno 61] Connection refused: Happens occasionally. Try restarting the app.

Development structure

Promptsource was developed as part of the BigScience project for open research 🌸 , a year-long initiative targeting the study of large models and datasets. The goal of the project is to research language models in a public environment outside large technology companies. The project has 600 researchers from 50 countries and more than 250 institutions.

Citation

If you want to cite this P3 or Promptsource, you can use this bibtex:

@misc{sanh2021multitask,
      title={Multitask Prompted Training Enables Zero-Shot Task Generalization}, 
      author={Victor Sanh and Albert Webson and Colin Raffel and Stephen H. Bach and Lintang Sutawika and Zaid Alyafeai and Antoine Chaffin and Arnaud Stiegler and Teven Le Scao and Arun Raja and Manan Dey and M Saiful Bari and Canwen Xu and Urmish Thakker and Shanya Sharma Sharma and Eliza Szczechla and Taewoon Kim and Gunjan Chhablani and Nihal Nayak and Debajyoti Datta and Jonathan Chang and Mike Tian-Jian Jiang and Han Wang and Matteo Manica and Sheng Shen and Zheng Xin Yong and Harshit Pandey and Rachel Bawden and Thomas Wang and Trishala Neeraj and Jos Rozen and Abheesht Sharma and Andrea Santilli and Thibault Fevry and Jason Alan Fries and Ryan Teehan and Stella Biderman and Leo Gao and Tali Bers and Thomas Wolf and Alexander M. Rush},
      year={2021},
      eprint={2110.08207},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Comments

Bias/fairness quantitative measurements
Two goals:

have quantitative measurements for the paper's Broader impact section

reflect these in the model cards when we release checkpoints

4 held out evaluation sets identified by the Evaluation WG:

jigsaw_toxicity_pred

crows_pairs

winogender (AXG in SuperGLUE)

winobias

At the current state, I see crows_pairs and winobias prompted, jigsaw_toxicity_pred has an opened PR (#451) we need to check, winogender needs to be prompted.

Workflow:

[x] prompt those that were not prompted yet

[x] making sure these were actually cached (@VictorSanh can have this caching step done fairly quickly)

[x] evaluation (normal and score rank evaluation) or maybe they have some special evaluation?

[x] when we know which checkpoints exactly to eval, get the final numbers to report
opened by VictorSanh 22
remove language restrictions in tydiqa + add arabic prompts
[x] Remove the if statement to allow English prompts to work across the Dataset.

[x] Add Arabic prompts for primary task subset with if statement to include only Arabic set.

[x] Add Arabic prompts for secondary task subset.
opened by KhalidAlt 21
Tracking trainings and evals
Main runs

D4 only (finetune-t5-xxl-lm-d4-091621-512)

[x] Training

[x] Eval

[x] Last

[x] Normal

[x] Rank

[x] 1'112'200 (half fine-tuning) (cf #456)

[x] Normal

[x] Rank

[x] 1'124'700 (second to last checkpoint)

[x] Normal

[x] Rank

[ ] SuperGLUE test (1380485) - RUNNING

[x] BigBench - see #469

D4 + GPT (finetune-t5-xxl-lm-d4-gpt-091621/)

[x] Training

[x] Eval

[x] Last

[x] Normal

[x] Rank

[x] 1'112'200 (half fine-tuning)

[x] Normal

[x] Rank

[x] BigBench - see #469 - RUNNING

D4 + GPT + Sglue (finetune-t5-xxl-lm-d4-all-091621)

[X] Training

[X] Eval

[x] Last

[x] Normal

[x] Rank

[x] 1'112'000 (half fine-tuning)

[x] Normal

[x] Rank

[x] BigBench - see #469 - RUNNING

Ablations

Nb Template - 1 OG Template (finetune-t5-xxl-lm-d4-og-091621)

[X] Training

[X] Eval

[x] Last

[x] Normal

[x] Rank

[x] 1'112'000 (half fine-tuning)

[x] Normal

[x] Rank

all OG Templates (finetune-t5-xxl-lm-d4-og-all-091621) #465

[x] Training - RUNNING

[ ] Eval

[ ] Last

[ ] Normal

[ ] Rank

[x] 1'112'000 (half fine-tuning)

[ ] Normal ``

[x] Rank ``

Size XL (finetune-t5-xl-lm-d4-091621)

[X] Training

[X] Eval

[x] Last

[x] Normal

[x] Rank

[x] 1'112'000 (half fine-tuning)

[x] Normal

[x] Rank

D4 only duplicate (finetune-t5-xxl-lm-d4-091621/)

[x] Training

[ ] Eval

[ ] Last

[ ] Normal

[ ] Rank

[X] 1'112'000 (half fine-tuning)

[X] Normal

[X] Rank

Baseline

T5 zero-shot baseline

[x] Eval - RUNNING under finetune-t5-xxl-lm-d4-091621

[x] Normal 1384362 - RUNNING

[x] Rank 1384360 - RUNNING
opened by VictorSanh 18
Special eval metrics and scripts
Since we have the string outputs of all tasks, in principal we should be able to run arbitrary metrics, especially for datasets require fancy metrics.@lintangsutawika has imported the official eval scripts for ReCoRD, SQuAD v2, Natural Questions, TriviaQA, and DROP.

Update: Even when using Lintang's eval scripts, all extractive QAs and closed-book (generative) QAs still have abnormally low numbers, namely:

[ ] ReCoRD

[x] SQuAD2 (contains unanswerables)

[x] DROP

[ ] CoQA (contains unanswerables, multiple questions per example)

[ ] Quac (contains unanswerables, multiple questions per example)

[ ] Natural Questions

[x] TriviaQA

[x] WebQuestions

Also, I think all eval of extractive QA from the training mixture also failed.

(Note that ARC is closed-book, but its performance is fine because it's multiple-choice. A great point in case that machine task categories care more about format way more than human skill/knowledge.)

Others with issues to keep an eye on:

[x] HellaSwag

[x] Lambada

[x] Winogrande

evaluations
opened by awebson 14
Fix rendering: use simple `st.text` instead of `st.markdown`

The rendering is messed up in a bunch of places. A few issues that are symptomatic: #355 #326

Replace st.markdown by st.text and extensively test that the rendering is better.

opened by VictorSanh 13
Templates for `ncbi_disease`

This one compared to the others has been a real pain, but I think the templates are quite interesting. Not sure the Jinja style is top notch, but it seems functional on the example sets I checked.

opened by drugilsberg 12
Creating unique identifier in the template.yaml

For now, it looks like we can sort of uniquely identify each template using a combination of template name and dataset name, but I'm expecting potential collisions when a lot of people start contributing. Besides, naming each template might not be useful (like if we end up with names like template1 template2 etc...), and it would help contributors if they don't have to add a name/check conflicts on the naming part before merging their template.yaml.

I was thinking that we could add an ID to each entry by getting the hash of timestamp + dataset + string of prompt python function or jinja template? That should be more than enough to prevent collisions
enhancement

opened by arnaudstiegler 12
Add prompts for DiaBLa dataset
Added a range of prompts for different tasks:

sentence-level translation

analogy-based translation

contextual translation (with context in different languages, automatically vs. manually translated)

detection of errors in translations

identification of machine translated output compared to human-produced translations
opened by rbawden 11

AttributeError: 'NoneType' object has no attribute 'session_id'

Hi,

Is anyone facing this issue while running streamlit run promptsource/app.py ?

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\I355109\Anaconda3\envs\python37\lib\multiprocessing\spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "C:\Users\I355109\Anaconda3\envs\python37\lib\multiprocessing\spawn.py", line 114, in _main
    prepare(preparation_data)
  File "C:\Users\I355109\Anaconda3\envs\python37\lib\multiprocessing\spawn.py", line 225, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "C:\Users\I355109\Anaconda3\envs\python37\lib\multiprocessing\spawn.py", line 277, in _fixup_main_from_path
    run_name="__mp_main__")
  File "C:\Users\I355109\Anaconda3\envs\python37\lib\runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "C:\Users\I355109\Anaconda3\envs\python37\lib\runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "C:\Users\I355109\Anaconda3\envs\python37\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\I355109\promptsource\promptsource\app.py", line 59, in <module>
    state = _get_state()
  File "c:\users\i355109\promptsource\promptsource\session.py", line 84, in _get_state
    session = _get_session()
  File "c:\users\i355109\promptsource\promptsource\session.py", line 74, in _get_session
    session_id = get_report_ctx().session_id
AttributeError: 'NoneType' object has no attribute 'session_id'

opened by manandey 11

Add conll2003 ner,pos,chunk task.
Prompt Description:

flat_question_with_label : Regular task. Label are normalized label in-case of POS tagging.

flat_question_with_random_label : It is not expected that user will always provide labels strictly from the dataset. They may provide a subset of labels. So here we provided subset of labels. If the gold labels in the sample are not available in the subset we replace the gold label with "O". In case of choosing random tags, We always include "O" tag for ner, pos and chunk labels.

flat_question_without_label : Regular task. No label is provided.

POS label Normalization

Both NER and Chunk task contains "O" tags. But POS doesn't contain "O" tag.

In case of parts-of-speech tags, there are few labels that are weird in natural sense. For example see a prompt with all pos labels,

Generate parts of speech from the following sentence. The parts of speech tags are ", '', #, $, (, ), ,, ., :, ``, CC, CD, DT, EX, FW, IN, JJ, JJR, JJS, LS, MD, NN, NNP, NNPS, NNS, NN|SYM, PDT, POS, PRP, PRP$, RB, RBR, RBS, RP, SYM, TO, UH, VB, VBD, VBG, VBN, VBP, VBZ, WDT, WP, WP$, WRB

Here first 9 labels are normalized to "O" tag.

Earlier Pull

Earlier zip was not available so I wrote the a brute force code in O(n^2) complexity. But now that zip is available, I wrote the code with simpler notation and loop (with O(n) complexity). While merging I messed up in previous pull https://github.com/bigscience-workshop/promptsource/pull/170 . So I closed that and created the new pull.
opened by sbmaruf 11
Some tweaks to the Editor
Default sort by popularity and include number of prompts

Global progress table

Separated out new prompts and select, and the columns (I think this is okay, but might break)

Added text wrapping so that you can view long prompts

Added a template viewer section
opened by srush 11
Possibility to use custom datasets
Hi - I have a custom dataset (e.g. https://huggingface.co/datasets/lauritowal/redefine-math) and would like to create prompts for it by using PromptSource:

-Download the repo -Navigate to the root directory of the repo -Run pip install -e . to install the promptsource module

It seems there is no option for loading a custom dataset into the tool via the Web-GUI. Maybe, I am missing something, but this would be a very helpful feature... (If there is a way to achieve this already, please let me know!)

Thanks!
opened by lauritowal 0
Merging xP3 & eval-hackathon
I would like to get all xP3 prompts merged into eval-hackathon & then have eval-hackathon be merged into main once & for all, but I'm not sure if you want that? cc @VictorSanh @stephenbach

It would include adding:

Normal english prompts for various new & existing datasets (many PRs already open - will clean them up & request reviews when ready if someone gives me rights to do so / merge if i can & tests pass)

Long prompts https://github.com/bigscience-workshop/promptsource/pull/823/files ; Would request reviews once ready

Human-translated & Machine-translated prompts (Always suffixed with ht or mt, roughly 20K lines; see https://github.com/Muennighoff/promptsource/pull/47/files)

Lmk if you're okay with that & I'll get started 🙂
opened by Muennighoff 0
Regenerate all templates in unicode
@KhalidAlt made a great suggestion to change

yaml.dump(self.format_for_dump(), open(self.yaml_path, "w"))

to

yaml.dump(self.format_for_dump(), open(self.yaml_path, "w"), allow_unicode=True)

in templates.py so that the yaml files display as unicode (rather than the unicode code points). We should definitely do this, but we should do it when the corpus is relatively stable and we can regenerate (read and write) all the yaml as one commit.
opened by stephenbach 0
fix fields names for TREC after

changes were induced here https://huggingface.co/datasets/trec/commit/1f97567bdd2adedefe8abdaa9bd6ee0e6725b458 to the field names (coarse_label and fine_label)

opened by VictorSanh 0

Releases(v0.2.3)

v0.2.3(Jul 2, 2022)

Fix many multiprocessing issues and HF datasets compatibility issues. A few new prompts.
Source code(tar.gz)
Source code(zip)
v0.2.2(Mar 1, 2022)

Adding back get_fixed_answer_choices_list as it is used in t-zero. 👀 @craffel
Source code(tar.gz)
Source code(zip)
v0.2.1(Feb 14, 2022)

[Hotfix] Fix Pypi install issue (#727) by @Guitaricet
Source code(tar.gz)
Source code(zip)
v0.2.0(Feb 4, 2022)
Finished cleaning pass (except for 3 data(sub)sets: species_800, evidence_infer_treatment/1.1 and evidence_infer_treatment/2.0)

Re-worked and updated on the README and overall documentation

Fixed multiple rendering bugs

Removed T-zero specific code for experiments and point to T-Zero official repo

Linked to official PromptSource paper

Pined v0.2.0!

Source code(tar.gz)
Source code(zip)
v0.1.0(Oct 18, 2021)

First official release of Promptsource along with the release of the paper Multitask Prompted Training Enables Zero-Shot Task Generalization.
Source code(tar.gz)
Source code(zip)

Owner

BigScience Workshop

Research workshop on large language models - The Summer of Language Models 21

GitHub

Pytorch based library to rank predicted bounding boxes using text/image user's prompts.

pytorch_clip_bbox: Implementation of the CLIP guided bbox ranking for Object Detection. Pytorch based library to rank predicted bounding boxes using t

50 Nov 27, 2022

Image morphing without reference points by applying warp maps and optimizing over them.

Differentiable Morphing Image morphing without reference points by applying warp maps and optimizing over them. Differentiable Morphing is machine lea

380 Dec 19, 2022

SparseML is a libraries for applying sparsification recipes to neural networks with a few lines of code, enabling faster and smaller models

SparseML is a toolkit that includes APIs, CLIs, scripts and libraries that apply state-of-the-art sparsification algorithms such as pruning and quantization to any neural network. General, recipe-driven approaches built around these algorithms enable the simplification of creating faster and smaller models for the ML performance community at large.

1.5k Dec 30, 2022

Face2webtoon - Despite its importance, there are few previous works applying I2I translation to webtoon.

Despite its importance, there are few previous works applying I2I translation to webtoon. I collected dataset from naver webtoon 연애혁명 and tried to transfer human faces to webtoon domain.

64 Oct 19, 2022

Fuzzification helps developers protect the released, binary-only software from attackers who are capable of applying state-of-the-art fuzzing techniques

About Fuzzification Fuzzification helps developers protect the released, binary-only software from attackers who are capable of applying state-of-the-

55 Oct 25, 2022

Fully-automated scripts for collecting AI-related papers

AI-Paper-collector Fully-automated scripts for collecting AI-related papers List of Conferences to crawel ACL: 21-19 (including findings) EMNLP: 21-19

776 Jan 8, 2023

Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit

CNTK Chat Windows build status Linux build status The Microsoft Cognitive Toolkit (https://cntk.ai) is a unified deep learning toolkit that describes

17.3k Dec 29, 2022

Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit

CNTK Chat Windows build status Linux build status The Microsoft Cognitive Toolkit (https://cntk.ai) is a unified deep learning toolkit that describes

17k Feb 11, 2021

A toolkit for developing and comparing reinforcement learning algorithms.

Status: Maintenance (expect bug fixes and minor updates) OpenAI Gym OpenAI Gym is a toolkit for developing and comparing reinforcement learning algori

29.6k Jan 8, 2023

PIKA: a lightweight speech processing toolkit based on Pytorch and (Py)Kaldi

PIKA: a lightweight speech processing toolkit based on Pytorch and (Py)Kaldi PIKA is a lightweight speech processing toolkit based on Pytorch and (Py)

336 Nov 25, 2022

A toolkit for making real world machine learning and data analysis applications in C++

dlib C++ library Dlib is a modern C++ toolkit containing machine learning algorithms and tools for creating complex software in C++ to solve real worl

11.6k Jan 1, 2023

Object detection and instance segmentation toolkit based on PaddlePaddle.

9.3k Jan 2, 2023

MMGeneration is a powerful toolkit for generative models, based on PyTorch and MMCV.

Documentation: https://mmgeneration.readthedocs.io/ Introduction English | 简体中文 MMGeneration is a powerful toolkit for generative models, especially f

1.3k Dec 29, 2022

Shuwa Gesture Toolkit is a framework that detects and classifies arbitrary gestures in short videos

89 Dec 22, 2022

The Medical Detection Toolkit contains 2D + 3D implementations of prevalent object detectors such as Mask R-CNN, Retina Net, Retina U-Net, as well as a training and inference framework focused on dealing with medical images.

The Medical Detection Toolkit contains 2D + 3D implementations of prevalent object detectors such as Mask R-CNN, Retina Net, Retina U-Net, as well as a training and inference framework focused on dealing with medical images.

1.2k Jan 4, 2023

Reference implementation of code generation projects from Facebook AI Research. General toolkit to apply machine learning to code, from dataset creation to model training and evaluation. Comes with pretrained models.

This repository is a toolkit to do machine learning for programming languages. It implements tokenization, dataset preprocessing, model training and m

408 Jan 1, 2023

Toolkit for collecting and applying prompts

Related tags

Overview

PromptSource

Setup

Running

Running (read-only)

Prompting an Example:

Contributing

Writing Prompts

Known Issues

Development structure

Citation

Comments

Main runs

Ablations

Baseline

Releases(v0.2.3)

v0.2.3(Jul 2, 2022)

v0.2.2(Mar 1, 2022)

v0.2.1(Feb 14, 2022)

v0.2.0(Feb 4, 2022)

v0.1.0(Oct 18, 2021)

Owner

BigScience Workshop

Pytorch based library to rank predicted bounding boxes using text/image user's prompts.

Image morphing without reference points by applying warp maps and optimizing over them.

SparseML is a libraries for applying sparsification recipes to neural networks with a few lines of code, enabling faster and smaller models

Face2webtoon - Despite its importance, there are few previous works applying I2I translation to webtoon.

Fuzzification helps developers protect the released, binary-only software from attackers who are capable of applying state-of-the-art fuzzing techniques

Fully-automated scripts for collecting AI-related papers

Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit

Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit

A toolkit for developing and comparing reinforcement learning algorithms.

PIKA: a lightweight speech processing toolkit based on Pytorch and (Py)Kaldi

A toolkit for making real world machine learning and data analysis applications in C++

Object detection and instance segmentation toolkit based on PaddlePaddle.

MMGeneration is a powerful toolkit for generative models, based on PyTorch and MMCV.

Shuwa Gesture Toolkit is a framework that detects and classifies arbitrary gestures in short videos

The Medical Detection Toolkit contains 2D + 3D implementations of prevalent object detectors such as Mask R-CNN, Retina Net, Retina U-Net, as well as a training and inference framework focused on dealing with medical images.

Reference implementation of code generation projects from Facebook AI Research. General toolkit to apply machine learning to code, from dataset creation to model training and evaluation. Comes with pretrained models.

A DNN inference latency prediction toolkit for accurately modeling and predicting the latency on diverse edge devices.

TorchIO is a Medical image preprocessing and augmentation toolkit for deep learning. Part of the PyTorch Ecosystem.

A Python multilingual toolkit for Sentiment Analysis and Social NLP tasks