Experiments and code to generate the GINC small-scale in-context learning dataset from "An Explanation for In-context Learning as Implicit Bayesian Inference"

P-Lambda

Last update: Dec 19, 2022

Related tags

Deep Learning incontext-learning

Overview

GINC small-scale in-context learning dataset

GINC (Generative In-Context learning Dataset) is a small-scale synthetic dataset for studying in-context learning. The pretraining data is generated by a mixture of HMMs and the in-context learning prompt examples are also generated from HMMs (either from the mixture or not). The prompt examples are out-of-distribution with respect to the pretraining data since every example is independent, concatenated, and separated by delimiters. We provide code to generate GINC-style datasets of varying vocabulary sizes, number of HMMs, and other parameters.

Quickstart

Please create a conda environment or virtualenv using the information in conda-env.yml, then install transformers by going into the transformers/ directory and running pip install -e .. Modify consts.sh to change the default output locations and insert code to activate the environment of choice. Run scripts/runner.sh to run all the experiments on sbatch.

Explore the data

The default dataset has vocab size 50 and the pretraining data is generated as a mixture of 5 HMMs. The pretraining dataset is in data/GINC_trans0.1_start10.0_nsymbols50_nvalues10_nslots10_vic0.9_nhmms10/train.json while in-context prompts are in data/GINC_trans0.1_start10.0_nsymbols50_nvalues10_nslots10_vic0.9_nhmms10/id_prompts_randomsample_*.json.

This repo contains the experiments for the paper An Explanation of In-context Learning as Implicit Bayesian Inference. If you found this repo useful, please cite

@article{xie2021incontext,
  author = {Sang Michael Xie and Aditi Raghunathan and Percy Liang and Tengyu Ma},
  journal = {arXiv preprint arXiv:2111.02080},
  title = {An Explanation of In-context Learning as Implicit Bayesian Inference},
  year = {2021},
}

You might also like...

This is the dataset and code release of the OpenRooms Dataset.

95 Jan 8, 2023

LSTM model trained on a small dataset of 3000 names written in PyTorch

LSTM model trained on a small dataset of 3000 names. Model generates names from model by selecting one out of top 3 letters suggested by model at a time until an EOS (End Of Sentence) character is not encountered.

1 Dec 20, 2021

Applications using the GTN library and code to reproduce experiments in "Differentiable Weighted Finite-State Transducers"

gtn_applications An applications library using GTN. Current examples include: Offline handwriting recognition Automatic speech recognition Installing

68 Dec 29, 2022

Minimal diffusion models - Minimal code and simple experiments to play with Denoising Diffusion Probabilistic Models (DDPMs)

Minimal code and simple experiments to play with Denoising Diffusion Probabilist

16 Oct 6, 2022

A large-scale video dataset for the training and evaluation of 3D human pose estimation models

ASPset-510 ASPset-510 (Australian Sports Pose Dataset) is a large-scale video dataset for the training and evaluation of 3D human pose estimation mode

36 Oct 30, 2022

A large-scale video dataset for the training and evaluation of 3D human pose estimation models

ASPset-510 (Australian Sports Pose Dataset) is a large-scale video dataset for the training and evaluation of 3D human pose estimation models. It contains 17 different amateur subjects performing 30 sports-related actions each, for a total of 510 action clips.

25 Jun 20, 2021

A large-scale face dataset for face parsing, recognition, generation and editing.

Experiments and code to generate the GINC small-scale in-context learning dataset from "An Explanation for In-context Learning as Implicit Bayesian Inference"

Related tags

Overview

GINC small-scale in-context learning dataset

Quickstart

Explore the data

You might also like...

This is the dataset and code release of the OpenRooms Dataset.

LSTM model trained on a small dataset of 3000 names written in PyTorch

Applications using the GTN library and code to reproduce experiments in "Differentiable Weighted Finite-State Transducers"

Minimal diffusion models - Minimal code and simple experiments to play with Denoising Diffusion Probabilistic Models (DDPMs)

A large-scale video dataset for the training and evaluation of 3D human pose estimation models

A large-scale video dataset for the training and evaluation of 3D human pose estimation models

A large-scale face dataset for face parsing, recognition, generation and editing.

Code to reproduce the experiments in the paper "Transformer Based Multi-Source Domain Adaptation" (EMNLP 2020)

Code to run experiments in SLOE: A Faster Method for Statistical Inference in High-Dimensional Logistic Regression.

Owner

P-Lambda

PyTorch experiments with the Zalando fashion-mnist dataset

Repo for my Tensorflow/Keras CV experiments. Mostly revolving around the Danbooru20xx dataset

Context Axial Reverse Attention Network for Small Medical Objects Segmentation

Open source hardware and software platform to build a small scale self driving car.

FactSeg: Foreground Activation Driven Small Object Semantic Segmentation in Large-Scale Remote Sensing Imagery (TGRS)

A data annotation pipeline to generate high-quality, large-scale speech datasets with machine pre-labeling and fully manual auditing.

Source code and notebooks to reproduce experiments and benchmarks on Bias Faces in the Wild (BFW).

The LaTeX and Python code for generating the paper, experiments' results and visualizations reported in each paper is available (whenever possible) in the paper's directory

Asterisk is a framework to generate high-quality training datasets at scale

Release of SPLASH: Dataset for semantic parse correction with natural language feedback in the context of text-to-SQL parsing