Modified GPT using average pooling to reduce the softmax attention memory constraints.

Last update: Dec 3, 2021

Related tags

Overview

NLP-GPT-Upsampling

This repository contains an implementation of Open AI's GPT Model. In particular, this implementation takes inspiration from the Nystromformer implementation to approximate the full attention softmax matrix to model longer sequences in NLP language modeling tasks by a simple strided average pooling of the input text sequence to reduce the sequence length. The reduced length attention output is then upsampled back to the original sequence length using the bilinear method.

It should be noted that due to the simplicity of this implementation, the performance of the model will not be comparable to the original GPT model utilising the full attention matrix. The tradeoff is that this naive strided averaging would be able to model longer sequences as compared to the original GPT implementation.

Fig. 1: GPT Model Architecture (obtained from GPT paper)

Data

This repository includes codes to process the Movie Dialogue dataset, where the preparation of the data follows this script closely, as well as the Reddit Jokes dataset.

To prepare the data prior to training the model(s), run

python process_movie_dialogue_subword.py

for the Movie Dialogue dataset, or

python process_reddit_jokes_subword_v1.py

for the Reddit Jokes dataset.

Training and Model Inference

Having processed the data into sub-word tokens, run

python train_movie_dialogue_sw_tf_ver2_gpt_keras_upsampled.py
python infer_movie_dialogue_sw_tf_ver2_gpt_keras_upsampled.py

python train_reddit_jokes_sw_tf_ver2_gpt_keras_upsampled.py
python infer_reddit_jokes_sw_tf_ver2_gpt_keras_upsampled.py

to train the respective models based on the dataset loaded and perform inference of the trained model.

📜 GPT-2 Rhyming Limerick and Haiku models using data augmentation

Well-formed Limericks and Haikus with GPT2 📜 GPT-2 Rhyming Limerick and Haiku models using data augmentation In collaboration with Matthew Korahais &

2 May 26, 2022

Creating a chess engine using GPT-3

GPT3Chess Creating a chess engine using GPT-3 Code for my article : https://towardsdatascience.com/gpt-3-play-chess-d123a96096a9 My game (white) vs GP

19 Dec 17, 2022

An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hundreds of billions of parameters or larger.

GPT-NeoX An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hun

3.1k Jan 8, 2023

Modified GPT using average pooling to reduce the softmax attention memory constraints.

Related tags

Overview

NLP-GPT-Upsampling

Data

Training and Model Inference

You might also like...

📜 GPT-2 Rhyming Limerick and Haiku models using data augmentation

Creating a chess engine using GPT-3

An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hundreds of billions of parameters or larger.

🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts

Shirt Bot is a discord bot which uses GPT-3 to generate text

🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts

Transformer related optimization, including BERT, GPT

Owner

WD

⚡ boost inference speed of T5 models by 5x & reduce the model size by 3x using fastT5.

An IVR Chatbot which can exponentially reduce the burden of companies as well as can improve the consumer/end user experience.

Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge

A CRM department in a local bank works on classify their lost customers with their past datas. So they want predict with these method that average loss balance and passive duration for future.

Original implementation of the pooling method introduced in "Speaker embeddings by modeling channel-wise correlations"

Interactive Jupyter Notebook Environment for using the GPT-3 Instruct API

Guide: Finetune GPT2-XL (1.5 Billion Parameters) and GPT-NEO (2.7 B) on a single 16 GB VRAM V100 Google Cloud instance with Huggingface Transformers using DeepSpeed

Generate product descriptions, blogs, ads and more using GPT architecture with a single request to TextCortex API a.k.a Hemingwai

Train GPT-3 model on V100(16GB Mem) Using improved Transformer.

A python project made to generate code using either OpenAI's codex or GPT-J (Although not as good as codex)