The tl;dr on a few notable transformer/language model papers + other papers (alignment, memorization, etc).

Will Thompson

Last update: Jan 4, 2023

tldr-transformers

The tl;dr on a few notable transformer/language model papers + other papers (alignment, memorization, etc).

Models: GPT- *, * BERT *, Adapter- *, * T5, etc.

BERT and T5 (art from the original papers)

Each set of notes includes links to the paper, the original code implementation (if available) and the Huggingface 🤗 implementation.

Here is an example: t5.

The transformers papers are presented somewhat chronologically below. Go to the " 👉 Notes 👈 " column below to find the notes for each paper.

This repo also includes a table quantifying the differences across transformer papers all in one table.

Quick Note
Motivation
Papers::Transformer Papers
Papers::1 Table To Rule Them All
Papers::Alignment Papers
Papers::Scaling Law Papers
Papers::LM Memorization Papers
Papers::Limited Label Learning Papers
How To Contribute
How To Point Our Errors
Citation
License

Quick_Note

This is not an intro to deep learning in NLP. If you are looking for that, I recommend one of the following: Fast AI's course, one of the Coursera courses, or maybe this old thing. Come here after that.

Motivation

With the explosion in papers on all things Transformers the past few years, it seems useful to catalog the salient features/results/insights of each paper in a digestible format. Hence this repo.

Models

Model	Year	Institute	Paper	👉 Notes 👈	Original Code	Huggingface 🤗	Other Repo
Transformer	2017	Google	Attention is All You Need	Skipped, too many good write-ups: Harvard NLP Group Jay Alammar Lilian Weng Something old		?
GPT-3	2018	OpenAI	Language Models are Unsupervised Multitask Learners	To-Do	X	X
GPT-J-6B	2021	EleutherAI	GPT-J-6B: 6B Jax-Based Transformer (public GPT-3)	X	here	x	x
BERT	2018	Google	BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding	BERT notes	here	here
DistilBERT	2019	Huggingface	DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter	DistilBERT notes		here
ALBERT	2019	Google/Toyota	ALBERT: A Lite BERT for Self-supervised Learning of Language Representations	ALBERT notes	here	here
RoBERTa	2019	Facebook	RoBERTa: A Robustly Optimized BERT Pretraining Approach	RoBERTa notes	here	here
BART	2019	Facebook	BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension	BART notes	here	here
T5	2019	Google	Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer	T5 notes	here	here
Adapter-BERT	2019	Google	Parameter-Efficient Transfer Learning for NLP	Adapter-BERT notes	here	-	here
Megatron-LM	2019	NVIDIA	Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism	Megatron notes	here	-	here
Reformer	2020	Google	Reformer: The Efficient Transformer	Reformer notes		here
byT5	2021	Google	ByT5: Towards a token-free future with pre-trained byte-to-byte models	ByT5 notes	here	here
CLIP	2021	OpenAI	Learning Transferable Visual Models From Natural Language Supervision	CLIP notes	here	here
DALL-E	2021	OpenAI	Zero-Shot Text-to-Image Generation	DALL-E notes	here	-
Codex	2021	OpenAI	Evaluating Large Language Models Trained on Code	Codex notes	X	-

BigTable

All of the table summaries found ^ collapsed into one really big table here.

Alignment

Paper	Year	Institute	👉 Notes 👈	Codes
Fine-Tuning Language Models from Human Preferences	2019	OpenAI	To-Do	None

Scaling

Paper	Year	Institute	👉 Notes 👈	Codes
Scaling Laws for Neural Language Models	2020	OpenAI	To-Do	None

Memorization

Paper	Year	Institute	👉 Notes 👈	Codes
Extracting Training Data from Large Language Models	2021	Google et al.	To-Do	None
Deduplicating Training Data Makes Language Models Better	2021	Google et al.	To-Do	None

FewLabels

Paper	Year	Institute	👉 Notes 👈	Codes
An Empirical Survey of Data Augmentation for Limited Data Learning in NLP	2021	GIT/UNC	To-Do	None
Learning with fewer labeled examples	2021	Kevin Murphy & Colin Raffel (Preprint: "Probabilistic Machine Learning", Chapter 19)	Worth a read, won't summarize here.	None

Contribute

If you are interested in contributing to this repo, feel free to do the following:

Fork the repo.
Create a Draft PR with the paper of interest (to prevent "in-flight" issues).
Use the suggested template to write your "tl;dr". If it's an architecture paper, you may also want to add to the larger table here.
Submit your PR.

Errata

Undoubtedly there is information that is incorrect here. Please open an Issue and point it out.

Citation

@misc{cliff-notes-transformers,
  author = {Thompson, Will},
  url = {https://github.com/will-thompson-k/cliff-notes-transformers},
  year = {2021}
}

For the notes above, I've linked the original papers.

License

MIT

You might also like...

Implementation of Cross Transformer for spatially-aware few-shot transfer, in Pytorch

Cross Transformers - Pytorch (wip) Implementation of Cross Transformer for spatially-aware few-shot transfer, in Pytorch Install $ pip install cross-t

40 Dec 22, 2022

Official code for "Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer. ICCV2021".

Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer. ICCV2021. Introduction We proposed a novel model training paradi

103 Dec 14, 2022

The code for two papers: Feedback Transformer and Expire-Span.

transformer-sequential This repo contains the code for two papers: Feedback Transformer Expire-Span The training code is structured for long sequentia

125 Dec 25, 2022

A simple but complete full-attention transformer with a set of promising experimental features from various papers

x-transformers A concise but fully-featured transformer, complete with a set of promising experimental features from various papers. Install $ pip ins

2.3k Jan 3, 2023

🐥A PyTorch implementation of OpenAI's finetuned transformer language model with a script to import the weights pre-trained by OpenAI

PyTorch implementation of OpenAI's Finetuned Transformer Language Model This is a PyTorch implementation of the TensorFlow code provided with OpenAI's

1.4k Jan 5, 2023

Code for EmBERT, a transformer model for embodied, language-guided visual task completion.

41 Jan 3, 2023

Omnidirectional Scene Text Detection with Sequential-free Box Discretization (IJCAI 2019). Including competition model, online demo, etc.

Box_Discretization_Network This repository is built on the pytorch [maskrcnn_benchmark]. The method is the foundation of our ReCTs-competition method

266 Nov 24, 2022

True Few-Shot Learning with Language Models

This codebase supports using language models (LMs) for true few-shot learning: learning to perform a task using a limited number of examples from a single task distribution.

124 Jan 4, 2023

Code for "Contextual Non-Local Alignment over Full-Scale Representation for Text-Based Person Search"

Contextual Non-Local Alignment over Full-Scale Representation for Text-Based Person Search This is an implementation for our paper Contextual Non-Loca

50 Dec 3, 2022

The tl;dr on a few notable transformer/language model papers + other papers (alignment, memorization, etc).

Related tags

Overview

tldr-transformers

Contents

Quick_Note

Motivation

Models

BigTable

Alignment

Scaling

Memorization

FewLabels

Contribute

Errata

Citation

License

You might also like...

Implementation of Cross Transformer for spatially-aware few-shot transfer, in Pytorch

Official code for "Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer. ICCV2021".

The code for two papers: Feedback Transformer and Expire-Span.

A simple but complete full-attention transformer with a set of promising experimental features from various papers

🐥A PyTorch implementation of OpenAI's finetuned transformer language model with a script to import the weights pre-trained by OpenAI

Code for EmBERT, a transformer model for embodied, language-guided visual task completion.

Omnidirectional Scene Text Detection with Sequential-free Box Discretization (IJCAI 2019). Including competition model, online demo, etc.

True Few-Shot Learning with Language Models

Code for "Contextual Non-Local Alignment over Full-Scale Representation for Text-Based Person Search"

Owner

Will Thompson

Byte-based multilingual transformer TTS for low-resource/few-shot language adaptation.

A Transformer-Based Feature Segmentation and Region Alignment Method For UAV-View Geo-Localization

Few-NERD: Not Only a Few-shot NER Dataset

Code for T-Few from "Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning"

😇A pyTorch implementation of the DeepMoji model: state-of-the-art deep learning model for analyzing sentiment, emotion, sarcasm etc

arxiv-sanity, but very lite, simply providing the core value proposition of the ability to tag arxiv papers of interest and have the program recommend similar papers.

Implemented fully documented Particle Swarm Optimization algorithm (basic model with few advanced features) using Python programming language

An original implementation of "Noisy Channel Language Model Prompting for Few-Shot Text Classification"

[ICRA2021] Reconstructing Interactive 3D Scene by Panoptic Mapping and CAD Model Alignment

VSR-Transformer - This paper proposes a new Transformer for video super-resolution (called VSR-Transformer).