awesome-fast-attention

A curated list of efficient attention modules (last update: Wed, 10 Mar 2021 23:52:22 +0000)

Efficient Attention
Articles/Surveys/Benchmarks

Efficient Attention

Paper (citations)	Implementation	Computational Complexity	AutoRegressive	Main Idea
Generating Wikipedia by Summarizing Long Sequences (282)	memory-compressed-attention	$\mathcal{O}({b}\cdot\frac{N}{b}\cdot\frac{N}{{b}\cdot{k}}\cdot{D})$	✔️	EXPAND compresses key and value + blocked attention
CBAM: Convolutional Block Attention Module (999+)	attention-module	$\mathcal{O}(({N}\cdot{D}+\frac{{D}^2}{r})+({N}\cdot{D}\cdot{k}^2))$	❌	EXPAND combines the SE attention with a per pixel(local) weight
Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks (16)	set_transformer	$\mathcal{O}({N}\cdot{K}\cdot{D})$	❌	EXPAND uses K relay nodes
CCNet: Criss-Cross Attention for Semantic Segmentation (296)	CCNet	$\mathcal{O}({N}\cdot({H}+{W})\cdot{D})$	❌	EXPAND each pixel attends to its row and column simultaneously
Efficient Attention: Attention with Linear Complexities (16)	efficient-attention	$\mathcal{O}({N}\cdot{D}^2)$	❌	EXPAND Softmax(Q)(Softmax(K^T)V)
Star-Transformer (40)	fastNLP	$\mathcal{O}({N}\cdot{D})$	❌	EXPAND uses a relay(global) node and attends to/from that node
GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond (199)	GCNet	$\mathcal{O}({N}\cdot{D}^2)$	❌	EXPAND squeeze and excitation with an attention pooling (instead of a GAP)
Generating Long Sequences with Sparse Transformers (257)	DeepSpeed	$\mathcal{O}({N}\cdot\sqrt{N}\cdot{D})$	✔️	EXPAND sparse block based attention
SCRAM: Spatially Coherent Randomized Attention Maps (1)	-	$\mathcal{O}({N}\cdot\log({N})\cdot{D})$	✔️	EXPAND uses PatchMatch to find close keys
Interlaced Sparse Self-Attention for Semantic Segmentation (24)	IN_PAPER	$\mathcal{O}({N}\cdot{D}^2+{N}\cdot\sqrt{N}\cdot{D})$	✔️	EXPAND combination of a short length and then long range(dilated) attention
Permutohedral Attention Module for Efficient Non-Local Neural Networks (3)	Permutohedral_attention_module	$\mathcal{O}({N}\cdot{D}^2)$	❌	EXPAND uses permutohedral lattice approximation algorithm to approximate the attention output
Large Memory Layers with Product Keys (43)	XLM	$\mathcal{O}({Q}\cdot({K}+{k}^2)\cdot{D})$	✔️	EXPAND search for nearest neighbor keys
Expectation-Maximization Attention Networks for Semantic Segmentation (79)	EMANet	$\mathcal{O}({N}\cdot{k}\cdot{D})$	❌	EXPAND applys expectation maximization to cluster keys into k clusters
BP-Transformer: Modelling Long-Range Context via Binary Partitioning (15)	BPT	$\mathcal{O}({N}\cdot{k}\cdot\log(\frac{N}{k})\cdot{D})$	✔️	EXPAND attends to distant tokens coarsely and attends to close tokens in a more fine-grained manner
Compressive Transformers for Long-Range Sequence Modelling (48)	compressive-transformer-pytorch	$\mathcal{O}({N}^2\cdot{D})$	✔️	EXPAND compresses distant tokens instead of just stop_grad() ing them, more efficient version of transformerXL
Axial Attention in Multidimensional Transformers (36)	axial-attention	$\mathcal{O}({N}\cdot({H}+{W})\cdot{D})$	✔️	EXPAND apply attention on each axis separately
Reformer: The Efficient Transformer (216)	trax	$\mathcal{O}({N}\cdot\log({N})\cdot{D}^2)$	✔️	EXPAND uses LSH to find close keys
Sparse Sinkhorn Attention (16)	sinkhorn-transformer	$\mathcal{O}(\frac{{N}^2}{n_b}+{n_b}^2)$	✔️	EXPAND uses a cost matrix to limit attention between buckets
Transformer on a Diet (2)	transformer-on-diet	$\mathcal{O}({N}\cdot{k}\cdot{D})$	✔️	EXPAND dilated transformer like wavenet
Time-aware Large Kernel Convolutions (9)	TaLKConvolutions	$\mathcal{O}({N}\cdot{D})$	✔️	EXPAND calculate mean over a dynamic subsequence around each token with the help of summed-area table
SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection (2)	-	$\mathcal{O}({N}\cdot{k}\cdot{D})$	✔️	EXPAND learns the q, k connections == dynamically creates a sparse attention matrix
Efficient Content-Based Sparse Attention with Routing Transformers (38)	routing-transformer	$\mathcal{O}({N}\cdot\sqrt{N}\cdot{D})$	✔️	EXPAND computes attention with same-cluster tokens (computed by online k-means)
Neural Architecture Search for Lightweight Non-Local Networks (11)	AutoNL	$\mathcal{O}((\frac{H}{h}\cdot\frac{W}{w})\cdot(\frac{D}{k})^2)$	❌	EXPAND computes Q(KV) and also down samples q, k, v both in spatial and channel dimensions
Longformer: The Long-Document Transformer (159)	longformer	$\mathcal{O}({N}\cdot({k}+{g})\cdot{D})$	✔️	EXPAND global + blocked attention
ETC: Encoding Long and Structured Inputs in Transformers (16)	-	$\mathcal{O}(({N}\cdot{g}+{g}^2+{N}\cdot{k})\cdot{D})$	❌	EXPAND combines global attention (star transformer with multiple global tokens) with local attention
Multi-scale Transformer Language Models (2)	IN_PAPER	$\mathcal{O}({N}^2\cdot{D})$	✔️	EXPAND UNet like + retina attetion is something close to BP-Transformer
Synthesizer: Rethinking Self-Attention in Transformer Models (26)	Synthesizer-Rethinking-Self-Attention-Transformer-Models	$\mathcal{O}({N}^2\cdot{D})$	✔️	EXPAND does not compute pairwise interactions
Jukebox: A Generative Model for Music (45)	jukebox	$\mathcal{O}({N}\cdot\sqrt{N}\cdot{D})$	✔️	EXPAND better attention patterns from Sparse Transformer
Input-independent Attention Weights Are Expressive Enough: A Study of Attention in Self-supervised Audio Transformers (0)	-	$\mathcal{O}({N}^2\cdot{D})$	✔️	EXPAND does not compute pairwise interactions and uses fixed mask patters
GMAT: Global Memory Augmentation for Transformers (2)	gmat	$\mathcal{O}({m}\cdot({N}+{m})\cdot{D})$	❌	EXPAND adds global tokens
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention (45)	fast-transformers	$\mathcal{O}({N}\cdot{D}^2)$	✔️	EXPAND uses phi(q)(phi(k)v) and also improves the sequential sampling step
Linformer: Self-Attention with Linear Complexity (47)	linformer-pytorch	$\mathcal{O}({N}\cdot{k}\cdot{D})$	❌	EXPAND project key and value from nd to kd
Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers (8)	google-research	$\mathcal{O}({N}\cdot{D}^2\cdot\log({D}))$	✔️	EXPAND calculate an unbiased stochastic approximation of the attention matrix
Kronecker Attention Networks (1)	kronecker-attention-pytorch	$\mathcal{O}(({H}+{W})^2\cdot{D})$	❌	EXPAND uses horizontal and lateral average matrices
Real-time Semantic Segmentation with Fast Attention (5)	-	$\mathcal{O}({N}\cdot{D}^2)$	❌	EXPAND l2_norm(q)(l2_norm(k)v)
Fast Transformers with Clustered Attention (6)	fast-transformers	$\mathcal{O}({N}\cdot{k}\cdot{D})$	❌	EXPAND groups queries together with LSH
Big Bird: Transformers for Longer Sequences (60)	DeepSpeed	$\mathcal{O}(({g}^2+{N}\cdot({k}+{g}+{r}))\cdot{D})$	❌	EXPAND ETC with random connections
Tensor Low-Rank Reconstruction for Semantic Segmentation (3)	-	$\mathcal{O}(({D}\cdot{H}\cdot{W}+{D}^2+{H}^2+{W}^2)\cdot{r})$	❌	EXPAND decompose the full attention tensor into rank one tensors (CP decomposition)
Looking for change? Roll the Dice and demand Attention (0)	IN_PAPER	$\mathcal{O}({H}\cdot{W}\cdot{D})$	❌	EXPAND uses the fractal tanimoto similarity to compare queries with keys inside the attention module
Rethinking Attention with Performers (30)	google-research	$\mathcal{O}({N}\cdot{m}\cdot{D})$	✔️	EXPAND unbiased approximation of the attention matrix with softmax kernel
Memformer: The Memory-Augmented Transformer (0)	memformer	$\mathcal{O}({N}\cdot{D})$	✔️	EXPAND attend to memory slots + Memory-Replay BackPropagation
SMYRF: Efficient Attention using Asymmetric Clustering (1)	smyrf	$\mathcal{O}({N}\cdot\log({N})\cdot{D})$	❌	EXPAND LSH with balanced clusters
Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting (0)	Informer2020	$\mathcal{O}({N}\cdot\log({N})\cdot{D})$	✔️	EXPAND sparse attention + funnel like encoder
Sub-Linear Memory: How to Make Performers SLiM (0)	google-research	$\mathcal{O}({N}\cdot{m}\cdot{D})$	✔️	EXPAND Performer but with sublinear Memory usage
Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention (0)	Nystromformer	$\mathcal{O}({N}\cdot{D})$	❌	EXPAND uses Nystrom method to approximate the attention matrix
Linear Transformers Are Secretly Fast Weight Memory Systems (0)	fast-weight-transformers	$\mathcal{O}({N}\cdot{m}\cdot{D})$	✔️	EXPAND show that linear transformers are basically fast weight networks + propose a new kernel function to linearise attention, balancing simplicity and effectiveness
LambdaNetworks: Modeling Long-Range Interactions Without Attention (6)	lambda-networks	$\mathcal{O}({N}^2\cdot{k}\cdot\frac{v}{h})$	✔️	EXPAND generates a linear layer based on context + decouple pos/context
Random Feature Attention (2)	-	$\mathcal{O}({N}\cdot{D})$	✔️	EXPAND kernel approximation and also transformers are rnn

Articles/Surveys/Benchmarks

Colibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgrams (i.e patterns with one or more gaps, either of fixed or dynamic size) in a quick and memory-efficient way. At the core is the tool ``colibri-patternmodeller`` whi ch allows you to build, view, manipulate and query pattern models.

Colibri Core by Maarten van Gompel, [email protected], Radboud University Nijmegen Licensed under GPLv3 (See http://www.gnu.org/licenses/gpl-3.0.html

122 Nov 17, 2022

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks. It takes raw videos/images + text as inputs, and outputs task predictions. ClipBERT is designed based on 2D CNNs and transformers, and uses a sparse sampling strategy to enable efficient end-to-end video-and-language learning.

612 Jan 4, 2023

🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

15k Jan 2, 2023

Reformer, the efficient Transformer, in Pytorch

Reformer, the Efficient Transformer, in Pytorch This is a Pytorch implementation of Reformer https://openreview.net/pdf?id=rkgNKkHtvB It includes LSH

1.8k Dec 30, 2022

This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

Pipeline For NLP with Bloom's Taxonomy Using Improved Question Classification and Question Generation using Deep Learning This repository contains all

9 Jul 17, 2021

Simple and efficient RevNet-Library with DeepSpeed support

RevLib Simple and efficient RevNet-Library with DeepSpeed support Features Half the constant memory usage and faster than RevNet libraries Less memory

112 Dec 5, 2022

Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models.

Tevatron Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models. The toolkit has a modularized

193 Jan 4, 2023

Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation, available for both PyTorch and Tensorflow.

730 Jan 9, 2023

An easy to use, user-friendly and efficient code for extracting OpenAI CLIP (Global/Grid) features from image and text respectively.

Extracting OpenAI CLIP (Global/Grid) Features from Image and Text This repo aims at providing an easy to use and efficient code for extracting image &

13 Jan 6, 2023

TaLK Convolutions

I'm the author of "Time-aware Large Kernel Convolutions" (https://arxiv.org/abs/2002.03184) which is an alternative method to self-attention with linear complexity published in ICML 2020. You can find the implementation here (https://github.com/lioutasb/TaLKConvolutions). Thanks a lot.

opened by lioutasb 1
Add the ProbSparse Attention (Informer)

I have the ProbSparse self-attention from the ''Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting" paper https://arxiv.org/abs/2012.07436. We have built a repo as the official implement here https://github.com/zhouhaoyi/Informer2020. Thank you a lot!

opened by zhouhaoyi 0
Memory compressed attention

I have the memory compressed attention from the "Generating Wikipedia" paper https://github.com/lucidrains/memory-compressed-attention . Also, wanted to let you know there is a more complete implementation of linformer by Peter here https://github.com/tatp22/linformer-pytorch Thank you for compiling this!

opened by lucidrains 0

A curated list of efficient attention modules

Related tags

Overview

awesome-fast-attention

Table of Contents

Efficient Attention

Articles/Surveys/Benchmarks

You might also like...

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks

🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Reformer, the efficient Transformer, in Pytorch

This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

Simple and efficient RevNet-Library with DeepSpeed support

Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models.

Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation, available for both PyTorch and Tensorflow.

An easy to use, user-friendly and efficient code for extracting OpenAI CLIP (Global/Grid) features from image and text respectively.

Comments

TaLK Convolutions

Add the ProbSparse Attention (Informer)

Memory compressed attention

Owner

Sepehr Sameni

Get list of common stop words in various languages in Python

Get list of common stop words in various languages in Python

A Paper List for Speech Translation

This is my reading list for my PhD in AI, NLP, Deep Learning and more.

This repo is to provide a list of literature regarding Deep Learning on Graphs for NLP

Tool to add main subject to items on Wikidata using a WMFs CirrusSearch for named entity recognition or a manually supplied list of QIDs

Simple python code to fix your combo list by removing any text after a separator or removing duplicate combos

This Project is based on NLTK It generates a RANDOM WORD from a predefined list of words, From that random word it read out the word, its meaning with parts of speech , its antonyms, its synonyms

A paper list for aspect based sentiment analysis.

A list of NLP(Natural Language Processing) tutorials built on Tensorflow 2.0.