awesome-fast-attention
A curated list of efficient attention modules (last update: Wed, 10 Mar 2021 23:52:22 +0000)
Table of Contents
Efficient Attention
Paper (citations) | Implementation | Computational Complexity | AutoRegressive | Main Idea |
---|---|---|---|---|
Generating Wikipedia by Summarizing Long Sequences (282) | memory-compressed-attention | |
EXPANDcompresses key and value + blocked attention |
|
CBAM: Convolutional Block Attention Module (999+) | attention-module | |
EXPANDcombines the SE attention with a per pixel(local) weight |
|
Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks (16) | set_transformer | |
EXPANDuses K relay nodes |
|
CCNet: Criss-Cross Attention for Semantic Segmentation (296) | CCNet | |
EXPANDeach pixel attends to its row and column simultaneously |
|
Efficient Attention: Attention with Linear Complexities (16) | efficient-attention | |
EXPANDSoftmax(Q)*(Softmax(K^T)*V) |
|
Star-Transformer (40) | fastNLP | |
EXPANDuses a relay(global) node and attends to/from that node |
|
GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond (199) | GCNet | |
EXPANDsqueeze and excitation with an attention pooling (instead of a GAP) |
|
Generating Long Sequences with Sparse Transformers (257) | DeepSpeed | |
EXPANDsparse block based attention |
|
SCRAM: Spatially Coherent Randomized Attention Maps (1) | - | |
EXPANDuses PatchMatch to find close keys |
|
Interlaced Sparse Self-Attention for Semantic Segmentation (24) | IN_PAPER | |
EXPANDcombination of a short length and then long range(dilated) attention |
|
Permutohedral Attention Module for Efficient Non-Local Neural Networks (3) | Permutohedral_attention_module | |
EXPANDuses permutohedral lattice approximation algorithm to approximate the attention output |
|
Large Memory Layers with Product Keys (43) | XLM | |
EXPANDsearch for nearest neighbor keys |
|
Expectation-Maximization Attention Networks for Semantic Segmentation (79) | EMANet | |
EXPANDapplys expectation maximization to cluster keys into k clusters |
|
BP-Transformer: Modelling Long-Range Context via Binary Partitioning (15) | BPT | |
EXPANDattends to distant tokens coarsely and attends to close tokens in a more fine-grained manner |
|
Compressive Transformers for Long-Range Sequence Modelling (48) | compressive-transformer-pytorch | |
EXPANDcompresses distant tokens instead of just stop_grad() ing them, more efficient version of transformerXL |
|
Axial Attention in Multidimensional Transformers (36) | axial-attention | |
EXPANDapply attention on each axis separately |
|
Reformer: The Efficient Transformer (216) | trax | |
EXPANDuses LSH to find close keys |
|
Sparse Sinkhorn Attention (16) | sinkhorn-transformer | |
EXPANDuses a cost matrix to limit attention between buckets |
|
Transformer on a Diet (2) | transformer-on-diet | |
EXPANDdilated transformer like wavenet |
|
Time-aware Large Kernel Convolutions (9) | TaLKConvolutions | |
EXPANDcalculate mean over a dynamic subsequence around each token with the help of summed-area table |
|
SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection (2) | - | |
EXPANDlearns the q, k connections == dynamically creates a sparse attention matrix |
|
Efficient Content-Based Sparse Attention with Routing Transformers (38) | routing-transformer | |
EXPANDcomputes attention with same-cluster tokens (computed by online k-means) |
|
Neural Architecture Search for Lightweight Non-Local Networks (11) | AutoNL | |
EXPANDcomputes Q(KV) and also down samples q, k, v both in spatial and channel dimensions |
|
Longformer: The Long-Document Transformer (159) | longformer | |
EXPANDglobal + blocked attention |
|
ETC: Encoding Long and Structured Inputs in Transformers (16) | - | |
EXPANDcombines global attention (star transformer with multiple global tokens) with local attention |
|
Multi-scale Transformer Language Models (2) | IN_PAPER | |
EXPANDUNet like + retina attetion is something close to BP-Transformer |
|
Synthesizer: Rethinking Self-Attention in Transformer Models (26) | Synthesizer-Rethinking-Self-Attention-Transformer-Models | |
EXPANDdoes not compute pairwise interactions |
|
Jukebox: A Generative Model for Music (45) | jukebox | |
EXPANDbetter attention patterns from Sparse Transformer |
|
Input-independent Attention Weights Are Expressive Enough: A Study of Attention in Self-supervised Audio Transformers (0) | - | |
EXPANDdoes not compute pairwise interactions and uses fixed mask patters |
|
GMAT: Global Memory Augmentation for Transformers (2) | gmat | |
EXPANDadds global tokens |
|
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention (45) | fast-transformers | |
EXPANDuses phi(q)(phi(k)v) and also improves the sequential sampling step |
|
Linformer: Self-Attention with Linear Complexity (47) | linformer-pytorch | |
EXPANDproject key and value from nd to kd |
|
Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers (8) | google-research | |
EXPANDcalculate an unbiased stochastic approximation of the attention matrix |
|
Kronecker Attention Networks (1) | kronecker-attention-pytorch | |
EXPANDuses horizontal and lateral average matrices |
|
Real-time Semantic Segmentation with Fast Attention (5) | - | |
EXPANDl2_norm(q)*(l2_norm(k)*v) |
|
Fast Transformers with Clustered Attention (6) | fast-transformers | |
EXPANDgroups queries together with LSH |
|
Big Bird: Transformers for Longer Sequences (60) | DeepSpeed | |
EXPANDETC with random connections |
|
Tensor Low-Rank Reconstruction for Semantic Segmentation (3) | - | |
EXPANDdecompose the full attention tensor into rank one tensors (CP decomposition) |
|
Looking for change? Roll the Dice and demand Attention (0) | IN_PAPER | |
EXPANDuses the fractal tanimoto similarity to compare queries with keys inside the attention module |
|
Rethinking Attention with Performers (30) | google-research | |
EXPANDunbiased approximation of the attention matrix with softmax kernel |
|
Memformer: The Memory-Augmented Transformer (0) | memformer | |
EXPANDattend to memory slots + Memory-Replay BackPropagation |
|
SMYRF: Efficient Attention using Asymmetric Clustering (1) | smyrf | |
EXPANDLSH with balanced clusters |
|
Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting (0) | Informer2020 | |
EXPANDsparse attention + funnel like encoder |
|
Sub-Linear Memory: How to Make Performers SLiM (0) | google-research | |
EXPANDPerformer but with sublinear Memory usage |
|
Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention (0) | Nystromformer | |
EXPANDuses Nystrom method to approximate the attention matrix |
|
Linear Transformers Are Secretly Fast Weight Memory Systems (0) | fast-weight-transformers | |
EXPANDshow that linear transformers are basically fast weight networks + propose a new kernel function to linearise attention, balancing simplicity and effectiveness |
|
LambdaNetworks: Modeling Long-Range Interactions Without Attention (6) | lambda-networks | |
EXPANDgenerates a linear layer based on context + decouple pos/context |
|
Random Feature Attention (2) | - | |
EXPANDkernel approximation and also transformers are rnn |