Implementation of the Transformer variant proposed in "Transformer Quality in Linear Time"

Phil Wang

Last update: Dec 28, 2022

Related tags

Overview

FLASH - Pytorch

Implementation of the Transformer variant proposed in the paper Transformer Quality in Linear Time

Install

$ pip install FLASH-pytorch

Usage

The main novel circuit in this paper is the "Gated Attention Unit", which they claim can replace multi-headed attention while reducing it to just one head.

It uses a relu squared activation in place of the softmax, the activation of which was first seen in the Primer paper, and the use of ReLU in ReLA Transformer. The gating style seems mostly inspired by gMLPs.

import torch
from flash_pytorch import GAU

gau = GAU(
    dim = 512,
    query_key_dim = 128,     # query / key dimension
    causal = True,           # autoregressive or not
    expansion_factor = 2,    # hidden dimension = dim * expansion_factor
)

x = torch.randn(1, 1024, 512)
out = gau(x) # (1, 1024, 512)

The authors then combine GAU with Katharopoulos linear attention, using grouping of the sequences to overcome a known issue with autoregressive linear attention.

This combination of the quadratic gated attention unit with grouped linear attention they named FLASH

You can also use this quite easily

import torch
from flash_pytorch import FLASH

flash = FLASH(
    dim = 512,
    group_size = 256,             # group size
    causal = True,                # autoregressive or not
    query_key_dim = 128,          # query / key dimension
    expansion_factor = 2.         # hidden dimension = dim * expansion_factor
)

x = torch.randn(1, 1111, 512)     # sequence will be auto-padded to nearest group size
out = flash(x) # (1, 1111, 512)

Finally, you can use the full FLASH transformer as mentioned in the paper. This contains all the positional embeddings mentioned in the paper. Absolute positional embedding uses scaled sinusoidal. GAU quadratic attention will get one-headed T5 relative positional bias. On top of all this, both GAU attention as well as the linear attention will be rotary embedded (RoPE).

import torch
from flash_pytorch import FLASHTransformer

model = FLASHTransformer(
    num_tokens = 20000,          # number of tokens
    dim = 512,                   # model dimension
    depth = 12,                  # depth
    causal = True,               # autoregressive or not
    group_size = 256,            # size of the groups
    query_key_dim = 128,         # dimension of queries / keys
    expansion_factor = 2.,       # hidden dimension = dim * expansion_factor
    norm_type = 'scalenorm',     # in the paper, they claimed scalenorm led to faster training at no performance hit. the other option is 'layernorm' (also default)
    shift_tokens = True          # discovered by an independent researcher in Shenzhen @BlinkDL, this simply shifts half of the feature space forward one step along the sequence dimension - greatly improved convergence even more in my local experiments
)

x = torch.randint(0, 20000, (1, 1024))
logits = model(x) # (1, 1024, 20000)

Test on Autoregressive Enwik8

$ python train.py

Citations

@article{Hua2022TransformerQI,
    title   = {Transformer Quality in Linear Time},
    author  = {Weizhe Hua and Zihang Dai and Hanxiao Liu and Quoc V. Le},
    journal = {ArXiv},
    year    = {2022},
    volume  = {abs/2202.10447}
}

@software{peng_bo_2021_5196578,
    author    = {PENG Bo},
    title     = {BlinkDL/RWKV-LM: 0.01},
    month     = {aug},
    year      = {2021},
    publisher = {Zenodo},
    version   = {0.01},
    doi       = {10.5281/zenodo.5196578},
    url       = {https://doi.org/10.5281/zenodo.5196578}
}

Comments

einsum operation in Linear Attention Part
Hi, Thanks a lot for your FLASH_pytorch, which helps a lot. I found that there are some differences from the paper in the Linear Attention Part: https://github.com/lucidrains/FLASH-pytorch/blob/main/flash_pytorch/flash_pytorch.py#L342-L343

lin_kv = einsum('b g n d, b g n e -> b d e', lin_k, v) / n lin_out = einsum('b g n d, b d e -> b g n e', lin_q, lin_kv)

the lin_kv is three-dim (bde) And the code in the paper is

lin_kv = tf.einsum('bhke,bgh→bgke', lin_kv, mask) linear = tf.einsum('bgnk,bgke→bgne', lin_q, lin_kv)

the lin_kv is four-dim (bgke) It seems that the two ways are not equivalent.

Looking forward to your reply. Best,
opened by ShomyLiu 5
mask error
x = torch.randint(0, 20000, (1, 1024)) mask = x.ne(0) logits = model(x, mask=mask)

RuntimeError: The size of tensor a (1024) must match the size of tensor b (128) at non-singleton dimension 2
opened by keyunluo 1
Speed on TPU

Hi, Thanks for the code! I test it on Google TPU v3, the training speed seems slower than my expectation. Maybe there is some operation which is not lower on TPU.

opened by magicknight 0
About the "shift_tokens"

Thank you for your amazing code.

In the class of FLASH, I find a flag: shift_tokens, and the corresponding code is as following: if self.shift_tokens: x_shift, x_pass = normed_x.chunk(2, dim = -1) x_shift = F.pad(x_shift, (0, 0, 1, -1), value = 0.) normed_x = torch.cat((x_shift, x_pass), dim = -1)

Assume we have normed_x in the shape [1024, 512], the x_shift/x_pass is the shape of [1024, 256]. Then it adds a row (with all 0 value) and remove the last row in the x_shift, and concat x_shift and x_pass to get the normed_x.

In my opinion, the F.pad operation will make the row in x_shift and x_pass do not match again.

May I know why it works?

Kang

opened by kangzhao2 1
Cross-Attention?

Hi, @lucidrains. Thank you for sharing this excellent implementation with us all! Do you have any thoughts as to what changes would need to be made to make cross-attention possible with your FLASH model?

opened by amorehead 2

Releases(0.1.6)

0.1.6(Sep 23, 2022)

null
Source code(tar.gz)
Source code(zip)
v0.1.5(Jun 19, 2022)

null
Source code(tar.gz)
Source code(zip)
v0.1.4(Jun 18, 2022)

null
Source code(tar.gz)
Source code(zip)
0.1.2(Apr 8, 2022)

Source code(tar.gz)
Source code(zip)
0.1.1(Mar 29, 2022)

Source code(tar.gz)
Source code(zip)
0.0.15a(Mar 29, 2022)

Source code(tar.gz)
Source code(zip)
0.0.14(Mar 29, 2022)

Source code(tar.gz)
Source code(zip)
0.0.12(Mar 29, 2022)

Source code(tar.gz)
Source code(zip)
0.0.11(Mar 29, 2022)

Source code(tar.gz)
Source code(zip)
0.0.10(Mar 29, 2022)

Source code(tar.gz)
Source code(zip)
0.0.9(Mar 29, 2022)

Source code(tar.gz)
Source code(zip)
0.0.8(Mar 29, 2022)

Source code(tar.gz)
Source code(zip)
0.0.7(Mar 29, 2022)

Source code(tar.gz)
Source code(zip)
0.0.6(Mar 29, 2022)

Source code(tar.gz)
Source code(zip)
0.0.1a(Mar 29, 2022)

Source code(tar.gz)
Source code(zip)
0.0.5(Mar 28, 2022)

Source code(tar.gz)
Source code(zip)
0.0.4(Mar 28, 2022)

Source code(tar.gz)
Source code(zip)
0.0.3(Mar 28, 2022)

Source code(tar.gz)
Source code(zip)
0.0.2a(Mar 28, 2022)

Source code(tar.gz)
Source code(zip)
0.0.1(Mar 28, 2022)

Source code(tar.gz)
Source code(zip)

Implementation of the Transformer variant proposed in "Transformer Quality in Linear Time"

Related tags

Overview

FLASH - Pytorch

Install

Usage

Test on Autoregressive Enwik8

Citations

Comments

einsum operation in Linear Attention Part

mask error

Speed on TPU

About the "shift_tokens"

Cross-Attention?

Releases(0.1.6)

0.1.6(Sep 23, 2022)

v0.1.5(Jun 19, 2022)

v0.1.4(Jun 18, 2022)

0.1.2(Apr 8, 2022)

0.1.1(Mar 29, 2022)

0.0.15a(Mar 29, 2022)

0.0.14(Mar 29, 2022)

0.0.12(Mar 29, 2022)

0.0.11(Mar 29, 2022)

0.0.10(Mar 29, 2022)

0.0.9(Mar 29, 2022)

0.0.8(Mar 29, 2022)

0.0.7(Mar 29, 2022)

0.0.6(Mar 29, 2022)

0.0.1a(Mar 29, 2022)

0.0.5(Mar 28, 2022)

0.0.4(Mar 28, 2022)

0.0.3(Mar 28, 2022)

0.0.2a(Mar 28, 2022)

0.0.1(Mar 28, 2022)

Owner

Phil Wang

Official PyTorch code for Mutual Affine Network for Spatially Variant Kernel Estimation in Blind Image Super-Resolution (MANet, ICCV2021)

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis

Implementation of Geometric Vector Perceptron, a simple circuit for 3d rotation equivariance for learning over large biomolecules, in Pytorch. Idea proposed and accepted at ICLR 2021

Implementation of 'lightweight' GAN, proposed in ICLR 2021, in Pytorch. High resolution image generations that can be trained within a day or two

This repository contains the implementation of Deep Detail Enhancment for Any Garment proposed in Eurographics 2021

This repo contains the implementation of the algorithm proposed in Off-Belief Learning, ICML 2021.

An implementation for the loss function proposed in Decoupled Contrastive Loss paper.

Implementation of the method proposed in the paper "Neural Descriptor Fields: SE(3)-Equivariant Object Representations for Manipulation"

Pytorch implementation of the popular Improv RNN model originally proposed by the Magenta team.

Torch-ngp - A pytorch implementation of the hash encoder proposed in instant-ngp

Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"

A PyTorch implementation of Mugs proposed by our paper "Mugs: A Multi-Granular Self-Supervised Learning Framework".

The source code for the Cutoff data augmentation approach proposed in this paper: "A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation".

Code and data of the Fine-Grained R2R Dataset proposed in paper Sub-Instruction Aware Vision-and-Language Navigation

nnDetection is a self-configuring framework for 3D (volumetric) medical object detection which can be applied to new data sets without manual intervention. It includes guides for 12 data sets that were used to develop and evaluate the performance of the proposed method.

The implemetation of Dynamic Nerual Garments proposed in Siggraph Asia 2021

Code for CMaskTrack R-CNN (proposed in Occluded Video Instance Segmentation)

Implement object segmentation on images using HOG algorithm proposed in CVPR 2005

PyTorch reimplementation of the Smooth ReLU activation function proposed in the paper "Real World Large Scale Recommendation Systems Reproducibility and Smooth Activations" [arXiv 2022].