Relative Positional Encoding for Transformers with Linear Complexity

Antoine Liutkus

Last update: Nov 16, 2022

Related tags

Deep Learning spe

Overview

Stochastic Positional Encoding (SPE)

This is the source code repository for the ICML 2021 paper Relative Positional Encoding for Transformers with Linear Complexity by Antoine Liutkus, Ondřej Cífka, Shih-Lun Wu, Umut Şimşekli, Yi-Hsuan Yang and Gaël Richard.

In this paper, we propose Stochastic Positional Encoding (SPE), which provably behaves like relative PE while being compatible with linear-complexity Transformers. We do this by drawing a connection between positional encoding and cross-covariance structures of correlated Gaussian processes.

Check out also the companion website with music examples.

Citation:

@inproceedings{pmlr-v139-liutkus21a,
  title = 	 {Relative Positional Encoding for {Transformers} with Linear Complexity},
  author =       {Liutkus, Antoine and C{\'i}fka, Ond{\v r}ej and Wu, Shih-Lun and {\c S}im{\c s}ekli, Umut and Yang, Yi-Hsuan and Richard, Ga{\"e}l},
  booktitle = 	 {Proceedings of the 38th International Conference on Machine Learning},
  pages = 	 {7067--7079},
  year = 	 {2021},
  editor = 	 {Meila, Marina and Zhang, Tong},
  volume = 	 {139},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {18--24 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v139/liutkus21a/liutkus21a.pdf},
  url = 	 {http://proceedings.mlr.press/v139/liutkus21a.html}
}

SPE implementation

We have implemented SPE in PyTorch and JAX/Flax. Each implementation is available as a separate Python package under src.

Experiments

Each of the 3 experiments (LRA, pop piano generation, groove continuation) has a dedicated directory under experiments. See the README files there for how to set up the environment and prepare the datasets. To make sure you have the custom dependencies for each experiment, clone this repository with --recurse-submodules or run git submodule init && git submodule update after cloning.

Comments

Scale problem

Hey I am a little bit confused about the scale.

Inside SineSPE() you deal with the scale (both d^0.25 and num_realizations^0.25) On the other hand when you show the application in pytorch, after applying the filter you divide by sqrt(num_realizations) again, why is that? https://github.com/aliutkus/spe/blob/main/src/pytorch/examples/test_spe.ipynb

opened by lucastononrodrigues 3
sharenoise
optimizes memory usage and speed by sharing the SPE along all layers.

This is done in the following way:

not redrawing noise each time, so as to share qbar and kbar on all layers. The strategy picked is to keep qbar and kbar untouched as long as their shapes are ok. They must be manually resetted if required.

remove the use of einsum. It's indeed much nicer, but for some mysterious reason, it apparently did not allow to save RAM when reusing qbar and kbar.

the notebook tries to apply the SPE many times in a row, to simulate many layers.

:warning: note that my_spe.reset() must now be called explicitly each time a new spe must be computed (typically at each batch during training).
opened by aliutkus 2

Support ndim>1

Hi there,

I tried to use ConvSPE with images (ndim=2) but spe.py failed. The following snippet replicates the error with the current spe implementation. Notice, I am using a 2D (50x50) input. This PR fixes this error. If this fix is wrong, please suggest a better solution. Thanks

import spe
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'

keys_dim = 64
num_heads = 1
num_realizations = 64
kernel_size = 20
n = 50
batchsize = 8

poscoder = spe.ConvSPE(ndim=2,in_features=keys_dim,kernel_size=kernel_size,num_heads=num_heads,num_realizations=num_realizations)
filter = spe.SPEFilter(gated=True, code_shape=poscoder.code_shape).to(device)

poscoder.to(device=device)

q = torch.rand(batchsize, n, n, num_heads, keys_dim, device=device, requires_grad=True)
k = torch.rand(batchsize, n, n, num_heads, keys_dim, device=device, requires_grad=True)


poscode = poscoder(q.shape[:3])
q, k = filter(q, k, poscode)

opened by ahmdtaha 1

Fix submodules

I was trying to reproduce your experiments and when I run git submodule init && git submodule update after cloning, but I faced several errors when the [email protected]:cifkao/fast-transformers.git was cloned: [email protected]: Permission denied (publickey) Then I decided to change url [email protected]:cifkao/fast-transformers.git to https://github.com/cifkao/fast-transformers.git and it seems to fix the issue

opened by maximzubkov 1
incorporate a bias parameter

instead of deciding how many features should be left unchanged or should be modulated by SPE, let's train that !

The idea is to modify the model so that we have $P_d(m,n) \leftarrow \lambda P_d(m-n) + (1-\lambda) 1$

So that depending on this $\lambda$, we turn this dimension into a normal (unchanged) one, or use SPE for it.

In practice, due to the cross-terms thing, this is implementing as drawing noise again, but this time this noise is the same for all time lags (it's hence small). This way, the covariance matrix will be full of one for QdKd, but still zero for QdKd' (d\neq d').

I didn't test heavily, please have a look

opened by aliutkus 1
refactored for allowing layer-dependent gating

this implements a new two step behaviour of first generating the pos encoding, before applying them on demand with a Filter module, that allows gating

this enables the layer-wise gating parameters

opened by aliutkus 0
Wrong axis in jax spe summation
For the jax implementation, on line 210 of spe.py, should the axis summed over be -1 instead of -2? When using -2, the size of the last output dimension is num_realizations, rather than the query/key dimension:

return (spe[:, :keys.shape[1]] * keys[..., None]).sum(axis=-1)
opened by tomweingarten 0
Very slow algorithm, is that normal?
Hello,

I implemented the algorithm in the vision transformer architecture the following way:

#inside __init__() self.spe = SineSPE(num_heads=head_cnt,in_features=in_dim,num_sines=5,num_realizations=64) self.filter = SPEFilter(gated=False,code_shape=self.spe.code_shape) #inside forward() q,k=self.filter(q,k,self.spe(q.shape[:2])) qk,kp = performer(...) out=lin_attention(...)

The model I am using has 4 layers 6 heads and embedding dimension 384, patch_size=4.

Training 100 epochs with CIFAR100 converges to 42.3% and without SPE 45.3%. Although this can be expected, with SPE the training time is around 6x longer, is that normal? Performers + ViT takes 39 minutes Perfomers + ViT + SPE takes around 4 hours For both I am using 2 Titan XP GPUs.

This is very problematic to me because I was considering scaling up those experiments with imagenet.

I would also like to know how can I implement the indexing T=N^2 for images (where did you do it in the lra benchmark?), according to section 2 of the paper.

Many thanks!
opened by lucastononrodrigues 3

Owner

Antoine Liutkus

Researcher at Inria

GitHub

Incremental Transformer Structure Enhanced Image Inpainting with Masking Positional Encoding (CVPR2022)

Incremental Transformer Structure Enhanced Image Inpainting with Masking Positional Encoding by Qiaole Dong*, Chenjie Cao*, Yanwei Fu Paper and Supple

190 Dec 27, 2022

Linear algebra python - Number of operations and problems in Linear Algebra and Numerical Linear Algebra

Linear algebra in python Number of operations and problems in Linear Algebra and

5 Oct 9, 2022

Python script for Linear, Non-Linear Convection, Burger’s & Poisson Equation in 1D & 2D, 1D Diffusion Equation using Standard Wall Function, 2D Heat Conduction Convection equation with Dirichlet & Neumann BC, full Navier-Stokes Equation coupled with Poisson equation for Cavity and Channel flow in 2D using Finite Difference Method & Finite Volume Method.

Navier-Stokes-numerical-solution-using-Python- Python script for Linear, Non-Linear Convection, Burger’s & Poisson Equation in 1D & 2D, 1D D

89 Jan 4, 2023

With this package, you can generate mixed-integer linear programming (MIP) models of trained artificial neural networks (ANNs) using the rectified linear unit (ReLU) activation function

With this package, you can generate mixed-integer linear programming (MIP) models of trained artificial neural networks (ANNs) using the rectified linear unit (ReLU) activation function. At the moment, only TensorFlow sequential models are supported. Interfaces to either the Pyomo or Gurobi modeling environments are offered.

40 Dec 27, 2022

Relative Positional Encoding for Transformers with Linear Complexity

Related tags

Overview

Stochastic Positional Encoding (SPE)

SPE implementation

Experiments

Comments

Scale problem

sharenoise

Support ndim>1

Fix submodules

incorporate a bias parameter

refactored for allowing layer-dependent gating

Wrong axis in jax spe summation

Very slow algorithm, is that normal?

Owner

Antoine Liutkus

Incremental Transformer Structure Enhanced Image Inpainting with Masking Positional Encoding (CVPR2022)

Linear algebra python - Number of operations and problems in Linear Algebra and Numerical Linear Algebra

With this package, you can generate mixed-integer linear programming (MIP) models of trained artificial neural networks (ANNs) using the rectified linear unit (ReLU) activation function

Simple Linear 2nd ODE Solver GUI - A 2nd constant coefficient linear ODE solver with simple GUI using euler's method

Hitters Linear Regression - Hitters Linear Regression With Python

Fast and robust certifiable relative pose estimation

Relative Uncertainty Learning for Facial Expression Recognition

Relative Human dataset, CVPR 2022

AntroPy: entropy and complexity of (EEG) time-series in Python

A Low Complexity Speech Enhancement Framework for Full-Band Audio (48kHz) based on Deep Filtering.

Official repository for the paper "Going Beyond Linear Transformers with Recurrent Fast Weight Programmers"

git《FSCE: Few-Shot Object Detection via Contrastive Proposal Encoding》(CVPR 2021) GitHub: [fig8]

Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

Distance Encoding for GNN Design

[ACMMM 2021 Oral] Enhanced Invertible Encoding for Learned Image Compression

AirCode: A Robust Object Encoding Method

Pytorch implementation of CVPR2020 paper “VectorNet: Encoding HD Maps and Agent Dynamics from Vectorized Representation”

Eth brownie struct encoding example