Relative Positional Encoding for Transformers with Linear Complexity

Related tags

Deep Learning spe
Overview

Stochastic Positional Encoding (SPE)

This is the source code repository for the ICML 2021 paper Relative Positional Encoding for Transformers with Linear Complexity by Antoine Liutkus, Ondřej Cífka, Shih-Lun Wu, Umut Şimşekli, Yi-Hsuan Yang and Gaël Richard.

In this paper, we propose Stochastic Positional Encoding (SPE), which provably behaves like relative PE while being compatible with linear-complexity Transformers. We do this by drawing a connection between positional encoding and cross-covariance structures of correlated Gaussian processes.

image

Check out also the companion website with music examples.

Citation:

@inproceedings{pmlr-v139-liutkus21a,
  title = 	 {Relative Positional Encoding for {Transformers} with Linear Complexity},
  author =       {Liutkus, Antoine and C{\'i}fka, Ond{\v r}ej and Wu, Shih-Lun and {\c S}im{\c s}ekli, Umut and Yang, Yi-Hsuan and Richard, Ga{\"e}l},
  booktitle = 	 {Proceedings of the 38th International Conference on Machine Learning},
  pages = 	 {7067--7079},
  year = 	 {2021},
  editor = 	 {Meila, Marina and Zhang, Tong},
  volume = 	 {139},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {18--24 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v139/liutkus21a/liutkus21a.pdf},
  url = 	 {http://proceedings.mlr.press/v139/liutkus21a.html}
}

SPE implementation

We have implemented SPE in PyTorch and JAX/Flax. Each implementation is available as a separate Python package under src.

Experiments

Each of the 3 experiments (LRA, pop piano generation, groove continuation) has a dedicated directory under experiments. See the README files there for how to set up the environment and prepare the datasets. To make sure you have the custom dependencies for each experiment, clone this repository with --recurse-submodules or run git submodule init && git submodule update after cloning.

Comments
  • Scale problem

    Scale problem

    Hey I am a little bit confused about the scale.

    Inside SineSPE() you deal with the scale (both d^0.25 and num_realizations^0.25) On the other hand when you show the application in pytorch, after applying the filter you divide by sqrt(num_realizations) again, why is that? https://github.com/aliutkus/spe/blob/main/src/pytorch/examples/test_spe.ipynb

    opened by lucastononrodrigues 3
  • sharenoise

    sharenoise

    optimizes memory usage and speed by sharing the SPE along all layers.

    This is done in the following way:

    • not redrawing noise each time, so as to share qbar and kbar on all layers. The strategy picked is to keep qbar and kbar untouched as long as their shapes are ok. They must be manually resetted if required.
    • remove the use of einsum. It's indeed much nicer, but for some mysterious reason, it apparently did not allow to save RAM when reusing qbar and kbar.

    the notebook tries to apply the SPE many times in a row, to simulate many layers.

    :warning: note that my_spe.reset() must now be called explicitly each time a new spe must be computed (typically at each batch during training).

    opened by aliutkus 2
  • Support ndim>1

    Support ndim>1

    Hi there,

    I tried to use ConvSPE with images (ndim=2) but spe.py failed. The following snippet replicates the error with the current spe implementation. Notice, I am using a 2D (50x50) input. This PR fixes this error. If this fix is wrong, please suggest a better solution. Thanks

    import spe
    import torch
    
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    
    keys_dim = 64
    num_heads = 1
    num_realizations = 64
    kernel_size = 20
    n = 50
    batchsize = 8
    
    poscoder = spe.ConvSPE(ndim=2,in_features=keys_dim,kernel_size=kernel_size,num_heads=num_heads,num_realizations=num_realizations)
    filter = spe.SPEFilter(gated=True, code_shape=poscoder.code_shape).to(device)
    
    poscoder.to(device=device)
    
    q = torch.rand(batchsize, n, n, num_heads, keys_dim, device=device, requires_grad=True)
    k = torch.rand(batchsize, n, n, num_heads, keys_dim, device=device, requires_grad=True)
    
    
    poscode = poscoder(q.shape[:3])
    q, k = filter(q, k, poscode)
    
    opened by ahmdtaha 1
  • Fix submodules

    Fix submodules

    I was trying to reproduce your experiments and when I run git submodule init && git submodule update after cloning, but I faced several errors when the [email protected]:cifkao/fast-transformers.git was cloned: [email protected]: Permission denied (publickey) Then I decided to change url [email protected]:cifkao/fast-transformers.git to https://github.com/cifkao/fast-transformers.git and it seems to fix the issue

    opened by maximzubkov 1
  • incorporate a bias parameter

    incorporate a bias parameter

    instead of deciding how many features should be left unchanged or should be modulated by SPE, let's train that !

    The idea is to modify the model so that we have $P_d(m,n) \leftarrow \lambda P_d(m-n) + (1-\lambda) 1$

    So that depending on this $\lambda$, we turn this dimension into a normal (unchanged) one, or use SPE for it.

    In practice, due to the cross-terms thing, this is implementing as drawing noise again, but this time this noise is the same for all time lags (it's hence small). This way, the covariance matrix will be full of one for QdKd, but still zero for QdKd' (d\neq d').

    I didn't test heavily, please have a look

    opened by aliutkus 1
  • refactored for allowing layer-dependent gating

    refactored for allowing layer-dependent gating

    this implements a new two step behaviour of first generating the pos encoding, before applying them on demand with a Filter module, that allows gating

    this enables the layer-wise gating parameters

    opened by aliutkus 0
  • Wrong axis in jax spe summation

    Wrong axis in jax spe summation

    For the jax implementation, on line 210 of spe.py, should the axis summed over be -1 instead of -2? When using -2, the size of the last output dimension is num_realizations, rather than the query/key dimension:

    return (spe[:, :keys.shape[1]] * keys[..., None]).sum(axis=-1)
    
    opened by tomweingarten 0
  • Very slow algorithm, is that normal?

    Very slow algorithm, is that normal?

    Hello,

    I implemented the algorithm in the vision transformer architecture the following way:

    #inside __init__()
    self.spe = SineSPE(num_heads=head_cnt,in_features=in_dim,num_sines=5,num_realizations=64)
    self.filter = SPEFilter(gated=False,code_shape=self.spe.code_shape)
    
    #inside forward()
    q,k=self.filter(q,k,self.spe(q.shape[:2]))
    qk,kp = performer(...)
    out=lin_attention(...)
    

    The model I am using has 4 layers 6 heads and embedding dimension 384, patch_size=4.

    Training 100 epochs with CIFAR100 converges to 42.3% and without SPE 45.3%. Although this can be expected, with SPE the training time is around 6x longer, is that normal? Performers + ViT takes 39 minutes Perfomers + ViT + SPE takes around 4 hours For both I am using 2 Titan XP GPUs.

    This is very problematic to me because I was considering scaling up those experiments with imagenet.

    I would also like to know how can I implement the indexing T=N^2 for images (where did you do it in the lra benchmark?), according to section 2 of the paper.

    Many thanks!

    opened by lucastononrodrigues 3
Owner
Antoine Liutkus
Researcher at Inria
Antoine Liutkus
Incremental Transformer Structure Enhanced Image Inpainting with Masking Positional Encoding (CVPR2022)

Incremental Transformer Structure Enhanced Image Inpainting with Masking Positional Encoding by Qiaole Dong*, Chenjie Cao*, Yanwei Fu Paper and Supple

Qiaole Dong 190 Dec 27, 2022
Linear algebra python - Number of operations and problems in Linear Algebra and Numerical Linear Algebra

Linear algebra in python Number of operations and problems in Linear Algebra and

Alireza 5 Oct 9, 2022
With this package, you can generate mixed-integer linear programming (MIP) models of trained artificial neural networks (ANNs) using the rectified linear unit (ReLU) activation function

With this package, you can generate mixed-integer linear programming (MIP) models of trained artificial neural networks (ANNs) using the rectified linear unit (ReLU) activation function. At the moment, only TensorFlow sequential models are supported. Interfaces to either the Pyomo or Gurobi modeling environments are offered.

ChemEngAI 40 Dec 27, 2022
Simple Linear 2nd ODE Solver GUI - A 2nd constant coefficient linear ODE solver with simple GUI using euler's method

Simple_Linear_2nd_ODE_Solver_GUI Description It is a 2nd constant coefficient li

:) 4 Feb 5, 2022
Hitters Linear Regression - Hitters Linear Regression With Python

Hitters_Linear_Regression Kullanacağımız veri seti Carnegie Mellon Üniversitesi'

AyseBuyukcelik 2 Jan 26, 2022
Fast and robust certifiable relative pose estimation

Fast and Robust Relative Pose Estimation for Calibrated Cameras This repository contains the code for the relative pose estimation between two central

null 42 Dec 6, 2022
Relative Uncertainty Learning for Facial Expression Recognition

Relative Uncertainty Learning for Facial Expression Recognition The official implementation of the following paper at NeurIPS2021: Title: Relative Unc

null 35 Dec 28, 2022
Relative Human dataset, CVPR 2022

Relative Human (RH) contains multi-person in-the-wild RGB images with rich human annotations, including: Depth layers (DLs): relative depth relationsh

Yu Sun 112 Dec 2, 2022
AntroPy: entropy and complexity of (EEG) time-series in Python

AntroPy is a Python 3 package providing several time-efficient algorithms for computing the complexity of time-series. It can be used for example to e

Raphael Vallat 153 Dec 27, 2022
A Low Complexity Speech Enhancement Framework for Full-Band Audio (48kHz) based on Deep Filtering.

DeepFilterNet A Low Complexity Speech Enhancement Framework for Full-Band Audio (48kHz) based on Deep Filtering. libDF contains Rust code used for dat

Hendrik Schröter 292 Dec 25, 2022
Official repository for the paper "Going Beyond Linear Transformers with Recurrent Fast Weight Programmers"

Recurrent Fast Weight Programmers This is the official repository containing the code we used to produce the experimental results reported in the pape

IDSIA 36 Nov 15, 2022
git《FSCE: Few-Shot Object Detection via Contrastive Proposal Encoding》(CVPR 2021) GitHub: [fig8]

FSCE: Few-Shot Object Detection via Contrastive Proposal Encoding (CVPR 2021) This repo contains the implementation of our state-of-the-art fewshot ob

null 233 Dec 29, 2022
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

Vision Longformer This project provides the source code for the vision longformer paper. Multi-Scale Vision Longformer: A New Vision Transformer for H

Microsoft 209 Dec 30, 2022
Distance Encoding for GNN Design

Distance-encoding for GNN design This repository is the official PyTorch implementation of the DEGNN and DEAGNN framework reported in the paper: Dista

null 172 Nov 8, 2022
[ACMMM 2021 Oral] Enhanced Invertible Encoding for Learned Image Compression

InvCompress Official Pytorch Implementation for "Enhanced Invertible Encoding for Learned Image Compression", ACMMM 2021 (Oral) Figure: Our framework

null 96 Nov 30, 2022
AirCode: A Robust Object Encoding Method

AirCode This repo contains source codes for the arXiv preprint "AirCode: A Robust Object Encoding Method" Demo Object matching comparison when the obj

Chen Wang 30 Dec 9, 2022
Pytorch implementation of CVPR2020 paper “VectorNet: Encoding HD Maps and Agent Dynamics from Vectorized Representation”

VectorNet Re-implementation This is the unofficial pytorch implementation of CVPR2020 paper "VectorNet: Encoding HD Maps and Agent Dynamics from Vecto

null 120 Jan 6, 2023
Eth brownie struct encoding example

eth-brownie struct encoding example Overview This repository contains an example of encoding a struct, so that it can be used in a function call, usin

Ittai Svidler 2 Mar 4, 2022