Implementation of Nyström Self-attention, from the paper Nyströmformer

Phil Wang

Last update: Jan 2, 2023

Related tags

Overview

Nyström Attention

Implementation of Nyström Self-attention, from the paper Nyströmformer.

Yannic Kilcher video

Install

$ pip install nystrom-attention

Usage

import torch
from nystrom_attention import NystromAttention

attn = NystromAttention(
    dim = 512,
    dim_head = 64,
    heads = 8,
    num_landmarks = 256,    # number of landmarks
    pinv_iterations = 6,    # number of moore-penrose iterations for approximating pinverse. 6 was recommended by the paper
    residual = True         # whether to do an extra residual with the value or not. supposedly faster convergence if turned on
)

x = torch.randn(1, 16384, 512)
mask = torch.ones(1, 16384).bool()

attn(x, mask = mask) # (1, 16384, 512)

Nyströmformer, layers of Nyström attention

import torch
from nystrom_attention import Nystromformer

model = Nystromformer(
    dim = 512,
    dim_head = 64,
    heads = 8,
    depth = 6,
    num_landmarks = 256,
    pinv_iterations = 6
)

x = torch.randn(1, 16384, 512)
mask = torch.ones(1, 16384).bool()

model(x, mask = mask) # (1, 16384, 512)

You can also import it as Nyströmer if you wish

from nystrom_attention import Nystromer

Citations

@misc{xiong2021nystromformer,
    title   = {Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention},
    author  = {Yunyang Xiong and Zhanpeng Zeng and Rudrasis Chakraborty and Mingxing Tan and Glenn Fung and Yin Li and Vikas Singh},
    year    = {2021},
    eprint  = {2102.03902},
    archivePrefix = {arXiv},
    primaryClass = {cs.CL}
}

Comments

Clarification on masking
Given the dimensionality of the mask argument, (N, T), I'm assuming this is a boolean mask for masking out padding tokens. I created the following function to generate such a mask given an input tensor:

def _create_pad_mask(self, x: torch.LongTensor) -> torch.BoolTensor: mask = torch.ones_like(x).to(torch.bool) mask[x==0] = False return mask

where 0 is the padding token, setting positions to False so not to attend to them.

However, I am unsure how to apply a causal mask to the attention layers so to prevent my decoder from accessing future elements. I couldn't see an example of this in the full Nystromformer module. How can I achieve this?

For context, I am trying to apply the causal mask generated by the following function:

def _create_causal_mask(self, x: torch.LongTensor) -> torch.FloatTensor: size = x.shape[1] mask = (torch.triu(torch.ones(size, size)) == 1).transpose(0, 1) mask = mask.float().masked_fill_(mask == 0, float('-inf')).masked_fill_(mask==1, 0.0) return mask

One way I can think of is to set return_attn to True, apply the mask on the returned attention weights then matmul with the value tensor. But this has a few issues:

Having to return v

Computing the full attention matrix (I think), defeating the entire point of linear attention

Needlessly calculating out only to discard it.

Is this just a limitation of Nystrom attention? Or am I overlooking something obvious?

Thanks
opened by vvvm23 3
Possible bug with padding
Hey there,

I was going through the code and I noticed the following, which I found curious.

In Line 75, you pad the input tensor to a multiple of num_landmarks from the front:

x = F.pad(x, (0, 0, padding, 0), value = 0)

In Line 144 you trim the extra padding elements you inserted in the output tensor from the end.

out = out[:, :n]

Am I not getting something, or should we be removing the front elements of out?

out = out[:, out.size(1) - n:]
opened by georgepar 2
Nystrom for Image processing
thank you for sharing the wondeful code. I am working on image processing and wanted to try your code for the same. I have 2 doubts:

How to select residual_conv_kernel? I could not find any details for the same. also, it is enabled by a flag. When should we enable it and when to disable it?

Is there any guideline for deciding num_landmarks for image processing task?

Thanks
opened by paragon1234 1
Error when mask is of the same size as that of the input X

Hi,

First of all, thank you for putting such an easy to use implementation on GitHub. I'm trying to incorporate the nystrom attention into a legacy codebase, it previously used to provide the input X and the mask (off the same dimensions as X) to a Multi headed Attention Layer.

When I'm trying to integrate nystrom attention with it, it runs alright without the mask. But, when I pass the mask alongside it, it throws einops rearrange error.

Sorry, if this is a very basic question, but how would you recommend I deal with handling 3D mask (same dimensions as the size of input) in the codebase.

Best, VB

opened by Vaibhavs10 1

ViewBackward inplace deprecation warning

Hello again,

The following code results in a UserWarning in PyTorch 1.8.1.

In [1]: from nystrom_attention.nystrom_attention import NystromAttention

In [2]: import torch

In [3]: attn = NystromAttention(256)

In [4]: x = torch.randn(1, 8192, 256)

In [5]: attn(x)
/home/alex/.tmp/nystrom-attention/nystrom_attention/nystrom_attention.py:91: UserWarning: Output 0 of ViewBackward is a view and is being modified inplace. This view is an output of a function that returns multiple views. Inplace operators on such views are being deprecated and will be forbidden starting from version 1.8. Consider using `unsafe_` version of the function that produced this view or don't modify this view inplace. (Triggered internally at  ../torch/csrc/autograd/variable.cpp:547.)
  q *= self.scale
Out[5]:
tensor([[[-0.0449, -0.1726,  0.1409,  ...,  0.0127,  0.2287, -0.2437],
         [-0.1132,  0.3229, -0.1279,  ...,  0.0084, -0.3307, -0.2351],
         [ 0.0361,  0.1013,  0.0828,  ...,  0.1045, -0.1627,  0.0736],
         ...,
         [ 0.0018,  0.1385, -0.1716,  ..., -0.0366, -0.0682,  0.0241],
         [ 0.1497,  0.0149, -0.0020,  ..., -0.0352, -0.1126,  0.0193],
         [ 0.1341,  0.0077,  0.1627,  ..., -0.0363,  0.1057, -0.2071]]],
       grad_fn=<SliceBackward>)

Not a huge issue, but worth mentioning

opened by vvvm23 1

Relative position encoding

Similar to the question raised for the performer architecture , is it possible to implement a relative position encoding given the methodology in which attention is calculated?

opened by jdcla 1
How can we implement "batch_first" in Nystrom attention?

Hi,

Thanks a lot for implementing the nystromformer attention algorithm! Very nice job!

I am wondering whether it is feasible to add the "batch_first" option in the nystrom attention algorithm? This allow the algorithm to be integrated in the existing pytorch transformer encoder architecture.

opened by mark0935git 0
x-transformers

Hi @lucidrains - just wondering if we can plug in Nystrom Attention with x-transformers?

I've been plugging in Vision Transformers with X-transformers but am wondering if its possible to have a Nystrom transformer with x-transformer improvements to plug into a ViT?

opened by robbohua 0

Releases(0.0.11)

0.0.11(Apr 6, 2021)

Source code(tar.gz)
Source code(zip)
0.0.10(Mar 18, 2021)

Source code(tar.gz)
Source code(zip)
0.0.9(Feb 24, 2021)

Source code(tar.gz)
Source code(zip)
0.0.8(Feb 18, 2021)

Source code(tar.gz)
Source code(zip)
0.0.7(Feb 14, 2021)

Source code(tar.gz)
Source code(zip)
0.0.6(Feb 12, 2021)

Source code(tar.gz)
Source code(zip)
0.0.5(Feb 12, 2021)

Source code(tar.gz)
Source code(zip)
0.0.4(Feb 12, 2021)

Source code(tar.gz)
Source code(zip)
0.0.3(Feb 12, 2021)

Source code(tar.gz)
Source code(zip)
0.0.2(Feb 12, 2021)

Source code(tar.gz)
Source code(zip)
0.0.1(Feb 11, 2021)

Source code(tar.gz)
Source code(zip)

Owner

Phil Wang

Working with Attention. It's all we need.

GitHub

Implementation of the 😇 Attention layer from the paper, Scaling Local Self-Attention For Parameter Efficient Visual Backbones

HaloNet - Pytorch Implementation of the Attention layer from the paper, Scaling Local Self-Attention For Parameter Efficient Visual Backbones. This re

189 Nov 22, 2022

Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"

Memory Efficient Attention Pytorch Implementation of a memory efficient multi-head attention as proposed in the paper, Self-attention Does Not Need O(

180 Jan 5, 2023

Official Pytorch Implementation of Relational Self-Attention: What's Missing in Attention for Video Understanding

Relational Self-Attention: What's Missing in Attention for Video Understanding This repository is the official implementation of "Relational Self-Atte

43 Dec 7, 2022

Implementation of Deformable Attention in Pytorch from the paper "Vision Transformer with Deformable Attention"

Deformable Attention Implementation of Deformable Attention from this paper in Pytorch, which appears to be an improvement to what was proposed in DET

128 Dec 24, 2022

Official implementation of NeurIPS 2021 paper "Contextual Similarity Aggregation with Self-attention for Visual Re-ranking"

CSA: Contextual Similarity Aggregation with Self-attention for Visual Re-ranking PyTorch training code for CSA (Contextual Similarity Aggregation). We

19 Oct 21, 2022

The official TensorFlow implementation of the paper Action Transformer: A Self-Attention Model for Short-Time Pose-Based Human Action Recognition

Action Transformer A Self-Attention Model for Short-Time Human Action Recognition This repository contains the official TensorFlow implementation of t

20 Jan 3, 2023

PyTorch code for our paper "Attention in Attention Network for Image Super-Resolution"

Under construction... Attention in Attention Network for Image Super-Resolution (A2N) This repository is an PyTorch implementation of the paper "Atten

71 Dec 30, 2022

Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image classification, in Pytorch

Transformer in Transformer Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image c

272 Dec 23, 2022

Implementation of STAM (Space Time Attention Model), a pure and simple attention model that reaches SOTA for video classification

STAM - Pytorch Implementation of STAM (Space Time Attention Model), yet another pure and simple SOTA attention model that bests all previous models in

109 Dec 28, 2022

Implementation of Nyström Self-attention, from the paper Nyströmformer

Related tags

Overview

Nyström Attention

Install

Usage

Citations

Comments

Releases(0.0.11)

0.0.11(Apr 6, 2021)

0.0.10(Mar 18, 2021)

0.0.9(Feb 24, 2021)

0.0.8(Feb 18, 2021)

0.0.7(Feb 14, 2021)

0.0.6(Feb 12, 2021)

0.0.5(Feb 12, 2021)

0.0.4(Feb 12, 2021)

0.0.3(Feb 12, 2021)

0.0.2(Feb 12, 2021)

0.0.1(Feb 11, 2021)

Owner

Phil Wang

Implementation of the 😇 Attention layer from the paper, Scaling Local Self-Attention For Parameter Efficient Visual Backbones

Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"

Official Pytorch Implementation of Relational Self-Attention: What's Missing in Attention for Video Understanding

Implementation of Deformable Attention in Pytorch from the paper "Vision Transformer with Deformable Attention"

Official implementation of NeurIPS 2021 paper "Contextual Similarity Aggregation with Self-attention for Visual Re-ranking"

The official TensorFlow implementation of the paper Action Transformer: A Self-Attention Model for Short-Time Pose-Based Human Action Recognition

PyTorch code for our paper "Attention in Attention Network for Image Super-Resolution"

Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image classification, in Pytorch

Implementation of STAM (Space Time Attention Model), a pure and simple attention model that reaches SOTA for video classification

Official implementation of cosformer-attention in cosFormer: Rethinking Softmax in Attention

Code for the CIKM 2019 paper "DSANet: Dual Self-Attention Network for Multivariate Time Series Forecasting".

The code is for the paper "A Self-Distillation Embedded Supervised Affinity Attention Model for Few-Shot Segmentation"

Implementation of Lie Transformer, Equivariant Self-Attention, in Pytorch

Implementation of SE3-Transformers for Equivariant Self-Attention, in Pytorch.

Authors implementation of LieTransformer: Equivariant Self-Attention for Lie Groups

Implementation of self-attention mechanisms for general purpose. Focused on computer vision modules. Ongoing repository.

Official implementation of Self-supervised Graph Attention Networks (SuperGAT), ICLR 2021.

This is an official implementation of "Polarized Self-Attention: Towards High-quality Pixel-wise Regression"

PyTorch implementation of "LayoutTransformer: Layout Generation and Completion with Self-attention"