Nystromformer: A Nystrom-based Algorithm for Approximating Self-Attention

Zhanpeng Zeng

Last update: Jan 1, 2023

Related tags

Text Data & NLP Nystromformer

Overview

Nystromformer: A Nystrom-based Algorithm for Approximating Self-Attention

April 6, 2021

We extended segment-means to compute landmarks without requiring the sequence length divisible by the number of landmarks. Then we used this Nystromformer to perform deployment of T2T-Vit_t-14 for image classification without retraining. Our T2T-ViT-Nys-14 achieves 78% top-1 accuracy, outperforming performer/Linformer +4.3%/+12.7% for the direct deployment.

Feb 27th, 2021

We fixed the coefficient computation of initial Z_0, which can lead to faster convergence to pseudoinverse. The original implementation has a scale difference. We leave the original as a default option. The added initialization is recommended. Thanks @sbodenstein for pointing out the difference.

Feb 17th, 2021

We have released the source code of PyTorch reimplementation of Long Range Arena (LRA) benchmark, which is to evaluate the generalization ability of models on diverse longer sequence tasks. Our codes are based on the official Jax LRA implementation. Reformer PyTorch implementation is from huggingface and Performer PyTorch implementation is from lucidrains.

Feb 14th, 2021

We have released the scores on individual LRA tasks.

Feb 9th, 2021

We have release the average score across LRA tasks.

Transformers have emerged as a powerful workhorse for a broad range of natural language processing tasks. A key component that drives the impressive performance of Transformers is their self-attention mechanism that identifies/encodes the influence or dependence of other tokens for each specific token. Its benefits notwithstanding, the quadratic complexity of self-attention on the input sequence length has limited its application to longer sequences – a topic being actively studied in the community. To address this limitation, we propose Nystromformer – a model that exhibits excellent scalability as a function of sequence length. Our idea is based on adapting the Nystrom method to approximate the standard self-attention with an efficient O(n) complexity.

Requirements

docker, nvidia-docker

Datasets

The pretraining dataset consists of English Wikipedia and BookCorpus. For pretraining on long sequence, we added one third Stories and one third Realnews. All downloaded data files should be placed in the corresponding folder under data-preprocessing. The original format of English Wikipedia dump is preprocessed using wikiextractor, and the resulting files are placed in data-preprocessing/wiki. Then, run data-preprocessing/ /preprocess.py under each corresponding folder to generate data files of unified format. After preprocessing, run data-preprocessing/preprocess_data_ .py to generate pretraining data of specific sequence length.

Pretraining

To start pretraining of a specific configuration: create a folder (for example, nystrom-512) and write /config.json to specify model and training configuration, then under folder, run

> /model/pretrain.txt 2>&1"">

docker run --rm --name=pretrain \
  --network=host --ipc=host --gpus all \
  -v "$PWD/../data-preprocessing/512-roberta:/dataset" \
  -v "$PWD/../code:/code" \
  -v "$PWD:/model" \
  -d mlpen/bert_env:0 \
  /bin/bash -c \
  "python3 /code/run_pretrain.py >> /model/pretrain.txt 2>&1"

All outputs will be redirected to /pretrain.txt. The command will create a /model folder holding all checkpoints and log file. The training can be stopped anytime by running docker kill pretrain, and can be resumed from the last checkpoint using the same command for starting pretraining.

Pretraining from Different Model's Checkpoint

Copy a checkpoint (one of .model or .cp file) from /model folder to folder and add a key-value pair in /config.json: "from_cp": "/model/ ". One example is shown in nystrom-4096/config.json. This procedure also works for extending the max sequence length of a model (For example, use nystrom-512 pretrained weights as initialization for nystrom-4096).

GLUE

To finetune model on GLUE tasks, download GLUE datasets and place them under glue folder, then under folder , run

> /model/glue.txt 2>&1"">

docker run --rm --name=glue \
  --network=host --ipc=host --gpus all \
  -v "$PWD/../glue:/glue" \
  -v "$PWD/../code:/code" \
  -v "$PWD:/model" \
  -d mlpen/bert_env:0 \
  /bin/bash -c \
  "python3 /code/run_glue.py --batch_size 32 --lr 3e-5 --epoch 5 --task MRPC --checkpoint 99 >> /model/glue.txt 2>&1"

batch_size, lr, epoch, task, checkpoint can be changed to finetune on different task, different hyperparameters, or different checkpoints. All outputs will be redirected to /glue.txt. The log file is located at /model folder.

Citation

@article{xiong2021nystromformer,
  title={Nystr{\"o}mformer: A Nystr{\"o}m-based Algorithm for Approximating Self-Attention},
  author={Xiong, Yunyang and Zeng, Zhanpeng and Chakraborty, Rudrasis and Tan, Mingxing and Fung, Glenn and Li, Yin and Singh, Vikas},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  year={2021}
}

Comments

Pre-trained weights

Hi,

Thank you for this very clean repository! I wonder whether any checkpoints will be released? If yes, I'm interested in adding those to the HuggingFace Transformers library (the Reformer and LongFormer are already there, but the Linformer and the Nystromformer aren't yet). Since we only have to replace the self-attention mechanism, this seems quite straightforward in terms of modeling.

Kind regards,

Niels

opened by NielsRogge 6
Possible bug in iterative_inv

When the input of iterative_inv is a 4-tensor (which is how it is used in NystromAttention), then the calculation of of the initial value of V in iterative_inv here takes the max over all dimensions of the 4-tensor K rather than just the matrix dimensions as claimed in lemma 1:

This has a significant effect on the accuracy of the approximation on random data I've tried it on.

opened by sbodenstein 6
Preprocessing datasets

Hello,

I'm unclear on the pretraining procedures, in particular the preprocessing of the datasets.

Unless I'm mistaken https://github.com/mlpen/Nystromformer/blob/main/data-preprocessing/preprocess_data_512.py suggest we just put segments of size 512. I'm not sure I understand how SOP is defined in this case? Actually SOP doesn't seem to be used at all in the pretrain script.

opened by thomasw21 3
Lemma 1 parenthesis seem wrong

Hello, first of all great work! This looks very promising.

As I'm reading a bit more in depth, I'm wondering if you might have an error on parenthesis in lemma 1. Both the code, and the proof seem to indicate it.

The last term should apply only to AZ instead of (15I - AZ) no?

This would "fix" the following lines in the proof

opened by thomasw21 3
Results on Long Range Arena

Hi, I saw the tweet about this paper and wanted to know the performance of Nystromformer on each individual task in Long Range Arena. (Also what is the "Standard" in the graph posted? Standard transformer does much worse in LRA)

opened by joaogui1 3
Length of Text Classification Task of LRA

Hello, I noticed that in your paper, Table 3 shown the Text (4K). However, max_length was set to 1000 in your code LRA/datasets/text.py. Is the Text task length 4K or 1K for the scores given in Table 3?

opened by lxchtan 2
Question about LRA Pathfinder task

Thanks for your great work.

I noticed that in the paper and also in this repo, you report the results on LRA tasks. I am trying to train your model on LRA with the code given in this repo. For the Pathfinder task, I follow the dataset preparing guidance and I get 3 sets of pickle files, e.g. pathfinder32-curv_baseline, pathfinder32-curv_contour_length_9, and pathfinder32-curv_contour_length_14, each with train/dev/test splits.

I would like to know that which set gives the 70.94 result in table 3.

opened by YJHMITWEB 2
Some questions regarding the paper
Hi, thanks for sharing the code.

I have read your paper recently and I have some questions about some details. I would be very grateful if you can help me with those questions.

How do you derive eqn.(4) ? As far as I know, Nystrom method only applies to semidefinite matrix, not the product of matrices, let alone the softmax of the matrix product. I do read through the reference above eqn.(4) ``Improving CUR Matrix Decomposition and the Nystr¨om Approximation via Adaptive Sampling'', but I failed to find any relevant results. Could you please point to me where can I find these results?

The other question is about the key simplification, which first uses Nystrom method inside the softmax then commutes the matrix production and the softmax operation (calcuate softmax first and then the matrix production). While the first step (Nystrom) is indeed unbiased, the second step (commute operations) is not necessary the case. In summary, it’s hard to say what the final equation (13) really is, and why it works at all.
opened by wzm2256 1
LRA cifar10.py never stop

When trying to run python3 cifar10.py to prepare the cifar10 dataset, the script gets into an infinite loop because input_pipeline.get_cifar10_datasets returns a dataset with infinite sample (it uses the repeat function). What do you think I should do to fix it?

opened by liranringel 0
Please add a license to this repo

First, thank you for sharing this project with us!

Could you please add an explicit LICENSE file to the repo so that it's clear under what terms the content is provided, and under what terms user contributions are licensed?

Per GitHub docs on licensing:

[...] without a license, the default copyright laws apply, meaning that you retain all rights to your source code and no one may reproduce, distribute, or create derivative works from your work. If you're creating an open source project, we strongly encourage you to include an open source license.

Thanks!

opened by mbrukman 0
score of softmax on Text4k; linformer-256 & nystrom-64 doesn't work
Hi,

Thanks for the excellent work!

I found some issues in my humble trials (I didn't change anything in the code):

using softmax attention on Text4k I got ~63.7 acc instead of 65.02 you posted in your paper.

again I tried linear attention Text4k I got ~64 acc, it's even higher than vanilla transformer, I wonder did you get the same result from your side?

the attention types linformer-256 and nystrom-64 doesn't work, the errors are either dimensions mismatching or config key error. It seems like not all the attention types can successfully run when you release the code. Btw I didn't try out all the choices.

Thank you for your time, I look forward to your reply~

Ziwei
opened by ZiweiHe 1
Self-attention weights don't always sum to 1

Hi all,

Nice work. I have a question however. I see with in my problem setting that the self-attention (SA) weights don't always sum to ~1 (row-wise). I assume this is due to the out-of-sample approximation and structural properties of the true SA matrix. Are there ways to reduce the error and obtain a more accurate approximation or other ways to account for this? I myself will try with more landmarks and see how the dynamics change.

Thanks in advance,

Dennis

opened by DennisHaijma 2
Retrieval accuracy different from official JAX/FLAX implementation

I wonder why the Retrieval accuracy is almost 20% higher than the official JAX/FLAX implementation. As the paper says, "While we achieve consistent results reported in (Tay et al. 2020) for most tasks in our PyTorch reimplementation, the performance on Retrieval task is higher for all models following the hyperparameters in (Tay et al. 2020)." Is there any difference aside from the hyperparameters?

opened by cwq159 1

Nystromformer: A Nystrom-based Algorithm for Approximating Self-Attention

Related tags

Overview

Nystromformer: A Nystrom-based Algorithm for Approximating Self-Attention

April 6, 2021

Feb 27th, 2021

Feb 17th, 2021

Feb 14th, 2021

Feb 9th, 2021

Requirements

Datasets

Pretraining

Pretraining from Different Model's Checkpoint

GLUE

Citation

Comments

Owner

Zhanpeng Zeng

Pervasive Attention: 2D Convolutional Networks for Sequence-to-Sequence Prediction

pytorch implementation of Attention is all you need

A PyTorch implementation of the Transformer model in "Attention is All You Need".

Intent parsing and slot filling in PyTorch with seq2seq + attention

multi-label，classifier，text classification，多标签文本分类，文本分类，BERT，ALBERT，multi-label-classification，seq2seq，attention，beam search

[ICCV 2021] Counterfactual Attention Learning for Fine-Grained Visual Categorization and Re-identification

Modified GPT using average pooling to reduce the softmax attention memory constraints.

Seq2seq attn - Use the Seq2Seq method to implement machine translation and introduce Attention mechanism to improve the results

HAIS_2GNN: 3D Visual Grounding with Graph and Attention

End-to-end image captioning with EfficientNet-b3 + LSTM with Attention

Multi-Scale Temporal Frequency Convolutional Network With Axial Attention for Speech Enhancement

Implementation of Memorizing Transformers (ICLR 2022), attention net augmented with indexing and retrieval of memories using approximate nearest neighbors, in Pytorch

Python module (C extension and plain python) implementing Aho-Corasick algorithm

Python module (C extension and plain python) implementing Aho-Corasick algorithm

Top2Vec is an algorithm for topic modeling and semantic search.

This is the library for the Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm, corresponding to the paper Fully Supervised Speaker Diarization.

Implementation of TF-IDF algorithm to find documents similarity with cosine similarity

This repository details the steps in creating a Part of Speech tagger using Trigram Hidden Markov Models and the Viterbi Algorithm without using external libraries.