Nystromformer: A Nystrom-based Algorithm for Approximating Self-Attention

Overview

Nystromformer: A Nystrom-based Algorithm for Approximating Self-Attention

April 6, 2021

We extended segment-means to compute landmarks without requiring the sequence length divisible by the number of landmarks. Then we used this Nystromformer to perform deployment of T2T-Vit_t-14 for image classification without retraining. Our T2T-ViT-Nys-14 achieves 78% top-1 accuracy, outperforming performer/Linformer +4.3%/+12.7% for the direct deployment.

Feb 27th, 2021

We fixed the coefficient computation of initial Z_0, which can lead to faster convergence to pseudoinverse. The original implementation has a scale difference. We leave the original as a default option. The added initialization is recommended. Thanks @sbodenstein for pointing out the difference.

Feb 17th, 2021

We have released the source code of PyTorch reimplementation of Long Range Arena (LRA) benchmark, which is to evaluate the generalization ability of models on diverse longer sequence tasks. Our codes are based on the official Jax LRA implementation. Reformer PyTorch implementation is from huggingface and Performer PyTorch implementation is from lucidrains.

Feb 14th, 2021

We have released the scores on individual LRA tasks.

Feb 9th, 2021

We have release the average score across LRA tasks.

Transformers have emerged as a powerful workhorse for a broad range of natural language processing tasks. A key component that drives the impressive performance of Transformers is their self-attention mechanism that identifies/encodes the influence or dependence of other tokens for each specific token. Its benefits notwithstanding, the quadratic complexity of self-attention on the input sequence length has limited its application to longer sequences – a topic being actively studied in the community. To address this limitation, we propose Nystromformer – a model that exhibits excellent scalability as a function of sequence length. Our idea is based on adapting the Nystrom method to approximate the standard self-attention with an efficient O(n) complexity.

Requirements

docker, nvidia-docker

Datasets

The pretraining dataset consists of English Wikipedia and BookCorpus. For pretraining on long sequence, we added one third Stories and one third Realnews. All downloaded data files should be placed in the corresponding folder under data-preprocessing. The original format of English Wikipedia dump is preprocessed using wikiextractor, and the resulting files are placed in data-preprocessing/wiki. Then, run data-preprocessing/ /preprocess.py under each corresponding folder to generate data files of unified format. After preprocessing, run data-preprocessing/preprocess_data_ .py to generate pretraining data of specific sequence length.

Pretraining

To start pretraining of a specific configuration: create a folder (for example, nystrom-512) and write /config.json to specify model and training configuration, then under folder, run

> /model/pretrain.txt 2>&1"">
docker run --rm --name=pretrain \
  --network=host --ipc=host --gpus all \
  -v "$PWD/../data-preprocessing/512-roberta:/dataset" \
  -v "$PWD/../code:/code" \
  -v "$PWD:/model" \
  -d mlpen/bert_env:0 \
  /bin/bash -c \
  "python3 /code/run_pretrain.py >> /model/pretrain.txt 2>&1"

All outputs will be redirected to /pretrain.txt . The command will create a /model folder holding all checkpoints and log file. The training can be stopped anytime by running docker kill pretrain, and can be resumed from the last checkpoint using the same command for starting pretraining.

Pretraining from Different Model's Checkpoint

Copy a checkpoint (one of .model or .cp file) from /model folder to folder and add a key-value pair in /config.json : "from_cp": "/model/ " . One example is shown in nystrom-4096/config.json. This procedure also works for extending the max sequence length of a model (For example, use nystrom-512 pretrained weights as initialization for nystrom-4096).

GLUE

To finetune model on GLUE tasks, download GLUE datasets and place them under glue folder, then under folder , run

> /model/glue.txt 2>&1"">
docker run --rm --name=glue \
  --network=host --ipc=host --gpus all \
  -v "$PWD/../glue:/glue" \
  -v "$PWD/../code:/code" \
  -v "$PWD:/model" \
  -d mlpen/bert_env:0 \
  /bin/bash -c \
  "python3 /code/run_glue.py --batch_size 32 --lr 3e-5 --epoch 5 --task MRPC --checkpoint 99 >> /model/glue.txt 2>&1"

batch_size, lr, epoch, task, checkpoint can be changed to finetune on different task, different hyperparameters, or different checkpoints. All outputs will be redirected to /glue.txt . The log file is located at /model folder.

Citation

@article{xiong2021nystromformer,
  title={Nystr{\"o}mformer: A Nystr{\"o}m-based Algorithm for Approximating Self-Attention},
  author={Xiong, Yunyang and Zeng, Zhanpeng and Chakraborty, Rudrasis and Tan, Mingxing and Fung, Glenn and Li, Yin and Singh, Vikas},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  year={2021}
}
Comments
  • Pre-trained weights

    Pre-trained weights

    Hi,

    Thank you for this very clean repository! I wonder whether any checkpoints will be released? If yes, I'm interested in adding those to the HuggingFace Transformers library (the Reformer and LongFormer are already there, but the Linformer and the Nystromformer aren't yet). Since we only have to replace the self-attention mechanism, this seems quite straightforward in terms of modeling.

    Kind regards,

    Niels

    opened by NielsRogge 6
  • Possible bug in iterative_inv

    Possible bug in iterative_inv

    When the input of iterative_inv is a 4-tensor (which is how it is used in NystromAttention), then the calculation of of the initial value of V in iterative_inv here takes the max over all dimensions of the 4-tensor K rather than just the matrix dimensions as claimed in lemma 1:

    image

    This has a significant effect on the accuracy of the approximation on random data I've tried it on.

    opened by sbodenstein 6
  • Preprocessing datasets

    Preprocessing datasets

    Hello,

    I'm unclear on the pretraining procedures, in particular the preprocessing of the datasets.

    Unless I'm mistaken https://github.com/mlpen/Nystromformer/blob/main/data-preprocessing/preprocess_data_512.py suggest we just put segments of size 512. I'm not sure I understand how SOP is defined in this case? Actually SOP doesn't seem to be used at all in the pretrain script.

    opened by thomasw21 3
  • Lemma 1 parenthesis seem wrong

    Lemma 1 parenthesis seem wrong

    Hello, first of all great work! This looks very promising.

    As I'm reading a bit more in depth, I'm wondering if you might have an error on parenthesis in lemma 1. Both the code, and the proof seem to indicate it.

    image

    The last term should apply only to AZ instead of (15I - AZ) no?

    This would "fix" the following lines in the proof

    image

    opened by thomasw21 3
  • Results on Long Range Arena

    Results on Long Range Arena

    Hi, I saw the tweet about this paper and wanted to know the performance of Nystromformer on each individual task in Long Range Arena. (Also what is the "Standard" in the graph posted? Standard transformer does much worse in LRA)

    opened by joaogui1 3
  • Length of Text Classification Task of LRA

    Length of Text Classification Task of LRA

    Hello, I noticed that in your paper, Table 3 shown the Text (4K). However, max_length was set to 1000 in your code LRA/datasets/text.py. Is the Text task length 4K or 1K for the scores given in Table 3?

    opened by lxchtan 2
  • Question about LRA Pathfinder task

    Question about LRA Pathfinder task

    Thanks for your great work.

    I noticed that in the paper and also in this repo, you report the results on LRA tasks. I am trying to train your model on LRA with the code given in this repo. For the Pathfinder task, I follow the dataset preparing guidance and I get 3 sets of pickle files, e.g. pathfinder32-curv_baseline, pathfinder32-curv_contour_length_9, and pathfinder32-curv_contour_length_14, each with train/dev/test splits.

    I would like to know that which set gives the 70.94 result in table 3.

    opened by YJHMITWEB 2
  • Some questions regarding the paper

    Some questions regarding the paper

    Hi, thanks for sharing the code.

    I have read your paper recently and I have some questions about some details. I would be very grateful if you can help me with those questions.

    1. How do you derive eqn.(4) ? As far as I know, Nystrom method only applies to semidefinite matrix, not the product of matrices, let alone the softmax of the matrix product. I do read through the reference above eqn.(4) ``Improving CUR Matrix Decomposition and the Nystr¨om Approximation via Adaptive Sampling'', but I failed to find any relevant results. Could you please point to me where can I find these results?

    2. The other question is about the key simplification, which first uses Nystrom method inside the softmax then commutes the matrix production and the softmax operation (calcuate softmax first and then the matrix production). While the first step (Nystrom) is indeed unbiased, the second step (commute operations) is not necessary the case. In summary, it’s hard to say what the final equation (13) really is, and why it works at all.

    opened by wzm2256 1
  • LRA cifar10.py never stop

    LRA cifar10.py never stop

    When trying to run python3 cifar10.py to prepare the cifar10 dataset, the script gets into an infinite loop because input_pipeline.get_cifar10_datasets returns a dataset with infinite sample (it uses the repeat function). What do you think I should do to fix it?

    opened by liranringel 0
  • Please add a license to this repo

    Please add a license to this repo

    First, thank you for sharing this project with us!

    Could you please add an explicit LICENSE file to the repo so that it's clear under what terms the content is provided, and under what terms user contributions are licensed?

    Per GitHub docs on licensing:

    [...] without a license, the default copyright laws apply, meaning that you retain all rights to your source code and no one may reproduce, distribute, or create derivative works from your work. If you're creating an open source project, we strongly encourage you to include an open source license.

    Thanks!

    opened by mbrukman 0
  • score of softmax on Text4k; linformer-256 & nystrom-64 doesn't work

    score of softmax on Text4k; linformer-256 & nystrom-64 doesn't work

    Hi,

    Thanks for the excellent work!

    I found some issues in my humble trials (I didn't change anything in the code):

    1. using softmax attention on Text4k I got ~63.7 acc instead of 65.02 you posted in your paper.
    2. again I tried linear attention Text4k I got ~64 acc, it's even higher than vanilla transformer, I wonder did you get the same result from your side?
    3. the attention types linformer-256 and nystrom-64 doesn't work, the errors are either dimensions mismatching or config key error. It seems like not all the attention types can successfully run when you release the code. Btw I didn't try out all the choices.

    Thank you for your time, I look forward to your reply~

    Ziwei

    opened by ZiweiHe 1
  • Self-attention weights don't always sum to 1

    Self-attention weights don't always sum to 1

    Hi all,

    Nice work. I have a question however. I see with in my problem setting that the self-attention (SA) weights don't always sum to ~1 (row-wise). I assume this is due to the out-of-sample approximation and structural properties of the true SA matrix. Are there ways to reduce the error and obtain a more accurate approximation or other ways to account for this? I myself will try with more landmarks and see how the dynamics change.

    Thanks in advance,

    Dennis

    opened by DennisHaijma 2
  • Retrieval accuracy different from official JAX/FLAX implementation

    Retrieval accuracy different from official JAX/FLAX implementation

    I wonder why the Retrieval accuracy is almost 20% higher than the official JAX/FLAX implementation. As the paper says, "While we achieve consistent results reported in (Tay et al. 2020) for most tasks in our PyTorch reimplementation, the performance on Retrieval task is higher for all models following the hyperparameters in (Tay et al. 2020)." Is there any difference aside from the hyperparameters?

    opened by cwq159 1
Owner
Zhanpeng Zeng
Zhanpeng Zeng
Pervasive Attention: 2D Convolutional Networks for Sequence-to-Sequence Prediction

This is a fork of Fairseq(-py) with implementations of the following models: Pervasive Attention - 2D Convolutional Neural Networks for Sequence-to-Se

Maha 490 Dec 15, 2022
pytorch implementation of Attention is all you need

A Pytorch Implementation of the Transformer: Attention Is All You Need Our implementation is largely based on Tensorflow implementation Requirements N

null 230 Dec 7, 2022
A PyTorch implementation of the Transformer model in "Attention is All You Need".

Attention is all you need: A Pytorch Implementation This is a PyTorch implementation of the Transformer model in "Attention is All You Need" (Ashish V

Yu-Hsiang Huang 7.1k Jan 5, 2023
Intent parsing and slot filling in PyTorch with seq2seq + attention

PyTorch Seq2Seq Intent Parsing Reframing intent parsing as a human - machine translation task. Work in progress successor to torch-seq2seq-intent-pars

Sean Robertson 159 Apr 4, 2022
multi-label,classifier,text classification,多标签文本分类,文本分类,BERT,ALBERT,multi-label-classification,seq2seq,attention,beam search

multi-label,classifier,text classification,多标签文本分类,文本分类,BERT,ALBERT,multi-label-classification,seq2seq,attention,beam search

hellonlp 30 Dec 12, 2022
[ICCV 2021] Counterfactual Attention Learning for Fine-Grained Visual Categorization and Re-identification

Counterfactual Attention Learning Created by Yongming Rao*, Guangyi Chen*, Jiwen Lu, Jie Zhou This repository contains PyTorch implementation for ICCV

Yongming Rao 89 Dec 18, 2022
Modified GPT using average pooling to reduce the softmax attention memory constraints.

NLP-GPT-Upsampling This repository contains an implementation of Open AI's GPT Model. In particular, this implementation takes inspiration from the Ny

WD 1 Dec 3, 2021
null 1 Jun 28, 2022
HAIS_2GNN: 3D Visual Grounding with Graph and Attention

HAIS_2GNN: 3D Visual Grounding with Graph and Attention This repository is for the HAIS_2GNN research project. Tao Gu, Yue Chen Introduction The motiv

Yue Chen 1 Nov 26, 2022
End-to-end image captioning with EfficientNet-b3 + LSTM with Attention

Image captioning End-to-end image captioning with EfficientNet-b3 + LSTM with Attention Model is seq2seq model. In the encoder pretrained EfficientNet

null 2 Feb 10, 2022
Multi-Scale Temporal Frequency Convolutional Network With Axial Attention for Speech Enhancement

MTFAA-Net Unofficial PyTorch implementation of Baidu's MTFAA-Net: "Multi-Scale Temporal Frequency Convolutional Network With Axial Attention for Speec

Shimin Zhang 87 Dec 19, 2022
Implementation of Memorizing Transformers (ICLR 2022), attention net augmented with indexing and retrieval of memories using approximate nearest neighbors, in Pytorch

Memorizing Transformers - Pytorch Implementation of Memorizing Transformers (ICLR 2022), attention net augmented with indexing and retrieval of memori

Phil Wang 364 Jan 6, 2023
Python module (C extension and plain python) implementing Aho-Corasick algorithm

pyahocorasick pyahocorasick is a fast and memory efficient library for exact or approximate multi-pattern string search meaning that you can find mult

Wojciech Muła 763 Dec 27, 2022
Python module (C extension and plain python) implementing Aho-Corasick algorithm

pyahocorasick pyahocorasick is a fast and memory efficient library for exact or approximate multi-pattern string search meaning that you can find mult

Wojciech Muła 579 Feb 17, 2021
Top2Vec is an algorithm for topic modeling and semantic search.

Top2Vec is an algorithm for topic modeling and semantic search. It automatically detects topics present in text and generates jointly embedded topic, document and word vectors.

Dimo Angelov 2.4k Jan 6, 2023
This is the library for the Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm, corresponding to the paper Fully Supervised Speaker Diarization.

UIS-RNN Overview This is the library for the Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm. UIS-RNN solves the problem of s

Google 1.4k Dec 28, 2022
Implementation of TF-IDF algorithm to find documents similarity with cosine similarity

NLP learning Trying to learn NLP to use in my projects! Table of Contents About The Project Built With Getting Started Requirements Run Usage License

Faraz Farangizadeh 3 Aug 25, 2022
This repository details the steps in creating a Part of Speech tagger using Trigram Hidden Markov Models and the Viterbi Algorithm without using external libraries.

POS-Tagger This repository details the creation of a Part-of-Speech tagger using Trigram Hidden Markov Models to predict word tags in a word sequence.

Raihan Ahmed 1 Dec 9, 2021