SOFT: Softmax-free Transformer with Linear Complexity, NeurIPS 2021 Spotlight

Fudan Zhang Vision Group

Last update: Dec 25, 2022

Related tags

Deep Learning SOFT

Overview

SOFT: Softmax-free Transformer with Linear Complexity

SOFT: Softmax-free Transformer with Linear Complexity,
Jiachen Lu, Jinghan Yao, Junge Zhang, Xiatian Zhu, Hang Xu, Weiguo Gao, Chunjing Xu, Tao Xiang, Li Zhang,
NeurIPS 2021 Spotlight

Requirments

timm==0.3.2
torch>=1.7.0 and torchvision that matches the PyTorch installation
cuda>=10.2

Compilation may be fail on cuda < 10.2.
We have compiled it successfully on cuda 10.2 and cuda 11.2.

Data preparation

Download and extract ImageNet train and val images from http://image-net.org/. The directory structure is the standard layout for the torchvision datasets.ImageFolder, and the training and validation data is expected to be in the train/ folder and val folder respectively:

/path/to/imagenet/
  train/
    class1/
      img1.jpeg
    class2/
      img2.jpeg
  val/
    class1/
      img3.jpeg
    class/2
      img4.jpeg

Installation

git clone https://github.com/fudan-zvg/SOFT.git
python -m pip install -e SOFT

Main results

Image Classification

ImageNet-1K

Model	Resolution	Params	FLOPs	Top-1 %	Config
SOFT-Tiny	224	13M	1.9G	79.3	SOFT_Tiny.yaml, SOFT_Tiny_cuda.yaml
SOFT-Small	224	24M	3.3G	82.2	SOFT_Small.yaml, SOFT_Small_cuda.yaml
SOFT-Medium	224	45M	7.2G	82.9	SOFT_Meidum.yaml, SOFT_Meidum_cuda.yaml
SOFT-Large	224	64M	11.0G	83.1	SOFT_Large.yaml, SOFT_Large_cuda.yaml
SOFT-Huge	224	87M	16.3G	83.3	SOFT_Huge.yaml, SOFT_Huge_cuda.yaml

Get Started

Train

We have two implementations of Gaussian Kernel: PyTorch version and the exact form of Gaussian function implemented by cuda. The config file containing cuda is the cuda implementation. Both implementations yield same performance. Please install SOFT before running the cuda version.

./dist_train.sh ${GPU_NUM} --data ${DATA_PATH} --config ${CONFIG_FILE}
# For example, train SOFT-Tiny on Imagenet training dataset with 8 GPUs
./dist_train.sh 8 --data ${DATA_PATH} --config config/SOFT_Tiny.yaml

Test

./dist_train.sh ${GPU_NUM} --data ${DATA_PATH} --config ${CONFIG_FILE} --eval_checkpoint ${CHECKPOINT_FILE} --eval

# For example, test SOFT-Tiny on Imagenet validation dataset with 8 GPUs

./dist_train.sh 8 --data ${DATA_PATH} --config config/SOFT_Tiny.yaml --eval_checkpoint ${CHECKPOINT_FILE} --eval

Reference

@inproceedings{SOFT,
    title={SOFT: Softmax-free Transformer with Linear Complexity}, 
    author={Lu, Jiachen and Yao, Jinghan and Zhang, Junge and Zhu, Xiatian and Xu, Hang and Gao, Weiguo and Xu, Chunjing and Xiang, Tao and Zhang, Li},
    booktitle={NeurIPS},
    year={2021}
}

License

MIT

Acknowledgement

Thanks to previous open-sourced repo:
Detectron2
T2T-ViT
PVT
Nystromformer
pytorch-image-models

[ICLR 2021 Spotlight Oral] "Undistillable: Making A Nasty Teacher That CANNOT teach students", Haoyu Ma, Tianlong Chen, Ting-Kuei Hu, Chenyu You, Xiaohui Xie, Zhangyang Wang

Undistillable: Making A Nasty Teacher That CANNOT teach students "Undistillable: Making A Nasty Teacher That CANNOT teach students" Haoyu Ma, Tianlong

71 Dec 28, 2022

Comments

如何理解论文中提到的线性复杂度？

很抱歉我图方便直接用中文提问。

论文里线性开销的关键在于用stride conv下采样，但是conv训练完以后kernel size和stride就固定了，那采样的比例也固定了。那么训练完以后，如果用更长的序列进行测试，m的长度会随着序列长度n增长，复杂度还是O(n^2)而不是O(n)。我看了下openreview的审稿意见，似乎有审稿人问到这问题，但rebuttal中提到固定m=49，但当测试序列更长时，这似乎在不改变stride的情况下是无法做到的？感觉Nystromformer的adaptive pooling更符合landmark的意义。另外，用于生成landmark的conv后面还跟着norm和GELU，是不是这才是收敛的关键？

opened by IDKiro 1
Substitute regular attention module with sofmax-free attention module
Hello,

The background is that due to the limitation of the computation platform I'm using, where the softmax operator costs a lot of time, I'm trying to substitute the regular attention modules into sofmax-free attention module.

I have one question about the structure of SOFT. The core of the softmax-free attention module runs like this:

def forward(self, X, H, W): Q = self.split_heads(self.W_q(X)) V = self.split_heads(self.W_v(X)) attn_out = self.attn(Q, V, H, W) attn_out = self.combine_heads(attn_out) out = self.ff(attn_out) return out

As Q and V are generated from X, does that mean this attention module is keen to a self-attention module rather than the cross-attention module where the Q, K, V are from different domains? If that is the case, is there any suggestion on regular cross-attention module substitution with softmax-free attention? Thanks.

Best, Chenxi
opened by Capchenxi 0
About from SOFT import _c library file problem

Hello, I read your article and code, but I found an error in the section of from SOFT import _c in the subtraction file, would you please tell me where to find this library

opened by handsomezhuo 4

SOFT: Softmax-free Transformer with Linear Complexity, NeurIPS 2021 Spotlight

Related tags

Overview

SOFT: Softmax-free Transformer with Linear Complexity

Requirments

Data preparation

Installation

Main results

Image Classification

ImageNet-1K

Get Started

Train

Test

Reference

License

Acknowledgement

You might also like...

Anomaly Transformer: Time Series Anomaly Detection with Association Discrepancy" (ICLR 2022 Spotlight)

The Noise Contrastive Estimation for softmax output written in Pytorch

an implementation of softmax splatting for differentiable forward warping using PyTorch

GB-CosFace: Rethinking Softmax-based Face Recognition from the Perspective of Open Set Classification

Official implementation of cosformer-attention in cosFormer: Rethinking Softmax in Attention

AntroPy: entropy and complexity of (EEG) time-series in Python

A Low Complexity Speech Enhancement Framework for Full-Band Audio (48kHz) based on Deep Filtering.

[ICLR 2021, Spotlight] Large Scale Image Completion via Co-Modulated Generative Adversarial Networks

[ICLR 2021 Spotlight Oral] "Undistillable: Making A Nasty Teacher That CANNOT teach students", Haoyu Ma, Tianlong Chen, Ting-Kuei Hu, Chenyu You, Xiaohui Xie, Zhangyang Wang

Comments

如何理解论文中提到的线性复杂度？

Substitute regular attention module with sofmax-free attention module

About from SOFT import _c library file problem

Owner

Fudan Zhang Vision Group

PyTorch implementation for our NeurIPS 2021 Spotlight paper "Long Short-Term Transformer for Online Action Detection".

Relative Positional Encoding for Transformers with Linear Complexity

[NeurIPS 2021 Spotlight] Aligning Pretraining for Detection via Object-Level Contrastive Learning

This is an official PyTorch implementation of Task-Adaptive Neural Network Search with Meta-Contrastive Learning (NeurIPS 2021, Spotlight).

[NeurIPS 2021 Spotlight] Code for Learning to Compose Visual Relations

With this package, you can generate mixed-integer linear programming (MIP) models of trained artificial neural networks (ANNs) using the rectified linear unit (ReLU) activation function

Simple Linear 2nd ODE Solver GUI - A 2nd constant coefficient linear ODE solver with simple GUI using euler's method

Hitters Linear Regression - Hitters Linear Regression With Python

Implementation of the Transformer variant proposed in "Transformer Quality in Linear Time"