Official implementation of cosformer-attention in cosFormer: Rethinking Softmax in Attention

Last update: Dec 15, 2022

Related tags

Deep Learning cosFormer

Overview

cosFormer

Official implementation of cosformer-attention in cosFormer: Rethinking Softmax in Attention

Update log

2022/2/28
- Add core code

License

This repository is released under the Apache 2.0 license as found in the LICENSE file.

Citation

If you use this code for a paper, please cite:

@inproceedings{
  zhen2022cosformer,
  title={cosFormer: Rethinking Softmax In Attention},
  author={Zhen Qin and Weixuan Sun and Hui Deng and Dongxu Li and Yunshen Wei and Baohong Lv and Junjie Yan and Lingpeng Kong and Yiran Zhong},
  booktitle={International Conference on Learning Representations},
  year={2022},
  url={https://openreview.net/forum?id=Bl8CQrx2Up4}
}

Comments

causal attention not working when q and kv are not in same length

Thank you for your great work! I am currently working on a seq2seq task and I found the causal attention code only works the src_len and the tgt_len are the same. Also, I suggest that you could adopt EPFL's causal linear attention CUDA code to improve the speed of causal attention.

opened by zero0kiriyu 1
Question about space complexity

Thanks very much for your interesting work! I have a question about the O(N) space complexity mentioned in your paper. I am wondering whether you can help me to figure it out.

In Eq. (11) of your paper, you compute QK^T in the denominator, which may lead to O(N^2*d) space complexity?

bests

opened by nihaomiao 0
Why cosformer not work on XL-base transformer architecture?

When implementing cosformer on MultiHeadAttention in Transformer-XL and running without extra long-range memory, the ReLU performance is worse than eLU. I think it is because the Attention and FF Net are different since XL-like transformer has different layer norm and residual connection. Why this ReLU(Q)ReLU(K).T softmax replacement is not robust on different transformer architectures?

opened by lwaekfjlk 0
Pre-train model

In the paper，it mentioned that the work of the bidirectional language modeling pre-train has been done. Are you planning on releasing some pre-trained weights for the model?

opened by csorujian 0
Attn Mask for Non-causal Models

We are examining non-NLP applications of the cosformer self-attention, and would need to use attention masking for the padded tokens in the batch. Is there a way to incorporate this ? Because the code does not explicitly compute the attention weights on which masking is traditionally applied.

opened by roshansh-cmu 1

Owner

GitHub

This repository provides the official implementation of 'Learning to ignore: rethinking attention in CNNs' accepted in BMVC 2021.

inverse_attention This repository provides the official implementation of 'Learning to ignore: rethinking attention in CNNs' accepted in BMVC 2021. Le

5 Jul 8, 2022

[NeurIPS 2021] Galerkin Transformer: a linear attention without softmax

[NeurIPS 2021] Galerkin Transformer: linear attention without softmax Summary A non-numerical analyst oriented explanation on Toward Data Science abou

159 Dec 20, 2022

Locally Enhanced Self-Attention: Rethinking Self-Attention as Local and Context Terms

LESA Introduction This repository contains the official implementation of Locally Enhanced Self-Attention: Rethinking Self-Attention as Local and Cont

20 Dec 31, 2021

an implementation of softmax splatting for differentiable forward warping using PyTorch

softmax-splatting This is a reference implementation of the softmax splatting operator, which has been proposed in Softmax Splatting for Video Frame I

338 Dec 28, 2022

Official implementation of Rethinking Graph Neural Architecture Search from Message-passing (CVPR2021)

Rethinking Graph Neural Architecture Search from Message-passing Intro The GNAS can automatically learn better architecture with the optimal depth of

48 Sep 30, 2022

Official DGL implementation of "Rethinking High-order Graph Convolutional Networks"

SE Aggregation This is the implementation for Rethinking High-order Graph Convolutional Networks. Here we show the codes for citation networks as an e

32 Jul 19, 2022

The Noise Contrastive Estimation for softmax output written in Pytorch

An NCE implementation in pytorch About NCE Noise Contrastive Estimation (NCE) is an approximation method that is used to work around the huge computat

287 Nov 25, 2022

Implementation of SETR model, Original paper: Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers.

SETR - Pytorch Since the original paper (Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers.) has no official

112 Dec 16, 2022

Rethinking the Importance of Implementation Tricks in Multi-Agent Reinforcement Learning

RIIT Our open-source code for RIIT: Rethinking the Importance of Implementation Tricks in Multi-AgentReinforcement Learning. We implement and standard

405 Jan 6, 2023

The implementation of "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer"

Shuffle Transformer The implementation of "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer" Introduction Very recently, window-

87 Nov 29, 2022

A PyTorch implementation of " EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks."

EfficientNet A PyTorch implementation of EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. [arxiv] [Official TF Repo] Implemen

298 Dec 10, 2022

PyTorch implementation of Rethinking Positional Encoding in Language Pre-training

TUPE PyTorch implementation of Rethinking Positional Encoding in Language Pre-training. Quickstart Clone this repository. git clone https://github.com

5 Jan 27, 2022

Official Pytorch Implementation of Relational Self-Attention: What's Missing in Attention for Video Understanding

Relational Self-Attention: What's Missing in Attention for Video Understanding This repository is the official implementation of "Relational Self-Atte

43 Dec 7, 2022

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Segmentation Transformer Implementation of Segmentation Transformer in PyTorch, a new model to achieve SOTA in semantic segmentation while using trans

161 Dec 8, 2022

《Rethinking Sptil Dimensions of Vision Trnsformers》(2021)

Rethinking Spatial Dimensions of Vision Transformers Byeongho Heo, Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Junsuk Choe, Seong Joon Oh | Paper NAVER

224 Dec 27, 2022

Simple Pose: Rethinking and Improving a Bottom-up Approach for Multi-Person Pose Estimation

SimplePose Code and pre-trained models for our paper, “Simple Pose: Rethinking and Improving a Bottom-up Approach for Multi-Person Pose Estimation”, a

256 Dec 24, 2022

[CVPR 2021] Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

897 Jan 5, 2023

[ICLR2021oral] Rethinking Architecture Selection in Differentiable NAS

DARTS-PT Code accompanying the paper ICLR'2021: Rethinking Architecture Selection in Differentiable NAS Ruochen Wang, Minhao Cheng, Xiangning Chen, Xi

86 Dec 27, 2022

Rethinking the U-Net architecture for multimodal biomedical image segmentation

MultiResUNet Rethinking the U-Net architecture for multimodal biomedical image segmentation This repository contains the original implementation of "M

308 Jan 5, 2023

Official implementation of cosformer-attention in cosFormer: Rethinking Softmax in Attention

Related tags

Overview

cosFormer

Update log

License

Citation

Comments

causal attention not working when q and kv are not in same length

Question about space complexity

Why cosformer not work on XL-base transformer architecture?

Pre-train model

Attn Mask for Non-causal Models

Owner

This repository provides the official implementation of 'Learning to ignore: rethinking attention in CNNs' accepted in BMVC 2021.

[NeurIPS 2021] Galerkin Transformer: a linear attention without softmax

Locally Enhanced Self-Attention: Rethinking Self-Attention as Local and Context Terms

an implementation of softmax splatting for differentiable forward warping using PyTorch

Official implementation of Rethinking Graph Neural Architecture Search from Message-passing (CVPR2021)

Official DGL implementation of "Rethinking High-order Graph Convolutional Networks"

The Noise Contrastive Estimation for softmax output written in Pytorch

Implementation of SETR model, Original paper: Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers.

Rethinking the Importance of Implementation Tricks in Multi-Agent Reinforcement Learning

The implementation of "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer"

A PyTorch implementation of " EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks."

PyTorch implementation of Rethinking Positional Encoding in Language Pre-training

Official Pytorch Implementation of Relational Self-Attention: What's Missing in Attention for Video Understanding

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

《Rethinking Sptil Dimensions of Vision Trnsformers》(2021)

Simple Pose: Rethinking and Improving a Bottom-up Approach for Multi-Person Pose Estimation

[CVPR 2021] Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

[ICLR2021oral] Rethinking Architecture Selection in Differentiable NAS

Rethinking the U-Net architecture for multimodal biomedical image segmentation