Examples of using sparse attention, as in "Generating Long Sequences with Sparse Transformers"

OpenAI

Last update: Dec 28, 2022

Related tags

Text Data & NLP sparse_attention

Overview

Status: Archive (code is provided as-is, no updates expected)

Update August 2020: For an example repository that achieves state-of-the-art modeling performance on CIFAR-10 using Sparse Transformers, please see https://github.com/openai/distribution_augmentation

Sparse Attention

This repository contains the sparse attention primitives used in Sparse Transformers (see blog and paper). Specifically, it includes the following:

A faster implementation of normal attention (the upper triangle is not computed, and many operations are fused).
An implementation of "strided" and "fixed" attention, as in the Sparse Transformers paper.
A simple recompute decorator, which can be adapted for usage with attention.

We hope this code can further accelerate research into sparse attention.

An example Transformer implementation which is close to the version we use internally can be found at https://github.com/openai/blocksparse/blob/master/examples/transformer/enwik8.py.

Overview of kernels

The repository contains fused implementations of the attention operation, which takes in Q, K, V matrices (all of dimensionality batch, time, dim) representing the queries, keys, and values for a sequence. For every query element, a weighted sum of the values is returned, where the weightings are determined by the scaled matrix product of Q and K^T.

The kernels allow specification of block sparsity in the QK^T matrix. This means you define a pattern of 0/1s on a [time/blocksize, time/blocksize] matrix of blocks, and the values where it is 0 will not be computed, and not be included in the softmax calculation. Additionally, one can define "callbacks" on the computed blocks, which will further mask out values in any given block from the softmax (though the matrix product will still be computed for those elements).

Block sizes of {8, 16, 32, 64} are supported, and slight advantages in speed may be seen from using larger blocks.

Prerequisites

For fp32 and blocksize 32, any NVIDIA GPU past Kepler can be used (i.e. compute capability beyond 3.5).

For fp16 and blocksize 8, 16, 32, 64, a GPU with Tensor Cores (e.g. the V100 GPU, compute capability >= 7.0) is required.

The primary dependency is the OpenAI blocksparse package.

With CUDA 10 and tensorflow-gpu, you can install blocksparse with pip install blocksparse.

For other setups, you must install blocksparse from source, and directions can be found in the root of the repository.

Examples

Run the following on a non-V100 GPU:

python attention.py

On a V100 GPU:

python attention.py fp16

General usage

An example can be found at the bottom of attention.py.

full_attn_tf = attention_impl(q, k, v, heads=4, attn_mode="all", recompute=True)
full_attn_bs = blocksparse_attention_impl(q, k, v, heads=4, attn_mode="all", recompute=True)

# first step of strided attention
local_attn_bs = blocksparse_attention_impl(q, k, v, heads=4, attn_mode="local", local_attn_ctx=32, recompute=True)
local_attn_tf = attention_impl(q, k, v, heads=4, attn_mode="local", local_attn_ctx=32, recompute=True)

# second step of strided attention
strided_attn_bs = blocksparse_attention_impl(q, k, v, heads=4, attn_mode="strided", local_attn_ctx=32, recompute=True)
strided_attn_tf = attention_impl(q, k, v, heads=4, attn_mode="strided", local_attn_ctx=32, recompute=True)

# # the 'fixed' attention pattern
fixed = blocksparse_attention_impl(q, k, v, heads=4, attn_mode="fixed", local_attn_ctx=128, num_verts=4, vertsize=1, recompute=True)

Referencing this work

If you find this helpful in your work, you can consider citing the following:

@article{child2019sparsetransformer,
  title={Generating Long Sequences with Sparse Transformers},
  author={Child, Rewon and Gray, Scott and Radford, Alec and Sutskever, Ilya},
  journal={URL https://openai.com/blog/sparse-transformers},
  year={2019}
}

Comments

Questions about novelty

The paper is well written and makes great results in various datasets. However, the contribution of novelty is unclear. Q1： How is the Sparse Transformer (strided) different from local attention? Q2： How is the Sparse Transformer (fixed) different from block self-attention? ( ICLR 2018 https://openreview.net/forum?id=H1cWzoxA-)?

opened by zhaoguangxiang 2
Has anyone been able to reproduce the results for image generation?

It seems that the code for images is not provided, and in #7, it was mentioned that the strided attention is difficult to reproduce. I am wondering whether anyone has successfully reproduce the results for image generation

opened by shaform 0
a problem in running code

When I tried to run the code the following error occurred: Traceback (most recent call last): File "attention.py", line 4, in from blocksparse import BlocksparseTransformer File "/home/user/anaconda3/lib/python3.7/site-packages/blocksparse/init.py", line 3, in from blocksparse.utils import ( File "/home/user/anaconda3/lib/python3.7/site-packages/blocksparse/utils.py", line 16, in _op_module = tf.load_op_library(os.path.join(data_files_path, 'blocksparse_ops.so')) File "/home/en/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/load_library.py", line 61, in load_op_library lib_handle = py_tf.TF_LoadLibrary(library_filename) tensorflow.python.framework.errors_impl.NotFoundError: libcudart.so.10.0: cannot open shared object file: No such file or directory

opened by hoenza 1
Problem with reproducing "strided" attention scheme from the paper

HI, I am trying to visualize the attention schemes using this code. Basically trying to reproduce Fig:3 from the paper. I could reproduce the "fixed" attention scheme as shown below:

The problem is I could not reproduce the "strided" scheme (Fig 3.b from paper). All I get is the following no matter what parameters I try:

If I change some code then I can get the correct "strided" version as shown in the paper. The following is after some code changes:

Did anyone face the same issue?

opened by krishnadubba 2
Great work! but seems insufficient "related work"

See title, as we all know, the DynamicConv has claimed that it achieved the state-of-the-art performance in many tasks (e.g., WMT14 ende). But I find that DynamicConv was never mentioned in your paper.

Would your team wanna conduct comparison experiments? Just like the issue659 in repository pytorch/fairseq

opened by alphadl 0

Examples of using sparse attention, as in "Generating Long Sequences with Sparse Transformers"

Related tags

Overview

Sparse Attention

Overview of kernels

Prerequisites

Examples

General usage

Referencing this work

Comments

Questions about novelty

Has anyone been able to reproduce the results for image generation?

a problem in running code

Problem with reproducing "strided" attention scheme from the paper

Great work! but seems insufficient "related work"

Owner

OpenAI

Modified GPT using average pooling to reduce the softmax attention memory constraints.

Implementation of the Hybrid Perception Block and Dual-Pruned Self-Attention block from the ITTR paper for Image to Image Translation using Transformers

Implementation of Memorizing Transformers (ICLR 2022), attention net augmented with indexing and retrieval of memories using approximate nearest neighbors, in Pytorch

Hierarchical unsupervised and semi-supervised topic models for sparse count data with CorEx

Code to reprudece NeurIPS paper: Accelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks

Pervasive Attention: 2D Convolutional Networks for Sequence-to-Sequence Prediction

pytorch implementation of Attention is all you need

A PyTorch implementation of the Transformer model in "Attention is All You Need".

Intent parsing and slot filling in PyTorch with seq2seq + attention

multi-label，classifier，text classification，多标签文本分类，文本分类，BERT，ALBERT，multi-label-classification，seq2seq，attention，beam search

[ICCV 2021] Counterfactual Attention Learning for Fine-Grained Visual Categorization and Re-identification

Seq2seq attn - Use the Seq2Seq method to implement machine translation and introduce Attention mechanism to improve the results

HAIS_2GNN: 3D Visual Grounding with Graph and Attention

End-to-end image captioning with EfficientNet-b3 + LSTM with Attention

Multi-Scale Temporal Frequency Convolutional Network With Axial Attention for Speech Enhancement

One Stop Anomaly Shop: Anomaly detection using two-phase approach: (a) pre-labeling using statistics, Natural Language Processing and static rules; (b) anomaly scoring using supervised and unsupervised machine learning.

This repository details the steps in creating a Part of Speech tagger using Trigram Hidden Markov Models and the Viterbi Algorithm without using external libraries.

Creating an Audiobook (mp3 file) using a Ebook (epub) using BeautifulSoup and Google Text to Speech

Text-Summarization-using-NLP - Text Summarization using NLP to fetch BBC News Article and summarize its text and also it includes custom article Summarization