An example of Scatterbrain implementation (combining local attention and Performer)

HazyResearch

Last update: Jan 2, 2023

Related tags

Deep Learning scatterbrain

Overview

We use the template from https://github.com/ashleve/lightning-hydra-template. Please read the instructions there to understand the repo structure.

Implementation & Experiments

An example of Scatterbrain implementation (combining local attention and Performer) is in the file src/models/modules/attention/sblocal.py.

T2T-ViT inference on ImageNet

To run the T2T-ViT inference on ImageNet experiment:

Download the pretrained weights from the [T2T-ViT repo][https://github.com/yitu-opensource/T2T-ViT/releases]:

mkdir -p checkpoints/t2tvit
cd checkpoints/t2tvit
wget https://github.com/yitu-opensource/T2T-ViT/releases/download/main/81.7_T2T_ViTt_14.pth.tar

Convert the weights to the format compatible with our implementation of T2T-ViT:

# cd to scatterbrain path
python scripts/convert_checkpoint_t2t_vit.py checkpoints/t2tvit/81.7_T2T_ViTt_14.pth.tar

Download the ImageNet dataset (just the validation set will suffice). Below, /path/to/imagenet refers to the directory that contains the train and val directories.
Run the inference experiments:

python run.py experiment=imagenet-t2tvit-eval.yaml model/t2tattn_cfg=full datamodule.data_dir=/path/to/imagenet/ eval.ckpt=checkpoints/t2tvit/81.7_T2T_ViTt_14.pth.tar  # 81.7% acc
python run.py experiment=imagenet-t2tvit-eval.yaml model/t2tattn_cfg=local datamodule.data_dir=/path/to/imagenet/ eval.ckpt=checkpoints/t2tvit/81.7_T2T_ViTt_14.pth.tar  # 80.6% acc
python run.py experiment=imagenet-t2tvit-eval.yaml model/t2tattn_cfg=performer datamodule.data_dir=/path/to/imagenet/ eval.ckpt=checkpoints/t2tvit/81.7_T2T_ViTt_14.pth.tar  # 77.8-79.0% acc (there's randomness)
python run.py experiment=imagenet-t2tvit-eval.yaml model/t2tattn_cfg=sblocal datamodule.data_dir=/path/to/imagenet/ eval.ckpt=checkpoints/t2tvit/81.7_T2T_ViTt_14.pth.tar  # 81.1% acc

Requirements

Python 3.8+, Pytorch 1.9+, torchvision, torchtext, pytorch-fast-transformers, munch, einops, timm, hydra-core, hydra-colorlog, python-dotenv, rich, pytorch-lightning, lightning-bolts.

We provide a Dockerfile that lists all the required packages.

Citation

If you use this codebase, or otherwise found our work valuable, please cite:

@inproceedings{chen2021scatterbrain,
  title={Scatterbrain: Unifying Sparse and Low-rank Attention},
  author={Beidi Chen and Tri Dao and Eric Winsor and Zhao Song and Atri Rudra and Christopher R\'{e}},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2021}
}

Comments

layers difference help
Hi, I am very interested in your work and want to try their applications. I have figured out the usage of "monarch_linear.py". But I am still confused about other layers with similar names.

blockdiag_linear.py

blocksparse_linear.py

fastlinear.py -- There are several alternative linear layers here

structured_linear.py -- I know structured_linear is a base class.

Could you please briefly introduce them to help me better understand your code? Thx in advance.
opened by zhujiem 5
Overparametrization in MonarchLinear
Dear authors,

Thank you for providing the code of Monarch for reproducing experiments of your paper.

In Definition C.3, a Monarch matrix M of size $n \times n$ is defined as a matrix admitting a factorization $M = LR$ where $R$ is block diagonal with $b$ blocks, and $L$ is block diagonal with $n / b$ blocks up to a certain permutation of rows and columns.

However, in your code src/models/layers/monarch_linear.py, if I am not mistaken, the class MonarchLinear does not implement correctly the Monarch structure as defined above. Indeed, the code is written as:

class MonarchLinear(StructuredLinear): def __init__(self, *args, nblocks=4, **kwargs): super().__init__(*args, **kwargs) in_blksz = int(math.ceil(self.in_features / nblocks)) out_blksz = int(math.ceil(self.out_features / nblocks)) self.in_features_extended = in_blksz * nblocks self.out_features_extended = out_blksz * nblocks if self.in_features_extended < self.out_features_extended: self.blkdiag1 = nn.Parameter(torch.empty(nblocks, in_blksz, in_blksz)) self.blkdiag2 = nn.Parameter(torch.empty(nblocks, out_blksz, in_blksz)) else: self.blkdiag1 = nn.Parameter(torch.empty(nblocks, out_blksz, in_blksz)) self.blkdiag2 = nn.Parameter(torch.empty(nblocks, out_blksz, out_blksz)) self.reset_parameters() logger.info(f'Linear class {self.__class__}: saving={self.saving}')

which means that self.blkdiag2 encoding $L$ has $b$ blocks and not $n / b$ as defined in Definition C.3. In practice, when you run the experiments of Section E.1.1 with $b=4$, the model is overparametrized compared to the structure defined in Definition C.3.

Could you clarify the reason of the difference between your definition of the Monarch structure and your implementation MonarchLinear?

Best, Léon Zheng
opened by leonzheng2 0
Monarch & PixelFly based MLP layer efficiency testing

Here I post some efficiency testing numbers for Monarch based MLP v.s. vanilla nn.Linear based MLP. I found that Monarch is best suitable for MLPs in Transformer architectures, which generally have large hidden size and batch size. In recommendation-focused MLPs, the MLP is usually small (e.g., 10000x1024x512, the first is feature input dim) and importantly a small batch size (say 10) is often used for serving given concurrent online requests. The following testing numbers are provided as a reference for anyone who has similar tasks.

| | Train(Fwd+Bwd) | Test(Fwd only) | | |:-------------------:|:--------------:|:--------------:|:-------:| | Batch_size=1000 | GPU-P100 | GPU-P100 | CPU | | MLP(10000x1024x512) | 2.95ms | 0.16ms | 26.57ms | | Monarch(nblk=4) | 1.85ms | 0.57ms | 10.29ms | | Monarch(nblk=16) | 1.37ms | 0.55ms | 5.67ms | | | | | | | Batch_size=10 | | | | | MLP(10000x1024x512) | 0.48ms | 0.13ms | 0.59ms | | Monarch(nblk=4) | 1.34ms | 0.54ms | 1.16ms | | Monarch(nblk=16) | 1.31ms | 0.52ms | 1.37ms | | | | | | | Batch_size=10000 | | | | | MLP(1024x1024x512) | 4.86ms | 0.13ms | 46.99ms | | Monarch(nblk=4) | 6.87ms | 0.53ms | 47.55ms | | Monarch(nblk=16) | 6.04ms | 0.51ms | 39.66ms | | | | | | | Batch_size=1000 | | | | | MLP(1024x1024x512) | 0.74ms | 0.16ms | 5.35ms | | Monarch(nblk=4) | 1.42ms | 0.53ms | 4.17ms | | Monarch(nblk=16) | 1.38ms | 0.52ms | 3.84ms | | | | | | | Batch_size=10 | | | | | MLP(1024x1024x512) | 0.46ms | 0.13ms | 0.27ms | | Monarch(nblk=4) | 1.29ms | 0.53ms | 1.15ms | | Monarch(nblk=16) | 1.27ms | 0.51ms | 0.84ms |

I will post the numbers for pixelfly later.

opened by zhujiem 2
Monarch Projection Step for rectangular blocks and where the number of blocks != sqrt(input dimension)
Hi, I had a question regarding the projection step from dense -> monarch butterfly matrices, in the general case, where we have rectangular or nblocks != sqrt(m) of the matrix, as I'm having trouble finding the code for this and extending existing implementations for this case.

I found multiple projection files/functions: blockdiag_butterfly_projection.py and blockdiag_butterfly_einsum.py.

As an example, if I use code roughly as follows:

from src.models.layers.blockdiag_butterfly_multiply import blockdiag_butterfly_multiply x = np.eye(8) nblocks = 2 bfly_dim = 4 w1_bfly = np.random.normal((nblocks, bfly_dim, bfly_dim)) w2_bfly = np.random.normal((nblocks, bfly_dim, bfly_dim)) bfly_matrix = blockdiag_butterfly_multiply(x, w1_bfly, w2_bfly) print(bfly_matrix.shape)

The resulting shape of the output will be 8x8, which is essentially the full matrix used for transformation.

However, if I use the projection function (which is meant for square matrices) from blockdiag_butterfly_projection.py to try and recover the butterfly matrices from this matrix, I run into the issue that it expects the matrix to decompose as follows M_permuted_batched = rearrange(M, '(p k) (r s) -> k r p s', k=sizes[1], r=sizes[0]), while in our case: r = 4 and s = 4, making it incompatible with the matrix dimensions.

Meanwhile, the einsum functions in blockdiag_butterfly_einsum.py gave different results from the original blockdiag_butterfly_multiply (comparing the forward multiplication step not the projection step). (see this colab)

In the paper, I did see the original derivation for algorithm 1: but I was unclear on how to actually perform the decomposition step when we can't decompose the tensor into an m x m x m x m shape.
opened by sashaDoubov 3
Monarch in pure pytorch. [Yes, it's also a hypercube]
Hi, @tridao . Yes, it's me again. First, thanks for the monarch paper, it was quite a read ;)

I've been looking through the repo, but could not find the official monarch implementation yet. So i've built an unofficial one :) https://gist.github.com/justheuristic/499ec116f1f353dfd3314de87f310f80 (warning: pure torch, please don't use for speed evaluation)

And yes, it's also a hypercube. Lemme explain)

Imagine a small monarch layer with 4 input units and 4 output units.

Here's what the paper says should happen:

Consider a matrix with N=4 input units, and hence, M=2. The permutation layer for 4 units is defined as x_new = [x[0], x[2], x[1], x[3]].

Hence, the first batch matrix multiplication has two blocks: [x[0], x[2]] and [x[1], x[3]]. And the second matrix multiplication is unpermuted, the blocks are: [x[0], x[1]] and [x[2], x[3]].

Now let's naively reshape this tensor into a 2x2 square:

x[0] --- x[1] | | x[2] --- x[3]

As you can see, the two batch matrix products go over the square's columns (L matrix) and rows (R matrix).

This intuition holds for any valid N, M: for instance, N=1024, M=32 results in a 32-by-32 square lattice, and the column-to-row order stays the same.

This leads to a few obvious considerations:

we can change this square into an uneven rectangle, e.g. 32-by-64, which yields several ways to define non-square Monarch

we can add more dimensions! i.e. go from square to cube to hypercube, etc.

On adding more dimensions: consider a GPT-3 layer has 12288 units. We can view this as a 3d lattice of shape [16, 24, 32], since 16 * 24 * 32 = 12288

Round 1 goes over the first dimension. It needs 16x16 weight matrices, and the number of such matrices is 768 (32*24).

Round 2 similarly uses 24x24 matrices, 512 of them to be specific (32*16)

Round 3 multiplies over the last dimension, using a total of 384 matrices (16*24), each of size 16x16

Using the code above, you can define this specific grid as follows:

This, in turn, raises several questions:

On memory requirements of Monarch: when done naively, Monarch requires storing 1 additional tensor of activations for backprop for every additional dimensionality -- or recomputing them due to gradient checkpointing. Is there any more efficient strategy for backprop through Monarch?

On relation to tensor decompositions: when viewed from this angle, Monarch sounds vaguely related (though not equivalent) to some popular tensor decompositions, such as TensorTrain or TensorRing. Is Monarch universally better or are there special cases where I should use either one?

p.s. the perspective from this question is not my own, we stumbled into it in discussions with @ostroumova-la , @TimDettmers , @KhrulkovV
opened by justheuristic 2
Pixelfly is a binary hypercube!
Okay, so imagine the full pixelfly matrix with 2^n blocks

Let's give each input block a number 0... 2^n-1 Then, the pixelfly matrix can be defined as such:

for every pair of input blocks i, j

if binary(i) xor binary(j) has all zeros ones except one, then matrix[i, j] is a non-zero block

informally, if binary codes for i and j are equal in all but one bit, "connect" them with weights

This is the same condition that defines a binary cube in n dimensions:

Ergo, pixelfly neurons actually form a cube and "connect" over the edges of said cube.

p.s. not my original thought, discovered in discussion with @ostroumova-la

p.p.s. if that is the case, are there other geometric shapes we could try? So, for instance, fully connected matrix can be viewed as an n-dimensional simplex (triangle -> tetrahedron -> simples) because all blocks connect to all other blocks. Than goes the hypercube of pixelfly. Then what?
opened by justheuristic 3

An example of Scatterbrain implementation (combining local attention and Performer)

Related tags

Overview

Implementation & Experiments

T2T-ViT inference on ImageNet

Requirements

Citation

Comments

layers difference help

Overparametrization in MonarchLinear

Monarch & PixelFly based MLP layer efficiency testing

Monarch Projection Step for rectangular blocks and where the number of blocks != sqrt(input dimension)

Monarch in pure pytorch. [Yes, it's also a hypercube]

Pixelfly is a binary hypercube!

Owner

HazyResearch

code for paper"A High-precision Semantic Segmentation Method Combining Adversarial Learning and Attention Mechanism"

Implementation of the 😇 Attention layer from the paper, Scaling Local Self-Attention For Parameter Efficient Visual Backbones

Locally Enhanced Self-Attention: Rethinking Self-Attention as Local and Context Terms

Neon-erc20-example - Example of creating SPL token and wrapping it with ERC20 interface in Neon EVM

Example-custom-ml-block-keras - Custom Keras ML block example for Edge Impulse

Python-kafka-reset-consumergroup-offset-example - Python Kafka reset consumergroup offset example

Code and datasets for the paper "Combining Events and Frames using Recurrent Asynchronous Multimodal Networks for Monocular Depth Prediction" (RA-L, 2021)

Combining Automatic Labelers and Expert Annotations for Accurate Radiology Report Labeling Using BERT

Official repo for the work titled "SharinGAN: Combining Synthetic and Real Data for Unsupervised GeometryEstimation"

Combining Reinforcement Learning and Constraint Programming for Combinatorial Optimization

Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision. ICCV 2021.

The official implementation of ELSA: Enhanced Local Self-Attention for Vision Transformer

Rainbow: Combining Improvements in Deep Reinforcement Learning

Implementation of STAM (Space Time Attention Model), a pure and simple attention model that reaches SOTA for video classification

Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image classification, in Pytorch

Official Pytorch Implementation of Relational Self-Attention: What's Missing in Attention for Video Understanding

Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"

Official implementation of cosformer-attention in cosFormer: Rethinking Softmax in Attention

Implementation of Deformable Attention in Pytorch from the paper "Vision Transformer with Deformable Attention"