An example of Scatterbrain implementation (combining local attention and Performer)

Overview

We use the template from https://github.com/ashleve/lightning-hydra-template. Please read the instructions there to understand the repo structure.

Implementation & Experiments

An example of Scatterbrain implementation (combining local attention and Performer) is in the file src/models/modules/attention/sblocal.py.

T2T-ViT inference on ImageNet

To run the T2T-ViT inference on ImageNet experiment:

  1. Download the pretrained weights from the [T2T-ViT repo][https://github.com/yitu-opensource/T2T-ViT/releases]:
mkdir -p checkpoints/t2tvit
cd checkpoints/t2tvit
wget https://github.com/yitu-opensource/T2T-ViT/releases/download/main/81.7_T2T_ViTt_14.pth.tar
  1. Convert the weights to the format compatible with our implementation of T2T-ViT:
# cd to scatterbrain path
python scripts/convert_checkpoint_t2t_vit.py checkpoints/t2tvit/81.7_T2T_ViTt_14.pth.tar
  1. Download the ImageNet dataset (just the validation set will suffice). Below, /path/to/imagenet refers to the directory that contains the train and val directories.
  2. Run the inference experiments:
python run.py experiment=imagenet-t2tvit-eval.yaml model/t2tattn_cfg=full datamodule.data_dir=/path/to/imagenet/ eval.ckpt=checkpoints/t2tvit/81.7_T2T_ViTt_14.pth.tar  # 81.7% acc
python run.py experiment=imagenet-t2tvit-eval.yaml model/t2tattn_cfg=local datamodule.data_dir=/path/to/imagenet/ eval.ckpt=checkpoints/t2tvit/81.7_T2T_ViTt_14.pth.tar  # 80.6% acc
python run.py experiment=imagenet-t2tvit-eval.yaml model/t2tattn_cfg=performer datamodule.data_dir=/path/to/imagenet/ eval.ckpt=checkpoints/t2tvit/81.7_T2T_ViTt_14.pth.tar  # 77.8-79.0% acc (there's randomness)
python run.py experiment=imagenet-t2tvit-eval.yaml model/t2tattn_cfg=sblocal datamodule.data_dir=/path/to/imagenet/ eval.ckpt=checkpoints/t2tvit/81.7_T2T_ViTt_14.pth.tar  # 81.1% acc

Requirements

Python 3.8+, Pytorch 1.9+, torchvision, torchtext, pytorch-fast-transformers, munch, einops, timm, hydra-core, hydra-colorlog, python-dotenv, rich, pytorch-lightning, lightning-bolts.

We provide a Dockerfile that lists all the required packages.

Citation

If you use this codebase, or otherwise found our work valuable, please cite:

@inproceedings{chen2021scatterbrain,
  title={Scatterbrain: Unifying Sparse and Low-rank Attention},
  author={Beidi Chen and Tri Dao and Eric Winsor and Zhao Song and Atri Rudra and Christopher R\'{e}},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2021}
}
Comments
  • layers difference help

    layers difference help

    Hi, I am very interested in your work and want to try their applications. I have figured out the usage of "monarch_linear.py". But I am still confused about other layers with similar names.

    Could you please briefly introduce them to help me better understand your code? Thx in advance.

    opened by zhujiem 5
  • Overparametrization in MonarchLinear

    Overparametrization in MonarchLinear

    Dear authors,

    Thank you for providing the code of Monarch for reproducing experiments of your paper.

    In Definition C.3, a Monarch matrix M of size $n \times n$ is defined as a matrix admitting a factorization $M = LR$ where $R$ is block diagonal with $b$ blocks, and $L$ is block diagonal with $n / b$ blocks up to a certain permutation of rows and columns.

    However, in your code src/models/layers/monarch_linear.py, if I am not mistaken, the class MonarchLinear does not implement correctly the Monarch structure as defined above. Indeed, the code is written as:

    class MonarchLinear(StructuredLinear):
    
        def __init__(self, *args, nblocks=4, **kwargs):
            super().__init__(*args, **kwargs)
            in_blksz = int(math.ceil(self.in_features / nblocks))
            out_blksz = int(math.ceil(self.out_features / nblocks))
            self.in_features_extended = in_blksz * nblocks
            self.out_features_extended = out_blksz * nblocks
    
            if self.in_features_extended < self.out_features_extended:
                self.blkdiag1 = nn.Parameter(torch.empty(nblocks, in_blksz, in_blksz))
                self.blkdiag2 = nn.Parameter(torch.empty(nblocks, out_blksz, in_blksz))
            else:
                self.blkdiag1 = nn.Parameter(torch.empty(nblocks, out_blksz, in_blksz))
                self.blkdiag2 = nn.Parameter(torch.empty(nblocks, out_blksz, out_blksz))
            self.reset_parameters()
            logger.info(f'Linear class {self.__class__}: saving={self.saving}')
    

    which means that self.blkdiag2 encoding $L$ has $b$ blocks and not $n / b$ as defined in Definition C.3. In practice, when you run the experiments of Section E.1.1 with $b=4$, the model is overparametrized compared to the structure defined in Definition C.3.

    Could you clarify the reason of the difference between your definition of the Monarch structure and your implementation MonarchLinear?

    Best, Léon Zheng

    opened by leonzheng2 0
  • Monarch & PixelFly based MLP layer efficiency testing

    Monarch & PixelFly based MLP layer efficiency testing

    Here I post some efficiency testing numbers for Monarch based MLP v.s. vanilla nn.Linear based MLP. I found that Monarch is best suitable for MLPs in Transformer architectures, which generally have large hidden size and batch size. In recommendation-focused MLPs, the MLP is usually small (e.g., 10000x1024x512, the first is feature input dim) and importantly a small batch size (say 10) is often used for serving given concurrent online requests. The following testing numbers are provided as a reference for anyone who has similar tasks.

    | | Train(Fwd+Bwd) | Test(Fwd only) | | |:-------------------:|:--------------:|:--------------:|:-------:| | Batch_size=1000 | GPU-P100 | GPU-P100 | CPU | | MLP(10000x1024x512) | 2.95ms | 0.16ms | 26.57ms | | Monarch(nblk=4) | 1.85ms | 0.57ms | 10.29ms | | Monarch(nblk=16) | 1.37ms | 0.55ms | 5.67ms | | | | | | | Batch_size=10 | | | | | MLP(10000x1024x512) | 0.48ms | 0.13ms | 0.59ms | | Monarch(nblk=4) | 1.34ms | 0.54ms | 1.16ms | | Monarch(nblk=16) | 1.31ms | 0.52ms | 1.37ms | | | | | | | Batch_size=10000 | | | | | MLP(1024x1024x512) | 4.86ms | 0.13ms | 46.99ms | | Monarch(nblk=4) | 6.87ms | 0.53ms | 47.55ms | | Monarch(nblk=16) | 6.04ms | 0.51ms | 39.66ms | | | | | | | Batch_size=1000 | | | | | MLP(1024x1024x512) | 0.74ms | 0.16ms | 5.35ms | | Monarch(nblk=4) | 1.42ms | 0.53ms | 4.17ms | | Monarch(nblk=16) | 1.38ms | 0.52ms | 3.84ms | | | | | | | Batch_size=10 | | | | | MLP(1024x1024x512) | 0.46ms | 0.13ms | 0.27ms | | Monarch(nblk=4) | 1.29ms | 0.53ms | 1.15ms | | Monarch(nblk=16) | 1.27ms | 0.51ms | 0.84ms |

    I will post the numbers for pixelfly later.

    opened by zhujiem 2
  • Monarch Projection Step for rectangular blocks and where the number of blocks != sqrt(input dimension)

    Monarch Projection Step for rectangular blocks and where the number of blocks != sqrt(input dimension)

    Hi, I had a question regarding the projection step from dense -> monarch butterfly matrices, in the general case, where we have rectangular or nblocks != sqrt(m) of the matrix, as I'm having trouble finding the code for this and extending existing implementations for this case.

    I found multiple projection files/functions: blockdiag_butterfly_projection.py and blockdiag_butterfly_einsum.py.

    As an example, if I use code roughly as follows:

    from src.models.layers.blockdiag_butterfly_multiply import blockdiag_butterfly_multiply
    
    x = np.eye(8)
    nblocks = 2
    bfly_dim = 4
    
    w1_bfly = np.random.normal((nblocks, bfly_dim, bfly_dim))
    w2_bfly = np.random.normal((nblocks, bfly_dim, bfly_dim))
    
    bfly_matrix = blockdiag_butterfly_multiply(x, w1_bfly, w2_bfly)
    
    print(bfly_matrix.shape)
    

    The resulting shape of the output will be 8x8, which is essentially the full matrix used for transformation.

    However, if I use the projection function (which is meant for square matrices) from blockdiag_butterfly_projection.py to try and recover the butterfly matrices from this matrix, I run into the issue that it expects the matrix to decompose as follows M_permuted_batched = rearrange(M, '(p k) (r s) -> k r p s', k=sizes[1], r=sizes[0]), while in our case: r = 4 and s = 4, making it incompatible with the matrix dimensions.

    Meanwhile, the einsum functions in blockdiag_butterfly_einsum.py gave different results from the original blockdiag_butterfly_multiply (comparing the forward multiplication step not the projection step). (see this colab)

    In the paper, I did see the original derivation for algorithm 1: image but I was unclear on how to actually perform the decomposition step when we can't decompose the tensor into an m x m x m x m shape.

    opened by sashaDoubov 3
  • Monarch in pure pytorch. [Yes, it's also a hypercube]

    Monarch in pure pytorch. [Yes, it's also a hypercube]

    Hi, @tridao . Yes, it's me again. First, thanks for the monarch paper, it was quite a read ;)

    I've been looking through the repo, but could not find the official monarch implementation yet. So i've built an unofficial one :) https://gist.github.com/justheuristic/499ec116f1f353dfd3314de87f310f80 (warning: pure torch, please don't use for speed evaluation)

    And yes, it's also a hypercube. Lemme explain)

    Imagine a small monarch layer with 4 input units and 4 output units.

    Here's what the paper says should happen: image

    Consider a matrix with N=4 input units, and hence, M=2. The permutation layer for 4 units is defined as x_new = [x[0], x[2], x[1], x[3]].

    Hence, the first batch matrix multiplication has two blocks: [x[0], x[2]] and [x[1], x[3]]. And the second matrix multiplication is unpermuted, the blocks are: [x[0], x[1]] and [x[2], x[3]].

    Now let's naively reshape this tensor into a 2x2 square:

    x[0] --- x[1]
      |        |
    x[2] --- x[3]
    

    As you can see, the two batch matrix products go over the square's columns (L matrix) and rows (R matrix).

    This intuition holds for any valid N, M: for instance, N=1024, M=32 results in a 32-by-32 square lattice, and the column-to-row order stays the same.

    This leads to a few obvious considerations:

    • we can change this square into an uneven rectangle, e.g. 32-by-64, which yields several ways to define non-square Monarch
    • we can add more dimensions! i.e. go from square to cube to hypercube, etc.

    On adding more dimensions: consider a GPT-3 layer has 12288 units. We can view this as a 3d lattice of shape [16, 24, 32], since 16 * 24 * 32 = 12288

    • Round 1 goes over the first dimension. It needs 16x16 weight matrices, and the number of such matrices is 768 (32*24).
    • Round 2 similarly uses 24x24 matrices, 512 of them to be specific (32*16)
    • Round 3 multiplies over the last dimension, using a total of 384 matrices (16*24), each of size 16x16

    Using the code above, you can define this specific grid as follows: image

    This, in turn, raises several questions:

    1. On memory requirements of Monarch: when done naively, Monarch requires storing 1 additional tensor of activations for backprop for every additional dimensionality -- or recomputing them due to gradient checkpointing. Is there any more efficient strategy for backprop through Monarch?
    2. On relation to tensor decompositions: when viewed from this angle, Monarch sounds vaguely related (though not equivalent) to some popular tensor decompositions, such as TensorTrain or TensorRing. Is Monarch universally better or are there special cases where I should use either one?

    p.s. the perspective from this question is not my own, we stumbled into it in discussions with @ostroumova-la , @TimDettmers , @KhrulkovV

    opened by justheuristic 2
  • Pixelfly is a binary hypercube!

    Pixelfly is a binary hypercube!

    Okay, so imagine the full pixelfly matrix with 2^n blocks

    image

    Let's give each input block a number 0... 2^n-1 Then, the pixelfly matrix can be defined as such:

    • for every pair of input blocks i, j
    • if binary(i) xor binary(j) has all zeros ones except one, then matrix[i, j] is a non-zero block
      • informally, if binary codes for i and j are equal in all but one bit, "connect" them with weights

    This is the same condition that defines a binary cube in n dimensions: image

    Ergo, pixelfly neurons actually form a cube and "connect" over the edges of said cube.

    p.s. not my original thought, discovered in discussion with @ostroumova-la

    p.p.s. if that is the case, are there other geometric shapes we could try? So, for instance, fully connected matrix can be viewed as an n-dimensional simplex (triangle -> tetrahedron -> simples) because all blocks connect to all other blocks. Than goes the hypercube of pixelfly. Then what?

    opened by justheuristic 3
Owner
HazyResearch
We are a CS research group led by Prof. Chris Ré.
HazyResearch
code for paper"A High-precision Semantic Segmentation Method Combining Adversarial Learning and Attention Mechanism"

PyTorch implementation of UAGAN(U-net Attention Generative Adversarial Networks) This repository contains the source code for the paper "A High-precis

Tong 8 Apr 25, 2022
Implementation of the 😇 Attention layer from the paper, Scaling Local Self-Attention For Parameter Efficient Visual Backbones

HaloNet - Pytorch Implementation of the Attention layer from the paper, Scaling Local Self-Attention For Parameter Efficient Visual Backbones. This re

Phil Wang 189 Nov 22, 2022
Locally Enhanced Self-Attention: Rethinking Self-Attention as Local and Context Terms

LESA Introduction This repository contains the official implementation of Locally Enhanced Self-Attention: Rethinking Self-Attention as Local and Cont

Chenglin Yang 20 Dec 31, 2021
Neon-erc20-example - Example of creating SPL token and wrapping it with ERC20 interface in Neon EVM

Example of wrapping SPL token by ERC2-20 interface in Neon Requirements Install

null 7 Mar 28, 2022
Example-custom-ml-block-keras - Custom Keras ML block example for Edge Impulse

Custom Keras ML block example for Edge Impulse This repository is an example on

Edge Impulse 8 Nov 2, 2022
Python-kafka-reset-consumergroup-offset-example - Python Kafka reset consumergroup offset example

Python Kafka reset consumergroup offset example This is a simple example of how

Willi Carlsen 1 Feb 16, 2022
Code and datasets for the paper "Combining Events and Frames using Recurrent Asynchronous Multimodal Networks for Monocular Depth Prediction" (RA-L, 2021)

Combining Events and Frames using Recurrent Asynchronous Multimodal Networks for Monocular Depth Prediction This is the code for the paper Combining E

Robotics and Perception Group 69 Dec 26, 2022
Combining Automatic Labelers and Expert Annotations for Accurate Radiology Report Labeling Using BERT

CheXbert: Combining Automatic Labelers and Expert Annotations for Accurate Radiology Report Labeling Using BERT CheXbert is an accurate, automated dee

Stanford Machine Learning Group 51 Dec 8, 2022
Official repo for the work titled "SharinGAN: Combining Synthetic and Real Data for Unsupervised GeometryEstimation"

SharinGAN Official repo for the work titled "SharinGAN: Combining Synthetic and Real Data for Unsupervised GeometryEstimation" The official project we

Koutilya PNVR 23 Oct 19, 2022
Combining Reinforcement Learning and Constraint Programming for Combinatorial Optimization

Hybrid solving process for combinatorial optimization problems Combinatorial optimization has found applications in numerous fields, from aerospace to

null 117 Dec 13, 2022
Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision. ICCV 2021.

Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision Download links and PyTorch implementation of "Towers of Ba

Blakey Wu 40 Dec 14, 2022
The official implementation of ELSA: Enhanced Local Self-Attention for Vision Transformer

ELSA: Enhanced Local Self-Attention for Vision Transformer By Jingkai Zhou, Pich

DamoCV 87 Dec 19, 2022
Rainbow: Combining Improvements in Deep Reinforcement Learning

Rainbow Rainbow: Combining Improvements in Deep Reinforcement Learning [1]. Results and pretrained models can be found in the releases. DQN [2] Double

Kai Arulkumaran 1.4k Dec 29, 2022
Implementation of STAM (Space Time Attention Model), a pure and simple attention model that reaches SOTA for video classification

STAM - Pytorch Implementation of STAM (Space Time Attention Model), yet another pure and simple SOTA attention model that bests all previous models in

Phil Wang 109 Dec 28, 2022
Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image classification, in Pytorch

Transformer in Transformer Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image c

Phil Wang 272 Dec 23, 2022
Official Pytorch Implementation of Relational Self-Attention: What's Missing in Attention for Video Understanding

Relational Self-Attention: What's Missing in Attention for Video Understanding This repository is the official implementation of "Relational Self-Atte

mandos 43 Dec 7, 2022
Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"

Memory Efficient Attention Pytorch Implementation of a memory efficient multi-head attention as proposed in the paper, Self-attention Does Not Need O(

Phil Wang 180 Jan 5, 2023
Official implementation of cosformer-attention in cosFormer: Rethinking Softmax in Attention

cosFormer Official implementation of cosformer-attention in cosFormer: Rethinking Softmax in Attention Update log 2022/2/28 Add core code License This

null 120 Dec 15, 2022
Implementation of Deformable Attention in Pytorch from the paper "Vision Transformer with Deformable Attention"

Deformable Attention Implementation of Deformable Attention from this paper in Pytorch, which appears to be an improvement to what was proposed in DET

Phil Wang 128 Dec 24, 2022