This is the official PyTorch implementation for "Mesa: A Memory-saving Training Framework for Transformers".

Related tags

Deep Learning Mesa
Overview

A Memory-saving Training Framework for Transformers

License

This is the official PyTorch implementation for Mesa: A Memory-saving Training Framework for Transformers.

By Zizheng Pan, Peng Chen, Haoyu He, Jing Liu, Jianfei Cai and Bohan Zhuang.

Installation

  1. Create a virtual environment with anaconda.

    conda create -n mesa python=3.7 -y
    conda activate mesa
    
    # Install PyTorch, we use PyTorch 1.7.1 with CUDA 10.1 
    pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
    
    # Install ninja
    pip install ninja
  2. Build and install Mesa.

    # cloen this repo
    git clone https://github.com/zhuang-group/Mesa
    # build
    cd Mesa/
    # You need to have an NVIDIA GPU
    python setup.py develop

Usage

  1. Prepare your policy and save as a text file, e.g. policy.txt.

    on gelu: # layer tag, choices: fc, conv, gelu, bn, relu, softmax, matmul, layernorm
        by_index: all # layer index
        enable: True # enable for compressing
        level: 256 # we adopt 8-bit quantization by default
        ema_decay: 0.9 # the decay rate for running estimates
        
        by_index: 1 2 # e.g. exluding GELU layers that indexed by 1 and 2.
        enable: False
  2. Next, you can wrap your model with Mesa by:

    import mesa as ms
    ms.policy.convert_by_num_groups(model, 3)
    # or convert by group size with ms.policy.convert_by_group_size(model, 64)
    
    # setup compression policy
    ms.policy.deploy_on_init(model, '[path to policy.txt]', verbose=print, override_verbose=False)

    That's all you need to use Mesa for memory saving.

    Note that convert_by_num_groups and convert_by_group_size only recognize nn.XXX, if your code has functional operations, such as Q@K and F.Softmax, you may need to manually setup these layers. For example:

    # matrix multipcation (before)
    out = Q@K.transpose(-2, -1)
    # with Mesa
    self.mm = ms.MatMul(quant_groups=3)
    out = self.mm(q, k.transpose(-2, -1))
    
    # sofmtax (before)
    attn = attn.softmax(dim=-1)
    # with Mesa
    self.softmax = ms.Softmax(dim=-1, quant_groups=3)
    attn = self.softmax(attn)
  3. You can also target one layer by:

    import mesa as ms
    # previous 
    self.act = nn.GELU()
    # with Mesa
    self.act = ms.GELU(quant_groups=[num of quantization groups])

Demo projects for DeiT and Swin

We provide demo projects to replicate our results of training DeiT and Swin with Mesa, please refer to DeiT-Mesa and Swin-Mesa.

Results on ImageNet

Model Param (M) FLOPs (G) Train Memory (MB) Top-1 (%)
DeiT-Ti 5 1.3 4,171 71.9
DeiT-Ti w/ Mesa 5 1.3 1,858 72.1
DeiT-S 22 4.6 8,459 79.8
DeiT-S w/ Mesa 22 4.6 3,840 80.0
DeiT-B 86 17.5 17,691 81.8
DeiT-B w/ Mesa 86 17.5 8,616 81.8
Swin-Ti 29 4.5 11,812 81.3
Swin-Ti w/ Mesa 29 4.5 5,371 81.3
PVT-Ti 13 1.9 7,800 75.1
PVT-Ti w/ Mesa 13 1.9 3,782 74.9

Memory footprint at training time is measured with a batch size of 128 and an image resolution of 224x224 on a single GPU.

License

This repository is released under the Apache 2.0 license as found in the LICENSE file.

Acknowledgments

This repository has adopted part of the quantization codes from ActNN, we thank the authors for their open-sourced code.

You might also like...
Official implementation of our CVPR2021 paper
Official implementation of our CVPR2021 paper "OTA: Optimal Transport Assignment for Object Detection" in Pytorch.

OTA: Optimal Transport Assignment for Object Detection This project provides an implementation for our CVPR2021 paper "OTA: Optimal Transport Assignme

This is the official PyTorch implementation of the paper
This is the official PyTorch implementation of the paper "TransFG: A Transformer Architecture for Fine-grained Recognition" (Ju He, Jie-Neng Chen, Shuai Liu, Adam Kortylewski, Cheng Yang, Yutong Bai, Changhu Wang, Alan Yuille).

TransFG: A Transformer Architecture for Fine-grained Recognition Official PyTorch code for the paper: TransFG: A Transformer Architecture for Fine-gra

StyleGAN2-ADA - Official PyTorch implementation
StyleGAN2-ADA - Official PyTorch implementation

Need Help? If you’re new to StyleGAN2-ADA and looking to get started, please check out this video series from a course Lia Coleman and I taught in Oct

Official PyTorch implementation of
Official PyTorch implementation of "ArtFlow: Unbiased Image Style Transfer via Reversible Neural Flows"

ArtFlow Official PyTorch implementation of the paper: ArtFlow: Unbiased Image Style Transfer via Reversible Neural Flows Jie An*, Siyu Huang*, Yibing

Official PyTorch implementation of RobustNet (CVPR 2021 Oral)
Official PyTorch implementation of RobustNet (CVPR 2021 Oral)

RobustNet (CVPR 2021 Oral): Official Project Webpage Codes and pretrained models will be released soon. This repository provides the official PyTorch

Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.
Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.

PyTorch Implementation of Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers 1 Using Colab Please notic

[PyTorch] Official implementation of CVPR2021 paper
[PyTorch] Official implementation of CVPR2021 paper "PointDSC: Robust Point Cloud Registration using Deep Spatial Consistency". https://arxiv.org/abs/2103.05465

PointDSC repository PyTorch implementation of PointDSC for CVPR'2021 paper "PointDSC: Robust Point Cloud Registration using Deep Spatial Consistency",

Official PyTorch implementation of MX-Font (Multiple Heads are Better than One: Few-shot Font Generation with Multiple Localized Experts)

Introduction Pytorch implementation of Multiple Heads are Better than One: Few-shot Font Generation with Multiple Localized Expert. | paper Song Park1

Official Pytorch implementation of 'GOCor: Bringing Globally Optimized Correspondence Volumes into Your Neural Network' (NeurIPS 2020)
Official Pytorch implementation of 'GOCor: Bringing Globally Optimized Correspondence Volumes into Your Neural Network' (NeurIPS 2020)

Official implementation of GOCor This is the official implementation of our paper : GOCor: Bringing Globally Optimized Correspondence Volumes into You

Comments
  • Does not compile with new pytorch versions

    Does not compile with new pytorch versions

    pip3 install torch==1.10.0+cu113 torchvision==0.11.1+cu113 torchaudio==0.10.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
    conda install pytorch=1.10 torchvision torchaudio cudatoolkit=11.3 -c pytorch
    conda install pytorch==1.9 torchvision torchaudio cudatoolkit=11.1 -c pytorch -c conda-forge -y
    
    FAILED: /home/Desktop/Mesa/build/temp.linux-x86_64-3.7/native.o 
    c++ -MMD -MF /home/Desktop/Mesa/build/temp.linux-x86_64-3.7/native.o.d -pthread -B /home/miniconda3/envs/mesa2/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/miniconda3/envs/mesa2/lib/python3.7/site-packages/torch/include -I/home/miniconda3/envs/mesa2/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/home/miniconda3/envs/mesa2/lib/python3.7/site-packages/torch/include/TH -I/home/miniconda3/envs/mesa2/lib/python3.7/site-packages/torch/include/THC -I/home/miniconda3/envs/mesa2/include/python3.7m -c -c /home/Desktop/Mesa/native.cpp -o /home/Desktop/Mesa/build/temp.linux-x86_64-3.7/native.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=native -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
    cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
    /home/Desktop/Mesa/native.cpp: In function ‘at::Tensor gelu_backward_cpu(const at::Tensor&, const at::Tensor&)’:
    /home/Desktop/Mesa/native.cpp:92:24: error: ‘gelu_backward_cpu’ is not a member of ‘at::native’; did you mean ‘glu_backward_cpu’?
       92 |     return at::native::gelu_backward_cpu(grad_output, input);
          |                        ^~~~~~~~~~~~~~~~~
          |                        glu_backward_cpu
    /home/Desktop/Mesa/native.cpp: In function ‘at::Tensor gelu_backward_cuda(const at::Tensor&, const at::Tensor&)’:
    /home/Desktop/Mesa/native.cpp:98:24: error: ‘gelu_backward_cuda’ is not a member of ‘at::native’; did you mean ‘glu_backward_cuda’?
       98 |     return at::native::gelu_backward_cuda(grad_output, input);
          |                        ^~~~~~~~~~~~~~~~~~
          |                        glu_backward_cuda
    /home/Desktop/Mesa/native.cpp: In function ‘at::Tensor softmax_backward_cpu(const at::Tensor&, const at::Tensor&, int64_t, const at::Tensor&)’:
    /home/Desktop/Mesa/native.cpp:190:24: error: ‘softmax_backward_cpu’ is not a member of ‘at::native’; did you mean ‘col2im_backward_cpu’?
      190 |     return at::native::softmax_backward_cpu(grad_output, output, dim, self);
          |                        ^~~~~~~~~~~~~~~~~~~~
          |                        col2im_backward_cpu
    /home/Desktop/Mesa/native.cpp: In function ‘at::Tensor softmax_backward_cuda(const at::Tensor&, const at::Tensor&, int64_t, const at::Tensor&)’:
    /home/Desktop/Mesa/native.cpp:198:24: error: ‘softmax_backward_cuda’ is not a member of ‘at::native’; did you mean ‘col2im_backward_cuda’?
      198 |     return at::native::softmax_backward_cuda(grad_output, output, dim, self);
          |                        ^~~~~~~~~~~~~~~~~~~~~
          |                        col2im_backward_cuda
    ninja: build stopped: subcommand failed.
    Traceback (most recent call last):
      File "/home/miniconda3/envs/mesa2/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1723, in _run_ninja_build
        env=env)
      File "/home/miniconda3/envs/mesa2/lib/python3.7/subprocess.py", line 512, in run
        output=stdout, stderr=stderr)
    subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
    

    These versions seem to work:

    conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=10.2 -c pytorch
    conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=11.1 -c pytorch -c conda-forge
    pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
    
    good first issue 
    opened by styler00dollar 3
  • How do you measure train memory?

    How do you measure train memory?

    Thanks for your great work! I have a question about the memory usage.

    How do you get the "train memory" in this table https://github.com/zhuang-group/Mesa#results-on-imagenet? Is it the total memory required for training? Do you get it analytically or empirically using torch.cuda.memory_allocated?

    If I understand correctly, Mesa compresses fp16 activations to int8, so it can at most reduce the activation memory by 2x. There are also other components in total memory, so Mesa cannot reduce the total memory by 2x. However, in your results, Mesa reduces memory by more than 2x for some models. How is this possible?

    question 
    opened by merrymercy 2
Owner
Zhuang AI Group
Zhuang AI Group
Official PyTorch implementation for paper Context Matters: Graph-based Self-supervised Representation Learning for Medical Images

Context Matters: Graph-based Self-supervised Representation Learning for Medical Images Official PyTorch implementation for paper Context Matters: Gra

null 49 Nov 23, 2022
StyleGAN2-ADA - Official PyTorch implementation

Abstract: Training generative adversarial networks (GAN) using too little data typically leads to discriminator overfitting, causing training to diverge. We propose an adaptive discriminator augmentation mechanism that significantly stabilizes training in limited data regimes.

NVIDIA Research Projects 3.2k Dec 30, 2022
Official PyTorch implementation of Joint Object Detection and Multi-Object Tracking with Graph Neural Networks

This is the official PyTorch implementation of our paper: "Joint Object Detection and Multi-Object Tracking with Graph Neural Networks". Our project website and video demos are here.

Richard Wang 443 Dec 6, 2022
Official pytorch implementation of paper "Image-to-image Translation via Hierarchical Style Disentanglement".

HiSD: Image-to-image Translation via Hierarchical Style Disentanglement Official pytorch implementation of paper "Image-to-image Translation

null 364 Dec 14, 2022
Official pytorch implementation of paper "Inception Convolution with Efficient Dilation Search" (CVPR 2021 Oral).

IC-Conv This repository is an official implementation of the paper Inception Convolution with Efficient Dilation Search. Getting Started Download Imag

Jie Liu 111 Dec 31, 2022
Official PyTorch Implementation of Unsupervised Learning of Scene Flow Estimation Fusing with Local Rigidity

UnRigidFlow This is the official PyTorch implementation of UnRigidFlow (IJCAI2019). Here are two sample results (~10MB gif for each) of our unsupervis

Liang Liu 28 Nov 16, 2022
Official implementation of our paper "LLA: Loss-aware Label Assignment for Dense Pedestrian Detection" in Pytorch.

LLA: Loss-aware Label Assignment for Dense Pedestrian Detection This project provides an implementation for "LLA: Loss-aware Label Assignment for Dens

null 35 Dec 6, 2022
An official implementation of "SFNet: Learning Object-aware Semantic Correspondence" (CVPR 2019, TPAMI 2020) in PyTorch.

PyTorch implementation of SFNet This is the implementation of the paper "SFNet: Learning Object-aware Semantic Correspondence". For more information,

CV Lab @ Yonsei University 87 Dec 30, 2022
Old Photo Restoration (Official PyTorch Implementation)

Bringing Old Photo Back to Life (CVPR 2020 oral)

Microsoft 11.3k Dec 30, 2022
Official PyTorch implementation of Spatial Dependency Networks.

Spatial Dependency Networks: Neural Layers for Improved Generative Image Modeling Đorđe Miladinović   Aleksandar Stanić   Stefan Bauer   Jürgen Schmid

Djordje Miladinovic 34 Jan 19, 2022