This is the official PyTorch implementation for "Mesa: A Memory-saving Training Framework for Transformers".

Zhuang AI Group

Last update: Dec 6, 2022

Related tags

Deep Learning Mesa

Overview

A Memory-saving Training Framework for Transformers

This is the official PyTorch implementation for Mesa: A Memory-saving Training Framework for Transformers.

By Zizheng Pan, Peng Chen, Haoyu He, Jing Liu, Jianfei Cai and Bohan Zhuang.

Installation

Create a virtual environment with anaconda.

conda create -n mesa python=3.7 -y
conda activate mesa

# Install PyTorch, we use PyTorch 1.7.1 with CUDA 10.1 
pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html

# Install ninja
pip install ninja

Build and install Mesa.

# cloen this repo
git clone https://github.com/zhuang-group/Mesa
# build
cd Mesa/
# You need to have an NVIDIA GPU
python setup.py develop

Usage

Prepare your policy and save as a text file, e.g. policy.txt.

on gelu: # layer tag, choices: fc, conv, gelu, bn, relu, softmax, matmul, layernorm
    by_index: all # layer index
    enable: True # enable for compressing
    level: 256 # we adopt 8-bit quantization by default
    ema_decay: 0.9 # the decay rate for running estimates
    
    by_index: 1 2 # e.g. exluding GELU layers that indexed by 1 and 2.
    enable: False

Next, you can wrap your model with Mesa by:

import mesa as ms
ms.policy.convert_by_num_groups(model, 3)
# or convert by group size with ms.policy.convert_by_group_size(model, 64)

# setup compression policy
ms.policy.deploy_on_init(model, '[path to policy.txt]', verbose=print, override_verbose=False)

That's all you need to use Mesa for memory saving.

Note that convert_by_num_groups and convert_by_group_size only recognize nn.XXX, if your code has functional operations, such as Q@K and F.Softmax, you may need to manually setup these layers. For example:

# matrix multipcation (before)
out = Q@K.transpose(-2, -1)
# with Mesa
self.mm = ms.MatMul(quant_groups=3)
out = self.mm(q, k.transpose(-2, -1))

# sofmtax (before)
attn = attn.softmax(dim=-1)
# with Mesa
self.softmax = ms.Softmax(dim=-1, quant_groups=3)
attn = self.softmax(attn)

You can also target one layer by:

import mesa as ms
# previous 
self.act = nn.GELU()
# with Mesa
self.act = ms.GELU(quant_groups=[num of quantization groups])

Demo projects for DeiT and Swin

We provide demo projects to replicate our results of training DeiT and Swin with Mesa, please refer to DeiT-Mesa and Swin-Mesa.

Results on ImageNet

Model	Param (M)	FLOPs (G)	Train Memory (MB)	Top-1 (%)
DeiT-Ti	5	1.3	4,171	71.9
DeiT-Ti w/ Mesa	5	1.3	1,858	72.1
DeiT-S	22	4.6	8,459	79.8
DeiT-S w/ Mesa	22	4.6	3,840	80.0
DeiT-B	86	17.5	17,691	81.8
DeiT-B w/ Mesa	86	17.5	8,616	81.8
Swin-Ti	29	4.5	11,812	81.3
Swin-Ti w/ Mesa	29	4.5	5,371	81.3
PVT-Ti	13	1.9	7,800	75.1
PVT-Ti w/ Mesa	13	1.9	3,782	74.9

Memory footprint at training time is measured with a batch size of 128 and an image resolution of 224x224 on a single GPU.

License

This repository is released under the Apache 2.0 license as found in the LICENSE file.

Acknowledgments

This repository has adopted part of the quantization codes from ActNN, we thank the authors for their open-sourced code.

You might also like...

Official implementation of our CVPR2021 paper "OTA: Optimal Transport Assignment for Object Detection" in Pytorch.

OTA: Optimal Transport Assignment for Object Detection This project provides an implementation for our CVPR2021 paper "OTA: Optimal Transport Assignme

217 Jan 3, 2023

This is the official PyTorch implementation of the paper "TransFG: A Transformer Architecture for Fine-grained Recognition" (Ju He, Jie-Neng Chen, Shuai Liu, Adam Kortylewski, Cheng Yang, Yutong Bai, Changhu Wang, Alan Yuille).

TransFG: A Transformer Architecture for Fine-grained Recognition Official PyTorch code for the paper: TransFG: A Transformer Architecture for Fine-gra

307 Jan 3, 2023

StyleGAN2-ADA - Official PyTorch implementation

Need Help? If you’re new to StyleGAN2-ADA and looking to get started, please check out this video series from a course Lia Coleman and I taught in Oct

217 Jan 4, 2023

Official PyTorch implementation of "ArtFlow: Unbiased Image Style Transfer via Reversible Neural Flows"

ArtFlow Official PyTorch implementation of the paper: ArtFlow: Unbiased Image Style Transfer via Reversible Neural Flows Jie An*, Siyu Huang*, Yibing

123 Dec 27, 2022

Official PyTorch implementation of RobustNet (CVPR 2021 Oral)

RobustNet (CVPR 2021 Oral): Official Project Webpage Codes and pretrained models will be released soon. This repository provides the official PyTorch

173 Dec 21, 2022

Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.

PyTorch Implementation of Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers 1 Using Colab Please notic

489 Jan 7, 2023

[PyTorch] Official implementation of CVPR2021 paper "PointDSC: Robust Point Cloud Registration using Deep Spatial Consistency". https://arxiv.org/abs/2103.05465

PointDSC repository PyTorch implementation of PointDSC for CVPR'2021 paper "PointDSC: Robust Point Cloud Registration using Deep Spatial Consistency",

153 Dec 14, 2022

Official PyTorch implementation of MX-Font (Multiple Heads are Better than One: Few-shot Font Generation with Multiple Localized Experts)

Introduction Pytorch implementation of Multiple Heads are Better than One: Few-shot Font Generation with Multiple Localized Expert. | paper Song Park1

97 Dec 23, 2022

Official Pytorch implementation of 'GOCor: Bringing Globally Optimized Correspondence Volumes into Your Neural Network' (NeurIPS 2020)

Official implementation of GOCor This is the official implementation of our paper : GOCor: Bringing Globally Optimized Correspondence Volumes into You

71 Nov 18, 2022

Comments

Does not compile with new pytorch versions

pip3 install torch==1.10.0+cu113 torchvision==0.11.1+cu113 torchaudio==0.10.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
conda install pytorch=1.10 torchvision torchaudio cudatoolkit=11.3 -c pytorch
conda install pytorch==1.9 torchvision torchaudio cudatoolkit=11.1 -c pytorch -c conda-forge -y

FAILED: /home/Desktop/Mesa/build/temp.linux-x86_64-3.7/native.o 
c++ -MMD -MF /home/Desktop/Mesa/build/temp.linux-x86_64-3.7/native.o.d -pthread -B /home/miniconda3/envs/mesa2/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/miniconda3/envs/mesa2/lib/python3.7/site-packages/torch/include -I/home/miniconda3/envs/mesa2/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/home/miniconda3/envs/mesa2/lib/python3.7/site-packages/torch/include/TH -I/home/miniconda3/envs/mesa2/lib/python3.7/site-packages/torch/include/THC -I/home/miniconda3/envs/mesa2/include/python3.7m -c -c /home/Desktop/Mesa/native.cpp -o /home/Desktop/Mesa/build/temp.linux-x86_64-3.7/native.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=native -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
/home/Desktop/Mesa/native.cpp: In function ‘at::Tensor gelu_backward_cpu(const at::Tensor&, const at::Tensor&)’:
/home/Desktop/Mesa/native.cpp:92:24: error: ‘gelu_backward_cpu’ is not a member of ‘at::native’; did you mean ‘glu_backward_cpu’?
   92 |     return at::native::gelu_backward_cpu(grad_output, input);
      |                        ^~~~~~~~~~~~~~~~~
      |                        glu_backward_cpu
/home/Desktop/Mesa/native.cpp: In function ‘at::Tensor gelu_backward_cuda(const at::Tensor&, const at::Tensor&)’:
/home/Desktop/Mesa/native.cpp:98:24: error: ‘gelu_backward_cuda’ is not a member of ‘at::native’; did you mean ‘glu_backward_cuda’?
   98 |     return at::native::gelu_backward_cuda(grad_output, input);
      |                        ^~~~~~~~~~~~~~~~~~
      |                        glu_backward_cuda
/home/Desktop/Mesa/native.cpp: In function ‘at::Tensor softmax_backward_cpu(const at::Tensor&, const at::Tensor&, int64_t, const at::Tensor&)’:
/home/Desktop/Mesa/native.cpp:190:24: error: ‘softmax_backward_cpu’ is not a member of ‘at::native’; did you mean ‘col2im_backward_cpu’?
  190 |     return at::native::softmax_backward_cpu(grad_output, output, dim, self);
      |                        ^~~~~~~~~~~~~~~~~~~~
      |                        col2im_backward_cpu
/home/Desktop/Mesa/native.cpp: In function ‘at::Tensor softmax_backward_cuda(const at::Tensor&, const at::Tensor&, int64_t, const at::Tensor&)’:
/home/Desktop/Mesa/native.cpp:198:24: error: ‘softmax_backward_cuda’ is not a member of ‘at::native’; did you mean ‘col2im_backward_cuda’?
  198 |     return at::native::softmax_backward_cuda(grad_output, output, dim, self);
      |                        ^~~~~~~~~~~~~~~~~~~~~
      |                        col2im_backward_cuda
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "/home/miniconda3/envs/mesa2/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1723, in _run_ninja_build
    env=env)
  File "/home/miniconda3/envs/mesa2/lib/python3.7/subprocess.py", line 512, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

These versions seem to work:

conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=10.2 -c pytorch
conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=11.1 -c pytorch -c conda-forge
pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html

good first issue

opened by styler00dollar 3

How do you measure train memory?

Thanks for your great work! I have a question about the memory usage.

How do you get the "train memory" in this table https://github.com/zhuang-group/Mesa#results-on-imagenet? Is it the total memory required for training? Do you get it analytically or empirically using torch.cuda.memory_allocated?

If I understand correctly, Mesa compresses fp16 activations to int8, so it can at most reduce the activation memory by 2x. There are also other components in total memory, so Mesa cannot reduce the total memory by 2x. However, in your results, Mesa reduces memory by more than 2x for some models. How is this possible?
question

opened by merrymercy 2

Owner

Zhuang AI Group

GitHub

Official PyTorch implementation for paper Context Matters: Graph-based Self-supervised Representation Learning for Medical Images

Context Matters: Graph-based Self-supervised Representation Learning for Medical Images Official PyTorch implementation for paper Context Matters: Gra

49 Nov 23, 2022

StyleGAN2-ADA - Official PyTorch implementation

Abstract: Training generative adversarial networks (GAN) using too little data typically leads to discriminator overfitting, causing training to diverge. We propose an adaptive discriminator augmentation mechanism that significantly stabilizes training in limited data regimes.

3.2k Dec 30, 2022

This is the official PyTorch implementation for "Mesa: A Memory-saving Training Framework for Transformers".

Related tags

Overview

A Memory-saving Training Framework for Transformers

Installation

Usage

Demo projects for DeiT and Swin

Results on ImageNet

License

Acknowledgments

You might also like...

Official implementation of our CVPR2021 paper "OTA: Optimal Transport Assignment for Object Detection" in Pytorch.

This is the official PyTorch implementation of the paper "TransFG: A Transformer Architecture for Fine-grained Recognition" (Ju He, Jie-Neng Chen, Shuai Liu, Adam Kortylewski, Cheng Yang, Yutong Bai, Changhu Wang, Alan Yuille).

StyleGAN2-ADA - Official PyTorch implementation

Official PyTorch implementation of "ArtFlow: Unbiased Image Style Transfer via Reversible Neural Flows"

Official PyTorch implementation of RobustNet (CVPR 2021 Oral)

Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.

[PyTorch] Official implementation of CVPR2021 paper "PointDSC: Robust Point Cloud Registration using Deep Spatial Consistency". https://arxiv.org/abs/2103.05465

Official PyTorch implementation of MX-Font (Multiple Heads are Better than One: Few-shot Font Generation with Multiple Localized Experts)

Official Pytorch implementation of 'GOCor: Bringing Globally Optimized Correspondence Volumes into Your Neural Network' (NeurIPS 2020)

Comments

Does not compile with new pytorch versions

How do you measure train memory?

Owner

Zhuang AI Group

Official PyTorch implementation for paper Context Matters: Graph-based Self-supervised Representation Learning for Medical Images

StyleGAN2-ADA - Official PyTorch implementation

Official PyTorch implementation of Joint Object Detection and Multi-Object Tracking with Graph Neural Networks

Official pytorch implementation of paper "Image-to-image Translation via Hierarchical Style Disentanglement".

Official pytorch implementation of paper "Inception Convolution with Efficient Dilation Search" (CVPR 2021 Oral).

Official PyTorch Implementation of Unsupervised Learning of Scene Flow Estimation Fusing with Local Rigidity

Official implementation of our paper "LLA: Loss-aware Label Assignment for Dense Pedestrian Detection" in Pytorch.

An official implementation of "SFNet: Learning Object-aware Semantic Correspondence" (CVPR 2019, TPAMI 2020) in PyTorch.

Old Photo Restoration (Official PyTorch Implementation)

Official PyTorch implementation of Spatial Dependency Networks.