FewBit — a library for memory efficient training of large neural networks

Overview

FewBit

FewBit — a library for memory efficient training of large neural networks. Its efficiency originates from storage optimizations applied to backward pass and memory footprint reduction for saved tensors between forward and backward passes. Namely, the library provides its own implementation of common activation functions and linear layer since they contribute the most to memory usage in training time. Optimized linear layer saves up to 15-20% memory and optimized activation functions save up to 15-30% of memory usage with negligible loss in performance (see [1][2] for details).

In the table below, one can see comparison of different optimizations applied to RoBERTa model. Compression rate of randomized linear layer is 20% (it uses only 20% of input) and GELU approximation uses only 3 bits.

Task Batch Size GELU Linear Layer Peak Memory, GiB Saving, %
1 MRPC 128 Vanilla Vanilla 11.30 0.0
2 MRPC 128 3-bit Vanilla 9.75 13.8
3 MRPC 128 Vanilla Randomized 9.20 18.6
4 MRPC 128 3-bit Randomized 7.60 32.7

Usage

The library fewbit implements basic activation functions with backward pass optimizations for reducing memory footprint during model training. All activation functions exported by the library can be used as a drop-in replacement for most of standard activation functions implemented in PyTorch. The common pattern is to replace torch.nn with fewbit package qualifier.

import fewbit
import torch as T

model = T.nn.Sequential(
    ...,
    fewbit.GELU(bits=3),  # Use 3-bits GELU approximation.
    ...,
)

In the case of pre-trained models, one can rebuild model with map_module routine which walks through model tree recursively and allows to replace some modules or activation functions. So, user should only use suitable constructor for a new module. As an example the code below replaces all default linear layers with randomized ones.

from fewbit import RandomizedLinear
from fewbit.util import convert_linear, map_module

converter = lambda x: convert_linear(x, RandomizedLinear, proj_dim_ratio=0.1)
new_model = map_module(old_model, converter)  # In-place model construction.

Quantized Gradients of Activation Functions

Installation

The simplest and preferred installation way is installation from PyPI.

pip install -U fewbit

FewBit is written in Python, but it implements some opertions in C++/CUDA to archive better performance. So, building from source requires CUDA Toolkit and CMake as a build system. The latest release can be installed with the following command.

pip install -U https://github.com/SkoltechAI/fewbit.git

List of Activation Functions

The library supports the following activation functions.

Piece-wise Activation Functions

In this section, all activation functions has 1-bit derivative. The only difference is band. The band requires two comparison to determine gradient domain. The complete list of activation functions is leaky_relu, relu, threshold, hardsigmoid, hardtanh, relu6, hardshrink, and softshrink.

Continous Activation Functions

All continous activation function could be divided into three classes according to its parity property: odd, even, and neither even nor odd. The parity property allows to use a small optimization to increase precision of approximation. The complete list of reimplemented activation functions in this category is celu, elu, hardswish, logsigmoid, mish, selu, sigmoid, silu, softplus, softsign, tanh, and tanhshrink.

List of Modules

Module RandomizedLinear is a replacement for default Linear module. It is used power of approximate matrix multiplication for memory saving.

Assembly

Preliminary step depends on one's PyTorch distribution and availiable tooling. Building of native components requires CMake and a build system like Make or Ninja. Next, if PyTorch is installed system-wide the the following step is not neccessary. Otherwise, one likely should add search path for CMake modules to environment variables as follows.

export CMAKE_PREFIX_PATH="$(python -c 'import torch.utils; print(torch.utils.cmake_prefix_path)')"

The next step is useful in development environment. It just builds PyTorch operator library in source tree (option --inplace) with forced CUDA support (option --cuda). By default no CUDA support are forced.

python setup.py build_ext --inplace --cuda

With options similar to the previous step, one can build wheel binary distribution of the package.

python setup.py bdist_wheel --inplace --cuda

Development Environment with Docker

In order to develop on different platforms we uses custom docker image for non-priviledge user based on Nvidia CUDA image. Image contains pre-built native extention and it is parametrized by user name and user ID in a host system. The latter is crucial thing in binding host volumes.

docker build -t fewbit --build-arg UID=$(id -u) .
docker run --rm -ti -e TERM=$TERM fewbit

Citation

Please cite the following papers if the library is used in an academic paper (export BibTeX).

@misc{bershatsky2022memoryefficient,
    title={{M}emory-{E}fficient {B}ackpropagation through {L}arge {L}inear {L}ayers},
    author={Daniel Bershatsky and Aleksandr Mikhalev and Alexandr Katrutsa and Julia Gusak and Daniil Merkulov and Ivan Oseledets},
    year={2022},
    eprint={2201.13195},
    archivePrefix={arXiv},
    primaryClass={cs.LG},
}

@misc{novikov2022fewbit,
    title={{F}ew-{B}it {B}ackward: {Q}uantized {G}radients of {A}ctivation {F}unctions for {M}emory {F}ootprint {R}eduction},
    author={Georgii Novikov and Daniel Bershatsky and Julia Gusak and Alex Shonenkov and Denis Dimitrov and Ivan Oseledets},
    year={2022},
    eprint={2202.00441},
    archivePrefix={arXiv},
    primaryClass={cs.LG},
}

License

© The FewBit authors, 2022 — now. Licensed under the BSD 3-Clause License. See AUTHORS and LICENSE file for more details1.

Footnotes

  1. The work was supported by Sber AI and the Analytical center under the RF Government (subsidy agreement 000000D730321P5Q0002, Grant No. 70-2021-00145 02.11.2021).

You might also like...
Episodic-memory - Ego4D Episodic Memory Benchmark

Ego4D Episodic Memory Benchmark EGO4D is the world's largest egocentric (first p

Run Effective Large Batch Contrastive Learning on Limited Memory GPU

Gradient Cache Gradient Cache is a simple technique for unlimitedly scaling contrastive learning batch far beyond GPU memory constraint. This means tr

MEDS: Enhancing Memory Error Detection for Large-Scale Applications

MEDS: Enhancing Memory Error Detection for Large-Scale Applications Prerequisites cmake and clang Build MEDS supporting compiler $ make Build Using Do

Ultra-Data-Efficient GAN Training: Drawing A Lottery Ticket First, Then Training It Toughly
Ultra-Data-Efficient GAN Training: Drawing A Lottery Ticket First, Then Training It Toughly

Ultra-Data-Efficient GAN Training: Drawing A Lottery Ticket First, Then Training It Toughly Code for this paper Ultra-Data-Efficient GAN Tra

Differentiable Neural Computers, Sparse Access Memory and Sparse Differentiable Neural Computers, for Pytorch
Differentiable Neural Computers, Sparse Access Memory and Sparse Differentiable Neural Computers, for Pytorch

Differentiable Neural Computers and family, for Pytorch Includes: Differentiable Neural Computers (DNC) Sparse Access Memory (SAM) Sparse Differentiab

Implementation of
Implementation of "Efficient Regional Memory Network for Video Object Segmentation" (Xie et al., CVPR 2021).

RMNet This repository contains the source code for the paper Efficient Regional Memory Network for Video Object Segmentation. Cite this work @inprocee

A memory-efficient implementation of DenseNets

efficient_densenet_pytorch A PyTorch =1.0 implementation of DenseNets, optimized to save GPU memory. Recent updates Now works on PyTorch 1.0! It uses

Official and maintained implementation of the paper
Official and maintained implementation of the paper "OSS-Net: Memory Efficient High Resolution Semantic Segmentation of 3D Medical Data" [BMVC 2021].

OSS-Net: Memory Efficient High Resolution Semantic Segmentation of 3D Medical Data Christoph Reich, Tim Prangemeier, Özdemir Cetin & Heinz Koeppl | Pr

InvTorch: memory-efficient models with invertible functions

InvTorch: Memory-Efficient Invertible Functions This module extends the functionality of torch.utils.checkpoint.checkpoint to work with invertible fun

Comments
  • likely fix for strange error with gcc <= 9

    likely fix for strange error with gcc <= 9

    I have gcc 9 and the code wouldn't compile due to a different gelu() function interface; a bool parameter is passed for gcc <= 9.

    I checked the torch source and saw a new parameter was added to gelu() in torch 12 (currently looks almost through release candidates, can be installed from the nightlies with something like pip3 install --pre torch\<1.13 --extra-index-url https://download.pytorch.org/whl/nightly/cu113). It was documented in the commit logs as a boolean for approximation, but it appears to be a string in the actual source.

    I'm guessing this slipped by tests because gcc > 9 was used for testing.

    I mutated the check to check for torch >= 1.12 instead, and change the boolean true to a string "tanh" which I think is what torch 12 expects.

    The code sequence seems very strange, and I'm curious if it was made by some AI refactoring tool. I don't use one, so I'm always looking to learn if there's a good one.

    opened by xloem 1
Owner
null
Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"

Memory Efficient Attention Pytorch Implementation of a memory efficient multi-head attention as proposed in the paper, Self-attention Does Not Need O(

Phil Wang 180 Jan 5, 2023
Implementation of Memory-Efficient Neural Networks with Multi-Level Generation, ICCV 2021

Memory-Efficient Multi-Level In-Situ Generation (MLG) By Jiaqi Gu, Hanqing Zhu, Chenghao Feng, Mingjie Liu, Zixuan Jiang, Ray T. Chen and David Z. Pan

Jiaqi Gu 2 Jan 4, 2022
A PyTorch implementation of "Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks" (KDD 2019).

ClusterGCN ⠀⠀ A PyTorch implementation of "Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks" (KDD 2019). A

Benedek Rozemberczki 697 Dec 27, 2022
PyTorch Code of "Memory In Memory: A Predictive Neural Network for Learning Higher-Order Non-Stationarity from Spatiotemporal Dynamics"

Memory In Memory Networks It is based on the paper Memory In Memory: A Predictive Neural Network for Learning Higher-Order Non-Stationarity from Spati

Yang Li 12 May 30, 2022
Code used to generate the results appearing in "Train longer, generalize better: closing the generalization gap in large batch training of neural networks"

Train longer, generalize better - Big batch training This is a code repository used to generate the results appearing in "Train longer, generalize bet

Elad Hoffer 145 Sep 16, 2022
Rethinking Space-Time Networks with Improved Memory Coverage for Efficient Video Object Segmentation

STCN Rethinking Space-Time Networks with Improved Memory Coverage for Efficient Video Object Segmentation Ho Kei Cheng, Yu-Wing Tai, Chi-Keung Tang [a

Rex Cheng 456 Dec 12, 2022
Learning recognition/segmentation models without end-to-end training. 40%-60% less GPU memory footprint. Same training time. Better performance.

InfoPro-Pytorch The Information Propagation algorithm for training deep networks with local supervision. (ICLR 2021) Revisiting Locally Supervised Lea

null 78 Dec 27, 2022
ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training

ActNN : Activation Compressed Training This is the official project repository for ActNN: Reducing Training Memory Footprint via 2-Bit Activation Comp

UC Berkeley RISE 178 Jan 5, 2023
An Efficient Training Approach for Very Large Scale Face Recognition or F²C for simplicity.

Fast Face Classification (F²C) This is the code of our paper An Efficient Training Approach for Very Large Scale Face Recognition or F²C for simplicit

null 33 Jun 27, 2021
Learning to Initialize Neural Networks for Stable and Efficient Training

GradInit This repository hosts the code for experiments in the paper, GradInit: Learning to Initialize Neural Networks for Stable and Efficient Traini

Chen Zhu 124 Dec 30, 2022