I-BERT: Integer-only BERT Quantization

Sehoon Kim

Last update: Dec 27, 2022

Related tags

Deep Learning natural-language-processing transformer quantization bert model-compression efficient-model efficient-neural-networks

Overview

I-BERT: Integer-only BERT Quantization

HuggingFace Implementation

I-BERT is also available in the master branch of HuggingFace! Visit the following links for the HuggingFace implementation.

Github Link: https://github.com/huggingface/transformers/tree/master/src/transformers/models/ibert

Model Links:

Installation & Requirements

You can find more detailed installation guides from the Fairseq repo: https://github.com/pytorch/fairseq

1. Fairseq Installation

Reference: Fairseq

PyTorch version >= 1.4.0
Python version >= 3.6
Currently, I-BERT only supports training on GPU

git clone https://github.com/kssteven418/I-BERT.git
cd I-BERT
pip install --editable ./

2. Download pre-trained RoBERTa models

Reference: Fairseq RoBERTa

Download pretrained RoBERTa models from the links and unzip them.

RoBERTa-Base: roberta.base.tar.gz
RoBERTa-Large: roberta.large.tar.gz

# In I-BERT (root) directory
mkdir models && cd models
wget {link}
tar -xvf roberta.{base|large}.tar.gz

3. Download GLUE datasets

Reference: Fairseq Finetuning on GLUE

First, download the data from the GLUE website. Make sure to download the dataset in I-BERT (root) directory.

# In I-BERT (root) directory
wget https://gist.githubusercontent.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e/raw/17b8dd0d724281ed7c3b2aeeda662b92809aadd5/download_glue_data.py
python download_glue_data.py --data_dir glue_data --tasks all

Then, preprocess the data.

# In I-BERT (root) directory
./examples/roberta/preprocess_GLUE_tasks.sh glue_data {task_name}

task_name can be one of the following: {ALL, QQP, MNLI, QNLI, MRPC, RTE, STS-B, SST-2, CoLA} . ALL will preprocess all the tasks. If the command is run propely, preprocessed datasets will be stored in I-BERT/{task_name}-bin

Now, you have the models and the datasets ready, so you are ready to run I-BERT!

Task-specific Model Finetuning

Before quantizing the model, you first have to finetune the pre-trained models to a specific downstream task. Although you can finetune the model from the original Fairseq repo, we provide ibert-base branch where you can train non-quantized models without having to install the original Fairseq. This branch is identical to the master branch of the original Fairseq repo, except for some loggings and run scripts that are irrelevant to the functionality. If you already have finetuned models, you can skip this part.

Run the following commands to fetch and move to the ibert-base branch:

# In I-BERT (root) directory
git fetch
git checkout -t origin/ibert-base

Then, run the script:

# In I-BERT (root) directory
# CUDA_VISIBLE_DEVICES={device} python run.py --arch {roberta_base|roberta_large} --task {task_name}
CUDA_VISIBLE_DEVICES=0 python run.py --arch roberta_base --task MRPC

Checkpoints and validation logs will be stored at ./outputs directory. You can change this output location by adding the option --output-dir OUTPUT_DIR. The exact output location will look something like: ./outputs/none/MRPC-base/wd0.1_ad0.1_d0.1_lr2e-5/1219-101427_ckpt/checkpoint_best.pt. By default, models are trained according to the task-specific hyperparameters specified in Fairseq Finetuning on GLUE. However, you can also specify the hyperparameters with the options (use the option -h for more details).

Quantiation & Quantization-Aware-Finetuning

Now, we come back to ibert branch for quantization.

git checkout ibert

And then run the script. This will first quantize the model and do quantization-aware-finetuning with the learning rate that you specify with the option --lr {lr}.

# In I-BERT (root) directory
# CUDA_VISIBLE_DEVICES={device} python run.py --arch {roberta_base|roberta_large} --task {task_name} \
# --restore-file {ckpt_path} --lr {lr}
CUDA_VISIBLE_DEVICES=0 python run.py --arch roberta_base --task MRPC --restore-file ckpt-best.pt --lr 1e-6

NOTE: Our work is still on progress. Currently, all integer operations are executed with floating point.

Comments

Training in mixed precision
❓ Questions and Help

Before asking:

search the issues.

search the docs.

What is your question?

Hi, thanks for the amazing contribution! I'm trying to use IBert from Huggingface/transformers (4.4.2) in my own training pipeline where I'm fine-tuning in quant mode with mixed precision (Using pytorch's cuda.amp module). This results in overflows in the QuantLinear layers, which causes following training to break due to nans. I'm considering artificially clamping the weights to a smaller range to avoid this or using a lower bit precision (from 8 to say 4) while fine-tuning.

I'm wondering if you have tried this or have any suggestions about my approaches that could help me train effectively.

Thanks.

Code

with autocast(enabled=grad_scaler.is_enabled()): # TRAINING CODE...

I'm unable to post any more code (proprietary stuff, sorry!), but I can provide some specifics if you need them.

What have you tried?

What's your environment?

fairseq Version (e.g., 1.0 or master):

PyTorch Version (e.g., 1.0): 1.8.0

OS (e.g., Linux): Ubuntu 18.04

How you installed fairseq (pip, source):

Build command you used (if compiling from source):

Python version:

CUDA/cuDNN version: 10.1/7.6.5

GPU models and configuration:

Any other relevant information:

question
opened by bdalal 3
Why use 22 bit quantized activations for some layer norms (except in Embeddings)?

Hi, I've noticed that the QuantAct layers preceding IntLayerNorm in the IBertSelfOutput and IbertOutput modules specify a 22 bit activation width while the QuantAct layer preceding IntLayerNorm in IBertEmbedding specifies a 16 bit activation.

I couldn't find any mention of these bit width choices in the paper. Could you please explain why these choices have been made?

Thank you!
question

opened by bdalal 2
why is Integer-only finetuning is much more slower than fp32 finetune

Compare with fp32 finetuning , It takes about 10x more time to inference dev data during training when do Integer-only finetune to Integer-only finetuning. How can I do INT8 inference and achieve the seepup as described in paper?
question

opened by renmada 0

(huggingface) The output of IBERT is float. Am I doing wrong?

❓ Questions and Help

What is your question?

I'm using the huggingface's implementation. Even though I set the quant_mode=True, I see the output of IBert is in float32 type. Am I using the model wrong, or is it expected?

Code

self.bert = AutoModel.from_pretrained(
    base_model, quant_mode=quant_mode, add_pooling_layer=False
)

...


def forward(
        self,
        input_ids: Tensor,
        attention_mask: Tensor,
        k: int = None,
        return_layers: List[int] = None,
        return_orig: bool = False,
    ):
        bert_out = self.bert(
            input_ids,
            attention_mask=attention_mask,
            output_hidden_states=True,
            return_dict=True,
        )

        # the output dtype is float32!
        print(bert_out.hidden_states[0])

What's your environment?

PyTorch Version: 1.7.1
OS (e.g., Linux): Ubuntu 20
How you installed fairseq (pip, source): No
Python version: 3.8.5
CUDA/cuDNN version: 11.0
GPU models and configuration: RTX 3090

question

opened by kyoungrok0517 0

Where can I find the integer-sqrt kernel ?
❓ Questions and Help

Before asking:

search the issues.

search the docs.

What is your question?

Code

What have you tried?

What's your environment?

fairseq Version (e.g., 1.0 or master):

PyTorch Version (e.g., 1.0)

OS (e.g., Linux):

How you installed fairseq (pip, source):

Build command you used (if compiling from source):

Python version:

CUDA/cuDNN version:

GPU models and configuration:

Any other relevant information:

question
opened by Alex-Songs 0
About scaling_factor
Dear Authors, Thanks for sharing valuable codes.

I'm trying to use your code for vision transformer quantization.

About the scaling factor for this work, I have some questions. If I want to swap some layers (i.e. GELU -> IntGELU), I have to set the scaling factor for input args.

For this, I suppose that I can add QuantAct in the forward function of IntGELU,

class IntGELU(nn.Module):

def forward(self, x, scaling_factor=None): if not self.quant_mode: return self.activation_fn(x), None x, scaling_factor = QuantAct(32, quant_mode=self.quant_mode) -------------------------------------------------------------------------------------- Is it right? could you please give me some advice? Thanks in advance.
question
opened by DOHA-HWANG 0
Pre-trained weights for specific tasks

Hi, thank you for releasing this project!

I was wondering if you happen to have the pre-trained weights for the models finetuned on the different downstream tasks (QQP, MNLI.. etc). i.e. the initialisation weights for the quantisation-aware fine-tuning stage. I only say this as it would save me a lot of time and compute, and may be helpful to others too.

Thanks
question

opened by roymiles 0
Storing both float32 and int parameters

Hi

It looks like at least in the HF code, you are storing both the float32 AND the int weights, which would increase the memory footprint. Don't you want to either load one or the other, or at least have an option to quanitize and send to cuda or something like that, where you would clear the float32 version or int version and send to cuda, thus lowering the memory footprint. Alternately you could overload the 'to' (or 'cuda'? or whatever method is used to convert to cuda) to only move over only the right parameters?

Thanks
bug

opened by ontocord 0
Latency 20x with quant_mode = true

In the hugging face config, I set quant_mode = TRUE. The weight_integer buffer remains 0, and the result is wrong. Moreover, inference latency of integer mode is 20 times of float mode. Can you please explain the reason for me?
question

opened by LiamPKU 1

Owner

Sehoon Kim

GitHub https://arxiv.org/abs/2101.01321

DiffQ performs differentiable quantization using pseudo quantization noise. It can automatically tune the number of bits used per weight or group of weights, in order to achieve a given trade-off between model size and accuracy.

Differentiable Model Compression via Pseudo Quantization Noise DiffQ performs differentiable quantization using pseudo quantization noise. It can auto

145 Dec 30, 2022

Quantization library for PyTorch. Support low-precision and mixed-precision quantization, with hardware implementation through TVM.

HAWQ: Hessian AWare Quantization HAWQ is an advanced quantization library written for PyTorch. HAWQ enables low-precision and mixed-precision uniform

293 Dec 30, 2022

Nonuniform-to-Uniform Quantization: Towards Accurate Quantization via Generalized Straight-Through Estimation. In CVPR 2022.

Nonuniform-to-Uniform Quantization This repository contains the training code of N2UQ introduced in our CVPR 2022 paper: "Nonuniform-to-Uniform Quanti

60 Dec 28, 2022

[ICLR 2022 Oral] F8Net: Fixed-Point 8-bit Only Multiplication for Network Quantization

F8Net Fixed-Point 8-bit Only Multiplication for Network Quantization (ICLR 2022 Oral) OpenReview | arXiv | PDF | Model Zoo | BibTex PyTorch implementa

76 Dec 13, 2022

Source code for the Paper: CombOptNet: Fit the Right NP-Hard Problem by Learning Integer Programming Constraints}

CombOptNet: Fit the Right NP-Hard Problem by Learning Integer Programming Constraints Installation Run pipenv install (at your own risk with --skip-lo

65 Dec 27, 2022

With this package, you can generate mixed-integer linear programming (MIP) models of trained artificial neural networks (ANNs) using the rectified linear unit (ReLU) activation function

With this package, you can generate mixed-integer linear programming (MIP) models of trained artificial neural networks (ANNs) using the rectified linear unit (ReLU) activation function. At the moment, only TensorFlow sequential models are supported. Interfaces to either the Pyomo or Gurobi modeling environments are offered.

40 Dec 27, 2022

The official implementation of You Only Compress Once: Towards Effective and Elastic BERT Compression via Exploit-Explore Stochastic Nature Gradient.

You Only Compress Once: Towards Effective and Elastic BERT Compression via Exploit-Explore Stochastic Nature Gradient (paper) @misc{zhang2021compress,

46 Dec 7, 2022

QKeras: a quantization deep learning library for Tensorflow Keras

QKeras github.com/google/qkeras QKeras 0.8 highlights: Automatic quantization using QKeras; Stochastic behavior (including stochastic rouding) is disa

437 Jan 3, 2023

Code for our paper at ECCV 2020: Post-Training Piecewise Linear Quantization for Deep Neural Networks

PWLQ Updates 2020/07/16 - We are working on getting permission from our institution to release our source code. We will release it once we are granted

54 Dec 15, 2022

FID calculation with proper image resizing and quantization steps

clean-fid: Fixing Inconsistencies in FID Project | Paper The FID calculation involves many steps that can produce inconsistencies in the final metric.

606 Jan 6, 2023

TorchPQ is a python library for Approximate Nearest Neighbor Search (ANNS) and Maximum Inner Product Search (MIPS) on GPU using Product Quantization (PQ) algorithm.

Efficient implementations of Product Quantization and its variants using Pytorch and CUDA

146 Dec 28, 2022

Degree-Quant: Quantization-Aware Training for Graph Neural Networks.

Degree-Quant This repo provides a clean re-implementation of the code associated with the paper Degree-Quant: Quantization-Aware Training for Graph Ne

35 Oct 7, 2022

This is the pytorch implementation for the paper: Generalizable Mixed-Precision Quantization via Attribution Rank Preservation, which is accepted to ICCV2021.

GMPQ: Generalizable Mixed-Precision Quantization via Attribution Rank Preservation This is the pytorch implementation for the paper: Generalizable Mix

18 Sep 2, 2022

I-BERT: Integer-only BERT Quantization

Related tags

Overview

I-BERT: Integer-only BERT Quantization

HuggingFace Implementation

Installation & Requirements

Task-specific Model Finetuning

Quantiation & Quantization-Aware-Finetuning

Comments

❓ Questions and Help

Before asking:

What is your question?

Code

What have you tried?

What's your environment?

❓ Questions and Help

What is your question?

Code

What's your environment?

❓ Questions and Help

Before asking:

What is your question?

Code

What have you tried?

What's your environment?

For this, I suppose that I can add QuantAct in the forward function of IntGELU,

Owner

Sehoon Kim

DiffQ performs differentiable quantization using pseudo quantization noise. It can automatically tune the number of bits used per weight or group of weights, in order to achieve a given trade-off between model size and accuracy.

Quantization library for PyTorch. Support low-precision and mixed-precision quantization, with hardware implementation through TVM.

Nonuniform-to-Uniform Quantization: Towards Accurate Quantization via Generalized Straight-Through Estimation. In CVPR 2022.

[ICLR 2022 Oral] F8Net: Fixed-Point 8-bit Only Multiplication for Network Quantization

Source code for the Paper: CombOptNet: Fit the Right NP-Hard Problem by Learning Integer Programming Constraints}

With this package, you can generate mixed-integer linear programming (MIP) models of trained artificial neural networks (ANNs) using the rectified linear unit (ReLU) activation function

The official implementation of You Only Compress Once: Towards Effective and Elastic BERT Compression via Exploit-Explore Stochastic Nature Gradient.

QKeras: a quantization deep learning library for Tensorflow Keras

Code for our paper at ECCV 2020: Post-Training Piecewise Linear Quantization for Deep Neural Networks

FID calculation with proper image resizing and quantization steps

TorchPQ is a python library for Approximate Nearest Neighbor Search (ANNS) and Maximum Inner Product Search (MIPS) on GPU using Product Quantization (PQ) algorithm.

Degree-Quant: Quantization-Aware Training for Graph Neural Networks.

This is the pytorch implementation for the paper: Generalizable Mixed-Precision Quantization via Attribution Rank Preservation, which is accepted to ICCV2021.

MQBench: Towards Reproducible and Deployable Model Quantization Benchmark

QTool: A Low-bit Quantization Toolbox for Deep Neural Networks in Computer Vision

This is an official implementation of the paper "Distance-aware Quantization", accepted to ICCV2021.

Spatial color quantization in Rust

YOLOv5 Series Multi-backbone, Pruning and quantization Compression Tool Box.

Qimera: Data-free Quantization with Synthetic Boundary Supporting Samples