I-BERT: Integer-only BERT Quantization

Overview

Screen Shot 2020-12-19 at 9 51 50 PM

I-BERT: Integer-only BERT Quantization

HuggingFace Implementation

I-BERT is also available in the master branch of HuggingFace! Visit the following links for the HuggingFace implementation.

Github Link: https://github.com/huggingface/transformers/tree/master/src/transformers/models/ibert

Model Links:

Installation & Requirements

You can find more detailed installation guides from the Fairseq repo: https://github.com/pytorch/fairseq

1. Fairseq Installation

Reference: Fairseq

  • PyTorch version >= 1.4.0
  • Python version >= 3.6
  • Currently, I-BERT only supports training on GPU
git clone https://github.com/kssteven418/I-BERT.git
cd I-BERT
pip install --editable ./

2. Download pre-trained RoBERTa models

Reference: Fairseq RoBERTa

Download pretrained RoBERTa models from the links and unzip them.

# In I-BERT (root) directory
mkdir models && cd models
wget {link}
tar -xvf roberta.{base|large}.tar.gz

3. Download GLUE datasets

Reference: Fairseq Finetuning on GLUE

First, download the data from the GLUE website. Make sure to download the dataset in I-BERT (root) directory.

# In I-BERT (root) directory
wget https://gist.githubusercontent.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e/raw/17b8dd0d724281ed7c3b2aeeda662b92809aadd5/download_glue_data.py
python download_glue_data.py --data_dir glue_data --tasks all

Then, preprocess the data.

# In I-BERT (root) directory
./examples/roberta/preprocess_GLUE_tasks.sh glue_data {task_name}

task_name can be one of the following: {ALL, QQP, MNLI, QNLI, MRPC, RTE, STS-B, SST-2, CoLA} . ALL will preprocess all the tasks. If the command is run propely, preprocessed datasets will be stored in I-BERT/{task_name}-bin

Now, you have the models and the datasets ready, so you are ready to run I-BERT!

Task-specific Model Finetuning

Before quantizing the model, you first have to finetune the pre-trained models to a specific downstream task. Although you can finetune the model from the original Fairseq repo, we provide ibert-base branch where you can train non-quantized models without having to install the original Fairseq. This branch is identical to the master branch of the original Fairseq repo, except for some loggings and run scripts that are irrelevant to the functionality. If you already have finetuned models, you can skip this part.

Run the following commands to fetch and move to the ibert-base branch:

# In I-BERT (root) directory
git fetch
git checkout -t origin/ibert-base

Then, run the script:

# In I-BERT (root) directory
# CUDA_VISIBLE_DEVICES={device} python run.py --arch {roberta_base|roberta_large} --task {task_name}
CUDA_VISIBLE_DEVICES=0 python run.py --arch roberta_base --task MRPC

Checkpoints and validation logs will be stored at ./outputs directory. You can change this output location by adding the option --output-dir OUTPUT_DIR. The exact output location will look something like: ./outputs/none/MRPC-base/wd0.1_ad0.1_d0.1_lr2e-5/1219-101427_ckpt/checkpoint_best.pt. By default, models are trained according to the task-specific hyperparameters specified in Fairseq Finetuning on GLUE. However, you can also specify the hyperparameters with the options (use the option -h for more details).

Quantiation & Quantization-Aware-Finetuning

Now, we come back to ibert branch for quantization.

git checkout ibert

And then run the script. This will first quantize the model and do quantization-aware-finetuning with the learning rate that you specify with the option --lr {lr}.

# In I-BERT (root) directory
# CUDA_VISIBLE_DEVICES={device} python run.py --arch {roberta_base|roberta_large} --task {task_name} \
# --restore-file {ckpt_path} --lr {lr}
CUDA_VISIBLE_DEVICES=0 python run.py --arch roberta_base --task MRPC --restore-file ckpt-best.pt --lr 1e-6

NOTE: Our work is still on progress. Currently, all integer operations are executed with floating point.

Comments
  • Training in mixed precision

    Training in mixed precision

    ❓ Questions and Help

    Before asking:

    1. search the issues.
    2. search the docs.

    What is your question?

    Hi, thanks for the amazing contribution! I'm trying to use IBert from Huggingface/transformers (4.4.2) in my own training pipeline where I'm fine-tuning in quant mode with mixed precision (Using pytorch's cuda.amp module). This results in overflows in the QuantLinear layers, which causes following training to break due to nans. I'm considering artificially clamping the weights to a smaller range to avoid this or using a lower bit precision (from 8 to say 4) while fine-tuning.

    I'm wondering if you have tried this or have any suggestions about my approaches that could help me train effectively.

    Thanks.

    Code

     with autocast(enabled=grad_scaler.is_enabled()):
            # TRAINING CODE...
    

    I'm unable to post any more code (proprietary stuff, sorry!), but I can provide some specifics if you need them.

    What have you tried?

    What's your environment?

    • fairseq Version (e.g., 1.0 or master):
    • PyTorch Version (e.g., 1.0): 1.8.0
    • OS (e.g., Linux): Ubuntu 18.04
    • How you installed fairseq (pip, source):
    • Build command you used (if compiling from source):
    • Python version:
    • CUDA/cuDNN version: 10.1/7.6.5
    • GPU models and configuration:
    • Any other relevant information:
    question 
    opened by bdalal 3
  • Why use 22 bit quantized activations for some layer norms (except in Embeddings)?

    Why use 22 bit quantized activations for some layer norms (except in Embeddings)?

    Hi, I've noticed that the QuantAct layers preceding IntLayerNorm in the IBertSelfOutput and IbertOutput modules specify a 22 bit activation width while the QuantAct layer preceding IntLayerNorm in IBertEmbedding specifies a 16 bit activation.

    I couldn't find any mention of these bit width choices in the paper. Could you please explain why these choices have been made?

    Thank you!

    question 
    opened by bdalal 2
  • why is Integer-only finetuning is much more slower than fp32 finetune

    why is Integer-only finetuning is much more slower than fp32 finetune

    Compare with fp32 finetuning , It takes about 10x more time to inference dev data during training when do Integer-only finetune to Integer-only finetuning. How can I do INT8 inference and achieve the seepup as described in paper?

    question 
    opened by renmada 0
  • (huggingface) The output of IBERT is float. Am I doing wrong?

    (huggingface) The output of IBERT is float. Am I doing wrong?

    ❓ Questions and Help

    What is your question?

    I'm using the huggingface's implementation. Even though I set the quant_mode=True, I see the output of IBert is in float32 type. Am I using the model wrong, or is it expected?

    Code

    self.bert = AutoModel.from_pretrained(
        base_model, quant_mode=quant_mode, add_pooling_layer=False
    )
    
    ...
    
    
    def forward(
            self,
            input_ids: Tensor,
            attention_mask: Tensor,
            k: int = None,
            return_layers: List[int] = None,
            return_orig: bool = False,
        ):
            bert_out = self.bert(
                input_ids,
                attention_mask=attention_mask,
                output_hidden_states=True,
                return_dict=True,
            )
    
            # the output dtype is float32!
            print(bert_out.hidden_states[0])
    

    What's your environment?

    • PyTorch Version: 1.7.1
    • OS (e.g., Linux): Ubuntu 20
    • How you installed fairseq (pip, source): No
    • Python version: 3.8.5
    • CUDA/cuDNN version: 11.0
    • GPU models and configuration: RTX 3090
    question 
    opened by kyoungrok0517 0
  • Where can I find the integer-sqrt kernel ?

    Where can I find the integer-sqrt kernel ?

    ❓ Questions and Help

    Before asking:

    1. search the issues.
    2. search the docs.

    What is your question?

    Code

    What have you tried?

    What's your environment?

    • fairseq Version (e.g., 1.0 or master):
    • PyTorch Version (e.g., 1.0)
    • OS (e.g., Linux):
    • How you installed fairseq (pip, source):
    • Build command you used (if compiling from source):
    • Python version:
    • CUDA/cuDNN version:
    • GPU models and configuration:
    • Any other relevant information:
    question 
    opened by Alex-Songs 0
  • About scaling_factor

    About scaling_factor

    Dear Authors, Thanks for sharing valuable codes.

    I'm trying to use your code for vision transformer quantization.

    About the scaling factor for this work, I have some questions. If I want to swap some layers (i.e. GELU -> IntGELU), I have to set the scaling factor for input args.

    For this, I suppose that I can add QuantAct in the forward function of IntGELU,

    class IntGELU(nn.Module):

    def forward(self, x, scaling_factor=None):
         if not self.quant_mode:
             return self.activation_fn(x), None
         x, scaling_factor = QuantAct(32, quant_mode=self.quant_mode)
    --------------------------------------------------------------------------------------
    Is it right? could you please give me some advice?
    
    Thanks in advance.
    
    
    question 
    opened by DOHA-HWANG 0
  • Pre-trained weights for specific tasks

    Pre-trained weights for specific tasks

    Hi, thank you for releasing this project!

    I was wondering if you happen to have the pre-trained weights for the models finetuned on the different downstream tasks (QQP, MNLI.. etc). i.e. the initialisation weights for the quantisation-aware fine-tuning stage. I only say this as it would save me a lot of time and compute, and may be helpful to others too.

    Thanks

    question 
    opened by roymiles 0
  • Storing both float32 and int parameters

    Storing both float32 and int parameters

    Hi

    It looks like at least in the HF code, you are storing both the float32 AND the int weights, which would increase the memory footprint. Don't you want to either load one or the other, or at least have an option to quanitize and send to cuda or something like that, where you would clear the float32 version or int version and send to cuda, thus lowering the memory footprint. Alternately you could overload the 'to' (or 'cuda'? or whatever method is used to convert to cuda) to only move over only the right parameters?

    Thanks

    bug 
    opened by ontocord 0
  • Latency 20x with quant_mode = true

    Latency 20x with quant_mode = true

    In the hugging face config, I set quant_mode = TRUE. The weight_integer buffer remains 0, and the result is wrong. Moreover, inference latency of integer mode is 20 times of float mode. Can you please explain the reason for me?

    question 
    opened by LiamPKU 1
DiffQ performs differentiable quantization using pseudo quantization noise. It can automatically tune the number of bits used per weight or group of weights, in order to achieve a given trade-off between model size and accuracy.

Differentiable Model Compression via Pseudo Quantization Noise DiffQ performs differentiable quantization using pseudo quantization noise. It can auto

Facebook Research 145 Dec 30, 2022
Quantization library for PyTorch. Support low-precision and mixed-precision quantization, with hardware implementation through TVM.

HAWQ: Hessian AWare Quantization HAWQ is an advanced quantization library written for PyTorch. HAWQ enables low-precision and mixed-precision uniform

Zhen Dong 293 Dec 30, 2022
Nonuniform-to-Uniform Quantization: Towards Accurate Quantization via Generalized Straight-Through Estimation. In CVPR 2022.

Nonuniform-to-Uniform Quantization This repository contains the training code of N2UQ introduced in our CVPR 2022 paper: "Nonuniform-to-Uniform Quanti

Zechun Liu 60 Dec 28, 2022
[ICLR 2022 Oral] F8Net: Fixed-Point 8-bit Only Multiplication for Network Quantization

F8Net Fixed-Point 8-bit Only Multiplication for Network Quantization (ICLR 2022 Oral) OpenReview | arXiv | PDF | Model Zoo | BibTex PyTorch implementa

Snap Research 76 Dec 13, 2022
Source code for the Paper: CombOptNet: Fit the Right NP-Hard Problem by Learning Integer Programming Constraints}

CombOptNet: Fit the Right NP-Hard Problem by Learning Integer Programming Constraints Installation Run pipenv install (at your own risk with --skip-lo

Autonomous Learning Group 65 Dec 27, 2022
With this package, you can generate mixed-integer linear programming (MIP) models of trained artificial neural networks (ANNs) using the rectified linear unit (ReLU) activation function

With this package, you can generate mixed-integer linear programming (MIP) models of trained artificial neural networks (ANNs) using the rectified linear unit (ReLU) activation function. At the moment, only TensorFlow sequential models are supported. Interfaces to either the Pyomo or Gurobi modeling environments are offered.

ChemEngAI 40 Dec 27, 2022
The official implementation of You Only Compress Once: Towards Effective and Elastic BERT Compression via Exploit-Explore Stochastic Nature Gradient.

You Only Compress Once: Towards Effective and Elastic BERT Compression via Exploit-Explore Stochastic Nature Gradient (paper) @misc{zhang2021compress,

null 46 Dec 7, 2022
QKeras: a quantization deep learning library for Tensorflow Keras

QKeras github.com/google/qkeras QKeras 0.8 highlights: Automatic quantization using QKeras; Stochastic behavior (including stochastic rouding) is disa

Google 437 Jan 3, 2023
Code for our paper at ECCV 2020: Post-Training Piecewise Linear Quantization for Deep Neural Networks

PWLQ Updates 2020/07/16 - We are working on getting permission from our institution to release our source code. We will release it once we are granted

null 54 Dec 15, 2022
FID calculation with proper image resizing and quantization steps

clean-fid: Fixing Inconsistencies in FID Project | Paper The FID calculation involves many steps that can produce inconsistencies in the final metric.

Gaurav Parmar 606 Jan 6, 2023
Degree-Quant: Quantization-Aware Training for Graph Neural Networks.

Degree-Quant This repo provides a clean re-implementation of the code associated with the paper Degree-Quant: Quantization-Aware Training for Graph Ne

null 35 Oct 7, 2022
This is the pytorch implementation for the paper: Generalizable Mixed-Precision Quantization via Attribution Rank Preservation, which is accepted to ICCV2021.

GMPQ: Generalizable Mixed-Precision Quantization via Attribution Rank Preservation This is the pytorch implementation for the paper: Generalizable Mix

null 18 Sep 2, 2022
MQBench: Towards Reproducible and Deployable Model Quantization Benchmark

MQBench: Towards Reproducible and Deployable Model Quantization Benchmark We propose a benchmark to evaluate different quantization algorithms on vari

null 494 Dec 29, 2022
QTool: A Low-bit Quantization Toolbox for Deep Neural Networks in Computer Vision

This project provides abundant choices of quantization strategies (such as the quantization algorithms, training schedules and empirical tricks) for quantizing the deep neural networks into low-bit counterparts.

Monash Green AI Lab 51 Dec 10, 2022
This is an official implementation of the paper "Distance-aware Quantization", accepted to ICCV2021.

PyTorch implementation of DAQ This is an official implementation of the paper "Distance-aware Quantization", accepted to ICCV2021. For more informatio

CV Lab @ Yonsei University 36 Nov 4, 2022
Spatial color quantization in Rust

rscolorq Rust port of Derrick Coetzee's scolorq, based on the 1998 paper "On spatial quantization of color images" by Jan Puzicha, Markus Held, Jens K

Collyn O'Kane 37 Dec 22, 2022
YOLOv5 Series Multi-backbone, Pruning and quantization Compression Tool Box.

YOLOv5-Compression Update News Requirements 环境安装 pip install -r requirements.txt Evaluation metric Visdrone Model mAP mAP@50 Parameters(M) GFLOPs FPS@

ZhangYuan 719 Jan 2, 2023
Qimera: Data-free Quantization with Synthetic Boundary Supporting Samples

Qimera: Data-free Quantization with Synthetic Boundary Supporting Samples This repository is the official implementation of paper [Qimera: Data-free Q

Kanghyun Choi 21 Nov 3, 2022