MQBench: Towards Reproducible and Deployable Model Quantization Benchmark
We propose a benchmark to evaluate different quantization algorithms on various settings. MQBench is a first attempt to evaluate, analyze, and benchmark the reproducibility and deployability for model quantization algorithms. We choose multiple different platforms for real-world deployments, including CPU, GPU, ASIC, DSP, and evaluate extensive state-of-the-art quantization algorithms under a unified training pipeline. MQBench acts like a bridge to connect the algorithm and the hardware. We conduct a comprehensive analysis and find considerable intuitive or counter-intuitive insights.
Table of Contents
- Table of Contents
- Highlighted Features
- Installation
- How to self-implement a quantization algorithm
- How to self-implement a hardware configuration
- Submitting Your Results to MQBench
- Lisence
Highlighted Features
-
Integrate with the latest tracing techniques in Pytorch 1.8.
-
Quantization Algorithms
- Learned Step Size Quantization: https://arxiv.org/abs/1902.08153
- Quantization Interval Learning: https://arxiv.org/abs/1808.05779
- Differentiable Soft Quantization: https://arxiv.org/abs/1908.05033
- Parameterized Clipping AcTivation: https://arxiv.org/abs/1805.06085
- Additive Powers-of-Two Quantization: https://arxiv.org/abs/1909.13144
- DoReFa-Net: https://arxiv.org/abs/1606.06160
-
Network Architectures:
- ResNet-18, ResNet-50: https://arxiv.org/abs/1512.03385
- MobileNetV2: https://arxiv.org/abs/1801.04381
- EfficienteNet-Lite-B0: https://blog.tensorflow.org/2020/03/higher-accuracy-on-vision-models-with-efficientnet-lite.html
- RegNetX-600GF: https://arxiv.org/abs/2003.13678
-
Hardware Platform:
Library Haware Type s Form Granularity Symmetry Fold BN Academic None FP32 Per-tensor Symmetric No TensorRT GPU FP32 Per-channel Symmetric Yes ACL ASIC FP32 Per-channel Asymmetric Yes TVM ARM CPU POT Per-tensor Symmetric Yes SNPE DSP FP32 Per-tensor Asymmetric Yes FBGEMM X86 CPU FP32 Per-channel Asymmetric Yes
Installation
These instructions will help get MQBench up.
-
Clone MQBench.
-
(Optionally) Create a Python virtual environment.
-
Install the MQBench-required packages
$ pip install -r requirements.txt
Notes: MQBench uses Pytorch-1.8, our quantized model is based on the new
torch.fx
tracing techniques. -
MQBench use the Pytorch distributed data-parallel training with
nccl
backend (see details here), please make sure your machine can initailize that distributed learning environment.
How to Reproduce MQBench
We provide the running scripts run.sh
and configuration file config.yaml
of all experiments in MQBench.
To reproduce LSQ on ResNet-18,
-
enter the directory
$ cd PATH-TO-PROJECT/qbench_zoo $ cd lsq_experiments/resnet18_4bit_academic
-
run script
$ sh run.sh
Note that
run.sh
contain some commands that may not be found, the core running command isPYTHONPATH=$PYTHONPATH:../../.. python -u -m prototype.solver.cls_quant_solver --config config.yaml
How to self-implement a quantization algorithm
All our quantization algorithms are implemented in prototype/quantization/
To implementa a new algorithm, you need to add you quantizer into this directory.
All quantizer are inheritant from QuantizeBase
class. Each QuantizedBase
will have an observer class which is used to estimate/update the quantization range. The observer design is inspired from the Pytorch-1.8 repo. Intializing a QuantizeBase
class will also initialize a Observer
class.
The parameters contained for QuantizeBase
and Observer
include:
quant_min, quant_max
, which specify the $N_{min}, N_{max}$ for rounding boundaries.qshcme
, which can betorch.per_tensor_symmetric
,torch.per_channel_symmetric
,torch.per_tensor_affine
, andtorch.per_channel_affine
. This is often determined by the hardware setup.ch_axis
, which is the dimension of channel-wise quantization. -1 is for per-tensor quantization. Typically fornn.Conv2d
andnn.Linear
module, thech_axis
should be 0.ada_sign
, which can adaptively choose the signness.ada_sign
should be enabled for academic setting only.pot_scale
, which is used to determine the powers-of-two scale parameters.
Note: each specified quantizer may have its own unique parameters, see example of LSQ below.
Example Implementation of LSQ:
-
For initialization, we add new parameters for storing the scale, zero_point:
self.use_grad_scaling = use_grad_scaling self.scale = Parameter(torch.tensor([scale])) self.zero_point = Parameter(torch.tensor([zero_point]))
-
The major implementation is the
forward
function, which should contain several cases:-
In case of
ada_sign=True
, the quantization range should be adjusted.if self.ada_sign and X.min() >= 0: self.quant_max = self.activation_post_process.quant_max = 2 ** self.bitwidth - 1 self.quant_min = self.activation_post_process.quant_min = 0 self.activation_post_process.adjust_sign = True
-
In case of symmetric quantization, the zero point should set to 0.
self.zero_point.data.zero_()
-
In case of powers-of-two scale, the scale should be quantized by:
def pot_quantization(tensor: torch.Tensor): log2t = torch.log2(tensor) log2t = (torch.round(log2t)-log2t).detach() + log2t return 2 ** log2t scale = pot_quantization(self.scale)
-
Implement both per-channel and per-tensor quantization.
-
After adding you quantizer...
The next step is to register the quantizer in prototype/quantization/qconfig.py
Import your quantizer and then add it to get_qconfig
function, and parse necessary arguments.
The final step is to override a config.yaml
file:
qparams:
w_method: lsq
a_method: lsq
bit: 4
backend: academic
bnfold: 4
By replacing the w_method, a_method
, you can run your implementation.
Note: the rest of the config file should not be modified in order to keep a unified training setting.
How to self-implement a hardware configuration
Adding a new setting in hardware is much simpler that algorithms. To do this, we can add another condition in the if-else
selection. For example, adding a new hardware TFLite Micro:
elif backend == "tflitemicro":
backend_params = dict(ada_sign=False, symmetry=True, per_channel=False, pot_scale=True)
...
model_qconfig = get_qconfig(**self.qparams, **backend_params)
model = quantize_fx.prepare_qat_fx(model, {"": model_qconfig}, foldbn_config)
Submitting Your Results to MQBench
You can submit your implementation to MQBench by submmitting a merge request to this repo. The implementation of new algorithms and the running scripts, log file are needed for evalutation.
License
This project is under Apache 2.0 License.