An efficient and easy-to-use deep learning model compression framework

Alibaba

Last update: Dec 25, 2022

Related tags

Deep Learning TinyNeuralNetwork

Overview

TinyNeuralNetwork

TinyNeuralNetwork is an efficient and easy-to-use deep learning model compression framework, which contains features like neural architecture search, pruning, quantization, model conversion and etc. It has been utilized for the deployment on devices such as Tmall Genie, Haier TV, Youku video, face recognition check-in machine, and etc, which equips over 10 million IoT devices with AI capability.

Installation

Python >= 3.6, PyTorch >= 1.4（ PyTorch >= 1.6 if quantization-aware training is involved ）

# Install the TinyNeuralNetwork framework
git clone https://github.com/alibaba/TinyNeuralNetwork.git
cd TinyNeuralNetwork
python setup.py install

# Alternatively, you may try the one-liner
pip install git+https://github.com/alibaba/TinyNeuralNetwork.git

Basic modules

Computational graph capture: The Graph Tracer in TinyNeuralNetwork captures connectivity of PyTorch operators, which automates pruning and model quantization. It also supports code generation from PyTorch models to equivalent model description files (e.g. models.py).
Dependency resolving: Modifying an operator often causes mismatch in subgraph, i.e. mismatch with other dependent operators. The Graph Modifier in TinyNeuralNetwork handles the mismatchs automatically within and between subgraphs to automate the computational graph modification.
Pruner: OneShot (L1, L2, FPGM), ADMM, NetAdapt, Gradual, End2End and other pruning algorithms have been implemented and will be opened gradually.
Quantization-aware training: TinyNeuralNetwork uses PyTorch's QAT as the backend (we also support simulated bfloat16 training) and optimizes its usability with automating the fusion of operators and quantization of computational graphs (the official implementation requires manual implementation by the user, which is a huge workload).
Model conversion: TinyNeuralNetwork supports conversion of floating-point and quantized PyTorch models to TFLite models for end-to-end deployment.

Project architecture

examples: Provides examples of each module
models: Provides pre-trained models for getting quickstart
tests: Unit tests
tinynn: Code for model compression
- graph : Foundation for computational graph capture, resolving, quantization, code generation, mask management, and etc
- prune : Pruning algorithms
- converter : Model converter
- util: Utility classes

RoadMap

Nov. 2021: A new pruner with adaptive sparsity
Dec. 2021: Model compression for Transformers

Frequently Asked Questions

Because of the high complexity and frequent updates of PyTorch, we cannot ensure that all cases are covered through automated testing. When you encounter problems You can check out the FAQ, or join the Q&A group in DingTalk via the QR Code below.

Comments

Add extra output for inference

Hi Author,

I need to add some extra output tensors which are used for inference. These tensors are not referenced during training but just for inference after the conversion to tflite. My naive intention is to put some operations in forward as many as possible so as to relief the loading of post processing which has to be implemented by c/c++ code.

For instance, some matrix ops such as reshape/sigmoid/multiply better be done by GPU/NPU instead of with CPU.

I add some logic in forward to implement this requirement and the training goes well but the conversion to tflite fails with following error message: File "/usr/local/lib/python3.6/dist-packages/torch/nn/quantized/modules/functional_modules.py", line 160, in mul r = ops.quantized.mul(x, y, scale=self.scale, zero_point=self.zero_point) RuntimeError: Mul operands should have same data type.

Is there any feasible way or workaround for this scenario? The script attached. Thanks. movenet_qat.zip
question

opened by liamsun2019 44
Not able to use converter.py to generate pytorch (mobilenet) to tflite(int8-quantized) for mobilenet model using Colab

Please see the below colab that I am using to convert mobilet v2 from pytorch to tflite(int8) https://colab.research.google.com/drive/1eW-I0RDzB3L6Zbz364t5lkI4fxgvpGbI#scrollTo=5YtQg5Ga2wmq

Getting the below erro]rs Traceback (most recent call last): File "./examples/converter/convert.py", line 9, in from examples.models.cifar10.mobilenet import DEFAULT_STATE_DICT, Mobilenet ModuleNotFoundError: No module named 'examples.models'
enhancement

opened by nyadla-sys 31
Open AI Whisper pytorch model to tflite

@peterjc123 Could you please help us to convert below tiny english pytorch model to tflite model, Colab notebook would help me alot. https://openaipublic.azureedge.net/main/whisper/models/d3dd57d32accea0b295c96e26691aa14d8822fac7d9d27d5dc00b4ca2826dd03/tiny.en.pt

Thanks in Advance..
question awaiting response

opened by nyadla-sys 30
For an int8 QAT model, the dequantized weights/bias are different from the original model
Hi, sorry to bother. This issue may not be directly related to tinynn. The scenario on my side is described as following:

I perform int8 QAT based on an official float32 model.

Only to fine tune some layers, I set requires_grad to False for all the other layers.

The training goes normally and the conversion to tflite is also done.

Use tf2onnx to dequantize the generated tflite and check the weights/bias. The weights/bias of the frozen layers are pretty different compared to the original model.

My target is to keep the weights/bias the same with the original model for the frozen layers after dequantization. Not sure if it's feasible. As you know, the official float32 model performs well for most cases and I just want to fine tune a few layers. Meanwhile, considering the inference speed on some edge devices, I have to perform QAT since the post-training quantization is really worse.

Thanks for your time.
question
opened by liamsun2019 24
outputs are different between a QAT tflite and corresponding de-quantized onnx model
I got a QAT int8 per-channel tflite model. To check the accuracy, I compare the inference results between it and the de-quantized onnx model.

python3.6 -m tf2onnx.convert --opset 11 --tflite test.tflite --output temp.onnx --dequantize python3.6 -m onnxsim temp.onnx test.onnx #test.onnx then holds float32 weights/bias

Run test.onnx and test.tflite respectively, compare the inference results based on the same input. There are big differences between the results. The onnx model has much better inference results.

I'm not pretty clear about the inference flow of a QAT tflite. Not sure such situation is normal or not. It's supposed to have similar results to that of onnx model.

Attach onnx and tflite model for reference. test.zip
question
opened by liamsun2019 19
RuntimeError rises up when converting qat model to tflite
Hi, my qat training goes well based on your sample code. But the conversion to tflite fails with the following error message:

RuntimeError: Quantized copy only works with contiguous Tensors

Based on the core trace, the error comes from: converter.convert()

My codes are almost the same with the sample code:

print("Start converting the model to TFLite") with torch.no_grad(): qat_model.eval() qat_model.cpu()

# The step below converts the model to an actual quantized model, which uses the quantized kernels. qat_model = torch.quantization.convert(qat_model) # When converting quantized models to TFLite, please ensure the quantization backend is QNNPACK. torch.backends.quantized.engine = 'qnnpack' # The code section below is used to convert the model to the TFLite format converter = TFLiteConverter(qat_model, dummy_input, tflite_path='out/qat_model.tflite') converer.convert()

By the way, I am curious how the trained qat model affect the conversion. Looks like the conversion could be done even without training. Not familiar with that, appreciate your help.Thanks.
question
opened by liamsun2019 19
Any chance to support int16 quantization for QAT and mixed QAT?

There is a kind of quantization that quantizes model with dynamic fixed point int16. Is there any chance to support int16 quantization for QAT and mixed QAT?
enhancement

opened by steven0129 15
Converting LiteHRNet pytorch model to TFLite, outputs don't match

Hi, this is really great work, thanks!

I am able to convert the LiteHRNet model to TFLite without running into any issues. However, the outputs don't match up.

Here is the output from sending ones through the network. Output is of shape [1,17,96,72]. I am just showing here output[0,0,0] from both pytorch and tflite:

pytorch
array([6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 1.8188522e-04, 1.7515068e-04, 1.9644469e-04, 1.6027213e-04, 1.9049855e-04, 1.5419864e-04, 1.2460010e-04, 9.0751186e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05], dtype=float32)

tflite
array([6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 1.1580180e-04, 2.3429818e-04, 3.9018277e-04, 7.7823577e-03, 1.8948119e-02, 2.8559987e-02, 3.3612434e-02, 2.5932681e-02, 1.2074142e-02, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05, 6.4367385e-05], dtype=float32)

When I convert to tflite via the onnx route, the outputs do match. So my guess is that some of the transpose/reshapes for NHWC is not happening correctly but I am not sure. Looking for some insight into what would be the best way to debug this?

Models: LiteHRNet pt trace LiteHRNet tiny tflite
bug

opened by simbara 15

Error from `avg_pool3d` & `Conv3D ` in 3D_CNN

非常棒的Repo，但我想使用convert.py将PyTorch训练好的3DCNN分类模型转换为.tflite模型，发生了错误：

问题1：x = F.avg_pool3d(x, x.data.size()[-3:])不支持

ERROR (tinynn.converter.base) Unsupported ops: aten::avg_pool3d

问题2：我将Temporal通道Squeeze掉了，但还是发生了错误：

Traceback (most recent call last):
  File "/home/leovin/Dynamic Gesture Detection/3D-CNNs/TinyNeuralNetwork/examples/converter/convert_for_3DCNNs.py", line 50, in <module>
    main_worker()
  File "/home/leovin/Dynamic Gesture Detection/3D-CNNs/TinyNeuralNetwork/examples/converter/convert_for_3DCNNs.py", line 46, in main_worker
    converter.convert()
  File "/home/leovin/anaconda3/envs/dgr/lib/python3.9/site-packages/TinyNeuralNetwork-0.1.20220310142810-py3.9.egg/tinynn/converter/base.py", line 347, in convert
    optimizer.optimize()
  File "/home/leovin/anaconda3/envs/dgr/lib/python3.9/site-packages/TinyNeuralNetwork-0.1.20220310142810-py3.9.egg/tinynn/converter/operators/optimize.py", line 1377, in optimize
    self.transform_graph()
  File "/home/leovin/anaconda3/envs/dgr/lib/python3.9/site-packages/TinyNeuralNetwork-0.1.20220310142810-py3.9.egg/tinynn/converter/operators/optimize.py", line 210, in transform_graph
    op.transform(self.graph, mapping)
  File "/home/leovin/anaconda3/envs/dgr/lib/python3.9/site-packages/TinyNeuralNetwork-0.1.20220310142810-py3.9.egg/tinynn/converter/operators/tflite/transformable.py", line 235, in transform
    assert False, "Only Conv[Transpose]1d/2d is supported"
AssertionError: Only Conv[Transpose]1d/2d is supported

Process finished with exit code 1

Only Conv[Transpose]1d/2d is supported请问是因为不支持Conv3d吗？

问题3：如果支持Conv3d，我该怎么做？

期待您的回答！

enhancement question awaiting response

opened by Le0v1n 14

Unsupported ops: argmax, expand, gather

Thanks for your excellent work. It will definitely help when converting the model from Pytorch to TFLite.

When I try to convert the MoveNet (https://github.com/lee-man/movenet-pytorch) to TFLite using tinynn converter, the log reported some unsupported ops: argmax, expand and gather. Hope you can support these ops later:)
enhancement

opened by lee-man 14

OneShotChannelPruner fails on Interpolate layer

Hello author. I'm testing some encoder-decoder models here. After some experiments, it seems that implemented pruner is not working with interpolate layer.

from tinynn.prune import OneShotChannelPruner
import torch
from torch import Tensor
from torch import nn
from torch.nn import functional as F

class Interpolate(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        return F.interpolate(x, (24, 24), mode='bilinear', align_corners=False)

class Conv2d_Interpolate(nn.Module):
    def __init__(self):
        super().__init__()

        self.conv2d = nn.Conv2d(3, 24, 3, padding = 1)
        self.interpolate = Interpolate()

    def forward(self, inTensor: Tensor):
        x = self.conv2d(inTensor)
        x = self.interpolate(x)
        return x

model = Conv2d_Interpolate()

dummy_input = torch.rand((1, 3, 192, 192))

_ = model(dummy_input)

pruner = OneShotChannelPruner(model, dummy_input, {"sparsity": {"default" : 0.25}, "metrics": "l2_norm"})

running above codes gives error below.

INFO (tinynn.graph.modifier) Start tracking tensor dimension changes...
ERROR (tinynn.graph.modifier) error modifier = interpolate_0_f, type = <class 'tinynn.graph.tracer.TraceFunction'>, kind = 
interpolate
...
RuntimeError: The size of tensor a (24) must match the size of tensor b (192) at non-singleton dimension 3

oneShotChannelPruner is not currently support interpolate layer?

enhancement awaiting response

opened by anawkward 11

[converter] support new PyTorch operators
Below are the PyTorch operators that are yet to be supported.

Unclassfied (New)

N/A

Primitives (Python operators)

[x] aten::len

Very easy (Constant generation or aliasing)

[x] aten::clamp_min

[x] aten::clamp_max

[x] aten::expand_as

Easy (Direct mapping)

Medium (Composite of multiple operators)

[x] aten::im2col https://github.com/alibaba/TinyNeuralNetwork/issues/69

[x] aten::col2im https://github.com/alibaba/TinyNeuralNetwork/issues/69

[x] aten::mish

[x] aten::group_norm

[ ] torchvision::nms https://github.com/alibaba/TinyNeuralNetwork/issues/16

Hard (No mapping or the mapping is too complex)

[ ] aten::grid_sample https://github.com/alibaba/TinyNeuralNetwork/issues/69

[ ] quantized::instance_norm

[ ] quantized::layer_norm

enhancement
opened by peterjc123 3

QAT cause conversion error

Sorry for bothering again, when I use QAT on yolov5, the conversion of the model will cause significant errors. I compare them layer by layer and find out it starts to cause errors from the very first layer and I have no idea how to fix it.

The generated script is here autoshape_qat.zip

device = "cpu"
dummy_input = torch.rand(1, 3, 640, 640)
model = torch.hub.load('ultralytics/yolov5', 'yolov5s', pretrained=True)

for k, m in model.named_modules():
    if isinstance(m, Detect):
        m.inplace = False
        m.onnx_dynamic = True
        m.export = True

quantizer = QATQuantizer(model, dummy_input, work_dir='out', config={'asymmetric': True, 'per_tensor': True})
qat_model = quantizer.quantize()
qat_model.to(device=device)

with torch.no_grad():  
    qat_model.cpu()
    qat_model = torch.quantization.convert(qat_model)
    torch.backends.quantized.engine = quantizer.backend
    convert_and_compare(qat_model, './output/qat_model.tflite', dummy_input)

question

opened by Raychen0617 13

A PTQ tflite model fails to pass benchmark test

My use case: Apply post training quantization to a pth model and convert to tflite. The generated tflite model fails to pass benchmark test with following error message: STARTING! Log parameter values verbosely: [0] Graph: [out/ptq_model.tflite] Loaded model out/ptq_model.tflite ERROR: tensorflow/lite/kernels/concatenation.cc:179 t->params.scale != output->params.scale (3 != -657359264) ERROR: Node number 154 (CONCATENATION) failed to prepare. Failed to allocate tensors! Benchmarking failed.

Pls refer to the attachment. Thanks. test.zip
bug

opened by liamsun2019 5
oneshotpruner fails when possible channel padding exists

Hi author，

For a certain model, oneshotpruner fails when there's channel padding with following error message:

ERROR (tinynn.graph.modifier) All node's sparsity in one subgraph must be the same

Please let me know if you need the model description file. Thanks.
bug

opened by liamsun2019 2

Owner

Alibaba

Alibaba Open Source

GitHub

Revisiting Discriminator in GAN Compression: A Generator-discriminator Cooperative Compression Scheme (NeurIPS2021)

Revisiting Discriminator in GAN Compression: A Generator-discriminator Cooperative Compression Scheme (NeurIPS2021) Overview Prerequisites Linux Pytho

34 Mar 31, 2022

Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.

English | 简体中文 Easy Parallel Library Overview Easy Parallel Library (EPL) is a general and efficient library for distributed model training. Usability

185 Dec 21, 2022

UMEC: Unified Model and Embedding Compression for Efficient Recommendation Systems

[ICLR 2021] "UMEC: Unified Model and Embedding Compression for Efficient Recommendation Systems" by Jiayi Shen, Haotao Wang*, Shupeng Gui*, Jianchao Tan, Zhangyang Wang, and Ji Liu

39 Dec 3, 2022

MMRazor: a model compression toolkit for model slimming and AutoML

Documentation: https://mmrazor.readthedocs.io/ English | 简体中文 Introduction MMRazor is a model compression toolkit for model slimming and AutoML, which

899 Jan 2, 2023

A Lighting Pytorch Framework for Recommendation System, Easy-to-use and Easy-to-extend.

Torch-RecHub A Lighting Pytorch Framework for Recommendation Models, Easy-to-use and Easy-to-extend. 安装 pip install torch-rechub 主要特性 scikit-learn风格易用

67 Jan 4, 2023

A collection of easy-to-use, ready-to-use, interesting deep neural network models

Interesting and reproducible research works should be conserved. This repository wraps a collection of deep neural network models into a simple and un

16 Jun 16, 2022

This is the pytorch implementation for the paper: Learning Accurate Performance Predictors for Ultrafast Automated Model Compression, which is in submission to TPAMI

SeerNet This is the pytorch implementation for the paper: Learning Accurate Performance Predictors for Ultrafast Automated Model Compression, which is

3 May 1, 2022

Implementation of "A Deep Learning Loss Function based on Auditory Power Compression for Speech Enhancement" by pytorch

This repository is used to suspend the results of our paper "A Deep Learning Loss Function based on Auditory Power Compression for Speech Enhancement"

19 Sep 30, 2022

DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.

8.4k Jan 1, 2023

High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

3k Jan 3, 2023

High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

2.8k Feb 12, 2021

An efficient and easy-to-use deep learning model compression framework

Related tags

Overview

TinyNeuralNetwork

Installation

Basic modules

Project architecture

RoadMap

Frequently Asked Questions

Comments

Unclassfied (New)

Primitives (Python operators)

Very easy (Constant generation or aliasing)

Easy (Direct mapping)

Medium (Composite of multiple operators)

Hard (No mapping or the mapping is too complex)

Owner

Alibaba

Revisiting Discriminator in GAN Compression: A Generator-discriminator Cooperative Compression Scheme (NeurIPS2021)

Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.

UMEC: Unified Model and Embedding Compression for Efficient Recommendation Systems

MMRazor: a model compression toolkit for model slimming and AutoML

A Lighting Pytorch Framework for Recommendation System, Easy-to-use and Easy-to-extend.

A collection of easy-to-use, ready-to-use, interesting deep neural network models

This is the pytorch implementation for the paper: *Learning Accurate Performance Predictors for Ultrafast Automated Model Compression*, which is in submission to TPAMI

Implementation of "A Deep Learning Loss Function based on Auditory Power Compression for Speech Enhancement" by pytorch

DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.

High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

DA2Lite is an automated model compression toolkit for PyTorch.

Pytorch implementation for Patient Knowledge Distillation for BERT Model Compression

Spatio-Temporal Entropy Model (STEM) for end-to-end leaned video compression.

Pytorch implementation of COIN, a framework for compression with implicit neural representations 🌸

A Pytorch Implementation of a continuously rate adjustable learned image compression framework.

A lossless neural compression framework built on top of JAX.

Deep Compression for Dense Point Cloud Maps.

Official Pytorch implementation for Deep Contextual Video Compression, NeurIPS 2021

This is the pytorch implementation for the paper: Learning Accurate Performance Predictors for Ultrafast Automated Model Compression, which is in submission to TPAMI