Ppq - A powerful offline neural network quantization tool with custimized IR

Related tags

Deep Learning ppq
Overview

PPL Quantization Tool(PPL 量化工具)

PPL Quantization Tool (PPQ) is a powerful offline neural network quantization tool with custimized IR, executor, dispacher and optimization passes.

Features

  • Quantable graph, an quantization-oriented network representation.
  • Quantize with Cuda, quantization simulating are 3x ~ 50x faster than PyTorch.
  • Hardware-friendly, simulating calculations are mostly identical with hardware.
  • Multi-platform support.

Installation

To release the power of this advanced quantization tool, at least one CUDA computing device is required. Install CUDA from CUDA Toolkit, PPL Quantization Tool will use CUDA compiler to compile cuda kernels at runtime.

ATTENTION: For users of pytorch, pytorch might bring you a minimized CUDA libraries, which will not satisfy the requirement of this tool, you have to install CUDA from NVIDIA manually.

ATTENTION: Make sure your python version is >= 3.6.0. PPL Quantization Tool is written with dialects that only by python >= 3.6.0.

  • Install from source:
  1. Run following code with your terminal(For windows user, use command line instead).
git clone https://github.com/openppl-public/ppq.git
cd ppq
python setup.py install
  1. Wait for python finish its installation and pray for bug free.
  • Install from wheel:
  1. Download compiled python wheel from follwoing links: PPL Quantization Tool
  2. Run following command with your terminal or command line(windows): "pip install ppq.wheel", and pray for bug free.

Tutorials and Examples

  1. User guide, system design doc can be found at /doc/pages/instructions of this repository, PPL Quantization Tool documents are written with pure html5.
  2. Examples can be found at /ppq/samples.
  3. Let's quantize your network with following code:
from ppq.api import export_ppq_graph, quantize_torch_model
from ppq import TargetPlatform

# quantize your model within one single line:
quantized = quantize_torch_model(
    model=model, calib_dataloader=calibration_dataloader,
    calib_steps=32, input_shape=(1, 3, 224, 224),
    setting=quant_setting, collate_fn=collate_fn,
    platform=TargetPlatform.PPL_CUDA_INT8,
    device=DEVICE, verbose=0)

# export quantized graph with another line:
export_ppq_graph(
    graph=quantized, platform=TargetPlatform.PPL_CUDA_INT8,
    graph_save_to='Output/quantized(onnx).onnx',
    config_save_to='Output/quantized(onnx).json')

Contact Us

WeChat Official Account QQ Group
OpenPPL 627853444
OpenPPL QQGroup

Email: [email protected]

Other Resources

Contributions

We appreciate all contributions. If you are planning to contribute back bug-fixes, please do so without any further discussion.

If you plan to contribute new features, utility functions, or extensions to the core, please first open an issue and discuss the feature with us. Sending a PR without discussion might end up resulting in a rejected PR because we might be taking the core in a different direction than you might be aware of.

Benchmark

PPQ is tested with models from mmlab-classification, mmlab-detection, mmlab-segamentation, mmlab-editing, here we listed part of out testing result.

  • No quantization optimization procedure is applied with following models.
Model Type Calibration Dispatcher Metric PPQ(sim) PPLCUDA FP32
Resnet-18 Classification 512 imgs conservative Acc-Top-1 69.50% 69.42% 69.88%
ResNeXt-101 Classification 512 imgs conservative Acc-Top-1 78.46% 78.37% 78.66%
SE-ResNet-50 Classification 512 imgs conservative Acc-Top-1 77.24% 77.26% 77.76%
ShuffleNetV2 Classification 512 imgs conservative Acc-Top-1 69.13% 68.85% 69.55%
MobileNetV2 Classification 512 imgs conservative Acc-Top-1 70.99% 71.1% 71.88%
---- ---- ---- ---- ---- ---- ---- ----
retinanet Detection 32 imgs pplnn bbox_mAP 36.1% 36.1% 36.4%
faster_rcnn Detection 32 imgs pplnn bbox_mAP 36.6% 36.7% 37.0%
fsaf Detection 32 imgs pplnn bbox_mAP 36.5% 36.6% 37.4%
mask_rcnn Detection 32 imgs pplnn bbox_mAP 37.7% 37.6% 37.9%
---- ---- ---- ---- ---- ---- ---- ----
deeplabv3 Segamentation 32 imgs conservative aAcc / mIoU 96.13% / 78.81% 96.14% / 78.89% 96.17% / 79.12%
deeplabv3plus Segamentation 32 imgs conservative aAcc / mIoU 96.27% / 79.39% 96.26% / 79.29% 96.29% / 79.60%
fcn Segamentation 32 imgs conservative aAcc / mIoU 95.75% / 74.56% 95.62% / 73.96% 95.68% / 72.35%
pspnet Segamentation 32 imgs conservative aAcc / mIoU 95.79% / 77.40% 95.79% / 77.41% 95.83% / 77.74%
---- ---- ---- ---- ---- ---- ---- ----
srcnn Editing 32 imgs conservative PSNR / SSIM 27.88% / 79.70% 27.88% / 79.07% 28.41% / 81.06%
esrgan Editing 32 imgs conservative PSNR / SSIM 27.84% / 75.20% 27.49% / 72.90% 27.51% / 72.84%
  • PPQ(sim) stands for PPQ quantization simulator's result.
  • Dispatcher stands for dispatching policy of PPQ.
  • Classification models are evaluated with ImageNet, Detection and Segamentation models are evaluated with COCO dataset, Editing models are evaluated with DIV2K dataset.
  • All calibration datasets are randomly picked from training data.

License

This project is distributed under the Apache License, Version 2.0.

Comments
  • PPQ can not complie cuda extensions, please check your compiler and system environment, PPQ will disable CUDA KERNEL for now.

    PPQ can not complie cuda extensions, please check your compiler and system environment, PPQ will disable CUDA KERNEL for now.

    RTX2080Ti Python 3.8.13 ninja 1.5.1 ppq 0.6.4 PyTorch 1.12.0 tensorrt 8.4.1.5 export PATH=/usr/local/cuda-11.1/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda-11.1/lib64:$LD_LIBRARY_PATH

    When import ppq, it raised this prompt message. Could you please give some kind advice? @zchrissirhcz @ouonline

    opened by songkq 26
  • 使用CPU执行时报错

    使用CPU执行时报错

    我又来了。我尝试在CPU跑ONNX官网model zoo 的 efficientnet-lite4-11.onnx 模型有报错。calibration策略为kl、mse时,quantization/optim/refine.py #582行触发assert,说某算子没有被正确quantize。 我用minmax策略的时候就不会出现这个问题。上述都是在CPU条件下进行的(我这边条件没有GPUhhhh),,

    我能通过改动某些代码来解决这个报错吗,还是说我只能先在CPU条件下用minmax策略勒

    opened by Menace-Dragon 18
  • AttributeError: 'Operation' object has no attribute 'config'

    AttributeError: 'Operation' object has no attribute 'config'

    Traceback (most recent call last): File "ProgramEntrance.py", line 200, in export_ppq_graph( File "/root/miniconda3/envs/mmdeploy/lib/python3.8/site-packages/ppq-0.6.5.1-py3.8.egg/ppq/api/interface.py", line 628, in export_ppq_graph exporter.export(file_path=graph_save_to, config_path=config_save_to, graph=graph, **kwargs) File "/root/miniconda3/envs/mmdeploy/lib/python3.8/site-packages/ppq-0.6.5.1-py3.8.egg/ppq/parser/trt_exporter.py", line 53, in export self.export_quantization_config(config_path, graph) File "/root/miniconda3/envs/mmdeploy/lib/python3.8/site-packages/ppq-0.6.5.1-py3.8.egg/ppq/parser/trt_exporter.py", line 29, in export_quantization_config input_cfg = op.config.input_quantization_config[0] AttributeError: 'Operation' object has no attribute 'config' 在ProgramEntrance.py中调用trt_int8出错,但是调用PPL_CUDA_INT8却没有这个问题

    opened by kaizhong2021 14
  • 关于scheduler/dispatcher.py 125行处的bug

    关于scheduler/dispatcher.py 125行处的bug

    项目很不错!但是我在跑ONNX官网model zoo 的 efficientnet-lite4-11.onnx 模型有报错。报错在scheduler/dispatcher.py 125行。分析了一下原因是这样:

    1. 该模型的graph里有这么一个流:···-->Conv-->BN-->Clip-->···。PPQ会默认 fuse ConvBN,但是fuse得到的operation 是 append 到 graph.operations末尾的。
    2. 在给Clip绑定platform时,会执行scheduler/dispatcher.py 125行的语句。

    综合1、2,也就是说,此时dispatching_table 是没有ConvBN这个operation的信息的,就会导致报错。顺序上的问题,看作者您怎么解决为好

    opened by Menace-Dragon 11
  • RuntimeError: Error happens when dealing with operation ConstantOfShape_1246(TargetPlatform.SOI)

    RuntimeError: Error happens when dealing with operation ConstantOfShape_1246(TargetPlatform.SOI)

    我执行了ppq/samples/Tutorial/quantize.py,使用模型是swin-transformer,target platform是TRT_INT8,出现了如下报错: Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/ppq-0.6.6-py3.8.egg/ppq/executor/torch.py", line 541, in __forward outputs = operation_forward_func(operation, inputs, self._executing_context) File "/usr/local/lib/python3.8/dist-packages/ppq-0.6.6-py3.8.egg/ppq/executor/op/torch/default.py", line 1197, in ConstantOfShape_forward output = torch.Tensor().new_full( TypeError: new_full(): argument 'size' must be tuple of ints, not list

    The above exception was the direct cause of the following exception:

    Traceback (most recent call last): File "quantize_test.py", line 71, in quantized = quantize_onnx_model( File "/usr/local/lib/python3.8/dist-packages/ppq-0.6.6-py3.8.egg/ppq/core/defs.py", line 54, in _wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/ppq-0.6.6-py3.8.egg/ppq/api/interface.py", line 259, in quantize_onnx_model quantizer.quantize( File "/usr/local/lib/python3.8/dist-packages/ppq-0.6.6-py3.8.egg/ppq/core/defs.py", line 54, in _wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/ppq-0.6.6-py3.8.egg/ppq/quantization/quantizer/base.py", line 61, in quantize executor.tracing_operation_meta(inputs=inputs) File "/usr/local/lib/python3.8/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/ppq-0.6.6-py3.8.egg/ppq/core/defs.py", line 54, in _wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/ppq-0.6.6-py3.8.egg/ppq/executor/torch.py", line 603, in tracing_operation_meta self.__forward( File "/usr/local/lib/python3.8/dist-packages/ppq-0.6.6-py3.8.egg/ppq/executor/torch.py", line 568, in __forward raise RuntimeError(f'Error happens when dealing with operation {str(operation)}') from _ RuntimeError: Error happens when dealing with operation ConstantOfShape_1246(TargetPlatform.SOI) - inputs:['onnx::ConstantOfShape_4472'], outputs:['onnx::Concat_4473']

    opened by shhn1 9
  • PPQ INT8 导出到 onnx 平台性能较低的可能原因

    PPQ INT8 导出到 onnx 平台性能较低的可能原因

    之前在另一个 issue #329 下面提到了,感觉还是另开一个新 issue 比较好。

    目前ppq在int8上表现不好的原因有可能就是因为导出格式的问题所导致的。因为相比未量化的模型,ppq导出的模型不仅额外引入了量化与反量化的操作,还使用了fp32进行计算。这里onnx官方的int8模型并不是这样导出的,而是使用了 QLinearConv 等量化数据专用算子进行运算。这里以 mobilenet 为例,下载仓库在这里。如图所示的是未量化的模型和官方量化的模型: image image

    下图是ppq在指定导出平台为 ONNXRUNTIME 量化得到的模型: image

    可以看到,官方实现由于使用了适用于量化后数据的 QLinearConv 等算子,并没有频繁的插入量化和反量化算子,同时还考虑到了图优化(比如他把 clip 给优化了,但这里还有可能能优化的点在于这种 QConv 接 QConv 可能也是可以融合的),但 ppq 不仅使用了非常多的量化和反量化算子,还使用了 fp32 进行实际运算。这可能是导致 ppq 导出的 onnx 模型低效的原因。

    ~~这只是一个猜想,不一定对(逃~~

    opened by Pzzzzz5142 8
  • PPL_DSP_INT8量化后export问题

    PPL_DSP_INT8量化后export问题

    GPU模式下,在跑RetinaFace(backbone为ResNet50时),量化过程成功跑完,在导出时报TypeError: Cannot convert Resize_133 to caffe op。debug发现是因为没有满足ppq/parser/caffe/caffe_export_utils.py 的第439行判断而导致的。

    opened by Menace-Dragon 8
  • 量化后模型,转换为 tensorrt int8 engine,inference 不对齐

    量化后模型,转换为 tensorrt int8 engine,inference 不对齐

    Hello, 我在尝试使用 PPQ 量化来得到 Tensorrt Int8 模型,发现模型比较大的时候,QDQ Onnx 模型转 TRT Int8 似乎存在性能问题 (无法对齐),具体地,我尝试小模型如 mnist 时可以对齐 (1e-7量级误差),稍大的模型如 resnet50 就存在较大的误差 我不确定是否我的操作存在问题,目前定位问题倾向于认为是 Tensorrt 转换过程引入了误差,所以我在 TensorRT repo 中提了 Issue,详见 https://github.com/NVIDIA/TensorRT/issues/2103 想请教一下是否遇到过类似的问题,谢谢!

    opened by FreemanHsu 7
  • Upsample算子似乎不支持量化;ConTranpose似乎无法完成BN Fold;

    Upsample算子似乎不支持量化;ConTranpose似乎无法完成BN Fold;

    a. 在跑一个onnx测试模型时,报错Upsample算子 no bakend on target platform,似乎PPQ还不支持Upsample算子的量化,后续有可能会支持吗

    b. 在跑另一个onnx测试模型时,模型中有这么个计算图: ...-->ConvTranspose-->BatchNrom-->ReLU-->... 然后报错ConvTranspose算子无法和BN进行Fold

    c. 还想叨扰请教一下,如果计算图为: ...-->BatchNrom-->Conv-->ReLU-->... 那么可以进行Fold吗?

    opened by Menace-Dragon 7
  • 无法正确的获取到 bn 的 output

    无法正确的获取到 bn 的 output

    报错如下

      File "/workspace/ppq/ppq/IR/morph.py", line 275, in format_sng_bn
        bn_out_var = bn_op.outputs[0]
    IndexError: list index out of range
    

    模型是onnx官方提供的模型

    代码是用的实例的代码,只修改的 device 部分,因为我现在没有 nv 的卡。

    # ---------------------------------------------------------------
    # 这个脚本向你展示了如何使用 onnxruntime 对 PPQ 导出的模型进行推理
    # 你需要注意,Onnxruntime 可以运行各种各样的量化方案,但模型量化对 Onnxruntime 而言几乎无法起到加速作用
    # 你可以使用 Onnxruntime 来验证量化方案以及 ppq 量化的正确性,但这不是一个合理的部署平台
    # 修改 QUANT_PLATFROM 来使用不同的量化方案。
    
    # This Script export ppq internal graph to onnxruntime,
    # you should notice that onnx is designed as an Open Neural Network Exchange format.
    # It has the capbility to describe most of ppq's quantization policy including combinations of:
    #   Symmtrical, Asymmtrical, POT, Per-channel, Per-Layer
    # However onnxruntime can not accelerate quantized model in most cases,
    # you are supposed to use onnxruntime for verifying your network quantization result only.
    # ---------------------------------------------------------------
    
    # For this onnx inference test, all test data is randomly picked.
    # If you want to use real data, just rewrite the defination of SAMPLES
    import onnxruntime
    import torch
    from ppq import *
    from ppq.api import *
    from tqdm import tqdm
    
    QUANT_PLATFROM = TargetPlatform.TRT_INT8
    MODEL = "converted.onnx"
    INPUT_SHAPE = [1, 3, 480, 640]
    SAMPLES = [
        torch.rand(size=INPUT_SHAPE) for _ in range(256)
    ]  # rewirte this to use real data.
    DEVICE = "cpu"
    FINETUNE = True
    QS = QuantizationSettingFactory.default_setting()
    EXECUTING_DEVICE = "cpu"
    REQUIRE_ANALYSE = True
    
    # -------------------------------------------------------------------
    # 下面向你展示了常用参数调节选项:
    # -------------------------------------------------------------------
    QS.lsq_optimization = FINETUNE  # 启动网络再训练过程,降低量化误差
    QS.lsq_optimization_setting.steps = 500  # 再训练步数,影响训练时间,500 步大概几分钟
    QS.lsq_optimization_setting.collecting_device = (
        "cpu"  # 缓存数据放在那,cuda 就是放在 gpu,如果显存超了你就换成 'cpu'
    )
    
    if QUANT_PLATFROM in {
        TargetPlatform.PPL_DSP_INT8,  # 这些平台是 per tensor 量化的
        TargetPlatform.HEXAGON_INT8,
        TargetPlatform.SNPE_INT8,
        TargetPlatform.METAX_INT8_T,
        TargetPlatform.FPGA_INT8,
    }:
        QS.equalization = True  # per tensor 量化平台需要做 equalization
    
    if QUANT_PLATFROM in {
        TargetPlatform.PPL_CUDA_INT8,  # 注意做这件事之前你需要确保你的执行框架具有混合精度执行的能力,以及浮点计算的能力
        TargetPlatform.TRT_INT8,
    }:
        QS.dispatching_table.append(operation="OP NAME", platform=TargetPlatform.FP32)
    
    print("正准备量化你的网络,检查下列设置:")
    print(f"TARGET PLATFORM      : {QUANT_PLATFROM.name}")
    print(f"NETWORK INPUTSHAPE   : {INPUT_SHAPE}")
    
    # ENABLE CUDA KERNEL 会加速量化效率 3x ~ 10x,但是你如果没有装相应编译环境的话是编译不了的
    # 你可以尝试安装编译环境,或者在不启动 CUDA KERNEL 的情况下完成量化:移除 with ENABLE_CUDA_KERNEL(): 即可
    # with ENABLE_CUDA_KERNEL():
    with open("a", "w") as fl:
        qir = quantize_onnx_model(
            onnx_import_file=MODEL,
            calib_dataloader=SAMPLES,
            calib_steps=128,
            setting=QS,
            input_shape=INPUT_SHAPE,
            collate_fn=lambda x: x.to(EXECUTING_DEVICE),
            platform=QUANT_PLATFROM,
            do_quantize=True,
        )
    
        # -------------------------------------------------------------------
        # PPQ 计算量化误差时,使用信噪比的倒数作为指标,即噪声能量 / 信号能量
        # 量化误差 0.1 表示在整体信号中,量化噪声的能量约为 10%
        # 你应当注意,在 graphwise_error_analyse 分析中,我们衡量的是累计误差
        # 网络的最后一层往往都具有较大的累计误差,这些误差是其前面的所有层所共同造成的
        # 你需要使用 layerwise_error_analyse 逐层分析误差的来源
        # -------------------------------------------------------------------
        print("正计算网络量化误差(SNR),最后一层的误差应小于 0.1 以保证量化精度:")
        reports = graphwise_error_analyse(
            graph=qir,
            running_device=EXECUTING_DEVICE,
            steps=32,
            dataloader=SAMPLES,
            collate_fn=lambda x: x.to(EXECUTING_DEVICE),
        )
        for op, snr in reports.items():
            if snr > 0.1:
                ppq_warning(f"层 {op} 的累计量化误差显著,请考虑进行优化")
    
        if REQUIRE_ANALYSE:
            print("正计算逐层量化误差(SNR),每一层的独立量化误差应小于 0.1 以保证量化精度:")
            layerwise_error_analyse(
                graph=qir,
                running_device=EXECUTING_DEVICE,
                interested_outputs=None,
                dataloader=SAMPLES,
                collate_fn=lambda x: x.to(EXECUTING_DEVICE),
            )
    
        print("网络量化结束,正在生成目标文件:")
        export_ppq_graph(
            graph=qir, platform=QUANT_PLATFROM, graph_save_to="model_int8.onnx"
        )
    
        exit(0)
    
        # -------------------------------------------------------------------
        # 记录一下输入输出的名字,onnxruntime 跑的时候需要提供这些名字
        # 我写的只是单输出单输入的版本,多输出多输入你得自己改改
        # -------------------------------------------------------------------
        int8_input_names = [name for name, _ in qir.inputs.items()]
        int8_output_names = [name for name, _ in qir.outputs.items()]
    
        # -------------------------------------------------------------------
        # 启动 onnxruntime 进行推理
        # 截止 2022.05, onnxruntime 跑 int8 很慢的,你就别期待它会很快了。
        # 如果你知道怎么让它跑的快点,或者onnxruntime更新了,你可以随时联系我。
        # -------------------------------------------------------------------
        session = onnxruntime.InferenceSession(
            "model_int8.onnx", providers=["CUDAExecutionProvider"]
        )
        onnxruntime_results = []
        for sample in tqdm(
            SAMPLES, desc="ONNXRUNTIME GENERATEING OUTPUTS", total=len(SAMPLES)
        ):
            result = session.run(None, {int8_input_names[0]: convert_any_to_numpy(sample)})
            onnxruntime_results.append(result)
    
    

    同时,我也对那个 opset 做了转换,转换到了12,但还是没有办法读到 bn 层的 output。然后我希望能够将一个 onnx 模型量化到 onnx 格式,请问一下该怎么做呢?我看了其他的issue好像是把 QUANT_PLATFROM 设置为 TargetPlatform.ONNXRUNTIME,但看起来目前的版本并不支持这个平台。

    opened by Pzzzzz5142 6
  • RuntimeError of Shape op during Calibration dataset progress and finetune progress

    RuntimeError of Shape op during Calibration dataset progress and finetune progress

    图片

    配置信息:

    TARGET_PLATFORM = TargetPlatform.NXP_INT8 # choose your target platform MODEL_TYPE = NetworkFramework.ONNX # or NetworkFramework.CAFFE INPUT_LAYOUT = 'chw' # input data layout, chw or hwc NETWORK_INPUTSHAPE = [16, 1, 40, 61] # input shape of your network CALIBRATION_BATCHSIZE = 16 # batchsize of calibration dataset EXECUTING_DEVICE = 'cuda' # 'cuda' or 'cpu'. REQUIRE_ANALYSE = True DUMP_RESULT = False

    SETTING = UnbelievableUserFriendlyQuantizationSetting( platform = TARGET_PLATFORM, finetune_steps = 2500, finetune_lr = 1e-3, calibration = 'percentile', equalization = True, non_quantable_op = None) dataloader = DataLoader( dataset=calibration_dataset, batch_size=32, shuffle=True) quantized = quantize( working_directory=WORKING_DIRECTORY, setting=SETTING, model_type=MODEL_TYPE, executing_device=EXECUTING_DEVICE, input_shape=NETWORK_INPUTSHAPE, target_platform=TARGET_PLATFORM, dataloader=dataloader, calib_steps=250)

    问题描述:

    在213次迭代时shape算子报上述错误,计算后发现这一次迭代batch size=19, 在dataload迭代器内部打印了下log,发现这一批次finetune确实只送出来了19个样本。后来发现数据集样本数刚好在213次迭代时遍历完一遍。 后面我将finetune step和calib_step都改为100, Calibration数据集样本数调整为32*100个之后就能正常运行。 下面是模型文件: model.zip

    opened by lycfly 6
  • parse onnx model failed

    parse onnx model failed

    problem:

    kls@ubuntu:~/workspace$ ~/libraries/TensorRT-8.4.1.5/bin/trtexec --onnx=./unet-q.onnx --saveEngine=unet-int8.engine
    &&&& RUNNING TensorRT.trtexec [TensorRT v8401] # /home/kls/libraries/TensorRT-8.4.1.5/bin/trtexec --onnx=./unet-q.onnx --saveEngine=unet-int8.engine
    [01/06/2023-17:32:20] [I] === Model Options ===
    [01/06/2023-17:32:20] [I] Format: ONNX
    [01/06/2023-17:32:20] [I] Model: ./unet-q.onnx
    [01/06/2023-17:32:20] [I] Output:
    [01/06/2023-17:32:20] [I] === Build Options ===
    [01/06/2023-17:32:20] [I] Max batch: explicit batch
    [01/06/2023-17:32:20] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
    [01/06/2023-17:32:20] [I] minTiming: 1
    [01/06/2023-17:32:20] [I] avgTiming: 8
    [01/06/2023-17:32:20] [I] Precision: FP32
    [01/06/2023-17:32:20] [I] LayerPrecisions: 
    [01/06/2023-17:32:20] [I] Calibration: 
    [01/06/2023-17:32:20] [I] Refit: Disabled
    [01/06/2023-17:32:20] [I] Sparsity: Disabled
    [01/06/2023-17:32:20] [I] Safe mode: Disabled
    [01/06/2023-17:32:20] [I] DirectIO mode: Disabled
    [01/06/2023-17:32:20] [I] Restricted mode: Disabled
    [01/06/2023-17:32:20] [I] Build only: Disabled
    [01/06/2023-17:32:20] [I] Save engine: unet-int8.engine
    [01/06/2023-17:32:20] [I] Load engine: 
    [01/06/2023-17:32:20] [I] Profiling verbosity: 0
    [01/06/2023-17:32:20] [I] Tactic sources: Using default tactic sources
    [01/06/2023-17:32:20] [I] timingCacheMode: local
    [01/06/2023-17:32:20] [I] timingCacheFile: 
    [01/06/2023-17:32:20] [I] Input(s)s format: fp32:CHW
    [01/06/2023-17:32:20] [I] Output(s)s format: fp32:CHW
    [01/06/2023-17:32:20] [I] Input build shapes: model
    [01/06/2023-17:32:20] [I] Input calibration shapes: model
    [01/06/2023-17:32:20] [I] === System Options ===
    [01/06/2023-17:32:20] [I] Device: 0
    [01/06/2023-17:32:20] [I] DLACore: 
    [01/06/2023-17:32:20] [I] Plugins:
    [01/06/2023-17:32:20] [I] === Inference Options ===
    [01/06/2023-17:32:20] [I] Batch: Explicit
    [01/06/2023-17:32:20] [I] Input inference shapes: model
    [01/06/2023-17:32:20] [I] Iterations: 10
    [01/06/2023-17:32:20] [I] Duration: 3s (+ 200ms warm up)
    [01/06/2023-17:32:20] [I] Sleep time: 0ms
    [01/06/2023-17:32:20] [I] Idle time: 0ms
    [01/06/2023-17:32:20] [I] Streams: 1
    [01/06/2023-17:32:20] [I] ExposeDMA: Disabled
    [01/06/2023-17:32:20] [I] Data transfers: Enabled
    [01/06/2023-17:32:20] [I] Spin-wait: Disabled
    [01/06/2023-17:32:20] [I] Multithreading: Disabled
    [01/06/2023-17:32:20] [I] CUDA Graph: Disabled
    [01/06/2023-17:32:20] [I] Separate profiling: Disabled
    [01/06/2023-17:32:20] [I] Time Deserialize: Disabled
    [01/06/2023-17:32:20] [I] Time Refit: Disabled
    [01/06/2023-17:32:20] [I] Inputs:
    [01/06/2023-17:32:20] [I] === Reporting Options ===
    [01/06/2023-17:32:20] [I] Verbose: Disabled
    [01/06/2023-17:32:20] [I] Averages: 10 inferences
    [01/06/2023-17:32:20] [I] Percentile: 99
    [01/06/2023-17:32:20] [I] Dump refittable layers:Disabled
    [01/06/2023-17:32:20] [I] Dump output: Disabled
    [01/06/2023-17:32:20] [I] Profile: Disabled
    [01/06/2023-17:32:20] [I] Export timing to JSON file: 
    [01/06/2023-17:32:20] [I] Export output to JSON file: 
    [01/06/2023-17:32:20] [I] Export profile to JSON file: 
    [01/06/2023-17:32:20] [I] 
    [01/06/2023-17:32:20] [I] === Device Information ===
    [01/06/2023-17:32:20] [I] Selected Device: NVIDIA A10
    [01/06/2023-17:32:20] [I] Compute Capability: 8.6
    [01/06/2023-17:32:20] [I] SMs: 72
    [01/06/2023-17:32:20] [I] Compute Clock Rate: 1.695 GHz
    [01/06/2023-17:32:20] [I] Device Global Memory: 22731 MiB
    [01/06/2023-17:32:20] [I] Shared Memory per SM: 100 KiB
    [01/06/2023-17:32:20] [I] Memory Bus Width: 384 bits (ECC enabled)
    [01/06/2023-17:32:20] [I] Memory Clock Rate: 6.251 GHz
    [01/06/2023-17:32:20] [I] 
    [01/06/2023-17:32:20] [I] TensorRT version: 8.4.1
    [01/06/2023-17:32:21] [I] [TRT] [MemUsageChange] Init CUDA: CPU +535, GPU +0, now: CPU 542, GPU 499 (MiB)
    [01/06/2023-17:32:21] [I] Start parsing network model
    [01/06/2023-17:32:21] [I] [TRT] ----------------------------------------------------------------
    [01/06/2023-17:32:21] [I] [TRT] Input filename:   ./unet-q.onnx
    [01/06/2023-17:32:21] [I] [TRT] ONNX IR version:  0.0.7
    [01/06/2023-17:32:21] [I] [TRT] Opset version:    13
    [01/06/2023-17:32:21] [I] [TRT] Producer name:    PPL Quantization Tool
    [01/06/2023-17:32:21] [I] [TRT] Producer version: 
    [01/06/2023-17:32:21] [I] [TRT] Domain:           
    [01/06/2023-17:32:21] [I] [TRT] Model version:    0
    [01/06/2023-17:32:21] [I] [TRT] Doc string:       
    [01/06/2023-17:32:21] [I] [TRT] ----------------------------------------------------------------
    [01/06/2023-17:32:21] [E] [TRT] ModelImporter.cpp:720: While parsing node number 23 [QuantizeLinear -> "PPQ_Variable_297"]:
    [01/06/2023-17:32:21] [E] [TRT] ModelImporter.cpp:721: --- Begin node ---
    [01/06/2023-17:32:21] [E] [TRT] ModelImporter.cpp:722: input: "outc.conv.weight"
    input: "PPQ_Variable_295"
    input: "PPQ_Variable_296"
    output: "PPQ_Variable_297"
    name: "PPQ_Operation_98"
    op_type: "QuantizeLinear"
    attribute {
      name: "axis"
      i: 0
      type: INT
    }
    
    [01/06/2023-17:32:21] [E] [TRT] ModelImporter.cpp:723: --- End node ---
    [01/06/2023-17:32:21] [E] [TRT] ModelImporter.cpp:726: ERROR: builtin_op_importers.cpp:1096 In function QuantDequantLinearHelper:
    [6] Assertion failed: axis == INVALID_AXIS && "Quantization axis attribute is not valid with a single quantization scale"
    [01/06/2023-17:32:21] [E] Failed to parse onnx file
    [01/06/2023-17:32:21] [I] Finish parsing network model
    [01/06/2023-17:32:21] [E] Parsing model failed
    [01/06/2023-17:32:21] [E] Failed to create engine from model or file.
    [01/06/2023-17:32:21] [E] Engine set up failed
    
    opened by nanmi 1
  • evaluation_with_imagenet.py is failure

    evaluation_with_imagenet.py is failure

    运行官方示例跑Resnet50报错: ppq/ppq/samples/Imagenet/evaluation_with_imagenet.py Test: [700 / 781] Prec@1 75.843 (75.843) Prec@5 92.812 (92.812) Evaluating Model...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 781/781 [00:34<00:00, 22.68it/s]

    • Prec@1 75.804 Prec@5 92.808 [Warning] File Output/resnet50.onnx is already existed, Exporter will overwrite it. /opt/conda/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py:53: UserWarning: Specified provider 'CUDAExecutionProvider' is not in available provider names.Available providers: 'CPUExecutionProvider' warnings.warn("Specified provider '{}' is not in available provider names." Traceback (most recent call last): File "evaluation_with_imagenet.py", line 84, in evaluate_onnx_module_with_imagenet( File "/home/li.sun/github/ppq/ppq/samples/Imagenet/Utilities/Imagenet/imagenet_util.py", line 103, in evaluate_onnx_module_with_imagenet sess = onnxruntime.InferenceSession(path_or_bytes=onnxruntime_model_path, providers=['CUDAExecutionProvider']) File "/opt/conda/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 283, in init self._create_inference_session(providers, provider_options, disabled_optimizers) File "/opt/conda/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 310, in _create_inference_session sess = C.InferenceSession(session_options, self._model_path, True, self._read_config_from_model) onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Load model from Output/resnet50.onnx failed:Type Error: Type (tensor(float)) of output arg (PPQ_Variable_2) of node (PPQ_Operation_0) does not match expected type (tensor(int8)).

    环境:

    1. ppq git commit id: commit 76e03261bad580e7c52e6f0856034fa9313f69b5 (HEAD -> master, origin/master, origin/HEAD) Author: AwesomeCodingBoy [email protected] Date: Tue Dec 13 14:03:47 2022 +0800

      Update inference_with_ncnn.md (#324)

    onnxruntime 1.13.1 onnxruntime-gpu 1.13.1 或 onnxruntime 1.8.1 onnxruntime-gpu 1.8.1

    opened by yuyinsl 2
  • How to export the quantized onnx file and save weight in int8 format?

    How to export the quantized onnx file and save weight in int8 format?

    Hi, More than appreciate to provide wonderful tutorials on bilibili. As the beginner in quantization, I do learn a lot from a series of class. To deploy onnx model to a specified TPU, the current solution offered by the vendor is to store the output of the ONNX file in int8 format. However, the current default storage format is float32. Could you please tell me how to change this setting?

    More details you could check this issue: https://github.com/sophgo/tpu-mlir/issues/51. They suggested us to export onnx in first format.

    opened by Jackycheng0808 1
  • Convert Yolov4

    Convert Yolov4

    Hi, I have a yolov4 model, that I want to run on TensorRT INT8. I read the documentation but having a hard time following it as an English speaker. Can you please guide me on how do I convert the model and prepared dataset for ProgramEntrance.py script. I have dataset in Yolo format.

    Thanks

    opened by Sayyam-Jain 1
  • 怎么看模型适不适合量化,能不能从量化中获益?

    怎么看模型适不适合量化,能不能从量化中获益?

    你好 从B站视频了解这个项目的,B站的视频讲得很清楚,已经一键三连。视频中提到有些网络从量化中并不能获益,可能还有负反馈,但是没提到怎么详细的判断,就怕一顿操作猛如虎,一看结果二百五。 这个和平台,推理框架有关吗?比如考虑android,arm64, ncnn 这个方向有什么好的判断准则吗? 谢谢!

    opened by zuowanbushiwo 3
Releases(v0.6.5)
  • v0.6.5(Sep 2, 2022)

    • Analyzer
      • 添加了新的分析方法 statistical_analyse
      • 允许分析多输出算子
      • 重新设计了 cosine 相似度的计算方式
    • API
      • 添加了新的api: load_native_graph
      • 添加了新的api: register_network_quantizer
      • 添加了新的api: register_network_parser
      • 添加了新的api: register_network_exporter
      • 允许以setting=None调用api函数
    • Executor
      • 支持1d, 3d卷积,
      • 支持1d 3d pooling
      • 支持1d 3d 反卷积
      • 支持 lstm, gru
      • 支持 sin, cos
      • 支持 abs
      • 支持 sum
      • 支持 Erf, Elu, Reciprocal
      • 重写了 resize, slice 与 scatterND 实现
      • 移除了注册算子的限制条件,现在你可以覆盖ppq内部的算子实现
      • 修正了Conv, Pooling, ConvTranspose, Pad中的padding问题,适配onnx 1d, 2d, 3d padding,并将以较高性能运行
    • Dispatcher
      • 添加了新的调度器 purseus, allin
      • 添加了新的数据类型抽象 opsocket,在下一版本中该抽象将被移入ppq.IR
      • 默认子图切分方法更改为 purseus
      • 添加调度报警信息
    • Observer
      • 添加了新的calibration方法OrderPreserving,保序量化将被应用在分类网络当中,提升分类性能
      • 添加了mse的非对称实现,并添加了c++的mse实现
    • Graph
      • 支持1d, 3d卷积与反卷积的bn融合
      • Variable添加属性 shape,可以直接修改 shape 来设定 dynamic shape
      • 图匹配引擎允许以ep_expr = None进行模式匹配
      • 支持图的复制
    • Optim
      • LSQ 算法被重写,性能大幅提升, Advanced optimization 与 LSQ 算法合并,现在被称为CuLSQ
      • Brecq 算法被移入 legacy,不推荐使用
      • layerSplit, BiasCorrection 算法被重写,性能提升
      • Laerwise Equalization 算法被重写,现在支持 1,2,3d 卷积与反卷积,支持 include act
      • 修正了 average pooling 算子对齐的错误
      • 修正了 bias, pad 量化的相关问题
      • 移除了RuntimePerlayerCalibrationPass,相关参数不再起作用
      • 移除了ConstantBakingPass,相关参数不再起作用
      • 移除了InplaceQuantizationSettingPass,相关参数不再起作用
      • 移除了 fuse_conv_add 设置选项,相关优化过程被移入 legacy 文件,现在必须手动调用
    • Doc
      • 添加了常见优化过程文档
      • 添加了 yolo 量化相关例子
      • 添加了新的入门教程示例代码
    • Cuda
      • 重写了量化核心函数,性能提升
      • 重写了量化梯度传播函数,性能提升
      • 现在在编译开始时,会自动移除编译锁
    • Core
      • 为 TQC 添加了新的属性 Visibilty,将使用该属性控制 TQC 导出能见度
      • 修改了一些属性名字,并将它们写入 ppq.common.py
      • TensorQuantizationConfig中的函数__is_revisable现在是一个公有函数,并被重命名为is_revisable
    • Other
      • Import TensorRT的警告现在只在导出的时候才会发出
      • 添加了 snpe 1.6.3 的支持
      • 添加了 tengine 的支持
      • 修复了一系列错误
    Source code(tar.gz)
    Source code(zip)
  • v0.6.4(Jun 1, 2022)

    重做计算图操作接口,添加了函数 remove_operation, remove_variable, insert_op_on_var, insert_op_between_var, create_link_with_var, create_link_with_op, truncate_on_var

    重做 onnxruntime 导出逻辑,重做 onnx oos 导出逻辑

    更新了 lsq 算法,加速执行

    更新了 ssd 算法,加速执行

    更新了 core.ffi,现在编译不了的话会给你报告错误。

    添加了几个api函数,包括 manop 与 quantize_native_model 它们允许你手动控制优化逻辑。

    添加了第二类模式匹配功能

    添加了 gru 分解的相关逻辑

    添加了图 api 的测试类

    添加了QNN导出逻辑

    添加了 swish, mish 激活函数的融合逻辑

    添加了 FPGAQuantizer

    添加 mod 算子支持,添加 softplus 算子支持,添加 gru 算子支持  

    移除了 misc 文件夹,其中代码已经不被使用。  

    修复了 pad 顺序不对的问题

    修复了创建变量时变量可能重名的问题

    修复了 path_matching 中中间结果没有被复制,从而导致结果可能出现错误的问题

    修复了 matex gemm split pass 的一些不引人注意的 bug

    修复了 delete_isolated 函数的一些错误

    修复了一个PPL_DSP_TI_INT8 被错误命名为 PPL_DSP_TI_IN8 的问题

    Source code(tar.gz)
    Source code(zip)
  • v0.6.3(Mar 30, 2022)

  • v0.6.2(Mar 18, 2022)

    • Scale and offset are now always torch.Tensor with dtype=fp32 for training your network.
    • PPQ will display network snapshot when quantize your network.
    • Add brecq & lsq algorithms
    • Cuda kernels has been refined, more cuda kernels are introduced into ppq.
    • Add an exporter for dumping onnx quantization model.
    • Test cases are introduced here since ppq 0.6.2
    Source code(tar.gz)
    Source code(zip)
DiffQ performs differentiable quantization using pseudo quantization noise. It can automatically tune the number of bits used per weight or group of weights, in order to achieve a given trade-off between model size and accuracy.

Differentiable Model Compression via Pseudo Quantization Noise DiffQ performs differentiable quantization using pseudo quantization noise. It can auto

Facebook Research 145 Dec 30, 2022
Quantization library for PyTorch. Support low-precision and mixed-precision quantization, with hardware implementation through TVM.

HAWQ: Hessian AWare Quantization HAWQ is an advanced quantization library written for PyTorch. HAWQ enables low-precision and mixed-precision uniform

Zhen Dong 293 Dec 30, 2022
Nonuniform-to-Uniform Quantization: Towards Accurate Quantization via Generalized Straight-Through Estimation. In CVPR 2022.

Nonuniform-to-Uniform Quantization This repository contains the training code of N2UQ introduced in our CVPR 2022 paper: "Nonuniform-to-Uniform Quanti

Zechun Liu 60 Dec 28, 2022
IntraQ: Learning Synthetic Images with Intra-Class Heterogeneity for Zero-Shot Network Quantization

IntraQ: Learning Synthetic Images with Intra-Class Heterogeneity for Zero-Shot Network Quantization paper Requirements Python >= 3.7.10 Pytorch == 1.7

null 1 Nov 19, 2021
[ICLR 2022 Oral] F8Net: Fixed-Point 8-bit Only Multiplication for Network Quantization

F8Net Fixed-Point 8-bit Only Multiplication for Network Quantization (ICLR 2022 Oral) OpenReview | arXiv | PDF | Model Zoo | BibTex PyTorch implementa

Snap Research 76 Dec 13, 2022
Code for our paper at ECCV 2020: Post-Training Piecewise Linear Quantization for Deep Neural Networks

PWLQ Updates 2020/07/16 - We are working on getting permission from our institution to release our source code. We will release it once we are granted

null 54 Dec 15, 2022
Degree-Quant: Quantization-Aware Training for Graph Neural Networks.

Degree-Quant This repo provides a clean re-implementation of the code associated with the paper Degree-Quant: Quantization-Aware Training for Graph Ne

null 35 Oct 7, 2022
QTool: A Low-bit Quantization Toolbox for Deep Neural Networks in Computer Vision

This project provides abundant choices of quantization strategies (such as the quantization algorithms, training schedules and empirical tricks) for quantizing the deep neural networks into low-bit counterparts.

Monash Green AI Lab 51 Dec 10, 2022
YOLOv5 Series Multi-backbone, Pruning and quantization Compression Tool Box.

YOLOv5-Compression Update News Requirements 环境安装 pip install -r requirements.txt Evaluation metric Visdrone Model mAP mAP@50 Parameters(M) GFLOPs FPS@

ZhangYuan 719 Jan 2, 2023
Neurolab is a simple and powerful Neural Network Library for Python

Neurolab Neurolab is a simple and powerful Neural Network Library for Python. Contains based neural networks, train algorithms and flexible framework

null 152 Dec 6, 2022
MINERVA: An out-of-the-box GUI tool for offline deep reinforcement learning

MINERVA is an out-of-the-box GUI tool for offline deep reinforcement learning, designed for everyone including non-programmers to do reinforcement learning as a tool.

Takuma Seno 80 Nov 6, 2022
This is a model made out of Neural Network specifically a Convolutional Neural Network model

This is a model made out of Neural Network specifically a Convolutional Neural Network model. This was done with a pre-built dataset from the tensorflow and keras packages. There are other alternative libraries that can be used for this purpose, one of which is the PyTorch library.

null 9 Oct 18, 2022
QKeras: a quantization deep learning library for Tensorflow Keras

QKeras github.com/google/qkeras QKeras 0.8 highlights: Automatic quantization using QKeras; Stochastic behavior (including stochastic rouding) is disa

Google 437 Jan 3, 2023
I-BERT: Integer-only BERT Quantization

I-BERT: Integer-only BERT Quantization HuggingFace Implementation I-BERT is also available in the master branch of HuggingFace! Visit the following li

Sehoon Kim 139 Dec 27, 2022
FID calculation with proper image resizing and quantization steps

clean-fid: Fixing Inconsistencies in FID Project | Paper The FID calculation involves many steps that can produce inconsistencies in the final metric.

Gaurav Parmar 606 Jan 6, 2023
This is the pytorch implementation for the paper: Generalizable Mixed-Precision Quantization via Attribution Rank Preservation, which is accepted to ICCV2021.

GMPQ: Generalizable Mixed-Precision Quantization via Attribution Rank Preservation This is the pytorch implementation for the paper: Generalizable Mix

null 18 Sep 2, 2022
MQBench: Towards Reproducible and Deployable Model Quantization Benchmark

MQBench: Towards Reproducible and Deployable Model Quantization Benchmark We propose a benchmark to evaluate different quantization algorithms on vari

null 494 Dec 29, 2022
This is an official implementation of the paper "Distance-aware Quantization", accepted to ICCV2021.

PyTorch implementation of DAQ This is an official implementation of the paper "Distance-aware Quantization", accepted to ICCV2021. For more informatio

CV Lab @ Yonsei University 36 Nov 4, 2022