quantize aware training package for NCNN on pytorch

Last update: Nov 23, 2022

Related tags

Deep Learning ncnnqat

Overview

ncnnqat

ncnnqat is a quantize aware training package for NCNN on pytorch.

ncnnqat

Installation

Supported Platforms: Linux
Accelerators and GPUs: NVIDIA GPUs via CUDA driver 10.1.
Dependencies:
- python >= 3.5, < 4
- pytorch >= 1.6
- numpy >= 1.18.1
- onnx >= 1.7.0
- onnx-simplifier >= 0.3.6
Install ncnnqat via pypi:
```
$ pip install ncnnqat (to do....)
```
It is recommended to install from the source code

or Install ncnnqat via repo：

$ git clone https://github.com/ChenShisen/ncnnqat
$ cd ncnnqat
$ make install

Usage

register_quantization_hook and merge_freeze_bn

(suggest finetuning from a well-trained model, do it after a few epochs of training otherwise.)

from ncnnqat import unquant_weight, merge_freeze_bn, register_quantization_hook
...
...
    for epoch in range(epoch_train):
        model.train()
    if epoch==well_epoch:
        register_quantization_hook(model)
    if epoch>=well_epoch:
        model = merge_freeze_bn(model)  #it will change bn to eval() mode during training
...

Unquantize weight before update it

...
... 
    if epoch>=well_epoch:
        model.apply(unquant_weight)  # using original weight while updating
    optimizer.step()
...

Save weight and save ncnn quantize table after train

...
...
    onnx_path = "./xxx/model.onnx"
    table_path="./xxx/model.table"
    dummy_input = torch.randn(1, 3, img_size, img_size, device='cuda')
    input_names = [ "input" ]
    output_names = [ "fc" ]
    torch.onnx.export(model, dummy_input, onnx_path, verbose=False, input_names=input_names, output_names=output_names)
    save_table(model,onnx_path=onnx_path,table=table_path)

...

if use "model = nn.DataParallel(model)",pytorch unsupport torch.onnx.export,you should save state_dict first and prepare a new model with one gpu,then you will export onnx model.

...
...
    model_s = new_net() #
    model_s.cuda()
    register_quantization_hook(model_s)
    #model_s = merge_freeze_bn(model_s)
    onnx_path = "./xxx/model.onnx"
    table_path="./xxx/model.table"
    dummy_input = torch.randn(1, 3, img_size, img_size, device='cuda')
    input_names = [ "input" ]
    output_names = [ "fc" ]
    model_s.load_state_dict({k.replace('module.',''):v for k,v in model.state_dict().items()}) #model_s = model     model = nn.DataParallel(model)
          
    torch.onnx.export(model_s, dummy_input, onnx_path, verbose=False, input_names=input_names, output_names=output_names)
    save_table(model_s,onnx_path=onnx_path,table=table_path)
    

...

Code Examples

Cifar10 quantization aware training example.

python test/test_cifar10.py

SSD300 quantization aware training example.

   ln -s /your_coco_path/coco ./tests/ssd300/data

   python -m torch.distributed.launch \
    --nproc_per_node=4 \
    --nnodes=1 \
    --node_rank=0 \
    ./tests/ssd300/main.py \
    -d ./tests/ssd300/data/coco

    python ./tests/ssd300/main.py --onnx_save  #load model dict, export onnx and ncnn table

Results

Cifar10

result：

net	fp32(onnx)	ncnnqat	ncnn aciq	ncnn kl
mobilenet_v2	0.91	0.9066	0.9033	0.9066
resnet18	0.94	0.93333	0.9367	0.937

SSD300(resnet18|coco)

fp32:
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.193
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.344
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.191
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.042
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.195
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.328
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.199
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.293
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.309
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.084
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.326
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.501
Current AP: 0.19269

ncnnqat:
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.192
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.342
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.194
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.041
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.194
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.327
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.197
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.291
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.307
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.082
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.325
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.497
Current AP: 0.19202

Todo

....

Comments

ncnnqat的环境问题

在使用pip install ncnnqat的时候会产生如下报错： File "setup.py", line 3, in from torch.utils.cpp_extension import BuildExtension, CUDAExtension ModuleNotFoundError: No module named 'torch'

但log上的显示的python地址和版本里面确有安装torch

若直接对该版本进行编译，则会出现下述问题：

make[1]: Entering directory '/root/ncnnqat' NVCC src/fake_quantize.cu nvcc -std=c++14 -ccbin=g++ -Xcompiler -fPIC -use_fast_math -DNDEBUG -O3 -I./ -I/usr/local/cuda/include -I/opt/conda/include/python3.7m -I/opt/conda/lib/python3.7/site-packages/torch/include -I/opt/conda/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.7/site-packages/torch/include/TH -I/opt/conda/lib/python3.7/site-packages/torch/include/THC -DTORCH_API_INCLUDE_EXTENSION_H -D_GLIBCXX_USE_CXX11_ABI=0 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -M src/fake_quantize.cu -o obj/cuda/fake_quantize.d
-odir obj/cuda nvcc -std=c++14 -ccbin=g++ -Xcompiler -fPIC -use_fast_math -DNDEBUG -O3 -I./ -I/usr/local/cuda/include -I/opt/conda/include/python3.7m -I/opt/conda/lib/python3.7/site-packages/torch/include -I/opt/conda/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.7/site-packages/torch/include/TH -I/opt/conda/lib/python3.7/site-packages/torch/include/THC -DTORCH_API_INCLUDE_EXTENSION_H -D_GLIBCXX_USE_CXX11_ABI=0 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -c src/fake_quantize.cu -o obj/cuda/fake_quantize.o /opt/conda/lib/python3.7/site-packages/torch/include/ATen/record_function.h(18): warning: attribute "visibility" does not apply here

/opt/conda/lib/python3.7/site-packages/torch/include/torch/csrc/autograd/profiler.h(97): warning: attribute "visibility" does not apply here

/opt/conda/lib/python3.7/site-packages/torch/include/torch/csrc/autograd/profiler.h(126): warning: attribute "visibility" does not apply here

src/fake_quantize.cu(15): error: a value of type "const float *" cannot be assigned to an entity of type "float *"

src/fake_quantize.cu(21): error: identifier "Row" is undefined

src/fake_quantize.cu(88): warning: variable "momenta" was declared but never referenced

/opt/conda/lib/python3.7/site-packages/torch/include/c10/util/TypeCast.h(27): warning: calling a constexpr host function("real") from a host device function("apply") is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this. detected during: instantiation of "decltype(auto) c10::maybe_real<true, src_t>::apply(src_t) [with src_t=c10::complex]" (57): here instantiation of "uint8_t c10::static_cast_with_inter_type<uint8_t, src_t>::apply(src_t) [with src_t=c10::complex]" (166): here instantiation of "To c10::convert<To,From>(From) [with To=uint8_t, From=c10::complex]" (178): here instantiation of "To c10::checked_convert<To,From>(From, const char *) [with To=uint8_t, From=c10::complex]" /opt/conda/lib/python3.7/site-packages/torch/include/c10/core/Scalar.h(66): here

2 errors detected in the compilation of "/tmp/tmpxft_00000066_00000000-11_fake_quantize.compute_75.cpp1.ii". Makefile:70: recipe for target 'obj/cuda/fake_quantize.o' failed make[1]: *** [obj/cuda/fake_quantize.o] Error 1 make[1]: Leaving directory '/root/ncnnqat' running install running bdist_egg running egg_info writing ncnnqat.egg-info/PKG-INFO writing dependency_links to ncnnqat.egg-info/dependency_links.txt writing requirements to ncnnqat.egg-info/requires.txt writing top-level names to ncnnqat.egg-info/top_level.txt reading manifest file 'ncnnqat.egg-info/SOURCES.txt' reading manifest template 'MANIFEST.in' writing manifest file 'ncnnqat.egg-info/SOURCES.txt' installing library code to build/bdist.linux-x86_64/egg running install_lib running build_py running build_ext building 'quant_cuda' extension Emitting ninja build file /root/ncnnqat/build/temp.linux-x86_64-3.7/build.ninja... Compiling objects... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. g++ -pthread -shared -B /opt/conda/compiler_compat -L/opt/conda/lib -Wl,-rpath=/opt/conda/lib -Wl,--no-as-needed -Wl,--sysroot=/ /root/ncnnqat/build/temp.linux-x86_64-3.7/./src/fake_quantize.o -Lobj -L/opt/conda/lib/python3.7/site-packages/torch/lib -L/usr/local/cuda/lib64 -lquant_cuda -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-3.7/quant_cuda.cpython-37m-x86_64-linux-gnu.so /opt/conda/compiler_compat/ld: cannot find -lquant_cuda collect2: error: ld returned 1 exit status error: command 'g++' failed with exit status 1 Makefile:106: recipe for target 'install' failed make: *** [install] Error 1

系统中安装的torch版本为1.6.0，cuda版本是10.1，也尝试了torch1.9.0，cuda10.2，均出现了同样的问题，请帮忙看看安装环境是否有存在问题的地方，感谢！

opened by FitzShen666 0

Super-Fast-Adversarial-Training - A PyTorch Implementation code for developing super fast adversarial training

Super-Fast-Adversarial-Training This is a PyTorch Implementation code for develo

26 Dec 2, 2022

High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

3k Jan 3, 2023

2.8k Feb 12, 2021

Ultra-Data-Efficient GAN Training: Drawing A Lottery Ticket First, Then Training It Toughly

Ultra-Data-Efficient GAN Training: Drawing A Lottery Ticket First, Then Training It Toughly Code for this paper Ultra-Data-Efficient GAN Tra

77 Oct 5, 2022

Learning recognition/segmentation models without end-to-end training. 40%-60% less GPU memory footprint. Same training time. Better performance.

InfoPro-Pytorch The Information Propagation algorithm for training deep networks with local supervision. (ICLR 2021) Revisiting Locally Supervised Lea

78 Dec 27, 2022

ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training

ActNN : Activation Compressed Training This is the official project repository for ActNN: Reducing Training Memory Footprint via 2-Bit Activation Comp

178 Jan 5, 2023

This is the code for our KILT leaderboard submission to the T-REx and zsRE tasks. It includes code for training a DPR model then continuing training with RAG.

KGI (Knowledge Graph Induction) for slot filling This is the code for our KILT leaderboard submission to the T-REx and zsRE tasks. It includes code fo

72 Jan 6, 2023

BERT model training impelmentation using 1024 A100 GPUs for MLPerf Training v1.1

Pre-trained checkpoint and bert config json file Location of checkpoint and bert config json file This MLCommons members Google Drive location contain

SAIT (Samsung Advanced Institute of Technology)

12 Apr 27, 2022

FuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space OptimizationFuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space Optimization

FuseDream This repo contains code for our paper (paper link): FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+GAN Space Optimizat

191 Dec 31, 2022

quantize aware training package for NCNN on pytorch

Related tags

Overview

ncnnqat

Table of Contents

Installation

Usage

Code Examples

Results

Todo

You might also like...

Super-Fast-Adversarial-Training - A PyTorch Implementation code for developing super fast adversarial training

High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

Ultra-Data-Efficient GAN Training: Drawing A Lottery Ticket First, Then Training It Toughly

Learning recognition/segmentation models without end-to-end training. 40%-60% less GPU memory footprint. Same training time. Better performance.

ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training

This is the code for our KILT leaderboard submission to the T-REx and zsRE tasks. It includes code for training a DPR model then continuing training with RAG.

BERT model training impelmentation using 1024 A100 GPUs for MLPerf Training v1.1

FuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space OptimizationFuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space Optimization

Comments

ncnnqat的环境问题

Owner

thundernet ncnn

A high-performance anchor-free YOLO. Exceeding yolov3~v5 with ONNX, TensorRT, NCNN, and Openvino supported.

YOLOX is a high-performance anchor-free YOLO, exceeding yolov3~v5 with ONNX, TensorRT, ncnn, and OpenVINO supported.

Ultra-lightweight human body posture key point CNN model. ModelSize:2.3MB HUAWEI P40 NCNN benchmark: 6ms/img,

MQBench Quantization Aware Training with PyTorch

Based on the paper "Geometry-aware Instance-reweighted Adversarial Training" ICLR 2021 oral

Physics-Aware Training (PAT) is a method to train real physical systems with backpropagation.

Degree-Quant: Quantization-Aware Training for Graph Neural Networks.

TAP: Text-Aware Pre-training for Text-VQA and Text-Caption, CVPR 2021 (Oral)