PPLNN is a Primitive Library for Neural Network is a high-performance deep-learning inference engine for efficient AI inferencing

Last update: Jan 7, 2023

Related tags

Deep Learning ppl.nn

Overview

PPLNN

Overview

PPLNN, which is short for "PPLNN is a Primitive Library for Neural Network", is a high-performance deep-learning inference engine for efficient AI inferencing. It can run various ONNX models and has better support for OpenMMLab.

Documents

Supported Ops and Platforms
Building from Source
Generating ONNX models from OpenMMLab
Getting Started in C++
C++ API References
Develop Guide
- Add New Engines and Ops
- X86
  - Add Ops（中文版）
  - Benchmark（中文版）
- CUDA
  - Add Ops（中文版）
  - Benchmark（中文版）

Contact Us

Contributions

This project uses Contributor Covenant as code of conduct. Any contributions would be highly appreciated.

Acknowledgements

License

This project is distributed under the Apache License, Version 2.0.

Comments

cuda convolution kernel input question.

Hi,

I see current implemented cuda conv kernel are either fp16 or int8. And those kernel's data layout is NHWC, as is requred by nvidia's tensor core. So like ./tools/pplnn.py, where it do the layout transpose? in the cpu side? As from nvprof result, I only see the conv kernel.

If I want to do the transpose at the gpu side, how should I change the command? Or I need to add additional transpose node in the onnx file?

opened by leiwen83 12
【gemm_fp32_fma performance】常用shape下，gemm_fp32_fma和tensorflow1.15 eigen matmul性能几乎持平?

问题：测试sgemm时，发现openppl的gemm_fp32_fma在一些常用shape下和tensorflow1.15 eigen matmul性能几乎持平，符合预期吗？有啥办法提升吗？相关参数: openppl版本v0.8，intel 32核机器，均使用多线程，build命令：./build.sh -DPPLNN_USE_X86_64=ON -DPPLNN_ENABLE_ONNX_MODEL=OFF -DPPL_USE_X86_OMP=ON -DPPLNN_USE_OPENMP=ON 以下是测试数据:

opened by huangmiumang 9
[CUDA] `RuntimeBuilder.Preprocess()` causes subsequent CUDA function calls to fail
What are the problems?(screenshots or detailed error messages)

Observe that, for some models (e.g. YOLOX-s, DBNet-r18, others like ResNet-18 are fine), after creating runtime using RuntimeBuilder, subsequent CUDA function calls (or kernel launches) may fail.

I first getting the CUDA invalid argument error when testing ppl.nn using mmdeploy's test.py, at a point after runtime creation, before inference, when copying data from host to device. Later I met the same problem when testing using mmdeploy's SDK.

After digging around for a while, I found the the simplest way to reproduce the problem using pplnn.py:

insert the following code

import torch t = torch.Tensor([[1,1],[1,1]]).cuda()

to https://github.com/openppl-public/ppl.nn/blob/1ae5d95f3ee49b3e582564cc004443931fbe2f7a/tools/pplnn.py#L564 and then

python pplnn.py --use-cuda --onnx-model model.onnx --in-shape 1_3_640_640 --quick-select

got

INFO: PPLNN version: [0.8.0], commit: [02418bb57bef2d888b57d44589a599080cb806d9] [INFO][2022-07-06 22:23:06.057][utils.cc:456] total partition(s) of graph[torch-jit-export]: 1. [INFO][2022-07-06 22:23:06.067][opt_graph.cc:324] added 1020 new bridge kernels [INFO][2022-07-06 22:23:06.223][opt_graph.cc:581] deleted 990 bridge kernels Traceback (most recent call last): File "pplnn.py", line 567, in <module> t = torch.Tensor([[1,1],[1,1]]).cuda() RuntimeError: CUDA error: invalid argument

Which version(commit id or tag) of ppl.nn is used?

02418bb57bef2d888b57d44589a599080cb806d9

What's the operating system ppl.nn runs on?

Ubuntu 18.04

What's the compiler and its version?

GCC-7.5, CUDA-11.1

What are the commands used to build ppl.nn?

cmake .. \ -DCMAKE_INSTALL_PREFIX=/workspace/ppl.nn/install \ -DPPLNN_ENABLE_PYTHON_API=ON \ -DPPLNN_USE_X86_64=ON \ -DPPLNN_USE_CUDA=ON \ -DPPL_USE_X86_AVX512=OFF \ -DPPLNN_ENABLE_CUDA_JIT=OFF \ -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_CUDA_ARCHITECTURES=75
opened by lzhangzz 9

Mask R-CNN failed with pplnn

The model was conveted from mmdetection library. And when I try to execute with pplnn, it shows error:

[INFO][2021-07-14 17:18:19.999][pplnn.cc:703] ppl.nn version: 5d56662bf5a288898f0dd5b90f763459cc86f47a
[WARNING][2021-07-14 17:18:21.873][engine.cc:209] Default input dims for dynamic graph are 1_3_224_224, we recommend using '--dims' to set a suitable training shape.
[INFO][2021-07-14 17:18:21.873][pplnn.cc:104] ***** register CudaEngine *****
[INFO][2021-07-14 17:18:22.320][simple_graph_partitioner.cc:107] total partition(s) of graph[torch-jit-export]: 1.
[ERROR][2021-07-14 17:18:22.322][reshape_reshape.cc:66] infer shape failed.
[ERROR][2021-07-14 17:18:22.338][reshape_add.cc:39] unbroadcastable input.
[ERROR][2021-07-14 17:18:22.339][reshape_add.cc:39] unbroadcastable input.
[ERROR][2021-07-14 17:18:22.340][reshape_add.cc:39] unbroadcastable input.
[ERROR][2021-07-14 17:18:22.341][reshape_concat.cc:42] input shape not match.
[ERROR][2021-07-14 17:18:22.341][reshape_add.cc:39] unbroadcastable input.
[ERROR][2021-07-14 17:18:22.342][reshape_add.cc:39] unbroadcastable input.
[ERROR][2021-07-14 17:18:22.342][reshape_add.cc:39] unbroadcastable input.
[ERROR][2021-07-14 17:18:22.342][reshape_add.cc:39] unbroadcastable input.
[ERROR][2021-07-14 17:18:22.342][reshape_add.cc:39] unbroadcastable input.
[ERROR][2021-07-14 17:18:22.343][reshape_add.cc:39] unbroadcastable input.
[ERROR][2021-07-14 17:18:22.343][reshape_unsqueeze.cc:36] axes overflow.
[ERROR][2021-07-14 17:18:22.343][reshape_concat.cc:42] input shape not match.
[ERROR][2021-07-14 17:18:22.344][reshape_split.cc:59] splited axis and sum of split point not match.
[ERROR][2021-07-14 17:18:22.344][reshape_split.cc:59] splited axis and sum of split point not match.
[ERROR][2021-07-14 17:18:22.345][reshape_split.cc:59] splited axis and sum of split point not match.
[ERROR][2021-07-14 17:18:22.345][reshape_split.cc:59] splited axis and sum of split point not match.
[ERROR][2021-07-14 17:18:22.345][reshape_split.cc:59] splited axis and sum of split point not match.
[INFO][2021-07-14 17:18:22.346][simple_graph_partitioner.cc:107] total partition(s) of graph[torch-jit-export1]: 1.
[INFO][2021-07-14 17:18:22.346][opt_graph.cc:204] Create 2 TensorImpl
[INFO][2021-07-14 17:18:22.346][opt_graph.cc:316] added 2 new bridge kernels
[INFO][2021-07-14 17:18:22.346][opt_graph.cc:478] deleted 1 bridge kernels
[INFO][2021-07-14 17:18:22.347][simple_graph_partitioner.cc:107] total partition(s) of graph[torch-jit-export2]: 1.
[INFO][2021-07-14 17:18:22.347][opt_graph.cc:204] Create 20 TensorImpl
[INFO][2021-07-14 17:18:22.347][opt_graph.cc:316] added 21 new bridge kernels
[INFO][2021-07-14 17:18:22.347][opt_graph.cc:478] deleted 14 bridge kernels
[ERROR][2021-07-14 17:18:22.348][reshape_split.cc:59] splited axis and sum of split point not match.
[ERROR][2021-07-14 17:18:22.348][reshape_concat.cc:42] input shape not match.
[ERROR][2021-07-14 17:18:22.348][reshape_split.cc:59] splited axis and sum of split point not match.
[ERROR][2021-07-14 17:18:22.349][reshape_split.cc:59] splited axis and sum of split point not match.
[ERROR][2021-07-14 17:18:22.349][reshape_split.cc:59] splited axis and sum of split point not match.
[ERROR][2021-07-14 17:18:22.349][reshape_split.cc:59] splited axis and sum of split point not match.
[ERROR][2021-07-14 17:18:22.350][reshape_split.cc:59] splited axis and sum of split point not match.
[ERROR][2021-07-14 17:18:22.350][reshape_split.cc:59] splited axis and sum of split point not match.
[ERROR][2021-07-14 17:18:22.350][reshape_split.cc:59] splited axis and sum of split point not match.
[ERROR][2021-07-14 17:18:22.389][reshape_add.cc:39] unbroadcastable input.
[ERROR][2021-07-14 17:18:22.389][reshape_add.cc:39] unbroadcastable input.
[ERROR][2021-07-14 17:18:22.390][reshape_add.cc:39] unbroadcastable input.
[ERROR][2021-07-14 17:18:22.390][reshape_add.cc:39] unbroadcastable input.
[ERROR][2021-07-14 17:18:22.390][reshape_add.cc:39] unbroadcastable input.
[ERROR][2021-07-14 17:18:22.391][reshape_add.cc:39] unbroadcastable input.
[ERROR][2021-07-14 17:18:22.391][reshape_unsqueeze.cc:36] axes overflow.
[ERROR][2021-07-14 17:18:22.391][reshape_unsqueeze.cc:36] axes overflow.
[INFO][2021-07-14 17:18:22.392][simple_graph_partitioner.cc:107] total partition(s) of graph[torch-jit-export3]: 1.
[INFO][2021-07-14 17:18:22.392][opt_graph.cc:204] Create 2 TensorImpl
[INFO][2021-07-14 17:18:22.392][opt_graph.cc:316] added 2 new bridge kernels
[INFO][2021-07-14 17:18:22.392][opt_graph.cc:478] deleted 1 bridge kernels
[INFO][2021-07-14 17:18:22.392][simple_graph_partitioner.cc:107] total partition(s) of graph[torch-jit-export4]: 1.
[INFO][2021-07-14 17:18:22.393][opt_graph.cc:204] Create 20 TensorImpl
[INFO][2021-07-14 17:18:22.393][opt_graph.cc:316] added 21 new bridge kernels
[INFO][2021-07-14 17:18:22.408][opt_graph.cc:478] deleted 14 bridge kernels
[ERROR][2021-07-14 17:18:22.408][reshape_split.cc:59] splited axis and sum of split point not match.
[ERROR][2021-07-14 17:18:22.409][reshape_concat.cc:42] input shape not match.
[ERROR][2021-07-14 17:18:22.409][reshape_split.cc:59] splited axis and sum of split point not match.
[ERROR][2021-07-14 17:18:22.409][reshape_split.cc:59] splited axis and sum of split point not match.
[ERROR][2021-07-14 17:18:22.410][reshape_split.cc:59] splited axis and sum of split point not match.
[ERROR][2021-07-14 17:18:22.410][reshape_split.cc:59] splited axis and sum of split point not match.
[ERROR][2021-07-14 17:18:22.410][reshape_split.cc:59] splited axis and sum of split point not match.
[ERROR][2021-07-14 17:18:22.411][reshape_split.cc:59] splited axis and sum of split point not match.
[ERROR][2021-07-14 17:18:22.411][reshape_split.cc:59] splited axis and sum of split point not match.
[ERROR][2021-07-14 17:18:22.413][reshape_split.cc:59] splited axis and sum of split point not match.
[INFO][2021-07-14 17:18:22.426][simple_graph_partitioner.cc:107] total partition(s) of graph[torch-jit-export5]: 1.
[ERROR][2021-07-14 17:18:22.426][reshape_concat.cc:42] input shape not match.
[ERROR][2021-07-14 17:18:22.427][reshape_concat.cc:42] input shape not match.
[ERROR][2021-07-14 17:18:22.427][reshape_concat.cc:42] input shape not match.
[ERROR][2021-07-14 17:18:22.427][reshape_concat.cc:42] input shape not match.
[ERROR][2021-07-14 17:18:22.427][reshape_concat.cc:42] input shape not match.
[ERROR][2021-07-14 17:18:22.428][reshape_concat.cc:42] input shape not match.
[ERROR][2021-07-14 17:18:22.428][reshape_concat.cc:42] input shape not match.
[ERROR][2021-07-14 17:18:22.428][reshape_concat.cc:42] input shape not match.
[ERROR][2021-07-14 17:18:22.429][reshape_concat.cc:42] input shape not match.
[INFO][2021-07-14 17:18:22.429][opt_graph.cc:204] Create 135 TensorImpl
[INFO][2021-07-14 17:18:22.430][opt_graph.cc:316] added 174 new bridge kernels
[INFO][2021-07-14 17:18:22.433][opt_graph.cc:478] deleted 153 bridge kernels
[INFO][2021-07-14 17:18:22.434][opt_graph.cc:204] Create 2263 TensorImpl
[INFO][2021-07-14 17:18:22.660][opt_graph.cc:316] added 2626 new bridge kernels
[INFO][2021-07-14 17:20:05.963][opt_graph.cc:478] deleted 2547 bridge kernels
[ERROR][2021-07-14 17:20:06.007][scheduler_common.cc:170] exec kernel[Pad_146] failed: invalid value
[ERROR][2021-07-14 17:20:06.007][sequential_scheduler.cc:116] execute kernel[Pad_146] failed: invalid value
[ERROR][2021-07-14 17:20:06.007][pplnn.cc:804] Run() failed: invalid value

I'm running it with true image data. Dose that pplnn support maskrcnn, or what should I do to execute it suceessfully? Thanks a lot! The model was generated by this command:

python ../tools/deployment/pytorch2onnx.py ../configs/mask_rcnn/mask_rcnn_r50_fpn_mstrain-poly_3x_coco.py \
mask_rcnn_r50_fpn_mstrain-poly_3x_coco_20210524_201154-21b550bb.pth \
--output-file mask_rcnn.onnx --simplify --dynamic-export

opened by Maosquerade 9

tools/pplnn.py --use-cuda output error

What are the problems?(screenshots or detailed error messages)

use ./tools/pplnn.py --use-cuda --onnx-model tests/testdata/conv.onnx to test python api and cuda engine; add input and output data value print to https://github.com/openppl-public/ppl.nn/blob/master/tools/pplnn.py#L499 and https://github.com/openppl-public/ppl.nn/blob/master/tools/pplnn.py#L511 it seems that input tensor and output tensor have the same value; which is different from x86 engine output;

INFO: PPLNN version: [0.6.3], commit: [9444a9d2ee0b89d8cd4a2fee8cef839fedfe8837]
[INFO][2022-04-19 18:43:40.768][engine_graph_partitioner.cc:103] total partition(s) of graph[torch-jit-export]: 1.
[INFO][2022-04-19 18:43:40.768][opt_graph.cc:329] added 4 new bridge kernels
[INFO][2022-04-19 18:43:40.770][algo_conv_hmma.cc:129] Compiling Conv_0
[INFO][2022-04-19 18:43:41.454][opt_graph.cc:583] deleted 2 bridge kernels
INFO: ----- input info -----
INFO: input[0]
INFO:     name: input
INFO:     dim(s): [1, 3, 4, 4]
INFO:     type: FLOAT32
INFO:     format: NDARRAY
INFO:     byte(s) excluding padding: 192
INFO:     in_data: [[[[-0.7580919  -1.0537796  -1.4523766  -1.1736736 ]
   [-0.50453496 -1.48383    -1.3174736  -0.8811438 ]
   [-1.5446684  -0.33240414 -1.429975   -1.172169  ]
   [-1.2639251  -0.00716734 -0.26453447 -1.4403057 ]]

  [[-1.6206262  -1.3826382  -0.74133873 -0.9391637 ]
   [-0.42861128 -0.09090185 -1.2538221  -0.02137303]
   [-0.074507   -0.29974604 -0.45086026 -1.9801757 ]
   [-0.07279325 -0.67775655 -1.4832225  -1.862076  ]]

  [[-1.0764339  -0.25367737 -1.8603811  -1.5876365 ]
   [-1.8216178  -0.6460962  -0.5559113  -0.9660294 ]
   [-1.837322   -1.0467303  -0.04060197 -0.5114651 ]
   [-0.21527338 -0.26388478 -1.6131785  -1.4633346 ]]]]
INFO: ----- output info -----
INFO: output[0]
INFO:     name: 5
INFO:     dim(s): [1, 3, 5, 5]
INFO:     type: FLOAT32
INFO:     format: NDARRAY
INFO:     byte(s) excluding padding: 300
INFO:     out_data: [[[[-0.7580919  -1.0537796  -1.4523766  -1.1736736  -0.50453496]
   [-1.48383    -1.3174736  -0.8811438  -1.5446684  -0.33240414]
   [-1.429975   -1.172169   -1.2639251  -0.00716734 -0.26453447]
   [-1.4403057  -1.6206262  -1.3826382  -0.74133873 -0.9391637 ]
   [-0.42861128 -0.09090185 -1.2538221  -0.02137303 -0.074507  ]]

  [[-0.29974604 -0.45086026 -1.9801757  -0.07279325 -0.67775655]
   [-1.4832225  -1.862076   -1.0764339  -0.25367737 -1.8603811 ]
   [-1.5876365  -1.8216178  -0.6460962  -0.5559113  -0.9660294 ]
   [-1.837322   -1.0467303  -0.04060197 -0.5114651  -0.21527338]
   [-0.26388478 -1.6131785  -1.4633346   0.          0.        ]]

  [[ 0.          0.          0.          0.          0.        ]
   [ 0.          0.          0.          0.          0.        ]
   [ 0.          0.          0.          0.          0.        ]
   [ 0.          0.          0.          0.          0.        ]
   [ 0.          0.          0.          0.          0.        ]]]]
INFO: Run ok

Which version(commit id or tag) of ppl.nn is used?

PPLNN version: [0.6.3], commit: [9444a9d2ee0b89d8cd4a2fee8cef839fedfe8837]

What's the operating system ppl.nn runs on?

ubuntu18.04

What's the compiler and its version?

g++ (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0

What are the commands used to build ppl.nn?

./build.sh -DHPCC_USE_X86_64=ON -DPPLNN_ENABLE_PYTHON_API=ON -DHPCC_USE_CUDA=ON

What are the execution commands?

PYTHONPATH=./pplnn-build/install/lib python3 ./tools/pplnn.py --use-cuda --onnx-model tests/testdata/conv.onnx

minimal code snippets for reproducing these problems(if necessary)

models and inputs for reproducing these problems (send them to [email protected] if necessary)

opened by sky-fun 8

cuda推理报错

What are the problems?(snapshots or detailed error messages)

将cpp的分类示例工程改为使用cuda推理(x86可以正常编译运行，benchmark cuda和x86也都可以跑)，编译时打印以下内容：

$ bear make -j
Consolidate compiler generated dependencies of target classification
[ 50%] Building CXX object CMakeFiles/classification.dir/classification.cpp.o
[100%] Linking CXX executable classification
/home/ubuntu/Documents/ppl.nn/pplnn-build/install/lib/cmake/ppl/../../../lib/libpplnn_static.a(engine_factory.cc.o)：在函数‘ppl::nn::CudaEngineFactory::Create(ppl::nn::CudaEngineOptions const&)’中：
/home/ubuntu/Documents/ppl.nn/src/ppl/nn/engines/cuda/module/cuda_module.h:42：对‘cuModuleUnload’未定义的引用
/home/ubuntu/Documents/ppl.nn/pplnn-build/install/lib/cmake/ppl/../../../lib/libpplnn_static.a(engine.cc.o)：在函数‘ppl::nn::cuda::CudaEngine::~CudaEngine()’中：
/home/ubuntu/Documents/ppl.nn/src/ppl/nn/engines/cuda/module/cuda_module.h:42：对‘cuModuleUnload’未定义的引用
/home/ubuntu/Documents/ppl.nn/pplnn-build/install/lib/cmake/ppl/../../../lib/libpplnn_static.a(engine.cc.o)：在函数‘ppl::nn::cuda::CudaEngine::~CudaEngine()’中：
/home/ubuntu/Documents/ppl.nn/src/ppl/nn/engines/cuda/module/cuda_module.h:42：对‘cuModuleUnload’未定义的引用
/home/ubuntu/Documents/ppl.nn/pplnn-build/install/lib/cmake/ppl/../../../lib/libpplnn_static.a(cuda_compiler.cc.o)：在函数‘ppl::nn::cuda::CUDANVRTCCompile(std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::vector<char const*, std::allocator<char const*> >, int, bool)’中：
/home/ubuntu/Documents/ppl.nn/src/ppl/nn/engines/cuda/module/cuda_compiler.cc:44：对‘nvrtcCreateProgram’未定义的引用
/home/ubuntu/Documents/ppl.nn/src/ppl/nn/engines/cuda/module/cuda_compiler.cc:45：对‘nvrtcCompileProgram’未定义的引用
/home/ubuntu/Documents/ppl.nn/src/ppl/nn/engines/cuda/module/cuda_compiler.cc:48：对‘nvrtcGetProgramLogSize’未定义的引用
/home/ubuntu/Documents/ppl.nn/src/ppl/nn/engines/cuda/module/cuda_compiler.cc:51：对‘nvrtcGetProgramLog’未定义的引用
/home/ubuntu/Documents/ppl.nn/src/ppl/nn/engines/cuda/module/cuda_compiler.cc:56：对‘nvrtcGetPTXSize’未定义的引用
/home/ubuntu/Documents/ppl.nn/src/ppl/nn/engines/cuda/module/cuda_compiler.cc:59：对‘nvrtcGetPTX’未定义的引用
/home/ubuntu/Documents/ppl.nn/src/ppl/nn/engines/cuda/module/cuda_compiler.cc:60：对‘nvrtcDestroyProgram’未定义的引用
/home/ubuntu/Documents/ppl.nn/src/ppl/nn/engines/cuda/module/cuda_compiler.cc:61：对‘cudaDeviceSynchronize’未定义的引用
/home/ubuntu/Documents/ppl.nn/src/ppl/nn/engines/cuda/module/cuda_compiler.cc:60：对‘nvrtcGetErrorString’未定义的引用
/home/ubuntu/Documents/ppl.nn/src/ppl/nn/engines/cuda/module/cuda_compiler.cc:44：对‘nvrtcGetErrorString’未定义的引用
/home/ubuntu/Documents/ppl.nn/src/ppl/nn/engines/cuda/module/cuda_compiler.cc:59：对‘nvrtcGetErrorString’未定义的引用
/home/ubuntu/Documents/ppl.nn/src/ppl/nn/engines/cuda/module/cuda_compiler.cc:56：对‘nvrtcGetErrorString’未定义的引用
/home/ubuntu/Documents/ppl.nn/pplnn-build/install/lib/cmake/ppl/../../../lib/libpplnn_static.a(cuda_module.cc.o)：在函数‘ppl::nn::cuda::CUDAModule::GetKernelFunc()’中：
/home/ubuntu/Documents/ppl.nn/src/ppl/nn/engines/cuda/module/cuda_module.cc:25：对‘cuModuleLoadDataEx’未定义的引用
/home/ubuntu/Documents/ppl.nn/src/ppl/nn/engines/cuda/module/cuda_module.cc:25：对‘cuGetErrorName’未定义的引用
/home/ubuntu/Documents/ppl.nn/src/ppl/nn/engines/cuda/module/cuda_module.cc:28：对‘cuModuleGetFunction’未定义的引用
/home/ubuntu/Documents/ppl.nn/src/ppl/nn/engines/cuda/module/cuda_module.cc:28：对‘cuGetErrorName’未定义的引用
/home/ubuntu/Documents/ppl.nn/pplnn-build/install/lib/cmake/ppl/../../../lib/libpplnn_static.a(cuda_module.cc.o)：在函数‘ppl::nn::cuda::CUDAModule::GetKernelFunc(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)’中：
...
...
...

Which version(commit id or tag) of ppl.nn is used?

ppl.nn version: 0a545145b6b1816fd190c6023a588328872fe80f

What's the operating system ppl.nn runs on?

Linux ubuntu-1660ti 5.4.0-100-generic #113~18.04.1-Ubuntu SMP Mon Feb 7 15:02:59 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

What's the compiler and its version?

我使用了两个版本的gcc，都不行

gcc (Ubuntu 9.4.0-1ubuntu1~18.04) 9.4.0
gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0

What are the commands used to build ppl.nn?

./build.sh -DPPLNN_ENABLE_PYTHON_API=ON -DHPCC_USE_X86_64=ON -DHPCC_USE_CUDA=ON

What are the execution commands?

bear make -j

minimal code snippets for reproducing these problems(if necessary)

#include "ppl/nn/engines/cuda/cuda_engine_options.h"
#include "ppl/nn/engines/cuda/engine_factory.h"
...
/************************ 2. create runtime builder from onnx model *************************/
    CudaEngineOptions options;
    options.device_id = 0;
    options.mm_policy = CUDA_MM_BEST_FIT;

    auto cuda_engine = CudaEngineFactory::Create(options);
    if (!cuda_engine)
    {
        return false;
    }
    cuda_engine->Configure(ppl::nn::CUDA_CONF_USE_DEFAULT_ALGORITHMS, false);
    vector<unique_ptr<Engine>> engines;
    vector<Engine *> engine_ptrs;
    engines.emplace_back(unique_ptr<Engine>(cuda_engine));
    engine_ptrs.emplace_back(engines[0].get());
...

models and inputs for reproducing these problems (sends them to [email protected] if necessary)

opened by watersounds 8

centernet runs with memory error.

My gpu is Tesla T4, and sample model runs normally. When I use centernet with --mm-policy=mem, it turns out erorr like this, but it can get an output. WHen I use --mm-policy=perf, it gets error out of memory like this: It seems they both end with memory error, is this error familiar to your team, or how can I avoid this error?

opened by Maosquerade 8

pplnn run mobilenet v2 model failed. (use cuda)

What are the problems?(screenshots or detailed error messages)

pplnn run mobilenet v2 model failed(use cuda). mobilenet v2 model is exported from torchvision.

ppl.nn version: [0.9.0], commit: [2da19ac438d4f726b8744d650a1751d310fc0710-dirty]
[INFO][2022-12-04 17:42:46.453][pplnn.cc:308] ***** register CudaEngine *****
[INFO][2022-12-04 17:42:46.474][utils.cc:369] total partition(s) of graph[torch_jit]: 1.
[INFO][2022-12-04 17:42:46.478][opt_graph.cc:312] added 242 new bridge kernels
[INFO][2022-12-04 17:42:46.509][algo_conv_hmma.cc:141] Compiling /features/features.0/features.0.0/Conv
[INFO][2022-12-04 17:42:51.219][algo_conv_hmma.cc:146] select kernel nvIdxnSm75Fp16Conv_hmma1688_nhwc_b64x32_w32x16_k32_s16
[INFO][2022-12-04 17:42:51.239][algo_conv_hmma.cc:141] Compiling /features/features.1/conv/conv.1/Conv
[INFO][2022-12-04 17:42:55.559][algo_conv_hmma.cc:146] select kernel nvIdxnSm75Fp16Conv_hmma1688_nhwc_b64x16_w32x8_k64_s32
[INFO][2022-12-04 17:42:55.650][algo_conv_hmma.cc:141] Compiling /features/features.2/conv/conv.0/conv.0.0/Conv
[INFO][2022-12-04 17:42:58.170][algo_conv_hmma.cc:146] select kernel nvIdxnSm75Fp16Conv_hmma1688_nhwc_b128x32_w32x16_k32_s32
[INFO][2022-12-04 17:42:58.184][algo_conv_hmma.cc:141] Compiling /features/features.2/conv/conv.2/Conv
[INFO][2022-12-04 17:43:00.891][algo_conv_hmma.cc:146] select kernel nv2spkSm75Fp16Conv_hmma1688_nhwc_f1_b32x16_w16x16_k64_s32_buf2
[INFO][2022-12-04 17:43:00.921][algo_conv_hmma.cc:141] Compiling /features/features.3/conv/conv.0/conv.0.0/Conv
[INFO][2022-12-04 17:43:06.278][algo_conv_hmma.cc:146] select kernel nvIdxnSm75Fp16Conv_hmma1688_nhwc_b64x32_w32x8_k32_s32
[INFO][2022-12-04 17:43:06.289][algo_conv_hmma.cc:141] Compiling /features/features.3/conv/conv.2/Conv
[INFO][2022-12-04 17:43:06.524][algo_conv_hmma.cc:146] select kernel nv2spkSm75Fp16Conv_hmma1688_nhwc_f1_b64x8_w64x8_k128_s32_buf1
[INFO][2022-12-04 17:43:06.557][algo_conv_hmma.cc:141] Compiling /features/features.4/conv/conv.0/conv.0.0/Conv
[INFO][2022-12-04 17:43:12.012][algo_conv_hmma.cc:146] select kernel nvIdxnSm75Fp16Conv_hmma1688_nhwc_b128x32_w64x8_k32_s32
[INFO][2022-12-04 17:43:12.017][algo_conv_hmma.cc:141] Compiling /features/features.4/conv/conv.2/Conv
Segmentation fault (core dumped)

What are the types of GPU/CPU you are using?

RTX 2080 Ti

What's the operating system ppl.nn runs on?

Ubuntu 18.04

What's the compiler and its version?

g++ 7.5.0 nvcc V10.2.89

Which version(commit id or tag) of ppl.nn is used?

2da19ac438d4f726b8744d650a1751d310fc0710-dirty

What are the commands used to build ppl.nn?

cmake .. -DCMAKE_BUILD_TYPE=Release -DPPLNN_USE_CUDA=ON -DCMAKE_INSTALL_PREFIX=install cmake --build . -j 20 --config Release
cmake --build . --target install -j 20 --config Release

What are the execution commands?

./pplnn-build/tools/pplnn --use-cuda --onnx-model=mobilenet_v2.onnx --kernel-type=float16 --export-algo-file=algos/mobilenet_v2_fp16.json

minimal code snippets for reproducing these problems(if necessary)

import torch
import torchvision
model = torchvision.models.mobilenet_v2(torchvision.models.MobileNet_V2_Weights.DEFAULT)
dummy_input = torch.randn(1, 3, 224, 224)
torch.onnx.export(
       model,
       dummy_input,
       "mobilenet_v2.onnx",
       input_names=["inp"],
       output_names=["out"],
       opset_version=11
)

./pplnn-build/tools/pplnn --use-cuda --onnx-model=mobilenet_v2.onnx  --kernel-type=float16 --export-algo-file=algos/mobilenet_v2_fp16.json

models and inputs for reproducing these problems (send them to [email protected] if necessary)

opened by shiwenloong 7

About the core J1900 run the python demo occur get unsupported isa 0

cpu core J1900 vendor_id : GenuineIntel cpu family : 6 model : 55 model name : Intel(R) Celeron(R) CPU J1900 @ 1.99GHz stepping : 9 microcode : 0x90c cpu MHz : 2042.652 cache size : 1024 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 movbe popcnt tsc_deadline_timer rdrand lahf_lm 3dnowprefetch epb pti ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid tsc_adjust smep erms dtherm ida arat md_clear bugs : cpu_meltdown spectre_v1 spectre_v2 mds msbds_only bogomips : 4000.00 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual

gcc 7.5.0 os:Ubuntu18.04 LTS PPLNN version: [0.6.3]

I use the command: PYTHONPATH=./pplnn-build/install/lib python3 ./tools/pplnn.py --use-x86 --onnx-model tests/testdata/conv.onnx

i find maybe the core is too elder that not support this?

opened by F0xZz 7
x86引擎运行没问题，但cuda引擎无法运行，会卡在Compiling Conv_0直至64G内存全部耗尽

[DEBUG][2022-02-26 11:23:12.125][fuse_shape_optimizer.cc:257] Output count 1 for fused shape node[Shape_127_Fused] [DEBUG][2022-02-26 11:23:12.126][fuse_shape_optimizer.cc:257] Output count 1 for fused shape node[Shape_139_Fused] [DEBUG][2022-02-26 11:23:12.126][fuse_shape_optimizer.cc:257] Output count 1 for fused shape node[Shape_151_Fused] [DEBUG][2022-02-26 11:23:12.126][fuse_shape_optimizer.cc:257] Output count 1 for fused shape node[Shape_163_Fused] [DEBUG][2022-02-26 11:23:12.126][fuse_shape_optimizer.cc:257] Output count 1 for fused shape node[Shape_176_Fused] [DEBUG][2022-02-26 11:23:12.126][fuse_shape_optimizer.cc:257] Output count 1 for fused shape node[Shape_185_Fused] [INFO][2022-02-26 11:23:12.127][engine_graph_partitioner.cc:103] total partition(s) of graph[torch-jit-export]: 1. [DEBUG][2022-02-26 11:23:12.153][opt_graph.cc:186] Can not reshape safely for node[Resize_170] [DEBUG][2022-02-26 11:23:12.154][opt_graph.cc:186] Can not reshape safely for node[Resize_158] [DEBUG][2022-02-26 11:23:12.155][opt_graph.cc:186] Can not reshape safely for node[Resize_146] [DEBUG][2022-02-26 11:23:12.156][opt_graph.cc:186] Can not reshape safely for node[Resize_134] [DEBUG][2022-02-26 11:23:12.156][reshape_concat.cc:43] ERROR: input[1]'s dim[2]'s value[1] != input[0]'s dim[2]'s value[37]. [DEBUG][2022-02-26 11:23:12.156][opt_graph.cc:186] Can not reshape safely for node[Concat_171] [DEBUG][2022-02-26 11:23:12.172][opt_graph.cc:186] Can not reshape safely for node[Resize_183] [DEBUG][2022-02-26 11:23:12.172][opt_graph.cc:186] Can not reshape safely for node[Resize_192] [DEBUG][2022-02-26 11:23:12.173][opt_graph.cc:200] Create 305 TensorImpl [DEBUG][2022-02-26 11:23:12.173][fs_conv.cc:80] Fuse node[Conv_172] and nextnode[Relu_173] [DEBUG][2022-02-26 11:23:12.173][fs_conv.cc:80] Fuse node[Conv_124] and nextnode[Relu_125] [DEBUG][2022-02-26 11:23:12.173][fs_conv.cc:80] Fuse node[Conv_136] and nextnode[Relu_137] [DEBUG][2022-02-26 11:23:12.173][fs_conv.cc:80] Fuse node[Conv_148] and nextnode[Relu_149] [DEBUG][2022-02-26 11:23:12.173][fs_conv.cc:80] Fuse node[Conv_160] and nextnode[Relu_161] [DEBUG][2022-02-26 11:23:12.173][fs_conv.cc:80] Fuse node[Conv_120] and nextnode[Add_121] [DEBUG][2022-02-26 11:23:12.173][fs_conv.cc:80] Fuse node[Conv_120] and nextnode[Relu_122] [DEBUG][2022-02-26 11:23:12.174][fs_conv.cc:80] Fuse node[Conv_118] and nextnode[Relu_119] [DEBUG][2022-02-26 11:23:12.174][fs_conv.cc:80] Fuse node[Conv_116] and nextnode[Relu_117] [DEBUG][2022-02-26 11:23:12.174][fs_conv.cc:80] Fuse node[Conv_113] and nextnode[Add_114] [DEBUG][2022-02-26 11:23:12.174][fs_conv.cc:80] Fuse node[Conv_113] and nextnode[Relu_115] [DEBUG][2022-02-26 11:23:12.174][fs_conv.cc:80] Fuse node[Conv_111] and nextnode[Relu_112] [DEBUG][2022-02-26 11:23:12.174][fs_conv.cc:80] Fuse node[Conv_109] and nextnode[Relu_110] [DEBUG][2022-02-26 11:23:12.174][fs_conv.cc:80] Fuse node[Conv_105] and nextnode[Add_107] [DEBUG][2022-02-26 11:23:12.174][fs_conv.cc:80] Fuse node[Conv_105] and nextnode[Relu_108] [DEBUG][2022-02-26 11:23:12.175][fs_conv.cc:80] Fuse node[Conv_103] and nextnode[Relu_104] [DEBUG][2022-02-26 11:23:12.175][fs_conv.cc:80] Fuse node[Conv_101] and nextnode[Relu_102] [DEBUG][2022-02-26 11:23:12.175][fs_conv.cc:80] Fuse node[Conv_98] and nextnode[Add_99] [DEBUG][2022-02-26 11:23:12.175][fs_conv.cc:80] Fuse node[Conv_98] and nextnode[Relu_100] [DEBUG][2022-02-26 11:23:12.175][fs_conv.cc:80] Fuse node[Conv_96] and nextnode[Relu_97] [DEBUG][2022-02-26 11:23:12.175][fs_conv.cc:80] Fuse node[Conv_94] and nextnode[Relu_95] [DEBUG][2022-02-26 11:23:12.176][fs_conv.cc:80] Fuse node[Conv_91] and nextnode[Add_92] [DEBUG][2022-02-26 11:23:12.176][fs_conv.cc:80] Fuse node[Conv_91] and nextnode[Relu_93] [DEBUG][2022-02-26 11:23:12.176][fs_conv.cc:80] Fuse node[Conv_89] and nextnode[Relu_90] [DEBUG][2022-02-26 11:23:12.176][fs_conv.cc:80] Fuse node[Conv_87] and nextnode[Relu_88] [DEBUG][2022-02-26 11:23:12.176][fs_conv.cc:80] Fuse node[Conv_84] and nextnode[Add_85] [DEBUG][2022-02-26 11:23:12.177][fs_conv.cc:80] Fuse node[Conv_84] and nextnode[Relu_86] [DEBUG][2022-02-26 11:23:12.177][fs_conv.cc:80] Fuse node[Conv_82] and nextnode[Relu_83] [DEBUG][2022-02-26 11:23:12.177][fs_conv.cc:80] Fuse node[Conv_80] and nextnode[Relu_81] [DEBUG][2022-02-26 11:23:12.177][fs_conv.cc:80] Fuse node[Conv_77] and nextnode[Add_78] [DEBUG][2022-02-26 11:23:12.177][fs_conv.cc:80] Fuse node[Conv_77] and nextnode[Relu_79] [DEBUG][2022-02-26 11:23:12.177][fs_conv.cc:80] Fuse node[Conv_75] and nextnode[Relu_76] [DEBUG][2022-02-26 11:23:12.178][fs_conv.cc:80] Fuse node[Conv_73] and nextnode[Relu_74] [DEBUG][2022-02-26 11:23:12.178][fs_conv.cc:80] Fuse node[Conv_70] and nextnode[Add_71] [DEBUG][2022-02-26 11:23:12.178][fs_conv.cc:80] Fuse node[Conv_70] and nextnode[Relu_72] [DEBUG][2022-02-26 11:23:12.178][fs_conv.cc:80] Fuse node[Conv_68] and nextnode[Relu_69] [DEBUG][2022-02-26 11:23:12.178][fs_conv.cc:80] Fuse node[Conv_66] and nextnode[Relu_67] [DEBUG][2022-02-26 11:23:12.178][fs_conv.cc:80] Fuse node[Conv_62] and nextnode[Add_64] [DEBUG][2022-02-26 11:23:12.179][fs_conv.cc:80] Fuse node[Conv_62] and nextnode[Relu_65] [DEBUG][2022-02-26 11:23:12.180][fs_conv.cc:80] Fuse node[Conv_60] and nextnode[Relu_61] [DEBUG][2022-02-26 11:23:12.180][fs_conv.cc:80] Fuse node[Conv_58] and nextnode[Relu_59] [DEBUG][2022-02-26 11:23:12.181][fs_conv.cc:80] Fuse node[Conv_55] and nextnode[Add_56] [DEBUG][2022-02-26 11:23:12.182][fs_conv.cc:80] Fuse node[Conv_55] and nextnode[Relu_57] [DEBUG][2022-02-26 11:23:12.182][fs_conv.cc:80] Fuse node[Conv_53] and nextnode[Relu_54] [DEBUG][2022-02-26 11:23:12.182][fs_conv.cc:80] Fuse node[Conv_51] and nextnode[Relu_52] [DEBUG][2022-02-26 11:23:12.183][fs_conv.cc:80] Fuse node[Conv_48] and nextnode[Add_49] [DEBUG][2022-02-26 11:23:12.183][fs_conv.cc:80] Fuse node[Conv_48] and nextnode[Relu_50] [DEBUG][2022-02-26 11:23:12.183][fs_conv.cc:80] Fuse node[Conv_46] and nextnode[Relu_47] [DEBUG][2022-02-26 11:23:12.183][fs_conv.cc:80] Fuse node[Conv_44] and nextnode[Relu_45] [DEBUG][2022-02-26 11:23:12.183][fs_conv.cc:80] Fuse node[Conv_41] and nextnode[Add_42] [DEBUG][2022-02-26 11:23:12.183][fs_conv.cc:80] Fuse node[Conv_41] and nextnode[Relu_43] [DEBUG][2022-02-26 11:23:12.184][fs_conv.cc:80] Fuse node[Conv_39] and nextnode[Relu_40] [DEBUG][2022-02-26 11:23:12.184][fs_conv.cc:80] Fuse node[Conv_37] and nextnode[Relu_38] [DEBUG][2022-02-26 11:23:12.184][fs_conv.cc:80] Fuse node[Conv_33] and nextnode[Add_35] [DEBUG][2022-02-26 11:23:12.184][fs_conv.cc:80] Fuse node[Conv_33] and nextnode[Relu_36] [DEBUG][2022-02-26 11:23:12.185][fs_conv.cc:80] Fuse node[Conv_31] and nextnode[Relu_32] [DEBUG][2022-02-26 11:23:12.185][fs_conv.cc:80] Fuse node[Conv_29] and nextnode[Relu_30] [DEBUG][2022-02-26 11:23:12.185][fs_conv.cc:80] Fuse node[Conv_26] and nextnode[Add_27] [DEBUG][2022-02-26 11:23:12.185][fs_conv.cc:80] Fuse node[Conv_26] and nextnode[Relu_28] [DEBUG][2022-02-26 11:23:12.185][fs_conv.cc:80] Fuse node[Conv_24] and nextnode[Relu_25] [DEBUG][2022-02-26 11:23:12.186][fs_conv.cc:80] Fuse node[Conv_22] and nextnode[Relu_23] [DEBUG][2022-02-26 11:23:12.186][fs_conv.cc:80] Fuse node[Conv_19] and nextnode[Add_20] [DEBUG][2022-02-26 11:23:12.186][fs_conv.cc:80] Fuse node[Conv_19] and nextnode[Relu_21] [DEBUG][2022-02-26 11:23:12.186][fs_conv.cc:80] Fuse node[Conv_17] and nextnode[Relu_18] [DEBUG][2022-02-26 11:23:12.186][fs_conv.cc:80] Fuse node[Conv_15] and nextnode[Relu_16] [DEBUG][2022-02-26 11:23:12.186][fs_conv.cc:80] Fuse node[Conv_11] and nextnode[Add_13] [DEBUG][2022-02-26 11:23:12.187][fs_conv.cc:80] Fuse node[Conv_11] and nextnode[Relu_14] [DEBUG][2022-02-26 11:23:12.187][fs_conv.cc:80] Fuse node[Conv_9] and nextnode[Relu_10] [DEBUG][2022-02-26 11:23:12.187][fs_conv.cc:80] Fuse node[Conv_7] and nextnode[Relu_8] [DEBUG][2022-02-26 11:23:12.187][fs_conv.cc:80] Fuse node[Conv_4] and nextnode[Relu_5] [DEBUG][2022-02-26 11:23:12.187][fs_conv.cc:80] Fuse node[Conv_2] and nextnode[Relu_3] [DEBUG][2022-02-26 11:23:12.187][fs_conv.cc:80] Fuse node[Conv_0] and nextnode[Relu_1] [INFO][2022-02-26 11:23:12.192][opt_graph.cc:311] added 261 new bridge kernels [INFO][2022-02-26 11:23:12.724][algo_conv_hmma.cc:126] Compiling Conv_0

opened by stujiajia 7
[x86-compile] error: impossible constraint in ‘asm’

I try to compile the latest master.

CPU | result ------- | ------------- Core i5-9500(not support avx512) | error: impossible constraint in ‘asm’ Xeon 6130(support avx512) | pass

I find that latest commit supports AVX-512. If it is a bug, will ppl support more CPU(no avx512) and any macro to separate AVX-512 codes? Thanks.

opened by alanzhai219 7
pytorch wrapper

Hi guys, Is it possible to supply a torch wrapper for ppl.nn? It will make it much easier to use ppl.nn. the wrapper can parse onnx file, and accept a torch.Tensor for forward-process.
improvement

opened by ShiyangZhang 0

Owner

[email protected]

GitHub

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

ONNX Runtime is a cross-platform inference and training machine-learning accelerator. ONNX Runtime inference can enable faster customer experiences an

8k Jan 4, 2023

High performance Cross-platform Inference-engine, you could run Anakin on x86-cpu,arm, nv-gpu, amd-gpu,bitmain and cambricon devices.

Anakin2.0 Welcome to the Anakin GitHub. Anakin is a cross-platform, high-performance inference engine, which is originally developed by Baidu engineer

514 Dec 28, 2022

A library for low-memory inferencing in PyTorch.

Pylomin Pylomin (PYtorch LOw-Memory INference) is a library for low-memory inferencing in PyTorch. Installation ... Usage For example, the following c

3 Oct 26, 2022

A modular, primitive-first, python-first PyTorch library for Reinforcement Learning.

TorchRL Disclaimer This library is not officially released yet and is subject to change. The features are available before an official release so that

860 Jan 7, 2023

PyTorch implementation of paper: HPNet: Deep Primitive Segmentation Using Hybrid Representations.

HPNet This repository contains the PyTorch implementation of paper: HPNet: Deep Primitive Segmentation Using Hybrid Representations. Installation The

42 Dec 7, 2022

Bayesian-Torch is a library of neural network layers and utilities extending the core of PyTorch to enable the user to perform stochastic variational inference in Bayesian deep neural networks

Bayesian-Torch is a library of neural network layers and utilities extending the core of PyTorch to enable the user to perform stochastic variational inference in Bayesian deep neural networks. Bayesian-Torch is designed to be flexible and seamless in extending a deterministic deep neural network architecture to corresponding Bayesian form by simply replacing the deterministic layers with Bayesian layers.

210 Jan 4, 2023

LightSeq is a high performance training and inference library for sequence processing and generation implemented in CUDA

LightSeq: A High Performance Library for Sequence Processing and Generation

2.5k Jan 6, 2023

Code for "Primitive Representation Learning for Scene Text Recognition" (CVPR 2021)

Primitive Representation Learning Network (PREN) This repository contains the code for our paper accepted by CVPR 2021 Primitive Representation Learni

76 Jan 2, 2023

CPU inference engine that delivers unprecedented performance for sparse models

The DeepSparse Engine is a CPU runtime that delivers unprecedented performance by taking advantage of natural sparsity within neural networks to reduce compute required as well as accelerate memory bound workloads. It is focused on model deployment and scaling machine learning pipelines, fitting seamlessly into your existing deployments as an inference backend.

1.2k Jan 9, 2023

A modular, research-friendly framework for high-performance and inference of sequence models at many scales

T5X T5X is a modular, composable, research-friendly framework for high-performance, configurable, self-service training, evaluation, and inference of

1.1k Jan 8, 2023

Jittor is a high-performance deep learning framework based on JIT compiling and meta-operators.

Jittor: a Just-in-time(JIT) deep learning framework Quickstart | Install | Tutorial | Chinese Jittor is a high-performance deep learning framework bas

2.7k Jan 3, 2023

PyTorch Implementation of [1611.06440] Pruning Convolutional Neural Networks for Resource Efficient Inference

PyTorch implementation of [1611.06440 Pruning Convolutional Neural Networks for Resource Efficient Inference] This demonstrates pruning a VGG16 based

836 Dec 26, 2022

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

6.9k Jan 4, 2023

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

5.7k Feb 12, 2021

Deep learning (neural network) based remote photoplethysmography: how to extract pulse signal from video using deep learning tools

Deep-rPPG: Camera-based pulse estimation using deep learning tools Deep learning (neural network) based remote photoplethysmography: how to extract pu

138 Dec 17, 2022

MetaBalance: High-Performance Neural Networks for Class-Imbalanced Data

This repository is the official PyTorch implementation of Meta-Balance. Find the paper on arxiv MetaBalance: High-Performance Neural Networks for Clas

20 Oct 18, 2021

[ICLR 2021] "CPT: Efficient Deep Neural Network Training via Cyclic Precision" by Yonggan Fu, Han Guo, Meng Li, Xin Yang, Yining Ding, Vikas Chandra, Yingyan Lin

CPT: Efficient Deep Neural Network Training via Cyclic Precision Yonggan Fu, Han Guo, Meng Li, Xin Yang, Yining Ding, Vikas Chandra, Yingyan Lin Accep

26 Oct 25, 2022

Torchserve server using a YoloV5 model running on docker with GPU and static batch inference to perform production ready inference.

Yolov5 running on TorchServe (GPU compatible) ! This is a dockerfile to run TorchServe for Yolo v5 object detection model. (TorchServe (PyTorch librar

82 Nov 29, 2022

Monocular 3D pose estimation. OpenVINO. CPU inference or iGPU (OpenCL) inference.

human-pose-estimation-3d-python-cpp RealSenseD435 (RGB) 480x640 + CPU Corei9 45 FPS (Depth is not used) 1. Run 1-1. RealSenseD435 (RGB) 480x640 + CPU

8 Oct 3, 2022