What are the problems?(screenshots or detailed error messages)
Observe that, for some models (e.g. YOLOX-s, DBNet-r18, others like ResNet-18 are fine), after creating runtime using RuntimeBuilder
, subsequent CUDA function calls (or kernel launches) may fail.
I first getting the CUDA invalid argument
error when testing ppl.nn using mmdeploy's test.py
, at a point after runtime creation, before inference, when copying data from host to device. Later I met the same problem when testing using mmdeploy's SDK.
After digging around for a while, I found the the simplest way to reproduce the problem using pplnn.py
:
insert the following code
import torch
t = torch.Tensor([[1,1],[1,1]]).cuda()
to
https://github.com/openppl-public/ppl.nn/blob/1ae5d95f3ee49b3e582564cc004443931fbe2f7a/tools/pplnn.py#L564
and then
python pplnn.py --use-cuda --onnx-model model.onnx --in-shape 1_3_640_640 --quick-select
got
INFO: PPLNN version: [0.8.0], commit: [02418bb57bef2d888b57d44589a599080cb806d9]
[INFO][2022-07-06 22:23:06.057][utils.cc:456] total partition(s) of graph[torch-jit-export]: 1.
[INFO][2022-07-06 22:23:06.067][opt_graph.cc:324] added 1020 new bridge kernels
[INFO][2022-07-06 22:23:06.223][opt_graph.cc:581] deleted 990 bridge kernels
Traceback (most recent call last):
File "pplnn.py", line 567, in <module>
t = torch.Tensor([[1,1],[1,1]]).cuda()
RuntimeError: CUDA error: invalid argument
Which version(commit id or tag) of ppl.nn is used?
02418bb57bef2d888b57d44589a599080cb806d9
What's the operating system ppl.nn runs on?
Ubuntu 18.04
What's the compiler and its version?
GCC-7.5, CUDA-11.1
What are the commands used to build ppl.nn?
cmake .. \
-DCMAKE_INSTALL_PREFIX=/workspace/ppl.nn/install \
-DPPLNN_ENABLE_PYTHON_API=ON \
-DPPLNN_USE_X86_64=ON \
-DPPLNN_USE_CUDA=ON \
-DPPL_USE_X86_AVX512=OFF \
-DPPLNN_ENABLE_CUDA_JIT=OFF \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CUDA_ARCHITECTURES=75