Hi, I got the following error when training on my own dataset. However, I could reproduce the results on S3DIS dataset without any error. Can you please look into my error and give me suggestion what might have been wrong? My dataset is same like S3DIS dataset with xyzrgb format.
I have the following configuration:
CUDA = V10.2.89
torch = 1.11.0
spconv-cu102 = 2.1.21
Ubuntu 18.04
I am really sorry for a long error message.
Thank you for your help !
(softgroup2) shrijan_pf@BQ-DX1100-CT2:~/SoftGroup$ ./tools/dist_train.sh configs/softgroup_s3dis_backbone_fold5_mintA.yaml 1
2022-04-28 13:43:26,027 - INFO - Config:
model:
channels: 32
num_blocks: 7
semantic_classes: 3 #changed, original 13
instance_classes: 3 #changed, original 13
sem2ins_classes: [0, 1]
semantic_only: True
ignore_label: -100
grouping_cfg:
score_thr: 0.2
radius: 0.04
mean_active: 300
class_numpoint_mean: [1823, 7457, 6189, 7424, 34229, 1724, 5439,
6016, 39796, 5279, 5092, 12210, 10225]
npoint_thr: 0.05 # absolute if class_numpoint == -1, relative if class_numpoint != -1
ignore_classes: [-99]
instance_voxel_cfg:
scale: 50
spatial_shape: 20
train_cfg:
max_proposal_num: 200
pos_iou_thr: 0.5
test_cfg:
x4_split: True
cls_score_thr: 0.001
mask_score_thr: -0.5
min_npoint: 100
fixed_modules: []
data:
train:
type: 's3dis'
data_root: 'dataset/s3dis/preprocess_mint_sample' #changed original 'dataset/s3dis/preprocess'
prefix: ['Area_2'] # changed original ['Area_1', 'Area_2', 'Area_3', 'Area_4', 'Area_6']
suffix: '_inst_nostuff.pth'
repeat: 20
training: True
voxel_cfg:
scale: 50
spatial_shape: [128, 512]
max_npoint: 2500000
min_npoint: 5000
test:
type: 's3dis'
data_root: 'dataset/s3dis/preprocess_mint_sample' #changed original 'dataset/s3dis/preprocess'
prefix: 'Area_1' #changed, original 'Area_5'
suffix: '_inst_nostuff.pth'
training: False
voxel_cfg:
scale: 50
spatial_shape: [128, 512]
max_npoint: 2500000
min_npoint: 5000
dataloader:
train:
batch_size: 1 #changed;original was 4
num_workers: 4
test:
batch_size: 1
num_workers: 1
optimizer:
type: 'Adam'
lr: 0.004
save_cfg:
semantic: True
offset: True
instance: False
fp16: False
epochs: 20
step_epoch: 0
save_freq: 2
pretrain: './hais_ckpt_spconv2.pth'
work_dir: ''
2022-04-28 13:43:26,027 - INFO - Distributed: True
2022-04-28 13:43:26,027 - INFO - Mix precision training: False
2022-04-28 13:43:27,398 - INFO - Load train dataset: 580 scans
2022-04-28 13:43:27,399 - INFO - Load test dataset: 32 scans
2022-04-28 13:43:27,399 - INFO - Load pretrain from ./hais_ckpt_spconv2.pth
2022-04-28 13:43:27,629 - INFO - removed keys in source state_dict due to size mismatch: semantic_linear.3.weight, semantic_linear.3.bias
2022-04-28 13:43:27,629 - INFO - missing keys in source state_dict: semantic_linear.3.weight, semantic_linear.3.bias
2022-04-28 13:43:27,629 - INFO - unexpected key in source state_dict: tiny_unet.blocks.block0.conv_branch.0.weight, tiny_unet.blocks.block0.conv_branch.0.bias, tiny_unet.blocks.block0.conv_branch.0.running_mean, tiny_unet.blocks.block0.conv_branch.0.running_var, tiny_unet.blocks.block0.conv_branch.0.num_batches_tracked, tiny_unet.blocks.block0.conv_branch.2.weight, tiny_unet.blocks.block0.conv_branch.3.weight, tiny_unet.blocks.block0.conv_branch.3.bias, tiny_unet.blocks.block0.conv_branch.3.running_mean, tiny_unet.blocks.block0.conv_branch.3.running_var, tiny_unet.blocks.block0.conv_branch.3.num_batches_tracked, tiny_unet.blocks.block0.conv_branch.5.weight, tiny_unet.blocks.block1.conv_branch.0.weight, tiny_unet.blocks.block1.conv_branch.0.bias, tiny_unet.blocks.block1.conv_branch.0.running_mean, tiny_unet.blocks.block1.conv_branch.0.running_var, tiny_unet.blocks.block1.conv_branch.0.num_batches_tracked, tiny_unet.blocks.block1.conv_branch.2.weight, tiny_unet.blocks.block1.conv_branch.3.weight, tiny_unet.blocks.block1.conv_branch.3.bias, tiny_unet.blocks.block1.conv_branch.3.running_mean, tiny_unet.blocks.block1.conv_branch.3.running_var, tiny_unet.blocks.block1.conv_branch.3.num_batches_tracked, tiny_unet.blocks.block1.conv_branch.5.weight, tiny_unet.conv.0.weight, tiny_unet.conv.0.bias, tiny_unet.conv.0.running_mean, tiny_unet.conv.0.running_var, tiny_unet.conv.0.num_batches_tracked, tiny_unet.conv.2.weight, tiny_unet.u.blocks.block0.conv_branch.0.weight, tiny_unet.u.blocks.block0.conv_branch.0.bias, tiny_unet.u.blocks.block0.conv_branch.0.running_mean, tiny_unet.u.blocks.block0.conv_branch.0.running_var, tiny_unet.u.blocks.block0.conv_branch.0.num_batches_tracked, tiny_unet.u.blocks.block0.conv_branch.2.weight, tiny_unet.u.blocks.block0.conv_branch.3.weight, tiny_unet.u.blocks.block0.conv_branch.3.bias, tiny_unet.u.blocks.block0.conv_branch.3.running_mean, tiny_unet.u.blocks.block0.conv_branch.3.running_var, tiny_unet.u.blocks.block0.conv_branch.3.num_batches_tracked, tiny_unet.u.blocks.block0.conv_branch.5.weight, tiny_unet.u.blocks.block1.conv_branch.0.weight, tiny_unet.u.blocks.block1.conv_branch.0.bias, tiny_unet.u.blocks.block1.conv_branch.0.running_mean, tiny_unet.u.blocks.block1.conv_branch.0.running_var, tiny_unet.u.blocks.block1.conv_branch.0.num_batches_tracked, tiny_unet.u.blocks.block1.conv_branch.2.weight, tiny_unet.u.blocks.block1.conv_branch.3.weight, tiny_unet.u.blocks.block1.conv_branch.3.bias, tiny_unet.u.blocks.block1.conv_branch.3.running_mean, tiny_unet.u.blocks.block1.conv_branch.3.running_var, tiny_unet.u.blocks.block1.conv_branch.3.num_batches_tracked, tiny_unet.u.blocks.block1.conv_branch.5.weight, tiny_unet.deconv.0.weight, tiny_unet.deconv.0.bias, tiny_unet.deconv.0.running_mean, tiny_unet.deconv.0.running_var, tiny_unet.deconv.0.num_batches_tracked, tiny_unet.deconv.2.weight, tiny_unet.blocks_tail.block0.i_branch.0.weight, tiny_unet.blocks_tail.block0.conv_branch.0.weight, tiny_unet.blocks_tail.block0.conv_branch.0.bias, tiny_unet.blocks_tail.block0.conv_branch.0.running_mean, tiny_unet.blocks_tail.block0.conv_branch.0.running_var, tiny_unet.blocks_tail.block0.conv_branch.0.num_batches_tracked, tiny_unet.blocks_tail.block0.conv_branch.2.weight, tiny_unet.blocks_tail.block0.conv_branch.3.weight, tiny_unet.blocks_tail.block0.conv_branch.3.bias, tiny_unet.blocks_tail.block0.conv_branch.3.running_mean, tiny_unet.blocks_tail.block0.conv_branch.3.running_var, tiny_unet.blocks_tail.block0.conv_branch.3.num_batches_tracked, tiny_unet.blocks_tail.block0.conv_branch.5.weight, tiny_unet.blocks_tail.block1.conv_branch.0.weight, tiny_unet.blocks_tail.block1.conv_branch.0.bias, tiny_unet.blocks_tail.block1.conv_branch.0.running_mean, tiny_unet.blocks_tail.block1.conv_branch.0.running_var, tiny_unet.blocks_tail.block1.conv_branch.0.num_batches_tracked, tiny_unet.blocks_tail.block1.conv_branch.2.weight, tiny_unet.blocks_tail.block1.conv_branch.3.weight, tiny_unet.blocks_tail.block1.conv_branch.3.bias, tiny_unet.blocks_tail.block1.conv_branch.3.running_mean, tiny_unet.blocks_tail.block1.conv_branch.3.running_var, tiny_unet.blocks_tail.block1.conv_branch.3.num_batches_tracked, tiny_unet.blocks_tail.block1.conv_branch.5.weight, tiny_unet_outputlayer.0.weight, tiny_unet_outputlayer.0.bias, tiny_unet_outputlayer.0.running_mean, tiny_unet_outputlayer.0.running_var, tiny_unet_outputlayer.0.num_batches_tracked, iou_score_linear.weight, iou_score_linear.bias, mask_linear.0.weight, mask_linear.0.bias, mask_linear.2.weight, mask_linear.2.bias
2022-04-28 13:43:27,630 - INFO - Training
[Exception|implicit_gemm]feat=torch.Size([17843, 64]),w=torch.Size([32, 2, 2, 2, 64]),pair=torch.Size([8, 17848]),act=17848,issubm=False,istrain=True
SPCONV_DEBUG_SAVE_PATH not found, you can specify SPCONV_DEBUG_SAVE_PATH as debug data save path to save debug data which can be attached in a issue.
Traceback (most recent call last):
File "./tools/train.py", line 185, in
main()
File "./tools/train.py", line 178, in main
train(epoch, model, optimizer, scaler, train_loader, cfg, logger, writer)
File "./tools/train.py", line 48, in train
loss, log_vars = model(batch, return_loss=True)
File "/home/shrijan_pf/anaconda3/envs/softgroup2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/shrijan_pf/anaconda3/envs/softgroup2/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 963, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/shrijan_pf/anaconda3/envs/softgroup2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/shrijan_pf/SoftGroup/softgroup/model/softgroup.py", line 97, in forward
return self.forward_train(**batch)
File "/home/shrijan_pf/SoftGroup/softgroup/util/utils.py", line 171, in wrapper
return func(*new_args, **new_kwargs)
File "/home/shrijan_pf/SoftGroup/softgroup/model/softgroup.py", line 109, in forward_train
semantic_scores, pt_offsets, output_feats = self.forward_backbone(input, v2p_map)
File "/home/shrijan_pf/SoftGroup/softgroup/model/softgroup.py", line 263, in forward_backbone
output = self.unet(output)
File "/home/shrijan_pf/anaconda3/envs/softgroup2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/shrijan_pf/SoftGroup/softgroup/model/blocks.py", line 139, in forward
output_decoder = self.deconv(output_decoder)
File "/home/shrijan_pf/anaconda3/envs/softgroup2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/shrijan_pf/anaconda3/envs/softgroup2/lib/python3.7/site-packages/spconv/pytorch/modules.py", line 137, in forward
input = module(input)
File "/home/shrijan_pf/anaconda3/envs/softgroup2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/shrijan_pf/anaconda3/envs/softgroup2/lib/python3.7/site-packages/spconv/pytorch/conv.py", line 446, in forward
input._timer, self.fp32_accum)
File "/home/shrijan_pf/anaconda3/envs/softgroup2/lib/python3.7/site-packages/torch/cuda/amp/autocast_mode.py", line 118, in decorate_fwd
return fwd(*args, kwargs)
File "/home/shrijan_pf/anaconda3/envs/softgroup2/lib/python3.7/site-packages/spconv/pytorch/functional.py", line 200, in forward
raise e
File "/home/shrijan_pf/anaconda3/envs/softgroup2/lib/python3.7/site-packages/spconv/pytorch/functional.py", line 191, in forward
fp32_accum)
File "/home/shrijan_pf/anaconda3/envs/softgroup2/lib/python3.7/site-packages/spconv/pytorch/ops.py", line 1118, in implicit_gemm
fp32_accum=fp32_accum)
File "/home/shrijan_pf/anaconda3/envs/softgroup2/lib/python3.7/site-packages/spconv/algo.py", line 661, in tune_and_cache
GemmMainUnitTest.stream_synchronize(stream)
RuntimeError: /io/build/temp.linux-x86_64-3.7/spconv/build/src/cumm/gemm/main/GemmMainUnitTest/GemmMainUnitTest_stream_synchronize.cc(11)
CUDA error 700
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1646755861072/work/c10/cuda/CUDACachingAllocator.cpp:1230 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f5183c301bd in /home/shrijan_pf/anaconda3/envs/softgroup2/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: + 0x1ec97 (0x7f5183e9ac97 in /home/shrijan_pf/anaconda3/envs/softgroup2/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0x23a (0x7f5183e9f1fa in /home/shrijan_pf/anaconda3/envs/softgroup2/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x2edaf8 (0x7f51ca243af8 in /home/shrijan_pf/anaconda3/envs/softgroup2/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f5183c16fb5 in /home/shrijan_pf/anaconda3/envs/softgroup2/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #5: + 0x1daa99 (0x7f51ca130a99 in /home/shrijan_pf/anaconda3/envs/softgroup2/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #6: + 0x4cf71c (0x7f51ca42571c in /home/shrijan_pf/anaconda3/envs/softgroup2/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #7: THPVariable_subclass_dealloc(_object) + 0x299 (0x7f51ca425a39 in /home/shrijan_pf/anaconda3/envs/softgroup2/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #8: + 0xfc359 (0x55c9017e7359 in /home/shrijan_pf/anaconda3/envs/softgroup2/bin/python)
frame #9: + 0xfac88 (0x55c9017e5c88 in /home/shrijan_pf/anaconda3/envs/softgroup2/bin/python)
frame #10: + 0xfac88 (0x55c9017e5c88 in /home/shrijan_pf/anaconda3/envs/softgroup2/bin/python)
frame #11: + 0xfa6a8 (0x55c9017e56a8 in /home/shrijan_pf/anaconda3/envs/softgroup2/bin/python)
frame #12: + 0xfb128 (0x55c9017e6128 in /home/shrijan_pf/anaconda3/envs/softgroup2/bin/python)
frame #13: + 0xfb13c (0x55c9017e613c in /home/shrijan_pf/anaconda3/envs/softgroup2/bin/python)
frame #14: + 0xfb13c (0x55c9017e613c in /home/shrijan_pf/anaconda3/envs/softgroup2/bin/python)
frame #15: + 0xfb13c (0x55c9017e613c in /home/shrijan_pf/anaconda3/envs/softgroup2/bin/python)
frame #16: + 0xfb13c (0x55c9017e613c in /home/shrijan_pf/anaconda3/envs/softgroup2/bin/python)
frame #17: + 0x12b4a7 (0x55c9018164a7 in /home/shrijan_pf/anaconda3/envs/softgroup2/bin/python)
frame #18: PyDict_SetItemString + 0x89 (0x55c901822a19 in /home/shrijan_pf/anaconda3/envs/softgroup2/bin/python)
frame #19: PyImport_Cleanup + 0xab (0x55c901897b8b in /home/shrijan_pf/anaconda3/envs/softgroup2/bin/python)
frame #20: Py_FinalizeEx + 0x64 (0x55c90190c714 in /home/shrijan_pf/anaconda3/envs/softgroup2/bin/python)
frame #21: + 0x232e20 (0x55c90191de20 in /home/shrijan_pf/anaconda3/envs/softgroup2/bin/python)
frame #22: _Py_UnixMain + 0x3c (0x55c90191e18c in /home/shrijan_pf/anaconda3/envs/softgroup2/bin/python)
frame #23: __libc_start_main + 0xe7 (0x7f51e65b4c87 in /lib/x86_64-linux-gnu/libc.so.6)
frame #24: + 0x1d803a (0x55c9018c303a in /home/shrijan_pf/anaconda3/envs/softgroup2/bin/python)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 13703) of binary: /home/shrijan_pf/anaconda3/envs/softgroup2/bin/python
Traceback (most recent call last):
File "/home/shrijan_pf/anaconda3/envs/softgroup2/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')())
File "/home/shrijan_pf/anaconda3/envs/softgroup2/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/shrijan_pf/anaconda3/envs/softgroup2/lib/python3.7/site-packages/torch/distributed/run.py", line 724, in main
run(args)
File "/home/shrijan_pf/anaconda3/envs/softgroup2/lib/python3.7/site-packages/torch/distributed/run.py", line 718, in run
)(*cmd_args)
File "/home/shrijan_pf/anaconda3/envs/softgroup2/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/shrijan_pf/anaconda3/envs/softgroup2/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
./tools/train.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2022-04-28_13:43:50
host : BQ-DX1100-CT2
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 13703)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 13703