CUDA error happens occasionally during training, how can I fix it?
here is the training log and stacktrace.
(torch) ➜ DCSR git:(master) ✗ sh train.sh
Making model...
Total params: 3.19M
Preparing loss function:
use_vgg: True
use_vgg: True
1.000 * L1
0.050 * contextual_ref
0.010 * contextual_hr
/home/laizeqiang/miniconda3/envs/torch/lib/python3.6/site-packages/torch/optim/lr_scheduler.py:417: UserWarning: To get the last learning rate computed by the scheduler, please use `get_last_lr()`.
"please use `get_last_lr()`.", UserWarning)
[Epoch 1] Learning rate: 1.00e-4
/home/laizeqiang/miniconda3/envs/torch/lib/python3.6/site-packages/torch/nn/functional.py:3063: UserWarning: Default upsampling behavior when mode=bicubic is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details.
"See the documentation of nn.Upsample for details.".format(mode))
/home/laizeqiang/miniconda3/envs/torch/lib/python3.6/site-packages/torch/nn/functional.py:3103: UserWarning: The default behavior for interpolate/upsample with float scale_factor changed in 1.6.0 to align with other frameworks/libraries, and now uses scale_factor directly, instead of relying on the computed output size. If you wish to restore the old behavior, please set recompute_scale_factor=True. See the documentation of nn.Upsample for details.
warnings.warn("The default behavior for interpolate/upsample with float scale_factor changed "
[400/39300] [L1: 0.0565][contextual_ref: 0.1948][contextual_hr: 0.0653][Total: 0.3167] 49.5+1.9s
[800/39300] [L1: 0.0414][contextual_ref: 0.1722][contextual_hr: 0.0554][Total: 0.2690] 49.1+0.1s
[1200/39300] [L1: 0.0338][contextual_ref: 0.1618][contextual_hr: 0.0507][Total: 0.2462] 49.5+0.1s
[1600/39300] [L1: 0.0308][contextual_ref: 0.1564][contextual_hr: 0.0480][Total: 0.2352] 49.7+0.1s
[2000/39300] [L1: 0.0274][contextual_ref: 0.1526][contextual_hr: 0.0461][Total: 0.2260] 49.9+0.1s
[2400/39300] [L1: 0.0253][contextual_ref: 0.1497][contextual_hr: 0.0449][Total: 0.2199] 49.9+0.1s
[2800/39300] [L1: 0.0236][contextual_ref: 0.1473][contextual_hr: 0.0437][Total: 0.2146] 49.8+0.1s
[3200/39300] [L1: 0.0224][contextual_ref: 0.1453][contextual_hr: 0.0428][Total: 0.2105] 49.8+0.1s
/opt/conda/conda-bld/pytorch_1607370120218/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [93,0,0], thread: [32,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1607370120218/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [93,0,0], thread: [33,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1607370120218/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [93,0,0], thread: [34,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1607370120218/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [93,0,0], thread: [35,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1607370120218/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [93,0,0], thread: [36,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1607370120218/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [93,0,0], thread: [37,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1607370120218/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [93,0,0], thread: [38,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1607370120218/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [93,0,0], thread: [39,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1607370120218/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [93,0,0], thread: [40,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1607370120218/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [93,0,0], thread: [41,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1607370120218/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [93,0,0], thread: [42,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1607370120218/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [93,0,0], thread: [43,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1607370120218/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [93,0,0], thread: [44,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1607370120218/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [93,0,0], thread: [45,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1607370120218/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [93,0,0], thread: [46,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1607370120218/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [93,0,0], thread: [47,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1607370120218/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [93,0,0], thread: [48,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1607370120218/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [93,0,0], thread: [49,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1607370120218/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [93,0,0], thread: [50,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1607370120218/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [93,0,0], thread: [51,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1607370120218/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [93,0,0], thread: [52,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1607370120218/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [93,0,0], thread: [53,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1607370120218/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [93,0,0], thread: [54,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1607370120218/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [93,0,0], thread: [55,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1607370120218/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [93,0,0], thread: [56,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1607370120218/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [93,0,0], thread: [57,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1607370120218/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [93,0,0], thread: [58,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1607370120218/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [93,0,0], thread: [59,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1607370120218/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [93,0,0], thread: [60,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1607370120218/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [93,0,0], thread: [61,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1607370120218/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [93,0,0], thread: [62,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1607370120218/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [93,0,0], thread: [63,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
Traceback (most recent call last):
File "main.py", line 31, in <module>
main()
File "main.py", line 25, in main
trainer.train()
File "/media/exthdd/laizeqiang/lzq/projects/ref-sr/related_work/DCSR/trainer.py", line 49, in train
sr = self.model(lr, ref)
File "/home/laizeqiang/miniconda3/envs/torch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/media/exthdd/laizeqiang/lzq/projects/ref-sr/related_work/DCSR/model/__init__.py", line 100, in forward
return self.model(x, ref,False)
File "/home/laizeqiang/miniconda3/envs/torch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/media/exthdd/laizeqiang/lzq/projects/ref-sr/related_work/DCSR/model/dcsr.py", line 117, in forward
ref_features_aligned = self.aa3(input, ref_p, index_map, ref_features1)
File "/home/laizeqiang/miniconda3/envs/torch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/media/exthdd/laizeqiang/lzq/projects/ref-sr/related_work/DCSR/model/attention.py", line 115, in forward
warpped_features = self.align(warpped_features,lr,warpped_ref)
File "/home/laizeqiang/miniconda3/envs/torch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/media/exthdd/laizeqiang/lzq/projects/ref-sr/related_work/DCSR/model/alignment.py", line 55, in forward
p = self._get_p(affine, dtype)
File "/media/exthdd/laizeqiang/lzq/projects/ref-sr/related_work/DCSR/model/alignment.py", line 124, in _get_p
p_n = self._get_p_n(N, dtype)
File "/media/exthdd/laizeqiang/lzq/projects/ref-sr/related_work/DCSR/model/alignment.py", line 102, in _get_p_n
p_n = p_n.view(1, 2*N, 1, 1).type(dtype)
RuntimeError: CUDA error: device-side assert triggered