Thanks for you amazing work!
I got RuntimeError: CUDA error: device-side assert triggered
around ~200 steps during training. This error always occurs even after I rerunning the program multiple times or according to #22 setting a higher epsilon (I've tried 1e-8
and 1e-6
).
This is the command I use for training. I have tried pytorch 1.6 and 1.7 with cuda 10.1.
CUDA_LAUNCH_BLOCKING=1 python tools/train_net.py --config-file "configs/voc/V_16_voc07.yaml" --use-tensorboard \
OUTPUT_DIR output \
SOLVER.IMS_PER_BATCH 1 \
SOLVER.ITER_SIZE 8 \
DB.METHOD none
Here is the logging
2020-12-08 00:13:05,575 wetectron.trainer INFO: eta: 1 day, 5:33:33 iter: 180 loss: 0.4550 (0.6005) loss_img: 0.2575 (0.2831) loss_ref_cls0: 0.0003 (0.0011) loss_ref_reg0: 0.0000 (0.0002) loss_ref_cls1: 0.1219 (0.1517) loss_ref_reg1: 0.0277 (0.0274) loss_ref_cls2: 0.0612 (0.1122) loss_ref_reg2: 0.0119 (0.0247) acc_img: 0.0000 (0.2319) acc_ref0: 0.0000 (0.0690) acc_ref1: 0.0000 (0.2546) acc_ref2: 0.0000 (0.2713) time: 0.4167 (0.4437) data: 0.0097 (0.0117) lr: 0.004100 max mem: 4047
tensor([0.0061, 0.0272, 0.0212, 0.0003, 0.0143, 0.0203, 0.0304, 0.2264, 0.0059,
0.2383, 0.0125, 0.0261, 0.0525, 0.1852, 0.0306, 0.0003, 0.0211, 0.0074,
0.0092, 0.0183, 0.0403], device='cuda:0', grad_fn=<ClampBackward>)
tensor([3.4809e-03, 2.0176e-02, 1.5387e-02, 3.3157e-05, 7.3178e-03, 2.7314e-02,
1.9246e-02, 4.2303e-01, 1.8498e-03, 2.4402e-01, 8.2913e-03, 2.3048e-02,
4.3761e-02, 1.8561e-01, 1.7174e-02, 5.2509e-05, 1.1843e-02, 3.0689e-03,
5.3479e-03, 8.2327e-03, 2.8905e-02], device='cuda:0',
grad_fn=<ClampBackward>)
tensor([0.0063, 0.0345, 0.0294, 0.0005, 0.0227, 0.0191, 0.0276, 0.1232, 0.0156,
0.2002, 0.0143, 0.0268, 0.0588, 0.2178, 0.0280, 0.0011, 0.0264, 0.0082,
0.0126, 0.0231, 0.0392], device='cuda:0', grad_fn=<ClampBackward>)
tensor([4.6997e-03, 2.4954e-02, 1.9726e-02, 8.2294e-05, 1.1063e-02, 2.8040e-02,
2.3942e-02, 3.1750e-01, 4.0102e-03, 2.3422e-01, 1.1085e-02, 2.5569e-02,
4.9987e-02, 1.9260e-01, 2.2549e-02, 1.3269e-04, 1.6148e-02, 5.0528e-03,
7.5706e-03, 1.2965e-02, 3.4329e-02], device='cuda:0',
grad_fn=<ClampBackward>)
tensor([1.0000e-08, 5.4938e-07, 3.5194e-06, 1.3429e-08, 4.6343e-08, 1.0000e-08,
1.0000e-08, 3.2079e-05, 1.9538e-08, 4.8927e-03, 9.9997e-05, 1.4685e-07,
3.2431e-01, 7.7275e-02, 1.0000e-08, 9.3841e-01, 1.4319e-03, 1.0000e-08,
1.0000e-08, 1.9370e-08, 1.7341e-06], device='cuda:0',
grad_fn=<ClampBackward>)
tensor([1.0000e-08, 2.8905e-08, 9.4311e-06, 1.0000e-08, 1.0000e-08, 1.0000e-08,
1.0000e-08, 6.9164e-04, 1.0000e-08, 1.7446e-02, 2.6333e-08, 2.8428e-07,
1.0017e-01, 7.1074e-02, 1.0000e-08, 9.8742e-01, 1.4429e-06, 1.0000e-08,
1.0000e-08, 1.0000e-08, 6.2333e-07], device='cuda:0',
grad_fn=<ClampBackward>)
tensor([1.0000e-08, 5.2896e-06, 6.3474e-07, 8.0780e-07, 4.2772e-08, 1.0000e-08,
1.0000e-08, 1.0000e-08, 1.0000e-08, 2.0732e-02, 1.0000e-08, 2.8691e-08,
2.5120e-01, 1.0660e-02, 1.0000e-08, 8.6331e-01, 2.0186e-03, 1.0000e-08,
1.0000e-08, 1.0000e-08, 3.8909e-07], device='cuda:0',
grad_fn=<ClampBackward>)
tensor([1.0000e-08, 5.0810e-07, 1.7225e-06, 1.0000e-08, 1.0000e-08, 1.0000e-08,
1.0000e-08, 1.0000e-08, 1.9051e-07, 2.6197e-02, 5.0034e-05, 1.4151e-07,
4.4598e-01, 7.7825e-02, 1.0000e-08, 2.2080e-01, 9.6862e-03, 1.0000e-08,
1.0000e-08, 2.3176e-08, 1.8984e-06], device='cuda:0',
grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 2.9017e-08, 1.0000e-08, 1.0000e-08, 4.3676e-07,
1.0000e-08, 9.8482e-01, 1.0000e-08, 2.6416e-03, 1.7355e-06, 5.0040e-08,
3.8472e-01, 9.1854e-03, 1.0000e-08, 9.9960e-01, 5.8315e-05, 1.0000e-08,
1.0000e-08, 1.0000e-08, 4.1070e-07], device='cuda:0',
grad_fn=<ClampBackward>)
tensor([1.0000e-08, 3.2842e-06, 2.7481e-06, 1.5629e-07, 3.9651e-06, 1.0004e-07,
9.6509e-08, 1.3372e-04, 1.6091e-08, 1.6981e-02, 2.7259e-04, 1.5076e-05,
1.5756e-01, 6.3610e-02, 3.7470e-07, 9.4090e-01, 2.4577e-04, 1.0000e-08,
1.0000e-08, 4.2767e-08, 1.5077e-04], device='cuda:0',
grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
1.0000e-08, 1.0000e-08, 1.0000e-08, 7.5319e-04, 1.0000e-08, 1.0000e-08,
1.7518e-03, 1.7722e-01, 1.0000e-08, 9.8997e-01, 6.3139e-02, 1.0000e-08,
1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
grad_fn=<ClampBackward>)
tensor([1.0000e-08, 4.1463e-05, 5.7913e-04, 1.0175e-05, 8.5911e-06, 1.0000e-08,
2.1342e-07, 6.7830e-02, 1.5353e-06, 2.2693e-02, 1.1492e-07, 1.2851e-05,
3.3217e-01, 1.1930e-01, 4.0176e-06, 8.4664e-01, 4.7693e-03, 1.0000e-08,
1.0000e-08, 1.9446e-06, 7.9586e-05], device='cuda:0',
grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e+00, 1.0000e-08, 1.0000e-08, 1.0000e-08,
1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e+00, 1.0000e-08, 1.0000e-08, 1.0000e-08,
1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e+00, 1.0000e-08, 1.0000e-08,
1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e+00, 1.0000e-08, 1.0000e-08, 1.0000e-08,
1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e+00, 1.0000e-08, 1.0000e-08, 1.0000e-08,
1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e+00, 1.0000e-08, 1.0000e-08, 1.0000e-08,
1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e+00, 1.0000e-08, 1.0000e-08, 1.0000e-08,
1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e+00, 1.0000e-08, 1.0000e-08, 1.0000e-08,
1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e+00, 1.0000e-08, 1.0000e-08, 1.0000e-08,
1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e+00, 1.0000e-08, 1.0000e-08,
1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
grad_fn=<ClampBackward>)
2020-12-08 00:13:14,823 wetectron.trainer INFO: eta: 1 day, 5:40:51 iter: 200 loss: 5.6391 (16.8278) loss_img: 0.9144 (0.5902) loss_ref_cls0: 0.0000 (0.0021) loss_ref_reg0: 0.0000 (0.0008) loss_ref_cls1: 0.0425 (0.1522) loss_ref_reg1: 0.0017 (0.0262) loss_ref_cls2: 0.0000 (14.2085) loss_ref_reg2: 0.0000 (1.8478) acc_img: 0.0000 (0.2213) acc_ref0: 0.0000 (0.0704) acc_ref1: 0.0000 (0.2392) acc_ref2: 0.0000 (0.2625) time: 0.4264 (0.4456) data: 0.0115 (0.0117) lr: 0.004167 max mem: 4047
tensor([1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e+00, 1.0000e+00,
1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e+00,
1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e+00, 1.0000e+00,
1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e+00, 1.0000e+00,
1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e+00, 1.0000e+00, 1.0000e+00,
1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e+00,
1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e+00,
1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e+00, 1.0000e-08, 1.0000e+00,
1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
grad_fn=<ClampBackward>)
tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
device='cuda:0', grad_fn=<ClampBackward>)
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [0,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [1,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [2,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [3,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [4,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [5,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [6,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [7,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [8,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [9,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [10,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [11,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [12,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [13,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [14,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [15,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [16,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [17,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [18,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [19,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [20,0,0] Assertion `input_val >= zero && input_val <= one` failed.
Traceback (most recent call last):
File "tools/train_net.py", line 301, in <module>
main()
File "tools/train_net.py", line 280, in main
use_tensorboard=args.use_tensorboard
File "tools/train_net.py", line 92, in train
meters
File "/home/unnc/Desktop/sota/wetectron/wetectron/engine/trainer.py", line 94, in do_train
loss_dict, metrics = model(images, targets, rois)
File "/home/unnc/anaconda3/envs/wetectron1.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/unnc/anaconda3/envs/wetectron1.7/lib/python3.7/site-packages/apex/amp/_initialize.py", line 197, in new_fwd
**applier(kwargs, input_caster))
File "/home/unnc/Desktop/sota/wetectron/wetectron/modeling/detector/generalized_rcnn.py", line 61, in forward
x, result, detector_losses, accuracy = self.roi_heads(features, proposals, targets, model_cdb)
File "/home/unnc/anaconda3/envs/wetectron1.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/unnc/Desktop/sota/wetectron/wetectron/modeling/roi_heads/weak_head/weak_head.py", line 106, in forward
loss_img, accuracy_img = self.loss_evaluator([cls_score], [det_score], ref_scores, ref_bbox_preds, proposals, targets)
File "/home/unnc/Desktop/sota/wetectron/wetectron/modeling/roi_heads/weak_head/loss.py", line 254, in __call__
return_loss_dict['loss_img'] += F.binary_cross_entropy(img_score_per_im, labels_per_im.clamp(0, 1))
File "/home/unnc/anaconda3/envs/wetectron1.7/lib/python3.7/site-packages/torch/nn/functional.py", line 2526, in binary_cross_entropy
input, target, weight, reduction_enum)
RuntimeError: CUDA error: device-side assert triggered