I'm using V100 for experiments, but still out of memory in the middle of the training process. Not sure what would be the reason at this momnet
Namespace(aux_loss=True, backbone='resnet50', batch_size=4, bbox_loss_coef=5, clip_max_norm=0.1, coco_panoptic_path=None, coco_path='./coco2017/', dataset_file='coco', dec_layers=6, device='cuda', dice_loss_coef=1, dilation=False, dim_feedforward=1024, dist_backend='nccl', dist_url='env://', distributed=True, dropout=0.1, enc_layers=6, eos_coef=0.1, epochs=300, eval=False, frozen_weights=None, giou_loss_coef=2, gpu=0, hidden_dim=256, lr=0.0005, lr_backbone=1e-05, lr_drop=200, mask_loss_coef=1, masks=False, nheads=8, num_queries=100, num_workers=2, output_dir='./output', position_embedding='sine', pre_norm=False, rank=0, remove_difficult=False, resume='', seed=42, set_cost_bbox=5, set_cost_class=1, set_cost_giou=2, start_epoch=0, weight_decay=0.0001, world_size=8)
Downloading: "https://download.pytorch.org/models/resnet50-19c8e357.pth" to /home/tiger/.cache/torch/hub/checkpoints/resnet50-19c8e357.pth
100%|██████████| 97.8M/97.8M [00:09<00:00, 10.3MB/s]
number of params: 36104659
loading annotations into memory...
Done (t=13.57s)
creating index...
index created!
loading annotations into memory...
Done (t=0.44s)
creating index...
index created!
Start training
Epoch: [0] [ 0/3696] eta: 2:32:25 lr: 0.000100 loss: 7.6000 (7.6000) at: 7.6000 (7.6000) at_unscaled: 7.6000 (7.6000) time: 2.4743 data: 0.5030 max mem: 14737
Epoch: [0] [ 10/3696] eta: 0:59:14 lr: 0.000100 loss: 7.5261 (7.5307) at: 7.5261 (7.5307) at_unscaled: 7.5261 (7.5307) time: 0.9643 data: 0.0806 max mem: 25656
Epoch: [0] [ 20/3696] eta: 0:56:49 lr: 0.000100 loss: 7.4746 (7.4774) at: 7.4746 (7.4774) at_unscaled: 7.4746 (7.4774) time: 0.8501 data: 0.0390 max mem: 25656
Epoch: [0] [ 30/3696] eta: 0:54:22 lr: 0.000100 loss: 7.3449 (7.4215) at: 7.3449 (7.4215) at_unscaled: 7.3449 (7.4215) time: 0.8489 data: 0.0374 max mem: 25656
Epoch: [0] [ 40/3696] eta: 0:54:59 lr: 0.000100 loss: 7.2054 (7.3429) at: 7.2054 (7.3429) at_unscaled: 7.2054 (7.3429) time: 0.8761 data: 0.0356 max mem: 25656
Epoch: [0] [ 50/3696] eta: 0:53:30 lr: 0.000100 loss: 7.0288 (7.2657) at: 7.0288 (7.2657) at_unscaled: 7.0288 (7.2657) time: 0.8662 data: 0.0362 max mem: 25656
Epoch: [0] [ 60/3696] eta: 0:53:44 lr: 0.000100 loss: 6.8423 (7.1774) at: 6.8423 (7.1774) at_unscaled: 6.8423 (7.1774) time: 0.8553 data: 0.0368 max mem: 26623
Epoch: [0] [ 70/3696] eta: 0:53:36 lr: 0.000100 loss: 6.6867 (7.0967) at: 6.6867 (7.0967) at_unscaled: 6.6867 (7.0967) time: 0.9036 data: 0.0359 max mem: 26623
Epoch: [0] [ 80/3696] eta: 0:52:42 lr: 0.000100 loss: 6.5043 (7.0184) at: 6.5043 (7.0184) at_unscaled: 6.5043 (7.0184) time: 0.8368 data: 0.0351 max mem: 26623
Epoch: [0] [ 90/3696] eta: 0:52:17 lr: 0.000100 loss: 6.4531 (6.9577) at: 6.4531 (6.9577) at_unscaled: 6.4531 (6.9577) time: 0.8094 data: 0.0362 max mem: 26623
Epoch: [0] [ 100/3696] eta: 0:51:33 lr: 0.000100 loss: 6.4151 (6.8982) at: 6.4151 (6.8982) at_unscaled: 6.4151 (6.8982) time: 0.8019 data: 0.0386 max mem: 26623
Epoch: [0] [ 110/3696] eta: 0:51:10 lr: 0.000100 loss: 6.3319 (6.8437) at: 6.3319 (6.8437) at_unscaled: 6.3319 (6.8437) time: 0.7937 data: 0.0392 max mem: 26623
Epoch: [0] [ 120/3696] eta: 0:50:56 lr: 0.000100 loss: 6.2714 (6.7969) at: 6.2714 (6.7969) at_unscaled: 6.2714 (6.7969) time: 0.8268 data: 0.0377 max mem: 26623
Epoch: [0] [ 130/3696] eta: 0:50:36 lr: 0.000100 loss: 6.2584 (6.7519) at: 6.2584 (6.7519) at_unscaled: 6.2584 (6.7519) time: 0.8254 data: 0.0372 max mem: 26623
Epoch: [0] [ 140/3696] eta: 0:50:25 lr: 0.000100 loss: 6.2035 (6.7111) at: 6.2035 (6.7111) at_unscaled: 6.2035 (6.7111) time: 0.8266 data: 0.0372 max mem: 29528
Epoch: [0] [ 150/3696] eta: 0:49:55 lr: 0.000100 loss: 6.1476 (6.6716) at: 6.1476 (6.6716) at_unscaled: 6.1476 (6.6716) time: 0.8011 data: 0.0375 max mem: 29528
Epoch: [0] [ 160/3696] eta: 0:49:27 lr: 0.000100 loss: 6.0711 (6.6330) at: 6.0711 (6.6330) at_unscaled: 6.0711 (6.6330) time: 0.7585 data: 0.0372 max mem: 29528
Epoch: [0] [ 170/3696] eta: 0:49:10 lr: 0.000100 loss: 6.0247 (6.5969) at: 6.0247 (6.5969) at_unscaled: 6.0247 (6.5969) time: 0.7769 data: 0.0358 max mem: 29528
Epoch: [0] [ 180/3696] eta: 0:49:27 lr: 0.000100 loss: 5.9822 (6.5631) at: 5.9822 (6.5631) at_unscaled: 5.9822 (6.5631) time: 0.8812 data: 0.0361 max mem: 29528
Epoch: [0] [ 190/3696] eta: 0:49:06 lr: 0.000100 loss: 5.9351 (6.5278) at: 5.9351 (6.5278) at_unscaled: 5.9351 (6.5278) time: 0.8712 data: 0.0371 max mem: 29528
Epoch: [0] [ 200/3696] eta: 0:48:45 lr: 0.000100 loss: 5.8904 (6.4953) at: 5.8904 (6.4953) at_unscaled: 5.8904 (6.4953) time: 0.7744 data: 0.0355 max mem: 29528
Epoch: [0] [ 210/3696] eta: 0:48:35 lr: 0.000100 loss: 5.8645 (6.4635) at: 5.8645 (6.4635) at_unscaled: 5.8645 (6.4635) time: 0.7968 data: 0.0348 max mem: 29528
Epoch: [0] [ 220/3696] eta: 0:48:17 lr: 0.000100 loss: 5.8032 (6.4343) at: 5.8032 (6.4343) at_unscaled: 5.8032 (6.4343) time: 0.7998 data: 0.0354 max mem: 29528
Epoch: [0] [ 230/3696] eta: 0:47:58 lr: 0.000100 loss: 5.7949 (6.4067) at: 5.7949 (6.4067) at_unscaled: 5.7949 (6.4067) time: 0.7687 data: 0.0362 max mem: 29528
Epoch: [0] [ 240/3696] eta: 0:47:45 lr: 0.000100 loss: 5.7568 (6.3776) at: 5.7568 (6.3776) at_unscaled: 5.7568 (6.3776) time: 0.7808 data: 0.0371 max mem: 29528
Epoch: [0] [ 250/3696] eta: 0:47:30 lr: 0.000100 loss: 5.7063 (6.3502) at: 5.7063 (6.3502) at_unscaled: 5.7063 (6.3502) time: 0.7889 data: 0.0366 max mem: 29528
Epoch: [0] [ 260/3696] eta: 0:47:11 lr: 0.000100 loss: 5.6821 (6.3225) at: 5.6821 (6.3225) at_unscaled: 5.6821 (6.3225) time: 0.7617 data: 0.0362 max mem: 29528
Epoch: [0] [ 270/3696] eta: 0:47:00 lr: 0.000100 loss: 5.6091 (6.2965) at: 5.6091 (6.2965) at_unscaled: 5.6091 (6.2965) time: 0.7725 data: 0.0366 max mem: 29528
Epoch: [0] [ 280/3696] eta: 0:46:48 lr: 0.000100 loss: 5.6024 (6.2713) at: 5.6024 (6.2713) at_unscaled: 5.6024 (6.2713) time: 0.7982 data: 0.0366 max mem: 29528
Epoch: [0] [ 290/3696] eta: 0:46:48 lr: 0.000100 loss: 5.5578 (6.2455) at: 5.5578 (6.2455) at_unscaled: 5.5578 (6.2455) time: 0.8433 data: 0.0370 max mem: 29528
Epoch: [0] [ 300/3696] eta: 0:46:36 lr: 0.000100 loss: 5.5396 (6.2221) at: 5.5396 (6.2221) at_unscaled: 5.5396 (6.2221) time: 0.8398 data: 0.0373 max mem: 29528
Epoch: [0] [ 310/3696] eta: 0:46:23 lr: 0.000100 loss: 5.5059 (6.1994) at: 5.5059 (6.1994) at_unscaled: 5.5059 (6.1994) time: 0.7842 data: 0.0374 max mem: 29528
Epoch: [0] [ 320/3696] eta: 0:46:12 lr: 0.000100 loss: 5.4888 (6.1767) at: 5.4888 (6.1767) at_unscaled: 5.4888 (6.1767) time: 0.7882 data: 0.0370 max mem: 29528
Epoch: [0] [ 330/3696] eta: 0:45:58 lr: 0.000100 loss: 5.4756 (6.1560) at: 5.4756 (6.1560) at_unscaled: 5.4756 (6.1560) time: 0.7820 data: 0.0365 max mem: 29528
Epoch: [0] [ 340/3696] eta: 0:45:49 lr: 0.000100 loss: 5.4458 (6.1354) at: 5.4458 (6.1354) at_unscaled: 5.4458 (6.1354) time: 0.7886 data: 0.0363 max mem: 29528
Epoch: [0] [ 350/3696] eta: 0:45:42 lr: 0.000100 loss: 5.4504 (6.1157) at: 5.4504 (6.1157) at_unscaled: 5.4504 (6.1157) time: 0.8230 data: 0.0364 max mem: 29528
Epoch: [0] [ 360/3696] eta: 0:45:34 lr: 0.000100 loss: 5.4683 (6.0973) at: 5.4683 (6.0973) at_unscaled: 5.4683 (6.0973) time: 0.8292 data: 0.0370 max mem: 29528
Epoch: [0] [ 370/3696] eta: 0:45:30 lr: 0.000100 loss: 5.4665 (6.0802) at: 5.4665 (6.0802) at_unscaled: 5.4665 (6.0802) time: 0.8410 data: 0.0357 max mem: 29528
Epoch: [0] [ 380/3696] eta: 0:45:22 lr: 0.000100 loss: 5.4943 (6.0647) at: 5.4943 (6.0647) at_unscaled: 5.4943 (6.0647) time: 0.8443 data: 0.0360 max mem: 29528
Epoch: [0] [ 390/3696] eta: 0:45:13 lr: 0.000100 loss: 5.4801 (6.0489) at: 5.4801 (6.0489) at_unscaled: 5.4801 (6.0489) time: 0.8209 data: 0.0371 max mem: 29528
Epoch: [0] [ 400/3696] eta: 0:45:14 lr: 0.000100 loss: 5.4442 (6.0338) at: 5.4442 (6.0338) at_unscaled: 5.4442 (6.0338) time: 0.8706 data: 0.0372 max mem: 29528
Epoch: [0] [ 410/3696] eta: 0:45:03 lr: 0.000100 loss: 5.4351 (6.0182) at: 5.4351 (6.0182) at_unscaled: 5.4351 (6.0182) time: 0.8613 data: 0.0376 max mem: 29528
Epoch: [0] [ 420/3696] eta: 0:44:50 lr: 0.000100 loss: 5.3845 (6.0028) at: 5.3845 (6.0028) at_unscaled: 5.3845 (6.0028) time: 0.7759 data: 0.0373 max mem: 29528
Epoch: [0] [ 430/3696] eta: 0:45:03 lr: 0.000100 loss: 5.3922 (5.9884) at: 5.3922 (5.9884) at_unscaled: 5.3922 (5.9884) time: 0.9318 data: 0.0361 max mem: 29528
Epoch: [0] [ 440/3696] eta: 0:44:50 lr: 0.000100 loss: 5.4115 (5.9759) at: 5.4115 (5.9759) at_unscaled: 5.4115 (5.9759) time: 0.9331 data: 0.0361 max mem: 29528
Epoch: [0] [ 450/3696] eta: 0:44:43 lr: 0.000100 loss: 5.4180 (5.9631) at: 5.4180 (5.9631) at_unscaled: 5.4180 (5.9631) time: 0.8017 data: 0.0359 max mem: 29528
Epoch: [0] [ 460/3696] eta: 0:44:29 lr: 0.000100 loss: 5.3881 (5.9501) at: 5.3881 (5.9501) at_unscaled: 5.3881 (5.9501) time: 0.7948 data: 0.0355 max mem: 29528
Epoch: [0] [ 470/3696] eta: 0:44:18 lr: 0.000100 loss: 5.3906 (5.9391) at: 5.3906 (5.9391) at_unscaled: 5.3906 (5.9391) time: 0.7668 data: 0.0371 max mem: 29528
Epoch: [0] [ 480/3696] eta: 0:44:10 lr: 0.000100 loss: 5.3906 (5.9277) at: 5.3906 (5.9277) at_unscaled: 5.3906 (5.9277) time: 0.8013 data: 0.0390 max mem: 29528
Epoch: [0] [ 490/3696] eta: 0:44:03 lr: 0.000100 loss: 5.4143 (5.9179) at: 5.4143 (5.9179) at_unscaled: 5.4143 (5.9179) time: 0.8300 data: 0.0391 max mem: 29528
Epoch: [0] [ 500/3696] eta: 0:43:54 lr: 0.000100 loss: 5.4093 (5.9075) at: 5.4093 (5.9075) at_unscaled: 5.4093 (5.9075) time: 0.8303 data: 0.0378 max mem: 29528
Epoch: [0] [ 510/3696] eta: 0:43:43 lr: 0.000100 loss: 5.3890 (5.8972) at: 5.3890 (5.8972) at_unscaled: 5.3890 (5.8972) time: 0.7958 data: 0.0367 max mem: 29528
Epoch: [0] [ 520/3696] eta: 0:43:31 lr: 0.000100 loss: 5.3959 (5.8872) at: 5.3959 (5.8872) at_unscaled: 5.3959 (5.8872) time: 0.7730 data: 0.0355 max mem: 29528
Epoch: [0] [ 530/3696] eta: 0:43:22 lr: 0.000100 loss: 5.3743 (5.8775) at: 5.3743 (5.8775) at_unscaled: 5.3743 (5.8775) time: 0.7915 data: 0.0358 max mem: 29528
Epoch: [0] [ 540/3696] eta: 0:43:12 lr: 0.000100 loss: 5.3725 (5.8675) at: 5.3725 (5.8675) at_unscaled: 5.3725 (5.8675) time: 0.8013 data: 0.0355 max mem: 29528
Epoch: [0] [ 550/3696] eta: 0:43:02 lr: 0.000100 loss: 5.3403 (5.8580) at: 5.3403 (5.8580) at_unscaled: 5.3403 (5.8580) time: 0.7922 data: 0.0349 max mem: 29528
Epoch: [0] [ 560/3696] eta: 0:42:52 lr: 0.000100 loss: 5.3460 (5.8494) at: 5.3460 (5.8494) at_unscaled: 5.3460 (5.8494) time: 0.7893 data: 0.0355 max mem: 29528
Epoch: [0] [ 570/3696] eta: 0:42:43 lr: 0.000100 loss: 5.3509 (5.8408) at: 5.3509 (5.8408) at_unscaled: 5.3509 (5.8408) time: 0.7901 data: 0.0359 max mem: 29528
Epoch: [0] [ 580/3696] eta: 0:42:31 lr: 0.000100 loss: 5.3509 (5.8328) at: 5.3509 (5.8328) at_unscaled: 5.3509 (5.8328) time: 0.7762 data: 0.0358 max mem: 29528
Epoch: [0] [ 590/3696] eta: 0:42:22 lr: 0.000100 loss: 5.3572 (5.8243) at: 5.3572 (5.8243) at_unscaled: 5.3572 (5.8243) time: 0.7785 data: 0.0351 max mem: 29528
Epoch: [0] [ 600/3696] eta: 0:42:11 lr: 0.000100 loss: 5.3541 (5.8163) at: 5.3541 (5.8163) at_unscaled: 5.3541 (5.8163) time: 0.7857 data: 0.0343 max mem: 29528
Epoch: [0] [ 610/3696] eta: 0:41:59 lr: 0.000100 loss: 5.3445 (5.8085) at: 5.3445 (5.8085) at_unscaled: 5.3445 (5.8085) time: 0.7585 data: 0.0351 max mem: 29528
Epoch: [0] [ 620/3696] eta: 0:41:54 lr: 0.000100 loss: 5.3499 (5.8015) at: 5.3499 (5.8015) at_unscaled: 5.3499 (5.8015) time: 0.8055 data: 0.0354 max mem: 29528
Epoch: [0] [ 630/3696] eta: 0:41:42 lr: 0.000100 loss: 5.3499 (5.7940) at: 5.3499 (5.7940) at_unscaled: 5.3499 (5.7940) time: 0.8031 data: 0.0343 max mem: 29528
Epoch: [0] [ 640/3696] eta: 0:41:31 lr: 0.000100 loss: 5.3273 (5.7865) at: 5.3273 (5.7865) at_unscaled: 5.3273 (5.7865) time: 0.7553 data: 0.0356 max mem: 29528
Epoch: [0] [ 650/3696] eta: 0:41:22 lr: 0.000100 loss: 5.3314 (5.7792) at: 5.3314 (5.7792) at_unscaled: 5.3314 (5.7792) time: 0.7825 data: 0.0378 max mem: 29528
Epoch: [0] [ 660/3696] eta: 0:41:16 lr: 0.000100 loss: 5.3259 (5.7719) at: 5.3259 (5.7719) at_unscaled: 5.3259 (5.7719) time: 0.8199 data: 0.0371 max mem: 29528
Epoch: [0] [ 670/3696] eta: 0:41:06 lr: 0.000100 loss: 5.2930 (5.7651) at: 5.2930 (5.7651) at_unscaled: 5.2930 (5.7651) time: 0.8170 data: 0.0351 max mem: 29528
Epoch: [0] [ 680/3696] eta: 0:40:57 lr: 0.000100 loss: 5.2930 (5.7582) at: 5.2930 (5.7582) at_unscaled: 5.2930 (5.7582) time: 0.7851 data: 0.0354 max mem: 29528
Epoch: [0] [ 690/3696] eta: 0:40:49 lr: 0.000100 loss: 5.2727 (5.7514) at: 5.2727 (5.7514) at_unscaled: 5.2727 (5.7514) time: 0.8068 data: 0.0353 max mem: 29528
Epoch: [0] [ 700/3696] eta: 0:40:41 lr: 0.000100 loss: 5.2917 (5.7451) at: 5.2917 (5.7451) at_unscaled: 5.2917 (5.7451) time: 0.8184 data: 0.0348 max mem: 29528
Epoch: [0] [ 710/3696] eta: 0:40:31 lr: 0.000100 loss: 5.2949 (5.7387) at: 5.2949 (5.7387) at_unscaled: 5.2949 (5.7387) time: 0.7904 data: 0.0358 max mem: 29528
Epoch: [0] [ 720/3696] eta: 0:40:21 lr: 0.000100 loss: 5.2874 (5.7325) at: 5.2874 (5.7325) at_unscaled: 5.2874 (5.7325) time: 0.7719 data: 0.0376 max mem: 29528
Epoch: [0] [ 730/3696] eta: 0:40:10 lr: 0.000100 loss: 5.2801 (5.7262) at: 5.2801 (5.7262) at_unscaled: 5.2801 (5.7262) time: 0.7581 data: 0.0372 max mem: 29528
Epoch: [0] [ 740/3696] eta: 0:40:02 lr: 0.000100 loss: 5.2634 (5.7196) at: 5.2634 (5.7196) at_unscaled: 5.2634 (5.7196) time: 0.7769 data: 0.0357 max mem: 29528
Epoch: [0] [ 750/3696] eta: 0:39:53 lr: 0.000100 loss: 5.2367 (5.7135) at: 5.2367 (5.7135) at_unscaled: 5.2367 (5.7135) time: 0.8039 data: 0.0365 max mem: 29528
Epoch: [0] [ 760/3696] eta: 0:39:43 lr: 0.000100 loss: 5.2874 (5.7082) at: 5.2874 (5.7082) at_unscaled: 5.2874 (5.7082) time: 0.7800 data: 0.0367 max mem: 29528
Epoch: [0] [ 770/3696] eta: 0:39:33 lr: 0.000100 loss: 5.2954 (5.7024) at: 5.2954 (5.7024) at_unscaled: 5.2954 (5.7024) time: 0.7681 data: 0.0356 max mem: 29528
Epoch: [0] [ 780/3696] eta: 0:39:23 lr: 0.000100 loss: 5.3127 (5.6975) at: 5.3127 (5.6975) at_unscaled: 5.3127 (5.6975) time: 0.7632 data: 0.0361 max mem: 29528
Epoch: [0] [ 790/3696] eta: 0:39:14 lr: 0.000100 loss: 5.3130 (5.6919) at: 5.3130 (5.6919) at_unscaled: 5.3130 (5.6919) time: 0.7715 data: 0.0359 max mem: 29528
Epoch: [0] [ 800/3696] eta: 0:39:06 lr: 0.000100 loss: 5.2498 (5.6860) at: 5.2498 (5.6860) at_unscaled: 5.2498 (5.6860) time: 0.7954 data: 0.0369 max mem: 29528
Epoch: [0] [ 810/3696] eta: 0:38:58 lr: 0.000100 loss: 5.2336 (5.6804) at: 5.2336 (5.6804) at_unscaled: 5.2336 (5.6804) time: 0.8095 data: 0.0380 max mem: 29528
Epoch: [0] [ 820/3696] eta: 0:38:50 lr: 0.000100 loss: 5.2354 (5.6755) at: 5.2354 (5.6755) at_unscaled: 5.2354 (5.6755) time: 0.8130 data: 0.0356 max mem: 29528
Epoch: [0] [ 830/3696] eta: 0:38:39 lr: 0.000100 loss: 5.2691 (5.6704) at: 5.2691 (5.6704) at_unscaled: 5.2691 (5.6704) time: 0.7757 data: 0.0355 max mem: 29528
Epoch: [0] [ 840/3696] eta: 0:38:31 lr: 0.000100 loss: 5.2588 (5.6653) at: 5.2588 (5.6653) at_unscaled: 5.2588 (5.6653) time: 0.7692 data: 0.0369 max mem: 29528
Epoch: [0] [ 850/3696] eta: 0:38:23 lr: 0.000100 loss: 5.2564 (5.6606) at: 5.2564 (5.6606) at_unscaled: 5.2564 (5.6606) time: 0.8133 data: 0.0363 max mem: 29528
Epoch: [0] [ 860/3696] eta: 0:38:15 lr: 0.000100 loss: 5.2448 (5.6556) at: 5.2448 (5.6556) at_unscaled: 5.2448 (5.6556) time: 0.8129 data: 0.0352 max mem: 29528
Epoch: [0] [ 870/3696] eta: 0:38:05 lr: 0.000100 loss: 5.2326 (5.6506) at: 5.2326 (5.6506) at_unscaled: 5.2326 (5.6506) time: 0.7795 data: 0.0351 max mem: 29528
Epoch: [0] [ 880/3696] eta: 0:37:56 lr: 0.000100 loss: 5.2049 (5.6456) at: 5.2049 (5.6456) at_unscaled: 5.2049 (5.6456) time: 0.7750 data: 0.0364 max mem: 29528
Epoch: [0] [ 890/3696] eta: 0:37:47 lr: 0.000100 loss: 5.2049 (5.6407) at: 5.2049 (5.6407) at_unscaled: 5.2049 (5.6407) time: 0.7812 data: 0.0367 max mem: 29528
Epoch: [0] [ 900/3696] eta: 0:37:37 lr: 0.000100 loss: 5.1690 (5.6354) at: 5.1690 (5.6354) at_unscaled: 5.1690 (5.6354) time: 0.7607 data: 0.0348 max mem: 29528
Epoch: [0] [ 910/3696] eta: 0:37:31 lr: 0.000100 loss: 5.1836 (5.6309) at: 5.1836 (5.6309) at_unscaled: 5.1836 (5.6309) time: 0.8035 data: 0.0355 max mem: 29528
Epoch: [0] [ 920/3696] eta: 0:37:22 lr: 0.000100 loss: 5.2129 (5.6261) at: 5.2129 (5.6261) at_unscaled: 5.2129 (5.6261) time: 0.8221 data: 0.0381 max mem: 29528
Epoch: [0] [ 930/3696] eta: 0:37:13 lr: 0.000100 loss: 5.1586 (5.6210) at: 5.1586 (5.6210) at_unscaled: 5.1586 (5.6210) time: 0.7758 data: 0.0377 max mem: 29528
Epoch: [0] [ 940/3696] eta: 0:37:05 lr: 0.000100 loss: 5.1586 (5.6162) at: 5.1586 (5.6162) at_unscaled: 5.1586 (5.6162) time: 0.7975 data: 0.0355 max mem: 29528
Epoch: [0] [ 950/3696] eta: 0:36:56 lr: 0.000100 loss: 5.1713 (5.6120) at: 5.1713 (5.6120) at_unscaled: 5.1713 (5.6120) time: 0.7970 data: 0.0358 max mem: 29528
Epoch: [0] [ 960/3696] eta: 0:36:47 lr: 0.000100 loss: 5.1839 (5.6077) at: 5.1839 (5.6077) at_unscaled: 5.1839 (5.6077) time: 0.7714 data: 0.0367 max mem: 29528
Epoch: [0] [ 970/3696] eta: 0:36:38 lr: 0.000100 loss: 5.1800 (5.6036) at: 5.1800 (5.6036) at_unscaled: 5.1800 (5.6036) time: 0.7812 data: 0.0363 max mem: 29528
Epoch: [0] [ 980/3696] eta: 0:36:30 lr: 0.000100 loss: 5.2028 (5.5995) at: 5.2028 (5.5995) at_unscaled: 5.2028 (5.5995) time: 0.7996 data: 0.0349 max mem: 29528
Epoch: [0] [ 990/3696] eta: 0:36:23 lr: 0.000100 loss: 5.2028 (5.5954) at: 5.2028 (5.5954) at_unscaled: 5.2028 (5.5954) time: 0.8110 data: 0.0353 max mem: 29528
Epoch: [0] [1000/3696] eta: 0:36:14 lr: 0.000100 loss: 5.1880 (5.5914) at: 5.1880 (5.5914) at_unscaled: 5.1880 (5.5914) time: 0.7950 data: 0.0369 max mem: 29528
Epoch: [0] [1010/3696] eta: 0:36:04 lr: 0.000100 loss: 5.1773 (5.5870) at: 5.1773 (5.5870) at_unscaled: 5.1773 (5.5870) time: 0.7645 data: 0.0368 max mem: 29528
Epoch: [0] [1020/3696] eta: 0:35:57 lr: 0.000100 loss: 5.2493 (5.5836) at: 5.2493 (5.5836) at_unscaled: 5.2493 (5.5836) time: 0.7915 data: 0.0360 max mem: 29528
Epoch: [0] [1030/3696] eta: 0:35:49 lr: 0.000100 loss: 5.1982 (5.5793) at: 5.1982 (5.5793) at_unscaled: 5.1982 (5.5793) time: 0.8164 data: 0.0363 max mem: 29528
Epoch: [0] [1040/3696] eta: 0:35:41 lr: 0.000100 loss: 5.1446 (5.5754) at: 5.1446 (5.5754) at_unscaled: 5.1446 (5.5754) time: 0.8053 data: 0.0375 max mem: 29528
Epoch: [0] [1050/3696] eta: 0:35:31 lr: 0.000100 loss: 5.1319 (5.5714) at: 5.1319 (5.5714) at_unscaled: 5.1319 (5.5714) time: 0.7766 data: 0.0359 max mem: 29528
Epoch: [0] [1060/3696] eta: 0:35:22 lr: 0.000100 loss: 5.2017 (5.5679) at: 5.2017 (5.5679) at_unscaled: 5.2017 (5.5679) time: 0.7481 data: 0.0365 max mem: 29528
Epoch: [0] [1070/3696] eta: 0:35:13 lr: 0.000100 loss: 5.2017 (5.5642) at: 5.2017 (5.5642) at_unscaled: 5.2017 (5.5642) time: 0.7754 data: 0.0387 max mem: 29528
Epoch: [0] [1080/3696] eta: 0:35:03 lr: 0.000100 loss: 5.1192 (5.5603) at: 5.1192 (5.5603) at_unscaled: 5.1192 (5.5603) time: 0.7605 data: 0.0383 max mem: 29528
Epoch: [0] [1090/3696] eta: 0:34:56 lr: 0.000100 loss: 5.1105 (5.5560) at: 5.1105 (5.5560) at_unscaled: 5.1105 (5.5560) time: 0.7700 data: 0.0379 max mem: 29528
Epoch: [0] [1100/3696] eta: 0:34:47 lr: 0.000100 loss: 5.1321 (5.5524) at: 5.1321 (5.5524) at_unscaled: 5.1321 (5.5524) time: 0.8007 data: 0.0380 max mem: 29528
Epoch: [0] [1110/3696] eta: 0:34:39 lr: 0.000100 loss: 5.1603 (5.5489) at: 5.1603 (5.5489) at_unscaled: 5.1603 (5.5489) time: 0.7850 data: 0.0382 max mem: 29528
Epoch: [0] [1120/3696] eta: 0:34:30 lr: 0.000100 loss: 5.1443 (5.5452) at: 5.1443 (5.5452) at_unscaled: 5.1443 (5.5452) time: 0.7765 data: 0.0383 max mem: 29528
Epoch: [0] [1130/3696] eta: 0:34:21 lr: 0.000100 loss: 5.1185 (5.5413) at: 5.1185 (5.5413) at_unscaled: 5.1185 (5.5413) time: 0.7790 data: 0.0372 max mem: 29528
Epoch: [0] [1140/3696] eta: 0:34:13 lr: 0.000100 loss: 5.0800 (5.5374) at: 5.0800 (5.5374) at_unscaled: 5.0800 (5.5374) time: 0.7986 data: 0.0356 max mem: 29528
Epoch: [0] [1150/3696] eta: 0:34:04 lr: 0.000100 loss: 5.1101 (5.5337) at: 5.1101 (5.5337) at_unscaled: 5.1101 (5.5337) time: 0.7654 data: 0.0345 max mem: 29528
Epoch: [0] [1160/3696] eta: 0:33:56 lr: 0.000100 loss: 5.1744 (5.5307) at: 5.1744 (5.5307) at_unscaled: 5.1744 (5.5307) time: 0.7695 data: 0.0344 max mem: 29528
Epoch: [0] [1170/3696] eta: 0:33:47 lr: 0.000100 loss: 5.1829 (5.5277) at: 5.1829 (5.5277) at_unscaled: 5.1829 (5.5277) time: 0.7968 data: 0.0362 max mem: 29528
Epoch: [0] [1180/3696] eta: 0:33:40 lr: 0.000100 loss: 5.1845 (5.5246) at: 5.1845 (5.5246) at_unscaled: 5.1845 (5.5246) time: 0.8120 data: 0.0374 max mem: 29528
Epoch: [0] [1190/3696] eta: 0:33:32 lr: 0.000100 loss: 5.1798 (5.5216) at: 5.1798 (5.5216) at_unscaled: 5.1798 (5.5216) time: 0.8169 data: 0.0371 max mem: 29528
Epoch: [0] [1200/3696] eta: 0:33:23 lr: 0.000100 loss: 5.1929 (5.5188) at: 5.1929 (5.5188) at_unscaled: 5.1929 (5.5188) time: 0.7739 data: 0.0361 max mem: 29528
Epoch: [0] [1210/3696] eta: 0:33:16 lr: 0.000100 loss: 5.1929 (5.5158) at: 5.1929 (5.5158) at_unscaled: 5.1929 (5.5158) time: 0.7985 data: 0.0340 max mem: 29528
Epoch: [0] [1220/3696] eta: 0:33:07 lr: 0.000100 loss: 5.1322 (5.5126) at: 5.1322 (5.5126) at_unscaled: 5.1322 (5.5126) time: 0.8027 data: 0.0350 max mem: 29528
Epoch: [0] [1230/3696] eta: 0:32:59 lr: 0.000100 loss: 5.1595 (5.5096) at: 5.1595 (5.5096) at_unscaled: 5.1595 (5.5096) time: 0.7881 data: 0.0374 max mem: 29528
Epoch: [0] [1240/3696] eta: 0:32:50 lr: 0.000100 loss: 5.1620 (5.5067) at: 5.1620 (5.5067) at_unscaled: 5.1620 (5.5067) time: 0.7849 data: 0.0365 max mem: 29528
Epoch: [0] [1250/3696] eta: 0:32:42 lr: 0.000100 loss: 5.1620 (5.5038) at: 5.1620 (5.5038) at_unscaled: 5.1620 (5.5038) time: 0.7893 data: 0.0357 max mem: 29528
Epoch: [0] [1260/3696] eta: 0:32:34 lr: 0.000100 loss: 5.1245 (5.5005) at: 5.1245 (5.5005) at_unscaled: 5.1245 (5.5005) time: 0.8002 data: 0.0359 max mem: 29528
Epoch: [0] [1270/3696] eta: 0:32:26 lr: 0.000100 loss: 5.1023 (5.4975) at: 5.1023 (5.4975) at_unscaled: 5.1023 (5.4975) time: 0.8015 data: 0.0362 max mem: 29528
Epoch: [0] [1280/3696] eta: 0:32:17 lr: 0.000100 loss: 5.1132 (5.4946) at: 5.1132 (5.4946) at_unscaled: 5.1132 (5.4946) time: 0.7906 data: 0.0349 max mem: 29528
Epoch: [0] [1290/3696] eta: 0:32:09 lr: 0.000100 loss: 5.1292 (5.4918) at: 5.1292 (5.4918) at_unscaled: 5.1292 (5.4918) time: 0.7743 data: 0.0334 max mem: 29528
Epoch: [0] [1300/3696] eta: 0:32:01 lr: 0.000100 loss: 5.1292 (5.4890) at: 5.1292 (5.4890) at_unscaled: 5.1292 (5.4890) time: 0.7875 data: 0.0339 max mem: 29528
Epoch: [0] [1310/3696] eta: 0:31:54 lr: 0.000100 loss: 5.1232 (5.4863) at: 5.1232 (5.4863) at_unscaled: 5.1232 (5.4863) time: 0.8117 data: 0.0343 max mem: 29528
Epoch: [0] [1320/3696] eta: 0:31:45 lr: 0.000100 loss: 5.1016 (5.4832) at: 5.1016 (5.4832) at_unscaled: 5.1016 (5.4832) time: 0.8161 data: 0.0341 max mem: 29528
Epoch: [0] [1330/3696] eta: 0:31:38 lr: 0.000100 loss: 5.0905 (5.4805) at: 5.0905 (5.4805) at_unscaled: 5.0905 (5.4805) time: 0.8149 data: 0.0343 max mem: 29528
Traceback (most recent call last):
File "main.py", line 257, in <module>
main(args)
File "main.py", line 207, in main
args.clip_max_norm, learning_rate_schedule)
File "/opt/tiger/intro/Stable-Pix2Seq/engine.py", line 98, in train_one_epoch
losses.backward()
File "/home/tiger/.local/lib/python3.7/site-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/tiger/.local/lib/python3.7/site-packages/torch/autograd/__init__.py", line 147, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 216.00 MiB (GPU 7; 31.75 GiB total capacity; 29.63 GiB already allocated; 213.75 MiB free; 29.95 GiB reserved in total by PyTorch)
Traceback (most recent call last):
File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/tiger/.local/lib/python3.7/site-packages/torch/distributed/launch.py", line 340, in <module>
main()
File "/home/tiger/.local/lib/python3.7/site-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/home/tiger/.local/lib/python3.7/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'main.py', '--coco_path', './coco2017/', '--batch_size', '4', '--lr', '0.0005', '--output_dir', './output']' returned non-zero exit status 1.
Killing subprocess 5627
Killing subprocess 5628
Killing subprocess 5629
Killing subprocess 5630
Killing subprocess 5631
Killing subprocess 5632
Killing subprocess 5633