Instructions To Reproduce the 🐛 Bug:
- what changes you made (
git diff
) or what code you wrote
Nothing change
- what exact command you run:
python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --coco_path ../data/COCO2017 --output_dir output/conddetr_r50_epoch50
- what you observed (including full logs):
| distributed init (rank 2): env://
| distributed init (rank 0): env://
| distributed init (rank 4): env://
| distributed init (rank 3): env://
| distributed init (rank 5): env://
| distributed init (rank 1): env://
| distributed init (rank 7): env://
| distributed init (rank 6): env://
fatal: Not a git repository (or any parent up to mount point /research/d4)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: Not a git repository (or any parent up to mount point /research/d4)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: Not a git repository (or any parent up to mount point /research/d4)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: Not a git repository (or any parent up to mount point /research/d4)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: Not a git repository (or any parent up to mount point /research/d4)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: Not a git repository (or any parent up to mount point /research/d4)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: Not a git repository (or any parent up to mount point /research/d4)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
git:
sha: N/A, status: clean, branch: N/A
fatal: Not a git repository (or any parent up to mount point /research/d4)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
Namespace(aux_loss=True, backbone='resnet50', batch_size=2, bbox_loss_coef=5, clip_max_norm=0.1, cls_loss_coef=2, coco_panoptic_path=None, coco_path='../data/COCO2017', dataset_file='coco', dec_layers=6, device='cuda', dice_loss_coef=1, dilation=False, dim_feedforward=2048, dist_backend='nccl', dist_url='env://', distributed=True, dropout=0.1, enc_layers=6, epochs=50, eval=False, focal_alpha=0.25, frozen_weights=None, giou_loss_coef=2, gpu=0, hidden_dim=256, lr=0.0001, lr_backbone=1e-05, lr_drop=40, mask_loss_coef=1, masks=False, nheads=8, num_queries=300, num_workers=2, output_dir='output/conddetr_r50_epoch50', position_embedding='sine', pre_norm=False, rank=0, remove_difficult=False, resume='', seed=42, set_cost_bbox=5, set_cost_class=2, set_cost_giou=2, start_epoch=0, weight_decay=0.0001, world_size=8)
number of params: 43196001
loading annotations into memory...
Done (t=20.78s)
creating index...
index created!
loading annotations into memory...
Done (t=0.56s)
creating index...
index created!
Start training
Epoch: [0] [ 0/7393] eta: 7:05:21 lr: 0.000100 class_error: 85.57 loss: 45.1821 (45.1821) loss_bbox: 3.7751 (3.7751) loss_bbox_0: 3.7823 (3.7823) loss_bbox_1: 3.7808 (3.7808) loss_bbox_2: 3.7756 (3.7756) loss_bbox_3: 3.7911 (3.7911) loss_bbox_4: 3.7856 (3.7856) loss_ce: 1.9574 (1.9574) loss_ce_0: 2.0151 (2.0151) loss_ce_1: 2.0196 (2.0196) loss_ce_2: 2.1484 (2.1484) loss_ce_3: 2.0683 (2.0683) loss_ce_4: 2.0683 (2.0683) loss_giou: 1.7011 (1.7011) loss_giou_0: 1.7000 (1.7000) loss_giou_1: 1.7040 (1.7040) loss_giou_2: 1.7059 (1.7059) loss_giou_3: 1.7022 (1.7022) loss_giou_4: 1.7012 (1.7012) cardinality_error_unscaled: 293.1250 (293.1250) cardinality_error_0_unscaled: 293.1250 (293.1250) cardinality_error_1_unscaled: 293.1250 (293.1250) cardinality_error_2_unscaled: 281.9375 (281.9375) cardinality_error_3_unscaled: 293.1250 (293.1250) cardinality_error_4_unscaled: 293.1250 (293.1250) class_error_unscaled: 85.5712 (85.5712) loss_bbox_unscaled: 0.7550 (0.7550) loss_bbox_0_unscaled: 0.7565 (0.7565) loss_bbox_1_unscaled: 0.7562 (0.7562) loss_bbox_2_unscaled: 0.7551 (0.7551) loss_bbox_3_unscaled: 0.7582 (0.7582) loss_bbox_4_unscaled: 0.7571 (0.7571) loss_ce_unscaled: 0.9787 (0.9787) loss_ce_0_unscaled: 1.0076 (1.0076) loss_ce_1_unscaled: 1.0098 (1.0098) loss_ce_2_unscaled: 1.0742 (1.0742) loss_ce_3_unscaled: 1.0341 (1.0341) loss_ce_4_unscaled: 1.0342 (1.0342) loss_giou_unscaled: 0.8506 (0.8506) loss_giou_0_unscaled: 0.8500 (0.8500) loss_giou_1_unscaled: 0.8520 (0.8520) loss_giou_2_unscaled: 0.8530 (0.8530) loss_giou_3_unscaled: 0.8511 (0.8511) loss_giou_4_unscaled: 0.8506 (0.8506) time: 3.4521 data: 0.4687 max mem: 2932
Epoch: [0] [ 100/7393] eta: 1:17:39 lr: 0.000100 class_error: 85.74 loss: 28.2629 (33.7855) loss_bbox: 1.5517 (2.3437) loss_bbox_0: 1.5566 (2.3695) loss_bbox_1: 1.5482 (2.3519) loss_bbox_2: 1.5535 (2.3396) loss_bbox_3: 1.5641 (2.3476) loss_bbox_4: 1.5637 (2.3431) loss_ce: 1.5467 (1.6584) loss_ce_0: 1.5650 (1.6414) loss_ce_1: 1.5443 (1.6461) loss_ce_2: 1.5557 (1.6477) loss_ce_3: 1.5392 (1.6545) loss_ce_4: 1.5541 (1.6667) loss_giou: 1.5534 (1.6289) loss_giou_0: 1.5514 (1.6296) loss_giou_1: 1.5541 (1.6292) loss_giou_2: 1.5695 (1.6291) loss_giou_3: 1.5526 (1.6289) loss_giou_4: 1.5519 (1.6296) cardinality_error_unscaled: 293.1875 (293.2420) cardinality_error_0_unscaled: 293.1875 (293.2420) cardinality_error_1_unscaled: 293.1875 (293.2420) cardinality_error_2_unscaled: 293.1875 (293.1312) cardinality_error_3_unscaled: 293.1875 (293.2420) cardinality_error_4_unscaled: 293.1875 (293.1658) class_error_unscaled: 75.6680 (75.4478) loss_bbox_unscaled: 0.3103 (0.4687) loss_bbox_0_unscaled: 0.3113 (0.4739) loss_bbox_1_unscaled: 0.3096 (0.4704) loss_bbox_2_unscaled: 0.3107 (0.4679) loss_bbox_3_unscaled: 0.3128 (0.4695) loss_bbox_4_unscaled: 0.3127 (0.4686) loss_ce_unscaled: 0.7733 (0.8292) loss_ce_0_unscaled: 0.7825 (0.8207) loss_ce_1_unscaled: 0.7722 (0.8231) loss_ce_2_unscaled: 0.7779 (0.8239) loss_ce_3_unscaled: 0.7696 (0.8272) loss_ce_4_unscaled: 0.7770 (0.8334) loss_giou_unscaled: 0.7767 (0.8145) loss_giou_0_unscaled: 0.7757 (0.8148) loss_giou_1_unscaled: 0.7771 (0.8146) loss_giou_2_unscaled: 0.7847 (0.8146) loss_giou_3_unscaled: 0.7763 (0.8144) loss_giou_4_unscaled: 0.7760 (0.8148) time: 0.6098 data: 0.0105 max mem: 4353
Traceback (most recent call last):
File "main.py", line 258, in <module>
main(args)
File "main.py", line 206, in main
train_stats = train_one_epoch(
File "/research/d4/gds/zwang21/ConditionalDETR/engine.py", line 41, in train_one_epoch
loss_dict = criterion(outputs, targets)
File "/research/d4/gds/zwang21/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/research/d4/gds/zwang21/ConditionalDETR/models/conditional_detr.py", line 254, in forward
indices = self.matcher(outputs_without_aux, targets)
File "/research/d4/gds/zwang21/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/research/d4/gds/zwang21/anaconda3/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/research/d4/gds/zwang21/ConditionalDETR/models/matcher.py", line 79, in forward
cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox))
File "/research/d4/gds/zwang21/ConditionalDETR/util/box_ops.py", line 59, in generalized_box_iou
assert (boxes1[:, 2:] >= boxes1[:, :2]).all()
AssertionError
Traceback (most recent call last):
File "/research/d4/gds/zwang21/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/research/d4/gds/zwang21/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/research/d4/gds/zwang21/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in <module>
main()
File "/research/d4/gds/zwang21/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/research/d4/gds/zwang21/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/research/d4/gds/zwang21/anaconda3/bin/python', '-u', 'main.py', '--coco_path', '../data/COCO2017', '--output_dir', 'output/conddetr_r50_epoch50']' returned non-zero exit status 1.
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Killing subprocess 29668
Killing subprocess 29669
Killing subprocess 29670
Killing subprocess 29671
Killing subprocess 29672
Killing subprocess 29673
Killing subprocess 29674
Killing subprocess 29675
- please simplify the steps as much as possible so they do not require additional resources to
run, such as a private dataset.
Expected behavior:
If there are no obvious error in "what you observed" provided above,
please tell us the expected behavior.
Environment:
Provide your environment information using the following command:
Collecting environment information...
PyTorch version: 1.8.0
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A
OS: CentOS Linux release 7.9.2009 (Core) (x86_64)
GCC version: (GCC) 11.2.0
Clang version: Could not collect
CMake version: version 2.8.12.2
Python version: 3.8 (64-bit runtime)
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.22.2
[pip3] numpydoc==1.1.0
[pip3] pytorch-ignite==0.2.0
[pip3] pytorch-metric-learning==0.9.99
[pip3] torch==1.8.0
[pip3] torchaudio==0.8.0a0+a751e1d
[pip3] torchfile==0.1.0
[pip3] torchsampler==0.1.1
[pip3] torchsummary==1.5.1
[pip3] torchvision==0.9.0
[conda] blas 1.0 mkl
[conda] cudatoolkit 10.2.89 hfd86e86_1
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] mkl 2021.2.0 h06a4308_296
[conda] mkl-service 2.3.0 py38h27cfd23_1
[conda] mkl_fft 1.3.0 py38h42c9631_2
[conda] mkl_random 1.2.1 py38ha9443f7_2
[conda] numpy 1.22.2 pypi_0 pypi
[conda] numpydoc 1.1.0 pyhd3eb1b0_1
[conda] pytorch 1.8.0 py3.8_cuda10.2_cudnn7.6.5_0 pytorch
[conda] pytorch-ignite 0.2.0 pypi_0 pypi
[conda] pytorch-metric-learning 0.9.99 pypi_0 pypi
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] torch 1.10.0 pypi_0 pypi
[conda] torchaudio 0.8.0 py38 pytorch
[conda] torchfile 0.1.0 pypi_0 pypi
[conda] torchsampler 0.1.1 pypi_0 pypi
[conda] torchsummary 1.5.1 pypi_0 pypi
[conda] torchvision 0.9.0 py38_cu102 pytorch