Hi,
I really enjoyed reading your paper and code. Great work.
I am trying to reproduce the results by running your code on HPC (cluster, one node with 2 GPUs). As mentioned in read me training section, I followed the following command in interactive slurm mode.
"
python -m torch.distributed.launch --nproc_per_node=2 --use_env src/train.py with \ crowdhuman
deformable
multi_frame
tracking
output_dir=models/crowdhuman_deformable_multi_frame \ "
But my code is getting hung up at line
" model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu], find_unused_parameters=True)."
Could you please help me? Following is the output
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
WARNING - root - Changed type of config entry "train_split" from str to NoneType
WARNING - train - No observers have been added to this run
WARNING - root - Changed type of config entry "train_split" from str to NoneType
WARNING - train - No observers have been added to this run
INFO - train - Running command 'load_config'
INFO - train - Started
INFO - train - Running command 'load_config'
INFO - train - Started
Configuration (modified, added, typechanged, doc):
aux_loss = True
backbone = 'resnet50'
batch_size = 1
bbox_loss_coef = 5.0
clip_max_norm = 0.1
cls_loss_coef = 2.0
coco_and_crowdhuman_prev_frame_rnd_augs = 0.2
coco_min_num_objects = 0
coco_panoptic_path = None
coco_path = 'data/coco_2017'
coco_person_train_split = None
crowdhuman_path = 'data/CrowdHuman'
crowdhuman_train_split = 'train_val'
dataset = 'mot_crowdhuman'
debug = False
dec_layers = 6
dec_n_points = 4
deformable = True
device = 'cuda'
dice_loss_coef = 1.0
dilation = False
dim_feedforward = 1024
dist_url = 'env://'
dropout = 0.1
enc_layers = 6
enc_n_points = 4
eos_coef = 0.1
epochs = 80
eval_only = False
eval_train = False
focal_alpha = 0.25
focal_gamma = 2
focal_loss = True
freeze_detr = False
giou_loss_coef = 2
hidden_dim = 288
load_mask_head_from_model = None
lr = 0.0002
lr_backbone = 2e-05
lr_backbone_names = ['backbone.0']
lr_drop = 50
lr_linear_proj_mult = 0.1
lr_linear_proj_names = ['reference_points', 'sampling_offsets']
lr_track = 0.0001
mask_loss_coef = 1.0
masks = False
merge_frame_features = False
mot_path_train = 'data/MOT17'
mot_path_val = 'data/MOT17'
multi_frame_attention = True
multi_frame_attention_separate_encoder = True
multi_frame_encoding = True
nheads = 8
no_vis = False
num_feature_levels = 4
num_queries = 500
num_workers = 2
output_dir = 'models/crowdhuman_deformable_multi_frame'
overflow_boxes = True
overwrite_lr_scheduler = False
overwrite_lrs = False
position_embedding = 'sine'
pre_norm = False
resume = ''
resume_optim = False
resume_shift_neuron = False
resume_vis = False
save_model_interval = 5
seed = 42
set_cost_bbox = 5.0
set_cost_class = 2.0
set_cost_giou = 2.0
start_epoch = 1
track_attention = False
track_backprop_prev_frame = False
track_prev_frame_range = 5
track_prev_frame_rnd_augs = 0.01
track_prev_prev_frame = False
track_query_false_negative_prob = 0.4
track_query_false_positive_eos_weight = True
track_query_false_positive_prob = 0.1
tracking = True
tracking_eval = True
train_split = None
two_stage = False
val_interval = 5
val_split = 'mot17_train_cross_val_frame_0_5_to_1_0_coco'
vis_and_log_interval = 50
vis_port = 8097
vis_server = ''
weight_decay = 0.0001
with_box_refine = True
world_size = 2
img_transform:
max_size = 1333
val_width = 800
INFO - train - Completed after 0:00:00
Namespace(aux_loss=True, backbone='resnet50', batch_size=1, bbox_loss_coef=5.0, clip_max_norm=0.1, cls_loss_coef=2.0, coco_and_crowdhuman_prev_frame_rnd_augs=0.2, coco_min_num_objects=0, coco_panoptic_path=None, coco_path='data/coco_2017', coco_person_train_split=None, crowdhuman_path='data/CrowdHuman', crowdhuman_train_split='train_val', dataset='mot_crowdhuman', debug=False, dec_layers=6, dec_n_points=4, deformable=True, device='cuda', dice_loss_coef=1.0, dilation=False, dim_feedforward=1024, dist_url='env://', dropout=0.1, enc_layers=6, enc_n_points=4, eos_coef=0.1, epochs=80, eval_only=False, eval_train=False, focal_alpha=0.25, focal_gamma=2, focal_loss=True, freeze_detr=False, giou_loss_coef=2, hidden_dim=288, img_transform=Namespace(max_size=1333, val_width=800), load_mask_head_from_model=None, lr=0.0002, lr_backbone=2e-05, lr_backbone_names=['backbone.0'], lr_drop=50, lr_linear_proj_mult=0.1, lr_linear_proj_names=['reference_points', 'sampling_offsets'], lr_track=0.0001, mask_loss_coef=1.0, masks=False, merge_frame_features=False, mot_path_train='data/MOT17', mot_path_val='data/MOT17', multi_frame_attention=True, multi_frame_attention_separate_encoder=True, multi_frame_encoding=True, nheads=8, no_vis=False, num_feature_levels=4, num_queries=500, num_workers=2, output_dir='models/crowdhuman_deformable_multi_frame', overflow_boxes=True, overwrite_lr_scheduler=False, overwrite_lrs=False, position_embedding='sine', pre_norm=False, resume='', resume_optim=False, resume_shift_neuron=False, resume_vis=False, save_model_interval=5, seed=42, set_cost_bbox=5.0, set_cost_class=2.0, set_cost_giou=2.0, start_epoch=1, track_attention=False, track_backprop_prev_frame=False, track_prev_frame_range=5, track_prev_frame_rnd_augs=0.01, track_prev_prev_frame=False, track_query_false_negative_prob=0.4, track_query_false_positive_eos_weight=True, track_query_false_positive_prob=0.1, tracking=True, tracking_eval=True, train_split=None, two_stage=False, val_interval=5, val_split='mot17_train_cross_val_frame_0_5_to_1_0_coco', vis_and_log_interval=50, vis_port=8097, vis_server='', weight_decay=0.0001, with_box_refine=True, world_size=2)
using distributed mode
| distributed init (rank 1): env://
Configuration (modified, added, typechanged, doc):
aux_loss = True
backbone = 'resnet50'
batch_size = 1
bbox_loss_coef = 5.0
clip_max_norm = 0.1
cls_loss_coef = 2.0
coco_and_crowdhuman_prev_frame_rnd_augs = 0.2
coco_min_num_objects = 0
coco_panoptic_path = None
coco_path = 'data/coco_2017'
coco_person_train_split = None
crowdhuman_path = 'data/CrowdHuman'
crowdhuman_train_split = 'train_val'
dataset = 'mot_crowdhuman'
debug = False
dec_layers = 6
dec_n_points = 4
deformable = True
device = 'cuda'
dice_loss_coef = 1.0
dilation = False
dim_feedforward = 1024
dist_url = 'env://'
dropout = 0.1
enc_layers = 6
enc_n_points = 4
eos_coef = 0.1
epochs = 80
eval_only = False
eval_train = False
focal_alpha = 0.25
focal_gamma = 2
focal_loss = True
freeze_detr = False
giou_loss_coef = 2
hidden_dim = 288
load_mask_head_from_model = None
lr = 0.0002
lr_backbone = 2e-05
lr_backbone_names = ['backbone.0']
lr_drop = 50
lr_linear_proj_mult = 0.1
lr_linear_proj_names = ['reference_points', 'sampling_offsets']
lr_track = 0.0001
mask_loss_coef = 1.0
masks = False
merge_frame_features = False
mot_path_train = 'data/MOT17'
mot_path_val = 'data/MOT17'
multi_frame_attention = True
multi_frame_attention_separate_encoder = True
multi_frame_encoding = True
nheads = 8
no_vis = False
num_feature_levels = 4
num_queries = 500
num_workers = 2
output_dir = 'models/crowdhuman_deformable_multi_frame'
overflow_boxes = True
overwrite_lr_scheduler = False
overwrite_lrs = False
position_embedding = 'sine'
pre_norm = False
resume = ''
resume_optim = False
resume_shift_neuron = False
resume_vis = False
save_model_interval = 5
seed = 42
set_cost_bbox = 5.0
set_cost_class = 2.0
set_cost_giou = 2.0
start_epoch = 1
track_attention = False
track_backprop_prev_frame = False
track_prev_frame_range = 5
track_prev_frame_rnd_augs = 0.01
track_prev_prev_frame = False
track_query_false_negative_prob = 0.4
track_query_false_positive_eos_weight = True
track_query_false_positive_prob = 0.1
tracking = True
tracking_eval = True
train_split = None
two_stage = False
val_interval = 5
val_split = 'mot17_train_cross_val_frame_0_5_to_1_0_coco'
vis_and_log_interval = 50
vis_port = 8097
vis_server = ''
weight_decay = 0.0001
with_box_refine = True
world_size = 2
img_transform:
max_size = 1333
val_width = 800
INFO - train - Completed after 0:00:00
Namespace(aux_loss=True, backbone='resnet50', batch_size=1, bbox_loss_coef=5.0, clip_max_norm=0.1, cls_loss_coef=2.0, coco_and_crowdhuman_prev_frame_rnd_augs=0.2, coco_min_num_objects=0, coco_panoptic_path=None, coco_path='data/coco_2017', coco_person_train_split=None, crowdhuman_path='data/CrowdHuman', crowdhuman_train_split='train_val', dataset='mot_crowdhuman', debug=False, dec_layers=6, dec_n_points=4, deformable=True, device='cuda', dice_loss_coef=1.0, dilation=False, dim_feedforward=1024, dist_url='env://', dropout=0.1, enc_layers=6, enc_n_points=4, eos_coef=0.1, epochs=80, eval_only=False, eval_train=False, focal_alpha=0.25, focal_gamma=2, focal_loss=True, freeze_detr=False, giou_loss_coef=2, hidden_dim=288, img_transform=Namespace(max_size=1333, val_width=800), load_mask_head_from_model=None, lr=0.0002, lr_backbone=2e-05, lr_backbone_names=['backbone.0'], lr_drop=50, lr_linear_proj_mult=0.1, lr_linear_proj_names=['reference_points', 'sampling_offsets'], lr_track=0.0001, mask_loss_coef=1.0, masks=False, merge_frame_features=False, mot_path_train='data/MOT17', mot_path_val='data/MOT17', multi_frame_attention=True, multi_frame_attention_separate_encoder=True, multi_frame_encoding=True, nheads=8, no_vis=False, num_feature_levels=4, num_queries=500, num_workers=2, output_dir='models/crowdhuman_deformable_multi_frame', overflow_boxes=True, overwrite_lr_scheduler=False, overwrite_lrs=False, position_embedding='sine', pre_norm=False, resume='', resume_optim=False, resume_shift_neuron=False, resume_vis=False, save_model_interval=5, seed=42, set_cost_bbox=5.0, set_cost_class=2.0, set_cost_giou=2.0, start_epoch=1, track_attention=False, track_backprop_prev_frame=False, track_prev_frame_range=5, track_prev_frame_rnd_augs=0.01, track_prev_prev_frame=False, track_query_false_negative_prob=0.4, track_query_false_positive_eos_weight=True, track_query_false_positive_prob=0.1, tracking=True, tracking_eval=True, train_split=None, two_stage=False, val_interval=5, val_split='mot17_train_cross_val_frame_0_5_to_1_0_coco', vis_and_log_interval=50, vis_port=8097, vis_server='', weight_decay=0.0001, with_box_refine=True, world_size=2)
using distributed mode
| distributed init (rank 0): env://
git:
sha: d62d81023dbffb4a1820db39ce527b66df6d7b61, status: has uncommited changes, branch: main