Hi,
Thank you for the great work! In the results section of your paper, you've stated results for your training on mixed datasets for 200 epochs. I attempted to train on the single 3dpw dataset from scratch but received unexpected results (as shown in the log below). I'd appreciate it if you could advise me how to solve this problem.
Thanks in advance.
2022-09-27 10:27:33,273 METRO INFO: Using 1 GPUs
2022-09-27 10:27:37,447 METRO INFO: Update config parameter num_hidden_layers: 12 -> 4
2022-09-27 10:27:37,447 METRO INFO: Update config parameter hidden_size: 768 -> 1024
2022-09-27 10:27:37,447 METRO INFO: Update config parameter num_attention_heads: 12 -> 4
2022-09-27 10:27:38,310 METRO INFO: Init model from scratch.
2022-09-27 10:27:38,310 METRO INFO: Update config parameter num_hidden_layers: 12 -> 4
2022-09-27 10:27:38,310 METRO INFO: Update config parameter hidden_size: 768 -> 256
2022-09-27 10:27:38,310 METRO INFO: Update config parameter num_attention_heads: 12 -> 4
2022-09-27 10:27:38,486 METRO INFO: Init model from scratch.
2022-09-27 10:27:38,486 METRO INFO: Update config parameter num_hidden_layers: 12 -> 4
2022-09-27 10:27:38,486 METRO INFO: Update config parameter hidden_size: 768 -> 128
2022-09-27 10:27:38,486 METRO INFO: Update config parameter num_attention_heads: 12 -> 4
2022-09-27 10:27:38,569 METRO INFO: Init model from scratch.
2022-09-27 10:27:40,009 METRO INFO: => loading hrnet-v2-w64 model
2022-09-27 10:27:40,012 METRO INFO: Transformers total parameters: 102256646
2022-09-27 10:27:40,016 METRO INFO: Backbone total parameters: 128059944
2022-09-27 10:27:40,216 METRO INFO: Training parameters Namespace(data_dir='datasets', train_yaml='pw3d_tsv_reproduce/train.yaml', val_yaml='pw3d_tsv_reproduce/test.yaml', num_workers=4, img_scale_factor=1, model_name_or_path='metro/modeling/bert/bert-base-uncased/', resume_checkpoint=None, output_dir='output/', config_name='', per_gpu_train_batch_size=20, per_gpu_eval_batch_size=30, lr=0.0001, num_train_epochs=30, vertices_loss_weight=100.0, joints_loss_weight=1000.0, vloss_w_full=0.33, vloss_w_sub=0.33, vloss_w_sub2=0.33, drop_out=0.1, arch='hrnet-w64', num_hidden_layers=4, hidden_size=128, num_attention_heads=4, intermediate_size=-1, input_feat_dim='2051,512,128', hidden_feat_dim='1024,256,128', legacy_setting=True, run_eval_only=False, logging_steps=1000, device=device(type='cuda'), seed=88, local_rank=0, num_gpus=1, distributed=False)
2022-09-27 10:37:39,084 METRO INFO: eta: 5:30:01 epoch: 0 iter: 1000 max mem : 19359 loss: 43.8094, 2d joint loss: 0.0363, 3d joint loss: 0.0242, vertex loss: 0.1603, compute: 0.5986, data: 0.0054, lr: 0.000100
2022-09-27 10:44:41,439 METRO INFO: Validation epoch: 1 mPVE: 216.89, mPJPE: 163.97, PAmPJPE: 110.12, Data Count: 35515.00
2022-09-27 10:53:16,153 METRO INFO: eta: 6:50:32 epoch: 1 iter: 2000 max mem : 19359 loss: 32.0019, 2d joint loss: 0.0250, 3d joint loss: 0.0167, vertex loss: 0.1277, compute: 0.7678, data: 0.1754, lr: 0.000100
2022-09-27 11:01:39,414 METRO INFO: Validation epoch: 2 mPVE: 213.65, mPJPE: 161.82, PAmPJPE: 105.72, Data Count: 35515.00
2022-09-27 11:08:53,971 METRO INFO: eta: 7:07:05 epoch: 2 iter: 3000 max mem : 19359 loss: 26.3174, 2d joint loss: 0.0201, 3d joint loss: 0.0134, vertex loss: 0.1088, compute: 0.8245, data: 0.2321, lr: 0.000100
2022-09-27 11:18:37,952 METRO INFO: Validation epoch: 3 mPVE: 204.17, mPJPE: 154.88, PAmPJPE: 102.00, Data Count: 35515.00
2022-09-27 11:24:28,939 METRO INFO: eta: 7:07:11 epoch: 3 iter: 4000 max mem : 19359 loss: 22.8643, 2d joint loss: 0.0172, 3d joint loss: 0.0115, vertex loss: 0.0963, compute: 0.8521, data: 0.2601, lr: 0.000100
2022-09-27 11:36:15,641 METRO INFO: Validation epoch: 4 mPVE: 182.91, mPJPE: 147.08, PAmPJPE: 96.03, Data Count: 35515.00
2022-09-27 11:36:17,768 METRO INFO: Save checkpoint to output/checkpoint-4-4544
2022-09-27 11:41:03,895 METRO INFO: eta: 7:06:50 epoch: 4 iter: 5000 max mem : 19359 loss: 20.4471, 2d joint loss: 0.0152, 3d joint loss: 0.0102, vertex loss: 0.0874, compute: 0.8807, data: 0.2837, lr: 0.000100
......
2022-09-27 18:08:18,140 METRO INFO: Validation epoch: 27 mPVE: 156.67, mPJPE: 136.94, PAmPJPE: 89.08, Data Count: 35515.00
2022-09-27 18:08:20,040 METRO INFO: Save checkpoint to output/checkpoint-27-30672
2022-09-27 18:11:35,319 METRO INFO: eta: 0:46:05 epoch: 27 iter: 31000 max mem : 19359 loss: 7.0103, 2d joint loss: 0.0049, 3d joint loss: 0.0030, vertex loss: 0.0350, compute: 0.8979, data: 0.3039, lr: 0.000010
2022-09-27 18:25:21,996 METRO INFO: Validation epoch: 28 mPVE: 157.29, mPJPE: 137.61, PAmPJPE: 88.35, Data Count: 35515.00
2022-09-27 18:25:23,883 METRO INFO: Save checkpoint to output/checkpoint-28-31808
2022-09-27 18:27:18,918 METRO INFO: eta: 0:31:10 epoch: 28 iter: 32000 max mem : 19359 loss: 6.8592, 2d joint loss: 0.0048, 3d joint loss: 0.0029, vertex loss: 0.0344, compute: 0.8993, data: 0.3053, lr: 0.000010
2022-09-27 18:42:27,939 METRO INFO: Validation epoch: 29 mPVE: 158.00, mPJPE: 137.04, PAmPJPE: 88.93, Data Count: 35515.00
2022-09-27 18:43:01,725 METRO INFO: eta: 0:16:12 epoch: 29 iter: 33000 max mem : 19359 loss: 6.7176, 2d joint loss: 0.0047, 3d joint loss: 0.0029, vertex loss: 0.0338, compute: 0.9006, data: 0.3065, lr: 0.000010
2022-09-27 18:53:01,465 METRO INFO: eta: 0:01:11 epoch: 29 iter: 34000 max mem : 19359 loss: 6.5830, 2d joint loss: 0.0046, 3d joint loss: 0.0028, vertex loss: 0.0333, compute: 0.8918, data: 0.2977, lr: 0.000010
2022-09-27 18:53:49,660 METRO INFO: eta: 0:00:00 epoch: 30 iter: 34080 max mem : 19359 loss: 6.5728, 2d joint loss: 0.0046, 3d joint loss: 0.0028, vertex loss: 0.0333, compute: 0.8911, data: 0.2970, lr: 0.000001
2022-09-27 18:59:31,358 METRO INFO: Validation epoch: 30 mPVE: 158.38, mPJPE: 137.46, PAmPJPE: 88.50, Data Count: 35515.00