I'm learning with DF2K and using 4 GPUs.
Also, I'm refer the "train_HAT_SRx4_finetune_from_ImageNet_pretrain.yml" file.
I just change the dataroot_gt, dataroot_lq for train and val.
also change the num_worker_per_gpu, batch_size_per_gpu like that
### num_worker_per_gpu: 6
### batch_size_per_gpu: 4
num_worker_per_gpu: 3
batch_size_per_gpu: 8
But, after 80000 iter.. The l_pix does not converged.
[
2022-09-19 20:15:50,914 INFO: [train..][epoch:738, iter: 79,000, lr:(1.000e-05,)] [eta: 6 days, 12:51:28, time (data): 3.029 (0.004)] l_pix: 2.1359e-02
2022-09-19 20:15:50,915 INFO: Saving models and training states.
2022-09-19 20:21:18,215 INFO: [train..][epoch:739, iter: 79,100, lr:(1.000e-05,)] [eta: 6 days, 12:45:51, time (data): 3.218 (0.341)] l_pix: 3.3533e-02
2022-09-19 20:26:43,030 INFO: [train..][epoch:740, iter: 79,200, lr:(1.000e-05,)] [eta: 6 days, 12:40:10, time (data): 2.068 (0.031)] l_pix: 1.9019e-02
2022-09-19 20:32:16,706 INFO: [train..][epoch:741, iter: 79,300, lr:(1.000e-05,)] [eta: 6 days, 12:34:47, time (data): 3.264 (0.337)] l_pix: 1.9070e-02
2022-09-19 20:37:43,115 INFO: [train..][epoch:742, iter: 79,400, lr:(1.000e-05,)] [eta: 6 days, 12:29:08, time (data): 3.401 (0.004)] l_pix: 1.7958e-02
2022-09-19 20:42:36,323 INFO: [train..][epoch:742, iter: 79,500, lr:(1.000e-05,)] [eta: 6 days, 12:22:19, time (data): 2.954 (0.020)] l_pix: 1.5392e-02
2022-09-19 20:48:30,628 INFO: [train..][epoch:743, iter: 79,600, lr:(1.000e-05,)] [eta: 6 days, 12:17:40, time (data): 3.378 (0.003)] l_pix: 2.8961e-02
2022-09-19 20:53:45,430 INFO: [train..][epoch:744, iter: 79,700, lr:(1.000e-05,)] [eta: 6 days, 12:11:37, time (data): 3.156 (0.225)] l_pix: 3.7259e-02
2022-09-19 20:59:13,519 INFO: [train..][epoch:745, iter: 79,800, lr:(1.000e-05,)] [eta: 6 days, 12:06:02, time (data): 3.902 (0.031)] l_pix: 2.7916e-02
2022-09-19 21:04:49,328 INFO: [train..][epoch:746, iter: 79,900, lr:(1.000e-05,)] [eta: 6 days, 12:00:44, time (data): 3.374 (0.410)] l_pix: 2.1746e-02
2022-09-19 21:10:27,211 INFO: [train..][epoch:747, iter: 80,000, lr:(1.000e-05,)] [eta: 6 days, 11:55:30, time (data): 3.748 (0.094)] l_pix: 2.1582e-02
2022-09-19 21:10:27,213 INFO: Saving models and training states.
2022-09-19 21:22:35,811 INFO: Validation open
# psnr: 20.3545 Best: 20.3660 @ 65000 iter
# ssim: 0.4768 Best: 0.4769 @ 65000 iter
2022-09-19 21:27:52,322 INFO: [train..][epoch:748, iter: 80,100, lr:(1.000e-05,)] [eta: 6 days, 12:15:17, time (data): 3.176 (0.366)] l_pix: 2.4691e-02
2022-09-19 21:33:13,818 INFO: [train..][epoch:749, iter: 80,200, lr:(1.000e-05,)] [eta: 6 days, 12:09:25, time (data): 3.303 (0.093)] l_pix: 2.2727e-02
2022-09-19 21:38:51,310 INFO: [train..][epoch:750, iter: 80,300, lr:(1.000e-05,)] [eta: 6 days, 12:04:08, time (data): 3.374 (0.419)] l_pix: 1.5810e-02
2022-09-19 21:44:40,636 INFO: [train..][epoch:751, iter: 80,400, lr:(1.000e-05,)] [eta: 6 days, 11:59:15, time (data): 3.433 (0.393)] l_pix: 1.9958e-02
2022-09-19 21:50:00,407 INFO: [train..][epoch:752, iter: 80,500, lr:(1.000e-05,)] [eta: 6 days, 11:53:20, time (data): 3.198 (0.192)] l_pix: 2.1157e-02
2022-09-19 21:55:30,407 INFO: [train..][epoch:753, iter: 80,600, lr:(1.000e-05,)] [eta: 6 days, 11:47:47, time (data): 3.248 (0.231)] l_pix: 2.8304e-02
2022-09-19 22:00:58,110 INFO: [train..][epoch:754, iter: 80,700, lr:(1.000e-05,)] [eta: 6 days, 11:42:09, time (data): 3.279 (0.391)] l_pix: 2.4832e-02
2022-09-19 22:06:35,306 INFO: [train..][epoch:755, iter: 80,800, lr:(1.000e-05,)] [eta: 6 days, 11:36:50, time (data): 3.326 (0.384)] l_pix: 2.9092e-02
2022-09-19 22:12:15,613 INFO: [train..][epoch:756, iter: 80,900, lr:(1.000e-05,)] [eta: 6 days, 11:31:38, time (data): 3.429 (0.409)] l_pix: 2.6695e-02
2022-09-19 22:17:41,607 INFO: [train..][epoch:757, iter: 81,000, lr:(1.000e-05,)] [eta: 6 days, 11:25:57, time (data): 3.343 (0.408)] l_pix: 3.1762e-02
2022-09-19 22:17:41,609 INFO: Saving models and training states.
]
Do you know why the loss does not converged??
attched file is my .yaml file
please advise to me.
train_HAT_SRx4_my_others_to_open.yml--.log