I have a problem when I run the training code.
After 4,000 epochs, I got an error.
This is the log on training.
`OrderedDict([('manual_seed', 10), ('lr_G', 0.0005), ('weight_decay_G', 0), ('beta1', 0.9), ('beta2', 0.99), ('lr_scheme', 'MultiStepLR'), ('warmup_iter', 10), ('lr_steps_rel', [0.5, 0.75, 0.9, 0.95]), ('lr_gamma', 0.5), ('weight_l1', 0), ('weight_fl', 1), ('niter', 30000), ('val_freq', 200), ('lr_steps', [15000, 22500, 27000, 28500])])
Disabled distributed training.
22-05-17 21:58:49.336 - INFO: name: train_color_as_full_z_nosieMapBugFixed_noavgpool
use_tb_logger: True
model: LLFlow
distortion: sr
scale: 1
gpu_ids: [0]
dataset: LoL
optimize_all_z: False
cond_encoder: ConEncoder1
train_gt_ratio: 0.2
avg_color_map: False
concat_histeq: True
histeq_as_input: False
concat_color_map: False
gray_map: False
align_condition_feature: False
align_weight: 0.001
align_maxpool: True
to_yuv: False
encode_color_map: False
le_curve: False
datasets:[
train:[
root: data/LOL
quant: 32
use_shuffle: True
n_workers: 4
batch_size: 16
use_flip: True
color: RGB
use_crop: True
GT_size: 160
noise_prob: 0
noise_level: 5
log_low: True
gamma_aug: False
phase: train
scale: 1
data_type: img
]
val:[
root: data/LOL
n_workers: 1
quant: 32
n_max: 20
batch_size: 1
log_low: True
phase: val
scale: 1
data_type: img
]
]
dataroot_unpaired: data/LOL/eval15/low
dataroot_GT: data/LOL/eval15/high
dataroot_LR: data/LOL/eval15/low
model_path: trained_models/trained.pth
heat: 0
network_G:[
which_model_G: LLFlow
in_nc: 3
out_nc: 3
nf: 64
nb: 24
train_RRDB: False
train_RRDB_delay: 0.5
flow:[
K: 12
L: 3
noInitialInj: True
coupling: CondAffineSeparatedAndCond
additionalFlowNoAffine: 2
split:[
enable: False
]
fea_up0: True
stackRRDB:[
blocks: [1, 3, 5, 7]
concat: True
]
]
scale: 1
]
path:[
strict_load: True
resume_state: auto
root: /home/jaemin/Desktop/LLFlow-main
experiments_root: /home/jaemin/Desktop/LLFlow-main/experiments/train_color_as_full_z_nosieMapBugFixed_noavgpool
models: /home/jaemin/Desktop/LLFlow-main/experiments/train_color_as_full_z_nosieMapBugFixed_noavgpool/models
training_state: /home/jaemin/Desktop/LLFlow-main/experiments/train_color_as_full_z_nosieMapBugFixed_noavgpool/training_state
log: /home/jaemin/Desktop/LLFlow-main/experiments/train_color_as_full_z_nosieMapBugFixed_noavgpool
val_images: /home/jaemin/Desktop/LLFlow-main/experiments/train_color_as_full_z_nosieMapBugFixed_noavgpool/val_images
]
train:[
manual_seed: 10
lr_G: 0.0005
weight_decay_G: 0
beta1: 0.9
beta2: 0.99
lr_scheme: MultiStepLR
warmup_iter: 10
lr_steps_rel: [0.5, 0.75, 0.9, 0.95]
lr_gamma: 0.5
weight_l1: 0
weight_fl: 1
niter: 30000
val_freq: 200
lr_steps: [15000, 22500, 27000, 28500]
]
val:[
n_sample: 4
]
test:[
heats: [0.0, 0.7, 0.8, 0.9]
]
logger:[
print_freq: 200
save_checkpoint_freq: 1000.0
]
is_train: True
dist: False
22-05-17 21:58:49.351 - INFO: Random seed: 10
rrdb params 0
22-05-17 21:58:56.276 - INFO: Model [LLFlowModel] is created.
Parameters of full network 38.8595 and encoder 17.4968
22-05-17 21:58:56.286 - INFO: Start training from epoch: 0, iter: 0
/home/jaemin/anaconda3/envs/mymodel/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of lr_scheduler.step()
before optimizer.step()
. In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step()
before lr_scheduler.step()
. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
<epoch: 0, iter: 1, lr:5.000e-05, t:-1.00e+00, td:2.79e-01, eta:-8.33e+00, nll:0.000e+00>
/home/jaemin/anaconda3/envs/mymodel/lib/python3.7/site-packages/torch/nn/functional.py:3103: UserWarning: The default behavior for interpolate/upsample with float scale_factor changed in 1.6.0 to align with other frameworks/libraries, and now uses scale_factor directly, instead of relying on the computed output size. If you wish to restore the old behavior, please set recompute_scale_factor=True. See the documentation of nn.Upsample for details.
warnings.warn("The default behavior for interpolate/upsample with float scale_factor changed "
22-05-17 21:59:00.339 - INFO: Parameters of full network 37.6774 and encoder 17.4211
<epoch: 0, iter: 2, lr:1.000e-04, t:-1.00e+00, td:5.03e-03, eta:-8.33e+00, nll:-7.233e+00>
<epoch: 0, iter: 3, lr:1.500e-04, t:3.81e+00, td:5.90e-03, eta:3.17e+01, nll:-6.932e+00>
<epoch: 0, iter: 4, lr:2.000e-04, t:1.40e+00, td:1.65e-03, eta:1.16e+01, nll:-7.222e+00>
<epoch: 0, iter: 5, lr:2.500e-04, t:1.53e+00, td:1.21e-03, eta:1.28e+01, nll:9.519e+02>
<epoch: 0, iter: 6, lr:3.000e-04, t:1.44e+00, td:7.74e-03, eta:1.20e+01, nll:1.007e+03>
<epoch: 0, iter: 7, lr:3.500e-04, t:1.40e+00, td:1.87e-03, eta:1.17e+01, nll:8.534e+02>
<epoch: 0, iter: 8, lr:4.000e-04, t:1.43e+00, td:1.58e-03, eta:1.19e+01, nll:9.690e+02>
<epoch: 0, iter: 9, lr:4.500e-04, t:1.39e+00, td:1.63e-03, eta:1.16e+01, nll:6.836e+02>
<epoch: 0, iter: 10, lr:4.500e-04, t:1.43e+00, td:1.73e-03, eta:1.20e+01, nll:5.680e+02>
<epoch: 0, iter: 11, lr:4.500e-04, t:1.42e+00, td:1.61e-03, eta:1.19e+01, nll:5.967e+02>
<epoch: 0, iter: 12, lr:4.500e-04, t:1.39e+00, td:1.52e-03, eta:1.16e+01, nll:5.568e+02>
<epoch: 0, iter: 13, lr:4.500e-04, t:1.43e+00, td:1.57e-03, eta:1.19e+01, nll:4.397e+02>
<epoch: 0, iter: 14, lr:4.500e-04, t:1.39e+00, td:2.70e-03, eta:1.16e+01, nll:5.610e+02>
<epoch: 0, iter: 15, lr:4.500e-04, t:1.49e+00, td:1.53e-03, eta:1.24e+01, nll:1.028e+02>
<epoch: 0, iter: 16, lr:4.500e-04, t:1.54e+00, td:1.48e-03, eta:1.28e+01, nll:1.578e+01>
<epoch: 0, iter: 17, lr:4.500e-04, t:1.53e+00, td:1.65e-03, eta:1.28e+01, nll:1.097e+01>
<epoch: 0, iter: 18, lr:4.500e-04, t:1.50e+00, td:1.56e-03, eta:1.25e+01, nll:1.119e+01>
<epoch: 0, iter: 19, lr:4.500e-04, t:1.53e+00, td:2.73e-03, eta:1.28e+01, nll:7.515e+00>
<epoch: 0, iter: 20, lr:4.500e-04, t:1.49e+00, td:2.75e-03, eta:1.24e+01, nll:3.359e+00>
<epoch: 0, iter: 21, lr:4.500e-04, t:1.52e+00, td:2.58e-03, eta:1.27e+01, nll:1.562e+00>
<epoch: 0, iter: 22, lr:4.500e-04, t:1.48e+00, td:2.86e-03, eta:1.24e+01, nll:-2.830e-01>
<epoch: 0, iter: 23, lr:4.500e-04, t:1.53e+00, td:2.14e-03, eta:1.27e+01, nll:-1.822e+00>
<epoch: 0, iter: 24, lr:4.500e-04, t:1.50e+00, td:2.55e-03, eta:1.25e+01, nll:-2.389e+00>
<epoch: 6, iter: 200, lr:4.500e-04, t:1.57e+00, td:1.48e-02, eta:1.30e+01, nll:-1.198e+01>
train.py:291: RuntimeWarning: divide by zero encountered in float_scalars
cropped_sr_img_adjust = np.clip(cropped_sr_img * (mean_gray_gt / mean_gray_out), 0, 1)
train.py:291: RuntimeWarning: invalid value encountered in multiply
cropped_sr_img_adjust = np.clip(cropped_sr_img * (mean_gray_gt / mean_gray_out), 0, 1)
22-05-17 22:04:39.757 - INFO: # Validation # PSNR: nan SSIM: nan
22-05-17 22:04:39.758 - INFO: <epoch: 6, iter: 200> psnr: nan SSIM: nan
<epoch: 13, iter: 400, lr:4.500e-04, t:1.71e+00, td:1.56e-02, eta:1.41e+01, nll:-1.342e+01>
22-05-17 22:10:20.482 - INFO: # Validation # PSNR: 1.9680e+01 SSIM: nan
22-05-17 22:10:20.483 - INFO: <epoch: 13, iter: 400> psnr: 1.9680e+01 SSIM: nan
22-05-17 22:10:20.483 - INFO: Saving best models
<epoch: 19, iter: 600, lr:4.500e-04, t:1.70e+00, td:1.37e-02, eta:1.39e+01, nll:-1.219e+01>
22-05-17 22:16:00.804 - INFO: # Validation # PSNR: 1.8884e+01 SSIM: nan
22-05-17 22:16:00.805 - INFO: <epoch: 19, iter: 600> psnr: 1.8884e+01 SSIM: nan
<epoch: 26, iter: 800, lr:4.500e-04, t:1.70e+00, td:1.55e-02, eta:1.38e+01, nll:-1.317e+01>
22-05-17 22:21:41.478 - INFO: # Validation # PSNR: 2.0391e+01 SSIM: 7.2959e-01
22-05-17 22:21:41.479 - INFO: <epoch: 26, iter: 800> psnr: 2.0391e+01 SSIM: 7.2959e-01
22-05-17 22:21:41.479 - INFO: Saving best models
<epoch: 33, iter: 1,000, lr:4.500e-04, t:1.70e+00, td:1.54e-02, eta:1.37e+01, nll:-1.284e+01>
22-05-17 22:27:22.198 - INFO: # Validation # PSNR: 2.1104e+01 SSIM: 7.4706e-01
22-05-17 22:27:22.199 - INFO: <epoch: 33, iter: 1,000> psnr: 2.1104e+01 SSIM: 7.4706e-01
22-05-17 22:27:22.199 - INFO: Saving models and training states.
22-05-17 22:27:22.785 - INFO: Saving best models
<epoch: 39, iter: 1,200, lr:4.500e-04, t:1.71e+00, td:1.36e-02, eta:1.37e+01, nll:-1.376e+01>
22-05-17 22:33:03.967 - INFO: # Validation # PSNR: 1.9835e+01 SSIM: 7.1762e-01
22-05-17 22:33:03.967 - INFO: <epoch: 39, iter: 1,200> psnr: 1.9835e+01 SSIM: 7.1762e-01
<epoch: 46, iter: 1,400, lr:4.500e-04, t:1.70e+00, td:1.56e-02, eta:1.35e+01, nll:-1.423e+01>
22-05-17 22:38:44.834 - INFO: # Validation # PSNR: 1.7979e+01 SSIM: 6.7875e-01
22-05-17 22:38:44.834 - INFO: <epoch: 46, iter: 1,400> psnr: 1.7979e+01 SSIM: 6.7875e-01
<epoch: 53, iter: 1,600, lr:4.500e-04, t:1.71e+00, td:1.58e-02, eta:1.35e+01, nll:-1.479e+01>
22-05-17 22:44:26.174 - INFO: # Validation # PSNR: 1.9058e+01 SSIM: 6.8164e-01
22-05-17 22:44:26.174 - INFO: <epoch: 53, iter: 1,600> psnr: 1.9058e+01 SSIM: 6.8164e-01
<epoch: 59, iter: 1,800, lr:4.500e-04, t:1.70e+00, td:1.36e-02, eta:1.33e+01, nll:-1.263e+01>
22-05-17 22:50:06.876 - INFO: # Validation # PSNR: 2.1273e+01 SSIM: 7.5831e-01
22-05-17 22:50:06.876 - INFO: <epoch: 59, iter: 1,800> psnr: 2.1273e+01 SSIM: 7.5831e-01
22-05-17 22:50:06.876 - INFO: Saving best models
<epoch: 66, iter: 2,000, lr:4.500e-04, t:1.71e+00, td:1.55e-02, eta:1.33e+01, nll:-1.427e+01>
22-05-17 22:55:48.322 - INFO: # Validation # PSNR: 2.1426e+01 SSIM: 7.4877e-01
22-05-17 22:55:48.322 - INFO: <epoch: 66, iter: 2,000> psnr: 2.1426e+01 SSIM: 7.4877e-01
22-05-17 22:55:48.322 - INFO: Saving models and training states.
22-05-17 22:55:48.860 - INFO: Saving best models
<epoch: 73, iter: 2,200, lr:4.500e-04, t:1.71e+00, td:1.54e-02, eta:1.32e+01, nll:-1.477e+01>
22-05-17 23:01:30.147 - INFO: # Validation # PSNR: 2.1490e+01 SSIM: 7.6223e-01
22-05-17 23:01:30.148 - INFO: <epoch: 73, iter: 2,200> psnr: 2.1490e+01 SSIM: 7.6223e-01
22-05-17 23:01:30.148 - INFO: Saving best models
<epoch: 79, iter: 2,400, lr:4.500e-04, t:1.71e+00, td:1.37e-02, eta:1.31e+01, nll:-1.484e+01>
22-05-17 23:07:11.724 - INFO: # Validation # PSNR: 1.8654e+01 SSIM: 6.7717e-01
22-05-17 23:07:11.725 - INFO: <epoch: 79, iter: 2,400> psnr: 1.8654e+01 SSIM: 6.7717e-01
<epoch: 86, iter: 2,600, lr:4.500e-04, t:1.70e+00, td:1.56e-02, eta:1.30e+01, nll:-1.458e+01>
22-05-17 23:12:52.514 - INFO: # Validation # PSNR: 2.1183e+01 SSIM: 7.5730e-01
22-05-17 23:12:52.514 - INFO: <epoch: 86, iter: 2,600> psnr: 2.1183e+01 SSIM: 7.5730e-01
<epoch: 93, iter: 2,800, lr:4.500e-04, t:1.70e+00, td:1.57e-02, eta:1.29e+01, nll:-1.419e+01>
22-05-17 23:18:33.419 - INFO: # Validation # PSNR: 2.1128e+01 SSIM: 7.5815e-01
22-05-17 23:18:33.420 - INFO: <epoch: 93, iter: 2,800> psnr: 2.1128e+01 SSIM: 7.5815e-01
<epoch: 99, iter: 3,000, lr:4.500e-04, t:1.70e+00, td:1.37e-02, eta:1.28e+01, nll:-1.454e+01>
22-05-17 23:24:13.746 - INFO: # Validation # PSNR: 2.1340e+01 SSIM: 7.6262e-01
22-05-17 23:24:13.747 - INFO: <epoch: 99, iter: 3,000> psnr: 2.1340e+01 SSIM: 7.6262e-01
22-05-17 23:24:13.747 - INFO: Saving models and training states.
<epoch:106, iter: 3,200, lr:4.500e-04, t:1.71e+00, td:1.55e-02, eta:1.27e+01, nll:-1.447e+01>
22-05-17 23:29:56.327 - INFO: # Validation # PSNR: 2.2221e+01 SSIM: 7.6925e-01
22-05-17 23:29:56.327 - INFO: <epoch:106, iter: 3,200> psnr: 2.2221e+01 SSIM: 7.6925e-01
22-05-17 23:29:56.327 - INFO: Saving best models
<epoch:113, iter: 3,400, lr:4.500e-04, t:1.71e+00, td:1.58e-02, eta:1.27e+01, nll:-1.527e+01>
22-05-17 23:35:39.072 - INFO: # Validation # PSNR: 2.0904e+01 SSIM: 7.5687e-01
22-05-17 23:35:39.073 - INFO: <epoch:113, iter: 3,400> psnr: 2.0904e+01 SSIM: 7.5687e-01
<epoch:119, iter: 3,600, lr:4.500e-04, t:1.71e+00, td:1.37e-02, eta:1.25e+01, nll:-1.365e+01>
22-05-17 23:41:20.888 - INFO: # Validation # PSNR: 2.0487e+01 SSIM: 7.4803e-01
22-05-17 23:41:20.888 - INFO: <epoch:119, iter: 3,600> psnr: 2.0487e+01 SSIM: 7.4803e-01
<epoch:126, iter: 3,800, lr:4.500e-04, t:1.69e+00, td:1.56e-02, eta:1.23e+01, nll:6.914e+00>
22-05-17 23:46:58.775 - INFO: # Validation # PSNR: nan SSIM: nan
22-05-17 23:46:58.775 - INFO: <epoch:126, iter: 3,800> psnr: nan SSIM: nan
<epoch:133, iter: 4,000, lr:4.500e-04, t:1.65e+00, td:1.56e-02, eta:1.19e+01, nll:nan>
22-05-17 23:52:29.097 - INFO: # Validation # PSNR: nan SSIM: nan
22-05-17 23:52:29.098 - INFO: <epoch:133, iter: 4,000> psnr: nan SSIM: nan
22-05-17 23:52:29.098 - INFO: Saving models and training states.
Intel MKL ERROR: Parameter 4 was incorrect on entry to SLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to SLASCL.
Traceback (most recent call last):
File "train.py", line 343, in
main()
File "train.py", line 191, in main
nll = model.optimize_parameters(current_step)
File "/home/jaemin/Desktop/LLFlow-main/code/models/LLFlow_model.py", line 208, in optimize_parameters
self.scaler.scale(total_loss).backward()
File "/home/jaemin/anaconda3/envs/mymodel/lib/python3.7/site-packages/torch/tensor.py", line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/jaemin/anaconda3/envs/mymodel/lib/python3.7/site-packages/torch/autograd/init.py", line 132, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: svd_cuda: the updating process of SBDSDC did not converge (error: 22)
`
Do you have any solution for this problem?
Thanks.