如题,PaddleGAN MPR_Net训练过程中GPU开飞机,显存持续上涨,直到100%。
运行环境:AI Studio最新版本GPU32G(BML Codelab 2.2.1 Python3)
代码:PaddleGAN最新版本(v2.1.0)
训练命令:
python -u tools/main.py --config-file configs/mprnet_test.yaml
配置文件(mprnet_test.yaml)
total_iters: 100000
output_dir: output/mprnet
enable_visualdl: True
model:
name: MPRModel
generator:
name: MPRNet
n_feat: 40
scale_unetfeats: 20
scale_orsnetfeats: 16
char_criterion:
name: CharbonnierLoss
edge_criterion:
name: EdgeLoss
dataset:
train:
name: MPRTrain
rgb_dir: data/mydata/train
num_workers: 0
batch_size: 1
img_options:
patch_size: 16
test:
name: MPRTrain
rgb_dir: data/mydata/val
num_workers: 0
batch_size: 1
img_options:
patch_size: 16
lr_scheduler:
name: CosineAnnealingRestartLR
learning_rate: !!float 2e-4
periods: [25000, 25000, 25000, 25000]
restart_weights: [1, 1, 1, 1]
eta_min: !!float 1e-6
validate:
interval: 2000
save_img: false
metrics:
psnr: # metric name, can be arbitrary
name: PSNR
crop_border: 4
test_y_channel: True
ssim:
name: SSIM
crop_border: 4
test_y_channel: True
optimizer:
name: Adam
# add parameters of net_name to optim
# name should in self.nets
net_names:
- generator
beta1: 0.9
beta2: 0.999
epsilon: 1e-8
log_config:
interval: 50
visiual_interval: 200
snapshot_config:
interval: 2000
部分训练日志
[01/18 14:28:56] ppgan.engine.trainer INFO: Iter: 950/100000 lr: 1.993e-04 loss: 7.585 batch_cost: 0.13380 sec reader_cost: 0.00008 sec ips: 7.47377 images/s eta: 3:40:53
[01/18 14:29:03] ppgan.engine.trainer INFO: Iter: 1000/100000 lr: 1.992e-04 loss: 10.412 batch_cost: 0.13585 sec reader_cost: 0.00009 sec ips: 7.36111 images/s eta: 3:44:09
[01/18 14:29:09] ppgan.engine.trainer INFO: Iter: 1050/100000 lr: 1.991e-04 loss: 26.876 batch_cost: 0.12891 sec reader_cost: 0.00008 sec ips: 7.75758 images/s eta: 3:32:35
[01/18 14:29:16] ppgan.engine.trainer INFO: Iter: 1100/100000 lr: 1.991e-04 loss: 13.800 batch_cost: 0.14733 sec reader_cost: 0.00010 sec ips: 6.78764 images/s eta: 4:02:50
[01/18 14:29:23] ppgan.engine.trainer INFO: Iter: 1150/100000 lr: 1.990e-04 loss: 13.804 batch_cost: 0.12896 sec reader_cost: 0.00008 sec ips: 7.75449 images/s eta: 3:32:27
[01/18 14:29:29] ppgan.engine.trainer INFO: Iter: 1200/100000 lr: 1.989e-04 loss: 4.678 batch_cost: 0.13077 sec reader_cost: 0.00008 sec ips: 7.64728 images/s eta: 3:35:19
[01/18 14:29:36] ppgan.engine.trainer INFO: Iter: 1250/100000 lr: 1.988e-04 loss: 11.096 batch_cost: 0.13756 sec reader_cost: 0.00008 sec ips: 7.26970 images/s eta: 3:46:23
[01/18 14:29:44] ppgan.engine.trainer INFO: Iter: 1300/100000 lr: 1.987e-04 loss: 24.029 batch_cost: 0.14615 sec reader_cost: 0.00010 sec ips: 6.84228 images/s eta: 4:00:25
[01/18 14:29:51] ppgan.engine.trainer INFO: Iter: 1350/100000 lr: 1.986e-04 loss: 14.311 batch_cost: 0.14573 sec reader_cost: 0.00009 sec ips: 6.86211 images/s eta: 3:59:36
[01/18 14:29:58] ppgan.engine.trainer INFO: Iter: 1400/100000 lr: 1.985e-04 loss: 17.655 batch_cost: 0.14838 sec reader_cost: 0.00010 sec ips: 6.73934 images/s eta: 4:03:50
[01/18 14:30:06] ppgan.engine.trainer INFO: Iter: 1450/100000 lr: 1.984e-04 loss: 9.815 batch_cost: 0.14712 sec reader_cost: 0.00010 sec ips: 6.79716 images/s eta: 4:01:38
[01/18 14:30:13] ppgan.engine.trainer INFO: Iter: 1500/100000 lr: 1.982e-04 loss: 20.609 batch_cost: 0.15017 sec reader_cost: 0.00010 sec ips: 6.65920 images/s eta: 4:06:31
[01/18 14:30:21] ppgan.engine.trainer INFO: Iter: 1550/100000 lr: 1.981e-04 loss: 22.655 batch_cost: 0.14877 sec reader_cost: 0.00010 sec ips: 6.72189 images/s eta: 4:04:06
[01/18 14:30:28] ppgan.engine.trainer INFO: Iter: 1600/100000 lr: 1.980e-04 loss: 22.751 batch_cost: 0.14651 sec reader_cost: 0.00010 sec ips: 6.82556 images/s eta: 4:00:16
[01/18 14:30:35] ppgan.engine.trainer INFO: Iter: 1650/100000 lr: 1.979e-04 loss: 368.427 batch_cost: 0.14723 sec reader_cost: 0.00010 sec ips: 6.79201 images/s eta: 4:01:20
[01/18 14:30:43] ppgan.engine.trainer INFO: Iter: 1700/100000 lr: 1.977e-04 loss: 19.512 batch_cost: 0.14658 sec reader_cost: 0.00009 sec ips: 6.82226 images/s eta: 4:00:08
[01/18 14:30:50] ppgan.engine.trainer INFO: Iter: 1750/100000 lr: 1.976e-04 loss: 9.517 batch_cost: 0.14538 sec reader_cost: 0.00009 sec ips: 6.87872 images/s eta: 3:58:03
[01/18 14:30:57] ppgan.engine.trainer INFO: Iter: 1800/100000 lr: 1.975e-04 loss: 17.127 batch_cost: 0.13771 sec reader_cost: 0.00009 sec ips: 7.26190 images/s eta: 3:45:22
[01/18 14:31:03] ppgan.engine.trainer INFO: Iter: 1850/100000 lr: 1.973e-04 loss: 9.396 batch_cost: 0.13269 sec reader_cost: 0.00009 sec ips: 7.53654 images/s eta: 3:37:03
[01/18 14:31:10] ppgan.engine.trainer INFO: Iter: 1900/100000 lr: 1.972e-04 loss: 11.732 batch_cost: 0.12869 sec reader_cost: 0.00008 sec ips: 7.77042 images/s eta: 3:30:24
[01/18 14:31:17] ppgan.engine.trainer INFO: Iter: 1950/100000 lr: 1.970e-04 loss: 6.371 batch_cost: 0.14135 sec reader_cost: 0.00009 sec ips: 7.07441 images/s eta: 3:50:59
[01/18 14:31:24] ppgan.engine.trainer INFO: Iter: 2000/100000 lr: 1.969e-04 loss: 19.288 batch_cost: 0.14611 sec reader_cost: 0.00009 sec ips: 6.84437 images/s eta: 3:58:38
[01/18 14:31:24] ppgan.engine.trainer INFO: Test iter: [0/836]
/home/aistudio/work/PaddleGAN/ppgan/metrics/psnr_ssim.py:176: RuntimeWarning: Mean of empty slice.
return ssim_map.mean()
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/numpy/core/_methods.py:188: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
[01/18 14:31:30] ppgan.engine.trainer INFO: Test iter: [50/836]
[01/18 14:31:36] ppgan.engine.trainer INFO: Test iter: [100/836]
[01/18 14:31:41] ppgan.engine.trainer INFO: Test iter: [150/836]
[01/18 14:31:47] ppgan.engine.trainer INFO: Test iter: [200/836]
[01/18 14:31:52] ppgan.engine.trainer INFO: Test iter: [250/836]
[01/18 14:31:58] ppgan.engine.trainer INFO: Test iter: [300/836]
[01/18 14:32:04] ppgan.engine.trainer INFO: Test iter: [350/836]
[01/18 14:32:09] ppgan.engine.trainer INFO: Test iter: [400/836]
[01/18 14:32:15] ppgan.engine.trainer INFO: Test iter: [450/836]
[01/18 14:32:20] ppgan.engine.trainer INFO: Test iter: [500/836]
[01/18 14:32:26] ppgan.engine.trainer INFO: Test iter: [550/836]
[01/18 14:32:31] ppgan.engine.trainer INFO: Test iter: [600/836]
[01/18 14:32:37] ppgan.engine.trainer INFO: Test iter: [650/836]
[01/18 14:32:42] ppgan.engine.trainer INFO: Test iter: [700/836]
[01/18 14:32:48] ppgan.engine.trainer INFO: Test iter: [750/836]
[01/18 14:32:54] ppgan.engine.trainer INFO: Test iter: [800/836]
[01/18 14:32:58] ppgan.engine.trainer INFO: Metric psnr: inf
[01/18 14:32:58] ppgan.engine.trainer INFO: Metric ssim: nan
[01/18 14:33:05] ppgan.engine.trainer INFO: Iter: 2050/100000 lr: 1.967e-04 loss: 4.794 batch_cost: 0.12932 sec reader_cost: 0.00008 sec ips: 7.73275 images/s eta: 3:31:06
[01/18 14:33:11] ppgan.engine.trainer INFO: Iter: 2100/100000 lr: 1.966e-04 loss: 6.932 batch_cost: 0.13047 sec reader_cost: 0.00008 sec ips: 7.66441 images/s eta: 3:32:53
[01/18 14:33:18] ppgan.engine.trainer INFO: Iter: 2150/100000 lr: 1.964e-04 loss: 11.207 batch_cost: 0.13163 sec reader_cost: 0.00009 sec ips: 7.59711 images/s eta: 3:34:39
[01/18 14:33:24] ppgan.engine.trainer INFO: Iter: 2200/100000 lr: 1.962e-04 loss: 11.202 batch_cost: 0.13088 sec reader_cost: 0.00008 sec ips: 7.64079 images/s eta: 3:33:19
[01/18 14:33:31] ppgan.engine.trainer INFO: Iter: 2250/100000 lr: 1.961e-04 loss: 11.765 batch_cost: 0.13034 sec reader_cost: 0.00008 sec ips: 7.67220 images/s eta: 3:32:20
[01/18 14:33:37] ppgan.engine.trainer INFO: Iter: 2300/100000 lr: 1.959e-04 loss: 12.314 batch_cost: 0.13326 sec reader_cost: 0.00009 sec ips: 7.50428 images/s eta: 3:36:59
[01/18 14:33:44] ppgan.engine.trainer INFO: Iter: 2350/100000 lr: 1.957e-04 loss: 3.787 batch_cost: 0.13138 sec reader_cost: 0.00009 sec ips: 7.61134 images/s eta: 3:33:49
[01/18 14:33:50] ppgan.engine.trainer INFO: Iter: 2400/100000 lr: 1.955e-04 loss: 11.284 batch_cost: 0.12889 sec reader_cost: 0.00008 sec ips: 7.75870 images/s eta: 3:29:39
[01/18 14:33:57] ppgan.engine.trainer INFO: Iter: 2450/100000 lr: 1.953e-04 loss: 22.345 batch_cost: 0.12972 sec reader_cost: 0.00008 sec ips: 7.70919 images/s eta: 3:30:53
[01/18 14:34:03] ppgan.engine.trainer INFO: Iter: 2500/100000 lr: 1.951e-04 loss: 16.613 batch_cost: 0.12862 sec reader_cost: 0.00008 sec ips: 7.77481 images/s eta: 3:29:00
Exception in thread Thread-2:
Traceback (most recent call last):
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 213, in _thread_loop
self._thread_done_event)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dataloader/fetcher.py", line 121, in fetch
data.append(self.dataset[idx])
File "/home/aistudio/work/PaddleGAN/ppgan/datasets/mpr_dataset.py", line 105, in __getitem__
inp_img = to_tensor(inp_img)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/vision/transforms/functional.py", line 82, in to_tensor
return F_pil.to_tensor(pic, data_format)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/vision/transforms/functional_pil.py", line 88, in to_tensor
img = paddle.cast(img, np.float32) / 255.
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/tensor.py", line 249, in cast
out = _C_ops.cast(x, 'in_dtype', x.dtype, 'out_dtype', dtype)
SystemError: (Fatal) Operator cast raises an paddle::memory::allocation::BadAlloc exception.
The exception content is
:ResourceExhaustedError:
Out of memory error on GPU 0. Cannot allocate 3.000244MB memory on GPU 0, 31.716675GB memory has been allocated and available memory is only 2.625000MB.
Please check whether there is any other process using GPU 0.
1. If yes, please stop them, or start PaddlePaddle on another GPU.
2. If no, please decrease the batch size of your model.
(at /paddle/paddle/fluid/memory/allocation/cuda_allocator.cc:79)
. (at /paddle/paddle/fluid/imperative/tracer.cc:221)
Traceback (most recent call last):
File "tools/main.py", line 56, in <module>
main(args, cfg)
File "tools/main.py", line 46, in main
trainer.train()
File "/home/aistudio/work/PaddleGAN/ppgan/engine/trainer.py", line 191, in train
self.model.train_iter(self.optimizers)
File "/home/aistudio/work/PaddleGAN/ppgan/models/mpr_model.py", line 59, in train_iter
restored = self.nets['generator'](self.lq)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 914, in __call__
outputs = self.forward(*inputs, **kwargs)
File "/home/aistudio/work/PaddleGAN/ppgan/models/generators/mpr.py", line 510, in forward
x3_cat = self.stage3_orsnet(x3_cat, feat2, res2)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 914, in __call__
outputs = self.forward(*inputs, **kwargs)
File "/home/aistudio/work/PaddleGAN/ppgan/models/generators/mpr.py", line 338, in forward
x = self.orb3(x)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 914, in __call__
outputs = self.forward(*inputs, **kwargs)
File "/home/aistudio/work/PaddleGAN/ppgan/models/generators/mpr.py", line 276, in forward
res = self.body(x)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 914, in __call__
outputs = self.forward(*inputs, **kwargs)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/container.py", line 98, in forward
input = layer(input)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 914, in __call__
outputs = self.forward(*inputs, **kwargs)
File "/home/aistudio/work/PaddleGAN/ppgan/models/generators/mpr.py", line 65, in forward
res = self.body(x)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 914, in __call__
outputs = self.forward(*inputs, **kwargs)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/container.py", line 98, in forward
input = layer(input)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 914, in __call__
outputs = self.forward(*inputs, **kwargs)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/nn/layer/conv.py", line 677, in forward
use_cudnn=self._use_cudnn)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/nn/functional/conv.py", line 123, in _conv_nd
pre_bias = getattr(_C_ops, op_type)(x, weight, *attrs)
SystemError: (Fatal) Operator conv2d raises an paddle::memory::allocation::BadAlloc exception.
The exception content is
:ResourceExhaustedError:
Out of memory error on GPU 0. Cannot allocate 334.750000kB memory on GPU 0, 31.716675GB memory has been allocated and available memory is only 2.625000MB.
Please check whether there is any other process using GPU 0.
1. If yes, please stop them, or start PaddlePaddle on another GPU.
2. If no, please decrease the batch size of your model.
(at /paddle/paddle/fluid/memory/allocation/cuda_allocator.cc:79)
. (at /paddle/paddle/fluid/imperative/tracer.cc:221)