Dear csjliang
When I use Distributed Training by README
2021-07-25 20:04:34,446 INFO: [LPTN_..][epoch: 16, iter: 19,800, lr:(1.000e-04,)] [eta: 2 days, 18:43:45, time (data): 0.178 (0.001)] l_g_pix: 3.1238e+01 l_g_gan: 8.7738e+01 l_d_real: 7.0186e+01 out_d_real: -7.0186e+01 l_d_fake: -8.7537e+01 out_d_fake: -8.7537e+01
2021-07-25 20:06:00,337 INFO: [LPTN_..][epoch: 17, iter: 19,900, lr:(1.000e-04,)] [eta: 2 days, 18:42:21, time (data): 0.504 (0.001)] l_g_pix: 2.0734e+01 l_g_gan: 9.3872e+01 l_d_real: 6.7697e+01 out_d_real: -6.7697e+01 l_d_fake: -9.4580e+01 out_d_fake: -9.4580e+01
2021-07-25 20:07:30,459 INFO: [LPTN_..][epoch: 17, iter: 20,000, lr:(1.000e-04,)] [eta: 2 days, 18:41:57, time (data): 0.202 (0.001)] l_g_pix: 3.0153e+01 l_g_gan: 9.9768e+01 l_d_real: 7.4591e+01 out_d_real: -7.4591e+01 l_d_fake: -9.9862e+01 out_d_fake: -9.9862e+01
2021-07-25 20:07:30,460 INFO: Saving models and training states.
0%| | 0/998 [00:00<?, ?image/s]2021-07-25 20:07:30,515 INFO: Only support single GPU validation.
0%| | 0/998 [00:00<?, ?image/s]Traceback (most recent call last):
File "codes/train.py", line 249, in
main()
File "codes/train.py", line 226, in main
model.validation(val_loader, current_iter, tb_logger,
File "/home/delight-gpu/project/LPTN/codes/models/base_model.py", line 45, in validation
self.dist_validation(dataloader, current_iter, tb_logger, save_img)
File "/home/delight-gpu/project/LPTN/codes/models/lptn_model.py", line 169, in dist_validation
self.nondist_validation(dataloader, current_iter, tb_logger, save_img)
File "/home/delight-gpu/project/LPTN/codes/models/lptn_model.py", line 225, in nondist_validation
metric_module, metric_type)(result_img, gt_img, **opt_)
UnboundLocalError: local variable 'gt_img' referenced before assignment
Traceback (most recent call last):
File "codes/train.py", line 249, in
main()
File "codes/train.py", line 226, in main
model.validation(val_loader, current_iter, tb_logger,
File "/home/delight-gpu/project/LPTN/codes/models/base_model.py", line 45, in validation
self.dist_validation(dataloader, current_iter, tb_logger, save_img)
File "/home/delight-gpu/project/LPTN/codes/models/lptn_model.py", line 169, in dist_validation
self.nondist_validation(dataloader, current_iter, tb_logger, save_img)
File "/home/delight-gpu/project/LPTN/codes/models/lptn_model.py", line 225, in nondist_validation
metric_module, metric_type)(result_img, gt_img, **opt_)
UnboundLocalError: local variable 'gt_img' referenced before assignment
Traceback (most recent call last):
File "codes/train.py", line 249, in
main()
File "codes/train.py", line 226, in main
model.validation(val_loader, current_iter, tb_logger,
File "/home/delight-gpu/project/LPTN/codes/models/base_model.py", line 45, in validation
self.dist_validation(dataloader, current_iter, tb_logger, save_img)
File "/home/delight-gpu/project/LPTN/codes/models/lptn_model.py", line 169, in dist_validation
self.nondist_validation(dataloader, current_iter, tb_logger, save_img)
File "/home/delight-gpu/project/LPTN/codes/models/lptn_model.py", line 225, in nondist_validation
metric_module, metric_type)(result_img, gt_img, **opt_)
UnboundLocalError: local variable 'gt_img' referenced before assignment
0%| | 0/998 [00:02<?, ?image/s]
0%| | 0/998 [00:02<?, ?image/s]
0%| | 0/998 [00:03<?, ?image/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 11662) of binary: /home/delight-gpu/anaconda3/envs/lptn/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=1
master_addr=127.0.0.1
master_port=4321
group_rank=0
group_world_size=1
local_ranks=[0, 1, 2]
role_ranks=[0, 1, 2]
global_ranks=[0, 1, 2]
role_world_sizes=[3, 3, 3]
global_world_sizes=[3, 3, 3]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_l8eumjpm/none_twj5_557/attempt_1/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_l8eumjpm/none_twj5_557/attempt_1/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_l8eumjpm/none_twj5_557/attempt_1/2/error.json
Traceback (most recent call last):
File "codes/train.py", line 249, in
main()
File "codes/train.py", line 128, in main
Traceback (most recent call last):
opt = parse_options(is_train=True)
File "codes/train.py", line 43, in parse_options
File "codes/train.py", line 249, in
init_dist(args.launcher)
File "/home/delight-gpu/project/LPTN/codes/utils/dist_util.py", line 14, in init_dist
main()
File "codes/train.py", line 128, in main
opt = parse_options(is_train=True)
File "codes/train.py", line 43, in parse_options
init_dist(args.launcher)
File "/home/delight-gpu/project/LPTN/codes/utils/dist_util.py", line 14, in init_dist
Traceback (most recent call last):
File "codes/train.py", line 249, in
main()
File "codes/train.py", line 128, in main
opt = parse_options(is_train=True)
File "codes/train.py", line 43, in parse_options
init_dist(args.launcher)
File "/home/delight-gpu/project/LPTN/codes/utils/dist_util.py", line 14, in init_dist
_init_dist_pytorch(backend, **kwargs)
File "/home/delight-gpu/project/LPTN/codes/utils/dist_util.py", line 25, in _init_dist_pytorch
_init_dist_pytorch(backend, **kwargs)
_init_dist_pytorch(backend, **kwargs) File "/home/delight-gpu/project/LPTN/codes/utils/dist_util.py", line 25, in _init_dist_pytorch
File "/home/delight-gpu/project/LPTN/codes/utils/dist_util.py", line 25, in _init_dist_pytorch
dist.init_process_group(backend=backend, **kwargs)
File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
dist.init_process_group(backend=backend, **kwargs)
dist.init_process_group(backend=backend, **kwargs)
File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 219, in _store_based_barrier
_store_based_barrier(rank, store, timeout)_store_based_barrier(rank, store, timeout)
File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 219, in _store_based_barrier
File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 219, in _store_based_barrier
raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=3, worker_count=6, timeout=0:30:00)
raise RuntimeError(raise RuntimeError(
RuntimeErrorRuntimeError: : Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=3, worker_count=6, timeout=0:30:00)Timed out initializing process group in store based barrier on rank: 2, for key: store_based_barrier_key:1 (world_size=3, worker_count=6, timeout=0:30:00)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 28662) of binary: /home/delight-gpu/anaconda3/envs/lptn/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 2/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=2
master_addr=127.0.0.1
master_port=4321
group_rank=0
group_world_size=1
local_ranks=[0, 1, 2]
role_ranks=[0, 1, 2]
global_ranks=[0, 1, 2]
role_world_sizes=[3, 3, 3]
global_world_sizes=[3, 3, 3]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_l8eumjpm/none_twj5_557/attempt_2/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_l8eumjpm/none_twj5_557/attempt_2/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_l8eumjpm/none_twj5_557/attempt_2/2/error.json
Traceback (most recent call last):
File "codes/train.py", line 249, in
main()
File "codes/train.py", line 128, in main
opt = parse_options(is_train=True)
File "codes/train.py", line 43, in parse_options
init_dist(args.launcher)
File "/home/delight-gpu/project/LPTN/codes/utils/dist_util.py", line 14, in init_dist
_init_dist_pytorch(backend, **kwargs)
File "/home/delight-gpu/project/LPTN/codes/utils/dist_util.py", line 25, in _init_dist_pytorch
dist.init_process_group(backend=backend, **kwargs)
File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 219, in _store_based_barrier
raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 2, for key: store_based_barrier_key:1 (world_size=3, worker_count=9, timeout=0:30:00)
Traceback (most recent call last):
File "codes/train.py", line 249, in
main()
File "codes/train.py", line 128, in main
opt = parse_options(is_train=True)
File "codes/train.py", line 43, in parse_options
Traceback (most recent call last):
init_dist(args.launcher)
File "codes/train.py", line 249, in
File "/home/delight-gpu/project/LPTN/codes/utils/dist_util.py", line 14, in init_dist
_init_dist_pytorch(backend, **kwargs)
File "/home/delight-gpu/project/LPTN/codes/utils/dist_util.py", line 25, in _init_dist_pytorch
dist.init_process_group(backend=backend, **kwargs)
File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 219, in _store_based_barrier
raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=3, worker_count=9, timeout=0:30:00)
main()
File "codes/train.py", line 128, in main
opt = parse_options(is_train=True)
File "codes/train.py", line 43, in parse_options
init_dist(args.launcher)
File "/home/delight-gpu/project/LPTN/codes/utils/dist_util.py", line 14, in init_dist
_init_dist_pytorch(backend, **kwargs)
File "/home/delight-gpu/project/LPTN/codes/utils/dist_util.py", line 25, in _init_dist_pytorch
dist.init_process_group(backend=backend, **kwargs)
File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 219, in _store_based_barrier
raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=3, worker_count=9, timeout=0:30:00)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 30527) of binary: /home/delight-gpu/anaconda3/envs/lptn/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 1/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=3
master_addr=127.0.0.1
master_port=4321
group_rank=0
group_world_size=1
local_ranks=[0, 1, 2]
role_ranks=[0, 1, 2]
global_ranks=[0, 1, 2]
role_world_sizes=[3, 3, 3]
global_world_sizes=[3, 3, 3]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_l8eumjpm/none_twj5_557/attempt_3/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_l8eumjpm/none_twj5_557/attempt_3/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_l8eumjpm/none_twj5_557/attempt_3/2/error.json
Traceback (most recent call last):
File "codes/train.py", line 249, in
Traceback (most recent call last):
File "codes/train.py", line 249, in
main()
File "codes/train.py", line 128, in main
opt = parse_options(is_train=True)main()
File "codes/train.py", line 43, in parse_options
File "codes/train.py", line 128, in main
init_dist(args.launcher)
File "/home/delight-gpu/project/LPTN/codes/utils/dist_util.py", line 14, in init_dist
opt = parse_options(is_train=True)
File "codes/train.py", line 43, in parse_options
_init_dist_pytorch(backend, **kwargs)
File "/home/delight-gpu/project/LPTN/codes/utils/dist_util.py", line 25, in _init_dist_pytorch
init_dist(args.launcher)
File "/home/delight-gpu/project/LPTN/codes/utils/dist_util.py", line 14, in init_dist
dist.init_process_group(backend=backend, **kwargs)
File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
_init_dist_pytorch(backend, **kwargs)
File "/home/delight-gpu/project/LPTN/codes/utils/dist_util.py", line 25, in _init_dist_pytorch
dist.init_process_group(backend=backend, **kwargs)
File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 219, in _store_based_barrier
_store_based_barrier(rank, store, timeout)raise RuntimeError(
File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 219, in _store_based_barrier
RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=3, worker_count=12, timeout=0:30:00)
raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=3, worker_count=12, timeout=0:30:00)
Traceback (most recent call last):
File "codes/train.py", line 249, in
main()
File "codes/train.py", line 128, in main
opt = parse_options(is_train=True)
File "codes/train.py", line 43, in parse_options
init_dist(args.launcher)
File "/home/delight-gpu/project/LPTN/codes/utils/dist_util.py", line 14, in init_dist
_init_dist_pytorch(backend, **kwargs)
File "/home/delight-gpu/project/LPTN/codes/utils/dist_util.py", line 25, in _init_dist_pytorch
dist.init_process_group(backend=backend, **kwargs)
File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 219, in _store_based_barrier
raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 2, for key: store_based_barrier_key:1 (world_size=3, worker_count=12, timeout=0:30:00)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 32378) of binary: /home/delight-gpu/anaconda3/envs/lptn/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish
/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:70: FutureWarning: This is an experimental API and will be changed in future.
warnings.warn(
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0010943412780761719 seconds
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "32378", "role": "default", "hostname": "LIGHT-24B.PC.CS.CMU.EDU", "state": "FAILED", "total_run_time": 22569, "rdzv_backend": "static", "raw_error": "{"message": ""}", "metadata": "{"group_world_size": 1, "entry_point": "python", "local_rank": [0], "role_rank": [0], "role_world_size": [3]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 1, "group_rank": 0, "worker_id": "32379", "role": "default", "hostname": "LIGHT-24B.PC.CS.CMU.EDU", "state": "FAILED", "total_run_time": 22569, "rdzv_backend": "static", "raw_error": "{"message": ""}", "metadata": "{"group_world_size": 1, "entry_point": "python", "local_rank": [1], "role_rank": [1], "role_world_size": [3]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 2, "group_rank": 0, "worker_id": "32380", "role": "default", "hostname": "LIGHT-24B.PC.CS.CMU.EDU", "state": "FAILED", "total_run_time": 22569, "rdzv_backend": "static", "raw_error": "{"message": ""}", "metadata": "{"group_world_size": 1, "entry_point": "python", "local_rank": [2], "role_rank": [2], "role_world_size": [3]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "LIGHT-24B.PC.CS.CMU.EDU", "state": "SUCCEEDED", "total_run_time": 22569, "rdzv_backend": "static", "raw_error": null, "metadata": "{"group_world_size": 1, "entry_point": "python"}", "agent_restarts": 3}}
/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py:354: UserWarning:
CHILD PROCESS FAILED WITH NO ERROR_FILE
CHILD PROCESS FAILED WITH NO ERROR_FILE
Child process 32378 (local_rank 0) FAILED (exitcode 1)
Error msg: Process failed with exitcode 1
Without writing an error file to <N/A>.
While this DOES NOT affect the correctness of your application,
no trace information about the error will be available for inspection.
Consider decorating your top level entrypoint function with
torch.distributed.elastic.multiprocessing.errors.record. Example:
from torch.distributed.elastic.multiprocessing.errors import record
@record
def trainer_main(args):
# do train
warnings.warn(_no_error_file_warning_msg(rank, failure))
Traceback (most recent call last):
File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/launch.py", line 173, in
main()
File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/launch.py", line 169, in main
run(args)
File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/run.py", line 621, in run
elastic_launch(
File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
codes/train.py FAILED
=======================================
Root Cause:
[0]:
time: 2021-07-25_21:37:41
rank: 0 (local_rank: 0)
exitcode: 1 (pid: 32378)
error_file: <N/A>
msg: "Process failed with exitcode 1"
Other Failures:
[1]:
time: 2021-07-25_21:37:41
rank: 1 (local_rank: 1)
exitcode: 1 (pid: 32379)
error_file: <N/A>
msg: "Process failed with exitcode 1"
[2]:
time: 2021-07-25_21:37:41
rank: 2 (local_rank: 2)
exitcode: 1 (pid: 32380)
error_file: <N/A>
msg: "Process failed with exitcode 1"