root@4460709f4b11:/workspace# bash scripts/test_training.sh
/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:163: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
logger.warn(
The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run
WARNING:torch.distributed.run:torch.distributed.launch
is Deprecated. Use torch.distributed.run
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : train.py
min_nodes : 1
max_nodes : 1
nproc_per_node : 1
run_id : none
rdzv_backend : static
rdzv_endpoint : 127.0.0.1:29500
rdzv_configs : {'rank': 0, 'timeout': 900}
max_restarts : 3
monitor_interval : 5
log_dir : None
metrics_cfg : {}
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_oe27bhot/none_vqivpge9
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future.
warnings.warn(
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0]
role_ranks=[0]
global_ranks=[0]
role_world_sizes=[1]
global_world_sizes=[1]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_oe27bhot/none_vqivpge9/attempt_0/0/error.json
Downloading: "https://download.pytorch.org/models/vgg19-dcbb9e9d.pth" to /root/.cache/torch/hub/checkpoints/vgg19-dcbb9e9d.pth
100%|█████████████████████████████████████████████████████████████████████| 548M/548M [01:40<00:00, 5.71MB/s]
/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py:3590: UserWarning: Default upsampling behavior when mode=bicubic is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details.
warnings.warn(
/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py:3638: UserWarning: The default behavior for interpolate/upsample with float scale_factor changed in 1.6.0 to align with other frameworks/libraries, and now uses scale_factor directly, instead of relying on the computed output size. If you wish to restore the old behavior, please set recompute_scale_factor=True. See the documentation of nn.Upsample for details.
warnings.warn(
/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at ../c10/core/TensorImpl.h:1153.)
return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
INFO:torch.distributed.elastic.agent.server.api:[default] worker group successfully finished. Waiting 300 seconds for other agents to finish.
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish
/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:70: FutureWarning: This is an experimental API and will be changed in future.
warnings.warn(
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0005445480346679688 seconds
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "386", "role": "default", "hostname": "4460709f4b11", "state": "SUCCEEDED", "total_run_time": 125, "rdzv_backend": "static", "raw_error": null, "metadata": "{"group_world_size": 1, "entry_point": "python", "local_rank": [0], "role_rank": [0], "role_world_size": [1]}", "agent_restarts": 0}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "4460709f4b11", "state": "SUCCEEDED", "total_run_time": 125, "rdzv_backend": "static", "raw_error": null, "metadata": "{"group_world_size": 1, "entry_point": "python"}", "agent_restarts": 0}}
python -m torch.distributed.launch --nproc_per_node=1 train.py --config configs/unit_test/spade.yaml >> /tmp/unit_test.log [Success]
/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:163: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
logger.warn(
The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run
WARNING:torch.distributed.run:torch.distributed.launch
is Deprecated. Use torch.distributed.run
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : train.py
min_nodes : 1
max_nodes : 1
nproc_per_node : 1
run_id : none
rdzv_backend : static
rdzv_endpoint : 127.0.0.1:29500
rdzv_configs : {'rank': 0, 'timeout': 900}
max_restarts : 3
monitor_interval : 5
log_dir : None
metrics_cfg : {}
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_4bbnta_g/none_vmrfh3_4
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future.
warnings.warn(
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0]
role_ranks=[0]
global_ranks=[0]
role_world_sizes=[1]
global_world_sizes=[1]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_4bbnta_g/none_vmrfh3_4/attempt_0/0/error.json
/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at ../c10/core/TensorImpl.h:1153.)
return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
INFO:torch.distributed.elastic.agent.server.api:[default] worker group successfully finished. Waiting 300 seconds for other agents to finish.
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish
/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:70: FutureWarning: This is an experimental API and will be changed in future.
warnings.warn(
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0006163120269775391 seconds
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "936", "role": "default", "hostname": "4460709f4b11", "state": "SUCCEEDED", "total_run_time": 20, "rdzv_backend": "static", "raw_error": null, "metadata": "{"group_world_size": 1, "entry_point": "python", "local_rank": [0], "role_rank": [0], "role_world_size": [1]}", "agent_restarts": 0}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "4460709f4b11", "state": "SUCCEEDED", "total_run_time": 20, "rdzv_backend": "static", "raw_error": null, "metadata": "{"group_world_size": 1, "entry_point": "python"}", "agent_restarts": 0}}
python -m torch.distributed.launch --nproc_per_node=1 train.py --config configs/unit_test/pix2pixHD.yaml >> /tmp/unit_test.log [Success]
/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:163: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
logger.warn(
The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run
WARNING:torch.distributed.run:torch.distributed.launch
is Deprecated. Use torch.distributed.run
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : train.py
min_nodes : 1
max_nodes : 1
nproc_per_node : 1
run_id : none
rdzv_backend : static
rdzv_endpoint : 127.0.0.1:29500
rdzv_configs : {'rank': 0, 'timeout': 900}
max_restarts : 3
monitor_interval : 5
log_dir : None
metrics_cfg : {}
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_62ca57fw/none_cbejg5ie
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future.
warnings.warn(
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0]
role_ranks=[0]
global_ranks=[0]
role_world_sizes=[1]
global_world_sizes=[1]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_62ca57fw/none_cbejg5ie/attempt_0/0/error.json
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
INFO:torch.distributed.elastic.agent.server.api:[default] worker group successfully finished. Waiting 300 seconds for other agents to finish.
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish
/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:70: FutureWarning: This is an experimental API and will be changed in future.
warnings.warn(
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0006213188171386719 seconds
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "1313", "role": "default", "hostname": "4460709f4b11", "state": "SUCCEEDED", "total_run_time": 20, "rdzv_backend": "static", "raw_error": null, "metadata": "{"group_world_size": 1, "entry_point": "python", "local_rank": [0], "role_rank": [0], "role_world_size": [1]}", "agent_restarts": 0}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "4460709f4b11", "state": "SUCCEEDED", "total_run_time": 20, "rdzv_backend": "static", "raw_error": null, "metadata": "{"group_world_size": 1, "entry_point": "python"}", "agent_restarts": 0}}
python -m torch.distributed.launch --nproc_per_node=1 train.py --config configs/unit_test/munit.yaml >> /tmp/unit_test.log [Success]
/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:163: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
logger.warn(
The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run
WARNING:torch.distributed.run:torch.distributed.launch
is Deprecated. Use torch.distributed.run
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : train.py
min_nodes : 1
max_nodes : 1
nproc_per_node : 1
run_id : none
rdzv_backend : static
rdzv_endpoint : 127.0.0.1:29500
rdzv_configs : {'rank': 0, 'timeout': 900}
max_restarts : 3
monitor_interval : 5
log_dir : None
metrics_cfg : {}
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_peoaucrn/none_je_rxzez
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future.
warnings.warn(
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0]
role_ranks=[0]
global_ranks=[0]
role_world_sizes=[1]
global_world_sizes=[1]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_peoaucrn/none_je_rxzez/attempt_0/0/error.json
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
INFO:torch.distributed.elastic.agent.server.api:[default] worker group successfully finished. Waiting 300 seconds for other agents to finish.
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish
/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:70: FutureWarning: This is an experimental API and will be changed in future.
warnings.warn(
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0005629062652587891 seconds
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "1561", "role": "default", "hostname": "4460709f4b11", "state": "SUCCEEDED", "total_run_time": 20, "rdzv_backend": "static", "raw_error": null, "metadata": "{"group_world_size": 1, "entry_point": "python", "local_rank": [0], "role_rank": [0], "role_world_size": [1]}", "agent_restarts": 0}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "4460709f4b11", "state": "SUCCEEDED", "total_run_time": 20, "rdzv_backend": "static", "raw_error": null, "metadata": "{"group_world_size": 1, "entry_point": "python"}", "agent_restarts": 0}}
python -m torch.distributed.launch --nproc_per_node=1 train.py --config configs/unit_test/munit_patch.yaml >> /tmp/unit_test.log [Success]
/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:163: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
logger.warn(
The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run
WARNING:torch.distributed.run:torch.distributed.launch
is Deprecated. Use torch.distributed.run
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : train.py
min_nodes : 1
max_nodes : 1
nproc_per_node : 1
run_id : none
rdzv_backend : static
rdzv_endpoint : 127.0.0.1:29500
rdzv_configs : {'rank': 0, 'timeout': 900}
max_restarts : 3
monitor_interval : 5
log_dir : None
metrics_cfg : {}
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_gf_rgw_1/none_yd9hagbt
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future.
warnings.warn(
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0]
role_ranks=[0]
global_ranks=[0]
role_world_sizes=[1]
global_world_sizes=[1]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_gf_rgw_1/none_yd9hagbt/attempt_0/0/error.json
/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py:3638: UserWarning: The default behavior for interpolate/upsample with float scale_factor changed in 1.6.0 to align with other frameworks/libraries, and now uses scale_factor directly, instead of relying on the computed output size. If you wish to restore the old behavior, please set recompute_scale_factor=True. See the documentation of nn.Upsample for details.
warnings.warn(
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
INFO:torch.distributed.elastic.agent.server.api:[default] worker group successfully finished. Waiting 300 seconds for other agents to finish.
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish
/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:70: FutureWarning: This is an experimental API and will be changed in future.
warnings.warn(
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0006194114685058594 seconds
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "1809", "role": "default", "hostname": "4460709f4b11", "state": "SUCCEEDED", "total_run_time": 15, "rdzv_backend": "static", "raw_error": null, "metadata": "{"group_world_size": 1, "entry_point": "python", "local_rank": [0], "role_rank": [0], "role_world_size": [1]}", "agent_restarts": 0}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "4460709f4b11", "state": "SUCCEEDED", "total_run_time": 15, "rdzv_backend": "static", "raw_error": null, "metadata": "{"group_world_size": 1, "entry_point": "python"}", "agent_restarts": 0}}
python -m torch.distributed.launch --nproc_per_node=1 train.py --config configs/unit_test/unit.yaml >> /tmp/unit_test.log [Success]
/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:163: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
logger.warn(
The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run
WARNING:torch.distributed.run:torch.distributed.launch
is Deprecated. Use torch.distributed.run
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : train.py
min_nodes : 1
max_nodes : 1
nproc_per_node : 1
run_id : none
rdzv_backend : static
rdzv_endpoint : 127.0.0.1:29500
rdzv_configs : {'rank': 0, 'timeout': 900}
max_restarts : 3
monitor_interval : 5
log_dir : None
metrics_cfg : {}
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_k4k83su_/none_doq690x_
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future.
warnings.warn(
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0]
role_ranks=[0]
global_ranks=[0]
role_world_sizes=[1]
global_world_sizes=[1]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_k4k83su_/none_doq690x_/attempt_0/0/error.json
INFO:torch.distributed.elastic.agent.server.api:[default] worker group successfully finished. Waiting 300 seconds for other agents to finish.
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish
/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:70: FutureWarning: This is an experimental API and will be changed in future.
warnings.warn(
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0004730224609375 seconds
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "2358", "role": "default", "hostname": "4460709f4b11", "state": "SUCCEEDED", "total_run_time": 15, "rdzv_backend": "static", "raw_error": null, "metadata": "{"group_world_size": 1, "entry_point": "python", "local_rank": [0], "role_rank": [0], "role_world_size": [1]}", "agent_restarts": 0}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "4460709f4b11", "state": "SUCCEEDED", "total_run_time": 15, "rdzv_backend": "static", "raw_error": null, "metadata": "{"group_world_size": 1, "entry_point": "python"}", "agent_restarts": 0}}
python -m torch.distributed.launch --nproc_per_node=1 train.py --config configs/unit_test/funit.yaml >> /tmp/unit_test.log [Success]
/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:163: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
logger.warn(
The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run
WARNING:torch.distributed.run:torch.distributed.launch
is Deprecated. Use torch.distributed.run
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : train.py
min_nodes : 1
max_nodes : 1
nproc_per_node : 1
run_id : none
rdzv_backend : static
rdzv_endpoint : 127.0.0.1:29500
rdzv_configs : {'rank': 0, 'timeout': 900}
max_restarts : 3
monitor_interval : 5
log_dir : None
metrics_cfg : {}
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_qmiyb0yp/none_nw6jcn_x
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future.
warnings.warn(
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0]
role_ranks=[0]
global_ranks=[0]
role_world_sizes=[1]
global_world_sizes=[1]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_qmiyb0yp/none_nw6jcn_x/attempt_0/0/error.json
INFO:torch.distributed.elastic.agent.server.api:[default] worker group successfully finished. Waiting 300 seconds for other agents to finish.
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish
/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:70: FutureWarning: This is an experimental API and will be changed in future.
warnings.warn(
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0007641315460205078 seconds
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "2600", "role": "default", "hostname": "4460709f4b11", "state": "SUCCEEDED", "total_run_time": 15, "rdzv_backend": "static", "raw_error": null, "metadata": "{"group_world_size": 1, "entry_point": "python", "local_rank": [0], "role_rank": [0], "role_world_size": [1]}", "agent_restarts": 0}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "4460709f4b11", "state": "SUCCEEDED", "total_run_time": 15, "rdzv_backend": "static", "raw_error": null, "metadata": "{"group_world_size": 1, "entry_point": "python"}", "agent_restarts": 0}}
python -m torch.distributed.launch --nproc_per_node=1 train.py --config configs/unit_test/coco_funit.yaml >> /tmp/unit_test.log [Success]
Traceback (most recent call last):
File "train.py", line 168, in
main()
File "train.py", line 92, in main
trainer = get_trainer(cfg, net_G, net_D,
File "/workspace/imaginaire/utils/trainer.py", line 59, in get_trainer
trainer = trainer_lib.Trainer(cfg, net_G, net_D,
File "/workspace/imaginaire/trainers/vid2vid.py", line 44, in init
super(Trainer, self).init(cfg, net_G, net_D, opt_G,
File "/workspace/imaginaire/trainers/base.py", line 99, in init
self._init_loss(cfg)
File "/workspace/imaginaire/trainers/vid2vid.py", line 145, in _init_loss
self.criteria['Flow'] = FlowLoss(cfg)
File "/workspace/imaginaire/losses/flow.py", line 59, in init
self.flowNet = flow_module.FlowNet(pretrained=True)
File "/workspace/imaginaire/third_party/flow_net/flow_net.py", line 30, in init
checkpoint = torch.load(flownet2_path,
File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 608, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 777, in _legacy_load
magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, '<'.
python train.py --single_gpu --config configs/unit_test/vid2vid_street.yaml >> /tmp/unit_test.log [Failure]
root@4460709f4b11:/workspace#
`