tandard_attention_real_3072_partial_points_rot_90_scale_1.2_translation_0.1.json
Have set cuda visible devices to 0,1,2,3,4,5,6,7
The distributed url we use is tcp://0.0.0.0:44507
['train.py', '--config=exp_configs/mvp_configs/config_standard_attention_real_3072_partial_points_rot_90_scale_1.2_translation_0.1.json', '--group_name=group_2022_10_19-021536', '--dist_url=tcp://0.0.0.0:44507', '--rank=0']
['train.py', '--config=exp_configs/mvp_configs/config_standard_attention_real_3072_partial_points_rot_90_scale_1.2_translation_0.1.json', '--group_name=group_2022_10_19-021536', '--dist_url=tcp://0.0.0.0:44507', '--rank=1']
Traceback (most recent call last):
Traceback (most recent call last):
File "train.py", line 714, in
File "train.py", line 714, in
train(num_gpus, args.config, args.rank, args.group_name, **train_config)
train(num_gpus, args.config, args.rank, args.group_name, **train_config)
File "train.py", line 335, in train
File "train.py", line 335, in train
init_distributed(rank, num_gpus, group_name, **dist_config)
File "/home/hm/guoxiaofan/Point_Diffusion_Refinement/pointnet2/distributed.py", line 57, in init_distributed
init_distributed(rank, num_gpus, group_name, **dist_config)
File "/home/hm/guoxiaofan/Point_Diffusion_Refinement/pointnet2/distributed.py", line 57, in init_distributed
group_name=group_name)
File "/home/hm/anaconda3/envs/pdr/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
group_name=group_name)
File "/home/hm/anaconda3/envs/pdr/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
barrier()
File "/home/hm/anaconda3/envs/pdr/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
barrier()
File "/home/hm/anaconda3/envs/pdr/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
work = _default_pg.barrier()
work = _default_pg.barrier()
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370116979/work/torch/lib/c10d/ProcessGroupNCCL.cpp:31, unhandled cuda error, NCCL version 2.7.8
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370116979/work/torch/lib/c10d/ProcessGroupNCCL.cpp:31, unhandled cuda error, NCCL version 2.7.8