Hi @SimiaoZuo , I encoutered problems when run bash bert_base_mnli_example.sh
The error information is below! Thanks very much!
/home/user/anaconda3/envs/MoEBERT/lib/python3.7/site-packages/torch/distributed/launch.py:164: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
"The module torch.distributed.launch is deprecated "
The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
WARNING:torch.distributed.run:--use_env is deprecated and will be removed in future releases.
Please read local_rank from `os.environ('LOCAL_RANK')` instead.
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : examples/text-classification/run_glue.py
min_nodes : 1
max_nodes : 1
nproc_per_node : 8
run_id : none
rdzv_backend : static
rdzv_endpoint : 127.0.0.1:29500
rdzv_configs : {'rank': 0, 'timeout': 900}
max_restarts : 3
monitor_interval : 5
log_dir : None
metrics_cfg : {}
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
/home/user/anaconda3/envs/MoEBERT/lib/python3.7/site-packages/torch/distributed/elastic/utils/store.py:53: FutureWarning: This is an experimental API and will be changed in future.
"This is an experimental API and will be changed in future.", FutureWarning
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_0/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_0/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_0/3/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_0/4/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_0/5/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_0/6/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_0/7/error.json
08/17/2022 10:52:17 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: True
08/17/2022 10:52:17 - INFO - __main__ - Training/evaluation parameters TrainingArguments(output_dir=mnli/model, overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=IntervalStrategy.STEPS, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=5.0, max_steps=-1, lr_scheduler_type=SchedulerType.LINEAR, warmup_ratio=0.0, warmup_steps=0, logging_dir=mnli/log, logging_strategy=IntervalStrategy.STEPS, logging_first_step=False, logging_steps=20, save_strategy=IntervalStrategy.NO, save_steps=500, save_total_limit=None, no_cuda=False, seed=0, fp16=True, fp16_opt_level=O1, fp16_backend=auto, fp16_full_eval=False, local_rank=0, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=0, past_index=-1, run_name=mnli/model, disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, sharded_ddp=[], deepspeed=None, label_smoothing_factor=0.0, adafactor=False, group_by_length=False, report_to=['tensorboard'], ddp_find_unused_parameters=None, dataloader_pin_memory=True, skip_memory_metrics=False, _n_gpu=1, cls_dropout=None, use_deterministic_algorithms=False)
Traceback (most recent call last):
Traceback (most recent call last):
File "examples/text-classification/run_glue.py", line 729, in <module>
File "examples/text-classification/run_glue.py", line 729, in <module>
Traceback (most recent call last):
File "examples/text-classification/run_glue.py", line 729, in <module>
main()
File "examples/text-classification/run_glue.py", line 281, in main
main()
File "examples/text-classification/run_glue.py", line 281, in main
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/home/user/MoEBERT/src/transformers/hf_argparser.py", line 187, in parse_args_into_dataclasses
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/home/user/MoEBERT/src/transformers/hf_argparser.py", line 187, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "<string>", line 67, in __init__
obj = dtype(**inputs)
File "<string>", line 67, in __init__
File "/home/user/MoEBERT/src/transformers/training_args.py", line 552, in __post_init__
File "/home/user/MoEBERT/src/transformers/training_args.py", line 552, in __post_init__
main()
File "examples/text-classification/run_glue.py", line 281, in main
if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval): File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1430, in wrapper
File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1430, in wrapper
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/home/user/MoEBERT/src/transformers/hf_argparser.py", line 187, in parse_args_into_dataclasses
return func(*args, **kwargs)return func(*args, **kwargs)
File "/home/user/MoEBERT/src/transformers/training_args.py", line 695, in device
File "/home/user/MoEBERT/src/transformers/training_args.py", line 695, in device
obj = dtype(**inputs)
File "<string>", line 67, in __init__
File "/home/user/MoEBERT/src/transformers/training_args.py", line 552, in __post_init__
return self._setup_devicesreturn self._setup_devices
File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1420, in __get__
File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1420, in __get__
if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1430, in wrapper
cached = self.fget(obj)
cached = self.fget(obj)
File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1430, in wrapper
File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1430, in wrapper
return func(*args, **kwargs)return func(*args, **kwargs)
File "/home/user/MoEBERT/src/transformers/training_args.py", line 685, in _setup_devices
File "/home/user/MoEBERT/src/transformers/training_args.py", line 685, in _setup_devices
return func(*args, **kwargs)
File "/home/user/MoEBERT/src/transformers/training_args.py", line 695, in device
torch.cuda.set_device(device)torch.cuda.set_device(device)
File "/home/user/anaconda3/envs/MoEBERT/lib/python3.7/site-packages/torch/cuda/__init__.py", line 264, in set_device
File "/home/user/anaconda3/envs/MoEBERT/lib/python3.7/site-packages/torch/cuda/__init__.py", line 264, in set_device
torch._C._cuda_setDevice(device)torch._C._cuda_setDevice(device)
RuntimeErrorRuntimeError: : CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
return self._setup_devices
File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1420, in __get__
cached = self.fget(obj)
File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1430, in wrapper
return func(*args, **kwargs)
File "/home/user/MoEBERT/src/transformers/training_args.py", line 685, in _setup_devices
torch.cuda.set_device(device)
File "/home/user/anaconda3/envs/MoEBERT/lib/python3.7/site-packages/torch/cuda/__init__.py", line 264, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
Traceback (most recent call last):
File "examples/text-classification/run_glue.py", line 729, in <module>
File "examples/text-classification/run_glue.py", line 729, in <module>
Traceback (most recent call last):
Traceback (most recent call last):
File "examples/text-classification/run_glue.py", line 729, in <module>
File "examples/text-classification/run_glue.py", line 729, in <module>
main()
File "examples/text-classification/run_glue.py", line 281, in main
main()
File "examples/text-classification/run_glue.py", line 281, in main
main()
main() File "examples/text-classification/run_glue.py", line 281, in main
File "examples/text-classification/run_glue.py", line 281, in main
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/home/user/MoEBERT/src/transformers/hf_argparser.py", line 187, in parse_args_into_dataclasses
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/home/user/MoEBERT/src/transformers/hf_argparser.py", line 187, in parse_args_into_dataclasses
model_args, data_args, training_args = parser.parse_args_into_dataclasses()model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/home/user/MoEBERT/src/transformers/hf_argparser.py", line 187, in parse_args_into_dataclasses
File "/home/user/MoEBERT/src/transformers/hf_argparser.py", line 187, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "<string>", line 67, in __init__
obj = dtype(**inputs)
File "<string>", line 67, in __init__
File "/home/user/MoEBERT/src/transformers/training_args.py", line 552, in __post_init__
obj = dtype(**inputs)obj = dtype(**inputs) File "/home/user/MoEBERT/src/transformers/training_args.py", line 552, in __post_init__
File "<string>", line 67, in __init__
File "<string>", line 67, in __init__
File "/home/user/MoEBERT/src/transformers/training_args.py", line 552, in __post_init__
File "/home/user/MoEBERT/src/transformers/training_args.py", line 552, in __post_init__
if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1430, in wrapper
if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1430, in wrapper
if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1430, in wrapper
File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1430, in wrapper
return func(*args, **kwargs)
File "/home/user/MoEBERT/src/transformers/training_args.py", line 695, in device
return func(*args, **kwargs)
File "/home/user/MoEBERT/src/transformers/training_args.py", line 695, in device
return func(*args, **kwargs)return func(*args, **kwargs)
File "/home/user/MoEBERT/src/transformers/training_args.py", line 695, in device
File "/home/user/MoEBERT/src/transformers/training_args.py", line 695, in device
return self._setup_devicesreturn self._setup_devices
File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1420, in __get__
File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1420, in __get__
return self._setup_devicesreturn self._setup_devices
File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1420, in __get__
File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1420, in __get__
cached = self.fget(obj)cached = self.fget(obj)
File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1430, in wrapper
File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1430, in wrapper
cached = self.fget(obj)cached = self.fget(obj)
File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1430, in wrapper
File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1430, in wrapper
return func(*args, **kwargs)
File "/home/user/MoEBERT/src/transformers/training_args.py", line 685, in _setup_devices
return func(*args, **kwargs)
File "/home/user/MoEBERT/src/transformers/training_args.py", line 685, in _setup_devices
return func(*args, **kwargs)return func(*args, **kwargs)
File "/home/user/MoEBERT/src/transformers/training_args.py", line 685, in _setup_devices
File "/home/user/MoEBERT/src/transformers/training_args.py", line 685, in _setup_devices
torch.cuda.set_device(device)
File "/home/user/anaconda3/envs/MoEBERT/lib/python3.7/site-packages/torch/cuda/__init__.py", line 264, in set_device
torch.cuda.set_device(device)
File "/home/user/anaconda3/envs/MoEBERT/lib/python3.7/site-packages/torch/cuda/__init__.py", line 264, in set_device
torch.cuda.set_device(device)torch.cuda.set_device(device)
File "/home/user/anaconda3/envs/MoEBERT/lib/python3.7/site-packages/torch/cuda/__init__.py", line 264, in set_device
File "/home/user/anaconda3/envs/MoEBERT/lib/python3.7/site-packages/torch/cuda/__init__.py", line 264, in set_device
torch._C._cuda_setDevice(device)
torch._C._cuda_setDevice(device)RuntimeError
: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.RuntimeError
: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
torch._C._cuda_setDevice(device)torch._C._cuda_setDevice(device)
RuntimeErrorRuntimeError: : CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Downloading: 28.8kB [00:00, 16.0MB/s]
Downloading: 28.7kB [00:00, 16.7MB/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 4113193) of binary: /home/user/anaconda3/envs/MoEBERT/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=1
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_1/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_1/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_1/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_1/3/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_1/4/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_1/5/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_1/6/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_1/7/error.json