Dear developers,
I am trying to run the gpt2_3d example but failed. It looks like the model didn't load the correct batch size. Hope to get some advice.
Thanks.
Error
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/_operation.py", line 281, in split_tensor_3d
assert dim_size % world_size == 0, \
AssertionError: The dimension 0 to split, size (1) is not a multiple of world size (2), cannot split tensor evenly.
Command
torchrun --standalone --nproc_per_node=8 train_gpt.py --config=gpt2_configs/gpt2_3d.py --from_torch
Environment
- colossalai 0.1.2
- nvcc 11.3.109
- python 3.8.13
- pytorch 1.11.0
- GPUs: 40G A100 * 8
Error details
$ torchrun --standalone --nproc_per_node=8 ./train_gpt.py --config=./gpt2_configs/gpt2_3d.py --from_torch
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/apex-0.1-py3.8-linux-x86_64.egg/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/apex-0.1-py3.8-linux-x86_64.egg/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/apex-0.1-py3.8-linux-x86_64.egg/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/apex-0.1-py3.8-linux-x86_64.egg/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/apex-0.1-py3.8-linux-x86_64.egg/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/apex-0.1-py3.8-linux-x86_64.egg/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/apex-0.1-py3.8-linux-x86_64.egg/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/apex-0.1-py3.8-linux-x86_64.egg/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
[05/01/22 10:53:55] INFO colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal ai/context/parallel_context.py:509 set_device
INFO colossalai - colossalai - INFO: process rank 2 is bound to device 2
INFO colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal ai/context/parallel_context.py:545 set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 2, numpy: 1024, python random: 1024,
ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1026,the default parallel seed is
ParallelMode.DATA.
[05/01/22 10:53:55] INFO colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal ai/context/parallel_context.py:509 set_device
[05/01/22 10:53:55] INFO colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal ai/context/parallel_context.py:509 set_device
[05/01/22 10:53:55] INFO colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal ai/context/parallel_context.py:509 set_device
[05/01/22 10:53:55] INFO colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal ai/context/parallel_context.py:509 set_device
[05/01/22 10:53:55] INFO colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal ai/context/parallel_context.py:509 set_device
[05/01/22 10:53:55] INFO colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal ai/context/parallel_context.py:509 set_device
INFO colossalai - colossalai - INFO: process rank 3 is bound to device 3
INFO colossalai - colossalai - INFO: process rank 7 is bound to device 7
INFO colossalai - colossalai - INFO: process rank 1 is bound to device 1
[05/01/22 10:53:55] INFO colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal ai/context/parallel_context.py:509 set_device
INFO colossalai - colossalai - INFO: process rank 4 is bound to device 4
INFO colossalai - colossalai - INFO: process rank 5 is bound to device 5
INFO colossalai - colossalai - INFO: process rank 0 is bound to device 0
INFO colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal ai/context/parallel_context.py:545 set_seed
INFO colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal ai/context/parallel_context.py:545 set_seed
INFO colossalai - colossalai - INFO: process rank 6 is bound to device 6
INFO colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal ai/context/parallel_context.py:545 set_seed
INFO colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal ai/context/parallel_context.py:545 set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 3, numpy: 1024, python random: 1024,
ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1027,the default parallel seed is
ParallelMode.DATA.
INFO colossalai - colossalai - INFO: initialized seed on rank 7, numpy: 1024, python random: 1024,
ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1031,the default parallel seed is
ParallelMode.DATA.
INFO colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal ai/context/parallel_context.py:545 set_seed
INFO colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal ai/context/parallel_context.py:545 set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 1, numpy: 1024, python random: 1024,
ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1025,the default parallel seed is
ParallelMode.DATA.
INFO colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal ai/context/parallel_context.py:545 set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 4, numpy: 1024, python random: 1024,
ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1028,the default parallel seed is
ParallelMode.DATA.
INFO colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024, python random: 1024,
ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed is
ParallelMode.DATA.
INFO colossalai - colossalai - INFO: initialized seed on rank 5, numpy: 1024, python random: 1024,
ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1029,the default parallel seed is
ParallelMode.DATA.
INFO colossalai - colossalai - INFO: initialized seed on rank 6, numpy: 1024, python random: 1024,
ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1030,the default parallel seed is
ParallelMode.DATA.
INFO colossalai - colossalai - INFO:
/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:109 launch
INFO colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 1, pipeline parallel size: 1, tensor parallel size: 8
INFO colossalai - colossalai - INFO: ./train_gpt_0.1.2.py:45 main
INFO colossalai - colossalai - INFO: Build data loader
INFO colossalai - colossalai - INFO: ./train_gpt_0.1.2.py:54 main
INFO colossalai - colossalai - INFO: Build model
[05/01/22 10:54:01] INFO colossalai - colossalai - INFO: ./train_gpt_0.1.2.py:84 main
INFO colossalai - colossalai - INFO: Build optimizer
[05/01/22 10:54:01] WARNING colossalai - colossalai - WARNING:
/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:281 initialize
INFO colossalai - colossalai - INFO:
/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:240 initialize
[05/01/22 10:54:01] WARNING colossalai - colossalai - WARNING:
/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:281 initialize
WARNING colossalai - colossalai - WARNING: Initializing an non ZeRO model with optimizer class
WARNING colossalai - colossalai - WARNING: Initializing an non ZeRO model with optimizer class
[05/01/22 10:54:01] WARNING colossalai - colossalai - WARNING:
/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:281 initialize
[05/01/22 10:54:01] WARNING colossalai - colossalai - WARNING:
/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:281 initialize
WARNING colossalai - colossalai - WARNING: Initializing an non ZeRO model with optimizer class
WARNING colossalai - colossalai - WARNING: Initializing an non ZeRO model with optimizer class
[05/01/22 10:54:01] WARNING colossalai - colossalai - WARNING:
/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:281 initialize
INFO colossalai - colossalai - INFO:
========== Your Config ========
{'BATCH_SIZE': 4,
'NUM_EPOCHS': 60,
'SEQ_LEN': 1024,
'TENSOR_PARALLEL': 8,
'fp16': {'mode': <AMP_TYPE.NAIVE: 'naive'>},
'gpt2_small': <function gpt2_small at 0x7f32a53354c0>,
'loss': {'type': <class 'model_zoo.gpt.gpt.GPTLMLoss'>},
'model': {'checkpoint': True},
'optimizer': {'lr': 0.00015, 'weight_decay': 0.01},
'parallel': {'pipeline': 1, 'tensor': {'mode': '3d', 'size': 8}}}
================================
INFO colossalai - colossalai - INFO:
/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:252 initialize
WARNING colossalai - colossalai - WARNING: Initializing an non ZeRO model with optimizer class
[05/01/22 10:54:01] WARNING colossalai - colossalai - WARNING:
/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:281 initialize
INFO colossalai - colossalai - INFO: cuDNN benchmark = True, deterministic = False
WARNING colossalai - colossalai - WARNING: Initializing an non ZeRO model with optimizer class
WARNING colossalai - colossalai - WARNING:
/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:281 initialize
WARNING colossalai - colossalai - WARNING: Initializing an non ZeRO model with optimizer class
[05/01/22 10:54:02] WARNING colossalai - colossalai - WARNING:
/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:281 initialize
WARNING colossalai - colossalai - WARNING: Initializing an non ZeRO model with optimizer class
[05/01/22 10:54:02] WARNING colossalai - colossalai - WARNING:
/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:409 initialize
WARNING colossalai - colossalai - WARNING: No PyTorch DDP or gradient handler is set up, please make
sure you do not need to all-reduce the gradients after a training step.
INFO colossalai - colossalai - INFO: ./train_gpt_0.1.2.py:98 main
INFO colossalai - colossalai - INFO: Init done, global batch size = 4
INFO colossalai - colossalai - INFO:
/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py:315 fit
INFO colossalai - colossalai - INFO: Using LossHook for training, priority = 0
INFO colossalai - colossalai - INFO:
/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py:315 fit
INFO colossalai - colossalai - INFO: Using LRSchedulerHook for training, priority = 1
INFO colossalai - colossalai - INFO:
/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py:315 fit
INFO colossalai - colossalai - INFO: Using LogMetricByEpochHook for training, priority = 10
INFO colossalai - colossalai - INFO:
/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py:315 fit
INFO colossalai - colossalai - INFO: Using ThroughputHook for training, priority = 10
INFO colossalai - colossalai - INFO:
/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py:315 fit
INFO colossalai - colossalai - INFO: Using LogMetricByStepHook for training, priority = 10
INFO colossalai - colossalai - INFO:
/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py:315 fit
INFO colossalai - colossalai - INFO: Using LogMemoryByEpochHook for training, priority = 10
INFO colossalai - colossalai - INFO:
/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py:319 fit
INFO colossalai - colossalai - INFO: Lower value means higher priority for calling hook function
INFO colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal ai/utils/memory_utils/memory_monitor.py:63 report_memory_usage
INFO colossalai - colossalai - INFO: Before-train: GPU: allocated 91.75 MB, max allocated 92.3 MB,
cached: 96.0 MB, max cached: 96.0 MB
[Epoch 0 / Train]: 0%| | 0/5 [00:00<?, ?it/s]Traceback (most recent call last):
Traceback (most recent call last):
File "./train_gpt_0.1.2.py", line 132, in <module>
Traceback (most recent call last):
File "./train_gpt_0.1.2.py", line 132, in <module>
File "./train_gpt_0.1.2.py", line 132, in <module>
main()Traceback (most recent call last):
File "./train_gpt_0.1.2.py", line 132, in <module>
File "./train_gpt_0.1.2.py", line 120, in main
Traceback (most recent call last):
File "./train_gpt_0.1.2.py", line 132, in <module>
main()
File "./train_gpt_0.1.2.py", line 120, in main
main()
main()
File "./train_gpt_0.1.2.py", line 120, in main
File "./train_gpt_0.1.2.py", line 120, in main
trainer.fit(
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 334, in fit
trainer.fit(
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 334, in fit
trainer.fit(
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 334, in fit
trainer.fit(
self._train_epoch(
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 185, in _train_epoch
self._train_epoch(
logits, label, loss = self.engine.execute_schedule(
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 185, in _train_epoch
main() File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 198, in execute_schedule
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 334, in fit
logits, label, loss = self.engine.execute_schedule(
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 198, in execute_schedule
File "./train_gpt_0.1.2.py", line 120, in main
self._train_epoch(
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 185, in _train_epoch
trainer.fit(
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 334, in fit
Traceback (most recent call last):
self._train_epoch(
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 185, in _train_epoch
File "./train_gpt_0.1.2.py", line 132, in <module>
output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs) output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_non_pipeline_schedule.py", line 49, in forward_backward_step
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_non_pipeline_schedule.py", line 49, in forward_backward_step
logits, label, loss = self.engine.execute_schedule(
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 198, in execute_schedule
self._train_epoch(
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 185, in _train_epoch
logits, label, loss = self.engine.execute_schedule(
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 198, in execute_schedule
output = self._call_engine(engine, data)output = self._call_engine(engine, data)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_base_schedule.py", line 105, in _call_engine
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_base_schedule.py", line 105, in _call_engine
main()
output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs)
File "./train_gpt_0.1.2.py", line 120, in main
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_non_pipeline_schedule.py", line 49, in forward_backward_step
logits, label, loss = self.engine.execute_schedule(
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 198, in execute_schedule
output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_non_pipeline_schedule.py", line 49, in forward_backward_step
return engine(**inputs)
return engine(**inputs)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 183, in __call__
output = self._call_engine(engine, data)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_base_schedule.py", line 105, in _call_engine
trainer.fit(
output = self._call_engine(engine, data)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_base_schedule.py", line 105, in _call_engine
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 334, in fit
return self.model(*args, **kwargs)
return engine(**inputs) File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return engine(**inputs)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 183, in __call__
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 183, in __call__
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 183, in __call__
output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs)
return self.model(*args, **kwargs)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_non_pipeline_schedule.py", line 49, in forward_backward_step
self._train_epoch(
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 185, in _train_epoch
return self.model(*args, **kwargs)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
output = self._call_engine(engine, data)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_base_schedule.py", line 105, in _call_engine
return self.model(*args, **kwargs)logits, label, loss = self.engine.execute_schedule(
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 198, in execute_schedule
return engine(**inputs)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 183, in __call__
output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_non_pipeline_schedule.py", line 49, in forward_backward_step
return self.model(*args, **kwargs)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
output = self._call_engine(engine, data)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_base_schedule.py", line 105, in _call_engine
return engine(**inputs)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 183, in __call__
return self.model(*args, **kwargs)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/amp/naive_amp/naive_amp.py", line 145, in forward
return forward_call(*input, **kwargs)return forward_call(*input, **kwargs)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/amp/naive_amp/naive_amp.py", line 145, in forward
return forward_call(*input, **kwargs)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/amp/naive_amp/naive_amp.py", line 145, in forward
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/amp/naive_amp/naive_amp.py", line 145, in forward
out = self.model(*args, **kwargs)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)out = self.model(*args, **kwargs)out = self.model(*args, **kwargs)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/amp/naive_amp/naive_amp.py", line 145, in forward
return forward_call(*input, **kwargs)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/amp/naive_amp/naive_amp.py", line 145, in forward
out = self.model(*args, **kwargs)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
out = self.model(*args, **kwargs)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
out = self.model(*args, **kwargs)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 291, in forward
return forward_call(*input, **kwargs)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 291, in forward
return forward_call(*input, **kwargs)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 291, in forward
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 291, in forward
return forward_call(*input, **kwargs)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 291, in forward
x = self.embed(input_ids)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
x = self.embed(input_ids)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
x = self.embed(input_ids)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 291, in forward
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
x = self.embed(input_ids)
return forward_call(*input, **kwargs) File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 50, in forward
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 50, in forward
return forward_call(*input, **kwargs)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 50, in forward
x = self.embed(input_ids)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
x = self.word_embeddings(input_ids) + self.position_embeddings(position_ids)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
x = self.word_embeddings(input_ids) + self.position_embeddings(position_ids)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
x = self.word_embeddings(input_ids) + self.position_embeddings(position_ids)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
[Epoch 0 / Train]: 0%| | 0/5 [00:00<?, ?it/s]Traceback (most recent call last):
File "./train_gpt_0.1.2.py", line 132, in <module>
result = forward_call(*input, **kwargs)result = forward_call(*input, **kwargs)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/colossalai_layer/_utils.py", line 38, in forward
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/colossalai_layer/_utils.py", line 38, in forward
result = forward_call(*input, **kwargs)
main() File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/colossalai_layer/_utils.py", line 38, in forward
File "./train_gpt_0.1.2.py", line 120, in main
trainer.fit(
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 334, in fit
x = self.embed(input_ids)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 50, in forward
self._train_epoch(
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 185, in _train_epoch
return forward_call(*input, **kwargs)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 50, in forward
logits, label, loss = self.engine.execute_schedule(
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 198, in execute_schedule
x = self.word_embeddings(input_ids) + self.position_embeddings(position_ids)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
return self._forward_func(*args)return self._forward_func(*args) output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs)
x = self.word_embeddings(input_ids) + self.position_embeddings(position_ids) File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/layers.py", line 976, in forward
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/layers.py", line 976, in forward
return forward_call(*input, **kwargs)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_non_pipeline_schedule.py", line 49, in forward_backward_step
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 50, in forward
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
output = self._call_engine(engine, data)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_base_schedule.py", line 105, in _call_engine
input_ = split_tensor_3d(input_, 0, self.weight_parallel_mode)input_ = split_tensor_3d(input_, 0, self.weight_parallel_mode)
x = self.word_embeddings(input_ids) + self.position_embeddings(position_ids)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/_operation.py", line 281, in split_tensor_3d
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/_operation.py", line 281, in split_tensor_3d
return self._forward_func(*args)
result = forward_call(*input, **kwargs) File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/layers.py", line 976, in forward
return engine(**inputs)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/colossalai_layer/_utils.py", line 38, in forward
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 183, in __call__
return self.model(*args, **kwargs)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return self._forward_func(*args)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/layers.py", line 976, in forward
input_ = split_tensor_3d(input_, 0, self.weight_parallel_mode)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/_operation.py", line 281, in split_tensor_3d
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
return forward_call(*input, **kwargs)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/amp/naive_amp/naive_amp.py", line 145, in forward
assert dim_size % world_size == 0, \assert dim_size % world_size == 0, \
result = forward_call(*input, **kwargs)AssertionErrorout = self.model(*args, **kwargs)
:
The dimension 0 to split, size (1) is not a multiple of world size (2), cannot split tensor evenly File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/colossalai_layer/_utils.py", line 38, in forward
AssertionError File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
: The dimension 0 to split, size (1) is not a multiple of world size (2), cannot split tensor evenly
return self._forward_func(*args)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/layers.py", line 976, in forward
result = forward_call(*input, **kwargs)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/colossalai_layer/_utils.py", line 38, in forward
return self._forward_func(*args)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/layers.py", line 976, in forward
input_ = split_tensor_3d(input_, 0, self.weight_parallel_mode)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/_operation.py", line 281, in split_tensor_3d
return forward_call(*input, **kwargs)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 291, in forward
x = self.embed(input_ids)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
input_ = split_tensor_3d(input_, 0, self.weight_parallel_mode)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/_operation.py", line 281, in split_tensor_3d
assert dim_size % world_size == 0, \
AssertionError: The dimension 0 to split, size (1) is not a multiple of world size (2), cannot split tensor evenly
assert dim_size % world_size == 0, \
AssertionError: The dimension 0 to split, size (1) is not a multiple of world size (2), cannot split tensor evenly
return forward_call(*input, **kwargs)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 50, in forward
input_ = split_tensor_3d(input_, 0, self.weight_parallel_mode)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/_operation.py", line 281, in split_tensor_3d
x = self.word_embeddings(input_ids) + self.position_embeddings(position_ids)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
assert dim_size % world_size == 0, \
AssertionError: The dimension 0 to split, size (1) is not a multiple of world size (2), cannot split tensor evenly
result = forward_call(*input, **kwargs)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/colossalai_layer/_utils.py", line 38, in forward
return self._forward_func(*args)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/layers.py", line 976, in forward
input_ = split_tensor_3d(input_, 0, self.weight_parallel_mode)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/_operation.py", line 281, in split_tensor_3d
assert dim_size % world_size == 0, \
AssertionError: The dimension 0 to split, size (1) is not a multiple of world size (2), cannot split tensor evenly
assert dim_size % world_size == 0, \
AssertionError: The dimension 0 to split, size (1) is not a multiple of world size (2), cannot split tensor evenly
Traceback (most recent call last):
File "./train_gpt_0.1.2.py", line 132, in <module>
main()
File "./train_gpt_0.1.2.py", line 120, in main
trainer.fit(
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 334, in fit
self._train_epoch(
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 185, in _train_epoch
logits, label, loss = self.engine.execute_schedule(
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 198, in execute_schedule
output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_non_pipeline_schedule.py", line 49, in forward_backward_step
output = self._call_engine(engine, data)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_base_schedule.py", line 105, in _call_engine
return engine(**inputs)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 183, in __call__
return self.model(*args, **kwargs)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/amp/naive_amp/naive_amp.py", line 145, in forward
out = self.model(*args, **kwargs)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 291, in forward
x = self.embed(input_ids)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 50, in forward
x = self.word_embeddings(input_ids) + self.position_embeddings(position_ids)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
result = forward_call(*input, **kwargs)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/colossalai_layer/_utils.py", line 38, in forward
return self._forward_func(*args)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/layers.py", line 976, in forward
input_ = split_tensor_3d(input_, 0, self.weight_parallel_mode)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/_operation.py", line 281, in split_tensor_3d
assert dim_size % world_size == 0, \
AssertionError: The dimension 0 to split, size (1) is not a multiple of world size (2), cannot split tensor evenly
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: driver shutting down
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from query at /opt/conda/conda-bld/pytorch_1646755903507/work/aten/src/ATen/cuda/CUDAEvent.h:95 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f0bb282b1bd in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x11a (0x7f0bf06ba6ea in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x50 (0x7f0bf06bccd0 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x145 (0x7f0bf06bdf65 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #4: <unknown function> + 0xc9039 (0x7f0c48562039 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/../../../../libstdc++.so.6)
frame #5: <unknown function> + 0x7ea5 (0x7f0c6ecd8ea5 in /lib64/libpthread.so.0)
frame #6: clone + 0x6d (0x7f0c6ea019fd in /lib64/libc.so.6)
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: driver shutting down
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from query at /opt/conda/conda-bld/pytorch_1646755903507/work/aten/src/ATen/cuda/CUDAEvent.h:95 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7fe8efe431bd in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x11a (0x7fe92dcd26ea in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x50 (0x7fe92dcd4cd0 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x145 (0x7fe92dcd5f65 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #4: <unknown function> + 0xc9039 (0x7fe985b3a039 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/../../../../libstdc++.so.6)
frame #5: <unknown function> + 0x7ea5 (0x7fe9ac2f0ea5 in /lib64/libpthread.so.0)
frame #6: clone + 0x6d (0x7fe9ac0199fd in /lib64/libc.so.6)
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: driver shutting down
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from query at /opt/conda/conda-bld/pytorch_1646755903507/work/aten/src/ATen/cuda/CUDAEvent.h:95 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7fdfff31b1bd in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x11a (0x7fe03d1aa6ea in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x50 (0x7fe03d1accd0 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x145 (0x7fe03d1adf65 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #4: <unknown function> + 0xc9039 (0x7fe095012039 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/../../../../libstdc++.so.6)
frame #5: <unknown function> + 0x7ea5 (0x7fe0bb7c8ea5 in /lib64/libpthread.so.0)
frame #6: clone + 0x6d (0x7fe0bb4f19fd in /lib64/libc.so.6)
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: driver shutting down
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from query at /opt/conda/conda-bld/pytorch_1646755903507/work/aten/src/ATen/cuda/CUDAEvent.h:95 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f835f9611bd in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x11a (0x7f839d7f06ea in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x50 (0x7f839d7f2cd0 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x145 (0x7f839d7f3f65 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #4: <unknown function> + 0xc9039 (0x7f83f5658039 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/../../../../libstdc++.so.6)
frame #5: <unknown function> + 0x7ea5 (0x7f841be0eea5 in /lib64/libpthread.so.0)
frame #6: clone + 0x6d (0x7f841bb379fd in /lib64/libc.so.6)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 184844) of binary: /home/asc/.conda/envs/nlp/bin/python
Traceback (most recent call last):
File "/home/asc/.conda/envs/nlp/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')())
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
run(args)
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
elastic_launch(
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
./train_gpt_0.1.2.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2022-05-01_10:54:05
host : localhost.localdomain
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 184845)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 184845
[2]:
time : 2022-05-01_10:54:05
host : localhost.localdomain
rank : 2 (local_rank: 2)
exitcode : -6 (pid: 184846)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 184846
[3]:
time : 2022-05-01_10:54:05
host : localhost.localdomain
rank : 3 (local_rank: 3)
exitcode : -6 (pid: 184847)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 184847
[4]:
time : 2022-05-01_10:54:05
host : localhost.localdomain
rank : 4 (local_rank: 4)
exitcode : -6 (pid: 184848)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 184848
[5]:
time : 2022-05-01_10:54:05
host : localhost.localdomain
rank : 5 (local_rank: 5)
exitcode : 1 (pid: 184849)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
time : 2022-05-01_10:54:05
host : localhost.localdomain
rank : 6 (local_rank: 6)
exitcode : 1 (pid: 184850)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[7]:
time : 2022-05-01_10:54:05
host : localhost.localdomain
rank : 7 (local_rank: 7)
exitcode : 1 (pid: 184851)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2022-05-01_10:54:05
host : localhost.localdomain
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 184844)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================