ColossalAI-Examples - Examples of training models with hybrid parallelism using ColossalAI

Overview

ColossalAI-Examples

This repository contains examples of training models with ColossalAI. These examples fall under three categories:

  1. Computer Vision
  2. Natural Language Processing
  3. General examples to demonstrate ColossalAI's features

Discussion

Discussion about the Colossal-AI project and examples is always welcomed! We would love to exchange ideas with the community to better help this project grow. If you think there is a need to discuss anything, you may jump to our dicussion forum and create a topic there.

If you encounter any problem while running these examples, you may want to raise an issue in this repository.

Contributing

This project welcomes constructive ideas and implementations from the community. If you wish to add an example for a specific application, please commit your code either in the image or language folders. If you wish to add new examples to explain our features, you can commit your code in the features folder, we may invite you to put up a tutorial or blog in ColossalAI Documentation.

Comments
  • [feature] New example: MAE pretraining on ImageNet 1000 dataset

    [feature] New example: MAE pretraining on ImageNet 1000 dataset

    Colossal-AI implementation of MAE, arxiv.

    As an example, we just cover the pretrain phase with ImageNet 1000 mini dataset. Helpers under subdir util/ are from facebookresearch/deit, under Apache License 2.0.

    About the coding style

    The coding style might be a little different from other examples like run_resnet_cifar10_with_engine.py, the configuration config/pretrain.py handled rich initialization logic and default values.

    The DeiT and MAE code has a really complicated and intertwined initialization process. By making full use of Colossal-AI's dynamic python configuration ability, we can keep things simple enough for newcomers to understand.

    opened by ofey404 9
  • ZeRO without using shard_param

    ZeRO without using shard_param

    🐛 Describe the bug

    When i use ZeRO without shard_params, it occurs the following problems

    Traceback (most recent call last):
      File "train.py", line 175, in <module>
        main()
      File "train.py", line 39, in main
        with ZeroInitContext(target_device=torch.cuda.current_device(), shard_strategy=shard_strategy, shard_param=False):
      File "/usr/local/Python-3.8.6/lib/python3.8/site-packages/colossalai/zero/init_ctx/init_context.py", line 75, in __init__
        self.config = ZeroContextConfig(target_device=target_device, replicated=True, shard_param=shard_param)
      File "/usr/local/Python-3.8.6/lib/python3.8/site-packages/colossalai/zero/init_ctx/init_context.py", line 37, in __init__
        assert target_device.type == 'cuda', "Replicated no-shard paramters should locate in cuda."
    AttributeError: 'int' object has no attribute 'type'
    
    

    My init code is:

    def main():
        parser = colossalai.get_default_parser()
        parser.add_argument('--use_trainer', action='store_true', help='whether to use trainer')
        args = parser.parse_args()
    
        colossalai.launch_from_torch(config='./config.py')
    
        logger = get_dist_logger()
    
        rank = int(os.environ['RANK'])
        # build resnet
        use_zero3 = hasattr(gpc.config, 'zero')
        if use_zero3:
            shard_strategy = TensorShardStrategy()
            with ZeroInitContext(target_device=torch.cuda.current_device(), shard_strategy=shard_strategy, shard_param=False):
                model = resnet34(num_classes=10)
        else:
            model = resnet34(num_classes=10)
    

    my config is

    from colossalai.amp import AMP_TYPE
    from colossalai.zero.shard_utils import TensorShardStrategy
    from colossalai.nn.optimizer import HybridAdam
    
    zero = dict(
        model_config=dict(
            tensor_placement_policy='cuda',
            shard_strategy=TensorShardStrategy(),
            reuse_fp16_shard=False
        ),
        optimizer_config=dict()
    )
    
    optimizer = dict(
        type=HybridAdam,
        lr=0.001,
        # weight_decay=1e-2,
    )
    
    BATCH_SIZE = 64
    NUM_EPOCHS = 20
    LOGGING_FREQUNCE = 20
    OUTPUT = './'
    
    gradient_clipping = 5.0
    

    Environment

    pip install colossalai==0.1.5+torch1.10cu11.1 -f https://release.colossalai.org

    ubuntu 18.04

    opened by powermano 7
  • Failed to run gpt2_3d example

    Failed to run gpt2_3d example

    Dear developers,

    I am trying to run the gpt2_3d example but failed. It looks like the model didn't load the correct batch size. Hope to get some advice.

    Thanks.

    Error

    File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/_operation.py", line 281, in split_tensor_3d
    
    assert dim_size % world_size == 0, \
    
    AssertionError: The dimension 0 to split, size (1) is not a multiple of world size (2), cannot split tensor evenly.
    

    Command

    torchrun --standalone --nproc_per_node=8 train_gpt.py --config=gpt2_configs/gpt2_3d.py --from_torch
    

    Environment

    • colossalai 0.1.2
    • nvcc 11.3.109
    • python 3.8.13
    • pytorch 1.11.0
    • GPUs: 40G A100 * 8

    Error details

    $ torchrun --standalone --nproc_per_node=8 ./train_gpt.py --config=./gpt2_configs/gpt2_3d.py  --from_torch
    WARNING:torch.distributed.run:
    *****************************************
    Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
    *****************************************
    /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/apex-0.1-py3.8-linux-x86_64.egg/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
      warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
    /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/apex-0.1-py3.8-linux-x86_64.egg/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
      warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
    /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/apex-0.1-py3.8-linux-x86_64.egg/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
      warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
    /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/apex-0.1-py3.8-linux-x86_64.egg/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
      warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
    /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/apex-0.1-py3.8-linux-x86_64.egg/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
      warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
    /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/apex-0.1-py3.8-linux-x86_64.egg/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
      warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
    /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/apex-0.1-py3.8-linux-x86_64.egg/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
      warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
    /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/apex-0.1-py3.8-linux-x86_64.egg/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
      warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
    [05/01/22 10:53:55] INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:509 set_device
                        INFO     colossalai - colossalai - INFO: process rank 2 is bound to device 2
                        INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:545 set_seed
                        INFO     colossalai - colossalai - INFO: initialized seed on rank 2, numpy: 1024, python random: 1024,
                                 ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1026,the default parallel seed is
                                 ParallelMode.DATA.
    [05/01/22 10:53:55] INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:509 set_device
    [05/01/22 10:53:55] INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:509 set_device
    [05/01/22 10:53:55] INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:509 set_device
    [05/01/22 10:53:55] INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:509 set_device
    [05/01/22 10:53:55] INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:509 set_device
    [05/01/22 10:53:55] INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:509 set_device
                        INFO     colossalai - colossalai - INFO: process rank 3 is bound to device 3
                        INFO     colossalai - colossalai - INFO: process rank 7 is bound to device 7
                        INFO     colossalai - colossalai - INFO: process rank 1 is bound to device 1
    [05/01/22 10:53:55] INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:509 set_device
                        INFO     colossalai - colossalai - INFO: process rank 4 is bound to device 4
                        INFO     colossalai - colossalai - INFO: process rank 5 is bound to device 5
                        INFO     colossalai - colossalai - INFO: process rank 0 is bound to device 0
                        INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:545 set_seed
                        INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:545 set_seed
                        INFO     colossalai - colossalai - INFO: process rank 6 is bound to device 6
                        INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:545 set_seed
                        INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:545 set_seed
                        INFO     colossalai - colossalai - INFO: initialized seed on rank 3, numpy: 1024, python random: 1024,
                                 ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1027,the default parallel seed is
                                 ParallelMode.DATA.
                        INFO     colossalai - colossalai - INFO: initialized seed on rank 7, numpy: 1024, python random: 1024,
                                 ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1031,the default parallel seed is
                                 ParallelMode.DATA.
                        INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:545 set_seed
                        INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:545 set_seed
                        INFO     colossalai - colossalai - INFO: initialized seed on rank 1, numpy: 1024, python random: 1024,
                                 ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1025,the default parallel seed is
                                 ParallelMode.DATA.
                        INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:545 set_seed
                        INFO     colossalai - colossalai - INFO: initialized seed on rank 4, numpy: 1024, python random: 1024,
                                 ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1028,the default parallel seed is
                                 ParallelMode.DATA.
                        INFO     colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024, python random: 1024,
                                 ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed is
                                 ParallelMode.DATA.
                        INFO     colossalai - colossalai - INFO: initialized seed on rank 5, numpy: 1024, python random: 1024,
                                 ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1029,the default parallel seed is
                                 ParallelMode.DATA.
                        INFO     colossalai - colossalai - INFO: initialized seed on rank 6, numpy: 1024, python random: 1024,
                                 ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1030,the default parallel seed is
                                 ParallelMode.DATA.
                        INFO     colossalai - colossalai - INFO:
                                 /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:109 launch
                        INFO     colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 1,                             pipeline parallel size: 1, tensor parallel size: 8
                        INFO     colossalai - colossalai - INFO: ./train_gpt_0.1.2.py:45 main
                        INFO     colossalai - colossalai - INFO: Build data loader
                        INFO     colossalai - colossalai - INFO: ./train_gpt_0.1.2.py:54 main
                        INFO     colossalai - colossalai - INFO: Build model
    [05/01/22 10:54:01] INFO     colossalai - colossalai - INFO: ./train_gpt_0.1.2.py:84 main
                        INFO     colossalai - colossalai - INFO: Build optimizer
    [05/01/22 10:54:01] WARNING  colossalai - colossalai - WARNING:
                                 /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:281 initialize
                        INFO     colossalai - colossalai - INFO:
                                 /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:240 initialize
    [05/01/22 10:54:01] WARNING  colossalai - colossalai - WARNING:
                                 /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:281 initialize
                        WARNING  colossalai - colossalai - WARNING: Initializing an non ZeRO model with optimizer class
                        WARNING  colossalai - colossalai - WARNING: Initializing an non ZeRO model with optimizer class
    [05/01/22 10:54:01] WARNING  colossalai - colossalai - WARNING:
                                 /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:281 initialize
    [05/01/22 10:54:01] WARNING  colossalai - colossalai - WARNING:
                                 /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:281 initialize
                        WARNING  colossalai - colossalai - WARNING: Initializing an non ZeRO model with optimizer class
                        WARNING  colossalai - colossalai - WARNING: Initializing an non ZeRO model with optimizer class
    [05/01/22 10:54:01] WARNING  colossalai - colossalai - WARNING:
                                 /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:281 initialize
                        INFO     colossalai - colossalai - INFO:
                                 ========== Your Config ========
                                 {'BATCH_SIZE': 4,
                                  'NUM_EPOCHS': 60,
                                  'SEQ_LEN': 1024,
                                  'TENSOR_PARALLEL': 8,
                                  'fp16': {'mode': <AMP_TYPE.NAIVE: 'naive'>},
                                  'gpt2_small': <function gpt2_small at 0x7f32a53354c0>,
                                  'loss': {'type': <class 'model_zoo.gpt.gpt.GPTLMLoss'>},
                                  'model': {'checkpoint': True},
                                  'optimizer': {'lr': 0.00015, 'weight_decay': 0.01},
                                  'parallel': {'pipeline': 1, 'tensor': {'mode': '3d', 'size': 8}}}
                                 ================================
    
                        INFO     colossalai - colossalai - INFO:
                                 /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:252 initialize
                        WARNING  colossalai - colossalai - WARNING: Initializing an non ZeRO model with optimizer class
    [05/01/22 10:54:01] WARNING  colossalai - colossalai - WARNING:
                                 /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:281 initialize
                        INFO     colossalai - colossalai - INFO: cuDNN benchmark = True, deterministic = False
                        WARNING  colossalai - colossalai - WARNING: Initializing an non ZeRO model with optimizer class
                        WARNING  colossalai - colossalai - WARNING:
                                 /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:281 initialize
                        WARNING  colossalai - colossalai - WARNING: Initializing an non ZeRO model with optimizer class
    [05/01/22 10:54:02] WARNING  colossalai - colossalai - WARNING:
                                 /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:281 initialize
                        WARNING  colossalai - colossalai - WARNING: Initializing an non ZeRO model with optimizer class
    [05/01/22 10:54:02] WARNING  colossalai - colossalai - WARNING:
                                 /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:409 initialize
                        WARNING  colossalai - colossalai - WARNING: No PyTorch DDP or gradient handler is set up, please make
                                 sure you do not need to all-reduce the gradients after a training step.
                        INFO     colossalai - colossalai - INFO: ./train_gpt_0.1.2.py:98 main
                        INFO     colossalai - colossalai - INFO: Init done, global batch size = 4
                        INFO     colossalai - colossalai - INFO:
                                 /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py:315 fit
                        INFO     colossalai - colossalai - INFO: Using LossHook for training, priority = 0
                        INFO     colossalai - colossalai - INFO:
                                 /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py:315 fit
                        INFO     colossalai - colossalai - INFO: Using LRSchedulerHook for training, priority = 1
                        INFO     colossalai - colossalai - INFO:
                                 /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py:315 fit
                        INFO     colossalai - colossalai - INFO: Using LogMetricByEpochHook for training, priority = 10
                        INFO     colossalai - colossalai - INFO:
                                 /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py:315 fit
                        INFO     colossalai - colossalai - INFO: Using ThroughputHook for training, priority = 10
                        INFO     colossalai - colossalai - INFO:
                                 /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py:315 fit
                        INFO     colossalai - colossalai - INFO: Using LogMetricByStepHook for training, priority = 10
                        INFO     colossalai - colossalai - INFO:
                                 /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py:315 fit
                        INFO     colossalai - colossalai - INFO: Using LogMemoryByEpochHook for training, priority = 10
                        INFO     colossalai - colossalai - INFO:
                                 /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py:319 fit
                        INFO     colossalai - colossalai - INFO: Lower value means higher priority for calling hook function
                        INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/utils/memory_utils/memory_monitor.py:63 report_memory_usage
                        INFO     colossalai - colossalai - INFO: Before-train: GPU: allocated 91.75 MB, max allocated 92.3 MB,
                                 cached: 96.0 MB, max cached: 96.0 MB
    [Epoch 0 / Train]:   0%|                                                                             | 0/5 [00:00<?, ?it/s]Traceback (most recent call last):
    Traceback (most recent call last):
      File "./train_gpt_0.1.2.py", line 132, in <module>
    Traceback (most recent call last):
      File "./train_gpt_0.1.2.py", line 132, in <module>
      File "./train_gpt_0.1.2.py", line 132, in <module>
        main()Traceback (most recent call last):
    
      File "./train_gpt_0.1.2.py", line 132, in <module>
      File "./train_gpt_0.1.2.py", line 120, in main
    Traceback (most recent call last):
          File "./train_gpt_0.1.2.py", line 132, in <module>
    main()
      File "./train_gpt_0.1.2.py", line 120, in main
        main()
        main()
      File "./train_gpt_0.1.2.py", line 120, in main
      File "./train_gpt_0.1.2.py", line 120, in main
        trainer.fit(
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 334, in fit
        trainer.fit(
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 334, in fit
        trainer.fit(
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 334, in fit
    trainer.fit(
    self._train_epoch(
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 185, in _train_epoch
        self._train_epoch(
        logits, label, loss = self.engine.execute_schedule(
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 185, in _train_epoch
    main()  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 198, in execute_schedule
          File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 334, in fit
    
        logits, label, loss = self.engine.execute_schedule(
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 198, in execute_schedule
      File "./train_gpt_0.1.2.py", line 120, in main
        self._train_epoch(
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 185, in _train_epoch
        trainer.fit(
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 334, in fit
    Traceback (most recent call last):
    self._train_epoch(
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 185, in _train_epoch
      File "./train_gpt_0.1.2.py", line 132, in <module>
        output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs)    output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs)
    
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_non_pipeline_schedule.py", line 49, in forward_backward_step
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_non_pipeline_schedule.py", line 49, in forward_backward_step
            logits, label, loss = self.engine.execute_schedule(
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 198, in execute_schedule
        self._train_epoch(
          File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 185, in _train_epoch
    logits, label, loss = self.engine.execute_schedule(
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 198, in execute_schedule
            output = self._call_engine(engine, data)output = self._call_engine(engine, data)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_base_schedule.py", line 105, in _call_engine
    
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_base_schedule.py", line 105, in _call_engine
        main()
    output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs)
      File "./train_gpt_0.1.2.py", line 120, in main
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_non_pipeline_schedule.py", line 49, in forward_backward_step
    logits, label, loss = self.engine.execute_schedule(
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 198, in execute_schedule
    output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_non_pipeline_schedule.py", line 49, in forward_backward_step
            return engine(**inputs)
    return engine(**inputs)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 183, in __call__
        output = self._call_engine(engine, data)
          File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_base_schedule.py", line 105, in _call_engine
    trainer.fit(
    output = self._call_engine(engine, data)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_base_schedule.py", line 105, in _call_engine
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 334, in fit
        return self.model(*args, **kwargs)
            return engine(**inputs)  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    
    return engine(**inputs)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 183, in __call__
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 183, in __call__
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 183, in __call__
        output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs)
        return self.model(*args, **kwargs)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_non_pipeline_schedule.py", line 49, in forward_backward_step
        self._train_epoch(
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 185, in _train_epoch
        return self.model(*args, **kwargs)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
        output = self._call_engine(engine, data)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_base_schedule.py", line 105, in _call_engine
            return self.model(*args, **kwargs)logits, label, loss = self.engine.execute_schedule(
    
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 198, in execute_schedule
        return engine(**inputs)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 183, in __call__
        output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_non_pipeline_schedule.py", line 49, in forward_backward_step
        return self.model(*args, **kwargs)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
        output = self._call_engine(engine, data)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_base_schedule.py", line 105, in _call_engine
        return engine(**inputs)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 183, in __call__
        return self.model(*args, **kwargs)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
            return forward_call(*input, **kwargs)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/amp/naive_amp/naive_amp.py", line 145, in forward
        return forward_call(*input, **kwargs)return forward_call(*input, **kwargs)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/amp/naive_amp/naive_amp.py", line 145, in forward
        return forward_call(*input, **kwargs)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/amp/naive_amp/naive_amp.py", line 145, in forward
    
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/amp/naive_amp/naive_amp.py", line 145, in forward
            out = self.model(*args, **kwargs)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
            return forward_call(*input, **kwargs)out = self.model(*args, **kwargs)out = self.model(*args, **kwargs)
    
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/amp/naive_amp/naive_amp.py", line 145, in forward
        return forward_call(*input, **kwargs)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/amp/naive_amp/naive_amp.py", line 145, in forward
        out = self.model(*args, **kwargs)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
        out = self.model(*args, **kwargs)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
        out = self.model(*args, **kwargs)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 291, in forward
        return forward_call(*input, **kwargs)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 291, in forward
        return forward_call(*input, **kwargs)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 291, in forward
    
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 291, in forward
        return forward_call(*input, **kwargs)
          File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 291, in forward
    x = self.embed(input_ids)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
        x = self.embed(input_ids)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
        return forward_call(*input, **kwargs)
        x = self.embed(input_ids)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 291, in forward
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
        x = self.embed(input_ids)
            return forward_call(*input, **kwargs)  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    
    return forward_call(*input, **kwargs)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 50, in forward
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 50, in forward
        return forward_call(*input, **kwargs)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 50, in forward
        x = self.embed(input_ids)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
        x = self.word_embeddings(input_ids) + self.position_embeddings(position_ids)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
        x = self.word_embeddings(input_ids) + self.position_embeddings(position_ids)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
        x = self.word_embeddings(input_ids) + self.position_embeddings(position_ids)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
    [Epoch 0 / Train]:   0%|                                                                             | 0/5 [00:00<?, ?it/s]Traceback (most recent call last):
      File "./train_gpt_0.1.2.py", line 132, in <module>
            result = forward_call(*input, **kwargs)result = forward_call(*input, **kwargs)
    
          File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/colossalai_layer/_utils.py", line 38, in forward
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/colossalai_layer/_utils.py", line 38, in forward
    result = forward_call(*input, **kwargs)
    main()  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/colossalai_layer/_utils.py", line 38, in forward
    
      File "./train_gpt_0.1.2.py", line 120, in main
        trainer.fit(
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 334, in fit
        x = self.embed(input_ids)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
        return forward_call(*input, **kwargs)
          File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 50, in forward
    self._train_epoch(
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 185, in _train_epoch
        return forward_call(*input, **kwargs)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 50, in forward
        logits, label, loss = self.engine.execute_schedule(
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 198, in execute_schedule
        x = self.word_embeddings(input_ids) + self.position_embeddings(position_ids)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
                return self._forward_func(*args)return self._forward_func(*args)    output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs)
    
    x = self.word_embeddings(input_ids) + self.position_embeddings(position_ids)  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/layers.py", line 976, in forward
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/layers.py", line 976, in forward
    
    
    return forward_call(*input, **kwargs)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_non_pipeline_schedule.py", line 49, in forward_backward_step
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 50, in forward
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
            output = self._call_engine(engine, data)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_base_schedule.py", line 105, in _call_engine
        input_ = split_tensor_3d(input_, 0, self.weight_parallel_mode)input_ = split_tensor_3d(input_, 0, self.weight_parallel_mode)
    x = self.word_embeddings(input_ids) + self.position_embeddings(position_ids)
    
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/_operation.py", line 281, in split_tensor_3d
          File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/_operation.py", line 281, in split_tensor_3d
        return self._forward_func(*args)
    result = forward_call(*input, **kwargs)  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/layers.py", line 976, in forward
    
        return engine(**inputs)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/colossalai_layer/_utils.py", line 38, in forward
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 183, in __call__
        return self.model(*args, **kwargs)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
        return self._forward_func(*args)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/layers.py", line 976, in forward
        input_ = split_tensor_3d(input_, 0, self.weight_parallel_mode)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/_operation.py", line 281, in split_tensor_3d
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/amp/naive_amp/naive_amp.py", line 145, in forward
            assert dim_size % world_size == 0, \assert dim_size % world_size == 0, \
    
            result = forward_call(*input, **kwargs)AssertionErrorout = self.model(*args, **kwargs)
    :
    The dimension 0 to split, size (1) is not a multiple of world size (2), cannot split tensor evenly  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/colossalai_layer/_utils.py", line 38, in forward
    AssertionError  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    
    : The dimension 0 to split, size (1) is not a multiple of world size (2), cannot split tensor evenly
        return self._forward_func(*args)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/layers.py", line 976, in forward
        result = forward_call(*input, **kwargs)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/colossalai_layer/_utils.py", line 38, in forward
        return self._forward_func(*args)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/layers.py", line 976, in forward
        input_ = split_tensor_3d(input_, 0, self.weight_parallel_mode)
          File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/_operation.py", line 281, in split_tensor_3d
    return forward_call(*input, **kwargs)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 291, in forward
        x = self.embed(input_ids)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
        input_ = split_tensor_3d(input_, 0, self.weight_parallel_mode)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/_operation.py", line 281, in split_tensor_3d
        assert dim_size % world_size == 0, \
    AssertionError: The dimension 0 to split, size (1) is not a multiple of world size (2), cannot split tensor evenly
        assert dim_size % world_size == 0, \
    AssertionError: The dimension 0 to split, size (1) is not a multiple of world size (2), cannot split tensor evenly
        return forward_call(*input, **kwargs)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 50, in forward
        input_ = split_tensor_3d(input_, 0, self.weight_parallel_mode)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/_operation.py", line 281, in split_tensor_3d
        x = self.word_embeddings(input_ids) + self.position_embeddings(position_ids)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
        assert dim_size % world_size == 0, \
    AssertionError: The dimension 0 to split, size (1) is not a multiple of world size (2), cannot split tensor evenly
        result = forward_call(*input, **kwargs)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/colossalai_layer/_utils.py", line 38, in forward
        return self._forward_func(*args)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/layers.py", line 976, in forward
        input_ = split_tensor_3d(input_, 0, self.weight_parallel_mode)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/_operation.py", line 281, in split_tensor_3d
        assert dim_size % world_size == 0, \
    AssertionError: The dimension 0 to split, size (1) is not a multiple of world size (2), cannot split tensor evenly
        assert dim_size % world_size == 0, \
    AssertionError: The dimension 0 to split, size (1) is not a multiple of world size (2), cannot split tensor evenly
    Traceback (most recent call last):
      File "./train_gpt_0.1.2.py", line 132, in <module>
        main()
      File "./train_gpt_0.1.2.py", line 120, in main
        trainer.fit(
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 334, in fit
        self._train_epoch(
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 185, in _train_epoch
        logits, label, loss = self.engine.execute_schedule(
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 198, in execute_schedule
        output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_non_pipeline_schedule.py", line 49, in forward_backward_step
        output = self._call_engine(engine, data)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_base_schedule.py", line 105, in _call_engine
        return engine(**inputs)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 183, in __call__
        return self.model(*args, **kwargs)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/amp/naive_amp/naive_amp.py", line 145, in forward
        out = self.model(*args, **kwargs)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 291, in forward
        x = self.embed(input_ids)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 50, in forward
        x = self.word_embeddings(input_ids) + self.position_embeddings(position_ids)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
        result = forward_call(*input, **kwargs)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/colossalai_layer/_utils.py", line 38, in forward
        return self._forward_func(*args)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/layers.py", line 976, in forward
        input_ = split_tensor_3d(input_, 0, self.weight_parallel_mode)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/_operation.py", line 281, in split_tensor_3d
        assert dim_size % world_size == 0, \
    AssertionError: The dimension 0 to split, size (1) is not a multiple of world size (2), cannot split tensor evenly
    terminate called after throwing an instance of 'c10::CUDAError'
      what():  CUDA error: driver shutting down
    CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
    For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
    Exception raised from query at /opt/conda/conda-bld/pytorch_1646755903507/work/aten/src/ATen/cuda/CUDAEvent.h:95 (most recent call first):
    frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f0bb282b1bd in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libc10.so)
    frame #1: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x11a (0x7f0bf06ba6ea in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
    frame #2: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x50 (0x7f0bf06bccd0 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
    frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x145 (0x7f0bf06bdf65 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
    frame #4: <unknown function> + 0xc9039 (0x7f0c48562039 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/../../../../libstdc++.so.6)
    frame #5: <unknown function> + 0x7ea5 (0x7f0c6ecd8ea5 in /lib64/libpthread.so.0)
    frame #6: clone + 0x6d (0x7f0c6ea019fd in /lib64/libc.so.6)
    
    terminate called after throwing an instance of 'c10::CUDAError'
      what():  CUDA error: driver shutting down
    CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
    For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
    Exception raised from query at /opt/conda/conda-bld/pytorch_1646755903507/work/aten/src/ATen/cuda/CUDAEvent.h:95 (most recent call first):
    frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7fe8efe431bd in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libc10.so)
    frame #1: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x11a (0x7fe92dcd26ea in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
    frame #2: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x50 (0x7fe92dcd4cd0 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
    frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x145 (0x7fe92dcd5f65 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
    frame #4: <unknown function> + 0xc9039 (0x7fe985b3a039 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/../../../../libstdc++.so.6)
    frame #5: <unknown function> + 0x7ea5 (0x7fe9ac2f0ea5 in /lib64/libpthread.so.0)
    frame #6: clone + 0x6d (0x7fe9ac0199fd in /lib64/libc.so.6)
    
    terminate called after throwing an instance of 'c10::CUDAError'
      what():  CUDA error: driver shutting down
    CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
    For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
    Exception raised from query at /opt/conda/conda-bld/pytorch_1646755903507/work/aten/src/ATen/cuda/CUDAEvent.h:95 (most recent call first):
    frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7fdfff31b1bd in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libc10.so)
    frame #1: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x11a (0x7fe03d1aa6ea in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
    frame #2: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x50 (0x7fe03d1accd0 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
    frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x145 (0x7fe03d1adf65 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
    frame #4: <unknown function> + 0xc9039 (0x7fe095012039 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/../../../../libstdc++.so.6)
    frame #5: <unknown function> + 0x7ea5 (0x7fe0bb7c8ea5 in /lib64/libpthread.so.0)
    frame #6: clone + 0x6d (0x7fe0bb4f19fd in /lib64/libc.so.6)
    
    terminate called after throwing an instance of 'c10::CUDAError'
      what():  CUDA error: driver shutting down
    CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
    For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
    Exception raised from query at /opt/conda/conda-bld/pytorch_1646755903507/work/aten/src/ATen/cuda/CUDAEvent.h:95 (most recent call first):
    frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f835f9611bd in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libc10.so)
    frame #1: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x11a (0x7f839d7f06ea in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
    frame #2: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x50 (0x7f839d7f2cd0 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
    frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x145 (0x7f839d7f3f65 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
    frame #4: <unknown function> + 0xc9039 (0x7f83f5658039 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/../../../../libstdc++.so.6)
    frame #5: <unknown function> + 0x7ea5 (0x7f841be0eea5 in /lib64/libpthread.so.0)
    frame #6: clone + 0x6d (0x7f841bb379fd in /lib64/libc.so.6)
    
    ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 184844) of binary: /home/asc/.conda/envs/nlp/bin/python
    Traceback (most recent call last):
      File "/home/asc/.conda/envs/nlp/bin/torchrun", line 33, in <module>
        sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')())
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
        return f(*args, **kwargs)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
        run(args)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
        elastic_launch(
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
        return launch_agent(self._config, self._entrypoint, list(args))
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent    raise ChildFailedError(
    torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
    ============================================================
    ./train_gpt_0.1.2.py FAILED
    ------------------------------------------------------------
    Failures:
    [1]:
      time      : 2022-05-01_10:54:05
      host      : localhost.localdomain
      rank      : 1 (local_rank: 1)
      exitcode  : -6 (pid: 184845)
      error_file: <N/A>
      traceback : Signal 6 (SIGABRT) received by PID 184845
    [2]:
      time      : 2022-05-01_10:54:05
      host      : localhost.localdomain
      rank      : 2 (local_rank: 2)
      exitcode  : -6 (pid: 184846)
      error_file: <N/A>
      traceback : Signal 6 (SIGABRT) received by PID 184846
    [3]:
      time      : 2022-05-01_10:54:05
      host      : localhost.localdomain
      rank      : 3 (local_rank: 3)
      exitcode  : -6 (pid: 184847)
      error_file: <N/A>
      traceback : Signal 6 (SIGABRT) received by PID 184847
    [4]:
      time      : 2022-05-01_10:54:05
      host      : localhost.localdomain
      rank      : 4 (local_rank: 4)
      exitcode  : -6 (pid: 184848)
      error_file: <N/A>
      traceback : Signal 6 (SIGABRT) received by PID 184848
    [5]:
      time      : 2022-05-01_10:54:05
      host      : localhost.localdomain
      rank      : 5 (local_rank: 5)
      exitcode  : 1 (pid: 184849)
      error_file: <N/A>
      traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
    [6]:
      time      : 2022-05-01_10:54:05
      host      : localhost.localdomain
      rank      : 6 (local_rank: 6)
      exitcode  : 1 (pid: 184850)
      error_file: <N/A>
      traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
    [7]:
      time      : 2022-05-01_10:54:05
      host      : localhost.localdomain
      rank      : 7 (local_rank: 7)
      exitcode  : 1 (pid: 184851)
      error_file: <N/A>
      traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
    ------------------------------------------------------------
    Root Cause (first observed failure):
    [0]:
      time      : 2022-05-01_10:54:05
      host      : localhost.localdomain
      rank      : 0 (local_rank: 0)
      exitcode  : 1 (pid: 184844)
      error_file: <N/A>
      traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
    ============================================================
    
    opened by FJRFrancio 7
  • 当模型gradient_checkpointing时运行feature/zero/train_v2.py出错

    当模型gradient_checkpointing时运行feature/zero/train_v2.py出错

    🐛 Describe the bug

    Traceback (most recent call last): File "/data1/users/jizhong1/ColossalAI-Examples/features/zero/train_v2.py", line 133, in main() File "/dirname/ColossalAI-Examples/features/zero/train_v2.py", line 123, in main optimizer.backward(loss) File "/python_path/lib/python3.9/site-packages/colossalai/zero/zero_optimizer.py", line 154, in backward self.module.backward(loss) File "/python_path/lib/python3.9/site-packages/colossalai/nn/parallel/data_parallel.py", line 266, in backward loss.backward() File "/python_path/lib/python3.9/site-packages/torch/_tensor.py", line 388, in backward return handle_torch_function( File "/python_path/lib/python3.9/site-packages/torch/overrides.py", line 1498, in handle_torch_function result = torch_func_method(public_api, types, args, kwargs) File "/python_path/lib/python3.9/site-packages/colossalai/tensor/colo_tensor.py", line 171, in torch_function ret = func(*args, **kwargs) File "/python_path/lib/python3.9/site-packages/torch/_tensor.py", line 396, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/python_path/lib/python3.9/site-packages/torch/autograd/init.py", line 173, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/python_path/lib/python3.9/site-packages/torch/autograd/function.py", line 253, in apply return user_fn(self, *args) File "/python_path/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 130, in backward outputs = ctx.run_function(*detached_inputs) File "/python_path/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 887, in custom_forward return module(*inputs, use_cache, output_attentions) File "/python_path/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/python_path/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 400, in forward hidden_states = self.ln_1(hidden_states) File "/python_path/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/python_path/lib/python3.9/site-packages/torch/nn/modules/normalization.py", line 189, in forward return F.layer_norm( File "/python_path/lib/python3.9/site-packages/torch/nn/functional.py", line 2503, in layer_norm return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled) RuntimeError: The tensor has a non-zero number of elements, but its data is not allocated yet. Caffe2 uses a lazy allocation, so you will need to call mutable_data() or raw_mutable_data() to actually allocate memory. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 40088) of binary: /python_path/bin/python Traceback (most recent call last): File "/python_path/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==1.12.1', 'console_scripts', 'torchrun')()) File "/python_path/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, **kwargs) File "/python_path/lib/python3.9/site-packages/torch/distributed/run.py", line 761, in main run(args) File "/python_path/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/python_path/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/python_path/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

    Environment

    No response

    opened by wjizhong 5
  • wikiextractor raise BdbQuit

    wikiextractor raise BdbQuit

    🐛 Describe the bug

    Hi All, When I run the code in language Bert # extractmodule wikiextractor --json enwiki-latest-pages-articles.xml.bz2 I got raise BdbQuit, this seems to be solved in here , by changing the version of wikiextractor to 3.0.4 But after that, the example code couldn't work due to 3.0.4 does not support --json

    Environment

    No response

    opened by saleelirenyun 3
  • [enhancement] Examplify `all_reduce()` for tensor_parallel_*

    [enhancement] Examplify `all_reduce()` for tensor_parallel_*

    Tutorial 1D Tensor Parallelism mentioned the use of all_reduce(), but the example attached doesn't show us how to do it.

    Quote:

    on each processor, then use an all-reduce to aggregate the results as $Z=Y_1B_1+Y_2B_2Z=Y$

    So I made this enhancement, to print weight information before and after calling all_reduce().

    Output:

    Weight of the first linear layer: torch.Size([256, 512])
    Weight of the second linear layer: torch.Size([512, 256])
    Output of the first linear layer: torch.Size([16, 512])
    Output of the second linear layer: torch.Size([16, 256])
    Output of the dropout layer: torch.Size([16, 256])
    On rank 0, first 10 elements of x:
    tensor([-0.1215, -0.3460, -0.2717, -0.0932, -0.4238, -0.0999, -0.0000,  0.2923,
            -0.1130, -0.0000], device='cuda:0', grad_fn=<SliceBackward0>)
    
    On rank 1, first 10 elements of x:
    tensor([-0.1215, -0.3460, -0.2717, -0.0932, -0.4238, -0.0999, -0.0000,  0.2923,
            -0.1130, -0.0000], device='cuda:1', grad_fn=<SliceBackward0>)
    
    After `all_reduce()`, first 10 elements of x:
    tensor([-0.2431, -0.6920, -0.5434, -0.1864, -0.8475, -0.1998, -0.0000,  0.5845,
            -0.2259, -0.0000], device='cuda:0', grad_fn=<SliceBackward0>)
    
    Output of the all_reduce opration: torch.Size([16, 256])
    
    opened by ofey404 3
  • knowledge graph embedding examples - Bin Shang

    knowledge graph embedding examples - Bin Shang

    My name is Bin Shang. As requested by professor Yong You, I add three knowledge graph embedding examples DistMult, ComplEx and RotatE. Please check the code and feel free to contact me if you have any questions.

    opened by MiracleDesigner 3
  • failed to run gpt example

    failed to run gpt example

    🐛 Describe the bug

    cd ColossalAI/examples/language/gpt
    torchrun --standalone --nproc_per_node=1 train_gpt.py --config=gpt2_configs/gpt2_zero3.py --from_torch
    

    bash: /opt/lcsoftware/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.3.0/miniconda3-4.10.3-u6p3tgreee7aigtnvuhr44yqo7vcg6r6/lib/libtinfo.so.6: no version information available (required by bash) Colossalai should be built with cuda extension to use the FP16 optimizer /home/lcfjr/.local/lib/python3.9/site-packages/torch/cuda/init.py:143: UserWarning: NVIDIA A100-PCIE-80GB with CUDA capability sm_80 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70. If you want to use the NVIDIA A100-PCIE-80GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

    warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name)) colossalai - colossalai - 2022-02-24 15:04:02,751 INFO: process rank 0 is bound to device 0 colossalai - colossalai - 2022-02-24 15:04:02,772 INFO: initialized seed on rank 0, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed is ParallelMode.DATA. colossalai - colossalai - 2022-02-24 15:04:02,772 INFO: Distributed environment is initialized, data parallel size: 1, pipeline parallel size: 1, tensor parallel size: 1 colossalai - colossalai - 2022-02-24 15:04:02,772 INFO: Build data loader colossalai - colossalai - 2022-02-24 15:04:02,864 INFO: Build model Traceback (most recent call last): File "/home/lcfjr/codes/ColossalAI/examples/language/gpt/train_gpt.py", line 118, in main() File "/home/lcfjr/codes/ColossalAI/examples/language/gpt/train_gpt.py", line 49, in main model = gpc.config.model.pop('type')(**gpc.config.model) File "/home/lcfjr/.local/lib/python3.9/site-packages/model_zoo/gpt/gpt.py", line 402, in gpt2_small return create_gpt_model(**model_kwargs) File "/home/lcfjr/.local/lib/python3.9/site-packages/model_zoo/gpt/gpt.py", line 368, in create_gpt_model model = GPT(**model_kwargs) File "/home/lcfjr/.local/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 254, in wrapper f(module, *args, **kwargs) File "/home/lcfjr/.local/lib/python3.9/site-packages/model_zoo/gpt/gpt.py", line 261, in init self.embed = GPTEmbedding(embedding_dim=dim, File "/home/lcfjr/.local/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 254, in wrapper f(module, *args, **kwargs) File "/home/lcfjr/.local/lib/python3.9/site-packages/model_zoo/gpt/gpt.py", line 33, in init self.word_embeddings = col_nn.Embedding(vocab_size, embedding_dim, padding_idx=padding_idx, dtype=dtype) File "/home/lcfjr/.local/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 254, in wrapper f(module, *args, **kwargs) File "/home/lcfjr/.local/lib/python3.9/site-packages/colossalai/nn/layer/colossalai_layer/embedding.py", line 69, in init weight_initializer(self.embed.weight, fan_in=num_embeddings, fan_out=embedding_dim) File "/home/lcfjr/.local/lib/python3.9/site-packages/colossalai/nn/init.py", line 31, in initializer return nn.init.normal(tensor, mean, std) File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/nn/init.py", line 151, in normal return no_grad_normal(tensor, mean, std) File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/nn/init.py", line 19, in no_grad_normal return tensor.normal_(mean, std) RuntimeError: CUDA error: no kernel image is available for execution on the device CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'HPC-AI_1150681_0' has failed to send a keep-alive heartbeat to the rendezvous 'a5650b64-ab96-467e-861a-b345eaa8ab3b' due to an error of type RendezvousConnectionError. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1150747) of binary: /opt/lcsoftware/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.3.0/miniconda3-4.10.3-u6p3tgreee7aigtnvuhr44yqo7vcg6r6/bin/python ERROR:torch.distributed.elastic.agent.server.api:Error waiting on exit barrier. Elapsed: 0.00041747093200683594 seconds Traceback (most recent call last): File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 899, in _exit_barrier store_util.barrier( File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/elastic/utils/store.py", line 67, in barrier synchronize(store, data, rank, world_size, key_prefix, barrier_timeout) File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/elastic/utils/store.py", line 52, in synchronize store.set(f"{key_prefix}{rank}", data) RuntimeError: Broken pipe WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'HPC-AI_1150681_0' has failed to shutdown the rendezvous 'a5650b64-ab96-467e-861a-b345eaa8ab3b' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/home/lcfjr/.local/bin/torchrun", line 10, in sys.exit(main()) File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, **kwargs) File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 719, in main run(args) File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run elastic_launch( File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

    train_gpt.py FAILED

    Failures: <NO_OTHER_FAILURES>

    Root Cause (first observed failure): [0]: time : 2022-02-24_15:04:10 host : HPC-AI rank : 0 (local_rank: 0) exitcode : 1 (pid: 1150747) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

    Environment

    No response

    opened by feifeibear 3
  • BERT示例运行错误,ColossalAI-Examples/language/bert/sequene_parallel/

    BERT示例运行错误,ColossalAI-Examples/language/bert/sequene_parallel/

    🐛 Describe the bug

    使用了最新提供的Dockerhub上的镜像0.1.8,但是在运行BERT序列并行案例:ColossalAI-Examples/language/bert/sequene_parallel/时候仍不能正常运行,提示缺少相关包: Traceback (most recent call last): File "/workspace/ColossalAI-Examples/language/bert/sequene_parallel/train.py", line 10, in from model.bert import BertForPretrain File "/workspace/ColossalAI-Examples/language/bert/sequene_parallel/model/bert.py", line 12, in from colossalai.builder.pipeline import partition_uniform ModuleNotFoundError: No module named 'colossalai.builder.pipeline'

    XI 7HJD3ZJY0X1W2RW W}6Q

    Environment

    docker镜像:docker pull hpcaitech/colossalai:0.1.8

    opened by ZXM1063694570 2
  • Problem with saving model state dict

    Problem with saving model state dict

    🐛 Describe the bug

    https://github.com/hpcaitech/ColossalAI-Examples/blob/f743872c2089d6bb5e593db6a8a48d427e6b2b1e/language/opt/run_clm.py#L504

    The code in this line should be model_state = model.state_dict(), although fixing this bug, the saved state dict is all None.

    Traceback (most recent call last): File "generate.py", line 238, in <module> main() File "generate.py", line 211, in main model = OPTForCausalLM.from_pretrained(args.model_path) File "/mnt/datadisk0/ouyangliqi/miniconda3/envs/colossalai/lib/python3.8/site-packages/transformers/modeling_utils.py", line 2119, in from_pretrained model, missing_keys, unexpected_keys, mismatched_keys, error_msgs = cls._load_pretrained_model( File "/mnt/datadisk0/ouyangliqi/miniconda3/envs/colossalai/lib/python3.8/site-packages/transformers/modeling_utils.py", line 2376, in _load_pretrained_model raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}") RuntimeError: Error(s) in loading state_dict for OPTForCausalLM: size mismatch for model.decoder.embed_tokens.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([50272, 4096]). size mismatch for model.decoder.embed_positions.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([2050, 4096]). size mismatch for model.decoder.final_layer_norm.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096]). size mismatch for model.decoder.final_layer_norm.bias: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096]). size mismatch for model.decoder.layers.0.self_attn.k_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096, 4096]). size mismatch for model.decoder.layers.0.self_attn.k_proj.bias: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096]). size mismatch for model.decoder.layers.0.self_attn.v_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096, 4096]). size mismatch for model.decoder.layers.0.self_attn.v_proj.bias: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096]). size mismatch for model.decoder.layers.0.self_attn.q_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096, 4096]). size mismatch for model.decoder.layers.0.self_attn.q_proj.bias: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096]). size mismatch for model.decoder.layers.0.self_attn.out_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096, 4096]). size mismatch for model.decoder.layers.0.self_attn.out_proj.bias: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096]). size mismatch for model.decoder.layers.0.self_attn_layer_norm.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096]). size mismatch for model.decoder.layers.0.self_attn_layer_norm.bias: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096]). size mismatch for model.decoder.layers.0.fc1.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([16384, 4096]). size mismatch for model.decoder.layers.0.fc1.bias: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([16384]). size mismatch for model.decoder.layers.0.fc2.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096, 16384]). size mismatch for model.decoder.layers.0.fc2.bias: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096]). size mismatch for model.decoder.layers.0.final_layer_norm.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096])....

    Environment

    CUDA: 11.3 Pytorch: 1.12 transformers: 4.21.0.dev0

    opened by ouyangliqi 2
  • failed to run gpt2 zero3 example

    failed to run gpt2 zero3 example

    🐛 Describe the bug

    Command:

    OMP_NUM_THREADS=32 torchrun --standalone --nnodes=1 --nproc_per_node 2 train_gpt.py --config=gpt2_configs/gpt2_zero3.py --from_torch
    

    Result:

    Traceback (most recent call last):
      File "train_gpt.py", line 130, in <module>
        main()
      File "train_gpt.py", line 56, in main
        ctx = ZeroInitContext(target_device=torch.cuda.current_device(),
    TypeError: __init__() missing 1 required positional argument: 'convert_fp16'
    Traceback (most recent call last):
      File "train_gpt.py", line 130, in <module>
        main()
      File "train_gpt.py", line 56, in main
        ctx = ZeroInitContext(target_device=torch.cuda.current_device(),
    TypeError: __init__() missing 1 required positional argument: 'convert_fp16'
    ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 38441) of binary: /home/toga/.conda/envs/ColAI/bin/python
    Traceback (most recent call last):
      File "/home/toga/.conda/envs/ColAI/bin/torchrun", line 33, in <module>
        sys.exit(load_entry_point('torch==1.10.1', 'console_scripts', 'torchrun')())
      File "/home/toga/.conda/envs/ColAI/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
        return f(*args, **kwargs)
      File "/home/toga/.conda/envs/ColAI/lib/python3.8/site-packages/torch/distributed/run.py", line 719, in main
        run(args)
      File "/home/toga/.conda/envs/ColAI/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
        elastic_launch(
      File "/home/toga/.conda/envs/ColAI/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
        return launch_agent(self._config, self._entrypoint, list(args))
      File "/home/toga/.conda/envs/ColAI/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
        raise ChildFailedError(
    torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
    ============================================================
    train_gpt.py FAILED
    

    Environment

    colossalai

    colossalai               0.1.1
    

    nvcc:

    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2021 NVIDIA Corporation
    Built on Mon_May__3_19:15:13_PDT_2021
    Cuda compilation tools, release 11.3, V11.3.109
    Build cuda_11.3.r11.3/compiler.29920130_0
    

    Python

    Python 3.8.12
    

    PyTorch

    torch                    1.10.1
    
    opened by CHN-ChenYi 2
  • connection failure

    connection failure

    🐛 Describe the bug

    I found a runtime error while running the code: The client socket has failed to connect to any network address of (hcp-bb-03, 52873). The client socket has failed to connect to hcp-bb-03:52873 (errno: 110 - Connection timed out) using command line :colossalai run --nproc_per_node 4 --master_port 29505 train.py

    Environment

    image

    opened by lhj-git 2
  • cannot import name 'OPTForCausalLM'

    cannot import name 'OPTForCausalLM'

    🐛 Describe the bug

    I tried to run the command in this link https://github.com/hpcaitech/ColossalAI-Examples/tree/main/language/opt, but errors occured.

    Traceback (most recent call last):
      File "run_clm.py", line 44, in <module>
        from transformers import (CONFIG_MAPPING, MODEL_MAPPING, AutoConfig, OPTForCausalLM, AutoTokenizer, SchedulerType,
    ImportError: cannot import name 'OPTForCausalLM'
    

    Environment

    python 3.6 CUDA 10.2 transformers 4.18.0

    opened by upwindflys 0
  • Outdated OPT example

    Outdated OPT example

    🐛 Describe the bug

    When running OPT example, I got the following errors:

    AttributeError: type object 'ChunkManager' has no attribute 'search_chunk_size'
    

    This is caused by an outdated API. Comparing to the OPT example in ColossalAI, the example here is not updated up for a while.

    Environment

    No response

    opened by larry-fuy 0
  • CVE-2007-4559 Patch

    CVE-2007-4559 Patch

    Patching CVE-2007-4559

    Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

    If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

    opened by TrellixVulnTeam 0
  • Load ColossalAI GPT model as HuggingFace/Transformers Model

    Load ColossalAI GPT model as HuggingFace/Transformers Model

    Describe the feature

    Hi all,

    I'm trying to use a GPT model I trained using ColossalAI with huggingface/transformers for inference but it's not possible to load the model as a huggingface model as it is implemented in pytorch. How can I go about loading the model I trained using huggingface/transformers library?

    Thanks so much for your help.

    Best, Red

    opened by Red-Giuliano 2
Owner
HPC-AI Tech
We are a global team to help you train and deploy your AI models
HPC-AI Tech
Hybrid CenterNet - Hybrid-supervised object detection / Weakly semi-supervised object detection

Hybrid-Supervised Object Detection System Object detection system trained by hybrid-supervision/weakly semi-supervision (HSOD/WSSOD): This project is

null 5 Dec 10, 2022
Mesh TensorFlow: Model Parallelism Made Easier

Mesh TensorFlow - Model Parallelism Made Easier Introduction Mesh TensorFlow (mtf) is a language for distributed deep learning, capable of specifying

null 1.3k Dec 26, 2022
A library for preparing, training, and evaluating scalable deep learning hybrid recommender systems using PyTorch.

collie_recs Collie is a library for preparing, training, and evaluating implicit deep learning hybrid recommender systems, named after the Border Coll

ShopRunner 97 Jan 3, 2023
A library for preparing, training, and evaluating scalable deep learning hybrid recommender systems using PyTorch.

collie Collie is a library for preparing, training, and evaluating implicit deep learning hybrid recommender systems, named after the Border Collie do

ShopRunner 96 Dec 29, 2022
PyTorch implementation of paper: HPNet: Deep Primitive Segmentation Using Hybrid Representations.

HPNet This repository contains the PyTorch implementation of paper: HPNet: Deep Primitive Segmentation Using Hybrid Representations. Installation The

Siming Yan 42 Dec 7, 2022
Learning recognition/segmentation models without end-to-end training. 40%-60% less GPU memory footprint. Same training time. Better performance.

InfoPro-Pytorch The Information Propagation algorithm for training deep networks with local supervision. (ICLR 2021) Revisiting Locally Supervised Lea

null 78 Dec 27, 2022
BossNAS: Exploring Hybrid CNN-transformers with Block-wisely Self-supervised Neural Architecture Search

BossNAS This repository contains PyTorch evaluation code, retraining code and pretrained models of our paper: BossNAS: Exploring Hybrid CNN-transforme

Changlin Li 127 Dec 26, 2022
Hybrid Neural Fusion for Full-frame Video Stabilization

FuSta: Hybrid Neural Fusion for Full-frame Video Stabilization Project Page | Video | Paper | Google Colab Setup Setup environment for [Yu and Ramamoo

Yu-Lun Liu 430 Jan 4, 2023
Code for Iso-Points: Optimizing Neural Implicit Surfaces with Hybrid Representations

Implementation for Iso-Points (CVPR 2021) Official code for paper Iso-Points: Optimizing Neural Implicit Surfaces with Hybrid Representations paper |

Yifan Wang 66 Nov 8, 2022
The official implementation of our CVPR 2021 paper - Hybrid Rotation Averaging: A Fast and Robust Rotation Averaging Approach

Graph Optimizer This repo contains the official implementation of our CVPR 2021 paper - Hybrid Rotation Averaging: A Fast and Robust Rotation Averagin

Chenyu 109 Dec 23, 2022
:hot_pepper: R²SQL: "Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing." (AAAI 2021)

R²SQL The PyTorch implementation of paper Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing. (AAAI 2021) Requirement

huybery 60 Dec 31, 2022
Cancer Drug Response Prediction via a Hybrid Graph Convolutional Network

DeepCDR Cancer Drug Response Prediction via a Hybrid Graph Convolutional Network This work has been accepted to ECCB2020 and was also published in the

Qiao Liu 50 Dec 18, 2022
HyDiff: Hybrid Differential Software Analysis

HyDiff: Hybrid Differential Software Analysis This repository provides the tool and the evaluation subjects for the paper HyDiff: Hybrid Differential

Yannic Noller 22 Oct 20, 2022
QSYM: A Practical Concolic Execution Engine Tailored for Hybrid Fuzzing

QSYM: A Practical Concolic Execution Engine Tailored for Hybrid Fuzzing Environment Tested on Ubuntu 14.04 64bit and 16.04 64bit Installation # disabl

gts3.org (SSLab@Gatech) 581 Dec 30, 2022
An adaptive hierarchical energy management strategy for hybrid electric vehicles

An adaptive hierarchical energy management strategy This project contains the source code of an adaptive hierarchical EMS combining heuristic equivale

null 19 Dec 13, 2022
Generalized hybrid model for mode-locked laser diodes with an extended passive cavity

GenHybridMLLmodel Generalized hybrid model for mode-locked laser diodes with an extended passive cavity This hybrid simulation strategy combines a tra

Stijn Cuyvers 3 Sep 21, 2022
Self-supervised Multi-modal Hybrid Fusion Network for Brain Tumor Segmentation

JBHI-Pytorch This repository contains a reference implementation of the algorithms described in our paper "Self-supervised Multi-modal Hybrid Fusion N

FeiyiFANG 5 Dec 13, 2021
The official implementation of the Hybrid Self-Attention NEAT algorithm

PUREPLES - Pure Python Library for ES-HyperNEAT About This is a library of evolutionary algorithms with a focus on neuroevolution, implemented in pure

Adrian Westh 91 Dec 12, 2022