Colossal-AI: A Unified Deep Learning System for Large-Scale Parallel Training

HPC-AI Tech

Last update: Jan 8, 2023

Overview

ColossalAI

An integrated large-scale model training system with efficient parallelization techniques.

arXiv: Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training

Installation

PyPI

pip install colossalai

Install From Source

git clone [email protected]:hpcaitech/ColossalAI.git
cd ColossalAI
# install dependency
pip install -r requirements/requirements.txt

# install colossalai
pip install .

Install and enable CUDA kernel fusion (compulsory installation when using fused optimizer)

pip install -v --no-cache-dir --global-option="--cuda_ext" .

Documentation

Documentation

Quick View

Start Distributed Training in Lines

import colossalai
from colossalai.engine import Engine
from colossalai.trainer import Trainer
from colossalai.core import global_context as gpc

model, train_dataloader, test_dataloader, criterion, optimizer, schedule, lr_scheduler = colossalai.initialize()
engine = Engine(
    model=model,
    criterion=criterion,
    optimizer=optimizer,
    lr_scheduler=lr_scheduler,
    schedule=schedule
)

trainer = Trainer(engine=engine,
                  hooks_cfg=gpc.config.hooks,
                  verbose=True)
trainer.fit(
    train_dataloader=train_dataloader,
    test_dataloader=test_dataloader,
    max_epochs=gpc.config.num_epochs,
    display_progress=True,
    test_interval=5
)

Write a Simple 2D Parallel Model

Let's say we have a huge MLP model and its very large hidden size makes it difficult to fit into a single GPU. We can then distribute the model weights across GPUs in a 2D mesh while you still write your model in a familiar way.

from colossalai.nn import Linear2D
import torch.nn as nn


class MLP_2D(nn.Module):

    def __init__(self):
        super().__init__()
        self.linear_1 = Linear2D(in_features=1024, out_features=16384)
        self.linear_2 = Linear2D(in_features=16384, out_features=1024)

    def forward(self, x):
        x = self.linear_1(x)
        x = self.linear_2(x)
        return x

Features

ColossalAI provides a collection of parallel training components for you. We aim to support you to write your distributed deep learning models just like how you write your single-GPU model. We provide friendly tools to kickstart distributed training in a few lines.

Comments

[BUG]: Memory consumption by fp16 is not normal
🐛 Describe the bug

When i used pytorch origin amp, the gpu memory is much smaller than colossai, why? the config is

from colossalai.amp import AMP_TYPE from colossalai.zero.shard_utils import TensorShardStrategy from colossalai.nn.optimizer import HybridAdam fp16 = dict( mode=AMP_TYPE.TORCH, ) optimizer = dict( type=HybridAdam, lr=0.001, # weight_decay=1e-2, )

model | dataset | machine | batch | gradient accmulate size | ZeRO | speed | GPU memory | OPT | tensor_placement_policy | | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- ir18 | private dataset | 1 | 64 | 1 | no ZeRO | 24%|██▍ | 2089/8549 [02:51<08:39, 12.43it/s] | 8703M | HybridAdam | | single machine + Engine | ir18 | private dataset | 1 | 64 | 1 | no ZeRO | 19%|█▊ | 1599/8549 [02:24<10:21, 11.17it/s] | 5769M | HybridAdam | | single machine + wo Engine + pytorch origin fp16 |

Environment

No response
bug
opened by powermano 26

[BUG]: RuntimeError of "RANK" when running train.py of ResNet example on a single GPU

🐛 Describe the bug

I met a problem today when running with python train.py, as below,

/home/user/software/python/anaconda/anaconda3/envs/conda-general/bin/python /home/user/***/***
/ColossalAI-Examples/image/resnet/train.py
Traceback (most recent call last):
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/colossalai/initialize.py", line 210, in launch_from_torch
    rank = int(os.environ['RANK'])
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/os.py", line 679, in __getitem__
    raise KeyError(key) from None
KeyError: 'RANK'

During handling of the above exception, another exception occurred:

...

RuntimeError: Could not find 'RANK' in the torch environment, visit https://www.colossalai.org/ for more information on launching with torch

Is this error due to the absence of environment variable RANK in my Ubuntu?

Environment

Python: 3.10

bug

opened by songyuc 23

[BUG]: type object 'ChunkManager' has no attribute 'search_chunk_size'

🐛 Describe the bug

when i training the diffusion model that happened:

Setting up LambdaLR scheduler... Traceback (most recent call last): File "/home/tongange/ColossalAI/examples/images/diffusion/main.py", line 804, in trainer.fit(model, data) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 578, in fit call._call_and_handle_interrupt( File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 620, in _fit_impl self._run(model, ckpt_path=self.ckpt_path) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1038, in _run self.strategy.setup(self) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/colossalai.py", line 333, in setup self.setup_precision_plugin() File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/colossalai.py", line 270, in setup_precision_plugin chunk_size = self.chunk_size or ChunkManager.search_chunk_size( AttributeError: type object 'ChunkManager' has no attribute 'search_chunk_size' Setting up LambdaLR scheduler... /root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py:437: UserWarning: Error handling mechanism for deadlock detection is uninitialized. Skipping check. rank_zero_warn("Error handling mechanism for deadlock detection is uninitialized. Skipping check.") Summoning checkpoint.

Traceback (most recent call last): File "/home/tongange/ColossalAI/examples/images/diffusion/main.py", line 804, in trainer.fit(model, data) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 578, in fit call._call_and_handle_interrupt( File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 88, in launch return function(*args, **kwargs) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 620, in _fit_impl self._run(model, ckpt_path=self.ckpt_path) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1038, in _run self.strategy.setup(self) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/colossalai.py", line 333, in setup self.setup_precision_plugin() File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/colossalai.py", line 270, in setup_precision_plugin chunk_size = self.chunk_size or ChunkManager.search_chunk_size( AttributeError: type object 'ChunkManager' has no attribute 'search_chunk_size'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/tongange/ColossalAI/examples/images/diffusion/main.py", line 806, in melk() File "/home/tongange/ColossalAI/examples/images/diffusion/main.py", line 789, in melk trainer.save_checkpoint(ckpt_path) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1900, in save_checkpoint self._checkpoint_connector.save_checkpoint(filepath, weights_only=weights_only, storage_options=storage_options) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 512, in save_checkpoint _checkpoint = self.dump_checkpoint(weights_only) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 444, in dump_checkpoint "state_dict": self._get_lightning_module_state_dict(), File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 526, in _get_lightning_module_state_dict state_dict = self.trainer.strategy.lightning_module_state_dict() File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/colossalai.py", line 383, in lightning_module_state_dict assert isinstance(self.model, ZeroDDP) AssertionError

Environment

i use the way bellow to train, all the steps are same: https://github.com/hpcaitech/ColossalAI/tree/main/examples/images/diffusion
bug

opened by Alfred-Duncan 16

[BUG]: colossalai/kernel/cuda_native/csrc/moe_cuda_kernel.cu:5:10: fatal error: cub/cub.cuh: No such file or directory (update: now with more build errors!)

🐛 Describe the bug

Trying to run a finetune torchrun script, get this error. ColossaiAL was built from source as directed, but it still fails.

anon@linuxmint:/media/anon/bighdd/ai/toolbox/training$ ./finetune.bash 
+ export BATCH_SIZE=4
+ BATCH_SIZE=4
+ export MODEL=/media/anon/bighdd/ai/models/opt-350m
+ MODEL=/media/anon/bighdd/ai/models/opt-350m
+ export NUMBER_OF_GPUS=1
+ NUMBER_OF_GPUS=1
+ export OUTPUT_DIR=checkpoints
+ OUTPUT_DIR=checkpoints
++ date +%Y-%m-%d_%H-%M-%S
+ LOG_NAME=2022-12-22_14-15-45
+ export HF_DATASETS_OFFLINE=1
+ HF_DATASETS_OFFLINE=1
+ mkdir -p checkpoints/logs
+ mkdir -p checkpoints/runs
+ torchrun --nproc_per_node 1 --master_port 19198 ./colossalai/run_clm.py --train_file ./data/train.json --learning_rate 2e-5 --checkpointing_steps 64 --mem_cap 0 --model_name_or_path /media/anon/bighdd/ai/models/opt-350m --output_dir checkpoints --per_device_eval_batch_size 4 --per_device_train_batch_size 4
+ tee checkpoints/logs/2022-12-22_14-15-45.log
2022-12-22 14:15:51.339450: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Colossalai should be built with cuda extension to use the FP16 optimizer
If you want to activate cuda mode for MoE, please install with cuda_ext!
[12/22/22 14:15:54] INFO     colossalai - colossalai - INFO:                                                                              
                             /home/anon/.local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:521 set_device          
                    INFO     colossalai - colossalai - INFO: process rank 0 is bound to device 0                                          
[12/22/22 14:15:55] INFO     colossalai - colossalai - INFO:                                                                              
                             /home/anon/.local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:557 set_seed            
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024, python random: 1024,                
                             ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed is ParallelMode.DATA.           
                    INFO     colossalai - colossalai - INFO: /home/anon/.local/lib/python3.8/site-packages/colossalai/initialize.py:117   
                             launch                                                                                                       
                    INFO     colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 1, pipeline      
                             parallel size: 1, tensor parallel size: 1                                                                    
                    INFO     colossalai - colossalai - INFO: ./colossalai/run_clm.py:309 main                                             
                    INFO     colossalai - colossalai - INFO: Start preparing dataset                                                      
Using custom data configuration default-ced548c04fa8d0c8
Found cached dataset json (/home/anon/.cache/huggingface/datasets/json/default-ced548c04fa8d0c8/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
100%|██████████| 1/1 [00:00<00:00, 597.82it/s]
Using custom data configuration default-ced548c04fa8d0c8
Found cached dataset json (/home/anon/.cache/huggingface/datasets/json/default-ced548c04fa8d0c8/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
Using custom data configuration default-ced548c04fa8d0c8
Found cached dataset json (/home/anon/.cache/huggingface/datasets/json/default-ced548c04fa8d0c8/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
                    INFO     colossalai - colossalai - INFO: ./colossalai/run_clm.py:350 main                                             
                    INFO     colossalai - colossalai - INFO: Dataset is prepared                                                          
                    INFO     colossalai - colossalai - INFO: ./colossalai/run_clm.py:366 main                                             
                    INFO     colossalai - colossalai - INFO: Model config has been created                                                
load model from /media/anon/bighdd/ai/models/opt-350m
                    INFO     colossalai - colossalai - INFO: ./colossalai/run_clm.py:373 main                                             
                    INFO     colossalai - colossalai - INFO: GPT2Tokenizer has been created                                               
                    INFO     colossalai - colossalai - INFO: ./colossalai/run_clm.py:388 main                                             
                    INFO     colossalai - colossalai - INFO: Finetune a pre-trained model                                                 
[12/22/22 14:16:04] INFO     colossalai - ProcessGroup - INFO:                                                                            
                             /home/anon/.local/lib/python3.8/site-packages/colossalai/tensor/process_group.py:24 get                      
                    INFO     colossalai - ProcessGroup - INFO: NCCL initialize ProcessGroup on [0]                                        
[12/22/22 14:16:07] INFO     colossalai - colossalai - INFO: ./colossalai/run_clm.py:400 main                                             
                    INFO     colossalai - colossalai - INFO: using Colossal-AI version 0.1.13                                             
searching chunk configuration is completed in 0.67 s.
used number: 315.85 MB, wasted number: 3.01 MB
total wasted percentage is 0.95%
/home/anon/.local/lib/python3.8/site-packages/colossalai/gemini/chunk/chunk.py:40: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor._storage() instead of tensor.storage()
  return tensor.storage().size() == 0
/home/anon/.local/lib/python3.8/site-packages/colossalai/gemini/chunk/chunk.py:45: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor._storage() instead of tensor.storage()
  tensor.storage().resize_(0)
[12/22/22 14:16:09] INFO     colossalai - colossalai - INFO: ./colossalai/run_clm.py:415 main                                             
                    INFO     colossalai - colossalai - INFO: GeminiDDP has been created                                                   
Running tokenizer on dataset: 100%|██████████| 10/10 [00:23<00:00,  2.34s/ba]
Running tokenizer on dataset: 100%|██████████| 1/1 [00:01<00:00,  1.18s/ba]
[12/22/22 14:16:37] WARNING  colossalai - colossalai - WARNING: ./colossalai/run_clm.py:444 main                                          
                    WARNING  colossalai - colossalai - WARNING: The tokenizer picked seems to have a very large `model_max_length`        
                             (1000000000000000019884624838656). Picking 1024 instead. You can change that default value by passing        
                             --block_size xxx.                                                                                            
Grouping texts in chunks of 1024: 100%|██████████| 10/10 [00:05<00:00,  1.92ba/s]
Grouping texts in chunks of 1024: 100%|██████████| 1/1 [00:00<00:00,  3.61ba/s]
[12/22/22 14:16:42] INFO     colossalai - colossalai - INFO: ./colossalai/run_clm.py:503 main                                             
                    INFO     colossalai - colossalai - INFO: Dataloaders have been created                                                
/home/anon/.local/lib/python3.8/site-packages/colossalai/tensor/colo_tensor.py:182: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor._storage() instead of tensor.storage()
  ret = func(*args, **kwargs)
/home/anon/.local/lib/python3.8/site-packages/colossalai/nn/optimizer/nvme_optimizer.py:55: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor._storage() instead of tensor.storage()
  numel += p.storage().size()
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/anon/.local/lib/python3.8/site-packages/colossalai/nn/optimizer/hybrid_adam.py:80 in       │
│ __init__                                                                                         │
│                                                                                                  │
│    77 │   │   super(HybridAdam, self).__init__(model_params, default_args, nvme_offload_fracti   │
│    78 │   │   self.adamw_mode = adamw_mode                                                       │
│    79 │   │   try:                                                                               │
│ ❱  80 │   │   │   import colossalai._C.cpu_optim                                                 │
│    81 │   │   │   import colossalai._C.fused_optim                                               │
│    82 │   │   except ImportError:                                                                │
│    83 │   │   │   raise ImportError('Please install colossalai from source code to use HybridA   │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ModuleNotFoundError: No module named 'colossalai._C.cpu_optim'

During handling of the above exception, another exception occurred:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /media/anon/bighdd/ai/toolbox/training/./colossalai/run_clm.py:643 in <module>                    │
│                                                                                                  │
│   640                                                                                            │
│   641                                                                                            │
│   642 if __name__ == "__main__":                                                                 │
│ ❱ 643 │   main()                                                                                 │
│   644                                                                                            │
│                                                                                                  │
│ /media/anon/bighdd/ai/toolbox/training/./colossalai/run_clm.py:519 in main                        │
│                                                                                                  │
│   516 │   │   },                                                                                 │
│   517 │   ]                                                                                      │
│   518 │                                                                                          │
│ ❱ 519 │   optimizer = HybridAdam(optimizer_grouped_parameters, lr=args.learning_rate)            │
│   520 │   optimizer = ZeroOptimizer(optimizer, model, initial_scale=2**14)                       │
│   521 │                                                                                          │
│   522 │   # Scheduler and math around the number of training steps.                              │
│                                                                                                  │
│ /home/anon/.local/lib/python3.8/site-packages/colossalai/nn/optimizer/hybrid_adam.py:83 in       │
│ __init__                                                                                         │
│                                                                                                  │
│    80 │   │   │   import colossalai._C.cpu_optim                                                 │
│    81 │   │   │   import colossalai._C.fused_optim                                               │
│    82 │   │   except ImportError:                                                                │
│ ❱  83 │   │   │   raise ImportError('Please install colossalai from source code to use HybridA   │
│    84 │   │                                                                                      │
│    85 │   │   self.cpu_adam_op = colossalai._C.cpu_optim.CPUAdamOptimizer(lr, betas[0], betas[   │
│    86 │   │   │   │   │   │   │   │   │   │   │   │   │   │   │   │   │   adamw_mode)            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ImportError: Please install colossalai from source code to use HybridAdam
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 206247) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/home/anon/.local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/anon/.local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/anon/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/home/anon/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/anon/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/anon/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
./colossalai/run_clm.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-12-22_14:16:47
  host      : linuxmint
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 206247)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Environment

Python 3.8.10 torch: 2.0.0.dev20221215+cu117 colossalai-0.1.13 Nvidia 3060 12GB NVIDIA-SMI 525.60.11 Driver Version: 525.60.11 CUDA Version: 12.0 Cuda compilation tools, release 10.1, V10.1.243

bug

opened by xznhj8129 15

[BUG]: ZeRO without using shard_param

🐛 Describe the bug

When i use ZeRO without shard_params, it occurs the following problems

Traceback (most recent call last):
  File "train.py", line 175, in <module>
    main()
  File "train.py", line 39, in main
    with ZeroInitContext(target_device=torch.cuda.current_device(), shard_strategy=shard_strategy, shard_param=False):
  File "/usr/local/Python-3.8.6/lib/python3.8/site-packages/colossalai/zero/init_ctx/init_context.py", line 75, in __init__
    self.config = ZeroContextConfig(target_device=target_device, replicated=True, shard_param=shard_param)
  File "/usr/local/Python-3.8.6/lib/python3.8/site-packages/colossalai/zero/init_ctx/init_context.py", line 37, in __init__
    assert target_device.type == 'cuda', "Replicated no-shard paramters should locate in cuda."
AttributeError: 'int' object has no attribute 'type'

My init code is:

def main():
    parser = colossalai.get_default_parser()
    parser.add_argument('--use_trainer', action='store_true', help='whether to use trainer')
    args = parser.parse_args()

    colossalai.launch_from_torch(config='./config.py')

    logger = get_dist_logger()

    rank = int(os.environ['RANK'])
    # build resnet
    use_zero3 = hasattr(gpc.config, 'zero')
    if use_zero3:
        shard_strategy = TensorShardStrategy()
        with ZeroInitContext(target_device=torch.cuda.current_device(), shard_strategy=shard_strategy, shard_param=False):
            model = resnet34(num_classes=10)
    else:
        model = resnet34(num_classes=10)

my config is

from colossalai.amp import AMP_TYPE
from colossalai.zero.shard_utils import TensorShardStrategy
from colossalai.nn.optimizer import HybridAdam

zero = dict(
    model_config=dict(
        tensor_placement_policy='cuda',
        shard_strategy=TensorShardStrategy(),
        reuse_fp16_shard=False
    ),
    optimizer_config=dict()
)

optimizer = dict(
    type=HybridAdam,
    lr=0.001,
    # weight_decay=1e-2,
)

BATCH_SIZE = 64
NUM_EPOCHS = 20
LOGGING_FREQUNCE = 20
OUTPUT = './'

gradient_clipping = 5.0

Environment

pip install colossalai==0.1.5+torch1.10cu11.1 -f https://release.colossalai.org

ubuntu 18.04

Environment

pip install colossalai==0.1.5+torch1.10cu11.1 -f https://release.colossalai.org

ubuntu 18.04

bug

opened by powermano 15

[BUG]: Issue with Colossal-AI on Cuda 11.4 and Docker ?

🐛 Describe the bug

Followed the installation guide here: https://github.com/hpcaitech/ColossalAI

2001 mkdir colossalai 2002 cd colossalai/ 2003 ll 2004 colossalai 2005 git clone https://github.com/hpcaitech/ColossalAI.git 2006 cd ColossalAI 2007 # install dependency 2008 pip install -r requirements/requirements.txt 2009 # install colossalai 2010 pip install . 2014 docker build -t colossalai ./docker

2015 docker run -ti --gpus all --rm --ipc=host colossalai bash

[root@dbf722d6d864 workspace]# colossalai check -i Colossalai should be built with cuda extension to use the FP16 optimizer If you want to activate cuda mode for MoE, please install with cuda_ext! CUDA Version: 11.3 PyTorch Version: 1.10.1 CUDA Version in PyTorch Build: 11.3 PyTorch CUDA Version Match: ✓ CUDA Extension: x

The Cuda extension ^^^ isn't present?

[root@dbf722d6d864 workspace]# colossalai benchmark --gpus 8 Colossalai should be built with cuda extension to use the FP16 optimizer If you want to activate cuda mode for MoE, please install with cuda_ext! === Benchmarking Parameters === gpus: 8 batch_size: 8 seq_len: 512 dimension: 1024 warmup_steps: 10 profile_steps: 50 layers: 2 model: mlp

Colossalai should be built with cuda extension to use the FP16 optimizer If you want to activate cuda mode for MoE, please install with cuda_ext!

=== size: 8, mode: None === Average forward time: 0.0004958677291870118 Average backward time: 0.0010803651809692383 Max allocated GPU memory: 0.26564550399780273 Max cached GPU memory: 0.287109375

=== size: 8, mode: 1d === Average forward time: 0.004022541046142578 Average backward time: 0.0007260799407958985 Max allocated GPU memory: 0.2382950782775879 Max cached GPU memory: 0.287109375

=== size: 8, mode: 2.5d, depth: 2 === Average forward time: 0.001216425895690918 Average backward time: 0.002291984558105469 Max allocated GPU memory: 0.17383670806884766 Max cached GPU memory: 0.2734375

=== size: 8, mode: 3d === Average forward time: 0.000978093147277832 Average backward time: 0.0016768646240234374 Max allocated GPU memory: 0.05128049850463867 Max cached GPU memory: 0.185546875

Colossalai should be built with cuda extension to use the FP16 optimizer

What does this ^^^ really mean ?

This is a A100 based system:

$nvidia-smi Thu May 26 18:43:56 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A100-SXM... On | 00000000:07:00.0 Off | 0 | | N/A 27C P0 52W / 400W | 0MiB / 40536MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA A100-SXM... On | 00000000:0F:00.0 Off | 0 | | N/A 26C P0 50W / 400W | 0MiB / 40536MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA A100-SXM... On | 00000000:47:00.0 Off | 0 | | N/A 26C P0 54W / 400W | 0MiB / 40536MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 3 NVIDIA A100-SXM... On | 00000000:4E:00.0 Off | 0 | | N/A 25C P0 53W / 400W | 0MiB / 40536MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 4 NVIDIA A100-SXM... On | 00000000:87:00.0 Off | 0 | | N/A 30C P0 54W / 400W | 0MiB / 40536MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 5 NVIDIA A100-SXM... On | 00000000:90:00.0 Off | 0 | | N/A 29C P0 53W / 400W | 0MiB / 40536MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 6 NVIDIA A100-SXM... On | 00000000:B7:00.0 Off | 0 | | N/A 29C P0 54W / 400W | 0MiB / 40536MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 7 NVIDIA A100-SXM... On | 00000000:BD:00.0 Off | 0 | | N/A 29C P0 53W / 400W | 0MiB / 40536MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

Environment

This is a A100 based system:

$nvidia-smi Thu May 26 18:43:56 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 |
bug

opened by Adrian-1234 15
[BUG]: Memory consumption by fp16 is not normal when using Engine.
🐛 Describe the bug

when using colossalai.amp.convert_to_torch_amp to wrap the model, optimizer and criterion,

if not use_colossai_engine: model, optimizer, criterion = colossalai.amp.convert_to_torch_amp(model, optimizer, criterion)

and then train normally, which also only consumes 4700M of memory.

output, _ = model(img, label) train_loss = criterion(output, label) optimizer.backward(train_loss) optimizer.step() optimizer.zero_grad()

But if you use colossalai.initialize to initialize, it will consume 7700M of memory. But we did see that by reading the fp16 parameter in config, in the initialization code of colossalai.initialize, the conversion of process colossalai.amp.convert_to_torch_amp is performed, and then we use the Engine for training, but it needs to consume 7700M of memory at this time. This is where I get confused.

engine.zero_grad() output, _ = engine(img, label) train_loss = engine.criterion(output, label) engine.backward(train_loss) engine.step()

Environment

No response
bug
opened by powermano 14
[BUG]: examples/images/diffusion ran failed

🐛 Describe the bug

I ran the example of diffusion according to https://github.com/hpcaitech/ColossalAI/tree/main/examples/images/diffusion： steps: conda env create -f environment.yaml conda activate ldm pip install colossalai==0.1.10+torch1.11cu11.3 -f https://release.colossalai.org git clone https://github.com/Lightning-AI/lightning && cd lightning && git reset --hard b04a7aa pip install -r requirements.txt && pip install .

dataset: laion-400m

run: bash train.sh

failed info:

**/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py:438: UserWarning: Error handling mechanism for deadlock detection is uninitialized. Skipping check. rank_zero_warn("Error handling mechanism for deadlock detection is uninitialized. Skipping check.") Traceback (most recent call last): File "/home/code/ColossalAI/examples/images/diffusion/main.py", line 811, in trainer.fit(model, data) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 579, in fit call._call_and_handle_interrupt( File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 621, in _fit_impl self._run(model, ckpt_path=self.ckpt_path) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1058, in _run results = self._run_stage() File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1137, in _run_stage self._run_train() File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1160, in _run_train self.fit_loop.run() File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, **kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance self._outputs = self.epoch_loop.run(self._data_fetcher) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, **kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 214, in advance batch_output = self.batch_loop.run(kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, **kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance outputs = self.optimizer_loop.run(optimizers, kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, **kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 200, in advance result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position]) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 247, in _run_optimization self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 357, in _optimizer_step self.trainer._call_lightning_module_hook( File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1302, in _call_lightning_module_hook output = fn(*args, **kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/core/module.py", line 1661, in optimizer_step optimizer.step(closure=optimizer_closure) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 169, in step step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/colossalai.py", line 368, in optimizer_step return self.precision_plugin.optimizer_step( File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/colossalai.py", line 74, in optimizer_step closure_result = closure() File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 147, in call self._result = self.closure(*args, **kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 133, in closure step_output = self._step_fn() File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 406, in _training_step training_step_output = self.trainer._call_strategy_hook("training_step", *kwargs.values()) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1440, in _call_strategy_hook output = fn(*args, **kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 352, in training_step return self.model(*args, **kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/colossalai/nn/parallel/data_parallel.py", line 241, in forward outputs = self.module(*args, **kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/overrides/base.py", line 98, in forward output = self._forward_module.training_step(*inputs, **kwargs) File "/home/code/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 411, in training_step loss, loss_dict = self.shared_step(batch) File "/home/code/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 976, in shared_step loss = self(x, c) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/code/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 988, in forward return self.p_losses(x, c, t, *args, **kwargs) File "/home/code/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 1122, in p_losses model_output = self.apply_model(x_noisy, t, cond) File "/home/code/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 1094, in apply_model x_recon = self.model(x_noisy, t, **cond) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/code/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 1519, in forward out = self.diffusion_model(x, t, context=cc) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/code/ColossalAI/examples/images/diffusion/ldm/modules/diffusionmodules/openaimodel.py", line 927, in forward h = th.cat([h, hs.pop()], dim=1) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/colossalai/tensor/colo_tensor.py", line 170, in torch_function ret = func(*args, kwargs) RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 8 but got size 7 for tensor number 1 in the list.

Environment

bug

opened by GxjGit 13
add example of self-supervised SimCLR training - V2

The previous version uses Nvidia DALI to create a dataloader. I found that data augmentations in DALI are different from those of torchvision. As a result, the desired performance could not be achieved. In this version, dataloader is implemented with colossalai.nn.data and torchvision. The final linear evaluation accuracy could be up to 85.4%.
documentation

opened by DevinCheung 13
[BUG]: RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one.
🐛 Describe the bug

After following the ResNet50 example in the tutorial as soon as possible,I got the error as the title said. It is like my last usage of hf's accelerate, I can't figure out this complex problem for my first usage. Of course I have tried my best to solve it and the reasons is likely: colossalai check -i and its output is: Colossalai should be built with cuda extension to use the FP16 optimizer If you want to activate cuda mode for MoE, please install with cuda_ext! CUDA Version: N/A (CUDA_HOME is not set) PyTorch Version: 1.11.0+cu102 CUDA Version in PyTorch Build: 10.2 PyTorch CUDA Version Match: x CUDA Extension: x

but I tried in a machine of 11.3 CUDA and I threw a same error.

Below is part of my code:

logger = get_dist_logger() # args = colossalai.get_default_parser().parse_args() colossalai.launch_from_torch(config='config.py') config = Config() tokenizer = JiebaTokenizer.from_pretrained('Lowin/chinese-bigbird-base-4096') model = BB() optimizer = optim.AdamW(params=model.parameters(),lr=1e-5,weight_decay=1e-2) lossFunc = F.cross_entropy rouge = load_metric('rouge') valida = json.load(open("dataset/dev.json")) trains = json.load(open("dataset/train.json")) dataSetTrain = DS(trains,tokenizer,config) dataSetValid = DS(valida,tokenizer,config) tDL = DataLoader(dataSetTrain,batch_size=config.batch_size_train,shuffle=True) vDL = DataLoader(dataSetValid,batch_size=config.batch_size_valid) engine,tDL,vDL,_ = colossalai.initialize( model, optimizer, lossFunc, tDL, vDL ) for epoch in range(gpc.config.NUM_EPOCH): tDL = tqdm(tDL,leave=False) engine.train() for batch in tDL: labels = batch.pop('labels').cuda() batch = {key:value.cuda() for key,value in batch.items()} logist = engine(batch) loss_sum = engine.criterion(logist.view(-1,config.vocab_size),labels.view(-1)) title_length = labels.ne(0).sum().item() loss = loss_sum/title_length engine.backward(loss) engine.step() engine.zero_grad() tDL.set_description(f'Epoch:{epoch}:') tDL.set_postfix(loss=loss.item())

Code of model construction

class BB(torch.nn.Module): def __init__(self): super(BB,self).__init__() self.transformer = BigBirdModel.from_pretrained('Lowin/chinese-bigbird-base-4096') self.dropout = torch.nn.Dropout(0.2) self.output = torch.nn.Linear(768,39999) def forward(self,batch): # batch = self._set_token_type_ids_(batch) outputs = self.transformer(**batch).last_hidden_state #bs token_num outputsize logits = self.output(self.dropout(outputs)) #bs token_num vocab_size return logits

here is error info: /home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/site-packages/transformers/models/big_bird/modeling_big_bird.py:981: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). torch.arange(indices.shape[0] * indices.shape[1] * num_indices_to_gather, device=indices.device) /home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/site-packages/transformers/models/big_bird/modeling_big_bird.py:981: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). torch.arange(indices.shape[0] * indices.shape[1] * num_indices_to_gather, device=indices.device) Traceback (most recent call last):
File "test3_v3.3.py", line 138, in logist = engine(batch) File "/home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 183, in call return self.model(*args, **kwargs) File "/home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 947, in forward Traceback (most recent call last):
File "test3_v3.3.py", line 138, in logist = engine(batch) File "/home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 183, in call return self.model(*args, **kwargs) File "/home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 947, in forward if torch.is_grad_enabled() and self.reducer._rebuild_buckets(): RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by making sure all forward function outputs participate in calculating loss. If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable). Parameter indices which did not receive grad for rank 0: 197 198 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
if torch.is_grad_enabled() and self.reducer._rebuild_buckets(): RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by making sure all forward function outputs participate in calculating loss. If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable). Parameter indices which did not receive grad for rank 1: 197 198 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 44596) of binary: /home/guxj/anaconda3/envs/NLP_colossalai/bin/python Traceback (most recent call last): File "/home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run elastic_launch( File "/home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ test3_v3.3.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2022-05-18_01:27:08 host : dlp01 rank : 1 (local_rank: 1) exitcode : 1 (pid: 44597) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2022-05-18_01:27:08 host : dlp01 rank : 0 (local_rank: 0) exitcode : 1 (pid: 44596) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Environment

CUDA: 10.2 pytorch: 1.11.0 python:3.8.13(miniconda)
bug
opened by 480284856 12
[BUG]: CUDA extension build skipped when installing from source

🐛 Describe the bug

Hi,I use Install From Source option to install colossalAI,but i encouter problem like: /path/to/myconda/anaconda3/envs/py37-pt111-cu111-colai/lib/python3.7/site-packages/torch/autocast_mode.py:162: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling warnings.warn('User provided device_type of \'cuda\', but CUDA is not available. Disabling') Colossalai should be built with cuda extension to use the FP16 optimizer If you want to activate cuda mode for MoE, please install with cuda_ext! I have installed torch1.11+cu11.3 and using cuda 11.1 any suggestion?

Environment

Pytorch 1.11+cu11.3 CUDA 11.1
bug

opened by imabackstabber 12
Train stable diffusion finetune stoped at "Summoning checkpoint"

My Machine: cpu 32g, gpu 16g, batchsize=1. It seems colossalai is not working well.

{'accelerator': 'gpu', 'devices': 1, 'log_gpu_memory': 'all', 'max_epochs': 2, 'precision': 16, 'auto_select_gpus': False, 'strategy': {'target': 'strategies.ColossalAIStrategy', 'params': {'use_chunk': True, 'enable_distributed_storage': True, 'placement_policy': 'cuda', 'force_outputs_fp32': True}}, 'log_every_n_steps': 2, 'logger': True, 'default_root_dir': '/tmp/diff_log/'} Running on GPU Using FP16 = True No module 'xformers'. Proceeding without it. LatentDiffusion: Running in v-prediction mode DiffusionWrapper has 865.91 M params. making attention of type 'vanilla' with 512 in_channels Working with z of shape (1, 4, 32, 32) = 4096 dimensions. making attention of type 'vanilla' with 512 in_channels Using strategy: strategies.ColossalAIStrategy Monitoring val/loss_simple_ema as checkpoint metric. Merged modelckpt-cfg: {'target': 'lightning.pytorch.callbacks.ModelCheckpoint', 'params': {'dirpath': '/tmp/2023-01-05T10-52-57_train_colossalai_teyvat/checkpoints', 'filename': '{epoch:06}', 'verbose': True, 'save_last': True, 'monitor': 'val/loss_simple_ema', 'save_top_k': 3}} GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs

.... ....

Lightning config trainer: accelerator: gpu devices: 1 log_gpu_memory: all max_epochs: 2 precision: 16 auto_select_gpus: false strategy: target: strategies.ColossalAIStrategy params: use_chunk: true enable_distributed_storage: true placement_policy: cuda force_outputs_fp32: true log_every_n_steps: 2 logger: true default_root_dir: /tmp/diff_log/ logger_config: wandb: target: loggers.WandbLogger params: name: nowname save_dir: /tmp/diff_log/ offline: opt.debug id: nowname

/home/ubuntu/anaconda3/envs/ldmco/lib/python3.9/site-packages/lightning/pytorch/loggers/tensorboard.py:261: UserWarning: Could not log computational graph to TensorBoard: The model.example_input_array attribute is not set or input_array was not given. rank_zero_warn( Epoch 0: 0%| | 0/234 [00:00<?, ?it/s]/home/ubuntu/anaconda3/envs/ldmco/lib/python3.9/site-packages/lightning/pytorch/utilities/data.py:85: UserWarning: Trying to infer the batch_size from an ambiguous collection. The batch size we found is 1. To avoid any miscalculations, use self.log(..., batch_size=batch_size). warning_cache.warn( /home/ubuntu/anaconda3/envs/ldmco/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py:233: UserWarning: You called self.log('global_step', ...) in your training_step but the value needs to be floating point. Converting it to torch.float32. warning_cache.warn( Summoning checkpoint. Killed

opened by yufengyao-lingoace 1
[example] simplify opt example

Why

Make sure the user can run OPT to profile performance in 1 minute. No data download, no complex training parameter setting. Just simply run a few iterations.

opened by feifeibear 0
[DOC]: examples中transformers版本错误

📚 The doc issue

https://github.com/hpcaitech/ColossalAI/blob/9c9246c0d9e09fc261ff9d052deb5ef1e02e614c/examples/language/gpt/requirements.txt#L3 感觉应该是4.23.1
documentation

opened by yhcc 0
[device] find best logical mesh
What does this PR do

implement search_best_logical_mesh function, which could find best logical mesh for the given device list.

The best logical mesh is searched in following steps:

detect homogeneous device groups, we assume that the devices in the alpha_beta_dict are homogeneous if the beta value is close enough.

Find the best homogeneous device group contains all the physical devices. The best homogeneous device group means the lowest beta value in the groups which contains all the physical devices. And the reason we require the group contains all the physical devices is that the devices not in the group will decrease the bandwidth of the group.

If the best homogeneous device group is found, we will construct the largest ring for each device based on the best homogeneous device group, and the best logical mesh will be the union of all the rings. Otherwise, the best logical mesh will be the balanced logical mesh, such as shape (2, 2) for 4 devices.

Usage:

>>> physical_devices = [0, 1, 2, 3] >>> ab_profiler = AlphaBetaProfiler(physical_devices) >>> best_logical_mesh = profiler.search_best_logical_mesh() >>> print(best_logical_mesh) [[0, 1], [2, 3]]

implement extract_alpha_beta_for_device_mesh function which extract the mesh_alpha list and mesh_beta list based on the best logical mesh.

Usage:

>>> physical_devices = [0, 1, 2, 3] >>> ab_profiler = AlphaBetaProfiler(physical_devices) >>> mesh_alpha, mesh_beta = profiler.extract_alpha_beta_for_device_mesh() >>> print(mesh_alpha) [2.5917552411556242e-05, 0.00010312341153621673] >>> print(mesh_beta) [5.875573704655635e-11, 4.7361584445959614e-12]

construct test cases to test above features.

Run Build and Test
opened by YuliangLiu0306 0

Releases(v0.2.0)

v0.2.0(Jan 3, 2023)
What's Changed

Version

[version] 0.1.14 -> 0.2.0 (#2286) by Jiarui Fang

Examples

[examples] using args and combining two versions for PaLM (#2284) by ZijianYY

[examples] replace einsum with matmul (#2210) by ZijianYY

Doc

[doc] add feature diffusion v2, bloom, auto-parallel (#2282) by binmakeswell

[doc] updated the stable diffussion on docker usage (#2244) by Frank Lee

Zero

[zero] polish low level zero optimizer (#2275) by HELSON

[zero] fix error for BEiT models (#2169) by HELSON

Example

[example] add benchmark (#2276) by Ziyue Jiang

[example] fix save_load bug for dreambooth (#2280) by BlueRum

[example] GPT polish readme (#2274) by Jiarui Fang

[example] fix gpt example with 0.1.10 (#2265) by HELSON

[example] clear diffuser image (#2262) by Fazzie-Maqianli

[example] diffusion install from docker (#2239) by Jiarui Fang

[example] fix benchmark.sh for gpt example (#2229) by HELSON

[example] make palm + GeminiDPP work (#2227) by Jiarui Fang

[example] Palm adding gemini, still has bugs (#2221) by ZijianYY

[example] update gpt example (#2225) by HELSON

[example] add benchmark.sh for gpt (#2226) by Jiarui Fang

[example] update gpt benchmark (#2219) by HELSON

[example] update GPT example benchmark results (#2212) by Jiarui Fang

[example] update gpt example for larger model scale (#2211) by Jiarui Fang

[example] update gpt readme with performance (#2206) by Jiarui Fang

[example] polish doc (#2201) by ziyuhuang123

[example] Change some training settings for diffusion (#2195) by BlueRum

[example] support Dreamblooth (#2188) by Fazzie-Maqianli

[example] gpt demo more accuracy tflops (#2178) by Jiarui Fang

[example] add palm pytorch version (#2172) by Jiarui Fang

[example] update vit readme (#2155) by Jiarui Fang

[example] add zero1, zero2 example in GPT examples (#2146) by HELSON

Hotfix

[hotfix] fix fp16 optimzier bug (#2273) by YuliangLiu0306

[hotfix] fix error for torch 2.0 (#2243) by xcnick

[hotfix] Fixing the bug related to ipv6 support by Tongping Liu

[hotfix] correcnt cpu_optim runtime compilation (#2197) by Jiarui Fang

[hotfix] add kwargs for colo_addmm (#2171) by Tongping Liu

[hotfix] Jit type hint #2161 (#2164) by アマデウス

[hotfix] fix auto policy of test_sharded_optim_v2 (#2157) by Jiarui Fang

[hotfix] fix aten default bug (#2158) by YuliangLiu0306

Autoparallel

[autoparallel] fix spelling error (#2270) by YuliangLiu0306

[autoparallel] gpt2 autoparallel examples (#2267) by YuliangLiu0306

[autoparallel] patch torch.flatten metainfo for autoparallel (#2247) by Boyuan Yao

[autoparallel] autoparallel initialize (#2238) by YuliangLiu0306

[autoparallel] fix construct meta info. (#2245) by Super Daniel

[autoparallel] record parameter attribute in colotracer (#2217) by YuliangLiu0306

[autoparallel] Attach input, buffer and output tensor to MetaInfo class (#2162) by Boyuan Yao

[autoparallel] new metainfoprop based on metainfo class (#2179) by Boyuan Yao

[autoparallel] update getitem handler (#2207) by YuliangLiu0306

[autoparallel] update_getattr_handler (#2193) by YuliangLiu0306

[autoparallel] add gpt2 performance test code (#2194) by YuliangLiu0306

[autoparallel] integrate_gpt_related_tests (#2134) by YuliangLiu0306

[autoparallel] memory estimation for shape consistency (#2144) by Boyuan Yao

[autoparallel] use metainfo in handler (#2149) by YuliangLiu0306

Gemini

[Gemini] fix the convert_to_torch_module bug (#2269) by Jiarui Fang

Pipeline middleware

[Pipeline Middleware] Reduce comm redundancy by getting accurate output (#2232) by Ziyue Jiang

Builder

[builder] builder for scaled_upper_triang_masked_softmax (#2234) by Jiarui Fang

[builder] polish builder with better base class (#2216) by Jiarui Fang

[builder] raise Error when CUDA_HOME is not set (#2213) by Jiarui Fang

[builder] multihead attn runtime building (#2203) by Jiarui Fang

[builder] unified cpu_optim fused_optim inferface (#2190) by Jiarui Fang

[builder] use runtime builder for fused_optim (#2189) by Jiarui Fang

[builder] runtime adam and fused_optim builder (#2184) by Jiarui Fang

[builder] use builder() for cpu adam and fused optim in setup.py (#2187) by Jiarui Fang

Logger

[logger] hotfix, missing _FORMAT (#2231) by Super Daniel

Diffusion

[diffusion] update readme (#2214) by HELSON

Testing

[testing] add beit model for unit testings (#2196) by HELSON

NFC

[NFC] fix some typos' (#2175) by ziyuhuang123

[NFC] update news link (#2191) by binmakeswell

[NFC] fix a typo 'stable-diffusion-typo-fine-tune' by Arsmart1

Exmaple

[exmaple] diffuser, support quant inference for stable diffusion (#2186) by BlueRum

[exmaple] add vit missing functions (#2154) by Jiarui Fang

Pipeline middleware

[Pipeline Middleware ] Fix deadlock when num_microbatch=num_stage (#2156) by Ziyue Jiang

Full Changelog: https://github.com/hpcaitech/ColossalAI/compare/v0.2.0...v0.1.13
Source code(tar.gz)
Source code(zip)
v0.1.13(Dec 20, 2022)
What's Changed

Version

[version] 0.1.13 (#2152) by Jiarui Fang

Revert "[version] version to v0.1.13 (#2139)" (#2153) by Jiarui Fang

[version] version to v0.1.13 (#2139) by Jiarui Fang

Gemini

[Gemini] GeminiDPP convert to PyTorch Module. (#2151) by Jiarui Fang

[Gemini] Update coloinit_ctx to support meta_tensor (#2147) by BlueRum

[Gemini] revert ZeROInitCtx related tracer (#2138) by Jiarui Fang

[Gemini] update API of the chunkmemstatscollector. (#2129) by Jiarui Fang

[Gemini] update the non model data record method in runtime memory tracer (#2128) by Jiarui Fang

[Gemini] test step-tensor mapping using repeated_computed_layers.py (#2127) by Jiarui Fang

[Gemini] update non model data calculation method (#2126) by Jiarui Fang

[Gemini] hotfix the unittest bugs (#2125) by Jiarui Fang

[Gemini] mapping of preop timestep and param (#2124) by Jiarui Fang

[Gemini] chunk init using runtime visited param order (#2115) by Jiarui Fang

[Gemini] chunk init use OrderedParamGenerator (#2110) by Jiarui Fang

Nfc

[NFC] remove useless graph node code (#2150) by Jiarui Fang

[NFC] update chunk manager API (#2119) by Jiarui Fang

[NFC] polish comments for Chunk class (#2116) by Jiarui Fang

Autoparallel

[autoparallel] process size nodes in runtime pass (#2130) by YuliangLiu0306

[autoparallel] implement softmax handler (#2132) by YuliangLiu0306

[autoparallel] gpt2lp runtimee test (#2113) by YuliangLiu0306

Example

Merge pull request #2120 from Fazziekey/example/stablediffusion-v2 by Fazzie-Maqianli

Optimizer

[optimizer] add div_scale for optimizers (#2117) by HELSON

Pp middleware

[PP Middleware] Add bwd and step for PP middleware (#2111) by Ziyue Jiang

Full Changelog: https://github.com/hpcaitech/ColossalAI/compare/v0.1.13...v0.1.12
Source code(tar.gz)
Source code(zip)
v0.1.12(Dec 9, 2022)
What's Changed

Zero

[zero] add L2 gradient clipping for ZeRO (#2112) by HELSON

Gemini

[gemini] get the param visited order during runtime (#2108) by Jiarui Fang

[Gemini] NFC, polish search_chunk_configuration (#2107) by Jiarui Fang

[Gemini] gemini use the runtime memory tracer (RMT) (#2099) by Jiarui Fang

[Gemini] make RuntimeMemTracer work correctly (#2096) by Jiarui Fang

[Gemini] remove eval in gemini unittests! (#2092) by Jiarui Fang

[Gemini] remove GLOBAL_MODEL_DATA_TRACER (#2091) by Jiarui Fang

[Gemini] remove GLOBAL_CUDA_MEM_INFO (#2090) by Jiarui Fang

[Gemini] use MemStats in Runtime Memory tracer (#2088) by Jiarui Fang

[Gemini] use MemStats to store the tracing data. Seperate it from Collector. (#2084) by Jiarui Fang

[Gemini] remove static tracer (#2083) by Jiarui Fang

[Gemini] ParamOpHook -> ColoParamOpHook (#2080) by Jiarui Fang

[Gemini] polish runtime tracer tests (#2077) by Jiarui Fang

[Gemini] rename hooks related to runtime mem tracer (#2076) by Jiarui Fang

[Gemini] add albert in test models. (#2075) by Jiarui Fang

[Gemini] rename ParamTracerWrapper -> RuntimeMemTracer (#2073) by Jiarui Fang

[Gemini] remove not used MemtracerWrapper (#2072) by Jiarui Fang

[Gemini] fix grad unreleased issue and param recovery issue (#2052) by Zihao

Hotfix

[hotfix] fix a type in ColoInitContext (#2106) by Jiarui Fang

[hotfix] update test for latest version (#2060) by YuliangLiu0306

[hotfix] skip gpt tracing test (#2064) by YuliangLiu0306

Colotensor

[ColoTensor] throw error when ColoInitContext meets meta parameter. (#2105) by Jiarui Fang

Autoparallel

[autoparallel] support linear function bias addition (#2104) by YuliangLiu0306

[autoparallel] support addbmm computation (#2102) by YuliangLiu0306

[autoparallel] add sum handler (#2101) by YuliangLiu0306

[autoparallel] add bias addtion function class (#2098) by YuliangLiu0306

[autoparallel] complete gpt related module search (#2097) by YuliangLiu0306

[autoparallel]add embedding handler (#2089) by YuliangLiu0306

[autoparallel] add tensor constructor handler (#2082) by YuliangLiu0306

[autoparallel] add non_split linear strategy (#2078) by YuliangLiu0306

[autoparallel] Add F.conv metainfo (#2069) by Boyuan Yao

[autoparallel] complete gpt block searching (#2065) by YuliangLiu0306

[autoparallel] add binary elementwise metainfo for auto parallel (#2058) by Boyuan Yao

[autoparallel] fix forward memory calculation (#2062) by Boyuan Yao

[autoparallel] adapt solver with self attention (#2037) by YuliangLiu0306

Version

[version] 0.1.11rc5 -> 0.1.12 (#2103) by Jiarui Fang

Pipeline middleware

[Pipeline Middleware] fix data race in Pipeline Scheduler for DAG (#2087) by Ziyue Jiang

[Pipeline Middleware] Adapt scheduler for Topo (#2066) by Ziyue Jiang

Fx

[fx] An experimental version of ColoTracer.' (#2002) by Super Daniel

Example

[example] update GPT README (#2095) by ZijianYY

Device

[device] update flatten device mesh usage (#2079) by YuliangLiu0306

Test

[test] bert test in non-distributed way (#2074) by Jiarui Fang

Pipeline

[Pipeline] Add Topo Class (#2059) by Ziyue Jiang

Examples

[examples] update autoparallel demo (#2061) by YuliangLiu0306

Full Changelog: https://github.com/hpcaitech/ColossalAI/compare/v0.1.12...v0.1.11rc5
Source code(tar.gz)
Source code(zip)
v0.1.11rc5(Nov 30, 2022)
What's Changed

Release

[release] update to 0.1.11rc5 (#2053) by Frank Lee

Cli

[cli] updated installation cheheck with more inforamtion (#2050) by Frank Lee

Gemini

[gemini] fix init bugs for modules (#2047) by HELSON

[gemini] add arguments (#2046) by HELSON

[Gemini] free and allocate cuda memory by tensor.storage, add grad hook (#2040) by Zihao

[Gemini] more tests for Gemini (#2038) by Jiarui Fang

[Gemini] more rigorous unit tests for run_fwd_bwd (#2034) by Jiarui Fang

[Gemini] paramWrapper paramTracerHook unitest (#2030) by Zihao

[Gemini] patch for supporting orch.add_ function for ColoTensor (#2003) by Jiarui Fang

[gemini] param_trace_hook (#2020) by Zihao

[Gemini] add unitests to check gemini correctness (#2015) by Jiarui Fang

[Gemini] ParamMemHook (#2008) by Zihao

[Gemini] param_tracer_wrapper and test case (#2009) by Zihao

Setup

[setup] supported conda-installed torch (#2048) by Frank Lee

Test

[test] align model name with the file name. (#2045) by Jiarui Fang

Hotfix

[hotfix] hotfix Gemini for no leaf modules bug (#2043) by Jiarui Fang

[hotfix] add bert test for gemini fwd bwd (#2035) by Jiarui Fang

[hotfix] revert bug PRs (#2016) by Jiarui Fang

Zero

[zero] fix testing parameters (#2042) by HELSON

[zero] fix unit-tests (#2039) by HELSON

[zero] test gradient accumulation (#1964) by HELSON

Testing

[testing] fix testing models (#2036) by HELSON

Rpc

[rpc] split with dag (#2028) by Ziyue Jiang

Autoparallel

[autoparallel] add split handler (#2032) by YuliangLiu0306

[autoparallel] add experimental permute handler (#2029) by YuliangLiu0306

[autoparallel] add runtime pass and numerical test for view handler (#2018) by YuliangLiu0306

[autoparallel] add experimental view handler (#2011) by YuliangLiu0306

[autoparallel] mix gather (#1977) by Genghan Zhang

Fx

[fx]Split partition with DAG information (#2025) by Ziyue Jiang

Github

[GitHub] update issue template (#2023) by binmakeswell

Workflow

[workflow] removed unused pypi release workflow (#2022) by Frank Lee

Full Changelog: https://github.com/hpcaitech/ColossalAI/compare/v0.1.11rc5...v0.1.11rc4
Source code(tar.gz)
Source code(zip)
v0.1.11rc4(Nov 23, 2022)
What's Changed

Workflow

[workflow] fixed the python and cpu arch mismatch (#2010) by Frank Lee

[workflow] fixed the typo in condarc (#2006) by Frank Lee

[workflow] added conda cache and fixed no-compilation bug in release (#2005) by Frank Lee

Gemini

[Gemini] add an inline_op_module to common test models and polish unitests. (#2004) by Jiarui Fang

[Gemini] open grad checkpoint when model building (#1984) by Jiarui Fang

[Gemini] add bert for MemtracerWrapper unintests (#1982) by Jiarui Fang

[Gemini] MemtracerWrapper unittests (#1981) by Jiarui Fang

[Gemini] memory trace hook (#1978) by Jiarui Fang

[Gemini] independent runtime tracer (#1974) by Jiarui Fang

[Gemini] ZeROHookV2 -> GeminiZeROHook (#1972) by Jiarui Fang

[Gemini] clean no used MemTraceOp (#1970) by Jiarui Fang

[Gemini] polish memstats collector (#1962) by Jiarui Fang

[Gemini] add GeminiAdamOptimizer (#1960) by Jiarui Fang

Autoparallel

[autoparallel] Add metainfo support for F.linear (#1987) by Boyuan Yao

[autoparallel] use pytree map style to process data (#1989) by YuliangLiu0306

[autoparallel] adapt handlers with attention block (#1990) by YuliangLiu0306

[autoparallel] support more flexible data type (#1967) by YuliangLiu0306

[autoparallel] add pooling metainfo (#1968) by Boyuan Yao

[autoparallel] support distributed dataloader option (#1906) by YuliangLiu0306

[autoparallel] Add alpha beta (#1973) by Genghan Zhang

[autoparallel] add torch.nn.ReLU metainfo (#1868) by Boyuan Yao

[autoparallel] support addmm in tracer and solver (#1961) by YuliangLiu0306

[autoparallel] remove redundancy comm node (#1893) by YuliangLiu0306

Fx

[fx] add more meta_registry for MetaTensor execution. (#2000) by Super Daniel

Hotfix

[hotfix] make Gemini work for conv DNN (#1998) by Jiarui Fang

Example

[example] add diffusion inference (#1986) by Fazzie-Maqianli

[example] enhance GPT demo (#1959) by Jiarui Fang

[example] add vit (#1942) by Jiarui Fang

Kernel

[kernel] move all symlinks of kernel to colossalai._C (#1971) by ver217

Polish

[polish] remove useless file _mem_tracer_hook.py (#1963) by Jiarui Fang

Zero

[zero] fix memory leak for zero2 (#1955) by HELSON

Colotensor

[ColoTensor] reconfig ColoInitContext, decouple default_pg and default_dist_spec. (#1953) by Jiarui Fang

[ColoTensor] ColoInitContext initialize parameters in shard mode. (#1937) by Jiarui Fang

Tutorial

[tutorial] polish all README (#1946) by binmakeswell

[tutorial] added missing dummy dataloader (#1944) by Frank Lee

[tutorial] fixed pipeline bug for sequence parallel (#1943) by Frank Lee

Tensorparallel

[tensorparallel] fixed tp layers (#1938) by アマデウス

Sc demo

[sc demo] add requirements to spmd README (#1941) by YuliangLiu0306

Sc

[SC] remove redundant hands on (#1939) by Boyuan Yao

Full Changelog: https://github.com/hpcaitech/ColossalAI/compare/v0.1.11rc4...v0.1.11rc3
Source code(tar.gz)
Source code(zip)
v0.1.11rc3(Nov 13, 2022)
What's Changed

Release

[release] update version (#1931) by ver217

Tutorial

[tutorial] polish README and OPT files (#1930) by binmakeswell

[tutorial] add synthetic dataset for opt (#1924) by ver217

[tutorial] updated hybrid parallel readme (#1928) by Frank Lee

[tutorial] added synthetic data for sequence parallel (#1927) by Frank Lee

[tutorial] removed huggingface model warning (#1925) by Frank Lee

Hotfix/tutorial readme index (#1922) by Frank Lee

[tutorial] modify hands-on of auto activation checkpoint (#1920) by Boyuan Yao

[tutorial] added synthetic data for hybrid parallel (#1921) by Frank Lee

[tutorial] added synthetic data for hybrid parallel (#1919) by Frank Lee

[tutorial] added synthetic dataset for auto parallel demo (#1918) by Frank Lee

[tutorial] updated auto parallel demo with latest data path (#1917) by Frank Lee

[tutorial] added data script and updated readme (#1916) by Frank Lee

[tutorial] add cifar10 for diffusion (#1907) by binmakeswell

[tutorial] removed duplicated tutorials (#1904) by Frank Lee

[tutorial] edited hands-on practices (#1899) by BoxiangW

Example

[example] update auto_parallel img path (#1910) by binmakeswell

[example] add cifar10 dadaset for diffusion (#1902) by Fazzie-Maqianli

[example] migrate diffusion and auto_parallel hands-on (#1871) by binmakeswell

[example] initialize tutorial (#1865) by binmakeswell

Merge pull request #1842 from feifeibear/jiarui/polish by Fazzie-Maqianli

[example] polish diffusion readme by jiaruifang

Sc

[SC] add GPT example for auto checkpoint (#1889) by Boyuan Yao

[sc] add examples for auto checkpoint. (#1880) by Super Daniel

Nfc

[NFC] polish colossalai/amp/naive_amp/init.py code style (#1905) by Junming Wu

[NFC] remove redundant dependency (#1869) by binmakeswell

[NFC] polish .github/workflows/scripts/build_colossalai_wheel.py code style (#1856) by yuxuan-lou

[NFC] polish .github/workflows/scripts/generate_release_draft.py code style (#1855) by Ofey Chan

[NFC] polish workflows code style (#1854) by Kai Wang (Victor Kai)

[NFC] polish colossalai/amp/apex_amp/init.py code style (#1853) by LuGY

[NFC] polish .readthedocs.yaml code style (#1852) by nuszzh

[NFC] polish <.github/workflows/release_nightly.yml> code style (#1851) by RichardoLuo

[NFC] polish amp.naive_amp.grad_scaler code style by zbian

[NFC] polish colossalai/auto_parallel/tensor_shard/deprecated/op_handler/operator_handler.py code style (#1845) by HELSON

[NFC] polish ./colossalai/amp/torch_amp/init.py code style (#1836) by Genghan Zhang

[NFC] polish .github/workflows/build.yml code style (#1837) by xyupeng

[NFC] polish colossalai/auto_parallel/tensor_shard/deprecated/op_handler/conv_handler.py code style (#1829) by Sze-qq

[NFC] polish colossalai/amp/torch_amp/_grad_scaler.py code style (#1823) by Ziyue Jiang

[NFC] polish .github/workflows/release_docker.yml code style by Maruyama_Aya

[NFC] polish .github/workflows/submodule.yml code style (#1822) by shenggan

[NFC] polish .github/workflows/draft_github_release_post.yml code style (#1820) by Arsmart1

[NFC] polish colossalai/amp/naive_amp/_fp16_optimizer.py code style (#1819) by Fazzie-Maqianli

[NFC] polish colossalai/amp/naive_amp/_utils.py code style (#1816) by CsRic

[NFC] polish .github/workflows/build_gpu_8.yml code style (#1813) by Zangwei Zheng

[NFC] polish MANIFEST.in code style (#1814) by Zirui Zhu

[NFC] polish strategies_constructor.py code style (#1806) by binmakeswell

Doc

[doc] add news (#1901) by binmakeswell

Zero

[zero] migrate zero1&2 (#1878) by HELSON

Autoparallel

[autoparallel] user-friendly API for CheckpointSolver. (#1879) by Super Daniel

[autoparallel] fix linear logical convert issue (#1857) by YuliangLiu0306

Fx

[fx] metainfo_trace as an API. (#1873) by Super Daniel

Hotfix

[hotfix] pass test_complete_workflow (#1877) by Jiarui Fang

Inference

[inference] overlap comm and compute in Linear1D_Row when stream_chunk_num > 1 (#1876) by Jiarui Fang

[inference] streaming Linear 1D Row inference (#1874) by Jiarui Fang

Amp

[amp] add torch amp test (#1860) by xcnick

Diffusion

[diffusion] fix package conflicts (#1875) by HELSON

Utils

[utils] fixed lazy init context (#1867) by Frank Lee

[utils] remove lazy_memory_allocate from ColoInitContext (#1844) by Jiarui Fang

Full Changelog: https://github.com/hpcaitech/ColossalAI/compare/v0.1.11rc3...v0.1.11rc2
Source code(tar.gz)
Source code(zip)
v0.1.11rc2(Nov 8, 2022)
What's Changed

Autoparallel

[autoparallel] fix bugs caused by negative dim key (#1808) by YuliangLiu0306

[autoparallel] fix bias addition module (#1800) by YuliangLiu0306

[autoparallel] add batch norm metainfo (#1815) by Boyuan Yao

[autoparallel] add conv metainfo class for auto parallel (#1796) by Boyuan Yao

[autoparallel]add essential CommActions for broadcast oprands (#1793) by YuliangLiu0306

[autoparallel] refactor and add rotorc. (#1789) by Super Daniel

[autoparallel] add getattr handler (#1767) by YuliangLiu0306

[autoparallel] added matmul handler (#1763) by Frank Lee

[autoparallel] fix conv handler numerical test (#1771) by YuliangLiu0306

[autoparallel] move ckpt solvers to autoparallel folder / refactor code (#1764) by Super Daniel

[autoparallel] add numerical test for handlers (#1769) by YuliangLiu0306

[autoparallel] update CommSpec to CommActions (#1768) by YuliangLiu0306

[autoparallel] add numerical test for node strategies (#1760) by YuliangLiu0306

[autoparallel] refactor the runtime apply pass and add docstring to passes (#1757) by YuliangLiu0306

[autoparallel] added binary elementwise node handler (#1758) by Frank Lee

[autoparallel] fix param hook issue in transform pass (#1755) by YuliangLiu0306

[autoparallel] added addbmm handler (#1751) by Frank Lee

[autoparallel] shard param and buffer as expected (#1753) by YuliangLiu0306

[autoparallel] add sequential order to communication actions (#1735) by YuliangLiu0306

[autoparallel] recovered skipped test cases (#1748) by Frank Lee

[autoparallel] fixed wrong sharding strategy in conv handler (#1747) by Frank Lee

[autoparallel] fixed wrong generated strategy for dot op (#1746) by Frank Lee

[autoparallel] handled illegal sharding strategy in shape consistency (#1744) by Frank Lee

[autoparallel] handled illegal strategy in node handler (#1743) by Frank Lee

[autoparallel] handled illegal sharding strategy (#1728) by Frank Lee

Kernel

[kernel] added jit warmup (#1792) by アマデウス

[kernel] more flexible flashatt interface (#1804) by oahzxl

[kernel] skip tests of flash_attn and triton when they are not available (#1798) by Jiarui Fang

Gemini

[Gemini] make gemini usage simple (#1821) by Jiarui Fang

Checkpointio

[CheckpointIO] a uniform checkpoint I/O module (#1689) by ver217

Doc

[doc] polish diffusion README (#1840) by binmakeswell

[doc] remove obsolete API demo (#1833) by binmakeswell

[doc] add diffusion (#1827) by binmakeswell

[doc] add FastFold (#1766) by binmakeswell

Example

[example] remove useless readme in diffusion (#1831) by Jiarui Fang

[example] add TP to GPT example (#1828) by Jiarui Fang

[example] add stable diffuser (#1825) by Fazzie-Maqianli

[example] simplify the GPT2 huggingface example (#1826) by Jiarui Fang

[example] opt does not depend on Titans (#1811) by Jiarui Fang

[example] add GPT by Jiarui Fang

[example] add opt model in lauguage (#1809) by Jiarui Fang

[example] add diffusion to example (#1805) by Jiarui Fang

Nfc

[NFC] update gitignore remove DS_Store (#1830) by Jiarui Fang

[NFC] polish type hint for shape consistency (#1801) by Jiarui Fang

[NFC] polish tests/test_layers/test_3d/test_3d.py code style (#1740) by Ziheng Qin

[NFC] polish tests/test_layers/test_3d/checks_3d/common.py code style (#1733) by lucasliunju

[NFC] polish colossalai/nn/metric/_utils.py code style (#1727) by Sze-qq

[NFC] polish tests/test_layers/test_3d/checks_3d/check_layer_3d.py code style (#1731) by Xue Fuzhao

[NFC] polish tests/test_layers/test_sequence/checks_seq/check_layer_seq.py code style (#1723) by xyupeng

[NFC] polish accuracy_2d.py code style (#1719) by Ofey Chan

[NFC] polish .github/workflows/scripts/build_colossalai_wheel.py code style (#1721) by Arsmart1

[NFC] polish _checkpoint_hook.py code style (#1722) by LuGY

[NFC] polish test_2p5d/checks_2p5d/check_operation_2p5d.py code style (#1718) by Kai Wang (Victor Kai)

[NFC] polish colossalai/zero/sharded_param/init.py code style (#1717) by CsRic

[NFC] polish colossalai/nn/lr_scheduler/linear.py code style (#1716) by yuxuan-lou

[NFC] polish tests/test_layers/test_2d/checks_2d/check_operation_2d.py code style (#1715) by binmakeswell

[NFC] polish colossalai/nn/metric/accuracy_2p5d.py code style (#1714) by shenggan

Fx

[fx] add a symbolic_trace api. (#1812) by Super Daniel

[fx] skip diffusers unitest if it is not installed (#1799) by Jiarui Fang

[fx] Add linear metainfo class for auto parallel (#1783) by Boyuan Yao

[fx] support module with bias addition (#1780) by YuliangLiu0306

[fx] refactor memory utils and extend shard utils. (#1754) by Super Daniel

[fx] test tracer on diffuser modules. (#1750) by Super Daniel

Hotfix

[hotfix] fix build error when torch version >= 1.13 (#1803) by xcnick

[hotfix] polish flash attention (#1802) by oahzxl

[hotfix] fix zero's incompatibility with checkpoint in torch-1.12 (#1786) by HELSON

[hotfix] polish chunk import (#1787) by Jiarui Fang

[hotfix] autoparallel unit test (#1752) by YuliangLiu0306

Pipeline

[Pipeline]Adapt to Pipelinable OPT (#1782) by Ziyue Jiang

Ci

[CI] downgrade fbgemm. (#1778) by Super Daniel

Compatibility

[compatibility] ChunkMgr import error (#1772) by Jiarui Fang

Feat

[feat] add flash attention (#1762) by oahzxl

Fx/profiler

[fx/profiler] debug the fx.profiler / add an example test script for fx.profiler (#1730) by Super Daniel

Workflow

[workflow] handled the git directory ownership error (#1741) by Frank Lee

Full Changelog: https://github.com/hpcaitech/ColossalAI/compare/v0.1.11rc2...v0.1.11rc1
Source code(tar.gz)
Source code(zip)
v0.1.11rc1(Oct 19, 2022)
What's Changed

Hotfix

[hotfix] resharding cost issue (#1742) by YuliangLiu0306

[hotfix] solver bug caused by dict type comm cost (#1686) by YuliangLiu0306

[hotfix] fix wrong type name in profiler (#1678) by Boyuan Yao

[hotfix]unit test (#1670) by YuliangLiu0306

[hotfix] add recompile after graph manipulatation (#1621) by YuliangLiu0306

[hotfix] got sliced types (#1614) by YuliangLiu0306

Release

[release] update to v0.1.11 (#1736) by Frank Lee

Doc

[doc] update recommendation system catalogue (#1732) by binmakeswell

[doc] update recommedation system urls (#1725) by Jiarui Fang

Zero

[zero] add chunk init function for users (#1729) by HELSON

[zero] add constant placement policy (#1705) by HELSON

Pre-commit

[pre-commit] update pre-commit (#1726) by HELSON

Autoparallel

[autoparallel] runtime_backward_apply (#1720) by YuliangLiu0306

[autoparallel] moved tests to test_tensor_shard (#1713) by Frank Lee

[autoparallel] resnet block runtime apply (#1709) by YuliangLiu0306

[autoparallel] fixed broken node handler tests (#1708) by Frank Lee

[autoparallel] refactored the autoparallel module for organization (#1706) by Frank Lee

[autoparallel] adapt runtime passes (#1703) by YuliangLiu0306

[autoparallel] collated all deprecated files (#1700) by Frank Lee

[autoparallel] init new folder structure (#1696) by Frank Lee

[autoparallel] adapt solver and CostGraph with new handler (#1695) by YuliangLiu0306

[autoparallel] add output handler and placeholder handler (#1694) by YuliangLiu0306

[autoparallel] add pooling handler (#1690) by YuliangLiu0306

[autoparallel] where_handler_v2 (#1688) by YuliangLiu0306

[autoparallel] fix C version rotor inconsistency (#1691) by Boyuan Yao

[autoparallel] added sharding spec conversion for linear handler (#1687) by Frank Lee

[autoparallel] add reshape handler v2 and fix some previous bug (#1683) by YuliangLiu0306

[autoparallel] add unary element wise handler v2 (#1674) by YuliangLiu0306

[autoparallel] add following node generator (#1673) by YuliangLiu0306

[autoparallel] add layer norm handler v2 (#1671) by YuliangLiu0306

[autoparallel] fix insecure subprocess (#1680) by Boyuan Yao

[autoparallel] add rotor C version (#1658) by Boyuan Yao

[autoparallel] added utils for broadcast operation (#1665) by Frank Lee

[autoparallel] update CommSpec (#1667) by YuliangLiu0306

[autoparallel] added bias comm spec to matmul strategy (#1664) by Frank Lee

[autoparallel] add batch norm handler v2 (#1666) by YuliangLiu0306

[autoparallel] remove no strategy nodes (#1652) by YuliangLiu0306

[autoparallel] added compute resharding costs for node handler (#1662) by Frank Lee

[autoparallel] added new strategy constructor template (#1661) by Frank Lee

[autoparallel] added node handler for bmm (#1655) by Frank Lee

[autoparallel] add conv handler v2 (#1663) by YuliangLiu0306

[autoparallel] adapt solver with gpt (#1653) by YuliangLiu0306

[autoparallel] implemented all matmul strategy generator (#1650) by Frank Lee

[autoparallel] change the following nodes strategies generation logic (#1636) by YuliangLiu0306

[autoparallel] where handler (#1651) by YuliangLiu0306

[autoparallel] implemented linear projection strategy generator (#1639) by Frank Lee

[autoparallel] adapt solver with mlp (#1638) by YuliangLiu0306

[autoparallel] Add pofo sequence annotation (#1637) by Boyuan Yao

[autoparallel] add elementwise handler (#1622) by YuliangLiu0306

[autoparallel] add embedding handler (#1620) by YuliangLiu0306

[autoparallel] protect bcast handler from invalid strategies (#1631) by YuliangLiu0306

[autoparallel] add layernorm handler (#1629) by YuliangLiu0306

[autoparallel] recover the merged node strategy index (#1613) by YuliangLiu0306

[autoparallel] added new linear module handler (#1616) by Frank Lee

[autoparallel] added new node handler (#1612) by Frank Lee

[autoparallel]add bcast matmul strategies (#1605) by YuliangLiu0306

[autoparallel] refactored the data structure for sharding strategy (#1610) by Frank Lee

[autoparallel] add bcast op handler (#1600) by YuliangLiu0306

[autoparallel] added all non-bcast matmul strategies (#1603) by Frank Lee

[autoparallel] added strategy generator and bmm strategies (#1602) by Frank Lee

[autoparallel] add reshape handler (#1594) by YuliangLiu0306

[autoparallel] refactored shape consistency to remove redundancy (#1591) by Frank Lee

[autoparallel] add resnet autoparallel unit test and add backward weight communication cost (#1589) by YuliangLiu0306

[autoparallel] added generate_sharding_spec to utils (#1590) by Frank Lee

[autoparallel] added solver option dataclass (#1588) by Frank Lee

[autoparallel] adapt solver with resnet (#1583) by YuliangLiu0306

Fx/meta/rpc

[fx/meta/rpc] move _meta_registration.py to fx folder / register fx functions with compatibility checks / remove color debug (#1710) by Super Daniel

Embeddings

[embeddings] add doc in readme (#1711) by Jiarui Fang

[embeddings] more detailed timer (#1692) by Jiarui Fang

[embeddings] cache option (#1635) by Jiarui Fang

[embeddings] use cache_ratio instead of cuda_row_num (#1611) by Jiarui Fang

[embeddings] add already_split_along_rank flag for tablewise mode (#1584) by CsRic

Unittest

[unittest] added doc for the pytest wrapper (#1704) by Frank Lee

[unittest] supported condititonal testing based on env var (#1701) by Frank Lee

Embedding

[embedding] rename FreqAwareEmbedding -> CachedEmbedding (#1699) by Jiarui Fang

[embedding] polish async copy (#1657) by Jiarui Fang

[embedding] add more detail profiling (#1656) by Jiarui Fang

[embedding] print profiling results (#1654) by Jiarui Fang

[embedding] non-blocking cpu-gpu copy (#1647) by Jiarui Fang

[embedding] isolate cache_op from forward (#1645) by CsRic

[embedding] rollback for better FAW performance (#1625) by Jiarui Fang

[embedding] updates some default parameters by Jiarui Fang

Fx/profiler

[fx/profiler] assigned UUID to each unrecorded tensor/ improved performance on GPT-2 (#1679) by Super Daniel

[fx/profiler] provide a table of summary. (#1634) by Super Daniel

[fx/profiler] tuned the calculation of memory estimation (#1619) by Super Daniel

Pipeline/fix-bug

[pipeline/fix-bug] num_microbatches support any integrate | stable chimera | launch tool for rpc pp framework (#1684) by Kirigaya Kazuto

Pipeline/rank_recorder

[pipeline/rank_recorder] fix bug when process data before backward | add a tool for multiple ranks debug (#1681) by Kirigaya Kazuto

Feature

[feature] A new ZeRO implementation (#1644) by HELSON

Revert "[feature] new zero implementation (#1623)" (#1643) by Jiarui Fang

[feature] new zero implementation (#1623) by HELSON

Fx

[fx] Add concrete info prop (#1677) by Boyuan Yao

[fx] refactor code for profiler / enable fake tensor movement. (#1646) by Super Daniel

[fx] fix offload codegen test (#1648) by Boyuan Yao

[fx] Modify offload codegen (#1618) by Boyuan Yao

[fx] PoC of runtime shape consistency application (#1607) by YuliangLiu0306

[fx] Add pofo solver (#1608) by Boyuan Yao

[fx] Add offload codegen (#1598) by Boyuan Yao

[fx] provide an accurate estimation of memory. (#1587) by Super Daniel

[fx] Improve linearize and rotor solver (#1586) by Boyuan Yao

[fx] Add nested checkpoint in activation checkpoint codegen (#1585) by Boyuan Yao

Pipeline/pytree

[pipeline/pytree] add pytree to process args and kwargs | provide data_process_func to process args and kwargs after forward (#1642) by Kirigaya Kazuto

Fix

[fix] fixed the collective pattern name for consistency (#1649) by Frank Lee

Moe

[moe] initialize MoE groups by ProcessGroup (#1640) by HELSON

[moe] fix moe bugs (#1633) by HELSON

[moe] fix MoE bugs (#1628) by HELSON

Tensor

[tensor] use communication autograd func (#1617) by YuliangLiu0306

Pipeline/chimera

[pipeline/chimera] test chimera | fix bug of initializing (#1615) by Kirigaya Kazuto

[pipeline/chimera] reconstruct PipelineBase and Worker to support more feasible custom schedule | finish Chimera (#1595) by Kirigaya Kazuto

Workflow

[workflow] deactivate conda environment before removing (#1606) by Frank Lee

Fx/tuning

[fx/tuning] tune performance on rotor with meta info. (#1599) by Super Daniel

Hotfix/rotor

[hotfix/rotor] fix variable names (#1597) by Super Daniel

Nfc

[NFC] add OPT serving (#1581) by binmakeswell

[NFC] polish ./colossalai/trainer/hooks/_lr_scheduler_hook.py code style (#1576) by Boyuan Yao

[NFC] polish colossalai/zero/sharded_model/reduce_scatter.py code style (#1554) by Fazzie-Maqianli

[NFC] polish utils/tensor_detector/init.py code style (#1573) by CsRic

[NFC] polish colossalai/nn/lr_scheduler/multistep.py code style (#1572) by Sze-qq

[NFC] polish colossalai/nn/lr_scheduler/torch.py code style (#1571) by superhao1995

[NFC] polish colossalai/nn/parallel/data_parallel.py code style (#1570) by Jiatong Han

[NFC] polish colossalai/pipeline/utils.py code style (#1562) by Zirui Zhu

[NFC] polish colossalai/fx/tracer/meta_patch/patched_module/convolution.py code style (#1563) by Xue Fuzhao

[NFC] polish colossalai/gemini/update/chunkv2.py code style (#1565) by Zangwei Zheng

[NFC] polish colossalai/nn/layer/colossalai_layer/dropout.py code style (#1568) by DouJS

[NFC] polish colossalai/utils/tensor_detector/tensor_detector.py code style (#1566) by LuGY

[NFC] polish colossalai/nn/_ops/embedding.py code style (#1561) by BigOneLiXiaoMing

[NFC] polish colossalai/builder/init.py code style (#1560) by Ziheng Qin

[NFC] polish colossalai/testing/comparison.py code style. (#1558) by Super Daniel

[NFC] polish colossalai/nn/layer/colossalai_layer/linear.py (#1556) by Ofey Chan

[NFC] polish code colossalai/gemini/update/search_utils.py (#1557) by Kai Wang (Victor Kai)

[NFC] polish colossalai/nn/_ops/layernorm.py code style (#1555) by yuxuan-lou

[NFC] polish colossalai/nn/loss/loss_2p5d.py code style (#1553) by shenggan

[NFC] polish colossalai/nn/_ops/embedding_bag.py code style (#1552) by Maruyama_Aya

[NFC] polish colossalai/nn/lr_scheduler/cosine.py code style by binmakeswell

[NFC] polish colossalai/utils/multi_tensor_apply/multi_tensor_apply.py code style (#1559) by Kirigaya Kazuto

Full Changelog: https://github.com/hpcaitech/ColossalAI/compare/v0.1.11rc1...v0.1.10
Source code(tar.gz)
Source code(zip)
v0.1.10(Sep 8, 2022)
What's Changed

Embedding

[embedding] cache_embedding small improvement (#1564) by CsRic

[embedding] polish parallel embedding tablewise (#1545) by Jiarui Fang

[embedding] freq_aware_embedding: add small functions for caller application (#1537) by CsRic

[embedding] fix a bug in table wise sharding (#1538) by Jiarui Fang

[embedding] tablewise sharding polish (#1535) by Jiarui Fang

[embedding] add tablewise sharding for FAW (#1526) by CsRic

Nfc

[NFC] polish test component gpt code style (#1567) by アマデウス

[NFC] polish doc style for ColoTensor (#1457) by Jiarui Fang

[NFC] global vars should be upper case (#1456) by Jiarui Fang

Pipeline/tuning

[pipeline/tuning] improve dispatch performance both time and space cost (#1544) by Kirigaya Kazuto

Fx

[fx] provide a stable but not accurate enough version of profiler. (#1547) by Super Daniel

[fx] Add common node in model linearize (#1542) by Boyuan Yao

[fx] support meta tracing for aten level computation graphs like functorch. (#1536) by Super Daniel

[fx] Modify solver linearize and add corresponding test (#1531) by Boyuan Yao

[fx] add test for meta tensor. (#1527) by Super Daniel

[fx]patch nn.functional convolution (#1528) by YuliangLiu0306

[fx] Fix wrong index in annotation and minimal flops in ckpt solver (#1521) by Boyuan Yao

[fx] hack torch_dispatch for meta tensor and autograd. (#1515) by Super Daniel

[fx] Fix activation codegen dealing with checkpointing first op (#1510) by Boyuan Yao

[fx] fix the discretize bug (#1506) by Boyuan Yao

[fx] fix wrong variable name in solver rotor (#1502) by Boyuan Yao

[fx] Add activation checkpoint solver rotor (#1496) by Boyuan Yao

[fx] add more op patches for profiler and error message for unsupported ops. (#1495) by Super Daniel

[fx] fixed adapative pooling size concatenation error (#1489) by Frank Lee

[fx] add profiler for fx nodes. (#1480) by Super Daniel

[fx] Fix ckpt functions' definitions in forward (#1476) by Boyuan Yao

[fx] fix MetaInfoProp for incorrect calculations and add detections for inplace op. (#1466) by Super Daniel

[fx] add rules to linearize computation graphs for searching. (#1461) by Super Daniel

[fx] Add use_reentrant=False to checkpoint in codegen (#1463) by Boyuan Yao

[fx] fix test and algorithm bugs in activation checkpointing. (#1451) by Super Daniel

[fx] Use colossalai checkpoint and add offload recognition in codegen (#1439) by Boyuan Yao

[fx] fix the false interpretation of algorithm 3 in https://arxiv.org/abs/1604.06174. (#1446) by Super Daniel

Autoparallel

[autoparallel]add backward cost info into strategies (#1524) by YuliangLiu0306

[autoparallel] support fucntion in operator handler (#1529) by YuliangLiu0306

[autoparallel] change the merge node logic (#1533) by YuliangLiu0306

[autoparallel] added liveness analysis (#1516) by Frank Lee

[autoparallel] add more sharding strategies to conv (#1487) by YuliangLiu0306

[autoparallel] add cost graph class (#1481) by YuliangLiu0306

[autoparallel] added namespace constraints (#1490) by Frank Lee

[autoparallel] integrate auto parallel with torch fx (#1479) by Frank Lee

[autoparallel] added dot handler (#1475) by Frank Lee

[autoparallel] introduced baseclass for op handler and reduced code redundancy (#1471) by Frank Lee

[autoparallel] standardize the code structure (#1469) by Frank Lee

[autoparallel] Add conv handler to generate strategies and costs info for conv (#1467) by YuliangLiu0306

Utils

[utils] refactor parallel layers checkpoint and bcast model on loading checkpoint (#1548) by ver217

[utils] optimize partition_tensor_parallel_state_dict (#1546) by ver217

[utils] Add use_reetrant=False in utils.activation_checkpoint (#1460) by Boyuan Yao

[utils] Impl clip_grad_norm for ColoTensor and ZeroOptimizer (#1442) by ver217

Hotfix

[hotfix] change namespace for meta_trace. (#1541) by Super Daniel

[hotfix] fix init context (#1543) by ver217

[hotfix] avoid conflict of meta registry with torch 1.13.0. (#1530) by Super Daniel

[hotfix] fix coloproxy typos. (#1519) by Super Daniel

Pipeline/pipleline_process_group

[pipeline/pipleline_process_group] finish PipelineProcessGroup to manage local abd global rank in TP,DP and PP (#1508) by Kirigaya Kazuto

Doc

[doc] docstring for FreqAwareEmbeddingBag (#1525) by Jiarui Fang

[doc] update readme with the new xTrimoMultimer project (#1477) by Sze-qq

[doc] update docstring in ProcessGroup (#1468) by Jiarui Fang

[Doc] add more doc for ColoTensor. (#1458) by Jiarui Fang

Autoparellel

[autoparellel]add strategies constructor (#1505) by YuliangLiu0306

Faw

[FAW] cpu caching operations (#1520) by Jiarui Fang

[FAW] refactor reorder() for CachedParamMgr (#1514) by Jiarui Fang

[FAW] LFU initialize with dataset freq (#1513) by Jiarui Fang

[FAW] shrink freq_cnter size (#1509) by CsRic

[FAW] remove code related to chunk (#1501) by Jiarui Fang

[FAW] add more docs and fix a warning (#1500) by Jiarui Fang

[FAW] FAW embedding use LRU as eviction strategy intialized with dataset stats (#1494) by CsRic

[FAW] LFU cache for the FAW by CsRic

[FAW] init an LFU implementation for FAW (#1488) by Jiarui Fang

[FAW] reorganize the inheritance struct of FreqCacheEmbedding (#1448) by Geng Zhang

Pipeline/rpc

[pipeline/rpc] update outstanding mechanism | optimize dispatching strategy (#1497) by Kirigaya Kazuto

[pipeline/rpc] implement distributed optimizer | test with assert_close (#1486) by Kirigaya Kazuto

[pipeline/rpc] support interleaving | fix checkpoint bug | change logic when dispatch data in work_list to ensure steady 1F1B (#1483) by Kirigaya Kazuto

[pipeline/rpc] implement a demo for PP with cuda rpc framework (#1470) by Kirigaya Kazuto

Tensor

[tensor]add 1D device mesh (#1492) by YuliangLiu0306

[tensor] support runtime ShardingSpec apply (#1453) by YuliangLiu0306

[tensor] shape consistency generate transform path and communication cost (#1435) by YuliangLiu0306

[tensor] added linear implementation for the new sharding spec (#1416) by Frank Lee

Fce

[FCE] update interface for frequency statistics in FreqCacheEmbedding (#1462) by Geng Zhang

Workflow

[workflow] added TensorNVMe to compatibility test (#1449) by Frank Lee

Test

[test] fixed the activation codegen test (#1447) by Frank Lee

Engin/schedule

[engin/schedule] use p2p_v2 to recontruct pipeline_schedule (#1408) by Kirigaya Kazuto

Full Changelog: https://github.com/hpcaitech/ColossalAI/compare/v0.1.10...v0.1.9
Source code(tar.gz)
Source code(zip)
v0.1.9(Aug 11, 2022)
What's Changed

Zero

[zero] add chunk_managerV2 for all-gather chunk (#1441) by HELSON

[zero] add chunk size searching algorithm for parameters in different groups (#1436) by HELSON

[zero] add has_inf_or_nan in AgChunk; enhance the unit test of AgChunk (#1426) by HELSON

[zero] add unit test for AgChunk's append, close, access (#1423) by HELSON

[zero] add AgChunk (#1417) by HELSON

[zero] ZeroDDP supports controlling outputs' dtype (#1399) by ver217

[zero] alleviate memory usage in ZeRODDP state_dict (#1398) by HELSON

[zero] chunk manager allows filtering ex-large params (#1393) by ver217

[zero] zero optim state_dict takes only_rank_0 (#1384) by ver217

Fx

[fx] add vanilla activation checkpoint search with test on resnet and densenet (#1433) by Super Daniel

[fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages (#1425) by Super Daniel

[fx] fixed torchaudio conformer tracing (#1392) by Frank Lee

[fx] patched torch.max and data movement operator (#1391) by Frank Lee

[fx] fixed indentation error in checkpointing codegen (#1385) by Frank Lee

[fx] patched torch.full for huggingface opt (#1386) by Frank Lee

[fx] update split module pass and add customized policy (#1373) by YuliangLiu0306

[fx] add torchaudio test (#1369) by Super Daniel

[fx] Add colotracer compatibility test on torchrec (#1370) by Boyuan Yao

[fx]add gpt2 passes for pipeline performance test (#1366) by YuliangLiu0306

[fx] added activation checkpoint codegen support for torch < 1.12 (#1359) by Frank Lee

[fx] added activation checkpoint codegen (#1355) by Frank Lee

[fx] fixed apex normalization patch exception (#1352) by Frank Lee

[fx] added activation checkpointing annotation (#1349) by Frank Lee

[fx] update MetaInforProp pass to process more complex node.meta (#1344) by YuliangLiu0306

[fx] refactor tracer to trace complete graph (#1342) by YuliangLiu0306

[fx] tested the complete workflow for auto-parallel (#1336) by Frank Lee

[fx]refactor tracer (#1335) by YuliangLiu0306

[fx] recovered skipped pipeline tests (#1338) by Frank Lee

[fx] fixed compatiblity issue with torch 1.10 (#1331) by Frank Lee

[fx] fixed unit tests for torch 1.12 (#1327) by Frank Lee

[fx] add balanced policy v2 (#1251) by YuliangLiu0306

[fx] Add unit test and fix bugs for transform_mlp_pass (#1299) by XYE

[fx] added apex normalization to patched modules (#1300) by Frank Lee

Recommendation System

[FAW] export FAW in _ops (#1438) by Jiarui Fang

[FAW] move coloparam setting in test code. (#1429) by Jiarui Fang

[FAW] parallel FreqAwareEmbedding (#1424) by Jiarui Fang

[FAW] add cache manager for the cached embedding (#1419) by Jiarui Fang

Global Tensor

[tensor] add shape consistency feature to support auto spec transform (#1418) by YuliangLiu0306

[tensor]build sharding spec to replace distspec in future. (#1405) by YuliangLiu0306

Hotfix

[hotfix] zero optim prevents calling inner optim.zero_grad (#1422) by ver217

[hotfix] fix CPUAdam kernel nullptr (#1410) by ver217

[hotfix] adapt ProcessGroup and Optimizer to ColoTensor (#1388) by HELSON

[hotfix] fix a running error in test_colo_checkpoint.py (#1387) by HELSON

[hotfix] fix some bugs during gpt2 testing (#1379) by YuliangLiu0306

[hotfix] fix zero optim save/load state dict (#1381) by ver217

[hotfix] fix zero ddp buffer cast (#1376) by ver217

[hotfix] fix no optimizer in save/load (#1363) by HELSON

[hotfix] fix megatron_init in test_gpt2.py (#1357) by HELSON

[hotfix] ZeroDDP use new process group (#1333) by ver217

[hotfix] shared model returns cpu state_dict (#1328) by ver217

[hotfix] fix ddp for unit test test_gpt2 (#1326) by HELSON

[hotfix] fix unit test test_module_spec (#1321) by HELSON

[hotfix] fix PipelineSharedModuleGradientHandler (#1314) by ver217

[hotfix] fix ColoTensor GPT2 unitest (#1309) by HELSON

[hotfix] add missing file (#1308) by Jiarui Fang

[hotfix] remove potiential circle import (#1307) by Jiarui Fang

[hotfix] skip some unittest due to CI environment. (#1301) by YuliangLiu0306

[hotfix] fix shape error in backward when using ColoTensor (#1298) by HELSON

[hotfix] Dist Mgr gather torch version (#1284) by Jiarui Fang

Communication

[communication] add p2p_v2.py to support communication with List[Any] (#1407) by Kirigaya Kazuto

Device

[device] add DeviceMesh class to support logical device layout (#1394) by YuliangLiu0306

Chunk

[chunk] add PG check for tensor appending (#1383) by Jiarui Fang

DDP

[DDP] test ddp state dict uses more strict threshold (#1382) by ver217

Checkpoint

[checkpoint] add kwargs for load_state_dict (#1374) by HELSON

[checkpoint] use args, kwargs in save_checkpoint, load_checkpoint (#1368) by HELSON

[checkpoint] sharded optim save/load grad scaler (#1350) by ver217

[checkpoint] use gather_tensor in checkpoint and update its unit test (#1339) by HELSON

[checkpoint] add ColoOptimizer checkpointing (#1316) by Jiarui Fang

[checkpoint] add test for bert and hotfix save bugs (#1297) by Jiarui Fang

Util

[util] standard checkpoint function naming (#1377) by Frank Lee

Nvme

[nvme] CPUAdam and HybridAdam support NVMe offload (#1360) by ver217

Colotensor

[colotensor] use cpu memory to store state_dict (#1367) by HELSON

[colotensor] add Tensor.view op and its unit test (#1343) by HELSON

Unit test

[unit test] add megatron init test in zero_optim (#1358) by HELSON

Docker

[docker] add tensornvme in docker (#1354) by ver217

Doc

[doc] update rst and docstring (#1351) by ver217

Refactor

[refactor] refactor ColoTensor's unit tests (#1340) by HELSON

Workflow

[workflow] update docker build workflow to use proxy (#1334) by Frank Lee

[workflow] update 8-gpu test to use torch 1.11 (#1332) by Frank Lee

[workflow] roll back to use torch 1.11 for unit testing (#1325) by Frank Lee

[workflow] fixed trigger condition for 8-gpu unit test (#1323) by Frank Lee

[workflow] updated release bdist workflow (#1318) by Frank Lee

[workflow] disable SHM for compatibility CI on rtx3080 (#1315) by Frank Lee

[workflow] updated pytorch compatibility test (#1311) by Frank Lee

Test

[test] removed outdated unit test for meta context (#1329) by Frank Lee

Utils

[utils] integrated colotensor with lazy init context (#1324) by Frank Lee

Optimizer

[Optimizer] Remove useless ColoOptimizer (#1312) by Jiarui Fang

[Optimizer] polish the init method of ColoOptimizer (#1310) by Jiarui Fang

Full Changelog: https://github.com/hpcaitech/ColossalAI/compare/v0.1.9...v0.1.8
Source code(tar.gz)
Source code(zip)
v0.1.8(Jul 12, 2022)
What's Changed

Hotfix

[hotfix] torchvison fx unittests miss import pytest (#1277) by Jiarui Fang

[hotfix] fix an assertion bug in base schedule. (#1250) by YuliangLiu0306

[hotfix] fix sharded optim step and clip_grad_norm (#1226) by ver217

[hotfix] fx get comm size bugs (#1233) by Jiarui Fang

[hotfix] fx shard 1d pass bug fixing (#1220) by Jiarui Fang

[hotfix]fixed p2p process send stuck (#1181) by YuliangLiu0306

[hotfix]different overflow status lead to communication stuck. (#1175) by YuliangLiu0306

[hotfix]fix some bugs caused by refactored schedule. (#1148) by YuliangLiu0306

Tensor

[tensor] distributed checkpointing for parameters (#1240) by Jiarui Fang

[tensor] redistribute among different process groups (#1247) by Jiarui Fang

[tensor] a shorter shard and replicate spec (#1245) by Jiarui Fang

[tensor] redirect .data.get to a tensor instance (#1239) by HELSON

[tensor] add zero_like colo op, important for Optimizer (#1236) by Jiarui Fang

[tensor] fix some unittests (#1234) by Jiarui Fang

[tensor] fix a assertion in colo_tensor cross_entropy (#1232) by HELSON

[tensor] add unitest for colo_tensor 1DTP cross_entropy (#1230) by HELSON

[tensor] torch function return colotensor (#1229) by Jiarui Fang

[tensor] improve robustness of class 'ProcessGroup' (#1223) by HELSON

[tensor] sharded global process group (#1219) by Jiarui Fang

[Tensor] add cpu group to ddp (#1200) by Jiarui Fang

[tensor] remove gpc in tensor tests (#1186) by Jiarui Fang

[tensor] revert local view back (#1178) by Jiarui Fang

[Tensor] rename some APIs in TensorSpec and Polish view unittest (#1176) by Jiarui Fang

[Tensor] rename parallel_action (#1174) by Ziyue Jiang

[Tensor] distributed view supports inter-process hybrid parallel (#1169) by Jiarui Fang

[Tensor] remove ParallelAction, use ComputeSpec instread (#1166) by Jiarui Fang

[tensor] add embedding bag op (#1156) by ver217

[tensor] add more element-wise ops (#1155) by ver217

[tensor] fixed non-serializable colo parameter during model checkpointing (#1153) by Frank Lee

[tensor] dist spec s2s uses all-to-all (#1136) by ver217

[tensor] added repr to spec (#1147) by Frank Lee

Fx

[fx] added ndim property to proxy (#1253) by Frank Lee

[fx] fixed tracing with apex-based T5 model (#1252) by Frank Lee

[fx] refactored the file structure of patched function and module (#1238) by Frank Lee

[fx] methods to get fx graph property. (#1246) by YuliangLiu0306

[fx]add split module pass and unit test from pipeline passes (#1242) by YuliangLiu0306

[fx] fixed huggingface OPT and T5 results misalignment (#1227) by Frank Lee

[fx]get communication size between partitions (#1224) by YuliangLiu0306

[fx] added patches for tracing swin transformer (#1228) by Frank Lee

[fx] fixed timm tracing result misalignment (#1225) by Frank Lee

[fx] added timm model tracing testing (#1221) by Frank Lee

[fx] added torchvision model tracing testing (#1216) by Frank Lee

[fx] temporarily used (#1215) by XYE

[fx] added testing for all albert variants (#1211) by Frank Lee

[fx] added testing for all gpt variants (#1210) by Frank Lee

[fx]add uniform policy (#1208) by YuliangLiu0306

[fx] added testing for all bert variants (#1207) by Frank Lee

[fx] supported model tracing for huggingface bert (#1201) by Frank Lee

[fx] added module patch for pooling layers (#1197) by Frank Lee

[fx] patched conv and normalization (#1188) by Frank Lee

[fx] supported data-dependent control flow in model tracing (#1185) by Frank Lee

Rename

[rename] convert_to_dist -> redistribute (#1243) by Jiarui Fang

Checkpoint

[checkpoint] save sharded optimizer states (#1237) by Jiarui Fang

[checkpoint]support generalized scheduler (#1222) by Yi Zhao

[checkpoint] make unitest faster (#1217) by Jiarui Fang

[checkpoint] checkpoint for ColoTensor Model (#1196) by Jiarui Fang

Polish

[polish] polish repr for ColoTensor, DistSpec, ProcessGroup (#1235) by HELSON

Refactor

[refactor] move process group from _DistSpec to ColoTensor. (#1203) by Jiarui Fang

[refactor] remove gpc dependency in colotensor's _ops (#1189) by Jiarui Fang

[refactor] move chunk and chunkmgr to directory gemini (#1182) by Jiarui Fang

Context

[context]support arbitary module materialization. (#1193) by YuliangLiu0306

[context]use meta tensor to init model lazily. (#1187) by YuliangLiu0306

Ddp

[ddp] ColoDDP uses bucket all-reduce (#1177) by ver217

[ddp] refactor ColoDDP and ZeroDDP (#1146) by ver217

Colotensor

[ColoTensor] add independent process group (#1179) by Jiarui Fang

[ColoTensor] rename APIs and add output_replicate to ComputeSpec (#1168) by Jiarui Fang

[ColoTensor] improves init functions. (#1150) by Jiarui Fang

Zero

[zero] sharded optim supports loading local state dict (#1170) by ver217

[zero] zero optim supports loading local state dict (#1171) by ver217

Workflow

[workflow] polish readme and dockerfile (#1165) by Frank Lee

[workflow] auto-publish docker image upon release (#1164) by Frank Lee

[workflow] fixed release post workflow (#1154) by Frank Lee

[workflow] fixed format error in yaml file (#1145) by Frank Lee

[workflow] added workflow to auto draft the release post (#1144) by Frank Lee

Gemini

[gemini] refactor gemini mgr (#1151) by ver217

Pipeline

[pipeline]add customized policy (#1139) by YuliangLiu0306

[pipeline]support more flexible pipeline (#1138) by YuliangLiu0306

Ci

[ci] added scripts to auto-generate release post text (#1142) by Frank Lee

Full Changelog: https://github.com/hpcaitech/ColossalAI/compare/v0.1.8...v0.1.7
Source code(tar.gz)
Source code(zip)
v0.1.7(Jun 21, 2022)
Version v0.1.7 Released Today

Highlights

Started torch.fx for auto-parallel training

Update the zero mechanism with ColoTensor

Fixed various bugs

What's Changed

Hotfix

[hotfix] prevent nested ZeRO (#1140) by ver217

[hotfix]fix bugs caused by refactored pipeline (#1133) by YuliangLiu0306

[hotfix] fix param op hook (#1131) by ver217

[hotfix] fix zero init ctx numel (#1128) by ver217

[hotfix]change to fit latest p2p (#1100) by YuliangLiu0306

[hotfix] fix chunk comm src rank (#1072) by ver217

Zero

[zero] avoid zero hook spam by changing log to debug level (#1137) by Frank Lee

[zero] added error message to handle on-the-fly import of torch Module class (#1135) by Frank Lee

[zero] fixed api consistency (#1098) by Frank Lee

[zero] zero optim copy chunk rather than copy tensor (#1070) by ver217

Optim

[optim] refactor fused sgd (#1134) by ver217

Ddp

[ddp] add save/load state dict for ColoDDP (#1127) by ver217

[ddp] add set_params_to_ignore for ColoDDP (#1122) by ver217

[ddp] supported customized torch ddp configuration (#1123) by Frank Lee

Pipeline

[pipeline]support List of Dict data (#1125) by YuliangLiu0306

[pipeline] supported more flexible dataflow control for pipeline parallel training (#1108) by Frank Lee

[pipeline] refactor the pipeline module (#1087) by Frank Lee

Fx

[fx]add autoparallel passes (#1121) by YuliangLiu0306

[fx] added unit test for coloproxy (#1119) by Frank Lee

[fx] added coloproxy (#1115) by Frank Lee

Gemini

[gemini] gemini mgr supports "cpu" placement policy (#1118) by ver217

[gemini] zero supports gemini (#1093) by ver217

Test

[test] fixed hybrid parallel test case on 8 GPUs (#1106) by Frank Lee

[test] skip tests when not enough GPUs are detected (#1090) by Frank Lee

[test] ignore 8 gpu test (#1080) by Frank Lee

Release

[release] update version.txt (#1103) by Frank Lee

Tensor

[tensor] refactor param op hook (#1097) by ver217

[tensor] refactor chunk mgr and impl MemStatsCollectorV2 (#1077) by ver217

[Tensor] fix equal assert (#1091) by Ziyue Jiang

[Tensor] 1d row embedding (#1075) by Ziyue Jiang

[tensor] chunk manager monitor mem usage (#1076) by ver217

[Tensor] fix optimizer for CPU parallel (#1069) by Ziyue Jiang

[Tensor] add hybrid device demo and fix bugs (#1059) by Ziyue Jiang

Amp

[amp] included dict for type casting of model output (#1102) by Frank Lee

Workflow

[workflow] fixed 8-gpu test workflow (#1101) by Frank Lee

[workflow] added regular 8 GPU testing (#1099) by Frank Lee

[workflow] disable p2p via shared memory on non-nvlink machine (#1086) by Frank Lee

Engine

[engine] fixed empty op hook check (#1096) by Frank Lee

Doc

[doc] added documentation to chunk and chunk manager (#1094) by Frank Lee

Context

[context] support lazy init of module (#1088) by Frank Lee

[context] maintain the context object in with statement (#1073) by Frank Lee

Refactory

[refactory] add nn.parallel module (#1068) by Jiarui Fang

Cudnn

[cudnn] set False to cudnn benchmark by default (#1063) by Frank Lee

Full Changelog: https://github.com/hpcaitech/ColossalAI/compare/v0.1.7...v0.1.6
Source code(tar.gz)
Source code(zip)
v0.1.6(Jun 2, 2022)
Main features

ColoTensor supports hybrid parallel (tensor parallel and data parallel)

ColoTensor supports ZeRO (with chunk)

Config tensor parallel by module via ColoTensor

ZeroInitContext and ShardedModelV2 support loading checkpoint and hugging face from_pretrain()

What's Changed

ColoTensor

[tensor] refactor colo-tensor by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/992

[tensor] refactor parallel action by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/1007

[tensor] impl ColoDDP for ColoTensor by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/1009

[Tensor] add module handler for linear by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/1021

[Tensor] add module check and bert test by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/1031

[Tensor] add Parameter inheritance for ColoParameter by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/1041

[tensor] ColoTensor supports ZeRo by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/1015

[zero] add chunk size search for chunk manager by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/1052

Zero

[zero] add load_state_dict for sharded model by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/894

[zero] add zero optimizer for ColoTensor by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/1046

Hotfix

[hotfix] fix colo init context by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/1026

[hotfix] fix some bugs caused by size mismatch. by @YuliangLiu0306 in https://github.com/hpcaitech/ColossalAI/pull/1011

[kernel] fixed the include bug in dropout kernel by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/999

fix typo in constants by @ryanrussell in https://github.com/hpcaitech/ColossalAI/pull/1027

[engine] fixed bug in gradient accumulation dataloader to keep the last step by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/1030

[hotfix] fix dist spec mgr by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/1045

[hotfix] fix import error in sharded model v2 by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/1053

Unit test

[unit test] refactor test tensor by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/1005

CI

[ci] update the docker image name by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/1017

[ci] added nightly build (#1018) by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/1019

[ci] fixed nightly build workflow by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/1022

[ci] fixed nightly build workflow by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/1029

[ci] fixed nightly build workflow by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/1040

CLI

[cli] remove unused imports by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/1001

Documentation

Hotfix/format by @binmakeswell in https://github.com/hpcaitech/ColossalAI/pull/987

[doc] update docker instruction by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/1020

Misc

[NFC] Hotfix/format by @binmakeswell in https://github.com/hpcaitech/ColossalAI/pull/984

Revert "[NFC] Hotfix/format" by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/986

remove useless import in tensor dir by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/997

[NFC] fix download link by @binmakeswell in https://github.com/hpcaitech/ColossalAI/pull/998

[Bot] Synchronize Submodule References by @github-actions in https://github.com/hpcaitech/ColossalAI/pull/1003

[NFC] polish colossalai/kernel/cuda_native/csrc/colossal_C_frontend.c… by @zhengzangw in https://github.com/hpcaitech/ColossalAI/pull/1010

[NFC] fix paper link by @binmakeswell in https://github.com/hpcaitech/ColossalAI/pull/1012

[p2p]add object list send/recv by @YuliangLiu0306 in https://github.com/hpcaitech/ColossalAI/pull/1024

[Bot] Synchronize Submodule References by @github-actions in https://github.com/hpcaitech/ColossalAI/pull/1034

[NFC] add inference by @binmakeswell in https://github.com/hpcaitech/ColossalAI/pull/1044

[titans]remove model zoo by @YuliangLiu0306 in https://github.com/hpcaitech/ColossalAI/pull/1042

[NFC] add inference submodule in path by @binmakeswell in https://github.com/hpcaitech/ColossalAI/pull/1047

[release] update version.txt by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/1048

[Bot] Synchronize Submodule References by @github-actions in https://github.com/hpcaitech/ColossalAI/pull/1049

updated collective ops api by @kurisusnowdeng in https://github.com/hpcaitech/ColossalAI/pull/1054

[pipeline]refactor ppschedule to support tensor list by @YuliangLiu0306 in https://github.com/hpcaitech/ColossalAI/pull/1050

New Contributors

@ryanrussell made their first contribution in https://github.com/hpcaitech/ColossalAI/pull/1027

Full Changelog: https://github.com/hpcaitech/ColossalAI/compare/v0.1.5...v0.1.6
Source code(tar.gz)
Source code(zip)
v0.1.5(May 17, 2022)
Main Features

Enhance ColoTensor and build a demo to train BERT (from hugging face) using Tensor Parallelism without modifying model.

What's Changed

ColoTensor

[Tensor] add ColoTensor TP1Dcol Embedding by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/899

[Tensor] add embedding tp1d row by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/904

[Tensor] update pytest.mark.parametrize in tensor tests by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/913

[Tensor] init ColoParameter by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/914

[Tensor] add a basic bert. by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/911

[Tensor] polish model test by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/915

[Tensor] fix test_model by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/916

[Tensor] add 1d vocab loss by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/918

[Graph] building computing graph with ColoTensor, Linear only by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/917

[Tensor] add from_pretrained support and bert pretrained test by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/921

[Tensor] test pretrain loading on multi-process by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/922

[tensor] hijack addmm for colo tensor by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/923

[tensor] colo tensor overrides mul by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/927

[Tensor] simplify named param by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/928

[Tensor] fix init context by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/931

[Tensor] add optimizer to bert test by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/933

[tensor] design DistSpec and DistSpecManager for ColoTensor by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/934

[Tensor] add DistSpec for loss and test_model by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/947

[tensor] derive compute pattern from dist spec by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/971

Pipeline Parallelism

[pipelinable]use pipelinable to support GPT model. by @YuliangLiu0306 in https://github.com/hpcaitech/ColossalAI/pull/903

CI

[CI] add CI for releasing bdist wheel by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/901

[CI] fix release bdist CI by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/902

[ci] added wheel build scripts by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/910

Misc

[Bot] Synchronize Submodule References by @github-actions in https://github.com/hpcaitech/ColossalAI/pull/907

[Bot] Synchronize Submodule References by @github-actions in https://github.com/hpcaitech/ColossalAI/pull/912

[setup] update cuda ext cc flags by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/919

[setup] support more cuda architectures by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/920

[NFC] update results on a single GPU, highlight quick view by @binmakeswell in https://github.com/hpcaitech/ColossalAI/pull/981

Full Changelog: https://github.com/hpcaitech/ColossalAI/compare/v0.1.4...v0.1.5
Source code(tar.gz)
Source code(zip)
v0.1.4(Apr 28, 2022)
Main Features

Here are the main improvements of this release:

ColoTensor: A data structure that unifies the Tensor representation of different parallel methods.

Gemini: More efficient Genimi implementation reduces the overhead of model data statistic collection.

CLI: a command-line tool that helps users launch distributed training tasks more easily.

Pipeline Parallelism (PP): a more user-friendly API for PP.

What's Changed

ColoTensor

[tensor]fix colo_tensor torch_function by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/825

[tensor]fix test_linear by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/826

[tensor] ZeRO use ColoTensor as the base class. by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/828

[tensor] revert zero tensors back by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/829

[Tensor] overriding paramters() for Module using ColoTensor by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/889

[tensor] refine linear and add gather for laynorm by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/893

[Tensor] test parameters() as member function by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/896

[Tensor] activation is an attr of ColoTensor by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/897

[Tensor] initialize the ColoOptimizer by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/898

[tensor] reorganize files by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/820

[Tensor] apply ColoTensor on Torch functions by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/821

[Tensor] update ColoTensor torch_function by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/822

[tensor] lazy init by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/823

[WIP] Applying ColoTensor on TP-1D-row Linear. by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/831

Init Conext supports lazy allocate model memory by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/842

[Tensor] TP Linear 1D row by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/843

[Tensor] add assert for colo_tensor 1Drow by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/846

[Tensor] init a simple network training with ColoTensor by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/849

[Tensor ] Add 1Drow weight reshard by spec by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/854

[Tensor] add layer norm Op by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/852

[tensor] an initial dea of tensor spec by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/865

[Tensor] colo init context add device attr. by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/866

[tensor] add cross_entropy_loss by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/868

[Tensor] Add function to spec and update linear 1Drow and unit tests by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/869

[tensor] customized op returns ColoTensor by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/875

[Tensor] get named parameters for model using ColoTensors by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/874

[Tensor] Add some attributes to ColoTensor by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/877

[Tensor] make a simple net works with 1D row TP by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/879

[tensor] wrap function in the torch_tensor to ColoTensor by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/881

[Tensor] make ColoTensor more robust for getattr by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/886

[Tensor] test model check results for a simple net by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/887

[tensor] add ColoTensor 1Dcol by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/888

Gemini + ZeRO

[zero] add zero tensor shard strategy by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/793

Revert "[zero] add zero tensor shard strategy" by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/806

[gemini] a new tensor structure by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/818

[gemini] APIs to set cpu memory capacity by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/809

[DO NOT MERGE] [zero] init fp16 params directly in ZeroInitContext by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/808

[gemini] collect cpu-gpu moving volume in each iteration by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/813

[gemini] add GeminiMemoryManger by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/832

[zero] use GeminiMemoryManager when sampling model data by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/850

[gemini] polish code by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/855

[gemini] add stateful tensor container by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/867

[gemini] polish stateful_tensor_mgr by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/876

[gemini] accelerate adjust_layout() by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/878

CLI

[cli] added distributed launcher command by @YuliangLiu0306 in https://github.com/hpcaitech/ColossalAI/pull/791

[cli] added micro benchmarking for tp by @YuliangLiu0306 in https://github.com/hpcaitech/ColossalAI/pull/789

[cli] add missing requirement by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/805

[cli] fixed a bug in user args and refactored the module structure by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/807

[cli] fixed single-node process launching by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/812

[cli] added check installation cli by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/815

[CLI] refactored the launch CLI and fixed bugs in multi-node launching by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/844

[cli] refactored micro-benchmarking cli and added more metrics by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/858

Pipeline Parallelism

[pipelinable]use pipelinable context to initialize non-pipeline model by @YuliangLiu0306 in https://github.com/hpcaitech/ColossalAI/pull/816

[pipelinable]use ColoTensor to replace dummy tensor. by @YuliangLiu0306 in https://github.com/hpcaitech/ColossalAI/pull/853

Misc

[hotfix] fix auto tensor placement policy by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/775

[hotfix] change the check assert in split batch 2d by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/772

[hotfix] fix bugs in zero by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/781

[hotfix] fix grad offload when enabling reuse_fp16_shard by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/784

[refactor] moving memtracer to gemini by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/801

[log] display tflops if available by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/802

[refactor] moving grad acc logic to engine by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/804

[log] local throughput metrics by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/811

[Bot] Synchronize Submodule References by @github-actions in https://github.com/hpcaitech/ColossalAI/pull/810

[Bot] Synchronize Submodule References by @github-actions in https://github.com/hpcaitech/ColossalAI/pull/819

[refactor] moving InsertPostInitMethodToModuleSubClasses to utils. by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/824

[setup] allow installation with python 3.6 by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/834

Revert "[WIP] Applying ColoTensor on TP-1D-row Linear." by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/835

[dependency] removed torchvision by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/833

[Bot] Synchronize Submodule References by @github-actions in https://github.com/hpcaitech/ColossalAI/pull/827

[unittest] refactored unit tests for change in dependency by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/838

[setup] use env var instead of option for cuda ext by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/839

[hotfix] ColoTensor pin_memory by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/840

modefied the pp build for ckpt adaptation by @Gy-Lu in https://github.com/hpcaitech/ColossalAI/pull/803

[hotfix] the bug of numel() in ColoTensor by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/845

[hotfix] fix _post_init_method of zero init ctx by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/847

[hotfix] add deconstructor for stateful tensor by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/848

[utils] refactor profiler by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/837

[ci] cache cuda extension by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/860

hotfix tensor unittest bugs by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/862

[usability] added assertion message in registry by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/864

[doc] improved docstring in the communication module by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/863

[doc] improved docstring in the logging module by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/861

[doc] improved docstring in the amp module by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/857

[usability] improved error messages in the context module by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/856

[doc] improved error messages in initialize by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/872

[doc] improved assertion messages in trainer by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/873

[doc] improved docstring and assertion messages for the engine module by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/871

[hotfix] fix import error by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/880

[setup] add local version label by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/890

[model_zoo] change qkv processing by @Gy-Lu in https://github.com/hpcaitech/ColossalAI/pull/870

Full Changelog: https://github.com/hpcaitech/ColossalAI/compare/v0.1.3...v0.1.4
Source code(tar.gz)
Source code(zip)
v0.1.3(Apr 16, 2022)
Overview

Here are the main improvements of this release:

Gemini: Heterogeneous memory space manager

Refactor the API of pipeline parallelism

What's Changed

Features

[zero] initialize a stateful tensor manager by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/614

[pipeline] refactor pipeline by @YuliangLiu0306 in https://github.com/hpcaitech/ColossalAI/pull/679

[zero] stateful tensor manager by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/687

[zero] adapt zero hooks for unsharded module by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/699

[zero] refactor memstats collector by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/706

[zero] improve adaptability for not-shard parameters by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/708

[zero] check whether gradients have inf and nan in gpu by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/712

[refactor] refactor the memory utils by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/715

[util] support detection of number of processes on current node by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/723

[utils] add synchronized cuda memory monitor by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/740

[zero] refactor ShardedParamV2 by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/742

[zero] add tensor placement policies by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/743

[zero] use factory pattern for tensor_placement_policy by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/752

[zero] refactor memstats_collector by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/746

[gemini] init genimi individual directory by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/754

refactor shard and gather operation by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/773

Bug Fix

[zero] fix init bugs in zero context by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/686

[hotfix] update requirements-test by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/701

[hotfix] fix a bug in 3d vocab parallel embedding by @kurisusnowdeng in https://github.com/hpcaitech/ColossalAI/pull/707

[compatibility] fixed tensor parallel compatibility with torch 1.9 by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/700

[hotfix]fixed bugs of assigning grad states to non leaf nodes by @Gy-Lu in https://github.com/hpcaitech/ColossalAI/pull/711

[hotfix] fix stateful tensor manager's cuda model data size by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/710

[bug] fixed broken test_found_inf by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/725

[util] fixed activation checkpointing on torch 1.9 by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/719

[util] fixed communication API with PyTorch 1.9 by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/721

[bug] removed zero installation requirements by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/731

[hotfix] remove duplicated param register to stateful tensor manager by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/728

[utils] correct cpu memory used and capacity in the context of multi-process by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/726

[bug] fixed grad scaler compatibility with torch 1.8 by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/735

[bug] fixed DDP compatibility with torch 1.8 by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/739

[hotfix] fix memory leak in backward of sharded model by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/741

[hotfix] fix initialize about zero by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/748

[hotfix] fix prepare grads in sharded optim by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/749

[hotfix] layernorm by @kurisusnowdeng in https://github.com/hpcaitech/ColossalAI/pull/750

[hotfix] fix auto tensor placement policy by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/753

[hotfix] fix reuse_fp16_shard of sharded model by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/756

[hotfix] fix test_stateful_tensor_mgr by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/762

[compatibility] used backward-compatible API for global process group by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/758

[hotfix] fix the ckpt hook bugs when using DDP by @Gy-Lu in https://github.com/hpcaitech/ColossalAI/pull/769

[hotfix] polish sharded optim docstr and warning by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/770

Unit Testing

[ci] replace the ngc docker image with self-built pytorch image by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/672

[ci] fixed compatibility workflow by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/678

[ci] update workflow trigger condition and support options by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/691

[ci] added missing field in workflow by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/692

[ci] remove ipc config for rootless docker by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/694

[test] added missing decorators to model checkpointing tests by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/727

[unitest] add checkpoint for moe zero test by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/729

[test] added a decorator for address already in use error with backward compatibility by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/760

[test] refactored with the new rerun decorator by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/763

Documentation

add PaLM link by @binmakeswell in https://github.com/hpcaitech/ColossalAI/pull/704

[doc] removed outdated installation command by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/730

add video by @binmakeswell in https://github.com/hpcaitech/ColossalAI/pull/732

[readme] polish readme by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/764

[readme] sync CN readme by @binmakeswell in https://github.com/hpcaitech/ColossalAI/pull/766

Miscellaneous

[Bot] Synchronize Submodule References by @github-actions in https://github.com/hpcaitech/ColossalAI/pull/556

[Bot] Synchronize Submodule References by @github-actions in https://github.com/hpcaitech/ColossalAI/pull/695

[refactor] zero directory by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/724

[Bot] Synchronize Submodule References by @github-actions in https://github.com/hpcaitech/ColossalAI/pull/751

Full Changelog: https://github.com/hpcaitech/ColossalAI/compare/v0.1.2...v0.1.3
Source code(tar.gz)
Source code(zip)
v0.1.2(Apr 6, 2022)
Overview

Here are the main improvements of this release:

MOE and BERT models can be trained with ZeRO.

Provide a uniform checkpoint for all kinds of parallelism.

Optimize ZeRO-offload, and improve model scaling.

Design a uniform model memory tracer.

Implement an efficient hybrid Adam (CPU and CUDA kernels).

Improve activation offloading.

Profiler TensorBoard plugin of Beta version.

Refactor pipeline module for closer integration with engine.

Chinese tutorials, WeChat and Slack user groups.

What's Changed

Features

[zero] get memory usage for sharded param by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/536

[zero] improve the accuracy of get_memory_usage of sharded param by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/538

[zero] refactor model data tracing by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/537

[zero] get memory usage of sharded optim v2. by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/542

[zero] polish ZeroInitContext by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/540

[zero] optimize grad offload by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/539

[zero] non model data tracing by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/545

[zero] add zero config to neutralize zero context init by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/546

[zero] dump memory stats for sharded model by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/548

[zero] add stateful tensor by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/549

[zero] label state for param fp16 and grad by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/551

[zero] hijack p.grad in sharded model by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/554

[utils] update colo tensor moving APIs by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/553

[polish] rename col_attr -> colo_attr by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/558

[zero] trace states of fp16/32 grad and fp32 param by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/571

[zero] adapt zero for unsharded parameters by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/561

[refactor] memory utils by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/577

Feature/checkpoint gloo by @kurisusnowdeng in https://github.com/hpcaitech/ColossalAI/pull/589

[zero] add sampling time for memstats collector by @Gy-Lu in https://github.com/hpcaitech/ColossalAI/pull/610

[model checkpoint] checkpoint utils by @kurisusnowdeng in https://github.com/hpcaitech/ColossalAI/pull/592

[model checkpoint][hotfix] unified layers for save&load by @kurisusnowdeng in https://github.com/hpcaitech/ColossalAI/pull/593

Feature/checkpoint 2D by @kurisusnowdeng in https://github.com/hpcaitech/ColossalAI/pull/595

Feature/checkpoint 1D by @kurisusnowdeng in https://github.com/hpcaitech/ColossalAI/pull/594

[model checkpoint] CPU communication ops by @kurisusnowdeng in https://github.com/hpcaitech/ColossalAI/pull/590

Feature/checkpoint 2.5D by @kurisusnowdeng in https://github.com/hpcaitech/ColossalAI/pull/596

Feature/Checkpoint 3D by @kurisusnowdeng in https://github.com/hpcaitech/ColossalAI/pull/597

[model checkpoint] checkpoint hook by @kurisusnowdeng in https://github.com/hpcaitech/ColossalAI/pull/598

Feature/Checkpoint tests by @kurisusnowdeng in https://github.com/hpcaitech/ColossalAI/pull/599

[zero] adapt zero for unsharded parameters (Optimizer part) by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/601

[zero] polish init context by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/645

refactor pipeline---put runtime schedule into engine. by @YuliangLiu0306 in https://github.com/hpcaitech/ColossalAI/pull/627

Bug Fix

[Zero] process no-leaf-module in Zero by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/535

Add gather_out arg to Linear by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/541

[hoxfix] fix parallel_input flag for Linear1D_Col gather_output by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/579

[hotfix] add hybrid adam to init by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/584

Hotfix/path check util by @kurisusnowdeng in https://github.com/hpcaitech/ColossalAI/pull/591

[hotfix] fix sharded optim zero grad by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/604

Add tensor parallel input check by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/621

[hotfix] Raise messages for indivisible batch sizes with tensor parallelism by @number1roy in https://github.com/hpcaitech/ColossalAI/pull/622

[zero] fixed the activation offload by @Gy-Lu in https://github.com/hpcaitech/ColossalAI/pull/647

fixed bugs in CPU adam by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/633

Revert "[zero] polish init context" by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/657

[hotfix] fix a bug in model data stats tracing by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/655

fix bugs for unsharded parameters when restore data by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/664

Unit Testing

[zero] test zero tensor utils by @FredHuang99 in https://github.com/hpcaitech/ColossalAI/pull/609

remove hybrid adam in test_moe_zero_optim by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/659

Documentation

Refactored docstring to google style by @number1roy in https://github.com/hpcaitech/ColossalAI/pull/532

[docs] updatad docs of hybrid adam and cpu adam by @Gy-Lu in https://github.com/hpcaitech/ColossalAI/pull/552

html refactor by @number1roy in https://github.com/hpcaitech/ColossalAI/pull/555

[doc] polish docstring of zero by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/612

[doc] update rst by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/615

[doc] polish amp docstring by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/616

[doc] polish moe docsrting by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/618

[doc] polish optimizer docstring by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/619

[doc] polish utils docstring by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/620

[NFC] polish colossalai/kernel/cuda_native/csrc/kernels/cuda_util.cu … by @GaryGky in https://github.com/hpcaitech/ColossalAI/pull/625

[doc] polish checkpoint docstring by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/637

update GPT-2 experiment result by @Sze-qq in https://github.com/hpcaitech/ColossalAI/pull/666

[NFC] polish code by @binmakeswell in https://github.com/hpcaitech/ColossalAI/pull/646

Model Zoo

[model zoo] add activation offload for gpt model by @Gy-Lu in https://github.com/hpcaitech/ColossalAI/pull/582

Miscellaneous

[logging] polish logger format by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/543

[profiler] add MemProfiler by @raejaf in https://github.com/hpcaitech/ColossalAI/pull/356

[Bot] Synchronize Submodule References by @github-actions in https://github.com/hpcaitech/ColossalAI/pull/501

[tool] create .clang-format for pre-commit by @BoxiangW in https://github.com/hpcaitech/ColossalAI/pull/578

[GitHub] Add prefix and label in issue template by @binmakeswell in https://github.com/hpcaitech/ColossalAI/pull/652

Full Changelog: https://github.com/hpcaitech/ColossalAI/compare/v0.1.1...v0.1.2
Source code(tar.gz)
Source code(zip)
v0.1.1(Mar 26, 2022)
What's Changed

Features

[MOE] changed parallelmode to dist process group by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/460

[MOE] redirect moe_env from global_variables to core by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/467

[zero] zero init ctx receives a dp process group by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/471

[zero] ZeRO supports pipeline parallel by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/477

add LinearGate for MOE in NaiveAMP context by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/480

[zero] polish sharded param name by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/484

[zero] sharded optim support hybrid cpu adam by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/486

[zero] polish sharded optimizer v2 by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/490

[MOE] support PR-MOE by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/488

[zero] sharded model manages ophooks individually by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/492

[MOE] remove old MoE legacy by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/493

[zero] sharded model support the reuse of fp16 shard by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/495

[polish] polish singleton and global context by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/500

[memory] add model data tensor moving api by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/503

[memory] set cuda mem frac by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/506

[zero] use colo model data api in sharded optimv2 by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/511

[MOE] add MOEGPT model by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/510

[zero] zero init ctx enable rm_torch_payload_on_the_fly by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/512

[zero] show model data cuda memory usage after zero context init. by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/515

[log] polish disable_existing_loggers by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/519

[zero] add model data tensor inline moving API by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/521

[cuda] modify the fused adam, support hybrid of fp16 and fp32 by @Gy-Lu in https://github.com/hpcaitech/ColossalAI/pull/497

[zero] refactor model data tracing by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/522

[zero] added hybrid adam, removed loss scale in adam by @Gy-Lu in https://github.com/hpcaitech/ColossalAI/pull/527

Bug Fix

fix discussion buttom in issue template by @binmakeswell in https://github.com/hpcaitech/ColossalAI/pull/504

[zero] fix grad offload by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/528

Unit Testing

[MOE] add unitest for MOE experts layout, gradient handler and kernel by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/469

[test] added rerun on exception for testing by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/475

[zero] fix init device bug in zero init context unittest by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/516

[test] fixed rerun_on_exception and adapted test cases by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/487

CI/CD

[devops] remove tsinghua source for pip by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/505

[devops] remove tsinghua source for pip by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/507

[devops] recover tsinghua pip source due to proxy issue by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/509

Documentation

[doc] update rst by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/470

Update Experiment result about Colossal-AI with ZeRO by @Sze-qq in https://github.com/hpcaitech/ColossalAI/pull/479

[doc] docs get correct release version by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/489

Update README.md by @fastalgo in https://github.com/hpcaitech/ColossalAI/pull/514

[doc] update apidoc by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/530

Model Zoo

[model zoo] fix attn mask shape of gpt by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/472

[model zoo] gpt embedding remove attn mask by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/474

Miscellaneous

[install] run with out rich by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/513

[refactor] remove old zero code by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/517

[format] polish name format for MOE by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/481

New Contributors

@fastalgo made their first contribution in https://github.com/hpcaitech/ColossalAI/pull/514

Full Changelog: https://github.com/hpcaitech/ColossalAI/compare/v0.1.0...v0.1.1
Source code(tar.gz)
Source code(zip)
v0.1.0(Mar 19, 2022)
Overview

We are happy to release the version v0.1.0 today. Compared to the previous version, we have a brand new zero module and updated many aspects of our system for better performance and usability. The latest version can be installed by pip install colossalai now. We will update our examples and documentation in the next few days accordingly.

Highlights:

Note: a. Only the major base commits are chosen to display. Successive commits which enhance/update the base commit are not shown.
b. Some commits do not have associated pull request ID for some unknown reasons.
c. The list is ordered by time.

Features

add moe context, moe utilities and refactor gradient handler (#455 )By @1SAA

[zero] Update initialize for ZeRO (#458 ) By @ver217

[zero] hybrid cpu adam (#445 ) By @feifeibear

added Multiply Jitter and capacity factor eval for MOE (#434 ) By @1SAA

[fp16] refactored fp16 optimizer (#392 ) By @FrankLeeeee

[zero] memtracer to record cuda memory usage of model data and overall system (#395 ) By @feifeibear

Added tensor detector (#393 ) By @Gy-Lu

Added activation offload (#331 ) By @Gy-Lu

[zero] zero init context collect numel of model (#375 ) By @feifeibear

Added PCIE profiler to dectect data transmission (#373 ) By @1SAA

Added Profiler Context to manage all profilers (#340 ) By @1SAA

set criterion as optional in colossalai initialize (#336 ) By @FrankLeeeee

[zero] Update sharded model v2 using sharded param v2 (#323 ) By @ver217

[zero] zero init context (#321 ) By @feifeibear

Added profiler communication operations By @1SAA

added buffer sync to naive amp model wrapper (#291 ) By @FrankLeeeee

[zero] cpu adam kernel (#288 ) By @Gy-Lu

Feature/zero (#279 ) By @feifeibear @FrankLeeeee @ver217

impl shard optim v2 and add unit test By @ver217

[profiler] primary memory tracer By @raejaf

add sharded adam By @ver217

Unit Testing

[test] fixed amp convergence comparison test (#454 ) By @FrankLeeeee

[test] optimized zero data parallel test (#452 ) By @FrankLeeeee

[test] make zero engine test really work (#447 ) By @feifeibear

optimized context test time consumption (#446 ) By @FrankLeeeee

[unitest] polish zero config in unittest (#438 ) By @feifeibear

added testing module (#435 ) By @FrankLeeeee

[zero] polish ShardedOptimV2 unittest (#385 ) By @feifeibear

[unit test] Refactored test cases with component func (#339 ) By @FrankLeeeee

Documentation

[doc] Update docstring for ZeRO (#459 ) By @ver217

update README and images path (#384 ) By @binmakeswell

add badge and contributor list By @FrankLeeeee

add community group and update issue template (#271 ) By @binmakeswell

update experimental visualization (#253 ) By @Sze-qq

add Chinese README By @binmakeswell

CI/CD

update github CI with the current workflow (#441 ) By @FrankLeeeee

update unit testing CI rules By @FrankLeeeee

added compatibility CI and options for release ci By @FrankLeeeee

added pypi publication CI and remove formatting CI By @FrankLeeeee

Bug Fix

fix gpt attention mask (#461 ) By @ver217

[bug] Fixed device placement bug in memory monitor thread (#433 ) By @FrankLeeeee

fixed fp16 optimizer none grad bug (#432 ) By @FrankLeeeee

fixed gpt attention mask in pipeline (#430 ) By @FrankLeeeee

[hotfix] fixed bugs in ShardStrategy and PcieProfiler (#394 ) By @1SAA

fixed bug in activation checkpointing test (#387 ) By @FrankLeeeee

[profiler] Fixed bugs in CommProfiler and PcieProfiler (#377 ) By @1SAA

fixed CI dataset directory; fixed import error of 2.5d accuracy (#255 ) By @kurisusnowdeng

fixed padding index issue for vocab parallel embedding layers; updated 3D linear to be compatible with examples in the tutorial By @kurisusnowdeng

Miscellaneous

[log] better logging display with rich (#426 ) By @feifeibear

Source code(tar.gz)
Source code(zip)
v0.0.2(Feb 15, 2022)
Change Log

Added

Unifed distributed layers

MoE support

DevOps tools such as github action, code review automation, etc.

New project official website

Changes

refactored the APIs for usability, flexibility and modularity

adapted PyTorch AMP for tensor parallel

refactored utilities for tensor parallel and pipeline parallel

Separated benchmarks and examples as independent repositories

Updated pipeline parallelism to support non-interleaved and interleaved versions

refactored installation scripts for convenience

Fixed

zero level 3 runtime error

incorrect calculation in gradient clipping

Source code(tar.gz)
Source code(zip)
v0.0.1-beta(Oct 28, 2021)
Features

Data Parallelism

Pipeline Parallelism (experimental)

1D, 2D, 2.5D, 3D and sequence tensor parallelism

Easy-to-use trainer and engine

Extensibility for user-defined parallelism

Mixed Precision Training

Zero Redundancy Optimizer (ZeRO)

Source code(tar.gz)
Source code(zip)

Colossal-AI: A Unified Deep Learning System for Large-Scale Parallel Training

Related tags

Overview

ColossalAI

Installation

PyPI

Install From Source

Documentation

Quick View

Start Distributed Training in Lines

Write a Simple 2D Parallel Model

Features

Comments

🐛 Describe the bug

Environment

🐛 Describe the bug

Environment

🐛 Describe the bug

Environment

🐛 Describe the bug

Environment

🐛 Describe the bug

🐛 Describe the bug

Environment

Environment

🐛 Describe the bug

Environment

🐛 Describe the bug

Environment

🐛 Describe the bug

Environment

🐛 Describe the bug

Environment

🐛 Describe the bug

Environment

Why

📚 The doc issue

What does this PR do

Releases(v0.2.0)

v0.2.0(Jan 3, 2023)

What's Changed

Version

Examples

Doc

Zero

Example

Hotfix

Autoparallel

Gemini

Pipeline middleware

Builder

Logger

Diffusion

Testing

NFC

Exmaple

Pipeline middleware

v0.1.13(Dec 20, 2022)

What's Changed

Version

Gemini

Nfc

Autoparallel

Example

Optimizer

Pp middleware

v0.1.12(Dec 9, 2022)

What's Changed

Zero

Gemini

Hotfix

Colotensor

Autoparallel

Version

Pipeline middleware

Fx

Example

Device

Test

Pipeline