Colossal-AI: A Unified Deep Learning System for Large-Scale Parallel Training

Overview

ColossalAI

An integrated large-scale model training system with efficient parallelization techniques.

arXiv: Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training

Installation

PyPI

pip install colossalai

Install From Source

git clone [email protected]:hpcaitech/ColossalAI.git
cd ColossalAI
# install dependency
pip install -r requirements/requirements.txt

# install colossalai
pip install .

Install and enable CUDA kernel fusion (compulsory installation when using fused optimizer)

pip install -v --no-cache-dir --global-option="--cuda_ext" .

Documentation

Quick View

Start Distributed Training in Lines

import colossalai
from colossalai.engine import Engine
from colossalai.trainer import Trainer
from colossalai.core import global_context as gpc

model, train_dataloader, test_dataloader, criterion, optimizer, schedule, lr_scheduler = colossalai.initialize()
engine = Engine(
    model=model,
    criterion=criterion,
    optimizer=optimizer,
    lr_scheduler=lr_scheduler,
    schedule=schedule
)

trainer = Trainer(engine=engine,
                  hooks_cfg=gpc.config.hooks,
                  verbose=True)
trainer.fit(
    train_dataloader=train_dataloader,
    test_dataloader=test_dataloader,
    max_epochs=gpc.config.num_epochs,
    display_progress=True,
    test_interval=5
)

Write a Simple 2D Parallel Model

Let's say we have a huge MLP model and its very large hidden size makes it difficult to fit into a single GPU. We can then distribute the model weights across GPUs in a 2D mesh while you still write your model in a familiar way.

from colossalai.nn import Linear2D
import torch.nn as nn


class MLP_2D(nn.Module):

    def __init__(self):
        super().__init__()
        self.linear_1 = Linear2D(in_features=1024, out_features=16384)
        self.linear_2 = Linear2D(in_features=16384, out_features=1024)

    def forward(self, x):
        x = self.linear_1(x)
        x = self.linear_2(x)
        return x

Features

ColossalAI provides a collection of parallel training components for you. We aim to support you to write your distributed deep learning models just like how you write your single-GPU model. We provide friendly tools to kickstart distributed training in a few lines.

Comments
  • [BUG]:  Memory consumption by fp16 is not normal

    [BUG]: Memory consumption by fp16 is not normal

    ๐Ÿ› Describe the bug

    When i used pytorch origin amp, the gpu memory is much smaller than colossai, why? the config is

    from colossalai.amp import AMP_TYPE
    from colossalai.zero.shard_utils import TensorShardStrategy
    from colossalai.nn.optimizer import HybridAdam
    
    fp16 = dict(
        mode=AMP_TYPE.TORCH,
    )
    
    optimizer = dict(
        type=HybridAdam,
        lr=0.001,
        # weight_decay=1e-2,
    )
    

    model | dataset | machine | batch | gradient accmulate size | ZeRO | speed | GPU memory | OPT | tensor_placement_policy | ย  | ย  -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- ir18 | private dataset | 1 | 64 | 1 | no ZeRO | 24%|โ–ˆโ–ˆโ– ย  ย  ย  | 2089/8549 [02:51<08:39, 12.43it/s] | 8703M | HybridAdam | ย  | single machine + Engine | ย  ir18 | private dataset | 1 | 64 | 1 | no ZeRO | 19%|โ–ˆโ–Š ย  ย  ย  ย | 1599/8549 [02:24<10:21, 11.17it/s] | 5769M | HybridAdam | ย  | single machineย  + wo Engineย + pytorch origin fp16 | ย 

    Environment

    No response

    bug 
    opened by powermano 26
  • [BUG]: RuntimeError of

    [BUG]: RuntimeError of "RANK" when running train.py of ResNet example on a single GPU

    ๐Ÿ› Describe the bug

    I met a problem today when running with python train.py, as below,

    /home/user/software/python/anaconda/anaconda3/envs/conda-general/bin/python /home/user/***/***
    /ColossalAI-Examples/image/resnet/train.py
    Traceback (most recent call last):
      File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/colossalai/initialize.py", line 210, in launch_from_torch
        rank = int(os.environ['RANK'])
      File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/os.py", line 679, in __getitem__
        raise KeyError(key) from None
    KeyError: 'RANK'
    
    During handling of the above exception, another exception occurred:
    
    ...
    
    RuntimeError: Could not find 'RANK' in the torch environment, visit https://www.colossalai.org/ for more information on launching with torch
    

    Is this error due to the absence of environment variable RANK in my Ubuntu?

    Environment

    Python: 3.10

    bug 
    opened by songyuc 23
  • [BUG]: type object 'ChunkManager' has no attribute 'search_chunk_size'

    [BUG]: type object 'ChunkManager' has no attribute 'search_chunk_size'

    ๐Ÿ› Describe the bug

    when i training the diffusion model that happened:

    Setting up LambdaLR scheduler... Traceback (most recent call last): File "/home/tongange/ColossalAI/examples/images/diffusion/main.py", line 804, in trainer.fit(model, data) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 578, in fit call._call_and_handle_interrupt( File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 620, in _fit_impl self._run(model, ckpt_path=self.ckpt_path) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1038, in _run self.strategy.setup(self) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/colossalai.py", line 333, in setup self.setup_precision_plugin() File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/colossalai.py", line 270, in setup_precision_plugin chunk_size = self.chunk_size or ChunkManager.search_chunk_size( AttributeError: type object 'ChunkManager' has no attribute 'search_chunk_size' Setting up LambdaLR scheduler... /root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py:437: UserWarning: Error handling mechanism for deadlock detection is uninitialized. Skipping check. rank_zero_warn("Error handling mechanism for deadlock detection is uninitialized. Skipping check.") Summoning checkpoint.

    Traceback (most recent call last): File "/home/tongange/ColossalAI/examples/images/diffusion/main.py", line 804, in trainer.fit(model, data) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 578, in fit call._call_and_handle_interrupt( File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 88, in launch return function(*args, **kwargs) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 620, in _fit_impl self._run(model, ckpt_path=self.ckpt_path) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1038, in _run self.strategy.setup(self) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/colossalai.py", line 333, in setup self.setup_precision_plugin() File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/colossalai.py", line 270, in setup_precision_plugin chunk_size = self.chunk_size or ChunkManager.search_chunk_size( AttributeError: type object 'ChunkManager' has no attribute 'search_chunk_size'

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last): File "/home/tongange/ColossalAI/examples/images/diffusion/main.py", line 806, in melk() File "/home/tongange/ColossalAI/examples/images/diffusion/main.py", line 789, in melk trainer.save_checkpoint(ckpt_path) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1900, in save_checkpoint self._checkpoint_connector.save_checkpoint(filepath, weights_only=weights_only, storage_options=storage_options) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 512, in save_checkpoint _checkpoint = self.dump_checkpoint(weights_only) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 444, in dump_checkpoint "state_dict": self._get_lightning_module_state_dict(), File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 526, in _get_lightning_module_state_dict state_dict = self.trainer.strategy.lightning_module_state_dict() File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/colossalai.py", line 383, in lightning_module_state_dict assert isinstance(self.model, ZeroDDP) AssertionError

    Environment

    i use the way bellow to train, all the steps are same: https://github.com/hpcaitech/ColossalAI/tree/main/examples/images/diffusion

    bug 
    opened by Alfred-Duncan 16
  • [BUG]: colossalai/kernel/cuda_native/csrc/moe_cuda_kernel.cu:5:10: fatal error: cub/cub.cuh: No such file or directory (update: now with more build errors!)

    [BUG]: colossalai/kernel/cuda_native/csrc/moe_cuda_kernel.cu:5:10: fatal error: cub/cub.cuh: No such file or directory (update: now with more build errors!)

    ๐Ÿ› Describe the bug

    Trying to run a finetune torchrun script, get this error. ColossaiAL was built from source as directed, but it still fails.

    anon@linuxmint:/media/anon/bighdd/ai/toolbox/training$ ./finetune.bash 
    + export BATCH_SIZE=4
    + BATCH_SIZE=4
    + export MODEL=/media/anon/bighdd/ai/models/opt-350m
    + MODEL=/media/anon/bighdd/ai/models/opt-350m
    + export NUMBER_OF_GPUS=1
    + NUMBER_OF_GPUS=1
    + export OUTPUT_DIR=checkpoints
    + OUTPUT_DIR=checkpoints
    ++ date +%Y-%m-%d_%H-%M-%S
    + LOG_NAME=2022-12-22_14-15-45
    + export HF_DATASETS_OFFLINE=1
    + HF_DATASETS_OFFLINE=1
    + mkdir -p checkpoints/logs
    + mkdir -p checkpoints/runs
    + torchrun --nproc_per_node 1 --master_port 19198 ./colossalai/run_clm.py --train_file ./data/train.json --learning_rate 2e-5 --checkpointing_steps 64 --mem_cap 0 --model_name_or_path /media/anon/bighdd/ai/models/opt-350m --output_dir checkpoints --per_device_eval_batch_size 4 --per_device_train_batch_size 4
    + tee checkpoints/logs/2022-12-22_14-15-45.log
    2022-12-22 14:15:51.339450: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
    To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
    Colossalai should be built with cuda extension to use the FP16 optimizer
    If you want to activate cuda mode for MoE, please install with cuda_ext!
    [12/22/22 14:15:54] INFO     colossalai - colossalai - INFO:                                                                              
                                 /home/anon/.local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:521 set_device          
                        INFO     colossalai - colossalai - INFO: process rank 0 is bound to device 0                                          
    [12/22/22 14:15:55] INFO     colossalai - colossalai - INFO:                                                                              
                                 /home/anon/.local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:557 set_seed            
                        INFO     colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024, python random: 1024,                
                                 ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed is ParallelMode.DATA.           
                        INFO     colossalai - colossalai - INFO: /home/anon/.local/lib/python3.8/site-packages/colossalai/initialize.py:117   
                                 launch                                                                                                       
                        INFO     colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 1, pipeline      
                                 parallel size: 1, tensor parallel size: 1                                                                    
                        INFO     colossalai - colossalai - INFO: ./colossalai/run_clm.py:309 main                                             
                        INFO     colossalai - colossalai - INFO: Start preparing dataset                                                      
    Using custom data configuration default-ced548c04fa8d0c8
    Found cached dataset json (/home/anon/.cache/huggingface/datasets/json/default-ced548c04fa8d0c8/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
    100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1/1 [00:00<00:00, 597.82it/s]
    Using custom data configuration default-ced548c04fa8d0c8
    Found cached dataset json (/home/anon/.cache/huggingface/datasets/json/default-ced548c04fa8d0c8/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
    Using custom data configuration default-ced548c04fa8d0c8
    Found cached dataset json (/home/anon/.cache/huggingface/datasets/json/default-ced548c04fa8d0c8/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
                        INFO     colossalai - colossalai - INFO: ./colossalai/run_clm.py:350 main                                             
                        INFO     colossalai - colossalai - INFO: Dataset is prepared                                                          
                        INFO     colossalai - colossalai - INFO: ./colossalai/run_clm.py:366 main                                             
                        INFO     colossalai - colossalai - INFO: Model config has been created                                                
    load model from /media/anon/bighdd/ai/models/opt-350m
                        INFO     colossalai - colossalai - INFO: ./colossalai/run_clm.py:373 main                                             
                        INFO     colossalai - colossalai - INFO: GPT2Tokenizer has been created                                               
                        INFO     colossalai - colossalai - INFO: ./colossalai/run_clm.py:388 main                                             
                        INFO     colossalai - colossalai - INFO: Finetune a pre-trained model                                                 
    [12/22/22 14:16:04] INFO     colossalai - ProcessGroup - INFO:                                                                            
                                 /home/anon/.local/lib/python3.8/site-packages/colossalai/tensor/process_group.py:24 get                      
                        INFO     colossalai - ProcessGroup - INFO: NCCL initialize ProcessGroup on [0]                                        
    [12/22/22 14:16:07] INFO     colossalai - colossalai - INFO: ./colossalai/run_clm.py:400 main                                             
                        INFO     colossalai - colossalai - INFO: using Colossal-AI version 0.1.13                                             
    searching chunk configuration is completed in 0.67 s.
    used number: 315.85 MB, wasted number: 3.01 MB
    total wasted percentage is 0.95%
    /home/anon/.local/lib/python3.8/site-packages/colossalai/gemini/chunk/chunk.py:40: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor._storage() instead of tensor.storage()
      return tensor.storage().size() == 0
    /home/anon/.local/lib/python3.8/site-packages/colossalai/gemini/chunk/chunk.py:45: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor._storage() instead of tensor.storage()
      tensor.storage().resize_(0)
    [12/22/22 14:16:09] INFO     colossalai - colossalai - INFO: ./colossalai/run_clm.py:415 main                                             
                        INFO     colossalai - colossalai - INFO: GeminiDDP has been created                                                   
    Running tokenizer on dataset: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 10/10 [00:23<00:00,  2.34s/ba]
    Running tokenizer on dataset: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1/1 [00:01<00:00,  1.18s/ba]
    [12/22/22 14:16:37] WARNING  colossalai - colossalai - WARNING: ./colossalai/run_clm.py:444 main                                          
                        WARNING  colossalai - colossalai - WARNING: The tokenizer picked seems to have a very large `model_max_length`        
                                 (1000000000000000019884624838656). Picking 1024 instead. You can change that default value by passing        
                                 --block_size xxx.                                                                                            
    Grouping texts in chunks of 1024: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 10/10 [00:05<00:00,  1.92ba/s]
    Grouping texts in chunks of 1024: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1/1 [00:00<00:00,  3.61ba/s]
    [12/22/22 14:16:42] INFO     colossalai - colossalai - INFO: ./colossalai/run_clm.py:503 main                                             
                        INFO     colossalai - colossalai - INFO: Dataloaders have been created                                                
    /home/anon/.local/lib/python3.8/site-packages/colossalai/tensor/colo_tensor.py:182: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor._storage() instead of tensor.storage()
      ret = func(*args, **kwargs)
    /home/anon/.local/lib/python3.8/site-packages/colossalai/nn/optimizer/nvme_optimizer.py:55: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor._storage() instead of tensor.storage()
      numel += p.storage().size()
    โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Traceback (most recent call last) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
    โ”‚ /home/anon/.local/lib/python3.8/site-packages/colossalai/nn/optimizer/hybrid_adam.py:80 in       โ”‚
    โ”‚ __init__                                                                                         โ”‚
    โ”‚                                                                                                  โ”‚
    โ”‚    77 โ”‚   โ”‚   super(HybridAdam, self).__init__(model_params, default_args, nvme_offload_fracti   โ”‚
    โ”‚    78 โ”‚   โ”‚   self.adamw_mode = adamw_mode                                                       โ”‚
    โ”‚    79 โ”‚   โ”‚   try:                                                                               โ”‚
    โ”‚ โฑ  80 โ”‚   โ”‚   โ”‚   import colossalai._C.cpu_optim                                                 โ”‚
    โ”‚    81 โ”‚   โ”‚   โ”‚   import colossalai._C.fused_optim                                               โ”‚
    โ”‚    82 โ”‚   โ”‚   except ImportError:                                                                โ”‚
    โ”‚    83 โ”‚   โ”‚   โ”‚   raise ImportError('Please install colossalai from source code to use HybridA   โ”‚
    โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
    ModuleNotFoundError: No module named 'colossalai._C.cpu_optim'
    
    During handling of the above exception, another exception occurred:
    
    โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Traceback (most recent call last) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
    โ”‚ /media/anon/bighdd/ai/toolbox/training/./colossalai/run_clm.py:643 in <module>                    โ”‚
    โ”‚                                                                                                  โ”‚
    โ”‚   640                                                                                            โ”‚
    โ”‚   641                                                                                            โ”‚
    โ”‚   642 if __name__ == "__main__":                                                                 โ”‚
    โ”‚ โฑ 643 โ”‚   main()                                                                                 โ”‚
    โ”‚   644                                                                                            โ”‚
    โ”‚                                                                                                  โ”‚
    โ”‚ /media/anon/bighdd/ai/toolbox/training/./colossalai/run_clm.py:519 in main                        โ”‚
    โ”‚                                                                                                  โ”‚
    โ”‚   516 โ”‚   โ”‚   },                                                                                 โ”‚
    โ”‚   517 โ”‚   ]                                                                                      โ”‚
    โ”‚   518 โ”‚                                                                                          โ”‚
    โ”‚ โฑ 519 โ”‚   optimizer = HybridAdam(optimizer_grouped_parameters, lr=args.learning_rate)            โ”‚
    โ”‚   520 โ”‚   optimizer = ZeroOptimizer(optimizer, model, initial_scale=2**14)                       โ”‚
    โ”‚   521 โ”‚                                                                                          โ”‚
    โ”‚   522 โ”‚   # Scheduler and math around the number of training steps.                              โ”‚
    โ”‚                                                                                                  โ”‚
    โ”‚ /home/anon/.local/lib/python3.8/site-packages/colossalai/nn/optimizer/hybrid_adam.py:83 in       โ”‚
    โ”‚ __init__                                                                                         โ”‚
    โ”‚                                                                                                  โ”‚
    โ”‚    80 โ”‚   โ”‚   โ”‚   import colossalai._C.cpu_optim                                                 โ”‚
    โ”‚    81 โ”‚   โ”‚   โ”‚   import colossalai._C.fused_optim                                               โ”‚
    โ”‚    82 โ”‚   โ”‚   except ImportError:                                                                โ”‚
    โ”‚ โฑ  83 โ”‚   โ”‚   โ”‚   raise ImportError('Please install colossalai from source code to use HybridA   โ”‚
    โ”‚    84 โ”‚   โ”‚                                                                                      โ”‚
    โ”‚    85 โ”‚   โ”‚   self.cpu_adam_op = colossalai._C.cpu_optim.CPUAdamOptimizer(lr, betas[0], betas[   โ”‚
    โ”‚    86 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   adamw_mode)            โ”‚
    โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
    ImportError: Please install colossalai from source code to use HybridAdam
    ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 206247) of binary: /usr/bin/python3
    Traceback (most recent call last):
      File "/home/anon/.local/bin/torchrun", line 8, in <module>
        sys.exit(main())
      File "/home/anon/.local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
        return f(*args, **kwargs)
      File "/home/anon/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
        run(args)
      File "/home/anon/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
        elastic_launch(
      File "/home/anon/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
        return launch_agent(self._config, self._entrypoint, list(args))
      File "/home/anon/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
        raise ChildFailedError(
    torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
    ============================================================
    ./colossalai/run_clm.py FAILED
    ------------------------------------------------------------
    Failures:
      <NO_OTHER_FAILURES>
    ------------------------------------------------------------
    Root Cause (first observed failure):
    [0]:
      time      : 2022-12-22_14:16:47
      host      : linuxmint
      rank      : 0 (local_rank: 0)
      exitcode  : 1 (pid: 206247)
      error_file: <N/A>
      traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
    ============================================================
    

    Environment

    Python 3.8.10 torch: 2.0.0.dev20221215+cu117 colossalai-0.1.13 Nvidia 3060 12GB NVIDIA-SMI 525.60.11 Driver Version: 525.60.11 CUDA Version: 12.0 Cuda compilation tools, release 10.1, V10.1.243

    bug 
    opened by xznhj8129 15
  • [BUG]: ZeRO without using shard_param

    [BUG]: ZeRO without using shard_param

    ๐Ÿ› Describe the bug

    ๐Ÿ› Describe the bug

    When i use ZeRO without shard_params, it occurs the following problems

    Traceback (most recent call last):
      File "train.py", line 175, in <module>
        main()
      File "train.py", line 39, in main
        with ZeroInitContext(target_device=torch.cuda.current_device(), shard_strategy=shard_strategy, shard_param=False):
      File "/usr/local/Python-3.8.6/lib/python3.8/site-packages/colossalai/zero/init_ctx/init_context.py", line 75, in __init__
        self.config = ZeroContextConfig(target_device=target_device, replicated=True, shard_param=shard_param)
      File "/usr/local/Python-3.8.6/lib/python3.8/site-packages/colossalai/zero/init_ctx/init_context.py", line 37, in __init__
        assert target_device.type == 'cuda', "Replicated no-shard paramters should locate in cuda."
    AttributeError: 'int' object has no attribute 'type'
    
    

    My init code is:

    def main():
        parser = colossalai.get_default_parser()
        parser.add_argument('--use_trainer', action='store_true', help='whether to use trainer')
        args = parser.parse_args()
    
        colossalai.launch_from_torch(config='./config.py')
    
        logger = get_dist_logger()
    
        rank = int(os.environ['RANK'])
        # build resnet
        use_zero3 = hasattr(gpc.config, 'zero')
        if use_zero3:
            shard_strategy = TensorShardStrategy()
            with ZeroInitContext(target_device=torch.cuda.current_device(), shard_strategy=shard_strategy, shard_param=False):
                model = resnet34(num_classes=10)
        else:
            model = resnet34(num_classes=10)
    

    my config is

    from colossalai.amp import AMP_TYPE
    from colossalai.zero.shard_utils import TensorShardStrategy
    from colossalai.nn.optimizer import HybridAdam
    
    zero = dict(
        model_config=dict(
            tensor_placement_policy='cuda',
            shard_strategy=TensorShardStrategy(),
            reuse_fp16_shard=False
        ),
        optimizer_config=dict()
    )
    
    optimizer = dict(
        type=HybridAdam,
        lr=0.001,
        # weight_decay=1e-2,
    )
    
    BATCH_SIZE = 64
    NUM_EPOCHS = 20
    LOGGING_FREQUNCE = 20
    OUTPUT = './'
    
    gradient_clipping = 5.0
    

    Environment

    pip install colossalai==0.1.5+torch1.10cu11.1 -f https://release.colossalai.org

    ubuntu 18.04

    Environment

    pip install colossalai==0.1.5+torch1.10cu11.1 -f https://release.colossalai.org

    ubuntu 18.04

    bug 
    opened by powermano 15
  • [BUG]: Issue with Colossal-AI on Cuda 11.4 and Docker ?

    [BUG]: Issue with Colossal-AI on Cuda 11.4 and Docker ?

    ๐Ÿ› Describe the bug

    Followed the installation guide here: https://github.com/hpcaitech/ColossalAI

    2001 mkdir colossalai 2002 cd colossalai/ 2003 ll 2004 colossalai 2005 git clone https://github.com/hpcaitech/ColossalAI.git 2006 cd ColossalAI 2007 # install dependency 2008 pip install -r requirements/requirements.txt 2009 # install colossalai 2010 pip install . 2014 docker build -t colossalai ./docker

    2015 docker run -ti --gpus all --rm --ipc=host colossalai bash

    [root@dbf722d6d864 workspace]# colossalai check -i Colossalai should be built with cuda extension to use the FP16 optimizer If you want to activate cuda mode for MoE, please install with cuda_ext! CUDA Version: 11.3 PyTorch Version: 1.10.1 CUDA Version in PyTorch Build: 11.3 PyTorch CUDA Version Match: โœ“ CUDA Extension: x

    The Cuda extension ^^^ isn't present?

    [root@dbf722d6d864 workspace]# colossalai benchmark --gpus 8 Colossalai should be built with cuda extension to use the FP16 optimizer If you want to activate cuda mode for MoE, please install with cuda_ext! === Benchmarking Parameters === gpus: 8 batch_size: 8 seq_len: 512 dimension: 1024 warmup_steps: 10 profile_steps: 50 layers: 2 model: mlp

    Colossalai should be built with cuda extension to use the FP16 optimizer If you want to activate cuda mode for MoE, please install with cuda_ext!

    === size: 8, mode: None === Average forward time: 0.0004958677291870118 Average backward time: 0.0010803651809692383 Max allocated GPU memory: 0.26564550399780273 Max cached GPU memory: 0.287109375

    === size: 8, mode: 1d === Average forward time: 0.004022541046142578 Average backward time: 0.0007260799407958985 Max allocated GPU memory: 0.2382950782775879 Max cached GPU memory: 0.287109375

    === size: 8, mode: 2.5d, depth: 2 === Average forward time: 0.001216425895690918 Average backward time: 0.002291984558105469 Max allocated GPU memory: 0.17383670806884766 Max cached GPU memory: 0.2734375

    === size: 8, mode: 3d === Average forward time: 0.000978093147277832 Average backward time: 0.0016768646240234374 Max allocated GPU memory: 0.05128049850463867 Max cached GPU memory: 0.185546875

    Colossalai should be built with cuda extension to use the FP16 optimizer

    What does this ^^^ really mean ?

    This is a A100 based system:

    $nvidia-smi Thu May 26 18:43:56 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A100-SXM... On | 00000000:07:00.0 Off | 0 | | N/A 27C P0 52W / 400W | 0MiB / 40536MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA A100-SXM... On | 00000000:0F:00.0 Off | 0 | | N/A 26C P0 50W / 400W | 0MiB / 40536MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA A100-SXM... On | 00000000:47:00.0 Off | 0 | | N/A 26C P0 54W / 400W | 0MiB / 40536MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 3 NVIDIA A100-SXM... On | 00000000:4E:00.0 Off | 0 | | N/A 25C P0 53W / 400W | 0MiB / 40536MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 4 NVIDIA A100-SXM... On | 00000000:87:00.0 Off | 0 | | N/A 30C P0 54W / 400W | 0MiB / 40536MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 5 NVIDIA A100-SXM... On | 00000000:90:00.0 Off | 0 | | N/A 29C P0 53W / 400W | 0MiB / 40536MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 6 NVIDIA A100-SXM... On | 00000000:B7:00.0 Off | 0 | | N/A 29C P0 54W / 400W | 0MiB / 40536MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 7 NVIDIA A100-SXM... On | 00000000:BD:00.0 Off | 0 | | N/A 29C P0 53W / 400W | 0MiB / 40536MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+

    +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

    Environment

    This is a A100 based system:

    $nvidia-smi Thu May 26 18:43:56 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 |

    bug 
    opened by Adrian-1234 15
  • [BUG]: Memory consumption by fp16 is not normal when using Engine.

    [BUG]: Memory consumption by fp16 is not normal when using Engine.

    ๐Ÿ› Describe the bug

    when using colossalai.amp.convert_to_torch_amp to wrap the model, optimizer and criterion,

    if not use_colossai_engine:
        model, optimizer, criterion =  colossalai.amp.convert_to_torch_amp(model, optimizer, criterion)
    

    and then train normally, which also only consumes 4700M of memory.

    output, _ = model(img, label)
    train_loss = criterion(output, label)
    optimizer.backward(train_loss)
    optimizer.step()
    optimizer.zero_grad()
    

    But if you use colossalai.initialize to initialize, it will consume 7700M of memory. But we did see that by reading the fp16 parameter in config, in the initialization code of colossalai.initialize, the conversion of process colossalai.amp.convert_to_torch_amp is performed, and then we use the Engine for training, but it needs to consume 7700M of memory at this time. This is where I get confused.

    engine.zero_grad()
    output, _ = engine(img, label)
    train_loss = engine.criterion(output, label)
    engine.backward(train_loss)
    engine.step()   
    

    Environment

    No response

    bug 
    opened by powermano 14
  • [BUG]: examples/images/diffusion ran failed

    [BUG]: examples/images/diffusion ran failed

    ๐Ÿ› Describe the bug

    I ran the example of diffusion according to https://github.com/hpcaitech/ColossalAI/tree/main/examples/images/diffusion๏ผš steps: conda env create -f environment.yaml conda activate ldm pip install colossalai==0.1.10+torch1.11cu11.3 -f https://release.colossalai.org git clone https://github.com/Lightning-AI/lightning && cd lightning && git reset --hard b04a7aa pip install -r requirements.txt && pip install .

    dataset: laion-400m

    run: bash train.sh

    failed info:

    **/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py:438: UserWarning: Error handling mechanism for deadlock detection is uninitialized. Skipping check. rank_zero_warn("Error handling mechanism for deadlock detection is uninitialized. Skipping check.") Traceback (most recent call last): File "/home/code/ColossalAI/examples/images/diffusion/main.py", line 811, in trainer.fit(model, data) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 579, in fit call._call_and_handle_interrupt( File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 621, in _fit_impl self._run(model, ckpt_path=self.ckpt_path) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1058, in _run results = self._run_stage() File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1137, in _run_stage self._run_train() File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1160, in _run_train self.fit_loop.run() File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, **kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance self._outputs = self.epoch_loop.run(self._data_fetcher) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, **kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 214, in advance batch_output = self.batch_loop.run(kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, **kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance outputs = self.optimizer_loop.run(optimizers, kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, **kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 200, in advance result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position]) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 247, in _run_optimization self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 357, in _optimizer_step self.trainer._call_lightning_module_hook( File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1302, in _call_lightning_module_hook output = fn(*args, **kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/core/module.py", line 1661, in optimizer_step optimizer.step(closure=optimizer_closure) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 169, in step step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/colossalai.py", line 368, in optimizer_step return self.precision_plugin.optimizer_step( File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/colossalai.py", line 74, in optimizer_step closure_result = closure() File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 147, in call self._result = self.closure(*args, **kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 133, in closure step_output = self._step_fn() File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 406, in _training_step training_step_output = self.trainer._call_strategy_hook("training_step", *kwargs.values()) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1440, in _call_strategy_hook output = fn(*args, **kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 352, in training_step return self.model(*args, **kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/colossalai/nn/parallel/data_parallel.py", line 241, in forward outputs = self.module(*args, **kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/overrides/base.py", line 98, in forward output = self._forward_module.training_step(*inputs, **kwargs) File "/home/code/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 411, in training_step loss, loss_dict = self.shared_step(batch) File "/home/code/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 976, in shared_step loss = self(x, c) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/code/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 988, in forward return self.p_losses(x, c, t, *args, **kwargs) File "/home/code/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 1122, in p_losses model_output = self.apply_model(x_noisy, t, cond) File "/home/code/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 1094, in apply_model x_recon = self.model(x_noisy, t, **cond) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/code/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 1519, in forward out = self.diffusion_model(x, t, context=cc) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/code/ColossalAI/examples/images/diffusion/ldm/modules/diffusionmodules/openaimodel.py", line 927, in forward h = th.cat([h, hs.pop()], dim=1) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/colossalai/tensor/colo_tensor.py", line 170, in torch_function ret = func(*args, kwargs) RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 8 but got size 7 for tensor number 1 in the list.

    Environment

    image

    bug 
    opened by GxjGit 13
  • add example of self-supervised SimCLR training - V2

    add example of self-supervised SimCLR training - V2

    The previous version uses Nvidia DALI to create a dataloader. I found that data augmentations in DALI are different from those of torchvision. As a result, the desired performance could not be achieved. In this version, dataloader is implemented with colossalai.nn.data and torchvision. The final linear evaluation accuracy could be up to 85.4%.

    documentation 
    opened by DevinCheung 13
  • [BUG]: RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one.

    [BUG]: RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one.

    ๐Ÿ› Describe the bug

    After following the ResNet50 example in the tutorial as soon as possible,I got the error as the title said. It is like my last usage of hf's accelerate, I can't figure out this complex problem for my first usage. Of course I have tried my best to solve it and the reasons is likely: colossalai check -i and its output is: Colossalai should be built with cuda extension to use the FP16 optimizer If you want to activate cuda mode for MoE, please install with cuda_ext! CUDA Version: N/A (CUDA_HOME is not set) PyTorch Version: 1.11.0+cu102 CUDA Version in PyTorch Build: 10.2 PyTorch CUDA Version Match: x CUDA Extension: x

    but I tried in a machine of 11.3 CUDA and I threw a same error.

    Below is part of my code:

    logger = get_dist_logger()
    	# args = colossalai.get_default_parser().parse_args()
    	colossalai.launch_from_torch(config='config.py')
    	config = Config()
    	tokenizer = JiebaTokenizer.from_pretrained('Lowin/chinese-bigbird-base-4096')
    	model = BB()
    	optimizer = optim.AdamW(params=model.parameters(),lr=1e-5,weight_decay=1e-2)
    	lossFunc = F.cross_entropy
    	rouge =   load_metric('rouge')
    
    	valida = json.load(open("dataset/dev.json"))
    	trains = json.load(open("dataset/train.json"))
    	dataSetTrain = DS(trains,tokenizer,config)
    	dataSetValid = DS(valida,tokenizer,config)
    	tDL = DataLoader(dataSetTrain,batch_size=config.batch_size_train,shuffle=True)
    	vDL = DataLoader(dataSetValid,batch_size=config.batch_size_valid)
    
    	engine,tDL,vDL,_ = colossalai.initialize(
    		model,
    		optimizer,
    		lossFunc,
    		tDL,
    		vDL
    	)
    
    	for epoch in range(gpc.config.NUM_EPOCH):
    		tDL = tqdm(tDL,leave=False)
    		engine.train()
    		for batch in tDL:
    			labels = batch.pop('labels').cuda()
    			batch = {key:value.cuda() for key,value in batch.items()}
    			logist = engine(batch)
    			loss_sum = engine.criterion(logist.view(-1,config.vocab_size),labels.view(-1))
    			title_length = labels.ne(0).sum().item()
    			loss = loss_sum/title_length
    			engine.backward(loss)
    			engine.step()
    			engine.zero_grad()
    			tDL.set_description(f'Epoch:{epoch}:')
    			tDL.set_postfix(loss=loss.item())
    

    Code of model construction

    class BB(torch.nn.Module):
    	def __init__(self):
    		super(BB,self).__init__()
    		self.transformer = BigBirdModel.from_pretrained('Lowin/chinese-bigbird-base-4096')
    		self.dropout = torch.nn.Dropout(0.2)
    		self.output = torch.nn.Linear(768,39999)
            
    
    	def forward(self,batch):
    		# batch = self._set_token_type_ids_(batch)
    		outputs = self.transformer(**batch).last_hidden_state  #bs token_num outputsize 
    		logits = self.output(self.dropout(outputs))  #bs token_num vocab_size
    		return logits
    

    here is error info: /home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/site-packages/transformers/models/big_bird/modeling_big_bird.py:981: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). torch.arange(indices.shape[0] * indices.shape[1] * num_indices_to_gather, device=indices.device) /home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/site-packages/transformers/models/big_bird/modeling_big_bird.py:981: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). torch.arange(indices.shape[0] * indices.shape[1] * num_indices_to_gather, device=indices.device) Traceback (most recent call last):
    File "test3_v3.3.py", line 138, in logist = engine(batch) File "/home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 183, in call return self.model(*args, **kwargs) File "/home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 947, in forward Traceback (most recent call last):
    File "test3_v3.3.py", line 138, in logist = engine(batch) File "/home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 183, in call return self.model(*args, **kwargs) File "/home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 947, in forward if torch.is_grad_enabled() and self.reducer._rebuild_buckets(): RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by making sure all forward function outputs participate in calculating loss. If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable). Parameter indices which did not receive grad for rank 0: 197 198 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets(): RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by making sure all forward function outputs participate in calculating loss. If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable). Parameter indices which did not receive grad for rank 1: 197 198 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 44596) of binary: /home/guxj/anaconda3/envs/NLP_colossalai/bin/python Traceback (most recent call last): File "/home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run elastic_launch( File "/home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ test3_v3.3.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2022-05-18_01:27:08 host : dlp01 rank : 1 (local_rank: 1) exitcode : 1 (pid: 44597) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2022-05-18_01:27:08 host : dlp01 rank : 0 (local_rank: 0) exitcode : 1 (pid: 44596) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

    Environment

    CUDA: 10.2 pytorch: 1.11.0 python:3.8.13(miniconda)

    bug 
    opened by 480284856 12
  • [BUG]: CUDA extension build skipped when installing from source

    [BUG]: CUDA extension build skipped when installing from source

    ๐Ÿ› Describe the bug

    Hi,I use Install From Source option to install colossalAI,but i encouter problem like: /path/to/myconda/anaconda3/envs/py37-pt111-cu111-colai/lib/python3.7/site-packages/torch/autocast_mode.py:162: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling warnings.warn('User provided device_type of \'cuda\', but CUDA is not available. Disabling') Colossalai should be built with cuda extension to use the FP16 optimizer If you want to activate cuda mode for MoE, please install with cuda_ext! I have installed torch1.11+cu11.3 and using cuda 11.1 any suggestion?

    Environment

    Pytorch 1.11+cu11.3 CUDA 11.1

    bug 
    opened by imabackstabber 12
  • Train stable diffusion finetune stoped at

    Train stable diffusion finetune stoped at "Summoning checkpoint"

    My Machine: cpu 32g, gpu 16g, batchsize=1. It seems colossalai is not working well.

    {'accelerator': 'gpu', 'devices': 1, 'log_gpu_memory': 'all', 'max_epochs': 2, 'precision': 16, 'auto_select_gpus': False, 'strategy': {'target': 'strategies.ColossalAIStrategy', 'params': {'use_chunk': True, 'enable_distributed_storage': True, 'placement_policy': 'cuda', 'force_outputs_fp32': True}}, 'log_every_n_steps': 2, 'logger': True, 'default_root_dir': '/tmp/diff_log/'} Running on GPU Using FP16 = True No module 'xformers'. Proceeding without it. LatentDiffusion: Running in v-prediction mode DiffusionWrapper has 865.91 M params. making attention of type 'vanilla' with 512 in_channels Working with z of shape (1, 4, 32, 32) = 4096 dimensions. making attention of type 'vanilla' with 512 in_channels Using strategy: strategies.ColossalAIStrategy Monitoring val/loss_simple_ema as checkpoint metric. Merged modelckpt-cfg: {'target': 'lightning.pytorch.callbacks.ModelCheckpoint', 'params': {'dirpath': '/tmp/2023-01-05T10-52-57_train_colossalai_teyvat/checkpoints', 'filename': '{epoch:06}', 'verbose': True, 'save_last': True, 'monitor': 'val/loss_simple_ema', 'save_top_k': 3}} GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs

    .... ....

    Lightning config trainer: accelerator: gpu devices: 1 log_gpu_memory: all max_epochs: 2 precision: 16 auto_select_gpus: false strategy: target: strategies.ColossalAIStrategy params: use_chunk: true enable_distributed_storage: true placement_policy: cuda force_outputs_fp32: true log_every_n_steps: 2 logger: true default_root_dir: /tmp/diff_log/ logger_config: wandb: target: loggers.WandbLogger params: name: nowname save_dir: /tmp/diff_log/ offline: opt.debug id: nowname

    /home/ubuntu/anaconda3/envs/ldmco/lib/python3.9/site-packages/lightning/pytorch/loggers/tensorboard.py:261: UserWarning: Could not log computational graph to TensorBoard: The model.example_input_array attribute is not set or input_array was not given. rank_zero_warn( Epoch 0: 0%| | 0/234 [00:00<?, ?it/s]/home/ubuntu/anaconda3/envs/ldmco/lib/python3.9/site-packages/lightning/pytorch/utilities/data.py:85: UserWarning: Trying to infer the batch_size from an ambiguous collection. The batch size we found is 1. To avoid any miscalculations, use self.log(..., batch_size=batch_size). warning_cache.warn( /home/ubuntu/anaconda3/envs/ldmco/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py:233: UserWarning: You called self.log('global_step', ...) in your training_step but the value needs to be floating point. Converting it to torch.float32. warning_cache.warn( Summoning checkpoint. Killed

    opened by yufengyao-lingoace 1
  • [example] simplify opt example

    [example] simplify opt example

    Why

    Make sure the user can run OPT to profile performance in 1 minute. No data download, no complex training parameter setting. Just simply run a few iterations.

    opened by feifeibear 0
  • [DOC]: examplesไธญtransformers็‰ˆๆœฌ้”™่ฏฏ

    [DOC]: examplesไธญtransformers็‰ˆๆœฌ้”™่ฏฏ

    ๐Ÿ“š The doc issue

    https://github.com/hpcaitech/ColossalAI/blob/9c9246c0d9e09fc261ff9d052deb5ef1e02e614c/examples/language/gpt/requirements.txt#L3 ๆ„Ÿ่ง‰ๅบ”่ฏฅๆ˜ฏ4.23.1

    documentation 
    opened by yhcc 0
  • [device] find best logical mesh

    [device] find best logical mesh

    What does this PR do

    1. implement search_best_logical_mesh function, which could find best logical mesh for the given device list.

      The best logical mesh is searched in following steps:

      1. detect homogeneous device groups, we assume that the devices in the alpha_beta_dict are homogeneous if the beta value is close enough.
      2. Find the best homogeneous device group contains all the physical devices. The best homogeneous device group means the lowest beta value in the groups which contains all the physical devices. And the reason we require the group contains all the physical devices is that the devices not in the group will decrease the bandwidth of the group.
      3. If the best homogeneous device group is found, we will construct the largest ring for each device based on the best homogeneous device group, and the best logical mesh will be the union of all the rings. Otherwise, the best logical mesh will be the balanced logical mesh, such as shape (2, 2) for 4 devices.

      Usage:

      
          >>> physical_devices = [0, 1, 2, 3]
          >>> ab_profiler = AlphaBetaProfiler(physical_devices)
          >>> best_logical_mesh = profiler.search_best_logical_mesh()
          >>> print(best_logical_mesh)
          [[0, 1], [2, 3]]
      
    2. implement extract_alpha_beta_for_device_mesh function which extract the mesh_alpha list and mesh_beta list based on the best logical mesh.

      Usage:

      
      	>>> physical_devices = [0, 1, 2, 3]
      	>>> ab_profiler = AlphaBetaProfiler(physical_devices)
      	>>> mesh_alpha, mesh_beta = profiler.extract_alpha_beta_for_device_mesh()
      	>>> print(mesh_alpha)
      	[2.5917552411556242e-05, 0.00010312341153621673]
      	>>> print(mesh_beta)
      	[5.875573704655635e-11, 4.7361584445959614e-12]
      
    3. construct test cases to test above features.

    Run Build and Test 
    opened by YuliangLiu0306 0
Releases(v0.2.0)
  • v0.2.0(Jan 3, 2023)

    What's Changed

    Version

    Examples

    • [examples] using args and combining two versions for PaLM (#2284) by ZijianYY
    • [examples] replace einsum with matmul (#2210) by ZijianYY

    Doc

    • [doc] add feature diffusion v2, bloom, auto-parallel (#2282) by binmakeswell
    • [doc] updated the stable diffussion on docker usage (#2244) by Frank Lee

    Zero

    • [zero] polish low level zero optimizer (#2275) by HELSON
    • [zero] fix error for BEiT models (#2169) by HELSON

    Example

    • [example] add benchmark (#2276) by Ziyue Jiang
    • [example] fix save_load bug for dreambooth (#2280) by BlueRum
    • [example] GPT polish readme (#2274) by Jiarui Fang
    • [example] fix gpt example with 0.1.10 (#2265) by HELSON
    • [example] clear diffuser image (#2262) by Fazzie-Maqianli
    • [example] diffusion install from docker (#2239) by Jiarui Fang
    • [example] fix benchmark.sh for gpt example (#2229) by HELSON
    • [example] make palm + GeminiDPP work (#2227) by Jiarui Fang
    • [example] Palm adding gemini, still has bugs (#2221) by ZijianYY
    • [example] update gpt example (#2225) by HELSON
    • [example] add benchmark.sh for gpt (#2226) by Jiarui Fang
    • [example] update gpt benchmark (#2219) by HELSON
    • [example] update GPT example benchmark results (#2212) by Jiarui Fang
    • [example] update gpt example for larger model scale (#2211) by Jiarui Fang
    • [example] update gpt readme with performance (#2206) by Jiarui Fang
    • [example] polish doc (#2201) by ziyuhuang123
    • [example] Change some training settings for diffusion (#2195) by BlueRum
    • [example] support Dreamblooth (#2188) by Fazzie-Maqianli
    • [example] gpt demo more accuracy tflops (#2178) by Jiarui Fang
    • [example] add palm pytorch version (#2172) by Jiarui Fang
    • [example] update vit readme (#2155) by Jiarui Fang
    • [example] add zero1, zero2 example in GPT examples (#2146) by HELSON

    Hotfix

    Autoparallel

    • [autoparallel] fix spelling error (#2270) by YuliangLiu0306
    • [autoparallel] gpt2 autoparallel examples (#2267) by YuliangLiu0306
    • [autoparallel] patch torch.flatten metainfo for autoparallel (#2247) by Boyuan Yao
    • [autoparallel] autoparallel initialize (#2238) by YuliangLiu0306
    • [autoparallel] fix construct meta info. (#2245) by Super Daniel
    • [autoparallel] record parameter attribute in colotracer (#2217) by YuliangLiu0306
    • [autoparallel] Attach input, buffer and output tensor to MetaInfo class (#2162) by Boyuan Yao
    • [autoparallel] new metainfoprop based on metainfo class (#2179) by Boyuan Yao
    • [autoparallel] update getitem handler (#2207) by YuliangLiu0306
    • [autoparallel] update_getattr_handler (#2193) by YuliangLiu0306
    • [autoparallel] add gpt2 performance test code (#2194) by YuliangLiu0306
    • [autoparallel] integrate_gpt_related_tests (#2134) by YuliangLiu0306
    • [autoparallel] memory estimation for shape consistency (#2144) by Boyuan Yao
    • [autoparallel] use metainfo in handler (#2149) by YuliangLiu0306

    Gemini

    • [Gemini] fix the convert_to_torch_module bug (#2269) by Jiarui Fang

    Pipeline middleware

    • [Pipeline Middleware] Reduce comm redundancy by getting accurate output (#2232) by Ziyue Jiang

    Builder

    • [builder] builder for scaled_upper_triang_masked_softmax (#2234) by Jiarui Fang
    • [builder] polish builder with better base class (#2216) by Jiarui Fang
    • [builder] raise Error when CUDA_HOME is not set (#2213) by Jiarui Fang
    • [builder] multihead attn runtime building (#2203) by Jiarui Fang
    • [builder] unified cpu_optim fused_optim inferface (#2190) by Jiarui Fang
    • [builder] use runtime builder for fused_optim (#2189) by Jiarui Fang
    • [builder] runtime adam and fused_optim builder (#2184) by Jiarui Fang
    • [builder] use builder() for cpu adam and fused optim in setup.py (#2187) by Jiarui Fang

    Logger

    Diffusion

    • [diffusion] update readme (#2214) by HELSON

    Testing

    • [testing] add beit model for unit testings (#2196) by HELSON

    NFC

    Exmaple

    • [exmaple] diffuser, support quant inference for stable diffusion (#2186) by BlueRum
    • [exmaple] add vit missing functions (#2154) by Jiarui Fang

    Pipeline middleware

    • [Pipeline Middleware ] Fix deadlock when num_microbatch=num_stage (#2156) by Ziyue Jiang

    Full Changelog: https://github.com/hpcaitech/ColossalAI/compare/v0.2.0...v0.1.13

    Source code(tar.gz)
    Source code(zip)
  • v0.1.13(Dec 20, 2022)

    What's Changed

    Version

    Gemini

    • [Gemini] GeminiDPP convert to PyTorch Module. (#2151) by Jiarui Fang
    • [Gemini] Update coloinit_ctx to support meta_tensor (#2147) by BlueRum
    • [Gemini] revert ZeROInitCtx related tracer (#2138) by Jiarui Fang
    • [Gemini] update API of the chunkmemstatscollector. (#2129) by Jiarui Fang
    • [Gemini] update the non model data record method in runtime memory tracer (#2128) by Jiarui Fang
    • [Gemini] test step-tensor mapping using repeated_computed_layers.py (#2127) by Jiarui Fang
    • [Gemini] update non model data calculation method (#2126) by Jiarui Fang
    • [Gemini] hotfix the unittest bugs (#2125) by Jiarui Fang
    • [Gemini] mapping of preop timestep and param (#2124) by Jiarui Fang
    • [Gemini] chunk init using runtime visited param order (#2115) by Jiarui Fang
    • [Gemini] chunk init use OrderedParamGenerator (#2110) by Jiarui Fang

    Nfc

    • [NFC] remove useless graph node code (#2150) by Jiarui Fang
    • [NFC] update chunk manager API (#2119) by Jiarui Fang
    • [NFC] polish comments for Chunk class (#2116) by Jiarui Fang

    Autoparallel

    Example

    • Merge pull request #2120 from Fazziekey/example/stablediffusion-v2 by Fazzie-Maqianli

    Optimizer

    • [optimizer] add div_scale for optimizers (#2117) by HELSON

    Pp middleware

    • [PP Middleware] Add bwd and step for PP middleware (#2111) by Ziyue Jiang

    Full Changelog: https://github.com/hpcaitech/ColossalAI/compare/v0.1.13...v0.1.12

    Source code(tar.gz)
    Source code(zip)
  • v0.1.12(Dec 9, 2022)

    What's Changed

    Zero

    • [zero] add L2 gradient clipping for ZeRO (#2112) by HELSON

    Gemini

    • [gemini] get the param visited order during runtime (#2108) by Jiarui Fang
    • [Gemini] NFC, polish search_chunk_configuration (#2107) by Jiarui Fang
    • [Gemini] gemini use the runtime memory tracer (RMT) (#2099) by Jiarui Fang
    • [Gemini] make RuntimeMemTracer work correctly (#2096) by Jiarui Fang
    • [Gemini] remove eval in gemini unittests! (#2092) by Jiarui Fang
    • [Gemini] remove GLOBAL_MODEL_DATA_TRACER (#2091) by Jiarui Fang
    • [Gemini] remove GLOBAL_CUDA_MEM_INFO (#2090) by Jiarui Fang
    • [Gemini] use MemStats in Runtime Memory tracer (#2088) by Jiarui Fang
    • [Gemini] use MemStats to store the tracing data. Seperate it from Collector. (#2084) by Jiarui Fang
    • [Gemini] remove static tracer (#2083) by Jiarui Fang
    • [Gemini] ParamOpHook -> ColoParamOpHook (#2080) by Jiarui Fang
    • [Gemini] polish runtime tracer tests (#2077) by Jiarui Fang
    • [Gemini] rename hooks related to runtime mem tracer (#2076) by Jiarui Fang
    • [Gemini] add albert in test models. (#2075) by Jiarui Fang
    • [Gemini] rename ParamTracerWrapper -> RuntimeMemTracer (#2073) by Jiarui Fang
    • [Gemini] remove not used MemtracerWrapper (#2072) by Jiarui Fang
    • [Gemini] fix grad unreleased issue and param recovery issue (#2052) by Zihao

    Hotfix

    Colotensor

    • [ColoTensor] throw error when ColoInitContext meets meta parameter. (#2105) by Jiarui Fang

    Autoparallel

    • [autoparallel] support linear function bias addition (#2104) by YuliangLiu0306
    • [autoparallel] support addbmm computation (#2102) by YuliangLiu0306
    • [autoparallel] add sum handler (#2101) by YuliangLiu0306
    • [autoparallel] add bias addtion function class (#2098) by YuliangLiu0306
    • [autoparallel] complete gpt related module search (#2097) by YuliangLiu0306
    • [autoparallel]add embedding handler (#2089) by YuliangLiu0306
    • [autoparallel] add tensor constructor handler (#2082) by YuliangLiu0306
    • [autoparallel] add non_split linear strategy (#2078) by YuliangLiu0306
    • [autoparallel] Add F.conv metainfo (#2069) by Boyuan Yao
    • [autoparallel] complete gpt block searching (#2065) by YuliangLiu0306
    • [autoparallel] add binary elementwise metainfo for auto parallel (#2058) by Boyuan Yao
    • [autoparallel] fix forward memory calculation (#2062) by Boyuan Yao
    • [autoparallel] adapt solver with self attention (#2037) by YuliangLiu0306

    Version

    Pipeline middleware

    • [Pipeline Middleware] fix data race in Pipeline Scheduler for DAG (#2087) by Ziyue Jiang
    • [Pipeline Middleware] Adapt scheduler for Topo (#2066) by Ziyue Jiang

    Fx

    • [fx] An experimental version of ColoTracer.' (#2002) by Super Daniel

    Example

    • [example] update GPT README (#2095) by ZijianYY

    Device

    Test

    • [test] bert test in non-distributed way (#2074) by Jiarui Fang

    Pipeline

    Examples

    Full Changelog: https://github.com/hpcaitech/ColossalAI/compare/v0.1.12...v0.1.11rc5

    Source code(tar.gz)
    Source code(zip)
  • v0.1.11rc5(Nov 30, 2022)

    What's Changed

    Release

    • [release] update to 0.1.11rc5 (#2053) by Frank Lee

    Cli

    • [cli] updated installation cheheck with more inforamtion (#2050) by Frank Lee

    Gemini

    • [gemini] fix init bugs for modules (#2047) by HELSON
    • [gemini] add arguments (#2046) by HELSON
    • [Gemini] free and allocate cuda memory by tensor.storage, add grad hook (#2040) by Zihao
    • [Gemini] more tests for Gemini (#2038) by Jiarui Fang
    • [Gemini] more rigorous unit tests for run_fwd_bwd (#2034) by Jiarui Fang
    • [Gemini] paramWrapper paramTracerHook unitest (#2030) by Zihao
    • [Gemini] patch for supporting orch.add_ function for ColoTensor (#2003) by Jiarui Fang
    • [gemini] param_trace_hook (#2020) by Zihao
    • [Gemini] add unitests to check gemini correctness (#2015) by Jiarui Fang
    • [Gemini] ParamMemHook (#2008) by Zihao
    • [Gemini] param_tracer_wrapper and test case (#2009) by Zihao

    Setup

    • [setup] supported conda-installed torch (#2048) by Frank Lee

    Test

    • [test] align model name with the file name. (#2045) by Jiarui Fang

    Hotfix

    • [hotfix] hotfix Gemini for no leaf modules bug (#2043) by Jiarui Fang
    • [hotfix] add bert test for gemini fwd bwd (#2035) by Jiarui Fang
    • [hotfix] revert bug PRs (#2016) by Jiarui Fang

    Zero

    • [zero] fix testing parameters (#2042) by HELSON
    • [zero] fix unit-tests (#2039) by HELSON
    • [zero] test gradient accumulation (#1964) by HELSON

    Testing

    • [testing] fix testing models (#2036) by HELSON

    Rpc

    Autoparallel

    • [autoparallel] add split handler (#2032) by YuliangLiu0306
    • [autoparallel] add experimental permute handler (#2029) by YuliangLiu0306
    • [autoparallel] add runtime pass and numerical test for view handler (#2018) by YuliangLiu0306
    • [autoparallel] add experimental view handler (#2011) by YuliangLiu0306
    • [autoparallel] mix gather (#1977) by Genghan Zhang

    Fx

    • [fx]Split partition with DAG information (#2025) by Ziyue Jiang

    Github

    Workflow

    • [workflow] removed unused pypi release workflow (#2022) by Frank Lee

    Full Changelog: https://github.com/hpcaitech/ColossalAI/compare/v0.1.11rc5...v0.1.11rc4

    Source code(tar.gz)
    Source code(zip)
  • v0.1.11rc4(Nov 23, 2022)

    What's Changed

    Workflow

    • [workflow] fixed the python and cpu arch mismatch (#2010) by Frank Lee
    • [workflow] fixed the typo in condarc (#2006) by Frank Lee
    • [workflow] added conda cache and fixed no-compilation bug in release (#2005) by Frank Lee

    Gemini

    • [Gemini] add an inline_op_module to common test models and polish unitests. (#2004) by Jiarui Fang
    • [Gemini] open grad checkpoint when model building (#1984) by Jiarui Fang
    • [Gemini] add bert for MemtracerWrapper unintests (#1982) by Jiarui Fang
    • [Gemini] MemtracerWrapper unittests (#1981) by Jiarui Fang
    • [Gemini] memory trace hook (#1978) by Jiarui Fang
    • [Gemini] independent runtime tracer (#1974) by Jiarui Fang
    • [Gemini] ZeROHookV2 -> GeminiZeROHook (#1972) by Jiarui Fang
    • [Gemini] clean no used MemTraceOp (#1970) by Jiarui Fang
    • [Gemini] polish memstats collector (#1962) by Jiarui Fang
    • [Gemini] add GeminiAdamOptimizer (#1960) by Jiarui Fang

    Autoparallel

    • [autoparallel] Add metainfo support for F.linear (#1987) by Boyuan Yao
    • [autoparallel] use pytree map style to process data (#1989) by YuliangLiu0306
    • [autoparallel] adapt handlers with attention block (#1990) by YuliangLiu0306
    • [autoparallel] support more flexible data type (#1967) by YuliangLiu0306
    • [autoparallel] add pooling metainfo (#1968) by Boyuan Yao
    • [autoparallel] support distributed dataloader option (#1906) by YuliangLiu0306
    • [autoparallel] Add alpha beta (#1973) by Genghan Zhang
    • [autoparallel] add torch.nn.ReLU metainfo (#1868) by Boyuan Yao
    • [autoparallel] support addmm in tracer and solver (#1961) by YuliangLiu0306
    • [autoparallel] remove redundancy comm node (#1893) by YuliangLiu0306

    Fx

    • [fx] add more meta_registry for MetaTensor execution. (#2000) by Super Daniel

    Hotfix

    • [hotfix] make Gemini work for conv DNN (#1998) by Jiarui Fang

    Example

    Kernel

    • [kernel] move all symlinks of kernel to colossalai._C (#1971) by ver217

    Polish

    • [polish] remove useless file _mem_tracer_hook.py (#1963) by Jiarui Fang

    Zero

    • [zero] fix memory leak for zero2 (#1955) by HELSON

    Colotensor

    • [ColoTensor] reconfig ColoInitContext, decouple default_pg and default_dist_spec. (#1953) by Jiarui Fang
    • [ColoTensor] ColoInitContext initialize parameters in shard mode. (#1937) by Jiarui Fang

    Tutorial

    • [tutorial] polish all README (#1946) by binmakeswell
    • [tutorial] added missing dummy dataloader (#1944) by Frank Lee
    • [tutorial] fixed pipeline bug for sequence parallel (#1943) by Frank Lee

    Tensorparallel

    Sc demo

    Sc

    • [SC] remove redundant hands on (#1939) by Boyuan Yao

    Full Changelog: https://github.com/hpcaitech/ColossalAI/compare/v0.1.11rc4...v0.1.11rc3

    Source code(tar.gz)
    Source code(zip)
  • v0.1.11rc3(Nov 13, 2022)

    What's Changed

    Release

    • [release] update version (#1931) by ver217

    Tutorial

    • [tutorial] polish README and OPT files (#1930) by binmakeswell
    • [tutorial] add synthetic dataset for opt (#1924) by ver217
    • [tutorial] updated hybrid parallel readme (#1928) by Frank Lee
    • [tutorial] added synthetic data for sequence parallel (#1927) by Frank Lee
    • [tutorial] removed huggingface model warning (#1925) by Frank Lee
    • Hotfix/tutorial readme index (#1922) by Frank Lee
    • [tutorial] modify hands-on of auto activation checkpoint (#1920) by Boyuan Yao
    • [tutorial] added synthetic data for hybrid parallel (#1921) by Frank Lee
    • [tutorial] added synthetic data for hybrid parallel (#1919) by Frank Lee
    • [tutorial] added synthetic dataset for auto parallel demo (#1918) by Frank Lee
    • [tutorial] updated auto parallel demo with latest data path (#1917) by Frank Lee
    • [tutorial] added data script and updated readme (#1916) by Frank Lee
    • [tutorial] add cifar10 for diffusion (#1907) by binmakeswell
    • [tutorial] removed duplicated tutorials (#1904) by Frank Lee
    • [tutorial] edited hands-on practices (#1899) by BoxiangW

    Example

    • [example] update auto_parallel img path (#1910) by binmakeswell
    • [example] add cifar10 dadaset for diffusion (#1902) by Fazzie-Maqianli
    • [example] migrate diffusion and auto_parallel hands-on (#1871) by binmakeswell
    • [example] initialize tutorial (#1865) by binmakeswell
    • Merge pull request #1842 from feifeibear/jiarui/polish by Fazzie-Maqianli
    • [example] polish diffusion readme by jiaruifang

    Sc

    • [SC] add GPT example for auto checkpoint (#1889) by Boyuan Yao
    • [sc] add examples for auto checkpoint. (#1880) by Super Daniel

    Nfc

    • [NFC] polish colossalai/amp/naive_amp/init.py code style (#1905) by Junming Wu
    • [NFC] remove redundant dependency (#1869) by binmakeswell
    • [NFC] polish .github/workflows/scripts/build_colossalai_wheel.py code style (#1856) by yuxuan-lou
    • [NFC] polish .github/workflows/scripts/generate_release_draft.py code style (#1855) by Ofey Chan
    • [NFC] polish workflows code style (#1854) by Kai Wang (Victor Kai)
    • [NFC] polish colossalai/amp/apex_amp/init.py code style (#1853) by LuGY
    • [NFC] polish .readthedocs.yaml code style (#1852) by nuszzh
    • [NFC] polish <.github/workflows/release_nightly.yml> code style (#1851) by RichardoLuo
    • [NFC] polish amp.naive_amp.grad_scaler code style by zbian
    • [NFC] polish colossalai/auto_parallel/tensor_shard/deprecated/op_handler/operator_handler.py code style (#1845) by HELSON
    • [NFC] polish ./colossalai/amp/torch_amp/init.py code style (#1836) by Genghan Zhang
    • [NFC] polish .github/workflows/build.yml code style (#1837) by xyupeng
    • [NFC] polish colossalai/auto_parallel/tensor_shard/deprecated/op_handler/conv_handler.py code style (#1829) by Sze-qq
    • [NFC] polish colossalai/amp/torch_amp/_grad_scaler.py code style (#1823) by Ziyue Jiang
    • [NFC] polish .github/workflows/release_docker.yml code style by Maruyama_Aya
    • [NFC] polish .github/workflows/submodule.yml code style (#1822) by shenggan
    • [NFC] polish .github/workflows/draft_github_release_post.yml code style (#1820) by Arsmart1
    • [NFC] polish colossalai/amp/naive_amp/_fp16_optimizer.py code style (#1819) by Fazzie-Maqianli
    • [NFC] polish colossalai/amp/naive_amp/_utils.py code style (#1816) by CsRic
    • [NFC] polish .github/workflows/build_gpu_8.yml code style (#1813) by Zangwei Zheng
    • [NFC] polish MANIFEST.in code style (#1814) by Zirui Zhu
    • [NFC] polish strategies_constructor.py code style (#1806) by binmakeswell

    Doc

    Zero

    • [zero] migrate zero1&2 (#1878) by HELSON

    Autoparallel

    • [autoparallel] user-friendly API for CheckpointSolver. (#1879) by Super Daniel
    • [autoparallel] fix linear logical convert issue (#1857) by YuliangLiu0306

    Fx

    Hotfix

    • [hotfix] pass test_complete_workflow (#1877) by Jiarui Fang

    Inference

    • [inference] overlap comm and compute in Linear1D_Row when stream_chunk_num > 1 (#1876) by Jiarui Fang
    • [inference] streaming Linear 1D Row inference (#1874) by Jiarui Fang

    Amp

    • [amp] add torch amp test (#1860) by xcnick

    Diffusion

    • [diffusion] fix package conflicts (#1875) by HELSON

    Utils

    • [utils] fixed lazy init context (#1867) by Frank Lee
    • [utils] remove lazy_memory_allocate from ColoInitContext (#1844) by Jiarui Fang

    Full Changelog: https://github.com/hpcaitech/ColossalAI/compare/v0.1.11rc3...v0.1.11rc2

    Source code(tar.gz)
    Source code(zip)
  • v0.1.11rc2(Nov 8, 2022)

    What's Changed

    Autoparallel

    • [autoparallel] fix bugs caused by negative dim key (#1808) by YuliangLiu0306
    • [autoparallel] fix bias addition module (#1800) by YuliangLiu0306
    • [autoparallel] add batch norm metainfo (#1815) by Boyuan Yao
    • [autoparallel] add conv metainfo class for auto parallel (#1796) by Boyuan Yao
    • [autoparallel]add essential CommActions for broadcast oprands (#1793) by YuliangLiu0306
    • [autoparallel] refactor and add rotorc. (#1789) by Super Daniel
    • [autoparallel] add getattr handler (#1767) by YuliangLiu0306
    • [autoparallel] added matmul handler (#1763) by Frank Lee
    • [autoparallel] fix conv handler numerical test (#1771) by YuliangLiu0306
    • [autoparallel] move ckpt solvers to autoparallel folder / refactor code (#1764) by Super Daniel
    • [autoparallel] add numerical test for handlers (#1769) by YuliangLiu0306
    • [autoparallel] update CommSpec to CommActions (#1768) by YuliangLiu0306
    • [autoparallel] add numerical test for node strategies (#1760) by YuliangLiu0306
    • [autoparallel] refactor the runtime apply pass and add docstring to passes (#1757) by YuliangLiu0306
    • [autoparallel] added binary elementwise node handler (#1758) by Frank Lee
    • [autoparallel] fix param hook issue in transform pass (#1755) by YuliangLiu0306
    • [autoparallel] added addbmm handler (#1751) by Frank Lee
    • [autoparallel] shard param and buffer as expected (#1753) by YuliangLiu0306
    • [autoparallel] add sequential order to communication actions (#1735) by YuliangLiu0306
    • [autoparallel] recovered skipped test cases (#1748) by Frank Lee
    • [autoparallel] fixed wrong sharding strategy in conv handler (#1747) by Frank Lee
    • [autoparallel] fixed wrong generated strategy for dot op (#1746) by Frank Lee
    • [autoparallel] handled illegal sharding strategy in shape consistency (#1744) by Frank Lee
    • [autoparallel] handled illegal strategy in node handler (#1743) by Frank Lee
    • [autoparallel] handled illegal sharding strategy (#1728) by Frank Lee

    Kernel

    Gemini

    • [Gemini] make gemini usage simple (#1821) by Jiarui Fang

    Checkpointio

    • [CheckpointIO] a uniform checkpoint I/O module (#1689) by ver217

    Doc

    Example

    • [example] remove useless readme in diffusion (#1831) by Jiarui Fang
    • [example] add TP to GPT example (#1828) by Jiarui Fang
    • [example] add stable diffuser (#1825) by Fazzie-Maqianli
    • [example] simplify the GPT2 huggingface example (#1826) by Jiarui Fang
    • [example] opt does not depend on Titans (#1811) by Jiarui Fang
    • [example] add GPT by Jiarui Fang
    • [example] add opt model in lauguage (#1809) by Jiarui Fang
    • [example] add diffusion to example (#1805) by Jiarui Fang

    Nfc

    • [NFC] update gitignore remove DS_Store (#1830) by Jiarui Fang
    • [NFC] polish type hint for shape consistency (#1801) by Jiarui Fang
    • [NFC] polish tests/test_layers/test_3d/test_3d.py code style (#1740) by Ziheng Qin
    • [NFC] polish tests/test_layers/test_3d/checks_3d/common.py code style (#1733) by lucasliunju
    • [NFC] polish colossalai/nn/metric/_utils.py code style (#1727) by Sze-qq
    • [NFC] polish tests/test_layers/test_3d/checks_3d/check_layer_3d.py code style (#1731) by Xue Fuzhao
    • [NFC] polish tests/test_layers/test_sequence/checks_seq/check_layer_seq.py code style (#1723) by xyupeng
    • [NFC] polish accuracy_2d.py code style (#1719) by Ofey Chan
    • [NFC] polish .github/workflows/scripts/build_colossalai_wheel.py code style (#1721) by Arsmart1
    • [NFC] polish _checkpoint_hook.py code style (#1722) by LuGY
    • [NFC] polish test_2p5d/checks_2p5d/check_operation_2p5d.py code style (#1718) by Kai Wang (Victor Kai)
    • [NFC] polish colossalai/zero/sharded_param/init.py code style (#1717) by CsRic
    • [NFC] polish colossalai/nn/lr_scheduler/linear.py code style (#1716) by yuxuan-lou
    • [NFC] polish tests/test_layers/test_2d/checks_2d/check_operation_2d.py code style (#1715) by binmakeswell
    • [NFC] polish colossalai/nn/metric/accuracy_2p5d.py code style (#1714) by shenggan

    Fx

    • [fx] add a symbolic_trace api. (#1812) by Super Daniel
    • [fx] skip diffusers unitest if it is not installed (#1799) by Jiarui Fang
    • [fx] Add linear metainfo class for auto parallel (#1783) by Boyuan Yao
    • [fx] support module with bias addition (#1780) by YuliangLiu0306
    • [fx] refactor memory utils and extend shard utils. (#1754) by Super Daniel
    • [fx] test tracer on diffuser modules. (#1750) by Super Daniel

    Hotfix

    • [hotfix] fix build error when torch version >= 1.13 (#1803) by xcnick
    • [hotfix] polish flash attention (#1802) by oahzxl
    • [hotfix] fix zero's incompatibility with checkpoint in torch-1.12 (#1786) by HELSON
    • [hotfix] polish chunk import (#1787) by Jiarui Fang
    • [hotfix] autoparallel unit test (#1752) by YuliangLiu0306

    Pipeline

    • [Pipeline]Adapt to Pipelinable OPT (#1782) by Ziyue Jiang

    Ci

    Compatibility

    • [compatibility] ChunkMgr import error (#1772) by Jiarui Fang

    Feat

    • [feat] add flash attention (#1762) by oahzxl

    Fx/profiler

    • [fx/profiler] debug the fx.profiler / add an example test script for fx.profiler (#1730) by Super Daniel

    Workflow

    • [workflow] handled the git directory ownership error (#1741) by Frank Lee

    Full Changelog: https://github.com/hpcaitech/ColossalAI/compare/v0.1.11rc2...v0.1.11rc1

    Source code(tar.gz)
    Source code(zip)
  • v0.1.11rc1(Oct 19, 2022)

    What's Changed

    Hotfix

    Release

    • [release] update to v0.1.11 (#1736) by Frank Lee

    Doc

    • [doc] update recommendation system catalogue (#1732) by binmakeswell
    • [doc] update recommedation system urls (#1725) by Jiarui Fang

    Zero

    • [zero] add chunk init function for users (#1729) by HELSON
    • [zero] add constant placement policy (#1705) by HELSON

    Pre-commit

    • [pre-commit] update pre-commit (#1726) by HELSON

    Autoparallel

    • [autoparallel] runtime_backward_apply (#1720) by YuliangLiu0306
    • [autoparallel] moved tests to test_tensor_shard (#1713) by Frank Lee
    • [autoparallel] resnet block runtime apply (#1709) by YuliangLiu0306
    • [autoparallel] fixed broken node handler tests (#1708) by Frank Lee
    • [autoparallel] refactored the autoparallel module for organization (#1706) by Frank Lee
    • [autoparallel] adapt runtime passes (#1703) by YuliangLiu0306
    • [autoparallel] collated all deprecated files (#1700) by Frank Lee
    • [autoparallel] init new folder structure (#1696) by Frank Lee
    • [autoparallel] adapt solver and CostGraph with new handler (#1695) by YuliangLiu0306
    • [autoparallel] add output handler and placeholder handler (#1694) by YuliangLiu0306
    • [autoparallel] add pooling handler (#1690) by YuliangLiu0306
    • [autoparallel] where_handler_v2 (#1688) by YuliangLiu0306
    • [autoparallel] fix C version rotor inconsistency (#1691) by Boyuan Yao
    • [autoparallel] added sharding spec conversion for linear handler (#1687) by Frank Lee
    • [autoparallel] add reshape handler v2 and fix some previous bug (#1683) by YuliangLiu0306
    • [autoparallel] add unary element wise handler v2 (#1674) by YuliangLiu0306
    • [autoparallel] add following node generator (#1673) by YuliangLiu0306
    • [autoparallel] add layer norm handler v2 (#1671) by YuliangLiu0306
    • [autoparallel] fix insecure subprocess (#1680) by Boyuan Yao
    • [autoparallel] add rotor C version (#1658) by Boyuan Yao
    • [autoparallel] added utils for broadcast operation (#1665) by Frank Lee
    • [autoparallel] update CommSpec (#1667) by YuliangLiu0306
    • [autoparallel] added bias comm spec to matmul strategy (#1664) by Frank Lee
    • [autoparallel] add batch norm handler v2 (#1666) by YuliangLiu0306
    • [autoparallel] remove no strategy nodes (#1652) by YuliangLiu0306
    • [autoparallel] added compute resharding costs for node handler (#1662) by Frank Lee
    • [autoparallel] added new strategy constructor template (#1661) by Frank Lee
    • [autoparallel] added node handler for bmm (#1655) by Frank Lee
    • [autoparallel] add conv handler v2 (#1663) by YuliangLiu0306
    • [autoparallel] adapt solver with gpt (#1653) by YuliangLiu0306
    • [autoparallel] implemented all matmul strategy generator (#1650) by Frank Lee
    • [autoparallel] change the following nodes strategies generation logic (#1636) by YuliangLiu0306
    • [autoparallel] where handler (#1651) by YuliangLiu0306
    • [autoparallel] implemented linear projection strategy generator (#1639) by Frank Lee
    • [autoparallel] adapt solver with mlp (#1638) by YuliangLiu0306
    • [autoparallel] Add pofo sequence annotation (#1637) by Boyuan Yao
    • [autoparallel] add elementwise handler (#1622) by YuliangLiu0306
    • [autoparallel] add embedding handler (#1620) by YuliangLiu0306
    • [autoparallel] protect bcast handler from invalid strategies (#1631) by YuliangLiu0306
    • [autoparallel] add layernorm handler (#1629) by YuliangLiu0306
    • [autoparallel] recover the merged node strategy index (#1613) by YuliangLiu0306
    • [autoparallel] added new linear module handler (#1616) by Frank Lee
    • [autoparallel] added new node handler (#1612) by Frank Lee
    • [autoparallel]add bcast matmul strategies (#1605) by YuliangLiu0306
    • [autoparallel] refactored the data structure for sharding strategy (#1610) by Frank Lee
    • [autoparallel] add bcast op handler (#1600) by YuliangLiu0306
    • [autoparallel] added all non-bcast matmul strategies (#1603) by Frank Lee
    • [autoparallel] added strategy generator and bmm strategies (#1602) by Frank Lee
    • [autoparallel] add reshape handler (#1594) by YuliangLiu0306
    • [autoparallel] refactored shape consistency to remove redundancy (#1591) by Frank Lee
    • [autoparallel] add resnet autoparallel unit test and add backward weight communication cost (#1589) by YuliangLiu0306
    • [autoparallel] added generate_sharding_spec to utils (#1590) by Frank Lee
    • [autoparallel] added solver option dataclass (#1588) by Frank Lee
    • [autoparallel] adapt solver with resnet (#1583) by YuliangLiu0306

    Fx/meta/rpc

    • [fx/meta/rpc] move _meta_registration.py to fx folder / register fx functions with compatibility checks / remove color debug (#1710) by Super Daniel

    Embeddings

    • [embeddings] add doc in readme (#1711) by Jiarui Fang
    • [embeddings] more detailed timer (#1692) by Jiarui Fang
    • [embeddings] cache option (#1635) by Jiarui Fang
    • [embeddings] use cache_ratio instead of cuda_row_num (#1611) by Jiarui Fang
    • [embeddings] add already_split_along_rank flag for tablewise mode (#1584) by CsRic

    Unittest

    • [unittest] added doc for the pytest wrapper (#1704) by Frank Lee
    • [unittest] supported condititonal testing based on env var (#1701) by Frank Lee

    Embedding

    • [embedding] rename FreqAwareEmbedding -> CachedEmbedding (#1699) by Jiarui Fang
    • [embedding] polish async copy (#1657) by Jiarui Fang
    • [embedding] add more detail profiling (#1656) by Jiarui Fang
    • [embedding] print profiling results (#1654) by Jiarui Fang
    • [embedding] non-blocking cpu-gpu copy (#1647) by Jiarui Fang
    • [embedding] isolate cache_op from forward (#1645) by CsRic
    • [embedding] rollback for better FAW performance (#1625) by Jiarui Fang
    • [embedding] updates some default parameters by Jiarui Fang

    Fx/profiler

    • [fx/profiler] assigned UUID to each unrecorded tensor/ improved performance on GPT-2 (#1679) by Super Daniel
    • [fx/profiler] provide a table of summary. (#1634) by Super Daniel
    • [fx/profiler] tuned the calculation of memory estimation (#1619) by Super Daniel

    Pipeline/fix-bug

    • [pipeline/fix-bug] num_microbatches support any integrate | stable chimera | launch tool for rpc pp framework (#1684) by Kirigaya Kazuto

    Pipeline/rank_recorder

    • [pipeline/rank_recorder] fix bug when process data before backward | add a tool for multiple ranks debug (#1681) by Kirigaya Kazuto

    Feature

    • [feature] A new ZeRO implementation (#1644) by HELSON
    • Revert "[feature] new zero implementation (#1623)" (#1643) by Jiarui Fang
    • [feature] new zero implementation (#1623) by HELSON

    Fx

    • [fx] Add concrete info prop (#1677) by Boyuan Yao
    • [fx] refactor code for profiler / enable fake tensor movement. (#1646) by Super Daniel
    • [fx] fix offload codegen test (#1648) by Boyuan Yao
    • [fx] Modify offload codegen (#1618) by Boyuan Yao
    • [fx] PoC of runtime shape consistency application (#1607) by YuliangLiu0306
    • [fx] Add pofo solver (#1608) by Boyuan Yao
    • [fx] Add offload codegen (#1598) by Boyuan Yao
    • [fx] provide an accurate estimation of memory. (#1587) by Super Daniel
    • [fx] Improve linearize and rotor solver (#1586) by Boyuan Yao
    • [fx] Add nested checkpoint in activation checkpoint codegen (#1585) by Boyuan Yao

    Pipeline/pytree

    • [pipeline/pytree] add pytree to process args and kwargs | provide data_process_func to process args and kwargs after forward (#1642) by Kirigaya Kazuto

    Fix

    • [fix] fixed the collective pattern name for consistency (#1649) by Frank Lee

    Moe

    • [moe] initialize MoE groups by ProcessGroup (#1640) by HELSON
    • [moe] fix moe bugs (#1633) by HELSON
    • [moe] fix MoE bugs (#1628) by HELSON

    Tensor

    Pipeline/chimera

    • [pipeline/chimera] test chimera | fix bug of initializing (#1615) by Kirigaya Kazuto
    • [pipeline/chimera] reconstruct PipelineBase and Worker to support more feasible custom schedule | finish Chimera (#1595) by Kirigaya Kazuto

    Workflow

    • [workflow] deactivate conda environment before removing (#1606) by Frank Lee

    Fx/tuning

    • [fx/tuning] tune performance on rotor with meta info. (#1599) by Super Daniel

    Hotfix/rotor

    Nfc

    • [NFC] add OPT serving (#1581) by binmakeswell
    • [NFC] polish ./colossalai/trainer/hooks/_lr_scheduler_hook.py code style (#1576) by Boyuan Yao
    • [NFC] polish colossalai/zero/sharded_model/reduce_scatter.py code style (#1554) by Fazzie-Maqianli
    • [NFC] polish utils/tensor_detector/init.py code style (#1573) by CsRic
    • [NFC] polish colossalai/nn/lr_scheduler/multistep.py code style (#1572) by Sze-qq
    • [NFC] polish colossalai/nn/lr_scheduler/torch.py code style (#1571) by superhao1995
    • [NFC] polish colossalai/nn/parallel/data_parallel.py code style (#1570) by Jiatong Han
    • [NFC] polish colossalai/pipeline/utils.py code style (#1562) by Zirui Zhu
    • [NFC] polish colossalai/fx/tracer/meta_patch/patched_module/convolution.py code style (#1563) by Xue Fuzhao
    • [NFC] polish colossalai/gemini/update/chunkv2.py code style (#1565) by Zangwei Zheng
    • [NFC] polish colossalai/nn/layer/colossalai_layer/dropout.py code style (#1568) by DouJS
    • [NFC] polish colossalai/utils/tensor_detector/tensor_detector.py code style (#1566) by LuGY
    • [NFC] polish colossalai/nn/_ops/embedding.py code style (#1561) by BigOneLiXiaoMing
    • [NFC] polish colossalai/builder/init.py code style (#1560) by Ziheng Qin
    • [NFC] polish colossalai/testing/comparison.py code style. (#1558) by Super Daniel
    • [NFC] polish colossalai/nn/layer/colossalai_layer/linear.py (#1556) by Ofey Chan
    • [NFC] polish code colossalai/gemini/update/search_utils.py (#1557) by Kai Wang (Victor Kai)
    • [NFC] polish colossalai/nn/_ops/layernorm.py code style (#1555) by yuxuan-lou
    • [NFC] polish colossalai/nn/loss/loss_2p5d.py code style (#1553) by shenggan
    • [NFC] polish colossalai/nn/_ops/embedding_bag.py code style (#1552) by Maruyama_Aya
    • [NFC] polish colossalai/nn/lr_scheduler/cosine.py code style by binmakeswell
    • [NFC] polish colossalai/utils/multi_tensor_apply/multi_tensor_apply.py code style (#1559) by Kirigaya Kazuto

    Full Changelog: https://github.com/hpcaitech/ColossalAI/compare/v0.1.11rc1...v0.1.10

    Source code(tar.gz)
    Source code(zip)
  • v0.1.10(Sep 8, 2022)

    What's Changed

    Embedding

    • [embedding] cache_embedding small improvement (#1564) by CsRic
    • [embedding] polish parallel embedding tablewise (#1545) by Jiarui Fang
    • [embedding] freq_aware_embedding: add small functions for caller application (#1537) by CsRic
    • [embedding] fix a bug in table wise sharding (#1538) by Jiarui Fang
    • [embedding] tablewise sharding polish (#1535) by Jiarui Fang
    • [embedding] add tablewise sharding for FAW (#1526) by CsRic

    Nfc

    Pipeline/tuning

    • [pipeline/tuning] improve dispatch performance both time and space cost (#1544) by Kirigaya Kazuto

    Fx

    • [fx] provide a stable but not accurate enough version of profiler. (#1547) by Super Daniel
    • [fx] Add common node in model linearize (#1542) by Boyuan Yao
    • [fx] support meta tracing for aten level computation graphs like functorch. (#1536) by Super Daniel
    • [fx] Modify solver linearize and add corresponding test (#1531) by Boyuan Yao
    • [fx] add test for meta tensor. (#1527) by Super Daniel
    • [fx]patch nn.functional convolution (#1528) by YuliangLiu0306
    • [fx] Fix wrong index in annotation and minimal flops in ckpt solver (#1521) by Boyuan Yao
    • [fx] hack torch_dispatch for meta tensor and autograd. (#1515) by Super Daniel
    • [fx] Fix activation codegen dealing with checkpointing first op (#1510) by Boyuan Yao
    • [fx] fix the discretize bug (#1506) by Boyuan Yao
    • [fx] fix wrong variable name in solver rotor (#1502) by Boyuan Yao
    • [fx] Add activation checkpoint solver rotor (#1496) by Boyuan Yao
    • [fx] add more op patches for profiler and error message for unsupported ops. (#1495) by Super Daniel
    • [fx] fixed adapative pooling size concatenation error (#1489) by Frank Lee
    • [fx] add profiler for fx nodes. (#1480) by Super Daniel
    • [fx] Fix ckpt functions' definitions in forward (#1476) by Boyuan Yao
    • [fx] fix MetaInfoProp for incorrect calculations and add detections for inplace op. (#1466) by Super Daniel
    • [fx] add rules to linearize computation graphs for searching. (#1461) by Super Daniel
    • [fx] Add use_reentrant=False to checkpoint in codegen (#1463) by Boyuan Yao
    • [fx] fix test and algorithm bugs in activation checkpointing. (#1451) by Super Daniel
    • [fx] Use colossalai checkpoint and add offload recognition in codegen (#1439) by Boyuan Yao
    • [fx] fix the false interpretation of algorithm 3 in https://arxiv.org/abs/1604.06174. (#1446) by Super Daniel

    Autoparallel

    • [autoparallel]add backward cost info into strategies (#1524) by YuliangLiu0306
    • [autoparallel] support fucntion in operator handler (#1529) by YuliangLiu0306
    • [autoparallel] change the merge node logic (#1533) by YuliangLiu0306
    • [autoparallel] added liveness analysis (#1516) by Frank Lee
    • [autoparallel] add more sharding strategies to conv (#1487) by YuliangLiu0306
    • [autoparallel] add cost graph class (#1481) by YuliangLiu0306
    • [autoparallel] added namespace constraints (#1490) by Frank Lee
    • [autoparallel] integrate auto parallel with torch fx (#1479) by Frank Lee
    • [autoparallel] added dot handler (#1475) by Frank Lee
    • [autoparallel] introduced baseclass for op handler and reduced code redundancy (#1471) by Frank Lee
    • [autoparallel] standardize the code structure (#1469) by Frank Lee
    • [autoparallel] Add conv handler to generate strategies and costs info for conv (#1467) by YuliangLiu0306

    Utils

    • [utils] refactor parallel layers checkpoint and bcast model on loading checkpoint (#1548) by ver217
    • [utils] optimize partition_tensor_parallel_state_dict (#1546) by ver217
    • [utils] Add use_reetrant=False in utils.activation_checkpoint (#1460) by Boyuan Yao
    • [utils] Impl clip_grad_norm for ColoTensor and ZeroOptimizer (#1442) by ver217

    Hotfix

    • [hotfix] change namespace for meta_trace. (#1541) by Super Daniel
    • [hotfix] fix init context (#1543) by ver217
    • [hotfix] avoid conflict of meta registry with torch 1.13.0. (#1530) by Super Daniel
    • [hotfix] fix coloproxy typos. (#1519) by Super Daniel

    Pipeline/pipleline_process_group

    • [pipeline/pipleline_process_group] finish PipelineProcessGroup to manage local abd global rank in TP,DP and PP (#1508) by Kirigaya Kazuto

    Doc

    • [doc] docstring for FreqAwareEmbeddingBag (#1525) by Jiarui Fang
    • [doc] update readme with the new xTrimoMultimer project (#1477) by Sze-qq
    • [doc] update docstring in ProcessGroup (#1468) by Jiarui Fang
    • [Doc] add more doc for ColoTensor. (#1458) by Jiarui Fang

    Autoparellel

    Faw

    • [FAW] cpu caching operations (#1520) by Jiarui Fang
    • [FAW] refactor reorder() for CachedParamMgr (#1514) by Jiarui Fang
    • [FAW] LFU initialize with dataset freq (#1513) by Jiarui Fang
    • [FAW] shrink freq_cnter size (#1509) by CsRic
    • [FAW] remove code related to chunk (#1501) by Jiarui Fang
    • [FAW] add more docs and fix a warning (#1500) by Jiarui Fang
    • [FAW] FAW embedding use LRU as eviction strategy intialized with dataset stats (#1494) by CsRic
    • [FAW] LFU cache for the FAW by CsRic
    • [FAW] init an LFU implementation for FAW (#1488) by Jiarui Fang
    • [FAW] reorganize the inheritance struct of FreqCacheEmbedding (#1448) by Geng Zhang

    Pipeline/rpc

    • [pipeline/rpc] update outstanding mechanism | optimize dispatching strategy (#1497) by Kirigaya Kazuto
    • [pipeline/rpc] implement distributed optimizer | test with assert_close (#1486) by Kirigaya Kazuto
    • [pipeline/rpc] support interleaving | fix checkpoint bug | change logic when dispatch data in work_list to ensure steady 1F1B (#1483) by Kirigaya Kazuto
    • [pipeline/rpc] implement a demo for PP with cuda rpc framework (#1470) by Kirigaya Kazuto

    Tensor

    • [tensor]add 1D device mesh (#1492) by YuliangLiu0306
    • [tensor] support runtime ShardingSpec apply (#1453) by YuliangLiu0306
    • [tensor] shape consistency generate transform path and communication cost (#1435) by YuliangLiu0306
    • [tensor] added linear implementation for the new sharding spec (#1416) by Frank Lee

    Fce

    • [FCE] update interface for frequency statistics in FreqCacheEmbedding (#1462) by Geng Zhang

    Workflow

    • [workflow] added TensorNVMe to compatibility test (#1449) by Frank Lee

    Test

    • [test] fixed the activation codegen test (#1447) by Frank Lee

    Engin/schedule

    • [engin/schedule] use p2p_v2 to recontruct pipeline_schedule (#1408) by Kirigaya Kazuto

    Full Changelog: https://github.com/hpcaitech/ColossalAI/compare/v0.1.10...v0.1.9

    Source code(tar.gz)
    Source code(zip)
  • v0.1.9(Aug 11, 2022)

    What's Changed

    Zero

    • [zero] add chunk_managerV2 for all-gather chunk (#1441) by HELSON
    • [zero] add chunk size searching algorithm for parameters in different groups (#1436) by HELSON
    • [zero] add has_inf_or_nan in AgChunk; enhance the unit test of AgChunk (#1426) by HELSON
    • [zero] add unit test for AgChunk's append, close, access (#1423) by HELSON
    • [zero] add AgChunk (#1417) by HELSON
    • [zero] ZeroDDP supports controlling outputs' dtype (#1399) by ver217
    • [zero] alleviate memory usage in ZeRODDP state_dict (#1398) by HELSON
    • [zero] chunk manager allows filtering ex-large params (#1393) by ver217
    • [zero] zero optim state_dict takes only_rank_0 (#1384) by ver217

    Fx

    • [fx] add vanilla activation checkpoint search with test on resnet and densenet (#1433) by Super Daniel
    • [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages (#1425) by Super Daniel
    • [fx] fixed torchaudio conformer tracing (#1392) by Frank Lee
    • [fx] patched torch.max and data movement operator (#1391) by Frank Lee
    • [fx] fixed indentation error in checkpointing codegen (#1385) by Frank Lee
    • [fx] patched torch.full for huggingface opt (#1386) by Frank Lee
    • [fx] update split module pass and add customized policy (#1373) by YuliangLiu0306
    • [fx] add torchaudio test (#1369) by Super Daniel
    • [fx] Add colotracer compatibility test on torchrec (#1370) by Boyuan Yao
    • [fx]add gpt2 passes for pipeline performance test (#1366) by YuliangLiu0306
    • [fx] added activation checkpoint codegen support for torch < 1.12 (#1359) by Frank Lee
    • [fx] added activation checkpoint codegen (#1355) by Frank Lee
    • [fx] fixed apex normalization patch exception (#1352) by Frank Lee
    • [fx] added activation checkpointing annotation (#1349) by Frank Lee
    • [fx] update MetaInforProp pass to process more complex node.meta (#1344) by YuliangLiu0306
    • [fx] refactor tracer to trace complete graph (#1342) by YuliangLiu0306
    • [fx] tested the complete workflow for auto-parallel (#1336) by Frank Lee
    • [fx]refactor tracer (#1335) by YuliangLiu0306
    • [fx] recovered skipped pipeline tests (#1338) by Frank Lee
    • [fx] fixed compatiblity issue with torch 1.10 (#1331) by Frank Lee
    • [fx] fixed unit tests for torch 1.12 (#1327) by Frank Lee
    • [fx] add balanced policy v2 (#1251) by YuliangLiu0306
    • [fx] Add unit test and fix bugs for transform_mlp_pass (#1299) by XYE
    • [fx] added apex normalization to patched modules (#1300) by Frank Lee

    Recommendation System

    • [FAW] export FAW in _ops (#1438) by Jiarui Fang
    • [FAW] move coloparam setting in test code. (#1429) by Jiarui Fang
    • [FAW] parallel FreqAwareEmbedding (#1424) by Jiarui Fang
    • [FAW] add cache manager for the cached embedding (#1419) by Jiarui Fang

    Global Tensor

    • [tensor] add shape consistency feature to support auto spec transform (#1418) by YuliangLiu0306
    • [tensor]build sharding spec to replace distspec in future. (#1405) by YuliangLiu0306

    Hotfix

    • [hotfix] zero optim prevents calling inner optim.zero_grad (#1422) by ver217
    • [hotfix] fix CPUAdam kernel nullptr (#1410) by ver217
    • [hotfix] adapt ProcessGroup and Optimizer to ColoTensor (#1388) by HELSON
    • [hotfix] fix a running error in test_colo_checkpoint.py (#1387) by HELSON
    • [hotfix] fix some bugs during gpt2 testing (#1379) by YuliangLiu0306
    • [hotfix] fix zero optim save/load state dict (#1381) by ver217
    • [hotfix] fix zero ddp buffer cast (#1376) by ver217
    • [hotfix] fix no optimizer in save/load (#1363) by HELSON
    • [hotfix] fix megatron_init in test_gpt2.py (#1357) by HELSON
    • [hotfix] ZeroDDP use new process group (#1333) by ver217
    • [hotfix] shared model returns cpu state_dict (#1328) by ver217
    • [hotfix] fix ddp for unit test test_gpt2 (#1326) by HELSON
    • [hotfix] fix unit test test_module_spec (#1321) by HELSON
    • [hotfix] fix PipelineSharedModuleGradientHandler (#1314) by ver217
    • [hotfix] fix ColoTensor GPT2 unitest (#1309) by HELSON
    • [hotfix] add missing file (#1308) by Jiarui Fang
    • [hotfix] remove potiential circle import (#1307) by Jiarui Fang
    • [hotfix] skip some unittest due to CI environment. (#1301) by YuliangLiu0306
    • [hotfix] fix shape error in backward when using ColoTensor (#1298) by HELSON
    • [hotfix] Dist Mgr gather torch version (#1284) by Jiarui Fang

    Communication

    • [communication] add p2p_v2.py to support communication with List[Any] (#1407) by Kirigaya Kazuto

    Device

    • [device] add DeviceMesh class to support logical device layout (#1394) by YuliangLiu0306

    Chunk

    • [chunk] add PG check for tensor appending (#1383) by Jiarui Fang

    DDP

    • [DDP] test ddp state dict uses more strict threshold (#1382) by ver217

    Checkpoint

    • [checkpoint] add kwargs for load_state_dict (#1374) by HELSON
    • [checkpoint] use args, kwargs in save_checkpoint, load_checkpoint (#1368) by HELSON
    • [checkpoint] sharded optim save/load grad scaler (#1350) by ver217
    • [checkpoint] use gather_tensor in checkpoint and update its unit test (#1339) by HELSON
    • [checkpoint] add ColoOptimizer checkpointing (#1316) by Jiarui Fang
    • [checkpoint] add test for bert and hotfix save bugs (#1297) by Jiarui Fang

    Util

    • [util] standard checkpoint function naming (#1377) by Frank Lee

    Nvme

    • [nvme] CPUAdam and HybridAdam support NVMe offload (#1360) by ver217

    Colotensor

    • [colotensor] use cpu memory to store state_dict (#1367) by HELSON
    • [colotensor] add Tensor.view op and its unit test (#1343) by HELSON

    Unit test

    • [unit test] add megatron init test in zero_optim (#1358) by HELSON

    Docker

    • [docker] add tensornvme in docker (#1354) by ver217

    Doc

    • [doc] update rst and docstring (#1351) by ver217

    Refactor

    • [refactor] refactor ColoTensor's unit tests (#1340) by HELSON

    Workflow

    • [workflow] update docker build workflow to use proxy (#1334) by Frank Lee
    • [workflow] update 8-gpu test to use torch 1.11 (#1332) by Frank Lee
    • [workflow] roll back to use torch 1.11 for unit testing (#1325) by Frank Lee
    • [workflow] fixed trigger condition for 8-gpu unit test (#1323) by Frank Lee
    • [workflow] updated release bdist workflow (#1318) by Frank Lee
    • [workflow] disable SHM for compatibility CI on rtx3080 (#1315) by Frank Lee
    • [workflow] updated pytorch compatibility test (#1311) by Frank Lee

    Test

    • [test] removed outdated unit test for meta context (#1329) by Frank Lee

    Utils

    • [utils] integrated colotensor with lazy init context (#1324) by Frank Lee

    Optimizer

    • [Optimizer] Remove useless ColoOptimizer (#1312) by Jiarui Fang
    • [Optimizer] polish the init method of ColoOptimizer (#1310) by Jiarui Fang

    Full Changelog: https://github.com/hpcaitech/ColossalAI/compare/v0.1.9...v0.1.8

    Source code(tar.gz)
    Source code(zip)
  • v0.1.8(Jul 12, 2022)

    What's Changed

    Hotfix

    • [hotfix] torchvison fx unittests miss import pytest (#1277) by Jiarui Fang
    • [hotfix] fix an assertion bug in base schedule. (#1250) by YuliangLiu0306
    • [hotfix] fix sharded optim step and clip_grad_norm (#1226) by ver217
    • [hotfix] fx get comm size bugs (#1233) by Jiarui Fang
    • [hotfix] fx shard 1d pass bug fixing (#1220) by Jiarui Fang
    • [hotfix]fixed p2p process send stuck (#1181) by YuliangLiu0306
    • [hotfix]different overflow status lead to communication stuck. (#1175) by YuliangLiu0306
    • [hotfix]fix some bugs caused by refactored schedule. (#1148) by YuliangLiu0306

    Tensor

    • [tensor] distributed checkpointing for parameters (#1240) by Jiarui Fang
    • [tensor] redistribute among different process groups (#1247) by Jiarui Fang
    • [tensor] a shorter shard and replicate spec (#1245) by Jiarui Fang
    • [tensor] redirect .data.get to a tensor instance (#1239) by HELSON
    • [tensor] add zero_like colo op, important for Optimizer (#1236) by Jiarui Fang
    • [tensor] fix some unittests (#1234) by Jiarui Fang
    • [tensor] fix a assertion in colo_tensor cross_entropy (#1232) by HELSON
    • [tensor] add unitest for colo_tensor 1DTP cross_entropy (#1230) by HELSON
    • [tensor] torch function return colotensor (#1229) by Jiarui Fang
    • [tensor] improve robustness of class 'ProcessGroup' (#1223) by HELSON
    • [tensor] sharded global process group (#1219) by Jiarui Fang
    • [Tensor] add cpu group to ddp (#1200) by Jiarui Fang
    • [tensor] remove gpc in tensor tests (#1186) by Jiarui Fang
    • [tensor] revert local view back (#1178) by Jiarui Fang
    • [Tensor] rename some APIs in TensorSpec and Polish view unittest (#1176) by Jiarui Fang
    • [Tensor] rename parallel_action (#1174) by Ziyue Jiang
    • [Tensor] distributed view supports inter-process hybrid parallel (#1169) by Jiarui Fang
    • [Tensor] remove ParallelAction, use ComputeSpec instread (#1166) by Jiarui Fang
    • [tensor] add embedding bag op (#1156) by ver217
    • [tensor] add more element-wise ops (#1155) by ver217
    • [tensor] fixed non-serializable colo parameter during model checkpointing (#1153) by Frank Lee
    • [tensor] dist spec s2s uses all-to-all (#1136) by ver217
    • [tensor] added repr to spec (#1147) by Frank Lee

    Fx

    • [fx] added ndim property to proxy (#1253) by Frank Lee
    • [fx] fixed tracing with apex-based T5 model (#1252) by Frank Lee
    • [fx] refactored the file structure of patched function and module (#1238) by Frank Lee
    • [fx] methods to get fx graph property. (#1246) by YuliangLiu0306
    • [fx]add split module pass and unit test from pipeline passes (#1242) by YuliangLiu0306
    • [fx] fixed huggingface OPT and T5 results misalignment (#1227) by Frank Lee
    • [fx]get communication size between partitions (#1224) by YuliangLiu0306
    • [fx] added patches for tracing swin transformer (#1228) by Frank Lee
    • [fx] fixed timm tracing result misalignment (#1225) by Frank Lee
    • [fx] added timm model tracing testing (#1221) by Frank Lee
    • [fx] added torchvision model tracing testing (#1216) by Frank Lee
    • [fx] temporarily used (#1215) by XYE
    • [fx] added testing for all albert variants (#1211) by Frank Lee
    • [fx] added testing for all gpt variants (#1210) by Frank Lee
    • [fx]add uniform policy (#1208) by YuliangLiu0306
    • [fx] added testing for all bert variants (#1207) by Frank Lee
    • [fx] supported model tracing for huggingface bert (#1201) by Frank Lee
    • [fx] added module patch for pooling layers (#1197) by Frank Lee
    • [fx] patched conv and normalization (#1188) by Frank Lee
    • [fx] supported data-dependent control flow in model tracing (#1185) by Frank Lee

    Rename

    • [rename] convert_to_dist -> redistribute (#1243) by Jiarui Fang

    Checkpoint

    • [checkpoint] save sharded optimizer states (#1237) by Jiarui Fang
    • [checkpoint]support generalized scheduler (#1222) by Yi Zhao
    • [checkpoint] make unitest faster (#1217) by Jiarui Fang
    • [checkpoint] checkpoint for ColoTensor Model (#1196) by Jiarui Fang

    Polish

    • [polish] polish repr for ColoTensor, DistSpec, ProcessGroup (#1235) by HELSON

    Refactor

    • [refactor] move process group from _DistSpec to ColoTensor. (#1203) by Jiarui Fang
    • [refactor] remove gpc dependency in colotensor's _ops (#1189) by Jiarui Fang
    • [refactor] move chunk and chunkmgr to directory gemini (#1182) by Jiarui Fang

    Context

    • [context]support arbitary module materialization. (#1193) by YuliangLiu0306
    • [context]use meta tensor to init model lazily. (#1187) by YuliangLiu0306

    Ddp

    • [ddp] ColoDDP uses bucket all-reduce (#1177) by ver217
    • [ddp] refactor ColoDDP and ZeroDDP (#1146) by ver217

    Colotensor

    • [ColoTensor] add independent process group (#1179) by Jiarui Fang
    • [ColoTensor] rename APIs and add output_replicate to ComputeSpec (#1168) by Jiarui Fang
    • [ColoTensor] improves init functions. (#1150) by Jiarui Fang

    Zero

    • [zero] sharded optim supports loading local state dict (#1170) by ver217
    • [zero] zero optim supports loading local state dict (#1171) by ver217

    Workflow

    • [workflow] polish readme and dockerfile (#1165) by Frank Lee
    • [workflow] auto-publish docker image upon release (#1164) by Frank Lee
    • [workflow] fixed release post workflow (#1154) by Frank Lee
    • [workflow] fixed format error in yaml file (#1145) by Frank Lee
    • [workflow] added workflow to auto draft the release post (#1144) by Frank Lee

    Gemini

    • [gemini] refactor gemini mgr (#1151) by ver217

    Pipeline

    Ci

    • [ci] added scripts to auto-generate release post text (#1142) by Frank Lee

    Full Changelog: https://github.com/hpcaitech/ColossalAI/compare/v0.1.8...v0.1.7

    Source code(tar.gz)
    Source code(zip)
  • v0.1.7(Jun 21, 2022)

    Version v0.1.7 Released Today

    Highlights

    • Started torch.fx for auto-parallel training
    • Update the zero mechanism with ColoTensor
    • Fixed various bugs

    What's Changed

    Hotfix

    • [hotfix] prevent nested ZeRO (#1140) by ver217
    • [hotfix]fix bugs caused by refactored pipeline (#1133) by YuliangLiu0306
    • [hotfix] fix param op hook (#1131) by ver217
    • [hotfix] fix zero init ctx numel (#1128) by ver217
    • [hotfix]change to fit latest p2p (#1100) by YuliangLiu0306
    • [hotfix] fix chunk comm src rank (#1072) by ver217

    Zero

    • [zero] avoid zero hook spam by changing log to debug level (#1137) by Frank Lee
    • [zero] added error message to handle on-the-fly import of torch Module class (#1135) by Frank Lee
    • [zero] fixed api consistency (#1098) by Frank Lee
    • [zero] zero optim copy chunk rather than copy tensor (#1070) by ver217

    Optim

    • [optim] refactor fused sgd (#1134) by ver217

    Ddp

    • [ddp] add save/load state dict for ColoDDP (#1127) by ver217
    • [ddp] add set_params_to_ignore for ColoDDP (#1122) by ver217
    • [ddp] supported customized torch ddp configuration (#1123) by Frank Lee

    Pipeline

    • [pipeline]support List of Dict data (#1125) by YuliangLiu0306
    • [pipeline] supported more flexible dataflow control for pipeline parallel training (#1108) by Frank Lee
    • [pipeline] refactor the pipeline module (#1087) by Frank Lee

    Fx

    Gemini

    • [gemini] gemini mgr supports "cpu" placement policy (#1118) by ver217
    • [gemini] zero supports gemini (#1093) by ver217

    Test

    • [test] fixed hybrid parallel test case on 8 GPUs (#1106) by Frank Lee
    • [test] skip tests when not enough GPUs are detected (#1090) by Frank Lee
    • [test] ignore 8 gpu test (#1080) by Frank Lee

    Release

    • [release] update version.txt (#1103) by Frank Lee

    Tensor

    • [tensor] refactor param op hook (#1097) by ver217
    • [tensor] refactor chunk mgr and impl MemStatsCollectorV2 (#1077) by ver217
    • [Tensor] fix equal assert (#1091) by Ziyue Jiang
    • [Tensor] 1d row embedding (#1075) by Ziyue Jiang
    • [tensor] chunk manager monitor mem usage (#1076) by ver217
    • [Tensor] fix optimizer for CPU parallel (#1069) by Ziyue Jiang
    • [Tensor] add hybrid device demo and fix bugs (#1059) by Ziyue Jiang

    Amp

    • [amp] included dict for type casting of model output (#1102) by Frank Lee

    Workflow

    • [workflow] fixed 8-gpu test workflow (#1101) by Frank Lee
    • [workflow] added regular 8 GPU testing (#1099) by Frank Lee
    • [workflow] disable p2p via shared memory on non-nvlink machine (#1086) by Frank Lee

    Engine

    • [engine] fixed empty op hook check (#1096) by Frank Lee

    Doc

    • [doc] added documentation to chunk and chunk manager (#1094) by Frank Lee

    Context

    • [context] support lazy init of module (#1088) by Frank Lee
    • [context] maintain the context object in with statement (#1073) by Frank Lee

    Refactory

    • [refactory] add nn.parallel module (#1068) by Jiarui Fang

    Cudnn

    • [cudnn] set False to cudnn benchmark by default (#1063) by Frank Lee

    Full Changelog: https://github.com/hpcaitech/ColossalAI/compare/v0.1.7...v0.1.6

    Source code(tar.gz)
    Source code(zip)
  • v0.1.6(Jun 2, 2022)

    Main features

    1. ColoTensor supports hybrid parallel (tensor parallel and data parallel)
    2. ColoTensor supports ZeRO (with chunk)
    3. Config tensor parallel by module via ColoTensor
    4. ZeroInitContext and ShardedModelV2 support loading checkpoint and hugging face from_pretrain()

    What's Changed

    ColoTensor

    • [tensor] refactor colo-tensor by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/992
    • [tensor] refactor parallel action by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/1007
    • [tensor] impl ColoDDP for ColoTensor by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/1009
    • [Tensor] add module handler for linear by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/1021
    • [Tensor] add module check and bert test by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/1031
    • [Tensor] add Parameter inheritance for ColoParameter by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/1041
    • [tensor] ColoTensor supports ZeRo by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/1015
    • [zero] add chunk size search for chunk manager by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/1052

    Zero

    • [zero] add load_state_dict for sharded model by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/894
    • [zero] add zero optimizer for ColoTensor by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/1046

    Hotfix

    • [hotfix] fix colo init context by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/1026
    • [hotfix] fix some bugs caused by size mismatch. by @YuliangLiu0306 in https://github.com/hpcaitech/ColossalAI/pull/1011
    • [kernel] fixed the include bug in dropout kernel by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/999
    • fix typo in constants by @ryanrussell in https://github.com/hpcaitech/ColossalAI/pull/1027
    • [engine] fixed bug in gradient accumulation dataloader to keep the last step by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/1030
    • [hotfix] fix dist spec mgr by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/1045
    • [hotfix] fix import error in sharded model v2 by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/1053

    Unit test

    • [unit test] refactor test tensor by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/1005

    CI

    • [ci] update the docker image name by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/1017
    • [ci] added nightly build (#1018) by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/1019
    • [ci] fixed nightly build workflow by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/1022
    • [ci] fixed nightly build workflow by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/1029
    • [ci] fixed nightly build workflow by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/1040

    CLI

    • [cli] remove unused imports by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/1001

    Documentation

    • Hotfix/format by @binmakeswell in https://github.com/hpcaitech/ColossalAI/pull/987
    • [doc] update docker instruction by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/1020

    Misc

    • [NFC] Hotfix/format by @binmakeswell in https://github.com/hpcaitech/ColossalAI/pull/984
    • Revert "[NFC] Hotfix/format" by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/986
    • remove useless import in tensor dir by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/997
    • [NFC] fix download link by @binmakeswell in https://github.com/hpcaitech/ColossalAI/pull/998
    • [Bot] Synchronize Submodule References by @github-actions in https://github.com/hpcaitech/ColossalAI/pull/1003
    • [NFC] polish colossalai/kernel/cuda_native/csrc/colossal_C_frontend.cโ€ฆ by @zhengzangw in https://github.com/hpcaitech/ColossalAI/pull/1010
    • [NFC] fix paper link by @binmakeswell in https://github.com/hpcaitech/ColossalAI/pull/1012
    • [p2p]add object list send/recv by @YuliangLiu0306 in https://github.com/hpcaitech/ColossalAI/pull/1024
    • [Bot] Synchronize Submodule References by @github-actions in https://github.com/hpcaitech/ColossalAI/pull/1034
    • [NFC] add inference by @binmakeswell in https://github.com/hpcaitech/ColossalAI/pull/1044
    • [titans]remove model zoo by @YuliangLiu0306 in https://github.com/hpcaitech/ColossalAI/pull/1042
    • [NFC] add inference submodule in path by @binmakeswell in https://github.com/hpcaitech/ColossalAI/pull/1047
    • [release] update version.txt by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/1048
    • [Bot] Synchronize Submodule References by @github-actions in https://github.com/hpcaitech/ColossalAI/pull/1049
    • updated collective ops api by @kurisusnowdeng in https://github.com/hpcaitech/ColossalAI/pull/1054
    • [pipeline]refactor ppschedule to support tensor list by @YuliangLiu0306 in https://github.com/hpcaitech/ColossalAI/pull/1050

    New Contributors

    • @ryanrussell made their first contribution in https://github.com/hpcaitech/ColossalAI/pull/1027

    Full Changelog: https://github.com/hpcaitech/ColossalAI/compare/v0.1.5...v0.1.6

    Source code(tar.gz)
    Source code(zip)
  • v0.1.5(May 17, 2022)

    Main Features

    1. Enhance ColoTensor and build a demo to train BERT (from hugging face) using Tensor Parallelism without modifying model.

    What's Changed

    ColoTensor

    • [Tensor] add ColoTensor TP1Dcol Embedding by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/899
    • [Tensor] add embedding tp1d row by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/904
    • [Tensor] update pytest.mark.parametrize in tensor tests by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/913
    • [Tensor] init ColoParameter by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/914
    • [Tensor] add a basic bert. by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/911
    • [Tensor] polish model test by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/915
    • [Tensor] fix test_model by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/916
    • [Tensor] add 1d vocab loss by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/918
    • [Graph] building computing graph with ColoTensor, Linear only by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/917
    • [Tensor] add from_pretrained support and bert pretrained test by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/921
    • [Tensor] test pretrain loading on multi-process by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/922
    • [tensor] hijack addmm for colo tensor by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/923
    • [tensor] colo tensor overrides mul by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/927
    • [Tensor] simplify named param by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/928
    • [Tensor] fix init context by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/931
    • [Tensor] add optimizer to bert test by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/933
    • [tensor] design DistSpec and DistSpecManager for ColoTensor by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/934
    • [Tensor] add DistSpec for loss and test_model by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/947
    • [tensor] derive compute pattern from dist spec by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/971

    Pipeline Parallelism

    • [pipelinable]use pipelinable to support GPT model. by @YuliangLiu0306 in https://github.com/hpcaitech/ColossalAI/pull/903

    CI

    • [CI] add CI for releasing bdist wheel by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/901
    • [CI] fix release bdist CI by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/902
    • [ci] added wheel build scripts by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/910

    Misc

    • [Bot] Synchronize Submodule References by @github-actions in https://github.com/hpcaitech/ColossalAI/pull/907
    • [Bot] Synchronize Submodule References by @github-actions in https://github.com/hpcaitech/ColossalAI/pull/912
    • [setup] update cuda ext cc flags by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/919
    • [setup] support more cuda architectures by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/920
    • [NFC] update results on a single GPU, highlight quick view by @binmakeswell in https://github.com/hpcaitech/ColossalAI/pull/981

    Full Changelog: https://github.com/hpcaitech/ColossalAI/compare/v0.1.4...v0.1.5

    Source code(tar.gz)
    Source code(zip)
  • v0.1.4(Apr 28, 2022)

    Main Features

    Here are the main improvements of this release:

    1. ColoTensor: A data structure that unifies the Tensor representation of different parallel methods.
    2. Gemini: More efficient Genimi implementation reduces the overhead of model data statistic collection.
    3. CLI: a command-line tool that helps users launch distributed training tasks more easily.
    4. Pipeline Parallelism (PP): a more user-friendly API for PP.

    What's Changed

    ColoTensor

    • [tensor]fix colo_tensor torch_function by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/825
    • [tensor]fix test_linear by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/826
    • [tensor] ZeRO use ColoTensor as the base class. by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/828
    • [tensor] revert zero tensors back by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/829
    • [Tensor] overriding paramters() for Module using ColoTensor by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/889
    • [tensor] refine linear and add gather for laynorm by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/893
    • [Tensor] test parameters() as member function by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/896
    • [Tensor] activation is an attr of ColoTensor by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/897
    • [Tensor] initialize the ColoOptimizer by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/898
    • [tensor] reorganize files by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/820
    • [Tensor] apply ColoTensor on Torch functions by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/821
    • [Tensor] update ColoTensor torch_function by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/822
    • [tensor] lazy init by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/823
    • [WIP] Applying ColoTensor on TP-1D-row Linear. by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/831
    • Init Conext supports lazy allocate model memory by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/842
    • [Tensor] TP Linear 1D row by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/843
    • [Tensor] add assert for colo_tensor 1Drow by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/846
    • [Tensor] init a simple network training with ColoTensor by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/849
    • [Tensor ] Add 1Drow weight reshard by spec by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/854
    • [Tensor] add layer norm Op by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/852
    • [tensor] an initial dea of tensor spec by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/865
    • [Tensor] colo init context add device attr. by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/866
    • [tensor] add cross_entropy_loss by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/868
    • [Tensor] Add function to spec and update linear 1Drow and unit tests by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/869
    • [tensor] customized op returns ColoTensor by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/875
    • [Tensor] get named parameters for model using ColoTensors by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/874
    • [Tensor] Add some attributes to ColoTensor by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/877
    • [Tensor] make a simple net works with 1D row TP by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/879
    • [tensor] wrap function in the torch_tensor to ColoTensor by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/881
    • [Tensor] make ColoTensor more robust for getattr by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/886
    • [Tensor] test model check results for a simple net by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/887
    • [tensor] add ColoTensor 1Dcol by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/888

    Gemini + ZeRO

    • [zero] add zero tensor shard strategy by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/793
    • Revert "[zero] add zero tensor shard strategy" by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/806
    • [gemini] a new tensor structure by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/818
    • [gemini] APIs to set cpu memory capacity by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/809
    • [DO NOT MERGE] [zero] init fp16 params directly in ZeroInitContext by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/808
    • [gemini] collect cpu-gpu moving volume in each iteration by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/813
    • [gemini] add GeminiMemoryManger by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/832
    • [zero] use GeminiMemoryManager when sampling model data by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/850
    • [gemini] polish code by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/855
    • [gemini] add stateful tensor container by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/867
    • [gemini] polish stateful_tensor_mgr by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/876
    • [gemini] accelerate adjust_layout() by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/878

    CLI

    • [cli] added distributed launcher command by @YuliangLiu0306 in https://github.com/hpcaitech/ColossalAI/pull/791
    • [cli] added micro benchmarking for tp by @YuliangLiu0306 in https://github.com/hpcaitech/ColossalAI/pull/789
    • [cli] add missing requirement by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/805
    • [cli] fixed a bug in user args and refactored the module structure by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/807
    • [cli] fixed single-node process launching by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/812
    • [cli] added check installation cli by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/815
    • [CLI] refactored the launch CLI and fixed bugs in multi-node launching by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/844
    • [cli] refactored micro-benchmarking cli and added more metrics by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/858

    Pipeline Parallelism

    • [pipelinable]use pipelinable context to initialize non-pipeline model by @YuliangLiu0306 in https://github.com/hpcaitech/ColossalAI/pull/816
    • [pipelinable]use ColoTensor to replace dummy tensor. by @YuliangLiu0306 in https://github.com/hpcaitech/ColossalAI/pull/853

    Misc

    • [hotfix] fix auto tensor placement policy by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/775
    • [hotfix] change the check assert in split batch 2d by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/772
    • [hotfix] fix bugs in zero by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/781
    • [hotfix] fix grad offload when enabling reuse_fp16_shard by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/784
    • [refactor] moving memtracer to gemini by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/801
    • [log] display tflops if available by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/802
    • [refactor] moving grad acc logic to engine by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/804
    • [log] local throughput metrics by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/811
    • [Bot] Synchronize Submodule References by @github-actions in https://github.com/hpcaitech/ColossalAI/pull/810
    • [Bot] Synchronize Submodule References by @github-actions in https://github.com/hpcaitech/ColossalAI/pull/819
    • [refactor] moving InsertPostInitMethodToModuleSubClasses to utils. by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/824
    • [setup] allow installation with python 3.6 by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/834
    • Revert "[WIP] Applying ColoTensor on TP-1D-row Linear." by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/835
    • [dependency] removed torchvision by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/833
    • [Bot] Synchronize Submodule References by @github-actions in https://github.com/hpcaitech/ColossalAI/pull/827
    • [unittest] refactored unit tests for change in dependency by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/838
    • [setup] use env var instead of option for cuda ext by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/839
    • [hotfix] ColoTensor pin_memory by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/840
    • modefied the pp build for ckpt adaptation by @Gy-Lu in https://github.com/hpcaitech/ColossalAI/pull/803
    • [hotfix] the bug of numel() in ColoTensor by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/845
    • [hotfix] fix _post_init_method of zero init ctx by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/847
    • [hotfix] add deconstructor for stateful tensor by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/848
    • [utils] refactor profiler by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/837
    • [ci] cache cuda extension by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/860
    • hotfix tensor unittest bugs by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/862
    • [usability] added assertion message in registry by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/864
    • [doc] improved docstring in the communication module by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/863
    • [doc] improved docstring in the logging module by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/861
    • [doc] improved docstring in the amp module by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/857
    • [usability] improved error messages in the context module by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/856
    • [doc] improved error messages in initialize by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/872
    • [doc] improved assertion messages in trainer by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/873
    • [doc] improved docstring and assertion messages for the engine module by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/871
    • [hotfix] fix import error by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/880
    • [setup] add local version label by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/890
    • [model_zoo] change qkv processing by @Gy-Lu in https://github.com/hpcaitech/ColossalAI/pull/870

    Full Changelog: https://github.com/hpcaitech/ColossalAI/compare/v0.1.3...v0.1.4

    Source code(tar.gz)
    Source code(zip)
  • v0.1.3(Apr 16, 2022)

    Overview

    Here are the main improvements of this release:

    1. Gemini: Heterogeneous memory space manager
    2. Refactor the API of pipeline parallelism

    What's Changed

    Features

    • [zero] initialize a stateful tensor manager by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/614
    • [pipeline] refactor pipeline by @YuliangLiu0306 in https://github.com/hpcaitech/ColossalAI/pull/679
    • [zero] stateful tensor manager by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/687
    • [zero] adapt zero hooks for unsharded module by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/699
    • [zero] refactor memstats collector by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/706
    • [zero] improve adaptability for not-shard parameters by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/708
    • [zero] check whether gradients have inf and nan in gpu by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/712
    • [refactor] refactor the memory utils by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/715
    • [util] support detection of number of processes on current node by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/723
    • [utils] add synchronized cuda memory monitor by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/740
    • [zero] refactor ShardedParamV2 by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/742
    • [zero] add tensor placement policies by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/743
    • [zero] use factory pattern for tensor_placement_policy by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/752
    • [zero] refactor memstats_collector by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/746
    • [gemini] init genimi individual directory by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/754
    • refactor shard and gather operation by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/773

    Bug Fix

    • [zero] fix init bugs in zero context by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/686
    • [hotfix] update requirements-test by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/701
    • [hotfix] fix a bug in 3d vocab parallel embedding by @kurisusnowdeng in https://github.com/hpcaitech/ColossalAI/pull/707
    • [compatibility] fixed tensor parallel compatibility with torch 1.9 by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/700
    • [hotfix]fixed bugs of assigning grad states to non leaf nodes by @Gy-Lu in https://github.com/hpcaitech/ColossalAI/pull/711
    • [hotfix] fix stateful tensor manager's cuda model data size by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/710
    • [bug] fixed broken test_found_inf by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/725
    • [util] fixed activation checkpointing on torch 1.9 by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/719
    • [util] fixed communication API with PyTorch 1.9 by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/721
    • [bug] removed zero installation requirements by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/731
    • [hotfix] remove duplicated param register to stateful tensor manager by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/728
    • [utils] correct cpu memory used and capacity in the context of multi-process by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/726
    • [bug] fixed grad scaler compatibility with torch 1.8 by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/735
    • [bug] fixed DDP compatibility with torch 1.8 by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/739
    • [hotfix] fix memory leak in backward of sharded model by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/741
    • [hotfix] fix initialize about zero by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/748
    • [hotfix] fix prepare grads in sharded optim by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/749
    • [hotfix] layernorm by @kurisusnowdeng in https://github.com/hpcaitech/ColossalAI/pull/750
    • [hotfix] fix auto tensor placement policy by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/753
    • [hotfix] fix reuse_fp16_shard of sharded model by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/756
    • [hotfix] fix test_stateful_tensor_mgr by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/762
    • [compatibility] used backward-compatible API for global process group by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/758
    • [hotfix] fix the ckpt hook bugs when using DDP by @Gy-Lu in https://github.com/hpcaitech/ColossalAI/pull/769
    • [hotfix] polish sharded optim docstr and warning by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/770

    Unit Testing

    • [ci] replace the ngc docker image with self-built pytorch image by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/672
    • [ci] fixed compatibility workflow by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/678
    • [ci] update workflow trigger condition and support options by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/691
    • [ci] added missing field in workflow by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/692
    • [ci] remove ipc config for rootless docker by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/694
    • [test] added missing decorators to model checkpointing tests by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/727
    • [unitest] add checkpoint for moe zero test by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/729
    • [test] added a decorator for address already in use error with backward compatibility by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/760
    • [test] refactored with the new rerun decorator by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/763

    Documentation

    • add PaLM link by @binmakeswell in https://github.com/hpcaitech/ColossalAI/pull/704
    • [doc] removed outdated installation command by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/730
    • add video by @binmakeswell in https://github.com/hpcaitech/ColossalAI/pull/732
    • [readme] polish readme by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/764
    • [readme] sync CN readme by @binmakeswell in https://github.com/hpcaitech/ColossalAI/pull/766

    Miscellaneous

    • [Bot] Synchronize Submodule References by @github-actions in https://github.com/hpcaitech/ColossalAI/pull/556
    • [Bot] Synchronize Submodule References by @github-actions in https://github.com/hpcaitech/ColossalAI/pull/695
    • [refactor] zero directory by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/724
    • [Bot] Synchronize Submodule References by @github-actions in https://github.com/hpcaitech/ColossalAI/pull/751

    Full Changelog: https://github.com/hpcaitech/ColossalAI/compare/v0.1.2...v0.1.3

    Source code(tar.gz)
    Source code(zip)
  • v0.1.2(Apr 6, 2022)

    Overview

    Here are the main improvements of this release:

    1. MOE and BERT models can be trained with ZeRO.
    2. Provide a uniform checkpoint for all kinds of parallelism.
    3. Optimize ZeRO-offload, and improve model scaling.
    4. Design a uniform model memory tracer.
    5. Implement an efficient hybrid Adam (CPU and CUDA kernels).
    6. Improve activation offloading.
    7. Profiler TensorBoard plugin of Beta version.
    8. Refactor pipeline module for closer integration with engine.
    9. Chinese tutorials, WeChat and Slack user groups.

    What's Changed

    Features

    • [zero] get memory usage for sharded param by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/536
    • [zero] improve the accuracy of get_memory_usage of sharded param by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/538
    • [zero] refactor model data tracing by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/537
    • [zero] get memory usage of sharded optim v2. by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/542
    • [zero] polish ZeroInitContext by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/540
    • [zero] optimize grad offload by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/539
    • [zero] non model data tracing by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/545
    • [zero] add zero config to neutralize zero context init by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/546
    • [zero] dump memory stats for sharded model by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/548
    • [zero] add stateful tensor by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/549
    • [zero] label state for param fp16 and grad by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/551
    • [zero] hijack p.grad in sharded model by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/554
    • [utils] update colo tensor moving APIs by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/553
    • [polish] rename col_attr -> colo_attr by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/558
    • [zero] trace states of fp16/32 grad and fp32 param by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/571
    • [zero] adapt zero for unsharded parameters by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/561
    • [refactor] memory utils by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/577
    • Feature/checkpoint gloo by @kurisusnowdeng in https://github.com/hpcaitech/ColossalAI/pull/589
    • [zero] add sampling time for memstats collector by @Gy-Lu in https://github.com/hpcaitech/ColossalAI/pull/610
    • [model checkpoint] checkpoint utils by @kurisusnowdeng in https://github.com/hpcaitech/ColossalAI/pull/592
    • [model checkpoint][hotfix] unified layers for save&load by @kurisusnowdeng in https://github.com/hpcaitech/ColossalAI/pull/593
    • Feature/checkpoint 2D by @kurisusnowdeng in https://github.com/hpcaitech/ColossalAI/pull/595
    • Feature/checkpoint 1D by @kurisusnowdeng in https://github.com/hpcaitech/ColossalAI/pull/594
    • [model checkpoint] CPU communication ops by @kurisusnowdeng in https://github.com/hpcaitech/ColossalAI/pull/590
    • Feature/checkpoint 2.5D by @kurisusnowdeng in https://github.com/hpcaitech/ColossalAI/pull/596
    • Feature/Checkpoint 3D by @kurisusnowdeng in https://github.com/hpcaitech/ColossalAI/pull/597
    • [model checkpoint] checkpoint hook by @kurisusnowdeng in https://github.com/hpcaitech/ColossalAI/pull/598
    • Feature/Checkpoint tests by @kurisusnowdeng in https://github.com/hpcaitech/ColossalAI/pull/599
    • [zero] adapt zero for unsharded parameters (Optimizer part) by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/601
    • [zero] polish init context by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/645
    • refactor pipeline---put runtime schedule into engine. by @YuliangLiu0306 in https://github.com/hpcaitech/ColossalAI/pull/627

    Bug Fix

    • [Zero] process no-leaf-module in Zero by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/535
    • Add gather_out arg to Linear by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/541
    • [hoxfix] fix parallel_input flag for Linear1D_Col gather_output by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/579
    • [hotfix] add hybrid adam to init by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/584
    • Hotfix/path check util by @kurisusnowdeng in https://github.com/hpcaitech/ColossalAI/pull/591
    • [hotfix] fix sharded optim zero grad by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/604
    • Add tensor parallel input check by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/621
    • [hotfix] Raise messages for indivisible batch sizes with tensor parallelism by @number1roy in https://github.com/hpcaitech/ColossalAI/pull/622
    • [zero] fixed the activation offload by @Gy-Lu in https://github.com/hpcaitech/ColossalAI/pull/647
    • fixed bugs in CPU adam by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/633
    • Revert "[zero] polish init context" by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/657
    • [hotfix] fix a bug in model data stats tracing by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/655
    • fix bugs for unsharded parameters when restore data by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/664

    Unit Testing

    • [zero] test zero tensor utils by @FredHuang99 in https://github.com/hpcaitech/ColossalAI/pull/609
    • remove hybrid adam in test_moe_zero_optim by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/659

    Documentation

    • Refactored docstring to google style by @number1roy in https://github.com/hpcaitech/ColossalAI/pull/532
    • [docs] updatad docs of hybrid adam and cpu adam by @Gy-Lu in https://github.com/hpcaitech/ColossalAI/pull/552
    • html refactor by @number1roy in https://github.com/hpcaitech/ColossalAI/pull/555
    • [doc] polish docstring of zero by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/612
    • [doc] update rst by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/615
    • [doc] polish amp docstring by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/616
    • [doc] polish moe docsrting by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/618
    • [doc] polish optimizer docstring by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/619
    • [doc] polish utils docstring by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/620
    • [NFC] polish colossalai/kernel/cuda_native/csrc/kernels/cuda_util.cu โ€ฆ by @GaryGky in https://github.com/hpcaitech/ColossalAI/pull/625
    • [doc] polish checkpoint docstring by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/637
    • update GPT-2 experiment result by @Sze-qq in https://github.com/hpcaitech/ColossalAI/pull/666
    • [NFC] polish code by @binmakeswell in https://github.com/hpcaitech/ColossalAI/pull/646

    Model Zoo

    • [model zoo] add activation offload for gpt model by @Gy-Lu in https://github.com/hpcaitech/ColossalAI/pull/582

    Miscellaneous

    • [logging] polish logger format by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/543
    • [profiler] add MemProfiler by @raejaf in https://github.com/hpcaitech/ColossalAI/pull/356
    • [Bot] Synchronize Submodule References by @github-actions in https://github.com/hpcaitech/ColossalAI/pull/501
    • [tool] create .clang-format for pre-commit by @BoxiangW in https://github.com/hpcaitech/ColossalAI/pull/578
    • [GitHub] Add prefix and label in issue template by @binmakeswell in https://github.com/hpcaitech/ColossalAI/pull/652

    Full Changelog: https://github.com/hpcaitech/ColossalAI/compare/v0.1.1...v0.1.2

    Source code(tar.gz)
    Source code(zip)
  • v0.1.1(Mar 26, 2022)

    What's Changed

    Features

    • [MOE] changed parallelmode to dist process group by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/460
    • [MOE] redirect moe_env from global_variables to core by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/467
    • [zero] zero init ctx receives a dp process group by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/471
    • [zero] ZeRO supports pipeline parallel by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/477
    • add LinearGate for MOE in NaiveAMP context by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/480
    • [zero] polish sharded param name by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/484
    • [zero] sharded optim support hybrid cpu adam by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/486
    • [zero] polish sharded optimizer v2 by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/490
    • [MOE] support PR-MOE by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/488
    • [zero] sharded model manages ophooks individually by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/492
    • [MOE] remove old MoE legacy by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/493
    • [zero] sharded model support the reuse of fp16 shard by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/495
    • [polish] polish singleton and global context by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/500
    • [memory] add model data tensor moving api by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/503
    • [memory] set cuda mem frac by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/506
    • [zero] use colo model data api in sharded optimv2 by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/511
    • [MOE] add MOEGPT model by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/510
    • [zero] zero init ctx enable rm_torch_payload_on_the_fly by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/512
    • [zero] show model data cuda memory usage after zero context init. by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/515
    • [log] polish disable_existing_loggers by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/519
    • [zero] add model data tensor inline moving API by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/521
    • [cuda] modify the fused adam, support hybrid of fp16 and fp32 by @Gy-Lu in https://github.com/hpcaitech/ColossalAI/pull/497
    • [zero] refactor model data tracing by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/522
    • [zero] added hybrid adam, removed loss scale in adam by @Gy-Lu in https://github.com/hpcaitech/ColossalAI/pull/527

    Bug Fix

    • fix discussion buttom in issue template by @binmakeswell in https://github.com/hpcaitech/ColossalAI/pull/504
    • [zero] fix grad offload by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/528

    Unit Testing

    • [MOE] add unitest for MOE experts layout, gradient handler and kernel by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/469
    • [test] added rerun on exception for testing by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/475
    • [zero] fix init device bug in zero init context unittest by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/516
    • [test] fixed rerun_on_exception and adapted test cases by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/487

    CI/CD

    • [devops] remove tsinghua source for pip by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/505
    • [devops] remove tsinghua source for pip by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/507
    • [devops] recover tsinghua pip source due to proxy issue by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/509

    Documentation

    • [doc] update rst by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/470
    • Update Experiment result about Colossal-AI with ZeRO by @Sze-qq in https://github.com/hpcaitech/ColossalAI/pull/479
    • [doc] docs get correct release version by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/489
    • Update README.md by @fastalgo in https://github.com/hpcaitech/ColossalAI/pull/514
    • [doc] update apidoc by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/530

    Model Zoo

    • [model zoo] fix attn mask shape of gpt by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/472
    • [model zoo] gpt embedding remove attn mask by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/474

    Miscellaneous

    • [install] run with out rich by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/513
    • [refactor] remove old zero code by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/517
    • [format] polish name format for MOE by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/481

    New Contributors

    • @fastalgo made their first contribution in https://github.com/hpcaitech/ColossalAI/pull/514

    Full Changelog: https://github.com/hpcaitech/ColossalAI/compare/v0.1.0...v0.1.1

    Source code(tar.gz)
    Source code(zip)
  • v0.1.0(Mar 19, 2022)

    Overview

    We are happy to release the version v0.1.0 today. Compared to the previous version, we have a brand new zero module and updated many aspects of our system for better performance and usability. The latest version can be installed by pip install colossalai now. We will update our examples and documentation in the next few days accordingly.

    Highlights:

    Note: a. Only the major base commits are chosen to display. Successive commits which enhance/update the base commit are not shown.
    b. Some commits do not have associated pull request ID for some unknown reasons.
    c. The list is ordered by time.

    Features

    • add moe context, moe utilities and refactor gradient handler (#455 )By @1SAA
    • [zero] Update initialize for ZeRO (#458 ) By @ver217
    • [zero] hybrid cpu adam (#445 )ย By @feifeibear
    • added Multiply Jitter and capacity factor eval for MOE (#434 ) By @1SAA
    • [fp16] refactored fp16 optimizer (#392 ) By @FrankLeeeee
    • [zero] memtracer to record cuda memory usage of model data and overall system (#395 ) By @feifeibear
    • Added tensor detector (#393 ) By @Gy-Lu
    • Added activation offload (#331 ) By @Gy-Lu
    • [zero] zero init context collect numel of model (#375 ) By @feifeibear
    • Added PCIE profiler to dectect data transmission (#373 ) By @1SAA
    • Added Profiler Context to manage all profilers (#340 ) By @1SAA
    • set criterion as optional in colossalai initialize (#336 ) By @FrankLeeeee
    • [zero] Update sharded model v2 using sharded param v2 (#323 ) By @ver217
    • [zero] zero init context (#321 ) By @feifeibear
    • Added profiler communication operations By @1SAA
    • added buffer sync to naive amp model wrapper (#291 ) By @FrankLeeeee
    • [zero] cpu adam kernel (#288 ) By @Gy-Lu
    • Feature/zero (#279 ) By @feifeibear @FrankLeeeee @ver217
    • impl shard optim v2 and add unit test By @ver217
    • [profiler] primary memory tracer By @raejaf
    • add sharded adam By @ver217

    Unit Testing

    • [test] fixed amp convergence comparison test (#454 ) By @FrankLeeeee
    • [test] optimized zero data parallel test (#452 ) By @FrankLeeeee
    • [test] make zero engine test really work (#447 ) By @feifeibear
    • optimized context test time consumption (#446 ) By @FrankLeeeee
    • [unitest] polish zero config in unittest (#438 ) By @feifeibear
    • added testing module (#435 ) By @FrankLeeeee
    • [zero] polish ShardedOptimV2 unittest (#385 ) By @feifeibear
    • [unit test] Refactored test cases with component func (#339 ) By @FrankLeeeee

    Documentation

    • [doc] Update docstring for ZeRO (#459 ) By @ver217
    • update README and images path (#384 ) By @binmakeswell
    • add badge and contributor list By @FrankLeeeee
    • add community group and update issue template (#271 ) By @binmakeswell
    • update experimental visualization (#253 ) By @Sze-qq
    • add Chinese README By @binmakeswell

    CI/CD

    • update github CI with the current workflow (#441 ) By @FrankLeeeee
    • update unit testing CI rules By @FrankLeeeee
    • added compatibility CI and options for release ci By @FrankLeeeee
    • added pypi publication CI and remove formatting CI By @FrankLeeeee

    Bug Fix

    • fix gpt attention mask (#461 ) By @ver217
    • [bug] Fixed device placement bug in memory monitor thread (#433 ) By @FrankLeeeee
    • fixed fp16 optimizer none grad bug (#432 ) By @FrankLeeeee
    • fixed gpt attention mask in pipeline (#430 ) By @FrankLeeeee
    • [hotfix] fixed bugs in ShardStrategy and PcieProfiler (#394 ) By @1SAA
    • fixed bug in activation checkpointing test (#387 ) By @FrankLeeeee
    • [profiler] Fixed bugs in CommProfiler and PcieProfiler (#377 ) By @1SAA
    • fixed CI dataset directory; fixed import error of 2.5d accuracy (#255 ) By @kurisusnowdeng
    • fixed padding index issue for vocab parallel embedding layers; updated 3D linear to be compatible with examples in the tutorial By @kurisusnowdeng

    Miscellaneous

    • [log] better logging display with rich (#426 ) By @feifeibear
    Source code(tar.gz)
    Source code(zip)
  • v0.0.2(Feb 15, 2022)

    Change Log

    Added

    • Unifed distributed layers
    • MoE support
    • DevOps tools such as github action, code review automation, etc.
    • New project official website

    Changes

    • refactored the APIs for usability, flexibility and modularity
    • adapted PyTorch AMP for tensor parallel
    • refactored utilities for tensor parallel and pipeline parallel
    • Separated benchmarks and examples as independent repositories
    • Updated pipeline parallelism to support non-interleaved and interleaved versions
    • refactored installation scripts for convenience

    Fixed

    • zero level 3 runtime error
    • incorrect calculation in gradient clipping
    Source code(tar.gz)
    Source code(zip)
  • v0.0.1-beta(Oct 28, 2021)

    Features

    • Data Parallelism
    • Pipeline Parallelism (experimental)
    • 1D, 2D, 2.5D, 3D and sequence tensor parallelism
    • Easy-to-use trainer and engine
    • Extensibility for user-defined parallelism
    • Mixed Precision Training
    • Zero Redundancy Optimizer (ZeRO)
    Source code(tar.gz)
    Source code(zip)
Owner
HPC-AI Tech
We are a global team to help you train and deploy your AI models
HPC-AI Tech
ManiSkill-Learn is a framework for training agents on SAPIEN Open-Source Manipulation Skill Challenge (ManiSkill Challenge), a large-scale learning-from-demonstrations benchmark for object manipulation.

ManiSkill-Learn ManiSkill-Learn is a framework for training agents on SAPIEN Open-Source Manipulation Skill Challenge, a large-scale learning-from-dem

Hao Su's Lab, UCSD 48 Dec 30, 2022
DeepGNN is a framework for training machine learning models on large scale graph data.

DeepGNN Overview DeepGNN is a framework for training machine learning models on large scale graph data. DeepGNN contains all the necessary features in

Microsoft 45 Jan 1, 2023
Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.

English | ็ฎ€ไฝ“ไธญๆ–‡ Easy Parallel Library Overview Easy Parallel Library (EPL) is a general and efficient library for distributed model training. Usability

Alibaba 185 Dec 21, 2022
Large-scale open domain KNOwledge grounded conVERsation system based on PaddlePaddle

Knover Knover is a toolkit for knowledge grounded dialogue generation based on PaddlePaddle. Knover allows researchers and developers to carry out eff

null 607 Dec 31, 2022
SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems

The SLIDE package contains the source code for reproducing the main experiments in this paper. Dataset The Datasets can be downloaded in Amazon-

Intel Labs 72 Dec 16, 2022
DeepLM: Large-scale Nonlinear Least Squares on Deep Learning Frameworks using Stochastic Domain Decomposition (CVPR 2021)

DeepLM DeepLM: Large-scale Nonlinear Least Squares on Deep Learning Frameworks using Stochastic Domain Decomposition (CVPR 2021) Run Please install th

Jingwei Huang 130 Dec 2, 2022
Open-AI's DALL-E for large scale training in mesh-tensorflow.

DALL-E in Mesh-Tensorflow [WIP] Open-AI's DALL-E in Mesh-Tensorflow. If this is similarly efficient to GPT-Neo, this repo should be able to train mode

EleutherAI 432 Dec 16, 2022
An Efficient Training Approach for Very Large Scale Face Recognition or FยฒC for simplicity.

Fast Face Classification (FยฒC) This is the code of our paper An Efficient Training Approach for Very Large Scale Face Recognition or FยฒC for simplicit

null 33 Jun 27, 2021
A large-scale video dataset for the training and evaluation of 3D human pose estimation models

ASPset-510 ASPset-510 (Australian Sports Pose Dataset) is a large-scale video dataset for the training and evaluation of 3D human pose estimation mode

Aiden Nibali 36 Oct 30, 2022
A large-scale video dataset for the training and evaluation of 3D human pose estimation models

ASPset-510 (Australian Sports Pose Dataset) is a large-scale video dataset for the training and evaluation of 3D human pose estimation models. It contains 17 different amateur subjects performing 30 sports-related actions each, for a total of 510 action clips.

Aiden Nibali 25 Jun 20, 2021
Official repository for the paper, MidiBERT-Piano: Large-scale Pre-training for Symbolic Music Understanding.

MidiBERT-Piano Authors: Yi-Hui (Sophia) Chou, I-Chun (Bronwin) Chen Introduction This is the official repository for the paper, MidiBERT-Piano: Large-

null 137 Dec 15, 2022
Galileo library for large scale graph training by JD

่ฟ‘ๅนดๆฅ๏ผŒๅ›พ่ฎก็ฎ—ๅœจๆœ็ดขใ€ๆŽจ่ๅ’Œ้ฃŽๆŽง็ญ‰ๅœบๆ™ฏไธญ่Žทๅพ—ๆ˜พ่‘—็š„ๆ•ˆๆžœ๏ผŒไฝ†ไนŸ้ขไธด่ถ…ๅคง่ง„ๆจกๅผ‚ๆž„ๅ›พ่ฎญ็ปƒ๏ผŒไธŽ็Žฐๆœ‰็š„ๆทฑๅบฆๅญฆไน ๆก†ๆžถTensorflowๅ’ŒPyTorch็ป“ๅˆ็ญ‰้šพ้ข˜ใ€‚ Galileo๏ผˆไผฝๅˆฉ็•ฅ๏ผ‰ๆ˜ฏไธ€ไธชๅ›พๆทฑๅบฆๅญฆไน ๆก†ๆžถ๏ผŒๅ…ทๅค‡่ถ…ๅคง่ง„ๆจกใ€ๆ˜“ไฝฟ็”จใ€ๆ˜“ๆ‰ฉๅฑ•ใ€้ซ˜ๆ€ง่ƒฝใ€ๅŒๅŽ็ซฏ็ญ‰ไผ˜็‚น๏ผŒๆ—จๅœจ่งฃๅ†ณ่ถ…ๅคง่ง„ๆจกๅ›พ็ฎ—ๆณ•ๅœจๅทฅไธš็บงๅœบๆ™ฏ็š„่ฝๅœฐ้šพ้ข˜๏ผŒๆ

JD Galileo Team 128 Nov 29, 2022
UniLM AI - Large-scale Self-supervised Pre-training across Tasks, Languages, and Modalities

Pre-trained (foundation) models across tasks (understanding, generation and translation), languages (100+ languages), and modalities (language, image, audio, vision + language, audio + language, etc.)

Microsoft 7.6k Jan 1, 2023
Large-Scale Pre-training for Person Re-identification with Noisy Labels (LUPerson-NL)

LUPerson-NL Large-Scale Pre-training for Person Re-identification with Noisy Labels (LUPerson-NL) The repository is for our CVPR2022 paper Large-Scale

null 43 Dec 26, 2022
BigDetection: A Large-scale Benchmark for Improved Object Detector Pre-training

BigDetection: A Large-scale Benchmark for Improved Object Detector Pre-training By Likun Cai, Zhi Zhang, Yi Zhu, Li Zhang, Mu Li, Xiangyang Xue. This

null 290 Dec 29, 2022
Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR

UniSpeech The family of UniSpeech: UniSpeech (ICML 2021): Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR UniSpeech-

Microsoft 282 Jan 9, 2023
PointNetVLAD: Deep Point Cloud Based Retrieval for Large-Scale Place Recognition, CVPR 2018

PointNetVLAD: Deep Point Cloud Based Retrieval for Large-Scale Place Recognition PointNetVLAD: Deep Point Cloud Based Retrieval for Large-Scale Place

Mikaela Uy 294 Dec 12, 2022
Official Implement of CVPR 2021 paper โ€œCross-Modal Collaborative Representation Learning and a Large-Scale RGBT Benchmark for Crowd Countingโ€

RGBT Crowd Counting Lingbo Liu, Jiaqi Chen, Hefeng Wu, Guanbin Li, Chenglong Li, Liang Lin. "Cross-Modal Collaborative Representation Learning and a L

null 37 Dec 8, 2022