Hardware: Rtx 3090, Ryzen 3600, 64 GB of RAM.
I am trying to train the 1.3B parameter model on a custom dataset. This model training takes more memory due to the longer input(or so it seems, it def takes more memory). Thus I am trying to use DeepSpeed. I have changed nothing other than using the smaller 1.3B model and reducing the batch size to 4.
An issue I am having is that the loss(I think its the loss) is overflowing. I know this is due to using mixed or half-precision in order to reduce memory usage. When training on the provided dataset, this is not an issue. The provided dataset does initially have the overflow issue, but it is quickly resolved through internal adjustments. Is there some configuration change I can make so that this custom dataset will work without overflowing?
Below are some logs, you can see that it still is overflowing even with the loss scale at 1.
python gpt_neo_xl_deepspeed.py
Special tokens have been added in the vocabulary, make sure the associated word embedding are fine-tuned or trained.
Max length: 384
[2021-06-08 12:02:15,302] [INFO] [distributed.py:47:init_distributed] Initializing torch distributed with backend: nccl
[2021-06-08 12:02:15,601] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.13+12a53b4, git-hash=12a53b4, git-branch=HEAD
[2021-06-08 12:02:15,622] [INFO] [engine.py:77:_initialize_parameter_parallel_groups] data_parallel_size: 1, parameter_parallel_size: 1
Adam Optimizer #0 is created with scalar arithmetic capability.
Config: alpha=0.000050, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
[2021-06-08 12:02:17,843] [INFO] [engine.py:602:_configure_optimizer] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2021-06-08 12:02:17,843] [INFO] [engine.py:606:_configure_optimizer] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2021-06-08 12:02:17,843] [INFO] [logging.py:60:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer
Initializing ZeRO Stage 3
[2021-06-08 12:02:17,844] [WARNING] [stage3.py:35:] apex was installed without --cpp_ext. Falling back to Python flatten and unflatten.
[2021-06-08 12:02:17,844] [INFO] [utils.py:555:see_memory_usage] Stage 3 intialize beginning
[2021-06-08 12:02:17,845] [INFO] [utils.py:560:see_memory_usage] MA 2.5 GB Max_MA 5.14 GB CA 5.14 GB Max_CA 5 GB
[2021-06-08 12:02:17,845] [INFO] [utils.py:565:see_memory_usage] CPU Virtual Memory: used = 24.87 GB, percent = 39.6%
[2021-06-08 12:02:17,845] [INFO] [stage3.py:586:init] Reduce bucket size 500000000
[2021-06-08 12:02:17,845] [INFO] [stage3.py:587:init] Allgather bucket size 50000000
[2021-06-08 12:02:22,511] [INFO] [stage3.py:730:init] optimizer state initialized
[2021-06-08 12:02:23,014] [INFO] [utils.py:555:see_memory_usage] After initializing ZeRO optimizer
[2021-06-08 12:02:23,014] [INFO] [utils.py:560:see_memory_usage] MA 0.43 GB Max_MA 5.14 GB CA 5.53 GB Max_CA 6 GB
[2021-06-08 12:02:23,014] [INFO] [utils.py:565:see_memory_usage] CPU Virtual Memory: used = 47.03 GB, percent = 74.9%
[2021-06-08 12:02:23,014] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed Final Optimizer = adamw
[2021-06-08 12:02:23,014] [INFO] [engine.py:439:_configure_lr_scheduler] DeepSpeed using configured LR scheduler = WarmupLR
[2021-06-08 12:02:23,014] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed LR Scheduler = <deepspeed.runtime.lr_schedules.WarmupLR object at 0x7fb428040ed0>
[2021-06-08 12:02:23,014] [INFO] [logging.py:60:log_dist] [Rank 0] step=0, skipped=0, lr=[5e-05], mom=[[0.9, 0.999]]
[2021-06-08 12:02:23,014] [INFO] [config.py:737:print] DeepSpeedEngine configuration:
[2021-06-08 12:02:23,016] [INFO] [config.py:747:print] json = {
0%| | 0/63645 [00:00<?, ?it/s][2021-06-08 12:02:25,515] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 4294967296
