Hardware: Rtx 3090, Ryzen 3600, 64 GB of RAM.
I am trying to train the 1.3B parameter model on a custom dataset. This model training takes more memory due to the longer input(or so it seems, it def takes more memory). Thus I am trying to use DeepSpeed. I have changed nothing other than using the smaller 1.3B model and reducing the batch size to 4.
An issue I am having is that the loss(I think its the loss) is overflowing. I know this is due to using mixed or half-precision in order to reduce memory usage. When training on the provided dataset, this is not an issue. The provided dataset does initially have the overflow issue, but it is quickly resolved through internal adjustments. Is there some configuration change I can make so that this custom dataset will work without overflowing?
Below are some logs, you can see that it still is overflowing even with the loss scale at 1.
python gpt_neo_xl_deepspeed.py
Special tokens have been added in the vocabulary, make sure the associated word embedding are fine-tuned or trained.
Max length: 384
[2021-06-08 12:02:15,302] [INFO] [distributed.py:47:init_distributed] Initializing torch distributed with backend: nccl
[2021-06-08 12:02:15,601] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.13+12a53b4, git-hash=12a53b4, git-branch=HEAD
[2021-06-08 12:02:15,622] [INFO] [engine.py:77:_initialize_parameter_parallel_groups] data_parallel_size: 1, parameter_parallel_size: 1
Adam Optimizer #0 is created with scalar arithmetic capability.
Config: alpha=0.000050, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
[2021-06-08 12:02:17,843] [INFO] [engine.py:602:_configure_optimizer] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2021-06-08 12:02:17,843] [INFO] [engine.py:606:_configure_optimizer] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2021-06-08 12:02:17,843] [INFO] [logging.py:60:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer
Initializing ZeRO Stage 3
[2021-06-08 12:02:17,844] [WARNING] [stage3.py:35:] apex was installed without --cpp_ext. Falling back to Python flatten and unflatten.
[2021-06-08 12:02:17,844] [INFO] [utils.py:555:see_memory_usage] Stage 3 intialize beginning
/home/blake/anaconda3/envs/gpt/lib/python3.7/site-packages/torch/cuda/memory.py:346: FutureWarning: torch.cuda.memory_cached has been renamed to torch.cuda.memory_reserved
FutureWarning)
/home/blake/anaconda3/envs/gpt/lib/python3.7/site-packages/torch/cuda/memory.py:354: FutureWarning: torch.cuda.max_memory_cached has been renamed to torch.cuda.max_memory_reserved
FutureWarning)
[2021-06-08 12:02:17,845] [INFO] [utils.py:560:see_memory_usage] MA 2.5 GB Max_MA 5.14 GB CA 5.14 GB Max_CA 5 GB
[2021-06-08 12:02:17,845] [INFO] [utils.py:565:see_memory_usage] CPU Virtual Memory: used = 24.87 GB, percent = 39.6%
[2021-06-08 12:02:17,845] [INFO] [stage3.py:586:init] Reduce bucket size 500000000
[2021-06-08 12:02:17,845] [INFO] [stage3.py:587:init] Allgather bucket size 50000000
[2021-06-08 12:02:22,511] [INFO] [stage3.py:730:init] optimizer state initialized
[2021-06-08 12:02:23,014] [INFO] [utils.py:555:see_memory_usage] After initializing ZeRO optimizer
[2021-06-08 12:02:23,014] [INFO] [utils.py:560:see_memory_usage] MA 0.43 GB Max_MA 5.14 GB CA 5.53 GB Max_CA 6 GB
[2021-06-08 12:02:23,014] [INFO] [utils.py:565:see_memory_usage] CPU Virtual Memory: used = 47.03 GB, percent = 74.9%
[2021-06-08 12:02:23,014] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed Final Optimizer = adamw
[2021-06-08 12:02:23,014] [INFO] [engine.py:439:_configure_lr_scheduler] DeepSpeed using configured LR scheduler = WarmupLR
[2021-06-08 12:02:23,014] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed LR Scheduler = <deepspeed.runtime.lr_schedules.WarmupLR object at 0x7fb428040ed0>
[2021-06-08 12:02:23,014] [INFO] [logging.py:60:log_dist] [Rank 0] step=0, skipped=0, lr=[5e-05], mom=[[0.9, 0.999]]
[2021-06-08 12:02:23,014] [INFO] [config.py:737:print] DeepSpeedEngine configuration:
[2021-06-08 12:02:23,015] [INFO] [config.py:741:print] activation_checkpointing_config {
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"partition_activations": false,
"profile": false,
"synchronize_checkpoint_boundary": false
}
[2021-06-08 12:02:23,015] [INFO] [config.py:741:print] allreduce_always_fp32 ........ False
[2021-06-08 12:02:23,015] [INFO] [config.py:741:print] amp_enabled .................. False
[2021-06-08 12:02:23,015] [INFO] [config.py:741:print] amp_params ................... False
[2021-06-08 12:02:23,015] [INFO] [config.py:741:print] checkpoint_tag_validation_enabled True
[2021-06-08 12:02:23,015] [INFO] [config.py:741:print] checkpoint_tag_validation_fail False
[2021-06-08 12:02:23,015] [INFO] [config.py:741:print] disable_allgather ............ False
[2021-06-08 12:02:23,015] [INFO] [config.py:741:print] dump_state ................... False
[2021-06-08 12:02:23,015] [INFO] [config.py:741:print] dynamic_loss_scale_args ...... {'init_scale': 4294967296, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1}
[2021-06-08 12:02:23,015] [INFO] [config.py:741:print] elasticity_enabled ........... False
[2021-06-08 12:02:23,015] [INFO] [config.py:741:print] flops_profiler_config ........ {
"detailed": true,
"enabled": false,
"module_depth": -1,
"profile_step": 1,
"top_modules": 3
}
[2021-06-08 12:02:23,015] [INFO] [config.py:741:print] fp16_enabled ................. True
[2021-06-08 12:02:23,015] [INFO] [config.py:741:print] global_rank .................. 0
[2021-06-08 12:02:23,015] [INFO] [config.py:741:print] gradient_accumulation_steps .. 1
[2021-06-08 12:02:23,015] [INFO] [config.py:741:print] gradient_clipping ............ 1.0
[2021-06-08 12:02:23,015] [INFO] [config.py:741:print] gradient_predivide_factor .... 1.0
[2021-06-08 12:02:23,015] [INFO] [config.py:741:print] initial_dynamic_scale ........ 4294967296
[2021-06-08 12:02:23,015] [INFO] [config.py:741:print] loss_scale ................... 0
[2021-06-08 12:02:23,015] [INFO] [config.py:741:print] memory_breakdown ............. False
[2021-06-08 12:02:23,015] [INFO] [config.py:741:print] optimizer_legacy_fusion ...... False
[2021-06-08 12:02:23,015] [INFO] [config.py:741:print] optimizer_name ............... adamw
[2021-06-08 12:02:23,015] [INFO] [config.py:741:print] optimizer_params ............. {'lr': 5e-05, 'betas': [0.9, 0.999], 'eps': 1e-08}
[2021-06-08 12:02:23,015] [INFO] [config.py:741:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2021-06-08 12:02:23,015] [INFO] [config.py:741:print] pld_enabled .................. False
[2021-06-08 12:02:23,015] [INFO] [config.py:741:print] pld_params ................... False
[2021-06-08 12:02:23,015] [INFO] [config.py:741:print] prescale_gradients ........... False
[2021-06-08 12:02:23,015] [INFO] [config.py:741:print] scheduler_name ............... WarmupLR
[2021-06-08 12:02:23,015] [INFO] [config.py:741:print] scheduler_params ............. {'warmup_min_lr': 0, 'warmup_max_lr': 5e-05, 'warmup_num_steps': 100}
[2021-06-08 12:02:23,015] [INFO] [config.py:741:print] sparse_attention ............. None
[2021-06-08 12:02:23,015] [INFO] [config.py:741:print] sparse_gradients_enabled ..... False
[2021-06-08 12:02:23,015] [INFO] [config.py:741:print] steps_per_print .............. 10
[2021-06-08 12:02:23,015] [INFO] [config.py:741:print] tensorboard_enabled .......... False
[2021-06-08 12:02:23,015] [INFO] [config.py:741:print] tensorboard_job_name ......... DeepSpeedJobName
[2021-06-08 12:02:23,015] [INFO] [config.py:741:print] tensorboard_output_path ......
[2021-06-08 12:02:23,015] [INFO] [config.py:741:print] train_batch_size ............. 4
[2021-06-08 12:02:23,015] [INFO] [config.py:741:print] train_micro_batch_size_per_gpu 4
[2021-06-08 12:02:23,015] [INFO] [config.py:741:print] wall_clock_breakdown ......... False
[2021-06-08 12:02:23,015] [INFO] [config.py:741:print] world_size ................... 1
[2021-06-08 12:02:23,015] [INFO] [config.py:741:print] zero_allow_untested_optimizer False
[2021-06-08 12:02:23,015] [INFO] [config.py:741:print] zero_config .................. {
"allgather_bucket_size": 500000000,
"allgather_partitions": true,
"contiguous_gradients": true,
"cpu_offload": true,
"cpu_offload_params": true,
"cpu_offload_use_pin_memory": false,
"elastic_checkpoint": true,
"load_from_fp32_weights": true,
"max_live_parameters": 1000000000,
"max_reuse_distance": 1000000000,
"overlap_comm": true,
"param_persistence_threshold": 100000,
"prefetch_bucket_size": 50000000,
"reduce_bucket_size": 500000000,
"reduce_scatter": true,
"stage": 3,
"sub_group_size": 1000000000000
}
[2021-06-08 12:02:23,016] [INFO] [config.py:741:print] zero_enabled ................. True
[2021-06-08 12:02:23,016] [INFO] [config.py:741:print] zero_optimization_stage ...... 3
[2021-06-08 12:02:23,016] [INFO] [config.py:747:print] json = {
"fp16":{
"enabled":true,
"min_loss_scale":1,
"opt_level":"O3"
},
"gradient_accumulation_steps":1,
"gradient_clipping":1.0,
"optimizer":{
"params":{
"betas":[
0.9,
0.999
],
"eps":1e-08,
"lr":5e-05
},
"type":"AdamW"
},
"scheduler":{
"params":{
"warmup_max_lr":5e-05,
"warmup_min_lr":0,
"warmup_num_steps":100
},
"type":"WarmupLR"
},
"train_micro_batch_size_per_gpu":4,
"zero_optimization":{
"contiguous_gradients":true,
"cpu_offload":true,
"cpu_offload_params":true,
"overlap_comm":true,
"stage":3
}
}
0%| | 0/63645 [00:00<?, ?it/s][2021-06-08 12:02:25,515] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 4294967296
0%| | 1/63645 [00:02<44:05:31, 2.49s/it][2021-06-08 12:02:27,976] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648.0
0%| | 2/63645 [00:04<43:44:33, 2.47s/it][2021-06-08 12:02:30,098] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648.0, reducing to 1073741824.0
0%| | 3/63645 [00:07<40:53:49, 2.31s/it][2021-06-08 12:02:32,216] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824.0, reducing to 536870912.0
0%| | 4/63645 [00:09<39:31:55, 2.24s/it][2021-06-08 12:02:34,369] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 536870912.0, reducing to 268435456.0
0%| | 5/63645 [00:11<39:00:10, 2.21s/it][2021-06-08 12:02:36,493] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456.0, reducing to 134217728.0
0%| | 6/63645 [00:13<38:30:25, 2.18s/it][2021-06-08 12:02:38,621] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 134217728.0, reducing to 67108864.0
0%| | 7/63645 [00:15<38:13:03, 2.16s/it][2021-06-08 12:02:40,756] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 67108864.0, reducing to 33554432.0
0%| | 8/63645 [00:17<38:03:59, 2.15s/it][2021-06-08 12:02:42,877] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 33554432.0, reducing to 16777216.0
0%| | 9/63645 [00:19<37:53:00, 2.14s/it][2021-06-08 12:02:45,001] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16777216.0, reducing to 8388608.0
[2021-06-08 12:02:45,002] [INFO] [timer.py:157:stop] 0/10, SamplesPerSec=1.8825260836384787
0%| | 10/63645 [00:21<37:46:42, 2.14s/it][2021-06-08 12:02:47,123] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8388608.0, reducing to 4194304.0
0%| | 11/63645 [00:24<37:41:45, 2.13s/it][2021-06-08 12:02:49,251] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4194304.0, reducing to 2097152.0
0%| | 12/63645 [00:26<37:40:18, 2.13s/it][2021-06-08 12:02:51,371] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2097152.0, reducing to 1048576.0
0%| | 13/63645 [00:28<37:36:38, 2.13s/it][2021-06-08 12:02:53,504] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1048576.0, reducing to 524288.0
0%| | 14/63645 [00:30<37:38:20, 2.13s/it][2021-06-08 12:02:55,643] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 524288.0, reducing to 262144.0
0%| | 15/63645 [00:32<37:41:12, 2.13s/it][2021-06-08 12:02:57,770] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144.0, reducing to 131072.0
0%| | 16/63645 [00:34<37:39:38, 2.13s/it][2021-06-08 12:02:59,892] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072.0, reducing to 65536.0
0%| | 17/63645 [00:36<37:37:35, 2.13s/it][2021-06-08 12:03:02,036] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536.0, reducing to 32768.0
0%| | 18/63645 [00:39<37:41:26, 2.13s/it][2021-06-08 12:03:04,169] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0
0%| | 19/63645 [00:41<37:41:35, 2.13s/it][2021-06-08 12:03:06,275] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0
[2021-06-08 12:03:06,275] [INFO] [timer.py:157:stop] 0/20, SamplesPerSec=1.8830545084715222
0%| | 20/63645 [00:43<37:32:58, 2.12s/it][2021-06-08 12:03:08,413] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8192.0, reducing to 4096.0
0%| | 21/63645 [00:45<37:37:14, 2.13s/it][2021-06-08 12:03:10,537] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4096.0, reducing to 2048.0
0%| | 22/63645 [00:47<37:35:54, 2.13s/it][2021-06-08 12:03:12,696] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2048.0, reducing to 1024.0
0%| | 23/63645 [00:49<37:45:52, 2.14s/it][2021-06-08 12:03:14,855] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1024.0, reducing to 512.0
0%| | 24/63645 [00:51<37:52:39, 2.14s/it][2021-06-08 12:03:17,010] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 512.0, reducing to 256.0
0%| | 25/63645 [00:53<37:56:31, 2.15s/it][2021-06-08 12:03:19,143] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 256.0, reducing to 128.0
0%| | 26/63645 [00:56<37:51:51, 2.14s/it][2021-06-08 12:03:21,267] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 128.0, reducing to 64.0
0%| | 27/63645 [00:58<37:45:52, 2.14s/it][2021-06-08 12:03:23,390] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 64.0, reducing to 32.0
0%| | 28/63645 [01:00<37:41:21, 2.13s/it][2021-06-08 12:03:25,526] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 32.0, reducing to 16.0
0%| | 29/63645 [01:02<37:42:30, 2.13s/it][2021-06-08 12:03:27,669] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16.0, reducing to 8.0
[2021-06-08 12:03:27,670] [INFO] [timer.py:157:stop] 0/30, SamplesPerSec=1.8793183327344205
0%| | 30/63645 [01:04<37:45:17, 2.14s/it][2021-06-08 12:03:29,803] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8.0, reducing to 4.0
0%| | 31/63645 [01:06<37:44:34, 2.14s/it][2021-06-08 12:03:31,936] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4.0, reducing to 2.0
0%| | 32/63645 [01:08<37:43:25, 2.13s/it][2021-06-08 12:03:34,071] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2.0, reducing to 1.0
0%| | 33/63645 [01:11<37:43:30, 2.13s/it][2021-06-08 12:03:36,202] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1.0, reducing to 1
0%| | 34/63645 [01:13<37:42:09, 2.13s/it][2021-06-08 12:03:38,344] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
0%| | 35/63645 [01:15<37:44:52, 2.14s/it][2021-06-08 12:03:40,449] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
0%| | 36/63645 [01:17<37:34:48, 2.13s/it][2021-06-08 12:03:42,570] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
0%| | 37/63645 [01:19<37:32:51, 2.13s/it][2021-06-08 12:03:44,722] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
0%| | 38/63645 [01:21<37:41:34, 2.13s/it][2021-06-08 12:03:46,892] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
0%| | 39/63645 [01:23<37:53:09, 2.14s/it][2021-06-08 12:03:49,014] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
[2021-06-08 12:03:49,015] [INFO] [timer.py:157:stop] 0/40, SamplesPerSec=1.8786745912560845
0%| | 40/63645 [01:25<37:46:00, 2.14s/it][2021-06-08 12:03:51,134] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
0%| | 41/63645 [01:28<37:40:15, 2.13s/it][2021-06-08 12:03:53,292] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
0%| | 42/63645 [01:30<37:48:30, 2.14s/it][2021-06-08 12:03:55,423] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
0%| | 43/63645 [01:32<37:45:29, 2.14s/it][2021-06-08 12:03:57,549] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
0%| | 44/63645 [01:34<37:42:00, 2.13s/it][2021-06-08 12:03:59,675] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
0%| | 45/63645 [01:36<37:39:17, 2.13s/it][2021-06-08 12:04:01,811] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
0%| | 46/63645 [01:38<37:40:58, 2.13s/it][2021-06-08 12:04:03,961] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
0%| | 47/63645 [01:40<37:46:14, 2.14s/it][2021-06-08 12:04:06,086] [INFO] [stage3.py:2323:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
0%| | 48/63645 [01:43<37:42:06, 2.13s/it]